Sie sind auf Seite 1von 923

Lecture Notes in Computer Science 1607

Edited by G. Goos, J. Hartmanis and J. van Leeuwen


Springer
Berlin
Heidelberg
New York
Barcelona
Hong Kong
London
Milan
Paris
Singapore
Tokyo
Jos6 Mira Juan V. Sfinchez-Andr6s (Eds.)

Engineering Applications
of Bio-Inspired
Artificial Neural Networks
International Work-Conference on
Artificial and Natural Neural Networks, IWANN'99
Alicante, Spain, June 2-4, 1999
Proceedings, Volume II

Springer
Series Editors
Gerhard Goos, Karlsruhe University, Germany
Juris Hartmanis, Cornell University, NY, USA
Jan van Leeuwen, Utrecht University, The Netherlands

Volume Editors
Jos6 Mira
Universidad Nacional de Educaci6n a Distancia
Departamento de Inteligencia Artificial
Senda del Rey, s/n, E-28040 Madrid, Spain
E-mail: jmira@dia.uned.es

Juan V. S~inchez-Andr6s
Universidad Miguel Hern(mdez, Departamento Fisiologia
Centro de Bioingenieria, Campus de San Juan, Apdo. 18
Ctra. Valencia, s/n, E-03550 San Juan de Alicante, Spain
E-mail: juanvi@umh.es

Cataloging-in-Publication data applied for


Die D ~ t ~ h e Bibliothek - CIP-Ehaheit~afimhme

International Work Confereace on Artifieial and Natural Neural Networks <5,


1999, Aiieante>:
International Work Conference on Artificial and Natural Neural Networks :
Alicante, Spain, June 2 - 4, 1999, proceedings / IWANN '99. Jos~ Mira ; Juan
V. S~r (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong
Kong ; London ; ~fdari ; Paris ; Singapore ; Tokyo : Springer.

Vol. 2. Engineering applications ofbio-inspired artificial neural networks. -


(1999)
(Lecture notes in computer science, Vol. 1607)
ISBN 3-540-66068-2
CR Subject Classification (1998): F.I.1, 1.2, E.I.1, C.1.3, C.2.1, G.1.6, 1.5.1,
1.4, J.1, J.2
ISSN 0302-9743
ISBN 3-540-66068-2 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.

9 Springer-Verlag Berlin Heidelberg 1999


Printed in Germany
Typesetting: Camera-ready by author
SPIN: 10704957 06/3142 - 5 4 3 2 1 0 Printed on acid-free paper
Preface

Fifty years after the publication of Norbert Wiener's book on Cybernetics and
a hundred years after the birth of Warren S. McCulloch (1898), we still have
a deeply-held conviction of the value of the interdisciplinary approach in the
understanding of the nervous system and in the engineering use of the results of
this understanding. In the words of N. Wiener, "The mathematician (nowadays
also the physicist, the computer scientist, or the electronic engineer) need not
have the skill to conduct a physiological experiment, but he must have the skill
to understand one, to criticize one, and to suggest one. The physiologist need
not be able to prove a certain mathematical theorem (or to program a model
of a neuron or to formulate a signaling code...) but he must be able to grasp
its physiological significance and to tell the mathematician for what he should
look". We, as Wiener did, had dreamed for years of a team of interdisciplinary
scientists working together to understand the interplay between Neuroscience
and Computation, and "to lend one another the strength o/ that understanding".
The basic idea during the initial Neurocybernetics stage of Artificial Intelli-
gence and Neural Computation was that both the living beings and the man-
made machines could be understood using the same organizational and struc-
rural principles, the same experimental methodology, and the same theoretical
and formal tools (logic, mathematics, knowledge modeling, and computation
languages).
This interdisciplinary approach has been the basis of the organization of
all the IWANN biennial conferences, with the aim of promoting the interplay
between Neuroscience and Computation, without disciplinary boundaries.
IWANN'99, the fifth International Work-Conference on Artificial and Natural
Neural Networks, that took place in Alieante (Spain) June 2-4, 1999, focused on
the following goals:
I. Developments on Foundations and Methodology.
II. From Artificial to Natural: How can Systems Theory, Electronics, and
Computation (including AI) aid in the understanding of the nervous system?
III. From Natural to Artificial: How can understanding the nervous system
help in the obtaining of bio-inspired models of artificial neurons, evolutionary
architectures, and learning algorithms of value in Computation and Engineering?
IV. Bio-inspired Technology and Engineering Applications: How can we ob-
tain bio-inspired formulations for sensory coding, perception, memory, decision
making, planning, and control?
IWANN'99 was organized by the Asociaci6n Espafiola de Redes Neuronales,
the Universidad Nacional de Educaci6n a Distancia, UNED, (Madrid), and the
Instituto de Bioingenierfa of the University Miguel Herns UHM, (Alicante)
also in cooperation with IFIP (Working Group in Neural Computer Systems,
WGI0.6), and the Spanish RIG IEEE Neural Networks Council.
yl

Sponsorship was obtained from the Spanish CICYT and DGICYT (MEC),
the organizing universities (UNED and UHM), and the Fundaci6n Obra Social
of the CAM.
The papers presented here correspond to talks delivered at the conference.
After the evaluation process, 181 papers were accepted for oral or poster presen-
tation , according to the recommendations of reviewers and the author's pref-
erences. We have organized these papers in two volumes arranged basically fol-
lowing the topics list included in the call for papers. The first volume, entitled
"Foundations and Tools in Neural Modeling" is divided into three main parts
and includes the contributions on:
1. Neural Modeling (Biophysical and Structural Models).
2. Plasticity Phenomena (Maturing, Learning and Memory).
3. Artificial Intelligence and Cognitive Neuroscience.
In the second volume, with the title, "Engineering Applications of Bioin-
spired Artificial Neural Nets", we have included the contributions dealing with
applications. These contributions are grouped into four parts:
1. Artificial Neural Nets Simulation and Implementation.
2. Bio-inspired Systems.
3. Images.
4. Engineering Applications (including Data Analysis and Robotics).
We would like to express our sincere gratitude to the members of the orga-
nizing and program committees, in particular to F. de la Paz and J.R. •lvarez,
to the reviewers, and to the organizers of invited sessions (Bahamonde, Barro,
Benjamins, Cabestany, Dorronsoro, Fukushima, Gonz~lez-Crist6bal, Jutten, Mil-
l~n, Moreno-Arostegui, Taddei-Ferretti, and Vellasco) for their invaluable effort
in helping with the preparation of this conference. Thanks also to the invited
speakers (Abeles, Gordon, Marder, Poggio, and Schiff) for their effort in prepar-
ing the plenary lectures.
Last, but not least, the editors would like to thank Springer-Verlag, in partic-
ular Alfred Hofmann, for the continuous and excellent cooperative collaboration
from the first IWANN in Granada (1991, LNCS 540), the successive meetings
in Sitges, (1993, LNCS 686), Torremolinos, (1995, LNCS 930), and Lanzarote,
(1997, LNCS 1240), and now in Alicante
The theme for the 1999 conference (from artificial to natural and back again),
focused on the interdisciplinary spirit of the pioneers in Neurocybernetics
(N. Wiener, A. Rosenblueth, J. Bigelow, W.S. McCulloch, W. Pitts, H. von
Foerster, J.Y. Lettvin, J. von Neumann, ...) and the thought-provoking meet-
ings of the Macy Foundation. We hope that these two volumes will contribute
to a better understanding of the nervous system and, equally, to an expansion
of the field of bio-inspired technologies. For that, we rely on the future work of
the authors of these volumes and on our potential readers.

June 1999 Jos6 Mira


Juan V. Sgnchez
Invited Speakers

Prof. Moshe Abeles (Hebrew Univ. Jerusalen. Israel)


Prof. M i r t a Gordon (CEA-Dept. Rech. Fond. Mat. Cond. SPSMS. France)
Prof. Eve M a r d e r (Brandeis Univ., Waltham, MA. USA)
Prof. Tomaso Poggio (Brain Sci. Dept. AI Lab. MIT, Cambridge, MA. USA)
Prof. Steven Schiff (Krasnow Inst. Adv. Stud. George Manson Univ., VA.
USA)

Field Editors

Prof. A. B a h a m o n d e (Univ. de 0viedo en Gij6n. Spain)


Prof. S. Barro (Univ. de Santiago de Compostela. Spain)
Prof. R. Benjamins (University of Amsterdam. Netherlands)
Prof. J. Cabestany (Universidad Polit6cnica de Catalufia. Spain)
Prof. J.R. Dorronsoro (Universidad Aut6noma de Madrid, Spain)
Prof. K. Fukushima (Osaka Univ. Japan)
Prof. J.C. Gonz~lez-Crist6bal (Univ. Polit~cnica de Madrid. Spain)
Prof. C. J u t t e n (LIS-INPG. France)
Prof. J. del R. Mill~n (Joint Research Center - European Commission, Ispra.
Italy)
Prof. J.M. Moreno-Arostegui (Univ. Polit~cnica de Catalufia. Spain)
Prof. C. Taddei-Ferretti (Istituto di Cibernetica, CNR. Italy)
Prof. M. Vellasco (Pontificia Univ. Catolica, Rio do Janeiro. Brazil)
Table of Contents, Vol. II

Artificial Neural Nets Simulation and Implementation

A Unified Model for the Simulation of Artificial and Biology-Oriented


Neural Networks . . . . . . . . . . . . . . . . . . . . . .............................. 1
A. Strey

Weight Freezing in Constructive Neural Networks: A Novel Approach . . . . . 11


S. Hosseini, C. Jutten
Can General Purpose Micro-Processors Simulate Neural Networks in
Real-Time? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B. Granado, L. Lacassagne, P. Garda

Large Neural Net Simulation under Beowulf-Like Systems . . . . . . . . . . . . . . . 30


C.J. Garcia Orellana, F.J. L6pez-Aligud, H.M. Gonzdlez Velasco,
M. Macias Macias, M.L Acevedo-Sotoca

A Constructive Cascade Network with Adaptive Regularisation . . . . . . . . . . 40


N.K. Treadgold, T. D. Gedeon
An Agent-Based Operational Model for Hybrid Connectionist-Symbolic
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
J. C. Gonzdlez Crist6bal, J.R. Velasco, C.A. Iglesias
Optimal Discrete Recombination: Hybridising Evolution Strategies with
the A* Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
C. Cotta, J.M. Troya Linero
Extracting Rules from Artificial Neural Networks with Kernel-Based
Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
J.M. Ramirez
Rule Improvement Through Decision Boundary Detection Using
Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.P. Engelbrecht, H.L. Viktor
The Role of Dynamic Reconfiguration for Implementing Artificial Neural
Networks Models in P r o g r a m m a b l e Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 85
J.M. Moreno Ar6stegui, J. Cabestany, E. Cant6, J. Faura,
J.M. Insenser
An Associative Neural Network and Its Special Purpose Pipeline
Architecture in Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
F. lbarra Pico, S. Cuenca Asensi
Effects of Global Perturbations on Learning Capability in a CMOS
Analogue Implementation of Synchronous Boltzmann Machine . . . . . . . . . . . 107
K. Madani, G. de Trgmiolles

Beta-CMOS Artificial Neuron and Implementability Limits . . . . . . . . . . . . . . 117


V. Varshavsky, V. Marakhovsky

Using On-Line Arithmetic and Reconfiguration for Neuroprocessor


Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
J.-L. Beuchat, E. Sdnchez

Digital Implementation of Artificial Neural Networks: From V H D L


Description to F P G A Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
N. Izeboudjen, A. Farah, S. Titri, H. Boumeridja

Hardware Implementation Using DSP's of the Neurocontrol of a


Wheelchair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
P. Martin, M. Mazo, L. Boquete, F.J. Rodriguez, L Ferndndez,
R. Barea, J.L. Ldzaro

Forward-Backward Parallelism in On-Line Backpropagation . . . . . . . . . . . . . 157


R. Gadea Gironds, A. Mocholi Salcedo

A VLSI Approach for Spike Timing Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 166


E. Ros, F.J. Pelayo, L Rojas, F.J. Ferndndez, A. Prieto

An Artificial Dendrite Using Active Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 176


E. Rouw, J. Hoekstra, A.H.M. van Roermund

Analog Electronic System for Simulating Biological Neurons . . . . . . . . . . . . . 188


V. Douence, A. Laflaqui~re, S. Le Masson, T. Bal, G. Le Masson

Neural Addition and Fibonacci Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198


V. Beiu

Adaptive Cooperation Between Processors in a Parallel Boltzmann


Machine Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
J. Ortega Lopera, L. Parrilla, J.L. Bernier, C. Gil, B. Pino,
M. Anguita

Bio-inspired Systems

Adaptive Brain Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219


J.del R. Milldn, J. Mouri~o, J. Heikkonen, K. Kaski, F. Babiloni,
M.G. Marciani, F. Topani, I. Canale

Identifying Mental Tasks from Spontaneous EEG: Signal Representation


and Spatial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
C. W. Anderson

Independent Component Analysis of H u m a n Brain Waves . . . . . . . . . . . . . . . 238


R. Vigdrio, E. Oja

EEG-Based Brain-Computer Interface Using Subject-Specific Spatial


Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
G. Pfurtscheller, C. Guger, H. Ramoser

Multi-neural Network Approach for Classification of Brainstem Evoked


Response Auditory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.-S. Dujardin, V. Amarger, K. Madani, O. Adam, J.-F. Motseh

EEG-Based Cognitive Task Classification with ICA and Neural Networks . 265
D.A. Peterson, C. W. Anderson

Local P a t t e r n of Synchronization in Extraestriate Networks During Visual


Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
L. Mendndez de la Prida, F. Barceld, M.A. Pozo, F.J. Rubia

A Bioinspired Hierarchical System for Speech Recognition . . . . . . . . . . . . . . . 279


J. M. Ferrdndez, M.V. Rodellar Biarge, P. Gdmez

A Neural Network Approach for the Analysis of Multineural Recordings


in Retinal Ganglion Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
J. M. Ferrdndez, J.A. Bolea, J. Ammermiiller, R.A. Normann,
E. Ferndndez

Challenges for a Real-World Information Processing by Means of


Real-Time Neural C o m p u t a t i o n and Real-Conditions Simulation . . . . . . . . . 299
J. C. Herrero

A Parametrizable Design of the Mechanical-Neural Transduction System


of the Auditory Brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
J.A. Macias Iglesias, M.V. RodeUar Biarge

Development of a New Space Perception System for Blind People, Based


on the Creation of a Virtual Acoustic Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
J.L. Gonzdlez-Mora, A. Rodriguez-Herndndez,
L.F. Rodriguez-Ramos , L. Diaz-Saeo, N. Sosa

Images
Application of the Fuzzy Kohonen Clustering Network to Biological
Macromolecules Images Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
A. Pascual, M. Barcdna, J.J. Merelo, J.-M. Carazo

Bayesian VQ Image Filtering Design with Fast Adaptation Competitive


Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
A.L Gonzdlez, M. Graga, L Echave, J. Ruiz-Cabello

Neural Networks for Coefficient Prediction in Wavelet Image Coders . . . . . . 351


C. Daniell, R. Matic

A Neural Network Architecture for T r a d e m a r k Image Retrieval . . . . . . . . . . 361


S. Alwis, J. Austin

Improved Automatic Classification of Biological Particles from


Electron-Microscopy Images Using Genetic Neural Nets . . . . . . . . . . . . . . . . . 373
J.J. Merelo, V. Rivas, G. Romero, P.A. Castillo, A. Pascual,
J.M. Carazo

P a t t e r n Recognition Using Neural Network Based on Multi-valued


Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
LN. Aizenberg, N.N. Aizenberg

Input Pre-processing for Transformation Invariant P a t t e r n Recognition . . . 393


G. Tascini, A. Montesanto, G. Fazzini, P. Puliti

Method for Automatic Karyotyping of H u m a n Chromosomes Based on the


Visual Attention System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
J.F. Diez Higuera, F.J. Diaz Pernas

Adaptive Adjustement of the CNN Output Fhnction to Obtain Contrast


Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
M.A. Jaramillo Mordn, J.A. Ferndndez Mu~oz

Application of ANN Techniques to A u t o m a t e d Identification of Bovine


Livestock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
H.M. Gonzdlez Velasco, F.J. Ldpez-Aligud, C.J. Garcia Orellana,
M. Macias Macias, M.L Acevedo-Sotoca

An Investigation into Cellular Neural Networks Internal Dynamics Applied


to Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
D. Monnin, L. Merlat, A. KSneke, J. H~rault

Autopoiesis and Image Processing: Detection of Structure and


Organization in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
M. KSppen, J. Ruiz-del-Solar

Preprocessing of Radiological Images: Comparison of the Application of


Polynomic Algorithms and Artificial Neural Networks to the Elimination
of Variations in Background Luminosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
B. Arcay Varela, A. Alonso Betanzos, A. Castro Martfnez,
C. Seijo Garcia, J. Sudrez Bustillo

Feature Extraction with an Associative Neural Network and Its


Application in Industrial Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
F. Ibarra Pico, S. Cuenca Asensi, J.M. Carcia Chamizo
xlll

Genetic Algorithm Based Training for Multilayer Discrete-Time Cellular


Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
P. Lgpez, D.L. Vilari~o, D. Cabello

Engeneering Applications
How to Select the Inputs for a Multilayer Feedforward Network by Using
the Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
M. Ferndndez Redondo, C. Herndndez Espinosa

Neural Implementation of the JADE-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 487


C. Ziegaus, E.W. Lang

Variable Selection by Recurrent Neural Networks. Application in Structure


Activity Relationship Study of Cephalosporins . . . . . . . . . . . . . . . . . . . . . . . . . 497
N. Ldpez, R. Cruz, B. Llorente

Optimal Use of a Trained Neural Network for Input Selection . . . . . . . . . . . . 506


M. Ferndndez Redondo, C. Herndndez Espinosa

Applying Evolution Strategies to Neural Network Robot Controller . . . . . . . 516


A. Berlanga, J.M. Molina, A. Sanchis, P. Isasi

On Virtual Sensory Coding: An Analytical Model of the Endogenous


Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
J.R. Alvarez-Sdnchez, F. de la Paz L6pez, J. Mira Mira

Using Temporal Information in ANNs for the Implementation of


Autonomous Robot Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
J.A. Becerra, J. Santos, R.J. Duro

Learning Symbolic Rules with a Reactive with Tags Classifier System in


Robot Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
A. Sanchis, J.M. Molina, P. Isasi, J. Segovia

Small Sample Discrimination and Professional Performance Assessment . . . 558


D. Aguado, J.R. Dorronsoro, B. Lucia, C. Santa Cruz

SOM Based Analysis of Pulping Process D a t a . . . . . . . . . . . . . . . . . . . . . . . . . 567


O. Simula, E. Alhoniemi

Gradient Descent Learning Algorithm for Hierarchical Neural Networks:


A Case Study in Industrial Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
D. Baratta, F. Diotalevi, M. Valle, D.D. Caviglia

Application of Neural Networks for A u t o m a t e d X-Ray Image Inspection


in Electronics Manufacturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
A. KSnig, A. Herenz, K. Wolter
xIv

Forecasting Financial Time Series Through Intrinsic Dimension Estimation


and Non-linear D a t a Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
M. Verleysen, E. de Bodt, A. Lendasse

Parametric Characterizacion of Hardness Profiles of Steels with


Neuro-Wavelet Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
V. Colla, L.M. Reyneri, M. Sgarbi

Study of Two ANN Digital Implementations of a Radar Detector


Candidate to an On-Board Satellite Experiment . . . . . . . . . . . . . . . . . . . . . . . 615
R. Velazco, Ch. Godin, Ph. Cheynet, S. Torres-Alegre, D. Andina,
M.B. Gordon

Curvilinear Component Analysis for High-Dimensional D a t a


Representation: I. Theoretical Aspects and Practical Use in the Presence
of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
J. Hdrault, C. Jausions-Picaud, A. Gudrin-Dugud

Curvilinear Component Analysis for High-Dimensional D a t a


Representation: II. Examples of Additional Mapping Constraints in
Specific Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
A. Gudrin-Dugud, P. Teissier, G. Delso Gafaro, J. Hdrault

Image Motion Analysis Using Scale Space Approximation and Simulated


Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
V. Parisi Baradad, H. Yahia, J. Font, L Herlin, E. Garcia-Ladona

Blind Inversion of Wiener Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655


A. Taleb, J. Sold, C. Jutten

Separation of Speech Signals for Nonlinear Mixtures . . . . . . . . . . . . . . . . . . . . 665


C.G. Puntonet, M.M. Rodriguez-ftlvarez, A. Prieto, B. Prieto

Nonlinear Blind Source Separation by P a t t e r n Repulsion . . . . . . . . . . . . . . . . 674


L.B. Almeida, G.C. Marques

Text-to-Text Machine Translation Using the R E C O N T R A Connectionist


Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
M.A. Castago, F. Casacuberta

An Intelligent Agent for Brokering Problem-Solving Knowledge . . . . . . . . . . 693


V.R. Benjamins, B.J. Wielinga, J. Wielemaker, D. Fensel

A System for Facilitating and Enhancing Web Search . . . . . . . . . . . . . . . . . . . 706


S. Staab, C. Braun, I. Bruder, A. DiisterhSft, A. Heuer, M. Klettke,
G. Neumann, B. Prager, J. Pretzel, H.-P. Schnurr, R. Studer,
H. Uszkoreit, B. Wrenger
XV

Applying Ontology to the Web: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . 715


J. Heflin, J. Hendler, S. Luke

How to Find Suitable Ontologies Using an Ontology-Based W W W Broker 725


J.C. Arpirez Vega, A. Gdmez-Pgrez, A. Lozano Tello, H.S. Andrade,
N.P. Pinto

Towards Personalized Distance Learning on the Web . . . . . . . . . . . . . . . . . . . 740


J. G. Boticario, E. Gaudioso Vdzquez

Visual Knowledge Engineering as a Cognitive Tool . . . . . . . . . . . . . . . . . . . . . 750


T. Gavrilova, A. Voinov, E. Vasilyeva

Optimizing Web Newspaper Layout Using Simulated Annealing . . . . . . . . . . 759


J. Gonzdlez, J.J. Merelo, P.A. Castillo, V. Rivas, G. Romero

Artificial Neural Network-Based Diagnostic System Methodology . . . . . . . . . 769


M. Reyes de los Mozos, D. Puiggrds, A. Calder6n

Neural Networks in Automatic Diagnosis of Malignant Brain Tumors . . . . . 778


F. Morales Arcia, P. Ballesteros, S. Cerddn

A New Evolutionary Diagram: Application to B T G P and Information


Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
J.L. Ferndndez- ViUacagas

Artificial Neural Networks as Useful Tools for the Optimization of the


Relative Offset between Two Consecutive Sets of Traffic Ligths . . . . . . . . . . 795
S. L6pez, P. Herndnclez, A. Herndndez, M. Garcia

ASGCS: A New Self-Organizing Network for Automatic Selection of


Feature Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
J. Ruiz-del-Solar, D. Kottow

Adaptive Hybrid Speech Coding with a M L P / L P C Structure . . . . . . . . . . . . 814


M. Fadndez-Zanuy

Neural Predictive Coding for Speech Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824


C. Chary, B. Gas, J.L. Zarader

Support Vector Machines for Multi-class Classification . . . . . . . . . . . . . . . . . . 833


E. Mayoraz, E. Alpaydm

Self-Organizing Yprel Network Population for Distributed Classification


Problem Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
E. Stocker, A. Ribert, Y. Lecourtier

An Accurate Measure for Multilayer Perceptron Tolerance to Additive


Weight Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
J.L. Bernier, J. Ortega Lopera, M.M. Rodriguez-Alvarez, L Rojas,
A. Prieto
Fuzzy Inputs and Missing D a t a in Similarity-Based Heterogeneous Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
L.A. Belanche, J.J. Valdgs
A Neural Network Approach for Generating Solar Irradiation Artificial
Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874
P.J. Zufiria, A. Vdzquez-LSpez, Y. Riesco-Prieto, J. Aguilera,
L. Hontoria
Color Recipe Specification in the Textile Print Shop Using Radial Basis
Function Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
S. Rautenberg, J.L. Todesco
Predicting the Speed of Beer Fermentation in L a b o r a t o r y and Industrial
Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
J. Rousu, T. Elomaa, R. Aarts

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903


Table of Contents, Vol. I

Neural Modeling (Biophysical and Structural Models)

Self-Assembly of Oscillatory Neurons and Networks . . . . . . . . . . . . . . . . . . . . 1


E. Marder, J. Golowasch, K.S. Richards, C. Soto-Trevigo,
W.L. Miller, L.F. Abbott

Reverberating Loops of Information as a Dynamic Mode of Functional


Organization of the N. S. : A Working Conjecture . . . . . . . . . . . . . . . . . . . . . . 12
J. Mira Mira, A.E. Delgado Garcia

Reconstruction of Brain Networks by Algorithmic Amplification of


Morphometry Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
S.L. Senft, G.A. Ascoli

Slow Learning and Fast Evolution: An Approach to Cytoarchitectonic


Parcellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
J.G. Wallace, K. Bluff

Dendritic [Ca 2+] Dynamics in the Presence of Immobile Buffers and of Dyes 43
M. Maravall, Z.F. Mainen, K. Svoboda

Development of Directionally Selective Microcircuits in Striate Cortex . . . . 53


M.A. Sdnchez-Montadds, F.J. Corbacho, J.A. Sigiienza

Neural Circuitry and Plasticity in the Adult Vertebrate Inner Retina . . . . . 65


G. Maguire, A. Straiker, D. Chander, S.N. Haamedi, D. Piomelli,
N. Stella, Q.-J. Lu

Modelling the Circuitry of the Cuneate Nucleus . . . . . . . . . . . . . . . . . . . . . . . . 73


E. Sdnchez, S. Barro Ameneiro, J. Mari~o, A. Canedo, P. Vdzquez

Filtering Capability of Neural Networks from the Developing Mammalian


Hippocampus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
L. Mendndez de la Prida, J.V. Sdnchez-Andrds

Spatial Inversion and Facilitation in the J. Gonzalo's Research of the


Sensorial Cortex. Integrative Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
L Gonzalo

A Self-Organizing Model for the Development of Ocular Dominance and


Orientation Columns in the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.M. Muro, M.A. Andrade, P. Isasi, F. Mordn
XVlll

Gaze Control with Neural Networks: A Unified Approach for Saccades and
Smooth Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
M. Pauly, K. Kopecz, R. Eckhorn
The Neural Net of Hydra and the Modulation of Its Periodic Activity . . . . . 123
C. Taddei-Ferretti, C. Musio

A Biophysical Model of Intestinal Motility: Application in Pharmacological


Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
R. Miftakhov, J. Christensen
Model of the Neuronal Net for Detection of Single Bars and Cross-Like
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
K.A. Saltykov, L A. Shevelev

Connected Cortical Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163


A. Renart, N. Parga, E.T. Rolls

Inter-spike Interval Statistics of Cortical Neurons . . . . . . . . . . . . . . . . . . . . . . 171


S. Shinomoto, Y. Sakai

A New Cochlear Model Based on Adaptive Gain Mechanism . . . . . . . . . . . . 180


X. Lu, D. Chen

Structure of Lateral Inhibition in an Olfactory Bulb Model . . . . . . . . . . . . . . 189


A. Davison, J. Feng, D. Brown

Effects of Correlation and Degree of Balance in Random Synaptic Inputs


on the Output of the Hodgkin-Huxley Model . . . . . . . . . . . . . . . . . . . . . . . . . . 197
D. Brown, J. Feng

Oscillations in the Lower Stations of the Somatosensory Pathway . . . . . . . . 206


F. Panetsos, A. Nu~ez, C. Avenda~o
Effects of the Ganglion Cell Response Nonlinear Mapping on Visual
System's Noise Filtering Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
L. Orz6
Paradoxical Relationship Between Output and Input Regularity for the
FitzHugh-Nagumo Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
S. Feerick, J. Feng, D. Brown

Synchronisation in a Network of FHN Units with Synaptic-Like Coupling . 230


S. Chillemi, M. Barbi, A. Di Garbo

Two-Compartment Stochastic Model of a Neuron with Periodic Input . . . . 240


R. Rodriguez, P. Ldnsk~j
Stochastic Model of the Place Cell Discharge . . . . . . . . . . . . . . . . . . . . . . . . . . 248
P. Ldnsk~j, J. Vaillant
•215

Integrate-and-Fire Model with Correlated Inputs . . . . . . . . . . . . . . . . . . . . . . . 258


J. Feng

Noise Modulation by Stochastic Neurons of the Integrate-and-Fire T y p e . . . 268


M. Spiridon, W. Gerstner

Bayesian Modeling of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277


R. Mutihac, A. Cicuttin, A. Cerdeira Estrada, A.A. Colavita

Neural Networks of the Hopfield Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287


L.B. Litinskii

Stability Properties of BSB Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297


F. Botelho

Storage Capacity of the Exponential Correlation Associative Memory . . . . . 301


R.C. Wilson, E.R. Hancock

A New Input-Output Function for Binary Hopfield Neural Networks . . . . . . 311


G. Galdn Marin, J. Mu~oz Pdrez

On the Three Layer Neural Networks Using Sigmoidal Functions . . . . . . . . . 321


L Ciuca, E. Jitaru

The Capacity and Atractor Basins of Associative Memory Models . . . . . . . . 330


N. Davey, S.P. Hunt

A Modular Attractor Model of Semantic Access . . . . . . . . . . . . . . . . . . . . . . . . 340


W. Power, R. Frank, J. Done, N. Davey

Priming an Artificial Associative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348


C. Bertolini, II. Paugam-Moisy, D. Puzenat

W h a t Does a Peak in the Landscape of a Hopfield Associative Memory


Look Like? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
A. Imada

Periodic and Synchronic Firing in an Ensemble of Identical Stochastic


Units: Structural Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
F.B. Rodriguez, V. Ldpez

Driving Neuromodules into Synchronous Chaos . . . . . . . . . . . . . . . . . . . . . . . . 377


F. Pasemann

Aging and L~vy Distributions in Sandpiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385


O. Sotolongo-Costa, A. Vazquez, J.C. Antoranz

Finite Size Effects in Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393


L. Viana, A. Castellanos, A.C.C Coolen
x•

On the Computational Power of Limited Precision Weights Neural


Networks in Classification Problems: How to Calculate the Weight Range
so t h a t a Solution Will Exist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
S. Draghici

Plasticity Phenomena (Maturing~ Learning &: Memory)


Estimating Exact Form of Generalisation Errors . . . . . . . . . . . . . . . . . . . . . . . 413
Y. Feng
A Network Model for the Emergence of Orientation Maps and Local
Lateral Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
T. Burger, E.W. Lang
A Neural Network Model for the Self-Organization of Cortical Grating
Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
C. Bauer, T. Burger, E.W. Lang

Extended Nonlinear Hebbian Learning for Developing Sparse-Distributed


Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
B.-l. Zhang, T. D. Gedeon
Cascade Error Projection: A Learning Algorithm for Hardware
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
T.A. Doung, T. Daud
Unification of Supervised and Unsupervised Training . . . . . . . . . . . . . . . . . . . 458
L.M. Reyneri
On-Line Optimization of Radial Basis Fhnction Networks with Orthogonal
Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
M. Salmerdn, J. Ortega Lopera, C.G. Puntonet
A Fast Orthogonalized F I R Adaptive Filter Structure Using a Recurrent
Hopfield-Like Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
M. Nakano-Miyatake, H.M. Pdrez-Meana
Using Temporal Neighborhoods to Adapt Function Approximators in
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
R.M. Kretchmar, C. W. Anderson

Autonomous Clustering for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 497


O. Luaces, J.J. del Coz, J.R. Quevedo, J. Alonso, J. Ranilla,
A. Bahamonde
Bioinspired Framework for General-Purpose Learning . . . . . . . . . . . . . . . . . . . 507
S. ~tlvarez de Toledo, J.M. Barreiro
Learning Efficient Rulesets from Fuzzy D a t a with a Genetic Algorithm . . . 517
F. Botana
•215

Self-Organizing Cases to Find Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527


J.J. del Coz, O. Luaces, J.R. Quevedo, J. Alonso, J. Ranilla,
A. Bahamonde

Training Higher Order Gaussian Synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537


R.J. Duro, J.L. Crespo, J. Santos

On-Line Gradient Learning Algorithms for K-Nearest Neighbor Classifiers. 546


S. Bermejo, J. Cabestany

Structure Adaptation in Artificial Neural Networks through Adaptive


Clustering and through Growth in State Space . . . . . . . . . . . . . . . . . . . . . . . . . 556
A. Pdrez-Uribe, E. Sdnchez

Sensitivity Analisys of Radial Basis Function Networks for Fault Tolerance


Purposes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
X. Parra, A. Catal5

Association with Multi-dendritic Radial Basis Units . . . . . . . . . . . . . . . . . . . . 573


,I.D. Buldain, A. Roy

A Boolean Neural Network Controlling Task Sequences in a Noisy


Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
F.E. Lauria, M. Milo, R. Prevete, S. Visco

SOAN: Self Organizing with Adaptative Neighborhood Neural N e t w o r k . . . 591


R. Iglesias, S. Barro Arneneiro

Topology Preservation in SOFM: An Euclidean Versus M a n h a t t a n


Distance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
N.J. Medrano-Marquds, B. Martin-del-Brio

Supervised VQ Learning Based on Temporal Inhibition . . . . . . . . . . . . . . . . . 610


P. Martin-Smith, F.J. Pelayo, E. Ros, A. Prieto

Improving the LBG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621


M. Russo, G. PatanO,

Sequential Learning Algorithm for P G - R B F Network Using Regression


Weights for Time Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
I. Rojas, H. Pomares, J.L. Bernier, J. Ortega Lopera, E. Ros,
A. Prieto

Parallel Fuzzy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641


M. Russo

Classification and Feature Selection by a Self-Organizing Neural Network . 651


A. Ribert, E. Stocker, A. Ennaji, Y. Lecourtier
•215

SA-Prop: Optimization of Multilayer Perceptron P a r a m e t e r s Using


Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
P.A. Castillo, J.J. Merelo, Y. Gonzdlez, V. Rivas, G. Romero

Mobile Robot P a t h Planning Using Genetic Algorithms . . . . . . . . . . . . . . . . . 671


C.E. Thomaz, M.A.C. Pacheco, M.M.B.R. Vellasco

Do Plants Optimize? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680


H.J.S. Coutinho, E.A. Lanzer, A.B. Tcholakian

Heuristic Generation of the Initial Population in Solving Job Shop


Problems by Evolutionary Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
R. Varela, A. Gdmez, C.R. Vela, J. Puente, C. Alonso

Randomness in Heuristics: An Experimental Investigation for the


M a x i m u m Satisfiability Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
H. Drias

Solving the Packing and Strip-Packing Problems with Genetic Algorithms. 709
A. Gdmez, D. de la Fuente

Multichannel P a t t e r n Recognition Neural Network . . . . . . . . . . . . . . . . . . . . . 719


M. Ferndndez-Delgado, J. Presedo, S. Barro Ameneiro

A Biologically Plausible Maturation of an ART Network . . . . . . . . . . . . . . . . 730


M.E.J. Raijmakers, P.C.M. Molenaar

Adaptive Resonance Theory Microchips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737


T. Serrano-Gotarredona, B. Linares-Barranco

Application of ART2-A as a Pseudo-supervised Paradigm to Nuclear


Reactor Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
S. Keyvan, L.C. Rabelo

Supervised ART-I: A New Neural Network Architecture for Learning and


Classifying Multivalued Input Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
K.R. AI-Rawi

Artificial Intelligence and Cognitive Neuroscience


Conscious and Intentional Access to Unconscious Decision-Making Module
in Ambiguous Visual Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
C. Taddei-Ferretti, C. Musio, S. Santillo, A. Cotugno

A Psychophysical Approach to the Mechanism of H u m a n Stereovision . . . . 776


F. Moradi

Neural Coding and Color Sensations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786


W. Backhaus
XXlll

Neurocomputational Models of Visualisation: A Preliminary Report . . . . . . 798


L Aleksander, B. Dunmatl, V. del Frate
Self-Organization of Shift-Invariant Receptive Fields . . . . . . . . . . . . . . . . . . . . 806
K. Fukushima, K. Yoshimoto
Pattern Recognition System with Top-Down Process of Mental R o t a t i o n . . 816
S. Satoh, H. Aso, S. Miyake, J. Kuroiwa

Segmentation of Occluded Objects Using a Hybrid of Selective Attention


and Symbolic Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
Y. Mitsumori, T. Omori
Hypercolumn Model: A Modified Model of Neocognitron Using
Hierarchical Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
N. Tsuruta, R.-i. Taniguchi, M. Amamiya
Attentional Strategies for Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 850
L. Pessoa, S. Exel

Author Index ................................................. 861


A Unified M o d e l for the Simulation of
Artificial and Biology-Oriented Neural Networks

Alfred Strey

Department of Neural Information Processing


University of Ulm, Oberer Eselsberg, D-89069 Ulm, Germany

A b s t r a c t . A unified model for the simulation of artificial and biology-


oriented neural networks is presented. It supports all rate-coded and
also many pulse-coded neural network models. The focus of the paper is
on the special requirements for the simulation of neural networks built
from neurons modelled by a single compartment. The derived unified
neural network model represents a basis for the design of a universM
neurosimulator. Several extensions of the neural specification language
EpsiloNN to incorporate the new model are explained.

1 Introduction

Many artificial neural network models have been proposed and successfully ap-
plied to technical problems. They always use simple rate-coded neurons. Also
several neurosimulators have been implemented to simplify the development of
neural applications. Often they are based on neural simulation kernels contain-
ing optimized realizations of a few artificial neural network models and learning
algorithms (e.g. SNNS [17], NeuralWorks [9]). Alternatively, several neural spec-
ification languages (like AXON [5], CONNECT [8], epsiloNN [12]) allow a more
or less flexible description of artificial neural networks. However, there is an
actual trend in neural network research towards more biologically plausible neu-
ral networks. Several experimental results and theoretical studies show t h a t the
time and temporal correlation of neuron activity are relevant in neural signal
processing [2] [11]. To study the behaviour of such neural networks, only a few
specialized simulation tools exist. T h e y support the neural network simulation
on only one of several abstraction levels.
Neurosimulators like GENESIS [1] or NEURON [6] are specialized for the
simulaton of multi-compartment models. Here the biophysical processes of each
neuron are simulated on a microscopic level. The spatial extension of a neuron
is considered by partitioning the neuron model in several compartments: soma,
axon and m a n y dendritic compartments. Each c o m p a r t m e n t is modelled by a dif-
ferential equation which represents the behaviour of the internal cell membrane.
For the simulation of the complete neural network a system of m a n y coupled dif-
ferential equations has to be solved numerically. Due to the high computational
effort only small neural networks can be simulated.
On a more abstract level the neuron is modelled by a single compartment.
The detailed internal mechanism of the cell m e m b r a n e is hidden. Each neuron
generates a spike if its input signals fulfil a certain condition (e.g. if the total
input potential exceeds a certain threshold). The spike impulse is propagated to
succeeding neurons by synaptic connections. Here the information is weighted
and a postsynaptic potential according to an impulse response function is gener-
ated. The spatial aspect is reflected in delays : an impulse of a distant neuron may
be more delayed than the impulse of a neighbour neuron. A typical simulator
supporting this abstraction level is SimSPiNN [15].
The simulation of pulse coded neural networks can be further simplified if
delays are not supported. So the spatial extension of the neural network is fully
ignored. This is realized in several neurosimulators like MDL /14] or NEXUS
[10]. Neurosimulators for artificial neural networks like SNNS mentioned already
above are suitable only for rate-coded neuron models. They do not support
temporal behaviour. Also a restricted underlying neural network model often
allows only the simulation of a limited subset of artificial neural networks.
So the user must first determine the abstraction level and then select an
appropriate simulation tool. A change of the level either requires the use of an-
other simulation tool or results in an inefficient simulation. Also hybrid network
architectures consisting of artificial and biology-oriented models can not be sim-
ulated. To overcome this problem a universal neurosimulator capable of efficient
neural network simulation on different abstraction levels is highly desirable.
In this paper a unified neural network model for the simulation of neural
networks on several abstraction levels is presented. The focus is on the modelling
of neurons consisting of a single compartment although multi-compartmental
neurons can easily be mapped onto the model too. The resulting unified model
which is described in Sect. 2 represents a basis for the design of a universally
applicable neurosimulator. The neural specification language EpsiloN N [12] [13]
originally developed by the author for the simulation of artificial network models
has been redesigned to incorporate the new unified model. Its extensions to
support also biology-oriented neural networks are summarized in Sect. 3.

2 A unified neural network model

The first step in the implementation of a neurosimulator is the design of an


underlying neural network model. It has to contain all neural network features
that are required for a simulation on the desired abstraction levels. For a formal
description of the model either (1) operations on vectors and matrices or (2)
operations on neural objects and a description of the topology can be used. For
simple artificial neural network models the first approach is sufficient; standard
fully connected network models can easily be expressed by several linear algebra
operations. However, the second approach represents a better high-level abstrac-
tion and allows a simple description also of spike-processing neural networks and
of very complex architectures. Thus, the second approach is preferred here.
Many artificial and biology-oriented neural network have been analyzed by
the author to derive a unified neural network model. It consists of neurons
(see Sect. 2.1) and synapses (see Sect. 2.2) as basic objects. The network (see
Sect. 2.3) describes the arrangement of the objects and the topology by which
they are connected. Simulation is organized in discrete time steps; the duration
of a time step At should be selected according to the actual decay constants 7-.
The selection of At (often 1 ms is used) represents a compromise: In a case of a
too coarse time step the behaviour of the simulated neural network is incorrect,
on the other hand computational resources are wasted by selecting a too fine
time step. In each time step the following operations have to be performed:

1. A (possibly new) input pattern is presented to the external network inputs.


2. All external input and all internal neuron output signals are propagated
through the synapses. Here they are weighted and may also be delayed. In
case of spiking network models also a postsynaptic potential may be induced.
3. The state variables of all objects are updated.
4. The new neuron output values are computed.
5. Learning may be performed by adjusting the synaptic weights a n d / o r some
neuron parameters.

Especially for biology-oriented neural networks, mostly linear first-order differ-


ential equations of the form
d
= - x ( t ) + g(u, t) (1)

are used for the description of the model behaviour. After discretization, they
can be simulated by a difference equation of the form

x(t + At) = ( 1 - At)x(t) + Atg(u,t) = c~x(t) + (1 - a ) g ( u , t ) (2)


T T

Throughout this paper it is assumed that such difference equations with expo-
nential decay constants c~ are sufficient for a description and a correct computer
simulation of the model behaviour.

2.1 The neuron model

Each neuron (see Fig. 1) consists of several input groups (also called sites), by
which it receives signals from other neurons or external inputs. A site contains all
inputs with similar characteristics from a certain part of the dendritic tree. The
signals u(/)(t) . .(u~. j). , (J) of each site j are combined by an arbitrary func-
, U~naxj
tion f(uj) resulting in an input potential p(J)(t) = f(J)(u(J)(t), p(J)(t - At)) which
may be excitatory (p(J) > 0) or inhibitory (p(J) < 0). The total neural activity x
(also called neuron potential or membrane potential) is calculated by the follow-
ing function from all k input potentials: x(t) = fact (p(1)(t),..., p(k)(t), x(t--At)).
The neuron's output potential y(t) (also called axonal potential) is com-
puted from the neuron's activity x by applying an arbitrary output function:
y(t) = fout(X(t),~(t)). The sigmoidal function y(t) = 1/(1 + e-x(t)), the Gaus-
sian function y(t) = e-X~(t)/z(t)2 or a threshold function are often used here. In
biology-oriented simulations the output y may be delayed by an axonal delay
l(t)~ ..... . . . . . - ............
l "--...../
, teach

(.ii" ~" /' y(t)


y(t-/~(n))

" ' " - - . . . . . . . . . . . ....-''~176

Fig. 1. The unified neuron model

A(n) = d. A t which is a multiple d of the time step. The output functions often
need a p a r a m e t e r fl or a p a r a m e t e r vector j3 = (/31,...,~ma• It m a y repre-
sent e.g. a threshold 6) or a variance a 2. The p a r a m e t e r ~ m a y also be adapted
by a function t3(t + 1) = fz(l~(t), x(t), y(t), l(t)). In biology-oriented neural net-
works the p a r a m e t e r ~ often describes a dynamic threshold by which a refractory
mechanism is realized:

~9(t + 1) = f ~(s) if x(t) > O(t) (3)


[O(t)*~ else

Here after the generation of a spike for x > (9 at time step t (8) the threshold is
raised to a high value 0(s) to prevent the neuron from spiking again.
In adaptive neural networks also a learning potential has to be computed:
l(t) = fiearn(x(t), y(t), teach(t), e(t)). It is required by the incoming synapses for
learning (compare Sect. 2.2). In case of Hebbian learning l(t) is identical with
the activity x(t). If supervised learning algorithms are used l(t) either depends
on an externally supplied teacher signal (e.g. l(t) = teach(t) - y(t)) or on an
internal error potential e(t) = fd(d(t)) which is calculated from the elements of
a difference vector d = ( d l , . . . , d m a x ) received from other succeeding neurons
via synaptic connections (e.g. e(t) = ~y dj).

2.2 The synapse model

Each synapse (see Fig. 2) represents a connection between two neurons or be-
tween an external network input node and a neuron. It has at least one in-
put in, one output out and at most one weight value w which represents the
synaptic strength. In artificial neural network models, the output out(t) =
fprop (in(t), w (t)) is computed from the presynaptic potential in and the weight w
...~~ . . . . . . . . ~-. ~176

back -, - - -..'-"'-. . . . . . ~ ' - . .... post

', f,'

"'"--~ . . . . . . . . . . ..-*"

Fig. 2. The unified synapse model

(mostly by a simple function like out(t) = i n ( t ) , w(t) or out(t) = in(t) - w(t)).


In case of spike-processing networks the propagate function fprop is more com-
plex. Here each presynaptic spike induces a postsynaptic potential which can
be excitatory (EPSP) or inhibitory (IPSP). It can be described by an impulse
response function [3] of the form

e(t)=
{exp(-~)-exp(-~)
0
if t ~ A(s)
else (4)

with sharp rise and exponential decay. Here A (s) represents the synaptic delay.
The actual output value out(t) = ~ i w . e ( t - tl ~)) is a superposition of the
response functions induced by all previous spikes at time steps tl s). However,
not all time steps of previous spikes must be stored in each synapse. The output
value can more easily be calculated by the following equation:

{ 1(1 + out (t)) - + outs(t)) if t - A(s) = tl


o u t ( t + At) = w . a l o u t i ( t ) - a2out2(t) else (5)

Here outl (t) and out2 (t) represent the parts of the output signal that result from
the first and second exponential term of Eq. 4. The values a i = e x p ( - - A t / T 1 )
and as = exp(-At/~-2) indicate the corresponding decay constants.
More generally, the actual synaptic output value out(t) can be described by a
function f p r o p ( i n ( t - A(s)), o u t ( t - At), w(t)) that depends on the synaptic delay
and the past output value. The synaptic delay of each synapse is modeled by a
(not adjustable) multiple of the time step: A(*) = d - At. It can be realized by
an internal FIFO buffer containing at least the input signals of the last d time
steps and a demultiplexer for selecting the correct value of time step t - A (s) .
Learning depends on the presynaptic potential in(t) and a postsynaptic po-
tential post(t) (usually the learning potential l(t) of the postsynaptic neuron,
compare Sect. 2.1): A w ( t ) = fiearn(in(t -- A(~)),post(t), w(t -- A t ) ) . Often Heb-
bian learning is used here: A w ( t ) =- 7" in(t - A(~)) 9post(t). It may be combined
with a decay term w(t) = 7 " w ( t - A t ) + A w ( t ) to realise a forgetting mechanism.
Many learning functions depend on a local parameter 5' or on a local parameter
vector 3, = (V1,..., ~/max). This parameters may also be updated during learning
by a local function f~(~/(t - At), in(t - A(s)),post(t)).
In most neural network models a synapse represents an unidirectional connec-
tion. However, in several supervised learning algorithms for multi-layer networks
there is also a flow of (weighted) error information in the reverse direction. So
each synapse requires an additional output back(t) = fb~ck(post(t), w(t)).

2.3 The network model

A neural network is a directed acyclic or cyclic graph. Each node represents


a neuron or an i n p u t / o u t p u t node, each arc is a synaptic connection between
two nodes. The nodes are organized in populations (also called layers in ar-
tificial network models) which are one- or two-dimensional arrays of identical
neurons or i n p u t / o u t p u t nodes. All neurons of one population use the same in-
put, activation, output and learning functions, only parameters are different.
The populations are interconnected by synaptic networks. Each network either
connects two different populations (interpopulation network) or connects neu-
rons of the same population (intrapopulation network). All connections of one
network are realized by identical synapses. So they share the same propagate and
learning functions with locally different parameters. The connections of each net-
work are arranged in certain topologies. Regular topologies are preferred in most
simulations of artificial and biology-ortiented neural networks. The three most
important topologies are illustrated in Fig. 3 and described in the following for
two-dimensional populations. They exist for one-dimensionai networks too.
full: Each node (ix, i v) of the source population is connected to all nodes of the
destination population.
c o r r e s p o n d i n g : The source node (ix, iv) is connected with node (ix, i v) of the
destination population.
t o p o g r a p h i c m a p : This topology is very often used in biology-oriented neural
networks because it corresponds to the connection scheme of the cortex. Here
each node (i~,iy) receives input from a neighbourhood (also called receptive
field) around a certain node of the source population. The neighbourhood has
a rectangular shape of size kx x ky (with kx, kv odd) and is centered at the
node c(ix, iv). More formally, let s~ x sy and dx x du be the sizes of the source
and destination populations. Then the center node of the neighbourhood in
the source population for each node (ix,/y) of the destination population with
ix = 0 , . . . , d x - l a n d i v--0,...,dy-lis

{ ( i s , i v) if sx = d x and s y = d y
c(ix,iy) = ([ixsx/dxl, [iysu/du]) else. (6)

So also populations of different sizes can be connected by this topology to allow


expansion and compression. The resulting topographic maps will have an overlap
of ox = kx - Fs~/dxl and Oy = k y - F s y / d y l . If a map connects the neuron
7

9, : / I
, / I
.9 o : /~] ~, map
9" 0 9 ., / (intra)
0 0 ' / I
9" 0 : / /I
99149 i // / /I
/ I
.9149 : / , / ~ full "o //01
//000 I
/ 0 I

'Ii,,'
/ I
/ 0 I
/ 0 I
0 j
0 I
o map '' ' 0 I
I
OI
0 ; ," 0 '1 OI
0 /O^O~l ~ t /" , 0 I
0 0 ~ I
0 / '
o ~)
9 o 0 /
0 o /
i 0 /
0 9149 I //
Oot/"
0 0 9," ,
t t, ,
I "
/ corres- 0 /
0 0 9 /
o o o .-'9 ., ,, ponding /

0 9

external input population 1 population 2 external output

Fig. 3. An example network

outputs y to the neuron inputs u and the propagate function o u t - - i n 9 w is


used in each single synapse, then the total network operation will be similar to
a 2D-convolution:
(ks -1)/2 (% -1)/2
Z Z (71
~=-(k.-1)/2 y=-(%-1)/2

However, in contrast to a convolution (where identical kernel elements w~y are


required) the weights w~,y,i~,iv can be different here for map instances connected
to different nodes ( i ~ , i y ) .
Rarely also i r r e g u l a r topologies are used, especially for connecting the neu-
rons of very small neural networks. Here the connectivity can be described ex-
plicitly by index pairs of source and destination nodes 9

2.4 Implementation remarks

The unified neural network model presented in the previous sections can either
directly be implemented in a neural simulation kernel or it can be used as an
underlying model for a neural specification language.
A neural simulation kernel allows on efficient simulation because all required
activation, output, propagate and learning functions of the synapse and neuron
models can be optimally implemented in the simulation kernel. The user must
only select such functions, several parameters and (possibly) the network topol-
ogy, which can be done by a configuration file or a graphic interface. However
the flexibility is limited: Only the parameters and functions predefined in the
kernel can be used. For each new parameter or new function the kernel source
code has to be extended and recompiled.
A neural specification language allows the description of all neural networks
that are confirm with the underlying formal model. A compiler translates the
specification into simulation code. This methodology is rather flexible because
the specification language allows the definition of any arbitrary neural function
that can be expressed by the language and an arbitrary number of internal pa-
rameters. The abstract high-level syntax follows the neural network terminology
and allows a concise and unique neural network specification. Thus it also sim-
plifies the interchange of specifications between neural network researchers from
different disciplines. Furthermore, the specification is also independent of the
target computer architecture. Compilers for parallel computers can be imple-
mented too. Thus, it represents the preferable approach for the implementation
of a neurosimulator.
Usually a t i m e - d r i v e n simulation is realized. In each simulation step the vari-
ables of all neural objects are updated. So the network behaviour is simulated
exactly. However, in many simulations of spiking neural network models the
mean network activity is low. At most a few neurons generate a spike in each
time step. Here an event-driven simulation can be used to improve the efficiency
of the simulation [16]. Each spike is considered as an event which is character-
ized by the time step t (s) and the index i of the spike-generating neuron. A
central e v e n t list contains all spikes in temporal order. Only those synapses w~j
connected to spiking neurons must be simulated for a certain period of time
starting at time step t (s) + A~~) + AI~) and ending when the induced postsynap-
tic potential is again negligibly small (i.e. below a certain threshold). Selecting
an appropriate threshold represents a compromise between a high efficency and
a high simulation accuracy. The event-driven simulation is especially interesting
for special-purpose hardware [4] [7]. It can also be included in a neurosimula-
tor if the compiler can generate event-driven simulation code from a network
specification.
In the unified neural network model the postsynaptic potential is modelled (in
accordance with biology) by the synaptic propagate function fprop. The neuron
simply adds all incoming potentials. For the implementation however, it is more
efficient to combine the calculation of all postsynaptic potentials with identical
time constants in the postsynaptic neuron. It adds the weighted input signals
and computes all impulse response functions locally.

3 Extensions of the specification language EpsiloNN

The neural specification language EpsiloNN has been designed especially for the
simulation of artificial neural networks on different computers [12] [13]. To sup-
port also biology-oriented neural networks the language has been redesigned.
First, the underlying neural network model has been extended in accordance
with the unified model presented in the previous section. Secondly, several new
language constructs have been included in the latest EpsiloNN release to support
all new features of the underlying model:
- Two-dimensional populations of neurons or i n p u t / o u t p u t nodes can be spec-
ified (e.g. s p i k i n g _ n e u r o n p o p l [50] [50]) and connected by all topologies.
- The new f i e l d topology is available for a simple specification of topographic
maps. Here the user specifies the names of source and destination population,
the size kx x ky of the neighborhood (in the example below 7 x 11) and the
neuron i n p u t / o u t p u t variables that are connected by the map, e.g.:
map_synapse net = {field, popl, popi, 7, II, "init.map" , zero,
in = popl.y, out = popi.y}

Initial weights wxv (identical for all instances of the map) may be read
from an optional initialization file. Alternatively, the weights can be set by
a user-defined function (randomly or dependent on the indices of the sourcc
and destination nodes). Also arbitrary learning functions can be defined for
updating the weights according to presynaptic and postsynaptic potentials.
Thus, the weights can differ in different instances of each map.
All connections required for the topographic map will automatically be built
by the simulator (also if the sizes of source and destination poulation are
different, see Sect. 2.3). At the borders of the source population only a partial
map can be realized because the source nodes (ix+x, iy+y) do not exist for all
z 6 { - ( k x - 1 ) / 2 , . . . , ( k x - 1)/2} and all y 9 { - ( k y - 1 ) / 2 , . . . , ( k y - 1)/2}.
The connections to the missing nodes are either truncated by the option
z e r o (default), or the source nodes ((ix + x) mod sx, (iy + y) rood Sy) at the
opposite site of the source population are used instead (option c y c l i c ) .
- All network delays are mapped into the synapse object. So the delay A~j
of the synapse with weight wij represents the sum of the delay AI ~) of the
presynaptic neuron i and the synaptic delay AI~). It can be specified by the
user as a multiple dij of the time step Aij := dij . At. The delay can be
different for each synapse of the same network and can be set by a user-
defined function (dependent on the indices of the source and destination
nodes). Internally, the FIFO buffer required for storing the dij last input
values is implemented in the presynaptic neuron and not in each synapse (as
assumed in the underlying model). Thus, the output values must be stored
only once in a FIFO buffer of size maxi dij and the efficiency is improved.

4 Conclusions

The presented unified neural network model incorporates the features of all ar-
tificial and many spiking neural network models. Especially the most important
characteristics of biology-oriented neural network models (postsynaptic poten-
tials, delays, spike generation and topographic maps) are included. Thus, im-
portant models like the integrate-and-fire neuron or the spike-respsonse-model
[3] can easily be described. Because of the same notation, artificial and spiking
neural networks can be combined to model complex hybrid neural architectures.
10

Acknowledgements
This work is partially supported by the D F G (SFB 527, subproject B3).

References
1. Bower, J., and Beeman, D. The book of GENESIS : exploring realistic neural
models with the GEneral NEural Simulation System. Springer, New York, 1995.
2. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Re-
itboeck, H. Coherent Oscillations: A mechanism of feature linking in the visual
cortex ? Biological Cybernetics 60 (1988), 121-130.
3. Gerstner, W. Spiking Neurons. In Pulsed Neural Networks, W. Maas and C. Bishop,
Eds. MIT Press, 1998, ch. 1, pp. 3-54.
4. Hartmann, G., Frank, G., Schs M., and Wolff, C. Spike 128K - An Accel-
erator for Dynamic Simulation of Large Pulse-Coded Networks. In Proceedings
MicroNeuro'97 (1997), H. Klar, A. Koenig, and U. Ramaeher, Eds., pp. 130-139.
5. Hecht-Nielsen, R. Neurocomputing. Addison-Wesley, 1990.
6. Hines, M., and Carnevale, N. The NEURON Simulation Environment. Neural
Computation 9 (1997), 1179-1209.
7. Jahnke, A., Roth, U., and SchSnauer, T. Digital Simulation of Spiking Neural
Networks. In Pulsed Neural Networks, W. Maas and C. Bishop, Eds. MIT Press,
1998, ch. 9.
8. Kock, G., and Serbed~ija, N. Artificial Neural Networks: From compact descrip-
tions to C+T. In Proceedings of the International Conference on Artificial Neural
Networks ICANN'94 (1994), Springer, pp. 1372 1375.
9. NeuralWaxe, Inc., Pittsburgh(PA). NeuralWorks Reference Guide, 1995.
10. Sajda, P., and Finkel, L. NEXUS: A simulation environment for large-scale neural
systems. SIMULATION 59, 6 (1992), 358-364.
11. Singer, W., and Gray, C. Visual feature integration and the temporal correlation
hypotheses. Ann. Rev. Neuroscience 18 (1995), 555-586.
12. Strey, A. EpsiloNN - - A Specification Language for the Efficient Parallel Imple-
mentation of Neural Networks. In Biological and Artificial Computation: From
Neuroscience to Technology, LNCS 1240 (Berlin, 1997), J. Mira, R. Moreno-Diaz,
and J. Cabestany, Eds., Springer, pp. 714 722.
13. Strey, A. EpsiloNN A Tool for the Abstract Specification and Parallel Simulation
of Neural Networks. Systems Analysis Modelling Simulation (SAMS), Gordon &
Breach, 1999, in print.
14. Teeters, J. MDL: A system for fast simulation of layered neural networks. SIMU-
LATION 56, 6 (June 1991), 369-379.
15. Walker, M., Wang, H., Kartamihardjo, S., and Roth, U. SimSPiNN - A Simulator
for Spike-Processing Neural Networks. In Proceedings of the 15th IMACS World
Congress on Scientific Computation, Modelling amd Applied Mathematics (Berlin,
1997), A. Sydow, Ed., Wissenschaft & Technik Verlag.
16. Watts, L. Event-driven simulation of networks of spiking neurons. In Advances in
Neural Information Processing Systems (1994), J. Cowan, G. Tesauro, and J. A1-
spector, Eds., vol. 6, Morgan Kaufmann Publishers, Inc., pp. 927-934.
17. Zell, A. et al. SNNS Stuttgart Neural Network Simulator, User Manual, Version
4.0. Report 6/95, University of Stuttgart, 1995.
Weight Freezing in Constructive Neural
Networks: A Novel Approach

Shahram Hosseini, Christian Jutten*

LIS, INPG, 46 Av. F. Viallet, 38031 Grenoble cedex, France

A b s t r a c t . Constructive algorithms can be classified in two main groups:


freezing and non-freezing, each one having its own advantages and incon-
veniences. In large scale problems, freezing algorithms ate more suitable
thanks to their speed. The main problem of these algorithms, however,
comes from the fixed-size nature of the new units that they use. In this
paper, we present a new freezing algorithm which constructs the main
network by adding small and variable-size accessory networks trained by
a non-freezing algorithm instead of simple units . . . .

1 Introduction

Multi layer perceptrons (MLP) trained by error backpropagation are used widely
for function approximation. Given a data set:

(xi,Yi) = ( x i , g ( x i ) + ei), i = 1 , . . . , N . (1)


where xi and Yi are samples of the variables x E 7~r and y E T / a n d ni are zero-
mean noise samples, one wants to find a good approximation of the underlying
relationship g(.). In general, the optimal size of an MLP for a given problem is
unknown. A too small network can not learn the data with a sufficient accuracy
and a too large network leads to overfitting and thus to poor generalization.
Constructive approaches have been used to solve this problem [1]. These meth-
ods start with a minimal configuration and dynamically grow the network until
the target function can be fitted with a satisfying accuracy. According to the
method used to train the new units, these algorithms can bc classified in two
main categories: freezing and non-freezing algorithms.

In the freezing algorithms, at each step of the network construction, one


computes the residue of the current network and tries to estimate it with a new
unit. The inputs of this unit are the network inputs and eventually the outputs
of the other units of the main network. Once the training of the new unit is
finished, its weights are frozen and the unit is added to the main network. After
the fusion, the weights of the input and the hidden layers do not change any-
more. However, to ensure that the new residual signal remains orthogonal to the

* Christian Jutten is professor in the Institut des Sciences et Techniques de Grenoble


(ISTG) of the Universit6 Joseph Fourier (UJF).
12

subspace spanned by the different units, the weights of the output layer must be
retrained. [2] and [3] are the most popular algorithms of this category.

In the non-freezing algorithms, however, at each step of the network con-


struction and after adding the new unit, there is no frozen parameter and all
the network weights can be modified. Before retraining the whole network, the
weights of the new unit may be initialized either randomly [4] or by teaching the
current network residue to the new unit [5].

Two main criteria may be used to compare freezing and non-freezing algo-
rithms : the final network size and the convergence speed.

In general, the freezing algorithms lead to larger networks. Indeed, these algo-
rithms try to find the optimal solution in a small subset of parameters space and
not in the whole space. Consequently, with respect to non-freezing algorithms,
they need more parameters to achieve the same performance. This problem espe-
cially depends on the estimation capacity of the new unit. For a simple sigmoidal
unit, this capacity is very limited, especially when a single hidden layer network
is used. When the network locks into a "hard state", one needs to add a consid-
erable number of single sigmoidal units to exit [6]. A solution proposed by many
researchers is to use more complicated units [3] or the cascade architecture [2].

Many parameters may influence the convergence speed of a constructive al-


gorithm. For a freezing algorithm, the time required for the convergence depends
essentially on the learning capacity of the new unit. If the unit learns sufficiently
well the residue, the algorithm converges quickly. On the other hand, a bad choice
of the new unit leads to add many neurons without considerably improving the
estimation and slows down the speed of convergence. Concerning non-freezing
methods, the relation between the speed of algorithm and the network size is
very important. When a small network is able to solve the problem, the time
necessary to train the whole network can be acceptable. But imagine a problem
requiring a network of thousands units: evidently, retraining the whole network
after adding each new unit is very time-consuming. In such a case, the freezing
algorithms are the only practical methods of network construction.

With this discussion, we conclude that while non-freezing algorithms are


suitable for the small size problems, weight freezing is required for large scale,
real world ones. However, the choice of a single sigmoid as the new unit enlarges
the size of network and slows down the convergence while the choice of a too
complicated new unit (like the smoothers used in [3]) increases the number of
parameters and may degrade the generalization. In our opinion 1, the main
difficulty comes from the fixed-size nature of the new units and a good solution
for this problem is constructing the new units (or more precisely, new accessory
networks) instead of adding simple neurons, one by one. In the next section, we

1 And it is verified by the following results.


13

present a freezing algorithm which constructs the main network by adding such
small accessory networks trained by a non-freezing algorithm.

2 Algorithm

The essential specification of this algorithm is that a new accessory network,


instead of a new unit, is added to the current network for estimating the residue.
In the following, we suppose the network contains a single hidden layer. Denoting
Ct(x) the output o f / t h accessory network, the main network output will be :
L
sL(x) = (2)
I----1
where f~l represents the output weights and L the number of accessory networks.
Supposing t h e / t h accessory network contains Ml neurons, its output is :
Ml
= w r + 00j) (3)
j=l

where r is a sigmoidal function. Considering the data model (1), suppose we al-
ready have K accessory networks providing the estimation f g ( x ) = ~ 1 /31el(X)
of g(x). Hence, the residue of estimation is: ~g = y _ f g ( x ) and we want to add
another accessory network r (x) to minimize II Cg _ f l g + l C g + l (X) I1~. It has
been shown [7] that the above expression achieves its minimum by maximizing :

N K X 2
E = CK+ (0) (4)

and choosing :
~K+I : EN----I(~KCK'I'I(Xi)) (5)
EN=I r 1 (Xi)
In our method the accessory network CK+~ is constructed using a non-
freezing algorithm. In fact, after computing the residue of the main network,
CK, we first try to estimate it by maximizing (4) with a single neuron. If this
neuron succeeds in decreasing significantly the error of residue approximation,
that is the objective function E mentioned in (4) is greater than a predefined
threshold, it will be added to the main network. Otherwise, we add another neu-
ron to the first one and these two neurons, this time together, try to estimate the
residue. After the convergence, we verify again if there is a significant reduction
of error or not. If yes, the construction of the accessory network will be stopped;
otherwise, the process of construction continue until there is a sensible reduction
of error. Afterwards, the weights of the accessory network will be frozen and its
output will be added to the main network with its output weight, 13K+1, whose
initial value can be computed using (5). Then, in order that the residue remains
orthogonal to the subspace spanned by the different accessory networks, the
14

weights of the output layer will be updated. Finally, we compute once more the
residue and another accessory network is constructed for estimating it. The algo-
rithm continue until a good estimation of target function, satisfying a stopping
criterion, is obtained. If there is enough data, a cross validation on a test data
base, can be used ms stopping criterion; else the other methods of generalization
evaluation may be considered. Figure 1 shows the network construction scheme.

Fig. 1. Network construction method: a) Residue computation for the main network.
b) Estimation of the residue with a accessory network using a non-freezing algorithm.
c) The fusion of the accessory network in the main network

Tile algorithm can be summarized as following :

BEGIN
Initialization: main-network-size=0, residue=target-function.
DO{
New-accessory-network-size= 0.
DO{
new-accessory-network-sizc++.
Train accessory network to cstimate the residue.
} W I t I L E ((F/residue-power)<TH1).
Connect the accessory network to the main network.
Optimize tile output layer weights.
C o m p u t e new residue.
} W H I L E (stopping criterion is not satisfied).
END

The advantages of this algorithm are:


i) As the main network is based on a freezing algorithm, it can be constructed
in a reasonable time even for the large-scale problems.
ii) The constructive design of the accessory network guarantees a significant
estimation of residue. Hence, we overcome the main impediment of freezing algo-
rithms i.e. their difficulty in the situations where a new unit can not a p p r o x i m a t e
well the residue.
iii) A non-fi'eezing algorithm is used for building the accessory network. As
we said, these algorithms lead to good performance for small networks.
15

3 Simulation results with artificial data

The experiments concern the approximation of the function g(x) = sin(Trx -


0.5)eos(27rx), x E [ - 2 , 2]. The learning samples consist of N = 200 patterns
(xi, g ( x i ) + ni) where ni are i.i.d, zero-mean noise samples. The experiments are
realized with 3 algorithms: our method i.e. Accessory Network Based Freezing
Algorithm (ANBFA), a freezing algorithm which adds the new single sigmoids
(similar to [7]), that we refer by Single Sigmoid Based Freezing Algorithm (SS-
BFA) and a Non-Freezing Algorithm (NFA) introduced in [4]. In all experiments,
single hidden layer networks are trained using "quickprop" algorithm and the
stopping criterion consists in comparing the test error with a predefined thresh-
old. The experiments consist of 10 runs, each run corresponding to a different
weights initialization.

Fig. 2. Approximation of a function with ANBFA. Target function (solid line), its noisy
samples (dots) and the approximation (dashed line).

In Fig. 2, the result of a sample run with our algorithm is shown. Table 1
illustrates the result of experiments for the 3 algorithms. In this table, "Combi-
nation" indicates the size of different accessory networks constructed by ANBFA.
As can be seen, ANBFA is on the average 4.4 times faster than SSBFA and 3.1
times faster than NFA, and it leads to a slightly larger networks. Moreover, in
30% of experiments, the SSBFA method was locked in a plateau so that even
with 100 neurons it was not able to satisfy the stopping criterion and we had to
stop the construction procedure. In the raw "Average", we neglected these runs.

4 Simulation r e s u l t s w i t h real d a t a
In the second experiment, we would like to find suitable models for geostatistical
data, concerning the different pollution factors measured in the Leman lake. At
]6

ANBFA SSBFA I NFA I


run time I # n I combination time I # n time I # n
1 1.9 16 2,2,1,2,2,3,2,2 >100 5.1 9
2 1.1 14 2,2,1,2,2,3,2 9.7 15 6.2 18
3 2.7 16 2,2,2,3,1,2,4 10.1 13 11.1 12
4 1.0 13 2,2,1,2,2,2,2 7.5 12 6.7 12
5 1.5 13 2,2,3,4,2 >100 9.8 21
6 5.3 15 2,2,4,2,1,4 >100 10.7 9
7 2.5 13 2,2,1,2,2,2,2 8.3 11 7.7 11
8 3.5 18 2,2,3,3,2,4,2 13.5 13 21.2 12
9 5.2 15 2,2,1,2,2,2,4 i18.4 14 6.5 11
10 5.0 18 2,2,2,4,2,2,4 25.1 16 7.0 13
Average 3.0 15.1 13.2 13.4 9.2 12.8

T a b l e 1. Simulation results. For each algorithm the normalized time necessary to


convergence and final number of hidden units ( ~ n ) are given. "Combination" represents
the size of different accessory networks constructed by ANBFA.

293 p o i n t s of lake, 8 p o l l u t i o n v a r i a b l e s 2 are m e a s u r e d as a f u n c t i o n of x a n d y


c o o r d i n a t e s . D a t a m a p is shown in t h e figure 3. T h e first step in d a t a a n a l y s i s

.-...-..::..:.:-
....... ~176149
....:.:.:.1.:.:-:.:.:-:9149149
99 . . . . . . . . . . . . . . . . .
ooo~
o*.OoOoO.,~ .'oOoOoO~ o*oO,*oOoO~

9. ..:.:.:.:.y,:.:.
9
".:.~
o~
z~
o~

F i g . 3. Data map.

consists in e l i m i n a t i n g the outliers. For this p u r p o s e we used two k n o w n geosta-


t i s t i c a l m e t h o d s [8]: t h e first one is based on p l o t t i n g t h e b i v a r i a t e s c o t t e r p l o t
of d a t a to c o m p a r e t h e values of a d j a c e n t d a t a p o i n t s a n d t h e second one con-
sists in c o m p a r i n g the m e a n a n d the m e d i a n of d a t a on each row a n d c o l u m n
of d a t a m a t r i x . A f t e r these analysis, 4 s a m p l e s were considered as s p u r i o u s a n d
e l i m i n a t e d f r o m t h e d a t a base.

For each of the 8 p o l l u t i o n variables, t h e 3 a l g o r i t h m s m e n t i o n e d in t h e pre-


vious section ( A N B F A , S S B F A a n d NFA) were used to c o n s t r u c t t h e n e t w o r k .
C o n s i d e r i n g the r e l a t i v e l y s m a l l size of d a t a base, we preferred to use all t h e
d a t a for t r a i n i n g t h e network. Hence, i n s t e a d of the cross v a l i d a t i o n , a k i n d of

2 Concentrations of Hg, Co, Pb, ...


17

internal validation proposed in [5] is used as stopping criterion. This method


consists in comparing the residue of estimation with white noise. In fact, if the
main network output, f K ( x ) (containing K accessory networks), gives a good
approximation of the target function, g(x), using the data model (1) we can
conclude that the residue of main network approximation, ~ g = y _ f g ( x ) , is
nearly equal to the noise term, c = y - g(x). Therefore, supposing a zero mean,
i.i.d, noise model, a simple correlation test on the residue can be used to verify
its whiteness and to stop the construction algorithm. When the noise term, c, is
colored, this method is still usable using more complicated procedures. [9].

In our experiment, we used a first order correlation hypothesis test. At each


step of main network construction, we compute the correlation between all the
residue sample pairs whose relative distance is less than V~. This distance is
chosen so that in each direction, each sample has not more than a neighbor.
Hence, the formula to compute the first correlation coefficient is:

c(1) - ~,jeD G(j ,E~:I (~ (6)


P / N
where N = 289 is the size of training data base and P is the number of sample
pairs belonging to D defined as:

D : ( i , j ) ~ (xi - x j ) ~ + (Yi - Yj)~ < 5 (7)


In our experiment P = 976. Supposing a Gaussian distribution of noise, the
correlation is statistically null with a significance level of 0.05, if c(1) < 0.063
[10]. Figure 4 illustrates the data surfaces and 3D contours for the first pollution
variable and its estimation using our method. In geostatistics, data analysis is
usually made using 2D variogram defined as [8]:
1
27(h) = iN(h)l ~ ( Z ( s i ) - Z(sj)) 2 (8)
N(h)
Where Z is the output variable, si and sj are the points in 7r 2, the sum is over
g ( h ) = {(i,j) : si - sj = h}, and IN(h)] is the number of distinct elements of
N ( h ) . Figure 5 shows the 2D variogram of first pollution variable, its estimation
by our algorithm and the final residue. The absence of correlation in the final
residue (Fig. 5.c) is pretty obvious.

Table 2 illustrates the mean and the standard deviation of the final network
size, of the normalized training time and of the final training error computed
on the 8 pollution variables for each of the three algorithms. As can be seen,
for nearly same performance (approximation error measured on training base),
our method is the fastest (10 times faster than non-freezing algorithm). The
final network size is on the average much less than single sigmoid based freezing
algorithm but more than the network constructed by a non-freezing algorithm.
Considering the discussions of the section 1, these results are not surprising.
18

(a) (b)
:!

N 200 .!........ i
N 201 ,.

500 20
. i. 1
480 ~ - - ~
460 ~,
\ ~ 14o 140
x 440 150 y X 440 150 y

(c) (d)
. . . ~

ooli ........ ..... ii

y 120 440 X y 120 440 X

Fig. 4. a) Original data surface, b) Neural network estimation surface (trained by


ANBFA). c) Original data 3D contour, d) Neural network estimation 3D contour.

(a) (b)
~~ i , i , , , I , b o 1
0 2 4 9 8 ?0 t2 14 ~B 0 2 4 0 81k I io t2 14 16
lhl

(0
i i i i i i i i t,
0 z 4 n e io t~ 14 1o
Ibl

Fig. 5. Omnidirectional variogram for a) original data b) neural network estimation


(trained by ANBFA) c) residue.
]9

Network size Normalized time Traininge r - ' ~/

algorithm[[ mean STD 1[ mean STD [[ mean [ S T D [


ANBFA 9.3750 1.4079 29.6 50 15.6108 0.0600 0.0347
SSBFA 28.2500 14.1295 61.3750 25.9777 0.0615 0.0358
NFA 4.5000 0.7559 202.7500 137.3845 ).0606 0.0359

Table 2. Estimation of pollution variables. For each algorithm the mean and the
standard deviation of the final number of hidden units, of the normalized time necessary
to convergence and of the final training error are given.

As in the freezing algorithms the input weights are optimized separately, it


has been suggested [12] that a whole retraining of final network (usually known
as "fine tuning") can improve the performance. We examined this hypothesis
and obtained an average training error reduction of 3.01% (with a standard de-
viation of 0.0165). Considering the relatively long time necessary for training
the final network, this improvement is not so convincing.

As can be seen in the table 2, the network constructed by our algorithm is


near|y twice larger than the network constructed by non-freezing algorithm. In
fact, as the non-freezing algorithms make the optimization on the whole weights
space, the network constructed by these algorithms could be the optimal net-
work. Hence, it is reasonable to think that nearly half of the weights in the
network constructed by our algorithm are "unimportant weights" and a pruning
algorithm may be used to eliminate these weights. To verify this hypothesis, we
applied the "Optimal Brain Damage", OBD algorithm [11] , on the final con-
structed network. At each step of network pruning, the weight with minimum
saliency is eliminated. The pruning procedure continue until the final residue
can no longer be considered as white noise. The results of pruning procedure for
the 8 pollution variables are given in the table 3 and it may be verified that this
procedure can eliminate on the average 33% of the weights without a sensible
reduction of network performance. However, this pruning procedure is slow and
for the large size problems, other kinds of pruning algorithms are preferable.

5 Conclusion

In this paper, we introduced a novel approach to construct neural networks


without overfitting. The primary results with artificial and real data prove this
method is much faster than the freezing or non-freezing algorithms which add
fixed-size new units to network. We think this advantage will be even more visible
in very large scMe problems. Our method, like other freezing algorithms, leads
to slightly oversized networks. Thus, a short pruning pass on the final network
could be useful. However, care must be taken so that pruning does not influence
the robustness of network. We are currently working on this issue.
20

Before pruning [ After pruning


Variable #O [TrMning error #O [Training error Normalized time
1 28 0.0677 13 0.0703 662
2 32 0.0372 25 0.0375 93
3 40 0.0591 28 0.0654 575
4 40 0.0260 16 0.0344 358
5 36 0.0848 36 0.0848 30
6 44 0.1306 24 0.1481 3094
7 36 0.0336 31 0.0339 186
8 44 0.0407 21 0.0427 869

0.0347 0.0387 997.6

T a b l e 3. The results of pruning algorithm app~ed on the final network constructed by


A B N F A . # 0 represents the number of parameters (weights)in the network.

References
1. T.Y. Kwok and D.Y. Yeung, Constructive Algorithms for Structure Learning in
Feedforward Neural Networks for Regression Problems. IEEE Trans. on Neural
Networks, vol. 8, no. 3, pp. 630-645, May 1997.
2. S.E. Fahlman and C. Lebiere, The Cascade-Correlation Learning Architecture,
inAdvances in Neural Information Processing Systems 2, D.S. Touretzky Ed., pp.
524-532, Morgan Kaufmann, Los Altos CA, 1990.
3. J.H. Friedman and W. Stuetzle, Projection Pursuit Regression, Journal of American
Statistical Association, vol. 76, no. 376, pp. 817-823, 1981.
4. T. Ash, Dynamic Node Creation in Backpropagation networks, Connection Sciences,
vol. 1, no. 4, pp. 365-375, 1989.
5. Ch. Jutten and R. Chentouf, A New Scheme for Incremental Learning. Neural Pro-
cessing Letters, vol. 2, no. 1, pp. 1-4, 1995.
6. T.Y. Kwok and D.Y. Yeung, Experimental Analysis of Input Weight Freezing in
Constructive Neural Networks, In Proceedings of the IEEE International Conference
on Neural Networks, vol. 1, pp. 511-516, San Francisco, California, USA, 1993.
7. T.Y. Kwok and D.Y. Yeung, Objective Functions for Training New Hidden Units in
Constructive Neural Networks, IEEE Transaction on Neural Networks, vol. 8, no.
5, pp. 1131-1148, September 1997.
8. N.A.C. Cressie, Statistics for Spatial Data, John Wiley &: sons, New York, 1991.
9. Sh. Hosseini and Ch. Jutten, Simultaneous Estimation of Signal and Noise by Con-
structive Neural Networks, In Proceedings of International ICSC / IFA C Symposium
on Neural Computation, Vienna, Austria, September 1998.
10. G. Baillargeon. Mgthodes Statistiques de l'Inggnieur, volume 1. Les Sditions SMG,
1994.
11. Y. Le Cun, J.S. Denker, and S.A. Solla, Optimal Brain Damage, in Advanced
in Neural Information Processing (2), D.S. Touretzky Ed., pp. 598-605, Morgan
Kaufmann, 1990.
12. M. Lehtocangas, J. Saarinen and K. Kaski, , Fine-tuning cascade-correlation feed-
forward network trained with backpropagation , Neural Processing Letters, vol.2,
no. 2, pp. 10-12, 1995.
Can General Purpose Micro-processors Simulate
Neural Networks in Real-Time?

Bertrand Granado, Lionel Lacassagne, and Patrick G a r d a

Universit6 Pierre et Marie Curie


Laboratoire des Instruments et Syst~mes Boi'te 252
4, place Jussieu - 75252 Paris Cedex 05
Tel.: (33) 01 44.27.75.07- Fax.: (33) 01 44.27.75.09
email:Bertrand.Granado@lis.jussieu.fr

1 Introduction

By their universal character, general purpose micro-processors may be used to


simulate artificial neural networks. However, until now, they were not capable
to perform these simulations in real-time. On the other hand, the computational
power of these processors has tremendously increased recently. Thus, one m a y
wonder whether up-to-date general purpose micro processors can simulate neural
networks in real-time.
To answer this question, we need to evaluate the performances of these ar-
chitectures for the simulation of ncural networks.
To realize this evaluation we have developed an original methodology [6]
which can predict the simulation time of a neural network on an electronic
architecture. This prediction is based on an analytic model of the architecture
performances.

2 Real time

Neural networks are often used in real-time applications. Such applications are
for example the recognition of the amount of a bank check or of a zip postal
code. In these applications the simulation time is hard limited. In this article
we have taken a time constraint of 40 ms, which correspond to the C C I R video
rate.

3 Neural Nets models

In this article we consider the two most used kind of neural networks, which
are the Multi-Layer Perceptrons (MLP) and the Radial Basis Function networks
(RBF).
To determine if general purpose micro-processor can perform real-time sim-
ulation of artificial neural networks, we simulated two neural nets: a MLP called
LENET and a RBF called RBF3.
22

3.1 LENET

LENET is a T D N N Multi-Layer Perceptron with 96522 local connections, 1920


full connections and 4365 neurons. Its function is to recognize handwritten digits.
It was developed by Y. Lecun in the A T ~ T laboratories [8].

3.2 RBF3

RBF3 [6, 5] is a Radial Basis Function network. Its 3 layers include respectively
256, 10 and 4 neurons, and it uses the Mahalanobis distance. This distance is
a very hard benchmark because the number of computations increases as the
square of the number of neurons in the input layer.

4 Evaluation

To determine the interest of general purpose micro-processors for the real-time


simulation of neural networks, we have developed an original methodology for
the evaluation and prediction of the processors performances [6].

4.1 Method

The usual method to predict the simulation time of neural networks on an elec-
tronic architecture is based on the measure of an average speed S for the con-
nections processing. T h e n the simulation time of a MLP or a R B F with C
connections is simply taken as S * C. We have demonstrated in [7] that this
m e t h o d cannot be applied for a general neural network architecture because it
leads to very high predictions errors. Thus we introduced a new m e t h o d for this
prediction.

D e s c r i p t i o n This methodology is based on the extraction of an analytical model


for the computational primitives of the neural network model. These primitives
are the basic m a t h e m a t i c a l operations that define the model.
T h e extracted analytical model is a m a t h e m a t i c a l function t h a t provides
the simulation time of a neural network depending on some neural network
p a r a m e t e r s like the number of neurons or the kind of connections (local or full).
It also depends on some hardware parameters like the cache size or the clock
frequency.
To get the total simulation time of the a neural network, we simply accumu-
late the simulation times given by the analytical model for all the primitives.

Primitives for MLP The equations 1 and 2 give the primitives associated to
the M L P model.
23

g (XJdeE')) = E X j * W0 (1)
jEEi
1 -- e -~v~
f ( ~ ) = m-1 + e_aV' (2)

m determines the range of the neuron state, included in [-1 : 1], and /~ is
the slope of f .

P r i m i t i v e s f o r R B F The equations 3 and 4 give the primitives associated to


the RBF model, when the Mahalanobis distance is used.

g (XJ(jEE')) "~ E (Wjl- Xj)t~i-l(wji- Xj) (3)


jeB,
f (~) = e ~ (4)

characterizes the influence zone of the neuron. Z ~ I is the inverse of the


covariance matrix associated to the neuron i.

N u m b e r o f p r i m i t i v e s In a general purpose micro-processor, there are two


major kinds of computation units, an integer unit and a floating-point unit. Thus
we have evaluated these two units and we have programmed the four primitives
described above in an integer and in a floating-point version: this leads to eight
primitives.

4.2 Analytical models for a general purpose micro-processor

To determine the execution time of a primitive P, we first determine the total


NumBer of Instructions needed to simulate this primitive N B I p (k, l , . . . , m, n),
as a function of the sizes (k, l , . . . , m, n) of the layers (w, x , . . . , y, z). This step
can be realized by an analysis of the assembler code of the programmed primitive.
Thanks to this function, we can estimate the number of Cycles Per Instruc-
tion C P I p for this primitive, with the formula:
T*F
CPIp(k,1,...,m,n) = NBIp(k,l,...,m,n) (5)

where F is the CPU frequency and T is the simulation time measured for this
primitive. To approximate the C P I p , we have made numerous simulations of
the primitives, measured the simulation time and determined the C P I p with
the formula 5.
At this point we have two functions: a function N B I p which provides the
number of instuctions for the primitive executed as a function of the sizes of the
24

layers of a neural network, and a function CPIp which provides the number of
cycles per instruction for the primitives as a function of the sizes of the layers
of a neural network.
Let us take now a neural network characterized by the primitives p E H
and by the layers (ap, bp,..., cp, dp) for the primitive p. We can compute the
simulation time TS of this neural network with the formula:

T S = 1 / F . ~-~(NBIp(ap,bp, ... ,cp,dp)*CPIe(ap,bp,... ,cp, dp)) (6)


pen
With the equation 6, we can predict the simulation time of any neural network
without p r o g r a m m i n g it on the architecture. Moreover this analytical model
depends on the size of the layers but also on the parameters of the architecture
like the clock frequency, the cache size, etc . . . . Then if we change the value of a
parameter, for example the clock frequency, we can compute the simulation time
of a neural network on a new architecture which is a minor modification of the
architecture originaly evaluated. Thus we can forecast now the performances of
a processor which will be introduced in the future.
But it's hard to give a deterministic analytical model for the architecture
of a general purpose micro-processor, because it includes complex mechanisms.
Such mechanisms are:
- M e m o r y m a n a g e m e n t including two or three m e m o r y cache levels.
- I n s t r u c t i o n f l o w s e q u e n c i n g m e c h a n i s m with branch prediction.
- Out of order execution of the instructions.
These mechanisms introduce non deterministic execution times of the in-
structions flow, because they depend on the values and the nature of the data.
The consequences of these features are that the estimation of the CPIp given
by equation 5 show a very large dispersion.
To overcome this problem we estimate the range of CPIp thanks to two
extrema, CPI~ in and CPI~na~. These two values are defined such as for any
network:

CPI~"* ~ CPIp(k, l,... , m, n) ~ CP~p a*


With these two extrema, our methodology gives two predicted times, a max-
i m u m predicted t i m e and a m i n i m u m predicted time. Then if the m a x i m u m
predicted time is smaller t h a n the real time constraint we can say t h a t the neu-
ral network is simulated in real-time.

5 Evaluation of SPARC and X86 family processors


To determine the analytic models of the SPAR.C and X86 processors, we used two
commercial C language compilers: Sun Microsystem CC-4.2 compiler for SPARC
and Microsoft Visual C + + 5 for X86.
25

5.1 The processors : SUPERSPARC, ULTRASPARCI[

Firstly we evaluated two processors of the SPARC family: the SUPEKSPARC and
the ULTRASPARCII.

H a r d w a r e In this section, we describe the hardware architecture of the evalu-


ated processors. These descriptions are derived from [9,2, 1].
The SPAkC 1 architecture is derived from the Berkeley university studies
between 1984 and 1987. It's a RISC architecture owned by Sun microsystems.
The two evaluated processors characteristics are:

- S U P E R S P A R C complies to the SPARe V8 norm. It is a three degree super-


scalar processor. It has one integer unit with two ALU, one floating-point
unit, one m e m o r y m a n a g e m e n t unit, a 16 KB L1 d a t a cache and a 20 KB
L1 instructions cache. Its clock frequency is 50 Mhz, it has 3.1 millions of
transistors in a BiCMOS 0,6 p m technology.
- ULTRASPARCII complies to the SPARe V9 norm. It is a four degree super-
scalar processor. It has one integer unit with two ALU, one floating-point
and VIS 2 graphic unit with 5 processing units, one m e m o r y m a n a g e m e n t
unit, a 16 KB L1 instructions cache and 16 KB of L1 d a t a cache. It has a
L2 cache, its size is in the range [512 KB ,16 MB]. Its clock frequency is 250
Mhz, it has 3.8 millions of transistors in a 0,29 p m CMOS technology.

5.2 T h e p r o c e s s o r s X86

T h e X86 processors family is derived from an Intel seventies CISC architecture.


But to compete with other micro-processors in scientific applications, there is
with PENTIUM micro-processors an evolution towards a RISC internal micro-
architecture.

Hardware

- PENTIUMPRO is a CISC-RISC micro-processor, the first stage of the pipeline


is dedicated to translate CISC instructions into 118 bits RISC-like micro-
instructions. This micro-processor has an integer unit with two ALU, a
floating-point unit, a m e m o r y m a n a g e m e n t unit, 8 KB of L1 instructions
cache and 8 KB of L1 d a t a cache. There is, at the same C P U clock fre-
quency, a 256 or 512 L2 unified cache. CPU clock is 200 MHz.
- PENTIUM II is an improvement of the PENTIUMPRO. There are MMX 3
graphic units, the sizes of L1 caches are increased up to 16 KB, the L2 cache
is only running at 2/3 of the C P U clock with and its size is not limited to
512 KB. There are 7.5 millions of transistors in a 0,28 p m CMOS technology
and a CPU clock of 266 MHz.
i Scalable Processor ARChitecture
2 Visual Instructions Set
3 MultiMedia eXtension
26

5.3 Analytical models


We extracted the analytical models for the eight primitives and for the four
processors. We cannot give in this article all the models, but we give the example
of the PENTIUM[I processor for the interger Mahalanobis distance primitive in
table 1~ The range of CPImahai is:
C P I , ~rnin
h a l = 1.1311 and C P I , m
~ ha x~ i = 3.5772.
The function h is defined as:

h(x)=l i f z > 0 else h ( x ) = 0

IPrimi.tives Analytical model


Mah'alanobis il.'1311/F) * (39 + h(size] 9 (9 + 12 * size)
Distance +h(size - 3) * (11 + [ ~ii~----~] * (9 + 19 * size))
Integer version +h(size%4)) 9 (7 + (size~4) 9 (8 + 9 9 size)))
for C P I mrnin
ahai
b~I'ahalanobis (3.5772/F) * (39 + h(size] * (9 + 12 * size)
Distance +h(size - 3) * (11 + [ - ~ ] * (9 + 19. size))
Integer version +h(size%4)) 9 (7 + (size%4) * (8 + 9 9 size)))
for C P I ~ = i
T a b l e 1. Example "of PI~N'TirdMIIanalytical models, where F is the clock frequency

W i t h all the analytical models we can perform both evaluation and predic-
tion.

5.4 Evalution and P r e d i c t i o n


We present here the predicted and measured simulation time of the two neural
networks, LEN~T and RBF3.

SeAac f a m i l y T h e table 2 shows that measured times are smaller than maxi-
m u m predicted time and larger t h a n m i n i m u m predicted time: this confirms the
validity of our methodology.
For the real time simulation of the neural networks, this table shows t h a t the
SUPIgRSPA~C processor can not satisfy the 40 ms time constraint.
But on the other hand, the ULTRASPAKClI can manage the real time simu-
lation of LENET. We have a m a x i m u m time of 8.3 ms for the integer version and
a m a x i m u m time of 14.621 ms for the floating-point version. Because LEN~T
is one of the biggest MLP ever designed, we can state t h a t current MLPs can
be simulated in real-time on general purpose micro-processors, when the time
constraint is 40 ms.
However, the table 2 shows that the real-time simulation of RBF3 cannot
always be achieved
27

Processor Neural Measured Minimum Maximum


Network Time ]Predicted Predicted
Time Time
(in ms) (in ms) (in ms)
SUPERSPARC Lenet integer 37,424 22,939 46,005
Lenet float 51,199 24,465 56,593
rbf3 integer 230,697 144,903 259,369
rbf3 float 211,703 190,944 255,853
ULTRASPARCII Lenet' 'integer 4,578 2,728 8,359
Lenet float 11,709 7,380 14,621
rbf3 integer 43,206 30,500 65,395
rbf3 float 37,619 21,821 46,244
Table 2. Predicted and mesured simulation time for LENET and RBF3 on SUPERSPARC
and ULTRASPARCII processors

The results shown in table 2, demonstrate the impressive evolution of general


purpose micro-processors. The SUPERSPARC, introduced in 1992, is seven times
less powerful than the ULTRASPARCII introduced in 1997. This evolution is not
only a consequence of the increase of the clock frequency, as the ratio between the
two clock frequencies is only equal to five, but also a consequence of architecture
improvements like memory cache management or duplication of computional
units.
If this evolution continues, the integer version of the RBF3 network could be
simulated in 9.34 ms in year 2002 on a SPARC processor which would be 7 times
more powerfull than the ULTRASPARCII. Then general purpose micro-processors
could be used for the reM-time simulation of RBF with the Mahalanobis distance

Processor Neural Measured] Minimum Maximum


Network Time Predicted Predicted
Time Time
(in ms) (in ms) (in ms)
PENTIUMPRo Lenet integer 3,019 2,751 8,086
Lenet float 37,869 10,853 41,523
rbf3 integer 51,404 17,816 56,346
rbf3 float 54,094 20,886 75,583
PENTIUMII Lenet integer 2,134 2,113 21,252
Lenet float 24,378 7,933 39,046
rbf3 integer 42,800 13,033 48,442
rbf3 float 43,238 16,149 54,198
T a b l e 3. Predicted and measured time on PENTIUMPRO et PENTIUMII
28

X86 f a m i l y Similarly to the SPARC family, the table 3 shows that our methodol-
ogy is valid, and that MLP, can be simulated in real time on these architectures.

6 Predicted performances for future electronic


architectures

Our methodology can evaluate actual electronic architectures, but it can also
predict the simulation time of future evolutions of these architectures. We used
it to predict the simulation time of the neural networks LENET and RBF3 on four
possible future evolutions of the ULTRASPARCII and PENTIUMII. For the sake
of simplicity, we modified only a parameter: the clock frequency. The prediction
will be pessimistic, because progress in microelectronics technology m a y lead to
speedup larger than the ratio of the clock frequency as we saw when we compared
the SUPERSPARC and the ULTRASPARC.
The four evolutions for which we predict the simulation time of LENET and
RBF3 networks are:

- an ULTRASPARCII with a 400 MHz clock frequency,


- an ULTRASPARCII with a 1 GHz clock frequency,
- a PENTIUMII with a 400 MHz clock frequency,
- a PENTIUMII with a 1 GHz clock frequency.

The clock frequency of 400 MHz up-to-date as the current generation of


PENTIUMII have a frequency of 450 MHz, and the ULTRASPARCIII a frequency
of 360 MHz.
The 1 GHz frequency will be available before year 2002. This is not a dream,
as said Peter Bannon of C o m p a q at the MicroProcessor Forum on October 1,
1998. The Alpha EV7 micro-processor, the next generation of Alpha processors
will be operates at more than 1 GHz [4]. Sun announces is in r o a d m a p [3] a new
generation of ULTR.ASPAR.C processor with a frequency of 1.5 GHz in 2002.
The prediction results are shown in table 4.

ULTRASPARCII PENTIUMII ULTRASPARCII I PENTIUMII


400 Mhz 400 Mhz 1 Ghz 1 Ghz
Neural Maximum Maximum Maximum Maximum
Network Time Time Time Time
(in ms) (in ms) (in ms) (in ms)
iLeNet float 9,138 25,965 3,655 10,386
Rbf3 integer 40,871 32,214 16,348 12,885
Rbf3 float 28,902 36,042 11,561 14,416

Table 4. Predicted time for ULTRASPARClIand PENTIUMII with 400 MHz and 1 GHz
clock frequencies
29

This table shows that with 400 MHz and 1 GHz clock frequencies, simulations
of neural networks will possible in real time for the two kind of neural networks
when the time constraint is 40 ms.

7 Conclusion

In this article we propose a new methodology to evaluate and predict the simula-
tion time of MLP and RBF neural networks on general purpose micro-processors.
With this methodology we evaluated two processors family, SPARC and X86
and we demonstrated that the general purpose micro-processors can now simu-
late MultiLayer Perceptrons with a 40 ms real time constraint.
We used also our methodology to predict the simulation time of neural net-
works on two future possible evolutions of SPARC and X86 family, and we showed
that these architectures would simulate Radial Basis Function networks with
Mahalanobis distance in real time with a 40 ms time constraint. They could be
available in the next three years.

References

1. Ultrasparc user's manual - ultrai - ultraii. Technical report, Sun Microsystems.


http://www.sun.com/microelectronics/manual/ultrasparc/802-7220-O2.pdf.
2. Intel architecture software developer's manual, volume 1: Ba-
sic architecture. Technical report, Intel Corporation, 1997.
http:/ /developer.intel.com/design/pentium/manuals/24319001.pdf.
3. 1999. http://www.sun.com/microelectronics/roadmap/.
4. Peter Barmon. Alpha 21364: A scalable single-chip smp. Compaq Computer Cor-
poration, Shrewsbury, MA, October 1998.
5. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley,
New York, United States, 1973.
6. Bertrand Granado. Architecture des syst@mes dlectroniques pour les rdseaux de neu-
rones - Conception d'une rgtine connexionniste. PhD thesis, Universit$ Paris XI,
November 1998.
7. Bertrand Granado and Patrick Garda. Evaluation of cnaps neuro-computer for the
simulation of mlps with recptive fields. In Proceedings of IWANN'97, Lanzarote -
Canary Islands, Spain, June 1997.
8. Y. LeCun, B. Boser, J.S. Denker, D.henderson, R.E. Howard, W. hubbard, and L.J.
Jackel. Handwritten digit recognition with a back-propagation network. In Neural
Information Process and System, pages 396-404, 1990.
9. Andr$ Seznec and Thierry Lafage. Evolution des gammes de processeurs mips, dec
alpha, powerpc, sparc, x86 et pa-risc. Technical Report 1110, Institut de Recherche
en Informatique et Syst~ rues AlSatoires, 1996.
Large Neural Net Simulation under Beowulf-Like
Systems

Carlos J. Garcia Orellana, Francisco J. L6pez Aligu6, Horacio M. Gonz~ez


Velasco, Miguel Macias Macias and M. Isabel Acevedo Sotoca.

Departamento de Electr6nica e Ing. Electromec~inica


Universidad de Extremadura
Avd. Elvas, sn. 06071 Badajoz- SPAIN.
carlos@nemet.unex.es, aligue@unex.es, horacio@nemet.tmex.es,
rniguel@nemet.unex.es, acevedo@unex.es

Abstract. /n this work we broach the problem of large neural network


simulation using low-cost distributed systems. We have developed for the
purpose high-performance client-server simulation software, where the server
runs on a multiproeessor Beowulf system. On the basis of a performance
analysis, we propose an estimator for the simulation time.

1. Introduction

Large neural network sottware simulation has a problem in its large requirements for
hardware resources (especially memory for storing the weights), due to which, until a
short time ago, the simulation of this neural network type was restricted to the use of
neurocomputers [11][13]. However, these neurocomputers have a high cost, and are
expensive to keep updated.
Over the last years, there have been some advances that have changed the
panorama of simulation and scientific calculations in general.
9 In the first place, standard hardware cost is falling, while, its power is increasing.
When we say standard hardware, we are of course referring to computers built
around Intel x86 processors. This evolution has narrowed the distance separating
this standard hardware from the workstation, and in some fields it is already a
serious competitor to the latter.
9 Also, the hardware for interconnecting computers has undergone a mayor
evolution, it being quite normal to have a switched Ethernet network at 100 Mbits
at a low cost.
9 The other important thing is the appearance on the scene of the Linux operating
system, which, as we know, is a complete UNIX, with excellent performance and
with great interconnection possibilities. In addition to these excellent
characteristics, we must bear in mind its price (it is freeware).

These three facts together allow us to build Beowulf systems at a low cost
[2][15][16], i.e., PC clusters connected by fast Ethernet and using Linux as operating
31

system. This class of system has been used for scientific calculations (high-energy
physics, plasma studies, etc) with great success [20], and obtaining a excellent
performance/cost relationship.
The use of such systems for neural network simulation, although they do not offer
neurocomputer performance, is certainly a good alternative, as we shall show in a
following section.

2. Simulation software description

Over the last years, we have been using multiprocessor systems for neural network
simulation, in particular, VME backplanes with Motorola processors (MC680X0 and
PowerPC) and VxWorks as operating system. However, to keep our VME
multiprocessor up-to-date is for too expensive and, as we observed in the previous
section, the performance of Intel x86 processors is better and their cost goes on
decreasing. Therefore, we decided to implement our neural network simulator under a
Beowulf system built around Pentium processors.
The simulation system is built on a client-server structure, which is based on
object-oriented modeling of neural networks using the OMT methodology [8]. In this
model, we consider the layers and connection between layers as the units that
conform the neuronal network, the connections being in all directions (feed-forward,
feedback and lateral interaction), as well as there being the possibility of choosing the
recognition and learning algorithms.
The server, called NeuSim, is responsible for the simulation performance, and is
the part that runs on the Beowulf system. In one cluster station runs the subsystem
that we denote master, in charge of coordinating and supervising the simulation and
monitoring the state of the other cluster stations, in which the other server subsystem
runs: the slave part. This has the task of doing the real simulation. The
communication between subsystems is done using TCP/IP sockets.
The client part does not run on the Beowulf system, but on another UNIX
workstation. Instead of developing a final user application as the client, we decided to
develop a library (which we NNLIB) for programming in the C language. We believe
this gives the fmal users more flexibility to adapt the simulation software to their
needs. The library has functions to create and delete objects, set and get attributes,
control the simulation, handle events, etc.
The NNLIB library could be used easily with widget libraries for X-Window (such
as GNU/GTK) for straightforward customization of any application with a graphics
interface.
Another important aspect of NeuSim-NNLIB is the possibility of developing new
models (recognition and learning algorithms, connection patterns, etc.) using
extensions (plug-ins), which can be written using a defined protocol [8].
32

3. Performance analysis

Our purpose is to get a simulation time estimate as a function of the neural network's
characteristic topology and the number and type of installed processors. Although
approximate, this prior estimate of simulation time will help us to decide if it is
necessary to use all the cluster processors or if it is better to use only a part of the
cluster for that simulation.
In recent years, many investigators around the world have been working on
dividing, automatically, the execution of some algorithms into atomic blocks with the
idea of using cache memories and systems with more than one processor (either
multiprocessor systems [1 ] or clusters of workstations [4]).
However, in general, the problem is quite difficult, and efforts have focussed on
parallelizing nested loops algorithms with uniform dependencies [3][14][16][18]. In
practice, there are many problems that can be solved using this method.
The optimization study of nested loops algorithms with uniform dependencies is
approached using a technique called Supemode Transformation or Tiling [10][3].
To describe this technique, without going into mathematical complexity, let us
suppose we have a problem with one n nested loop, and denote by index space a
subset J of Z~, i.e., a block in which every point means an iteration of the n-times
nested loop. Then, to execute all the loop, we should execute every point in the
iteration index space. If there were not dependencies between certain points in the
iteration space, we could execute in parallel every point until we complete all the
iteration space.
However, the dependencies could require one block to be executed before others
blocks. The uniform dependencies [18] help us to simplify the problem, because these
are the same for all points in the iteration space. Mathematically, the dependencies
can be characterized by a matrix in which every column represents a dependency
which is a vector of dimension n.
The tiling consists of dividing the iteration space into blocks (called tiles) using a
transformation. This transformation gives us a new iteration index space, where each
point represents one of these blocks (or tiles). The execution of each block can be
made in a practically autonomous way, needing at most the tiles executed
immediately before. The values of the components of the dependencies vector will be
0 o r 1.
But, since the dependencies still exist, we still have the problem of not being able
to execute the tiles independently in parallel, so that it becomes necessary to plan the
execution with respect to the dependencies [6] [17].
With the purpose of optimizing the execution in a multiprocessor environment, we
should choose in an appropriate way the size and the form of the tile [9][12][5]. We
must take into account that the time needed to execute a tile is made up of two terms:
one due to the computation itself, and another due to the time needed for the data
communication. The computation time is proportional to the volume of the tile, and
the communication time also grows with the size of the tile because the neighboring
tiles will be bigger. If we also take into account that the time needed to begin the
communication is much greater than that needed to transmit one item of data, we
reach the conclusion that the larger the tile, the better will be the performance.
33

Since, the dependencies force us to execute the tiles in a certain order, this implies
that, if we choose a very large size, we will not be able to take maximum advantage of
parallel execution [7]. This leads us to look for an optimum size for the tile.
Focusing now on our problem of neural network simulation, we find that the
recognition (or learning) algorithms for each layer (connection) correspond to nested
loops in which the dependencies are not uniform, since each layer can receive
information from various connections, and are also different between layers. Since we
execute the net as a whole, assigning to each processor the neurons of each layer that
it should process, and therefore their weights to store, executing each layer
individually connection by connection would involve an excessive increase of
communication between processors, and hence would make it unprofitable in terms of
total calculation time.
If we analyze the form of the algorithms used in the neural networks we should
notice the following facts:
9 Firstly, in the recognition phases, there exist dependencies that are fixed by the
exact form of the connection patterns. These dependencies affect the values that we
want to calculate: the new states of the neurons. If we take these dependencies into
account, we would be forced to also take into account the order in which we
execute the neurons. Since the size in memory of the neuron states is insignificant
relative to the memory occupied by the weights, we can maintain two copies of the
neuron states, alternating their use. This eliminates the obligation of executing the
neurons in a certain order.
9 Secondly, during the learning phases, in the learning algorithms are generally such
that the values of the new weights are not affected by the values of the neighboring
weights. This eliminates any problem with the dependencies in so far as the
execution order is concerned.

These considerations obviate our having to plan the order of execution of the
blocks of neurons, thus giving us the possibility to exploit the parallelism better.
In estimating the simulation time in the recognition phase we will follow the
exposition of other work [9][12]. We seek an estimate that is reliable when the neural
net is large, without worrying unduly if the prediction is less accurate with a small
network. We will divide the simulation time into two contributions: one due to the
calculation itself and the other due to data communication between processors.
The calculation contribution will be proportional to the time needed to execute a
connection.
For the second contribution (that due to communication), we shall neglect the
particularities of the physical medium used. We will assume that the communication
time is linear in form (first-order approximation). The constant represents the time
that it takes to establish communication, while the slope represents the cost of
transmitting a datum.
Let us now define the parameters that we will use in the expressions to model the
simulation time.

p ~ Whole cluster performance index.


p, --~ Processor i performance indent
c ~ Total connections of the neural network.
34

tci ---> Time that the processor i takes in executing a connection.


Ci ---> Connections assigned to the processor i
---> Constant proportional to the power o f processor i.
---> Time needed to send a connection from one processor to
another.
---> Constant proportional to the time needed to establish the
communication.

With these parameters we seek to obtain an expression for the number o f


connections processed per second (n) versus the number o f connections to process (e)
and the number o f processors (p), i.e.:

n = f(c, p) (1)

In principle, we will also consider the time needed to make one iteration (t). This
will also be a function o f the previous two variables, and will be the sum, as noted
above, o f a calculation time and a communication time, i.e.:

t(p, c) = tc~c (p, c) + tr (p, c) (2)

Let us first consider the estimation o f the computation time. I f we defme:

a at = c Pi (3)
P i = - - ::~ tui = - - and ci .--
tui Pi P

Taking into account that the computation time should be the same for all
processors in the cluster, we have that:

at (4)
tc~ac=tuici ~ tcat~(p,c)=--'c
P

Secondly, we consider the communication time. As we noted above, we could


estimate the communication time due to one processor o f the cluster ( tco
i m ) as:

i m
tco fl" ni + r (5)

where n, is the number o f data that the processor must exchange across the
communication network.
In a Beowulf system, these data are the neuron states. We must take into account
that in this case the necessary weights for a processor are in the local memory, and
therefore it is not necessary to exchange them. In this situation the data to exchange
are only some states o f the neurons o f the network, which implies in practice that we
can neglect the term ft. n; relative to ),, and therefore have:

f l "n w s i << ?' i m~ r


==> twsco (6)

i.e.,
35

P . (7)
J
tWScom= Z tWScom= Z'P
j=l

We thus have the total simulation time estimated by the expression


(8)
tws ( p , c)- = - - . c + y . P
P

r
I f we take into account that n ( p , c ) - t ( p , c) ' we have that:

(9)
nws (p, c) - p. c
y.p2 +Ct.c
Analyzing the function n(p,c) w i t h constant connection number, we find that the
function n(p, c=cte) has a maximum at:

PWSmax(C)=~O:~c (10)

I f we keep the processor index constant, we find that the function n(p=cte, c) has a
horizontal asymptote given by the expression:

nwslim (P) = p (11)


a

4. Results

To check the expressions derived in the previous section, we performed a series of


simulations on a Beowulf system with a total of 6 Pentium and Pentium MMX
processors at different clock frequencies, all of them with 64 Mbytes of RAM and
interconnected by a non-switched fast Ethernet network.
Our tests were done using 8 neural networks with sizes between 800 000 and
33 000 000 connections. They are shown in Table 1.
The simulation times are shown in Table 2. We estimated the parameters cx and 7
using non-linear regression. We list the results in Table 3.
We show graphically the estimated performance versus real performance in Fig. 1,
and numerically in Table 4. One sees that if the neural network is large enough: our
estimate agrees with the real value.
36

Table 1. Characteristics o f the eight neural networks used for performance analysis.

Networks Layers Neurons Conn/ neuron Connections


Net 1 1 1600 50 80000
Net2 1 10000 50 500000
Net 3 2 20000 83 1660000
Net 4 1 40000 74 2960000
Net5 1 122500 74 9065000
Net 6 1 200000 74 14800000
Net 7 2 245000 83 20335000
Net 8 2 400000 83 33200000

Fig. 1. Graphical comparison between experimental and estimated results for each o f our eight
neural networks.

Table 2. Simulation times o f the eight neural nets for different values o f the cluster
performance index.

Simulated W h o l e cluster p e r f o r m a n c e index (ip)


Network 5.11 4.26 3.70 2.86 2.00 1.00
Net 1 0.192 0.190 0.173 0.174 0.163 0.157
Net 2 0.203 0.214 0.200 0.210 0.217 0.265
Net 3 0.256 0.275 0.270 0.303 0.342 0.512
Net 4 0.320 0.330 0.353 0.434 0.512 0.818
Net 5 0.670 0.752 0.801 0.990 1.262 2.255
Net 6 1.104 1.191 1.234 1.492 1.998 ....
Net 7 1.303 1.511 1.611 2.001 2.604 ....
Net 8 2.025 2.312 2.531 3.100 ........
37

Table 3. Non-linear regression results using the model given by expression (9).

Parameters Estimated value


a (s.ip/Kcu) (2.748+0.004)x10 -4
g (s/ip) 0.0394+ 0.0020
r2 0.9880

Lastly, in Table 5 we list the estimated values of the optimal number of processors
and o f the performance limit.

Table 4. Numerical comparison between experimental and estimated data.

pop)= 1.00 2.00 2.86 3.70 4.26 5.11

c (Kcu) 11 1)at. A (~) Dat. d (?4) Dat. d (%) Dat. d (~) Dat. d (?4) Dat. d (%)

s~p. 509.6 489.8 459.8 463.8 422.2 416.7


8O 155.6 81.7 44.4 13.7 9.4 -6.8
Teor 1302.3 889.7 664.0 527.1 462.0 388.4

1886.8 2309.5 2381.0 2500.0 2336.4 2469.1


500 49.9 46.7 30.6 9.3 6,9 -11.4
Teor 2827.4 3387.6 3108.7 2732.5 2496,7 2187.8

~,xp. 3242.2 4853.8 5478.5 6148.1 6036,4 6484.4


1660 3.3 11.4 11.3 0.3 0.0 -12.0
Teor 3349.6 5408,9 6095.7 6166.1 6034.0 5706.7

3617.1 5781,3 6820.3 8397.2 8969.7 9250.0


2960 -4.0 5.5 9.2 -3.7 -8.1 -11.3
Teor 3470.9 6097.6 7451.0 8090.6 8245.5 8204.8

exp. 4020.0 7183.0 9156.6 11317.1 12054.5 13529.9


9065 -10.9 -4.7 0.6 -2.3 -0.1 -2.7
Teor 3582.5 6847.4 9212.4 11059.0 12040.6 13158.3

--- 7407.4 9919.6 11993.5 12423.1 13405.8


14800 . . . . 5.4 -2.8 -1.0 6.1 10.7
Teor 3604.3 7009.3 9640.6 11877.7 13180.4 14841.8

~,. --- 7809.1 10162.4 12622.6 13458.0 15606.3


20335 --- -9.3 -3.2 -2.8 2.1 0.6
Teor 3613.8 7081.4 9837.4 12268.5 13740.0 15706.3

. . . . . 10709.7 13117.3 14359.9 16395.1


332O0 . . . . . . 6.2 -3.2 0.1 2.0
Teor 3623.6 7157.6 10049.8 12701.5 14372.2 16715.2
38

Table 5. Estimated values of the parameters Pwsma~and nwsum.

c (Kcu)~ 80 500 1660 2960 9065 14800 20335 33200

pws,,o~ 0.75 1.87 3.40 4.54 7.95 10.15 11.90 15.21

5.11 4.26 3.70 2.86 2.00 1.00

Nwslirn 18604 15499 13452 10405 7281 3639

5. Conclusions

We have proven the viability of a cluster of PC's with the operating system Linux
(Beowulf systems) for large neural network simulation.
From the interpretation of the results, we have that, as the number of processors
grows, the performance limit follows expression (11). This is because the data that
represent the weights are kept in the processor that has to work with them, obviating
the need to use of the communication network. This leads us to the importance of
equipping each one of the components of the cluster of PC's with enough memory.
Also, our estimate of the simulation time was found to be correct when the neural
network has a large enough size.

Acknowledgements

This work has been partially supported by project PRI9606D007, financed by the
Junta de Extremadura

References

1. Agarwal A., Kranz D.A, Natarajan V.: Automatic partitioning of parallel loops and data
arrays for distributed shared-memory multiprocessors. IEEE Trans. Parallel Distributed
Systems, 6(9):943-962, 1995.
2. Becker D.J., Sterling T., Savarese D., Dorband J.E., Ranawak U.A., Packer C.V. Beowulf:
A Parallel workstation for scientific computation. Proceedings, International Conference on
Parallel Processing, 1995.
3. Boulet P., Darte A., Risset T. and Robert Y. (Pen)-ultimate tiling?. Integration, the VLSI
Journal, 17:33-51, 1994.
4. Pierre Boulet, Jack Dongarra, Yves Robert and Frederic Vivien, Tiling for Heterogeneous
Computing Platforms, Report UT-CS-97-373, Jul 1997
39

5. Calland P.Y., Dongarra J. and Robert Y. Tiling with limited resources. Application Specific
Systems, Architectures and Processors. ASAP'97, pp 229-238. IEEE Computer Society
Press, 1997.
6. Darte A., Khachiyan L. and Robert Y. Linear Scheduling is Nearly Optimal. Parallel
Processing Letters, vol 1.2, pp. 73-81, 1991.
7. Desprez F., Dongarra J., Rastello F. and Robert Y. Determining the Idle Time of a Tiling:
New Results Journal of Information Science and Engineering, pp. 167-190, Vol.14 No.1.
March 1997.
8. Garcia Orellana, C.J. Modelado y Simulaci6n de Grandes Redes Neuronales. Doctoral
Thesis - University of Extremadura. October,, 1998.
9. Hodzic E. and Shang W. On Supernode Transformation with Minimized Total Running
Time. IEEE Transactions on Parallel and Distributed Systems. Vol. 9, N ~ 5. May 1998. pp.
417-428.
10. Irigoin F. and Triolet R. Snpemode partitioning. In Proc. 15th Annual ACM Symp.
Principles of Programming Languages, pages 319-329, CA, January 1988.
11. A. Miiller, A. Gunzinger and W. Guggenbiihl, Fast Neural Net Simulation with a DSP
Processor Array. IEEE Transactions On Neural Networks, Vol. 6, No. 1, January 1995.
12. Ohta H., Saito Y., Kainaga M. and Ono H. Optimal Tile Size Adjustment in Compiling
General DOACROSS Loop Nets. Proc. 1995 Int'l Conf. Supercomputing, pp. 270-279.
ACM Press, 1995.
13. U. Ramacher et al., SYNAPSE-1 -- A General Purpose Neurocomputer, Siemens AG,
available on request Feb. 1994.
14. Ramanujam J. and Sadayappan P. Tiling Multidimensional Iteration Spaces for
Multicomputers. J. Parallel and Distributed Computing, vol. 16, pp. 108-120, 1992.
15. Reschke C., Sterling T., Ridge D., Savarese D. Becker, D., Merkey P. A Desing Study of
Alternative Network Topologies for the Beowulf Parallel Workstation. Proceedings, High
Performance and Distributed Computing, 1996.
16. Schreiber R. y Dongarra J.J. Automatic Blocking of Nested Loops. Technical Report 90.38,
RIACS, Aug. 1990.
17. Shang W. and Fortes, J.A.B. Time Optimal Linear Schedules for Algorithms with Uniform
Dependencies. IEEE Trans. Computers, vo140, no 6, pp 723-742, Jun 1991.
18. Shang W. and Fortes, J.A.B. Independent Parttioning of Algorithms with Uniform
Dependencies. IEEE Trans. Computers, vo141, no 2, pp 190-206, Feb 1992.
19. Sterling T., Becker D.J., Savarese D., Berry M.R., Reschke C. Achieving a Balanced Low-
Cost Architectures for Mass Storage Management through Multiple Fast Ethemet Channels
on a Beowulf Parallel Workstation. Proceedings, International Parallel Processing
Symposium, 1996.
20. Warren M.S., Salmon J.K, Becket D.J., Goda M.P., Sterling T., Winckelmans G.S.:
Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, II. Price/performance of
$50/Mflop on Loki and Hyglac. In Supercomputing'97. Los Alamitos, 1997. IEEE
Computer Society.
A Constructive Cascade Network with Adaptive
Regularisation

N.K. Treadgold and T.D. Gedeon

Department of Information Engineering,


Computer Science and Engineering,
University of New South Wales,
Sydney, Australia
{nickt, tom}@cse, unsw. edu. all

A b s t r a c t . Determining the optimum amount of regulaxisation to ob-


tain the best generalisation performance infeedforward neural networks
is a difficult problem. This problem is addressed in the casper algorithm,
a constructive cascade algorithm that uses regulaxisation. Previously the
amount of regularisation used by this algorithm was set by a parameter
prior to training. This work explores the use of an adaptive method to
automatically set the amount of regularisation as the network is con-
structed. This technique is compared against the original method of user
optimised regulaxisation settings and is shown to maintain, and some-
times improve the generalisation results, while also constructing smaller
networks. Further benchmaxking on the Probenl series of data sets is
performed and the results compared to an optimised Cascade Correla-
tion algorithm.

1 Introduction

The casper algorithm [1, 2] has been shown to be a powerful method for training
feedforward neural networks. It is a constructive algorithm t h a t inserts hidden
neurons one at a time to form a cascade architecture, similar to Cascade Corre-
lation (cascor) [3]. The amount of regularisation in casper is set by a parameter.
The optimal value for this p a r a m e t e r is difficult to estimate prior to training, and
is generally obtained through trial and error. An inherent problem for the regu-
larisation of constructive networks is that the number of weights in the network
is continually changing, and thus even an optimal regularisation level for a given
size network will become redundant as the network grows. This work explores
the use of a method which adaptively sets the regularisation level as the network
is constructed. This paper will first give an introduction to the casper algorithm,
then describe the adaptive regularisation method and provide the results of some
comparative simulations. Finally the algorithm is benchmarked on the P r o b e n l
[4] series of d a t a sets and its performance is compared to an optimised Cascade
Correlation algorithm.
41

2 The casper Algorithm

The casper algorithm uses a version of the R P R O P algorithm [5] for network
training. R P R O P is a gradient descent algorithm which uses separate adaptive
learning rates for each weight. Each weight begins with an initial learning rate,
which is then adapted depending on the sign of the error gradient seen by the
weight as it traverses the error surface. This results in the update value for each
weight adaptively growing or shrinking as a result of the sign of the gradient
seen by that weight.
Casper constructs a cascade network in a similar manner to cascor: it begins
with all inputs connected directly to the outputs, and successively inserts neurons
which receive inputs from all prior hidden neurons and inputs. R P R O P is used
to train the whole network each time a hidden neuron is added. The use of
R P R O P is modified, however, such that when a new neuron is inserted, the
initial learning rates for the weights in the network are reset to values that
depend on the position of the weight in the network. The network is divided into
three separate regions, each with its own initial learning rate: L1, L2 and L3. The
first region is made up of all weights connecting to the new neuron from previous
hidden and input neurons. The second region consists of all weights connecting
the output of the new neuron to the output neurons~ The third region is made up
of the remaining weights, which consist of all weights connected to, and coming
from, the old hidden and input neurons.
The values of LI~ L2 and L3 are set such that L1 > > L2 > L3. The reason for
these settings is similar to the reason that cascor uses the correlation measure:
the high value of L1 as compared to L2 and L3 allows the new hidden neuron
to learn the remaining network error. Similarly, having L2 larger than L3 allows
the new neuron to reduce the network error, without too much interference from
other weights. Importantly, however, no weights are frozen, and hence if the
network can gain benefit by modifying an old weight, this occurs, albeit at an
initially slower rate than the weights connected to the new neuron. In addition,
the L1 weights are trained by a variation of R P R O P termed S A R P R O P [6]. The
S A R P R O P algorithm is based on RPROP, but uses a noise factor to enhance
the ability of the network to escape from local minima.
In casper a new hidden neuron is installed after the decrease of the validation
error has fallen below a set amount. All hidden neurons use a symmetric logistic
activation function ranging between - 0 . 5 and 0.5. The output neuron activation
function depends on the type of analysis performed. Regression problems use a
linear activation function. Classification tasks use the standard logistic function
for single output classification tasks. For tasks with multiple outputs the softmax
activation function [7] is used. Similarly, the error function selected depends on
the problem. Regression problems use the standard sum of squares error function.
Classification problems use the cross-entropy function [8]. For classification tasks,
a 1-of-c coding scheme for c classes is used, where the output for the class to
be learnt is set to 1, and all other class outputs are set to 0. For a two class
classification task, a single output is used with the values 1 and 0 representing
42

the two classes. For multiple classes a winner-takes-all strategy is used in which
the output with the highest value designates the selected class.
The regularisation used in casper is implemented through a penalisation term
added to the error function as shown below:

43

where A sets the regularisation magnitude, and S is a Simulated Annealing (SA)


term. The SA term reduces the amount of decay as training proceeds, and is
reset each time a new neuron is added to the network.

3 Implementing Adaptive Regularisation

One method that would allow the amount of regularisation to change in con-
structive algorithms is to adapt this parameter as the network is trained. This
was done using the following method as applied to the casper algorithm. The
adaptation process relies on using three training stages for each new hidden neu-
ron added, instead of the usual single training stage. The validation results taken
after the completion of each training stage are then used to adapt the regulari-
sation levels for future network training. This process repeats as the network is
constructed.
For each new hidden neuron inserted into the network, three training stages
are performed. Each training stage is performed using the same method as the
casper algorithm, and is halted using the same criterion. The commencement
of a new training stage results in all R P R O P and SA parameters being reset
to their initial values. Importantly, however, the final weights from the previous
training stage are retained and act as the starting point for the next training
stage. The motivation for this is that it is likely to increase convergence speed,
and thereby construct smaller networks.
The regularisation level for the network once a new neuron is added is set
to the initial value, :ki, termed the initial decay value. This parameter takes the
form Ai = 10-% It is this initial decay value that is adapted as the network is
constructed. The first training stage uses the initial decay value. Each successive
stage uses a regularisation level that has been reduced by a factor of ten from the
previous stage. After each training stage the performance of the network on the
validation set is measured, and the network weights recorded. On completion of
the third training stage, the initial decay value is adapted as follows: if the best
performing regularisation level occurred during the first two training stages, the
initial decay value is increased by a factor of ten, else it is decreased by a factor
of ten. At this point the weights that produced the best validation results are
restored to the network. When the next neuron is added, the process repeats
using the newly adapted initial decay value.
The initial network with no hidden neurons is trained using a single training
stage with a regularisation level of a = O. The adaptation scheme begins with
43

the addition of the first hidden neuron, which is given an initial decay value of
a -- 2. The initial decay value is chosen to give a relatively high regularisation
level as this can easily be reduced through network growth and the adaptation
process~ The limits placed on the initial decay value are a -- 1 to 4, which
gives a total possible regularisation range of a = 1 to 6 (since there are three
training stages). The lower initial decay limit (a -- 4) was selected to stop the
regularisation level falling too low, which can occur in early stages of training
when the network is still learning the general features of the data set. The top
initial decay limit (a -- 1) was selected since convergence becomes difficult with
excessive regularisation levels.
For reasons of efficiency, if the validation result of the second stage is worse
than the first, the third training stage is not performed. In addition, if the vali-
dation results of the first training stage are worse than the best validation results
of the previous network architecture, the weights are reset to their previous val-
ues before this training stage was commenced. The regularisation level is then
reduced as normal, and the second training stage is started. This was done to
stop excessive regularisation levels distorting past learning.
This regularisation selection method allows the network to adapt the level
of regularisation as the network grows in size. The motivation for using this
adaption scheme is the relationship between good regularisation levels in similar
size networks. By finding a good regularisation level in a given network, it is
likely that a slightly larger network will benefit from a similar regularisation
level. The adaption process allows a good regularisation level to be found by
modifying the window of regularisation magnitudes that are examined. This
adaption process is biased towards selecting larger regularisation levels since the
initial decay value is increased if either of the first two training stages have the
best validation result. The reason for this bias is that as the network grows in
size, in general more regularisation will be required.
The motivation for reducing the regularisation level through each training
stage is that it allows the network to model the main features of the data set,
which can then be refined by lowering the regularisation level. This is the same
motivation for the use of the SA term in the regularisation function. The algo-
rithm incorporating this adaptive regularisation method will be termed acasper.
The parameter values for this algorithm were selected after some initial tuning
on the Two spirals [3] and Complex interaction [9] data sets. Some tuning was
also performed using the cancerl data set from the Probenl collection.

4 Comparative Simulations

In order to test the performance of acasper it was compared against casper on


three regression and two classification benchmarks. The regression data sets are
based on the Complex additive (Cadd), Complex interactive (Cif), and Harmonic
(Harm) functions [9]. Each data set is made up of a training set of size 225
randomly selected points over the input space [0, 1]2, a validation set of size 110
similarly generated, and a test set of size 10,000 generated by uniformly sampling
44

the grid [0, 1]2. Gaussian noise of 0 mean and 0.25 standard deviation was added
to the training and validation sets. The two classification benchmarks were the
Glass and Thyroid data sets, which are glass1 and thyroid1 respectively from
Probenl.
For each data set 50 training runs were performed for each algorithm using
different initial starting weights. The Mann-Whitney U test [10] was used to
compare results, with results significant to a 95% confidence level indicated in
bold. Training in both casper and acasper is halted when either the validation
error (measured after the installation of each hidden neuron) fails to decrease af-
ter the addition of 6 hidden neurons, or a maximum number of hidden neurons
have been installed. This maximum was set to 8 and 30 for the classification
and regression data sets respectively. The measure of computational cost used is
connection crossings (CC) which Fahlman [3] defines as the number of multiply-
accumulate steps required to propagate activation values forward through the
network, and error values backward. This measure is more appropriate for con-
structive networks than the number of epochs trained since it takes into account
varying network size.
The results on the test sets at the point where the best validation result
occurred for the constructed networks after the halting criterion was satisfied
are given in Tables 1 and 2. For the classification data sets this measure is the
percentage of misclassified patterns, while for the regression data sets it is the
Fraction of Variance Unexplained (FVU) [9], a measure proportional to total
sum of squares error. Also reported is the number of hidden neurons installed
at the point where the best validation result occurred, and the total number
of connection crossings performed when the halting criterion was reached. The
casper results reported are those that gave the best generalisation results from
a range of regnlarisation levels: letting A = 10 - a , a was varied from 1 to 5.

T a b l e 1. Comparative Results for the Classification Data Sets

Data Set Algorithm Property I Mean I StDv [Median ] M i n ] Max


Test Error % 2 8 . 9 4 2.34 28.30 26.42 33.96
Glass casper Hidden Neurons 3.06 206 300 o oo 800
CC (xlO s) 0.52 0.00 0.52 0.52 0.52
Test Error % 30.68 2.61 30.19 26.42 35.85
Glass acasper Hidden Neurons 4.18 2.21 4.00 1.00 8.00
C C ( x l 0 s) 1.33 0.09 1.32 1.11 1.50
Test Error % 1.68 0.23 1.61 1.33 2.29
Thyroid casper Hidden Neurons 7.18 1.37 8.00 2.00 8.00
CC ( x l 0 s) 25.71 0.91 25.46 23.77 29.07
Test Error % 1.67 0.26 1.64 1.28 2.28
Thyroid aeasper Hidden Neurons 4 . 6 4 2.34 5.00 1.00 8.00
C C ( x l 0 s) 69.34 3.80 69.48 60.00 77.38
45

Table 2. Comparative Results for the Regression Data Sets

Data Set [ Algorithm Property [ Mean I StDv I Median I Min I Max


Test FVU ( x l 0 - " ) 1.29 0.60 1.17 0.81 4.03
Cadd casper Hidden Neurons 21.26 7.70 21.50 4.00 30.00
CC ( x l 0 s) 11.93 0.07 11.81 11.72 12.08
Test FVU ( x l 0 - " ) 1.18 0.24 1.09 0.84 1.86
Cadd acasper Hidden Neurons 16.16 5.74 15.00 7.00 29.00
CC ( x l 0 s) 34.55 1.74 34.66 30.34 39.88
Test FVU ( x l 0 -2) 2.98 0.98 2.69 1.63 5.98
Cif casper Hidden Neurons 24.52 6.84 27.50 6.00 30.00
CC ( x l 0 s) 12.09 0.20 12.07 11.78 12.75
Test FVU ( x l 0 -~) 2.38 0.61 2.21 1.48 4.03
Cif acasper Hidden Neurons 20.16 6.24 19.50 8.00 30.00
CC (xl08) 34.27 1 . 9 1 34.24 28.61 38.77
Test FVU ( x l 0 -z) 3.12 0.95 2.89 1.59 5.46
Harm casper Hidden Neurons 26.00 5.71 29.50 12.00 30.00
CC ( x l 0 s) 12.33 0.37 12.28 11.82 13.78
Test FVU ( x l 0 -2) 2.37 0.69 2.18 1.45 4.70
Harm acasper Hidden Neurons 19.34 5.30 18.00 10.00 30.00
CC ( x l 0 s) 36.02 2.51 35.84 28.00 41.17

4.1 Discussion

In general the acasper algorithm is able to maintain or better the generalisation


results obtained by the casper algorithm with a fixed, user optimised regulari-
sation level. The only data set where acasper performs significantly worse is the
Glass data set, although this reduction in performance is relatively small. The
good performance of acasper can be attributed to its ability to adapt the regu-
larisation level by taking into account such factors as the current network size
and the presence of noise. Figure I demonstrates acasper's ability to adapt regn-
larisation levels depending on the noise present in the data. This figure shows an
example of the A values selected by acasper on the Cif data set, with and without
added noise, for a typical training run. The regularisation magnitudes selected
for the noisy data set become greater as training proceeds, and are successful in
preventing the network over-fitting the data.
In terms of the network size constructed, the acasper algorithm maintains,
and often reduces the number of hidden neurons installed. The reduction is
sometimes large, as can be seen for the regression tasks. This can be attributed
to two factors. First, the acasper algorithm performs more training at each pe-
riod of network construction. This takes the form of restarting training with
different regularisation levels and with reset R P R O P and SA parameters. This
increases the chance of the network escaping :from the current (possibly local)
minimum and perhaps converging to a better solution. Second, the adaptation
of the regularisation level may result in faster convergence in comparison to a
fixed level.
46

1.0e+O0
cif - e - -
r ~--

1.0e-01

A r-~ /'\ ~,,


,~/\ / \ /,, /\
i 1.0e-O~ i~ ~, / ~--~ ~---' ~- . . . . . . ~ r ........

..... _ / "_/
1.~-04

1.0o-OS

, .Oe-06
0 10 30

Fig. 1. Regularisation Magnitudes Selected by acaspero

The main disadvantage of the adaptive regularisation method used in acasper


is the increase in computational cost. For the benchmark results obtained, this
increase is of the order of two to three in comparison to casper. The increase
in computational cost is expected to scale approximately linearly in comparison
to corresponding size networks trained by casper, since it is a result of at most
three additional training stages at each point of network construction. Part of
the increased cost of training acasper is balanced by its ability to construct
smaller networks than casper. The use of adaptive regularisation also removes
the need to select a regularisation level in casper. The computational cost of such
preliminary training is significant but not easily quantifiable, and not reflected
in the results quoted for casper.

4.2 Benchmarking acasper

In order to allow comparison between the acasper algorithm and other neural
network algorithms, an additional series of benchmarking was performed on the
remaining data sets in the Probenl collection. The same benchmarking set-up
was used as for the previous comparisons, except the maximum network size for
the regression problems was set to eight. The four regression data sets in Probenl
are buildingl, flarel, heartal, and heartacl. The test results for these data sets
are given in terms of the squared error percentage as defined by Prechelt [4]:

Esep = 100 OmazN-. : rain Ereg

where omax and omln are the maximum and minimum values of the outputs, N
is the number of training patterns, and c the number of training patterns.
To allow direct comparison with a well known constructive algorithm, the
results obtained by the cascor algorithm [3] are also given. These results were
47

obtained from benchmarking carried out in [11]. This version of c a s c o r incor-


porates a sophisticated implementation of early stopping. The results of these
simulations are give in Tables 3 and 4 which give the test and hidden unit results
respectively. Results which are significant to a 95% confidence level are printed
in bold. At this level, the flare results in Table 3 were given as significantly dif-
ferent by the Mann-Whitney U test, however the test scores were found to have
very different distributions, and hence this result was not treated as significant~

Table 3. Probenl Benchmarking: Test Error Percentage

Data Set Algorithm I Mean I StDv I Median I Min I Max ]


cancerl acasper 1.89 0.80 1.72 0.57 4.02
cascor 1.95 0.38 1.72 1.15 2.87
caxdl acasper 13.72 0.59 13.37 13.37 16.86
cascor 13.58 0.43 13.37 12.79 14.54
diabetesl acasper 23.14 1 . 2 6 2 2.92 20.31 27.08
cascor 24.53 1 . 4 4 24.48 22.40 28.65
genel acasper 11.72 0.09 11.73 11.60 11.85
Ca$COr 13.38 0.47 13.49 11.98 14.38
glassl acasper 30.68 2 . 6 1 30.19 26.42 35.85
cascor 34.76 5.88 33.96 26.42 47.17
heart1 acasper 19.21 0.44 19.13 16.96 20.00
cascor 19.89 1 . 5 8 20.44 16.09 22.17
heartcl acasper 18.85 1 . 1 4 18.67 18.67 26.67
C~8COT 19.47 1 . 2 8 18.67 18.67 24.00
horse1 acasper 32.46 0.71 32.97 29.67 34.07
C~SCOr 26.37 2.58 26.37 20.88 31.87
soybeanl acasper 7.89 1.03 7.65 5.29 10.00
CaBCOr 9.46 0.86 9.41 7.65 11.77
thyroidl acasper 1.67 0.26 1.64 1.28 2.28
CaSCOr 3.03 1.15 2.67 2.11 6.56
buildingl acasper 0.64 0.02 0.64 0.61 0.71
cascor 0.82 0.23 0.72 0.49 1.42
flarel acasper 0.53 0.01 0.52 0.52 0.58
C~SCOr 0.53 0.01 0.53 0.51 0.55
heartal acasper 4.74 0.10 4.69 4.67 5.23
C.~SCOT 4.62 0.15 4.60 4.43 5.02
heaxtas acasper 2.75 0.14 2.72 2.62 3.11
cascor 2.87 0.44 2.70 2.48 4.25

It can be seen that a c a s p e r outperforms c a s c o r both in terms of test re-


sults and constructing smaller networks~ There are eight data sets where a c a s p e r
obtains significantly better test results than c a s c o r , compared to two where c a s -
t o r outperforms a c a s p e r (four with no significant difference). For all data sets
a c a s p e r was able to produce smaller networks than c a s c o r , with significant re-
48

sults for twelve out of the fourteen data sets. There are some cases where the
difference is surprisingly large, for example the Soybean and Thyroid data sets.
One reason for this may be that the halting criteria for a c a s p e r specifies a max-
imum network size of eight, although in general this limit is rarely reached by
a c a s p e r during the benchmarking.

Table 4. Probenl Benchmarking: Hidden Units Used

Data Set Algorithm Mean StDv Median Min Max


cancerl acasper 4.86 2.08 4.50 1.00 8.00
CaSCOr 5.18 2.05 4.00 3.00 10.00
cardl acasper 0.12 0.59 0.00 0.00 4.00
CA~scor 1.07 0.25 1.00 1.00 2.00
diabetesl acasper 3.02 1.55 3.00 1.00 8.00
cascor 9.78 5.32 9.00 0.00 25.00
genel acasper 0.00 0.00 0.00 0.00 0.00
cascor 2.73 1.19 2.00 1.00 6.00
glassl acasper 4.18 2.21 4.00 1.00 8.00
cascor 8.07 5.19 7.00 1.00 24.00
heartl acasper 0.10 0.36 0.00 0.00 2.00
~80Or 2.64 1.17 2.00 1.00 7.00
heaxtcl acxLsper 0.10 0.36 0.00 0.00 2.00
Ca$cor 1.38 0.49 1.00 1.00 2.00
horsel acasper 0.12 0.59 0.00 0.00 4.00
CO,8oor 0.82 0.39 1.00 0.00 1.00
soybea~l acx~sper 2.16 1.08 2.00 1.00 5.00
CaStor 16.04 5.17 16.00 6.00 24.00
thyroidl acasper 4.64 2.34 5.00 1.00 8.00
ca8co7" 25.04 8.71 27.00 2.00 44.00
buildingl acasper 6.36 2.15 7.00 1.00 8.00
cascor 9.27 9.73 6.00 0.00 29.00
flarel acasper 1.30 1.59 1.00 0.00 6.00
~8o9r 2.63 0.67 3.00 2.00 4.00
heaxtal acasper 0.40 0.57 0.00 0.00 2.00
~ff,s c o r 2.77 1.72 2.00 0.00 7.00
heartacl acasper 0.20 0.86 0.00 0.00 5.00
C~SCOr 1.47 0.73 1.00 0.00 3.00

Interestingly, many of the data sets are solved by a c a s p e r using very small
networks, often with no hidden units at all. This illustrates a major advantage
of using constructive networks: the simple solutions are tried first. It is often the
case t h a t many real world data sets, such as the ones in P r o b e n l , can be solved
by relatively simple networks.
49

5 Conclusion

The introduction of an adaptive regularisation scheme to the casper algorithm is


shown to maintain, and sometimes improve the generalisation results compared
to a fixed, user optimised regularisation setting. In addition, smaller networks
are generally constructed. In comparisons to an optimised version of cascor,
acasper is shown to improve generalisation results and construct smaller net-
works. One further advantage of acasper is that it performs automatic model
selection through automatic network construction and regularisationo This re-
moves the need for the user to select these parameters, and in the process makes
the acasper algorithm free of parameters which must be optimised prior to the
commencement of training.

References

1. No K. Treadgold and T. D. Gedeon, "A cascade network algorithm employing


progressive rprop," in Proe. of the Int. Work-Conf. on Artificial and Natural Neural
Systems, Lanzarote, Spain, June 1997, pp. 723-732.
2. N~ K. Treadgold and T. D. Gedeon, "Extending casper: A regression survey," in
Proc. of the Int. Conf. on Neural Information Processing, Dunedin, New Zealand,
Nov. 1997, pp. 310-313.
3. S. E. Fahlman and C. Lebiere, "The Cascade-Correlation learning architecture,"
in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. San
Mateo, CA: Morgan Kanfmann, 1990, pp. 524-532.
4. L. Prechelt, "Probenl - a set of neural network benchmark problems and bench-
marking rules," Tech. Rep. 21/94, Fakult~it fiir Informatik, Universit~it Kahlsruhe,
1994.
5. M. Riedmiller and H. Braun, "A direct adaptive method for faster backpropagation
learning: The RPROP algorithm~" in Proc. o f the IEEE Int. Conf. on Neural
Networks, San Francisco, CA, Apr. 1993, pp. 586-591.
6. N. K. Treadgold and T. D. Gedeon, "Simulated annealing and weight decay in
adaptive learning: The sarprop algorithm," IEEE Transactions on Neural Net-
works, vol. 9, pp. 662-668, July 1998.
7. J. S. Bridle, "Probabilistic interpretation of feedforward classification network out-
puts, with relationships to statistical pattern recognition," in Neuro-computing:
Algorithms, Architectures and Applications, F: Fogelman Soulid and J. H~rault,
Eds. Berlin: Springer-Verlag, 1990, pp. 227-236~
8. C. Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford University
Press, 1995.
9. J.-N. Hwang, S.-R. Lay, M. Maechler, R. D~ Martin, and J. Schimert, "Regression
modeling in back-propagation and projection pursuit learning," IEEE Transactions
on Neural Networks, vol. 5, pp. 342-353, May 1994.
10. R. Steel and J. Torrie, Principles and Procedures o f Statistics A Biomedical Ap-
proach. Singapore: McGraw-Hill, 1980.
11. L. Prechelt, "Investigation of the cascor family of learning algorithms," Neural
Networks, vol. 10, no. 5, ppo 885-896, 1997.
An Agent-Based Operational Model
for Hybrid Connectionist-Symbolic Learning*

Josd C. Gonz~lez, Juan R. Velasco, and Carlos A. Iglesias

Dep. Ingenieria de Sistemas Telem~ticos


Universidad Polit~cnica de Madrid, SPAIN
{cif, j cg, juanra}@gs i. dit. upm. es

A b s t r a c t . Hybridization of connectionist and symbolic systems is being


proposed for machine learning purposes in many applications for differ-
ent fields. However, a unified framework to analyse and compare learn-
ing methods has not appeared yet. In this paper, a multiagent-based
approach is presented as an adequate model for hybrid learning. This
approach is built upon the concept of bias.

1 Introduction

In her work "Bias and Knowledge in Symbolic and Connectionist Induction"


[3], M. Hilario addresses a key issue in the Machine Learning field: the need
of a unified framework for the analysis, evaluation and comparison of different
(symbolic, cormectionist, hybrid . . . . ) learning methods. This need is justified
upon the fact that there are no universally superior methods for induction. She
builds this unified framework upon the concept of bias.
This paper follows the same line, but from a different perspective. The main
point here is that the conceptual-level framework put forward by Hilario can be
complemented with its counterpart at the operational level. The purpose of this
work is, then, three-fold:

- Firstly, showing that the agent-based paradigm can provide a neutral, un-
biased, operational model for such a unified framework.

* This research is funded in part by the Commission of the European Communities


under the ESPRIT Basic Research Project MIX: Modular Integration of Connec-
tionist and Symbolic Processing in Knowledge Based Systems, ESPRIT-9119, and
by CYCIT, the Spanish Council for Research and Development, under the project
M2DZ: Metaaprendizaje en Minem'a de Datos Distribuida, TIC97-1343. The MIX
consortium is formed by the following institutions and companies: Institute National
de Recherche en Informatique et en Automatique (INRIA-Lorralne/CRIN-CNRS,
France), Centre Universitalre d'Informatique (Universit4 de Gen~ve, Switzerland),
Institute d'Informatique et de Math~matiques Appliqu~es de Grenoble (France),
Kratzer Automatisiertmg (Germany), Fakult~t flit Informatik (Technische Uni-
versit~it Miinchen, Germany) and Dep. Ingenieria de Sistemas Telem~.ticos (Uni-
versidad Polit~cnica de Madrid, Spain).
51

- Secondly, showing that this model includes most the known forms of meta-
learning proposed by the machine learning community.
- Thirdly, showing that this kind of model may help to overcome some of the
traditionally weak points of the work around meta-learning.

2 A Distributed Tool for Experimentation

In the MIX project, several models of hybrid systems integrating connectionist


and symbolic paradigms for supervised induction have been studied and applied
to well-defined real world problems in different domains. These models have been
implemented through the integration of software components (including connec-
tionist and symbolic ones) on a common platform. This platform was developed
partially under and for the MIX project. Software components are encapsulated
as agents in a distributed environment. Agents in the MIX platform offer their
services to other agents, carrying them out through cooperation protocols.
In the past, the platform has been mainly used for building object-level hy-
brids, i.e. hybrid systems developed to improve performance (in comparison with
symbolic or connectionist systems alone) when carrying out particular tasks (pre-
diction or classification) on specific real-world problems.
This application-oriented work has led to good results (in terms of increase
of performance, measured as a reduction of task error rates). Some amount of
qualitative knowledge about hybridization was derived from these experiences.
However, this knowledge is not enough for guiding the selection of an adequate
problem-solving strategy in face of a particular problem. Summing up, what we
should look for are general and well-founded bias management techniques, calling
bias "any basis for choosing one generalization over another, other than strict
consistency with accepted domain knowledge and the observed training instances"
[3].
Our proposal is that the same platform used until now for object-level hy-
brids, be used to explore different bias management policies. A general archi-
tecture to do so can be seen in Fig. 1. This architecture will be particularised
for several interesting cases. But, before that, a brief overview of the concept of
bias, classified along four levels, will be presented.

3 Classes of Bias

Hilario distinguishes between two kinds of bias, representational and search bias,
that can be studied at different grain levels. We classify these granularity levels
as follows:

- Hypothetical level.
On the representational side, it has to do with the selection of formalisms or
languages used for the description of hypothesis and instances in the prob-
lem space.
52

Fig. 1. A multi-agent architecture for bias management

Regarding search, this level deals with the kind of task we are trying to ac-
complish through automatic means: classification, prediction, optimization,
etc.
- Strategic level.
A particular representation model (production rules, decision trees, per-
ceptrons, etc.) has to be selected, compatible with the formalism preferred
at the previous level. This model is built by a particular learning algorithm
by searching the hypothesis space.
- Tactical level.
Once a pair model/algorithm has been selected, some tactical decisions may
remain to be taken about the representation model (e.g., model topology
in neural nets) or the search model (number of generations in genetic al-
gorithms, stopping criteria when inducing decision trees, etc.)
- Semantic level.
This level concerns the interpretation of the primitive objects, relations and
operators. Concerning representation, this level includes the selection, com-
bination, normalization (scaling, in general), discretization, etc. of attributes
in the problem domain. Semantic level search bias includes the selection of
weight updating operator in neural nets and fitness updating operator in
genetic algorithms, the information-content measure used for the selection
53

of the most informative attribute in algorithms for the induction of decision


trees, etc.

4 Case 1. Semantic Level Bias: Attribute Selection

The determination of the relevant attributes for a particular task is a funda-


mental issue in any machine learning application. Statistical techniques should
play a fundamental role for this purpose. However, commercial tools integrating
statistical analysis along with symbolic or connectionist machine learning al-
gorithms have appeared only recently. For instance, the researcher needs to have
a clear idea about the correlation between variables for guiding the experiments:
dropping variables, crating new ones by combination of others, etc. The evalu-
ator may compare the results obtained by a particular learning algorithm applied
to different subsets or supersets of the source data-set looking for statistically
significant differences.
The data analyser in Fig. 1 takes a data set in the machine learning repos-
itory as input and produces a description of this data set in terms of problem
type (classification, prediction or optimization), size (amount of variables and
samples), statistical measures (variable distribution and correlation), consistency
measures (amount of contradictory samples), information measures (absolute
and conditional entropy of variables, presence of missing values), etc.
A transformation agent (not shown in the figure) can be coupled in this
architecture. The goal of this agent is proposing experiments from data sets
generated from the source one. Transformed data sets may be obtained by several
methods:
- Sampling: it is almost compulsory for data sets too big for machine learn-
ing processes. Moreover, random or stratified sampling techniques can be
necessary for experimental purposes.
- Dropping of variables: the less informative variables can be considered as
noise. Noise makes learning more difficult.
- Replacing or adding variables: the new ones can be formed by combination
of others (to be deleted or not).
- Clustering of samples: the activity of a system may fall in different macro-
states where the behaviour of the system may be qualitatively different.
These differences can be associated with completely different deep models,
in such a way that learning algorithms might perform better when trained
from cases in one individual macro-state.
- Discretization of variables: the precision used to represent a continuous vari-
able can hide the fact that precision does not imply relevance. Some ma-
chine learning algorithms handle only discrete variables, but discretization
can attain performance improvements with algorithms capable of managing
continuous and discrete attributes. Discretization can be achieved by crisp
methods (splitting the range of a variable in homogeneous sub-ranges in
terms of size or number of cases falling in the range), or non-crisp ones (by
connectionist or fuzzy clustering techniques).
54

..........

Fig. 2. Architecture for tactical bias selection

5 C a s e 2. T a c t i c a l L e v e l B i a s : P a r a m e t e r Selection

A good amount of work can be found in the literature about systems intended for
the selection of adequate representational or search bias at the tactical level. For
instance, the C45TOX system, developed for a toxicology application in the MIX
project, uses genetic algorithms for optimising the parameters used by the C4.5
learning algorithm. A work with the same goal had been previously developed
by Kohavi and John [4]. They used a wrapper algorithm for parameter setting.
In the C45TOX system, the genetic algorithm acts as a specialised config-
uration manager. It provides the experiment designer with candidate sets of
parameters that are used for training a decision tree. This tree is tested using
cross-validation. The evaluator agent estimates the performance of the decision
tree and transmits the error rate to the genetic agent to update the fitness of the
corresponding individual of the population. The knowledge base of the genetic
system evolves through the application of genetic operators. When a new genera-
tion is obtained, new experiments are launched until no significant improvement
is achieved.
The architecture of this system is shown in Fig. 2.
55

6 Case 3. Hypothetical/Strategic Level Bias: Algorithm


Selection

Advances in software technology, and specially in the field of distributed pro-


cessing permit the easy integration of several algorithms co-operating to carry
our a particular task: classification, prediction, etc. Differences in performance
estimated at training time can be used to configure strategies for bias man-
agement through arbiters or combiners. Both, arbiters and combiners, can be
developed according to fixed policies (e.g., a majority voting scheme in the case
of arbiters) or variable policies.
One interesting research avenue in the field of meta-learning concerns the se-
lection of the most adequate algorithm for a task according to variable inductive
policies. One of the biggest efforts done following this line has taken place in the
framework of the STATLOG project [5, 2]. 24 different algorithms were applied
to 22 database classical in the machine learning literature. Finding mappings
between tasks and biases was proposed first as a classification problem (to se-
lect the best candidate algorithm for an unseen task). For this purpose, C4.5
was used. Afterwards, meta-learning was implemented as a prediction problem
intended to estimate the performance of a particular algorithm in comparison
with others in face of an unseen database.
Some difficulties are evident with this approach. First, 22 data-sets are too
few for meta-learning. Second, standard (default) parameters were used to con-
figure each algorithm. Nobody knows, then, if the low performance of an in-
dividual system comes from itself or from a bad selection of parameters. All
the meta-learning systems described in the literature [7,1, 8] suffer from similar
drawbacks.
In Fig. 3 we show the instantiation of the proposed distributed architecture
for strategic bias selection. Systems are characterised according to their perform-
ance (basically, error rate, error cost, computing time and learning time) on a
particular data-set.
The architecture has several appealing features:

- Full integration. The meta-learning agents are exactly the same used for
object-level learning. In the same way, several learning agents can be launched
simultaneously for meta-learning, and their results can be compared or in-
tegrated in an arbiter or combiner structure.
- On-line learning. Meta-learning can be achieved simultaneously with object-
level learning.
- Use of transformed and artificial data-sets. The lack of source data-bases is
a difficulty that can be overcome through the generation of new data-sets
obtained from the transformation of the original ones. New attributes can
be derived or noise can be added in order to test noise-immunity. Even fully
artificial data-bases can be generated from rules or any other mechanism,
controlling at the same time the level of noise to be added.
56

! ................................................................................................................................................................. i

i .....

Fig. 3. Architecture for strategic bias selection

7 Current Work
The ideas and the architecture proposed in this paper are being implemented
at this moment in the project M2D2 ( "Meta-Learning in Distributed Data Min-
ing"), funded by CYCIT, the Spanish Council for Research and Development.
This approach has been successfully used, for instance, for the development of
the C45TOX system.

References
1. P. Chan and S. Stolfo. A comparative evaluation of voting and meta-learning on
partitioned data. In Prieditis and Russell [6], pages 90-98.
2. J. Gama and P. Brazdil. Characterization of classification algorithms. In E. Pinto-
Ferreira and N. Mamede, editors, Progress in Artificial Intelligence. Proceedings of
the 7th Portuguese Conference on Artificial Intelligence (EPIA-95), pages 189-200.
Springer-Verlag, 1995.
3. Melanie Hilario. Bias and knowledge in symbolic and connectionist induction. Tech-
nical report, Centre Universitalre d'Informatique, Universitfi de Gen~ve, Gen~ve,
Switzerland, 1997.
4. R. Kohavi and G. John. Automatic parameter selection by minimizing estimated
error. In Prieditis and Russell [6], pages 304-312.
57

5. Donald Michie, David J. Spiegelhalter, and CharlesC. Taylor, editors. Machine


Learning, Neural and Statistical Classification. Ellis Horwood, 1994.
6. A. Prieditis and S. Russell, editors. Proceedings of the l l t h International Conference
on Machine Learning, Tahoe City, CA, 1995. Morgan Kanfmann.
7. L. Rendell, R. Seshu, and D. Tcheng. Layered-concept learning and dinaanically
variable bias management. In Proceedings of the lOth International Joint Conference
on Artificial Intelligence, pages 308-314, Milan, Italy, 1987. Morgan Kaufmann.
8. G. Widmer. Recognition and exploitation of contextual cues via incremental meta-
learning. Technical Report OFAI-TR-96-01, Austria Research Institute for Artifi-
cial Intelligence, Vienna, Austria, 1996.
Optimal Discrete Recombination: Hybridising
Evolution Strategies with the A* Algorithm

Carlos Cotta and Josd M. Troya

Dept. of Lenguajes y CC.CC., University of Mhlaga,


Complejo Tecnol6gico (2.2.A.6), Campus de Teatinos,
E-29071, M~laga, Spain
{ccottap, troya}~lcc.tuna.es

Abstract. This work studies a hybrid model in which an optimal search


algorithm intended for discrete optimisation (A*) is combined with a
heuristic algorithm for continuous optimisation (an evolution strategy).
The resulting algorithm is successfully evaluated on a set of functions
exhibiting different features such as multimodality, noise or epistasis. The
scalability of the algorithm in the presence of epistasis is an important
issue that is also studied.

1 Introduction

Evolutionary Algorithms [2] are powerful heuristics for optimisation based on


the principles of natural evolution. One of the more stressed features of these
techniques is robustness: even a simple evolutionary algorithm (related to the
problem under consideration just by a fitness function for evaluating tentative
solutions) was assumed to produce acceptable final solutions. As a matter of
fact, their overall superiority to other techniques (either specialised or not) has
been almost an axiom for a long time.
Despite this, many authors (especially L.D. Davis [5]) have advocated for
adapting the algorithm by using as much problem knowledge as available. In its
widest sense, this use of problem knowledge is termed hybridisation. As Hart and
Belew [8] initially stated and Wolpert and Macready [13] later popularised, using
problem knowledge is not an optional mechanism for improving the performance
of the algorithm, but it is a requirement for ensuring a minimal quality of the
results (i.e., better than random search).
There exist a plethora of mechanisms for using problem knowledge in an
evolutionary algorithm. Cotta and Troya [4] consider strong hybrid models (in
which the knowledge is used in the internal elements of the algorithm such
as the representation or the operators) and weak hybrid models (in which dif-
ferent search algorithms are combined and collaborate by means of periodical
exchanges of information). This work studies a weak hybrid model in which an
optimal search algorithm intended for discrete optimisation is combined with a
heuristic algorithm for continuous optimisation. To be precise, an A* algorithm
is combined with an evolution strategy (ES). In this sense, this work differs from
59

previous models in which the two combined algorithms were adapted and applied
to discrete optimisation [3].
The remainder of this article is organised as follows: first, the weak hybrid
model is formalised in Sect. 2. Subsequently, its functioning is described in Sect.
3, considering scalability issues as well. Next, experimental results are reported
in Sect. 4. Finally, some conclusions are outlined in Sect. 5.

2 Random vs. Optimal Discrete Recombination

Before describing the functioning of the proposed recombination mechanism, it


is convenient to state some previous definitions. Let S be the search space. It
will be assumed that S C_ { x = ( x l , . . . , x n ) l x i E [Li,Ui],l ~ i ~ n). Let
X : S • S x ~Z --+ S be a binary recombination operator 1. Then, the immediate
dynastic span is defined as follows [11]:
D e f i n i t i o n 1 ( I m m e d i a t e D y n a s t i c S p a n ) . The immediate dynastic span of
two individuals x and y with respect to a recombination operator X is defined
as F ) c ( { x , y ) ) -= { w I 3k 9 2g, X ( x , y , k ) = w ) , i.e., the set of all feasible
individuals that can be produced when recombining x and y using X .

Definition 1 allows classifying the several recombination mechanisms that


can be used when working on continuous domains into discrete and non-discrete
operators. For that purpose, it is firstly necessary to introduce the concept of
discrete dynastic potential:
D e f i n i t i o n 2 ( D i s c r e t e D y n a s t i c P o t e n t i a l ) . The discrete dynastic potential
A({x, y ) ) of two individuals x and y is the set of all feasible individuals z -=
( Z l , . . . , z n ) such that zi 9 ( x i , y i ) , l < i < n.
This definition is based on the more general concept of dynastic potential
[11]: while the discrete dynastic potential of two individuals and is the set of
valid vertices of the hypercuboid they define in the search space, according to [10]
their dynastic potential is the whole hypercuboid. Now, a discrete recombination
operator can be defined as follows:
D e f i n i t i o n 3 ( D i s c r e t e R e c o m b i n a t i o n O p e r a t o r ) . A recombination oper-
ator X is discrete if, and only if, Vx, y 9 S : F J c ( ( x , y } ) C A ( { x , y } ) .
Examples of non-discrete recombination operators are intermediate recombi-
nation and continuous random respectful recombination [10]. These two opera-
tors will be used for comparison purposes in the experimental part of this work.
The following definition shows an example of a discrete recombination operator:
D e f i n i t i o n 4 ( R a n d o m D i s c r e t e R e c o m b i n a t i o n ) . Let ci : 2 S • ~Z -+ $ be
a function such that ~ ( ~ , i ) is the j t h member (j = i mod I~1) of S under an
arbitrary enumeration. Thus, the random discrete recombination operator is a
function R D R : S • S • ~Z -~ $ defined by R D R ( x , y, k) = (f(A({x, y}), k).
1 Some comments on multiparent recombination are given in Sect. 5.
60

Thus, if RDR is given a random parameter k, it returns a random member


of the discrete dynastic potential of x and y. Notice that, if the representation
is orthogonal, all combinations of variables are valid and hence A({x, y}) =
YL~=I{xi, yi}, i.e., the n-dimensional Cartesian product of all pairs {xi, y~}. In
this case, RDR is equivalent to uniform crossover [12].
As stated in Sect. 1, the random selection done by RDR may be inappropri-
ate if problem-dependent knowledge is available. This is specifically true when
there exist epistatic relations among variables. In this case, the value of a vari-
able may be meaningless outside the context of other variables. Moreover, even
if no epistasis is involved, an intelligent recombination may result in a consid-
erable speed-up of the algorithm. This use of problem-dependent knowledge is
formalised by means of the Optimal Discrete Recombination (ODR) operator.
More precisely, let r be the fitness function and let -4 be a partial order relation
in r such that x dominates (i.e., is a better solution than) y if, and only if,
r ~ r Then, ODR is defined as:

Definition 5 ( O p t i m a l Discrete Recombination). The Optimal Discrete


Recombination operator is a ]unction ODR : S x S • ~ --+ S defined by
ODR(x, y, k) = 5(H({x, y}), k), where H({x, y}) = {w I w e A({x, y}), r e
sup~{r y})]}}.

According to this definition, ODR returns the best individual (or one of the
best individuals) that can be built without introducing any new material. This
implies performing an implicit exhaustive search in a small subset of the solution
space (i.e., in the discrete dynastic potential of the recombined solutions). Such
an exhaustive search can be efficiently done by means of the use of a subordinate
A*-like algorithm as described in next section.

3 Implementation and Scalability of ODR

As mentioned above, ODR requires performing an exhaustive search in an im-


plicitly defined set of solutions. For that purpose, partial knowledge about the
fitness function must be available. Otherwise, the search would be reduced to
a full enumerative search with the subsequent unaffordable computational ef-
fort (consider that, according to Definition 2, the size of the discrete dynastic
potential of two n-dimensional vectors x and y is [A({x, y})[ = 0(2'~)).
To be precise, it is necessary to determine optimistic estimations r of the
fitness of partially specified solutions ko (i.e., Vz C q~, r -4 r in order to
direct the search to promising regions, pruning suboptimal solutions. A partially
specified solution is termed macro-forma, and its optimistic evaluation r is
decomposed as V(~) + ~(~P), where V(~) is the known final contribution of the
variables already included in ~ to the fitness of any z E ~, and ~(q~) is an
optimistic estimation of the fitness contribution for the remaining underspecified
variables. Although it is possible to set ~(~P) = inf.~{r it is clear that the
more accurate fitness estimation, the more efficient the search will be.
61

Solutions are incrementally constructed in the following way: initially ~g =


0; subsequently,
~2~1 ~-~oJ U <$i+1> (1)

~2jJ1-1-1 ----~J U (Yi+l> (2)


are considered. Whenever a macro-forma ~ is infeasible or r -< r (where
is the fitness of the best-so-far solution - initially r = inf.r {r -), ~ is closed
(i.e., discarded). Otherwise, the process is repeated for open macro-formae. Obvi-
ously, this mechanism is computationally more expensive than any classical blind
recombination operator but, as it will be shown in the Sect. 4, the resulting algo-
rithm performs better when compared to algorithms using blind recombination
and executed for a computationally equivalent number of iterations.
Some comments must be done regarding the scalability of the above-defined
operator. First of all, notice that for non-epistatic representations 2, it is possible
to decompose the fitness function as:

r = r x,~)) = ~ r (3)
i=l

It is easy to see that, in this situation, ODR must simply scan x and y,
picking the variables that proportionate the best value for each r Hence, ODR
scales linearly and, subsequently, this case does not pose any problem.
The scenario is different when epistasis is involved. In this situation, and
since an A*-like mechanism for exploring A({x,y}) is used, ODR is sensitive
to increments in the dimensionality of the problem and to the subsequent expo-
nential growth of [A({x,y})[. An adjustment of the representation granularity
is proposed to alleviate this problem. To be precise, recall that solutions are in-
crementally constructed by adding one variable at a time. If the computational
cost of this procedure were too high, ODR could be modified so as to add g
variables at a time, i.e.,

~2~1 ----~J U (Xi.g+l,''',X(i+l).g) (4)

(5)
It can be seen that increasing g confines ODR to a smaller subset of A({x, y})
whose size is 0 ( 2 ~/g) and thus the computational cost is reduced. However, a
very high value of g may turn ODR ineffective since the chances for combining
valuable information are reduced as well. For this reason, intermediate granu-
larity values represent a good trade-off between computational cost and quality.

2 Since constrained optimisation is clearly a substantial topic for itself, we defer it to


a further work. Orthogonality is assumed for the remainder of the article.
62

4 Experimental Results

A large collection of experiments has been done to assess the quality of the pro-
posed recombination mechanism in the context of several different continuous-
optimisation problems. These problems are described in Subsection 4.1 Subse-
quently, the experimental results are reported and discussed in Subsection 4.2.

4.1 The Test Problems

The test suite used in this work is composed of four problems: the generalised
Rastrigin function, a weighted noisy matching function, the Rosenbrock func-
tion and the design of a brachystochrone. Each of these functions exhibits several
distinctive properties, thus providing a different scenario for evaluating and com-
paring different operators. These properties are described below in more detail.

Generalised Rastrigin Function The generalised Rastrigin function is a non-


linear minimisation problem whose n-dimensional form is defined as:

r = n . a + E[x~ - a . cos(w- xi)] (6)


i----1

For high absolute values of xi, this function behaves like an n-dimensional
parabola. However, the sinusoidal term becomes dominant for small values.
Hence, there exist m a n y local m a x i m a and minima around the optimal value
(located at x = 0). Although not epistatic, this function is highly multimodal
and hence difficult for gradient-based algorithms. The values a = 10, w = 27r,and
- 5 . 1 2 _< xi _< 5.12 have been used in all experiments.

W e i g h t e d N o i s y M a t c h i n g F u n c t i o n This function is also a non-linear min-


imisation problem defined by the following expression:
n

r = w i . [(xi - + N (0, oi)] (7)


i=1

where
ai =
K . .(1 - max.(~,-v,I)
~'-v'l ~
/
if Ixi - vii ->- e (8)
0 otherwise.
If the noisy terms Ni(0, ai) are discarded, this function is equivalent to a
scaled translated sphere function (the optimum being located at x = v). How-
ever, the presence of Gaussian noise makes this function harder. Moreover, it can
be seen t h a t the amplitude of the noisy terms increases as the reference values
are approached, thus becoming stronger as the algorithm converges. The noise
ceases within a small neighbourhood e of each reference value.
The values wi -- i, vi = 5.12. s i n ( 4 ~ i / n ) , K = .5, e = .1, and - 5 . 1 2 _< xi <:
5.12 have been considered in all the experiments.
63

R o s e n b r o c k ~ x n c t i o n The Rosenbrock function is another classical nonlinear


minimisation problem defined as:
n--1
r = E l 1 0 0 - ( x i q _ 1 -- X2)2 "1- (1 -- Xi) 2] (9)
i=1

In this problem, there exist epistatic relations between any pair of adjacent
variables. Additionally, there exist non-epistatic terms as well. However, the
latter have a much lower weight, and hence the search is usually dominated
by the former. As a matter of fact, there exists a strong attractor located at
x = 0, where these terms become zero. The further evolution towards the global
optimum (x = 1) is usually very slow. As for the previous functions, the range
- 5 . 1 2 < xi < 5.12 has been considered.

B r a c h y s t o c h r o n e D e s i g n The design of a brachystochrone is a classical prob-


lem of the calculus of variations. This problem involves determining the shape of
a frictionless track along which a cart slides down by means of its own gravity, so
as to minimise the time required to reach a destination point from a motionless
state at a given starting point. To approach this problem by means of evolu-
tion strategies, the track is divided into a number of equally spaced pillars [9].
Subsequently, the algorithm determines the heights of each of these pillars.
As stated above, the objective is to minimise the time required by the cart to
traverse the track. This time can be calculated as the sum of the times required
for moving between any two consecutive pillars, i.e.,
n+l n+l
t = E ti : E t(hi-l'hi')~'vi-1) (10)
i=I i=1

where n is the number of pillars, h i is the height of the ith pillar (h0 and hn+l are
data of the problem), ~ is the distance between consecutive pillars (a problem
parameter as well), and vi = v ( v i - l , h i - l , h i ) is the velocity at the ith pillar
(vo = 0).
As it can be seen, this is also an epistatic problem: the contribution of each
variable (i.e., pillar height) depends on the value of previous variables; but,
unlike the Rosenbrock function, there does not exist any non-epistatic term. The
experiments with this function have been carried out using h0 -- 2, h~+l -- 0,
(n + 1) .)~ = 4, and (2h~+1 - h0) < hi < h0.

4.2 Empirical Evaluation


All experiments have been done utilising a standard (2,20)-ES with independent
stepsizes for each variable. These stepsizes undergo self-adaptation using a global
learning rate T' = (2n) -1/2 and a local learning rate T = (4n) -1/4 [1]. For all
test problems, the lower bound for local stepsizes has been set to a i n / = .01, and
the upper bound to as~p = 2 (except for the brachystochrone design problem in
which as~p = 1).
64

For comparison purposes, experiments have been also done both without
recombination and with several classical recombination operators such as inter-
mediate recombination (IR), random respectful recombination (R3), and random
discrete recombination (RDR). Furthermore, two different reproductive mecha-
nisms have been tried: mutation + recombination (i.e., the parents are mutated
before being recombined) and recombination + mutation (i.e., the parents are
recombined and the resulting child is then mutated). When using a non-discrete
recombination operator, stepsizes are always geometrically averaged. For each
test problem, dimensionality, reproductive mechanism and operator, twenty runs
have been performed and the mean value has been considered. Runs are termi-
nated after 105 function evaluations. When using the ODR operator, the addi-
tional partial evaluations carried out during recombination are also considered
and hence fewer generations are performed in this case.
Table 1 shows the results for the generalised Rastrigin function. It can be
seen that using a recombination operator always improves the performance with
respect to a non-recombinative ES. As stated in [6], this is due to the regularity
in the arrangement of local optima and to the global structure of the function,
similar to a unimodal landscape. Also, notice that the results are generally better
mutating and then recombining that vice versa. Moreover, applying the ODR
operator after mutating provides the best results. The lower performance of
ODR when applied before mutation is a consequence of the disturbing effects of
mutation on the heuristically selected combination of variables.

Table 1. Results for the generalised Rastrigin function. All results are averaged for 20
runs. The best results for each dimensionality are shown in boldface.

[[ # of [mutation[ Mutation + recombination [ Recombination + mutation [I


variables only IR R~ RDR ODR IR R~ RDR ODR

The results for the weighted noisy matching function are more impressive.
As it can be seen in Table 2, all operators are relatively satisfactory for dimen-
sionalities up to 16. From that point, only ODR consistently finds quasi-optimal
solutions. The rest of operators are incapable of dealing with the high number of
independent and non-uniformly scaled noisy terms that are present in this func-
tion. Again, the use of recombination improves the results of a non-recombinative
ES (with the exception of RDR applied before mutation), and mutating after
applying ODR yields slightly worse results.
65

Table 2. Results for the weighted noisy matching function. All results are averaged
for 20 runs. The best results for each dimensionality are shown in boldface.

Mutation + recombination Recombination + mutation

The results for the Rosenbrock function are shown in Table 3. As mentioned
in Sect. 3, the computational cost of ODR quickly grows when the dimension-
ality of this problem is increased. For that reason, the granularity factor g has
been modified. To be precise, the algorithm automatically adjusts g so as to
keep a constant dimensionality-to-granularity ratio p = n/g. In these experi-
ments, the value p ~ 10 has been used. This value seems to be robust for the
problems considered. Nevertheless, empirical evidence suggest that fine-tuning
of this parameter (usually in the interval 8 < p < 12) may yield better results.

Table 3. Results for the Rosenbrock function. All results are averaged for 20 runs.
The best results for each dimensionality are shown in boldface.

I1 # of Imutationl Mutation + mutationll


recombination IRecombination +
variables only IR R~ RDR ODR IR R ~ RDR ODR

As it can be seen, O D R exhibits a good but not outstanding performance for


low dimensionalities. On the one hand, this may be due to an inappropriate p
ratio since, for a low number of variables, small changes in this parameter have
a high influence. On the other hand, these instances may be too small for taking
advantage of the additional computational overhead of ODR. As a matter of fact,
ODR outperforms all other operators for instances of a higher dimensionality.
Notice also that O D R performs significantly better when individuals are mutated
and then recombined than vice versa. This indicates that mutation has a very
66

disturbing effect with respect to the locally optimal combination of variables


arranged by ODR.

Table 4. Results for the brachystochrone design problem. All results are averaged for
20 runs. The best results for each dimensionality are shown in boldface.

II # of Imutationl Mutation + recombination IRecombination + mutationl[


variables only IR R ~ RDR ODR IR R ~ RDR ODR

Finally, the results for the brachystochrone problem are given in Table 4. As
for the Rosenbrock function, ODRg=I becomes computationally prohibitive for
high dimensionalities, and hence the value p ~ 10 has been kept. The results
are very satisfactory: ODR performs better than other recombination operators
do. Furthermore, it seems to scale much better. It can be also seen that, again,
the results of ODR are worse when it is applied before mutation, although the
difference is not very significant in this case. Also, and with the exception of
RDR applied after mutation, the algorithms with recombination perform better
than the non-recombinative algorithm.

5 Conclusions

A hybrid model that combines evolution strategies with the A* algorithm has
been presented in this work. This model tries to exploit the available knowledge
about the fitness function in order to intelligently combine valuable parts of
solutions independently discovered. By using this model, recombination turns to
be a strong exploitative operation (no new value is introduced in any variable),
thus relying the responsibility for exploration on the mutation operator, a very
powerful element in evolution strategies.
The empirical evaluation of the hybrid algorithm has been very satisfac-
tory, outperforming other classical operators on a benchmark composed of mul-
timodal, noisy and epistatic functions. Moreover, the algorithm can be scaled by
tuning the representation granularity (i.e., the size of the blocks combined by the
subordinate A* algorithm). In fact, this parameter can be adjusted according to
the available computational resources to allow a finer exploration.
Notice that, although recombination has been restricted to a binary operation
in this work, the hybrid model can be straightforwardly upgraded to multiparent
recombination [6, 7]. In this sense, it is simply necessary to extend the concept
67

of discrete dynastic potential of two individuals to an arbitrary number of them.


Obviously, scalability would be a more important issue in this case. The study
of this extension of O D R constitutes a line of future work.
Another very interesting extension is the use of mechanisms for adapting
or self-adapting the granularity used by ODR during the run. Intuitively, it
is possible to use a very coarse granularity in the early stages of evolution,
using a progressively finer granularity later. Nevertheless, a more careful study
is required. Finally, the evaluation of ODR on constrained optimisation problems
is another line of future work.

References
1. Th. B~ick. Evolutionary Algorithms in Theory and Practice. Oxford University
Press, New York, 1996.
2. Th. B~ick, D.B. Fogel, and Z. Michalewicz. Handbook of Evolutionary Computation.
Oxford University Press, New York NY, 1997.
3. C. Cotta, E. Alba, and J.M. Troya. Utilising dynastically optimal forma recombi-
nation in hybrid genetic algorithms. In A.E. Eiben, Th. B~ick, M. Schoenauer, and
H.-P. Schwefel, editors, Parallel Problem Solving From Nature - PPSN V, volume
1498 of Lecture Notes in Computer Science, pages 305-314. Springer-Verlag, Berlin
Heidelberg, 1998.
4. C. Cotta and J.M. Troya. On decision-making in strong hybrid evolutionary al-
gorithms. In A.P. Del Pobil, J. Mira, and M. Ali, editors, Tasks and Methods in
Applied Artificial Intelligence, volume 1416 of Lecture Notes in Computer Science,
pages 418-427. Springer-Verlag, Berlin Heidelberg, 1998.
5. L. Davis. Handbook of Genetic Algorithms. Van Nostrand Reinhold Computer
Library, New York, 1991.
6. A.E. Eiben and Th. B~ick. Empirical investigation of multiparent recombination
operators in evolution strategies. Evolutionary Computation, 5(3):347-365, 1997.
7. A.E. Eiben, P.-E. Raue, and Zs. Ruttkay. Genetic algorithms with multi-parent
recombination. In Y. Davidor, H.-P. Schwefel, and R. M~nner, editors, Parallel
Problem Solving From Nature - PPSN III, volume 866 of Lecture Notes in Computer
Science, pages 78 87. Springer-Verlag, Berlin Heidelberg, 1994.
8. W.E. Hart and R.K. Belew. Optimizing an arbitrary function is hard for the genetic
algorithm. In R.K. Belew and L.B. Booker, editors, Proceedings of the Fourth
International Conference on Genetic Algorithms, pages 190-195, San Mateo CA,
1991. Morgan Kaufmann.
9. M. IIerdy and G. Patone. Evolution strategy in action: 10 es-demonstrations.
Technical Report TR-94-05, Technische UniversitRt Berlin, 1994.
10. N.J. Radcliffe. Forma analysis and random respectful recombination. In R.K. Belew
and L.B. Booker, editors, Proceedings of the Fourth International Conference on
Genetic Algorithms, pages 222-229, San Mateo, CA, 1991. Morgan Kaufmann.
11. N.J. Radcliffe. The algebra of genetic algorithms. Annals of Mathematics and
Artificial Intelligence, 10:339-384, 1994.
12. G. Syswerda. Uniform crossover in genetic algorithms. In J.D. Scha/fer, editor,
Proceedings of the Third International Conference on Genetic Algorithms, pages
2 9, San Mateo, CA, 1989. Morgan Kaufmann.
13. D.H. Wolpert and W.G. Macready. No free hlnch theorems for optimization. IEEE
Transactions on Evolutionary Computation, 1(1):67-82, 1997.
E x t r a c t i n g R u l e s from Artificial N e u r a l
Networks with Kernel-Based Representations

Josd M. Ramfrez *

Dpto. de Computaci6n
Universidad Sim6n Bolivar
Apartado 89000, Caracas 1080-A, Venezuela
jramire~ldc.usb.ve

A b s t r a c t . In Neural Networks models the knowledge synthesized from


the training process is represented in a subsymbolic fashion (weights,
kernels, combination of numerical descriptions) that makes difficult its
interpretation. The interpretation of the internal representation of a suc-
cessful Neural Network can be useful to understand the nature of the
problem and its solution, to use the Neural "model" as a tool that gives
insights about the problem solved and not just ms a solving mechanism
treated as a black box. The internal representation used by the fam-
ily of kernel-based Neural Networks (including Radial Basis Functions,
Support Vector machines, Coulomb potential methods, and some prob-
abilistic Neural Nctworks) can be seen as a set of positive instances of
classification and, thereafter, used to derive fuzzy rules suitable for expla-
nation or inference processes. The probabiiistic nature of the kernel-based
Neural Networks is captured by the membership functions associated to
the components of the rules extracted. In tiffs work we propose a method
to extract fuzzy rules from trained Neural Networks of the family men-
tioned; comparing the quality of the knowledge extracted by different
methods using known machine learning benchmarks.

1 Motivation
At a certain level of abstraction, neural learning can be seen as a case-based learn-
ing. T h e Neural Network stores only a selected set of examples (or a c o m b i n a t i o n
of examples) and uses t h e m to find the best approximation for new instances of
the problem, according to its generalization ability. T h e system learns from ex-
amples, not rules, but the examples are instances of the application of (partially)
u n k n o w n rules in a given domain. A succe~ful Neural Network synthesizes in
a subsymbolic representation the rules n ~ d e d to solve instances of a problem
in a given domain; but its mechanism does not give significant feedback to the
designer t h a t could contribute to the understanding of the problem domain.
In Neural Networks models the knowledge synthesized from the training pro-
cess is represented in a subsymbolic fashion (weights, kernels, combination of

"~ International postal address: Jos6 M. Ramirez, CCS 90996, 4440 N.W., 73 Av. Miami,
FI. 33166, USA.
69

numerical descriptions) that makes difficult its interpretation. The interpreta-


tion of the internal representation of a successful Neural Network can be useful
to understand the nature of the problem and its solution, to use the Neural
"model" as a tool that gives insights about the problem solved and not just as
a solving mechanism treated as a black box.
The internal representation used by the family of the kernel-based Neural
Networks, including Hopfield-like networks [10], Radial Basis Functions [1_4][16][18],
Support Vector Machines [4][6][27], Coulomb potential methods [2][20][19][22],
and some probabilistic Neural Networks [23], can be seen as a set of positive
instances of classification and, thereafter, used to derive fuzzy rules suitable for
explanation or inference processes.
The probabilistic nature of the kernel-based Neural Networks is captured by
the membership functions associated to the components of the rules extracted.
In this work we propose a method to extract fuzzy rules from trained Neural
Networks of the family mentioned; comparing the quality of the knowledge ex-
tracted by different methods using known machine learning benchmarks.

2 Rule extraction from Neural Networks

The use of weights strength as the internal knowledge representation (Multi-layer


Perceptrons in general) constitutes a very powerful scheme to learn function
approximations, but nonlinear nature of the transformation of the inputs using
those weights makes the interpretation extremely difficult. The instances used for
training are fused in a global representation which reverse engineering requires
a lot of knowledge about the problem domain, the training set used and the
specific error surface that the training process followed.
A tot of effort have been done in the extraction of rules from MLP Networks.
Gallant in [8] and latter [9] propose semantics of Neural Networks as two-valued
logic rules that can be used in an inference process.
Thrun in [25] analyze the input-output behavior of backpropagation net-
works, attaching validity intervals to the activation range of each processing
unit, such that the network's activations must lie within these intervals. Once
the intervals are obtained, If-then rules are generated where the precondition is
given by a set of intervals and the output category. The rules are tested against
the network and refined using generalization (enlarging the validity intervals) or
specialization.
An interesting result was obtained by Benitez et al. [3], based on the evidence
provided by Roger Jang and Sun in [12] regarding the functional equivalence be-
tween Radial Basis F~anction Networks and Fuzzy Inference Systems, developing
a method for the extraction of rules from trained Neural Networks using a pow-
erful logic connective, the i-or, that encompass the representation capabilities of
"and" and "or".
McMillan defines a weight template as a parametrized region of weight space
corresponding to specific symbolic functions [15]. Later Alexander and Mozer
[1] used an extraction mechanism based on weight templates, translating each
70

unit's weight into discrete, symbolic descriptions. Some candidate templates are
generated and instantiated; the candidates that better fit with the actual weights
are selected. The activation of the units is assumed boolean and the transfer
function is restricted to sigmoidal.
Other approaches use domain theories to initialize the networks [26]. The
networks are trained using labeled examples and rules describing the behavior of
the networks, according to the theories, are extracted using an iterative clustering
algorithm. The rules generated are, in fact, mathematical descriptions of the
networks behavior and not symbolic rules.
Craven and Shavlik use a trained Neural Network to perform queries using
the training data to induce decision trees [7]. The trained network is used as a
black box to answer queries, making the method architecture independent, but
what is generated is not a symbolic translation of the Neural Network's internal
representation, but a decision tree that is functionally equivalent to the network.
This observation can be subtle, but the training process of the network derive a
compact subsymbolic representation that is lost, since the network is used only
as the oracle of the decision tree induction algorithm.
Deductive learning is another method for extracting rules from Neural Net-
works that modify the network architecture to simplify the representation to
better learning. One of the better explained methods of deductive learning is
presented by Ishikawa in [11] and is named structural learning with forgetting.
The method interrupt the standard learning of the network at certain points and
prune the connections with weights lesser than a certain ~; this action produces
a "forgetting" effect in the network that contributes to generate a more gener-
alized representation. When the training converges, a symbolic interpretation of
the remaining connections values is generated in form of rules.
Kosko [13] proposes the use of clusters generated in the input-output prob-
lem space by competitive learning as the extensional description of fuzzy sets;
and sketch a method to generate rules from a discretization of the problem space
in a way that captures the spatial distribution of the clusters. This proposal is
closely related with our work, in the sense that, starting from a discretization of
the problem space and target classes, fuzzy implications are derived from mem-
bership functions associated to defined intervals (clusters in Kosko's proposal).
Clustering is also used by Sreerupa and Mozer in [24] and by Omlin and
Giles in [17] to induce Finite State Machines from Neural Networks. The result
is not rules, but FSM that can be seen as a discrete representation of the network
behavior, closer to the people's understanding.

3 Knowledge representation in kernel-based Neural


Networks

Probably the first reference of a practical Neural Network architecture based


in the storage of local memories instead of global representations was made
by Hopfield in [10], inspired partially in works in vector mapping with ther-
modynamic models. Since the appearance of the Hopfield network, a family of
71

networks grown upon the principle of internal representation of local memories


in the form of weights [23], fields [14][16][18][2] [20][19][22] or vectors [4][6][27].
In a way or another all these methods select certain training vectors with
known classification to be stored and manipulated by distance metrics, inner
products or discriminant functions to determine the classification of new input
vectors. The networks are, in fact, applying some rules on prototypical repre-
sentations of the target classes. Each prototype is modeled by nodes in hidden
layers of the network, and can be seen as templates to be contrasted against
an input vector to obtain a kind of "belonging degree". An arbitration is the
applied to select the appropriate classification.
Obviously, each prototype, template or kernel represent and approximation
of an instance of the solution. We are certain that the center of each kernel is
correctly classified, but the points around the region of influence of the kernel
(receptive field, basin of attraction, support vector, etc.) have a probability lesser
than 1 to belong to the class of the kernel. To have a perfect classification we
would need as many kernels as points can be defined in the feature space, this
is intractable by nature; but we can use a probabilistic approach to deal with
the belonging degrees and the overlapping of kernels [27], in order to produce
reasonable classifications.
One method to define probabilistic belonging or membership as a set of
points to a given class is provided by fuzzy theory in the form of fuzzy sets
and its intensional interpretation, the fuzzy membership functions. Taking the
membership degree of each input vector component to a given fuzzy set and the
target class associated, it is possible to generate a fuzzy rule which condition
part is formed by the conjunction of all membership tests. In the next section
we describe a method that accomplish this task.

4 Our rule extraction approach

Due to the interval representation characteristics posed in the previous section,


and the analogy establish between kernels and fuzzy membership functions, our
target rules will be fuzzy rules. The algorithm proposed generates conjunctive
i f - Then rules with relative strengths. A revision of the algorithm, after an
analysis of the experimental results, generates rules with confidence factors.
Our Algorithm generates templates, similar to the used by Alexander and
Mozer in [1], but based on the prototype vectors stored in kernels and assign
relative strengths based on the numbers of kernels that shares the same template:

1. The data available is discretized (fuzzy quantized) and membership functions


are defined
2. For all kernels...
- Label each component of the prototype vector using the fuzzy intervals
defined for the problem. Each component is replaced by the label of the
interval with the maximum membership.
- Generate conjunctive rules using each attribute membership and the
target class associated to the kernel.
72

3. If a rule appears more than once, leave only one rule and attach the number
of coincident rules as a relative strength.
For example, if the prototype vector of one kernel is (1.5, 6.3), representing
the values of input features of the problem, namely x and y; and a discretization
criteria that assigns label to intervals of x and y:

Low Med High Very High


x [0, 0.99] [1.00, 3.99] E4.00, 5.99] [6.00, o~]
y [0, 2.99] 113.00, 4.99: [5.00, 8.99] [9.00,
The number of intervals defined depends on the accuracy desired in the clas-
sification, a membership function is associated to each interval. The discretized
prototype vector will be (Med, High). If the known class associated to the kernel
is CtassA, then the rule generated could be:
/f (z is M e d ) and (y is High) then ClassA with 4
where the 4 is the number of kernels that generated the same rule. This
number represents a frequency factor that can be used as a certainty factor in
the fuzzy inference process.
Experimental results shown that the relative strength was inadequate as a
selection criteria if contradictory rules are presented due to the overlap between
kernels. A modification to the algorithm was introduced, using a method pro-
posed by Wang in [28] for learning fuzzy rules from numerical data. It is based
in the fuzzy quantization of the prototype vectors to generate fuzzy rules with
a confidence factor that is the product of the membership degrees # of each
attribute. In our previous example, if the/~ of the feature values (1.5, 6.3) to the
interval are:

Low: M e d H i g h V e r y High
o.6o 10.90 o.oo o.oo
y 0.00! 0.i0 0.75 0.00
The confidence of the rule generated will be 0.675, that is the product of 0.90
and 0.75. This value captures the confidence on the classification suggested by a
kernel, given that part of its area is overlapping with another kernels associated
to different classes. The rule will be:
/f (x is M e d ) and (y is H i g h ) then ClassA with 0.675
The confidence factor is a kind of compound probability that is far most
accurate than the certainty factor used previously. The confidence factors are
used to resolve ambiguities and as a tie-breaking criteria during the inference
process.
Given that this modification of the algorithm may produce rules with the
same template, but different confidence factor; the rules with the same template
are aggregated and a compound confidence factor is associated to the aggregated
rule. Finally the rules are sorted using the confidence factor.
73

Sepal Length (SL) [4.00, 4.99][5.00, 5.99][6.00, 6.99]


[7.00, oo)
Sepal Width (SW) [2.00, 2.99][3.00, 3.99][4.00, 8.00]
Petal Length (PL) [1.00,1.99][2.00, 2.99][3.00, 3.99]
[4.00, 4.99][5.00, 5.99][6.00, 7.00]
Petal Width (PW) [0.00, 0.99][1.00, 1.99][2.00, 3.00]
Table 1. Intervals for the discretization of the Iris Plants Database attributes

5 Experimental results

To initially test the algorithm we used the well known Iris Plants Database [5].
The goal is to recognize the type of the iris plant to which a given individual
belongs. The data is composed of 150 instances, equally distributed between
three classes, 50 for each of the three types of plants: Setosa, Versicolor, and
Virginica. The first class is linearly separable from the other two; while the
latter are not linearly separable from each other. Each instance features four
attributes: petal length, petal width, sepal length and sepal width, which take
continuous values measured in centimeters.
The whole database was processed using a RBF network, a Coulomb potential-
based network (RCE) and the algorithm presented in this work. 100 instances of
the database were used as training data and 50 as testing data. The RBF network
was trained using a kernel unit for each training vector and cr was calculated for
each kernel as the RMS distance to its neighbors, as suggested in [21]. The RCE
network was trained using an initial O (threshold or radius of each kernel) equal
to half of the dimension of the feature space. Both networks used 4 input units
(one for each attribute) and 3 output units (one for each class).
For the rule extraction algorithm, membership functions of type S (sigmoidal)
and H (bell) were used and the attributes were discretized as shown in Table 1.
Table 2 and 3 show the results of the application of the networks and rules
extracted for the classification of the testing set. It can be seen that the accuracy
was over 90% and the number of rules extracted from the kernels created with
the training set gives us an idea of the compactacy that the rule extraction
achieves, maintaining a good performance.
The "extra" rule generated from RBF was a very specialized rule for the class
Virginica that was present in RBF due to the initialization strategy used, that
created a kernel for each instance in the training set.
Finally, the alignment measures the functional equivalence of each rule set
extracted with the corresponding network. This number was obtained comparing
the output of the network and the corresponding rule sets for each instance of
the testing set~ and is proportional to the number of coincident outputs.
T h e rules extracted from the RCE network were:
74

Kernels or Rules
RBF 100
RCE 23
Rules from RBF 7
Rules from RCE 6
Table 2. Kernels and rtfles generated from each network using the 100 instances train-
ing set from Iris Plants Database

Accuracy (%) Alignment (%)


RBF 96 -
RCE 98 -
Rules from RBF 91 94.8
Rules from RCE 94 95.9
Table 3. Accuracy and Alignment measured over 50 instances the testing set from Iris
Plants Database

1. if (SL is in interval 2) and (SW is in interval 2) and (PL is in interval 1)


and (PW is in interval 1) then Setosa with 0.65

2. if (SL is in interval 1) and (SW is in interval 2) and (PL is in interval 1)


and (PW is in interval 1) then Setosa with 0.31

3. if (SL is in interval 3) and (SW is in interval 1) and (PL is in interval 4)


and (PW is in interval 2) then Versicolor with 0.65

4. if (SL is in interval 3) and (SW is in interval 2) and (PL is in interval 4)


and (PW is in interval 2) then Versicolor with 0.35

5. if (SL is in interval 5) and (SW is in interval 3) and (PL is in interval 4)


and (PW is in interval 2) then Virginica with 0.15

6. if (SL is in interval 3) and (SW is in interval 2) and (PL is in interval 5)


and (PW is in interval 3) then Virginica with 0.85

Another database used was the Mushroom Database that consists of descrip-
tions of 8124 instances corresponding to 23 species of mushrooms, each identified
as edible (51.8 %) or poisonous (48.2 %) and described by 22 nominal-valued
attributes, with between 2 and 12 possible values.
In this case the discretization step is skipped given the nominal nature of
the attributes. RCE and RBF were trained using 5416 instances, the rules were
extracted and then tested against the remaining 2708 instances. As expected,
the rules extracted were perfect and the accuracy reached almost 100 % with 10
rules. Tables 4 and 5 show the results.
75

Kernels or Rules
RBF 5416
RCE 32
Rules from RBF 10
Rules from RCE 10
Table 4. Kernels and rules generated from each network using the 100 instances train-
ing set from Mushrooms Database

Accuracy (%) Alignment (%)


RBF 99.9 -
RCE 100 -
Rules from RBF 98.2 99.8
Rules from RCE 99.1 99.9
Table 5. Accuracy and Alignment measured over 50 instances the testing set from
Mushrooms Database

6 Discussion and future work

The method presented successfully extracts fuzzy rules from trained kernel-based
Neural Networks, including RBF and RCE. It is expected that the method will
behave well on the rest of the kernel-based networks, given the analogy in es-
tablish in terms of the internal representation.
The format of the rules extracted is extremely straightforward, allowing the
understanding of the functionality of the Neural Network and the dynamics of
the target problem. Moreover, the rules generated shown an outstanding com-
parative performance in the resolution of the same problems.
We are not interested, by the moment, in the complexity and scalability of the
algorithm; the degradation in performance due to kernel addition is a well known
drawback of kernel-based networks, but its precision and stability are preferred
for certain tasks. The precision is maintained in the symbolic representation
obtained.
The alignment of the rules generated with the networks gives us an idea
of how accurate can be an study of the original problem using the symbolic
representation obtained. Over 94% was obtained, what is outstanding in terms
of the target that is widely used for model alignment.
The analysis of the rules extracted can lead to the definition of strategies
to debug or expand Neural Networks; creating new training instances to cover
conditions that seem to be absent from the neural internal representation. We
will explore this issue elsewhere.

References

I. J.A. Alexander and M.C. Mozer. Template-based algorithms for connection]st rule
extraction. Advances in Neural Information Processing Systems, 7:609-616, 1995.
76

2. C. Bachinan, Cooper L., Dembo A., and Zeitouni O. A relaxation model for
memory with high storage density. Proceedings of the National Academy of Science,
21:609-616, 1995.
3. J.M. Benitez, J.L. Castro, and I. Requena. Are artificial neural networks black
boxes? [EEE Transaction on Neural Networks, 8:1156-1164, 1997.
4. B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin
classification. In Proceedings 5th annual Workshop on Computational Learning
Theory, pages 144-152. ACM, 1992.
5. E. Keogh C. Blake and C.J. Merz. UCI repository of machine learning databases,
1998. http: / /www.ies.uci.edu/,~mlearn[MLRepository.html.
6. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297,
1995.
7. M.W. Craven and J.W. Shavlik. Extracting tree-structured representations of
trained networks. Advances in Neural Information Processing Systems, 8:24-30,
1996.
8. S. I. Gallant. Connectionist expert systems. Communications of the ACM, 31:152-
169, 1988.
9. M.J. Healy and T.P. Caudell. Acquiring rule sets as a product of learning in
a logical neural architecture. IEEE Transaction on Neural Networks, 8:461-474,
1997.
10. J. J. Hopfield. Neural networks and physical systems with emergent collective com-
putational abilities. In Proceedings of the National Aca. of Science USA, volume 81,
pages 3088-3092, 1984.
11. M. Ishikawa. Neural network approach to rule extraction. In Proceedings of the
2nd New Sealand International Conference on Artificial Neural Networks and Fuzzy
Systems, pages 6-9. IEEE Computer Society, 1995.
12. J.S. Roger Jang and C.T. Sun. Functional equivalence between radial basis function
networks and fuzzy inference systems. IEEE Transaction on Neural Networks,
4:156-159, 1993.
13. B. Kosko. Neural Networks and Fuzzy Systems: A dynamical approach to machine
intelligence. Prentice-Hail, 1992.
14. S. Lee and R. Kil. A gaussian potential function network with hierarchically self-
organizing learning. Neural Networks, 4(2):207-224, 1991.
15. C. McMillan, M.C. Mozer, and P. Smolensky. Rule induction through integrated
symbolic and subsymbolic processing. Advances in Neural Information Processing
Systems, 4:969-1497, 1992.
16. 3.E. Moody and C. Darken. Fast learning networks of locaily-tuned processing
units. Neural Computation, 1(2):281-294, 1989.
17. C.W. Omlin and C.L. Giles. Extraction of rules from discrete-time recurrent neural
networks. Technical Report 92-23, Department of Computer Science, Rensselaer
Polytechnic Institute, 1992.
18. T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings
IEEE~ 78:1481-1497, 1990.
19. J. Ramirez. MRC-an evolutive cormectionist model for hybrid training. In Lecture
Notes in Computer Science 686: New trends in neural computation, pages 223-229.
Springer-Verlag, 1993.
20. D. Reilly, L. Cooper, and C. Elbaum. A neural model for category learning. Bio-
logical Cybernetics, 45:35-41, 1982.
21. A. Saha and J. Keeler. Algorithms for better representation and faster learning
in radial basis function networks. Advances in Neural Information Processing Sys-
tems, 2:482-489, 1990.
77

22. C. Scofield. Learning internal representation in the coulomb energy network. In


Proceedings of the IEEE International Conference on Neural Networks, 1988.
23. D. Specht. Generation of polynomial discrimination functions for pattern recogni-
tion. PhD thesis, Stanford University, 1966.
24. D. Sreerupa and M. Mozer. A unified gradient-descent/clustering architecture
for firfite state machine induction. Advances in Neural Information Processing
Systems, 6:19-26, 1994.
25. S. Thrun. Extracting hales from artificial neural networks with distributed repre-
sentations. Advances in Neural Information Processing Systems, 7:505-512, 1995.
26. G. Towell and J. Shavlik. Interpretation of artificial neural networks: mapping
knowledge-based neural networks into rules. Advances in Neural Information Pro-
cessing Systems, 4:977-984, 1992.
27. V. Vapnik. The nature of Statistical Learning Theory. Springer-Verlag, 1995.
28. L.X. Wang. Adaptive Fuzzy Systems and Uontrol - Desiny and stability analysis.
Englewood Cliffs, 1994.
Rule Improvement Through Decision Boundary
Detection Using Sensitivity Analysis

AP Engelbrecht 1 and HL Viktor 2

1 Department of Computer Science, University of Pretoria, Pretoria, SOUTH


AFRICA, engel@driesie.cs.up.ac.za
2 Department oflnformatics, University of Pretoria, Pretoria, S O U T H AFRICA,
hlviktor@econ.up.ac.za

A b s t r a c t . Rule extraction from artificial neural networks (ANN) pro-


vides a mechanism to interpret the knowledge embedded in the numerical
weights. Classification problems with continuous-valued parameters cre-
ate difficulties in determining boundary conditions for these parameters.
This paper presents an approach to locate such boundaries using sensi-
tivity analysis. Inclusion of this decision boundary detection approach in
a rule extraction algorithm resulted in significant improvements in rule
accuracies.

1 Introduction

Artificial neural networks (ANN) have proved to be very efficient classification


tools. Domain experts are, however, skeptical to base crucial decisions on the
results obtained from ANNs, mainly due to the numerical representation used
by ANNs. It is very difficult to interpret the knowledge encapsulated by the
numerical weights of ANNs. Rule extraction from ANNs provides a mechanism
to interpret this numerically encoded knowledge.
Several rule extraction algorithms have been developed, including
[Craven et al 1993,Fu 1994,Towell 1994,Viktor 1998]. These algorithms have
shown to be efficient in the knowledge extraction process. The output of rule
extraction algorithms are propositional DNF rules with attribute-value tests
of the form Ai < rel_operator > boundary_value, e.g. petal-width < 49.50.
Classification problems with continuous-valued attributes present difficulties in
determining the boundary conditions for such attributes. Current solutions to
this problem include discretizing the continuous attributes, to use a brute-
force approach to find boundaries, or to implement an algorithm to locate de-
cision boundaries. Discretization may cause important information to be ob-
scured, while decision boundary detection algorithms are computationally com-
plex [Baum 1991,Cohn et al 1994,Hwang et al 1991].
This paper presents a computationally efficient approach to locate decision
boundaries, using sensitivity analysis. First-order derivatives of the ANN outputs
with regard to input patterns are used to find the position of boundaries. This
algorithm is used in conjunction with the ANNSER rule extraction algorithm
79

[Viktor et al 1995] to find the boundary values for continuous-valued attributes.


Results of this approach showed significant improvements in rule accuracies.
This paper is outlined as follows: Section 2 overviews an attribute evaluation
approach to find boundary values, and presents a sensitivity analysis approach
to locate decision boundaries. Sections 3 and 4 respectively present results of the
sensitivity analysis approach on the iris and breast cancer problems, compared
to that of the attribute evaluation approach.

2 Decision boundary detection

The ANNSER rule extraction algorithm extract rules from feedforward


ANNs, using sensitivity analysis to prune the ANN prior to rule extrac-
tion [Viktor et al 1995]. Propositional DNF rules are produced. This section
overviews two approaches to determine boundary values for continuous-valued
attributes: the attribute evaluation approach, and the sensitivity analysis deci-
sion boundary detection approach.
Consider a training set that contains n values for continuous attribute Ai,
namely the set of values {Vl,... ,vn}. A decision boundary is a point xi in the
input space where the attribute values of Ai are divided into two, i.e. subset A
with values { v l , . . . ,vi} where Ai < xi and subset B containing {vi+l,... ,vn}
for Ai > xi. The value of xi can be used as a threshold value (boundary value)
in attribute-value tests to distinguish between output classes, where the test
(Ai < xi) covers concepts of class Cj and the test (Ai > xi) covers those concepts
that do not fall in class Cj.
The attribute evaluation method to determine these threshold values is ap-
plied to the original unscaled data set. In this approach, the minimum and
maximum values of each attribute Ai with respect to each output class Cj
are determined. T h a t is, for each attribute Ai and output class Cj, the range
minA~ < Ai < maxA~ is determined. If minA~ corresponds to the minimum
value contained in the training set, the test is simplified to (Ai < maxA~). Sim-
ilarly, if the maximum value of the range is equal to the maximum value in the
training set, the test is simplified to (minA~ < Ai).
Note that the training examples possibly do not include values that reflect
the exact boundaries of the attribute-values. To improve the generalization of
the rules over unseen examples, the ranges can be extended to include those
values that fall on the boundaries. For example, consider a range al < Ai < a2.
The thresholds can be modified according to the following equations:

al = al - (a2 - al)~ (1)


a2 = a2 + (a2 - al)r (2)
where, r is a domain dependent constant, used to improve the test data
classification by the rules obtained from the training data [Sestito et al 1994].
The sensitivity analysis decision boundary detection algorithm is based on
the following assumption: Consider a continuous-valued attribute Ai. If a small
80

perturbation AAi of Ai causes the ANN to change its classification from one
class to another, then, according to [Engelbrecht et al 1998a,Engelbrecht 1998b],
a decision boundary is located in the range [Ai, Ai + AAi]. That is, a decision
boundary is located at the point in input space where a small perturbation to
the value of an input parameter causes a change in the output class.
Sensitivity analysis of the ANN output with respect to input parameter Ai
is used to assign a "measure of closeness" of an attribute value to the boundary
value(s) of that attribute. That is, for each example p in the training seti the
first-order derivative
ocj (3)
OAf v)
is calculated for each class (output) Cj and for each input Ai
OCj
[Engelbrecht et al 1998a,Engelbrecht 1998b]. The higher the value of OA-~(?,the
greater the chance that a small perturbation of A~p) will cause a different classifi-
cation [Engelbrecht 1998b]. Therefore, patterns with high ~ values lie closest
OA~
to decision boundaries.
A graph of ~OA~
, p = 1,..., P, reveals peaks at boundary points. A curve
fitting algorithm can be used to fit a curve over the values ~ and to find
OA~p)
the values of Ai where a peak is located. These values of Ai constitute decision
boundaries. Sampling values to the left and right of the boundary peaks indicates
whether an attribute should have a value less or greater than the boundary value
to trigger a rule.
The sensitivity analysis decision boundary detection algorithm is applied in
conjunction with the ANNSER rule extraction algorithm in the next sections.

3 The Iris data set

The aim of this section is to illustrate the sensitivity analysis decision boundary
detection algorithm and to compare the rules extracted to the rules extracted
when the attribute evaluation method is used.
The Iris classification problem concerns the classification of Irises into one
of three classes, namely Setosa, Versicolor and Virginica. Irises are described by
means of four continuous-valued inputs sepal-width, sepal-length, petal-width and
petal-length.
The original 150 instance Iris data set was randomly divided into a 105
instance training set and a 45 instance test set. The sensitivity analysis decision
boundary detection algorithm was executed against the 105 instance training
set.
Firstly, a 4-2-3 ANN was trained using sigmoid activation functions with
steep slopes to approximate linear threshold functions. All input values were
scaled to the range [-1, 1]. Training converged after 10 epochs, with a classi-
fication test accuracy of 98%. Next, the sensitivity analysis pruning algorithm
in [Engelbrecht et al 1996,Engelbrecht 1998b] was executed to prune irrelevant
81

PetalLengthw IrisVers~olor Pe~al~dlh vs IdaVersico~


4 i i i I , i i i ! i i , i r i i
@
Ids Verse~" o Ida Ve~,k:ih( *
3.5 l
o

Z5

2
t
1.5 #

1
#
(, '$ $
0,5
o
o o
00
|
0
..fie 46 .o.4 4)2 0 0.2 0.4 0.6 0.8 -0.8 -o~ .0.4 -0.2 o 02 0.4 0.6 0.8
Petall.e~lh Peal~rdil

Fig. 1. Petal-width and petal-length decision boundaries for Versicolor Iris

ANN parameters. Sensitivity analysis showed the sepal-width and sepal-length


attributes to be of a low significance and these two attributes were subsequently
pruned [Engelbrecht 1998b]. After pruning, a reduced 2-2-3 ANN was trained.
Again training converged after 10 epochs, with a classification test accuracy of
95.9%. The threshold values of the attribute-value tests were determined using
the sensitivity analysis decision boundary detection algorithm, as discussed in
Section 2. The decision boundary detection algorithm was applied to each of the
three Iris types.

Figure I illustrates the decision boundary peaks formed for the petal-width
and petal-length attributes with regard to the Versicolor Iris. These peaks were
used to determine the actual unscaled attribute values that corresponded to these
boundaries. The boundaries were located at 49.50 for the petal-length attribute
and at 18.50 for the petal-width. The actual relational operators were determined
by sampling values to the left and right of the boundaries. The boundaries pro-
duced two attribute-value tests describing the Versicolor Iris, namely the tests
(petal-length < 49.50) and (petal-width < 18.50). The decision boundaries for
the Setosa and Versicolor Irises were located using the same approach as dis-
cussed above. The Setosa decision boundaries were detected at a petal-length of
19.50 and a petal-width of 6.50, producing attribute-value tests (petal-length <
19.50) and (petal-width < 6.50). The Virginica Iris type was described by the
attribute-value tests (petal-length > 49.50) and (petal-width > 16.50).
82

Using these attribute-value tests, the ANNSER rule extraction algorithm


extracted the following rules:

Rule 1: IF petai-length < 19.50 AND petal-width < 6.50 T H E N Setosa


Rule 2: IF petai-length > 49.50 AND petal-width > 16.50 T H E N Virginica
Rule 3: IF petal-length < 49.50 THEN Versicolor
Rule 4: IF petal-width < 18.50 THEN Versicolor

The test set accuracy of the rule set was 95.9%, with individual rule accuracies
ranging from 93.9% to 100%. The accuracy of the set of rules was equal to that
of the classification accuracy of the 2-2-3 ANN. This implies that the rule set
models the ANN to a comparable degree of fidelity, where the fidelity is measured
by comparing the classification performance of the rule set to that of the ANN
from which it was extracted [Craven et al 1993].
The attribute evaluation method was applied next, and is illustrated by con-
sidering the construction of the attribute-value tests of the rule that describe
the Versicolor Iris, as depicted in Table 1. This rule concerned the petal-length
attribute. For the Versicolor Iris, the petal-length attribute had values within
a range of (13.0 < petal-length < 46.50). The petal-length attribute values in
the training set ranged from 13 to 69. Therefore, the minimum attribute-value
test range value corresponded to the minimum value in the training data set.
The attribute-value test was simplified to (petal-length < 46.50). To improve
the generalization of the rule set, the value of r was set to 0.03. This value was
used to calculate a new threshold, using equation (2). A new attribute-value
test, namely (petal-length < 47.50), was produced.
The resultant rules were subsequently compared with the results of the deci-
sion boundary detection algorithm. Table 1 shows the attribute-value tests of the
two rule sets. Using the attribute evaluation approach, four rules with a test set
accuracy of 93.9% were extracted. The accuracy of the individual rules ranged
from 89.9% to 100.0%. Using the decision boundary threshold values obtained
from the sensitivity analysis approach, an improvement of 2.0% on the overall
accuracy was achieved. An improvement of 4.0% was achieved on the least accu-
rate rule. For this set of experiments, the decision boundary detection algorithm
produced an accurate, general set of rules.

4 The Breast Cancer data set

The aim of this section is to illustrate the sensitivity analysis decision boundary
detection algorithm in a noisy domain that contained incorrect values. The breast
cancer data set, obtained from the UCI machine learning repository was used
for this purpose. Originally, the breast cancer database was obtained from Dr.
William H. Wolberg of the University of Wisconsin Hospitals, Madison. The
data set contained 699 tuples and distinguished between benign (noncancerous)
breast diseases and malignant cancer. The data set concerned 458 (65.5%) benign
and 241 (34.5%) malignant cases. In practice, over 80 percent of breast lumps
are proven benign.
83

Technique Iris type Attribute-value test


lAttribute evaluation Setosa petal-length < 19.50
petal-width < 6.50

Virginica petal-length > 44.50


petal-width > 14.50

Versicolor petal-length < 47.50

petal-width < 17.50


3ecision boundaries Setosa petal-length < 19.50
petal-width < 6.50

Virginica petal-length > 49.50


petal-width > 16.50

Versicolor petal-length < 49.50

petal-width < 18.50

Table 1. Attribute evaluation versus decision boundaries

The data set contained missing values and the level of noise (incorrect values)
was unknown. There are 10 input attributes, including the redundant sample
code number. The other nine inputs concerned the results obtained from the
tissue samples that were pathologically analyzed.
A 10-10-2 ANN was trained, using sigmoid activation functions with a high
slope to approximate linear threshold values. The sensitivity analysis pruning
algorithm reduced the ANN to a 3-3-2 network that produced six rules. The
classification test accuracy of this ANN was 95.2%. Next, the attribute-value
test thresholds were determined using the attribute evaluation method and the
sensitivity analysis decision boundary detection algorithm. The rule sets for both
methods were extracted. For the original attribute evaluation method, the rule
set accuracy was 79.6%. The individual rule accuracies ranged from 66.4% to
85.3%. The accuracy of the rule set that was produced after the results of the
sensitivity analysis decision boundary detection algorithm were incorporated was
94.3%, giving an improvement of 14.7%. The individual rule accuracies ranged
from 65.4% to 93.4%. The fidelity of the final rule set is high, since the rule set
accuracy of 94.3% is comparable to that of the original ANN (95.2%).

5 Conclusion

This paper presented an approach to rule extraction where a decision bound-


ary detection algorithm was used to find threshold values for continuous-valued
attributes in attribute-value tests. The decision boundary algorithm uses sensi-
tivity analysis to locate boundaries for each attribute. The sensitivity analysis
84

approach to detect decision boundaries is computationally feasible, since the


first-order derivatives are already calculated as part of the learning equations.
Results showed a significant improvement in rule accuracies compared to an
attribute evaluation approach to find threshold values.

References

[Baum 1991] EB Baum, Neural Net Algorithms that Learn in Polynomial Time from
Examples and Queries, IEEE Transactions on Neural Networks, 2(1), 1991, pp 5-19.
[Cohn et al 1994] D Cohn, L Atlas, R Ladner, Improving Generalization with Active
Learning, Machine Learning, Vol 15, 1994, pp 201-221.
[Craven e~ al 1993] MW Craven and JW Shavlik, 1993. Learning Symbolic Rules using
Artificial Neural Networks, Proceedings of the Tenth International Conference on
Machine Learning, Amherst: USA, pp.79-95.
[Engelbrecht et al 1996] AP Engelbrecht, I Cloete, A Sensitivity Analysis Algorithm
for Pruning Feedforward Neural Networks, IEEE International Conference in Neural
Networks, Washington, Vol 2, 1996, pp 1274-1277.
[Engelbrecht et al 1998a] AP Engelbrecht and I Cloete, 1998. Selective Learning us-
ing Sensitivity Analysis, 1998 International Joint Conference on Neural Networks
(IJCNN'98), Alaska: USA, pp.1150-1155.
[Engelbrecht 1998b] AP Engelbrecht, 1998. Sensitivity Analysis of Multilayer Neural
Networks, submitted PhD dissertation, Department of Computer Science, University
of Stellenbosch, Stellenbosch: South Africa.
[Fu 1994] LM Fu, Rule Generation from Neural Networks, IEEE Transactions on Sys-
tems, Man and Cybernetics, Vol 24, No 8, August 1994, pp 1114-1124.
[Hwang et al 1991] J-N Hwang, JJ Choi, S Oh, RJ Marks II, Query-Based Learning
Applied to Partially Trained Multilayer Perceptrons, IEEE Transactions on Neural
Networks, 2(1), January 1991, pp 131-136.
[Sestito et al 1994] S Sestito and TS Dillon, 1994. Automated Knowledge Acquisition,
Prentice-Hall, Sydney: Australia.
[Towell 1994] GG Towell and JW Shavlik, Refining Symbolic Knowledge using Neural
Networks, Machine Learning, Vol. 12, 1994, pp 321-331.
[Viktor et al 1995] HL Viktor, AP Engelbrecht and I Cloete, 1995. Reduction of Sym-
bolic Rules from Artificial Neural Networks using Sensitivity Analysis, IEEE Inter-
national Conference on Neural Networks (ICNN'95), Perth: Australia, pp.1788-1793.
[Viktor et al 1998a] HL Viktor, AP Engelbrecht, I Cloete, Incorporating Rule Extrac-
tion from ANNs into a Cooperative Learning Environment, Neural Networks & their
Applications (NEURAP'98), Marseilles, France, March 1998, pp 386-391.
[Viktor 19981 HL Viktor, 1998. Learning by Cooperation: An Approach to Rule Induc-
tion and Knowledge Fusion, submitted PhD dissertation, Department of Computer
Science, University of Stellenbosch, Stellenbosch: South Africa.
The Role of Dynamic Reconfiguration for
Implementing Artificial Neural Networks Models in
Programmable Hardware

J.M. Moreno, J. Cabestany, E. Cant6, J. Faura § J.M. Insenser §

Technical University of Catalunya, Dept. of Electronic Engineering, Advanced Hardware


Architectures Group, Building C4, Campus Nord, c/Gran Capith s/n, 08034 - Barcelona - Spain
moreno@eel.upc.es
+SIDSA, PTM, Torres Quevedo 1, 28760 - Tres Cantos (Madrid) - Spain
faura@sidsa.es

Abstract. In this paper we address the problems posed when Artificial Neural
Networks models are implemented in programmable digital hardware. Within
this context, we shall especially emphasise the realisation of the arithmetic
operators required by these models, since it constitutes the main constraint (due
to the required amount of resources) found when they are to be translated into
physical hardware. The dynamic reconfiguration properties (i.e., the possibility
to change the functionality of the system in real time) of a new family of
programmable devices called FIPSOC (Field Programmable System On a Chip)
offer an efficient altemative (both in terms of area and speed) for implementing
hardware accelerators. After presenting the data flow associated with a serial
arithmetic unit, we shall show how its dynamic implementation in the FIPSOC
device is able to outperform systems realised in conventional programmable
devices.

1 Introduction

The advances raised during the last years in the microelectronics fabrication processes
have facilitated the advent of new families of FPGA (Field Programmable Gate
Arrays) devices with increasing performance (in terms of both capacity, i.e., number
of implementable equivalent gates, and processing speed). This has motivated their
popularity in the implementation of complex embedded systems for industrial
applications.
Due to their inherent capability of tackling complex, highly non-linear optimisation
tasks (like classification, time series prediction . . . . . ), Artificial Neural Networks
models have been incorporated progressively as a functional section of the final
system. As a consequence, there have been several approaches, [I], [2], [3], [4],
dealing with the digital implementation of different neural models in programmable
hardware. However, due to the amount of resources required by the arithmetic
operations (especially digital multiplication), these realisations have been limited to
small models or alternatively have required many programmable devices.
During the last years the programmable hardware community has evidenced a
trend towards the integration of dynamic reconfiguration properties in conventional
FPGA architectures [5]. As a consequence, there have been already several proposals,
86

coming from both the academic [6] and the industrial [7], [8], [9] communities. The
term dynamic reconfiguration means the possibility to change, totally or partially, the
functionality of a system using a transparent mechanism, so that the system does not
need to be halted while it is being reconfigured. This feature was not available in
early FPGA devices, whose reconfiguration time is usually several orders of
magnitude larger than the execution delay of the system. In this paper we shall
concentrate our attention on the device presented in [9], which constitutes a new
concept of programmable devices, since it includes a programmable digital section
with dynamic reconfiguration properties, a configurable analog section and a
microcontroller, thus constituting an actual system on a chip. Through a careful use of
the dynamic configuration properties of the programmable digital section we shall
provide efficient arithmetic strategies which could assist in the development of
customisable neural coprocessors for real world applications.
The paper is organised as follows: In the next section we shall briefly explain the
main features of the FIPSOC device, paying especial attention to those related to its
dynamic reconfiguration properties. Then we shall evaluate some efficient arithmetic
strategies capable of handling the data flow associated with neural models. Bearing in
mind the intrinsic characteristics of the HPSOC family, we shall then present an
efficient serial scheme for implementing digital multipliers, providing throughput
estimates obtained from the first physical samples. Finally, the conclusions and future
work will be outlined.

2 Architectural Overview of the FIPSOC Device

Figure 1 depicts the global organisation of the FIPSOC device.

::r~___'~____"Y~__ ..... ,

4-z-q .~ C ~ I g Idem
I 0
a

Fig. 1. Global organisation of the FIPSOC device.

As it can be seen, the internal architecture of the FIPSOC device is divided in five
main sections: the microcontroller, the programmable digital section, the configurable
analog part, the internal memory and the interface between the different functional
blocks.
Because the initial goal of the FIPSOC family is to target general pro'pose mixed
signal applications, the microcontroller included in the first version of the device is a
full compliant 8051 core, including also some peripherals like a serial port, timers,
parallel ports, etc. Apart from running general-purpose user pro~ams, it is in charge
87

of handling the initial setup of the device, as well as the interface and configuration of
the remaining sections.
The main function of the analog section, is to provide a front-end able to perform
some basic conditioning, pre-processing and acquisition functions on external analog
signals. This section is composed of four major sections: the gain block, the data
conversion block, the comparators block and the reference block. The gain block
consists of twelve differential, fully balanced, programmable gain stages, organised as
four independent channels. Furthermore, it is possible to have access to every input
and output of the first amplification stage in two channels. This feature permits to
construct additional analog functions, like filters, by using external passive
components. The comparators block is composed of four comparators, each one at the
output of an amplification channel. Each two comparators share a reference signal
which is the threshold voltage to which the input signal is to be compared. The
reference block is constructed around a resistor divider, providing nine internal
voltage references. Finally, the data conversion block is configurable, so that it is
possible to provide a 10-bit DAC or ADC, two 9-bit DAC/ADCs, four 8-bit
DAC/ADCs, or one 9-bit and two 8-bit DAC/ADCs. Since nearly any internal point
of the analog block can be routed to this data conversion block, the microprocessor
can use the ADC to probe in real time any internal signal by dynamically
reconfiguring the analog routing resources.
Regarding the programmable digital section, it is composed of a two-dimensional
array of programmable cells, called DMCs (Digital Macro Cell). The organisation of
these cells is shown in figure 2.
As it can be deduced from this figure, it is a large-granularity, 4-bit wide
programmable cell. The sequential block is composed of four registers, whose
functionality can be independently configured as a mux-, E- or D-type flipfiop or
latch. Furthermore, it is also possible to define the polarity of the clock (rising/falling
edge) as well as the set/reset configuration (synchronous/asynchronous). Finally, two
main macro modes (counter and shift register) have been provided in order to allow
for compact and fast realisations.
The combinational block of the DMC has been implemented by means of four
16xl-bit dual port memory blocks (Look Up Tables - LUTs - in figure 2). These
ports are connected to the microprocessor interface (permitting a flexible management
of the LUTs contents) and to the DMC inputs and outputs (allowing for their use as
either RAM or combinational functions). Furthermore, an adder/subtractor macro
mode has been included in this combinational block, so as to permit the efficient
implementation of arithmetic functions.
A distinguishing feature of this block is that its implementation permits its use
either with a fixed (static mode) or with two independently selectable (dynamic
reconfigurable mode) functionalities. Each 16-bit LUT can be accessed as two
independent 8-bit LUTs. Therefore it is possible to use four different 4-LUTs in static
mode, sharing two inputs every two LUTs, as depicted in figure 2, or four
independent 3-LUT in each context in dynamic reconfigurable mode. Table 1
summarises the operating modes attainable by the combinational block of the DMC in
static mode and in each context in dynamic reconfigurable mode.
Furthermore, since the operating modes indicated in table 1 are implemented in
two independent 16x2-bit RAMs (8x2-bit RAMs in dynamic reconfigurable mode), it
is possible to combine the functionalities depicted in this table. For instance, it is
possible to configure the combinational block in order to provide one 5-LUT and one
88

16x2-bit RAM in static mode or two 3-LUTs and one 4-LUT in dynamic
reconfigurable mode.

I I I I I I I I Output
courc Cl c 2 s3 cotrrs c~ st sz Unit
COUTC t cOtrl~

C3 R7 O3
IA3 D3
C21~ 02

1A_.~0 (:1

s 0

OE0 IAUXI

TTTT
D3 D2 DI DO
Sequential
Block
Combinational
Block

Fig. 2. Organisation of the basic cell (DMC) in the programmable digital section.

Table 1. Functionalities of the combinational block in static and dynamic reconfigurable


modes.

Static mode D~namic recon~i~urable mode


9 4 x 4-LUTs (sharing 2 inputs 9 4 x 3-LUTs
every two LUTs) 9 2 x 4-LUTs
9 2 x 5-LUTs 9 1 x 5-LUT
9 1 x 6-LUT 9 1 x 4-bit adder
9 1 x 4-bit adder 9 2 x 8x2-bit RAMs
9 2 x 16x2-bit RAMs

The multicontext dynamic reconfiguration properties have been provided also for
the sequential block of the DMC. For this purpose, the data stored in each register has
been physically duplicated. In addition, an extra configuration bit has been provided
in order to include the possibility of saving the contents of the registers when the
context is changed and recover the data when the context becomes active again.
In order to enhance the overall flexibility of the system, an isolation strategy has
been followed when implementing the configuration scheme of the FIPSOC device.
This strategy, depicted in figure 3(a), provides an effective separation between the
actual configuration bit and the mapped memory through an NMOS switch. This
switch can be used to load the information stored in the memory to the configuration
cell, so that the microprocessor can only read and write the mapped memory. This
implementation is said to have one mapped context (the one mapped in the
microprocessor memory space) and one buffered context (the actual configuration
memory which directly drives the configuration signals).
89

The benefits of this strategy are clear. First, the mapped memory can be used to
store general-purpose user programs or data, once its contents have been transferred
to the configuration cells. Furthermore, the memory cells are safer, since their output
does not drive directly the other side of the configuration bit. Finally, at the expense
of increasing the required silicon area, it is possible to provide more than one mapped
context to be transferred to the buffered context, as depicted in figure 3(b). This is the
actual configuration scheme which has been implemented in the FIPSOC device, and
it permits to change the configuration of the system just by issuing a memory write
command. Furthermore, the programmable hardware has also access to the resources
which implement this context swap process. In this way, it is even possible to change
the actual configuration of the DMCs in just one clock cycle. As it will be explained
in the following sections, this constitutes in fact the basis of the strategy we shall use
to implement efficiently arithmetic operators for artificial neural networks models.

i[ ......... i ..........

i. . . . . . . . . . . . . . . . . . . . . . !
t ~
Toad I

(a) (b)
Fig. 3. (a) Configuration scheme. (b) Multicontext configuration.

In addition to this configuration scheme, an efficient interface between the


microcontroller and the configuration memory has been included in the FIPSOC
device, as depicted in figure 4.
As it can be seen, the microcontroller can select any section in the array of DMCs
(the shaded rectangle depicted in figure 4), and, while the rest of the array is in
normal operation, modify its configuration just by issuing a memory write command.
Therefore, the dynamic configuration strategy included in the FIPSOC device shows
two main properties: it is transparent (i.e., it is not necessary to stop the system while
it is being reconfigured) and time-efficient (since only two memory write cycles are
required to complete the reconfiguration, one to select the logical rectangle of DMCs
to be reconfigured and one to establish the actual configuration for these elements).
Regarding the routing resources, the array of DMCs which constitutes the
programmable digital section of the FIPSOC device is provided with 24 vertical
channels per column and 16 horizontal channels per row. The routing channels are not
identical, and have different lengths and routing patterns. Switch matrices are also
provided to connect vertical and horizontal routing channels. There are also special
nets (two per row and column) which span the whole length or height of the array,
and whose goal is to facilitate the clock distribution.
In the next section we shall first analyse some alternatives which have been
proposed for implementing arithmetic functions in programmable hardware. Then we
shall exploit the intrinsic features of the digital programmable section included in the
90

FIPSOC device in order to construct fast and compact realisations of digital


multipliers for neural accelerators to be used in real-world applications.

Fig. 4. Microcontroller interface for dynamic reconfiguration.

3 Arithmetic strategies in programmable hardware

Multiplication and addition are among the most common operators found in the data
flow associated with Artificial Neural Networks models. For instance, they are found
in the synaptic function of the neurons constituting a Multilayer Perceptron network
or in the distance calculation process inherent to Learning Vector Quantization (LVQ)
or Radial Basis Function (RBF) models. Since most commercial FPGA devices
include specific hardware macros devoted to the realisation of fast and compact adder
units, addition does not usually represent a serious limitation when a digital
implementation for these neural models is envisioned.
On the contrary, the implementation of a digital multiplier usually requires too
many physical resources or a large latency, thus penalising the performance (in terms
of area and/or execution delay) of the final system. The advent of programmable
devices with dynamic reconfiguration properties has resulted in new strategies for the
physical realisation of multiplier units. In this way, the alternative presented in [ 10] is
based on what has been termed partial evaluation. The term partial evaluation refers
to the possibility of simplifying certain functions when some operands are fixed. This
is the case of artificial neural networks models during the recall phase, since the
neurons' weights have been already established during the learning phase. For
instance, if we consider the multiplication of two 4-bit numbers, there are 16 8-bit
possible results if one of the operands is fixed. As a consequence, the multiplier could
be implemented in this case by means of 8 4-input LUTs. This is the approach which
was introduced in [10], which is represented in figure 5 for the case of 8-bit numbers.
Figure 5(a) shows how the multiplication of an 8-bit constant (A) by an arbitrary 8-
bit number (B) can be constructed as the overlapped addition of two 12-bit numbers
(resulting from the partial products A x B 1 and A x B2, respectively), both of them
obtained from 24 4-LUTs, as indicated in figure 5(b). Since the combinational part of
91

the DMC included in the FIPSOC device allows, in static mode, for the realisation of
up to 4 4-LUTs (sharing two inputs every two LUTs) or one 4-bit adder, it can be
easily deduced that the multiplier depicted in figure 5 can be implemented using 9
DMCs. The execution delay of this multiplier could be quite low, since it is given by
one LUT access (i.e., the time associated with a read cycle in a SRAM) plus the
propagation delay of the 12-bit adder.

Fig. 5. (a) Partial evaluation principle (b) Implementation with 4-LUTs of a 8-bit multiplier

Therefore, following the microcontroller-driven dynamic configuration strategy


depicted in figure 4, it could be possible to implement several such multipliers in the
programmable digital section of the device, one for each synaptic connection of a
neuron. Being this alternative quite attractive in terms of overall system throughput,
however its main limitation lies in the fact that each time one weight is changed the
contents of the 6 DMCs which provide the functionality of the 24 4-LUTs have to be
overwritten. Though this process can be done transparently (i.e., one multiplier can be
modified while the others are still working), due to the microprocessor-driven
dynamic reconfiguration depicted in figure 4, it may take a long time (since 16
memory write cycles are required to change the contents of the LUTs in one DMC) in
comparison with the execution delay of the multiplier. As a consequence, this strategy
may be useful only for implementing low complexity networks or when the flexibility
of the system (i.e., the possibility of changing a specific weight just by issuing several
memory write commands) dominates over its attainable throughput.
Another alternative of implementing a digital multiplier consists of considering a
serial data flow, instead of the parallel scheme used previously. Figure 6 depicts the
global structure corresponding to a basic 8 x 8-bit serial multiplier.
As it can be deduced from this figure, the array of AND gates produces in each
clock cycle an 8-bit partial product (resulting from the operand B and the
corresponding bit of the operand A, a i, obtained from the serial output of a shift
register). This partial product is then added with the current partial result stored in the
output shift register, thus producing in each clock cycle a valid bit of the product, p~.
If this structure is to be implemented in the DMCs included in the FIPSOC device
(ignoring the 8-bit input registers), we obtain the realisation depicted in figure 7.
92

B A

i
8-bit register ] mitJ o~ 8-bit shift register
i
I -1 _1 _1 _1 _1 _1 _1 ',

~eout 8-bii adder


8

serialin 8-bit shift register serialout l, Pi

Fig. 6. Global structure of an 8 x 8-bit serial multiplier

P,

b~
i.....................

4
i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
b2 x ~r
b3a,~ 3-LUT

DMC 1
...................... ,
oMc3i
,......................

b, ! ~1
bs "~ i 4 4i
b6 i b 4
b7 3-LUT
aI
DMC 2
...................... ,
Fig. 7. Implementation of an 8 x 8-bit serial multiplier in the FIPSOC device.

As it can be deduced~from this figure, two DMCs (DMC 1 and 2) of the total four
DMCs required are used just for generating the 8-bit partial product to be
accumulated each clock cycle. However, we can further optimise this implementation
by using the dynamic configuration properties of the FIPSOC device. As it has been
explained in the previous section, ea6h configuration bit of the DMCs is attached to
two mapped configurations (i.e., there are two mapped contexts for each buffered
context). Furthermore, there is an input signal in each DMC which permits to switch
between both contexts. Therefore, since the routing structure used for the inputs
attached to each DMC is based on a multiplexer (which, in addition to attaching each
input to a given routing channel is able to fix it to a logic level 1 or 0), we can emulate
the A N D function required to obtain the partial products in the serial multiplier by
means of a context swap, which is controlled by the a~ bit. In this way, if the a, bit
equals 1, it selects the context where each input is connected to the corresponding b~
bit of the second operand, while in the case the a, bit equals 0, it activates the context
93

where all the inputs are tied to ground. This context swap governed by the value of a
is possible since the state of the registers can be saved when a context swap is
produced, and furthermore the 4-bit adder functionality is available in both contexts.
This scheme thus permits the implementation of a 8 x 8-bit digital multiplier using
just 2 DMCs. Furthermore, since all the signals (except the carry signals transferred
between the DMCs, which have fast dedicated routing channels) are propagated
locally (i.e., inside a DMC), the overall execution delay can be kept very small
(operation with a clock frequency of 96 MHz has been already qualified for the
FIPSOC device), since the propagation delays incurred when traversing routing
resources between DMCs is removed.
Finally, since the result produced by the multiplier is obtained serially, it is
possible to combine each multiplier with a shift register and a 1-bit adder in order to
accumulate the results and provide finally the activation potential of the neuron. In
this way, we could construct an array of processing elements organised following a
Broadcast Bus Architecture, as depicted in figure 8.

xj--;... '8 / -/
.......................................... .,_,'_: ....... :," ,
: /

L~20-bishift
t reg.~ //I//
ai "multiplier [/
................................ "~8 i/'
B
Fig. 8. Array of processing elements organised as a Broadcast Bus Architecture.

As it can be seen, there is a global bus shared by all the processing elements (PEs
in the figure), which is in charge of providing the inputs (xj in the figure) to all the
neurons, where they will be multiplied by the corresponding synaptic weights (~k in
the figure). An array composed of 12 such units could be mapped in the first device of
the FIPSOC family (which includes an array of 8 x 12 DMCs). Therefore, since a
maximum clock frequency of 96 MHz could be used, the maximum throughput of the
system is 70 MCPS (Millions of Connections Per Second), thus offering an efficient
alternative for the implementation of neural accelerators in programmable hardware.

4 Conclusions

In this paper we have addressed the implementation of Artificial Neural Networks


models in programmable hardware, using the dynamic reconfiguration properties
offered by new FPGA families.
94

After presenting the main features and the global organisation of the FIPSOC
(Field Programmable System On a Chip) devices, we have reviewed some strategies
for implementing digital multipliers, which are the core of the arithmetic unit used to
realise physically neural models. By improving a serial multiplication scheme with
the dynamic reconfiguration properties of the FIPSOC devices, we have derived an
architecture which provides an efficient solution for the implementation of parallel
processing systems in programmable hardware.
Our current efforts are concentrated in the exhaustive qualification of the samples
corresponding to the first member of the FIPSOC family, as well as in the
implementation and characterisation of the proposed architecture.

Acknowledgements

This work is being carried out under the ESPRIT project 21625 and spanish CICYT
project TIC-96-2015-CE.

References
1. Cox, C., Balnz, W.E.: GANGLION: A fast field-programmable gate array implementation
of a connectionist classifier. IEEE Journal on Solid-State Circuits, Vol. 27, no. 3 (1992)
288-299
2. Beiu, V., Taylor, J.G.: Optimal Mapping of Neural Networks Onto b't'GAs - A New
Constructive Algorithm. In: Mira, J., Sandoval, F. (eds.): From Natural to Artificial Neural
Computation. Lecture Notes in Computer Science, Vol. 930. Springer-Vedag, Berlin
Heidelberg New York (1995) 822-829
3. Hartmann, G, Frank, G, Sch~.fer, M. Wolff, C.:SPIKE 128K-An Accelerator for Dynamic
Simulation of Large Pulse-Coded Networks. In: Klar. H, KOnig, A, Ramacher, U. (eds.):
Proceedings of the 64 International Conference on Microelectronics for Neural Networks,
Evolutionary & Fuzzy Systems. University of Technology Dresden (1997) 130-139
4. P6rez-Uribe, A. Sanchez, E: FPGA Implementation of Neuronlike Adaptive Elements. In:
Gerstner, W., Germond, A., Hasler, M. Nicoud, J.-D. (eds.): Artificial Neural Networks-
ICANN'97. Lecture Notes in Computer Science, Vol. 1327. Springer-Verlag, Berlin
Heidelberg New York (1997) 1247-1252
5. Becker, J., Kirchbaum, A., Renner, F.-M., Glesner, M:Perspectives of Reconfigurable
Computing in Research, Industry and Education. In: Hartenstein, R., Keevallik, A. (eds.):
Field-Programmable Logic and Applications. Lecture Notes in Computer Science, Vol.
1482. Springer-Verlag, Berlin Heidelberg New York (1998) 39-48
6. DeHon, A.: Reconfigurable Architectures for General-Purpose Computing. A.I. Technical
Report No. 1586. MIT Artificial Intelligence Laboratory (1996)
7. Churcher, S., Kean, T., , Wilkie, B.: The XC6200 FastMap Processor Interface. Field
Programmable Logic and Applications, Proceedings of FPL'95. Springer-Verlag (1995) 36-
43
8. Hesener, A.: Implementing Reconfigurable Datapaths in FPGAs for Adaptive Filter Design.
Field Programmable Logic, Proceedings of FPL'96. Springer-Verlag (1996) 220-229
9. Faura, J., Horton, C., Van Duong, P., Madrenas, J., Insenser, J.M.: A Novel Mixed Signal
Programmable Device with On-Chip Microprocessor. Proceedings of the IEEE 1997
Custom Integrated Circuits Conference (1997) 103-106
10.Kean, T., New, B., Slous, B.: A Fast Constant Coefficient Multiplier for the XC6200. Field-
Programmable Logic. Lecture Notes in Computer Science, Vol. 1142. Springer-Verlag,
Berlin Heidelberg New York (1996) 230-236
An Associative Neural Network and Its Special
Purpose Pipeline Architecture in Image Analysis

Ibarra Pico, F.; Cuenca Asensi, S.


Departamento de Tecnologfa lnform,Sticay Computaci6n
Campus de San Vicente
Universidad de Alicante
03080, Alicante, Spain
email: ibarra@ dtic.ua.es,sergio@ dtic.ua.es

Topics: Computer vision, neural nets, texture recognition, real-time quality control

Abstract.- There are several approaches to texture analysis and classification. Most have limitations in
accurate discrimination or complexity in time calculation. A first phase is the extraction of texture
features and later we classify it. Texture features should have the followings properties: be invariant
under Ihe transformations of translation, rotation, and scaling; a good discriminating power; and take the
non-stationary nature of texture account. In Our approach we use Orthogonal Associative Neural
Networks to Texture identification. It is used in the feature extraction and classification phase (where its
energy function is minimized). Due his low computational cost and his regular computational structure
the implementation of a real-time texture classifier based on this algorithm is feasible. There are several
platforms to implement Artificial Neural Networks (VLSI chips, PC accelerator cards, multiboard
computers, ...). The election relies on the type of neural model, their application, the response time,
capacity of storage, type of communications, and so on. In this paper we present a pipeline architecture,
where precision, cost and speed are optimally trade off. In addition we propose CPLD (Complex
I'rogrammahle Logic Device) chips to complete realization of the system. CPLD chips have a
reasonable density and performance at low cost.

1. I n t r o d u c t i o n

Texture segmentation is one of the most important task in the analysis of texture
images [1]. It is at this stage that different texture regions within an image are isolated
for subsequent processing, such as texture recognition. The major problem of texture
analysis is the extraction of texture features. Texture features should have the
followings properties: be invariant under the transformations of translation, rotation,
and scaling; a good discriminating power; and take the non-stationary nature of texture
account. There are two basic approaches for the extraction of texture features:
structural and statistical [2]. The structural approach assumes the texture is
characterized by some primitives following a placement rule. In this view, to describe
a texture one needs to describe both the primitives and the placement rule. This
approach is restricted by complications encountered in determining the primitives and
the placement rules that operate on these primitives. Therefore, textures suitable for
structural analysis have been confined to quite regular textures rather than more natural
texture in practice. In the statistical approach, texture is regarded as a sample from a
probability distribution on the image space and defined by stochastic model or
characterized by a set of statistical features. The most c o m m o n features used in
practice are based on the pattern properties. They are measured from first and second
order statistics and have been used as discriminators between textures.
96

For real-time image analysis, for example in detection of defects in textile fabric, the
complexity of calculations has to be reduced, in order to limit the system costs [3].
Additionally algorithms which are suitable for migration into hardware have to be
chosen. Both the extraction method of texture features and the classification algorithm
must satisfy these two conditions. Moreover, the extraction method of texture features
should have the followings properties: be invariant under the transformations of
translation, rotation, and scaling; have a good discriminating power; and take the non-
stationary naturc of texture account. We choose the Morphologic Coefficient [8] as a
feature extractor that is adequate for its implementation by associative memories and
dedicated hardware.
In the other hand, the classification algorithm should be able to store all of patterns,
have a high correct classification rate and a real time response. There are many
models of classifier based on artificial neural networks. Hopfiel [11 ] y [12] introduced
a first model of one-layer autoasociative memory. The Bi-directional Associative
Memory (BAM) was proposed by Kosko [14] and generalizes the model to be
bidirectional and heteroassociative. The BAMs have storage capacity problems [17].
It has been proposed several improvements (Adaptative Bidirectional Associative
Memories [15], multiple training [17] y [18], guaranteed recall, and a lot more
besides. One-step models without iteration has been developed too (Orthonormalized
Associative Memories [9] and the Hao's associative memory [10], which uses a
hidden layer). In this paper, we propose a new model of associative memory which
can be used in bidirectional or one-step mode.
Artificial neural networks needs a high number of computations and data interchange
[5]. So, parallel and high integration techniques (multiprocessor, array processors,
superscalar chips, segmentation, VLSI chips . . . . ) have been used for its
implementation, neural models come in many ways and flavors. Implementations
include analog, digital and hybrids. However, in some cases, when we are looking for
and adequate platform to map a neural model and its application we choose the most
suitable for both. In our case, we use Complex Programmable Logic Devices (CPLD)
chips to implement a small associate memory and we use it for texture characterization
and classification. These CPLD devices combine gate-array flexibility and desktop
programmability. So, we can design a circuit, test and probe it in short (avoid
fabrication cycle times). In the other hand, it only has some thousands of gates, so its
use is only adequate for specific neural models and applications.

2. Feature Extraction for Texture Analysis

The Hausdorff Dimension (HD) was first proposed in 1919 by the mathematician
Hausdorff and has been used, mainly, in fractai studies [4]. One of the most attractive
features of this measure when analyzing images is its invariant properties under
isometric transformations. We will use HD when extracting features. Given an image I
2
belonging to R and being S a set of points in that image, that is, S c I. The HD of that
set is define as follows.
The HD is invariant to isometric and similar transforms of the image. This property
makes it appropriated in objects recognition. It is difficult to calculate the dimension
from the definition. Because of that, some alternative methods have been proposed like
mass dimension, box dimension, etc. Our proposal is to prove that the calculation of
g7

the HD is a NP-complete problem, and to propose an heuristic based on neural


networks that allows its computation.
One of the HD main problems is its difficult computation from its definition, and that
is why, in general, approximative box-counting methods are used. Now we will see
how the HD is equivalent to the calculation of a semicover and we will use it as a
result for its COmlmtatiou.
Definition L A packing or semicover of a set S is a collection of sets sm(S)={Ai}
i=l..n verifying that A i n A j = ~ 'v'i~j and u A i c S

Definition VI. W e call 5 -semicover of a set S (5-sm(S)) to a semicover of S formed by


a finite collection of sets {A i } having a diameter of ~i or less.

Theorem L The Hausdorff dimension of a set S (Dh(S)) can be calculated from its
semicover 05-sm(S)) with the following expression:

with

and

Smo(S)h = inf Ai / A.! r ~ - sin(S)

Proof'. The definitions II, III and IV express the HD from a ~5-cover of the set S. W e
only have to consider that in the limit 05-->0), it follows that 5-cover(S)=5-sm(S).
Theorem I allows us to express the calculation of the HD as a semicover of a set
calculation problem. This implies that its computation with semicovers inherits the
invariant properties of the dimension. And inversely, the characterization of
semicovers as an NP-complete problem allows to estimate the complexity of
evaluating the HD.
We can approximate the HD by semicovers, so we define the morphologic coefficient
which can be used to feature extraction. W e call morphologic coefficient of the
semicover of a set S over an morphologic element A i, of diameter 6 = IAil to

CM(S) = lim Iog~8- sm(S~


6~o - Iog~

The morphologic coefficient of the semicover converges to the dimension entropy


when the diameter of the morphologic element tends to zero, and therefore, it can be a
good estimation for the DH [8]. In the practice, the entropy calculation is made for
some discrete values of 5 (I, 2,3,..,D) instead of calculating the limit. From these
values an estimation of morphologic coefficient is established different heuristics as
I ~',Dio~Si_sm(S ~
CM
./__Z~IIog~i
98

It is at the level of discretization where the goodness of the semicovers method can be
seen in comparison with set covers raisings (box-dimension). For discrete values of 8,
the 5-semicover is much more restrictive than the 8-cover of the set S, and this allows
us to capture much more better its topologic characteristics [8]. Therefore, in the
practice it allows a better features extraction. The ~5-semicover offers output patterns
with more Hamming distance than the k-cover and therefore allows a better process of
classifying.
Characterization of the texture
In order to extract the invariant characteristics of an image we divide it in several
planes attending to the level of intensity of each point. Then we could define the
multidimensional morphologic coefficient like the vector formed for the CM of each
one of these planes. We can characterize the texture with his CM vector.
r = [CMI, CM2..... CMp] ; p-- n ~ of planes in which image is partitioning
The CM vectors of the patterns will be employed in the learning process of the
classifier that it is described below.

3. Associative Orthogonal Memory (MAO)

In this paper, we propose a new model of associative memory which can be used in
bidirectional or one-step mode. This model uses a hidden layer, proper filters and
orthogonality to increase the store capacity and reduce the noise effect of lineal
dependencies between patterns. Our model, that we call Bidirectional Associative
Orthogonal Memory (MAO) , go beyond tile BAM capacity. The BAM and MAON
models are particular cases of it.

3.1 Topology and Learning Process


Let a set of q pairs of patterns (ai,bi) of the vectorial spaces R n and R ra. We build two
learning matrixes as we show below :
A=[aijJandB=lbik ] for i~{1,..,q} jE{1,..,n} kE{l,..,m}
The MAO Memory is built as a neural network with two synaptic matrixes (Hebbian
correlations) W and V, which are computed as W=AQ t y V=QB t. Where Q is and
intermediate orthogonal matrix (Walsh, Householder, and so on) of dimensions qxq.
The qi vectors of Q are an orthogonal base of the vectorial space R q. This
characteristic of the qi vectors is very important to make accurate associations
including below noise conditions [ 16].

3.2 Recalling Process and Basics Filters


The associations between patterns can be in one-step (forward or backward) or in bi-
directional mode. One-step recall :
9 Let a i the input pattern, the output b i is bi = fl[f2(a~ .W).V]= F(ai)
9 Let b i the input pattern, the output a i is a,= fl[f2(bf .V').wt]= F-'(bi)
In bi-directional mode : the patterns are fed forward and backward (feedback)into the
MAO in a similar BAM style while the energy is falling in a minimum of its energy
99

surface. The process continues until to reach a maximum number of iterations or a


convergence desire grade. In the input and output layer, the net uses the classical
bipolar filter fl (patterns are coded in bipolar mode). In the hidden layer, the MAO
computes the flter f2, (where ql and q2 are the two possible values of the Q
components).

However, It is a particular representation of more general model [8]. So, when we use
this neural network as a classifier the particular values of Q matrix and the filters are
different. For example, the filter in the hidden layer will be to get the maximum
response in forward classification mode.

4. Pipelined Architecture for Real-Time Texture Analysis

Usually the quality of the fabric is controlled by visual inspections of humans. In order
to substitute this work by automatic visual inspection, fast image processing system
are required [3]. If consider that the fabric is processed with 100m-180m per minute,
then 5-15MB image data have to be processed per second. We propose a pipelined
architecture that carry out this job in real-time, and suggest the CPLD's or FPGA's
chips to implement it due his adequate ratio cost/performance.
Figure 1 shows the block diagram of the texture analysis system (TAS). The TAS
inputs are the eight bits of the pixel and the five thresholds that determined the
intervals of each one of the four planes in which we divide the image to analyze. These
thresholds are predefined or may be programmed depending on the application.
TAS is divided in two modules: Analysis module and Classification module. The first
module performs the feature extraction of the image using the CM. The second module
performs the classification of the texture using the MAON algorithm. The
Classification module find the minimum distance pattern to the texture. This is
equivalent to minimize the subtraction [8].
2.(CM x.CM i) - IIC-Mill2 ;
where (C--Mx) is the CM vector of the texture to classify and IIC-MilJ2 is the square
module of the CM vector of the i-pattern. So in the learning mode, the patterns (r i)
and his square module (IIC-Mill2) must be stored in tile Classification module. And in
the recognition mode the Classification module have to calculate the dot product and
the subtraction.

The CM unit
The CM unit is designed to calculate the 4-dimensional CM (four planes partitioning)
using a 2x2 pixels morphologic element (also named mask). The CM calculations for
each of the four planes are perform in parallel, so it is necessary four circuits like
shows in Figure 3. Notice that the columns image data is fed to the CM unit in serial
manner via a 8-bit bus, and the filter module (Figure 3) produces a " l " if the pixel
intensity level belong to the interval [Thmin,Thmax]. The 2bit-shift register and the
100

r- Analysis Mod. ......... Classification Mod . . . . . . . .


i
i

Tl~eshold~
From h o s t ~ 4xCM , ~i Dot
i unit Clasifier
Clasifier
unit
pixel ,~ l ~li
i
i
i i
i
i. . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Fig. 1. Blocks diagram of TAS


first AND gate performs the vertical mask. It produces a "1" if two consecutive pixels
(in the same column) belong to the interval. The results of each column must be stored
to compare with the two next columns. The FIFO array and the second AND gate
(Figure 3) realizes this job (horizontal mask). Notice that to use a n x n pixels mask we
only need and n-inputs AND gates and n chaining Fifo's in the FIFO array.

Thmin Thmax P'-


I---3 F
8 -k.
P

elk i dl

EndWindow
Fig 2. CM unit datapath
Let the image have the size of m x n . To provide the data for the horizontal mask, m+l
filtered pixels values (lbit) must be stored. Figure 3 shows also the FIFO array. Every
time the horizontal mask produces a "1" the mask count is incrementing. When the last
pixel has been processed the CM is stored in the counter.
The CM unit is pipelined in four stages:
First stage: read the pixel and filter
Second stage: shift and vertical mask
Third stage: shift to the FIFO array and horizontal mask
Four stage: increment counter mask
Each pixel is processed in one clock cycle (clkl). The number of cycles to calculate
the CM of a m x n frame is [4 + (m x n) - 1].
The Dot Product unit

It performs the dot product (CM x"CM i). The patterns are stored in four 256x8 RAM
modules. The memories are organized in such a way that the firsts components of all
C--Mvectors are in module RAM0 the seconds components are in module RAM1 an so
on. So when the address is "00000000" the outcome of the RAM modules is the four
components of the first pattern. Next the dot product it is done in parallel with the four
multipliers and the partial adders.
101

The unit can work in two modes: store mode and recognition mode. In the first, the
CM vectors of the patterns are stored in the RAM memories (RAM0..RAM3), the
square modules (11~ ill 2) in the RAMmod (256xi8), and the number of patterns in the
address counter. The data come from the host computer or may be calculated for the
system itself. The second mode is the normal operation mode.

8~
/ ,~1A I
Thmin
"1 ~oMpI
in I: .+.
8/
-T
elk
out

T~ ..........

Product unit
Ms~--q I o 1I

LSB

"--] To host

clk2

Fig 3 Filter and FIFO array

Like the previous unit the Dot Product is pipelined too, and the number of clock cycles
(clk2) to realized
one dot product is 3:
First stage: generate the address and memories access
Second stage: product of individual components
Third stage: partial adders and in parallel RAMmod access and 2's complement
102

The address generator is a 8-bit down counter, additionally the address is used like a
index of the product, so the dot product and his index go on together to the Classifier
unit.
Classifier unit
This unit produce the index of the most similar pattern to the texture. First the dot
product is doubled simply by sticking a zero to the LSB to produce a 19-bit number,
next the sign bit is added (sticking a zero to the MSB) and finally a 20-bit number
result. In the same way two bits (sticked a one) are added to the IIC---Mill 2 , to convert it
in a 20-bit negative number. Next it performs the add and the result is inverse
2'complemented if the sign is negative. Finally it is compare with the previous data
and the major, next with his index are stores in the aux. registers. When all
comparisons are completed the most similar pattern and his index will be in the aux.
registers.
The stages of the Classifier unit are:
First stage: adder and inverse 2'complement
Second stage: magnitude and sign comparison
Third: store the winner

Figure 4. Classifierunit
103

Let p the number of patterns stored previously, then the total cycles (clk2) in the
recognition process (Dot Product unit + Classifier unit) is [6+ p - I]. Therefor the
total latency in texture recognition (analysis + classification) of a m x n pixels window
using p patterns textures is:
RLat = [4 + (mxn) - 1] clkl + [6 + p - 1] clk2
We can consider the TAS like two stages pipelined (feature extraction and
recognition), then the process of different windows are overlapped. The total latency
in texture recognition of k windows is:
Tk= [2+k-1] clk ; elk=max{J4 + ( m x n ) - 1] clkl, [6 + p - 1] clk2}

5. Performance of the Algorithm

To test the texture analysis algorithm (features extraction and classifier) we consider
the problem of defects detection in textile fabric. Real-world 512x512 images (jeans
texture) with and without defects (figure la and lb) were employed in the learning
process of the MAO classifier. We considered windows of 70x70 pixels with 256 gray
levels and the parameters of the algorithm were adjusted to obtain high precision and
low response time. These are shows in the table la and lb.

(a). Someimageofjeans textile fabricdefectsbeforeclassification

(b). Windowsdetectionof defects


Figure 5 Imagetargetanalysiswith defects
Between the different possibilities (image partition, diameter of the morphologic
element, etc...) we consider specially interesting three basics configuration: C-I: 8
planes partitioning, 32 gray levels in each plane, C-II: 4 planes partitioning, 64 gray
levels in each plane y C-Ill: problem oriented partitioning. In all the cases the size of
the morphologic element was 2x2 (6=2~t2). The classification criteria was the
euclidean distance (/1 filter) and the maximum response (]2 filter). The number of
patterns was 400 (so 400 neurons in the MAO) and the recognition mode was no
iterative.
In addition we compare the algorithm with two methods: Laws mask [7] and center-
Symmetric AutoCorrelation (SAC) [6]. The better results were find for 40x40 (Laws)
104

and 64x64 (SAC) windows sizes. The implementation was made in a C-program. In
the test process and in the learning process were employed different images. In both
cases were 1.200 images with defects and 1.000 without defects. The results (tab. 2)
shows that in all the cases our algorithm is two magnitude order faster than the others.
In addition the hit rate it is next to 90% for with and without defects texture
recognition (notice that in the C-Ill, ad-hoc partitioning, it is over 95%). The
conclusion is that it is feasible to implement a real-time system with a high precision
level based in our algorithm. So an architectural proposal will be made in the next
section.

Image C-1:11,32][33,641t65,961197,12811129,1601[
161,192][ 193,224][225,256]
partitioning
C-iI: ll,64][65,128][I 29,192][193,2561
C-III:[I-80][81-120][121-1801[181-21011211-220l[221-2561
L,earning Adaptative (qinitial=50patterns y q,..l<40l patterns)
Operation mode No iterative
distancia Euclidean
fl filter Euclidean
f2 filter Maximum response determination

Table l, Experiment and method parameters

Algorithm pixels hit rate without hit rate with response time
defects defect
C-i 70x70 92,23% 87,14% 0,081 seg.
C-II 70x70 96,12% 93,32% 0.055seg.
C-ill 70x70 97,81% 94,42% 0.070seg.
Laws 40x40 93,71% 64,69% I ,Sseg
SAC 64x64 95,12% 84,34% 1.1 seg
Table 2. Simulation results

VLSI Design
The gate level structure of the architecture has been simulated (using VHDL timing
simulator), to verify the functionality of the TAS. The parameters values were chosen
to match those that have been used during the algorithm testing. The results shows that
there is no performance degradation.
We propose CPLD's or FPGA's chips for implementation. Using these technologies it
is possible achieves moderates frequencies of process but a low cost. For example
there is no problem with employ clkl=30ns (33MHz) and clk2=100ns (10MHz), in
this case the time to process 5MB-15MB of data image is 0,15sec-0,46sec. These
results complete perfectly with the specifications.

6. Conclusion

A real time system for texture analysis is successfully applied to solve the problem of
defects segmentation in textile fabric. The system presents a statistic method for
feature extraction and a neural classifier.
105

The method for the extraction of texture features is based on the Hausdorff dimension
and its most important properties are: it is easy to compute and it is invariant under
geometrical m a p p i n g such as rotation, translation and scaling.
An Associative Neural Model is used as a classifier. In this extension the neurons have
an output value that is updated at the same time that the neurons weights. From this
output value we can easily calculate the distance between the neuron and the cluster
and get the probability that a neuron is into a cluster, that is, the probability which the
system works well.
This system works in real time and produces about 96.44% of correct rate. This
suggest that a system based on a C C D camera that inspect textile fabric with our
pipeline proposal is a good and low cost tool for texture analysis in textile industry.

7. References

[It N.R. Pal and S.K. Pal, A review on image segmentation techniques, Pattern Recognition, Vol. 26,
No.9, pp. 1277-1294, 1993.
[21 R.M. Haralick, Statistical and structural approaches to texture, Procc. 1EEE, Vol. 67, pp. 786-804,
1979
[3] C. Neubauer, Segmentation of defects in textile fabric, Proce. IEEE, pp. 688-691, 1992.
[4] Hoggar, S. G. Mathematics for Computer Science. Cambridge University Press. 1993.
[5] J.M.Zurada, Introduction to Artifial Neural Systems, West Publishing Company, 1992.
[61 Harwaood, D. et al. Texture Classification by Center-Symmetric Auto-Correlation, using Kullback
Discrimination of Distribution. Pattern Recognition Letters. Voi 16, pp. 1-10. 1995
[71 Laws, K. Y. Texture Image Segmentation. Ph D. Thesis. University of Southern California. January.
1980.
[8] Francisco Ibarra Pie6. An~ilisis de Texturas mediante Coeficiente Morfol6gico. Modelado
Conexionista Aplicado. Ph.D. Thesis. Universidad de Alicante. 1995.
[91 Garcia-Chamizo J.M., Crespo-Llorente A. (1992) "Orthonormalized Associative Memories".
Proceeding of the IJCNN, Baltimore, vol 1, pg. 476-481.
[10] Hao J., Wanderwalle J. (1992) "A new model of neural associative memoriy" Proceedings of the
JJCNN92, vol 2, pg. 166-171.
[11] Hopfield J.J. (1984a) "Neural Networks and physical systems with emergent collective
computational abilities". Proceedings of the National Academy of Science, vol 79, pg. 2554-2558.
[12J Hopfield J.J. (1984b) "Neural networks with graded response have collective Computational
properties like those of two-state Neurons". Proceedings of the National Academy of Science, vol 81, pg.
3088-3092.
[13] lbarra-Pic6 F., Garcia-Chamizo J.M. (1993) "Bidirectional Associative Orthonormalized
Memories". Actas AEPIA, vol 1, pg 20-30.
[14] Kosko, B. (1988a) "Bidirectional Associative Memories". IEEE Tans. on Systems, Man &
Cybernetics, vol 18.
[15] Kosko, B. (1988b) "Competitive adaptative bidirectional associative memories".Procedings of the
IEEE first International Conference on Neural Networks, eds M. Cardill and C. Butter vol 2. pp 759-
766.
[16] Pao You-Hart. (1989)"Adaptative Pattern Recognition and Neural Networks". Addison-Wesley
Publishing Company, Inc. pg 144-148.
106

[ 17] Wang, Cruz F.J., Mulligan (1990a) "On Multiple Training for Bidirectional Associative Memory ".
IEEE Tans. on Neural Networks, 1(5) pg 275-276.
[181 Wang, Cruz F.J., Mulligan. (1990b) "Two Coding Strategies for Bidirectional Associative Memory
", IEEE Tans. on Neural Networks, pg 81-92.ang, Cruz F.L,
Effects of Global P e r t u r b a t i o n s on Learning
Capability in a C M O S A n a l o g u e I m p l e m e n t a t i o n
of S y n c h r o n o u s B o l t z m a n n Machine

Kurosh Madam, Ghislain de Tremiolles

Division REseaux Neuronaux


LERISS - Universit6 PARIS XII, I.U.T. DE SENART
Avenue Pierre POINT
F-77127 LIEUSAINT - FRANCE
E - M a i l : m a d a n i @ u n i v - p a r i s 12.fr

Abstract : All of the presented implementations of Artificial Neural


Networks (A.N.N.) have been supposed to be working in ideal conditions,
however, real applications will be subject to local and global perturbations.
Since 1994, we have investigated the behaviour modelling of electronic A.N.N.
with global perturbation conditions. We have scrutinised the behaviour analysis
of a CMOS analogue implementation of synchronous Boltzmann Machine
model with both ambient temperature and electrical perturbation. In this paper
we present, using our model, the analysis of these global perturbations effects on
learning capability of the above mentioned CMOS based analogue
implementation. Simulation and experimental results have been exposed
validating our concepts.

Key Words : Global Perturbations ; Neural network ; l~earning capability ; Modelling ;


Experimental validation.

1. I N T R O D U C T I O N

A very large number of works concerning the area of Artificial Neural


Networks (ANN) deal with implementation of these models. 'Especially digital and
analogue implementations of Artificial Neural Networks (A.N.N.) as CMOS
integrated circuits show several attractive features [1]. During the last two decades,
numerous papers show that small size analogue A.N.N. operate correctly [2] [3].
However, today, efforts are focused on real size industrial applications of A.N.N.
which may require large networks [4] [5] [61.

On the other hand, -all implementations (analogue or mixed digital/analogue)


of A.N.N. have been supposed to be working in ideal conditions (without
perturbations) : natural analogy between the biolo~cal systems and implementation
of such models made suppose these implementations to be as robust as biological
systems. However, reliability and robustness are among key points for success of
108

A.N.N. based approaches and their effective application in industrial world :


especially, when a hardware implementation of such models is needed. Of course,
numerous research works have pointed out the tolerance and the robustness of
A.N.N. models when the perturbation is a local one : by "local perturbation" we
identify the case where one or a few neurones of the networks are faulty [7] but the
major of them operate correctly. Even if these studies show some structural
robustness of ANN models comparing to classical computing systems, several
points concerning such studies remain unrealistic. One of these points concerns the
fact that all of these studies are based on inhibition of a relatively small number of
neurones (or synapses) of the network (other neurones or synapses of the neural
network are supposed to operate perfectly). In the real world, a system (and so, a
neural network) operates interacting in a global manner with it's environment : thus,
it will be subject to some global perturbation. The global perturbation means that a
large number (or all) of system's modules (neurones or synapses) operate out of their
nominal (or correct) mode. Another point is related to the fact that these works don't
take into account any physical parameter of environment in which the neural
network will operate : only mathematical structure (in a large number of cases, a
graphs theory based analysis) is considered. As example, in the case of a thermal
perturbation, all neurones of the neural network will be influenced by some
temperature gradient. So in such case, some global perturbation will affect each unit
(neurone) of the system : all neurones work but don't operate correctly. Even if
natural tendency consists to consider that redundancy of operation units and
distributed nature of the encoded information in synapses will lead to some system
robustness, it is basic to evaluate impacts of such global perturbations on system's
operation capabilities (learning capability, synaptic activity, etc.). Unfortunately,
very few works have been interested in the behaviour modelling and analysis of
analogue implementation of neural networks or in their limitations [8] [9] [10] [11]
[121.

Since 1994, we have investigated the behaviour modelling of dectronic


A.N.N. with global perturbation conditions, we have scrutinised the behaviour
analysis of a CMOS analogue implementation of synchronous Boltzmann Machine
[14] [15] [16] model with both thermal and electrical (supply voltage) perturbations
[12] [17] [21] [22]. The raison of our interest on Boltzmann Machine results, on the
one hand, from the availability of an electrolfic implementation of the synchronous
model [15] [16], and on the other hand, from the availability of a learning example
with experimental results given by [16], useful to confirm our investigations. The
section 2 of the present paper introduces the synchronous Boltzmann Machine model
and its analogue implementation. In the section 3, we expose our approach to model
the physical parameters dependence of the analogue implementation presented in
section 2 : simulation and experimental results relative to temperature and electrical
perturbation effects are reported. We also discuss compensation possibility of the
mentioned global perturbations effects. In section 4 of the paper we analyse, using
our model, the influence of such global perturbations on learning capabilities of
neural network : learning speed (convergence speed) and synaptic activity have been
studied. Finally, in section 5 we give the conclusion and perspectives of the present
work.
109

2. A B O U T SYNCHRONOUS BOLTZMANN MACHINE AND


ITS ANALOGUE IMPLEMENTATION

In the Synchronous Boltzmann Machine model [14], by opposition to the


asynchronous model [13], neurones update their state simultaneously. Let u i be the i-
the neurone of the network, xin be the n e u r o n e state of ui at instant n (which may
have values 1 or 0) and wij the weight between neurones u i and uj with n E { 1,2 .....
N-1, N}. Vi n, the action p o t e n t i a l of ui after instant n, is computed according to (1).
Then the state of the neurone u i at discrete time step (n+ 1) is tossed at random with
the probability given by relation (2).
V.n n
1 =j~iWlj X) (1)

) 1
X? +1 = 0 -- v.n (2)
1 + exp(+)

T is a positive control parameter, called also "Boltzmann parameter" and is


analogue to the absolute temperature of Boltzmann's model in statistical physics. The
weight update process is repeatedly performed for all the pattern associations, and for
each of them, it consists in a clamped and a free phase. During the clamped phase, a
pattern is imposed both on the input and output neurones while the hidden neurones
are left free, and the concurrence Pij + is computed, whereas during the free phase, the
input pattern is presented to the input neurones while the output and hidden neurones
are left free, and the concurrence Pij- is computed. The weights are updated according
to the gradient nile given by relation (3) (where v1 is a positive parameter and T is the
control parameter mentioned previously). One can remark that the Boltzmann
parameter (T) appears as a key parameter of this neural model.

A W.. =
q
PU - p (3)
[15] has investigated its analogue implementation and [16] had realised a
mixed digital/analogue synaptic circuit for a mixed digital/analogue implementation
of [15]. The electronic board realised by [16] includes two main integrated circuits :
MBAT2 chip and MBAT11 chip. MBAT2 is an analogue/digital circuit containing 32
neurones and M B A T l l is a mixed analogue/digital synaptic circuit containing 16
synapses. The prototype includes two MBAT11 synaptic chips, one MBAT2 neurone
circuit and some standard control logic. The neurone cell includes :

- a cellular Automaton Random Number generator. This bloc (C.A.R.N.)


performs random numbers according to a uniform probability low,

- a Current to voltage Converter (C.V.C.) with a T parameter input that


performs some voltage representation of the - ~ quantity,

- a Boltzmann parameter control bloc (B.T.C.B. represented in figure 1),


110

- a Sigmoidal Function Bloc (S.F.B.), performing f(V/n) operation,


realizing a hyperbolic tangent (th (.)) witch gives a good approximation
of relation (2) of this paper.
- a compactor circuit that compares f(V?) (S.F.B. bloc's output)to
C.A.R.N. bloc's output and decide the neurone state changing.

Xuklmmon ffmto~LmrCon..tier 54.

Kn ~ $3.
~_velmp
: X1. -'~,,.Bk.._s 2.
O
~ Vont
~ ' n vaa

~ h a ! NEURONE
CELL
TemperatureControl
A B
Bloc
00

Figure 1 : left side : Bloc diagram of a neurone cell of the MBAT2


chip as it has been described by [15]. Right side" a) Boltzmatm Temperature
parameter Control Bloc circuit, b) Rct resistor's realisation.

Figure 1 reproduce the bloc diagram of a neurone cell and the B.T.C.B.
circuit (Bollzmann Temperature parameter Control Bloc) of the MBAT2 chip. Each
neurone cell of the MBAT2 chip performs following operations : it collects and add
up synaptic currents ; then converts the total synaptic current into a voltage ;
computes the potential action and gives a voltage representation of the f ( V i n); at
the same time a random number distributed according into the uniform probability
low is generated (as a voltage) by C.A.R.N. circuit of the neurone cell ; finally, the
generated random voltage is compared to the proportional voltage representation of
the f ( V i n) performing the neurone state updating 15,18

3. M O D E L L I N G THE BEHAVIOUR OF THE B O L T Z M A N N


P A R A M E T E R WITH THE PHYSICAL P A R A M E T E R S

Synchronous Boltzrnann Machine model and its analogue implementation,


previously presented, point out that C.V.C. (current to voltage converter) B.T.C.B.
(Boltzmann temperature parameter Control Bloc), and S.F.B.(sigmoidal Function
Bloc) blocs are main functional blocs of neurone cell. Boltzmann parameter (T) range
of variations is related to the number of synaptic connections of the neurone and to
the synaptic weights dynamic. For example, for a neurone chip with 128 synaptic
inputs per neurone and a [-1, +1] synaptic weight dynamic, T parameter is around 0.5
(between 0.2 and 0.8) 18
111

V.n
Let Vp be S.F.B. input voltage. So, Vp is an electrical representation of
quantity. Let K p be transformation coefficient given in volts-1 (Kp depends on the
electrical realisation of the sigmoidal function S.F.B. bloc) and IM be the maximum
value of the total synaptic current (1/aA in the case of the MBAT2 circuit). Finally, let
Rct be the current to voltage conversion ratio between current and voltage
representations of synaptic potential. Thus, Boltzmann parameter could be obtained
as a function of Rct, Kp and I M 12. Rct and Kp depend to structural and physical
parameters.
1
W-- (4)
&,'Kp'I M
Considering the figure 1, one can remark that the Rct resistor is obtained by a set of
MOS gates. Two control signals perform Rct variation : VRc that controls MOS
transistor current and S k (with k E { 2 , 3 , 4}) that selects a different MOS
transistor geometry (performed by different MOS grids areas). Thus T O value of T
parameter, corresponding to Rct0 resistor's value, is obtained for $2 = $3 = $4 = 0.
Taking into account the Rct structure and Kp dependence with physical parameters,
the T parameters behaviour with physical temperature z and supply voltage will be
given by (5) 12, 17 (all voltages are reported to a reference offset Vref).

~(,.CO) Cox ( ~ I-K~ [ I-a


T(,r, ) = +
IM Kp(To) ~-~0) i[Wl[Vdd- Vre, -- VT("r,o)-b g 2 (T- To)]
]-1

W2 [VRc _ Vef__ VT(.l~o)+ g 2 (,'g__T0)]] (5")


where : Cox is the thin oxide capacitance, VT is the MOS gate threshold voltage, W
is the transistor grid width, L is the transistor =midlength,/a is the electron mobility
and finally, K 1 and K2 are technology dependent parameters and AT=T-T0. Their
values are given in [17], [18] and [20]. Taking into account the numerical values, this
expression could be written as :

T(~:) = 1,48(3~0) '-K'


+( . 2,4 ]
Vat -- 3,64 + K, ('c - 300)} t Vm - 3 , 6 4 ~ 2 ( ~ : - 300))
(6)

The VRC analogue control line using to adjust the Boltzmann parameter could be
used to compensate temperature effects. It's expression as a function of the
Boltzmann parameter, the physical temperature and the supply voltage is given by
(7).
112

( 1,48 { 1: ~-I 0,25 ) -1 - K2.(~- 300)+3,64


V~a - 3,64 + K'~2.(lr - 300))
(7)

i 0.6
0.55

~o45 0

0.4 - " 9
5

~
~

~
/
/5.~
5.2
6

~ 0.351 J. , . / 4.8

o io ~ 4 VOLTAGE
60 80 100 (V)
Ambient Temperature (~

Figure 2 : Model based simulated Boltzmann parameter as a function of both


temperature and supply voltage.

E
,t
4
.4

,Y
vv ;NO 3E
1OO
Ambient Temperature (~ (v)

Figure 3 9 Simulated VRC analog control line as a function of both temperature and
supply voltage.

Figures 2 and 3 show simulation results based on the presented model. The
figure 2 plots the Boltzmann parameter as a function of both ambient temperature and
supply voltage. The figure 3 gives the simulated V RC analogue control line Figures 2
and 3 show simulation results based on the presented model. The figure 2 plots the
Boltzmann parameter as a function of both ambient temperature and supply voltage.
113

The figure 3 gives the simulated VRC analogue control line as a function of both
temperature and supply voltage to keep the Boltzmann parameter value constant. As
one can remark, this control line varying in a quasi linear way with physical
temperature 17. That leads the possibility of using this analogue control input for
temperature effects compensation easily. In the case of the supply voltage
perturbations, this control line could be also used as an external compensation
parameter even if the behaviour is more non linear. Figure 4 shows the experimental
Boltzmann parameters (T) evolution with the physical temperature and the supply
voltage when V RC is controlled to reach the temperature and electrical perturbation
effects compensation. The VRC behaviour to reach the perturbations effects
compensation has been computed using the model presented in previous section. As
one can remark, the model leads to compensate the perturbations effects and to
stabilize the Boltzmann parameter.

0.5 5.8
6
0.4 /5 5~'~'SUPPLY
0.3 / 468 VOLTAZ
O. 0 " e " " " ' m ' ~ / 4.244 (V)

40 50
Ambient Temperature (~

Figure 4 : Boltzmann parameters (T) evolution with both temperature and the supply
voltage when VRC is controlled to reach the perturbation effects compensation.

4. E F F E C T S ON NETWORK'S LEARNING CAPABILITIES

After the experimental validation of our model we have investigated, using


our model, the prediction of influence of the considered global perturbations on
neural networks performances. As an indicator of neural network performances, we
have considered the convergence speed during the learning phase. Convergence
speed could be defined as an indicator proportional to the number of iterations
necessary to learn a set of patterns. In our case, the learning example has been the
XOR function.

The figure 5 shows the evolution of the convergence speed of the neural net
with the Boltzmann parameter (T). Looking at these figure, one can remark that the
convergence speed (measuring the learning speed of the network) decreases when the
Boltzmann parameter value grows up. Indeed, for the high values of Boltzmann
parameter, the decision probability low (updating the neuroues output states) goes
114

toward a uniform distributed probabilistic decision low (represents a soft transition).


Thus the system approximates a noisy system needing more time to be stabilised. On
the other hand, referring to the figure 2 one can remark that the Boltzmann parameter
value decreases when the ambient temperature increases. So, at low temperatures the
neural net's convergence speed will be affected (result pointed out from the figure 5).
It is possible to use the VRC analog control line to compensate these effects.

5O

[" 20-
0 "-" 10-
!

0.40 0145 0.50 0.55 0.O


Boltzmann Parameter
Figure 5 : Convergence speed, during the learning phase, as function of Boltzmann
parameter.
Synaptic Weights
200 ,
-- - - T r - - ~

100
Boltzmann Parameter
! I i
0 0.4 0.45 0.50 0.55 0.60

-100
F ~.~__ I--
-200 ~ --
Figure 6" Synaptic weights value evolution, after the learning phase, as
function of Boltzmann parameter.

The figure 6 shows the implemented neural network's synaptic weights


evolution (after the neural network's training) as function of Boltzmann parameter.
Using the model, we have established the influence of both thermal and electrical
perturbations effects on synaptic weight dynamics (figure 7) : the learning example
has been the same, mentioned previously. One can seen that the synaptic weight
dynamic could be affected by the considered global perturbations, even if the
synaptic weight values variations (after leanfing of a pattern) could be considered as
relatively small. That means that even if the implemented neural network may
tolerate, after the learning phase, some perturbation of environment in which it
operates), these perturbations could affect the learning performances of the network
during the learning phase : reducing the synaptic weight dynamics of the
implemented model.
115

Synaptie weights
values d ~ ~ _ [
400]-
1
3 4 0 ~
I6
300~ J . ~ 5.2 Supply Voltage

28 4
48 64 80 96 4
Ambient Temperature (~

Figure 7 : Synaptic weights values dynamics evolution, after the learning phase, as a
function of both anlbient temperature and supply voltage variation.

5. CONCLUSION

We have scrutinised the behaviour analysis of a CMOS analogue


implementation of synchronous Boltzmann Machine model with both ambient
temperature and electrical perturbations (supply voltage). Simulation and
experimental results have been reported and discussed. We also show and discussed
how such global perturbation effects could be compensated using the structure of the
neural processor. Finally, we have analyzed using our model, the influence of the
considered global perturbations on neural networks performances (learning
parameters of the neural network). We have established the influence of both thermal
and electrical perturbations effects on neural network's convergence speed and on the
synaptic weight dynamics. Our future directions concern on the one hand, the similar
behaviour analysis considering electronic implementations of other neural models,
and on the other hand, using our results and models to design a simulation tool
dedicated to design and analysis of neural based applications considering real world
and real complexity conditions.

REFERENCES

[1] C.A. Mead, "Analogue VLSI and neural systems", Addison Wesley 1989.
[2] R.F. Lyon and C.A. Mead, " An Analog electronic cochlea". IEEE transactions on
Acoustic, Speech and signal Processing, Vol. 36, PP 1119- 1134, 1988.
[3] L. Jackel, "Electronic neural networks". In NATO ARW, Neuro-algorithms, architecture
and applications, Les Arcs, 1989.
[4] M. MAYOUBI, M. SCHAFER, S. SINSEL, "Dynamic Neural Units for Non-linear
Dynamic Systems Identification", From Natural to Artificial Neural Computation, L N C S
Vol. 930, Springer Verlag, pp. 1045-1051,, 1995.
116

[5] M. CHIABERGE, L. M. REYNERI, "Cintia : A Neuro-Fuzzy Real-Time Controller for


Law-Power Embedded Systems', IEEE Micro Vol. 15, pp. 40-47, June 1995.
[6] G. MERCIER, K. MADANI, " C M A C Real-Time Adaptive Control Implementation on a
DSP Based Card", , From Natural to Artificial Neural Computation, LNCS Vol. 930,
Springer Veflag, pp.ll14-1120, 1995.
[7] G. Bugrnann, P. Sojka, M. Reiss, M.Plumbly, J.Taylor, "Direct Approaches to Improving
the robustness of Multilayer Neural Networks", Artificial Neural Networks 2, Elsevier
science Pub, 1992.
[8] J.J. Hopfield,'Neurons with graded response have collective computational properties like
those of two state neurones". Proceedings of the national Academy of science of U.S.A.,
vol 81 pp 3088-3092, 1984.
[9] J.L. WYATT and D.L. STANDLEY, "Circuit design criteria for stable lateral inhibition
neural networks "In IEEE International Symposium Circuits and systems, IEEE pp 997-
1000, June 1988
[10] M.A. Sivilotti, M.R.EMERLING, and C.A.Mead, "VLSI Architectures for
implementation of Neural Network". In AIP conference Proceedings on Neural Network
for computing, J.S. DENKER, American Institute of physic, Snowbird, UTAH pp408-413,
1986.
[11] M.VERLEYSEN and P. JESPERS, " precision of sum-of-product in Analog Neural
Network". In Proceedings of the first International workshop on Microelectronics for
Neural Networks, Dortmund, RFA, June 1990.
[12] K.MADANI, I. BERECHET, "Temperature Perturbation Effects on Image Processing
Dedicated stochastic Artificial Neural Networks', SPIE SYMPOSIUM ON
ELECTRONIC IMAGING : Science and Technology, San Jose California- U.S.A.,
February 6-10 1994.
[13] G.E. Hinton and T.J. Sejnowski, "learning in Boltzmann machines ".In Cognitive 85,
PARIS, PP 283-290, 1985.
[14] R. Azeneott, "Synchronous Boltzmann Machines and their learning algorithms". In
NATO ARW, Springer-Verlag, les arcs, February 1989.
[15] P. GARDA and E. BELHAIRE, "An Analog chip set with digital I/O for synchronous
Boltzmann Machine. "In VLSI for Artificial Intelligence and Neural Network Kluwer
Academic, J.G.Delgado-frias and W.R. Moore, BOSTON, 1990.
[16] V. LAFARGUE, "Contribution h la rralisation 61ectronique de Rrseaux de Neurones
formels : Intrgration mixte de l'apprentissage des machines de Boltzmann "; Ph. D.
Report, th~se de doctorat en science de l'universit6 PARIS XI, Orsay, January 1993.
[17] K.MADANI, I.BERECHET, G. DE TREMIOLLES, "Analysis of limitations in Analog
Implementation of stochastic Artificial Neural Network V, ORLANDO, FLORIDA,
U.S.A., 4 - 8 April 1994.
[18] E. BELHAIR, "Contribution h la rralisation 61ectronique de rrseaux de Neurones Formels
: Intrgration Analogique d'une machine de BOLTZMANN" ; ph.D. report, th~se de
doctorat en science de l'universit6 Paris XI, Orsay February 1992.
[19] Y.P. TSIVIDIS, "Operation and Modelling of the MOS Transistor", Me Graw Hill, 1988,
PP148.
[20] S.M. SZE" physics of Semiconductor Devices". Wiley, 1981.
[21] K. MADANI, G. DE TREMIOLLES, Perturbation Effects Analysis in Analogue
Implementation of a stochastic Artificial Neural Network, S H E International
AeroSense'96 Symposium - Applications and Science of Artificial Neural Networks ,
Orlando, Florida, USA, 08- 12 May 1996.
[22] K. MADANI, G. DE TREMIOLLES, Global Perturbation Effects Analysis in a CMOS
Analogue Implementation of Synchronous Boltzmann Machine, 3-rd International
Workshop on Thermal Investigations of Integrated Circuits and Mierostructures, IEEE-
CNRS, Cannes - Crte d'Azur, September 21 - 23, 1997.
B e t a - C M O S Artificial N e u r o n and
I m p l e m e n t a b i l i t y Limits

Victor Varshavsky I and Vyacheslav Marakhovsky 2

1 The University of Aizu, Hardware Department,


Aizu-Wakamatsu City, 965-8580 Japan
victor@u-aizu.ac.jp
2 The University of Aizu, Software Department,
Aizu-Wakamatsu City, 965-8580 Japan
marak@u-aizu.ac.jp

A b s t r a c t . The paper is focused on the functional possibilities (class of


representable threshold functions), parameter stability and learnability
of the artificial learnable neuron implemented on the base of CMOS f~-
driven threshold element. A neuron /~-comparator circuit is suggested
with a very high sensitivity to input current change that allows us to
sharply increase the threshold value of the functions. The SPICE sim-
ulation results confirm that the neuron is learnable to realize threshold
functions of 10, 11 and 12 variables with maximum values of threshold
89, 144 and 233 respectively. A number of experiments were conducted
to determine the limits in which the working parameters of the neuron
can change providing its stable functioning after learning to the functions
for each of these threshold values. MOSIS BSIM3v3.1 0.8#m transistor
models were used in the SPICE simulation.

1 Introduction

Hardware implementation of an artificial neuron has a number of well-known


advantages over software implementation [1-4]. In its turn, a hardware artifi-
cial neuron can be implemented as a special purpose programmable controller
or digital/analog circuit (device). Each of these implementations has its advan-
tages/drawbacks and fields of application. Commercially available neurochips
can be of any of these two types. The comparative analysis of characteristics
and application fields of various neurochips is beyond the scope of this article.
We will just note that digital/analog implementation has one obvious advantage
over all other implementations which is high performance.
On the other hand, digital/analog implementation, due to its internal analog
nature, has fairly rigid limitations on the class of realizable threshold functions.
These limitations considerably decrease the functional possibilities of neural nets
with fixed number of neurons.
Functional power of neurochip equally depends on the number of neurons
that can be placed on one VLSI and functional possibilities of a single neuron.
Unfortunately, it has not been properly studied yet how much these parameters
118

affect the functional power of the neurochip. However, it is evident that decreas-
ing area/synapse and extending the functional possibilities of a neuron are prior
aims when creating new neurochips.
We suggested a new type of a CMOS threshold element (fl-driven thresh-
old element, flDTE) [5,6] and a CMOS artificial learnable neuron on its base
[7-10]. flDTE has a noticeable feature: its implementability depends only on
the threshold value and does not depend on the number of inputs and their
weights. This fact and low complexity (5 transistors per a learnable synapse)
make this artificial learnable neuron a very attractive candidate for usage in the
next generations of digital/analog neurochips.
The goal of this work is the circuit development of CMOS artificial neuron
on the base of flDTE, studying its characteristics and dependence of its behavior
on the parameters.

2 Controllable/~DTE

The conventional mathematical model of neuron, starting from the work by


McCulloch and Pitts [11], is the threshold function:

F=Sign_L~ (~_~k
~On-WjXj
1 --T ) 9 '
Sign(A) = {01 if A < 0
if d > 0 (1)

w h e r e wj is the weight of the j - t h input and T is the threshold value.


Representing a threshold function as (1) determined that a threshold element
is traditionally implemented by the structure shown in Fig.1.

Xn~tWny

Fig. 1. General structure of the neuron threshold model.

It is shown in [5,6] that any threshold function can be represented in ratio


form, as follows:

F=Sign(~wjxj-T) =Sign( ~j~Swjx--------------~j


1)=Rt(~j~sWjX~J~ 9
Rt(A/B) = ~( 01i iff AA< B_> B (2)
119

where S is a certain subset of indexes 3 such that Y ] j e s w j = T. From (2) it


immediately follows that CMOS implementation of/3DTE can be like that in
Fig.2. The voltage Vout at the/3-comparator output is determined by the ration
of steepnesses (/3n and/3v) of n- and p-circuits. The steepnesses are formed by
connecting transistors of respective width in parallel.

xj (i e s)

~ ~~~.oo ~ g ~ F

xj(ir s)
Fig. 2./3-driven threshold element.

In [7,8], to build a threshold element with controllable input weights, a re-


duced ratio form is introduced:

F = Sign WjXj -- T = Rt ~j=__ jxj = Rt OJjXj ;


\j=o "=
wj = w j / T (3)

that leads to the circuit of/3-comparator shown in Fig.3a where/3nj = wj/3; /3p =
/3; / ~ = Z ~ j =n-1
0 ~x,.

b),7
_ Vout
Fig. 3./3-compaxator: CMOS implementation (a); equivalent circuit (b).

3 To construct S it is sufficient to take any hypercube vertex that lies in the separating
hyperplane and to include in S indexes of the variables having the value 1 on the
vertex.
120

In Fig.3b, a circuit is shown equivalent to that in Fig.3a. The output voltage


of the j3-comparator is determined by the value a = ~n//3p in the following way:

Volt = { > Vdd/2 if a < 1


~_ Vdd/2 if a > 1
If the output voltage of a CMOS couple (Fig.3b) Volt ~ Vdd/2 this means
that both the transistors are in non-saturated mode since both of them meet the
condition Vth < Vout < Vgs - Vth 4, Vgs = Vdd. Hence,

ip
I~q-Ip=O
In [5] these equations were analyzed and it was shown that the suggested
comparator circuit has sensitivity dY_Z~de* ~ - 2 V in the point ~ --- j3~/j3p ---- 1.
Hence, at the threshold level (Vout = Ydd/2) the reaction of the 3-comparator
to a unit change of the weighted sum AVo~t ,.~ J2/TJV, i.e. it linearly decreases
as the threshold grows.
The analysis of stability of j3DTE to parameter variations made in [5] showed
that only/3DTE with small thresholds (<3-4) can be stably implemented. How-
ever, an artificial neuron is a learnable object and variations of many parameters
(for example, technological) can be compensated during the learning.
The learnable artificial neuron on the base of j3DTE [7,8] has a sufficiently
simple control over the input weight (Fig.4): the control voltage changes the

p root F

F i g . 4. /3-driven l e a r n a b l e n e u r o n .

equivalent t3 of the respective synapse. Since the synapse can be in one of


two states, conducting and non-conducting, the output voltage Vo,~t of the t3-
comparator is formed only by the synapses which are conducting in this given
moment. On the other hand, after the threshold is reached, adding new synapses
4 For simplicity let's a s s u m e t h a t t h e t h r e s h o l d voltage is t h e s a m e for t h e b o t h
transistors.
121

does not change the neuron output state. It follows from this that the imple-
mentability of flDTE and, hence, of the neuron on its base depends only on the
threshold value and does not depend on the number of inputs and sum of their
weights (this fact was established in [5]). The essential aspect is the sensitiv-
ity of the fl-comparator to the current change at the threshold point. Since the
range of B-comparator output voltage is limited within (O-Vdd), the only way
of increasing the ]3-comparator steepness at the threshold point is increasing
the non-linearity of how the fl-comparator output voltage depends on the ratio

Below we discuss the problems of increasing the sensitivity of the B-comparator


in the threshold point and parametric sensitivity of the artificial neuron.

3 Increasing f~-Comparator Sensitivity

To increase the sensitivity of/~-comparator, its transistors should be in the satu-


rated mode when the output voltage is in the threshold zone of output amplifier
switching. This can be demonstrated by the example of an equivalent circuit
(Fig.3b).
Let the gates of both the transistors be fed not by ground and voltage supply
but by voltages V~s and V~, such that both the transistors are in the saturated
mode when Vo~,t = Vdd/2. Let us assume for simplicity that VP8 = V~ = Vgs,
Vt~ = Vt~ = Vth and 0 < Vgs - Vth < Vdd/2. Then the equations for the currents
flowing through the transistors can be represented as

& = B n ( Y ~ -- Y,h)2(1 + AnYo~,~),


I~ = - B p ( V ~ - Y~h)2[1 + AAVdd -- Vo,,~)],
In+x~ = 0
where the parameters An and Ap reflect the small increase of the transistor
currents that takes place when Vd~ grows. From these equations we find

Vout = 1 - a + ApVdd
Ap + Ana ' a = Zn/Zp

and
dVo~,t _ An + A v + AnApVdd
d(~ (AB + A n a ) 2

Let An = 0.03 1/V and Ap = 0.11 1/V. 5 It is easy to calculate that for Vo~,t =
Vdd/20L = 1.15. Parameter a is not equal to one at this point since the values
of An and An are different. When V d d = 5 V and a = 1.15 adY-~=-7.5V. Thus, the
sensitivity of the ft-comparator has increased by 3.75 times. The less An and Ap,
the more the sensitivity.
In the learnable neuron circuit (Fig.4), every synapse consists of two transis-
tors. The gate of one transistor is fed by the input variable x j; the gate of the
5 The values of these parameters were found from the used transistor models.
122

other one is fed by the voltage Vcj that controls the variable weight (current in
the synapse).
Let us first consider the lower part of the neuron fl-comparator where the
synapse currents are summed. Let us replace the couples of transistors that form
synapses by equivalent transistors with characteristics shown in Fig.5. These
characteristics were obtained by SPICE simulation.

300uAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . !

~: 9 wlv i
1.~ 2.~ 3.~ 4.~ 5.~

Fig. 5. Characteristics of the transistor that is equivalent to the transistor couple.

To the left of the mode switching line, the transistors are in the non-saturated
mode; to the right - - in the saturated mode. We can see from these characteris-
tics that when Vout--2.5V, the equivalent transistors are in the saturated mode
if the control voltage Vc < 2.5V and in the non-saturated mode if V~ > 2.5V.
Thus, the saturated mode condition restricts the range of control voltage change.
Breaking this restriction leads to decreasing the output signal of the comparator
because the currents are re-distributed among the synapses.
Indeed, let the smallest weight corresponds to synapse current Imi~ and
adding this current to the total current of the other synapses must cause the
switching of the neuron. If the synapse with the biggest current is not saturated,
decreasing Vo~,t because of the total current increase makes the current of this
synapse smaller. The currents of other non-saturated synapses also decrease. As
a result, the total current increases by a value which is considerably smaller than
Imi~. This leads to decreasing the output signal of the comparator.
The range in which the control voltages of the synapses change can be ex-
tended if an extra n-channel transistor is incorporated into the circuit as shown
in Fig.6. The gate of this transistor is fed by voltage Vr~i1 such that when the
current provides Vout ~-, Ydd/2, the transistor is saturated under the reaction of
the voltage Vg8 -- Vr~i1 - VE. Increasing the total current through the synapses
by adding a synapse with the smallest current makes Vz smaller, so that Vg8
becomes bigger. The extra transistor opens and the extra increase of the to-
tal current compensates the change in VE. Thus, due to the negative voltage
feedback, the extra transistor stabilizes V~ and therefore stabilizes the currents
123

M 1 . 7 Vdd

M2,~ VdM1
V r e f 3 ~ F

v of, l Vs l
Xl--~IX2-~t.,.Xn~}I
Fig. 6. Modified/3-comparator.

300uA ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,

200uA ~
ri
.......
Vc=5V

100uA i

0V 1.0V 2.0V 3.0V 4.0V 5.0V

Fig. 7. Dependence of the synapse current from Volt when Vc = 5V.

through the synapses.


In Fig.6, when the control voltage of the synapse has its maximum value
(Vc=5V), the current through the synapse depends on Vout as shown in Fig.7.
It looks like a transistor characteristic having two zones: the linear zone and
zone of saturation. It is easy to see that when Volt ~2.5V, the synapse is in
the saturated mode. When Vreyl gets smaller, the synapse current stabilization
starts at a smaller value of Vo~,t and the value of the stabilized current decreases.
This is unwanted because the range of synapse current change narrows down.
When VreI1 increases, the synapse current grows and the zone of its stabilization
shifts to the right that may cause the lost of current stabilization in the working
point. Thus, there is an optimum value of Vr~D.
Now let us consider the p-channel part of the 3-comparator. In the working
point (Volt ~ Vda/2) it should provide a current corresponding to the maximum
value of the threshold of the realized functions. For this goal, one p-channel
transistor can be used with offset V~f providing its saturation in the working
point. In Fig.8, the dependence Volt(I) (Curve 1) is shown that was obtained
using such a transistor. The steepness of this characteristic in the working point
124

5.ffr .................................................................................... I

r 9

2.0VI . . . . . . . ,

0Y~ ............ r ............ 9 ............ , ............ r ............ . ..........


OA 50uh 100uA 150uA 200uA 250uh 300uA 350~

Fig. 8. Curve 1 - dependency Vo,,t(I) when the comparator has one p-channel transis-
tor; Curve 2 - dependency Vo,~t(I) when the comparator has two p-channel transistors;
Curve 3 - dependency VdM1(I).

is obviously not sufficient for good stabilization of the threshold value of the
current.
In the modified/~-comparator circuit(Fig.6), the p-channel p a r t of the com-
parator consists of two transistors M1 and M2 referenced by voltages V,-ef2 and
Vrel3 respectively. These reference voltages axe selected so t h a t as the com-
p a r a t o r current grows, transistor M1 gets saturated first, followed by M2. The
dependence of voltage VdM1 at the drain of M1 from the current is shown in
Fig.8 (Curve 3). As soon as M1 comes into the saturation zone, the voltage Vg8
of M2 begins to change with higher speed because Vg8 = Vr~f3 - VdM1. The
voltage drop on M2 sharply grows, increasing the steepness of Vout(I) (Curve 2
in Fig.8).

, . , , _ _ - - - - - - - , - - - - - - - - - - - - .... ~................ ~-.......

(89, 1.151V)

ov$ ............... ~ ................ , ............... ~ ................ , ........................ ;

0 20 40 60 80 1O0 110

Fig. 9. Comparator characteristics: curve 1 for the old comparator; curve 2 for the new
o n e .
125

For comparison, Fig.9 contains experimental characteristics of the old and


new fl-comparators adjusted to function threshold T = 8 9 . In this experiment,
we studied how the comparator output Volt depends on the number of switched
synapses whose control inputs were fed by voltage Vc rain corresponding to the
smallest weight of a variable. For the old comparator (Curve 1), the leap of the
output voltage in the threshold point is 32mV. The characteristic of the new
comparator has a much higher steepness in the threshold zone; the voltage leap
in the threshold point is 1V.

4 Results of SPICE Simulation

In order to study the functional power of the neuron, a number of experi-


ments were carried out with SPICE simulation of its behavior. We used MOSIS
BSIM3v3.1 models of 0.8#m transistors.
For all experiments with learnable neurons, the issue of choosing threshold
functions is crucial. The threshold function should match the following demands:

- to have the short learning sequence;


- to cover a wide range of input weights;
- to have the biggest threshold for the given number of variables.

Monotonous Boolean functions representable by Gorner's scheme match all these


demands. For such functions, the sequence of input weights and threshold forms
Fibonacci sequence. The length of the shortest learning (checking) sequence for
a function of n variables is n + 1 set of input variable values. Proving these facts
is beyond the scope of this article. Experiments were made with three threshold
functions for n=10, 11 and 12:
Fx0 = Sign(x1 + x2 + 2x3 -}-3x4 + 5x5 + 8x6 + 13x7 + 21x8 + 34x9 + 55x10 - 89),
Fll = Sign(x1 + x2 + 2x3 + 3x4 + 5x5 + 8x6 + 13x7 + 21Xs + 34x9 + 55x10 +
89Xll - 144),
F12 = Sign(x1 + x2 + 2x3 + 3x4 -F 5x5 -b 8x6 -}- 13x7 -F 21xs + 34x9 + 55x10 +
89xal + 144x12 - 233).
Since the learning process was not the object of our experiment, on the
synapses we set the optimum values of control voltages. The logical inputs of
the neuron were fed by a checking (learning) sequence.
In the first series of the experiments, we found m a x m i n AVo~t - the max-
imum of the smallest change of/3-comparator output voltage at the threshold
level 2.7V. The results of the experiments are given in the second column of
Table 1.

Results of SPICE simulation Table 1

Neuron type AVout (rain + max)Yth 5Vdd


Flo 1V 1.88+3.7V 0.3%
Fxl 0.525V 1.9+3.68V 0.2%
F12 0.325V 1.97-3.65V 0.1%
126

In the second series of the experiments, for fixed parameters of the com-
parator we tried to find in what range of threshold voltages control voltages on
synapses existed that provided rain AVo,,t > 100mV. In other words, we tried
to find in what range the threshold Vth of the output amplifier may change (for
example, because of technological parameter variations). This range can be com-
pensated during the learning. The results are given in the third column of Table
1. The neuron during the learning is adjusted to any threshold of the output
amplifier from these ranges.
The other experiments were associated with the question: with what precision
should we maintain the voltages for normal functioning of the neuron after the
learning?
First of all, we were interested in the stability of the neuron to supply volt-
age variations. With constant values of the reference voltages and changes of the
voltage supply 4-0.1% (4-5mV), the dependence of the output voltage Vo~,t from
the currents flowing through p-transistors of the comparator shifts along the
axis of current by 4-1.5% as shown in Fig.10. For neuron F12, the current in the

5.OV, - .................................................................................. ,

Fig. 10. Behavior of the dependency Vo,~t(Ip) when the voltage Vdd changes in the
interval -t-0.1%.

working point is about 233Imi~; 1.5% of this value is 3.5Imi~, i.e. the shift of the
characteristic is 3.5 times more than the minimum current of the synapse. Evi-
dently, the neuron will not function properly when the working current changes
like that.
On the other hand,taking into account the way of reference voltages pro-
ducing, it is natural to assume that the reference voltages must change propor-
tionally to the changes of the voltage supply. The effect from reference voltage
change opposes the effect of supply voltage change, partially compensating it.
The experiments carried out under these conditions showed that learned neurons
Fx0, FI~ and F12 can function properly in respective ranges of supply voltage
change shown in the fourth column of Table 1. To fix the borders of the ranges,
the following condition was used: signal AVo,,t should be more or less than the
output amplifier threshold by a value not more than 50mV.
127

The control voltages of the synapses were set up with the accuracy of 1inV.
With what accuracy should they be maintained after the learning? Evidently,
the neuron will not function properly if with the same threshold of the output
amplifier the total current of the synapses will drift by Imi~/2 in one or the
other side. Experiments were conducted where we determined the permissible
range in which the control voltage 5Vc of one of the synapses (with minimum and
maximum currents) can change while the control voltages of the other synapses
are constant. The condition for fixing the range borders was the same as in the
previous series of experiments. The obtained results are given in Table 2.
Results of SPICE simulation Table 2

[Type 5Is . . , ~ ,fVc rain 5Vc ma=


~0 • • (• • (=t=17mV)
• :t=4.7% (• • (=k27mV)
~2 • • (• • (•
In the second column of the table, the permissible ranges of synapse current
change are shown. The third and fourth columns contain the limits of change of
the control voltages that define the corresponding changes of current in synapses
with minimum and maximum weights.
Basing on Table 2 data, we can make the following conclusion: since all the
control voltages of synapses in the neuron should be maintained simultaneously,
their maintenance should be as accurate as units of mV.

5 Conclusion
The suggested neuron with improved fl-comparator has a number of attractive
features. It is very simple for hardware implementation and can be implemented
in CMOS technology. Its fl-comparator has a very high sensitivity that provides
the minimum output signal of the comparator as small as 325mV for the thresh-
old value as big as T=233. Its implementability does not depend on the sum of
input weights, being determined only by the threshold value. Such a neuron can
perform very complicated functions, for example, all logical threshold functions
of 12 variables. There is no doubt that it is learnable to any threshold func-
tion of 12 variables because the dispersions of all technological and functional
parameters of the circuit are chosen during the learning.
The drawbacks of the suggested neuron are very high demands to the stabil-
ity of the voltage supply after the learning. This drawback looks to be peculiar
to all circuits with high resolution, for example, digital-analog and analog-digital
converters. If these demands cannot be matched on the interval of neuron func-
tioning, one should reduce the threshold value until they are matched or carry
out additional research in order to study if it is possible to compensate the
influence of unstable supply voltage.
This work does not deal with the problems of teaching the neuron to threshold
logical functions and its maintenance in the learned state. These issues are of
special interest and should be the object of a separate research.
128

References

1. Mead, C.: Analod VLSI and Neural Systems. Addison-Wesley (1989)


2. Fakhraie, S.M., Smith, K.C.: VLSI-Compatible Implementations for Artificial Neu-
ral Networks. Kluwer, Boston-Dordrecht-London (1997)
3. Shibata, T., Ohmi, T.: Neuron MOS binary-logic integrated circuits: Part 1, Design
fundamentals and soft-hardware logic circuit implementation. IEEE Trans. Electron
Devices Vol.40~ No.5 (1993) 974-979
4. Ohmi, T., Shibata, T., Kotani, K.: Four-Terminal Device Concept for Intelligence
Soft Computing on Silicon Integrated Circuits. Proc. Of IIZUKA'96 (1996) 49-59
5. Varshavsky, V.: Beta-Driven Threshold Elements. Proceedings of the 8-th Great
Lakes Symposium on VLSI, IEEE Computer Society, Feb. 19-21 (1998) 52-58
6. Varshavsky, V.: Threshold Element and a Design Method for Elements. Filed to
Japan's Patent Office, Jan.30 (1998) JPA H10-54079 (under examination)
7. Varshavsky, V.: Simple CMOS Learnable Threshold Element. International
ICSC/IFAC Symposium on Neural Computation, Vienna, Austria, Sept.23-25
(1998)
8. Varshavsky, V.: CMOS Artificial Neuron on the Base of Beta-Driven Threshold
Elements. IEEE International Conference on Systems, Man and Cybernetics, San
Diego, CA, October 11-14 (1998) 1897-1861
9. Varshavsky, V.: Synapse, Threshold Circuit and Neuron Circuit. Filed to Japan's
Patent Office on Aug. 7 (1998) JPA-H10-224994 (under examination)
10. Varshavsky, V.: Threshold Element. Filed to Japan's Patent Office on Aug. 12
(1998) JPA-H10-228398 (under examination)
11. McCulloch, S., Pitts, W.: A Logical Calculus of the Ideas Imminent in Nervous
Activity. Bulletin of Mathematical Biophysics 5, (1943) 115-133
Using On-Line Arithmetic and Reconfiguration
for Neuroprocessor Implementation

Jean-Luc Beuchat and Eduardo Sanchez

Logic Systems Laboratory, Swiss Federal Institute of Technology,


C H - 1015 Lausanne, Switzerland
E-mail: {name.surname} @di.epfl.ch

A b s t r a c t . Artificial neural networks can solve complex problems such


as time Series prediction, handwritten pattern recognition or speech pro-
cessing. Though software simulations are essential when one sets about
to study a new algorithm, they cannot always fulfill real-time criteria
required by some practical applications. Consequently, hardware imple-
mentations are of crucial import.
The appearance of fast reconfigurable FPGA circuits brings about new
paths for the design of neuroprocessors. A learning algorithm is divided
into different steps that are associated with specific FPGA configura-
tions. The training process then consists of alternating computing and
reconfiguration stages. Such a method leads to an optimal use of hard-
ware resources.
This paradigm is applied to the design of a neuroprocessor implementing
multilayer perceptrons with on-chip training and pruning. All arithmetic
operations are carried out with on-line operators. We also describe the
principles of the hardware architecture, focusing in particular on the
pruning mechanisms.

1 Introduction

Modern digital computers perform very complex arithmetic calculations at the


nanosecond time scale. As we easily make mistakes when computing complex
expressions, we cannot approach such capabilities. However, we daily perform
simple tasks like reading or talking. In spite of their computational potential,
digital computers encounter many difficulties to face these tasks. In fact, the
brain is a "computer" which exhibits characteristics such as robustness, dis-
tributed memory and calculation, interpretation of imprecise or noisy sensorial
information.
T h e design of electronic devices exhibiting such characteristics would be very
interesting from the engineering point of view. Applications of such circuits in-
clude artificial vision, autonomous robotics or speech processing. Artificial neural
networks, which are briefly described in section 2 (we will especially focus on
multilayer perceptrons and supervised learning which are extensively used in
this paper), constitute a possible way to realize such devices. Section 3 gives an
overview of classic paradigms allowing the design of neuroprocessors. However,
130

the appearance of fast reconfigurable Field-Programmable Gate Arrays (FPGA)


offers new paths for neuroprocessor implementation (section 4). A learning algo-
rithm is split into several steps executed sequentially, each of which is associated
with a specific FPGA configuration. Such an approach leads to an optimal use
of hardware resources. The reconfiguration paradigm also allows the implemen-
tation of multiple algorithms on the same hardware. We apply these principles
to the design of an on-line arithmetic-based reconfigurable system able to run
different learning rules for multilayer perceptrons (section 5). Finally, section 6
presents our concluding remarks and some future extensions of our work.

2 Artificial Neural Networks

Artificial neural network models are widely used for the design of adaptive, intel-
ligent systems since they offer an attractive property: the capability of learning
in order to solve problems from examples. These models achieve good perfor-
mance via massively parallel networks composed of non-linear computational
elements, often referred to as units or neurons. A value, referred to as activity
(or activation value) is associated with each neuron. Similarly, a synaptic weight
is associated with each connection between neurons. A neuron's activity depends
on the activity of the neurons connected to it and the weights. Each neuron com-
putes the weighted sum of its inputs. This value is called net input. The activity
is obtained by the application of an activation function (e.g. sigmoid, gaussian
or linear function) to the net input.
Many network architectures have been described in the literature. Multilayer
perceptrons, which are used in our project, are composed of several layers of
neurons: an input layer simply holding input signals, one or more hidden layers
of neurons and an output layer, from where the response comes. Connections
are only possible between two adjacent layers. Let us further introduce some
notations. N,~ designates the number of neurons in layer m. wn,,n# is the weight
between neuron i in layer n and neuron j in layer m. am, P j = ~(h~,j) denotes
the activity of neuron j in layer m, where ~a is the activation function. Finally,
P
hPm,j = ~"~=~-~ Wm-l,m~ "arn_l,i is the net input. Other interconnection shemes
include recurrent or competitive networks.

2.1 Learning Algorithms

We distinguish three classes of learning algorithms: supervised, reinforcement


and unsupervised learning. The major characteristic of supervised learning is the
availability of a teacher having knowledge of the environment. This knowledge
is represented as a set of input-output examples, called training set. When the
network is exposed to a training vector ~P (where p denotes the index of the
vector in the training set), it computes an output o p which is compared with
the desired response d p provided by the teacher. The resulting error signal E p :-
lidp - oPll2 obviously depends on the weights associated with the connections.
Thus, learning consists in determining the weights wm-l~mj minimizing E p.
131

The Backpropagation rule [1] is perhaps the most popular supervised algorithm
for multilayer perceptrons. It iteratively computes the values of weights using a
gradient descent algorithm. | Initially, all weights are initialized to small random
values. ~) An input vector is then presented and propagated layerwise through
the network. | We compute the error signal | which is back-propagated through
the network. This process allows to assign errors to hidden neurons. | Finally,
computed errors and neuron activities determine the weight change. Steps |
to | are carried out for all training vectors. This process is repeated until the
output error signal falls below a given threshold. Supervised algorithms solve a
wide range of complex problems including image processing, speech recognition
or prediction of stock prices.
In reinforcement learning, the system tries an action on its environment and
receives an evaluative reward, indicating whether the action is right or wrong.
Reinforcement learning algorithms try to maximize the received reward over
time. Such algorithms are efficient in autonomous robotics.
There is no teacher available in unsupervised learning. These algorithms try
to cluster or categorize input vectors. Similar inputs are classified within the
same category and activate the same output unit. Applications of unsupervised
learning include data compression, density approximation or feature extraction.

2.2 Pruning Algorithms


When we train a multilayer perceptron, it is generally illusory to provide every
possible input pattern. Therefore, an important issue of training is the capabil-
ity of the network to generalize, that is, cope with previously unseen patterns.
However, generalization depends on the network topology. A rule of thumb for
obtaining a good generalization is to use the smallest network able to learn the
training data [2]. Training successively smaller networks is a time-consuming
approach. Among the efficient processes to determine a good topology, one can
cite genetic algorithms, growing methods, and pruning algorithms.
With pruning algorithms we train a network that is larger than necessary
and delete superfluous elements (units or connections). These algorithms can
be classified into two general categories: sensitivity estimation and penalty-term
methods. Algorithms within the first category measure the sensibility of the
error to the removal of a connection or a unit. Then, elements with the smallest
sensibilities are pruned. Methods belonging to the second category suggest new
error functions which drive weights to zero during training.
Ishikawa [3] has suggested a penalty-term algorithm based on the following
error function:
E~ = E p -b A. ~ Iwm_limsl. (1)
m,l,j
Differentiating Eq. (1) with respect to the synaptic coefficient Wm--l,mj leads to
a new update rule :
OE o
z~AWrn--lirnJ = --~ 9 OWrn--l~rnj (2)
132

Equation 2 drives synaptic coefficients to zero. Weights are removed when they
decrease below a given threshold. Pruning connections sometimes leads to a
situation where some neurons have no more inputs or outputs. Such neurons,
called dead units, can be deleted.

3 Hardware Implementation

Though software simulations are essential when one sets about to study a new
algorithm, they can't always fulfill real-time criteria required by some practi-
cal applications. In order to exploit the inherent parallelism of artificial neural
networks, hardware implementations are essential.
Analog implementations allow the design of extremly fast and compact low-
power circuits. This approach has been successful on the design of signal process-
ing neural networks, like the H~!rault-Jutten model [4], or bio-inspired systems
like silicon retinas [5]. The main drawback of analog circuits lies perhaps in
their limited accuracy. Consequently, they cannot implement the backpropaga-
tion algorithm which requires a resolution from 8 bits to more than 16 bits [6],
depending on several factors, such as the complexity of the problem to be solved.
Among the many digital neuroprocessors described in the literature, we dis-
tinguish two main design philosophies. The first approach involves the design
of a highly parallel computer and a programming language dedicated to neural
networks. It allows the implementation of multiple algorithm on the same envi-
ronment. [7] gives an interesting overview of different academic and commercial
systems. However, programming such computers is often arduous.
The second approach involves the design of a specialized chip for a given al-
gorithm, thus avoiding the tedious programming task. [8] describes such circuits
and presents the benefits of this method: "resource efficiency in respect to speed,
compactness and power consumption". However, the main drawback lies in the
need for a different hardware device for each algorithm.
Besides analog and digital approaches, literature describes other design
paradigms. Let us mention two examples:
[::1 F. N. Sibai and S. D. Kulkarni have proposed a neuroprocessor combining
digital weight storage and analog processing [9].
Q The optical neural network paradigm "promises to enable the design of highly
parallel, analog-based computers for applications in sensor signal process-
ing and fusion, pattern recognition, associative memory, and robotic con-
trol" [10]. The VLSI implementation of a fully connected network of N units
requires area O(N2). Optics allows the implementation of connections in a
third dimension, reducing the chip area to O(N). However, additional re-
search is indispensable to embed such a system in a small chip.

4 Reconfigurable Hardware

Fast reconfigurable FPGA offer new paths for neuroprocessor implementation.


Figure lb depicts an FPGA-based neuroprocessor implementing four learning
133

algorithms. It consists of an FPGA board and a set of configurations describ-


ing specialized architectures for different algorithms. This solution provides the
advantages of both design philosophies previously discussed.

C'llipN I ~ __

al, ex ^11~.~

,.- ?
/ . . . .
?
I
(a) (b)

Fig. 1. (a) A neuroprocessor made of four special-purpose chips. (b) An FPGA-based


neuroprocessor.

Furthermore, reconfigurable systems offer interesting perspectives in the de-


sign of neuroprocessors:
1:21 O p t i m a l u s e o f h a r d w a r e resources. A learning algorithm consists of
several stages, each of them requiring specific functionalities. For instance,
the initialization step of the backpropagation learning rule makes use of a
linear feedback shift register, which remains unused in the following steps.
Consequently, implementing a complete algorithm on a FPGA board leads
to a waste of resources. To avoid this drawback, we divide learning algo-
rithms into several sequentially executed stages, each of which is associated
with an FPGA configuration. Figure 2 depicts a possible decomposition of
the backpropagation learning rule. Note that the reconfiguration time is of
crucial import-if this process needs more time than computation, such an
approach is not appropriate.
1:3 H a r d w a r e c o m p o n e n t reuse. Suppose you have already implemented the
backpropagation learning rule. The implementation of the pruning algorithm
described in section 2.2 is straightforward. The only difference between the
two algorithms lies in the weight update rule. Therefore, you can reuse most
of the configurations previously designed and simply need to develop a new
configuration which computes equation 2 (Fig. 2).

5 Implementation of a Multilayer Perceptron


We apply the principles described above to implement a multilayer perceptron,
some supervised learning rules (backpropagation, resilient backpropagation and
134

Configuration
Database Algorithm 2

Algorithm I Algorithm 2

Forward Propagation
ErrorComputation
Backward Propagation
Weigh[ Updam

,,,%.

Fig. 2. A possible decomposition of a backpropagation-like algorithms.

an algorithm using a weighted error function discussed by Sakaue et al. in [6]),


and pruning algorithms. As F P G A s are not well suited for floating-point com-
putation, we use fixed-point numbers to carry out all arithmetic operations. A
series of experiments have demonstrated the efficiency of such a limited-precision
system [11]. We found that Ishikawa's algorithm is especially suitable for our
hardware implementation. Equation 2 prevents an important growth of synaptic
coefficients, thus bounding the number of required bits for the integral part of
the numbers.

5.1 On-line Arithmetic

A neural network contains numerous connections, thus requiring a substantial


amount of wires. As parallel arithmetic requires large buses, it is not well suited
for such implementations. Bit-serial communications seem to be an effective
solution.
Bit-serial d a t a transmission may begin with the least significant bit (LSB)
or with the most significant bit (MSB). Though the LSB paradigm seems more
natural (the digits of the result are generated from right to left when we carry
out an addition or a multiplication using the "paper and pencil" method), it
does not allow the design of algorithms for division or square root. The on-line
mode, introduced in [12], performs all computations in MSB mode thanks to a
redundant number representation. It avoids carry propagation within addition.
CX0
Usually, a number a E R is written in radix r as ~ i = - o ~ a i r - ' , where al E
7)r = {0, 1 , . . . , r - 1}. 7)r is the digit set. On-line algorithms described in the
literature use Avizienis' signed-digit systems [13], where numbers are represented
in radix r with digits belonging to ~Dr = { - a , - a + 1 , . . . , a - 1, a}, a <_ r - 1.
Our neuroprocessor exploits on-line arithmetic in radix 2, i.e. the digit set
is { - 1 , 0, 1}. It uses a bit-level representation of the digits, called Borrow-Save,
defined as follows: two bits, a + and a~-, represent the ith digit of a, such that
135

ai = e + --a~-. Consequently, digit 1 is encoded by (1, 0), digit - 1 is encoded by


(0, 1), while digit 0 is represented by (0,0) or (1, 1).

(f=2
9 O0 O1 0 2 03 04 05 06 07
." . ." .. .' ." " .. :
l0

:
~I Z2 Z3 Z4 Z5 Z6 Z7

: :
i i i i i i i i i : : : : :
'

: Time
~ - - ~ 5 = 3

0 1 2 3 4 5 6 7 8 9 5= 2 Register

(a) (b)

Fig. 3. (a) Delay of an on-line operator. (b) Pipeline with on-line operators.

Many on-line operators are described in the literature. Each of them is char-
acterized by a delay 5, which indicates the number of clock cycles required to
compute the MSB of the result. Figure 3a depicts the schedule diagram of an
operator of delay 2. An input signal i is provided at time t -- 0. Two clock cycles
are required to elaborate the MSB of the output signal o. A new bit of the result
is then produced at each clock cycle.
We have designed a VHDL library of on-line operators dedicated to our neu-
roprocessor. Table 1 summarizes the available operators and their delay. Notice
that it is possible to implement multipliers and squarers of delay 2. However,
the size of these operators depends on the length of the operands [14]. The
implementation of activation functions is obtained with the Horner's scheme.

Table 1. Some on-line operators and their respective delay.

n-input adder / rl~


Multiplier ] 3
Multiplier by a constant [ 2
Squarer [ 3
Binomier [ 3

Suppose now that the result of a first operator F is needed for further com-
putations. The second operation may begin as soon as F has generated the MSB
of the partial result. Figure 3b shows how to chain different on-line operators.
The addition and the first multiplication are carried out in parallel. As the de-
lay of the adder is smaller than the one of the multiplier, we use a register to
synchronize the inputs of the second multiplier.
136

5.2 General Architecture

When designing the hardware architecture of our neural network we first ob-
served that a time-multiplexed interconnection scheme provides a good trade-off
between speed and scalability (Fig. 4). The main idea is to connect all the out-
puts of hidden layer m and the inputs of hidden (or output) layer m + 1 to a
common bus; the same hardware is reused for all layers of the network. The
multiplexor allows to provide the network with an input signal or an activation
value of a hidden unit. The neuroprocessor is basically made of FPGAs, each of
them embedding N neurons (this number depends on the F P G A family). Fur-
thermore, each F P G A is connected to its own memory. Memory module m stores
the synaptic weights between all neurons implemented by F P G A m and their
inputs (Fig. 4).
We will focus herein on forward propagation of a signal (the backward propa-
gation obeys the same principles). The first neuron in layer m places its activity
aPm,t on the bus. All neurons in layer m + 1 read and multiply it by the ap-
propriate synaptic weight w,r,,m+b and finally store the result. This process is
sequentially repeated for every neuron in layer m. Each processing element in
layer m + 1 accumulates the results of the successive multiplications. Due to this
interconnection scheme, each neuron is a very simple processing element.

tol'nlra+l 1 wraim+l~N:~ ""''i~

ilili
:i ii:i:i
I :.-- "::: : : ~
"''...

(Registersholdingthq
::=3:::)1~ | I~tlvitv of . . . . . . . inI
11 o li:[:i!i :?:)::::).] ?::o'o: ?1o 11

Wmi+lrtt+l 1 Wmi+lmN+l
Fig. 4. General architecture of a reconfigurable neuroprocessor.
137

Figure 5a illustrates the architecture used during the forward propagation


step. An on-line binomier computes iteratively the net input:
k
Wrnk+lrn§ P
9 am,k+ , -{- ~ Wm~m+,i 9 arn,i.P (3)
i=1

The FIFO allows the synchronization of the inputs of the binomier. A special bit,
called Pruning, is associated with each weight and indicates whether a connection
has been pruned (in which case, this bit is set to 0) or not. This information is
stored in a flip-flop and combined with an input of the on-line operator.

Pruning .. ~ a+ _ ~
1_~ rn,k + l
[ P r u n i n ~
I ~
Load Clrw+
r~h+t r n §
Wrnk+l rn§

aP
~-~ Wrnim+'j " rn,i " :.
It
.-- -
[
FI-F-o ~ Load Cl---~

(a) (b)

Fig. 5. (a) Architecture of a neuron. (b) Dead unit detection mechanism.

We now have to provide our neuroprocessor with a means for detecting dead
units. This problem is solved by the simple mechanism depicted in Fig. 5b. As-
sume that a neuron j in layer m has no more inputs. All wm-l~m~ coefficients are
loaded when a signal is forward-propagated through the network. As the Pruning
bits associated with the wm-l~,-,~ are set to zero, the flip-flop output remains
zero as well. Consider now a neuron with no outputs. The backward-propagation
process involves all weights w,~r,+lk whose Pruning bits are equal to zero. Con-
sequently, the detection of such dead units occurs during this step. Once a dead
unit has been detected, a signal is sent to a global controller which manages
the network topology. As the activities of neurons are sequentially placed on the
bus, the deletion of dead units increases the learning speed.

6 Conclusions

This paper has presented an attractive paradigm for the design of neuropro-
cessors. The reconfigurable approach allows the implementation of circuits ded-
icated to different algorithms on the same board. Furthermore, it leads to an
138

optimal use of hardware resources. We then discussed some hardware implemen-


tation details. An interesting issue is the pruning system. Dead unit removal
and training are executed concurrently. Our solution is more efficient than the
one used in some software simulators, where the removal of dead units and the
training stages are done sequentially.
We now have to complete the hardware implementation and to evaluate the
performance of our neuroprocessor. As previously mentioned, the reconfiguration
time is of crucial import. To implement this first prototype, we use Renco [15]
(REconfigurable Network COmputer), an FPGA-board designed to study re-
configurable systems. This board is composed of a commercial processor and
four Altera Flex 10K130 FPGAs, whose reconfiguration process is quite slow.
Thus, Renco only allows the design of a first prototype. The design of a new
board providing another FPGA family will be indispensable. Context switching
FPGAs [16] are another promising alternative for our implementation.
Finally, reconfigurable systems offer some other interesting prospects. The
architecture depicted in Fig. 4 is well suited for the learning process. However, it
should be interesting to increase the parallelism when training is over. Therefore,
we plan to design special FPGA configurations for the recall process.

References
1. B. Widrow and M. A. Lehr. 30 Years of Adaptative Neural Networks: Perceptron, Madaline and
Backpropagation. Proc. IEEE, 78(9):1415-1442, September 1990.
2. Russell Reed. Pruning Algorithms - A Survey. I E E E T~ansaetlons on Neural Networks,
4(5):740-747, September 1993.
3. Masumi Ishikawa. Structural Learning with Forgetting. Neural Networks, 9(3):509-521, 1996.
4. J. Hdrault and C. Jutten. Rdseau~ neuronaux et traiteme nt du signal. Hermes, 1994.
5. C. Mead. Analog V L S I and Neural Systems. Addison-Wesley, May 1989.
6. Shigeo Sakaue, Toshiyuki Kodha, Hiroshi Yamamoto, Susumu Maruno, and Yasuharu Shimeki.
Reduction of Required Precision Bits for Back-Propagation Applied to Pattern Recognition.
I E E E T r a n s a c t i o n s on Neural Networks, 4(2):270-275, March 1993.
7. Paolo Ienne. Digital Connectionist Hardware: Current Problems and Future Challenges. In Josd
t . . . . .
Mira, Roberto Moreno-Dlaz, and Joan Cabestany, editors, Bzolog~cal and Artificial C om put a-
tion: F rom Neuroscience to Technology, pages 688-713. Springer, 1997.
8. Ulrich Rfickert and Ulf Witkowski. Silicon Artificial Neural Networks. In L. Niklasson, M. Bod~n,
and T. Ziemke, editors, I C A N N 98, Perspectives in Neural Computing, pages 75-84. Springer,
1998.
9. Fadi N. Sibai and Sunil D. Kulkarni. A Time-Multiplexed Reconfigurable Neuroprocessor. I E E E
Micro, 17(1):58-65, 1997.
10. B. Keith Jenkins and Jr. Armand R. Tanguay. Optical Architectures for Neural Network Imple-
mentations. In Michael A. Arbib, editor, The Handbook of Br ai n T he or y and Neural Networks,
pages 673-677. The MIT Press, 1995.
11. Jean-Luc Beuchat and Eduardo Sanchez. A Reconfigurable Neuroprocessor with On-chip Prun-
ing. In L. Niklasson, M. Bod~n, and T. Ziemke, editors, I C A N N 98, Perspectives in Neural
Computing, pages 1159-1164. Springer, 1998.
12. Kishor S. Trivedi and Milos D. Ercegovac. On-line Algorithms for Division and Multiplication.
I E E E T r a n s a c t i o n on Computers, 0-26(7), July 1977.
13. Algirdas Avizienis. Signed-Digit Number Representations for Fast Parallel Arithmetic. I R E
Tra nsa ctions on Electronic Computers, 10, 1961.
14. Jean-Claude Bajard, Jean Duprat, Sylvanus Kla, and Jean-Michel Muller. Some Operators for
On-Line Radix-2 Computations. Journal of Parallel and Distributed Computing, 22:336-345,
1994.
15. Eduardo Sanchez, Moshe Sipper, Jacques-Olivier Haenni, Jean-Luc Beuchat, Andr~ Stauffer,
and Andrds Perez-Uribe. Static and Dynamic Configurable Systems. To appear.
16. Stephen M. Scalera, John J. Murray, and Steve Lease. A Mathematical Benefit Analysis of
Context Switching Reconflgurable Computing. In Josd Rolim, editor, Parallel and Distributed
Processing, n u m b e r 1388 in Lecture Notes in Computer Science, pages 73-78. Springer, 1998.
Digital Implementation of Artificial Neural Networks:
From VHDL Description to FPGA Implementation

N. lzeboudjen, A. Farah*, S. Titri, H. Boumeridja


Development Center of Advanced Technologies, Microelectronic Laboratory
128, Chemin Mohamed Gacem, B.P: 245 El Madania, 16075 Algiers-Algeria.
E-mail: nizeboudjen@hotmail.com
Fax: 213 02 27 59 37
*Ecole Nationale Polytechnique, Laboratoire Teclmiques Digitales et Systemes.
10, Avenue Hassen Badi E1 Harrach, Algiers- Algeria.
E-mail: farah@ist.cerist.dz

Abstract
This paper deals with a top down design methodology of an artificial neural network
(ANN) based upon parametric VHDL description of the network. To come off early in the
design process a high regular architecture was achieved. Then, the VHDL parametric
description of the network was realized. The description has the advantage of being
generic, flexible and can be easily changed at the user demand. To validate our approach,
an ANN for electroc,'u'diogram (ECG) arrhythmia's classification is passed through a
synthesis tool, GALILEO, for FPGA implementation.

Key words
ANN, top down design, VHDL, parametric description, FPGA implementation.

Introduction
Engineers have long been fascinated by how efficient and how fast biological neural
networks are capable of performing complex tasks such as recognition. Such networks are
capable of recognizing inputs data from any of the five senses with the necessary
accuracy and speed to allow living creature to survive. Machines, which perform such
complex tasks, with similar accuracy and speed, were difficult to implement until the
technological advances of VLSI circuits and systems in the late 1980's [I]. Since then,
VLSI implementation of artificial neural networks (ANNs) has witnessed an exponential
growth. Today, ANNs are available as microelectronics components.
The benefit of using such implementation is well described in a paper by R. Lippman [2] :
<< The great interest of building neural networks remains in the high speed processing that
could be provided through massively parallel implementation >>. In [3], P. Trealeven and
others have also reported that the important design issues of VLSI ANNs are parallelism,
performance, flexibility and their relational ship to silicon area. To cope with these
properties [3] reported that a good VLSI ANN should exhibit the following architectural
properties:
9 Design simplicity that leads to ,architecture based on copies of a few simple cells.
9 Regularity of the structure that reduces wiring
9 Expandability and design scalability that allow many identical units by packing a
number of processing units on a chip and interconnecting many chips for a complete
system.
Historically, the development of VLSI implementation of artificial neural networks has
been widely influenced by the development in technology as well as in VLSI CAD tools.
140

Hardware implementation of ANNs can make use of analog or digital technology


techniques. A straight question is how to choose between these two technologies?
Selection between digital or analog implementation depends on many factors such as
speed, precision, flexibility, progranunability and memory elements.
Analog implementations have the potential for high densities and fast operations.
Unlbrtunately, they are sensitive to noise; cross talk, temperature effects and power
supply variations. Also long term weight storage requires special fabrication techniques.
Another major drawback, which is very critical in ANNs is that conventional analog
implementations are fixed (i.e. no programmability can be achieved).
Digital integrated technology, in the other hand, offers very desirable features such as
design flexibility, learning, expandable size and precision. Another advantage is that
mature and powerful CAD tools support design of digital VLSI circuits.
Digital implementation of ANNs can make use of full custom VLSI, semi custom, ASICs
(application specific integrated circuits) and FPGAs (Field programmable gate arrays) [4],
[5], [6].
Particularly, FPGA implementation of ANNs is very attractive because of the high
flexibility that can be achieved through the re-programmability nature of these circuits.
One would assume that the neural network models developed in computational
neuroscience could be directly implemented in silicon. This assumption is false because
when implementing a neural network, the designer is confined to some specific problems
related to the characteristics of these algorithms such as: speed processing, precision, high
memorization, parallelism, regularity and flexibility of the architecture. In addition, the
designer must fulfil design constraints related to the target application: area and
consumption problems. Another supplementary imperative constraint which adds, today,
to the complexity of these circuits is the quick turnaround design.
Nowadays, with the increasing complexity of VLSI circuits, state of the art design is
focused around high level synthesis which is a top down design methodology, that
transform an abstract level such as the VHDL language (acronym for Very high speed
integrated circuits Hardware Description Language) into a physical implementation level
[7], [8], [9].
VHDL based synthesis tools have become very popular due to mainly these reasons: the
need to get a correctly working system the first time, technology independent design,
design reusability, the ability to experiment with several alternatives of the design, and
economic factors such as time to market. In addition, synthesis tools allow designers with
limited knowledge, of low level implementation details to analyze and trade off between
alternative implementations without actually implementing the target architecture [9].
Beside this, the VHDL language is well suited for high regular structures like neural
networks.
However, although all these advantages, seldom attention has been done to use synthesis
for ANNs implementations.
In this paper, a new design methodology of ANNs based upon a VHDL synthesis of the
network is applied. The novelty is the introduction of the parametric VHDL description of
the network.
The intended objective is to realize an architecture that takes into account the parallelism,
performance, flexibility and their relational-ship to silicon area as requested in [3]. After
synthesis the resulting netlist file is mapped into the FPGA X1LINX XC4000E family
circuit's for physical implementation [10]. The paper is organized as follow:
In section II theoretical background of artificial neural networks is given. Section III
describes the followed design methodology. In section IV, Parametric VHDL description
141

of the ANN is introduced. Section V is an application to an ECG arrhythmia's classifier.


Finally, a discussion and conclusion are given in section VI.

II. Theoretical background


An artificial neural network (ANN) is a computing system that combines a network of
highly interconnected processing elements (PEs) or neurons (Fig.l). Connections between
neurons are called synapses or connection weights.
hlspired by tile physiology of the human brain, the traditional view holds that a neuron
performs simple threshold function- weighted input signals are assumed-. If the results
exceed a certain threshold, a signal emits from the neuron. Fig. 1. (a). Represents a
biological model neuron and Fig. 1. (b) represents an artificial neuron model.
Many different types of ANNs exist: single layer perceptron, multilayer perceptron, the
Hopfield net and Boltzman machine, the Hamming net, the Carpenter/Grossberg classifier
and Kohenen's self-organizing maps [2]. Each type of ANN exhibits its own architecture
(topology) and learning algorithm. From all these types of ANNs we have chosen for
implementation the three layer feed-forward back propagation network (Fig.l.c). This
choice is motivated by the high regular structure, the simple connection (unidirectional)
and the great number of problems that can be solved by this kind of neural networks
ranging from classification, pattern recognition and image processing to robotics and
control applications.

i~put~ weights Activation


function

Xn , ~ L J l

Synapses Output layer


Hidden layer
(a) (b) Input layer
Synaptic connection
O Processing element (PE) or neuron

(c)

Fig. 1. (a) Biological model neuron. (b) Artificial neuron model (c) Three layer artificial neural network

The ANN computation can be divided in two phases: learning phase and recall phase. The
learning phase performs an iterative updating of the synaptic weights based upon the error
back-propagation algorithm [2]. It teaches the ANN to produce the desired output for a set
of input patterns. The recall phase computes the activation values of the neurons from the
output layer according to the weighted values (computed in the learning phase).
Mathematically, the function of the processing elements can be expressed as:
l j = )-~ ,wiji ~ail - l ) + O (I)
i
w.[ is the real valued synaptic weight between element i in layer l-1 and element j in layer
U
O-l)
I. s i is the current state of element I in layer I-I. 0 is the bias value. The current state
142

of the node is determined by applying the activation function to x I . For our


implementation, we have selected the logistic activation function:
sl - 1 (2)
I
I + e -xi
Training (learning) of an ANN is carried as follows:
i) Initialize the weights and bias,
ii) Compute the weighted sum of all processing elements from the input to output layer,
iii) Starting from the output layer and going back wordto the input layer adjust the
weights and bias recursively until the weights are stabilized.

It must be mentioned that our aim is to implement the recall phase of a neural network,
which has been previously trained on a standard digital computer where the final synaptic
weights are obtained, i.e. "off- chip training".

III. D e s i g n methodology
The proposed approach for the ANN implementation follows a top down design
methodology. As illustrated in Fig. 2, architecture is first fixed for the ANN. This phase
is followed by the VHDL description of the network at the register transfer level (RTL)
[8], [13], Then this VHDL code is passed through a synthesis tool which performs logic
synthesis and optimization according to the target technology. The result is a netlist ready
for place and root using an automatic FPGA place and root tool. At this level verification
is required before final FPGA implementation.

Fig.2 Design methodology of the ANN

In the following sections the digital architecture of the ANN will be derived then the
proposed parametric VHDL description. Synthesis results, placement and rooting will be
discussed through an application.
143

lIl Digital architecture of the ANN


As mentioned in section I, the requirements of ANNs are parallelism, performance,
flexibility and their relational-ship to silicon area (in our case number of CLBs).
Parallelism of the network is discussed in this section.
Designing a fully parallel ANN requires:
9 The parallelism of all layers, which means that at least OnE multiplier, is needed per
layer.
9 The (PE) or neuron's parallelism which requires one multiplier per neuron.
9 The connection parallelism, which means that all synaptic connections of a neuron,
are calculated at the same time. In this case, the neuron needs as many multipliers as it
has connections to the previous layer.
9 The connection parallelism is the highest degree of parallelism that can be reached in
an ANN. This parallelism leads to a very high network performance in term of
processing speed. However building a large number of multipliers and a large number
of connections are a severe penalty for the FPGAs because of their limited resources
and the excessive delay inherent to FPGA. To avoid this problem, we consider only
the neuron's parallelism. ConsequEntly, data transfer between layers should be serial,
because one neuron is chosen to compute only one connection at a time.
Based upon the above ANN hardware requirements, the FPGA equivalent architectural
model of the neuron of Fig. l.b is represented by Fig. 3.a. The hardware model is mainly
based on a:
9 Memory circuit (ROM) where the final values of the synaptic weights are stocked,
9 Multiply accumulate circuit (MAC) which computes the weighted sum and,
9 Look-up table (LUT) which implements the sigmoid activation function.
ThE resulting ANN architecture of Fig. l.c. is represented in Fig. 3.b. (note that only the
second and output layers are represented in this figure), with the following features:
9 For the same neuron, only one MAC is used to compute the product sum.
9 Each MAC has its own ROM of weights. The depth of each ROM is equal to the
number of nodes constituting its input layer.
9 For the same layer, neurons are computed in parallel.
9 Computation between layers is done serially.
9 The whole network is controlled by a unit control.
As we can see, in Fig. 3.b., the resulting architecture exhibits a high degree of parallelism,
simplicity, regularity and repeatncss.

Fig. 3. (a): Neuron hardware model. (b) ANN architecture.


144

IV. Parametric VHDL description of tile ANN


Having fixed architecture, the next phase is the VHDL description of the ANN.
Flexibility is the parameter of interest in this section.
The capabilities of the VHDL language to support parameterized design are the key to
providing flexible ANNs that can adapt to different specifications.
Besides its emergence as an industry standard for hardware design VHDL supports
additional features such as encapsulation, inheritance, and reuse within the representation
[Ill.
Encapsulation reduces the number of details that a designer has to deal with, through the
representation of the design as a set of interacting cores. Thus the designer doesn't have
to know how these cores work inside, but rather should focus on defining the appropriate
interfaces between them. Encapsulation is reinforced through the use of packages,
functions, procedures and entity declaration.
Inheritance in VHDL is realized through parameter passing. The general structure of
components and architectures is inherited by new designs. The parameters are passed to
instantiate the design to the specifications of the target application. Inheritance is also
reinforced through component instantiation. Reuse can be realized by constructing
parameterized libraries of cores, macro-cells and packages.
Our approach to the ANN hierarchic VHDL description is illustrated in Fig.4. VHDL
description of the network begins by creating a component neuron, then a component
layer is created and finally a Network is described.
9 Component neuron is composed by a MAC component, a ROM component and a LUT
component.
9 Component layer is composed by a set of component neurons and multiplexers.
9 A Network is composed by a set of component layer (input layer, hidden layer and
output layer).

Fig. 4 Top view of an artificial neural network

In Fig. 5(a) the VHDL description of the neuron is illustrated. Fig. 5(b) illustrates the
layer description. Fig. 5(c) illustrates the Network description.
First, a VHDL description of the MAC circuit, the ROM and the LUT memories was
done. In other to achieve flexibility, the word size width (nb_bits) and the memories
depth (nb_addr and n b a d d ) are kept as generic parameters (Fig. 5(a)).
Second, a VHDL description of the neuron was achieved. The parameters that introduce
the flexibility of the neuron are the word size (rib_bit) and the component instantiation. A
designer can change the performances of the neuron by choosing other pre-described
components stocked in a library without changing the VHDL code description of the
neuron (Fig. (5b)).
145

Third, a layer is described. The parameters that introduce the design flexibility and
genericity of the layer are the word size ( n b b i t s ) and the number of neuron (nb_neuron).
The designer can easily modify the number of neurons in a layer only by easy
modifications of the layer VHDL code description (Fig.5. b.).
Finally, a VHDL description of the network is achieved. The parameters that introduce
the flexibility of the network are the neurons word sizes (n), the number of neurons in
each layer (nb_neuron) and component instantiation of each layer (component layerS,
component layer3 and component layer2). The designer can easily modify the size of the
network simply by giving small changes in the layers descriptions. The designer can also
change the performances of the network only by using others pre-designed layers (Fig
4.c.).
entity neuron is Entity nelwock is
generic(nb_bits :integer) ; -- word size generic (n, nl, nO: integer) ;
port (in nenr :in unsigned(nb_bits- I downto O) ; pelt (X I,X2,X3,X4,X5:in sl.d_logic_vector (n downto 0);
out_neur : out std_logic_v~tor((nb_bits - L) downlo O) ; ad:in unsigned(nl dowmo 0);
rend_en,rst,clk,rcady : in std_logic) ; adl:in unsigned (nl dew, ate 0);
end neuron ; ad2:in unsigned(hi downto 0):
,lrchileclnrc nenron_dc,~tiplion of neuron is elk ,rst.rendyl.rend en : in ~d logic;
zompoucm M A C c 132,e232:oet std_logic_vector(((2 *n+ 1) downto 0)) ;
generic (nb_bits : integer) ; end network ;
port (x, w : i n std_logic_vcctor((nb_bits-I) downto 0) ;
elk. rsl : in sld_logic ; architecture network_desctiplion of nctwoA is
q : out std_logic_vector ((2*nb bits) -I) downto 0)) ; component layer I
gild cotnponen[ ; generic (nb_neeron : integex ; nl : integer) ;
component ROM port(XI,X2,X3.X4,X5:in std_h~ic_vcctor (nl downlo 0);
generic {nb add : integer : ni',_bits :integer) ; ad:in unsigned(nl downto0).
port ( add : in unsigl~'d ((nb_addr -1) downto 0) ; s l:in std logic_vector(no downlo 0);
out_tom : out ~d logic vector((nb_bits - I) downto 0) ; clk,rsl.rendy,read_en : in std_logic) ;
read en : in ~d_logic) ; n 13.n23,n33,n43.n53 :out std logic_vector(((2*n l )+ I )
end colnponeflt ; dowmo 0));
[,'OIllpone n.I LUT end component ;
generic(nb_ad(h" :integer ; nb_bits :integer) ;
port (addr : in tad Iogie_vectoc((nb_bits - I) downto 0)); component layer2
out lut : out sld_logic_vector((2*nb bits -I) downto 0) ; genetic (nb_ncuron : integer ; nl : integer) ;
read en : in std_logic) ; port(X l,X2,X3,X4,X5:in s~d_togic_vector (at downto 0);
end component ; adl :in unsigned(nl dowmo 0);
begin s2:in sld_logic_vector(nO downto 0).
rein_wight : ROM generic inap (). port nrmp (read en, add, w) ; clk,rsl,ready,.rend_en : in std logic) ;
molt ace : MAC genetic map(), port map (x.w,clk.rea,q) ; n 13.n23,n33:out std Iogic_veCter(((2 *n I)+ 1) dowmo
result : LUT generic map O . port map (rend_en. q. out_lut ); 0));
end neuron_de~ription ; end component ;
(a)
component layer3
entity Inyer_n is genetic (nb_neuron : integer ; n I : integer;) ;
generic(nb neuron :integer ; rib_bits :imeger) ; port(X ~,X2,X3:in std_loglc_vector {n ~ downlo 0);
p~t(inpot_layer I : unsigned ((nb_bits -I) downto 0); nd2;in unsigned(nl dowmo 0);
inl~tt Inyer2 : in ~td Iogic_vector((nh hits) downlo 0); s3:in md logic vector(no (k)wnlo 0);
elk, rst, ready.rcad_enl : :in ~ d [t~ic ; clk,r~,ready,rend cn : in ~d logic) :
output_layer I ..... output layer n : out n 132,n232 :out sl d_logic_vector(((2 *n l )+ I ) downl o 0));
~d..iogic((2*(nb_bits)+ I) downto 0)) ; end component ;
end layer_n ;
architecture layer_description of layer n is ~r
component neuron layer 5 : laycri genetic n'mpO, port mal~sl, X[,X2,X3,X4,X5,
port (in_neur :std_iogic_vector(nb bits- I downto 0) ; rs~,clk, ready.read_cn,ad,n 13.n23,n33,n43,n53) ;
out_neur : out sld_loglc_vcctor(nb_bits - I dowmo O) ;
read_en.rst.clk.ready : in std_logic) ; layer 3:layer2 generic innpO, Portrnap(s2.X I,X2,X3,X4,X4,
end eounponent ; elk jsl,ready,read en,ad I .n ! 3,n23,n33);
begin layer_2:layer3 genetic map(), port nmp(s3,X l,X2,X3,clk,r~,
neuron_n : neuron generic map(), rendy.read_en,nd2,n 132.n232);
port map (input_laycxi ,input_layer2. elk, rst. ready, :nd ;
read_enl .output_layerl....,outpuLlayer n) ;
end iayer_descriiXion;
(b) (c)

Fig. 5. Parametric VHDL description. (a): Neuron description. (b): Layer description. (c): Network
description.
146

V. Case Study: ANN A r r h y t h m i a ' s classifier synthesis

V.1 Classifier description


To validate our approach, our first application is an electrocardiogram (ECG) neural
network classifier used to distinguish between normal sinus rhythm (NS) and the
supraventricular tachycardia (SVT).
The system is composed of two cascaded stages: a morphological classifier and a
temporal classifier (Fig. 6.). The morphological classifier is designed to distinguish
between normal (N) and abnormal (A) P and QRS waves patterns of the ECG signal,
respectively. The temporal classifier takes the first stage outputs and the PP, PR, RR
interval duration of the ECG's signal rhythm and outputs a classification in two
categories: NS or SVT [12]. First a program was written in C code to train the two
networks using the back propagation learning algorithm, and where the final weight are
obtained. After training only the timing classifier, which is composed of an input layer of
5 neurons, a hidden layer of 3 neurons and an output layer of 2 neurons i.e. (5-3-2)
classifier, was synthesized for FPGA hardware implementation.

NS

~swr

PR
PP
ECG ~iB.nl

Fig. 6. Neural network arrhythmia's classifier architecture

V.2 Synthesis and simulation results


For IC design, the architecture was synthesized using the synthesis tool GALILEO [13].
Before synthesis simulation is required until the ANN meets the functional specifications.
Fig.7.a input-output pins of the (5-3-2) classifier. For this application, the data word size
is fixed to 8bits. As precision is not requested, a scaling was done in order to reduce the
sizes of the look-up tables in each layer. Thus the network outputs are 8 bits size.
Fig.7.b represents functional simulation results of the ANN. Results show that the
required functionality is well achieved.
Once the functionality is verified, the VHDL - RTL code is used for synthesis. At this
level, and depending on the target technology, which is in our case, the FPGA Xilinx,
GALILEO transforms the RTL description to a netlist in term of configurable logic
blocks (CLB). Tile synthesis tool proceeds to estimate area in terms of CLBs. The output
of GALILEO is a table summarizing synthesis results from individual passes as well as
the best result of the (5-3-2) network based on the desired performance optimization
(speed/area). In this application, we selected area for optimization because the ECG
signal is slow (0.8 sec per cycle).
Fig.8. Shows synthesis results of the (5-3-2) network with the XC4000E as target
technology. In addition, Galileo outputs a netlist file (xnf file format) which will be used
in the next phase for placement and routing.
147

Fig. 7. (a): ANN input- output connections. (b): Functional simulation results of the (5-3-2) ANN.

V.3 FPGA implementation


This phase deals with the automatic placement and rooting using the XACT tool.
At this level, XACT takes as input the netlist file (xnf format) generated by GALILEO.
The resulting structure of the ANN is shown in Fig. 9. The ANN is mapped into the
4020EPG223 package. For clarity, only the placed CLBs are shown in Fig.9. As we can
see the (5-3-2) network is mapped into only one FPGA.

Fig. 8. Galileo Synthesisresults. Fig.9. Top view of the ANN FPGA structure.
148

V.i Discussion and Conclusion


Through this paper we have presented a synthesis methodology for FPGA implementation
of a digital ANN classifier.
The proposed VHDL description is based on a simple, regular and parallel architecture.
The use of the parametric VHDL description offers a high flexibility to the designer
because the same code can be reused to cover a wide range of applications and
performances depending on the pre-designed ANN library.
In addition, the advantage of using synthesis is that the designer can target the circuit for
different libraries (XC3000, XC4000, XC700, Actel ACT2, MAX5000, and ASICs) from
different vendors (Xilinx, Actel, Altera, etc.). After comparing, the designer can choose
the best technology that meets the requested input specifications.
The primary results are very successful since the whole network can be mapped into only
one FPGA.
Our next objective is to test the FPGA circuit in the whole ECG system. In the future, our
objective is to include the training phase in the proposed architecture (on chip training), to
explore tile proposed ANN description to other applications domains (image processing,
character recognition, etc.) and to extend the approach to other ANN algorithms
(Hopfield, Kohenen...etc.).

References
[1] M. I. Elmasry, <<VLSI Artificial Neural Networks Engineering >~,Kluwer Academic
Publshers.
[2] Richard P. Lippmann, ~An Introduction to computing with Neural Nets ~, IEEE
ASSP Magazine, pp. 4 -22. April 1987.
[3] Philip Trealeven, Macro Pacheco and Marley Vellasco, ~ VLSI Architectures for
Neural Networks ~, IEEE MICRO, pp. 8-27, December 1989.
[4] Y. Arima, K. Mashiko, K. Okada, ~A Self- Learning Neural Network Chip with 125
Neurons and 10K Self-Ornization Synapses ~, Symposium on VLSI Circuits, pp. 63-
64, 1990 IEEE.
[5] H. Ossoing, ~Design and FPGA- Implementation of Neural Networked, ICSPAT'96,
Pp. 939-943.
[6] Charles E. Cox and W. Ekkehard Blanz, ~ GANGLION- A Fast Field Programmable
Gate Array Implementation of a Connectionist Classifier ~, IEEE JSSC, Vol. 27, No.
3, pp. 288- 299, March 1992.
[7] R. Airiau, J. M. Bcrge, V. Olive, J. Rouillard " VHDL du language a la
modelisation",
Presses Polytechniques et Universitaires Romandes et CNEST- ENST.
[8] R. Airiau, J. M. Berge, V. Olive, "Circuit Synthesis with VHDL", Kluwer Academic
Publishers.
[9] Daniel Gajski, Nikil Dutt, Allen Wu, Steve Lin, "High level Synthesis- Introduction
to
Chip and System Design", Kluwer Academic Publishers.
[10] XACT user manual.
[11] M. S. Ben Romdhane, V. K. Madissetti and J. W. Hines, " Quick-Turnaround ASIC
Design in VHDL Core- Based Behavioral Synthesis", Kluwer Academic Publishers.
[12] N. Izcboudjen and A. Farah, " A New Neural Network System for arrhythmia's
Classification," NC'98, International ICSC/IFAC Symposium on neural
Computation. Vienna, September 23-25, pp. 208-212.
[13] GALILEO HDL Synthesis Manual.
H a r d w a r e I m p l e m e n t a t i o n U s i n g D S P ' s of t h e
N e u r o c o n t r o l of a W h e e l c h a i r
P. Martin, M. Mazo, L. Boquete, F.J. Rodriguez, I. Fernhndez, R. Barea, J.L. l_Azaro,

Electronics Department, University of Alcalh. Spain boquete@depeca.alcala.es

Abstract. This paper describes the implementation of a neural network control system for
guiding a wheelchair, using an architecture based on a digital signal processor (DSP). The
control algorithm is based on a radial base network model, the main advantage of which is
learning speed. The hub of the algorithm architecture is a DSP of the company Texas
Instruments (TMS320C31). The board has complete autonomy of action and is specially
designed for executing control algorithms in real time. The wheelchair prototype forms part of
the SIAMO project, currently being developed by the Electronics Department of the University
of Alcaht. The stability conditions are obtained for the correct functioning of the system and
various simulations are conducted to deduce the correct functioning of the system when
governing the output of the wheelchair.

I. Introduction

In the field of wheelchairs to aid the mobility of handicapped persons and/or the elderly there
is a need for systems affording users smooth, safe driving conditions with a quick response
(above all in especially dangerous situations for the user). An important aspect in said aim is the
development of a control system able to respond to the user's commands in the shortest possible
time and with the greatest accuracy. Two requisites to this end are a high-performance
hardware and reliable control algorithms governed by a real time operational system.

A recent computer search revealed that in the period between 1990 and 1995, 9,955 papers
were published in which the words "neural network" appear [Narendra, 1996]. 8,000 of them
deal with the approximation of 'functions and reeogni.tiotr of patterns (static systenis).
Approximately 1,960 papers discuss control problems, only 353 of which deal with their
possible applications, and within this group, 45% are theoretical. Of the rest, 28% are
computer-based simulations and only 14 papers deal with real applications (it may safely be said
that many industrial applications are not published for reasons relating to patent fights). This
paper presents a practical case in which a wheelchair is guided by using an inverse control
system, where the neurocontroUer must generate the control signals which allow the output of
the controlled plant to be the appropriate one. One of the main advantages of using a neural
network as a controller is that neural networks are universal function approximators which learn
on the basis of examples and may be immediately applied in an adaptive control system, due to
their capacity to adapt in real time.

When using an inverse control system, in which the controller is a neural network, the problem
is how to propagate the control error to the adjustable coefficients of the neurocontroller in
such a way that the latter vary in the fight direction, so that the error is reduced. In short, the
problem is how to obtain the sensitivity of each plant output with respect to each input. This
problem has been solved in different ways: thus, [Hunt & Sbarbaro, 1991], [Ku & Lee 1995],
[Noriega & Wang 1998] and [Boquete et al, 1998], use a neuroidentifier in parallel with the
150

Safely a.nd.e.nv..ironmtnt d e t e c t i o n

LonWorks B u s (1)
I i-
NaviIalioa and sensor
....... .~t~a__o~ ..........
i i

Figure 1. Wheelchair Architecture.

physical system to be controlled which serves as a path for the prolongation of the error. This
neuroidentifier may be a recurrent neural network or a "feed-forward" network with inputs of
different moments of time. Another possibility is that used by [Maeda et al., 1997], who obtain
said sensitivity by increasing each one of the neurocontroller's adjustable coefficients and
making the corresponding observation of the variations in each one of the outputs of the plant
to be controlled, thereby estimating the Jacobian of the plant.

There are also techniques or adjustment algorithms in which this problem does not arise, since
a stochastic algorithm is used, e.g. Alopex [Venugopal, 1993], where in order to adjust the
controller's coefficients, the correlation between the error function to be minimized and the
variations produced in each coefficient is used.

The use of DSPs is justified because their combination with neural networks results a powerful
sytem used in control task, [Bona et al, 1997] by their computing power in multiplication and
addition operations and also because their peripherals allow additional hardware to be
controlled. The DSP receives the commands, implements the network and sends the commands
on to the motors. To do so it uses an FPGA-implemented odometric systemr a double-port
storage to receive the commands (coming from a joystick, voice command, eye movements,
breath expulsion, etc) and a LonWorks bus (based on a Neuron-Chip) for setting up the
communication protocol with the motors [Garcia et al. 1997].

The architecture of the prototype wheelchair in the tests carried out is shown in figure 1. As
may be seen, it comprises: a) an environment-detection and safety system (including infrared
and ultrasound sensors), b) wheelchair-user interface including such output devices as a display
and voice synthesiser and input devices that allow the user to guide the wheelchair using
different strategies, including joystick, voice commands, breath expulsion and eye movements,
c) low level control mainly in charge of regulating the .motor speed and the speed (angular and
linear) of the wheelchair, d) navigation and sensory integrati.bn, whose main functions are dead
reckoning (working from information provided by the encoders in both drive wheels), path
151

generation (using spylines) and avoidance of obstacles and integration of the information from
the various sensors [Mazo et al, 1998].

Conmmnication between the various modules is established through two LonWorks buses and
a parallel bus, which guarantee the response speed and communication reliability required in
these types of applications.

This paper has been divided into the following sections: firstly the control scheme is described.
Section 3 shows the neural model (neurocontroller), the formulas for the adjustment of the
neurocontroller and the system's stability conditions. Section 4 focuses on the hadware
implementation of the sytem and lastly section 5 shows the results of the different tests and the
main conclusio~as of this work.

2. Control scheme

The control scheme used is shown in figure 2. The shaded blocks are implemented in the DSP
board. The input commands [V f~]T can either be sent by a manual system implemented in the
joystick or automatically by voice, breath expulsion or eye movements.

The neural control is implemented using a neural network model based on radial base functions
(RBF). The controller outputs are the speeds of the right-hand and left-hand wheels [~o,~]r.
These outputs do not act directly on the wheels but are sent to an electronic system that directly
actuates the drivers of the wheelchair's motors, with a classic PID control loop. The latter
ensures that the speed of each wheel corresponds as closely as possible to that sent by the neural
controller (eo'~o, ~ and t o ' l = ~ ) . The control scheme is thus divided into two levels: the
PID that implements the low level control and the second control level ("high level")
implemented by the neural controller, which makes sure the wheelchair's linear and angular
speeds IV~, kqRE]Tcorrespond to the command speeds.

The feedback loop of the high level control is set up via a reading of the wheel speeds [r
~]T, using an odometric system implemented in a field-programmable gate array (FPGA).
Once the equations have been worked out for modelling the wheelchair a calculation is made
of the linear and angular speed IV~, f~]T. The difference between the real and desired speeds
(error function e is used to adjust the coefficients by the gradient descent technique).

i .....................................................

IV, flit

[vu . O W

Figure 2. Control scheme


152

3. Neurocontroller m o d e l

Figure 3 shows the neurocontroller model. The neural network used is a model with
architecture based on radial base functions (RBF). The gradient descent technique is used to
adjust the inter-neuron synaptic connections and the network outputs. The model equations are:

M
yNp(k) = ~w,p.g,(k); p = k2; O)

g:k) = ~ ., (2)

~,~k)=V~k) ~ g,~)-~-~3-~, w, t

.... Wilt

Figure 3. Neural net used in tile controller

In the scheme shown in figure 2, the neurocontroller coefficients have to be adjusted on line,
the error (equation 3), being reduced to the minimum by the error backpropagation technique:

1
e<k) = ~(v,. - i,32 + ~cta,.
l -ta? (3)

The problem posed is the propagation of the error through the wheelchair dynamic. In this case
a behaviour study of the wheelchair is made, thereby obtaining its Jacobean to adjust the
neurocontroller. With this alternative the following is obtained:

-- ~-[,.,, +
I4)
R

where R is the radius of the drive wheels and D the inter-wheel distance. (Rffil6 cm, D = 54
cm..)
153

From these equations we obtain:

(5)

Thus, the neurocontroller adjustment equations are:

AW~! = - a ( V 1~ - VaE).]ll.g, - a ( f ~ - f2~)J21.g,


(6)
Aw~2 = - a ( V m~ - VaE).dl2.g , - a(f~ - ~1~122.g ,

Analysis of stability

In this section we find a maximum in the value of the learning factor (a) in such a way that it
ensures that the training error decreases or at least does not increase at all times. For this, a
vector W containing all the adjustable coefficients of the network is considered. The variation
in vector W is:

Am(k) = -a.(V~(k)-V(k)).Ov~,(k)_ a.(n,~(k)_n(k)).


owCk) OW(k) (7)
_a.e#). ~ ) _ aye(k)
= "'~#1" T~-~"

The increase in the function E(k) is:

AE(k) = E(k+l) - E(k) = 1.[[el(k+t)12 + [e:(k+l)l2 - [e#)l: -


(B)
= Aej(k).[el(k)+2.Aej(k)] + Ae2(k).[e2(k)+l.Ae2(k)l

The above expression can be made negative in the following way:

1 1
O < a < ~ O < a < ~
(9)
a% u2
,-T lieU2

where the vector W is made up o f all the coefficients adjusted in each sampling cycle.
154

Applying the results indicated by equations (9) and considering that the neurocontroller+chair
unit is a single neural network, the following conditions must be fulfilled in the control system
of Figure 2:

1 1
O<a< O<a<
ave(k) 2 . af~'(/O ,2 (io)
u---Ew u
with:

w - - ~ w wl2,.., wu~, w~]r (11)

it results:

nonv~(k)n R
= J . . g # ) ~ J . = -~ (z2)

HaVe(k) n = --

R
n _a~(k)Bm = J~,.g#) ~ J,, -- (14)
~11r

R
(is)
awa

Wig the physical values of the chair, the most unfavourable case is that indicated by equations
14 or 15. In short, the maximum value of the learning factor which may be used in the control
system of Figure 3 is:
0<a< 1
2M.(R.)2D ( 16 )

4. D e s c r i p t i o n o f t h e b o a r d

Figure 4 shows the board's block diagram. A description is given below of the components
therein. The hub ofthe board is the DSP of the firm Texas Instruments C1"MS320C31). The use
of a digital signal processor rather than a general purpose one is justified by the fact that its
architecture is specially designed to tackle the computational type and load needed for the
155

Figure 4. Block Diagram of the Board

execution of the proposed algorithms. The most important of the devices complementing the
DSP are the RAM of256K x 32, and the software loading storage, which allows the board to
work with complete autonomy. A high-performance, user-programmable device is used for
implementing the other functions, the encoder reader and calculation of the position and speed
of the drive wheels.

Communication with the exterior can be effected using two different communication protocols.
The control stage, for example, sends the orders to the motor actuators through the LonWork
bus. The central unit and the path generator make use of a second communication protocol
(parallel bus in figure 1) by means of mailboxes in a double-port RAM through which
commands are sent (angular and linear wheelchair speeds) to the control stage.

5. Results and conclusions

A debugging tool of the DPS family TMS320C3X was u s ~ for measuring the execution times
of the neural control algorithms. The algorithm execution period was established at I00 ms,
conditioned by the response of the wheelchair motors. The total execution time of the control
algorithms for a [6 neuron network was 0.937mseg. In figure 5 an example of adaptive control
is shown, in which at certain moments of time a person sits in the chair (t=50 s.), stands up
(r=90 sg.) and sits down again (t~140 s.). The wheelchair is making a circumference with a
radious equal to 1 m.. The parameters used in the ncurocontroller are: M=16 neurons, tr~l.8
and a = 0,1.

0 60 100 160 200


sa.

f
-- i , ,
/
|

O 50 t00 150 200


U.
Figure5.Exampleofadaptivecentre!
156

6. Acknowledgements

This work has been carried out thanks to the grants received from CICYT (Interministerial
Science and Technology Committe-Spain) project TER96-1957-C03-01.

References

Bona, B., Carabelli, S., Chiaberge, M., Miranda, E. and Reyneri, L. M. "Neuro-Fuzzy harware
and DSPs: a promising marriage for control of complex systems". MICRONEURO'97.
Dresden, Septem. 1997.

Boquete, L., et al., "Control with Reference Model Using Recurrent Neural Networks". En:
International ICSC/IFAC Symposium on Neural Computation. Septiembre 1998, pp. 506-511.

Garcia J.C. et al. "An Autonomous Wheelchair with a LonWorks Network based Distributed
Control System ". The Spring 97 LonUsers International Conference and Exhibitions. May
1997. Santa Clara.

Hunt, K. J. And Sbarbaro, D." "Neural networks for nonlinear internal model control"9 lEE
Proceedings-d, Vol. 138, (1991) n* 5.

Ku, C. C. & Lee, K. Y. "Diagonal Recurrent Neural Networks for Dynamic Systems
Control". IEEE Transactions on Neural Networks. Vol. 6. N* 1, January 1995.

Maeda, Y. and Figueiredo, R. L P?'Learning Rules for Neuro-Controller via Simultaneous


Perturbation". IEEE Transactions on Neural Networks, Vol. 8, n~ 5, September 1997.

Mazo M. et al."Integral System for Assisted Mobility" 2nd International Workshop on


Intelligent Control (IC'98). JCIS'98 Proceedings. pp 361-364.(1998).

Narendra, K. "Neural Networks for Control: Theory and Practice". Proceedings of the IEEE,
Vol. 84, n* 10 Oct. 1996.

Noriega, J. R. and Wang, H. "A Direct Adaptive Neural-Network Control for Unknown
Nonlinear Systems and its Application". IEEE Transactions on Neural Networks. Vol. 9. N ~ 1,
January 1998.

TMS320C3X. Technical Handbook 1997. Texas Instruments.

Venugopal, K. " Learning in Connectionist Networks Using the Alopex Algorithm". Phd.
Thesis. Florida Atlantic University. Boca Raton, Florida, April 1993.
Forward-Backward Parallelism in On-Line
Backpropagation
Rafael Gadea Giron6s Antonio Mocholl Salcedo
Dpto. lng. Electr6nica U.P. Y Dpto. Ing. Electr6nica U.P. V

Abstract
The paper describes the implementation of a systolic array for a multilayer perceptron on ALTERA FLEXIOKE
FPGAs with a hardware-friendly learning algorithm. A pipelined adaptation of the on-line backpropagation
algorithm is shown. It better exploits the parallelism because both the forward and backward phases can be
performed simultaneously. As a result, a combined systolic array structure is proposed for both phases. Analytic
expressions show that the pipelined version is more efficient than the non-pipelined version. The design is
implemented and simulated using VIfDL at different levels of abstraction andfinally mapped on FPGAs.

1. Introduction
In recent years it has been shown that neural networks are capable of providing Solutions to
many problems in the areas of pattern recognition, signal processing, time series analysis, etc.
While software simulations are very useful for investigating the capabilities of neural network
models and creating new algorithms, hardware implementations are essential to take full
advantage of the inherent parallelism of neural networks.
To organize the ideas described below a careful examination of the parallelism inherent in
artificial neural networks (ANN) is useful. A casual inspection of the standard equations used to
describe backpropagation reveals two obvious degrees of parallelism in an ANN. Firstly, there is
parallel processing by many nodes in each layer. Secondly, there is parallel processing in the
many training examples.
The former comes to mind most easily when parallel aspects of ANNs are being considered.
The network may be partitioned by distributing the synaptic coefficients and neurons throughout
a processor network. Again, there are two variations of this teclmique: "neuron-oriented
parallelism" - and "synapse-oriented parallelism".
In the first variation, the neurons are distributed among available processors. However, it is
difficult to place the neurons in such a way as to produce efficient implementations, which
require both an evenly distributed computational load (easy) and reduced data communications
(difficult) [ 1].
The second variation is based on the fact that the computations in a neural network are
basically matrix products [2],[3], [4] and [5]. The advantage of this approach is the amount of
data communicated between processors is moderate and evenly distributed, although in a
multilayer perceptron the synaptic matrix is lower triangular. It can be more interesting to
perform the matrix in an implementation distributed by layers (matrix-vector representation) [6].
The latter, which we will refer to as "training set parallelism", is perhaps the most useful. In
backpropagation, this latter parallel aspect is the result of the linear combination of the individual
contributions made by each training pattern to the adjustment of the network weights. The
linearity implies that the patterns can be processed independently and hence, simultaneously.
However, this implementation requires that the weights be updated after all the parallel processed
training patterns have been seen:- so-called "batch" updating [7].
A third, but less obvious, aspect of backpropagation stems from the. fact that forward and
backward passes of different training pattems can be processed in parallel. In a work by
Rosemberg,and Belloch [8] with Connection Machine, the authors noted this possibility in their
implementation, though it remained unimplemented. Later, A. Petrowski et.[9], describe a
theoretical analysis and experimental results with transputers. However, only the batch-line
version of backpropagation algorithm was shown. The possibility of an on-line version was noted
by the authors in general terms, but it was not implemented with systematic experiments and
theoretical investigations. In [10] we show that this parallelism, we will refer to it as "forward-
backward parallelism, has a good performance in convergence time and generalization rate and
158

we begin to show the better hardware performance of the pipelined on-line backpropagation in
terms of speed of learning. Now in this paper our main purpose will be to concrete this
improvement of speed in a hardware implementation on Altera FLEX10K50 and to show the
hardware costs of this pipelined on-line backpropagation always compared to standard
backpropagation.
In section 2 pipelined on-line backpropagation is presented and proposed. Section 3 studies the
latency, throughput, and efficiency, of this algorithm compared to a non-pipelined algorithm. An
alternating orthogonal systolic array is used for these measurements of hardware performance.
The methodology of design using VHDL is described in Section 4. Also in this section the
implementation properties when we compile on FLEX10K FPGAs from Altera will be evaluated.

2. Pipeline and Backpropagation Algorithm

2.1 Initial point

The starting point of this study is the backpropagation algorithm in its on-line version. We
assume we have a multilayer perceptron with three layers: two hidden layers and the output layer

The phases involved in backpropagation taking one pattem m at a time and updating the
weights after each pattern (on-line version) are as follows:

a) Forward phase. Apply the pattem a~r to the input layer and propagate the signal forwards
through the network until the final outputs a~L have been calculated for each i and l
a t, = f(u/)
N~_
y/ =u/ ='~-~wt#d-l/ (1)
I-0
l<i~;N{ , I < I < L
b) Error calculation step. Computer the 8's for the output layer L and compute the 6's for the
preceding layers by propagating the errors backwards using

8~ = I,(,L )(,,_y,)

a/-t = f'(,/-),)~wua'j (2)


J-I
l<i~N t ,lSl<L
c) Weight update step. Update theweights using
m
w~I = m - I w#I + m t~wij
-- I

mAWI
=17m~l
otyjt-I (3)
l<i~N t ,l~l<L

All the elements in (3) are given at the same time as the necessary elements for the error
calculation step; therefore it is possible to perform these two last steps simultaneously (during tile
same clock cycle) in this on-line version and to reduce the number of steps to two: forward step
(1) and backward step (2) and (3). tIowever in the batch-line version, file weight update is
performed at the end of an epoch (set of training patterns) and this approximation would be
impossible.
159

2.2 Pipeline versus non-pipeline

Non- pipeline: Non-pipeline: The algorithm takes one training pattern m . Only when the forward
step is finished in the output layer can the backward step for this pattern occur. When this step
reaches the input layer, the forward step for the following training pattern can start (Figure 1).

F i b r e 2. Pipeline
Figure I. Non pipeline

In each step s only the neurons of each layer can perfornl simultaneously, and so this is the
only degree of parallelism for one pattern. However, this disadvantage means we can share the
hardware resources for both phases because these resources are practically the same (malrix-
vector multiplicntion).

Pipeline: The algorithm takes one training pattem m and starts the forward phase in layer i. The
following figure shows what happens at this moment (in this step) in all the layers of the
multilayer perceptron.
Figure 2 shows that in each step, every neuron in each layer is busy working simultaneously,
using two degrees of parallelism: synapse-oriented parallelism and forward-backward
parallelism. Of course, in this type of implementation, the hardware resources of the forward and
backward phases cannot be shared. In the following section we will see how, in spite of this
problem, the pipeline version for the proposed systolic array is more efficient than the non-
pipeline version.
Evidently, the pipeline carries an important modification of the original backpropagation
algorithm [11 ],[12]. This is clear because the alteration of weights at a given step interferes with
computations of the states a~ and errors ~ for patterns taken at different steps in the network. For
example, we are going to observe what happens with a pattern m on its way to the network during
the forward phase (from input until output). In particular, we will take into account the last
pattern that has modified the weights of each layer. We can see:

1. For the layer t the last pattern to modify the weights of this layer is the pattern ra-5.
2. When our pattern m passes the layer d, the last pattern to modify the weights of this layer
will be the pattern m-3.
3. Finally, when the pattern reaches the layer L the last pattern to modify the weights of this
layer will be the pattern m-l.
160

Of course, the other patterns also contribute. The pattems which have modified the weights
before patterns m-s m-3 and m-I, are patterns m-6, m-4 and m-2 for the layers L J and L
respectively. In the pipeline version, the pattern m-I is always the last pattern to modify the
weights of the all layers. It is curious to note that when we use file momentum variation of the
backpropagation algorithm with the pipeline version, the last six patterns before the current
pattern contribute to the weight updates, while with the non-pipeline version, only the last two
patterns contribute before the current pattern.
It is important that the equations for the two phases perform in the same manner as the non-
pipeline version. For this, it will be necessary to store the values of sigmoids and their derivatives
(see the following section). Therefore, we have a variation of the original on-line
backpropagation algorithm that consists basically in a modification of the contribution of the
different patterns of a training set in the weight updates, and in the same line as the momentum
variation.
3. Hardware performance of pipeline systolic architecture
The aim of this section is to characterize the hardware performance of the pipeline on-line
backpropagation algorithm compared with the non-pipeline version. For this proposal we employ
the "alternating orthogonal systolic array"[6] and use the following metrics:

1) Throughput rate (measured as the number of clock cycles between two processed
patterns)
2) Latency (clock cycles required to process one pattern)
3) Array efficiency.
The array efficiency metric provides a measure of PE (processing elements) and pipeline
usage over the period required to process one pattem. We measure the efficiency of a parallel
algorithm by the common ratio S/P [13], where S is the speedup and P is the number of
processors in the network. The speedup is given by the ratio S=6e,/tpo,, where tse, and tm, are the
sequential and the parallel computation times, respectively.

3.1 Initial point.

We suppose that we have a MLP (multilayer perceptron) with three layers and with the
following characteristics:
NE =number of inputs.

Figure 3
161

N lO=number of neurons in the first hidden layer.


N20=number of neurons in the second hidden layer.
NS --number of outputs.
The dimension of hidden layers are considered more flexible than the input or output, which
are often restricted by the format of the data used in the neural application.
If we implement this MLP by means of "alternating orthogonal systolic array", we have a
structure as shown in Figure 3 for the concrete case NE= 2. NIO = 5. N20-- 2. NS = 2 plus bias
ncurons.
We call the white PEs "synapses units" and the black PEs "neuron units". The former units
perform the MAC "multiply-and-accumulate" operation as well as the weight update operation;
the latter units perfornl the sigmoid function and its derivative, the multiplication of the learning
rate rI by the error. They also calculate the error term at the output layer.

3.2 Analytical expressions

To quantify this, we are going to assume that the "synapses units" and the "neurons units"
perform theirs operations in one clock.

non-pipeline: For the two phases of backpropagation algorithm we have the following time costs:
tl Forward phase: (Ne + Nto+ N2o +5) Cycles
tl Backward phase: (Ns + N2o+ NIo+NE+ 5) Cycles

Therefore the latency is simply a sum of these costs

Latency= (2Ne+ 2N,o + 2N2o + Ns+ 10) Cycles (4)

The throughput rate in a non-pipeline version of BP algorithm is practically file sam0 as


latency, because n pattern needs the previous pattern to have fmished its two phases of training:

Throughput= (NE+2NIo+2N2o + Ns+ 9) Cycles (5)

To measure the efficiency, we are going to consider our object of analysis to be the MLP
performing one epoch training (being b the number of patterns). The duration t~r of the
presentation of an epoch for the non-pipelined BP algorithm is given by:

tpa~= Latency (one pattern) + Throughput (b-l) = = 2NE+2Njo + 2N2o+ Ns+ 10) Cycles (6)
+ (NE+ 2Nio + 2N2o + Ns+ 9) (b-l.)Cycles

If b is very high we can approximate:

tp~= ('Nlz+2N1o + 2N2o + Ns+ 9)b Cycles

Itcan easilybe shown that:

ts~--[2(NE+I)NIo+ 2(Nlo+l)N2o + 2(N2o+l)Ns +2Nlo+2N2o+Ns]b Cycles (7)

It is evident that our alternatingorthogonal systolicarray has the following processing units
(takingaway the biasPEs):

P= NE + 3N20 +4 (8)

Therefore the efficiency is given by:


162

E~cte.cy= ~er ! =

t P
par
= I('NE+I)NIO + OVIO+I)N20+(N20+I)NS+ N I O + N 2 0 4 1 1 2 N S (9)
[(/dE+ 2NIO + 2 N 2 0 + NS+ 9)0VE + 3N20 +4)]

pipeline: For the two phases of backpropagation algorithm we have these time costs:
0 Forward phase: (N~ + Nm+ N2o +5) Cycles
n Backward phase: (Ns + N2o+ Nm+NE+ 5) Cycles

Therefore the latency is simply a sum of these costs

Latency= (2NE+ 2Nio + 2N2o + Ns+ 10) Cycles (10)

This expression is exactly the same as the non-pipeline vez?sion (4) and this shows that the
latency does not improve with the pipeline.
The throughput rate is, however, very affected by the application of this variation. In the
particular case of an alternating orthogonai systolic array, with three layers distributed as: vertical
layer - horizontal layer - vertical layer, the throughput is given by:

Throughput = (Nm + 1) Cycles (ll)


The simulation results validate the equation (11) showing that throughput depends directly on
the number of neurons in the first hidden layer,
In the same way as with the non-pipeline version, we are going to give the expression of
efficiency for one epoch (being b the number of patterns). The duration t~r of the presentation of
an epoch for the pipelined BP algorithm is given by:

tpar= Latency (one pattern) + Throughput (b-1)= (2NE+ 2Nm + 2N2o + Ns+ 10) Cycles+ (12)
OVto + 1) Co-l)Cycles

It is evident that for our aitemating orthogonal systolic array in the pipeline version the number of
processing units is:

P= NE + 3N20 +4+(NE+2N2o+2) (13)

This equation shows an increase in the number of processing units because the MAC
operations of synapses units for forward and backward phases cannot work simultaneously, and
so we duplicate the quantity of synapse units.

Therefore file efficiency is given by:


t l
Efficiency = ser x - - =
t J' (15)
par
:l[2~z § + 2~lO+~)N20§ 2~0+ 2~20§215
( 2 N E + 2 N I O + 2 N 2 0 + N S + I0)+ (NIO + 1) (b-I) ....) NE+3N20+4+(Ns

If we compare the obtained expressions, we can make stand out the better efficiency and the
9clear improve of the throughput of pipeline version. We must remember that the number of
connections updated per second will be directly proportional to the frequency of our
implementation and inversely to this throughput.
163

4. Implementation and Verification

This section compares directly the implementation properties of pipelined on-line BP with the
standard BP algorithm when we use the same technology: ALTERA FLEXI 0KE FPGAs.
4.1 Design entry with VHDL
The design entry of the pipelined on-line BP, and classical on-line BP for performing
comparatives, is accomplished in VHDL. It is very important to make these system descriptions
independeuts of the physical hardware because our objetive in the future will be to test our
descriptions on others FPGAs and even on ASICs.
We have made eight VHDL testbenches to perform the simulations shown in Figure 5: four for
the pipeline version and four for tile non-pipeline version. The VHDL description of the
"alternating orthogonal systolic array" (always the unit under test) is totally configurable by
means of generics and generates statements whose values are obtained from three ASCII files:
0 Database file: nt, mber of inputs, number of outputs, the training and validation patterns
0 Learning file: number of neurons in the first hidden layer, number of neurons in the second
hidden layer, type of learning (on-line, batch-line or BLMS), value of learning rate q, value
of momemtum rate, type of sigmoid (binary or bipolar) etc.
0 Resolution file: resolution of weights (integer and decimal part), resolution of activations,
resolution of accumulator (integer and decimal part),etc.

4.2 Speed and resource usage

PIPELINE NON PIPELINE


PERIOD (ns) FREQUENCY PERIOD FREQUENCY
TH, TL, T (MHz) (MHz)

LAST VERTICAL NEURON 46,4 20,66 47.3 21,14


LAYER
SYNAPSE 53,2 21,1 74,:3" 13,45 56.9 17.57
HORIZONTAl. NEURON . . . . . 4B,4 20,66 52.8 18.93
LAYER SYNAPSE 54,3 20,7 75 13,33 55.3 18.08
VERTICAl. NEURON 49,5 20,2 48.2 20.74
LAYER SYNAPSE 59,2 20,5 79,7 12,54 55.4 18.05
TOTAL 198,9 MCPS 261,7 MCPS
PERFORMANCE 198,9 MCUPS 77,84 MCUPS
(12 MHz) (17 MHz)

Table 1. SPEED

Table I shows how the implementation of pipeline version affects to frequency of operation.
This effect is evident in the synapses because in the pipeline version is necessary to do 2 read
and 1 write operations of the embedded dual port RAM which stores the weights. Although the
FLEXIOKE permit simultaneous read and write operation, the cycle period must be shared for
this three operations. However this increased of period doesn't avoid that the speed performance
of pipeline version is much better than non pipeline version (standard version) in the number of
Connections Updated Per Second . These results of the last raw were obtained for a multilayer
perceptron as the Figure 4 but with the following parameters: 3 inputs, 4 outputs, 20 neurons in
the first hidden layer and 10 in the second hidden layer.
164

PIPELINE NON PIPELINE


Primitive count: Resource usage Primitive count: Resource usage
FLEX I 0KE EPF10K50EQC240-1 FLEX 10KE EPFI 0K50EQC240-1
FPGA Express MAXPLUS It FPGA Express MAXPLUS II
CARRY 60 Total I/O pins used: CARRY 58 Total I/O p i n s used:
Z DFFE 51 59/183 (32%) DFFE 49 561183 (30%)
LUT 152 Total logic cells used: LUT 130 Total logic cells used:
LUT CARRY 62 40112880 (13%) LUT CARRY 60 386/2880 (13%)
~ Ipm_mult_s 8_8 2 Embedded cells used: Ipm_mult s 8_8 2 Total embedded cells used:
r ~:=:~ F--4 Ipm_mult_s8_8 1 01460 (0%) Ipm_mult s 8_8 1 01160 (0%)
CARRY 123 Total I/O pins used: CARRY 88 Total IIO p i n s used:
LIJ DFFE 76 971183 (53%) DFFE 49 861183 (46%)
Total logic cells used: LUT 197 Total logic cells used:
WT ARRY 123 82712880 (32%) LUT CARRY 88 83112880 (21%)
Total embedded ceils eltdl~'a,m_32x16 1 Total embedded cells used:
Ipm_mult_s 8_16 2 used: Ipm_mult_s_8 16 1 16/160 (10%)
~3 Ipm_mult_s 8_8 1 16/160 (10"/,,) Ipm_mult_s_8_8 1

A 21MUX 14 Total i/O i~ns used: Total I/O p i n s used:


63/183 (34%) CARRY 43 60/183 (32%)
DFFE 58 Total logic cells used: DFFE 42 Total logic cells used:
Ipm ram dq_128x8 1 488/2880 (16%) 38712880 (t3%)
<C LOT" - 174 Total embedded cells LUT 111 Total embedded cells used:
LUT CARRY 59 used: 81160
LUT CARRY 45 0/160 (0%)
.<~ Ipm_mult s 8_8 2 (5%) Ipm~mulLs 8 8 2
CARRY 115 Total I/O pins used: CARRY 64 Total IIO p i n s used:
1071183 (58%) DFFE 41 1041183 (56%)
LUT 224 LUT
Total logic Cells used: 204 Total logic cells used:
! LUT CARRY 115 928/2880 (32%) LUTCARRY 64 62412880 (21%)
altdp~am_32x16 1 Total embedded ceils altdwam_3~16 1 Total embedded Cells used:
Ipm_mult_s 8_16 2 used: Ipm mull_s_8 16 1 16/160 (I0%)
Ipm_muR_s 8_6 1 16/160 (10%) Ipm m~_s_8...8 1

A_21MUX 14 Total I/O pros used: Total I10 p i n s used:


~Z~ CARRY 65 641183 (34%) ;CARRY 50 611183 (33%)
OFFE 58 Total logic Ceffs used: IDFFE 77 Total logic cells used:
Ipm_ram_dq 128x8 1 47112880 (t8%) 426/2800 (14%)
LUT 190 Total embedded Cells LUT 130 Total embedded Ceresused:
LUT CARRY 66 used: 8/160 LUTCARRY 52 0/160 (0%)
Ipm_mult_s 8_8 2 (5%) Ipm~muCt_s_8_8 2
CARRY 123 Total I/O pins used: CARRY 71 Total I/O p4ns used:
I..Ij DFFE 76 971183 (53%) I DFFE 49 861183 (46%)
LUT 231 Total logic cells used: I LUT 202 Total logic cells used:
LUT CARRY 123 193612880 (32%) LUT CARRY 71 622/2880 (21%)
altdpram 32x16 1 Total embedded ceils alldl~am_32x 16 1 Total embedded cells used:
Ipm mult_s 8_16 2 used: Ipm muir_s_8 16 1 16/160 (10%)
o'3 Ipm_mult s 8_8 1 16/160 (10%) Ipm muir s_8_8 1

Table 2. AREA

Tables 2 show the resource usage for the two versions supposing that the number of neurons of
first hidden layer is less than 32. We have used the FASTEST style for the implementation and
optimization and we have mapped all the memory elements (FIFO and RAM) on embedded array
blocks (EAB) of the FPGA by means of megafunctions of ALTERA. We can observe that the
hardware cost for pipelining the backpropagation algorithm is higher in the synapses than in the
neurons and is produced fundamentally because the pipeline version needs different multipliers
and accumulators for the forward and backward phases.

5. Conclusions
This paper evahtates the hardware performance of the pipelined on-line backpropagation
algorithm. This algorithm removes some of the drawbacks that traditional backpropagations
suffer when implemented on VLSI circuits. It may go on to offer considerable improvements,
especially with respect to hardware efficiency and speed of learning, although the circuitry is
more complex.
165

We believe this paper contributes new data for the classical contention between researchers
who update network weights continuously (on-line) and those updating weights only after some
subset, or often after the entire set, of training patterns has been presented to the network (batch-
line). Until now, batch updating after lhe entire training set has been processed (i.e. after each
epoch) was preferred in order to best exploit "training set parallelism" and "forward backward
p~lrallelism." Now, we can see that to exploit all the degrees of parallelism, we can use the on-line
version of backpropagation without degradation of its properties.

~] A. Singer, "Implementations of artificial neural networks on the connection machine",


arallel Computing, vol. 14, 1990, pp. 305-315.
2] S. Shams, and J.L. Gaudiot, "Implementing Regularly Structured Neural Networks on the
ream Machine IEEE transactions on neural networks, vol. 6, no.2, March 1995, pp. 408-421.
~1 W-M lin, V. K. Prasanna, and K. W. Przytula, "Algorithmic Mapping of Neural Network
odels onto Parallel SIMD Machines", IEEE Transactions on Computers, Vol. 40, no. 12,
December 1991, pp. 1390-1401.
4] S.R. Jones, K.M. Sammut, and J. Hunter, "Learning in Linear Systolic Neural Network
ngines: Analysis and Implementation", Transactions on Neural Netwoks, Vol. 5, no. 4, July
1994, pp. 584-593.
[5J D. Naylor, S. Jones , and D. Myers, " Backpropagation in Linear Arrays- A performance
Analysis and Optimization", IEEE Transactions on Neural Networks, Vol. 6, no. 3, May 1995,
pp. 583-595.
[6] P. Mu~agh, A.C. Tsoi, and N. Bergmann, " Bit-serial array implementation af a multilayer
perceptron ', IEEE Proceedings-E, Vol. 140, no. 5, September 1993, pp. 277-288.
7 X. Zhan_g., M. McKenna,. J.P. Mesirov,. and D .Waltz, "An efficient implementation of the
[~ackpropagatton algorithm on the conectlon Machine CM-2, Advances in Neural Information
Porcessing Systems 2, D.S. Touretzky,Ed. San Mateo, CA:Morgan Kaufmann, 1990, pp. 801-
809.
[8] C.R.,Rosemberg, and G. Belloch, "An implementation of network learning on the Connection
machine , Connectionist Models an their Implications, D. Waltz and J Feldman, eds., Ablex,
Norwood, NJ. 1988.
~] A. Petrowski, G. Dreyfus, a~,,d C. Girault, "Performance Analysis of a Pipelined
ackpropagation Parallel Algoriflma , IEEE Transaction on Neural Networks, Vol.4 , no. 6,
November 1993, pp. 970-981.
~ 0] R.Gadea, A. Mocholi, "Systolic Implementation of a Pipelined On-Line Backpropagation",
roc., April 1998.
~1I]D.E. Rttmelhart, G.E. Hinton, and R.J. Williams, "Learning internal representations by error
ackpropagation, Parallel Distributed processing, Vol. 1, MIT Press. Cambridge, MA, 1986, pp.
318-362.
l2] S.E. Falhman, "Faster learning variations on backpropagation: An empirical study", Proc.
f988 Connectionist Models Summer School, 1988, pp. 38-50.
13] M.R. Zargham, Computer Architecture single and parallel systems, Prentice Hall
fnternational Inc. 1996.
A VLSI Approach for Spike Timing Coding

E. Ros; F.J. Pelayo; 1. Rojas; F.J. Fernfindez; A. Prieto

I)epartamento de Arqnitectura y Tecnologia de Computadores


Universidad de Granada, 18071 Granada, Spain. E-maih eduardo@atc.ugr.es

Abstract

The paper describes a VLSI viable integrate-and-fire neuron model with an easily
conlrollable firing threshold that can be used to induce synchronization processes. The
circuits are intended to exploit both rate aml spike time coding schemes, taking
adwmlage of these synch,onization processes to accelerate processing tasks. In this way
Ihc temporal domain can be exploited in neural computation architectures. A simple
iicural structure is also discussed, providing simulation results to illustrate how these
lime coded signals can be combined to perform a simple processing task such as
coherent input detection.

I. Introduction

Thc way in which biology manages to perform complex processing tasks in a very short
lime is still t, nclear. An efficient exploitation of the temporal domain may be the key.
Most spiking neuron models use rate coding but several biological studies have revealed
the important role that single spike timing may play in biological processing schemes.
I,i fact, although some processing tasks in biological nervous systems might be carried
~)ul making use of a pure rate coding, this is not the case for other processing pathways
whcrc complex processing tasks are performed in times of a few milliseconds, through
scvcral neural layers 1"I"!-10961.This is hard to achieve compatible with the use of a rate
coding coml)uting scheme based on biological neurons, which exhibit interspike times
ill Ihc ntnge of milliseconds. This fact has motiwlted different research groups to study
altcnmtive coding schemes, such as rank order coding [THO98] and temporal coding
IHOP951. These models are compatible with the rapid processing observed in some
scnsory pathways of the biological reference.

Synchronization between groups of spiking neurons seems to be a good way of


exploiting the temporal domain in neural populations. Furthermore, stimulus dependent
synchronization processes have been found in biological systems [ECK88, GRA89,
ENG91 ] and they seem to play an important role for specific tasks like binding coherent
Icaturcs of sensory inputs and pattern segmentation lECK94, FRE87, GRA90, MAL86].
()ncc a neuron population is synchronized, the specific spike timing of individual
ileni'ons within this population may convey information about relative intensities of
input stimuli to which they exhibit a more or less selective response [HOP95, KON95].
Synchronization may be induced by several (non-exclusive) mechanisms: coupling
bclwcen groups of firing neurons I FRE87, GRO91, MIR90], feedback fi'om other htyers
IECK94, SIL94, TON92] or some intrinsic modulation of the neural activity [ALO89,
167

1101'951. Biological systems seem to make efficient use of both firing rates and the
relative timing of individual spikes to code neural inlormation [SEJ95].

This paper describes hasic circuits, and neural configurations based on them, intended
to exploit both rate and spike time coding schemes. Their functionality is illustrated by
SPICE simulations taking into account the parameters of the 1.2 IotmCMOS fabrication
process of AMS, in which some of the circuits have already been implemented and
tested [PEL97, ROS97a].

The computational primitives described here are still far from any solution that could be
directly applied to a real problem. Nevertheless, in order to reach this point, it is
wzccessary to develop the basic time coding circuits and also to study how the neural
information processing capabilities offered by these cells can be used collectively in
massively parallel architectures. Basic circuits like the one proposed here may motivate
the search for new neural configurations that take full advantage of time coded signals
Io i)cl'forln complex processing tasks efficiently.

Section 2 of the paper presents the integrate-and-fire neuron model; Sections 3 and 4
describe briefly the synaptic circuit and the time coding circuit, respectively. In Section
5 a simple neural structurc is discussed with simulation ,esults illustrating how coherent
input detection can be carried out with the proposed cells. In Section 6 some concluding
I C l u a r k s a r e made.

2. Neuron Model

The circuits proposed in this paper implement a spiking neuron model [GER98]. The
neuron state is represented by a variable (Vx) called the membrane potential. Each time
Ihat Vx reaches a certain threshold (Vth) the neuron fires an output spike. Two processes
affect the value of Vx according to expression (I). First, Vx falls to its minimum value
each time an output pulse is fired. Second, Vx integrates the contribution of all the
presynaptic neurons.

V,,(t)= ~,l,(t-t~/")+ ~z~ ~w~j~u(t-t(lf') (1)

In expression (I) w~i represents the weight of the synaptic connection and ~i.i is a
function that describes in time how the synaptic contributions of individual spikes are
i~tcgrated in the membrane potential (Vx). Using a similar nomenclature to that in
[GER98], for a particular neuron i, Fi denotes its receptive field, that is, all its
presymlptic connections, and ~i represents the set of its firing times as indicated in
exp,'ession (2).

*, : I _<,,}= (,) = v,,,} (2)

3. Synaptic Circuits

The synaptic circuits are described in detail in [ROS97a, PEL97]. For the sake of
clarity, in this paper we consider a particular configuration of the synaptic circuits that
168

induces a linear behaviour on the synapse model (see Figure I.a). In this way the
mcmbrane potential variation induced by a single spike does not depend on the actual
value of the membrane potential (V~). Therefore V, will rise (or fall) linearly with the
number of excitatory (or inhibitory) input pulses, respectively (see Figure I.b).

The weight of the excitatory synapse circuit (w0+) is controlled by a reference voltage
(V,r according to equation (3).

K+. C + (3)
W~ C.r

II'WT .................................................................... 1

v,,ir, ,,, r~, t,, k ~i


v. e~.
v*'"i-l~ I\ 1~ I", / ~,, H
l i ,, / ."xl \,Vo I \, / \ Ii
.~"-..i ~, '-J \ 1
o U ( f J 0 ) * 11(9)
\.J '\.L ~11
9 , ;~
2 lt~e r ----:-z ....... ............................................. Z.''?!
t .. /,--

,, i~.! f
" ~i v~ ,/~-----" v~ ::
v,
2.u~ ..... , ................ , ............... , ................ , ............. 1
2~s ~?6ks 2D0eS 3ml~s 311us
e(12)
1Lm,

(a) (b)

Figure I: (a) Schematic circuit of an excitatory synapse. Each time a spike reaches the
circuit a charge packet is injected into the membrane capacitance C~. (b) The membrane
polcntial (Vx i) rises in response to spikes received by all excitatory synapse. Each time
an inpt, t pulse reaches the synapse, a charge packet is injected into Cx for a time that
depends on Vrefij -

4. T i m e c o d i n g circuit

The p,'oposed time coding circuit can be seen basically as an integrate-and-fire neuron
with an external firing threshold that can be modulated for synchronization purposes
(see Fig. 2).
169

C~r

Figure 2: Schematic circuit of the time coding module.

Each time that the membrane potential (V,) reaches the firing threshold (Vth) Ihe
comparator circuit swilches, pl'oducillg a I'ast charge of the intermediate cal)acitance
(C~), and generates a spike through an output stage similar to the one proposed by
Carvcr Mead [MEA89]. While the pulse is being fired, the membrane capacitance Cx is
colnplctely depleted and the intermediate capacitance Ci is partially discharged to a
fixcd wdue below the transition threshold of the first inverter (I I). This is done by two
spccil'ic depletion transistors (mn,~:~, and mp~.~,). An additional current source may be
ilnplemcntcd to complete the depletion of Ci in order to avoid undesirable charges
caused by leakage currents at the output of the comparator circuit (see Figure 3).

Cd
____+I
9
-•-C•
- IN~' K'II IC/] __1_
Vi

ilL c,T
"=2_~
nnl~k, r

Figure 3: CMOS version of the time coding circuit.

This circuit behaves like an integrate-and-fire neuron. Vth is the firing threshold and iexc
rcpresents the global incoming charge received by the membrane capacitance Cx
thlough all the synapses. The time to the next spike is described by expression (4), and
depends on the global excitation received from the whole presynaptic tree and on the
time of the previously fired spike.

V,,, 9 C x (4)
t. = i...(t) + t._,
170

When a constant firing threshold (Vth) is used, tile output frequency of spikes (Fo,0 is
I)roporiional to the inconling excitation current (i<,xc). On Ihe ()tiler hand, a periodic
threshold signal (Vd,) can he used to induce synchronization processes [tlOP95,
ROS97bl. This periodic signal, applied as the conlmon reference voltage (V,~) lbr a
popuhilion of neurons, can be seen as 'an artificial way of inducing synchronization in a
set of Ilcurous receiving similar inputs. In biological systems the synchronous
oscillation of populations of neurons leads to subthreshold variations in the membrane
potentials, which may play a similar role to this common reference signal. Global
periodic oscillation signals have been observed in biological systems and may cause
illtrinsic modulation signals. Therefore a simple external threshold applied to the circuit
siilnulates its inherent synchronization properties, responding with similar spike timing
io silnilar inputs.

The power consumption of tile above circuits is described in detail in [ROS97a]. All
circuits have low power consumption with typical values in the order of 0.3 ItW for
each synapse and 0.5 I.tW for each time coding circuit.

5. A simple collecting neuron as coherence detector

A single collecting neuron with a converging synaptic tree works as a coherence


detector in specific receptive fields (ri), responding to synchronized neuron populations
in previous layers. "File receptive field senses an input pattern that is conveyed to an
iiq~ut layer (Ni), where this information is coded by means of spike timing. The input
layer uses a periodic signal as the firing threshold and therelore synchronizes the spikes
produced by neurons receiving similar excitation currents. The outputs of a set of
neurons of a specific receptive field are collected by a second layer unit (N~) that works
as a coherence detector (see Figure 4).

N, ~;;:,';'; :.;;5;;:;; . . . . . . . . . . . . . . . .
Frequencycoding

Figure 4: Neural configuration for coherence detection. An input layer encodes the input
siilnuli through individual spike timing. A collecting neuron (No) with a strong passive
decay tcrln fires spikes if it receives synchronized bursts of pulses.

This neuron (No) has a strong passive decay term and therelbre only strong excitation
phases (i.e. a high number of input spikes in a short time period) will be able to raise the
membrane potential over the threshold and fire output pulses. The passive decay term of
171

Nc is caused by an inhibitory synapse receiving a constant spike frequency (FR). The


output pulses of the collecting neuron are fired by a circuit that codes the membrane
potential as a spike frequency [PEL97]. The strong passive decay term of this unit limits
Ihe number of output pulses (the greater the synchronization in previous layers, the
more output pulses are fired) in response to a suprathreshold phase.

Simulation results of such a neural configuration, with ten neurons in the input layer,
are shown in Fig. 5. The input signals (Fig. 5.a) evolve through time and are quite
heterogeneous until t=60 ms when all inputs converge to a similar value. From this time
Ihe output spikes of the input layer gradually become synchronized. The collecting
neuron receives pulse streams: the more synchronized the firing by the input neurons,
the narrower are the pulse streams generated. Due to the strong passive decay term of
Ihis neuron (N~) only very abrupt excitation processes (concentrated pulse streams) will
be able to dominate it and raise the membrane potential over the firing threshold. In Fig.
5.b it can be seen that when the input spikes received are synchronized in a short
interval of time, the membrane potential rises over the threshold, producing output
spikes i,1 the collecting neuron (N~).

i,+~,~(hA)
1
v,.,,+

. u(11)
1H][l[tlIJ U ILL
- u(121
i

'. 11113) ~ U(tl~) ", 0('15) + U(16) ~ U(27) , 0(18)

r , . m l +. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
g(49) * 11120)
i

v ..... 9 " ~i;i:i~i,E,;.F,;~ ..............


~-q, ...................................................
il IlJli i
o 0 t (ms)

v ..........................................................................................
:............. P+ !

m+ 5ram+ tml~+ l ~;lo*m +oom 2511'.~ 3lOm++


o u(~*) , I
r:t~

(a) (b)

Figure 5: (a) Evolution of input stimuli. (b) Upper trace: spikes fired by the input layer,
beginning to be synchronized from t=60ms, when the input pattern becomes
homogeneous. Middle trace: spikes fired by the collecting neuron (No). Lower trace:
,icmbrane potential of the collecting neuron. Only dense pulse streams coming from Ni
lead to abrupt excitation phases able to dominate the passive decay term of the
collecting neuron.

The input layer produces spikes of a fixed frequency. The particular time of a spike
withi,1 the period of the threshold reference signal (Vth) depends on the excitation
current (iexc); weak excited neurons will fire delayed pulses with respect to more
strongly excited units.

Fig. 5.b (lower trace) shows that the collecting neuron Nc exhibits subthreshold
oscillations in response to non-coherent input stimuli sensed by (Ni). On the other hand,
172

when the input layer receives a homogeneous stimulus, the membrane potential of the
collecting neuron exhibits coherent oscillations over the threshold and generates output
pulse streams (see middle trace in Fig. 5.b).

If different groups of input neurons in the receptive field of a collecting neuron (No)
receive homogeneous stimuli, but of different values, then this collecting neuron will
respond with several excitation phases per reference cycle, each one corresponding to a
different region of homogeneous input stimuli.

Fig. 6 shows simulation results of the same configuration with twenty input neurons
where the input pattern converges at time t=60ms. From this time on, two populations
of neurons receive different values and therefore they synchronize their output spikes to
different firing times. This spatial pattern sensed in the receptive field generates an
output time signature consisting of two output pulse streams. If the populations of
synchronized input neurons are significant, the collecting neuron (N,,) will produce
output spikes for each abrupt excitation received. In this way such a neural
configuration could produce a particular output sequence in response to simple patterns
within the receptive field or even to textures (sensed as two input values distributed
through the receptive field).

6,~ .......................................................................

i~xci(nA)

06 :: ,~'") "
v:.............L.Htl
..........lid
.....
2.~T ......................................................................... !

02

o~/
,.=~/
.; /
v'iU(21 i
100
Ni 0 0
t (ms)

(a) (b)
Figure 6: (a) Input stimuli evolution. (b) Upper trace: spikes fired by the collecting
neuron (No). Lower trace: Nc membrane potential. Two high activity phases can be
observed in each V~h cycle (20 ms) once the input stimuli have converged to
homogeneous values.

If the receptive field of Nc has a specific shape and orientation within the input
processing layer (see Figure 7), different collecting neurons may respond with
oscillations to stimuli of a certain shape and orientation. The intensity level of the input
stimulus is coded as oscillations at a specific time within the time interval defined by
the reference oscillation signal. On the other hand the amplitude of the oscillations of
the membrane potential of collecting neurons gives a measure of the number of
synchronized neurons in the previous layer.

If the receptive field of neuron Nc is selective to a particular orientation, suboptimal


oriented patterns will excite only part of the input units, producing weaker oscillations
173

in lhe membrane potential of this collecting neuron. Furthermore, lateral inhibition


between disjoint receptive fields increases the selectivity properties to certain primitives
such as bars of a specific width and orientation.

/'"7

~ N ~ N~
(a) (b)
Figure 7: (a) An oriented bar stimulates only a few neurons (dark units) in a specific
receptive field. (b) In this case an oriented bar stimulates most of the neurons (dark
units) of a receptive field. The number of excited neurons represents the degree of
simila,ity between the input stimulus and the receptive field characteristics. A single
collecting neuron Nc can be used to detect this degree of matching and to code it in
spike streams.

6. Conclusion

Many aspects of biological processing systems have inspired new computational


concepts and principles. The contribution of this paper is focussed on the way that the
Iclnporal domain can be exploited by VLSI circuits. The proposed time coding circuit
uses a periodic reference signal to code the neural activity through specific spike firing
times. As these basic circuits have been developed, they are being used as a reference to
explore ways of taking full advantage of synchronization processes in new neural
architectures.

J. J. Hopfield made a claim [HOP95], illustrating how a simple periodic firing threshold
induces synchronization within integrate-and-fire neuron populations receiving similar
inputs. This can be used to exploit the temporal domain, coding the neural information
in the spike timing rather than in spike rates. The circuits described in this paper
represent a VLSI approach of this concept tlmt is a starting point for tile study of VLSI
neural structures able to take full advantage of this time coding scheme.

References

[ALO89] A. Alonso, R.R. Llinfis, "Subthreshold Na+-dependent theta-like rhythmicity


in stellate cells of entorhinal cortex layer II", Nature, vol. 342, pp. 175-177,
1989.
174

IECK881 R. Eckhorn, R. Bauer, W. Jordan, M. Brosch, W. Kruse, M. Munk, H.J.


Reitboeck, "Coherent oscillations: A mechanism of feature linking in the
visual cortex?", Biol. Cybern., Vol. 60, pp. 121-130, 1988.
I ECK94] R. Eckhorn, "Oscillatory and non-oscillatory synchronizations in the visual
cortex and their possible roles in associations of visual features", Progress in
Brain Research, vol. 102, pp. 405-426, 1994
IENG911 A.K. Engel, P. K(~nig, A.K. Krciter, W. Singer, "lnlerhemispheric
synchronization of oscillatory neural responses in cat visual cortex",
Science, Vol. 252, pp. 1177-1179, 1991.
I FRE87] W. Freeman, B. W. Dijk, "Spatial patterns of visual cortical fast EEG during
conditioned reflex in a rhesus monkey", Braiu Res., Vol. 422, pp. 267-276,
1987.
[GER98] W. Gerstner, "Spiking Neurons", Pulsed Neural Networks, W. Maas and
C.M. Bishop (Editors), MIT press, pp. 3-54, 1998.
IGRA891 C.M. Gray, P. K6nig, A.K. Engel, W. Singer, "Oscillatory responses ill cat
visual corlex exhibit intcrcolumnar synchronization which reflects global
slilnulus properties", Nature, Vol. 338, pp. 334-337, 1989.
IGRA90I C. M. Gray, A. K. Engel, P. Ki3nig and W. Singer, "Stimulus dependent
neuronal oscillations in cat visual cortex: receptive field properties and
feature dependence", Eur. J. Neurosci., vol. 2, pp. 607-619, 1990.
IGRO91 ] S. Grossberg, D. Somers, "Synchronized oscillations during cooperative
feature linking in a cortical model of visual perception", Neural Networks,
vol. 4, pp 453-466, 1991.
II IOP951 J.J. Hopfield, "Pattern recognition computation using action potential timing
for stimulus representation", Nature, vol. 376, pp. 33-36, 1995.
I KON951 P. KOnig, A. K. Engel, P. R. Roelfsema, W. Singer, "How precise is
neuronal synchronization?", Neural Computation, vol. 7, pp. 469-485, 1995.
I MAL86] C. yon der Malsburg, W. Scbeneider, "A neural cocktail party processor",
Biol. Cybern vol. 54, pp. 29-40, 1986.
I MEA89] C.A. Mead, "Analog VLSI and neural systems," Addison Wesley, Reading
MA, 1989.
IMIR90] R. E. Mirollo, S. H. Strogatz, "Synchronization of pulse-coupled biological
oscillators", SIAM J. Appl.. Math., vol. 50, no. 6, pp. 1645-1662, 1990.
IPEL97] FJ. Pelayo, E. Ros, X. Arreguit, A. Prieto, "VLSI implementation of a
neural model using spikes", Analog Integrated Circuits and Signal
Processing, Kluwer Academic Publishers, Vol. 13, No. I/2, pp. 111-121,
1997.
IROS97a] E. Ros, "h'nplementaci6n VLSI de Estructuras Neuronales Inspiradas en la
B iologia", PhD. Dissertation, University of Granada, 1997.
IROS97bl E. Ros, F.J. Pelayo, B. Pino, and A. Prieto, "Firing Rate and Phase Coding
Circuits for Neural Computation using Spikes", MicroNeuro'97,
Microclcctronics for Neural Networks, Evolutionary & Fuzzy Systems, pp.
305-311, September 1997.
ISEJ95] T. J. Sejnowski, "Time for a new neural code'?", Nature, vol. 376, pp. 21-22,
1995.
1SIL94] A. M. Sillito, H. E. Jones, G. L. Gersteln, D. C. West, "Feature-linked
synchronization of thalamic relay cell firing induced by feedback from the
visual cortex", Nature, vol. 369, pp. 479-482, 1994.
IT110961 S.J. Thorpe, D. Fize, C. Marlot, "Speed of processing in the human visual
system", Nature, Vol. 38 I, pp. 520-522, 1996.
175

ITHO98] S.J. Thorpe, J. Gautrais, "Rank Order Coding: A new coding scherne for
rapid processing in neural networks", Computational Neuroscience: Trends
in Research, J. Bower (Ed.), New York: Plenum Press.
[TON921 G. Tononi, O. Sporns, G. M. Edelman, "Reentry and the problem of
integrating multiple cortical areas: simulation of dynamic integration in the
visual system", Cerebral Cortex, vol. 2, pp. 310-335, 1992.
An Artificial Dendrite Using Active Channels

Eelco Rouw, Jaap Hoekstra, and Arthur H.M. van Roermund

Delft University of Technology, Faculty of Information Technology and


Systems/DIMES,
Electronics Research Laboratory, Mekelweg 4, 2628CD Delft, the Netherlands
E-maih E.Rouw@ITS.TUDelft.NL

A b s t r a c t . Since their introduction, neural networks have become an


accepted object of research in various disciplines. Most of these neural
networks are implemented using digital hardware consisting of computers
or dedicated processors.
Analogue implementations of artificial neurons, the elementary process-
ing units, could be smaller than their digital counterparts, thus enabling
more complex networks on a single chip. Conventional methods of learn-
ing cannot be used directly in these networks, due to practical limitations
regarding on-chip interconnections. In order to achieve such a complexity,
it is necessary to refine the neural networks.
This article proposes an artificial dendrite, one of the most important
parts of the neuron. The artificial dendrite uses principles found in biol-
ogy like active propagation and shaping of action potentials using active
channels. A brief introduction in neurophysiology is given in order to
explain the underlying mechanisms. The model is simulated in SPICE
using models of conventional analogue electronic devices.

1 Introduction

Conventional (digital) computers are an integral part of our lives and become
more powerful. Nowadays, it is possible to do several millions of calculations per
second even with a modest PC. This computational power is accomplished by
high clock rates, pipelining and some degree of parallelism. Despite this compu-
tational power, a conventionally programmed computer is not able to recognize
images (for example), as good as we do. This task would require a very high
clockrate to make real-time operation possible. An alternative is to use other
computational methods or structures. One of these structures is an artificial
neural network [1], a computational structure inspired by the (human) central
nervous system.

1.1 A n a l o g u e neural n e t s

Artificial neural networks implemented in digital hardware suffer the problem of


needing complex circuits for executing the necessary arithmetic, thus reducing
the possible number of neurons. With an enormous amount of memory, software
177

could be capable of simulating large neural networks, however most of the par-
allelism found in a neural network has to be translated to sequential programs,
making real-time processing of large neural nets nearly impossible. Because some
of the basic relations found in analogue electronics already consist of summa-
tions and threshold functions, it should be possible to create neurons consisting
only of a few analogue devices. It has been shown by Carver Mead [2] that it
is possible to mimic specific behavior of sensory systems using simple analogue
circuits. His replica of the human retina [3] was able to detect edges and motion
using a photo-sensitive array of neurons.
Due to the possible reduction in complexity of the neuron, it is possible to fit a
large amount of neurons on a single chip. In order to use the neural net, learning is
required. When the conventional supervised learning rules are applied, practical
problems arise. A supervised algorithm would be implemented using a complex
circuit centrally placed between the neurons and connected to each neuron. In
case of tenths of neurons this would be feasible, though the limited amount of
metal layers on chip would not be sufficient when a few thousand or even a few
million neurons are used. In order to create networks of such magnitudes it is
necessary to divide and decentralize the supervising algorithm resulting in more
localized learning circuitry. Good examples of localized learning algorithms can
be found in the human brain itself. Each individual human neuron posesses the
ability to learn.
The goal of this paper is to model the most important part of a neuron, the
dendrite. These dendrites could prove to be the key to local learning, because
most of the processing (if not all) is done by the dendrite. The resulting artificial
dendrite serves as a suitable vehicle for experiments concerning local learning.
The last aspect of this introduction is the information coding. In normal com-
puters, the information passed on from one element to another is coded using
a sequence of patterns. The human brain uses a different information encoding,
using the temporal relations in a series of spikes or action potentials. One can
compare this modulation with FM, the more spikes, the higher the intensity.
This temporal coding could prove to be less sensitive for parasitics than other
forms of coding.

2 The biological neuron

It is necessary to look at the biological neuron in order to model the dendrite.


Different aspects of the neuron are described briefly in this section. More in-
formation about the biological neuron and its can be found in [4]. First the
anatomy and the functions are described. After this general introduction, the
dendrite and its important parts, the (active) channels and the passive mem-
brane are reviewed.

2.1 Anatomy
Some segments are similar for each neuron, though most neurons can differ
greatly in size and shape. A neuron has roughly said three distinct kind of
178

segments, dendrites, an axon and a cell body. The neurons are connected to other
neurons using the dendrites and the axons. The axons excite other dendrites or
cell bodies using small bulb-like terminations called synapses. A typical neuron,
a Purkinje cell, can be seen in fig. 1.

Cell body

Fig. 1. A Purkinje cell

The dendrites act as inputs of the neuron, while the axons are used as out-
puts. The dendrites and axons have in common a structure consisting of a cylin-
drical membrane formed by a bilayer of lipid molecules. Large molecules, called
channels, form passages through the membrane between the cell's interior and
exterior. Figure 2 shows a typical membrane with some channels. The termina-

Channels

Inlerior
Lipid molecules

Fig. 2. Membrane of the biological dendrite

tions of the axon, the synapses, do not connect physically to the dendrite (by
merging the membranes) but lie close to the dendrite. These synapses mainly
use chemical reactions to transfer information from the axon to the dendrite.

2.2 Function

The best way to describe a biological neuron is by a highly nonlinear filter. Sev-
eral effects can be subscribed to the dendrite, for example threshold behavior
and a delay. It has been stated earlier that the information coding in the dendrite
uses the temporal relation between indivual or groups of spikes. These spikes are
transient potentials (potential difference between the cell's interior and exterior)
along the membrane of the dendrites and axons.
179

A more proper name for these spikes is action potentials. Though the ampli-
tude and the timing can differ, each action potential has the same characteristic
periods. A typical action potential as well as tile characteristic periods can be
seen in fig. 3. A generic action potential has four characteristic periods, the

Repc~za~c~
AI zest A.b~olutc re~a~:lov]p~lod A! rest
\ I I J

I
t

D~oo~Hzation Depol~uization
Fted,3t~ refructory p~iod

Fig. 3. A typical action potential

resting period, the depolarization and the two refractory periods (partly repo-
larization/depolarization). During tile resting period, the dendrite is in a state of
equilibrinm with a resulting membrane potential of approx. -60mV (the resting
potential). When the membrane potential is raised above a certain level, tile de-
polarization begins and the membrane potential rises quickly to approx. 90mV.
When this point is reached, the membrane potential seizes to rise and starts to
drop quickly below the resting potental (undershoot). After this undershoot, the
membrane potential slowly rises to its resting state (repolarization). The last
two events are called the refractory period, in which the potential drop is called
absolute refractory (the membrane cannot be excited) and tile slow return to
tile resting potential is called relative refractory (the membrane is less sensitive
to excitations).
The information is coded in number of spikes per second, though recent research
suspects that the shape of an action potential also contains information [5]. This
kind of information encoding is described the best with pulse density modulation.

2.3 The dendrite


The fact that almost all processing takes place in tile dendrite is a good reason
to model this part of the human neuron. The dendrite is not only capable of
tile processing but also of the propagation of action potentials. The propagation
carl be achieved in two different manners, active and passive. In case of a passive
propagation the dendrite behaves electrically like an RC transmission line shown
in fig. 4. The resistances R,,, are the resistances through the membrane and tile
resistances R i a r e the resistances along the membrane. When a pulse is applied
to a dendrite, measurements on different distances show that the shape of a pulse
and the amplitude change along the membrane. A simulation of an RC-line shows
180

Dendrite

Fig. 4. The dendrite seen as an RC-line

that the pulse is attenuated and t h a t the RC-line behaves like a low-pass filter.
Fig. 5 shows the six m e m b r a n e potentials measured at the six points on a six-
section RC-line. Within a section the values of R~, Rm, Cm and Vm potential

=.mi
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fig. 5. Simulation of a 6-section passive RC-line

are considered to be constant. As one can see, the amplitude gradually decays
and the shape changes (the shape gets smoother due to the low-pass nature of
the RC-line) when the potential is measured at a point further away. Although
passive propagation is sufficient for short dendrites, it is necessary to use some
kind of active propagation for long distances (e.g. from the brain to a foot).
Active propagation is achieved using channels, large tunnel shaped molecules
through the membrane. There are several different kinds of channels [6]. These
channels are used to transport ions (or molecules) from the interior to the cell's
exterior and vice versa.

2.4 Axons

Axons also use active channels to transport action potentials. In contrast to


dendrites, an axon can be covered by insulating fat cells (Schwann cells), leaving
only small openings (see fig. 1). These Schwann cells are responsible for reduction
of propagation times, making fast responses possible.
181

2.5 Channels
As has been stated in the text above, channels are large molecules through the
membrane. These channels use passive (diffusion) or active mechanisms to trans-
port particles through the membrane. Most of the active channels exhibit a volt-
age dependent behavior and memory effects. The ionic flow is determined by the
membrane voltage and the derivative (different behavior on a rising/decreasing
edge). These dependencies on the membrane potential are depicted using hys-
teresis graphs in fig. 6. The arrows show the possible trajectories of the membrane
potential and the ionic current. Three ionic currents are important for the shape
of the action potential. These flows (electrically seen as currents) are: the inflow
of N a + ions ( N a + influx) and K + ions ( K + efflux and influx).

K ~ in

K§ ex

Na § in () !

i,
V~ Vn VT~ vv~
M e m b r a n e Potential

Fig. 6. Three hysteresis diagrams of the biological channels

3 The artificial dendrite

This section deals with the translation of the biological dendrite to an artificial
dendrite. The first subsection gives some considerations regarding the conducted
research. The model is a simplified version of the dendrite. The simplifications
are explained in the second paragraph. The next subsection describes the trans-
lation of the hysteresis graphs to ideal electrical schematics. The final phase of
the modeling is the translation of the ideal schematic to a circuit consisting of
conventional components, described in the last subsection.

3.1 Considerations
The biological dendrite is capable of processing information using addition, mul-
tiplication, delay and threshold behavior. The artificial dendrite should be capa-
ble of the same processes except multiplication (this could be a target for further
182

research). The ionic flows found in the biological dendrite can be modeled using
currents. The channels consist of circuits controlled the membrane potential and
drive the membrane with a current source or sink. The advantage of this layout
is the fact that both input (membrane potential) and output (driving current)
could be connected to the same node, without undesired influences. The artificial
dendrite described in this paper is capable of propagating action potentials in
both directions (to and from the cell-body). It is, however difficult to model the
dendrite completely, because this would result in many circuits.

3.2 Model constraints

One of the reductions of the model are the characteristics of the channels. Bio-
logical channels have a variable aperture, the flow of ions can vary. The artificial
channels, described here, can only be opened or closed. Fig. 6 showed the mem-
brane potential dependency of the biological channels. The following picture
shows the effect of the reductions on the membrane potential dependency of the
artificial channels.
This figure is also a hysteresis diagram using arrows to show the possible tra-

K§ in ()

K* ex ()

"i
()

VT1 V~4 Vt~ VT2 VTs=VT3


Membrane Potential

Fig. 7. Three hysteresis diagrams of the artificial channels

jectories. The modulation of the ionic flow in the biological channels result in a
smoother action potential. The values VT? are threshold values triggering certain
events in the artificial channels. In this graph, VT~ and VT3 have the same value.
In the practical implementation, these values have been chosen different to pre-
vent undesired equilibria. The indexes correspond to the ones used in the ideal
schematics later on in this section. When the artificial channels function accord-
ing to fig. 7 the resulting action potential will be like fig. 8. Another reduction
in the model is the used structure. The structure of a biological dendrite can
be seen as a continuous transmission line with all resistances and capacitances
183

VT2

vT,
vro~
VT3
VT4

Time

Fig. 8. An action potential generated using artificial channels

distributed along the membrane. Simulating continuous transmission lines would


result in time-consuming simulations. The artificial dendrite will use a lumped
approach [7]. Each dendrite is divided in sections with optional nonlinear parts
and two resistors and a capacitor (resistors and capacitor represent the passive
transmission line used as a backbone for the artificial dendrite). A picture of a
passive transmission line can be found in fig. 9.

Istim ~ , Vsup

o Rm Cm Rm Cm Rm Cm GND
o

Fig. 9. Passive RC-line modeled by three RC sections

3.3 D e r i v a t i o n o f ideal m o d e l

The main information used to derive the ideal circuits is the graph shown in
fig. 7. The basic elements used in the artificial dendrite are comparators (for the
threshold behavior), logic gates (used to combine different thresholds), latches
(for the memory behavior, to determine the direction) and switched current
supplies/sinks. When these four elements are combined it is possible to construct
the different hysteresis diagrams. The resulting circuits are shown in fig. 10, 11
and 12.

3.4 T r a n s l a t i o n i d e a l m o d e l t o e l e c t r i c a l circuit

It is possible to extract circuits from the three different ideal circuits using
conventional devices. These circuits can be used to simulate the artificial dendrite
184

Vsupp

VT~ ..... Outpt~u


Vi n
- - GND

Fig. 10. Ideal circuit of the Na + influx channel

Vin VT5
GND

Fig. 11. Ideal circuit of the K + effiux channel

Vsupp

Vre~t

+
Output
Vio
" GND

Fig. 12. Ideal circuit of the K + influx channel

with the circuit simulator SPICE. Each ideal component can be replaced by a
circuit built from conventional components. These circuits can be found in any
textbook on electronics [8] [9]. These small basic circuits can be found in any
general textbook on electronics. The resulting schematics are included with this
article in Appendix.

4 Simulation and Results

Having electrical models of the biological channels give the opportunity to sim-
ulate the artificial dendrite. These experiments were used to verify whether the
185

model had the desired functionality. This section describes the results of the
conducted simulations.

4.1 Simulation and experimental setup

The simulations have been run in three different categories. The first category
tests the i n p u t / o u t p u t behavior of the three different channels. The second cat-
egory verifies the behavior of the channels connected to a RC-pair (representing
the passive line). It is sufficient to mention the fact that the results of both cat-
egories agree with the desired behavior. The values of the different components
have been chosen to make the results of the simulations clear.

4.2 Verification

The experiments of a third category are used to determine whether the artificial
dendrite functions properly. These experiments verify whether the artificial den-
drite is capable of producing and propagating an action potential. Two different
dendrites are verified, a three-section dendrite and a nine-section dendrite. Four
consecutive stimuli (injected currents) have been applied. The passive RC-line is
charged in the beginning. The membrane potential rises from 0V to Vrest; the
K + influx channel is responsible for this charging. The first stimulus in both
cases is not sufficient to excite the dendrite above the threshold voltage VT2, due
to the fact that the membrane potential is still rising to the resting potential.
The next three stimuli are able to excite in both cases because the membrane po-
tential rises above a threshold potential, and for each dendrite three consecutive
action potentials are generated.

, . ~ . . . . . . . . . . _ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fig. 13. Simulation of a 3-section dendrite


186

7.m ........... - .......... ~........... ~.......... : .......... -. - . . . . . . . . . . . . . . . . . . . . . : ..................... : ..........

!i ......
Fig. 14. Simulation of a 9-section dendrite

4.3 Review

The experiments from the former section show that action potentials can be
generated in the artificial dendrite. Several properties found in biological den-
drites are also implemented in the artificial dendrite, such as threshold behavior
and refractory periods. The refractory periods of the artificial dendrite have a
sequential nature instead of the timed behavior of the biological dendrite. The
shape of the action potentials generated by the artificial dendrite are also more
rectangular shaped than their biological counterparts due to the discrete nature
of the artificial channels (in contrast to the continuous nature of the biological
channels). The artificial dendrite is capable of summation and threshold behav-
ior. Summation is accomplished by the injection of different currents into one
point. The threshold behavior can be subscribed to the active channels. It is
possible to simulate a conventional artificial neuron with temporal behavior.

5 Conclusions and Recommendations

In the next section conclusions are drawn regarding the simulations and im-
plementation of the artificial dendrite. The last section gives recommendations
regarding further research.

5.1 Conclusions

This research shows the possibility of mimicking specific behavior of the biolog-
ical dendrite using conventional electronic components. The approach used in
this article resulted in analogue circuits, that perform different functions found
in the biological dendrites. The simulation experiments are used for verification
of the desired behaviour. This verification has been conducted on different levels.
The first category verified the response of the different channels to transient in-
put voltages. The second category verified the separate channels in a membrane
like environment. Finally, the active membrane sections were concatenated in
187

the third category to simulate small artificial dendrites. These three categories
show t h a t the artificial dendrite as well as its subcircuits have function properly.
With its behavior, the artificial dendrite could be a good underlying structure
for neural networks processing information coded in the t e m p o r a l domain.

5.2 Recommendations

The research described in this article proposed a structure, t h a t can be used


in the next generation of analogue neural networks. However, a lot of research
needs to be done. Targets of specific interests are temporal information coding
and local learning. The nonlinear circuits need to be simplified and models of
synapses should be made. When all of these elements are combined in a suit-
able technology, it is possible to create integrated circuits for continuous speech
recognition and processing or image recognition. These applications could nar-
row the gap between humans and computers, making more natural means of
communication possible.

References
1. Simon Haykin, Neural Networks, A Comprehensive Foundation, Prentice Hall, New
Jersey, 1994.
2. Carver Mead, Analog VLSI and neural systems, Addison Wesley, 1989.
3. C.A. Mead and M.A. Mahowald, "A silicon model of early visual processing," Neural
Networks, , no. 1, pp. 91-97, 1988.
4. R.H.S. Carpenter, Neurophysiology - Third edition, Arnold, London, 1996.
5. C. Koch, "Computation and the single neuron," Nature, vol. 385, pp. 207-210,
1997.
6. Editorial, "Making sense of channel diversity," Neuroscience, vol. 1, no. 1, pp.
169-170, 1998.
7. J. Hoekstra, "Single and multiple compartment models in neural networks," in
Computing Anticipatory Systems - C A S Y S ' 9 7 conference proceedings, Daniel M.
Dubois, Ed., Liege, Belgium, 1997, CHAOS, pp. 626-641, AIP.
8. T. Bogart, Electronic devices and circuits - Third edition, Merril Prentice Hall,
Columbus, Ohio, 1993.
9. P. Horowitz and W. Hill, The art of electronics - Second edition, Cambridge uni-
versity press, 1996.
Analog Electronic System for Simulating
Biological Neurons

Vincent Douence, Arnaud Laflaqui~re, Sylvie Le Masson,


Thierry Bal (I), Gwendai Le Masson (2)

IXL Microelectronics Laboratory, CNRS UMR 1126, ENSERB-Universit6 Bordeaux 1,


351 cours de la Lib6ration, 33405 Talence Cedex, France.
(I) lnstitut Alfred Fessard, CNRS UPR 2212, av. de la Terrasse, 91198 Gif-sur-Yvettc.
France.
(2) Institut de Neurosciences Frangois Magendie, INSERM U378, 1 rue Canaille Saint-
Saens, 33077 Bordeaux Cedex, France.

ABSTRACT
This paper deals with the implementation of an analog electronic system capable of emuhttiog
and~or characterizing the electrical activity of biological neurons. We detail the mahl
characteristics and performances of the system, and point out its litheness as an experimentation
tool :
9 high level of modeling accuracy, validated by simple and hybrM experiments,
9 analog modelingprinciple, andpossibility to emulate in real time a large range of neurons or
neural networks, thanks to a set of programmable parameters.
9 model intplementation simplicity, owing to a dedicated hardware and sr inteJfilce.

I. INTRODUCTION

The study of biological neural networks, that are made of complex and highly
non-linear elements, is limited by classical experimental approaches. A way to overcome
these limitations is to study those networks from a theoretical point-of-view, using
matllcmatical models of the neurons, such as those following tile classical l lodgkin-
1luxley formalism [1], [2], [3]. That statement is a key point justifying the dcvelopment of
the research in computational neuroscience. But softwares that numerically solve the
model equations are also somehow limited by the computation time : in that casc, analog
computation appears as an alternative solution tbr neural computation.
Following initiative studies we have developed in the 90s [4], [5], an analog
electronic model for the implementation of artificial neurons and neural networks, based
on the Hodgkin-Huxley formalism. Equations are computed by analog circuits, integrated
on ASIC (Application Specific Integrated Circuits) chips in a BiCMOS technology, and
run in real time the electrical activity of the neuron, i.e. its membrane potential with a
high level of accuracy. Each module of the model chip acts as an ionic current generator.
and a neuron or a synapse is build by summing a number of these generators on a neural
membrane capacitance. Each ionic current generator can be independently conligured to
follow the dynamics and gain of the ionic activity, using voltage parameters introduced in
the circuitry via external inputs on pads of the chip. Then by simply externally tuning
voltages on these inputs, one can configure the desired neural network.
We have shown in previous publications [6], [7], [8] the validity of this mode of
implementation for artificial neurons. Those circuits represent a powerful tool to test and
189

validate results from computational neuroscience with neurophysiology experiments. In


this case they are used by non electronic specialists and therefore must be integrated in a
user-friendly system, compatible wi~.h the eleetrophysiology e~vironment. Starting fi'om
the isolated chip computing the neuron analog model, we then build a complete analog
simulation system, which supports the artificial neurons and allows their progranaming
using a dedicated instrumentation and software interface.
After briefly mentioning the implementation principles of the analog artificial
neurons, we will present the structure of the system, and its main characteristics. P,esults
of two different applications of the system will be exposed. First the system is considered
as a standalone simulator for computational neuroscience, second as a tool for conducting
hybrid experiments where electronic and biological neurons interact in real time [9].

II. ANALOG MODEL CIRCUITS FUNCTIONALITIES

The integrated circuits we designed compute, in analog mode, equations of the Hodgkin-
lluxley formalism [1]. That formalism describes single neuron or neural network
electrical activity ( membrane potential or synaptic current), as the result of the sum of
ionic currents on a membrane capacitance (Figure I.A). These currents express the
membrane permeability to various ionic species (Sodium, Potassium, Calcium...), and arc
time and membrane voltage dependent [3]. Those variations are explicited by a set of
mathematical equations that we consider as the generic operations of an ionic era'rent
geJTerator (see Figure 1.B, equations (1) to (5)). Parameters of these equations (Figure
I.B, see list of the parameters) are specific to the ionic specie considered and of the
modeled neuron. Those parameters are originally experimentally determined, using
neurophysiology classical methods such as voltage-clamp. That experimental source of
the models justifies the admitted assumption that Hodgkin-Huxley models closely match
the biological neurons activity.
The analog ASICs we developed include a set of electronic blocks, each emulating
an ionic current generator that follow the equations of Figure 1.B. The only
approximation is found in the kinetic expressions (equation (3)), where we neglect the
voltage dependence phenomenon. The ionic current generators outputs are summed on an
external capacitance (representing the membrane capacitance): that current sum is
realized by a simple connection, outside the chip. Synaptic connections that imply the
application of membrane voltages are also externally made. The topology of a neural
network and a single neuron constitution are then set by the user when interconnecting the
chips. For example, it is admitted that a simple spiking neuron activity can bc described
by the tlowing of two ionic species (typically Sodium and Potassium), whereas cells with
a more complex activity are build with up to 6 ionic currents.
Additional functionalities have been introduced among the implemented blocks.
The first one expresses interdependencies of ionic currents that can not be directly
explicited by the Hodgkin-Huxley formalism, and that happen to be very important for
some neuron activities.
As shown in equations (6) and (7), internal variables of the model then get an
additional dependency: the most common is the Calcium-dependence phenomenon,
where the Calcium concentration (computed from the Calcium current) balances the
membrane permeability to another ion [10]. To better describe the neural activity, we
added a function that is capable of expressing what is called a regulation process:
observations on living neurons have shown that, if physiological perturbations appear in
190

A ioniccurrent
Inside membrane generatorblock

, Ii
r
:. gi
__ em =

T 1_-
Outside membrane

c .... dV.~m _ .,y_,I, - I~ O) Variables


dt Vmem : the membrane potential
Ii : the ith ionic current
I i = gim iPhi q (Vine m _ Vequii ) (2) ml : the i th activation state variable
dm i (3) hi : the i th inactivation state variable
~'mi~ = m=i - nli m.~ : the mi steady-state value
1 Isya : synaptic current, similar to li
n].i = Vine,. _ Vott~u (4) m.c . : the regulated mi steady-state value
I+ e v,~p,~ [Ca *+] : Calcium concentration
t,,~= g~,.(v~,,,. - v.,.~o,,) (5) Parameters
Cmem : the membrane capacitance
re" d[Ca'*]d---'~= klc''" - [Ca++] (6) Is : a synaptic or stimulation current
gi: the ith maximal conductance
[Ca**] p, q : constant integers
V,q,,ii : the equilibrium potential
r,,~: the ith activation kinetic
Vom,, : the i th activation sigmo'/d offset
V,~o~, : the ith activation sigmo'fd slope
~c, ~[c,"] (9) K,C,G, A: constants
l+e a r~: Calcium kinetic
rg: regulation kinetic
Ct: Calcium regulation target

Figure 1
A neuron model. A : electrical equivalent schematic. B : generic equations.

a neuron environment, the cell may respond with a change in the intrinsic structure of its
ionic channels, in order to maintain its original activity. We use in the model a
mathematical expression defined by G. Le Masson [11], who chooses the Calcium
concentration as the feedback parameter for the regulation process, and so renders
phenomena measured during experiments on living neurons (equations (8) and (9)). Other
expressions could be considered, but that tentative approach however covers a large part
of the well-known regulation processes in single neuron computation.
191

One important specificity of our process for implementing the neuron model
equations, is that none of the equation parameters is definitely set. For each of the
parameters that appear in an ionic current generator expression, a dedicated chip input is
reserved ; the effective parameter value will depend on the voltage value applied on that
input by the user. The voltage range is adapted to the concerned parameter, to match its
variation range for neuron models. Using identical chips, we then had the opportunity to
model different types of neurons, of both vertebrate and invertebrate species [71. 181.
Some of these models will appear in the applications of section IV.
However, model chips by itself can be exploited for the processing of simple
models. In that case, the many voltage inputs used to fix the parameters are manually
driven, using for example a set of variable resistors. But the circuits can not be easily used
in more complex or systematic experiments, in which the equation parameters are to be
accurately managed and often modified.

III . ARCHITECTURE OF THE ANALOG SIMULATION SYSTEM

We chose to develop a system behaving as a neural simulator, but based on analog


computation. That system comprises our models chips and a computer interface to
program, modify or store the model parameters. It provides a framework to construct
artificial neuronal systems. When running a simulation, facilities are also providcd to use
the system as an oscilloscope, and possibly store the membrane voltage. That information
feedback is important when studying complex neural behavior, and will be necessary
when we will implement signal processing algorithm, as we have already made with an
optimization process [12]. A final characteristic to underline is that specific
functionalities have been added to the system, to make it the artificial counterpa,'t in
hybrid system experiments. Artificial synapses are computed, and provide inputs and
outputs compatible with an Axoclamp amplifier. An hybrid experiment application will
be emphasized in section IV.
The whole set-up comprises an electronic rack, connected to a computer system
including an acquisition board (see Figure 2). The rack is organized around a mother
board, supporting a data bus of analog and digital signals. Daughter boards (up to 8)
supporting the artificial neurons on ASICs can be plugged on the lnother board, and then
connected to the bus. The latest boards we designed include 4 ASICs, i.e. 8 ionic current
generators. These generators can be wired together on any of 32 membrane lines.
available on the data bus, that may be considered as membrane voltages. In that case. the
lilac is loaded by a membrane capacitance, and an artificial neuron is created. Some ionic
current are possibly configured as synaptic currents, and are then added on the
postsynaptic neuron membrane line. Compensation and stimulation currents may also be
injected on the artificial neurons. A compensation current is directly issued from a VCCS
(voltage-controlled current source) block implemented on the ASICs to possibly
compensate the fabrication technological defects, whereas a stimulation current represents
a synaptic current or a global influence on the neural network: the stimulation current
may present various waveforms and is then provided by an external source.
On the daughter boards, one sets the topology of each neuron and of the network
by configuring the connections of the membrane lines with manual switches. Among the
32 membrane lines, 8 may be output from the rack, to an oscilloscope, or to the computer
acquisition board. They represent the analog part of the communication bus.
192

Figure2
Architecture of the analog simulation ~ystem

The other part is composed of 3 digital lines, driving the configuration protocol of
the chips parameters. These data are first treated by a programmable logic circuit, that
decodes the addressed chip and an associated digital to analog converter (DAC). The
DAC converts the serial 12 bits data of the parameter value, and continuously applies the
analog result to the corresponding ASIC parameter input.
193

As we already mentioned, the set-up is adapted for an utilization in hybrid


experiments. In those experiments, hybrid neural networks are constructed by
interconnecting in real time artificial neurons and living ones in an in vitro preparation.
This construction is intended to give the opportunity to access and tune parameters of
ionic or synaptic conductances, and then to infer their individual role in shaping the entire
network behavior. The hybrid systems principle implies that the model runs in real time :
the cxpcriments, that were initially conducted using numerically-computcd neuron
models, were confined to simple neurons due to the limited numerical processing speed.
Analog computation is a way to definitely overcome that difficulty. Assuming that point,
it was then easy to extend the functionalities of the analog simulation system to use it as a
tool for hybrid experiments with the on-chips analog models as artificial neurons. A
specific board was added in parallel with the whole system, supporting ionic current
generators configured as artificial synapses (see Figure 1, equation (5)). When
constructing an hybrid network, real neurons connected to models are impaled with an
intracellular electrode which is used: first, to record the biological neuron membranc
potential, second, to inject an artificial synaptic current. Both tasks are done by an
Axoclamp amplifier in discontinuous current clamp mode. In the first case, our synaptic
board receives a presynaptic membrane potential, and computes an artificial synaptic
current added on the membrane line of the postsynaptic artificial neuron. In the second
configuration, where the artificial neuron is the presynaptic one, the synaptic current is
computed on the board and output as a proportional voltage value, to drive the Axoclamp
current injection command.
The computer executes a software that drives the whole set-up. A graphical
interface is provided to the user, that directly sets the modeled neurons parameters on
dialog boxes. Artificial networks programmed in that way can be freely saved and
reloaded. After the configuration phase, where the commands are applied to the boards
via the digital bus, the user can start the simulation and acquire each of the 8 membrane
lines available on the analog bus. Waveforms can be saved or directly treated in
macro functions such as optimization algorithms.

IV. APPLICATIONSEXAMPLES

An interesting application for the analog artificial neurons consists in the study of
the thalamus relay structure. This small neural network is located in the tlmlamus, often
called the gateway to the cerebral cortex, and acts as an interface transmitting inlbrmation
li"om the optic nerve to the cortex. It comprises two populations of neurons, that arc called
TC for the relay thalamo-cortical cells, and nRt for reticular thalamic neuron. An
interesting function of that intra-thalamic network is that it selectively controls the flow of
visual information from the retina, during the various states of the sleep-wake cycle and
arousal [13]. In a wake or attentiveness state, the neurons of the network are depolarized
and present a general tonic activity, whereas during the sleep phase, synchronized
oscillations (called spindle waves) appear, that are the result of synaptic reciprocal
interactio,a between nRt and TC cells [14]. The study of these interactions is important lbr
understanding the mechanisms of the transitions from sleep to waking, and more
generally for explaining how the thalamocortical system may control the state of activity
of the brain.
194

As a first example, we present the implementation of the TC cell model; its


activity is described by the following equation :
C,,e m dVme"
dt = "gle~'(Vm~* " V~e"k) " Ir " I^ - I h - 1N~ - IK (I0)
This model expression was defined by A. Destexhe and al. in 1993 [15], and often
validated in experiments since [16]. It means that five ionic currents and a leak
coqductance are source of the neuron activity. Those equations of current follow the
dcscription given in Figure 1. INa (fast sodium current) and IK (delayed-rect([ier current)
are responsible for the action potentials generation. It (low threshoM calcium current)
represents the slow calcium activity, Ih (hyperpolarization-activated cation current)
appears when the neuron is hyperpolarized, IA (transient potassium current) represents a
second-term potassium effect.

,v I
2.6 see
A B C
stimulation 4(1 EtA
current

0.5 V [
I Itll
1 1 I I I A ^ ~ ^ ^,.^... 250 ms
B

/WWWVVW 'V " --

0.5 V
C 95 ms

Figure 3
Modeling a TC neuron different activity states, using a ramp stimulation current.
A: bursting activity (delta waves), B: bursting activity vanishes.
C: in a depolarization phase, the neuron presents action potentials.

We programmed that model on an artificial neuron, and added a ramp stimulation


current. The resulting measurement of the model membrane voltage is shown in figure 3.
195

The ramp stimulation allows the visualization of the different states of activity of the cell :
first, a bursting activity where low kinetic currents are activated ; second, a silent phase
when the neuron gets depolarized ; third, a tonic phase of simple action potentials for a
high stimulation level. Those three types of activities are characteristic of the TC cell, and
are necessary for the efficiency of the TC-nRt loop, that we will be illustrated in the next
example.
The neurophysiologlcal experimental approach for studying the thalamus relay
structure is quite complicated : in the in vitro preparations of vertebrates thalamus slices.
due to slice thickness, the synaptic connections between the neurons are generally cut.
and are then difficult to characterize. The TC-nRt loop effect on a visual stimulation is
then impossible to evaluate. An hybrid experiment allows to solve that problem, by
artificially reconstructing the thalamus relay structure. In the experiment presented in
Figure 4, the TC cell is a living one, impaled in an in vitro preparation of a thalamus slice
of a vertebrate animal. The nRt cells and the synaptic connections are artificial.
constructed with our analog simulation system; the TC->nRt synapse is an excitatory one.
whereas the nRt to TC synapse has an inhibition effect. The experiment intends to prove
that tim synaptic combination is a key point to obtain the production of bursts which arc
the characteristics of the network awake state.

C~m dV'-. . . . g,.~.(V., m - V...k) - IT - IKc. - Ir - IN.- I K (11)


dt
The nRt model [17] comprises five ionic currents and a leak conductance (equation (I 1)).
Some currents a ~ of identical types to those of the TC neuron, Ic,, (non speci/ic cation
current) is an additional fast current; an important calcium-dependence effect exists, and
is expressed in IKC,(calcium-dependentpotassium current). Neurons of the network arc in
an awake state, where their membrane potentials are depolarized, and close to the value
triggering action potentials. An external stimulation on the TC living neuron, simulating
the optic nerve stimulation, should produce a long rhythnlic bursting activity.
Measurements of Figure 4-A and 4-B clearly show that such an activity is only present
when both synaptic connections exist, and prove the validity of the synaptic model.

V. CONCLUSION

We have presented in this paper an electronic simulation system for modeling


biological neurons and networks. That system handles a configurable set of specitic
integrated circuits, designed to compute in analog mode and in real time a neuron model
cqt,ations. The programmability at a chip level gave us the opportunity to develop on the
system an interface, that results for an external user in an interactive software to set the
modeled neurons parameters and topology.
Applications have shown that the simulation system can be used for an isolated
computation of neural models, but also that a it is particularly intended for the
construction of hybrid networks. We can sum up that argumentation by stating that the
simulation system has been designed to be an tool, that can be used outside an electronics
laboratory environment, for biologists that intend to link the experimental approach of
neurophysiology experiments and the power of computational neurosciences.
196

Figure 4
llybrid experiment with the analog artificial neurons, handling a thalamus relay network.
A: one synaptic connection is made, from TC to nRt. An external stimulation prod,ces
only one spike. B: reciprocal synapses are connected. The network responds" to the
stimulation with a bursts sequence.
197

REFERENCES

[1] A.L. Hodgkin and A.F. Huxley, A quantitative description of membrane current and
its application to conduction and excitation in nerve, Journal of Physiology, vol. 117, pp.
500-544, 1952.
[2] B. Softkey, C. Koch, Single cell models, in M. Arbib, editor, The handbook of brain
theory and neural networks, pp. 879-884, MIT Press, Boston, MA, 1995.
[3] C. Koch and I. Segev, Editors, Methods in neuronal modeling: from synapses to
networks, MIT Press, Cambridge, MA, 1989.
[4] M. Mahowald, R.J. Douglas, A silicon neuron, Nature, vol. 354, pp. 515-518, 1991.
[5] R.J. Douglas, M. Mahowald, A construction set for silicon neurons, in S.F. Zornetzer
and al.editors, Neural and Electronics Networks, pp. 277-296, Academic Press, Arlington.
1995.
[6] D. Dupeyron, S. Le Masson, Y. Deval, G. Le Masson and J.P. Dom, A BiCMOS
implementation of the Hodgkin.Huxley formalism, Proc. of MicroNeuro'96, Lausanne.
IEEE Computer Society Press, pp. 311-316, 1996.
[7] A. Laflaqui/:re, S. Le Masson, G. Le Masson and J.P. Dom, Accurate amdog I/LSI
model of Calcium-dependent bursting neurons, International Conference oll Ncural
Networks (ICNN'97, Houston, Texas), 1997.
[8] S. Le Masson, A. Laflaqui/:re, D. Dupeyron, T. Bal, G. Le Masson, Analog circuits
.fi~r modeling biological neural networks : design and applications, IEEE Transactions o11
Biomedical Engineering, in press.
[9] G. Le Masson, S. Le Masson and M. Moulins, From conductances to neural networks
properties: analysis of simple circuits using the hybrid networks method, Progress in
Biophysics and Molecular Biology, vol.64 n~ pp. 201-220,1995.
[10] R.W. Meech, Calcium-dependent activation in nervous tissues, Annual review of
Biophysics and Bioengineering, vol. 7, pp.l-18, 1978.
[I 1] G. Le Masson, E. Marder and L.F. Abbott, Activity-dependent regulation of
conductances in model neurons, Science, vol. 259, pp. 1915-1917, 1993.
[12] G. Le Masson, Stabilit~ foncionnelle des rdseaux de neurones: dtude expdrimentale
et thdorique dans le cas d'un r~seau simple, Th/:se de l'Universit6 Bordeaux 1, 1998.
[13] D. A. McCormick, T. Bal, Sensory gating mechanisms of the thalamus, Current
Opinion in Neurobiology, vol. 4, pp. 550-556, 1994
[14] T. Bal, D.A. McCormick, Mechanisms of oscillatory activity on guinea-pig nucleus
retictdaris thalami in vitro: a mammalian pacemaker, Journal of Physiology, vol.486, pp.
669-691, 1993.
[I 5] A. Destexhe, A. Babloyantz, T. Sejnowski, Ionic mechanisms .fi~r intrinsic slow
oscillations in thalamic relay neurons, Biophysical Journal, vol.65, pp. 1538-1552, 1993.
[16] T. Bal, M. von Krosigk, D.A. McCormick, Synaptic and membrane mechanisms
tmderlying synchronized oscillations in the ferret lateral geniculate mtcleus in vitro. J. of
Physiology, vol. 483.3, pp. 641-663, 1995.
[I 7] M. von Krosigk, T. Bal, D.A. McCormick, Cellular mechanisms of a ,~ynchronized
oscillation in the thalamus, Science, vol. 261, pp. 361-364, 1993.
Neural Addition and Fibonacci Numbers

Valeriu Beiu *

RN2R LLC, 14850 Montfort Drive, Suitel81, Dallas, Texas 75240, USA
E-maih v b e i u @ r o s e - r e s e a r c h , c o m

Abstract. This paper presents an intriguing relation between neural networks


having as weights the Fibonacci numbers and the ADDITIONof (two) binary num-
bers. The practical application of interest is that such 'Fibonacci' networks are
VLSI-optimal with respect to the area of the circuit. We shortly present the state-
of-the-art, and detail a class of multilayer solutions for ADDITION.For this class
we will prove constructively that the weights of the threshold gates implementing
the Boolean functions are the Fibonacci numbers. As the weights are the smallest
integers (by construction), the area of the VLSI circuit----estimatedas the sum of
the digits needed to represent the weights--is minimised. Therefore this class of
solutions is VLSI-optimal. Conclusions and open questions are ending the paper.

1 Introduction

In this paper we shall consider feedforward artificial neural networks for computing ad-
dition. Formally, a network is a graph having several input nodes, and some (at least
one) output nodes. If a synaptic weight is associated with each edge, and each node i
computes the weighted sum of its inputs to which a nonlinear activation function is then
applied (i.e., artificial neuron, or simply neuron):
fi(xi) : f/ (Xi, 1..... Xi'k) : i~i(Zk=l WjXi, j.~Oi), (1)
the network is a neural network (NN), with the synaptic weights w i e IR, 0 i e IR known
as the threshold, k 6 IN being the fan-in, and o i a non-linear activation function. If the
underlying graph is acyclic, the network does not have feedback connections, and can
be layered being known as a multilayerfeedforward neural network, and is commonly
characterised by two cost functions: depth (i.e., number of layers) and size (i.e., number
of neurons). W e shall firstly discuss Fibonacci numbers, then shortly present known re-
sults for ADDITION before introducing and proving the VLSI-optimality of a N N having
Fibonacci numbers as their weights. Conclusions and open questions are ending the pa-
per.

2 Fibonacci Numbers

Leonard of Pisa (Leonardo Pisano: 1170-1240) is better known by his nickname: Fi-
bonacci. This is short fromfilius Bonacci which means son of Bonacci which m a y mean
"lucky son" (literally, "son of good fortune"). He played an important role in reviving
ancient mathematics and made significant contributions of his own. Liber Abaci (pub-
lished in 1202) introduced the Hindu-Arabic place-valued decimal system and the use

The author is 'on leave of absence' from the "Politehnica" University of Bucharest, Computer Science
Department, Spl. Independent,ei 313, RO-77206 Bucharest, Romania.
199

of Arabic numerals in Europe. It is the famous Rabbit Problem described in LiberAbaci


that leads to the Fibonacci numbers and the Fibonacci sequence--for which Fibonacci
is in fact best remembered today: "A certain man put a pair o f rabbits in a place sur-
rounded on all sides by a wall. How many pairs o f rabbits can be produced f r o m that
pair in a year i f it is supposed that every month each pair begets a new pair which f r o m
the second month on becomes productive?" Supposing that our rabbits never die and
that the female always produce one new pair (one male, one female) every month, the
answer involves the series of numbers: 1, 1, 2, 3, 5, 8 .... where each number is the sum
of the previous two. It was the French mathematician Edouard Lucas who gave the
name Fibonacci numbers to this series.
Definition For any i > 2, the Fibonacci sequence Fib, is defined by the recurrence:
Fibi = Fibi_l + Fib~_2, (2)
with the initial conditions Fib 0 = 0 and Fib 1 = 1.
A wealth of mathematics has arisen from this sequence and today a journal (Fi-
bonacci Quarterly) is devoted to topics related to the sequence. Beside, the numbers
or their ratios appear quite frequently in nature: shell spirals, pine cones, branching
plants, petals on flowers, arrangement of seeds on flowerheads, leaf arrangements,
etc. The reason seems to be the same for the arrangement of leaves as for seeds and
petals, and is given by the golden section (or golden ratio) qb , and its reciprocal
to ~ 1/~b defined as: "to square it you just add I . " If there are to leaves per turn,
then we have the best packing so that each leaf gets the maximum exposure to light,
casting the least shadow on the others. The explanation is that the best number is
an irrational number that never settles to a rational approximation for very long. The
simplest such number can be expressed as 1 + 1/(1 + 1/(...)) (i.e., a continued frac-
tion), or x = 1 + 1/x. This is another possible definition of to = ( q 5 - - 1)/2. The ra-
tional approximation of to is given by: 1/t, 1/2, 2/3, 3/5, 5/8 .... which is the ratio
of successive Fibonacci numbers.
Binet's formula (probably discovered previously by de Moivre) gives the n-th
Fibonacci number directly. It involves the golden section number qb and its recip-
rocal tO: Fib n = [O n - (_tO)n]/~f~- A simpler formula can be easily derived:
Fib. = rou,~(~n/G) (3)
while the n-th power of the golden section can be obtained using the Fibonacci numbers
as: qbn = tO-n= Fib._ ~+ Fib n 9 O. Binet's formula can be used to extends Fibonacci
numbers to negative n, and gives 1,-1, 2 , - 3 , 5 .... or Fib_n= (-1) n+ 1 Fibn "

3 ADDITION

T h e ADDITION o f t w o n-bit numbers, an augend X = X,_~Xn_2 ... X~Xo and an addend


Y = Yn-lYn-2 "" YlYo, is defined as the unsigned sum of the addend added to the augend
S -~ s,s,_ 1 ... s~so. A well established method [9, 17, 22, 36] for this computation is: c i
= ( x ~ ^ y i ) v (x~^ci_~)v ( y i A c i _ ~ ) , c_j = 13, s i = x i @ y i ~ c H, f o r i = 0 , 1 . . . . . n - 1
(or s i = (x i ^ Yi) v (xi A Ci_1)V (Yi ^ Ci-1) as in [13]), and Sn = Cn. The c i are known as the
"carry" bits, and it is clear that ADDITIONreduces to computing the carries.
200

Table 1. Two-input AND-ORadders (constructive solutions).

Author(s) Depth(delay) Size (#gates) Remarks


Wegener 1996 3 O (n2) [39] (also in [27])
Chandra et al. 1984 4 O(n2) [11] (also in [37])
School method 2n-1 5n-3 From [37]
Chang et al. 1992 41ogn 35n-6 Carry-lookahead [12]
Brent & Kung 1982 41ogn 14n-logn-10 Prefix algorithm [9]
Kelliher et al. 1992 31ogn 3nlogn+25n/2-8 Conditional sum [18]
Ladner & Fischer 1980 21ogn+2k+2 for 0 _<k_< logn n (8 + 6 / 2 k) Prefix algorithm [21]
Conditional sum (e > 0)
Wei & Thompson 1990 (2+e)logn (2+e)nlogn+5n [40]

Wegener 1987 21ogn+l 3nlogn+lOn-6 Conditional sum [37]


Kelliher et al. 1992 21ogn 5/'2nlogn+5n-1 Conditional sum [18]
Krapchenko 1967 21ogn+72 ~ + 16 9n Prefix algorithm [19]
Ong & Atkin 1983 Prefix algorithm with
21ogn/logA-1 O (nlogrglogA)
Ngai & Irwin 1985 limited fan-in [25, 26]
Hybrid prefix with
2(lognAogA +k) for
Han et al. 1987 2kn+n 2/Ak limited fan-in and fan-
logan -logalogan_qJcSlogan - 1 out [15]
Prefix algorithm (fan-
Montoye 1981 21ogn g~ (n 2) in <_n/'2+1) [24]
Prefix algorithm with
Kogge & Stone 1973 21ogn/logA ~ (n e) limited fan-in [20]

Historically much attention has been paid to the tradeoff between delay (depth)
and number of gates (size), but later attention has switched and focused on the VLSI
area complexity, by looking on how to connect the gates in simple and regular ways
for minimising it.
Some of the well known adders are built out of AND-OR bounded fan-in logic
gates (we use delay instead of depth, and gates instead of size, as in most of the
original articles) are shown in Table 1 (here n is the number of bits needed to rep-
resent one input).
It has also been proven that a depth-2 circuit of AND-OR logic gates for ADDITION
must have exponential size [33]. Some authors [12, 40] have formulated the problem
of minimising the latency in carry-skip and block carry-lookahead adders [14, 29]
as multidimensional dynamic programming. Others [23] have investigated implemen-
tations based on spanning trees. But on the whole a lot of effort has been devoted
to practical implementations [18, 23, 28, 40]. Out of these, at least two papers made
the remark that a way to reduce the number of logic levels (and corresponding the
circuit latency) is to increase the fan-in [40], or equivalently to group more bits
[23]. But they mentioned that "no practical method for doing this has been pre-
sented in the literature." Still, some very interesting results using fan-ins larger than
two--building on [9, 20]--have been reported in [15, 25, 26, 34]. They mention
that increasing the fan-in affects the time performance of the circuitry in three dif-
ferent ways:
201

9 the depth decreases from O (logn) to O (logn/logA);


9 the delay of each processing element increases due to the need to implement more
complex logic;
9 an increase in delay time is caused by the larger fan-out capacitance (the number
of gates the outputs a cell must drive grows as A2).
Extensions of the algorithm for prefix computation of Brent and Kung [9] to larger fan-
ins (A = 3, 4) have been presented by Ong and Atkins [26], and later by Ngai and Irwin
[25], while similar extensions for a hybrid prefix algorithm were detailed in [15]. The
hybrid prefix algorithm exhibits the lowest depth in optimal area 0 (nlogn), as can be
seen in Table 1. Han et al. [15] detail two implementation enhancements for reducing
the area: a folding method and the use of hierarchical leaf cells. Their very thorough
analysis has led to the conclusion that for all the operand length n < 1024, the optimal
fan-in f o r the hybrid prefix algorithm is either 3 or 4.
From the completely different point of view of neural networks, ADDITION has
been considered as a challenging--as it implies computing nonlinearly separable ex-
clusive-ORs function--and useful function [1, 2, 31, 39]. A depth-2 threshold gate
(TG, i.e., the simplest artificial neuron having the sign function as nonlienarity) cir-
cuit of size 0 (n 4) has been presented in [1, 2]. Two other constructions for ADDI-
TION based on AND-OR logic gates have been detailed in [33]. Because AND and OR
gates can be simulated by TGs, they have been considered as unbounded fan-in TGs,
the results being:
9 a depth-3 TG circuit of size 0 (n 2) (Theorem 7, [33], p. 1410);
9 a depth-7 TG circuit of size 0 (nlogn) (Lemma 4, [33], p. 1408).
It becomes clear that going for a lower depth (from 7 to 3) increases the size complexity
from O (nlogn) to O (n2).
A general class of solutions was detailed in [3, 4]. It is based on a class of linearly
separable functions/FA defined as the class of functionsfa of A input variables, with A
even, fA = fa ( g z x / 2 - l, PA/2- 1 . . . . . go, Po), and computing:
= v ~/2-, [gj ~, (A ~/2-1Pk)]
f~' j=0 " k=j+l "

By convention, we consider A ~'-JPi = 1. One restriction is that the input variables are
pair-dependent, meaning that we can group the A input variables in A / 2 pairs of two
input variables each: (gA/2-1, Pa/z-1) . . . . . (go, Po), and that in each such group one
variable is "dominant" (i.e., when a dominant variable is 1, the other variable forming
the pair will also be 1). This can can be explained if the generate and propagate variables
are defined as gi = xi ^ Yi andp~ = x~ v y~. Because the Boolean functions from Step 3 and
Step 4 of Lemma 4 from [33] are/Fzx functions, the depth-7 construction can immedi-
ately be shrunk to depth-5 by allowing threshold gates instead of AND-OR gates in the in-
termediate layers. This depth-5 TG circuit still has O (nlogn) size [3, 4].
From Brent and Kung [9] it is known that the carry chain can be computed based
on an associative operator "o" defined as:
(g, p) o (g', p') = [g v (p^g') , pap'] ( G i, Pi) = (g~,p~) o ( G~_1, Pi-~) (4)
for 2 < i < n. It has been proven that c~ = Gi 9 In these equations, gi is the "carry gener-
ate", pi is the "carry propagate", Gi can be imagined as "a block carry generate" (also
known as "G-functions" or "triangles" [37]), and P~ can be imagined as "a block Carry
propagate". The carry generate is computed as g~ = x~ ^ y~ ; for the carry propagate we
202

can use either Pi = xi 9 Yi or Pi = xi v yi [21, 33]. As "o" is associative, the computation


of the Gg and P~ can take place in any order. This result was used to constructively prove
a class o f TG solutions spanning depths f r o m 4 to 3 + Iogn, a n d sizes f r o m 7n to
2nlogn + 5n (later this class of adders was used to prove inclusions amongst many
classes of circuit complexity [7]).
Finally, another depth-3 solution has been presented in [10], followed by three
more depth-3 constructions in [35] and [13]:
9 one of O (n 2 / klogn) size (for any 1 < k < n / logn) having polynomially bounded
weights 0 (n k) __ by varying k, the sizes obtained cover the interval from 10n to
2n 2/logn + 8n, but for those solutions close to I0n, the weights are no more poly-
nomially bounded (n n / (clogn) --- 2 n / c );
9 two other solution achieving linear size 0 (n), but having weights bounded by
2 F~ q (i.e., exponential) 1
For an easier comparison, all these solutions are presented in a compact form in
Table 2. They show interesting depth-size tradeoffs which can be clearly related to
the (allowed) range of weights and fan-ins.

4 A D D I T I O N Revisited

The TGs for implementing thefa functions have as inputs generate and propagate val-
ues and output a group-generate value (for the group-propagate an AND gate is the sim-
plest i m p l e m e n t a t i o n ) . As an e x a m p l e , consider foul ~" gz +P2 (gl + P l go), w h e r e
go = Ci,, gl = al'bl, g2 = az'bE, pl = al + bl, andp2 = a2 + b2. We will prove that the result-
ing functions:
Co,, = gk +P~ (gk-t + " ' + P2 (g, + P, go))) = f2(k+,) (5)

are always linearly separable. In fact these arefa functions without the restriction of
dominant input variables. We will show recursively that any such function (i.e. having
arbitrary fan-in A, see eqn (5)), can be implemented by o n e T G (having fan-in = A - 1):
9 f4 can be implemented by one TG;
9 fa.2 can be implemented by one TG having the same weights as the TG imple-
mentingfa, by adding two weights and modifying the threshold.
Proposition 1 IFA is a class of linearly separable functions without any restriction on
the input variables.
Proof. For A = 4, eqn (5) becomes:
f4(g~,P~,go, Po) = g~ v (Pl A go)
and from [16] it is known that this is a linearly separable function:
g~v(p~^g0) = (2g~+P~+go)L~ = sgn(2gl+pl+go--2 ). (6)
Refining eqn (5), we can determine the following recursive version (we increment
by 2, as A has to be even):

In this paper Lx]is the floor of x, i.e. the largest integer less than or equal to x, and [xq is the ceiling
of x, i.e. the smallest integer greater or equal to x, and all the logarithms are to the base 2 (except
otherwise explicitly mentioned).
203

T a b l e 2. Two-input threshold gate adders (constructive solutions).

Author(s) Depth (delay) Size (#gates) Remarks


weights <nC
Siu & Bruck 1990 2 nc
fan-in <nc
weights ~ {-1, 0, +1 }
Alon & Bruck 1991 2 O (n 4)
fan-in <n4
weights E {-1, 0, +l }
3 (n 2 + 7n - 2 ) / 2
fan-in <_n
Siu et al. 1991
weights E {-1, 0, +1 }
7 O (nlogn)
fan-in <n
weights < 2n/2
Beiu et al. 1994 5 O (nlogn)
fan-in <_n
Beiu et al. 1994 3+ Vlogn/logA-17 5n+21~-logw'logA-17 weights" < 2 A/2
Beiu & Taylor 1996 4 < depth < 3+logn 7n < size <_2nlogn+5n fan-in <- A
0 {n2/(klogn)} weights < 0 (n to)
3
(1 < k<_ n/logn) fan-in < 2n
Vassiliadis et al. a) 1996
weights <- 2 F'IZ]
3 6n+2~n/F~nq]
fan-in < 2Fn/F~-n qq+3
weights <_2 [.IZ]
Cotofana & Vassifiadis 1997 3 4n
fan-in < 2Fn/F~n]]

a) The exact value for size is size (n, k) = 2klogn [n/(klogn)] ( [n/(klogn)] - 1) + 8klogn [n/(klogn)] , spanning
10n -<size < 2nr + 8n, while the fan-in, spanning 21ogn<fan-in < 2n, is: A (n, k) = max {2k logn, 4([n/(klogn)] + 2) }.

fA+2 = V A/2-1j=0{ gjA [(A A/2-1k=j+Pl k ) A p A / 2 ] }V (gA/2A 1)


= gz~/2v[pa/2,,,,f~(ga,,2_l, Pa/2_ I . . . . . go, eo)] = g~,,,2V(,pz~,,2AfA)" (7)
S u p p o s e that the c l a i m is true for A (i.e., f ~ is linearly separable), then:

fA = s g n (S"
, . , , Ai=0
/ Z - I Vi gi + ~ Ai=0
/2-I wiPi
+ tA ) =
sU n (~) (8)

is true. H e r e , w e use the n o t a t i o n e = s g n x~.a


( S ' i=0
A / 2 - 1 Vi gi + S~.a
' ai=0
/ 2 - 1 w i p i + tA ) .
AS h y p o t h e s i s for recursion, w e also c o n s i d e r that all t h e w e i g h t s are p o s i t i v e
( n o n - n e g a t i v e ) integers, w i t h o n l y the t h r e s h o l d s b e i n g n e g a t i v e i n t e g e r s ( e a s y to
v e r i f y for t h e p a r t i c u l a r c a s e w h e n A = 4 s i m p l y b y l o o k i n g at e q n (6): v 1= 2,
w 1 = 1, v o = 1, w 0 - - 0 , w h i l e t 4 = - 2).
T o c o n s t r u c t i v e l y p r o v e that fa§ is linearly separable, w e b u i l d it in t h r e e steps:
9 c o p y all the c o r r e s p o n d i n g w e i g h t s f r o m f A ;
9 add t w o a d d i t i o n a l w e i g h t s v a / 2 and w A / 2 (for the v a r i a b l e s g ~ / z and P~/2):
. ~~A/2-11).+ ~aA/2-1w. (9)
va/z = 1-t- 2a i= 0 ~ L, ~=0 ,

= S ' a/2-1V,; (10)


WA/2 ~ i=0
204

9 change the threshold to tA+2:

ta,.. = --rh/2 = - - 1 - - z~i=0~'a


A/2- 1 Vi--~.ai=OS
' A/2-1 w,. (11)
These lead to:
fA+2 = sgn[(vA/2ga/2+g'a/2-1
z-..~ i = 0 I"igl)+(WA/2Pa/2+~A/2-
~i:o t w, p i ) + t a . j . (12)
We shall verify that eqns (8) and (12) satisfy the recursion, i.e., eqn (7).
Three cases have to be considered:
9 If ga/2 = 1 then fa +2 = 1 regardless of the other input variables (see eqn (7)).
Equation (12) becomes:
Ya,,- = sgn[(va/2+ y, a/z-lvigi)+(wa/zpA/z+Z
i=0
a/1-!wipi)+tA+z]. i=0

The worst case--due to the fact that all the weights are positive--is when all the
other input variables are 0. By substituting eqns (9), (10) and (11) into the pre-
vious equation we obtain:
fa*2 = sgn(va/2+t^,2) = sgn(O) = 1.
9 IfgA/2 --- 0 we have to analyse two cases.
First, suppose that Pa/z = 0. These makes fA +2 = 0 regardless of the other input
variables (see again eqn (7)). By replacing gA/2 = 0 and PA/z = 0, eqn (12) be-
comes:
fa+2 = s g n (x~..~
~ a /i2= 0- 1 v i g i + ~A/2-1wiPi+t~+.)
~ i=0 -

and even if all the (other) input variables are 1, the value of tA+2 (see eqn (11)) is
large enough:
L.2 = , g . (\ gv- . ~ ~i =, 20 - . v, + v~ . a ai =' 02 -~ w i - - l - - E i =~,2-~
O v , - Z ~i =' 02 - ' w,)
= sgn(-1) = 0.
Second, suppose that PAl2 = 1. This is in fact the most complicated case. We re-
member that gA/2 = 0. In this c a s e f a + 2 =fA (see eqn (7)). Starting again from eqn
(12), we replace ga/2 = 0 and PAl2 = 1:
L+~_ = s g . Ix;, ~ , ~ - i .,g, + (wA,2+ v ~ ' ~ - ' w,p,) + , ~ , ~ ] .
L/..t i=0 9~..,~ i = 0

and by substituting eqns (10) and (11) we obtain:


fa+,- = sgntS 'a/2-'v,g,+(S 'a/2-1v,+S 'a/2-iw,p,)- l-~a/2-'v,
Ll..a i=O ~-r i=0 ~ i=O i=0

--E A/2-1
i=0
W,]
i=0 vig,+ /.~i=0
wip,)- 1 _ ~ a /i 2= 0- 1 wi].
The first two sums are larger or smaller than - t~ iff~ = 1, or respectively f~ = 0
(see eqn (8)). Let these two sums (i.e., between the round parentheses) be
- ta + e, with e positive iffA = 1, and respectively negative iffa = 0 (see eqn (7)).
Then:
fA+2 = sgn [ ( - / a + e ) - 1 _ g , Ai/=20 - ! w,]
and replacing ta as given by eqn (11):
205

fa+z = s g n [ ( l + ' g ~ A / 2 - 2 v ' + 9- , A ~ 2 - 2 Wl + ~ ) - 1 - - - rg ' A / g - ' w , ]

= sun ( Z ~,2-2
i=O
v, + e - w~,2_,)
Finally, w e use eqn (10) to obtain:

/=0 i=0

The fact that the recursion (eqn (7)) is verified concludes the proof. ~1

Proposition 2 The sequence of weights w k and v~ are the even and respectively odd Fi-
bonacci numbers: w k = Fib2k and vk = Fib2k + 1.
Proof T h e initial conditions s h o w the f o l l o w i n g c o r r e s p o n d e n c e b e t w e e n w 0, vo, w 1,
v 1 and the F i b o n a c c i numbers:
Fib 0 = w0 = 0 Fib t = v0 = 1 Fib2= w l = 1 Fib3 = vl = 2.
L e t us suppose that w k and vk are the e v e n and respectively odd F i b o n a c c i numbers. W e
will p r o v e that vk = Fib2k + 1 and % = Fib2k satisfy eqns (9) and (10).
B e c a u s e v k and w k are F i b o n a c c i numbers, eqn (9) b e c o m e s :
vk = Fib2k+~ = 1 +~2k-lFib,
i=0

= 1 + Fiba_ ~+ Fiba_ z + Fib2k-3 + ...


but f r o m definition Fib2k + 1 = Fib2k + Fibgk_ 1
Fib2k + Fibzk_ x = 1 + Fib a_ l + Fib2k_z + Fibz~_3 + ...
Fib2k = 1 + Fib2k_2 + Fiba_3 + ...
= 1 + ~ 2k-2 Fib,
t'=0

which recursively proves eqn (9).


B e c a u s e vk and w k are F i b o n a c c i numbers, eqn (10) b e c o m e s :
wk = Fib2, = Fiba _ 1 + Fibz~- 3 + Fib2k- 5 + ...
but f r o m definition Fibgk = Fib2k_ 1 + Fibgk- 2
Fib2~_~ + Fiba_ 2 = Fib2k_~+Fib2~_3+Fibt2k-5]+...
Fib2k - z = Fibz~_3 + Fiba- 5 + ..-
w h i c h recursively proves eqn (10). Q

Proposition 2 implies that the weights are b o u n d e d as ~2k + 1 / , ~ - (see eqn (3)).
Solutions with small fan-ins and small weights are o f interest [5, 6, 13, 30, 32, 39]
because the area o f a V L S I i m p l e m e n t a t i o n is considered to be proportional to the
sum o f the digits needed to represent the weights. The weights w e h a v e d e t e r m i n e d
are the smallest integers (by construction), therefore the solution is V L S I - o p t i m a l as
m i n i m i s i n g the area.

5 Conclusions

The paper has presented a class o f N N s c o m p u t i n g the ADDITION o f tWO binary numbers.
The interesting result is that the weights o f such a V L S I - o p t i m a l solution are the Fibon-
acci numbers. O p e n question r e m a i n on w h y the weights are exactly the F i b o n a c c i n u m -
bers, and if such ' F i b o n a c c i ' T G circuits could c o m p u t e other (useful) functions.
206

References
1. Alon, N., & Bruck, J. (1991). Explicit Construction of Depth-2 Majority Cir-
cuits for Comparison and Addition. IBM Technical Report RJ 8300 (75661).
San Jose, CA: IBM Almaden Research Center.
2. Alon, N., & Bruck, J. (1994). Depth-2 Threshold Logic Circuits for Logic and
Arithmetic Functions. Patent US 5357528.
3. Beiu, V., Peperstraete, J.A., Vandewalle, J., & Lauwereins, R., (1994a). Area-
Time Performances of Some Neural Computations. In P. Borne. T. Fukuda &
S.G. Tzafestas (Eds.): Proc. IMACS Intl. Syrup. on Signal Proc. Robotics and
Neural Networks, Lille, France (pp. 664-668). Lille: GERF EC.
4. Beiu, V., Peperstraete, J.A., Vandewalle, J., & Lauwereins, R., (1994b). On
the Circuit Complexity of Feedforward Neural Networks. In M. Marinaro &
P.G. Morasso (eds.): Proc. Intl. Conf. on Artif. Neural Networks, Sorrento, It-
aly (pp. 521-524). Springer-Verlag.
5. Beiu, V., Peperstraete, J.A., Vandewalle, J., & Lauwereins, R., (1994c). Opti-
mal Parallel ADDITIONMeans Constant Fan-In Threshold Gates. In Proc. Intl.
Conf. on Technical Informatics, Timisoara (vol. 5, pp. 166-177). Timisoara:
Technical University of Timisoara Press.
6. Beiu, V. (1996). Constant Fan-In Discrete Neural Networks Are VLSI-Opti-
real. In S.W. Ellacott, J.C. Mason, & I.J. Anderson (eds.): Mathematics of Neu-
ral Networks - - Models, Algorithms and Applications (pp. 89-94). Kluwer
Academic.
7. Beiu, V., & Taylor, J.G. (1996). On the circuit complexity of sigmoid feed-
forward neural networks Neural Networks, 9(7), 1155-1171.
8. Beiu, V. (1998). On the Circuit and VLSI Complexity of Threshold Gate COM-
PARISON. Neurocomputing, 19(1), 77-98.
9. Brent, R.P., & Kung, H.T. (1982). A Regular Layout for Parallel Adders. IEEE
Trans. on Comp., 31(3), 260-264.
10. Cannas, S.A. (1995). Arithmetic Perceptrons. Neural Computation, 7(1), 173-
181.
11. Chandra, A.K., Stockmeyer, L.J., & Vishkin, U. (1984). Constant Depth Re-
ducibility. SlAM J. Comput., 13(2), 423-539.
12. Chang, P.K., Schlag, M.D.F., Thomborson, C.D., & Oklobdzija, V.G. (1992).
Delay Optimization of Carry-Skip Adders and Block Carry-Lookahead Adders
Using Multidimensional Programming. IEEE Trans. on Comp., 41(8), 920-930.
13. Cotofana, S., & Vassiliadis, S. (1997). Low Weight and Fan-In Neural Net-
works for Basic Arithmetic Operations. In Proc. IMACS World Congress on
Sei. Comput., Modelling and Appl. Maths. (vol. 4, pp. 227-232).
14. Doran, R.W. (1988). Variants of an Improved Carry Look-Ahead Adder. IEEE
Trans. on Comp., 37(9), 1110-1113.
15. Han, T., Carlson, D.A., & Levitan, S.P. (1987). VLSI Design of High-Speed,
Low-Area Addition Circuitry. In Proc. Intl. Conf. on Circuit Design (pp. 418-
422). IEEE Press.
16. Hu, S. (1965). Threshold Logic. Berkeley, Los Angeles: University of Califor-
nia Press.
17. Hwang, K. (1979). Computer Arithmetic: Principles, Architecture and Design.
New York: John Wiley & Sons.
18. Kelliher, T.P., Owens, R.M., Irwin, M.J., & Hwang, T.-T. (1992). ELM A Fast
Addition Algorithm Discovered by a Program. IEEE Trans. on Comp., 41(9),
1181-1184.
207

19. Khrapchenko, V.M. (1967). Asymptotic Estimation of Addition Time of a Par-


allel Adder. Problemy Kibernetiki, 19, 107-122 (in Russian); English transla-
tion (1970) in System Theory Research, 19, 105-122.
20. Kogge, P.M., & Stone, H.S. (1973). A Parallel Algorithm for the Efficient
Solution of a General Class of Recurrence Equations. IEEE Trans. on Comp.,
22(8), 783-791.
21. Ladner, R.E., & Fischer, M.J. (1980). Parallel Prefix Computations. J. ACM,
27(4), 831-838.
22. Ling, H. (1981). High Speed Binary Adder. IBM J. Res. Develop., 25(2 & 3),
156-166.
23. Lynch, T., & Swartzlander, E.E., Jr. (1992). A Spanning Tree Carry Lookahead
Adder. IEEE Trans. on Comp., 41(8), 931-939.
24. Montoye, R.K. (1981). Area-Time Efficient Addition in Charge Based Tech-
nology. Proc. Intl. Design Automation Conf. (pp. 862-872). IEEE Press.
25. Ngai, T.F., & Irwin, M.J. (1985). Regular Area-Efficient Carry-Lookahead Ad-
ders. Proc. Intl. Symp. on Comp. Arithmetic (pp. 9-15). ACM Press.
26. Ong, S., & Atkins, D.E. (1983). A Comparison of ALU Structures for VLSI
Technology. Proc. Intl. Symp. on Comp. Arithmetic (pp. 10-16). ACM Press.
27. Parberry, I. (1994). Circuit Complexity and Neural Networks. Cambridge, MA:
MIT Press.
28. Quach, N.T., & Flynn, M.J. (1992). High-Speed Addition in CMOS. IEEE
Trans. on Comp., 41(12), 1612-1615.
29. Rhyne, T. (1984). Limitations on Carry Lookahead Networks. IEEE Trans. on
Comp., 33(4), 373-374.
30. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1990). Computing with Almost
Optimal Size Threshold Circuits. (Technical Report, Info. Sys. Lab., Stanford
University).
31. Siu, K.-Y., & Bruck, J. (1990). Neural Computation of Arithmetic Functions.
Proc. IEEE, 78(10), 1669-1675.
32. Siu, K.-Y., & Bruck, J. (1991). On the Power of Threshold Circuits with Small
Weights. SlAM J. Disc. Maths., 4(3), 423-435.
33. Siu, K.-Y., Roychowdhury, V.P., & Kailath, T. (1991). Depth-Size Tradeoffs
for Neural Computations. IEEE Trans. on Comp., 40(12), 1402-1412.
34. Sugla, B. (1985). Parallel Computation with Limited Resources. PhD Disser-
tation, Dept. ECE, Univ. of Massachusetts, Amherst, MA.
35. Vassiliadis, S., Cotofana, S., & Berteles, K. (1996). 2-1 Addition and Related
Arithmetic Operations with Threshold Logic. IEEE Trans. on Comp., 45(9),
1062-1068.
36. Waser, S., & Flynn, M.J. (1982). Introduction to Arithmetic of Digital Systems.
Holt, Rinehart and Winston.
37. Wegener, I. (1987). The Complexity of Boolean Functions. Chichester: Wiley-
Teubner.
38. Wegener, I. (1996). Unbounded Fan-In Circuits. In L.L. Keener (ed.): Ad-
vances in the Theory of Computation and Computational Mathematics, 123-
153. Ablex.
39. Wegener, I. (1993). Optimal Lower Bounds on the Depth of Polynomial Size
Threshold Circuits for Some Arithmetic Functions. Information Proc. Lett., 46,
85-87.
40. Wei, B.W.Y., & Thompson, C.D. (1990). Area-Time Optimal Adder Design.
IEEE Trans. on Comp., 39(5), 666-675.
A d a p t i v e C o o p e r a t i o n B e t w e e n Processors in a
Parallel B o l t z m a n n Machine I m p l e m e n t a t i o n
J. Ortega; L. Parrilla; J.L. Bernier; C. Gil; B. Ptno; M. Anguita
Departamento de Arqultectura y Tecnologia de Computadores
Unlversldad de Granada, 18071 Granada

Abstract
Tile fine grain, data-driven parallelism shown by neural models as the Boltzmann machine
cannot be implemented in an entirely efficient way either in general-purpose multicomputers
or in networks of computers, which are nowadays the most common parallel computer
architectures.
In this paper we present a parallel implementation of a modified Boltzmann machine
where the processors, with disjoint subsets of neurons allocated, asynchronously compute
the evolution of their neurons by using values that might not be updated for the remaining
neurons, thus reducing interprocessor communication requirements. An evolutionary
algorithm is used to learn the rules that allow the processors to cooperate by interchanging
tt~e local optima that they find while concurrently exploring different zones of the Boltzmann
machine state space. Thus, the way the processors interact changes dynamically during
execution of the algorithm, adapted to the problem at hand. Good figures for speedup with
respect to the Boltzmann machine computation in a uniprocessor computer have been
experimentally obtained.

Key words: Boltzmann machines, combinatorial optimization, evolutionary computation,


multicomputers and networks of computers, parallel processing.

1. Introduction
The resolution of combinatorial optimization problems can greatly benefit from the parallel and
distributed processing which is characteristic of neural network paradigms. Nevertheless, although artificial
neural networks (ANNs) process the information in a distributed and massively parallel way, the fine grain
data-driven parallelism of neural models, such as the Boltzmann Machine (BM), cannot be implemented
in an entirely efficient way either in general-purpose multicomputers or in networks of computers, which
are nowadays the most common parallel computer architectures because they represent a good choice
in terms of cost/performance and scalability [1]. In these architectures, each processor has its own local
memory and contacts the other processors of the system through an interconnection network or a local
area network, thus corresponding to a coarse grained architecture with the memory distributed among the
processing nodes. As the cost of communicating the processing nodes is high, it is necessary to have a
great volume of computation being performed between subsequent communications in order to achieve
appropriate efficiency.
Some parallel implementations of BM in general-purpose multicomputers have been proposed [4-
6]. In the scheme described in [6], the neurons are distributed among the processors, and as each
processor computes the changes in its subset of neurons while considering the remaining to be clamped,
the set of processors searches in different and smaller zones of the solution space in order to find a local
optimum. The solution found is communicated by each processor to the others, and any processor
receiving this information uses it to guide its search within new subspaces where better solutions could be
found. This method can be included in the class of large-step optimization methods [11,12], which are
defined by a procedure to perform the local search, a procedure to perform the large-step transitions to
209

non-local solutions, and an accept/reject test. In the present case, the use of several processors and the
characteristics of the BM would make it possible to exploit the work done by the remote processors in order
to drive the large-step transitions of each processor and speed up the search. Several alternatives to allow
the processors to work cooperatively are analyzed and their performance detailed in [6]. Among the
proposed schemes, one oi them is identified that allows the corresponding BM to converge to solutions with
high quality and which provides a high acceleralion over the execution of the BM in uniprocessor
computers.
Nevertheless, it has been shown [2] that if an algorithm performs well on a particular class of
problems, it shows a degraded performance with another class. This implies that each parallel Boltzmann
machine (corresponding to a particular optimization problem) would require a specific rule to allow the
cooperation between the processors in the machine with the best performance. Thus, it is more effective
to devise a procedure that automatically determines the best rule by adapting it dynamically while the
parallel Boltzmann machine is processed than trying to determine the best procedure, simply because such
procedure might not exist. In this paper, we propose the use of a genetic algorithm to learn, while the
optimization procedure is running, the best way to improve the local solution by extracting information from
tl~e solutions received from other processors. Thus, a hybrid optimization procedure which mixes neural
and evolutionary techniques is proposed.
Thus, Section 2 presents a parallel implementation of the BM in which the processors alternate
phases of local optimization with phases in which the local solutions are interchanged and used by each
receiver processor to update its own local solution according to an Update Rule (UR). The space of
possible Update Rules that allow the processors to work cooperatively through interactions is presented
in Section 3 along with the genetic procedures used to find the best one for a given problem and also to
reach an adaptive behaviour. Finally, Section 4 gives the conclusions of the paper.

2. A Parallel Boltzman Machine Implementation


As is well known, the BM is a stochastic feedback neural network consisting of binary neurons
connected by symmetric weights. If the BM has N neurons, the weights associated with their
interconnections define an NxN matrix, called the weight matrix, in which the element W(i,j) (i,j=1,2,..,N)
corresponds to the connection between neurons i and j. The state of neuron i is indicated as S(i) (i=l,...,N)
and can take the values 0 or 1. The set of values for all the neurons in the machine, (S(1),S(2),...,S(N)),
is called the configuration of the BM, S[m], where m=2~ With a BM, it is possible
to associate a consensus function, Cs[ml, in order to evaluate the overall desirability of the activated
connections in a given configuration, which is defined as

c~,,,~. ] ~ w(i,j) s(~) s(j) (1)


J.tJ~J

Given the configuration S[m], the difference in the consensus, dC(i), when the state of neuron i
changes while the states of the remaining neurons are unchanged is

dC(i) . (l-2s(i)) ( ~ W(i,j ) S(i) + W(i,i)) (2)


J~

and the probability of accepting this state transition in neuron i is


1
A~(dC(i) ) . -dccl)
(3)
l§ T
210

where T denotes the value of a control parameter usually called temperature. Equation (3) describes the
'heatbath' algorithm for a BM. It is also possible to use the Metropolis algorithm [3], in which a change in
a neuron with dC(i)>0 is always accepted, whereas a change with dC(i)<0 is accepted with probability
Aw(dC(i))=exp(dC(i)/-I). In a minimization problem, a change with dC(i)<0 is always accepted, whereas if
dC(i)>O, the change is accepted with probability AT(dC(i))=exp(-dC(i)/T).
The BM has been implemented in both general-purpose and specific-purpose parallel architectures
J4,5]. They are able to speed up the evolution of the BM with respect to the number of neurons, but due to
synchronization and communication requirements, the use of several processors is not efficient when
either the neurons are highly interconnected or it is not possible to find an adequate clustering of neurons
that allows the reduction of communications among the processors where the BM has been located. In [6]
a procedure with low communication and synchronization requirements, in order to take advantage of the
availability of local networks of general-purpose architectures, is proposed. The goal of this procedure is
similar to that of [7], which analyzes the effect of reducing communications in a parallel implementation of
the simulated annealing algorithm, or [8], which considers the possibility of accelerating, by a factor greater
than the number of processors used, the obtention of sufficiently good solutions for the NP-complete
problems.

Figure 1. (a) Distribution of the neurons among the processors and (b) functional
blocks implemented by a given processor

Thus, in the parallel BM implementation presented in [6], the neurons of the BM are distributed
among the P processors (k=1,2,..,P) as shown in Figure 1.(a) using as an example a BM with eight neurons
and four processors. The neurons associated with processor k are called neurons ofk. Nevertheless, to
compute the changes in its neurons (using (2)), a processor also needs the values of all the neurons
connected to it. Thus, each processor k stores in its local memory a local configuration denoted as
211

S,[m,]=(Sk(1),S,(2),..,S,(N)), where Sk(j) is the state of neuron j in the local configuration of processor k. In
this way, the local configuration of processor k has two kinds of components, those corresponding to the
n e u r o n s o f k and those corresponding to the neurons assigned to the remaining processors, called the
remote neurons of k. When required, the components of a given local configuration are noted with
superindices. Thus, S,q(i) (q=1,2,..,P) refers to the state, in the local configuration of p r o c e s s o r k, o f neuron
i (which is a n e u r o n of p r o c e s s o r q). If q=k, S,"(i) is the state of neuron i, which is one of the neurons of
processor k, otherwise i is a remote neuron of k. As we will see, the values Skq(i) at a given instant do not
necessarily coincide with S~q(i). Thus they are not necessarily updated.

Paralfel Boltzmann Machine (N neurons, P processes);


/' k is Ihe process (or processor) index */
if (k is the Root pracess)
begin
Spawn P processes;/* A process is assigned to each processor */
For (i=1 to P) send(Neurons,Welghts,lnltlal_eonflguratlon,T,T0,a, L, Inter, to process I);
For 0=1 to P) receive(local configuration from process I);
Select the best local configuration as Solution;/* End */
nrKt
else
begin
receive(Neurons,Weights,Initial conflguratlon,T,T0.a, L, Inter, from Root_process);
/* The local .configuration is Initialized with initial_configuration received from Rootprocess, */
/" Ure neurons of k, the corresponding weights, and the rest of parameters are also received */
Compute(L~.~ from L)
whllr~ (T>TO) do
begin
/* Step 1. Local EvoluUon m p~ocessk 'i
while (Lt=,,>0)
begin
/" Compute_localevolution (select a neuron of process k) */
RandomSelectlon(neuron t of processor k); Ll~.l:=L~c,l-1;
If ((dCk(i)<0) or (random0<AT(dCk(I)))) then Sk(I)=t-Sk(I);
/" The neurons of remote processes are clamped */
end;
send (local_configuration Sk[mk]);

/" Step 2. Updating of remote neurons in local conflguraUon*1


For 0=1 to Inter)/* 0 < Inter < P */
begin
receive(remote configuration, Sq[m.], from remote processor q);
UpdateRule(Iocal .configuration S~[mk],remote_configuration S.[m,],T);
end;
T:=a*T; I* O<a<l */
end;
send(local configuration to Root_process);/* End of process (not Root) */
end;

Figure 2. Evolution of the modified Parallel Boltzmann Machine

The P processors implementing the evolution of the parallel Boltzmann machine interact at some
instants and alternate two processing phases or steps, as indicated in the algorithm shown in Figure 2,
whose functional blocks are provided in Figure 1 .b. In the first phase (Step 1 ), each processor k evolves
(as a sequential Boltzmann machine) by changing in the local configuration, S , [ m j , only its subset of
212

neurons (values S,~(i)) and keeping its remote neurons (values s,q(i), with k~q) clamped and acting as
parameters. In this way, in Step 1 each process searches within a reduced subspace defined by the
clamped states of its remote neurons, tn the second phase (Step 2), the processor receives remote
configurations from other processors and interacts with them through an Update Rule (UR) that defines
the way in which the remote neurons of k (values of Skq(i), with k~q, in Sk[m.] ) are modified according to
the states of the neurons of Sq[mq], which come from processor q. In this step, the local neurons in S,[m,]
are clamped while the remote neurons in S,[mj change according to their state in the remote configuration
and the UR used. Each processor takes advantage of the search carried out by the remote processors in
their Step 1 through an interaction among configurations, which takes place in a given processor according
to the UR used. This should produce a diversification of the local configurations, impelling each processor
to a different solution subspace, which is explored afterwards in a new Step 1.
The four URs of Table 1 correspond to different alternatives according to the usual evolution of a
sequenllal Boltzmann machine (olher URs can also be delined, as shown In Section 3). In [6}, experhnenlal
results are provided and explained for these four rules and different examples of BM with up to 1024
neurons. Some URs provide good solutions with a reduced number of iterations, thus being able to achieve
a significant speedup with respect to the sequential execution.
The speedup (S) that may be attained wilh this scheme can be evaluated from the following
expression

S- TI- NjterI . T 1 Jt:.;,.~ (N)


(4)
Tp Nlterp. i TP~eept(N, P) .TP,tep2[N, P) ]

where T~s,.p~(N)is the time required by each iteration in the sequential execution (SEQ) of a Boltzmann
machine with N neurons; T",,.p,(N,P) is the time per iteration of Step 1 of the parallel scheme when its N
neurons are distributed among P processors; and T"~t.,,2(N,P) is the time required by Step 2. The values
N,,,,' and N,~? correspond, respectively, to N,., for the sequential and the parallel schemes.
The value of T~,~p,(N) is proportional to the length of the Markov chain, L, used at each
temperature. In our case, we have used L=N as in [3], so T~s~p~(N)=N.t~o.~,with t.o,~ being the time required
to compute a transition in a neuron. The value of TP,t~p,(N,P)is equal to the product L,oca,.l=r~,and as L~=f
has been assumed to be proportional to (N/P), Lro~.,=K,,.p~.(N/P), thus TPst.p~=K~,~p~.(N/P).t~p. Finally
T~',,.p2(N,P) depends on the time required to communicate the interacting processors, and on the time
required to compute the interaction, which is also considered proportional to N/P. Thus,
]~',,,,~(N,P)=InIer(P).{(N/P).L.o.,p+t ...... (P)}, where Inter(P) is the number of processors interacting in Step 2.
In this way,
T",~.p2(N,P)=(N/P).t~o,.p.K,,.,~(N,P)
where K,,.p2(N,P)=(1+(P/N),(t~or.,,(P)/t,o~)).lnter(P), and the efficiency (E.=S/P) can be expressed as:

s T~ N~.~ ~ 1
~' (5)
P PTp Niter p Kstep J (N, P) +Kstep2 (S, P)

Whenever the communication can be overlapped with the computation, the value of K,,,p~can be
considered as approximately equal to Inter(P). Thus, for schemes such as UR4, where N,.~VN,e,p does not
change with the number of processors P, if the complexity of Ks~.p~and K,f~p2 is less than P, the speedup
obtained increases with the number of processors. Moreover, whenever (Kst.p~+Kste,,2)<(N,.,VN,e,p) is verified,
efficiencies higher than 1 are obtained (corresponding to superlinear speedups). If Inter(P) is set
proportional to the number of processors, the speedup tends to N,e,VN,.,p as P grows.
Figure 3 shows the average efficiencies experimentally obtained by running different Boltzmann
213

machines (with UR4) that correspond to the vertex cover problem applied to randomly selected graphs of
up to 256 nodes. For clarity, Figure 3 only provides the results for N=64 and N=256. The experiments were
carried out in a PARAMID multicomputer, with nodes based on the TTM200 board equipped with an Intel
i860/XP processor, 16 Mbytes of memory, and a T805 Transputer with four bidirectional links and 4 Mbytes
of memory. Up to eight processors were available for our experiments. As shown in Figure 3, it is possible
to obtain efficiencies higher than one (superlinear speedups). For N=64 and P=2, the experimental data
present a high deviation with respect to the expression (5). This can be explained by taking into account
that, as the number of processors is low, the effect of cooperation among processors also decreases, with
a corresponding reduction in N,er'/N,te,p for this small number of neurons.

Table 1. Examples of possible Update Rules

Rule l)escription

[JRI For all n e u r o n s i of processor q do Sk(i)=S,,(i);

UR2 (I Cq,k=Csql,.ql" CSkl,.kl


if ((d(:q.k<O) . r (randentO<A.r(dC.,)))
t h e . for all n e u r o n s i of processor '1 do Sk(i)=S,,(i) i
tJ+l,t3 For.j:= 1 Io I.,,,,,, do
begin Random_Selection(neuron i of remote processor q);
/* i is a ncuron o f q such that S,l(i)zSk(i) */
dCk,q(i)-(l-2sk (i) ) (E W(j, i) Sk (j).W(i, i) )
j,1

if ((dfq.k(i)<O) or (randomO<A.,(dCq.k(i)))) then S.(i)=l-Sk(i);


cud

U R4 For j:=l to LI,..,d do


begin Random_Selection(neuron i of remote processor q);
/* i is a neuron ofq such that Sq(i)~Sk(i ) */
dCk,q(i)'(l-2sk (i) ) ( E w(j'i)sk(J) )
J~e~

if ((dC.~.(i)<O) or (randomO<AT(dCq,k(i)))) then Sk(i)=l-Sk(i);


end

3. Evolutionary approach to learn the Update Rules


A genetic algorithm has been included in the parallel procedure described in [6] to determine the
UR that each processor applies when it receives a solution from a remote processor. The goal for the URs
is to achieve the fastest convergence to the optimal solution in any of the working processors.
The possible Update Rules that can be applied by a given processor are implemented by
programs with three common decision-taking steps: (a) the way to select the bits corresponding to remote
neurons of a given processor when a solution arrives at the local processor; (b) the state required in each
of the selected bits to allow a change in it; and (c) the sign required in the cost change in order to accept
a transformation in a selected bit. With respect to (a), four alternatives are considered: (a 1) to select all the
bits in the local configuration which correspond to neurons of the remote processor whose solution has
arrived; (a2) to select a clump of bits whose limits and size are randomly taken; (a3) to select a random
number of bits, which are also randomly determined; and (a4) to select only one bit randomly. There are
214

three alternatives for (b): (bl) the possible change is applied if the bit in the local solution and the
corresponding bit in the received solution are different; (b2) it is applied if they are equal; and (b3) it is
applied irrespective of whether they are equal or not. Finally, five possibilities are considered for (c): (ct)
the value of a selected bit is always changed (the sign of the cost change is not taken into account); (c2)
it is only changed if this change decreases the cost computed with expression (2); (c3) it is only changed
if Ihe cost docreasos according to the approximate expression (used in UR4 in Table 1); (c4) and (c5) are
iised in UR3 and UR4, respeclively, thus being similar to (c2) and (c3), except for lhe changes lhat,
although determining a cost increase, can also be accepted with the probability given by expression (3).
In this way, associated with each local solution there are seven bits, (gl ,g2,g3,g4,g5,g6,gT), that codify the
program implementing the UR. Bits gl and g2 codify the four possibilities for step (a), bits g3 and g4 the
lhree possibilities for step (b), and bits g5, g6, and g7 the five possibilities for step (c).

N=64 *~

2G !':':: , N=256eq(51eq(51
....+......

~.4
\%..

a
1.8 ",,
~.,.. ,.
t6
c
B

08 I I I J J
3 4 5 6 7
Processo~

Figure 3. Experimentalefficiencies obtained in the PARAMID multicomputercompared with the theoretical ones
(equation (5) in [6}).

A genetic algorithm has been devised, by which each processor uses a population of URs defining
the different ways the local solution could change when a solution is received from a remote processor.
This procedure is included in the optimization algorithm and is applied while the optimization is running
(Figure 5); it is called the On-Line (ONL) learning procedure. In the procedure
Evaluate._Fitness Update_Rulel~ 0 in Figure 5, the fitness of UR~Iof the population is obtained from the
reduction in the consensus function after applying a small number of iterations of the local optimization
procedure to the solution provided by the corresponding URji. In each generation, half of the population of
the URs with the better fitness values is selected for the new generation. The second half of the population
for the next generation is obtained by applying the two-point crossover and the bit-flip mutation operators
to the half of the URs previously selected. The procedure Selection_Mutation_Crossover0 in Figure 5
applies these transformations. The best solution obtained during Step 2 is used to start the next iteration
of Step 1. Thus, after each iteration of the parallel Boltzmann machine (Step 1 plus Step 2), the population
ol Update Rules in each processor, and so the way the processors cooperate, evolves and is improved
at the same time as the optimization procedure proceeds.
Table 2 summarizes the experimental result obtained. The set of weight matrices of Table 2 (Boltz.
Mach. column) corresponds to different levels of connections between neurons and weight magnitudes,
with two different randomly selected matrices for each set of conditions. Thus, in the code mX_YYY.ZZin
Table 2, YYY indicates that the weights take values between -YYY.0 and +YYY.O; ZZ means that there is
215

a probability of 0.ZZ for two given neurons to be connected; and Xis an index which identifies different
randomly selected weight matrices with the same values for YYY and ZZ. The EXH column of Table 2
~hows, for each weight matrix, the costs of the minimum found by the parallel procedure of Figure 2 when
lhe best UR, found by an exhaustive search, is applied. The code describing the UR is given in brackets,
[ai,bi,ci], and the number of iterations, Niter, required to obtain a solution which is less than 1% worse than
the best solution found, appears in parentheses, (Niter). Results are provided considering 8 and 16
processors (column Proc. in Table 2) and N=64 neurons. The asterisk means that the best solution found
is more than 1% worse than the optimum.
An exhaustive search has been used to determine the UR that, applied during the whole
optimization process by all the processors, provides the best figures of solution quality and convergence
speed. As the number of possible URs in this case is only 60, it is possible to find the best one for each
Boltzmann Machine by analyzing the performance of every rule. The EXHcolumn of Table 2 shows, for
each weight matrix, the costs of the minimum found when the UR determined by the indicated exhaustive
search is applied. The code describing the UR is given in brackets, [ai,bi,ci].
As is to be expected from [21, the optimal UR obtained depends on the characteristics of the weight
matrix, although some alternatives occur more frequently than others, and there are alternatives such as
(a3), (a4), (b2), (cl) and (c3), that do not appear in lhe EXHcolumn.

Table 2. Cost of the Solution obtained, ( Number of Iterations, Niter) and [Update Rule]
Boltz. Mach, [ Proc, [ EXH Procedure ONL Procedure Random Sel. U R
ml .025.25 8 -1019.75 (6) [al,bl,c4] -1019.75(5) -1019~75 (22)
16 -1019.75 (5) [al,bl,cSJ -1019.75/5) -1019.75 (21)
m2_025.25 8 -1113.10 (7) [al,bl,c5] -1113.10 (6) -1113.10 (16)
16 -1102.46 16) [a2,bl,c5] -1107.32 (7) -1087.35 (12)*
m1_025.80 8 -2039.25 (5) [al,bl,c2] -2044.25 (5) -2044.25 (7)
16 -2043.25 (5) [at,bl,c2] -2044.25 IS) -2044.25 (19)
m2_025.80 8 -2331.72 (6) [al,bl,c5] -2333.10 (7) -2333.10 (10)
16 -2327.69 (51 [al,bl,c2} -2333.10 (71 -2327.69 (14)

ml 100.25 8 -4063.00 (4) [al,bl,c2] -4063.00 (7) -3861.05 (6)*


16 -4063.00 (4/ [al,bt,c4] -4063.00 (7) -3861.05 (6)*

m2_100.25 6 -4325.69 (6) [at,bl,c2] -4325.69 (5) -4325.69 (13)


16 -4227.19 (5)* [al,bl,c5] -4325.69 (7) -4325.69 (9)
mt_lO0.80 8 -7183.00 (5) [at,bl,c2] -7183.00 (10) -7163.00 (18)
16 -7183.00 ~8) [al,bl,c2] -7183.00 (6) -7078.01 (21)*
m2 100.80 8 -7414.30 (5) [al,bl,c2] -7353.89 (7) -7223.19 (11)*
16 -7414.30 (41 [al,b3,c4] -7414.30 (10) -7414.30 (15)

The ONL Procedure column of Table 2 gives the cost of the best solution found and the value of
Niter obtained for the different weight matrices and with a population of 10 UR codes in each processor
and a mutation probability of 0.1. The Random SeL column shows the values of the cost and Niterwhen
a randomly selected UR is used. Comparing these two columns, it is clear that the ONL procedure
216

represents an improvement, providing better solutions with fewer iterations (a large reduction in Niter in
most cases).
From the ONL Procedure and EXFI Procedure columns, it can be seen that the use of the Update
Rule obtained by the exhaustive search procedure allows a reduction in Niter in most cases, although
sometimes Niter is slightly higher than the value corresponding to the use of the ONL procedure. However,
except for ml 100.80 and m2 .100.80 with 16 processors, this reduction is not very important, and in most
cases the differences in Niterare similar to the experimental error (+/-1 iteration). With respect to the quality
of the solutions obtained, the ONL procedure provides similar or better solutions in all cases except for
m2 100.80 with 8 processors. Figure 4 shows the evolution of the local solution found by a processor for
each of the three cases considered in Table 2, and four different weight matrices.
These results are understandable, remembering that the ONL procedure allows the use of different
URs in each processor and iteration, which represents a more general situation with respect to the use of
only one UR in all the processors during the whole optimization procedure. Indeed, in the execution of the
ONL procedure, it has been observed that the processors use populations of URs with different individuals
across different processors.

ml 025,25(16 Proc) mt.025.80 (16 Proc)


-400 '~176176
-1200 k
-000
-1400
-800

-1000 -1800!
.20001 ~ -.
-12(
5 10 15 20 10 15 20

-20OG
m2 100.25(8 Proc) m1100.80 (8 Proc)
-250C I -4000,1ilI ~
-50001 ~Ii
-30OG
-3500 -6000[ i x.

-400G -7000
5 10 15 20 0 5 10 15 20

Figure 4. Cost vs. iterations in a processor, when using ONL procedure (solid line), the UR obtained by the
exhaustive search procedure (dash-dot line and a randomly selected UR (dashed line)

4. Conclusions
The cooperation of several processors can improve the resolution of combinatorial optimization
problems by using parallel computer architectures. The parallel Boltzmann machine implementation
considered uses independent search processes implemented by the processors in different solution space
zones and the interaction among processors exchanging solutions, in order to take advantage of the work
done by the other processors. The problem is to determine the way a processor extracts information from
the solutions received. In this paper we propose an evolutionary strategy to learn the interaction between
processors.
The proposed strategy allows an efficient implementation of Boltzmann machines in coarse grain
parallel computer architectures. It is based on the procedure in Figure 2 [6], in which each processor
217

alternates two computation phases (Step 1 and Step 2). In Step 1, the processor improves the consensus
of the Boltzmann machine by only changing the states of the neurons assigned to it while its remote
ileurons are considered to be clamped. Thus, these clamped neurons act as parameters or constraints,
which are updated in Step 2 through interactions between the processors according to an UR that guides
the optimization process by taking into account the w o r k done by the remote processors in their
corresponding Step 1. The procedure was modified by including a genetic algorith which operates at the
s a m e time as the execution of the parallel optimization procedure, as shown in Figure 5. Thus, each
processor uses a different population of URs, which can change dynamically, to cooperate with the others
processors. This population of rules changes during the optimization process.

Parallel Boltzmann Machine (N neurons. P processes);


/' k is the process (or processor) index */
if (k is the Root process)
begin
Spawn P processes;/* A process is assigned to each processor */
For (I=1 to P) send(Neurons,Weights,Initial, conflguratlon,T,T0,a, L, Inter, to process I);
For (I=1 to P) receive(local .configuration from process I);
Select the best local configuration as Solution;/* End */
(~nrl
else
hnglr~
receive(Neurons,Weights,Initial conflguratlon,T,T0,a, L, Inter, from Rootprocess);
/' The local configuration is Initialized with Initial configuration received from Root.process, */
/' the neurons of k, the corresponding Weights, and the rest of parameters are also received */
Compute(L~o+.~from L)
Initialize (Population of Update_Rules)/* Random selection ol a set of N_lndlvlduals Update_ Rules */
while (T>T0) do
begin
/" Step 1. Local Evolution in process k */
while (L~.,>0)
begin
/" Compute_local evolution (select a neuron of process k) */
Random_Selection(neuron I of processor k); L~o~.r:=L~.cl;
If ((dCk(J)<0) or (random0<A,(dCk(I)))) then Sk(t)=l-Sk(I);
/* The neurons of remote processes are clamped */
end;
send (local_configuration Sk[mk]);

/* Step 2. Updating of remote neurons in local_configuration */


For (i=1 to Inter)/* 0 < Inter < P Is the number of messages received from remote Processors */
begin
receive(remote .configuration. S,[m,]. from remote processor q);
For 0=1 to N_Generatlons)/* N Generations is the chosen number of generations */
begin
For (jl=l to N individuals)/* N_individuals ts the number of URs In the population */
begin
Update_Rulej#ocal configuration S,[m=], remote_configuration Sq[mj,T);
Evaluate Fitness Update Rule H;
end
Selection Crossover_Mutation ( Population of Update Rules);
end;
Select. the_Solution_obtained by the_best Update_Rule;
end;
T:=a*T;/* 0<a<l "/
end;
send(local configuration to Root process); I" End ot process (not Root) "/
end;

Figure 5. Parallel Boltzmann Machine with evolutionary Update Rules

The complexity of the optimization procedure w h e n the number of neurons grows is not increased
by this evolutionary selection of URs because the number of individuals in the population of rules
218

associated with each processor does not change with the number of neurons.
An exhaustive search procedure has also been applied to determine the best UR to use.
Nevertheless, it has been proven [2] that for any optimization procedure, the high-quality performance
obtained when it Js applied to a given class of problems is balanced by poor results in another class. Thus,
the EXH column in Table 2 provides different optimal URs for different problems.
The results provided in this paper show that the proposed evolutionary computation method
performs well in both convergence speed and quality of the solutions obtained. Although this paper is
devoted to describing a parallel implementation of a Boltzmann machine, similar evolutive strategies can
L)e applied to allow processor cooperation in othe parallel combinatorial optimization methods. For
example, in [10] a procedure is proposed in which multiple populations evolve independently by a genetic
algorithm. Each population determines local solutions that represent components of the global solution and
which are combined to build the whole solution by assigning a credit 1o each local solution, according to
how well it collaborates in achieving such a global solution. In the procedure proposed here, each
processor determines a local optimum through Step 1 and then this local solution is combined in Step 2
with the solution coming from other processors. The matrices of weights used here to obtain the
experimental resulls correspond to situations in which there is a high number of interconnections among
the neurons assigned to differenl processors. In this way, the procedure proposed here is tested by using
optimization problems with many interdependent variables and highly multimodal cost functions, thus
corresponding to problems that are more difficult to solve than others considered in works where the
variables of the function to be optimized are reasonably independent.

Acknowledgements. This paper has been supported by project TIC97-1149 (CICYT, Spain).

References
[1] Anderson, T.E.; Culler, T.E.; Patterson, D.A.; and the NOW team: "A Case for NOW (Networks of
Workstations". IEEE Micro, pp.54-64. February, 1995.
[2] Wolpert, D.H.; Macready, W.G.:"No Free Lunch Theorems for Optimization". IEEE Trans. on
Evolutionary Computation, Vol.1, No.l, pp.67-82. April, 1997.
{3J Aarts, E.H.L.; Korst, J.H.M.:"Simulated Annealing and Boltzmann Machines". Wiley, 1988.
[4] Oh, D.H.; Nang, J.H.; Yoon, H.; Maeng, S.R.:"An efficient mapping of Boltzmann Machine
computations onto distributed-memory multiprocessors". Microprocessing and Microprogramming,
Vol. 33, pp.223-236, 1991/92.
[5] De Gloria, A.; Faraboschi, P.; Ridella, S.: "A dedicated Massively Parallel Architecture for the
Boltzmann Machine". Parallel Comp., Vo1.18, No.l, pp.57-75, 1993.
[6] Ortega, J.; Rojas, i.; Diaz, A.F.; Prieto, A.:"Parallel Coarse Grain Computing of Boltzmann
Machines". Neural Processing Letters, Vol.7, No.3, pp.1-16, 1998.
[TJ Hong, C.-E.; McMillin, B.M.:"Relaxing Synchronization in Distributed Simulated Annealing". IEEE
Trans. on Parallel and Distributed Systems, Vol.6, No.2, pp.189-195. February, 1995.
[~] Pramanick, I.; Kuhl, J.G.:"An inherently Parallel Method for Heuristic Problem-Solving: Part I -
General Framework". IEEE Trans. on Parallel and Distributed Systems, Vol.6, No.10, pp.1006-
1015. October, 1995.
[9J Zissimopoulos, V.; Paschos, V.T.; Pekergin, F.:"On the approximation of NP-complete problems
by using the Boltzmann Machine method: The cases of some covering and packing problems".
IEEE Trans. on Computers, Vol.40, No.12, pp.1413-1418. December, 1991.
I10] Potter, M. A.; De Jong, K.A.:"A Cooperative Coevolutionary Aproach to Function Optimization". In
Third Conference on Parallel Problem Solving from Nature, Y. Davidor and H.P. Schwefel (Eds.),
Lecture Notes in Computer Science, Vol.866, Springer-Verlag, pp.249-257, 1994.
[11] Martin, O.; Otto, S.W.; Felten, E.W.:"Large-step Markov chains for the TSP incorporation local
search heuristics". Operations Research Letters, 11, pp.219-224. May, 1992.
[12] Lourenqo, H.R.:"Job-shop scheduling: Computational study of local search and large-step
optimization methods". European J. of Operation Research, 83, pp.347-364, 1995.
Adaptive Brain Interfaces

J. del R. Mill~in", J. Mourifio a, J. Heikkonen b, K. Kaski b,


F. Babiloni c, M.G. Marciani c, F. Topani d, I. Canale d

~Joint Research Centre of the EC, 21020 Ispra (VA), Italy. E-mail: jose.millan@jrc.it
bHelsinki University of Technology, PO Box 9400, 02015 HUT, Finland
~ di Riabilitazione S. Lucia, Via Ardeatina 306, 00179 Roma, Italy
~Fase Sistemi Srl, Via Ildebrando Vivanti 12, 00144 Roma, Italy

Abstract. This paper presents first results of an Adaptive Brain Interface suit-
able for deployment outside controlled laboratory settings. It robustly recog-
nizes three cognitive mental states from on-line spontaneous EEG signals and
may have them associated to simple commands. Three commands allow inter-
acting intelligently with a computer-based system through task decomposition.
Our approach seeks to develop individual interfaces since not two people are the
same either physiologically or psychologically. Thus the interface adapts to its
owner as its neural classifier learns user-specific filters.

1 Introduction

Physiological studies indicate that EEG signals are a reliable mirror of mental activity.
In addition, the combination of EEG, MRI and PET are providing gradually better
maps of brain functions (i.e., which cortical areas are responsible for specific mental
activities). Thus, it is quite appealing to try to use EEG signals as an alternative means
of interaction with computers. This paper describes a recent European research effort
whose objective is to build Adaptive Brain Interfaces (ABI) suitable for deployment
outside controlled laboratory settings. The immediate application is to extend the ca-
pabilities of physically-disabled people (e.g., select items from a computer screen,
explore virtual worlds, or guide a motorized wheelchair).
We aim to recognize from three to five mental states (e.g., relaxation, visualization,
music composition, arithmetic, verbal) from on-line spontaneous E E G signals I by
means of artificial neural networks and to associate them with simple commands such
as "move wheelchair straight", "turn left" and so on. Thus, users will be able to oper-
ate computer-based systems by composing sequences of these patterns.
An ABI requires users to be conscious of their thoughts and to concentrate suffi-
ciently on those few mental tasks associated to the commands. Any other E E G pattern
different from those corresponding to these mental tasks will be associated with the
command "nothing", which has no effect on the computer-based system.

1 We will also refer to them as "EEG pattems", or just "pattems."


220

The current ABI prototype is built upon the experience of the different partners in
the whole spectrum of areas covering the multidisciplinary nature of the project:

9 expertise in the neurological basis of EEG signals (e.g., [7]),


9 design of advanced EEG signal processing techniques (e.g., [2]),
9 off-line feature extraction and classification of EEG signals (e.g., [ 11]),
9 development of neural classifiers for the robust recognition of mental states from
on-line EEG signals [91, and
9 design of a portable and easy-to-use EEG system as well as other biomedical de-
vices for rehabilitation (e.g., [3]).

An obstacle to the achievement of the ABI project is the robust recognition of EEG
patterns outside laboratory settings. This presumes the existence of an appropriate
EEG equipment that should be compact, easy-to-use, and suitable for deployment in
real-world environments. No commercial product fulfilling these requirements exists.
We have set up a first prototype for the acquisition of high-quality EEG signals. We
are also able to robustly recognize three cognitive mental states from on-line sponta-
neous EEG signals.

2 Related Work

In the last years several other research groups have begun to develop EEG-based brain
interfaces (BI). Several companies are also commercializing basic mind-controlled
devices. By basic devices we mean that they can only recognize two patterns or use
muscular activity.
Two groups are developing BIs based on the recognition of mental states associated
to motor activities. McFarland and Wolpaw's approach relies completely on user
training (users must control their mu rhythm on each brain hemisphere) and looks for
a fixed EEG pattern that should be present in a large majority of individuals (e.g., [7]).
The ABI project adopts an opposite approach: rather than putting all the training re-
quirements on the user, who has to learn to generate a fixed EEG pattern, it makes the
brain interface adapt to the user. This approach is partially followed by Pfurtscheller's
group who seeks to recognize the motor readiness potential generated while people are
planning movements. They are using artificial neural networks with the aim of devel-
oping universal BIs (e.g., [3]). That is, they gather EEG signals from a given number
of users in well-controlled laboratory conditions and learn a classification function that
should be valid for everybody. They have obtained good results with a few healthy
subjects, but there is no definite evidence that motor readiness also happen in motor-
impaired people. Instead of using brain activities associated with motor-related tasks,
the ABI project seeks also to recognize cognitive mental tasks outside controlled labo-
ratory settings. Anderson's group is also using artificial neural networks to build uni-
versal BIs (e.g., [1]). They are using pre-recorded EEG signals and try to derive in-
variant information from cognitive tasks. Their results, however, are mixed.
221

This universal BI approach suffers, in our opinion, from a major limitation. We


cannot expect a neural classifier built with EEG data from a few persons to generalize
across individuals since not two people are the same, both physiologically and psy-
chologically. Our approach seeks to develop individual BIs rather than universal ones
valid for everybody. This means that the interface adapts to its owner as the artificial
neural network learns user-specific filters that classify the incoming EEG signals into
different categories. Furthermore, since users can choose their more natural strategy to
undertake a given mental task, they can regularly generate those individuals EEG
patterns that are better distinguished by their personal interface.

3 Experimental Setup and Protocol

One of our concerns is the acquisition of high-quality EEG signals by means of robust
and easy-to-use equipment, suitable for deployment outside controlled laboratory
environments. To this end, we have built a first prototype (see Fig. 1). The EEG sys-
tem is made of a standard PC running LabVIEW and C++, a commercial signal acqui-
sition board, a cap with integrated electrodes, and a dedicated hardware for the acqui-
sition of EEG signals. This hardware is a stand-alone, fully isolated, portable system
that gathers analog brain-wave voltages from up to eight scalp electrodes, amplifies
and filters them, converts to digital values, and transmits them via the acquisition
board to the PC for analysis. This prototype is very easy to operate (healthy users can
run it without external assistance), what greatly reduces the preparation time for the
acquisition of good signals.

Fig. 1. ABI current prototype at work.

Figure 1 shows the ABI current prototype at work. In this picture, the user holds the
cap with integrated electrodes located according to the International 10-20 system.
222

Eight of these electrodes are directly plugged into amplifiers before sending the sig-
nals to the dedicated hardware (left). In the computer screen one can see two of the
bipolar EEG signals being processed (left windows), their corresponding power spec-
tra (top right window), and a circle (bottom right) indicating that one of the mental
tasks has been recognized.
EEG potentials are measured on the 8 channels F 3, F 4, C 3, C 4, P3, P,, O1' and O 2,
with a reference electrode located in between F z, Fpl, and Fp2. Ground is applied to one
of the ear lobes. The sampling rate is 128 Hz and data is preprocessed in temporal
windows of half a second. This preprocessing consists of a Hanning windowing, a
Butterworth bandpass filtering (4-30 Hz), on-line removal of temporal windows cor-
rupted by ocular artifacts, and computation of either the energy of 5 differential chan-
nels (F3C3, C3P3, F4C4, C4P4, OiO 2) or the coherence between 10 pairs of channels (6
intra- and 4 inter-hemispheres). The energy or coherence features are fed to the neural
classifier. In this paper we only report experiments with energy features.
We want our experimental protocol to fit the real conditions in which users would
work. This means that perfect synchronization is not possible, it cannot depend on
external events, and the size of the time window is to be short. Another critical aspect
of the experimental protocol is the set of mental tasks to recognize (and differentiate
from each other). In this respect, we are considering a relatively large number of them
consisting of both cognitive tasks (e.g., arithmetic) and motor-related ones (e.g.,
imagination of left-hand movement). Tasks will be chosen so that the cortical areas
involved are quite localized and, in particular, to invoke hemispheric lateralisation.
From this list, each individual user selects the 3-5 tasks most comfortable to him/her.
The subject is seated and spontaneously concentrates on a mental task. The subject
performs the selected task during 10 to 15 seconds, and he/she chooses when to stop
doing it and the next to be undertaken. Each recording session lasts about 5 minutes.
For the training and testing of the neural classifier, the subject informs an operator of
the task he/she will perform; then 2 seconds before and 2 seconds after are removed
from the recording to remove the artifacts introduced by this "communication".
The mental tasks used in the study reported in this paper are "relaxation", "cube
rotation", and "subtraction ''2. Relaxation is done with closed eyes, and all other tasks
with opened eyes. Relaxation is used to switch on and off the neural classifier in order
to facilitate the recognition task. The rationale is that if users inform the ABI when
they do (or do not) intend to use it, the rest of the desired mental tasks has only to be
distinguished from each other (and from relaxation), but not from any possible back-
ground mental activity. An additional advantage is that users will probably concentrate
better on the associated mental tasks, as they will not be so stressed while not operat-
ing the ABI (e.g., avoiding to think on those mental tasks). It follows that users will be
able to use the ABI for longer periods of time. Of course, relaxation must be distin-
guished from whatever other task.

2 The tasks consist on getting relax, visualizing a three-dimensional cube in rotation around one
of its axis, and doing successive subtractions by a fixed number (e.g., 64-3=61, 61-3=58, 58-
3=55, ...), respectively.
223

4 Results

In this section we present first results we have obtained with two users. For each of
them we have recorded 4 sessions to train and test the neural classifier. In particular,
one of the sessions is used for training, one for validation, and the remaining two for
testing. We have adopted this unusual splitting of the available data--where training is
done over one fourth of the data while generalization is tested over half the patterns--
to probe our approach under realistic situations since the very beginning.
Recognizing mental states from on-line spontaneous EEG signals is a complex task
where we cannot expect to reach recognition rates near to 100%. But a practical ABI
doesn't require such a high performance; on the contrary, it is our view that it suffices
a recognition rate in between 70% and 80% provided that it has an insignificant pro-
portion (less than 2%) of false positives. This is what we mean by robust recognition.
tn other words, the neural classifier (almost) never takes a relevant pattern for another
(what, for example, would make the wheelchair move in the wrong direction), but
doesn't eventually recognize EEG patterns corresponding to the desired mental tasks
(which is associated with the command "nothing" and thus doesn't have any conse-
quence except delay).
To illustrate the hard task faced by the neural classifier, Figure 2 shows the PCA
projection on the two first eigen directions of the energy features of the five bipolar
channels recording EEG signals while a user carried out five mental tasks according to
the experimental protocol above (see [9] for details). Every sampled EEG pattern is
indicated with the number of its associated mental task (from 0 to 4). The figure
shows a high degree of overlap among the classes. Thus compact networks, such as
classical multi-layer perceptrons, will fail since they cannot compute different outputs
for very similar inputs. Their performance will not improve even if the data is pre-
processed with self-organizing maps [5] since every unit of the map will codify EEG
patterns of different categories. We have confirmed experimentally this suspicion [9].
On the other hand, one could use a single local network, such as RBF (e.g., [10]) or
LVQ [5]. Our results, however, show that local networks achieve good results during
training but generahze poorly. For example, Table [ reports the results ~ we have ob-
tained with Platt's RAN algorithm [10] for the classification of the task "cube rota-
tion". Similar results are obtained for the task "subtraction", whereas the task "relaxa-
tion" is better classified.

Table 1. Performance of the RAN algorithm for the mental task "cube rotation".

Good False Positive N Units


Training Set 92.0 % 2.4 % 444
Testing Set 68.3 % 12.4 %

-~The results of this section are referred to the personal brain interface of one of the users.
Similar levels of performance are got for the second user.
224

Fig. 2. PCA projection of the power spectral energy features on the two first eigen directions.

Fig. 3. Hierarchical committee of incremental networks for the classification of EEG patterns.

The alternative we adopt is a hierarchical committee of networks (see Fig. 3). At


the top level, there is a committee where each network tries to classify a given task
against all the others. Then, for each task, there exists a network per EEG channel.
The output of any of the classifiers should be 1.0 if the input pattern belongs to the
mental task it has to recognize and 0.0 otherwise. Each of these networks is built in-
crementally using a variation of the RAN algorithm. That is, the number of units is not
fixed. Rather, units are created dynamically, as they are required to better cover the
input space. The networks consist of radial basis function (RBF) units in the first layer
that have weighted connections to the output unit. RBF units fire only when the input
pattern lies within their receptive field (or width). Their responses are propagated
forward to the output unit, which simply computes a weighted sum of all contribu-
tions. Initially, there is no RBF unit in a network. The RBF units may overlap.
225

A new unit is added only if two conditions apply. First, an E E G pattern corre-
sponding to the mental task to be recognized is incorrectly classified. Second, the
current input doesn't activate sufficiently any existing RBF unit. In this case, the cen-
ter of the new unit corresponds to the current EEG pattern, and its width is initialized
to a fixed value. The weight of the connection from this new unit to the output unit is
set to the difference between the desired output (i.e., 1.0) and the actual output of the
network.
If any of the above conditions is not satisfied, the learning algorithm adjusts simul-
taneously the center and width of the active RBF units as well as the weighted con-
nections to decrease the output error. The resulting gradient descent rules have intui-
tive interpretations. Unit centers are pulled toward E E G patterns of the desired mental
task, while are pushed away from EEG patterns of other tasks. On the other hand, unit
widths grow to cover as many desired EEG patterns as possible, but shrink to avoid
negative E E G patterns.
An important feature of this kind of neural classifier is that units are moved so as to
find clusters of those EEG patterns corresponding to the mental task to be recognized.
After training, it comes out that some of the units have learned quite robust user-
specific filters whereas some other units are tuned to E E G patterns that are still too
similar to patterns of different mental tasks. Our approach is, then, to label as unknown
the output of the classifier if one of the latter units is the closest to the observed E E G
pattern. In this way, the neural classifier doesn't make risky decisions for uncertain
EEG patterns, which are thus associated to the command "nothing". Furthermore,
users can take this "no answer" of the brain interface as a "warning" that they should
either concentrate more intensively on the desired mental task or choose another strat-
egy to undertake it. Indeed, initial observations seem to indicate that, with practice,
users learn to generate those individual E E G patterns that are better distinguished by
their personal brain interface. But more extensive testing of the approach is needed
before confirming this hypothesis.
Even though the output of a neural classifier is a real number, the ABI makes dis-
crete decisions to classify the incoming E E G patterns all across the hierarchical com-
mittee. To this end, a network classifies an EEG pattern as:

9 belonging to the desired class if the output is higher than a given threshold, K,
9 belonging to the remaining classes if the output is smaller than 1-•, or
9 unknown if the output is in between.

This procedure is another key factor for enhancing the robustness of the ABI, for the
same reasons stated before.
Figure 4 shows two user-specific filters discovered from the differential channel
F4C4 that classify quite robustly the mental task "cube rotation". Table 2 reports the
performance of our approach for the classification of the task "cube rotation". Similar
results are obtained for the task "subtraction", whereas the task "relaxation" is better
classified. It is worth noting that the channel O10 2 is irrelevant for the classification of
the three mental tasks of interest.
226

Fig. 4. Two user-specific filters for the robust classification of the mental task "cube rotation"
using energy features.

Table 2. Performance of the hierarchical committee of networks


for the mental task "cube rotation".

Good Unknown False Pos N Units


Network Train Set 39.8 % 55.7 % 0.0 % 24
F4C4 Test Set 28.3 % 62.1% 0.4 %
' Network Train Set 25.3 % 65.1% 0.0 % 12
C,P, Test Set 26.5 % 66.3 % 0.0 %
Network Train Set 36.1% 50.4 % 0.0 % 17
F~C~ Test Set 27.3 % 63.6 % 0.0 %
Network Train Set 41.6 % 49.4 % 0.0 % 13
_- C~P~ Test Set 46.2 % 40.9 % 0.3 %
Top Train Set 86.1% 10.0 % O.2 % 23
Network Test Set 75.1% 14.3 % 1.6 %

5 Discussion

Three elementary commands is the minimum set to interact with a computer-based


system in an intelligent way provided that we adopt the principle of task decomposi-
tion, namely the user addresses the task at high level and all the low level details are
handled separately. In other words, the user just tells the ABI to execute an elementary
227

action (e.g., move the pointer up or the wheelchair forward), but does not worry about
its implementation (e.g., how far so as to reach the next item above or obstacle avoid-
ance). The implementation of the elementary commands associated to the EEG pat-
terns of interest will depend on the application. For example, we can use these three
patterns to guide a motorized wheelchair. The first pattern--relaxation--switches on
or off the ABI. Turning on the ABI makes the wheelchair move forward, while turn-
ing it off makes the wheelchair stop. The remaining two other patterns are used to
make the wheelchair turn right or left, respectively. These elementary commands (i.e.,
move forward, turn right, and turn left) are sent to a second learning system (e.g., [8])
that uses the on-board sensors to bring the wheelchair in the desired direction in a safe
(avoiding collisions) and smooth way.
These preliminary results have been obtained with a couple of users. Before going
on to recognize a larger set of patterns, we are now experimenting the ABI with more
users and trying to improve its robustness. In this latter respect we are exploring alter-
native feature extraction methods (e.g., autoregressive models, wavelets, etc.). This
on-going work is partially built upon previous studies with off-line E E G signals. One
of them confirms that an artificial neural network distinguishes E E G patterns better if
it uses the temporal dynamics of brain activity [11]. This is not surprising since EEG
signals carry temporal information. We deal with the time dimension (or history) by
means of a novel recurrent self-organizing map.

References

1. Anderson, C.W., Sijercic, Z.: Classification of EEG signals from four subjects during five
mental tasks. Int. Conf. on Engineering Applications of Neural Networks (1996) 407--414.
2. Babiloni, F., et al.: Improved realistic Laplacian estimate of highly-sampled EEG potentials
with regularization techniques. Electroencephalography and Clinical Neurophysiology 106
(1998) 336-343.
3. Bemardi, M., Canale, I., et al.: Ergonomy of paraplegic patients working with a reciprocat-
ing gait orthosis. Paraplegia 33 (1995) 458-463.
4. Kalcher, J., et al.: Graz brain-computer interface II. Medical & Biological Engineering &
Computing 36 (1996) 382-388.
5. Kohonen, T.: Self-Organizing Maps. 2nd ed. Springer-Verlag, Berlin (1995).
6. Marciani, M.G., et al.: Quantitative EEG evaluation in normal elderly subjects during men-
tal processes. International Journal of Neurosciences 76 (1994) 131-140.
7. McFarland, D.J., et al.: Spatial filter selection for EEG-based communication. Electroen-
cephalography and Clinical Neurophysiology 103 (1997) 386--394.
8. Mill~n, J. del R.: Rapid, safe, and incremental learning of navigation strategies. IEEE Trans.
on SMC-Part B, 26 (1996) 408-420.
9. Mill~m, J. del R., Mourifio, J., et al.: Incremental networks for the robust recognition of
mental states from EEG. Technical Report, Joint Research Centre of the EC, Italy (1998).
10. Platt, J.: A resource allocating network for function interpolation. Neural Computation 3
(1991) 213-225.
11. Varsta, M., Mill~, J. del R., Heikkonen, J.: A recurrent self-organizing map for temporal
sequence processing. 7th Intl. Conf. on Artificial Neural Networks (1997) 421-426.
Identifying Mental Tasks from Spontaneous
EEG: Signal Representation and Spatial Analysis
Charh;s W. Anderson
Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA,
anderson~es, c o l o s t a t e , edu,
~VW~V home pages: http:/[rJww, cs. colostate, edu/'anderson

Keywords: electroencephalogram, pattern recognition, autoregressive models, brain-


COml)uter interface

A b s t r a c t . Feedfurward neural networks are trained to classify half-second segments of six-


channel, EEG data into one of five classes corresponding to five mental tasks performed by one
subject. Two and three-layer nenral networks are trained on a 128-processor SIMD cOtnl)uter
using lf)-fl~hl cross-validatiou aml early st(,pl,ing to limit over-fitting. Four representations of
the EEG signals, based on autoregressive (AR) models and Fourier Transforms, are compared.
Using the AR representation and averaging over consecutive segments, an average of 72~J of
the test sr,gmcnts are correctly classified; for some test sets 100% are correctly cbmsified.
Clust,'J :~;~,! i:: ,,f ~,, r,,,-Oting hiddeu-unit weight vectors suggests which electrodes and
rel)reselltatioll COlllpOllell~s ,lit! l:J,l- Jill,.4[ ii'h!v~!rtt |,o [he classification problem.

1 Introduction
Automatic classification of electroencephalogram, or EEG, signals can lead to signif-
icant advances in studies of psychiatric diagnosis [21], truman-computer interfaces,
aids for disabled persons [14], and cognitive workload [15]. The state-of-the-art, how-
ever, is very limited; usually a small number of mental states are discriminated in
any given experiment. For example, previous work by Keirn and Aunon studied the
discrimination of pairs of mental tasks [12]. We have repeated their work with pairs
of tasks [1-4] and a much larger data set. We have also extended their work to the
discrimination of three tasks [5]. In this article, wc dcscribc the proccdurcs and results
of attempting to discriminate between five mental tasks.
A criti(zal colllponent to any automatic classification scheme is the representation
with which the information for each case is encoded. For EEG signal classification, a
representation is desired for which an accurate classifier can be trained with a reason-
able nunlbcr of known examples and that is relatively invariant over time. The lack
of comparative studies in EEG classitication makes it difficult to draw useful con-
clusions. Some studies have reported comparisons between conventional classification
m(rtho(ls ;rod ~cural networks (e.g., [22]), but it is relatively rare to thud comparisoz~s
among dill'crcnt signal representations.
This article reports the results of a comparison of EEG signal representations
judged by the performance of neural-network classifiers. We compared representations
based on AR models and Fourier Transforms and reduced-dimensional versions of
based on the Karhunen-Lodve (KL) Transform [10]. Oar best results were obtained
with a sixth-order, AR representation and a feedforward neural network having one
hidden layer of 20 units, trained with error backpropagation [17]. By averaging the
OUtl)ut of the classifier over approximately five seconds of consecutive, half-second
windows, we found that 72% of the test segments were classified correctly, averaged
229

over all cross-validation repetitions. An indication of relevant features in the AR


representation is obtained by clustering the hidden-unit weight vectors of trained
neural networks.
Work related to our EEG classification problem is reviewed in Section 2. The
rel)resentations and training 1)rocedure are defined in Section 3. Results are presented
in Section 4 and Section 5 contains a description and results of the cluster analysis
l)crformed on trained networks. Section 6 summarizes the conclusions and limitations
of the classification experiments.

2 Related Work
Since the early days of automatic EEG processing, representations based on a Fourier
Transfi)rm have been most commonly applied. This approach is based on earlier ob-
serwd, ions that tile EEG st)ectrmn contains some characteristic wavch)rms that fall
primarily within four frequency b a n d s - d e l t a (1-3 Hz), theta (4--7 llz), alpha (8-13
l lz), and I)cta (14-20 Ilz). Such methods have l)roved benelicial f()r wu'ioas EEG char-
acterizations, but the Fourier Transform and its discrete version, the FFT, suffer from
large noise sensitivity. Numerous other techniques from the theory of signal analy-
sis have been used to obtain representations and extract the features of interest for
classification purposes. Gevins and R6.mond [6] summarize many of these techniques.
Yun(:k and Tuteur [22] describe an aml)itious comparative study of a variety of
classifiers, all using the same representation, for the discrimination of EEG recorded
h'om 40 subjects performing the following seven tasks: resting, mental arithmetic,
listening to music, performing verbal exercises, listening to speech, performing pic-
torial exercises, and viewing a fihn. They compared four parametric classifiers based
on Gaussian assumptions and four nonparametric k-nearest neighbor classifiers. Their
representation consisted of 320 features I)ased on the power in several frequency bands
from the signals recorded simultaneously at four electrodes [9]). The nonparamet-
tic classifiers were found to be superior to the Gaussian-based classifiers, suggesting
that the majority of 1)ublished work on EEG classification, which is based on linear-
discriminant analysis and related Gaussian-based methods, could be improved by
using nonparametric methods, such as neural networks.
Others also fin(l AR models to be fruitfifl ways of characterizing EEG segnw.nts.
Sanderson, ct al., [18] descril)e a nmltiple-stage procedure whereby single channels
of EEG arc adaptively divided into relatively stationary segments, models using AR
models, and the AR coefficients are clustered. Tseng, et al., [20] evaluated different
parametric models on a fairly large database of EEG segments. Using inverse filtering,
white noise tests, and one-second EEG segments, they found that AR models of orders
between 2 and 32 yielded the best EEG estimation. For a method which avoids the
use of signal segmentation and t)rovides an on-line AR 1)arameter estimation that fits
nonstationary signals, like EEG, see [7].
In a problem of discriminating EEG of normal subjects from those with psychiatric
disorders, Tsoi, et al., [21] used AR models to represent one-second EEG segments
and trained neural networks to perform the classification. Their best classification
results were obtained on averaged data over 250 second intervals.
Inouye, et al., [8] used EEG to localize activated areas and determined directional
patterns in activity changes during mental arithmetic. They considered two EEG rep-
resentations based on information theoretic measures. Signals from 18 electrodes were
230

represented by first calculating F F T ' s of each one-second segment, then averaging the
F F T ' s over four consecutive segments, and finally calculating the entropy of the re-
suiting power spectra. Differences in the entropy at a number of electrode locations
was found during rest versus during the l)erformance of a mental arithmetic task.
They also studied a mutual information measure based on two-dimensional, 15-order,
AR models fitted to each of the 153 pairwise combinations of electrodes. Their results
showed a signilicant different in infornmtion tlow betwccn electrodes for the resting
and mental arithmetic tasks.

3 Method
3.1 EEG Data Acquisition and Representation
Subjects were seated in an Industrial Acoustics Company sound controlled booth
with din, lighting and noiseless fans for vcntilatiom An Electro-Cap elastic electrode
cap was used to rc('ord fi'om position,s C3, C4, P3, P4, O1, and 02, detined by the
1[)-20 system of electrode placement [9] and shown ill Figure 1. The electrodes were

Fig. 1. Placement of the electrodes according to the 10-20 system.

connected through a bank of Grass 7P511 amplifiers and bandpass filtered from 0.1-
100 Hz. Data was and recorded at a sampling rate of 250 Hz with a Lab Master 12
lilt A / D converter mounted ill an IBM-AT computer. Eye blinks were detected by
recalls of a S(;lmrate cham~el of data recorded frola t w o electrodes placed above and
below the subject's left eye.
For this paper, the data from one subject performing the following five mental
tasks was analyzed. These tasks were chosen by Keirn and Aunon to invoke hemi-
spheric brainwave asymmetry [16]. The five tasks are:

Baseline T a s k : The subjects were not asked to perform a specific mental task, but
to relax as much ms possible and think of nothing in particular.
L e t t e r T a s k : The subjects were instructed to mentally compose a letter to a friend
or relative without vocalizing.
M a t h T a s k : The subjects were given nontrivial multiplication problems, such as 49
times 78, and were asked to solve them without vocalizing or making any other
physical movements.
V i s u a l C o u n t i n g Tile subjects were asked to imagine a blackboard and to visualize
mnubers being written on the board sequentially, with the 1)revious number being
erased I)efore the next number was written.
231

G e o m e t r i c F i g u r e R o t a t i o n The subjects were given 30 seconds to study a draw-


ing of a complex three dimensional block figure after which the drawing was re-
moved and the subjects instructed to visualize the object being rotated about ml
axis.
Data was recorded for I0 seconds during each task and each task was repeated five
times per session. Most subjects attended two such sessions recorded on separate
weeks, resulting in a total of 10 trials for each task. With a 250 Hz sampling rate,
each 10 second trial produces 2,500 samples per channel. These were divided into half-
second segments that overlap by one quarter-second, producing at most 39 segments
per trial--segments containing eye blinks are discarded.
3.2 Representations of EEG Signals
The first rel)resentation studied is composed of just tile AR coefficients. Let ai,~ be
the i ~h cocflMcnt of the AR model for chamw.1 c, where c = {C3, C4, P3, P4, O1, 02}
and i = 1 , . . . , n with n being the order of the model. The prediction, xi,~, of the order
n, All. m(nM is given by
x,,o(O = ~ a,,~x,,o(t - i).
i=1
The coefficients that minimize tile squared error of this prediction were estimated
using the Burg method [11].1 The AIC criterion is minimized for orders of two mid
three [19], but based on previous results by Keirn and Aunon, an order of six was
used.
The 36 coefficients (6 channels x 6 orders) for each segment are concatenated into
one feature vector consisting of the six coefficients for the C3 channel, then for the
C4, P3, P4, O1, and 02 chammls. A total of 1,385 half-second windows compose
the 10 trials, with 277 windows from each of the five tasks. Each trial contains the
same number of windows fl'om eadl task, though the trials contain a different total
nmnber of windows, rmlging from 100 to 175.
To coral)are with the performance of the AR representation, a power spectrum
density representation (PSD) was implemented using the same data window of 125
samples, or one-half second, with a quarter-second overlap. Data segments were win-
dewed with the l ianning wiudow and a 125-point FFT was applied 2, resulting in a
63-point power spectrum density spanning 0 to 125 Hz with a resolution of 2 Hz. For
each segment, the 63 points per channel were concatenated to form a feature vector
of 378 components (63 x 6). The ordering of channels in the feature vector is tile santo
as the ordering for the AR representation.
The dimensionality of tile AR and PSD representations were reduced via a Karhmmn-
Lo6ve (KL) transformation [10], in which the eigenvectors of the covarianee matrix of
all All, or PSD vectors are determined and the AR or PSD vectors are projected onto
a subset of the eigenvectors having the highest cigenvalues. The key parameter of this
transformation is the number of eigenvectors onto which each vector is projected. A
common way to choose this number is to set it equal to the global Karhunen-LoSve
estimate, given by the smallest index i for which .ki/A,,,~, <_ 0.01, where the ,~i are
tile eigenvalues in decreasing order for i = 1, 2, .... For the AR representation of all
I The Burg method was implemented using the MATLAB function ar. See the Mathworks, Incorporated,
web page at http://www.mathworks.com for more information.
2 The FFT was implemented using the MATLAB psd function.
232

segments fi'om the five tasks, the global KL estimate is 31, a small reduction from
the original 30 dimensions of the representation. For the PSD representation, the
global KL estimate is 21. This is a large reduction from the 378 dimensions of the
PSD representation.

3.3 Neural N e t w o r k Classifier


The classifier implemented for this work is a standard, feedforward, neural network
with one or two hidden layers and onc output laycr with five units, one for each task.
It was fouud that the classification accuracy obtained with two hidden layers was not
better than the accuracy obtained with one hidden layer; all results reported here
arc for one hidden layer. For the five-task experiments, the target values were set to
1,0,0,0,0 for the baseline task, 0,1,0,0,0 for the letter task, 0,0,1,0,0 for the math task,
0,0,0,1,0 for the (:ounting task, and 0,0,0,0,1 for the rotation task. The standard error
backI)ropagatiou algorithm was used to train the network. Different learning rates
were used for the hidden layers and the outI)nt layer. After trying a large mmibcr of
different values, we fi)und that a learning rate of 0.1 for the hidden layers and 0.01
for the output layer produced the best performance.
To limit the amount of ovcr-fitting during training, the following 10-fold, cross-
validation, early-stopping procedure was performcd. Eight of the ten trials were used
for the training set, one of the remaining trials was selected for validation and tile
last trial was used for testing. The error of the network on the validation data was
calculated after every pass, or epoch, through the training data. After 3,000 epochs,
the network state (its weight values) at the epoch for which the validation error is
smallest was chosen as the network that will most likely perform well on novel data.
This best network was then applied to the test set; the result indicates how well the
network will generalize to novel data. With 10 trials, there are 90 ways of choosing
the validation and test trials with the remaining eight trials combined for tile training
set. Results described in the next section are reported as the average classification
acc~zracy on the test set averaged over all 90 partitions of the data. Each of the 90
repetitions started with ditfcrent, random, initial weights. The neural networks wcre
trained using a CNAPS Server II, a parallel, SIMD architecture with 128, 20 MHz
processors. An experiment of 90 repetitions required 4.8 hours on the CNAPS and 30
horn's on the SparcStation. Each inl)ut COml)onent was transformed so that its mean
was 0.5 and its standard deviation was 0.5/3. Then all components greater than 1
were set to 1 and all components less than 0 were set to 0.

4 Results
Figure 2 summarizes the average percent of test segments classified correctly for
various-sized networks using each of the four representations, which will be called
AR-KL and AR for the representations based on AR coefficients, with and without
dimensionality reduction by the Karhunen-LoSve transform, and PSD-KL and PSD
for the representations based oil the power spectral density. 90% confidence intervals
are included ia tile plots. For one hidden unit, the PSD representations perform better
than the AR representations. With two hidden units, the PSD-KL representatiou
performs about 10% better than the other three. With 20 hidden units, the KL
representations perform worse than the non-KL representations, though the difference
is not statistically signiticant.
233

8O

55

5O
Percont OI
Tost
Segments 4~
Correctly
Classified 40

35

30

25
I 2 5 10 20
Number of Hidden Units

F i g . 2. Average perccnt of test segments correctly classified. Error bars show 90% confidence intervals.

Inspectiozt of how the network's classification changes from one segment to the next
suggests that better pcrh)rmance might bc achieved by averaging the network's output
over consecutiw; segments. "ib investigate this, a 2[)-unit network trained with the AR
representation is studied. The left cohtmn of graphs in Figure 3 show the outlmt values
of I,he net.work's five output traits fi~r each segment of test data from one trial. On each

Count

~.~?. . . .

Averaging Over Averaging Over


NO Averaging 10 Consecutive Segments 20 Consecutive Segmenls
Percent Correct = 54 Percent Correct = 82 Percent Correct = 96

Fig. 3. Network o u t p u t vahtes and desired values for one test trial. The first five rows of graphs show
the vMues of the five network o u t p u t s over the 175 test segments. T h e sixth row of graphs plots the t&~k
determined by the network o u t p u t s and the true tm~k. The first c o h m m of graphs is without averaging over
consecutive segments, the second is for averaging the network o u t p u t over ten consecutive segments, while
the third column is for averaging over twenty segments.

graph tile desired value for tile corresponding output is also drawn. The bottom graph
shows the true task and tile task predicted by the network. For this trial, 547o of the
segments are classified correctly when no averaging across segments is performed. The
other two columns of grat)hs show the network's output and predicted classification
that result front averaging over 10 and 20 consecutive segments. Confusions that the
234

classifier nmde can be identified by the relatively high responses of an output unit for
test segments that do not correspond to the task represented by that output unit. For
example, in the third graph in tim right column, the output value of the math unit
is high during math segments, as it should be, but it is also relatively high during
count segments. Also, the output of the count unit, shown in the fourth graph is high
during count segments, but is also relatively high during letter segments.
For this trial, averaging over 20 segments results in 96% correct, but performance
is not improved this much on all trials. The best classification performance for the
20 hidden unit network, averaged over all 90 repetitions, is achieved by averaging
over all segments. Figure 4 shows how the fraction correct varies with the number
of consecutive segments averaged for each representation. All trials contain at least
20 segments, t)ut very few contain 35, so the statistical significance of the averages
plotted in Figure 4 quickly decreases above 20 segments. The AR representation
performs the best whether averaging over 10 or 20 segments, but when averaging
over 20 segments, the AI~ and AR-KL rel)rescntations 1)erform equally well. The PSD
and PSD-I(L representations do consistently worse than the AR rel)resentations.

i i I * ,
i i l I *
I , * I I i
...... s. . . . . . ., . . . . . . t.. . . . . . 1. . . . . . J . . . . . -i . . . . .

9 * r , i
Pe=e.t o, 76 ...... ; . . . . . ] . . . . . . r . . . . .
Test i * r i i
Segments i , i

oo.-.," ,o : ~ ..... /
...... 9, = 9, ~ - -, a i. . . . . - Li - - . . ' - J

' - .....
' .I,......
~ ,7.....
[ TrF~6:K-t.l,.....
, ~, --

s ~o ~ 2o 2S P~ 3S

Number of Consecu~ve Segments Averaged O v e r

Fig. 4. The fraction of averaged windows classified correctly versus the number of consecutive windows
averaged over.

5 Analysis of the Neural Network Classifier


Onc approach to dcaling with this quantity of information is to cluster the weight
vectors from all 90 repetitions. Vectors to be clustered are formed by concatenating
the input weights from a hidden unit with its output weights--the weights with which
the unit is connected to thc five output units. Thc results of al)plying the k-means
clustering algoritlml with k = 20, i.e., for 20 clusters, is shown in Figure 5a. The k-
meaus algorithm was initialized by randomly selecting k hidden unit weight vectors as
the initial cluster centers. Positive weights are drawn as filled boxes, negative weights
as unfilled boxes. The width and height of a box is proportional to the weight's
magnitudc. The weights of the hidden layer are drawn as the upper matrix of boxes
and the weights of the output layer are drawn as the lower matrix. The weights of
the first hidden unit appear in the left-most column of the upper matrix, while the
wcights of the first output unit, the one corresponding to the baselinc task, are drawn
as the first row of the lower matrix.
235

I n p u t Veclor Outpul voclot

I 9 .* o.. 9 . II.-
. . . . . . = 2 * , , , i
C3 ..o.;.,..o .,. 9 t ~ J *
, i , , J
.............. o o
' i;iiii!ii: ~176
C4 9 ;
4
13 .:. ' :' .; ai.
. . . . . . . . . . . . . = Cluster 1
P3
Index 5
19
:.:'.;:..:';. : :
. . . . 9 . . . . o..
P4 . 9. .. . . . .. .. .. . . . . . . o .
, ~ i i i
25 io.*o i D* . io..
I0 i.v IOQOlO .Io
":: :;,:gU;: : :
01

31 o~;:'o?,': . : ::! 8

02 9
: : .::' : : ;
Tasks
Base :,~ ; : : ~ : . . ' : . : , : , O ~ r ~
.s,h~eI'ei ' : ili
tO . . . . .

Count " _
9 i ~ i t 1 1 1 1
5 tO 15 20 C3 IC41 p 3 * p 4 1 011 0 2
R o t a l l O n -~" ~,e M~lh RolJ,llon
Cluster Index Ll~lw Ccurd

a. b.

Fig. 5. a. Results of k-means clustering for 20 clusters with the AR representation; b. Results of k-means
clustering for 10 clusters with the PSD representation.

Cluster 2 suppresses (is connected negatively to) the math task output unit and
Cluster 3 suppresses all but the math task unit. Other clusters also contain significant
weights for the O1 and 02 channels. One cluster that includes large weights in other
channels is Cluster 18, for which the first order weights are relatively large, positive
vahms for the C3, P3, and O1. As Figure 1 shows, these electrodes record from the
left hemisphere. This cluster has positive output weights for the baseline, letter, and
math tasks, and negative for tile counting and rotation task, suggesting a hemispheric
asymmetry in the EEG signals related to the first three tasks. II.ccall that prior
to training all representation components wcrc normalized to have the same mean
;ttl(l variat|ce. This removes biases that would arise from (liffering inl)ut coml)olmnt
variances, allowing the direct comparison of the magnitudes of the weights in these
clusters.
A similar cluster analysis can be applied to other signal representations. Figure 5b
shows the results of a cluster analysis of the PSD representation. As before, vectors
of hidden unit input weights and output weights were clustered, this time into ten
clusters. There are too many comt)onents to display as boxes, so they are simply
plotted versus a component in(lex. The left cohmm of graphs in Figure 5b show the
input components of the ten clusters. Components are grouped by channel and the
components for each channel correspond to tlle power at frequencies ranging from 1
to 125 Hz. The right column of graphs display the output weights for each cluster,
with one value corresponding to each task.
EEG signals arising from brain activity are typically characterized by their power
st)cctrum fi'om 0 to 30 or 40 Hz, with higher frequencies being attributed to muscle
activity or seusor uoise. Consideration of the third and fifth clusters suggests that,
whatever the cause of the high frequency signals, the high frequency components
236

are correlated with task. Cluster 3 contains a positive connection to the math task
and negative connections to the others, while Cluster 5 contains a negative math
connection and positive or near-zero connection to the others, i.e., the inverse of
Cluster 3. The input weights of these clusters are also approximately negatives of
each other: Cluster 3 has negative weights for O1 components and positive for 02
components mid this pattern is reversed for Cluster 5.
Another very interesting observation is that Cluster 8 contains large weights in
the P3 and O1 channels at a frequency very close to 60 Hz. This is most likely due
to interference during the recording process of the 60 Hz power supply. The EEG
recording amplifier used to gather this data supposedly filters out 60 Hz, but the
cluster analysis clearly shows the presence of a 60 Hz signal and, not only its presence,
but that it is correlated with the letter task. Even though all tasks were repeated on
two ditfcrent clays, there may be more 60 Hz noise in the letter task data than in
other data. This demonstrates how the cluster anMysis of a large number of resulting
weight w'~(:tors (:a,l lead to an understanding of what relationshil)s the networks have
extracted from the data. It also shows how assumptions about the data, such as the
removal of known noise sources, can be verified.

6 Conclusion
Tile correct task out of five was identified correctly for 64% of the EEG test patterns
when tile output of tim network was averaged over five consecutive, half-second seg-
ments and each segment was represented by either an AR model or a power spectrum
density (PSD). This level of performance was achieved with a neural network of one
hidden layer containing 20 units for the AFt representation and 40 units for the PSD
reI)resentation. Performance was increased to 72% by averaging over 20 consecutive
segments-approximately live seconds of d a t a - - b u t only for the AR case.
Karhuncn-Lohvc transforms were applied to both the AR and PSD representations
to investigate the possibility of reducing the dimensionality of the input reprcsentatioll
without sacriticing performance. Results show that the AR representation could not
be significantly reduced without decreasing performance. The dimension of the PSD
rei)rescntation could bc greatly reduced with little loss in performance on individual
half-second segments, but performance was considerably lower when averaging over
consecutive segments.
Cluster analysis was applied to learned weight vectors, revealing some of the ac-
quired relationships between representation components and mental tasks and also
revealing unexpected characteristics of the data, such as the presence of 60 Hz noise.
The results of clustering can be used both for the construction of lower-dimensional
representations and for investigating hyl)otheses regarding differences in brain activity
related to different cognitive behavior.
Many issues remain to be solved before this approach can be developed into a
reliable, portable EEG-computer interface. Portable EEG acquisition devices are not
generally available, but are being developed. Current EEG electrodes are very in-
convcnicnt to use. A primary limitation of work to-date is the lack of generalization
studies across subjects. Lin, Tsai, and Liou [13], did test multi-subject generalization
using data very similar to that used in this article, but met with little success.
A c k n o w l e d g m e n t s : This work was supported by the National Science Foundation through grants IRI-
9202100 and 01SE-9422007.
237

References
1. C.W. Anderson, S. V. Devulapalli, and E. A. Stolz. Determining mental state from EEG signals using
neural networks. Scientific Programming, 4(3):171-183, Fall 1995.
2. C. W. Anderson, S. V. Devulapani, amt E. A. Stolz. EEG signal classification with different signal
rel)resentatines. In F. Girosi, J. Makhoul, E. Manolakos, and E. Wilson, editors, Neural Networks for
Signal Processing V, pages 475--483. IEEE Service Center, Piscataway, NJ, 1995.
3. C. W. Anderson, E. A. Stolz, and S. Shamsunder. Disciminating mental tasks using EEG represented by
AR models. In Proceedings of the 1995 IEEE Engineering in Medicine and Biology Annual Conference.
Montreal, Canada, 1995.
4. C. W. Anderson, E. A. Stolz~ and S. Shamsnnder. Multivariate autoregressive models for classification of
spontaneous electroencephalogram during mental tasks. IEEE Transactions on Biomedical Engineering,
45(3):277-286, 1998.
5. Charles W. Amlerson. Effects of variations in neural network topology and output averaging on the
discrimination of mental tasks from spontaneous electroencephalogram. Journal of Intelligent Systems,
7(1-2):165-190, 1997.
6. A. S. Gevins alld A. Rdmond. Methods of Analysis of Brain Electrical and Magnetic SignaL~, volume 1
of Handbook of Eleclroencephalogynphy and Clinical Ncurophysiology (revised series). Elsevier Science
Publishers B.V., New York, NY, 1987.
7. S. Goto, M. Nakamura, and K. Uosaki. On-line spectral estimation of nonstationary time series ba.~ed
on AR model parameter estimation and order selection with a forgetting factor. IEEE Transactions on
Signal Processing, 43(6):1519-1522, June 1995.
8. T. Inouye, K. Shinosaki, A. lyama, and Y. Matsumoto. Localization of activated areas and direc-
tional EEG patterns during mental arithmetic. Electroencephalography and Clinical Neurophysiology,
86(4):224-230, 1993.
9. 11. J,~sper. The ten twenty electrode systeln of the international federation. Electrocnccphalography and
Clinical Ncurophysiology, 10:371-375, 1958.
It}. I. T. 3olliIe. Principal Component Analysis. Springer-Verlag, New York, 1986.
11. S. M. Kay. Modern Spectral Estimation: Theory and Application. Prentice-llall, Englewood Cliffs, NJ,
1988.
12. Z. A. Keirn and J. I. Aunon. A new mode of communication between man and his surroundings. 1EEE
Transactions on Biomedical Engineering, 37(12):1209-1214, December 1990.
13. Shiao-Lin Lin, Yi-.lean Tsai, and Cheng-Yuan Lion. Conscious mental tasks and their EEG signals.
Medical ~'~ Biological Engineering CJ Computing, 31:421-425, 1993.
14. ti. S. Lusted and I1. B. Knapp. Controlling computers with neural signals. Scientific American, pages
82-87, October 1996.
15. Scott Makcig, Tzyy-Ping Junj, and Terrenee J, Sejuowski. Using feedforward neural networks to monitor
alertness from changes in EEG correlation and coherence. In D. S. Touretzky, M. C. Mozcr, and M. E.
Ila.ssehno, editors, Advances in Neural Information Processing Systems 8, pages 931-937. The MIT
Press, Canal)ridge, MA, 1996.
16. M. Osaka. Peak alpha frequency of EEG during a mental task: Task difficulty and hemispheric dilfer-
euces. PsychophysiohJgy, 21:101-105, 1984.
17. I). E. lhmwllmrt, G. E. llintou, and IL W. Williams. Learning intern',d representations by error propaga-
tion. ht D. E. R.umelltart, J. L. McClellaml, and The PDP Researdl Group, editors, Parallel Distributed
Processing: Ezplorations in the Microstructure o/Cognition, vohmm 1. Bradford, Cambridge, MA, 1986.
18. A. C. Sanderson, J. Segen, and E. Richey. Hierarchical nmdeling of EEG signals. IEEE Transactions
on Pattern Analysis and Machine lnleUigence, PAMI-2(5):405-414, September 1980.
19. E. Stolz. Multivariate autoregressive models for classification of spontaneous electroencephalogram
during mental tasks. Master's thesis, Electrical Engineering Department, Colorado State University,
Fort CoUins, CO, 1995.
20. S-Y. Tseng, R-C. Chen, F-C. Chong, and T-S. Kuo. Evaluation of parametric methods in EEG signal
analysis. Mcd. Eng. Phys., 17:71-78, January 1995.
21. A. C. Tsoi, D. S. C. So, and A. Sergejcw. Classification of electroencephalogram using artificial neural
networks. In J. D. Cowan, G. Tcsauro, anti 3. Alspeetor, editors, Advances in Neural Information
Processing Systems 6, pages 1151-1158. Morgan Kaufinann, San Francisco, CA, 1994.
22. T. P. Yunck and F. B. "Ihlteur. Comparison of decision rules for automatic EEG classification. 1EEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-2(5):420-428, September 1980.
Independent Component Analysis
of Human Brain Waves

Ricardo Vig~rio and Erkki Oja

Lab. of Computer & Info. Science


Helsinki University of Technology
P.O. Box 5400, FIN-02015 HUT, Finland
{Ricardo. Vigario, Erkki. Oj a)@hut, f i

A b s t r a c t . Recent years have seen a considerable increase in knowledge


about the human brain, both in the understanding of some of the most
basic human processing systems, and in the elaboration of efficient com-
putational neuroscience models. In a bootstrapping (reinforced) manner,
the discoveries made on the human brain are leading into the formulation
of more efficient computational methods which in turn make it possible
to design new signal processing tools for better extracting information
from brain data.
In this paper, we will review one of such signal processing tools, the inde-
pendent component analysis (ICA), that belongs to the class of artificial
neural networks. It will be shown how this technique suits the problem
of artifact detection, and removal, both in electroencephalographic and
magnetoencephalographic recordings. Furthermore, when applied to the
evoked field paradigm, ICA separates the complex brain responses into
simpler components than the conventional principal component analysis
(PCA) approach. This sparse division may lead to an improvement in
the interpretation of such event related signals.

1 Introduction

With no doubt, brain is among the most intriguing and complex systems ever
studied by human-kind. In an attempt to give a plausible explanation to the
why's and how's of human perception and cognition many conjectures have been
formulated and theories have been tested throughout centuries. The end of this
century, in particular, sees an impressive explosion of knowledge about the brain,
both in the understanding of some of the most basic human processing systems,
and in the elaboration of efficient computational neuroscience models.
In a bootstrapping (reinforced) manner, the discoveries made on the human
brain are leading into the formulation of more efficient computational methods
which in turn make it possible to design new signal processing tools for better
extracting information from brain data. Some of the most promising of such tools
are in the field of artificial neural networks, of which this paper's independent
component analysis (ICA) algorithm is a good example.
239

Over the last 3 or 4 years, ICA techniques have proven to be effective in


helping to solve the problem of the extraction of artifacts from electroencephalo-
graphic and magnetoencephalographic recordings (EEG and MEG, respectively).
This paper will summarize our experience in this field, as well as in the analysis
of event-related studies.
In the next section a brief introduction to the ICA problem, as well as the
FastICA fixed-point algorithm used throughout the paper, will be presented. On
the following sections we will successively go from the identification and extrac-
tion of artifacts from EEG and MEG [22, 23] data to the analysis of auditory and
somatosensory evoked fields [19, 24, 25]. For clarity in the text, we will present
examples only in MEG. Further reading in EEG studies can be found in, e.g. [13,
17,22].

2 Independent Component Analysis

In blind source separation, the original independent sources are assumed to be


unknown, and we only have access to their weighted sum. In this model, the
signals recorded in an MEG study are noted as xk(i) (i ranging from 1 to L, the
number of sensors used, and k denoting discrete time). Each xk (i) is expressed
as the weighted sum of M independent signals sk (j), following the vector ex-
pression:
M
Xk = E a(j)sk(j) = Ask, (1)
j:l

where Xk = [Xk ( 1 ) , . . . , Xk (L)] T is an L-dimensional data vector, made up of the


L mixtures at discrete time k. The Sk(1),..., sk(M) are the M zero mean inde-
pendent source signals, and A = [a(1),..., a(M)] is a constant mixing matrix,
whose elements aij are the unknown coefficients of the mixtures. In MEG, the
columns of A give the magnetic flux produced by the sources, across the scalp. In
order to perform ICA, it is necessary to have at least as many mixtures as there
are independent sources (L >__ M). When this relation is not fully guaranteed,
which we believe may be the case in these experiments, and the dimensionality of
the problem is high enough, we should expect the first independent components
to present clearly the most strongly independent or non-Gaussian signals, while
the last components still consist of mixtures of the remaining source signals.
The problem is now to estimate the independent signals sk(j) from their
mixtures, or the equivalent problem of finding the separating matrix B that
satisfies (see Eq. 1)
sk = BXk. (2)

Several approaches to the solution of the ICA problem are available in the
literature [2, 4, 6, 7, 10, 14]. A good tutorial on neural ICA implementations is
available by [15]. The particular algorithm used in this study is discussed in [10,
18].
240

2.1 The algorithm

The initial step in source separation, using the method described in this arti-
cle, is whitening, or sphering. This projection of the d a t a is used to achieve the
uncorrelation between the solutions found, which is a prerequisite of statistical
independence [10]. The whitening can as well be seen to ease the separation of
the independent signals [15]. In [11], it has been shown t h a t a well chosen com-
pression, during this stage, may be necessary in order to reduce the overlearning
(overfitting), typical of ICA methods. The result of a poor compression choice
is the production of solutions practically zero almost everywhere, except at the
point of a single spike or bump.
The whitening m a y be accomplished by P C A projection: v = V x , with
E { v v T} = 1. The whitening matrix V is given by V = A-1/2~ T, where
A = diag[)~(1),..., X(M)] is a diagonal matrix with the eigenvalues of the data
covariance matrix E { x x T } , and Z a matrix with the corresponding eigenvectors
as its columns.
Consider a linear combination y = w T v of a sphered d a t a vector v, with
Hw[I = 1. Then E { y 2} = 1 and kurt(y) = E { y 4} - 3, whose gradient with
respect to w is 4 E { v ( w T v ) 3 } .
The fixed point algorithm [10], calculated over sphered zero-mean vectors v,
finds one of the rows of the separating matrix B (noted w) and so identifies one
independent source at a time - - the corresponding independent source signal can
then be found using Eq. 2. Each iteration of this algorithm, a gradient descent
over the kurtosis, is defined for a particular time instant k as

w*, = E { v ( w T l v ) 3 } -- 3wl-1
wz = w*~/llw*zll. (3)

In order to estimate more than one solution, and up to a m a x i m u m of M,


the algorithm m a y be run as m a n y times as required. It is, nevertheless, nec-
essary to remove the information contained in the solutions already found, to
estimate each time a different independent component. This can be achieved, by
simply subtracting the estimated solution ~ = w T v from the unsphered d a t a xk.
The solutions thus found are defined up to a multiplication and a permutation.
Therefore, the subtracted vector must be multiplied by a vector containing the
regression coefficients over each vector component of xk. These regression values
are given by the columns of the matrix A, t h a t we can now estimate from the
separating matrix B as _~ = B -1. Any other explicit orthogonalization of the
solutions m a be used as well.
This study was made using MATLAB code, based on the FastICA package [1].

3 Brain wave studies

The challenges presented to the signal processing community by the electro- and
magnetoencephalographic recordings from the human brain m a y be divided in
241

two classes, one dealing with the identification and removal of artifacts from the
recordings, and another on the understanding of the brain signals themselves
(see Table 1). The amplitude of the artifactual disturbances may well exceed the
one of brain signals, turning the analysis of brain activity a very hard process.
Moreover, artifacts may present strong resemblance to some physiological brain
responses, bringing an erroneous interpretation of the recording [9].

Table 1. Some signal processing problems encountered in EEG and MEG studies.

Artifacts I Brain signals


Ocular artifacts Evoked responses (e.g. auditory,
I somatosensory, visual, ... )
Myographic activity ] Spontaneous rhythmic activity
Externally induced artifacts Abnormal brain behavior
(e.g. epileptic seizures, infarction, ... )

Typical artifacts, present in most EEG and MEG measurements, include eye
and muscle activity; the heart's electrical activity, captured at the lowest sen-
sors of a whole-scalp magnetometer array; and externally induced artifacts. The
relevance of the identification of such artifacts can be seen in the analysis of the
QRS complex, followed by the repolarising T wave, which may be misinterpreted
as a spike and slow wave associated to some epileptic seizures.
As for the analysis of the human brain's functioning, it is common to use event
related activity as an entry level to this study. This activity is time-locked to a
particular stimulus, that may be of auditory, somatosensory or visual type [16].
Brain responses to the stimulation present minimal inter-individual differences
to a particular set of stimulus parameters. In order to understand the physiolog-
ical origins of the event related activity, it may be desirable to decompose the
complex brain response into simpler elements, that would b e easier to model,
and to localize their neural sources. In addition, the separation of multi-modal
responses to complex stimuli, may represent a hard task to conventional meth-
ods, but is surely of capital importance, due to the diversity of stimuli in the
perception of the real world.

3.1 Experimental setup


The magnetoencephalographic data used in the experiments reported in this
paper was collected in a magnetically shielded room with a 122-channel whole-
scalp Neuromag-122 TM neuromagnetometer. This device collects data at 61
locations over the scalp using orthogonal figure-of-eight pick-up coils that couple
strongly to a local dipolar source just underneath the sensor [8].

3.2 Handling artifacts

When performing EEG or MEG measurements, physicians have often to deal


with considerable amounts of artifacts, that may render impossible the extraction
242

of valuable information therein. The simplest and eventually most commonly


used artifact correction method is rejection: through visualization of the recorded
signals, the portions of recordings corresponding to high levels of disturbance, are
simply discarded [3, 21]. This method may lead to significant loss of information
from the data, as well as leading the remaining data unrepresentative of the
study. Furthermore, some artifacts of smaller amplitude may remain in the data,
leading to an erroneous appreciation of the remaining signals.
Other methods, often based on mathematical models of the sources of the
artifacts, have been introduced to lessen the effects of these undesirable signals [5,
20] (see [22] for a review of some of these methods).
The approach presented in this paper, initially reported in [22, 23], assumes
that the brain and artifact activities are anatomically and physiologically sepa-
rate processes, and that their independence is reflected in the statistical relation
between the electromagnetic signals generated by those processes.

Fig. 1. A sample of the 122-channel MEG recordings, showing artifacts produced by


saccades, blinldng, muscle activity and cardiac cycle. For each of the 6 positions shown,
the two orthogonal directions of the sensors are plotted.

In this experiment, the measured subject was asked to bite his teeth, to move
his eyes horizontally, as well as to blink. This activity ensured the presence
of strong eye and myographyc artifacts. In order to augment the number of
possible artifacts, a watch was inserted in the shielded room [23]. In Fig. 1 a
sample of the MEG signals is depicted, showing the clear areas of eye and muscle
activity. Both the watch and the heart cycle can be guessed from some of the
sensor signals. Vertical and horizontal electro-oculograms (VEOG and HEOG)
and electrocardiogram (ECG) were recorder simultaneously with the MEG, in
order to guide and ease the identification of the independent components.
243

Figure 2 present 6 independent components found on the recorded data. The


first two IC's, with a broad band spectrum, are clearly due to the muscular
activity originated from the biting. Their separation into two components seems
to correspond, on the basis of the field patterns, to two different sets of muscles
t h a t were activated during the process. IC3 and IC5 are, respectively showing
the horizontal eye movements and the blinks. IC4 represents cardiac artifact that
is very clearly extracted. In IC5 the digital watch is completely isolated.

Fig. 2. Six independent components extracted from the MEG data. For each compo-
nent the left, back and right views of the field patterns are shown - full lines stand for
magnetic flux coming from the head, and dotted lines the flux inwards.

3.3 Electric and magnetic evoked responses


To show the use of ICA in event related studies, we used a simultaneous so-
matosensory and auditory stimulation. The vibrotactile stimuli, generated in a
bass-reflex loudspeaker, was conveyed to the subject via a balloon, coupled to
the loudspeaker through a non-magnetic tube. Both the vibrotactile and the
concomitant auditory stimuli were elicited in the same experiment [12, 24]. Fig-
ure 3 presents the averaged responses to the vibrotactile stimuli, and the inserts
zoom on a subset of the channels with strongest responses in amplitude.
244

Fig. 3. Recorded MEG, in a case of simultaneous somatosensory and auditory event


related study. A sample of the 122 averages is enlarged on the right side. Each tick
represents 100 ms.

The latencies of the two different evoked responses is clear in some MEG
channels (compare e.g. MEG58 with MEG61, over the auditory and somatosen-
sory primary cortices, respectively). Nevertheless, in most of the recorded signals
this separation is far from accomplished. Figure 4 a) show tile results obtained
by PCA, where we may see that the confusion hasn't been solved. In b) we see

Fig. 4. Principal a) and independent b) components of the data. Field patterns cor-
responding to the first two independent components in c). In d) the superposition
of the localizations of the dipole originating IC1 (black circles, corresponding to the
auditory cortex activation) and IC2 (white circles, corresponding to the SI cortex acti-
vation) onto magnetic resonance images (MRI) of the subject. The bars illustrate the
orientation of the source net current.
245

the auditory and somatosensory responses clearly separated in the first two in-
dependent components. The corresponding field patterns c), together with the
superimposition of the localizations of the sources on MRI slices d), allow us to
conclude on a satisfactory agreement between the IC's and conventional loca-
tions for this type of brain responses.
A final experiment, using only averaged auditory evoked fields, illustrated the
decomposition capabilities of ICA in such setups. The stimuli consisted of 200
tone bursts that were presented to the subject's right ear, using ls interstimulus
interval. These bursts had a duration of 100ms, and a frequency of 1KHz [25].
PeA lC~

-1012345 -1012345 ~1

(a) (b) (c) (d)

Fig. 5. Principal a) and independent b) components found on the auditory evoked field
study. Each tick in a) and b) corresponds to lOOms, going from lOOms before stimulation
onset to 500ms after. In c) and d) the four [C's are plotted after scaling to one lef~ and
right MEG original signals.

As in the previous experiment, we can see from 5 a) and b) that PCA is


unable to resolve the complex brain response, whereas the new ICA technique
produces cleaner and sparser responses. From frames c) and d) it is visible that
IC1 and IC2 correspond to responses typically labeled as Nlm, with the char-
acteristic latency of around lOOms, after the onset of the stimulation. Another
component, IC4, exhibiting a longer latency (around 180ms), fully explains the
later responses in the contra-lateral brain hemisphere (see [25] for the field pat-
terns associated to these IC's).

4 Discussion

In this paper we have seen how to apply the recently developed statistical tech-
nique of independent component analysis, to the processing of biomagnetic brain
recordings. In particular, we have seen that it is very well suited for extracting
different types of artifacts from EEG and MEG data, even in situations where
the order of magnitude of these disturbances is lower than the background brain
activity.
246

Often the use of more than one sensing modality is employed to perceive the
world. ICA has shown to be able to differentiate between somatosensory and
auditory brain responses in the case of a complex vibrotactile stimulation. The
result obtained augurs the appearence of new sets of effective modal-sensitive
applications/studies. In addition to this findings, the experiment showed as well
that the independent components, found with no other modeling assumption
than the independence of the sources, exhibit field patterns that agree with the
conventional dipolar source modeling. In fact, when we admitted that model,
the localization of the equivalent source dipole of the independent sources fell
on the expected brain regions, for the particular stimulus.
Finally, in addition to the above result, the application of ICA to averaged
auditory evoked responses isolates the main response, with a latency of about
lOOms, from subsequent components. Furthermore, it discriminated between the
ipsi- and contralateral main responses of the brain. These decompositions may
lead us to an increase in the understanding of the functioning of the human
brain, as a finer mapping of the brain's responses may be achieved.

Acknowledgment
The authors thank Professor Riitta Hari, and Dr. Veikko Jousm~iki, from the
Brain Research Unit OF Helsinki University of Technology, for the MEG data,
and for very valuable discussions on the results reported in this paper. We ex-
press as well our gratitude to Mr. Jaakko Sirel~i, for his help in some of the
experiments.

References
1. FastICA MATLAB package. Available at the WWW adress:
http ://www. cis. hut. fi/projects/ica/fastica.
2. S. Amari. Blind source separation - mathematical foundations. In S. A m a r i and
N. Kasabov, editors, Brain-like Computing and Intelligent Information Systems,
pages 153-166. Springer, Singapore, 1997.
3. J. S. Barlow. Computerized clinical electroencephalography in perspective. IEEE
Trans. Biomed. Eng., 26:377-391, 1979.
4. A. Bell and T. Sejnowski. An information-maximization approach to blind sepa-
ration and blind deconvolution. Neural Computation, 7:1129-1159, 1995.
5. P. Berg and M. Scherg. A multiple source approach to the correction of eye arti-
facts. Electroenceph. elin. Neurophysiol., 90:229-241, 1994.
6. A. Cichocki and R. Unbehauen. Robust neural networks with on-line learning for
blind identification and blind separation of sources. IEEE Trans. on Circuits and
Systems, 43(11):894-906, 1996.
7. P. Comon. Independent component aniysis - a new concept? Signal Processing,
36:287-314, 1994.
8. M. Hs R. Hari, R. Ilmoniemi, J. Knuutila, and O. V. Lounasmaa.
Magnetoencephalography--theory, instrumentation, and applications to noninva-
sive studies of the working human brain. Reviews of Modern Physics, 65(2):413-
497, April 1993.
247

9. R. Hari. Magnetoencephalography as a tool of clinical neurophysiology. In E. Nie-


dermeyer and F. L. da Silva, editors, Electroencephalography. Basic principles,
clinical applications, and related fields, pages 1035-1061. Baltimore: Williams
Wilkins, 1993.
10. A. Hyv~rinen and E. Oja. A fast fixed-point algorithm for independent component
analysis. Neural Computation, 9:1483-1492, 1997.
11. A. Hyvs and E. Oja. Independent component alanysis by general non-linear
hebbian-like learning rules. Signal Processing, 64(3):301-313, 1998.
12. V. Jousms and R. Hari. Somatosensory evoked fields to large-area vibrotactile
stimuli. Electroenceph. clin. Neurophysiol., 1998. Submitted.
13. T.-P. Jung, C. Humphries, T.-W. Lee, S. Makeig, M. J. McKeown, V. Iragui,
and T. Sejnowski. Extended ICA removes artifacts from electroencephalographic
recordings. In Neural Information Processing Systems 10 (Proc. NIPS'97). MIT
Press, 1998.
14. C. Jutten and J. Herault. Blind separation of sources, part i: an adaptive algorithm
based on neuromimetic architecture. Signal Processing, 24:1-10, 1991.
15. J. Karhunen, A. Cichocki, W. Kasprzak, and P. Pajunen. On neural blind sepa-
ration with noise supression and redundancy reduction. Int. J. Neural Systems,
8(2):219-237, 1997.
16. F. Lopes da Silva. Event-related potentials: Methodology and quantification. In
E. Niedermeyer and F. Lopes da Silva, editors, Electroencephalography. Basic prin-
ciples, clinical applications, and related fields, pages 877-886. Baltimore: Williams
& Wilkins, 1993.
17. S. Makeig, T.-P. Jung, A. Bell, D. Ghahremani, and T. Sejnowski. Blind separation
of auditory event-related brain responses into independent components. Proc. Natl.
Acad. Sci. USA, 94:10979-10984, 1997.
18. E. Oja, J. Karhunen, A. Hyvs R. Vig~rio, and J. Hurri. Neural independent
component analysis - approaches and applications. In S. Amari and N. Kasabov,
editors, Brain-like Computing and Intelligent Information Systems, pages 167-188.
Springer, Singapore, 1997.
19. J. Ss163 R. Vig&rio, V. Jousms R. Hari, and E. Oja. ICA for the extraction of
auditory evoked fields. In 4th International Conference on Functional Mapping of
the Human Brain (HBM'98), Montreal, Canada, 1998.
20. M. A. Uusitalo and R. J. Ilmoniemi. The signal-space projection (ssp) method
for separating meeg or eeg into components. Medical ~4 Biological Engineering ~r
Computing, 35:135-140, 1997.
21. R. Verleger. Valid identification of blink artifacts: are they larger than 50 #v in
EEG records? Electroenceph. clin. Neurophysiol., 87:354-363, 1993.
22. R. Vigs Extraction of ocular artifacts from EEG using independent component
analysis. Electroenceph. clin. Neurophysiol., 103:395-404, 1997.
23. i%. Vigs V. Jousms M. H~m~l~inen, R. Hari, and E. Oja. Independent com-
ponent analysis for identification of artifacts in magnetoencephalographic record-
ings. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Neural Information
Processing Systems 10 (Proc. NIPS'97), Cambridge MA, 1998. MIT Press.
24. R. Vig~rio, J. S~rels V. Jousm~iki, and E. Oja. Independent component analysis in
decomposition of auditory and somatosensory evoked fields. In Proc. Int. Workshop
on Independent Component Analysis and Blind Separation of Signals (ICA '99),
Aussois, France, January 1999.
25. R. Vigs J. S~rel~, and E. Oja. Independent component analysis in wave decom-
position of auditory evoked fields. In Proc. Int. Conf. on Artificial Neural Networks
(ICANN'98), SkSvde, Sweden, September 1998.
EEG-based Brain-Computer Interface Using
Subject-Specific Spatial Filters

G. Pfurtscheiler, C. Guger, H. Ramoser

Departmentof Medical Informatics, Institute of Biomedical Engng.


and
Ludwig Boltzmann Institute of Medical Informatics and Neuroinformatics,
University of Technology, Graz, Austria
hfffeldgasse 16a, 8010 Graz
e-mail: pfu @dpmi.tu-graz.ac.at
Telephone: +43-316-873-5300
Fax: +43-316-812964

Key Words: Brain-Computer Interface (BCI), single-trial EEG classification,


common spatial filter, motor imagery, event-related desynchronization

Abstract. Sensorimotor EEG rhythms are affected by motor imagery and can,
therefore, be used as input signals for an EEG-based brain-computer interface (BCI).
Satisfactory classification rates of imagery-related EEG patterns can be activated
when multiple EEG recordings and the method of common spatial patterns is used
for parameter estimation. Data from 3 BCI experiments with and without feedback
are reported.

1 Motor imagery and brain waves

Sensorimotor EEG rhythms such as mu and central beta rhythms display an event-
related desynchronization (ERD) not only with execution of hand movement but also with
imagination of the same or a similar type of movement [8]. Imagination of right and left
hand movement can therefore be used as a mental strategy to realize an EEG-based brain
computer interface (BCI) [10]. Examples of high-resolution ERD maps based on a
realistic head model obtained from magnetic resonance imaging (MRI) during left and
right hand movement imagery are displayed in Fig. 1. It can be seen that the ERD is
circumscribed and localized over the contralateral sensorimotor hand area.
249

Fig. 1. ERD maps calculated for a realistic head model during imagination of left and right hand
movement. The ERD focus is indicated by dense "isopotential" lines

Although the imagery-related ERD forms a focus close to the hand representation area,
one or two EEG signals recorded either from one or both hemispheres are insufficient to
describe the state of brain activation during motor imagery. Therefore, it is understandable
that the BCI system, using either 1 or 2 EEG channels for parameter estimation and
control of cursor movement in 2 directions (e.g. cursor up and down), can achieve only an
accuracy of 80-90% after about 10 sessions [5,7,10]. It can be expected that the analysis
and classification of a large number of EEG signals recorded over sensorimotor areas may
improve the classification accuracy of a BCI.
It was shown recently by off-line analysis of 56-channel EEG data from a motor
imagery experiment that EEG patterns during left and right motor imagery could be
discriminated in 3 healthy subjects with an accuracy of 90.8%, 92.7% and 99.7%,
respectively [11]. For this discrimination the common spatial pattern (CSP) method was
used [3,6]. With this CSP-method variance-related feature vectors from 2 populations of
EEG patterns are extracted and used for classification. It is therefore of interest, whether
the CSP-method can be used for on-line BCI sessions with continuous feedback [7] and
what classification accuracy can be achieved after e.g. only 3 days of training.

2 Common Spatial Filter

The CSP-method lead to new time series that are optimal for discriminating 2
populations of EEG patterns related to right and left motor imagery. The method is based
on the simultaneous diagonalization of 2 covariance matrices [1]. The imagery-related
EEG pattern (E) recorded from m electrodes is multiplied by a mapping matrix W. The
first two and last two rows (time series) of the resulting matrix Z (Z=WE) are best suitable
to discriminate the 2 populations of EEG patterns and are used to construct the weight
vector for the classifier. The components (features) used for classification are the
logarithm of the normalized variances of the time series obtained by spatial filtering (for
details see [6]).
250

3 On-line EEG Classification

Three students participated in the BCI experiment, all experienced with the BCI
(subjects g3, g7, i2). Each student imagined 80 left and 80 right hand movements per
session whereby the side of imagination was indicated by an arrow on a monitor pointing
either to the left or to the right (for details see [2,10]. The experimental paradigm is shown
in Fig. 2.

Fig. 2. Experimental paradigm for EEG data collection during motor imagery without feedback.

EEG was recorded from 27 electrodes closely spaced over left and right sensorimotor
areas. Amplified EEG signals filtered between 8-30 Hz, sampled at 128 Hz and cleared of
artifacts were used for calculating subject-specific common spatial filters and weight
vectors. All sessions with and without feedback were performed within only 3 days. A
typical example for one subject (g7) is given in Fig. 3.
Feedback (FB) was given in form of the outline of a rectangle. Immediately after the
arrow (cue) disappeared, the feedback stimulus appeared in the center of the screen and
began to extend horizontally toward the right or left side. The subject's task was to extend
this feedback bar toward the left or right boundary of the screen, depending on the
direction of the arrow (cue stimulus; see also Fig. 2) presented before. During a 3.75-
second period the bar was moving to the right or left side of the screen according to the
results of tile on-line analysis (linear distance function as described before).
251

Fig. 3. Flowchart of 6 BCI sessions with and without feedback for subject g7 within 3 days.
Altogether 3 CSP filters and 4 weight vectors (WV) were calculated.

As an example, the procedure used with subject g7 is described in detail. The


experiment was started (1" day) without FB in session 1. The subject imagined 80 right
and 80 left hand movements according to the paradigm shown in Fig. 2. From these 27-
channel EEG data a first common spatial filter (CSPI) and a first weight vector 1 (WV1)
were calculated. These CSPI and WVI were used to classify the EEG data on-line in the
following session 2 with FB on the next day (2 ~ day). As a result of the classification
between 2 imagination classes no discrimination (accuracy about 50%) was achieved.
Therefore on the same day (2 "J day) another session (session 3 in Fig. 3) without FB was
performed and a new weight vector (WV2) with the CSPI was calculated. Using CSPI
and WV2 in session 4 with FB a classification accuracy of 68% was obtained. Repeating
252

the update procedure twice (see Fig. 3) in session 6 with FB on the 3 ~ day a classification
accuracy of 94% was achieved. The time courses of the on-line classification for all
subjects are displayed in Fig. 4.

g3 =:

2 2.5 3 35 4 4.5 5 55 6 6.5 7 75 8

C l a s s i f i c a t i o n T i m e P o i n t In s e c o n d s

60

50

40

g7
20

0
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 75 8

C l a s s i f i c a t i o n T i m e P o i n t in s e c o n d s

i2 ~

2 25 3 35 4 4.5 5 5.5 6 65 7 7.5 8

C l a s s i f i c a t i o n T i m e P o i n t in s e c o n d s

Fig. 4. Time courses of on-line classification error over a period of 6 seconds, starting
1 second before visual cue presentation (from second 3 to 4.25). Summarized data of all 3
subjects are shown. Subjects g7 and g3 participated in 4 and subject i2 in 5 sessions with
FB. Instead of the classification accuracy, the error rate (100%: minimum classification
accuracy) is displayed.
253

Subject i2 started, similar to subject g7, without any classification power (50%
classification accuracy) in session 2. After calculation of 3 spatial filters and 5 weight
vectors a classification accuracy of 96% was achieved in session 7 with FB.
In contrast to subjects g7 and i2, subject g3 started in the first FB-session with an
accuracy close to 80%. In the last FB-session the classification rate was 98%.

4 Conclusion

The application of subject-specific spatial filters is a suitable method for on-line


classification of multichannel EEG data recorded during imagination of hand movement.
Important is that not only the spatial filters are updated in the course of sessions, but also
the classifier. For example the reason for the 50%-accuracy in session 2 with FB in
subject g7 was the biased classification results, meaning the feedback bar on the monitor
was always pointing in one direction. After calculation of a new classifier (weight vector)
the accuracy increase from 50% to 68%.
The CSP method is sensitive to electrode positions. Therefore, it is recommended to
use the same electrode montage to calculate of the spatial filters, to set up the classifier
and Ibr the next FB session. In this sense it has to be remembered that the ERD pattern of
sensorimotor rhythms can be completely different on 2 electrode positions over the hand
representation area, when the electrode distance over the scalp is smaller than 2.5 cm [9].
A further disadvantage is the large number of electrodes needed for the CSP-method. The
problem of a large number of electrodes and the precise positioning is solved, however,
when implanted electrode arrays are used for recording sensorimotor rhythms. Such
electrode arrays will be available in the near future for BCI applications in patients with
severe motor disabilities [4].
Most importantly it was shown for the first time that a high classification rate can be
achieved within only 3 days of training when multichannel EEG data in connection with a
BCI is used.

Acknowledgements

This research was supported by the "Fonds zur F6rderung der wissenschaftlichen
Forschung" project PI1208MED, the "Steierm~irkische Landesregierung" and the
"AIIgemeine Unfallversicherungsanstalt, AUVA" in Austria.
254

References

1. Fukunaga, K.: Introduction to statistical pattern recognition, Academic Press, (1972)


2. Guger, C., Schloegl, A., Walterspacher, D., Pfurtscheller, G.: Design of an EEG-based Brain-
Computer Interface (BCI) from standard components running in real-time under Windows.
Biomed. Technik, (1999) in press.
3. Koles, Z.J.: The quantitative extraction and topographic mapping of the abnormal components
in the clinical EEG. Electroenceph. Clin. Neurophysiol. 79 (1991) 440-447.
4. Maynard, E.M., Nordhausen, C.T., Normann, R.A.: The Utah intracortical electrode array: a
recording structure for potential brain-computer interfaces. Electroenceph. Clin. Neurophysiol.
102 (1997) 228-239.
5. McFarland, D.J., McCane, L.M., Wolpaw, J.R.: EEG-Based communication and control: short-
term role of feedback. IEEE Trans. Rehab. Engng., 6 (1998) 7-11.
6. MUller-Gerking, J., Pfurtscheller, G., Flyvbjerg, H.: Designing optimal spatial filters for single-
trial EEG classification in a movement task. Electroenceph. Clin. Neurophysiol. (1999) ill
press.
7. Neuper, C., SchlOgl, A., Pfurtscheller, G.: Enhancement of left-right sensorimotor EEG
differences during feedback-regulated motor imagery. J. Clin. Neurophysiol. (1999) in press.
8. Pfurtscheller, G., Neuper, C.: Motor imagery activates primary sensorimotor area in humans.
Neuroscience Letters, 239 (1997) 65-68.
9. Pfurtschelter, G., Neuper, C., Berger, J.: Source localization using event-related
desynchronization (ERD) within the alpha band. Brain Topography. 6/4 (1994) 269-275.
10. Pfurtscheller, G., Neuper, Ch., Flotzinger, D., Pregenzer, M.: EEG-based discrimination
between imagination of right and left hand movement. Electroenceph. clin. Neurophysiol. 103
(1997) 642-651.
11. Ramoser, H., Milller-Gerking, J., Pfurtscheller, G.: Optimal spatial filtering of single-trial EEG
during imagined hand movements, IEEE Rehab. Engng. (1999) submitted
Multi-neural Network Approach for
Classification of Brainstem Evoked
Response Auditory
Anne-Sophie DUJARDIN', V6ronique AMARGER', Kurosh MADANI*, Olivier
ADAM", Jean-Franc,,ois MOTSCH'"

* Universit6 Paris Xli - Val de Marne. I.U.T. de S6nart-Fontainebleau


Laboratoire d'Etude et de Recherche en Instrumentation Signaux et Syst6mes
Division de Recherche g6seaux Neuronaux
Avenue Pierre POINT, F-77127 LIEUSAINT, FRANCE
Phone : +331 64 13 46 85 Fax : +331 64 13 45 07

** Universit6 PARIS XlI - Val de Marne


Laboratoire d'Etude et de Recherche en Instrumentation Signaux et Syst6mes
Division Traitement du Signal et Instrumentation M6dicale
61, av du G6n6ral de Gaulle, F-94010 CRETEIL Cedex, FRANCE
Phone : +331 45 17 1493 Fax : +331 45 17 1492

E-Mail : {bellanger, amarger, madani, adam, motsch}@univ-parisl2.fr

ABSTRACT

Since about twenty years, the otoneurology functional exploration possesses experimental
techniques to analyze objectively the state of the nervous conduction ofauditive pathway.
it concerns brainstem evoked response auditory. In this paper, we present a new
classification approach based on a hybrid neural network technique focusing this
biomedical application for developing a diagnostic tool. We have used two models of
artificial neural networks : Learning Vector Quantization and Radial Basis Function ones.
In our approach, these two neural networks are used to achieve the classification in a
serial multi-neural network configuration. Case study and experimental results have been
reported and discussed.

Keywords : Multi-Neural Network, Brainstem Evoked Response Auditory,


Classification, Learning Vector Quantization, Radial Basis Function.

1. I N T R O D U C T I O N

Artificial Neural Networks are information processing systems, which allow the
elaboration of many original techniques covering a large field of applications. Among
their most appealing properties, we can quote their ability of learning and generalization,
and, for some of them, their ability of classification. On the other hand, the classification
problems cover a large domain of applications such as signal processing, image
processing, biomedical diagnosis... The problem of non-linear classification, classification
with uncompleted database or with database representing a large rate of resemblance is
256

difficult. Over the past decades, new approaches based on Artificial Neural Networks
have been proposed to solve such class of problems [1]. Some of studies have been made
for electrical signals classification in the field of test and diagnosis of analog circuits
[2113114][5]. ANNs techniques have also well performed for classification tasks in the
biomedical field [6l|7].

In this paper, we propose an original approach on classification of electrical signals,


which come from a medical test. These are called Brainstem Evoked Response Auditory
(BERA). Indeed, the otoneurology functional exploration possesses experimental
techniques to analyze objectively the state of the nervous conduction of auditive pathway..
The BERA's classification is a first step in the development of a diagnosis tool assisting
the medical expert. The classification of these signals presents some problems, because of
the difficulty to distinguish one class of signal from the others. The results can be
different for different test session for the same patient. Considering these difficulties, we
have developed a serial multi-neural network approach that involves both Learning vector
Quantization (LVQ) ISll91 and Radial Basis Function (RBF) [10] ANNs. These two
models of ANNs are particularly adapted for classification tasks.

If it is admitted that techniques based on single neural network show a number of


attractive futures to solves problems for which classical solutions have been limited, it is
also admitted that a flat neural structure doesn't represent the more appropriated way to
approach "intelligent behavior". The approach we propose uses a multi-neural network
(MNN) architecture

This paper is structured as follow. In the next section, we present the BERA signals. Then
we expose the approach based on Multi-Neural Network (MNN) structure. In section 4,
we present the classification results we obtained by using a database of 213 Brainstem
Evoked Response Auditory like waveforms. A comparison study with the single RBF and
LVQ ANNs has been made. Finally, we conclude and give the prospects that follow from
our work.

2. B R A I N S T E M EVOKED RESPONSE AUDITORY (BERA)

When a sense organ is stimulated, it generates a string of complex neurophysiological


processes. Evoked potentials are electrical response caused by the brief stimulation of a
sense system. The stimulus gives rise to the start of a string of action's potentials that can
be recorded on the nerve's course, or from a distance of the activated structures.

Brainstem Evoked Response Auditory (BERA) is generated as follow : the patient hears
clicking noise or tone bursts through earphones. The use of auditory stimuli evokes an
electrical response. In fact, the stimulus triggers a number of neurophysiological
responses along the auditory pathway. An action potential is conducted along the eight
nerve, the brainstem and finally to the brain. A few times aPter the initial stimulation, the
signal evokes a response in the area of brain where sounds are interpreted. These response
signals have small amplitude, and so they are frequently masked by the background noise
of electrical activity in the brain. But, it can be seen that the average of this noise equals
zero. Indeed, the response is obtained by extraction from the noise by the principle of
averaging. The response waveform consist on a series of five peaks numbered with
257

Roman numerals (wave I to V). Figure I (extracted from | l i d represents a perfect BERA.
This test provides an effective measure of the integrity of the auditory pathway up to the
upper brainstem level.

A technique of extraction, presented in [11 | allows us, following 800 acquisitions such as
describe before, the visualization of the BERA estimation on averages of 16 acquisitions.
Thus, a surface of 50 estimations called Temporal Dynamic of the Cerebral trunk (TDC)
can be visualized. The software developed for the acquisition and the processing of the
signal is called ELAUDY. It allows us to obtain the average signal, which corresponds to
the average of the 800 acquisitions, and the TDC surface. Figure II (extracted from 111])
shows two typical surfaces, one for a patient with a normal audition (II-A) and the other
one for patient who suffers from an auditory disorder (ll-B).This figure shows the large
variety of BERA signals even for a same patient. Moreover, this software automatically
determinates, from the average signal, the five significant peaks and gives the latency of
these waves. It also allows us to record a file for each patient, which contains
administrative information (address, age...), the results of auditory tests and the doctor's
conclusions (pathology, cause, confidence's index of the pathology...).

Figure I - PerfectBERA Figure II - TDC Surfaces(A - NormalPatient


B - Patient with auditory
di~rder~

BERA signals and TDC technique are important to diagnosis auditory pathologies.
However, medical experts have still to visualize all auditory tests' results before making a
diagnosis.

Today, taking into account the progress accomplished in the area of intelligent
computation or artificial intelligence, it becomes conceivable to develop a diagnosis tool
assisting the medical expert. One of the first steps in the development of such tool is the
BERA signals classification.

3. M U L T I - N E U R A L NETWORK BASED APPROACH

The approach we propose to solve the posed problem is based on Multi-Neural Network
(MNN) concept. A MNN could be seen as a neural structure including a set of similar
neural networks (homogeneous MNN architecture) or a set of different neural nets
(heterogeneous MNN architecture). On the other hand, both two above mentioned
(homogeneous and heterogeneous MNNs) could be organized in different manners. In a
general point of view, three topologies [11] could characterize the MNN's organization :
258

9 parallel organization : in this case, ANNS are not inter-connected. The MNN
input is dispatched to all neural networks composing such structure.
9 serial organization : in this case, the output of a given ANN composing the
structure is the input of the following ANN.
9 serial/parallel organization : which combines the two structures above mentioned
connections,

The problem (application) on which our efforts have been focused concerns the signals
(signatures) classification, where signals to be classified could represent a high
resemblance. In such class of problems, a very fine separation should be performed in the
feature space (parameters space). So, the use of single neural structures could lead, on the
one hand, to a large number of neurons, and on the other hand, to a long learning process.
Especially, when the application deals with real time execution constraints : in our case,
the BERA signals classification is intended to be used as a part of the process in a
computer aided medical diagnosis tool, and so, the execution time constraint should be
taken into account.

As it has been mentioned and shown in the previous sections, the main difficulty in
classification of BERA signals is related, on the one hand, to the large variety of such
signals for a same diagnosis result (the variation panel of corresponding BERA signals
could be very large), and on the other hand, to the close resemblance between such
signals for two different diagnosis results. The serial homogeneous MNN is equivalent to
a single neural structure with a greater number of layers with different neuron activation
functions. So the use of homogeneous MNN with a serial organization is here out of real
interest. In the parallel homogeneous MNN configuration, each neural net operates as
some 'expert' (learning a specific characteristic of the feature space). So, the interest of
the parallel homogeneous MNN appears when a decision stage, to process the results
pointed out by the set of such 'experts', is associated to such MNN structure. In this case,
such structure becomes then a serial/parallel MNN, needing an optimization procedure to
determine the number of neural nets to be used.

We propose an intermediary solution : a two stage serial heterogeneous MNN structure


combining a RBF based classifier (operating as the first processing stage) with a LVQ
based decision-classification stage. Figure III represents the proposed serial
heterogeneous MNN based architecture.

The RBF model we use is a Weighted-gBF model but a standard one, and so, it performs
the feature space mapping associating a set of 'categories' (in our case a category
corresponds to a possible pathological class) to a set of 'areas' of the feature space. The
LVQ neural model belongs to the class of competitive neural network structure. It
includes one hidden layer, called competitive layer. Even if the LVQ model has
essentially been used for the classification tasks, the competitive nature of it's learning
strategy ('based on winner takes all' strategy), makes it usable as a decision-classification
operator. On the other hand, the weighted nature of transfer functions between the input
layer and the hidden one and between the hidden layer and the output one in this model
allows non-linear approximation capability, making such neural net a function
'approximation operator'.
259

Taking into account the above analysis, the proposed serial MNN structure could be seen
as a structure associating a neural decision operator to a neural classifier. Moreover, the
proposed structure could also be seen as some global neural structure with two hidden
layers. So, the association of two neural models improves the global order of the non
linear approximation capability o f such global neural operator, comparing to each single
neural structure (here RBF or LVQ) constituting the MNN system. This technique allows
to fill in the gap induced by the P,BF ANN, and thus, to refine the classification.

R~F LVQ

t'":: ...... :o....... '-'1 I ....... " .......

I Illlll

Figure Ill - Propo~i serial Multi-NeuralNetwork based Struelurc

4. C A S E S T U D Y A N D E X P E R I M E N T A L RESULTS

4.A Experimental environment and database construction

In order to achieve this work, we dispose on a database, which contains the BERA signals
and the associated pathology. This database contains the files of 11 185 patients. We have
decided to work on the average signal presented in the section 2 and we choose three
categories of patients according to the type of their auditory disorder, which are :
9 n o r m a l : these patients have a normal audition
9 endocochlear : these patients suffer from disorders which concern the
part of the ear before the cochlea
9 retrocochlear : these patients suffer from disorders which concern the
part of the ear at the level o f the cochlea and after the cochlea, like
acoustic neuroma.

We select 213 signals. 92 belong to the normal class, 83, to the endocochlear class and 38
to the retrocochlear class. In a general point o f view, for a patient who has a normal
audition, the result of the TDC test is a regular surface. The waves are well synchronized,
and stay stable in latencies, amplitude and form. The results of the TDC test for patients
who suffer from endocochlear disorder are quite the same as for normal one. The
latencies stay normal, the morphology of TDC surface is not altered. One parameter
allows the medical expert to conclude for an endocochlear disorder : the auditory level. At
last, the retrocochlear disorders are characterized by an extension of the latencies and
non-synchronized waves, except from the wave V which can appeared well synchronized
with amplitudes modulation.

But, in reality, it is not so easy to distinguish one class from the other. The BERA signals
can be different for different test sessions for the same patient, because they depend on
260

the relaxation of the person, the background, the test's conditions, the signal-to-noise
ratio... The aim of this case study is to classify these signals using the MNN technique
presented in the above section.

The aim of classification by ANNs is to link a set of input vectors to a set of specific
output vectors. In our case, the components of input vectors are the samples of the BERA
average signals and the output vectors correspond to the different classes. The signals
corresponding to :
9 a retrocochlear disorder are associated to the class 1,
9 an endocohlear disorder are associated to the class 2,
9 a normal case are associated to the class 3.

In order to build our training database, we choose signals that come from patient whose
pathology is given as sure. All BERA signals come from the same experimental system.
After the learning phase, if not learned signals vectors are presented to ANN, the
corresponding class (type of disorder) must be designed.

4. B. Results relative to RBF-L VQ based multi-neural network approach.

We used the RBF-LVQ based serial heterogeneous MNN, described in the previous
section (Figure Ili).
Concerning the RBF ANN, the number of input neurons (88) is corresponding to the
number of components of the input vectors. The output layer contains 3 neurons. The
number of neurons of the hidden layer (in this case, 20 neurons) has been determined by
learning.
For the LVQ ANN, the number of input cells is equal to the number of output cells of the
RBF ANN. The output layer of the LVQ ANN contains as many neurons as classes (3).
The number of neurons in hidden layer (in this case, 8 neurons) has been determined by
considering the number of subclasses we can count into the 3 classes.

The learning database contains 24 signals, 11 of them correspond to retrocohlear


disorders, 6, to endocochlear disorders and 7, to normal hearing. For the generalization
phase, we use the full database, including the learning database. Table I gives the results
of this experiment.

Class
~ R e a l Retrocochlear Endocochlear Normal
Class given~
By the ANN
Retrocochlear 27 4 10
Endocochlear 8 45 19
Normal 4 34 63

Table I - RBF+LVQ'sResults
The learning database has successfully been learnt. All of the learnt vectors are well
classified in the generalization phase. We can see that this network well classifies 63% of
the full database (including the learnt vectors), with a rate of correct classification of:
261

9 71% for the retrocochlear class,


9 55% for the endocochlear class,
9 69% for the normal class.

The behavior of the MNN concerning the retrocochlear and the normal classes permits to
obtain high rate of well classification o f these classes. However, the classification rate of
endocochlear signals is not satisfactory. Only about 55% o f vectors are well classified.

One can remark that when the network makes a mistake on an endocochlear vector, it
classifies it more often as a normal one rather than a retrocochlear one. In the same way, a
normal bad-classified vector is told preferentially endocochlear. This could be explained
by the fact that these results have been obtained without taking into account the auditory
threshold. The above mentioned parameter is among the key parameters used to
distinguish a normal hearing from an endocochlear disorder. The results considering this
parameter are presented in Table II.

" ~ Real Class


Retrocochlear Endocochlear Normal
Class given ~
by the ANN
Retrocochlear 27 10
Endocochlear 72
Normal 4 7 77

Table I I - RBF+LVQ'sResults consideringthe auditory threshold


Thus, the correct classification rate is equal to 83% among all the database with a rate of
correct classification of:
9 71% for the retrocochlear class,
9 87% for the endocochlear class,
9 84% for the normal class.

To evaluate our MNN approach to single RBF or LVQ ANNs based techniques, we have
compared the obtained results with the results relative to these two cases.

4. C Comparison study with single RBF and L VQ ANNs approaches

The structure of the RBF and LVQ ANNs for the respective single approaches are
composed as follows :
9 the number of input neurons for RBF and LVQ ANNs corresponds to the
number of components of the input vectors,
9 the output layer of RBF and LVQ ANN contains 3 neurons, corresponding to
the 3 classes,
9 for RBF ANN, the number of neurons o f the hidden layer (in this case, 22
neurons) has been determined by learning,
9 for LVQ ANN, the number o f hidden neurons (in this case, 10 neurons) has
been determined by considering the number o f subclasses we can count into
the 3 classes.
262

For RBF ANN, the learning database contains 24 signals, 11 of them are retrocohlear, 6
endocochlear and 7 normal. For LVQ ANN, the learning database contains 20 signals, 6
of them are retrocohlear, 7 endocochlear and 7 normal. The results we obtain are given in
the following table (Table Ill).

~.. Real Class


Retrocochlear Endocochlear Normal
Class g i v e 6 " - ~
RBF [ LVQ RBF LVQ RBF J LVQ
by the ANN .....~..
Retrocochlear 23 27 2 6 9 5
Endocochlear l0 4 48 47 21 35
Normal 5 7 33 30 62 52

Table III - RBF and LVQ ANN ApproachesResults

In the two cases, the learning database has successfully been learnt. All of the learnt
vectors are well classified in the generalization phase. The RBF network well classifies
62,5% of the full database (including the learnt vectors), with a rate of correct
classification of:
9 61% for the retrocochlear class,
9 58% for the endocochlear class,
9 68% for the normal class.

The LVQ network well classifies 59% of the full database (including the learnt vectors),
with a rate of correct classification of:
9 72% for the retrocochlear class,
9 57% for the endocochlear class,
9 57% for the normal class.

Comparing these two single ANN based approaches with our proposed MNN technique,
one can remark :
9 similar performance has been obtained in the case of normal class for the
MNN technique and single RBF one.
9 similar performance has been obtained in the case ofretrocochlear class for the
MNN technique and single LVQ one.
9 performance is improved in the case of the normal class for the MNN
technique compared to that obtained by the single LVQ approach.
9 performance is improved in the case of the retrocochlear class for the MNN
technique compared to that obtained by the single RBF approach.

Therefore, the MNN structure combines the advantages of both LVQ and RBF ANNs.
Moreover, the high rates of classification of our MNN technique are achieved with low
number of neurons in the ANNs architecture, taken into account the specificity of our
problem.
263

5. C O N C L U S I O N

In this paper, we have presented an original approach based on a multi-neural network


technique to Brainstem Evoked Response Auditory (BERA) classification. The main
difficulty is related to the classification of very similar vectors corresponding to different
type of disorder and the large variety of vectors in the same class.

The MNN we propose, involves Learning Vector Quantization (LVQ) and Radial Basis
Function (RBF) neural models. The first neural net (RBF ANN) is used as a classifier,
and the second one (LVQ ANN) as a competitive decision processor. The RBF model
performs the feature space mapping associating a set of 'pathological classes' to a set of
'areas' of the feature space. Because of the competitive nature of the LVQ model's
learning strategy, this ANN is used, in our case, as a decision-classification operator.
Moreover, the proposed structure could also be seen as a global neural structure with two
hidden layers. So, the association of two RBF and LVQ neural models improves the
global order of the non-linear approximation capability of such global neural operator,
comparing to each single neural structure constituting the MNN system.

To evaluate the capability of this technique, we have classified BERA average signals for
three categories of patients according to the type of their auditory disorder : normal
hearing, endocochlear and retrocochlear disorders. Our proposed Multi-Neural Network
architecture allows us to keep the advantages of both RBF (classification rate equals 68%
for normal class) and LVQ (classification rate equals 72% for retrocochlear class) ANNs
and improves classification rate in a fine classification problem (classification rates equal
71% for retrocochlear class, 84% for normal class and 87% for endocochlear class).
Moreover, these results are achieved with low number of neurons in the ANNs
architecture, taken into account the specificity of our classification problem.

The results we obtained are encouraging and show the feasibility of a neural networks
based tool for help-diagnosis. The study field in BERA's classification remains wide and
this work should be carried on.

ACKNOWLEDGEMENTS

The database we have used belongs to the CREFON (Center of Research and Functional
Investigation on Oto-Neurology). We wish to thank its member, especially Dr. M
OHRESSER for her help.

REFERENCES

[I] WIDROW B., LEHR M.A., "30 years of adaptative Neural Networks : Perceptron,
Madaline, and Backpropagation", Proceeding of the IEEE, Voi.78, pp. 1415-1441, 1990.
264

121 BENGHAR81 A , "Contribution au test et diagnostic des circuits analogiques par des
approches basres sur des techniques neuronales", PhD thesis report, University of
Creteil - Paris XII, 1997

13| AMARGER V., BENGHARBI A., MADANI K., "A New Approach to fault diagnosis
of Analog Circuit using Neural Networks Based Techniques", IEEE European Test
Workshop 96 Montpellier, June 12-14, 1996.

[4 i BENGHARBI A., AMARGER V., MADANI K., "Multi-Fault Diagnosis of Analog


non Linear Circuits by Smart Classification Technique", 1EEE European Test
Workshop 97, Torino, Mai, 1997.

15] MADANI K., BENGHARBI A.,'AMARGER V., "Neural Fault Diagnosis Techniques
for Non Linear Analogue Circuits", SPIE'97, Orlando, 1997,

16] BAZOON M., A. STACEY D., CUI C., " A Hierarchical Artificial Neural Network
System for the Classification of Cervical Cells", IEEE International Conference on
Neural Networks, Orlando, July, 1994.

171 ALPSAN D., "Auditory Evoked Potential Classification by Unsupervised Art 2-A and
Supervised Fuzzy Artmap Networks", IEEE International Conference on Neural
Networks (ICNN), Orlando, July, 1994.

[8l KOHONEN T., "Learning Vector Quantization", Neural Networks, vol. I, suppl. 1, p.
303, 1988.

19] KOHONEN T., "Self Organizing and Associative Memory", 3'~ ed., Springer-Verlag,
Germany, 1989.

|10] NIRINJAN M., FALLSIDE F., "Neural Networks and Radial Basis Functions in
classifying-static speech pattern", Report CUED/FINFENG/TR22, Cambridge
University, England, 1988.

II II MOTSCH J-F, "La dynamique temporelle du tronc crrrbral : gecueil, extraction et


analyse optimale des potentiels 6voqurs auditifs du tronc crrrbral", PhD thesis report
(th~se d'rtat), University ofCrrteiI-Paris Xll, 1987.
E E G - b a s e d Cognitive Task Classification w i t h I C A and
Neural Networks
David A. Peterson 1 and Charles W. Anderson 1

1 Department of Computer Science


{petersod,anderson}@cs.colostate.edu
http://www.cs.colostate.edu/- {petersod,anderson}
970.491.7184 and 970.491.7491
970.491.2466 fax
Colorado State University
Fort Collins, CO 80521

Abstract. Electroencephalography (EEG) has been used extensively for


classifying cognitive tasks. Many investigators have demonstrated clas-
sification accuracies well over 90% for some combinations of cognitive
tasks, signal transformations, and classification methods. Unfortunately,
EEG data is prone to significant interference from a wide variety of ar-
tifacts, particularly eye blinks. Most methods for classifying cognitive
tasks with EEG data simply discard time windows containing eye blink
artifacts. However, future applications of EEG-based cognitive task clas-
sification should not be hindered by eye blinks. The value of an EEG-
controlled human-computer interface, for instance, would be severely di-
luted if it did not work in the presence of eye blinks. Fortunately, recent
advances in blind signal separation algorithms and their applications to
EEG data mitigate the artifact contamination issue. In this paper, we
show how independent components analysis (ICA) and its extension for
sub-Gaussian sources, extended ICA (eICA), can be applied to accurately
classify cognitive tasks with eye blink contaminated EEG recordings.

1 Introduction

1.1 Cognitive Task Classification With EEG


Many investigators have demonstrated moderate success classifying cognitive
tasks with E E G using a wide variety of signal transformations and classifiers,
including neural networks (a s u m m a r y can be found in [2, 1]). However, most
studies have tried to classify only blink-free data; time windows during which
the subject blinks are usually excluded. Such eye blink artifact-contaminated
windows are typically detected by crude measures such as thresholds in the
magnitude of the E E G or eletrooculogram (EOG) signals. An open question,
however, is whether the successes in EEG-based cognitive task classification can
be extended to signals that include those periods of time during which the subject
blinks.
Recently, Jung et al., have shown that various artifacts including eye blinks
can be separated from the remaining E E G signals with elCA [5]. In this p a p e r
266

we study the effect of applying ICA and eICA to EEG data on classification
performance using standard power spectral density (PSD) signal representations
and feedforward neural network classifiers.

1.2 ICA

ICA is a method for blind source separation. It assumes that the observed signals
are produced by a linear mixture of source signals. Thus, the original source sig-
nals could, in principle, be recovered from the observed signals by running the
observed signals back through the inverted mixing matrix. Computationally-
intensive matrix inversions can be avoided, however, with recent relaxation-based
ICA algorithms [3]. These algorithms derive maximally independent components
u by maximizing the joint entropy between ui, which is equivalent to minimiz-
ing the components' mutual information. The joint entropy is maximized with
respect to the unmixing matrix W. The result is a simple rule for evolving W
in an iterative, gradient-based algorithm.
It is reasonable to apply ICA to EEG data because EEG signals measured
on the scalp are the result of linear filtering of underlying cortical activity [5,
7]. However, ICA assumes that all of the underlying sources have similar super-
Gaussian probability density functions. It is unknown how well EEG "sources"
follow this assumption, but it is reasonable to assume that some may not. A
recent extension to ICA, extended ICA, takes a first step toward addressing this
issue.

1.3 Extended ICA

Extended ICA [5] provides the same type of source separation as ICA, but also
allows some sources to have sub-Gaussian distributions. The learning rule for
the unmixing matrix W is modified to be a function of the data's normalized
4th order cumulant, or kurtosis:

AW c< [I + [~ut]W (1)


Pi -= - s g n ( k 4 ) t a n h ( u i ) - ui (2)

where k4 is the kurtosis and ui is the i th activation. Periodically during the course
of learning, the kurtosis is calculated and the learning rule adjusted according
to the kurtosis sign. Positive kurtosis is indicative of super-Gaussian distribu-
tions, and negative kurtosis of sub-Gaussian distributions. By accommodating
sub-Gaussian distributions in the data, eICA should provide a more accurate
decomposition of multi-channel EEG data, particularly if different underlying
sources follow different distributions.
267

2 Methods

2.1 Data Collection

Ten 10-second trials were given to each subject for each of three tasks:

- b a s e : baseline task: try to relax and not think of anything specific


- l e t t e r : mentally compose a letter to a friend
- m a t h : sub-vocally multiply two non-trivial numbers

The subject kept their eyes open during the trials, and were asked to avoid
blinking. E E G d a t a was collected from six channels of the International 10-20
System: C3, C4, P3, P4, 01, 02, referenced to linked mastoids. E O G d a t a was
also collected to provide a reference for eye blinks. All signals were sampled at
250 Hz. Further details are provided in [6].

2.2 Eye Blink Removal

Despite instructions to avoid eye blinks, m a n y of the trials contain one or more
eye blinks. Two categories of schemes were used for handling the eye blinks: 1)
the 'threshold' approach and 2) ICA. With the threshold approach, eye blinks
were detected by at least a 100 #V change in less than 100 msec in the E O G
channel. The subsequent 0.5 sec window of the trial was removed from further
consideration.
With the ICA approach, eye blinks are "subtracted" rather t h a n explicitly
detected, and no portion of the trials are thrown out. ICA is performed on the
combination of the E O G and six E E G channels. The number of activations spec-
ified was the same as the number of input channels: seven. As a result, activity in
the E E G channels that is closely correlated with the activity of the E O G channel
is separated and placed in one activation, as illustrated in Figure 1 for the first
five seconds of one trial of the base task. Notice that the eye blinks in the E O G
channel influence even the most posterior E E G recordings at channels O1 and
02. The ICA activations show the eye blink activity in only one component.
Thus, eye blink activity reflected in the E E G channels is "subtracted" from
those E E G channels. The activation containing the E O G activity can be trans-
parently detected, because it is the one with the highest correlation to the
original E O G data. The remaining activations are retained as the "eye-blink
subtracted" independent components of the E E G data. Thus, with the ICA ap-
proach, the full trial of E E G d a t a is used for all trials, regardless of the number
or distribution of eye blinks in those trials. Within the ICA-based category of
eye blink removal schemes, three specific forms of ICA were used:

- ICA
- Extended ICA (i.e. the algorithm chooses the number of sub-Gaussian com-
ponents to use)
- Extended ICA with fixed number of sub-Gaussian components
268

~ao,h~ E ~ ~ EEQ ~ a t IC#. J,.~'z.on s

'jl ' ,til : ' i .... t


2

. . . . . . to , , , ' , , ,

.~, , , t , , t I .tot ~ , , i , i
.......... . . . . . . 10 . . . . . .

~ . 0

]
o ~o 4~ er~ f~o 1~ ~oo 9 ~ ~ ~ ~ tc~o t2oo 1leo

Fig. 1. Eye blink subtraction with ICA

Thus, a total of four different schemes were used to remove eye blinks and rep-
resent the "blink-free" signals: t h r e s h (for eye blink removal using threshold
detection, as described above), ICA, eICA, and eICA_:f (for eICA with fixed
sub-Gaussian components). Our objectives were not only to see how cognitive
task classification performance varies as a function of the eye blink-removal ap-
proach, but also to see how cognitive task classification performance varies as a
function of the number of sub-Gaussians in the ICA representation.

2.3 Signal Transformation


All trials were divided into 0.5 second windows with 0.25 sec overlap, as in [2].
The power spectral density of each channel in every window was computed and
summed over the five primary E E G frequency bands: 5 (0-4 Hz), 0 (4-7 Hz), a
(8-12 Hz), fi (13-35 Hz), and V (> 35 Hz). The power spectral density was used
because it has been a popular and successful signal representation for many types
of E E G analyses for decades. Finally, because the PSD values were so heavily
weighted in the lower frequencies, the loglo of this vector was computed. Thus,
each window was represented by a feature vector of length 30 (i.e. six channels
x five frequency bands).

2.4 Neural N e t w o r k - B a s e d Classification


The cognitive tasks were classified in two pairwise task comparisons: base versus
math and letter versus math. By analyzing two pairwise classifications, we hoped
to assess how well the classification scheme would generalize to different task
pairs.
Supervised learning and simple feedforward neural networks were used to
classify the feature vectors into one of the two tasks. The networks had one
269

linear output node. The number of sigmoidal hidden nodes was varied over [0 1
2 3 5 10]. By including zero hidden nodes as one of the network architectures,
we are effectively assessing how well a simple linear perceptron can classify the
data. Network inputs were given not only to the hidden layer, but also to the
output node, in a cascade-forward configuration. Thus, network classifications
were based on a combination of the non-linear transformation of the input fea-
tures provided by the hidden layer as well as a linear transformation of the input
features given directly to the output node.
The networks were given input feature vectors normalized so that each feature
has a N(0,1) distribution. The networks were trained with Levenberg-Marquardt
optimized backpropagation [4]. Training was terminated with early stopping,
with the data set partitioned into 80, 10, and 10% portions for training, valida-
tion, and test sets, respectively. The mean and standard deviation of classifica-
tion accuracy reported in the results section reflect the statistics of 20 randomly
chosen partitions of the data and initial network weights.

3 Results

The best classification accuracy for each different eye blink removal scheme over
all network architectures is shown in Figure 2. For the eICA_f scheme, the per-
formance shown is for the best number of sub-Gaussian components. The per-
formance is statistically similar across the different schemes. In all cases except
ICA on the letter v. math pair, mean classification accuracies are over 90%. For
both task pairs, eICA and eICA_f perform statistically as well as the t h r e s h
scheme.
Figure 3 shows how classification accuracy varies with the size of the neural
network's hidden layer. For the t h r e s h scheme, the linear neural networks (i.e.
zero non-linear hidden layer nodes) perform about as well as the non-linear net-
works. Thus, the simple t h r e s h scheme seems to represent the data's features in
a linearly-separable fashion. However, with all three of the ICA-based schemes,
performance tends to improve with the size of the hidden layer, then decrease
again as the number of hidden nodes is increased from five to ten. Notice that
eICA and eICA_f perform about as well as t h r e s h when networks of sufficient
hidden layer size are used for the classification. Apparently the eICA repre-
sentations produce feature vectors whose class distinctions fall along non-linear
feature space boundaries. Notice that for the base v. math task pair, the mean
performance with eICA.:f is greater than that of t h r e s h for all of the non-linear
networks.
So are there specific numbers of sub-Gaussian components for which perfor-
mance is better than others? We explored this question, analyzing task pair clas-
sification accuracy while varying the number of fixed sub-Gaussian components
used in the eICA_:f scheme. The results are summarized in Figure 4. Notice that
for both task pairs, classification performance is indeed a function of the number
of sub-Gaussian components. Also, the variability in performance is consistent
across different size networks. For both task pairs, performance is about max-
270

base v. math letter v. math


100

95 - - ......... 95

~ 90 .... 90

._~
u=
9~ 85 85

80 80

75 75 i
thresh ICA elCA elCA_f thresh ICA elCA elCA_f

Fig. 2. Best classification performance as a function of eye blink removal schemes.


(Error bars are one a above and below mean.)

imum when the number of sub-Gaussians is four, and decreases steadily with
additional sub-Gaussian components. However, the classification performance
differs markedly between the task pairs when the number of sub-Gaussian com-
ponents is less than four. Perhaps with the base task the underlying sources have
fewer sub-Gaussian components, making the choice of fewer fixed sub-Gaussian
components in our representation helpful for classification.

4 Discussion

We have shown that eICA can be used to subtract eye blinks from E E G d a t a and
still provide a signal representation conducive to accurate cognitive task classi-
fication. We have also provided preliminary evidence that eICA-based schemes
can generalize across different cognitive tasks. In both cases, however, it was
necessary to use non-linear neural networks to achieve the same performance
as was attained with a simple thresholding eye blink removal scheme and linear
neural network classifiers. Further work needs to be done to assess the sensitivity
of these results to different cognitive tasks.
By using a combination of ICA and artifact-correlated recording channels
(e.g. the E O G channel) for artifact removal, eye blinks were removed without a
hard-coded definition of eye-blink such as magnitude thresholds. This approach
could generalize to other artifact sources. If, for example, specific muscle activity
is interfering with E E G signals in a specific cognitive task monitoring setting,
271

base v. m a t h letter v. m a t h
o .................................... ....... tO0 .................................. 9.......

9s ................. i ...... : 22 - / :

i-" I

./~..

90 ~o ....... ~- ..i...... i . . . . . .
4,'/:
I" / " i
....... i ....... 2 . - " / ! :
/ : i i

.~ 85
...... ! ....... ~.. ...... : ................. : 85 ........ ! ......... : ...... :i::'.":.i:. ...... i
I :
u=
.= / .

i z : .."i i "'.:
;/ i
80 ........ i! .................................... 80 ..... : ........... :. . . . . . . . . . . . . . . . . . . . . . . . . .

I :
/ ,

75 -./ . . . . . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70 i = J i 9 i
1 2 3 5 10 1 2 5 10
# hidden nodes # hidden nodes

F i g . 3 . C l a s s i f i c a t i o n p e r f o r m a n c e a s a f u n c t i o n of h i d d e n l a y e r size. ( E r r o r b a r s o m i t t e d
for c l a r i t y . F o r m o s t d a t a p o i n t s , a < 4.)

b a s e v. m a t h letter v. math
"iCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . . .

-- # HLnodes =
i .-. # HLnodes =
-- #HLnodes=
95 .... ; .................... : .............. 95 ................................... 9 .....

/ :\ , i
\\\ / : \
..!... ~ ..:. \ i
90 ................ / ..i~. i ..... :: .......
./ /i'.. ~ i i

85 =:::::i: ... . . . i.:: . . . . . . . ".:-i x .!


.: \;
:
:
. ...: . \ .

~ ~o 8o

75 75
: i !
i i i
i :! i :! : i
7o i i i ~ i i 70 ; ; *
2 3 4 5 6 7 3 4 6 7
# subgaussian components # subgaussian components

Fig. 4. Classification performance as a function of # of sub-Gaussian components.


( R e s u l t s w e r e s i m i l a r for l a r g e r n e t w o r k s , a n d n o t p l o t t e d h e r e for c l a r i t y . )
272

then this approach could be used to subtract the myographic activity from the
E E G signals by including the appropriate electromyographic (EMG) reference
channel in the ICA decomposition.

Acknowledgements Initial E E G experiments were supported through NSF


grant IRI-9202100.

References

1. C. W. Anderson, E. A. Stolz, and S. Shamsunder. Multivariate autoregressive mod-


els for classification of spontaneous electroencephalogram during mental tasks. IEEE
Transactions on Biomedical Engineering, 45(3):277-286, 1998.
2. Charles W. Anderson. Effects of variations in neural network topology and output
averaging on the discrimination of mental tasks from spontaneous electroencephalo-
gram. Journal of Intelligent Systems, 7(1-2):165-190, 1997.
3. A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind sep-
aration and blind deconvolution. Neural Computation, 7(6):1129-1159, November
1995.
4. M. T. Hagan and M. Menjah. Training feedforward networks with the marquardt
algorithm. IEEE Transactions on Neural Networks, 5(6):989-993, 1994.
5. Tzyy-Ping Jung, Colin Humphries, Te Won Lee, Scott Makeig, Martin J. McKe-
own, Vicente Iragui, and Terrence J. Sejnowski. Extended ica removes artifacts
from electroencephalographic recordings. In to appear, editor, Advances in Neural
Information Processing Systems 10. The MIT Press, Cambridge, MA, 1998.
6. Z. A. Keirn. Alternative modes of communication between man and machine. Mas-
ter's thesis, Purdue University, Lafayette, IN, West Lafayette, IN, 1988.
7. Scott Makeig, Anthony J. Bell, Tzyy-Ping Jung, and Terrence J. Sejnowski. Inde-
pendent component analysis of electroencephalographic data. In D. S. Touretzky,
M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Pro-
cessing Systems 8, pages 145-151. The MIT Press, Cambridge, MA, 1996.
Local P a t t e r n of S y n c h r o n i z a t i o n in
Extraestriate N e t w o r k s during Visual A t t e n t i o n

Liset Men6ndez de la Prida I, Francisco Barcel61'2. Miguel A. Pozo l


and Francisco J. Rubia I

i Unidad de Cartografia Cerebral, Instituto Pluridisciplinar, Universidad Complutense,


Paseo Juan XXIII, I. 28040 Madrid, SPAIN.
2Faculty of Psychology, Complutense University of Madrid
Madrid- SPAIN.

We analyzed the activity of human electroencephalogram (EEG) during the


execution of a visual attention test. By analyzing event-related potentials
(ERP) associated with the onset of attended and unattended stimuli, we
investigated the patterning of interactions between distant brain areas. Our
results suggest that focalized attention to visual stimuli is associated with an
increase in the spectral coherence of EEG signals recorded from temporal
and parieto-occipital brain areas. This outcome supports the notion of a
network of synchronous brain activation during selective visual attention.

INTRODUCTION
Inspired by recent physiological studies in animals, which report synchronized activity of
cortical neurons during processing of visual stimuli (Corbetta, 1998; Goldman-Rakic, 1988;
LaBerge et al., 1992; Lopes da Silva, 1991; Mesulam, 1990; Mclntosh et al., 1994; Rees et
al., 1998; Webster et al., 1993; Wrigth & Liley, 1996), we hypothesized that the analysis of
coherence of phase-locked ERP activity will carry information about the interactions
between distant brain areas involved in visual attention. Event-related brain potentials
(ERP) are averages of the electroencephalogram (EEG) which are time-locked to the
presentation of a sensory event. ERPs can measure activity of distant areas of the cortex
with high temporal resolution. Because coherence describes the phase-locked component
shared by two signals, large-scale cortical interactions can be detected over the whole
cortex (Sarnthein et al., 1998; Wrigth & Liley, 1996). Following this rationale, we
performed a coherence analysis on human scalp EEG recorded while subjects accomplished
a visual attention task in order to test two hypotheses:

1) If anterior and posterior areas become functionally interrelated (i.e. synchronized)


during the phasic deployment of attention to visual stimulation, then there should be a

"Corresponding author: Faculty of Psychology, Complutense University, Somosaguas


28223, Madrid - Spain. Email: fbarcelo@psi.ucm.es
274

significant increment in the amount of coherence among interrelated areas after


stimulus onset compared to the period prior the onset of the attended stimulus.
2) If the increment in coherence is specifically associated with the filtering of sensory
information (attention), then a more efficient filtering would be associated with larger
increases in coherence. That is, larger coherence values are expected while attention is
focused on a stimulus than when it is unfocused or shifting between stimuli.

In this report we present evidence that focalized attention to objects in the visual scene
brings about a corresponding increase in the spectral coherence of the EEG signal recorded
from temporo-parietal and parieto-occipital brain areas. This pattern of synchronization
involved left temporal-parietal rather than frontal regions.

METHODS
Sixteen right-handed young volunteers (8 women and 8 men; age range 19-28 years, mean
= 20.8 years) with normal or corrected vision and no history of neurological or psychiatric
problems were recruited from colleges in the University campus. Subjects were informed of
all aspects of the research
and signed a consent form
approved by the Ethical
Committee of the Brain
Mapping Unit. Subjects
were paid for their
participation.

The task protocol


consisted of a computer
adaptation of a well-
known test of visual
attention, the Wisconsin
Card Sorting Test
(WCST) (Barcel6 et al.,
1997; Barcel6 & Rubia,
1998; Milner, 1963). This
task involves sorting a
key-card according to the
color, the shape or the
number of their elements.
Therefore, the task
alternates periods in
which the subject has to
shift attention between stimulus dimensions (SHIFt), with other periods in which the
subject focuses attention in only one stimulus dimension (ATTEND). All other task and
stimulus parameters remain identical in both conditions.

Each trial began with the onset of a compound stimulus containing the four WCST key-
cards on top of one choice-card, all centered on the screen. The compound stimulus
275

subtended a visual angle of 4* horizontally and 3.5* vertically. Subjects were instructed to
match the choice-card with one of the four key-cards following one of three possible
sorting principles: number, color, or shape. The correct sorting principle could be
determined on the basis of feedback which was delivered 1900 ms after each response
through a computer-generated tone (2000 Hz for correct, 500 Hz for incorrect). Responses
were made with a 4-button panel. The length of the WCST series varied randomly between
6 and 9 trials. The inter-trial interval varied randomly between 3000 and 4000 ms. The task
consisted of two blocks of 18 series each. The order of choice-cards within the series was
dctermined on a semi-random basis so that the first four sorts in the series could be made
unambiguously. Elimination of ambiguity eased the correction of the test, and improved the
sigllal.l().noi~e ratio in Ihe ERPs. The average duration of each block was 12 min, with a 5
min rest period between blocks.

The electroencephalogram (EEG) was recorded from tin electrodes at positions Fpl, Fp2,
T7, T8, C3, C4, PO7, PO8, O1, and 0 2 of the extended 10-20 system (American
Electroencephalographic Society, 1994) and referenced to left mastoid (Figure I). The EEG
was amplified with a band pass from DC to 30 Hz (12 dB/octave roll-off), and digitized at
250 ltz over a 1700 ms epoch with a 200 ms baseline. Impedances were kept below 5 kfl.
The electrooculogram (EOG) was also recorded for blink correction. Trials with remaining
muscle or movement artifacts were discarded. Separate averages were computed for early
attd late WCST trials. The second and third trials across series were averaged into a
' S t t I [ ~ ' waveform, and the last two trials were averaged into an 'A'I1'END' wavcform. A
linked-mastoid reference was computed off-line for the averaged data.

Coherence for two signals, x and y (Cxy), is equal to the average cross power spectrum
normalized by the averaged powers of
the compared signals. Coherence is the
,SHIFT ATTEND frequency domain equivalent of the
cross-covariance function and is a
measure of the similarity of two signals.
~o Its value lies between zero and one, and
-2
it estimates the degree to which phases at
the frequency of interest are dispersed.
Coherence estimates were computed on
~o the pre-stimulus and post-stimulus
I I I I I I periods of the averaged stimulus-locked
ERP signal among all possible pairs of
electrodes. Separate coherence estimates
were obtained for the 'SHIFT' and
I I I I I I I I I I I I

-200 0 200J.OOQOOeO0 .200 0 2 0 0 * . o o e o o e o o


'ATTEND' conditions (Figure 2). The
first hypothesis will :be tested by
mlllls e c on ds mllllsec on d s
computing the mean coherence for each
Figure 2. Event-relatedbrain potentials recorded
during the SHIFT and ATTENDconditionsfrom electrode with every other electrode. The
the left temporal (T'7) and occipital (OI) areas. second hypothesis will be tested by
The coherence function (Cxy) as a function of computing the difference in coherence
time is plotted in tile lower panel. between the SHIFT and ATTEND
conditions. The significance level of the
276

increase in mean coherence between the pre-stimulus and post-stimulus periods, as well as
between the SHIFT and A'VrEND conditions was evaluated with a series of paired t-tests.

RESULTS

For the SHIFT condition, the mean coherence of the pre-stimulus period did not differ
significantly from the mean coherence of the post-stimulus period. This was so for all
electrodes tested. For the ATTEND condition, there was an increase in mean coherence
between the pre-stimulus and post-stimulus periods in central, temporal, and parieto-
occipital electrodes (P< 0.02), but not in frontal sites.

A summary of results for the test of differences in coherence between the SHIFT and
ATTEND conditions is presented in Table 1. Figure 3 displays a line connecting all
electrode pairs which showed a significant increase in coherence (P< 0.01 or better).

Table i. Summary of mean increment of intra-hemispheric and inter-hemispheric


coherence between pairs of electrodes when comparing the attention and control
conditions.

INTRAtlEMISPHERIC COHERENCE
T3 " ) T4 ~ C3 " ) C4 " ) P 3 " ) P4 " )
C3 P3 O! C4 P4 02 P3 OI P4 02 Oi 02
0.03 0.13" 0.13" 0.00 0.00 0.00 0.06 0.07 0.10 0.07 0.01 0.01

INTERHEMISPHERIC COHERENCE
T3 " ) T4 ") C3 ")' C4 ") P3"-) P4 " )
C4 1'4 02 C3 P3 Ol 1'4 02 P3 Ol 02 Ol
0.08* 0.15" 0.12" 0.03 0.04 0.05 0.14" 0.04 0.11 0.10" 0.10" 0.16"
* P < 0.01

DISCUSSION

We have observed enhanced EEG coherence specifically associated with the phasic
deployment of attention to visual stimuli. This enhancement was not present while the
subject was in the process of shifting attention between stimulus dimensions, but appeared
while the person's attention was concentrated on one aspect of the stimulation. This
outcome indicates that the increase in coherence is a phenomenon specifically associated
with attention, rather than with any other physical property of the stimulation. As expected,
the pattern of coherence was larger during the ATTEND condition than during the SHIFI'
condition (Barcel6 & Rubia, 1998; Berman et al., 1995). Figure 3 illustrates the pattern of
connectivity between those temporal and parieto-occipital areas that registered the larger
increment in coherence. The left temporal area experiences the largest increases in
277

coherence, particularly with areas in


the opposite hemisphere, Such an
increase in coherence reflects the fact
that the left temporal area is the one
which experiences the larger
increment in coherence after the shift
in attention. This outcome can be
assimilated to the left temporal
asymmetry described in previous
works (Barcel6 & Rubia, 1998;
Berman et al., 1995), and which has
been associated with the processes of
attentional shifting among stimulus
dimensions.

Extensive research in humans and


nonhumans primates supports the
notion of a network for selective
attention involving posterior
association areas (Goldman-Rakic,
1988; LaBerge et al., 1992; Mclntosh
et al., 1994; Sarnthein et al., 1998;
Webster et al., 1993). A larger
involvement of frontal areas was
expected for a task commonly used
in the assessment of frontal lobe
function. However, frontal areas did
not show any significant pattern of
coherence with posterior association
areas. Instead, the largest coherence
values were centered over the left
temporal region. This result is consistent with a large number of reports that lesions in the
left mesial temporal lobes impair performance of the WCST (Horner et al., 1996), and
suggest an important contribution of non-frontal regions in the modulation of visual
attention. Conversely, a failure to find any significant pattern of frontal coherence does to
rule out a possible implication of frontal cortex in visual attention (Corbetta, 1998;
Mesulam, 1990). This outcome could be due to the relatively coarse window of analysis
adopted, since the modulation of visual attention by the frontal cortex has been described as
a phasic mechanism, and their effects are fast and short-lived (Barcel6 & Rubia, 1998;
Mclntosh et al., 1994). Future research should accomplish a temporal decomposition of the
temporal pattern of coherence into smaller time windows.

REFERENCES

American Electroencephalographic Society (1994) Guidelines for standard electrode


position nomenclature. Journal of Clinical Neurophysiology, 1 !, 111-113.
278

Barcel6, F. & Rubia, F.J. (1998) Non-frontal P3b-like activity evoked by the Wisconsin
Card Sorting Test. Neuroreport, 9, 747-51.
Barcel6, F., Sanz, M., Molina, V., Rubia, F.J. (1997) The Wisconsin card sorting test and
the assessment of frontal function: A validation study with event-related potentials.
Neuropsychologia, 35, 399-408.
Berman K.F., Ostrem J.L., Randolph C., Gold J., Goldberg T.E., Coppola R., Carson R.E.,
Herscovitch P. & Weinberger D.R. (1995) Physiological activation of a cortical
network during performance of the Wisconsin card sorting test: a positron emission
tomography study. Neuropsychologia 33, 1027-1046.
Corbetta, M. Frontoparietal cortical networks for directing attention and the eye to visual
locations: Identical, independent, or overlapping neural systems? (1998) Proc. Natl.
Acad. Sci. USA 95, 831-838.
Getting, P.A. (1989) Emerging principles governing the operation of neural networks.
Annual Review of Neuroscience 12, 185-204.
Goldman-Rakic P.S. (1988) Topography of cognition: Parallel distributed networks in
primate association cortex. Annual Review of Neuroscience 11,137-156.
Horner, M.D., Flashman, L.A., Freides, D., Epstein, C.M. & Bakay, R.A. (1996) Temporal
lobe epilepsy and performance on the Wisconsin Card Sorting Test. Journal of
Clinical and Experimental Neuropsychology, 18, 310-313.
LaBerge, D., Carter M., & Brown V. (1992) A network simulation of thalamic circuit
operations in selective attention. Neural Computation, 4, 318-331.
Lopes da Silva F. (1991) Neural mechanisms underlying brain waves: from neural
membranes to networks. Electroencephalography and Clinical Neurophvsiologv, 79,
81-93.
Mclntosh, A.R., Grady, C.L., Ungerleider, L.G., Haxby, J.W., Rapoport, S.I. & Horwitz, B.
(1994) Network analysis of cortical visual pathways mapped with PET. J. Neurosci.
14, 655-666.
Mesulam, M.M. (1990) Large-scale neurocognitive networks and distributed processing for
attention, language, and memory. Annals of Neurology, 28, 597-613.
Milner, B. (1963) Effects of different brain lesions on card sorting. Archives of Neurology,
9, 90-100.
Rees, G., Frackowiak, R., & Frith, C. (1997) Two modulatory effects of attention that
mediate object categorization in human cortex. Science 275, 835-838.
Sarnthein, J., Petsche, H., Rappelsberger, P., Shaw, G.L. & Stein, von A. (1998)
Synchronization between prefrontal and posterior association cortex during human
working memory. Proceedings of the National Academy of Sciences USA, 95,
7092-7096.
Webster, M.J., Bachevalier, J. & Ungerleider, L.G. (1993) Connections of inferior temporal
areas TEO and TE with parietal and frontal cortex in macaque monkeys. Cerebral
Cortex 5,470-483.
Wright J.J. & Liley D.T.J. (1996) Dynamics of the brain at global and microscopic scales:
Neural networks and the EEG. Behavioral and Brain Sciences, 19, 285-320.
A Bioinspired Hierarchical System for Speech
Recognition

Ferrfindez J.M. 1, Rodellar V. 2, G6mez p.2

~Institutode Bioingenieria, U. Miguel Hernfindez,Alicante, Spain,


ZDepto. Arquitectura y Tecnologia de Computadores, Univ Polit6cnica Madrid, Spain.
Corresponding Author: jm.ferrandez@umh.es

Abstract. Artificial speech recognition systems lack certain characteristics


needed for maintaining their performance under usual conditions (background
noise, continuos speech...etc). Human auditory system has been able to solve
these problems through neural evolution. In this paper a bioinspired speech
recognition system, which mimic the hierarchical auditory processing is
proposed. It will present the desired robustness, accuracy and spectro-temporal
generalization.

1. Introduction
Artificial speech recognition area is evolving continuously because their application
areas involve very different systems, like bank transactions, friendly interactive user
information, bioengineering, forensic evaluations, automatic translation, language
assistance, handicapped people aids and so on. The proposed systems must be user
independent, they must be able to recognize a considerable amount of words, they
must handle continuously speech and they must keep their performance under adverse
environments.

Human listeners can recognize speech of different talkers, with different rates, distinct
accents and even under noisy conditions. A detailed understanding of how speech
is processed by humans could help design of bio-inspired systems which share the
main characteristics of biological systems, accuracy and robustness. If speech signals
are coded in the same way the auditory periphery does, it would be later extracted by
a model inspired on the central auditory system with the desired properties.

This paper propose a hierarchical bioinspired speech recognition system based on


auditory processing, which consists in two main modules adopted from the standard
automatic ones: a parametric extraction module, formed by the concatenation of
different physiological models of the relevant pre-processing centers of the auditory
system, which will provide the precision and robustness inherent to biological
systems, and a recognition module composed by a time-delay self-organizing neural
network which will group and classify in an automatic way the different components
combination provided, capturing the spectro-temporal variability of speech signal in
its structure. This may be checked by visual analizing the temporal evolution of the
280

weights of the map different nodes, which will code in its internal organization the
characteristic temporal evolution of the speech units. This kind of networks is also
biologically plausible.

2. Speech Production

In order to study the main components of speech sounds, first we must understand
how it is produced. The process of generating speech begins in the lungs, where the
constrictions force air out through the trachea to the larynx, which contains the vocal
cords and the glottis. The air can follow two ways: it can reach the outside through the
vocal tract, which begins at the vocal cords and ends at the lips or it can pass through
the nasal cavity. The air flow is regulated by the velum.

The air can arrive to the tract in form of vibration of the vocal cords (this process will
produce voiced phonemes) or in form of breath noise (unvoiced phonemes will b e
heard). The periodicity of this vibration is called fundamental frequency or pitch.
However this flow of air is affected by many physical factors in the vocal tract,
including the position of the tongue, the dental effects, the position of the velum, the
movement of the lips...ete. All these physical process act as a resonator. The natural
resonances of the vocal tract correspond to the poles of the transfer function and they
are called formants. They provide the most important way for recoEnizing phonemes.
Formants are identified by a number in order of increasing frequency: F1, F2,...etc. F~
is the first resonant frequency and for voiced speech, it generally is in the range of
250 to 800 Hz. F2 has a wider range, from 600 to 3600 Hz. Formants F3, F4 and F5
may be present in voiced speech, however, the lowest two formants are usually
sufficient to identify specific phonemes.

When spectrograms of speech signals are represented three distinct components can
be observed. There exist horizontal bars defined precisely in certain characteristic
frequencies (CF) components that will correspond to static formants. Some oblique
bars appear at the beginning or ending of prior elements. They correspond to
transitions between formants and they are called frequency modulated (FM)
components. The last elements that can be observed are certain bands of energy in
frequencies above 2 kHz corresponding to noise burst (NB) components. This three
elements are shared by human speech and animal sounds for communication [1].
There exists specific groups of neurons for detecting each one of these components
and combinations of them [2][3][4]. In humans, CF elements correspond to vowels
and vowels-like (e.g. nasals) sounds, stops are identified by FM elements, while
fricatives have characteristic NB components (Figure 1A).

Vowels are steady-state voiced sounds. This fact produces a quasiperiodic waveform
with fixed formants during vowel duration. If we set up a co-ordinate system using Fx
and F2 as a basis, vowels lie in specific regions. The exact positioning of the F1-F2
space varies with age, sex, language, and from one talker to the other, but the overall
281

clustering pattern does not vary (Figure 1C). However, most of the consonants (stops)
require precise dynamic movements of the vocal-tract articulators for their
production. These articulatory movements make the formant tracts to change
continuously, so in the formant spectrum these consonants will be recoLrnized by
different transitions (FM components) to the steady-states that identify vowels and
NB elements at the beginning which will correspond to the initial scatter of sound
energy. Consonants will be reco~ized by the transition part of both formants, not the
adaptation state, which will be responsible of discriminating between vowels. But
detecting an ascending/descending transition in formants is not enough to recoEnize
consonants. We must detect the exact slope of the transition in formants in order to
identify such phonemes. For Spanish [ba],[da] and [ga], the only difference is in the
slope of the transition of formant F2, so while transition in F1 is the same for all of
them, high slope for ascending formant Fa identifies [ba], high slope in descending
formant [ga] and low descending transition in F2 represents Ida] (Figure 1D). The
same spectrograms but with certain delays in the emission of voice may be used for
stops [pa], [ta] and [ka] (Figure 1B).

Fricatives are produced by exciting the vocal tract with steady air stream that becomes
turbulent at certain point of constriction. This point of constriction is used to
distinguish between different fricatives. In the spectrogram there exist certain NB
elements centred around certain characteristic frequencies.

A p--~_p yOWlS
B
VOT

I :::::::::::::::::::::::::::::::::
F= b<=i<p
d <ss< t
'7 L ........... " g<41<k
C t**.~/~w**
D
~1 I l II i i i' i
G~i ~n,~m
,i' ?

,-/
t ..[ Ft~d~tl
i" t; ,, i' ~ I ! I I
'"<-.,, /.",,,'
I.i,i9 II I

u
!
9 I '% .."

F r ~ q ~ * # d it~ im kltl

Fig. 1. Acousticalcomponentsof speech signals. Taken from [4]


282

3. The Human Auditory System

Signals arrive to the cochlea through the outer and middle ear, where no relevant
frequency computation is made, these centers just amplify the signals level. The first
important processing is produced in the basilar membrane, inside the cochlea. The
basilar membrane has cross striations, much like the strings of a piano, and its apical
end is much wider than the basal one, so striations resonate with different
frequencies, but the overall behavior is in the form of a travelling wave, so a single
frequency stimulation cause a very broad area displacement in the basilar membrane.
Low frequencies are represented by peaks in the apical end of the membrane, while
higher ones are represented towards the basal area in a topological ordered way. It is
important to note that different frequencies of sound produce different travelling
waves with different peak amplitudes. These peak locations code different frequency
stimuli in the basilar membrane. Also these peak amplitudes will excite different hair-
cell at different positions in the cochlea which will be responsible for the mechanical
to neural transduction process that propagates electrical impulses to higher neural
centers through a tonotopical organized array of auditory nerve fibers. Each auditory
nerve fiber is specialized in the transmission of a different characteristic frequency
and the rate of the transmitted pulses by these pathways code not only the frequency
intensity information, but also certain features of the signal relevant for discrimination
purposes. Fibers with characteristic frequencies below 3 kHz fire in synchrony with
the stimulus, so signals will be coded by the temporal firing. On the other hand high
CF fibers loose this phase-locking mechanism in a linear way. So stimuli will be
coded by the relative position of the peak along the basilar membrane, place coding
mechanism.

One important aspect of transduction is adaptation. It is important for sensory


systems to ignore strong and continuous stimuli, and to be very sensitive to small
ones, which will be more relevant for recognition. The set of auditory nerve fibers,
jointly with the vestibular fibers are grouped in the eighth auditory nerve, which
conducts the set of pulses to the central nervous system. Fibers tend to respond to
each of the spectral components (F0,F1,F2...) of speech signals in distinct groups, and
within each group the interpeak intervals represent the period of the corresponding
spectral components [5], so formants will be coded in the peaks of these fibers and
their intervals. The next processing centre is the cochlear nucleus (CN) where
different kinds of neurons are specialized in different kinds of processing ,some of
them segment the signals (chopper units), others detect the onset of stimulus in order
to locate it by inter-aural differences (onset cells), others delay the information to
detect the temporal relationship (pauser units), while others just pass the information
(primary-like units). Also, CN sends information back to the cochlea in order to sharp
the response and attenuate and protect the organ of Corti from overstimulation. The
last function is to feedforward information to the Olivary Complex, where sounds are
located by interaural differences, temporal and intensity ones depending on the
frequency perceived. Inferior Colliculus (IC) is organized in spherical layers, with
isofrequency bands orthogonal to one another. Certain delay lines up to 12 msec are
created in its structure, and its function will be to detect temporal elements coded in
acoustic signals (CF and FM components). This center sends information to the
283

thalamus (medial geniculated nucleus) which acts as a relay station for prior
representations (some neurons exhibit delays of a hundred milliseconds storing in this
way, some delayed information which may be used for detecting sequential
acoustical events), and there exists also neurons sensible to noisy stimuli. Finally, in
this center it has been detected synaptic plasticity, which permit the labelling of the
units with their perceptual meanings. A considerable amount of inputs to thalamus
comes from the cortex, implementing in this way the circuit recurrence, forming
neural atractors.

The high level processing is done in the cortex. Different anatomical experiments
have confirmed the existence of coarse specialized areas organized according with its
responsibility for processing information coming from the sensory receptors (visual,
auditory, etc.) and to generate the information to different actuators (speech, eye
movement, motor functions, etc.). That is, it seems that the neural tissue in the brain is
organized as ordered feature maps [6] according to its sensory specialization.

The exact location of the area in the human cortex responsible for speech
processing and understanding is not well defined due to the fact that the subjects of
experimentation have been mainly animals as cats, bullfrogs, squirrel monkeys,
guinea pigs, goldfish, etc. It has been detected in cats some neurons that fire with
certain slope of frequency transitions (FM elements) [2], some neurons that respond
to specific noise bursts (NB components) in macaque [3] and also some neurons
which are able to detect the combinations between these elements (CF-CF, FM-FM
with different delays between them) [4] in auditory cortex. The information that is
used for the bats for locating objects, may be used bay human beings for
communication, because they share the same neurobiological principles. Finally
signals arrive to Wernicke's area, where it has been speculated about the possibility of
a word or concept map observing different cognitive disfunctionalities.

@
Auditory cortex
(in lemporal )obe) - -
.: .?~
i Jill I Medial geniculate
Inferior colliculus ? '!]{i nucleus

Cochlea.-.. f:/ J / Sul~rior oliw,Qr


nucleus

Fig. 2. Hierarchical Auditory Processing


284

4. A Hierarchical Speech Recognition System based on Auditory


Processing

The system consists of two main modules adopted from standard automatic speech
recognition systems, a parametric extraction module and a recognition module. Each
module will be composed by bioinspired algorithms, which will provide the desired
functionality under an acceptable computational cost and taking into account the
biological process very closely. The parametric extraction module incorporates a
cochlear model based on gammatonne filtering [7] which supplies the frequency
analysis and the temporal response observed in physiological registers, a mechanical
to neural transduction process based on Meddis' hair cell model [8], which will
include adaptation, compression and half way rectifier processes and even also the
lost of phase locking with stimuli components for the fibers with high characteristic
frequency. A temporal integration stage that will emphasize static components,
aligning the different fibers information colored by the cochlear transmission delays,
and an component extraction module which will use a spatio-temporal strategy for
obtaining the robustness and the independence of the level provided by temporal
approaches and the energy estimation achieved by spatial methods [9].

The recognition module consists in a time delay self-organizing map [10], which
will group and classify in an automatic way the different components combination
provided, capturing the spectro-temporal variability of speech signals in its Structure.
The modular and hierarchical system design will allow validate each module
separately, and will permit a more efficient evolution, cause any new published
algorithm could be used just by inserting in the appropriated module without affecting
the whole hierarchical system.
000 O]
WORDS WERNICK~'S AREA

000 0

O0 0 000 0
, O0 0 0 0

Parametric
Extraction
I .~,~c.s r I7~~ u
Module Trn~s~176176

I BA$/~RMs
VOLUMgI'RICVELOC.17Y
"tl/dPnNIC ~E.SStnlE

Fig. 3. The proposed bio-inspired speech recognition system


285

5. Results

For testing the parametric extraction module, it was used synthetic speech instead
of real one in order to detect precisely the accuracy of this stage. Initially, vowels
were generated because of their static spectrum. All of them had the same
fimdamental frequency F0=100 Hz and only the two first formants were included. As
an example, results for vowels/a/are shown. The vowel/a/has been generated using
F~=640 Hz and F2=1190 Hz. The first one was a synthetic/a/under clean conditions.
The same vowel was corrupted with white noise with a 100% of the vowel level.

Figure 4 shows the interval interspike histogram of the fibers response. For clean
vowel/a/(left) three peaks appear. The lowest one is centred on 10 msec, so it will
code a fundamental frequency F0 of 100 Hz. The second one is located around 1.65
msec, which will correspond to 645 Hz, and the highest one is at 0.85 msec or 1176
Hz. These values approximate with certain accuracy to the formants provided. On the
other hand, it can be observed in the noisy vowel histogram that white noise is
transformed in a high frequency component, due to mechanical cochlear filtering
observed as a huge peak on the left of the temporal axis. The other three peaks
(marked with arrows) lie on the same locations than clean data peaks with lower
magnitude caused by the noise inserted. The extraction used ignores temporal peaks
lower than 0.3 msec. (high frequency components will be detected by spatial methods
identifying them as NB elements), so the formants position (CF) will not be affected
by white noise using a bioinspired parametric extraction module.
as-ipi-~ m,saLwav anlO~ipi-sum.sai.wav
lilt .... I ......... I , . * I I I L , , I ......... I,,
10240

Q -

~, o-,

-loz4o~ . . . . . . . . . i . . . . . . . . . t . . . . . . . . . i . . . . . . . . . i,,
Time [ms] 44.05 Time ImsJ 44.05

Fig.4. Interval interspike histogram of a clean (left) and a noisy (right) vowel/a/

The system precision for dynamic stimuli was checked using synthetic
consonants. The results for Spanish phonemes/b/and/g/are shown in Figure 5. The
first formant consist in an ascending frequency from 500 to 700 Hz during the first 10
msec until reaching the steady state. The second formant varies from 900 to 1200 for
p h o n e m e / b / a n d decreases from 2000 to 1200 for phoneme/g/. The third formant
provided slides from 2100 to 2400 for phoneme /g/ and from 3200 to 2400 for
phoneme/g/. This data is the same that Secker used for their analysis [5] in auditory
fibers and matches the plosives behaviour described in section two.

The extraction was performed every 5 msec. using a 16 msec. overlapping window.
The Figure shows the provided formants (solid line) with the formant estimations
286

obtained by the bioinspired parametric extraction module (first formant is dotted with
diamonds, second with crosses and third with squares). It can be observed sharp
precision for all estimations. The deviations are main in the transition from the slope
to the steady state and they are caused mainly by the temporal integration stage and
the lost of synchrony (phase-locking) with the stimuli for high characteristic
frequency fibers. The estimation detects in this way the temporal evolution of the
components for dynamic stimuli with high accuracy.

I O Cl

~o

j~-J

Fig. 5. Formant estimations for synthetic phoneme/b/(left) and/g/(right).

One important aspect of this method is computer cost. In prior Figures it has been
used 75 sections in each stage. If we obtain the intervals histogram using 20 sections,
peaks will lie on the same locations. This will permit reducing 2/3 the complexity of
the model without loosing accuracy and robustness.

For the recognizing module, it was used static and dynamic phonemes for testing the
incorporation of the spectro-temporal variability in the structure (weights) of the time-
delay self-organizing map. In the vocalic map, each vowel lie in a specific region,
placing very closely those which share certain characteristics (/g-/e/and/o/-/u/) and
very distant those classes with severe dissimilarities (a/-/i/-/u/, which are the vowels
which origin the vocalic triangle). The dynamic map consists on the distribution of
plosives/b/,/d/and/g/. Their representation is more complex than the previous one,
because there exists phonemes with more than one area in the map (phoneme/d/) and
there exists certain areas (left low row) which fire with two different phonemes (/d/
and/g/), due to the similarity of their spectral representation.

For analizing how the map incorporates the spectro-temporal variability in its
structure, it was compared the spectrogram of the phoneme, with the visual evolution
of the weights of the unit labeled with this phoneme in the map. The three delayed
components was aligned in consecutive columns, with NB elements (estimation of the
energy) on the upper side. Figure 6 shows the spectrogram and the visual analysis of
the weights of unit/aJ. It can be seen that the static behavior of the components is
reflected on the weight evolution. Phoneme/a/has its two firsts formants on the lower
part of the spectrum, while the third is on the upper side. This is coded in the unit
weights with static information on the lower side and a band reflecting the third
287

formant. The NB elements show the energy of this third formant and absence of
energy above this frequency, also in a static way.

Fig. 6. Spectrogram of phoneme/a/with the visualization of the weights of its unit in the map

However, p h o n e m e / b / h a s a dynamic behavior with its two firsts formants lightly


ascending in the spectrum. This dynamic behavior is reflected on the unit weights
evolution seen in Figure 7, where the two first CF components ascend while the third
is static. It can also be observed some energy on the NB elements related with the
plosive characteristic of this king of phonemes capturing the initial plosion.

Fig. 7. Spectrogram of phoneme/b/with the visualization of the weights of its unit in the map

The global results compared with other similar models, Payton [11] and Patterson [7]
are equivalent for static speech, for vocals, fricatives ands nasals, was about 90, 70
and 50% respectively, while for dynamic speech it was increased compared with other
two models. For plosives and glides it was about 50 and 65%. The inclusion of
temporal delays in the network increases the discriminative performance for dynamic
data.
288

6. Conclusions

The obtained results show its precision in the estimation of the static, dynamic and
noisy components, its robustness in the extraction of corrupted phonemes, and the
computational efficiency just by using a limited number of sections in the model. The
recognition module present independence on the order of the provided data, and it
captures the spectro-temporal variability of the speech components in its weights
structure, obtaining a recognition rates which match other similar models for static
phonemes, while improve their results for dynamic data.

The bio-inspired approximation will allow the construction of new engineering


systems from a biological point of view, and the proposed models could be used for
validating hypothesis of the functionality of the different neural centers involved.

Acknowledgements

We would like to thank Dr. Roy Patterson for providing AIM software. This research is funded
by NATO CRG-960053

References
1 Suga, N: "Basic Acoustic Patterns and Neural Mechanism Shared By Humans and Animals
for Auditory Perception: A Neuroethologist view". Proceedings of Workshop on the Auditory
bases of Speech Perception, ESCA, pp. 31-38, July 1996.
2 Mendelson JR, Cynader MS: "Sensitivity of Cat Primary Auditory Cortex (AI) Neurons to
the Direction and Rate of Frequency Modulation". Brain Research, 327, pp 331-335, 1985.
3 Rauschecker JP, Tian B, Hauser M: "Processing of Complex Sounds in the Macaque
Nonprimary Auditory Cortex". Science, vol. 268, pp 111-114, 7 April 995.
4 Suga, N: "Cortical Computational Maps for Auditory Imaging". Neural Networks, 3, pp. 3-
21, 1990.
5 Secker H. and Searle C.: "Time domain analysis of auditory-nerve fibers firing rates". J.
Acoust. Soc. Am. 88 (3) pp. 1427-1436, 1990.
6 Schreiner C.E: Order and Disorder in Auditory Cortical Maps. Curr. Op. Neurobiol., 5, pp.
489-496.
7 Patterson RD, Anderson TR, Allerhand M: "The Auditory Image Model as a Pre-processor
for Spoken Language". ICSLP, pp. 1395-1398, 1994.
8 Meddis R: "Simulator of mechanical to neural transduction in the auditory receptor". J.
Acoust. Soc. Am. 79 (3), pp. 702-711, 1986.
9 Ferr~ndez J.M. "Estudio y Realizaci6n de una Arquitectura Jerarquica Bio-Inspirada para el
Reconocimiento del Habla" Ph.D Thesis, Universidad Polit6cnica de Madrid, Junio, 1998.
10 Mc. Dermott and Katagiri: "Shift-lnvariant Multicategory Phoneme Recognition using
Kohonen LVQ2". Proceedings de ICASSP, pp. 81-84, Glasgow 1989.
11 K. L. Payton. Vowel processing by a model of the auditory periphery: A comparison to
eight-nerve responses. J. Acoust. Soc. Am. 83 (1), pp. 145-162, January 1988.
A Neural Network Approach for the Analysis of
Multineural Recordings in Retinal Ganglion Cells

Ferrfindez J.MJ, Bolea LA x, Ammermliller J. 2, Normann R.A. 3, FernAndez E. 1

llnstituto de Bioingenieria, U. Miguel Hernfindez, Alicante,


2Dept. Neurobiologie, Univ. Oldenburg, Germany,
3Dept. Bioengineering, University Utah, Salt Lake City, USA,
Corresponding Author: jm.ferrandez@umh.es

Abstract. In this paper the coding capabilities of individual retinal ganglion


cells are compared with respect to the coding capabilities of small population of
cells using different neural networks. This approach allows not only the
identification of the most discriminating cells, but also detection of the
parameters that are more important for the discrimination task. Our results show
that the spike rate together with the exact timing of the first spike at light-ON
were the most important parameters for encoding stimulus features.
Furthermore we found that whereas single ganglion cells are poor classifiers of
visual stimuli, a population of only 15 cells can distinguish stimulus color and
intensity reasonable well. This demonstrates that visual information is coded as
the overall set of activity levels across neurons rather than by single cells.

1. Introduction
Our perception of the world, our sensations about light, color, music, speechl taste,
smell are coded in raw data by the peripheral sensory systems, and sent, by the
corresponding nerves, to the brain where this code is interpreted and colored with
emotions. The raw or binary sensory data consists of sequences of identical voltage
peaks, called action potentials. Seeing implies the decoding the pattems of spike trains
that are sent to the brain, via the optic nerve, by the visual transduction element, the
retina. Thus, the external world object features, as size, color, intensity.., are
transformed by the retina into a myriad of parallel spikes sequences, which must
describe with precision and robustness all the characteristics perceived.
Understanding this population code is, nowadays, a basic question for visual science.

Understanding the code means quantifying the amount of information each cell
carries, and studying the possible parameters that are used by the cells for transmitting
the data. The system has to assign meaning to this population code. Thus for a given
pattern of action potentials, the brain has to estimate the stimulus that has produced it.
The encoding has to be unequivocal and fast in order to ensure object recognition for
any single stimulus presentation.
290

A considerable number of studies have focused on single ganglion cell responses


[1][2]. Traditionally, the spiking rate, or even the spontaneous firing rate has been
used as information carrier due to their close correlation with the stimulus intensity in
all sensory systems [3] [4], however single neurons produce only a few spikes in
response to different presentations and they must code a huge spectrum in their
firings. The exact temporal sequence of action potentials in only one cell may also
code the main stimulus features as it occurs in other systems (e.g. auditory coding [5])
however the response of a single cell to repetitions of the same stimuli often has a
considerable variability for different presentations and cannot unequivocally describe
the stimulus. Furthermore the timing sequence differs not only in the time events but
also in the number of spikes, producing uncertainty in the decoding. Finally the same
sequence of neural events may be obtained by different stimulus, introducing
ambiguity in the neural response. So, it is a complex task to "understand" the neural
coding just by analizing a single ganglion cell response.

New recording techniques and the emergence o f new electrode array technologies,
allow simultaneous recordings from populations of neuronal cells. However there are
still many difficulties associated with collecting and analyzing activity from many
individual cells simultaneously. FitzHugh [6] proposed a statistical analyzer that
applied to the neural data estimates the characteristics of the stimulus. Different
approaches have been used on the construction of such a decoder, including
information theory [7], linear filters [8], discriminant analysis [9]...etc.

In this paper we used two different artificial neural networks, one trained by back-
propagation and other implemented with auto-organizing maps to estimate how an
ensemble of retinal ganglion cells can encode the characteristics of the light incident
at the retina. Our results show that artificial neuronal networks are useful tools for
analyzing multineuronal recordings and that visual information is coded as the overall
set of activity levels across neurons rather than by single cells.

2. Methods

Experiments were performed on isolated turtle (Trachemy scripta elegans) retinas.


Retina isolation has been described in detail before [9]. Briefly the turtle was
sacrificed by decapitation conforming to ECC rules. The eye was then enucleated and
hemisected under dim illumination, and the retina was removed under bubbled Ringer
solution taking care to keep intact the photoreceptor outer segments. Then the retina
was placed fiat onto a beam splitter with the photoreceptor side facing down (Figure
1).

Light stimuli were produced from a tungsten lamp. Flashes with a duration of 0.2
seconds, followed by a 0.24 second period of darkness, were used as typical stimuli.
Wavelength selection (400, 450, 488, 514, 546, 577, 600, 633 and 694 nm) was
achieved with narrow band filters, and intensities were controlled with neutral density
filters. Different spot sizes (ranging from 0.195 to 2.6 mm) were also used through
291

this study in order to learn how well recordings from a network of ganglion cells
could be used to predict the shape, color and intensity of the visual stimulus. Each set
of stimuli was presented 7 times. Responses were amplified with a differential
amplifier and stored in a Pentium based computer. A custom analysis program
sampled the incoming data at 20 kHz, plotted the waveforms on the screen, and stored
the record for later analysis.

Ringer
Solution

Computer

R t ina
I I

Light

Fig 1. Recording method

Extracellular multielectrode recordings were performed using the Utah


microelectrode array (UEA). It consists in an array of 100 isolated, 1.5 mm long
silicon needles with platinized tips. Electrodes are arranged in a regular square
pattern, spaced at 400 microns from each other. In each experiment we recorded
neural activity from about 80-90 electrodes. We selected the electrodes with the
highest signal to noise ratio and isolated 1 to 2 units from the multiunit responses by
setting the thresholds to high levels [9]. For each electrode and presentation, the time
of the first and second spike, the number of spikes, and the interspike interval during
light-ON were stored for further analysis. Figure 2 shows an example of
simultaneously recorded responses from 15 electrodes to 8 consecutive flashes of 546
nm, 2.6 mm diameter.
292

3. Analysis

In this study we selected the signals from those electrodes that had the highest signal
to noise ratios. In general, multi-unit signals were obtained from most of the
electrodes and ot~en single unit separation was difficult so that we selected those 13-
15 prototypes which were unequivocal in terms of both amplitude and shape. For each
electrode, a 4-vector element was constructed using the number of spikes, the relative
time of the first and second spike, and the interspike interval of these firings. A 60-
element vector (4 variables x 15 cells) was used as the input matrix to our different
neural network approaches.

Two different neural networks were used. The first one was a three layer
backpropagation [10], with 20 nodes in the hidden layer. The output layer consisted of
the same number of neurons as the classes to be recognized. Using this architecture,
each neuron on the output layer only fires for a certain stimulus, and the rest of the
neurons of the output layer have no activation (winner take all network). The
activation function used for all neurons, including the output layer was the hyperbolic
tangent sigmoid transfer function given by:

2
f(x) l+e-2" 1
. . . . . . . (1)

using as initial momentum and adaptive learning rate the values established by default
by the Matlab Neural Network Toolbox. The initial weights were randomly initialized
and the network was trained to minimize a sum squared error goal of 1, for providing
more generality to the estimation stage.

The other network used was the Kohonen Supervised Learning Vector Quantization
(LVQ) [11] with 16 neurons in the competitive map, and a learning rate 0.05. This
network is a competitive network, where the neurons with weights more similar to the
input, increase their strength in response to this input, decreasing the rest of the nodes
except those in a close neighborhood. This establishes a topological relation in the
map. The main advantage of using learning vector quantization is that it takes less
time to reach the convergence criteria.

Once the network was trained, the estimation with extended data were used, and the
correlation coefficients between the stimulus and their estimations were computed.
Other studies use their own concepts as mutual information [8] in order to assess the
overall quality of the reconstruction, but there does not exist a common agreement
about the measure that best estimates the goodness of the prediction.
293

4. Results

For many stimulus conditions and many cells only a few spikes were produced in
response to light-ON. Figure 2 shows a raster plot of the response time stamps of 15
cells to several identical presentations of a full field flash, using a wavelength of 546,
log. relative intensity = -0.5. Stimulus is indicated in channel 1, so that 8 different
flashes are shown. It can be seen that most of the cells are ON-OFF, and that they
only fire a few spikes in response to the stimulus. Another characteristic is that for a
given cell, different presentations of the same, identical stimuli, evoke different
responses. These responses differ not only in the number of spikes but also in their
relative timing, manifesting variability in their spiking behavior. This variability
produces uncertainty for recognizing the right stimuli using only one individual cell,
because there is no unequivocal function that associates the firing variables with the
provided visual information.

Ambiguity is another aspect noticed. Thus a single cell can have exactly the same
response to different stimuli, making the stimulus estimation task much more
difficult. These aspects are presented in detail in Ammermtiller et al. [9], and they
point to population coding as the strategy used to represent information in the visual
system.

Fig. 2. Multielectrode response raster plot.

Figure 3 shows the correlation between the output of a trained backpropagation neural
network and the correct stimuli, which in this case consisted in 8 different intensities.
The three wavelenghts choosen were those where discrimination of the population was
worst (633 nm), intermediate (546 nm) and best (450 run). It can be seen that the
294

scores show variability depending on the cell and the wavelength studied. On average,
all single cells were far below ideal discrimination, although to a varying degree. The
cells with higher estimation scores were cells 8, 10, 11 and 12. On the other hand, the
performance of all the cells taken together ("All" column) exceeded 0.95 for all
wavelengths.

1
o.
0,8o II
III |
0,7-
0,6-
0,5-
0,4~
0,3-
0,2-
0,1
O 1 I I I I I I I I
2 3 4 5 6 8 9 10 11 12 13 14 15 All

Fig. 3. Intensity estimation scores for individual cells and for all the
cells taken together ("All" column) using a BP network

Color estimation is more complex, and the estimation rates for single cells were
considerably lower (Figure 4). For these kind of studies the intensity was fixed and
we asked the network to correctly classify nine different wavelengths. Again the
population discrimination was fairly good, with correlation coefficients ranging from
0.95 to 0.97, values that clearly surpassed all the individual cell coefficients for all
kinds of stimuli.

I
0,9 I
0,8 I I I
0,7

~ I, I
0,4
0,3
0,2
0,1
0 1 I I I I I
I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 All
Fig. 4. Color estimation scores for individual cells and for all the
cells taken together ("All" columns) using a BP network

For validating the above mentioned results, the same data was presented to another
kind of neural network, a supervised learning vector quantization (LVQ) with 20
nodes in the competitive layer. This network converged faster than the back-
propagation (BP) network, and again the cells with higher estimation scores were
cells 8, 10, 11 and 12. The results obtained by using all the cells together was nearly
the same as that obtained using the BP algorithms (Figure 5).
295

1 ,

0,9 ~
0,8 ~ _ _ I I
0,7 ~ I. II NI
0,6 i II II~
o,s-
0,4
0,3-
o,2-
o,i-:
0~
I 2 3 4 5 6 7 8 9 10 11 13 14 15 All
Fig. 5. Intensity estimation scores for isolated cells and for all the cells
taken together ("All" column) using a LVQ network

The wavelength discrimination using competitive networks behaved similarly to the


prior feedfoward network. Lower estimation scores were obtained, even for the whole
population. This could be due to the difficulty of the network in fixing a decision
border which divides the different clusters, however these values were clearly higher
than the correlation rates obtained by using only individual cells (Figure 6).
I. 7

0,g~
0,8i
0.6i |
0,7i

0,5~
o,4i
0,3i i ,,, ,o, i I!
0,2i
0,1~
0_~
1 2 3
9
4 5 6 7 8 9 10
ih
11
Fig. 6 Color estimation scores for individual ceils and for all the
12 13 14 15 All

cells taken together ("All" column) using a LVQ network

In order to get some insight into the relative importance of each one of the variables
for the discrimination task, we used a BP network with 20 nodes in the hidden layer.
The input to this layer was only the spike rate (N), only the timing of the first spike
(T1), only the timing of the second spike (T2), or only the time difference between
spike one and spike two (Interval) for the entire population of 15 cells. We also used
all these parameters taken together. Figure 7 shows the correlation indexes between
the real stimuli and the network estimations. Spike rate (N in figure 7) was the most
important variable followed by the exact timing of the first spike (T1). The timing of
the second spike (T2) and the interspike interval carried less information, and were
296

poor coding elements. When all the variables from the ensemble of cells were used
the correlation coefficients were close to 1.
1
9 633
o,9
0,8 [] s46
0,7 [] 450
0,6
0,5
0,4
0,3
0,2
0,10
N T1 T2 Interval All
Fig. 7. Intensity estimation scores for the population using different variables

Basically the same results were obtained for color discrimination, although the overall
performance was not as good as in the case of intensity discrimination (Figure 8).

I
0,g k\\'q~-4

0,8 ~,,"~ I
0,7
0,6
0,5
0,4 NN-I
0,3 kx,NNI
0,2
01 Nx\',l..-i
[N.\\~l I
'0 i
N T1 T2 Interval All
Fig. 8. Color estimation scores for the population using different variables

5. Conclusions

In this paper, a cormectionist method has been used to investigate how color and
intensity can be estimated l~om single cells and from populations of retinal ganglion
cells. Two different neural networks, a feedfoward backpropagation and a competitive
LVQ, have been used for determining the coding capabilities of individual cells
versus a group of neurons. The correlation between the estimation of the networks and
the real stimuli was used for quantifying the transmitted information. Both networks
indicate that the brain could potentially deduce reliable information about stimulus
features from the response patterns of ganglion cell populations but not from single
ganglion cell responses.

The spike rate together with the exact timing of the first spike at light-ON were the
most important parameters that encoded stimulus features as it has been shown for
297

different systems [4,5]. The fact that the number of spikes, or the first spike's relative
timing, obtain the same estimation index as the overall parameters, approximately
0.95, could imply redundancy in the transmitted information, and could be related to
the robustness in the data transference inherent to this system.

A more refined data set will help in providing more accuracy to our analysis. Thus,
new physiological techniques which decrease the level of the background noise in the
recorded responses, and an efficient separation of the action potential prototypes
recorded with a single electrode [12], will help to isolate the firing parameters from
artifacts which contaminate our present recordings.

Finally, while the quality of the different coding parameters could be assessed by
using this neural network approach, we have no idea if indeed the brain also focuses
on these variables. Once the visual code is understood, the construction of spiking
retina models which reflects with accuracy the physiological recordings will be
available.

Acknowledgements
This research is being funded by CICYT SAF98-0098-CO2-02, DFG SFB 517 and
NSF grant #IBN 9424509.

References

1 Ammermuller J., Kolb H. "The Organization Of The Turtle Inner Retina. I. ON- and OFF-
Center Pathways." J. Comp. Neurol. 358(1), pp. 1-34, 1995.

2 Ammermuller J., Weiler R., Perlman I.: "Short-term Effects Of Dopamine On Photoreceptors,
Luminosity- And Chromaticity-Horizontal Cells In The Turtle Retina." Vis. Neurosci. 12(3),
pp. 403-412, 1995.

3 Kuffier, S. W. "Discharge Patterns And Functional Organization Of Mammalian Retina". J.


Neurophysiol. 16, pp 37-68, 1953.

4 Berry M. J., Warland D. K., Meister M.: "The Structure and Precision of Retinal Spike
Trains" Proc. Natl. Acad. Sci USA 94(10), pp. 5411-5416, 1997.

5 Secker H. and Searle C.: "Time Domain Analysis of Auditory-Nerve Fibers Firing Rates". J.
Acoust. Soc. Am. 88 (3) pp. 1427-1436, 1990.

6 Fitzhugh, R. A.: "A Statistical Analyzer for Optic Nerve Messages". J. Gen. Physiology 41,
pp. 675-692, 1958.

7 Rieke F., Warland D., van Steveninck R., Bialek W.: "Spikes: Exploring the Neural Code".
MIT Press. Cambridge, MA, 1997.

8 Warland D., Reinagel P., Meister M.: " Decoding Visual Information from a Population of
Retinal Ganglion Cells". J. Neurophysiology 78, pp. 2336-2350, 1997.
298

9 Ammermiiller J., Fern~tndez E., Ferrb.ndez J.M., Normann R.A.: "Color and Intensity
Discrimination of Retinal Ganglion Cells in the Turtle Retina". J. Physiology. (under revision).

10 McClelland J., Rumelhart D.: "Explorations in Parallel Distributed Processing", vol. 1 and 2.
MIT Press, Cambridge MA, 1986.

11 Kohonen T.: "Self Organization and Associative Memory" vol. 8. Springer Series in
Information Sciences, Springer-Verlag NY, 1984.

12 Ohberg F, Johansson H., Bergenheim M., Pedersen J., Djupsjobacka M. "A Neural Network
Approach to Real-Time Spike Discrimination during Simultaneus Recordings from several
Multi-Unit Filaments." Journal of Neuroscience Methods 64 (1996), pp. 181-187.
Challenges for a Real-World Information
P r o c e s s i n g by Means of R e a l - T i m e Neural
C o m p u t a t i o n and R e a l - C o n d i t i o n s S i m u l a t i o n
J.C. llerrero
Software AG
c/Ronda de la Luna, 4. - 28760 Tres Cantos (MADRID), SPAIN
e-maih jcherrer@arrakis.es

Summary
Should we consider the dimensions o f natural neural computation as they are known as a result of the
scientific research, we realize there is a long tomorrow before us, interested in neural computation, for the
simple reason that we can only handle a relatively low number o f units and connections nowadays. All
along this centmT we have significantly improved our knowledge on natural neural nets, to realize that
huge nnmber of cells and connections and begin to umterstond some of the brain signals processing and the
repetitive structures which support it. However, even in the most developed cases, such as the auditoly
pathway modelling, there is not a neural computational device which can involve a real time response and
Jbllow the.fitcts ab'eady known or phtusibly postulated on some brain processes (e.g. by McCulloch and
Pitt,;J, with the unavoidable great number o f processbtg elements involved too, besides neither suitable
models regarding those kind o f real-look nets have been designed nor their con'esponding real-conditions
simulations have been carried out. That means there is a lack o f connectionistically computable models and
also reduction methods by which we can obtain a connectionistic implementation design, given the
knowledge level model.
Therefore, we would like to ask: what is within reach? hi order to answer this question we are going
to present a restricted auditory pathway modelling case, where we shall be able to see the realistic
challenges we are fitcing up to. By eying to propose a consistent implementation for it, based on parallel,
mo&dat; diso'ibuted and self-programming computation, we shall see the kh~d o f methods, equipment,
software and simulations required attd desirable.

1. Introduction.
Natural neural computation is a result of natural evolution. Following Charles
Darwin [Darwin 1859], natural neural things are there because they represent the most
fcasiblc way for them to be in their environment, as a result of such evolution, and in such
tcnns wc may try to understand it. Natural neural nets have evolved to process real-world
information. Thus, whcn we try to understand what those nets are for, why those
connections, shapes and quantities, perhaps it may be helpful to begin with a real world
information proccssing modelling and then construct the corresponding circuits whose
input is somc interaction with physical events which happen in the real world
[Churchland 1992] [Hawkins 1996].
We have discovered the computational features of one neuron are not so simple as
some modelling thrends pretended, and unlike the models we are accustomed to manage,
the structures involving neurons are very complex and consist of a huge number of
components [I)eFelil)e 1997] (that slmuld not be surprising, since it is about the number
of cclls in a body), as complex is the computation they carry out [Moreno-Diaz 1997,
1998]. However the more we know about them and their functionalities, the more it
seems this is the better way for processing the involved signals [Mira&Delgado 1995a,
1997], and wish to emulate their features [Churchland 1992] [Hawkins 1996].
Thus, parallel, modular, distributed, and self-programming computation appears in
the pathway. However, it is not less true this processing has to be understood in a causal
300

relationship with well known facts at a different level [Mira&Delgado 1995a, 1995b,
1997], i.e. the knowledge level [Marr 1982] [Newell 1981]. For instance, in the auditory
pathway modelling, the whole signal processing is causally related to auditory sensations
and perceptions (some person has), like timbre and chord recognition, i.e. the
psychoacoustics [Delgutte 1996]. This causality between levels is admitted although
unknown in the case of natural neural nets (in the brain), and completely impossible in
artificial neural nets nowadays.
This does not mean one cannot eventually model the whole auditory pathway in the
future, make either a design or a physical implementation and objectively interpret the
results in terms of well known components of the input that the circuit would receive
from the real world; this is the connectionistic long term aim, rather. In the meantime, in
this paper we are going to present a restricted auditory pathway synthesis modelling for
timbre recognition by processing real world information by emulating the natural neural
nets features, basicly those of parallel, modular, distributed and self-programming
computation, and that means real time processing too. Such a synthesis modelling
exercise will show us the magnitude of the problem, even in a restricted case like this, and
it will suggest us the kind of tools we would need in order to tackle these kind of
problems, as well as the necessity of explicit reduction methods which should eventually
be used in order to obtain the computational design, given the knowledge level model of
analysis, like in the symbolic computation counterpart [Mira 1997, 1998] [Herrcro 1998,
1999], methods which explicitly justify the causal relationship between computational
levels.
It is usually said that the aim of computational analysis should be the development of
formal models, sufficiently explicit, internally consistent and complete, that what
conceptual models, in natural language, are not [Hawkins 1996] [Benjamins 1997].
However, this must not mislead us. Firstly, because beside the computational analysis aim
we cannot forget there is a computational synthesis aim or else there would not be any
computational aim at all. Sencondly, because there are two kind of causalities we cannot
forget either: the model's causality and the reduction method's causality. As to the
former, formalisms are intended to express things in a powerful manner [Russell 1959]
[Whitehead 1913] and they have their own formal causality, based on abstract
relationships (usually mathematical ones) for handling abstract entities (like elements of
sets, etc.). Of course, the formalism can be expressed in natural language, albeit in a long-
winded way, or we could never understand what it means; but that what it means has
properly to do with those abstract entities and relationships, which have nothing to do
with the real problem's causality and entities, unless someone interprets the formalism in
these othcr terms. Then, on the one hand, this knowledge level interpretation of the
precise descriptions of the formalism can be expressed in natural language, and therefore,
the fact that conceptual models are imprecise is rather a custom than an intrinsic
characteristic. But, on the other hand, if knowledge level models talk about facts of a real
world, this disables the possibility of any formalism, completeness, internal consistency,
etc., at least at the knowledge level, since those facts are the only possible justification of
the relationship between the model entities; we can always ask some "why?" about the
model whose only possible answer is "because of the observed facts", and then it is about
a scientific model. As to the reduction method's causality, we have to justify why the
implementation level model has to do with the knowledge level model, and we must
explain it explicitly. While it is possible to describe reduction methods for obtaining the
program code, given the knowledge level description of a problem [Mira 1997, 1998]
[Herrero 1998, 1999], analogous reduction methods are not yet available for
connectionistic implementations. Anyway, the interpretation of the (electronic level of
301

the) implementation in terms o f the knowledge level is arbitrary in any computation


[Mira&Delgado 1995a] [Mira 1995], unless the implementation has objective
relationships with real world events [Turing 1950]. Besides, we find intelligence in living
creatures as a result of evolution, brought about by the interaction between those
creatures' lineage and their environment [Darwin 1859], i.e. the real world facts, so there
must be a causal relationship between knowledge and real world facts, as well as most of
our knowledge is ultimately based on senses and other perceptions, if not all of it. And we
think these are good reasons which support the idea of beginning with implementations
whose inputs interact with the real world, if eventually it has to be said that artificial
neural nets objectively embody any knowledge.

2. F r o m p h y s i c a l e v e n t s to n e o c o r t e x , b y t h e a u d i t o r y p a t h w a y .
We begin with a very brief summary of known facts about the auditory pathway, only
picking those which are significant (we estimate) for our purposes, disregarding the
description of a good deal of wonderful details already known to science.
Sounds are physically described as a kind of vibrations usually transmitted by the air,
as very fast cyclic changes in pressure in the direction of the sound propagation. We hear
sounds because of a causal chain or line of events [Lyon 1996] [Russell 1948], starting at
the physical event which produces the air vibrations that eventualy reach our outer ear
and then the tympanic membrane. The tympanic membrane transmits the vibrations to the
middle ear ossieles through which the vibrations get to the inner ear and then enter the
neocortex, and then we hear, although maybe we do not listen.
There are several noteworthy components in the inner ear, contained in the cochlea.
The shape of the cochlea looks like a snail shell, as it is well known, and it is filled up
with a fluid (endolymph) and divided along the shell spiral into three compartments by
the basilar membrane and the Reissner's membrane. The basilar membrane gets thicker
the more we move nearer the spiral center or apex. The vibrations which get to the
cochlea are transmitted by the fluid to the basilar membrane, which responds to them
depending on the frequency. In the 19th century, Hermann yon Hehnholtz modelled the
basilar membrane as a series of mechanical oscilators. The fact is that depending on the
vibration frequency a different point of a,zone of the basilar membrane vibrates
maximally, the thinner the zone the higher the frequency, the thicker the lower. On the
basilar membrane sits the organ of Corti, which undergoes the vibrations of the
membrane. In the organ of Corti, there are two kind of cells: the inner hair cells (about
3,500 in humans) and the outer hair cells (about 12,000). The inner hair cells seem to be
the first responsible of our hearing, since "almost a 95% of afferent fibers of the cochlea
division of the eighth nerve (auditory nerve) originate at the base o f these cells, while
most of the efferent input to the cochlea from the central nervous system reaches the
bases of the outer hair cells", a really astonishing phenomenon [Dnuw 1996] [Delgutte
1996] [Hanavan 1996] [Lyon 1996]. However, there are interactions with the outer hair
cells that join the filtering, resonance, amplifying, and others effects of the outer, middle,
and rest of the inner car [Mountain 1996].
The inner ear transmits to the cortex two features of the sound we are interested in:
frequency and intensity. As we have said, frequency is identified by the inner cells
corresponding with a basilar membrane zone, and this information is transmitted by the
corresponding nerve fibers, being these like labelled lines in the very causal lines we
referred to previously. It is not less interesting the fact that the sound intensity is coded
and travels along the same fibers, carried by the rate or frequency of the discharge of the
neurons, beside cooperative processing under saturation conditions [Delgutte 1996]. The
302

final destiny of both pieces of information, including the psychoacoustic aspect of


perception or the sound sensation we consciously perceive, remains as a great mystery of
the human brain [Russell 1950] [Darwin 1859].
There are some successful models to explain different components of outer, middle
and inner ear, and specially the cochlea, like the basilar membrane, the organ of Corti and
its components, etc., as well as surgical experiences on cochlear implants in deaf persons,
with different success degrees [Hanavan 1996] [Mountain 1996] [Delgutte 1996] [Lyon
1996]. However, the modelling ceases beyond the cochlea, precisely where our interest
bcgins and both features of the sound, namely coded intensity and labelled line of
fi'ccucncy, arc available and enter the cortex. From this viewpoint, the whole set of outer,
middle, and inncr ear can be seen as a device which primarily filters some of the
surrounding environment features, say kind of events, and typifies them by some device-
specific and event-specific characteristics, i.e. discharge frequency and labelled line
(respectively, intensities and frequencies of the sound). These features can be understood
from a physical viewpoint since it is known that for any sound, represented by its
corresponding wave shape, we can calculate a decomposition into a sum of simple sine
and cosine waves with different frequencies and amplitudes, by the Fourier's theorem;
that means we may consider the sound really consist of these simpler components.
Regarding those two features and fi'om the knowledge level psychoacoustic
viewpoint, we know that human hearing ranges approximately from 20Hz to 20kHz, and
pure frequency tones are distinguished following a logarithmic scale of 2, as it is well
known by anyone who loves music. We can recognize 12 frequencies between a given
frecuency f and 2f, in the musical scale, but we can also distinguish not less than 25
different "out of tune" frequencies between these twelfth intervals. That yields more than
3000 different frequencies we can distinguish when heard indenpendently. The sound
intensity is also distinguished following a logarithmic scale, over a range of more than
100dB.

3. A restricted auditory pathway modelling case: a modular model for


timbre recognition.
Based on these facts, we are going to propose a model which processes the two basic
features in order to memorize and subsequently recognize timbres of different sounds.
First of all, wc have to say we have to model some aspects of the auditory pathway in the
cortex which are not yet well known, although there are plausible approaches for some of
thcm. That means we are not trying to simulate the functioning of several parts of the
human brain, but only trying to obtain known functionalities. That is, we are going to
Ibllow a constructive method, where functional modules with known natural counterparts
(e.g. colums) are used in the building [Mira&Delgado 1995b], in order to obtain the
global functionality. Second, and in consequence, we shall design a net whose
architecture embodies the problem structure [Mira&Delgado 1995b]. As to this, there is
plenty of evidence which points to the fact that problem structure is preserved in natural
neural nets from the input through the cortex layers [Mira&Delgado 1995b]. So we shall
apply this to the auditory case. Third, we shall take into account that the computation
performed by every artificial neuron, namely local computation, is not as simple as a sum
of weight-inputs products, but a program in general instead, which is different in each
module, although not too complex [Mira 1995]. Fourth, we also have to consider that
although computation, as we understand it nowadays, can impose some restrictions, it
also can provide us some features in order to implement the functionalities observed in
the natural counterpart. For instance, digital computation cannot use the frequency to
code anything, though the natural counterpart uses frequency to code intensity, but digital
303

computation can code intensity by means of numbers. Following Marr [Marr 1982] if we
have to understand the behaviour o f neurons, i.e. the natural processing o f information, at
some other level, there is no need for an exact neuron by neuron, synapse by synapse,
artificial synthesis. Albeit we could choose this way, we can also consider a local
computation program represents the computation o f serveral neurons and synapses or
conversely, while we preserve the problem structure and we reproduce the suitable global
functionalities in regard with our problem.
Figure 1 represents the rnodules o f our model. The reception module is a transductor
whose mission is to pick the sound waves from the environment as input and return a
complete feature map as output, the wave spectrum for each time interval At. That
complete feature map may be represented by a bidimensional chart, where X axis holds
the frequency values and the Y axis holds the intensity values, both in a logarithmic scale,
as we can see in the same figure. The next module detects intensity variations in time for
each frequency, parses the wave spectrum and returns one or more sub-espectra
components. This module uses columns for computing the suitable output. The final one
is the memory-recognition module, which handles normalized forms, where spectrum
frequencies and their respective intensities are relatives.
Next we are going to see these modules in detail, except for the reception module,
and we shall discuss some alternatives.

4, ntens'ty ['''~ ]"]~ .............~"""-"i"...... INPUT WAVE

.
Recept
. . on moau
.
e [ :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
,; ~ " t me
,~ ntens ty " ""-'----.
~' ...............................................................
d TIll J ill I WAVE SPECTRUM in At
Intensityvariationsin L........'...!_!....!.._...!..!.......!...!..!..._..!__....{....
frequency
I time for frequency ] [ ...........................................................................
]
/ module (IVTM) I I........................................t"~-- ................'1 I I i
t ..... ~. t ................................. / I I I I .
_I_ ...................... IT-H-aj I...................................
=.........:...._.:._.I
9 L .......................... 1

module(MRM) .......................
:.......................
:...........................
I I ...................

- [~ I of absolute frequency or intensity

Figure 1. Modular model for timbre recognition, a restricted auditory pathway model.

3.1.Reception.
In this restricted model, we can see the reception module (RM) as a low frequency
electronic device, with a total bandwidth o f 20-20kHz, which consists o f one or more
electroacoustic components for transforming the sound waves into electric signals,
followed by band-pass filters tuned to m different frequencies, spread over the total
bandwith in a logarithmic scale, as usual, and finally followed by half-wave rectifiers and
analog to digital converters, so that the output o f such device is m labelled-lines, each one
corresponding with a definite frequency, each one carrying a coded number a
representing the intensity associated with that frequency. Note that m is over 3000. While
this device's analog input may vary continuously, this device changes the digital output a
discretely every Ato, i.e. it renders a complete feature map every At0, the wave spectrum
for this time interval, as we saw in figure 1.
304

3.2.1ntensities variation in time.


The intensities variation in time module (IVTM) is based on one o f the already
known common structures in the brain, namely the columns [Mira&Delgado 1995b]. [n
this case, the columns are used to calculate the relative intensity variation for each
f r e q u e n c y f in an interval At, i.e. Aai/ai, where Aai=ai(t)-ai(t-At) and oi means ai(t-AO;
note that a is constant in At0. As we can see in figure 2, we have q columns, each with m
units, so in colum j we calculate Aa/ai (i=l,nO and so we do for the rest o f columns
(j=l,q). That yields a mxq bidimensional net. However, each unit output depends on the
colunm, i.e. in columj, the output o f each unit i is not 0 only ifej > AaJa~ > ej-t, where ej
depends on column j and does not depend on frequeneyJ~ (we consider F,0 = -~,q). These
calculations are equivalent to a situation where each unit extends its receptive field in
time by At = noAto, i.e. no intervals in time, since unit calculations involve the input at
time t, and tile input at time t-noAto. Each input is the value coded in the labelled line
which comes from the corresponding reception module output; each output is 0 or the
coded intensity. Thus, we find each unit receives only 1 input and gives only l output, the
number o f units is mq and tile number o f connections is mq inputs plus mq outputs. Given
m, the greater q the better precision. For m=3000 and q=50 we are talking about 150,000
units, 150,000 inputs, and 150,000 outputs.
iI.................................................................... -:
2,,.2
[ ..................................... , " .......

"\ ..............

\ ....... ~ ................. ~ . . . . . . . . . . . . . . . . . . . . . . -~---~'--- i

i[ [~2]ii ~7ai'~-- ~21a~a2-:-~ ~~ti}.............................


~z]i~Z1M i[i columnq
INTIMEMODULE
Figure 2. Computing intensities variation in time, after the reception module outputs (l s~version).

We can computationally implement this module in a little different way, say the
second version, functionally equivalent despite it seems unlikely that real neurons operate
like this. We can arrange m units, one per frequency, with q outputs per unit, and assign e;
to thejth output. Wc can preserve an order for each output in each unit, consistent with
the corresponding output assignments to the rest o f the units, so that the set which
consists of thejth output o f each unit corresponds with thejth column; now colums are
made up just o f output connections, corresponding with the computations performed in
the m units. Each unit i calculates Zlai/ai = 8'j and by comparing e'j to ej and ej-i (]=I,q)
the unit decides which of the q outputs is not 0 and will transmit the coded information to
the next module. Since, the number o f units is m, the number o f inputs is m too, and the
number of outputs is mq, so we are talking about 3,000 units, 3,000 inputs and 150,000
outputs. We can see that, for the same number of outputs, the number o f inputs and units
are considerably less than those o f the first version, so this is a better choice to be
implemented.

The purpose of this module is to detect groups o f frequencies which vary together. As
we previously said in this paper, we are trying to recognize timbres. Although this does
305

not mean that we restrict ourselves to the case of harmonic sounds, suppose we hear
several instruments playing together a melody on the radio (which is not stereo). We can
distinguish each one because they are not too many, they do not play continuously
together all the time, and the series attack-release-sustain-decay is different for each
instrument; if not for this, the timbre alone could not help us in order to distinguish
between them (this happens when we listen to an orchestra). So we can imagine each
instrument's spectrum in frequency varies in a different way during a given At and also
from one At to another as time goes by. Because of the latter, we can recognize the
melody that each instrument plays, and this is not the basic functionality of the module.
Because of the former, the module gathers in the same colum all the frequencies whose
intensities vary in the same proportion, so they probably have the same origin, e.g. the
same instrument. Besides, if we think in terms of evolution, suppose some living being
has been hearing a sound with the same frequency spectrum, say the soft breeze on the
savannah. Probably, this living being will associate this sound with a unique mild origin.
But a quite different case happens if some part of the frequency spectrum varies, though
softly, at a different rate from the rest: it will probably be successful to associate all the
frequencies of this spectrum which vary at a different rate and indentify them with a
different origin, in the struggle for life.

3.3.Memory and recognition.


Due to the IVTM characteristics, its output for a sound with a unique origin may
appear in different columns in different time intervals, as well as we may have outputs in
different columns in the same interval as a result of sounds with different origin and
different spectrum evolution in time. On the other hand, we are interested in timbre
recognition, i.e., in a frequency spectrum disregarding the absolute values of frequencies
and intensities, and only regarding relative values between them. For instance, we can
identify the sound we hear belongs to a violin, independently of either the pitch in the
eigth or the strength of the arc rubbing the strings. Thus, if a module must be useful to
memorize and recognize timbres, it must take into account these two circumstances.
We are going to see that it is possible to design a unit to memorize and subsequently
recognize the spectrum of a sound, so that different spectra will be memorized and
recognized by different units. Once again, we do not pretend to identify a unit with a
natural neuron. As we can see in figure 3, a memory unit consists of only one output and
two kind of input connections: the MR connections and the RO connections. There are as
many MR connections as outputs come from only one given IVTM column. There are as
many RO connections as outputs come from the rest o f the q-1 IVTM columns. There are
as many memory units as we need, each of them with MR and RO connections. Note that,
depending on the unit, the MR inputs are connected to one of the q IVTM columns or
another. However, at a given time, there are only q units that can memorize, q being the
number of IVTM colums. Inspired in nature, we can think of lateral inhibitory processes
which enable only those units to memorize, among the whole population.
Computationally, we can think of other mechanisms to bring one of the units to the
memorizing function, without being necessarily fully interconnected to each other.
Anyway, when a unit becomes available for memorizing purposes, this is the only
function it can perform. Once a unit has memorized, the only available function is
recognizing, although the memorized form could be slightly modified by local learning in
subsequent recognition operations in this unit, for very similar patterns, being this
similarity determined by the local computation and learning algorithm.
Suppose the ith unit has its MR connections with the kth IVTM column. When this
colum renders output, we have a series of output values
306

(aPilk~ a~ti2k, ..., aPimk),

where m is the number of IVTM outputs per colum (as many as labelled lines, one per
frequency, in a given column k). Every one of these a~ijk can be either 0 or the intensity
coded value. A normalization over the absolute intensity coded values turns the series into
relative intensity ones. Another normalization over the absolute frequencies turn them
into relative ones, simply by displacement. We call the normalized series a "form", so
units memorize forms. Once these normalizations end, the MR connections are never
more used to memorize, but to recognize, i.e., the MR connections start working like RO
connections.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . ; ............................... - ......

!I
: I~ // "~ I/ "~ H 1 icol,.,,,q

Figure 3. The three modules. In the memory-recognition module, only 1 unit is shown. At a given time,
there are q units ready to memorize, one per IVTM column, each one with its MR connections with a
different column. In general, there are many other units like them, some of them have already memorized
their own form, and therefore they are capable of recognize it, some other are not yet memorized anything,
but only q of them are ready to memorize.

Recognition process is available when a unit has memorized a form. Then, whatever
the input series it receives, it will be normalized and compared with the memorized form.
The RO connections are arranged so that the unit considers that inputs which come from
different IVTM columns belong to different series. So we have a series like
(aPilt, aPi2r, ..., aPimr)~

where m is the number of IVTM outputs per colum and r the specific column whose
outputs are now inputs to the unit i for recognition purposes. Computation acts in order to
consider only one of the q possible series in the normalization and comparison process. If
comparison fails, then another series is considered, until recognition is achieved or no
matches at all are found. If recognition is successful, the unit output is not 0, being 0
otherwise. Comparison criteria is not necessarily an exact-match one, since we can admit
little differences in the values of the series. Note that, since this unit operates in parallel
with the rest of units for recognition purposes, recognition is achieved in parallel and in a
few steps. Note also that first of all, the memory-recognition module (MRM) tries to
recognize the form, by inhibiting the MR connections. If no output is obtained, then the
inhibitions cease, and memorization is permitted.
307

If not for the displacemetlt and normalization, the underlying idea is that connections
support the memory function. The natural neural nets analogy is to straightforwardly
interpret that memory would imply synaptic changes: when memory function is on, the
synapses which received impulses become excitatory ones, while those which did not
receive any impulse in the same process become inhibitory ones. But this process is not
meant to be accomplished by only one neuron, but by a group of them, working
cooperatively, i.e. a neural net.
I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ' ~ ~ ..... [ ..... [ ..... l ~. . . . . M O eonneclions
: _ :]-. :•

, r,om iv:ru I cot,re,~.1.: ~ - - I / / ' - ' , ~ ,; " I: 'q ! X : /" '


i ,wM c,,I.m,21 "~"(])'gt,'[--[3/ ~ ~ ~ ~ J. i ......... I,
!o,um~
: --~ ~ l _ _ I : ......... ~ ......... ; : ...... "...... ". . . . . . . . . . . ".........
, ~ ~ m u l t i p~'~
l e x e d_ ~ '~
signal n o r m aIl i z e ~ manifolds

Figure 4. The memory-recognition unit inspired in the Pitts-McCulloch research. Grey lines come from
IVTM columns outputs for the corresponding frequency; these input lines converge in units X j and, after
being normalized, cross slantwise the manifolds and enter the units on a multiplexed-by-manyfoldbasis. On
the other hand, the black lines represent the MR connections that come from only one IVTM column j,
whose intensity was also normalized.
In order to preserve to the utmost possible this very interesting feature in our design,
we could think of a second version of our MRM, based on the neural structures studied by
W. Pitts and W.S. McCulloch [Pitts&McCulloch 1947]. As we can see in figure 4, for
each IVTM column there are m manifolds, each representing the m possible
displacements in frequency, so that we have a m x m matrix o f units. The m possible
outputs o f a IVTM column k, previously normalized (in intensity), are sent to each of the
inputs corresponding with the same frequency but in a different manifold. These are the
MO connections (memorizing-only), so memorization takes place in all the manifolds in
the corresponding inputs of the matrix, and also in a unit X (on the right). As to the
rccognition connections RO, they are as follows: the m possible outputs o f each IVTM
column k are sent firstly to multiplexed units whose output is controlled by a multiplexed
signal, then converge in units X I, then are normalized and afterwards sent slantwise to the
manifolds, so that the output corresponding to the lower frequency crosses the diagonal of
the matrix, and the higher theoretically arrives to only one input. Each IVTM column
output is examined when the multiplexed signal is on respectively for the column, and
multiplexor's units render output only when the multiplexed signal is on for them and
there is input coming from the corresponding frequency and column. Recognition takes
place only when the RO connections match a manifold with the suitable input inhibitory
charactcristics and a multiplexed signal (not represented in the figure) is on for this
nmnifold, and then this manifold's units render outputs different from zero, such that
entering the trait X as inputs, coincide with this unit's memorized form (these connections
for unit X are not represented in the figure, being the manifold units' outputs; unit X
considers these connections orderedly by manifold). Normalization for MO as well as RO
connections is performed as follows, as we can see in figure 4 (for RO connections only):
m units X l receive the RO inputs and send them to other m units X "+l which keep them

during n,dt, while signals are compared from r to r in n layers (represented by the dotted
308

triangle), so that the nth layer consist of only one unit xn: this unit's output is the greatest
input value received in X" units. So, units X "§ divide their inputs by the value of unit
Xn's output. That quotient is sent to the manifold's units as input.
If we calculate the dimensions of the MRM, for this second version, we have that one
MRM unit has only 1 output, m MO inputs and m 2 RO inputs corresponding with unit X;
m2 MO inputs and m2/2 RO inputs plus m2/2 multiplexed-by-manifold inputs
corresponding with these manifolds; 2mq inputs corresponding with multiplexed units;
mq inputs corresponding with units XI; 4m-2 inputs corresponding with RO normalizing
units and 5m-2 inputs corresponding with MO normalizing units (if r=2). That yields
6m2/2+m(10+3q)-3 connections per MRM unit. Under the same circumstances
considered in the IVTM case, that means we are talking about 2.7x107 connections per
MRM unit; please, note once more we do not mean this MRM unit is a neuron. This
MRM unit consist of simpler units: 1 unit X, m2 units per manifold, 2(3m-1) normalizing
units, one for MO connections, one for RO connections, and mq multiplexed units. That
yields m2+m(3+q) simpler units per MRM unit, i.e. we are talking about 9.Ix 106 simpler
units. Then, if MRM has 1000 MRM units, we are talking about 2.7x 10 t~ connections and
9.1 x 109 simpler units per MRM, following the Pitts-McCulloch postulates.
However, per MRM unit, the first version of MRM has 1 output, m MR inputs, and
mq RO inputs (since the MR inputs have to change to RO ones after memorization), so
we are talking about 1.53xi05 connections per MRM unit, and there are not simpler units,
but unit's local computation is more complex than that of the second version.
Multiplexion can be unnecessary if we think of a third version for this module, at a
higher cost. We could have q matrices of units (one per IVTM column), one for MO
connections and the rest for RO connections. When a MRM unit memorizes in the matrix
of units where the MO connections arrive, there are outputs of every unit of this matrix
which are sent to the corresponding units of the rest of q-1 matrices, where memorization
takes place too. In each manifold, units are fully interconnected to each other, so that if
one of the units which has not memorized receives input, it inhibits the rest of its
manifold's units, so they do not produce any output. Besides, there are m units which
receive outputs which come from each manifold's corresponding same frequency, whose
outputs are sent to unit X's inputs. This third version of MRM does not significantly
increase the number of units with respect to the second version one. However, note the
full interconnection per manifold yields m2(m-1) connections per matrix, that means we
are talking about m j connections per MRM unit, i.e. 2.7x10 t~ connections, therefore
2.7x 1013 connections per MRM, if we wanted 1000 memory-recognition units; anyway,
human technology cannot afford it nowadays.

4. Implementation issues.
The restricted synthesis modelling we just have presented, based on the processing of
real-world coded information carried by labelled-lines, is an example of parallel, modular,
distributed and self-programming computation for real-world information processing.
There is a need of paralellism because of the real-world processing, since inputs are
provided in parallel and real-time processing is required. As to modularity, it is a correct
methodological desing procedure, but in this case it is a consequence of two special
circumstances: first, disregarding the modules that pick the real world information, the
rest of modules process feature maps, i.e. abstract information, therefore they can be used
in a wider kind of problems, i.e. they are generic, as their natural analogues are frequently
found all over the brain; and second, the high number of units and connections with
which we have to deal means that, if a real implementation has to be made, we cannot
309

think in tenns of interconnected units any more, but in terms of interconnected modules,
i.e. prefabricated standard objects. These standard modules (or objects) are only
characterized by their functionality, the number and characteristics of inputs and the
number and characteristics of outputs.
As we have seen, the simpler the computation per unit, the greater the number of
units and connections, to achieve the same global functionality. If we take into account
the current technological resources, by increasing the complexity of each unit's
computation we can jeopardize the real-time response of the net: thus, we must estimate
what we can implement, if any, for a real-time response, and what we can only simulate,
out of any real-time response, but still under real-world conditions.
The estimation of the number of units and connections for the IVTM 2"d version and
the MRM 1 st version lead us to an implementation which involves thousands of units and
hundreds of thousands of connections. Due to the characteristics of this kind of
processing, simulations can be and should be done simply by sequencing the parallel
computation module by module, and unit by unit, regarding the net layers in each module.
In spite of such number of units (about 3,000) and connections (about 300,000), this kind
of simulation is totally within reach, all the more because very powerful GHz processors
and Gb RAM chip memories will be available at low cost in the short term. It would be
desirable that programming interfaces were available for performing those simulations,
such that the researchers could define the net in temls of standard modules, picked from
either a standard library or a user library (such that the researchers could define their own
modules). A very important characteristic would be that those s#nulations would process
teal world information, so the feature maps should be provided as input. That means that
recognition modules (RM) should be built in order to record those real world
characteristics. That is a great difference with regard to the current simulations, that
seldom perform computations with inputs under real conditions. Most use analytical
mathematical models, which are not translated into synthesis models, so the real
computation is not described and therefore simulations cannot display any information
about the real implementation behaviour.
As to real implementations, we can distinguish two cases: specific and universal
machines, both including modules for real world information picking and feature map
rendering. Specific machines are built for a definite purpose, with fixed modules and
fixed connections between and within them. In this case, units execute specific
programming which could be either software or hardware. The former means that units
must be general purpose processors, the latter means that units can be more simple and
specific circuits. The latter is the preferred option, and therefore, as programs are Turing
machines (TM's) we call these specific machines "neural Turing machines" (NTM's).
These NTM's are made of standard modules, all of them are also NTM's. (We can also
think of machines which could build these NTM's following CAD/CAM design
specifications due to the high number of connections to be made, as well as symbolic
languages which express the modules' characteristics and interconnections). Finally,
starting from the knowledge level model described in terms of members of a PSM's
library, a reduction methods library is needed for building them, following a suitable
methodology [Mira 1997, 1998] [Herrero 1998, 1999].
Universal machines are justified as sequential universal machines (i.e., von Neumann
ones) are too, but for neural computation instead. We call these machines "neural
universal Turing machines" (NUTM). In these machines, units must be general purpose
processors (i.e. UTM's or universal Turing machines), and the connections between them
will be done depending on the case. Thus, any kind of modules could be defined, as well
as the same kind of modules with different dimensions, and connections between them
310

and also within them. A v o n Neumann machine could support the configuration of this
NUTM, the NUTM being a peripheral of the von Neumann machine. In the von Neumann
machine, we could perfonn simulations and design; there would be tile libraries of
modules, and once the design is achieved, this machine would load the suitable modules
in the NUTM, carrying out their configuration and interconnection. This would be a
totally flexible neural computer. We can think of a NUTM as net of computers, like the
current ones which support the big corporations' management, like banks, notorious
computer manufacturers, etc., which involve hundreds and even thousands of computers,
i.e. processors, all over the country or even the world. In general, these machines are
interconnected following telephone lines, that means connections capable of
configuration, even software configuration, and sometimes these lines are high speed
ones. Usually, there is a central (von Neumann) machine which supports the system
management and maintenance of the net, as well as software distribution utilities maintain
each system's software. Therefore, NUTM's are really possible in the world today,
although at a very expensive cost, in spite of they would not have the long distance
communication handicaps that the mentioned kind of computer nets actually have.
Nevertheless, the main problem is the lack of models, both at the knowledge level and at
the implementation level, which could be executed in these machines, and therefore the
corresponding reduction methods are not yet available.

5. Conclusions.
The lack of computational designs, which should be provided by reduction methods,
defers any implementation, since the solution for the available models based on accurate
analytical fonnulations can only be carried out by mathematical means, which are
computable, but do not render description about any neural net by themselves, thus do not
provide any idea about the structure and dimensions of the implementation and the
required tools, if available, the required resources for it, and so on. Thus, the inspiration
in natural neural nets is desirable, i.e. more research has to be done to find out more and
more about the brain structures and their functioning, as well as PSM's and reduction
methods which translate those models to the connectionistic implementation level. These
reductions methods' results should be neural nets, i.e. parallel, modular, distributed and
self-programming computation, where the modules consist of components also inspired in
already known and widely found structures throughout the brain (lateral inhibitory nets,
colums, slantcd scan, clc.). PSM's describe functionalities observed at the knowledge
level, beginning with the sense pathways, and simulations for them should be done under
real conditions with real world inputs. By simulations under real conditions (i.e. by
sequencing local computations module by module, unit by unit, following the net desing
and regarding inputs as they are received in the real implementation), we are preparing
the design of fi~ture resources and virtual machines which will carry out the neural
computation then, although we do not know if new materials, components and
conceptions will necessarily replace the currently available ones, in order to achieve real-
time responses.

6. Acknowledgements.
To J. Mira and A. E. Delgado for their support and underlying ideas of a suitable
viewpoint on computation and computational problem solving and reduction methods.
7. References.
[Benjamins 1997] V.R. Benjamins & M. Aben. Structure-preserving knowledge-based system
development through reusable libraries: a ease study in diagnosis. International Journal of Human-
Computer Studies, 47 (1997) 259-288.
311

[Churchland 1992] P.S. Churchland & T.L Sejnowski. The Computational Brain. (The MIT Press,
('ambridge, MA. 1992)
[Darwin 1859J C. Darwin. On the Origin of Species by Means of Natural Selection, or the
Preservation of Favoured Races in the Struggle for Life. (London: John Murray, Albemarle Street, 1859).
[DeFelipe 1997] J. de Felipe. Microcircuits in the Brain. In Biological and Artificial Computation:
From Neuroscience to Technology. Mira, Moreno-Diaz & Cabestany (Eds.) (Springer, 1997) 1-14.
[Delgutte 1996] B. Delgutte. Physiological Models for Basic Auditory Percepts. In Auditory
Computation. II.I_ l lawkins et al. (Eds.) (Springer, 1996) 157-220.
ll)nuw 1996] l)epartment of Neurophysiology, University of Winsconsin - Madison. Ilearing and
Balance. http://www.neurophys.wisc.edu/-ychen/texlbook/textindex.hlml
[llanavan 1996J P.C. Ilanavan. Virtual Tour of the Ear. http://ctl.augie.edu/perry/frames.htm
Augustana College, SD.
[llawkins 1996] II.L. Ilawkins & T.A. McMullen. Auditory Computation: An Overview. In Auditory
Computation. II.L. llawkins et al. (Eds.) (Springer, 1996) 1-14.
[llerrero 1998] J.C. [lerrero & J. Mira. In Search of a Common Structure Underlying a Representative
Set of Generic Tasks and Methods: The Hierarchical Classification and Therapy Planning Cases Study.
Methodology and Tools in Knowledge Based Systems. Mira, del Pobil & Ali (Eds.) (Springer, 1998) 21-36.
[tterrero 1999] J.C. Ilerrero & J. Mira. SCIIEMA: A Knowledge Edition Interface for Obtaining
Program Code from Stnfciured Descriptions of PSM's. Two Cases Study. Applied Intelligence (Accepted).
[Lyon 1996] R. Lyon & S. Shamma. Auditory Representations of Timbre and Pitch. In Auditory
Computation. H.I_ Ilawkins et al. (Eds.) (Springer, 1996) 221-270.
[Marr 1982] D. Marr Vision. (Freeman, New York, 1982).
[Mira 1995] J. Mira et al. Aspectos b~tsicos de la inteligencia artificial. (Sanz y Tortes, 1995).
[Mira 1997] J. Mira, J.C. llerrcro & A.E. Delgado. A Generic Formulation of Neural Nets as a Model
of Parallel and Self-Programming Computation. In Biological and Artificial Computation: From
Neuroscience to Technology. Mira, Moreno-Diaz & Cabestany (Eds.) (Springer, 1997) 195-206.
[Mira 1998] J. Mira, J.C. Herrero & A.E. Delgado. Where is Knowledge in Computational
Intelligence? On the Reduction of the Knowledge Level to the Level Below. Proceedings of the 24 th
Euromicro Conference. (IEEE, 1998) 723-732.
[Mira&Delgado 1995a] J. Mira & A.E. Delgado. Reverse Neurophysiology: the "Embodiments of
Mind" Revisited. In Proceedings of the International Conference on Brain Processes, Theories and Models.
R. Moreno & J. Mira eds. (The MIT Press, Cambridge, MA. 1995) 37-49.
IMira&Delgado 1995b] J. Mira & A.E. l)elgado. Computaei6n neuronal avanzada: fundamentos
biol6gicos y aspectos metodol6gicos. In Computaci6n Neuronal. Sen6n Barro & Jos6 Mira (Eds.)
(Universidad de Santiago de Compostela, 1995) 125-178.
[Mira&Delgado 1997] J. Mira & A.E. Delgado. Some Reflections on the Relationships between
Neuroscience and Computation. In Biological and Artificial Computation: From Neuroscience to
Technology. Mira, Moreno-Diaz & Cabestany (Eds.) (Springer, 1997) 15-26.
[Moreno-Diaz 1997] R. Moreno-Diaz. Systems Models of Retinal Cells: A Classical Example. In
Biological and Artificial Computation: From Neuroscience to Technology. Mira, Moreno-Diaz &
Cabestany (Eds.) (Springer, 1997) 178-194.
[Moreno-Diaz 1998] R. Moreno-Diaz. Neurocibemetics, Codes and Computation. In Task and
Methods in Applied Artificial Intelligence. Mira, del Pobil & Ali (Eds.) (Springer, 1998) 1-14.
[Mountain 1996] D.C. Mountain & A.E. Hubbard. Computational Analysis of Hair Cell and Auditory
Nerve Processes. In Auditory Computation. l I.L. I lawkins et al. (Eds.) (Springer, 1996) 121 - 156.
I Newell 1981] A. Newell. The Knowledge Level. AI Magazine 2 (2) (Summer 1981) 1-20, 33.
[l~itts&McCulloch 1947] W. Pitts & W.S. McCulloch. Ilow we know universals: the perception of
auditory and visuals forms. Bulletin of Mathematical Biophysics, Vol 9, pp 127-147. (University of Chicago
Press, 1947),
[Russell 1948] B. Russell. Human Knowledge: Its Scope and Limits. (George Allen & Unwin, 1948).
[Russell 19501 B. Russell. An Inquiry into Meaning and Truth. The William James lectures for 1940,
delivered at llarvard University. (George Allen & Unwin, 1950).
[Russell 1959] B. Russell. My Philosophycal Development. (George Allen & Unwin, 1959).
[Turing 1950] A.M. Turing. Computing Machinery and Intelligence. Mind 49 (1950) 433-460.
[Whitehead 1913] A.N. Whitehead & B. Russell. Principia Mathematica. (Cambridge University
Press, 1913).
A Parametrizable Design of t h e
M e c h a n i c a l - N e u r a l Transduction S y s t e m of the
Auditory Brainstem

Jos6 Antonio Macfas Iglesias I and Marfa Victoria Rodellar Biarge 2

1Dpto. Ingenierfa Informfitica. Univ. Aut6noma de Madrid. E.T.S.I.Informfitica.


Carretera de Colmenar Viejo, Km. 15, E-.28049. Madrid. Spain
macias@ii.uam.es
http://www.ii.uam.es/-jamacias/
2 Dpto. Arquitectura y Tecnologfa de Sistemas Informfiticos. Univ. Polit6cnica de Madrid.
Facultad de Informfitica. Campus de Montegancedo, E-28660. Madrid. Spain
victoria@pino.datsi.fi.upm.es
http://tamarisco.datsi.fi.upm.es/PEOPLE/victoria.html

Abstract. We present an implementation of a key component in the


auditory brainstem present in mammalian - the inner hair cell -. This
cell is a key component in the mechanical-neural transduction system of
the auditory brain. The cell will be implemented in VLSI technology
using VHDL language, with parametrizable characteristics.
Furthermore, details of the cell structure and issues about design will be
given.

1 Introduction

I.I Goal of the work


The main goal of this work, is centered in the design of a dedicated digital circuit
being easily modifiable in word width and modules implemented as well as the
number of hair cells. In order to that, we will make the design of the system according
to the mechanical-neural transduction device present in mammalians, using high level
techniques, accomplishing the design of each cell in parametrizable way. Jerarquical
and structural decomposition it will be used to probe each item in the design,
refinement and optimization in area and time of the resulting structure.
1.2 Auditory brain
Biological speech processing is produced through a set of specialized mechanisms,
each one transforms and refines the information obtained in the previos stage. The
processing of the information is accomplished, as first step, on the cochlea, since the
middle and external ear are limited to amplify the signal, and to provide determined
information on the location of the source that originates the sound, where the
vibrations proceeding from the bones chain are transformed into a wave that will be
displaced throughout the basal membrane, from the basal to apical zone, and through
the amplitude of the vibration and the frequency of if, frequencial components of the
stimulus are identificated The frequency of each zone is transformed into a set of
trains of electronic pulses (mechanical-neural transduction), that will codify, using the
313

time-space information, the items previously detected and will be driven to the main
nervous system.

2 The Meddis mechanial-neural transduction system

2.1 Description

In the most of the models [1,3], the flow of the transmision fluid is modelized through
different reservoirs which can be found in the outer and the inner cytoplasm. This
models are more adjusted to a real behavior, with respect to adaptation process and
phase-locking. Van Shaik [4] gives an analogous implementation of the model, which
contains only simplified top-level characteristics of the more general model of hair
cell. Also Lyon [5] and Kumar [6] developed some portions of the auditive system in
analogue hardware. We have decided to apply Meddis physiological model [1,2] for
several reasons:

- To begin with, for develop the biological process in a more specific way than the
rest of the models.
- It reproduces in a realistic way some of the basic properties of the auditive nervous
fibers, such as adaptation through two components or the phase-locking with the
stimulus for the fibers of high characteristic frequency.
- It is considered a not linear model, and has been contrasted the relationship
between the linear form of a system and its performance under adverse conditions.
- Finally, the computational cost is not very high, compared with other similar
models.
2.2 S t r u c t u r e
Meddis model simulates the electrical activity of the auditive nervous fibers
beginning from a stimulus, in this case, the displacement of the basilar membrane.
As has been commented previously, the beginning of the stimulus provokes an
increase in the shot rate followed by an adaptation process that depends on the
intensity of the stimulus. However, the fall rate until the adaptation level is
independent of the amplitude of the perceived signal. The first models developed had
difficulty, in this aspect, due to the fact that the quantity released (or the probability of
be released) were determined directly from the intensity of the stimulus. Furthermore,
the cell were thoroughly emptied of transmitting substance in presence of a
moderately intensive signal. So the shot rate will have to depend on the amount of
neurotransmitter available, however would be given the case that a superior stimulus
do not reach to be propagated due to the transmitting substance lack in the cell. As
consequence, the models should possess certain amount of neurotransmitters in stock
to answer the increases in the stimulation, therefore certain quantity of this substance
should not be affected by the intensity of the signal.
With all this, the properties of the model can be summarized in Table 1.
314

Table 1. Meddis, model main characteristics


Property Meddis
Number of reservoirs 3
Number of parameters 8
Intensity function behavior Yes
Adaptation rate Yes
Instant activity recovery Yes
Activity No
Masking process No
Phase-Locking Yes
Computational efficiency Yes

2.3 Hair cell modeling


The inner hair cell in the mammalian cochlea is a key component in the process of
hearing. It converts acoustic vibrations into an electrical signal which is used to drive
action potentials (spikes) via the auditory nerve to the brain. The system also
processes the signal so that the electrical output is quite unlike the mechanical input.
The behaviour of the cell can be modelled in terms of the flow of transmitter
substance between three reservoirs. One reservoir, the free transmitter poll, holds the
transmitter inside the cell ready to be released into the cleft in response to acoustic
stimulation. The transmitter in the cleft is represented by the second reservoir and it is
this amount that determines the rate of spike activity in the auditory nerve fibre. The
amount of tramsmitter in the cleft can be considered as the probability of the cell
transmitting a spike at any one time. Transmitter in the cleft is taken back into the cell
and reprocessed in the third reprocessing reservoir. The system is represented
diagramatically in Figure 1.

I Factory [ Loss
l.ct
y.[m-qt]
?
Free transmitter pool [I Cleft ~_~
kt.qt
(qt) ~, (ct) Output = ct
(Cleft value)
T x.wt
I Reprocessing store I

I (wt) r.ct

Fig. 1. Diagram of hair cell design. Taken from [1,2]

Some transmitter is lost from the cleft (and from the system) by diffusion from the
cleft. This would lead to a gradual run-down of the system if it were not for a gradual
replenishment of the transmitter from a "factory" within the cell that slowly replaces
loss transmitter.
315

g.dt[s(t) + A] (1)
For [s(t) + A] > 0
k(t) = s(t) + A + B
0 For [s(t) + A] < 0

dw (2)
- r.c(t) - x.w(t)
dt

(3)
dq = y(1 - q(t)) + x.w(t) - k(t).q(t)
dt

dc (4)
dt k ( t ) . q ( t ) - l . c ( t ) - r.c(t)

Equations (1,2,3,4), give a complete mathematical account of the model. The flow
parameters y (replenishment rate), 1 (loss in cleft rate), r (reprocesing rate) and x
(adaptation rate) are constant; k(t) is a variable function of s(t) the instantaneous
amplitude of the acoustic stimulus. The k(t) equation represents a saturation function
with a maximum value of g.dt. We can also see that with no stimulus, s(t)=0, we get a
non-zero value of k(t). This non-zero value is determined by parameters A and B
(permeability constants). If we develop the equations (1,2,3,4) in function of the
design graph in Figure 1, we will have the following in/out relationship, equations
(5,6,7), determined by the discreet form of the previously referenced differential
equations.

w( t+ l )=-xw( t)+rc( t)+w( t) (5)

q( t+ l )=xw( t )-q( t)k( t)+(-q( t)+m)y+q( t) (6)

c( t+ l )=-rc( O-Ic( O+q( Ok( t)+c( t) (7)

In these equations we can see, as last addend of the right member, the reservoirs w(t),
q(t) and e(t). This is due to the own intrinsic feedback in the model and in the
development of the differential equations (1,2,3,4).

3 Computational analysis
The hair cell can be initially modeled at structural level in VHDL. This model will be
used as a reference to verify the subsequent stages in the design. All the code was
converted to VHDL. Once the netlist synthesized has been verified, it was proceeded
316

to obtain the values of the critical path, number of ports and other determinant factors
for the study of the structural model performance.
3.1 Design
Below is represented the data flow architecture used to implement the hair cell,
modeled according to the equations (5,6,7). To carry out this task, the operations have
been structured in 7 stages, attend to the precedence between operations and its
possible parallelism to work out the results, besides the stages have been optimized to
obtain a minimal time of computing. Figure 2 shows the complete model.

Stage/Reg. S(t) I Q(t) I w(t) [C(t) [

STAGE 1

STAGE 2
......................... y ................. 1.............
STAGE 3

STAGE 4

STAGE 5

STAGE 6

STAGE 7 ..................................................................
q(t+l) . . . . . . . . . . . . . . . . . . . . . . . . .
~/J.)
..........

c(t+l)

Fig. 2. Design flow graph by stages

3.2 Hardware architecture


To develop the cell model previously commented, has been used VHDL designs for
24 bits numbers. Concretely, they have been 17 the functional units used: 11 adders, 5
multipliers and a divider.
In view of the needs established for the final design, it has been opted for developing
a model of generic adder of n-bits implemented by a half-adder and a full-adder. This
functional unit has been designed in VHDL language, using Synopsis software for
Sun/OS, just as generical and reusable design to extend in any moment the number of
bits of such functional unit. Table 2 shows some characteristics of synthesis for adder
units.
317

Table 2. Characteristics of the generic adder of 24 bits

Concept. Result obtained


Library used 0.7 ~t
Number of bits (dimensinn) 24 bits
Number of ports 73
Number of cells 24
Area 59903.199219 p2
Time employed in accomplishing an 36.63 ns => 27.30
Operation MHz

The multiplier used to implement the model, is based on a structural model


implemented through a two-dimensional iterative net that reflects directly in its
structure a manual algorithm of binary multiplication. That is to say, what we want to
implement is a quick array multiplier of 24 bits. Taking into account the structural
symmetry of this kind of multipliers, its VHDL implementation will be based on the
design of a cell called m u l s u m that accomplishes a multiplication of two bits (x and
y) and a sum of the result in z to produce the product and a carry.
This functional unit has been designed, equally, in VHDL language, using the same
philosophy and tools commented above. In Table 3 are showed some characteristics
of the synthesis for the multiplier.

Table 3. Characteristics of the generic multiplier of 24 bits

Concept. Result obtaided


Library used 0.7 p
Number of bits (dimensinn) 24 bits
Number of ports 144
Number of cells 577
Area 2145335.5 p2
Time employed in accomplishing an 172.78 ns => 5.787
9peration MHz

As can be supposed, the corresponding design of a 24 bits multiplier, can contain


about some 577 cells. An approximation of this design can be observed in Figure 3.
318

Fig. 3. Scheme of the generic multiplier synthesized for 24 bits

Finally, the divider used for the design is a barrel-shifter implementation, runing at 50
Mhz (20 ns). This model is yet in development, and it is being calculating its future
performance.
319

4 Result analysis
Now, let us see the results taken for the scheme of Figure 2 using the hardware
architecture of point 3.2. First, we study the referring times of each stage, and then
will see the occupation (in area terms) of each stage for two different solutions.

Table 4. Time obtained according to the proposed design

Time (ns)
Stages Solution 1 Solution 2
Stage 1 345.56 172.78
Stage 2 345.56 172.78
Stage 3 36.63 36.63
Stage 4 172.78 172.78
Stage 5 73.26 36.63
Stage 6 73.26 36.63
Sta~e 7 36.63 36.63
Sum 1083.68 664.86

The solution 1 involves only one functional unit of the same type per stage, while
solution 2 involves two functional units of the same type per stage. The clock
frequencies of each solution are 0.92 Mhz and 1.5 Mhz respectively for the solutions
1 and 2. It is important to emphasize that the solution 2 implies an increase in the
speed of the circuit, being the difference between both solutions of 418 ns.
We are going to study the occupation, in area terms, of both solutions described above

Table 5. Area, in port terms, obtained according to the proposed design

Time (ns)
Stages Solution 1 Solution 2
Stage 1 217 434
Stage 2 217 434
Stage 3 73 73
Stage 4 217 217
Stage 5 73 146
Stage 6 73 146
Sta~e 7 73 73
Sum 943 1523

As we can see, solution 2 involves a considerable increase in the area used to design
the circuit, having 580 ports more than solution 1. Taking into account the results of
Table 4, is important to arrive to a commitment between speed and the number of port
used in the design.
320

5 Conclusions
It has been presented a hardware implementation of the mechanical-neural
transduction model present in mammalians. The implementation of such system, is
based on the" VLSI parametrizable development of a fundamental component - the
hair cell -, modeled through standard cells beginning from a scheme presented in 3.
The contribution of this design is to be able to make parametrizable and reusable
items using high level design techniques. In this way, we have described the design of
a structural block - the hair cell - that it will be used for the lineal development of a
cells array for simulate the function of the cochlea as a transduction mechanical
mechanism of the auditory system. Furthermore, this design will be able to build
modular libraries which can be used for the construction of a bioinspirated system for
the human speech recognition.

Acknowledgements
This research is supported by P R O N T I C 97-1011 and NATO CRG-960053 grants.

References
1. Meddis, R., (1996). "Simulation of mechanical to neural transduction in the auditory
receptor.". J. Acoust. Soc. Am. 79, pp. 702-711.

2. Meddis, R., (1998). "Simulation of auditory-neural transductions: Further studies." J.


Acoust. Soc. Am. 83, pp. 1056-1063.

3. K. L. Payton. J. Acoust (1988) "Vowel processing by a model of the auditory periphery: A


comparison to eight-nerve responses." Soc. Am. 83 (1), pp. 145-162.

4. Van Schaik, A., Fragniere, E., Vittoz, E., (1996). "A silicon model of amplitude modulation
detection in the auditory brainstem." Advances in Neural Information Processing Systems 9,
MIT Press.

5. Lyon, R.F., Mead, C. (1988). "An analog electronic cochlea." IEEE Trans. Acoust., Speech,
Signal Processing vol. 36, pp. 1119-1134.

6. Kumar, N., Himmelbaurer, W., Cauwengerghs, G., Andreau, A., G. (1997). "An analog
VLSI front-end for auditory signal analysis." IEEE International Conference on Neural
Networks 1997, pp. 876-881.
D e v e l o p m e n t of a N e w Space P e r c e p t i o n S y s t e m
for Blind People, B a s e d on t h e C r e a t i o n of a
V i r t u a l A c o u s t i c Space
IGonz~tlez-Mora, J.L., 1Rodrfguez-Herndndez, A., 2Rodrfguez-Ramos, L.F.,2Dfaz-Saco, L. 2Sosa, N.

l Department of Physiology, University of La Laguna and 2Department of Technology, Institute of Astrophysics, La


Laguna, Tenerife. 38071. Spain; e-mail jlgonzal @ull.es

Abstract. The aim of the project is to give blind people more information about their immediate
environment than they get using traditional methods. We have developed a device which captures the
form and the volume of the space in front of the blind person and sends this information, in the form of a
sounds map, to the blind person through headphones in real time. The effect produced is comparable to
perceiving the environment as if the objects were covered with small sound sources which are
continuously and simultaneously emitting signals. An experimental working prototype has been
developed, which has allowed us to validate the idea that it is possible to perceive the spatial
characteristics of the environment. The validation experiments have been carried out with the
collaboration of blind people and to a large extent, the sound perception of the environment has been
accompanied by simultaneous visual evocation, this being Ihe visualisation of luminous points
(phophenes) located at the same positions as the virtual sound sources.
This new form of global and simultaneous perception of three-dimensional space via a sense, as
opposed to vision, will improve the user's immediate knowledge of his/her interaction with the
environment, giving the person more independence of orientation and mobility. It also paves the way for
an interesting line of research in the field of the sensory rehabilitation, with immediate applications in
the psychomotor development of children with congenital blindness.

I Introduction
From both a physiological and a psychological point of view, the existence of three
senses capable of generating the perception of space (vision, hearing and touch) can be
considered. They all use comparative processes between the information received in
spatially separated sensors, complex neural integration algorithms then allow the three
dimensions of our surroundings to be perceived and "felt" [2]. Therefore, not only light
but also sound can be used for carrying spatial information to the brain, and thus, creating
the psychological perception of space[ 14].
The basic idea of this project can be intuitively imagined as trying to emulate, using
virtual reality techniques, the continuous stream of information flowing to the brain
through the eyes, coming from the objects which define the surrounding space, and being
carried by the light which illuminates the room. In this scheme two slightly different
images of the environment are formed on the retina with the light reflected by
surrounding objects, and processed by the brain in order to generate its perception. The
proposed analogy consists of simulating the sounds that all objects in the surrounding
space would generate, these sounds being capable of carrying enough information, despite
source position, to allow the brain to create a three-dimensional perception of the objects
in the environment and their spatial arrangement, after modelling their position,
orientation and relative depth.
This simulation will generate a perception which is equivalent to covering all
surrounding objects (doors, chairs, windows, walls, etc.) with small loudspeakers emitting
sounds according to their physical characteristics (colour, texture, light level, etc.). In this
situation, the brain can access this information together with the sound source position,
using its natural capabilities. The overall hearing of all sounds will allow the blind person
to form an idea of what his/her surroundings are like, and how they are organised, up to
the point of being capable of understanding and moving in it as though he could see them.
A lot of work has been done on the application of technical aids for the
handicapped, and particularly for the blind. This work can be divided into two broad
322

categories: Orientation providers (both at city and building level) and obstacle detectors.
The former has been investigated everywhere in the world, a good example being the
MOBIC project, which supplies positional information obtained from both a GPS satellite
receiver and a computerised cartography system. There are also many examples of the
latter group, using all kinds of sensing devices for identifying obstacles (ultrasonic, laser,
etc.), and informing the blind user by means of simple or complex sounds. The "Sonic
Path Finder" prototype developed by the Blind Mobility Research Group, University of
Nottingham, should be specifically mentioned here.
Our system fulfils the criteria of the first group because it can provide its users with
an orientation capability, but goes much further by building a perception of space itself at
neuronal level [20,18], which can be used by the blind person not only as a guide for
moving, but also as a way of creating a brain map of how his surrounding space is
organised.
A very successful qualified precedent of our work is the KASPA system [8],
developed by Dr. Leslie Kay and commercialised by SonicVisioN, This system uses an
ultrasonic transmitter and three receivers with different directional responses. After
suitable demodulation, acoustic signals carrying spatial information are generated, which
can be learnt, after some training, by the blind user. Other systems have also tried to
perform the conversion between image and sound, such as the system invented by Mr.
Peter Meijer (PHILIPS), which scans the image horizontally in a temporal sequence;
every pixel of a vertical column contributes a specific tone with an amplitude proportional
to its grey level.
The aim of our work is to develop a prototype capable of capturing a three-
dimensional description of the surrounding space, as well as other characteristics such as
coiour, texture, etc., in order to translate
them into binaural sonic parameters,
virtually allocating a sound source to every
position of surrounding space, and DRO:~m ~ / ~ = ~ I " - - " ' ~ C o r r i d o r
performing this task in real time, i.e. fast a)
enough in comparison with the brain's
perception speed, to allow training with
simple interaction with the environment,

2 Material and Methods


. . . . " ..... C o r r i d o r
2.1 Developed system b)
A two-dimensional example of the
way in which the prototype can work in R o o m ~ User
order to perform the desired transformation
between space and sound is shown in
Figure 1. In the upper part there is a very
simple example environment, a room with a 8-,o
o ~176

half open door and a corridor. The user is C) s ~'


s
standing near the window, looking at the $
door. Drawing b, shows the result of
dividing the field of view into 32 c~---User

stereopixels which actually represent the Fig. l.- Two-dimensional example of the
horizontal resolution of the vision system, system behaviour
(however the equipment could work with
323

an image of 16 x 16 and 16 depth) providing more detail at the centre of the field in the
same way as human vision. The description of the surroundings is obtained by
calculating the average depth (or distance) of each stereopixel. This description will be
virtually converted into sound sources, located at every stereopixel distance, thus
producing a perception depicted in drawing c, where the major components of the
surrounding space can be easily recognised (The room itself, the half open door, the
corridor, etc.)
This example contains the equivalent of just one acoustic image, constrained to two
dimensions for ease of representation. The real prototype will produce about ten such
images per second, and include a third (vertical) dimension, enough for the brain to build
a real (neuronal based) perception of the surroundings.
Two completely different signal processing areas are needed for the
implementation of a system capable of performing this simulation. First, it is necessary to
capture information of the surroundings, basically a depth map with simple attributes such
as colour or texture. Secondly, every depth has to be converted into a virtual sound
source, with sound parameters coherently related to the attributes and located in the
spatial position contained in the depth map. All this processing has to be completed in
real time with respect to the speed of human perception, i.e. approximately ten times per
second.
professional headphones
q f---- SENNHEISER HD-580 . . . . . . . - . . - ~
Ethernet link
(TCP-IP)
Colourw ~ e o Z ~ "-"
m icr~c amenras J
JAI CV-M 1050

Frame grabber
MATROX . _ ~
mod, GENESIS Huron Bus: Cards having: f
DSP 56002. A/D,D/A .... ../

V ison St,~bsy~tgm Acoustic Subsystem


(Based on PENTIUM 11 300MHz) (Based onPENTIUM 166MHz)

Fig. 2.- Conceptual diagram of the developed prototype.

Figure 2 shows a conceptual diagram of the technical solution we have chosen for
the prototype development. The overall system has been divided into two subsystems:
vision and acoustic. The former captures the shape and characteristics of the surrounding
space, and the second simulates the sound sources as if they were located where the
vision system has measured them. Their sounds depend on the selected paralneters, both
reinforcing the spatial position indication and also carrying colour, texture, or light-level
information. Both subsystems are linked using a TCP-IP Ethernet link.
The Vision Subsystem
A stereoscopic machine vision system has been selected for the surrounding data
capture[12]. Two miniature colour cameras are glued to the frame of conventional
spectacles, which will be worn by the blind person using the system. The set will be
calibrated in order to calculate absolute depths. In the prototype system, a feature-based
324

method is used to calculate a disparity map. First of all, the vision subsystem obtains a set
of comer features all over each image, and the matching calculation is based on the
epipolar restriction and the similarity of the grey level in the neighbourhood of the
selected comers.
The map is sparse but it can be obtained in a short time and contains enough
information for the overall system to behave correctly.
The vision subsystem hardware is based on a high-performance PC computer,
(PENTIUM II, 300 MHz), with a frame grabber board from MATROX, model GENESIS
featuring a C80 DSP.

2.2 The Acoustic Subsystem


The virtual sound generator uses the Head Related Transfer Function (HRTF)
technique to spatialize sounds [5]. For each position in space, a set of two HRTFs are
needed, one for each ear, so that the interaural time and intensity difference cues, together
with the behaviour of the outer ear are taken into account. In our case, we are using a
reverberating environment, so the measured impulse responses would also include
information about the echoes in the room. HRTF's are measured as the responses of
miniature microphones (placed in the auditory channel) to a special measurement [I]
signal (MLS). The transfer function of the headphones is also measured in the same way,
in order to equalise its contribution.
Having measured these two functions, the HRTF and the Headphone Equalizing
Data, properly selected or designed sounds (Dirac deltas) can be filtered and presented to
both ears, the same perception being achieved as if the sound sources were placed in the
same position from where the HRTF was measured.
Two approaches are available for the acoustic subsystem. In the first one, sounds
can be processed off-line, using HRTF information measured with reasonable spatial
resolution, and stored in the memory system ready to be played. The second method is to
only store the original sounds and to perform real-time filtering using the available DSP
processing power. This second approach has the advantage of allowing the use of a much
larger variety of sounds, making it possible to include colours, textures, grey level, etc.
The information in the sound, at the expense of requiring a higher number of DSPs, is
directly related to the number of sound sources to be simulated. In both cases all the
sounds are finally added together in each ear.
The acoustic subsystem hardware is based on a HURON workstation, (Lake DSP,
Australia), an industrial range PC system (PENTIUM 166) featuring both an ISA bus plus
a very powerful HURON Bus, which can handle up to 256 channels, using time division
multiplex at a sample rate of up to 48 kHz, 24 bits per channel. The HURON bus is
accessed by a number of boards containing four 56002 DSPs each, and also by input and
output devices (A/D, D/A) connected to selected channels. We have configured our
HURON system with eight analogue inputs (16 bits), forty analogue outputs (18 bits), and
2 DSPs boards.

2.3 Subjects and experimental conditions


The experiments were carried out on 6 blind subjects and 6 sighted volunteers, the
ages ranged between 16-52. All 6 blind subjects were completely blind (absence of light
perception) as the result of peripheral lesion, but were otherwise neurologically normal.
They all lost their sight as adults having had normal visual function before, The results
obtained from late blind subjects were compared to each other as well as to measurements
325

taken from the 6 healthy, sighted young volunteers with closed eyes in all the
experimental conditions. All the subjects included in both experimental groups described
above were selected according to the results of an audiometric control. The acoustic
experimental stimulus generated was a burst of 6 Dirac deltas spaced at 100 msec and the
subjects indicated the apparent spatial position by calling out numerical estimates of
apparent azimuth and elevation, using standard spherical coordinates. This acoustic
stimulus were generated to simulate a set of five virtual positions covering a 90-deg range
of azimuths and elevation from 30 deg below the horizontal plane to 60 deg above it. The
depth or Z was studied by placing the virtual sound at different distances of up to 4
meters, which were divided into five intermediate positions in a logarithmic arrangement,
from the subjects.
2.4 Data analysis
The data obtained from both experimental groups (blind people as well as sighted
subjects) were evaluated by analysis of variance (ANOVA), comparing the changes in
the response following the change of virtual sound sources. This was followed by post-
hoc comparisons of both group values using Bonferroni's Multiple Comparison Test.
3 Results
Having discarded the real impossibility of distinguishing between real sound
sources and their corresponding virtual ones, for blind as well as for the visually enabled
controls, we tried to determine the capability of locating blind people's virtual sound
sources with regard to sighted controls. Without having had any previous experience, we
carried out Iocalisation of spatialized virtual sound tests in both groups, each one lasted 4
seconds.We found significant differences in blind people as well as in the sighted group
when the sound came from different azimuthal positions, (see figure 3). llowever, as can
be observed in this graph, blind people detected the position of the source with more
accuracy han people with normal vision.

100.

Fig. 3.- Mean percentages (with


70.
standard deviations), of accuracy
I Sighted controls
:3
I Bind
in response to the virtual sound
so.
,( localisation generated through
25
headphones in azimuth.

** = p<0.005
yes no yes
Response

100

9" ontrols
u~ 70 Fig. 4.- Mean percentages (with
w standard deviations), of accuracy in
=
SO response to tile virtual sound
localisation generated through
20 headphones in elevation.

** = p<0.005
yes no yes
Relponle
326

When the virtual sound sources were arranged in a vertical position, to evaluate
the discrimination capacity in elevation, one can see that there were significant
differences amongst the blind group, which did not exist in the control group (see figure
4).

It
|" !

nd
]hled controls Fig. 5.- Mean percentages (with
og standard deviations), of accuracy in
.~ response to tile virtual sound
localisation generated through
headphones in distances, Z axis.
** = p<0.005
ye6 no yes no
Response

Figure 5 shows that both groups can distinguish the distances well, nevertheless,
only the group of blind subjects showed significant differences.The results in the initial
tests using simultaneous multiple, virtual or real sounds showed that, fundamentally in
blind subjects, it is possible to generate the perception of a spatial image from the spatial
information contained in sounds,. The subjects can perceive complex tridimensional
aspects from this image, such as: form, azimuthal and vertical dimensions, surface
sensation, limits against a silent background, and even the presence of several spatial
images related to different objects. This perception seems to be accompanied by an
impression of reality, which is a vivid constancy of the presence of the object we have
attempted to reproduce. It might be interesting to mention that, in some subjects, the
tridimensional pattern of sound-evoked perceptions had mental representations which
were subjectively described as being more similar to the visual images than to the
auditive ones. Presented in a general way, and considering that the objects to be
perceived are punctual shapes or they change from punctual shapes into, mono, bi and
three-dimensional shapes (which include, horizontal or vertical lines, concave or convex,
isolated or grouped flat and curved surfaces composing figures, e.g., squares, or columns
o1"parallel rows, etc.), the following observed aspects stand out:
9 An object located in the field of the user's perception, generated from the received
sound information, can be described and therefore perceived, in significant spatial aspects
like; their position, their distance and the dimensions in the horizontal and vertical axes
and even in the axis z of depth.
9 Two objects separated by a certain distance, each one inside the perceptual field
captured by the system, can be perceived in their exact positions, regardless of their
relative distances from each other.
. After a brief period of time, which is normally immediate, the objects in the
environment are perceived in their own spatial disposition in a global manner, and the
final perception is that all the objects appear to be inside a global scene.
This suggests that the blind can, with the help of this interface, recognise the
presence of a panel or rectangular surface in its position, at its correct distance, and with
its dimensions of width and height. The surface structure of spatial continuity e.g. door,
window, gap etc are also perceived. Two parallel panels forming the shape of a corridor
are perceived as two objects, one on each side, with their vertical dimensions and depth,
and that there is a space between them where one can go through,
327

In an attempt to simulate the everyday tasks of the blind we created a dummy and a
very simple experimental room. It was possible for the blind to be able to move, without
relying on touch, in this space and he/she could extract enough information to then give a
verbal global image, graphically described (see figure 6), including its general disposition
to the starting point, the presence of the walls, his/her relative position, the existence of a
gap simulating a window in one of them, the position of the door, the existence of a
central column, perceived in its vertical and horizontal dimensions. In summary, it was
possible to move freely everywhere in the experimental room.

0
io
"-'1

g-
Column
9 Column

X Starting point
Door , Doorx ,qtarting point

A B
Fig. 6.- A. Schematic representation of the experimental room, with a particular objects
distribution. B. Drawing made by a blind person after a very short exploration, using the
developed prototype, without relying on touch.

It is very important to remark that in several blind people the sound perception of
the environment has been accompanied by simultaneous visual evocation, consisting of
punctate spots of light, (phophenes) located in the same positions as the virtual sound
sources. Phoshenes did not flicker, so this perception gives a great impression of reality
and is described, by the blind, as visual images of the environment.

4 Discussion
Do blind people develop the capacities of their other remaining senses to higher
level than those of sighted people?. This has been a very important question of debate for
a long time. Anecdotal evidence in favour of this hypothesis abounds and a number of
systematic studies have provided experimental evidence for compensatory plasticity in
blind humans, ]15], [19], [16]. Other authors have often argued that blind individuals
should also have perceptual and learning disabilities in their other senses such as tile
auditory system, because vision is needed to instruct them, [10], [17]. Thus, the
question of whether intermodal plasticity exists has remained one of the most vexing
problems in cognitive neuroscience. In the last few years, results of PET and MRI in blind
humans indicate activation of areas that are normally visual during auditory stimulation
[23],[4] or Braille reading [19]. In most of the cases, a compensatory expansion of
auditory areas at the expense of visual areas was observed, [14], In principle this would
suggest that this would result in a finer resolution of auditory behaviour rather than in a
328

reinterpretation of auditory signals as visual ones. However, these findings pose several
interesting questions: What is the kind of percept that a blind individual experiences when
a 'visual' area becomes activated by an auditory stimulus?, does the co-activation of
'visual' regions add anything to the quality of this sound that is not perceived normally, or
does the expansion of auditory territory simply enhance the accuracy of perception for
auditory stimuli?.
According to this, our findings suggest that, at least in our sample, blind people
present a significantly higher spatial capability of acoustic Iocalisation than the visually
enabled subjects. This capability, which one would expect, is more important in
Azimuth than in elevation and in distances. Nevertheless, in the latter ones they are
statistically significant. These results allow us to sustain the idea of a possible use of the
auditory system as a substratum to transport spatial information in visually disabled
people and, in fact, the system we have developed using multiple virtual sounds suggests
that the brain can generate an image of spatial occupation of an object with its shape, size
and three-dimensional location. To form this image the brain needs to receive spatial
information about the characteristics of the object's spatial disposition and this
information needs to arrive fast enough so that the flow is not interrupted, regardless of
the sensorial source it comes through.
It seems to be believable that neighbouring cortical areas share certain functional
aspects, defined partly by their common projection targets. In agreement with our results,
several authors think that the function shared by all sensory modalities seems to be spatial
processing [ 14]. Therefore, a common code for spatial information that can be interpreted
by the nervous system has to be used and probably, the parietal areas, in conjunction with
the prefrontal areas form a network involved in sound spatial perception and selective
attention [6].
Thus, to explain our results, it is necessary to consider that signals from many
different modalities need to be combined in order to create an abstract representation of
space that can be used, for instance, to guide movements. Many authors [3], [6] have
shown evidence that the posterior parietal cortex combines visual, auditory, eye position,
head position, eye velocity, vestibular, and propioceptive signals in order to perform
spatial operations. These signals are combined in a systematic fashion by using the gain
field mechanism. This mechanism can represent space in a distributed format that is quite
powerful, allowing inputs from multiple sensory systems with discordant spatial frames
and sending out signals for action in many different motor co-ordinate frames. Our
holistic impression of space, independent of sensory modality, may be embodied in this
abstract and in this distributed representation of space in the posterior parietal cortex.
These spatial representations generated in the posterior parietal cortex are related to other
higher cognitive neuronal activities, including attention.
In conclusion, our results suggest a possible amodal treatment of spatial information
and, in situations such as after the plastic changes which are a consequence of sensorial
deficits, it could have practical implications in the field of sensorial substitution and
rehabilitation. Furthermore, contrary to the results obtained from other lines of research
into sensorial substitution [8], [4] the results of this project have been spontaneous, and
did not follow any protocol of previous learning, which suggests the high potential of the
auditory system and of the human brain provided the stimuli are presented in the most
complete and coherent way possible.
Regarding the appearance of the evoked visual stimuli that we have found when
blind people are exposed to spatialized sounds, using the Dirac deltas is very important in
this context, since this demonstrates that the proposed method can, without direct
329

stimulus of the visual pathways or visual cortex, generate visual information


(phosphenes) which bears a close relationship to the spatial position of the generated
acoustic stimuli. The evoked appearance of phosphenes, which has also been found by
other authors after the exposition of auditory stimuli, although under other experimental
conditions [11], [13], shows that, in spite of their spectacular appearance, this is not an
isolated and unknown fact. In most of their cases, the evocation was transitory, with a
duration of a few weeks to a few months. Our results are interesting because, in all our
cases the evocation has lasted until the present moment, and the phosphenes are
perceived by the subject in the same spatial position as the virtual or real sound source
position.
As regards the nature of this phenomenon, there are several possible explanations:
a) Hyperactive neuronal activity can exist by visual deafferentation in neurones which
are able to respond to visual stimuli as well as auditory stimuli. Several cases have been
referred to by authors that support this hypothesis, which probably happens when these
neurones receive sounds [l 1] in certain circumstances in early blindness. It is known that
glucose utilisation in human visual cortex is abnormally elevated in blindness of early
onset but decreased in blindness of late onset [23]; there is also evidence, found in
experimental animals, that in the first few weeks of blindness there is an increase in the
number and synaptic density in the visual cortex [24]. However, as in one of our cases a
woman who has been blind for 6 years, its explanation according to this theory will
require additional data.
b) The auditory evoked phosphenes could be generated in the retina or in the
damaged optic nerve. Page and collaborators [13] suggest the hypothesis that subliminal
action potentials whose passing through both lateral geniculate nuclei (LGN) would
facilitate the auditorily evoked phosphenes. The LGN is the convergence point with other
paths of the central nervous system and especially those which influence other high
cognitive neuronal activities.
c) It is necessary to consider the possibility of a stimulation by a direct connection
from the auditory path to the visual one. In this sense, the development of projections
from primary and secondary auditory areas to the visual cortex were observed in
experimental animals [7]. Furthermore, other authors have described that the generation
of phosphenes takes place after the stimulation of areas not directly related with visual
perception [22]. And it is possible to hypothesise that the convergence of auditory stimuli
as well as visual stimuli in the posterior inferoparietal area is directly involved in the
generation of a spatial representation of the environment perceived through the different
sensorial modalities which suggests, as mentioned above, the possibility that at that level
the auditory-visual contact can be carried out and the subsequent visual evocation occurs.
For this conclusion to be completely valid, neurobiological investigations, including
studies of functional neuroimaging, on the above-mentioned subjects, needs to be
performed to clarify this possibility.
The enhanced non visual abilities of blind are hardly capable of replacing, fully the
lost sense of vision because of the much higher information capacity of the visual
channel. Nevertheless, they can provide partial compensation for the lost function by
increasing the spatial information incoming through the auditory system.
Now, our future objectives will be focused on a better delimitation of the observed
capabilities, the study of the developed system in dynamic conditions, and the exploration
of the possible cortical brain areas involved in this process, using functional techniques.
330

Acknowledgements
This work was supported by Grants from the G o v e r n m e n t of the C a n a r y Islands,
European C o m m u n i t y and I M S E R S O (Piter Grants).

References
1. Albert S. Bregman, Auditory Scene Analysis, The MIT Press (1990).
2. Alho, K., Kujala T., Paavilainen P., Summala H. and N~i~it~inenR. Auditory processing in visual areas of the
early blind: evidence from event-related potentials. Electroene. And Clin. Neurophysiol. 86 (1093) 418-
427.
3. Andersen R. Snyder HL, Bradley C, Xing J. (1997). Multimodal representation of space in posterior
parietal cortex and its use in planing movements Annu. Rev. Neurosci.20, 303-330.
4. Bach-y-Rita, P. Vision Substitution by Tactile Image Projection. Nature. Vol 221, 8,963-964, 1969.
5. Frederic L. Wightman & Doris J. Kistler, "Headphone simulation of free-field listening. I: Stimulus
synthesis", "II: Psychophysical validation", J. Acoust. Soc. Am. 85 (2), feb 1989.
6. Griffiths 1"., Rees G., Green G., Witton C., Rowe D., Biichel C., Turner R., Frackowiak R., (1998). Right
parietal cortex is involved in the perception of sound movement in humans. Nature neuroscience. 1.74-77
7. Innocenti G.M., Clarke S., (1984), Bilateral transitory projection to visual areas from auditory cortex in
kittens. Develop. Brain Research. 14: 143-148.
8. Kay L., Air sonars with acoustical display of spatial information. In Bushel, R-G and Fish, J.F., (Eds),
Animal Sonar Systems, 769-816 New York Plenium Press.
9. Kujala, T., (1992). Neural plasticity in processing of sound location by the early blind: an event-related
potential study. Electroencephalogr. Clin. Neurophysiol. 84,469-472.
10. Locke, J. (1991). An Essay Concerning Human Understanding (Reprinted 1991, Turtle).
I 1. Lessell, S. and M.M. Cohen. Phosphenes induced by sound. Neurology 29: 1524-1526, 1979.
12. Nitzan, David. "I'hree Dimensional Vision Structure for Robot Applications", IEEE Trans. Patt. Analisys
& Mach. hitell.. 1988
13. Page, N.G., J.P. Bolger, and M.D. Sanders. Auditory evoked phosphenes in optic nerve disease.
J.Neurol.Neurosurg.Psychiatry 45: 7-12, 1982.
14. Rauschecker JP, Korte M. (1993.) Auditory compensation of early Blindness in cat cerebral cortex.
Journal of Neuroscience, 13(10) 4538:4548.
15. Rauschecker JP. (1995). Compensatory plasticity and sensory substitution in the cerebral cortex.
TINS. 18,1,36-43
16. Rice CE (1995) Early blindness, early experience, and perceptual enhancement. Res Bull Am Found Blind
22:1-22.
17. Rock, I, (1966). The Nature of Perceptual Adaptation. Basic Books.
18. Rodrlguez-Ramos, LF., Chulani, H.M., Dfaz-Saco, L., Sosa. N., Rodrfguez-Hernlindcz, A., Gonz,'llez-
Mora, J.L. (1997). hnage And Sound Processing For The Creation Of A Virtual Acoustic Space For The
Blind People. Signal Processing and Communications, 472-475.
19. Sadato N. Pascula-Leone, A. Grafman, J., Ib~itez, V., Daiber, M.P., Dold, G., Hallett, M. (1996).
Activation
of primary visual cortex by Braille reading in blind people. Nature. 380,526-527.
20, Takahashi T. T., Keller C.H.(1994). ,'Representation of Multiple Sounds Sources in the Owl's Auditory
Map." Journal of Neuroscience, 14(8) 4780-4793.
21. Takeo Kanade, Atshushi Yoshida. A Stereo Matching for Video-rate Dense Depth Mapping and Its New
Applications (Carnegie Mellon University). Proceedings of lSth Computer Vision and Pattern
Recognition Conference.
22. Tasker, R.R., L.W. Organ, and P. Hawrylyshyn. Visual phenomena evoked by electrical stimulation of the
human brain stem. Appl.Neurophysiol. 43: 89-95, 1980.
23. Veraart C., De Voider, A.G:, Vanet-Defalque, M.mC., Bol, A., Michel, Ch., Goffinet, A.M. (1990)
Glucose utilisation in visual cortex is abnormally elevated in blindness of early onset but decreased in
blindness of
late onset. Brain Res, 510, 115-121.
24. Winfield D.A. The postnatal development of synapses in the visual cortex of the cat and the effects of
eyelid closure. Bra in Res. 1981. Feb. 9. 206:166-171
Application of the Fuzzy Kohonen Clustering Network
to Biological Macromolecules Images Classification

Alberto Pascual ~, Montserrat Barc6na 1, J.J. Merelo 2, Jos6-Maria Carazo ~

1Centro Nacional de Biotecnologia-CSIC. Universidad Aut6noma, 28049. Madrid. Spain.


Tel: +34-91 585 4543; Fax: +34-91 585 4506; e-mail: {pascual, carazo} @cnb.uam.es
2GeNeura Team, Dpto Arquitectura y Tecnologia de las Computadoras, Facultad de
Ciencias, Campus Fuentenueva, s/n, 18071 Granada, Spain; e-mail: geneura@kal-el.ugr.es

Abstract. In this work we study the effectiveness of the Fuzzy Kohonen


Clustering Network (FKCN) in the unsupervised classification of electron
microscopic images of biological macromolecules. The algorithm combines
Kohonen's Self-Organizing Feature Maps (SOM) and Fuzzy c-means clustering
technique (FCM) in order to obtain a powerful clustering technique that inherits
their best properties. Two different data sets obtained from the G40P helicase
from B. Subtilis bacteriophage SPP1 have been used for testing the proposed
method, one composed of 2458 rotational power spectra of individual images
and the other composed by 338 images from the same macromolecule. Results
of FKCN are compared with Self-Organizing Maps (SOM) and manual
classification. Expe"rlmental results have proved that this new technique is
suitable for working with large, high dimensional and noisy data sets. This
method is proposed to be used as a classification tool in Electron Microscopy.

1. Introduction.
Image classification is a very important step in the three-dimensional study of
biological macromolecules using Electron Microscopy (EM) because three-
dimensional reconstruction methods need a homogeneous set of projections, that is,
different projection views (two-dimensional images) from the same biological
specimen. Obtaining such a set is a very complicated task due to several factors: the
low signal/noise ratio of the images obtained in the electron microscope and the
intrinsic heterogeneity in the set of images. This is because a biochemical
homogeneous population does not necessarily produce a homogeneous set of images,
since different 2D views of the same 3D structure may exist and also a lot of
projections usually are obtained from a large set of particles of the same specimen.
In the context of Pattern Recognition and Classification in Electron Microscopy,
different approaches have been previously used: classical statistical methods,
clustering techniques and Neural Networks. Multivariate Statistical Analysis (MSA)
[1][2], was first proposed as a way to reduce the number of variables characterizing
an image and in some cases, a visual inspection was enough to enable the visual
identification of the clusters in the data set under analysis. Visual inspection,
however, is not suitable for all kind of data, so more objective clustering methods
332

were applied: Hierarchical ascendant classification [3], hybrid (k-means and


ascendant classification approach) [4] and Fuzzy c-means [5].
Neural Networks (Kohonen's self-organizing feature maps) have also been
successfully applied in classification of Electron Microscopy images [6], creating a
new and powerful methodology for classification in this field. One of the advantages
of this approach is that it does not need any prior knowledge of the number of clusters
present in the data set, however, several parameters, such as topology, size,
dimension, the size of the update neighborhood and learning rate strategy must be
defined by the user with a direct impact in the result [7]. A bad choice of parameters
can lead to negative results, specially the map size and updating neighborhood, that is
the reason why several attempts to optimize the selection of these parameters have
also been reported [8][9].
In the present work we describe the application in Electron Microscopy image
classification of a hybrid approach using the integration of Kohonen's Self-organizing
Feature Maps (SOM) [7] and Fuzzy c-means clustering algorithm (FCM) [10]: the
Fuzzy Kohonen Clustering Network (FKCN). This method takes advantage of the
self-orgamzing structure of SOM and the Fuzzy-clustering model of FCM. It was first
proposed by Bezdeck [11] and has also been successfully applied in the field of image
processing [12][13].

2. Materials a n d M e t h o d s .

2.1 Experimental data sets

We have used a set of images of negatively stained hexamers of the SPP1 G40P
helicase obtained in the electron microscopy [14]. 2458 images were translationally
and rotationally aligned and their rotational power spectra were calculated [15]. For
experimental purposes, we created two data sets: one composed of 2458 rotational
power spectra (up to 15 harmonics) and the other composed of 338 50x50 pixels
images that were extracted from an apparently homogeneous 6-fold and 3-fold
symmetry population. Original images have a very low signal/noise ratio, making
impossible visual classification. (Figure 1).

Fig. 1. Two examples of tile images used for expcri,nents. As can be seen they have a low signal/noise ratio,
making impossible visual classification.
333

2.2 K o h o n e n ' s self-organizing feature maps

The Kohonen model is a neural network that simulates the hypothesized self-
organization process carried out in the human brain when some input data is presented
[7]. The algorithm can be interpreted as a nonlinear projection of the probability
density function of the n-dimensional input onto an output array of nodes. The
functionality of the algorithm can be described as follows: when an input vector is
presented to the net, the neurons in the output layer compete among themselves and
the winner (whose weight has the minimum distance from the input) as well as a
predefined set of neighbors update their weights. This process is continued until some
stopping criterion is met, usually, when weight vectors "stabilize" or when a number
of iterations are completed. The update rule of this algorithm is described as follows:

Vi,t = Vi, t-1 "l- O(thr, t(Xk-Vi,t-1) (1)

Where the learning rule al is defined as a decreasing function that control the
magnitude of the changes with time, and hrt is a sigmoidal function that controls the
neighborhood of the winning node to be updated during training.

2.3 F u z z y c - m e a n s

Fuzzy c-means clustering is a process of grouping similar objects into the same class,
but the resulting partition is fuzzy, which means that in this case images are not
assigned exclusively to a single class, but partially to all classes. The goal is to
optimize the clustering criteria in order to achieve a high intracluster similarity and a
low intercluster similarity using n-dimensional feature vectors. The theoretical basis
of these methods has been reported in detail elsewhere [10][11] and will only be
briefly reviewed here.
Let X = {Xl,X2,X3..... x,}denote a set of n feature vectors xk. The data set X is going to
be partitioned into c fuzzy clusters, where 1 < c < n, being c the number of clusters to
be found. A c-partition of X can be represented by ui(xk) or utk, where u~, is a
continuos function in the [0,1] interval and represents the membership of xk in the
cluster i, 1 < i < c, 1 _< k < n. The fuzzy c-means algorithm consists of an iterative
optimization of an objective function:

n c m (2)
Jm(U,v) = ~_~k=,~_,,:,(uik) D,k

Where v = ( V b V 2 , V 3 . . . . . V c ) , with vi being the cluster center of class i, 1 < i < c, DE is


the inner product norm.
V 2
2 --IIx - 'IIA (3)

and A is a positive definite matrix. The parameter m determines the "fuzziness" of the
result and m e [1,oo]. The choice of m depends on the data under analysis.
334

2.4 Fuzzy Kohonen Clustering Network (FKCN)

Fuzzy Kohonen clustering network I11] is a type of Neural Network that combines
both methods described above: SOM and Fuzzy c-means. The structure of this self-
organization feature-mapping model consists of two layers: input and output. The
input layer is composed of p nodes, where p is the number of features and the output
layer is formed by c nodes, where c is the number of clusters to be found. Every
single input node is fully connected to all output nodes with an adjustable weight v~
(cluster center) assigned to each connection. Given an input vector, the neurons in the
output layer update their weights based on a pre-defined learning rule o~ This
approach integrates the fuzzy membership U~kfrom the FCM in the following update
rule:

k'i,t = ~"i.t-I + O~k.t(Xk-Vi,t-1) (4)

where the learning rule a is defined as:

Ofik,t = (Uik,t)mt; mt = (rod - tam) and Am = (rod - 1)/tm,x,

and Uik., is tile Fuzzy membership matrix calculated by Fuzzy c-means, mo is any
positive constant greater than one, t is file current iteration and tmax is the iteration
limit.

This method possesses several interesting properties:

9 The learning rate is a function of the iteration t and its effect is to distribute the
contribution of each input vector xk to the next update of the neuron weights
inversely proportional to their distance from Xk. The winner node (whose weight
has the minimum distance from the input) updates its weight favored by the
learning rate as the iteration increases and in this way the Kohonen concept of
neighborhood size and neighborhood updating are embedded in this new learning
rate.
9 FKCN is not sequential; code vectors update is done after each pass through X
(input vectors). Hence, it is not label dependent.
9 In the limit (mr = 1), the update rule reverts to Hard c-means (winner take all).
9 For a fixed nh > 1 (that is, Am = 0) FKCN is a truly Fuzzy c-means algorithm.
9 FKCN is a truly Kohonen-type algoritlma, because it possesses a well defined
method for adjusting both the leanimg rate distribution and update neighborhood
as function of time. Hence, FKCN inherits the "self-organizing" structure of
SOM-types algorithms and at the same time, is stepwisc optimal with respect to a
widely used fuzzy clustering model.
335

3. Results.

We have tested file proposed method in two different data sets: one composed by the
rotational power spectra of a large set of images and the other composed by a subset
of these images. In this way we attempt to demonstrate that FKCN is suitable for
working with large, noisy and high dimensional data, very common in Electron
Microscopy Image Analysis.

3.1 Experiment using Rotational Power Spectra.

In this example, 2458 images were used for analysis. The rotational power spectrum
[15] of each particle was calculated yielding a 2458 15-dimensional data set.
A 7x7 SOM was applied to the data set and the resulting map was manually clustered
in four classes as described in [14]. The results are shown in Figure 2. Group A shows
a predominant 6-fold symmetry with a small but noticeable component on harmonic
3. Group B represents 3-fold symmetry images. Group C is closely related to 2-fold
symmetry particles and Group D showed only a prominent harmonic in 1, which can
be interpreted as a lack of symmetry in this group of particles.
FKCN was applied to the whole set of particle's spectra with the following
configuration: 15 input nodes (each node representing a component in the specmma)
and 4 nodes in the output layer (representing four clusters). For comparison purposes
with results already obtained using this data set [14], 4 clusters were used. Fuzzy
constant m was set to 1.5 and 500 iterations was used The resulting code vectors
(cluster centers) are shown in Figure 3. As can be seen in tile cluster centers, the four
groups visualized by the SOM were also successfully extracted by FKCN.
Quantitative results of coincidence (with respect to file SOM groups) are shown in
table 1, however, a major difference in the sets extracted by both algorithms (SOM
and FKCN) should be noticed. When SOM output was manually clustered, a set of
code vectors that were bounding the groups were no considered in order to avoid a
erroneous classification in the borders. FKCN considered the whole data, so an
unavoidable difference will be reflected in tile results. Cluster 2 obtained by FKCN
seems to be composed by these group of spectra associated to the code vectors that
were eliminated from the SOM for being part of the borders between the four
hypothetical clusters. Furthermore a large set of noisy spectra as well as the non-
symmetry ones are also included in this cluster.

Table 1. Comparison between FKCN and SOM. m = 1.5. t (iterations) = 500

SOM FKCN Coincidence


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Group A Cluster 1 92.87%


Group D Cluster 2 90.97%
Group B Cluster 3 85.86%
_G.5ou2 C Cluster 4 96.24%
336

!
,:9 e o

4 7 1o l ) I I o 13 7 101) 4 7 to IS I0 I I 7 10 13

T 10 IS I 7 1~ 1) ? 1o13 t ~' I u I I to I 1 ? 11 11

I i a

~ IO I1 4 ; 10 X~ io x s '/ 1 0 1 | 4 7 IO I I IO I 1 I l 0 11

'111011 4 71011 7 IQII * 7 toll loll ~ 10LI

10 I"1 * T 1o I~ Iol) ? loll 4 T In II le i) I l0 It

| e
I

y Iu Is 4 ~ 10 11 7 I q 13 7 1011 | o 11 ? |0 t)

I e I I e

6 T 10 | 3 7 |011 7 I01) 4 I tO I I Jo I | ~' ID I1

Fig. 2. 7x7 SOM output manually clustered in four regions. Group A: particles with a prominent 6-fold
component and a small but noticeable 3-fold component. Group B: 3-fold symmetry. Group C: 2-fold
symmetry and Group D: lacks of a predominant symmetry. The lmmber of elements assigned to each code
vector is printed in the upper-right comer of each spccu-um.

4 Y I0 13 4 I I0 ll
LI 4 T to 11 4 Y 10 tl

Fig. 3. Four cluster centers obtained from FKCN. The number of elements assigned to each cluster
is printed in the upper-right comer of each speclrum.
337

3.2 Experiment using 3-fold and 6-fold symmetry images.

In this experiment 338 images were used lor testing the algorithms in the presence of
high dimensional and very noisy data. These images share similar rotational
symmetry (6-fold with a minor 3-fold component) See [14] for details. Classification
of this kind of particles is very difficult because of the low signal/noise ratio and the
apparent structural similarity of the population. Two examples of the images forming
the data set are shown in Figure 1.

For comparison purposes, a 7x7 SOM was applied to file data set and the resulting
map was manually clustered in two different classes that apparently exhibited an
opposite handedness [14]. Figure 4 shows the map clustered in two classes that seem
to reflect essentially the same type of macromolecular structure.
FKCN was applied to this data set, using a circular mask of the images (Area of
Interest) as input nodes. In the first experiment we clustered the data into 2 groups,
however, results (not shown here) showed that file method could not found the subtle
variations in handedness of the particle. A small but noticeable difference in the
rotational spectra was detected, but was not enough to get conclusions out of it. We
then clustered the data into 3 groups using m=l.5 and 500 iterations. Cluster centers
obtained are shown in figure 5. Classification accuracy is also shown in table 2.
Analyzing the results of the clustering algorithm it is clear that FKCN correctly
clustered Group A in SOM (Cluster 3), however, Group B need further analysis. It is
obvious that both clusters 1 and 2 belong to Group B in SOM, but a main question
arises: why FKCN needed three clusters to "find" this structure of the data? The
answer can be found by analyzing the particles assigned to these clusters. The images
from the three classes were independently realigned to obtain their average. Figure 6
shows the average image and rotational spectra of subsets belonging to each cluster.
In the case of cluster 3, a clockwise handedness is clearly observed as expected from
Group A of SOM. In the case of clusters 1 and 2, the average images show the same
handedness (counterclockwise) as expected from Group B of SOM, however, it is
also clear the differences in rotational symmetry between both clusters. Both of them
have a predominant 6-fold symmetry, but Cluster 2, as oppose to Cluster 1, is
influenced by a noticeable 3-fold component. This very subtle difference was not very
clear in SOM. This small symmetry variation in the data set was maybe the cause of
misclassification when using two clusters: symmetry was influencing more than
handedness.

Table 2. Comparison between FKCN and SOM in Image


classification, m = 1.5. t (iterations) = 500

FKCN SOM Coincidence


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Cluster 1 Group B 93.23%


Cluster 2 Group B 96.15%
Cluster 3 Grou~...A._. . . . . . . . 8 9 : 3 ~ / L _ . . . .
338

B
Fig. 4. 7x7 SOM output manually clustered in two regions with different
handedness. The number of particles assigned to each code vector is printed in
the lower-right corner of each image.

I .2 3

Fig. 5. Three cluster centers obtained from FKCN. The number of particles
assigned to cluster is printed in tile lower-right corner of each image.

4. Discussion and Conclusion.

In this paper, a new fuzzy classification technique have has been applied to the study
of biological specimens by Electron Microscopy. This technique uses a special type of
Neural Network named Fuzzy Kohonen Clustering Network (FKCN) that
successfully combines dm well-knovm Self-Organizing feature maps (SOM) and the
fuzzy c-means clustering technique (FCM). The need of classification tools suitable
for working with long sets of noisy images is evident in electron microscopy field.
Here we have proposed a new approach which is in the middle somewhere between
two methods already applied in this context: SOM and Fuzzy c-means. The proposed
339

method combines the ideas of fuzzy membership values for learning rates, the
parallelism of FCM and the structure of tile update rules of SOM, producing a robust
clustering technique with a self-organizing structure.
It is important to note that FKCN can also be considered as an optimization algorithm
like FCM, so the possibility of falling into local minimum is theoretical present,
however, the experiments carried out in this work showed that FKCN properly
converged in all the analyzed cases. On the contrary, FCM apparently fell in a local
minimum in the presence of such kind of data.
FKCN have been fully tested in this work using two kinds of data sets that are very
common in any Electro Microscopy laboratory: the rotational power spectra and the
images of individual particles from a protein (In this case the G40P helicase was
used). In both cases FKCN was able to discriminate not only evident but subtle
variations in the data set. The results demonstrate the suitability of this method for
working with these kind of high dimensional and noisy data sets. Comparing this
clustering approach with others previously proposed in the field for structure-based
classification, we should emphasize that this method directly performs classification
(assignment of data to cluster), while at the same time offers a direct visualization by
inspecting cluster centers directly.
A number of future research topics remain open, especially the automatic
determination of number of clusters. In our opinion that topic can be addressed by the
mean of exploratory data analysis capable of faithful showing up the Probability
Density Function of the data set under analysis. We dimk that this type of neural
computation approaches (self-organizing networks) cau be successfully employed for
exploration of data in the Electron Microscopy field.

Cluster 1 Cluster 2 Cluster 3

Fig. 6. Average Images and rotational spectra of clusters.


340

5. References.

1. Van Heel, M., Frank J.: Use of multivariate statistics in analyzing the images of biological
macromolecules. Ultramicroscopy 6 (1981) 187-194.
2. Frank, J., Van Heel, M.: Correspondence analysis of aligned images of biological
particles. J. Mol. Biol. 161 (1982) 134-137.
3. Van Heel, M.: Multivariate statistical classification of noisy images (randomly oriented
biological macromolecules). Ultramicroscopy 13 (1984) 165-184.
4. Frank, J., Betraudiere, J.P., Carazo, J.M., Verschoor, A., Wagenknecht, T.: Classification
of images of biomolecular assemblies. A study of ribosomes and ribosomal subunits of
Escherichia coil. J. Microsc. 150 (1988) 99-115.
5. Carazo, J.M, Rivera, F.F., Zapata, E.L., Radermacher, M., Frank, J.: Fuzzy set based
classification of electron microscopy images of biological macromolecules with an
application to ribosomal particles. J. Microsc. 157 (1990) 187-203.
6. Marabini, R., Carazo, J.M.: Pattem Recognition and Classification of Images of
Biological Macromolecules using Artificial Neural Networks. Biophysical Journal 66
(1994) 1804-1814.
7. Kohonen, T.: Self-Organizing Maps, 2 na Edition, Springer-Verlag (1997).
8. Siemon, H.P. Selection of Optimal Parameters for Kohonen Self-organizing Feature Maps.
Artificial Neural Networks 2 (1992) 1573-1577.
9. ViUmann, T., Der, R., Hen'mann, M., Martinetz, T.M.: Topology Preservation in Self-
Organizing Feature Maps: Exact Definition and Meastaement. IEEE Transactions on
Neural Networks 8 (1997) 256-266.
10. Bezdek, J. C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum
Press, New York. (1984).
11. Chen.Kuo Tsao, E., Bezdek, J. C., Pal, N. R.: Fuzzy Kohonen Clustering Networks.
Pattem Recognition 27 (1994) 757-764.
12. Jin-Shin Chou, Chin-Tu Chen, Wei-Chung Lin: Segmentation of Dual-echo MR Images
using Neural Networks. Image Processing 1998 (1993) 220-227.
13. Diago, L.A., Pascual, A., Ochoa, A.: A Genetic Algorithm for Automatic Determination of
the Cup/Disc Ratio in Eye Fundus Images. Proceedings TIARP'98, Mexico (1998) 461-
472.
14. Barcena, M., San Martin, C., Weise, F., Ayora, S. Alonso, J.C., Carazo, J.M.: Polymorphic
quaternary organization of the Bacillus subtilis bacteriophage SPP1 repilcative helicase
(G4OP).Journal of Molecular Biology (1988) (in press).
15. Crowther, R.A., Amos, L.A.: Harmonic analysis of electron microscope images with
rotational symmetry. J. Mol. BioL 60 (1971) 123-130.
16. Rivera, F.F., Zapata, E.L., Carazo, J.M.: Cluster validity based on the hard tendency of
the fuzzy classification. Pattern Recognition Letters 11 (1990) 7-12.
Bayesian VQ Image Filtering Design with
Fast Adaption Competitive Neural Networks

A.I. Gonz~ilez ~, M. Grafia 1, I. Echave l, J. Ruiz-Cabello z

1Dpto. Ciencias de la Computacidn e IA, UPV/EHU.


Apdo. 649, 20080 San Sebastian SPAIN
e-mail:ccpgrrom@ si.ehu.es

2Unidad de RMN, Universidad Complutense


Paseo Juan XXIII, 1, 28040 Madrid, SPAIN

Abstract. Vector Quantization (VQ) is a well known technique for


signal compression and codification. In this paper we propose the
filtering of images based on the codebooks obtained from Vector
Quantization design algorithms under a Bayesian framework. The
Bayesian VQ filter consists in the substitution of the image pixel by the
central pixel of the codevector that encodes the pixel and its
neighborhood. This process can be interpreted as a Maximum A
Posteriori restoration based on the codebook estimated from the image.
We apply the VQ filter to noise removal in images from micro-
magnetic resonance. We compare our approach with the more
conventional approach of applying VQ compression as a noise removal
filter. Some visual results show the improvement introduced by our
approach.

1. Introduction.

The Self Organizing Map and other Competitive Neural Networks [5] are applied to
Vector Quantization (VQ) [3,4] which is a widely used technique for signal
compression and pattern recognition. VQ is used as an encoding mechanism or to
provide the dimensionality reduction needed for the classifiers. In the field of digital
image processing [1] it has been mainly proposed for lossy image compression,
because of its nice rate-distortion properties. However, the problem of efficiently
computing the codebook and the image quantization is still an issue for research,
where Competitive Neural Networks are among the salient approaches. We apply fast
learning variants [7] of the SOM to obtain good approximations to the optimal
codebooks with affordable amounts of computation. This approach makes sense for
big problems such as the one tackled here. The practical application presented here
deals with the restoration and noise removal of 3D images produced by micro-
magnetic resonance imaging devices. The 3D image is a 128x256x128 matrix of 32
342

bits/pixel elements. Thus, the computation of the codebooks for either the filtering or
compression of these images requires very efficient VQ design techniques.
The Bayesian VQ filtering approach consists in two steps: (1) Determine the
codevector that encodes the vector given by the pixel and its neighborhood, and (2)
substitute the pixel by the central pixel in the codevector. The probabilistic
interpretation of this process is comes from the interpretation of the encoding process
as a Maximum A Posteriori (MAP) classification. Under this interpretation, the VQ
filter performs a simplified Bayesian restoration [1,2], where (1) The stochastic model
and its estimation, are, respectively, the codebook and the search for the optimal
codebook; and (2) the restoration process does not involve complex and lengthy
relaxation procedures (like simulated annealing in [2]), only the search for the nearest
codevector. The problem of estimating the optimal codebook remains the same
difficult problem as in conventional VQ. Our approach is related with other techniques
that involve the application of compression algorithms to image filtering. Among
them the Occam filters [8,9] are of special interest for future developments. The
definition of Occam filters based on VQ demands very efficient codebook design
algorithms. In this respect our fast learning application of the SOM could be a first
step towards this goal.
In Section 2, we present the Bayesian VQ filter more formally. Section 3 presents
the results on the application to an image, compared with the Compression VQ
approach. Finally, Section 4 gives some concluding remarks.

2. VQ for image filtering.

In this section we will review the approach to signal noise removal based on the so-
called Occam filters. Afterwards we will present our approach relating the VQ-based
filter to the Bayesian framework for image restoration. Let us first recall some basic
definitions about Vector Quantization.
A conventional definition of Vector Quantization [4] is as follows: given a
stochastic process {Xt;t>O } whose state space is (inside) the d-dimensional
(Euclidean) real space 9t a, the Vector Quantizer is given by a set of codevectors that
form a codebook Y = { Y l .... Yc}, where c is the codebook size, the encoding
operation E : ~ d ~ {1 .... C} that maps each vector in r d to the nearest codevector in
the sense of some defined distance (most usually the Euclidean distance); and the
decoding operation C-x : {1.... c} --* 9t d that reconstructs the encoded signal using the
codebook. The Vector Quantization design consists of the estimation of the codebook
from a sample X = {x 1.... Xn}.
For image compression, the image is decomposed in non-overlapping blocks and
each blockis assumed as an independent vector: F = {F1,1.... Fn,n} where n = N/~J-d,
N2 is the dimension of the square image, and each
Fi,j = {Fi+k,j+l : 0 < k,l < t-d_ 1} can be considered a d-dimensional vector. The
343

codebook is therefore a set of image blocks Yi = (y~,/ :0_< k,l < ~ - 1 ) , the
q

codification process produces a reduction of the data proportional to d, and therefore the
distortion of the decoded image (measured either by the mean square error or the signal-
to-noise ratio) increases accordingly. The complexity search for the optimal codebooks
grows also with the size of the blocks considered.

2.1 VQ compression for image filtering

The so-called Occam filters have been introduced by Natarajan in [8,9]. The essence of
the technique is that when a lossy data compression algorithm is applied to a noisy
signal with the allowed loss set equal to the noise strength, the loss and the noise tend
to cancel rather than add. A lossy compression algorithm C is a program that takes as
input a sequence fn and a loss tolerance c~ > 0 and produces as output the string s
representing a encoded sequence gn such that lifo - gnl[ < 6 c it is said to obey the
norm I1"11. The decompression algorithm C-]produces the output sequence gn on
input the string s .
Let j3n a sequence corrupted with noise: J~n = fn + Vn, where VniS a sample
sequence of a random variable representing the noise. With respect to a metric I1"11,the
strength of the noise is denoted as Ilvll= tim
n---)oo
Ilvnll
The Occam filter algorithm is

defined as :
Let Ilvll be the strength of the noise.

.an obtain s ;

Run C -l(s) to obtain the filtered sequence gn"

Thus, the general definition of the Occam filters requires the estimation of the rate-
distortion response of the compression algorithm. The knee-point of this plot, i.e.,
the point at which its second difference attains a maximum, is assumed as the optimal
rate-distortion relationship. The filtered sequence g n is the one obtained by the
compression/decompression with a loss tolerance set to the optimal distortion.
The applications of Vector Quantization to digital image processing are discussed in
[3]. They suggested that the encoding/decoding process introduces some non linear
smoothing of the image that removes some kinds of noise, specially speckle noise.
Therefore, VQ can be considered as an instance of Occam filters. However, some
specific characteristics of VQ must be taken into account. The application of VQ as a
compression algorithm is usually guided by the compression rate, therefore to obtain a
prescribed distortion result, an exhaustive exploration of the codebook dimension and
size must be performed. The computational complexity of the VQ design is very high
and grows exponentially with the codebook size and dimension. Therefore, the general
Occam approach is of little practical application. A situation that worsens for image
344

processing. We apply the well known Self Organizing Map [5] for the VQ design
task, employing some fast learning strategies [7]. Nevertheless , the computational
cost precludes the exhaustive computation of the rate distortion curve.

2.2. VQ and the Bayesian Restoration of images.

The Bayesian approach to the restoration of images [ 1,2] is one of the less restrictive
approaches in its assumptions. The observed image is modeled as the degraded image
G which is of the form ~(H(F))@ N, where F is the original image, H is the
blurring operation, ~ is a possibly nonlinear transformation, N is an independent
noise field and ~ denotes any suitably invertible operation. The a posteriori
conditional density given by Bayes' rule

p(F = fiG = g) = p(G = glF = f)p(F = f) (1)


p(G=g)
has been used to find different types of estimates of the true image F from the
observed image G. The maximum a posteriori (MAP) and maximum likelihood (ML)
are the modes of p(F = fiG = g) and p(G = gtF = f), respectively. When the
observation model is nonlinear, it is difficult to obtain the marginal density
p(G = g). However, the MAP and ML estimates do not require it, and are easier to
obtain than minimum mean square estimates. Both estimation methods need to
postulate some image model. In [1, ch.8] is shown that under the assumption of the
Gaussian distribution of both the original image and the noise process, and the
knowledge of the smoothing filter H, some equations can be deduced to obtain the ML
and MAP estimates of F. In the famous paper by Geman and Geman [2], it was
shown that the posterior probability distribution was Gibbsian under the assumption
of a Gibbsian distribution of the original image, modeled through the definition of a
Markov Random Field. The tools of statistical physics were introduced to produce a
probabilistic model of the image based on local interactions. We will discuss below
that the VQ filter in this paper assumes an image model given by a mixture of
Gaussians.
In the Bayesian application of VQ to image filtering the codevectors are considered
to be centered around their middle pixels, much like the traditional definition of
convolution masks. They are, thus redefined as yi = (y~,/ : - ~/-d/2 < k,l < if-d/2). The
image is not decomposed into blocks, rather we consider for each pixel a neighborhood
Fi,j = {Fi+k,j+l :-.fd/2 <- k,l <- "v/d/2}. The filtering process corresponds to
substitute each pixel by the central pixel in the codevector that encodes its
neighborhood, denoting ~" = [~-,j : 1 -< i, j < N] the filtered image it can be formalized
as follows:
345

= .tO, 0

It must be noted that in this application of the codevector no longer are meaningful
ideas relevant to compression such as the relationship between distortion, signal-to-
noise ratio and the dimensionality of the codevectors. The codevectors become the
probabilistic models of the pixel neighborhood. The filtering application of the
codebook must be interpreted as realizing the following approximation of the posterior
probabilities:

P(Fi,j : YO,O
k Gi,j = g) = •k,e(g) l<i,j<N and l<k<c (3)

To put the VQ filter in the framework of Bayesian image restoration, we recall the
probabilistic model embodied by the codebook. In our works we do consider that the
codebook design performed by the Self Organizing Map intends to minimize the
Euclidean distortion, and as such it has been applied in the experiments described
below.

O= Xi - Ye(xi)

A well known interpretation [6], in terms of statistical decision theory, of the


minimization of (4) is as follows. Given a number of classes, i.e. c, and feature
vectors whose probability density follows a mixture of conditional densities
p ( x ) = '}-'~j=l (.0j p X (.0j . If we assume that the conditional densities are Gaussian

with identical unit covariance matrices p(x(.oj)=~yj,I), and that the classes are
equiprobable, then the minimization of (4) is equivalent to maximum log-likelihood
estimation of the parameters of the model, the class means. Based on these parameters
the MAP decision m.axp(ojl x)
is the bayesian minimum risk decision. Thus, the
J
filtering realized by (3) corresponds to a MAP classification and restoration process, in
which the classes are the gray levels of the central pixel in the representative
neighborhoods extracted from the image. We can state the model of the dependencies
of each pixel to its neighborhood as:
(5)
P(Fij=fo,oFi,j=f)= ~1 1 e-89 z
j = l C ( 2 ~ i d12

where f0,0 is the central pixel of an image block f .

3. Experiments and Results.

In this paper we present the visual results of the application of the filtering based on
the VQ codebook computed by the SOM [5] over a 3D image. We have applied some
fast learning strategies that we have already tested elsewhere [7]. We have tested
346

several sample sizes, but the visual results shown in this paper are obtained with a
sample that consists of the 20% of the 3D image pixels and their 2D or 3D
neighborhoods. We have also tested several numbers of classes: 32, 64 and 128. The
compression rate is determined by the number of classes and the neighborhood sizes.
More detailed results with other number of classes and sample sizes can be found in
(http://sizx01.si.ehu.es/resultados.html). As the end interest of these images is for
medical-biological inspection, the visual evaluation is the prime concern. Therefore,
we do present the visual results of the application of Bayesian VQ filter and of the
conventional compression/decompression application as in the Occam filter approach.
The computational process is as follows:
1. Image samples are extracted randomly in the 3D image grid, with the specified
neighborhood sizes.
2. A one-pass learning SOM [7] is applied to the sample to obtain the codebook.
3. The whole 3D image is filtered based on (a) the Bayesian VQ approach, and (b)
the Occam filter approach.

3.1. Experimental Data.

The data upon which we have performed the experiments is a sequence of images of
micro-Magnetic Resonance obtained by the research group of the Unit of Magnetic
Resonance of the Universidad Complutense. The images have been obtained with an
experimental magnet of 4.7 Teslas. The sequence corresponds to a sequence of 128
cuts of a human embryo. Each image is 128x256 pixels of 32 bits/pixel. The
reduction to 8 bits/pixel has been done by ad-hoc manipulations of the intensities
based on the statistics over the whole sequence. In figure 1 we show the image #80 of
the sequence as it appears after the manipulation. One of the main interests of the
Magnetic Resonance group is the removal of artifacts in the background that
corresponds to the empty space. Also the noise removal must preserve some very
small classes of pixels for eventual segmentation.

Fig. 1. Original flame #80 after manual intensity range reduction to 8 bits/pixel, before
processing by the SOM Bayesian VQ filter.
347

3.2 Experimental Results.

In this section we present some visual results that consists in the frame #80 of the
sequence after being filtered with the Bayesian and compression approaches. The
complete sequence results can be obtained as MPEG movies from
(http://sizx01.si.ehu.es/resultados.html). We plan to perform volumetric rendering of
the sequence and present the results as MPEG movies, but at present the movies show
the sequence of cuts after processing with the Bayesian VQ filter and the compression
VQ filter.
Figure 2 and 3 show the Bayesian and Compression VQ filtering results on the
frame #80 of the sequence, using 3x3x3 spatial neighborhoods, with 32 and 64
codevectors respectively. That gives compression ratios 216:5 and 216:6, respectively.
It can be appreciated that the Bayesian VQ filter provides meaningful results, whereas
the results from the Compression VQ filter are obviously affected by the big
distortion produced by the high compression rates (very low coding rates).
Figure 4 and 5 show the Bayesian and Compression VQ filtering results using
5x5x5 spatial neighborhoods, again with 32 and 64 codevectors respectively. The
compression ratios are 1000:5 and 1000:6. The results of the Compression VQ are
almost unrecognizable, while the Bayesian VQ provides a strong smoothing effect thai
preserves the boundaries of the image regions. We note that the Bayesian VQ does not
involve a proper compression of the image. Nevertheless, the compression ratio is an
indication of the reduction in terms of the number of potential models applied to the
MAP filtering.
Figures 6 and 7 show the Bayesian and Compression VQ filtering results with 128
codevectors using 3x3xl and 5x5xl spatial neighborhoods, respectively. The
compression ratios are 72:7 and 200:7, respectively. Again the degradation introduced
by the Compression VQ distorts the image to the point of making it impossible to
process, whereas the Bayesian VQ improves over the previous figures and the
Compression VQ for the same setting. The improvement of Bayesian VQ given by
the 72:7 compression ratio over the 200:7 compression ratio is significative.

(a) Co)
Fig. 2. Filtering with a 3x3x3 neighborhood and 32 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter.
348

(a) Co)
Fig. 3. Filtering with a 3x3x3 neighborhood and 64 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter

(a) Co)
Fig. 4. Filtering with a 5x5x5 neighborhood and 32 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter

(a) 00)
Fig. 5. Filtering with a 5x5x5 neighborhood and 64 codevectors (a) Bayesian VQ filter
and (b) Compress!on VQ filter

(a) Co)
Fig. 6. Filtering with a 3x3xl neighborhood and 128 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter
349

(a) Co)
Fig. 7. Filtering with a 5x5xl neighborhood and 128 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter

4. Concluding Remarks.

We have proposed the application of Vector Quantizers computed by the Self


Organizing Map as Bayesian filtering mechanisms. In this paper we have stated the
idea and discussed its interpretation as a Maximum A Posteriori image filtering
process. The problem of codebook design remains the same as in conventional
applications of VQ, and the same approaches can be applied. In this paper we have
applied a fast learning version of the Self Organizing Map of Kohonen for the
estimation of the codebook in the experiment. We have shown the results of this
approach against the results obtained by the Compression application of the VQ for
image filtering. The Bayesian VQ preserves the boundaries of the image regions even
for very big compression ratios. Future work involves the direct application of the
Self Organizing Map to the 32 bits/pixel original data, without any manual range
reduction. We plan also to perform extensive experimentation trying to establish the
relation between our approach and the Occam filters.

Acknowledgments.
Work partially supported by the Departments of Education and Industry of the
Gobierno Vasco under project UE96/9. Ana Isabel Gonzalez has a predoctoral grant
from the Departamento de Educacion, and Imanol Echave has a predoctoral grant from
the Departamento de Industria. The work is also partially supported by the CICYT
project TAP-98-0294-C02-02
350

References

[1] A.K. Jain "Fundamentals of digital image processing" Englewood-Clliffs: Prentice-Hall


(1989)
[2] S. Geman, D. Geman "Stochastic relaxation, Gibbs Distributions and the Bayesian
Restoration of Images" IEEE trans. PAMI (1984) vol 6 pp.721-741
[3] P.C Cosman, K.L. Oehler, E.A. Riskin, R.M. Gray (1993) "Using Vector Quantization
for Image Processing" IEEE Proceedings 81(4) pp.1326-1341
[4] A. Gersho, R.M. Gray "Vector Quantization and Signal Compression" Kluwer (1992)
[5] T. Kohonen "Self Organizing Maps" Springer-Verlag (1995)
[6] R.O. Duda, P.E. Hart "Pattern Classification and Scene Analysis" Wiley (1973).
[7] A. I. Gonzalez, M. Grafia, A. D'Anjou, F.X. Albizuri, M. Cottrell "A sensitivity
analysis of the Self Organizing Map as an Adaptive One-pass Non-stationary Clustering
algorithm: the case of Color Quantization of image sequences" Neural Processing Letter
(1997) vol 6 pp. 77-89
[8] B.K. Natarajan "Filtering Random Noise via Data Compression", Proc. ~EF~ Data
Compression Conference, Snowbird, Utah, pp. 60-69. (1993).
[9] B.K. Natarajan "Filtering Random Noise from Deterministic Signals via Data
Compression" IEEE Trans. on Signal Processing, vol 43, n.ll, pp.2595-2605. (1995)
N e u r a l N e t w o r k s for Coefficient P r e d i c t i o n
in W a v e l e t I m a g e C o d e r s

Cindy Daniell and Roy Matic

Abstract
We present a unique method for estimating the upper frequency band coefficients
solely from the low frequency information in a subband multiresolution
decomposition. First, a Bayesian classifier predicts the significance or
insignificance of the high frequency coefficients. A neural network then estimates
the sign and magnitude of the visually significant information. This prediction
model allows us to construct an image coder which can exclude transmission of the
upper subbands and reconstruct this information at the decoder. We demonstrate
results for a two level subband decomposition.

INTRODUCTION

Multiresolution image representations resulting from wavelet translbrms are often


used to encode images lbr data compression in subband image coders. Subband analysis
of images involves separating them into several frequency bands and processing each band
separately. Wavelet subband image coders apply the analysis transform recursively to the
low frequency subband, resulting in a multiresolution, or pyramid decomposition. The
signal is downsampled between levels of recursion. This loss of information is accounted
for at the decoder by cleverly designed synthesis filters that are applied recursively to
reconstruct the original image. A typical level of the multiresolution decomposition will
have four subbands of information, one composed of only low frequency components, and
three containing high frequency components.
Although the wavelet transform is an orthogonal decomposition, the visual patterns
among coefficients in adjacent subbands hold a striking array of interrelated activity as
shown in Figure 1.

Figure 1. Multiresolution Image Decomposition. "llle original image is deCOlnl~)Sed


into four subbands as shown in (b). The subband labels are defined in (a) and denote the
horizontal or vertical direction of the low and high pass wavelet filters. The decomposition
process is carried out recursively as displayed in (c) on each successively smaller LI.
subband image.
352

Current subband coding techniques, however, make little or no use of the apparent
relationships between coel/icients in neighboring subbands. This paper investigates the
use of neural networks to predict the coefficients between adjacent frequency bands, given
the challenges that 1) the coefficients are produced by orthogonal basis functions, and 2)
the adjacent frequency bands have been downsampled, resulting in a loss of half of the
original information.
It is well known that neural networks excel at recognizing patterns and that mising
data can be reconstructed from local neighbors with reasonable fidelity. Specifically, our
goal is to learn about the behavior of image structure (e.g., edges and texture) across
frequency bands by exploiting the inherent abilities of neural networks. First, we
empirically model the intraband relationships characteristic of natural images in the
wavelet transform domain. Using these models, a Bayesian neural network, constrained to
minimize total entropy, classifies the high frequency signal as significant or insignificant.
A second neural network with analog outputs then learns the nonlinear mappings of the
patterns existing between the low and high frequency subbands. The algorithm estimates
upper frequency coefficients solely from lower ones within the same band and then
performs this estimation recursively. This prediction ablility allows us to exclude
transmission of the upper subbands and reconstruct the information at the decoder.

PREDICTION MODEL

Specifically, this algorithm attempts to predict the value of upper frequency


coefficients solely from lower frequency coefficients within the same level of the
multiresolution decomposition. As exhibited in Figure 2, the LH and HI. coefficients are
predicted from the LL coefficients, and the HH coefficients are predicted from the
predicted values of the LH and HL coefficients.

Figure 2. Prediethm Within One I.evel. Firsl, information in the I.I. subband is usc'd I(}
predict coefficients in both the HL and I.H subbands, which are in turn used jointly to
predict the HH subband coefficients. Nole each prtxlicttxl coel]]cient corr~pontls
physically with the centc~ of Its IIIpUlnelghborlmod.

The prediction process consists of two steps as illustrated in Figure 3. Using a Bayesian
classifier, we first classify whether a particular upper frequency is significant or
353

insignificant. Coefficients classified as insignificant will have a predicted value of zero.


Next, coell~cients that are classified as significant will have their value (both sign and
magnitude) estimated by a neural network. Both steps are performed to predict upper
frequency coefficients within a level and then performed recursively over multiple levels.

Significant
Coeffs ~l Estimate Encode
rl Value El-rots

Subband ~[ SeparateClasses of L
Input "] Upper Coefficients
[ Neural Network

Insignificant Coeffs Encode


Statistical
" Zeros

Figure 3. Prediction Sununary. Within the prediction of each subband, Ihe algorithm
consists of two steps: 1) separation of the significant and insignificant coefficienls, and 2)
prediction of the value of Ihe significant coefficients. The process is carried out
recursively over each level of the decomposition.

Bayesian Classifier

To classify whether an upper subband coefficient is significant or insignificant, we


first empirically model statistical relationships between it and its unique input field to
determine the intraband coefficient relationships characteristic of natural images in the
wavelet transform domain. The model relates the change in value (local slope) across the
input field of coefficients to a probability that a given upper frequency coefficient is
significant or insignificant, i.e. above a defined threshold value.
We then estimate the following probabilities for the three upper subbands (HL, LH,
and HH) in each individual image:
P(nz) the probability that a coefficient is significant,
P(z) the probability that a coefficient is insignificant,
p(mlnz)- the conditional probability of an input slope value given that the
coellkient is significant, and
p(mlz) - the conditional probability of an input slope value given that the
coefficient is insignificant.
Next we employ the following equations from Bayes Rule,

p(m, z) = p(mlz)P(z) and p(m, nz) = p(mtnz)P(nz),

where p(m, z) is the joint probability of a particular slope value generating an insignificant
coefficient, and p(m, nz) is the joint probability of a particular slope value generating a
significant coefficient.
In addition to the probabilities described above, we also calculate the cost of
misclassification, detined as the number of bits necessary for transmitting corrections when
the significance classification process is not correct. We define ~.(zlnz) as the cost of
classifying a coefficient as insignificant when the true state is significant, and ~,(nzlz) as the
354

cost of classifying a coefficient as significant when the true state is insignificant. The
minimum number of transmission bits for corrections is defined as

A ~i, = 3, (zlnz) + 3' (nzlz).

The value )~i, is determined experimentally for each image using a variety of common
coding techniques such as a bit string code or a run length code. The components of Lm~,
then define the ratio,

r = 3, ( z l n z ) / 3 , (nzlz).

Now according to Bayes Decision Theory, we select class 'z' (the coefficient is
insignificant) if,

p(m, z ) / p ( m , nz) > r ,

otherwise, we select class 'nz' (the coefficient is significant). The ratio in the above
equation is known as the likelihood ratio, and the two joint probabilities in the above
equation are shown in the graph of Figure 4.

A Priori Conditional Probabilities


p(m,cla ~s)
.5

.4 ~p(mlz)

.3

!~k A p(mlnz)

m
~.1 .2 .3 .4 .5 .6
,,
!
m = M where p ( m l z ) P ( z ) = ~;
p ( mlnz)P ( nz )

Figure 4. Separating Significance Classes. Using a measure of slope across the input
field, m, the requisite joint probabilities are computed, as displayed in the graph above.
Next. we cxixn-imentally calculate the point on the graph, defined as T, which minimizes
the cost of transmitting corrections for any misclassifications. The slope (x-axis) value
corresponding to the y-axis value of T, is defined as M. This value of M is then used as a
tln'esbold to define the two classes of coefficients: significant and ixtsignificant.

From such a plot, we determine the slope, M, where the likelihood ratio equals the value
x, and select this slope, M, as our threshold for classification. Classification is based on
355

CODER IMPLEMENTATION

A block diagram of the image coder model is illustrated in Figure 5.

Figure 5. Proposed Image Coder Model. In the proposed system, a prediction model,
which can recursively estimate the upper subband coefficients from the LL subband
coefficients, is duplicated at both the encode~ and decoder. Thus, the smallest LL subband
in the decomposition is the only subband information necessary for lossy image
compression. Image fidelity improves with transmission of prediction errors, yielding
lossless compression upon transmission of all errors.

Both the encoder and decoder are augmented by replicate learning models which are
used to predict the upper subband coefficient data within a wavelet decomposition. Only
the I.I. subband coefficients of the smallest level within the wavelet decomposition must be
transmitted verbatim. All of the remaining upper subbands are predicted by the recursive
coefficient prediction algorithm described below. For the latter, the errors (which are
defined by the learning model at the encoder) may be transmitted for use at the decoder.
All of the errors must be transmitted for lossless compression, while only some of the
errors must be transmitted lbr lossy compression.
The data rate reduction is based on the variance and shape of the error distribution.
The savings can be generalized to approximate

H e = Ho - l o g ( a ) , where a is defined as a = t r o / ,7 e

for the general case. The values H~ and Ho are defined as the entropy of the error and
original distributions, respectively. Likewise, the values t~ and t~o are the standard
deviations of the error and original distributions, respectively.
Both the Bayesian Classifier and neural network can produce errors. The Bayesian
Classifier can incorrectly classify the significance of the coefficient, which leads to a binary
string. Errors in the sign of the predicted coefficient at the output of the neural network
also lead to a binary string. The binary errors are currently encoded by a positional code,
a bit string code, or a run length code, with the method selected experimentally for each
image to provide a minimum transmission of data.
356

the slope calculated from the input field to the Bayes Classifier. The above classification
scheme is outlined in Figure 4.
The necessary probabilities have to be estimated for each subband individually, and
thus, a different threshold value is used for each subband. In addition, unique threshold
values are currently determined for each image, however, a series of training images could
be used to generate a global threshold for each subband.

Neural Network

Once a coefficient has been classified as significant or insignificant, the values of the
significant coefficients are estimated with a three layer, fecdtbrward neural network, with
a different network specified lor each subband. The data input to the neural network is
normalized between values of -1.0 to 1.0 and is exactly the same as the data input to the
Bayesian Classifier within each case. The number of neurons in the middle layer varies for
each subband and level, and is selected experimentally
The neural network architecture contains two output neurons, one I~)r positive
coefficient values, node A (the other output node is held to zero during this part of the
training), and one for negative coefficient values, node B (the positive output nodes is held
to zero during this part of training). These output nodes are only allowed to vary between
0.0 and 0.9. During operation, the maximum of the two output nodes is accepted as the
valid estimation of magnitude lbr the predicted coefficient.

predicted coefficient magnitude = max(yA, YB),

where y^ is the output of node A and yB is the output of node B. The larger of the two
output nodes also denotes the sign of the predicted coefficient.

predicted coefficient sign = +1, if yA > ya


= -1, i f y n > y A

Each neural network is trained over the same set of several images with the standard
back propagation training algorithm. Once training is complete, the weight values are
stored and operation of the algorithm then remains fixed over any test image.

Reeursive Operation
The procedure described above to predict the coefficients in one level of a subband
decomposition is applied recursively to predict multiple levels of upper subband
coefficients. In multilevel subband decompositions, the algorithm is performed as follows.
1. Define n as the number of levels in the given subband decomposition.
2. Estimate the coefficients of the three upper subbands of level n as outlined in the
algorithm above and illustrated in Figures 2 and 3.
3. Reconstruct the low frequency subband of level n-1 with the synthesis filters
designed in the subband analysis. (not part of the prediction algorithm)
4. Replace n with n-1 and go back to step 1. PerIbrm this recursion until n equals
0, i.e., the original image size has been reconstructed.
357

Progressive Transmission
To progressively improve image reconstruction at the decoder, we currently use the
following transmission techniques in succession.
l. Transmit only the sign flip and significance map errors. This method can be used
effectively for Very Low Bit Rate applications.
2. Transmit all magnitude error terms that are greater than 10% of the true
coefficient value for the significant coefficients, resulting in lossy compression.
3. Transmit all remaining magnitude error terms for the significant coefficients.
This again results in a lossy reconstruction of the image.
4. Transmit all the insignificant coefficient values. This results in a lossless
reconstruction of the image.

RESULTS

The Ibllowing results were obtained for a two level decomposition over a test set of
twelve images displayed in Table 1. A separate set of twelve images was used to train the
neural network. Additionally, a third set of twelve images became a validation set for the
neural network. Neither the training set nor validation set are displayed in this paper, but
they contained a varied collection of images.

Network Performance

We fa'st looked at the performance of the separate components of the prediction


module. The Bayesian Classilier achieved only 3% incorrect classification of whether the
upper band coefficients were significant or insignificant. An 8% misclassification of the
sign of the signiticant coefficients, which could be predicted as positive or negative, was
achieved by the neural network. Errors in the prediction of the significant coefficients'
magnitude are best quantified as the mean absolute error at the output of the neural
network. Over all the networks currently employed the mean absolute error averaged
0.102 with a standard deviation of 0.0179.

Lossy Compression

Table I compares the rates of the Coell~cient Prediction algorithm to a standard


entropy coder for lossy transmission over a test set of twelve images. The two techniques
are compared for the same peak signal to noise ratio (PSNR), which was achieved after
transmitting all of the significant coefficient information. These results were achieved with
a two level wavelet decomposition.

Very Low Bit Rate Applications

In Figure 6 we show three examples of image reconstruction for Very l.ow Bit Rate
applications. That is, only the signilicance map errors and sign flip errors were transmitted
and employed at the decoder. For these three images, coefficient prediction was employed
for two levels of the wavelet decomposition. The starting point or LI. subband of level
two is shown in the top of Figure 6. By employing the coefficient prediction algorithm we
were able to reconstruct the original size image at the rates given in Figure 6 with the
visual quality exhibited therein.
358

Table 1. Lossy Compression Results. A comparison of the transmission rates for all
significant coefficients using prediction and non-prediction methods.

Figure 6. Reconstruction with Prediction. Reconstruction of the original image from the 1/8
size LL subband, is illustrated for tl~eabove three images.
359

Figure 7 also illustrates how well the coefficient prediction algorithm perlbrms at
very low bit rates. We have enlarged (four times) the swimmer's hat from the swimmer
image in Figure 6 to more closely observe the details. The image on the far right is shown
for reference and displays reconstruction after transmission of all the wavelet coefficients.
In this case, the rate is computed as simply the entropy of all the coefficient data. The
image in the middle of Figure 7 exhibits reconstruction with the coefficient prediction
algorithm when only sign flip errors and significance map errors have been transmitted,
which results in a very minimal cost of 350 bytes.
The image on the left is reconstructed after transmitting the significant coefficients in
descending order along with their corresponding geographical locations (no significance
map is transmitted) until a rate equivalent to the Coefficient Prediction rate (350 bytes) is
achieved. The rate is calculated by computing entropy.
Now image quality, measured as peak signal to noise ratio (PSNR) is compared tor
the same bit rate. The Coefficient Prediction technique achieves a 26.2 PSNR, a 13.4%
increase over the 23.1 PSNR of the standard image coding technique. This coupled with
the visual similarity to the reference image noted in the flag and letters illustrate the
accuracy of the prediction of the significant coefficient magnitudes.

CoeWtclentTransmission CoefficientTransmission LosslessTransmission


Without Prediction ~ Prediction (for RelerenCe)

0.061 bpp 0.061 bpp (350 bytes) 6.04 bpp (31K bytes)
23.1 PSNR 26.2 PSNR

Figure 7. Reconstruction Comparisons. Figure (b) utilizes the coeMcient prediction


algorittun when only sign llip errors and significance map errors have been lransmitted.
The same number of bits was used in (a) to transmit the upper subband coefficients in rank
order. The original image (c) is shown for reference. All images are magnified four times
the original stze,

Figures 6 and 7 demonstrate the accuracy of the prediction algorithm when used in
an image coding application. In both figures, the reconstructed images rely purely on the
prediction precision of the neural network for the significant coefficients' magnitude
values, as no magnitude error terms are transmitted. This precision is illustrated by the
fidelity and visual quality of the reconstructed images in both figures. Furthermore, the
low bit rates demonstrate the accuracy of the significance classification by the Bayesian
Classifier and the accuracy of the sign value prediction for the significant coefficients by
the neural network.
360

DISCUSSION

Combining the compression power of the wavelet transform with the pattern
recognition power of the neural network allows enhanced visual perception, especially at
very low bit rates. For Very Low Bit Rate applications we can perform lossy image
reconstruction with near zero transmission because only the initial low frequency subband
is transmitted along with sign errors of the upper coefficient predicted values. Magnitude
error terms are sent for higher fidelity lossy image reconstruction and for the lossless case.
Additionally, the off-line training of the prediction weights, which can vary with
initialization techniques and network architecture, facilitates data encryption in secure
transmission environments.
This work presents predictive models for multiresolution image representations,
such as wavelets. These models are adaptive to different classes of imagery and their
application facilitates image reconstruction, image compression, and image enhancement.

REFERENCES

1. R W Buccigrossi and E P Simoncelli, "Progressive Wavelet Image Coding Based on a


Conditional Probability Model," Proc. ICASSP 1997, Munich, Germany, April 1997.
2. R W Buccigrossi and E P Simoncelli, "Embedded Wavelet Image Compression Based
on a Joint Probability Model," Proc. ICIP 1997, Santa Barbara, California, October
1997.
3. O Johnson, O V Shenton, S K Mitra, "A Technique for the Efficient Coding of the
Upper Bands in Subband Coding of Images," Proc. ICASSP 1990, Vol. 4, pp. 2097-
2100, April 1990.
4. J Shapiro, "Embedded Image Coding Using Zerotrees of Wavelet Coefficients," IEEE
Trans Sig Proc, Vol. 41, No. 12, pp. 3445-3462, December 1993.
A Neural N e t w o r k Architecture for Trademark
Image Retrieval

Sujeewa Alwis and Jim Austin

Advanced Computer Architecture Group


Department of Computer Science
University of York
York, Y010 5DD, UK
suj eewalaustinOcs, york. ac. uk

Abstract. This paper describes a novel massively parallel connectionist


architecture for image retrieval. The proposed search engine of the system
consists of associative memory nodes connected by information channels
which convey symbolic messages. Symbolic information stored inside the
system is obtained using gestalt feature extraction methods which cap-
ture multiple representations of images. In this paper, we summarise our
feature extraction method and then we describe the connection schemata
of the system, training process as well as how such a system can be
utilised to capture perceptual similarity of trademark images. Finally,
we present results obtained during evaluation of the system.

1 Introduction

There has been considerable progress in the area of content based image retrieval
during the last two decades. However, capturing perceptual similarity of images
is a relatively under-explored area of research [San96]. Trademark image retrieval
provides a good avenue of investigation in this regard since an effective trademark
retrieval system should necessarily be able to retrieve images which humans
perceive as similar.
Trademarks play an important role in providing unique identity for products
and services in the marketing environment and trademark classification systems
should be able to ensure that the existing trademarks are distinct to avoid con-
fusions. Traditionally, classification of trademarks is based on limited vocabulary
descriptions. Most of the patent offices use manually assigned codes to represent
these descriptions such as human beings, animals or geometrical figures. But it
has been shown that these methods suffer a number of problems. The assignment
of classes to trademarks is subjective, the classes become either too specific or
too broad depending on the use of classes, there is no mechanism to handle the
generation of new classes, and there is a large fraction of images with little or no
representational meaning making such a classification extremely difficult. This
motivates the need to investigate the potential of content based image retrieval
techniques to solve this problem.
362

In this study, we investigate a new trademark image retrieval system based


on features extracted using gestalt feature extraction methods. During retrieval,
our framework utilises alternative feature interpretations and the communication
strategy combines evidence from different interpretations in a mutually beneficial
manner. Though this framework may be able to capture perceptual similarity of
trademark images, the high computational requirements creates the need for an
efficient and low cost computational platform.
There have been numerous attempts to solve a range of problems using neural
networks. However, many neural network architectures suffer from long training
times or inefficient hardware implementations. Associative memory architectures
perform better than many other methods in this respect. Pattern matching ca-
pabilities offered by correlation matrix memory networks (CMM) [Mic97] under
the framework of AURA [Jim95] provide a number of features that have been ex-
ploited to obtain an efficient search engine for the proposed system. Apart from
a fast, low-cost hardware implementation of the network, it offers the ability
to parallelise the search mechanism by presenting input patterns and obtaining
output patterns simultaneously. The outcome of this integration is a massively
parallel connectionist architecture for trademark image retrieval which uses con-
cepts from visual cognitive psychology.
In the next section, we describe the pre-processing stage of the system which
performs gestalt feature extraction. We then summarise the basic mechanisms
of AURA which provide the computational platform for the search engine of the
system. In section 4, we describe the connectionist architecture of the search en-
gine and its functionality. Finally, we present results obtained during evaluation
of the system.

2 Feature Extraction

During feature extraction, we aim to extracting features from different analytical


levels of images. One of our recent experiments conducted with subjects justifies
this approach as we have observed different human interpretations for the same
image. In deciding the features to be extracted we are motivated by the findings
of visual cognitive psychology. Biederman [Bid87] argues that edge extraction is
the initial phase of human object recognition and followed by the detection of
a number of non-accidental properties identified by Witkin and Tenenbaum as
follows: co-linearity of points or lines, co-curvilinearity of points or arcs, paral-
lelism of lines and arcs, symmetry under reflection or rotation, convergence of
lines or arcs at vertices. The gestalt psychologists observed and emphasised the
importance of organization in vision. They demonstrated that shapes have some
illusive, immeasurable collective properties that do not appear when analysed
in terms of their constituent parts. Gestalt perceptual organization phenom-
ena is based on proximity and similarity of features. It can be seen that the
gestalt feature extraction methods could be used to extract the the above men-
tioned non-accidental properties and closure feature as well as to group the image
into perceptually significant segments. In our system, the above mentioned non-
363

accidental relationships are used in local feature representation while a widely


used set of features is utilised in closed-figures based feature representation. 9

edge extraction

contour decomposition

local perceptual feature extraction


from the raw image

feature extraction of closed figures preparation of the Gestalt image


from the raw image

local perceptual feature extraction


from the Gestalt image

feature extraction of closed figures


from the Gestalt image

Fig. 1. The overview of the feature extraction phase.

During feature extraction (as summarised in figure 1), we first perform edge
extraction using the original image (raw image) and segment edge pixels into
constant curvature edge segments using the method proposed by Wuescher and
Boyer [Wus91]. These two steps give straight line segments and curve segments
having the following properties: end points, orientation, curvature and pixel
points on the segment. Using this information we extract different perceptual
features (end-point proximity, parallelism, co-linearism and co-curvilinearism)
using both Lowe's methods [Low87] and Sarkar and Boyer's [Sar94] methods
for perceptual feature extraction [Suj98]. Figure 2 shows some of theses features
extracted using an example image. We group images based on co-linearism and
co-curvilinearism and obtain a new image (gestalt image) and this step gives
some new segments derived from segments from the raw image. We also store
the relationships between antecedents and new segments. The gestalt image is
then subjected to the earlier process of extraction of end-point proximity and
parallelism. We also obtain closed figures by grouping the segments of the image
based on end-point proximity and continuity. This method extracts alternative
364

interpretations of closed figures which may not be obtainable using standard


pixel based linking methods as shown in figure 3. In the next step, we extract
features of closed figures, namely circularity, directionality, straightness, com-
plexity, right-angleness, aspect-ratio, sharpness and stuffedness. Figure 4 sum-
marises the features extracted during this process. More information about this
process can be found in [Suj98].

(a) (b) (c)

Fig. 2. Figure 2(b) shows the co-linear and co-curvilinear segments while figure
2(c)shows parallel segments extracted using the image in figure 2(a).

(a) (b) (c) (d)

Fig. 3. Some of the closed figures extracted using the image in figure 3(a). Using pixel
based linking methods it is possible to extract half circle shapes but not the perceived
circle or triangle.

3 The AURA system

The AURA system is aimed at fast combinatorial searching and high perfor-
mance knowledge base system design. The main building block of AURA is a
simple one layer neural network called correlation matrix memory (CMM) that
utilises binary weights. The information processing methods of AURA exploits
threshold logic and distributed processing capabilities of binary neural networks.
365

The CMMs used in AURA have binary weights and binary input and output
vectors which make the training process a one shot learning process of binary
associations.
The training process of a CMM(M) using an input vector (I) and an output
vector (O), can be summarised as follows: First obtain the vector product of
I and O which would give a matrix M I. Then perform a logical OR operation
between M r and M which superimposes patterns in M ~ onto M.
During the recall phase, an input pattern (I) is presented to the CMM to
recall the stored output pattern(s) (O). This produces a vector of summed val-
ues at the output of the CMM. The internal operation behind generation of
this output pattern (R) can be expressed as R = M I T. This is followed by a
thresholding process which generates subsequent output vector(s) called as sep-
arator(s) in the context of AURA. During thresholding, each bit in the output
pattern (R) is set to zero if it is below a previously determined threshold or
otherwise set to one. Several strategies for determining the threshold is available
including L-max thresholding and Wilshaw thresholding [Jim95]. The AURA
consists of required pre-processing and post-processing modules to support sym-
bolic pattern processing using CMMs.

3.1 Pre-processing

During pre-processing all the inputs (in the symbolic form) are converted into
binary vectors with exactly k bits set (where k is a constant for a given CMM)
by the pre-processor. The features offered in the architecture allows simulta-
neous presentation of more than one input pattern. The input pattern can be
superimposed to present them as a single pattern vector as follows:
X1 = 000001000000; X2 = 000100000000; X3 = 100000000000
super-imposed input pattern: 100101000000
This is an important feature of this architecture as it allows parallelising the
search by presenting multiple data items at once. This has been an important
step towards removing the information about the order of inputs as well as
preserving memory space and making the size of the input query to the network
to be independent of the number of variables that the input contains.

3.2 Post-processing

The output vector may contain more than one trained pattern. To separate
theses patterns a method called MBI (Middle Bit Indexing) is used [F194]. In
this method, the relationship between a separator and the symbolic output is
stored into the MBI database using middle bit of the separator as the index to
the database. This reduces the search process to only dealing with bits in the
output data that could belong to the middle bit of the code.
366

4 Connectionist Architecture

In designing the search engine for the image retrieval system, four factors have to
be considered. First, the strategy for mapping feature information into symbolic
data has to be determined. Second, the message passing strategy between CMM
nodes has to be determined. Third, the training strategy to store the above men-
tioned symbolic data has to be determined, and a method must be established by
which all the evidence can be combined to evaluate similarity between a query
image and images in the database.

4.1 Data Representation

Feature information (extracted as described in section 2) has to be stored as


symbolic associations inside associative memories. As a result, communication
between memories can be performed by transferring symbolic patterns. In the
system, closed figures are represented by symbolic tags (type 1) which represent
image and figure number while constant curvature segments are represented by
tags (type 2) which represent image number and segment number. Another set
of symbols is used to represent elements of the feature vectors (circularity, di-
rectionality etc.). Since these elements consist of continuous values, they are
quantised in order to assign symbolic values for them (quantisation is performed
using uniform scales). All the symbolic associations take the form of input pat-
tern - output pattern where patterns are single or a combination of symbolic
data. The feature information in figure 4 can be re-represented in symbolic form
as in figure 5.

Image ~ p ~ n t a t i o n

features in raw image


I relationships bet n
I
features in gestalt image
~ w and g~talt image
(antecedents - new ~gments)

I I I I
]~al perceptual relafio~hips featu~s of Iocalperceptual ~latio~hip~ features of
featn~ between segments clmed figures
between between ~gnlents c l ~ d f'lgu~ features
segmentr and closed figures between and c l ~ e d f i g u r ~
~gnlents

end-point p a ~ l l e l i s m c o - l i n e a r i s m end-point parallelism


proximity ~ c~urvilinear ~m proximity

I--
circularity ~gect-ratio
I
circularity ~ p e c t-~ tio

Fig. 4. Information extracted during the feature extraction process.


367

Image n p m e n t a t i 0 n

........... !........ .t. .......... J .........


(antet~dents - ~w s~wnts)
t~_2 - ta~2

I I I r
I~al pe~eptual r e l a t i o ~ h i l s, f e a l u ~ of local p e ~ e p t u a l relatio~41il~ f e a t u ~ s of
bt ee ~t wt ~~ n bet w ~ n ~gmenls clos~l f i g u ~ s featu~ b e t w ~ n seg~nLq cl~d figa~
seg~nts ~ d cl<,~d f l g u ~ s betw~n ~ d clo~,l f l g u ~ s
seg~nU

I
ta~2 - la~l
I
ta~2 - last

tag2 - tag2
' I
............ I
tag2 - tag2
I feature v ~ t o r - t a g _ l
tag2 - tag2 tag2 - tag2 tag2 - tag2 tag2 - tag2

Fig. 5. Symbolic representation of features in figure 4.

4.2 Connection Schemata

The architecture of the image retrieval system is inherently distributed in which


nodes consist of a collection of CMMs. It has two types of nodes which represent
closed figures (type 1 nodes) and constant curvature segments (type 2 nodes).
Each type of nodes has the same information at every location. This feature of
the system ensures its location invariance. Communication between the nodes is
facilitated through information channels which connect nodes in a non-uniform
fashion. T y p e 2 nodes are connected to nodes of the same type through informa-
tion channels which represent the following perceptual relationships: end-point
proximity, parallelism, co-linearity or co-curvilinearity and antecedents - new seg-
ments. T y p e 2 nodes are connected to type i nodes through information channels
which represent segments - closed figures relationships while type 1 nodes are
connected all other type 1 nodes derived from the same image representation
(ie. raw image or gestalt image), as shown in figure 6.
All information channels facilitate bi-directional communication while sym-
bolic information transfer in every channels correspond to possible candidates
at the neighbouring node connected by the particular perceptual relationship.

0 -typc2aod~ -- end-pohlt p ~ x i m i t y
9 --- p*r.dleUs~l

----- ~ t ~ e d m t ~ - new ~gments


...... segments- c l o u d f l g e ~
~w ~+ t ~ a l t in~ge -- c l ~ e d l l g u ~ i n the s ~ ~ p ~ t a U m

Fig. 6. Connection schemata of the system.


368

4.3 Training the s y s t e m

Training phase is aimed at storing all feature information inside CMMs in order
to utilise t h e m effectively and efficiently during retrieval. During the training
phase, symbolic associations, as summarised in figure 5, are generated to repre-
sent feature patterns in images. Memories at nodes of type 2 are trained with
different perceptual relationships (end-point proximity, parallelism, co-linearity
or co-curvilinearity, antecedents - new segments and segments - closed figures).
For example, memory for end point proximity is trained with associations in the
form of tag - tag which represent two segments connected by end point proximity
relationship. For example, according to the feature arrangements in figure 7(a),
following symbolic associations are generated for the node I1 S1 which represents
a type 2 node (constant curvature segment).
end point proximity relationships: I1 S1 - I1 $2, I1 S1 - I1 $3
parallelism relationships: I1 S1 - I1 $4
co-linear relationships: I1 S1 - I1 $5
antecedents - new segment relationships: I1 S1 - GI1 S1
segments - closed figures relationships: I1 S1 - I1 C2
During training, each symbolic tag is assigned a unique binary token and as
a result, association between a pair of tags becomes an association between a
pair of binary vectors.
Memories at nodes of type 1 are trained with relationships between feature
vectors and tags as well as relationships between tags which represent closed fig-
ures as well as constant curvature segments. Each feature vector consists of eight
feature elements as explained in section 2 (circularity, directionality etc.) which
are quantised to obtain eight different symbolic patterns. First, binary patterns
are generated to represent theses symbolic patterns and then superimposed to
generate a single pattern. As a result, associations between feature vectors and
tags for closed figures become associations between binary vectors.
According to the feature arrangement in figure 7(a), following symbolic as-
sociations are generated to represent inter-relationships for closed figure at I1
C1.
closed figure - closed figure relationships: I1 C1 - I1 C2
closed figure - segment relationships: I1 C1 - I1 $2, I1 C1 - I1 $3
The training phase is completed when all the feature information in the
database is stored within relevant CMMs. Addition of a new image into the
database can be performed at anytime since addition of new associations do not
affect the existing ones.

4.4 Similarity assessment

Similarity assessment framework is aimed at retrieving images in the order of


similarity to a given query image. In doing so, we expect the system to find
possible candidates based on features of constant curvature segments and closed
figures as well as relationships among them. The whole process can be seen as a
three stage process.
369

It Cl I2c2

",, [ " 6"~2 i \

"x! \ nSl ~ns2 '

11 $5 GII $4

0 -type2nodes -- end-point pr~imity


9 - type 1 nodes --- paralleli~
--.- ~linuri~ and ~ ~ili~a rlsnl
a n t ~ d e n t s - ~ w segments
...... ~gment~ - closed figures
-- dosed figur~ in the ~ m e repre~tltation

(a) (b)

Fig. 7. (a) Typical feature arrangement of an image (b) Information fusion and pruning
process.

First, message passing structure for the particular query image has to be
determined. The considerations are the number of nodes allocated for each type
of node and the nature of connectivity between these allocated nodes. There has
to be a type 1 node for each closed figure and type 2 node for each constant
curvature segment in the query image. As a result, allocated number of type 1
nodes is equal to the number of closed figures and the allocated number of type 2
nodes is equal to the number of constant curvature segments in the query image.
The connectivity pattern is determined according to the perceptual relationships
within the query image as described in section 4.2.
Second, each node obtains a set of initial candidates for similarity, based
on their internal (context free) information (initialisation stage). The nature
of information used for this task depends on the type of the node. For type 2
nodes, a feature vector of three elements is used and it consists of length, ori-
entation (for lines) and curvature (for curves). For type 1 nodes, a feature vec-
tot of eight elements (circularity, directionality, straightness, complexity, right-
angleness, aspect-ratio, sharpness and stuffedness) is used and we also allow
partial matching during this process. It can easily be implemented under the
AURA framework by first obtaining superimposed binary vector for the feature
vector at node n and using it as the input vector (L~) to obtain the output vector
(Cn) with the CMM (M~) which has been used to store relevant information
(ie. feature vector-tag relationships for the whole database), as shown below.

C~ =M~I~ T (1)
370

During this process, each node i obtains its candidate vector Ci and the
third step is aimed at bringing feature information between nodes to achieve
contextual consistency.
During the third step, each node obtains evidence from other nodes to sup-
port the existing candidates. This information can be used to prune unplausible
candidates. During the process, each node plays its role by providing its possi-
ble support for the candidates at other nodes connected to itself by information
channels. In doing that, they use different CMMs stored within the nodes. These
CMMs contain feature information between nodes for the entire database (end-
point proximity, parallelism etc.) For example, to obtain supporting candidates
for nodes connected by relationship x to node n, it presents candidate vector for
n (C,~) to the CMM (M~) as follows.

o,, = M , ~ C J (2)
where M~ is the CMM which contains feature information on relationships
x, at node n Then On becomes one of the support vectors (denoted as C ~ ) for
node m. Node m accumulates all the support vectors (ie. C m = ~ C/~) received
thorough information channels (this can be viewed as an information fusion
process) and the existing vector in accumulator is subjected to thresholding
which prunes the entries for candidates below threshold t (currently we use t
=1). To guarantee convergence we perform a simple "and" operation between
C'r~ and Cm to obtain the new candidate vector Cm at node rn. This process can
be illustrated as in figure 7(b).
Due to the smaller feature vector used during initialisation at nodes of type
2, we continue this information fusion and pruning process at all nodes of type 2
as an iterative process which helps minimise ambiguities. Iterative nature of the
process ensures propagation of constraints throughout the connectionist struc-
ture. This process is halted when all type 2 nodes obtain stability (ie. no more
pruning). Information fusion process at nodes of type 1 is a non-iterative process
which is started after type 2 nodes obtain stability.
T h e theoretical foundation for the information fusion and pruning process is
obtained from the relaxation by elimination (RBE) framework[Jim97].

5 Experiments

During the following experiments we used a collection of 1000 t r a d e m a r k images.


We used 10 query images for which we had similarity judgement d a t a from
t r a d e m a r k officers. Figure 8 shows the query images we used for this task. In
calculating the retrieval effectiveness we use widely cited recall - precision graphs,
and averaging of graphs is based on the macro-evaluation method suggested in
IVan77].
For a given query image x, we can calculate the recall and precision as
recall (x) = number of objects found and relevant to x / the total number of
objects relevant to x
371

@ @ @ m ... @ + .. .:-:.
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

Fig. 8. Query images used during the experiments.

precision (x) = number of objects found and relevant to x / the total number
of objects found
According to this criteria, we can obtain pairs of recall-precision values which
indicate the fraction of relevant items retrieved and the fraction of retrieved items
that are relevant respectively, as we traverse from the top to the bottom of the
list.
During the experiment, we compared two different methods of combining
evidence from different feature interpretations within our image retrieval frame-
work. We compared the communication strategy we describe in this paper against
the external combination strategy we proposed earlier in [Suj99]. In obtaining re-
sults published in this section, we simulated the behaviour of CMMs using linked
lists. During our work on external evidence combination, we considered the fea-
tures of closed figures and constant curvature segments of raw and gestalt images
in four separate modules. In contrary to the model in this paper, there were no
inter-communication between these modules. We allowed modules to generate
different set of results and we observed that combination of these results using
Dempster-Shafer mechanism gave the best performance.
Figure 9 shows average recall-precision distribution of retrieval performance
over the ten queries under our external combination method using Dempster-
Sharer mechanism [Suj99] and our new model for image retrieval presented in
this paper. Results show that the new model which utilises inter-communication
between different interpretations performs better than the external combination
of modules. These results give evidence to argue that better performance can be
achieved by facilitating granular level communication in order to achieve global
consensus rather than attempting it at the modular level, within our image
retrieval framework.

6 Conclusions

We have presented a novel massively parallel connectionist architecture for trade-


mark image retrieval. It is an integrated framework using multiple feature inter-
pretations obtained using gestalt principles. We have compared this framework
with our earlier proposed model which combined evidence at the decision level.
Results show that better performance can be obtained by delaying this deci-
sion making process and facilitating granular level communication, within the
framework.
372

09

0.8

0.7

0.6

05

04

I 02 0.3 04 05F~r~406 07 08 09

Fig. 9. Average recall- precision distribution of the system over the 10 queries.

Acknowledgements
We would like to thank Dr. John Eakins at the University of Nothumbria at
Newcastle and Dr. Philip Quinlan at the University of York for the substantial
benefits obtained from the discussions with them. Financial support for this
project comes from the association of common wealth universities.

References
[San96] Santini.S and Jain.R: Similarity matching. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 1996.
[Mic97] Turner.M and Austin.J: Matching Performance of Binary Correlation Matrix
Memories. Neural Networks, 1997.
[Jim95] Austin.J, Kennedy.J, Lees.K: The Advanced Uncertain Reasoning Architec-
ture. In Proc. Artificial Neural Networks and Expert Systems Conference, 1995.
[Bid87] Biederman.I: Recognition by components - A theory of human image under-
standing. Psychological Review, vol 94, no 2, pages 115-147, 1987.
[Wus91] Wuescher.D and Boyer.K.L: Robust contour decomposition using a constant
curvature criterion. IEEE PAMI, vol 13, pages 41-51, 1991.
[Low87] Lowe.D.G: Three-dimensional object recognition from single two-dimensional
images. Artificial Intelligence, vol 31, pages 355-395, 1987.
[Sar94] Sarkar.S and Boyer.K: Computing perceptual organization in computer vision.
World Scientific Publishers, 1994.
[Suj98] Alwis.S and Austin.J: A novel architecture for trademark image retrieval sys-
tems. In Proceedings of the challenge of image retrieval, 1998.
[F194] Filer.R: Symbolic reasoning in an associative neural network. In MSe Thesis,
University of York, UK, 1994.
[Jim97] Turner.M and Austin.J: A neural relaxation technique for chemical graph
matching. In Proceedings of the Fifth International Conference on Artificial
Neural Networks, Cambridge, UK, Editor. M Niranjan, IEE Publishers, July
1997.
IVan77] Van Rijsbergen.R.J: Information retrieval. Butterwoths, London, 1979.
[Suj99] Alwis.S and Austin.J: Trademark image retrieval using multiple features. To
be published in Proceedings of the challenge of image retrieval, February 1999.
Improved Automatic Classification of Biological
Particles from Electron-Microscopy Images
Using Genetic Neural Nets

J. J. Merelo 1, V. Rivas 2, G. R o m e r o 1, P. Castillo 1, A. Pascual 3, J. M. C a r a z o 3

1 GeNeura Team, Depto. Arquitectura y Tecnologfa de las


CoInputadoras, Facultad de Ciencias,
Campus ~entenueva, s/n,
18071 Granada (Spain)
E-mail: geneura@kal-el, ugr. es, h t t p : / / k a l - e l . ugr. e s /
Departamento de Inform~tica
Universidad de Jafin
a Grupo de Biocomputacidn, CNB
E-mail: {carazo I pascual}@cnb .uam. es
28040 Madrid (Spain)

Abstract. In this paper several neural network classification algorithms


have been applied to a real-world data case of electron microscopy data
classification. Using several labeled sets as a reference, the parameters
and architecture of the classifiers, LVQ (Learning Vector Quantization)
trained codebooks and BP (backpropagation) trained feedforward neural-
nets were optimized using a genetic algorithm. The automatic process
of training and optimization is implemented using a new version of the
G-LVQ (genetic learning vector quantization) and G-PROP (genetic back-
propagation) algorithms, and compared to a non-optimized version of
the algorithms, Kohonen's LVQ and MLP trained with QuickProp. Di-
viding the all available samples in three sets, for training, testing and
validation, the results presented here show a low average error for un-
known samples. In this problem, G-PROP outperforms G-LVQ, but G-LVQ
obtains codebooks with less parameters than the perceptrons obtained
by G - P R O P . The implication of this kind of automatic classification algo-
rithms in the determination of three dimensional structure of biological
particles is finally discused.

1 Introduction

D N A helicases are ubiquitous enzymes with f u n d a m e n t a l roles in all aspects of


nucleic acid metabolism: D N A replication, repair, recombination, and conjuga-
tion. Their activity leads to disruption of the h y d r o g e n bonds between the two
strands of duplex DNA. P r e s u m a b l y all species in n a t u r e contain a collection of
helicases t h a t participate in many, if not all, facets of D N A metabolism. In spite
of their critical role, and the m n o u n t of biochemical knowledge in the field, little
is k n o w n a b o u t their structure. Hexameric helicases constitute a very impor-
t a n t structural subgroup of the helicases family, but so far their high-resolution
374

structure remains unknown. In the last few years, electron microscopy has pro-
vided low resolution structural information for some of these hexameric proteins
[1, 2, 3, 4]. Based on these studies, a general picture of the structure of the
hexameric helicases has emerged, featuring the homohexamer as a ring-like self-
assembly of protomers arranged around a central channel. Recently, B~rcena
et .al [3] have addressed the structural characterization of the gene 40 prod-
uct (G40P) of SPP1, a Bacillus subtilis double-stranded DNA bacteriophage by
means of electron microscopy of negatively stained samples of the protein, image
processing techniques and unsupervised classification methods (Self-Organizing
Maps [5, 6]). They proposed a new approach for the analysis of rotational sym-
m e t r y heterogeneities, which allowed the detection of different quaternary or-
ganizations of the protein. Normally, images produced by Electron Microscopy
present a low signal to noise ratio, which hides the subtle heterogeneities in the
population. In such cases classical classification techniques may fail. Rotational
s y m m e t r y analysis [7] have proved to be useful for the detection of the possible
differences in the s y m m e t r y of the population as shown in the structural study
of G40P. Here we propose to study the structure of this hexameric helicase by
a different approach: Instead of using non-supervised techniques, a supervised
classification scheme is proposed.
In general, we are interested in two kinds of final image processing ap-
proaches. The first one is an increase of the signal-to-noise ratio of the images
by a process of averaging. The second one is to use the views of an specimen
coming from different directions as input to a 3D reconstruction algorithm whose
m a t h e m a t i c a l principles are very similar to the ones behind the familiar "Com-
puterized Axial Tomography" in Medicine. Our biological goal is set to reach
high resolution (subnanometer resolution), and to this end it is neccesary to
have very significant image statistics, forcing us to process thousands of images.
However, it is obvious that prior to any of these processes it is vital to be able to
separate and classify these images into their main views in a fast, reliable, and
as authomatic and objective way as possible.
Our strategic aim would then be to provide a reliable method to classify
large quanties of images -in the order of thousands or ten of thousands- in an
automatic manner as a way to obtain high resolution structural results in spite
of the low signal to noise ratio of the individual images.
Neural network techniques have already been applied succesfully to biologi-
cal particle classification and reconstruction [8, 6], showing results t h a t are more
robust, and sometimes faster, than traditional statistical techniques. In this pa-
per, we will apply to the above mentioned problem several supervised learning
algorithms, G-LVq [9], based in a two-level genetic algoritm (GA) operating on
variable size chromosomes, which codify the initial weights and labels for an LVQ
network, and G-PROP, which acts in a similar way on MLPs trained with BP;
results will be compared to Kohonen's Learning Vector Quantization [10] (LVQ)
algorithm for codebook training and with QuickProp. We will first present the
state of the art in evolutionary codebook design in section 2. The new version of
the G-LVq algorithm, used in this paper to classify helicases, is briefly described
375

in section 3, to be followed by results obtained in section 4 and a brief discussion


(section 5).

2 State of the art

Codebooks are sets of labeled vectors {w~i} (called codeveetors), which, acting as
classifiers, offer more information than, for instance, usual feed-forward neural
network classifiers: it not only creates a method lot predicting the classes of
unknown patterns, but also is in itself a sample of the training universe; each
codevector ~ belongs to the input space and can thus be visualized and eval-
uated in the same way as the original sample, giving the user a clue of why
an unknown sample has been classified in a way and not another. Designing a
codebook usually involves deciding in advance its size, class label distribution,
initial weights, and the iterative training algorithm used to set the codevectors.
LVQ is one of the possible codebook design and training algorithms, proposed
originally by Kohonen [5]. In this algorithm, the initial weights (initial codevec-
tor values) and the codevector labels are set in advance, as well as a learning
constant, which determines the vector change rate; then the codevectors are
(:hanged by gradient descent, making them as close as possible to the vectors
in the training sample. LVQ, along with the multilayer perceptron and Radial
Basis Functions, are the most widely used techniques in the neural net realm.
The task of finding the correct number and values of a classifier parameters,
t h a t is, what is known in the neural network world as "training the weights of a
neural net", is called by statisticians parameter estimation. Statiscians face the
same problem as neural net researchers: finding a combination of parameters
which give optimal accuracy results, with m a x i m u m parsimony, t h a t is, mini-
m u m amount of parameters. So far, several global optimization procedures have
been tested: simulated annealing, stochastic approximation, and genetic algo-
rithms. In either case, the usual approach to codebook design optimization is
to optimize one of the three metrics for its performance; for instance, maximiz-
ing classification accuracy, while leaving size fixed. Several size values are usually
tested, and the best is chosen. However, the technique used in this paper, G-LVQ,
intends to optimize several codebook parameters at the same time: codevector
labels, codebook size (or number of levels) and codevector initial values.
Some other methods tor optimizing LVQ have been based on incremental
approaches (for a review, see [11]), which are still local error-gradient descent
search algorithms on the space of networks or dictionnaries with different size.
Other methods use genetic algorithms [12] to set the initial weights of a code-
book with a m a x i m u m implicit weight; this m a x i m u m length limits the search
space, which might be a problem to find the correct codebook, whose number
of levels might be higher. Since then, several papers combining GA and vector
quantization have been published; for instance, Johnson and collaborators in
[13] use GAs to select the optimal set of descriptors that are fed to a vector
quantization algorithm; FrSnti and collaborators in [14] use a genetic algorithm
for codebook generation.
376

An incremental methodology proposed by Perez and Vidal [15], seems to


offer the best results for this kind of methodology. This method adds or takes
codevectors after presentation of the whole training sample, eliminating those
that have not been near any vector in the training sample, and adding as new
codevector a vector from the training sample that has been incorrectly classified,
and is the most distant from all codevectors belonging to its same class. This
approach has the advantage of not relying on threshold parameters, but it still
has the problems of being a local search procedure that optimizes size step by
step and of relying on heuristics for the initial weights.
There are many more attempts to optimize the size of other neural nets, like
MLP trained with backpropagation [16]. Specially noteworthy is G-PROP in its
different variants [17, 18], whose results will be compared with G-LVQ in this
paper.
Previously, a method for global optimization of LVq was proposed by the
authors in [9, 19]. That method has been streamlined and improved in this
paper, giving the best results so far.
With respect to the problem of classifying biological particles, usual pattern
recognition techniques are a mixture of neural networks and personal experience,
with some statistical techniques thrown in; particles are classified in an unsuper-
vised way using principal component analysis or Kohonen's self-organizing map,
and then labeled by hand [6]. This method has been used to label the samples
used in this paper, which means that its labelling is inherently unaccurate. In
any case, usual statistical techniques can be used to assess the accuracy of the
automatic classification and the previous manual classification. Obtaining a neu-
ral network for automatic classification that objectivizes the expert knowledge
without needing to make it explicit would also be an achievement. Recently, a
new unsupervised approach which overcome the problems inherent, to the manual
labeling has been proposed [20].

3 Method

The method used here for obtaining optimal biological particle classifiers is based
on several pillars, which sometimes lean on each other: hybridization of local and
global search operators (an idea originally proposed by Ackley [21]), variable
length chromosome genetic algorithm (which is not too common in the GA
literature, but has already been used, for instance, by Harvey [22]) and vectorial
fitness (not very popular either, but proposed, for instance, by Schaffer [23]. The
genetic algorithm used in this work, based in the previous considerations, can
be described in the following way:

1. Initialize randomly a chromosome population with lengths ranging from half


the number of classes to twice the number of classes.
2. For each member of the population:
(a) Decode the chromosome to the initial codebook values, create a classifier
and train it using Kohonen's LVQ. Then evaluate accuracy and distor-
377

Symmetry Absolute number of samples Sample Frequency


No symmetry or noisy 218 23 %
Symmetry 2 178 19 %
Symmetry 3 328 35 %
Symmetry 4 124 13 %
Symmetry 6 88 9%

Table 1. Distribution of patterns in the training set; the most frequent patterns where
those with symmetry 3; frequencies are not exact and thus do not add up to 100.

tion in classification over the test set. Set vectorial fitness to the triplet
(accuracy, size, distortion).
(b) With a certain probability, apply chromosome length changing operators,
suppresing the genetic representation of the codevector which responds
to a lowest nmnber of input space vectors, or duplicating the one which
responds to the highest number.
3. Select the P best chromosomes in the population, according to their fit-
ness vector, and make them reproduce, with mutation and 2-point crossover
(acting over the length of the smallest chromosome of the couple).
4. Finish after a predetermined number of generations, keeping the best trained
codebook, and apply it to the validation set.

The other algorithm used in this paper, G-PROP, described in is basically


the same than G-LVQ, except that the classifier and training algorithm used are
multilayer perceptrons trained with backpropagation.

4 Results

The whole set of samples available consisted of 933 spectra, obtained after a
process of segmentation, translational and rotational alineation, manual label-
ing by visual inspection of the rotational, and preprocessing to obtain circular
symmetry spectrum of electron-microscopy images. After obtaining the spectra,
only those with physical meaning were used for training and testing, that is,
only with symmetry equal to 1,2,3,4 and 6, and those classified as noisy, which
were assigned the same class label as those with symmetry 1 (or no symmetry).
Absolute number of samples and frequencies for each class are shown in table
I. Each particle is represented by a 15-component vector, corresponding to the
spectrum components. Class averages are shown in figure i; each class represents
vectors with a strong symmetry, so there should be a peak value in one of the
components, as is seen in the figure.

Using these spectra, three files were created, one for training, another one
tor testing, and another one for validation. Training files were used to train
378

Class averages
30
' ' ' ' ' $1 and noi~se - -
$2 ....
$3 ........
$4 ................
25 $6 .....

20

15 ,
/ i!.,i ..'; ,~ ~ i

; :' " / " \ i '


; ,, \ i ', " i ',
10

i i I i I I i
0
0 2 4 6 8 10 12 14 16
Spectrum Component

Fig. 1. Class averages for each class. As can be seen, each class has got a sharp peak
at one component, along with some harmonics at components with double and half
value; tbr instance, objects with symmetry 6 has an small harmonic at 12.

all neural nets, test files to select one among all trained nets, and validation
file to check accuracy for that network; validation file has never been shown
betbre to the chosen network. Each file contained 311 samples, with random
class distribution. Only the training and validation files were used tbr the Lvq
algorithm, since instead of selecting one net, the average and lowest errors are
taken.
Several classification algorithms were then tested on these sets: Kohonen's
classical Learning Vector Quantization, or LVQ; O-LVQ [9], which is basically, as
pointed out above, a genetic optimization of Kohonen's LVQ that, at the same
time, discovers the optimal size of the LVQ network; QuickProp [24] and G-PROP,
which is a genetically optimized version of Backpropagation which is also able
to discover the correct size for the hidden layer [18].
Lvq was run 1000 times with several preset codebook sizes, using the training
file to train and the validation file as test file; since no further selection is made on
the training parameters, the test file was not considered necessary. LVQ algorithm
has no method to select the codebook size in advance, thus, two different sizes
were tested: 8 and 16 levels. Weight vectors and labels were initialized with
vectors from the training file, which usually gives good results, but with so few
codevectors, it might happen that many codevectors have got the same label.
379

Algorithm Error+StdDev Lowest iNet size (number of parameters)


Lvq 8-level 16 4- 11 0 8 (128)
1Glevel 4 4- 5 0 16 (256)
iQuickProp 7• 3.2 16 ( 2 7 2 )
O-LVQIpopulation = 200 10 4- 4 7.37 8 4- 2 (128 • 32 )
50 generations 14.7 4- 0.6 13.46 9.4 4- 1.1 (150.4 4- 17.6 )
100 generations 7.9 4- 0.8 7.37 9 • 2 ( 144 4- 32 )
G-PROP 1.0 4- 0.6 0.32 16 4- 2 ( 272 4- 85 )

Table 2. Results of evaluating LVQ, G-LVQ,QuickProp and G-PROP on the classification


of the G40 virus spectra. Parameters for LVQ were as follows: 1000 training steps, and
gain parameter set to 0.1; initialization from the training file. In the case of O-LVQ, each
network received 1000 training steps. G-PROP and QuickProp were trained as described
in [18]

Tile gain factor a was set to 0.1, decreasing by (~/epochs each step; and they
were trained for 1000 epochs. The program was written using S-Vq, (Simple
Vector Quantization), a C + + class library for programming vector quantization
applications, which is available fl'om the author. The whole test took several
minutes on an Silicon Graphics 02. 100 QuickProp networks were also trained,
with the hidden layer size and gain factor set to the best one obtained by G - P R O P .
All settings for G-PROP were the same as in [18].
The genetic algorithm used for G-LVQ is a steady-state algorithm, which
means that a part of the population is kept in each generation. In this case,
40% of the population were substituted each generation, with the offspring of
the 40% best. Population was fixed to 100 networks for each generation (or
200 in one case). Variable-length chromosome operators were used, with 20%
probability for the gene-duplication operator, and 10 % probability for the gene-
elimination operator. Bit-flip mutation is applied with 10with 40% probability.
The GA was run for several generations, using a vectorial fitness as shown in
[9]; the main criterium for optimization was minimization of misclassification
rate, followed by length. G-LVQ was run 5 times with each of the the different
parameter settings, and different random initialization; each network was trained
for a number of epochs equal to twice the size of the training set (around 600
vector presentations in this case). Running the test around one hour in an old
Silicon Graphics R4000 Entry (similar to a Pentium 100 Mhz). G-PROP used the
same parameters as described in [18]
The best algorithm is G-PROP, which obtains an outstanding error level, with
low standard deviation, and an acceptable size. 16-level LVQ initialized with sam-
ples from the training file obtains also acceptale values, and if enough networks
are trained (1000 in this case), is able to find one which generalizes perfectly (0
380

Generated codebook
50
/
,,, 3 ........
ji tL / ,'', 0

,' / ' ' ,R, ...............

/ ~i; : ', , ~ Noisy and no S ~ n m e t r y


40 ",/ ' '~ / ,~ ' : " / ," " ' ~. 6 -:3 .....
, , , ; ,;, , , , ,~ ,', ,, , : ,~ ~ ;
, ~' ,' .' ' ~ ,'', ' ' 't ~ /

r ~ " ' '~ d" " ~' / ~' I ~' "


30 ~ , ,'',, ,: ~ ,' ,", ,i'., ~ . . . . . , ,,
,~; 'L ,-' ', ,i ,; ' '; ,J :,. ~, ,' ', ; ! ~' ~. ~,

o) ., , , ,. , .... , ~: /~ i, ~ , ~ , ..

t~ 20 , ,.~:.. ,., . ,! , ,~,, ~',: ,.' ...


>

I ",Jr ;:, ", ," ~ i ' ~i', 9 ',~ ,"-.' ,/ "~', - .." ,"

10

-10 I I i I J i i
0 2 4 6 8 10 12 14 16
Component

F i g . 2. Trained codebook obtained in one of the G-LVQ runs. The genetic algorithm has
assigned 1 vector to each class, except class with symmetry 3 (3) and s y m m e t r y 2 (2
codevectors). This particular run obtained a 9% error on the validation sample 9 Usually,
each codevector has higher values in the component corresponding to its s y m m e t r y and
its harmonics.

error). G-LVQ yields d i s s a p o i n t i n g results w i t h r e s p e c t to these, a l t h o u g h t h e y


are also a c c e p t a b l e , p r o b a b l y d u e to an i n c o m p l e t e r u n n i n g of t h e a l g o r i t h m ;
m o r e g e n e r a t i o n s would find b e t t e r results, as is shown in t h e l O 0 - g e n e r a t i o n s
case. M o r e t e s t s will have to m a d e in this point.

W i t h r e s p e c t to t r a i n i n g t i m e , all tests t o o k a s i m i l a r time; G-PROP was


t r a i n e d for fewer g e n e r a t i o n s a n d with a s m a l l e r p o p u l a r t i o n , b u t each B P t r a i n -
ing t a k e s longer t h a n LVQ. O n t h e o t h e r h a n d , t h e n o n - g e n e t i c versions of t h e
n e u r a l n e t s were t r a i n e d for m o r e epochs t h a n t h e one t r a i n e d w i t h i n the g e n e t i c
algorithm.
381

5 D i s c u s s i o n and future work

This paper proves that a very efficient classification of spectra procedent from
electron microscopy images, which paves the way tbr a high-speed and accurate
electron microscopy image processing. For this problem, genetically optimized
versions of LVQ and Backprop obtained better results than the standalone al-
gorithms, and, besdies, were able to find relevant parameters (initial weights,
size and learning parameters). G-PROP outperforms all the other algorithms,
and finds a perceptron with a small size, less than 1% the number of variables
involved in training. Probably, G-PROP is much more efficient than G-LVQ when
more than 2 classes are involved and class frequencies are not the same for all
classes, since it sometimes finds eodebooks with one or several classes missing
(and, depending on class frequencies, it could still have a good error rate). Fhture
work will include improvements on G-LVQ, so that it, obtains better results on
this kind of problems, and application to other electron microscopy problems.

Acknowledgements
This work has been supported in part by CICYT (Spain) grants number 1FD97-
0439-TEL1, and BIO98-076.

References
1. M. C. San Mart{n, N.P.J. Stamford, N. Dammeranova, N.E. Dixon, and J. M.
Carazo. A structural model for the Echerichia coli DnaB helicase based on electron
microscopy data. J. Struet. Biol., (114):167-176, 1995.
2. X. Yu, M.J. Jezewska, W. Bujalowski, and E.H. Egehnan. The hexameric e.eoli
dnab helicase can exist in different quaternary states. J. Mol. Biol., (259):7-14,
1996.
3. M. Bs M.C. San Mart{n, F. Weise, S. Ayora, J.C. Alonso, and J.M Carazo.
Polymorphic quaternary organization of the bacilus subtilis bacteriophage SPP1
replicative helicase (G40P). J. Mol. Biol., (283):809 819, 1998.
4. C. San Martin, C. Gruss, and J.M. Carazo. Six molecules of SV40 large T antigen
assemble in a propeller-shaped particle around a channel. Journal of Molecular
biology, (269), 1997.
5. Teuvo Kohonen. The self-organizing map. Proc. IEEE, 78:1464 1480, 1990.
6. R. Marabini and J.M. Carazo. Pattern recognition and classification of images
of biological macromolecules using artificial neural networks. Biophysics Journal,
66:1804 1814, 1994.
7. R.A. Crowther and L.A. Amos. Harmonic analysis of electron microscope images
with rotational symmetry. J. Mol. Biol., (60):123-130, 1971.
8. Jose-Jesus Fernandez and Jose-Maria Carazo. Analysis of structural variability
within two-dimensional biological crystals by a combination of patch averaging
techniques and self-organizing maps. Ultramicroscopy, 65:81-93, 1996.
9. J. J. Merelo and A. Prieto. G-LVQ, a combination of genetic algorithms and LVQ.
In N.C.Steele D.W.Pearson and R.F.Albrecht, editors, Artificial Neural Nets and
Genetic Algorithms, pages 92 95. Springer-Verlag, 1995.
382

10. T. Kohonen. The self-organizing map. Procs. IEEE, 78:1464 ft., 1990.
11. Ethem Alpaydim. GAL: Networks that grow when they learn and shrink when they
~brget. Technical Report TR-91-032, International Computer Science Institute,
May 1991.
12. Enrique Monte, D. Hidalgo, J. Marifio, and I. Herns A vector quantization
algorithm based on genetic algorithms and LVQ. In N A T O - A S I Bubi6n, page 231
ft., 1993.
13. S.R. Johnson, J.M. Sutter, H.L. Engelhardt, P.C. Jurs, J. Whiteand J.S. Kauer,
T.A Dickinson, and D.R. Walt. Identitication of multiple analytes using an optical
sensor array and pattern recognition neural networks. Anal. Chem., (69):4641,
1997.
14. P. Frhnti, J. Kivij/irvi, T. Kaukoranta, and O. Nevalainen. Genetic algorithms for
codebook generation in vq. In Proc. 3rd Nordic Workshop on Genetic Algorithms,
Helsinki, Finlan, pages 207-222, 1997.
15. Juan-Carlos Perez and Enrique Vidal. Constructive design of LVQ and DSM classi-
tiers. In J. Mira, J. Cabestany, and A. Prieto, editors, New Trends in Neural Com-
putation, Lecture Notes in Computer Science No. 586, pages 335-339. Springer,
1993.
16. Xin Yao and Yong Liu. Towards Designing Artificial Neural Networks by Evolu-
tion. Applied Mathematics and Computation, 91(I):83-90, 1998.
17. P.A. Castillo; J. Gonzs J.J. Merelo; V. Rivas; G. Romero; A. Prieto. G-Prop:
Global Optimization of Multitayer Perceptrons using GAs. submitted to Neuro-
computing, 1998.
18. P.A. Castillo; J. Gonzs J.J. Merelo; V. Rivas; G. Romero; A. Prieto. SA-Prop:
Optimization of Multilayer Perceptron Parameters using Simulated Annealing.
submitted to IWANN99, 1998.
19. J. J. Merelo; A. Prieto; F. Mor~n; R. Marabini; J. M. Carazo. Automatic classi-
fication of biological particles from electron-microscopy images using conventional
and genetic-algorithm optimized learning vector quantization. Neural Processing
Letters, 8:55 65, 1998.
20. A. Pascual, M. Brcena, and J.M. Carazo. Application of the fuzzy kohonen clus-
tering network to biological macromolecules images classification. In submitted to
IWANN99, 1999.
21. David H. Ackley. A connectionist algorithm for genetic search. In John. J
Grefenstette, editor, Proceedings of the First International Conference on Genetic
Algorithms and their Applications, pages 121-135, Hillsdale, New Jersey, 1985.
Lawrence Erlbaum Associates.
22. I. Harvey. Species adaptation genetic algorithms: a basis for a continuing SAGA.
In F. J. Varela and P. Bourgine, editors, Proceedings of the First European Confer-
ence on Artificial Life. Toward a Practice of Autonomous Systems, pages 346-354,
Paris, France, 11-13 December 1991. MIT Press, Cambridge, MA.
23. J. D. Schaffer and J. J. Grefenstette. Multi-objective learning via genetic algo-
rithms. In Procs. of the 9th international Conference on Artificial Intelligence,
pages 593 595, 1985.
24. S.E. Fahlman. Faster-Learning Variations on Back-Propagation: An Empirical
Study. Proceedings of the 1988 Connectionist Models Summer School, Morgan
Ka~fmann, 1988.
Pattern Recognition
Using Neural Network Based on
Multi-valued Neurons

Igor N. Aizenberg, N a u m N. Aizenberg

Department of Cybernetics, University of Uzhgorod (Ukraine)


Scieotilic mlvisors to the c{nnpnnyNeornl Nctwork~ Techn(hogie~I,Id, (israel)
For ~mlmunications: Minaiskaya 28. kv. 49, t.b,hgorod, 294015, Ukraine
E-mail: ina(a)karpaty.uzhgortxl.ua
"lhe ta~mpanyNeural Network Technologies Ltd. (Israel) supports the presented work

Abstract

Multi-valuedneurons are the neural processingelementswith complex-valu~weights, huge functionality


(it is possible to implementon the single neuron arbitrarymapping describedby partial defintxlmultiple-
valued limction),quicklyconvergedlearningalgorithms. Such featuresof the nmlti-wduedneuronsmay be
used Ibr solulionofthe dilli,'re,ltkindsofprnblcms.
Neural networkwith mulli-valuedneurollslbr imagerecognilionwill be consideredin tile paper. Such a
network analyzes the spectral cocllicieuts corresponding to low frequencies. Sinmlation results are presented
011 the
example o1"thee reeognilitm.

1. INTRODUCTION

Multi-valued neural element (MVN), which is based on the ideas of multiple-valued


threshold logic [!], has been introduced in [2]. Different kinds of networks, which are
based on MVN have been proposed [2-7]. Successful application of these networks to
simulation of the associative memory [2-4, 6-7], image recognition and segmentation [3-4,
6] time-series prediction [3-4] confirms their high efficiency. Highly effective quickly
converged learning algorithms for MVN and neural networks based on them have been
elaborated [2-4, 5]. We would like to concentrate here on the problem of pattern
recognition. It will be considered on example of the image recognition.
Solution of the image recognition problem using neural networks became very popular
during last years. Many corresponding examples are available [7-10]. On the other hand
many authors consider image recognition reduced to analysis of the onhogonai spectrum
coefficients using different neural networks [ 10-I I]. We would like to propose here a new
approach to tile image recognition based on tile following: I) high functionality of the
multi-valued neurons and quick convergence of the learning algorithm for them; 2) well-
known fact about concentration of the signal energy in the low-frequency part of
orthogonal spectra [I i].
Different MVN-based neural networks were already used for solution of the image
recognition problem. Several papers devoted to different types of an associative memory
should be mentioned. MVN based cellular neural network has been proposed in [2] as
associative memory. MVN based neural network with random connections has been
proposed [3-4] as associative memory alternative to the Hopfield one. The MVN based
network with random connections requested much smaller number of the connections than
fully connected Hopfield network. The quickly converged learning algorithm is another
important useful property of the MVN based network with random connections. On the
384

other hand Hopfield-like MVN-based neural network has been proposed as associative
memory in [7]. A disadvantage of these three networks is impossibility to recognize
shifted or rotated image, also as image with changed dynamic range. To break these
disadvantages and to use effectively features of multi-valued neurons, we would like to
propose here a new type of the network, learning strategy and data representation
(frequency domain will be used instead of spatial one). An idea of the image recognition
on the single-layered MVN based neural network using analysis o f the orthogonal spectra
coefficients has been proposed in [6]. It will be considerably developed here.

2. M U L T I - V A L U E D NEURONS AND T H E I R L E A R N I N G

Multi-valued neuron (MVN) introduced in [2] was considered later in many papers. E.g.,
some important theoretical aspects have been presented in [3] and [5]. We would like to
remain here some keyword moments of the MVN theory (mathematical model o f the
MVN and its learning).
MVN [2, 3, 5] performs a mapping between n inputs and single output. The performed
mapping is described by multiple-valued (k-valued) function o f n variables .f(x, ..... x,) via
their representation through n+l complex-valued weights w0 ,wj ..... w, :
f ( x I..... x,) = P(w o + wlx t +...+ w , x , ) , (1)
where x t ..... x, are variables, on which performed function depends (values of the
function and of variables are also coded by complex numbers which are k-th power roots
of unity: e j = e x p ( i 2 n j / k ) , j e {0, k-l}, i is an imaginary unity. In other words, values of
the k-valued logic are represented as k-th power roots of unity: j --~ c j). P is the activation
function o f the neuron:
P(z) = exp(i2•j/k), i f 2• ( j + l ) / k > arg(z) > 2rcj/k (2)
or, with integer output:
P(z) = j, i f 2n ( j + l ) / k > arg(z) > 2 n j / k , (2a)
where j=0, I ..... k-1 are values of the k-valued logic, z = w 0 +w~x I + . . . + w , x , is the
weighted sum, arg(z) is the argument of the complex number z. So, ifz belongs to thej-th
sector, on which the complex plane is divided by (2), neuron's output is equal to e j , o r j
in the integer form (Fig. I).

I 0

P(z) = ~-'
Definition o f the MVN activation function
Fig. 1

MVN has some wonderful properties that make it much more powerful than traditional
artificial neurons. The representation (1) - (2)'makes possible implementation o f the
input/output mappings described by arbitrary partial defined multiple-valued functions.
Such a possibility to implement arbitrary mappings on the single neuron gives an
385

opportunity to develop networks not to perform complicate mappings, but definitely to


solve the complicate applied problems.
Another important property o f the MVN is simplicity o f its learning. Theoretical
aspects of the learning, which are based on the motion within the unit circle, have been
considered in [1 -2]. If we consider a learning of the MVN as generalization of the
perception learning, we obtain the following. If perception output for some element of the
learning set is incorrect (1 instead o f - 1 , or -1 instead of I) then the weights should be
corrected by some rule to ensure an inversion of the weighted sum sign. Therefore, it is
necessary to move the weighted to the opposite subdomain (respectively from "positive"
to "negative", or from "negative" to "positive"). For MVN, which performs a mapping
described by k-valued function we have exactly k domains. Geometrically they are the
sectors on the complex plane (Fig. 1). If the desired output of MVN on some element from
the learning set is equal to e q then the weighted sum should to fall into the sector number
q. But if the actual output is equal to ~ then the weighted sum has fallen into sector
number s (see Fig. 2). A learning rule should correct the weights to move the weighted
sum from the sector number s to the sector number q.

Fig. 2. Problem of the MVN learning

The following correction rule for learning of the MVN has been proposed in [2]:
w.,, -- w. +o,. qx, (3)

where W~, and W,, ~ are current and next weighting vectors .~ is the complex-conjugated
vector of the neuron's input signals (the current vector from the learning set), e q is the
desired neuron's output (in the complex-valued form), C= is the scale coefficient, co is the
correction coefficient. Such a coefficient must be chosen from the point of view that the
weighted sum should move exactly to the desired sector, or at least as close, as possible to
it after the correction of weights according to the rule (3).
Another effective, quickly converged learning algorithm for multi-valued neuron has
been proposed in [3] and then developed in [6]. It is based on the error-correction rule:
W,,+l = W,~+~ ( e q - e ") X , (4)
in+l)
where 14,'= and W,~,t are current and next weighting vectors, .~ is a vector of the neuron's
input signals (complex-conjugated), 6 is a primitive k-th root of unity (k is chosen from
(2)), C is a scale coefficient, q is the number of the desired sector on the complex plane, s
is the number o f the sector, to which the actual value of the weighted sum has fallen, n is
the number of neuron inputs. Learning algorithm based on both rules (3) and (4) are very
386

quickly converged. It is possible to implement them in truly integer arithmetic [3] also as
always possible to find such a value o f k in (2) that (1) will be true for given function f,
which describes the mapping between neuron's inputs and output [3, 6].

3. MVN BASED NEURAL NETWORK FOR IMAGE RECOGNITION

As it was mentioned above a disadvantage of the networks used as associative memories


is impossibility to recognize shifted or rotated image, also as image with changed dynamic
range. MVN based neural network with random connections, which has been proposed as
associative memory in [3-4] has the same disadvantage. This network is oriented on a
storage of the gray-scale images of a size n x m. It contains exactly n x m neurons. Each
neuron is connected with a limited number of other ones. Connections are defined by
some random function. An example of such a network is shown in Fig. 3.

Fig. 3 Fragment of the neural network with random connections

The/j-th neuron is connected with 8 other neurons, and with itself. Numbers of neurons,
from which/,/-th neuron receives the input signals, are chosen randomly.
To use more effectively the MVN features, and to break the disadvantages mentioned
above we would like to consider here a new type of the network, learning strategy and
data representation (frequency domain will be used instead of spatial one).
Consider N classes of objects, which are presented by images of n x m pixels. The
problem is formulated into the following way: we have to create recognition system based
on neural network, which makes possible successful identification of the objects by fast
learning on the minimal number of representatives from all classes.
To make our method invariant to the rotations, shifting, and to make possible
recognition of other images of the same objects we will move to frequency domain
representation of objects. It has been observed (see e.g., [11]) that objects belonging to the
same class must have similar coefficients corresponding to low spectral coefficients. For
different classes o f dlscrete signals (with different nature and length from 64 until 512)
sets of the lowest (quarter to half) coefficients are very close each other tbr signals from
the same class from the point of view of learning and analysis on the neural network [11 ].
This observation is true for different orthogonal transformations. It should be mentioned
that a neural network proposed in [I I] for solution of the similar problem has based on the
obvious threshold elements, and only two classes of objects have been considered. In the
terms of neural networks to classify object we have to train a neural network with the
learning set contained the spectra of representatives o f our classes. Then the weights
obtained by learning will be used for classification of unknown objects.
387

l
2

N classes of objects - N neurons

Fig. 4 MVN based neural network for image recognition

We propose the following structure of the MVN based neural network for the solution
of our problem. It is single-layer network, which has to contain the same number of
neurons as tile number of classes we have to identify (Fig. 4). Each neuron Ires to
recognize pattern belongency to its class and to reject any pattern from any other class.
Taking into account that single MVN could perform arbitrary mapping, it easy to conclude
that exactly such a structure of the network that was just chosen is optimal and the most
effective.
To ensure more precise representation of the spectral coefficients in the neural network
they have to be normalized, and their new dynamic range atter normalization will be [0,
511]. More exactly, they will take discrete values from the set {0, 1, 2 .... ,511 }. We will
use two different models for the frequency domain representation of our data. The first
one is using the low part of Cosine transformation coefficients. The second one is using
phases of the low part of Fourier transformation coefficients. In the last case we used such
a property of Fourier transformation that phase contains more information about the signal
than amplitude (this fact is investigated e.g., in [12].
The best results for the first model were obtained experimentally, when we reserved
the first l=k/4 (from the k=-512) sectors on the complex plane (see (2), and Fig. 1) for
classification of the pattern as belongs to the given class. Other 3/4 sectors correspond to
rejected patterns (Fig. 5). The best results for the second model were also obtained
experimentally, when for classification of the pattern as belongs to the given class we
reserved the first l=k/2 (from the k=512) sectors on the complex plane (see (2), and Fig. i).
Other k/2 sectors correspond to rejected patterns (Fig. 6).

k/4- !/ Sectors
k/41 / O..... k/4-1-
I -/domain ibr the patterns
f . - - ~ r o m t ~ given class
Sectors # ~ 0
k/4..... k-I -.
-domain for rejecte~,,.._ J ~ k- l
patterns

Fig 5. Reservation of the domains for recognition - 1~t model


388

Sectors O, ..., k/2 -! -


- domain fiw the patterns from the given class

I~ 0

k-I

- Sectors k/2,...,k-I
domain for rejected patlems

Fig. 6. Reservation o f the domains for recognition - 2 "a model

Thus, for both models output values 0, ..., I-I for the i-th neuron correspond to
classification o f object as belonging to i-th class. Output values I, .... k-1 correspond to
classification o f object as rejected for the given neuron and class respectively.
Hence there are three possible results o f recognition alter the training: I) output o f the
neuron number i belongs to {0, ..., l-l} (it means that network classified pattern as
belonging to class number 0; outputs o f all other neurons belong to {/..... k-l }; 2) outputs
o f all neurons belong to {l..... k-I }; 3) outputs o f the several neurons belong to {0 ..... /-
I }. Case l corresponds to exact (or wrong) recognition. Case 2 means that a new class o f
objects has been appeared, or to non-sufficient learning, or not-representative learning set.
Case 3 means that number o f neuron inputs is small or inverse is large, or that learning has
been provided on the not-representative learning set.

4. S I M U L A T I O N RESULTS

The proposed structure o f the MVN based network and approach to solve o f the
recognition problem has been evaluated on the example o f face recognition. Experiments
have been performed on the software simulator o f t b e neural network.

1 2 3 4 5

6 7 8 9 10

ll 12 13 14 15

16 17 18 19 20
Fig. 7. Testing image data base
389

We used MIT faces data base [13], which was supplemented by some images from the
data base used in our previous work on associative memories (see [3 - 4]). So our testing
data base contained 64 x 64 portraits of 20 people (27 images per person with different
dynamic range, conditions of light, situation in field). So, our task was training of the
neural network to recognize twenty classes. Fragment of the data base is presented in
Fig.7 (each class is presented by single image within this fragment).
According to the structure proposed above, our single-layer network contains twenty
MVNs (the same number, as number of classes). For each neuron we have the following
learning set: 16 images from the class corresponding to given neuron, and 2 images for
each other class (so 38 images from other classes). Let describe the results obtained for
both models.
Model 1 (Cosine transformation).
According to the scheme presented in Fig. 5 sectors 0, ..., 127 have been reserved for
classification of the image, as belonging to the current class, and sectors 128 .... ,511 have
been reserved for classification of the images from other classes. The learning algorithm
with the rule (4) has been used. Thus for each neuron q=63 for patterns from the current
class, and q=319 for other patterns in the learning rule (3).
The best results have been obtained for 20 inputs of the network, or for 20 spectral
coefficients, which are inputs of the network. More exactly, there are 20 low coefficients
(from second until sixth diagonals, zero-frequency coefficient has not been used). Choice
of the spectral coefficients from the diagonals of spectrum is based on the property of 2-D
frequency ordered spectra: each diagonal contains the coefficients corresponding to the
same 2-D frequency ("zigzag", see Fig. 8).

o
oJ"o o o
Fig. 8. Choice of the spectral coefficients, which are inputs of neural network

We have got quick convergence of the learning for all neurons. Computing time of the
software simulator implemented on the Pentium-133 is about 5 - 15 seconds per neuron. It
corresponds to 2000-3000 iterations. It is necessary to make important remark: if it is
impossible to obtain convergence of learning for the given k in (2), it is necessary to
change it and to repeat a process.
For testing, twelve images per each person, which did not present in the learning set,
and are other or corrupted photos of the same people, have been shown to neural network
for recognition. For classes 1, 2, and 17 testing images are presented respectively, in
Fig. 9-11. Results are the following. Number of incorrectly identified images for all
classes (neurons) is li'om 0 (for 15 classes from 20) to 2 (8%), excepting classes No 2 and
13. For both classes No 2 and 13 this number is increased to 3-4. May be, it is influence of
the same background, on which photos have been made, and very similar glasses of both
persons (see Fig. 8). To improve results of recognition in such a case the learning set
should be expanded. From our point of view it is not a problem because additional
390

learning is very simple. On the other hand, increasing of the number of classes, which
have to be identified, also is not a problem, because always it is possible to add necessary
number o f neurons to the network (Fig. 4), and to repeat learning process beginning from
the previous weighting vectors.
Model 2 (Fourier transformation).
The results corresponding to model 2 are better. According to the scheme presented in
Fig. 6 sectors 0 .... ,255 have been reserved for classification of the image, as belonging 1o
the current class, and sectors 256, ..., 511 have been reserved for classification of the
images from other classes. The learning algorithm with the learning rule (3) has been
used. So, for each neuron q=127 for patterns from the current class, and q=383 for other
patterns in the learning rule (3).
The results o f recognition sequentially improved with increasing of number of network
inputs. It should be noted that such a property was not noticed in method i. The results of
recognition were stable for number o f coefficients more than 20.
The best results have been obtained for 405 inputs of the network, or for 405 spectral
coefficients, which are inputs of the network, and beginning from this number the results
stabilized. Phase of spectral coefficients has been chosen again according to "zigzag" rule
(Fig. 8).

Class "1": 100%successful recognition


Fig. 9.
Class "2", model 1:9 out of 12 images (75 %) are
recognized, lncorrecllyrecognized images
are marked by "*"
Class '~2", model 2: 100% successlid
recognition
Fig. I 0.

I 2 3 4 5 6

8 9 10 11 12
Class "17" - 100% successful recognition
Fig. i 1.
391

For all classes 100% successful recognition has been gotten. For classes "2" and "13" 2
images from another class ("13" for "2", and "2" for "13") also have been identified as
"its", but this mistake has been easy corrected by additional learning. A reason of this
mistake is evidently, again the same background of the images, and very similar glasses of
both persons whose portraits establish the corresponding classes.
To compare both methods, and to estimate a store of precision ensured by the learning
Table 1 contains numbers of sectors (from 512), to which the weighted sum has been
fallen for images from the class No 17 (see Fig. 11 ).
It should be mentioned that using frequency domain data representation it is very easy
to recognize the noisy objects (see Fig. 11, Table I). Indeed, we use low frequency
spectral coefficients for the data representation. At the same time noise is concentrated in
the high frequency part of the spectral coefficients, which is not used.
We hope that considered examples are convinced, and show either efficiency of
proposed solution for image recognition problem, and high possibilities o f MVN or neural
networks based on them.

Table 1. Number of sectors, which to weighted sum has fallen during recognition of the
images presented in Fig. 6.2.8 (class 17).

lnmge | 2 3 4 5 6 7 8 9 10 !! 12
Melhod1, 60 62 62 102 40 34 65 45 99 65 35 46
Sector
(boarders
a r e Or 127
Method2. 126 122 130 129 120 135 118 134 151 126 107 119
Sector
(boarders
a r e 0, 255

5. CONCLUSIONS AND FUTURE W O R K

A new MVN based neural network for solution of the pattern recognition problems has
been proposed ia the paper. This single-layered network contains a minimal number of
neurons. This number is equal to the number of classes, which would be recognized. The
orthogonal spectra coefficients (Cosine and Fourier) are using for representation of the
objects, which have to be recognized. A proposed solution of the recognition problem has
been tested on the example of Face recognition. Simulation results confirmed high
efficiency of the proposed solution: probability of the correct recognition of the images
fi'om the testing set is close to 100%. The obtained results may be generalized from the
face recognition to image recognition in general and pattern recognition in general. A
future work in developing of the obtained results will be directed to the minimization of
the number of neural network inputs and to the search for the best orthogonal basis for
representation of the data describing analyzed objects.

REFERENCES

1. N.N.Aizenberg, Yu.L.Ivaskiv Multiple-Valued Threshold Logic. Kiev: Naukova


Dumka, 1977 (in Russian)
2. N.N.Aizenberg, I.N.Aizenberg "CNN based on multi-valued neuron as a model of
associative memory for gray-scale images", Proc. of the 2-d IEEE International
Workshop on Cellular Neural Networl~' and their Applications, Munich, 1992, pp. 36-
41.
392

3. N.N.Aizenberg, l.N.Aizenberg., G.A.Krivosheev "Multi-Valued Neurons: Learning,


Networks, Application to Image Recognition and Extrapolation of Temporal Series",
Lecture Notes in Computer Science, Vol. 930, (J.Mira, F.Sandoval - Eds.), Springer-
Verlag, 1995, pp.389-395.
4. N.N.Aizenberg, l.N.Aizenberg, G.A.Krivosheev "Multi-Valued Neurons:
Mathematical model, Networks, Application to Pattern Recognition", Proc. of the 13
lnt.Conf, on Pattern Recognition, Vienna, August 25-30, 1996, Track D, IEEE
Computer Soc. Press, pp. 185-189, i 996.
5. l.N.Aizenberg,., N.N.Aizenberg "Universal binary and multi-valued neurons
paradigm: conception, learning, applications", Lecture Notes in Computer Science,
Vol. 1240 (J.Mira, R.Moreno-Diaz, J.Cabestany - Eds.), Springer-Verlag, 1997, pp.
463-472.
6. l.N.Aizenberg., N.N.Aizenberg "Application of the neural networks based on multi-
valued neurons in image processing and recognition", SPIE Proceedings, Vol. 3307,
1998, pp. 88-97.
7. S.Jankowski, A.Lozowski, M.Zurada "Complex-Valued Multistate Neural Associative
Memory", IEEE Trara'. on Neural Networks, Vol. 7, pp. 1491- 1496, 1996.
8. N.Petkov, P.Kruizinga, T.Lourens "Motivated Approach to Face Recognition",
Lecture Notes in Computer Science, Vol. 686, (J.Mira, F.Sandoval - Eds.), Springer,
pp.68-77, 1993.
9. S.Lawrence, C. Lee Giles, Ah Chung Tsoi and A.D.Back "Face Rocognition: A
Convolutional Neural-Network Approach", IEEE Trans. on Neural Networks, Vol. 8,
pp. 98-113, 1997.
10. R.Foltyniewicz "Automatic Face Recognition via Wavelets and Mathematical
Morphology", Proc. of the 13 Int. Conf. on Pattern Recognition, Vienna, AuLntst 25-30,
1996, TrackB, IEEE Computer Soc. Press, pp. 13-17, 1996.
11. N.Ahmed, K.R.Rao "Orthogonal Transforms for Digital Signal Processing", Springer,
1975.
12. A.V.Oppenheim and S.J.Lim "The importance of phase in signals", Proc. IEEE,
Vol. 69, pp. 529-541, 1981.
13. M.Turk and A.Petland "Eigenfaces for Recognition", Journal of Cognitive
Neuroscience, Vol. 3, 1991.
Input Pre-processing for Transformation lnvariant Pattern Recognition

Guido Tascini, Anna Montesanto, Giammarco Fazzini, Paolo Puliti


Istitulo di Infnrmatica, Universilh di Ancona,
via Brecce Blanche, 60131 Ancona (Italiy)
e-mail: tascini @in form.unian.it
Abstract
This article describes a classifier of patterns based on a pre-processing system, located at the input of a recognition
system using a llopfield neural net, which recognises pattern transformed by translation, rotation and scaling. After a
detailed description of components tbrming the chain of the pre-processing system, we present some results obtained
by supplying the system input with handwritten characters defnrmed front rotation, scaling, and translation. The
patterns gotten out of the the pre-processing module are furnished in input to the recognition net in order to evaluate
the effectiveness of the pre-processing system itself. Besides the well known problems deriving from the scarce
mcnlorisation ability of the I h~pfield net, is I~ced by a strategy tlmt foresees the subdivision of the training patterns in
groups minimally correlated.

I. Introduction
The majority of the recognition systems of pattern based on neural nets are very sensitive to
transformations as rotation, scaling, and translation. In the last years, some researchers build
systems based on neural nets of elevated order, that are insensitive to translation, rotation and
scaling. But in practical applications suffered from elevated combinatory explosion of units.
Although many other invariant neural recogniser has been proposed, does not still exist an useful
syslcm that could be considered insensitive to translation, rotation and scaling. This job orients
particularly to the development of a system of pattern pre-processing, which is able to make any
invariant pattern to the aforesaid transformations. The effect of the transformations is typical when
the device of acquisition, like for example a television camera, change his orientation or distance
from the model. Systems having the ability of recognising patterns in a transformation invariant
mariner, have pratical applications in a great variety of fields, from the control of existence of
simple objects up to the guide of the robots in their space of exploration. Besides system mandatory
characteristics are: independence fi'om the used recognition approach; autonomy that is it alone
extracts the pattern from a generic binary image, and through varied elaboration's like expansion,
translation, normalisation, rotation, and finally scaling, it reaches its complete transformation
itwariance, adapting the drawn out pattern to the in demand dimensions fiom the rccogniser. In the
present job it has been implemented like required by a Hopficld neural network constituted of fully
intcrcnnnccted matrix with 13x 13 neurones.

2. i're-proeessing
The first phase of the pre-processing consists of the acquisition through scanner with optic
resolution equal to 100 dpi, getting a representation of the image to the inside of the calculator like
map of bit (bit-map) in graphic format (PCX). This representation doesn't make facilitate the
following elal~oralion's on the image, for which is necessary Ihe conversion to the text format. The
second phase foresees the extraction of the pattern to the inside of the image in fully grown text,
and they then have subjected the operation morphological dilation that has the double purpose of
will fill the holes to the inside of the pattern introduced from the low resolution of sampling and of
confer to the same mutual lower correlation's. To this point begin the process of adaptation or
normalisation, that consists of the sequence of functional forms what translation, normalisation,
~x~tation, and scale, each action to draw a measure on the pattern in entry, that allows to lead it to a
canonical form.

2.1. Translation
The translation of an object consists of his shift an a new position that maintains unchanged its
dimensions and orientation. This process allows to get the invariance to position, calculating the
centre of gravity of the pattern for then subsequently translating it so that the centre of gravity
coincide with the cenlre of the new window. For instance(see figure 1):
394

x~=2 x,--4

i Iratt'~lntion y r----4
llll|l

~l::[I Ill ~ i J J
Y y
Figure 1-Trashttion of 5x8 Window into a 9x9 window.

The function of mapping or of transformation for this process is:

f t ( x , y ) = f ( x + xc, y+ yc)

Where [xc] and [yc] are the co-ordinates of the baricentre of the pattern, f (x, y) gives the value of
the pixcl of the pattern at tile co-ordinates (x, y); for binary bidimentional digitised objects, this
function could return only value 0 or 1.

/" v]~-I Block that effects Ihe translation I ~ /'

The centre of the area in a binary image is the same of the centre of mass if we consider the
intensity of the point like the mass of tile same point. The position of the pattern has given from:[ 31

Y.j*fli, jl ~ n~i*fli,J]
i=lj=l yc i=lj=l

Y~f[i, jl
i=lj=l i=lj=l

2.2 Normalisation
The first step of this process, consists of the effect a measure of greatness on the pattern, calculating
the middle ray with the following formula[ 1]
It n
2 Z"~ax{i~-xcl, lY-YCl}'"(x,y)
x=ly=l
Fit I =

2,(x,y)
x=ly=l

Where n is the dimension of the window, (Ixc], [ycl) is tile co-ordinates of tile centre of the window
and u (x, y) is tile matrix of the pattern to normalise:

u(x,y) ~ IIIock that effet'l~ the normnll~athm II ~.~ ~l.(x,y)

The performed normalisation preserves the form of the pattern, scaling it, by means of a coefficient
namcd scale factor, expressed by:
395

s=rm; r=~
r 3

The function of mapping for the translation invariant pattern is the following:

Un (x, y) = u(s.x,s, y)

For instance, figure 2 shows a factor 2 pattern enlargement.

Figure :Z-Factor 2 pattern enlargement..

2.3 Rotation
The first computation for carrying the vector orientation in the canonical direction, so realising the
rotation invariance, consists in the calculus of a vector, by means of the Karhunen-Loeve
transformation; from this we have that: given a whole of vectors, the autovector related to biggest
autovalue of the covariance nmtrix, derived (you see under) fiom the whole of the vectors, points
in thc direction of maximum variance. [1, 3] The formula is the following:

y- my " y-,,y ,n, n n i ~ l j~="l"(x'' ys)"


.~ ~ u.(x,.y,) Z Y.u,,(x,,yj)
i=lj=l i=lj=l

The inclination of the corresponding autovector to the biggest of the autovalues of the covariance
matrix, that allows of determine the orientation of the pattern, is the following:

( 2 2
y = T,. - T~, + Tr, - T~,) +4'T~y where:
x 2'T~

ii it ii ii ii tt
7'- = E E u , ( x , , Y j ) ' x , 2 T~y= ~ ~ , u , ( x t , y j ) . y , ~ T,~. = Y~ Y . u , ( x , , y j ) ' x , ' y j
i=lj=l i=lj=l i=lj=l

Subsequently the pattern is rotated, toward the canonical direction, choice as coincident with tile y
axis in the direction "south" by the algorithm of rotation. This algorithm has been built so that it
minimises the errors of approximation, guarantees elaboration's less complex and finally allows a
possible implementation to the inside of a neural network. The purpose is reached with the
introduction of the following hypothesis:
1. we use the metric chess-board, that is: d= [max] ((X-Xb). (Y'Yb)) where ([xb], [Yh]) is the co-
ordinates of the baricentre of the figure.
2. an oblique line comes approximate by means of a broken one, for example see figure 3:

Figure 3.Representation of oblique line with angling of


22.5 ~
396

3. the equidistant points from the centre of rotation, according to the metrics of point 1, form a
circumference which constitutes the dominion in which the result of the rotation of such points
could fall. For instance as in figure 4 a and 4 b:

Figure 4a. Pattern before rotation of 22.50 ~ Figure 4 b. Pattern after rotation of 22.50 ~

The rotation of an angle cz of a matrix of points, involves that all the points, belonging the
circumference of 'd ray, suffer a shift along the same circumference, equal to the whole part of the
value (d* (z)/45.

2.4 Scaling
The scaling allows to adjust the dimension of an image to that one of the entry layer of the
recogniser neural network. It doesn't effect measure of greatness on the pattern, but it guarantees a
small distortion. Now it individualises a factor of scale, 'rap,' which is equal to the relationship
between the side of the window to scale and the side of the scaled window. The window to scale
comes sampled with a grate of equal dimension to the side of the scaled window. It takes place the
weighed sum (where the weights is constituted flom the values of the factors of contribution) of the
pixel that fall to the inside of the grate of sampling; this sum comes compared with a value said
threshold of dependent decision from the dimensions of the two windows. This value of threshold is
worth:

0 = raps
; least value engaged by the sums of contributions so that
2 the pixel under examination could be considered active.

This value corresponds to the half of the sum of all the contributions weighed in tile case in which
all the pixel are active. In the figure 5 we show with the more thin layer, the pixel of the scaled
window, on which they will come destined the contributions of the pixel of the window to scale.

0.33

It contributes 0.11 for a factor It contributes 0.11 for a factor

0.33
It contributes 0.11 for a factor It contributes 0.11 for a factor

F i g u r e 5. The value of this pixel is conditioned from the pixel aloft to left with weight 0. l 1, from the
pixel aloft to the right with weight 0.11, from the pixel in low to left with weight 0.11, from the pixel in
low to the right with weight 0. 11 (rap= 2/3).

2.5 Dilation.
This process performs the morphological operation 'dilation', that allows to widen the geometric
greatness of the pattern in the window, and to fill the "holes" inside the same pattern. The dilation
bases itself on the Minkowsky sum definite like[7]:

A@B = { t~ R2:t =a+b,a~ A , b ~ B} Minkowsky Formula


397

For example for binary pattern, used in this job, the procedure happens like in the figure 6:.

Fig6. Dilation of E letter rotated of 22.5 ~ (b), using the structuring element (a).

2.6 Extraction of objects,


The operation of extraction of pattern from an any image is a necessary operation because allows to
extract both the pattern that must be used for the training, and those for the recognition; the low
resolution of sampling through the scanner involves that the presence of "holes" inside the pattern,
making unusable the known algorithm of contour following.[3,4]. Therefore, for filling the holes we
use an algorithm, that in the theory of the elaboration of the image is definite with the term
"cxpansion." .[4] Whcn pixels of an object change their value, such that some pixel of background
has converted to 1, the operation is called "expansion".
The implementation of the expansion, it is the following;

* You for each [pixel] of the image performs:


- If the pixel is of background (0) and has in the his 8-connected neighbourhood
at least two pixel to 1, then convert itself to 1.

Figure 7.About 8-connected neighbourhood

'Fhc expansion process eliminated the "holes", now the image could be furnished to the algorithm of
pursuit of the contour without other problems. Along the contour of the expanded object, they are
characterised the vertexes belonging the minimal rectangle containing the object. Then the not
expandcd object is extracted, leaving unchanged accordingly, all the characteristics of the object.

2.7 Results
In this paragraph we have brought again the results gotten from the process of extraction, to the
process of normalisation, of tile pattern representatives the 26 hand-written letters acquired, through
so:tuner. For instance:

A C DE FGJ K

LMN OPQ S
T UV 4XyZ
Figure 8. The 26 handwritten letters of the alphabet
398

Figure 9a. Handwritten letters C B M A U Y Figure 9b. The same pattern normalized
Z after extraction and dilation, from the preprocessing.

Figure 10. Letters from the set{ A J T}. Right one are drawn out and dilate;
left one are the corresponding normalized characters.

Figure 11. Letters from the set{ X C G }. Right one are drawn out and dilate;
left one are the corresponding normalized characters.

Figure 12. Letters from the set{ HUS }. Right one are drawn out and dilate;
left one are the corresponding normalized characters.

3. The l-lapfleld model.


We want to test the effectiveness of the "preprocessing" applying it to a system of recognition based
on the Hopfield neural net with synaptic weights given from the law of Hebb ~j=]~=l p x~xj ~, xi~
e xj~ could have the values only 1, -1, p is the number of patterns to memorize. This law brings to
scarce ability of memorisation and, from considerations of statistic mechanics, under hypothesis of
void middle value and void mutual correlation between pattern; the maximum ability of
memorisation is worth 0.145• number). For instance if number of the neurones is equal to
64 (see figure 2.13), it then could memorise 0.145x64= 9.28 pattern.. [5,6]

Figure 13a. Training patterns.

But the simulation to the calculator has shown the recovery without errors of all sixteen patterns
shown in figure 13, underlining an ability of superior memorization to as scheduled of well 7
pattern. Unfortunately, in the practical cases, the conditions of middle void value and void mutual
correlation is not ever had, therefore we have divided the training pattern in groups, for training an
equivalent number of nets of Hopfield like in figure 14.
399

i~176 l* liopfleld net

2~176 I._.~ 2" Ilopfleldnet [


f Together of the TM
~ training patterns ~ - - q ~ Choice of the groups

3~group ~ 3~Ilopfleldnet ]

n*gr~ [_.~1 n~llopfleld net I

Figure 14-Schema of subdivision process for minimally correlated groups.

The proposed strategy must guarantee, to the inside of each group, mutual correlation's between
patterns that are smallest, to the purpose of get a correct recall in phase of recovery. In fact if we
have a great number of pattern, like in our case, we arc not able to divide them in minimally
correlated groups without the help of a process tbat drives us. To such purpose we introduced a
threshold that, choice from the user, allows effecting a first selection of those pattern that have
mutual correlation with value under of her. The mutual correlation or overlap is definite by [5]:

C,, = ~_~X/Xi'

Where the Csr represents the values of mutual correlation between the s pattern and r, n is the
number of the neurons; p is the number of the pattern. The correlation matrix of Cij is built as in the
following schema:

~ ~ l I pattern n~1 pattern n"2 .............. pattern n~


pattern n~ CH CI2 ............. Ctp
pattern n~ C:~, Cz,~ ............. C2p
..............................................
pattern n~ i Cp~ Cp2 . . . . . . . . . . . . . . Cp~

Snbscquently, by thresholding, the correlation matrix Cbinij becomes binary as in the following,:

pattern n~ pattern n~ .............. pattern n~


pallern n~ 0 l .............. 1
pattern n~ I 0 . . . . . . . . . . . . . . 0
.......................................
pattern n~ .............. 0

The clement of i rules and j column of the matrix of binary correlation, is equal to 1 if
C,i<threshold; it otherwise is O. The pattern of index of i rules is compatibly correlated under the
value of select threshold, with the pattern of index of j column, if the element of i rules and j
colunm is 1, otherwise is incompatible. In short: pattern i compatibly-correlated to pattern j (f
CI, in,l= 1. If the pattern i is compatibly con'elated with the pattern j, pattern k, pattern z, where
patlcrn j, pallerrt k, pattern z belong the same rules, it is not true in general that the patlern of j
index, k and z, arc compatibly correlated between them. For instance like it is seen in the following
chart, the pattetn 1 is compatible with the pattern 2, 3, 4, while the pattern 2 is COlnpatible with Ihe
pattern 3 but not with the pattern 4.
400

pattern n~ ~attern n~ pattern n~ pattern n~


patlern n~I 0 I I l
panern n~ I 0 I 0
pattern n~3 I I 0 0
paUern n~ I ,0 0 0

The following phase foresees the elimination in each line of the pattern incompatible between them,
building p groups of minimally correlated elements, getting the denominated matrix C'
For instance the following chart is releved:

• pattern n~ pattern n~ pattern n~ pattern n~


"laltern n ~ I 0 1 0
~attern n~ t 0 I 0
:~attcrn n~ I I 0 0
3attern nO4 I 0 0 0

I:rt~m the chart chr it is deduced the pattern 1, pattern 2, pattern 3 they form this group, while the
other has given from the pattern 1, pattern 4. The group minimally correlated has given C'ij that has
the smaller sum of the values of mutual correlation between the pattern from the group of the matrix
C,j. For instance if we have a net with 36 neuroues and the following correlation matrix Cij:

nil pattern n"l pattern n"Z pattern n".~ pattern 11"4


pattern n" I 36 -I 11 23
pattern n"2 -I 36 I 0
pattern n~ II I 36 0
p.'tttern n~ 23 0 0 36

Wc have: Correlation total group (1,2,3)= Ci2+ CI.~+ C23= 11; Correlation total group{ 1,4}= Ct4=
23; [mini { 11,23 }= 11 ; C i j < < 0 points OUt pattern little correlated; C,j>> 0 points out very correlated
pattetn. The group is extracted minimally correlated as gotten and it is restataed the process afresh,
until to complete exhaustion of the pattern. The n groups gotten from this process, go to train 'n'
nets of I Iopficld.

Char. "A" normalised Recognised Patterns

Char. "E" normalised Recognised Patt.

Char."J" normalised. RecoL,nised Patt.

Char. "R" Normalised Reco~nised Patt.

Char. "Y" Normalised Reco~nised Patt.


Figure 15.Normalised characters and recognised patterns
401

3,1 Restdts
In this paragraph we visualise (see figure 15 ) the results concerning recognition of some patterns
normalised from the pre-processing, and representing the hand-written characters { A, E, J,}. Each
norrnalised letter has given in input to the net trained with the group containing the same normalised
letter; in fact only with this net could be gotten the correct recognition. The call of the pattern has
been performed by deterministic recovery.

4. Conclusions
In this job, the problem has been faced of how building a system of pre-processing that allows the
transformation invariant recognition of pattern with a neural net. Such system of pre-processing
must guarantee the invariance for position, greatness and rotation of each pattern, since all the
systems of recognition based on neural nets are very sensitive to the aforesaid transformations. For
getting the invariance the object has been firstly centred to the inside of a square window with
dimensions equal to 59x59 pixel, reaching the invariance to translation; then the object has been
normalised inside the same window, getting the invariance to greatness, and finally it has been
deliberate the angle of orientation with the measure of the direction of maximum variance, for get
the rotation invariance. The pattern then requires a scaling for adapting itself to the dimensions of
the neural net, equal to 13x13 pixel. The results have shown the suit success of this system of pre-
processing that is able to extract autonomously the pattern from an image and subsequently, through
varied stadiums, to adapt itself to the dimensions of any system of recognition. The choice of the
net of Hopfield like a recogniser presents a scarce capability of memorisation, which comes
resolved partially with the choice of the groups minimally correlated; this kind of net however has
the undeniable advantage of power recover pattern seriously damaged flom the noise, and from the
claboration's that allow to rcalize the invariance.

References
Ill C. Yticeer, K. Oflazer, "A rotation, scaling, and traslational invariant pattern classification
system ", Pattern recognition, vol. 26, no. 5, pp. 687-710, (1993).
[2J S.O. Belkasim, M. Shridhar and M. Ahmadi, "Pattern recognition with moment invariants:
a comparative study and new results", Pattern recognition, vol 24, no. 12, pp. 1117-1138,
(1991)
[31 W. Pratt, "Digital image processing", second edition Wiley,New York pp 629-647, (1978)
141 R. Jain, R. Kasturi, B. G. Shunck, "Machine vision", Mc Graw-Hill, (1995).
[51 E. Pessa, "Reti neurali e processi cognitivi", Di Renzo Editore (1993).
161 J. J. Hopfield, "Neural Networks and phisical systems with emergentc collective
computational abilities", Proceeding of the National Academy of Science USA, vol.79
pp.2554-2558, (1982).
[7] V. Cantoni, S. Levialdi, "La visione delle macchine", Tecniche Nuove, (1989).
[81 M. Fukumi, S. Omatu, Y. Nishikawa "Rotation-Invariant Neural Pattern Recognition
System Estimating a Rotation Angle", IEEE Transaction on neural network, vol. 8, No. 3,
May 1997
t91 Cho-Huak Teh, Roland T. Chin "On Image Analysis by the Methods of Moments", IEEE
Transaction on pattern analysis and machine intelligence, Vol. 10, No, 4, July (1988)
[lOI Michael Reed Teague, "Image analysis via the general theory of moments", J. Optical
Society of America Vol. 70, No. 8, August 1980
[liJ R.P.N. Rao, D. II. Ballard, "Localized Receptive Fields May Mediate Transformation-
Invariant Recognition in the Visual Cortex", Technical Report 97.2, National Resource
Laboratory for the Study of Brain and Behavior Departement of Computer Science,
University of Rochester, May (1997)
M e t h o d for A u t o m a t i c K a r y o t y p i n g of
H u m a n C h r o m o s o m e s B a s e d on t h e
Visual Attention System
J.F. Dfez Higuera & F.J. Dfaz Pernas
Department of Signal Theory, Communications and Telelnatics Engineering
School of Telecommunications Engineering. University of Valladolid
Campus Miguel Dclibes. Camino del Cementcrio, s/n. 47011 Valladolid, Spain
josdie@tel.uva.es

Abstract: The present article constitutes a contribution to the diagnosis attended by


computer, concretely in the field of the chromosomes Classification. This task plays
an important role in as otltstanding questions as the infantile prcdiagnosis aJ)d in the
citogenetics of the cancer. The proposed architecture is biologically inspired on the
behavior of tile human visual system. Thus, the operation in preattcntive way is
modeled, by means of a module in charge to make the segregation figure-ground,
using features of the visual attention system. Of the same lbrm, the attentive
operation is modeled, by means of a module that takes care of the regions of interest
sequentially (possible objects) secreted by the preattentive module. For each region
it extracts the emergent features, and it adapts them before sending them to the
recognition module. This last module as much receives the information of the
preattentive module as of the attentive module, and makes the identification of the
attended object. The attentive process is iterated until they are not left more
interesting regions in the image. The proposed model is applied to the analysis of
chromosomes, in an attempt to automate a tedious and expensive process in time. 1,1
addition, it is tried to avoid that the user must take part in some of the stages of the
recognition.

1 Introduction

The cytogenetic techniques have done possible to identify each human chromosome by
means of their pattern of bands. Technically, karyotyping is the process by which the
chromosomes of a cell in division (see Fig. 1), properly tinted, are identified and assigned
to a certain group [27]. In Fig. 2, the typical aspect of a karyotype is shown. This process
is very important, since the inspection of human chromosomes is an important and
complex task used mainly in clinical diagnosis and biological research [18, 24]. This task
is expensive, in time and money, and imprecise when it is made in a manual form.
An expert in cytology can produce karyotipes with a small en'or about 0,1% [ 191. It is a
tedious and expensive process in time. It's necessary to photograph the selected
metaphase and to cut the chromosomes in order to classify in a karyotype. During the last
30 years there have been diverse attempts to automate some or all the procedures inw)lvcd
in the analysis of chromosomes [8, 9, 16, 18, 20J. The automatization of this task presents
several difficulties, due partly to the deviation that the chromosomes present with respect
to the standard pattern of bands, and also because the chromosomes have random
direction and can be bent, overlapped and/or in touch between them.
All the efforts to make automatic the analysis of chromosomes have been a limited
successful and poor classification results compared with obtained by a skilled
403

cytotechnician [10, 24]. Some of the reasons of the poor operation are tile inadequate use
of the knowledge and experience of the expert and the insufficient ability to make
comparisons and/or eliminations between chromosomes of the same metaphase. In
addition, the systems require the interaction of the operator to separate leaned and/or
overlapped chromosomes and to verify the classification results [24].

Fig. 1. Mitoticcell, in the metaphasestage, in whichthe chromosomesare visible.

Fig. 2. Pairsof chromosomesarc groupingand labelingin order to build thc karyo|ypc.


404

2 The visual attention


This section describes briefly the selective visual attention, one of the most important
contributions of the proposed model. The human brain uses this feature to focus the
attention on a determined point of the scene, processing solely the inlbrmation contained
in the center of attention[5]. Most of the systems of recognition of patterns are divided in
two stages: feature extraction and classification. These two blocks usually operate in a
independent and sequential way: when the feature extractor system extractor has obtained
the corresponding vector of characteristics, send this vector to the classification system in
order it process and categorize it. Nevertheless, there are many evidences about the
existence, in the human brain, of a process of interactive collaboration that allows us to
extract the features of an object with the aid of the previous knowledge which we pruned
to have on him. The system of selective attention is controlled by signals generated
internally (based on the previous knowledge) [22]. These signals move the attention
center towards the zone of interest, causing that the segmentation system extracts the
excellent information about that zone, and send it to the classification system (bottom-up
signals). Based on the similarity that has the received information with the learned
information, the classification system will send the signals corresponding to the attentive
system (top-down signals) [5]. This process is iterated until completing the recognition of
the object at issue. These three systems, segmentation, attentive and classification,
conform the fundamental structure of the propose architecture (see Fig. 3).
The segmentation system is based on the Neuronal Dynamic Theory of S. Grossberg et
al. [11]. The atenttive system is based on several works of R. Milanese[22]. The
classification system is based on the Adaptive Resonance Theory of S. Grossberg et al.
[2]. Also they have been helpful the ideas of S. Kosslyn on the mechanisms involved in
the object recognition [ 17].
The following section describes the propose neuronal architecture and the results
obtained when processing, by means of the propose neuronal architecture, a chromosome
database widely used as benchmark for several classification systems. Also a comparative
study with the results obtained by other methods is made.

3 System description

The present architecture has been designed thinking about solving the problem of the
analysis and classification of human chromosomes, but trying at any moment to maintain
the majority of the biological plausibility. As the human visual system, the analysis made
by the propose architecture is divided in a first preattentive level and the later attentive
level. Treisman[26] suggests two different processes in the visual perception. A
preattentive process, that acts as a fast tracking system and is only related to the objects
detection. This process checks the global object features and codifies the useful
elementary properties of the scene: color, direction, size or movement direction. In this
point, an edge or contour can be discerned to the variation in a simple property, but
complex differences in combinations of properties are not detected. The different
properties are coded in different feature maps, in different regions on the brain. The later
attentive processing initially directs the attention to the specific features of an object,
selecting and emphasizing the characteristics segregated in the independent maps. Also, a
saliency map must exist that codifies only the key aspects of the image. This map receives
405

entrances from all the feature maps, but it only abstracts those features that distinguish the
object of attention from the ground. In this way the saliency map selects the details that
are essential for the attentive recognition. The recognition takes place when the emergent
positions in different feature maps are associated.
The modular diagram of the architecture is shown in Fig. 3. As it can be observed, there
are three main blocks: dorsal module, ventral module and recognition module. This
diagram is based on the recognition model proposed by Kosslyn[ 17]. The dorsal module
is in charge of the preattentive processing of spatial features and of the integration of the
feature maps: it determinen the position, direction and size of the object to identify. The
ventral module is in charge of the attentive extraction of figure features in the atencional
window selected by the dorsal module. Finally, the information generated by both systems
is sent to the recognition module, where it is come to the recognition.
Distinguishing for the case of the chromosomes, the dorsal module must isolate each
one of the chromosomes, so that the ventral module individually analyzes each one of the
chromosomes, and the recognition module, from the information of size generated in the
dorsal module, and of the information relative to the formal structure of the chromosome
produced by the ventral module, identifies it and classifies in one of the 24 groups.
Next a functional description of each one of the blocks is made, proposing a possible
anatomical location for each one of them, and the used basic models for its design.

Fig. 3, Modular diagram of the propose architecture.

4 Description of the neuronal architecture

The sequence of operation of the proposed architecture begins with the image coming
from a camera CCD connected to a microscope. The visual memory corresponds
retinotopically to a assembly of mapped areas, in which knowledge on the objects that can
affect to processing is stored [21]. These areas constitute a structure characterized
functionally, and, therefore, they do not need to be anatomically contiguous. There arc
406

several visual areas that are component of this functional structure, including the areas
V1, V2, V3 and V4 [6,28].

Dorsal module

The block diagram of the dorsal module appears in Fig. 4. The image enters the dorsal
module. This module processes, preattentively, spatial properties, such as position,
direction and size [28]. In this module the luminance and direction feature maps are
extracted. Later, these maps are integrated by means of a relaxation process, in order to
generate a saliency map. This module has been denominated as dorsal because it has an
operation similar to the human dorsal system [17], assembly of cerebral areas that
includes from the occipital lobe to the parietal lobe. On the propose architecture, this
system is based on Grossberg's BCS model [115]. In this case receptive fields greater than
in the ventral module are used, since the objective is to detect objects, regions of interest
in the scene. The integration of the feature maps follows the model proposed by Milanese
[22], using the constrast maps generated on the LGN (channels ON and OFF), and the
texture maps generated by the BCS.

Fig. 4. Blockdiagramof the dorsalmodule.

As result of the preattentive processing a binary map is generated, where the detection
of emergent regions has been made. The region of greater area is selected, and two signals
are generated: one towards the atencional control, and another one towards the recognition
module. The first signal indicates to the ventral module the region that has to process
attentively. Biologically speaking, it corresponds to the attention shift. In Computation, it
is reduced to a rotation (direction information) and a zoom (size inlk)rmation). In this way
it is had the region of interest in the attention window. This first processing allows to
generate the three types of inwu'iance, since the region of interest in agreement with the
information provided by the dorsal module can be moved, be turned and be resized. The
407

second signal, towards the recognition module, will be the input to one of the recognition
channels, the one that classifies the input pattern based on his size.

ASSOCIA~VEMEMORY
,1
(~)~,~@ CompetitiveStageI1~

f l
(~) CompetitiveStage I ~

] l
Comp,oxCo,,s | 1
lIlI IlII

T T

Atencionat
Window
Fig. 5. Block diagram of the ventral module.

Ventral module

The block diagram of the dorsal module appears in Fig. 5.The ventral module receives
its name from the ventral system, which it is a assembly of cerebral areas that includes
from the occipital lobe to the inferior lobe temporary, IT, and whose cells respond to
objects properties such as form, color typically and texture. The ventral module processes,
in a attentive form, the region selected by the dorsal module. This processing generates
the contours maps coming from two types of receptive fields (symmetrical and anti-
symmetrical). Since the object is turned and resized before beginning the attentive
processing, the direction of the contours maps is fixed, and in the case of the
chromosomes it corresponds to 0 ~ This map contains information referring to tile banded
pattern. Therefore, the ventral module sends two signals to the recognition module. These
two signals (contours maps) along with the information of size sent by the dorsal module,
constitute the input pattern that the recognition module has to identify.
In the propose architecture, the generation of characteristics, in the ventral module, is
made by means of a model based on the Grossberg's BCS. In the case of the
chromosomes, the receptive fields are small, adapted to the size of the chromosomes, with
the objective to detect the transition between the bands of the chromosome.
408

Recognition Module
The block diagram of the recognition module appears in Fig. 6. The outputs from the
ventral module (pattern of bands) and from the dorsal module (size) arrive at the
recognition module where they are compared with the stored information. This module
corresponds to the associative memory described by Kosslyn, and that seems to be
implanted partly in the superior and posterior temporary lobes[17].
In the proposed architecture the recognition module is implanted by means of a
multisensorial ART network, composed by 3 Fuzzy ARTMAP networks [3] and one
ART1 network [2]. The three first networks receive information from the dorsal module
(size information) and from the ventral module (information on band transition). ARTI
receives as input the output from the 3 networks Fuzzy A R T M A P and generates a single
output that indicates the identification of the chromosome. There are other models with
similar philosophy, like the network Fusion ART, propose by Asfour ET al. [I]. If the
chromosome cannot be identified, it separates so that an expert analyzes it.
Once the object selected preattentively by tile dorsal module, and analyzed attentively
by the ventral module, has been identified (or separated), the region corresponding to the
analyzed object it's inhibited in the saliency map. And so on, until all the interesting
regions of the image have been analyzed. Next they in detail describe each one of the
modules.

F;

I I [oo,,.',oou',l
Fig. 6. Block diagram of the rccogniUon module.
409

5 Results

In this section appear the results obtained when applying the proposed architecture to a
chromosomes database widely used as benchmark for several methods of classification.
To the experimentation of the proposed model it is had three extensive data bases of G-
banded chromosomes, coming from Copenhagen, Edinburgh and Philadelphia. These data
have been used in previous studies of classification [7, 24, 25]. Each lot contains a great
number of chromosomes extracted from images of cells in the metaphase stage of the
cellular division. The first data base was created in Copenhagen in 1976-1978 and
consists of 180 G-banded metafasic cells with TRG, coming from blood samples. Thc
second database was obtained in the MRC, Edimburgo, in 1984, and contains 125 G-
banded sanguineous cells with method ASG. The third database was obtained in the
Jefferson Medical College, Philadelphia, in 1987, and contains 130 cells of chorionic
villus coming from routine and crossed analyses of laboratory with Giemsa.
It must be indicated that all the chromosomes come from normal human cells and
therefore do not contain abnormalitys. Also it is necessary to emphasize that they are not
present cases of sly chromosomes, and are very little the chromosomes with a
considerable curvature. This fact has not allowed extending the architecture to all the
possible cases. Of all the chromosomes available in the database, the chromosomes with
an excessive curvature have been rejected. This must to that it is not had a sufficient
number of them like making a trustworthy learning.
On the other hand it is necessary to indicate that the results obtained in the
classification are compared with the ones obtained by means of other methods that use the
same database as benchmark. Nevertheless, the used information is not the same one in all
the cases. In the tests the chromosome images have been used like source, extracting from
them all the necessary information for the classification. The methods that participate in
the comparative study use as source the images features extracted by experts in
chromosome recognition. These features are are the unidimensional bands profile, the
centromeric index, geometric parameters, etc. Therefore, in the comparison of results it is
not only necessary to consider the goodness of the classification method, but also the
capacity to extract the relevant information fi'om the image. In the lbllowing table the
results obtained by the proposed architecture are shown, in comparison with other models
of classification.

Table 1. Comparative study of the classification results using diverse methods


Author Method/Features Error rate
Piper & Granum[24] Centromeric index 5.9%
Polarity
Piper Parametric classifier i6.5%
Tso et a1.[27] Transport algorithm 4.4%
Errington & Graham [711 Backpropagation 5.8%
Banded profile I-D
Proposed architecture ART-multichannel 3.5%
modules
Texture
Size
410

It is possible to emphasize that the results of the proposed architecture with respect to
the units used in the learning are highly satisfactory, since the network recognizes them
with a percentage of 100%. With respect to the units that the network has not learned, the
low percentage slightly, due to chromosomes with smaller contrast or excessively bent.

6 Conclusions

The present paper describes a neuronal architecture for segmenting and recognizing
textured monochrome images, in general, and specially oriented towads the classification
of human chromosomes.
The initial objective to try to solve the problem of the analysis and classification of human
chromosomes has been complemented with the maintenance of the biological plausibility,
which confers to the proposed architecture a general porpuse feature. In each particular
application a module of recognition adapted to the scene and objects features will be
required. In the present paper, a module of specific recognition for objects with texture of
parallel bands has been designed, as it is the case of the chromosomes.
In relation to the biological plausibility, it has been tried to model both ways of
operation of the visual system, preattentive mode and attentive mode, as well as the
mechanism of visual attention. In the preatentive mode, the processing is highly parallel
and of low resolution, due to the extension of the receptive fields. Its battle area extends
to all the scene, and its function is to select the regions with excellent features, avoiding
the rest of the image. The attentive mode sequentially analyzes each one of the regions
selected by the preattentive module. In the analysis of chromosomes, the preattentive
mode, represented in the dorsal module, analyzes all the image and tries to isolate each
one of the chromosomes. Next, and already processing in attentive mode, it is come to the
individual identification of each chromosome.
Finally, the emergent properties that define the contribution of the present article are:
9 the proposed architecture is a first approach to the behavior of the visual system in the
processing and recognition of visual stimuli. In the development of the neuronal
architecture all their stages have been justified suggesting their location within the
structure of the visual system
9 It proposes an invariable transformation to translation, rotation and scale, from the
information provided by the dorsal module of preattentive segmentation, so that the
objects are presented to the attentive module with the same direction and size, since
they adapt to the attentional window.

7 References
[11 Asfour, Y.R., G.A. Carpenter, S. Grossberg and G.W. Lesher. 1993a. Fusion AI,',TMAP: a neural network
architecture for multi-channel data fusion and classification. Technical Report CAS/CNS-TR-96-006, Bostm~
University, January 1993.
[21 Carpenter, G.A. and S. Grossberg. 1987a. A massively parallel architecture lbr a self-organizing neural pattern
recognition machine. Computer Vision. Graphics, and hnage Processing, 37: 54-115.
[3] Carpenter, G.A., S. Grossberg, and J.H. Reynolds. 1991. ARTMAP: Supervised real-time learning and classification
of nonstationary data by a self-organizing neural network. Neural Networks, vol. 4, No.5, pp. 565-588.
[4] Cohen M.A. and S. Grossberg. 1984. Neural dynamics of brightness perception: Features, boundaries, diffusion,
and resonance. Perception and Psychophysics, Vol. 36, pp. 428-456.
[5] Desimone, R, 1992. Neural circuits for visual attention in the primate brain. In G.A. Carpenter and S. Grossberg,
editors, Neural Networksfor Vision and hnage Prm'esshag. Cambridge, MA: MIT Press, pages 343-364.
411

[6] Desimone, R. and L.G. Underleider. 1989. Ilandbook of Neuropsychology, 2: 267.


[7] Errington, P.H. and J. Graham. 1993. Classification of chromosomes using a combination of neural networks. In
Proceedings of the IEEE - International Conference on Neural Networks, volume 111, pages 1236-1241, San
Francisco, CA, March 28-April 1 1993.
[8] Graham, James. 1987. Automation of routinal clinical chromosome analysis i: Karyotyping by machine. Analytical
and Qaat,titative Cytology and tlistology, vol. 9, pp. 383-390.
[91 Groen, F.C.A. and M. van der Ploeg. 1979. DNA cytophotometry of human chromosomes. The Journal of
Ilistochemistry and Cytochemistry, vol. 27, pp. 436-440.
[ 10l Groen, F.C.A., T.K. ten Kate, A.W.M. Smeulders, and I.T. 1989. Young. lluman chromosome classilication based
on local band descriptors. Pattern Recognition letters, vol. 9, pp. 211-222, April 1989.
[I 1] Grossberg, S. 1976a. Adaptive Pattern Classification and Universal Recording, 1: Parallel development and coding
of neural feature detectors. Biological Cybernetics, vol. 23, pp. 121-134,
[121 Grossberg, S. 1976b. Adaptive Pattern Classification and Universal Recording, II: Feedback, Expectation,
Olfaction, and Illusions. Biological Cybernetics, 23: 187-202.
[13] Grossberg, S. 1993. Neural dynamics of motion perception, recognition learning, and spatial attention. Technical
Report CAS/CNS-TR-93-O01, Boston University, January, 1993.
[14] Grossberg, S. and E. Mingolla. 1985a. Neural dynamics oflbrm perception: Boundary completion, illusmy figures,
and neon color spreading. Psychological Review, vol. 92, pp. 173-211.
1151 Grossherg, S. and E. Mingolla. 1985b. Neural dynamics of perceptual grour, ing: Textures, boundaries, and
emergent segmentations. Perception and Psychophysics, vol. 38, No. 2, pp. 141-171.
[ 16] Jenniogs, A. 1990. Chromosome classification using neural nets. Master's thesis, University of Manchester, U.K.
[171 Kosslyn, SM. 1994. hnage and Brain. The Resolution of the hnagery Debate. MIT Press, Cambridge, MA.
[18] Lerner, B., |1. Guterman, and 1. Dinstein. 1993. Classification of human chromosomes by two-dimansional Fourier
transform components. In Proceedings of the WCNN, volume 111, pages 793-796, Portland, Oregon, July 11-15
1993.
[191 Lundsteen. C., A.M. Lind, and E. Granum. 1976. Visual classification of banded human chromosomes I.
karyotyping compared with classification of isolated chromosomes. Ann. llum. Genet., vol. 40, pp. 87-97.
[20] Lundsteen, C., T. Gerdes, E. Granum, J. Philip, and K. Philip. 1981. Automatic chromosome analysis I1:
Karyotyping of banded human chromosomes using band transition sequences. CIh~ical Genetics, vol. 19, pp. 26-36.
[21] Man', David. 1982. Vision. Freeman, San Francisco.
[22] Milanese, R. 1993. Detecting salient regions in a image: from biological evidence to computer implementation.
Ph.D. thesis, University of Geneva.
[23] Piper, J. and E. Granum. 1989. On Fully Automatic Feature Measurement for Banded Chromosome Classification.
Cytometry, vol. 10, pp. 242-255.
[24] Piper, J., E. Granum, D. Rutovitz, and H. Ruttledge. 1980. Automation of chromosome analysis. Signal
Processing, vol. 2, No. 3, pp. 203-221, July 1980.
[25] Thomason, M.G. and E. Granum. 1986. Dynamically programmed inference of Markov networks from finite sets
of sample strings. IEEE Transactions on Pattern Analysis and Machh~e bltelligence, Vol. 8, pp. 491-501.
[26] Treisman, A. 1985. Preattentive processing in vision. In A. Rosenfeld, editor, ]luman and Machine Vision 11,
pages 313-334. Academic Press.
127] Tso, M., P. Kleinsclunidt, I. Mitterreiter and J. Graham. 1991. An eflicient transportation algorithm for automatic
chromosome karyotyping. Pattern Recognition Letters, vol. 12, pp. I 17-126, February 199 I,
[28] Van Essen, D.C., and C.H. Anderson. 1990. Inlbrmation processing strategies and pathways in the primate retina
and visual cortex. In S,F. Zornelzer, J.L, Davis, and C. Lau, editors, Ati Introduction to Neural attd Electronic
Networks. Academic Press, Inc., San Diego, Calilbrnia. pp. 43-72.
Adaptive Adjustment of the CNN Output Function to Obtain
Contrast Enhancement
M. A. Jaramillo Mor~tn, J. A. Fem~tndez Mufioz

Dpto. Electr6nica e Ingenieria Electromec~inica. Escuela de Ingenierias Industriales.


Avda. de Elvas s/n. Universidad de Extremadura. Badajoz. Spain.
Tlf.: 34-24-289628. E-mail: miguel@unex.es

Abstract

In this paper we propose an adaptive modification of the output function of the CNN
(Cellular Neural Network) model to perform contrast enhancement of an image. First,
we define the output function to operate in the interval [0,1] with variable saturation
limits ill order to adapt the behaviour of the network to the grey levels in the
neighbourhood of every cell. Then we propose a three-layers CNN where the mean
value of the neighbourhood o f a pixel is obtained by the first layer and the calculation of
the mean deviation of the pixel values from the mean in the same neighbourhood is
carried out by the second one. These parameters are control signals that define the
saturation limits of the piecewise linear output function of each cell in the third layer,
the output of the network, adapting it to the neighbourhood of each cell. Some examples
are presented to demonstrate the capabilities of the model.

1.- Introduction

The use of neural networks as image processing structures is a growing research field in
the neural network community, because, as brain capabilities justified that the first
applications of these structures were devoted to pattern learning and recognition, the
visual processing carried out by the visual neural system of living beings justifies the
application of neural networks to those tasks. However, the existence of a quite
developed theory in image processing that constitutes a whole scientific field is the
reason why the attention of the neural network community on these tasks has not been
very intense, apart from being rather recent. It has been the need of looking for new
applications lbr neural network models what has originated his growing. Furthermore,
as many of the recognition tasks neural networks are devoted to are perlbrmed with
optical patterns, it is natural to pay some attention to pure image processing. The
possibility of having a pattern recognition system with a previous image processing unit
enhancing the image quality, both developed with neural networks, is a very attractive
idea.
On the other hand, the existence of a highly developed image processing theory,
what at first glance could be a handicap, can turn out to be helpful for neural image
processing research, since the study of techniques of proved efficiency can help to
develop neural models to use as such structures. Then if one can obtain with tile neural
networks analogous results to those provided by the standard image processing tools it
would justify their potentiality as image processing systems.
413

A lot of works have appeared in which neural networks are used to process
images with the goal of being a help in pattern and shape recognition tasks. However, it
has been a model whose main aim was to obtain an easy VLS! implementation which
seems to be better adapted to these tasks: Celullar Neural Networks (CNN). They were
originally proposed by Chua and Yang [1][2] as a unification of some aspects of Neural
Networks and Celullar Automata. They have a neuron model that is very similar to the
Hopfield one, but with the difference that each cell is only connected with those
surrounding it. These connections are the same for every cell defining a repetitive
structure usually named as "cloning template". This repetitive synaptic scheme
represents the main feature of the model, providing a local processing of the input
signal, that makes it specially appropriate to be used in image processing, what has
become one of the main applications of CNNs [3][4][5]. For this reason, it seems
reasonable to use them as an image processing system, leaving aside its possible VLSI
implementation.
To describe the model we will consider an image u as the network input, and,
assuming that both input and output have the structure of a mxn matrix, the equations
that govern the dynamic behaviour of the neural activity v(t) are:

dt = --~v~(t)+ ~A(i,j;k,l)yk,(t ) + ~B(i,j;k,l)uk,(t)+l (1)


C(k,I )eN, (i, 1) C(k,I)~N, (t, 1 )

where A(i,j;k,l) and B(i,j;k,l) represent, respectively, the synaptic weights of each neuron
with other cells and with pixels in the input image. They are the same for each neuron,
and are what we have defined as "cloning templates" that define the connectivity of each
neuron with its neighbourhood:

N,(i,j)= {C(k,l)/max{Ik-tl,lt-jl}<-r, l<k<m; l<_l<_n} (2)

y(t) is the output of each neuron and is defined by the equation:

(3)

that represents a piecewise linear function (Fig. I). The output function may also be a
radial basis function or a sigmoid one [6].

t I

-1

Figure 1. Piecewise linear function.


414

As the model was designed for an electronic implementation, the different


network parameters will have an appropriate meaning. So, C is an input capacitor, R an
input resistance, and I a bias current that acts as a threshold for the neural activity. They
are the same for every cell. Nevertheless, as no VLSI implementation will be carried
out, they can be assumed as simple network parameters.
Notice that as A(i.j;k,l) represents the connection of every cell with those
surrounding it and B(i,j;k,l) the connections with pixels in their neighbourhood, they
both define the network behaviour, since they establish how the local processing of the
image (B(i,j;k,l)) and the interaction between neighbouring neurons (A(i,j;k,l)) are
performed. In this way it is useful to refer to A(i,j;k,l) as feedback operator and B(i,j,'k,I)
as control operator.
As the network will be devoted to image processing it is convenient to represent
equation (1) by the approximation of a difference equation of the form [ 1] [2]:

v,j(t+l)=v,y(t)+~- vu(t)+ ~'~A(i,j;k,l)ykt(t)+ ~'~B(i,j;k,l)uk,(t)+l (4)


C(k,l)~N,(t,J) C(k,I) c~Nt(tj)

where h is a constant that define the time step during the simulation.

2.- Neural T r a n s f e r Fu n c t i o n

As it has been stated previously, the piecewise linear function (3) is used in the CNN as
the cell output; so its value will be within the interval [-1,1].

C D1 C D1

(a) (b)

Figure 2. (a) Adaptive piecewise linear function. (b) Expansion of the grey interval

ltowever, as a pixel in an image has a value between 0 and I it is necessary to


convert the cell output into this interval, although the simulation is perfonned between
the values -1 and 1. Moreover, the inverse transformation must be carried out wheo an
image is taken as a network input. To avoid these conversions a new function that only
takes values inside the interval [0,1] is proposed. Furthermore, to obtain a more general
expression the saturation limits may be changed, defining a minimum C and a
maximum D (Fig. 2 (a)). This modification endows the model with more flexibility,
415

allowing the definition of different gains that can modify the relation between the neural
activity and the neuron output. The equation so obtained has the form:

(5)
Yu = 2 ~.I-~S'-C IO-CI

This definition of the output function allows it to adapt to the features of the task
to be performed. So, ifC=D a binary response will be obtained. On the opposite, if C=0
and D---I the neuron activity is provided as the neuron output. Between these extreme
definitions a great variety of possibilities appear. Moreover, fixing the difference
between C and D, what implies a fixed slope, their values may vary between 0 and 1 so
that the neural response is adapted to the cell input avoiding its saturation. So, defining
the values of the elements of A(i,j;k,O and B(ij;k,O so that the cell activity is always
inside the interval [0,1] the neural response may be controlled with C and D and the
maximum possible resolution in the neural output can be obtained. The delinition of
these parameters may be carried out before running the simulation, adapting the output
function to the specific task determined by A(i,j;k,O and B(i,j;k, O.
However, there is another possibility: to look for an automatic adaptation of
these parameters to the input neighbourhood of each cell, so that, instead of defining
them as fixed values, they may vary according to the environment in which every
neuron is placed. So as A(i,j;k, 0 and B(i,j;k, 0 determine the task the network will
perform and will be defined before running the simulation, C and D may be adapted to
the input neighbourhood during the simulation to obtain an optimal response of the
network.

3.- Adaptation of t h e Neural Output Function

As the adaptation capacity of the output function must be performed by varying the
values of C and D while running the simulation of the neural network it is necessary to
obtain them with a different network to that pertbrming the image processing. In
addition, this structure should be a CNN to preserve the homogeneity of the whole
system. So, the multilayer structure proposed in the original model [1] may be used,
defining two hidden layers to provide the desired values for those parameters.
The calculation of C and D as elements that adapt to tile features of the input
image need to be carried out through the definition of larger synaptic masks than those
commonly used in CNNs, because as they are assumed to represent a sort of "mean
values" of the pixels surrounding each cell they must be obtained statistically, and then
the number of pixel considered must be large enough to obtain meaningful values.
So the definition of the local adaptation of the output function may provide an
image enhancement since little differences in the grey level in a certain zone of tile
image may be increased by giving C and D tile appropriate values to obtain a relation
between inputs and outputs as that represented in Fig. 2 (b), where a small grey interval
is expanded into a bigger one. So differences in the first one are augmented when
transformed into the second. Therefore, assuming that C and D are the extreme grey
values in a neuron neighbourhood the output function will expand the original grey
interval into another ranging from 0 to 1, so that the contrast of that zone of the image
will be increased, allowing that details that are not clear enough can be easily detected
416

now. In this way, a generalized contrast enhancement of the image is obtained as every
neuron performs the same transformation.
Intuitively the easiest way to obtain C and D is defining them as the minimum
and the maximum values in the input neighbourhood of each neuron. Ilowever, this
definition would need the use o f two new relations between/he inputs: tile maximum
and the minimum of a set of values. Some neural models has been defined where the
relations between inputs are different from their weighted sum, usually the sum of their
weighted products [7] or logic functions [8]. Therefore, the use of the maximum and
minimum functions to obtain C and D would be feasible. However, their utilisation does
not represent the better solution, since these functions are very sensitive to noise,
because, as it usually appears as an extreme value (black or white), it will be considered
as the maximum or minimum in the neighbourhood, defining a higher grey interval for
the input, and therefore decreasing the model performance.

-1
I
J ,6, "~

Figure 3. Plot of function (6) with ct=l. The continuous line is obtained for ~ =0.
The striped line plots the function for e--_-0.2.

To avoid the noise influence we propose the use of an input function that is a
particular case of the extended absolute value proposed in [9]:

abs~(x)=(x 2 +62) ~12, a E[],oo), 6 > 0 (6)

This expression represents a generalization of the input function where the


weighted sum appears as a particular case, with the multiplication of each input by a
synaptic constant avoided by the use of an appropriate modification of function (6). In
this way, an important area consumption saving is provided when it is implemented in
VLSI. ltowever, we are concerned with the definition of the absolute value function
from (6). For this purpose, if we select ct=l an approximation to it is obtained when
e---~0 (Fig. 3). So, taking E=0 and considering the difference between two variables as
the argument of the function, the absolute value of their distance is obtained:

abs,(m- x) : ( ( m - x)2)~2 =Ira- xl (7)

This function provides a new type of input relation that allows us to obtain
statistically the values of parameters C and D for the output function of every celt so that
the influence of noise is highly reduced.
They may be obtained in the following way. The two hidden layers have a neural
activity function where A(id;k, 0 and I are assumed to be zero, while B(i,j;k.O will define
417

the layer behaviour. The output function is (5) with C=0 and D=I. The first hidden
layer, which we will call L c , will provide the mean value of the neighbourhood of every
pixel if we assume that every element in B(i,j;k,l) has a value of I / r 2 when B(i,j;k,l) has
dimension rxr. So the output of every cell will have the form:

m0= ~ ~ (8)
C(k,I)eN,(i,)) r

This value is considered the central point of the [C,j,D,j] interval. The second
hidden layer, named Lo, will calculate the mean deviation from m,j in the same
neighbourhood. So, using expression (7), we have:

(9)
C(k,I)~N, OJ) r2

Now, the values of C 0 and D o can be obtained as:

Co = mo - d O D,j = mv + d o (1o)

However, as the average value of the deviation is relatively small, the output
function slope could be too high, and some grey levels could fall into the saturation
limits. To avoid this effect d o is multiplied by 6, a greater than unity constant.
Therefore, equations (10) have the form:

C o = m,j - 6 d o D,j = m o + 6 d o (I 1)

and the size of the intervals [C0,D0] is widened. The value of~5 must be fitted from
simulation.

S
Figure 4. Multilayer structure.
418

Now they represent the two limits of saturation that define the output function of
every neuron in the processing layer and will be provided as control signals. The neural
activity function is represented by equation (4), where A(i,j;k,l) and I are zero and
B(i,j;k, 0 is lxl with this element equal to I.
So, the system structure is as follows (Fig. 4):

Layers Lc and L,: they process a neighbourhood of the input image.


Processing layer: it receives the outputs from Lc and L~ as neural output function
control parameters. Each neuron connects with only one pixel in the input image.

Therefore, we may say that the network performs the processing of each pixel in
the input image, assigning it a new value that will depend on its neighbourhood. The
new values form the output image. As connections in the three layers are purely
feedforward, the system stability is guaranteed.
Since the described structure will increase the contrast of the image, it can also
make excessively noticeable details that were clear enough by themselves. In order to
compensate for this effect, the processing layer can be provided with a smoothing ability
that can diminish it, although without damaging the overall capability of the system.
This effect can be obtained ifB(i,j,'k,O is defined as a mean filler analogous to that used
in layer Lc but with a less size. A 3x3 dimension will be enough.
An added problem that may appear is the fact that this structure can also produce
details that really are not. So, in areas with small variations in their grey levels, this
variations may have no special meaning. They could be just little faults or noise present
in the image, but the net will amplify them. So, the network performance may be
degraded by noise amplification or by new noise generation. This unwanted effect can
be even more harmful when the grey level homogeneity in an area of the image does not
vary, providing a near to zero mean deviation. So C o and D,j will be close together and
the corresponding slope will be too high. Therefore, little differences in the grey levels
will be augmented to the maximum range (black and white), providing as important
details things that are nothing but noise.
The problem may be solved with the imposition of a lower bound to the mean
deviation value computed by layer L~, although modifying as less as possible the
obtained values when there is a meaningful variation in the grey scale. This effect may
bc obtained from (6) assuming a=l and e.>0:

2 2 ,~2
abs,(u,~ -rib)= -m,~) +6 ) , c> 0 (12)

As it can be seen in Fig. 4, the minimum value of the deviation is bounded, so


that areas with a uniform grey level will have a mean deviation very close to e, while in
others where the contrast is high the deviation is also high, and the response of the
function is very close to that of the actual absolute value (7) ife is little enough.

4.-Simulation

We present three different pictures in Figures 5, 6 and 7 to test the behaviour of the
proposed model. In order to reduce the computation time we use a more simplified
419

expression for (4). As the model has no feedback, A(i,j,'k,O =0, the output only depends
on the fixed values of the input image and the neural activity may be obtained in one
time step. So, assuming R=C---h=I and I=0 in (4) the neural activity function is now:

vo = Z B(i,j;k,t)u,, 03)
C(t,,i)eN O,j)

The neural output is provided by (5) and parameters controlling it are obtained as
mentioned above with 8=2.
The original images are presented in (a). They are processed in (b), with formula
(10) defining the input interval of the output function. A general contrast enhancement
appears and details that were hardly perceived are clearly visible now. Nevertheless,
many extreme values (black or white) also appear because, as the interval [C ,D,j] is
statistically obtained, some of the pixel values inside each neighbourhood may be
outside this interval, saturating the output function, q'o compensate for this effect a 3x3
mean filter is added in (c) to obtain a smoothing of the images. As we can see, many
extreme values have disappear but a blurred image is also obtained. On he other hand,
we can also see in Fig. 5 (b) and Fig. 7 (b) that those areas with a uniform grey level in
the original image present details that actually do not exist. They are obtained by the
amplification of little differences in the pixel values produced by a very high value of
the slope of the output function as has previously mentioned. This effect appears in the
girl's cheek and in her hat in Fig. 5 (c), and at the bottom of Fig. 7 (c). To avoid it,
function (11), with e=0.1, is used instead of(10) to obtain C o and D~j. We can see in
Fig. 5 (d) and Fig. 7 (d) that they have been removed while the presence of extreme
values have been also decreased in the three figures. So a contrast en!mncement has
been obtained only in those areas where it was necessary.

5.- C o n c l u s i o n s

We have proposed an adaptive model of the CNN output function that provides a
contrast enhancement capability in those areas of an image where details aren't clear
enough. This result was obtained only with the use of the adaptive output function. A
small lowpass filter was used to provide a little softening to compensate for an
excessive contrast obtained in some zones of the output image. So the use of the
adaptive function with different types of filters will probably improve their
performances. It could be interesting to study the effect of endowing neural networks
with image processing capabilities with an adaptive output function in order to
increment their performances and flexibility.
420

(a) (b)

(c) (d)
Figure 5.-

(a) (b)

(c) (d)
Figure 6.-
421

(a) (b)

(c) (d)
Figure 7.-

References

[1] L. O. Chua, L. Yang. "Cellular Neural Networks: Theory". IEEE Trans. on Circuits
and Systems. Vol. 35, No. 10. October 1988. pp. 1257-1272.
[2] L. O. Chua, L. Yang. "Cellular Neural Networks: Applications". IEEE Trans. on
Circuits and Systems. Vol. 35, No. 10. October 1988. pp. 1273-1290.
[3] T. Matsumoto, L. O. Chua, R. Furukawa. "CNN Cloning Template: Hole-Filler".
IEEE Trans. on Circuits and Systems. Vol. 37, No. 5. May 1990. pp. 635-638.
[4] T. Matsumoto, L. O. Chua, T. Yokohama. "Image Thinning with a Cellular Neural
Network". IEEE Trans. on Circuits and Systems. Vol. 37, No. 5. May 1990. pp. 638-
640.
115] B. E. Shi, T. Roska, L. O. Chua. "Design of Linear Cellular Neural Network tbr
Motion Sensitive Filtering". IEEE Trans. on Circuits and Systems. 1I: Analog and
Digital Signal Processing. Vol. 40, No. 5. May 1993. pp. 320-331.
[6] L. O. Chua, T. Roska. "The CNN Paradigm". IEEE Trans. on Circuits and Systems.
I: Fundamental Theory and Applications. Vol. 40, No. 3. March 1993. pp. 147-156.
[7] E. B. Kosmatopoulos, M. M. Polycarpou, M. A. Christodoulou, P. A. loannou.
"High-Order Neural Network Structures for Identification of Dynamical Systems". tEEE
Transactions on Neural Networks, Vol. 6, No. 2, pp. 422-431.
[8] F. J. L6pez Aligu6, M. I. Acevedo Sotoca, M. A. Jaramillo Mor~in. "A tligh Order
Net, ral Model". Lecture Notes in Computer Science, No. 686, "New Trends Neural
Computation". pp. 108-113, Springer-Verlag, Berlin. June 1993.
[9] R. Dogaru, K. R. Crounse, L. O. Chua. "An Extended Class of Synaptic Operators
with Applications for Efficient VLSI Implementation of Cellular Neural Networks".
IEEE Transactions on Circuits and Systems, Vol. 45, No. 7, July 1998, pp.745-755.
Application of A N N Techniques to A u t o m a t e d
Identification of B o v i n e Livestock

Horacio M. Gonz~ilez Velasco, F. Javier L6pez Aligu6, Carlos J. Garcia Orellana,


Miguel Macfas Macfas, M. Isabel Acevedo Sotoca

Departamento de Electr6nica e hagenierfa Electromec~inica


Universidad de Extremadura
Av. de Elvas, sin. 06071 Badajoz - SPAIN
horacio@nemet.unex.es, aligue@unex.es, carlos@nemet.unex.es

Abstract: In this work a classification system is presented that, taking lateral


images of cattle as inputs, is able to identify the animals and classify them by
breed into previously learnt classes. The system consists of two fundamental
parts. In the first one, a deformable-model-based preprocessing of the image is
made, in which the contour of the animal in the photograph is sought, extracted,
and normalized. Next, a neural classifier is presented that, supplemented with a
decision-maker at its output, makes the distribution into classes. In the last part,
the results obtained in a real application of this methodology are presented.

1. Introduction

For the control and conservation of the purity in certain breeds of bovine livestock,
one of the fundamental tasks is the morphological evaluation of the animals. This
process consists of scoring a series of very well defined characteristic [10, 11] in the
morphology of the animal, such as head or back and loins, and to form a final score
from a weighted sum of these partial scores. Evidently the process should be carded
out by people with great experience in this task, so that the number of qualified
people is very small. This, together with the high degree of subjectivity involved in
the whole process, leads one to think of the utility of a semiautomatic system of
morphological evaluation based on images of the animal.
In the publications on the topic it is suggested that most of the morphological
information of the animals involved in the process can be obtained by analysing their
different profiles. In this present work we try to corroborate this statement by means
of the study of similarities between the profiles of different images taken of the same
animal, and the similarities between the profiles of animals of the same breed, as well
as the degree of difference between animals of different breeds. To carry out this
study we developed a classifier based on images with a conventional structure [5] that
takes lateral images of cows as inputs (i.e. in profile) and that, after a first processing
for the extraction and normalization of contours, processes them by a neural classifier
which associates that image with one of the animals that it has previously learned, and
also relates it to one of the breeds that are objects of the study. I.e., we will be
describing a classifier that identifies the individual animal as well as makes the
classification by breed, simply using the information contained in the profile.
423

In section 2 we will describe in detail the classification system, considering


separately input image processing and contour classification. In section 3 we will
describe the trials we have made as well as the results that have been obtained. Lastly,
in section 4 the overall conclusions of the work will be presented and discussed.

2. General description: material and method

As noted above, our classification system consists of two clearly differentiated parts.
In the first, using a lateral image of a cow, we extract its contour and represent it in an
appropriate way for use as input to the neural classifier. In this phase we have mainly
used deformable model techniques [6], in particular those known as active shape
models [1, 2, 3], combined with strategies for within-image searching.
For the neural classifier we have used a type of network known as SIMIL [7, 8, 9],
a model which has been developed in our laboratory and that we have already applied
with success to other classification tasks. In the sections that follow we describe these
two parts in detail.

2.1. Preprocessing system


As we remarked above, the material we are using consists of digital colour images,
where there is always a cow in transverse position, fairly well centred and occupying
most of the image (fig. 1,a). These photographs were taken directly in the field by
means of a digital camera, so that neither the lighting conditions nor the backgrounds
that appear in them were controllable.

2.1.1. Shape modelling


The search for the cow's contour in these pictures can be considered as a
segmentation problem in which one knows a priori, in an approximate way, the shape
of the fundamental region that one is looking for, i.e. the cow. In order to use this
information in the process of searching for the contour as well as in the later
classification, we have used an approach based on point distribution models (PDM)
[1, 2]. These consist in deformable models that represent the contours by means of
ordered groups of points located at specific positions (fig 1,b), which are constructed
statistically, based on a set of examples. Using principal component analysis (PCA),
we can describe the main modes of shape variation observed in the training set by a
small number of parameters. We will thus have modelled the average shape of our
object, as well as the allowed deviations from this average shape, based on the set
used in the training process. Through an appropriate election of the elements used for
the training, with this technique we have the advantage of not needing any heuristic
assumption about which contours have an acceptable shape and which not.
Another fundamental advantage that the technique of PDM provides is to be found
their construction mechanism. In order to make an efficient statistical analysis of the
positions of the points in the training contours, they should be transformed (by
translations, rotations and scaling, preserving the shape) to a normalized space in
424

which the alignment of equivalent points is the most appropriate. Each contour will
then be represented mathematically by a vector x such that
x = x , + P.b (1)
where x is the average shape, P is the matrix of eigenvectors of the covariance
matrix, and b a vector containing the weights for each eigenvector and is that which
properly defines the contour in our description. Considering only a few eigenvectors
corresponding to the largest eigenvalues of the covariance matrix, we will be able to
describe practically all the variations that take place in the training set.

25

66 1

(a) (b)

Fig. 1. In figure (a) a representative example is shown of the input images to oar classifier. The
model of the cow contour formed by 73 points, the most representative of which are numbered,
is plotted in figure (b).

In our specific case we have used a description of the cow' s contour, not including
the limbs, that consists of 73 points (fig. 1,b), and the model was constructed using a
training set of 20 photographs, distributed evenly over the considered breeds, and
where the animals are in similar poses, since our purpose is to study the variations due
to different animals and not those due to the pose. Once the principal component
analysis was made, we only needed to use 12 eigenvalues to represent practically the
entirety of the possible variations. Hence, each normalized contour is perfectly
defined by a vector b of 12 components, to which we have to add a translation t, a
rotation 0 and a scale s to transform it to the space of the image.

2.1.2. Search within the image


Having defined the model of the shape we want to find in the image, the search
process consists of adjusting one of those shapes to the profile of the cow that appears
in the image, starting with an initial contour b (duly transformed to the space of the
image). The technique used to perform this search is known as an active shape model
[1, 3], and is as an iterative process with a series of steps. For each contour point,
what we shall call the best displacement is calculated, so that we get a good fit
somewhere in the image. These displacements are transformed into variations of the
components of vector b, on which we will impose restrictions which will assure us
that our original shape does not differ excessively from those of the training set. This
425

process is repeated in an iterative manner until convergence is reached in some area


of the image.
In the previous description we must emphasize three fundamental points that
require a detailed description since they are very important in the process. These are
the calculation of the best displacements for the points of the objects, the initial shape
we use to start the process, and the criteria used to determine convergence. These
aspects are fundamental because they are definitive in determining the evolution of
the contour. We will describe our specific approach to these problems, as well as
some examples of the results obtained in the case we are dealing with.
9 The normal method [1] used to calculate the best displacement corresponding to a
point of the object consists of looking for the strongest edge located on the
perpendicular to the contour at that concrete point, considering a not too large
search window. Evidently, we have to preprocess the original image to extract the
edges, generating an accessory image that we call potential. In our present case the
approach taken was similar, with some variations that have improved the
performance and the results.
To extract the potential, we have tried to take full advantage of the fact that we are
dealing with colour images. In order to use that information, instead of applying a
conventional edge detection method on the luminance coordinate, we worked with
the three colour components, in the system proposed by the CIE, L*a'b*. This
colour space has the advantage that is perceptually uniform, i.e. a small
perturbation to a component value is approximately equally perceptible across the
whole range of that value. As our goal it is to base our edges on colour differences
between areas, the perceptual uniformity allows us to treat the three components of
colour linearly, i.e., to extract edges in the images corresponding to the colour
coordinates by a conventional method (Sobel, Canny,...) and subsequently to put
them together using a linear combination. Figure 2 shows a comparison of the
results using only the luminance and using the method we propose.
Once the potential has been extracted, we must approach the calculation of the best
displacements for the points of the model. In our case we used the method
described above, but using the whole image, and considering not only the intensity
of the edge but the distance between that edge and the point.
9 Another important aspect in the process is the initial contour. Due to the
characteristics of the photographs, although the position of the cow can be
considered as "quite predictable", it is still complicated to locate it in detail. In
order to cover as great a field as possible we considered various initial positions
(always using b = 0, i.e. the contour x,), which we use successively if we cannot
reach convergence with the previous one.
9 The last point we have to explain is the method used to determine whether
convergence has been reached or not. In order to do that it is not enough to analyze
the parameters at a specific moment. One has to evaluate their evolution over a
certain number of iterations. We mainly used the average of the best displacements
calculated for each point. Observing their mean calculated over the last T
iterations, we will be able to determine whether the value is low enough or not.
Observing their variance, we will be able to decide whether the contour has
stabilized or not. Then, when both are small, we will have assured the
convergence.
426

(a) (b)

Fig. 2. Representation of the final edge determined from the colour information, as against the
edge of the coordinate L* in which only the luminance has been used.

To apply this set of techniques, a series of computer programs has been developed
which allows us to automate all the tasks without user intervention. The photographs
are taken in the field, and then read into a database which is consulted by the
programs which have to process them. All these computer applications were
developed in C except the one dedicated to the PDM and ASM that uses OCTAVE
due to the great amount of matrix calculations involved.

2.2. Neural classifier

Once an image has been preprocessed, we have the information concerning its
contour as a set of 12 parameters b~ that forms the vector b. However this space does
not seem well suited for the classification process because various vectors b can exist
that give very similar contours but are quite far from each other. Accordingly, as our
objective is to classify the contours, it seems more appropriate to use them as inputs
to the classifier. In order to do that, a bitmap that has the contour represented can be
generated making use of the vector b and a transformation that must be the same for
all the cases, so that the contours are comparable independently of the position or the
size of the cow in the original image.
To classify this kind of input we used a type of neural network known as SIMIL
[7, 8, 9], which has presented very good results in classification problems similar to
the present [9]. This network was conceived from its origin to be integrated into a
classification system. In the learning process it uses a series of prototypes of the
classes into which we will classify the group of inputs. This learning process is based
on the direct assignation of prototype values to the weights of neurons, which makes
it fast and effective. Also, a neural function that detects similarities between its inputs
and the information in its weights is used, based on ideas similar to those of
Kohonen's self-organizing maps. All this is integrated into a feedforward network,
which permits high performance in classification problems.
As output of the network for each input we obtain the membership rates dp to each
one of the p classes that the network learned. To offer a final result we introduced
another element into the process, that we have denominated decision-maker [4, 7],
427

and whose purpose consists in, given the membership rates, to indicate either the class
whose membership rate is the largest, or a state of indecision. This decision-maker is
based on two rules:
9 For the class with the largest membership rate to be the final result, this should
surpass a minimum value. If we define
d =max{D} where O = { d , ..... dp} (2)
one must have that d > Vm, where Vmis one of the decision-maker parameters.
9 Also, we should require the network to be able to select "sufficiently" one of the
classes, i.e. that the difference between the largest of the membership rates and the
second is sufficiently large. To quantify this, if we write

d, = max{ O - max{ O } } (3)


we can define the separation index as

s=l ds
dm (4)

which rises as the distance between d and d mincreases. Hence, class m must satisfy
s > V to be the output of the decision-maker, where V, is the other parameter that
defines that processing block.
It should be noticed that dm and s can take completely different values, i.e., we can
find cases of maximum separation with very small values of d m and, vice versa, cases
of minimum separation with high values of d~. For this reason, having established the
classification function, we were interested in defining a parameter that measures the
quality of the classification for the cases in which classification is possible. That
parameter is called the classification index and we define it as:

I = dmd~ -gmVs
1-V,,V,. (5)

As we can see, in the worst classification case, d, = V and s = V,, we will have
I = 0. In the rest of the cases the variations of the d and d, values have the same
importance.
Regarding the application of the entire technique to our problem, we must
comment on the following points:
9 As inputs to the neural network we used bitmaps of 400x300 pixels, generated with
the contours obtained in preprocessing. We used three contours with a thickness of
5 pixels, centred on the same point and with different sizes (see fig. 3,b), to
minimize the number of neurons that do not receive information.
9 The SIMIL network we used is composed of a single processing layer with
400x300 neurons and a random feedforward connection scheme, with 400 inputs
per neuron in a neighbourhood of radius 200. As output function of the neurons we
used a sigmoid with parameters 1, 0.4 and 0.1. To simulate this network we use a
parallel system with 6 processors (3 Pentium 200, 2 Pentium 233 and one Pentium
Pro), running the large neural network simulator NeuSim [4] which has been
developed in our laboratory. With this system we obtained recognition times of
approximately 3 to 4 seconds per image.
428

(a) (b)

Fig. 3. One of the photographs used, with its snake fitting the contour, is shown in figure (a),
and next to it we see the real input to our neural network (b). We have inverted the image to
facilitate the presentation.

With respect to the decision-maker, taking into account the trial simulations we
had made, it seemed reasonable to require, in order to establish a definitive
classification, that a minimum of 10% of the neurons associated to the
corresponding prototype are activated (V=0.1) and also that a minimum separation
of 5% exists between the activated neurons of the chosen prototype and the second
(i.e. that d, is smaller than 95% o f d m), i.e. V, = 0.05.

3. Results

The system described in the previous section was tested with a total of 95 pictures
corresponding to 45 different animals, distributed among the 5 breeds considered in
this present study (Retinta, Blanca Cacerefia, Morucha, Limusfn and Avilefia). Once
the photographs had been preprocessed and the input bitmaps for the classifier
obtained, we ran the neural network learning process, using one contour for each of
the animals. After learning, we had the network recognize the 95 pictures, classifying
them into the 46 corresponding classes (considering indecision as a separate class).
For the classification into breeds, these were considered to be superclasses formed
by those classes corresponding to animals of one specific breed. Hence, the animal
obtained as a result of the classification also determines the breed.
Given the number of images processed, it is impossible to describe all the results
obtained in the classification. In table 1 an overall summary of those results is shown,
corresponding to the classification into animals and into breeds. Also, table 2 presents
an example of results for a specific animal of which four photographs (fig. 4) were
used. In that table separation and classification indexes are presented, as well as the
two largest membership rates, for both classification processes.
429

Table 1. Summary of the final results, showing successes, mistakes and indecisions, of our
identification and classification system.

Successes Mistakes Indecisions


Class. b y Animals 75.79% 11.58% 12.63%
Class. by Breeds 91.58% 1.05% 7.37%

The data in table 2 are quite representative of the cases that may occur. As one can
see, in the classification into animals there is an erroneous assignment (with quite a
low classification index), in which the system has related the input image with
another animal of its same breed. There was also a case of indecision due to the low
separation index between the first and the second membership rates, although, as one
can see, the classification would have been correct. One should also notice that,
although the classification into animals for these images is not very good, all are
correctly assigned into races.

Table 2. Classification results, by animals and by breeds, for the 4 photographs of cow dx501.
The separation and classification indices are shown.

Photograph s I Classification
dx501_1 0.18 0.09 WRONG[1~ dx303 (0.333) 2~ 01416(0.273)]
Class. b y dx501_2 0.06 0.18 RIGHT[1~ dx501 (0.440) 2~ gb516 (0.411)]
Animals dx501_3 0.01 0.40 IND. [1~ dx501 (0.638) 2~ dx201 (0.630)]
dx501_4 0.44 0.55 RIGHT[1~ dx501 (0.994) 2~ 014005 (0.560)]
dx501_l 0.18 0.09 RIGHT[1~ BlancH(0.333) 2~ Avilefia(0.273)]
Class. b y dx501_2 0.06 0.18 RIGHT[1~ BlancH (0.440) 2~ Retinta (0.411)]
Breeds dx501_3 0.20 0.32 RIGHT[i~ BlancH (0.638) 2*: Av[lefia(0.511)]
dx501_4 0.44 0.55 RIGHT[10:BlancH (0.994) 2~ Avilefia(0.560)]

4. Conclusions and future work

In light of the results described in the previous section, we can state that the
classification results are excellent, especially in the case of the classification into
breeds, where the indecisions are reduced to the minimum and the mistakes are very
few. It must be emphasized that, when the system makes a mistake in classification
into animals, the wrong choice is usually an animal of the same breed. This supports
one of our premises: the fact that a great part of the morphological characteristics of a
breed is reflected in the contour.
430

(1) (2)

(3) (4)

Fig. 4. The 4 photographs of cow dx501 with the fitted snake are presented, corresponding to
the data in table 2.

To be able to perform the classification based on such different images (referring


to size and position of the animals in the photograph) shows the suitability of
preprocessing, especially in normalization. However, many of the mistakes and
indecisions are due to this part of the system, because in some situations the fit of the
deformable model is not precise enough. There also exists the problem of variations
in position. Although small, these usually exist, and require either a second
normalization process by areas, or the use of several prototypes per animal.
In sum, the very good results obtained indicate the feasibility of approaching the
process of morphological evaluation using a similar method, on several photographs
for each animal.

Acknowledgements
This work has been supported in part by the Junta de Extremadura (project
PR19606D007, and the doctoral scholarship of D. Horacio M. Gonz~ilez Velasco) and
the CICYT (project TIC 97-0268).
We also wish to express our gratitude to the personnel of the Centro de Seleccitn y
Reproducccitn Animal (CENSYRA) for the technical help with everything related to
cattle, and for aiding us in the process of taking photographs.
431

References

1. Cootes, T.F., Taylor, C.J., Cooper, Graham, J.: Active Shape M o d e l s - - T h e i r Training and
Application. Computer Vision and Image Understanding, vol 61, n~ 1, pp 38-59. Jan. 1995.
2. Cootes, T.F., Taylor, C.J., Cooper, Graham, J.: Training Models of Shape from Sets of
Examples. Proc. British Machine Vision Conference, pp 9-18. 1992.
3. Cootes, T.F., Taylor, C.J.: Active Shape Models - Smart Snakes. Proc. British Machine
Vision Conference, pp 266-275. 1992.
4. Garcfa, C.J.: Modelado y Simulaci6n de Grandes Redes Neuronales. Doctoral Thesis,
Universidad de Extremadura. 1998.
5. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice Hall, 1989
6. Kass, M., Witldn, A., Terzopoutos, D.: Snakes: Active Contour Models. International
Journal of Computer Vision, vot 1, n~ 4, pp 321-331. 1988.
7. L6pez, F.J., Gonz,51ez, H.M., Garcfa, C.J., Macfas, M.: S~VIIL: Modelo neuronal para
clasificaci6n de patrones. Proc. Conferencia de la Asociaci6n Espafiola para la Inteligencia
Artificial, pp 187-196. 1997.
8. L6pez, F.J., Macfas, M., Acevedo, I., Gonz~ilez, H.M., Garcfa, C.J.: Red neuro-fuzzy para
clasificaci6n de patrones. Proc. Congreso Espafiol sobre Tecnologfas y L6gica Fuzzy, pp
225-232. 1998.
9. Macfas, M.: Disefio y realizaci6n de un Neurocomputador Multi-CPU. Doctoral Thesis,
Universidad de Extremadura, 1997.
10. Reglamentaci6n especffica del Libro Geneal6gico y de Comprobaci6n de Rendimientos de
la raza bovina Retinta. Boletfn Oficial del Estado, Espafia. 05/04/1977.
11. S~nchez-Belda, A.: Razas Bovinas Espafiolas. Manual T6cnica, Ministerio de Agricultura,
Pesca y Alimentaci6n, Espafia. 1984.
A n I n v e s t i g a t i o n into Cellular N e u r a l N e t w o r k s
Internal D y n a m i c s A p p l i e d to I m a g e P r o c e s s i n g

David Monnin 1,2, Lionel Merlat 1, Axel K6neke 1, and Jeanny H~rault 2

1 French-German Research Institute of Saint-Louis,


PO Box 34, 68301 Saint-Louis Cedex, France
monnin@lis, inpg. fr, monninOece, fr
2 LIS Laboratory-INPG,
46, avenue Viallet, 38031 Grenoble Cedex, France

A b s t r a c t . Interesting perspectives in image processing with cellular


neural networks can be emphasized from an investigation into the inter-
nal states dynamics of the model. Most of the cellular neural networks
design methods intend to control internal states dynamics in order to
get a straight processing result. The present one involves some kind of
internal states preprocessing so as to finally achieve processing other-
wise unrealizable. Applications of this principle to the building of com-
plex processing schemes, gray level preserving segmentation and selective
brightness variation are presented.

1 Introduction

Cellular Neural Networks (CNNs) [1] are lattices of analog locally connected
cells conceived for an implementation in VLSI technology and perfectly suitable
for analog image processing. The operation of a cell (i, j) is described by the
following dimensionless equations:
dxi,j 1
dt - - - ~ xi'j ~- A | Yi,j + B | ui,j + I (1)
1
y~,j(~) = ~(Ix~,j + 1L- Ix~,~- 1L). (2)
where | denotes a two-dimensional discrete spatial convolution such t h a t
A | Yi,j ~- ~k,lCN(i,j) Ak,l.Yi+k,j+l, for k and 1 in the neighborhood N ( i , j ) of
cell (i,j), which is generally restricted to the 8-connected cells. A and B are the
so-called feedback and feedforward weighting matrices, and I is the cell bias.
ui,j, xi,j and yi,j are the input, internal state and output of a cell, respectively.
The same set of parameters A, B and I, also called cloning template, is re-
peated periodically for each cell over the whole network, which implies a reduced
set of at most 19 control parameters, but nevertheless a large number of possible
processing operations [2]. It was shown that numerous traditional operators for
binary and gray level image processing, among which are all linear convolution
filters as well as morphological and Boolean operators, can be designed for un-
coupled CNNs, i.e. CNNs with no feedback interconnection [3]. In the case of
433

uncoupled CNNs and according to equations (1) and (2), the CNN dynamics
can be defined by a set of three differential equations valid over one of the three
domains of lineaxity of (2):
dx~,3
dt - x ~ j - a + B | u~,j + I , for x~,j E] - c~, -1] (3a)

dxi,j _ (a - 1)- x~,j + B | u~,3 + I for x~,j C [-1, 1] (3b)


dt
dx~,j
dt - - x ~ j + a + B | u~ 5 + I , for x~,j e [1, + c r (3c)

It is known from [3] that gray level output operators can be obtained when
a < 1, while only binary output operators are obtained when a > 1. In terms
of dynamical internal behavior of a cell, a < 1 implies only one stable equilib-
rium point xqi,j~ whereas a > 1 leads to two possible stable equilibrium points
xq~,j and xqi,+ respectively located in ] - cr -1] and ]1, cr The values of the
different possible equilibrium points are derived from (3a-c) when the derivative
is canceled and are given by:

xq~,j = B | u i , j + I - a (4a)
B | ui,j + I
xqi~ - 1 - a (4b)

xqi,+ = B | ui,j + I + a . (4c)

The usual way for designing cloning templates consists in acting on the CNN
dynamics to set prescribed equilibrium points and thus get the expected process-
ing operator in a straight way. The byroad presented in this paper investigates
the internal states dynamics to prescribe particular states configurations which
axe finally used to derive operation not realizable directly. After a short overview
of the design of CNNs for image processing, the processing of internal states
will be introduced, and applications to the composition of complex processing
schemes, gray level preserving segmentation and selective brightness variation
will be presented.

2 Background of Cellular Neural Networks Design


As the main results of this principle must be known for a better understanding of
the following sections, the current section gives a short overview of the method
presented in [3] for the design of uncoupled CNNs for binary and gray level image
processing.
Whether it is called convolution mask or structuring element, depending on
whether the processing intended is a convolution filter or a morphological op-
erator, the feedforward matrix B involved in a CNN image processing operator
is here chosen according to the values of an equivalent conventional digital im-
age processing filter, while parameters a and I are determined from the design
method according to the CNN dynamics.
434

As stated before, two primary categories of CNN image processing opera-


tors can be defined depending on the value of the feedback parameter a, those
providing a gray level output and those providing a binary output.

2.1 Gray Level Output Image Processing Operators


A CNN can perform a linear convolution filtering using feedforward matrix B
as a convolution mask, and can even simultaneously rescale the original range
[m, n] of the input image to a desired range [M, N]. This is made possible by
determining the parameters a and I as follows:
~7,-n m - n
a---- 1 - IIBII1. M - N' I = N-IIBII1- M - - - - ~ n (5a-b)

where tlB[I1 is the sum of absolute values of matrix B coefficients.


Furthermore, a "reverse video" effect can be obtained by simply reversing
the sign of B and using a new original range which is symmetrical to the old one
with respect to the origin and yields a new current constant I ' = I + n + m .

2.2 Binary Output Image Processing Operators


In the case of binary output image processing operators, the convolution op=
eration mentioned before remains and the role of the feedforward matrix B is
maintained, but the result of the processing is now thresholded. The determi-
nation of parameters a and I has then to deal with the value of one or two
thresholds as will be outlined in the following subsections where several variants
of the same principle are tackled.

Single Threshold Processing. The purpose of the simplest variant of the


method is to threshold the result of a linear convolution filter at a desired thresh-
old T h . Thus parameters a and I are such that:

a>l, I=(1-a)-x(0)-Th (6a-b)

where the initial state x(0) E [-1, 1] is the same for all cells of the network. In
addition, an inversion effect is obtained by reversing the sign of B and T h .

T w o T h r e s h o l d s P r o c e s s i n g . The aim of this second variant of the method is


to threshold the result of a linear convolution filter with two different thresholds
assigned to particular cells according to their input state. In this case parameters
a and I are expressed as:

Th- - Th + x- (0). Th + - x + (0). Th-


a = 1 + x+ (0) - x - (0)' I = x +(0) - x - (0) (Ta-b)

where T h - applies to cells with an initial state x - ( 0 ) E [-1, 1], and T h + to cells
with an initial state x+(0) C [-1, 1], such that x - ( 0 ) < x+(0) and T h - > T h +.
435

S i n g l e T h r e s h o l d P r o c e s s i n g a n d B o o l e a n O p e r a t o r s . This is an adapta-
tion of the previous method which allows to combine a binary initial state with
the result of a thresholded convolution filter.
"OR" Boolean operators are obtained when:

Th- = Th, T h + <_ -[[B[[ I . (8a-b)

"AND" Boolean operators are obtained when:

T h - >_ []B[]I, Th + = Th. (9a-b)

Once again, an inversion effect can be obtained by simply reversing the sign
of B and of the threshold value T h .

3 Internal States Processing

It is obvious from (2) that for x E [-1, 1], the value of the output y reflects the
one of the internal state x, which can hence be straightforwardly observed from
the output of the CNN. However, when y = =t:l, the only information on the
internal state provided by the output is that Ix[ _> 1. In the latter case, it is
understood that binary output does not imply binary internal states. The use
of internal states histograms as investigation tools allows to have an insight into
the CNN behavior beyond the [-1, 1] range. As a meaningful example, fig. 1
shows an image and its internal states histogram before and after thresholding.
It is clear from this representation that even if the output is binary, it is not
necessarily so for the internal states and thus the gray level information is not
really lost but merely hidden and can be processed in order to complete specific
operations. To achieve this aim it is first interesting to focus on the internal
states location after a binary output image processing.

Fig. 1. a) original image; b) its corresponding histogram; c) image thresholded at


Th -----0.56; d) resulting histogram, parameters used: A = a ----2, B ---- b ----1, I ----0.56
436

3.1 Internal States Location After a Binary Output Image


Processing
As seen before, in the case of a binary output image processing, the steady
internal states are located either in a subset D - of ] - oc, - 1 [ or in a subset D +
of ]1, oc[. Hence, according to (3a) and (3c), it is possible to determine D - and
D + as:

D-= [min(xq~d),max(xq~j)[, D+=]min(xq+j),max(xq+j)]. (10a-b)


%3 z,3 %3 %3

I n t e r n a l S t a t e s L o c a t i o n A f t e r a S i n g l e T h r e s h o l d P r o c e s s i n g . For sin-
gle threshold processing, the value of B | ui,j in m a x i j (xq~,j) and mini,j (xq+j)
is equal to the threshold value T h , while for m i n i j (:rq~j) and m a x i j (xq+j), the
value of B | ui, j is respectively equal to -I[B[[1 and IIB][~. This leads to the
following expression of D - and D+:
D-=[- IIBIIl + I - a , T h + I - a[, D + = ] T h + I + a, llBIIl + I + a] . (lla-b)

I n t e r n a l S t a t e s L o c a t i o n A f t e r a T w o T h r e s h o l d s P r o c e s s i n g . Deriving
the previous approach for two thresholds processing leads for T h - to:
D Z = [ - [IBI]I + I - a , T h - + I - a[, D +_= ] T h - + I + a, IIB]I1 + I + a]. (12a-b)
and for T h + to:
D+=[- [IBIIl + l - a , T h + + I - a[, D + = ] T h + + I + a, llBl[l + I + a]. (13a-b)
Considering T h - and Th+simultaneonsly, the overall expression of D - and D +
is:
D-=D-UD+, D + = D +_UD +. (14a-b)
Remembering that T h - > T h + it finally yields:
D-----[-tlBIll+I-a, Th-+I-a[, D +=lTh++I+a,llBIIl+I+a]. (15a-b)

3.2 Internal States Binarization


It was stressed that binary output image processing operators presented in sub-
section 2.2 preserve the convolution information in internal states even if the
CNN output is binary. But there is another way of thresholding an image in
such a manner that internal states get binary too. This can be done by can-
celing the convolution process B | ui,j, when the input image is applied to the
CNN initial state. The value of the different possible equilibrium points are then
derived from (4a-c) and given by:
xq- = I - a (16a)
I
xq~ - 1 - a (16b)

xq + = I + a. (16c)
437

As a > 1, x q ~ is an unstable equilibrium point which acts here as a threshold


T h . Therefore, equation (16b) can also be written:

I
Th - l-a" (17)

Hence, when a cell initial state is less than T h it leads to x q - , and to x q + when
it is greater. The method then allows, by choosing parameters a and I according
to (17), to design a threshold operator which operates on an image stored in the
CNN initial state, and results in a binary image for both output and internal
state image. The possible values for the outputs are then of course -1 and 1,
whereas they are x q - and xq+for the internal states.
The choice of parameters a and I in the equation of T h (17) allows to set
either the value of x q - or that of x q +, but not both at the same time. However,
if a threshold operation is not useful because the internal state already results
from a previous threshold operation, it is possible to binarize the internal state
and to fix both the value of x q - and that of x q +. This is done by solving the
following set of equations for a and I:

xq- = I- a (18)
xq+ I + a

which yields:

xq + - xq- xq + + xq
a-- 2 , I- 2 (19a-b)

It must be clearly noticed that, if the latter method can binarize internM
states to prescribed values, whether the internal states have already been bina-
rized or n o t , it cannot modify any CNN output, i.e. it cannot move an internal
state from ] - co,-i[ to ]i, co[ or from ]i, oo[ to ] - oo,-I[. The only way of
changing the CNN output, consists in fact in dealing with an initial state in
[-1,1].

3.3 Internal States Shifting

As all the internal states processing operations involved in section 3 are regarded
as a kind of preprocessing for new CNN image processing operators, it means
t h a t the internal states involved should not get stuck in ] - c ~ , - 1 [ and ]1, co[, and
t h a t it should be possible to shift them even into [ - 1 , 1]. This implies the use of
cloning templates for which a < 1, which paradoxically generates CNN operators
for which steady state is independent of the CNN internal state I3]. Fortunately,
this paradox can be solved if the CNN convergence is stopped before the steady
state is reached. The following subsections will establish the relation between
internal state value and transient time and expose the principle of internal states
shifting.
438

R e l a t i o n b e t w e e n I n t e r n a l S t a t e V a l u e a n d T r a n s i e n t T i m e . The de-
termination of the relation between internal state value and transient time can
be done by solving the differential equations (3a-c). Even if more complex cases
could be considered, for clarity, it is convenient to set a = 0. Equations in (3a-c)
can then be gathered in only one equation:

dxi,j(t)
dt - x i , j ( t ) + B | uid + I . (20)

Solving differential equation (20) leads to:

xi,j (t) = (xi,j (0) - B | ui,y - I ) e - t + B | ui,j + I (21)

which finally yields the expression of the transient time t for a given value of
z i , j = x~,j (t):

( Xij--BQUij--[
ti,j = - In \)'xi,j';O'
t - B | uYi,j - I / " (22)

S h i f t o f B i n a r y I n t e r n a l S t a t e s . The most elementary internal states shifting


operation consists in shifting the two values x - (0) and x + (0) of a binary internal
state to two given values x - and x +, where x - ( 0 ) < x +(0) and x - < x +. As
the internal states must remain binary, the convolution process B | ui,j must
be canceled. To find the value of I, it is assumed that there is a time t for which
the prescribed values x - and x + are reached, which can be expressed as:

-ln x~7--f =-ln\x+(O)-I.,/ "


(23)

Solving (23) for p a r a m e t e r I yields:

I = x - ( 0 ) - x +(t) - x + ( 0 ) . x - ( t ) (24)
x - (0) + x + ( t ) - x + (0) - (t)

Once p a r a m e t e r I is found, the effective transient time can be directly pro-


cessed from (22).

S h i f t o f M u l t i - V a l u e d I n t e r n a l S t a t e s . Another internal states shifting op-


eration consists in shifting the internal states information contained in D - and
D + after a single or two thresholds processing. The idea of the method is, given
the original range [m, n] of the thresholded image, to prescribe a desired trans-
lated range [M, N] to which D - and D + will be shifted. The processing, derived
from the one of subsection 2.1, must be stopped after an effective time for which
D - and D + have reached the desired location. As no other processing t h a n
shifting is involved here, only the central element b of feedforward matrix B is
useful. For clarity, a = 0 has already been chosen. Thus according to (5a), and
given t h a t M--~v -- 1, it leads to b = 1. Hence, only parameter I and the effective
transient time t have to be determined.
439

The transition speed, or derivative of a cell state, expressed in (20) is here:

dx ,j (t)
- xi,j(t) + u<j + I. (25)
dt

Hence, it is a function of A(t) = u i , j - x i , j (t) which has the same value A - ( t )


for all cells initially in D - and the same value A+ (t) tot all cells initially in D +.
This means that all cell states initially in D - move at the same speed, and so do
all cell states initially in D +. Thus, it is possible to find I and t from only one
representative cell state of D and another of D +. The process is then the same
as the one of the previous subsection, but involving now u and u +, respectively
the input values of the cells chosen in D - and D +. The following equation can
thus be given:

( ( x+-,,,+-, (26)
- ln \ x - ( O ) - u- - I] = -in \x~((0~Tu~:7i] "

Solving (26) for p a r a m e t e r I yields now:

I = z - ( 0 ) , x + - x +(0)- x - - u - - [ x + - x +(0)1 + u + . [x- - x-(0)] (27)


x- (0) + x+ - x+(0) - x-

Once p a r a m e t e r I is found the effective transient time can again be directly


processed from (22).

4 Applications

The applications proposed here are based on the processing of the image in
fig. la. This image has a particular histogram which makes the segmentation
of objects easier and does not require complex segmentation schemes which are
beyond the scope of this paper. In fact, the sky has a histogram included in
[ - 1 , -0.81], the balloon in [-0.81, -0.56], the landscape in [-0.56, 0.40] and the
helicopter in [0.40, 1].

4.1 Composition of Complex Processing Schemes

The first application of internal states processing is the composition of complex


processing schemes in the form of delay-type cloning templates [4]. Different
elementary tasks can here be linked together by means of two-pass internal
states processing operations. This concerns especially processing for which the
previous binary output must serve as an internal state for the following one.
The aim of the following example is the segmentation of both the helicopter and
the balloon in fig. la. In the tbllowing operation, steps 2 and 3 constitute the
internal states processing which allows to link steps 1 and 4:

- 1. A single threshold with Th -- - 0 . 5 6 is applied to the image and the output


is reversed to respect T h - > Th + in the following step 4 (fig. 2).
440

- 2. Internal states are binarized at x q - = - 2 and xq + = 2 (fig. 3).


- 3. Internal states are shifted to x q - = - 1 and xq + = 1 (fig. 4).
- 4. A two thresholds processing with T h - = -0.81 and T h + = 0.40 is applied
(fig. 5).

4.2 Gray Level Preserving S e g m e n t a t i o n


The segmentation processing presented here erases the background and keeps
only the objects, but with their original gray level. The processing starts with
steps 1 to 4 of subsection 4.1, after which D - = [-2.4, - 1 ] and D + = [1, 2.81].
Then the following step is:

- 5. D+ is shifted to [-0.81, 1], whereas D - is kept unchanged at [ - 2 . 4 , - 1 ]


(fig. 6).

4.3 Selective Brightness Variation


The selective brightness variation allows to modify the brightness of a segmented
object without modification of the background. The following example illustrates
how to modify the brightness of the balloon in fig. la. The processing starts with
steps 1 to 3 of subsection 4.1, and is completed by two more steps:

- 4'. A single threshold processing with a Boolean operation "and" is ap-


plied, with T h - = 1 and T h + = -0.81, after which D - = [ - 3 , - 1 ] and
D + --[1, 2.81] (fig. 7).
- 5'. D - and D + are shifted to [ - 1 , 1] and [-0.5, 1.31] (fig. 8).

5 Conclusion

Interesting perspectives in CNN based image processing have been emphasized


from an investigation into CNN internal states dynamics. Applications to the
building of complex processing schemes, gray level preserving segmentation and
selective brightness variation were presented. This original approach can possibly
open the way to other new processing operations like selective contrast variation.

References

1. L. O. Chua and Yang, Cellular Neural Networks Theory, IEEE T-CAS vol. 35 (1988)
1257-1272.
2. L. Merlat, A. KSneke, J. Merckl~, A Tutorial Introduction to Cellular Neural Net-
works, Proc. of Workshop in Electrical Engineering and Automatic Control (1997),
ESSAIM, Mulhouse, France.
3. D. Monnin, L. Merlat, A. KSneke and J. H~rault, Design of Cellular Neural Networks
for Binary and Gray Level Image Processing, Proc. of ICANN 98 (1998) 743-748.
4. T. Roska and L. O. Chua, CNN with Non-linear and Delay-type Template Elements
and Non-uniform Grids, in Cellular Neural Networks, J. Wiley & Sons (1993) 31-43.
441

Figures

6000 , , , i ,

4000 t

OL ........
-3
lib,..,,,
-2
Fig. 2. Single threshold processing, parameters: a = 2, b = -I, I = -0.56, x(0) = 0
-1
, ,
0
,
1 2
J,ll
.....
3

XlO 4 XlO 4
3 3

o
-1.5 -1 ~).5 0 0.5 1 1.5
t o -2 -1.5 -0.5 0 0.5 1 1.5 2
t
Fig. 3. Internal states binarization, Fig. Internal states shifting,
4.
parameters: a = 2, b = 0, 1 = 0 parameters: a = 0, b = 0, I = 0, t = 0.69

600s

4000
t ' l ' ' '
, .... 1
o- , I, ..,i,,.n~ltll . . . . u

-3 -2 -1 0 1~ 3
Fig. 5. T w o thresholds processing, parameters: a = 1.605, b = 1, I = 0.205

i " 600014000
I k, '
' o. . . . ,,,.,m~llll , ..i.. I i ..... I
-3 -2 -1 0 1
Fig. 6. Internal states shifting, parameters: a = 0. b = 1, / = -1.4, t = 0.83

iL
4000

2~176176
o"
-3
.,h,,~U]
-2
dll .
-1
I I
0
hldl
1
I
2
Fig. 7. Single threshold processing & A N D , parameters: a = 1.905, b = 1, I = -0.095

6000 I i i

4o~176
I

i i , I
-3 -2 -1 0 1 2
F i g . 8. Internal states shifting, parameters: a = 0, b = 1, I = 0 . 1 9 , t = 2.44
Autopoiesis and Image Processing: Detection of Structure and
Organization in Images

Mario K6ppen I, Javier Ruiz-del-Solar2


i Dept. of Pattern Recognition, Fraunhofer IPK-Berlin,
Pascalstr. 8-9, 10587 Berlin, Germany.
Email: mario.koeppen@ipk.fhg.de.
2 Dept. of Electrical Eng., Universidad de Chile
Casilla 412-3, 6513027 Santiago, Chile.
Email: j.ruizdelsolar@computer.org.

Abstract

The theory of Autopoiesis describes what the living systems are and not what they do.
Instead of investigating the behavior of systems exhibiting autonomy and the concrete
implementation of this autonomy (i.e. the system structure), the study addresses the
reason why such behavior is exhibited (i.e. the abstract system organization). This article
explores the use of autopoietic concepts in the field of Image Processing. Two different
approaches are presented. The first approach assumes that the organization of an image is
represented only by its grayvalue distribution. In order to identify autopoietic
organization inside an image's pixel distribution, the steady state Xor-operation is
identified as the only valid approach for an autopoietic processing of images. The effect
of its application on images is explored and discussed. The second approach makes use of
a second space, the A-space, as the autopoietic-processing domain. This allows for the
formulation of adaptable recognition tasks. Based on this second approach, the concept of
autopoiesis as a tool for the analysis of textures is explored.

Keywords: Autopoiesis, Steady State Image Processing, Auto-projective Operators,


Texture Analysis, Texture Retrieval Systems, Autopoietic-Agents

1 Introduction

The theory of Autopoiesis, developed by the Chilean biologists Humberto Maturana and
Francisco Varela, attempts to give an integrated characterization of the nature of a living
system, which is framed purely with respect to the system in and of itself. The term
autopoiesis was coined some twenty-five years ago by combining the Greek auto (self-)
and poiesis (creation; production). The concept of autopoiesis is defined as [Varela,
1979, p. 13]:

'An autopoietic system is organized (defined as a unity) as a network of


processes of production (transformation and destruction) of components that
produce the components that:
443

1. through their interactions and transformations continuously regenerate and


realize the network o f processes (relations) that produced them; and

2. constitute it (the machine) as a concrete unity in the space #1 which they [the
components.] exist by specifying the topological domain of its realization as
such a network. '

In other words an autopoietic system produces its own components in addition to


conserving its organization. A network of local transformations produces elements, which
maintain a boundary. This boundary captures the domain, in which the local
transformations take place. In this context, l/~ can be defined as autopoietie organization
realized in a physical space. The autopoietic theory describes what the living systems are
and not what they do. Instead of investigating the behavior of systems exhibiting
autonomy and the concrete implementation of this autonomy (i.e. the system structure),
the study addresses the reason why such behavior is exhibited (i.e. the abstract system
organization). A complete material concerning the autopoietic theory (tutorials, study
plan, bibliography, Internet links, etc.) can be found in the Internet site Autopoiesis and
Enaction: The Observer Web [Whitaker, 1996].
The autopoietic theory has been applied in diverse fields such as biology,
sociology, psychology, epistemology, software engineering, artificial intelligence and
artificial life. In this context, this article tries to explore the use of autopoietic concepts in
the field of Image Processing.
Two different approaches will be presented. These approaches differ in their
interpretation of the domain of the processes. The first approach, presented in section 2,
assumes that the domain of an image is represented only by its grayvalue distribution. In
order to identify autopoietic organization inside an image's pixel distribution, the steady
state Xor-operation is identified as the only valid approach for an autopoietic processing
of images. The effect of its application on images is explored. The second approach makes
use of a second space, the A-space, as autopoietic processing domain. This allows the
formulation of adaptable recognition tasks. Based on this second approach, the concept of
autopoiesis as a tool for the analysis of textures is explored in section 3. As a concrete
example, a Texture Retrieval System based on the use of an autopoietic-agent is presented.
Finally, in section 4 some conclusions are given.

2 Recognition of image structures by using auto-projective operators

In a first attempt, the question arises whether images by themselves preserve some kind
of autopoietic organization. Because of images generally are considered as static
representations of real-world objects, but autopoiesis is constituted by a network of
dynamic transformations, the image must be processed by suitable operators in order to
reveal possible organizational principles. Thereby, the original image appears to be like a
"frozen" state of its intrinsic dynamical processes. Two approaches are possible from
444

now on: the first approach assumes no relation between these dynamics and the real-
world objects pictured in the image, in contrary to the second approach, which names the
kind of features of real-world objects from which the dynamics are driven.

This section is concerned with the first approach i.e., image dynamics are restricted to the
distribution of colors or grayvalues in the image. No reference is given to the pictured real-
world objects. In order to "melt" the image due to a possible intrinsic dynamic, an
operator is searched with the following two essential properties:

1. The operator should be applied point-wise. Normally, image operators like the
Laplacian, the Sobel or the median operator are applied to all image pixels, at once. But, as
it was mentioned in the introduction, autopoietic systems constitute the domain of its
realization. For an effective search of these domains, the size of them can not be
predicted. Hence, the application domain of the operator must be balanced between pure
local analysis (a pixel and its neighborhood) and global analysis (all image pixels). Pre-
defined image operators do not offer such a choice. By repetitively applying the local
operator point-wise, the effect of a local operation is spread out over the image and more
complicated patterns of interaction are possible. As a reminder for a similar procedure in
genetic algorithms, we refer to this manner of image operator application as steady state
image processing.

2. The requirements of autopoietic organization detection include the requirements for


rejecting this hypothesis as well. The operators we are interested in, must be able to
recover the original image, at least to some degree, or with a probability much greater than
zero. If the image dynamics appears to be not autopoietic, the process must be able to
fail, i.e. to leave the image as it was at the beginning of the processing.

Speaking more formally, let | be the image operation, which is applied point-wise. A
sequence of points p(T) is generated randomly, where p(T) is the point chosen at time
step T. Let g(p) be the grayvalue of point p in the image. In this article, all points are
equally possible in the random sequence. Non-adjacent points in the sequence could be
neighbors in the image. Assume p(Ti) and p(T2) are such a pair of points with T~<'1"2.
Then, while applying the operator | to the point p(Tt) and its neighbors, p(T2) is also
affected. But later, at step/'2, the application of the operator onto the modified p(T2) also
affectsp(Tt). Our demand is to have a non-zero probability of reproducing p(Ti)'s original
value by this procedure. This demand can be fulfilled using the Xor-operation. This can
be verified by considering the following three properties of the Xor-operation:

Commutativity: a|174
Associativity: (a@b)|174174
Auto-projection: (a|174
445

The third property explains the fundamental role of the Xor-operation in data coding, as
well as for sprite algorithms in computer animations. It can be easily shown that the Xor-
operation and its negate are the only binary operations fulfilling all of these three
properties. Hence, for detecting organized autopoietic structures in an image's grayvalue
distribution, it is necessary to apply the Xor-operation in a steady state manner. The
resulting algorithm is as simple as in the following:

Repeat:

1. Take an arbitrary pixelpl and choose one of its neighborsp2.


2. Replace g(Pz) with g(Pz) | g(Pz).

Alternatively, operations similar to the Xor-operation can be designed based on the


theory of periodic Abelian groups (Xor gives a 2-periodic Abelian Group). The
periodicity ensures the possible reproduction of its operands. However, the larger the
period, the fewer the reproduction probability.

2.1 Diseussion

For a further understanding of the effect of this operation consider figure 1. There, a face
image (a), the result of the repetitive application of the above algorithm after 1000
"generations" (b), and a dilated version of the second image (c), are shown. A white
circular contour around the phong-pattern on the forehead can be seen. This phong-
pattern is a result of the lighting conditions during image acquisition. Phong-patterns are a
major problem for facial recognition tasks. By using the Xor-operator, they can be easily
detected. But where comes this circle around the phong-pattern from?
To understand this, consider the effect of the same procedure onto a gradient
image (figure 2). Also, the original image, the image after 1000 steps and its dilated version
are shown. A white line appears in the middle of the image, around the grayvalue 128, but
not exactly in this position. The explanation of this effect reminds of another famous role
of the Xor-operation, as a benchmark for neural networks. The Xor-operation can not be
linearly separable. A neural network needs a hidden layer to learn the Xor-operation. The
gradient image helps to give an imagination of this fact. If pl has grayvalue 0, then g(Pl) |
g(P2) gives g(Pz), i.e. for low grayvalues, the Xor-operator tends to be the identity
transformation. Ifg(pl) is the maximum grayvalue (255 in our case), g(Pl) | g(P2) gives
the inverse of g(Pz), i.e. for high grayvalues it tends to be the inverting transformation.
But there is no linear descent from identity to inverse! Hence, we must have a non-linear
anomaly between these two extremes. The white line represents this anomaly.
Grayvalues around 128 tend to complete each other to the maximum grayvalue 255. The
white line appears to be a boundary between a gradual descent from the maximum to the
minimum grayvalue. This way boundary exchange processes can be identified; i.e. the
boundary must be a closed one to prevent the Xor-operations from pocketing it. Hence,
446

the Xor-operation is able to detect the gradient-descending organization of grayvalue


distributions in an image, no matter how this descent is structured (line-likeness,
circularity, linearity ascending or faster). But also, the white line is a boundary separating
intcrior from outside.

Figure I. A face image (a), tile result of the repetitive application of the above algorithm after 1000
"generations" (b), and a dilated version of the second image (c).

Figure 2. A gradient image (a), tile result of the repetitive application of the above algorithm after 1000
"generations" (b), and a dilated version of the second image (c).

Two summarize the foregoing discussion: For the detection of the autopoietic
organization of a grayvalue distribution, or better, the actual grayvalue distribution as a
"frozen" state of a possible autopoietic organization, the Xor-operator must be applied in
a steady state manner, i.e. on a sequence of randomly chosen image points. Only the Xor-
operator has the property of auto-projection, which ensures a much greater than zero
probability of regenerating the original image. This is true as long as binary numbers
represents grayvalues. The application of the Xor-operator onto images yields phong-like
structures, which prove to be the only organizational issues of intensity ordering in an
autopoietic manner.
These first results are encouraging enough to continue this work. It has been
shown, that the search for autopoietic organization in grayvalue distributions of images
reveals new structural properties of them, which are hardly to find by mean o f other
image processing operations. Further research on the Xor-operator should explore the role
of the probability distribution for the random sequence of pixel positions. Also, other
447

ordering relations in the image than the conventional intensity ordering should offer new
application tasks for the Xor-operator.

3 Autopoiesis and Texture Analysis

3.1 Textures and Autopoiesis

Texture perception plays an important role in human vision. It is used to detect and
distinguish objects, to infer surface orientation and perspective, and to determine shape in
3D scenes. Even though texture is an intuitive concept, there is no universally accepted
detinition for it. Despite this fact we can say that textures are homogeneous visual
patterns that we perceive in natural or synthetic images. They are made of local
micropatterns, repeated somehow, producing the sensation of uniformity [Ruiz-del-Solar,
1997]. It is important to point out, that textures can not be characterized only by their
structure because the same texture, viewed under different conditions, is perceived as
having different structures.
In the fi:~mework of the theory of autopoiesis, Maturana and Varela make a
complementary definition of the concepts of organization and structure of a system. The
organization of a system defines its identity as a unity, while the structure determines
only an instance of the system organization. In other words, the organization of a system
defines its invariant characteristics. The concept of autopoiesis captures the key idea that
living systems are systems that self maintain their organization (see introduction). In the
context of texture analysis, the systems to be analyzed are the textures. As it was
established, the concept of organization must be used to characterize a system and in our
case to characterize a texture. For this reason, in this section the concept of autopoiesis is
explored as a tool for texture identification, which corresponds to an important task in the
field of texture analysis. The analogy between the process of autopoietic organization (i.e.
lil~) in a chemical medium and the process of texture identification is used.
Before to apply the concept of autopoiesis as a tool for texture identification a
computational model of autopoiesis must be defined. Varela et al. developed the first
model that was capable of supporting autopoietic organization [Varela et al., 1974].
Recently, McMuilin developed the SCL model, which corresponds to an improvment of
the model, presented by Varela [McMullin, 1996a and 1997b]. The SCL model from
McMullin is used here.

3.2 The SCL Model

SCL involves three different chemical elements (or particles): Substrate (S), Catalyst (K)
and Link (L). These particles move in random walks in a discrete, two-dimensional space.
In this space, each position is occupied by a single particle, or is empty. Empty positions
are managed by introducing a fourth class of particles: a Hole (H). SCL supports six
distinct reactions among particles [McMullin, 1997b]:
448

1. Production:
K+S+S ........ > K+L+H

2. Disintegration:
L ........ > S+S

3. Bonding:
Adjacent L panicles bond into indefinitely long chains

4. Bond decay:
Individual bonds can decay, breaking a chain

5. Absorption:
L+S ........ > L*

6. Emission:
L* ........ > L+S

The autopoietic organization is produced, when a chain of L dements forms a


boundary, which defines a concrete unity in the space. Of course, this boundary must be
continuously regenerated (see introduction). The L elements are produced only in the
prcscnce of a catalyst (Production reaction). For this reason, we can say that in this
model, an autopoietic organization is produced only in the presence of a catalyst.

3.3 The modified SCL Model

The original SCL model was modified to allow the identification of textures, by
introducing the idea of a texture-dependent catalyst. That means, a catalyst that is tuned
with a defined texture and that produced an autopoietic organization only in this texture.
To implement this idea an autopoietic image A(i,j) is defined for each texture image T(i,j).
Each pixel of A(i,j) has a corresponding position in T(ij) and is represented by 2 bits
(enough for represent four particles). A T-Space is associated with the texture images
TO'j) and a A-Space is associated with the autopoietic images A(i,j) (see figure 3). The
reactions defined by the SCL model, that is the possible autopoietic organization, take
place in the A-Space, but by taking into account information from the T-Space (textures).
T-Space A-Space

K
.s2

Figure 3. The A-Space, where the autopoietic organization is created, and the T-Space, where convolution
between the texture and the Gabor-Filter is performed, are shown.
449

Gabor-Filters are able to characterize textures by decomposing them into different


orientations and frequencies (scales) [Ruiz-del-Solar, 1997]. Here, a Gabor-Filter is
associated with tile catalyst, to allow it (the catalyst) to be tuned with a particular
texture. The Gabor-Filter interacts directly with the textures in the T-Space (convolution
operation) and the result of this interaction is used to modulate the reactions in the A-
Space.
From all the reactions defined by the SCL model only the Production reaction was
modified, because it is the only one where the catalyst operates and the L particles are
created. The new Production reaction is defined by:

Production:

K+SI+S 2 ........ > K+L+H

Cl=Nl*Gk
C2=N2*Gk

if(Cl>TH and C2>TH) {


if(Ct> C2) {
St ---> L
$2 ---> H
}
else {
Si ---> H
$2 ---> L

where Gk is the Gabor-Filter associated with the catalyst K; N/ and N2 are the
neighborhood, in the T-Space, of St and $2, respectively (see figure 3); Ct and C2 are the
results of the convolution (performed in the T-Space); and TH is a threshold value. If in
the A-Space of a given texture a chain of elements form a boundary, after an interaction
time, then tile catalyst K has identified the texture (in its T-Space) as corresponding to the
class of textures characterized by the Gabor-Filter Gx.

3.4 An autopoietic-agent for Texture Retrieval

To illustrate the idea of texture identification by using a computational model of


autopoiesis, a system for retrieval of textures in image databases is proposed (see figure
4). The system is based in the use of an autopoietic-agent (the texture-dependent catalyst
described in section 3.3), which is generated by using the texture description contained in
the query. This autopoietic-agent is tuned with this description, which means it can
interact (produce autopoietic organization) only with the texture that corresponds to this
description. The autopoietic-agent is sent to every texture of the database and allowed to
interact with the substrate particles of the A-Space of the textures. After an interaction
time, the texture, where an autopoietic organization was produced (in its A-Space), is
retrieved.
450

query
~ autoagent
poeitci- ~ "9 ~

Texturei Textures'
9 Database

Figure 4. ProposedTextureRetrievalSystem(A3G:AutomaticAutopoietie-AgentGenerator;TA2T:
Textural Autopoietie-AgentTester).

4 Conclusions

The use of autopoietic concepts in the field of Image Processing was explored. Two
different approaches were presented. The first approach, presented in section 2, assumes
that the organization of an image is represented only by its grayvalue distribution. In
ordcr to identify autopoietic organization inside an image's pixel distribution, the steady
state A'or-operation was identified as the only valid approach for an autopoietic
processing of images. The application of the Xor-operator onto images yields phong-like
structures, which prove to be the only organizational issues of intensity ordering in an
autopoietic manner. These first results are encouraging enough to continue this work. It
was shown that the search for autopoietic organization in grayvalue distributions of
images reveals new structural properties of them, which are hardly to find by means of
other image processing operations. Further research on the Xor-operator should explore
the role of the probability distribution for the random sequence of pixel positions. Also,
other ordering relations in the image than the conventional intensity ordering should offer
new application tasks for the Xor-operator.
The second approach, presented in section 3, makes use of a second space, the A-
space, as autopoietic processing domain. This allows the formulation of adaptable
recognition tasks. Based on this second approach, the concept of autopoiesis as a tool for
thc analysis of textures was explored. The SCL model, a computational model of
autopoiesis, was modified to allow the identification of textures, by introducing the idea
of a texture-dependent catalyst. As a demonstrating example, a Texture Retrieval System
based on the use of an autopoietic-agent, the texture-dependent catalyst, was presented.
Further research must be performed to apply this concept in the solution of real-world
problems.

References

McMullin, B. (1997a). Computational Autopoiesis: The original algorithm. Working


Paper 97-01-001, Santa Fe Institute, Santa Fe, NM 87501, USA.
http ://www. santafe, edu/sfi/publications/Working-Papers/97-01-001 /
451

McMullin, B. (1997b). SCL: An artificial chemistry in Swarm. Working Paper 97-01-002,


Santa Fe Institute, Santa Fe, NM 87501, USA.
http://www.santafe.edu/sfi/publications/Working-Papers/97-01-002/

Ruiz-del-Solar, J. (1997). TEXSOM: A new Architecture for Texture Segmentation. Proc.


of the Workshop on Self-Organizing Maps - WSOM 97, pp. 227-232, June 4-6, Espoo,
Finland.

Varela, F.J. (1979). Principles of Biological Autonomy, New York: Elsevier (North
t [olland).

Varela, F.J., Maturana, H.R., and Uribe, R. (1974). Autopoiesis: The organization of
living systems, its characterization and a model. BioSystems 5: 187-196.

Whitaker, R. (1996). Autopoiesis and Enaction: The Observer Web.


http://www.informatik.umu.se/~rwhit/AT.html
Preprocessing of Radiological Images:
Comparison of the Application of Polynomic
Algorithms and Artificial Neural Networks
to the Elimination of Variations
in Background Luminosity

Arcay Varela, Bemardinol; Alonso Betanzos, Amparo~; Castro Martinez,


Alfonso1'2; Seijo Garcia, Concepci6nl; Suarez Bustillo, Jesfis3

1 LIDIA (Laboratorio de Investigaci6n y Desarrollo en Inteligencia Artificial),


Departamento de Computaci6n, Facultade de Inform/ttica, Universidade da Corufia, Spain
{cibarcay, ciamparo, alfonso} @udc.es conchi@mail2.udc.es
2 Instituto Universitario de Ciencias da Satide, Universidade da Corufia, Spain
3 Complexo Hospitalario Juan Canalejo, A Corufia, Spain

Abstract. One of the maj or difficulties arising in the analysis of a radiological image is
that of non-uniform variations in luminosity in the background. This problem urgently
requires a solution given that differing areas of the image have attributed to them the
same values and this may potentially lead to grave errors in the analysis of an image.
This article describes the application of two different methods for the solution of this
problem: polynomial algorithms and artificial neural networks. The results obtained
using each method are described and compared, the advantages and drawbacks of each
method are commented on and reference is made to areas of potential interest from the
point of view of future research.

1 Introduction

Within the field of digital image processing in medicine, one of the areas to which
most effort is dedicated is that of the analysis of radiological images[l]. Any
improvement in either the quality of these images or the analysis process of the same
would guarantee an important improvement in patient care.
Moreover, this area of investigation is particularly interesting in terms of
developing new support systems for specialists in a particular image field, given that
there is generally available a good supply of images both for development and for
system tests
In digital analysis of radiological images one of the problems that occurs most
frequently is the problem of variations in luminosity [2]. This problem occurs as a
consequence of curvature in the exposed surface or an intrusion of some kind between
the image acquisition apparatus and the object. The consequence is that the non-
uniform illumination causes the elements making up the image to have different
453

luminosity values depending on the area of the image and these values, for the
different elements, are similar for different areas of the image.
This is a problem that needs to be resolved before proceeding to a detailed analysis
of the image. Not doing so could cause grave errors to occur during the segmentation
phase given the impossibility of establishing a criterion that delimits, with a sufficient
margin of error, the different elements that make up the radiograph.
The traditional approach to this problem is based on statistical methods [3]. Using
a set of images a series of probabilities are calculated, on the basis of which a function
is applied to the luminosity value of each point of the image so as to obtain the correct
values. However, this kind of method has two major drawbacks:

1. Large quantifies of images are required to calculate the probabilities used a


posteriori.
2. Satisfactory resuks are not usually obtained for images that present characteristics
different to those used for the calculation of the probabilities.

This article present the results obtained in the preprocessing of radiological images
corresponding to an orthopedic service. The aim of the research is to endeavour to
eliminate problems of variations in luminosity and to make an in-depth analysis of the
images, with a view to creating a valuable tool for specialists to employ in their
diagnoses. In view of this aim, the two different methods selected as most
appropriate were polynomial algorithms [4] and artificial neuron networks. In order
to evaluate the quality of the results a segmentation of each image obtained from
applying the two methods was carried out using different clustering algorithms. The
results obtained using both methods along with the advantages and drawbacks in the
use of either are described below.
A solution to the problem that is the concem of this research would mean
significant progress in the development of an automatic process for the examination
of radiological images, given that the value of the radiographs available depends
greatly on the extent to which this flaw is corrected.
As a longer-term aim, it is hoped to extend the research so as to develop a system
that assists specialists in the fitting of prostheses as well as in the assessment of screw
implants.

2 Characteristics of the Radiological Images

To start with, the characteristics that best defined the image were identified (Fig.
1), in order to select those techniques that would produce the best results. For this
characterization of the image, standard digital processing techniques were used.
454

Fig. 1. Image used in the study

It was observed that the borders between the different elements are both close to
each other and fuzzy or blurred.
The histogram of the radiograph was also examined (Fig. 2), with a view to
obtaining a precise idea of the distribution of the values, and this confirmed that the
borders were blurred. In addition, the radiograph presented a non-uniform variation
in luminosity, the intensity of the bone and the screw in the upper and bottom portions
of the image is very different.

3~000

N o~ pixe]:~

li~:!i.................. i.................... i ........ !

o
100 200 ~5
lnt~r~ibj

Fig. 2. Histogram of the image


455

3 The M e t h o d s A n a l y s e d

Applied to the analysis of the problem were the polynomial algorithm and artificial
neuron network techniques, with the aim of comparing the results of both. The
former is a linear algorithmic technique whereby it is only possible to adjust a fixed
number of parameters; the ANN technique, on the other hand, is a non-linear one
whereby after training it is expected that it will be capable of generalizing, i.e. of
adapting to images with totally different characteristics to those of the images used for
training.

3.1 Polynomial Algorithms

The least squares method consists of the construction of an image reflecting the
variations occurring in the background of the image, by means of a bidimensional
polynomial (p(x,y)) calculated by way of the least squares method, and subtracting it
from the original image with a view to eliminating the variation.
The calculation of the polynomial is based on the assumption that the values for
background luminosity in an image are spatially continuous, it being possible to make
the calculations using a polynomial of arbitrary degree based on the Weiertrass
theorem of approximation[5].

Hence a polynom of n degree, which minimizes the squared error, as follows:

Aoox~ ~ + alox~y~ + a20x2y~ + ... + an0xny~ + a01x0y1 + ao2x~ 2 + ... + aonX~ n (1)

The values for the polynomial image are subsequently determined by calculating the
value of the polynomial for x,y with x:l...N, y.'l...M; where N and M represent the
range of each dimension of the image.
Bearing in mind that this technique has the drawback of being time-consuming in
computational terms, the degree of the polynomial and the numbe of points used to
calculate it should be limited as far as possible.

3.2 Artificial Neural Networks

The type of neural network utilized was the feed-forward type [6], with an input
layer composed of 25 process elements, plus a hidden layer and an output layer, each
having one process element each. Connectivity is total between all the process
elements of the network. The input layer was defined at a size of 5x5 pixels, with
views of a fragmented processing of the image, simulating a convolution [7].

Training. Selected in order to train the network was a supervised learning process
using the backpropagation algorithm.
456

Input Pattern. Used was a synthetic image in a range of grey tones, with dimensions
of 368 by 360 pixels taking values in the interval [0,255]. A total of 9,975 fragments
of 5x5 pixels were extracted, the values of which were administered to the network as
input.

Output pattern. One fragment of 5x5 pixels was extracted from the output image for
each fragment of the input image. The expected output of the network would be the
mean value of the pixels in each fragment of the image taken as the output model.

Activation. Act-Logistic, defmed as:

1
(2)
1 + e zx•

where actv is the activation in each process element and x the input to the neuron.

Output function. Identification

Updating function. Topological Order (the most appropriate for feed-forward


networks. The neurons calculate their new activation in order and in accordance with
the network topology (first the input layer, then the hidden layer and finally the output
layer).

Initialization function. Randomized Weights initialize weights and bias with values
distributed aleatorially; in this case, in the interval [-1,I].

The processing of the image is carried out by displacing a temporary window of


the size 5x5 pixels; the values of the pixels over which the window is fixed in each
iteration constitutes the input to the network and this gives us the expected value for
each fragment. This resulting pixel will be the Centre of the corresponding window in
the output image (Fig. 3).
The network was implemented using the SNNS programme of free distribution and
different routines which had to be implemented in C[8].

f ~'k

Fig. 3. Network Processing of the Pixels


457

4 Comparison of Results.

With a view to comparing the outputs produced by the polynomial algorithm on


the one hand, and the artificial neuron network on the other, it was decided to subject
the image to a clustering algorithm.
The aim is for the clustering algorithm to segment the image into the different
elements that compound it. The output of the algorithm for each resulting image is
compared with a mask made by hand, in order to determine the degree of accuracy
obtained.
It was decided to use the MFCM (Modified Fuzzy c-Means) clustering algorithm, a
fuzzy variation of the c-Means algorithm, and the outcome of research of Young Won
Lim and Sang Uk Lee [9]. These authors describe an algorithm for the segmentation
of colour images by means of the study of the histograms of each one of the colour
bands.

The MFCM algorithm is composed of two parts, as follows:


1. A hard part responsible for the study of the histograms of an image in order to
both establish a number of classses and to make an initial global classification of
the image. Study of the image requires the carrying out of an initial softening
of the same. Young and Sang Uk recommend the employment of the space-scale
developed by Witkin [10]. The term space-scale describes the bidimensional surface
obtained on convoluting a unidimensional signal with a gaussian function in which the
parameter (y2 is successively varied.
Once the cut-off points are localized, the RGB space is divided into a series of
independent 'pre-clusters'. Each one of these pre-clusters is separated from its
neighbours by a security zone, the width of which is a configurable parameter of
the algorithm. These zones are part of the fuzzy area which is classified in the
second part of the algorithm. A configuable threshold determines how many
and which of these pre-clusters pass on to the fuzzy stage. If any one of the pre-
clusters possesses fewer pixels than required by the threshold then it is
eliminated and its pixels go to the fuzzy area. As the lesser populated pre-
clusters are eliminated, the class centres are recalculated for the surviving pre-
clusters. These centres remain unchanged as of this moment.
2. A fuzzy part that deals with the classification of the pixels that have had greater
difficulty in determining the class to which they belong. In this stage, the pixels
stored in the fuzzy area (i.e. pixels from the initial borders between pre-clusters
and the discarded pre-clusters) are assigned to the final surviving clusters. In
order to determine the factor of correspondence for each pixel to a cluster, the
following formula was used:
458

Where c is the number of clusters; m is a weighting factor that permits the


evaluation, to a greater or lesser extent, of the distance of an individual element from
the class examples; vi and vj are the centroids of the ith andjth classes respectively;
and xk is a pixel from the diffuse area.
Not having labelled samples from each class the centres of gravity of the clusters
are used to calculate the factors of a pixel.
The results of the segmentation algorithm demonstrate a greater degree of accuracy
for the neural network than for the polynomial algorithm (Fig. 1). The fact that the
polynomial algorithm shows greater accuracy in the detection of iron is entirely due to
the fact that when segmenting in the case of the polynomial, the clustering algorithm
classifies almost all bone as iron, thus committing a very grave classification error
indeed.

Fig. 4. Results for the Clustering Algorithm in the Segmentation of the Different Images.

The ANN results are more satisfactory, particularly in comparison to the


polynomial in classifying screw-bone. This greater accuracy is due to the
generalization capacities of the network, which means that it is capable, on the basis
of the training set, of assessing the different patterns in order to correct the problem
of variations in luminosity in the radiological images, as well as being able to solve
the problems in the different zones. For its part, the polynomial algorithm calculates
the coefficients for the entire image, not being capable of adapting to each of the
zones of the image; and even when it manages to adapt, the details for the other
zones are lost as a consequence of an 'excess' of adjustment. This is not to mention
another drawback - the fact that the calculation time required is extremely lengthy,

5 Conclusions and Results

There appears to be a case for claiming that the ANNs produce quite an improved
result over the polynomial algorithms. Nevertheless, there still remain various
adjustments to be made to the training network so as to obtain optimum results, given
that there are certain pattems that the network is capable of treating optimally. For
example, in the radiographs it can be appreciated that there is an elevated level of
459

noise, conducive to error in the segmentation phase, for which reason its elimination
during the pre-processing phase is desirable.
Finally, another interesting modification would be the design of a non-supervised
training network that would permit the detection of patterns of interest in the images,
thus facilitating segmentation and characterization of a radiological image.

6 Acknowledgements

Our thanks to the Computing Service of the Juan Canalejo Hospital (A Corufia,
Spain) for their collaboration in this research.

References

1. Todd-Pokropek, Andrew E.; Viergever, Max A.: Medical Images: Formation,


Handling and Evaluation, Springer-Verlag, NATO Series (1994).
2. Gonzalez, Rafael C.; Woods, Richard E.: Digital Image Processing, 2nd edn.
Addison-Wesley Publishing Company (1992).
3. Sonka, Milan; Hlavac, Vaclav; Boyle, Roger: Image Processing, Analysis and
Machine Vision, Ed. Chapman & Hall (1994).
4. Castro Martinez, Alfonso; Alonso Betanzos, Amparo; Arcay Varela, Bemardino,
Aplicaci6n de Algoritmos Polin6micos al Preprocesado de Im~genes Radiol6gicas~
CASEIB 98, September 1998.
5. Chandrasekar, Ramachandran; Attikiozel, Yianni: Gross Segmentation of
Mammograms Using a Polynomial Model, Proceedings of the IEEE-EMBS, Vol
16(1994).
6. Haykin, Simon: Neural Networks: A Comprehensive Foundation, Prentice Hall
International.
7. Kulkamy, Amn D.: Artificial Neural Networks for Image Understanding, VNR
Computer Library (1993).
8. Masters, Timothy: Signal and Image Processing with Neural Networks: a C++
Sourcebook, John Wiley & Sons (1994).
9. Lim, Y. M.; Lee, S. U: On the Color Image Segmentation Algorithm Based on the
Thresholding and the Fuzzy c-Means Techniques, IEEE Press, Fuzzy Models for
Pattern Recognition (1990).
10.Witkin, A. P.: Scale-Space Filtering: A New Approach to Multi-scale Description,
Proc. 8th Itn'l Joint ConfArtificial Intelligence (August 1983) 1019-1022.
Feature E x t r a c t i o n w i t h an
A s s o c i a t i v e N e u r a l N e t w o r k and Its A p p l i c a t i o n
in Industrial Q u a l i t y C o n t r o l
Ibarra Pico, F.; Cuenca Asensi, S.; Carcfa-Chamizo, J.M;
Departamento de Tecnologfa Inform,'fticay Computaci6n
Campus de San Vicente
Universidad de Alicante
03080, Alicante, Spain
email: ibarra@ dtic.ua.es

Topics: Image Processing, neural nets, industrial automation, texture recognition, real-time quality
control

Abstract. There are several approaches to quality control in industrial processes. This work is center in
artificial vision applications for defect detection and its classification and control. In particular, we are
center in textile fabric and the use of texture analysis for discrimination and classification. Most
previous methtxls have limitations in accurate discrimination or complexity in lime calculation; so we
apply parallel and sigllal processing techniques. Our algorithm is divided in two phases: a first phase is
the extraction of texture features and later we classify it. Texture features should have the followings
properties: be invariant under the transformations of translation, rotation, and scaling; a good
discriminating power; and take the non-stationary nature of texture account. In Our approach we use
Orthogonal Associative Neural Networks to Texture identification and extraction of features with the
previous properties. It is used in the feature exlracti~m and classification phase (where its energy
function is minimized) too, so all the method was applying to defect detection in textile fabric. Several
experiments has been done comparing the proposed method with other paradigms. In response time and
quality of response our proposal gets the best parameters.

1. Introduction

For real-time image analysis, for example in detection of defects in textile fabric, the
complexity of calculations has to be reduced, in order to limit the system costs [3].
There are several approaches to quality control in industrial processes [1][2][7]
Additionally algorithnts which are suitable for migration into hardware h,'tve to be
chosen. Both the extraction method of texture features and the classification algorithm
must satisfy these two conditions. Moreover, the extraction method of texture features
should have the followings properties: be invariant under the transformations of
translation, rotation, and scaling; have a good discriminating power; and take the non-
stationary nature of texture account. We choose the Morphologic Coefficient [8] as a
feature extractor that is adequate for its implementation by associative memories and
dedicated hardware.
In the other hand, the classification algorithm should be able to store all of patterns,
have a high correct classification rate and a real time response. There are m a n y
models of classifier based on artificial neural networks [5][13][16]. Hopfiel [11] y [12]
introduced a first model of one-layer autoasociative memory. The Bi-directional
Associative Memory (BAM) was proposed by Kosko [14] and generalizes the model
to be bidirectional and heteroassociative. The BAMs have storage capacity problems
[17].
It has been proposed several improvements (Adaptative Bidirectional Associative
Memories [15], multiple training [17] y [18], guaranteed recall, and a lot more
461

besides. One-step models without iteration has been developed too (Orthonormalized
Associative Memories [9] and the l-Iao's associative memory [10], which uses a
hidden layer). In this paper, we propose a new model of associative memory which
can be used in bidirectional or one-step mode.

2. Feature Extraction for Texture Analysis

The Hausdorff Dimension (HD) was first proposed in 1919 by the mathematician
Hausdorff and has been used, mainly, in fractal studies [4]. One of the most attractive
features of this measure when analyzing images is its invariant properties under
isometric transformations. We will use HD when extracting features.
Definition L The Hausdorff dimension of order h of a set S with S ~ Rn,h _>0 and S > 0
is defined as follows:
(I)

with

Definition 1l. The Hausdorff dimension of a set S is the value of h that makes Hh(s)
have an inflexion point in 0 and infinite. Formally

= ',4',' = 0} = ,.,,{,,, = -} (3)


We can approximate the HD by semicovers, so we define the morphologic coefficient
which can be used to feature extraction. We call morphologic coefficient of the
semicover of a set S over an morphologic element A i, of diameter 9 = [Ail to

CM (S) = tin, l~ "s"'(S)l (4)


6~o - log6

Characterization o f the texture


In order to extract the invariant characteristics of an image we divide it in several
planes attending to the level of intensity of each point. Then we could define the
multidimensional morphologic coefficient like the vector formed for the CM of each
one of these planes. We can characterize the texture with his CM vector.
CM = [CMI, CM2 ..... CMp] (5)
p - n ~ of planes in which image is partitioning
The CM vectors of the patterns will be employed in the learning process of the
classifier that it is described below.

3, Associative Orthogonal Memory (MAO)

In this paper, we use a new model of associative memory which can be used in
bidirectional or one-step mode. This model uses a hidden layer, proper filters and
orthogonality to increase the store capacity and reduce the noise effect of lineal
462

dependencies betwcen patterns. Our model, that we call Bidirectional Associative


Orthogonal Memory (MAO) , go beyond the BAM capacity. We use this Neural
Network to implement the feature extractor (Morphologic Coefficient) and to classify
the image.

Figure 1 .Imagedescomposition in several planes

Figure 2 .P-CM analysis

Topology and Learning Process


Let a set of q pairs of patterns (ai,bi) of the vectorial spaces R n and R m. We build two
leaming matrixes as we show below :
Z = ~ ij Jand B = [hik ] for i E {l,.., q } j ~ {l,..,n } k E {lo..,m }
The MAO Memory is built as a neural network with two synaptic matrixes (Hebbian
correlations) W and V, which are computed as W=AQ t y V=QB t. Where Q is and
intermediate orthogonal matrix (Walsh, Householder, and so on) of dimensions qxq.
The qi vectors of Q are an orthogonal base of the vectorial space R q. This
characteristic of the qi vectors is very important to make accurate associations
including below noise conditions [16].
For feature extraction, an heuristic implementation of the Morphologic coefficient is:
In a first step the image is divided in several layers (using the grey level as parameter).

Let I be an Image I c R 2 and le P(I) be a partion P(1)={I(A~),I(A 2) ..... /(Ap)}, the

semicover in each plane ~-i, for pixel i=1.. p, is obtained by bipolar filter :
463
The norna of 8 -semicober of the image in the phme ~.i, for i=l..P and window Vj. for
j=I..N/IVj is

Vj -sn(l/lj . V j t = l c = t , f ( x ) = l V x e V~ y V x e l ( ~ ) (6)

It is calculated by a spccific neuron in each descomposition window:

f,(x1) ~>'r

rl(x3) " , ~ , ~ ~
...... ,_, ~ ~-~g(~ )

r,(%, ~-~,)/ I'""'


r,fi,,~ >~ -I

Figure 3. Vj -semicover in window Vj.

The g filter use the reference signal (Vi -1)

;1 yj > 0
g(YJ ) = yj < 0
Finally, the CM (Morphologic Coeficient) in a plane ~,i , for i=1.. P, is calculated
from several windows Vj, for j=I..N4V j , y Vj = 1,2,., N in each plane, that is

u/~=K log[g(y,)]
(7)
CM(~., . V j) = 1=/ -loglVl
So, we need a neuron that represent the output of a window.

g(Yl )
g(Y2) ~ K-sm(I/~i)
g(Y3) ~ ~ ">CM
g(YN) log K-sm(I/~i)
K --IogK
Figure 4. Morphologic Coefficient in the plane ~'i
464

Where the filter, f~(x) is given by f~(x)=log(x)/-log(k).


So in the Neural Model the W and V synaptic matrices are

I Ljl
+1 +1 +1 +1 +1
+! +1 +l +l +1
(8)
W= V=

"t-I "4-1 +1 +1 ~ |-I-1 1_~


k-I k-I k-I k - I ^~N +1 I

4. Experiments

To test the texture analysis algorithm (features extraction and classifier) we consider
the problem of defects detection in textile fabric. Real-world 512x512 images (jeans
texture) with and without defects (fig la and lb) were employed in the lea,'ning
process of the MAO classifier. We considered windows of 70x70 pixels with 256 gray
levels and the parameters of the algorithm were adjusted to obtain high precision and
low response time. These are shows in the table la and lb.

(a) Image of jeans textile fabric without defecls

(b) Windows of jeans textile fabric containing defects


F i g u r e 5. Example of application

The implementation was made in a C-program. In the test process and in the learning
process were employed different images. In both cases were 1.200 images with defects
and 1.000 without defects. The results shows that in all the cases our algorithm is two
magnitude order faster than the others. In addition the hit rate it is next to 90% for with
and without defects texture recognition (notice that in the C-III, ad-hoc partitioning, it
is over 95%). The conclusion is that it is feasible to implement a real-time system with
a high precision level based in our algorithm. So an architectural proposal will be
made.
465

Algorithm Window hit rate hit rate response


size without with time
defects defect
C-I 70x70 92,23% 87,14% 0,081 seg.
C-II 70x70 96,12% 93,32% 0,055seg.
C-Ill 70x70 97,81% 94,42% 0,070seg.
Laws 40x40 93,71% 64,69% 1,5seg
SAC 64x64 95,12% 84,34% l,lseg

Table I. Simulationresults

5. Conclusion

A new method of texture analysis is successfully applied to solve the problem of


defects segmentation in textile fabric by a neural network model. The system presents
a statistic method for feature extraction and a neural classifier.
The method for the extraction of texture features is based on the Hausdorff dimension
and its most important properties are: it is easy to compute and it is invariant under
geometrical mapping such as rotation, translation and scaling.
An Associative Neural Model is used as a classifier. In this extension the neurons have
an output value that is updated at the same time that the neurons weights. From this
output value we can easily calculate the distance between the neuron and the cluster
and get the probability that a neuron is into a cluster, that is, the probability which the
system works well. This system works in real time and produces about 96.44% of
correct rate and is compared with other methods.

References

[1] N.R. Pal and S.K. Pal, A review on image segmentation techniques, Pattern
Recognition, Vol. 26, No.9, pp. 1277-1294, 1993.
[2] R.M. Haralick, Statistical and structural approaches to texture, Procc. IEEE, Vol.
67, pp. 786-804, 1979
[3] C. Neubauer, Segmentation of defects in textile fabric, Procc. IEEE, pp. 688-691,
1992.
[4] Hoggar, S. G. Mathematics for Computer Science. Cambridge University
Press. 1993.
[5] J.M.Zurada, Introduction to Artifial Neural Systems, West Publishing Company,
1992.
[6] Harwaood, D. et al. Texture Classification by Center-Symmetric Auto-Correlation,
using Kullback Discrimination of Distribution. Pattern Recognition Letters. Vol 16,
pp. 1-10. 1995
[7] Laws, K. Y. Texture Image Segmentation. Ph D. Thesis. University of Southern
California. January. 1980.
466

[8] Francisco Ibarra Pic6. Anfilisis de Texturas mediante Coeficiente Morfol6gico.


Modelado Conexionista Aplicado. Ph. D. Thesis. Universidad de Alicante. 1995.
[9] Garcia-Chamizo J.M., Crespo-Llorente A. (1992) "Orthonormalized Associative
Memories". Proceeding of the IJCNN, Baltimore, vol 1, pg. 476-481.
[10] Hao J., Wanderwalle J. (1992) "A new model of neural associative memoriy"
Proceedings of the JJCNN92, vol 2, pg. 166-171.
[11] Hopfield J.J. (1984a) "Neural Networks and physical systems with emergent
collective computational abilities". Proceedings of the National Academy of Science,
vol 79, pg. 2554-2558.
[12] Hopfield J.J. (1984b) "Neural networks with graded response have collective
Computational properties like those of two-state Neurons". Proceedings of the National
Academy of Science, vol 81, pg. 3088-3092.
[13] Ibarra-Pic6 F., Garcia-Chamizo J.M. (1993) "Bidirectional Associative
Orthonormalized Memories". Actas AEPIA, vol 1, pg 20-30.
[14] Kosko, B. (1988a)"Bidirectional Associative Memories". IEEE Tans. on Systems,
Man & Cybernetics, vol 18.
[15] Kosko, B. (1988b) "Competitive adaptative bidirectional associative
memories".Procedings of the IEEE first International Conference on Neural Networks,
eds M. Cardill and C. Butter vol 2. pp 759-766.
[16] Pao You-Han. (1989) "Adaptative Pattern Recognition and Neural Networks".
Addison-Wesley Publishing Company, Inc. pg 144-148.
[17] Wang , Cruz F.J., Mulligan (1990a) "On Multiple Training for Bidirectional
Associative Memory ". IEEE Tans. on Neural Networks, 1(5) pg 275-276.
[18] Wang , Cruz F.J., Mulligan. (1990b) "Two Coding Strategies for Bidirectional
Associative Memory ", IEEE Trans. on Neural Networks, pg 81-92.ang, Cruz F.J.,
Genetic Algorithm Based Training for Multilayer
Discrete-Time Cellular Neural Networks

P. Ldpez, D.L.Vilarifio, and D. Cabello

Departament of Electronics and Computer Science.


University of Santiago de Compostela.
15706 Santiago de Compostela
Tel.: -I-34 981 563100 Ext. 13559; Fax Number: -t-34 981 599412
E-mail: paula@dec.use.es; dlv(@dec.usc.es; diego(@dee.use.es

A b s t r a c t . Genetic Algorithms are applied to optimize the synaptic cou-


plings of multilayer Discrete-Time Cellular Neural Networks for image
processing. The only information required during the training phase of
the network are the global input and tile corresponding desired output.
Therefore all the coefficients of the different layers are optimized simulta-
neously without using a priori knowledge of the behaviour of each layer.

1 Introduction

Cellular Neural Networks (CNN) [11 are a neural network model encompassed by
the dynamic network category. They are characterized by the parallel computing
of simple processing elements (so called cells) locally interconnected.
On the other hand, many image processing tasks consist on simple operations
restricted to the neighbourhood of each pixel into the image under processing.
Therefore, they are directly mapped out on a CNN architecture. This fact, along
with the possible implementation as an integrated circuit of the CNN makes
these architectures an interesting choice for those image processing applications
needing high processing speeds.
In order to approach a given task by means of a CNN architecture, the
weights of the connections among cells must be determined. This is usually
achieved after a heuristic design which requires a good definition of the problem
under consideration, as well as the use of learning algorithms [2]. Most of these
algorithms consist of adaptations of classical learning algorithms and leads to
good solutions on those applications projected on single layer CNN. However,
many of them fail when multiple CNN operations are required.
Multiple CNN operations, which are needed for complex problems resolu-
tion, can be implemented using the CNN Universal Machine (CNN-UM) [3].
The CNN-UM consists of an algorithmically programmable analog array com-
puter which allows to approach complex problems by splitting them up into
simpler operations (many of which are even implemeted on existing libraries
and subroutines [4]).
468

Another way to approach those complex tasks is given by the use of the
discrete-time extension of the CNN (DTCNN) [5]. Due to the synchronous pro-
cessing in DTCNN, a robust control over the propagation velocity is possible,
faciliting the extension to multilayer structures [6]. This allows to directly ap-
proach the global problem. However, the high complexity of the dynamical be-
haviour in this kind of structures, makes most of the learning algorithms applied
to single layer structures unsuitable. Usually the learning process in multilayer
systems is tackled by considering the optimization of each layer independently,
either heuristically or by means of single-layer learning algorithms. However, it
can be interesting a global training process where all the weights of the different
layers are optimized at the same time.
In this work we present a global learning strategy for multilayer D T C N N
architectures. We apply a stochastic optimization method, namely Genetic Al-
gorithms (GA), to simultaneously optimize all the weights of the different layers
of the system. To prove the generality of the method, we applied it to different
image processing tasks projected onto multilayer DTCNN. First of all we tack-
led the problem of training a system to perform the skeletonization of arbitrary
binary images. Next, the edge detection in general grayscale images was consid-
ered. Finally a novel active contour based technique for image segmentation is
approached using this learning strategy.
In Section 2 the notions of multilayer D T C N N architectures are briefly re-
called. Section 3 describes general GA characteristics and the specific GA used is
discussed. Application examples of the GA-based training process are in Section
4 and the final conclusions and discussions in Section 5.

2 Multilayer Discrete Time Cellular Neural Networks

Single layer D T C N N [5] have been shown to be an efficient tool in image process-
ing and pattern recognition tasks. They are completely described by a recursive
algorithm and their dynamic behaviour is based on the feedback of clocked and
binary outputs.
The equations which govern the behaviour of a multilayer D T C N N with
time-variant templares are [6]:

7(k) = a?"(k) + bT'"(k) u," + iT(k) (1)


dENt(c) dENt(c)

c I+1 ifx~(k)>0
Yl(k+l)=f(x~(k))=,_l ifx[(k)<0 (2)

where u~, x[(k) and y[(k) are the input, internal state and output of the
c-th cell in layer l respectively. The inputs v,zd have continuous values, and the
outputs y~(k) are binary valued. The summations are performed within the
neighbourhood (N~(c)) of a cell c, which is defined as the set of all cells within
the distance r including cell c. The feedback coeficients a~'a(k) E IR, the control
coefficients b~'d(k) E IR and the thresholds i~(k), are called templates. For
469

our purpose they are time variant and translation invariants. So, the set of the
weights which caracterizes the topology of the network is largely limited, making
the learning process easier.

3 Genetic Algorithm Description

To find a set of p a r a m e t e r s so t h a t a network performs according to a given task


is one of the core problems in the field of neural networks. Deterministic m e t h o d s
offer good results when the problem has a limited complexity. For more complex
tasks they become usually slow and have the same limitation of the gradient-
based methods: they can be easily misled by local minima.
Stochastic learning methods are especially well suited for problems with a
high degree of complexity where there is little previous knowledge a b o u t the flmc-
tion to be optimized. Genetic Algorithms (GA) [7] are one of the most promising
techniques in this category. Although they lack a rigorous m a t h e m a t i c a l back-
ground they have proved to be suitable for complex optimization problems where
an analytical solution is not directly available. Their independence of the initial
conditions and the domain of application combined with their implicit paral-
lelism are other of their advantages over classical optimization methods.
GA consists of a stadistic search m e t h o d of optimal or quasi optimal solu-
tions t h a t use some kind of codification to represent the possible solutions. It
has often been referred to as a blind, codificated and multiple search method. It
is said to be blind because the searching process is independent of the particular
domain of application. It is also codificated due to the need of coding any pos-
sible candidate solution into a string (chromosome). And is multiple because a
complete population of candidate solutions to the problem is evaluated during
each iteration of the algorithm. GA begins with an initial population (usually
randomly generated) of possible solutions, where each individual in the popu-
lation represents a point in the search space of the problem. After an initial
population is generated, it evolves by means of the recursive application of a set
of genetic operators, mainly crossover and mutation. The new populations will
evolve in the sense of minimizing a given criteria. The fitness of each individual
will be a measure of the degree of adaptation of that solution to the minimization
problem.
In the next section we show some examples of application of GA to the opti-
mization of complex multilayer D T C N N architectures. In all the cases, we m a d e
use of elitism strategies, that is, the best individual found in each generation was
preserved in the next population. The selection mechanism was the well known
roulette-wheel m e t h o d and the crossover operator was the two-point crossover.
In the training phase of the network the input (u c) and the corresponding de-
sired output (y~) for each cell c are given. The objective of the training process
will be to find the set of p a r a m e t e r s (synaptic coupling of the network) t h a t best
fits the desired output. The fitness value t h a t measures the degree of a d a p t a t i o n
of each individual in the population will be a function of the next expression:
470

error c = -- y3): (3)

where yC(kr the output of cell c once the convergence is reached at time
interval k~d and y~ is the desired output value. The total error value will be
computed over all the cells of the network.

4 Application Examples

4.1 Skeletonization of Binary Images


As a first example of the usefulness of GA's as a general learning strategy we
tackled the problem of optimizing the templates of a multilayer D T C N N system
for skeletonizing binary images. Some CNN skeletonization solutions have been
published, but many of them use complex multilayer structures [8I, or nonuni-
form architectures [6]. Using current VLSI technologies is very difficult to build
a CNN chip which implements these solutions.
The approach presented here finds the skeleton of binary images using cyclic,
linear, 3x3 templates, making it well suited for being implemented on a VLSI
chip. The algorithm that we have used is based on that in [9]. However, as
a difference from that, the template coefiCicients were determined by a global
learning algorithm.
The skeleton of an object can be defined as a stick figure were each pixel
is connected only with two neighbours, except the ones at the end and branch
points of the figure. Mapping these into a D T C N N architecture that peels pixels
circularly from the object implies the use of eight layers, one for each direction
(considering both cardinal and diagonal processing directions). Since the pro-
cessing in any direction must be equivalent due to the simetry of the problem,
it is possible to perform each step using the same layer with cyclic templates.
The input image will be the initial state of the layer and each subsequent step
will recieve as an input the previous output of the layer. So, it will only be
necessary to optimize the weights corresponding to the feedback template (A)
and the current bias (I). Note that each step of the algorithm peels off black
pixels having three white and two black neighbours in appropiate position. For
example, the pattern to peel off a pixel when processing in the northwestern
direction is considered will have the form:

AN_ W = bd
dc

The b coefficient corresponds to the pixel under processing and the positions
marked a and b to white and black neighbours respectively. The value of the
remaining three neighbours is don't care, because they do not take part on the
decision of wether a pixel belongs to the northwestern edge or not, and correspods
to positions marked c. In fact, this coefficient could be set to 0, but we will
also optimize it's value during the training process. The processing p a t t e r n in
471

another direction can be found similarly. The current bias t e r m I is considered


to be independent of the processing direction. So we have only five coefficients
to be optimize. The image used during the training phase is shown on Fig.1.

Fig. 1. Skeletonization training phase: Input and Target images respectively

A solution with a 0% percentage of error in the training phase was found


after 1027 iterations of the algorithm using a population of 500 individuals The
genetic operators used were the standard mutation and two-point crossover with
rates of 0.01 and 0.6 respectively. The coefficientes obtained were the following:

a=0.5, b=3.93, c=-0.04, d=-0.46, I=-1.97

As it can be noticed, the value for coefficient c, that measures the influence of
the don't care neighbours, is nearly 0. Applying these corresponding templates
to a general test image with multiple objects, the result in Fig.2 was found.

Fig. 2. Skeletonization test phase: Input and Target images respectively

4.2 Contour E x t r a c t i o n on Grayscale Images

Now we consider the template optimization of a multilayer D T C N N system for


edge detection of grayscale images. In this case the input of the network during
the learning process will be the grayscale image whose edges we want to detect.
So we have a continuously valued input in contrast with the binary valued one
of the previous example. This will complicate the learning process because we
will have an infinite number of possible combinations as input patterns.
In [10], Harrer et. al. used a single layer 3x3 D T C N N to perform the edge
detection of grayscale images. These templates were found heuristically and offer
good results when they are applied to simple images, but fail when they are used
472

with general complex grayscale images. Nossek in [2] proposed another D T C N N


architecture for edge detection. Here a 5x5 single layer structure is used and
its results clearly outperforms the previous ones. Although the 5x5 approach
presents good behaviour even with complex images, it has the disvantage of
its large neighbourhood size which makes its hardware implementation difficult.
Thus, any reduction in the neighbourhood size required to perform a given task
will ease the implementation process. Obviously this should be done without
limiting the quality of the results found.
It is easy to see that the correlation of two 3x3 templates leads to a 5x5 one.
Unfortunately the opposite is not always true, that is, a 5x5 template can not
always be exactly replaced by the correlation of two 3x3 templates. Despite this,
it is clear that using two 3x3 iterations implies the consideration of second order
neighbours. Considering this, we propose a edge detection system consisting of
two layers, each of them with a first order neighbourhood between cells, so that
any cell will only be directly connected to its adjacents cells. The first layer will
receive as an external input the initial grayscale image and the previous outputs
will be feedback to its input. The second layer has as external input the output
of the previous one and its output will be the global output of the network. We
will not make any additional consideration about the problem, so we will have
to optimize 38 coefficients, 19 per layer.
For the training phase we use a grayscale image corrupted by Gaussian noise
(a = 100). The location of the edges of that object are assumed to be well known
and will be used as the desired output pattern (Fig.3).

Fig. 3. Edge detection training pha~se: Input and target images respectively

After 2445 iterations of the GA with a population size of 50, a solution


with a 1.6~ percentage of error over the training image was found. The genetic
operators used were the standard mutation and two-point crossover with rates
of 0.01 and 0.6 respectively. The templates obtained are as follow.
4.63 1.71 2.91 4.35 0.84 - 0 . 7 2 )
A1 = | 4.26 0 . 9 1 - 2 . 5 7
\ - 1 . 4 1 3.33 2.31 ) B1 = -3.25 - 2 . 6 3 - 3 . 0 2
4.55 -3.26 3.74
I1 = - 9 . 0 4

[ 174-273 -03s [ 4.15 0.42 - 3 . 9 3 ~


= | 1 6 9 - 9 s 6 390 ! B2 = | - 0 . 6 8 - 9 . 2 6 0.37 ] I2 = - 0 . 1 5
\0.57 2.5s -1.01/ \-1.14-3.47 3.o9 /
The result of applying this values to a complex grayscale image is shown on
figure 4.
473

Fig. 4. Edge detection test phase: Input and Output images respectively

4.3 Image Segmentation by Active Contours


Image segmentation techniques by means of active contours (so called snakes)
represent an interesting approach among segmentation strategies. T h e y usually
consist of elastic curves that located over an image, evolve from their initial
shape and position in order to adapt themselves to the salient characteristics
of the scene. The snake evolution is usually guided by internal forces (from the
snake) that model the elasticity of the curves, as well as external forces (from
the image under consideration) that lead the snake toward features of the image.
The solution to the problem of detecting the contour is found on the min-
imization of an energy function which includes both the internal and external
forces. In order to numerically compute a minimal energy solution it is needed to
discretize the expression of the energy. Different procedures to approach b o t h the
discretization and the minimization of the energy function can be used. However
most of them require a high computational cost. Also due to their parametric
nature, they cannot split a contour or merge two of them into one. This limits
their application to segmentation tasks where the number of interesting objects
and their approximate location are known a priori.
The development of strategies based on active contours by means of CNN
could become an alternative to classical active contour techniques. T h e main
motivation in pursuing this is the possible implementation as specific integrated
circuit of the proposals which allows the use of massively parallel processing
to reduce processing time. Thus, in [11], [12] a CNN-based strategy for the
segmentation of an image on the basis of active contours has been proposed.
However, because of the need for maintaining the connectivity of the snakes,
breaking of the contours is not allowed. Therefore, as in classical active contour
techniques, it is constrained to applications with a previous knowledge of the
approximate location of the objects in the scene.
Following we will propose a CNN-based altertanive strategy to the afore
mentioned techniques in order to allow the topologic transformation of the snake.
We begin with an initial region of arbitrary form which comprises the objects
474

of interest. The region edges will be eroded based on a certain local information
until the final contours coincide with those of the objects under consideration.
Unlike the classical techniques, now the optimun location of the final contour
is not achieved by means of a global minimization process of the energy of the
snake but as a result of the local processing of an external energy. Furthermore,
this strategy easily allows the splitting of an initial contour into several ones in
order to delimitate different objects into the image under processing. A block
diagram of the proposed structure is shown in Fig.5.

Fig. 5. Pixel-level block diagram of tile multilaycr DTCNN structure

Following a description of the behaviour of each layer is given:


1. EP layer (Energy Processing) makes the decision of whether a pixel belong-
ing to the region contour should be peeled off. This decision is made by
taking into account the current location of the region (that corresponds to
the output of the E R layer) and a given information (energy image) that
acts as an external input of the layer. This energy image has real values and
remains constmlt during all the processing steps. The output of the layer
will be a binary image whose activated pixels will correspond to those into
the region that should not be eroded.
2. ER layer (Erosion Layer) effectively erodes those pixels belonging to the
region contour whose locations coincide with deactivated pixels in the output
of the previous layer. It also avoids change in state of pixels that do not
belong to the region frontier. The external input will be the o u t p u t image
from E P layer.
3. Edge detection layer detects the contour of the output image of the erosion
layer and its output corresponds with the output of the overall system.

The first two layers act consecutively for each of the four cardinal directions
until reaching the convergence, that is, until the erosion layer o u t p u t remains
unchanged after two consecutive iterations. The mlmber of iterations needed to
reach the convergence depends on the shape and size of the initial region. For the
training phase we only considered the problem of the optimization of the first
two layers. The third layer performs an edge detection task on binary images in
475

order to have the contour of the region in E R output as the global output of
the system. Templates for this simple task can be found in the literature a b o u t
CNN.
In order to determine valid templates for this application we have carried out
a learning procedure using the training p a t t e r n in figure 6. The input represents
the energy image. In this case, this only include external energy forces in such a
way t h a t the gray level associated with each pixel in the image is a function of
the distance to the closest border. T h e desired output is represented by a binary
image where the objects are perfectly defined.

Fig. 6. Training Phase: energy and target ims~ges respectively

To optimize the coefficients of the different layers a GA with Gray codification


and a population size of 100 was used. After 159 iterations of the algorithm a
solution with only a 0.2% error over the training image was found.The templates
found are:

AEpN = (! --0.16
--0.47
BEPN = ~0
0.54
IEPr~ = --0.15

AER= (!0 20.010


0.22
2 BErt-- 09.43
\0 0
IER = - - 6 . 5 5

Figure 7 shows the evolution of the region contour from its initial state until
its final location superposed on the energy image. Note that the evolution starts
from only one contour which is broken in order to delimite two different objects.

5 Conclusions
In this p a p e r we have shown that GAs can be successfully used as a global learn-
ing strategy for multilayer D T C N N architectures that perform complex tasks
over general grayscale images. This m e t h o d allows to simultaneously optimize
all the coeficients of the differents layers of the network instead of applying a
different learning strategy for each one. Due to the global nature of the training
process the only information required are the global input and output of the
overall structure.
476

Fig. 7. Example of the evolution of the contour superposed on the energy image

References
1. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory and Applications. IEEE
Transactions on Circuits and Systems. 35 (1988) 1257 1290
2. Nossek, J.A.: Design and learning with Cellular Neural Networks. International
Journal of Circuit Theory and Applications. 24(1996) 15 24
3. Roska, T., Chua, L.O.: The CNN Universal Machine: An analogic array computer.
IEEE Transactions on Circuits and Systems. 40 (1993) 163-173
4. Roska, T., K~k, L., Nemes, L., ZavKndy, ,~., Brendel, M.: CSL-CNN Software Li-
brary, Version 7.1., DNS-CADET-15 Analogical and Neural Computing Laboratory,
Computer and Automation Institute, Hungarian Academy of Sciences.
5. Harrer, H., Nossek, J.A.: Discrete-Time Cellular Neural Network. International
Journal of Circuit Theory and Applications. 20 (1992) 453-467
6. Harrer, H.: Multiple layer Discrete-Time Cellulax Neural Networks using time vari-
ant templates. IEEE Transactions on Circuits and Systems-II: Analog and Digital
Signal Processing. 40 (1993) 191-199
7. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley Publishing Company. (1989)
8. Matsumoto, T., Chua, L.O., Yokohama, T.: Image Thinning with Cellular Neural
Networks. IEEE Transactions on Circuits and Systems 37(1990) 638-640
9. Venetianer, P.L., Werblin, F., Roska, T., Chua, L.O.: Analogic CNN Algorithms for
Some Image Compression and Restoration Tasks. IEEE Transactions on Circuits
and Systems-h Fundamental Theory azld Applications. 42(1995) 278- 283
10. Harrer, H., Venetiazmr, P.L., Nossek, J.A., Roska, T., Chua, L.O.: Some Examples
of Preprocessing Analog Images with Discrete-Time Cellular Neural Networks. IEEE
International Workshop on Cellular Neural Networks and their Applications. Rome,
Italy. (1994) 18-21
11. Vilarifio, D.L., Brea, V.M., Cabello, D., Pardo, J.M.: Discrete-Time CNN for Image
Segmentation by Active Contours. Pattern Recognition Letters. 19(1998) 721 734
12. Vilarifio, D.L., Cabello, D., Balsi, M., Brea, V.M.: Image Segmentation Based on
Active Contours Using Discrete-Time Cellular Neural Networks. IEEE International
Workshop on Cellular Neural Networks aim their Applications. London, England.
(1998) 331-336
How to Select the Inputs for a Multilayer Feedforward
Network by Using the Training Set

Mercedes Fermindez Redondo 1, Carlos Hernfindez Espinosa I.

Universidad Jaume-I. Departamento de Informgtica. Campus Riu Sec. Edificio TI.


Castell6n. Spain.
E-mail: espinosa@inf.uji.es

Abstract. In this paper, we present a review of feature selection methods based


on an analysis of the training set which have been applied to neural networks.
This type of methods uses information theory concepts, interclass and intraclass
distances or an analysis of fuzzy regions. Furthermore, a methodology that
allows evaluating and comparing feature selection methods is carefully
described. This methodology is applied to the 7 reviewed methods in a total of
15 different real world classification problems. We present an ordination of
methods according to its performance and it is clearly concluded which method
performs better and should be used. We also discuss the applicability and
computational complexity of the methods.

1 Introduction

Neural networks (NNs) are used in quite a variety of real world applications, in them
one can usually measure a large number of variables that can be used as potential
inputs. One clear example is the extraction of features for object recognition [1],
many different types of features can be utilized, such as geometric features,
morphological, etc. However, usually not all variables that can be collected are
equally informative: they may be noisy, irrelevant or redundant.
Feature selection is the problem of choosing a small subset of features ideally
necessary and sufficient to perform the classification task, from a larger set of
candidate features. Feature selection has long been one of the most important topics in
pattern recognition and it is also an important issue in NNs. If one could select a
subset of variables one could reduce the size of the NN, the amount of data to process,
the training time, and possibly increase the generalization performance. This last
result is known in the bibliography and ratified in our results.
Feature selection is also a complex problem, we need a criterion to measure the
importance of a subset of variables and that criterion will depend on the classifier. A
subset of variables could be optimal for one system and very inefficient for another.
In the bibliography there are several potential ways to determine the best subset of
features: analyze all subsets, genetic algorithms, a heuristic stepwise analysis and
direct estimations.
In the case of NNs direct estimation methods are preferred because of the
computational complexity of training a NN. Inside this category we can perform
478

another classification: methods based on the analysis of the training set, [2], [3], [4],
[5], [6], [7], [8], methods based on the analysis of a trained multilayer feedforward
network [9] and methods based on the analysis of other specific architectures [10J.
The purpose of this paper is to make a brief review of the methods based on an
analysis of the training set, present a methodology to compare them and present the
first empirical comparison among them.
In the next section we review the methods, in section 3 we present the comparison
methodology, the experimental results and an ordination of methods according to its
performance and we finally conclude in section 4.

2 Theory

The methods considered use concepts of information theory, measurements of


interclass and intraclass distances or analysis of fuzzy regions.

2.1 Information Theory

The first method reviewed was proposed by Battiti [4], (from here named BA). The
algorithm is:
1. (Initialization) Set F to the whole set of p features. S an empty set.
2. Compute the mutual information I(C,f) for each feature fe F and the full set of
classification classes C=(cb cz..... CM).
3. Find feature f that maximizes I(C,f). Then include f in S and extract f from F,
S=Svo{ f}, F=F-{ f}.
4. Repeat until the cardinal of S is k.
4.1 Compute the mutual information I(f,s) between fe F and se S.
4.2 Choose g as the feature that maximizes the following equation:

I(C, f ) - fl .~_J(f ,s) (1)


s~S

where ~ is a parameter in the interval [0.5,1] according to the recommendations


of the author, we have tried several values: 0.5,0.6,0.7,0.8,0.9,1, in our
experiments.
4.3 Include g in S, S=Sw[g}, delete g from F, F=F-[g}.
This algorithm provides a subset of k features from the whole original set of candidate
features. There is no simple way of choosing the appropriate value of k, and this is a
general characteristic of all input selection methods. They provide an ordination of
inputs according to its importance and the appropriate value of k should be chosen
experimentally by measuring the performance of several subsets as described in the
section experimental results.
The algorithm BA allows obtaining an ordination of features according to its
importance, the most important is the first selected and the last selected is the least
important.
479

The calculation of the mutual information is as follows:

l(C,f) = y~ p ( ( f e c ~ ) ^ ( f e rk(f)).log 2 P ( ( f e r k ( f ) ) ^ ( f e c~)) (2)


i=ltoM.k=ltoN p(f e Cj)'p(f e rk(f))

where rl(f), rz(f) . . . . . rN(f) is a partition resulting of dividing the range of f values into
equals parts. And the above sum is over all these parts of f and all the classes cl, c2 .....
CM. The appropriate number of parts is usually between 16 and 32, we have used 24 in
our experiments. In the equation p means probability.
Analogously:
p ( ( f e r k ( f ) ) ^ (s e r j (s)) (3)
I(f,s) = ~ p ( ( f e r k ( f ) ) A ( s e rj(s)).log 2
k,j=ltoN p ( s e r) ( s ) ) . p ( f e rk ( f ) )

Another method was proposed by Chi [7] (from here named CHI). He defined an
entropy measurement for a feature f as:
N M (4)
CH(f) = -~ p ( f e rk ( f ) ) . ~ , p ( ( f e c i ) A ( f e rk (f))).log 2 p ( ( f e c i ) A ( f e rk (f)))
k=l i=1

This magnitude is always positive and the feature is considered more important as its
CH measurement decreases. This method also allows an ordination of features
according to its importance.
Setiono [6] proposed another method (from here called SET) based on the use of
normalized information gains G'i of feature fi to estimate the importance of a feature.
The normalized information can be calculated with the following equations:
nj ni M (5)
I ( S ) = -~.~--.log 2 - I(S,,) = -~_a n'kJ-log2 n'kj
j-i n n j=l nik n~k

where nj is the number of samples x belonging to class cj, xe cj, n is the total number
of samples in the training set S, nik is the number of samples for which feature fi takes
a value inside rk(fi) and nikj is the number of samples x for which ficzrk(fi) and xe cj.
And finally:
N
E i = ZN nik .I(Sik ) I i = _Zniklog2 nik Gi = I(S) - E i G; = Gi
1-"-i (6)
k=l n k=l n n

A feature is considered more important as its normalized information gain is lager.


This method also allows an ordination of features according to its performance.
Finally, the last method is based on the GD distance [8] (from here named GD DST).
The algorithm is quite similar to the Battiti's one:
1. Initialization) Set F to the whole set of p features. S an empty set.
2. For all features fe F compute the GD distance between the set of features F- {f} and
the set of classification classes C, doD(F-{ f},C).
3. Find feature g that minimizes GD distances of step 2). Include g in S and extract g
from F, S=Su{g}, F=F-{g}.
4. Repeat steps 1) and 2)until the cardinal of S is k.
480

The method also allows an ordination of features in a way similar to the Battiti's
method, the first selected feature is considered the most important and the last
selected the least important.
The calculation of the GD distance between a set of features F and the classes C is:

dGD (F, C) = D(F, C) T .T -1 .D(F, C) (7)

Where T is the transinformation matrix and D(F,C) a vector of Mataras distances:

[ l(fl,J])l(fl,f2) ...I(fl,fp) ] ( d(f,C) (8)


T = I I(f2,fl) l(f2,f2' "l(f2,fp) [ D(F, C) = [ d(f2' C)

Ll(fp,fl)l(fp,f2)...l(fp,fp)j
The mutual information between features, I(fi,fj), can be calculated as described in the
method BA, eq. (3), and the expression for calculating Mataras distance is:

d(fi,C ) = H(fi,C )- I(fi,C ) (9)


where H is called the entropy and can be calculated with the following equation:
?4 M p((f ~ rk(f)) ^ (f~ e cj)) (10)
H(f,C) = - ~ , p ( ( f ~ rk(f) ) ^ ( f ~ cj)).log~
k=ly=l p ( f ~ rk(f))'p(f 9 c1)
and the mutual information between a feature fi and the classes C can be calculated as
described for the method BA, eq. (2).

2.2 Interclass and Intraclass distances

The first method is quite popular in pattern recognition and it is called Relief [3]
(from here RLF). We will review here a simpler version for problems of two classes,
see the reference for a more complete method which supports multiclass problems
and missing attributes. The algorithm is basically the following:
1. We will call S to the training set, cl and c2 the two classification classes and p the
number of inputs or features in every instance of the training set.
2. Initialize a weight of features relevance W=(wl ..... wp) randomly around 0.5.
3. Repeat the following an appropriate number of steps, m.
3.1 Choose at random an instance Xm(X1,X2..... Xp) from S.
3.2 Choose at random two instances, z, y, closest to x, zc cl and yE c2.
3.3 If (xe cl) then N_hit=z; N_miss=y; else N_hit=y; N_miss=z;
3.4 Update weights:
for i= 1 to p wi =wi-diff(xi,N_hiti)2+diff(xi,N_missi)2;
4. Normalize the relevance of features, Relevance=(1/m).W.
The difference diff of two features fi and si of two samples f and s are for nominal
values:
~0,if f and si are the same (11)
diff ( f ,s,) = [ 1 , i f f and s, aredifferent
481

And for numerical values:


diff (f,s,) = ( f - xi)/m k (12)
mk is a factor to normalize values of diff in [0,1]
One feature is considered more important as its relevance increases, and the method
allows an ordination of features.
Scherf [5] proposed another method (from here SCH). He defined a weighted distance
between two instances of the training set:

P "
do = lq~=lWq.(f~ _ fqj' ) 2 (13)

where q=l ..... p denotes every feature of instances yi,yj, yi=(fli ' f2 i . . . . . fpi) and
YJ=(flj, f2j ..... f~), and there is a weight Wq for each feature.
Also the definitions:

8S(i j ) = / 1 , tf yi and YJ areof thesameclass (14)


' [0, otherwise

6 0 q, ]) 1,if Y~ and YJ are of different class (15)


I
[0, otherwise

A relation between interclass and intraclass distances is:


1 N N 1 N N (16)
J=7-, "Y--.,Z a%,J)'do--;7"E Z a%'J)'d*s
~vS i=1 j=i+l D i=1 j=i+l

where:
N N N N (17)
us:X X:<i,J No:X X:<i,+)
i=lj=i+l i=l j = i + l

and N is the number of instances in the training set S.


With this formulation we should find a set of weights Wq which minimizes J, by doing
so we will minimize intraclass distances and maximize interclass distances. The
minimization is performed by gradient descent, we initialize the weights randomly
around 0.5 and modify them by the following equation:
(18)
Wq(t + l) = Wq(t)+ ~" ~--~
~
OWq

The equation of the derivative can be found in the reference:


3J 1 SN' ~ N ~ S ( t9 , j9) . ( f i~ - f q j) 2
1 ,~ 6D(i,J)'(fq--f~) 2 (19)

s ,= ,=i 2.~,~=lw,.(fi _ f / ) 2 , I S ' w . ' (.f...' .- .t ' i ' :


m ,:, ,=, 2'~/,_zT.l.

After convergence, we can obtain an ordination of features. A feature is considered


more important if its weight is larger.
482

2.3 Analysis of Fuzzy Regions

Finally, we review a method proposed by Thawonmas [2] (form here FUZ). The
fuzzy rules are generated as follows: consider Xi the set of samples which belongs to
class ci, we can generate an activation hiperbox of level 1 Aii(1), i=l ..... M, by
finding the maximum and minimum values of each input variable from Xi. Then, if
Aii(1) and Aii(1 ) overlap we can define an inhibition hyperbox Iij(1) of level 1 as the
intersection of hyperboxes Aii(1 ) and Aii(1). The fuzzy rule would be:
- If the sample x is in Aii(1 ) and is not in Iij(1) for every j j~ei then x belong to class
Ci.
After that, we can define new activation hyperboxes of level two, Aij(2) and Aji(2), by
taking into account the samples included in the inhibition hyperbox Iij(1) which
belong to classes i and j respectively. If Aij(2) and Aji(2) overlap we can define a new
inhibition hyperbox of level 2. This process continues until no inhibition hyperboxes
are found.
Then, we can define the exception ratio oij(F) for a set of features F, as the sum for all
levels of the ratio between the volume of inhibition hyperbox of level n (Iij(n)) and the
volume of activation hyperbox (Aij(n)) multiplied by the probability of finding a
sample inside the inhibition hyperbox. See the reference for a more complete
explanation and the exact equations.
The total exception ratio O(F) is the sum of oij(F) for all i,j i~j.
After all theses calculations, we should apply the following algorithm:
1. (Initialization) Set F to the whole set of p features. S an empty set.
2. Compute O(F-{f}) for every feature f e F.
3. Find the feature g that minimizes the above calculated total exceptions ratios.
4. Include g in S and extract g from F, F=F-{g} and S=Su{g}.
5. Repeat steps 2) and 3) until the cardinal of S is k.
This method allows an ordination of features as the method BA.

We have pointed out before that all the methods allow an ordination of inputs
according to its importance, besides that, there is no simple way to choose the
cardinal k of the subset of inputs that should be selected for an application.
Furthermore, the ordination of inputs depends on the method, two ordinations
obtained from two different methods will be, in general, different and therefore their
performance.
Every method is based on some heuristic principle and the only way to compare them
may be empirically because of the complexity of the problem. This will be described
and accomplished in the following section.

3 Experimental results

In order to compare the 7 methods we have applied them to 15 different classification


problems, which are from the UCI repository of machine learning databases. They
are: Abalone (AB), Balance Scale (BL), Cylinder Bands (BN), Liver Disorders (BU),
Credit Approval (CR), Display 1 (DI), Glass identification (GL), Heart Disease (HE),
483

Mushroom (LE), The M o k ' s Problems (M1, M2, M3), Pima Indians Diabetes (PI),
Voting Records (VO) and Wisconsin Breast Cancer (WD). The complete data of the
problems and a full description of them can be found in the UCI repository.
In all problems, we have included a first useless input generated at random inside the
interval [0,1]. It is interesting to see how important this input is considered by every
method.
For each problem and method we have obtained an ordination of feature importance.
For example, the ordination of all methods for problem AB is inside Table 1 (BA 0.5
means Battiti's method with [~ parameter equal to 0.5).

Table 1. Input importance ordination for all methods and problem AB.
Method Least Important Most Important
BA 0.5 6 3 4 8 7 5 1 2 9
BA 0.6 6 3 4 8 7 5 1 2 9
BA0.7 6 4 3 8 7 5 1 2 9
BA0.8 6 4 3 8 7 5 1 2 9
BA 0.9 6 4 3 8 7 5 2 1 9
BA 1 6 4 3 8 7 5 2 1 9
SET 1 7 3 6 4 8 9 2 5
CHI 1 2 7 5 3 8 6 4 9
GD DST 1 2 7 8 5 6 9 3 4
RLF Not applicable
SCH 2 4 6 9 3 8 7 5 1
FUZ 2 3 5 4 8 7 9 1 6

After that, we obtained several inputs subsets by deleting successively the least
important input until a final subset of one input. For example, using the results of
Table 1 method BA 0.5, the first subset is obtained by deleting input {6}, the
following subset is obtained by deleting inputs {6,3 } and the final subset of one input
is {9}.
For every subset we wanted to obtain the performance of a classifier to see how good
the subset is, the classifier of interest is Multilayer Feedforward. We trained several
multilayer feedforward networks to get a mean performance independent of initial
conditions (initial weights), and also an error for the mean by using standard error
theory [11]. The performance criterion was the percentage correct in the test set. The
number of trained neural networks for each subset was a minimum of ten, in many
cases we trained much more than ten networks in order to diminish the error in the
mean, the maximum permitted error in a measurement was 3%.
In table 2, we can see the results of all methods for problem AB.
Then, for each method and problem we can obtain what we call the optimal subset.
This subset is the one which provides the best performance, and in the case of two
subsets with indistinguishable performance, the one with a lower number of inputs
because it provides a simpler neural network model.
In order to see if the performances are distinguishable we have performed t-tests. The
hypothesis tested was the null hypothesis ~tg=~B, assuming that the two mean
performances are indistinguishable. In the case that this null hypothesis can be
484

rejected we conclude that the difference between the two measurements is significant.
The significance level of the tests, ct, was 0.1.

Table 2. Performance of all methods in problem AB for all subsets of inputs.


Number of Omitted Inputs
Method 1 2 3 4 5 6 7 8
BA 0.5 50-2_2 56.7+1.3 48+3 52+2 53+2 55.4+0.355.4+0.4 46+2
BA 0.6 50+2 56.7+1.3 48+3 52+2 53+2 55.4+0.355.4+0.4 46+'2
BA 0.7 50+2 53+2 48+3 52_+2 53+2 55.4+0.355.4+0.4 46+2
BA 0.8 50-&-_2 53+2 48+-3 52+2 53+2 55.4+0.355.4+0.4 46+-2
BA 0.9 50-2_2 53+2 48+3 5 2 - + 2 53+-2 55.4+0.3 45+3 46+-2
BA 1 50+2 53+2 48+3 52.+.2 53+2 55.4+0.3 45+3 46+-2
SET 51+3 4 8 + - 2 49-2_2 53_+2 52+3 52_+_3 51+2 41+-2
CHI 51+3 55+2 58+1 58+1 50+2 5 5 + - 2 52+2 46+2
GD DST 51+3 55+2 58+1 51+2 53+3 59--+1 49+3 51+-2
RLF Not a ?plicable
SCH 58.7+0.753_+2 58.7+0.7148+251.4+1.9 146+3 43+3 32.3+0.4
FUZ 59+157+154_254+255+251+246+244+ I 31

For example, from the results of Table 2, the performance for method BA 0.5 for the
subset with 2 omitted inputs is the best mean performance, but this performance is
indistinguishable (according to the described t-test) with the one of subsets of 6 and 7
omitted inputs. The performance differences among subsets 2, 6 and 7 is not
significant and we should select the subset with lower number of inputs, subset 7 of 7
omitted inputs, as the optimal one. The inputs in this subset are the most appropriate
to design an application for problem AB according to method BA 0.5.
Also, for the method SCH, there are two subsets 1,3, with the same performance,
which is better and distinguishable from the rest of subsets. Again, we should select
the subset with lower number of inputs, subset 3, as the optimal one.
After obtaining the optimal subsets for each method and problem, we can compare the
performance of different methods in the same problem by comparing the performance
of their optimal subsets and the number of inputs in the case of indistinguishable
performance. Again, we performed t-tests to see significant differences.
For example, the results for method BA 0.5 (55.4_+0.4) and SCH (58.7+0.7) in Table 2
are distinguishable and we can conclude that the performance of method SCH is
better for the problem AB.
By comparing all the methods, two by two, we can obtain another table where we can
find whether one method is better or worse than another for a concrete problem. An
extract of that table is in Table 3.

Table 3. Performance comparison of methods BA 0.5 and SCH in all problems.


BA 0.5
Best BA: BL, BN,CR,DI,GL,HE,M1,M3,PI,VO
SCH Equal: M2
Best SCH: AB,BU,LE,WD
485

For example, we can see that method BA 0.5 is better than SCH in problems BL, BN,
CR, DI, GL, HE, M1, M3, PI, VO. The number of problems where BA 0.5 performs
better is larger and so we can expect that its performance will be better than the one of
SCH.
Following this methodology and this type of comparisons with the full results, (we do
not present them because of the lack of space) we can get the following ordination:

GD DST > BA 0.5 > RLF = BA 0.7 = BA 0.9 = BA 1,0 > BA 0.6 = BA 0.8 = SET >
> CHI > FUZ > SCH

The best method is GD distance and the following Battiti with a low value of I~
(~=0.5), among methods RLF and SET the differences are not very important, and the
worst methods are clearly SCH and FUZ.
However, we can and should further discuss the applicability of every method. For
example, for applying the GD distance, the matrix T of transinformation should not be
singular. Well, we have found a singular matrix within the working precision (double
floating point precision) in 3 of 15 problems. This method is the best but because of
its limited applicability we can think of using Battiti with a low value of 13.
Another method with limited applicability is FUZ, we can think in the case where the
activation hyperboxes of level 1, Aii(1), i=l ..... M, does not overlap. In that case there
are not inhibition hyperboxes and therefore the method is not applicable. We found
this situation in 3 of 15 problems.
Finally, the method Relief RLF was not applicable for several reasons in 4 of 15
problems.
Another important question is the computational complexity. The methods based on
an analysis of the training set are usually characterized by a low computational cost.
This is true for all the methods reviewed except for Scherf SCH, which performs a
gradient descent search with computational cost larger than the training of a neural
network.

4 Conclusions

We have presented a review of the feature selection methods based on an analysis of


the training set which have been applied to neural networks. We have also carefully
presented a methodology that allows selecting an optimal subset of inputs and
evaluating and comparing feature selection methods. This methodology was applied
to the 7 reviewed methods in a total of 15 different real world classification problems.
Finally, we presented an ordination of methods according to its performance and it
was clearly concluded which method performs better and should be used. We have
also discussed the applicability and computational complexity of the methods.
486

References

1. Devena, L.: Automatic selection of the most relevant features to recognize objects.
Proc. of the Int. Conf. on Artificial NNs, vol.2, pp.1113-1116, 1994.
2. Thawonmas, R., Abe, S.: Feature reduction based on analysis of fuzzy regions.
Proc. of the 1995 Int. Conf. on Neural Networks, vol. 4, pp. 2130-2133, 1995.
3. Kira, K., Rendell, L.A.: The feature selection problem: Traditional methods and a
new algorithm. Proc. of 10 th Nat. Conf. on Artif. Intellig., pp. 129-134, 1992.
4. Battiti, R.: Using mutual information for selecting features in supervised neural net
learning. IEEE Trans. on Neural Networks, vol. 5, n. 4, pp. 537-550, 1994.
5. Scherf, .: A new approach to feature selection. Proc. of the 6 th Conf. on Artificial
Intelligence in Medicine, (AIME'97), pp. 18 I- 184, 1997.
6. Setiono, R., Liu, H.: Improving Backpropagation learning with feature selection.
Applied Intellig.: The Int. Journal of Artif. Intellig., NNs, and Complex Problem-
Solving Technologies, vol. 6, n. 2, pp. 129-139, 1996.
7. Chi, Jabri: An entropy based feature evaluation and selection technique. Proc. of
4 th Australian Conf. on NNs, (ACNN'93), pp.193-196, 1993.
8. Lorenzo, Hern~indez, M6ndez: Attribute selection through a measurement based on
information theory (in Spanish). 7 a Conferencia de la Asociaci6n Espafiola para la
Inteligencia Artificial, (CAEPIA 1997), pp. 469-478, 1997.
9. Tetko, I.V., Villa, A.E.P., Livingstone, D.J.: Neural network studies 2. Variable
selection. Journal of Chem. Inf. Comput. Sci., vol. 36, n. 4, pp. 794-803, 1996.
10.Watzel, R., Meyer-B~ise, A., Meyer-B~ise, U., Hilberg, H., Scheich, H.:
Identification of irrelevant features in phoneme recognition with radial basis
classifiers. Proc. of 1994 Int. Symp. on Artificial NNs, pp. 507-512, 1994.
ll.Bronshtein, I., Semandiavev, K.: Mathematics Handbook for engineers and
students (in Spanish). MIR, Moscow, 1977.
Neural Implementation of the JADE-Algorithm

Christian Ziegaus and Elmar W. Lang

Institute of Biophysics, University of Regensburg, D-93040 Regensburg, Germany


Christian.Ziegaus@stud.uni-regensburg.de

A b s t r a c t . The Joint Approximative DiagonaIization of Eigenmatrices


(JADE)-algorithm [6] is an algebraic approach for Independent Com-
ponent Analysis (ICA), a recent data analysis technique. The basic as-
sumption of ICA is a linear superposition model where unknown source
signals are mixed together by a mixing matrix. The aim is to recover the
sources respectively the mixing matrix based upon the mixtures with
only minimum or no knowledge about the sources. We will present a
neural extension of the JADE-algorithm, discuss the properties of this
new extension and apply it to an arbitrary mixture of real-world images.

1 Introduction

Principal Component Analysis (PCA) is a well known tool for multivariate data
analysis and signal processing. P C A finds the orthogonal set of eigenvectors of
the covariance matrix and therefore responds to second-order information of the
input data. One often used application of P C A is dimensionality reduction. But
second-order information is only sufficient to describe data that are gaussian
or close to gaussian. In all other cases higher-order statistical properties must
be considered to describe the data appropriately. A recent technique that also
includes P C A and that uses higher-order statistics of the input is Independent
Component Analysis (ICA).
The basic assumption to perform an ICA is a linear mixture model rep-
resenting an n-dimensional real vector x = [ x 0 , . . . , X n - i ] T as a superposition
of m linear independent but otherwise arbitrary n-dimensional signatures a (p),
0 < p < m, forming the columns of an n x m-dimensional mixing matrix A
= [a(~ a(m-i)]. The coefficients of the superposition interpreted as an m-
dimensional vector s = [ s o , . . . , sin-i] T, leads to the following basic equation of
linear ICA:
x = As. (1)
The influence of an additional noise term is assumed to be negligable and will not
be considered here. The components of s are often called source signals, those
of x mixtures. This reflects the basic assumption that x is given as a mixture of
the source signals s. Thereby x is the quantity that can be measured. It is often
assumed that the number of mixtures equals the number of sources (n = m).
A few requirements about the statistical properties of the sources have to be
met for ICA to be possible [4]. The source signals are assumed to be statistically
488

independent and stationary processes with at most one of the sources following a
normal distibution, i.e. has zero kurtosis. Additionally for the sake of simplicity
it can be taken for granted that all source signals are zero mean E {s} = 0,
0<i<m.
The implementation of an ICA can be seen in principle as the search for an
m x n-dimensional linear filter matrix W = [ w ( ~ w( m-l)] T whose output

n--1
y = Wx y/ = ~ wj(0 x j
j=o

reconstructs the source signals s. Ideally the problem could be solved by choosing
W according to
W A = Im,
where Im represents the m-dimensional unit matrix. But it is clear t h a t the
source signals can only be recovered arbitrarily permuted with a scaling factor
possibly leading to a change of sign, because there is a priori no predetermina-
tion which filter leads to which source signal. This means, it is impossible to
distinguish As from .4w with A = A ( P S ) and ~ = ( P S ) -1 s, where P represents
an arbitrary orthogonal permutation matrix and S a scaling matrix ~w~h nonzero
diagonal elements [2].
The determination of an arbitrary mixing matrix A can be reduced to the
problem of finding an orthogonal matrix U by using second-order information of
the input data [3][9][12]. This can be done by whitening or sphering the d a t a via
a m • n-dimensional whitening matrix Ws obtained f r o m the.correlation matrix
[
R x [ R ixj def
= E { x ~ x j ) ) of x leading to

z= Wsx R z = E{zz T}=/~


with Im the m-dimensional unit matrix. The m x m-dimensional orthogonal
matrix U that has to be determined after this preprocessing is then given by

z = W s x = W s A s = Us. (2)

2 Determination of the mixing matrix


Serveral algorithms for ICA have been recently proposed (see e.g. [11] and ref-
erences therein). We focus here on an algebraic approach called J A D E (Joint
Approximative Diagonalization of Eigenmatrices) [6].

2.1 Basic Definitions

Consider z as a m-dimensional zero mean, real-valued random vector. The second


and fourth-order m o m e n t and correlation tensor of z is given by Corr (zi, zj) =
E { z i z j } and Corr ( z i , z j , z k , z l ) = E { z i z j z k z t ) respectively, where E denotes
489
the expectation. The corresponding cumulant tensors of z are defined as the
coefficients of the Taylor expansion of the cumulant generating function [10]
(k) = log (r (k)), where r (k) is the Fourier transform of the probability den-
sity function (pdf) p (z). Relations exist between moment and cumulant tensors
which for the fourth-order cumulant read
Cum (zi, zj, zk, zt) = E{zlzjzkzt}
-E{zizj}E{ZkZt}
-E{ZiZk}E{ztzj}
-E{zizt }E{zjzk }, (3)
whereby E{zi} - - O, 0 _~ i < m is assumed. Often autocumulants play a decisive
role and are given by
O'i =Cum (zi, zi) Variance
ai = Cum (zi, zi, zi, zi) Kurtosis.
The Fourth Order Signal Subspace (FOSS) is defined as the range of the linear
mapping
M --~ Qz (M)
m--1
[Qz (M)]ij = E Cum (zi, zj, zk, zt) Mkt,
k,l=O
where M denotes an arbitrary m x m-dimensional real matrix. The matrices
Qz (M) will be called eumulant matrices in the following.

2.2 R e p r e s e n t a t i o n s of t h e c u m u l a n t matrices
For the determination of the orthogonal mixing matrix U according to the
whitened model (2) it will be necessary to represent the cumulant matrices first
by the orthogonal mixing matrix and second by an eigendecomposition of the
fourth-order cumulant of z.

M i x i n g m a t r i x A Using the multilinearity property of the fourth-order cumu-


lant [5] leads to
m--1
Cum(zi, zj,zk,zt)= E u i(~) uj(~)u k('~)u t(~)~~ . ~ , (4)
~:~ =0
with ~ Z ~ = Cum (sa, sz, s,, s~). This yields a representation of the cumulant
matrices by the orthogonal mixing matrix U
m--1
Q~(M) = u A ( M ) u T = ~ A(~)u~u~ (5)
a,~=O
m-1 m-1
kl?/'(5)l
[A(M)] c~ de__fE /~c~,5E I/'("Y)mk
7,5=0 k,l=O
490

At this point nothing has been assumed about the statistical structure of the
sources. From the statistical independence of the components of s follows, that
cumulants of all orders of s are diagonal leading to [A(M)] ij = 5iJ aiu(i)TMu(j)"
The FOSS is thus given as

Range (Qz) = Span (u(~ (~ u(m_l)u(m_l) T) (6)

=
{ M IM = E
m, cPu(p)u(p)T
}
p----0

= { M I M = UAU T, A diagonal. }

This means that the dimensionality of the FOSS equals m, the number of sources.

E i g e n m a t r i c e s o f t h e f o u r t h - o r d e r c u m u l a n t The fourth-order cumulant


tensor of z is (super)symmetric which means that Cure (zi, zj, Zk,Zt) is invari-
ant under every permutation of zi,zj,zk,zt. By resorting the m 4 elements of
the fourth-order cumulant tensor into a n m 2 • m2-dimensional symmetric ma-
trix (stacking-unstacking, see [6]) an eigenvectordecomposition can be performed
leading to an eigenmatrixdecomposition of Cum (zi, zj, Zk, zt) (after rearranging
the resulting eigenvectors) such that

m2--1
Cum (zi, zi, zk, zt) = E A(P)M(P)M(P)ij kt (7)
p--O

holds with symmetric m x m-dimensional eigenmatrices M (p) , 0 _~ p < m 2. We


will assume that there exists an m x m-dimensional orthogonal matrix D =
[d(~ d (m-l)] diagonalizing jointly all eigenmatrices of Cure (zi, zj, zk, zt):

DTM(P)D:A(P) :Diag(it(P),...,#(Pm)l),O~_p<m2 , (8)

where Diag (.) denotes the m x m-dimensional diagonal matrix with the m ar-
guments as diagonal elements. The joint diagonalizer D can be found by a max-
imization of the joint diagonality criterion [7]

m2--1
c (V) : E Diag (VTM(p)V)2, (9)
p:0

where [Diag(.)[ is the norm of the vector of diagonal matrix elements, which is
equivalent to a minimization of
m 2 --1

Z (lo)
p:O
491

where off(W) for an arbitrary m x m-dimensional matrix W = (wi,j)o<i,j<m is


defined as
m--1
off(w) %e ~ Iwul 2.
i,j=O

It can be shown [5] that (9) is equivalent to

m--1
c(V)= E ICum(hi, hi, hk,ht)12, (11)
i,k,l=O

where h = VTz. Consequently the maximization of c (V) is the same as mini-


mization of the squared (cross-)cumulants with distinct first and second indices.
For V = U, h is equivalent to the source signals s.
According to (8) for each M (p) now there exists an eigenvectordecomposition
with
m--1
M(~) = ~..,(~) u,u,-_,T
' (12)
i----0

Using (7) and (12) leads to a new representation of the cumulant matrices

Qz(M)=DI"(M)D T , _P(M) = Diag (~/o(M) , . . . , ~(mM ) ) , (13)

w i t h ~ r(M) , 0 <: r < m d e f i n e d a s

m2--1 m - 1 m--1

= E E r-r r-q 4q)mk'4 ~ (14)


p=o q=o \k,t=o /

The eigenmatrixdecomposition of the fourth-order cumulant leads to m 2 sym-


metric matrices M (p). On the other hand the set of cumulants of order d of a
real m-dimensional random vector z forms a real vector space with dimension

79(m,d) = (m + d - 1 ) .

In the generic case, as defined in [8], the dimension or generic width ~ (m,d)
is even smaller than 7:)(m, d). A few values for G (m, d) are given in table 1,
also compared to 79 (m, d) and m 2. Additionally follows from (6) that only m
out of the m 2 possible eigenvalues A(P) are nonzero, which means that only m
eigenmatrices should be really important.

D e t e r m i n a t i o n o f t h e m i x i n g m a t r i x From (5) and (13) we can see that the


cumulant matrices are diagonalized by the orthogonal mixing matrix U and by
the joint diagonalizer D of the set of eigenmatrices of the fourth-order cumulant.
Because of the fact that an eigenvectordecomposition is only unique up to an
492

T a b l e 1. Comparison of the possible dimension of the real vector space given by the
set of cumulants Cum (z~, zj, zk, zl) (d = 4) for various dimensionalities m of the real
random vector z.

m m2 ~ (m, 4) 7) (m, 4)
4 16 10 35
5 25 15 70
6 36 22 126

arbitrary orthogonal matrix, the question arises whether the orthogonal mixing
matrix U can be identified by the joint diagonalizer of the eigenmatrices M (p),
0 < p < m. The answer is yes and the reason is given by Theorem 2 in [12], which
states, t h a t D is equal to the transpose (=inverse) of U up to a sign p e r m u t a t i o n
and a rescaling if two conditions are fullfilled
1. C u m (hi, hi) = 5ij
2. Cum (hi, hi, hk, ht) = 0 for at least two nonidentical indices,
with h = D T z . While the first condition is fullfilled by our orthogonal model
(2), condition two is given by the way the joint diagonalizer D is determined
through (11).

2.3 Neural learning of eigenmatrices


Recently [13] an extension of Oja's learning algorithm for P C A has been deviced
to account for higher-order correlations within any given input data. The main
idea is, t h a t capturing higher-order statistics of the input space implies the use
of a neuron which is capable of accepting input from two or more channels at
once. In the single neuron case, the learning rule is given as

nw j (t) = r (t) y (t) {zi (t) (t) - y (t) wit (t)}


m--1
with y (t) = E w,j (t) zi (t) zj (t). (16)
i,j=O

Averaging over the ensemble of input data used for training the network leads
to an eigenequation of the fourth-order correlation tensor
m--I

E Corr (zi, zj, zk, zt) Mkt = #Mij


k,l=.O
m--1
It= E MijCorr(zi,zj,zk,zt) MkZ.
i,j,k,l=O

From equation (3) can be seen t h a t the main difference between the fourth-
order cumulant and the corresponding correlation tensor is an explicit suppres-
sion of two-point correlations. Taking the latter into account we propose a new
493

weight update rule

Lawi (t) = r (t) y (t) {(zi (t) zj (t) - ij) - (t) (y (t) - Tr (W))}

where the weights M after convergence accomplish an eigenequation of the


fourth-order cumulant
m--1

Cum (zi, z~, zk, zt) Mkt = )~Mij


k,l=O
m--1
A = y ~ MijCorr(zi, zj,zk,zt) Mkl -- 2 -- Tr 2 ( M ) . (17)
i,j __
, ~ , t --0

Tr (.) denotes the trace of the matrix argument. The corresponding weight up-
date rule in case of m output neurons can be found straightforwardly to read

(18)
where w~y) denotes the weight matrix of output neuron p connecting to input
neurons i and j. The upper bound of the sum over q representing the decay term
is intentionally unspecified. The lerning rule can be implemented in two different
ways namely with 0 < q < m (Oja-type) and 0 < q _< p (Sanger-type). While
in the first case the resulting weight matrices belong to approximately equal
eigenvalues, the second case leads to weights whose corresponding eigenvalues
are obtained in decreasing-order. The later thus can give information about the
number of eigenmatrices necessary to span the FOSS.
Finding an (approximative) orthogonal joint diagonalizer D by minimizing
(9) can be interpreted as determining something like an average eigenstructure
[7]. Since the criterion can only be minimized but cannot generally driven to zero,
this notation corresponds only to an approximate simultanous diagonalization,
though the average eigenstructure nevertheless is well defined.

3 Experimental Results

To investigate the properies of the neural implementation of the JADE-algorithm,


we applied it to the problem of separating arbitrary mixtures of real-world
grayscale images also known as Blind Source Separation (BSS). The image en-
semble can be found in figure 1.
For simplicity we took the same number of sources, mixtures and sources
that have to be recovered (n = m). The source signals si (x, y), where i denotes
the image number and 0 < x, y < 256 the position within the image are given as
the pixel values of the images. The components of the mixing matrix have been
chosen normal distributed from the interval [-1.0, 1.0], but the special choice of
the interval proofed not to be too important. For the results obtained using the
494

Fig. 1. Image ensemble used to evaluate the algorithm developed within this paper. It
consists of 1. the three letters ICA, 2. a painting by Franz Marc titeled 'Der Tiger', 3.
an urban image, 4. normal distibuted noise, 5. the Lena image and 6. a natural image
gather from the natural image ensemble used in [2]. They are all 256 x 256 pixels in
size with pixel values in the intervall [0,..., 255]. The images have been normalized to
yield unit variance and the mean pixel value has been substracted from each picture.

Oja- respectively Sanger-type learning rule for each m under consideration the
same mixing matrix has been used.
Since statistical independence of the source signals is an important condi-
tion to separate the source signals from the mixtures, we calculated the source
correlation matrix (0 _< i, j < m)

256
S clef 1
ij=2562 E si(x,y) sj(x,y).
x,y:O

We found m a n y components to be (slightly) different from zero indicating inter-


image correlations violating the statistical independence assumption of the sour-
ces. ICA has been realized with m = 4, 5, 6 of the images using the learning rule
with the Oja- and the Sanger-type decay term. After convergence of the neural
network the joint diagonalizer for the set of weight matrices has been calculated
using an extension of the Jacobi-technique [7]. The resulting value of the cross-
talking error g of the product 7) = DTu with

de
2 ~ I ul - I
) 1 I ul
)
1 (19)
maxk
i=o \ j = o j=o \ i = o

has been calculated to get a measure of how well the demixing or separation has
been performed. The closer s is to zero the better the separation, but a value
g ~ 1 - 3 usually indicates good demixing.
Table 2 summarizes the experimental results. It can be seen that the average
eigenstructure can be determined better with the Oja-type learning rule leading
to much better separation results (see also [15]). Figure 2 shows the eigenvalues
obtained using the Sanger-type learning rule. For the determination of the joint
diagonalizer D here only the first 10, 15, 22 (m = 4, 5, 6) eigenmatrices have
been used. After convergence the weights with numbers greater than 10, 15, 22
(m = 4, 5, 6) have died away which means t h a t their norm converged to zero (see
table 1).
495

Table 2. Summary of the simulation results with m = 4, 5, 6. The table shows the
cross-taBr E as defined in (19) obtained using the Oja-type and the Sanger-
type learing rule (18).

m E (Oja) s (Sanger)
4 2.05 7.95
5 1.92 16.69
6 3.98 25.25

3.5
m=4 - -
3 m=5 - -
m=6 - -
2.5
2
1.5
1
0.5
0
5 10 15 20
# of weight

Fig. 2. Eigenvalues ~ + 2 according to (17) with m = 4, 5, 6 using the Sanger-type


learning rule. The whole set of possible eigenmatrices has been calculated but only a
subset has been used for the determination of the joint diagonalizer.

4 Discussion

U s e o f n o n l i n e a r i t i e s . Many ICA algorithms incorporate non-linearities in


some more or less spezified way. The main idea behind is that non-linearities
provide a way to introduce higher-order statistical properies of the input data
into the calculations. Details about how they influence the determination of the
mixing matrix A are rarely given. This deficiency in mind the choice of the non-
linearity has been called a 'black art' in [1]. To overcome this unsatisfactory
situation we tried to avoid any kind of 'arbitrary' parameter within our model.
The kind of higher-order information that is used in the ICA algorithm we propse
is well known as fourth-order correlations.
C o m p u t a t i o n a l effort. The original implementation of the JADE-algorithm
as can be found in [6] incorporates the calculation of the fourth-order cumulant
from the data samples together with an eigenvectordecomposition. But severe
problems arise with high input dimensionalities. The original JADE-algorithm
could never be applied to a problem like the one in [2], where the input dimen-
sionality is 144. Experiments showed that it is often not necessary to calculate
all possible eigenmatrices of the fourth-order cumulant but it is sufficient to
determine the average eigenstructure from a subset of all eigenmatrices.
S t a t i s t i c a l p r o p e r t i e s o f t h e i n p u t d a t a . Higher-order statistical properties
of high dimensional data are hard to investigate in respect of the difficulty of visu-
alizing the results which often remain lmimaginable. In the case of fourth-order
496

information (correlation and cumulant tensors) the eigenmatrixdecomposition


leads to twodimensional structures t h a t can easily be visualized. One example
where this could be useful are natural images and the influence of their statistical
properties on the development of our visual system [2][14].

A c k n o w l e d g e m e n t This work has been supported by a grant from the Claussen


Stiftung, Stifterverband ffir die deutsche Wissenschaft, Essen, Germany.

References
1. Anthony J. Bell and Terrence J. Sejnowski. An information-maximisation approach
to blind separation and blind deconvolution. Neural Computation, 7:1129-1159,
1995.
2. Anthony J. Bell and Terrence J. Sejnowski. The 'independent components' of
natural scenes are edge filters. Vision Research, 37(23):3327-3338, 1997.
3. Jean-Francois Cardoso. Source separation using higher order moments. In Pro-
ceedings of the ICASSP, pages 2109-2112, Glasgow, 1989.
4. Jean-Francois Cardoso. Fourth-order cumulant structure forcing, application to
blind array processing. In Proceedings of the 6th workshop on statistical signal and
array processing (SSAP 1992), pages 136-139, Victoria, Canada, 1992.
5. Jean-Francois Cardoso and Pierre Comon. Independent component analysis, a
survey of some algebraic methods. In Proceedings ISCAS 1996, pages 93-96, 1996.
6. Jean-Francois Cardoso and Antoine Souloumiac. Blind beamforming for non-
gaussian signals. IEE Proeedings - Part F, 140(6):362-370, 1993.
7. Jean-Francois Cardoso and Antoine Souloumiac. Jacobi angles for simultaneous
diagonalization. SIAM Journal on Matrix Analysis and Applications, 17(1):161-
164, 1996.
8. P. Comon and B. Mourrain. Decomposition of quantics in sums of powers of linear
forms. Signal Processing, 53(2):96-107, 1996.
9. Pierre Comon. Independent component analysis, a new concept? Signal Processing,
36:287-314, April 1994.
10. Gustavo Deco and Dragan Obradovic. An information-theoretic approach to neu-
ral computing. Perspectives in Neural Computing. Springer, New York, Berlin,
Heidelberg, 1996.
I I . Te-Won Lee, Mark Girolami, Anthony J. Bell, and Terrence J.Sejnowski. A uni-
fying information-theoretic framework for independent component analysis. Inter-
national Journal on Mathematical and Computer Modeling (in press), 1998.
12. Jean-Pierre Nadal and Nestor Parga. Redundancy reduction and independent
component analysis: Conditions on cumulants and adaptive approaches. Neural
Computation, 9:1421-1456, 1997.
13. J. G. Taylor and S. Coombes. Learning higher order correlations. Neural Networks,
6:423-427, 1993.
14. Christian Ziegaus and Elmar W. Lang. Statistics of natural and urban images. Lec-
ture Notes m Computer Science (Proceedings ICANN 1997, Lausanne), 1327:219-
224, 1997.
15. Christian Ziegaus and Elmar W. Lang. Independent component extraction of
natural images based on fourth-order cumulants. In Proceedings of the ICA (Inde-
pendent Component Analysis) 1999, in press.
Variable Selection by Recurrent Neural
Networks. Application in Structure Activity
Relationship Study of Cephalosporins

Nancy Lopez 1, Roberto Cruz 2, and Belsis LLorente 3

1 University of Antioquia, Medellin, Colombia. nlopez0matematicas .udea. edu. co


Institute of Cybernetics Mathematics and Physics, La Habana, Cuba.
nancylr~cidet. • inf. c u
2 Institute of Nuclear Sciences and Technology. La Habana. Cuba.
rcruz~rsrch, isctn, edu. cu
3 Center of Pharmaceutical Chemistry, La Habana, Cuba. cqf9 sld. cu

Abstract. Two methods for variable selection which are efficiently im-
plemented by Hopfield -like Neural networks are described. Qualitative
SAR models using the selected variables by both neural networks variable
selection methods are built. The biological activity against Staphylococ-
cus aureus of cephalosporins was used as dependent variable. The final
correlation between observed and predicted activity values are good, in-
dicating that the informative weight of the favored variables is high,
providing a sound basis to select a good variable set of in qualitative
structure-activity relationships (SAR) modeling.

1 Introduction

Structure-activity relationship (SAR) work using Artificial Neural Networks


(ANN) is not new, it is possible to find in the literature a number of papers
emphasizing the advantages of using these parallel distributed processors. Their
nonlinear nature makes them perfect candidates to work out classification prob-
lems where separability between classes is a predicament [1].
On the other hand, the problem of determining a set of variables to describe
accurately a set of compounds have been a long lasting headache. This problem
can be split in two parts: the determination of the best number of variables and
the calculation of the information contents of each variable. Too few variables
do not discriminate between different compounds and too many can lead to
overfitting and consequently to models with poor predictive activity [2]. This
part of the problem has been largely studied but the most widely used method,
the analysis of principal components [3], has the well-known drawback of loosing
the identity of the original variables. Some Multilayer Perceptrons (MLP) have
already been used to treat the dimensionality reduction problem.J4].
Recurrent neural networks (RNN) have been used in optimization problems
[5], [6]. The parallel approach has demonstrated to be able to approximate to the
minimum of an objective function better than other methods. In this article, two
498

variable selection methods based on recurrent neural models are proposed. The
first model searches the best variable subset looking for the maximal independent
set of a graph with minimum cardinality. The second one builds clusters of
analogous variables and chooses the best one of each cluster to form the most
relevant subset. The analogy function measures the variables capacity to keep
the class distribution of the data.
We build SAR models in cephalosporins using selected variables. Cephalos-
porins are antibacterial compounds belonging to the fl-lactams family. Its basic
structure has a fl-lactam ring bounded to a dihydrotiazine ring known as cephem
nucleus.
The second section describes the recurrent neural network methods for vari-
able selection and the third presents the application of the proposed methods to
a SAR study.

2 Variable Selection using Recurrent Neural Networks

The approach to the discovery of relevant variables needs an accurate search for
the best solution of a minimization problem [6]. Recurrent neural networks have
been used in optimization problems.
The proposed RNN approach to variable selection has two main compo-
nents: the analogy matrix and the recurrent neural network search. Following
the structure of the variable selection models the similarity matrix can be seen
like an evaluation function. In this article, the variable selection is done using
two recurrent models. We called them the independent set model (VSIS) and
the clustering model (VSCA). Both use the same relevance function but make
the search using different dynamics ( energy functions ).

2.1 The variable similarity matrix


The two methods for variable selection described in this work, are based on
the construction of a variable similarity matrix. Assume that m objects are
given, and denote by aij the i-th measurement of j - t h variable (i C N m ----
( 1 , 2 , . . . , m } , j C N n -= { 1 , 2 , . . . ,n}) , where n is the number of variables.
Before applying any variable selection method, the given data is preprocessed
using the formula:
bij = aij - mini aij (1)
maxi aij - mini a i j
to obtain a new data matrix [bij]m~n, with entries ranging from 0 to 1. Let us
assign to each variable j E Am, a vector vJ E (0, 1} re(m-I)~2, with components
calculated by the formula:

if ci = Ck and 1 - Ibij - bkjl > bma,


Vik z if ci ~ Ck and 1 - Ibij - bkjJ < brain (2)
in other case
499

where i E Nm, k E Nm, vector [c]m is the class distribution of the object
collection, bmaz E [0, 1] is the similarity threshold between objects and bmi,~ E
J = 1 if objects i and k have
[0, 1] is the dissimilarity threshold. As can be seen Vik
similar measurements of j - t h variable and belong to the same class or if objects
have different measurements of j - t h variable and belong to different classes. It
J -- 1 if objects i and k are "well classified" by the j - t h variable.
means t h a t Vik
The similarity matrix [sjz]~=n is calculated using the formula:

m--1 m J l
~ i = 1 ~ k = i + l VikVik
sj, = m. (m- 1)/2 (3)

and each element sit of the similarity matrix equals the number of object
pairs which are well classified by the variables j and l, divided by the total
number of object pairs.

2.2 V a r i a b l e selection and m i n i m u m independent sets: T h e V S I S


model
Let s be a similarity level such that min sjl ~ s ~ max sjl where Sjk is the
similarity value between variables calculated using equation 3. For a similarity
level s, we intend to select a subset of variables V s E N~, such that:

a) Vj, I E V 8,sjl < s.


b) Vk E V ~, 3j ~_V~/sjk > s.
c) The number of selected variables, [VS[ will be minimum.

Let Gs(V, E), IV[ = n, E C_ V x V a similarity graph associated to a similarity


level s. Each vertex vj E V is associated to the variable j, and two vertex
are adjacent, (vj,vl) E E if sj~ > s. Then, the required subset of variables is
equivalent to the maximal independent set of minimum cardinality of similarity
graph Gs(V, E). As can be seen, the density of the similarity graphs decrease as
the value of the similarity level s increases, and consequently the cardinality of
the minimum independent sets also increase. The problem of finding the maximal
independent set of minimum cardinality for the graph G~(V, E), is equivalent
to the problem of finding t h e m a x i m a l clique of minimum cardinality for the
complementary graph G'(V, E ) and is known to be NP-complete problem [7].
In order to find the minimum clique of the graph G~ (V, E) we use a modified
version of the neural network based algorithm, developed by Cruz and L6pez for
Maximum Clique Problem [8].
It was considered a Hopfield-like neural network model with n neurons, where
n is the number of vertexes in graph G~ and x E {0, 1} n represents the vector
state of neurons at determined time t. T h e initial state of the system is x ~ = ei,
where ei is i-th unit vector. Since the initial state vector can be interpreted as a
clique which contains only one vertex i, we use a transfer function which adds
only one vertex to a clique in each iteration, until a clique becomes maximal.
It means t h a t at each iteration, we select only one neuron to fire, among all
500

neurons corresponding to vertexes adjacent to the current clique. T h e original


algorithm for the Maximum Clique Problem was designed to find maximal clique
as large as possible, hence we selected, among candidate neurons, that one which
guarantees the largest number of candidates in the next iteration. In this case,
we need to find maximal clique of minimum cardinality and at each iteration we
select a neuron which guarantees the minimum number of candidate neurons in
the next iteration. Let N t be the number of candidates at iteration t, it means
the number of vertexes adjacent to all vertexes belonging to the current clique
C t, and N t+l the number of candidates if we chose Xm to be fired at the iteration
t + 1. Hence, we must select the neuron, which has the minimum value of Ntm+1 .
The algorithm can be described as follows:
1. t +-- 0, x ~ 4-- ei, C O = i, N ~
2. while N t ~ 0 do
(a) For all candidate neurons m update Ntm+1
(b) select neuron l such that N [ +1 = m i n m ( N t+l)
(c) z ~+1 e - x t + e~
(d) C t+l +-- C t u
(e) N t+l e-- N t+l{/}
(f) t e - t + l
3. end

This algorithm provides a discrete descent dynamics to approximately find


the maximal clique of minimum cardinality in the subgraph of G s (V, E ) that
contains only vertexes t h a t are adjacent to the vertex i. In order to find the
minimum size clique of graph G~, we use this algorithm for each vertex which
does not belong to a clique found. For each level of similarity s, it can be obtained
several minimum clique for the graph G--~(V,E ) and consequently several subsets
of variable with the same cardinality.
T h e algorithm for variable selection using the VSIS model is applied for each
similarity level s equal to each different element of the similarity matrix sjl from
equation 3 taken in increasing order. This means to apply the neural network
algorithm to n ( n - 1 ) / 2 similarity graphs in the worst case, where n is the number
of variables. The solution was updated for every similarity level with different
cardinality of the selected subsets.
The effectiveness of the method for each similarity level s was evaluated by
using the K-Nearest Neighbor classifier with K = I (INN). It means that, for
simulated class distribution of the pattern collection with selected subsets, each
pattern is included in the class of its nearest neighbor. The Euclidean Distance
with variables belonging to selected subset for each similarity level is used. The
percentage of well classified patterns, i. e. the percentage of patterns for which
the simulated and observed class are the same, is used as the validation criteria
for selected subsets of variables.
As is pointed out above, for each level of similarity s, it can be obtained
several subsets of variable with the same cardinality. In order to have only one
of those subsets we selected the one with the greatest percentage of well classified
patterns among the subsets with the same cardinality.
501

2.3 Variable selection and cluster analysis: The VSCA model

The clustering model uses a recurrent neural network for making clusters of vari-
ables given a similarity level. Given a set of objects of some kind and a relevant
measure of similarity between these objects, the purpose of cluster analysis is
to partition the set into several clusters (subsets) in such a way that objects in
each cluster are highly similar to one another, while objects assigned to differ-
ent clusters have low degrees of similarity. The cluster analysis can be used to
perform variable selection, if a measure of similarity between variable is given.
After clustering of variables is performed, we have to select one variable from
each cluster according with certain criteria. Using similarity matrix from equa-
tion 3, for a similarity level s, min sj, < s < max sfl, we use a neural network
algorithm developed by Cruz and L6pez [9] to perform cluster analysis.
A Hopfield-like neural network with n 2 neurons was considered. T h e differ-
ential equation system expressing the state of the network at the time t is:

dxij
dt - A 2~ (sjk--s)yik (4)
k=l,k~j
Yij = f(xij)
i , j = 1,m
In this system Yij is the state of the ij-th neuron at a determined time;
Yij = 1 if j - t h variable is placed at the i-th cluster and Yij = 0 otherwise. The
function f ( x i j ) is the transfer function of the neural network. In this model the
Takefuji M a x i m u m transfer function was used:

{~ ifxti=max(xli,...,xmi)
f(xij) = otherwise (5)

Liapunov energy function associated to the system 4 is:

E=-~afifi fi (sjk--s)yikyij (6)


i=1 j=l k = l , k ~ j
The energy function is closer to a minimum if YijYik = 1 and Sjk > s (the
similarity value between variables j and k is greater than a certain threshold).
This means that variables j arid k are placed at the same cluster. When 8jk < 8,
the minimum value is reached for YijYik = 0. This means t h a t variables j and k
belong to different clusters. As in the first method, with the increment of s, the
number of clusters and consequently the cardinality of selected subsets increase.
In order to select one variable from each cluster, a value dj :
m j
dj = }-~i~1 ~ k = i + l 1)ik (7)
m(m- 1)/2
is assigned to each variable j E Nn. A variable j with a m a x i m u m value of
dj within its cluster is selected.
502

As in the first method, the algorithm for variable selection using cluster
analysis is applied for each similarity level s equal to each different element of
the similarity matrix sjl from equation 3 taken in increasing order. This means to
solve n ( n - 1 ) / 2 differential systems 4 in the worst case. The solution was updated
for every different performance of the clustering pattern. The effectiveness of
VSAC algorithm for each similarity level s was also evaluated by using the 1NN
classifier.

3 Application of Variable Selection Methods to a SAR


study.

3.1 Dataset. Compounds, biological activity a n d d e s c r i p t o r s .


A set of 105 compounds originated from various substitutions at positions C3
and C7 in cephalosporins basic structure was assembled [10] [11] [12] [13] [14]
[15] [16] [17] [18]. The minimum inhibitory concentration in rag. m L -1 (MIC)
against S. aureus was used to measure biological activity. For this qualitative
structure-activity study the set of compounds was split into two classes: active
and inactive. Compounds that showed no activity at a concentration less than
0.78 m g 9m L -1 were considered inactive.
The 3D-structures were built using the InsightII Builder module (InsightII
program, Biosym/Molecular Simulation Technologies, San Diego, CA). The co-
ordinates of these structures were used for energy optimizations with the quan-
tum mechanical method AMI[19] incorporated in the molecular orbital package
MOPAC/AMPAC version 6.0 [20].
43 molecular descriptors containing topological and electronic information
of compounds were calculated. MOPAC output files were used to calculate the
~2p (variables 3-8), ~p(q) (variables 9-14),/2~ (variables 15-20) and ~p(variables
21-26) topographical indexes introduced by Estrada et al. [21][22]. Although
the recent introduction of these indexes they have successfully been correlated
to molecular volume [21], boiling points in alkenes [22]. HOMO and L U M O
energies (variables 1 and 2 respectively) were also calculated. Additionally, the
VX1 valence index (variables 28-35) and the electrotopological index (variables
36-43) for 8 important atoms in cephalosporins, both introduced by Kier and
Hall [23][24], were included in the study.

3.2 N e u r a l n e t w o r k s for S A R studies.

SAR studies were carried out by means of MLPs with v - x - 1 architecture, where
v is the number of descriptors and x the number of neurons in the hidden layer,
respectively. The neuron in the output layer corresponds to the biological activity
class. In this qualitative study the target values of biological activity presented
to the networks, were 0.1 and 0.9 for compounds belonging to the inactive and
active classes, respectively. ANN training was performed using backpropagation
algorithm by the SNNS [25] package running on a Indigo2 R4400 workstation.
503

3.3 Results

The similarity matrix with 43 molecular descriptors which describe the 105 com-
pounds were calculated and used to apply variable selection methods. T h e se-
lected variable subset obtained with VSIS and VSCA models, with cardinality
N~ up to 10 are shown in table 1. In general all subsets selected by both meth-
ods allowed classification with 100% of well classified patterns. Moreover, the
variables 1 and 2 that form the subsets with cardinality 1 separate the patterns
collection in the two studied classes, according to the 1NN classifier. As can be
seen both methods perform similarly, selecting in general the same subsets of
variables, particularly in subsets with low cardinality.
The most favored variables in the selection were variables 1 , 2 ( H O M O and
LUMO) and variables 38, 40, 41, 42 and 43, corresponding to electrotopological
indexes calculated on carbon atoms of ce]em nucleus. The values of these elec-
trotopological indexes depend on substituent at C-3 of cephem nucleus and it
has been reported that in vitro activity and bioavailability of cephalosporins is
affected by hydrophobic and electronic characteristics of this group.

N8 Model VSIS. Selected Subsets Model VSCA. Selected Subsets.


1 1 2
2 2, 38 2, 38
3 2, 38, 41 2, 38, 41
4 2, 38, 41, 42 2, 38, 41, 42
6 2, 38, 40, 41, 42, 43 2, 38, 40, 41, 42, 43
7 1, 21, 38, 40, 41, 42, 43 1, 2, 38, 40, 41, 42, 43
9 1, 2, 29, 37, 38, 40, 41, 42, 43 1, 2, 36, 37, 38, 40, 41, 42, 43
10 1, 21, 29, 36, 37, 38, 40, 41, 42, 43 1, 2, 29, 36, 37, 38, 40, 41, 42, 43

Table 1. Selected subsets of variables.


Due to the nonlinear nature of SARs and in order to determine the minimum
number of variables for effective SAR models we trained 5 MLPs with architec-
ture N8-8-1, where N8 = 3, 4, 6, 7. In the case of N8 = 7 two deferent subsets of
variables were used as input to the classifying network, corresponding to sub-
sets selected by VSIS and VSCA models. The networks were trained on the 105
patterns, the results are shown in Table 2, where r is the correlation coefficient,
M.A:E. is the maximum absolute error between observed and calculated activity
and Np is the number of missclassified patterns. We accept that a p a t t e r n is
well classified when the absolute error between observed and predicted activity
is less than 0.2, taking into account that observed activity was presented to the
network with value 0.1 for inactive compounds and 0.9 for active. As can be seen,
for the subset of cardinality 6, only one missclassified pattern was obtained, even
though the absolute error in this pattern was 0.23. In the case of subsets selected
by both methods with cardinality 7, all patterns were well classified, indicating
that the learning ability of the network was very high. In the case of subsets with
lower cardinality the learning capacity of the MLPs was not high, showing t h a t
504

although with 1NN classifier all patterns were well classified, 3 or 4 variables are
not enough to build an effective SAR model.

Ns r M.A.E. Np
3 0.86 0.73 17
4 0.93 0.78 5
6 0.99 0.23 1
7 (VSIS) 0.999 0.13 0
7 (VSCA 0.998 0.17 0

Table 2. Results of MLPs models trained with selected variables.

4 Conclusions

Two variable selection methods based on recurrent neural models were described.
The first model selects the best variable subset looking for the maximal indepen-
dent set of a graph with minimum caxdinality. The second one builds clusters of
analogous variables and chooses the best one of each cluster to form the most
relevant subset.
Both methods were applied to a sarnple of 105 cephalosporins described by 43
molecular descriptors and distributed in two classes: active and inactive against
S. aureus. All the selected subset of variables showed the capacity to keep the
distribution of the pattern collection. Both algorithms performed similarly.
SAR NN models for S. aureus, using the selected variables were built. The
obtained SAR models provide good classifications of the compounds and shows
the strong activity dependence on electronic and hydrophobic parameters of
cephalosporins.

5 Acknowledgments

This work has been supported by University of Antioquia under the Research
Project "Development of Heuristics to the Combinatorial Optimization NP-
Problem". The authors also thank the financial support of Third World Academy
of Sciences. (TWAS R.G.A. No 97-144 R G / C H E / L A ) .

References

1. Rose V.S., Wood J. and MacFie H.J.H., Analysis of Embedded Data: k-Nearest
Neighbor and Single Class Discrimination in Advanced Computer-Assisted Tech-
niques in Drug Discovery (Methods and Principles in Medicinal Chemistry, vol
III), Mannhold R. and Krogsgaard-Larsen H., van de Waterbeemd H., ed., VCH,
1995, pp 229-242.
2. Tetko I.V., Luik A.I. and Poda G.I., J. Med. Chem., 36, 811-814 (1993).
505

3. Lin C.T., Pavlick P.A. and Martin Y.C., Tetr. Comput. Methodol., 3, 723-738
(1990).
4. Wikel J.H. and Dow E.R., BioMed. Chem. Left., 3, 645-651 (1993).
5. Hopfield J.J. and Tank D.W. Biological Cybernetics, 52, 141-152 (1985)
6. Takefuji Y. Neural Network Parallel Computing. KLUWER Acad. Pu. 1992
7. Garey M. R. and Johnson D. S. , "Computers and Intractability : A Guide to the
Theory of NP-Completeness". Freeman, San Francisco, 1979.
8. Cruz R. and Lopez N. Proceedings of the V European Congress on Intelligents
Techniques and Soft Computing. Eufit'97, V 1,465-470 (1997)
9. Cruz R.., Lopez N., Quintero M, and Rojas G. Journal of Mathematical Chemistry,
20 385-394 (1996)
10. Ishikura K., Kubota T., Minami K., Hamashima Y., Nakashimizu H., Motokawa
K. and Yoshida T. The Journal of Antibiotics, 47, 453-465 (1994).
11. )Lee Y.S., Lee J.Y., Jung S.H., Woo E., Suk D.H., Seo S.H. and Park H., The
Journal of Antibiotics, 47, 609 612 (1994).
12. Negi S., Yamanaka M., Sugiyama I., Komatsu Y., Sasho M., Tsuruoka A., Kamada
A., Tsukada I., Hiruma R., Katsu K. and Machida Y., The Journal of Antibiotics,
47, 1507 1525 (1994).
13. Negi S., Sasho M., Yamanaka M., Sugiyama I., Komatsu Y., Tsuruoka A., Kamada
A., Tsukada I., Hiruma R., Katsu K. and Machida Y. The Journal of Antibiotics,
47, 1526 1540 (1994).
14. [24] Ishikura K., Kubota T., Minami K., Hamashima Y., Nakashimizu H., Mo-
tokawa K., Kimura Y., Miwa H. and Yoshida T., The Journal of Antibiotics, 47,
466 477 (1994).
15. Park H., Lee J.Y., Lee Y.S., Park J.O., Koh S.B. and Ham, W., The Journal of
Antibiotics, 47, 606-608 (1994).
16. Yokoo C., Onodera A., Fukushima H., Numata K., Nagate T. The Journal of
Antibiotics, 45, 932 939 (1992).
17. Yokoo C., Onodera A., Fukushima H., Numata K. and Nagate T., The Journal of
Antibiotics, 45, 1533 1539 (1992).
18. Yokoo C., Got M., Onodera A., Fukushima H. and Nagate T., The Journal of
Antibiotics, 44, 1422 1431 (1991).
19. Dewar M.J.S., Zoebisch E.V., Healy E.F. and Stewart J.J.P., J. Am. Chem. Soc.,
107, 3902-3909 (1985).
20. Stewart J.J.P., MOPAC 6.0 User Manual, Frank J. Seiler Research Laboratoty, US
Air Force Academy, 1990.
21. Estrada E., J. Chem. Inf. Comput. Sci., 35, 31-33 (1995).
22. Estrada E., J. Chem. Inf. Comput. Sci., 35, 708-713 (1995).
23. Kier L.B. and Hall L.H., J. Pharm. Sci., 72, 1170-1173 (1983).
24. Kier L.B. and Hall L.H., Pharmaceutical Research,7, 801-807 (1990).
25. Stuttgart Neural Network Simulator (SNNS), Version 4.1, Institute for Parallel
and Distributed High Performance Systems. 1995, Report No. 6/95.
Optimal Use of a Trained Neural Network for Input
Selection

Mercedes Fermindez Redondo 1, Carlos Hern~indez Espinosa 1.

Universidad Jaume-I. Departamento de Inform~itica. Campus Riu Sec. Edificio TI.


Castell6n. Spain.
E-mall: espinosa@inf.uji.es

Abstract. In this paper, we present a review of feature selection methods, based


on the analysis of a trained multilayer feedforward network, which have been
applied to neural networks. Furthermore, a methodology that allows evaluating
and comparing feature selection methods is carefully described. This
methodology is applied to the 19 reviewed methods in a total of 15 different
real world classification problems. We present an ordination of methods
according to its performance and it is clearly concluded which method performs
better and should be used. We also discuss the applicability and computational
complexity of the methods.

1 Introduction

Neural networks (NNs) are used in quite a variety of real world applications, in them
one can usually measure a large number of variables that can be used as potential
inputs. One clear example is the extraction of features for object recognition [1],
many different types of features can be utilized, such as geometric features,
morphological, etc. However, usually not all variables that can be collected are
equally informative: they may be noisy, irrelevant or redundant.
Feature selection is the problem of choosing a small subset of features ideally
necessary and sufficient to perform the classification task, from a larger set of
candidate features. Feature selection has long been one of the most important topics in
pattern recognition and it is also an important issue in NNs. If one could select a
subset of variables one could reduce the size of the NN, the amount of data to process,
the training time, and possibly increase the generalization performance. This last
result is known in the bibliography and ratified in our results.
Feature selection is also a complex problem, we need a criterion to measure the
importance of a subset of variables and that criterion will depend on the classifier. A
subset of variables could be optimal for one system and very inefficient for another.
In the bibliography there are several potential ways to determine the best subset of
features: analyze all subsets, genetic algorithms, a heuristic stepwise analysis and
direct estimations.
In the case of NNs direct estimation methods are preferred because of the
computational complexity of training a NN. Inside this category we can perform
another classification: methods based on the analysis of the training set, [2], methods
507

based on the analysis of a trained multilayer feedforward network [1], [3-16], and
methods based on the analysis of other specific architectures [ 18].
The purpose of this paper is to make a brief review of the methods based on the
analysis of a trained multilayer feedforward network and present the first empirical
comparison among them.
In the next section we will briefly review the 19 different methods, in section 3 we
present the comparison methodology, the experimental results and an ordination of
the methods according to its performance and we finally conclude in section 4.

2 Theory

Many methods based on the analysis of a trained multilayer feedforward network try
to define what it is called the relevance of an input unit Si, one input I i is considered
more important if its relevance Si is larger. They also define the relevance sij of a
weight wij connected between the input unit i and the hidden unit j. The relation
between Si and sij is:
Nh (1)
Si = Z so"
j=l

where Nh is the number of hidden units.


The criteria for defining weight relevance are varied. Some of them are based on
direct weight magnitude. For instance, the criterion proposed by Belue [3] (from here
named BL2) is:

so . = (Wo.)2 (2)
And the one proposed by Tetko [9] (from here named TEKA) is:

Sij = Wij (3)

These criteria are based on the heuristic principle that, as a result of the learning
process, the weights of an important input should have a larger magnitude than other
weights connected to a useless and may be random input.
Other criteria of weight relevance are based on an estimation of the change in the
m.s.e. (mean square error), E, when setting the weight to 0, this estimation is
calculated by using the hessian matrix H, as a result of considering the Taylor
expansion of the m.s.e., E, with respect to the weight. One example is the method
proposed by Cibas [8] (from here CIB) where we denote by Wk the weight wij:
(4)
sk = s o = 1 . h k k BW~
In the above expression h~ is a diagonal element of hessian matrix H.
And the criterion proposed by Tetko [9] (form here TEKE):
508

(s)

L
The hessian matrix can be exactly calculated with the algorithm and expressions
described in [17].
Another method of estimating weight relevance is based on an estimation of the
change in E when setting wij equal to 0, but it does not use the hessian matrix, it was
proposed by Tetko [9] (named TEKC from here), and the value of weight relevance
is:
Wil(t) (6)
sij : E ~-~-E(t).Awii(t) i f
t----0~162 we - we

where the sum over t represents a sum for all iteration steps of the training process
from the initialization t=0, until the iteration of convergence, wlij is the initial value of
weight wij and we is the value of that weight at the iteration of convergence, Awij(t) is
the change in the weight calculated by using the learning algorithm at the iteration t.
In order to apply the method, we should keep a record of the appropriate information
during the learning process for calculating the weight relevance.
Other methods define the relevance Si of input i by a calculation related to the
variance of weights w~j of input i, they are based on the heuristic that an input unit
with small variance will behave as a threshold and therefore it will have little
importance. One example is the criterion defined by Devena [1] (from here DEV):

( (Zw~.~12 (7)

Another example is the criterion proposed by Deredy [10] (from here DER3):
Nh.vari (8)
S i = ~_wij
J
Another way to define the relevance of input i Si is by using the sensitivity of outputs
oj with respect to the input Ii. It is based on the heuristic that a higher sensitivity
means a larger variation in the output values with respect to a change in the input, and
therefore we can suppose that the input is more important. For example, Belue [3]
(from here BL1) uses the following definition:

si=l.~ ~ ~.~ ~ . ( x , w , (9)


xe S j =lx~D t
where S if the training set, D is a set of points in the input space equally distributed
along the range of variability of inputs, No is the number of outputs and N the sum of
cardinals of S and D.
509

A similar method was proposed by Cloete [4] (from here named CLO):
S i = max(Aij) Vj (10)

I (Ooj .~2 (11)

-I Ns
where Ns is the number of training samples.
Priddy [5] also proposed a method (from here called PRI) based on sensitivities:

No Ook (12)
s, = X s s Z-yiT(x, w)
a~Sj=lx~D k:t:j t

which tries to estimate the variation of the probability of classification error with
respect to the variation of the input Ii. See the reference for more details.
The method proposed by Sano [16] (from here named SAN) also uses sensitivities:

D(k,i) : max lNO~ (x,w) Vx e S} (13)

This method gives a matrix of sensitivities D(k,i) with i=1, ..., Number of outputs, and
i=1 ..... Number of inputs, and the relevance Si of input i is consider to be larger than
the relevance Sj of input j if D(k,i)>D(k,j) for a number of values of k greater than
No/2.
And, finally Deredy [10] proposes the use of logarithmic sensitivities (from here
named DER2):

S i = max(Bik ) 'qk (14)

Bik _ ~ ln(Itk - Okl (15)


~lnI i
where tk is the target of output Ok. The purpose of this logarithmic sensitivity is to
avoid the saturation term Ok'(l-Ok), where Ok is an output value, which appears in the
calculation of the other sensitivities.
Some other methods try to evaluate the contribution of one input to the output taking
into account the values of all weights in the NN structure. For example, the following
equation proposed by Tetko [9] (we call this method TEKB):
2 (16)
8 7 = _[~.[
_ W/j ,~,s+l
j=l max waj V a "~J

tries to estimate the overall importance Sis of unit i in layer s over the units in the next
layer s+ 1, wij is the weight between unit i in layer s and unit j in layer s+ 1, M is the
number of units in layer s+l. The equation is recursive, we set Sj equal to 1 for all
510

outputs and calculate Si for all hidden units, applying the equation again we calculate
the input relevance.
An analogue method also proposed by Tetko [7] (named TEK from here) is based on
the equation:
M 2 s (17)
= ,~ (wo)'E[ai ~ .~s+l

where E[ak~] is the mean value of the output of unit k in layer s. We also set Sj equal
to one for the outputs and recursively calculate Si for the inputs.
Another method proposed by Mao [12] (named MAO from here) calculates an
estimation of the m.s.e, increase when deleting input i, the estimation is made again
by a Taylor expansion of the m.s.e, with respect to the value of the input I i. That value
is used as the relevance Si of the input. The equations are:
Ns (18)
S i = "~ &E k (Ii)
k=l
where the sum is for all patterns in the training set Ns, and:

OEk 1 32Ek , ,, (19)


,5,Ek (I i) = "~i "Ali + "~"~i 2 "[AIi )~

where AIi should be O-Ii (which is the effect of setting Ii equal to 0), and the
derivatives can be calculated recursively. For the output units:

OEk b2Ek (20)


= oi - t i =1
3oi 3oi 2

and the relationship between the derivatives of two layers 1+1 and 1 is:

3E k N(/+I) bE k ' w (21)


Oy[ = 2.~ -~-7~'g" ij
j=l oyj

O2Ek N(/+I) 02Ek N(/,(,(~+I)3E k ,, w 2 (22)


~y-2~ - Z ...2"(g"w/J )2 + 2~ 7-'7-~-'g " /j
j=l Oy~t+l) j=l dyj

where g denotes the sigmoid function and its first and second derivatives are:

g =, y j l+1 ..
.[l--yjl + 1 . ) g,,=
y jl + 1 .[t--yj
.. 1+1 . . . .
).[l--z.yj/ + 1 . ) (23)

There are two very simple methods that calculate the effect of substituting an input by
its mean value in the training set. They are based on the heuristic that if this
substitution has little effect in the performance the input nearly behaves as a threshold
and has little importance.
511

The first one proposed by Lee [6] (called LEE in the paper), calculates the percentage
correct in the test set, with one input substituted by its mean value, the input is
considered more relevant if the value of the tested percentage is lower, because the
performance decrease is larger.
The second one proposed by Utans [13] (called UTA in the paper), focus on the
m.s.e., E, and calculates its increment when substituting an input for its mean value.
One input is considered more relevant if the increment of E is higher.
Bowles [15] proposed another method (called BOW here) that should also keep an
information record of the training process. The relevance of one input Si is defined as
the following sum over all iteration steps until the convergence point T:

T INh I (24)
Si =t~=O~l~J'WiJ "

where wij is the weight between input i and hidden unit j, and 5j is the backpropagated
error of hidden unit j.
Finally, Younes [ 14] proposed another method that we have used in the comparison
(we call it YOU), but we will not describe it because it is rather complex, its
computational cost is high and the applicability limited, we got division by zero errors
in 6 of 15 problems.
It is very important to point out that every method reviewed allows obtaining an
ordination of inputs or features according to its importance or relevance. Obviously,
the ordination of two methods will not be, in general, the same and therefore its
performance will also be different.
Furthermore, we can get an ordination of inputs and we will know which inputs
should be first discarted from the training set (the least important ones), but there is
not simple and efficient procedure to know the cardinal, k, of the final subset of
inputs. We do not know the optimal number of inputs that should be kept in the
training set.
As we saw before, every method is based on heuristic principles and the only way to
compare them may be empirically because the complexity of the problem. This will
be described and accomplish in the following section.

3 Experimental Results

In order to compare the 19 methods, we have applied them to 15 different


classification problems, which are from the UCI repository of machine learning
databases. They are: Abalone (AB), Balance Scale (BL), Cylinder Bands (BN), Liver
Disorders (BU), Credit Approval (CR), Display 1 (DI), Glass identification (GL),
Heart Disease (HE), Mushroom (LE), The Mok's Problems (M1, M2, M3), Pima
Indians Diabetes (PI), Voting Records (VO) and Wisconsin Breast Cancer (WD). The
complete data of the problems and a full description of them can be found in the UCI
repository.
In all problems, we have included a first useless input generated at random inside
[0,1 ]. It is interesting to see how important this input is considered by every method.
512

Furthermore, we have normalized the range of variability of every input to the


interval [0,1], this is important because the range of variability of an input influences
the magnitude of the weights connected to the input and almost all the relevance
measurements described in the theory section.
In order to apply the methods we should use at least one neural network trained for
each problem. Well, we have trained 30 different neural networks (with different
initialization weights) for every problem, we have applied the methods to the 30
neural networks to obtain 30 relevance measurements Si for each input and problem.
And we have obtained a final value of relevance Si' by averaging the 30 value of Si.
We have followed this procedure because we wanted to obtain what we can call a
general performance of the method and avoid results of the relevance biased by a
concrete neural network, which obviously depends on the initialization weights.
Furthermore, the number of hidden units of the neural network for each problem was
carefully obtained by a trial and error procedure, before training the 30 neural
networks.
Following the above methodology, for each problem and method we have obtained an
ordination of feature importance according to its relevance value. For example, the
ordination of method UTA and CIB for problem PI is in Table 1.
Table 1. Input importance ordination for problem PI and methods UTA and LEE.
Method Least Important Most Important
UTA 6 11 5 I 8 I 9[ 4[ 2 I 7 I 3
CIB 6 5[ 8 [ 1 2 9 4 7 3
After that, we obtained several inputs subsets by removing successively the least
important input until a final subset of one input. For example, using the results of
Table 1 method UTA, the first subset is obtained by deleting input {6}, the following
subset is obtained by deleting inputs {6,1 } and the final subset of one input is {3 }.
For every subset we wanted to obtain the performance of a classifier to see how good
the subset is, the classifier of interest is Multilayer Feedforward. We trained several
multilayer feedforward networks to get a mean performance independent of initial
conditions (initial weights), and also an error for the mean by using standard error
theory [19].
The performance criterion was the percentage correct in the test set. The number of
trained neural networks for each subset was a minimum of ten, in many cases we
trained much more than ten neural networks in order to diminish the error in the
mean, the maximum permitted error in the measurement was 3%. This value is an
appropriate one resulting of a tradeoff between precision and computational cost.
In table 2, we can see the results of all methods for problem PI.
Then, for each method and problem we can obtain what we call the optimal subset.
This subset is the one which provides the best performance, and in the case of two
subsets with indistinguishable performance, the one with a lower number of inputs,
because it provides a simpler neural network model.
In order to see if the performances are distinguishable we have performed t-tests. The
hypothesis tested was the null hypothesis gA=gB, assuming that the two mean
performances are indistinguishable. In the case that this null hypothesis can be
rejected we conclude that the difference between the two measurements is significant.
The significance level of test, cq was 0.1.
513

Table 2. Performance of all methods in problem PI for all subsets of inputs.

Number of Omitted Inouts


1 2 3 4 5 6 7 8
BL1 75.5 75.72 76.76 77.0 76.56 75.52 75.76 63.3
_+0.3 _+0.18 _+0.15 _+0.2 _+0.09 _+0.10 _+0.14 _+0.3
BL2 74.8 76.1 76.76 76.16 77.20 77.0 75.76 74.4
+1.1 _+0.2 _ + 0 . 1 5 _ + 0 . 1 7 _+0.18 _+0.3 _+0.14 _+0
CLO 75.5 75.72 76.76 77.0 76.56 75.52 75.76 63.3
_+0.3 _+0.18 _+0.15 _+0.2 !-0.09 _+0.10 _+0.14 _+0.3
PRI 75.5 75.72 76.76 77.0 76.56 75.52 75.76 63.3
_+0.3 +0.18 _+0.15 _+0.2 _+0.09 _+0.10 _+0.14 _+0.3
DEV 74.8 76.1 76.16 76.16 77.20 77.0 75.76 63.3
+1.0 _-t-0.2 _+0.18 _+0.17 _+0.18 _+0.3 _+0.13 _+0.3
LEE 66.3 69.7 66.9 67.9 67.8 66.9 64.44 65.2
_+0.6 _+0.7 _+0.5 _+0.7 _+0.5 _+0.3 _+0.13 _+0
TEK 75.24 76.1 76.16 73.6 75.1 75.52 75.76 74.4
_+0.17 _+0.2 _+0.18 +1.4 _+0.3 _+0.10 _+0.14 _+0
SAN 75.5 75.72 76.76 76.16 76.56 75.52 75.76 63.3
_-_t-0.3 +0.18 +0.15 _+0.17 _+0.09 _+0.10 +0.14 _+0.3
BOW 75.24 76.04 73.7 76.52 75.1 73.4 75.76 74.4
_+0.17 _+0.13 +1.4 _+0.12 _+0.3 +1.3 _+0.14 _+0
TEKC 75.24 75.1 74.8 75.5 74.6 73.7 65.00 65.2
_+0.17 _+0.3 _+0.4 _+0.3 _+0.3 +1.0 _+0.09 _+0
TEKA 74.8 76.1 76.76 76.16 77.20 77.0 75.76 74.4
+1.1 _+0.2 _+0.15 _+0.17 _+0.18 _+0.3 _+0.14 _+0
TEKB 75.24 76.1 76.76 76.16 ! 77.20 77.0 75.76 63.3
+0.17 _+0.2 _+0.15 _+0.17 _+0.18 _+0.3 _+0.14 _+0.3
DER2 75.5 75.72 74.4 77.0 76.56 75.52 75.76 63.3
+0.3 _+0.18 +1.1 +0.2 _+0.09 _+0.10 _+0.14 _+0.3
DER3 76.3 75.44 75.8 64.8 60.40 65.9 65.00 65.60
0.3 _+0.14 _+0.2 _-2-0.5 +0.17 _+0.2 _+0.09 _+0.13
MAO 75.24 76.1 76.16 73.6 75.7 75.52 75.76 74.4
0.17 _+0.2 _+0.18 +1.4 _+0.4 _+0.10 _+0.14 _+0
UTA 75.24 75.72 76.80 76.16 76.60 77.32 74.80 74.4
_+0.17 _+0.18 _+0.15 _+0.17 _+0.09 _+0.18 _+0.14 _+0
CIB 75.24 76.1 76.16 76.16 75.7 75.52 75.76 74.4
_+0.17 _+0.2 _+0.18 _+0.17 _+0.4 +0.10 _+0.14 _+0
TEKE 66.3 65.3 70.8 70.8 69.1 65.9 66.08 65.2
_+0.6 +0.7 _+0.4 _+0.9 _+0.8 _+0.2 +0.13 +0
YOU 75.24 76.1 76.16 76.52 75.1 75.52 75.76 74.4
_+0.17 _+0.2 _+0.18 _+0.12 _+0.3 _+0.10 _+0.14 _+0
For example, from the results of Table 2, the performance of method U T A for the
subset with 6 omitted inputs is distinguishable of the rest and it is the best, so it is the
optimal subset. Also for CIB there are three subsets with best performance and
indistinguishable among them, subsets of 2, 3 and 4 omitted inputs The optimal subset
is the one of 4 omitted inputs because it has a lower number of inputs.
After obtaining the optimal subsets for each method and problem, we can compare the
performance of different methods in the same problem by comparing the performance
of their optimal subsets. Again, we performed t-tests to see significant differences.
514

For example, the results for method UTA and CIB in Table 2 are distinguishable and
we can conclude that the performance of method UTA is better for the problem PI.
By comparing all the methods, two by two, we can obtain another table where we can
find whether one method is better or worse than another for a concrete problem. An
extract of that table is in Table 3.

Table 3. Performance comparison of UTA and C1B.

UTA
Best UTA: AB,BN,BU,CR,GL,VO,LE,PI,WD
CIB E q u a l : BL,DI,M 1,M2,M3
Best CIB: HE
We can see that method UTA is better than CIB in a larger number of problems and
conclude that method UTA performs better than CIB.
Following this methodology and this type of comparison with the full results, (we do
not present them because of the lack of space) we can get the following ordination:

UTA > TEKA > BL2 > DEV > TEKB =DER2 = MAO > CLO > BL1 =
= BOW = YOU > PRI = SAN > CIB =TEKE > TEK > TEKC >DER3 >LEE

The best method is UTA and the worst is LEE. We can further discuss the
applicability of every method. The unique method with limited applicability was
YOU as commented in the theory section.
Another important question is the computational complexity. The methods with a
higher computational cost were CIB and TEKE because of the calculation of the
hessian matrix and also YOU which requires an iteration over all samples of the
training set and the calculation of two integrals for each iteration.

4 Conclusions

We have presented a review of the feature selection methods based on an analysis of a


trained multilayer feedforward network, which have been applied to neural networks.
We have also carefully presented a methodology that allows selecting an optimal
subset, evaluating and comparing feature selection methods. This methodology was
applied to the 19 reviewed methods in a total of 15 different real world classification
problems. We presented an ordination of methods according to its performance and it
was clearly concluded which method performs better and should be used. We have
also discussed the applicability and computational complexity of the methods.

References

1. Devena, L.: Automatic selection of the most relevant features to recognize objects.
Proc. of the Int. Conf. on Artificial NNs, vol.2, pp.1113-1116, 1994.
2. Battiti, R.: Using mutual information for selecting features in supervised neural net
learning. IEEE Trans. on Neural Networks, vol. 5, n. 4, pp. 537-550, 1994.
515

3. Belue, L.M., Bauer, K.W.: Determining input features for multilayer perceptrons.
Neurocomputing, vol. 7, n. 2, pp. 111-121, 1995.
4. Engelbrecht, AP., Cloete, I.: A sensitivity analysis algorithm for pruning
feedforward neural networks. Proc. of the Int. Conf. on Neural Networks, vol. 2,
pp. 1274-1277, 1996.
5. Priddy, K.L., Rogers, S.K., Ruck D.W., Tarr G.L., Kabrisky, M.: Bayesian
selection of important features for feedforward neural networks. Neurocomputing,
vol. 5, n. 2&3, pp. 91-103, 1993.
6. Lee, H., Mehrotra, K., Mohan, C. Ranka, S.: Selection procedures for redundant
inputs in neural networks. Proc. of the World Congress on Neural Networks, vol. 1,
pp. 300-303, 1993.
7. Tetko, I.V., Tanchuk, V.Y., Luik, A.I.: Simple heuristic methods for input
parameter estimation in neural networks. Proc. of the IEEE Int. Conf. on Neural
Networks, vol. 1, pp. 376-380, 1994.
8. Cibas, T., Souli6, F.F., Gallinari, P., Raudys, S.: Variable selection with neural
networks. Neurocmputing, vol. 12, pp. 223-248, 1996.
9. Tetko, I.V., Villa, A.E.P., Livingstone, D.J.: Neural network studies. 2. Variable
selection. Journal of Chemical Information and Computer Sciences, vol. 36, n. 4,
pp. 794-803, 1996.
10.E1-Deredy, W., Branston, N.M.: Identification of relevant features in HMR tumor
spectra using neural networks. Proc. of the 4 th Int. Conf. on Artificial Neural
Networks, pp. 454-458, 1995.
ll.Steppe, J.M., Bauer, K.W.: Improved feature screening in feedforward neural
networks. Neurocomputing, vol. 13, pp. 47-58, 1996.
12.Mao, J., Mohiuddin, K., Jain, A.K.: Parsimonious network design and feature
selection through node pruning. Proc. of the 12th IAPR Int. Conf. on Pattern
Recognition, vol. 2, pp. 622-624, 1994.
13.Utans, J., Moody, J., Rehfuss, S., Siegelmann, H.: Input variable selection for
neural networks: Application to predicting the U.S. business cycle. Proc. of
IEEE/IAFE 1995 Comput. Intellig. for Financial Eng., pp. 118-122, 1995.
14.Younes, B., Fabrice, B.: A neural network based variable selector. Proc. of the
Artificial Neural Network in Engineering, (ANNIE'95), pp. 425-430, 1995.
15.Bowles, A.: Machine learns which features to select". Proc. of the 5 th Australian
Joint Conf. on Artificial Intelligence, pp. 127-132, 1992.
16.Sano, H., Nada, A., Iwahori, Y., Ishii, N.: A method of analyzing information
represented in neural networks. Proc. of 1993 Int. Joint Conf. on Neural Networks,
pp. 2719-2722, 1993.
17.Bishop, C.: Exact calculation of the hessian matrix for the multilayer perceptron.
Neural Computation, vol. 4, pp. 494-501, 1992.
18.Watzel, R., Meyer-B~ise, A., Meyer-Base, U., Hilberg, H., Scheich, H.:
Identification of irrelevant features in phoneme recognition with radial basis
classifiers. Proc. of 1994 Int. Symp. on Artificial NNs, pp. 507-512, 1994.
19.Bronshtein, I., Semandiavev, K.: Mathematics Handbook for engineers and
students (in Spanish). MIR, Moscow, 1977.
Applying Evolution Strategies to Neural Networks Robot
Controller

Antonio Berlanga, Jos6 M. Molina, Araceli Sanchis, Pedro Isasi

Sca-Lab. Departamento de Informfitica.


Universidad Carlos III de Madrid, Spain.
Avda. Universidad 30, 28911-Leganfs (Madrid).
e-mail : aberlan@ia.uc3mes

Abstract - In this paper an evolution strategy (ES) is introduced, to learn weights of a neural

network controller in autonomous robots. An ES is used to learn high-performance reactive


behavior for navigation and collisions avoidance. The learned behavior is able to solve the
problem in different environments; so, the learning process has proven the ability to obtain a
specialized behavior. All the behaviors obtained have been tested in a set of environment and
the capability of generalization is showed for each learned behavior. No subjective information
about "how to accomplish the task" has been included in the fitness function. A simulator
based on mini-robot Khepera has been used to learn each behavior.

I. Introduction
Autonomous robots are sometimes viewed as reactive systems; that is, as systems
whose actions are completely determined by current sensorial inputs. This is the base
of the subsumption architecture [1], where finite state machines are used to
implement robot behaviors. Other systems use fuzzy logic controllers instead [2]. The
rules of these behaviors could be designed by a human expert, designed "ad-hoc" for
the problem or learned using different artificial intelligence techniques [3]. The
control architecture used to evolve the reaction (adaptation) is based on a neural
network.
The neural networks controller has several advantages [4]: (1) NN are resistant
to noise, that exists in real environment, and are able to generalize their ability in new
situations, (2) the primitives manipulated by the evolutionary strategy are at the
lowest level in order to avoid undesirables choices made by human designer, (3) a
NN could easily exploit several ways of learning during its lifetime. The used of a
feed forward network with eight input units and two output units directly connected
to motors appears in previous works [4] as an efficient way to learn a behavior:
"avoid obstacles" using Genetic Algorithms. In this work the NN ought to learn more
complex behavior: "navigation". This task requires more environmental information
and the sensors have been grouped using only five input units.
In the proposed model, the robot starts without information about the right
associations between environmental signals and actions responding to those signals,
And from this situation the robot is able to learn through experience to reach the
highest adaptability grade to the sensors information. The number of inputs (robot
sensors), the range of the sensors, the number of outputs (number of robot motors)
and its description is the only previous information.
517

In this paper, we present the results of a research aimed at learning reactive


behaviors in an autonomous robot using an ES. In section 2, we outline the general
theory of Evolution Strategies. Section 3 is related to the experimental environment
and the goals of the work. The experimental features are described in Section 4. The
experimental results are shown in Section 5. The last section contains some
concluding remarks.

2. Evolution Strategies

Evolution strategies (ES) developed by Rechenberg [5] and Schwefel [6], have been
traditionally used for optimization problems with real-valued vector representations.
As Genetic Algorithms [7] (GA) the ES are heuristic search techniques based on the
building block hypothesis. Unlike GA, however, the search is basically focused in the
gene mutation. This is an adaptive mutation based on the likely the individual
represents the problem solution. The recombination plays also an important role in
the search, mainly in the adaptive mutation.

Initialize H EvaluateP ~ END I


PopulationP (FitnessFunction)

I Selection
a"rent H Recombination
Pent H Mutation
Children
Chi'arcn+earentU EvaluateV[
Survival [ -[ (FitnessFunction)
Figure. 1 : Schema of an evolution strategy.

Figure 1 shows a typical evolution strategy. First, it is necessary to codify each


solution of the problem in a real-valued vector. Each vector represents a solution and
also an individual. The method consists in evolving solution sets, called populations,
in order to fired better solutions. Selecting pairs of individuals (parents) that produce
new individuals (children) via recombination, which are further perturbed via
mutation performs the evolution of populations. The best individual (p+l selection)
or the best individuals (p+~, selection), in the set composed by parents and children,
are selected to form the next population [8].
An individual is represented by a = (Xl,... , Xn, 0"1,... , O"n ) E ~n, that are the
n real values (xi) and their corresponding deviations (o-i) used in the mutation process
for the (~t+X) ES. The mutation is represented by equations (1) and (2).

cri'=cr i . exp(N(O,Act)) (1)


x i': x, + N (0, o",') (2)

Where xf and cri' are the mutated values, following a normal distribution (N(/z, or)).
518

However, when a (It+l) ES is used the mutation process follows the 1/5 rule [8].
In both cases, the recombination follows the canonical GA approach [7].

3. Experimental Environment

The task faced by the autonomous robot is to reach a goal in a complex environment
avoiding obstacles found in the path. Different environments have been used to find
the connections of the NN. The system has been developed using a simulator to prove
different characteristics of the system. Finally, a real robot has been used to test the
proposed solution.
A simulator developed in a previous work [10] has been used as complete
soRware for the simulation of mobile robot. Working with a simulation offers the
possibility to evaluate several systems in different environments controlling the
execution parameters. The robot simulator characteristics is based on a mini-robot
Khepera [9] has been used, which is a commercial robot developed at LAMI (EPFL,
Laussanne Switzerland). The robot characteristics are; 5.5 cm of diameter in circular
shape, 3 cm of height and 70 gr. of weight. The robot has two wheels controlled by
two motors that let any type of movement. The ES should specify the wheel velocity
that could be read later by an odometer. Eight infrared sensors supply two kinds of
incoming information: proximity to the obstacles and ambient light. Instead of using
eight sensors individually, to reduce the amount of information six sensors are used
and grouped (as Figure 2 shown) giving a unique value, the average, from two input
values. Representing the goal by a light source, the ambient information lets the robot
know the angle (the angle position in the robot of the ambient sensor receiving more
light) and the distance (the amount of light in the sensor).

Figure. 2: Sensors considered in the real robot.

The simulated world consists of a rectangular map of user defined dimensions,


where particular objects are located. In this world it is possible to define a final
position for the robot (the goal to reach), (Figure 3 (a)). In this case, the robot is
represented with three proximity sensors and two special sensors to measure the
distance and the angle to the goal.
519

Figure. 3: (a) SimDAI Simulator (Example of one simulated environment). (b)


Example of a real experimental environment.

Different simulated worlds that resemble real ones have been defined before
being implemented in the real world. An example of these environments is shown in
Figure 3 (a) and Figure 3 (b). The controlled developed is the same in both cases
(simulated and real) except the differences in the treatment of the sensors.

4. E v o l v i n g N N connections by means of Evolution S t r a t e g i e s

It has been proved that by means of connections between sensors and actuators, a
controller is able to solve any autonomous navigation robotic behavior [11]. This
theoretical approach is based on the possibility of finding the right connections of a
feed-forward NN without hidden layers for each particular problem. The input
sensors considered in this approach are the ambient and proximity sensors of Figure
2. The NN outputs are the wheel velocities. The velocity of each wheel is calculated
by means of a linear combination of the sensor values using those weights (Figure 4):

5
vj = f ( Y wij x s,) (3)
i=1

Where w O" are searched weights, si are sensor input values and f is a function for
constraining the maximum velocity values of the wheels.
520

i ~gt,lot~,,c~,.,ot/
I
s:. 7

SI

~& Sensor
~ ~ V W;,Weigoft
b hth1e6r2

Figure. 4: Connections between sensors and actuators in the Braitenberg


representation of a Khepera robot.

Weight values depend on problem features. To find them automatically, an ES is


proposed. In this approach each individual is composed by a 20 dimensional-real
valued vector, representing each one of the above mentioned weighs and their
corresponding variances. The individual represents one robot behavior consequence
of applying the weights to the equation 3. The evaluation of behaviors is used as
fitness function.
In order to make the problem more realistic no information about the location of
the goal, neither direction nor distance, has been included in the evaluation function.

5. Experimental Results

Different experiments have been done all of them over the same set of environments.
The environments have been generated by changing the goal position, number and
location of obstacles looking for a generalized environment. In a set of preliminary
comparisons, it was found that results obtained with the software model did not differ
significantly from the results obtained with the physical robot.
An exploratory set of experiments was performed in simulation to adjust the
quality measures used in the fitness function as well as the parameters of Evolution
Strategy. A (!a+~.)-ES, p.--6, ~.=6, were used.
The quality measures used to calculate the fitness value of a controller were the
following:
9 Number of collisions. (Collisions)
9 Number of stops. Cycles of the simulation in which the robot stays in the
same location. (Stand)
9 Time needed to reach the goal. (Time)
9 Length of the robot trajectory from the starting point to the final location.
(Path Length)
The global evaluation depends linearly with these concepts: 10*Collisions +
10*Stand + 20*Time - 1,5*Path_Length. Each evaluated robot behavior ends over
one environment when the goal has been reached or the time exceed some time out.
521

Five evolutionary runs of 70 generations each have been performed, for eight
different environments, each one starting with a different seed for initializing the
computer random functions.

@j l i,
~ ~ 1

It -|
~r 1.00r~x: 00,i~ny: 0 Ak'avo

I
m

- I -I m

o ~,~,* I
Ii
m m

m m

Figure 5. Eight environments used to evolve the controller. Dark shapes are the
obstacles, the big point is the starting location of the robot and the small
point is the goal. The environments are closed.

The evolution of the quality measures used to calculate the fitness value shows a
similar behavior over all environments. All the quality measures evolve in the way to
get the optimal robot behavior. See Figures 6-10.
522

2 9 ~ lr~Wx..----[,

. . . . . . . . . . . . . . . . . . . . . . . . .

~ 9 + / .................. ! 1400 ~::.~. . . . . . . :::.,:. . . . . . .


2oo ++~i

i 2000 /
/ 2.
++) ~ iLL. .......... ' :~ :,

Q/ ............................................................

Gemmb~ G~erdlor~

Figure 6. Evolution of the "Path Len ;th" Figure 7. Evolution o f "Time" needed to
versus generations in each reach the goal versus
environment. generations in each
environment.

2000 2000

1888, ~ ...... L
1680. ,!

| 1~ i :
~ ~" 9

40O
20o i \._.....~.. . , 200 9 ' ' " " ........... ~~ ~:::=:+ :
o ..................................................... ;+?~v, 0 I r .......... ',,, ), .+,',,. ),,~ ............... .~......... , ............

Gen~tlons ~nlrlttO(ll

Figure 8. Evolution of the "Stand" versus Figure9. Evolution o f "Collisions"


generations in each versus generations in each
environment. environment.

60000

- - E0
50000
............. E1
40000 ........... , .Z% \ .
E2
........... E 3
30000
- - E4
E
20000 --E5
............. E 6
10000 ..... E 7

G e n e r a t i o n s

Figure 10. Evolution of the fitness value of the population's best controller versus
generations in each environment.

Figures 11 and 12 show the evolution of the quality measures of environments 1


and 3.
523

IO0

3Ot , 9 . . . . . 9 ;i
70 ji : i

4O
3o

lO ...........,;
4 7 10 13 lS 19 22 25 28 31 34 37 40 43 46 49 52 55 58 1 4 7 10 13 16 t9 22 25 28 3t 34 37 40 43 46 49 52 55 58
Generations Generations

I - - Coli~= - - Stencl r~ P ~ L~lth I

Figure 11. Evolution of the quality Figure 12. Evolution of the quality
measures versus generations measures versus generations
in environment 1. in environment 3.

Very different behaviors are observed. For environment 1, which consists in an


initial configuration without close objets, initially exploratory behaviors appear. The
robot covers long distances but without avoiding obstacles. On the contrary, for
environment 3, due to the proximity of obstacles to the starting point, the controller
will not be able to explore the environment searching for the goal, until it does not
acquire the ability to avoid obstacles. The environment guides the learning process.
The obtained controllers are valid for the environment in which they are trained.
Figure 13 shows the behavior of a controller in the environment where it has been
trained, as well as in other new environments.

100
90

I 70 /i
60 /'
g 50 w 50 /'
~ 4o
L 30 30 \ /
2Q 2O

10 10

0 0
1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Environment Environment

I--s0 sl s2 s3,
I I--~ -ss I srl
Figure 13. The fitness value of the solution (S.) obtained in environment n is
measured in all environments. The point shows the fitness value calculated
in the training environment.

Neural networks trained with an ES adjust precisely their weights to the training
environment. This is an advantage when we want to obtain a good solution within a
short processing time but a lack for getting generalized solutions. This behavior is
displayed in Figure 14; the solution trained in environment 3 is validated in
environments 1,2 and 6.
524

Figure 14. Solution trained in environment 3 plots its behavior in environments 3, 1,


2 and 6 respectively.

6. Conclusions and further work

The experiments prove the possibility of learning behaviors in an autonomous robot


by means of an ES. The process has been applied on a simple NN where the directly
associations between sensors and motors allows to solve a navigation problem. It can
be also extended to other more complex NN. It is important to remark that the fitness
functions doesn't include any subjective information "how to accomplish the task"
but objective information about "how the task has been accomplished".
As a consequence, the learning process can be easily modified in order to
consider new problems that could appear such as: surrounding an obstacle, or hiding
from the light. The adaptation to new problems does not require too much effort
because of no inclusion of local information about the problem in the fitness function.

7. References

[1] Brooks R. A. "Intelligence without Representation". Artificial Intelligence, 47,


139-159, (1991).
[2] Ishikawa S. "A Method of Autonomous Mobile Robot Navigation by using Fuzzy
Control". Advanced Robotics, vol. 9, No. 1, 29-52, (1995)
525

[3] Matell~m V., Molina J.M., Sanz J., Fem~indez C. "Learning Fuzzy Reactive
Behaviors in Autonomous Robots". Proceedings of the Fourth European
Workshop on Learning Robots, Germany, (1995).
[4] Miglino O., Hautop H., Nolfi S. "Evolving Mobile Robots in Simulated and Real
Environment". Artificial Life 2:417-434 (1995)..
[5] Rechenberg, I. Evolutionsstrategie: Optimierung Technischer Systeme nach
Prinzipien der Biologischen Evolution. Frommann-Holzboog, Stuttgart (1973).
[6] Schwefel, H. P. Numerical Optimization of Computer Models. New York: John
Wiley & Sons (1981).
[7] Goldberg D., Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley, New York, (1989).
[8] Rechenberg I., Evolution strategy: Nature's Way of Optimization. In H. W.
Bergmann, editor, "Optimization: Methods and Applications, Possibilities and
Limitations", Lecture Notes in Engineering, pag 106-26, Springer, Bonn (1989).
[9] Mondada F. and Franzi P.I. "Mobile Robot Miniaturization: A Tool for
Investigation in Control Algorithms". Proceedings of the Second International
Conference on Fuzzy Systems. San Francisco, USA, (1993).
[10] Sommaruga L., Merino I., Matell~n V and Molina J. "A Distributed Simulator
for Intelligent Autonomous Robots", Fourth International Symposium on
Intelligent Robotic Systems-SIRS96, Lisboa (Portugal), (1996).
[11] Braitenberg V. Vehicles: experiments on synthetic psychology. MIT Press
Cambridge, Massachusets (1984).
On Virtual Sensory Coding: An Analytical
Model of the Endogenous Representation

Jos6 R. Alvarez, F61ix de la Paz, and Jos6 Mira

Dpto. Inteligencia Artificial - UNED, Senda del Rey, s/n. E-28040 Madrid, Spain,
{j ras, delapaz, jmira}~dia, u/led, es

A b s t r a c t In this paper we present a constructive analytical process


used to synthesize a net capable of transforming poor sensory data into
a more rich internal representation invaxiant to rotations and local dis-
placements. This model could be of interest both to understand the sen-
sory code and the internalization of external geometries into the nervous
system and to build internal models of the environment in autonomous
and connectionistic robots.

1 Introduction

In spite of the great efforts in experimental neurology and in artificial intelligence


(AI), it is not clear yet how to model the perception of the external environment,
nor how to render this model operational by electronic means or via a computer
program.
How can we translate the spatio-temporal d a t a patterns into an internal
representation with more precision and centered around the animal or robot?
How can we recognize an object despite drastic changes in size, rotational dis-
placement, or position? How can we achieve these invariances to translations,
rotations or points of view? Walter Pitts and Warren S. McCulloch, in "How
we know universals" [1] suggest t h a t a biologically plausible m e t h o d could be
to apply a whole set of transformations to features of the sensed environment,
and to average over the resulting ensemble to extract invariant representations
that allow the brain to produce standard versions centered on his center of grav-
ity. Michael Arbib has also pointed out the relevance of this problem s t a t e m e n t
[2] because it stresses a method of hierarchical coding of information by topo-
graphically organized networks arranged into layers. Pellionisz and Llinas [3] also
intended to formulate a set of transformations from "pure" sensory d a t a to an
internal and autonomous representation of the external world, but the problem
still remain open [4].
On the other hand, m a n y scientists in the field of robotics have dealt with
the problem of the internal representation in different ways [5], even without
any representation, t h a t is fully reactive to stimuli through the sensors without
a m a p [6], or using different m a t h e m a t i c a l abstractions for the representation
of the external environment in the agent memory, with partially satisfactory
results, such as, for example, the configurations space [7], and the repulsive
potential method [8].
527

Those methods have problems caused by the lack of mechanic precision of the
robot (dead reckoning), related to local minima, and in general, problems due to
the discrepancies between the model and the real environment. Other qualitative
methods are used to solve part of those problems [9], but the solution seems to
be in the use of an hybrid strategy using qualitative and geometric methods.
Again the problem of building an internal model of the external environment,
allowing the robot to navigate or to perform other tasks involving an efficient
use of the inner representation of the external geometry, has not been solved in
a satisfactory way.
In this paper a very modest example, but analytically complete one, has
been worked out. We deal with the task of creating a computable structure
in the analytical level for a simple set-up, such as a circular system with a
limited set of distance sensors, arranged with plane radial symmetry, and that
can move as a whole (rotation relative to the system and displacements with it).
We also assume that the system has other sensors for inner perception, such as
the angle rotated by the sensors set or the displacement of the center (direction
and amount). The codification of the sensors can be absolute, relative or as a
rate of change. These inner sensors can have dead reckoning which must be taken
into account.
The system can move around measuring distances in an environment filled
with two-dimensional obstacles (from the point of view of the system). The
obstacles are fixed (or they move very slow related to the system movement).
The sensors can also have sporadic errors (wrong measurements) that must be
compensated.
The rest of the paper is structured in the following way. In section 2 we de-
scribe the solution method at the knowledge level, starting with data structures
(distance sensors, system movement and inner sensors) and giving the diagram
of transformations for the successive representations in the model. Section 3 de-
scribes the first transformation (the way we use the system movement to increase
the sensors resolution and introduce rotation invariance and adaptation to dis-
placements). Then, in section 4 we describe the second transformation (sensory
representation independent of the position). Finally in section 5 we conclude
giving the usefulness of this method of design.

2 Knowledge level description of the method

The task is to build a "navigation-oriented" internal representation with greater


wealth of information than the primary data on distance values. For this, the
method uses the movement and the memory together with the hypothesis of
slow changes in the environment.

2.1 Distance sensors

The system is composed of a collection of distance sensors which can move as


a whole and are organized with planar radial symmetry. The sensors can be of
528

several types with different properties. From a formal point of view these sensors
are characterized by the following properties (figure 1):

1. Each sensor is fixed in a point at R t (where t means the type) from the
system center. T h a t distance is the same for all the sensors of the same
type.
29 The sensory field of each sensor faces outwards from the system 9 The position
of a sensor i (of type t) is determined by an angle 0~ relative to the system)
in the same direction as the axis of its sensory field9 As a first approximation,
we suppose that the sensors of each type are distributed uniformly around
the system such that 0~ = i 9 A0 t, where A0 t ~ 2, with N t being the total
number of sensors of type t in the system.
3. The sensor has a sensitivity sector defined by the angle 6t centered on its
axis.
4. The sensor can detect objects within its sensory field, between a minimum
distance (dt,~i,) and a maximum one (aU,~) far from its position. The value
given by the sensor represents that distance relative to it from the closest
object within range. The sensor can inform about saturation (all objects are
out or range, far away or too close)9 The precision of the returned value can
be limitted, this can be represented by the value belonging to a finite set
only9 The most common case is the values distributed representing a l i n e a r
range.
5. Each type of sensors has an accuracy given by a function depending on the
distance and the angular position of the object relative to the sensor9 There
is also a minimum size of detectable object (ie. its projection) depending on
the distance and the angle again9

Sensorial

~ ~'~ ~ t%nin
/ I U: ~ . . ~'~ ~
' i"~ . - . " ~ s e n s o r
i i--"/q,t i
I I
system ,
center ,
\ /

F i g u r e l . Sensory field and geometrical characterization of a distance sensor.


529

2.2 System movement and inner sensors


The system can move in a two-dimensional space towards any direction relative
to the sensors orientation. The set of sensors can rotate around the center of the
system and independently of it but without changing the relative positions of
the sensors between them.
A p a r t from the set of main sensors, the system has a way to measure its own
displacements and rotations. This means the system has also a linear displace-
ment sensor and a register indicating the direction (angle of movement) plus a
sensor to know the angle between the system reference and the main sensors set.
This three sensors form the inner sensors set.
The inner sensors are defined by the set of the returned values (precision)
and the error due to the dead reckoning (ie. the value can be smaller or bigger
t h a n the true displacement distance or angle).

2.3 The inference structure


The diagram in figure 2 shows the set of successive transformations over the pri-
m a r y d a t a provided by the inner and distance sensors. The rectangles represent
d a t a structures and the ellipses represent inference steps (roughly). T h e suc-
cessive transformations are represented by the nested discontinuous rectangles
and correspond to the functional specification of what we (external observers)
consider the system "need to know" to navigate in an environment. T h a t is,
an endogenous representation on a system of reference centered on the moving
system and invariant under navigation changes.
The level "0" corresponds to the real sensors. Level "1" is the first level of
virtual sensors t h a t embodies the spatio-temporal expansions of the sensory field
and the rotation invariance. The successive levels take care of the invariances
under displacements, the space of features and the use of these features in the
identification of homogeneous zones and in the build up of a topological m a p of
the environment [10]. This p a p e r describes with certain detail the levels "0", "1",
and "2"; it sketches partially the levels "3", "4", and "5" and it gives some hints
for development of them.

3 First transformation (virtual sensors)

A way to use the system movement to improve the sensors' resolution consists of
to accumulate the instantaneous values of the information at the p r i m a r y sen-
sors corresponding to m a n y different coordinates in successive sampling intervals.
This expansion is developed in two parts: 1) rotation of the sensors, without dis-
placements (static virtual sensors) and 2) displacement in one direction, without
rotation, t h a t gives us the corrected or dynamic virtual sensors.
Given t h a t the information received by the system has to be transformed
to represent the environment from the endogenous "point of view", the first
representation relative to the system position and independent of the direction
defines the properties of the virtual sensors (VS's):
530

.................................................................
~,5 [ topologicalmap of zones
I mobileenvironmentobjects I I (+ metric information)

I movements I I b~ f

I local position
[independent sensors I

~2 . . . . . . . . ~ ..............

]corrected ~ a r . [ center of areas


virtual s e n s o r s accum~ relative position I
(dinamie)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,

,
real sensors
[ (instant sample) I
%, .
l system rotauons 1
s stem
L
~Splacements]
I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a
.......................................................
.........................................................
............................................................ I
..............................................................
.................................................................

F i g u r e 2 . Diagram of the successive transformations of a primary representations in


the build up of a model of the environment centered in the moving system and invariant
under navigation changes.

1. T h e VS's are "placed" in the center of the system. T h a t means the distance
value stored by t h e m is relative to t h a t center.
2. Every VS is assigned to a two-dimensional spatial sector a r o u n d the system
to represent the distance to the closest object in t h a t sector.
3. T h e VS's receive unified values from the different kinds of real sensors only
when one of t h e m is in range (not saturated) and it is faced in the same
direction assigned to the VS.
531

4. The VS change the stored information concerning distances when the sys-
t e m rotates or when it moves, thus maintaining approximately invariant the
representation of the external obstacles.

3.1 From raw sensory data to virtual sensors

In the first stage of the sensory precision extension (accumulation in rotations),


the d a t a from the real sensors are distributed to the VS as the figure 3 shows.
This distribution of d a t a guarantee the independence from the rotations of the
sensors relative to the external space and a finer representation (more points).

eJ~- " ~
: V

qr~Jl ~ I\~ "~ n realsens~


~ ~...~ 9 group distributor

9 ~ 9 virtual sensor

.... switch connection

Figure3. Distribution of raw sensory data to the virtual sensors.

The main function in this stage is the d a t a routing depending on the angu-
lar direction of the sensors. D a t a distribution is done by intermediate elements
which group virtual sensors into zones, to allow more modularity and fault tol-
erance. There are so m a n y groups as real sensor sectors for the first step in the
distribution. The sector covered by the virtual sensors of a group belongs to
the group. Every real sensor is connected to the group with the assigned sector
corresponding to the facing direction of the real sensor in the sampling moment,
when the measurement is taken.
532

T h e second step in the first stage of the transformation is demultiplexing.


T h a t is to say, the distribution of the value from each real sensor to a group and
from t h a t group to the VS t h a t corresponds to this exact orientation.

3.2 Temporal and spatial accumulation

Once the valid d a t a (sensor in range) is in the correct place, each VS accumulates
the new incoming value to the previously stored values with a weight. This
t e m p o r a l accumulation also includes a continuous fforgetting" (increment the
distance) if the VS is not activated frequently. The accumulation is also spatial
by lateral interaction with the neighbor virtual sensors. Small contributions from
the neighbors are added to the stored value. The contributions are related to the
dispersion in the real sensors depending on the overlap between sensory fields.
The fforgetting" consists of a periodic increment of the stored distance, in-
versely proportional to it. This function allows the correction of isolated erro-
neous data. The following expression for the distance increment due to forgetting,
includes all these functional specifications

0
It is null when d = dma~= and it is K . (dma~ -dmi,O when d = dmin. The
constant K must be the part of the complete range corresponding to the times
it is activated in a sampling period of the real sensors, so it must be K = N L
where N is the number of real sensors of one kind, /~ is the n u m b e r or virtual
sensors, f , is the sampling frequency of the slowest real sensors, and fa is the
activation frequency of the forgetting increments in the virtual sensor. Normally
f~ > fr and N > N , so K is a small value.

3.3 Displacement c o r r e c t i o n s in t h e v i r t u a l sensors

The virtual sensors must correct the stored value when the system changes its
position thus reflecting the changes of the objects relative to the system. T h a t
correction has two components depending on the angle of the displacement rel-
ative to the facing direction of the virtual sensor:

- A longitudinal correction corresponding to an increment or decrement of the


distance due to the moving away or approaching of the represented objects.
- A transversal correction due to p a r t of the neighbor sensor field coming into
the sensory field and part of the stored value going to the other neighbor.

The two corrections are proportional to the projection of the displacement over
the facing angle of the sensor. To avoid the global calculation depending on the
angle of each sensor relative to the displacement, we distribute the computations
and the connections between the virtual sensors in a local and modular way.
We also suppose to simplify the calculations t h a t the displacement is along the
nearest direction corresponding to one virtual sensor.
533

The process starts in the sensor faced to the same direction of displacement.
This sensor has a longitudinal correction equal to the displacement and a null
transversal correction. The sensor transmits t h a t information to the two neighbor
sensors activating them. The sensors receiving information from one side com-
pute their corrections and transmit to the other side. This cascade process ends
with the last sensor (pointing in the opposite direction of displacement) which
receives two activations equilibrating the transversal corrections (the n u m b e r of
virtual sensors must be even). This way of computing allows to all the sensors
use the same formulae independent of the angle relative to the displacement.
We now compute the correction of a sensor in the place n u m b e r k counting
from the first activated sensor (in the same direction of the displacement) with
index 0. We call dk the distance stored before the displacement in the k-th sensor
and d'k the corrected distance stored after the displacement and it will be the
interpolation (or extrapolation) between dk y dk-1. The correction depends on
the angle of the sensor relative to the direction of displacement, Ok, t h a t can
be substituted by 8k ---- k 9 A0, where A0 = 2-2-"with N the number of virtual
sensors. We will call a the distance advanced ~y the system.
The diagrams of fig. 4 can help us to develop the expressions for the new
corrected value. There are two possible geometric configurations, depending on
the advance and the value of the previous sensor ( d k - 1 ) , the first one is calculated
by i n t e r p o l a t i o n and the second one by extrapolation. The results are the same in
both cases, as we will prove. We use a shortened notation calling S ~ sin (A8),
C ~ cos (A8), Sk =-- a" sin (Ok) and Ck -= a . cos (Oh).
The i n t e r p o l a t i o n correction gives the new value of the distance d'k = d k - 1 C -
Ck + p where the last t e r m p can be solved by similar triangles (see fig. 4) as

dk -- d k - l C
( d k - l S -- Sk) (2)
P ---- dk-lS

The extrapolation correction is similar, except t h a t now the last t e r m i s - q


(see lower part of fig. 4), d'k = d k - l C - ck - q where the t e r m q by similar
triangles is

dk -- d k - l C
(-dk_lS -t- sk) (3)
q - dk-lS

being equal to p with opposite sign.


We substitute p = - q into any of the previous expressions to obtain the same
correction expression for both the interpolation and extrapolation cases

dk = d k - l C - Ck + dk -- d k - l C ( d k - l S - Sk) (4)
da-lS

and simplification with t e r m reordering yields

Ek = dk + 6' -ck (5)


dk-1
534

d
9 dk-1

dk-1sin(AO) "-- i.:"


9 --. new system
"--, center

O~ ...'a. sin(Ok)

interpolation case~ "" " "" " - . . . -"~-~.."


dk-1CiSS(-Z~0.)... old system center
" - 9

dk-1 sin!'~O) dk~il.."'"" "''--. new system


j ...-'7~ "--. center
..-"'"" . " ' ~ " " : " a ' . c ~

d k ~ ~ a .'a 9sin(0k)

extrapolation case "'--.. ~ ."


dk-1cos(A.0)., 9 . . ~
old system center

F i g u r e 4 . Diagram for the projections used in the calculations of the longitudinal and
transversal corrections of the virtual sensors in a system displacement 9 The two possible
configurations are represented (interpolation above and extrapolation below).

where the second term in the right side is the transversal correction and the last
term is the longitudinal correction.
The expression (5) in the shown form and using only the definitions for Sk
and Ckhas direct dependence on Ok which varies with the displacement direction9
We seek an expression depending only on values in the sensor and the neighbor
535

activating it. We use the trigonometric properties for the angle sum and t h a t
Ok = Ok-~ + AO, with Sk and Ck giving the result:

8k : C . 8 k _ l Jr S " Ck_l
(6)
Ck = - S . Sk-1 + C " Ck-~

where b o t h expressions only depend on the previous sensor values and on con-
stants (S and C ) . T h a t is to say, the corrections in every virtual sensor are done
using the three values sent by the adjacent sensor, S k - 1 , Ck-land d k - 1 , and using
the formulae 6 y 5. Every sensor sends to the next one the three values Sk, Ck
and dk. In the first activated sensor the initial values are So -- 0 (null transversal
correction) and Co = a (full longitudinal correction).

4 Second transformation (position independent sensors)

From the representation obtained in the first transformation, (independent of


rotations of the sensors set with respect to the center of the system and t h a t
adapts to the displacements) we obtain another representation that, in addi-
tion, is independent of the position of the system within the zone of reach of
their sensors (zone of local homogeneity). For this 1) we calculate a position
t h a t is "centered" (relative to the obstacles detected around and represented
in the virtual sensors) and then 2) we build up a second set of virtual sensors
t h a t represent the external environment from this previously c o m p u t e d centered
position.

4.1 Local c o m p u t a t i o n o f t h e center o f a r e a

At this stage we have a representation of the obstacles around the system in


terms of a set of radial distances. We can suppose t h a t the stored distances
in the virtual sensors represent the vertices of a polygon of free space around
the system. In figure 5 where we have the imaginary triangle formed by the
distances stored in two neighbor VS's (r~ and rb) and the union between them,
with angles a~ and o~b relative to a common origin. The difference between these
angles remain constant, a~ - oLb : Z~, as in the previous section.
For obtaining the center of areas of all the triangular sectors, it is convenient
to use an intermediate layer of computing elements where each element of the
layer is connected with two adjacent VS's and receive from b o t h the information
concerning the local values of area and center of area. Well known geometrical
considerations gives us the following expressions for the whole area (Sc) and the
coordinates (x~, Yc):

Sc -~ 89 sin A0
xc = (ra COS + rb COS (7)
Yc = ~ (ra s i n a a + rb sinab)
536

/ ~ , .-" . ."

,s .- .

~.. . . . . ra -"" .'" ."


~ ~ ", --~ ~ ~ .'

I i '. .-* . . . . . . . . . "'"

" "'--.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

F i g u r e 5 . Diagram used to illustrate the calculus of the center of areas in triangular


sectors9 In the left part we have the panorama of virtual sectors of a free space around
the system. In the right part we have the illustration of the calculus of re.

where we have taken into a c c o u n t t h a t sin A0, a a and ab are c o n s t a n t for


each pair of neighbor VS's and t h a t xa = ra cos an, y~ = r~ s i n a a , xb = rb cos ab,
a n d Yb = rb sin ab.
For the local and accumulative c o m p u t a t i o n of the whole center of area we
need a m o d u l e t h a t receive as inputs the d a t a (sl, x l , yl; s2, x2, y2) concerning
the centres of area from two VS's or other modules, and which c o m p u t e the
weighted sum

8c ~ 81 -}- 82
81X 1"~-82~2
81 82
YC = 81+82

If we p a y a t t e n t i o n t o the s y m m e t r y of these equations we can easily get


a repetitive and local formulation of the calculus. E a c h m o d u l e receives and
t r a n s m i t each coordinate times the corresponding area (xl - s i x 1 , Y l - s l y 1 ,
x2 =- s 2 x 2 , Y2 - s2y2, xc =- ScXc, Yc - scyc) and only at the end of the process
(binary tree) the normalization is performed. In this way, e q u a t i o n 8 b e c o m e
just additions,

8c = 81 "~- 82
Xc "~ X l -[- X2 (9)

4.2 Position independent virtual sensors

T h e last t r a n s f o r m a t i o n (from d'k to dk) completes the change of representation,


as s u m m a r i z e d in figure 6. Starting from the virtual sensors (d'k) c o m p u t e d in the
537

p a r a g r a p h 3.3 and changing the point of reference t o t h e center of areas previ-


ously c o m p u t e d , we obtain a representation independent of local displacements.
This change is accomplished by a correction (interpolation or extrapolation) of
the VS's corresponding to an i m a g i n a r y displacement of the center of areas. This
correction is formally identical to the one c o m p u t e d in section 3.3 for t h e real
displacements of the system and the local procedure of calculus also coincide.
We assume the system "moves" in the same direction t h a t the angle of the closer
sensor and in an a m o u n t equal to the distance from the center of areas. A1 the
adjacent sensors p e r f o r m the same c o m p u t a t i o n as in equations 6 and 5.

os onin e e ent ensors

center of area " -

data routing

rea sors

F i g u r e 6 . Connectionistic view of the virtual sensors transformations, d~are the real


sensors inputs, dk are the virtual sensors, q'k are the corrected virtual sensors (eq. 5),
A

q'k are the position independent virtual sensors (eq. 5 again), and ~c, xc, yc are the area
and coordinates of the center of area from eq. 9.
538

5 Usefulness of endogenous representations

The analytic model proposed in this paper is connectionistic. T h a t is to say,


(figure 6), it is modular, of fine grain, parametric, with possibility of learning,
and layered with parallel processing in each layer. Also, it uses only elemen-
tary processes at each node of local computation (weighted adders and analog
multipliers).
The model starts the input representation space from raw sensory data of
the free space around the system and goes to a representation with greater de-
tail and endogenous, in the sense of invariance to rotations and displacements.
These invariance properties increases the efficiency of the subsequent differential
processes, like lateral inhibition, removing the influences of the system's own
movement that now are used to increase the discriminant power in the identifi-
cation process [10] of open and closed zones, edges and external objects.
N o w , these algorithms with kernels on differences can use information rela-
tive to 1) mobility of the center of areas with regard to displacements, 2) number
of obstacles (barriers), gaps (borders and holes), and sequence of sizes, 3) min-
imum distance to the closer and far away centres of areas and its relation with
the equivalent radius, and 4) connections with other adjacent zones.
The correlation between the momevent of the center of areas and the dis-
placement of the system suggest clues on open space (exploration tasks). The
value of total area (section 4.1) or equivalent radius (r = V/~-~ ) is useful for the
distinction between different zones. Finally, those informations together with the
values d" and d can be used to build up more efficient topological and metrical
maps.

Acknowledgements
This work has been partially supported by the project TIC-97-0604 of the
ComisiSn Interministerial de Ciencia y Tecnologia (CICYT) of Spain.

References

1. Pitts, W. H. and McCulloch, W. S.: How we know universals: the perception of


auditory and visual forms. Bulletin of Mathematical Biophysics, 9, (1947) 127-147
2. Arbib, M.: The Metaphorical Brain" J. Wiley (1972)
3. Pellionisz, A., Llin~s, R.: Tensor network theory of the metaorganization of func-
tional geometries in the CNS. Neuroscience 16 , 2 (1985) 245-273
4. Somjen, G.: Sensory conding in the mamalian. NS. Plenum (1975)
5. Kuipers, B.J. Byun, Y.T.: A Robot Exploration and Mapping Strategy Based on
Semantic Hierarchy of Spatial Representations. Robotics & Autonomous Syustems 8
(1991) 47-63
6. Braitemberg, V.: Vehicles: Experiments in Synthetic Psychology. MIT press. (1984)
7. Lozano-Pdrez, T.: Automatic Planning of Manipulator Tranfer Movements. IEEE
Trans. on System , Man and Cybernetics, (1988) 122-128
539

8. Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for
mobile robots. IEEE Journal of Robotics and Automation Vol 7, No. 3, (1991) 278-
288
9. Levitt, T.S., Lawton, D.T., Chelberg, D.M., Nelson, P.C.: Qualitative Navigation.
Proc DARPA Image Understanding Workshop Los Altos. Morgan Kaufmann (1987)
447-465
10. Romo, J., de la Paz, F., Mira, J.: Incremental Building of a Model of Environment
in the Context of the McCulloch-Craik's Functional Architecture for Mobile Robots.
Tasks and Methods in Applied Artificial Intelligence. Springer (1998) 339-352
Using Temporal Information in ANNs for the Implementation of
Autonomous Robot Controllers

J. A. Becerra ~, J. Santos j and R. J. Duro2

I Dpto. Computaci6n, Universidade da Corufia, Spain


{ronin, santos] @dc.fi.udc.es
2 Dpto. Ingenieffa Industrial, Universidade da Corufia, Spain
richard@udc.es

Abstract

In this work we study a way of introducing temporal information in the structure of


artificial neural networks that will be used as behavioral controllers for real mobile robots
operating in unstructured environments. We introduce networks with delays in their synapses
as the building block for these controllers and the evolutionary methodology employed for
obtaining them in simulation. The effects of different types of noise added during evolution
on the robustness of the controllers in the real robot are commented. Two examples of
behaviors that will require time reasoning in our robot implementation are presented: wall
following and homing.

1 Introduction

In the last few years, there has been an important trend towards obtaining robot controllers
with the emphasis on the behavior desired rather than on the knowledge required by the robot
in order to carry out its functions [1 ]. Most of the implementations have made use of artificial
neural networks as their basic building block due to their tolerance to noise and how adequate
they are for automatic implementation [2][3], although normally based on static architectures.
The results of these approaches have permitted the generation of controllers able to perforn~
simple tasks in uncomplicated environments, but have been difficult to scale to more complex
problems, among them those that must handle temporal information. Obviously, handling
temporal information is necessary for performing certain functions and very useful in others.
The use of this type of information becomes even more necessary when the robot suffers fi'om
undersensorization, as the data perceived by the sensors in a given instant may be the same for
substantially different situations, and the only way to avoid this ambiguity is to increase the
dimensionality of the sensing space. In an analogy to the Embedding Theorem [4], the
dimensionality may be increased considering the data sensed in previous instants. It thus
seems necessary to employ an ANN structure that permits this type of temporal processing.
Several methods have been proposed in order to incorporate to ANNs the capacity of
processing temporal information: recurrencies [5][6], unfolding of the temporal dimension
into a fixed spatial window [7] and variable delays in the synaptic connections 181.
Recurrencies permit obtaining behaviors in which it is necessary to maintain a state that
somehow summarizes the history of the network in previous instants, but does not easily
permit using particular previous values of given inputs. The temporal windows do permit
using particular values, but the network must process all the data within the window, even
when some of them are not necessary. Variable delays in the synaptic connections permit the
541

network to learn to process only the values of those instants of time it requires, thus reducing
the number of connections needed.
In this work we will make use of variable delays in the connections of the ANNs in order
to obtain controllers for a real autonomous robot that needs to perform tasks for which the use
of temporal information is necessary. The ANNs will be obtained using an evolutionary
algorithm. The evolution will be carried out in a simulated environment and the final
controller will be downloaded to the real robot. In the examples we will shown how handling
temporal information permits obtaining functions that would not be possible otherwise. We
will also show that the networks support very high levels of noise that would make it very
difficult to obtain the behaviors with structures different from ANNs.

2 Temporal information in ANNs

It is eveident that handling temporal information is important for performing some tasks,
but in addition, it may be very useful in order to improve the performance of other tasks.
When the inputs of the ANN are very noisy, if the information of previous instants is
available the network may choose to average the input data and thus reduce the effect of noise
increasing the robustness of the behavior. In other cases, identical sets of input values may be
differentiated if the data from previous instants are considered, thus reducing ambiguities, in
addition, temporal data may be employed for predictions, as in [9], where a robot predicts the
trajectory a mobile target is going to follow and finds a fast interception point for chasing it in
the minimum possible time, or predicts situations where a mobile target is going to crash with
the robot in order to prevent them.
Note that if we add a new dimension to the ANN (time), in order to verify the criteria
established by [10] for the transference of hchaviors from simple simulations to complex real
worlds, we must also add noise to this new dimension. This will lead to the behavior
generated by the network being robust. Tile noise present at the inputs of an ANN is usually
of zero mean and consists in larger or snmller variations around the ideal value of the input.
Thus, for example, in the case of the robots, this noise corresponds to imperfections in the
operation of the sensors. When the ANN must learn a given temporal pattern present at its
inputs there is a high probability of the pattern not being always exactly the same. That is,
there may be slight differences between the points that make up the pattern. Differences not
only with respect to the values themselves, but also with respect to the spacing of the samples
in time. The ANN must tolerate reasonable differences in tile input values as well as errors in
the spacing between them.
For example, in the case of a robot, the tolerance to temporal noise is fundamental. The
distance between the different events of a temporal pattern, even when it does not vary front
the viewpont of an external observer (which is a very strong assumption), it may be altered
from the point of view of the robot for different reasons. The robot may appreciate events with
a different temporal distance simply because of a change in the controller (for example
increase or decrease the number of neurons of the network) or even of the compiler employed
in order to obtain the object code, as any of these reasons may imply longer or shorter times
between two consecutive processings of the input values to the ANN. If some type of
temporal noise is not employed that makes tile ANN tolerant to the temporal variations in the
duration and/or separation between the events, we will have to obtain new controllers
whenever any change of this type ocurrs.
This noise, we may call temporal noise, may be addressed in different ways. As in the case
of additional noise, care must be taken when it is introduced in the network. The zero mean
noise changed in each evaluation of the ANN in the evolutionary process does not always
542

correspond to reality. It may be necessary to employ the same amount of noise for a large
number of executions of the ANN, changing it later for another value and preserving this
value for another number of executions and so on. This is the case, for instance, if we want to
make an ANN tolerant to variations in the time the robot may need to execute it due to the
previously commented reasons, as in these cases, the execution time of the controller will not
be significantly changed from one step to the next, but will change between evaluations of the
controller.

3 Architecture

Recurrencies, consisting in the existence of at least one return path for the ouptut
information of least one neuron through connections between neurons of the same layer or
from one layer to a previous one, are useful for summarizing the history of previous activation
states, but do not permit a simple storage of fixed temporal patterns, which are necessary for a
large number of applications. On the other hand, temporal windows consisting in the presence
of several inputs (each one usually with a different weight) corresponding to consecutive
temporal values of the same sensor to a given neuron, permit storing these temporal patterns,
but present the drawback that connections for
all the temporal instants within the window
must exist, even when they are not necessary.
This leads to a large number of connections
and processing time and obscures the
processing the network must perform. In
some applications, such as mobile robotics,
the processing time is very important, as it
determines the reaction speed of the robot.
The importance of this must be stressed, as
the robot may be faced with dangerous
situations and processing speed is important,
specially when noise may cause a delay in the
perception of the dangers or in the case of
simple robots, whose processing capacity is Figure 1: ANN with synaptic delays
very small.
In order to prevent these drawbacks, the architecture for the ANNs we employ (figure I)
consists of several layers of neurons interconnnected as a multiple layer perceptron where the
synapses, in addition to the weights include a delay term that indicates the time an event takes
to traverse it. These delays, as well as the weights, are trainable, allowing the network to
obtain from its interaction with the world a model of the temporal processes required. A fact
that must be taken into account is that, in general, for the processes we are going to consider
to have delays only in the first layer is equivalent to having them in all the layers.

4 Obtaining the controllers

In order to obtain the weights and delays of the synaptic connections of the ANN, we have
made use of an evolutionary algorithm. The reason for using this type of algorithm is the
difficulty in determining in general how good a single action of the robot is towards the
consecution of its objectives. This is the credit apportioning problem, which precludes the use
of a supervised learning algorithm or even a reinforcement learning scheme. In most cases we
cannot decide a priori what the best motion or sequence of motions is. Specially because the
543

motion may imply a compromise among several cases that are perceptually identical to the
robot but in reality are different. When the behavioral complexity increases or when the noise
level is so high that the designer cannot choose the optimal strategy, it becomes very difficult
to make use of learning. For this reason, the selection of an evolutionary algorithm as a
method for obtaining the parameters of the ANNs seems more adequate.
The type of algorithm employed is basically an evolutionary strategy with some changes in
order to adapt it to the nature of the problem. The selection of an evolutionary strategy rather
than a genetic algorithm is given by the large depende,lce between the weights of an ANN,
leading to a high level of epistasis which can cause the problem to be deceptive and slow
down the process of obtaining a solution, as indicated in [I 1]. Thus, an evolutionary strategy
where more emphasis is put on mutation than crossover seems better than a genetic one.

5 Robot and Environment Used

The robot employed for testing these strategies is a Rug


Warrior. It is a small, simple circular robot, built arotmd the
MC68HCI IA1 microcontroller. It has two DC motors that
power two wheels and as sensors it includes two infrared
emitters and one receiver, two light sensors, three contact
sensors and two speed sensors (one per wheel). The size of
the robot is large enough for it to be able to operate fi'eely in
a human environment, such as an office or laboratory. The
sensors, on the other hand, are very low quality, very noisy
and imprecise. Thus, for instance, the infrared receiver is
binary, that is, it returns 1 if it detects an object and 0 if it
does not detect it. It does not return any information
regarding the distance at which the object is located.
The behaviors are obtained in a simulation/evolution
environment whose simulation part is based on a Khepera
Simulator [12]. This is done this way due to the problems
presented by evolution in a real physical robot, such as Figure2: Rug Warrior
slowness of the process, limitation of the evolution
environments to those that are available, etc. The behaviors have been tested in the real robot
and environment.

6 Wall following

The wall following behavior is one of the most usual behaviors in the autonomous robotics
literature. The behavior consists in the robot finding and following the walls of an enclosure at
the highest speed possible minimizing the distance to the wall it is following each instant of
time and avoiding collisions. It is usually implelnemed in robots where the sensors employed
in this task provide values in a range that is large enough for the robot to be able to distinguish
between when it is approaching a wall or going away from it. The biggest problem found
when obtaining these behaviors is caused by the presence of noise in the sensors. The infrared
sensors of the Rug Warrior, which are the only ones we can use for this task, are binary, as
mentioned before. This fact makes it impossible to decide when we are approaching or going
away from an objet without taking into account the previous instants. An additional problem
found when obtaining this behavior is that the Rug Warrior has a single receiver for the two
544

emitters that is located in an intermediate point between both sensors. This particular
arrangement of the emitters and receiver leads to a compounding of the noise problem.
In order to guide the evolutionary process towards obtaining an ANN that implements the
desired behavior, we have implemented the following procedure. The fitness function of the
robot is the amount of energy it possesses at the end of an evaluation period. The robot
increases its energy level by eating the food it finds stuck on the walls of the simulated
environment. In order to eat, the robot must simply sense, with one of the infrared sensors a
point in the wall. Once the robot has sensed a brick of the wall, the food disappears from the
brick, forcing the robot to follow the walls of the environment in order to continue eating and
thus increasing its energy. The reasons for using this strategy as opposed to an engineering of
the fitness function are studied in detail in [13],
The environment employed is a world enclosed by walls where the walls present a large
number of angles and shapes, for the robot to evolve a behavior that follows walls whatever
their shape without colliding with them.
If the controller is evolved without considering
temporal information, we see (figure 3) that there
are some types of curves, whatever the number of
hidden layers employed, that the robot cannot
handle in a satisfactory manner. This is due to what
we commented before about the fact that the robot
without temporal information cannot differentiate
some situations from others, and thus adopts a
simple strategy of turning one way when a wall is Figure 3: Wall following without delays
detected and the other when it is not. As the robot
must not collide with the wall, the turning
radius must be large enough for this not to
happen. This turning radius must be even
larger if the robot starts its evaluation far
from the walls, as it must be capable of
reaching them.
Using an ANN with delays we obtain the
behavior of figure 4, in which it is easy to
see how the robot is capable of making all
the turns, even when it starts its ew~lution
far from the walls.
Note that these behaviors have been
obtained with very high levels of noise.
Smoother behaviors may be obtained in the
simulator with lower levels of noise, but
they are less robust and will not work
adequately on the real robot. Thus, the range
of the infrared sensors may be reduced up to
50% (in order to simulate different surfaces) Figure 4: Following walls with delays
and their orientation may also vary between
theoretical position-~8 and theoretical_posilion+n/8 ill order to simulate the change in
orientation due to collisions. These types of noise are applied on top of the usual 5% noise
level on the values that reach sensors and actuators. In the case of temporal noise, the time
elapsed between two data samplings is taken randomly from a range of values determined at
the beginning of each evaluation, and is maintained until the next evaluation. This noise is
545

useful, in addition to the general reasons already mentioned, for the simulation of other
circumstances, such as a decrease in the battery load (which leads to a smaller distance
advanced by the robot, which is equivalent to a smaller time interval between consecutive data
samplings), or the different friction coefficients of the surfaces and robot wheels. The main
problems observed with these very high levels of noise is the large oscillations that may arise
in the fitness of the individuals through the generations, as the controller may have obtained a
very good fitness with some given levels of noise and be bad for other different levels. In
order to minimize this problem it becomes necessary to increase the number of evaluations of
each individual above the number that would be required without these types of noise.

7 Homing

Another typical behavior is that of


homing. It consists in finding a given
object in the environment. The robot
will interpret this object as its home and
will go towards it. In this case we have
prepared the following scenario. The
environment has no walls, facilitating
the search for home, but there is an
object that is very similar to home and
that we can take as a trap. Thus, home
is represented by a flashing light,
whereas the trap is represented by a
static light. The robot will necessarily
require temporal information in order to
distinguish one object from the other.
The flashing light has a period of 0.6 Figure 5: ANN f o r the wall following b e h a v i o r with
seconds and it is on for half of the delays. F o r e a c h synapsis the integer values c o r r e s p o n d
period and off the other half. to the delay the real value on it is the weight and those
on too of the n e u r o n s a r e the biases annlied.
The fitness function employed is also
based on the concept of energy. In this case, tile robot starts with a given energy level. As time
passes, the robot starts to lose energy and the only way this loss is stopped is by reaching
home. If the robot by mistake falls into the trap, it will lose all the energy it has left.
During the evaluation of the robot, both
the trap and home are randomly positioned,
but always in the two areas shown in figure
6 and in such a way that the two objects are
always in different areas. This is done this
way in order to avoid the static light from
masking the flashing light, saturating the
light sensors and preventing its
identification by the robot.
Using the architecture commented
above, the behavior of figures 7 a n d 8 are Figure 6: A r e a s in which h o m e a n d the t r a p will be
obtained, placed
In this case we have applied the sarne
types of noise as in the previous example, and in addition we have applied noise to the
ambient light value in the same way as was done for the case of infrared sensors. Clearly, the
546

robots learn to perform the tasks adequately in these environments and many others where
they where tested. The evolutions took around 600 generations and each robot was evaluated
16 times each generation. An order of magnitude less generations are required if no temporal
noise is used, but the results are not useful in real robots.

Figure 7: Homing behavior, initial position Figure 8: Homing behavior: final position
(light at the botton is trap, light at the top is
home and circle is robot)

8 Conclusions

In this work we have studied the use of ANNs with temporal delays in the synapses for the
generation of behavioral robot controllers. The controllers were obtained using an
evolutionary process and it was seen that in a relatively small number of generations the
results obtained were very adequate in tasks that could not be performed without the use of
temporal information. We also ascertain the need to include a different type of noise in the
evolutionary process when temporal information is employed. This noise is on the temporal
positions of the events perceived by the robot and helps to make the robot robust with respect
to time dependent phenomena. It is also shown how the combination of ANNs and
evolutionary algorithms is capable of autonomously generating structures that can operate in
environments and real robots where huge amounts of noise are present.

Acknowledgments

This work was funded by the Universidade da Corufia and the CICYT under project
TAP98-0294-C02-01.

References

1. Arkin, R.C.: Behaviour Based Robotics. MIT Press, Cambridge, MA (1998)


2. Nolfi, S., Floreano, D., Miglino, O., Mondada, F.: How to Evolve Autonomous Robots:
Different Approaches in Evolutionary Robotics. 111Brooks, R., Maes, P. (eds.): Proceedings of
Fourth International Conference on Artificial Life. Cambridge, MA, MIT Press (1994).
3. Cliff, D.T., Harvey, I., Husbands, P.: Explorations in Evolutionary Robotics. Adaptive
Behavior, Vol. 2 (1993) 73-110
4. Takens, F.: On the Numerical Determination of the Dimension of an Attractor, In Rand, D.,
Young, L.S. (eds.): Dynamical Systems and Turbulence, Warwick 1980 Lecture Notes i,1
Mathematics, Vol. 898. Springer Verlag (1981) 366-381
547

5. Elman J. L.: Finding Structures in Time. CRL Technical Report N~ La Jolla


University of California, San Diego (1988)
6. Jordan, M.I.: Attractor Dynamics and Parallelism in a Connectionist Sequential Machine.
In Erlbaum, L., Hillside, N.J. (eds.): Proceedings of the 1986 Cognitive Science Conference
(1986) 531-546
7. Waibel A., Hanazawa., T., Hinton, G., Lang, J., Shikano K.: Phoneme Recognition Using
Time-Delay Neural Networks. IEEE Trans. Acoust. Speech Signal Processing, Vol. 37.
(1989) 328-339
8. Duro, R.J., Santos, J.: Fast Disc,'ete Time Backpropagation for Adaptive Synaptic Delay
Based Neural Networks. Submitted for publication in IEEE Trans. on Neural Networks (1998)
9. Santos, J., Duro, R.J.: Evolving Neural Controllers for Temporally Dependent Behaviors in
Autonomous Robots. In del Pobil, A.P., Mira, J., Ale, M. (eds.): Tasks and Methods in
Applied Artificial Intelligence, Lecture Notes in Artificial Intelligence, Vol. 1416. Springer-
Verlag, Berlfn (1998) 319-328
10. Jakobi, N.: Evolutionary Robotics and the Radical Envelope of Noise Hypothesis. Adaptive
Behavior, Vol. 6. No. 2 (1997) 325-368
11. Salomon, R.: The Evolution of Different Neuronal Control Structures for Autonomous
Agents, Robotics and Autonomous Systems, Vol. 22 (1997) 199-213
12. Mitchel, O.: Khepera Simulator Package version 2.0, Freeware Mobile Robot Simulator,
downloadable from http://wwwi3s.unice.fr/om7khep-sim.html. University of Nice Sophia-
Antipolis, France (1996)
13. Becerra, J.A., Crespo, J.L., Santos, J., Duro, R.J.: Incremental Design of Neural
Controllers for an Infrasensorized Autonomous Robot. Las Palmas International Conference
Wiener's Cybernetics: 50 Years of Evolution, Las Palmas de Gran Canaria (1999)
Learning Symbolic Rules with a Reactive with Tags
Classifier System in Robot Navigation

Araceli Sanchis, Jos6 M. Molina, Pedro Isasi, Javier Segovia*

Sca-Lab. Departamento de Inform~itica.


Universidad Carlos III de Madrid, Spain.
Avda. Universidad 30, 2891 l-Legan6s (Madrid).
e-mail : masm@ia.uc3m.es
* Departamento de Lenguajes y Sistemas, Facultad de Inform~ltica, UPM
Campus de Montegancedo, Boadilla del Monte (Madrid)

Abstract - Classifier System are special production systems where conditions and actions are

codified in order to learn new rules by means of Genetic Algorithms (GA). These systems
combine the execution capabilities of symbolic systems and the learning capabilities of Genetic
Algorithms. The Reactive with Tags Classifier System (RTCS) is able to learn symbolic rules
that allow to generate sequence of actions, chaining rules among diferent time instants, and
react to new environmental situations, considering the last environmental situation to take a
decision. The capacity of RTCS to learn good rules has been prove in robotics navigation
problem. Results show the suitablity of this aproximation to the navigation problem and the
coherence of extracted rules.

1. Introduction
A Classifier System, proposed by John Holland [1, 2, 3, 4, 5, 6, 7], is a kind of
production system. In general, a production system is a set of rules that trigger others
and accomplish certain actions. Rules consist of a condition and an action. An action
can activate the condition of other rule, and thus some rules interact on other.
Classifier Systems are parallel production systems while traditional expert systems,
generally, are not parallel. In a parallel production system several rules can be
activated at the same time, while in not parallel ones, only one rule can be activated in
each action. Together with the parallel activation capacity of rules, CS's have the
property of learning rule chains sintactically simple to guide their behavior in
changing environments, therefore they are considered as learning systems.
In traditional production systems, the value of a rule with respect to other is
fixed by the programmer in conjunction with an expert or group of experts in the
matter that is being emulated. In a CS does not exist this advantage. The relative
value of the different rules is one of the key pieces of the information that it must be
learnt. To facilitate this learning, the CS force rules to coexist in an information-based
service economy. It is held a competition among rules, where the right to answer to
the activation is going from the highest bidders, that will pay the value of their offers
to those rules responsible of their activation. The competitive nature of the economy
assures that good rules (usefull ones) survive and bad rules disappear.
When a CS is employed for learning reactive behaviors, an additional problem is
detected respect to the action chains: these action chains blind the system, make it
insensitive to the environment during the duration of the chain, since the system can
549

not manage any new input during the decision process. If, furthermore, the
environment where the learning is accomplished is dynamical, the system would have
to read the sensors (input, situation of the environment) in each decision step, since
this is the principal characteristic of reactive systems. To solve this problem, a
Reactive with Tags Classifier System, RTCS, is proposed [8], [9]. For example, in the
navigation of an autonomous robot through a dynamical environment problem
studied (where the obstacles can be mobiles), robot would not have to remain blind
any moment, therefore each movement must be the result of the application of a
decision process over the last reading of the sensors [10]. Control rules could be
designed by a human expert, designed "ad-hoc" for the problem, or learnt through
some artificial intelligence techniques. Some approximations have employed Genetic
Algorithms to evolve Fuzzy controllers [11], Evolutionary Strategies to evolve
connections weights in a Braitenberg approximation [12], or Neural Nets for
behaviors learning [ 13].
In the proposed learning system, the only previous system information is related
to number of inputs (in the robot will be number of sensors), the domain, the number
of outputs (in the robot, number of motors) and their description. Thus, the robot
controller (the RTCS) starting without information about correct associations between
sensors input and motors velocities. From this situation, the system (robot +
controller) must be capable from learning to reach the greater degree of fitness to the
sensors information. The robot has to discover a set of effective rules, employing past
situations experience, and must extract information of each situation, when this is
produced. In this way, the system will learn from incremental way and the past
experience remains implicitly represented through evolved rules.

2.Classifier Systems

A Classifier System consists of three principal components, that can be considered as


activity levels. The first level (Action) is the responsible of giving answers (adequate
or not) for the resolution of the outlined problem. In this level the rules of the system
are found, codified in chains of characters over a restricted alphabet. The Action level
produces a response to a given situation. The appropriateness of the given response to
the problem to solve is measured through the reward that receives the rule from the
environment. The second level (Credits Assignment) evaluates the results obtained in
the previous level, distributing the rewards received by the rules that provide the
output among rules that have Contributed to the activation of each one of final rules
(which give the output). As in other reinforcement learning methods, this evaluation
can be adjusted applying a reward or payment from the environment, with a high
value if the solution is profitable and a punishment or negative value if it is not. In
this level, it does not possible to modify, however, the behavior of the system by
means of changes in their rules, but it is possible to adjust their values and to
establish, in certain measure, a good and wrong rules hierarchy. The task of the third
level (Discovery) is to find new process that allows the system to discover new
solutions. In this level a Genetic Algorithm is applied.
550

Although the search of new rules in CS is based on Genetic Algorithms


application over a set of rules, a fundamental difference is the capacity of CS to
generate isolated rules that are injected to a set of previously existing rules. Genetic
Algorithms provide good results in many problems. However, as their analogous
Genetic Programming and Evolutionary Strategies, the evaluation is accomplished on
the complete system, without discriminating between different internal parts. If the
system is composed of a set of rules, as in the case of the CS, an evaluation on the
complete set without individualizing each one of the rules, does not permit to
generate new isolated rules. Besides, the application of Genetic Algorithms over rules
of Classifier Systems requires an intermediate representation, "codified rules", for
genetic operators to act. Classifiers Systems are a specialized form of production
system that have been designed to be amenable to the use of Genetic Algorithms [7].
The operation form of Classifier Systems presents some problems in execution
time, in the learning of complex strategies, in the definition of the instant to call the
GA and, as many other learning systems, in the presentation of the examples to the
system. Centering in the first two problems, they are due to existence of internal
cycles. These cycles permit the interrelationship among rules in order to produce
elaborate solutions. While a CS executes internal cycles, remains isolated for the
environmental information. This problem can be described as the necessity of a CS of
being capable of "to react" to the stimuli of the environment. The attempts of seeking
the "reactivity" in Classifier Systems have been approached from two different
perspectives: the increase in the speed process of the system, with the systems ICS
and hierarchic CS of Dorigo [14], and on the other hand, the execution of a rule for
an input, without internal cycles and then without rules sequence, the HCA of Weib
[15] based on Wilson [16] and Grefeenstete [17] works.
In this work, a Reactive with Tags Classifier System (RTCS) has been applied
[9]. This RTCS works in the sense of Weib [15], but at the same time allows to
elaborate complex strategies. For this, it is necessary to remember the definition of
reactivity. A reactive system must decide for each input an action, and each action is
determined by an input and in a CS, without losing the capacity of chaining rules in
different time instants. For obtain a RTCS the operation of the action level has been
modified. The solution proposed, therefore, must unite the capacity from learning
without previous knowledge with the capacity of generating some kind of internal
subdivision within CS to allow rule categories existence. To carry out this solution
should be modified the codification of the rules (classifiers) and a field that represent
the type or group which belongs each classifier has been included, named Tags [8, 9].

3. Input and Output Codification

The codification of information in CS (the design of environmental and output


messages) is based on the special problem where CS will be applied. In this work, the
CS is used as a controller of an autonomous robot named Khepera [13]. The sensory
inputs come in from eight infra-red proximity/ambient sensors. The robot has two
wheels controlled by two independent DC motors with incremental encoder that
allow any type of movement. Each wheel velocity could be read by a speedometer.
551

The sensors (proximity, ambient and speedometer) supply three kinds of


incoming information: proximity to the obstacles, ambient light and velocity. Instead
of using the eight infra-red sensors individually, they have been grouped giving a
unique value from two sensor input values (Figure la), reducing the amount of
information received by the CS. Representing the goal by a light source, the ambient
information lets the robot know the angle (the angle position in the robot of the
ambient sensor receiving more light) and the distance (the amount of light in the
sensor) (Figure 1b).

_!
Sl Lisht~u~

~ t Sensors
[] ProximityS~sors
(b)

Fig. 1: (a) Sensors in the real robot. (b) Input information to the system.

The input to the CS consists of three proximity sensors, angle and goal distance
(given by ambient sensors) and velocity values obtained by the speedometer. The
outputs are the velocity values. The composition of the message could be seen in
figure 2.

S.... 11s~,~or2[s..... 3i ........


s'ngle [D__
DistancetI Vo,oc,ty1 [ ~.......
Nearenvironment
description. -.....................
(AVOID) Goaldescription. i
(FOLLOW) Internalrobotsituation
description.
Fig. 2: Composition of the environmental message.

The distance information of proximity sensors is obtained by the response curve


of the sensors, that is a sigmoidal function defined over the intensity values domain.
The distance domain is transformed, translating it into a simpler domain to codify the
values. This transformation allows both the CS and the robot to be independent. So
the CS could be developed for any robot by changing the transformation function.
The input domain has been partitioned in four crisp sets. The maximum distance
value "seen" by one sensor is 40 units and is divided in four equal sets. The angle sets
are of different size to consider a fine fitting of the trajectory, avoiding big
oscillations when the robot follows the right direction (the sets near 0 and 2zt are
smaller than the "<rt" and the ">re" ones). To keep the independence between robot
and CS, the distance values are translated from the real sensor values to a domain
def'med from 0 to oo. The input domain has been partitioned in four crisp sets.
552

Velocity values flow as input to the classifier system and as decision from the CS to
the robot. The values are defined by the maximum and minimum velocities (10, -10).
This range is divided in four equal sets. All these sets should be codified to build the
message from the environment. Two binary digits are needed to represent each set.
The codified inputs to the robot are displayed in the table:
Proximity Angle Distance Velocities
00 Very Near (VN) Near 0 (0) (0,25) (VN) Slow Forward (F)
01 Near (N) <n (0-1'1) (25,100) (N) Fast Forward (FF)
iI Far (D >n (PI-2PI) (I 00,200) (F) Backward (Be)
10 Very Far (VF) Near 2~t (2PI) (200, ao) (VF) Stop (ST)

4. Analysis of Learned Rules in RTCS Applied to Navigation of an


Autonomous Robot

Results obtained with RTCS are caused by, on one hand the introduction of Internal
Tags, IT, and, additionally, the introduction of the mechanism that allows the CS to
be reactive (RTCS). Evidently, the existence of two mechanisms permits to obtain so
good results when applying the CS to the problem of the navigation. In this section,
the influence and contribution of Internal Tags will be analyzed. When a RTCS is let
to evolve in the simulator, in a certain moment, the RTCS is able of solving the
navigation problem. Then is considered that the robot has learnt This RTCS has been
carried to the real robot and it has been proven its efficiency in navigation.
The analysis of the meaning of the symbolic rules obtained has been done by
different groups. Each group contains a different number of rules and share some
condition values that reflect similar situations. In Table 1 - Table 6 appear collected
the different groups and the symbolic values for the part of condition can be
observed, represented by the concepts sl, s2, s3, A (angle), d (distance), vl and v2
(left and right wheels velocity values) in the condition part. Below of condition
values, message values: vl and v2 values in message part are found.

T a b l e 1: G r o u p 1 r u l e s .
CONDITION MES
sl s2 s3 A d vl v2 vl v2
ForVF VNorN VF 0 N Bc ST F F
VF MN N or VF 0 VF F F F F
VF MN or L N or VF 0 VN or F F F FF FF
F or VF F or VF MN 0-PI VF Bc ST F F
F or VF N or VF VN or N 2PI or 0-PI VF FF FF or Bc Bc ST
VF VF MN or L 0-PI VF FF or Bc All Bc ST
F or VF VN or N VF 0-PI N FF FF ST Bc
VF VN or N VF 0-PI F or VF ST or F Bc ST Bc
VF VN or N VF 0-PI F F or FF FF ST Bc
VF N or VF VN or N 2Pi or 0-PI F F or FF FF Bc ST
F or VF VF MN 0 or PI-2PI VF ST or Bc ST or Bc Bc ST
F or VF MN VF 2PI or PI-2PI VF ST or F Bc F F
VF VN or N VF PI-2PI N or VF ST or Bc ST F F
MN F or VF VF 0 or PI-2PI F or VF Bc ST Bc ST
VN or N VF VF 0-PI VF All ST F F
553

VN or N VF VF 2PI N F ST or F ST Be
VN or N VF VF 0-PI VF All F F F
VF MN VF N VF ST Bc F F
VF VF MN 0-PI F All ST or Be Bc ST
F or VF VF MN 0 F FF or Bc ST or Bc F F
Group 1 c o n s i s t s o f 2 0 r u l e s o f t h e 119 t h a t R T C S contains. Analyzing the
s e n s o r s v a l u e s s l , s 2 a n d s3 in t h e r u l e s o f g r o u p I, t h i s g r o u p s e e m s t o h a v e in
common t h a t r e p r e s e n t s s i t u a t i o n s o f c o l l i s i o n d a n g e r in s o m e o f t h e s e n s o r s . It d o e s
n o t s e e m b e a g r o u p t h a t a n s w e r s o l e l y to t h i s c h a r a c t e r i s t i c , s i n c e c o l l i s i o n d a n g e r
a p p e a r s f r e q u e n t l y in o n l y o n e s e n s o r . F u r t h e r m o r e , i f v a l u e s o f v e l o c i t i e s d e c i d e d f o r
each situation are analyzed, turnings and advances of the robot are observed, though
rules number t h a t m a k e t h e r o b o t t u r n s is n o t v e r y h i g h . A n g l e values are those
which, compel the robot to advance without turning when collision risk appears by
t h e l a t e r a l ( v a l u e s s 2 a n d s3).

T a b l e 2: G r o u p 2 rules.
CONDITION MES
S1 s2 s3 A d vl v2 vl v2
VN VN VF 0-PI N or VF FF FF ST Bc
N or VF VN or N F or VF 0 or 0-PI F F or FF FF ST Bc
N or VF VN VF 0 or PI-2PI F FF FF or Bc FF FF
VN VN or F VN 0-PI VF F or FF FF or Bc Bc ST
N F VN 0-PI VF Bc ST Bc ST
VN or N N or VF VN or F 0 VF Bc all Bc ST
N or VF VN or N VF 0-PI VF F F ST Bc
VN or N VN or F VN or F 2PI VF ST all Bc ST
N F VN or F PI-2PI VF FF FF Bc ST
VN or F VF F PI-2PI F or VF Bc ST Bc ST
N N or VF VN 2PI or PI-2PI F or VF Bc ST FF FF
F VN or N VF P1-2PI F or VF F or FF F F F
VN or N VN or F VF 0-P1 F ST or F F FF FF
VN or F F N 2PI or PI-2PI VF ST or F Bc Bc ST
VN F VN 0-PI VF ST Bc Bc ST
VN N N all VF ST or F Bc ST Bc
VN VN or N VN or N 0-PI VF ST or Bc Bc ST Be
VN or N F VF 2PI or 0-PI VN or F FF FF or Bc ST Bc
F VN N or VF PI-2PI VF FF FF ST Bc
N or VF F or VF VN or F 2PI VF all ST F F
N VN or N VF 0 or 0-PI VF FF FF ST Bc
N or VF VF VN PI-2PI N or VF FF FF Bc ST
N N or VF VN 2PI or PI-2PI F or VF Bc ST Bc ST
VN VN or N VF 0or 0-PI VF F F ST Bc
N or VF VN VF PI-2PI VF F ST or F ST Bc
N VN VF 0 VF ST or Bc ST or Bc ST Bc
N or VF VN VF PI-2PI VF FF or Bc FF ST Bc
F VN or F VF 0 or 0-PI N ST or F all ST Bc
N or VF VN VF PI-2PI VF F ST or F ST Bc
Group 2 contains 29 of the 119 ones that forms the RTCS, so appear a 50%
more rules than in p r e v i o u s group. In t h i s g r o u p the most important observed
c h a r a c t e r i s t i c is t h a t a p p e a r m a n y v a l u e s n e a r o r v e r y n e a r in s e n s o r s s l , s 2 o r s 3 , t h a t
554

r e p r e s e n t situations w h e r e the r o b o t is in d a n g e r o f collision. S e e m , t h e r e f o r e , that it


can b e c o n c l u d e d that this g r o u p is the one w h i c h is e n t r u s t e d w i t h special w a y w i t h
a v o i d i n g the obstacles. In this case, the values o f the d e c i s i v e s p e e d s are in s u c h a
w a y that in n e a r l y all the rules is p r o d u c e d the draft o f the robot.

T a b l e 3: G r o u p 3 rules.
CONDITION MES
sl s2 s3 A d vl v2 vl v2
VF VF VF 0-PI VN or F All ST or F F F
VF F or VF F or VF 0 or PI-2PI VF ST Bc Bc ST
F or VF VF VF 0 VF ST Bc FF FF
VF VF VF 2PI or 0-PI All FF or Bc ST or Bc ST Bc
F or VF VF F or VF 0-PI VF FF FF FF FF
F or VF F or VF VF 2PI N FF FF or Bc FF FF
VF VF F or VF 2PI N F ST or F FF FF
VF F or VF VF 0 VF F F F F
VF VF VF 0 F F F F F
VF VF VF 0 F Bc ST F F
VF VF F or VF 2PI or 0-PI VN or N ST FF or Bc F F
G r o u p 3 c o n s i s t s o f 11 rules o f the 119 o n e s that c o n t a i n s the R T C S . In this
g r o u p t h e r e are less rules that in p r e v i o u s ones. R e p r e s e n t e d situations, b y the
o p p o s i t e o f t w o p r e v i o u s g r o u p s , d o not c o n t a i n n o collision situation, in fact it s e e m s
r a t h e r t h a n rules o f this g r o u p are related w i t h a n g l e values. A n a l y z i n g a n g l e values
for all rules, m o s t l y the r o b o t s e e m s quite a l i g n e d w i t h the objective. A s result o f the
i n f e r e n c e o f e a c h rule, m e s s a g e s sent to the r o b o t c o m p e l h i m to a d v a n c e in straight
line, that c o r r e s p o n d s to the situation w h e r e the r o b o t is l o c a t e d f o r m i n g a 0 or 2PI
a n g l e w i t h the o b j e c t i v e . D i s t a n c e values to the o b j e c t i v e d o n o t s e e m , to b e
d e t e r m i n a n t to take d e c i s i o n s .

T a b l e 4: G r o u p 4 rules.
CONDITION MES
sl s2 s3 a d vl v2 vl v2
N N or VF VF 0 VN or N ST or Bc Bc FF FF
VF N VF 2PI or PI-2PI VF Bc ST F F
VF N N or VF PI-2PI VF F or FF F F F
F VF N 0-PI VF Bc ST Bc ST
N VF F 0-PI VF FF or Bc ST Bc ST
N N or VF N 0 or 0-PI All Bc ST or F Bc ST
VF VF N All VF ST ST or Bc FF FF
N VF All 2PI VF F All FF FF
N or VF N VF 2PI or 0-PI VF FF FF FF FF
F or VF N N or VF 2PI or PI-2PI VF ST or Bc FF or Bc F F
F N VF 0-PI F F F F F
F N VF 2PI or 0-PI VF All F ST Bc
F N F or VF 0-PI VF FF FF or Bc ST Bc
N or VF N VF 0-PI VF FF FF FF FF
F F or VF N 0 or PI-2PI N or VF FF or Bc ST F F
All VF N 2PI or PI-2PI N or VF ST or F Bc Bc ST
F N VF 2PI VF FF F or FF ST Bc
VF N VF 0 VF F F F F
555

F VF N PI-2PI VF FF FF Bc ST
F or VF VF N 0 or PI-2PI VN or F F F F. F
VF F or VF N 0-PI F F F F F
G r o u p 4 c o n s i s t s o f 21 o f 119 rules that c o n t a i n s the R T C S . In this group, a n y
clear t e n d e n c y a p p e a r s w i t h r e s p e c t to the g e n e r a l b e h a v i o r s : " s t r a i g h t to o b j e c t i v e "
a n d " a v o i d o b s t a c l e s " . A n a l y z i n g the rules, s o m e o f the p r o x i m i t y s e n s o r s , s l , s2 a n d
s3, has values o f near, but this value a p p e a r s in rules j o i n t to Far or V e r y Far values,
so that is n o c a s e o f d a n g e r situations. A t t e n d i n g to the angle and d i s t a n c e values is
o b s e r v e d that the robot, in a l m o s t all rules, is far f r o m the o b j e c t i v e and in g e n e r a l n o
a l i g n e d w i t h the objective. T h e s e c i r c u m s t a n c e s c a u s e that rules c o m p e l to the r o b o t
to turn t o w a r d the o b j e c t i v e and to a d v a n c e so that, thereinafter, s o m e d a n g e r
situation will b e p r o d u c e d that g r o u p s 1 and 2 c o u l d resolve.

T a b l e 5: G r o u p 5 rules.
CONDITION MES
sl s2 s3 a d vl v2 vl v2
VF N or VF VF All L ST or Bc ST or Bc FF FF
VF All VF 0 All FF or Bc ST FF FF
VF All VF 0 or 0-PI N ST or F F F F
VF All VF 0 All FF or Bc FF FF FF
VF N or VF F 0-PI VF ST or Bc Bc F F
VF N or VF F 0-PI VF All F or FF ST Be
N or VF F or VF F 0-PI VF ST Bc Bc ST
VF VF N or VF 0-PI N or VF ST Bc FF FF
VF All All PI-2PI VF ST ST or F Bc ST
VF VF N or VF 2PI or 0-PI VF F F F F
VF VF N or VF PI-2PI VN or N FF FF or Bc FF FF
F or VF VF N or VF 2PI VF F F F F
N or VF F F or VF 2PI VF F F or FF F F
VF N or VF N or VF 2PI F or VF F ST or F F F
N or VF F All 2PI L ST or F ST or F F F
VF VF N or VF 2PI L FF FF FF FF
VF VF N or VF 2PI L F F FF FF
VF All VF 0-PI L F F F F
N or VF N or VF F or VF 0 or 0-PI L F F FF FF
VF F or VF N or VF 2PI VF FF FF FF FF
VF VF N or VF 2PI or PI-2PI L ST or Bc ST Bc ST
VF F or VF N or VF 2PI or PI-2PI L ST Bc FF FF
N or VF VF VF 0 N FF or Bc ST F F
VF F N or VF 0-PI N ST FF or Bc ST Bc
G r o u p 5 is c o m p o s e d o f 24 rules o f 119 o n e s that f o r m R T C S . This g r o u p is
similar to the p r e v i o u s g r o u p r e s p e c t to the values o f s l , s2 and s3, but o p p o s i t e
r e s p e c t to angle a n d d i s t a n c e values. In this case, a n g l e values d e f i n e , in a l m o s t all
rules, situations w h e r e the r o b o t is a l i g n e d with the objective. A s d i s t a n c e value, in
m o s t o f the rules, c o r r e s p o n d s w i t h Far or V e r y Far d i s t a n c e situation, the c o m b i n e d
e f f e c t o f a n g l e a n d distance values cause that rules c o m p e l the r o b o t to a d v a n c e
straight to the objective.
556

Table 6: Group 6 rules.


CONDITION MES
sl s2 s3 a d vl v2 vl v2
FF F FF 2PI VN or N F F or FF F F
FF F or VF F 0 FF F F F F
F F F or VF 2PI N STor F All ST Be
FF FF F PI-2PI FF FF or Be STor F Bc ST
FF F FF 0-PI FF FF or Bc FF ST Bc
F FF FF 0-PI N or FF Bc ST ST Bc
FF F F 2P1 or 0-PI F or VF ST or F All F F
FF F FF 0-PI FF ST or Bc ST or Bc F F
ForVF F F orVF 0 ForVF FF F F or Bc ST Be
FF FF F 0 or PI-2PI FF FF FF ST Bc
FF F or VF F PI-2PI FF F ST or F F F
F FF F or VF 2PI N Bc S Tor Bc Bc ST
F FF FF PI-2PI F o r VF Bc ST Bc ST
FF F FF 0 F or VF F F F F
Group 6 consists of 14 rules of 119 that contains RTCS. This group is similar to
Group 3 but the represented angle situations require rules that make the robot turns.
So, groups 3 and 6 seem to be responsible for approaching the robot to the objective,
while groups 1 and 2 seem to be responsible from avoiding obstacles.
The analysis of groups shows the capacity of Classifier Systems to evolve
coherent rules groups. This coherence can be seen from two points of view, on one
hand, all rules of each group represent similar situations and give similar outputs and,
on the other hand, independent behaviors are discriminated, so groups are quite
independent. Furthermore, only groups necessary to solve the problem are generated,
not all the possible. This allows to conclude that the number of bits to represent
possible groups must permit the evolution of all necessary groups and the RTCS will
learn the number of groups that are actually necessary.

5. Conclusions

This work has been centered in the application of a new CS, named Reactive with
Tags Classifier System (RTCS), to learn symbolic rules in a navigation problem.
Navigation of a robot could be defined as a complex behavior that requires the
movement in a world with obstacles and where the robot goal is to reach a predefined
point. This problem, from the point of view of learning, is considered sufficiently
complex if the decision must be obtained in real time, since the environment
continues changing during the time of taking a decision, or, from another point of
view, the robot is moving while the decision is taken.
The RTCS contains a set of mechanisms that allow the incorporation of new
environmental information in the process of taking decisions. This process allows
rules sequence (chaining rules in different execution instants) and break the sequence
to provide a reactive output. The RTCS has proven the capacity of learning reactions
and strategies, so the dilemma between reactive and planned systems could be
surpassed.
557

6. References

[1] J.Holland, "Adaptation in Natural an Artificial Systems". University of


Michigan Press, Ann Arbor, (1975).
[2] J.Holland, "Adaptive Algorithms for Discovering and Using General Patterns
in Growing Knoledge Bases", Int. J. Policy Analysis and Information Systems,
vol 4, 245-268, (1980).
[3] J.Holland, "Properties of the Bucket Brigade". In Proc. of International
Conference on Genetic Algorithms and their Applications, vol 1, 1-7, (1985).
[4] Holland, J.H. "Hidden order : how adaptation builds complexity". Reading
(Massachusetts), Addison-Wesley, (1995).
[5] L. Booker, D.E. Goldberg y J.H. Holland, "Classifier Systems and Genetic
Algorithms", Artificial Intelligence, 235-282, (1989).
[6] D.E. Golberg, "Genetic Algorithms in Search, Optimization, and Machine
Learning". Addison Wesley, Reading MA, (1989).
[7] M. Mitchell, "An Introduction to Genetic Algorithms", MIT Press,
Massachusetts, (1996).
[8] A. Sanchis, J.M. Molina y P. Isasi, "Classifier Systems for Learning Reactions
in Robotic Systems", 1st Int. Workshop on Machine Leaming, Forecasting and
Optimization (MALFO' 96), 153-159 (1996).
[9] A Sanchis, J.M. Molina y P. Isasi, "Learning Reactive Behavior for
Autonomous Robots using Classifier Systems", Vol. 37 of Frontiers in
Artificial Intelligence and Applications, Pag. 152-159, IOS Press, 1997.
[I0] R.A. Brooks, "Intelligence Without Representation", Artificial Intelligence,
47, 139-159, (1991).
[11] M.A. Lee y H. Takagi, "Integrating Design Stages on Fuzzy Systems Using
Genetic Algorithms ". 2 "d Int. Conf. on Fuzzy Systems, 612-617, (1993).
[12] J.M. Molina, A. Sanchis, A. Berlanga, P.Isasi-Vifiuela, "Evolving Connection
Weight Between Sensors and Actuators in Robots". IEEE International
Symposium on Industrial Electronics. (1997).
[13] F. M Mondada y P.I. Franzi , "Mobile Robot Miniaturization: A Tool for
Investigation in Control Algorithms". Proc. 2 "d. Int. Conf. on Fuzzy Systems.
San Francisco, USA, (1993).
[14] M. Dorigo, "ALECSYS and the AutonoMouse: Learning to Control a Real
Robot by Distributed CS", Machine Learning, 19, 209-240, (1995).
[15] G. Weil3, "Hierarchical Chunking in Classifier Systems", Proc. of the 12th.
International Conference on Artificial Intelligence", 1335-1340, (1994).
[ 16] S. Wilson, "Knowledge growth in an Artificial Animal", Proc. of the First Int.
Conf. on Genetic Algorithms and their Applications, 16-23, (1985).
[17] J.J. Grefenstette, "Credit Assignment in Rule Discovery Systems Based o n
Genetic Algorithms ", Machine Learning, vol 3,225-245, (1988).
Small Sample D i s c r i m i n a t i o n and Professional
Performance A s s e s s m e n t

David Aguado, Jos6 R. Dorronsoro, Beatriz Lucfa, Carlos Santa Cruz

Instituto de Ingenierla del Conocimiento


Universidad Autdnoma de Madrid, 28049 Madrid, Spain

A b s t r a c t . Class overlapping and small sample class sizes, a situation


not infrequent in practical settings, can make very difficult the success-
ful construction of a classification system. In this paper we will address
this question by means of a new procedure, which we call Nonlinear
Discriminant Analysis (NLDA), for classifier construction in such cases
and that combines the excellent approximation properties of the well
known Multilayer Perceptrons with the target-free classical discrimina-
tion technique of Fisher's Analysis. Besides a short description of NLDA
fundamentals, we will give an illustration of its use in a practical problem,
the assessment of professional performance of insurance salespersons.

1 Introduction

A relatively frequent situation in pattern recognition is that of having to build a


system to discriminate between two classes/20,/21 that present a certain degree
of overlapping and where at least one of the class samples to be used has a rather
small pattern number. A good measure of the degree of class overlapping can be
given by the mean error probability MEP(SB) of the Bayes classifier

~ s ( P ) = argmaxo,l{P(aolX), P(f~IlX)}

t h a t assigns a certain pattern P to the class to which it has a larger probability


to belong. For instance, for classes having equal prior probabilities, its value can
be computed as

1f
MEP(SB) = -~ J min (f(PIg20), f ( P l Y h ) ) dP,

f(PI Y2i) denotes the density function of the patterns in class Y2i. A value of
where
MEP(SB) close to 0 means that the classes involved are well separated, whereas
if a considerable degree of overlapping is present, its value will tend to 0.5.
It is evident t h a t no classifier will be able to discriminate between patterns
coming from the overlapping region between the classes involved. However, even
* J. Dorronsoro and C. Santa Cruz were partially supported by Spain's CICyT un-
der grant TIC 98-0247. Both are also members of the Department of Computer
Engineering, Universidad Aut6noma de Madrid.
559

1.2

0.8

0.6

0.4

0.2

0
o I x-s~tlon 2 3 4

Fig. 1. Histograms of the x-sections of a sample of 10000 normal (solid line) and
exponential (dashed line) data.

if a rather large overlapping is present, it could be desirable to be able to build


classifiers that can still tell apart patterns from areas where the two classes are
markedly different. As an example, figure 1 shows the histogram of the x-sections
of a sample of 10000 data derived from a bidimensional normal distribution (solid
line) centered at (1, 1) with covariance matrix Z = 0.25/2 and from the negative
exponential with density e -(x+y), x > 0, y > 0. Even if good discrimination is not
possible in the central region around the common mean (1, 1), it certainly should
be so for patterns coming from both distribution tails. For instance, when class
sizes have relatively large and comparable sizes, classifiers built using the well
known multilayer perceptrons (MLPs) have mean error probabilities not too far
from the Bayes optimum value of 0.263. More concretely, figure 2 shows (dashed
line) the evolution, according to relative class sizes, of the mean error probability
MEP(SMLP) of a classifier (~MLPbuilt from a MLP with targets (1, 0) and (0, 1)
and that assigns a pattern P = (x, y) to the class whose target vector is closer
to the output for that pattern of the MLP transfer function ~(x, y). This is not
surprising at all; MLPs provide excellent classifiers, something due, among other
things, to their excellent universal approximation properties [6].
When class sizes are not too uneven, the MEP(SMLP) value remains near
the Bayes optimum (as seen in the figure). However, this behavior may change
when sample sizes become rather small (as it will be the case in the application
example of section 3), or when one class size becomes much smaller that the
other. For instance, if in the above synthetic example exponential patterns be-
come scarcer, MEP(SMLP) starts degrading rather fast towards the 0.5 limit.
This behavior is certainly due to the increasing difficulty of the discrimination
problem. Nevertheless, another poor performance factor is present here, one hav-
ing to do with the fact [11] that the habitual target vector labeling for C-class
problems in terms of the C dimensional unit vectors ei = ( 0 , . . . , 1 , . . . , 0) implic-
itly incorporates class sizes into the training process, strongly favoring the larger
560

50

45

40

35
////
30

i I L i , i i i i
;~550 55 60 65 70 75 80 85 90 95 100
Nodal data fractt~ in training set

Fig. 2. Evolution of the mean probability errors of NLDA (solid line) and MLP (dashed
line) classifiers with respect of the fraction on normal data in the training set.

classes. Some alternative coding procedures can be used to try to correct this
fact. However, they do not essentially improve the MLP performance depicted
above.
It seems thus to be desirable to avoid any incorporation of class size informa-
tion into network training through target coding. Notice that a discrimination
building procedure that does not depend on the necessity to use a concrete target
scheme certainly exists. In fact, probably the best known discrimination method,
Fisher's Analysis, does not rely on any targeting scheme at all. In the follow-
ing section we will discuss what we may call Nonlinear Discriminant Analysis
(NLDA), a classifier construction method that combines the target-free nature
of Fisher's Analysis with the approximating properties of MLPs. Although the
training of NLDA networks has a larger complexity than that of ordinary MLPs,
they appear to be more robust than MLPs on pattern recognition problems such
as the above synthetic example. In fact, figure 2 also shows the evolution of
the MEP values for NLDA classifiers built upon training sets with a decreasing
fraction of exponential sample data. It can be seen that at the beginning both
are similar and close to the Bayes optimum of 0.263, but once the fraction of
normal data in the sample is about 70%, MLP performance degrades faster than
that of the NLDA net, which remains below 0.3 even when 90% of the sample
are normal data, and below 0.35 when that proportion reaches 98% (notice that
when the sample is made up of just normal data, any classification method must
give a MEP value of 0.50).
The robustness that this target coding free training gives to NLDA classifiers
may make them suitable for use in a number if practical situations. For instance,
they have been used successfully in credit card fraud prevention [2], a problem in
which classes tend to have overlapping and size characteristics similar to those
of the above synthetic problem. NLDA network training will be briefly reviewed
in the next section, and in section 3 an example of their application to a psy-
561

WI

x~ actl_[~T~outl WH

x2 3 ~ ~ ~ !yC-1
' act i f(act9~out~ / / ~ ) ( ~ )

XD w~M~ actM~ ~

Fig. 3. Architecture of a NLDA net and notational conventions used in the paper.

chological classification problem, the assessment of the professional performance


of insurance salespersons, will be given. This is a situation where the presence
of borderline individuals leads to significant class overlapping and where usu-
ally both class samples (good and bad salespersons) have small sizes. Although
clearly having a great practical interest, it is thus a difficult problem. This dif-
ficulty is increased by the fact that overall individual evaluations require the
application of test batteries with large number of items, that result in rather
big input pattern sizes. Moreover, the combination of small samples with large
input dimension makes classical dimensionality reduction techniques, such as
principal component analysis, rather difficult or unpractical to use. In this case,
NLDA network performance compares favorably to that of standard multilayer
perceptrons (MLPs). However, all the above circumstances imply that, for this
problem, numerical classifier results, even if they appear to be fairly good, are
to be used in practice as another indicator, potentially useful, but to be com-
plemented with a number of alternative ones. The paper will close with a brief
discussion.

2 Nonlinear Discriminant Analysis

Contrary to what happens in MLP construction, which is based on the mini-


mization of an error function between desired targets and network outputs, in
Fisher's discriminant analysis a certain linear projection of input patterns is
sought that minimizes a concrete, target-free criterion function. An usual choice
is the ratio J ( W ) = [Sw]/ISBI, of the determinants of the between and within
class variances, Ss and Sw respectively, of target projections; W denotes the
vector of projecting weights (see for instance [5] for more details).
Although it can give good results in a number of situations, the linear na-
ture of Fisher's pattern transformation makes it unsuitable for problems where
nonlinear class boundaries are to be expected. However, a natural way to over-
562

come this difficulty is to combine Fisher's projection with a previous, nonlinear


transformation of input patterns. Figure 3 shows a simple network with this ef-
fect for a C class problem. It has essentially the same architecture that a single
hidden layer MLP, the difference being the function that network weights have
to minimize. This function will be now

ISwl_ I ( W " ) * s # w H I
= ISB-- - I(W")*s wHI (1)
where SB H and S H denote now the between and within class scatter matrices of
the transforms at the hidden layer level of input patterns. These matrices are
then affected by the concrete values of the input weights W I. Therefore, the
criterion function ,7 depends now on the pair of weight vectors ( W z, w H ) , and
its minimization is to be done with respect to both9
A simple way to do it (see [9]) is to iteratively generate a sequence ( W Ik, W H
k ),
of minimizing weights in a two step fashion9 For the k + ] weights, we first
compute W ~ k + l by keeping fixed the W~ and performing multiple C class dis-
criminant analysis on the pattern vectors provided by the hidden unit outputs
these fixed weights provide. This is done by the well known Fisher's eigen-
value and eigenvector computations (see [Sj, pp. 115-121). Once this is done,
we keep now fixed the just computed Wk+ 1 and we obtain the new Wk+ 1
by optimizing against it the corresponding version of the criterion function
y I ( W I ) : y ( W I, Wf..kl ).
Several choices are now available; in [9] a quasi-Newton procedure is used,
for which the gradients V J I = (0,7I/0wi1) have to be computed. This can be
done using the vectors out h = (outhj) as intermediate variables (see figure 3 for
concrete variable labeling). More precisely, if M denotes the number of hidden
units, C the number of classes and Ni that of sample patterns in class i, we have

0j1 c N, M a j Oout~j8act~
OwIl -- E E E Oout h Oact h OwI, '
(2)
i=l j=l h=l
which, for instance, in the simplest case of a 2 class problem, reduces to

0,7* N~ 1 f_ O~W _ O~B


i=1,2 j=l

where the two scalars sB and sw denote now the between and within class
scatter of network outputs. Notice that here the output layer has a single unit
(in general, and as it happens with classical Fisher analysis, it has C - 1 units for
a C class problem) and, therefore, the determinant ratio of the general criterion
function reduces to a simple quotient of scalars.
For the practical application of the above formula, the partials O~g/Oout~j
and O~w/Oout~j have to be computed. A further analysis shows that
563

M
08B H H k
Oout~ _
- - 2 E W l l W k l ( m i -- m k)
k=l
M M
H H-1~8 ~ k

h=l k=l s=l,2


M
O~w
_ _ 2 E Hw .H wkl
- (out~k - m~)
Oo~t~
k~l

2 M M Ni H H k i )"
E E E lWhl(O t r-
h=l k=l r=l

where m ik and m k denote the components of the class means m i and of the total
mean m respectively, and N = ~ c Ns is the total number of sample patterns.

Further details on NLDA networks can be found in [9]. We just mention here
that, with respect to NLDA complexity, the more complicated criterion function
being used in NLDA obviously will imply model training costlier than that of
MLPs. The simplest way of comparing both is to estimate the cost of gradient
computations for each model. For a two class problem, it can be easily derived
from the above estimates that the cost of a single full network gradient estimation
is O(NDM2), with M and N as before, and where D denotes input pattern
dimension. In contrast, gradient computations in backpropagation will have a
cost of O ( N D M ) , that is, will be M times faster. In a general C class problem
the relative costs are the same if traces are used instead of determinants in the
definition of the NLDA criterion function; if the latter are used, the cost of a
NLDA gradient estimate shots up to O ( N D M 2 ( C - 1 ) 5) = O(NDM2C5). In any
case, when tried in problems in which both methods converge, NLDA training
tends to require less iterations than backpropagation, substantially alleviating
its greater cost.

We now turn to a concrete application of NLDA to classifier construction in


a human resources management problem: the assessment of professional perfor-
mance of insurance salespersons. These NLDA based classifiers are integrated in
a larger application, developed by Instituto de Ingenieria del Conocimiento (IIC)
for a major international insurance company. This application handles several
chores, such as interactive test administration, answer collecting, and test re-
sults analysis and psychological evaluation. As mentioned before, several factors
make this a rather difficult problem, whose solution cannot be expected as the
result of any single tool. Accordingly, the NLDA classifiers are used in the above
application as a support tool integrated with a number of other analysis and
diagnostic procedures.
564

3 NLDA and professional performance assessment

When psychological evaluation is used in companies and organizations, one of the


key problems is obviously the detection, among all the members of a potential
hiring pool, of those candidates capable of the best professional performance.
This is of course a major issue in such psychological studies and a number of
large batteries of questions, tests and tasks have been devised for this goal. The
analysis of the results derived from these batteries is one of the key factors in
the predictive inference process that finally leads to, say, a hiring or not hiring
decision.
Of course, a first condition for the success of the whole process is the scientific
guarantees (reliability and validity) of the predicting instrument [1, 8]. However,
given the large input sizes of the data collected, numerical techniques have also
to be used to analyze these data and to derive predictive features. Among these,
techniques we can find linear discriminants, multivariate regression and principal
components and factor analysis [10]. While useful in some instances, a drawback
common to all these techniques is their linear nature, that forces in the practi-
tioner the assumption of a linear relationship among the detection variables. Of
course, this is not always the case, and hence the interest in detection methods
able to introduce nonlinear relations among the variables used.
In our example, the data gathering for the assessment of professional perfor-
mance was based in a questionnaire with a large set of psychological test items
aiming to measure variables ranging from sociodemographical characteristics, to
personality traits based on Big-Five models [3], to reasoning related abilities,
and to motivational and capability characteristics. This battery of questions re-
sults in a 398 item test, which when applied to a concrete individual, makes up
the data vector upon which subsequent analysis is to be done. Giving this input
size, a first natural idea is to try to reduce the data dimensionality. However this
is not an easy task. To begin with, the items on the test battery to be applied
usually reflect a long and established field work expertise, which makes rather
difficult the straight elimination of single test items. Moreover, standard numer-
ical dimensionality reduction techniques, such as Principal Component Analysis
[7], are also hard to apply here. In fact, a substantial amount of sample variance
can be captured by a relatively small number of components. The problem is
that components that are relevant for the whole sample or for a given subset
may not be so for other sizeable sample subsets. Thus, a posteriori, sample uni-
form dimensionality reduction is very often bound to give poor results. It may
be worth to observe that among the numerical techniques being investigated in
psychology to reduce the number of characteristic traits to be derived from a
given person, possibly the most promising one, Adaptive Item Response Theory
(see for instance [4]) tries to reduce the number of items in a fashion individually
suited to each subject and while the test is being actually taken. In any case,
this approach, that may very well yield different measure items for different test
subjects, not giving therefore uniform, sample wide dimensionality reduction,
has not been considered here.
565

Training Test
MLP NLDA MLP NLDA
class 0 1 0 1 0 1 0 1
0 51 li 47 5 17 13 22 8
!1 2 56 8 50 12 18 10 20

T a b l e 1. Classification results for training and test sets using for MLP and NLDA
networks.

The results reported below are derived from a 170 pattern initial sample
obtained in an early phase of the project. The sample individuals were divided
into two performance categories, with 82 and 88 individuals respectively. Ob-
serve that small sample sizes are unavoidable in most performance assessment
applications, because of factors such as the specificity of the task abilities to be
measured, the usually not too large sizes of hiring or training person groups, or
the concrete company testing requirements that make difficult the aggregation
of samples obtained through different test procedures.
Several test sets where formed with 60 patterns, 30 from each class, and
the training sets were made up with the remaining 110 patterns, 52 of them
corresponding to the first class and 58 to the other. For each set, various MLP
and NLDA networks were trained, with some of them being discarded because
of either poor training convergence or poor test results (notice that local min-
ima convergence or overfitting are bound to appear, given the above mentioned
pattern input dimension and sample sizes). Table 1 contains training and test
set confusion matrices for both NLDA and MLP networks. The MLP has two
output units, with the usual target coding values of (1,0) and (0,1), while NLDA
networks yielded one dimensional outputs (see section 2). As it can be seen from
the table, training set classification was slightly better for the MLP net than for
the NLDA one. However, the MLP net has a test classification error percentage
of 42% while that of the NLDA network was just 30%. Among other plausible
interpretations of this fact, it seems to imply a more robust behavior of NLDA
networks with respect to overfitting.
The overall application in which NLDA classifiers will be incorporated is now
in mid development. A second testing phase, very nearly completed, will increase
the initial 170 sample with about 200 more patterns, which later on will further
augment on subsequent validation and maintenance phases. These new data will
be used to validate the initial models and to derive if possible new, more efficient
classifiers.

4 Conclusions

The target free nature of NLDA network training seems to make the result-
ing classifiers rather robust in situations where classes have a high degree of
overlapping. When sample sizes are very uneven, this has been observed in syn-
thetic classification problems (such as the one reported in the first section) and
566

also in practical applications (such as credit card fraud detection). In this work
their use for professional performance assessment has been reported. This is a
difficult classification task with a priori high class overlapping but where usu-
ally all class sample sizes are rather small, a difficulty which is compounded by
large input pattern sizes, not reducible through standard dimensionality reduc-
tion techniques. In any case, NLDA results over a first field sample are very
encouraging. Subsequent work will concentrate, on the one hand, in enlarging
the samples available, and also on an empirical study of the general effects of
small overlapping samples on the discrimination capabilities of NLDA networks,
and their comparison with those of other classifier construction techniques.

References

1. American Educational Research Association, American Psychological Association


and National Council of Measurement in Education, "Standards for Educational
and Psychological Testing", American Psychological Association, Washington,
1985.
2. J. Dorronsoro, F. Ginel, C. Ss C. Santa Cruz, "Neural Fraud Detection in
Credit Card Operations", IEEE Trans. in Neural Networks 8 (1997), 827-834.
3. P.T. Costa and R.R. McCrae, "The NEO Personality Inventory Manual", Psycho-
logical Assessment Resources, Odessa, FL, 1985.
4. B.G. Dodd, R.J. De Ayala, W.R. Koch, "Computerized Adaptive Testing with
Polytomous Items", Applied Psychological Measurement 19 (1995), 5-22.
5. R.O. Duda, P.E. Hart, "Pattern classification and scene analysis", Wiley, 1973.
6. K. Hornik, M. Stinchcombe, H. White, "Universal Approximation of an Unknown
Mapping and Its Derivatives Using Multilayered Feedforward Networks", Neural
Networks 3 (1990), 551-560.
7. I.T. Jolliffe, "Principal Component Analysis", Springer, 1986.
8. S. Messick, "Test Validity and the Ethics of Assessment", American Psychologist
35 (1980), 1012-1027.
9. C. Santa Cruz, J. Dorronsoro, "A nonlinear discriminant algorithm for data projec-
tion and feature extraction", IEEE Trans. in Neural Networks 9 (1998), 1370-1376.
10. L.J. Cronbach, G.C. Gleser, H. Nanda, N. Rasaratnam, "The Dependability of Be-
havioral Measurements. Theory of Generalizability for Scores and Profiles", Wiley,
1972.
11. A.R. Webb, D. Lowe, "The optimized internal representation of multilayer classifier
networks performs nonlinear discriminant analysis", Neural Networks 3 (1990),
367-375.
S O M B a s e d A n a l y s i s of P u l p i n g P r o c e s s D a t a

Olli Simula and Esa Alhoniemi

Laboratory of Computer and Information Science


Helsinki University of Technology
P.O. Box 5400, Finland

Abstract. Data driven analysis of complex systems or processes is neces-


sary in many practical applications where analytical modeling is not possible.
The Self-Organizing Map (SOM) is a neural network algorithm that has been
widely applied in analysis and visualization of high-dimensional data. It car-
ries out a nonlinear mapping of input data onto a two-dimensional grid. The
real)ping preserves the most important topological and metric relationshil)s
of the data. The SOM has turned out to be an efficient tool in data explo-
ration tasks in various engineering applications: process analysis in forest
industry, steel production and analysis of telecommunication networks and
systems. In this paper, SOM based analysis of complex process data is dis-
cussed. As a case study, analysis of a continuous pulp digester is presented.
The SOM is used to form visual presentations of the data. By interpret-
ing the visualizations, complex parameter dependencies can be revealed. By
concentrating on the significant measurements, reasons for digester faults
can be determined.

1 Introduction
Modeling and control of industrial processes usually requires that an analytic sys-
tem model can be built. However, in large industrial systems, global models can-
not always be defined. In such cases, discovering of complex relationships between
system variables is often problematic and modeling should be based on experimen-
tal knowledge. Modern automation systems produce masses of measurement data
which, however, may be very difficult or even impossible to interpret. In many
practical situations, even minor knowledge about the characteristic behavior of the
system might be useful. For this purpose, the measurements need to be converted
into some simple and comi)rehensive display which would reduce the dimensionality
of measurements and simultaneously preserve the most important metric relation-
ships between the data.
Artificial neural networks have successfully been used to build system models
directly based on process data. They provide means to analyze the process with-
out explicit physical model. The Self-Organizing Map (SOM) [7] is one of the most
popular neural network models. It is especially suitable for system analysis due
to unsupervised learning and topology preserving properties. The SOM algorithm
implements a nonlinear nmpping from high-dimensional input data space onto a
two-dimensional grid or net of neurons. The mapping preserves the most important
topological and metric relationshil)s of tim input data. The net roughly approxi-
mates the probability density fuimtion of the data and, thus, inherently clusters
568

the data. Various visualization alternatives of the SOM are helpful, e.g., in hunting
correlations between measurements and in investigating the cluster structure of the
data.
The SOM based data exploration has been applied in various engineering appli-
cations such as pattern recognition, text and image analysis, financial data analysis,
process monitoring and modeling as well as control and fault diagnosis [8, 12]. In
addition, the SOM has been used in analysis of telecommunications environment,
e.g., in discrete-signal detection and adaptive resource allocation problems [10, 13].
The ordered signal nmpping property of the SOM algorithm has proven to be
especially powerful in analysis of complex industrial processes [1]. In this paper,
analysis of a pulping process is considered. The SOM based approach has been
utilized to determine the reasons for situations where the pulp quality variable,
kappa number, becomes too low. Similar approach has been used also in the analysis
of steel rolling process.

2 Process Data Analysis

In nmdern process automation systems, it is possible to collect and store huge


amounts of measurement data. On the other hand, the development of the computer
capacity has made it possible to efficiently carry out analysis of the data. In Figure 1,
different approaches to treatment of process data are presented.

Data ]

Black-boxmodeling] I Dataanalysis

"2
-
~ i 2r:'::,'=
t"
< 727o::::=

Fig. 1. Different possibilities of process data treatment.

Black-box models build a regression model between inputs and outputs of a sys-
tem based on measurement data. After that, they can be used to predict system
outputs if the inputs are known. For example, in [11,2,9], artificial feed-forward
neural networks are suggested for prediction of pulping processes. The disadvan-
tage of black-box models, however, is that they give little information about the
dependencies of the variables and it is difficult to get general view of the system
behavior.
In data analysis, useful information of process data is extracted without model-
ing the system. The approach needs to be distinguished from process experiments,
569
where inputs are intentionally varied to find out the effect of the changes in the
output. Data analysis could merely be used to find out reasonable setups for the
experiments.
The application areas of data analysis are such processes or phenomena which
- - due to their complexity or nature - - are impossible to model analytically. The
advantage of the analysis is that neither process modifications nor experiments are
required and general methods independent of application can be used.
Two different types of information may be acquired. Qualitative information
can be obtained using such a data representation that is easy to understand and
interpret. This is usually done using data visualization techniques. An example of
quantitative, i.e., numerical information is the correlation coefficient depicting the
strength of linear dependency between two variables.
Because the whole analysis is based on data, the measurements have to be
reliable. Signal noise is usually not a problem, because it may be reduced using
[ilt,ering techniques. Real problems are due to sensor faults. A good example is signal
uon-stationarity caused by slow drifting of dirt on the surface of the rneasurement
sensor.
Analysis of data originating from a complex system is an iterative process. In
Figure 2, different stages of data analysis are presented. After data acquisition, the
variables of interest are selected. Common preprocessing operations include removal
of erroneous measurements, noise reduction, and compensation of delay between dif-
ferent variables. The analysis may directly lead to conclusions, but typically several
variable sets need to be considered; also the preprocessing may need to be altered.
Usually the data analysis is started using as many variables as possible which are
then reduced to the most significant ones.

I D"t"a~quisiti"t--~
)n Scl
v.r.ih,l:escti~ ~--'~ I:'rcP'~cssi~"l Analysis~-~ C~mclu~ion.~

Fig. 2. A schematic illustration of data analysis.

It should be emphasized, that the whole analysis process requires tile presence of
a process expert, whose assistance is valuable in selection of variables, in choosing
methods and their paranmters for preprocessing as well as in interpretation and
checking of the results. Expertise is - - of course - - needed also in making tile
correct conclusions based on the results.

3 The Self-Organizing Map in Data Analysis

As stated earlier, tile SOM is a neural network structure consisting of neurons


arranged to a two-dimensional regular grid. Each neuron is presented by a model
(prototype) vector. The SOM algorithm carries out a nonlinear, topology preserving
maI)ping fi'om a high-dinmnsional input space onto_n)odelvectors._In other words,
570

the algorithm simultaneously does two things: vector quantization and projection
of the prototype vectors into two dimensions.
The topology preserving property of the mapping makes it possible to apply
several different visualization methods to study the dependencies between variables
in different parts of the input space. In the following, the visualizations used in the
case study are discussed in detail.

3.1 S O M visualization

The greatest advantage of the SOM in aalalysis of measurement data is efficient


visualization of the data [4, 6, 16]. The visual presentation can be obtained by visu-
alizing the model vectors of the map or the training data using the model vectors.
In this research project, following visualization methods have been applied 1
- Component plane representation shows the distribution of one model vector
component, i.e. one variable, on the map. By displaying all component planes
(distributions) at the sanle time, one may roughly observe relationships between
variables and distinguish different operating states of the process (for example,
see Figure 4). In our visualizations, the magnitude of each variable is presented
using a color scale.
- Ordering of the component planes makes it easier to simultaneously investigate
several component planes [14, 15]. The basic idea is to arrange the component
planes in such a way that similar planes (correlating variables) lie close to each
other. This reorganization is also carried out using the SOM algorithm. For
example, in Figure 5, tile component planes of Figure 4 have been rearranged.
- In continuous coloring, each map unit is assigned a color, which is similar in
neighboring units. In our experiments, the HSV color system is used so that the
value of component H varies with direction from the map center, S is constant
and V is proportional to the map unit distance from the map center.
Following techniques based on the coloring are used in this article:
9 Two variables are selected. Corresponding components of the model vectors
are drawn in a scatter plot, and each point is dyed using the corresponding
map unit color. The color coding makes it possible to study dependence of
two variables in different process states. For an illustration, see Figure 7.
9 The coloring can also be used in time series visualization: the best-matching
unit (BMU) of each sample is determined, and the corresponding points in
the series are dyed using the color of the BMU (see Figure 8).

4 Case Study: Analysis of a Continuous Digester

4.1 The Continuous Digester

In the case study, the behavior of a continuous pulp digester of a pulp mill was
studied. An illustration of the digester and the separate impregnation vessel is
shown in Figure 3.
The color images of this article can be viewed at URL
http://.ww, cis.hut. fi/proj ect s/ide/publicat ions/fulldet ails. html#iwann99
571

Continuous
digcMcr
Steam

Imple~nati,,*n
ves~c[

~ ........ Top screen

Ell
ill * F- ~ Old~cmextractlon

Blackliquor
Exlracth)n
screens

",,-,::2, I Blackliquor

Wash
screen

Washliqu(~ k

While liquIw
...................... . ............................................ " Ka~.,..,.ro,.r

Fig, 3. The continuous digester and the impregnation vessel. The cooking and wash liquor
flows are marked by thin lines and the chip flow by thick line. The four square-shaped
symbols are heat exchangers.

Presteamed wood chips together with cooking liquor are fed into the impreg-
nation vessel. After the impregnation, the chips are fed into the digester. At the
top of the digester, they are heated to cooking temperature using steam, and the
pull)lag reaction - - removal of lignin - - begins. During the cooking, the chips move
downwards the digester. The cooking ends at extraction screens by displacement of
hot cooking liquor by cooler wash liquor, which is injected to the digester through
bottom nozzles and bottom scraper. The liquor moves counter-current to the chip
flow and performs washing of the chips.

4.2 Analysis of Digester Measurements

Problems in tile digester operation, where pulp consistency in the digester outlet
dropped, were the starting point for the analysis. In those situations, end product
quality variable - - the kappa number - - values were smaller than the target value.
The test material consisted of measurements made in the digester during one
week with constant production speed. At the end of the period, the digester ended
up into such a faulty situation that tile production speed of tile lille had to be
dropped in order to restore control of the digester.
572

Fig. 4. Component planes of the SOM trained using 24 measurement signals of the digester.

The signals depicting operation of the digester were collected from the automa-
tion system of tile mill. The test period was selected so that there were no significant
errors in the measurements. In preprocessing, the signals were delayed with respect
to each other using known digester delays. Because the signal values were already
averages of ten measurements made once a minute, no further noise reduction by
filtering was required.
In Figure 4, the component planes of a SOM trained using 24 measured signals
are presented. In the lower part of the map, especially in the right corner, the value
of kappa number is low. This means that the problematic states are mapped into
that part of the map.
The component planes of Figure 4 are rearranged in Figure 5 so that component
planes resembling each other, i.e., ttle correlating ones, lie near each other: this aids
in interpretation of the numerous component planes.
Using the reorganization of the component planes in Figure 5, the 11 variables
best correlating with kappa number were selected for further investigation while
the 12 others were rejected. In Figure 6, component planes of an another SOM
trained using the selected variables only are shown. Also in this case, the problematic
process states with low kappa number were mapped into the bottom right corner
of the map. To illustrate correlations between the kappa number and several other
variables, each map node is assigned a hue (see Figure 7 top left corner). Then,
573

Fig. 5. Rearranged component planes. The variables selected for further investigation are
surrounded by black line. The output variable of interest, the kappa number, is indicated
by black arrow.

xy-plots with the kappa number on the y-axis and the other ten variables on x-axis
were produced using the map weight vector values.
The scatter plots indicate that in the faulty states there seems to be very lit-
tle correlation between kappa and H-Factor, which is the variable used to control
the kappa number. On the other hand, the variables "Dig_chip_r', "Dig_liq_l",
"Black_liq", "Screens", and "Press_d" seem to correlate with the kappa number.
The explanation for this is that in a faulty situation, the downward movement
of the chip plug in tile digester slows down. This is due to the fact that the plug
is so tightly packed at the extraction screens that the wash liquor cannot pass it.
574

Fig. 6. Component planes of the selected variables.

Fig. 7. Color coding on the SOM and 11 scatter plots.


575

There are two consequences: the wash liquor slows down the downward movement
of the plug and the pulping reaction does not stop.
Because the cooking continues, the kappa number becomes too small. In addi-
tion, the It-factor based digester control fails, because in the It-factor computation,
assumed constant cooking times for chips become longer due to slowing down of the
chip plug movement.
Finally, in the Figure 8, the signals that were used to train the SOM of Figure 6
are colored using the coding presented in Figure 7. Now it can be clearly seen
that the process operates normally until sample 700. From that on, problematic
situations dyed by yellow and green hues appear every now and then. The last
variations in the kappa number are so alerting that the operators have to slow
down the production rate of the digester (not shown).

Fig. 8. Color coding of signals used to train the SOM with 12 variables.

5 Conclusions

Data analysis approach is useful in problems where the system or phenomenon of


interest is difficult to analytically deal with due to its complexity or nature. In
normal operation, the behavior of the continuous digester can be modeled based on
first principles [5, 3]. In the faulty states, however, the analysis of the pulping data
is the only way to investigate the behavior of the system.
576

The Self-Organizing Map can be effectively used to find and visualize correlations
between process variables in different operational states of the process. In faulty
states of the continuous digester, variables that normally do not have much effect
on the pulp quality (chip level etc.) seemed to affect the pulping. This was due to the
fact that they correlated with chip plug movement in the digester. In the problematic
situations, the H-factor based kappa number control failed due to increased residence
time of the chips in the digester, which is assumed to be approximately constant at
constant production speed. In other words, the H-factor was bigger than the control
system expected it to be.

5.1 Acknowledgments
The authors wish to thank UPM-Kymmene Wisaforest pulp mill for the pulping
data and UPM-Kymmene Pulp Center for aid in the interpretation of the results. Fi-
nancial support by Technology Development Centre of Finland and UPM-Kymmene
is gratefully acknowledged.

References
1. E. Alhoniemi, J. Hollm~n, O. Simula, and J. Vesanto. Process monitoring and Model-
ing Using the Self-Organizing Map. Integrated Computer-Aided Engineering, 6(1):3-14,
1999.
2. B.S. Dayal, J. F. MacGregor, P. A. Taylor, and S. Marcikic. Application of feedforward
neural networks and partial least squares for modelling kappa number in a continuous
kamyr digester. Pulp ~ Paper Canada, 95(1):26-32, 1994.
3. R. R. Gustafson, C. A. Sleicher, W. T. McKean, and B. A. Finlayson. Thoretical model
of thekraft pulping process. Industrial ~4 Engineering Chemistry Process, 22(1):87-96,
Jan. 1983.
4. J. Himberg. Enhancing SOM-based data visualization by linking different data pro-
jections. In L. Xu, L. W. Chan, and I. King, editors, Intelligent Data Engineering and
Learning, pages 427-434. Springer, 1998.
5. E. H~kSnen. A mathematical model for two-phase flow in a continuous digester. Tappi
Journal, 70:122-126, Dec. 1987.
6. S. Kaski, J. Venna, and T. Kohonen. Tips for Processing and Color-Coding of Self-
Organizing Maps. In G. Deboeck and T. Kohonen, editors, Visual Explorations in
Finance, Springer Finance, chapter 14, pages 195-202. Springer-Verlag, 1998.
7. T. Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sci-
ences. Springer, Berlin, Heidelberg, 1995.
8. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas. Engineering Applications of
the Self-Organizing Map. Proceedings of the IEEE, 84(10):1358 - 1384, 1996.
9. M. T. Musavi, D. It. Coughglin, and M. Qiao. Prediction of wood pulp k # with
radial basis function neural network. In Proceedings of the 1995 IEEE International
Symposium on Circuits and Systems, volume 3, pages 1716-1719, Piscataway, 1995.
IEEE.
10. K. Ralvio, J. Hcnriksson, and O. Simula. Neural detection of QAM signal with strongly
nonlinear receiver. Neurocomputing, 21:159-171, 1998.
11. J. B. Rudd. Prediction and control of pulping processes using neural network mod-
els. In 80th Annual Meeting, Technical Section, volume B, pages 169-173, Montreal,
Quebec, Canada, Feb. 1994. Canadian Pulp & Paper Association.
577

12. O. Simula and J. Kangas. Neural Networks for Chemical Engineers, volume 6 of
Computer-Aided Chemical Engineering, chapter 14: Process monitoring and visualiza-
tion using self-organizing maps, pages 371-384. Elsevier, Amsterdam, 1995.
13. H. Tang and O. Simula. The optimal utilization of multi-service scp. In Intelligent
Networks and New Technologies, pages 175-188. Chapman & Hall, 1996.
14. J. Vesanto. SOM-Based Data Visualization Methods. Intelligent Data Analysis, 1998.
Accepted for publication.
15. J. Vesanto and J. Ahola. Hunting for Correlations in Data Using the Self-Organizing
Map. Accepted for publication in International ICSC Symposium on Advances in
Intelligent Data Analysis.
16. J. Vesanto, J. Himberg, M. Siponen, and O. Simula. Enhancing SOM Based Data
Visualization. In T. Yamakawa and G. Matsumoto, editors, Proceedings of the 5th In-
ternational Conference on Soft Computing and Information/Intelligent Systems, pages
64-67. World Scientific, 1998.
Gradient Descent Learning Algorithm for
Hierarchical Neural Networks: A Case Study
in Industrial Quality
Daniela Baratta, Francesco Diotalevi, Maurizio Valle and Daniele D. Caviglia

Department of Biophysical and Electronic Engineering


University of Genoa - via Opera Pia 1I/a, 1-16145
Genoa - Italy
{baratta, diotalevi, valle, caviglia} @dibe.unige.it

Abstract. This paper deals with the training procedure for a


hierarchical neural network (Tree of Multi-Layer Perceptrons -
TMLP) aimed to classify surface defects in fiat rolled strips. Due to
the difficulties in collecting large Data Bases it is necessary to
exploit at the best the available knowledge. A comparison between
techniques derived from both the Back-Propagation and Weight-
Perturbation algorithms is done, and experimental results are
reported.

1 Introduction
Artificial Neural Networks (NNs) are an efficient solution for solving many real
world problems. At present there is a growing interest in applications like Optical
Characters Recognition (OCR), remote-sensing images classification, industrial
quality control analysis and many others in which Neural Networks (NNs) can be
effectively employed [1]-[3].
Among learning algorithms, the most diffused are supervised and adopt the
error function gradient descent technique [4]: Back Propagation (BP), for example,
is one of the most widely used and reliable but its implementation in analog VLSI
requires precise and complex circuit implementation [5]. The Weight Perturbation
(WP) algorithm, on the other hand, was formedy developed to simplify the circuit
implementation [6] and although it looks more attractive then BP for the analog
VLSI implementation, its efficiency in solving real world problems has not yet been
heavily investigated.
Usually, in pattern recognition problems when the input data can be separated
into categories, classification trees are a popular approach (e.g. see, among others,
[7] and [8]). On the other hand, also Neural Networks (NNs) constitute a cost-
efficient solution [9]. Comparisons between classification trees and Multi Layer
Perceptrons (MLPs) show that MLPs feature classification and generalisation
performances similar or even better, [10], [11]. To take advantage of both NNs and
classification trees approaches, some authors tried to use NNs together with
classification trees (see, among others [12], [13]).
579

In this perspective, we want to exploit the capabilities of classification trees and


MLPs taking into account the inherent data structure of many industrial application
problems (e.g. quality analysis and control in steel, textile, wood, etc. industries). In
fact classification problems very often feature data which are structured in
hierarchical way. Samples clusters can exhibit complex decision boundaries.
The goal of our research activity is the application of hierarchical neural
networks (i.e. tree structured neural networks) to the supervised classification of
"hierarchically structured" input patterns.
In a previous paper [14], we proposed a neural architecture based on
classification trees with only two hierarchical levels and in which each node is a
single-hidden-layer, multiple-output MLP. We named such architecture Tree of
MLPs, TMLPs. The advantages of using TMLPs with respect to MLPs in real-
world problems are basically the following: 1) to reduce the training complexity of
large unstructured MLPs; 2) to increase the classification accuracy exploiting the
"inherent data structure" of the problem; 3) to make it easier to determine the
network topology (e.g. the number of hidden neurons of each MLP, or in other
words, the overall number of weights); 4) to increase the reliability because of the
decrease of the overall network complexity.
This paper deals with the training procedure for a hierarchical neural network
(Tree of Multi-Layer Perceptrons - TMLP) aimed to classify surface defects in fiat
rolled strips. A comparison between techniques derived from both the Back-
Propagation and Weight-Perturbation algorithms is done, and experimental results
are reported.

2 The Application Problem


Surface defect recognition plays a significant role in quality management of
steel makers. Surface defects are the most important quality metrics for customers,
as they are visible and they determine the quality (and price) of the products. Better
information on defects may provide valuable direct feed-back for process control to
reduce costs of quality (internal costs of scrap-rework) and to increase
manufacturing productivity and yield. Surface defects are numerous, and may be
organized by families such as inclusions, scratches, stains, etc. Fig. 1 shows a
sketch of defect classes and families for the surface defect detection in fiat rolled
mills. Defects belonging to different families may have similar visual
characteristics, and, as such, they are very difficult to classify. Current approaches
rely on human inspectors specifically trained for the job.

C G OL,,
W -~._
AI!IIB!i M (~N R~ P~Q
Scratches Seams Transverse Stains Dirtiness Marks

Fig.1. A sketch of surface defect classes and families in flat rolled mills.
580

3 The TMLP Architecture


Since the application we are considering concerns the classification of samples
belonging to hierarchically organised classes, it seemed natural to use a hierarchical
network, reflecting the same hierarchy of the data to be classified. In principle, the
hierarchical classification scheme, likewise the associated TMLP network, could
span over many levels. In this paper we shall limit to second order trees (in practice,
it is difficult to find real-world problems with more than two levels), although the
proposed model can be easily extended to trees of whatsoever order.
In Fig. 2., the structure of the TMLP network for the mentioned problem is
shown: each node is a single-hidden-layer, multiple-output MLP. The MLP at the
first level (root MLP) classifies the input data with respect to superclasses (i.e.
defect families, e. g. seams, marks, stains, etc.) and selects only one MLP at the
second level (leaf MLPs). The selected MLP classifies the input sample, within the
chosen family, into classes (e. g. dl, d30, d92 etc.). Both the first and the second
level MLPs have the same inputs.
During classification, the behaviour of the TMLP network is the following:
9 the root MLP identifies the family the input sample belongs to and activates the
corresponding leaf MLP;
9 the second MLP (leaf MLP) is configured (in terms of hidden units, output
units and synaptic weights) using the information about the current family
supplied by the root MLP;
9 the leaf MLP recognises the class of the input sample within the family.

Concerning training, each MLP is trained independently from the others (e.g.
using either the Back Propagation algorithm [15], [20] or the Weight Perturbation
algorithm [6]) on the corresponding data sub-set.

Input Simple

dl dlO d83 dgO d30 d66 d92 d16 d33 d62 d72 d73 d77 d79 d82 d60 dTO d26 d27
Scratches Scimi Trunlv9 Stains DlrtlnesJ Mirkl

Fig.2. Structure of the TMLP network


581

4 The Weight Perturbation Learning Algorithm with


local and adaptive learning rate
Using gradient descent optimization techniques, the learning task is
accomplished by minimizing, with respect to the synaptic weights values, the
output error function e [4]; the weight update learning rule is:
bE
Awi~ = - 1 ] - -
~he~j (1)
where 1] is the learning rate, e represents the output error function that must be
minimized and w~j is the synaptic weight that connects the i th n e u r o n to the jth
neuron. The main computational issue is the computation of ~ f 6 w o
The WP algorithm estimates rather than calculate the gradient's value of the
output error function. This method is able to estimate the gradient simply through
its incremental ratio. If the weight perturbation pi/~) is small enough, we can neglect
the superior order terms and write:
be _-- E(Wij at"Pij (n)) --E(Wij) (2)
~Wij Pij (n)

so;

e(wij + Pi/")) - e ( w i j ) (3)


Awij = -1] (.)
PO
where p/"),l is the perturbation injected in the wU synaptic weight at the n th
iteration and Aw0 is the value used to update the weight w0.
The difference between the output error function before and after the
perturbation of the generic weight w0 is used to estimate the gradient's value with
respect to w0 [6].
This algorithm is the simplest form of the WP algorithm, and because only one
synapse's weight is perturbed at a time, we call this technique as sequential [6].
For circuit implementation issues, we consider every weight perturbation po ~"~as
equal in value but random in sign [16]:
pij (n) = perto(") step (4)
where step is the value of every perturbation of every synaptic weight w0, while
pertJ ") can assume +1 or -1 with equal probability.
We can rewrite eq. (3) as follows:

Po ) - e ( w i j ) Ae
Aw~j = -1"I e ( w + c.).
c.) = -'q pertq (") = - ~//ft pertq ~") (5)
pq step , - eP AE"
We can combine the information of the term step in the rl value, i.e.:
, 1 ]n / (6)
Awo = -1] l,mE, perto~") 1] - ~ o.)/~,ep
582

pert~ = (_+'
1 with equal probability (7)

To compute the synapse's weight w,~,we only need to compute Ae and to known
pertij (n)
With the term "learning strategies" we mean the way through which the
synaptic weights are updated [17]. The two main learning strategies are:
By pattern: with the by-pattern approach the pattern examples are sequentially
and usually randomly given in input to the network; the synaptic weight values are
updated at each example presentation following the direction of the negative
gradient of the output error function e.
By epoch: with the by-epoch approach, the synaptic weight values are updated
when all the pattern examples have been given in input to the network following the
direction of the negative gradient of the output error function e.
With respect to the by-epoch approach, the by-pattern examples presentation
procedure introduces some randomness in the learning processes that often may
help in escaping from the local minima of the output error function ~. Moreover,
this technique is usually faster and more effective when the training set is composed
of thousands of pattern examples (e.g. in the case of hand-written character
recognition, speech recognition, etc.). On the other hand, the by-epoch approach
usually gives better results when high precision is required (e.g. function
approximation).
To accelerate the learning process we adopt an adaptive and local learning
rate strategy [18]: each synapse has its own local learning rate rl0 and the value of
each a ri0 is changed adaptively following the local gradient's error function
behavior (Se/'6wo).
More precisely, ri is increased when at least during two successive iterations the
signs of the term 8v_/fiwU is equal, and is decreased when the signs of the term
during two consecutive iterations are different.
The final version of the WP learning algorithm that we adopted is:

for (each epoch)


{set e a c h I]ij to ~]min;
set e a c h p e r t i 5 (~ at a r a n d o m v a l u e s ;
f o r ( e a c h p a t t e r n of the t r a i n i n g set)
{Choose a p a t t e r n in r a n d o m w a y a n d put it in input to
the n e t w o r k ;
Feed-Forward phase ;
C o m p u t e s (wl5) ;
Weight Perturbation;
Feed-Forward phase ;
C o m p u t e ~ (wij+step.pertlj) ;
C o m p u t e Aw=-1]'- [s (wij+step.perti5) - s (wlj) ] .pertij ;
E a c h I]ij is a d a p t i v e l y u p d a t e d ;
WeightUpdate ;
583

5 Simulation results
Though neural approaches are very effective in dealing with the nature of
problems whose specifications cannot be explicitly defined (e.g. using "rules" as in
Expert Systems), nevertheless they need a large database for the training task. The
database that we received by technicians working on the steel-industry plant is
relatively small due to the high cost of collecting data (1725 patterns). Moreover,
the number of samples is not equally distributed among classes and families (see
Fig. 3.).

The measurements were collected through the following system (Fig. 4.):
9 an on-line CCD camera;
9 an on-line image acquisition/pre-processing system which performs
filtering and feature extraction tasks on images of the surface of the fiat
rolled strip.

Fig.4. Schematic drawing of the real time quality control system


584

The Data Base (DB) was obtained with on-line and on-plant measurements; it
includes samples of steel-ribbon impurities and flaws. Each sample consists of 16
features, some representing the geometrical properties of the imperfections on the
strip, others providing information about illumination, width and thickness of the
strip etc., and a code number, which identifies the type of defect.
Most of the input features come from the on-line image acquisition/pre-
processing system. The complete list of the features of the input samples is detailed
in Table 1.

As reported in Section 4, the better WP algorithm from the point of view of an


analog VLSI implementation is by pattern. The DB just presented has a very
different number of pattern for each class, and also for each superclass. So the tests
made directly on such DB gave too poor classification results related to those
obtained with the BP by epoch.
For this reason we have enlarged the original DB by increasing the number of
patterns with noise, to obtain a nearly homogenous distribution of superclass
elements (see Fig. 3.). Then the DB has been partitioned into three data sets with
the dimensions reported in Table 2. [19].

The use of the DB increased with noise allows a better comparison of the
classification performances of the TMLP trained either with the BP or the WP
learning algorithms. In Table 3., the most significant features of the two algorithms
are reported.
In Fig.5. and Fig.6. are reported respectively the classification results related to
the superclasses and the defect classes. Fig. 7. reports the overall performances of
the TMLP with the two learning techniques.

Table 1.Complete list of the features of the input samples


Number offeature Description
1 proximity, to the strip-edge;
2 number of pixels of the enclosing box not covered by the deject;
3 squaring, the enclosin~ box;
4 width of the defect;
5 principal direction of the defect;
6 contour smoothness of the defect along x axis;
7 contour smoothness of the defect along y axis;
8 direction of the defect;
9 logarithm of the defect-area;
10 logarithm of the x side of the enclosing box;
11 logarithm of the ~ side of the enclosing box;
12 maximum luminance;
13 minimum luminance;
14 sum of all luminances;
15 width of the strip;
16 thickness of the strip.
585

Table 2.Number of pattems fot the Training Set, Test Set and Validation Set

Data Sets Number of patterns


Training set 3172
Validation set 1578
Test set 1573

Table 3.Most significant features of BP and WP learning algorithms

BP WP
Weight Update by epoch by pattern
Learning rate update Vogl's acceleration technique local learning rate
adaptation strategy
Stop criteria - the number of epochs is - when the minimum of
greater than un upper the Validation Set is
limit reached
- when ABS(gradient) is
less than a ~ive threshold
Activation.function Hyperbolic tangent

Fig.5. Classification results for the superclasses

Fig.6. Classification results for the defect classes


586

Fig.7. Overall TMLP classificationresults.

6 Conclusions
It is worth nothing that even if the available Data Base was of limited
dimensions and of poor quality with respect to examples distribution, the use of the
TMLP architecture allows to reach reasonable classification performances.
The comparison between the BP algorithm with Vogl's acceleration and the WP
algorithm is favorable to the first one in this case. More extensive experiments with
larger Data Bases will be necessary to give a better insight of the problem.

7 References
1. B. E. Boser, E. Sackinger, Jane Bromley, Y. Le Cun, and L. D. Jakel, "An
Analog Neural Network Processor with Programmable Topology," IEEE
Journal of Solid State Circuits, Vol. 26, No. 12, pp. 2017-2025, 1991.
2. D.Baratta, G.M. Bo, D.D. Caviglia, M. Valle, G. Canepa, R. Parenti and C.
Penna, "A Hardware Implementation of Hierarchical Neural Networks for Real
Time Quality Control Systems in Industrial Applications," In Proc. of the
International Conference on Artificial Neural Networks, ICANN'97, p.p. 1229-
1234, Lausanne, Switzerland, 1997.
3. G.M. Bo, D.D. Caviglia, and M. Valle, "An Analog VLSI Neural Architecture
for Handwritten Numeric Character Recognition," In Proc. of the International
Conference on Artificial Neural Networks, ICANN'95, Paris, France, 1995.
4. J. Hertz, A. Krogh, and R.G. Palmer, "Introduction to the theory of the Neural
Computation," Addison- Wesley Publishing Company, 1981
5. M. Valle, D.D. Caviglia and G.M. Bisio, "An Experimental Analog VLSI
Neural Network with On-Chip Back-Propagation Learning," Journal of Analog
Integrated Circuits and Signal Processing, Kluwer Academic Publisher Vol. 9,
1996, pp. 25-40, Dordrecht (N).
6. M. Jabri and B. Flower, "Weight Perturbation: An Optimal Architecture and
learning Technique for Analog VLSI Feedforward and Recurrent Multilayer
Networks," IEEE trans. Neural Networks, vol. 3 (1), pp. 154-157, 1992
587

7. J.R. Quinlan, "Induction of decision trees", in Machine Learning, 1:81 - 106,


1986.
8. L. Breiman, J.H. Freidman, and C.J. Stone, "Classification and regression
trees", Wadsworth International Group, Belmont, CA, 1984.
9. R. P. Lippmann, "An introduction to computing with neural nets", IEEE ASSP
Magazine, pp. 4 - 22, April 1987.
10. C. Tsoi, R. A. Pearson, "Comparison of three classification techniques, CART,
C4.5 and Multi-Layer Perceptrons", in R. P. Lippmann, J. E. Moody, D. S.
Touretzky (Eds.), Ad. in Neural Information Processing System, Morgan
Kaufmann, Vol. 3, pp. 963-969, 1990.
11. L. Atlas, et al., "A performance comparison of trained multilayer perceptrons
and trained classification trees", Proc. of the IEEE, Vol. 78, No 10, pp. 1614 -
1619, Oct. 1990.
12. A. Sankar, R. Marnmone, "Tree Structured Neural Networks", Technical Report
CAIP - TR 122, Rutgers University,1990.
13. H. Guo, S. B. Gelfand, "Classification Trees with Neural Network Feature
Extraction", IEEE Transaction on Neural Networks, Vol 3, pp. 923-933,
November 1992.
14. D. D. Caviglia, M. Marchesi, M. Valle, V. Baiardo and D. Baratta, "A Digital
MLP Architecture for Real-Time Hierarchical Classification", Proceedings of
the 7th Italian Workshop on Neural Nets WIRN V1ETRI-95, Word Scientific
Publishing Co, pp. 183 - 188, 1995.
15. E. Rumelhart, G. E. Hinton, R. J. Williams, "Learning internal representations
by error propagation," in Parallel Distributed Processing, MIT Press, Vol. 1,
pp. 362-381, 1986.
16. J. Alspector and D. Lippe, "A Study of Parallel Perturbative Gradient Descent,"
in Advances in Neural Information Processing Systems, pp. 803-810, 1996
17. A. Cichocki and R. Unbehauen, "Neural Networks for Optimization and Signal
Processing," John Wiley & Sons, 1993.
18. G.M. Bo, D.D. Caviglia, H. Chibl6 and M. Valle, "A Circuit Architecture for
Analog On-Chip Back Propagation Learning with Local Learning Rate
Adaptation," Accepted for pubblication on Analog Integrated Circuits and
Signal Processing, 1998.
19. Frequently Asked Question about NN.: ftp://ftp.sas.com/pub/neural/FAQ.html.
20. P. Vogl, J. K. Mangis, W. T. Zink, D. L. Alkon, "Accelerating the convergence
of the Back Propagation method," Biological Cybernetics, Vol. 59, pp. 257-263,
1988
Application of Neural Networks for Automated
X-Ray Image Inspection in Electronics
Manufacturing

Andreas Kbnig 1, Andreas Herenz ~ , and Klaus Wolter 3

University of Technology Dresden,


1Chair of Electronic Devices and Integrated Circuits,
2Center of Microtechnical Manufacturing (ZpP),
3Chair of Procedure Technology of Electronics
01062 Dresden, Germany

A b s t r a c t . Artificial neural networks have proven to be valuable tools in


industrial problems, e.g. for image processing, and classification in visual
inspection tasks. Typically, todays successful systems have a heteroge-
neous structure, applying small and specialised neural networks together
with classical and heuristic methods in a hybrid framework. This paper
reports on the practical application of such a system, developed in prior
work, which efficiently employs selected neural networks in an innovative
framework. In particular, the application in electronics manufacturing
with advanced sensor technology was subject of investigation.

1 Introduction

Artificial neural network algorithms are currently applied as valuable tools and
system components in a variety of application domains, e.g. control, prediction,
optimisation, OCR, image and signal processing, computer vision, and pattern
recognition. Application examples are, e.g. medical imaging, cheque and credit
card slip reading, surveillance, identification and access control, image coding,
intelligent cruise control tasks, and visual and multisensorial quality control in
industrial manufacturing [4], where this paper focuses on.
However, as advanced cognitive models, e.g. selective attention mechanisms,
complex feature maps and feature binding schemes etc.., due to their inher-
ent computational complexity for reasons of costs and hardware capability are
out of reach for the majority of applications, successful state-of-the-art systems
typically employ hybrid structures incorporating modules from the disciplines of,
e.g. image processing, artificial intelligence, artificial neural networks, statisti-
cal pattern recognition. Further, to achieve industrial acceptance, transparence,
ease-ouf-use, rapid configuration, and robust classification have to be provided
by such inspection system development tools.
In previous work (cf. e.g. [4]) such a system has been developed, denoted as
QuickCog. QuickCog is both development system and run-time platform on PC
as well as industry P C / 1 0 4 + standard (MS-Windows 95, 98, NT). Instant de-
ployment of developed inspection systems on these platforms is feasible. The key
589

Fig. 1. QuickCog PC-system applied to electronics manufacturing tasks

features of QuickCog are visual programming, menu-driven sample set compila-


tion, preclassifieation, sample set oriented processing as well as a wide range of
proven image processing, pattern recognition, artificial neural networks, and data
analysis methods. Fig. 1 shows the current QuickCog PC-system with typical in-
spection applications related to electronics manufacturing. Key components, as
the project explorer for method, workspace, and sample set management and
access, as well as several graphical result displays, including a special interac-
tive neuro-inspired feature map for multidimensional scaling and visualisation
of feature space can be observed in the figure.
In this work, QuickCog will be applied and adapted to advanced problems
of electronics manufacturing. In the following, the chosen neural network algo-
rithms will be briefly presented. Then an application example of neural networks
in electronics manufacturing will be given.

2 Selected Neural Networks

From the large number of neural network algorithms, we have picked a few
according to their desirable properties with regard to ease-of-use, speed, trans-
parence and performance. Though backpropagation networks in the hands of an
experienced user can be very powerful tools, it is well known that the appropriate
definition of network topology and learning parameters is not an easy task and
for each application or modification, this burden is again imposed on the user.
590

Todays situation in industrial manufacturing does not leave any room for such
time consuming processes. Thus, we focused on neural algorithms for powerful
nonparametric classification, t h a t autonomously find their topology in the learn-
ing process tailored to the problem and have no critical learning parameters.
One convenient m e t h o d is the well-known and proven LVQ3-method of Kohonen
[1], which adapts a set of reference vectors according to the following equations
for the two nearest neighbors wi(t), wl (t) of a pattern xj:

wi(t + 1) = wi(t) - a ( t ) [ x j - wi(t)] mit wl ~s wj


(i)
w,(t + 1) = wl(t) + c~(t)[xj - wt(t)] mit ~ot = wj

for vectors xj, t h a t fall in the window (s. [1]). For wl = ~t = wj


wk(t+l)=wk(t)+ca(t)[xj--wk(t)] /or kE{i,j}. (2)
However, LVQ3 needs an apriori definition and initialisation of the weight vec-
tors. In our approach, this is achieved by applying the iterative Reduced-Nearest-
Neighbor-Algorithm (RNN) of Gates, that finds a m i n i m u m number of vectors

Evaluation measure qs ~
(Separability) x [eJx • • ~ \
• • 2 1 5 2 1 5\ \ •

. oS
oo" 9 ~, x x v
' / ~ x // 9 o o

o ~176 ~176 9 /
O- O O Approximationof the classborde~ /
by Voronoi-Tesselation

Fig. 2. Bold vectors were selected by RNN-algorithm as reference vectors

from the original d a t a set to achieve perfect resubstitution. These vectors per-
fectly serve as an initialisation for LVQ3, which carries out a fine tuning step
to improve generalisation of the neural classifier 1. For instance for Iris data,
4 errors occured using 10 reference vectors of RNN. A fine tuning with LVQ3
improved the error rate to 2. Further, Radial-Basis-Function-type (RBF) neural
networks with similar salient properties are implemented in our system, e.g.

1 In fact, the RNN initialisation in many simpler cases is already sufficient for reliable
classification.
591

the Probabilistic-Neural-Network (PNN), closely related to the Parzen-Window-


Classifier, or an RBF-network with dynamic kernel selection by a hypersphere
classifier approach [2].
All classifiers available in QuickCog provide an optional rejection mechanism for
improved system development.

3 Application in Electronics Manufacturing

With the continuous significant increase of electronic component complexity, au-


tomated inspection in manufacturing is of ascending importance. Though visual
inspection, based on camera and image processing systems of various sophisti-
cation, covers a wide range of problems, improved packaging and mount tech-
nologies are no longer amenable to visual inspection. To cope with the challenge,
advanced multisensor data acquisition must be combined with powerful infor-
mation processing technology to achieve adequate inspection systems.
One example of such an demanding inspection task is given by an improved
package type for mounting on printed circuit board (PCB). The package type,

Fig. 3. PCB area for BGA-mount (left), mounted BGA-package (right)

Fig. 4. Cross-section of BGA package (left), BGA-solder-joint images generated by


X-ray microscopy (U-140kV, I=25pA) (right)
592

denoted as ball-grid-arrays (BGA) features an array of solder balls instead of


the pins met in other package types. The PCB area, where the package shall be
mounted, provides a corresponding array. After placement and soldering step,
the individual solder joints are no longer amenable to visual inspection. Instead,
X-ray microscopy is employed to get an image of the joints and make them
amenable to assessment and inspection. Fig. 4 shows part of such BGA-array
achieved by X-ray imaging.
As can be observed from Fig. 4, some of the joints have grayish speckles which
corresponds to enclosed gas bubbles. Though ideal joints should have none of
these, and just feature a homogeneous, dark coloured area in the X-ray image,
the presence of bubbles in this particular application does not necessarily imply a
defect for the inspected BGA-package. The task of an appropriate algorithm for
an inspection system is now to provide one or several features with values pro-
portional to size and number of these bubbles that can be thresholded according
to user preference by corresponding rules. Alternatively, chosen examples can
be categorized or preclassified by personnel of the production line according to
their experience and context knowledge. For the latter option, a neural network
based inspection system has been elaborated employing QuickCog.

Individual pins of the BGA-package are extracted as regions of interest (ROI)


and for the supervised learning, a preclassification of the responsible production
personnel is specified. For the convergence of the neural network learning process

Fig. 5. Multiple ROI selection for creation of BGA sample sets


593

and the generalisation capability of the trained network, the appropriate collec-
tion of training, validation, and test data sets is both a crucial and a tedious,
error prone task. The same holds for the organisation and preclassification of
larger pattern numbers, and thus demands for efficient and ergonomic support
by the development system. Thus, QuickCog offers an interactive tool, the sam-
ple set editor, tailored to this problem. Fig. 5 shows the results of QuickCog
interactive multiple ROI selection. The selected ROI are extracted and affiliated

Fig. 6. QuickCog sample set preclassification

to training and test sets. In the regarded case we tentatively defined three classes
to the problem, i.e. correct pins (Pin_ok), barely visible bubbles (Weak_Bubbles),
and prominently visible bubbles (Strong_Bubbles). This tentative class affiliation
can be swiftly changed according to context knowledge and constraints of the
regarded production line. Based on the extracted sample sets, a classification
system for training and test was designed. As a first step of treatment, all pin
images were subject to histogram equalization. For ensuing preprocessing, seg-
mentation based on multi-level thresholding already provides useful results. This
is illustrated in Fig. 7, where only pixels with gray-values in [100;160] have a cor-
responding output pixel set in the segmentation image. Evidently, the amount
of bubbles is proportional to the accumulated segmentation area. From available
X-ray images 48 characteristic pins were extracted and preclassified. 33 were
used for training and 15 for testing. Based on the segmentation result and an
ensuing masking operation, that eliminates the ring which can be observed in
Fig. 7, the bubble regions were isolated and the mean, standard deviation, and
gray-value histograms were computed, concatenated and normalized to serve as
594

Fig. 7. BGA joint free of (left0 and with bubbles (right)

Fig. 8. Feature space visualisation for BGA inspection training set

feature vectors for ensuing classification. Fig. 8 gives an insight in the resulting
feature space, using a technique basically similar to SOFM [1] but more con-
venient for industrial applications [4]. The resulting feature map is sensitive,
i.e. from each projection point in the map, the corresponding pin image can be
invoked by mouse click from the sample set database. This transparent prop-
erty considerably alleviates system design and analysis [4]. The WeightWatcher
supports interactive navigation in feature space, which is especially useful when
feature data with considerably varying density is met. Fig. 9 shows the confu-
sion matrices for test data using the RNN/LVQ3 neural classifier. The selection
principle of the RNN achieves correct resubstitution by default. Generalisation
only achieved 86% recognition rate, but this becomes less worrysome, if Fig. 9
is observed. The resulting errors, that are due to two misclassified patterns, are
a confusion between weak and strong bubbles. For the corresponding pin im-
ages, a unique affiliation in preclassification was indeed hard to make. Good and
potential defect pins, however, have been correctly identified. With this applica-
tion, one of the cases has been met, where LVQ fine-tuning did not accomplish
a better solution. With the PNN neural classifier the same recognition rate of
86% has been achieved.
595

Fig. 9. Classification results for BGA X-ray inspection test set

4 S u m m a r y and Conclusions
In this paper, we applied an innovative inspection system development platform,
the also commercially available QuickCog PC-System [4] [5], for the first time
in the domain of automated quality control in electronics and microelectronics
production. Due to the nonparametric properties of the applied neural classifiers,
the proposed approach is viable, but we will investigate improved preprocess-
ing techniques and regard larger sample sets. Especially, image segmentation
based on recurrent neural networks could help to refine bubble segmentation at
the fringes. We will proceed with a similar approach to other relevant sensor
technologies, e.g. ultrasonic inspection for chip and smart card production (cf.
Fig. 1).
For future work, one focus of our research will be on the application of neural
networks with time-dependent processing for quality data analysis and predic-
tive diagnosis. An ambitious long term research objective is the realization of
a quality control loop to optimize production output. QuickCog will serve as
platform in this research and will be enhanced by the required methods and
modules.

References
1. Kohonen, T.: Self-Organization and Associative Memory. Springer Verlag Berlin
Heidelberg London Paris Tokyo Hong Kong, 1989.
2. Elbaum, C., Reilly, D.L., Cooper, L.N.: A Neural Model for Category Learning.
Biological Cybernetics 45 (1982) 35
3. Gates, G.W.: The Reduced Nearest Neighbour Rule. [EEE Transactions on Infor-
mation Theory, vol. IT-18, (1972), 431 - 433
4. K6nig, A., Eberhardt, M., Wenzel, R.: A Transparent and Flexible Development En-
vironment for Rapid Design of Cognitive Systems. Proc. Int. EUROMICRO Conf.,
Workshop CI, V/isteraas, Sweden, (1998), 655 - 662
5. K6nig, A., Eberhardt, M., Wenzel, R.: QulckCog - Cognitive Systems De-
sign Environment. QulckCog home page: http://www.iee.et.tu-dresden.de/-
koeniga/QuickCog.html, (1999)
Forecasting Financial Time Series
through Intrinsic D i m e n s i o n Estimation
and Non-linear Data Projection

M. Verleysen I, E. de Bodt 2, A. Lendasse ~

i Universit6 catholique de Louvain, CERTI,


3 pl. du Levant, 1348 Louvain-la-Neuve, Belgium
verleysen@ dice.ucl.ac.be
z Universit6 Catholique de Louvain, IAG-FIN,
1 pl. des Doyens, 1348 Louvain-la-Neuve, Belgium
debodt@fin.ucl.ac.be
3 Universit6 catholique de Louvain, CESAME-AUTO,
4 av. G. Lemaitre, 1348 Louvain-la-Neuve, Belgium
lendasse@auto.ucl.ac.be

Abstract. A crucial problem in non-linear time series forecasting is to


determine its auto-regressive order, in particular when the prediction method is
non linear. We show in this paper that this problem is related to the fractal
dimension of the time series, and suggest using the Curvilinear Component
Analysis (CCA) to project the data in a non-linear way on a space of adequately
chosen dimension, before the prediction itself. The performances of this method
are illustrated on the SBF 250 index.

1. Introduction

Time series forecasting is a problem encountered in many industrial (electrical load,


river f l o w . . . ) and economic (exchange rates, stock exchange...) tasks. Often,
prediction must be done without indication about the (unknown) underlying process;
input values to the prediction method must thus be chosen by trial and error. In some
situations, a priori information can be fed into the prediction method, but this remains
an exception: as an example, weekly and monthly past values are obviously good
candidates to predict the electrical load.

In most situations however information about the underlying process is hardly


available. Forecasting with non-linear methods is then usually achieved through one
of the two following methods:
9 linear prediction models (for example ARX) are built; the best auto-regressive
order o f the linear model is used for the non-linear prediction method too.
9 non-linear prediction models only are used: many possible auto-regressive orders
are investigated and the best one is chosen by trial and error.
597

These methods are often the only possible ones but they have large defects. The first
over-estimates the autoregressive order necessary (because it does not take account of
non-linear dependencies between the data) and lead to overfitting. The second is very
heavy to implement and often not very reliable; indeed, the various trainings can be
sullied with errors which are caused by the method of prediction itself, such as for
example the presence of local minima in the optimization of Multilayer Perceptrons.

The method suggested in this paper try to overcome these disadvantages. It will be
exposed in the second part and then applied to a simple artificial example. In the
fourth part, we will try to predict the successive fluctuations of the SBF 250 Stock
Market Index.

2. Forecasting method

2.1. Autoregressive order and vector

The autoregressive non-linear order can be defined as the optimal number of past
values to use in a time series for a good prediction. The autoregressive vector includes
these past values. Using a non-linear method to evaluate the autoregressive order must
make it possible to take into account the non-linear relations between past values of
the series; a traditional linear method to estimate the autoregressive order only takes
into account the correlation (linear dependence) between past values.

One can still choose the autoregressive vector in two ways. The first one consists in
estimating the optimal antoregressive order n, and to look for the best n past values in
the series to use for the prediction [5, 9]. Another possibility is to look for a n-
dimensional vector built with non-linear mixings of the past values of the series,
instead of the raw values themselves.

In the following we will use the second possibility. We will first look for a way to
estimate the non-linear autoregressive order, and secondly we will build the auto-
regressive vector with a projection method.

2.2. Intrinsic dimension

In order to determine the non-linear autoregressive order, we will use the notion of
"intrinsic" dimension of a set of points. Without going into mathematical details, the
intrinsic dimension of a data set can be defined as the minimum number of coordinates
that would be necessary to describe the data without loss of information, is these
coordinates were measured on curved axes. For example the intrinsic dimension of a
598

set of points forming a string in dimension 2 (or higher) is 1, and the intrinsic
dimension of a set of points forming a non-planar surface in dimension 3 (like the
well-known horseshoe distribution) is 2.

First we build an autoregressive vector of size m from the last past values of the raw
time series. This vector will have to be sufficiently large to contain all information
necessary to a good prediction. One possible solution is to take the optimal
autoregressive vector for an A R X model [5]; indeed this one is built in a way that it
contains "sufficient" information when used with a linear prediction method, and will
thus obviously contain enough information too when used with a non-linear prediction
method. Larger vectors can be taken for more security, but they would make more
difficult the continuation of work. An autoregressive vector is built at each time step;
they are laid out as rows in a matrix called autoregressive matrix.

Since it is supposed that there is an excess of information in the autoregressive


vectors, we will try to reduce their dimension. This goes through a first step which
consists in estimating an optimal reduced dimension, which will be identified to the
fractal dimension of the set of points (the autoregressive vectors) in a m-dimensional
space. This value will be further referred as the fractal dimension of the
autoregressive matrix. It can be interpreted as the number of columns "non-linearly
independent" of this matrix: there is a non-linear transformation which makes it
possible to entirely rebuild the matrix from d columns.

To estimate the fractal dimension of the autoregressive matrix, we use the Grassberger
and Procaccia method [4]; many other methods can however be used to estimate a
fractal dimension [1, 6, 7]. It must be mentioned that the concept itself of non-linear
dependency is difficult to define. Therefore the fractal dimension found by these
methods can vary; in difficult situations, it may be worthwhile to use several methods
in order to asses their results. The intrinsic dimension can also be a non-integer
value; in the following, we will use the integer value nearest to the intrinsic dimension
as an approximation of the non-linear autoregressive vector size defined below.

2.3 Non-linear autoregressive vector

The following step consists in building a non-linear autoregressive vector of size d


from each of the m-dimensional autoreressive vectors.

The set of points defined by the rows of the autoregressive matrix form a d-surface in
a m-dimensional space. If we could unfold this d-surface by projecting the m-
dimensional space onto a d-dimensional one, keeping the topology of the initial set,
we would obtain a d-dimensional non-linear autoregressive matrix that could be used
for further prediction.
599

Many non-linear "projection" methods exist. Kohonen's self-organizing map is


probably the most widely known example. Yet in our experiments we will use
another method, the Curvilign Component Analysis (CCA) [3]; unlike the Kohonen
maps, this method doe not make any assumption on the shape of the projection space,
and was found to give better results in our application.

2.3 Non-linear forecasting

After this projection, we obtain the required non-linear autoregressive matrix. Its rows
will be used as input vectors to any non-linear forecasting method. We used in our
experiments the standard multi-layer perceptron (MLP) and radial-basis functions
(RBF) as prediction core.

Obviously, the prediction method could also use the initial m-dimensional
autoregressive vectors extracted from the raw series. Nevertheless, it must be
reminded that even if neural networks are known to be good candidates (compared to
other non-linear interpolators) when dealing with the curse of dimensionality, it
remains that, for a fixed number of training vectors, their performance decrease with
the dimension of their input vectors. The interest of our method is precisely here: we
expect that the little information lost in the non-linear projection will be largely
compensated by the gain of performance in the forecasting itself. This will be
illustrated in the examples below.

3. Artificial time series example

In order to test the above method, we built a chaotic artificial time series from the
following non-linear equation:
:t;§ = a x 2 + b x,2 + e (1)

Obviously, the non-linear autoregressive order of this time series is 2 (it is generated
from 2 past values). Let us note the lack of a x,, term, as well as the presence of a
noise e (about 10% of the maximum value of the series).

This series is represented on Figure 1.


600

Fig. 1. Artificial time series generated according to equation (1).

The first step of our method consists in the search for the optimal autoregressive
matrix for a linear A R X prediction model.

Figure 2 shows the sum (on 1000 test points) of the quadratic errors obtained if one
uses a standard A R X model of increasing size; the x-coordinate of the figure is the
autoregressive order.

Fig. 2. Sum of quadratic errors (on 1000 test points) obtained with an ARX model for different
values of the autoregressive order.

To ensure to collect the whole dynamics of the series, we will build an initial
autoregressive matrix of order 6. The estimation of the fractal dimension of this matrix
gives 2.12, which is very close to reality.

The following step of the method is the projection of the set of the points (rows of the
autoregressive matrix) from R 6 to R 2. Note that in the simulations we added the x, term
601

to the two coordinated found b y this projection, in order to improve the results. The
final autoregressive vector dimension is thus equal to 3.

In a next step we used this 3-dimensional autoregressive vector as inputs to a non-


linear prediction model. W e used a Multi-Layer Perceptron with one hidden layer.
The sum o f quadratic errors obtained with this M L P is around 5 (on 1000 points),
which significantly lower than the errors illustrated in Figure 2 (linear model).

W e also compared this result to the error obtained with a similar Multi-Layer
Perceptron, where the input vector is the set of p last values from the raw series.
Figure 3 shows this error for different values of p. The horizontal line corresponds to
the error obtained with our method; we conclude that we obtain (for this example) an
error similar to a result obtained by trial and error on several non-linear models, which
was the goal of our investigation. This easiness o f implementation will be valuable
when dealing with a "real-size" dataset for which the non-linear autoregressive order
is unknown.

Fig. 3. Sum of quadratic errors (on 1000 points) obtained with a MLP network for different
values of the autoregressive order. The horizontal line corresponds to the result of the proposed
method.

4. Application to the SBF250 Stock Market Index

A n interesting example o f time series in the field o f finance is the S B F 250 ~ index.
The application of time series forecasting produces to financial market data is a real

1 The SBF 250 is one of the reference index of the French stock market. As suggested by its
name, it is based on a representative sample of 250 individual stocks.
602

challenge. The efficient market hypothesis (EMH) remains up to now the most
generally admitted one in the academic community, while essentially challenged by
the practitioners. Under EMH, one of the classical econometric tool used to model the
behavior of stock market prices is the geometric Brownian motionL If it does
represent the true generating process of stock returns, the best prediction that we can
obtain of the future value is the actual one. Results presented in this section must
therefore be analyzed with a lot of caution.

To succeed in determining the variations of the SBF250, other variables being able to
influence its fluctuations are included as inputs (extrinsic variables). We selected
three international indexes of security prices (S&P500, Topix and FTSE100,
respectively American, Japanese and English), two rates of exchange (Dollar/Mark
and Dollar/Yen), and two American interest rates (T-Bills 3 months and US Treasury
Constant Maturity 10 years). We used daily data over 5 years (from 01/06/92 to
01/12/97), to have a significant data set.

The problem considered here is the forecasting of the SBF250 index at time t+l, from
available data at time t.

To capture the relations existing between the French (non-stationary) index and the
other variables chosen, a co-integration is necessary. The result of this co-integration
is the (stationary) residues of the SBF250 index, defined by the difference between the
true value SBF+~ and the approximation S B F 1 given by the model:
7
R = SBF,+, - S B F , = SBF,+, - ( s t + E P t , i .It,i) (2)
i=l
where I,.i (1 < i < 7) are the 7 selected variables at time t..

In the following, we will focus on the forecast of these residues, or more exactly on
the forecast of daily return of these residues. Indeed, it is more useful for somebody
eager to play on the market, to forecast its fluctuations rather than its level. To predict
that the level of the SBF index tomorrow is close to the level today is trivial. On the
contrary, to determine if the market will raise or fall is much more complex and
interesting.

The daily return P, of the R, residue at time t is defined by:

R t - Rt_ t
9,- (3)
Rt_t

2 Stock prices would follow the following diffusion process :

eL5 = ~tdt + ~dz, where dz = E . ~ " and E = N(O,1). S is the stock price, ~t is the drift rate by
S
unit of time and o'is the instantaneous volatility.
603

According to Refenes et al. [12], we will use technical indicators directly resulting
from the outputs of the residues:
" P,, P,-,0, P,-2~ P,~o: returns ;
9 P,- P,-5, P,-5- P,-,o, P,-~o- P,-,5, P,-,~- P,-~: differences of returns ;
9 K(20), K(40) : oscillators ;
9 MM(10), MM(50) : moving averages ;
9 MME(10), MME(50) : exponential moving averages. ;
9 p-MME(10), p-MME(50) return and moving average differences ;
9 MME(10)-MME(50). moving average differences

If we carry out Principal Component Analysis (PCA) on these 17 indicators, we note


that 99,72% of the original variance is kept with the fn'st eleven principal components:
6 technical indicators can be removed without loss of information.

The target variable, whose sign has to be predicted, is a forecast variable over 5 days:

C.~= ~ 5R'+6 (4)

The time series of this variable is illustrated in Figure 4.

Fig. 4. Time series of the target variable according to equation (4).

This variable has to be predicted using the 11 indicators selected after PCA. The
interpolator we used is a Radial-Basis Function (RBF) network with the learning
algorithm presented in [13]. The network is trained with 1000 points and tested on 100
other points. Our interest goes to the sign of the prediction only, which will be
compared to the real sign of the prediction variable.

The best results we obtained are 60,2% correct approximations of the sign of the series
on the training set, and 48 % on the test set. This result is obviously bad: it is worst
than a pure random guess on the test set!
604

On the other hand, is we use the proposed method and estimate the fractal dimension
of the data set, we obtain an approximate value of 5. We then use the CCA method to
project the 11-dimensional data (after PCA) on a 5-dimensional space. Thereafter, we
use another RBF network to approximate the variable to predict. We obtain 61% of
correct sign prediction on the training set and 57 on the test set. This result seems to
be significantly better than the result that we could get by using a purely naive
approach (for example, by predicting always a + sign). A lot of simulation work
remains however to be done to validate it (by, for example, constructing a bootstrap
estimator).

Still better results were obtained using a MLP instead of a network (more than 62%
correct sign predictions on the validation set). Unfortunately, the results obtained with
a MLP are difficult to repeat for various initial conditions, convergence parameters,...
We prefer to restrict our performances to those obtained with a RBF network, because
they are much less parameter-dependent.

5. Conclusion

The proposed method for the determination of the best autoregressive vector gives
satisfactory results on a financial series. Indeed, the quality of the prediction obtained
is either comparable to the quality obtained with other methods (slightly higher on a
real-world financial time series, and equivalent on an artificial data set). The
advantage of our method mainly comes from the systematization of the procedure:
there is no need for many trials and errors for the determination of the variables to use
at the input of the predictor and of its parameters. Moreover, the determination of the
autoregressive vector is completely independent from the prediction method.
Ameliorations of the proposed method could be searched in alternative ways to
estimate the fractal dimension of the series or to project the data in a non-linear way.

The question of the predictability of a series such as the SBF250 index remains. The
results presented in this paper are promising, but could certainly be improved. We
must also remind that predicting a complex, mostly stochastic time series as the
SBF250 must be achieved with several prediction methods, in order to cross-validate
their results. It must also be noted that, the simple fact of being able to forecast, at a
certain level of confidence, a financial time series is not in itself sufficient to
invalidate the EMH. The problem is to see if it is possible to exploit the prediction
algorithm to obtain abnormal returns, that is to say returns that take into account the
level of the risk generated by the trading strategy as well as the associated transaction
costs.
605

References

1. Alligood K. T., Sauer T. D., Yorke J. A.: Chaos: An Introduction to Dynamical Systems.
Springer Verlag, New York (1997), pp. 537-556
2. Box G.E.P., Jenkins G.: Time Series analysis: Forecasting and Control. Cambridge
University Press (1976)
3. Demartines P., H6mult J.: Curvilinear Component Analysis: A self-organizing neural
network for nonlinear mapping of data sets. IEEE Trans. on Neural Networks 8(1) (1997)
148-154
4. Grassberger P., Precaccia I.: Measuring the Strangeness of Strange Attractors. Physica D56
(1983) 189-208
5. Ljang L.: System Identification - Theory for User. Prentice-Hall (1987)
6. Tackens F.: On the numerical Determination of the dimension of an attractor. In: Lecture
Notes in Mathematics Vol. 1125, Springer-Verlag (1985) 99-106
7. Theiler J.: Statistical Precision of Dimension Estimators. Phys. Rev. A41 (1990) 3038-3051
8. Weigend A. S., G-ershenfeld N.A.: Times Series Prediction: Foreasting the future and
Understanding the Past. Addison-Wesley Publishing Company (1994)
9. Xiangdong He, Ha_ruhiko Asada: A New Method for Identifying Orders of Input-Output
Models for Nonlinear Dynamic Systems. In: Prec. of the American Control Conf., San
Francisco (CA) (1993) 2520-2523
10.Burgess A.N.: Non-linear Model Identification and Statistical Significance Tests and their
Application to Financial Modelling. In: Artificial Neural Networks, Inst. Elect. Eng. Conf.,
June (1995)
ll.Fama E.: Efficient Capital Markets: A Review of Theory and Empirical Work. Journal of
Finance XXV No 2 (1970) 383~117
12.Refenes A. N., Burgess A.N. and Bentz Y.: Neural Networks in Financial Engineering: A
Study in Methodology. IEEE Transactions on Neural Networks 8(6) (1997) 1222-1267
13.Verleysen M., Hlava~kova K..: An Optimized RBF Network for Approximation of
Functions. In: Proc of European Symposium on Artificial Neural Networks, Brussels
(Belgium), April 1994, D facto publications (Brussels).
P a r a m e t r i c Characterization of H a r d n e s s Profiles
o f Steels w i t h N e u r o - W a v e l e t N e t w o r k s

V. Colla 1, L.M. Reyneri 2, and M. Sgarbi 3


i Scuola Superiore Sant'Anna, Pisa (I), e.mail: valeq~mousebuster.sssup.it
Politecnico di "Ibrino (I), e.mail reyneri@polito.it
a Scuola Superiore Sant'Anna, Pisa (I), e.mail: mirkosg~tin.it

A b s t r a c t . This work address the problenl of extracting the Jominy hardness profiles of steels
directly from the chemical composition. Wavelet and Neural networks provide very interesting
results, especially when compared with classical methods. A hierarchical architecture is pro-
posed, with a first network used as a parametric modeler of the Jominy profile, and a second
one estimating parameters from the steel chemical composition. Suitable data preprocessing
helps to reduce network size.

1 Introduction
Hardenal)ility is a basic feature of steels: in order to characterize it, manufacturers
usually perform the so-called Jominy end-quench test [1], which consists in measuring
the hardness along a specimen of a heat-treated steel, at prcdefincd positions; tlle
measured values form the Jominy hardness profile.
Hardenability depends on chemical composition in a partially unknown fashion,
therefore black-box models have been developed to predict the shape of Jominy profiles
directly from chemical analysis. Most of them are linear, but this affects accuracy,
especially when a wide variety of steels is considered.
Neural Networks (NNs) seem to cope well with such a modeling problem, as they
are good approximators for strongly non-linear functions. An attempt to apply NNs
to predict Jominy profiles has been made in [2] by using a standard Multi-Layer
Perceptron (MLP) with one hidden layer, but there is no rel)ortcd attempt to use
Wavelet Networks, ( W N s ) for thc same task.
Unfortunately, most methods llased on NNs alone sulfei from sew~ral caveats.
For instance, their initialization and training requires a large alnount of data, which
are sehtom easily and rapidly available. In addition, simple NNs may often predict
profiles which are not l)hysically plausible, unless very eomi)lex networks are used and
long training processes are employed. It is therefore mandatory to accurately select
the network structure, in order to obtain good performance, to reduce as much as
possible the number of free parameters, and consequently to reduce the required size
of tile training set.
Another drawback of NNs alone is that no information related to physical char-
acteristies of the steel can be extracted fl'om the trained network; this means that
NNs can only be used to predict the profiles themselves, lint not any other steel
characteristic.
This paper presents some more powerful methods based on two combined Neuro-
Wavelet Networks ( N W N s ) , where one network provides a parametric model of the
Jominy profile, while the second one predicts tile parameters as a flmction of chemical
composition. The extracted parameters do have a strong relationship with the Jomiuy
profile, of which they are a compact representation.
607

2 Neuro-Wavelet Unification

Radial Wavelet Networks are based on Wavelet decomposition and use radial Mother
Wavelets ~(IIXII) E L2(~ N) suitably dilated and translated. Such networks are based
on Radial Wavelons (WAVs) which havc a model bascd oil the Euclidean distancc be-
tween the input vector X and a translation vector E, where each distance component
is weighted by a component of a dilation vector T:

A function O(.) is admissible as a radial Wavelet only if its Fourier transform satisfies
a few constraints not discussed here [5]. A commonly used function is the Mexican
hat e ( z ) = (l - 2z2) . e - " .
Radial Wavelet Networks, as well as many other neural and fuzzy paradigms, can
be viewed into a unified perspective by means of the Weighted Radial Basis Functions
( W R B F ) [4].
Each layer (array) of W R B F neurons is associated with a set of 1)arameters: an
ordern E ~R, defining the neuron's metric (mostly n E {0, 1, 2}), a weight matrix V(, a
center matrix C, a bias vector 0 and an activation function F (z). The mathematical
model of a W R B F neuron of order n (or, WRBF-n) is:

where F (z) can be any function (although in most cases monotonic flmetions or
Wavelets or linear or polynomial functions are used) and the distance .function 79,,(.)
is defined as:
I)n(Xj Cji) A I ( X j - C j i ) forn=0
- = (3)
/ Ixj - c~t ~ fo~ n r o

All the NWN paradigms used here have been re-conduced to W R B F in order
to have common paradigms, methodologies, initialization strategy and learning rule,
which is the main advantage of unification; Radial Wavelons are WRBF-2 neurons
with Wii = (1/Tji) 2, and C = E (i.e. the matrix made of one translation vector
E per neuron), while the activation fimetion comes from tile radial Mother Wavelet
F (z) = ~ ( v ~ ) . Details on the unification of other neural paradigms can be found
in [7, 4].
As far as initialization of NWNs is concerned, in this work we have used three,
forms of initialization:

- Fixed initialization: all weights, biases and centers are initialized to a predcfincd
vahm (or set of values). This has been used for all the networks of the parametric
model described in section 4.
- Random initialization: the l)aramcters are initialized to ran(lore values (uniform
distribution). This has been used for the WRBF-0 networks of the parameter
estimator described in section 5.
608

A." ; '~ ~ ~.,,.~-~ a : : . B . ' ' ' ,~...~,- . . . . . . C . - ' ~

F i g . 1. A) A few e x a m p l e s of J o m i n y profiles. B) E i g e n v a l u e s of t h e i n p u t d a t a c o v a r i a n c e m a t r i x , C) B l o c k
d i a g r a m of t h e p a r a m e t r i c n e u r o - w a v e l e t e s t i m a t o r .

- Orthogonal Least Squares algorithm (OLS): a fast initialization method already


applied to RBF networks [10] and Wavelet Networks [5]. It chooses among all the
basis fimctions (i.e. the dilated and translated activation functions) those which
give the greatest contribution to the approximation. This has been used for the
WRBF-2 networks of the parameter estimator described in section 5.
For the training, the basic generalized learning rule based on gradient descent, devel-
oped for WRBF-n networks [4], has been applied to all the NWNs used for this work.
This is a very simple training algorithm, with fixed learning rate and moincntum.
As a performance index for all NWNs, we adopt the Normalized Square Root Meau
Square Error ( N S R M S E ) [11] in a version suitable for multi-output networks, defined
as:

I v.M V-N ('~'P VP~2 1 M

M and N are, respectively, the number of samples in the training (or validation) set
and the number of network outputs; YjP is the j-th component of the p-th target vector
YP in the training (or validation) set, while ~P is the corresponding network estimate.

3 Preliminary Analysis of the Jominy Profiles Data

In tile Jominy eml-quench test, a cylindrical sl)ecimen of small dimensions (diameter


of I in. and length of 3 or 4 in.) is firstly maintained at austenitizing temperature for
about 30 min. Afterwards, one end of the specimen is cooled by quenching it for at
least 10 min. in a water stream with a temperature of 5 to 30 ~ tlle other specimen
end is cooled in air. This treatment causes a cooling rate gradient to develop over the
length of the specimen, with the highest cooling rate obviously corresponding to the
quenched end. This affects the steel micro-structure along the length of the sl)ecimen.
A curve is built by measuring the specimen hartlness on the l~ockwell C scale at
increasing distance from the quenched end. This curve is the Jominy hardness profile,
which characterizes steel hardenability.
We have real industrial data for three particular qualities of steel, named, re-
spectively, A, B and C (see some examples in Figure 1.A). The chemical analysis
associated with each Jominy profile indicates the content of several micro-alloying
elements present in the steel.
All these steels have a relatively high content of Boron (greater than 10 1)l)m)
such that, from a metallurgical point of view, they can be considered as "Boron
609

steels". The influence of Boron on hardenability is known to be important but a dcel)


comprehension of the Boron effects on steel properties has not yet been achieved [3].
Most models obtained so far to predict the Jominy profiles fail particularly in the case
of Boron steels; thus the considered application is of great interest for the iron and
steel industry.
Throughout this work, we consider as int)ut variables the content of 17 chenfical
components: C, Mn, Si, P, S, Cr, Ni, V, Mo, Cu, Sn, A1, Ti, B, N, and soluble Al and
B . We will call:

- Q E ~17 the vector of chemical composition;


Q ~ . . . . . ' Q q~ . ~ ' ' " ' Q ~ T~ . . . . . ' } E ~17 the vector of normalized chemical corapo-
- .hf =,a tr____~._
sition, where Qi,max is the maximum absolute value of Qi over the whole trainin.q
set; normalization avoids the problems that arise in NWNs when input variables
have different physical dimensions;
- J(x) E ~Rthe Jominy hardness profile as a function of distance x from the quenche(l
end.
- ,7 E ~15 the Jominy vector containing the values of J(x) at 15 (sometimes, 18, or
19) predefined positions (often, 1.5, 3, 5ram, etc.).

To reduce the size of the NWNs, we tried to reduce as much as possible the number
of input variables to the network (without loosing significant information) by al)l)lying
the Principal Component Analysis [8] to the vectors A/" of the training set,.
Figure 1.B plots the computed eigenwdues in decreasing order. The eigenvectors
associated with the largest eigenvalues span a subspace containing most of the infor-
mation of the training set. The 6 largest eigenvalues have been retained, as a good
compromise between complexity and performance.
The projection of input data in the subspacc spanned by the corresponding 6
eigenvectors (properly normalized) maintains most of the original infornmtion and
constitutes a new input vector "1) E ~6 to be fed into the network. This vector is
obtained as V = Af. M, where M is a matri• containing as columns the 6 principal
eigenvectors.

4 Parametric Estimation of Jominy Profiles

Aim of our work was to determine which NWN performs better in modeling the
Jominy hardness profiles. At first, we drew some prclirninary considerations:
1. in traditional approaches [3, 2], the mmlber of network outputs equals the number
of measured points of the Jominy profiles, namely 15. But, from Figure 1.A it
can be observed that Jominy profiles are relatively slowly varying, especially in
the initial and the final parts. There is usually a little (liff~:ren(:c between two
neighl)oring points and thus the information conveyed I)y these values is somehow
redundant. The statistical correlation of a(Ijacent elements of ,.7" al)l)roaches mfil,y
(its average value is 0.93), as well as consequently the correlation between weights
of adjacent neurons.
2. In traditional approaches, several approximation errors can produce estimates of
the Jominy profiles which are physically aot plausible (for instance, small local
increases instead of a continuous decrease of the hardness along the specimen).
610

3. The 15 positions where the hardtmss is measured are not evenly distributed and
often differ among different manufacturers, therefore Jominy profiles cannot always
be compared directly, tn addition, even the nmnber of points can often be varied
(for instance, up to 18 or 19 points can be measured).
4. Hardness measurement is often affected by large errors, therefore the Jominy vector
J is usually affected by relatively large quantities of noise.

We tried a coml)letely different approach to the problem of Jominy protile estimation


and classification. The proposed system (Figure 1.C) is composed of 3 i n t e r a c t i n g
blocks: the small NWN, n e t w o r k A, is a parametric model of the Jominy protile,
considered as a function J(x) of the distance x frmn the quenched end. ,.7" is a vector
of samples of J(x) at predefined positions. The set of free parameters of network A
(weights, centers and biases) constitutes a vector 7:' of parameters of the model, which
uniquely identifies an estimate J(x) of the ,lominy profile .l(:c) and, consequently, of
g,q'. The vector "P can I)e evaluated by simI)ly training the NWN, with any l~aruing
rule. We used standard backpropagation for 1.000 to 5.000 epochs. The larger NWN,
n e t w o r k B, is used as a parameter estimator which predicts the I)arameter vector
"P (instead of the profile itself) as a flmction of chemical composition Q (after di-
mensional reduction, through vector V). In order to reduce the approxinmtion error,
we introduce a further block named a - p o s t e r i o r i m o d e l c o r r e c t o r described in
section 4.1. This approach has the following advantages:

1. the size of the parameter vector "P is smaller than that of gT (see Table 1), thus
network B is smaller than would be a network predicting J , for comparable ac-
curacy. As a consequence, a smaller training set will I)e enough and, at the same
time, a consideral)le saving in COml)utational time and memory can be achicve(l
during both training and relaxation (namely during nominal operation).
2. If network A is properly chosen, "7' is less sensitive to measurement noise than f f
(see section 4.1), therefore steel characterization will be more robust.
3. "P can be computed also when solne measurements of , J are missing.
4. 7~ is almost independent of the number and position of har(tness measurements.
5. By prol)erly selecting network A, "P can be made representative of the physical
process, therefore it can also be used to classify (more robustly) steel quality.
6. Network A can also bc used to reduce the effects or measurement noise.

4.1 Choice of the Parametric M o d e l e r

Tile choice of tile best NWN for network A (parametric model) is by itself not a sim-
ple problem, due to the need of reducing as much as possible the nmnber of tunable
parameters while maintaining a good estimation and classification accuracy. We have
essayed a set of very small two-layers WRBF networks, with one or two hidden neu-
rons, difihrent activation flmctions in tile hidden layer, and a linear output layer (see
Table 1). Such networks are very easy to train using generalize(I ba(:kl)r(q)agation [4].
Jominy curves (Figure 1.A) are monotonically decreasing; if one wishes to ap-
proximate them by means of a WRBF-2 network with either exponential or Mexican
hat activation function (WAV-x, RBF-x), the center vector in the tirst layer conM 1)(;
fixed to 0 and need not be trained. Moreover, all WRBF-0 (MLP-x) have a null center
vector, as bias and center are somehow redundant in WRBF-0 networks.
611

1se layer 2'''t layer size ill mm in pts


type F(z) M type F(z) o f * P e ~ ( % e~o(%) ; ~ ( % ) e~o(%)
MLP-1 WRBF-0 hyp. tg. 1 WRBF-011iaear 4 10.16 4.89 6.27 3.86
MLP-2 WRBF-(} hyp. tg. 2 WRBF-0 linear 7 6.76 3,90 4.33 3.39
LIN WRBF-0 hyp.+lin 2 WRBF-{] linear 5 9.37 4.85 6.14 3.89
WAV-1 WRBF-2 Mex,hat 1 WRBF-G linear 3 16.40 8.53 6,79 7.74
WAV-2 WRBF-2 Mex.hat 2 WRBF-01inear 5 16.08 8.14 7.43 7.29
RBF-1 WRBF-2 Gauss. t 1 WRBF-0 linear 3 11.61 6.11 7.29 7,32
RBF-2 WRBF-21Gauss. t 2 WRBF-C linear 5 9.29 4 . 9 7 8.95 5.62
T a b l e 1. NSRMSE of network A: columns "in pts" and "in ram" refer to the estimation of Jominy profile
as 3", = f ( i ) and J(x), respectively, t F(z) is exponential, as the square is provided by n -- 2. M is the
number of neurons,

The LIN network in Table 1 takes into account the slight linear trend supcrimi)ose(l
on the nearly-sigmoi(lal shape of the Jominy profiles; this is similar to an MLP-2
network, but the hi(hlcn layer is composed of a linear neuron and a neuron with F (z)
hyperbolic tangent. This network has 5 parameters, as the linear activation 5mcti(m
o[' the second hidden neuron allows to merge 2 weights and biases.
Table 1 (cohunn e~, "in ram") compares the different models in terms of NSRMSE.
The values given are an average over the whole training plus validation sets (800
different specimen). There is no need to distinguish between training and vali(lation
sets, as each profile is trained independently of each other.
We observed that the estimation error of Jominy profiles predicted by each network
A has a non-null average, which varies with the distance x. We therefore subtract
this average modclization error (in tabular form) fi'om the outl)ut of network A (a-
posteriori correction), as shown in Figure 1.C. This has reduced the modelization error
roughly by a factor 2, as shown in Table 1 (cohmm er "in nun").
In both cases (with or without correction) the MLP-x, the LIN and the RBF-2
networks clearly outperform the other networks thanks to the particular shape (nearly
sigmoidal) of the Jominy profile. This similarity between a sigmoidal fimction and
the Jominy curves is filrther enhanced by aI)proximating ,:Ti = f(i) instead of .](x)
(namely, the vector elements as a filnction of their "index" i E [1,15] instead of their
distance x). The results of such apl)roximation are listed in Table 1 under cohtmns
"in pts").
Now, we can ctloose one of the seven types of network A according to the following
criteria:

1. to have tile smallest approximation error. Networks MLP-1, MLP-2, LIN and RBF-
2 are the best under this respect.
2. To have as few parameters as possible. This reduces both training time for network
A and the size of network B. Networks WAV-1, RBF-1, MLP-1 are the best under
this respect.
3. To have a set of parameters which are as representative as I)ossible of the l)hysical
process of hardening. Tile degree of representativity has 1)con assessed by ana-
lyzing the correlation between t)airs of Jominy vectors ~T and the corrcsl)onding
parameter vectors T ~. Very representative models shouhl have a roughly linear re-
lationship between ,4,.7" ~ H,.7"t - ,:T21] and A ' p ~ lip ~ - T~211, where ,:T l and
,.7.2 are two Jominy vectors (taken randomly from the data sets), while .pl and
7~2 are the corresponding parameter vectors. Figure 2.A plots A T ~ versus A,:T for
612

Fig. 2. A) Parameter vectors distance as a function of Jominy vectors distance for networks A for 1.800
random pairs of specimen. B) a~, (plain line) and a,~ (dotted line) versus a j (on normalized axes). C)
Comparison of estimated Jominy profiles.

several pairs of specimen ("in pts"; similar plots have been ol)tained for estimation
"in ram") and for the 4 neural paradigms which provide the best results (WAV-2,
RBF-2, MLP-2 perform poorly, thus their gral)hs are not reported). The closer
are the points to the main diagonal, tile more representative is tile model, the
easier will be to train network B. MLP-1 and LIN networks are the best under this
respect.
4. To have a set of parameters which provide tile smallest noise sensitivity, namely
tile smallest sensitivity of the model to the noise affecting hardness measurements.
Noise sensitivity is assessed by means of simulations, as described below.
Consider a measured Jominy vector J P . This is associated with a parameter vector
T 'p by training. A Jominy vector estimate J'P is obtained from network A with
the parameters T ~v. For definition, ,.~P is the best estimate of ,.7"p compatible with
the given model. Namely:
,.~.p training ,pp model.~uation t~.P (5)

When some noise A J p E R 15 is added to the original profile, the associated pa-
rameter vector estimated via training is corrupted by an error A'pP which increases
the error on the reconstructed profile:
t.~'P _~. A J ' p traini~g ~ p -~- A~DP modeleva~uation y P : J R _~ A y p (6)

The noises standard deviations are related to each other; (~.~ = ~/E([[ AJ'~ H2) and
a~, = t E ( H A'p~ [p) increases almost linearly with a s = ~/E(H A J ~ IP), and a 2
is smaller than a s.
The noise sensitivity of the parametric model is defined as the average slope of the
curve a~. = f ( a s ) ; when it is smaller than one, the estimated profile is less affected
by noise than the original profile. Most networks are good enough (except MLP-2),
yet networks LIN, WAV-1, MLP-1 and EXP-1 are slightly better: Figure 2.B plots
the results obtained by an average over 40 specimen, with 0 ~ a 3- _< 13 HRc 1
5. To have a model which is as independent as possible of the number and position
of hardness measurements. All the networks "in ram" are appropriate under this
respect.
At the end, we have chosen two networks, namely MLP-1 and LIN, "in mm", because
the increased flexibility given by the dependency on the distance has been considered
important.
1 HRe: Rockwell hardness unit.
613

We have also outlined a method to get similar 1)erformance to the networks "in
pts", while maintaining the flexibility of the approach "in ram". This method consists
in processing the distance x through an appropriate non-linear flmction, before passing
it to network A. This method is currently under consideration, but no final result is
yet available.

5 Parameter Estimation

From the 6 principal components V of chemical composition .IV', network B should


predict a set of parameters 79 characterizing the corresponding Jominy profile. The
issue of choosing tile most suitable network to perform this association is a problem
of approximating an unknown non-linear function ~R6 ---r ~4, or ~6 _.+ {R5, respectively
for an MLP-1 and a LIN network A.
We. test different kinds of two-layer networks (WAVs, MLPs and RBFs with Mex-
ican hat, hyl)erbolic tangent and Gaussian activation flmctions) and several number
of neurons in the hidden layer.
Training and validation sets contain, respectively, 615 and 152 sami)lcs, of the three
considered qualities, selected because of the similarity of their chemical compositions
and their 3ominy profiles. We had to choose a larger training than the validation set,
because we needed to have at least 500-600 samples in the training set, in order to
train with a sufficient accuracy the proposed NWNs (see Ta/)le 2). As the work is still
going on, we are collecting more data that will be inserted only in the validation set.
WAV and RBF networks need less than 1.000 epochs to train, thanks to the
initialization, and most of them reach the minimum error within few epochs, while
MLP networks require from 2.000 to 5.000 epochs. CPU time for training is always
less than 15 min (average of 10 rain) on a Pentium 133.
Table 2 shows the results on the validation set for different types of network B.
For either MLP-1 and LIN networks A, the table gives: the NSRMSE on parameter
estimation epa~, which depends on network B alone; the NSRMSE on Jominy vector
estimation, with (eco) and without (ea~) tile a-priori model correction shown in Fig-
ure 1.C. The latter two depend on both networks (A and B). Tile last cohmm of the
table gives, as a eomi)arison, the NSRMSE obtained with only one network (with 15
outputs) predicting the Jominy vector directly as a fimction of chemical composition
(traditional approach).
WAV networks with 4 or 8 Wavelon, with the corrected LIN network A gives
the overall best results. Almost comparable results can be achieved with the RBF
networks: The MLP is the worst of all. Also the unique network ("one net" in Table 2)
gives comparable results, thus our approach gives no worse performance than existing
methods, but offers the aforementioned advantages, which are very worthy in real
industrial applications.
It is also interesting to compare our results with those achieved with the lin(~ar
model for Boron Steels proposed in [3], for which the points of the Jominy profile are
estimated by (different) linear combinations of the contents of 3 chemical COml)onents:
C, Mn, Cr. This model provides an estimate only for the first 10 points of the Jominy
profile, instead of the 15 points of the proposed approach. As shown in Table 2, the
resulting NSRMSE is far higher than our values. Furthermore, the validity of the
linear model is guaranteed only when the concentration of C, Mn, Si, Ni, Mo, Cr, Bo
614

Network B bidden MLP-1 LIN ]one net


.euro.s ..~(%)1e~(%)1~=o1%)ep.,(%)ie.~(%)eoo(~)/,(%)
WAV 2 3.66 18.4 15.2 3.42 18.20 15.8
WAV 4 3.53 18.3 15.1 3.49 16.39 1~15 13.9
WAV 8 3.45 17.5 13.7 3.49 16.24 13.6 13.4
WAV 16 3.59 16.8 13.6 3.59 16.46 13.8 13.7
RBF 2 3.43 18.5 15.9 3.40 18.3 16.0
RBF 4 3.48 18.5 15.6 3.52 16.7 14.5 14.1
RBF 8 3.42 18.7 16.2 3.43 16.3 14.0 14.0
RBF 16 3.50 17.0 13.9 3.47 16.5 14.0 1.1.1
MLP 2 3.46 19.6 17.1 3.37 19.0 17.0
MLP 4 3.53 19.4 16.8 3.40 42.0 36.5 14.3
MLP 8 3.46 71.6 66.8 3A2 47.5 43.6 13.8
MLP 16 3.36 39.5 37.6 3.99 26.5 25.3 15.0
I linear model I e = 39.4% I
Table 2. Performance (NSRMSE) of tim proposed system for dilferent combinations of NWN. Last column
lists the results obtained with traditional neuronal methods (no parametric estimation; only one network).
Last row gives tbe error aclfievable with a known linear model.

a r e w i t h i n s p e c i f i e d r a n g e s , w h i c h a r e in o u r c a s e t o o r e s t r i c t i v e . T h i s c o n f i r n l s t h a t ,
in p r a c t i c a l c a s e s , t h e n e u r o - w a v e l e t p a r a m e t r i c a p p r o a c h c a n b e a m o r e e i f c c t i v e a n d
reliable alternative to traditional models. Figure 2.C compares a lneasured Jolniny
profiles with the corresponding p r e d i c t i o n , for t h e p r o p o s e d a n d t h e l i n e a r m e t h o d s .

Acknowledgments
The authors wish to thank Dr. Qinghua Zhang, Dr. Benedetto Allotta, Dr. Renzo
V a l e n t i n i a n d P r o f . G i o r g i o B u t t a z z o for t h e i r f r u i t f i t l d i s c u s s i o n s .
This work has been partially supported by the National Research Council project
M A D E S S - I I " A r c h i t e c t u r e s a n d V L S I d e v i c e s for e m b e d d e d N e u r o - F u z z y c o n t r o l " .

References
1. "Standard Method for End-Quench Test fi~r llardenability of Steel," A255, Annual Book of ASTM
Standards, pp. 27-44, ASTM, 1989.
2. W.G. Vermeulen, P.J. van der Wolk, A.P. de Weijer, S. van der Zwaag: "Prediction of Jominy llardness
Profiles of Steels Using Artificial Neural Networks," Jourrt. Material Eng. and Perforraance, Vol. 5,
No. 1, February 1996.
3. D.V. Doane J.S. Kilkaldy (eds.): "Hardenability Concepts with Applications to Steel", TMS-AIME,
1978.
4. L.M. Reyneri, "Unification of Neural and Wavelet Networks and bSlzzy Systems", to be printed on IEEE
Trans. Neural Networks, 1999.
5. Q. Zhang: "Using Wavelet Network in Non-parametric Estimation," IEEE Trans. Neural Networks,
Vol. 8, No. 2, pp. 227-236, March 1997.
6. I. Daubechies: "Ten Lectures on Wavelets," Society for Industrial and Applied Mathematics, Philadel-
phia, Pennsylvania, 1992.
7. V. Coils, M. Sgarbi, L.M. Reyneri: "A Comparison Between Weighted Radial Basis Functimm and
Wavelet Networks," in Proe. of ESANN 98, Bruges, Belgium, 22-24 April 1998, l)p. 13-19.
8. W.W. Cooley, P.R. Lohne~: "Multivariate 1)ata Analysis," Jnhn Wiley & Sons Inc., USA, 1971.
9. M. Riedmiller: "Advanced Supervised Learning ia Multi-layer l~ercepttons - l'h'om Bat:kl~rol~agati~}n to
Adaptive Learning Algorithms," Int'l dourn. Computers Standards and htterfaces, No. 5, 1994.
10. S. Chen, C.F.N. Cowan, P.M. Grant: "Orthogonal Least Squares Learning Algorithm for Radial Basis
Function Networks," IEEE Trans. Neural Networks, Vol. 2, No. 2, pp. 302-309, March 1991.
11. Q. Zhang, A. Benvenistc: "Wavelet Networks," IEEE qYans. Neural Networks, VoI. 3, No. 6, pp. 889-898,
November 1992.
Study of Two A N N Digital Implementations of a Radar Detector
Candidate to an On-Board Satellite Experiment
R. Velazco 1, Ch. Godin z'4, Ph. Cheynet 1,
S. Torres-Alegre 3, D. Andina 3, M. B. Gordon 2
1 Laboratoire TIMA
46, Av. F61ix Viallet -38031 Grenoble - FRANCE
2 Commissariat ~tl'Energie Atomique (CEA)
D6partement de Recherche Fondamentale sur la Mati6re Condensde
17 Av. Des Martyrs, 38054 Grenoble Cedex 9 - FRANCE
3 Universidad Polit6cnica de Madrid
ETS Ingenieros de Telecomunicacion 28040 Madrid - SPA1N
4 Commissariat ~ l'Energie Atomique - Division d'Applications Militiares
(CEA-DAM) Bruy6res-le-Chfitel - FRANCE

Abstract
The Microelectronics and Photonics Testbed (MPTB) is' a scientific satellite carrying
twenty-four experiments' on-board in a high radiation orbit since November 1997. The
first objective o f this paper is' to summarize one year flight results', telemetred from
one o f its experiments, a digital "neural board " programmed to perform texture
analysis by means of an Artificial Neural Network (ANN). One of the attractive
features o f MPTB neural board is its possibility o f re-programmation from the earth.
The second objective of this paper is to present two new ANN architectures, devoted
to radar or sonar detection, intended to be telecommanded on the MPTB neural
board. Their characteristics (performances and potential robustness with respect to
parameter deviations due to the interaction with charged particles) are compared in
order to predict their behavior under radiation.

1. Introduction
It is expected that neural hardware will provide attractive tools for automatic pattern
recognition and data classification. In particular, the application of neural networks
has been considered [1-4] to be relevant in automatic target recognition, speech
recognition, seismic signal processing and sonar signal processing.
Like with most information processing devices, ANNs may be implemented
following three different modalities: (i) software simulation, on a general purpose
computer, (ii) hardware emulation, that mimic the ANN on some particular hardware
architecture, which may include dedicated processors to accelerate the response, and
(iii) physical implementation, where there is, at least in principle, a one-to-one
correspondence between virtual and physical neurons. Only the two last strategies
allow to cope with the timing constraints imposed by real-time applications. The main
differences between hardware emulation and physical implementation resides in the
way the network response is obtained. In the latter, the physical implementations
attempt to take advantage of the network's structure and principles, by means of
either a computer with multiprocessing or parallelism capabilities or using dedicated
hardware in which the neurons are individually implemented (analog or digital neural
processors). In the former, neuron responses are calculated, sequentially or with some
parallelism, by a processor running a suitable program. The emulation program has
some loops that calculate the responses of the network's neurons as a function of the
state of other neurons and/or the input values. Thus, neurons exist only virtually,
corresponding to a piece of program during a particular period of time. This is
616

typically the case when microprocessor-based architectures, like those considered in


this paper, are programmed to produce the response of an ANN.
It is usually believed that ANNs are intrinsically fault tolerant, due to both the
network's redundancy and the non-linear nature of neuron responses. This should
make their digital implementation robust with respect to some of the errors resulting
of the interaction with the environment (radiation, electromagnetic perturbations,...).
This paper deals with a particular class of errors, called soft error, bit-flip or Single
Event Upset (SEU) in different contexts, which result in a temporary inversion of the
content of a memory cell within a digital integrated circuit. The consequences of this
type of error at the system level are strongly related to both the nature of the stored
information and the instant when the error occurred. For instance, in digital
implementations, (i) a parameter may suffer a change whose magnitude depends on
the position of the affected bit, (ii) the error may arise on an already used information,
having then no effect, (iii) the sequencing flow of a program can be altered provoking
a wrong output, a loss of control (infinity loops) or even more serious consequences
easy to imagine (progressive program corruption for instance), as the result of a
transient" mutation" of the pre-established instructions that emulate the ANN.
In particular, for some applications requiring on-board operation, the flexibility and
the accuracy (practically unlimited) offered by digital hardware emulations built on
presently available processors (microprocessors, digital signal processors, specific co-
processors) constitute a clear advantage with respect to other possible
implementations. Nevertheless, a thorough study under the effects of the environment
is necessary to evaluate the error rates expected during the operation and to design
mechanisms for recovering from critical errors. If we consider applications devoted to
operate under space or nuclear environments, ground simulations using radiation
facilities (particle accelerators) are usually performed to .achieve these goals. The last
5 years, several experiments [5-7] on the behavior of A N N digital implementations
under radiation showed that hardware emulations present a significant fault tolerance
to bit-flips arising on the network inputs and parameters (weights and thresholds).
Although this fault tolerance depends on ~the used hardware and .on the particular
application on which the ANN were evaluated, we found that a significant percentage
of bit-flips arising on the memory area where relevant data is stored, between 50%
and 95%, was tolerated. However, ground' simulations cannot take into account some
aspects of the final applications environment. Indeed, the beams offered by the
particle accelerators cannot cope with the range of energies of galactic cosmic rays.
On the other hand, ground experiments are limited to a few hours so that beam fluxes
of much higher orders of magnitude than the ones encountered in space have to be
used. Therefore, error rate predictions derived from ground experiments generally
over-estimate the error rate in flight. They may sometimes lead to discard circuits or
architectures that would be interesting solutions. Experiments operating on board of
space vehicles (satellite, probe, space stations) attempt to overcome these drawbacks
of ground tests, and also provide flight data which allow to adjust environment
models for prediction techniques.
The background of this paper is an experiment running on board of a satellite
presently in orbit, the Microelectronics and Photonics Testbed (MPTB) project. We
present results on fault tolerance of emulated ANNs running on a pattem recognition
task, and we analyze the reliability of two different ANNs devoted to radar detection,
intended to be telecommanded to MPTB in the next months. In Section 2 the MPTB
project is briefly described. Details about the neural networks emulation as well as
flight data obtained after one year mission of a Multiple Layer Perceptron designed to
identify textures on pre-stored satellite images are summarized in Section 3. The two
different ANN implementations for a radar detector are described in Section 4. One of
them is based on a constructive strategy, called NetSpheres, while the other is a MLP
617

trained with Backpropagation algorithm (BP). Their performances are compared in


this section. Conclusions and future work are presented in Section 5.

2. The MPTB project


The Microelectronics and Photonics Testbed [8] is a space experiment designed by
Naval Research Laboratories (N.R.L. Washington) to study the effects of radiation on
a variety of microelectronics and photonic devices and circuits. The orbit o f satellite
carrying MPTB is highly elliptical, extending from near geosynchronous altitude,
where the radiation environment is dominated by cosmic rays, to below the radiation
belts in which there is a large flux of protons and electrons. Such a highly elliptical
orbit is ideal for a space radiation experiment such as the MPTB. The specific goals
of MPTB are to improve the prediction accuracy of radiation-induced performance
degradation, to develop a data base of ground tests and space data and to promote the
insertion of new technologies into space systems. To meet these goals, extensive
ground testing of all parts were required. The resulting data allow to predict the
performance in space using models for both the radiation environment and the
interaction of radiation with matter. The predictions will be compared with the actual
data from space (including the actual radiation environment) to identify sources of
error in the prediction. The MPTB monitors 24 experimental boards dispatched on 3
panels and relays continuously the status of each board, together with information on
the radiation environment and the temperature. The devices on board are sensitive to
radiation to different extents: their performances are degraded by single event effects
(bit-flips or latchups), Total Ionizing Dose (TID) effects, and displacement effects, or
all three. The MPTB monitors the occurrence of single event effects continuously
with a one second accuracy. Once per orbit, it measures the effects of TID and
displacement effects on the parts, and transmits all that to the ground station. Many of
the settings on the boards can be changed by telecommand.

3. The Neural Boards on MPTB


3.1 Architecture
One of the experiments on board MPTB is devoted to the study of the operation
under radiation of ANNs hardware emulations. This experiment, so-called MPTB
" neural board ", was designed by TIMA Laboratory in collaboration with CNES
(French Space Agency) and CEA (French Atomic Agency). Fig. 1 gives the main
features of the neural board architecture. It consists of a Transputer (RISC
microprocessor from INMOS/SGS-Thomson with parallelism capabilities), a
dedicated neural coprocessor (the L-Neuro 1.0 chip from Philips), various memory
devices, and circuits needed to recover in case of critical single event effects
(watchdog and antilatchup systems). The T225 is the main core of the board, having
in charge all the operations related with data transfer to/from the satellite and the
implementation of DUTs test strategies. The very little number of T225 internal
registers gives it potential SEU robustness [9]. It executes a program which mimics
the network structure, preparing data needed to compute the state of each neuron. The
weighted sum calculations are performed by the L-Neuro 1.0 coprocessor, having
previously loaded its internal RAM with the necessary data (weights, thresholds) and
the internal registers with the state of the input neurons. In that way, the calculation of
output states of neuron sets (up to 64) can be completed without reloading the internal
memory. The main SEU-sensitive area is the memory region where network program,
parameters or inputs are stored. Detailed information on the board structure and the
ANN computing method can be found in [7].
Two neural boards, differing only in the presence or not of the L-Neuro coprocessor
have been implemented. After ground qualification tests, this neural coprocessor
618

turned out to be the most sensitive component. These twin boards are called below
board A (the one with the neural coprocessor) and board B (the one without it). After
suitable telecommand operation, these two boards may run in parallel two different
ANN versions of the same problem.

H T225
~ ~ Processor

~ ~ L-Neuro1.O
oc___~prote
]
sso____~]

~ ,..._132Ko
" q l ~ l ~ SRAMHITACH

~ 32 KO
SRAMMH.S.
]

T225Bus
Figure 1: Experiment block diagram of MPTB neural board A

3.2 On-board ANN


As an initial problem we considered the one of recognizing between four types of
textures within SPOT satellite images (Industrial Area, Residential Area, Scrubland
and Sea). The Irained neural network was a partially connected MLP composed of
three hidden layers (Fig. 2).

Figure 2: MLP structure for texture analysis


The whole SPOT image (6,000 x 6,000 pixels in grayscale) has been decomposed into
small squares of 24 x 24 pixels, that are the network's inputs. This input layer is
followed by the three layers of hidden neurons clustered in bidimensional structures,
and an output layer of four neurons. The network was trained with the BP algorithm
using a training set of 10,000 input images. To increase the network's robustness, an
additive gaussian noise was injected on the weights during training. A detailed
description of this network can be found in [5]. The generalization test (classification
of 10,000 new pictures with the resulting network) was successful in 78% of the
images with integer weights. This constitutes a good performance, considering the
chosen selection criteria. Indeed, we have decided that a texture is correctly
recognized if the corresponding neuron is active (output value>0) while the others are
inactive (outputs values<0). The global performance may be higher if other criteria
were used (for instance: the texture recognized is given by the maximum of the values
of output neurons). In order to reduce the hardware size, the transformation carried out
by the first hidden layer (a learned image compression) was done on ground. Thus, the
619

input of the ANN on board, which has thus only two hidden layers, is layer 2 of Fig.2.

3.3. Flight Results


In this section flight results of the ANN for texture analysis are briefly presented. The
information collected by telemetry up to now has not provide enough flight data to
derive statistically relevant conclusions. Therefore, data presented here aims only at
showing the capabilities of the neural experiment, which is re-programmable from
ground by telecommand appropriate data and instruction code sequence. Board A ran
an effective time of about 141 days while board B about 203 days. In order to avoid
that too many SEUs are cumulated in the memory, an automatic refresh is done every
day by a power off-on of the two boards. About the bit-flips occurred in the ANN
parameters, it can be said 79 of them occurred on board A while 147 arose on board
B. This correspond to an average of 0.56 upset/day and 0.72 upset/day respectively.
Concerning bitflip cumulation on the memory, we noticed a maximum of 5
(respectively 7) for board A (respectively board B).
The effects of bitflips on neural boards were analyzed by comparing the obtained
ANN response of the four output neurons to the expected one. During the whole
mission time analyzed, only a few times the outputs were faulty as the result of
radiation effects: 7 times for board A, 15 times for board B. It is important to note
that the ANN performances of neural boards were affected in different ways by
bitflips: the recognition rate never dropped under 60% for boards A and B.
Nevertheless, as consequence of a bitflip affecting the T225 the performance of board
B dropped 43% (from the 78% initial recognition rate to 35%). Also it should be
noticed that some bitflips resulted in the improvement of recognition rate (up to 80%)
confirming previous simulation resuks [7]. This high fault tolerance is partly due to
the redundancy of the ANN architecture considered.

4. Radar Detection with Neural Networks


Neural networks may also have interesting robustness capabilities when applied as
binary classifiers. Their aim is to decide if a given input belongs to one of two type, 0
or 1. One usual way to treat this problem is to consider networks trained with BP,
which needs neurons coded with real numbers, whose states are given by a sigmoidal
function, and to discrete the output. However, as the complexity of binary neurons is
much lower than that of the MLP's units, making them potentially more robust [10],
it is worth to consider the performances of networks built of binary units from the
start. In the following we present the problem of detection of noisy radar signals,
which is a classification task that is being currently considered for the next on-board
experiment, and the two ANNs designed to implement it.

4.1 Neural Detector


The detector under consideration is a modified envelope detector [ 11-13 ], as is shown
in Fig. 3. [ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. ~ x " xc(l'TO

r(t) ] [ |cos wet

L>/
I I
isen ,%t
/V'~--L~ ass I
. ( k T 0) T ) 1
, ~ ~ .... j ~ o
Pllter [ x s(t)

Detector
Figure 3: The Neural Detector.
620

The binary detection problem is reduced to decide if an input complex value (the
complex envelope of the input, involving signal and noise) has to be classified as one
of two outputs, 0 (noise) or 1 (noisy signal). The need of processing complex signals
with an all-real coefficient NN, requires to split off the input in its real and imaginary
parts (the number of inputs doubles the number of integrated pulses); then, a
threshold T is established at the NN output.
The input r(t) is a band-pass signal, and the complex envelope x(t) = xc(t) + j x S (t) is
sampled each T O seconds. Then:

x(kTo) = x c ( k T o ) + j x s ( k T o ) , k = l ..... M. (j=xf~) (1)


At the neural network output, values in (0, 1) are obtained. A threshold value T within
the interval (0, 1) is chosen so that output values in (0,/3 will be considered as binary
output 0 (decision DO) and values in [T, 1) will represent 1 (decision D1). The two
hypotheses HO (target absent) and H 1 (target present) are defined as follows
Ho: x(kTo) = n(kTo) (2a)
Hi: x(kTo) = S(kTo).e j~ + n(kTo) (2b)
where T O is the pulse repetition period, k varies from 1 to the number of integrated
pulses (M), S(kTo) is the signal amplitude sequence, O is the signal phase and n(kTo)
is the complex envelope of the noise sequence, i.e. n(kTo) = nc(kTo) + jns(kTo).
More details of the design and optimization can be found in [14].

4.2 Results for an MLP with continuous sigmoidal units


After a thorough study of different neural solutions, it appears that results of MLPs
with continuous sigmoidal units, trained with BP can be very close to the optimal
performance [13]. In this section we present typical results for this kind of neural
networks.
The results are presented through the detection curves, under the classical Marcum
model [15]. This curves present the probability of detection Pd (probability of decide
D 1 when hypothesis H1 is true) vs. Signal-to-noise ratio (signal power divided by
noise power, in dB) of the input signal, for a given false alarm probability, P'faa
(probability of decide DI under the hypothesis H0). When an MLP is used, the
results, depending on the parameter Training Signal-to-Noise-Ratio (TSNR), can be
quasi-optimal (Fig. 4).

Pd lit 9 TSNR3
A
Pd ,, -" o 9
~ TSNR6 08
08 ~ TSNRI3

l
06 ~
9 ~
0.4~
~ TSNRI4
TSNRI5
Optimum
Pfa = 0.01
0.6

0.4
j
(A

Pill = oA)Ol
0.2 ~ :~.,~ " "xJ" TSNR3 ~ TSNKI4
TSNR6 --~ TSN~RI5
+ TSNR13
$~ . , ~ Optilnlllu
o ~'~::~" 0
6 9 -1 0 1 2 3 4 5 6 7 8 9 10
SNR (dB)
SNK (dB)

(a) (b)
Figure 4: Detection Probability (Pd) vs. Signal-to-Noise Ratio for a MLP of structure 16/8/1
and different Training Signal-to-Noise Ratio. (a) Pfa--0.01 and (b) Pfa= 0.001.
621

Aiming at evaluate the SEU sensitive surface of a digital implementation of the MLP
based neural detector, we have performed software simulations. The network
performance (detection capability) in presence of a unique bit-flip fault on its
parameters (synaptic weights and neurons offsets) was calculated. These bit-flips
were successively (and exhaustively) injected to get all possible neural detector
mutations due to SEUs, before rulming a C program emulating the neural detector.
The network parameters were coded as 32 bits floating point numbers (IEEE 754
1985 standard) with a sign bit, one byte signed exponent and a 23-bits mantissa.
Parameters to be corrupted occupy 4640 bits. We have studied the degradation of the
detector performance for the particular case Pfa =0.1. We have considered that a
bitflit is critical when the Pfa increase more than 5% or when the Pd decrease more
than 10% for any of the sfudied values of SNR (between -10dB and +9dB). The
results of these simulations can be summarized as follows:
- the synaptic weights of the hidden layer neurons have all the same SEU-sensitive
bit which is the sign of the exponent. Its modification is critical. The Pfa grows from
0.1 to 0.5 when this bit is modified. This effect affects 136 bits. We hfive also put in
evidence that bitflips on another bit in the exponent lead to slight modifications of
Pfa (until a maximum 0,15). This affects another 136 bits.
- corrupting the synaptic weights of the output neuron is more critical. Bit-flips of
practically all of the bits of the exponent lead to a serious loss of detection
performance. There are 98 bits in this situation.
- bit-flips of only 30 bits of the weights in the hidden layer can be considered as
beneficial: their inversion improves the Pd without modifying the Pfa-
- modifying the remaining 4240 bits has no significant effect on the neural detector
performances.
Obviously, this study takes into account only bit-flips on the memory area used to
store network parameters. The study of the effects of bit-flips on other memory
regions needs to be done on the fmal digital implementation. It is also important to
notice that, to study the effects of SEU-like faults on the neural detector, not only the
detection probability must be checked, but also the false alarm probability. Also, let
us remark that owing to the chosen format, these figures correspond to a worst case.
In the fmal digital implementation, the network's parameters will be coded as 16-bit
integers, leading to a minimization of the sensitive memory area that avoids the
proliferation of critical bits.

4.3 ANNs of binary units

For the sake of comparison, we generated ANNs composed of exclusively binary


units. To this end, we used the constructive algorithms NetLines and NetSpheres [ 16,
17, 18], which are incremental training algorithms. They proceed by including hidden
units upon leaming, one after the other, until the number of training errors reaches a
user's dependent upper bound. The resulting networks architectures are similar to
those of completely connected MLPs, but the hidden neurons are binary units that
implement either linear or quadratic (hyper-spherical) discriminating surfaces,
depending on the learning algorithms considered. With NetLines, the hidden units
determine hyperplanes in input space, whereas NetSpheres generates hyper-spherical
discriminating surfaces. It is worth to point out that hyper-sphericat surfaces have the
same number of parameters as linear ones, so that the complexity of networks built
with NetLines or NetSpheres having the same number of neurons is the same.
Although not considered in this paper, hybrid networks containing both kinds of
neurons may be constructed.
NetLines and NetSpheres are presented in [16,17]. We describe here their main
characteristics. The heuristics of both algorithms is similar, differing only on the kind
622

of unit (linear or spherical) included at each growth step. They proceed as follows: a
first unit is trained to separate the patterns of the training set belonging to one class
from the other. I f this succeeds, only one neuron suffices, and the algorithm stops.
Otherwise, this unit becomes the first neuron of a hidden layer. New hidden neurons
are successively added, and trained to separate (either linearly or spherically) the
remaining errors. After training each hidden unit, an output (linear) neuron attempts
at learning the training set. I f its training error is lower than the accepted bound, the
algorithm stops, and the last trained output neuron is kept. Otherwise the output
neuron is removed and the algorithm goes back to add and train a new hidden unit.
The algorithm used to train the linear discrimmant units is Minimerror [19, and
references therein]. Minimerror-S, a generalization of Minimerror, is used for hyper-
spherical discriminations. Both algorithms are based on the minimization of suitable
cost functions that depend on two hyper-parameters called "temperatures". The final
weights minimize the number of errors close to the discriminating surfaces. The
algorithms have three adjustable parameters " the learning rate, the ratio between the
two "temperatures ", and an annealing rate.

4.4 Results for ANNs of binary units,


We present results obtained with a training set of 500 patterns, evenly distributed in
two classes. Class -1 corresponds to noise, Class +1 to noisy signals with a signal-to-
noise ratio (SNR) of 13dB. It turned out that in the present problem, NetSpheres
stopped with only one neuron, whose output is:

or=sign
F x- -

where the vector ~" is the input, and the vector w and w0 are the weights and the
0 (3)

threshold of the radial neuron, respectively. This result means that the training set is
spherically separable. The generalization error, determined with an independent test
set of 10000 patterns, also vanishes. This explains why so many neurons are needed
when the hidden units implement linear discriminations: as shown on Fig. 5, we were
unable to improve the classification performance using NetLines: neither the
recognition error of signals nor the fraction of false alarms decrease upon adding
hidden neurons beyond three. Comparison with the results obtained with BP show
clearly that in this task, the performance reached with real valued neurons is better
than with binary linear traits.

Q25. Q~o4b
Q2).

s Q'6-
\ 0~j \ I-, ,~0~1
\
,~ clio. \
\ x
Q'64 \
\ \
acfi-
Q~4
QfD. 1. . . . . . I. . . . |-- |--
aCsl

Nnt~ d tidtm uils Nn~tiatmu~

Figure 5 : Performance of NetLines.


a) Training error vs number of hidden units for networks trained with 500 patterns.
b) Corresponding generalization errors estimated with a test set of 10000 patterns.

As small networks implementations use less physical memory area, they should be
less sensitive to radiation effects. Thus we investigated the performance of the single
623

hyper-spherical neuron trained with Minimerror-S on the detection of noisy signals


with SNR different from the ones used for training. As with BP, we determined the
detection curves, i.e. the fraction of signals correctly detected by the neuron. The
results are shown on Fig. 6 where the Pd is displayed as a function of the SNR for
three different levels of tolerated false alarms.This level is controlled by adjusting the
threshold (i.e. the weight w0) using an independent set of noise patterns (Class -1).
The fraction of these that the network classifies as signal (Class +1) is the Pfa.
Clearly, the networks trained with BP have the best performances. However, under
radiation conditions the quality of their response may be degraded. As they are more
complex, they may be more sensitive to SEUs than the hyper-spherical neuron.
Moreover, as the latter need only few bytes to be implemented, it would be possible to
replicate it in order to increase the robustness of the solution, and still use smaller
memory space.

~o ~,0 . , . , .
/

0.8 o/
- BP

Q,6 :
" 2k.
~6
K

0,4 0.4

i
0,2

o,r
.~o
i
-5
I
5 c~c. ~ , - ~ ~.~ ~.~J " "

1,o

u
~8 F~=0.0~
Figure 6: Pd vs. SNR for different Pfa 9 EP /
levels, for the two investigated networks &8
--~- 1~8~res

trained with BP and with NetSpheres


repsectively g
a4 o /
O2 ,/

~o_..~.-_ - -s o
S t ~

5. Conclusions and perspectives


We have presented flight results of neural experiments presently operating in space as
part of a project devoted to study the operation under radiation of novel
microelectronics and photonics circuits (the MPTB project). These results show that
ANNs present significant fault tolerant properties with respect to one of the critical
624

e r r o r s d u e to r a d i a t i o n : the S i n g l e E v e n t U p s e t p h e n o m e n o n r e s p o n s i b l e o f bit-flips.
T h e s e n e u r a l e x p e r i m e n t s are r e - p r o g r a m m a b l e f r o m g r o u n d . R a d a r d e t e c t o r s b a s e d
u p o n A N N s h a v e b e e n s e l e c t e d as the n e x t a p p l i c a t i o n s c a n d i d a t e to r u n o n M P T B
n e u r a l b o a r d s . T w o d i f f e r e n t s o l u t i o n w e r e p r e s e n t e d , the first o n e b a s e d o n a n
Multiple Layer Perceptron trained with Backpropagation, which showed quasi-optimal
p e r f o r m a n c e s , the s e c o n d o n e a n A N N c o m p o s e d o f b i n a r y n e u r o n s t r a i n e d w i t h
i n c r e m e n t a l t r a i n i n g a l g o r i t h m s . A l t h o u g h the latter n e t w o r k s h o w e d w o r s e d e t e c t i o n
p e r f o r m a n c e s , w e e x p e c t t h a t the s m a l l m e m o r y s u r f a c e n e e d e d to i m p l e m e n t it s h o u l d
m a k e it less s e n s i t i v e to t r a n s i e n t errors d u e the i n t e r a c t i o n w i t h r a d i a t i o n
e n v i r o n m e n t . T h u s , its m i n i m a l s t r u c t u r e s h o u l d m a k e it m o r e s u i t a b l e for s p a c e
a p p l i c a t i o n s . F u t u r e w o r k i n c l u d e s the d i g i t a l i m p l e m e n t a t i o n o f t h e s e n e u r a l r a d a r
d e t e c t o r s to e v a l u a t e b o t h the d e t e c t i o n p e r f o r m a n c e a n d the r o b u s t n e s s a g a i n s t
t r a n s i e n t p e r t u r b a t i o n s . T h e t e l e c o m m a n d o f t h e s e n e u r a l d e t e c t o r s to M P T B n e u r a l
b o a r d s is s c h e d u l e d for J u n e 1999.

6. References
[1] S.E. Decatur, "Application of neural networks to
ten'ain classification," Proc. Int. C o t f Neural [10] A. Assoum, et at. "Robustness against single
Networks, pp. 283-288, 1989. event upsets of digital implementations of neural
networks. ". Proceedings International Conference
[2] N. Miller, M.W. McKenna, T.C. Lau, "Office of on Artificial Neural Networks, Session 9, Paris,
Naval Research Contributions to Neural Networks Octobre 9-13, 1995.
and Signal Processing in Oceanic Engineering",
IEEE Journal of Oceanic Engineering, Vol. 17, N ~ [11] D. Andina, J.L. Sanz-Gonzfilez. "Quasi-
4, Oct. 1992. Optimum Detection Results Using a Neural
Network", Proc. o f IEEE Int. Conf. on Neural
[3] M.W. Roth, "Survey of neural network Networks, ICNN'96, Washington DC, USA, pp.
technology for automatic target recognition", IEEE 1929-1932, June 1996.
Trans. Neural Networks, Vot. 1, N~ 1, pp. 28-43,
March 1990. [12] W.L. Root, "An Introduction to the Theory of
the Detection of Signals in Noise". Proc. o f the
[4] D. Andina and J. L. Sanz-Gonzfilez. IEEE, vol 58, pp. 610-622, May 1970.
"Optimization of a Neural Network Applied to
Pulsed Radar Detection", Proc. of VIII European [13] D. Andina, J.L. Sanz-Gonzfilez and J.A.
Signal Processing Conference, EUSIPCO-96, Jim6nez-Pajares, "A Comparison of Criterion
Trieste, Italy, pp. 851-854, September 1996. Functions for a Neural Network Applied to Binary
Detection", Proc. of hit. Conf Neural Networks,
[5] J. D. Muller, P. Cheynet, R. Velazco, "Analysis ICNN, Perth, Australia, 1995.
m~d improvement of network robustness for on-
board satellite image processing", Int. Conference [14] D. Andina, J.L. Sanz-Gonzfilez. "Design and
on Artificial Neural Networks ICANN'97, Lausanne, Performance Analysis of Neural Binary Detectors",
Suisse, 8-10 oct. 1997. Novel Intelligent Automation and Control Systems,
Vol I, pp.59-78, Germany, 1998.
[6] A. Assoum, N.E. Radi, R. Velazco, F. Elie & R.
Ecoffet, "Robustness Against SEU of an Artificial [15] J.L. Marcum, "A Statistical Theory of Target
Neural Network Space Applications", Special Issue Detection by Pulsed Radar". 1RE Trans. on
IEEE Trans. on Nuclear Science, Vol. 43. Number Information Theoly, Vol. IT-6, N ~ 2, pp. 59-144.
3, Part I. pp. 973-978, June 1996. Apr. 1960.
[7] R. Velazco, Ph. Cheynet, J-D. Muller, R. [16] J.M.Ton-es-Moreno, "Apprentissage et
Ecoffet, "Artificial Neural Network Robustness for g6n6ralisation par des re~seaux de neurones: 6tude de
on-board satellite image processing: Results of SEU nouveaux algorithmes constructifs", Ph.D. Thesis,
simulations and ground tests", IEEE Transactions Institut Nat. Polytechnique de Grenoble, 1997.
on Nuclear Science, Part I, Vol. 44, pp. 2337- [17] Bruno Raffin, Mirta B. Gordon, "Learning and
2344, 1997. generalization with Mmimerror, a temperature
[8] J. C. Ritter , "Microelectronics and Photonics dependent learning algorithm", Neural
Test Bed", 20 th Annual AAS Guidance and Control Computation 7, pp. 1206-1224, 1995.
Conference, Breckenridge, Colorado, Feb. 5-9 1997. [18] Torres-Moreno, J.-M. and Gordon, M,
[9] F. Bezerra, D. Benezech, R. Velazco, "Study of "Characterization of the Sonar Signals Benchmark",
the sensitivity of Transputers with respect to SEU Neural Processing Letters 7 1-4, 1998.
and latchup phenomenons", Proe. of Radiation [19] J.M. Torres-Moreno, M.B. Gordon, "Efficient
Effects on Components and Systems (RADECS'95), adaptive learning for classification tasks with binary
Arcachon, 18-23 Sept. 1995. units", Neural Computation 10, pp. 1017-1040,
1998.
Curvilinear Component Analysis for
High-Dimensional Data Representation:
I. Theoretical A s p e c t s and P r a c t i c a l
Use in the P r e s e n c e of N o i s e

Jeanny H6rault, Claire Jausions-Picaud and Anne Gufrin-Dugu6

INPG-LIS, 46 Ave. F61ix Viallct, F-38031 Grenoble Cedex, France


{jeanny.herault, claire.jausions, anne.guerin}@inpg.fr

Starting from a recall of the theoretical framework, this paper presents the
conditions and the strategy of implementation of CCA, a recent algorithm
for non-linear mapping. Initially developed in a basic form, for non-linear
and high-dimensional data sets, lhe algorithm is here adapted to the general,
and more realistic, case of noisy data. This algorithm, which finds the
manifold (in particular, the intrinsic dimension) of the data, has proved to
be very efficient in the representation of highly folded data structures. We
describe here how it can be tuned to find the average manifold and how
robust the convergence is. A companion paper (this issue) presents various
applications using this property.

1. Data analysis and non-linear mapping


The general problem with high-dimensional data sets is, using the inherent redundancy
of the data, first to obtain a reduction of dimension, second to obtain a representation,
of the intrinsic dimension of these data, that is to provide a "picture" which can be
used to givc a meaningful interpretation of the data. This can be done through
Multidimensional Scaling or through Non-Linear Mapping techniques [Mardia et al.
(1979) Borg (1997)].
These algorithms are based on the point mapping of n-dimensional vectors to a
lower-dimensional space such that the inherent structure of the data is approximately
preserved. The input data can be either vectors from a set of measurements (the input
space is known), or an inter-point Euclidean distance matrix or a similarity matrix
(the dimension of the input space is unknown). In some cases the data are non-metric:
only tile rank orders of the distances are known [Kruskal (1964)]. For a similarity
matrix {sij}, often used in psychometry, a conversion into a distance matrix {dij } is
required: dij = : 2 - sij - sji , [D'Aubigny (1989)].
The main idea is the following: for every couple of distinct points (i,j), take every
interpoint distance Xij = Ilxi - xjll in the input space and find the corresponding
interpoint distance YiJ = Ilyi - 3ill in a lower-dimensional output representation space.
This can be done in different manners [Siedlecki et al. (1988)]. Basically, one of them
is obtained by minimising the following quadratic form:

E:E(x,j- (1)
l.I
626

for example by means of some gradient descent algorithm. If the dimensions of the
input and output spaces are the same, the cost function E can be made null. It can be
normalised according to the input distribution by:

l'-""~Z(Xij- Yij)2 orby:


E = ~., Xij2 i,j
E= 21)Yij------~2Xi2/Yi4
i,j
i,j i,j
due to Shepard [Shepard (1964)], but the last one is very computationally demanding
and the equilibrium point is difficult to find. Another cost function, automatically
normalised can be:

Ee :i,j( Aij )2
But in the case of non-linear and folded data structures, this cost function is not
really suitable because it works according to the relative error. The idea here is to
favour the mapping for small distances in the input space with respect to the mapping
of large distances, leading thus (intuitively) to some local topology preservation,
which is the aim. However, in the case of strongly folded data, the unfolding is
difficult or impossible to obtain, simply because when the input is folded, the
extreme points of the input distribution have a small distance Xij, thus the algorithm
which favours it prevents from the desired unfolding.

Some compromise has been given by the so-called Sammon Mapping [Sammon
(1969)], by giving less importance to the relative error:

1 (Xij-Yij) 2
Es ZXij Zi,j Xi;
i,j
The behaviour of unfolding, though slightly better, has been shown to fail with
strongly folded data manifolds [Demartines and H6rault (1997)].

2. CCA: Curvilinear Component Analysis


2.1. Principle

In order to circumvent these drawbacks, we have derived a new algorithm for data
representation called "Curvilinear Component Analysis". In this algorithm, the
strategy of processing is as important as the equations themselves. The goal is to find
the dimension of the average manifold of the data and to map it onto a space of lower
dimension (representation space). In summary, we proceed to a Global Unfolding
followed by a Local Projection onto the average manifold (See fig. 1).
Hypothesis: the input consists of N s samples belonging to some theoretically p-
dimensional manifold, embedded in an n-dimensional input space X={xik}, i=l..Ns,
k=l..n. But, because of noise, the manifold has some "thickness", being thus also of
dimension n.
627

The aim is tofind the average manifold and to map it on a p-dimensional output
space Y. To do this, we use N n neurons with n-dimensional input weights and p-
dimensional output weights. If the number of samples is low, we use one neuron per
sample: Nn = Ns. If N s is too large, we first proceed to a vector quantization [Gersho
and Gray R. M. (1992)] of the input space, and the number of neurons is equal to the
number of prototypes N n < Ns [Demartines (1994)].

Figure I. Principle of the CCA algorithm. The input weights first proceed to a vector
quantization (VQ) of the input data space (X) in n dimensions. Then, the output weights
map the local topology of the input average manifold by projecting it (P) into an output
representation space (Y) of dimension p < n. This way, tasks like classification and
recognition are highly facilitated in an unfolded and lower-dimensional output space.
Then, each neuron i is associated to one input sample (or prototype) and its input
weights are made equal to the components of the sample (or prototype): w k = Xk,
k=l ..n. On the contrary of Kohonen's Self-Organising Maps [Kohonen (1989)], the
neurons have no a priori pre-defined neighbourhood, but they have p-dimensional
output weights yq, q=l..p, pointing in the output space Y. They will find themselves
their neighbourhood by adapting their output weights to the local topology of the
input samples.

2.2. Choice of the cost function

Let us come back to the basic cost function, without normalisation for sake of clarity:

E = ~,Eij, with Eij = ( X i j - Y~2)2 (2)


i,j

The input interpoint distances Xij = IIxi - xjll are given, and for every point Yi in
the output space we move the points yj so that the terms Eij are minimised, for
example by means of a gradient descent algorithm. In order to map the average
manifold of the data, two cases are to be considered (see figure 2): first, we need a
global unfolding of the average manifold of the data, and second, we need a local
projection of these data onto their average manifold.
Let us consider the first case alone (Unfolding: figure 2- top). In order to unfold the
data, only some of the E/j terms in formula 2 need to be minimised: those for which
the distance Yij is smaller than some pre-defined distance Z. Thus, allowing the
628

matching for only short distances is a way to respect the local topology. It has been
proved that this condition (applied on the output distances) ensures a global unfolding,
much better than other mapping techniques, which apply it to the input distances
[Demartines (1994)]. In this case, the general term to be minimised becomes:

E;d = (Xij - ~iij)2 Fx (~iij) (3)

with F~ (.) = 1 for ~j < ~, and F;t (.) = 0 for ~j > ,~,.
The choice of 2, strongly depends on the data structure (e. g. curvature of the
average manifold, spreading of the data around this manifold). As the data structure is
in most cases unknown, some strategy should be defined in order to define the best
value of 2,, see section 4.
We should remark that, apart from the desired global unfolding, there is also some
tendency to make a local projection. Look at the input distribution in figure 2:
because we ask the mapping of X14 simultaneously with the mapping of X12, X23
and X34, the resulting compromise will lead to Y12 < X12 , Y23 < X23 and Y34 <
X34, which is an approximate projection. This property will be used hereafter.

Figure 2. Illustration of the problem of data representation, in two cases: either only an
unfolding is desired, or only a local projection is desired (see text).
Let us now consider the second case (Projection, figure 2- bottom). This situation
is the opposite to the preceding one: let us suppose that we have already projected the
data onto their average manifold, the interpoint distances X'ij of the projected data
will locally minimise the following quadratic error: (Xij 2_ X,ij 2). Then, the
output vectors should map this local projection, that is, translated into a cost function
problem, they should minimise:

p = 2 _ (4)

It is important to notice that this is quite equivalent to make a local Principal


Component Analysis, in the local subspace of the data. Because of the projection, this
629

should apply only when Yij < Xij, a situation which is initiated by the above-
mentioned tendency to make local projection. Conversely, when Yij > Xij, we are
in the condition of unfolding. Hence, the two situations (unfolding or projection) do
not overlap, and the global cost function can merge formulae (3) and (4); provided that
the continuity between them be assured at Yij = Xij.
Let us remark that, with such a cost function, there are some degrees of freedom: it
is invariant under transformations like translation, rotations, or inversion of axes.
This property can be exploited by adding constraints suitable for various conveniences
of data representation, refer to the companion paper [Gu6rin et al. (1999), this issue].
In particular, various constraints may be added, for example:
- smoothness constraints in the case of sparse distances matrices,
- constraints to let the axis of maximum variance be horizontal,
- addition of a term containing the information relative to one given factor in
factorial discriminant analysis,
- chosing one axis to minimise intra-class variance while maximising inter-class
variance in case of supervised learning

3. Choice of the learning strategy


Instead of minimising globally the cost function according to a classical gradient
descent, we use a special, more powerful, strategy:

1~31initialise at random the yi's values,


select at random every Yi and move every the yj in a direction opposite to the
gradient of the partial cost 1unction Eij.
repeat step 2, until a good mapping quality is obtained (see sect. 4).

This is equivalent in mean to a classical gradient descent of E, but has proved to be


better, in the sense of escaping from local minima [Demartines and H6rault (1997)],
this is mainly because during convergence it allows the global cost function E to
temporarily increase. In other words, minimising E term by term is more likely to
lead to the global minimum than minimising E globally.
In order to assure a continuity between the two partial cost functions around
Yij = Xij, we need to equate their second order differentials. Let us now calculate the
first and second order differentials of these cost functions.

In the case of unfolding (when Yij > Xij ), we have:

EIj=(Xij-Y/j)2=IXij-~(Yi-Yj)T(yi-Yj)) 2 (5)
and, with respect to the variation dyj, we have:

dE~~' = dy j v V j E~ju = _ 2 ( X 0 -Y/j) dyjT(yj Yi) (6)


rij
630

The gradient is a vector in the direction of (Yi-Yj), its norm is proportional to the
distance error. The second order differential is:

d2E~j =" Xij ,, Tt , -Yi) dyj -2 dyjTdyj (7)


z ~ a , j ~'j-Yi)(Yj Yij
The Hessian matrix is positive definite for YU = Xij and this point is a
minimum with a quadratic behaviour for E~- It is also positive definite for every
Yij > X/j, and because the direction of the gradient does not change for Y/j < Xij,
down to Y/j -- 0, we have a very wide basin of attraction around the minimum.

In the case of projection (when Yij < Xij ), we have:


(
Ei}'=(X2-Yif) 2= Xi~--(yj--yi)T(yj--Yi
))2 (8)

and, with respect to the variation d3j, we have:


dEi~=dyjTVjE~'=-4(Xi~-Yi2)dyjT(yj-Yi) (9)

The gradient is a vector in the direction of (Yi-Yj), its norm is proportional to the
squared projection error and to Ilyi-yjll.The second order differential is:

d2EP =SdyjT(yj - Yi)(yj - Yi)Tdyj - 4 ( X 2 - Yi2) dyjTdyj (10)

As previously, the Hessian matrix is positive definite at the same point Yij = Xij,
which is also a minimum for Eip. For the same reason as previously, the basin of
attraction is quadratic and wide.
In order to have the same cost functions in both cases around Yij = Xij, we need to
normalise them so that their second order derivatives at this point are equal. The
global function to be minimised is then:

ij yij>X,j 4Xi~ ~ Yij<xij


Remarks:
1. At the beginning of the learning procedure, we need a global organisation of the
output vectors and hence, it is better first to run the minimisation of Eij's with only
the unfolding term. During this procedure, the parameter ;t of the weighting function
Fd~(Yij) starts with a maximal value and decreases as the inverse of the iteration
number. After, the projection term will be introduced, as a refinement for local
organisation.
2. In the neighbourhood of the minimum, the basins of attraction are interestingly
quite wide. To show this, let us rewrite the expressions of d 2 Ei~u and d2Ei p in a
631

more convenient form. By expressing the scalar product dyjT[yj-Yi)/\ as


lldyjllYijcos(0ij), Oij being the angle between the directions of dyj and (yj - Yi):

-d2E":*-'!i dyj 2 1- X(JYijsin2(Oij) ' d2Ei; = dyj2 -sin2(Oij) 2Yi2 J'


and by taking their mathematical expectation under the condition of one given value
of Yij, we find:

EEd2E/j l~j]=

The first one is positive for


E ]
dyj 2 1- 2Yij
Xij

Yij >- Xij/2


,and E[d2E~; l~j]=
E
dyj 2 1-~i ~ .

and the second one is positive for


Yij >- Xij/~. This means that the basins of attraction around Y/; = Xij are wide
enough to allow a robust convergence of the algorithm, either in the unfolding or in
the projection situations.

4. Quality of the output mapping

a) " ca"
"-X :.~::~;~--'-~.~ ix ]F~(Yij) .z/~ /
2)~:'"" "" "::*~:~,; / j .,d[J[l~ll~g~.Unfolding

~ii~!Input space.Y}: ~ .... ' ~ ~ .

4:- al

y Output space I ", ~ Local

Figure 3. Evaluation of the quality of the mapping, a) Example of a 2-dimensional data


space with a l-dimensional average manifold, b) 1-dimensional output representation, c)
The dx/dy joint distribution showing the regions where unfolding and projection occur.
If the output space has the same dimension as the input one, all the input
interpoint distances are equal to the corresponding output interpoint distances: the
joint distribution of input distances (dx) versus output ones (dy) lie on the first
quadrant bisector dx--dy. If the dimension of the output space is lower than the one of
the input space (see fig. 3), the joint distribution dx/dy presents two aspects: in the
632

case of unfolding, the points lie on the dy>dx side of the first diagonal and, in the case
of projection, they lie on the dy<dx side. A "good" mapping is obtained when there is
an unfolding for large dy values and a projection for small values. Then, the aspect of
the joint distribution dx/dy should be the one of figure 3.
The visual analysis of this dx/dy graph is very useful [Demartines (1992)]:
1. When searching for the (unknown) intrinsic dimension of the input data, we!
choose the output dimension by dichotomy: if the distribution lies on the first
diagonal, we can lower the output dimension, and if the distribution becomes
thicker, the output dimension is too small.
2. Once in the good dimension, playing on the minimum value of X to reduce the
scattering around dx=dy for medium values of dy, will improve the quality of
the mapping.
3. More, looking at the maximum of dx near dy=0 gives an idea of the spreading
of the data near the average manifold.
4. In the case of multimodal input data distributions, it can be interesting tol
provide one dx/dy representation for each modality [Teissier et al. (1998)].
As an example of a difficult problem of non-linear mapping, see figure 4. Here we
try to map a 3-D data set of two interlaced rings onto a 2-D representation space. The
problem has no solution, but the CCA algorithm finds the best compromise
satisfying the 2-D constraint: it breaks the two rings so that the local topology be at
best preserved.

a) b) c)

Figure 4. Mapping of a 3-D data set of two interlaced rings onto a 2-D representation
space, a) input space, b) output space where the two rings are broken in order to satisfy at
best the 2-D representation, c) the dx/dy distribution showing the local projection and the
complexity of the unfolding.

5. C o n c l u s i o n : examples of application domains


CCA is a Self-Organising neural network which has been developed as an alternative
solution for Self-Organising Maps and Non-Linear Mapping algorithms. It overcomes
most of the well known drawbacks of these algorithms and becomes to have a sound
theoretical foundation. For example, Demartines [Demartines (1994)] has shown its
possible applications in various fields of data or system analysis like:
- graphical representation of unknown state-variable automata
- routing of messages in telecommunication networks
- classification of vowels for speech recognition
And Cirrincione [Cirrincione et al. (1994)] have applied it to the detection of faults
in electrical circuits.
633

More recently, CCA has been successfully applied to difficult problems of audio-
visual fusion for vowel recognition in a noisy environment [Teissier P. et al. (1998)],
and of nuclear physics for the calibration of detectors [Vigneron V. et al. (1997)].
Some new extensions are given in the companion paper [Gu6rin-Dugu6 et al. (1999),
this issue].

Figure 5. Two-dimensional representation of the 20-dimensional data space obtained by


the energies, in 4 orientations and 5 spatial frequency bands, of a set of 72 images. We see
clearly that CCA reveals some semantic organisation: Natural/Artificial scenes for each
side of the grey line, and Open/Closed landscapes along this line.

Another difficult problem has been approached, the one of scene categorisation
from spatial statistics of the energy distribution of an image in various frequency
bands and orientations [Hdrault J. et al (1997)]. An image is analysed by a bank of
spatial filters, according to 4 orientations and 5 frequency bands, ranging from very
low spatial frequencies to medium ones. The global energies of the 20 filters' outputs
constitute the 20-dimensional measure space, and each image is a 20-vector in this
space. By CCA, we have found that a 2-dimensional representation was possible and
634

that, in this space, the organisation of the data was surprisingly in accordance with
some semantic meaning (see figure 5).

6. R e f e r e n c e s
Borg I. and Groenen P. (1997). Modern Multidimensional Scaling: Theory and
Applications. Springer Series in Statistics.
Cirrincione G., Cirrincione M., Vitale G. (1994). " Diagnosis of Three-Phase converters
Using the VQP Neural Network" 2nd IFAC Workshop on Computer Software Structures
integrating AI/KBS System in Process Control, Lund, Sweden, 11/13 August 1994, 5
pages.
D'Aubigny G., L'analyse Multidimensionnelle des Donn6es de DissimilaritEs, Th~se d'6tat,
Universit6 Grenoble I, 1989.
Demartines P. (1992). Mesures d'organisation du r6seau de Kohonen. In M.Cottrell, editor,
Congr~s Satellite du Congr~s Europ6en de Math6matiques: Aspects ThEoriques des
R6seaux de Neurones.
Demartines P. (1994). Analyse de donn6es par r6seaux de neurones auto-organis6s. PhD
thesis, Institut National Polytechnique de Grenoble.
Demartines P. and Herault J. (1997). Curvilinear Component Analysis: a Self-Organising
Neural Network for Non-Linear Mapping of Data Sets, IEEE Trans. on Neural Networks,
8, 1, 148-154..
Gersho A. and Gray R. M. (1992). Vector quantization and signal compression. Kluwer
Academic Publishers, London.
Gu6rin-Dugu6 A., Teissier P., Delso-Gafaro G. and H6rault J. (1999). Curvilinear
Component Analysis for High-dimensional Data Representation: II. Examples of
introducing additional mapping constraints for specific applications. Proceedings of
IWANN'99, Alicante, Spain.
H6rault J., Oliva A., Gu6rin-Dugu6 A. (1997). Scene Categorisation by Curvilinear
Component Analysis of Low Frequency Spectra. European Symposium on Artificial
Neural Networks, Bruges, BE.
Kohonen T. (1989). Self-Organisation and Associative Memory. Springer-Verlag, Berlin,
3rd edition.
Kruskal J.B. (1964). Non-metric multidimensional scaling: a numerical method.
Psychometrika, 29:115--129.
Mardia K.V., Kent J.T., and Bibby J.M. (1979). Multivariate Analysis. Academic Press,
London.
Sammon J.W. (1969). A non-linear mapping algorithm for data structure analysis. IEEE
Trans. Computers, C-18(5):401--409.
Shepard R. N. (1962). The analysis of proximities: multidimensional scaling with an
unknown distance function. Psychometrica, vol. 27, pp.125-139.
Siedlecki W., Siedlecka K., and Sklansky J. (1988). An overview of mapping techniques
for exploratory pattern analysis. Pattern Recognition, 21(5):411--429.
Teissier P., Gu6rin-Dugu6 A., Schwartz J.L. (1998). Models for Audiovisual Fusion in a
Noisy-Vowel Recognition Task, Journal of VLSI Signal Processing, vol 20, pp.25-44.
Vigneron V., Maiorov V., Berndt R. Sanz-Ortega J. J. and Schillebeeckx P. (1997). Neural
network application to enrichment measurements with nai detectors. VCCSR
Proceedings, Vienna, November 1997.
Curvilinear Component Analysis for
High-Dimensional Data Representation:
II. Examples of Additional Mapping
Constraints in Specific Applications

Anne Gudrin-Dugu61 , Pascal Teissier 1-2, Gaspar Delso Gafaro 3,


Jeanny Hdrault j

LIS-INPG, 46 avenue Frlix Viallet, F-38031 Grenoble, France


{Anne.Gurrin, Jeanny.Hdrault }@inpg.fr
2 ICP-INPG, 46 avenue Fdlix Viallet, F-38031 Grenoble, France
teissier@ icp.inpg.fr
3 C/.Creu, 34, SP-17002 - Girona, Spain
delso @lix.intercom.es

Abstract. Using a recent algorithm for non linear mapping, Curvilinear


Component Analysis, we show through three applications how a priori
knowledge can be introduced in the CCA framework, and we translate this
knowledge in term of mapping constraints. This a priori knowledge can be
introduced to constraint the convergence of the algorithm toward a data
structure having a best interpretation according to the physical process of
input data generation. The three applications concern geographical data
representation, speech recognition and IRMf image processing.

1 Introduction

Efficient features extraction, non-linear transformation and representation of


high-dimensional data are crucial points in data analysis as it concerns most of the
practical applications. The theoretical frameworks are usually exploratory statistics
or pattern recognition. In one hand, (MultiDimensional Scaling [Borg & Groenen
1997]), we search an Euclidean representation of data from their relationship
through a distances matrix (or more generally a dissimilarities matrix). On the other
hand (Non Linear Mapping [Sammon, 1969]), these techniques are used to provide a
non linear dimension reduction (samples coordinates in the input space are known).
In both cases, the choice of the output dimension is a main problem for configuring
correctly the algorithm. Nevertheless, some a priori knowledge can be introduced to
constraint the convergence of the algorithm towards a data structure having a better
interpretation according to the physical process of input data generation.
In this article, starting from a recent algorithm, Curvilinear Component Analysis,
for non linear mapping [Demartines & Hdrault 1996], we show how such a priori
knowledge can be introduced in the CCA framework, and we translate this
knowledge in term of mapping constraints. A companion paper ([Hrrault et al.
636

1999] in this issue) describes a new version of this algorithm adapted to the general
and more realistic case of noisy data. The basic principles are recalled in section 2.
The theoretical constraints will be discussed in section 3 and illustrated with three
specific applications in sections 4, 5, 6, respectively,

2 Presentation of Curvilinear Components Analysis

Curvilinear Component Analysis is a powerful non-linear mapping algorithm


which efficiently unfolds high dimensional data structure towards its mean
manifold. The principle is simple and based on the minimization of the cost function
between the input distances X 6 and the output distances Y/j (eq. 1). By mapping
distances, the output representation is invariant under transformations like
translation, rotation, or inversion of axis.

E=~. X6-~j .F~(~j)with F;t(g/j)=l if g 0 < ; t e l s e F~.(Y/j)=0.


iJ
This decreasing function F~.(gi]) allows first a global and then a local unfolding
with )~ decreasing progressively in time. The companion paper (in this issue [H6rault
at al. 1999]) describes more precisely a new version of this algorithm where two
data transformation processes are clearly identified (unfolding and projection). A
more robust algorithm is derived in the case of noisy date.

3 Introduction of data constraints

Usually two kinds of constraints on the output data structure are considered
[d'Aubigny 1988], (i) constraints on the configuration of the output representation
(section 3.1) and (ii) constraints on the relationships between data (section 3.2). In
the following, we present how these two constraints can be taken into account inside
the CCA framework, both on theoretical and experimental aspects.

3.1 Configuration Constraints

For most of the applications, the output presentation is a plane or an hyperplane


("flat" representation). With no a priori knowledge, building the output space with
Euclidean geometry is very useful and convenient. Let us consider the distance
matrix between cities inside a country or distances between cities all around the
world. In the first case, data structure can be mapped on a plane. In the second one,
mapping on a sphere would be more convenient ("curved" representation).
To capture this 3D structure, different strategies may be handled :
I. Add a new dimension of the output space to capture the curved structure.
637

2. Add a penalty term to impose a constant distance (radius) between all the input
samples and an additional sample at the center of the input structure [Borg &
Groenen 1997]. These two terms can be weighted in the cost function.
3. Change the coordinate system from the Cartesian system to the spherical system
and impose a constant radius for all the output samples [Cox & Cox, 1991]. The
output distances are evaluated at the surface of the output sphere. The two free
parameters are the angles for the position in the spherical coordinates system.

From strategies 1 to 3, the spherical constraint for the output presentation is more
strongly imposed. Foe example, some perceptual data coming from psychological
experiments fits well with circular or spherical structure [Dr6sler 1981, Eckman
1954, Rogowitz et al. 1998].

3.2 Relation constraints

3.2.1 Weighting the distance matrix


With CCA algorithm, the weighting decreasing function F;~ (Y/j) allows to discard
the mapping for long range distances and realize near the convergence step only
local mapping of short range distances. Then, starting from a full distance matrix,
this matrix becomes more and more sparse during the convergence process. Now, let
us consider that all the relationships must not be taken into account (unknown,
missing, imprecise measures or non significant links .... ). In a general case, to take
into account these missing values, a weighting coefficient (eq. 2) for each data pair
(i,j) is introduced in the cost function. These coefficients may be definitively set
according to a priori information on the input database or adaptively estimated
according to the predefined strategy (for example a decreasing function of output
distance Y/j and/or mismatching (Yiij - Xij ) ).

(2)
= Zwi:(x -r s) 2 O _wo
O

The output representation is built from a partial knowledge of the input


relationship. By setting for some distances the weighting coefficients to zero, the
distances between the associated points are never considered in the mapping and
then, there are more degrees of freedom to find the output data structure. This
weighting process is very powerful and allows to take into account missing or
imprecise data. However, this weighting process must be carefully handled to avoid
bad mapping if the distance matrix is too sparse. Figure 1 illustrates this
phenomenon on a very simple example of a 3D input database (fig. la) lying on a
2D manifold on a folded "paper sheet" (800 samples). For this unfolding, we
suppose that only the short range distances are known. Here, for each sample, the 32
nearest distances are known. That represents only 8% of all the possible distances.
Figure lb shows the CCA representation on a plan : the global structure of the
638

variety is not revealed. If we add a more global information by the way of some
long range distances, this 3D structure can be unfolded (see figure lc-d-e-f). For
these experiments, the number of samples for which all the distances are known,
increases from 4 to 16. These samples fix the global structure and are called "anchor
points". In this example, a new "anchor point" (selected by vector quantization in
the input database) provides only 0.01% supplementary distances.

Z~;4",
" ~--o ~ s,;+?)-'.,

(a) (b) (c)

x:=

-).~: ~:: "~?-i,.~'§ ~,;"/i'~':'r.

y:-..= :~§

(d) (e) (0
Fig. 1. (a) Original 3D data set, (b) Unfolding from a sparse matrix, only with short range
distances, 8% of distances are known, (c) Unfolding with short range distance and 4 "anchor
points" (d) 8 "anchor points", (e) 12 "anchor points", (f) 16 "anchor" points.

In figure lc-d-e, twists appear in the output data representation : the proportion of
long range distances remains too small. In figure If, the structure is completely
unfolded considering only 16 "anchor points" uniformly distributed into the input
database. At sections 5 and 6, we present two applications using this specific
weighting of the distance matrix.

3.2.2 Supervised unfolding of complex data structure : a sequential approach


When the data structure combines different overlapped structures, they can be
separately enhanced if sufficient a priori information is known to handle an optimal
procedure. For example, let us consider the simple example of artificial data in
figure 2a where two structures coexist : a global one on a 1D parabolic shape, and a
local one for each cluster on a 3D spherical shape (fig. 2a). Firstly, the aim of the
non linear mapping is the capture of the cluster organization. Secondly, the non
linear mapping flattens the global organization. The justification of this procedure
is, for example, to optimally preprocess the data in order to use a simpler classifier
[Teissier et al. 1998] (see section 6). In order to realize this preprocessing by CCA,
a sequential approach is defined :
639

1. Flatten the global structure (fig. 2b) by choosing an output dimension as the
intrinsic dimension of this global manifold (here 1). The intrinsic structure of
each cluster is then lost.
2. From this organization, process a second CCA with a supplementary dimension
up to the intrinsic dimension of each cluster (here 2 and after 3). For each new
stage, the initialization step keeps the configuration of the final previous step for
the first dimensions and initializes at random the new dimension. Figure 2c
illustrates this process in two dimensions : the clusters are circular and the global
manifold is flattened. For the third CCA in three dimensions, the clusters are
spherical on the same flatten global manifold.

Through the joint distribution dx/dy after the three CCA, we globally observe the
same behavior : distribution (fig. 2d) on eight packets (first packet : within-class
distances, other seven packets : between-class distances) and unfolding process for
long range input distances. But the distribution of the within-class distances shows
differences. By CCA with 1 dimension, projection mainly occurs on the within-class
distances (fig. 2e). By CCA with 1 then 2 dimensions, both behaviors (projection
and unfolding) coexist (fig. 2f). Finally, by CCA with 1 then 2 then 3 dimensions,
the matching is almost done (fig. 2g). An illustration of this procedure is given at
section 5 on a real application in speech recognition.
x~
~• x• ~•

x5

(a) (b) (c)


25

...:::..i::i!/
o12
. ;
..... ... :../:...i-:? ...
~, ":. ,,ii?~:,,.'::. 9
0. o.
5

2o ~0 5 s o.~ i ~:s

(d) (e) (f) (gi


Fig. 2. (a) 2D view of the 3D original data, (b) Organization after CCA(1D), ( c )
Organization after CCA (1 then 2D), (d) dx/dy distribution after CCA (1 then 2D), (e) Zoom
on the intra-class dx/dy distribution after CCA(1D), (f) idem after CCA(1 then 2D), (g) idem
after CCA(1 the 2 then 3D).

4 Spherical data structure for unfolding of geographical data

In order to illustrate the spherical representation, let us consider the distances (as
the crow flies) between towns all around the world. A flat representation on a plan is
not convenient for this database. Figure 3a-b illustrates the result of the CCA
640

algorithm in two dimensions. Unfolding occurs for long range distances as it is


visible on the dx/dy graph (fig. 3a). In fact, in this database, the samples are not
uniformly distributed on the manifold. There are large regions without towns
(oceans, deserts .... ). So, there are many cuts where the matching constraint has been
relaxed, i.e when long range distances are superior to the threshold of the weighting
function F~(Yij ) (fig. 3b, cut inside the Pacific Ocean, the Indian Ocean). Towns
inside area which is more uniformly sampled, are globally well organized (for
example European countries). There are some accidental twists (for example for the
position of "Manille"). As we know that the optimal solution lies on a sphere, this
plane representation is obviously inappropriate.
80 , ,
I~Anchorage ~*Vladivostok
~eSeattle ~Pekin ~Ma
~d_osAngeles ~lrkoutsk
~<Oenver
)(~-Iouston I )~Yellowlolife ~Calcutta
20 ~Uex~co ~i~lg ~o~
1 .... = . . . . . . . . . . . . . :. m xLa Havane ~tHeA~/co u ~q3omb~

~ >(Managua

~QUi,o~C. . . .
~r~arsovie
~I~g~e rome~ffB.~ar~r~lh
-20 )XLima ~ a n a u s

,~g:~%.: .. -40
i: f
~Brasilia
9 SanUago . . ~Tombouctou
~Bangui
)~u endlff~s~e Janelro ~Kinshasa
-6O )KNairobi
i~ i......i i~ i i o ~.;~.~:..: .... lLusaka
~d.s Cap

_ r , i , i i i = i
-100 -80 -60 -40 -20 0 20 4O 6O 80 100

(a) (b)

Fig. 3. CCA in 2 dimensions from the distances between towns all around the world (a)
dx/dy distribution (b) Towns positions on the output plan.

Here, the true position of the towns is known, then the quality of the
representation is estimated by the residue of a Procruste rotation [in Borg &
Groenen 1997] in order to fit this representation with the true one. This mean
squared error is of 12 and falls down to 3.10 "4 with the constraint of a spherical
output space implemented with the strategy 3 (see section 3.1).

5 Sequential unfolding of a noisy data structure in speech


recognition

5.1 Data set description

The data set is composed of 100 repetitions of each of the 10 French oral vowels
[a, i, e, e, u, o, 0, y, ~e, o] pronounced in isolation by a single speaker. Noisy
acoustic signals were obtained by adding various amount of white Gaussian noise on
the temporal stimuli (24 dB, 12 dB, 6 dB, 0 dB, -6 dB, -12 dB and -24 dB). Acoustic
data are the normalized spectral components of the speech signal into 20 frequency
641

bands. The acoustic observations lie onto a 20-dimensional space. But, it is well-
known in phonetics, that vowels can be well represented in a 2D triangular shape
called "vowel triangle", organizing through the first two formants (fig. 4a).
Moreover with noise, the vocalic triangular shape is distorted according to a
progressive shrinking of the convex shape of the vowels clusters.
FI(Hz)
200

300

480 o
5
500

~
e
600
5
700

800
o 1o
3~0 -lO 5 o
F 2 (Hz) -L~5 -lo -s ~ Io

(a) (b) (c)


Fig. 4. (a) Location of the vowels in a 2D acoustic space , (b) Output organization after
CCA(2D), (c) Output organization after constrained CCA(3D) (see section 5.2) [Teissier et
al. 1998]

5.2 Sequential Unfolding of the Acoustic Data Set

In this application, the unfolding process concerns two data structures which are
linked together. The first one is the organization of the vowel structure at each noise
level, and the second one is the evolution of this organization through the noise.
This evolution is seen as a trajectory for each cluster which depends on the
interaction between each vowel and the noise. In [Teissier et al. 1998], we have
shown that CCA firstly reveals the intrinsic audio data structure (fig. 4b), and
secondly can be constrained to unfold the trajectory of each cluster disturbed by
noise (fig. 4c). This is done by combining four constraints :
1/ Supervised data representation : Data are sequentially presented to the network
from the level "without noise" to the most noisy level (24dB).
2/Output space configuration : Two dimensions are enough to unfold the level
"without noise" (fig. 4ab). For the following levels, a third dimension is added in
order to capture this new degree of freedom (see section 3.2.2).
3/Initialization : The initialization of the output samples for a given level is set
from the coordinates of the output samples of the immediately inferior level and by
adding a positive random offset on the third coordinate. Then, for a new level i, the
initialization state is the final state of the level i (see section 3.2.2).
4/Sparse distance matrix : With this sequential process (level 0 "without noise",
level 1.... level i, ...), the number of output samples increases for each level, and
consequently also the dimension of the input and the output distance matrix. These
two matrices Xij and Y/j are not full (see section 3.2.1) : only distances between
642

samples inside the same level of noise are known and also distances with samples
inside the immediately inferior level. These matrices are then structured by blocks.

5.3 Results Presentation

In this section, we illustrate this preprocessing for a recognition task of the ten
vowels with several level of noise. Two preprocessing stages and two classifiers are
tested (Principal Components Analysis in 3D -fig. 5a-, and constrained CCA in 3D -
fig. 4c-, simple Gaussian classifier -legend 'SG' fig. 5b-, and mixture of gaussian
classifier -legend 'MG' -fig. 5b-). The regularization of the clusters trajectories in the
noise allows to use a simpler classifier, as it is illustrated at figure 5b.
10(

96

/t . . . . ,. . . . . . . . .

i,,__ _,_ _,__ ,_. _,. _~


,- - -, - - , - - r - -, - - i

I . . . . ' _ _'_ _ '_ _ 2 . . !


4 5
2 0 . . . . . , - - , - - r - -, - - i

-24 -12 -6 0 6 12 24 wn
DIM 2 4 15 OIM 1
Noise R S B d b

(a) (b)
Fig. 5. (a) Cross Comparison between two classifiers ( Simple Gaussian Classifier -SG- and
mixture of Gaussian classifier -MG-) and two preprocessing (Constrained CCA 3D and PCA
3D), (b) Output representation after PCA (3D).

6 Unfolding or flattening ? : cortical flattening for I R M f

6.1 Database description

The ability to measure cortical activity in addition to anatomical structure, is an


important breakthrough to study the activity of human brains at relatively high
spatial resolution [Teo et al. 1997, Tootell et al., 1996]. For this purpose, Functional
Magnetic Resonance Images (fMRI) are a very interesting investigation tool. These
maps are computed from a 3D image of the cortex, but this tissue is highly
convoluted. So firstly, the visualization inside the various gyri and sulci is difficult,
and secondly a 3D representation by 3D reconstruction or by a stack of 2D slices are
not convenient tools for an easy investigation of the cortical structure. For this,
flatten and unfold the cortical structures are presented as suitable representation
means balancing in one hand, an easy 2D or 3D view and in other hand, the
distortions due to the non linear geometric space transformation [Drury et al. 1996].
643

The database consists of 3D samples at the cortical surface obtained by image


segmentation (fig. 6a-b). The selected samples are on the edges between the cerebral
spinal fluid and the gray matter. The number of samples depends on the size of the
cortical surface to be represented.

6.2 Cortical unfolding versus flattening

In this framework, the words "Unfolding" and "Flattening" have a precise


meaning [Drury et al. 1996] :
Unfolding : This process is similar to inflating a crumpled balloon except that the
surface has not been stretched. This process reduces the curvature by preserving
local area.
Flattening : To completely flatten the cortical tissue, unfolding is not sufficient
due to the intrinsic curvature of the cortical ribbon. It is necessary to introduce
artificial discontinuities by cutting off specific relationships inside identified sulci).

The link with multidimensional scaling is now evident : for flattening, we use a
"flat" Euclidean geometry and for unfolding, a "curved" geometry (see section 3.1).
In both cases, the distance matrices are sparse (local mapping). Furthermore, the
introduction of the necessary discontinuities for the flattening process is simply
realized by not considering the input and output associated distances.

6.3 Results presentation

In the CCA framework, we show that the cortical flattening representation can be
obtained by a very sparse input distance matrix :
1. Local distances are considered to preserve the local topology
2. Some long range distances evaluated at the cortical surface are
considered to take into account the global structure ("anchor points")
and to avoid the "twists' phenomenon as it is explained in section 3.2.1.

(a) (b) (c)


Fig. 6. (a) Original Image : 2D slice, (b) Edge detection between scalp, cerebral spinal fluid,
gray matter and white matter, (c) example of flattening on an area in the occipital part of the
cortex.

For this example, the number of nodes is 4203 (approximately 40 cm 2) . This flatten
representation is obtained with a sparse distance matrices (mean number of local
distances per sample : 32, number of "anchor" points : 16) in 200 iterations.
644

Conclusion

C C A is a self-Organizing neural network efficient to unfold c o m p l e x high


dimensional data structure toward the mean manifold. A companion paper [H6rault
et al. 1999] describes the theoretical principles and its extension in the case of noisy
data. Here, we have presented several strategies to take into account additional
m a p p i n g constraints in order to converge toward the best output representation
according to a priori knowledge on the physical application. These strategies have
been illustrated with three different applications (geography, speech recognition and
I R M f analysis).

References

d'Aubigny G., L'analyse Multidimensionnelle des Donntes de Dissimilaritts, Th~se d'ttat,


Universit6 Grenoble I, ! 989.
Borg I., Groenen P., Modern Multidimensional Scaling : Theory and Applications, Springer
Series in Statistics, 1997
Cox, T.F. Cox, M.A.A., Multidimensional Scaling on the Sphere, Comunications in
Statistics, 20, pp.2943-2953, 1991
Demartines P. and Htrault J., Curvilinear Component Analysis: a Self-Organising Neural
Network for Non-Linear Mapping of Data Sets, IEEE Trans. on Neural Networks, vol 8,
n~ pp.148-154, 1997
Dr6sler J., The empirical validity of multidimensional scaling, in Borg I. (Ed.),
Multidimensional data representation : when and why, pp. 627-651, Ann. Arbor,
Ml:Mathesis Press, 1981
Drury H.A., Van Essen D.C. et al., Computerized Mappings of the Cerebral Cortex: A
Multiresolution Flattening Method and a Surface-Based Coordinate System, Journal of
Neuroscience, vol 8, n~ pp. 1-28, 1996
Ekman G., Dimensions of color vision, journal of Psychology, vol 38, pp. 476-474, 1954
Htrault J., Jausions-Picaud C., Gutrin-Dugu6 A., Curvilinear Component Analysis for high
dimensional data representation : I. Theoretical aspects and practical use in the presence
of noise, submitted to IWANN'99, june 2-4, Alicante, Spain, 1999.
Sammon J.W., A nonlinear mapping algorithm for data structure analysis, IEEE Trans.
Computers, vol C-18, n~ pp.401--409, 1969
Rogowitz B.E., Frese T., Smith J.R., at al., Perceptual Image Similarity Experiments, in
Human Vision and Electronic imaging, B.E. Rogowitz and T.N. Pappas (Ed.),
Proccedings of the SPIE, n ~ 3299, San Jose, USA CA, january 26-29, 1998
Teissier P., Gutriu-Dugu6 A., Schwartz J.L., Models for Audiovisual Fusion in a Noisy-
Vowel Recognition Task, Journal of VLSI Signal Processing, vol 20, pp.25-44, 1998
Teo P.C., Sapiro G., Wandell B.A., Creating Connected Representations of Cortical Gray
Matter for Functional MRI Visualisation, IEEE Trans. on Med. Imaging, vol 16, n~ pp.
852-863, 1997
Tootel R., Dale A., Sereno M., Malach R., New images from human visual cortex, Trends in
Neurosciences, vol. 19, n ~ 11, pp. 481-489, 1996.
Image Motion Analysis Using Scale Space
Approximation and Simulated Annealing

Vicenq Parisi Baradad I , Hussein Yahia 2, Jordi Font a, Isabelle Herlin 2, and
Emili Garcia-Ladona 3

1 AHA, Dept. Enginyeria Electr6nica,


UPC, Colom no.l, 08222 Terrassa, Spain
2 INRIA Rocquencourt, BP 105,
78153, Le Chesnay Cedex, France
3 Dept. Geologia Marina i Oceanografia Fisica,
Institut de Ci~ncies del Mar, CSIC, Barcelona, Spain

A b s t r a c t . This paper addresses the problem of motion estimation in


sequences of remotely sensed images of the sea. When the temporal sam-
pling period is low the estimation of the velocity field can be done by
finding the correspondence between structures detected in the images.
The scale space aproximation of these structures using the wavelet mul-
tiressolution is presented. The correspondence is solved using a simulated
annealing technique which assures the convergence to high quality solu-
tions.

1 Introduction

T h e study of oceanographic phenomena like vortex, dipole rings and fronts in-
volves processing of sequences of remotely sensed images, like the Sea Surface
T e m p e r a t u r e images obtained with the A V H R R sensor [7], to analyze the dy-
namics using m a t h e m a t i c a l spatial modelling and estimation of the motion field
to model the t e m p o r a l evolution [5]
The first a t t e m p t s in computing image motion in oceanograghic images [4],
to get sea surface currents, consists in identifying the m a x i m u m cross correla-
tion (MCC) between extracted windows in consecutive images. As pointed out
in [3] it gives poor results in zones of high rotational. T h e y propose to solve this
insensitivenes to rotation by formulating the problem as a correspondence be-
tween selected image tokens located in consecutive images. This correspondence
is found using a Hopfield neural network t h a t minimizes a cost function which
quantifies the differences between the tokens.
Each one of the tokens in an image has to be described in a manner t h a t makes
able to compare it with all the tokens in successive images. These description
is made through features t h a t quantify its characteristics (area, excentricity,
curvature,...). T h e y characterize the token locally when they result from an
analysis of a small region around it or globally, as the region of analysis grows
up. In fact, token's selection is made after an analysis of the image looking for
those regions which present conservative features. Thus when object oclusion is
646

Fig. 1. Geometric construction used for the estimation of the features of a corner

probable local features are preferred as the number of tokens to track will be
higuer.
C6te [3] looks for tokens in contours of strong spatial gradients of temper-
ature. T h e y select as tokens those points of the gradient with highly curved
shapes. Selection of these tokens avoids the aperture problem, which relates the
ambiguity in interpreting the translation of a point without salient characteris-
tics, located in a moving edge seen in an aperture.
Though these tokens makes the method robust against the aperture problem
and it performs well when dealing with cloudy images t h a t can oclude parts of
the contours, it meets the problem cited above: the velocity field c o m p u t e d is
very sparse, so, in order to obtain a more dense field it's necessary to look for
global descriptors which include information to distinguish better those close
points t h a t are very similar locally.
It is proposed to characterize each point in the contours using the segments
at each side of the point, analyzing the geometric characteristics of this corner
when the contour is aproximated at different scales. T h e lower scales will take
into account local information and as the scale raises more global information is
used.
This representation obtains for each point of the contour the features of the
corner formed by the point and the two segments of the contour leaving it in
opposite directions when approximated at different scales.
At the lowest scale the closest points correspond to the neighbouring pixels
at each point and at higuer scales they correspond to an approximation of the
evolution of the contour.
T h e features used will be the local position (x, y)of each point in the contour,
the angle r between segments and the orientation 0 of B K , appearing in Fig.I,
at each scale of representation:
T h e positions of the most similar pairs of tokens in succesive images indicate
the apparent motion. Let pi be the i-th token in an image and pj the j - t h token
647

in the succesive image, their associated features are fi,k and fj,k respectively,
where the index k varies from 1 to N f ; the difference between tokens is,

Nf
di f f (pi,p.f) .= ~ (fi,k - fj,k) (1)
k=l
T h e Hopfield neural network is useful to solve combinatorial optimization
problems when the initial state of the network is adequately chosen in order to
arrive to a global minimum [6], but this initial settings are difficult to achieve
when the number of variables in the function to minimize grow.
To solve this drawback it is proposed to use the simulated annealing scheme,
which allows to converge to a global minimum, without regarding the initial
state of the neurons, due to the possibility to increase the energy of the network
ability when it arrives to a local minimum.
This p a p e r is organized as follows: section 2gives the basics of the scale space
approximation using the multiressolution analysis; in section 3 the principles of
multiressolution analysis are applied to find features of the points in discrete
curves. T h e cost function t h a t represents the correspondence problem and it's
minimization using simulated annealing appears in section 4. Section 5 shows an
exemplification of the proposed method using curves t h a t have been displaced,
rotated and deformed. Finally the conclusions and future work a p p e a r in section
6.

2 Multiressolution analysis

In this section we give the fundamentals of the multiresolution analysis in the


L 2 (R) space of continuous functions of one real variable and integrable square.
Detailed insight can be found in[8].
An approximation of a function f E L 2 (R) at the resolution j is obtained
by the action of a linear operator Aj such t h a t Ajf E Vj where Vj is a subspace
of L 2 ( R ) .
A multiresolution analysis consists in the approximation of a function in
different subspaces Vj,Vj+I, Vj+2, ...embedded one into each other, such t h a t
the change from one subspace to the other is the result of a scale change.
T h e r e exists a function ~ (x) C L 2 (R)called scaling function t h a t by dilata-
tions and translations generates an orthonormal base of Vj, being ~j,n (x) the
basis functions constructed with the relation:

_.t (a-ix), n e Z (2)

Different scaling functions W (x) can be used to construct by translation a


base of the subspace V0 if they acomplish the orthogonality between them.
T h e aproximation Ajf of the function f at the scale j is then given by

A j f = ~ <f,~j,~l~vj,~ = ~ 4~j,~ (3)


648

It can be defined for each Vj its orthogonalcomplement Wj such ~hat

Vj_ 1 = Vj (~ W j (4)

and there exists a function ~ (x) E L 2 (R) called wavelet that can generate
a base orthonormal of Wj. The basis functions are c J (x) and are constructed
using the relation
cj, (x) = a 2r ( a - i x ) , n e Z (5)
The pass from the aproximation o f function f at the scale j to the scale j - 1
can be done with the relation

A j - l f = d j f + E (f' Cj,n} ~)j,n (6)


n

The notation d j =(f,r and D j f = ~ n dnCj,n


J let to write the recon-
struction expression as
Aj-lf = Ajf + Djf (7)
One of the oldest wavelet is the Haar's wavelet. The approximation subspaces
Vj in this decomposition are the set of constant amplitude functions over intervals
of length 2j,

gj = { f 9 L 2 (R)which Vk 9 Z, f[2~k,2~(K+l)[---- ct.} (8)

3 Description of a discrete curve as constant slope


segments

Waku [10] performs multiressolution analysis of discrete curves using iteratively


the algorithms described above. Though similar, we aim to analyze each point
of the curve, and obtain at each scale the approximation of the segments em-
anating from the point. In the analysisis proposed in [10], when the resolution
decreases the curve has fewer points, due to the undersampling effects of the
iterative algorithms used, and the information of some points is embedded in
the approximated segments, so an alternative is proposed.
Given the i-th pixel ,p~, located at {xij ,Yij} ,in a discrete curve cj composed
of Nj pixels, the segments pc~ nc~ leaving the pixel at opposed sides are aprox-
imated at different scales. ( p means 'positive' and identifies the segments whose
pixels are obtained by incrementing the index i until Nj , and n means negative
and identifies the segments whose pixels are obtained by decrementing the index
i until 1)
pci~ = {xkj,ykj,i + 1 < k < Nj} (9)
nc~ = {xkj,ykj,1 < k < i-- 1} (10)
Superindex 0 symbolizes the higuest resolution, given by the spatial sampling
of the images. A continuous curve interpolates each pixel in its center for each
0 --vpo ~,~To
segment. These continuous interpolating curves are pcij = {xij (s) , yij (s) } ,and
649

, Yij (s) }. As the pixels are on a grid, xij (s), Yij (s) and =~Oxij(s) ,
~.o
3 (s) can be described using piecewise constant slope curves

r-l(.s-lApl2~
l&p ] ms with l ~ {1, x/'2} andm E { O,:1:--~-~, :t::1} (11)
lif-S/2<_s<S/2
beingFl(~)= Oifs<-S/2ors>S/2
and the evolution of x and y over all the pixels, from {xij,yij} to both
extremes of the contour is expressed as,
k

k=jq-1tlk~P--[-](~)
E [~
fl rn~s

for the segment between {xij, yij} and {XN~jj, YNjj } and,

E Yl /-~A~ inks (13)


k-.~j -- I

for the segment between {Xij,yij} and {XNkjj,YN~j}. With l~,ln E {1, v ~ }
and mk ~ ~0,~=-~-2,fl=l~.
t ~ J

Approximation a t e a c h i n t e r e s t i n g p o i n t The segments pc~ = {xkj,ykj}


, i + 1 < k < Nj and nc~ = {x~j, Ykj }, 1 < k < i - 1, leaving each interesting
point have to be approximated at different scales, and to do so it has to he
decided which are the basis functions of the approximation subspaces. As the
angle between these segments has to be calculated at all the scales, it seems
reasonable to look for basis which lead always to a geometric configuration as
that shown in Fig.l The approximation will be done using the continuous curves
~; (~),~o (~) and ~~ (~),~o (~):
A~; (~) = ~ (~ (~),~,~) ~,~ (14)
m

A ~~ (16)
m

~ y ~ o (~1 = ~ (~o (~1,~,~) ~,~ (17)


m
650

So the approximations are made up of the scaled basis functions, and we'll
get the same kind of geometric configuration if ~Om,n has the same form than
x~/(s) ,~ij~ (s) and E~j~ ( s ) , ~ j ~ (s). As these are constant slope piecewise func-
tions the Haar's scaling functions is chosen.

4 Solution of the correspondence problem using


simulated annealing

Given a sequence of images, there is a set of Npi+l interesting points in image


i + 1 t h a t has to be identified in a set of Npi interesting points in image i, where
i C {1..M - 1} being M the number of images in the sequence.
A matrix of Npi rows per Npi+l columns, represents the correspondence
problem. T h e elements of the matrix (k, l) set to Vkj=l indicate t h a t interesting
point Pi,k matches interesting point Pi+lj the rest of the elements in line k and
row I have to be set to 0 .
T h e cost function t h a t quantifies the differences between the pairs of pixels
in consecutive images is,

C = w~,~ vk,ldiff (Pi,k,Pi+l,1 + (18a)

+w~ .E Vk,l- 1 E Vk,z (18C)


\ k=l \k=l ~kk=l ]

T h e first t e r m in this expression quantifies the similarity between pixels, the


second and third terms represent the constriction t h a t a pixel in an image can
be just associated to one pixel in the next image, or not associated.
In [9] a Hopfield neural network is utilized to minimize such equation using
the trajectories of a set of drifting buoys as calibration data. T h e synaptic inter-
connections and external inputs of the neurones are calculated identifying the
above cost function with the energy function of the neural network,
Npi Npi+l Np~ Np~+l Npi Np~+I
E
k=l l=l m=l n=l k=l l=l

As the only accepted transition for the neurones are those which decrease the
energy the initial state of the network has to be set carefully, in order to arrive
to a good solution and not be t r a p p e d in a local minima.
T h e simulated annealing [1] permits to scape from these local minima and
arrive to a high quality solution without depending on the choice of the initial
state.
651

T h e neurons change their state according to a probability P t h a t favourizes


energy decreasing transitions and depend on a ' t e m p e r a t u r e ' p a r a m e t e r T as,

P (vk,z --+ v~,z) = { 1 if e---f-z~A~E< 0 (20)

At an initial state T is high allowing changes t h a t increment the energy, thus


scaping from local minima, and it is cooled until it reaches zero or some small
positive value, where only the energy decreasing transitions are accepted.
A polynomial t i m e cooling schedule [2] is used to decrease slowly the t e m -
perature, assuring the convergence to a global minimum.

5 Results

To exemplify these method this section shows the analysis of two images which
consist of three curves. T h e features at each pixel of the curve are c o m p u t e d us-
ing different scales. In Fig.2 and Fig.3 appear the features found for two adjacent
pixels; it can be appreciated the capability proportioned by these characteriza-
tion to distinguish neighbouring pixels in cases where the curves have no specially
salient local features.
T h e correspondence between the pixels, found by the simulated annealing is
shown in Fig.4. Note t h a t the method fails in the pixels located at the extremes
of the curves, this is due to the fact t h a t each pixel is analyzed at increasing
scales, until they involve the closest extreme of the curve to the pixel, so in the
case of pixels located in the extremes, the only information used is its position.

6 Conclusions

Estimation of motion in sequences of images solving the correspondence between


pixels needs of a good characterization of the pixels to be tracked and a method
to minimize the cost function t h a t quantifies the accuracy of the correspondences
found.
A method to obtain features t h a t characterize pixels in discrete curves is pre-
sented. This method has the ability to characterize b o t h locally and globally the
pixel, thus computing the apparent motion as a correspondence between these
pLxels increases its robustness against oclusions and the capability to distinguish
between adajacent pixels yields a more dense velocity vector field.
A high quality solution is obtained by using simulated annealing to minimize
the cost function; this overcomes the problems present in the Hopfield neural
network when it stacks at a local minimum.
Though this method solves the correspondence in a polynomial time it seems
necessary to speed up the minimization procedure, so future work calls for in-
corporating faster methods, like mean field annealing.
By now the scale space approximation is obtained at all the possible scales,
the largest one using all the pixels in the contour. An s t u d y of the conservation
of the features at different scales would help to describe better each pixel and
improve the estimation of the motion field.
652

F i g . 2. F e a t u r e s of a p o i n t in a c u r v e w h e n t h i s is a p p r o x i m a t e d a t different scales

F i g . 3. F e a t u r e s of a p o i n t in a c u r v e w h e n t h i s is a p p r o x i m a t e d a t different scales
653

Fig. 4. Correspondence between several curves in an image

7 Acknowledgments

This work has been undertaken in the framework of the Mediterranean Tar-
geted project ( M T P phase II-MATER). We acknowledge the support from the
Europeans Commissions Marine Science and Technology P r o g r a m m e (MAST
III) under contract MAS-CT96-0051

References

1. Aarts, E., Korst, J.: Simulated annealing and Boltzmann machines: a stochastic
approach to combinatorial optimization and neural computing. John Wiley ~: Sons
(1989)
2. Aarts, E., Van Laarhoven, P.: A new polynomial time cooling schedule. Proc. IEEE
Int. Conf. on Computer Aided Design. Santa Clara (1985) 206-208
3. C6te, S., Tatnall, A.R.L.: Estimation of ocean surface currents from satellite im-
agery using a Hopfield neural network. Third Thematic Conference on Remote
Sensing for Marine and Coastal Environments I Seattle (1995) 538-548
4. Emery, W.J.: An objective method for computing advective surface velocities from
sequential infrared satellite images. Journal of Geophysical Research, 91, (1986)
12865-12878
5. Herlin, I.L., Cohen,I., Bouzidi S.: Image processing for sequences of oceanographic
images. J. Visualization and Computer Animation 7 (1996) 169-176
654

6. Hopfield, J.J., Tank, D.W.:Neural computation of decisions in optimization prob-


lems. Biologycal Cybernetics 7 (1985) 141-152
7. Ikeda, M.: Mesoscale variability revealed with sea surface temperature imaged by
AVHRR on NOAA satellites. Oceanographic applications of remote sensing. CRC
Press. (1995) 3-14
8. Mallat, S.: A theory for multiresolution signal decomposition: the wavelet repre-
sentation. IEEE Trans. on PAMI 11 (1989) 674-693
9. Parisi V., et al.: A Hopfield neural network to track drifting buoys in the ocean.
Proc. of the Ocean's 98 Conference I I (1998) 1010-1018
10. Waku, J., and Chassery, J.M.: Wavelets and multi-Scale representation of discrete
boundary. Proceedings of l l t h . ICPR. The Hague (1992) 680-683.
Blind Inversion of Wiener Systems

Anisse Taleb 1, Jordi Sol~ 2, and Christian J u t t e n 1'3

1 LIS-INPG, 46 Avenue F61ix Viallet - 38031 Grenoble Cedex, France


2 Universitat de Vic, c/ Sagrada famflia, 7 - 08500 Vic, Catalunya, Spain.
3 ISTG-UJF, B.P 53 - 38041 Grenoble Cedex 9, France

A b s t r a c t . A system in which a linear dynamic part is followed by a non-


linear memoryless distortion, a Wiener system, is blindly inverted. This
kind of systems can be modelised as a postnonlinear mixture, and using
some results about these mixtures, an efficient algorithm is proposed.
Results in a hard situation are presented, and illustrate the efficiency of
this algorithm.

1 Introduction and assumptions

In m a n y areas of signal processing, nonlinear systems are present. Many research


has been done in the identification a n d / o r the inversion of such systems. These
assume t h a t the input of the distortion is available. One can get an estimate of
the nonlinearity, or its inverse, and then the compensation of the distortion is
straightforward.
However, in a real world situation, one often does not have access to the
input. In this case, blind identification of the nonlinearity becomes the only way
to solve such a problem. It is well known that, unlike the case of linear systems,
prior knowledge of the model is necessary for nonlinear system identification
[11].
Traditional nonlinear system identification methods have relied on higher-
order cross-correlations of the input and the output [1]. The Bussgang and Prices
theorems [2], [8] have been applied to identification of nonlinear models with real
and complex Gaussian inputs. Though higher order statistics of the output signal
have been used in the detection of nonlinearities [9], [10], blind identification
of nonlinear systems has remained an intractable problem, except for the very
restricted class of Gaussian inputs.

f g
s(t) ~ _ _ ~ e(t) e(t) ~ y(t)

Fig. 1. Nonlinear Wiener system and its inversion structure

This paper is concerned by a particular class of nonlinear systems. These are


composed by a linear subsystem (filter h) and a memoryless nonlinear function f
656

(Figure 1 Left). This class of nonlinear systems, also known as Wiener systems,
is not only another nice and mathematically attracting model, but also a model
found in various areas, such as biology: study of the visual system [4], relation
between the muscle length and tension [6], industry: description of a distillation
plant, sociology and psychology, see also [7] and the references therein. Despite
its interest, at our knowledge, no blind procedure exists for the identification of
such systems.
We suppose that the input of the system $ = {s(t)} is an unknown non-
Gaussian independent and identically distributed (iid) process, and that both
subsystems h, f are unknown and invertible. We are concerned by the restitution
of s(t) by only observing the output of the system. This implies that we will
blindly design an inverse structure g,w (Figure 1 Right). The nonlinear part g
is concerned by the compensation of the distortion f without access to its input,
while the linear part w is a linear deconvolution filter.

2 Design of the cost function

The following notation will be adopted through the paper. For each process
Z = {z(t)}, z denotes a vector of infinite dimension, whose t-th entry is z(t).
Following this notation, the input-output transfert can be written as:

e = f(Hs) (1)
where:

h(t- 1) h(t) h(t 1)


H= h ( t - 2) h ( t - 1 ) h(t) (2)

denotes a square Toeplitz matrix of infinite dimension and represents the action
of the filter h on s(t). This matrix is nonsingular provided that the filter h is
invertible.
One can recognise in equation (1) the postnonlineax (pnl) model [12]. How-
ever, this model has been studied only in the finite dimentional case, in which it
has been shown that, under mild conditions, the system was separable provided
that the input s has independent components, and that matrix H has at least
two nonzero entries per row or per column.
We conjecture that this will remain true in the infinite dimensional case. Here
the first separability condition is fullfiled since s has independent components
due to the iid assumption. Moreover, due to the particular structure of matrix H ,
the second condition of separability will always hold except if h is proportional
to a pure delay.
The output of the inversion structure can be written in the same way than
(1):
y = w= (3)
657

with x(t) = g(e(t)). Following [12], to invert such a system, the inverse system
g, w is estimated by minimizing the output mutual information 9
Mutual information of a r a n d o m vector of dimension n is defined by:

I(z) = ~ H(z D - g(zl, z2,. 9 , Zn) (4)


i----1

Since we are interested in the mutual information of infinite dimension r a n d o m


vectors, a natural question to ask is "how does this quantity grow with n?".
This comes by using the notion of entropy rates of stochastic processes [3]. The
entropy rate stochastic process Z = {z(t)} is defined as:

1
H(Z) = lim g(z(-T) .. z(T)) (5)
T - ~ oo ~ ~" '

when the limit exists. Theorem 4.2.1 of [3] states t h a t this limit exists for a
stationary stochastic process. We shall then define mutual information rate of a
stationary stochastic process by:

I(Z)= lim - - 1 H(z(t)) - g(z(-r),... ,z(T)) = H(z(~-)) - H ( Z )


r--*oo 2T + 1
t
(6)
Here ~- is arbitrary due to the stationarity assumption. We shall notice t h a t I ( Z )
is always positive and vanishes when Z is iid.
Now, since S is stationary, and the filters h, w are time-invariant filters, then
y is also stationary, and I ( y ) is well defined by:

I(y) = H(y(9-)) - H(y) (7)


From (3), one can write:

9 = 9 9 9 (8)
\y(-T)] \w('T) wiO) ] \ x ( - T ) + v ( - T ) ]
wT(/~+.~)
where VT is a r a n d o m vector which contains the remaining terms corresponding
to the convolution truncation 9

H ( y ( - T ) , . . . , y ( T ) ) = H(WT(XT + VT)) = logldet WTI + H(XT + VT) (9)


The entropy rate of y can then be expressed as:
1 1
H(y) = lim H(y(-T), . ,y(T)) = lim logldet WTI + H(X)
(10)
658

since as T -+ 0% x ( t ) + v(t) m.s.) x(t). The first term of this last equation is:

lim 1 logldet W(t)[ = 1 f2. log[ E w(t)e-Jt~ (11)


r-,oo 2T + 1 ~ J0 t~-~-- O0

In fact, as T --+ 0% the eigenvalues of W tend to the Fourier coefficents of w.


Finally, by the stationarity of E, one can write:

1
H ( X ) = lim - - H(e(-T),... ,e(T)) + E E[logg'(e(t))]
T--~oo 2T + 1
t=--T

= H ( E ) + E [logg'(e(T))] (12)

Combining (11) and (12) in (7) leads to:

1 ~027r +oo
I ( y ) = H(y(~-)) - ~ log[ E w(t)e -jr~ IdO - E [log g'(e(~-))] - H(E)
~z--O0
(13)

3 Theoritical derivation of the inversion a l g o r i t h m


To derive the optimization algorithm we need the derivatives of I ( y ) (13) with
respect to the linear part w and with repect to the nonlinear function g.

3.1 Linear subsystem


For the linear subsystem w, this is quite easy since the filter w is well parame-
terized by its coefficients. For the coefficient w(t) corresponding to the t-th lag
we have:

- 0v( ) = - (14)
ow(t) u'wk~)

where Cy(r)(u) = (logpy(r))'(u). Since, by stationarity, it is independent from ~-


it will be denoted simply r The second term of interest is:

0 1 fo2~ +~ 1 f02~ e -jr~ (15)


Ow(t) 2zr log[ E w(t)e-Jt~ = ~ ~+=~_~ w ( t ) e - j t o
t ~ -- (x)

One recognises the { - t } - t h coefficient of the inverse of the filter w, which we


denote ~ ( - t ) . The derivative of other terms with respect to w coefficients are
null, which leads by combining (14) and (15):

o (y)
- E[x(~- - t)r - ~(-t) (16)
Ow(t)
659

Equation (16) is the gradient of I ( y ) with respect to w(t). Consider a small


relative variation of w, expressed in terms of a convolution by a small filter e:

w--+w+e.w (17)

The first order variation of I(y) writes as :

I(y) = - + 5) 9 (0) (18)

where 7v,r = E [ y ( T - t)r and 5 is the identity filter. One immedi-


ately notices that taking:

(19)

w h e r e / t is a small positive 1 real constant, insures a continuous decrease of I ( y ) .


It then provides the following gradient descent algorithm:

(20)

3.2 Nonlinear subsystem

For this subsystem, we use a nonparametric approach. We make no parametric-


type restriction concerning its functional form. In consequence, and since the
fanfily of all possible characteristics is so wide, the only possible parametrisation
of g is itself. This m a y seem to be confusing, but the consequences are simple.
The same technique as the linear subsystem is used here. In fact, consider a small
relative deviation of g, expressed in terms of composition by a "small" function:

g --+ g + e o g (21)

In this case, we have:

AE[logg'(e(7))] = E[log(1 + e ' o g(e(T)))] = E[d(x(T))] (22)


and,

A H (y(7) ) = --E[r ){w * e(X) }(T)] (23)

which gives the variation of I ( y ) :

AI(y) = -E[r * ~(X)}(T)] - E[e'(x(~))] (24)

Now let us write:

e'(x) = L e ( v ) 5 ' ( x - v)dv (25)

e(x) = ~ e ( v ) 5 ( x - v)dv (26)

i Small enough to insure the validity of the first order variation approximation.
660

then:

A I ( y ) = - / _ E[r * 5(x - v)}(r) + 5'(x(r) - v)]~(v)dv (27)


J~v)
To make a gradient descent, we may take:

e(v) = #Q * J(v) (28)

where Q is any function such that:

fRJ(v)Q * J(v)dv > 0 (29)

Using Parseval equality, this condition becomes

~ lJ(u)12~{Q(u)}du > 0 (30)

It suffice to take R{Q(u)} > 0 to insure this condition. Based on the gradient
descent, the algorithm writes then as:

g +-- g + P { O * J} o g (31)

4 P r a c t i c a l issues

It is clear that (20) and (31) are unusable in practice. This section is concerned
by adapting these algorithms to an actual situation. We consider then a finite
discrete sample g = {e(1), e ( 2 ) , . . . , e(T)}. The first question of interest is the
estimation of the quantities involved in equations (20) and (31). We assume
that we already have computed the output of the inversion system, i.e. 2( =
{ x ( 1 ) , x ( 2 ) , . . . , x ( T ) } and y = { y ( 1 ) , y ( 2 ) , . . . ,y(T)}.

Estimation of r Since we are concerned by nonparametric estimation, we


will use a kernel density estimator [5]. This estimator is easy to implement and
has a very flexible form, but suffers from the difficulty of the choice of the kernel
bandwidths. Formally, we estimate py by:

= - s1 ~K(U-h(t) ) (32)

from which we get an estimate of Cv by Cy(u) - /3~(u) Many kernel shapes


can be good candidates, for our experiments we used the Gaussian kernel. A
"quick and dirty" method for the choice of the bandwith consists in using the
rule of thumb h = 1.06&T-1/% Better estimators may be found, and used, but
experimentally we noticed that the proposed estimator works fine.
661

E s t i m a t i o n o f "/y,r : Provided we have an estimator of Cy, we can compute


Cy(y(t)),t = 1 , . . . , T . Then:

(t) : - (33)

is estimated by:

1 T
~y,r (t) = ~ ~ y(T - t)r (y(T)) (34)
7-=1

assuming ergodicity. Since 7u,r (u) (0) = - 1, ~u,r (y) (0) may be set to - 1 without
computing it.

E s t i m a t i o n o f Q * J : This function is necessary to adapt the output of the


nonlinear subsystem, and can be estimated by:
T
Q 9 J(v) = 1 ~ - Q ' ( v - x(t)) + Cy(y(t)){w * Q(v - x)}(t) (35)
t=l

N o n l i n e a r s u b s y s t e m p a r a m e t r i s a t i o n a n d e s t i m a t i o n : No parametrisa-
tion of g is used. One would ask the intriguing question "How would I compute
the output o] the nonlinear subsystem without g ?". In fact applying the equation
(31) to the t-th element of the sample E, and using x(t) = g(e(t)), one gets:

x(t) +-- x(t) + # {Q * J} (x(t)) (36)

This equation will then compute the output of g without having a particular
form of this function. A possible choice of Q is:

-u if u_> 0 (37)
Q(u)= 0 otherwise

which is very simple from a computational point of view.

F i l t e r p a r a m e t r i s a t i o n a n d e s t i m a t i o n : In pratical situations, the filter w


is of finite length (FIR). We also suppose that w has equal length in its causal
and anti-causal parts. Result of the convolution of w with ~y,r should be
truncated to fit the size of w. A smooth truncation, e.g. use of a Hamming
window, is preferable to avoid overshooting.

I n d e t e r m i n a c i e s : The output of the nonlinear subsystem x ( t ) , t = 1,... , T


should be centered and normalized. In fact, the inverse of the nonlinear distortion
can be restored only up to a linear function. For the linear subsystem, the output
y(t), t = 1 , . . . , T should also be normalized.
662

5 Experimental results

To test the previous algorithm, we simulate a hard situation. The iid input
sequence s(t), shown in figure 3, is generated by applying a cubic distortion to
an iid Gaussian sequence. The filter h is FIR, with the coefficients:
h = [0.826, - 0 . 1 6 5 , 0.851, 0.163, 0.810]
Its frequency response is shown in figure 2. The nonlinear distortion is a hard

10

i~ -5

~-10

-150 01 9.2 0.3 9.4 6.5 9.6 9.7 0.8 9.9


Normalized frequency (Nycluist == 1)

-100

~.~ - 2 0 0

o- - 3 0 0

-400
0, 02 o13 04 & o16 o17 o18 o19
Normalized frequency (Nyquist == 1)

Fig. 2. h frequency domain response,

saturation f ( u ) = t a n h ( 1 0 u ) . The observed sequence is shown in figure 3. The

. . . . . . . . . . . . . . . . . . . . . i

Fig. 3. From left to right: Original input sequence s(t), Observed sequence e(t), Re-
stored sequence y(t).

algorithm was provided with a sample of size T = 1000. The size of the impulse
663

response of w was set to 51. Estimation results, shown in figures 3,4,5, prove
the good behavior of the proposed algorithm. The phase of filter w, Figure 4, is
composed of a linear part which corresponds to an arbitrary uncontrolled but
constant delay, and of a nonlinear part which compensates the h phase.

15

40

0
"~ -5-
-1o i i i i , t , i ,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Normalized frequency(Nyquist == 1)

-100C

~ -2000
;
~ oo0

4000

-500G
01 0.2 03 0.4 05 0.6 0.7 0.8 09
Normalized frequency(Nyquist == 1)

Fig. 4. Estimated inverse of h: w frequency domain response.

f
Y
i

Fig. 5. Estimated inverse of the nonlinear characteristic f: x(t) vs. e(t)

6 Final remarks and conclusion

In this paper a blind procedure for the inversion of a nonlinear Wiener system
was proposed. This procedure is based on a relative gradient descent of the
mutual information rate of the inversion system output.
664

One may notice that some quantities involved in the algorithm can be effi-
ciently estimated by resort to the F F T which reduces dramatically the com-
putational cost. The estimation of g is done implicitely, only the values of
x(t) = g(e(t)), t = 1 , . . . , T are estimated. One can further use any regression
algorithm based on this data to estimate g, e.g. neural networks, splines, ect.
The relation between the choice of Q and the performances of the algorithm are
not well understood and is currently under investigation.
The proposed procedure shows good performance on simulated data, and is
now applied to real data. Extension to multichannel Wiener systems is currently
under investigation.

A k n o w l e d g e m e n t : This work has been in part supported by the Direcci5


General de Recerca de la Generalitat de Catalunya.

References

1. S. A. Billings and S. Y. Fakhouri. Identification of a class of nonlinear systems


using correlation analysis. Proc. IEEE, 66:691-697, July 1978.
2. E. D. Boer. Cross-correlation function of a bandpass nonlinear network. Proc.
IEEE, 64:1443-1444, September 1976.
3. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in
Telecommunications, 1991.
4. A. C. den Brinker. A comparison of results from parameter estimations of impulse
responses of the transient visual system. Biol. Cybern., 61:139-151, 1989.
5. W. H~rdle. Smoothing Techniques, with implementation in S. Springer-Verlag,
1990.
6. I. W. Hunter. Frog muscle fiber dynamic stiffness determined using nonlinear
system identification techniques. Biophys. J., 49:81a, 1985.
7. I. W. Hunter and M. J. Korenberg. The identification of nonlinear biological
systems: Wiener and Hamerstein cascade models. Biol. Cybern., 55:135-144, 1985.
8. G. Jacovitti, A. Neri, and R. Cusani. Methods for estimating the autocorrela-
tion function of complex stationary processes. IEEE Trans. ASSP., 35:1126-1138,
August 1987.
9. C. L. Nikias and Petropulu A. P. Higher-Order Spectra Analysis - A Nonlinear
Signal Processing Framework. Englewood Cliffs, N J: Prentice-Hall, 1993.
10. C. L. Nikias and M. R. Raghuveer. Bispectrum estimation: A digital signal pro-
cessing framework. Proc. IEEE, 75:869-890, July 1987.
11. M. Schetzen. Nonlinear system modeling based on the Wiener theory. Proc. IEEE,
69:1557-1573, December 1981.
12. A. Taleb and C. Jutten. Source separation in postnonlinear mixtures. Jan 1998.
Submitted to IEEE trans. S.P.- under revision.
Separation of Speech Signals for Nonlinear Mixtures

C.G.Puntonet, M.R.Alvarez, A.Prieto, B.Prieto

Departamento de Arquitectura y Tecnologfa de Computadores


Universidad de Granada. 18071-Granada. Spain
E-mail: carios@atc.ugr.es

Abstract. This paper shows an approach to recover original speech signals from
their nonlinear mixtures. Using a geometric method that makes a piecewise linear
approximation of the nonlinear mixing space, and the fact that the speech
distributions are Laplacian or Gamma type, a set of slopes is obtained as a set of
linear mixtures.

1 Introduction

The problem of blind separation of sources [1] involves obtaining the signals generated
by p sources, sj, j=l ..... p, from the mixtures detected by p sensors, ej, i=l ..... p. The mixture
of the signals takes place in the medium in which they are propagated, and:

ei(t ) = F i ( s l ( t ) ..... sj(t) ..... s p ( t ) ) , i = 1..... p (1)

where Fi: flt~ is a function of p variables from the s-space to the e-space, represented
by one matrix, Apxp. The goal of source separation is to obtain p functions, Lj, such that:

s j ( t ) = Lj~.e.l~t~ ) ..... ei(t ) ..... e p ( t ) ) , j = 1 ..... p (2)

where Lj: ~P-3t is a function from the e-space to the s-space. The source separation is
considered solved when signals yj(t) are obtained from matrix Wpxp(similar to A) [2], and:

W -l . A -- D . P ; D c {diagonal mat.} , P c {permutation mat.} (3)


We have proposed various procedures that are based on geometrical properties of source
vectors, S (t), and of mixtures, E(t), from the hypothesis that the sources are bounded [3,4],
since the real signals (speech, biomedical) are limited in amplitude. The present paper
aims to extend this method to a type of nonlinear mixture that approximately models the
non linearities introduced in sensors. We believe, in agreement with other authors [5], that
an adequate mixture model is the post-nonlinear (PNL) model. Thus, (1) may be expressed
as"

ei(t ) = Fi( f~ aij.sj(t )) ; i = l ..... p (4)


j=l

Nevertheless, there exists a great variety of sensors [6] whose transfer characteristics are
modelled by diverse functions. Thus, we can also consider a more general nonlinear model
whenever the F~ transformation is a continuous nonlinear function, since in this way, it is
666

possible to achieve a linear piecewise approximation of Fi 9 Since each sensor (i) is


sensitive at least to its associated source (i), the following hypothesis is verified:

a,*O , V i e { 1 ..... p} (5)

2 Basis of procedure

In previous papers [3,4] we have shown that, for linear mixtures, the set of all the images,
E(t), forms a hyperparallelepiped in the E-space; by taking p vectors, (w~,...w~) each one
located at one of the edges of the cone that contains the mixing space, as column vectors
of a matrix Wp,p, this matrix is similar to Ap,p. This can be performed as follows:

wij = min ej(t)J ; e;(t)>O - wi;; ~a~j ,9 i , j e { 1 .... pl (6)


L aJd

Recently, for linear mixtures of two speech signals [1], we used the property that speech
signal distributions are Laplacian or Gamma type and symmetrical; then, normalizing the
mixing space it is possible to determine the distribution of the points in the unit circle,
obtaining two maxima that correspond to the slopes wle and w21 or, in the same way, the
independent components because, due to the linearity of the F~ transformations, the
mixtures are distributed with maxima of probability in directions parallel to the edges of
the parallelepiped (distribution axes). Given the values of wlj and ci=det (W), the sources,
X, may be obtained. Thus, for p=2 we have:

xi(t+l) : si(t+l) = ci-t.(ei(t)-wlj(t).ej(t)); i, j~{1,2} , i*j (7)

3 Piecewise Linearizatlon

When a normalization of the mixing space is performed, as in the case of the previous
method, a loss of information for the case of nonlinear mixtures of two speech signals
occurs, since the irregularities in the point density of the two-dimensional mixing space
are projected onto the unit circle. The method proposed in this paper considers the
distribution of the points observed in the E(t) space by sectorizing the latter by means of
radial and angular parameters. In this way, each sector is addressed by two numbers: the
radius and the angle, as shown in Figure 1. Then, for each circle (or radius) there are two
sectors (or angles) with the maximum distribution of points corresponding to tile
independent axes (or slopes) as if a linear mixture of signals were made in each span
between two circles. Thus, for each circle we obtain a Wp matrix as in the linear case. If
the F~ non-linear function is continuous, a piecewise linearization can be done in order to
approximate F~. In some cases, when the non-linear mixing function is not continuous,
good approximations can be obtained if the gap between two successive slopes is low, i.e.,
if the distance between, two sectors is not excessive. Clearly, a high number of sectors
provides greater accuracy in the piecewise linearization. This procedure can be applied,
not only to Gamma distribution type signals, but also to all kinds of sources presenting a
probability distribution with a maximum at the centre and that are symmetrical around this
667

centre, such as Gaussian, Laplacian and Poisson functions. Furthermore, the method is
valid, in general, even in the presence of additive noise produced in the medium itself (or
in the mixing sensors), as the usual noise models do not alter the relative centres or the
distribution symmetries.

~ector ( p , 8 )

E2

Fig. 1. Sectorization of mixture space(El,E2).

4 Adaptive processing

The piecewise linearization procedure for the separation of two signals can be
implemented in a recursive artificial neural network. The number of processing elements
is proportional to the number of radii (po,,,) used in the observed signals map and will
depend on the number of sources to be separated. For the case of two signals, this number
is 2p,~, irrespective of the number of angular sectors used. The structure of the recursive
network (Hopfield or Herault-Jutten) allows us to separate the sources, s (t), as follows:

s(t+l) = e ( t , p ) - Wp . s(t) V p ~ {1 ..... Pmax} (8)

where e(t,p) represents the value of an observation vector belonging to the sector
identified by the p radius, and W 0 =(w ~j) is the weight matrix associated with this radius.
Note that, without loss of generality and for two signals, the elements (w,,w22) of Wp are
equal to 1, and that the two slopes (w~2,w2j) have the value wij=tan(0k), with Ok
representing the angles of the two winning sectors in each circle of p radius; in other
words, 01 and 0 2 are the angles formed by wl and w2 weight vectors with the (e~,e2) axes
respectively. The adaptive rule for the weights is the recursive expression used in the
668

context of competitive learning [8] since, geometrically, the two weight vectors (wt,wz)
that are representative of each circle of p radius are shifted towards a new vector e(t), i.e.:

wl(t+l,p) : wl(t,p) + 0~(e(t,p)- wl(t,p) )


wz(t+l,p) = wz(t,p) + 0t(e(t,p)- w2(t,p) )
(9)
Vp~{1 ..... p,..,,}

where c~ is the classical learning rate, which must be a suitable monotonic decreasing of
the scalar- valued coefficient, 0<a<l. Initially, the weights are located on the (eL, e2) axes
with zero value, i.e., w(t=0,p)=0. After convergence, the two weight vectors of equation
(10) will be located on the two maximum distributions of points for each circle,
respectively.

5 S i m u l a t i o n results

We simulated this adaptive procedure for linear and nonlinear mixtures. Four simulations
were made with synthetic and speech signals. In the Figures, we show the input space, the
observed space, the sectorization of the latter, with the lines corresponding to the (w~,wz)
vectors, and the separated signals space. For the sake of clarity, the radial sectors have not
been plotted. In simulations 3 and 4, we show the window of the separated speech signals
in the convergence.

Simulation 1. In order to validate the proposed method we initially simulated a linear


mixture from an orthogonal input space. The mixing matrix used was the following:

a/01'0:/
There were 4 circles (radii), and 16 sectors (angles). The crosstalk values for each
separated signal, with 2000 samples, were c(1)=-35 dB and c(2)=-34 dB.

Simulation 2. In this simulation, a nonlinear mixture was generated from linear mixtures
in each circular sector, i.e., four matrices depending on the radius were used, as follows:

:/ /0120:/ A3,/0404) 4 /06 ~ A 1

The crosstalk values for each separated signal, with 2000 samples, were c(1)=-26 dB and
c(2)=-25 dB.

Simulation 3. The third simulation used speech signals as source signals, namely the
Spanish words "mano (hand)" and "mufieca (doll)". The crosstalk values for each
669

separated signal, with 5000 samples, were c(1)=-22 dB and c(2)=-27 dB. There were 4
circles, and the nonlinear mixing applied was as follows:

A(o.,)=(10 ~) A(p=2)=(10~) A(p,3)=(01.5 015) A(p~4)=(01.5 0~5)

Simulation 4. Tile fourth simulation also used speech signals as source signals, namely the
Spanish words "mano (hand)" and "mufieca (doll)". The crosstalk values for each
separated signal, with 5000 samples, were c(1)=-25 dB and c(2)=-28 dB. There were 5
circles and the nonlinear mixing applied was as follows:

A(0=l)=(~ ~) A(p:2)=(0.;5 0.~5) A(p.3)=(0.;5 0.~5)

a,p
5,=toi5o5)
6 Conclusions

This paper presents an adaptive procedure for the demixing of linear and nonlinear
mixtures of signals with probability distributions that are symmetrical with respect to their
centres and non uniform. The main idea is that it is possible to perform a piecewise
linearization in the case of nonlinear mixtures (and in the linear case) in order to obtain
the distribution axes of probability that are parallel to the slopes of the hyperparallelepiped
(parallelepiped for two sources), or independent components, for each circle of radius p.
Although this paper describes the application of the algorithm to two signals, the
operations are performed in vectorial form and future work will concern the application
of this method to the separation of more than two sources and the study of the influence
of noise.

Acknowledgments

This work has been supported in part by the Spanish CICYT proyect TIC98-0982.

References

. A.Prieto, B.Prieto, C.G.Puntonet, P.M.Smith, A.Cafias, "Geometric separation


of linear mixtures of sources: application to speech signals". First International
Congress on Independent Component Analisys and Signal Separation (ICA-99)",
1999, 11-15 January, Aussois, France.
. C.G. Puntonet, A. Prieto, C. Jutten, M.R. Alvarez and J. Ortega, "Separation of
sources: a geometry-based procedure for reconstruction of n-valued signals".
Signal Processing, 1995, 46, No 3, pp. 267-284.
670

3. C.G. Puntonet and A. Prieto, "Geometric approach for blind separation of


signals", Flectronics Letters, 1997, 33, N ~ I0, pp. 835-836.
4. C.G. Puntonet and A. Prieto, "Neural net approach for blind separation of
sources based on geometric properties". Neurocomputing, 1998,18, pp. 141-164.
5. A. Taleb and C. Jutten. Nonlinear source separation: The Post-Nonlinear
Mixtures. Proceedings of ESANN'97, 1997, pp. 279-284. D Facto Brussels.
6. P.A. Paratte and P. Robert, "Syst~mes de mesure", Trait~ d'ElectriciN, 1996, 17,
Presses Polytechniques et Universitaires Romandes, Lausanne.
7. C. Jutten, A. Guerin and N.L. Nguyen Thi, "Adaptive optimization of neural
algorithms", Lecture Notes in Computer Science, 1991, 540, pp.54-61, Springer-
Verlag.
.
T. Kohonen, "The self organizing map", Proceedings of the IEEE, vol. 7, n.9, pp.
1464-1480, Sept. 1990.

//

Fig. 2. Simulation I, Orthogonal input space, Linear mixing, Sectorized mixing and Output
space
671

rlI .,,

/i

ii I .... I

Fig. 3. Simulation 2, Orthogonai input space, Non-linear mixing, Sectorized mixing and
Output space

SI

$2 ~ _ .~.... __ . L I L L L L 9 L. . . . _ : _

el

C2

XI

X2

FiQ. 4. Simulation 3. Original, mixed and separated real speech signals.


672

;:.:r
~:~:...: ! : ;

.'?.,.

,.. '~

Fig. 5. Simulation 3. Input space, Nonlinear mixing, Sectorized mixing and Output space.

SI

$2 :~ t_~.tt L t ~ : .... _ .... ,, L, L,,,

el

e2

X|

X2

Fig. 6. Simulation 4. Original, mixed and separated real speech signals


673

r
1.2.',! , ;
.~~,,.~

Fig. 7. Simulation 4. Input space, Nonlinear mixing, Sectorized mixing and Output space
Nonlinear Blind Source Separation by Pattern
Repulsion*

Lufs B. Almeida** and Gon~alo C. Marques***

INESC, R. Alves Redol,9


1000 Lisboa, Portugal
luis. almeida@inesc.pt, goncalo.marques@inesc, pt

A b s t r a c t . Blind source separation has been a topic of great interest


for researchers in the last few years, and applications are starting to
appear. Until now, most of the research and applications have focused
on the separation of linear mixtures. In this paper we briefly discuss the
problem of separation of nonlinear mixtures, and present a method for
performing this kind of separation. We also present some experimental
results.

1 Introduction

Blind source separation has been a topic of growing interest for researchers in the
last few years [6, 3, 2]. Most of the work t h a t has been done until now has focused
on the separation of linear mixtures. A few authors have started to address the
more general problem of separation of nonlinear mixtures [4, 7, 5, 9, 8, 10] (see
further references in [6]). In this paper we address that problem. We present a
method for performing nonlinear separation, together with some examples.
We shall denote vectors by bold lowercase letters and vector functions by
bold uppercase letters. Superscripts shall denote vector components, as in s i. To
denote an exponent, we shall enclose the base in parentheses, as in (s ~) 2 unless
the meaning becomes clear from the context. Subscripts shall be used to index
vectors (patterns) within a set.
We shall consider the following setting. A set of sources s i, statistically in-
dependent from one another, forming the source vector s, are passed through
a nonlinear m i x i n g s y s t e m M , whose output forms the v e c t o r of observations
o = M (s). Our aim is to recover (i.e. separate) the original sources from the
nonlinear mixtures o i. The separation is blind, meaning t h a t little is known
about the sources s i or the mixing system M. Regarding the sources, we only
assume t h a t they are independent from one another. Regarding the mixing sys-
tem, we have to assume, first of all, t h a t it is invertible. If nothing else is assumed,

* This work was partially supported by PRAXIS project TIT/1585/95.


** Also with ISEL.
*** Also with IST.
675

the separation problem will be ill-posed, having an infinite number of solutions


[8]. To remove this indetermination we shall further assume that the nonlinear
mixture is smooth, and we shall use adequate regularization in the separation
process.
In this nonlinear separation setting, besides the well known degree of inde-
termination that exists in the linear case (i.e. permutation and scaling), we have
to accept another form of indetermination, corresponding to an invertible non-
linear warping of each source. If we designate by x i the outputs of the separation
process, we can only expect to obtain x i = fi (s ~) (apart from possible permu-
tations), fi being invertible. In fact, if the sources are independent from one
another, so will be fi (si). A criterion that enforces independence of the separa-
tion outputs can't be expected to recover the original sources exactly. But each
separated component x i will contain the same information as the corresponding
source 8i .
The separation technique that we describe here has first been presented by
these authors in [8]. An independent derivation of this technique, starting from
information theoretic concepts, can be found in [11].

2 The separation technique

A well known method for linear separation is the information maximization


technique of Bell and Sejnowski [2]. It essentially consists of passing each of
the separated signals through a suitable saturating, monotonic nonlinearity, and
forcing the resulting joint distribution to be as uniform as possible within a
hypercube with edges parallel to the coordinate axes 1. A simple justification of
the technique is that if two or more signals have a jointly uniform distribution
in a hypercube, then those signals are mutually independent, and so the signals
before the nonlinearities must also be independent from one another.
Our method is based on the same principle: we try to obtain, at the output of
the nonlinear separator, a uniform joint distribution of the outputs within a hy-
percube. If the distribution is uniform, the outputs will be independent from one
another. Bell and Sejnowski approach the uniform distribution by maximizing
the joint entropy of the outputs. This can be done in the case of their separating
system, which consists of a single linear layer followed by componentwise non-
linearities, but that method cannot be easily extended to a general nonlinear
separation system (see [10], however, for an extension to a restricted class of
nonlinear separators).
We'll start by an analogy that will help us to explain the basic principle of our
separation technique. This will then be formalized in precise mathematical terms.
We make an analogy between output patterns of the separator and physical
particles in space. The intuitive idea behind our separation method is that of
repulsion among electrically charged particles. The repulsion will be strongest in
the zones of highest concentration of particles. Therefore these zones will have
1 In this paper we shall refer only to hypercubes with this orientation, and shall des-
ignate them simply as "hypercubes".
676

a stronger tendency to spread than those of low concentration, thus tending to


uniformize the particle distribution. We cannot use the decay law of electrostatic
repulsion, however, since this would lead all particles to locate themselves at the
periphery of the hypercube, as happens with real charges in a conductor. The
decay with distance will have to be faster than the inverse quadratic law of the
electrostatic force.
More formally, consider a set of n-dimensional patterns x/ which are i.i.d.
samples from a distribution with density p (x), and consider also a potential
energy,
1
W = ~ ~--~ E(flx~ - xyl]) , (1)

where E ( H x / - xj II) is the potential energy corresponding to the pair of patterns


(x/, xj), and depends only on the distance between the patterns in the pair.
The repulsion force F ( x i , x j ) , exerted by pattern xi on pattern xj, will be
minus the gradient, relative to xj, of the potential of that pair of patterns,

OE(Hx, - x~ ll) (2)


1~ ( x . x~) = 0xj '

and its magnitude will depend only on the distance between the patterns.
As noted above, the force, and thus also the potential, will have to decay
faster than the electrostatic ones. We shall assume that E ( x ) is finite everywhere,
and that its integral is also finite, f ~ . E ( x ) d x = K . We shall also assume that E
is concentrated around the origin, so that, for any ~, E (I]x - ~11), considered as
a function of x, has significant values only in a small region around ~, in which
the density p(x) can be considered constant. Therefore

~ . E ( t l x - ~ll)p(x)dx ~ Kp(~) (3)

and we can use E(x)/K as a Parzen kernel for estimating the probability density
of patterns p from the samples xi,
1
~(x) - g g ~ E (llxi - xll ) ~-. p(x) , (4)
i

where N is the number of patterns. We can then express the total energy as

W = 1 E (llx, - xjll) (5)


i,j
= KN ~/3 (xj) (6)
2
J
KN2 f
--~ JR"/3 (x) p ( x ) d x (7)

gg2 I p2(x)dx " (8)


~' T J I R n
677

If we restrict the distribution p to be zero outside some finite region T4 of


]R~, the integral in the last equation will be minimal if p(x) is uniform within
T~ (see the proof in the Appendix). We can therefore uniformize the density of
the patterns by minimizing the energy W.
Assume then that we have a set of observations oi and that we want to
separate the sources that generated them. We do that by means of a nonlinear
transformation S (implemented through an MLP, for example). The dimension
of the output of S is assumed to be the same as the number of sources. Designate
the output patterns of S by x~ = S (oi). The transformation is optimized so that
its output patterns have a distribution that is as close as possible to uniform
within a hypercube. The optimization is performed by minimizing, by gradient
descent, the total energy W of the output patterns. If we designate by w the
vector of parameters of S, we have

OW OW Oxl
i
Oxi
= - Z r x,). 0w (9)
j,i

If S is implemented by means of an MLP, the terms F (xj, x i ) . 0 x ~ / 0 w can be


computed by the backpropagation rule, using the components of F (xj, xi) as the
inputs of the backpropagation network. To compute F ( x j , x i ) we need xj and
xi simultaneously, and therefore we need to process, at each time, both oj and
oi through the MLP. Having computed F (xj, xi) and F (xi, xj) = - F (xj, xi)
we carl then compute the two terms F (xj, xi) 90 x j 0 w and F (xi, xj) 9Oxj/Ow
by means of two separate backpropagations.

3 Application examples

We present two examples of nonlinear separation, the first one with images (see
Fig. l-a) and the second one with time-domain signals (see Fig. 3-a). The non-
linear mixtures corresponded to nonlinear analytical expressions given ahead.
These expressions were chosen somewhat arbitrarily, but taking into consider-
ation that they shouldn't be too unsmooth (cf. Sect. 1). The separation was
implemented, in both cases, by MLPs with 10 tanh hidden units and with two
linear output units. The MLPs had full connectivity between consecutive layers,
and also had direct connections from the inputs to the output units.
The images had a total of 200 • 200 = 40,000 pixels each, and the signals were
both 1000 samples in length. In both cases the sum in (9) would have involved
too large a number of terms to be usable in practice. We dealt with this, in both
cases, by strongly subsampling the set of pairs of patterns that we used in the
computation. In the case of the images we started by randomly sampling, without
replacement, a set of 1000 patterns from the 40,000 observations. The remaining
procedure was the same for both the images and the signals. Designating by
o~,i -- 1, ..., 1000, the training patterns, and setting by convention 01001 = ol,
678

we then further subsampled the set of pairs of output patterns to be used in (9)
by using only the pairs of the form ( x i , x i + l ) with i -- 1, ..., 1000.
For the computation of W and of its gradient we used, for the potential,
E (llxl - x211) -- 0.05 e -2~215215 The objective function J that we minimized
had two extra terms, besides the energy, J -- W + B + R. The term B bounded
the distribution of output patterns within the square [-1, 1]2, and R was a
regularization term. For B we used
1000
B= Z (m x [(Ix t- 1),~ + [(Ix l- 1),~ 2} (lO)
i=1

For regularization we used weight decay,


R = (w) 2 , (11)
wEYV

where )4/was the set of all weights except those of the direct connections from
inputs to outputs and the unit biases, which did not affect the smoothness of the
mapping. Training was performed in batch mode. We used adaptive step sizes
and error control (cfl [1], Sects. C.1.2.4.2 and C.1.2.4.3) which were essential in
getting the training to converge quickly. The training converged in roughly 1000
epochs, in both cases.
For the images we used the mixture equations
01 = Sl (s2 + 1 ) / 2 (12)
o2 = s2/(sl + 1) (13)
where s~ and oi designate pixel intensities, with 0 corresponding to black and
1 to white. We set A = 10 -3 in the regularization term. Figures. 1-b and 1-c
show the nonlinear mixtures and the nonlinear separation results, respectively.
We see that the method was able to recover the original sources without visually
noticeable error. Figure 1-d shows a result of linear separation, obtained by using
a linear network instead of the MLP, and following the same training procedure.
Note that, since the mixture is not linearly separable, different linear separation
methods would probably have yielded different results, but none would have
been able to recover the original sources.
To give a better view of how the nonlinear separation performed, we processed
a square grid of 30 x 30 points, from the source space, through the nonlinear
mixture and separation. Figure 2 shows the grid after the nonlinear mixture and
after the separation. We see that the separation was able to recover the original
square grid with relatively small error.
In the second example, the sources were a triangular waveform with 4 com-
plete periods within the 1000 samples, and a sine wave with 23 complete periods
within that interval. Both sources were first rescaled in amplitude to the interval
[0, 1]. We used the mixture equations
01 = (sl) 2 / l o g (s 2 + 2) (14)
0 2 = (s2) 2 / X / s 1 + 1 . (15)
679

Fig. 1. Nonlinear separation of images. (a) originals, (b) nonlinear mixture, (c) non-
linear separation, (d) linear separation.

This mixture is more strongly nonlinear than the one of the first example,
especially due to the quadratic terms. Accordingly, we had to use a smaller
coefficient in the regularization term, )~ = 5 x 10 -5. The order of the 1000
mixture observations was randomized, so t h a t in the pairs of the form (xi, xi+l)
there would be no correlation between the two patterns. A p a r t from these details,
the training was performed in the same way as in the first example. Figure 3
shows the results, and we can see t h a t the nonlinear system was able to separate
the sources relatively well, while a linear system was not. Figure 4 shows scatter
plots of the various signals, to give a better view of the transformations involved
in this experiment.

4 Conclusions

We described a method for separation of nonlinear mixtures of independent


sources. The method is based on the concept of repulsion among patterns. We
also presented two examples t h a t illustrate the application of the method to the
separation of nonlinear mixtures.
From the form of operation of this separation method, it is clear t h a t it can
only separate mixtures t h a t are not too strongly "twisted". Otherwise the out-
put distribution will probably still become approximately uniform, but without
recovering the original sources. An issue t h a t needs to be clarified concerns the
kinds of distributions that the method can separate. In our examples we used
source distributions t h a t were subgaussian and had well defined edges. It is still
unclear how well the method will handle other kinds of distributions.
680

~
~176176176176
::::::::::::::::::::::::::::::
.............. i ...............

... 9.-..--.

\\\\\\~i??iiiiii!?iiiii;;~:.".:
(a) (b)
Fig. 2. Mappings. (a) nonlinear mixture, (b) nonlinear separation.

Appendix
Consider a finite region ~ C ]Rn, and designate b y u (x) -- u the uniform
density within that region. Let p (x) be any density within that region, and let
c (x) -- p (x) - u. Since both u and p are probability densities, ,we ~have

n o (x) dx = 0 (16)

Therefore,

rip (x) dx = In (u2 (x) + 2u (x) e (x) + e2 (x)) dx (17~)

= / n u2 ( x ) d x + 2 u f n e ( x ) d x + / n e2 (x)dx (18)

= / 2(x)dx + / (x)dx (19)


>-fn u2 (x) dx , (20)

with equality only if p -- u. The uniform density is therefore the absolute mini-
mizer of fn p2 (x) dx.

References
[1] Almeida, L. B.: Multilayer perceptrons. Handbook of Neural Computation.
Fiesler, E. and Beale, R., eds. Institute of Physics and Oxford University Press
(1997) available at http://www.oup-usa.org/acadref/nccl_2.pdf.
[2] Bell, A. and Sejnowski, T.: An information-maximization approach to blind sep-
aration and blind deconvolution. Neural Computation 7 (1995) 1129-1159.
681

(a) (b)

(c) (d)
Fig. 3. Nonlinear separation of time-domain signals. (a) originals, (b) nonlinear mix-
ture, (c) nonlinear separation, (d) linear separation.

[3] Comon, P.: Independent component analysis - A new concept?. Signal Processing
36 (1994) 287-314.
[4] Deco, G. and Brauer, W.: Nonlinear higher-order statistical decorrelation by
volume-conservingneural architectures. Neural Networks 8 (1995) 525-535.
[5] Hochreiter, S. and Schmldhuber, J.: LOCOCODE performs nonlinear ICA with-
out knowing the number of sources. Proc. First Int. Worksh. Independent Com-
ponent Analysis and Signal Separation. Aussois, France. Cardoso, J. F., Jutten,
C., and Loubaton, P., eds. (1999) 277-282.
[6] Lee, T.-W., Girolami, M., Bell, A., and Sejnowski, T.: An unifying information-
theoretic f~amework for independent component analysis. International Journal
on Mathematical and Computer Modeling (1998).
[7] Marques, G. C. and Almeida, L. B.: An objective function for independence. Proc.
International Conference on Neural Networks. Washington DC (1996) 453-457.
[8] Marques, G. C. and Almeida, L. B.: Separation of nonlinear mixtures using pattern
repulsion. Proc. First Int. Worksh. Independent Component Analysis and Signal
682

?::-5, ,
~,. , , . - - , . : . . . - , : ,...
~': ".- : _ ~ : :" ":

o . ,,

. * , ~ . . ' . ,.. ~ ,. -. .,,.


i.. - - - - .. - - -

~'.". . . . - - . - ' . : " : o 7-- " ..

(a) (b)
.r
i-oO

. -~ - . . . . r',r
-- t *, '*o*

....... 9. . - .. . - ...,
' ::.i :! -:..
..- - :" ~- ::
......... ........... ~ " ~ - .............. ~ ................ ~ .......................................... ~ : o . . . . . . . . . . ~ ~ i ' " "A ....... ~ .........

9. . o, o," .. ~ % . *~ ~ .-.~ . . . . .,

t ~176
~- ~ .... o'~ " "~ ** o : - *-'~ . . : t -~ .. ",',

:.~..-:_ :-.: ..':...::-..'~. :. :. :- :=: ..........


,.'. .... : - : - : - ~ . : . . :.. :..':.
.... :..:.
9 ~ "~ "" " . . . . '~":" ":-" ~ : :5"~ ~ ~:~...:~....'.":/i.....-':~.

(c) (d)
F i g . 4. Scatter plots. (a) originals, (b) nonlinear mixture, (c) nonlinear separation, (d)
linear separation.

Separation. Aussois, France. Cardoso, J. F., Jutten, C., and Loubaton, P., eds.
(1999) 277-282.
[9] P a j u n e n , P.: Nonlinear independent component analysis by self-organizing maps.
Proc. Int. Conf. on Artificial Neural Networks. Bochum, G e r m a n y (1996) 815-819.
[10] Palmieri, F., Mattera, D., and Budillon, A.: Multi-layer i n d e p e n d e n t c o m p o n e n t
analysis (MLICA). Proc. First Int. Worksh. Independent C o m p o n e n t Analysis and
Signal Separation. Aussois, France. Cardoso, J. F., Jutten, C., and Loubaton, P.,
eds. (1999) 93-97.
[11] Xu, D., Principe, J., Fisher, J., and Wu, H.-C.: A novel measure for i n d e p e n d e n t
c o m p o n e n t analysis. Proc. I E E E Int. Conf. Acoust., Speech and Sig. Processing.
Seattle W A 2 (1998) 1161-1164.
Text-to-Text Machine Translation
Using the R E C O N T R A C o n n e c t i o n i s t Model*

M.A. Castafio 1 F. C a s a c u b e r t a 2

IDpto. de Informdtica. Universitat Jaume I de Castell6n. Spain.


castano@inf.uji.es
2Dpto. Sistemas lnformfiticos y Computaci6n.
Universidad Politfcnica de Valencia. Spain.
fcn@iti.upv.es

Abstract. Encouragingly accurate t,'anslations have recently been obtained using a


connectionist translator called RECONTRA (Recurrent Connectionist Translator). In
contrast to traditional Knowledge-Based systems, this model is built from training data
resulting in an Example-Based approach. It directly carries out the translation between
the source and target language and employs a simple (recurrent) connectionist
topology and a simple training scheme. This paper extends previous work exploring
the capabilities of this RECONTRA model to perform text-to-text translations in
limited-domain tasks.

1 Introduction

In comparison with traditional Knowledge-Based Machine Translation (MT) systems, in


the last years Example-Based (EB) techniques (so called inductive techniques) have led to
successful limitcd-domain applications. In this paradigm, systems are automatically built
from training sets of examples which are large enough, resulting in lower development
costs. There are several studies that directly aim at placing MT within the EB fi'amework
[1,2,3,15]. In this direction, Neural Networks can be considered an encouraging approach
to MT, as the translation schemes presented in [13] and [18] have empirically shown.
Nevertheless, the connectionist system in [13] employs static topologies which are not
appropriate to approach a real MT task and that in [18] is based on a translation model
which is quite complex.

In contrast to these approaches, using a simple EB recurrent connectionist translator,


encouraging results have been obtained for text-to-text limited domain applications. The
translator, called RECONTRA (Recurrent Connectionist Translator), was recently
presented in [4]. It directly carries out the translation between both the input and the
output languages (with no intermediate items) and, at the same time, automatically learns
the semantic and syntax implicit in both languages. This connectionist system was tested

* Partiallysupportedby the SpanishCICYT,projectTIC-97-0745-CO2-02.


684

in preliminary MT experiments [4] on a simple pseudo-natural task (with small


vocabularies) involving descriptions of visual scenes [7]. The translation rates obtained
with this academic task were quite close to 100%. This paper evaluates the above
RECONTRA translator on a more realistic task with larger vocabularies. In addition, the
accuracies provided are also compared to those obtained using other EB MT techniques to
approach the same task.

Section 2 presents the MT task to be employed in the experimentation. In Section 3 the


RECONTRA connectionist translator and the training algorithm are briefly described.
Section 4 details the experiments performed and reports the results achieved by the
RECONTRA translator and the other MT approaches. The conclusions of the work are
finally discussed in Section 5.

2 The Traveller Task

2. I Description of the General Task

The task chosen in this paper was called the Traveller task, which was defined within the
first-phase of the EuTrans project [1]. This task can be considered a more realistic test for
our connectionist RECONTRA translator than that we have previously employed [7]. The
fi'amework adopted for the task is that of a traveller (tourist) at the reception of a hotel of a
country whose language he/she does not speak. The vocabularies of the languages
considered in the project (Spanish, English, German and Italian) ranged from 500 to 700
words. Taking into account the great difference between the 30 words of the vocabularies
of the task previously approached using the RECONTRA model and the 700 words in the
Traveller task, we chose a subtask of the Traveller task to test ot, r translator. This subtask
includes sentences in which the tourist notifies the reception of his departure, asks for the
bill, asks and complains about the bill and asks for his luggage to be moved.

2.2 Categorized and Non-categorized Tasks

lq order to decrease the sizes of the vocabularies and the complexity of the chosen
Traveller (sub)task, the grouping of some words and word sequences into categories was
introduced. Specifically, two categories labelled by $DATE and $HOUR were used,
which respectively represented generic dates and hours just as their names suggest. In a
first experiment we considered pairs of categorized Spanish-into-English sentences and
later, non-categorized Spanish-into-English sentences. In what follows we will refer to
them as the categorized and non-categorized Traveller tasks respectively.

Any additional process to automatically categorize tile source sentences or to


atttomatically expand the categories provided by the translators were not considered in the
categorized task. The only objective of introducing this task in our experiments was to
study the behaviour of the RECONTRA translator on a task simpler thaq the non-
categorized Traveller task. However, both categorization and expansion processes could
be also carried out through appropriate connectionist mechanisms and integrated with the
connectionist translation process.
685

The Spanish vocabulary of the non-categorized task had 178 different words, which was
decreased to 132 words after categorizing the sentences. The English vocabulary had 140
and 82 words in the corresponding non-categorized and categorized tasks. Figure 1 shows
some examples of both non-categorized and categorized Spanish-into-English translations.

PAIRS OF NON-CATEGORIZED SENTENCES


Spanish: L Les importarfa bajar el equipaje a recepci6n ?
English: Would you mind sending the luggage down to reception ?
Spanish: /, Podemos abonar en efectivo ?
English: Can we pay in cash ?
SI)anish: lie de nlarcll;Jruleel dia veintisiete tie fchrero a his sietc y media tie la larde .
E,iglish: 1 should leave on February the twenty-seventh at half past seven in the alternoon.
PAIRS OF CATEGORIZEI) SENTENCES
Spanish: Me voy a ir el dfa $FECHA a $HORA de la mafiana.
English: 1 am leaving on $FECHA at $HORA in the morniqg.
Spanish: L Est~iincluido el recibo del teldfono en la factura ?
English: Is the phone bill included in the bill ?
Spanish: Nos marchamos hoy mismo a $HORA por la noche.
English: We are leaving today at $HORA in the evening.

Figure 1. Some examples of pairs of sentences of the non-categorized and categorized Traveller
tasks.

3 The RECONTRA Translator

3.1 Network Topology

The hasic architecture adopted for the connectionist RECONTRA translator is a simple
recu,'rent network presented in [9]. In addition, it includes "delayed" inputs, which
reinforce the preceding and the following contexts of the input signal. The resulting neural
topology is shown in Figure 2.

Let see now how the RECONTRA model runs: The words of the sentence to be translated
are presented sequentially at the input layer of the net, while the model has to provide the
successive words of the corresponding translated sentence. That is, the translator sees the
input sentence through a window of n words which is shifted word by word and generates
the successive words of the output sentence one after the other.

It should be noted that the size of the window should be wide enough so that the
RECONTRA will have seen enough information at the input layer before providing the
appropriate translated output word; that is, the net cannot translate something it still has
not seen. However, it is not strictly necessary that the input word(s) related to the
686

translated output word was inside the current input window, since the net has memory and
is able to remember (some) past events.

English word

I output unns I

I CONrEXrUNITSl I I~-- I iNPUtS I~--I


Spanish word I-1 Spanish word i Spanish word i+]

Figure 2. The RECONTRA translator.

In order to mark the end of the sentence translated by the net an additional output word is
included in the target vocabulary. Consequently, the presentation of the input sentence
finishes alter the translator provides this special word or, failing this, after introducing the
whole input sentence and a certain number of following empty input words to the network.

3.2 Codification of the Source and Target Vocabularies

In order to approach MT between languages which involve hu'ge (or even meditml)
vocabularies using our R E C O N T R A translator, a local representation of these
vocabularies cannot be employed. It would lead to networks with .m excessive (and so
unapproachable) number of connections to be trained. Consequently, a distributed
representation of both source and target vocabularies is required. Previous studies on the
more appropriate type of distributed codification to represent the vocabularies in our
RECONTRA translator were carried out in 161. They suggested employing similar
(boolean) codifications with those words in the vocabuhtry which appeared in similar
syntactic contexts. These experiments also showed that the learning convergence was
increased when the same codification for the sot, rce word to be translated and the
corresponding translated target word was adopted (as far as possible). Finally, these
studies revealed that significantly better translation performances were obtained by using
coarse representations in contrast to severe subsymbolic distribttted rel~resentations.
Consequently, in the experiments presented in this paper we adopted boolean coarse
codifications for both the source and target vocabularies of the Traveller tasks; in addition,
similar codifications were assigned to words in the same vocabulary that appeared in the
same syntactic context and to words in different vocabularies that were a translation of
the other.

Theoretical concepts on local and distributed representations in connectionist models can


be found in [8] and [17].
687

3.3 Training Procedure

The RECONTRA translator described above is trained by ttsing an on-line version of the
Backward-Error Propagation algorithm [ 16]. This means that the recurrent connections of
the net are ignored in the process of adjusting the weights and that they are considered
additional external inputs to the architecture (although the net is not spread in time as in
the Back-Propagation-through-Time learning method). Consequently, the gradient of the
error is truncated in the estimation of the weights; that is, it is not exactly computed.
However, this learning method runs well in practice as it is shown in [5] where it is
compared to (more computationally costly) methods which exactly follows the gradient.

Tile resttlting training algorithm is as tbllows: After inputs and target traits arc updated,
the forward step is computed, the error is back-propagated Ihrough Ihe net and the weights
are modified. Later, the hidden unit activations are copied onto the corresponding context
units. This time cycle is continuously repeated until the target wdues mark the end of the
translated sentence. A sigmoid function (0,1) is assumed as the non-linear actiwttion
function and context activations are initialized to 0.5 at the beginning of every input-
output pair. The updating of the weights requires estimating appropriate values for the
learning rate and momentum. The choice of these parameters is carried out inside the
unitary bidimensional space which they define, by analyzing the residual mean sqnared
error of a network trained for 10 random presentations of the learning corpus ( 10 epochs).
Training continues for the learning rate and momentum which led to the lowest mea,1
squared error. And the training process stops when a certain established criterion is
verified.

With regard to the translated message provided by the RECONTRA model, the network
continuously generates output activations. In order to interpret the activations provided at
a given time cycle, the word associated to the pre-established codification of the target
vocabulary which is nearest to such activations is searched lbr.

4 Experimental Results

First, the Spanish-into-English categorized Traveller task was approached using the
RECONTRA translator described in the above section. Later, the non-categorized MT task
was learned in a second experiment using both the RECONTRA model and other recent
inductive MT approaches 1.

All the connectionistexperimentspresentedin the paperwere trained and testedusing tile SNNS neuralsirnulatnr
1191.
688

4.1 Training and Test Corpora

The corpora adopted in the two tasks approached in the paper were sets of text-to-text
pairs each of which consisted of a sentence in the Spanish input language and the
corresponding translation in the English output language. Among the Spanish-into-English
non-categorized pairs of sentences considered in the EuTrans project (related to the
subtask considered in this paper), we randomly chose 5,000 samples to train the
connectionist translator and 1,000 different pairs to test the resulting learned models.
These training and test corpora were later categorized and employed to learn and
recognize the categorized task.

3,425 of the 5,000 non-categorized training pairs were different; after categorizing these
5,000 santples, the number of different pairs was decreased to 2,687. In the test corpus 991
out of the 1,000 non-categorized pairs were different, and after being categorized this
number went to 771.

There was no overlapping between the non-categorized training and test corpora;
however, 54% of the pairs in the categorized test set were included in the categorized
learning set.

The length of the non-categorized Spanish sentences ranged from 3 to 20 and the length of
the non-categorized English sentences, from 3 to 17. The number of words of the
categorized sentences ranged from 3 to 13 for the Spanish ones and from 3 to 12 for the
English ones.

4.2 Criterion Assessing Correct Translations

A s o , r c e test sentence supplied to a connectionist architecture was considered to be


correctly translated if the output provided by the model exactly coincided with the
expected translation for this source sentence. In order to determine word r the
obtained and expected translations corresponding to every source sentence in the test
sample were compared using a conventional Edit-Distance (Dynamic Programming)
procedure [141. In this way, the number of insertions, deletions and substitution errors was
obtained. The word accuracies reported here correspond to the ratio of the total number of
non-errors with respect to the total number of edit (total error + correct) operations.

4.3 Results for the Categorized Task

Tile RECONTRA translators employed to approach the categorized Traveller task had 50
input units and 37 outputs, which respectively coded tile 132 words of tile Spanish source
vocabulary and the 83 words of the English target vocabulary (including tile special word
which marks the end of the translated sentence). The codifications adopted were pseudo-
random boolean coarse codings with the features specified in Section 4.2.
689

Six Spanish words were presented simultaneously to the net, so that 3 and 2 words were
considered the corresponding (balanced) right and left contexts of the input word. This
input context was adopted after studying some examples of the task and verifying that the
source word(s) corresponding to the target word to be translated every time cycle, have
been previously presented to the input of the net. In order to avoid translators with an
excessive number of trainable connections, larger input contexts were not considered.

Tile next step in our approach to tile categorized task wits to estimate an adequate value Ior
the number of hidden units. With this objective in mind, translators with the above
features and with a single hidden layer ranging from 130 to 160 units were designed.
Appropriate values for the learning rate and momentum were found for every model. Each
of these models was trained up to 500 epochs using the 5,000 categorized pairs
corresponding to the learning corpus of tile task. The resulting Irained translators were
then tested on the 1,000 recognition samples. Tile best test performances were obtained fo,"
the netwo,-k with 140 hidden units. Table 1 shows the (test) sentence accuracy translation
rates aud tile word accttracies achieved for that topology afler both 100 and 500 h'aining
epochs. These results reveal that, in spite of the low number of training epochs, the
translation performances obtained were quite good.

Table 1. Sentence accuracy translation rates and word accuracy rates for the categorized and non-
categorized Traveller tasks.

TRAVELLER TASK FEATURES OF THE NET TRAINING SENTENCE WORD


EPOCltS ACC. RATE ACC. RATE

CATEGORIZED 50 input units 6 delayed inputs 100 98.0% 99.7%


37 output units 140 hidden units 500 98.8% 99.8%

NON-CATEGORIZED 61 input units 8 delayed inputs


52 output units 180 hidden units 10t) 91.1% 98.6%

Let see now the behaviour of the RECONTRA translator on the same pairs of sentences
without being categorized.

4.4 Results for the Non-categorized Task

Taking into account that a RECONTRA translattn' with 140 hidden tinits p,'ovided gotld
accuracy rates for the categorized Traveller task, we employed a model with 160 neurons
at this time since the size of the vocabularies of the non-categorized task were higher. The
178 words of the Spanish vocabulary were coded into 61 boolean units using pseudo-
random coarse representations. Tile 140 words of the English vocabulary togclhcr with tile
word which marks the end of the translation were coded into 52 units in a similar way.
After studying some pairs of examples of the task, we noticed that at least 8 delayed inputs
(with 4 words for the left context and 3 words for the right context) were ,'equired.
690

Summarizing, the RECONTRA translator considered for tackling the non-categorized


Traveller task was a network with 61 input units, 52 outputs, 160 hidden traits and 8
(4+I+3) delayed inputs. After estimating appropriate values for the learning rate and
momentum, this model was trained for 100 epochs using the 5,000 pairs of the non-
categorized learning corpus. The trained translator was then tested on the 1,000 test
sentences. Table 1 shows the corresponding translation rates reached. Looking at these
results it can he observed that the accuracies are good enough to approach preliminary
experimentations on non-categorized MT tasks with vocabularies of medium size. It
should be noted that these performances cannot be exactly compared with those achieved
on the previous categorized task; since no categorization nor expansion processes of the
categorized sentences were additionally carried out, the semantic domain of the
categorized and the non-categorized tasks did not coincide.

4.5 Comparison with Other Inductive MT Approaches

In order to compare the results achieved using our connectionist RECONTRA translator,
the experiments on the non-categorized Traveller task presented in the previous Section
were approached using a translation model based on subsequential transducers similar to
that presented in [2]. The scheme combined subsequential transducers with hmguage
models of both the input and output languages (built using 3-grams I12]) and with an
error-correcting model which was focused on the Levenshtein distance [ 10]. The resulting
translator was trained and tested employing the same respective learning and test samples
than those considered with the RECONTRA translator. Table 2 shows the test sentence
performances reached.

Table 2. Sentence translation rates achieved using different inductive techniques to approach the
non-categorized Traveller task.

SI~NTENCI{
TI/,ANSI,ATION MOI)FI, ACC. RATI,~
Probabilistic alignments 77.6%
Grammar association with perceptrons 58.3%
Grammar association with LOCO model 79.5%
Subsequential transducers 27.2%
RECONTRA 9 I. 1%

Tile non-categorized Traveller task was also approached by Garcfa and Prat using their
respeclivc (developing) techniques hased on probahilislic aligumcnts and grammar
association. The first technique [ 11] was inspired in the translation Model-II previously
developed by the people from IBM [3]; and the second technique estimated the association
probabilities through a multilayer perceptron and through a model called LOCO [15].
Table 2 summarizes the sentence translation rates achieved using the same corpora
employed in our non-categorized experiment to train and test these last translation models.
More details about these experimentations can be fotmd in [151.
691

Looking at the comparative results shown in Table 2 we can observe that the best
translation performance was provided by the RECONTRA connectionist model. However,
our translator required the largest spatial storage and learning time. On the other hand, the
accuracies obtained using subsequential transducers could be improved by providing
larger training samples.

5 Conclusions and Future Work

The Example-Based RECONTRA translator recently presented in [4] is tested on a new


text-to-text MT task in this paper. The task, which is a subtask of the Traveller task [1],
has a wider (restricted) semantic domain than that of the tasks previously t,pproachcd
using the RECONTRA model and the vocabularies involved have near 200 words each. In
order to decrease the complexity of the task and the sizes of the vocabularies, a first
experiment in which both the sentence to be translated and the corresponding translaled
sentence were partially categorized was tested. Later, non-categorized pairs of sentences
were considered. The results showed that translation accuracies close to 100% were
achieved for the simpler (categorized) task, while 91% of the more complex non-
categorized sentences were correctly translated. In addition, this last performance was
found to be higher than those obtained using other inductive MT techniques to approach
the same task.

Considering these encouraging results, it seems feasible that future work could deal with
more complex limited-domain translations and with larger vocabularies. However, in
order to avoid translators with unapproachable sizes, more experimentation to work with
effective compact (coarse or distributed) representations of the vocabularies is required.
Destructive training methods can be also employed to reduce the size of the networks (and
so the learning time), New connectionist architectures which continue to lower this
learning time should be also considered. Automatic categorization of source non-
categorized sentences and t,'anslation of instances of translated categories will be added to
the process of translating categorized sentences. Finally, the integration of our translator
with a module which recognizes voice input is still pending.

References

1. J.C. Amengual, J.M. Benedf, K. Beulen, F. Casacuberta, M.A. Castafio, A. Castellanos, D.


Llorens, A. Marzal, H. Ney, F. Prat, E. Vidal, J.M. Vilar. Speech Transhttion based on
Automatically Trainable Finite-State Models. Procs. of the 5th Eurol~ean Conference on
Speech Communication and Technology (EUROSPEECH-97), vol. 1, pp. 91--91, Rhodes,
Greece. 1997.
2. J.C. Amengual, J.M. Bened[, F. Casacuberta, M.A. Castafio, A. Castellam~s, D. I.Iorens, A.
Marzal, 1:. Prat, E. Vidal, J.M. Vilar. Error-Correcting Patwing Jor Te.~t-to-Text Machim~
Translation ,sing Finite State Models. Procs. of the 7th hiternalional Conference on
Theoretical and Methodological Issues in Machine Translation (TMI-97), pp. 135--142. Santa
Fe, USA. 1997.
692

3. P.F. Brown, S.A. Della Pietra, V.J. Della Prieta, R.L. Mercer. The Mathematics of Statistical
Machine Translation: Parameter Estimation. Computational Linguistics, vol. 19, no. 2, pp.
263--31 I. 1993.
4. M.A. Castafio, F. Casacuberta. A Connectionist Approach to Machine "l'rt,lshttion. Procs. of
the 5th European Conference on Speech Communication and Technology (EUROSPEECH-
97), vol. I, pp. 91--94, Rhodes, Greece. 1997.
5. M.A. Castafio, F. Casacuberta. Training Simple Recurrent Networks through (;radie,t Descent
Algorithms. In "Biological and Artificial Computation: From Neuroscience to Technology", In
"Lecture Notes in Computer Science", vol. 1240, pp 493--5(10. Eds. J. Mira, R. Moreno-l)/:tz,
J. Cabestany. Springer-Verlag. 1997.
6. M.A. Castafio. Reties Neuronales Recurrentes para Inferencia Grt,natieal y Tradttceit~n
A,tomdtica. Ph.D. dissertation, Dpto. Sistemas Inform,Sticos y Computaci6n, Universidad
Polit~.cnica de Valencia. 1998.
7. A. Castellanos, I. Galiano, E. Vidal. Application of OSTIA to Machine Transh:tion T~tsks. In
"Lecture Notes in Computer Science", vol 862, pp.93--105, R.C.Carrasco and J. Oncina
(Eds.), Springer-Verlag. 1994.
8. G. Dorffner. A step towards sub-symbolic language models without li,guistic representatimts.
Connectionist Approaches to Language Processing, vol. I. Eds. R. Reilly, N. Sharkey.
Erlbaum. 1990.
9. J.L. Ehnan. Finding Str, cture in Time. Cognitive Science, vol. 2, no. 4, pp. 279-31 I. 1990.
I0. K.S. Fu. Svnuwtic Pattern Recognition and Applications. Prentice-Hall. 1982.
I1. 1. Garc(a. Trtuhwcign Autom6tica basada en M~todos Estadisticos. Final year project. Dpto.
Sistemas Inform,-iticos y Computaci6n. Universidad Polit~cnica de Valencia. 1996.
12. F. Jelinek. L~tngttage Modelling for Speech Recognition. Procs. of the 12tb European
Conference on Artificial Intelligence (ECAI-96), pp. 26--32, Hungary. 1996.
13. N. Koncar, G. Gulhrie. A Natttrttl l.z, lgttage Transhttion Neur~tl Network. Procs. of the Int.
Conf. on New Methods in Language Processing, pp. 71--77, Manchester, UK. 1994.
14. A. Marzal, E. Vidal. Comp,tation of Normalized Edit Distance and Applications. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 15, hum .9. 1993.
15. F. Prat. Traducci6n Automdtica en Dominios Restringidos: Algunos Modelox Estoctisticos
Susceptibles tie ser Aprendidos tl partir de Ejemplos. Ph.D. dissertation, Dplo. Sislemas
lnformfiticos y Computaci6n, Universidad Polit6cnica de Valencia. 1998.
16. D.E. Rumelhart, G. Hinton, R. Williams. l.earning seq,e,tial strttctttre in simple recurre,t
networks. In "Parallel distributed processing: Experiments in the microstructure of cognition",
vol. I. Rumelhart D.E., McClelland J.L. and the PDP Research Group (Eds.), MIT Press.
Cambridge. 1986.
17. N.E. Sharkey. Connectionist Representations fi, r Natural Lttnguage: Ohl and New. Procs. of
the VI SEPLN, Donostia. 1990.
18. A. Waibel, A.N. Jain, A.E. McNair, H. Saito, A.G. Hauptmann, J. Tebelskis. JANUS: A
Speech-to-Speech Translation System ,sing Connectionist tt,d Symbolic Processi,g
Strategies. Procs. ICASSP-9 I, pp. 793--796. 199 I.
19. A. Zell et al. SNNS: Stuttgart Neural Network Simulator. User manual, Version 4. I. Technical
Report no. 6195, Institute for Parallel and Distributed High Performance Systems, University
of Stuttgart. 1995.
An Intelligent Agent for Brokering
Problem-Solving Knowledge

V. Richard BenjaminQ, Bob Wielinga l, Jan Wielemaker 1 and Dieter Fcnsel2

t Dept. of Social Science Informatics (SWI), University of Amsterdam, Roetcrsstraat 15, 1018 WB
Amsterdam, The Netherlands, richard@swi.psy.uva.nl, http://www.swi.psy.uva.nl/
2 University of Karlsruhe, Institute AIFB, 76128 Karlsruhe, Germany, dfe@aifb.uni-karlsruhe.de,
http://www.aifb.uni-karlsruhe.de/WBS/dfe/

Abstract
Wc describe an intelligent agent (a broker) for configuration and execution of knowl-
edge systems for customer requests. The knowledge systems me configured from reusable
problem-solving methods that reside in digital libraries on the Internet. The approach
followed amounts to solving two subproblems: (i) the configuration problem which im-
plies that we have to mason about problem-solving components, and (ii) execution of
heterogeneous components. We use CORBA as the communication infrastructure.

1 Introduction and motivation


We think that software reuse will play a more and more important role in the next century,
both general software components, as well as so-called knowledge components. Knowl-
edge components are object of study ill the knowledge engineering community and include
problem-solving methods and ontologies. In this paper, we are concerned with problem-
solving methods (PSMs). Nowadays, many PSM repositories exist at different locations
[4, 24, 7, 29, 2, 31, 8, 20], which opens, in principle, the way to large-scale reuse. There
a.re, however, at least two l)rol)lcms that haml)cr wi(Icsl)n'~ad reuse of these i~roblem-s(dving
components: they are neither accessible nor interoperable. In this paper, we present an ap-
proach aimed at remedying these two problems. We will 1)resent a software agent - a broker--
that is able to configure PSMs into an executable reasoner. The work presented here forms
part of an E S P R I T project whose aim is to make knowledge-system technology more widely
available at lower costs.
Our approach is based on the integration of ditfcrent technologies: knowledge modeling,
interoperability standards and ontologies. PSMs are made accessible by describing them
in the product description language UPML (Unified Problem-solving Method description
Language) whose development is based on a unification of current knowledge-modeling ap-
proaches [25, 9, 28, 1, 23, 27]. For letting heterogeneous PSMs work together, we use CORBA
[22, 15]. Ontologies are used to describe the different worlds of the agents involved, which
have to he mapped onto each other.
In a nutshell, the two tasks we aim to solve, are the following (illustrated in Figure 1). A
broker program configures individual PSMs -that reside in dilfcrent libraries (m the lnternet-
into a coherent problem solver. The task is carried out at the UPML-Ievel and involves:
interaction with a customer to establish the customer requirements, matching the require-
ments with PSMs, checking the applicability of the identified PSMs, deriving components
for glueing the PSMs to the customer's knowledge base, and imposing a control regime on
the selected components (System1 in Figure 1). The other task we have to deal with is to
actually execute the configured problem solver for the customer's problem (using its KB). It
does not really matter whether the PSMs are retrieved from the libraries and migrated to
the broker or the customer's site, or whether they remain in the respective libraries and are
executed distributively. CORBA makes these options transparent. The only requirement is
694

System 1 ~ . ~ PSM
~ ~ nbrary2
broker

I ~ PSM
Output library1
Configurationprocessof I)roker ]
TheproductIsarunningKBS I
System2 ~...~ ~
CusI.... 'a L ~ . ~ Glue ~

I,'IGURE 1: Distinction between two systems: (1) the broker configures a i)roblem solver by reasoning with
UPML, and (2) the output of tim broker is a knowledge system, which consists of executable code fragments
corresponding to the selected PSMs, along with "glue" for their integration to nmke them interoperate. The
arrows in systeml denote UPML expressions, whereas tbe arrows in system2 stand for CORBA structures.

that the site where a PSM is executed should support the language in which the PSM is
implemented (System2 in Figure 1).
In Section 2, we briefly review the ingredients needed to explain our approach. Section 3
describes the configuration task of the broker. In Section 4, we outline how the configured
problem solver is executed and in Section 5 we sketch the C O R B A architecture to implement
our approach. Finally, Section 6 concludes the paper.

2 Ingredients
Before we explain our approach in detail, we first briefly explain its ingredients: PSMs,
ontologies, UPML and CORBA.

2.1 Problem-solving methods

The components we broker are problem-solving methods, which are domain-indel)endcnt de-
scriptions of reasoning procedures. PSMs are usually described as having an i n p u t / o u t p u t
description, a competence description (what they can deliver), ~ s m n p t i o n s on domain
knowledge (what they require before they can deliver their competence). We distinguish be-
tween two kinds of PSMs; primitive and composite ones. Composite PSMs comprise several
subtasks that together achieve the competence, along with an operational description speci-
fying the control over the subtasks. Primitive PSMs are directly associated with executable
code.

2.2 Ontologies
An ontology is a shared and common understanding of some domain that carl be commu-
nicated across people and computers [17, 32, 17, 30]. Most existing ontologies are domain
ontologies, reflecting the fact that they capture (domain) knowledge about the worht inde-
pendently of its use [18]. However, one can also view tile world from a "reasoning" (i.e.
use) perspective [19, 14, 10]. For instance, if we are concerned with diagnosis, we will talk
about "hypotheses", "symptoms" and "observations". We say that those terms belong to
the task ontology of diagnosis. Similarly, we can view the world from a problein-solving
point of view. For example, Propose gc Revise sees the worht in terms of "states", "state
695

transitions", "preferences" and "fixes" [14, 20]. These terms are part of the method or PSM
ontology [16] of Propose & Revise.
Ontologies can be used to model the different agents involved in our scenario (illus-
trated in Figure 3). So we have task ontologies, PSM ontologies and domain ontologies to
characterize respectively the type of task the customer wants to solve, the PSM, aml the
application domain for which a customer wants a KBS to be built. These different ontologies
are related to each other through what we (:all bridges.

FIGURE 2: The class hierarchy of the UPML language (left), and ttle attributes of a UPML specification
(left).

2.3 A product description language


In order to reason about tile different components involved, we need a component-description
language. Tile idea is that providers of PSMs (library builders) characterize their products
(i.e. PSMs) using a standard language. Note that providers are free to use any particular
implementation language. The broker understands this product language and reasons with
it. The language we developed is the Unified Problem-solving Method description language
(UPML) and integrates notions from various existing knowledge modeling approaches [12].
UPML allows to describe in an integrated way task ontologies, task specifications, domain
ontologies, PSM ontologies, and bridges between these components. The syntax of the
696

language is specified in the ProtegeWin3 tool [21] which is a tool that allows one to write
down a meta-description of a language. Figure 2 gives the class hierarchy of UPML (left part
of figure). A UPML specification consists of, among others, tasks, PSMs, domain models,
ontologies and bridges (see right part of Figure 2). Bridges have to fill the gap between
different ontologies by renaming and mapping.
Having specified the structure and syntax of UPML, ProtegeWin automatically can
generate a knowledge acquisition tool for it, that can be used to write instances in UPML
(i.e. actual model components). For describing the coml)ctence of PSMs, FOL fornmlas can
be used. Typically, library providers use a subset (the part related to PSM) of UPML to
characterize their PSMs, using the generated KA tool.

2.4 Interoperability standard


CORBA stands for the Common Object Request Broker Architecture [22] and allows for
network transparent communication and component definition. It enables distributive exe-
cution of heterogeneous programs. Each of tile participating programs needs to be provided
with a so-called IDL description (Interface Definition Language), which defines a set of
common data structures. Programs then can exchange data that comply with thc IDL
definition.
In our approach, we use CORBA both for exchanging data during execution of the
problem solver, as well as for exchanging UPML specifications between the broker, the
libraries and the customer, during the configuration task (note that for clarity reasons, in
Figure 6, the use of CORBA is only depicted for the execution part, and not for configuration
part).

I PSM-doml
bridge
ln KB I
FIGURE 3: The steps the broker needs to make for selecting a PSM.

3 Configuration task of the broker


In order to configure a problem solver, the broker reasons with characterizations of compo-
nents written in UPML. In Section 5, we will explain how tim broker gets access to UPML
descriptions of PSMs. Figure 3 illustrates the different steps the broker takes. The cur-
rent version of the broker is implemented in Prolog 4. Therefore, we have built a parser
that generates Prolog from UPML specifications written with the KA-tool generated by
ProtegeWin.
ahttp://smi-~eb, s t anford, edu/proj e c t s / p r o t -at /
4SWI-Prolog [33].
697

B r o k e r - c u s t o m e r i n t e r a c t i o n The first task to be carried out is to get the customer's


requirement, that is, what kind of task does s/he wants to be solved. We use the notion
of task ontology for this. A task ontology describes the terms and relations that always
occur in that t ~ k , described I)y a signature. When a task is applied to a specific domain, it
imports the corrcsl)onding domain ontology. A task ontology additionally (lcscribes axioms
that define the terms of the signature.
With a particular task ontology, a whole variety of specific instances of the tasks
can be defined. In other words, a customer can construct a specific goal s/he wants to
have achieved by combining terms of the task ontology. For example, in a classification
task, tile task ontology would define that solution classes 5 need to satisfy several proper-
ties. Additional requirements on tile goal are also possible like complete classification and
single-solution classification. A specific goal would consist of some combination of a sub-
set of these axioms, along with the inllut and output specification (i.e. observations and
cl;Lsses, respectively). Goals can be specified in FOL (which the customer does not need to
be aware of) such as: V x : c l a s s i n ( x , o u t p u t - s e t ) =~ i n ( x , i n p u t - s e t ) A t e s t ( x ,
p r o p e r t i e s ) , which says that the output class is a valid solution if it was in tim original
input and if it its properties pass some test (namely, that they are observed).

B r o k e r - l i b r a r y i n t e r a c t i o n a n d b r o k e r - c u s t o m e r ' s K B i n t e r a c t i o n Given the goal


of the customer, it is the broker's task to locate relevant and applicable PSMs. Two problems
need to be solved here:

9 Matching the goal with PSM competences and finding a suitable renaming of terms
(the ontology of the task and the ontology of the PSMs may have different signatures).
9 Checking the assumptions of the PSM in the customer's knowledge base, and gener-
ating the needed PSM-domain bridge (for mapping different signatures).

These tasks are closely related to matching software components in Software Engineering
[34, 26], where thcorem proving techniques have shown to be interesting candidates. For the
current version of our broker, we use tlm leanT4p [3] theorem prover, which is an iterative
deepening theorem prover for Prolog that uses tableau-based deduction.
For matching the customer's goal with the competence of a PSM, we try to prove tim
task goal given the PSM comt)etence. More precisely, we want to know whether the goal
logically follows from the conjunction of the assumptions of ttle task, the postcondition of
the PSM and the assumptions of the PSM. Figure 4 illustrates the result of a successful
proof for a classification task (set-pruning) and one specific PSM (prune). In Figure 4,
Formula (1) represents to task goal to be proven (explained in the paragraph on "broker-
customer interaction"), Formula (2) denotes the assumption of the task, Formula (3) the
postcondition of the PSM and Formula (4) represents the assumption of the PSM. The
generated substitution represents the PSM-task bridge needed to mall tile output roles of
the task and PSM onto each other. The output of the whole matching process, if successful,
is a set of PSMs whose competences match the goal, along with a renaming of the input and
output terms involved. If more than one match is found, the best 6 needs to be selected. If no
match is found, then a relaxation of the goal might be considered or additional assumptions
could be made [5].
5Note that "class" is used in the context of classification, and not in the sense of the OO-paradigm.
Sin the current version, we match only on competence (thus on fimctionality). However, UPML has a slot
for capturing non-hmctional, pragmatic factors, such as how often has the component be retrieved, was that
succegsful or not, for what application, etc. Such non-fimctional aspects play an important role in practical
component selection.
698

9 ?- match_psm('set-pruning,prune,Substitution,lO).
The goal to be proven is :
formula(forall([var(x, class)]), (I)
implies(in(x, 'output-set'),
and(in(x, 'input-set'), test(x, properties)))).
The t h e o r y i s :
and(formula(forall([var(x, class)]), (2)
equivalent(test(x, properties),
formula(forall([var(p, property)]),
implies(in(p, properties),
implies(true(x), true(p)))))),
and(formula(forall([var(x, class)]), (3)
implies(in(x, output),
and(in(x, input),
formula(forall([var(p, property)]),
and(in(p, properties),
has_property(x, p)))))),
formula(forall([var(x, element), var(p, property)]), (4)
implies(has_property(x, p), implies(true(x), true(p)))))).

Substitution = ['input-set'/input, properties/properties, 'output-set;/output]

Yes
......................................................................................

FIGURE 4: Matching the task goal of the "set-pruning task" with the competence description of tile "prune"
PSM using a theorem prover. The task provides the goal to be proven (1). The theory from which to
prove the goal is constituted by the assumptions of the task (2), the postcondition of the PSM (3) and the
assumptions of the PSM (4). The "10" in the call of the match denotes that we allow the theorem prover to
search 10 levels deep. The resulting substitution constitutes the PSM-task bridge.

10 ?- b r i d g e _ p d ( p r u n e , a p p l e - c l a s s i f i c a t i o n , B).
The g o a l t o be proven i s :
formula(forall([var(x, element), vat(p, property)]), (1)
implies(has_property(x, p), implies(true(x), true(p)))).
The theory is:
and(formula(forall([var(c, class), var(f, feature)]), (2)
implies(has_feature(c, f), implies(true(c), true(f)))),
forall([var(x, class), var(y, feature)], (3)
equivalent(has_feature(x, y), has_property(x, y)))).

Limit = 1
Limit = 2
Limit - 3
Limit - 4

B - [forall([var(x, class), var(y, feature)], equivalent(has.feature(x, y),


has_property(x, y)))]

Yes
......................................................................................

FIQURE fi; Derivi.g the PSM-domai. bridles: V x:class, y:feat.ro (has-foat.re(x,y) ~ h~s-property(x,y)) at
the fourth level
699

Once a PSM has been selected, its assumptions need to be checked in the customer's
knowledge base. Because the signatures of the KB ontology and PSM ontology are usually
different, we may need to find a bridge for making the required proof possible. Figure 5
illustrates the result of a successful proof for deriving a PSM-domain bridge. In the figure,
we ask to derive a PSM-domain bridge to link together the prune PSM and a KB for
apple classification. Formula (1) represents the PSM assumptions (same as Formula (4)
in Figure 4). We want to prove Formula (1) from the assumption of tile KB (Formula
(2)) and some PSM-domain bridge (if needed). In our prototype, a PSM-domain bridge is
automatically constructed, based on an analysis of the respective signatures. This involves
pairwise comparison of the predicates used by the PSM and the KB that have tile same
arity. A match is fotmd if the respective predicate domains can be maI)ped onto each other.
The constructed bridge is added to the theory (Formula (3)) and then the theorem prover
tries to prove the PSM assumption from this theory (i.e., from the conjunctiou of Formula
(2) and (3)), which is successful at the fourth level of iteration (Figure 5). Note that, in
general, it is not possible to check every assumption automatically in the KB; some of them
just have to be believed true [6].
Figure 3 shows the case for primitive PSMs. In case of composite PSMs the following
happens. When a composite PSM has been found to match the task goal, its comprising
subtasks are considered as new goals for which PSMs need to be found. Thus, the broker
consults again the libraries for finding PSMs. This continues recursively until only primitive
PSMs are found.

Configuration

Configur Taskstructures
task library
assumptions Broker

broker

customer's do-task(In,Out):-
solve(PSM2(In,
0)),
TKB ~ Out))j

utio~

CORBABUS - TCP/IP
FIGURE 6: Tile whole picture.

I n t e g r a t i o n of selected P S M s The result of the process described above is a set of


PSMs to be used for solving the customer's problem. In order to turn these into a coherent
700

reasoner, they need to be put together. In the current version, we simply chain PSMs based
on common inputs and outputs, taking into account the types of the initial data and the
final solution. This means that we only deal with sequential control and not with iteration
and branching. We plan to extend this by considering the control knowledge specified in the
operational descriptions of composite PSMs (controlling the execution of their subtasks).
This knowledge needs to be kept track of during PSM selection, and can then be used to ghm
the primitive PSMs together. The same type of control knowledge can be found explicitly in
existing task structures for modeling particular task-specific reasoning strategies [9, 2, 11].
Task structures include task/sub-task relations along with control knowledge, and represent
knowledge-level descriptions of domain-independent problem solvers. If the collection of
PSMs selected by the broker matches an existing task structure (this can be a more or less
strict match), then we can retrieve the corresponding control structure and apply it.

O u t p u t of t h e c o n f i g u r a t i o n task of the broker The output of the broker is thus a


program in which each statement corresponds to a PSM (with the addition of the derived
PSM-domain bridge to relate the PSM predicates to the needed KB predicates), as illus-
trated -in an oversimplified way- in Figure 6. The next step is to execute this program,
which may consist of heterogeneous parts.

module ibrow

t y p e d e f s t r i n g atom;

enum s i m p l e _ v a l u e _ t y p e enum v a l u e _ t y p e
int_type, { simple_type,
float_type, compound_type,
atom_type list_type
}; };

union s i m p l e _ v a l u e union v a l u e
switch (simple_value_type) switch (value_type)
{ c a s e int_type: { case simple_type:
long int_valua; simple_value simple_value_value;
case float_type: c a s e compound_type:
float float_value; sequence<value> name_and_urEuments ;
case atom_type: case list_type:
atom atom_value; sequence<value> list_value;
}; };

interface psm
{ value solve(in value arE) ;
};
};

FIGURE 7: The IDL description for list-like data structures. For space reasons we printed it in two cohmms,
but it should be in one column.

4 Execution of the problem solver


Once we have selected the PSMs, checked their asSUml)tions and integrated them into a
specification of a problem solver, the next step is to execute the problem solver applied to
701

the customer's knowledge base. Figure 6 situates the execution process in the context of
the overall process.
Since we use CORBA, we need to write an IDL in which we specify the data structures
through which the PSMs, the KB and the broker communicate [15]. Figure 7 shows the
IDL. In prin('iple, this IDL can then be used to make interoperable PSMs written in any
language (as long as a mapping can be made from the language's internal data structures
to the IDL-defined data structures). In our current prototype, we experiment witb Prolog
and Lisp, and our IDL provides definitions for list-like data structures with simple and
compound terms. This IDL is good for languages based on lists, but might not be the best
choice for including object-oriented languages such as Java. An IDL based on attribute-
value pairs might be an alternative. Figure 8 illustrates the role of IDL in the context of
heterogeneous l)rogralns aml CORBA. Given the IDL, coml)ilcrs generate langm~ge spccitic
wrappers that translate statements that comply with the IDL into structures that go into
tim CORBA bus. The availability of such compilersz depends on the particular CORBA
version/implementation used (we used ILU and Orbix).

Prolog

CORBA
bus

Lisp

FIGURE 8: The role of IDL.

The last conversion to connect a particular language to the CORBA l)us is performed
1)y a wrapper (see left wral)pers in Figure 8) constrm:ted by a participating partner (e.g.
a library provider of Prolog PSMs). This wrapper translates the internal data stru('tures
used by the programmer of the PSM or KB (e.g. pure Prolog) into statements accel)ted by
the automatically generated wrapper (e.g. "IDL-ed" Prolog). Figure 9 shows an example
of a simple PSM written in Prolog and wrapped to IDL. Wrapping is done by the module
convert, which is imported into the PSM and activated by the predicates in_value and
out_value).

5 Architecture

In ttm context of CORBA, our PSMs are sel~Jers and tim statements in tim problem solver
(the program configured by the broker, see Figure 6) are the c l i e n t s . This means that each
PSM is a separate server, the advantage being modularity. If we add a new PSM to the
library, and assuming that the PSMs run distributively at the library's site, titan we call
easily add new PSMs, without side effects. Tim customer's KB is also a server which the
broker and the PSMs can send requests to.
During the execution of the problem solver, the broker remains in charge of the overall
control. Execution means that when a statement in the problcin solver prograln is called (a
client is activated), a request is sent out to the CORBA bus and picked up by the appropriate
ZThe compiler for Prolog has been developedinhouse.
702

:-module(prune,[
psm_solve/3
3).
:- use_module(server(cllent)).
:- use_module(convert).

psm_solve(_Self, Arg, Return) :-


in_value(Arg, prune(Classes,Features)),!,
prune(Classes,Features,Out),
out_value(Out, Return).

prune([], _, [ ] ) : - ! .
prune(Classes, Features, Candidates) :-
setof(Class,
( member(Class, Classes)
forall(member(Feature,Features),
has_property(Class, Feature))),
Candldates).
...........................................................

FIGURE 9: A simple "prune" PSM implemented in Prolog.

PSM - a server (through a unique naming service). Execution of the PSM may mean that
the PSM itself becomes a client, which sends requests to the customer's knowledge base (a
server). Once the PSM finished runldng and has generated an output, this is sent back to
the broker l)rogram, which then continues with the next statement.
Another issue is that typically a library offers several PSMs. Our approach is that each
library needs a mete-server that knows which PSMs are available and that starts up their
corresponding servers when needed. The same meta-server is also used for making U P M L
descriptions available to the broker. In our architecture, PSMs are considered objects with
two properties: (i) its UPML description and (ii) its executable code. The mete-servers
of the various libraries thus have a dual function: (i) provide UPML descriptions of its
containing PSMs, and (ii) provide a handle to the apl)rol)riate PSM implementatiolL
Interaction with the broker takes place through a Web browser. We use a common
gateway interface to Prolog (called PLCGI 8) to connect the broker with the Web.

6 Conclusions
We presented an approach for brokering problem-solving knowledge on the Internet. We
argued that this imt)lied solving two problems: (i) configuration of a problem-solver from
individual problem-solving methods, and (ii) execution of the configured, possibly hetero-
geneous, problem solver. For the configuration problem, we developed a language to char-
acterize problem-solving methods (UPML), which can be considered as a proposal for a
standard product-description language for I)rol)lem-solving components in the context of
electronic commerce. We assume that library providers of PSMs characterize their products
in UPML. Moreover, we also assunm that the custonmr's knowledge base is either charac-
terized in UPML, or that a knowledgeable person is capable of answering all questions the
broker might ask concerning the fulfilhnent of PSM assumptions. For matching customers'
goal with competences of PSMs, we used a theorem prover, which worked out satisfactory
for the experiments we did.
aPLCGI is only for internal use.
703

With respect to the execution problem, we use CORBA to make interoperability of


distributed programs network-transparent. The use of C O R B A for our purpose has turned
to be relatively straightforward. Our current IDL describes a list-like data structure, making
interoperability between for example Prolog and Lisp easy. For including object-oriented
languages such as Java, we may have to adapt the current IDL. We assume that the PSMs
and the customer's knowledge base come with wrappers for converting their language-specific
data structures into those defined in the IDL.
In this paper, we presented our approach by demonstrating the core concel)ts for a simple
case, but we have to do additional work to see how the approach scales up. For example, the
leanT4p theorem prover was convenient for our small examl)le, but nfight not scale up to more
complex proofs. Other possible directions to scale up inchlde (i) to make the matching of the
customer's goal with the competence of the PSMs lcss strict (l)artial match), (ii) to includc
other aspects than functionality in this matching process (non-fimctional requirements),
(iii) to include other control regimes than sequencing (branching, iteration), (iv) to deal
with sets of assumptions (possibly interacting), (v) to take more complicated tasks than
classification.
The interface through which customers interact with tim broker, is an imI)ortant issue.
Currently, the broker takes the initiative in a guided dialogue, asking the custonmr for
information when needed. Our plan is to extend this to include more flexibility in the sense
that the customer can browse through the libraries (using Ontobroker's hyperbolic views
[13]), select PSMs, check for consistency of her/his selection, and ask for suggestions which
PSMs to add to the current selection.

Acknowledgment
This work is carried out in the context of the IBROW project 9 with support from the
European Union under contract number EP: 27169.

References
[1] J. Angele, D. Fensel, D. Landes, S. Neubert, and R. Studer. Model-based and incremental
knowledge engineering: the MIKE approach. In J. Cuena, editor, Knowledge 01~ented So]tware
Design, IFIP ~ansactions A-27, Amsterdam, 1993. Elsevier.
[2] Barros, L. Nunes de, J. Hendler, and V. R. Benjamins. Par-KAP: a knowledge acquisition
tool for building practical planning systems. In M. E. Pollack, editor, Proc. o] the 15th IJCAI,
pages 1246-1251, Japan, 1997. International Joint Conference on Artificial Intelligence, Morgan
Kanfmann Publishers, Inc. Also published in Proceedings of the Ninth Dutch Conference on
Artificial Intelligence, NAIC'97, K. van Marcke, W. Daelemans (eds), University of Antwerp,
Belgium, pages 137-148.
[3] B. Beckert and J. Posegga. leanT4p: Lean tableau-I)ased deduction. Journal o[ Automated
Reasoning, 15(3):339-358, 1995.
[4] V. It. Benjamins. Problem-solving methods for diagnosis and their role in knowledge acquisition.
International Journal of Expert Systems: Research and Applications, 8(2):93-120, 1995.
[5] V. R. Benjamins, D. Fensel, and R. Straatman. Assumptions of problem-solving methods and
their role in knowledge engineering. In W. Wahlster, editor, Proc. ECAI-96, pages 408-412. J.
Wiley & Sons, Ltd., 1996.
[6] V. R. Benjamins and C. Pierret-Golbreieh. Assumptions of problem-solving methods. In
N. Shadbolt, K. O'tIara, and G. Schreiber, editors, Lecture Notes in Artificial Intelligence,
9http://www. swi.psy, uva.al/projects/IBROW3/home.html
704

1076, 9th European Knowledge Acquisition Workshop, EKAW-96, pages 1-16, Berlin, 1996.
Springer-Verlag.
[7] J. Breuker and W. van de Velde, editors. CommonKADS Library for Expertise Modeling. IOS
Press, Amsterdam, The Netherlands, 1994.
[8] B. Chandrasekaran. Design problem solving: A task analysis. AI Magazine, 11:59-71, 1990.
[9] B. Chandrasekaran, T. R. Johnson, and J. W. Smith. Task-structure analysis for knowledge
modeling. Communications of the ACM, 35(9):124-137, 1992.
[10] B. Chandrasekaran, J. R. Josephson, and V. R. Benjamins. Tile ontology of tasks and methods.
In B. R. Gaines and M. A. Musen, editors, Proceedings of the 11th Workshop on Knowledge
Acquisition, Modeling and Management, Banff, Alberta, Canada, 1998. SRDG Publications,
University of Calgary.
[11] D. Fensel and V. R. Benjamins. Key issues for automated problem-solving methods reuse. In
tI. Prade, editor, Proc. of the 13th European Conference on Artificial lnteUigence (ECAI-98),
pages 63-67. J. Wiley & Sons, Ltd., 1998.
[12] D. Fensel, V. R. Benjamins, S. Decker, M. Gaspari, R. Groenboom, W. Grosso, M. Musen,
E. Motta, E. Plaza, A. Th. Schreiber, R. Studer, and B. J. Wielinga. The component model
of UPML in a nutshell. In Proceedings of the First Working IFIP Cortference on Software
Arehitecture (WICSA1), San Antonio, Texas, 1999.
[13] D. Fensel, S. Decker, M. Erdmann, and R. Studer. Ontobroker: The very high idea. In Proceed-
ings of the 11th International Flairs Conference (FLAIRS-g8), Sanibal Island, Florida, 1998.
[14] D. Fensel, E. Motta, S. Decker, and Z. Zdrabal. Using ontologies for defining tasks, problem-
solving methods and their mappings. In E. Plaza and V. R. Benjamins, editors, Knowledge
Acquisition, Modeling and Management, pages 113-128. Springer-Verlag, 1997.
[15] J. H. Gennari, H. Cheng, R. Altman, and M. A. Musen. Reuse, corba, and knowledge-based
systems. International Journal of Human-Computer Studies, 49(4):523-546, 1998. Special issue
on Problem-Solving Methods.
[16] J. H. Gennari, S. W. Tu, T. E. Rotenfiuh, and M. A. Musen. Mapping domains to methods in
support of reuse. International Journal of fluman-Computer Studies, 41:399-424, 1994.
[17] T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisi-
tion, 5:199-220, 1993.
[18] N. Guarino. Formal ontology, conceptual analysis and knowledge representation. International
Journal of Human-Computer Studies, 43(5/6):625-640, 1995. Special issue on Tim Role of
Formal Ontology in the Information Technology.
[19] M. Ikeda, K. Seta, and R. Mizoguchi. Task ontology makes it easier to use authoring tools.
In Proe. of the 15th IJCAI, pages 342-347, Japan, 1997. International Joint Conference on
Artificial Intelligence, Morgan Kaufmann Publishers, Inc.
[20] E. Motta and Z. Zdrahal. A library of problem-solving components based on tim integration
of the search paradigm with task and method ontologies. International Jourual of Human-
Computer Studies, 49(4):437-470, 1998. Special issue on Problem-Solving Methods.
[21] M. A. Musen, J. II. Gemmri, II. Eriksson, S. W. Tu, and A. R. Puerta. PROTEGE II: Comlmter
support for development of intelligent systems from libraries of components. In Proceedings of
the Eighth World Congress on Medical lnformatics (MEDINFO-95), pages 766-770, Vancouver,
B. C., 1995.
[22] R. Orfali, D. Harkey, and J. Edwards, editors. The Essential Distributed Objects Survival Guide.
John Wiley & Sons, New York, 1996.
[23] A, Puerta, S. W, q~l, and M. A, Musen. Modeling tasks with mechanisms. In Workshop on
Problem-Solving Methods, Stanford, July 1992. GMD, Germany.
705

[24] F. Puppe. Knowledge reuse among diagnostic problem-solving methods in the shell-kit D3.
International Journal of Human-Computer Studies, 49(4):627-649, 1998. Special issue on
Problem-Solving Methods.
[25] A. Th. Schreiber, B. J. Wielinga, and J. A. Breuker, editors. KADS: A Principled Approach
to Knowledge-Based System Development, volume 11 of Knowledge-Based Systems Book Series.
Academic Press, London, 1993.
[26] J. Schumann and B. Fischer. NORA/HAMMR making deduction-based software component
retrieval practical. In 12th IEEE International Conference on Automated Software Engineering,
pages 246-254. IEEE Computer Society, 1997.
[27] N. Shadbolt, E. Motta, and A. Rouge. Constructing knowledge-based systems. IEEE Software,
10(6):34-39, November 1993.
[28] L. Steels. Components of expertise. AI Magazine, 11(2):28-49, Smmner 1990.
[29] A. ten Teije, F. van Harmelen, A. Th. Schreiber, and B. Wielinga. Construction of l)roblem-
solving methods as parametric design. International Journal of Human-Computer Studies,
49(4):363-389, 1998. Special issue on Problem-Solving Methods.
[30] M. Usehold and M. Gruninger. Ontologies: principles, methods, and applications. Knowledge
Engineering Review, 11(2):93-155, 1996.
[31] A. Valente and C. Liickenhoff. Organization as guidance: A library of assessment models. In
Proceedings of the Seventh European Knowledge Acquisition Workshop (EKAW'93), Lecture
Notes in Artificial Intelligence, LNCS 723, pages 243-262, 1993.
[32] G. van Heijst, A. T. Schreiber, and B. J. Wielinga. Using explicit ontologies in KBS develop-
ment. International Journal of Human-Computer Studies, 46(2/3):183-292, 1997.
[33] J. Wielemaker. SWI-Prolog ~.9: Reference Manual. SWI, University of Amsterdam,
Roetersstraat 15, 1018 WB Amsterdam, The Netherlands, 1997. E-mail: jan@swi.psy.uva.nl.
[34] A. M. Zaremski and J. M. Wing. Specification matching of software components. ACM Trans-
actions on Software Engineering and Methodology, 6(4):333-369, 1997.
A System for Facilitating and Enhancing Web Search

Steffen Staab I , Christian Braun 2, llvio Bruder ~, Antje Dtisterh0ft 4,


Andreas Heuer 4, Meike Klettke 4, GiJnter Neumann 2, Bernd Prager 3,
Jan Pretzel '~, ltans-Peter Schnun "1, Rudi Studer 1, Hans Uszkoreit '~, Burkhard Wrcnger 3

t AIFB, Univ. Karlsruhe, 2DFKI, 3GECKO mbH, 41nformatik,


D-76128 Karlsruhe Saarbrticken Rostock Univ. Rostock
ht.lzp : / / w w w . g e t e s s . g e c k o . d e

Abstract

We present a system that uses semantic methods and natural language processing capa-
billies in order to provide cmnprehensive and easy-to-use access to tourist information in the
WWW. Thereby, the system is designed such that as background knowledge and linguistic
coverage increase~ the benefits of the system improve, while it guarantees state-of-the-art
information and database retrieval capabilities as its bottom line.

1 Introduction
Due to the vast amounts of information in the WWW, its users have more and more difficul-
ties finding the information they ale looking for among the many heterogeneous information
resources. Therefore, methods for comfortable and intelligent access are in the primary focus
of a number of research communities these days. Currently, syntactic methods of information
retrieval prcwdl in realistic scenarios (cf., e.g., Ballerini et al. (1996)), such as in general search
engines like AltaVista, but the limits inhcrent in these approaches often make linding the proper
information a nuisance. On the other end of methodologies, semantic methods could provide
just the right level for finding information, but they rely on explicitly annotated sources (cf., e.g.,
(Fensel et al., 1998)) or on complete and correct natural language understanding systems, both
of which cannot be expected in the near future.
Therefore our system, GETESS, uses the semantics of documents in the WWW - - as far as
it is provided explicitly or as it can be inferred by an incomplete natural language understanding
system, but relies on syntactic retrieval methods, once the methods at the semantic level fail to
fulfill their task. In particular, we consider an information finding system that, (i), has seman-
tic knowledge for supporting the retrieval task, (it), partially, but robustly, understands natural
hmguage, (iii), allows for several ways of interaction that appear natural to the human user, and,
(iv), combines knowledge from unstructured and semi-structured documents with knowledge
from relational database systems.
In our project, we decided to aim at an information system that provides information finding
and filtering methods for a restricted domain, viz. for prospective tourists that may travel in a
certain region and are looking for all kinds of information, such as housing, leisure activities,
scesights, etc. The information about all this cannot be found within a narrowly restricted format
-- neither in a single database nor in a single web site. Rather, the information agent must gather
inlormation that is stored on many different web servers, often in unstructured text, and even in
some databases, such as a booking database of a hotel chain. In order to improve on common
information retrieval systems, at least part of what is stated in the (HTML) texts must be made
available semantically. However, since automatic text understanding is still far from perfect, we
pursue afitil-sr approach that is based on extracting knowledge from text with a robust parsei,
but also integrates and falls back onto common information retrieval mechanisms when the more
elaborate understanding component fails.
707

In the following, we draft the architecture of the GETESS-system with its overall sharing of
the work load. From this outline we will then motivate and describe some key issues of the major
subsystenas of GETESS.

2 Architecture
The front end of the GETESS system (cf. a depiction of its architecture in Figure 1) provides a
user interface that is embedded in a dialogue system controlling the history of interactions (cf.
Section 3). Single interactions are banded to the query processor that selects the corresponding
analysis methods, viz. the natural language processing module (NLP system; also cf. Section 5)
or the information retrieval and database query mechanisms (cf. Section 4). While the latter ones
can be directly used as input to the search system, the natural language processing module first
translates the natural language query into a corresponding database query, before it sends this
formal query to the search system.
In order to process queries and search fox"results, three kinds of resources are provided by
the back end of the GETESS system. First, archived information is available in several content
databases (the abstract DB, the index DB and the DB repository), the function of which is ex-
plained below. Second, the lexicon and the ontology provide metaknowledge about the queries,
viz. about the grammatical status of words and their conceptual denotations. Third, a database
incorporating dialogue sequences and user profiles, gives control over dialogue interactions.
While dialogue sequences and user profiles are acquired during the course of interactions and
the metaknowledge is provided by the human modeller with the help of knowledge acquisition
tools (KA 7bols), the content databases must be filled automatically, since the contents of typical
web sites change almost on a daily basis. For this task the gatherer searches regularly through
relevant XML/HTML-pages and specified databases in order to generate corresponding entries
in the abstract database, the index database and the database repository.
The content in the abstract database is derived from a robust, though incomplete natural lan-
guage understanding module that parses documents and extracts semantic information building
a so called "abstract" for each document. These abstracts are sets of facts, i.e. tupels, like
h a s C h u r c h ( A l i c a n t e , C h u r c h - 1 ) , that could be extracted from natural language text,
like "Aficante's major church was build during mediaval times" The index generator builds ac-
cess information lor full text search with information retrieval methods, while the DB repository
offers relevant views onto extern databases.

Figure 1" The GETESS system architecture


708

Subsequently, we will first introduce the front end, the dialogue system. The key issues here
are concerned with facilitating user interaction at different levels of expertise (Section 3). At the
back end of the system, the tools for gathering, database management and information retrieval
provide rite technical platform for efficiently updating and accessing the system's information
repositories (Scction 4). The natural language processing component in GETESS is employed
by the dialogue system as well as by the back end in order to understand natural language queries
and extract information from natural language texts, respectively, and, thus, enhance the quality
of the web search (Section 5). Finally, we outline the function of the ontology that constitutes
the "glue" of the system at the semantic level (Section 6).

3 DialogueSystem
The dialogue system constitutes the interface between the human user and the data in the reposi-
tories of the GETESS system. In order to facilitate the user's task of finding the information he is
looking for, users should be able to express queries conveniently at their level of expertise. This
means the system should allow for intuitive interaction by natural language queries as well as for
formal queries that may be the preferred mode of interaction by a human expert user. Indepent
of the concrete mode of interaction, the system should react quickly and accurately while using
the capabilities of the different modal actions.
In the GETESS system we allow for four types of interaction, viz. natural language, graphical
interface, keyword search and formal database query. Since the methods for natural language
processing as well as for keyword search and formal database queries form major components of
the system, their description has been delegated to subsequent sections (5 as well as 4, resp.).
Thus, this chapter serves the following three goals: First, it is described how reasoning about
user interactions may support the user's goal in quickly finding the appropriate information.
Second, it is sketched how single interactions are treated as elements of a complex dialogue.
Hence, the user does not have to start from scratch every time he inititiates a new query, but can
instead ,efer to his previous queries, e.g. by requests like "Show me information related to this
matter." where "this matter" relates to the last query. Finally, we give a glimpse on the use of the
graphical interface.
The Knowledge Base of the Dialogue System. Knowledge is crucial for all modes of in-
teraction, because we want the system to give appropriate responses to the user when problems
arise. For instance, when a query results in an abundance of hits, the system must reason about
why this problem might have occurcd and how it might be solved. Knowledge that allows for
this type of reasoning is encoded in the knowledge base of the dialogue system (KBD).
The KBD includes all the definitions available in the ontology (cf. Section 6). These defini-
tions help in explaining to the user why a query was too unspecific or giving him hints how he
might try to rephrase the query such that he gets the information he is looking for. For example,
if the user seeks for information on the local offers of "entertainment", the hit rate for a database
query can be reduced by the choice of one of the refined search terms "music events", "theater
events" and "sport events". Vice versa, a too specific choice like "folk music event" might result
in no hits, but the hint towards more general search terms like "folk culture presentation" might
bring up an event that also includes live demonstrations of "folk music". Further help is also
provided through important terminological links such as synonyms, homonyms, antonyms and
terms that may be parts of other terms, e.g. the show of a magician may be part of a circus show
and, hence, the circus show might be a viable entertainment alternative to a magician's show.
In addilion to tile definitions of the ontology, the KBD features definitions and rules about
dialogtte concepts. At the moment, this part is tuned to map different interactions onto common
requests to the database. I For example, the user input "1 am looking for the station", which may
also be supplemented by restrictions from the graphical interface, has the same meaning, i.e.

tin the linguistic literature this mapping is defined by the way natural language propositions, requests or ques-
tions can be considered as so calkedspeechacts (Austin, 1962).
709

it constitutes the same speech act, as "where is the station?". Therefore, both inputs must be
mapped onto the same query to the database.
Both types of knowledge provide user support that reduces the number of inquiries the user
has to pose to the system and, hence, accelerates the dialogue compared to common keyword
retrieval interactions.
Complex l)ialogues. As indicated above, information finding rarely produces an instanta-
neous hit - - after the user has formulated just a single query. This is true for syntactic methods
and it will improve only to a limited amount with semantic methods either. However, we believe
that when a user's sequence of interactions is perceived as being executed in order to achieve
it goal, then this task of finding the proper information can be substantially facilitated. For this
putpose, we provide a query processor that analyses not only the single interactions, but also
views them as being embedded into a more global structure.
The methodology we use is based on work done by Ahrenberg et al. (1996), who structure
the dialog hierarchically into segments that are opened by a request and closed by the appropriate
answer. The assumption in our scenario is that users typically have a request for a certain piece
of information and give related ;,rtformation in order to succeed. For example, they give a topic ~,
which here boils down to a type restriction, like "sightseeing tour", and temporal information
when they want to take part in a sightseeing tour during a particular time frame. The task of the
dialogue system lies in zooming in or out on relevant information according to the interaction
initiated by the user. For example, two user interactions 3 like, (i), "Show me all theater events.",
and, (ii),"No,just the ones in August." return a large set of documents first (with feedback such as
described in the previous subsection), but a much smaller set of data after the second interaction
has narrowed down the focus.
Hence, we here identify dialogue segments, interactions and topics as the major parameters
(though not the only ones) that the dialogue system keeps track of. By this way the user's single
interactions may all convey towaFds the common information finding goal and, thus, facilitate
the human-computer inter'tction.
The Graphical Interface. Besides of the natural language query capabilities and the possi-
bility of directly composing a formal query, the GETESS system features a graphical interface.
This interface conslitules :m intermedi:tte level of access to the systent between the most profes-
sional (and fastest) one, viz. the formal query, and the most intuitive one (that requires a some-
what more elaborate intcraction), viz. the natural language access. The graphical interface does
not require from the user to learn the syntax of a particular query language or the concepts that
are available in the ontology, but expects some basic understanding of formal systems from the
user. This interface visualizes the ontology in a manner suited for selecting appropriate classes
and attributes and, thus, allows the assembly of a formal query through simple mouse clicks. For
this purpose, the ontology is visualized by a technology based on hyperbolic geometry (Lamping
& Rao, 1996): classes in the center of the visualization are represented with a big circle, sur-
rounding classes are represented with a smaller circle. This technique allows fast navigation to
distant classes and a clear illustration of each class and its neighboring concepts.

4 Gathering, Database Management and Information Retrieval


In this section, we outline the back end of GETESS that gathers data flom the web and stores
it in a way that allows for efficient retrieval mechanisms as far as keyword search and formal
queries are concerned.

:"File "topic" in a dialogue corresponds tt) what the dialogue is about. Usually, it is given only implicitly in
nalural language statements. In our setting it may also be given explicitly,e.g. through the graphical interface.
3Each interaction corresponds I(~sely to a speech act as introduced above, but might also be an act in the
graphical user interface. Examples are up&tte(users provide informationto the system), question, answer, assertion
or directive.
710

The back cad of GETESS employs a typical gatherer-broker structure, viz. a Harvest search
service (Bowman et al., I995) with a database interface. Though we use the tools provided by
,'mother project, SWING (Heyer et al., 1997), the setting of GETESS puts additional demands on
the gatherer-broker system: (i), the GETESS search engine has to work with facts contained in
the abstracts~ (it), ontology knowledge must be integrated into the process of analysing internet
information as well as answering user queries, (iii), internet information can be of different types
(e.g., HTML, XML texts), .'rod, (iv), data collections such as information stored in databases
must also be accessible via the GETESS search engine. These different requirements must he
met both during the main process of gathering data and during the querying process (broker).
The Gatherer pn)c~s. Periodically, inlernet information is analysed via interact agents in
order to build a search index for the GETESS search engine. Information (e.g. HTML-texts,
Postscript-files ..... ) is checked to find keywords. Additionally, the GETESS-Gatherer has to
build abstracls from these information. The two kinds of index data ('simple' keywords and
abstracts) are stored in databases.
The Broker process. As indicated above, the dialogue system maps the user's queries (with
the help of the natural language processing module and the definition in the ontology) onto formal
or keyword queries in IRQL, the Information Retrieval Query Language. The IRQL language
combines different kinds of queries - - both database and information retrieval queries - - thus
providing access to the index data. The query result set is ranked with an user-centered ranking
function. The ranked result set is then presented to the user via the dialogue system.
The integration of different types of information (full text, abstracts, relational d,'~tabase facts)
during gathering and querying has put forth and still requil~es demand for research, however at
the same time it raises new possibilities of posing queries, because:

I. Conventional search engines support an efficient search for keywords or combinations of


keywords over a whole document. This is still possible, but the GETESS abstracts also
relate information in the document to attributes. Exploiting this type of information, we
can realise attributed queries. That means users have the possibility to search for terms in
special attributes.

2. Searching for particular integer values, for instance prices or distances is nearly impossible
with a conventional information retrieval approach. In GETESS, it will be possible to
compare integer and real values, e.g. to search for all prices that fall below a threshold. In
addition, one may also determine minimum, maximum and average values as well as sort
and group results by particular values.

3. Database functionality brings up answers from the abstract databases that are composed
of different abstracts. That means, for answering an query of a user we may refer and
exploit facts derived from different websites. Thereby, it is not even necessary that these
websites are connected by links, but all the algebraic operations given through database
functionality can be employed in order to deduce information. For instance, today's cinema
events may be announced in one web site, the corresponding reviews are found in another
one. Database technology allows for retrieving all movies that are shown today and that
received a good note in the corresponding review.

This functionality is implemented in an object relational database system. It will be employed in


a distributed database solution, which provides for data storage at different local servers. Hav-
ing made available different languages for accessing this repository of information, we are now
researching a common language level, the IRQL described above, for accessing structured in-
formation (abstracts) and unstructured information (for instance HTML or XML) with the same
interface. This language will then reduce the burden on the dialogue system, because the dia-
logue system will not have to distinguish between formal and keyword search anymore. Thus,
IRQL will enhance the overall robustness of the system.
711

5 Natural Language Processing


In the GETESS system, the natural language processing (NLP) component is used in order to,
(i), linguistically analyse user queries specified in the dialogue system, (ii), to generate the lin-
guistic basis for the extraction of facts from NL-documents, and, (iii), generate natural language
responses from facts in the abstract database.
The design of this component is based on two major design criteria: First, the GETESS
system requests a high degree of robustness and efficiency in order that the system may be applied
in a real.world setting. The reason is that we must be able to process arbitrary sequences of
strings efficiently, because "broken" documents appear on real web sites, and that the number
of documents that must be processed is too large in order to allow for response limes of several
minutes per sentence. Second, we employ the same shallow NL core components and linguistic
data managing tools for processing texts and extracting information as well as for analysing
a user's query. Thus, we can reduce the amount of redundancies as far as possible and keep
the system in a consistent state with regard to its language capabilities. For instance, when the
internal linguistic representation of an NL query and the abstracts use the same data sources, and
if we also use the same knowledge sources for NL-based generation, inconsistencies as a result
of unshared data can be reduced.
For the purpose of a short presentation here, we abstract from two major components of the
natural language processing component in GETESS. We do not elaborate on the natural language
generation part that also includes features for summarizing facts from the abstract database.
Moreover, we are well aware that our project serves an international tourist community and,
therefore we will have to add multi-lingual access as well as multi-lingual presentations of the
query results. However, at the current state of the project, we focus on parts of Germany as the
touristic goal that we want to provide information about and, hence, focus on the analysis of
German documents only.
Shallow text processing. The shallow text processor (STP) of GETESS is based on and an
extension of SMES, an IE-core system developed at DFKI (see (Neumann et al., 1997; Neumann
& Mazzini, 1998)). One of the major advantages of STP is that it makes a clean separation
between domain independent and dependent knowledge sources, lts core components include:
(i), a text scanner, which recognizes, e.g., number, date, time and word expressions, as well as text
structure, like sentence markers and paragraphs; (ii), a very fast component for morphological
proccssing which performs inflectional analyses including processing of compounds and robust
lexical access (e.g. analysing "houses" as the plural of "house"); (iii), a chunk parser based on a
cascade of weighted finite state transducers. The chunk parser performs recognition of phrases
(specific phrases like complex date expressions and proper names, and general phrases, like
nominal phrases and verb groups), collection of phrases into sentences, and determination of
the grammatical functions (like the deep subject, which describes the acting person in "Tourist
groups are led by a native guide.").
STP has large linguistic knowledge sources (e.g., 120.000 general stem entries, more than
20.000 verb-frame entries). The system is fast, and can process 400 words in about one second
running all components. In order for adapting STP for GETESS, we have begun to evaluate
STP's coverage on a corpus provided by our industrial partner. Though evaluation is blind,
because the current knowledge sources have not been specified using any part of this corpus, we
could analyse over 90% of all word forms and found that a majority of the remaining forms can
be covered by domain specific lexica.
Extraction of facts. Finally, a word on the extraction of facts: The STP generates a fin-
guistic analysis, i.e. it determines syntactic relations between words, e.g. between a verb and
its subject. How these linguistic cues are exploited in order to go from natural language to a
formal description is explained in the following section that elaborates on the semantic level of
GETESS.
712

6 Ontology
As already mentioned, the gathering, use and querying of information with syntactical methods
is very limited and in many cases not successful. A semantic reference model, an ontology,
which structures the content and describes relationships between parts of the content helps to
overcome these limitations. With the ontology in GETESS, we aim at two major puq~oses:
First, it offers inference facilities that are exploited by the other modules, as, e.g., described
in Section 3, the dialogue module may ask for the types a particular instance belongs to in
order to present alternative query options to the user. Second, the ontology acts as a mediator
between the different modules. This latter role is explained here in more detail, since it illustrates
how ontological design influences the working of the GETESS system and, in particular, the
extraction of facts fio,n natural language texts.
The text processing (cf. Section 5) of natural language documents and queries delivers syn-
tactic relations between words and phrases. Whether and how this syntactic relation can be
translated into a meaningful semantic relation, depends on how the tourism domain is concep-
tualized in the ontology. For example (cf. Fig. 2), the natural language processing system finds
syntactic relations between the words "church" and "Alicante" in the phrase "Alicante's main
chu,'ch". The word "Alicante" refers to A l i c a n t e which is known as an instance of the class
cs t y in the database. The database refers to the ontology for the description of the classes,city
and c h u r c h . Querying the ontology for semantic relations between c h u r c h and c i t y results
in hasBui iding and hasChurch. Both relations are inherited flom the class location to
the class c i t y . Since, h a s C h u r c h is the more specific one, a cmTesponding entry between
A l i c a n t e and c h u r c h is added to the abstract, i.e. the set of extracted facts, of the cmxently
l),'ocessed document in the abstract database.
This example shows that the design of
the ontology determines the facts which • hasBuildingr
may be extracted from texts, the database
schema that must be used to store these
facts and, thus of course, what informa-
tion is made available at the semantic level.
ltence, the ontology might constitute an en-
gineering bottleneck. However, we try to
overcome this problem by using the linguis-
hasChurch=I.Church-1I Database
tic and statistical analyses of the text pro- Figure 2: Interaction of Ontology with NLP
cessing component for indicating frequent, system and Database Organization
though unmodelled, concepts and relations
to the knowledge engineer.

7 Related Work
The GETESS project builds on and extends a lot of earlier work in various domains. In the natural
language community, research like (Grosz et al., 1987; Wahlster et al., 1978) fostered the use of
natural language application to databases, though these applications never reached the high pre-
ciseness and generality required in order to access typical databases, e.g. for accounting. Here,
our approach seems better suited, since the imponderabilities of general natural language under-
standing are counterbalanced by information retrieval facilities and an accompanying graphical
interface.
Only few researchers, e.g. Hahn et al. (1999), have elaborated on the interaction between
natural language unde,'standing and the con'esponding use of ontologies. We think this to be
an important point since underlying ontologies cannot only be used as submodules of text un-
derstanding systems, but can also be employed for a more direct access to the knowledge base
and for providing an intermediate layer between text representation and extem databases, an
interesting topic._that has not been raised so far to [he best of our knowledge.
713

As far as such queries of conceptual structures is concerned, we agree with McGuinness &
Patel-Schneider (1998) that usability issues play a vital role in determining whether a semantic
layer can be made available to the user and, hence, we elaborated on this topic early on (Fensel
et al., 1998). We, thereby, keep in mind that regular users may find lengthy natural language
questions too troublesome to deal with and, therefore, prefer an interface Ihat allows fast access,
but which is still more comfortable than any lormal query language.
Projects that compare directly to GETESS are, e.g., Paradime (Neumann et al., 1997) 4, MU-
L1NEX (Capstick et al., 1998) and MIE'VFA (Buitehtar et al., 1998). However, none of these
projects combines information extraction with similarly rich interactions at the semantic layer.
I Icnce, to thc best of our knowledge we are the only one integrating unstructured, semi-structured
and highly-structured data with a variety of easy-to-use facilities for human-computer interac-
tion.

8 Conclusion
In the project GETESS (GErman Text Exploitation and Search System) we decided to build
an intelligent information finder that relies on current techniques for information retrieval and
database querying as its bottom line. The support for finding informations is enhanced through
an additional semantic layer that is based on ontological engineering and on a partial text under-
standing tool.
In order tofitcilitate web search, the dialogue is considered a complex entity. The analysis of
sequences of interactions allows for refining, rephrasing or refocusing succeeding queries, and
thus eliminates the burden of starting from scratch with every single interaction. Thereby, several
modes of interaction are possible, besides keyword search and SQL-queries one can mix natural
language queries with clicking in the graphical query interface.
Having built the single modules for our system, the next task in the GETESS project is bring-
ing these components together. Given the design methodology of achieving entry level features
first and then working towards "the high ceiling" (via. complete text understanding and represen-
tation), we expect benefits on the parts of economic and researct~ interests early in the project.
The system is general enough in order to be applied to many realistic scenarios, e.g. as an intelli-
gcnt interface to a company's intranet, even though it is still far from offering a general solution
for the most general information finding problems in the WWW. Further research will have to
show an evaluation of how a user's performance in finding particular pieces of information looses
or (hopefully) gains from using this information ;,gent.

References
Ahrenberg, L., Dahlb."ick, N., J(innson, A., & Thure, A. (1996). Customizing interaction for natural
language interfaces. Computer and lnformation Science, 1(1).
Austin, J. (1962). Ih~w to Do Things with Words. Oxford University Press.
P,allcrini, J., Biichel, M., Kuaus, I)., Mateev, B., Mittendorf, M., Sch~iuble, P., Sheridan, P., & Wechsler,
M. (1996). SPIDER retrieval system at TREC 5. In Proc. of TREC-5. Gaithersburg, Maryland,
November 20-22, 1996. http://www-nlpir.nist.gov/pubs.
Bowman, C., Danzig, P., Hardy, R., Manber, U., & Schwartz, M. (1995). The har-
vest information discovery and access systems. Networks and ISDN Systems.
hnp://ftp.cs.colorado.edu/publcsltechreportslschwartzlHarvest.Conf.ps.Z.
Buitclaar, P., Netter, K., & Xu, F. (1998). Integrating different strategies for cross-language information
retrieval in the MIETTA project. In Hiemstra, D., de Jong, F., & Netter, K. (Eds.), Language Tech-
nology in M,Ithnedia h~rmation Retrieval Proceedings of the 14th 7~,ente Workshop on Ixmguage
Technology, TWI51" 14, pages 9-17. Universiteit Twente, Enschede.

'~Actually.GETESS uses the same linguisticcore machinery as Paradime.


714

Capstick, J., Diagne, A. K., Erbach. G., Uszkoreit, H., Cagno, F., Gadaleta, G., Hernandez, J. A., Korte,
R., Leisenberg, A., Leisenberg, M., & Christ, O. (1998). MUL1NEX: Multilingual web search and
navigation. In I'roceedings of Natural Language Processing and h~dustrial Applicatons.
Fensel, D., Decker, S., Erdmann, M., & Studer, R. (1998). Ontobroker: The very high idea. In FLAIRS.98:
Proceedings of the I I th International Flairs Conference, Sanibal Islaml, Florida, May 1998.
Grosz, B., Appelt, D., Martin, P., & Pereira, E (1987). Team: An experiment in the design of transportable
natural-language interfaces. Artificial hltelligence, 32(2): 173-243.
ltahn, U., Romacker, M., & Schulz, S. (1999). tlow knowledge drives understanding: Matching medical
ontologics with the necds of medical language processing. AI in Medicine, 15(I):25-51,
Iteyer, A., Meyer, H., Diisterh6fl, A., & Langer, U. (1997). SWING: Der Anfrage- und Suchdienst des
Rcgioualen lnformationssystems MV-Info. In Tagungsband luK-Tage Mecklenburg-Voq~ommern.
Schwerit,, 27.,,'28. Jtmi 1997.
Lamping, J. & Rao, R. (1996). The hyperbolic browser: A focus + context technique for visualizing large
hierarchies. Journal of Vi.~ualLanguages & Computing, 7.
McGuinness, D. & PateI-Schneider, P. (1998). Usability issues in knowledge representation systems. In
Proc. of AAAI-98, pages 608--614.
Neumann. G., Backofen, R., Baur, J., Becket, M., & Braun, C. (1997). An information extraction core
system for real world german text processing. In 5th hzternational Conference of Applied Natural
Langttage Processing, pages 208-215, Washington, USA.
Neumann, G. & Mazzini, G. (1998). Domain-adaptive iuformation extraction. Technical report, DFKI,
Saarbrticken.
Wahlster, W., Jameson, A., & Hocppner, W. (1978). Glancing, referring and explaining in the dialogue
system HAM-RPM. American Journal of Computational Linguistics, pages 53-67.
Applying Ontology to the Web: A Case Study

Jeff Heflin, James Hendler, and Sean Luke

Department of Computer Science


University of Maryland
College Park, MD 20742
{heflin, hendler, seanl } @cs.umd.edu

Abstract
This paper describes the use of Simple HTML Ontology Extensions (SHOE) in a real world internet
application. SHOE allows authors to add semantic content to web pages and to relate this content to
common ontologies that provide contextual infiwmation about the domain. Using tills inibrmation, query
systems can provide more accurate responses than are possible with the search engines available on the
Web. We have applied these techniques to the domain of Transmissible Spongiform Encephalopathies
(TSEs), a class of diseases that include "Mad Cow Disease". We discuss our experiences and provides
lessons learned from the process.

1. Introduction
The "Mad Cow Disease" epidemic in Great Britain and the apparent link to Creutzfeldt-
Jakob disease (CJD) in humans generated an international interest in these diseases.
Bovine Spongiform Encephalopathy (BSE), the technical name for "Mad Cow Disease",
and CJD are both Transmissible Spongiform Encephaiopathies (TSEs), brain diseases
that cause sponge-like abnormalities in brain cells. Concern about the risks of BSE to
humans continues to spawn a number of websites on the topic; some of these sites
provide valuable information, while others are simply sources of rumors. The reliable
sites range in content from epidemiology of the diseases, to scientific studies on
inactivation, to regulations by various agencies. It is difficult for users to locate relevant
information with the standard web search engines because these tools match on individual
words instead of their meanings. As such, they cannot take the relationship between
words into account, map between the terminology of different communities, or use any
contextual information to differentiate between terms with many meanings.
The Joint Institute for Food Safety and Nutrition (JIFSAN), a partnership between
the Food and Drug Administration (FDA) and the University of Maryland, is attempting
to rectify this situation. They wish to provide a clearinghouse for information on TSEs.
This site must be able to serve a diverse group of users, including the general public,
researchers, risk assessors, and policy makers. However, the diversity of data, the
constant appearance of new information, and the distribution o f ownership make it
difficult to manually maintain an accurate index. Additionally, the nature of the target
user community means the retrieval tools must be able to respond to general queries and
very specialized queries with the appropriate level of detail to inform the user.
We have built a suite of tools to address these problems, with the basis for these
tools being an internet compatible knowledge representation language called Simple
HTML Ontology Extensions (SHOE). The underlying philosophy of SHOE is that
intelligent agents will be able to better perform tasks on the Internet if the most useful
information on web pages is provided in a structured manner. To tiffs end, SHOE extends
HTML with a set of knowledge oriented tags that, unlike HTML tags, provide structure
for knowledge acquisition as opposed to information presentation. In addition to
providing explicit knowledge, SHOE sanctions the discovery of implicit knowledge
716

through the use of taxonomies and inference rules available in reusable ontologies that are
referenced by SHOE web pages. This allows information providers to encode only the
necessary information on their web pages, and to use the level of detail that is appropriate
to the context. SHOE-enabled web tools can then process this information in novel ways
to provide more intelligent access to the information on the lnternet.
This paper describes the first application of SHOE to a large-scale, real world
domain. In Section 2, we lay out the architecture of the system and detail the efforts to put
each piece in place. Section 3 discusses what we have learned from the process. Sections
4 and 5 discuss related and future work, respectively. Finally, Section 6 presents our
conclusions.

2. Building the System


This section describes the procedural and technical aspects of the TSE application. We
also explain our design choices based on the features of the TSE problem domain. The
system architecture can be summarized as follows:

9 A single, comprehensive ontology is available on the TSE Risk Website.


9 Knowledge providers who wish to make material available to the TSE Risk Website
use a tool called the Knowledge Annotator to mark-up their pages with SHOE. The
instances within these pages are described using elements from the TSE Ontology.
9 The knowledge providers then place the pages on the Web and notify JIFSAN.
9 JIFSAN reviews the site and if it meets their standards, adds it to the list of sites that
Expos6, the SHOE web crawler, is allowed to visit.
9 Expos6 crawls along the selected sites, searching for more SHOE annotated pages
with relevant TSE information. It will also look for updates to pages.
9 SHOE knowledge discovered by Expos~ is loaded into a Parka knowledge base.
9 Java applets on the TSE Risk Website access the knowledge base to respond to users'
queries or update displays. These applets include the TSE Path Analyzer and the
Parka Interface.

The following subsections describe how we created our ontology, how SHOE tags were
added to web pages, how new SHOE information is discovered, and how users access
information that is relevant to them.

2.1 Ontology Design


The fundamental component of SHOE is the ontology. In SHOE, an ontology can extend
one or more existing ontologies by adding its own category hierarchies, relations, and
inference rules. The excerpts from the TSE ontology shown in Figure 1 give a sample of
the SHOE syntax. A complete description of the syntax can be found in the SHOE
Specification (Luke and Heflin 1997).
An important problem when designing an ontology is setting an appropriate
scope. W e asked the following questions to set an initial scope for the TSE ontology:
9 What kinds of pages will be annotated?
9 What sorts of queries can the pages be used to answer?
9 Who will be the users of the pages?
9 What kinds of objects are of interest to these users?
9 What are the interesting relationships between these objects?
717

~BODY>
<ONTOLOGY ID="TSE Ontology" VERSION="1.0">
<USE-ONTOLOGY ID="Base Ontology" VERSION="1.0" PREFIX="base">

~DEF-CATEGORYN~.IE="Disease_Agent" ISA="base. SHOEEntity">


<DEF-CATEGORY NAME="BSE" ISA="Disease_Agent">
<DEF-CATEGORY NAME="CJD" I S A = " D i s e a s e A g e n t " >
<DEF-CATEGORY NAME="NV-CJD" I S A = " D i s e a s e A g e n t " >

~REI.~TION NAME="hasInput">
<ARG POS=I TYPE="Process">
<ARG POS=2 TYPE="Material">
</RELATION>
<RELATION NAME="hasOutput">
<ARG POS=I TYPE="Process">
<ARG POS=2 TYPE="Material">
</RELATION>

~/ONTOLOGY>
</BODY>

Figure 1. Excerp~ from the TSE Ontology

Note that the motivation for web ontologies is slightly different from that of traditional
ontologies. People rarely query the web searching for abstract concepts or similarities
between very disparate concepts, and as such, complex upper ontologies are not
necessary. Since most pages with SHOE annotations will tend to have tags that categorize
the concepts, there is no need for complex inference rules to perform automatic
classification. In many cases, rules that identify the symmetric, inverse, and transitive
relationships will provide sufficient inference.
The initial TSE ontology was fleshed out in a series of meetings that included
members of the FDA and the Maryland Veterinarian School. Since one of the key goals
was to help risk assessors gather information, the ontology focused on the three main
concerns for TSE Risks: source material, processing, and end-product use. Source
materials are described using the concepts of Animal, Tissue, and DiseaseAgent.
Processing focused on the types of Processes, and relations to describe inputs, outputs,
duration, etc. Finally, end-product use categorized the types of Products and dealt with
the RouteOfExposure. We also defined number of general concepts such as People,
Organizations, Events, and Locations.
Currently, the ontology has 73 categories and 88 relations. It is stored as a file on
a web server with an HTML section that presents a human-readable description and a
machine-readable section with SHOE syntax. In this way, the file can serve the purpose
of educating users in addition to being understandable to machines.

2.2 Annotation
Annotation is the process of adding SHOE semantic markup to a web page. A SHOE web
page describes one or more instances, each representing an entity or concept. An instance
is uniquely identified by a key, which is usually formed from the URL of the web page.
The description of an instance consists of ontologies that it references, categories that
classify it, and relations that describe it. A sample instance is shown in Figure 2.
Determining what concepts in a page to annotate can be complicated. First, if the
document represents or describes a real world object, then an instance whose key is the
718

<HTML>
<BODY>

~ I N S T A N C E KEY="http://www.cs.umd.edu/projects/plus/SHOE/tse/rendering.html">
<USE-ONTOLOGY ID='TSE-Ontology" VERSION='1.0" PREFIX="tse"
URL="http;//www.cs.umd.edu/projects/plus]SHOE/tse/tseont.html">
< C A T E G O R Y NAME="tse. P r o c e s s ' >
<RELATION NAME="tse.name">
<ARG POS='TO" V A L U E = " n e n d e r l n g ' >
</RELATION>
<RELATION NAME="tse.hasInput'>
<ARG POS="TO" VALUE="http://www.cs.umd.edu/projects/plus/SHOE/tse/offal.html">
</RELATION>
<RELATION NAME="tse.hasInput">
<ARG POS="TO" VALUE="http://www.cs.umd.edu/projects/plus/SHOE/tse/bones.html">
</RELATION>
<RELATION NAME='tse.hasOutput'>
<ARG POS="TO" VALUE="http://www.cs.umd.edu/proJects/plus/SHOE/tse/mbm.html">
</RELATION>
<RELATION NAME='tse.hasOutput">
<ARG POS='TO" VALUE='http://www.cs.umd.edu/projects/plus/SHOE/tse/tallow.html">
</RELATION>
<RELATION NAME="tse.hasOutput'>
<ARG POS='TO" VALUE="http://www.cs.uznd.edu/proJects/plus/SHOE/tse/gellatln.html">
</RELATION>
< /INSTANCE>
< /BODY>
</IITML>

Figure 2. Sample Instance

document's URL should be created. Second, hyperlinks are often signs that there is some
relation between the object in the document and another object represented by the
hyperlinked URL. If a hyperlinked document does not have SHOE annotations, it may
also be useful to make claims about its object. Third, one can create an instance for every
proper noun, although in large documents this may be excessive. If these concepts have a
web presence, then that URL should be used as the key, otherwise, unique keys can be
created by appending a "#" and a unique string to the end of the document's URL.
Since manually annotating a page can be time consuming and prone to error, we
have developed the Knowledge Annotator, a tool that makes it easy to add SHOE
knowledge to web pages by making selections and filling in forms. As can be seen in
Figure 3, the tool has an interface that displays instances, ontologies, and claims. Users
can add, edit or remove any of these objects. When creating a new object, users are
prompted for the necessary information. In the case of claims, a user can choose the
source ontology from a list, and then choose categories or relations from a corresponding
list. The available relations will automatically filter based upon whether the instances
entered can fill the argument positions. A variety of methods can be used to view the
knowledge in the document. These include a view of the source }tTML, a logical notation
view, and a view that organizes claims by subject and describes them using simple
English. In addition to prompting the user for inputs, the tool performs error checking to
ensure correctness ~ and converts the inputs into legal SHOE syntax. For these reasons,
only a rudimentary understanding of SHOE is necessary to markup web pages.
We selected pages to annotate with two goals in mind: provide information on the
processing of animal-based products and provide access to existing documents related to
TSEs. We were unable to locate web pages relevant to the first goal, and therefore had to
create a set of pages describing many important source materials, processes and products.
To achieve the second goal we selected relevant pages from sites provided by the FDA,
United States Department of Agriculture (USDA), the World Health Organization and

a Here correctness is in respect to SHOE's syntax and semantics. The Knowledge Annotator cannot verify if
the user's inputs properly describe the page.
719

Figure 3. The Knowledge Annotator


others. For the pages that we created, we added the SHOE tags inline. Since we did not
have the authority to modify the other pages, we created summary pages that basically
consisted of the SHOE information and pointers to the originals.

2.3 Information Gathering


The vastness of the Internet and bandwidth limitations make it difficult for a system to
perform direct queries on it efficiently. However, if the relevant data is already stored in a
knowledge base, then it is possible to respond to queries very quickly. For this reason, we
have designed Expos6, a softbot that searches for web pages with SHOE markup and
interns the knowledge. However, since a web-crawler can only process information so
quickly, there is a tradeoff between coverage of the Web and freshness of the data: if the
system revisits pages frequently, then there is less time for discovering new pages. Since
we are only concerned with information on TSEs for this project, we chose to limit the
sites Expos6 may visit, so that it does not waste time exploring pages where there is no
relevant information,
In order to use Expos6, we had to choose a knowledge base system for storing the
information. The selection of such a system depends on a number of criteria. First, many
knowledge base systems cannot handle the volume of data that would be discovered by
the web-crawler. Second, the knowledge base system nmst support the kinds of inference
that will be needed by the application. Third, since SHOE allows for n-ary relations, it is
useful, though not absolutely necessary, to choose a knowledge base that can support
720

them 2. We chose Parka (Evett, Andersen, and Hendler 1993; Stoffel, Taylor, and Hendler
1997) as our knowledge base because evaluations have shown it to be very scalable, there
is an n-ary version, and parallel processing can be used to improve query execution time.
Since we were not interested in performing complex inferences on the data at the time,
the fact that Parka's only inference mechanism is inheritance was of no consequence.
An important aspect of the lnternet is that its distributed nature means that all
information discovered must be treated as claims rather than facts. Parka, as well as most
other knowledge base systems, does not provide a mechanism for attaching sources to
assertions or facilities for treating these assertions as claims. To represent such
information, one must create an extra layer of structure using the existing representation.
Parka uses categories, instances and n-ary predicates to represent the world. A natural
representation of SHOE information would be to treat each declaration of a SHOE
relation as an assertion where the relation name is the predicate, and each category
declaration as an assertion where instanceof is the predicate. To represent the source of
the information, we could add an extra term to each predicate. Thus, an n-ary predicate
would become an (n+l)-ary predicate. However, the structural links (i.e., isa and
instanceof) are default binary predicates in Parka. Thus, this approach could not be used
without changing the internal workings of the knowledge base. We opted for a simpler
approach, and instead made two assertions for each claim. The first assertion ignores the
claimant, and can be used normally in Parka. The second assertion uses a claims predicate
to link the source to the first assertion. When the source of information is important, it
can be retrieved through the claims predicate. Although this results in twice as many
assertions being made to the knowledge base, it preserves classifcation while keeping
queries straightforward.
As designed, the agent will only visit websites that have registered with JIFSAN.
This allows JIFSAN to review the sites so that Expose will only be directed to search
sites that meet a certain level of quality. Note that this does not restrict the ability of
approved sites to get current information indexed. Once a site is registered, it is
considered trusted and Expose will revisit it periodically.

2.4 User Interfaces


The most important aspect of the system is the ability to provide users with the
information they need. Since we are dealing with an internet environment, it is important
that users can access this information through their web browsers. For this reason, the
tools we have created are Java applets that are available from the TSE website. We
currently provide a general purpose query tool and a custom tool built to meet the needs
of TSE community.
The Java Parka Interface for Queries (PIQ), as shown in Figure 4, is a graphical
tool that can be used to query any Parka knowledge base. This interface gives users a new
way to browse the web by allowing them to submit complex queries and open documents
by clicking on the URLs in the results. A user inputs a query by drawing frames and the
relations between them. This specifies a conjunctive query in which the frames are either
constants or variables and the relations can be a string matching function, a numerical
comparison or a relation defined in an ontology. The answers to the query are displayed

2 A binary knowledge base can represent the same data as an n-ary knowledge base, but requires an
intermediate processingstep to convert an n-ary relation into a set of binary relalions.This is inefficientin
terms of storage and executiontime.
721

Figure 4. The Parka Interface for Queries (PIQ)

as a table of the possible variable bindings. If the user double-clicks on a binding that is a
URL, then the corresponding web page will be opened in a new window of the user's web
browser.
It is widely believed that the outbreak of BSE in Great Britain was the result of
changes in rendering practices. Since processing can lead to the inactivation or spread of
a disease, JIFSAN expressed a desire to be able to visualize and understand the
processing of animal materials from source to end-product. To accommodate this, we
built the TSE Path Analyzer, a graphical tool which allows the user to pick a source,
process and/or end product and view all possible pathways that match their queries. The
input choices are derived from the taxonomies of the ontology, allowing the user to
specify the query at the level of generality that they wish. This display, which can be seen
in Figure 5, is created dynamically based on the semantic information in the SHOE web
pages. As such, it is automatically updated as new information becomes available,
including information that has been made available elsewhere on the web.
Since both these interfaces are applets, they are executed on the machine of each
user who opens it. This client application conmmnicates with the central Parka
knowledge base through a Parka server that is located on the JIFSAN website. When a
user starts one of these applets on their machine, the applet sends a message to the Parka
server. The server responds by creating a new process and establishing a socket for
communication with the applet.
722

Figure 5. The Path Analyzer

3. Lessons Learned
This research has given us many insights into the use of ontologies in providing access to
internet information. The first insight is that it is worthwhile to spend time getting the
ontology "right". By "right", we mean that it must cover the concepts in the types of
pages that are to be used and the ways in which these pages will be accessed. We often
had to extend our ontology to accommodate concepts in pages that we were annotating,
and this slowed the annotation process.
Second, real world web pages often refer to shared entities such as BSE or the
North American continent. Such concepts may be described in many web pages, none of
which should have the authority to assign a key to them. In such cases, we revise the
appropriate ontologies to include a constant for the shared object. However, this may
result in frequent updates if the ontology is used extensively.
Third, ordinary web-users do not have the time or desire to learn to use complex
tools. Although the PIQ is easy to use once one has gained a little experience with it, it
can be intimidating to the occasional user. On the other hand, users liked the Path
Analyzer, even though it can only be used to answer a restricted set of queries, because it
presents the results in a way that makes it easy to explore the problem. It seems web users
are often willing to sacrifice power for simplicity.
723

Finally, the knowledge base must be able to perform certain complex operations
as a single unit. For example, the Path Analyzer needs to display certain descendant
hierarchies. Although such lists can be built by recursively asking for the immediate
children of the categories retrieved in the last step, this requires many separate queries. In
a client-server situation this is expensive, since each query requires its own
communication overhead and internet transmission delays can be significant. To improve
performance, we implemented a special server request that returns the complete set of
parent-child pairs that form a hierarchy. Although this requires the same amount of
processing by the knowledge base, it results in a significant speedup of the client
application.

4. Related W o r k
The World-Wide Web Consortium (W3C) has proposed the Extensible Markup Language
(XML) (Bray, Paoli, and Sperberg-McQueen 1998) as a standard that is a simplified
version of SGML (ISO 1986) intended for the Internet. XML allows web authors to create
customized sets of tags for their documents. Style sheets can then be used to display this
information in whatever format is appropriate. SHOE is a natural fit with XML: XML
allows SHOE to be added to web pages without creating an HTML variant, while SHOE
adds to XML a standard way of expressing semantics within a specified context. The
Resource Description Framework (RDF) (Lassila and Swick 1998) is another work in
progress by the W3C. RDF uses XML to specify semantic networks of information on
web pages, but has no inferential capabilities and is limited to binary relations.
There are many other projects that are using ontologies with the Web. The World
Wide Knowledge Base (WebKB) project (Craven et al. 1998) is using ontologies and
machine learning to attempt automatic classification of web pages. The Ontobroker
(Fensel et al. 1998) project has resulted in a language which, like SHOE, is embedded in
HTML. Although the syntax of this language is more compact, it is not as easy to
understand as SHOE. Also, Ontobroker does not have a mechanism for pages to use
multiple ontologies and those who are not members of the community have no way of
discovering the ontology information.

5. F u t u r e W o r k
The JIFSAN TSE Website is a work in progress, and we will continue to annotate pages,
refine the ontology, and improve the tool set. When we have accumulated a significantly
large and diverse set of annotated pages, we will systematically evaluate the performance
of SHOE relative to other methods. We also plan to develop a set of reusable ontologies
for concepts that appear commonly on the Web, so that future ontologies may be
constructed more quickly and will have a commonality that allows for queries across
subject areas when appropriate.
To gain acceptance by the web community, a new language must have intuitive
tools. We plan to create an ontology design tool that simplifies the ontology development
process. We also plan to improve the Knowledge Annotator so that more pages can be
annotated more quickly. We are particularly interested in including lightweight natural
language processing techniques that suggest annotations to tile users. Finally, we are
investigating other query tools with the goal of reducing the learning curve while still
providing the full capabilities of the underlying knowledge base.
724

6. Conclusion
The TSE Risk Website is the first step in developing a clearinghouse on food safety risks
that serves both the general public and individuals who assess risk. SttOE allows this
information to be accessed and processed in powerful ways without constraining the
distributed nature of the sources. Since SHOE does not depend on keyword matching, it
prevents the false hits that occur with ordinary search engines and finds other matches
that they cannot. Additionally, the structure of SHOE allows intelligent agents to process
the information from many sources and combine or present it in novel ways.
We have demonstrated that SHOE can be used in large domains without clear
boundaries. The methodology and tools we have described in this paper can be applied to
other subject areas with little or no modifications. We have determined that the hardest
part of using SHOE in new domains is creating the ontology, but we are convinced that as
high quality ontology components are made available, this process will be simplified. We
are encouraged by the interest that our initial efforts have generated in the TSE
community, and believe that improvements in our tools and the availability of basic
ontologies will lead to an internet where the right data is always available at the right
time.

Acknowledgments
This work is supported in part b y grants from ONR (N00014-J-91-1451), ARPA
(N00014-94-1090, DABT-95-C0037, F30602-93-C-0039) and the ARL
(DAAH049610297).

References
Bray, T., J. Paoli and C.M. Spcrberg-McQueen. 1998. Extensible Markup Language (XML). W3C (World-
Wide Web Consortium).(At http://www.w3.org/'rR/1998/REC-xml-19980210.html)

Craven, M., D. DiPasquo, D. Freitag, A. McCallum,T. Mitchell, K. Nigramand S. SlatteD,. 1998. Learning
to Extract SymbolicKnowledgefrom the World Wide Web. In Proceedings of the AAAI.98 Conference on
Artificial Intelligence. AAAI/MITPress.

EveU, M.P., W.A. Andersenand J.A. Hendler. 1993. ProvidingComputationalEffective Knowledge


Representationvia Massive Parallelism.ht Parallel ProcessblgfvrArtificial Intelligence. L. Kanal,V.
Kumar, H. Kitano,and C. Sunner, Eds. Amsterdam:ElsevierScience Publishers.

Fensel, D., S. Decker, M. Erdmann,and R. Studer. 1998. Ontobroker:How to enable intelligentaccess to


the WWW. In AAAI.98 Workshop on AI and blformation bltegration. Madison,WI.

ISO (lntcrnati(malOrganizationfi)r Standardization). 1986. ISO 8879:1986(E). h~formation pro('essb~g --


Text and Office Systems -- Standard Generalized Markup Language (SGML). First edition -- 1986-I0-15.
[Geneva]: InternationalOrganizationfor Standardization.

Lassila, O. and R.R. Swick. 1998. Resource Description Framework (RDF) Model and Syntax. W3C
(World-WideWeb Consortium).At http://www.w3.org/TIU'WD-rdf-syntax-19980216.hind.

Luke, S. and J. Heflin. 1997.SHOE 1.0. Proposed Specification. At


http://www.cs.umd.edu/projects/plus/SllOE/spee.html

Stoffel, K., M. Taylor and J. Hendler. 1997. EfficientManagementof Very Large Ontologies. In
Proceedings of American Association for Artificial Intelligence Conference (AAAI-97). AAAI/MITPress.
H o w to Find Suitable Ontologies
Using an Ontology-Based W W W Broker
J u l i o C ~ s a r A r p l r e z V e g a I, A s u n c i 6 n G 6 m e z - P 6 r e z ' ,
A d o l f o L o z a n o T e l l o 2 a n d H e l e n a S o f i a A n d r a d e N. P. Pinto3"l ".
~arpirez, asun, alozano]@delicias.dia fi. upm.es, sofia@gia, ist. utl.pt

Abstract. Knowledge reuse by means of outologies now faces three important


problems: (1) there are no standardized identifying features that characterize
ontologies from the user point of view; (2) there are no web sites using the same
logical organization, presenting relevant information about ontologies; and (3) the
search for appropriate ontologies is hard, time-consuming and usually fruitless. To
solve the above problems, we present: (1) a living set of features that allow us to
characterize ontologies from the user point of view and have the same logical
organization; (2) a living domain ontology about ontologies (called Reference
Ontology) that gathers, describes and has links to existing ontologies; and (3)
(ONTO)2Agent, the ontology-based www broker about ontologies that uses the
Reference Ontology as a source of its knowledge and retrieves descriptions of
ontologies that satisfy a given set of constraints. (ONTO)~Agent is available at
http://delicias.dia.fi.upm.es/REFERENCE ONTOLOGY/

1 I N T R O D U C T I O N AND M O T I V A T I O N
Nowadays, it is easy to get information from organizations that have ontologies using the
WWW. There are even specific points that gather information about ontologies and have
links to other web pages containing more explicit information about such ontologies (see
The Ontology Page 4, also known as TOP) and there are also ontology servers like The
Ontology Server s [8, 9], Cycorp's Upper CYC Ontology Server 6 [29] or Ontosaurus 7 [36]
that collect a huge number of very well-known ontologies.
When developers search for candidate ontologies for their application, they face a
complex multi-criteria choice problem. Apart from the dispersion of ontologies over
several servers; (a) ontology content formalization differs depending on the server at
which it is stored; (b) ontologies on the same server are usually described with different
detail levels; and (c) there is no common format for presenting relevant information
about the ontologies so that users can decide which ontology best stilts their purpose.
Choosing an ontology that does not match tile system needs properly or whose usage is
expensive (people, hardware and software resources, time) may force future users to stop
reusing the ontology already built and oblige them to formalize the same knowledge
again. It would be very useful for the knowledge reuse market to prepare a kind of yellow
pages of ontologies that provides classified and up-dated information about ontologies.
These living yellow pages would help future users to locate candidate ontologies for a
given application. A broker specialized in the ontology field can help in this search,
I Grupo de reutilizaci6n. Laboratorio de lnt9 Artificial. Facultad de lnformfitica. Universidad Polit6cnica de Madrid. Espafia
2 Area de Lenguajes y Sistemas Infornutticos. Departamento de lnfomvitica. Universidad do Extremadura. Espafia
3 Grupo de Intelig~,neia Artificial.. Oepartamento de Engenharia Infonmttiea. lnstitulo Superior Ttcnico. Lisboa. Portugal
t This work was partially supported by JNICT grant PRAXIS XXI/BD/11202/97 (Sub-Programa Ci6neia e Tecalologia do Segundo
Quadro Comunit~hio de Apoio).
4 http://www.medg.lcs.nait.edtffdoyle/top
http:Hwww-ksl.standlbrd.cdu:5915
96 http://www.cyc.com
7 http://indra.isi.edu:8000/l,oom
726

- about the ontology: name, server-site, mirror-sites, Web-pages, FAQs available,


mailing lists, NL- descriptions, built date.
Identifying about the developers: name, Web-page, e-mail, contact name, telephone, FAX,
postal address.
about the distributors: name, Web-page, e-mail, contact people name,
telephone, FAX, postal address.

general: type of ontology, subject, purpose, ontological commitments, llst of higher


level concepts, implementation status, on-line and hard-copy
documentation.

scope: number of concepts representing classes, number of concepts


representing Instances, number of explicit axioms, number of relations,
number of functions, number of class concepts at first, second and third
levels, number of class leaves, average branching factor, average depth,
highest depth level.

design: building methodologies, steps followed, level of formality of the


Descriptive methodology, building approach, level of specification formality, types of
knowledge sources, reliability of knowledge sources, knowledge acquisition
techniques, formalism paradigms, list of Integrated ontologles, list of
languages/n which the ontology Is available.

requirements: hardware and software support.

cost: pdce of use, maintenance cost, estimated price of required software,


estimated price of required hardware.

usage: number of applications, list of main applications.

Functional _ _ description of use tools, documentation quality, training courses, on-line help, operatin!
Instructions, availability of modular use, possibility of additlng new knowledge, possibilit'
of delaying with contexts, availability of PSMs.

Figure 1. Feature taxonomy.

speeding up the search and selection process, by supplying the engineer with a set of
ontologies that totally/partially meet the identified requirements. As a first step to solving
the problem of searching for candidate ontologies, we present (ONTO)2Agent, an
ontology-based WWW broker on the field of ontologies that spreads information about
existing ontologies, helps to search appropriate ontologies, and reduces the search time
fbr the desired ontology. (ONTO)2Agent uses as a source of its knowledge an ontology
about ontologies (called Reference Ontology) that plays the role of a yellow pages of
ontologies.
In this paper, we will firstly present an initial set of features that allow us to
characterize, evaluate and assess ontologies from the user point of view. Secondly, we
will show how we have built the Reference Ontology at the knowledge level [32] using
the METHONTOLOGY framework [5, 11, 16] and the Ontology Design Environment
(ODE) [5], and how we have incorparated the Reference Ontology into the (KA) 2
initiative [4]. Finally, we will present the technology we have used to build ontology-
based WWW brokers and how it has been instantiated in (ONTO)2Agent. (ONTO)2Agent
is capable of answering questions like: give me all the ontologies in the domain D that
are implemented in languages L1 and L2.
727

2 FEATURES FOR COMPARING ONTOLOGIES


The goal of this section is to provide an initial set of features that allows us to
characterize the ontologies from the user point of view by identifying the main attributes
and their values. The kind of questions we are trying to answer are, for example: Which
are the languages in which an ontology is available? Which are the mechanisms for
interacting with the ontology? Is the knowledge represented in a frame-based formalism?
What is the cost of the hardware and software infrastructure needed to use the ontology?
What is the cost of the ontology? Is the ontology well documented? Was it evaluated [17]
from a technical point of view?
Although Software Engineering and Knowledge Engineering provide detailed features
for evaluating and assessing Software Engineering and Knowledge Engineering products
[26, 33, 34], the literature reviewed in the field of ontologies shows that there are few
papers about identifying features for describing, comparing and assessing ontologies. The
taxonomy presented by Hovy [23] for comparing ontologies for natural language
processing (divided into form, content and use) is insufficient for comparing ontologies
in other domains. Fridman and Hafner [14] studied a small set of features for comparing
well-known and representative ontologies.
To be able to answer the above questions, we have made a detailed study of the
ontologies available at ontology servers on the web (Ontology Server, Cyc Server,
Ontosaurus) and also other ontologies found in the literature (PhysSys [6], EngMath
[18]). Our aim is twofold: first, to identify the more representative features of these
ontologies (developers, ontology-server, type, purpose,...); second, to define a shared
domain ontology about ontologies (the Reference Ontology) and relate each individual
ontology to that shared ontology. This Reference Ontology could help future users to
select the most adequate and suitable ontology for the application they have in mind.
To ease and speed up the process of searching for the features of the ontology, they are
grouped in the following categories: identifying, descriptive and functional features, as
shown in Figure 1. A preliminary set of features is proposed for each category. Since not
all the features are equally important, the essential features, i.e., features which are
indispensable in order to distinguish each ontology, are given in italics. It is compulsory
to fill in these features. We also stress that: (1) some features cannot be used to
characterize certain ontologies; (2) the ontology builder may not know the values of
some features; and (3) this is a living list of features to be improved and completed with
new features if as required.

2.1 Identifying features


They provide information about the ontology itself, its developers and distributors. We
consider it important to specify:
* About the ontology: Its name, server-site, mirror-sites, Web pages, FAQs available,
mailing lists, natural language description and built date.
9 About the main developers anddistributors: their names, Web pages, e-mails, contact
names, telephone and fax numbers and postal addresses.
2.2 Descriptive features
They provide information about the content and form of the ontology. They have been
divided into six categories: general, scope, design, requirements, cost and usage.
General features describe basic content issues. Users will frequently consult this kind
of information, since these features are crucial for looking up other features. We
considered the following properties: type of ontology [22], subject of the ontology,
728

purpose [37], ontological commitments [19], list of higher level concepts,


implementation status, and on-line and hard-copy documentation.
Scopefeatures describe measurable attributes proper to the ontology. They give an idea
of the content and depth of the ontology. Properties to be taken into account are: number
of concepts representing classes, number of concepts representing instances, number of
explicit axioms, number of relations and functions, number of class concepts at first,
second and third levels, number of class leaves, average branching factor, average depth,
highest depth level.
Design features describe the method followed to build the ontology, the activities
carried out during the whole process and how knowledge is organized and distributed in
the ontology s.
1. It is important to mention the methodology used, the steps [5, 11, 16] taken to build
the ontology (mainly planning, specification, knowledge acquisition,
conceptualization, implementation, evaluation, documentation and maintenance)
according to the selected methodology, its level of formality [37], and the
construction approach [37].
2. Depending on the methodology, the specification may be formal, informal or semi-
formal.
3. With regard to knowledge acquisition, it is important to state the types of knowledge
sources, how reliable such knowledge sources are and the techniques used in the
process.
4. With respect to formalism paradigms, a frame-based formalism, a first order logic
approach, a semantic network, like conceptual graphs, or even a hybrid knowledge
representation paradigm can be selected. It is important to state here that the chosen
formalism places constraints on the knowledge representation ontology in which the
current ontology is going to be implemented. For example, if we select a frame-based
formalism paradigm, one major candidate would be the frame-ontology at the
Ontology Server. The formalism paradigm also plays a major role in ontology
integration. For example, if you want to integrate an ontology built using a first order
language into a frame-based paradigm a lot of knowledge will be lost due to the
weaker expressive power of the latter.
5, As far as integration is concerned, a list of the integrated ontologies should be given.
6. Finally, we need to know from the implementation point of view, the source
languages in which the ontology is supplied and the list of formal KR languages
supported by available translators
Requirementfeatures identify the minimal hardware (swap and hard disk space, RAM,
processor, operating system) and software support requirements (knowledge
representation languages and implementation language underneath the KR language) for
using the ontology. All these features will greatly influence costs.
Cost features help to assess the estimated cost of using the ontology in a given
organization. Since the hardware and software costs vary widely and depend on the
existing computer infrastructure, the total cost should be calculated by adding the cost of
use and maintenance to the features identified above (estimated prices of the hardware
and software required).
lhe usagefeature refers to the applications that use this ontology as a source of their
knowledge. The number of known applications and their names are the features to be
filled in by the informant.
t The ontology can b~ divided into several ontologics.
729

2.3 Functional features


These properties give clues on how the ontology can be used in applications. We have
identified the following features: description of use tools (taxonomical browsers, editors,
evaluators, translators, remote access modules, ._), quality of documentation, training
courses available, on-line help available, how to use the ontology (including the steps
followed to access, manipulate, display and update knowledge from remote and on-site
applications), availability of modular use, possibility of addition of new knowledge,
possibility of dealing with contexts, availability of integrating PSMs, etc.

3 DESIGN OF AN ONTOLOGY ABOUT ONTOLOGIES: THE REFERENCE


ONTOLOGY
Having presented a living set of features that describe each ontology and differentiate one
ontology from another, the goal of this section is to present how we have built the
Reference Ontology using the features identified in section 2. As stated above, tile
Reference Ontology is a domain ontology about ontologies that plays the role of a kind
of yellow pages of ontologies. Its aims are to gather, describe and have links to existing
ontologies, using a common logical organization.
The development of this Reference Ontology was divided into two phases. The first
phase is concerned with the development of its conceptual structure, and the
identification of its main concepts, taxonomies, relations, functions and axioms. This
phase was carried out using the METHONTOLOGY framework and the Ontology
Design Environment. As one of the research topics of the KA community is ontologies,
we decided to incorporate the Reference Ontology into the Product ontology of the (KA) 2
initiative that is currently being developed by the KA community. The second phase
corresponds to the addition of knowledge about specific ontologies that act as instances
in this Reference Ontology. Ontology developers will enter such knowledge using a
WWW form also based on the features previously presented in section 2. So, the eftbrt
made to collect information about specific ontologies is distributed among ontology
developers. It should be stressed that this is a first attempt at building a living ontology in
the domain ofontologies. In this section we only present issues related to the first phase.

3.1 METHONTOLOGY
The METHONTOLOGY framework enables the construction of ontologies at the
knowledge level. It includes: the identification of the ontology development process, a
proposed life cycle and the methodology itself. The ontology development process
identifies which tasks should be performed when building ontologies (planning, control,
quality assurance, specification, knowledge acquisition, conceptualization, integration,
formalization, implementation, evaluation, maintenance, documentation and
configuration management). The life cycle (based on evolving prototypes) identifies the
stages through which the ontology passes during its lifetime. Finally, the methodology
itself, specifies the steps to be taken to perform each activity, tile techniques used, the
products to be outputted and how they are to be evaluated. The main phase in the
ontology development process using the METHONTOLOGY approach is the
conceptualization phase. Its aims are: to organize and structure the acquired knowledge
in a complete and consistent knowledge model, using external representations (glossary
of terms, concept classification trees, "ad hoe" binary relation diagrams, concept
dictionary, table of"ad-hoc" binary relations, instance attribute table, class attribute table,
logical axiom table, constant table, formula table, attribute classification trees and an
instance table) that are independent of implementation languages and environments. As a
730

result of this activity, the domain vocabulary is identified and defined, For detailed
information on building ontologies using this approach, see [16].

3.2 ~.KA)2 Ontological Reengineering Process


(KA) is an initiative that models the
Knowledge Acquisition Community
(its researchers, research topics,
products, events, publications, etc.) in
an ontology that it is called the (KA)2
Ontology. Initially, the (KA)2 ontology t Reverse
m••r
Re'strucluring .

Forward I
was formalized in Flogic [28]. A
WWW broker called Ontobroker [10]
uses this Flogic ontology to infer new
information that is not explicitly stored
on the ontology.
To make this ontology accessible to
the entire community, it was decided Figure 2. Ontological Reengineezing Process of the (KA)a Ontology.
to translate this Flogic ontology to
Ontolingua [20] and to make it accessible through the Ontology Server. Since all the
knowledge had been represented in a single ontology, the option of directly translating
from Flogic to Ontolingua was ruled out (since it transgressed the modularity criterion),
and it was decided to carry out an ontological reengmeermg process of the (KA) 2
ontology as shown in Figure 2. First, we obtained a (KA) 2 conceptual model, attached to
the Flogic ontology manually by a reverse engineering process. Second, we restructured
it using ODE conceptualization modules. After this, we got a new (KA) 2 conceptual
model, composed of eight smaller ontologies: People, Publications, Events,
Organizations, Research-Topics, Projects, Research-Products and Research-Groups.
Finally, we converted the restructured (KA)2 conceptual model into Ontolingua using
forward ODE translators.
[People
Employee
~ublication
Article Pr~Dcv9
oSS:
Dcpartraent
Acadmnio-Staff Article.ln-Book
ZZL, Confe~snc~-Paper
Journal-Article
ReJcarch-Projr
[ A&uinuttrative-Staff TedmiceI-Report ReacarchOiganization
Workshop-Paper Un vett ty
Bock
Joumal
lEl~-Bxpe~
ImCS
Special.lnue

Hgure 3. Concept Classification Tree in (KA) 2.


Figure 3 shows the main concepts identified in the domain grouped in Concept
Classification Trees9. Figure 4 shows the most representative "ad hoc" binary
relationships described in the Diagram of Binary Relations t~ of the new (KA) 2 ontology
conceptual model; for instance, the relation Affiliation, between an Employee and an
Organization; its inverse, the relation Employs; and the relation Cooperates-with,
between two Researchers. It should be noted that multiple inheritance among concepts
represented in the ontology is allowed, since for example a PhD Student is both a Student
9 These trees identify the main laxonomies of a domain. Each tree will produce an independent ontology.
l0 The goal of this dia~'am is to establish relationships between concepts from the same or different ontologies.
731

and a Researcher. For a detailed explanation of the new (KA) 2 ontology conceptual
model built alter restructuring the Flogic (KA) ~ ontology, see [5]

Employs

I i
Project ~ Ors~liz~ti~l

Rt~earch-bate~t

i i i
8UpOI'YjIS~ ~]lt~,l~ll[

Publication Chairof Co Studies-at

I I I I I
Figure 4. Dia~am of Binary "Ad-hoe" Relations in (KA) ~.

3.3 Incorporating the Reference Ontology into (KA)2


As starting points for developing our Reference Ontology, we took three sources of
knowledge. The first source was the set of features presented earlier in section 2. The
second source was the restructured (KA) 2 conceptual model. The third source was the set
of properties identified for the Research-Topic ontology, which were established during
the KEML workshop held at Karlsruhe, on Janua13, 23, 1998 and distributed by R.
Benjamins to the KA-coordinators-list. The properties identified were: Name,
Description, Approaches, Research-groups, Researchers, Related-topics, Sub-topics,
Events, Journals, Projects, Application-areas, Products, Bibliographies, Mailing-lists,
Webpages, International-funding-agencies and National-funding-agencies. All these
properties describe the field ofontologies and differentiate it from other fields of research.
However, the properties we presented in section 2 characterize each ontology and
differentiate one ontology from another. Some of the features presented in section 2 lead
to some minor changes and extensions to the (KA) 2 ontology. For instance, information
concerning distributors and developers was associated to Product and not exclusively to
Ontology.
The design criteria used to incorporate the Reference Ontology into the (KA) 2 ontology
were:
9 Modularity: we sought to build a module-based ontology to allow more flexibility and
varied uses.
9 Specialize: we identified general concepts that were specialized into more specific
concepts until domain instances were reached. Our goal was to classify concepts by
similiar features and to guarantee inheritance of such features.
732

/4elelrch-Teple Appllcxtioa Areal


~ R e l e l r c h - T oplc-A p pro*r
Reselrck-Toplc. R e l l l e d . T e p i c t
R esea r oh- T oplc-R ubloplca

R e l e l r ch-T oplc-Pr od u ets


R eleA r ch-T eplc-I n fer n* Ilenat-V mnd fm|-A lleucles
Research-Topic N l Ilonal- Ir u i d i n | - A |alleles
Groupi-W orklnl.en Research-Topic todur
RA-Methodology
glnlzlllon Guideline
Deplrtment Dlttrlbuled-by Problem-Solving-M elhod
It nlerpr(ll Plper,Ltbrlty
Institute 5pacific itlon-L anl, ulKe
R I l l I fCh-(} [o lip M odellne-LaneulBe
C omputer-Support Litt~ ed t l 1 Io*o|les
InlelllBent-Editor
1 I)ARp~4(US
~,~jv. ,airy A I TrineformalioalI-Tool ~,tev,
PToblem-Solving-Method-Library
Ontolol~y-Library
Researcb-T oplc-Researcher Validlltor
Verlflat
Implementation-Environment
e~'~,on :~.,o,,ly .Lor led-AI-II..... E ,loll,lion-Tool
Employee Ollol, ;y+Lec l e d - A I - M l r r o r - S e r v e l Internal-Tool
9Acldemir Hllurll'L lnlLul~e'PlrleI
8ervera ~ '
Leclurer I
Ox#~ l
...........
Admtntalrllive-$ta f ,.............................. I : re;" b r l n : : IlIa
CyeSrr~er 9 -T lale~
! ..........
Sscre/aly
Techlttcal*Stiff
stu/en!
I
O ntololly+Ter iii llll|ed .i ii _L i B i l l | e
I1 r
....... '
i~a nil n all e~ll. . . . . . . .
O i t o l o l y - Lll P | u lll[el
Om.llnp.a
r
| I
PhD-~ludlnl CycL T h e - K R - L I n ulle-of-~lerver
! LOOM

Research Topic-Project* '

'so~ct
, . ..... ,,o,.,,
~ o . .~onwlre-Pfojecl I
| Re~/lrch-Projeet
1
R elea rch.T oplc-E ve n i l

~Iptr
Conference
3.f$ "lT.0alolwllr I

r..... 1
w orklhop
E C A I "fd-OA#ale|teal-enpt*ceHn p ahlicltion
ECAI "~le-AppL.oJ'-Onta~& -Probl-.~olv-Mrlholltl Article

l
Activity ANicll-ln-Book
Con ferantl-Arllr
JOtlrlllI-ArileI|
Tlchnlc II*R aport
R elesrch T o p l c - J o . r n a i l
BoWr
t'k'hop*Arlir
Joula|t
IEER-Expert
IJIIC3
~plcIl[-Iiiui
Oa-llna-P ubllcstton

F i g u r e 5. Some of the relations and concepts added to the (KA)2 ontology 9

9 Diversify each hierarchy to increase the power provided by multiple inheritance


mechanisms. By representing enough knowledge in the ontology and using as many
different classification criteria as possible, it is easier to enter new concepts (since
they can be easily specified from the pre-existing concepts and classification criteria).
9 Minimize the semantic distance between sibling concepts: similar concepts are
grouped and represented as subclasses of one class and should be defined using the
same primitives, whereas concepts which are less similar are represented further apart
in the hierarchy.
9 Maximize relationships between taxonomies: in this sense, "ad hoc" relations and slots
were filled in as concepts in the ontology.
9 We have not taken into account ontology server, ontologies and language releases to
build our ontology. For instance, in our ontology, the Ontology Server is an instance
of servers and we do not keep records of its latest and future releases.
9 Standardize names: whenever possible we specified that a relation should be named
by concatenating the name of the ontology (or the concept representing the first
733

element of the relation), the name of the relation and the name of the target concept;
for instance, the relation Ontology-Formalized-in-Language between the class of
ontologies and one Language.
Based on the previous criteria, our analysis of the conceptual model of the (KA) 2
ontology showed that:
9 about the classes: from the viewpoint of the Reference Ontology, some important
classes were missing; for instance, the classes Servers and Languages, subclasses of
Computer-Support at the Product ontology. The subclass of the class Servers is the
class Ontology-Servers, whose intances are the Ontology-Server, the Ontosaurus and
the CycServer. The subclass of the class Languages is the class Ontology-Languages,
whose instances are Ontolingua, CycL [29] and LOOM [30].
9 about the relations: from the viewpoint of the Reference Ontology, some important
relations were missing; for instance, the relation Research-Topic-Products between a
research topic and a product, or the relation Distributed-by between a product and an
organization or the relation Ontology-Located-at-Server that relates an ontology to a
server.
9 about the properties: from the viewpoint of the Reference Ontology, some important
properties were missing; for instance, Research-Topic-Webpages, Developers-Web-
Pages, Type-of-Ontology or Product-Name.
So, we introduced the classes, relations and properties needed. The most representative
appear highlighted in hold lettering in Figure 5.
All the changes, the entry of new relations and properties and the entry of new concepts
were guided by the features that were presented in section 2. Essentially, the (KA) 2
ontology was extended using new concepts and some knowledge previously represented
in the (KA) 2 ontology was specialized in order to represent the intbrmation that we found
was of use and of interest for comparing different ontologies with a view to reuse or use
as a basis for further applications.

Figure 6. OntoAgent architecture.


734

4 ONTOAGENT A R C H I T E C T U R E
Having identified the relevant features ofontologies and built the conceptual structure of
the Reference Ontology using the Ontology Design Environment, the problem of
entering, accessing and updating the information about each individual ontology arises.
Ontology developers will enter such knowledge using a WWW form based on the
features identified in section 2. A broker specialized in the ontology field, called
(ONTO)2Agent, can help in this search. In this section, we describe domain-independent
technology for building and maintaining ontology-based WWW brokers. The broker
uses ontologies as a source of its knowledge and interactive WWW user interfaces to
collect information that is distributed among ontology developers.
The approach taken to build ontology-based WWW brokers is based on the architecture
presented in Figure 6. It consists of different modules, each of which carries out a major
function within the system. These modules are:

A. A world-wide web domain model builder broker, whose main capability is to


instantiate the conceptual structure of an ontology about the broker domain
expertise. This domain model builder needs:
A. 1. Ontology blformation Collector: an easy-to-use interactive WWW user interface
that eases data input by distributed agents (both programs and humans);
A.2.An hlstance Conceptualizer: for transforming the data from the WWW user
interface into instances of the ontology specified at the knowledge level;
A.3.0ntology Generator/Translators: For generating or translating the" ontology
specified at the knowledge level into several target languages used to formalize
ontologies and thus allow access from heterogeneous applications.
B. A world wide web domain model retrieval broker, whose aim is to provide help in
accessing the information in an ontology warehouse and show it nicely. It is divided
into:
B. 1. A query builder to help to build queries using the broker vocabulary, as well as to
reformulate and refine a query given by the user; the queries will be formulated
upon a set of ontologies previously selected from the ontology pool available in the
architecture;
B.2. A query translator that transforms the user query into a query representation
compatible with the language in which the ontology is implemented;
B.3. An reference engine that searches for the answer to the query; as shown in Figure 6,
knowledge sources can be represented in several formats;
B.4. An answer builder that presents to the client the answers to the query obtained by
the inference engine module in an easy and human readable manner. The answers
are presented for each ontology that has been searched. Thus, one query may be
answered in several domains, depending on domains of the ontologies.
This technology has already been instantiated in two applications: (ONTO)2Agent and
Chemical OntoAgent.

4.1 (ONTO)2Agent
In tile ontological engineering context, using tile Reference Ontology as a source of its
knowledge, the broker locates and retrieves descriptions of ontologies that satisfy a given
set of constraints. For example, when a knowledge engineer is looking for ontologies
written in a given language applicable to a particular domain, (ONTO)2Agent can help in
the search, supplying the engineer with a set of ontologies that totally/partially comply
with the requirements identified.
735

The above abstract architecture has been instantiated as follows:

A. The WWW-based domain model


builder broker uses:
A. 1. A world-wide web form based
on the identified ontology
features previously discussed in
this paper. Its main aim is to
gather information about
ontologies and thus distribute the
effort made in collecting this
data from ontology developers.
Part of this form
(http://delicias.dia.fi.upm.es/REF
ERENCE_ONTOLOGY) is Figure 7. IITMLontology questionnaireform.
shown in Figure 7. Note that the
different categories are divided into groups. There are compulsory options that
ontology developers must fill in -e.g., the ontology name, the language of the
ontology-, while others are optional and offer a more detailed view of the ontology -
e.g., number of nodes at the first level. The form contains proper questions to get the
values of the features of an ontology. Besides, it contains help to guide the ontology
developers filling in the form. A set of possible values are also identified"for some
questions, so the user merely has to click on a radio button or check box.
A.2. The data are used to fill in the instances of the concepts identified in the ontology
described in section 3, which was built using ODE, thus ensuring full compatibility
with this tool. Furthermore, we prefer to store the ontologies in a relational database
rather than as implementations of other knowledge representation languages.
A.3. This database representation of the ontology specification is generated
automatically using ODE forward translation modules. Knowledge can also be
represented using other formats. Indeed, a number of translation languages we
support, includes Ontolingua and SFK [13]. In the future, other languages such as
Flogie or LOOM will be supported.

B. With regard to the WWW-baseddomain model retrieval broker:


B.1. Two query builders have been implemented, both similar in their conception but
not implemented in the same manner. The first is a Java applet and the second, a
Java standalone application. The main goal of the former is to get a fast applet
download time to a web browser, limited by the Internet current transfer speed. Its
functionality is smaller than the standalone application. This however, is due to the
strict security restrictions applied to Java applets [25] and the above-mentioned
speed limitation. Both elements seek to provide easy and quick access to ontologies.
They possess a graphical user interface from which the user can build queries to any
ODE ontology stored in the relational database. The query system is in fact domain-
independent, although it has actually only been tested with two ontologies:
Reference and CHEMICALS.
736

Both query builders allow users to formulate simple and complex queries. Simple
queries can be made using the predefined queries present in the agent. They are
based on ODE intermediate representations and include: definition of a concept,
instances of a concept, comparison between two concepts, etc. They are used to get
answers, loaded with information, easily and quickly. The query procedure is similar
to the one used by Yahoo I~ or Alta Vista 12, so anyone used to working with these
lnternet search tools is unlikely to have any problems using the interface. Complex
queries can be formulated by using a query builder wizard that works with AND/OR
trees and the vocabulary obtained from the ontologies we are querying. It will allow
us to build a more restrictive and detailed query than is shown in Figure 8, where we
are looking for all the ontologies in
the engineering domain, with
standard units as a defined term and
whose language is either
Ontolingua, LOOM or SFK; before
the query is translated to the proper
query language, it is checked
semantically for inconsistencies -
syntactic correctness is implicit-, n e ~ S. (ONTO)'Agent is asked to provide all the ontologies in
the engineering domain, written in Ontolingua, LOOM or SFK, with
thanks to the building Standard Units as a defined term, using a query expressed by means
query
method. If it is all right, it is ofanAND/ORlree.
refined, eliminating any
redundancies.
B.2. The resulting query is then translated into the SQL language in order to match
the ontology specification at the knowledge level, using the implementation of the
ontology stored in a database. For the ontolingua implementation of a similar agent,
an OKBC-capable [39] builder would be required.
B.3. The SQL query is sent to the server by means o f a OntoAgent-specific protocol
built on top of the TCP/IP stack. Therefore, the applications will be able to contact
the server by means o f this protocol. The inference engine used is the search engine
equipped with MS-Access and some add-ins.
B.3. Once the query is sent to the server, the results will be returned and will be
graphically visualized by the system. This representation will be different depending
on whether or not natural language generation was requested. These results can be
saved in HTML format for later consult using a common web browser.
Appart from this querying capability, we can also download or upload ontologies from
the server or to the server. So, we can work on the ontology of our own workstation so as
to work with it employing ODE, and modify and/or enlarge it as desired.

4.2 Chemical OntoAgent


Chemical OntoAgent is the other broker to which this technology has been applied. It is a
chemistry teaching broker that allows students to learn chemistry in a very
straightforward manner, providing the necessary domain knowledge and helping students
to test their skills. To make the answers more understandable to students, this technology
is able to interact with a system called OntoGeneration [1]. OntoGeneration is a system
that uses a domain ontology (CHEMICALS [12]) and a linguistic ontology (GUM [2]) to
H http://www.yahoo.com
J~ htt p://www.altavisha.com
737

Figure 9. Search results in natural language and in tabuler form. Sodium definition: Sodium is an alement that belongs to the
alkalymetal group and has an atomic number of 11,an atomic weight of 22.98977 and a valency of 1. The table also shows the
Chemicals instance attributes table.

generate Spanish text descriptions in response to the queries in the domain of chemistry.
This is shown in Figure 9, where we queried the definition of sodium and the instance
attributes table of the Chemicals ontology using a predefined query.
Chemical OntoAgent does not have the modules described for the world-wide web
domain model builder broker, since the Chemicals ontology was built entirely using
ODE, and needed no further dynamic updating after its completion.

5 CONCLUSIONS
In this paper we presented (ONTO)2Agent, an ontology-based WWW broker to select
ontologies for a given application. This application seeks to solve some important
problems:
1. To solve the problem of the absence of standardized features for describing
ontologies, we have presented a living and domain-independent taxonomy of 70 t~atures
to compare ontologies using the same logical organization. This framework differs from
Hovy's approach, which was built exclusively for comparing natural language processing
ontologies. This framework also extends the limited number of features proposed by
Fridman and Hafner for comparing well-known and representative ontologies, like: CYC
[29], Wordnet [31], GUM [3], Sowa's Ontology [35], Dahlgren's Ontology [7], UMLS
[24], TOVE [21], GENSIM [27], Plinius [38] and KIF [15].
2. To solve the problem of the dispersion of ontologies over several servers, and the
absence of common formats for representing relevant information about ontologies using
the same logical organization, we built a living Reference Ontology (a domain ontology
738

about ontologies) that gathers, describes using the same logical organization and has
[inks to existing ontologies. We built this ontology at the knowledge level using the
METHONTOLOGY framework and the Ontology Design Environment. We also
presented the design choices we made to incorporate the Reference Ontology into the
(KA) 2 initiative ontology after carrying out an Ontological Reengineering Process.
3. To solve the problem of searching for and locating candidate ontologies over several
servers, we built (ONTO)2Agent, an ontology-based WWW broker that retrieves lhe
ontologies that satisfy a given set of constraints using the knowledge tbrmalized in the
Reference Ontology. (ONTO)2Agent is an instantiation of the OntoAgent Architecture.
OntoAgent and Ontobroker have several key points in common. Both are distributive,
joint-efforts by the community, they use an ontology as the source of their knowledge,
they use the web to collect information, and they have a query language for formulating
queries. However, the main differences between them are:
eOntoAgent architecture uses: (l) a SQL database to formalize the ontology, (2) a
WWW form and an ontology generator to store the captured knowledge, and (3) simple
and complex queries based on ODE intermediate representations and AND/OR trees to
retrieve information from the ontology.
9 Ontobroker uses: (1) a Flogic ontology, (2) Ontocrowler for searching WWW annotated
documents with ontological information, and (3) a Flogic-based syntax to formulate
queries.
We hope that (ONTO)2Agent and the Reference Ontology will ease the ,.search of
ontologies to be used in other applications.

6 ACKNOWLEDGEMENTS
We would like to thank Mariano Fernandez and Juanma Garcia for their help in using
ODE.

7 REFERENCES
1. Aguado G., Balenlan J., Baflbn A., Bernardos S., Fernfindez M., G6mez-Pdrez A., Nieto. E, Olalla A., Plaza R., Sanchcz A.
ONTDGENEt~TION: Reusing domain and linguistic ontologies for Spanish. Workshop oil Applications of Ontologies and
PgMs. Brighton. England. August 1998. 29
2. Bateman, J.A.; B. Maguini, G. Fabris. The Generalized Upper Model Knowledge Bare: Organizalinn and Use. In Towar&s Very
Large Knowledge Bases. Pages 60-72. lOS Press. 1995.
3. Batcman J. A., Magnini B., Rinaldi E, The Generalized Italian, German, English Upper Model, Proceedings of ECAI94's
Workshop on Comperison of Implemented Ontologics, Amsterdanh 1994.
4. Bcnjamins R., Fensel D., Community is Knowledge! m (i(,4)~, Knowledge Acquisition Workshop, KAW98. Presented in FOIS
1998.
5. Bl~quez M, Fermindez M, Oarcia-Pinar J. M., G6mez-P6rez A., Building Ontologies at the Knowledge Level using the
Ontology Design Envirnoment, Knowledge Acquisition Workshop, KAW98, Banff, 1998.
6. Borst P., Benjamins J., Wiclinga B., Akkarmans tt., An Application of Ontology Construction, Workshop on (hltologieal
Engineering, ECAI~6, Budapest, PP. 5-16, 1996.
7. Dahlgren K., Naive Semantics for Natural Language Understanding. Boston:MA, Kluwer Academic, 1988.
8. Farquhar A., Fikes R., Rice J., The Ontolingua Server: A Toolfor Collaboralive Ontology Consfruclion, Proceedings of the 101h
Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Alberta, Canada, PP. 44.1-44.19, 1996.
9. Farquhar A., Fikcs R., Pratt W., Rice J., Collaboralive Ontology Construction for Information Integration, Technical Report
KSL-95-10, Knowledge Systems Laboratory, Stanford University, CA, 1995.
10. Fensel, D, Decker, S. Erdman M. Studer, R. Ontobroker: The Very High Idea. In Proceedings of the 1 lth International Flairs
Conference (FLAIRS-98), Sanibal Island, Florida, May 1998.
11. Fermindez, M., O6mez-P6rez, A. Juristo, N. METttONTOLOGY: From OniologlcalArt Toward Ontological Engineering. Spring
Symposium Series on Ontological Engineering. AAAI97. Stanford. USA. Marsh 1997.
12, Fermtndez M. CtlEMICALS: ontologla de elementos quhnicos. Proyeclo fin do can'era. Facullad de lnformfitica. IJniversidad
Polit6cnica do Madrid. December 1996.
13. Fisher, D.; K. Rust9 SFK: A Smalltalk Frame Kit. Technical Report, OMD/1PSI, Darmstadt, Germany, 1993
14. Fridman N., Hafncr C., The State of the Art in Ontology Design, AI MAGAZINE, Fall 1997, PP. 53-74, 1997.
] 5. Genesereth M., Fikes R., Knowledge Interchange Format, Technical Rcptwt, Computer Science Department, Stanford University,
Logic-92-1, 1992
16. G6mez-PO'ez A., Knowledge Sharing and Reuse, The Handbook of Applied Expert Systems, Edited by J. Liebowitz, CRC Press,
1998.
17. O6mez.P~rez A., Towards a Framework to Verify Knowledge Sharing Technology, Expert Systems with Applications, Vo] 1 I, N~
4, PP. 519-529, 1996,
739

IS. Gn~ber, T. and Olsen, R. An Ontology for Engineering Mathematics, Technical Report KSL-94-lg, Knowledge Systems
Laboratory, Stanford University, CA, 1994.
19. Cn'ubor T., Toward Principles for the Design of Ontologtes Used for Knowledge Shoring, Technical Report KSL-93-04,
Knowl~lgn Systems Laboratory, Stanford University, CA, 1993.
20. Gruber T., OHTOLINGUA: A Mechanism to Support Portable Ontologies, KSL-91-66, Knowledge Systems Laboratory, Stanford
University, 1992.
21. Gruningcr M., Fox M, Methodology for the Design and Evaluation os Proceedings of IJCAI9Ys Workshop oil Basic
Ontological Issues in Knowledge Sharing, 1995.
22. van lieist O., Sehreiber A. Th., Wielinga B. J., Using explicit ontologJes in KBS development, International Journal of llmnan-
Computer Studies, 45, PP. 183-292, 1997.
23. llovy E., ,What lYould J!Mean to Measure on Ontology?, unpublished, 1997.
24. Ilumplweys B. L., Lindberg D. A. B., UMLS project: making the conceptual connection between users and the inlbrm.~tio, they
need, Bulletin of the Medical Library Association, 81(2), 1993.
25. JavaSoft. Java Security FAQ. http://iava.sun.com/sl'aq. October 1997.
26. Kan S. K., Metrics and Models in Software Quality Engineering. Ed. Addison-Wesley Publishing Company, MA, USA 1995.
27. Karp P.D., A Qualitative Biochemistry and its Application to the Regulation of the Trypophan Operon, Artilicial Intelligence abd
Molecular Biology, L. Hunter (ed.), 289-325, AAAI Press/MIT Press, 1993.
28. Kifer M., Lansan G., Wu J., Logical Foundations ofObjdct-Oriented and Frame-Based Languages, Journal of the ACM, 1995.
29. Lanai D.B., CYC: Toward Programs with Common Sense, Comnmnicalions of the ACM, 33(g), PP. 30-49, 1990.
30. Loom Users Guide Version 1.4. ISI Corporation. 199L
31. Miller G. A., WordNet: An On-line lgxical Database, Intemelianal Journal of Lexicography 3, 4: 235- 312, 1990.
32. Newell A., '/'he Knowledge Level, Artificial lntdligance (Ig), PP. 87-127, 1982.
33. Pressman R., Software Engineering: A practitioner's approach, McGraw-Hill, 1997.
34. Slagie J., Wick M, A Methodfor Evah~aUng Candidate, AI MAGAZINE, WINTER 88, PP. 44-53, 1988.
35. Sowa J. F., Knowledge Representation: Logical, Philosophical, and Computational Foundations, Boston:MA, PWS Publishing
Company, Forthcoming, 1997.
36. Swartout B., Patti R., ~fight K., Russ T,, Towards Distributed Use of Large.Scale Ontologtes, AAAI97 Spring Symposium Series
on Ontological Engineering, 1997.
37. Uschold M., G~ninger M., ONTOLGGIES: Principles, Methods and 16Applications, Knowledge Engineering Review, Vol. I 1,
N. 2, June 1996.
38. Van der Vet P.E., Spcel P.-H., Mars N. J. I., The Plinius ontology of ceramic materials. Proceedings of ECAI94's Workshop oll
Comparison of Implemented Ontologies, Amsterdam, 1994.
39. Chaudhri Vinay K., Farquhar Adam, Fikes Richard, Karp Peter D., Rice James P. The Generic Frame Protocol 2.0, July 21,
1997.
Towards P e r s o n a l i z e d D i s t a n c e Learning
on t h e W e b

Jesus G. Boticario and Elena Gaudioso

Universidad Nacional de Educacion a Distancia,


Departamento de Inteligencia Artificial,
28040 Senda del Rey s/n,
Madrid, Spain
{ j g b , elena}9 u n e d . es
http ://~. dia.uned, es/~j gb

A b s t r a c t . The widespread use of the Web in distance learning could


help to satisfy the need for information and to mitigate the isolation
that characterizes the student in this domain. It can be observed that
the different nature of this kind of students and the dispersion of the rel-
evant information make the effective use of the available resources more
difficult. In order to improve this situation, we develop an interactive
system to support education on the Web which is able to adapt to the
information and communication needs of each student.

1 Introduction and Motivation

One of the features that characterizes distance learning (DL) is the "systematic
use of communication media and technical support" [7] as alternatives to mediate
in learning ex~periences. Any theory about learning insists t h a t the quality of the
communication between teacher and student is a decisive factor in the process.
Therefore, taking advantage of resources such as Internet, that can significantly
improve the information sources and the quality of the communication with the
students, should be seen as an obligation (the natural evolution of this kind of
education will eventually lead to its imposition).
In the near future it is likely t h a t a distance learning student will contact
his/her classmates, teachers, advisor, and the University administration, as well
as make use of common university facilities, through the Internet. Telematic
services can be used by any student, for example, in clearing up doubts together
with fellow students or the teacher, regardless of his/her degree of isolation, or
in lightening the administration involved in compiling his/her academic record.
Considering the student diversity which characterizes this kind of education
(workers with family responsibilities, disabled people, teachers with a p e r m a n e n t
need to bring up to date their background knowledge, teenagers coming from
technical schools and secondary education ... ) as well as the dispersion of the
information source~ (news, mailing fists, web pages of different kinds: those of
the institution or other institutions, pages for the different courses, FAQ's, the
lecturers' pages, practical exercises, continuous remote assessments ... ) the
741

development of any kind of interactive systems able to adapt to the information


and communication needs of each student, would be of great help.
Our purpose in this paper is to describe the development of an interactive
personal learning apprentice that operates by adapting the use of WorldWide
Web (WWW) services to student needs. To date, this personal assistant uses
a set of complementary information sources: predefined access paths, traces of
student choices, hyperlinks added by the student and available historical data
(subjects for which the student has already passed the appropriate exam, sub-
jects for which the student has registered, tutorials which the student attends,
the preferred means of communication with other teachers and students,...)
This application, which is already being used in the personalizing of the
problem classes of the machine-learning courses at the Computer Science School
(CSS) and the postgraduate courses of the Artificial Intelligence Department
(AID) of the Universidad Nacional de Educacion a Distancia(UNED), will be
especially useful when the use of this media becomes tile main channel of com-
munication for the diverse agents involved in the process (lecturers at the Central
Site, local tutors and students). In fact the increasing use of the Internet in the
CSS is generally welcomed. The fact that the three main CSS departments with
lecturing responsibilities have their own servers is good proof of this, as is the
proliferation of non-official web sites created and managed by students of this
University, where they interchange material, create news and mailing lists for
the subjects, and even organize tutorials with students that have already passed
the exam for that subject (This is the case of the non-official web site of the CSS
students: WAINU 1, the Web site for fans of applied mathematics 2 ... )

2 Specifications and objectives of the system

This application combines objectives of a diverse nature:

i. From the point of view of the psycho-pedagogical model of teaching-learning,


the main objective is to stimulate significant and active learning, where the
main protagonist is the student. So far, it makes use of the natural model of
learning, that consists in [11]:
(a) Raising interesting objectives.
(b) Generating questions that are useful to respond to the established goals.
(c) Processing answers to the questions raised.
2. It is a support system to DL which looks for fast, efficient and personalized
access to the relevant information.
3. Its architecture follows the model which in the machine learning literature is
called personal learning apprentice. More concretely, our approach is based
on the following principle:

1 www.geocities.com/Athens/Forum/5889/index.html
2 usuarios.iponet.es/jastorga/matematicas
742

P r o p o s i t i o n 1. One can construct interactive assistants that, even when


they are initially uninformed, are both sufficiently useful to entice users and
able to capture useful training examples as a by-product o.f their use [5].

4. It works through the access to the educational services available on the


Internet, it is transparent to the student and no additional specific software
is required.
5. The system is based on a specific education management proposal [1] for
distance education over the Internet. The didactic material supplied follows
the guidelines t h a t are considered appropriate for this kind of education [2].

3 State of the Art

After making an exhaustive analysis of other distance education centers (for


instance: the Open University, www.open.ac.uk; the University of Wisconsin-
Extension, www.uwex.edu; Penn State University, www.cde.psu.edu; Bell Labs
Distance Educational Center, www.lucent.com/cedl/; La Universitat O b e r t a de
Catalunya, www.uoc.es), after considering the tendencies of current interactive
learning environments[9] and after reviewing the available software, we have
reached the following conclusions:

- The larger part of commercial educational software currently on the market


is mainly concerned with primary and secondary education; in university
education, the student has more freedom to choose a learning method, and
should be allowed to investigate on his own so that the student's individual
efforts must be particularly encouraged.
- The educational software is based on multimedia technologies, which, at the
present time, means that distance education systems over the Internet are
very slow.
- Much educational software is designed to be similar to computer gaines. It
is called edutainment (from education and entertainment).
- The student is not free to choose the p a t h he wants (in the software t h a t
Schank described, the student is given more freedom and control over the
program [11]).
- Most courses offered on the Internet axe limited to collections of H T M L
pages with hyperlinks to "g~fide" the student.

Educational software can be found oil m a n y different web sites 3. All these
applications are closed systems, implemented especially for specific contents and
for a specific level. Then there are the so-called authoring tools, working envi-
ronments used in the creation of Internet courses. These tools mainly provide
3 www.edsoft.com,
www.gcse.com/maths,
node.on.ca/tile,
curriculum.qued.qld.gov.au/lisc/edsw/dossoft.htm,
www.telelearn.ca/conference/demos.html
743

communication software (e-mail, f t p . . . ), facilities to create hypermedia courses,


calendar/planner software ... Among all these tools we can name: LEARNING-
W E B 4, LEARNING SPACE 5, IMDL 6 ' Real Education Active Learning (REAL)
System 7 ... More sophisticated systems using artificial intelligence techniques
to facilitate the publishing of teaching material on the Internet have also been
developed. One such example is the Interbook system that is able to maintain
a model for each student s. There are other similar systems: DGC: Dynamic
Course Generation on the W W W 9, ISLE: The Intensely Supportive Learning
Environment 10, E X T O L 3 11 ELF 12 ...
In conclusion, we have checked that most of the available software is very
limited or is too inflexible to fulfil our initial objectives (sect. 2),.

4 Distance Learning Interactive System

4.1 The program

We have finally opted to make use of a Web server, implemented in Lisp (CL-
H T T P (Common Lisp Hypermedia Server)), that was used in the development
of the Interbook system (sect. 3.)

C L - H T T P is a H T T P server developed in the Artificial Intelligence Labora-


tory of the Massachusetts Institute of Technology ( ~ . a i . m i t . edu) in order to
facilitate exploratory programming in the interactive hypermedia domain and
to provide access to complex research programs, particularly artificial intelli-
gence systems [8]13. C L - H T T P is a full-featured server for the Internet Hyper-
text Transfer Protocol ( H T T P 1.1, H T M L 2.0, H T M L 3.2, pre-HTML 4.0) that
comes complete with free source code. It enables the H T M L page to be processed
when the client requests it, thus allowing a personalized response of the server.
Additional modules for this software, that have been implemented by the
user-community of the server, are also available, one such example is the html-
parser module (implemented by Sunil Mishra (smishra@cc.gatech.edu)) that al-
lows us to construct our own H T M L parser.
We have also taken into account the specification of the World Wide Web
Consortium (W3C ~r~.w3c. org) (this consortium develops the specifications
of the protocols, languages,.., that are used on Internet) concerning meta-data

4 www.learning-web.com
www.lotus.com/learningspace
6 www.educom.edu/program/nlii/articles/moshwils.html
7 www.realeducation.com/products/index.html
s www.contrib.andrew.cmu.edu/-plb/InterBook.html
9 www.contrib.andrew.cmu.edu/~ plb/AIED97_workshop/Vassileva/Vassileva.html
10 www.icbl.hw.ac.uk/projects/isle/Doc.html
11 curriculum.qed.qld.gov.au/lisc/edsw/d-ctools.htm
12 www.icbl.hw.ac.uk
13 www.ai.mit.edu/projects/iiip/doc/cl-http/home-page.html
744

(that is, the d a t a t h a t refers to the page, its content, referring pages, identifica-
tion of the a u t h o r . . . ); This specification is called R D F (Resource Description
Framework (~r~. w3. o r g / T R / W D - r d f - s y n t a x ) and it will eventually be included
in the formal specification of H T M L .

4.2 Project description


The students interact with the system when they explore the H T M L pages, so
t h a t they need no additional software in order to obtain a personalized access
to the server.
These pages are generated according to the student needs, t h a t is, the server
creates the page dynamically, concatenating static information (like a contents
page or a practical exercise) with any piece of information t h a t could be relevant
to the student (for instance annotation of the user or hyperlinks t h a t the system
knows/predicts to be important for the student ... ). This also permits us to re-
spond to a request for a page, not only by supplying the page in question but also
by updating the user model, modifying other existing pages, recording the trace
of the user activity during the session or performing any other operation t h a t
m a y be of use to us. For each page the H T M L parser described above (sect. 4.1)
allows us to treat and analyze the m e t a - d a t a about the student or a b o u t the
page itself, for instance, any rule about the way the system must behave, any
description about th~ ~content of the page or the URL of related documents ...
The fact t h a t we can use response functions when the client makes a request, al-
lows us to include calls to external modules that make use of artificial intelligence
techniques in these functions, in order to give the student a more personalized
answer.
We can also include with this system, without adding any external application,
other common Internet services that are very useful in :a distance education
system like the e-mail, news or mailing list.

4.3 Experimentation

At the moment the access to the exercises of the machine-learning courses a t ~the
CSS and postgraduate courses of the AID have been personalized.
The system maintains a model for each user that interacts with him/her; when
the student first starts a session with the server, he must register; in this way,
his model is automatically initialized in the system (Figure 1).
The student can follow whatever p a t h he/she wants, while doing the exercises,
irrespective of the pages that the system recommends.
The personalization of the exercises module is basically focused on:

1. Recommendations about the pages, the aim of which is to help the student
to understand the purpose of the exercise. For instance, in the exercise of
Figure 2 the system advises the student to study the objectives of the exer-
cise among the course contents.
745

Fig. 1. The login page where the user is asked to introduce a login identifier and a
security password

After the student has visited the objectives page, the system will then re-
construct the original page, recommending a different hyperlink (Figure 3).
2. The system allows the student to add new hyperlinks in documentation
pages. For example, if the system presents a page with interesting hyper-
links, the user could add a new hyperlink using a form, as in Figure 4.

To carry out the personalization task, the information sources we consider


at present are:

- P r e d e f i n e d a c c e s s p a t h s : The pages include goals, questions and answers for


each concept previously selected by the tutor. The student always has tile
option, however, not to follow the structure initially anticipated. In fact,
every available piece of information is accessible with or without the aid of
the system.
- T h e a c c e s s trace: The hyperlink paths followed in each session are annotated.
- H y p e r l i n k s p r o v i d e d by the s t u d e n t : The user may introduce new active links
in the dynamically-created information pages.
- T h e h i s t o r i c a l d a t a available f o r e a c h s t u d e n t : subjects for which the student
has already passed the appropriate exam, subjects for which the student has
registered, class attendance record, preferred channel of communication with
other teachers and students (e-mail, telephone, post mail), the study centers
where he/she is registered, the projects in which he/she participated (with
the department, with the telematic lab at tile CSS, with any other organism
of the UNED), employment situation, ...
746

Fig. 2. Exercise with several possible links to follow together with the system's recom-
mendations

The success of the learning task of the Web personal assistant depends cru-
cially on the quality of its knowledge. The first design choice is to select a stable
set of attributes for describing training examples. The selected attributes must
satisfy some of the following requirements:

- T h e y are correlated.
- There are causal dependencies between them.
- There are hierarchical dependencies between attributes and classes.
- T h e y cover a significant portion of the training examples.
- T h e y are based on measurement or objective judgments.
- Their values can discriminate between the training examples.

Another critical decision is to calibrate the degree of coverage of the values


of the selected features. The structured attributes (e.g., students can be under-
graduate or postgraduate, and the former can be classified depending on their
academic year) offer more information than attributes divided into a predefined
set of interval values (e.g., student age can be divided into young people, adults
and the elderly). In turn, these attributes offer more information t h a n continuous
attributes. On the continuous attributes, a threshold-finding process is applied
in order to discover intervals with greater information gain [10]. Finally, the
discrete attributes provide smaller information gain. It turns out t h a t almost
all the attributes needed (subjects, classes, preferred channel of communication,
... ) belong to the latter category. Sometimes the nature of the problem forces
the selection of discrete attributes even though this causes an information gain
reduction.
Consequently, it is convenient to analyze input data either to come up with
a discrete number of intervals, to run a threshold-finding process or to set up
747

Fig. 3. The system recommends the next hyperlink for the student

some structure of feature values. The objective of this process is to decrease the
dispersion of training values, improving the predictive quality of the learning
task. However, there is a tradeoff between the usefulness of these clusters of
feature values and the quality of the program results.

5 Conclusions and Future Work

In this paper we have described the usefulness of interactive systems, which


are able to a d a p t to the web-based information and communication needs of
the students in a distance learning model. In short, we have described the de-
velopment of an interactive system applied in the UNED (the Spanish National
Distance-learning University) to the personalization of the problem classes of the
machine-learning courses of the Computer Science School and the postgraduate
courses of the Artificial Intelligence Department.
Our experiments have shown t h a t it is possible to design a personalized inter-
action with users in a transparent (with no specific software requirements) and
efficient way. It is based on dynamic H T M L pages t h a t are able to ask for d a t a
directly over the H T T P protocol. The adaptation to the user is performed using
a student model that is updated each time the user interacts with the server.
The system predicts the needs of each user using complementary information
sources, such as: predefined access paths, traces of student choices, hyperlinks
added by the student and available historical data.
T h e goal of this application is to facilitate the access and the interaction
with any of the services supplied on the Internet in a distance education model.
The initial design is therefore being updating in order to promote its wider
use. In the first phase, this system will be applied to all the available material
of the machine-learning courses at CSS and at AID. Later on, we intend to
748

Fig. 4. Form where the student is asked to introduce the data for the new hyperlink

extend the system to the whole CSS. However, this system will only become
really useful when the W W W resources became the main support for distance
learning education [1].
With respect to the performance of the personalization task, in an extended
design of the system we decided to apply an ensemble o.f classifiers for improving
its learning accuracy. In addition, content-based information filtering techniques
are applied in the representation of the Web pages [4]. Two information sources
are combined: academic reports and available data from user activity on the
web, including information directly introduced by the student and items which
he/she has selected (web pages, added hyperlinks, news groups, e-mail lists ... ).
Finally, the classification model is constructed from the overlapping training sets
of the cross-validation sampling method [6]. The final system will go beyond the
identification of relevant items for the student to find out the preferred channel
of communication with other teachers and students. For example, it is quite pos-
sible that some students will prefer to contact their companions through news
groups, instead of looking at the Web pages of registered students. Additionally,
the unstructured nature of the information sources (web pages, information asso-
ciated with h y p e r l i n k s . . . ) requires the application of representation techniques
that summarize the relevant features of domain objects (there is an interesting
proposal in [3]).

6 Acknowledgements

The authors would like to acknowledge the helpful comments of Simon Pickin,
arising in the course of his language revision of this article. We also thank the en-
tire Artificial Intelligence Department of the Spanish National Distance-learning
University (UNED) for providing support for this project.
749

References

1. Jesds G. Boticario. Internet y la universidad a distancia. A Distancia, pages 64-69,


1997.
2. Jesds G. Boticario. Material dids y servicios para la educacidn a distancia en
Internet. A Distancia, pages 70-76, 1997.
3. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and
S. Slattery. Learning to extract symbolic knowledge from the world wide web. In
Proceedings of the Fifteenth National Conference on Artificial Intelligence.
4. H. C. M. de Kroon, Tom M. Mitchell, and E. J. H. Kerckhoffs. Improving learning
accuracy in information filtering. In Proceedings of the Thirteenth International
Conference on Machine Learning - Workshop on Machine Learning Meets HCI
(ICML-96).
5. L. Dent, J. G. Boticario, J. McDermott, T. M. Mitchell, and D. T. Zabowski. A
personal learning apprentice. In Proceedings of the Tenth National Conference on
Artificial Intelligence, pages 96-103, San Jose, CA. Mit Press.
6. Tom G. Dietterich. Machine learning research: Four current directions. A I Maga-
zine, 18(4):97-136, 1997.
7. J. Desmond Keegan. From new delhi to vancouver: trends in distance education.
In Learning at a Distance. A world perspective, Athabasca University, Edmonton,
1982. International Council for Correspondence Education.
8. John C. Mallery. A common lisp hypermedia server. In Proceedings of The First
International Conference on The World-Wide Web.
9. D. McArthur, M.W. Lewis, and M. Bishay. The roles of artificial intelligence in
education: current progress and future prospects. Technical Report DRU-472-NSF,
RAND, Santa Monica, CA, 1993.
10. Ross Quinlan. C~.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA, 1993.
11. Roger C. Schank and Chip Cleai,y. Engines for education. Lawrence Erlbaum
Associates, Hillsdale, New Jersey, 1995.
Visual Knowledge Engineering as a Cognitive Tool

Tatiana Gavrilova a, Alexander Voinov 2, Ekaterina Vasilyeva2

1 St. Petersburg State Technical University, Politechnicheskaya29/9,


195251, St. Petersburg, Russia
gavr@spb.limtu.su
2Artificial Intelligence Lab, Institute for High Performing Computing and Data Bases,
194291, P.O. Box 71, St. Petersburg, Russia
vki@~.csa.ru

Abstract. Paper presents research framework based on methodology of


knowledge acquisition via visual structured analysis of the domain. The
methodology includes formal procedure and special techniques of knowledge
stratification and detalisation. Described approach is implemented in computer
programs, that may be used as special cognitive tools, helping domain experts
to investigate the domain knowledge through visual design of concept maps of
knowledge bases. The paper also discusses how ontologies can be specified at
the knowledge level using the set of graphical intermediate representations.
Special software tool implementing visual knowledge engineering techniques
and principles is described in the paper. In this paper, we also present the
CAKE as a software tool to specify ontologies and concept maps at knowledge
level. Its multilingual generator module automatically translates the visual
specification into targeted knowledge representation languages. CAKE may be
also effectively used for visual hypertext design and development of
hypermedia applications on WWW.

1 Introduction

Knowledge based system (K13S) designers and hypertext system developers contend
that information structures may reflect the semantic structures of human memory.
Further, they believe that mapping the semantic structure of an expert onto a
knowledge hypertext information structure and explicitly illustrating that structure in
the hypertext will result in improved comprehension, because the knowledge
structures of the users will reflect the knowledge structures of the expert to a greater
degree [13]. This paper reviews techniques for ascertaining an expert's knowledge
structure and mapping it onto visual representations. The studies show generating a
semantic network through structured knowledge acquisition improves the
development phase significantly.
The short prehistory of knowledge engineering (KE) techniques and tools
(including knowledge acquisition, conceptual structuring and representation models),
the overall overview of which is presented in [5, 27], is a way to develop the
751

methodology that can bridge a gap between the remarkable capacity of human brain
as a knowledge store and the efforts of knowledge engineers to materialise this
compiled experience of specialists in their domain of skill.
Beginning from the first steps and research that show the "bottleneck" [1] in expert
system development up to nowadays the AI (a~ificial intelligence) investigators and
designers has been slightly guided by cognitive science. So major part of KE
methodology suffer of fragmentation, incoherence and shallowness.
The highlights in this area relate to early works in 80-ies on the reconstruction of
semantic space of human expertise [3] and serious success of repertory grid-centred
tools as the Expertise Transfer System (ETS) [4], AQUINAS [3], KSSO and others.
All these programs can be related to the first generation of KE tools.
The next impact to knowledge acquisition refinement is concerned with the visual
knowledge engineering [5] that develop novel technique aimed at knowledge
engineers. These so-called second generation KE tools [7] provides ideas of CASE
technology to AI [2]. They help to traverse and organise visually an emerging
knowledge store and to the semantic space of the domain in the most natural form, for
example as an "image panel" or a sketchpad for the concept maps, diagrams and
pictures.
Although the popular methods described above are rather powerful and versatile,
the knowledge engineer in fact is weakly supported at the most important and critical
stage in the knowledge engineering life cycle - transition from elicitation to
conceptualisation by understanding and realisation of the domain strncture and
expert's reasoning way. He needs a mindtool which will help and assist.
The last 5-7 years the main interest of the researchers in this field is concerned
with the special tools that help knowledge capture and strncturisation. Many KA tools
appeared that help to cut down the revise and review cycle time and to refine,
structure and test human knowledge and expertise [ 1, 24].
In this paper the new technology called CAKE (Computer Aided Knowledge
Engineering) is described. CAKE also may be effectively used for concept mapping
and ontology development.
Like KBS development, ontology development faces the knowledge acquisition
bottleneck problem. However, unlike KBS, the ontology developer comes up against
the additional problem of not lraving any sufficiently tested and generalised
methodologies recommending what activities to perform and at what stage of the
ontology development process these activities should be performed. That is, each
development team usually follows their own set of principles, design criteria and steps
in the ontology development process. The absence of structured guidelines and
methods hinders the development of shared and consensual ontologies within and
between teams, the extension of a given ontology by others and its reuse in other
ontologies and final applications [6].
Until now, few domain-independent methodological approaches have been
reported for building ontologies. Uschold's methodology [25], Gruninger and Fox's
methodology [12] and METHONTOLOGY [6] are the most representative. These
methodologies have in common that they start from the identification of the purpose
of the ontology and the need for domain knowledge acquisition. However, having
752

acquired a significant amount of knowledge, Uschold proposes codification in a


formal language expressing the idea as a set of intermediate representations and then
generating the ontology using translators. These representations bridge the gap
between how people see a domain and the languages in which ontologies are
formalised. The conceptual models are implicit in the implementation codes. A
reengineering process is usually required to make the conceptual models explicit.
Ontological commitments and design criteria [11] are implicit in the ontology code.
Ontology developer preferences in a given language condition the implementation
of the acquired knowledge. So, when people code ontologies directly in a target
language, they are omitting the minimal encoding bias criterion defined by Gruber
Illl.
Ontology developers (who are unfamiliar with or simply inexperienced in the
languages in which ontologies are coded) may find it difficult to understand
implemented ontologies or even to build a new ontology.
Therefore visual development techniques may be very helpful and successful for
this process. The implementation of visual technologies also could change the entire
design cycle of the hypertext tutorials and database development. They force the
designer to follow the top-down technology versus the bottom-up one. Graphical
approach works as a cognitive tool for Wansparent and effective design procedure
[171.

2 Concept Maps, Ontologies and Knowledge Bases as Cognitive


Tools

Cognitive tools have been around for thousands of years. Cognitive tools refer to
technologies, tangible or intangible, that enhance the cognitive powers of human
beings during thinking, problem solving, and learning. Cognitive tools represent
formalisms for thinking about ideas. They constrain the ways the people organise and
represent ideas, so they necessarily engage different kinds of thinking. [161
Today, computer software programs are examples of exceptionally powerful
cognitive tools. As computers have become more and more common in education,
training, and performance contexts, the effectiveness and impact of software as
cognitive tools have begun growing.
Although many types of software can be used as cognitive tools for learning (e.g.,
databases, spreadsheets, expert system shells, abductive reasoning tools, multimedia
authoring systems, micro-worlds, and dynamic modelling tools), this article focuses
on the effectiveness of such visual techniques as concept mapping, ontologies and
knowledge base design software employed as intellectual partners in learning.
Concept maps, which are very alike to semantic networks, are spatial
representations of concepts and their interrelationships that are intended to represent
the knowledge structures that human store in their minds [14]. Concept maps are
graphs consisting of nodes representing concepts and labelled lines representing
relationships between the concepts. Concept mapping is the process of constructing
753

concept maps - of identifying important concepts, arranging those concepts spatially,


identifying relationships between concepts, and labelling the nature of the semantic
relationship between concepts. Although concept maps can be drawn by hand or with
other simple artefacts such as cards and string, computer-based concept mapping
software enables much easier production of concept maps [15, 23].
Building concept maps as a study strategy in a course resulted in consistent,
hierarchical and coherent knowledge structures [14, 21].
Concept maps help in increasing the total quantity of formal content knowledge
because it facilitates learners and developers to use the skill of searching for patterns
and relationship. This organisational knowledge and the total quantity of formal
content knowledge facilitate meaningful learning and development. Concept mapping
and its effects on domain knowledge are also predictive of problem solving
performance.
Concept and ontology mapping provides a powerful tool for both learning,
assessment of that learning and design of different complex applications like expert
systems and hypertext tutorials. However, it represents only one method for doing
both, a method that students are differentially disposed toward and capable of. It is
useful and illuminating to allow users to create multiple representations (perhaps
using different tools) of the same content. Some people are better able to express
themselves through concept maps, while others benefit more from more concrete
tools such as multimedia authoring software. Many authors [16,19,22] hypothesise
that the integration of concept mapping software programs as one of a suite of
knowledge representation tools embedded in constructivist learning environments will
be much more successful than their use in the context of traditional teacher-centred
pedagogues.

3 Visual design of knowledge bases and hypertexts

The proposed CAKE (Computer Aided Knowledge Engineering) approach and


software tool suggest analysing procedure that should be done before the visualising
design and concept mapping [9]. This procedure is intended to split the domain
knowledge into different levels or strata. Object-Structured analysis is based on
decomposition of subject domain into (at least) eight strata [10]:
sl WHAT_FOR-Knowledge:Strategic Analysis of the System, its Intention and
Functioning.
s2 WHO-Knowledge: OrganisationalAnalysis of System Developers Team.
s3 WHAT-Knowledge: Conceptual Analysis of Subject Domain Revealing
Concepts and Relationships between them.
s4 HOWTO-Knowledge: Functional Analysis: Hypotheses and the Models of
Decision Making.
s5 WHERE-Knowledge: SpatialAnalysis: Environment, Communications, etc.
s6 WHEN-Knowledge: TemporalAnalysis: Schedules, Time Constraints, etc.
s7 WHY-Knowledge: Causal Analysis: Explanation System.
754

s8 HOW-MUCH-Knowledge: Economical Analysis: Resources, Losses, Incomes,


Revenue, etc.
The number of strata could be increased if necessary. After this structuring
procedure is finished each strata may be presented as a concept map. Fig. 1 illustrates
how CAKE helps to structure $3 strata for one of software domains. The definition of
the domain is taken from [26]. The presented screenshot is a part of future knowledge
base for the expert system consulting LINUX users.
The stratification process helps system analyst or knowledge engineer to realise
basic of the domain knowledge. For expert system design there are three main strata:
9 WHAT-Knowledge ($3),
9 HOIVTO-Knowledge ($4),
9 WHY-Knowledge ($7).
Other strata are complimentary.

Fig. 1. WHAT-knowledge Structure

The same approach may be implemented to the hypertext design. Now many
modem Intemet hypertext tools, such as Explorer and Netscape, are intended to serve
as graphical browsers for a global hyperlinked mediaspace. Really, however, every
user of more or less complex hypertext structure is usually frustrated by a chaotic
labyrinth of crosslinks. This is especially valid for the World Wide Web as a
distributed hypermedia system, where the sort of the associated information is usually
unavailable for the local node.
755

The imposing of the knowledge structure on such amorphic hyperlink spaces can
dramatically shorten the conceptual apprehension of the corresponding flow of
information. In this way, the CAKE technology, even in the described
implementation, appears to be useful in this scope of problems, because it offers key
functionality for elucidating of the basic logical skeleton of the domain. Even the
plain visualising of the logical schemata of the domain have the powerful cognitive
impact both to the user and to the designer.
For example fig. 2 shows a draft of one hypertext tutorial chapter. This tutorial is
based on the course in intelligent system development and is intended for distance
learning [8].

Fig, 2. Visual Design of Hypertext Tutorial

The least but not the last contribution of the CAKE technology into this scope of
problems concludes in the possibility for the end user to consciously navigate through
the hypermedia space, while gradually increasing the knowledge structure of the path
left behind. Such structure may generalise the primitive apparatus of bookmarks and
index files.
The active browsing support currently implemented in CAKE allows the user of
the system to automate both the analysis and synthesis procedures of these activities.
The proof of a framework's value is how much time and cost one saves when
developing and modifying the knowledge base and hypertext environment. The
framework of CAKE is a modern design environment with the openness, and tool and
dam integration capabilities one needs to:
756

9 Provide an easy-to-use strategy of visual knowledge acquisition from


heterogeneous sources of expertise.
9 Significantly lower the cognitive efforts of both the designer and the user of a
knowledge/data based system.
9 Increase the designer's productivity through the visual browsing support.
9 Create an environment that optimises the way of developing of both knowledge
based and hypertext products.
The bottom line is that the described approach helps to navigate both in the
materialised logically linked spaces and the imaginary ones, which were usual for the
traditional forms of the expertise transfer.

6 Discussion

This paper presents a rationale for the application of visual knowledge engineering
software as cognitive tools in education and industrial development of intelligent
systems.
Higher order thinking, especially problem solving relies on well-organised,
domain-specific knowledge. The approach described in this paper facilitates the
development and representation of domain knowledge. Therefore, visual tools are
predictive of different forms of higher order thinking.
They help in organising knowledge and data by integrating information into a
progressively more complex conceptual framework. When learners construct concept
maps or ontologies for representing their understanding in a domain, they may
reconceptualise the content domain by constantly using new propositions to elaborate
and refine concepts that are already known based on decontextualised knowledge [16,
18, 20]. The cross links which connect different sub-domains of conceptual structure,
enhance the anchorage of the concepts in the cognitive structure.
However, the research described above is limited and there is a great need for
sustained research regarding the implementation and effects of visual tools as
cognitive tools.

7 Acknowledgements

The presented research was partially supported by Russian Foundation for Basic
Research (grant 98-01-00081).

References

1. Adeli H. KnowledgeEngineering.McGraw-Hill.New-York (1995)


757

2. Aussenac-GiUes, Natta, Aussenac-Gilles, N. - Natta, N Making the Method Solving Explicit


with MACAO: SIZYPItUS Case-Study. In: Linster, M. (Ed.): Sisyphus'92: Models of
problem solving, GMD - Arbeitspapiere (1993) 1-21
3. Boose J.H., Shema D.B., Bmdshaw J.M. Recent progress in AQUINAS: a Knowledge
Acquisition Workbench. Knowledge Acquisition - Vol. 1, N 1. (1995) 185-214
4. Boose J.H. ETS: a PSP-Based Program for Building Knowledge-Based Systems. Proc.
WESTEX-86: IEEE West. Conf. Knowledge-Based Engineering and Expert Systems in
Anaheim, Calif. June 24-26. Washington: D.C. (1986)
5. Eisenstadt M., Domingue J., Rajah T., Motta E. Visual Knowledge Engineering, IEEE
Transactions on Software Engineering, Vol. 16, No. 10 (1986) 1164-1177
6. Fernandez, M.; Gomez-Perez, A.; Juristo, N. METHONTOLOGY: From Ontological Art
Towards Ontological Engineering. Spring Symposium Series. Stanford. (1997) 33-40
7. Gaines B.R. Second Generation Knowledge Acquisition Systems. Proceedings of the
Second European Knowledge Acquisition Workshop, Bonn, 17 (1986) 1-14
8. T. Gavrilova, T. Chemigovskaya, A. Voinov, S. Udaltsov. Intelligent Development Tool for
Adaptive Courseware on WWW. 4-th International Conference on Computer Aided
Learning and Instruction in Science and Engineering. June 15-17, Chalmers University of
Technology GOteborg, Sweden (1998) 464-467
9. Gavrilova T., Voinov A. Visualized Conceptual Structuring for Heterogeneous Knowledge
Acquisition. Proceedings of the International Conference on Education and Mulimedia
EDMED '96. M1T. Boston (1996)
10.Gavrilova T., Voinov A. Work in Progress: Visual Specification of Knowledge Bases. 11-th
International conf. On Industrial and Engineering Applications of Artificial Intelligence and
Expert Systems IEA-98-AIE, Spain, Benicassim, Springer (1998) 717-726
11.Gruber, T. Toward Principles for the Design of Ontologies Used for Knowledge Sharing.
Technical Report KSL-93-04. Knowledge Systems Laboratory. Stanford University, CA
(1992)
12.Crnmmger, M.; Fox, M.S. Methodolog),for the Design and Evaluation of Ontologies. IJCAI
Workshop on Basic Ontological Issues in Knowledge Sharing. Monlreal, Quebec, Canada
(1995)
13.Jonassen, D. H. Representing the expert's knowledge in hypertext. Impact Assessment
Bulletin, 9(1) (1991) 1-13
14.Jonassen, D.H., Beissner, K., & Yacci, M.A. Structural knowledge: Techniques for
representing, conveying, and acquiring stxuctural knowledge. Hillsdale, NJ: Lawrence
Eribaum (1993)
15.Jonassen, D.H, & Marra, R.M.. Concept mapping and other formalisms as Mindtools for
representing knowledge. Alt-J: Association for Learning Technology Journal, 2(1) (1994)
50-56
16.Jonassen, D.H. Computers in the Classroom: Mindtols for Critical Thinking. Prentice Hall
(1996)
17.Joseph, R.L.: Graphical Knowledge Acquisition, Proceedings of the Fourth Knowledge
Acquisition for Knowledge-Based System Workshop, Banff, October, 18. 1-16
18Kremer, R. Visual Languages for Knowledge Representation. Proceedings of the 11~ Banff
Knowledge Acquisition for Knowledge-Based Systems-Workshop.
19. Liu, X. The validity and reliability of concept mapping as an alternative science assessment
when item response theory is used for scoring. Paper presented at the Annual Meeting of the
American Educational Research Association, New Orleans, LA (ERIC Document No.
370992) (1994)
758

20. Musen, M.: Conceptual Models of Interactive Knowledge Acquisition Tools, Knowledge
Acquisition, Vol. 1, No. 1 (1994) 73-88
21. Nosek, J.T. & Roth, I. A comparison of Formal Knowledge Representation Schemes as
Communication Tools: Predicate Logics vs. Semantic Network, International Journal of
Man-Machine Studies, 33 227-239
22. Shavelson, R.J., Lang, H., & Lewin, B. On concept maps as potential "authentic"
assessments in science (CSE Technical report No. 388). Los Angeles, CA: National Centre
for Research on Evaluation, Standards, and Student Testing (CRESST), UCLA (1994)
23.Sowa, J. F. Conceptual Structures: Information Processing in Mind and Machine, Addison-
Wesley, Reading, Mass.
24.Tuthill, S. Knowledge Engineering. TAB Professional and Reference Books.
25.Uschold, M, Grtminger, M. ONTOLOGIES: Principles, Methods and Applications.
Knowledge Egineering Review. Vol. 11; N. 2 (1996)
26.Welsh M. http://www.linux.org/LDP (1995)
27.Wielinga B., Schreiber G., Breuker J. (1992) A Modeling Approach to Knowledge
Engineering. Knowledge A cquisin'on, 4 (1). Special Issue. (1992 )
Optimizing Web Newspaper Layout Using
Simulated Annealing

J. Gonzs 1, J.J. Merelo 1, P.A. Castillo 1, V. Rivas 2 a n d G. R 0 m e r o I

1 Department of Architecture and Computer Technology


University of Granada
Campus de Fuentenueva
E. 18071 Granada (Spain)
2 Department of Computer Science
University of Jafin
Avda. Madrid, 35
E. 23071 Jaffa (Spain)

e-maih g e n e u r a @ k a l - e l . u g r , es phone: +34 958 24 31 62

A b s t r a c t . This paper presents a new approach to the pagination prob-


lem. This problem has traditionally been solved ofltine for a variety of
applications like the pagination of Yellow Pages or newspapers, but since
these services have appeared in Internet, a new approach is needed to
solve the problem in real time. This paper is concerned with the problem
of paginating a selection of articles from web newspapers that match a
query sent to a personalized news site by a user. The result should look
like a real newspaper and adapt to the client's computer configuration
(font faces and sizes, screen size and resolution, etc.). A combinatorial
approach based on Simulated Annealing and written in JavaScript is
proposed to solve the problem online in the client's computer. Experi-
ments show that the SA achieves real time layout optimization for up to
50 articles.

1 Introduction

S i m u l a t e d A n n e a l i n g (SA) [1, 8] is a M o n t e C a r l o a p p r o a c h for o p t i m i z a t i o n t a s k s


i n s p i r e d by t h e r o u g h l y a n a l o g o u s p h y s i c a l process of h e a t i n g a n d t h e n slowly
cooling a s u b s t a n c e to o b t a i n a s t r o n g c r y s t a l l i n e s t r u c t u r e . T h e s i m u l a t e d an-
n e a l i n g process lowers t h e t e m p e r a t u r e b y slow stages until t h e s y s t e m "freezes"
a n d no f u r t h e r changes occur. A t each t e m p e r a t u r e t h e s i m u l a t i o n m u s t p r o -
ceed long e n o u g h for t h e s y s t e m t o reach a s t e a d y s t a t e or e q u i l i b r i u m . T h i s
is k n o w n as t h e r m a l i z a t i o n . T h e sequence of t e m p e r a t u r e s a n d t h e n u m b e r of
i t e r a t i o n s a p p l i e d to t h e r m a l i z e t h e s y s t e m a t each t e m p e r a t u r e c o m p r i s e a n
annealing schedule. To a p p l y s i m u l a t e d a n n e a l i n g , t h e s y s t e m is i n i t i a l i z e d w i t h
a p a r t i c u l a r configuration; a new c o n f i g u r a t i o n is b u i l t b y i m p o s i n g a r a n d o m
d i s p l a c e m e n t . If t h e e n e r g y of this new s t a t e is lower t h a n t h a t of t h e p r e v i o u s
one, t h e c h a n g e is a c c e p t e d u n c o n d i t i o n a l l y a n d t h e s y s t e m is u p d a t e d . If t h e
760

energy is higher, the new configuration is accepted probabilistically. This proce-


dure allows the system to move consistently towards lower energy states, yet still
'jump' out of local minima due to the probablistic acceptance of some upward
moves. This paper describes an approach to a pagination problem based on SA
where a set of newspaper articles have to be displayed on a web page, as the
result of a query sent to a personalized news site.
The difference between an online problem and an offline one has been clear
since De Jong established it in [3], but nowadays this difference is blurring. With
the arrival of the World Wide Web, lots of problems that were previously solved
with an offline approach must now be solved in real time to be useful in Internet,
becoming online problems in this environment. One of these problems is that of
pagination. This problem has been automatized by several firms [5, 9] who were
able to use an offline approach while typesetters had long enough to compose
the Yellow Pages or the newspaper, but since these kinds of services appeared in
Internet, as is the case of web newspapers and news query services, the layout of
the articles that the user wants to read must be performed taking into account
the client's machine configuration, maximizing the amount of information dis-
played in the browser window and avoiding scroll bars if possible. Every user has
different font faces and sizes, a different screen size and resolution, and the lay-
out process should take these parameters into account and optimize a personal
layout for each user in real time. Furthermore, when push technologies start to
be more fashionable (and standard), laying out all user windows and channels
on the screen will be a challenge, and this will have to be done in real time too.
If the process can be downloaded to the client's machine, the problem would
then not overload the server machine.
This paper describes a SA-based approach to the web pagination problem
that optimizes the layout of a web newspaper in real time taking into account the
size of the browser window and the face and size of fonts in the client's machine,
producing a layout that adapts itself to the user's computer characteristics.
The layout of this paper is as follows: the particulars of the problem are
detailed in section 2; the state of the art is described in section 3; the proposed
approach is discussed in section 4; the results obtained are analyzed in section
5 and some conclusions are drawn in section 6.

2 The problem

After a user sends a query to a news server site, a set of articles related to his
query is obtained. These articles are page segments extracted from web newspa-
pers that may contain headers, text and even images. Tile fact is that the client
does not know exactly what kind or amount of information will be received.
As the user's query is sent via a web browser, the results should be presented
as a web page containing all the articles extracted by the server in a correct way,
that is, without overlapping between articles, occupying the smallest possible
area and with no empty gaps between articles.
761

It would be convenient for the optimization process to take place inside the
client machine to avoid server overload due to several queries being made at the
same time and because the results depend on the client's computer configuration,
such as the face and size of fonts being used and the size and resolution of the
screen.
As described in [4], the best way to manage the above constraints is to
program the optimization process as a JavaScript [2] script to be sent within the
web page containing the articles to be laid out and which will be interpreted by
the web browser when the page is loaded. Such a script is able to change the
appearance of a web page dynamically and to lay out all the articles, taking into
account the face and size of fonts and the size of the browser window, avoiding
scroll bars if possible. Thus, the server only has to find the articles t h a t satisfy
the query and send t h e m to the user, while the rest of the work is done at the
client's end.

3 State of the art

The pagination problem is not new, and a u t o m a t e d procedures to paginate Yel-


low Pages or fax newspapers, such as the YPSS++ system [5] in Germany, have
already been proposed by several firms.
A group of workers in the Finnish Research Institute V T T applied simulated
annealing optimization of page layout to paginate fax newspapers and the Yellow
Pages of several countries [9]. In their paper, one heuristic and two simulated
annealing methods are presented. The best simulated annealing algorithm selects
which articles are going to be included and situates them on the page at the
same time. Overlapping is allowed and sometimes a slight overlap of articles is
observed in the final result.
The above approaches provide an oJfline solution to the pagination problem
(even with a certain time restriction) after which the Yellow Pages or newspapers
are distributed.
One example of an online approach to this problem is the Krakatoa Project
[6], which is a personalized newspaper presented in Java applet form. It cus-
tomizes the layout for each user, but the newspaper layout does not depend on
the size of each article or the available surface of the window but rather, on
the user and community preferences; thus, it does not really optimize layout:
it typesets the newspaper in two columns, with available surface area divided
among articles depending on the user and user community profile.
We presented another approach in [4] where a GA was used to optimize
the layout of articles extracted from several newspapers. The position of each
article was encoded in the chromosome storing the x and y coordinates of its
left-top corner and articles could span several columns. This approach finds
good solutions, but due to the representation used, it is very difficult to detect
gaps and overlappings between articles, so the objective function is very time
consuming because it has to compare all the coordinates of all the articles to
762

check whether the layout is legal; thus, the optimization process becomes too
slow to be very useful.

4 Proposed approach

The surface of the window is divided into columns with a fixed width and an
infinite height (if the number of articles is such that they do not fit inside the
window, a vertical scroll of the window is allowed). The number of columns in
the layout depends on the size of the browser window. In this paper, each article
has the same width as a column and a height that depends on the amount of
information, but the system will shortly be able to deal with articles with a
width of several columns, although if an article originally takes up more than
one column, it can be fitted to one column without loss of information. The
problem is then how to fill all the columns with articles to get the heights of the
columns as close as possible.
This problem is very similar to a bin packing problem [10], in which the goal
is to minimize the amount of fixed size bins used to pack a number of objects;
however, in this problem, the number of bins (columns) is fixed and what has
to be minimized is the used capacity difference between bins (columns).

4.1 Problem representation

The most intuitive representation is to encode each article as a pair of integer


values (x, y) to represent the position of the article's left-top corner in the window
[4]. This representation is close to the problem, but it, makes it very difficult to
detect gaps and overlapping in the layout, which makes the objective function
very time consuming; taking into account that the objective function is called
for every new configuration generated, the whole optimization process becomes
very slow. This is not desirable in a web environment, where the user has to
wait, first for the page to be loaded and afterwards for the optimization process
to proceed, which could take another minute or two.
However, it turns out that coding a solution as a permutation of the order
the articles are going to be laid out in is much faster. The permutation is the
same length as the number of articles to lay out, so if there are n articles,
possible solutions will be a permutation for values from 0 to n-1. To get the
layout encoded by the permutation, a decoder is used. This decoder implements
a greedy algorithm using the following heuristic: the next article to be placed
must be allocated to the least occupied column. If there is more than one column
with minimum used capacity, the leftmost column is chosen. This representation
avoids gaps and overlapping between articles, making the objective function very
simple because, as all the permutations encode legal solutions, it does not have
to deal with any constraint satisfaction.
763

4.2 The mutation operator

To mutate a configuration, a transposition of some numbers in the permuta-


tion that encodes the possible solution is used in the tbllowing way. Given a
permutation, two points are chosen randomly (the underlined ones):

old = (12 a 4 5 S 7-89)

The new generated permutation is a copy of the old one, but with the numbers
at the marked positions swapped:

new=(128456739)

4.3 The Algorithm

The algorithm used in this approach needs only two parameters: the number
of iterations required by the search process n u m I t and the number of changes
necessary to reach the thermal equilibrium k. Its implementation is detailed in
the tbllowing pseudocode:

n=0
T = To
select a configuration iol d at random
evaluate iol d
repeat
for j = 1 t o k
select a new configuration i,,~,,,, in the neighbourhood
of iold by mutating iold
A f = f(inew) -- .f(iold) zxi
if ( A f < 0) OR, (random(O, 1) < e - - f - )
t h e n iold = inew
e n d for
T = fT(To,n)
n=n+l
u n t i l (T < T.~in)
Use the last configuration to obtain the layout

The initial temperature is calculated following Kirkpatrick's suggestion [7]:

To --
zxf* (1)
ln(po)
where A f* is the average objective increase observed in a random change,
and Pa is the initial acceptance probability (0.8 is usually used).
For the freezer function (fT) this approach uses:
764

To (2)
fT(TO, n) - 1 + n

this function lowers the temperature and thus the acceptance probability
quickly at first, and later starts a more controlled descent until the minimum
temperature is reached.
The minimum temperature is calculated on the basis of the desired number
of iterations n u m I t as follows:

Tmin = f T ( T o , n u m l t ) (3)

4.4 The objective function

Two different objective functions are tested in this approach. The first one is the
sum of all differences between the capacity taken up of each column and that of
the most filled column:

n--1

f, = Z c - c, (4)
i=0

where ci is the capacity taken up of the i-th column and C is that of the
most filled column (C = m a x ( c i ) ) . This function measures the unused area in
the layout, but implies a lot of calculation, making the algorithm slower, so a
different objective function was designed and tested.
The final objective function measures the difference in capacity taken up
between the most filled and the least filled column:

f2 = C - c (5)
where c is the capacity used of the least filled column (C = m i n ( c i ) ) . The
optimal layout (if it exists with the given articles) is reached when this difference
f2 is zero. This means that all columns are equally filled.
This objective function is easier and faster to calculate than the first one and
guides the search better because the first one cannot distinguish between two
different layouts having the same total unused surface area but in which one
has all the columns with unused capacity equally filled while the other has some
columns with a little unused capacity and other columns that are almost empty.

5 Results

To determine how the number of articles influence the time spent by the SA
in the optimization process, the algorithm was tested with 25, 50, 100 and
765

200 real articles extracted from the Spanish web newspaper E L M U N D O


(http ://www. el-mundo, es).
Table 1 shows the minimum, maximum, average and standard deviation of
time (T) in seconds and cost (C) in pixels measured over 10 runs for each
number of articles executed within Netscape Communicator 4.5 running in a
233MHz Intel Pentium MMX. The parameters used in all the algorithm runs
were nnrn/t=80 and k=10.

IArticlesll T =h ~r ITml,.IT~a~ll C 4 - ~ IC~,.i~IC . . . . I


25 5,22 i 0,3 4,84 5,96 2 4- 0 2 2
50 10,54:t: 1,2 8,80 12,74 2 • 0 2 2
100 24,15 d: 1,7 22,86 28,70 2,2 4- 0,6 2 4
200 51,27 + 3,3 44,98 57,62 2 4- 0 2 2

Table 1. Minimum, maximum, average and standard deviation of time and cost opti-
mizing 25, 50, 100 and 200 articles

Appropiate solutions are found independently of the number of articles to


be laid out, so the quality of the final result is independent of the size of the
problem. Another important issue is that the time spent in the optimization
process increases almost linearly with the size of the problem, as shown in fig-
ure 1. This is desirable in every optimization problem because with traditional
methods this time always grows exponentially, making the problem intractable
for large problems.

Fig. 1. Minimum, maximum and average time in optimizing 25, 50,100 and 200 articles
766

Taking into account that the program is written in JavaScriptand that every
execution of the algorithm must be interpreted by a JavaScriptengine inside the
browser, the times obtained are acceptable. If the algorithm were written in C
and compiled, every execution would be much faster, but it would not be able
to optimize web pages dynamically and in the client's computer. Moreover, in
a usual size browser window there is only room for 10 articles without scrolling
bars in the window and the usual number of articles returned from the server is
no greater than 25, so usual times are between 2 seconds (with 10 articles) and
5 seconds (with 25 articles) in most cases, which is a really short time compared
with the time spent loading the web page.
An example of a final result is shown in figure 2, where 25 articles using a
very small (unreadable) font are displayed. An 8-point font size was used in the
execution in the figure to allocate as many articles as possible in the window
without scrolling. With a normal font size, i.e. 10-12 points, no more than 10
articles can be fitted into a window.

Fig. 2. Final look of a simulated newspaper page with 25 articles

6 Conclusions

This paper presents a different approach from the one presented in a previous
paper [4] based on SA to solve the pagination problem where the code to solve
the problem is sent by the server within the same web page to be optimized. With
this approach, the server only has to look up the information the user orders
767

and as the optimization process runs at the client's end, it knows the exact
configuration of the client's computer and adapts to it easily, always obtaining
a personalized result for each user.
The time required for optimization is acceptable; for example, in a 233MHz
Intel Pentium MMX it is usually between 2 seconds (optimizing 10 articles) and 5
seconds (optimizing 25 articles). With current processors, the optimization time
should be better, so this is a very good time if we consider that the code that
performs the optimization is interpreted within a web browser slower than an
normal optimization application compiled for a particular computer architecture.
The proposed approach is available in h t t p : / / k a l - e l . u g r . e s / - j e s u s / l a y o u t .
In the near future this application will be able to handle articles having different
widths, so a long article, which at present is restricted to fit only in one column,
would occupy more than one column, and thus have a squarer shape, although
this is not really a restriction, since the shape of the article can be altered to
occupy as many columns as necessary.
Another interesting improvement would be to allocate related articles as close
together as possible. This would make the layout easier to read and understand
for the user, but would involve tagging or an understanding of the articles by
the machine, which is a much more complex and completely different problem.

7 Acknowledgements
This work has been supported in part by the proyects C I C Y T Proyecto BIO96-
0895 (Spain), D G I C Y T PB-95-0502 and F E D E R 1FD97-0439-TEL1.

References
1. E.H.L. Aarts and J. Korst. Simulated Annealing and Boltzmann Machines. John
Willey, 1989.
2. Netscape Communications Corporation. Javascript developer central. Web ad-
dress: http ://developer. netscape, com/tech/j avascript.
3. K. de Jong. An analysis o/the behavior o/ a class o/genetic adaptive systems. PhD
thesis, Dept. of Computer and Communications Sciences, University of Michigan,
Ann Arbor, 1975.
4. J. Gonzs and J.J. Merelo. Optimizing web page layout using an annealed genetic
algorithm as client-side script. In A. E. Eiben, T. Bick, M. Schoenauer, and H. P.
Schwefel, editors, Proceedings of the 5th Con.ference on Parallel Problem SoSJing
.from Nature, volume 1498 of Lecture Notes in Computer Science, pages 1018 1027,
Amsterdam, The Netherlands, September 1998. Springer-Verlag.
5. W. H. Graf. Graf's home page. Web address: h t t p : / / w w w . d f k i . d e / - g r a f / .
6. O. Kamba, K. Bharat, and M. C. Albers. The Krakatoa chronicle - an interactive,
personalized newspaper on the web. Technical Report Number 95-25, Technical
Report, Graphics, Visualisation and Usability Center, Georgia Institute of Tech-
nology, USA, 1995.
7. S. Kirkpatrick. Optimization by simulated annealing - quantitative studies. Y.
Stat. Phys. 34, 975-986, 1984.
768

8. S. Kirkpatrick, C.D. Gerlatt, and M.P. Vecchi. Optimization by simulated anneal-


ing. Science 220, 671-580, 1983.
9. K. Lagus, I. Karanta, and J. Yl~a-J~ski. Paginating the generalized newspapes
- a comparison of simulated annealing and a heuristic method. In Hans-Michael
Voigt, Werner Ebeling, Ingo Rechenberg, and Hans-Paul Schwefel, editors, Pro-
ceedings of the ~th Conference on Parallel Problem Solving from Nature, volume
1141 of Lecture Notes in Computer Science, pages 595 603, Dortmund, Germany,
September 1996. Springer-Verlag.
10. S. Martello and P. Toth. Bin Packing Problem, Chapter 8 in Knapsack Problems~
Algorithms and Computer Implementations. John Wiley ~ Sons Ltd., 1990.
Artificial Neural Network-Based Diagnostic
System Methodology
Mario Reyesde los Mozos, David Puiggr6s, Atbert Calderfn
,";oft Computing ApplicationGroulLUnitatde Microelectrbnica.
Enginycria Informhtica. Escola T/~cnicaSuperior d'Englnyeria
08190 Bellaterra, Ccrdanyola, Barcelona,(SPAIN)
e-maih mario@microelec.uab.es

Abstract: In this paper we propose a development methodology of ANN-based


diagnostic assistance systems, to begin by data collection until the final analysis
of the results is obtained. The methodology proposed is divided in three phases:
(I) a basic pre-process of the data analysis; (2) to train the ANN, and to ewduate
its l~erformance; (3) to study the criterions used by the ANN, and for this
process we have used the Trepan algorithm. Finally, we show th,ee medical
applications realised by members of our group.

I. Introduction

During last years the number of biomedical applications that are based on artificial
neural networks has increased considerably. ANN-based systems are used both in
hardware applications, as to put up in doses of medicine to a patient that is in an Intensive
Care Unit, and software applications, as diagnostic and monitoring assistance systems of a
patient. In this paper we focus the discussion on diagnostic assistance systems. Specifically
wc propose a development methodology of this kind of systems, to begin by data
collection until the fired analysis of the results is obtained.
The design process of this kind of systems is divided in three phases. First, a basic
prc-process of data analysis collected by doctors. With this analysis we want to detect
crro,s made during the data collection, and to realise a first study of data nature. This phase
allows filtering non-significant variables. The last step of this phase consists of generating
training and test data sets lbr the training process of ANN.
In a sccond phase the trainiag process of the ANN is realised, evaluating its
perfornmnce. The third phase consists of studying the criterions used by the ANN to give
the final diagnose. If we know the criterions of the ANN, we can increase the medical
reliance on the diagnostic assistance system. Trepan is the algorithm used for this purpose.
Next, we describe in detail the three phases commented before, and we present three
medical applications realised by the Soft Computing Applications Group of the Universitat
Aulbnoma tie Barcelona, in collaboration with different hospitals.

2. Data Pre-proeessing

Data pre-processing main goal is to detect and, if it's possible, to correct data
abnormalities in the original data set, to present the neural network, learning set that
contains all the information but appearing simplified in order to improve the tilne needed
in the learning process as well as the internal neural network architecture. This process is
divided in three phases: descriptive statistic analysis, data transfommtion and data
wdidation using Trepan.
770

All Ihis process is made aulomatically through an AWK validation program. This
I)r~gr;un, using tile origbml tlala set ;.Uld a wdidalion rules set, detects data SKI
:d~n~wmalilies, iuld, if tile original dlll;~| set is correct, generates lwt) dala subsets: one t)l
these is used to train the network and the other is used to validate the learning process. The
case number and case bahmcing can be select by the user like a part of lhe validation rules.

2. I. Ikescriplive statistic analysis

In order to detect abnormalities due to the data input process or data set wrong
selection, a simple stalislic study is made. This study shows the parameters dcpiclcd in
table I.
Present values Standard deviation
Missin~ values Sl~tndald error
Mean Lowest wdue
Central reserwttion Upper value
First quartile Third quartile
T a b l e I. Descriptive statistic analysis parameters.
The number of nfissing values can be used to pin down data wdidity. Patterns wilh
missing dala Call be off cast or can be completed with mean values, typical values or, if it's
I)ossiblc, with correlated wdues. Nevertheless, if data set has enough example numbers,
patterns with missing data are rejected.
Standard deviation is used to detect data noise. A standard deviation bigger than
expcctcd, ix suggesting errors in the data input. In the other side, if the standard deviation
was between the expected wdues it doesn't means input data process was correct, it would
be that all data set is shifted. This error can be colTected in next steps through data
transformation. In order to ensure that every wlriable is into her domain, in the simple
statistic study appears the lowest and upper values that every wlriable achieve.

2.2. Data t m n . ~ ) r m a t i o n

Over thc original data set various transformations are made in order to improve Ihe
learning process and diminish the training time. Data translbrmation relies to the variable
type. It's possible to identify three variable types:
9 Nomit~al variables. Those that present one or more exclusives states. It isn't
possible to determine a degree order keeping to the adopted value.
9 Ordinal variables. They have different states through that it's possible to
determine a degree order but the grade distance is undetermined.
9 Ahsohtte variables. They have different states through that it's possible Io
dcterminc a degree order and also is possible to determine the degree distance.
Nominal variables can be coded in two ways:
1. By following a configuration one-neurone-one-state in so much there are so many
neurones like states and only one different neurone is active in every state.
2. If the wlriable has a binary value it can be coded with only one neurone hastening
or disabling it in order to code the characteristic presence or absence.
Ordinal w~riables are coded like cumulative indicators using so many neuroncs minus one
than state number are present. Every state is coded enabling so many neurones like the
771

degree thai it takes up in thc wduc ordeucd scale. And absolute variables arc nonmaliscd
between Ihe interval [0. l, 0.91.

2.3. Data valithaion using ?)'el)an

TREPAN makes a statistical data set separation attending to variable or variables lhat
give mr information in order to classify the resulting class. This study can show dala
dcviatkm that would be very difficult to see using traditional statistical techniques. This
alg~rilhm can be used to lest the protocol validity giving informalion in two senses:
9 It gives inft)rmation about more imporlant variables and less ilnporlant wlriablcs.
9 If the generated tree is very unbalanced it depicts the htck of wlriables in the
protocol definition.
9 The Trepan trees can be conlrasled wilh the specialist criterion in order to debug
the protocol definition.
In the t~thcr sense, the node complexity dccision rules gives information about lhe global
c~mq~lcxity of thc problem.

3. ANN-based Diagnostic System

Artificial Neural Networks (ANN) is the better option to build a diagnostic assistance
system. ANN is altractive to medical applications due to different characteristics:
9 If the input data is human opinions, ill-defined categories, or it is subject to
possibly large error, the robust behaviour of neural networks is important.
9 A neural network presents the ability to discover patterns in data that are so
obscure as to be imperceptible to human researchers and standard statistical
methods.
9 The medical data exhibits significant unpredictable nonlinearity. Traditional time-
series models for predicting future values are based on strictly defined models.
9 A ncural network acquires inforlnation, 'knowledge', concerning a problem by
means of a learning/training process, extracting the knowledge directly from the
data. This information is stored in a compact way, attd the access is simple and
fast.
9 A neural network presents a high degree of precision (generalisation) at time to
give a new solution to a new input data, in the same problem domain.

Thc design process of an ANN-based diagnostic assistance system can be divided in


5 phases:
9 Determining the structure of the neural network. In this point it's necessary to
response next questions.
9 llow many hidden layers do we need? . It's known that there is no theoretical
reason ever to use more than two hidden layers.
9 How many neurones do we need in the hidden layer?. Choosing an appropriate
number of hidden neurones is extremely important. Using too few will starve the
network of the resources it needs to solve the problem. Using too many will
increase the training time, pedmps so much that it becomes impossible to train it
adequately in a reasonable period of time. Also, an excessive number of hidden
neurons may cause a problem called overfitting.
772

9 Determining the Iraining and test pat{crns set. If we wa~lt ;m ANN-b;lsed system
clfcclivc, the Iraining set must be complete enough to salisfy several goals: (I)
Every class must be represented. (2) Within each class, statistical variation must bc
adcqualcly represented. (3) The training set must have approximately Ihc double t~l"
ANN frcc parameters (internal connection) to avoid overfitting problem. (4)
Training and test set must be balanced.
9 Training the neural network.
9 Ewduating perlbrnmnce of neural networks by means I{cccivcr Operating

Characteristic (ROC) curves. With this slcp, we ewduate if the training process has
bccn made corrcclly, if the ANN give a good solution in relation with the data
input.
9 Validating the diagnoslic assistance system. A group of specialist ewdtmtes Ihe
Ilcrforul:mce of ANN-based system. We conlparc the ANN diagnose ;rod the
specialist diagnose.
()ncc tile ANN has been trained, the next step is to study the followed criterions by the
ANN tt~ reach the final result. If we know the criterions of the ANN, wc can increase the
medical reliance on the diagnostic assistance system.

4. RNA criterions extr:wction by Trepan

Ti~e internal representation of tile knowledge acquired by an ANN after it~ learning
process isn't easily understandable. Several parameters of tile ANN lake part of this
internal representation, for example weight values, bias values, and actiwltion and output
function. This aspect in a great inconvenient of the use of the ANN in medical nature
applicali~nas.
Bu(, why do we want to gain access to the internal rcprcsentalion in an
Ht~tlcrslandablc and easily manner?. And, maybe we can reach an ANN Ih:|l can bc
~l~Craletl. Following, several answers to the question are shown:
9 The deduced criterions can explain how the net reaches the final diagnoses. This
is the major obstacle of several ANN-based systems, overcoat in medical
applications.
9 If the knowledge of the ANN-based system can be expressed by a rules set, it can
be made up in other intelligent systems, for example an expert system. It's
possible because we can handle and express the ANN-knowledge in an easily
manner.
9 Thanks to tile ANN-knowledge we can explore the collected data and we c;m
evaluate ANN-conclusions. With this process, it's possible Io give to specialist
more inlbrmation about the problem.
Learning techniques that use rules as knowledge representation, resolve the
commented problent in a direct manner, that's to say, the acquired knowledge is easy to get
on with. But there are applications where the ANN-systems present better solutions than
olhcr learning algorithms.
Several ANN-knowledge extraction algorithms have been proposed. Every one of
Ihcn| presents different characlerislics. The selected algorithm is Trepan, which gencralcs a
dccisi~m tree from a trained ANN and the pattern set used to train the net. ht l~lcl, "l'repau
tlocsn't need an ANN, it only needs an oracle or teacher that responses Ihe queslions made
by the algorithm, and an instances distribution model.
773

Wc can easily uaderstand Trepan from a classic algorithm as 1193. 1193 is a symbolic
learning algorithm thai learns concef)ls. 1193 generates a decision tree (DT) from a
classified examples set by a Icacher. This I)T is composed by rule-uodcs, where a rule
separates the examples set in Iwo classes, one class complies with the rule, and Ihe other
docsu't. In a rceursivc way, rulc-nodcs are selected to classify the examples set. Algorithm
finishes when tire decision tree classifies completely tire initial examples set. The
performance of Trepan is similar to ID3 algorithm, with the difference that Trepan
generates news examples from a data model (the model is deduced fiom tile examples set).
Trepan uses the trained-ANN as oracle, and we car) conclude the decision tree shows the
ANN-knowledge.
Thanks to Trepan we can:
9 Study the acquired knowledge from tile net. From this information we can observe
ttsclhl charactcristics of Ihc problem, characteristics detcclcd by the net.
Aftcrwards, the specialist can evaluate Ibis knowledge.
9 Generate a rule-bascd expert system from the ANN-knowledge. It's possible Io
complctc expert systems with the deduced rule set.
9 Study the wcight of the different wu'iables o attributes in the ANN-solution. If a
variable doesn't appear on the DT, it's probably that this attribute is not important
for the net. In the same way, we can detect a variable very important lbr the final
diagnose.
9 Sft)dy possible ANN-perforn)ance problems, If there is some unclassified
examples, we cab) suppose that any att,'ibute is not present in the protocol, and
maybe these attributes are important for the net.
Trepan increases to maximum the reading and understanding of the ANN-decision
tree, generating trees more compacts and usefnl than trees generated by means of ID3
;algorithms. The generated rules by Trepan presents a great senmntic expressiveness, naajor
than (hc rules reached by ID3.

5. ANN-based medical applications

The explained methodology has been utilised in several medical projects realiscd
by mcmbcrs of our group in collaboration with different hospitals. Next, we present the
three more important. ..

5.1 Inteq~retation of Open Angle Chronic Glaucoma visualfield

This project presents the study of the Open Angle Chronic Glauconla (OACG). The
OACG is a high frequency and serious eye disease (0.4-2.1% between population older
than 40 years) because it cab] produce a great damage in the visual function, being one of
the main reasons of bliqd,]ess in developed countries. At present there are two tests that are
considered as the pillar of the glaucoma diagnosis: the study of the atrophy of the fibbers
laycr and the optical nerve bead, and the exploration of the visual field. But, the study of
the visual field is still the inain data in the glaucoma diagnosis and the absence of
campimctric defects excludes its diagnosis.
The diagnosis system is based on artificial neural networks, specifically the
fccdforwards networks, trained by means of the backpropagation learning algorithm. The
nctwork has seven units, which are the zones defined in the campimctry, and the response
is if the patient visual field presents glaucomatous defects.
774

The specificity (>82%) and set~sihility (>9(.)) v;dues ~Jrc higher than Ihe index
td,laincd hy ~)ther inclhotls of visual field inlerpletatiou. '1'o sum up, flotn the results
ohhfincd is deduced that :lrlificial neural networks arc a good solution to develop a
di;ignostic helping system.
Another positive aspect of this al)proximation is the possihility of knowing the
criterions followcd by the net to reach the final diagnoses, and lbr this .job we have uscd
Trepan algorithm. The glaUcOlna application has the particular characteristic thai their
variables arc continuous, anld this aspect represents a great effort to Trepan, because the
process of determining rule conditions with continuous variables is very hatd.
(-)l)hlhahnologists of the Glaucoma Unit of IOBA (lnstituto de Oflalmobiolog[a Aplicada)
havc cvaluatcd and accepted the final rule set. With this we can increase the credibility of
ANN-sohlti(m.

5. 2 Mammogruphy ra~fio-gttided biopsies

The aim of this project is to determine if the utilisation of an ANN for the
indicalion tfl radin-,.zuided biopsies can rcduce the percentage of negative biopsies to
diagnose breast cancca. Bctwccn 15 aml 30% of mannnography dctccted abnormalities are
brcast carcinonms. I lcnce, radio-guidcd biopsy is indicated lbr outlining the suspccted
breasl zone and con fin|ring/refuting the presence of breast carcinoma. Nevertheless, the
pcrccnlage of negative biopsies is, fortunately for patients, extremely high (up to 85%);
this represents extraordinary expenses of time and money to the hospital. Objective
methods designed to reduce the percentage of negative biopsies would no o n l y alleviate
hospital hudgcts but also lessen the understandable patient fears and nt, isances when facing
the dot|bt of cancer. An additional goal of the project is to study tile weight of cvcry
attribute that characterises a mammography, with the purpose of determining the quality of
the prot(~col.
Mammography's are the snloothes! breast cancer detection tcclmique, being the
first breast exploration. After that, if it's necessary, the patient is subjectcd to other
explorations more aggressive. A mammography can present two types of characteristics:
n|icrocalci ficatious and nodules. These characteristics can be shown both at the same time,
but this is a very strange case; they usually appear alone.
As the previous application an ANN-based system is proposed. In the initial
analysis, belorc training pt'ocess, we can detect the aspect commeuted bclbre, the
separation between microcalcification and nodules. For this reason we decided to divide
the problem into two parts: (I) detection of dangerous microcalcifications, and (2)
dclcction of dangcrotls nodules (high risk of breast cancer). With this solution we achieve a
COmlflcxity rcductiou of the two new problems.
An ANN-based systcm has been designed for resolving Ihe mic,'ocalcificalions
problem. After that, we get a rule-set from the ANN by means "Frepau algorithm, being
cvalualcd aml validated by specialists.
For the nodulcs problem we have decided using another approximation, different to
ANN-bascd systcm. We have decided to design a rule-based system, basically because is
very simple to deduce it. Trepan algorithm has been utilised to deduce the rule set, and
after, if it's nccessluy, breast specialist complete the rule-based system.
In c(mcfusion, the final system for resolving the manmlography radio-guided
hiopsics has a hyhrid nature. It presents two blocks, a rule-based block and an ANN-based
block, improving the performance of the complete system in relation to a unique ANN-
based sohttion.
775

5.3 I'redictitm index o['advmu'ed age patients

In this case we want to obtain a prediction index of evolution and classification


(tw~ classes, success alld no-sttccess) for advanced age patients (>65 years old). The data
collcctitm is realiscd at time of admission of the patient to Intern Medicine emergency
services of I?,cllvitge and Vihtdecans Hospitals. With this application wc want to show the
use ~l Trepan to study the importance of patient attributes.

Cognitivefunction(No damage;Lightdamage;Moderatedamage;High damage)

F (95, 39.6 F) 2of3 (sex=male, age<0.29, F (94.1, 20.4 F) Iof6 (Respiratory frequency<0.66 t,
albumin >0.3) sex=males, axillary temp>0.512,
Bronchoplegy=yes, albumin<O.28,
urea<O.115)

Figure 1. First nodes of a Trepan tree.

Figure I shows the first node of a Trepan tree. It shows two kinds of rt, les:
categorical rules and simple rules. The first node is a categorical rule with a w~riablc that
shows four possible states. And for every state of these variables Trepan generates a new
rule sct t~r it gives a classification in the set with the result (T) or the set with false result
(F). It can be seen than the 39.6% are examples with cognitive function set to no damage
an of this 95% has a false final evolution, then this kind of examples are classified as false
cxamplcs. Also there are 20.4% of examples with moderated damage and 94.1% of this has
a false evolution and then all kind of examples will be classified as false examples. In the
other side, the examples with light damage and high damage needs a more complex rule
that will be generated in the next tree nodes.
It can be seen that the w~riable cognitive function has the bigger information in the
set because it's in the top of the tree, but in the other side this variable hasn't sufficient
clinical sense to predict the patient evolution for itself. Then in this case Tl'epan tree aids to
discover that the data set is deflected by this variable.

6. Conclusions

In tile present article it has been proposed a inethodology to develop ANN-based


diagnostic assistance systems, in the medical field. This methodology has been used to
design different ANN-based systems, as the interpretation of Open Angle Chronic
Ghmcoma visual field, Mammography radio-guided biopsies, and the prediction imlcx of
adwmced age patients. We have established that ANN-based systems are a good solution to
medical nature applications, obtaining better results than other solutions, as discriminant
analysis or logistic discriminant analysis.
776

We have shown Ihal a good analysis of Ihe collected data r help to kin)wing
I)t'llcr Ihe ii;|ltlrc of Ihe problem. Si)ecific~,lly, in mammography radio-guided bi~l)sies,
Ih:mks to this mmlysis we can detect:several aspects of Ihe l)rohleln thai wc didn'l know. h'J
the last application (prediction ;index of advanced age palienls), we use the Trepan
algorithm Io study the solution given by the ANN-based system.
Thc mclhodology shown is divided in three phases: (1) to realise a basic pre-
process of the collected data and to generate training and test pattern set. (2) To train the
ANN, and to evaluate its performance. (3) And, to obtain and to study criterions followed
by the ANN In reach the final solution. Thanks to the third step we can increase the
medical reliance o,t the diagnostic given by the ANN-based assistance system.

7. Ackimwledgements

We thank to membcrs of the IOBA (Instituto de Oftallnobiolog(a Aplicada tie


Valladolid) for their assistance in the Glaucoma application. We also thank to members of
I lospital Duran y Rcynals for their collaboration in the Mammography problem, we also
Ihank to members of Intern Medicine emergency services of Bellvitge and Viladecans
l lospitals for their collaboration in this research. And, we want to thank to IMB-CNM for
their generous collaboration in tiffs work.

8. l,l.el'erences

I I I Alfonso AnI6n L6pcz: Valor de lus Redes Neuronales y la discriminaci6n Iog{stica en


el amilisis de los defectos del canq~o visual, Tesis Doctoral, Facultad de Medicina,
Universidad de Valladolid, 1995.
[2] Sen6u Barro, Jos6 Mira : Computaci6n Neuronal, Servicio de Publicaci6ns e
imercambio Cientffico, Universidade de Santiago de Compostela, 1995.
I3] B. Milller, J. Reinhardt : Neural Networks. An bztroduction, Springer-Verlang, Berlin,
1991.
[4] Timothy Masters : Practical Neurul Networks. Recipes in C++, Academic Press, Inc.,
San Diego, 1993,
[51 Murray Smith : Neural Networks for Statistical Modeling, Van Noslrans Reinhold,
New York, 1993.
[6[ Stuttgart Neural Network Simulator. User Manual, Version 4.0, Report No.
~/95,1nstitute for Parallel and Distributed High Performance Systems (IPVR),
University of Stuttgart.
17J Goldbaum MH, Sample PA, White H, C6t8 B, Raphaelian P, Fr RD. Weinreb
RN. "Interpretation of automated perimetry for glauconm by net, ral network", htvest.
Ophthahnol Vis Sci 35 : 3362-3373, 1994.
19J Asmau P, Ilcijl A. (1992) "Glaucoma hemifield test. Automated visual field
examination". Arch Ophthahnol 110:812-819.
I lOl Asman P, Heijl A. (1993)"Arcuate cluster analysis in glaucoma pe,imetry". ,lournal
o./'Glattconta 2( 1): 13-20.
[I l]Mandava S, Caprioli J, Zulauf M. (1992) "Glaucoma pattern index to quantify
ghmcomatous visual field loss". Journal of Glaucoma 1: 178-183.
[12lllirsbrunner HP, Fankhauser F, Jenni A, Funkhauser A. (1990) "Ewduating a
perimetric expert system: experience with Octosmai't". GrutJ'e:s' Arch Clin s
Ophthahnol 228: 237-241.
777

[ 13111irsbrtmncr liP, Fankhauser F, Funkhauser, Jenni A. (1990) "Evalualing truman and


atttcmaated interpretation of visual field data ill perimetry". Jim .I Ol)hthalmol 34: 72-

11411(.;.itlllllalltl II, Flanlmer J, I~,utishauser C. (1990) Evaluation of visual field I~y


ol~hthahnologists and by Octosmart Program. 201 : 104.109.
II 5 IZcycn TG, Zulauf M, Caprioli J. "Priority ~Jl"test locations for automated pcrimclry'.
Ophthahnology 1993; 100:518-23.
I 161Anl6n A, JA Maquct, A Mayo, J Tapia, JC Pastor. "Ewduation of logistic discriminant
amdysis for interpreting visual field defects". Ophthahnology (in press).
[17]Katz J, Summer A, Gaasterland DE, Anderson DR. "Comparisoll of analytic
algorithms for delecting glaucon'tatous visual field loss". Arch Ophthahmd 1991; 109:
1684-89.
[18] Wu Y, Doi K, Giger ML, Vyborny CJ, Schmidt RA, Nishikawa RM, et al.
"Al)plicaliol~ of artificial neural networks in mammography lbr the diagnosis of brcast
canccr". Radiology 1991; 181 (P):143.
[191 Wu Y, Gigcr ML, Doi K, Vyborny CJ, Schmidt RA, Met CE. "Artificial ncural
networks in mammography. Application to decision making in the diagnosis of breast
canccr". K. Rossman. Radiology, April 1993, pp. 81-87.
120] Floyd CE Jr, Lo JY, Yun A J, Sullivan DC, Kornguth PJ. "Prediction of breast cancer
malignancy using an artificial neural network". Cancer. 1994 Dec 1; 74(11):2944-8.
[211,lay A. Baker, MD. Phyllis J. Komguth, MD, PhD. Joseph Y. Lo, PhD. Margaret E.
Williford, MD. Carey E. Floyd, Jr, PhD. "Breast Cancer: Prediction with Artificial
Ncural Network Based on BI-RADS Standardized Lexicon". Radiology 1995;
196:817-822.
[22J Jay A. Baker, MD. Phyllis J. Kornguth, MD, PhD. Joseph Y. Lo, PhD. Carey E.
Floyd, Jr, PhD. "Artificial Neural Network: hnproving the Quality of Breast Biopsy
Recommendations". Radiology t996; 198:131-135.
1231L. Porta, R. Villa, L. Prieto, E. Andia, E. Valderrama. "lnfraclinic breast carcinoma:
application of neural networks techniques for the indication of radioguidcd biopsias".
In Springer, ed., Biological and Artificial Computation: Jkom Nem'oscience to
Technology, number 1240 in Lectures Notes in Computer Science, p. 978-985, 1997
[241Mark W. Craven. Extracting comprehensible models J?om trained neural networks.
Doctor of philosophy (computer science), University of Wisconsin, 1996.
125] Reyes de los Mozos, Mario. Soft Computing & Aplicaciones Biom6dicas. Estudio del
Glaucoma en estado incipiente. Doctor of Computer Science. Universitat Aut6noma
de Barcelona (UAB), t998.
Neural N e t w o r k s in A u t o m a t i c Diagnosis
Malignant Brain Tumors
Francisco Morales, Paloma Ballesteros

UNED. Facultad de Ciencias. Dpto. Quimica OrgAnicay Biologfa.


E-mail: fmoralcs@bec.uned.es, pballesteros@ccia.uned.es
Sebastifin Cerdfin

CSIC. Instituto de Investigaciones Biom~dicas.


E-mail: scerdan@iib.uam.es

Abstract. Automatic Proton Nuclear Magnetic Resonance (1H NMR) spectral


based diagnosis malignant brain tumors is analyzed, using Neural Networks for
classifying purposes. The Pattern Recognition has been split into its principal
parts or subproblems, adapting each one of them to the special task of analyzing
~H NMR spectra. We study the principal algorithms needed for solving those
problems, and finally a distributed object oriented classifying system is
proposed, which attempts to solve the principal problems mentioned in each
section.

Introduction

Nuclear Magnetic Resonance (NMR) spectroscopy is becoming one of the most


important tools for biochemical analysis of living tissue, since it allows the
simultaneous observation of a large number of metabolites without making any
preselection on them.
Its use has become widely known and applied as a diagnosing method, in many
hospitals in developed countries. On the other hand the applications to the study of the
Human Central Nervous System represents a great advance not only in the knowledge
about the different pathologies, but also for the acknowledgment of existing
relationships between lesions and the neighboring structures (Pascual, J. M".,
Carceller, F., Cerdfin, S., Roda, J. M. 1998).
NMR spectroscopy is considered as a non invasive method, for the study of living
tissue's metabolism, since it allows quantitative and qualitative observation of a large
amount of metabolites in each culture. This last feature has permitted us to
characterize diverse types of tumors through NMR spectra.
However, precisely this richness of information obtained from the spectrum,
becomes a major problem when we try to interpret the changes associated with more
than two or three metabolites, in a specific clinical situation. This is due mainly to the
following reasons:
9 Spectra, from biopsies with the same medical diagnosis, can vary notably one from
each other, as figure 1 shows.
779

9 Spectra, from samples with different medical diagnosis, can overlap, producing an
impression of homogeneity or common causality.
9 The identification of metabolites with low concentrations becomes very difficult or
almost impossible, with the common techniques.
9 The presence of noise or a traditional statistical analysis can provoke a loss of
information related to small biochemical changes, which may represent highly
important facts from a clinical point of view.

$u Glu
__ ~ l a ~ ~ j ~
Gin. 1 NA&

.,4-.. ~ ,
2.5 2.3 1.5 1=] ppm

. . . . . . . . . . . , . . . . , .

2.5 20 15 | .,;1 pore

Fig. 1. An example1 of the spectra's variability. The image shows a perchloric acid proton
spectrum, extracted from a brain tumor biopsy, with the diagnosis of having a high grade
Glioma. The differences between both can be noticed on the marked metabolites (Ala, NAA,
etc.).

The first part of this work, intends to analyze the principal problems responsible
for that kind of behavior. The second part is dedicated to the study of artificial Neural
Networks as a computational tool for solving those difficulties, and spectra
classification. In the third part we end with the proposal of a distributed object
oriented system for automated diagnosing.

NMR spectroscopy

Characterization of Proton spectra

Proton NMR spectra are characterized by a series of peaks called Resonances.


Each one of those corresponds exactly to a specific proton in a determined metabolite.
That proton is featured through:
9 Its position in the spectrum, which represents the chemical shift, expressed in parts
per million or ppm. It depends on the chemical environment and is equivalent to
the abscissa x in the XY plane.

1This sample of spectrum has been taken from (E1-Deredy, W. 1997).


780

9 Its height, which represents its intensity, and indicates this metabolite's
concentration in the studied sample. It is equivalent to the ordinate y in the XY
plane.

Building the Spectrum

Starting from a spectrum of a normal brain, shown in figure 2, one can determine
typical metabolite's resonances, which can serve as a criterion for later classifying
purposes.
The determined metabolites are:
1. Lactate's (Lac) H 3 protons.
2. Alanine's (Ala) H 3 protons.
3. N-Acetylaspartate's (NAA) H 6 protons.
4. The H 4 protons of the Glutamate and Glutamine (Glu and Gin).
5. The Creatine's and phosphocreatine's C H 3 and CI-I2 groups respectively.
6. The trimetilamonioum groups of Coline and derivatives.
7. The H 2 and H 3 Taurine's protons.
8. The H 2 proton of the Myoinusitol.
The spectrum's formation process is very simple, and can be easily explained
through any two metabolites, for
13er~r~ ~ g
cr example the Lactate and N-
Acetylaspartate, which produce the
spectra shown in figure 3.
The resulting spectrum, built on a
~r ~ o Ms basis of a 50:50 mixture from both
metabolites, could be obtained by
overlapping former spectra, as figure 4
shows.
TmI ~l~ Following this approach it is easy to
Ih'no GI~ ~. conclude that the more metabolites
present in a spectrum, the harder it will
be to read and classify, in order to
produce a safety diagnosis.
pprn All this leads us to the conclusion
that, the application of Pattern
Fig. 2. Proton NMR spectrum from a normal
Recognition techniques to the
brain, with the main metabolites.
classification of NMR spectra, could
produce quite interesting results for this and other formerly mentioned problems. This
approach is not completely new, since signal processing has always been associated to
NMR spectra analysis; but, in our work, we would like to show, the significance an
integrated completely automated Neural Network based classification system would
have.
The pattern recognition based spectra's analysis and classification started in the
80's with the works of Jeremy Nicholson and John Lindon. However, in spite of the
success that NMR's techniques have obtained in the biochemical area, its use have
not been extended in many research groups on that field, mainly due to the lack an
781

experience and basically, real time, effective software systems. That is the main
reason why we think that an Integrated Classification Environment would be of great
importance for this field.

/.12 ... :~0 1.~ 1.~ ~prn

~(.9 " 3~6 " 3.3 . . . . . . . 1~9 gl:'rn

Fig. 3. Lactate's and N-Acetylaspartate's spectra.

J\ k.._._
4.12 4.0 3.9 3.6 3.3 . .. 1.9 1.5 1.3:5 plom

Fig. 4. Spectrum of Lactate's and N-Acetylaspartate, in a 50:50 mixture.

Pattern Recognition

Pattern recognition can be defined as the capacity to identify, analyze and interpret
a set of regularities, previously defined and characterized, within a collection of
objects, for example our metabolites, which are described through a set of
measurements. Those measurements are commonly affected by noises and other, less
important elements of the complex environment.
Our main task would be the combination of these techniques with NMR
spectroscopy knowledge in order to achieve the following goals:
9 To detect subtle differences on the metabolites present in spectra with the same
diagnosis. This ability would allow us to refine our final classification.
9 The application of modern pattern recognition techniques to noisy spectra's
analysis, where it is very difficult to identify important metabolites from the noise
would also be highly important.
782

Measurement ~.~I Signal H Pattern H Classifying ~.~ C


device preprocessing representation algorithm L

22 52 22 S

I i

NMR FID preprocessing Feature Classifying


Scanner - with Rapid Fourier extraction algorithm
Produce a Transform
FID* signal algorithms

FID states for Free Induction Decay, which represents the measurements acquisition process.
Fig. 5. Flow of control in any Pattern Recognition System.

Figure 5 depicts the flow of control in any pattern recognition system, which one
we have adapted to the problem of spectral classification.
In the following sections, we will study the different steps in that process, with
exception of the first and second ones, since they do not provide any important
information for our computational process.

Representing the pattern

The representation of the pattern, which we are trying to classify, constitutes one of
the fundamental steps in the classification process, since its goal is the reduction of
the data vector's dimension; in other words we are about to prepare the spectral data
in a comprehensive and reduced form for the neural network.
The main characteristic of the resulting vector, which we called Feature Vector
(FV), lies in that it only contains the spectrum's relevant components, from a
biochemical and a classificatory point of view. That is, its components are freed from
any noise and non relevant data.
This FV can be obtained through any of the following approaches:
1. Features Selection: This method is very simple, and it consist of direct selecting,
from the original vector, a subset of components, which represent the spectrum's
main features. It commonly relies on the specialist, for example a doctor's or
biochemists, experience.
783

2. F e a t u r e Extraction: In this method the original measurements are "combined",


with the help of an "extraction algorithm", for example Principal Components
Analysis, Wavelet transformation, Factor analysis, etc.
3. Mixed A p p r o a c h : This consists of the combination of the former approaches.
An extensive study of the available algorithms for the data vector's dimensionality
reduction lies outside the scope of this communication. The interested reader would
find in other papers, for example (E1-Deredy, W. 1997), an excellent review of these
methods.
In our case, we have started with the first approach, direct features selection, based
on the knowledge of experienced specialists. Table 1 shows the selected metabolites
and their resonances 2.

Metabolite Resonance value~n Ep_~s,). . . .


Lactate's (Lac) H 3 protons. 1.35
Alanine's (Ala) H 3 protons. 1.48
N-Acetylaspartate's (NAA) H 6 protons. 2.03
The H 4 protons of the Glutamate and Glutamine (Glu 2.35 y 2.45 respectively
and Gln).
The Creatine's and phosphocreatine's CH~ and CH, 3.05
groups respectively.
The trimetilamonioum groups of Coline and 3.20
derivatives.
The H 2 and H 3 Taurine's protons. 3.42
T h e H rgton of the M oinusitol. 4.07

Table 1. Selected metabolites and their resonance.


However, although this approach represents a valid solution for this problem, we
think that a computational analysis of more metabolites would result in a safer and
more exact diagnosis. For this reason, we think that an automated classification
system, should offer the possibility of feature extraction, through other algortihms.
As a representation of this second approach we have selected the Wavelet
Transform, described as follows.

Wavelet Transform

This method starts from a function named "mother wavelet ''3, which acts as a
prototype. This "wave" would be translated and scaled in such a way that, the N
spectrum's components, or points, would be transformed in N coefficients, with the
help of N Wavelets. Those Wavelets form an orthogonal basis of the N-dimensional

2 The studied proton spectra were obtained on a BRUKER AM-360 spectrometer.


3 A Wavelet can be defined as a function formed through translations and scaling applied to a
mother wavelet or prototype. The term comes from Morlet et al, the interested reader is
addressed to (Morlet, J., Arens, G., Fourgeau, I., y Giard, D. 1982) and (Tate, Anne
Rosemary. 1996).
784

spectral space. The resulting signal would have the minimum of noise, when the
transformation and mother wavelet are correct.
Each one of the mentioned coefficients is calculated through the dot product of the
data vector with one of the basis functions.
The set of basis functions, as mentioned earlier, can be obtained from the mother
wavelet gb~,c(t), through transformations as equation 1 (Tate, Anne Rosemary. 1996)
shows:

1 (t -b) (1)
ga,b ( t ) : gbasic a

The process of analyzing a function, our spectrum, through wavelets, can be


viewed as this function's representation at different approximation levels (Tate, Anne
Rosemary. 1996). The first level represents a very general one, with a two functions
ortonormal base, the next one would determine a less general approximation with four
functions, and the last level would represent the original function as it is.
In our case, the problem lies in the determination of the approximation level, at
which the signal does not loss any important biochemical, and therefore medical,
information, but the dimension of the data vector would be less than the original one.
The gain of this approach lies in the fact that the process runs completely
automatic, without the need of interaction with a human operator.
There are many other algorithms for feature extraction, which are currently under
evaluation for the inclusion in our system. The interested reader is addressed to the
work of Waley E1-Deredy in (E1-Deredy, W. 1997).

The classification process

Once the principal features are extracted, one can start with the classification task,
through a study of the more frequently applied neural network topologies for the last
step of the pattern recognition process.
In the literature we found that Backpropagation Neural Networks (NN) constitute
the most frequently applied model; that is an input layer, one or more hidden layers
and an output one. Among the more frequent applications are:
9 Interpretations of:
- Radiographs.
- Electrocardiograms.
- Dementia states.
- Blood analysis.
9 Diagnosis of:
- Lunge tumors.
- Breast tumors.
The main cause for this model's frecuent application in medical problems, lies in
the typical overlapping that exists among sets of values reporting malignant and
benign diagnosis.
785

The Neural Network's model

We have chosen for this project's first phase the Backpropagation model of NNs,
on the basis of following reasons:
9 Backpropagation networks, as its name implies, learn by example, a highly
frequent applied model in medicine.
9 New knowledge can be easily added to the network, by including new examples in
the training set, and retraining the network. This is a very important fact since the
final system should be used by operators and other non specialized staff.
9 The input data do not need to meet any specific probabilistic distribution.
9 Once the network is trained, it can be applied in real time problems.

Model features

Input layer
It contains the input neurons, whose role is to connect the hidden layer with the
input data, determined in the former process. In this layer we found as many neurons
as input variables, in our case we have ten.

Hidden layer
Represents the main processing component of the NN. The elements to take in
account at this point are: the activation function, denoted as AF(x) where x represents
the weighted sum of each neuron's input, since it decides the possible activation of
each neuron, and the total amount of neurons to place in this layer.
In the case of the activation function, the end user would be able to choose among
the following:
9 The Sigmoid function, the most widely used transfer function,
9 The Piece Wise Step function,
9 The Unit Step or Hard limiter function, and finally
9 The Gaussian.
Respecting the total amount of neurons, it is generally determined empirically
through a test-error approach, in our case we left this decision to the end user,
allowing him, through an interactive process, the determination of the neurons in this
layer.

Output layer
Generally this layer contains a neuron per classification class, for our problem we
have six possible classes: Astrocitomas, Meningiomas, Glioblastomas,
Oligodendrogliomas, Medulloblastomas, and non malignant tumors, which represent
the different tumors classes to diagnose.
786

Developing an Automated Distributed Diagnosing System

The previous sections have developed a background for this last part, where we
briefly describe how to use the methods and techniques mentioned earlier, to design
and implement a distributed diagnostic system.

Tasks to be accomplished by the system

Such a system should be able to accomplish following tasks:


1. To read and process in a completely automated way, a FID signal as it comes from
the N M R scanner. The main goal in this phase is the elimination or filtering of
noise and secondary signals, producing a pattern vector for the NN.
2. The former pattern should be processed by the previously designed and trained
NN. In order to accomplish this the end user should firstly specify:
9 The number of layers in the net.
9 For hidden layers, the quantity of neurons in each layer.
9 The transfer or activation function to use.
and finally, the he should:
9 Train the network.
3. The NN produces a classification and a reliability rate for this diagnosis.

Why distributed?

Since the practical application of our system would be, in its last phase, in real
cases, we planned, looking for simplicity in its design, on the basis of a distributed
C O R B A 4 compliant architecture.
The main reason why we chose CORBA as the architecture for distributed objects
is the independence respect to:
9 Implementation language, and
9 Working platform.
On the basis of this architecture, we have designed a thin-client component, whose
main role would be the interaction with the end-user, leaving the hard processing
work to the server component. Figure 6 depicts this basic idea.

4 CORBA stays for Component Object Request Broker Architecture, a methodology for design
and implementation of distributed applications for the net.
787

Fig. 6. Application Architecture

Under global services we grouped, basically, the first and third former mentioned
steps. Local services deal with the user-system interaction, storing, locally, the ready
to classify NNs, that is N N ' s already modeled and trained, developed in the second
step.
In the near future we are planning to allow the communication among local clients,
in order to permit resource interchange, eliminating, that way, redundant operations.

References

E1-Deredy, W. 1997. "Pattern Recognition Approaches in Biomedicine and Clinical Magnetic


Resonance Spectroscopy: A Review." NMR in Biomedicine. Vol 10, 99-124, 1997.
Asfion M.L., Wilding P. 1992. "Application of neural networks to the interpretation of
laboratory data in cancer diagnosis". Clinical Chemistry (US), Vol. 38, pp. 34-38.
Astion, M.L., Wilding, P. 1992. "The application of Backpropagation Neural Networks to
problems in Pathology and Laboratory Medicine". Arch. Pathol. Lab Med, Vol. 116, pp.
995-1001.
Pascual, J. Ma., Carceller, F., Cerd~n, S., Roda, J. M. 1998. "Diagn6stico Diferencial de
tumores cerebrales "in vitro" por espectroscopia de resonancia magn6tica de prot6n. M6todo
de los cocientes espectrales". Neurocimgfa 9, pp. 4-10.
Morlet, J., Arens, G., Fourgeau, I., y Giard, D. 1982. "Wave propagation and sampling theory".
Geophysics 47, pp. 203-236.
Tate, Anne Rosemary. 1996. "Pattern Recognition Analysis of In Vivo Magnetic Resonance
Spectra". PhD. Thesis.
Holmes, E., Nicholls, A., W., Lindon, J., C., Ramos, S., Spraul, M., Neidig, P., Connor, S., C.,
Connelly, J., Damment, S., J., P., Haselden, J., Nicholson, J., K. 1998. "Development of a
model for classification of toxin-induced lesions using 1H NMR spectroscopy of urine
combined with pattern recognition". NMR in Biomedicine 11, pp. 235-244.
Nikulin, A. E., Dolenko, B., Bezabeh, T., Somorjai, R. 1998. "Near-optimal region selection
for feature space reduction: novel preprocessing methods for classifying MR spectra". NMR
in Biomedicine 11, pp. 209-216.
A New Evolutionary Diagram: Application to
B T G P and Information Retrieval

J.L. Ferns

Future Technologies, BT Laboratories


Martlesham Heath, Ipswich IP5 3RE, United Kingdom
jfernand~bt-sys.bt.co.uk
http :I/w~rw. labs. bt. com/people/fernanj i

Abstract. A series of measurements on factors in evolutionary processes


are carried out for the application of an evolutionary algorithm, BTGP,
to the problem information retrieval. A new evolutionary diagram is pro-
posed that allows us to study the performance of, in principle, any evo-
lutionary algorithm on a task. Taking inspiration from the HR diagram
in Astrophysics, algorithms are classified using their degree of variation
and the logarithmic ratio of exploitation versus exploration. The latter is
measured as the inverse of the mean population fitness versus the fitness
variance.

1 Introduction

Evolutionary methods have been the focus of much attention in computer sci-
ence, principally because of their potential for performing partially directed
search in very large combinatorial spaces. Evolutionary algorithms (EAs) have
the potential to balance exploration of the search space with exploitation of
useful features of that search space. However the correct balance is difficult to
achieve and places limits on what can be predicted about the algorithm's be-
haviour. In addition, EAs are often implemented in system-specific ways, making
it very difficult to compare results on different implementations.
A similar problem exists in evolutionary biology, and substantial progress
has been made in this area by choosing the proper levels of abstraction at which
to study natural systems (see, for instance [1] [2] and [3]). This suggests that
abstracting away from the comprehensive detail of EAs may generate rewards
in terms of our understanding of the evolutionary processes.
Several attempts have been made at establishing a methodology that deals
with measures of evolutionary processes in EAs (see, for instance, [4] [5]). The
justification of these approaches is, amongst others, to measure the present and
past performance of EAs, compare their current performance and predict its fu-
ture behaviour; it can also help in specifying the characteristics of the proposed
EAs, understand the reasons for observed EA performance and provide the k n o w
how to tackle fundamental problems in EAs (i.e. scaling, transferability, flexibil-
ity, evolvability).
789

The latter reason is based on the assumption that all EAs face fundamental
problems to do with their use for large scale applications. In general, computa-
tional EAs do not scale well from small to large problems, do not transfer well
from one problem domain to another and are not very flexible in response to
changing test problems. Biological EAs are arguably better, but we have not
been able to work out how to implement them in a feasible manner outside
their natural context. By developing and using measures on evolutionary sys-
tems we are likely to be able to quantify and learn more about how to solve
these problems.
It may be that these fundamental problems are all aspects of evolvability - the
capacity of systems to evolve - and it has been argued elsewhere (i.e. [6] [7]) that
we may be able to measure aspects of evolvability. Understanding evolvability
would yield substantial benefits in the application of EAs to real problems.
Measures of evolutionary processes in EAs have been derived from a number
of different sources: theory of animal breeding and theoretical genetics ( [6] [8] [9]),
study of natural selection ([1]), adaptive landscape theory ([10]) and ALife
modelling ([11]).
In summary, the advantages to be gained from developing measures of evolu-
tionary processes in EAs strongly suggest that research incorporating this are is
essential in the development of EAs for real-world applications. As Mitchell and
Forrest [12], discussing the relation of genetic algorithms to Artificial Life write:
" . . .the formulation of macroscopic measures of evolution and adaptation, as
well as descriptions of the microscopic mechanisms by which the macroscopic
quantities emerge, is essential if artificial life is to be made into an explanatory
s c i e n c e . . . " and "... we consider it an open problem to develop adequate criteria
and methods for evaluating artificial life systems.". Their comments still apply
strongly to the whole fields of evolutionary computation and Artificial Life and
should be acted upon.
In this paper we present one specific evolutionary measure applied to a par-
ticular EA based on Genetic Programming, B T G P , and for a specific problem
(filtering of Boolean query trees for a classification problem). The outline of the
paper is as follows: after this introduction, Section 2 will give a brief description
of B T G P while Section 3 will describe the real world application. In Section 4 ,
we will implement the evolutionary measure and a new evolvability diagram will
be introduced and applied to the information retrieval task with B T G P . Finally
section 5 will contain the conclusions of our work and some future directions of
research.

2 The algorithm: BTGP

The genetic programming system ( B T G P ) maintains a population of phenotypes


(decision trees) which it operates on directly. In this sense the genotype and
phenotype can be considered to be one and the same.
After generating the initial population the B T G P performs the genetic pro-
gramming cycle of fitness evaluation, selection of parents and reproduction with
790

application of the genetic operators to produce the children of the next gener-
ation. The B T G P has m a n y configuration options (see [13]) but for the experi-
ments described in this paper the following options were used:

- " R a m p e d growth" of the initial population's trees.


- Fitness proportionate (roulette wheel) selection.
- Genetic operators: copy, crossover, mutation.

In addition to the above settings, the following p a r a m e t e r s can be experi-


mentally varied:

- Rates at which each genetic operator is applied


- M a x i m u m tree depth
- Node branching factor

R a m p e d growth means t h a t the generated trees are uniformly distributed


in depth up to the m a x i m u m tree depth. Crossover is performed by randomly
choosing nodes from each parent and exchanging them, but avoiding exchanges
which would exceed the m a x i m u m tree depth. Mutation consists of replacing a
node with a randomly grown sub-tree up to the m a x i m u m depth. Further details
regarding the B T G P and the information retrieval task are given elsewhere [13].

3 The problem: Information Retrieval

The task on which B T G P has been applied is to evolve a Boolean decision


tree capable of discriminating between two document classes, those sought in a
retrieval task and those which are of no interest. The d a t a used is generated in
a pre-processing step from Internet documents which have been labelled by a
user as either of interest (positive) or of no interest (negative). Pre-processing
consists of extraction of a set of keywords across all the documents, and then
recording for each document whether it is a positive or negative example, and
whether each keyword is present or absent. The resulting d a t a records, one per
document, are then separated into training and test sets.

3.1 The Phenotype and Fitness Functions

The phenotypic representation is a Boolean decision tree. Each node of this tree
is either a function node taking one of the values AND, OR, NOR, NAND or a
leaf node variable which references a particular keyword. For a given training or
test case each keyword variable will be instantiated to the value 1 or 0 denoting
the presence or absence (respectively) of the corresponding keyword for t h a t
case. A tree which evaluates T R U E for a positive case or FALSE for a negative
case has thus correctly classified t h a t case.
The fitness function is evaluated over a set of training or test cases. It is
parameterised by the following values: the number of correctly identified positives
npos, the number of negatives falsely identified as positive n,~eg, the total number
791

of positives Npos, and the total number of negatives Nneg. T h e fitness function
is designed to minimise both the number of missed positives and the n u m b e r of
false positives:

N p o- snpos nneg
f = c~ Npo~ + fl Nn,g

Note t h a t a and fl and the function lie in the range [0, 1] with 0 being the
best possible fitness, 1 the worst. The aim is therefore to minimise its value.
The d a t a set was generated from a known decision tree illustrated in Fig. 1.
It has 16 keywords, a training set of 200 cases and a test set of 50 cases. The
training and test cases were chosen randomly from the 216 possible keyword
configurations such t h a t each set contained an equal n u m b e r of positive and
negative cases.

/j OR

vintage
antique
collector
car
vehi cl e
transport
5 design

programming
I construction
database
tutorial
beans

Fig. 1. Decision tree corresponding to data set

4 Evolutionary Measure

The performance of a EA is a balance between the exploration of its search space


and the exploitation of features of the same space as already found solution re-
gions and local minima. In a EA this can be seen as the variation maintenance
between the exertion of mutational variance upon the population and the con-
straint effect of selection reducing this variance. We propose a ratio of p a r a m e t e r s
representative of the evolutionary process a n d / o r help to sustain the balance be-
tween exploration and exploitation. These are the inverse of the m e a n fitness of
the population, f as an exploration indicator, and the fitness variance, a v2 , as its
exploration counterpart.
792

A new diagram for measuring and comparing the performance of one or dif-
ferent algorithms on the same task is proposed here. Taking inspiration from
Astrophysics, the Hertzprung-Russetl diagram [14] [15], otherwise known as the
HR diagram, allows us to track the evolution of every star in the Universe in
a simple two dimensional diagram where temperature is represented versus the
star's magnitude or logarithm of its luminosity referred to a standard star, gen-
erally the Sun. In this diagram stars with different chemical composition and
mass evolve through well studied paths that are the result of their internal phys-
ical phenomena guided by the laws of Physics. Different temperatures alter the
nuclear reactions in their interiors while their luminosity is a balance between
the radiation pressure trying to escape the outer layers of the star and the grav-
itational collapse of the latter that increases the optical depth thus trapping the
photons.
The simile with EAs appears if we consider the mutation rate to play the
role of temperature (or level of agitation), while the ratio between exploitation
and exploration resembles the luminosity (actually the inverse of the luminosity).
Exploitation can be measured as the inverse of the mean fitness of the population
while exploration can be realised as the fitness variance. Therefore we can study
the dependence with time of an algorithm changing the degree of mutation and
extracting conclusions on the capabilities of the algorithm to explore and exploit.
The proposed evolutionary parameters have been measured for our algorithm
( B T G P ) and task (information retrieval). Some of the measures specifications
will depend upon the nature of the algorithm itself and its representation of
solutions (Boolean tress) where phenotype and genotype are the same, Further-
more, some will be influenced by the definition of fitness and sampling of our
fitness landscape derived from the task.
In the remainder of this section we will present some preliminary descriptive
results and will discuss them. For testing purposes we have fixed a few B T G P
parameters; 100 individuals, 100 generations, maximum depth of trees 4 lev-
els, roulette wheel selection (unless otherwise indicated), ramped half and half
tree generation and branching factor between 2 and 4. The data set and fitness
function used have been discussed in Section 2).
With these settings, B T G P was executed for 13 different values of the mu-
tation rate: 0, 0.01, 0.05 and then 10 consecutive runs evenly spaced between 0.1
and 1.
Mean fitness, ] , slowly decreases with generations for rate 0.2, but as soon as
mutation is increased to 0.5 or 0.8, ] oscillates around a constant value and does
not decrease further. For smaller values of the mutation rate (e.g. Mutrate =
0.05), the mean fitness decreases even further.
Fitness variance, av, 2 does not change significantly in the course of the run
for mutation rates Mutrate > 0.1). Nevertheless the rate is decreased to small
values (i.e. M u t ~ t ~ = 0.05), variance seems to increase steadily as generations
go by.
Statistics on the degrees of exploitation and exploration and different muta-
tion rates were gathered for BTGP. The result is the diagram shown in Fig. 2.
793

Ld

0
EL.

,.-*2
0
.J

01-
0.0 0,2 0.~- 0,~ O,B 1.0
Mutation
Fig. 2. Evolvability diagram for BTGP and information retrieval.

Small mutation rates produce a wider range of exploitation versus exploration


ratios as the run proceeds; the behaviour is generally of higher degrees of ex-
ploitation and very little exploration; this situation is repeated in a lesser degree
at very high mutation rates; it is in the region 0.4 < Mutrate < 0.6 where we
get less exploitation and more exploration; thus a better balance for the ratio.
Our particular requirements on the performance of the algorithm may lead us
to choose different degrees of mutation or change the latter as the algorithm
progresses. We consider Fig. 2 as an example of an evolutionary diagram.

5 Conclusions and future work

In this paper we have proposed an initial evolvability diagram that allows us to


track the progress of an algorithm with different rates of mutation, monitoring
the rates of exploitation and exploration of the solution landscape, in a similar
fashion as we can monitor the evolutionary paths of stars in the HR diagram;
mutation rate plays the role of temperature and the exploit/explore ratio resem-
bles the balance between radiation pressure and gravitational collapse. We do
not intend to carry this simile in a literal way; but believe cross-fertilisation of
well established scientific disciplines could and has, in the past, brought advances
in scientific research.
794

Directions for future work could include the extension of these measures
and evolvability diagram to other algorithms applied to the same and different
problems. In particular, the exploration of algorithms with a non trivial m a p p i n g
between genotype and phenotype could lead us to establish some conclusions
on the suitability of algorithms to tasks and to compare different algorithm
performances, linking these to evolvability.

References

1. Endler, L., "Natural Selection in the Wild", Princeton, N.J., Princeton University
Press, 1986.
2. Hofbauer, J. and Sigmund, K., "The Theory of Evolution and Dynamical Systems",
Cambridge, Cambridge University Press, 1988.
3. Roff, D., "Evolutionary Quantitative Genetics", London, Chapman and Hall, 1998.
4. Fernhndez-Villacafias, J.L., Marrow, P., Shackleton, M., submitted to GECCO'99,
1999.
5. Bedau, M.A., Snyder, E., Brown, C.T. and Packard, N.H., A comparison of evo-
lutionary activity in artificial evolving systems and in the biosphere, in "Fourth
European Conference on Artificial Life", P. Husbands and I. Harvey (Eds.), pp.
125-134, Cambridge, MA, MIT Press, 1997.
6. Altenberg, L., The evolution of Evolvability in genetic programming, in "Advances
in Generic Programming", K.E. Kinnear Jr. (Ed.), pp. 47-74, Cambridge, MA, MIT
Press, 1994.
7. Wagner, G.P., Altenberg, L., Complex adaptations and the evolution of evolvabil-
ity, "Evolution" 50, 967-976, 1996.
8. Falconer, D.S., "Introduction to Quantitative Genetics", 3rd. ed. Harlow, Long-
man, 1994.
9. Miihlenbeiu, H., The equation for the response to selection and its use for predic-
tion, "Evolutionary Computation" 5, 303-346, 1998.
10. Hordijk, W., A measure of landscapes, "Evolutionary Computation" 4, 335-360,
1997.
11. Bedau, M.A. and Packard, N.H., Measurement of evolutionary activity, teleology
and life, "Artificial Life I I ' , C.G. Langton, C. Taylor, J.D. Farmer and S. Ras-
mussen (Eds.), pp. 431-461, Redwood City, CA, Addison-Wesley, 1991.
12. Mitchell, M. and Forrest, S., Genetic algorithms and Artificial Life, "Artificial Life"
1, 267-289, 1995.
13. Fern~ndez-Villacafias, J.L. and Exell, J., BTGP and information retrieval, in "Pro-
ceedings of the Second International Conference ACEDC'96", PEDC, University
of Plymouth, 1996.
14. Hertzsrpung, E., Ueber die Verwendung photographischer effektiver Wellenlaengen
zur Bestimmung von Farbenaequivalenten, "Publikationen des Astrophysikalischen
Observatoriums zu Potsdam", 22. Bd., 1. Nr. 63, 1911.
15. Russell, H.N., Nature, no. 93, 252, 1914.
Artificial N e u r a l N e t w o r k s as U s e f u l Tools for
the O p t i m i z a t i o n of the R e l a t i v e Offset b e t w e e n
T w o C o n s e c u t i v e Sets of Traffic Lights,r

Secundino L6pez, Pedro Hernfindez, Alejandro Hernfindez, and Marco Garcfa

Artificial Intelligence Centre of the University of Oviedo at Gij6n (www.aic.uniovi.es)


Campus de Viesques s/n. 33271 Gij6n, Spain
{secun, pedro, alex, marco}@aic.uniovi.es

Abstract. In this paper we present the most important results of our


experimentation with artificial neural networks for correcting offset relative
error between two consecutive sets of traffic lights. Neural networks allow us to
estimate the length of the queue of vehicles stopped in front of the stop line
waiting for the red light to change to green. We will check that this length is an
essential parameter for solving the offset problem. Training data and test data
for the ANN are provided by a simulator specifically built up for this purpose.
The performance of the simulator is tested with real data. An algorithm to
improve the offset based on the queue length provided by the ANN was
proposed. Finally, it was proved that its proposals provide a path to the optimal
offset.

1 Introduction

One of the most difficult problems in urban traffic control is to decide the optimal
offset between two consecutive sets of traffic lights. To illustrate the basic concepts of
the problem, suppose that A and B are two consecutive traffic lights and that the
vehicles drive from B to A.

ALast

Detector B
Fig. 1. This figure shows a sketch of the essential characteristics of the problem. Our aim is to
adjust the offset between signals A and B in such a way that the vehicle BFirst reaches the
vehicle Mast when this one reaches the stop line at A.

This paper is based upon data provided by the Traffic Control Department of the city of
Gij6n (Spain). The author would like to thank the Gij6n City Council for its helpful
collaboration
796

Let QA and QB be the queues of vehicles stopped in front of the signals A and B.
Let ALast and BFirst be the last vehicle in queue QA and the first one in queue QB,
respectively. The difference between the instants where the green phases of the
signals A and B start is called relative offset between A and B. The problem is how to
coordinate A and B in such a way that the vehicle BFirst reaches the vehicle Alast
when this one is crossing the stop line of the signal A. See Figure 1.
The early researchers in urban traffic control tried to solve the problem searching
for an offset value near the optimum, which was computed for an average queue
length and a rate of queue output statistically estimated. These solutions were
obtained independently of the real length of the queue QA. This is not surprising since
to compute this length is a hard task. Neural networks have been revealed as a
powerful tool to compute it.
To improve the offset we need to have extra information about the traffic behavior
in the network. An induction loop detector buried in the road will provide us this
information. The detector is able to supply the number of vehicles crossing over it
every 5 seconds and the length of time it was driven over. These data are cyclically
recorded in two patterns called flow and occupancy profiles (see fig. 2). It is supposed
that the detector is placed upstream close to the stop line

Flow Profile Occupancy Profile

I I I I I
Green Red Green Red
Fig. 2. A pair of flow and occupancy profiles represented as integer arrays. Each component
corresponds to data of flow and occupancy measured for 5 seconds, respectively. Some of the
first rectangles included in the Green window could correspond to vehicles starting from the
queue.

Obviously, the joint observation of both profiles should provide a piece of


information about the instantaneous speed of the vehicles when they are crossing the
loop. Taking into account that the loop length is known and that the average length of
the vehicles can be statistically estimated, the flow/occupancy quotient relates space
and time and therefore it is an approximation of the speed of the vehicles. Two
guesses are made:
9 There is a slight variation in the speed between the vehicles leaving a queue and
those circulating freely. That is, it may be possible to split each profile into two
different blocks. One of them is associated with the vehicles that were stopped
waiting for the green light, and the other one corresponds to the vehicles
circulating freely.
9 If the splitting point between both blocks exists, then we can compute it through a
neural network and it will be denoted by tO. When there is not a queue or this is not
cleared along the green phase we will say that tO doesn't exist.
To train the neural network we need a wide collection of pairs of profiles
representing all the posible settings of the traffic. We also need for each profile the
instant, if it exists, corresponding to the value tO. Unfortunately, the task of collecting
797

this information directly from real traffic nets is very complicated, since we should
force some situations close to congestion constituting a critical risk and there are other
special situations quite difficult to obtain because the traffic demand would need to be
modified.
To eliminate these difficulties we have built up a realistic traffic simulator, able to
provide accurate profiles of flow and occupancy and capable of managing the queue
output efficiently. So we can generate a lot of profiles in many different traffic
situations.
Since our simulator manages the position of the vehicles and its state (stationary or
driving) every hundredth of a second, the part of the profile corresponding to vehicles
in the queue is known, as well as value tO. With these data we have trained an ANN.
In this way an algorithm has been generated to decide if there exists tO and to
determine its value in the positive case. These results were successfully tested with a
collection of real data.
Once the problems of generating reliable profiles and of obtaining the value tO
have been solved, it was necessary to compute the optimal offset in terms of the value
tO and to build an algorithm to implant it in the network. However, a question arises
immediately. How can we reach the optimal offset from the current offset?. The
problem of searching for a set of decisions to optimize the offset (let's say, what kind
of actions could/should be taken) is not easy to solve. In most of the cases it won't be
possible to correct the offset deviation in only one step. Moreover, each action taken
to move the offset can modify the previously calculated value tO. Thus, before
concluding the process of optimizing the offset, value tO could have been modified.
An efficient method to reach the optimal offset was proposed, it was based on the
supposition that always there is green to excess, that is, the green supplied is long
enough to satisfy the demand of vehicles
Finally, we present the results of a collection of trials carried out on the simulator
in many different situations, as well as some important conclusions.

2 The Simulator

When designing the simulator we took into account two main characteristics. On the
one hand, it was necessary to move the vehicles in a realistic way in order to have
reliable profiles; so we needed to model the behavior of two different processes: the
queue output (ahead) and the queue input (behind). On the other hand, it was also
necessary to have the capability of implanting the required actions to optimize the
offset.
To solve the first problem (queue output) we have designed on the street an
experiment consisting of the observation of the queues (more specifically, of the
vehicles leaving the queue) in different traffic conditions and in different lanes always
containing ten vehicles at least. In this way we have achieved a sample containing 150
observations representing the behavior of the vehicles when they start leaving a
queue. A recording for each vehicle has been made. It contains their position in the
queue as well as two noticeable instants: starting and reaching the stop line,
considering as reference the instant in which the green phase begins.
With these data and assuming that a vehicle is separated from its neighbors by an
average distance of 1,5 meters and that they have an uniformly accelerated
798

movement, the statistical distribution of two random variables has been obtained for
each vehicle in the queue allowing us to simulate the queue output:
9 S t a r t i n g i n s t a n t computed since the beginning of the green phase. A simple
examination of the real data will illustrate that the starting instant of the first
vehicle in the queue approaches a uniform distribution with a mean of 1.97 and a
standard deviation of 1.056 and that the difference between the starting instant of
two consecutive vehicles also approaches a uniform distribution with a mean of
1.44 and a standard deviation of 0.56.
9 V e h i c l e a c c e l e r a t i o n . Considering that the vehicles start from a stationary state, the
equation for the uniformly accelerated movement is reduced to e=a*t2/2. Since we
know the instant in which every vehicle crossed over the stop line as well as its
position in the queue and considering this position as a measure of the distance to
the stop line, we can deduce the uniform acceleration associated to each vehicle. In
this way we are able to estimate the acceleration of each vehicle depending only on
its position within the queue.

Table 1. Average accelerations associated with the sample data

Place in queue 1 2 3 4 5 6 7 8 9
Acceleration 5.95 2.97 2.20 1.73 1.54 1.31 1.12 1.02 0.95

These data have been submitted to the method Curve Estimation of SPSS [10] to fit
an inverse model.

Table 2. Results of the Curve Estimation method to fit a model

Independent: p (place in queue)


Dependent Mth Rsq d.f. F Sigf b0 bl
a (Acceleration) INV .998 7 4105.91 .000 .3439 5.5447

Table 2 allows us to state that the equation


p=0.3439+5.5447/a (1)
is a suitable model to predict the value of the acceleration from the queue position.
So, our simulator takes out the vehicles from the queue with the following
criterion: the first vehicle starts up ~ seconds after the beginning of the green phase,
and the remaining ones with a delay of r I seconds with respect to the previous vehicle.
and ~1 are randomly chosen from uniform distributions having as mean and standard
deviation the empiric values mentioned above. That is, Uniform[0.14,3.8] and
Uniform[0.44,2.4], respectively. Moreover, each vehicle starts up with a uniform
acceleration that depends on its position in the queue, according to the equation (1).
So, moving the vehicles according to both criteria, we can hope that the simulator will
provide us an accurate estimation of the instant in which each one of them cross the
stop line. Once a battery of runs was carried out, a set of sequences corresponding to
the virtual instants in which the vehicles in the queue passed through the stop line
have been collected. These data were tested with those obtained in the real experiment
799

using tests of means comparison, Friedman Two-Way Anova, Kendall Coefficient of


Concordance and Wilcoxon, being proven at the .95 confidence coefficient that both
samples come from the same population in all the cases and for all the positions. The
next table shows the results of the Friedman two way Anova test.

Table 3. A set of results of the Friedman Two-Way Anova test, where Pi is the position in the
queue of the vehicle i.
Position in queue P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Significance 0.1 0.8 0.8 0.6 0.4 1 1 0.6 0.8 0.4

In this way the behavior of the simulator with regard to the process of queue output
has been validated.
To simulate the model of the queue input we have supposed that every vehicle is
driven with the average speed while the distance to the previous one is lesser than 30
meters. At that moment it begins to brake until its speed is equal to the speed of the
previous vehicle. To compute the deceleration and the time occupied in braking, we
have used again the equations of the uniformly accelerated movement.
Vf= V 0 + a*t and e=v,*t + a't2/2 (2)
The variables Vf (speed of the previous vehicle), Vo (speed of the objective vehicle)
and e (30 meters) are known and then we can compute a and t.
To complete the simulator it should be equipped with the possibility that the signal
timing can be modified. So we have implemented and included in the simulator a
virtual traffic regulator capable of receiving control messages and of putting them into
action on the street.

3 The Artificial Neural Network

Our guess to find out the value tO is based on the main idea that the speed of the
vehicles in the queue is slightly different from those circulating without any
restriction. So we are going to go through the profiles supplied by the simulator
searching for that instant, if it exists. Unfortunately, this is a very difficult decision if
only one measuring instant is considered. Therefore to detect this difference it will be
necessary to consider a window including three consecutive measuring instants.
Since the detector is placed very close to the stop line, the queue usually goes
beyond the detector. Really, there are three different possibilities depending on the
queue length. If there is no queue or it needs a longer output time than the green phase
then the instant we are searching for doesn't exist, or else it should be within the
green phase. In this way the problem space can be restricted to the instants associated
with the green phase and then a collection of all the sets including three consecutive
pairs of flow and occupancy profiles is generated. Attached to every set, we recorded
whether the intermediate instant corresponds to the passing through the detector of the
last vehicle in the queue or not. This process is carried out on each profile.
Data provided by the simulator are used to train an artificial neural network. As in
[4] our aim is to compute a function like this:
last_vehicle_in_queue? (measure_instant):truth_value.
800

That is to say, the problem is to decide which of the measuring instants


corresponds to passing through the stop line of the last vehicle in the queue (tO
instant). Moreover, given a pair of flow and occupancy profiles where the queue went
beyond the detector, we can estimate the queue if we test all possible tO instants
within the green phase with the function last_vehicle_in_queue? until we obtain a
positive answer.
A function where is the .queue_end(flow_profile,occupaney_profile):Integer
can be easily implemented from the function last_vehicle in .queue? allowing us to
compute the measuring instant where the queue ends.
Finally, the function compute_queue(flow profile,occupancy._profile):lnteger
provides us the number of vehicles in the queue and can be obtained by adding up the
flow profile from the starting of the green phase to the
where_is_the_queue_end(flow_profile,occupancy_profile) instant.
320 cycles were run in the simulator changing the number of vehicles per cycle
every 10 cycles as well as the shape of the input and ranging from a wide collection of
traffic conditions. In all the cases there was a queue. These data provided us 2890
training instances.
The A N N used for this purpose was a simple 3-layered feedforward network with
6 input units, 1 output unit and 15 hidden units. The training was carried out with the
SNNS [11] system. The method used was standard backpropagation with a learning
rate of 0.2 and a maximum error allowed in output of 0.05.
In 96% of the cycles the tO instant was detected. In the remaining ones (4%) the
queue end was not found and so were misclassified.
Where tO was detected, in 53% of times the estimation was right.
To be precise, the estimation error is computed in terms of measuring instants
(index of the array recording the profile) and also in terms of the number of vehicles
making up the queue. Furthermore, the cases where the estimated value tO is greater
than the real tO are separated from those in which the estimation is lesser than the real
value. This information about the A N N performance is shown in Table 4.
It is important to emphasize that in 38% of the times the estimated value was
wrong in exactly one measuring instant. Definitively, only in 5% of the times the
error was greater than 1 measuring instant.

Table 4. Other important results derived from the ANN

Criterion to evaluate the Error Avera~;eError Value


Cases with On the estimated instant 0.2
Negative error On the estimated queue length 8.2%
Cases with On the estimated instant 0.4
Positive error On the estimated queue length 15.1%
Total number On the estimated instant 0.3
of cases On the estimated queue !ensth 11.6%

Even when a profile was misclassified or when the estimation error was greater
than 1, the traffic conditions were close to saturation. That is, the saturation level
(percentage of the green phase required to clear the queue) is close to 100%, and
therefore error becomes less important in terms of traffic control because it is very
difficult to improve the offset in this situation, as was said above. This fact is also
801

confirmed by other two parameters: spare time of the green phase and time with a
maximum occupancy. See Table 5.

Table 5. Relation between the Error of the estimated value tO and the Saturation level of the
green phase
Error of the estimated Average time of the Average time of the cycle with Saturation
value tO cycle with flow and flow equals to 0 and occupancy level of the
In measuring instants occupancy equal to () equals to 5 green phase
afthe profile) (Spare time) /Detector conlinuousl), stepped)
0orl 14% 41% 80%
>2 15% 35% 86%
Misclassification 3% 46% 93%

4 The algorithm to optimize the relative offset

At this moment we are able to compute the value tO through an ANN. To complete
our task we need to compute the optimal relative offset and to design an algorithmic
solution to reach this optimal offset. To do this it will be necessary to introduce some
previous concepts. Figure 3 shows a schematic representation of the problem. It
describes the traffic evolution throughout a cycle.

[] Queue
9 Platoon
t ~ Green phase
GP / I Red phase

o, b ' 6
Fig. 3. Represents the Evolution of traffic (which has a cyclical and therefore repetitive
character) throughout a cycle. The cycle is split into the usual green and red phases. In order to
simplify the notation we have made two particular instants coincide: the beginning of the cycle
and the beginning of the green phase. This restriction doesn't suppose any loss of generality in
the system representation. The location of the vehicles at the beginning of the cycle is
represented on the horizontal axis. Oblique lines allow us to represent the instant where the
vehicles reach the slop line and its slope corresponds to the multiplicative inverse value of the
traffic speed. Obviously the queue of vehicles formed in front of the signal has an output speed
lower than the free circulation speed, and therefore its related line has a slope bigger than the
remaining ones. So, the origin represents both the beginning of the cycle (vertical axis) and the
location of the controlled signal (horizontal axis).

First of all, we need to determine exactly the term optimal relative offset. Let D be
the location of the first vehicle coming from the signal B at the beginning of the cycle.
Then tp represents the instant in which such a vehicle would reach the signal on the
hypothesis that it would be circulating along the street without any restriction, tp
802

would be the optimal relative offset between the signals A and B if there was no
queue. The instant where the last vehicle in the queue of the signal A arrives at the
stop line of A will be called output time of the queue and it was already denoted by tO
in the Introduction. According to the situation represented in Fig. 1 we can see that
the platoon located at D reaches the signal A in an instant tp later than tO.
Consequently there exists an offset deviation A=tp-t0. This deviation can also be
interpreted in the following sense: the head of the platoon D should have arrived at
the position P at the beginning of the cycle. We say that relative offset is optimum
when A=0.
In the case A~0 our objective consists of designing and implementing some actions
to reduce the magnitude of A until making it 0. From this viewpoint, on the situation
of Fig. 1 it seems reasonable to produce in signal A a delay of A seconds in the
opening of the green phase or an advancement of A seconds in signal B. Both
solutions are symmetrical and from now on we suppose that actions to modify the
offset will be taken only at signal A. Unfortunately, the decision of delaying A
seconds at the beginning of the green phase implies a temporal increase of the cycle
length. This fact constitutes a critical point for any offset strategy. As a first
consequence, the junction could turn out to be jammed. Moreover, delay A can
become non-optimal. Let us see.
9 Possibility of congesting the junction. To point out this circumstance it is necessary
to introduce some other terms. Let us call Supply to the current length of the green
phase and Demand to the length of the green phase required for clearing a queue
containing all the vehicles entering the lane each cycle. In our approach the
restriction Supply>Demand is assumed for all the cycles (non-saturation
hypothesis); otherwise there would not be any strategy able to improve the offset,
since the queue would increase continuously until reach the saturation point of the
green phase, without any possibility of reducing it. A brief analysis of the problem
of implanting a new offset allows us to see that a modification of A seconds in the
cycle might be excessive to guarantee the restriction Supply>Demand. Really, the
violation of the inequality could have a temporary character. In this case the non-
saturation hypothesis would be again fulfilled some cycles later. But this
possibility is not fully satisfactory. Although the profiles provide us information
enough to find out the saturation, it is difficult to check if this saturation is going to
be temporary or if it hides a new state of permanent saturation, unless we decide to
wait (maybe indefinitely) until a situation of non-saturation is detected again.
Obviously this risk is critical.
9 Possibility of generating an infinite loop. In any case, the action of modifying the
cycle length could modify the relationship between the supply and the demand
temporarily, which would have consequences on the queue length and therefore in
value tO. Then a paradoxical effect could take place. This is, when the solution was
reached so it already stopped being optimal. So what happens is that we try to
pursue a mobile objective. So we should restart the process with a new goal. This
process could continue indefinitely. Our only chance consists of achieving a
convergent displacement of the goal.
In conclusion, any offset strategy should maintain the intersection in a non-
saturation state and should be convergent to the optimum. To do this we have
implemented an algorithm to compute the maximum offset variation (upper bounded
803

by A) compatible with the non-saturation state. So, any feasible decision has to satisfy
the inequality t0<GE (See Fig. 3), where GL is the length of the green phase. In this
way, the offset also converges with the optimum.
The simulator is a useful tool to evaluate this strategy. Thus, to test the efficiency
of the algorithm we have fed the simulator with a wide collection of traffic situations
representing many different relative offset problems. In 92% of times the offset error
was reduced below 5 seconds in at most 4 cycles; Moreover, when offset error could
not be corrected, the queue was too long (occupying almost the whole green phase)
and then the offset lost importance in terms of traffic control.
Figure 3 shows a situation where the offset is not the optimum. As it was stated
above, the offset can be optimized increasing the cycle length. Figure 4 describes
another non-optimal situation requiring a decrease of the cycle to correct the offset.
The action is now completely the opposite but our algorithm maintains its
effectiveness because its performance is independent of the action if the non-
saturation restriction is continuously satisfied.
Although the scope of this paper is more restricted, it is important to point out
other two considerations. On the one hand, anyone of the situations described in
figures 3 and 4 could be optimized advancing or delaying the beginning of the green
phase. This fact is due to the recurrent behavior of the traffic when it is regulated by
traffic lights. On the other hand, to increase or to decrease the cycle would be
necessary to modify the length of the green and/or red phases. Nevertheless, the
consequences of the actions are not the same depending on the order in which the
phases are considered. All the possibilities were included in the algorithm. The results
were always analogous.

[ ] Queue
9 Platoon
9 ~ Green phase
GE- ~ ~ p h a s e
t0
tp
___S'
O p D

Fig. 4. Another typical situation with a non-optimal offset. The offset deviation is now
negative.

5 Conclusions

In [4] we had proved that the value of tO could be accurately estimated through an
artificial neural network in some particular situations depending on the location of the
detector in the lane. Now with the aid of the simulator we can make reliable
estimations of the value tO in any case, with errors lesser than 11.6% of the queue.
804

To compute queues some traffic controllers ([7], [8], [9]) use accumulative
algorithms based on the idea that the queue length in a cycle t+l can be obtained
adding to the queue in cycle t the estimated number of vehicles entering the lane in
cycle t+l minus the number of vehicles leaving the lane along the green phase of
cycle t+l. This method may fail sometimes since these parameters are difficult to
compute accurately and then the error in the estimation of the queue length could
increase indefinitely. Our approach avoids this risk due to the fact that in every cycle
value tO is estimated independently from the former cycle.
Additionally, we have built up an algorithm to correct the relative offset error that
only needs the value tO as input. In this way, the required actions to correct the error
become independent both on the actions taken in the former cycle and on their
consequences. The results of this strategy were completely successful.

6 References

[1] Bahamonde, A.; L6pez-Garcfa, S.; Hemfindez-Arauzo, P.; Bilbao, A.; Vela, C.R.: ITACA:
An Intelligent Urban Traffic Controller. Proceedings of IFAC Symposium on Intelligent
Components and Instruments for Control Applications, SICICA'92. Mfilaga, (1992) 787-
792
[2] Bell, M.; Scemama, G.; Ibbetson, L.: CLAIRE an expert system for congestion
management. Proceedings of the Drive Conference. Brussels (1991)
[3] Forast~, B; Scemama, G.: Surveillance and congested traffic control in Paris by expert
system. Proceedings of 2nd. International Conference on Road Traffic Control. London
(1986). 333.337
[4] Hern~ndez-Arauzo, P.; L6pez-Garcfa, S.; Bahamonde, A.: Artificial Neural Networks for
the computation of traffic queues. Biological and Artificial Computation: From
Neuroscience to technology. LNCS, Vol. 1240. Springer-Verlag, Berlin (1997) 1288-1297.
[5] Hermlndez-Arauzo, P.; Bahamonde, A.; L6pez-Garcfa, S.: Sobre la Calculabilidad del
tiempo de desalojo de una cola de vehiculos. Proceedings of VI Conferencia de la
Asociaci6n Espafiola para la Inteligencia Artificial, CAEPIA-95. Alicante. Spain. (1995)
449-458
[6] Hernfindez-Arauzo, P.: Traffic queues computation. A virtual problems model, Ph.D.
dissertation. Universidad de Oviedo at Gij6n. (1996). p. 104 + ii
[7] Hunt, P.B.; Robertson, D.I.; Bretherton, R.D.; Winton, R.I.: SCOOT a traffic responsive
methgod of coordinating signals. TRRL Report LR1014, Transport and Road Research
Laboratory. Crowthorne (1981)
[8] Institute of Transportation Engineers Australian section: Management and Operation of
Traffic signals in Melbourne. Technical report. Melbourne. (1985)
[9] Lowrie, P.R.: The Sydney co-ordinated adaptive traffic system. Principles, Methodology
and algorithms. Proceedings of the International Conference on Road Traffic Signalling.
London. (1982). 67-70
[10] SPSS Inc. SPSS-X User's guide. Mc Graw-Hill. NewYork. (1983)
[11] Zella, A. et al.: SNNS: Stuttgart Neural Network Simulator. User Manual, Version 4.1.
Institute for Parallel and Distributed High Performance Systems. Technical Report No.
6/95. (1995).
ASGCS: A New Self-Organizing N e t w o r k for
A u t o m a t i c Selection of F e a t u r e Variables

J. Ruiz-del-Solar, D. Kottow
Department of Electrical Engineering
Universidad de Chile
Casilla 412-3, Santiago, CHILE
Ph.: +56-2-6784207 / Fax: +56-2-6953881
E-mail: j.ruizdelsolar@computer.org

Abstract

The automatic selection of invariant feature variables is very important in pattern


recognition systems. Recently, neural models have begun to be employed for this task.
Among other models the ASSOM stands out because of its simplicity and biological
plausibility. However, the main drawback of the application of the ASSOM in image
processing systems is that a priori information is necessary to choose a suitable network
size and topology in advance. The main purpose of this article is to present the Adaptive-
Subspace Growing Cell Structures (ASGCS) network, which corresponds to a further
improvement of the ASSOM that overcomes its main drawbacks. The ASGCS network
introduces some GCS (Growing Cell Structures) concepts into the ASSOM model. The
ASGCS network is described and some examples of automatic Gabor-like feature filter
generation are given.

Keywords: Adaptive-Subspace Self-Organizing Map (ASSOM); Growing Cell Structures


(GCS); Adaptive-Subspace Growing Cell Structures (ASGCS); Gabor-Filters; Automatic
features extraction

1. Introduction

A fundamental task in almost any Pattern Recognition System is the extraction of


invariant features, which are then used to perform a classification process. Every problem
needs a careful selection of feature variables, which so far is mostly done by hand. Neural
networks have been used successfully as classifiers for a long time. Only recently neural
models have begun to be employed for automatic selection of feature variables.
In the visual system of higher mammals the extraction of invariant features is
carried out by the simple and complex cells of the primary visual cortex [Wilson et al.,
1990]. The shape of the receptive fields of these cells and their organization are the
results of unsupervised learning during the development of the visual system in the first
few months of life [Van Sluyters et al., 1990]. This example-based learning is performed
through the act of seeing different real-word scenes, repetitively, which produces activity
dependent synaptic modification. The shape and organization of the receptive fields
emerge gradually by means of the refinement of an initially diffused set of connections
806

[Van Sluyters et al., 1990]. The receptive fields of these cells can be seen as feature
detectors, which are then modeled as Gabor Functions [Daugman, 1980]. These functions
have often been used as filters in technical systems.
In this context, it seems natural to follow this example-based learning strategy to
automatically select the feature variables or, in other words, to automatically generate the
invariant feature detectors (Gabor-like filters). Different approaches have been used to
generate this kind of detectors by using neural models [Kohonen, 1995a; Sanger, 1989;
Sirosh, 1995]. Among them, the adaptive-subspace SOM (ASSOM), proposed by
Kohonen, stands out because of its simplicity and biological plausibility. TEXSOM, the
first image processing architecture based on the ASSOM model, was recently proposed
[Ruiz-del-Solar and KOppen, 1996 and 1997; Ruiz-del-Solar, 1998]. This architecture is
suitable to perform texture segmentation and defect identification on textured images.
The main drawback of the application of the ASSOM model in image processing
systems is that a priori information is necessary to choose a suitable network size (the
number of feature variables/the number of Gabor-like filters) and topology in advance.
Moreover, in some cases the lack of flexibility in the selection of the network topology
(rectangular or hexagonal grids) makes it very difficult to cover some areas of the (two-
dimensional) frequency domain with the filters.
As a first improvement of the ASSOM, the Supervised ASSOM (SASSOM),
proposed in [Ruiz-del-Solar and KOppen, 1996], selects automatically the number of
neurons (or filters) in the network equal to the number of classes into consideration in the
classification process. However, the lack of flexibility in the selection of the network
topology still remains an important drawback.
On the other hand, the Growing Cell Structures (GCS) network, proposed in
[Fritzke, 1994], improves existing SOM models by selecting automatically the network
size and topology from the input data. GCS corresponds to a self-organizing network that
grows until a performance criterion is met.
The main purpose of this article is to present the Adaptive-Subspace GCS
(ASGCS) network, which corresponds to a further improvement of the ASSOM. The
ASGCS network introduces some GCS concepts into the ASSOM model. These new
concepts allow to automatically select the number of feature variables (the number of
filters or neurons in the network) and the topology (not only rectangular or hexagonal
grids) of the network. The article is organized as follows. The ASSOM model is explained
in section 2. The proposed ASGCS network is presented in section 3. The generation of
Gabor-like feature filters using the ASGCS and the ASSOM models are shown and
compared in section 4. Finally, in section 5, a summary of this work and some
conclusions and projections are given.

2. The ASSOM Model

The Adaptive-Subspace Self-Organizing Map (ASSOM) corresponds to a further


development of the SOM architecture [Kohonen,. 1995a], which allows one to generate
invariant-feature detectors. In this network, a neuron is not described by a single
807

parametric reference vector, but by basis vectors that span a linear subspace. The
comparison of the orthogonal projections of every input vector into the different
subspaces is used as matching criterion by the network. If one wants these subspaces to
correspond to invariant-feature detectors, one must define an episode (a group of vectors)
in the training data, and then locate a representative winner for this episode. The training
data is made of randomly displaced input patterns. The generation of the input vectors
belonging to an episode is different, depending on translation, rotation, or scale invariant
feature detectors needed to be obtained [Kohonen, 1995a, 1995b, 1996]. In either case,
the learning rule of the ASSOM architecture will be given by [Kohonen, 1995b]:

1. Locating the representative winner c, or the neuron in whose subspace the


projected "energy" is maximum:

C(tp)=argmaxf~ I*(i)(tp)21 (1)


i ~,tpeS
with J(i)(tp) being the orthogonal projection of the input vector x(tp into the subspace
generated by the basis vectors of the neuron i (b(hi)(tp)), in instant t, of the S episode.
This projection is given by:

(2)

2. Updating the basis vectors of the representative winner and its neighbors as
follows:

b(hi)' _ n (l+~,t , X(tp)xT(tp ) I b (i) (3)


li ,~ p, i
tpeSl
-
IF '<'"'lF'"'ilJ
" h

with o~(tp) being the time-variable learning rate.

3. Orthonormalizing the basis vectors. First, the vectors are orthogonalized by


using the Gram-Schmidt process, as follows:

b~i)" =b~i) '


h lib(i)' b(h)'' ~ (4)
b(h)"
J
h = 2,...,n
j=l bSh),, 2

Secondly, the orthogonalized vectors are normalized.


808

3. The ASGCS Network

The Growing Cell Structures (GCS) network corresponds to a self-organizing network,


which has a flexible problem-dependent structure, a variable number of elements and a k-
dimensional topology whereby k can be arbitrarily chosen. The main advantages of the
GCS over existing SOM models are [Fritzke, 1994]: (a) the network structure is
determined automatically from the input data; (b) it is not necessary to choose a network
size in advance, since the network grows until a performance criterion is met; (c) all the
parameters of the model are constant and is not necessary to define a decay factor. A
complete description of the GCS model and its properties can be found in [Fritzke,
1994].
In this section the so-called Adaptive-Subspace Growing Cell Structures (ASGCS)
network is presented. The ASGCS network corresponds to a further improvement of the
ASSOM, which introduces some GCS concepts into the ASSOM models. These new
concepts allow one to select automatically the number of feature variables (the number of
filters or neurons in the network) and the topology of the network. As in the ASSOM
model, in the ASGCS each neuron'is not described by a single parametric reference
vector, but by basis vectors that span a linear subspace. The comparison of the
orthogonal projections of every input vector into the different subspaces is used as
matching criterion by the network. As before, if one wants these subspaces to correspond
to invariant-feature detectors, one must define an episode (a group of vectors) in the
training data, and then locate a representative winner for this episode. The training data is
made of randomly displaced input patterns. The network grows freely until a
performance criterion is met.
Let us consider a ASGCS network with a k-dimensional topological (variable)
structure G. An adaptation step of this network can be formulated as follows:

1. Locating the representative winner s, or the neuron in whose subspace the


projected "energy" is maximum, using (1) and (2).

2. Updating the basis vectors of the representative winner and its direct
topological neighbors N, as follows:

b~n'- n (,.,~,, , x(tp)xr(tp ) .b~hi)


- it ''~'PJ ^(i)
, es / x IIx. ll (5)

with
or' (tp) = es for the winner
ot'(tp)=e c "~ceN s (6)

3. Orthonormalizing the basis vectors by using the Gram-Schmidt (see (4)).


809

4. Increment the signal counter of s:


A't"s = I
(7)

5. Decrease all signal counters by a fraction fl


Ave = -/3T c Vc e G
(8)
6. Every L adaptation steps do:

6.1. Compute the relative signal frequency of a cell c as:


hc
j~G
(9)

6.2. Determine the cell q with the property


hq>h, Vc~G
(lo)

6.3. Look for the direct neighbor of q with the largest distance in input space. This
is a cell f, satisfying the condition that the sum of the projections of the base
vectors o f q into the subspaee o f f is minimum.
f =argmin(E(b~i',b~q')l
i~u, k h - -) (11)

6.4 Insert a new cell r in between q and f in such a way that one has again a
structure consisting only of simplex structures of dimension k.

6.5 Initialize the synaptic vector o f r as


b~') = ~(b~ q' + b~n) (12)
and orthonormalize them by using the Gram-Schmidt (see (4)).

6.6 Perform a redistribution of the counter variables "r,.of the topological neighbors
oft"
!
A't',. = ca-"t"rx'uk'*,)r~ Vc ~ N,
(13)
and initialize the counter variable o f r as:
r, = ~ A ' r
c~N,
(14)
810

4. Gabor-type Filter Generation using the ASGCS and the ASSOM

In this section, the automatic generation of Gabor-type spatial feature filters by using the
ASGCS and the ASSOM is shown. As in [Kohonen, 1995a], artificial, two-dimensional,
randomly oriented random-frequency sinusoidal waves are used to train the networks.
Sampling lattices with 169 (13xl 3) points were used.

(a) (b)

(c) (d)

Fig. I: The growing process of the ASGCS. Only the even component (bl) of the generated filters is
shown.

(a)

(b)

(c)

(e)

Fig. 2: The Gabor-like feature filters generated by the ASGCS after: (a) 1000; (b) 2500; (c) 5000; (d)
10000; and (e) 20000 iterations.
811

(a) (b)

(c) (d)

(e) (f)

Fig. 3: The Gabor-like feature filters generated by the ASSOM after: (a) 1000; (b) 5000; (c) 10000; (d)
20000; (e) 30000; (0 40000 iterations. Only the even component (bl) of the filters is shown.

First, the results obtained with the ASGCS network are presented. The growing
process o f the ASGCS is shown in figure 1. Filters which were close neighbors in the
underlying growing graph structure are shown near each other in these pictures. This is
done in the manner described in [Fritzke, 1994]. As it can be seen, the growing of the
network is performed in parallel with the frequency and orientation tuning o f the filters. It
should be noted that already in early phases of the simulation the ASGCS network has
812

basically its final shape with fewer neurons (filters). This behavior is described as fractal
growth. In figure 2 the generated feature filters are shown as function of the number of
iterations, ranging from 1000 to 20000. It can be observed that the generated filters exhibit
very fast a Gabor-type structure (after 5000 iterations).
Finally, in figure 3 the feature filters generated by the ASSOM are shown, as
function of the number of iterations (ranging from 1000 to 40000). It can be observed that
it takes a long number of iterations till the filters shown a Gabor-like structure. It should
be also mentioned that the ASGCS algorithm is about five times faster than the ASSOM
algorithm. In addition, the setting of the parameters of the ASGCS is easier and more
robust than the setting of the ASSOM parameters.

5. Summary and Conclusions

The ASGCS network was introduced in this article. This network corresponds to a
further improvement of the ASSOM that introduces some GCS concepts into the
ASSOM model. These new concepts allow to select automatically the number of feature
variables (the number of filters or neurons in the network) and the topology (not only
rectangular or hexagonal grids) of the network. Examples of the automatic generation of
Gabor-type spatial feature filters by using the ASGCS and the ASSOM were shown.
It can be concluded that proposed network is adequate to automatically generate
Gabor-like feature detectors and that its properties must still be analyzed. Moreover, the
ASGCS algorithm is about five times faster than the ASSOM algorithm, and the setting of
the its parameters is easier and more robust.
Further extensions of this work include:
9 Study of the dynamic and the properties of the ASGCS
9 Introduction of the ASGCS network into an image processing architecture, e.g.
the TEXSOM-Architecture.
9 Experimentation with the ASGCS in one-dimensional signal processing.

References

Daugman, J.G. (1980). Two-dimensional spectral analysis of cortical receptive field


profiles. Vision Research, 20, 847-856.

Fritzke, B. (1994). Growing Cell Structures - A self-organizing network for unsupervised


and supervised learning. Neural Networks, Vol. 7, No. 9, 1441-1460.

Kohonen, T. (1995a). Self-Organizing Maps. Springer-Verlag, Heidelberg.

Kohonen, T. (1995b). The Adaptive-Subspace SOM (ASSOM) and its use for the
implementation of invariant feature detection. Proc. of the Int. Conf. on Artificial Neural
Networks. ICANN 95, October 9-13, Paris, France.
813

Kohonen, T. (1996). Emergence of invariant-feature detectors in the adaptive-subspace


self-organizing map. Biol. Cybern., Vol. 75, No. 4,281-291.

Ruiz-del-Solar, J., and KOppen, M. (1996). Automatic generation of Oriented Filters for
Texture Segmentation. Proc. of the lnt. Workshop on Neural Networks for Identification,
Control, Robotics & Signal~Image Processing- NICROSP 96, August 21-23, Venice,
Italy.

Ruiz-del-Solar, J., and K6ppen, M. (1997). A Texture Segmentation Architecture based


on automatically generated Oriented Filters. Journal of Microelectronic Systems
Integration, Vol. 5, No. 1, 43-52.

Ruiz-del-Solar, J. (1998). TEXSOM: Texture segmentation using self-organizing maps.


Neurocomputing 21, 7-18.

Sanger, T.D. (1989). Optimal unsupervised learning in a single-layer linear feed-forward


neural network. Neural Networks, Vol. 2, No. 6, 459-473.

Sirosh, J. (1995). A Self-Organizing neural network model of the primary visual cortex,
Ph.D. Thesis, The University of Texas at Austin, USA.

Van Sluyters, R.C., Atkinson, J., Banks, M.S., Held, R.M., Hoffmann, K.-P., and Shatz,
C.J. (1990). The Development of Vision and Visual Perception. In L. Spillman and J.
Werner (Eds.), Visual Perception: The Neurophysiological Foundations, Academic Press.

Wilson, H.R., Levi, D., Maffei, L., Rovamo, J., and DeValois, R. (1990). The Perception
of form: Retina to Striate Cortex. In L. Spillman and J. Werner (Eds.), Visual Perception:
The Neurophysiological Foundations, Academic Press.
Adaptive Hybrid Speech Coding
with a M L P / L P C Structure

Marcos Fafndez-Zanuy

Escola Universit~ia Polit~cnica de Matar6.


Univertitat Polit~cnica de Catalunya
Avda. Puig i Cadafalch 101-111 08303 MATARO( BARCELONA, SPAIN)
e-mail: faundez @eupmt.es http://www.eupmt.es/veu
Te1:+34-93 757 44 04 Fax:+34-93-7570524

ABSTRACT
In the last years there has been a growing interest for nonlinear speech
models. Several works have been published revealing the better performance of
nonlinear techniques, but little attention has been dedicated to the implementation
of the nonlinear model into real applications. This work is focused on the study of
the behaviour of a combined linear/nonlinear predictive model based on linear
predictive coding (LPC-10) and neural nets, in a speech waveform coder. Our
novel scheme obtains an improvement in SEGSNR between 1 and 2.5 dB for an
adaptive quantization ranging from 2 to 5 bits.

1. Introduction

Speech applications usually require the computation of a linear prediction model for the
vocal tract. This model has been successfully applied during the last thirty years, but it
has some drawbacks. Mainly, it is unable to model the nonlinearities involved in the
speech production mechanism, and only one parameter can be fixed: the analysis order.
With nonlinear models, the speech signal is better fit, and there is more flexibility to
adapt the model to the application.
In the last years there has been a growing interest for nonlinear models applied to speech.
This interest is based on the evidence of nonlinearities in the speech production
mechanism. Several arguments justify this fact:
a) Residual signal of predictive analysis [ 1].
b) Correlation dimension of speech signal [2].
c) Fisiology of the speech production mechanism [3].
d) Probability density functions [4].
e) High order statistics [5].
Although these evidences, few applications have been developed so far. Mainly due to
the high computational complexity and difficulty of analyzing the nonlinear systems.
The applications of the nonlinear predictive analysis have been focussed on speech
coding, because it achieves greater prediction gains than LPC. The most relevant systems
are [6] and [7], that have proposed a CELP with different nonlinear predictors that
improve the SEGSNR of the decoded signal.
Three main approaches have been proposed for the nonlinear predictive analisys of
speech. They are:
815

a) Nonparametric prediction: it does not asume any model for the nonlinearity. It is a
quite simple method, but the improvement over linear predictive methods is lower than
with nonlinear parametric models.
b) Parametric prediction: it asumes a model of prediction. The main approaches are
Volterra series and neural nets.
Recently several contributions have appeared on the context of neural nets. In this paper
we propose a novel ADPCM speech waveform coder for the following bit rates: 16Kbps,
24Kbps, 32Kbps and 40Kbps with an hybrid (linear/nonlinear) predictor. With this
structure a significative improvement in SEGSNR between 1 and 2.5 dB is achieved over
the equivalent coders based on MLP and LPC alone.

2. Adaptive ADPCM with hybrid predictor scheme


A significative number of proposals found in the literature use Volterra series with
quadratic nonlinearity (higher nonlinear functions imply a high number of coefficients
and high computational burden for estimating them), and Radial Basis Function nets
(RBF) that also implies a quadratic nonlinear model. We propose the use of a Multi Layer
Perceptron net, because it has more flexibility in the nonlinearity. It is easy to show that
an MLP with a sigmoid transfer function lets to model cubic nonlinearities (Taylor series
expansion of sigmoid function). We believe that this is an important fact, because the
nonliearity present in the human speech prediction mechanism is due to a saturation
x2 x3
0.4

0.2
0.3
0.1

0
0.2
-0.1

-0,2
0.1
-0.3

-0.4
-0,5 0 0.5 -0.5 0 0.5

x _ x2 x_x3
0.4
0,2

0 0.2
-0.2

-0.4

-0.6
-0.8 -0.2
-1

-0.4
-0.5 0 0.5 -0.5 0 0.5

Figure 1 quadratic and cubic nonlinearity and the saturation function.


816

phenomena in the vocal chords. Figure 1 shows that is possible to model a saturation
function with a cubic function, but it is not possible with a quadratic function.

A more detailed explanation about the nonlinear predictive model based on neural nets
can be found in [8] and [9]. This paper is focused on the speech coding application.
In a preliminar work we studied the behaviour of the linear (LPC) and nonlinear
MultiLayer Perceptron (MLP) predictors alone. This study reveals that the optimal
solution is an adaptive selection LPC/MLP prediction. We propose a linear/non linear
switched predictor in order to choose always the best predictor and to increase the
SEGSNR of the decoded signal. Figure 2 represents the implemented scheme.
For each frame the outputs of the linear and nonlinear predictor are computed
simultaneously with the coefficients obtained from the previous encoded frame. Then a
logical decision is made that chooses the output with smaller prediction error. This
implies an overhead of 1 bit for each frame that represents only 1/100 bits more per
sample (in our simulations frame size is t00 samples). It is referred in the table as hybrid
predictor, because it combines linear and nonlinear technologies. The percentage of use
of each predictor is showed in table 1.

PREDICTOR Nq=2 Nq=3 Nq=4 Nq=5

LPC- 10 60.54% 54.07% 54.13% 52.75%

MLP 39.46% 45.93% 45.87% 47.25%


table 1. Percentages of use LPC- 10/MLP in the adaptive ADPCM backward speech coder

4 ft-
I

1 bit/frame ,I
Fig. 2 Adaptive ADPCM-B hybrid coder. LP: linear predictor, NLP: nonlinear predictor,
SW: switch

2.1 System overview

Predictor coefficients updating


9 The coefficients are updated once time every frame.
9 To avoid the transmission of the predictor coefficients an ADPCM backward
817

(ADPCMB) configuration is adopted. That is, the coefficients of the predictor are
computed over the decoded previous frame, because it is already available at the receiver
and it can compute the same coefficients values without any additional information. The
obtained results with a forward unquantized predictor coefficients (ADPCMF) are also
provided for comparison purposes.
9 The nonlinear analysis consists on a multilayer perceptron with 10 input neurons, 2
hidden neurons and 1 output neuron. The network is trained with the Levenberg-
Marquardt algorithm.
9 The linear prediction analysis of each frame consists on 10 coefficients obtained with
the autocorrelation method (LPC-10).
Residual prediction error quantization
eThe prediction error has been quantized with (Nq=) 2 to 5 bits. (bit rate 16Kbps to
40Kbps).
9 The quantizer step is adapted with multiplier factors, obtained from [ 10]. A,,~ and Am~
are set empirically [ 111.
Database
9 The results have been obtained with the following database: 8 speakers (4 males & 4
females) sampled at 8Khz and quantized at 12 bits/sample.
Additional details about the predictor and the database were reported in [8] and [9].

102 I ' " L" l I E'"

10'

LLI
t.o

10~ ~. v~.,~ropagatio n

Elman

Levenberg-Marquardt
10-1 , , i i i I 1 1 1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
epochs
Figure 3 MSE vs epochs for Multilayer perceptron (trained with BP and L-M) and Elman
net
818

2.2 Parameter selection

a)Linear predictor
For the linear predictor the parameters are:

9 Prediction order: it is studied LPC-10 (same number of input samples than the MLP
10x2xl) and LPC-25 (same number of prediction coefficients than the MLP 10x2xl
OFrame length: sizes from 10 to 300 samples with a step of 10 samples are evaluated.
Notice that the bigger frame size the smaller the number of frames for a given speech
signal, but if the frame length is large then the assumption of stationary signal into the
analysis window is no valid and the behaviour degrades. If the frame length is short, the
parameter estimation is not robust enough and the behaviour degrades.
b)Nonlinear predictor
For the nonlinear predictor based on neural nets, the number of parameters that must be
optimized is greater. The selected network architecture is the Multi-Layer Perceptron
with 10 input neurons, 2 hidden neurons with a sigmoid transfer function and one output
neuron with a linear transfer function trained with the Levenberg-Marquardt (L-M)
algorithm, based on our previous results [8]. We have also evaluated a recurrent Elman
net, but we found that its behaviour was worse than MLP trained with L-M. Fig. 3 shows
Histograma
150 I I

10x2x
10x4x]

100

50

0'
12 14 16 18 20 22 24 26 28 30
Gp
Figura 4 Histograms of the prediction gain for 500 random initializations of neural net
weights.
819

the Mean Square Error as function of the number of epochs for a typical voiced frame of
the database. It can be seen that the L-M algorithm presents a fast convergence and a
small MSE. The MLP 10x4xl was also tested, but it has more coefficients and the
computational complexity is greater. Also a great number of random initializations must
be done in the 10x4xl structure, because the probability of achieving the greatest
prediction gain for a random initialization is lower than for the 10x2xl structure (fig. 4).
The adjusted parameters of the predictor into the closed loop ADPCM scheme are:
eNumber of trained epochs: This is a critical parameter. To encode a given frame the
neural net is trained over the previous frame in the backward scheme and over the actual
frame in the forward configuration. In both cases special attention must be taken in order
to avoid the problem of overtraining (the network must have a good generalization
capability to manage inputs not used for training). Although consecutive frames are
normally very similar, there are significative changes in the waveform that must be seen
as perturbances of the input, and even if the neural net is applied over the same frame
used for training, the conditions are different because the predictor is trained in an open-
loop scheme and tested in closed loop, so really the input signal is corrupted by the
quantization noise. This is as much important the lesser is the number of quantizer bits.
The way to make the neural net as robust as possible to this small changes implies the
optimization of training conditions such us:
a)Number of epochs used for training
b)Number of random initializations of the weights ( a multistart algorithm is used).
A D P C M Forward
35 , , , , ,

9 . . . . . . . . . . . . . . . . .

. . . . . . 9

i i I I
50 0 100 150 200 250 300
frame length
Fig. 5 SEGSNR vs frame length for ADPCM forward.
820

For achieving a good initialization a multi-start algorithm is used, which consists in


computing several random initializations (experimentally fixed to 5) and to choose the
one that achieves the higher SEGSNR. For selecting the number of epochs the optimal
condition would be to evaluate for each frame the number of epochs that maximizes the
SEGSNR. This is impractical because the decoder needs to know the number of epochs
in order to track the encoder. Obviously this would imply the transmission of the number
of trained epochs and so, the bit rate would be increased. The adopted solution consisted
on a statistical study for choosing the best average number of epochs. This study reveals
that the optimal number of epochs is 6 (see [9]).
OFrame length: Same commentaries of the linear predictor apply here. Experimental
results show that the linear predictor has a similar behaviour over a wider range of frame
sizes than the nonlinear predictor, but there is some rage for which the nonlinear predictor

A D P C M Backward
35

~ Nq=5
30
/ . ' ~ : - ~. ~ . . . . . . ~'..". . . . . . . . . ~ .-__-:__-..~.=.~ ~.~-__. _.~ ~ . . . . . . -
i ./--- 9

25
' ,4, ._-'_?_'" \'L . . . . . . . _
/./-7 /
/ / ~ ~ ~ ~ Nq=3
9 / ~ / / r ~ -- ~ ~ ~ ~
,'r" 20
Z . ~ l l . ~ . " ' i #- - - ~ . _-r- ~ _ . ~-- k-- - 7 ~-- 9 ~ .x . . . . . . ~ -
co
./) -I //
ILl
O') /I I
15
,, ...-'>'- ...........
/,'.' / - ,...
10
:4 I /
I /
I / HYBRID
J /
/ MLP
/
/ / ...... LPC-IO
/
/
/ / I I I I

50 100 150 200 250 300


frame length

Fig. 6 SEGSNR vs frame length for ADPCM Backward


is better than the linear predictor.

Figures 5 and 6 show the SEGSNR (computed with a 200 samples analysis window) for
frame lengths ranging from 10 to 300 samples for MLP10x2xl, LPC-10, LPC-25 and
hybrid predictor with Nq=2 to 5 bits, averaged for the frames of one sentence. For the
hybrid predictor an overhead of 1 bit/frame must be sent, so if the frame length is reduced
the compression ratio is also reduced. For these reasons in this study the block size has
been selected to 100 samples/frame, because it offers a good compromise.
821

3. Results

The results have been evaluated using subjective criteria (listening to the original and
decoded files), and SEGSNR.
Table 2 shows the SEGSNR obtained with the ADPCM configuration for the whole
database with the following predictors: LPC-10, LPC-25 and MLP 10x2xl.
The results of the ADPCM forward (with unquantized predictor coefficients) are also
provided such us reference of the backward configuration.
This results reveal the superiority of the nonlinear predictor in the forward configuration
(3.5 dB aprox, over LPC-25 except for the 2 bit quantizer). This superiority is greater
if the quantizer has a high number of levels.
In the backward configuration there is a small SEGSNR decrease with the linear predictor
versus the forward configuration. For the nonlinear predictor it is more significative
(nearly 3dB), but the SEGSNR is better than LPC-10 except for Nq=2 bits. Also, the
variance of the SEGSNR is greater than for the linear predictor, because in the stationary
portions of speech the neural net works satisfactorily well, and for the unvoiced parts the
nnet generalizes poorly. Therefore, we propose a hybrid predictor.

METHOD Nq=2 bits Nq=3 bits Nq=4 bits Nq=5 bits

sta 'EO I std EOS I std


ADPCMF-LPC- 10 15.35 5.8 21.18 6.4 25.86 6.9 30.52 7.1

ADPCMF-LPC-25 15.65 5.6 21.46 6.4 26.26 6.9 30.79 7.2

ADPCMF-MLP 15.5 7.4 24.12 7.3 29.35 7.6 34.14 8.4

ADPCMB-LPC- 101 14.92 5.1 20.59 5.9 25.38 6.6 30.02 7.1

ADPCMB-LPC-25 14.88 5.1 20.95 5.5 25.2 6 30,1 6.2

ADPCMB-MLP 14.35 6.9 21.48 7.5 26.76 7.6 31.5 8.4

ADPCMB-HYBRI[ 16.1 4.8 22.38 5.8 27.51 6.1 32.53 6.4

Table 2. SEGSNR for ADPCM forward, backward, linear, nonlinear and hybrid.

We have also evaluated the computational complexity of the studied systems. Table 3
summarizes the number of flops required for encoding the whole database with diferent
schemes. For comparison purposes, the computational complexity has been refered to the
ADPCM LPC-10 systems. Thus, the numbers in table 3 show how many times is greater
the computational burden. Evaluated systems are:
eB: ADPCM with backward adaptation of prediction coefficients.
822

eF: ADPCM with forward adaptatition of unquantized prediction coefficients.


eL-10: linear predictive analisys of same order than MLP 10x2xl
eL-25: linear predictive analysis of same number of coefficients than MLP 10x2xl
eMLP: non linear predictive analisys with Multi Layer Perceptron 10x2xl
ell: Hybrid prediction (the best predictor, MLP 10x2xl or LPC-10)

100 and 200 indicate the frame length in the block adaptive prediction system.

scheme -~ BL10 BL25 BMLP H FL25 FMLP

frame length

100 1 1.4 27 29.8 1.4 24

200 2.1 2.6 27.5 32.9 2.6 26.2


Table 3: Computational burden

4. Conclusions and comparison with previously published w o r k

The unique work that we have found that deals with ADPCM with nonlinear prediction
is the one proposed by Mumolo et alt. [12]. It was based on Volterra series and has
problems of unstability, which were overcome with a switched linear/nonlinear predictor.
Our novel nonlinear scheme has been always stable in our experiments, although we also
propose a switched predictor in order to increase the SEGSNR of the decoded signals.
The results of our novel scheme show an increase between 1 and 2.5 dB over classical
LPC-10 for quantizer ranges from 2 to 5 bits, while the work of Mumolo [12] is 1 dB
over classical LPC for quantizer ranges from 3 to 4 bits and also with and hybrid
predictor. On the other hand, the computational complexity has increased thirty times
aproximately in the hybrid structure.
A statistical test was done in order to check if the results are statistically significatives.
The selected test is ANOVA (Analysis of Variance), and it proves that the proposed
adaptive hybrid speech coder is significatively better than the ADPCMB LPC-10 and
LPC-25 schemes for all studied bit rates.
In this paper we have obatined the same conclusion than in our speaker recognition
application of nonlinear predictive models based on MLP [13]: the best results are
achieved with a combination of linear and nonlinear predictive models. In [ 14] we have
obtained the same conclusion (also in speaker recognition) for a combination of a MLP
trained as a classifier for each speaker, and a codebook of cepstral parameters derived
from a linear parametrization.

Acknowledgements

This work has been supported by the CICYT TIC97-1001-C02-02.


823

References

[ll J. Thyssen, H. Nielsen y S.D. Hansen "Non-linear short-term prediction in


speech coding". ICASSP-94, pp.I-185,1-188.
[2] B. Townshend "Nonlinear prediction of speech". ICASSP-91, Vol. 1, pp.425-
428.
[31 H.M. Teager "Some observations on oral air flow vocalization" IEEE trans.
ASSP, vol.82 pp.559-601, October 1980
[4] G. Kubin "nonlinear processing of speech" capitulo 16 de Speech coding and
synthesis, editors W.B. Kleijn & K.K. Paliwal, Ed. Elsevier 1995.
[51 J. Thyssen, H. Nielsen y S.D. Hansen "Non-linearities in speech",proceedings
IEEE workshop Nonlinear Signal & Image Processing, NSIP'95,June 1995
[61 A. Kumar & A. Gersho "LD-CELP speech coding with nonlinear prediction".
IEEE Signal Processing letters Vol. 4 N~ April 1997, pp.89-91
[7] L. Wu, M. Niranjan y F. Fallside "Fully vector quantized neural network-based
code-excited nonlinear predictive speech coding". IEEE transactions on speech
and audio processing, Vol.2 n~ 4, October 1994.
[81 M. Fatindez, E. Monte & F. Vallverdfi "A comparative study between linear and
nonlinear speech prediction". Biological & artificial computation: from
neuroscience to technology. IWANN- 9 7. pp. 1154-1163. September 1997.
[91 M. Fafindez, Francesc Vallverdu & Enric Monte, "Nonlinear prediction with
neural nets in ADPCM" International Conference on Acoustincs, Spech &
Signal Processing, ICASSP-98 .SP11.3 USA, May 1998
[101 N.S. Jayant & P. Noll, "Digital coding of waveforms: principles &
applications". Ed. Prentice Hall 1984
[11] M. Fa6ndez "Modelado predictivo no lineal de la serial de voz aplicado a
codificaci6n y reconocimiento de locutor". Ph. D. Thesis. UPC, November 1998
Available at http://www.eupmt.es/veu in pdf format.
[12] E. Mumolo, A. Carini & D. Francescato "ADPCM with nonlinear predictors".
European Signal & Image Processing Conference, EUSIPCO-94, pp.387-390.
[131 M. Fatindez & D. Rodrfguez "speaker recognition using residual signal of linear
and nonlinear prediction models". International Conference on Spoken
Language Processing ICSLP'98. Vol.2, pp. 121-124. Dec. 1998, Sydney
[141 D. Rodriguez & M. Fatindez, "Speaker recognition with a MLP classifier and
LPCC codebook". Accepted for publication at ICASSP'99, Phoenix, USA
March 1999.
Neural Predictive Coding for Speech Signal

C.Chaw (chaw@ccr.jussieu.fr), B.Gas (gas@ccr.jussieu.fr), J.L.Zarader (jlz@ccr.jussieu.fr)

Universit6 Paris VI, Laboratoire des Instruments et Syst~mes,


4 place Jussieu 75 252 Paris cedex 05, France

ABSTRACT

In this paper, we present a new speech coding named NPC (Neural Predictive Coding). It is
obtained thanks to a MLP (Multi Layer Perceptron) used in prediction. The system is
designed to predict the samples of a signal window from previous ones. The goal of this
coding is to extract the signal window characteristics relative to the database which it is
extracted. After a precise description of our coding, we compare results obtained by our
coding with the ones obtained by classic coding (MFCC, FFT, LAR, LPC and LPCC) on
phoneme recognition. The NPC coding allows an improvement of the recognition rate in
respect of the other coding.

INTRODUCTION

The first coding goal is to reduce the data number to process. In the same time, it has to
preserve the maximum of discriminating information. They are several coding types.
Frequential coding (FFT, MFCC) or predictive coding (LPC, LAR, LPCC). These coding are
effective, but not adaptive. The objective of the NPC coding is to adapt itself to the base to
code. The closest classic coding of NPC is the LPC.
In the first part we will describe the model that we propose, first in a qualitative way then in a
formal one. In a second part we will present the chosen parameters for the coding system and
give explanations for these choices. We will also present the results obtained by our coding
and classic coding already quoted on phoneme recognition. Finally, we will discuss the new
possible research axes.
825

I NPC CODING

1.1 Qualitative description

The coding system is a two layers perceptron. It possesses nl inputs, n2 neurons on the
hidden layer, and one output. It is trained to predict a signal sample from the nl previous
ones. The (nl+l) * n2 weights of the first layer are common to all windows, and constitute
the fixed part of the system. On the other hand the n2+l weights of the second layer are
proper to each window, and constitute the coding coefficients. The process is decomposed in
two phases, first the phase of the first layer adjustment, and then the coding one.

I.I.I First layer adjustment

We choose b signal windows of N samples each. A second layer is associated with each
window. We present the nl first samples of a window to the MLP composed of the common
first layer, and the second layer associated with this window. The MLP is adapted to predict
the window sample nl+l. Then we bring input and output forward by one sample, i.e. that the
system has to predict the sample nl+2 from samples 2 Io nl+l placed in input. So far as to
predict the last one from samples N-nl to N-1. Therefore each window provides N-nl-I
examples. Modifications of the second layer are then executed. On the other hand
modifications of the common first layer are executed only after the passage of all the
windows. Then, weights of the first layer are modified by the N-nl-I examples of the b
windows of the database, while weights of the second layer are modified by the N-nl-I
examples of the associated window. Once this weight optimization is done, we obtain a first
layer that constitutes the fixed part of the coding system. Then, the system is ready to code.
From a connectionist viewpoint, the first layer capture the common information. From a
signal viewpoint, the first layer do some optimal transformations for the prediction of the
windows.

1.1.2 Coding

Each window to be encoded is presented to the MLP constituted of the first layer
previously calculated and a second layer that we initialize at random. We optimize
weights of the second layer to minimize the prediction error in output. Weights of lhis
second layer then constitute the coefficients coding of the window. We reiterate this
procedure for all the windows. The objective is'to~.p~at off the common information to all
windows (not discriminative) on the first layer and the proper information to each
window on the second one.
826

We also propose a variant to this coding, by associating a second layer not with each window
but with each class. I.e. during the first layer adjustment, when we present a window to the
system we do not place its associated second layer, but the second layer associated with its
own class. Modifications of this second layer are only executed after the passage of all the
windows of this class. The objective is to put on the second layer the common information to
the associated class (discriminative).

Figure 1: x window prediction by the MLP.

1.2 Formal description

1.2.1 First layer adjustment

Let eX'C(k) be the k th sample of the x window of the c class,

X~jc = {ex'c(i),eX'C(i + 1),.... eX'C(j - 1), eX'C(j) the vector composed of samples i to j of the x
window of the c class.
The neuron outputs of the bidden layer is:
v'.~(k)-- o(w,. x~:'o~ + B,) ~l)
where cr is the sigmoid function, W 1 the matrix nl*n2 of the first layer weights, and B1
the vector 1"112 of tile first layer biases.
Then, the prediction of the k th sample of the x window of the c class is:
~,~(k) = ,~(w; ,~ 9 vx,~(k) + b~, ~ (2)
where W~ 'c is the vector n2*l of second layer weights associmcd to the x window of tile c
class, and b~'c the bias.
827

The quadratic error commit on this prediction is therefore:


sx'c(k) = (eX'C(k) - 6x'c (k)) 2 (3)
The criterion to minimize for the second layer associated with the x window of the c class is:
N
j~.c = Z ex'c(k) (4)
k=nl+l
I.e. the sum of(3) on all examples of the x window.
On the other hand for the first layer the criterion to minimize is:

J' = Z sx'c(k) (5)


c=1 x=l k=nl+l
I.e. the sum of (3) on all examples of all windows of all classes, where nc is the number of
classes and nx (c) the number of windows of the c class. In the case of the variant of a second
layer for each class the criterion to minimize for the second layer associated with the c class
becomes:

J~ = .,, eX'Cv.
V (k), (6)
x=l k=nl+l
I.e. the sum of (3) on all examples of all windows of the c class. We modify weights to
minimize these criteria by using the backpropagation algorithm. Once this minimization is
done, we obtain the fixed part of the coding system, W~ and B1.

1.2.2 Coding

For each window we initialize at random W~ 'c . Then we minimize with the backpropagation
N
algorithm the criterion j~,c = Z cx'c (k) by modifying only W~ 'c . W~ 'c constitute then the
k=nl+l
coding coefficients of the x window of the c class.

As we can see, the NPC is like the LPC a predictive coding where the coding coefficients are
optimized for the prediction. But there are two fundamental differences with the LPC. First,
the NPC has a fixed part which depend of the base to code. This fixed part has for goal to
capture the useless information that is necessary for the prediction, but not for a classification.
This dependence of some parameters of the coding system is used on several studies [l]. It's
often parameters of a filters bank. And second the prediction is a non linear prediction. The
non linearity of the speech signal is shown and used in several studies [2,3,4]

2 TEST ON PHONEME RECOGNITION

An application of our system is the phoneme recognition. We have done several experiences
so as to test our system performance on this application. We will describe the generals
conditions of these experiences, then we will present the results obtained by our coding and
the classic ones. Finally we will discuss about these results.
828

2.1 General parameters

The database is composed of six phoneme classes (s z ah ih aa iy) of approximately 1000


windows each. Windows are extracted from the dialect region 1 (New England) of the
NTIMIT database (telephonic version of the darpa timit database) [5]. These classes have
been chosen to find some classes which are easily separable (s and ha for example), and
others far more difficult (s and z for example). The database has been divided into two lower
databases of 3000 windows each. One serves for the first layer adjustment, and for the
classification learning. The other serves for the generalization test. A cross-validation
database has not been constituted because the goal is not to estimate the NPC coding
performance in the absolute but to compare it to the other existent coding. In each experience
the database was exactly the same in the choice of windows as in their distribution between
the two databases.
We filter the signal with a Chebychev filter, then we under sample at 8 KHz. Finally, the
signal undergone a preaccentuation by a filter of transfer function I-0.95z-1.
Each phoneme is split into several windows of 128 samples (16 ms). Extracted windows of a
same phoneme are all in the test database, or all in the learning one. After coding each
window provides one example for the MLP classifier. We have fixed the ntunber of
coefficients to 12 for more rapidity in experiences.

2.2 Classification with a MLP


The classifier that we have used to estimate performances of the different coding is a classic
MLP 12 10 6. An example is classified in the output that gives the best score. Each class has
an associated output (6 output for the 6 classes). The desired output for an example of the
class 1 is therefore [+1 -1 -1 -1 -1 -I].

2.3 Results of the classic coding

We have tested 5 coding to evaluate NPC. We choose these coding because they are often
used for phoneme recognition. The results obtained by the different coding are presented in
table 1. The indicated scores are the recognition rate obtained on the generalization database.

Maximum
Coding generalization
recognition rate
FFT 59.14
MFCC 58.58
LAR 57.95
LPCC 57.88
LPC 56.69
Table.t: Recognition rate obtained
by different coding.
829

One observes that the recognition rates are relatively close. The best rate is obtained by the
FFT coding. The LPC coding obtain the worst rate. The frequentials coding look like to be
better than predictive ones.

2.4 NPC parameters

The transfer function of all hidden neurons is the sigmoid function. We have managed several
experiences on the predictive MLP input number (nl). Figure 1 shows the sum of absolute
connection weights connecting the input i to neurons of the hidden layer, nl is fixed to 20 in
this experience, i varies from 1 to 20, i=l representing the most postponed input, and i=20 the
most recent. One observes that tiffs sum decreases when one goes back in the past. We can
therefore imagine that the prediction is made thanks to most recent samples (what seems
natural). If we add too many input, they bring much noise than information. And if we
suppress too many inputs we suppress information. The optimal number for nl that we find is
20. Two additional experiences with nl equal to 18 and 22 give worst recognition rate.

0.4

0.3

0.2 O
O
O

0.1 0
O
0 o
O
0
000 0000 0 0
0
0 20

Figure 1: swn of absolute connection weights


connecting the input i to neurons of the hidden layer

We have done 6 experiences. For each one we have done classifications for several iteration
numbers. Iteration for the first layer adjustment part, and for the coding part. During the first
layer adjustment phase, the differences between the 6 experiences was :
9 The function of the output ncuron was the sigmoid function or tile linear one.
9 There was a bias for the second layer or there was not
9 The second layer was associated with each window or with each class
830

2.5 Results of the NPC coding

Table 3 presents an experience led with a second layer for each window, without bias and
with a line,u" function for the output neuron during the first layer adjustment phase. The score
,are the recognition rate on the test database. The character size used is proportional to the
score in order to have a visual impression. Each line correspond to a iteration number of the
first layer adjustment phase (indicated on left). And each column correspond to an iteration
number of the coding phase (indicated on top).

1 5 10 20 40 100 250 500 1000


100 59.24 60.36 59.87 60.53 59.70 60.26 59.93 59.40 58.18
400 59.80 60.79 59.74 59.44 60.20 59.77 59.34 59.14 58.45
1600 59.70 60.40 60.83 59.57 60.46 59.90 58.45 58.21 58.41
3700 60.10 61.16 61.19 60.03 59.64 59.77 59.74 59.67 59.44

Table 2: recognition ratefor several iteration numbers

The figure 3 and 4 present results of the experience that give the best recognition rate. This
rate is 62.25 for 7000 iterations for the first layer adjustment phase and 40 iterations for the
coding phase. In the figure 3 we can see the mean of recognition rates of all iterations number
of coding phase (1 5 10 20 40 100) in respect of the iteration number of the first layer
adjustment phase. Inversely in the figure 4 we can see the mean of recognition rates of all
iterations number of first layer adjustment phase (200 800 3200 7000) in respect of the
iteration ntunber of the coding phase.

62 61.5
0 0
61.5 0 0
0 61
61 0
60.5
60.5
60
60

59.5 59.5
0 2000 4000 6000 8000 0 20 40 60 80 100

Figure 3: means of recognition rate in Figure 4: means of recognition rate in


generalization for several iteration generalization for several iteration
mmzbers of the first layer adjustment numbers of the coding phase
phase
831

2.6 Results Analysis

9 According to table 2 and figure 4 one observes that the iteration number of the coding
phase is very determinant. Too many iterations deteriorates the recognition rate ( even if
the prediction error is lower). A over fitting appear after 100 iterations. The optimal
number seems to be between 5 and 40. This point was noted in all the experiences led.
9 According to table 2 and figure 3 one observes that more important is the iteration
number during the first layer adjustment phase the better is the recognition rate. This point
was also noted in the other experiences. But after 3000 iterations the recognition rate do
not grow up significantly. We never went so far in the experiences to observe an over
fitting.
9 To end this analysis we note that our best score is higher of 3 points than the best score
obtained by the classic coding. It is also higher of 5.5 points than the LPC coding.

3 FURTHER WORK

,, The MLP used for the classification is a basic classifier. It would be interesting to test this
coding on others more evolved classifiers. Moreover the used database, even if it is
balanced, is small, and would merit to be expanded to a greater phoneme class number.
Finally, the number of coefficients used to code the speech signal is very often superior to
12. So it would be interesting to increase it.
9 An other application of our coding would be tile speech compression. Exactly as it is done
with the LPC coding we could compress and decompress a speech signal. It will be
interesting to test this point.
9 Actually the first layer is optimized for the prediction. But we want to have a better
recognition rate than a good prediction. So we think about ways to couple the prediction
and the classification. A study is in progress at the laboratory.

CONCLUSIONS

In this paper we have presented the NPC coding that we have developed. NPC is original
because it has a part wich depends of the base to code (first layer of a perceptron). It is also
original because the prediction is a non linear prediction. Experiences that we have presented
show that our coding obtains a best score in generalization that the FFT coding (that has been
the most effective classic coding among the 5 that we have tested). However a study of
greater scope remains to be led, and is in progress at the laboratory.
832

REFERENCES

[1] A. Biota, S. Katagiri "FILTER BANK DESIGN BASED ON DISCRIMINATIVE


FEATURE EXTRACTION", ICASSP'94, 1-485-488.
[2] J. Thyssen, H. Nielsen, S. Duus Hansen "NON-LINEAR SHORT-TERM
PREDICTION IN SPEECI1 CODING", ICASSP'94,1-185-188.
[3] F. diaz-de-Maria, A. R. Figueras-Vidal "NONLINEAR PREDICTION FOR SPEECH
CODING USING RADIAL BASIS FUNCTIONS", ICASSP'95,1-788-791.
[4] B.. Townshend "NONLINEAR PREDICTION OF SPEECH", ICASSP'91,1-425-428.
[5] C. Jankowski, K. Ashok, S. Basson, J. Spitz "NTIMIT : A PHONETICALLY
BALANCED, CONTINUOUS SPEECH, TELEPHONE BANDWITH SPEECH
DATABASE" ICASSP '90.
S u p p o r t Vector M a c h i n e s
for Multi-class Classification

Eddy Mayoraz and Ethem Alpaydm


I D I A P - - D a l l e Molle Institute for Perceptual Artificial Intelligence
CP 592, CH-1920 Martigny, Switzerland
Dept of Computer Engineering, Bogazici University TR-80815 Istanbul, Turkey
A b s t r a c t : Support vector machines (SVMs) are primarily designed for 2-class clas-
sification problems. Although in several papers it is mentioned that the combination
of K SVMs can be used to solve a K-class classification problem, such a procedure
requires some care. In this paper, the scaling problem of different SVMs is highlighted.
Various normalization methods are proposed to cope with this problem and their effi-
ciencies are measured empirically. This simple way of using SVMs to learn a K-class
classification problem consists in choosing the maximum applied to the outputs of K
SVMs solving a one-per-class decomposition of the general problem. In the second part
of this paper, more sophisticated techniques are suggested. On the one hand, a stack-
ing of the K SVMs with other classification techniques is proposed. On the other end,
the one-per-class decomposition scheme is replaced by more elaborated schemes based
on error-correcting codes. An incremental algorithm for the elaboration of pertinent
decomposition schemes is mentioned, which exploits the properties of SVMs for an
efficient computation.

1 Introduction
Automated classification addresses the general problem of finding
an approximation F of an unknown function F defined from an
input space [2 onto an unordered set of classes {wl,... ,wK}, given
a training set: T = {(~eP,yP = F(xP)}P1 C ~2 x {•l,...,09K}.
Among the wide variety of methods available in the literature to
learn classification problems, some are able to handle many classes
(e.g. decision trees [2,12], feedforward neural networks), while others
are specific to 2-class problems, also called dichotomies. This is the
case of perceptrons or of support vector machines (SVMs) [1,4,14].
When the former are used to solve K-class classification problems,
K classifiers are typically placed in parallel and each one of them is
trained to separate one class from the K - 1 others. The same idea
can be applied with SVMs [13]. This way of decomposing a general
classification problem into dichotomies is known as a one-per-class
decomposition, and is independent of the learning method used to
train the classifiers.
834

In a one-per-class decomposition scheme, each classifier k trained


on the dichotomy {(a:P, yP = / k ( a . p ) ) } L 1 c a2 x { - 1 , +1} produces
an approximation fk of fk of the form fk = sgn(gk), where g k :
a2 --+ I~. The class wk picked by the global system for an input x will
then be the one maximizing gk(a:). This supposes, however, that the
outputs of all g k are in the same range.
As long as each of the learning algorithms used to solve the dicho-
tomies outputs probabilities, their answers are comparable. When a
dichotomy is learned by a criterion such as the minimization of the
mean square error between gk(xP) and yP E { - 1 , +1}, it is reason-
able to expect (if the model learning the dichotomy is sufficiently
rich) that for any data drawn with the same distribution than the
training data, the output of the classifier will have its module around
+1. Thus, in this case again, one can more or less assume that the
answers of the wk classifiers are comparable.
The output scale of a SVM is determined so that outputs for the
support vectors are +1. This scale is not robust, since it depends
on just a few points, often including outliers. Therefore, it is gener-
ally not safe to decompose a classification problem in dichotomies
learned by SVMs whose outputs are compared as such, to provide the
final output. In this paper, different alternatives will be proposed to
circumvent this problem. The simplest ones are based on renormal-
ization of the SVMs outputs. Another approach consists in stacking
a first level of one-per-class dichotomies solved by SVMs, with other
classification methods. More elaborated solutions are based on other
types of decomposition schemes, in which SVMs can be involved
either as basic classifiers, i.e. to solve the dichotomies, or in recom-
bining answers of the basic classifiers, or both.

2 Illustrative example

To illustrate the normalization problem of the SVMs outputs and


to get some insight on possible solutions, let consider the artificial
example of Figure 1. The data, partitioned into three classes, are
drawn according to three Gaussian distributions with exactly the
same covariance matrix and different mean vectors indicated by stars
in Figure 1.
835

!
~ "'" *'" 9 #~k' o'r1761769 ~ o ~ 9 ~ I'i- , ~ %Oo.~. o'.';~., 9 Oo %, . 9

0.5
9 " 9 .'. ,~" .." 9 il_i! "." "" ' " "'"
].3k
Class 1 ~! i \ Class 2

0 / \

-1 ~'~ "" : : " .

.:-., .: : :-.... 9
1.5 ~'
I ~'I I I I I I I 11 I
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

Fig. 1. A 3-class example.

Since the three covariance matrices are identical and the a pri-
ori probabilities are equal, the boundaries of the decision regions
based on an exact Bayesian classifier are three lines intersecting in
one point [7], which are represented by continuous lines on Figure 1.
The 50 data of each class is linearly separable from the data of the
other two classes. However, the maximal margin of a linear separ-
ator isolating Class 3 from Class 1 and 2 is much larger than the
margin of the other two linear separators. Thus, when using 3 linear
SVMs to solve the three dichotomies, the norm of the optimal hy-
perplane found by SVM algorithm is much smaller in one case than
in the other two. Whenever the output class is selected as the one
corresponding to the SVM with largest output, the decision region
obtained is shown in Figure 1 by dashed lines, which is quite different
from the optimal Bayes decision.
For comparison, the dash-dotted lines (with cross-point marked
by a square) correspond to the boundaries of the decision regions
obtained by three linear Perceptrons trained by the Pseudo-inverse
method, i.e. the linear separators minimize mean square errors [7].
This matches closely the optimal one.
Two different ways of normalizing the outputs of the SVMs are
also illustrated in Figure 1 and the boundaries of the correspond-
ing decision regions are shown with dotted lines. In one case, the
836

parameters (w k, bk) of each of the K separating hyperplanes {~ I


~rwk + bk = O} are divided by the Euclidean norm of w k (the cross-
point of the boundaries is a circle). In the other case, (w k, bk) are
divided by the estimate of the standard deviation of the output of the
SVM (the cross-point of the boundaries is a triangle that superposes
the circle).

3 S V M output normalization
The first normalization technique considered has a geometrical in-
terpretation. When a linear classifier fk : ~d __+ {--1, +1} of the
form
]k(X) = sgn(gk(w)) ----sgn(xrw k q- bk) (1)
is normalized such that the Euclidean norm Ilwkll2 is 1, gk(x) gives
the Euclidean distance from ~c to the boundary of fk.
Non-linear SVMs are defined as linear separators in a high di-
mensionM space 7-/in which the input space I~d is mapped through a
non-linear mapping 9 (for more details on SVMs, see for example the
very good tutorial [3] from which our notations are borrowed). Thus,
the same geometrical interpretation holds in 7-/. The parameter w k
of the linear separator fk in 7-/of the form (1) is never computed
explicitly (its dimension may be huge or infinite). But is known as a
linear combination of images through 9 of the support vectors (input
data with indices in N~)

wk = E P P P ). (2)
p~N ~,
k used in this work will thus be defined
The normalization factor rrw
by
1
-- - ~ aka~P
" P~' y r r (3)
(-~):
p ,p' 6 N ks
: E ~P'~P'~'P~'P'I((TP
~k~k~ u., ,~C P' ), (4)
P,Pi~-Nsk

where K is the kernel function allowing an easy computation of dot-


products in 7-/.
837

One way to normalize is scaling the output of each support vector


machine such that
Ep[y gk(x)] = 1
The scaling factor 7rk is defined as the mean over the samples, of
yPgk(xP), again estimated on the training set or on new data.
Each normalization factor can also be chosen as the optimal
solution of an optimization problem. The factor 7rk. minimizes the
mean square error over the samples, between the normalized output
7ck,gk(x p) and the target output yP 6 { - 1 , + 1 } .
Z(. = (5)
P

whose optimal solution is

(6)

4 Stacking SVMs and singlelayer perceptrons


So far, the output class is determined by choosing the maximum of
the outputs of all SVMs. However, the responses of other SVMs than
the winner carry also some information. Moreover, when a SVM is
trained to separate one class wk from the K - 1 others, it may happen
that the mean of gk varies significantly from one class to another.
For example, if class w2 lies somewhere "in-between" class wl and
class wa, the function g I separating class Wl from w2 and w3 is likely
to have a stronger negative answer on w3 than on w2. This knowledge
can be used to improve the overall recognition.
A simple way to aggregate the answers of all the K SVMs into
a score for each of the classes is by a linear combination. If g =
(91... ,gK)m denotes the output of the system of K SVMs, the idea
suggested here is to replace the former function

P -- arg max(g)

by
/~ = arg mkax( M g ) ,
838

where M is a K x K mixture matrix. The classical way of solving


a K-class classification problem by one-per-class decomposition cor-
responds to using the identity mixture matrix. The technique given
in Section 3 with 7ck corresponds to a diagonal M with 7ck as the
diagonal elements. If sufficiently many data are available to estimate
more parameters, a full mixture matrix can provide a finer way of
recombining the outputs of the different SVMs.
This way of stacking a set of K classifiers with a single layer
neural network provides a solution to the normalization problem as
long as the network (i.e. the mixture matrix M ) is designed to min-
imize the m e a n square error between g ( x p) and yP = { - 1 , . . . , + 1 ,
. . . , - 1 } . Generalizing Equation (5), we get

E ( M ) = ~--][Mg(x p) - yp]2 (7)


p

5 Numerical experiments
All the experiments reported in this section are baed on datasets of
the Machine Learning repository at Irvine [10]. The values listed are
pecentages of classification errors, averaged over 10 experiments. For
glass and dermatology, one time 10-fold cross validation was done,
while for vowel and soybean, the ten runs correspond to 5 times
2-folding. We used SVMs with polynomial kernel of degrees 2 and 3.

k k
database deg no normal. 71"w ~, M
glass 2 35.7 =h 13.5 31.6 =h 10.3 31.9 4- 12.3 ~39.0 + 12.5
glass 3 37.6 =h 12.8 33.3 =t= 11.4 35.7 + 10.6 45.2 :t: 10.8
dermatology 2 3.9 -t- 1.9 4.1 -t- 2.0 3.9 + 1.9 4.2 =h 2.0
dermatology 3 3.9 :t: 2.7 4.4 + 2.7 3.9 4- 2.7 4.4 =h 2.7
vowel 2 70.3 =t= 39.7 69.8 -t- 40.7 69.9 ~: 40.5 24.2 + 1.6
vowel 3 62.1 =h 44.5 61.4 :t: 45.4 61.8 ::h 44.9 10.5 4- 3.2
soybean 2 71.6 4- 34.7 71.6 4- 34.8 71.6 :k 34.9 29.2 :t: 11.2
soybean 3 71.6 :k 34.8 71.4 =h 35.1 71.6 :t: 34.8 28.8 -4- 11.1

We notice that on the four datasets, the two normalization tech-


k or using 7r,k do not improve accuracy except
niques of dividing by ~rw
in glass where a small improvement is seen. Using stacking with a
linear model on vowel and soybean significantly improves accuracy
839

which demonstrates the useful effect of postprocessing SVM outputs.


Overtraining certainly explains the deterioration of this stacking ap-
proach on glass, as this is a very small dataset. One can use more
sophisticated learners instead of a linear model whereby accuracy
can be further improved. One interesting possibility is to use an-
other SVM to combine the outputs of the first layer SVMs.
We are currently experimenting with larger databases, other types
of kernels and other combining strategies and we are expecting to
have more extensive support of this approach in the near future.

6 Robust decomposition/reconstruction
schemes
Lately, some work has been devoted to the issue of decomposing a
K-class classification problem into a set of dichotomies. Note that
all the research we are referring to was carried out independently of
the method used to learn the dichotomies, and consequently all the
techniques can be applied right away with SVMs.
The one-per-class decomposition scheme can be advantageously
replaced by other schemes. If there are not too many classes, the so
called pairwise-coupling decomposition scheme is a classical alternat-
ive in which one classifier is trained to discriminate between each pair
of classes, ignoring the other classes. This method is certainly more
efficient than one-per-class, but it has two major drawbacks. First,
the number of dichotomies is quadratic in the number of classes.
Second, each classifier is trained with data coming from two classes
only, but in the using phase, the outputs for data from any classes
are involved in the final decision [11].
A more sophisticated decomposition scheme, proposed in [6,5],
is based on error-correcting code theory and will be referred to as
ECOC. The underlying idea of the ECOC method is to design a set
of dichotomies so that any two classes are discriminated by as many
dichotomies as possible. This provides robustness to the global clas-
sifier, as long as the errors of the simple classifiers are not correlated.
For this purpose, every two dichotomies must also be as distinct as
possible.
In this pioneering work, the set of dichotomies was designed a pri-
ori, i.e. without looking at the data. The drawback of this approach
840

is that each dichotomy may gathers classes very far apart and thus is
likely hard to learn. Our contribution to this field [8] was to elaborate
algorithms constructing the decomposition matrix a posteriori, i.e.
by taking into account the organization of the classes in the input
space as well as the classification method used to learn the dicho-
tomies. Thus, once again, the approach is immediately applicable
with SVMs.
The algorithm constructs the decomposition matrix iteratively,
adding one column (dichotomy) at a time. At each iteration, it
chooses a pair of classes (wk,c0k,) at random among the pairs of
classes that are so far the less discriminated by the system. A clas-
sifier (e.g. a SVM) is trained to separate wk from wk,. Then, the
performance of this classifier is tested on the other classes and a
class wl is added to the dichotomy under construction as a positive
(resp. negative) class, if a large part of it is classified as positive
(resp. negative). The classifier is finally retrained of the augmented
dichotomy. The iterative construction is complete, either if all the
pairs of classes are sufficiently discriminated or when a given number
of dichotomies is reached.
Although each of these general an robust decomposition tech-
niques are applicable to SVMs and must be in any case preferred to
the one-per-class decomposition, they do not solve the normalization
problem. When choosing a general decomposition scheme composed
of L dichotomies providing a mapping from the input space J2 into
{ - 1 , +1} L or ]~L, one also has to select a mapping rn : IRL --+ 11~K,
called the reconstruction strategy, on which the arg maxk operator
will finally be applied.
Among the large set of possible reconstruction strategies that
have been explored in [9], one distinguishes the a p r i o r i reconstruc-
tions from the a posteriori reconstructions. In the latter, the mapping
rn can be basically any classification technique (neural networks, de-
cision trees, nearest neighbor, etc.). It is learned from new data and
thus, it solves the normalization problem.
Reconstruction mappings rn composed of L SVMs have also been
investigated in [9] and provided excellent results, especially for degree
2 and 3 polynomial kernels. Note that in this case, the normalization
problem occurs again at the output of the mapping rn and in our
841

experiments we cope with it using the normalization factors 7rw, l l


1,...,L.
When the decomposition scheme is constructed iteratively by the
algorithm described above and the reconstruction mapping is based
on SVMs, a considerable amount of computation time can be saved
as follows. At the end of each iteration constructing a new dichotomy,
the mapping m must be elaborated based on the current number of
dichotomies, say L, in order to determine (in the next iteration) the
pair of classes ( w k , w k , ) for which the global classifier is doing the
worse confusion. But the optimal mapping m : ]I~ L ---+ ]I~ K have some
similarities with m ' : I~ n - 1 --+ I~ I~ constructed at the previous itera-
tion. It has been observed that the quadratic program determining
the 1~hSVM of the mapping m is solved much faster when initialized
with the optimal solution (the a~s indicating the support vectors
and their weights) of the quadratic program corresponding to the
l ~hSVM of the mapping m ~.

7 Conclusions

In this paper, the problem of normalizing the outputs of several


SVMs, for the sake of comparison, is highlighted. Different normal-
ization techniques are proposed and experimented. More elaborated
methods allowing the usage of binary classifiers for the resolution
of multi-class classification problems are briefly presented. The ex-
perimentation of these approaches with SVMs as well as with other
learning techniques is a large scale ongoing work and will be presen-
ted in the final version of this paper.

References

1. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal


margin classifiers. In Proceedings of the Conference on Learning Theory, COLT'92,
pages 144-152, 1992.
2. L. Breiman, J. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
3. C. Burges. A tutorial on support vector machines for pattern recogni-
tion. Data Mining and Knowledge Discovery, to appear, available at
h t t p : / / s v m . r e s e a r c h , b e l l - l a b s , com/SVMdoc, html.
4. C. Cortes and V.Vapnik. Support vector network. Machine Learning, 20:273-297,
1995.
842

5. Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via
error-correcting output codes. Journal of Artificial Intelligence Research, 2:263-
286, 1995.
6. T. G. Dietterich and G. Bakiri. Error-correcting output codes : A general method
for improving multiclass inductive learning programs. In Proceedings of AAAI-91,
pages 572-577. AAAI Press / MIT Press, 1991.
7. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley
& Sons, New York, 1973.
8. Eddy Mayoraz and Miguel Moreira. On the decomposition of polychotomies into
dichotomies. In Douglas H. Fisher, editor, The Fourteenth International Confer-
ence on Machine Learning, pages 219-226, 1997.
9. Ana Merchan and Eddy Mayoraz. Combination of binary classifi-
ers for multi-class classification. IDIAP-Com 02, IDIAP, 1998. pa-
per 22 in the Proceedings of Learning'98, Madrid, September 98,
http://learn98, tsc. uc3m. es/~learn98/papers/abst ract s.
10. C. J. Merz and P. M. Murphy. UCI repository of ma-
chine learning databases. Machine-readable data repository
h t t p ://www. ics .uci. edu/~mlearn/mlrepository.html, Irvine, CA: University
of California, Department of Information and Computer Science, 1998.
11. Miguel Moreira and Eddy Mayoraz. Improved pairwise coupling classification with
correcting classifiers. IDIAP-RR 9, IDIAP, 1997. To appear in the Proceedings of
the European Conference on Machine Learning, ECML'98.
12. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
13. B. Schrlkopf, C. Burges, and V. Vapnik. Extracting support data for a given task.
In U. M. Fayyad and R. Uthurusamy, editors, Proceedings of the First International
Conference on Knowledge Discovery and Data Mining, pages 252-257. AAAI Press,
1995.
14. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York,
1995.
Self-Organizing Yprel Network Population
for Distributed Classification Problem Solving

Emmanuel STOCKER, Arnaud RIBERT, Yves LECOURTIER

Universit6 de Rouen, Lab. PSI-LA3i,


UFR des Sciences et Techniques,
F-76821 Mont Saint Aignan cedex, France
e-mail: Emmanuel.Stocker@univ-rouen. fr

Abstract: This paper deals with a new scheme of distributed classifier based on a particular
formal neuron named "yprel". The main characteristics of the proposed approach are: (i) a
classifier is a set of interconnected and cooperating networks; (ii) the distributed resolution
strategy emerges from the individual network classification behaviors during the incremental
building phase of the classifier; (iii) each neuron is able to come to classification decisions about
some elements and to communicate them; (iv) the network architectures and the interconnexion
links between the networks are not a priori chosen, but get themselves organized thanks to an
incremental and competitive learning between the decision-making neurons.

I. INTRODUCTION

One of the main problems raised by any pattern recognition work is the classification
task solving. Nowadays, neural methodologies have proved their ability to treat complex
problems about learning and classification, but some critical points still remain unsolved.
Two of them are dealt with this paper: (i) the way to build an adapted network
architecture to a given task; (ii) the way to share out a complex problem resolution among
a set of cooperating networks. These points have been acknowledged as key problems by
several authors during the last years [1 ], [2], [3], [4], [5].

In this paper, we describe the main principles of the yprel methodology which put
forward the self-organization o f a distributed solution for supervised classification
problems. The two following sections detail the formal neuron used and the way to
determine a network architecture adapted to a given goal. The third section describes the
automatic task decomposition and the distribution process which emerge from the
learning phase. Preliminary results obtained for the handwritten digit recognition on the
NIST database are given in the last section.

ll. THE YPREL: A DECISION-MAKING NEURON

This section presents a particular neuron named "yprel" for the abbreviated form of"Y-
PRocessing-ELement". The "Y" character symbolizes the neuron structure which
possesses at most two inputs and one output. The tbllowing figure 1 gives an example of
yprel network. Two kinds of neuron can be distinguished: the one-input neurons linked to
the real features extracted from the shape to identify, and the standard neurons with their
two inputs. In an yprel network, each neuron output can be connected to one or several
other neurons without any layer restraint for the global structure.
844

Input Network
features~ v output

Fig. 1. Example ofyprel network.

The aim of the yprel methodology is to provide a simple decision-making neuron


associated with a given class. In the proposed approach, each network of the distributed
classifier is linked to a particular class and each neuron is able to take some classification
decisions. Thus, according to the element to identify and its own classification task, an
yprel of a particular network will be able to:

- recognize the element as belonging to the concerned class;


- reject the element as not belonging to the concerned class;
- give no conclusion about this element.

When a recognition or rejection decision is taken by an yprel, it is just transmited to the


following neurons in the network. This decision propagation scheme allows to perform a
very fast learning of each neuron. Each standard yprel only tries to reduce the set of non-
decided elements coming from its two parents. This mechanism leads to encode clearly
the information computed by the neurons. Each neuron will provide a normalization of its
output values by following the same decision encoding scheme.

The role of file one-input yprels is not only to normalize the data coming from the
feature vector, they are also able to take some classification decisions. Each extracted
feature is linked to a particular neuron. This one uses the linear separation to determine
two homogeneous decision domains on both sides of a remaining non-decided area. This
decision scheme is illustrated by the figure 3.

Fig. 3. One-input yprel decision function.

Different decisions can be formulated outside tile interval [i"min,l,'mtt,c] according to


the initial data distribution on the feature. The final output encoding requires to know
precisely what kind of decision is taken on each side of this interval. The supervised
learning enables to find the two frontiers Fmin,Fmax of the mixed zone and the
corresponding decision domain labels (recognition or rejection). A redticed set of simple
845

rules allows to deduce them from the respective limits of the studied class elements and
other classes elements. Only the six cases listed on the figure 4 are possible.

do = -1 dl = 2 9 '

[]rain lmax lmin Umax

d~ ~-~ d'=-l, d~ ~---~---d'__=2.

Dmin Omax Imin Imax

--d~ , dl,= 2 ~. d~ ~ }~=-I 9


Imax+k-lmin +g nmax+lmin 4-g
2 2
do, dl: homogeneous decision domains : (2) reco. (-1) reject.
[] studied class elements. Iother classes elements.
*min,*max: limitsof the mixed zone: (non-decision area).
Fig. 4. Decision domains determination.

In order to improve the generalization performances, the length of the non-decision


interval [Fmin~max] is increased by a few per cent. The one-input yprel simulation is
done by computing the following function:

yp = kl.x + k2

with x the extracted feature component. The two parameters kt, k2 are determined during
the learning phase. They normalize the enlarged interval [Fmin, Fmax] on the segment
[0,1]. Thus, if (Yv < 0) or (yp > 1), the one-input yprel can directly conclude about
elements. Then, these decision values are updated according to the respective domain
labels do, dl in order to respect the output encoding used. The final Yt, output becomes
nonlinear and defined by the following rules:

if ( yp < 0 and do =-1 ) then Yv = Yv _~ rejection decision.


if ( Ye < 0 and do = 2 ) then yp = 2 -Yv -~ recognition decision.
if ( 0 < Yv < 1 ) then yp = yp ~- no decision.
if ( yv > 1 and dt = 2 ) then yv = 1 + yp = recognitiondecision.
if ( yp > 1 and dt = -! ) then Yv = 1 -yp = rejection decision.

The standard yprel role is also to perform in its own input space a kind of linear
separation between the studied class elements and every others. At first, it allows to
ensure the propagation mechanism of all the decisions coming from the two inputs and to
verify their coherence. Let yp be a standard yprel, with two inputs e0 and et which refer to
other neuron outputs. The yp input values depend on the encoding scheme used, while the
data distribution in this space is due to the classification behaviors of eo and ej. The input
space of the standard yprel is illustrated by the figure 5. The encoding used enables to
differentiate in this space several regions which correspond to the different decision
making situations.
846

The decisions provided by the two inputs e0, et are compatible if the two sub-nets are
come to the same conclusion. In this case, the corresponding decision is just transmited to
the ypreis using the Yt, neuron output.

Fig. 5. Standard yprei input space, Fig. 6. Standard yprel decision function.

The two input values are also compatible if only the neuron ei provides a decision and
the other one e] gives no conclusion, ( i * j , 0"e{0,1}). In this case, the yp neuron output
takes the value of the classification decision. This situation allows to ensure a real
cooperation process between the two sub-nets by taking into account the complementarity
of the decisions coming from the two inputs.
When no decision is taken by the two inputs, ( 0 < eo < 1 and 0 < et < 1 ), the standard
yprel calculates a certain linear combination of its inputs. This combination determines
the direction of a projection straight line which enables also to find two homogeneous
decision domains on both sides of a remaining non-decided area. The figure 6 illustrates
this standard neuron decision mechanism.

Once the projection direction is found, the standard neuron is exactly determined like
the one-input yprel. The supervised learning allows to deduce the two frontiers
Fmin, F m a x of the mixed zone and the corresponding decision domain labels d0, dl from
the respective element positions on the projection straight line, as it is described on the
figure 4. The obtained non-decision interval is also enlarged and normalized on the
segment [0,1], and the same rules allow to perform the final output encoding. The
standard neuron simulation is done by computing:

yp = kt.( ao.eo + at.et ) + k2

with e o , et the two input components; ao , at the synaptie weights and kt , k2 the
normalisation parameters. The expression (ao .eo + at.el) computes the element (eo, et)
orthogonal projection on a straight line whom the direction vector components are (a0=cos
I~ , at = sin/z):/t being the angle between the projection straight line and the horizontal
axis.

The use of only two inputs allows to search the projection direction in an exhaustive
way, by making vary the angle/~ by l-degree step. For each angle tt value, we determine
the two decision domains according to the element projections on the corresponding
straight line. The number of decisions inside one domain acts as a selection criterion to
847

find the final direction. The selected straight line corresponds to the domain which
provides the greatest possible number of decisions. To ensure a better generalization
behavior, we only keep a domain for a given straight line if its number of decisions is
superior to a certain treshold determined during the learning phase. Otherwise, the
corresponding frontier is moved and the normalisation parameters are updated to include
this domain in the non-decision area. This frontier shifting mechanism is illustrated by the
figure 7.

Fig. 7. The frontier shifting mechanism.

If no domain is finally kept, we select the projection direction which minimizes the
length of the non-decision interval computed before the frontier movements. This case has
to be considered since a standard neuron can be very efficient only by combining the
decisions coming from two complementary sub-nets. The same shifting mechanism is
used to validate the decision domains due to the one-input yprels. Its role is very
important in the generalization behavior. It allows to eliminate from the network all the
non-representative decision domains which contains only few elements of the learning set.

We have described the formal neuron used in the yprel networks and the way it was
able to take some decisions according to a given class. Nevertheless, no hypothesis has
been made about the network structure. A network has to combine the decision abilities of
several yprels to solve its particular classification problem. The aim of the tbllowing
section is to present the method used to obtain a sell-organizing architecture adapted to
each network goal.

Ill. A SELF-ORGANIZINGNETWORKSTRUCTURE

In the yprel methodology, the network structure is not a priori chosen. It is determined
step by step during the learning process. The learning strategy used allows to mix the
main advantages of the incremental building methods with the competition process due to
the genetic algorithms, but without requiring to the gene encoding phase. This learning
algorithm is based on a cooperative and competitive process between tile deciding
neurons.

The first step of the network building phase consists in creating the set of all the one-
input yprels linked to the extracted features. Then, we generate incrementally a set of
standard yprels. Each generated neuron becomes the terminal element of a particular sub-
net which takes some classification decisions. All the created sub-nets attempt to reach the
848

same goal, but each determines its own solution space owing to the decision distribution
and the propagation mechanism. Thus, each sub-net becomes a potential candidate for the
particular problem to solve. The total number of decisions taken inside each sub-net acts
as a selection criterion to find the winner sub-net. Only the sub-net with the best
performances will be kept as the final network. This competition principle is illustrated by
the figure 8.

Input ~
features Network
-- ~""," 9 output

Fig. 8. The yprel competition process.

With this competitive learning algorithm, the parameters of each yprel are calculated
only once, when this neuron is used as the terminal element of a new sub-net. They are
never modified later. Then, the necessary calculations to generate a new sub-net are
limited to evaluate the parameters of only one yprel. The computation cost to generate one
standard neuron decreases as soon as we progress in the network structure building, since
we only use the remaining non-decided elements to determine its parameters. Therefore,
the learning procedure being very fast, it allows to generate a great number of candidate
neurons which test simultaneously different possible architectures. This competition
mechanism is essential to ensure the search of a "good" solution in the combination space.
It allows to limit the overleaming phenomena, since it does not lead to complete each time
the same network structure. Thus, a biased or overfitted architecture can be completely
given up, ifothcr tested combinations lead to better solutions.

The main problem raised by this learning strategy is the way to choose the two inputs
of a new yprel. The selection rule must avoid to explore systematically all the possible
combinations, since the total number of different network structure possibilities follows
an exponential law with respect to the number of created neurons. The selection rule must
highly constrain the competition process to ensure the system convergence and to limit the
combinatorial explosion. But, in the same time, it has to remain sufficiently flexible to get
out of a biased solution by testing other network structure possibilities. The proposed
selection rule has this double-acting behavior.

At first, we use the individual standard neuron efficiency which calculates the number
of decisions taken by the neuron itself in comparison with the best of its two parents.
During the generation phase, only the yprels with individual efficiencies superior to a
certain treshold are kept to become potential candidates. The other ones are eliminated
and the corresponding sub-net combinations are marked to be never tested again. The
same treshold value is used to validate a decision domain and to keep a candidate. This
treshold mechanism allows to limit highly the combinatorial explosion by keeping only
the most reliable candidates which really make the problem solving progress on a certain
number of decisions compared to their two parents.

During the generation phase, the system updates a two-dimensional array which allows
to sort the selected population according to the performance (number of decisions) and the
849

size (number of yprels) of the obtained sub-nets. Each input of a new yprel is chosen by
performing two biased drawing lots inside this table. The first one determines the size
category of the candidates. The second allows to select one element among these
candidates. Each pseudo-random selection is made by computing the following function
which returns the position of the selected element in the considered list:

n- 1 - r trap+!

with n the number of elements in the list, r a random selection in [0,1] and trap a
temperature parameter. This function associates one selection probability with each
element according to its position in the ordered list. These probabilities follow an
exponential law whom the curvature is set by the temperature parameter. The first pseudo-
random selection enables to favour the candidates with the smallest sizes and the one-
input yprel selection, while the second increases the chances of the most efficient sub-nets
for a given size. The random component of this selection function allows to test new
neuron combinations and to avoid local minima, while the bias component constrains the
process to ensure the incremental structure building and the system convergence. In the
proposed approach, the two temperature parameters have been frozen in a satisfactory
experimental way, close to a breadth-first-search which favours the neuron competition
more than the incremental building scheme. This way is possible since the convergence of
our system is mainly due to the treshold mechanism which selects the candidates and to
the strategy used to distribute the problem solving.

During the standard neuron generation phase, this biased-random selection is applied
on the ordered list of the one-input yprels. Only the input features used by the sub-net
winner of the competition process will be present in the final network architecture. By this
way, the yprel methodology enables an automatic structure determination mixed to an
active input feature selection adapted to each particular network goal.

IV. EMERGENCEOF A DISTRIBUTEDPROBLEMSOLVING

Unfortunatly, the first trials on real complex problems with just one network by class,
have shown all the difficulties to find satisfactory network structures and the importance
of the treshold value working on the number of decisions taken by each yprel. A high
value limits the number of candidates by selecting only the most reliable neurons, but
does not allow to classify all the learning base elements. On the contrary, a small value
increases the number of neurons with non-representative decision domains leading to
overfitting phenomena. Moreover, they have shown that the treshold values had to be
different according to the treatened classes and the problems to solve, that is to say the
absolute necessity to introduce an adaptative parameter.

The maximun possible value for this parameter corresponds to a decision domain
which contains all the elements not belonging to the network class (i.e. the concerned
class is linearly separable from the others inside only one yprel). The neuron generation
starts with this maximum value. After a certain number of trials, if none standard yprel is
kept, the treshold value is decreased. The process is repeated until the system can select
some candidates among the number of available trials. The generated network takes
classification decisions about a part of the learning data. All the non-decided elements
850

make up the learning base of a new network linked to the same class. This one reduces
again the non-decision number and so on... The following figure 9 illustrates this strategy
based on the successive learning base filterings. This task decomposition depends on the
created networks and their individual classification behaviors. It is entirely linked to the
self-organization of the network structures and emerges thus from the learning phases.

~
initial Base filtered base

wo,k y~:~ork

Fig. 9. lntra-class task decomposition.

This strategy based on the emergence of the task decomposition is applied to all
classes. Each class will be treatened by a different number of networks. The networks
linked to a particular class have to come to rejection decision about other classes
elements. In the same time, the networks associated to other classes try to conclude to
recognition decisions about the same elements. Then, the different network goals can be
treatened in interacting way to simplify the learning phase and the structure of a particular
network under construction. To get this inter-class cooperation scheme, the building of the
whole classifier is made layer by layer. One layer represents one step of each intra-class
decomposition. Inside a layer, the networks are built class after class. A given layer will
contain at most one network by class and less if certain class treatments have been
achieved in the previous layers. Then, a network under construction can consider the
outputs of all the previously built networks as a supplementary set of possible input
features. The proposed methodology leads to a real self-organization of the internal
cooperation links between the networks since they emerge from the learning phase thanks
to the selection rule working on this new set of one-input yprels. The following figure 10
gives an example of a possible distributed architecture for a 3-class problem cO,cl, c2
where the networks are built in the class order (0,I,2) and the classifier layer after layer.

io,t,a, I ,ayor,
..............................................
t , or2 j,ayor t0, si or
base q outputs

. . . . . . . ~- intra--class task dccomposilion - - 9 internal feature cooperation links

Fig. 10. Example of distributed strategy.


851

Vo PRELIMINARY RESULTS

Some preliminary results have been obtained with the previous methodology on the
classical problem of the handwritten digit recognition. The learning base contains 5.000
elements constitued by the 500 first examples of each class extracted from the NIST
database3 without any sort of population, while the test base contains 53.383 digits. A
feature vector of 124 components has been extracted from each character. This vector
contains classical features in the field ofOCR (invariant moments, number and position of
intersections with vertical and horizontal straight lines, projection profiles...). The tobal
recognition rate is 95,5% for a 4,5% error rate, which is normal according to the size and
to the unsorted population of the learning base.

The whole classifier structure is shared out among a set of 117 networks for a total
number of 445 neurons. The average of network sizes is 3,8 yprels. The smallest ones are
only made with 3 neurons: 1 standard yprel conbining 2 one-input yprels; the biggest has
9 neurons. The different cooperation links are summarized in the following table. The
number of networks by class illustrates the intra-class task decomposition. The whole
system contains 112 inter-class cooperation links corresponding to the generation of
internal features (i.e. the concerned one-input yprels are linked to other network outputs).

class 0 1 2 3 4 5 6 7 8 9
nb network 7 11 15 14 10 15 6 10 18 11
nb neuron 26 43 57 60 46 49 26 38 64 36
internal feature 4 10 20 16 3 12 6 8 24 9

We can notice that the strategy used generates a vast number of very small networks
linked by the two cooperation schemes. Then, we can compare the global classifier to an
unique network which contains very localized areas dedicated to explicit treatments of
each class.

Except for the first network of each class which works on the whole learning base, all
the created networks only try to reduce previously filtered bases. Then, we have a real
reduction of the task to solve for each class as soon as we progress in the classifier layers.
The figure 11 indicates the decision rates obtained for each class inside the first layers.
The simulated elements will not use the same network sub-sets according to the places
where the recognition and rejection decisions are taken for each class. The figure 12 gives
the percentages of network used.

Fig. 11. Decision rates by class. Fig. 12. The network uses.
We have seen that the elements do not use the same network sets according to the
decisions making. As each network selects its own input features, different elements will
852

use different input feature sub-sets. Thus, in the yprel methodology the input feature
extraction could be managed by the element itself according to the networks to simulate.
The two following figures illustrate the uses of the extracted input features. The first one
gives the number of input features used by the elements, while the second shows the use
frequencies of the extracted features.

Fig. 13 a-b. The use of the extracted input features.

VI. CONCLUSION
This paper shows that good performances can be obtained for supervised classification
problems by using a complete self-organizing yprel network population. The proposed
methodology can be very usefull about large scale problems (high number of classes,
elements and input features) thanks to the distributed resolution strategy which emerges
from the learning phase and allows a partial simulation of the generated system and an
active input feature extraction according to the networks to simulate. A way to restart the
learning phases with new elements by freezing some reliable networks is currently under
study to obtain a real self-organized incremental data learning scheme.

References
[1] R.K.Powalka, N.Sherkat, R.LWhitrow, (1996), "Multiple recognizer combination topologies",
in "Handwriting and drawing research: basic and applied issues", M.L.Simner, C.G.Leedham,
A.l.W.M.Thomassen (Eds.), IOS Press, pp 329-342.

[2] C.Jutten, (1995), "Learning in evolutive architecture: an ill-posed problem ?", Proc.
IWANN'95, "From natural to artificial neural computation", J.Mira, F.Sandoval (Eds.), Lecture
Notes in Computer Science N~ Springer-Verlag, pp 361-373.
[3] D.Ackley, M.Littman, (1991), "Interactions between learning and evolution", Artificial life II,
SFI studies in the sciences of complexity, vol. X, Addison-Wesley,pp 48%509.

[4] J.Sietsma, R.J.F.Dow, (1991), "Creating artificial networks that generalize", Neural
Networks, Vol.4, N ~ 1, pp 67-79.
[5] S.A.Harp, T.Samad, A.Guha, (1989), "Towards the genetic synthesis of neural networks",
Proc. 3rd. Int. Conf. on Genetic Algorithms, pp 360-369.
An Accurate Measure for Multilayer Perceptron
Tolerance to Additive Weight Deviations

Jose L. Bernier, J. Ortega, M.M. Rodrfguez, I. Rojas, A. Prieto


Dpto. Arquitectura y Tecnologfa de Computadores
Universidad de Granada, E18071-Granada (Spain)

Abstract. The inherent fault tolerance of artificial neural networks (ANNs) is


usually assumed, but several authors have claimed that ANNs are not always
fault tolerant and have demonstrated the need to evaluate their robustness by
quantitative measures. For this purpose, various alter,tatives have been proposed.
In this paper we show the direct relation between the mean square error (MSE)
and the statistical sensitivity to weight deviations, defining a measure of
tolerance based on statistical sentitivity that we have called Mean Square
Sensitivity (MSS); this allows us to predict accurately the degradation of the
MSE when the weight values change and so constitutes a useful parameter for
choosing between different configurations of MLPs. The experimental results
obtained for different MLPs are shown and demonstrate the validity of our mcxlel.

1 Introduction

Based on the analogy between Artificial Neural Networks (ANNs) and natural ones,
it is frequently assumed that ANNs are inherently tolerant to faults. This assumption
has been shown to be false [1,2]; consequently, some configurations of weights for a
fixed structure of ANN are more tolerant than others, although they provide similar
performance with respect to learning ability. Thus it would be uselill to have an accurate
means of uneasuring this tolerance.
Different measurements have been proposed. In J3J the probability of errors in
Multilaye," Perceptrons (MLPs) that use the simple-step function as activation is used
to study the tolerance of such structures, attd in [4] the study is extended to neurons
with a multiple-step activation function. The authors tound that the probability of error
is not affected by an increase in the number of neurons per layer.
In [1 ] a simulation-derived quantitative measure making use of the worst case
hypothesis to consider the number of neurons that present a stuck-at fault is used, while
in 12] a procedure based on the replication of bidden units is I~roposed to achieve fault-
lolcrant ANNs, providing rnelrics for tolerance as a Iimcliou ol' redundancy.
The authors ofl51 show that the inscrlion of synaplic noise during the learning
process increases the fault tolerance of ANNs. They also sbowed that enlarging the
networks does not improve fault tolerance at all, bt, t on the contrary makes them more
susceptible to faults.
854

Hence, it is clear that the supposed fault tolerance of MLPs is relative and that
the degree of fault tolerance can and should be measured as is the learning
performance. This becomes significant when the learning process is performed in media
different from where the MLP is implemented, as there may be differences in the
accuracy used to store the values of the weights or other magnitudes. These differences
may seriously degrade the performance of the MLP. A discussion of hardware error
sources is presented in [5].
Choi et al [7] proposed statistical sensitivity as a measure of tolerance in MLPs
and thus as a criterion for selecting a contiguration of weights from different
alternatives presenting similar perrformance with respect to learning. Statistical
sensitivity measures the output changes of the MLP when the values of the weights
change, a low value implying less degradation of the learning performance. This fact
is implicity proved with the results obtained.
In this work, an explicit relation between the statistical sensitivity to weight
deviations and the mean square error (MSE) between the desired and obtained output
in the MLP is shown, such that it is possible to accurately predict the degradation of the
MSE for a particular value of this deviation. A new ligure that we call Mean Square
Sensitivity (MSS) is proposed as a measurement of the fault tolerance to weight
deviations. The MSS can be easily computed after training and is shown to constitute
a good measurement of the tolerance of the nelwork.
The statistical sensitivity can be computed locally lor each neuron and so,
different alternatives arise for future work and research possibilities; lot example, the
study of tolerance to weight perturbations of particular elements in the network or the
modification of the training algorithm in order to reduce the MSS. Other measures
proposed affect the whole network [I,2,3,4] and so their use to study tolerance is more
limitecl. Furthermore, as the MSS is directly related to the MSE degradation, an
accurate and quantitative measure of MLP tolerance is ohlained, without making use
of any IWpotheses.

2 The statistical sensitivity of a pultilayer percepton

A multilayer perceptron (MLP) is composed of M layers, where each layer rn (m=l ...M)
has Nm neurons. The neuron i of layer m is connected to the N,,_j neurons of the
previous layer by a set of weights w~i"'(j= I ..... N,,.0.
The output y~" of a neuron i belonging to a layer m, is a lunction li" (activation
function) of the weighted sum of the outputs coming Iiom the neurons in the previous
layer (zi'") :
N I
Iii iii t/I ,111 II1 III -- [ %
Yi = fi (Zi ) =./i (~ wii Yi ) (1)
j-I
855

where y)"~ constitutes the outputs of the N,,.I neurons of the previous layer. Specifically,
yi~ ..... N()) are the inputs to the network.
During learning, the weights are adapted in order to minimize the MSE by
using a gradient descent algorithm known as backpropagation. Depending on the initial
values of the weights and other parameters of the algorithm, different values are reached
after training. These posible solutions may present a similar MSE, but they differ with
respect to the fault tolerance obtained. Fault tolerance is related to a unilbrm
distribution of learning, such that the saliency of the weights is regular [6].
The values of the weights in an electronic implementation can be changed if
a circuit presents a defect. Moreover, the backpropagation algorithm is often executed
in a general purpose computer and the weights obtained are physically implemented;
it is thus possible Ibr differences in the loaded values to occur. The statistical
sensitivity Sim allows us to measure the degradation of the expected output of a neuron
i in layer m in a quantitative way when the values of the weights change. The statistical
sensitivity is defined in [7] by the following expression:

Sim = lira ~ (2)


o ~0 0

where o represents the standard deviation of the changes in the weights, and var(Ay~")
is the variance of the deviation in the output (with respect to the output of the MLP in
the absence of perturbations) due to these changes and can be computed as:
where E[.] is the expected value of [.].

var(AYi" ) : E[(Ay/") 2] - (E[Ay/"]) 2 (3)

Statistical sensitivity is fundamentally different from output sensitivity as used


in [6]. Tile output sensitivity is defined as the derivative ol'the output of the neuron with
respect to the value of a weight and so it measures the dependence of the output with
respect to this particular value of the weight. The statistical sensitivity measures the
variation range ol' the output of a neuron due to changes in Ihe weights.
To compute expression (2), an additive model of weight deviations is assumed
that satislies:
(a). E[AwIjm] = 0
(b). E[(Awij" )2] = o 2
(c). E[(Awlj" Awi.j.m')] = 0 ifi~i' or j~j' or m~m'
i.e. each weight wli" is changed to w!i" • 6~ where 6~ is a random variable with average
equal to zero and varianze equal to o 2. Moreover, perturbations of dill"erent weights are
supposed not to he statistically correlated.
If the devialions of tile weighls of a neuron i in layer m are small enough, Ihe
corresponding deviation in tile output can be approximated as :
856

N_I av" ,q,, Ill N _I


I~'IA
~y,"~ ~ ("'~ iii
~w o + ..~ iii - |
,,yj ) : ay, tYi zawq .i w O tayj
;11 - |
) (4)
j. I OWij vyj .i" I

where Of/m---df/"
dzi m

Proposition 1: if E[Awiim]=O V i,j,m tilen E[Ay,m]=O V i,m.

Proofh it will be shown by induction over m. For re=l:

N{b N.
E [Ay/] : E [Of/' (yjoAwij)]=
, Of,.' ~ yjo E [Awi]] = 0 (5)
j-i j-;

Assuming that E[Ayim"]=0, for layer m:

N,. _ I
E lay/'] = E [Of~m ~ " LYi.,-I a. .w. .q. . . . +
. . .w. 0 ray i I..
)l
j-I

Nm - I (6)
: O f / , , ~ . , , - LYj
~ . . wii E [A3~"-J])
E [Aw 0. . . .]+
.i.I
= 0

[2

Proposition 2: the statistical sensitivity to additive perturbations of a neuron


i ill layer m can he expressed as:

iN,,_ N,,,.,
.... '~l~ "-' ~ ...... ""'-'~ (7)
~IjR k*l

where the terms C]km can be recursively computed as:

N m_ N ~_ I
Ir"~rmx2 ~ .. m-I. 2 m ~ m~,m-l\
I~%) Z .,uyr ) " ~,. Z .,%%., ) (lT:k
ul = ,~ r-I s-I
CIk I N,,_! N,,,_,
(s)
1,~Z' Itl " ~ r Ill ~ Ill ~-~ lit f , tit - i
I% % Lwi r L wk.,%,* otherwise
[ r=l " s-I
857

P r o o f 2: making use of Proposition 1, var(Ayi" ) = E[(Ayi" )21. The terms


E[Ayi"'Ayk'"] Vm>0 can be obtained as:

ErZXy/I'Ay2"I :
N.,_, ~,,_,
/ m-lA /11+ //IA //I -I x I11 / nl-l~t Ill mA nl-lx.t
= E [ Y ' ~ d f j m tYr ~Wjr WjrtaYr )Y~,c3f~ tY., zaw~,.+w~ay., )j
rffil sffil
E,,-, E,,-,
.-....,..,., ..... ,,._, . . . . . . . . . . .... . . . . . . _ . m-.
: E ,~oJj OJk tYr Ys P--,IIAWjrIAWk.,.J-C14PjrWks t~l~Yr raYs 1 (9)
r.I s-I
nl-I nl l ~ r A Ilia II1" 11 Ill-I Ill lr~rt Ilia I l l - ] ~x
+ Y,. wh. t~ltaWi,, raYs 1 + Y.,. W/l" r. [zxwk, zxy,. l)
u,,,_, ~,,.,
...,..,..-..-..,.-,ll,-, . . . . . . i., ,,, ........ ,-'.Ill-'.
= O[j (YJ'k ,Z.I Z_.I tYr Y.. t~ l taWi~ ZawA.. +Wit
. .Wk~.. l ~.l ~ y r l.Xy~. ]
r.I s=l
-* 0 -~ 0)

If C.ik" is delined as ~k,, - E IA3)"o~ Ay2" ] and the m(xJel Ibr

perturbations considered is applied in (9), it is straighlorward to obtain


equation (8). The initial condition for (8) is that Ci~~ 0 Vj,k as the inputs Yi"
are supossed to be free of errors.
At this point, taking into account that EI(Ay/")~ ] = o 2 C/" and substituing in
(2) Proposition 2 is proved.
[]

In the particular case of MLPs with only one hidden layer, expression (7) can
be compuled in a more compacl form:

Sire : Oj i" I ~ ~.L~j,,i-t.2


) * (wii,, ,.j
c,,,-Ia~
, ViVm:=l,2 ( i o)
j-I

taking into account that the statistical sensitivity of tile inputs is zero, i.e., S~~ Vi.
In this way, as the above expression (10) shows, the computation of the
statistical sensitivity can be performed in a relatively easy way for each neuron of the
MLP and for each input pattern in the case of MLPs with one hidden layer, because
there are no cross terms as ill expression (7).

3 The Mean Square Sensitivity

The goal of the backpropagation algorithm is to reduce tile mean square error (MSE)
which can be expressed as:
858

i E",,,
e - 2N~, i,-1 ~'~.t(,1,(1,) .~,u(p))2 (11)

where Np is the number of input patterns considered, NM is the number of neurons in


the output layer, and di(p) and y M(p) are the desired and obtained outputs of the neuron
i of the output layer for the input pattern p, respectively.
If the weights of the MLP suffer any deviation, the MSE is altered and so, by
developing expression (1 I) with a Taylor expansion near the nominal MSE found after
learning, E.. it is obtained that:

g 1 '
1 E (di(.P)-YiM(J~)) AYi O~) * (A.yiM(p))2 + 0 (12)

Now, if we compute the expected value of e' and take into account that
E[AyIM]=0 and that E[(AyiM )2] can be obtained from expressions (2) and (3) as
E [(Ay~M)2] = o2(S~U)2 , the following expression is obtained:

EW] = % z_, z_, ( . , ) " (13)


p-I i-I

By analogy with the delinition of MSE, we define the following ligure as Mean
Square Sensitivity (MSS) :

l }E (siM)2
MSS - 2Np I,-I i-I
(14)

The MSS can be computed from tile statistical sensitivity of the neurons
belonging to the output layer, as expression (14) shows. In lhis way, combining
expressions (13) and (14), the expected degradation of the MSE, Ele'] can be computed
as:

Eldl =% , 0 2MSS (15)

Thus, (15) shows the direct relation helween lhe MSE degradation and the
MSS. As the MSS can be directly cornputed alier training it is possible to predict the
degradation of the MSE when the weights are deviated from their nominal values into
a range with standard deviation equal to o. Moreover, as can be observed in the
expression obtained, a lower value of the MSS implies a lower wdue of the degradation
of the MSE, so we propose using the MSS as a suitable measure of tile tolerance of
MLPs to weight devialions. Note that as lhe slalislical sensitivity o f a parlicular neuron
elm he computed independently, several lines of research are opei~ Io sludy Ihe tolerance
of particular elements or to develop new training algorithms that take into acccmnt the
MSS as another term to rninimize during learning. In 17] it is proposed to use the
average statistical sensitivity as a criterion Io select a weight configuration l?om
859

different possibilities which present similar MSE al/er training. However, Ihe MSS is
directly obtained li'om the MSE degradation, .'is expression (15) shows, and thus
constitutes a better measurement of MLP tolerance against weight perturbations.

4 Results

In order to validate expression (15) we compared the results obtained for the MSE when
the MLPs are subject to additive deviations with the predicted value obtained by using
this expression. Two MLPs were considered: an approximator of the sine function [8]
and a predictor of the Mackey-Glass temporal series [9]. The approximator had 1 input
neuron, 11 neurons in the hidden layer and 1 output neuron, and the predictor consisted
of 3 input neurons, 11 neurons in the hidden layer and I oulput neuron. All lhe neurons
considered contained a bias input. Table I shows the values of MSE and MSS obtained
after training with the test patterns (different from those used lbr training).

Table 1. MSE and MSS obtained after training.

Approximator Predictor

~c~ 0.001 0.0004

MSS 0.1922 0.1347

All the weights obtained after learning have been deviated from their nominal
values by using the additive model such that each weight w!i" has a value equal to (wii"
+ (5~), where 6j is a random variable with standard deviation equal to o and average
equal to zero. Table 2 shows the values ()f the MSE predicted and obtained
experimentally for different values of o. For each value of o considered, tile
experimental values of MSE are averaged over 100 lests where each test consists of a
random deviation of all the weights of the MLP. The conlidence interval at 95% is also
presented in Table 2.
Expression (15) is shown to be valid; it accurately predicts the degradation of
the MSE when the weights present perturbations. It is also proven that the lower the
value of the MSS, the lower the degradation of the MSE. Thus, even if a particular
configuration presents a lower MSE after training, if the MSS is high, this nominal
MSE is strongly degraded when deviations are present, and so the MSS Inust be
considered when a weight conliguralion is 1o be chosen.
860

Table 2. Comparison between E[E'] predicted and E[e'] experimental.

Approximator Predictor

o Predicted Experimental Predicted Experimental


(x le-3) (x le-3) (x le-3) (x le-3)

0.01 1.029 1.030 • 0.011 0.444 0.444 • 0.015

0.02 1.087 !.097 • 0.028 0.484 0.499 • 0.032

0.03 I. 183 1.296 • 0.094 0.552 0.593 • 0.056

0.04 1.317 !.363 • 0.090 0.646 0,683 • 0.091

0.05 1.49 1.632 • 0.145 0.767 1.068 • O. i 85

0.06 1.702 1.864 • 0.203 0.915 I. 168 • 0.264

0.07 1.952 1.993• 1.09 1.274•

0.08 2.24 2.218 • 0.241 1.293 1.585 • 0.284

Figures 1 and 2 show the degradation of the MSE for dill'ercnl values ofo. Thc
values predicted and obtained experimentally arc represented for the approximator and
the predictor, respectively. Each experimental value is plotted with its respective
confidence level at 95% obtained with 100 samples. In a similar way to Table 2, the
predicted values of the MSE accurately fit those obtained experimentally. The matching
between the predicted and the experimental values of MSE is better whcn weight
deviations are smaller; however, for grealcr dcvialions it constitutes an upper bound Ibr
the MSE degradation.

5 Conclusions

111 this letter we have presented the relation between the mean square error (MSE) and
the statistical sensitivity. As the statistical sensitivity measures the deviation in the
output of a MLP when its weights are perturl~ed, this relation allows us to obtain a
useful criterion to evaluate the fault tolerance of the network. To cornpare different
weight configurations, we propose the use of mean square sensitivity (MSS), which is
computed from the statistical sensitivity. Lower wdues of MSS imply lower
degradations of MSE. Results show the correctness of Ihe expressions obtained. To
distinguish MSS from other measures proposed to assess tile Iolcrancc of MLPs, it is
directly related to MSE degradation and also, as statistical sensitivity can be computed
for each neuron of the MLP, new research possibilities are opened lot the study of
related aspects. As future work, a new backpropagation algorithm that includes Ihe
861

objective of minimizing MSS, jointly with MSE, will be developed in order to obtain
weight configurations that maximize lault tolerance while maintaining learning
performance. As MSS is an accurate measure for MSE degradation, the perlbrmance
of such an algorithm will probably be better than that described in I10] for a similar
training algorithm based on average statistical sensitivity minimization..

Approximator
0.12
Experimental
Predicted

0.1 ,/

0.08

~. o.06

0.04

0.02

0
0.1 0.2 0.3 (1.4 0.5 0.6 0.7
Shlrldard dovlallon of weight periurbationl~

Figure 1. Predicted and experimental MSE for the approximator o1'


the sine function.

Pqod~ctor
0.1
Exporlmental
Pmdlctod .....
0.09

0.08

0.1 0.2 0.3 o.4 0.5 0.6 0.7


Slundard dovlalion ul weight porlurbatlo~m
Figure 2. Predicted and experimental MSE lor the predictor of the
Mackey-Glass temporal series.
862

References

HI B.E. Segee, M.J. Carter, "Comparative Fault Tolerance of Parallel Distributed


Processing Networks". IEEE Trans on Computers, vol. 43, no. 11, pp. 1323-
1329, Nov 1994.
[2] D.S. Pathak, I. Koren, "Complete and Partial Fault Tolerance of Feedforward
Neural Nets". IEEE Trans. on Neural Networks, vol. 6, no. 2, pp. 446.456,
Mar 1995.
13] M. Stevenson, R. Winter, B. Widmw, "Sensilivily of Neural Networks to
Weight Errors", IEEE Trans. on Neural Networks, vol. I, no. I, pp. 71-80, Mar
1990.
[4] C. Alippi, V. Piuri, M. Sami, " Sensitivity to Errors in Artilicial Neural
Networks: a Behavioral Approach", in Proc. IEEE hit. Syrup. on Circuits &
Systems, pp. 459-462, May 1994.
[5] P.J. Edwards, A.F. Murray, "Fault Tolerance via Weight-noise in Analogue
VLSI Implementations - a Case Study with EPSILON". IEEE Proc. on
Circuits and Systems !1: Analog and Digital Signal Processing, vol. 45, no.9,
pp. 1255-1262. Sep 1998.
[61 P.J. Edwards, A.F. Murray, "Caw) I)cterministic Penalty Terms Model Ibe
Effects of Synaptic Weight Noise on Network Fault-Tolerance?". Int. Journal
of Neural Systems, vol. 6, no.4, pp. 401-416, 1995.
[7] J.Y. Choi, C. Choi, "Sensitivity Analysis of Multilayer Perceptron with
Differentiable Activation Functions", IEEE Trans. on Neural Networks, vol.
3, no. I, pp. 101-107, Jan 1992.
r81 T. Sudkamp, R. Hammell, "Interpolation, Completion and Learning Fuzzy
Rules", IEEE Trans. on ,Tystems, Mm~ & Cybernetics, vol. 24, no.2, pp.332-
342, Feb 1994.
19] L. Wang, Adaptive Fuzzy Systems and Control. Design and Stability Analysis.
Englewood Clifl~s: Prentice Hall, 1994.
[J0] J.L. Bernier, J. Ortega, A. Prieto, "A Modilied Backpropagation Algorithm to
Tolerate Weight Errors", Lecture Notes in Computer Science, vol. 1240, pp.
763-771, Springer-Verlag, June 1997.
Fuzzy Inputs and Missing Data in Similarity-Based
Heterogeneous Neural Networks

Llufs A. Belanche and Julio J. Vald~s


Secci5 d'lntel.ligbncia Artificial.
Dept. de Llenguatges i Sistemes Informhtics.
Universitat Polit/:cnica de Catalunya.
c/Jordi Girona Salgado 1-3
08034 Barcelona, Spain.
{belanche, valdes}~lsi.npc.es
Phone: +34 93 401 56 44
Fax: d-34 93 401 70 14

A b s t r a c t . Fuzzy heterogeneous networks are recently introduced neural network models com-
posed of neurons of a general class whose inputs and weights are mixtures of continuous vari-
ables (crisp and/or fuzzy) with discrete quantities, also admitting missing data. These networks
have net input functions based on similarity relations between the inputs and the weights of a
neuron. They thus accept heterogeneous -possibly missing- inputs, and can be coupled with
classical neurons in hybrid network architectm'es, trained by means of genetic algorithms or
other evolutionary methods. This paper compares the effectiveness of the fuzzy heterogeneous
model based on similarity with the classical feed-forward one, in the context of an investiga-
tion in the field of environmental sciences, namely, the geochemical study of natural waters in
the Arctic (Spitzbergen). Classification performance, the effect of working with crisp or fuzzy
inputs, the use of traditional scalar product v s . similarlty-based functions, and the presence
of missing data, are studied. The results obtained show that, from these standpoints, fuzzy
heterogeneous networks based on similarity perform better than classical feed-forward models.
This behaviour is consistent with previous results in other application domains.

1 Introduction

Tile notion of heterogeneous neurons was introduced in [11] as a model accepting as


inputs vectors composed of a mixture of continuous real-valued and discrete qua,l-
tities, possibly also containing missing data. The other feature of this model depar-
turing from tile classical was its definition as a general mapping from which different
instance models can be derived. In particular, when the model is constructed as the
composition of two mappings, different instance models can be derived by making con-
crete choices of the net input and activation functions, mimicking the classical neuron
model. In this special case, whereas the classical neuron model uses (lot product as net
input, and sigmoid (or hyperbolic tangent) as squashing fimctions for activation, the
heterogeneous model uses, respectively, a similarity or proximity relation [4] between
the input and the weight tuples, and a sigmoid-like bijection of the reals in [0, 1].
The choice of the specific similarity function should account for the heterogeneo,,s
nature of neuron inputs and the presence of missing data. This was shown to I)a a
reasonable brick for constructing layered network architectures mixing heterogeneous
with classical neurons, since the outputs of these neurons can be used as inl)uts fl,r
the classical ones. Such type of hybrid networks is composed of one hidden layer of
heterogeneous networks and one output layer of classical neurons. In this case the
heterogeneity of the solution space makes genetic algorithms a natural choice for a
864

training procedure, and indeed, these networks were able to learn from non-trivial
data sets with an effectiveness comparable, and sometimes better, than that of classi-
cal methods. They also exhibited a remarkable robustness when information degrades
due to the increasing presence of missing data. One step further in the development
of the heterogeneous neuron model was the inclusion of fuzzy quantities within the
input set, extending the former use of real-valued quantities of crisp character. In this
way, uncertainty and imprecision (in inputs and weights) can be explicitly considered
within the model, making it more flexible. In the context of a real-world application
example in geology [12], it was found that hybrid networks using fuzzy heterogeneous
neurons perform better by treating the same data with its natural imprecision than
considering them as crisp quantities, as is usually done. Moreover, in the same study
it was found that hybrid networks with heterogeneous neurons in general (i.e. with or
without fuzzy inputs) outperform feed-forward networks with classical neurons, even
when trained with sophisticated procedures like a coml)ination of gradient techniques
with simulated annealing.
In this paper, the possibilities of this kind of neurons are illustrated by compa-
rison to fully classical architectures in a real-world problem. The paper is organized
as follows. Section 2 reviews the concept of heterogeneous neurons and their use in
configuring hybrid neural networks for classification tasks. Section 3 describes the
example application at hand, fruit of an environmental research in the Arctic, while
Section 4 covers the different experiments performed: description, settings and discus-
sion. Finally, Section 5 presents the conclusions.

2 The Fuzzy Heterogeneous Neuron Model

A fuzzy heterogeneous neuron was defined in [12] as a mapping h : 7/" --+ TLo~t C R .
Ilere R denotes tile reals and 7/" is a cartesian product of an arbitrary number n
of source sets. These source sets n~ay be extended reals 7~i = Ri U {,V}, extended
families of (normalized) fuzzy sets ~'i = ~ U {X}, and extended finite sets of tile form
Oi = Oi U {X}, .Mi = .Mi U {X}, where each of tile Oi has a fltll order relation, while
tlle .Mi have not. In all cases, the extension is given by tlle special symbol X, which
denotes the unknown element (missing information) and it behaves as an incomparable
element w.r.t, any ordering relation. Consider now the collection of n / e x t e n d e d fimzy
sets of the form ~'i = .TiU {X} and their cartesian product ~',,t = ~'l x .T'2 • 2 1 5 .~,~,.
The resulting input set will then be ~ " = 7~"~ • fi"t • (.')"o • .M'--, where tile
cartesian products for the other kinds of source sets (7~"*, O"~ are constructed
in a similar way from their respective cardinalities n~, no, n,,, with 7"r~ = . ~ = ~o =
J ~ ~ = ~ ~ = r n = n~ + n l + no + nm, and n > 0. According to this definition,
neuron inputs are vectors composed of n elements among which there might be reals,
h,zzy sets, ordinals, nominals and missing data.
An interesting particular class of heterogeneous submodels is constructed by con-
sidering Is as the composition of two mappings h = f o s , such that s : 7/" -+ 7r C R
and f : 7Z~ --+ 7Zout C_ R . The mapping It can be seen as a n-ary flmction l)arameteriz~,d
by a n-ary vector ~b e 7/" representing the neuron's weights, i.e. h(&, ~ ) = f ( s ( & , @)).
Within this framework, several of the most common artificial neuron models can be
derived. For example, the classical scalar-product driven model is obtained by making
865

n = n, (and thus n I = no = n m = 0), no missing data at all, s ( ~ , ~ ) = ~ . ~ ,


and choosing some suitable sigmoidal for f . However, there are many possible choices
for the function s, and some of them are currently under investigation. In particular,
from its very beginning [11], the function s represents a similarity index, or proximity
relation (where transitivity considerations are put aside): a binary, reflexive and sym-
metric function s ( x , y ) with image in [0, 1] such that s ( x , x ) = 1 (strong reflexivity).
The semantics of s(x, y) > s(x, z) is that object x is more similar to object y than is
to object z. The function f takes the familiar form of a squashing non-linear function
with domain in [0, 1]. That is, the neuron is sensitive to the degree of similarity be-
tween its input and its weights, both composed in general by a mixture of continuous
and discrete quantities, possibly with missing data. It has been our postulate that
such a family of functions are, in general, better suited for pattern recognition devices
than the classical scalar product and derived measures.
The concrete instance of the model used in the present paper uses a Gower-like
similarity index [7] in which the computation for heterogeneous entities is constructed
as a weighted combination of partial similarities over subsets of variables, which were
singletons in the original definition, although any problem-specific partition is con-
ceivable. This coefficient has its values in the real interval [0, 1] and for any two objects
i, j given by tuples of cardinality n, is given by the expression

~ k = t g i j k ijk
slj = n
Ek=l i j k
where 91jk is a similarity score for objects i, j according to their value for variable k.
These scores are in tile interval [0, 1] and are computed according to different schemes
for numeric and qualitative variables. In particular, for a continuous variable k and
any two objects i, j the following similarity score is used:

[vik - vjkl
ffijk = 1
range O'.k )
Ilere, vi~ denotes tile value of object i for variable k and range (v.k) = maxl,j (Iv,k--vjkl)
(see [7] for details on other kinds of variables). Tim Jijk is a binary function expressing
whether both objects are comparable or not according to their values w.r.t, variable
k. It is 1 if and only if both objects have values different from ,u for variable k,
and 0 otherwise. This way, in the model considered here, Gower's original definitions
for real-valued and discrete variables are kept. For variables representing fuzzy sets,
similarity relations from the point of view of fuzzy theory have been defined elsewhere
[5], [15], and different choices are possible. In our case, if .T~ is an arbitrary family of
fuzzy sets from tim source set, and A, f3 are two fltzzy sets such that .4, h E f'i, the
following similarity relation is used:

9(A, h) = ma~ (/,~o~(x))

where tL~nh(x) = rain (#A(x),/ti~(x)). As for the activation function, a modified


version of the logistic is used (see I11]), that maps the real interval [0, 1] on (0, 1). The
training procedure for fuzzy heterogeneous networks of the kind described is based on
genetic algorithms ([13], [14]), because the, in general, non-differential)ility of the net
866

input function, and the presence of missing information prevent the use of gradient-
based techniques. The resulting heterogeneous neuron can be used for configuring feed-
forward network architectures in several ways. In this paper it is slmwn how layered
feed-forward structures with a hidden layer composed of heterogeneous neurons and
an output layer of classical units are natural choices better suited for the data than
the fully classical counterparts.

3 An example of application: environmental research in the


Arctic

During the scientific expedition Spitzbergen'85, organized by the University of Sile-


sia (Poland), a scientific team composed of specialists from this university, the Na-
tional Center for Scientific Research (Cuba), and the Academy of Sciences of Cuba,
performed glaciological and hydrogeological investigations in several regions of the
Spitzbergen island (Svalbard archipelago, about 76~ to 80~ The purpose was to
determine the mass and energy balance within experimental hydrogeological basins,
the study of the interaction between natural waters and rock-forming minerals in the
severe conditions of polar climate and their comparison with similar processes de-
veloped in tropical conditions. This has been a long-term research of several Polish
universities (Silesia, Warsaw and Wroclaw) and the Polish Academy of Sciences since
the First Geophysical Year in 1957, and represents an important contribution to the
evaluation of the impact of global climatic changes. Complex interactions take place
due to peculiar geological, geomorphological and hydrogeological conditions which, in
the end, reflect in water geochemistry.
In this study, a collection of water samples were taken from different hydroge-
ological zones in two Spitzbergen regions. They were representative of many differ-
ent zones: subglaciar, supraglaciar, endoglaciar, springs (some hydrothermal), lakes,
streams, snow, ice, the tundra and coastal. Among the physical and chemical param-
eters determined for these water samples, the following ninc were uscd for the present
study: temperature, pit, electrical conductivity, hydrocarbonate, chloride, sulphate,
calcium, magnesium and sodium-potasium. Geochemical and hydrogeological studies
of these data [8], [9] have shown a relation betwecn the diffcrcnt hydrogeological con-
ditions present in Spitzbcrgen and the chemical composition of their waters, reflecting
the existence of several families. That is, an indirect assessment of their hydrogeolog-
ical origin is in principle possible from the information present in the geochemical
parameters, thus enabling the use of a learning algorithm.

4 Experiments
4.1 General Information

Tile available set of N = 114 water samples from Spitzbergen, corresponding to


c = 5 hydrogeological families of waters, was used for comparative studies of su-
pervised classification performance (error and accuracy) using different neural ar-
chitectures, described below. To express the distribution of samples among classes
we introduce the notation nk to denote that there are n samples of class k. This
867

way, the actual distribution was 371, 292,103,114, 27s. Default accuracy (relative fre-
quency of the most common class) is then 37/144 or 32.5~ Entropy, calculated as
- ~']~=,(nk/N) log2(nk/N), is equal to 2.15 bits. There were no missing data and all
-

measurements were considered to have a maximum of 5% of imprecision w.r.t, the


reported value. This aspect will be taken into account when considering uncertainty
in the form of fuzzy inputs, since the fact that the physical parameters characterizing
the samples as well as their chemical analysis were done in situ -in the extremely hard
climatic and working conditions of the Arctic environment- makes them particularly
suited to a kind of processing in which uncertainty and imprecision are an explicit
part of the models used. Accordingly, hybrid feed-forward networks composed of a
first (hidden) layer of heterogeneous neurons, mixed with an output layer of classical
ones is the basic architectural choice for this case study. These hybrid architectures
will be compared to their fully classical counterparts -under the same experimental
settings- in order to assess their relative merits. To this end, the following notation
is introduced: let q~ denote a single layer of q neurons of type z, where possibilities
for x are:

n Classical: real inputs, scalar-product net input and logistic activation.


h Heterogeneous: real inputs, similarity-based net input and (adapted) logistic acti-
vation.
f Fuzzy heterogeneous. Triangular fuzzy inputs (converted from the original crisp
reported value by adding a 5% of imprecision, see fig. 1) similarity-based net input
and (adapted) logistic activation.

r-5% r r+5%

Fig. 1. A triangular fllzzy number constntcted out of the reported crisp value r.

Accordingly, p=% denotes a feed-forward network composed of a hidden layer of


p neurons of type x and an output layer of q neurons of type y. For example 4h5, is
a network composed of a hidden layer of 4 neurons of type h and an output layer of
5 neurons of type n. All units use the logistic as activation. Shortcut (direct inl~ut to
output) connections are not considered.
All neural architectures will be trained using a standard genetic algorithm (SGA)
with the following characteristics: binary-coded values, probability of crossover: 0.6,
probability of mutation: 0.01, number of individuals: 52, linear rank scaling with
factor: 1.5, selection mechanism: stochastic universal, replace procedure: worst. Tile
algorithm was stopped unconditionally after 5,000 generations or if there was no
improvement for the last 1,000. This last criterion helps evaluating tile goodness of
tile architecture being trained and saves unusefid computing time.
868

4.2 E x p e r i m e n t Settings

In the present study, all models (including the classical feed-forward one) were trained
using exactly the same procedure and parameters in order to exclude this source
of variation from the analysis. Of course, fully classical architectures need not be
trained using the SGA. They could instead be trained using any standard (or more
sophisticated) algorithm using gradient information, llowever, this would have made
direct comparison much more difficult, since one could not attribute differences in
performance exclusively to the different neuron models, but also to their training
algorithms. The experiment settings were the following:

Training r e g i m e The training set was composed of 32 representative samples (28%


of the whole data set), whereas the remaining 82 (72%) constituted the test set,
a deliberately chosen hard split for generalization purposes. Class distribution is
81, 72, Sa, 54, Ts in training and 291, 222, 53, 64, 205 in test. Default accuracies are
25.0% and 35.4%, respectively.
A r c h i t e c t u r e s We will explore the following architectures: 5=,2=5,,4=5,,6=5, anti
8=5,, for z in n, h, f. Note that the output layer is always composed of five units,
one for each water class.
N u m b e r of runs Every architecture was allowed R = 5 runs varying the initial
population. All of them were included in the results.
Weight range The weights concerning units of type n were limited to be in the
range [-10.0, 10.0], to prevent saturation, whereas heterogeneous weights adopt
(by definition of the heterogeneous neuron) the same range as their corresl)onding
input variable.
E r r o r functions The target error function to be minimized by the training algo-
rithms is the usual least s q u a r e d error, defined as follows:
p m

i j

where y} is the j-th component of the output vector yi computed by the network
at a given time, when the input vector :vi is presented, and ~ = r is the
target for x}, where r tel)resents the characteristic fimction for class j. The error
displayed will be the m e a n s q u a r e d error, defined as MSE = A-LSE, where m
is the number of outputs and p the number of patterns.

4.3 P r e s e n t a t i o n of t h e Results (I)

Let the classification accuracy for training (TR) and test (TE) sets, calculat,~d with a
winner-take-all strategy, be denoted CATn(r) and CATE(r), respectively, for a givr
run 7". Tile errors MSETa(r) and MSETE(r) are similarly defined. For each neural
architecture, the following data is displayed:
R
Accuracy: Mean classification accuracy on training MCATR = -~ ~ ..... 1 CATn(run.),
same on test MCATE = ~ ~ , , = 1 CATs(run), and best classification accuracy
(BCA) defined as the pair < CATa(r), CATE(r) > with higher CATE(r).
869

E r r o r : Mean MSE in training defined as MMSETR = _~ ~ nR= l MSETa(run), sample


variance in training defined as
R
1
SVMSETR = R---'~ ~ [MSETn(run)- MMSETa] 2
rttn=l
and similarly defined values MMSET~. and SVMSETE for tile test set.
The results are collectively shown in table 1. As an additional reference measure of
performance, the/c-nearest neighbours algorithm (with/c = 5) is Mso run on the data
-with the same train/test partition- yielding an accuracy in test equal to 58.5%.

Architecture][ qYaining Test BCA


]MCATa]MMSDrR]SVMSETa MCAT~IMMSET~]SVMSETB

I' I"'%1~176
5n
51
66.3%
99.4%
0.1084
0.0338
8.0e-06
3.0e-06
67.1%
69.3%
~176176176
0.1202
0.0917
1.6e-05
1.1e-05
75.0% 76.8%
100% 75,6%

2h5n 71.9% 0.0968 2.0e-04 69.5% 0.1088 ] 2.6e-04 81.3% 85.4%


215. 86,3% 0.0635 1.2e-04 71.7% 0.0995 [ 9.3e-05 81.3% 81,7%

4h5n 90.0% 0.0614 1.Oe-05 79.0% 0.0786": 2.9e-05 93.8% 82.9%


4t5,, 98.1% 0.0201 1.4e-04 81.2% 0.0620 1.3e-04 ]00% 86.6%

6h5,~ 91.3% 0.0508 5.0e-05 83.7% 0,0803 5,6e-05 93.8% 87.8%


615,, 100% 0.0106 3.0e-06 84.9% 'i10.0553 l.le-05 100% 90.2%

8t~Sn 93.8% 0.0456 1.9e-05 86.6% 0.0603 4.0e-05 93.8% 90.2%


8/5n 100% 0.0064 4.0e-06 80.5% 0.0541 4.3e-05 100% 84.1%
Table 1. Results of the experiments. See text for an explanation of entries.

4.4 A n a l y s i s of t h e results (I)


As stated, the experiments were oriented to reveal tile influence of several factors:
a) the kind of neural model used (heterogeneous vs. classical)
b) the effect of considering imprecision (fuzzy inputs vs. crisp inputs), and
c) tile effect of missing data in the test set.
The effect of factor (a) can be assessed by comparison, for all the architectures, of the
first entry against the other two, column by column. The effect of (b) reflects in the
difference between the second vs. the third.

Single-layer a r c h i t e c t u r e s Let us begin by analysing the results for the architec-


tures with no hidden units, that is, the first three rows of table 1. The interpolation
capabilities of tile three neuron models can be seen by comparing tlle value of MCATR.
The mean error MMSETa is also a good indicator. The robustness (ill the sense of
expected variability) can also be assessed by the value of SVMSETR. It can be seen
how the heterogeneous neurons are in general better and much more robust than tile
classical one. Especially, the fuzzy neuron can learn from the data set to ahnost per-
fection very robustly. Similar results are achieved in the test set. Again, an increasing
accuracy and decreasing errors and variance indicate an overall better performance.
However, the f units are clearly overfitting the d~ta, a fact that shows in the highly
unbalanced TR and TE accuracy ratios (both in average and in tile best pair BCA).
870

M u l t i - l a y e r a r c h i t e c t u r e s For the four groups of architectures selected (the p~5,),


there are two aspects amenable to be discussed. First, the relative behaviour of ele-
ments of the form p=5,, for a fixed p. Second, their relative behaviour for a fixed x.
These two dimensions will collectively give light on any coherent behaviour present in
the results.
To begin with, it can be seen that for all the architectures 2~5n, 4~5,, 6x5n and
8~5,, as we go through the sequence n,h, f , the behaviour is consistent: mean ac-
curacies increase, and mean errors and their variances decrease, both in training and
in test, with the only exception of the error variance in the case 4=5,. This shows a
general superior performance of h neurons over n neurons, and of f neurons over h.
The absolute differences between neuron models are also noteworthy. In all training
respects, the p15, families show very good interpolation capabilities, explaining the
100% of the T R set starting from p = 4 in BCA and from p = 6 in MCA.ra. This trend
is followed - t o a less extent- by the ph5n. The same consistent behavlour is observed
in all test indicators. Here the two heterogeneous families show a similar belmviour,
with the f neurons slightly above the h ones, until for p = 8, the architectures pi5,
end up overfitting the data so strongly that their performance in test begins to fall
down.
As for the second aspect, p~5, fixing x, it can be checked that all neuron models
show an increasing performance when the number of hidden neurons is increased, as
can reasonably be expected. In conclusion, for all of the architectures it is clear that
the use of heterogeneous neuron models leads to higher accuracy rates in the training
and test sets. Moreover, when imprecision is allowed by accepting that each value is
endowed with the above mentioned uncertainty, the fuzzy heterogeneous model also
outperforms its crisp counterpart.

4.5 Presentation of the Results (II)

Tile neural nets obtained ill the previous experiment can now be used to assess the
effect of factor (c), tile influence of missing values in tile data. The purpose of this
experiment is twofohh lirst, it is usefitl studying to what extent missing information
degrades performance. This is all indication of robustness and is important from the
point of view of the methods. Second, in this particular problem, studying the effect
of missing data is very interesting, because it can give an answer to the following
questions:
1. What predictive performance could we expect if we do not supply all the informa-
tion? (and just a fraction of it).
2. What would have happened had we presented to the net incomplete trail~it,9 in-
formation from the outset?
This scenario makes sense in our case study, for which a rich set of complete data
may be impossible to obtain, because of lack or damage of resources, physical or
practical unrealizability, lack of time, climatic conditions, etc. Note that it is not
that a particular variable cannot be measured (we could readily remove it) b u t that
some realizations of (potentially) all variables may be missing. These experiments
were performed with the same nets found in the previous section. This time, however,
they were each run on different test sets, obtained by artificially and randomly (with
871

io0 . . . . . . . . .

"I
Io Z

to

~o ...............................

~o

m
to
io[- , 9 i i n I i i *
0 i i i ~ I | t * I
to m ~ 4~ io en 70 m m

(a) 5~, and ~! (b) 2h5. and 2j5.

IOQ . . . . . ,, . , . .

O0

eO'

70

mO

84)

4(i

30

IIO

Io

(i
m an an 4o ~0 eo 70 an In too

( c ) 4h5,~ a n d 4 1 5 n (d) 6ngn and 615.

I00

O0

TO ,.IJ
QO

60

40

30

IQ

O .
to
. .
ao
.
m
. .
~
.

.
m
.
m to
,
m i~u
'~
u ,
m
9
an
b
~o
i
~
i
~
,
~o
r
7o
i
m i~

(e) s . 5 . ana s / 5 . (f) 0,,5. a,.i ~15,,

Fig. 2. Increasing presence of missing data in test. Mean test classification accuracy for tile heterogeneous
( p ~ 5 n ) a n d f u z z y h e t e r o g e n e o u s ( p l 5 n ) f a m i l i e s . ( a ) 5~ r a i d 51 ( b ) 2 a 5 . m i d 2 / 5 n ( c ) 4 h 5 n a n d 4 / 5 . (el)
6h5n and 6t5n (e) 8hSn and 815n (f) Mean test classification accuracy for 6h5,, raid 615. when trained with
a 30% of missing information. See text for an explanation of axis.
872

a uniform distribution) adding different percentages of missing information. These


percentages range from 10% to 90%, in intervals of 10%. The results are presented,
for the whole set of heterogeneous architectures I displayed in table 1, in a graphical
form, through figs. 2 (a) to 2 (e). The x-axis represents the total percentage of missing
values in the test set, while the y-axis stands for the MCATv. (that is, again, data
shown for each point is the average for R = 5 runs). The horizontal line represents
the size of the major class (35.4%) to be taken as a reference, and the same k-nearest
neighbours algorithm is run and shown in fig. 2 (a).

4.6 Analysis of the Results (II)

Both neuron models h, f are very robust, a fact that shows in the curves, which follow
a quasilinear decay. The accuracies are consistently higher for the fuzzy model than for
the crisp counterpart for all the network architectures, again showing that allowing
imprecision increases effectiveness and robustness. Performance, in general, is well
above the default accuracy until a 50% - 60% of missing information is introduced. In
many cases, mean classification accuracy is still above for as far as 70% - 90%, which
is very remarkable. This graceful degradation of fuzzy heterogeneous models should
not be overlooked, since it is a very desirable feature in any model willing to be useful
in real-world problems.
The last figure -fig. 2 (f)- shows the effect of a different training outset. Choosing
what seems to be the best group of architectures for the given problem, the 6h5n and
615n, these networks were trained again, this time with a modified training set: adding
to it a 30% of missing information, in the same way it was done for the test set, and
using them again to predict the increasingly diluted test sets. As usual, the horizontal
line represents the size of the major class and k-nearest neighbours performance is
also shown. Training and test accuracies were this time lower (as one should expect)
and equal to MCATR = 88.8% for 6h5~ and to MCATrt = 96.3% for 615,,. However,
the differences with previous performance are relatively low. Some simple calculations
show that, although the amount of data is 70% that of the previous situation, new
accuracies are 97.3% and 96.3% of those obtained with full information for 6h5,, and
6f5~, respectively. Performance in test sets is also noteworthy: although the new curves
begin at a lower point than before, the degradation is still quasilinear. What is more,
the slope of this linear trend is lower (in absolute value), resulting in a slight raising
up of the curves (in both of them).

5 Conclusions

Experiments carried out with data coming from a real-world problem in the domain
of environmental studies have shown that allowing imprecise inputs, and using fimzy
heterogeneous neurons based on similarity, yields much better l)rediction indicators
-mean accuracies, mean errors and their variances and absolute best models found-
than those from classical crisp real-valued models. These results for heterogeneous
t These experiments could not be performed for the p~5~ architectures, for they do not accept missing
information. Although there are estimation techniques, they are not an integrated part of the models, and
would have introduced a bias.
873

networks confirm the features observed in other studies [1] [21 [31 [111 [12] concerning
their mapping effectiveness and their robustness with respect to the presence of un-
certainty a n d m i s s i n g d a t a . T h e i r a b i l i t y to c o n s i d e r d i r e c t l y i m p r e c i s e d a t a a n d t h e i r
p e r f o r m a n c e u n d e r t h o s e c i r c u m s t a n c e s d e s e r v e closer a t t e n t i o n , d u e to t h e i r i m p l i c a -
t i o n s for real-world p r o b l e m s f r o m t h e p o i n t of view of n e u r o f u z z y s y s t e m s . However,
t h e s t u d y of t h e s e n e t w o r k s is still in its i n i t i a l stage. Several o t h e r a r c h i t e c t u r e s are
possible, a l o n g w i t h different ( p a r t i a l ) s i m i l a r i t y m e a s u r e s , a n d f u r t h e r i n v e s t i g a t i o n s
are b e i n g m a d e in o r d e r to e x p l o r e in m o r e e x t e n t t h e i r p r o p e r t i e s , a n d to m a k e t h e
scope of t h e i r a p p l i c a t i o n m o r e precise.

References
1. LI. Belanche and J.J. Valdds. "Using Fuzzy Heterogeneous NeurM Networks to Learn a Model of
the Central Nervous System Control~. In Procs. of EUFIT'98, 6th European Congress on Intelligent
Techniques and Soft Computing, pp, 1858-62, Elite Foundation. Aachen, Germany, 1998.
2. LI. Belanche, J,J. Valdds and R. Alqu~zar. "Fuzzy Heterogeneous Neural Networks for Signal Forecast-
ing". In Procs. of ICANN'98, Intl. Conf. on Natural and Artificial Neural Networks (Perspectives in
Neural Computing), pp. 1089-94. Sk~ivde, Sweden. Springer-Verlag, 1998.
3. LI. Belanche, J.J. Valdrs, J. Comas, I.-R. Roda and M. Poch. "Modeling the Input-Output Behaviour
of Wastewater Treatment Plants using Soft Computing Techniques ". In Procs. of BESAI'98, Binding
Environmental Sciences and AI, held as part of ECAI'98, European Conference on Artificial Intelligence,
pp. 81-94, Brighton, UK, 1998.
4. Chandon, J.L, Piuson, S: Analyse Typologlque. Thdorie et Applications. Masson, 1981.
5. Dubois D., Esteva F., Garcfa P., Godo L., Prade t1.: A logical approach to interpolation based on simi-
laxity relations. Instituto de Investigaci6n en lnteligencia Artificial. Consejo Superior de lnvestigaciones
Cientfficas, Barcelona, Espafia. Research Report IliA 96/07, 1996.
6. Dubois D., Prade IL, Esters F., Garcla P., Godo L., Ldpez de MAntaras R: Fuzzy set modelling in
case-based reasoning. Int. Journal of Intelligent Systems (to appear) (1997).
7. Gower, J.C. A General Coefficient of Similarity and some of its Properties. Biometrics ~7, 857-871,
1971
8. Fagundo, J.R, Valdds J.J, Rodrfguez, J. E.: Karst Hydrochemistry (in Spanish). Research Group of Wa-
ter Resources and Environmental Geology, University of Granada, Ediciones Osuna, pp 212, Granada,
Spain, 1996.
9. Fag-undo, J.R, Vald~s J..l, Pulina, M.: Hydrochemical investigations in extreme climatic areas, Cuba
and Spitzbergen. In: Water Resources Management and Protection in Tropical Climates, pp 45-54,
Havana, Stockholm, 1990.
10. G.J. Klir, T.A. Folger.: l~zzy Sets, Uncertainty and Information. Prentice llail Int. Editions, 1988.
11. Valdds J.J, Garcfa R.: A model for heterogeneous neurons and its use in eonfigucing neural networks
for classification problems. In Procs. of IWANN'97, International World Conference on Artificial aml
Natural Neural Networks. Lecture Notes in Computer Science 1240, pp. 237-246. Springer Verlag, 1997.
12. Vald~s J.J., Belanche LI., Alqu~zar R. Fuzzy heterogeneous neurons based on similarity. International
Journal of Intelligent Systems (accepted for publication, 1999). Also in Proes. of CCIA'98: Congr~s
Cataig per a la lntel.lig~neia Artificial (Catalan Congress for Artificial Intelligence), Tarragona, Spain,
1998. Also in LSI Research Report LSI-98-33-R. Universitat Polit~cnica de Cataiunya, Barcelona (1998).
13. Goldberg, D.E.: Genetic Algorithms for Search, Optimization & Machine Learning. Addison-Wesley
(1989).
14. Davis, L.D.: Handbook of Genetic Algorithms. Van Nostrand Reinhold (1991).
15. Zimmermann lt.J.: Fuzzy set theory and its applications. Kluver Acadenfic Publishers (1992).
A Neural N e t w o r k A p p r o a c h for G e n e r a t i n g Solar
Irradiation Artificial Series

P. J. Zufiria t, A. Vgzquez-L6pez t, J. Riesco-Prieto t,


J. Aguilera* and L. Hontoriat
t G r u p o de Redes Neuronales
Dpto. de M a t e m ~ t i c a Aplicada a l a s Tecnol. de la Inform.
E.T.S. Ingenieros de Telecomunicaci6n
Universidad Politdcnica de Madrid
C i u d a d Universitaria s / n
E-28040 Madrid, Spain

$Grupo Jadn de Tdcnica Aplicada


Dpto. de Electr6nica
Universidad de Ja6n
Avda. de Madrid 35
23071 Ja6n, Spain

ABSTRACT

In this paper a relevant problem in the photovoltaic solar energy field is considere(l:
tile generation of artificial series of hourly solar irradiation. Tile proposed methodo-
logy artificially generates series following the average tendency of tile hourly radiation
series kt in a given place. This is obtained by making use of a set of historical values
of this series in such place (for training purposes) as well as the dailyclarity index/iT
of the year to be generated. This information is employed for the supervised train-
ing of a proposed neural network model. Ttle neural model employs a well known
l)aradigm, calle(t Multilayer Perceptron (MLP), in a feedback architecture. The gen-
eration method is base(I on the MLP ability to extract, from a sutiiciently general
training set, tim existing relationshil)s between wlriables whose inter(lel)endence is
mlknown a priori. This way, the presented design methodology can iml)licitly include
all the available information. Simulation results show the good perfornmnce of the
irradiation series generator, and the general applicability of this methodology in the
estimation of highly coml)lex temporal series.
875

1 Introduction
The design and analysis of photovoltaic converters is usually performed via numerical
simulations which require as input data large time sequences of hourly or daily irradi-
ation* values [Grah 90, Lore 91]. Nevertheless, these historic radiation measurements
do not exist in most of the world countries, and, if any, their quality is questionable
or they have t)lenty of missing values.
In 1988 Graham proposed the substitution of this historical measurements by syn-
thetic sequences of irradiation values generated using mathematical models of the irra-
diation process. These generated sequences should preserve the statistical properties
of the historical measurements. The proposed methodology was based on autoregress-
ive time series theory for generating sets of daily values of solar irradiation.
Tile work described in [Grah 90] extends such methodology to the generation of
hourly solar irradiation series making use of daily values. These daily values can be
obtained from historical measurements (which are more common than hourly meas-
urements) or via some daily values generation methods (which are more validated
than hourly methods). This is an stochastic disgregation method (very typical in
Hydrology: to separate the annual flow estimation into monthly estimations). The
hourly radiation series are very useful when studying photovoltaic systems with one
or two-hour response time such as peak plants or photovoltaic plants which return
energy to the network at maximum charge instants.
The main criticisms to Graham's method are the high computing requirements for
obtaining each series value, and the geographical location dependency of the method
with the place where data has been retrieved for constructing the model.
In this work, we propose a neural network approach, making use of the Multilayer
Perceptron (MLP) [Lipp 87, Rume 86, Werb 74] in a feedforward-feedl)ack architec-
ture [Nare 90] for generating hourly solar radiation series. The main attractive prop-
erty of our method is the MLP capability for approximating any continuous fimction
defined on a compact set within a prescribed error margin. Existence results prove
that it sullices to employ a MLP with a hidden layer, a required number of neurons
and an appropriate training procedure [Horn 89]. In practice, selection of al)l)ropriate
topology as well as training algorithms may I)ecome a big challenge.
One important aspect addressed in this paper is the possibility of employing the
presented architecture with a reduced knowledge of the problem to be considered
In that sense the paper defines a simple design methodology with quite general
applicability.
The paper is organized as follows. Section 2 presents some basic aspects concerning
the use of MLP based architecture for time series processing; in addition, specific
aspects related with the generation of irradiation series are also considered. Thc
specific proposed model for the generation of tirne series is l)rcscntcd in section 3.
Conchlding remarks are outlined in section 4.
*The term solar radiation refers to the physical phenomenon in a generic sense, whereas the term
irradiation refers to the incident energy on a horizontal surface over a given period of time (hourly,
kw.h
daily irradiation, etc). Therefore, the irradiation units are -Wr.
876

$a-

$n-2 MLP ~"

L,

E.

Figure 1: Prediction via network evolution.

2 Methodology for design of MLP based architec-


ture
2.1 T h e Multilayer P e r c e p t r o n
Since several years ago, neural networks are increasingly used in different scientific
and technical fields [Agar 97, Hayk 94, Hush 93, Koho 95, Lipp 87]. For instance, as
a computation and learning paradigm, they can be used for many types of applic-
ations. One of the most appealing properties of some neural network l)aradigms is
their potential use for functional approximation purposes [Horn 89]. The Multilayer
Perceptron (MLP) is the most widely used type of neural network for approximation
tasks; it is classified as a feedforward type neural network, whose topology defines
several layers of neurons. The MLP, in static contexts, is usually trained via a super-
vised procedure, one of the great advantages of the MLP being the existence of a very
efficient training method for it: the backpropagation algorithm [Rume 86, Werb 74].
Also, in dynamic contexts, the identification and control of nonlinear plants, as
well as the time series prediction have been successfully addressed via feedback archi-
tectures of supervised neural models [Nare 90, Nare 91, Weig 90]. This work can be
framed in such context, as shown below.

2.2 MLP for time series prediction


The methodology employed for Times Series Prediction (TSP) and system identifica-
tion via MLP [Lape 87, Nare 90, Nare 91, Weig 90, Vazq 92] is the framework of the
method developed for generating hourly solar radiation series.
Tile prol)lem of TSP via MLP makes use of the time series {.~,}, fi)r ol)taining the
filnction {~ (in case that such function exists) which relates each series value with the
previous p values:

~,,+, = ~[Sn-p+l, ...., srtl '~ MLP[s,,_p+, ..... , s,.] (1)

By training a MLP with p inputs and 1 output, with a training set representative
enough, the MLP will be able to tlnd the desired relationship (in case that it exists)
877

just approximating the function G. Once the approximation is performed, future


values can be computed via feedback of the predictions whenever they are available.
Such method is called prediction by network evolution (see Figure 1).
One of the great advantages of employing a MLP based methodology for generating
radiation series is that most of the computational resources are required during the
training procedure, as opposed to the generation procedure. In addition, once the
method is developed from historical data of a prescribed place, it can be applied to
new places just by repeating the training procedure with new data corresponding to
such new places.

2.3 The nature of the information


The procedure shown in this paper makes use of atmospheric transmittance or trans-
parence values (as in Graham's work) instead of a direct use of solar irradiation values.
This transmitance (also called clarity index) is represented as kt for hourly values and
KT for daily values. The extraatmospheric solar irradiation behaves ill a determin-
istic way and it is the clarity index which induces randomness to tile solar irradiation
measured on earth. More precisely:
Gh
= e [0,1] (2)

where Bob is the extraatmospheric irradiation on a horizontal surface during hour h,


a n d G h is the irradiation during hour h on the earth surface. Also, the solar irradiation
variable is specific for a given place whereas the random properties of thc clarity index
behave in a quasi-universal manner.
The progression of kt values could be described from a distribution probability
function if such hourly events were to be independent. Since this is not the case,
additional information is required of the correlation between different hourly values.
Nevertheless, the progression of kt cannot be described via an stochastic process (which
is a necessary condition for applying ARMA models [Prie 88]), because the probability
associated with daily events changes in a monthly manner (i.e., there exists some
mouthly stationality) and the probability associated with hourly events changes every
hour (there exists hourly stationality), tn addition, the probability of a given kt to
happen depends on the clarity index KT of the referred day.
In our computational experiments we made use of a set of hourly irradiation valucs
k~ measured in Madrid between 1978 and 1986. Such set of data corresponds with
9 (years)• (days per year) • (measured hours per day) values of kL and its
corresponding 9 • 365 daily values K.v.
As a first approach to the problem, due to tile limitations for evaluating the quality
of a generated series, we considered the first 8 years as a training set, and employed
the 9th year for testing the validity of the generated series. We measured such validity
with the parameter Mean Relative Variance (MRV), which quantities tl,e relative error
and is frequently employed in the Digital Signal Processing community. The MRTg
defines an estimation of the quotient between the prediction error signal power and
the AC power of the signal to be predicted:
1 l S
- Ei=I( 'i - 2
M R V = '1 l (3)
T (si -
878

2.3.1 Information inclusion


The proposed method for generating hourly radiation series has been developed via an
step-by-step inclusion of the available associated information. Tile great advantage of
this MLP based methodology is that explicit knowledge of the relationship among all
the information sources is not needed. Such information sources can be progressively
incorporated in different steps upon the proposed method. The details of this step-
by-step procedure can be found in [Vazq 93].

3 Proposed generation method


The generation procedure proposed in this paper can be seen in Figure 2. A MLP is
employed in a mixed feedback-feedforward configuration.
As a first step, tile series of 9 • 365 • 16 values was considered as a numeric sequence,
without making use of the meaning of its indexes (which hour of day or which clay
of the year they refer to). A MLP was trained with the series corresponding to the
8 first years and the 5840 values corresponding to the 9th year were generated in an
iterative manner by application of prediction by network evolution (see Figure 1). As
expected, the results were not satisfactory; this first approach does not provide good
results, showing the need of employing additional information.
Consequently, daily information was considered. A day by day prediction method
was employed as well as a dependency of any hourly value with the three previous
hours of tile same day Therefore, in order to generate the 16 hourly values {kt} of
a given day, we started a method of prediction by network evolution with window
p = 3. This implies the need of using the kt values of the 3 first hours of such day.
Since these three initial values are 0 (or close to 0) for most of the days of a year, it is
reasonable to assume that they do not provide meaningful information. In addition,
these values can be modeled in a probabilistic framework.
In order to keep ttle monthly stationality, a new input was added to the MLP
containing the distance (days) between the value to be generated and the (lay with
maxinmm value in tile {kt} annual distribution. Tile normalized day inl)ut was defined
as d, = 1 - INJ-16al
163 7
where Nd is the day number within the year.
In a following step a new MLP input was created indicating the value of tile daily
clarity index 1('1' corresponding to the day to which tile hourly to-be-generated value
belongs. These 365 KT values were taken from the year to be generated (9th year of
our data). In a real application of the method, a method for generating those 365 I(7.
values would be first employed, generating the 5840 hourly values afterwards.
The final step of our method added a new input to the MLP indicating the hour
order number of the kt value to be generated. This value, ranging from 4 to 16, in size
3 window, is normalized upon hourno~m = t...... 16-p
P where p is the prediction window
size.
It is important to note that different optimization schemes were employed for the
supervised training of the networks studied. Although some schemes (lid improve the
performance, the proper selection of the neural model inputs showed to be the most
relevant design issue.
In order to test the quality of the method, hourly values series were generated for
each the 9th year having employed the previous 8 years in the trainiug procedure. In
879

KT

i (hour)

n (day)
k(i-3) MLP . k(i)
t t

k(i-2)
t

kt(i-1)

Figure 2: Proposed generation method.

order to generate it, 365 KT (daily clarity index) of such year were needed as inputs,
as well as the 3 initial values of the hourly clarity index kt of each day. The MRV
obtained was 0.0943, proving that the method emulates quite well the deterministic
component of the series.
The obtained series generator can be successfully compared is some aspects with
the computation of the average tendency ktm performed by Graham's method. Our
proposed method can be employed for generating series corresponding to any locality,
if the corresponding training data set is available, i.e. a set of hourly and daily
clarity indexes measured over several years. Also, Graham's method requires the same
training set for computing the nonlinear regressions corresponding to each locality
which link each hour ktm value and the KT value of the corresponding day. On the
other hand, tile use of a MLP does not assume any a priori model, being advantageous
versus a nonlinear regression approach.
From an academic point of view it is very interesting to note tha MLP capability
tbr finding relationships among variables of different nature. In our example, making
use of an appropriate training set, the MLP was able to relate information from hour
of the day, daily clarity index value, and 3 previous values of the hourly clarity index
in order to generate a new kt index value.
Nevertheless, the shape of the resulting series does not have the characteristic
rippling of the real series. This is due to the fact that the employed training set
(8 years of kt and t(T values) was large in relation to tim MLP 5-x-1 topology. It is
possible that such training set may have input/output pairs such that different desired
output values may bc linked with the same input wtlue. Therefi)re, after training the
MLP, an averaging effect might have occurred among such different output values.
Hence, this could justify that our proposed method does not generate radiation hourly
series with the characteristic stochastic rippling of the real series (the generated series
are smooth as can be seen in Figure 3).
For the sake of emulation completeness, the stochastic rippling was enmlated, as
a first approach, via a generated set of random gaussian variables corresponding to
the 16 hours of the (lay excluding the initial p and last 2 (that is 16 - p - 2 random
880

0.8

MLP
0.7 ....... real

0.6

0.5
F
0.4

0.3

o2f11
60 8O 100 120 140 160 180 200
i

220

i
Figure 3: Real series versus generated one without noise. Days 5-13.
0.8 [ i i i

MLP
0.7 ....... real

0.6

0.5

~ 0.4

0.3

0.2 l ~

0.1

0 '
60 80 100 120 140 160 180 200 220

i
Figure 4: Real series versus generated one with additive noise. Days 5-13.
881

0.9 i i i i r

MLP t
0.8 ....... real

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
5660 5680 5700 5720 5740 5760 5780

i
Figure 5: Real series versus generated one with additive noise. Days 355-361.

variables). The means and variances of these random variables were estimated fi'om an
error signal between the 9th year real series and the series generated by our proposed
method. Hence, we added to each of the generated hourly series values {kt} one
realization value of the random variable corresponding to such hour. The initial 3
hours of each (lay and the last 2 did not suffer such perturbation. In Figures 4 and 5
we show the series corresponding to the 9th year, generated by the described method
after adding the noise, corresl)onding to the hourly values from (lay 5 to 13, and from
355 to 361 respectively.

4 Concluding Remarks
A methodology based on neural networks has been presented for generating time series
following the average tendency of the hourly radiation series kt in a given place. Such
methodology is based on the possibility of implicitly employing information associ-
ated with the problem, without knowing the existing relationships between different
variables and sources of information.
The proposed methodology makes use of both a set of historical values of the
series (for training purposes) as well as the daily clarity index I(,r of the year to I)c
generated in a straightforward manner: tile whole information has been eml)loyed
for the supervised training of a MLP based feedforward-feedl)ack architecture. A
proper selection of the model has proved to 1)e more critical than the training method
selected.
Although the quality of tim developed method needs filrther testing, one (:an con-
clude that the generation can be performed with little knowledge of the problem. This
882

is due to the MLP capability for finding relationships among variables with unknown
a priori relationship. Nevertheless, a proper MLP topology and training set must be
selected for such purpose.
The proposed method does not assume any a priori model as opposed to the stand-
ard approximation techniques where polynomial regression techniques are employed.

Acknowledgments
This work has been financially supported by Proyecto Multidisciplinar de In-
vestigaci6n y Desarrollo 14908 of the Universidad Polit6cnica de Madrid and Pro-
ject PB97-0566-C02-01 of the Programa Sectorial de PGC of Direcci6n General de
Ensefianza Superior e Investigacidn Cientffica in the MEC.
The authors want to thank Professor Eduardo Lorenzo and Dr. Mario Macagnan,
from the Instituto de Energfa Solar in the UPM, for their helpful comments and
suggestions, as well as for providing the data on radiation series employed in this
work.

References
[Agar 97] , M. Agarwal, A Systematic Classification of Neural-Network-Based Con-
trol. IEEE Control Systems Magazine, vol. 17, n.2, pp. 75-93, April 1997.
[Gold 96] R. Golden, Mathematical Methods.for Neural Network Analysis and Design,
MIT Press, Cambridge, 1996.

[Grah 90] V. A. Graham and K. G. T. tlollands, A Method to Generate Synthetic


Hourly Solar Radiation Globally. Solar Energy, Vol. 44, 1990.

[Hayk 94] S. Iiaykin, Neural networks. A comprehensive foundation, Macmillan Pub-


lishing Company, 1994.

[Horn 89] K. Hornik, M. Stinchcombe and It. White, Multilayer feedforward networks
are universal approximators. Neural Networks, 2 (5), 359-366.

[tlush 93] D. R. Hush and B. G. Horne, Progress in Supervised Neural Networks.


What's new since Lippmann?, IEEE S.P. Magazine, pp. 8-39, January 1993.
[Koho 95] T. Kohonen, Sell-Organizing Maps, Springer Verlag, Berlin Iteidelberg,
1995.
[Lape 87] A. S. Lapedcs and R. M. Farber, Non linear signal processing using neural
networks: prediction and system modeling. Technical II.cport, Los Almnos
National Laboratory, 1987.

[Lil)p 87] R. P. Lippmann, An Introduction to Computing with Neural Nets. IEEE


ASSP Magazine, pp. 4-22, April 1987.

[Lore 91] E. Lorenzo, Electricidad Solar Fotovoltaiea. ETSI 221ecomunicaci'on


(U.P.M. Madrid), 1991.
883

[Lowe 91] D. Lowe and A. R. Webb, Time series prediction by adaptive networks: a
dynamical systems perspective. IEEE Proceedings-F, February 1991.
[Nare 90] K. S. Narendra and K. Parthasarathy, Identification and Control of Dy-
namical Systems Using Neural Networks, IEEE Transactions on Neura]
Networks, vol. 1, n. 1, pp. 4-27, March 1990.
[Nare 91] K. S. Narendra and K. Parthasarathy, Gradient methods for the Optimiz-
ation of Dynamical Systems Containing Neural Networks, IEEE Transac-
tions on Neural Networks, vol. 2, n. 2, pp. 252-262, March 1991.
[Prie 88] M. B. Priestley, Non-linear and non-stationary time series analysis. Aca-
demic Press, 1988.

[Rume 86] D. Rumelhart and J. L. MacClelland, Learning internal representations


by error backpropagation. Chapter 8 from Parallel distributed Processing.
Vol.l: Foundations. The MIT Press, 1986.

[Vazq 92] A. V~izquez-LSpez, Identificacidn de Sistemas mediante Redes Neuronales


para Control de Robots. ETSI TelecomunicaciSn, Madrid 1992.
[Vazq 93] A. V~zquez-LSpez and P. J. Zufiria, Generacidn artificial de series de ra-
diacidn solar mediante PerceptrSn Multicapa, Actas V Conferencia de la
Asociacidn Espafiola para la Inteligencia Artificial (CAEPIA 93), pp. 196-
205, 16-18 Noviembre 1993.
[Weig 90] A. S. Weigend, D. E. Rumelhart and B.A.Huberman, Back-Propagation,
Weight-Elimination and Time Series Prediction. Chapter in Proceedings of
the 1990 Connectionist models Summer School. Morgan Kaufman, 1990.
[Werb 74] P. Werbos, Beyond regression: New tools for prediction and analysis in
the behavioral sciences, Ph.D. dissertation, Committee oll Appl. Math.,
Harvard Univ., Cambridge, MA, Nov. 1974.
Color Recipe Specification in the Textile Print Shop Using Radial Basis
Function Networks

Rautenberg, Sandro x
Todesco, Jos6 Leomar 2
Engenharia de Produg8o e Sistemas
Universidade Federal de Santa Catarina - U F S C - Brasil
I dinho@eps.ufsc.br
2 tite@eps.ufsc.br

Abstract

Color recipe specification m textile print shop requires a great deal o f human
experience. There is an intrinsic knowledge that makes the computing modeling a difficult
task. One o f the main issues is the human color perception. A small variation on the
intenseness o f colorants can lead to very different results. In this" paper, we propose to use
a Radial BasLe' Function Networks (RBFN) to color recipe specification in the textile print
shop. The method has been applied on a real environment with the/bllowing results': it
allowed the modeling o f the intuitive nature o f color perception; it made possible to
simulate the color mixing process on a computer; and it became a suitable means fi)r
training on color recipe specification.
geywords: Textile print shop, Color Recipe, RBFN, Artificial Neural Network.

I. Introduction
One of the more important processes in the textile industry is the development of an
appropriate color to print a certain kind of fabric. Issues such as product, esthetic beauty,
art creativity, among others things, are directly dependent on this development. In general,
these issues are the main points observed by customers, causing direct impact on sales [9].
Even being so important, on the majority of the industries, the color recipe process is
yet very primitive. It is basically resumed in the attempt of reaching a desired color. In this
process, the person in charge uses experience, mixing the colorants to obtain the target
color [I0]. In many cases, this process leads to dissatisfied results, with a high number of
failures and a considerable amount of wasted material. The nature of the method may lead
to failures even when the colorist is an color recipe expert [12].
To profile this situation, many companies invest in the acquisition of a
spectrophotometer and in a recipe computerized formulation system. Nevertheless, such
885

investments occur without a previous study regarding the adaptation of the company
environment. In this case, the company may face some problems, specially after the system
has been implemented [3].
On the current state-of-the-art, the color development is intrinsically dependent on the
individual perception, making color recipe a highly specialized task. The colorist uses
his/her experience comparing and prescribing colors with different amount of colorants.
In this article we propose a system to simulate the human color perception on textile
print shop. The system is able to catch and lay up the knowledge requested and to deal with
the knowledge inherent in the process. The main result is the system capability of helping
individuals in the development of a certain color. It is also a source for training to others
involved in the process of stamping the fabric.

2. The Color Recipe Specification


Color has always been an object of men fascination in most diverse areas of science.
Color is studied in arts, psychology, anthropology, medicine, religion and others [5]. It
should not be a surprise the existence of a science exclusively dedicated to the color study
[9, 11].
The word "color" can be defined by the following concepts: "Body appearance as the
way they reflect or dissipate the light". "Particular impression caused in the feeling of
eyesight by different luminous rays, single or combined, when reflected by the body" [13].
From these definitions one can conclude that the color impression depends on
spectrometric characteristics of:
9 the luminous source;
9 the observed object;
9 the observer's eyes;
The last item makes color identity subjective and dependent on the textile colorist
view. Modeling such subjectiveness can turn the process of color recipe a computational
model phenomenon.
Color has always represented something special to humans. It is an inherent
characteristic of common objects (houses, ware, adornments, clothes, among others) [I ].
This makes the color correctness a crucial issue, directly related to product quality and
customer satisfaction. Colorist are always facing a challenge when dealing with new color
mixes.
The paint preparation for textile print shop, technically known as "print paste",
requests knowledge on textile chemistry [7]. The main paste element is the colorant. The
correct choice and appropriate handling of colorants are the more complex tasks to the
colorist.
The process is based on the measurement of how much each colorant participates in
the mix (i.e., in the new color). Besides the color mix, the colorist has to consider the kind
of fabric that will be painted [10]. For instance, pigment colorants are used in smooth
fabric while reactive colorants are used in fabrics such as velvet.
The colorist has to take into account another set of variables related to color
consistence. The illumination condition of the environment where the colors are being
created and the quality of the colorants, for instance, have direct impact on the resultant
color.
886

In the current process, the colorist decisions are based mainly on his/her professional
experience with color development. The amount of each colorant in the mixture is
determined intuitively and based on practical experiences on different mixes [7]. The lack
of an explicit methodology may cause a lot amount of re-work or waste of raw material.
Besides, the subjectivism of color evaluation makes difficult to reach a complete
agreement of how close a color is from the initial target. Variables such as age, tiredness,
visual defect, opinions, taste, etc. can make color perception different among observers [9].
Another factor is how sensible the mixture is to an increase in the amount of a
colorant. There are colorants that, when mixed in a small quantity, interfere a lot on the
final result. In Table 1 we present the quantity (in grams) of each colorant to the
production of one kilogram of paint. The pantones identify the colors according to industry
standards.

PANTONE yellow gold orange magenta!cyan royal black


yellow blue
130 24.00 - 0.20 - 0.70
137 16.00 0.60 0.10 I
210 1.00 -
472 1.80 - 0.90 ~. 0. t5
632 0.30 - 10.5 4.00
Table 1: Textile color recipes.
Another aspect is the visualization of the paste before and after printed on fabric. It is
common to reach unexpected results on fabric even when the paste seems to be ready.

3. Research on Color Recipe Prediction


Colorant manufactures invest considerable amount of money in order to develop color
recipe by computational systems. Ciba-Geigy, for instance, developed a system that
foresees recipes using its reactive colorants. The main problem with such systems is that
their solution is highly specialized, that is, it works only on a confined environment (only
for the manufacture's colorants). The system does not address crucial issues such as
adaptability to the textile manufacturer color policy, or the combination of colorants of
different suppliers. The textile industry usually apply this system as the initial point to the
color development. Current research shows that colorant manufacturers keep working on
their own color recipe system. Ciba-Geigy, for instance, is developing a system for recipe
prediction with its pigment colorants.
In 1990, three companies, including the textile manufacturer Coats Viyella, formed a
consortium. One of the objectives was to develop a system to color recipe prediction in
textile, paper and surface coating [5]. Table 2 shows the results of tests.
Artificial Intelligence techniques were also proposed for color recipe prediction. In
order to model the specialized knowledge of the process, Bishop, Bushnell, and Westland
developed a neural network for color recipe prediction [2]. The system was able to imitate
the colorist behavior. The neural net developed consists in a Backpropagation with the
architecture 3-8-16-3. The authors emphasize that the neural network easily identified the
relationship between colorant and color, determining the amount each colorant in the
desired mix. The results were considered satisfactory, with 60% of the predictions resulted
in a mistake (A3) smaller than 1. Other interesting result was the system response with the
amount of colorant. The predictions that should employ a single colorant produced the
887

biggest mistakes. However, when the mix involved more than one colorant, the system
answered very well. In 78% of the cases the mistake was smaller than 0,8.
Total of tests A3
textile 22 1.0
surface coating 11 1.1
paper: transparent 12 1.2
opaque 5 1.1
Table 2: Test Results in Color Recipe Prediction.

4. RADIAL BASIS FUNCTION NETWORK (RBFN)


There exists a variety of different ways that a artificial neural networks (ANNs) can be
used in pattern classification or approximation [6]. The backpropagation algorithm for
training a MLP (supervised) could be seen as an application of a method of optimization
known in statistics as stochastic approximation [4, 6]. We can visualize the design of a
ANN as an approximation problem to find the best result (curve-fitting) in a space of high
dimension. Looking at the problem this way, learning is equivalent to finding a surtRce in a
multidimensional space proportioned by the best adaptation of the parameters of training,
with the criterion of "the best adaptation" measured by some statistical method [6].
Correspondingly, generalization is equivalent to using this multidimensional surface to
interpolate the test data. This is the approach used in RBFN, which was initially introduced
in the problem solution of real multivariate interpolation.
Broomhead and Lowe, in 1989, were among the first to explore the use of RBFN in
the neural network field. The RBFN basically consists of an input layer, a hidden layer,
and the output layer. The Figurel, show the basic form of a RBFN. Each node in the
hidden layer employs radial basis functions to produce a localized output with respect to
the input signals. The outputs are combinations of weighted inputs that are mapped by an
activation function that is radially symmetric. Each activation function requires a "center"
and a scale parameter. The most common radial basis function is the Gaussian function, so
that given an input vector X the output of a single node will be
y = f ( x - c) (4. I)
where the function f could be

1 , xj-q
f=(x-c) (2,r) hi2 ~ . ~ . . . ~ i (4.2)

The values of o'eo'v..o"., j = [1,n], are used in the same manner as with "normal"
probability densities to provide "dispersion" scales in each component direction.
Another common variation on the basis functions is to increase their functionality
using the Mahalanobis distance in the Gaussian function. The above equation becomes:

1 exp{-~(x-c)r K-'(x-c)} (4.3)


f =(x-c) = (27r),,=l/(lu2

where Kq is the inverse of the covariance matrix of X associated with hidden node C.
888

Given p-exemplar n-vectors, representing p-classes, the network can be initiated with
knowledge of the centers (locations of the exemplars). If cj represents the jth exemplar
vector, then we can define the weight matrix C as follows:
C = [el c2 ... c,] (4.4)
such that the weights in the hidden nodej are the components of the "center" c). Thus, a
hidden-layer node calculates the expression of Eq. (4.2).
The output layer is a weighted sum of the hidden-layer outputs. When presenting an
input vector x to the network, the network implements
y= w..f(llx-dl) (4.5)
where f represents the vector of functional outputs from the hidden layer, and c the
corresponding center vector. Given some training data with desired responses, the output
weights W can be found using the LMS interactively or non-interactively, like descent
gradient and pseudo inverse techniques, respectively.

Learning in the hidden-layer is Y1 Y.q

performed using an unsupervised


method, typically a clustering
algorithm, heuristic clustering
algorithm, or supervised algorithm to
find the cluster centers (hidden node
C). The most common clustering
algorithm used to train the hidden
layer RBFN is the generalized Lloyd
algorithm or the K-means clustering
algorithm [16, 22], Some studies have
also used supervised learning of
1
locations of the centers and self-
organizing learning of the centers [I 7,
18, 23]. XI ,It"z X#

Figure 1 - Basic structure of radial basis


function networks.

A simple way of choosing the scaling factors for the Gaussian functions is to set them
equal to the average distance between all training data
o-~2 = ~-7S-
i
~--~ ( x - c ) r ( x - c) (4.6)
x e(-) )

where O j is the set of training patterns grouped with cluster center Cj, and ~ is the
number of patterns in Oj.
Another manner of choosing the o-2 parameters is to calculate the distances between
the centers in each dimension and use some percentage of this distance tbr the scaling
factor. In this way, the p-nearest neighbor algorithm has been used. Sometimes, to improve
the radius of the Gaussian function, it is interesting to multiply this variance by a constant.
889

The objective is to increase the radius and consequently the amplitude or range of the
neuron [19].

5. The Aplication

The first step in direction to implement the solutions was the normalization of the
environment, in order to get good conditions. Some steps were:
9 To form a team and a strategy to evaluate the colors;
9 Definition of the colorants used;
9 The type of the clothes to be considered; and
9 Configuration of the textile print shop machine.
After the normalization of the environment was initiate the data acquisition. To do that,
was simulated the pantone system in the clothes. The pantone system is formed by
approximately a thousand of diversified color samples, that is considered a good vehicle
of color communication by colorist professional.
When a certain color was reached then three spectral measures were did by the
spectrophotometer X-rite 978. The development of a color recipe prediction system
requires the utilization of technologic resources, particularly the spectrophotometer. This
equipment quantifies the human perception regarding a certain color [3, 13]. The
spectrophotometer output is a color evaluation in a so-called Lab scale. The first step is to
convert this scale to another parameter, more intuitive to the colorist. An example is the
CMYK (Cyan, Magenta, Yellow, Black) scale. By working with this scale, the knowledge
acquisition process becomes easier. The spectral data were converted to other scales of
representation, XYZ e zyx [11].
The best results were got using two RBFN, according Figure 2, where each net have its
own functionality. The two stages were:

9 composition: RBFN to predict


which colorants must be in the
recipe. This network have the
following features: -rol
,~t~t~ [ ' ~

9 9 inputs (LabLabLab);
9 7 outputs (7 colorants used by
the industry);
9 quantity: RBFN to predict the
amount (in grams) of each
I
colorant identified on first ~rou~ r

stage. The topology of this


network was:
9 16 inputs (LabXYZxyz and
composition); Figure 2: Implemented System.
9 7 outputs;
890

To test the system were selected 21 colors extracted from the pantone system that
weren't made before, mainly because was difficult to get. For composition stage, the
system present the following results:
* 17 excellent compositions, resulting in 81% of success;
9 02 compositions partially corrects, where was possible to reach the desired
color with small adjust (9.5% of the compositions);
9 02 compositions completely wrong (error of 9.5%).
For the second stage, quantity, the system present the following results:
9 11 excellent recipes;
9 08 recipes close to the target, needing small corrections; and
9 02 recipes completely wrong.

6. The Experience on the Textile Manufacturer


The system described in this paper is under development on a print shop laboratory of
a textile manufacturer. The preliminary results have encouraged the managers to invest
more research on a system for color recipe determination. They see three major advantages
on having a system for such task: (a) the system can reduce the time spent in recipe
elaboration; (b) it can reduce the company dependence on specialized staff (actually, the
chief colorist is about to retire and there is great concern about loosing know-how); and (c)
it can be a suitable means for training of new professionals.
The first results have already reached these objectives in some extend. The system
response in simulated tests is extremely higher than practical evaluation. When the system
response was used as the first recipe, the whole process was reduced from 24 to 2 hours,
including the recipe composition and the tests applied to fabric samples.
The current experiments are being collected by a freshman colorist, engaged on the
knowledge acquisition process. He has used the system knowledge-base in his own
learning and has helped to develop new tests for performance evaluation.

7. Conclusions and Future Work


In this paper we described RBFN applied to the recipe color specification in textile
print shop. The intrinsic subjectivism in the human perception of colors makes the
computing modeling a difficult task, particularly when conventional tools are applied. The
proposed system is a two-fold benefit application: first, it captures the nature of the colorist
knowledge, making possible to use in different environments and manufacturer's color
policies; second, it makes the color recipe an automatic process, saving time and resources
when new colors have to be set up for certain fabric.
The system has already showed some economic benefits to textile manufacturers. The
time and amount of raw material necessary to reach a certain color can be significantly
reduced. The system yields a response free from environmental and cognitive variables
that influence the colorist in practical experiments. The most noticeable result is a t~ster
product with a smaller material lost.
Another feature discussed is the use of the system as a tutorial system. The system
knowledge-base is transparent and shows exactly the way the colorist approaches the
problem. A new interface can be built using this knowledge to evaluate simulated
responses and to show more appropriate recipes to beginners.
891

Although these results were already perceived in the real field, there is still room for
other approaches. Actually, the most difficult step, the knowledge acquisition, can be
notably improved by the adoption of automatic knowledge extraction techniques (e.g.,
rule-extraction [15], fuzzy neural networks [20], or hybrid learning techniques [21]). Such
methods can elucidate rules directly from a set of samples composed by pairs (color target;
colorant mix), avoiding most of the steps on the laborious knowledge acquisition task to
design a fuzzy system.

8. References
[1] Ara~jo. M., and Castro, E. M. M.; Manual de Engenharia T~xtil (l'extile Engineering
Manual), Fundaq~fo Calouste Gulbenkian, Lisboa, September 1984.
[2] J. M. Bishop, M. J. Bushnell, and S. Westland, "Application of Neural Networks to
Computer Recipe Prediction". Color research and application, John Willey & Sons,
New York, February 1991, pp. 3-9.
[3] R. Hirschler, L.C.R. Almeida, and K.S. Arafjo. "Formulaq~o computadorizada de
receitas de cores de tingimento e estamparia t6xtil: como obter sucesso na ind6stria"
("Computerized color recipes in textile print shop: how to obtain success in industry").
Quimica T~xtil, Assoeiaq~o Brasileira de Quimicos e Coloristas T~xteis, Barueri - Silo
Paulo, September 1995, pp. 61-67.
[4] Moody, J. & Darken, C. J., Fast learning in Networks oflocally-tunedprocessing units,
Neural Computation, voi. 1,281-294, 1989.
[5] R. Luo, P. Rhodes, J. Xin, and S. Scrivener. "Effective colour communication for
industry". JSDC, Society of Dyers and Colorist, Bradford, December 1992, pp. 516-
520.
[6] Haykin, Simon, Neural Networks: A Comprehensive Foundation, Macmillan College
Publishing Company, New York, 1994
[7] Ribeiro, E.G., Como iniciar uma estamparia em silk-screem (how to open a textile print
shop), CNI, Rio de Janeiro, 1987.
[8] Welstead, S.T., Neural network and fuzzy logic applications in C/C ~+, John Willey &
Sons, New York, 1994.
[9] Farina, M., Psicodintimica das cores em comunicagao, (psycho-dynamics of colors in
communication) Editora Edgard Bl0cher Ltda, S~o Paulo, 1990.
[10] Vigo, T., Textile Processing and Properties - preparation, dyeing, finishing and
performance, Elsevier, Amsterdam, 1994.
[11] Billmeyer, F.W.Jr., and M. Saltzman, Principles of color technology, John Willey &
Sons, New York, 1981.
[ 12] Ingamells, W., Color for textiles, Society of Dyers and Colorist, Bradford, 1993.
[13] M. R. Costa, "Principios b~tsieos da colorimetria", (Basic Principles of Coloring)
Quimica Tdxtil, Associaq~o Brasileira de Quimicos e Coloristas Texteis, Barueri - S~io
Paulo, June 1996, pp. 36-71.
[14] Lammens, J.M.G., A computational model of color perception and color naming.
Faculty of the Graduate School of State University of New York at Buffalo, New York,
June 1994.
892

[15] Abe, S.; and Lan, M. -S., "A method for fuzzy rule extraction directly from numerical
data and its application on pattern classification," IEEE Trans. on Fuz~ Systems, vol.
3, no. 1, pp. 18-28, 1995.
[16] Hush, Don R. & Home, B. G., Progress in Supervised Neural Networks: What's New
Since Lippmann ?, IEEE Signal Processing Magazine, 8-39, January, 1993.
[17] Lee, S. & Kil, R. M., A Gaussian Potential Function Network with hierachically Self-
Organizing Learning, Neural Networks, vol. 4, 207-224, 1991.
[18] Wettschereck, D. & Dietterich, T., Improving the Performance of Radial Basis
Function Networta' by Learning Center Locations, Advances in Neural Information
Processing System 4, J. E. Moody, S. J. Hansen and R. L. Lippmann editors, 1133-
1140, 1992.
[19] Saha, Avijit & Keeler, J. D., Algorithms Jbr Better Representation and Faster
Learning m Radial Basis Function Networks, Advances in Neural Information
Processing System 2, D.S. Touretzki editor, 482-489, 1990.
[20] Ishibuchi, H. ; Kwon, K.; and Tanaka, H., A learning algorithm offi~zzy neural
networks with triangular fuzzy weights, Fuzzy Sets and Systems, vol. 71, pp. 277-293,
1995.
[21 ] Bonarini, A., Evolutionary Learning of Fuzzy Rules: Competition and Cooperation, in
Fuzzy Modeling: Paradigms and Practice, Ed. By W. Pedrycz Kluwer Academic Press,
1996.
[22] TODESCO, Jos6 L., Reconhecimento de padr~es usando rede neuronal artificial
corn urea funr de base radial: uma aplicar na classificar de cromossomos
humanos. Florian6polis, 1995. Tese (Doutorado em Engenharia de Pruduq[lo) -
Engenharia de Produq[lo e Sistemas, UFSC.
[23] TONTINI, Gerson. Automatlzacao da ldentificar de padr6es em grtificos de
controle estat[stico de proces$os (CEP) atrav~s de redes neurais corn 16gica ~'fusa.
Florian6polis, 1995. Tese (Doutorado em Engenharia Mecfinica) - Engenharia
Mee~nica, UFSC.
Predicting the Speed of Beer Fermentation
in Laboratory and Industrial Scale

Juho Rousu 1, Tapio Elomaa 2, and Robert Aarts 1.

1 V T T Biotechnology and Food Research, P.O. Box 1500


FIN-02044 VTT, Finland, j u h o . r o u s u @ v t t , f i
2 Department of Computer Science, P.O. Box 26
FIN-00014 University of Helsinki, Finland, elomaa0cs . h e l s i n k • .f•

A b s t r a c t . Characteristic of the beer production process is the uncer-


tainty caused by the complex biological raw materials and the yeast, a
living organism. This uncertainty is exemplified by the fact that predict-
ing the speed of the beer fermentation process is a non-trivial task.
We employ neural network and decision tree learning to predict the speed
of the beer fermentation process. We use two d a t a sets: one t h a t comes
from laboratory-scale experiments and another t h a t has been collected
from an industrial scale brewing process. In the laboratory-scale exper-
iments a neural network that employs characteristics of the ingredients
and the condition of the yeast, could predict the fermentation speed
within 2% of the true value. Decision trees for classifying whether the
speed of fermentation will be slow or fast were constructed from the same
data. Astonishing simple decision trees were able to predict the classes
with 95% 98% accuracy. In contrast to the neural net experiment, even
the highest accuracy could be reached by utilizing only standard brewery
analyses.
We then set out to check the utility of these methods in a real brewery
environment. The setting in the brewery is more complex and unpre-
dictable than the laboratory in several ways. Regardless, reasonably good
results were obtained: the neural network could, on average, predict the
duration of the fermentation process within a day of the true value; an
accuracy t h a t is sufi~icient for today's brewery logistics. The accuracy of
the decision tree in detecting slow fermentation was around 70%, which
is also a useful result.

1 Introduction

T h e a r t of p r o d u c i n g beers h a s d e v e l o p e d over 5000-8000 years. Nevertheless,


t h e c o m p l e x i t y of t h e p r o c e s s still p r o v i d e s challenges to t h e brewers. B o t h t h e
c o m p l e x i t y of t h e i n g r e d i e n t s a n d t h e u n p r e d i c t a b l e n a t u r e of t h e y e a s t , a living
o r g a n i s m , c o n t r i b u t e to t h e u n c e r t a i n t y t h a t t h e breweries a r e forced t o live
with.
* Current address: Nokia Telecommunications, P.O. Box 370, FIN-00045 Nokia Group,
Finland, r o b e r t , a a r t s 9 com.
894

From the production management point of view, the ability of predict the
duration of the fermentations would be a useful one [3]. In practise, the fer-
mentation times in seemingly equivalent settings can vary considerably, which
hinders efficient scheduling of the plants. Moreover, the breweries are forced to
make daily measurements to observe the course of the fermentations, in order
to make the decision when to stop the process. With a good predictor for the
fermentation speed, one could manage with fewer measurements.
In this paper we study how two predictor families, neural nets and decision
trees suit this problem. The task of the neural net is to predict the fermentation
time and the task of the decision tree is to classify the batches as slow or fast.
The neural net prediction gives continuous classification while the decision tree
is understandable even to the brewers. We perform two sets of tests. The first set
is performed with data from laboratory tests. The second data set is collected
from a real brewery.
The rest of this paper is organized as follows. First, Section 2 briefly explains
the beer fermentation process. Section 3 reviews the results that were obtained on
the laboratory-scale data. Section 4 goes through the results that were obtained
on the brewery data. Finally, Section 5 presents the conclusions of the current
work.

2 The beer fermentation process

The main ingredients of beer are malt, water and hops. The main phases of the
brewing process are wort production and fermentation.
The wort production starts with crushing the malt into coarse flour, which is
then mixed with water. The resulting porridge-like mash is heated according to
a carefully selected temperature program which encourages the malt enzymes to
partially solubilize the ground malt. The resulting sugar-rich aqueous extract,
wort, is then separated from the solids and boiled with hops. The wort is then
clarified and cooled.
The fermentation process starts with aerating the cooled wort and adding
yeast to it. The yeast starts to consume the nutrients contained in wort, in
order to stay alive and grow. At the same time, the yeast produces alcohols
and esters. Fermentation is controlled by regulating the temperature, oxygen
content, and the pitch rate; i.e., the amount of yeast put into the fermentation
tank. Temperature has a great effect on both the speed of fermentation and the
flavour of beer. The growth of yeast can be controlled by the oxygen content. The
pitch rate affects the fermentation speed but not as much as the temperature.
However, the effects of pitch rate on flavor are small which permits larger changes
without altering the flavor profile.
In addition, the course of fermentation is affected by other factors, such as
the wort composition and the yeast condition. Ideally, these factors should be
constant, so that the predictability of fermentation is maintained. In practice,
neither the wort composition nor the yeast condition is static. The natural vari-
895

ation of malt induces some variation to the wort composition, although such
variations can be diminished by re-planning the mashing recipes [1, 2].
The condition of the yeast is a more complicated issue. Traditionally, the
breweries have observed the viability, i.e. the percentage of live cells in the batch
by laboratory analyses. However, these methods do not tell anything about the
vitality of the yeast, i.e. the fermentation rate of the cells. The yeast used in
brewing is grown by the brewery and recycled many times before disposal. The
ability of the yeast to ferment is greatly dependent on the history of the yeast.
For example, new yeast typically behaves differently from yeast that has been
recycled many times. Also, yeast that has been stored long periods between
fermentation is often less vital.
Ideally, the brewery should be able to modify the fermentation recipes so that
the variability of the yeast and wort would be canceled out. So, if the vitality
of the yeast is low, the brewery could increase the pitch rate or elevate the
temperature or oxygen content slightly. A fermentation recipe planner, such as
the Sophist system [8] is well suited to this task. A reliable estimate of the yeast
vitality is needed for such an approach, though. However, as one can expect from
the above introduction, no single analysis exists that would permit predicting
the time of fermentations to any reasonable degree.

3 Results using laboratory-scale data

A set of 100 fermentations [4] was used for both the artificial neural net (ANN)
and the decision tree experiments. This data set contains fermentations with
recycled yeast (up to 4 cycles) and fermentations with freshly propagated yeast.
The worts used in these experiments were all made according one recipe us-
ing a single lot of malt extract. Hence the worts were all very similar indeed.
Yeast viability was assessed by methylene blue (MB) and methylene violet (MV)
staining, both at the end and at the start of a fermentation. In addition the
trehalose content of the yeast, which is a stress indicator, was measured before
pitching. The pitching rate was constant. As a fourth yeast condition measure-
ment the acidifying power (AP) was recorded. Cropped yeast was aerated for
0, 3 or 5 hours before pitching. The percentage of apparent fermentation--the
percentage of sugars consumed--was calculated from daily measurements of the
specific gravity (SG) of the wort. A review of these measurements is given, e.g.,
by Londesborough [5].

3.1 Neural net results

The first approach was to train ANN on this data. In the work presented here
an ANN was trained to predict the relative degree of fermentation at 72 and 130
hours. Several sets of inputs were used, in order to see what analysis contribute
to the quality of prediction.
A number of neural nets to estimate the apparent degree of fermentation at
72 and 130 hours were trained. For each net approximately 75~ of the available
896

Table 1. The error of prediction of degree of fermentation of neural nets using different
measurements. The errors are given in absolute percentages, e.g. the difference between
the predicted value and the actual measured value was never more than the given error.
"Prev. adf" means the measured degree of fermentation of the batch that the yeast
was cropped from. This value is not available when freshly propagated yeast is used.

MB MV Trehalose Aerat. time SG Wort 0 2 / P r e v . Error Error


pH/temp adf ADF72 ADF130
X X X X X X • •
X X X X X • •
X X X X X X • •
X X X +8.1% •
X X X X X • •

data was used for training and 25% for validation. The nets differed in the input
measurements used. Table 1 lists the inputs to the nets that were constructed
and the prediction errors of these nets.
It can be seen that information about the behavior of the yeast in previous
batches is rather useful, inclusion of this data reduces the error of prediction
significantly. For freshly propagated yeast such data is not available and it is
therefore more difficult to predict the behavior of such yeast. Adding informa-
tion about the physiological condition of the yeast in the form of the trehalose
measurement helped prediction in this case.

3.2 Decision tree results

Another approach to prediction of starting speed of fermentation is to classify


a batch as slow, normal or fast, based on the descent of specific gravity in the
first 72 hours. We assigned these classes according to "natural" clusters that were
seen in the data. These kind of classification tasks are particularly suitable to
symbolic learning methods such as decision tree or rule inducers. A benefit of
these kind of methods is that the predictor, i.e. the rule, is understandable to
humans. We used the latest version of the well-known C4.5 decision tree learning
package [6, 7] to build the classification rules.
In the decision tree experiments, the predictive accuracies of the trees were
estimated by 10xl0-fold cross-validation method that works by dividing the data
into 10 subsets and then uses each subset at a time as test set while the other
9 subsets are used for training. The whole process was repeated 10 times and
the accuracies and standard deviations reported below are averages over the
10 iterations. Cross-validation accuracies of this kind are considered reliable
estimates of the performance of the prediction method on unseen cases. In our
first test, all measurements of yeast condition and wort quality were available
to the learning algorithm to choose from. In addition, the performance of the
yeast in the previous batch, that is, whether the start of the fermentation in the
previous batch was slow, normal or fast.
897

Of this set only two measurements appeared in the tree (Table 2) induced
from the whole data, namely methylene blue measurement, and somewhat sur-
prisingly, the specific gravity at the start of the fermentation. T h e training accu-
racy of the depicted tree, as well as t h a t of the trees on Tables 3 and 4, is 98%.
The cross-validation accuracy (i.e. the estimated performance on unseen cases)
of this scheme is 97.8% • 0.4%, meaning that circa 2% of new batches would be
misclassified using this rule.

T a b l e 2. Rule induced from the whole data.

if MB > 8.8 then slow


else if SG <= 1.04353 then normal
else if MB <= 3 then fast
else normal

Next experiment was to exclude measurements t h a t were chosen at the first


round, one at a time, in order to see how dependent the predictability was of the
two preferred measurements. Excluding methylene blue measurement had the
effect of bringing the acidifying power (AP) measurement into the tree (Table
3). The original gravity was present this tree also. The cross-validation accuracy
of this setting dropped to 95%, suggesting t h a t methylene blue is more robust
measurement for this task than the acidifying power. A minor surprise was t h a t
methylene violet measurement did not appear in the rule, even though it was
deemed useful in the neural net experiments.

T a b l e 3. Rule induced from the data where methylene blue measurement was excluded.

if AP <= 2.1425 then slow


else if SG <= 1.04353 then normal
else if AP <= 2.515 then normal
else fast

Our third experiment was to exclude the original specific gravity from the
set of available measurements. The effects were parallel to the second experi-
ment: the specific gravity was replaced with pH of the original wort. The cross-
validation accuracy was 94.8%~0.7%, again pointing out t h a t the original gravity
is a more informative measurement in this setting.
The immediate conclusion to be drawn of these decision tree experiments is
t h a t predicting whether a batch will be slow or not can be done with surprisingly
little information about the wort quality and the yeast condition. Only two
measurements of both appear in the three decision trees that, moreover, all have
very high accuracy.
898

T a b l e 4. Rule induced from the data where original specific gravity measurement was
excluded.

if MB > 8.8 then slow


else if pH > 5.53 then normal
else if AP <= 2.515 then normal
else fast

One important question arose from these experiments: why seems the original
specific gravity of the wort to be necessary in predicting the speed of fermenta-
tion. This finding seems peculiar since the wort was of very even quality in the
different batches. Another question was why replacing SG measurement with pH
obtains almost as good results. A technical answer to this question is t h a t the
two measurements are quite strongly inversely correlated (r--0.7818) in this data.
Still, answer to the fundamental question why either of these measurements is
relevant, remains unclear.

4 Results on brewery data

We set out to validate the laboratory results in industrial scale. For t h a t end,
we collected the d a t a of 118 fermentations from a brewery.
The set of attributes in the d a t a was different from the one used in the
laboratory tests: The brewery uses a online capacitance measuring device for
assessing the viability of the yeast mass rather than the staining methods. The
benefit of the former approach is t h a t the whole yeast mass becomes measured
instead of a small sample. In addition, the volume of the pitched yeast was used
as an additional measurement.
Since the history of the yeast was found important in the l a b o r a t o r y d a t a set,
the fermentation time of the previous round of the yeast was included. Also, the
length of the history of the yeast as the number of fermentations was included.
The propagations where the yeasts originated from were more numerous which
contributed to variation that could not be coded into the data.
The fermentation t a n k was filled with two brews t h a t entered the t a n k in
intervals of varying length (several hours in each case). We found it necessary to
include the first SG measurement performed from the full t a n k into the d a t a set
in order to manage this complication, in addition to the average of the original
specific gravities of the two brews. The interim time between the two brews was
also included into the data set.

4.1 Neural net results

We trained a neural network using the backpropagation learning algorithm using


the m o m e n t u m t e r m to avoid local minimas. The d a t a was split into training
(70%) and test sets (30%). The size of network was decided by manually checking
899

Fig. 1. Neural network prediction results. Each dot represents one prediction: the
squares correspond to data items that were included into the training set and the
circles represent predictions in fresh cases. The solid line corresponds to the correct
prediction and the dashed lines axe the one-day error margins.

the predictive accuracy on the test set. A network of 2x4 hidden units was found
to give a good result when all yeast strains were present in the data. In contrast,
as small as lx3 unit network was found to generalize well when the d a t a was
restricted to include just one strain.
The predictions given by the network of lx3 hidden units are depicted in the
Figure 1, which plots the fermentation speed predictions given by the network on
the training and test d a t a against the measured duration. A correct prediction
falls on the solid line. The dashed lines represent error margins of one day. It can
be seen t h a t most predictions are within the one-day error limit. The average
deviation of the predictions is 0.6 days (14.4 hours), which is clearly worse t h a n
the best results (1 hour and 6.5 hours) obtained on the laboratory-scale data.
However, taking the more complicated real-world setting into account the result
is satisfactory.

4.2 Decision tree results


Again the ten times repeated cross-validation testing was applied with the C4.5
decision tree learning algorithm [6, 7]. The intention was not to predict the actual
900

speed of fermentation, but simply to classify whether the fermentation is slow or


fast. The axis-parallel single-attribute value categorization carried out by C4.5
does not have the chance of producing as good results in the very noisy real-world
data as neural networks do. Nevertheless, the intelligible classifiers produced are
of interest.
Depending very much on the attribute set used quite different looking deci-
sion trees were produced. Quite heavy pruning was needed to delete the many
apparent dependencies produced by noise. Despite all, the prediction accuracy
level using decision trees is quite stable: in repeated cross-validation 67-70%
prediction accuracy was obtained with many different settings. In contrast, the
overall prediction accuracy of the neural network on the brewery data is approx-
imately 73%.

5 Conclusions

In light of the laboratory-scale experiments, it is well possible to predict the


behavior of beer fermentations. If accurate predictions are desired it is necessary
to have detailed information about the wort and the history of the yeast. Neural
nets can be trained on such data, probably one net for each strain. If an early
warning for exceptionally slow (or fast) fermentations suffices, it is possible to use
a simple decision tree that employs only a very small set of routine measurements.
In the brewery data, however, there exists variation that cannot be directly
encoded into the input variables. For example, the propagations where the yeast
originated from were distinctly different from each other. Therefore, the results
cannot reach the same exactness as in the laboratory-scale experiments. Never-
theless, the obtained results would seem to be applicable in the breweries.

References

1. Aarts, R., SjSholm, K., Home S., Pietil~i K.: Computer-Planned Mashing. In Pro-
ceeding of the Twenty-Fourth Congress, European Brewery Convention. IRL Press,
Oxford (1993) 655-662
2. Aarts, R., Rousu, J.: Towards CBR for Bioprocess Planning. In Smith, I., Faltings,
B. (eds.): Advances in Case-Based Reasoning, Proceedings of the Third European
Workshop, EWCBR-96. Lecture Notes in Artificial Intelligence, Vol. 1168. Springer-
Verlag, Berlin Heidelberg New York (1996) 16-27
3. Cummins, S., Plant, N., Kelleher, P., O'Connor, J.B.: Optimisation of Brewery Op-
erations Using Fuzzy Logic and Simulation Tools. Proceedings of the International
Symposium on Automatic Control of Food and Biological Processes. SIK. GSteborg,
Sweden (1998) 459-467
4. Kataja, K.: Yeast Recycling in Main Fermentation of Beer (in Finnish). Master's
thesis, Department of Chemical Technology, Helsinki University of Technology, Fin-
land (1997)
5. Londesborough, J.: The Measurement of Yeast Viability in Breweries (in Finnish).
Mallas ja Olut 5 (Oct. 1998) 139-148
901

6. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Ma-
teo, Calif. (1993)
7. Quinlan, J.R.: Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res.
4 (1996) 77-90
8. Rousu, J., Aarts, R.: Case-Based Planning Methods in Biotechnical and Food Pro-
cesses. Proceedings of the International Symposium on Automatic Control of Food
and Biological Processes. SIK. GSteborg, Sweden (1998) 215-224

Das könnte Ihnen auch gefallen