Sie sind auf Seite 1von 881

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/300874631

at: https://www.researchgate.net/publication/300874631 Multilayer Perceptrons: Architecture and Error

Chapter · December 2014

DOI: 10.1007/978-1-4471-5571-3_4

CITATIONS

2

2 authors:

94 PUBLICATIONS

1,092 CITATIONS

READS

927

Montreal 94 PUBLICATIONS 1,092 CITATIONS READS 927 M.N.s. Swamy Concordia University Montreal 1,010

1,010 PUBLICATIONS

9,291 CITATIONS

SEE PROFILE SEE PROFILE
SEE PROFILE
SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Search and optimization by metaheuristics View project View project

Phd thesis at concordia university canada View project View project

All content following this page was uploaded by K.-L. Du on 16 April 2016.

The user has requested enhancement of the downloaded file.

i

Neural Networks and Statistical Learning

This textbook introduces neural networks and machine learning in a statisti- cal framework. The contents cover almost all the major popular neural network models and statistical learning approaches, including the multilayer perceptron, the Hopfield network, the radial basis function network, clustering models and algorithms, associative memory models, recurrent network s, principal compo- nent analysis, independent component analysis, nonnegative matrix factoriza- tion, discriminant analysis, probabilistic and Bayesian models, support vector machines, kernel methods, fuzzy logic, neurofuzzy models, hardware implemen- tations, and some machine learning topics. Applications of these approaches to biometric/bioinformatics and data mining are finally given. This book is the first of its kind that gives a very comprehensive, yet in-depth introduction to neural networks and statistical learning. This book is helpful for all academic and technical staff in the fields of neu- ral networks, pattern recognition, signal processing, machine learning, computa- tional intelligence, and data mining. Many examples and exercises are given to help the readers to understand the material covered in the bo ok.

K.-L. Du received his PhD in electrical engineering from Huazhong University of Science and Technology, Wuhan, China, in 1998. He is the Chief Scientist at Enjoyor Inc, Hangzhou, China. He has been an Affiliate Associate Professor at Concordia University since June 2011. He was on research s taff at Depart- ment of Electrical and Computer Engineering, Concordia University, Montreal, Canada from 2001 to 2010. Prior to 2001, he was on technical staff with Huawei Technologies, China Academy of Telecommunication Technology, and Chinese University of Hong Kong. He worked as a researcher with Hong K ong University of Science and Technology in 2008. Dr. Du’s research interests cover signal processing, wireless communications,

machine learning, and intelligent systems. He has coauthor ed two books (Neural Networks in a Softcomputing Framework, Springer, London, 2006; Wireless Com- munication Systems, Cambridge University Press, New York, 2010) and have published over 45 papers. Dr. Du is a Senior Member of the IEEE. Presently, he

is on Editorial Board of IET Signal Processing, British Journal of Mathematics

and Computer Science, and Circuits, Systems & Signal Processing. M.N.S. Swamy received the B.Sc. (Hons.) degree in Mathematics from Mysor e University, India, in 1954, the Diploma in Electrical Communication Engineering from the Indian Institute of Science, Bangalore in 1957, and the M.Sc. and Ph.

D. degrees in Electrical Engineering from the University of Saskatchewan, Saska- toon, Canada, in 1960 and 1963 respectively. In August 2001 he was awarded

a Doctor of Science in Engineering (Honoris Causa) by Ansted University “In

recognition of his exemplary contributions to the research in Electrical and Com- puter Engineering and to Engineering Education, as well as his dedication to

ii

the promotion of Signal Processing and Communications Applications”. He was bestowed with the title of Honorary Professor by the Nationa l Chiao Tong Unier- sity in Taiwan in 2009. He is presently a Research Professor and holder of Concordia Tier I chair, and was the Director of the Center for Signal Processing and Communications from 1996 till 2011 in the Department of Electrical and Computer Engineer- ing at Concordia University, Montreal, Canada, where he ser ved as the found- ing Chair of the Department of Electrical Engineering from 1 970 to 1977, and Dean of Engineering and Computer Science from 1977 to 1993. He has published extensively in the areas of circuits, systems and signal pro cessing, and holds five patents. He is the co-author of five books: Graphs, Networks a nd Algorithms (New York, Wiley, 1981), Graphs: Theory and Algorithms (New York, Wiley, 1992), Switched Capacitor Filters: Theory, Analysis and Design (Prentice Hall International UK Ltd., 1995), Neural Networks in a Softcomputing Framework (Springer, 2006), Modern Analog Filter Analysis and Design (Wiley-VCH, 2010) and Wireless Communication Systems (Cambridge University Press, New York, 2010). A Russian Translation of the first book was published by Mir Publishers, Moscow, in 1984, while a Chinese version was published by the Education Press, Beijing, in 1987. Dr. Swamy is a Fellow of many societies including the IEEE, IE T (UK) and EIC (Canada). He has served the IEEE CAS Society in various capaci- ties such as President in 2004, Vice-President (Publicatio ns) during 2001-2002, Vice-President in 1976, Editor-in-Chief of the IEEE Transactions on Circuits and Systems I from June 1999 to December 2001. He is the recipient of many IE EE- CAS Society awards including the Education Award in 2000, Golden Jubilee Medal in 2000, and the 1986 Guillemin-Cauer Best Paper Award. Since 1999, he has been the Editor-in-Chief of the Journal Circuits, Systems and Signal Pro- cessing. Recently, Concordia University instituted a Research Cha ir in his name as a recognition of his research contributions.

Neural Networks and Statistical Learning

KE-LIN DU and M. N. S. SWAMY Enjoyor Labs, Enjoyor Inc., China Concordia University, Canada April 28, 2013

iii

In memory of my grandparents

To my family

K.-L. Du

M.N.S. Swamy

To all the researchers with original contributions to neural networks and machine learning

K.-L. Du, M.N.S. Swamy

Contents

List of Abbreviations

page xx

1 Introduction

1

1.1 Major events in neural networks research

1

1.2 Neurons

3

1.2.1 The McCulloch-Pitts neuron model

5

1.2.2 Spiking neuron models

6

1.3 Neural networks

8

1.4 Scope of the book

12

References

13

2 Fundamentals of Machine Learning

16

2.1 Learning methods

16

2.2 Learning and generalization

21

2.2.1 Generalization error

22

2.2.2 Generalization by stopping criterion

23

2.2.3 Generalization by regularization

24

2.2.4 Fault tolerance and generalization

25

2.2.5 Sparsity versus stability

26

2.3 Model selection

27

2.3.1 Crossvalidation

27

2.3.2 Complexity criteria

29

2.4 Bias and variance

31

2.5 Robust learning

32

2.6 Neural network processors

35

2.7 Criterion functions

38

2.8 Computational learning theory

40

2.8.1 Vapnik-Chervonenkis dimension

41

2.8.2 Empirical risk-minimization principle

43

2.8.3 Probably approximately correct (PAC) learning

45

Contents

v

2.10

Neural networks as universal machines

46

2.10.1 Boolean function approximation

47

2.10.2 Linear separability and nonlinear separability

49

2.10.3 Continuous function approximation

50

2.10.4 Winner-takes-all

51

2.11

Compressed sensing and sparse approxiation

52

2.11.1 Compressed sensing

53

2.11.2 Sparse approximation

54

2.11.3 LASSO and greedy pursuit

55

2.12

Bibliographical notes

57

References

61

3 Perceptrons

70

3.1 One-neuron perceptron

70

3.2 Single-layer perceptron

71

3.3 Perceptron learning algorithm

72

3.4 Least mean squares (LMS) algorithm

74

3.5 P-delta rule

77

3.6 Other learning algorithms

79

 

References

82

4 Multilayer perceptrons: architecture and error backprop agation

86

4.1 Introduction

86

4.2 Universal approximation

87

4.3 Backpropagation learning algorithm

88

4.4 Incremental learning versus batch learning

93

4.5 Activation functions for the output layer

97

4.6 Optimizing network structure

99

 

4.6.1 Network pruning using sensitivity analysis

99

4.6.2 Network pruning using regularization

102

4.6.3 Network growing

104

4.7 Speeding up learning process

106

 

4.7.1 Eliminating premature saturation

106

4.7.2 Adapting learning parameters

108

4.7.3 Initializing weights

111

4.7.4 Adapting activation function

113

4.8 Some improved BP algorithms

115

 

4.8.1 BP with global descent

117

4.8.2 Robust BP algorithms

118

4.9 Resilient propagation (RProp)

119

vi

Contents

References

 

123

5 Multilayer perceptrons: other learing techniques

133

5.1 Introduction to second-order learning methods

133

5.2 Newton’s methods

134

 

5.2.1 Gauss-Newton method

135

5.2.2 Levenberg-Marquardt method

136

5.3

Quasi-Newton methods

138

5.3.1 BFGS method

140

5.3.2 One-step secant method

141

5.4 Conjugate-gradient methods

142

5.5 Extended Kalman filtering methods

146

5.6 Recursive least squares

149

5.7 Natural-gradient descent method

150

5.8 Other learning algorithms

151

 

5.8.1

Layerwise linear learning

151

5.9 Escaping local minima

152

5.10 Complex-valued MLPs and their learning

153

 

5.10.1

Split complex BP

153

5.10.2

Fully complex BP

154

References

158

6 Hopfield networks, simulated annealing and chaotic neural networks

165

6.1 Hopfield model

165

6.2 Continuous-time Hopfield network

167

6.3 Simulated annealing

171

6.4 Hopfield networks for optimization

174

 

6.4.1 Combinatorial optimization problems

175

6.4.2 Escaping local minima for combinatorial optimizatio n problems 178

6.4.3 Solving other optimization problems

179

6.5 Chaos and chaotic neural networks

181

6.5.1 Chaos, bifurcation, and fractals

181

6.5.2 Chaotic neural networks

182

6.6 Multistate Hopfield networks

185

6.7 Cellular neural networks

186

References

189

7 Associative memory networks

194

7.1 Introduction

194

7.2 Hopfield model: storage and retrieval

196

Contents

vii

7.2.2 Pseudoinverse rule

197

7.2.3 Perceptron-type learning rule

198

7.2.4 Retrieval stage

199

7.3 Storage capability of the Hopfield model

200

7.4 Increasing storage capacity

204

7.5 Multistate Hopfield networks for associative memory

207

7.6 Multilayer perceptrons as associative memories

208

7.7 Hamming network

210

7.8 Bidirectional associative memories

212

7.9 Cohen-Grossberg model

213

7.10 Cellular networks

214

References

218

8 Clustering I: Basic clustering models and algorithms

224

8.1 Introduction

224

8.1.1 Vector quantization

224

8.1.2 Competitive learning

226

8.2 Self-organizing maps

227

8.2.1 Kohonen network

229

8.2.2 Basic self-organizing maps

230

8.3 Learning vector quantization

237

8.4 Nearest-neighbor algorithms

240

8.5 Neural gas

243

8.6 ART networks

246

8.6.1 ART models

247

8.6.2 ART 1

248

8.7 C -means clustering

250

8.8 Subtractive clustering

253

8.9 Fuzzy clustering

256

8.9.1 Fuzzy

C -means clustering

256

8.9.2 Other fuzzy clustering algorithms

259

References

262

9 Clustering II: topics in clustering

270

9.1 The underutilization problem

270

9.1.1 Competitive learning with conscience

271

9.1.2 Rival penalized competitive learning

272

9.1.3 Softcompetitive learning

274

9.2 Robust clustering

275

9.2.1

Possibilistic C -means

277

viii

Contents

 

9.3 Supervised clustering

279

9.4 Clustering using non-Euclidean distance measures

280

9.5 Partitional, hierarchical and density-based clustering

282

9.6 Hierarchical clustering

283

9.6.1 Distance measures, cluster representations and dendrograms

283

9.6.2 Minimum spanning tree (MST) clusterng

285

9.6.3 BIRCH, CURE, CHAMELEON and DBSCAN

286

9.6.4 Hybrid hierarchical/partitional clustering

290

9.7 Constructive clustering techniques

291

9.8 Cluster validity

294

9.8.1 Measures based on compactness and separation of clusters

294

9.8.2 Measures based on hypervolume and density of clusters

295

9.8.3 Crisp silhouette and fuzzy silhouette

296

9.9 Projected clustering

298

9.10 Spectral clustering

299

9.11 Coclustering

300

9.12 Handling qualitative data

301

9.13 Bibliographical notes

302

References

303

10

Radial basis function networks

312

10.1 Introduction

312

10.1.1 RBF network architecture

313

10.1.2 Universal approximation of RBF networks

314

10.1.3 RBF networks and classification

315

10.1.4 Learning for RBF networks

315

10.2 Radial basis functions

316

10.3 Learning RBF centers

319

10.4 Learning the weights

321

10.4.1

Least squares methods for weight learning

321

10.5 RBF network learning using orthogonal least squares

32 3

10.5.1 Batch orthogonal least squares

323

10.5.2 Recursive orthogonal least squares

324

10.6 Supervised learning of all parameters

325

10.6.1 Supervised learning for general RBF networks

326

10.6.2 Supervised learning for Gaussian RBF networks

327

10.6.3 Discussion on supervised learning

328

10.6.4 Extreme learning machines

329

10.7 Various learning methods

330

10.8 Normalized RBF networks

332

10.9 Optimizing network structure

333

Contents

ix

10.9.2 Resource-allocating networks

334

10.9.3 Pruning methods

336

10.10Complex RBF networks

337

10.11A comparision of RBF networks and MLPs

338

10.12Bibliographical notes

341

References

343

11 Recurrent neural networks

351

11.1 Introduction

351

11.2 Fully connected recurrent networks

353

11.3 Time-delay neural networks

354

11.4 Backpropagation for temporal learning

357

11.5 RBF networks for modeling dynamic systems

360

11.6 Some recurrent models

361

11.7 Reservoir computing

363

References

366

12 Principal component analysis

370

12.1 Introduction

370

12.1.1 Hebbian learning rule

371

12.1.2 Oja’s learning rule

372

12.2 PCA: conception and model

373

12.2.1

Factor analysis

375

12.3 Hebbian rule based PCA

376

12.3.1 Subspace learning algorithms

376

12.3.2 Generalized Hebbian algorithm

380

12.4 Least mean squared error-based PCA

382

12.4.1

Other optimization-based PCA

386

12.5 Anti-Hebbian rule based PCA

387

12.5.1

APEX algorithm

388

12.6 Nonlinear PCA

392

12.6.1

Autoassociative network-based nonlinear PCA

393

12.7 Minor component analysis

395

12.7.1 Extracting the first minor component

395

12.7.2 Self-stabilizing minor component analysis

396

12.7.3 Oja-based MCA

397

12.7.4 Other algorithms

397

12.8 Constrained PCA

398

12.8.1

Sparse PCA

399

12.9 Localized PCA, incremental PCA and supervised PCA

400

x

Contents

12.11Two-dimensional PCA

403

12.12Generalized eigenvalue decomposition

404

12.13Singular value decomposition

405

12.13.1Crosscorrelation asymmetric PCA networks

406

12.13.2Extracting principal singular components for nonsquare matrices408

12.13.3Extracting multiple principal singular components

409

12.14Canonical correlation analysis

410

References

 

413

13 Nonnegative matrix factorization

423

13.1 Introduction

 

423

13.2 Algorithms for NMF

424

 

13.2.1

Multiplicative update algorithm and alternating nonnegative

 

least squares

425

13.3

Other NMF methods

427

13.3.1

NMF methods for clustering

430

References

431

14 Independent component analysis

435

14.1 Introduction

 

435

14.2 ICA model

436

14.3 Approaches to ICA

437

14.4 Popular ICA algorithms

439

 

14.4.1 Infomax ICA

439

14.4.2 EASI, JADE, and natural-gradient ICA

441

14.4.3 FastICA algorithm

442

14.5 ICA networks

447

14.6 Some ICA methods

449

 

14.6.1 Nonlinear ICA

449

14.6.2 Constrained ICA

450

14.6.3 Nonnegativity ICA

451

14.6.4 ICA for convolutive mixtures

452

14.6.5 Other methods

452

14.7 Complex-valued ICA

455

14.8 Stationary subspace analysis and slow feature analysis

457

14.9 EEG, MEG and fMRI

458

References

 

462

15 Discriminant analysis

469

Contents

xi

 

15.1.1

Solving small sample size problem

472

15.2 Fisherfaces

473

15.3 Regularized LDA

474

15.4 Uncorrelated LDA and orthogonal LDA

475

15.5 LDA/GSVD and LDA/QR

477

15.6 Incremental LDA

478

15.7 Other discriminant methods

478

15.8 Nonlinear discriminant analysis

481

15.9 Two-dimensional discriminant analysis

482

References

483

16

Support vector machines

489

16.1 Introduction

489

16.2 SVM model

492

16.3 Solving the quadratic programming problem

495

16.3.1 Chunking

496

16.3.2 Decomposition

496

16.3.3 Convergence of decomposition methods

500

16.4 Least-squares SVMs

501

16.5 SVM training methods

504

16.5.1 SVM algorithms with reduced kernel matrix

504

16.5.2 ν -SVM

505

16.5.3 Cutting-plane technique

506

16.5.4 Gradient-based methods

507

16.5.5 Training SVM in the primal formulation

508

16.5.6 Clustering-based SVM

509

16.5.7 Other methods

510

16.6 Pruning SVMs

513

16.7 Multiclass SVMs

515

16.8 Support vector regression

517

16.9 Support vector clustering

522

16.10Distributed and parallel SVMs

525

16.11SVMs for one-class classification

527

16.12Incremental SVMs

528

16.13SVMs for active, transductive and semi-supervised learning

530

16.13.1SVMs for active learning

530

16.13.2SVMs for transductive or semi-supervised learning

530

16.14Probabilisitic approach to SVM

534

16.14.1Relevance vector machines

534

References

535

xii

Contents

17 Other kernel methods

550

17.1 Introduction

550

17.2 Kernel PCA

552

17.3 Kernel LDA

556

17.4 Kernel clustering

558

17.5 Kernel autoassociators, kernel CCA and kernel ICA

560

17.6 Other kernel methods

561

17.7 Multiple kernel learning

563

References

565

18 Reinforcement learning

574

18.1 Introduction

574

18.2 Learning through awards

576

18.3 Actor-critic model

578

18.4 Model-free and model-based reinforcement learning

57 9

18.5 Temporal-difference learning

581

18.6 Q -learning

584

18.7 Learning automata

586

References

587

19 Probabilistic and Bayesian networks

589

19.1 Introduction

589

19.1.1 Classical vs. Bayesian approach

590

19.1.2 Bayes’ theorem

590

19.1.3 Graphical models

591

19.2 Bayesian network model

592

19.3 Learning Bayesian networks

595

19.3.1 Learning the structure

596

19.3.2 Learning the parameters

601

19.3.3 Constraint-handling

602

19.4 Bayesian network inference

603

19.4.1 Belief propagation

604

19.4.2 Factor graphs and the belief propagation algorithm

606

19.5 Sampling (Monte Carlo) methods

609

19.5.1

Gibbs sampling

611

19.6 Variational Bayesian methods

612

19.7 Hidden Markov models

614

19.8 Dynamic Bayesian networks

617

19.9 Expectation-maximization algorithm

618

19.10Mixture models

620

19.10.1Probabilistic PCA

621

Contents

xiii

19.10.2Probabilistic clustering

622

19.10.3Probabilistic ICA

624

19.11Bayesian approach to neural network learning

625

19.12Boltzmann machines

627

19.12.1Boltzmann learning algorithm

629

19.12.2Mean-field-theory machine

629

19.12.3Stochastic Hopfield networks

632

19.13Training deep networks

633

References

 

636

20 Combining multiple learners: data fusion and emsemble learning

650

20.1

Introduction

650

20.1.1 Ensemble learning methods

651

20.1.2 Aggregation

652

20.2

Boosting

653

20.2.1

AdaBoost

654

20.3 Bagging

657

20.4 Random forests

658

20.5 Topics in ensemble learning

659

20.6 Solving multiclass classification

662

 

20.6.1 One-against-all strategy

662

20.6.2 One-against-one strategy

662

20.6.3 Error-correcting output codes (ECOCs)

664

20.7

Dempster-Shafer theory of evidence

666

References

 

670

21 Introduction of fuzzy sets and logic

675

21.1 Introduction

 

675

21.2 Definitions and terminologies

676

21.3 Membership function

682

21.4 Intersection, union and negation

683

21.5 Fuzzy relation and aggregation

685

21.6 Fuzzy implication

687

21.7 Reasoning and fuzzy reasoning

688

 

21.7.1 Modus ponens and modus tollens

689

21.7.2 Generalized modus ponens

689

21.7.3 Fuzzy reasoning methods

690

21.8 Fuzzy inference systems

692

 

21.8.1 Fuzzy rules and fuzzy interference

693

21.8.2 Fuzzification and defuzzification

694

xiv

Contents

 

21.9.1 Mamdani model

694

21.9.2 Takagi-Sugeno-Kang model

697

21.10Complex fuzzy logic

698

21.11Possibility theory

699

21.12Case-based reasoning

700

21.13Granular computing and ontology

701

References

 

705

22 Neurofuzzy systems

708

22.1

Introduction

708

22.1.1

Interpretability

709

22.2

Rule extraction from trained neural networks

710

22.2.1 Fuzzy rules and multilayer perceptrons

710

22.2.2 Fuzzy rules and RBF networks

711

22.2.3 Rule extraction from SVMs

712

22.2.4 Rule generation from other neural networks

713

22.3

Extracting rules from numerical data

714

22.3.1 Rule generation based on fuzzy partitioning

715

22.3.2 Other methods

716

22.4 Synergy of fuzzy logic and neural networks

718

22.5 ANFIS model

719

22.6 Fuzzy SVMs

 

726

22.7 Other neurofuzzy models

728

References

 

732

23 Neural circuits and parallel implementation

738

23.1 Introduction

 

738

23.2 Hardware/software codesign

740

23.3 Topics in digital circuit designs

740

23.4 Circuits for neural-network models

742

 

23.4.1 Circuits for MLPs

742

23.4.2 Circuits for RBF networks

744

23.4.3 Circuits for clustering

745

23.4.4 Circuits for SVMs

746

23.4.5 Circuits of other models

747

23.5 Fuzzy neural circuits

748

23.6 Graphic processing unit (GPU) implementation

749

23.7 Implementation using systolic algorithms

751

23.8 Implementation using parallel computers

752

23.9 Implementation using cloud computing

753

Contents

xv

References

 

755

24 Pattern recognition for biometrics and bioinformatics

761

24.1

Biometrics

761

24.1.1 Physiological biometrics and recognition

762

24.1.2 Behavioral biometrics and recognition

765

24.2

Face detection and recognition

767

24.2.1 Face detection

767

24.2.2 Face recognition

768

24.3

Bioinformatics

771

24.3.1

Microarray technology

773

24.3.2

Motif discovery, sequence alignment, protein folding, and coclus- tering

776

References

778

25 Data mining

781

25.1 Introduction

781

25.2 Document representations for text categorization

782

25.3 Neural network approach to data mining

784

 

25.3.1 Classification-based data mining

784

25.3.2 Clustering-based data mining

786

25.3.3 Bayesian network based data mining

789

25.4 Personalized search

790

25.5 XML format

 

793

25.6 Web usage mining

794

25.7 Association mining

795

25.8 Ranking search results

796

 

25.8.1 Surfer models

797

25.8.2 PageRank Algorithm

798

25.8.3 Hypertext induced topic search (HITS)

801

25.9

Data warehousing

801

25.10Content-based image retrieval

803

25.11Email anti-spamming

806

References

 

807

Appendix A: Mathematical preliminaries

816

A.1

Linear algebra

816

A.2

Linear scaling and data whitening

823

A.3

Gram-Schmidt orthonormalization transform

824

A.4

Stability of dynamic systems

825

A.5

Probability theory and stochastic processes

826

xvi

Contents

A.6 Numerical optimization techniques

830

References

833

Appendix B: Benchmarks and resources

834

B.1

Face databases

834

B.2

Some machine learning databases

837

B.3

Data sets for data mining

839

B.4

Databases and tools for speech recognition

840

B.5

Data sets for microarray and for genome analysis

841

B.6 Software

843

References

847

Preface

The human brain, consisting of nearly 10 11 neurons, is the center of human intelligence. Human intelligence has been simulated in var ious ways. Artificial intelligence (AI) pursues exact logical reasoning based on symbol manipulation. Fuzzy logics model the highly uncertain behavior of decisio n making. Neural networks model the highly nonlinear infrastructure of brain networks. Evolu- tionary computation models the evolution of intelligence. Chaos theory models the highly nonlinear and chaotic behaviors of human intelligence. Softcomputing is an evolving collection of methodologies for the representa- tion of the ambiguity in human thinking; it exploits the tolerance for impreci- sion and uncertainty, approximate reasoning, and partial truth in order to achieve tractability, robustness, and low-cost solutions. The major methodologies of soft- computing are fuzzy logic, neural networks, and evolutiona ry computation. Conventional model-based data-processing methods requir e experts’ knowl- edge for the modeling of a system. Neural network methods provide a model-free, adaptive, fault tolerant, parallel and distributed processing solution. A neural network is a black box that directly learns the internal rela tions of an unknown system, without guessing functions for describing cause-a nd-effect relationships. The neural network approach is a basic methodology of information processing. Neural network models may be used for function approximatio n, classification, nonlinear mapping, associative memory, vector quantization, optimization, fea- ture extraction, clustering, and approximate inference. Neural networks have wide applications in almost all areas of science and enginee ring. Fuzzy logic provides a means for treating uncertainty and co mputing with words. This mimics human recognition, which skillfully cop es with uncertainty. Fuzzy systems are conventionally created from explicit knowledge expressed in the form of fuzzy rules, which are designed based on experts’ experience. A fuzzy system can explain its action by fuzzy rules. Neurofuzzy systems, as a synergy of fuzzy logic and neural networks, possess both lea rning and knowledge- representation capabilities. This book is our attempt to bring together the major advances in neural net- works and machine learning, and to explain them in a statistical framework. While some mathematical details are needed, we emphasize th e practial aspects of the models and methods rather than the theoretical details. To us, neural networks are merely some statistical methods that can be represented by graphs

xviii

Preface

and networks. They can iteratively adjust the network parameters. As a statis- tical model, a neural network can learn the probability density function from the given samples, and then predict, by generalization acco rding to the learnt statistics, outputs for new samples that are not included in the learning sample set. The neural network approach is a general statistical computational paradigm. Neural network research solves two problems: the direct pro blem and the inverse problem. The direct problem employs computer and engineering techniques to model biological neural systems of the human brain. This pro blem is investigated by cognitive scientists and can be useful in neuropsychiatr y and neurophysiology. The inverse problem simulates biological neural systems fo r their problem-solving capabilities for application in scientific or engineering fields. Engineering and computer scientists have conducted extensive investigation in this area. This book concentrates mainly on the inverse problem, although the two areas often shed light on each other. The biological and psychological plausibility of the neural network models have not been seriously treated in this book, though some backgound material is discussed. This book is intended to be used as a textbook for advanced undergraduates and graduate students in engineering, science, computer science, business, arts, and medicine. It is also a good reference book for scientists , researchers, and practitioners in a wide variety of fields, and assumes no previous knowledge of neural network or machine learning concepts. This book is divided into twenty-five chapters and two appendices. It contains almost all the major neural network models and statistical learning approaches. We also give an introduction to fuzzy sets and logic, and neur ofuzzy models. Hardware implementations of the models are discussed. Two chapters are ded- icated to the applications of neural network and statistica l learning approaches to biometrics/bioinformatics and data mining. Finally, in the appendices, some mathematical preliminaries are given, and benchmarks for validating all kinds of neural network methods and some web resources are provided. First and foremost we would like to thank the supporting staff from Springer London, especially Anthony Doyle and Grace Quinn for their enthusiastic and professional support throughout the period of manuscript preparation. K.-L. Du also wishes to thank Jiabin Lu (Guangdong University of Technology, China), Jie Zeng (Richcon MC, Inc., China), Biaobiao Zhang a nd Hui Wang (Enjoyor, Inc., China), Zongnian Chen (Hikvision, Inc., China), and many of his graduate students including Na Shou, Shengfeng Yu, Lusha Han and Xiaoling Wang (Zhejiang University of Technolog, China) for their co nsistent assistance. In addition, we should mention at least the following names for their help:

Omer Morgul (Bilkent University, Turkey), Yanwu Zhang (Monterey Bay Aquar- ium Research Institute, USA), Chi Sing Leung (City University of Hong Kong, Hong Kong), M. Omair Ahmad and Jianfeng Gu (Concordia University, Canada), Li Yu, Limin Meng, Jingyu Hua, Zhijiang Xu and Luping Fang (Zhe- jiang University of Technolog, China). Last, but not least, we would like to thank

Preface

xix

our families for their support and understanding during the course of writing this book. A book of this length is certain to have some errors and omissions. Feedback is welcome via email at kldu@ieee.org or swamy@encs.concordia.ca. Due to restriction on the length of this book, we have placed two app endices, namely, Mathematical preliminaries, and Benchmarks and resources , on the website of this book. MATLAB code for the worked examples is also downlo adable from the website of this book.

K.-L. Du

Enjoyor Inc.

Hangzhou, China

M. N. S. Swamy Concordia University Montreal, Canada

List of Abbreviations

adaline

adaptive linear element

COP combinatorial optimization problem

A/D analog-to-digital

AI artificial intelligence AIC Akaike information criterion ALA adaptive learning algorithm

CORDIC coordinate rotation digital computer

CPT

conditional probability table central processing units

ANFIS

adaptive-network-based fuzzy inference system accurate online SVR asymmetric PCA

CPU

AOSVR

CURE clustering using representation DBSCAN density-based spatial cluster-

APCA

ing of applications with noise DCS dynamic cell structures DFP Davidon-Fletcher-Powell

API application programming interface

APEX

adaptive principal components extraction

DCT

discrete cosine transform discrete Fourier transform electrocardiogram

DFT

ART adaptive resonance theory

ECG

ASIC application-specific integrated circuit

ECOC

error-correcting output code

EEG electroencephalogram

ASSOM

adaptive-subspace SOM

EKF

extended Kalman filtering extreme learning machine

BAM bidirectional associative mem- ory BFGS Broyden-Fletcher-Goldfarb- Shanno BIC Bayesian information criterion BIRCH balanced iterative reducing

and clustering using hierar- chies BP backpropagation BPTT backpropagation through time BSB brain-states-in-a-box BSS blind source separation CBIR content-based image retrieval CCA canonical correlation analysis CCCP constrained concave-convex procedure cdf cumulative distribution func- tion CEM classification EM CG conjugate gradient CMAC cerebellar model articulation controller

ELM

EM expectation-maximization ERM empirical risk minimization

E-step

expectation step elementary transcendental function eigenvalue decomposition fuzzy C -means fast Fourier transform

ETF

EVD

FCM

FFT

FIR finite impulse response

fMRI functional magnetic resonance imaging

FPGA

field programmable gate array

FSCL frequency-sensitive competi- tive learning

GAP-

growing and pruning algo- rithm for RBF growing cell structures

RBF

GCS

GHA generalized Hebbian algorithm

GLVQ-F generalized LVQ family algo- rithms

GNG

growing neural gas

GSO Gram-Schmidt orthonormal HWO hidden weight optimization

List of Abbreviations

xxi

HyFIS

Hybrid neural fuzzy inference system independent component analy- sis independently drawn and iden- tically distributed interactive-or Karush-Kuhn-Tucker k -nearest neighbor k -winners-take-all

OLAP online analytical processing OLS orthogonal least squares OMP orthogonal matching pursuit

ICA

OWO

output weight optimization

iid

PAC probably approximately cor- rect

i-or

PAST

projection approximation sub-

KKT

space tracking PASTd PAST with deflation PCA principal component analysis PCM possibilistic C -means pdf probability density function

k

-NN

k

-WTA

LASSO

least absolute selection and shrinkage operator

LBG

Linde-Buzo-Gray

PSA principal subspace analysis QP quadratic programming QR-cp QR with column pivoting

LDA

linear discriminant analysis

LM

Levenberg-Marquardt

LMAM

LM with adaptive momentum

RAN resource-allocating network RBF radial basis function RIP restricted isometry property RLS recursive least squares

LMI

linear matrix inequality

LMS

least mean squares

LMSE

least mean squared error

LMSER

least mean square error recon- struction

RPCCL

rival penalized controlled com- petitive learning rival penalized competitive learning resilient propagation real-time recurrent learning

LP

linear programming

RPCL

LS

least-squares

LSI

latent semantic indexing

RProp

LTG

linear threshold gate

RTRL

LVQ

learning vector quantization

RVM relevance vector machine SDP semidefinite programs

SIMD single instruction multiple data

MAD

median of the absolute devia- tion

MAP

maximum a posteriori

MCA

minor component analysis

SPMD

single program multiple data

MDL

minimum description length

SLA subspace learning algorithm

MEG

magnetoencephalogram

SMO

sequential minimal optimiza- tion self-organization maps

MFCC

Mel frequency cepstral coeffi- cient

SOM

MIMD

multiple instruction multiple data

SRM structural risk minimization SVD singular value decomposition

MKL

multiple kernel learning

SVDD

support vector data descrip-

ML

maximum-likelihood

tion SVM support vector machine SVR support vector regression

MLP

multilayer perceptron

MSA

minor subspace analysis

MSE

mean squared error

TDNN time-delay neural network TDRL time-dependent recurrent learning

MST

minimum spanning tree

M-step

maximization step

NARX

nonlinear autoregressive with

TLMS

total least mean squares

exogenous input NEFCLASSneurofuzzy classification

TLS total least squares

TREAT

trust-region-based error aggre-

NEFCON

neurofuzzy controller

gated training TRUST terminal repeller uncon- strained subenergy tunneling TSK Takagi-Sugeno-Kang

NEFLVQ

non-Euclidean FLVQ

NEFPROX

neuronfuzzy function approxi- mation

NIC

novel information criterion

TSP

traveling salesman problem

NOVEL

nonlinear optimization via VC Vapnik-Chervonenkis

external lead

VLSI very large scale integrated WINC weighted information criterion

OBD

optimal brain damage

OBS

optimal brain surgeon

WTA winner-takes-all XML Extensible markup language

 

1 Introduction

1 Introduction 1.1 Major events in neural networks research The discipline of neural networks models the

1.1 Major events in neural networks research

The discipline of neural networks models the human brain. The average human brain consists of nearly 10 11 neurons of various types, with each neuron con- necting to up to tens of thousands synapses. As such, neural network models are also called connectionist models . Information processing is mainly in the cere- bral cortex, the outer layer of the brain. Cognitive functions, including language, abstract reasoning, and learning and memory, represent the most complex brain operations to define in terms of neural mechanisms. In the 1940s, McCulloch and Pitts [27] found that a neuron can be modeled as a simple threshold device to perform logic function. In 1949 , Hebb [14] proposed the Hebbian rule to describe how learning affects the synaptics between two neurons. In 1952, based upon the physical properties of cell membranes and the ion currents passing through transmembrane proteins, Hodg kin and Huxley [15] incorporated the neural phenomena such as neuronal firing and action potential propagation into a set of evolution equations, yielding quantitatively accurate spikes and thresholds. This work brought Hodgkin and Huxley a Nobel Prize in 1963. In the late 1950s and early 1960s, Rosenblatt [32] prop osed the perceptron model, and Widrow and Hoff [39] proposed the adaline (adaptive linear element) model, trained with a least mean squares (LMS) method. In 1969, Minsky and Papert [28] proved mathematically that the perceptron cannot be used for complex logic function. This substantially waned the interest in the field of neural networks. During the same period, the adaline model as well as its multilayer version called the madaline was succe ssfully used in many problems; however, they cannot solve linearly inseparable problems due to the use of linear activation function. In the 1970s, Grossberg [12, 13], von der Malsburg [38], and Fukushima [9] con- ducted pioneering work on competitive learning and self-or ganization, inspired from the connection patterns found in the visual cortex. Fukushima proposed his cognitron [9] and neocognitron models [10], [11], under the competitive learning paradigm. The neocognitron, inspired by the primary visual cortex, is a hierar- chical multi-layered neural network specially designed fo r robust visual pattern recognition. Several linear associative memory models wer e also proposed in that period [22]. In 1982, Kohonen proposed the self-organization map (SOM) [23].

2

Chapter 1. Introduction

The SOM adaptively transforms incoming signal patterns of a rbitrary dimensions into one- or two-dimensional discrete maps in a topologically ordered fashion. Grossberg and Carpenter [13, 4] proposed the adaptive resonance theory (ART) model in the mid-1980s. The ART model, also based on competitive learning, is recurrent and self-organizing. The Hopfield model introduced in 1982 [17] ushered in the modern era of neural network research. The model works at the system level rather than at a single neuron level. It is a recurrent neural network working with the Hebbian rule. This network can be used as an associative memory for informa tion storage and for solving optimization problems. The Boltzmann machine [1] was introduced in 1985 as an extension to the Hopfield network by incorporating stochastic neurons. Boltzmann learning is based on a method called simulated annealing [20]. In 1987, Kosko proposed the adaptive bidirectional associative memory (BAM) [24]. The Hamming network proposed by Lippman in the mid-1980s [25] is based on competitive learning, and is the most straightfo rward associative memory. In 1988, Chua and Yang [5] extended the Hopfield model by proposing the cellular neural network model. The cellular network is a dynamical network model and is particularly suitable for two-dimensional sig nal processing and VLSI implementation. The most prominent landmark in neural network research is the backpropa- gation (BP) learning algorithm proposed for the multilayer perceptron (MLP) model in 1986 by Rumelhart, Hinton, and Williams [34]. Later on, the BP algo- rithm was discovered to have already been invented in 1974 by Werbos [40]. In 1988, Broomhead and Lowe proposed the radial basis function (RBF) network model [3]. Both the MLP and the RBF network are universal appr oximators. In 1982, Oja proposed the principal component analysis (PCA) network for classical statistical analysis [30]. In 1994, Common propo sed independent com- ponent analysis (ICA) [6]. ICA is a generalization of PCA, and it is usually used for feature extraction and blind source separation (BSS). Since then, many neu- ral network algorithms for classical statistical methods, such as Fisher’s linear discriminant analysis (LDA), canonical correlation analysis (CCA), and factor analysis, have been proposed. In 1985, Pearl introduced the Bayesian network model [31]. The Bayesian network is the best known graphical model in AI. It possesses the characteristic of being both a statistical and a knowledge-representation fo rmalism. It establishes the foundation for inference of modern AI. Another landmark in the machine learning and neural network communities is the support vector machine (SVM) proposed by Vapnik et al. in the early 1990s [37]. The SVM is based on the statistical learning theory and is particularly useful for classification with small sample sizes. The SVM ha s been used for classification, regression and clustering. Thanks to its successful application in the SVM, the kernel method has aroused wide interest. In addition to neural networks, fuzzy logic and evolutionary computation are two other major softcomputing paradigms. Softcomputing is a computing frame-

Introduction

3

work that can tolerate imprecision and uncertainty instead of depending on exact mathematical computations. Fuzzy logic [41] can inco rporate the human knowledge into a system by means of fuzzy rules. Evolutionar y computation [16, 35] originates from Darwin’s theory of natural selection, and can optimize in a domain that is difficult to solve by other means. These techniques are now widely used to enhance the interpretability of the neural networks or to select optimum architecture and parameters of neural networks. In summary, the brain is a dynamic information processing system that evolves its structure and functionality in time through information processing at differ- ent hierarchical levels: quantum, molecular (genetic), single neuron, ensemble of neurons, cognitive, and evolutionary [19]:

At a quantum level, particles, that constitutes every molec ule, move contin- uously, being in several states at the same time that are char acterized by probability, phase, frequency, and energy. These states ca n change following the principles of quantum mechanics. At a molecular level, RNA and protein molecules evolve in a ce ll and interact in a continuous way, based on the stored information in the DNA and on external factors, and affect the functioning of a cell (neuro n). At the level of a single neuron, the internal information pro cesses and the external stimuli change the synapses and cause the neuron to produce a signal to be transferred to other neurons. At the level of neuronal ensembles, all neurons operate together as a function of the ensemble through continuous learning. At the level of the whole brain, cognitive processes take pla ce in a life-long incremental multiple task/multiple modalities learning mode, such as lan- guage and reasoning, and global information processes are manifested, such as consciousness. At the level of a population of individuals, species evolve through evolution via changing the genetic DNA code.

Building computational models that integrate principles from different informa- tion levels may be efficient for solving complex problems. The se models are called integrative connectionist learning systems [19]. Information processes at different levels in the information hierarchy interact and influence each other.

1.2

Neurons

Among the 10 11 neurons in the human brain, about 10 10 are in the cortex. The cortex is the outer mantle of cells surrounding the centr al structures, e.g., brainstem and thalamus. Cortical thickness varies mostly b etween 2–3 mm in the human, and is folded with an average surface area is about 2200 cm 2 [42]. The neuron, or nerve cell, is the fundamental anatomical and functional unit of the nervous system including the brain. A neuron is an extension of the simple

4

Chapter 1. Introduction

Impulse conduction Integration Dendrites Transmitter Spike secretion initiation Presynaptic terminals Nucleus
Impulse conduction
Integration
Dendrites
Transmitter
Spike
secretion
initiation
Presynaptic
terminals
Nucleus
Axon
Axon
Axon
terminals
hillock
Soma
Impulses out
Synaptic
response
Impulses in

Figure 1.1 Schematic drawing of a prototypical neuron.

cell with two types of appendages: multiple dendrites and an axon. A neuron pos-

sesses all the internal features of a regular cell. A neuron has four components:

the dendrites, the soma (cell body), the axon and the synapse. A soma contains

a cell nucleus. Dendrites branch into a bushy network around the cell to receive input from other neurons, whereas the axon stretches out for a long distance, typically a centimeter and as far as a meter in extreme cases. The axon is an

output channel to other neurons; it branches into strands and substrands to con- nect to the dendrites and cell bodies of other neurons. The co nnecting junction

is called a synapse. Each cortical neuron receives 10 4 –10 5 synaptic connections,

with most inputs coming from distant neurons. Thus connections in the cortex are said to exhibit long-range excitation and short-range inhibition. A neuron receives signals from other neurons through its soma and dendrites, integrates them, and sends output signals to other neurons through its axon. The dendrites receive signals from several neighborhood neuro ns and pass these onto the cell body, and are processed therein and the resulting signal is transferred through an axon. A schematic diagram shown in Fig. 1.1. Like any other cell, neurons have a membrane potential, that is, an electric potential difference between the intracellular and extracellular compartments, caused by the different densities of sodium (Na) and potassium (K). Neuronal membrane is endowed with relatively selective ionic channe ls that allow some specific ions to cross the membrane. The cell membrane has an e lectrical resting potential of 70 mV, which is maintained by pumping positive ions (Na+) out of the cell. Unlike an ordinary cell, the neuron is excitable. Because of inputs from the dendrites, the cell may not be able to maintain the 70 mV resting potential, resulting in an action potential that is a pulse transmitted down the axon. Signals are propagated from neuron to neuron by a complicated electro- chemical reaction. Chemical transmitter substances pass the synapses and enter the dendrite, changing the electrical potential of the cell body. When the poten-

Introduction

5

x

φ 1 w 1 θ w ( ) y 2 2 Σ w J 1
φ
1
w
1
θ
w
( )
y
2
2
Σ
w J
1
1

x

x J

Figure 1.2 The mathematical model of McCulloch-Pitts neuron.

tial is above a threshold, an electrical pulse or action potential is sent along the axon. After releasing the pulse, the neuron returns to its resting potential. The action potential causes a release of certain biochemical ag ents for transmitting messages are to the dendrites of nearby neurons. These biochemical transmit- ters may have either an excitatory or inhibitory effect on neighboring neurons.

A synapse that increases the potential is excitatory, where as a synapse that

decreases it is inhibitory. Synaptic connections exhibit plasticity–long-term chang es in the strength of connections in response to the pattern of stimulation. Neur ons also form new connections with other neurons, and sometimes entire colle ctions of neurons can migrate from one place to another. These mechanisms are thought to form the basis for learning in the brain. Synaptic plasticity is a basic biological mechanism underlying learning and memory. Inspired by this, a large number of learning rules, specifying how activity and training experience cha nge synaptic efficacies [14], have been advanced.

1.2.1 The McCulloch-Pitts neuron model

A neuron is a basic processing unit in a neural network. It is a node that processes

all fan-in from other nodes and generates an output according to a transfer func-

tion called the activation function . The activation function represents a linear

or nonlinear mapping from the input to the output and is denoted by φ(·). The

variable synapses is modelled by weights. The McCulloch-Pitts neuron model [27], which employs the sigmoidal activation function, was inspired biologically. Figure 1.2 illustrates the simple McCulloch-Pitts neuron model. The output

of the neuron is given by

net =

J 1

i=1

w

i x i θ = w T x θ,

(1.1)

y

= φ(net ),

(1.2)

where x i is the i th input, w i is the link weight from the i th input, w =

( w 1 ,

J 1 ) T , θ is a threshold or bias, and J 1 is the number

J 1 ) T , x = ( x 1 ,

,w

,x

6

Chapter 1. Introduction

φ θ x 1 w 1 ( ) x + 2 w 2 w 0
φ
θ
x
1
w
1
( )
x
+
2
w
2
w 0
x
J
1
w
J
1

Figure 1.3 VLSI model of a neuron.

y

of inputs. The activation function φ(·) is usually some continuous or discontin- uous function, mapping the real numbers into the interval (1 , 1) or (0 , 1). Neural networks are suitable for VLSI circuit implementations. The analog approach is extremely attractive in terms of size, power, and speed. A neuron can be realized with a simple amplifier and the synapse is realized with a resistor. Memristor is a two-terminal passive circuit element that ac ts as a variable resistor whose value can be varied by varying the current passing thro ugh it [2]. The circuit of a neuron is given in Fig. 1.3. Since weights from the circuits can only be positive, an inverter can be applied to the input voltage so as to realize a negative synaptic weight. By Kirchhoff’s current law, the output voltage of the neuron is derived as

y

=

φ

J

1

i=1 w i x i

J

i=0 w i

1

θ ,

(1.3)

where x i is the i th input voltage, w i is the conductance of the i th resisitor, θ the bias voltage, and φ(·) is the transfer function of the amplifier. The bias voltage of a neuron in a VLSI circuit is caused by device mismatches, a nd is difficult to control. The McCulloch-Pitts neuron model is known as the classical p erceptron model, and it is used in most neural network models, including the MLP and the Hopfield network. Many other neural networks are also based on the McCulloch-Pitts neuron model, but use other activation functions. For example, the adaline [39] and the SOM [23] use linear activation functions, and the RBF network adopts a radial basis function (RBF).

1.2.2 Spiking neuron models

Many of the intrinsic properties seen within the brain were not included in the classical perceptron, limiting their functionality and use to linear discrimination tasks. A single classical perceptron is not capable of solving nonlinear problems, such as the XOR problem. Spiking neuron and spiking neural network models mimic the spiking activity of neurons in the brain when processing information. Spiking neurons tend to gather in functional groups firing together during strict

Introduction

7

time intervals, forming coalitions or assemblies [18], als o referred to as events. Spiking neural networks represent information as trains of spikes. This results in a much higher number of patterns stored in a model and more flexible processing. During training of a spiking neural network, the weight of the synapse is modified according to the timing difference between the pre-s ynaptic spike and the post-synaptic spike. This synaptic plasticity is called spike-timing-dependent plasticity [26]. In a biological system, a neuron integrates the excita tory post- synaptic current, which is produced by presynaptic stimulus, to change the volt- age of its soma. If the soma voltage is larger than a defined thr eshold, an action potential (spike) is produced. The integrate-and-fire neuron [36], FitzHugh-Nagumo neuro n [8], and Hodgkin-Huxley neuron model all incorporate more of the dynamics of actual biological neurons. Whereas the Hodgkin-Huxley model describes the biophys- ical mechanics of neurons, both the integrate-and-fire and FitzHugh Nagumo neurons model key features of biological neurons such as the membrane poten- tial, excitatory postsynaptic potential, and inhibitory p ostsynaptic potential. A single neuron incorporating these key features has a higher dimension to the information it processes in terms of its membrane threshold, firing rate and postsynaptic potential, than a classical perceptron. The integrate-and-fire neu-

ron model, whose output is binary on a short time scale, either fires an action potential or does not. A spike train s ∈ S (T ) is a sequence of ordered spike times

., N } corresponding to the time instants in the interval

s = { t m ∈ T : m = 1 ,

T = [0 , T ] at which a neuron fires. The FitzHugh-Nagumo model is a simplified version of the Hodgkin-Huxley model which models in a detailed manner the activation and deactivation dynamics of a spiking neuron. The Hodgkin-Huxley model [15, 21] incorporates the principal neurobiological properties of a neuron in order to understand phenomena such as the action potential. It was obtained from empirical investigation of the physiological prop- erties of the squid axon into a dynamical system framework. The model is a set of conductance-based coupled ordinary differential equations, incorporating sodium (Na), potassium (K) and chloride (Cl) ion flows throug h their respective channels. These equations are based upon the physical properties of cell mem- branes and the ion currents passing through transmembrane proteins. Chloride channel conductances are static (not voltage dependent) and hence leaky. According to the Hodgkin-Huxley model, the dynamics of the membrane potential V (t ) of the neuron can be described by

C dV

dt

= g Na m 3 h (V V Na ) g K n 4 (V V K ) g L (V V L ) + I (t ),

(1.4)

where the first three terms on the right-hand side correspond to the potassium, sodium, and leakage currents, respectively, and g Na = 120 mS/cm 2 , g K = 36 mS/cm 2 and g L = 0 .3 mS/cm 2 are the maximal conductances of sodium, potas- sium and leakage, respectively. The membrane capacitance C = 1 mF/cm 2 ;

8

Chapter 1. Introduction

8 Chapter 1. Introduction Figure 1.4 Parameters of the Hodgkin-Huxley model for a neuron. of sodium,

Figure 1.4 Parameters of the Hodgkin-Huxley model for a neuron.

of sodium, potassium, and leakage currents, respectively. I (t ) is the injected cur-

rent. The stochastic gating variables n , m and h represent the activation term

of the potassium channel, the activation term, and the inactivation term of the

sodium channel, respectively. The factors n 4 and m 3 h are the mean portions of the open potassium and sodium ion channels within the membra ne patch. To take into account the channel noise, m , h and n obey the Langevin equations. When the stimuli S1 and S2 occur at 15 ms and 40 ms of 80 ms, the simulated results for V , m , h and n are plotted in Fig. 1.4; this figure was generated by a Java applet (http://thevirtualheart.org/HHindex.html ).

1.3 Neural networks

A neural network is characterized by the network architecture, node character-

istics, and learning rules.

Architecture

The network architecture is represented by the connection weight matrix W = [w ij ], where w ij denotes the connection weight from node i to node j . When w ij = 0, there is no connection from node i to node j . By setting some w ij ’s to zero, different network topologies can be realized. Neural networks can be grossly classified into feedforward neural networks, recurrent neural networks, and their hybrids. Popular network topologies are fully connected layered feedforward networks, recurrent networks, lattice networks, layered feedforwar d networks with lateral connections, and cellular networks, as shown in Fig. 1.5.

In a feedforward network, the connections between neurons a re in one direc- tion. A feedforward network is usually arranged in the form of layers. In such a layered feedforward network, there is no connection between the neurons in the same layer, and there is no feedback between layers. In a fully connected layered feedforward network, every node in any layer is connected to every

Introduction

9

1 4 2 5 3 (a)
1
4
2
5
3
(a)
1 4 2 3 (b)
1
4
2
3
(b)
1 3 2 4 6 5
1
3
2
4
6
5

(c)

1 4 1 7 2 5 2 8 3 6 3 9 (d) (e)
1 4
1
7
2 5
2
8
3 6
3
9
(d)
(e)

Figure 1.5 Architecture of neural networks. (a) Layered feedforward n etwork. (b) Recurrent network. (c) Two-dimensional lattice network. (d) Layered feedforward network with lateral connections. (e) Cellular network. The big numbered circles stand for neuron s and the small ones for input nodes.

node in its adjacent forward layer. The MLP and the RBF networ k are fully connected layered feedforward networks. In a recurrent network, there exists at least one feedback co nnection. The Hopfield model and the Boltzmann machine are two examples of r ecurrent networks. A lattice network consists of one-, two- or higher-dimensio nal array of neurons. Each array has a corresponding set of input nodes. The Kohonen network [23] uses a one- or two-dimensional lattice architecture. A layered feedforward network with lateral connections has lateral connec- tions between the units at the same layer of its layered feedforward network architecture. A competitive learning network has a two-layered network of such an architecture. The feedforward connections are excitatory, while the lateral connections in the same layer are inhibitive. Some P CA networks using the Hebbian/anti-Hebbian learning rules [33] also employ this kind of network topology. A cellular network consists of regularly spaced neurons, ca lled cells, which communicate only with the neurons in its immediate neighbor hood. Adjacent cells are connected by mutual interconnections. Each cell is excited by its own signals and by signals flowing from its adjacent cells [5].

- J M to represent a neural network

with a layered architecture of M layers, where J i is the number of nodes in the i th layer. Notice that the input layer is counted as layer 1 and nodes at this layer are not neurons. Layer M is the output layer.

In this book, we use the notation J 1 - J 2 -

10

Chapter 1. Introduction

Operation

The operation of neural networks is divided into two stages: learning (train- ing) and generalization (recalling). Network training is typically accomplished by using examples, and network parameters are adapted using a learning algo- rithm, in an online or offline manner. Once the network is trained to accomplish the desired performance, the learning process is terminate d and it can then be used directly to replace the complex system dynamics. The tr ained network can be used to operate in a static manner: to emulate an unknown dynamics or nonlinear relationship. For real-time applications, a neural network is required to have a constant processing delay regardless of the number of input nodes and to have a minimum number of layers. As the number of input nodes increases, the size of the network layers should grow at the same rate without additional layer s. Adaptive neural networks are a class of neural networks that do not need to be trained by providing a training pattern set. They can learn when they are performing. For adaptive neural networks, unsupervised learning methods are usually used. For example, the Hopfield model uses a generalized Hebbian learn- ing rule for implementation as associative memory. Any time a pattern is pre- sented to it, the Hopfield network always updates the connection weights. After the network is trained with standard patterns and is prepared for generalization, the learning capability should be disabled; otherwise, when an incomplete or noisy pattern is presented to the network, it will search the closest matching, meanwhile the memorized pattern is replaced by this new pattern. Reinforcement learning is also naturally adaptive, where the environment is treated as a teacher. Supervised learning is not adaptive in nature.

Properties

Neural networks are biologically motivated. Each neuron is a computational node, which represents a nonlinear function. Neural networ ks possess the fol- lowing advantages [7]:

Adaptive learning : They can adapt themselves by changing the network parameters in a surrounding environment. Generalization : A trained neural network has superior generalization capa - bility. General-purpose nonlinear nature : They perform like a black box. Self-organizing : Some neural networks such as the SOM [23] and competitive learning based neural networks have a self-organization pr operty. Massive parallelism and simple VLSI implementations : Each basic processing unit usually has a uniform property. This parallel structure allows for highly parallel software and hardware implementations. Robustness and fault tolerance : A neural network can easily handle imprecise, fuzzy, noisy, and probabilistic information. It is a distributed infor- mation system, where information is stored in the whole network in a dis-

Introduction

11

tributed manner by the network structure such as W . Thus, the overall per- formance does not degrade significantly when the informatio n at some nodes is lost or some connections in the network are damaged. The ne twork repairs itself, and thus possesses a fault-tolerant capability.

Applications

Neural networks can be treated as a general stitistical tool for almost all dis- ciplines of science and engineering. The applications can b e in modeling and system identification, classification, pattern recognitio n, optimization, control, industrial application, communications, signal processing, image analysis, bioin- formatics, and data mining. Pattern recognition is central to biological and arti- ficial intelligence; it is a complete process that gathers the observations, extracts features from the observations, and classifies or describes the observations. Pat- tern recognition is one of the most fundmental applications of neural networks. More specific, some neural network models have the following functions.

Function approximation : This capability is generally used for modeling and system identification, regression and prediction, control, signal process- ing, pattern recognition and classification, and associative memory. Image restoration is also a function approximation problem. The MLP and RBF networks are universal approximators for nonlinear functions. Some recurrent networks are universal approximators of dynamical systems . Prediction is an open-loop problem while control is a closed-loop problem. Classification: Classification is the most fundamental application of neur al networks. Classification can be based on the function approximation capability of neural networks. Clustering and vector quantization : Clustering groups together similar objects, based on some distance measure. Unlike in classific ation problems, the classmembership of a pattern is not known a priori. Vector quantization is similar to clustering. Associative memory : An association is an input-output pair. Associative memory, also known as content-addressable memory, is a memory organization that accesses memory by its content instead of its address. I t picks up a desirable match from all stored prototypes, when an incomplete or corrupted sample is presented. Associative memories are useful for pa ttern recognition, pattern association, or pattern completion. Optimization : Some neural network models, such as the Hopfield model and the Boltzmann machine, can be used to solve combinatoria l optimization problems (COPs). Feature extraction and information compression : Coding and informa- tion compression is an essential task in the transmission and storage of speech, audio, image, video, and other information. PCA, ICA, vecto r quantization can achieve the objective of feature extraction and informa tion compression.

12

Chapter 1. Introduction

1.4 Scope of the book

This book contains twenty-six chapters and two appendices:

Chapter 2 describes some fundamental topics on neural netwo rks and machine learning. Chapter 3 is dedicated to the perceptron. The MLP is the topic of Chapters 4 and 5. The MLP with BP learning is introduced in Chapter 4, and structural optimization of the MLP is also described in this chapter. The MLP with second-order learning is introduced in Chapter 5. Chapter 6 treats the Hopfield model, its application for solving COPs, simu- lation annealing, chaotic neural networks and cellular networks. Chapter 7 describes associative memory models and algorithms. Chapters 8 and 9 are dedicated to clustering. Chapter 8 intro duces Kohonen networks, ART networks, C -means, subtractive, and fuzzy clustering. Chapter 9 introduces many advanced topics in clustering. In Chapter 10, we elaborate on the RBF network model. Chapter 11 introduces the learning of general recurrent networks. Chapter 12 deals with PCA networks and algorithms. The minor compo- nent analysis (MCA), crosscorrelation PCA networks, gener alzied eigenvalue decompostion (EVD) and CCA are also introduced in this chapter. Nonnegative matrix factorization (NMF) is introduced in Chapter 13. ICA and BSS are introduced in Chapter 14. Discriminant analysis is described in Chapter 15. Probilistic and Bayesian networks are introduced in Chapter 19. Many topics such as the EM algorithms, the HMM, sampling (Monte Carlo) methods, and the Boltzmann machine are treated in this framework. SVMs are introduced in Chapter 16. Kernel methods other than SVMs are introduced in Chapter 17. Reinforcement learning is introduced in Chapter 18. Ensemble learning is introduced in Chapter 20. Fuzzy sets and logic are introduced in Chapter 21. Neurofuzzy models are described in Chapter 22. Transformations between fuzzy logic and neural networks are also discussed. Implementation of neural networks in hardware is treated in Chapter 23. In Chapter 24, we give an introduction to neural network applications to biometrics and bioinformatics. Data mining as well as the application of neural networks to the field is intro- duced in Chapter 25. Mathematical preliminaries are included in Appendix A. Some benchmarks and resources are included in Appendix B.

Examples and exercises are included in most of the chapters.

REFERENCES

13

Problems

1.1 List the major differences between the neural-network appro ach and clas-

sical information-processing approaches.

1.2 Formulate a McCulloch-Pitts neuron for four variables: white blood count,

systolic blood pressure, diastolic blood pressure, and pH o f the blood.

1.3 Derive Equation (1.3) from Fig. 1.3.

References

[1] D.H. Ackley, G.E. Hinton, & T.J. Sejnowski, A learning algorithm for Boltz- mann machines. Cognitive Sci., 9 (1985), 147–169. [2] S.P. Adhikari, C. Yang, H. Kim & L.O. Chua, Memristor bridge synapse- based neural network and its learning. IEEE Trans. Neural Netw. Learn. Syst., 23 :9 (2012), 1426–1435. [3] D.S. Broomhead & D. Lowe, Multivariable functional interpolation and adap- tive networks. Complex Syst., 2 (1988), 321–355. [4] G.A. Carpenter & S. Grossberg, A massively parallel architecture for a self- organizing neural pattern recognition machine. Computer Vision, Graphics, Image Process., 37 (1987), 54–115. [5] L.O. Chua & L. Yang, Cellular neural network–Part I: Theory; Part II: Appli- cations. IEEE Trans. Circ. Syst., 35 (1988), 1257–1290. [6] P. Comon, Independent component analysis—A new concept? Signal Pro- cess., 36 :3 (1994), 287–314 [7] K.-L. Du & M.N.S. Swamy, Neural Networks in a Softcomputing Framework (London: Springer, 2006).

[8] R. FitzHugh, Impulses and physiological states in theor etical models of nerve membrane. Biophysical J., 1 (1961), 445–466. [9] K. Fukushima, Cognition: A self-organizing multulayer ed neural network. Biol. Cybern., 20 (1975), 121–136. [10] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36 (1980), 193–202. [11] K. Fukushima, Increasing robustness against background noise: visual pat- tern recognition by a neocognitron. Neural Netw., 24 :7 (2011), 767–778. [12] S. Grossberg, Neural expectation: Cerebellar and retinal analogues of cells fired by unlearnable and learnable pattern classes. Kybernetik, 10 (1972), 49–

57.

[13] S. Grossberg, Adaptive pattern classification and universal recording: I. Par- allel development and coding of neural feature detectors; I I. Feedback, expec- tation, olfaction, and illusions. Biol. Cybern., 23 (1976), 121–134 & 187–202. [14] D.O. Hebb, The Organization of Behavior (New York: Wiley, 1949).

14

Chapter 1. Introduction

[15] A. L. Hodgkin, & A. F. Huxley, A quantitative description of ion currents and its applications to conductance and excitation in nerve membranes. J. Phys., 117 (1952), 500–544. [16] J. Holland, Adaptation in Natural and Artificial Systems (Ann Arbor,

Michigan: University of Michigan Press, 1975). [17] J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci., 79 (1982), 2554–2558. [18] E.M. Izhikevich, Polychronization: Computation with spikes. Neural Com- put., 18 :2 (2006), 245–282. [19] N. Kasabov, Integrative connectionist learning systems inspired by nature:

current models, future trends and challenges. Nat. Comput., 8 (2009), 199–

218.

[20] S. Kirkpatrick, C.D. Gelatt Jr, & M.P. Vecchi, Optimization by simulated annealing. Science, 220 (1983), 671–680. [21] W. M. Kistler, W. Gerstner, & J. L. van Hemmen, Reduction of Hodgkin- Huxely equations to a single-variable threshold model. Neural Comput., 9 (1997), 1015–1045. [22] T. Kohonen, Correlation matrix memories. IEEE Trans. Computers, 21 (1972), 353–359. [23] T. Kohonen, Self-organized formation of topologically correct feature maps. Biol. Cybern., 43 (1982), 59–69. [24] B. Kosko, Adaptive bidirectional associative memories. Appl. Optics, 26 (1987), 4947–4960. [25] R.P. Lippman, An introduction to computing with neural nets. IEEE ASSP Mag., 4 :2 (1987), 4–22. [26] H. Markram, J. Lubke, M. Frotscher & B. Sakmann, Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275 :5297 (1997), 213–215. [27] W.S. McCulloch & W. Pitts, A logical calculus of the idea s immanent in nervous activity. Bull of Math. Biophysics, 5 (1943), 115–133. [28] M.L. Minsky & S. Papert, Perceptrons (Cambridge, MA: MIT Press, 1969). [29] K.R. Muller, S. Mika, G. Ratsch, K. Tsuda, & B. Scholkopf, An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw., 12 :2 (2001),

181–201.

[30] E. Oja, A simplified neuron model as a principal component analyzer. J. Math. & Biology, 15 (1982), 267–273. [31] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference (San Mateo, CA: Morgan Kaufmann, 1988). [32] R. Rosenblatt, Principles of Neurodynamics (New York: Spartan Books,

1962).

[33] J. Rubner & P. Tavan, A self-organizing network for principal-component analysis. Europhysics Lett., 10 (1989), 693–698.

REFERENCES

15

[34] D.E. Rumelhart, G.E. Hinton, & R.J. Williams, Learning internal repre- sentations by error propagation. In: D. E. Rumelhart, J.L. McClelland, Eds., Parallel distributed processing: Explorations in the micr ostructure of cogni- tion, 1 : Foundation, 318–362 (Cambridge: MIT Press, 1986). [35] H.P. Schwefel, Numerical Optimization of Computer Models (Chichester:

Wiley, 1981). [36] H.C. Tuckwell, Introduction to Theoretical Neurobiology (Cambridge, MA, UK: Cambridge University Press, 1988). [37] V.N. Vapnik, Statistical Learning Theory (New York: Wiley, 1998).

[38] C. von der Malsburg, Self-organizing of orientation sensitive cells in the striata cortex. Kybernetik, 14 (1973), 85–100. [39] B. Widrow & M.E. Hoff, Adaptive switching circuits. IRE E astern Electronic Show & Convention (WESCON1960), Convention Record, 4 (1960), 96–104. [40] P.J. Werbos, Beyond Regressions: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD Thesis, Harvard University, Cambridge, MA,

1974.

[41] L.A. Zadeh, Fuzzy sets. Inf. & Contr., 8 (1965), 338–353. [42] K. Zilles, Cortex. In: The Human Nervous System, G. Pixinos, Ed. (New

York: Academic Press, 1990).

2 Fundamentals of Machine Learning

2 Fundamentals of Machine Learning 2.1 Learning methods Learning is a fundamental capability of neural networks.

2.1 Learning methods

Learning is a fundamental capability of neural networks. Learning rules are algo- rithms for finding suitable weights W and/or other network parameters. Learn- ing of a neural network can be viewed as a nonlinear optimization problem for finding a set of network parameters that minimize the cost function for given examples. This kind of parameter estimation is also called a learning or training algorithm . Neural networks are usually trained by epoch. An epoch is a co mplete run when all the training examples are presented to the network and ar e processed using the learning algorithm only once. After learning, a neural network represents a complex relationship, and possesses the ability for genera lization. To control a learning process, a criterion is defined to decide the time fo r terminating the process. The complexity of an algorithm is usually denoted a s O (m ), indicating that the order of number of floating-point operations is m . Learning methods are conventionally divided into supervised, unsupervised, and reinforcement learning; these schemes are illustrated in Fig. 2.1. x p and y p are the input and output of the p th pattern in the training set, yˆ p is the neural network output for the p th input, and E is an error function. From a statistical viewpoint, unsupervised learning learns the pdf of the training set, p (x ), while supervised learning learns about the pdf of p (y |x ). Supervised learning is widely used in classification, approximation, control, modeling a nd identification, signal processing, and optimization. Unsupervised learning schemes are mainly used for clustering, vector quantization, feature extraction, sig nal coding, and data anal- ysis. Reinforcement learning is usually used in control and artificial intelligence.

In logic and statistical inference, transduction is reasoning from observed, spe- cific (training) cases to specific (test) cases. In contrast, induction is reasoning from observed training cases to general rules, which are then applied to the test cases. Machine learning falls into two broad classes: inductive learning or transductive learning. Inductive learning pursues the sta ndard goal in machine learning, which is to accurately classify the entire input space. In contrast, trans- ductive learning focuses on a predefined target set of unlabeled data, the goal being to label the specific target set.

Fundamentals of Machine Learning

17

y p W x p y p Σ + e p (a)
y p
W
x p
y p
Σ
+
e p
(a)
W x p y p
W
x p
y p

(b)

W x p y p Environment System fails? (c)
W
x p
y p
Environment
System
fails?
(c)

Figure 2.1 Learning methods. (a) Supervised learning. e p = yˆ p y p . (b) Unsupervised learning. (c) Reinforcement learning.

Multitask learning improves the generalization performance of learners by leveraging the domain-specific information contained in the related tasks [30]. Multiple related tasks are learned simultaneously using a s hared representation. In fact, the training signals for extra tasks serve as an inductive bias [30]. In order to learn accurate models for rare cases, it is desira ble to use data and knowledge from similar cases; this is known as transfer learning. Transfer learning is a general method for speeding up learning. It exploits the insight that generalization may occur not only within tasks, but also across tasks. The core idea of transfer is that experience gained in learning to perform one source task can help improve learning performance in a related, but different, target task [155]. Transfer learning is related in spirit to case-based and analogical learning. A theoretical analysis based on an empirical Bayes perspective exhibits that the number of labeled examples required for learning with transfer is often significantly smaller than that required for learning each target independently

[155].

Supervised learning

Supervised learning adjusts network parameters by a direct comparison between the actual network output and the desired output. Supervised learning is a closed- loop feedback system, where the error is the feedback signal. The error measure, which shows the difference between the network output and the output from the training samples, is used to guide the learning process. The error measure is usually defined by the mean squared error (MSE)

E = 1

N

N

y p yˆ p 2 ,

p=1

(2.1)

where N is the number of pattern pairs in the sample set, y p is the output part of the p th pattern pair, and yˆ p is the network output corresponding to the pattern pair p . The error E is calculated anew after each epoch. The learning process is terminated when E is sufficiently small or a failure criterion is met. To decrease E toward zero, a gradient-descent procedure is usually applied. The gradient-descent method always converges to a local minimum in a neighbor- hood of the initial solution of network parameters. The LMS a nd BP algorithms

18

Chapter 2. Fundamentals of Machine Learning

are two most popular gradient-descent based algorithms. Second-order methods are based on the computation of the Hessian matrix. Multiple-instance learning [46] is a variation of supervised learning. In multiple-instance learning, the examples are bags of insta nces, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. A unified theoretical analysis for multiple-instance learning and a PAC- learning algorithm are introduced in [122]. Deductive reasoning starts from a cause to deduce the conseq uence or effects. Inductive reasoning allows us to deduce possible causes fro m the consequence. The inductive learning is a special class of the supervised learning techniques, where given a set of { x i