See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/300874631
Chapter · December 2014
DOI: 10.1007/9781447155713_4
CITATIONS
2
2 authors:
94 PUBLICATIONS
1,092 CITATIONS
READS
927
1,010 PUBLICATIONS
9,291 CITATIONS
Some of the authors of this publication are also working on these related projects:
Search and optimization by metaheuristics View project
Phd thesis at concordia university canada View project
All content following this page was uploaded by K.L. Du on 16 April 2016.
The user has requested enhancement of the downloaded file.
i
Neural Networks and Statistical Learning
This textbook introduces neural networks and machine learning in a statisti cal framework. The contents cover almost all the major popular neural network models and statistical learning approaches, including the multilayer perceptron, the Hopﬁeld network, the radial basis function network, clustering models and algorithms, associative memory models, recurrent network s, principal compo nent analysis, independent component analysis, nonnegative matrix factoriza tion, discriminant analysis, probabilistic and Bayesian models, support vector machines, kernel methods, fuzzy logic, neurofuzzy models, hardware implemen tations, and some machine learning topics. Applications of these approaches to biometric/bioinformatics and data mining are ﬁnally given. This book is the ﬁrst of its kind that gives a very comprehensive, yet indepth introduction to neural networks and statistical learning. This book is helpful for all academic and technical staﬀ in the ﬁelds of neu ral networks, pattern recognition, signal processing, machine learning, computa tional intelligence, and data mining. Many examples and exercises are given to help the readers to understand the material covered in the bo ok.
K.L. Du received his PhD in electrical engineering from Huazhong University of Science and Technology, Wuhan, China, in 1998. He is the Chief Scientist at Enjoyor Inc, Hangzhou, China. He has been an Aﬃliate Associate Professor at Concordia University since June 2011. He was on research s taﬀ at Depart ment of Electrical and Computer Engineering, Concordia University, Montreal, Canada from 2001 to 2010. Prior to 2001, he was on technical staﬀ with Huawei Technologies, China Academy of Telecommunication Technology, and Chinese University of Hong Kong. He worked as a researcher with Hong K ong University of Science and Technology in 2008. Dr. Du’s research interests cover signal processing, wireless communications,
machine learning, and intelligent systems. He has coauthor ed two books (Neural Networks in a Softcomputing Framework, Springer, London, 2006; Wireless Com munication Systems, Cambridge University Press, New York, 2010) and have published over 45 papers. Dr. Du is a Senior Member of the IEEE. Presently, he
is on Editorial Board of IET Signal Processing, British Journal of Mathematics
and Computer Science, and Circuits, Systems & Signal Processing. M.N.S. Swamy received the B.Sc. (Hons.) degree in Mathematics from Mysor e University, India, in 1954, the Diploma in Electrical Communication Engineering from the Indian Institute of Science, Bangalore in 1957, and the M.Sc. and Ph.
D. degrees in Electrical Engineering from the University of Saskatchewan, Saska toon, Canada, in 1960 and 1963 respectively. In August 2001 he was awarded
a Doctor of Science in Engineering (Honoris Causa) by Ansted University “In
recognition of his exemplary contributions to the research in Electrical and Com puter Engineering and to Engineering Education, as well as his dedication to
ii
the promotion of Signal Processing and Communications Applications”. He was bestowed with the title of Honorary Professor by the Nationa l Chiao Tong Unier sity in Taiwan in 2009. He is presently a Research Professor and holder of Concordia Tier I chair, and was the Director of the Center for Signal Processing and Communications from 1996 till 2011 in the Department of Electrical and Computer Engineer ing at Concordia University, Montreal, Canada, where he ser ved as the found ing Chair of the Department of Electrical Engineering from 1 970 to 1977, and Dean of Engineering and Computer Science from 1977 to 1993. He has published extensively in the areas of circuits, systems and signal pro cessing, and holds ﬁve patents. He is the coauthor of ﬁve books: Graphs, Networks a nd Algorithms (New York, Wiley, 1981), Graphs: Theory and Algorithms (New York, Wiley, 1992), Switched Capacitor Filters: Theory, Analysis and Design (Prentice Hall International UK Ltd., 1995), Neural Networks in a Softcomputing Framework (Springer, 2006), Modern Analog Filter Analysis and Design (WileyVCH, 2010) and Wireless Communication Systems (Cambridge University Press, New York, 2010). A Russian Translation of the ﬁrst book was published by Mir Publishers, Moscow, in 1984, while a Chinese version was published by the Education Press, Beijing, in 1987. Dr. Swamy is a Fellow of many societies including the IEEE, IE T (UK) and EIC (Canada). He has served the IEEE CAS Society in various capaci ties such as President in 2004, VicePresident (Publicatio ns) during 20012002, VicePresident in 1976, EditorinChief of the IEEE Transactions on Circuits and Systems I from June 1999 to December 2001. He is the recipient of many IE EE CAS Society awards including the Education Award in 2000, Golden Jubilee Medal in 2000, and the 1986 GuilleminCauer Best Paper Award. Since 1999, he has been the EditorinChief of the Journal Circuits, Systems and Signal Pro cessing. Recently, Concordia University instituted a Research Cha ir in his name as a recognition of his research contributions.
Neural Networks and Statistical Learning
KELIN DU and M. N. S. SWAMY Enjoyor Labs, Enjoyor Inc., China Concordia University, Canada April 28, 2013
iii
In memory of my grandparents
To my family
K.L. Du
M.N.S. Swamy
To all the researchers with original contributions to neural networks and machine learning
K.L. Du, M.N.S. Swamy
Contents
List of Abbreviations 
page xx 
1 Introduction 
1 
1.1 Major events in neural networks research 
1 
1.2 Neurons 
3 
1.2.1 The McCullochPitts neuron model 
5 
1.2.2 Spiking neuron models 
6 
1.3 Neural networks 
8 
1.4 Scope of the book 
12 
References 
13 
2 Fundamentals of Machine Learning 
16 
2.1 Learning methods 
16 
2.2 Learning and generalization 
21 
2.2.1 Generalization error 
22 
2.2.2 Generalization by stopping criterion 
23 
2.2.3 Generalization by regularization 
24 
2.2.4 Fault tolerance and generalization 
25 
2.2.5 Sparsity versus stability 
26 
2.3 Model selection 
27 
2.3.1 Crossvalidation 
27 
2.3.2 Complexity criteria 
29 
2.4 Bias and variance 
31 
2.5 Robust learning 
32 
2.6 Neural network processors 
35 
2.7 Criterion functions 
38 
2.8 Computational learning theory 
40 
2.8.1 VapnikChervonenkis dimension 
41 
2.8.2 Empirical riskminimization principle 
43 
2.8.3 Probably approximately correct (PAC) learning 
45 
2.9
Nofreelunch theorem
46
Contents
v
2.10 
Neural networks as universal machines 
46 
2.10.1 Boolean function approximation 
47 

2.10.2 Linear separability and nonlinear separability 
49 

2.10.3 Continuous function approximation 
50 

2.10.4 Winnertakesall 
51 

2.11 
Compressed sensing and sparse approxiation 
52 
2.11.1 Compressed sensing 
53 

2.11.2 Sparse approximation 
54 

2.11.3 LASSO and greedy pursuit 
55 

2.12 
Bibliographical notes 
57 
References 
61 

3 Perceptrons 
70 

3.1 Oneneuron perceptron 
70 

3.2 Singlelayer perceptron 
71 

3.3 Perceptron learning algorithm 
72 

3.4 Least mean squares (LMS) algorithm 
74 

3.5 Pdelta rule 
77 

3.6 Other learning algorithms 
79 

References 
82 

4 Multilayer perceptrons: architecture and error backprop agation 
86 

4.1 Introduction 
86 

4.2 Universal approximation 
87 

4.3 Backpropagation learning algorithm 
88 

4.4 Incremental learning versus batch learning 
93 

4.5 Activation functions for the output layer 
97 

4.6 Optimizing network structure 
99 

4.6.1 Network pruning using sensitivity analysis 
99 

4.6.2 Network pruning using regularization 
102 

4.6.3 Network growing 
104 

4.7 Speeding up learning process 
106 

4.7.1 Eliminating premature saturation 
106 

4.7.2 Adapting learning parameters 
108 

4.7.3 Initializing weights 
111 

4.7.4 Adapting activation function 
113 

4.8 Some improved BP algorithms 
115 

4.8.1 BP with global descent 
117 

4.8.2 Robust BP algorithms 
118 

4.9 Resilient propagation (RProp) 
119 
vi
Contents
References 
123 

5 Multilayer perceptrons: other learing techniques 
133 

5.1 Introduction to secondorder learning methods 
133 

5.2 Newton’s methods 
134 

5.2.1 GaussNewton method 
135 

5.2.2 LevenbergMarquardt method 
136 

5.3 
QuasiNewton methods 
138 

5.3.1 BFGS method 
140 

5.3.2 Onestep secant method 
141 

5.4 Conjugategradient methods 
142 

5.5 Extended Kalman ﬁltering methods 
146 

5.6 Recursive least squares 
149 

5.7 Naturalgradient descent method 
150 

5.8 Other learning algorithms 
151 

5.8.1 
Layerwise linear learning 
151 

5.9 Escaping local minima 
152 

5.10 Complexvalued MLPs and their learning 
153 

5.10.1 
Split complex BP 
153 

5.10.2 
Fully complex BP 
154 

References 
158 

6 Hopﬁeld networks, simulated annealing and chaotic neural networks 
165 

6.1 Hopﬁeld model 
165 

6.2 Continuoustime Hopﬁeld network 
167 

6.3 Simulated annealing 
171 

6.4 Hopﬁeld networks for optimization 
174 

6.4.1 Combinatorial optimization problems 
175 
6.4.2 Escaping local minima for combinatorial optimizatio n problems 178
6.4.3 Solving other optimization problems 
179 
6.5 Chaos and chaotic neural networks 
181 
6.5.1 Chaos, bifurcation, and fractals 
181 
6.5.2 Chaotic neural networks 
182 
6.6 Multistate Hopﬁeld networks 
185 
6.7 Cellular neural networks 
186 
References 
189 
7 Associative memory networks 
194 
7.1 Introduction 
194 
7.2 Hopﬁeld model: storage and retrieval 
196 
7.2.1
Generalized Hebbian rule
196
Contents
vii
7.2.2 Pseudoinverse rule 
197 

7.2.3 Perceptrontype learning rule 
198 

7.2.4 Retrieval stage 
199 

7.3 Storage capability of the Hopﬁeld model 
200 

7.4 Increasing storage capacity 
204 

7.5 Multistate Hopﬁeld networks for associative memory 
207 

7.6 Multilayer perceptrons as associative memories 
208 

7.7 Hamming network 
210 

7.8 Bidirectional associative memories 
212 

7.9 CohenGrossberg model 
213 

7.10 Cellular networks 
214 

References 
218 

8 Clustering I: Basic clustering models and algorithms 
224 

8.1 Introduction 
224 

8.1.1 Vector quantization 
224 

8.1.2 Competitive learning 
226 

8.2 Selforganizing maps 
227 

8.2.1 Kohonen network 
229 

8.2.2 Basic selforganizing maps 
230 

8.3 Learning vector quantization 
237 

8.4 Nearestneighbor algorithms 
240 

8.5 Neural gas 
243 

8.6 ART networks 
246 

8.6.1 ART models 
247 

8.6.2 ART 1 
248 

8.7 C means clustering 
250 

8.8 Subtractive clustering 
253 

8.9 Fuzzy clustering 
256 

8.9.1 Fuzzy C means clustering 
256 

8.9.2 Other fuzzy clustering algorithms 
259 

References 
262 

9 Clustering II: topics in clustering 
270 

9.1 The underutilization problem 
270 

9.1.1 Competitive learning with conscience 
271 

9.1.2 Rival penalized competitive learning 
272 

9.1.3 Softcompetitive learning 
274 

9.2 Robust clustering 
275 

9.2.1 
Possibilistic C means 
277 
9.2.2
A uniﬁed framework for robust clustering
278
viii
Contents
9.3 Supervised clustering 
279 

9.4 Clustering using nonEuclidean distance measures 
280 

9.5 Partitional, hierarchical and densitybased clustering 
282 

9.6 Hierarchical clustering 
283 

9.6.1 Distance measures, cluster representations and dendrograms 
283 

9.6.2 Minimum spanning tree (MST) clusterng 
285 

9.6.3 BIRCH, CURE, CHAMELEON and DBSCAN 
286 

9.6.4 Hybrid hierarchical/partitional clustering 
290 

9.7 Constructive clustering techniques 
291 

9.8 Cluster validity 
294 

9.8.1 Measures based on compactness and separation of clusters 
294 

9.8.2 Measures based on hypervolume and density of clusters 
295 

9.8.3 Crisp silhouette and fuzzy silhouette 
296 

9.9 Projected clustering 
298 

9.10 Spectral clustering 
299 

9.11 Coclustering 
300 

9.12 Handling qualitative data 
301 

9.13 Bibliographical notes 
302 

References 
303 

10 
Radial basis function networks 
312 

10.1 Introduction 
312 

10.1.1 RBF network architecture 
313 

10.1.2 Universal approximation of RBF networks 
314 

10.1.3 RBF networks and classiﬁcation 
315 

10.1.4 Learning for RBF networks 
315 

10.2 Radial basis functions 
316 

10.3 Learning RBF centers 
319 

10.4 Learning the weights 
321 

10.4.1 
Least squares methods for weight learning 
321 

10.5 RBF network learning using orthogonal least squares 
32 3 

10.5.1 Batch orthogonal least squares 
323 

10.5.2 Recursive orthogonal least squares 
324 

10.6 Supervised learning of all parameters 
325 

10.6.1 Supervised learning for general RBF networks 
326 

10.6.2 Supervised learning for Gaussian RBF networks 
327 

10.6.3 Discussion on supervised learning 
328 

10.6.4 Extreme learning machines 
329 

10.7 Various learning methods 
330 

10.8 Normalized RBF networks 
332 

10.9 Optimizing network structure 
333 
10.9.1
Constructive methods
333
Contents
ix
10.9.2 Resourceallocating networks 
334 

10.9.3 Pruning methods 
336 

10.10Complex RBF networks 
337 

10.11A comparision of RBF networks and MLPs 
338 

10.12Bibliographical notes 
341 

References 
343 

11 Recurrent neural networks 
351 

11.1 Introduction 
351 

11.2 Fully connected recurrent networks 
353 

11.3 Timedelay neural networks 
354 

11.4 Backpropagation for temporal learning 
357 

11.5 RBF networks for modeling dynamic systems 
360 

11.6 Some recurrent models 
361 

11.7 Reservoir computing 
363 

References 
366 

12 Principal component analysis 
370 

12.1 Introduction 
370 

12.1.1 Hebbian learning rule 
371 

12.1.2 Oja’s learning rule 
372 

12.2 PCA: conception and model 
373 

12.2.1 
Factor analysis 
375 
12.3 Hebbian rule based PCA 
376 

12.3.1 Subspace learning algorithms 
376 

12.3.2 Generalized Hebbian algorithm 
380 

12.4 Least mean squared errorbased PCA 
382 

12.4.1 
Other optimizationbased PCA 
386 
12.5 AntiHebbian rule based PCA 
387 

12.5.1 
APEX algorithm 
388 
12.6 Nonlinear PCA 
392 

12.6.1 
Autoassociative networkbased nonlinear PCA 
393 
12.7 Minor component analysis 
395 

12.7.1 Extracting the ﬁrst minor component 
395 

12.7.2 Selfstabilizing minor component analysis 
396 

12.7.3 Ojabased MCA 
397 

12.7.4 Other algorithms 
397 

12.8 Constrained PCA 
398 

12.8.1 
Sparse PCA 
399 
12.9 Localized PCA, incremental PCA and supervised PCA 
400 
12.10Complexvalued PCA
402
x
Contents
12.11Twodimensional PCA 
403 
12.12Generalized eigenvalue decomposition 
404 
12.13Singular value decomposition 
405 
12.13.1Crosscorrelation asymmetric PCA networks 
406 
12.13.2Extracting principal singular components for nonsquare matrices408
12.13.3Extracting multiple principal singular components 
409 

12.14Canonical correlation analysis 
410 

References 
413 

13 Nonnegative matrix factorization 
423 

13.1 Introduction 
423 

13.2 Algorithms for NMF 
424 

13.2.1 
Multiplicative update algorithm and alternating nonnegative 

least squares 
425 

13.3 
Other NMF methods 
427 

13.3.1 
NMF methods for clustering 
430 

References 
431 

14 Independent component analysis 
435 

14.1 Introduction 
435 

14.2 ICA model 
436 

14.3 Approaches to ICA 
437 

14.4 Popular ICA algorithms 
439 

14.4.1 Infomax ICA 
439 

14.4.2 EASI, JADE, and naturalgradient ICA 
441 

14.4.3 FastICA algorithm 
442 

14.5 ICA networks 
447 

14.6 Some ICA methods 
449 

14.6.1 Nonlinear ICA 
449 

14.6.2 Constrained ICA 
450 

14.6.3 Nonnegativity ICA 
451 

14.6.4 ICA for convolutive mixtures 
452 

14.6.5 Other methods 
452 

14.7 Complexvalued ICA 
455 

14.8 Stationary subspace analysis and slow feature analysis 
457 

14.9 EEG, MEG and fMRI 
458 

References 
462 

15 Discriminant analysis 
469 
15.1
Linear discriminant analysis
469
Contents
xi
15.1.1 
Solving small sample size problem 
472 

15.2 Fisherfaces 
473 

15.3 Regularized LDA 
474 

15.4 Uncorrelated LDA and orthogonal LDA 
475 

15.5 LDA/GSVD and LDA/QR 
477 

15.6 Incremental LDA 
478 

15.7 Other discriminant methods 
478 

15.8 Nonlinear discriminant analysis 
481 

15.9 Twodimensional discriminant analysis 
482 

References 
483 

16 
Support vector machines 
489 

16.1 Introduction 
489 

16.2 SVM model 
492 

16.3 Solving the quadratic programming problem 
495 

16.3.1 Chunking 
496 

16.3.2 Decomposition 
496 

16.3.3 Convergence of decomposition methods 
500 

16.4 Leastsquares SVMs 
501 

16.5 SVM training methods 
504 

16.5.1 SVM algorithms with reduced kernel matrix 
504 

16.5.2 ν SVM 
505 

16.5.3 Cuttingplane technique 
506 

16.5.4 Gradientbased methods 
507 

16.5.5 Training SVM in the primal formulation 
508 

16.5.6 Clusteringbased SVM 
509 

16.5.7 Other methods 
510 

16.6 Pruning SVMs 
513 

16.7 Multiclass SVMs 
515 

16.8 Support vector regression 
517 

16.9 Support vector clustering 
522 

16.10Distributed and parallel SVMs 
525 

16.11SVMs for oneclass classiﬁcation 
527 

16.12Incremental SVMs 
528 

16.13SVMs for active, transductive and semisupervised learning 
530 

16.13.1SVMs for active learning 
530 

16.13.2SVMs for transductive or semisupervised learning 
530 

16.14Probabilisitic approach to SVM 
534 

16.14.1Relevance vector machines 
534 

References 
535 
xii
Contents
17 Other kernel methods 
550 

17.1 Introduction 
550 

17.2 Kernel PCA 
552 

17.3 Kernel LDA 
556 

17.4 Kernel clustering 
558 

17.5 Kernel autoassociators, kernel CCA and kernel ICA 
560 

17.6 Other kernel methods 
561 

17.7 Multiple kernel learning 
563 

References 
565 

18 Reinforcement learning 
574 

18.1 Introduction 
574 

18.2 Learning through awards 
576 

18.3 Actorcritic model 
578 

18.4 Modelfree and modelbased reinforcement learning 
57 9 

18.5 Temporaldiﬀerence learning 
581 

18.6 Q learning 
584 

18.7 Learning automata 
586 

References 
587 

19 Probabilistic and Bayesian networks 
589 

19.1 Introduction 
589 

19.1.1 Classical vs. Bayesian approach 
590 

19.1.2 Bayes’ theorem 
590 

19.1.3 Graphical models 
591 

19.2 Bayesian network model 
592 

19.3 Learning Bayesian networks 
595 

19.3.1 Learning the structure 
596 

19.3.2 Learning the parameters 
601 

19.3.3 Constrainthandling 
602 

19.4 Bayesian network inference 
603 

19.4.1 Belief propagation 
604 

19.4.2 Factor graphs and the belief propagation algorithm 
606 

19.5 Sampling (Monte Carlo) methods 
609 

19.5.1 
Gibbs sampling 
611 
19.6 Variational Bayesian methods 
612 

19.7 Hidden Markov models 
614 

19.8 Dynamic Bayesian networks 
617 

19.9 Expectationmaximization algorithm 
618 

19.10Mixture models 
620 

19.10.1Probabilistic PCA 
621 
Contents
xiii
19.10.2Probabilistic clustering 
622 

19.10.3Probabilistic ICA 
624 

19.11Bayesian approach to neural network learning 
625 

19.12Boltzmann machines 
627 

19.12.1Boltzmann learning algorithm 
629 

19.12.2Meanﬁeldtheory machine 
629 

19.12.3Stochastic Hopﬁeld networks 
632 

19.13Training deep networks 
633 

References 
636 

20 Combining multiple learners: data fusion and emsemble learning 
650 

20.1 
Introduction 
650 

20.1.1 Ensemble learning methods 
651 

20.1.2 Aggregation 
652 

20.2 
Boosting 
653 

20.2.1 
AdaBoost 
654 

20.3 Bagging 
657 

20.4 Random forests 
658 

20.5 Topics in ensemble learning 
659 

20.6 Solving multiclass classiﬁcation 
662 

20.6.1 Oneagainstall strategy 
662 

20.6.2 Oneagainstone strategy 
662 

20.6.3 Errorcorrecting output codes (ECOCs) 
664 

20.7 
DempsterShafer theory of evidence 
666 

References 
670 

21 Introduction of fuzzy sets and logic 
675 

21.1 Introduction 
675 

21.2 Deﬁnitions and terminologies 
676 

21.3 Membership function 
682 

21.4 Intersection, union and negation 
683 

21.5 Fuzzy relation and aggregation 
685 

21.6 Fuzzy implication 
687 

21.7 Reasoning and fuzzy reasoning 
688 

21.7.1 Modus ponens and modus tollens 
689 

21.7.2 Generalized modus ponens 
689 

21.7.3 Fuzzy reasoning methods 
690 

21.8 Fuzzy inference systems 
692 

21.8.1 Fuzzy rules and fuzzy interference 
693 

21.8.2 Fuzziﬁcation and defuzziﬁcation 
694 
21.9
Fuzzy models
694
xiv
Contents
21.9.1 Mamdani model 
694 

21.9.2 TakagiSugenoKang model 
697 

21.10Complex fuzzy logic 
698 

21.11Possibility theory 
699 

21.12Casebased reasoning 
700 

21.13Granular computing and ontology 
701 

References 
705 

22 Neurofuzzy systems 
708 

22.1 
Introduction 
708 

22.1.1 
Interpretability 
709 

22.2 
Rule extraction from trained neural networks 
710 

22.2.1 Fuzzy rules and multilayer perceptrons 
710 

22.2.2 Fuzzy rules and RBF networks 
711 

22.2.3 Rule extraction from SVMs 
712 

22.2.4 Rule generation from other neural networks 
713 

22.3 
Extracting rules from numerical data 
714 

22.3.1 Rule generation based on fuzzy partitioning 
715 

22.3.2 Other methods 
716 

22.4 Synergy of fuzzy logic and neural networks 
718 

22.5 ANFIS model 
719 

22.6 Fuzzy SVMs 
726 

22.7 Other neurofuzzy models 
728 

References 
732 

23 Neural circuits and parallel implementation 
738 

23.1 Introduction 
738 

23.2 Hardware/software codesign 
740 

23.3 Topics in digital circuit designs 
740 

23.4 Circuits for neuralnetwork models 
742 

23.4.1 Circuits for MLPs 
742 

23.4.2 Circuits for RBF networks 
744 

23.4.3 Circuits for clustering 
745 

23.4.4 Circuits for SVMs 
746 

23.4.5 Circuits of other models 
747 

23.5 Fuzzy neural circuits 
748 

23.6 Graphic processing unit (GPU) implementation 
749 

23.7 Implementation using systolic algorithms 
751 

23.8 Implementation using parallel computers 
752 

23.9 Implementation using cloud computing 
753 
Contents
xv
References 
755 

24 Pattern recognition for biometrics and bioinformatics 
761 

24.1 
Biometrics 
761 

24.1.1 Physiological biometrics and recognition 
762 

24.1.2 Behavioral biometrics and recognition 
765 

24.2 
Face detection and recognition 
767 

24.2.1 Face detection 
767 

24.2.2 Face recognition 
768 

24.3 
Bioinformatics 
771 

24.3.1 
Microarray technology 
773 

24.3.2 
Motif discovery, sequence alignment, protein folding, and coclus tering 
776 

References 
778 

25 Data mining 
781 

25.1 Introduction 
781 

25.2 Document representations for text categorization 
782 

25.3 Neural network approach to data mining 
784 

25.3.1 Classiﬁcationbased data mining 
784 

25.3.2 Clusteringbased data mining 
786 

25.3.3 Bayesian network based data mining 
789 

25.4 Personalized search 
790 

25.5 XML format 
793 

25.6 Web usage mining 
794 

25.7 Association mining 
795 

25.8 Ranking search results 
796 

25.8.1 Surfer models 
797 

25.8.2 PageRank Algorithm 
798 

25.8.3 Hypertext induced topic search (HITS) 
801 

25.9 
Data warehousing 
801 

25.10Contentbased image retrieval 
803 

25.11Email antispamming 
806 

References 
807 

Appendix A: Mathematical preliminaries 
816 

A.1 
Linear algebra 
816 

A.2 
Linear scaling and data whitening 
823 

A.3 
GramSchmidt orthonormalization transform 
824 

A.4 
Stability of dynamic systems 
825 

A.5 
Probability theory and stochastic processes 
826 
xvi
Contents
A.6 Numerical optimization techniques 
830 

References 
833 

Appendix B: Benchmarks and resources 
834 

B.1 
Face databases 
834 
B.2 
Some machine learning databases 
837 
B.3 
Data sets for data mining 
839 
B.4 
Databases and tools for speech recognition 
840 
B.5 
Data sets for microarray and for genome analysis 
841 
B.6 Software 
843 

References 
847 
Preface
The human brain, consisting of nearly 10 ^{1}^{1} neurons, is the center of human intelligence. Human intelligence has been simulated in var ious ways. Artiﬁcial intelligence (AI) pursues exact logical reasoning based on symbol manipulation. Fuzzy logics model the highly uncertain behavior of decisio n making. Neural networks model the highly nonlinear infrastructure of brain networks. Evolu tionary computation models the evolution of intelligence. Chaos theory models the highly nonlinear and chaotic behaviors of human intelligence. Softcomputing is an evolving collection of methodologies for the representa tion of the ambiguity in human thinking; it exploits the tolerance for impreci sion and uncertainty, approximate reasoning, and partial truth in order to achieve tractability, robustness, and lowcost solutions. The major methodologies of soft computing are fuzzy logic, neural networks, and evolutiona ry computation. Conventional modelbased dataprocessing methods requir e experts’ knowl edge for the modeling of a system. Neural network methods provide a modelfree, adaptive, fault tolerant, parallel and distributed processing solution. A neural network is a black box that directly learns the internal rela tions of an unknown system, without guessing functions for describing causea ndeﬀect relationships. The neural network approach is a basic methodology of information processing. Neural network models may be used for function approximatio n, classiﬁcation, nonlinear mapping, associative memory, vector quantization, optimization, fea ture extraction, clustering, and approximate inference. Neural networks have wide applications in almost all areas of science and enginee ring. Fuzzy logic provides a means for treating uncertainty and co mputing with words. This mimics human recognition, which skillfully cop es with uncertainty. Fuzzy systems are conventionally created from explicit knowledge expressed in the form of fuzzy rules, which are designed based on experts’ experience. A fuzzy system can explain its action by fuzzy rules. Neurofuzzy systems, as a synergy of fuzzy logic and neural networks, possess both lea rning and knowledge representation capabilities. This book is our attempt to bring together the major advances in neural net works and machine learning, and to explain them in a statistical framework. While some mathematical details are needed, we emphasize th e practial aspects of the models and methods rather than the theoretical details. To us, neural networks are merely some statistical methods that can be represented by graphs
xviii
Preface
and networks. They can iteratively adjust the network parameters. As a statis tical model, a neural network can learn the probability density function from the given samples, and then predict, by generalization acco rding to the learnt statistics, outputs for new samples that are not included in the learning sample set. The neural network approach is a general statistical computational paradigm. Neural network research solves two problems: the direct pro blem and the inverse problem. The direct problem employs computer and engineering techniques to model biological neural systems of the human brain. This pro blem is investigated by cognitive scientists and can be useful in neuropsychiatr y and neurophysiology. The inverse problem simulates biological neural systems fo r their problemsolving capabilities for application in scientiﬁc or engineering ﬁelds. Engineering and computer scientists have conducted extensive investigation in this area. This book concentrates mainly on the inverse problem, although the two areas often shed light on each other. The biological and psychological plausibility of the neural network models have not been seriously treated in this book, though some backgound material is discussed. This book is intended to be used as a textbook for advanced undergraduates and graduate students in engineering, science, computer science, business, arts, and medicine. It is also a good reference book for scientists , researchers, and practitioners in a wide variety of ﬁelds, and assumes no previous knowledge of neural network or machine learning concepts. This book is divided into twentyﬁve chapters and two appendices. It contains almost all the major neural network models and statistical learning approaches. We also give an introduction to fuzzy sets and logic, and neur ofuzzy models. Hardware implementations of the models are discussed. Two chapters are ded icated to the applications of neural network and statistica l learning approaches to biometrics/bioinformatics and data mining. Finally, in the appendices, some mathematical preliminaries are given, and benchmarks for validating all kinds of neural network methods and some web resources are provided. First and foremost we would like to thank the supporting staﬀ from Springer London, especially Anthony Doyle and Grace Quinn for their enthusiastic and professional support throughout the period of manuscript preparation. K.L. Du also wishes to thank Jiabin Lu (Guangdong University of Technology, China), Jie Zeng (Richcon MC, Inc., China), Biaobiao Zhang a nd Hui Wang (Enjoyor, Inc., China), Zongnian Chen (Hikvision, Inc., China), and many of his graduate students including Na Shou, Shengfeng Yu, Lusha Han and Xiaoling Wang (Zhejiang University of Technolog, China) for their co nsistent assistance. In addition, we should mention at least the following names for their help:
Omer Morgul (Bilkent University, Turkey), Yanwu Zhang (Monterey Bay Aquar ium Research Institute, USA), Chi Sing Leung (City University of Hong Kong, Hong Kong), M. Omair Ahmad and Jianfeng Gu (Concordia University, Canada), Li Yu, Limin Meng, Jingyu Hua, Zhijiang Xu and Luping Fang (Zhe jiang University of Technolog, China). Last, but not least, we would like to thank
Preface
xix
our families for their support and understanding during the course of writing this book. A book of this length is certain to have some errors and omissions. Feedback is welcome via email at kldu@ieee.org or swamy@encs.concordia.ca. Due to restriction on the length of this book, we have placed two app endices, namely, Mathematical preliminaries, and Benchmarks and resources , on the website of this book. MATLAB code for the worked examples is also downlo adable from the website of this book.
K.L. Du
Enjoyor Inc.
Hangzhou, China
M. N. S. Swamy Concordia University Montreal, Canada
List of Abbreviations
adaline 
adaptive linear element 
COP combinatorial optimization problem 

A/D analogtodigital 

AI artiﬁcial intelligence AIC Akaike information criterion ALA adaptive learning algorithm 
CORDIC coordinate rotation digital computer 

CPT 
conditional probability table central processing units 

ANFIS 
adaptivenetworkbased fuzzy inference system accurate online SVR asymmetric PCA 
CPU 

AOSVR 
CURE clustering using representation DBSCAN densitybased spatial cluster 

APCA 
ing of applications with noise DCS dynamic cell structures DFP DavidonFletcherPowell 

API application programming interface 

APEX 
adaptive principal components extraction 
DCT 
discrete cosine transform discrete Fourier transform electrocardiogram 
DFT 

ART adaptive resonance theory 
ECG 

ASIC applicationspeciﬁc integrated circuit 
ECOC 
errorcorrecting output code 

EEG electroencephalogram 

ASSOM 
adaptivesubspace SOM 
EKF 
extended Kalman ﬁltering extreme learning machine 
BAM bidirectional associative mem ory BFGS BroydenFletcherGoldfarb Shanno BIC Bayesian information criterion BIRCH balanced iterative reducing and clustering using hierar chies BP backpropagation BPTT backpropagation through time BSB brainstatesinabox BSS blind source separation CBIR contentbased image retrieval CCA canonical correlation analysis CCCP constrained concaveconvex procedure cdf cumulative distribution func tion CEM classiﬁcation EM CG conjugate gradient CMAC cerebellar model articulation controller 
ELM 

EM expectationmaximization ERM empirical risk minimization 

Estep 
expectation step elementary transcendental function eigenvalue decomposition fuzzy C means fast Fourier transform 

ETF 

EVD 

FCM 

FFT 

FIR ﬁnite impulse response 

fMRI functional magnetic resonance imaging 

FPGA 
ﬁeld programmable gate array 

FSCL frequencysensitive competi tive learning 

GAP 
growing and pruning algo rithm for RBF growing cell structures 

RBF 

GCS 

GHA generalized Hebbian algorithm 

GLVQF generalized LVQ family algo rithms 

GNG 
growing neural gas 
GSO GramSchmidt orthonormal HWO hidden weight optimization
List of Abbreviations
xxi
HyFIS 
Hybrid neural fuzzy inference system independent component analy sis independently drawn and iden tically distributed interactiveor KarushKuhnTucker k nearest neighbor k winnerstakeall 
OLAP online analytical processing OLS orthogonal least squares OMP orthogonal matching pursuit 

ICA 

OWO 
output weight optimization 

iid 
PAC probably approximately cor rect 

ior 
PAST 
projection approximation sub 

KKT 
space tracking PASTd PAST with deﬂation PCA principal component analysis PCM possibilistic C means pdf probability density function 

k 
NN 

k 
WTA 

LASSO 
least absolute selection and shrinkage operator 

LBG 
LindeBuzoGray 
PSA principal subspace analysis QP quadratic programming QRcp QR with column pivoting 

LDA 
linear discriminant analysis 

LM 
LevenbergMarquardt 

LMAM 
LM with adaptive momentum 
RAN resourceallocating network RBF radial basis function RIP restricted isometry property RLS recursive least squares 

LMI 
linear matrix inequality 

LMS 
least mean squares 

LMSE 
least mean squared error 

LMSER 
least mean square error recon struction 
RPCCL 
rival penalized controlled com petitive learning rival penalized competitive learning resilient propagation realtime recurrent learning 

LP 
linear programming 
RPCL 

LS 
leastsquares 

LSI 
latent semantic indexing 
RProp 

LTG 
linear threshold gate 
RTRL 

LVQ 
learning vector quantization 
RVM relevance vector machine SDP semideﬁnite programs SIMD single instruction multiple data 

MAD 
median of the absolute devia tion 

MAP 
maximum a posteriori 

MCA 
minor component analysis 
SPMD 
single program multiple data 

MDL 
minimum description length 
SLA subspace learning algorithm 

MEG 
magnetoencephalogram 
SMO 
sequential minimal optimiza tion selforganization maps 

MFCC 
Mel frequency cepstral coeﬃ cient 
SOM 

MIMD 
multiple instruction multiple data 
SRM structural risk minimization SVD singular value decomposition 

MKL 
multiple kernel learning 
SVDD 
support vector data descrip 

ML 
maximumlikelihood 
tion SVM support vector machine SVR support vector regression 

MLP 
multilayer perceptron 

MSA 
minor subspace analysis 

MSE 
mean squared error 
TDNN timedelay neural network TDRL timedependent recurrent learning 

MST 
minimum spanning tree 

Mstep 
maximization step 

NARX 
nonlinear autoregressive with 
TLMS 
total least mean squares 

exogenous input NEFCLASSneurofuzzy classiﬁcation 
TLS total least squares 

TREAT 
trustregionbased error aggre 

NEFCON 
neurofuzzy controller 
gated training TRUST terminal repeller uncon strained subenergy tunneling TSK TakagiSugenoKang 

NEFLVQ 
nonEuclidean FLVQ 

NEFPROX 
neuronfuzzy function approxi mation 

NIC 
novel information criterion 
TSP 
traveling salesman problem 

NOVEL 
nonlinear optimization via VC VapnikChervonenkis 

external lead 
VLSI very large scale integrated WINC weighted information criterion 

OBD 
optimal brain damage 

OBS 
optimal brain surgeon 
WTA winnertakesall XML Extensible markup language 

1 Introduction
1.1 Major events in neural networks research
The discipline of neural networks models the human brain. The average human brain consists of nearly 10 ^{1}^{1} neurons of various types, with each neuron con necting to up to tens of thousands synapses. As such, neural network models are also called connectionist models . Information processing is mainly in the cere bral cortex, the outer layer of the brain. Cognitive functions, including language, abstract reasoning, and learning and memory, represent the most complex brain operations to deﬁne in terms of neural mechanisms. In the 1940s, McCulloch and Pitts [27] found that a neuron can be modeled as a simple threshold device to perform logic function. In 1949 , Hebb [14] proposed the Hebbian rule to describe how learning aﬀects the synaptics between two neurons. In 1952, based upon the physical properties of cell membranes and the ion currents passing through transmembrane proteins, Hodg kin and Huxley [15] incorporated the neural phenomena such as neuronal ﬁring and action potential propagation into a set of evolution equations, yielding quantitatively accurate spikes and thresholds. This work brought Hodgkin and Huxley a Nobel Prize in 1963. In the late 1950s and early 1960s, Rosenblatt [32] prop osed the perceptron model, and Widrow and Hoﬀ [39] proposed the adaline (adaptive linear element) model, trained with a least mean squares (LMS) method. In 1969, Minsky and Papert [28] proved mathematically that the perceptron cannot be used for complex logic function. This substantially waned the interest in the ﬁeld of neural networks. During the same period, the adaline model as well as its multilayer version called the madaline was succe ssfully used in many problems; however, they cannot solve linearly inseparable problems due to the use of linear activation function. In the 1970s, Grossberg [12, 13], von der Malsburg [38], and Fukushima [9] con ducted pioneering work on competitive learning and selfor ganization, inspired from the connection patterns found in the visual cortex. Fukushima proposed his cognitron [9] and neocognitron models [10], [11], under the competitive learning paradigm. The neocognitron, inspired by the primary visual cortex, is a hierar chical multilayered neural network specially designed fo r robust visual pattern recognition. Several linear associative memory models wer e also proposed in that period [22]. In 1982, Kohonen proposed the selforganization map (SOM) [23].
2
Chapter 1. Introduction
The SOM adaptively transforms incoming signal patterns of a rbitrary dimensions into one or twodimensional discrete maps in a topologically ordered fashion. Grossberg and Carpenter [13, 4] proposed the adaptive resonance theory (ART) model in the mid1980s. The ART model, also based on competitive learning, is recurrent and selforganizing. The Hopﬁeld model introduced in 1982 [17] ushered in the modern era of neural network research. The model works at the system level rather than at a single neuron level. It is a recurrent neural network working with the Hebbian rule. This network can be used as an associative memory for informa tion storage and for solving optimization problems. The Boltzmann machine [1] was introduced in 1985 as an extension to the Hopﬁeld network by incorporating stochastic neurons. Boltzmann learning is based on a method called simulated annealing [20]. In 1987, Kosko proposed the adaptive bidirectional associative memory (BAM) [24]. The Hamming network proposed by Lippman in the mid1980s [25] is based on competitive learning, and is the most straightfo rward associative memory. In 1988, Chua and Yang [5] extended the Hopﬁeld model by proposing the cellular neural network model. The cellular network is a dynamical network model and is particularly suitable for twodimensional sig nal processing and VLSI implementation. The most prominent landmark in neural network research is the backpropa gation (BP) learning algorithm proposed for the multilayer perceptron (MLP) model in 1986 by Rumelhart, Hinton, and Williams [34]. Later on, the BP algo rithm was discovered to have already been invented in 1974 by Werbos [40]. In 1988, Broomhead and Lowe proposed the radial basis function (RBF) network model [3]. Both the MLP and the RBF network are universal appr oximators. In 1982, Oja proposed the principal component analysis (PCA) network for classical statistical analysis [30]. In 1994, Common propo sed independent com ponent analysis (ICA) [6]. ICA is a generalization of PCA, and it is usually used for feature extraction and blind source separation (BSS). Since then, many neu ral network algorithms for classical statistical methods, such as Fisher’s linear discriminant analysis (LDA), canonical correlation analysis (CCA), and factor analysis, have been proposed. In 1985, Pearl introduced the Bayesian network model [31]. The Bayesian network is the best known graphical model in AI. It possesses the characteristic of being both a statistical and a knowledgerepresentation fo rmalism. It establishes the foundation for inference of modern AI. Another landmark in the machine learning and neural network communities is the support vector machine (SVM) proposed by Vapnik et al. in the early 1990s [37]. The SVM is based on the statistical learning theory and is particularly useful for classiﬁcation with small sample sizes. The SVM ha s been used for classiﬁcation, regression and clustering. Thanks to its successful application in the SVM, the kernel method has aroused wide interest. In addition to neural networks, fuzzy logic and evolutionary computation are two other major softcomputing paradigms. Softcomputing is a computing frame
Introduction
3
work that can tolerate imprecision and uncertainty instead of depending on exact mathematical computations. Fuzzy logic [41] can inco rporate the human knowledge into a system by means of fuzzy rules. Evolutionar y computation [16, 35] originates from Darwin’s theory of natural selection, and can optimize in a domain that is diﬃcult to solve by other means. These techniques are now widely used to enhance the interpretability of the neural networks or to select optimum architecture and parameters of neural networks. In summary, the brain is a dynamic information processing system that evolves its structure and functionality in time through information processing at diﬀer ent hierarchical levels: quantum, molecular (genetic), single neuron, ensemble of neurons, cognitive, and evolutionary [19]:
^{} At a quantum level, particles, that constitutes every molec ule, move contin uously, being in several states at the same time that are char acterized by probability, phase, frequency, and energy. These states ca n change following the principles of quantum mechanics. ^{} At a molecular level, RNA and protein molecules evolve in a ce ll and interact in a continuous way, based on the stored information in the DNA and on external factors, and aﬀect the functioning of a cell (neuro n). ^{} At the level of a single neuron, the internal information pro cesses and the external stimuli change the synapses and cause the neuron to produce a signal to be transferred to other neurons. ^{} At the level of neuronal ensembles, all neurons operate together as a function of the ensemble through continuous learning. ^{} At the level of the whole brain, cognitive processes take pla ce in a lifelong incremental multiple task/multiple modalities learning mode, such as lan guage and reasoning, and global information processes are manifested, such as consciousness. ^{} At the level of a population of individuals, species evolve through evolution via changing the genetic DNA code.
Building computational models that integrate principles from diﬀerent informa tion levels may be eﬃcient for solving complex problems. The se models are called integrative connectionist learning systems [19]. Information processes at diﬀerent levels in the information hierarchy interact and inﬂuence each other.
1.2
Neurons
Among the 10 ^{1}^{1} neurons in the human brain, about 10 ^{1}^{0} are in the cortex. The cortex is the outer mantle of cells surrounding the centr al structures, e.g., brainstem and thalamus. Cortical thickness varies mostly b etween 2–3 mm in the human, and is folded with an average surface area is about 2200 cm ^{2} [42]. The neuron, or nerve cell, is the fundamental anatomical and functional unit of the nervous system including the brain. A neuron is an extension of the simple
4
Chapter 1. Introduction
Figure 1.1 Schematic drawing of a prototypical neuron.
cell with two types of appendages: multiple dendrites and an axon. A neuron pos
sesses all the internal features of a regular cell. A neuron has four components:
the dendrites, the soma (cell body), the axon and the synapse. A soma contains
a cell nucleus. Dendrites branch into a bushy network around the cell to receive input from other neurons, whereas the axon stretches out for a long distance, typically a centimeter and as far as a meter in extreme cases. The axon is an
output channel to other neurons; it branches into strands and substrands to con nect to the dendrites and cell bodies of other neurons. The co nnecting junction
is called a synapse. Each cortical neuron receives 10 ^{4} –10 ^{5} synaptic connections,
with most inputs coming from distant neurons. Thus connections in the cortex are said to exhibit longrange excitation and shortrange inhibition. A neuron receives signals from other neurons through its soma and dendrites, integrates them, and sends output signals to other neurons through its axon. The dendrites receive signals from several neighborhood neuro ns and pass these onto the cell body, and are processed therein and the resulting signal is transferred through an axon. A schematic diagram shown in Fig. 1.1. Like any other cell, neurons have a membrane potential, that is, an electric potential diﬀerence between the intracellular and extracellular compartments, caused by the diﬀerent densities of sodium (Na) and potassium (K). Neuronal membrane is endowed with relatively selective ionic channe ls that allow some speciﬁc ions to cross the membrane. The cell membrane has an e lectrical resting potential of − 70 mV, which is maintained by pumping positive ions (Na+) out of the cell. Unlike an ordinary cell, the neuron is excitable. Because of inputs from the dendrites, the cell may not be able to maintain the − 70 mV resting potential, resulting in an action potential that is a pulse transmitted down the axon. Signals are propagated from neuron to neuron by a complicated electro chemical reaction. Chemical transmitter substances pass the synapses and enter the dendrite, changing the electrical potential of the cell body. When the poten
Introduction
5
x
x
x J
Figure 1.2 The mathematical model of McCullochPitts neuron.
tial is above a threshold, an electrical pulse or action potential is sent along the axon. After releasing the pulse, the neuron returns to its resting potential. The action potential causes a release of certain biochemical ag ents for transmitting messages are to the dendrites of nearby neurons. These biochemical transmit ters may have either an excitatory or inhibitory eﬀect on neighboring neurons.
A synapse that increases the potential is excitatory, where as a synapse that
decreases it is inhibitory. Synaptic connections exhibit plasticity–longterm chang es in the strength of connections in response to the pattern of stimulation. Neur ons also form new connections with other neurons, and sometimes entire colle ctions of neurons can migrate from one place to another. These mechanisms are thought to form the basis for learning in the brain. Synaptic plasticity is a basic biological mechanism underlying learning and memory. Inspired by this, a large number of learning rules, specifying how activity and training experience cha nge synaptic eﬃcacies [14], have been advanced.
1.2.1 The McCullochPitts neuron model
A neuron is a basic processing unit in a neural network. It is a node that processes
all fanin from other nodes and generates an output according to a transfer func
tion called the activation function . The activation function represents a linear
or nonlinear mapping from the input to the output and is denoted by φ(·). The
variable synapses is modelled by weights. The McCullochPitts neuron model [27], which employs the sigmoidal activation function, was inspired biologically. Figure 1.2 illustrates the simple McCullochPitts neuron model. The output
of the neuron is given by
net =
J 1
i=1
w 
_{i} x _{i} − θ = w ^{T} x − θ, 
(1.1) 
y 
= φ(net ), 
(1.2) 
where x _{i} is the i th input, w _{i} is the link weight from the i th input, w =
( w _{1} ,
_{J} _{1} ) ^{T} , θ is a threshold or bias, and J _{1} is the number
_{J} _{1} ) ^{T} , x = ( x _{1} ,
,w
,x
6
Chapter 1. Introduction
Figure 1.3 VLSI model of a neuron.
y
of inputs. The activation function φ(·) is usually some continuous or discontin uous function, mapping the real numbers into the interval (− 1 , 1) or (0 , 1). Neural networks are suitable for VLSI circuit implementations. The analog approach is extremely attractive in terms of size, power, and speed. A neuron can be realized with a simple ampliﬁer and the synapse is realized with a resistor. Memristor is a twoterminal passive circuit element that ac ts as a variable resistor whose value can be varied by varying the current passing thro ugh it [2]. The circuit of a neuron is given in Fig. 1.3. Since weights from the circuits can only be positive, an inverter can be applied to the input voltage so as to realize a negative synaptic weight. By Kirchhoﬀ’s current law, the output voltage of the neuron is derived as
y
=
φ
J
1
i=1 ^{w} i ^{x} i
J
i=0 ^{w} i
1
− θ ,
(1.3)
where x _{i} is the i th input voltage, w _{i} is the conductance of the i th resisitor, θ the bias voltage, and φ(·) is the transfer function of the ampliﬁer. The bias voltage of a neuron in a VLSI circuit is caused by device mismatches, a nd is diﬃcult to control. The McCullochPitts neuron model is known as the classical p erceptron model, and it is used in most neural network models, including the MLP and the Hopﬁeld network. Many other neural networks are also based on the McCullochPitts neuron model, but use other activation functions. For example, the adaline [39] and the SOM [23] use linear activation functions, and the RBF network adopts a radial basis function (RBF).
1.2.2 Spiking neuron models
Many of the intrinsic properties seen within the brain were not included in the classical perceptron, limiting their functionality and use to linear discrimination tasks. A single classical perceptron is not capable of solving nonlinear problems, such as the XOR problem. Spiking neuron and spiking neural network models mimic the spiking activity of neurons in the brain when processing information. Spiking neurons tend to gather in functional groups ﬁring together during strict
Introduction
7
time intervals, forming coalitions or assemblies [18], als o referred to as events. Spiking neural networks represent information as trains of spikes. This results in a much higher number of patterns stored in a model and more ﬂexible processing. During training of a spiking neural network, the weight of the synapse is modiﬁed according to the timing diﬀerence between the pres ynaptic spike and the postsynaptic spike. This synaptic plasticity is called spiketimingdependent plasticity [26]. In a biological system, a neuron integrates the excita tory post synaptic current, which is produced by presynaptic stimulus, to change the volt age of its soma. If the soma voltage is larger than a deﬁned thr eshold, an action potential (spike) is produced. The integrateandﬁre neuron [36], FitzHughNagumo neuro n [8], and HodgkinHuxley neuron model all incorporate more of the dynamics of actual biological neurons. Whereas the HodgkinHuxley model describes the biophys ical mechanics of neurons, both the integrateandﬁre and FitzHugh Nagumo neurons model key features of biological neurons such as the membrane poten tial, excitatory postsynaptic potential, and inhibitory p ostsynaptic potential. A single neuron incorporating these key features has a higher dimension to the information it processes in terms of its membrane threshold, ﬁring rate and postsynaptic potential, than a classical perceptron. The integrateandﬁre neu
ron model, whose output is binary on a short time scale, either ﬁres an action potential or does not. A spike train s ∈ S (T ) is a sequence of ordered spike times
., N } corresponding to the time instants in the interval
s = { t _{m} ∈ T : m = 1 ,
T = [0 , T ] at which a neuron ﬁres. The FitzHughNagumo model is a simpliﬁed version of the HodgkinHuxley model which models in a detailed manner the activation and deactivation dynamics of a spiking neuron. The HodgkinHuxley model [15, 21] incorporates the principal neurobiological properties of a neuron in order to understand phenomena such as the action potential. It was obtained from empirical investigation of the physiological prop erties of the squid axon into a dynamical system framework. The model is a set of conductancebased coupled ordinary diﬀerential equations, incorporating sodium (Na), potassium (K) and chloride (Cl) ion ﬂows throug h their respective channels. These equations are based upon the physical properties of cell mem branes and the ion currents passing through transmembrane proteins. Chloride channel conductances are static (not voltage dependent) and hence leaky. According to the HodgkinHuxley model, the dynamics of the membrane potential V (t ) of the neuron can be described by
_{C} dV
dt
= − g _{N}_{a} m ^{3} h (V − V _{N}_{a} ) − g _{K} n ^{4} (V − V _{K} ) − g _{L} (V − V _{L} ) + I (t ),
(1.4)
where the ﬁrst three terms on the righthand side correspond to the potassium, sodium, and leakage currents, respectively, and g _{N}_{a} = 120 mS/cm ^{2} , g _{K} = 36 mS/cm ^{2} and g _{L} = 0 .3 mS/cm ^{2} are the maximal conductances of sodium, potas sium and leakage, respectively. The membrane capacitance C = 1 mF/cm ^{2} ;
_{N}_{a} = 50 mV, V _{K} = − 77 mV, and V _{L} = − 54 .4 mV are the reversal potentials
V
8
Chapter 1. Introduction
Figure 1.4 Parameters of the HodgkinHuxley model for a neuron.
of sodium, potassium, and leakage currents, respectively. I (t ) is the injected cur
rent. The stochastic gating variables n , m and h represent the activation term
of the potassium channel, the activation term, and the inactivation term of the
sodium channel, respectively. The factors n ^{4} and m ^{3} h are the mean portions of the open potassium and sodium ion channels within the membra ne patch. To take into account the channel noise, m , h and n obey the Langevin equations. When the stimuli S1 and S2 occur at 15 ms and 40 ms of 80 ms, the simulated results for V , m , h and n are plotted in Fig. 1.4; this ﬁgure was generated by a Java applet (http://thevirtualheart.org/HHindex.html ).
1.3 Neural networks
A neural network is characterized by the network architecture, node character
istics, and learning rules.
Architecture
The network architecture is represented by the connection weight matrix W = [w _{i}_{j} ], where w _{i}_{j} denotes the connection weight from node i to node j . When w _{i}_{j} = 0, there is no connection from node i to node j . By setting some w _{i}_{j} ’s to zero, diﬀerent network topologies can be realized. Neural networks can be grossly classiﬁed into feedforward neural networks, recurrent neural networks, and their hybrids. Popular network topologies are fully connected layered feedforward networks, recurrent networks, lattice networks, layered feedforwar d networks with lateral connections, and cellular networks, as shown in Fig. 1.5.
^{} In a feedforward network, the connections between neurons a re in one direc tion. A feedforward network is usually arranged in the form of layers. In such a layered feedforward network, there is no connection between the neurons in the same layer, and there is no feedback between layers. In a fully connected layered feedforward network, every node in any layer is connected to every
Introduction
9
(c)
Figure 1.5 Architecture of neural networks. (a) Layered feedforward n etwork. (b) Recurrent network. (c) Twodimensional lattice network. (d) Layered feedforward network with lateral connections. (e) Cellular network. The big numbered circles stand for neuron s and the small ones for input nodes.
node in its adjacent forward layer. The MLP and the RBF networ k are fully connected layered feedforward networks. ^{} In a recurrent network, there exists at least one feedback co nnection. The Hopﬁeld model and the Boltzmann machine are two examples of r ecurrent networks. ^{} A lattice network consists of one, two or higherdimensio nal array of neurons. Each array has a corresponding set of input nodes. The Kohonen network [23] uses a one or twodimensional lattice architecture. ^{} A layered feedforward network with lateral connections has lateral connec tions between the units at the same layer of its layered feedforward network architecture. A competitive learning network has a twolayered network of such an architecture. The feedforward connections are excitatory, while the lateral connections in the same layer are inhibitive. Some P CA networks using the Hebbian/antiHebbian learning rules [33] also employ this kind of network topology. ^{} A cellular network consists of regularly spaced neurons, ca lled cells, which communicate only with the neurons in its immediate neighbor hood. Adjacent cells are connected by mutual interconnections. Each cell is excited by its own signals and by signals ﬂowing from its adjacent cells [5].
 J _{M} to represent a neural network
with a layered architecture of M layers, where J _{i} is the number of nodes in the i th layer. Notice that the input layer is counted as layer 1 and nodes at this layer are not neurons. Layer M is the output layer.
In this book, we use the notation J _{1}  J _{2} 
10
Chapter 1. Introduction
Operation
The operation of neural networks is divided into two stages: learning (train ing) and generalization (recalling). Network training is typically accomplished by using examples, and network parameters are adapted using a learning algo rithm, in an online or oﬄine manner. Once the network is trained to accomplish the desired performance, the learning process is terminate d and it can then be used directly to replace the complex system dynamics. The tr ained network can be used to operate in a static manner: to emulate an unknown dynamics or nonlinear relationship. For realtime applications, a neural network is required to have a constant processing delay regardless of the number of input nodes and to have a minimum number of layers. As the number of input nodes increases, the size of the network layers should grow at the same rate without additional layer s. Adaptive neural networks are a class of neural networks that do not need to be trained by providing a training pattern set. They can learn when they are performing. For adaptive neural networks, unsupervised learning methods are usually used. For example, the Hopﬁeld model uses a generalized Hebbian learn ing rule for implementation as associative memory. Any time a pattern is pre sented to it, the Hopﬁeld network always updates the connection weights. After the network is trained with standard patterns and is prepared for generalization, the learning capability should be disabled; otherwise, when an incomplete or noisy pattern is presented to the network, it will search the closest matching, meanwhile the memorized pattern is replaced by this new pattern. Reinforcement learning is also naturally adaptive, where the environment is treated as a teacher. Supervised learning is not adaptive in nature.
Properties
Neural networks are biologically motivated. Each neuron is a computational node, which represents a nonlinear function. Neural networ ks possess the fol lowing advantages [7]:
^{} Adaptive learning : They can adapt themselves by changing the network parameters in a surrounding environment. ^{} Generalization : A trained neural network has superior generalization capa  bility. ^{} Generalpurpose nonlinear nature : They perform like a black box. ^{} Selforganizing : Some neural networks such as the SOM [23] and competitive learning based neural networks have a selforganization pr operty. ^{} Massive parallelism and simple VLSI implementations : Each basic processing unit usually has a uniform property. This parallel structure allows for highly parallel software and hardware implementations. ^{} Robustness and fault tolerance : A neural network can easily handle imprecise, fuzzy, noisy, and probabilistic information. It is a distributed infor mation system, where information is stored in the whole network in a dis
Introduction
11
tributed manner by the network structure such as W . Thus, the overall per formance does not degrade signiﬁcantly when the informatio n at some nodes is lost or some connections in the network are damaged. The ne twork repairs itself, and thus possesses a faulttolerant capability.
Applications
Neural networks can be treated as a general stitistical tool for almost all dis ciplines of science and engineering. The applications can b e in modeling and system identiﬁcation, classiﬁcation, pattern recognitio n, optimization, control, industrial application, communications, signal processing, image analysis, bioin formatics, and data mining. Pattern recognition is central to biological and arti ﬁcial intelligence; it is a complete process that gathers the observations, extracts features from the observations, and classiﬁes or describes the observations. Pat tern recognition is one of the most fundmental applications of neural networks. More speciﬁc, some neural network models have the following functions.
^{} Function approximation : This capability is generally used for modeling and system identiﬁcation, regression and prediction, control, signal process ing, pattern recognition and classiﬁcation, and associative memory. Image restoration is also a function approximation problem. The MLP and RBF networks are universal approximators for nonlinear functions. Some recurrent networks are universal approximators of dynamical systems . Prediction is an openloop problem while control is a closedloop problem. ^{} Classiﬁcation: Classiﬁcation is the most fundamental application of neur al networks. Classiﬁcation can be based on the function approximation capability of neural networks. ^{} Clustering and vector quantization : Clustering groups together similar objects, based on some distance measure. Unlike in classiﬁc ation problems, the classmembership of a pattern is not known a priori. Vector quantization is similar to clustering. ^{} Associative memory : An association is an inputoutput pair. Associative memory, also known as contentaddressable memory, is a memory organization that accesses memory by its content instead of its address. I t picks up a desirable match from all stored prototypes, when an incomplete or corrupted sample is presented. Associative memories are useful for pa ttern recognition, pattern association, or pattern completion. ^{} Optimization : Some neural network models, such as the Hopﬁeld model and the Boltzmann machine, can be used to solve combinatoria l optimization problems (COPs). ^{} Feature extraction and information compression : Coding and informa tion compression is an essential task in the transmission and storage of speech, audio, image, video, and other information. PCA, ICA, vecto r quantization can achieve the objective of feature extraction and informa tion compression.
12
Chapter 1. Introduction
1.4 Scope of the book
This book contains twentysix chapters and two appendices:
^{} Chapter 2 describes some fundamental topics on neural netwo rks and machine learning. ^{} Chapter 3 is dedicated to the perceptron. ^{} The MLP is the topic of Chapters 4 and 5. The MLP with BP learning is introduced in Chapter 4, and structural optimization of the MLP is also described in this chapter. ^{} The MLP with secondorder learning is introduced in Chapter 5. ^{} Chapter 6 treats the Hopﬁeld model, its application for solving COPs, simu lation annealing, chaotic neural networks and cellular networks. ^{} Chapter 7 describes associative memory models and algorithms. ^{} Chapters 8 and 9 are dedicated to clustering. Chapter 8 intro duces Kohonen networks, ART networks, C means, subtractive, and fuzzy clustering. ^{} Chapter 9 introduces many advanced topics in clustering. ^{} In Chapter 10, we elaborate on the RBF network model. ^{} Chapter 11 introduces the learning of general recurrent networks. ^{} Chapter 12 deals with PCA networks and algorithms. The minor compo nent analysis (MCA), crosscorrelation PCA networks, gener alzied eigenvalue decompostion (EVD) and CCA are also introduced in this chapter. ^{} Nonnegative matrix factorization (NMF) is introduced in Chapter 13. ^{} ICA and BSS are introduced in Chapter 14. ^{} Discriminant analysis is described in Chapter 15. ^{} Probilistic and Bayesian networks are introduced in Chapter 19. Many topics such as the EM algorithms, the HMM, sampling (Monte Carlo) methods, and the Boltzmann machine are treated in this framework. ^{} SVMs are introduced in Chapter 16. ^{} Kernel methods other than SVMs are introduced in Chapter 17. ^{} Reinforcement learning is introduced in Chapter 18. ^{} Ensemble learning is introduced in Chapter 20. ^{} Fuzzy sets and logic are introduced in Chapter 21. ^{} Neurofuzzy models are described in Chapter 22. Transformations between fuzzy logic and neural networks are also discussed. ^{} Implementation of neural networks in hardware is treated in Chapter 23. ^{} In Chapter 24, we give an introduction to neural network applications to biometrics and bioinformatics. ^{} Data mining as well as the application of neural networks to the ﬁeld is intro duced in Chapter 25. ^{} Mathematical preliminaries are included in Appendix A. ^{} Some benchmarks and resources are included in Appendix B.
Examples and exercises are included in most of the chapters.
REFERENCES
13
Problems
1.1 List the major diﬀerences between the neuralnetwork appro ach and clas
sical informationprocessing approaches.
1.2 Formulate a McCullochPitts neuron for four variables: white blood count,
systolic blood pressure, diastolic blood pressure, and pH o f the blood.
1.3 Derive Equation (1.3) from Fig. 1.3.
References
[1] D.H. Ackley, G.E. Hinton, & T.J. Sejnowski, A learning algorithm for Boltz mann machines. Cognitive Sci., 9 (1985), 147–169. [2] S.P. Adhikari, C. Yang, H. Kim & L.O. Chua, Memristor bridge synapse based neural network and its learning. IEEE Trans. Neural Netw. Learn. Syst., 23 :9 (2012), 1426–1435. [3] D.S. Broomhead & D. Lowe, Multivariable functional interpolation and adap tive networks. Complex Syst., 2 (1988), 321–355. [4] G.A. Carpenter & S. Grossberg, A massively parallel architecture for a self organizing neural pattern recognition machine. Computer Vision, Graphics, Image Process., 37 (1987), 54–115. [5] L.O. Chua & L. Yang, Cellular neural network–Part I: Theory; Part II: Appli cations. IEEE Trans. Circ. Syst., 35 (1988), 1257–1290. [6] P. Comon, Independent component analysis—A new concept? Signal Pro cess., 36 :3 (1994), 287–314 [7] K.L. Du & M.N.S. Swamy, Neural Networks in a Softcomputing Framework (London: Springer, 2006).
[8] R. FitzHugh, Impulses and physiological states in theor etical models of nerve membrane. Biophysical J., 1 (1961), 445–466. [9] K. Fukushima, Cognition: A selforganizing multulayer ed neural network. Biol. Cybern., 20 (1975), 121–136. [10] K. Fukushima, Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaﬀected by shift in position. Biol. Cybern., 36 (1980), 193–202. [11] K. Fukushima, Increasing robustness against background noise: visual pat tern recognition by a neocognitron. Neural Netw., 24 :7 (2011), 767–778. [12] S. Grossberg, Neural expectation: Cerebellar and retinal analogues of cells ﬁred by unlearnable and learnable pattern classes. Kybernetik, 10 (1972), 49–
57.
[13] S. Grossberg, Adaptive pattern classiﬁcation and universal recording: I. Par allel development and coding of neural feature detectors; I I. Feedback, expec tation, olfaction, and illusions. Biol. Cybern., 23 (1976), 121–134 & 187–202. [14] D.O. Hebb, The Organization of Behavior (New York: Wiley, 1949).
14
Chapter 1. Introduction
[15] A. L. Hodgkin, & A. F. Huxley, A quantitative description of ion currents and its applications to conductance and excitation in nerve membranes. J. Phys., 117 (1952), 500–544. [16] J. Holland, Adaptation in Natural and Artiﬁcial Systems (Ann Arbor,
Michigan: University of Michigan Press, 1975). [17] J.J. Hopﬁeld, Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci., 79 (1982), 2554–2558. [18] E.M. Izhikevich, Polychronization: Computation with spikes. Neural Com put., 18 :2 (2006), 245–282. [19] N. Kasabov, Integrative connectionist learning systems inspired by nature:
current models, future trends and challenges. Nat. Comput., 8 (2009), 199–
218.
[20] S. Kirkpatrick, C.D. Gelatt Jr, & M.P. Vecchi, Optimization by simulated annealing. Science, 220 (1983), 671–680. [21] W. M. Kistler, W. Gerstner, & J. L. van Hemmen, Reduction of Hodgkin Huxely equations to a singlevariable threshold model. Neural Comput., 9 (1997), 1015–1045. [22] T. Kohonen, Correlation matrix memories. IEEE Trans. Computers, 21 (1972), 353–359. [23] T. Kohonen, Selforganized formation of topologically correct feature maps. Biol. Cybern., 43 (1982), 59–69. [24] B. Kosko, Adaptive bidirectional associative memories. Appl. Optics, 26 (1987), 4947–4960. [25] R.P. Lippman, An introduction to computing with neural nets. IEEE ASSP Mag., 4 :2 (1987), 4–22. [26] H. Markram, J. Lubke, M. Frotscher & B. Sakmann, Regulation of synaptic eﬃcacy by coincidence of postsynaptic APs and EPSPs. Science, 275 :5297 (1997), 213–215. [27] W.S. McCulloch & W. Pitts, A logical calculus of the idea s immanent in nervous activity. Bull of Math. Biophysics, 5 (1943), 115–133. [28] M.L. Minsky & S. Papert, Perceptrons (Cambridge, MA: MIT Press, 1969). [29] K.R. Muller, S. Mika, G. Ratsch, K. Tsuda, & B. Scholkopf, An introduction to kernelbased learning algorithms. IEEE Trans. Neural Netw., 12 :2 (2001),
181–201.
[30] E. Oja, A simpliﬁed neuron model as a principal component analyzer. J. Math. & Biology, 15 (1982), 267–273. [31] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plau sible Inference (San Mateo, CA: Morgan Kaufmann, 1988). [32] R. Rosenblatt, Principles of Neurodynamics (New York: Spartan Books,
1962).
[33] J. Rubner & P. Tavan, A selforganizing network for principalcomponent analysis. Europhysics Lett., 10 (1989), 693–698.
REFERENCES
15
[34] D.E. Rumelhart, G.E. Hinton, & R.J. Williams, Learning internal repre sentations by error propagation. In: D. E. Rumelhart, J.L. McClelland, Eds., Parallel distributed processing: Explorations in the micr ostructure of cogni tion, 1 : Foundation, 318–362 (Cambridge: MIT Press, 1986). [35] H.P. Schwefel, Numerical Optimization of Computer Models (Chichester:
Wiley, 1981). [36] H.C. Tuckwell, Introduction to Theoretical Neurobiology (Cambridge, MA, UK: Cambridge University Press, 1988). [37] V.N. Vapnik, Statistical Learning Theory (New York: Wiley, 1998).
[38] C. von der Malsburg, Selforganizing of orientation sensitive cells in the striata cortex. Kybernetik, 14 (1973), 85–100. [39] B. Widrow & M.E. Hoﬀ, Adaptive switching circuits. IRE E astern Electronic Show & Convention (WESCON1960), Convention Record, 4 (1960), 96–104. [40] P.J. Werbos, Beyond Regressions: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD Thesis, Harvard University, Cambridge, MA,
1974.
[41] L.A. Zadeh, Fuzzy sets. Inf. & Contr., 8 (1965), 338–353. [42] K. Zilles, Cortex. In: The Human Nervous System, G. Pixinos, Ed. (New
York: Academic Press, 1990).
2 Fundamentals of Machine Learning
2.1 Learning methods
Learning is a fundamental capability of neural networks. Learning rules are algo rithms for ﬁnding suitable weights W and/or other network parameters. Learn ing of a neural network can be viewed as a nonlinear optimization problem for ﬁnding a set of network parameters that minimize the cost function for given examples. This kind of parameter estimation is also called a learning or training algorithm . Neural networks are usually trained by epoch. An epoch is a co mplete run when all the training examples are presented to the network and ar e processed using the learning algorithm only once. After learning, a neural network represents a complex relationship, and possesses the ability for genera lization. To control a learning process, a criterion is deﬁned to decide the time fo r terminating the process. The complexity of an algorithm is usually denoted a s O (m ), indicating that the order of number of ﬂoatingpoint operations is m . Learning methods are conventionally divided into supervised, unsupervised, and reinforcement learning; these schemes are illustrated in Fig. 2.1. x _{p} and y _{p} are the input and output of the p th pattern in the training set, yˆ _{p} is the neural network output for the p th input, and E is an error function. From a statistical viewpoint, unsupervised learning learns the pdf of the training set, p (x ), while supervised learning learns about the pdf of p (y x ). Supervised learning is widely used in classiﬁcation, approximation, control, modeling a nd identiﬁcation, signal processing, and optimization. Unsupervised learning schemes are mainly used for clustering, vector quantization, feature extraction, sig nal coding, and data anal ysis. Reinforcement learning is usually used in control and artiﬁcial intelligence.
In logic and statistical inference, transduction is reasoning from observed, spe ciﬁc (training) cases to speciﬁc (test) cases. In contrast, induction is reasoning from observed training cases to general rules, which are then applied to the test cases. Machine learning falls into two broad classes: inductive learning or transductive learning. Inductive learning pursues the sta ndard goal in machine learning, which is to accurately classify the entire input space. In contrast, trans ductive learning focuses on a predeﬁned target set of unlabeled data, the goal being to label the speciﬁc target set.
Fundamentals of Machine Learning
17
(b)
Figure 2.1 Learning methods. (a) Supervised learning. e _{p} = yˆ _{p} − y _{p} . (b) Unsupervised learning. (c) Reinforcement learning.
Multitask learning improves the generalization performance of learners by leveraging the domainspeciﬁc information contained in the related tasks [30]. Multiple related tasks are learned simultaneously using a s hared representation. In fact, the training signals for extra tasks serve as an inductive bias [30]. In order to learn accurate models for rare cases, it is desira ble to use data and knowledge from similar cases; this is known as transfer learning. Transfer learning is a general method for speeding up learning. It exploits the insight that generalization may occur not only within tasks, but also across tasks. The core idea of transfer is that experience gained in learning to perform one source task can help improve learning performance in a related, but diﬀerent, target task [155]. Transfer learning is related in spirit to casebased and analogical learning. A theoretical analysis based on an empirical Bayes perspective exhibits that the number of labeled examples required for learning with transfer is often signiﬁcantly smaller than that required for learning each target independently
[155].
Supervised learning
Supervised learning adjusts network parameters by a direct comparison between the actual network output and the desired output. Supervised learning is a closed loop feedback system, where the error is the feedback signal. The error measure, which shows the diﬀerence between the network output and the output from the training samples, is used to guide the learning process. The error measure is usually deﬁned by the mean squared error (MSE)
E = ^{1}
N
N
^{} y _{p} − yˆ _{p} ^{} ^{} ^{2} ,
p=1
(2.1)
where N is the number of pattern pairs in the sample set, y _{p} is the output part of the p th pattern pair, and yˆ _{p} is the network output corresponding to the pattern pair p . The error E is calculated anew after each epoch. The learning process is terminated when E is suﬃciently small or a failure criterion is met. To decrease E toward zero, a gradientdescent procedure is usually applied. The gradientdescent method always converges to a local minimum in a neighbor hood of the initial solution of network parameters. The LMS a nd BP algorithms
18
Chapter 2. Fundamentals of Machine Learning
are two most popular gradientdescent based algorithms. Secondorder methods are based on the computation of the Hessian matrix. Multipleinstance learning [46] is a variation of supervised learning. In multipleinstance learning, the examples are bags of insta nces, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. A uniﬁed theoretical analysis for multipleinstance learning and a PAC learning algorithm are introduced in [122]. Deductive reasoning starts from a cause to deduce the conseq uence or eﬀects. Inductive reasoning allows us to deduce possible causes fro m the consequence. The inductive learning is a special class of the supervised learning techniques, where given a set of { x _{i}
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.