Beruflich Dokumente
Kultur Dokumente
Intelligent Systems
2
Keywords
Artificial intelligence vs. computational
intelligence. Neural networks. Knowledge
representations. Machine learning. Rule-based
systems. Fuzzy Systems. Evolutionary
computation.
3
Syllabus
An introduction to the goals and objectives
of AI as a discipline and its milestones.
Approaches in AI. Major components in
intelligent systems.
Methods of knowledge acquisition and
representations. Associative memory.
Techniques on machine learning such as
supervised learning, unsupervised learning,
reinforcement learning, and deep learning.
Generalization. 4
Syllabus (cont’d)
Basic concepts of graph and tree search.
Optimization methods such as stochastic
annealing, neurodynamic optimization,
genetic algorithm, particle swarm
optimization, ant colony optimization, and
differential evolution.
5
References
R. Rojas, Neural Networks: A Systematic
Introduction, Springer, 1996.
S. Haykin, Neural Networks and Learning
Machines (3rd Ed), Prentice-Hall, 2009.
S. Russell and P. Norvig, Artificial
Intelligence: A Modern Approach. 3rd Ed.
Prentice-Hall (2009)
C.-T. Lin and C.S. G. Lee, Neural Fuzzy
Systems, Prentice-Hall, 1996.
6
Intelligence
“Intelligence is a mental quality that
consists of the abilities to learn from
experience, adapt to new situations,
understand and handle abstract
concepts, and use knowledge to
manipulate one’s environment.”
Britannica
7
Definition of Intelligent Systems
A system is an intelligent system if
it exhibits some intelligent
behaviors.
8
Intelligent Behaviors
Inference: Deduction vs. Induction
(generalization); e.g., judgment and pattern
recognition
Learning and adaptation: Evolutionary
processes; e.g., learning from examples
Creativity: e.g., planning and design
9
CS5486 10
Milestones of Intelligent System
Development
1940’s: Cybernetics by 1974: Back propagation
Wiener algorithm by P. Werbos
1943: Threshold logic 1970’s: Adaptive resonance
networks by McCulloch and theory by S. Grossberg
Pitts 1970’s: Self-organizing map
by T. Kohonen
1950’s-1960’s: Perceptrons
1980’s: Hopfield networks
by Rosenblatt by J. Hopfield
1960’s Adaline by Widrow 1980’s: Genetic algorithms
1970’s: AI and Expert by J. Holland
systems 1980’s: Simulated annealing
1970’s: Fuzzy logic by L. by Kirkpatrick et al.
Zadeh 2006-: Deep learning by
Hinton et al. 11
CS5486
Milestones (cont’d)
13
Professor Warren S. McCulloch
14
Professor Bernard Widrow
15
Professor Lofti A. Zadeh
16
Dr. Paul J. Werbos
17
Professor Teuvo Kohonen
18
Professor John J. Hopfield
19
Professor Jeffrey E. Hinton
20
Engineering Applications of
Intelligent Systems
Pattern recognition: e.g., image processing,
pattern analysis, speech recognition, etc.
Control and robotics: e.g., modeling and
estimation
Associative memory (content-addressable
memory)
Forecasting: e.g., in financial engineering
21
22
CS5486 23
Computational Intelligence
Coined by IEEE Neural Networks Council
in 1994.
Represent a new generation of intelligent
systems.
Consist of Neural Networks, Fuzzy Logic,
and evolutionary computing techniques
(genetic algorithms).
24
CS5486 25
26
CS5486 27
Soft Computing
“Soft computing based on
computational intelligence should
be the basis for the conception,
design, deployment of intelligent
systems rather than hard
computing based on artificial
intelligence.”
Lofti Zadeh
CS5486 28
CS5486 29
CS5486 30
What are Neural Networks?
CS5486 31
Components of Neural Networks
A number of artificial neurons (also known
as nodes, processing units, or computational
elements)
Massive inter-neuron connections with
different strengths (also known as synaptic
weights).
Input and output channels
CS5486 32
Formalization of Neural Networks
CS5486 33
Architecture of Neural Networks
ARCH = (u, v, w, x, y)
Simple and alike neurons represented by u
and v in N-dimensional space
Inter-neuron connection weights represented
by w in M-dimensional space
External input and outputs represented
respectively by x and y in n and m-
dimensional space
CS5486 34
Model of Neurons
Biological neurons: 1010-1011
Highly simplified
Fire activities are quantified by using state
variables (also called activation states)
Net input to a neuron is usually a weighted
sum of state variables from other neurons,
input and/or output variables
Net input to a neuron usually goes thru a
nonlinear transformation
CS5486
called activation 35
Connections between Neurons
Adaptive: Synaptic connections with
adjustable weights
Excitatory (positive weight) vs.
inhibitory (negative weight)
Distributed knowledge representation,
different from digital computers
CS5486 36
Rules of Neural Networks
RULE = (E, F, G, H, L)
E: Evaluation rule mapped from v and/or y to a real
line; e.g., error function or energy function
F: Activation rule mapped from u to v; e.g.,
activation function
G:Aggregation rule mapped from v, w, and/or x to u;
e.g., weighted sum
H: output rule mapped from v to y, y usually is a
subset of v
L: Learning rule mapped from v, w, and x to w,
usually iterativeCS5486 37
Learning in Neural Networks
Goal: To improve performance
Means: interact with environment
A process by which the adaptable
parameters of an ANN are adjusted thru an
iterative process of stimulation by the
environment in which the ANN is
embedded
Supervised vs. unsupervised
CS5486 38
On Learning
Continuous-time:
dw(t )
L(v, w, x)
dt
CS5486 40
Two-Time Scale Dynamics
in Neural Networks
CS5486 41
Categories of Neural Networks
CS5486 42
Definition of Neural Networks
CS5486 43
Features of Neural Networks
Resemble the brains in two aspects:
1. Knowledge acquisition: knowledge is
acquired by neural networks thru learning
processes.
2. Knowledge representation: Inter-neuron
connections, known as synaptic weights are
used to store acquired knowledge
CS5486 44
Properties of Neural Networks
1 Nonlinearity
2 Input-output mapping
3 Adaptivity
4 Contextual information
5 Fault tolerance
6 hardware implementability
7 Uniformity of analysis and design
8 Neurobilogical analogy and plausibility
CS5486 45
McCulloch-Pitts Neuron
CS5486 46
McCulloch-Pitts Neurons
Binary values {0, 1}
Unity connection weights of 1 and –1
If an input to a neuron is 1 and the
associated weight is –1, then the output of
the neuron is 0
Otherwise, if the weighted sum of input is
not less than a threshold, then the output is
1; or is less than the threshold, then 0.
CS5486 47
Threshold Logic
CS5486 48
AND & OR Gates
CS5486 49
Other Logic Functions
CS5486 50
Decoders
CS5486 51
Another Decoder
CS5486 52
Threshold Logic Units
Proposition: Any logical function F: {0, 1}n
-> {0, 1} can be implemented with a two-
layer McCulloch-Pitts network.
CS5486 53
Weighted Connections
CS5486 54
A Recurrent Network
CS5486 55
Finite Automata
An automaton is an abstract device capable of
assuming different states which change
according to the received input and previous
states.
A finite automaton can take only a finite set of
possible states and can react to only a finite
set of input signals.
CS5486 56
Finite Automata & Recurrent Networks
CS5486 57
Perceptron
A single adaptive layer of feedforward
network of pure threshold logic units.
Developed by Rosenblatt at Connell
University in late 50’s.
Trained for pattern classification.
First working model implemented in
electronic hardware.
CS5486 58
Perceptron
CS5486 59
Perceptron
CS5486 60
Simple Perceptron
CS5486 63
Perceptron Convergence Algorithm
i 1
1) Adapt weights: for all p
wi (t 1) wi (t ) ( z y ) x , 0
p p
i
p
CS5486 65
A Simple Case
CS5486 66
Error Landscape
CS5486 67
Learning Process
CS5486 68
Threshold
CS5486 69
Perceptron Convergence Theorem
CS5486 70
Two-Variable Logic Functions
CS5486 71
2D OR and AND
CS5486 72
3D Logic Function
CS5486 73
OR Function
CS5486 74
Majority Function
CS5486 75
Duality
CS5486 76
2D Weight Space
CS5486 77
2D Weight Space
CS5486 78
XOR Problem
CS5486 79
Limitations of Perceptrons
CS5486 80
Bipolar vs. Unipolar State Variables
Unipolar: v {0,1}
Bipolar: v {1,1}
Bipolar coding of state variables is better
than unipolar (binary) one in terms of
algebraic structure, region proportion
in weight space, etc.
CS5486 81
Monte Carlo Tests
CS5486 82
ADALINE
A single adaptive layer of feedforward
network of linear elements.
Full name: Adaptive linear elements.
Developed by Widrow and Hoff at Stanford
University in early 60’s.
Trained using a learning algorithm called
Delta Rule or Least Mean Squares (LMS)
Algorithm.
CS5486 83
LMS Learning Algorithm
wi (t 1) wi (t ) ( z y ) x , 0
p p
i
p
p
4) Repeat until w converges
CS5486 84
Gradient Descent Learning Algorithms
E Ep (z y ) p p 2
p p
CS5486 85
Training Modes
Sequential mode: input training sample
pairs one by one orderly or randomly.
Batch mode: input training sample pairs in
the whole training set at each iteration.
Perceptron learning: either sequential or
batch mode.
ADALINE training: batch mode only.
CS5486 86
Perceptron vs. Adaline
Architecture: Perceptron uses bipolar or
unipolar hardlimiter activation function,
Adaline uses linear activation function.
Learning rule: Perceptron learning
algorithm is not gradient-descent and can
operate in either sequential or batch training
mode, whereas Adaline learning (LMS)
algorithm is gradient descent, but can only
operate in batch mode.
CS5486 87
Weight Space Regions
Separated by Hyperplanes
Each plane is defined by one training
sample.
One plane separates two (2) half-space.
Two planes separate up to four (4) regions.
Three planes separate up to eight (8)
regions.
However, four planes separate up to
fourteen (14) regions only.
CS5486 88
Number of Weight Space Regions
m 1
n 1
m n
R(m, n) 2 2
i 0 i n!
CS5486 89
Number of Weight Space Regions
CS5486 90
Number of Logic Functions vs.
Number of Threshold Functions
The number of threshold functions defined
by hyperplanes is a function of 2 n(n-1)
n
whereas that of logical functions is 2 . 2
CS5486 91
Learnability Problems
Solution existence in the weight space? Neither
Perceptron nor Adaline can classify patterns
with nonlinear distributions such as XOR. But
two-layer Perceptron can classify XOR data.
How to find the solution even though it exists in
the weight space? It is known that multilayer
Perceptron can classify arbitrary shape of data
classes. But how to design learning algorithms
to determine the weights?
CS5486 92
Multilayer Feedforward Network
CS5486 93
Multilayer Feedforward Network
CS5486 94
XOR Problem Solved
CS5486 95
Alternative Networks
CS5486 96
Learning Process
CS5486 97
Alternative Network
CS5486 98
Two-layer Separation
CS5486 99
Two-layer Separation
CS5486 100
Backpropagation Algorithm
Also known as generalized delta rule.
Invented and reinvented by many researchers,
popularized by the PDP group at UC San Diego in
1986.
A recursive gradient-descent learning algorithm for
multilayer feedforward networks of sigmoid
activation function.
Compute errors backward from the output layer to
input layer.
Minimze the mean squares error function.
CS5486 101
Sigmoid Activation Functions
CS5486 102
Sigmoid Activation Functions
1
Unipolar: f (u )
1 exp( u )
df (u ) exp( u )
du (1 exp( u )) 2
1 1 exp( u ) 1
f (u )[1 f (u )]
1 exp( u ) 1 exp( u )
1 exp( u )
Bipolar: f (u ) tanh(u )
1 exp( u )
CS5486 103
Related Activation Functions
CS5486 104
Backpropagation Algorithm
(cont’d)
Error function:
m
1
E E p ( zi yi )
p p 2
p 2 p i 1
General formula:
E E p
wij (t ) wij (t 1) wij (t )
wij p wij
CS5486 105
Backpropagation Algorithm
(cont’d)
Output layer l:
Ep E p yip uil
wij
l
yi ui wij
p l l
u
i l i i i i
i
CS5486 106
Backpropagation Algorithm
(cont’d)
Hidden layer l-1:
E p E p u lj E p u lj
l 1
l 1
( l ) l 1 j w ji
l l
vi j u v
l
j i j u j vi j
E p E p vil 1
il 1 l 1 l 1 ( lj wlji )vil 1 (1 vil 1 )
uil 1 vi ui j
E p E p uil 1
l 1 l 1 il 1v lj 2
wijl 1 ui wij
CS5486 107
Backpropagation Algorithm
(cont’d)
Input layer 1:
( w )v (1 v )
1
i
2
j
2
ji
1
i
1
i
j
E p E p u 1
1 p
x i
wij ui w
1 1 i j 1
ij
CS5486 108
Backpropagation Algorithm
(cont’d)
1) Initialize weights and threshold randomly.
2) Calculate actual output of the MLP:
3) Adapt weights for all layers:
E p
wij (t 1) wij (t )
p wij
4) Repeat until w converges
CS5486 109
Local Minima
CS5486 110
Momentum Term
To avoid local oscillation, a momentum term
is sometimes added:
E
wij (t ) wij (t 1)
wij
0 1
CS5486 111
Kolmogorov Theorem
Let f: [0, 1]n -> [0, 1] be a continuous
function. There exist functions of one
argument g and hj for j=1,2,…,2n+1 and
constant wi for i=1,2,…,n such that
2 n 1 n
f ( x1 , x2 ,..., xn ) g[ wi h j ( x j )].
j 1 i 1
CS5486 112
Universal Approximators
Multilayer feedforward neural networks are
universal approximators of continuous
functions.
A set of weights exist such that the
approximation errors can be arbitrarily
small.
However, the BP algorithm is not
guaranteed to find such a set of weights.
CS5486 113
Overfitting Problem
CS5486 114
Radial Basis Functions
A radial basis function (RBF) is a real-
valued function whose value depends only
on the distance from its origin or center.
Related to kernel theory in statistical
learning.
Used as the means for approximating or
interpolating multivariate functions.
CS5486 115
Radial Basis Functions
h
y wi (|| x c ||)
i 1
CS5486 117
Radial Basis Function Networks
Proposed first by Broomhead and Lowe in
1988.
A linear combination of a number radial
basis functions that play the role of hidden
neurons.
Two-layer architecture. Its output layer uses
a linear activation function as ADALINE.
Its hidden layer uses radial basis activation
functions. CS5486 118
Radial Basis Function Networks
CS5486 119
Cover’s Theorem
Cover’s Theorem (1965):
A dichotomy {X+,X-} is said to be
φ-separable if there exist an m-dimensional
vector w such that
– wT φ(x) 0, if x in X+
– wT φ(x) < 0, if x in X-
– The hyperplane defined by wT φ(x) = 0 is the
separating surface between the two classes.
CS5486 120
XOR Problem Revisited
CS5486 121
XOR Problem Revisited
Using a pair of Gaussian RBFs, the input
patterns are mapped onto the φ1- φ2 plane
and become linearly separable.
|| x c || 2
1
|| x c 2 || 2
1 ( x ) e 2 ( x) e
CS5486 122
XOR Problem Revisited
CS5486 123
Functional Link Network
Proposed by Yoh-Han Pao at CWRU in late
80’s
One-layer feedforward architecture
Higher-order aggregation rule
Fast learning process
Local minima could be eliminated
Many successful stories in applications
CS5486 124
Functional Link Network
CS5486 125
Extreme Learning Machine
Proposed by Guangbin Huang at NTU in
mid 2000’s
One-layer feedforward architecture
Random connection weights from inputs to
hidden neurons
Fast learning process for weights in output
layer.
Local minima eliminated
CS5486 126
Support Vector Machine
V. Vapnik
CS5486 127
Linear Discriminant Function
x2
g(x) is a linear function: wT x + b > 0
g (x) wT x b
A hyper-plane in the
feature space
n
(Unit-length) normal
vector of the hyper-
plane:
w
n wT x + b < 0 x1
w
CS5486 128
Linear Discriminant Function
x2
How would you classify
these data using a linear
discriminant function in
order to minimize the
error?
Infinite number of
answers!
denotes +1
denotes -1 x1
CS5486 129
Linear Discriminant Function
x2
Which one is
the best with
maximal
generalization
power?
x1
CS5486 130
Maximal Margin Classifier
x2
The linear discriminant “safe zone” Margin
function (classifier)
with the maximum
margin is the best.
Margin is defined as the
width that the boundary
could be increased by
before hitting a data point
Why it is the best?
Robust to outliners and
x1
thus strong generalization
ability
CS5486 131
Maximal Margin Classifier
Given a set of data: x2
{(xi , yi )}, i 1, 2, , n, where
For yi 1, wT xi b 0
For yi 1, wT xi b 0
For yi 1, wT xi b 1
x1
For yi 1, w xi b 1
T
CS5486 132
Maximal Margin Classifier
Let x2
Margin
wT x b 1 x+
wT x b 1
x+
The margin width is:
n
M (x x ) n x-
w 2
(x x )
w w
Support Vectors x1
CS5486 133
Maximal Margin Classifier
Formulation: x2
Margin
2 x+
maximize
w
such that x+
For yi 1, wT xi b 1 n
x-
For yi 1, wT xi b 1
x1
CS5486 134
Maximal Margin Classifier
Formulation: x2
Margin
1 2
x+
minimize w
2
such that x+
For yi 1, wT xi b 1 n
x-
For yi 1, wT xi b 1
x1
CS5486 135
Maximal Margin Classifier
Formulation: x2
Margin
1 2
x+
minimize w
2
subject to x+
yi (wT xi b) 1 n
x-
x1
CS5486 136
Quadratic Programming Problem
Quadratic 1 2
minimize w
programming 2
with linear
constraints s.t. yi (wT xi b) 1
Lagrangian
Function
2 i 1
s.t. i 0
CS5486 137
Quadratic Programming Problem
minimize Lp (w, b, i ) w i yi (wT xi b) 1
n
1 2
2 i 1
s.t. i 0
Lp
n
0 w i yi xi
w i 1
n
Lp
0 y i i 0
b i 1
CS5486 138
Quadratic Programming Problem
minimize Lp (w, b, i ) w i yi (wT xi b) 1
n
1 2
2 i 1
s.t. i 0
Lagrangian dual
problem
n
1 n n
maximize i i j yi y j xTi x j
i 1 2 i 1 j 1
n
s.t. i 0 , and y
i 1
i i 0
CS5486 139
Quadratic Programming Problem
From KKT condition, we know: x2
i yi (wT xi b) 1 0 x+
x-
w i yi xi y x
x1
i i i
i 1 iSV
g ( x) w T x b i i xb
x
iSV
T
CS5486 141
Soft Constraints
x2
What to do if data is
not linearly
separable? (noisy
data, outliers, etc.) 2
1
Slack variables ξi
can be added to
allow mis-
classification of
nonlinearly x1
distributed or
noisy data CS5486 142
Soft Constraints
Problem formulation:
n
1
w C i
2
minimize
2 i 1
subject to
yi (wT xi b) 1 i
i 0
Parameter C is a weight between marginal maximization
and misclassification miinimization.
CS5486 143
Maximal Margin Classifier
Problem reformulation: (Lagrangian dual problem)
n
1 n n
maximize i i j yi y j xTi x j
i 1 2 i 1 j 1
Subject to
0 i C
n
y
i 1
i i 0
CS5486 144
Nonlinear SVM
Datasets that are linearly separable with noise work out great:
0 x
But what to do if the dataset is nonlinearly distributed?
0 x
How about… mapping data to a higher-dimensional space:
x2
CS5486 0 x1 145
Feature Space
General idea: the original input space can be mapped
to a higher-dimensional feature space where the
training set is linearly separable
feature map
Φ: x → φ(x)
CS5486 146
The Kernel Trick
With this mapping, the discriminant function is now:
K (xi , x j ) (xi )T (x j )
CS5486 147
An Example
2-dimensional vectors x=[x1 x2];
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
CS5486 148
Common Kernel Functions
Linear kernel: K (xi , x j ) xTi x j
Polynomial K (xi , x j ) (1 xTi x j ) p
kernel:
Gaussian (Radial-Basis Function (RBF) ) kernel:
2
xi x j
K (xi , x j ) exp( )
2 2
Sigmoid:
y
i 1
i i 0
g (x) K (x , x) b
iSV
i i
CS5486 152
Design Issues
Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating
appropriate similarity measures
Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different
classifications
- In the absence of reliable criteria, applications rely on the use
of a validation set or cross-validation to set such parameters.
Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters
are tested
CS5486 153
SVM Summary
1. Maximal Margin Classifier
– Better generalization ability & less over-fitting
CS5486 155
Support Vector Regression
Problem reformulation
CS5486 156
Support Vector Regression
CS5486 157
Support Vector Regression
CS5486 158
Least-squares SVM
Proposed by Suykens and Vandewalle at
KUL in 1999.
Let
Primal problem formulation
CS5486 159
Least-squares SVM
Lagrangian
CS5486 160
Least-squares SVM
Final model
CS5486 161
Two-spirial Classification
CS5486 162
Short-term Prediction
CS5486 163
MAXNET
A sub-network for selecting the input with
maximum value - winner takes all.
By means of mutually prohibition, a
MAXNET keeps the maximal input and
presses down the rest.
It is often used as the output layer in some
existing neural networks
CS5486 168
MAXNET
A recurrent neural network with self
excitatory connections and laterally
inhibitory connections.
t. Julian
The weight of self excitatory connections is
1.
The weight of self inhibitory connections is
-w where w<1/m, and m is the number of
output neurons.
CS5486 169
kWTA Model
Desirable Properties
The kWTA model with Heaviside activation
function has been proven to be globally
stable and globally convergent to the kWTA
solutions in finite time.
Derived lower and upper bounds of
convergence time are respectively
Simulation Results
Randomized Integer Inputs
Simulation Results
Low-Resolution Inputs
Clustering
物以类聚
人以群分
CS5486 174
Clustering
Clustering or cluster analysis is to group
similar data based on a given similarity
measure.
It is subjective without unique solutions.
It is done via unsupervised learning.
CS5486 175
ART1 Network
Invented by Stephen Grossberg at Boston
University in 1970’s.
Used to cluster binary data w/ unknown
cluster number.
A two-layer recurrent neural network.
MAXNET serves as its output layer.
Bidirectional adaptive connections called
bottom-up and top-down connections.
CS5486 176
ART1 Architecture
top down
bottom up (normalised)
CS5486 177
ART1 for Clustering
1
1) Initialize weights: wij (0) 1, wij (0)
td bu
1 n
2) Compute net input for an input pattern xp:
n
u i
p
wijbu (t ) x jp , i 1,2,..., m
j 1
CS5486 179
Vigilance Parameter
in ART1 Network
Vigilance sets granularity Small , imprecise
of clustering
It defines basin of
attraction of each cluster
Low threshold
– Large mismatch accepted
– Few large clusters
Large , fragmented
High threshold
– Small mismatch accepted
– Many small clusters
– Higher precision
CS5486 180
Illustrative Example
CS5486 181
Hopfield Networks
Invented by John Hopfield at Princeton
University in 1980’s.
Used as associative memories or
optimization models.
Single-layer recurrent neural networks.
The discrete-time model uses bipolar
threshold logic units and the continuous-
time model uses unipolar sigmoid activation
function. CS5486 182
Hopfield Networks
The Hopfield networks are the classical recurrent
neural networks.
John Hopfield, “Neural networks and physical
systems with emergent collective computational
abilities,” PNAS, USA, vol. 79, 1982.
CS5486 184
Discrete-Time Hopfield Network
u (t 1) Wv(t ) x
ui (t 1) wij v j (t ) xi , i
j
vi (t ) sgn( ui (t )) {1,1}
CS5486 185
Stability Analysis
n n
1
E[v (t )]
2
w
i 1 j i
ij vi (t )v j (t ) xi vi (t )
i 1
n
1
E[v (t 1)] E[v (t )]
2
w
i 1 j i
ij vi (t 1)v j (t 1)
n n n
1
i 1
x v
i i (t 1)
2
w
i 1 j i
ij vi (t )v j (t ) xi vi (t )
i 1
n
1
2
(w
i 1
ik wki )vi (t 1)vk (t 1) xk vk (t 1)
n
1
2
(w
i 1
ik wki )vi (t )vk (t ) xk vk (t )
n
[vk (t 1) vk (t )][ wik vi (t ) xk ] vk (t )u k 0
i 1
CS5486 186
Stability Conditions
_
Stability: lim v(t ) v
t
Sufficient conditions:
B β B B’
For example,
ID Name A A
Hetero-Associations
Albert
Input pattern desired output
Einstein
x11 x12 x13 x14 y11 y12 y13 y14
1 4 1
x2 x22 x23 x2 y2 y22 y23 y24
x31 x32 x33 x34 y31 y32 y33 y34
Charles 1 4
1
x4 x42 x43 x4 y4 y42 y43 y44
Kao
• Retrieval (recall):
P
v(0) s q , u (1) Wv(0) s p ( s p )T s q Ps q
p 1
s q ( s q )T s q s p ( s p )T s q Ps q
pq
CS5486 198
Discrete-Time Hopfield Network
as an Associative Memory
If is orthonormal; i.e., (s ) s 0, p q.
p T q
sp
then the second term in recall formula (cross-
talk or noise) is zero.
If ( s ) s || s || n P , then v(1)= s
q T q q 2 q
2
p
If s is not orthonormal, for a small
CS5486 199
Illustrative Example
Stored Memory
PatternsCS5486 Recall 200
Limitations
n
Very limited capacity: 2 log n
, where n is the
memory length
Many spurious states; e.g., s q
CS5486 201
Discrete-Time Hopfield Network
as an Optimization Model
Formulate the energy function according to
the objective function and constraints of a
given optimization problem.
1 T
E (v) v Wv xT v, v {1,1}n
2
Form a Hopfield network, then update the
states asynchronously until convergence.
Shortcoming: slow convergence due to
asychrony. CS5486 202
Bidirectional Associative Memories
(BAM)
Also known as hetero-associative memories
and resonance networks.
A generalization of auto-associative
memories.
Proposed by Bart Kosko of University of
Southern California in 1988.
Using bipolar signum activation functions.
CS5486 203
BAM Architecture
CS5486 204
Bidirectional Associative Memories
(BAM)
x R n , y (t ) sgn(Wx (t ))
y R m , x (t 1) sgn(W T y (t ))
P
W y
p 1
p
(x p
)T
P
sgn(Wx ) sgn( y p ( x
q p
)T x q )
p 1
sgn( y q || x q || 22 y p ( x p
)T x q ) y q ,
pq
P
sgn(W T
y ) sgn( x
q p
( y p )T y q )
p 1
sgn( x q || y q || 22 x p
( y p )T y q ) x q ,
pq
CS5486 205
Continuous-time Hopfield Network
du u
Wv x
dt
dui ui
wij vi xi , i
dt j
1
vi
1 exp( ui )
CS5486 206
Continuous-time Hopfield Network
Continuous-time
model [Hopfield,
PNAS, vol. 81,
1984]
CS5486 207
Continuous-time Hopfield Network
CS5486 208
Stability Analysis
1 1 v j 1
E wij vi v j f (v)dv x j v j
2 i j j 0
j
dE E dvi ui dvi
( wij v j xi )
dt i vi dt i j dt
dui dvi df (ui ) dui 2 dui
( ) 0, 0.
i dt dt i dui dt dt
CS5486 209
High Gain Unipolar Sigmoid
Activation Function
1 1 1 v
f (u ) , u f (v) ln
1 exp( u ) 1 v
v 1
0
1
f ( x ) dx [ v ln v (1 v ) ln(1 v )]
v
If inf, then
1
f ( x)dx 0
0
CS5486 210
Continuous-Time Hopfield Network
as an Optimization Model
Formulate the energy function according to
the objective function and constraints of a
given optimization problem.
1 T
E (v) v Wv xT v, v [0,1]n
2
Synthesize a continuous-time Hopfield
network, then an equilibrium state is a local
minimum of the energy function. .
CS5486 211
Traveling Salesman Problem
Find a complete route by visiting each city once
and only once.
Checking out all possible routes: (N – 1)!/2
For example, N = 30 => 4.4 x 1030 routes
If we compute 1012 routes per second => we
need1018 seconds or 31,709,791,984 years!
In most practical problems N >> 30
A continuous Hopfield network can compute a
good solution to the TSP in a parallel and
distributed manner.
CS5486 212
Vertex Path Representation
CS5486 218
Constructing the Hopfield Network
CS5486 219
Constructing the Hopfield Network
=-C
CS5486 220
Constructing the Hopfield Network
If , then
corresponding term is
ij 1 if i j else 0
Similarly,
CS5486 221
Constructing the Hopfield Network
D term contributes an amount to the Lyapunov
function only when or
So
CS5486 222
Parameter Selection
Appropriate values for the parameters A,B,C,D
and must be determined
Tank and Hopfield used A = B =D =250 , C =
1000 and = 50
Tank and Hopfield applied it to random 10-city
maps and found that, overall, in about 50% of
cases, the method found the optimum route from
among the 181,440 distinct paths.
CS5486 223
Ten-City Example
CS5486 224
Optimal Route
CS5486 225
Local Minima
CS5486 226
Simulated Annealing
Annealing is a metallurgical process in
which a material is heated and then slowly
brought to a lower temperature to let
molecules to assume optimal positions.
Simulated annealing simulates the physical
annealing process mathematically for global
optimization of nonconvex objective
function.
CS5486 227
Updating Probability
E
PE exp( ), if E 0
T
PE 1, if E 0
where T 0, T 0, lim T 0
t
The tangent of the probability function
intersects with the horizontal axis at T
CS5486 228
Updating Probability
1
PE
E
1 exp( )
T
where T 0, T 0, lim T 0
t
CS5486 230
Descending Energy
CS5486 231
Sample TSP Solution
CS5486 232
Characteristics of
Simulated Annealing
The higher the temperature, the higher the
probability of an energy increase.
As the temperature approaches to zero, the
simulated annealing procedure becomes an
iterative improvement one.
The temperature parameter has to be lower
gradually to avoid prematurity.
CS5486 233
Boltzmann Machine
A stochastic recurrent neural network
invented by G. Hinton (Univ. of Toronto)
and T. Sejnowski (Salk Institute) in 1983.
n
It has binary state variables {-1, 1} with a
j 2 i j i i
E
E (vi ) : E (vi ) E (vi ) vi ( wij v j xi )2vi 2ui vi
vi j
1 1
P(vi vi )
E (vi ) 2ui vi
1 exp( ) 1 exp( )
T T
1 1
P(vi 1 1) , P(vi 1 1)
2ui 2ui
1 exp( ) 1 exp( )
T T
CS5486 235
Mean Field Annealing Network
A deterministic recurrent neural network.
Based on mean-field theory.
Continuous state variables on [-1, 1]n.
use a bipolar sigmoid activation function.
Use a gradual decreasing temperature
parameter like simulated annealing.
Used for combinatorial optimization.
CS5486 236
Mean Field Annealing Network
2ui
1 exp ( )
u
ui wij v j xi , vi T tanh( i )
2ui T
j
1 exp ( )
T
dT
T 0, 0, lim T 0. As T 0, vi {1,1}
dt t
1
E (v ) wij vi v j xi vi , vi [ 1,1],
2 i j i
CS5486 239
SOM Architecture
CS5486 240
Kohonen’s Learning Algorithm
1. (Initialization) Randomize wij(0) for i =
1,2,…n; j = 1,2,…m; p = 1, t = 0.
2. (Distance) for datum xp, d n [ x p w (t )]2
j i 1
i ij
0 1, d / dt 0, p p 1, goto Distance.
CS5486 241
Neighborhood in SOMs
CS5486 242
A Simple Example
CS5486 243
Kohonen’s Example
CS5486 244
CS5486 245
Echo State Network
Proposed by Herbert Jaeger and Harald
Haas at Jocobs University in 2004.
Also called reservoir computing.
It is a recurrent neural network with sparse
connections and random weights among
hidden neurons.
CS5486 246
ESN Architecture
input units output units
...
...
recurrent "dynamical
reservoir"
CS5486 247
State Equations
u (n) (u1 (n),..., u K (n))'
x(n) ( x1 (n),..., x N (n))'
y (n) ( y1 (n),..., y L (n))'
W ( wij ), W ( wij ),
in in
CS5486 250
Membership Functions
CS5486 251
Membership Functions
Temp: {Freezing, Cool, Warm, Hot}
Degree of Truth or "Membership"
Freezing Cool Warm Hot
1
0
10 30 50 70 90 110
Temp. (F°)
252
Membership Functions
CS5486 253
Fuzzy Set
Fuzzy set A is the set of all pairs (x, uA(x)) where x
belongs to X; i.e.,
A {( x, u A ( x)) | x X }
If X is discrete, A u A ( xi ) / xi
i
If X is continuous, A u A ( x) / x
X
0.2
0.1
age 255
32 33 34 39 40 41 47 48
Fuzzy Set
CS5486 256
Fuzzy Set Terminology
Fuzzy singleton: A fuzzy set where its
support set contain a single point only with
uA (x)=1.
Crossover point: x X such that u A ( x) 0.5
Kernel of a fuzzy set A: All x such that
uA (x)=1; i.e., ker( A) {x X | u A ( x) 1}
Height of a fuzzy set A: Supremum of
uA (x) over x; i.e., ht ( A) sup xX u A ( x)
CS5486 257
Fuzzy Set Terminology
Normalized fuzzy set A: Its height is unity; i.e.,
ht(A)=1. Otherwise, it is subnormal.
-cut of a fuzzy set A: A crisp set
A {x X | u A ( x) }
Convex fuzzy set A:
[0,1], x, y X
u A (x (1 ) y)) min(u A ( x), u A ( y))
i.e., any -cut is a convex set.
CS5486 258
Logic Operations on Fuzzy Sets
Union of two fuzzy sets:
AB ( x) max{ A ( x), B ( x)}
Intersection of two fuzzy sets:
AB ( x) min{ A ( x), B ( x)}
Complement of a fuzzy set:
A ( x) 1 A ( x)
CS5486 259
Logic Operations on Fuzzy Sets
CS5486 260
Cardinality and Entropy of Fuzzy
Sets
Cardinality: |A| is defined as the sum of the
membership function values of all elements
in X; i.e.,
| A | A ( x) or | A | A ( x)dx
xX X
CS5486 262
Entropy of Fuzzy Sets
CS5486 263
Entropy of Fuzzy Sets
CS5486 264
Logic Operations on Fuzzy Sets
Equality: For all x, uA(x)=uB(x)
Degree of equality:
| A B |
E ( A, B) deg( A B)
| A B |
Subset: A B, if u A ( x) uB ( x), x X
Subsethood measure:
| A B |
S ( A, B) deg( A B)
| A|
CS5486 265
Properties of Fuzzy Sets
Union: A A B, B A B
Intersection: A B A, A B B
Double negation law: A A
DeMorgan’s laws: A B A B
A B A B
However, A A X , A A
CS5486 266
Fuzzy Relations
R( x1 , x2 ,..., xn ) u ( x , x ,..., x ) /( x , x ,..., x )
R 1
X 1 X 2 ...X n
2 n 1 2 n
CS5486 268
Typical Defuzzifiers
Centoid (also know as center of gravity and
center of area) defuzzifier:
x ( x) dx x (x ) i A i
x* X
or x* i
X
( x)dx (x ) i
A i
ht ( x
i
i )
CS5486 269
Centroid Defuzzifier
CS5486 270
Linguistic Variables
Linguistic variables are important in fuzzy
logic and approximate reasoning.
Linguistic variables are variables whose
values are words or sentences in natural or
artificial languages.
For example, speed can be defined as a
linguistic variable and takes values of slow,
fast, and very fast.
CS5486 271
Fuzzy Inference Process
When imprecise information is input to a
fuzzy inference system, it is first fuzzified
by constructing a membership function.
Based on a fuzzy rule base, the fuzzy
inference engine makes a fuzzy decision.
The fuzzy decision is then defuzzified to
output for an action.
The defuzzification is usually done by using
the centoid method.
CS5486 272
Fuzzy Inference Process
CS5486 273
Fuzzy Inference System
CS5486 274
An Electrical Heater Example
Rule Base:
R1: If temperature is cold, then increase
power.
R2: If temperature is normal, then maintain.
R3: If temperature is warm, then reduce
power.
At 12o, T = cold/0.5 + normal/0.3+ warm/0.0,
A = increase/0.5 + maintain/0.3 + reduce/0.0.
CS5486 275
An Electrical Heater Example
CS5486 276
An Electrical Heater Example
CS5486 277
Type-2 Fuzzy Logic
A generalization of type-1 fuzzy logic to
handle the uncertainty of membership
function by using fuzzy membership
function
Proposed by Prof. Zadeh in 1975 (ten years
after type-1), but became popular in recent
few years
CS5486 278
CS5486 279
Type-2 Fuzzy System
CS5486 280
Evolutionary Computation
Population-based stochastic and meta-
heuristic search algorithms for global or
multi-objective optimization
Motivated by the natural evolution based
on Darwinist or other principles
Use collective wisdom to accomplish given
tasks via efforts of many generations.
CS5486 281
Swarm-Based Search
Swarm is better than individual
282
Evolutionary Algorithms
Evolutionary programming (Lawrence
Fogel, 1960’s)
Evolutionary strategies (Ingo Rechenberg
and Hans-Paul Schiwefel, 1960’s)
Genetic algorithms (John Holland, 1970’s)
Genetic programming (John Koza,
1990’s)
CS5486 283
Genetic Algorithms
A stochastic search method simulating the
evolution of population of living species.
Optimize a fitness function which is not
necessarily continuous or differentiable.
A genetic algorithm generates a population
of seeds instead of one in traditional
algorithms.
The computation of the population can be
carried out in parallel.
CS5486 284
Elements in Genetic Algorithms
A coding of the optimization problem to produce
the required discretization of decision variables in
terms of strings.
A reproduction operator to copy individual strings
according to their fitness.
A set of information-exchange operators; e.g.,
crossover, for recombination of search points to
generate new and better population of points.
A mutation operator for modifying data.
CS5486 285
Reproduction Operator
1. Sum the fitness of all the production members
and call the result total fitness.
2. Generate a random number n between 0 and
total fitness under uniform distribution.
3. Return the first population member whose
fitness, added to the fitnesses of the preceding
population members (running total), is greater
than or equal to n.
CS5486 286
Crossover Operator
Select offspring from the population
after reproduction.
Two strings (parents) from the
reproduced population are paired with
probability Pc.
Two new strings (offspring) are
created by exchanging bits at a
crossover site.
CS5486 287
Crossover Operation
CS5486 288
Crossover Operation
CS5486 289
Mutation Operator
Reproduction and crossover produce
new string without introducing new
information into the population at bit
level.
To inject new information into
offspring.
Invert chosen bits randomly with a
lower probability Pm
CS5486 290
Mutation Operation
CS5486 291
Towers of Hanoi
Towers of Hanoi puzzle: move a number
of disks from one peg to another among
three pegs, to restore the original piling
order.
Any move must satisfy constraints: e.g.,
we can only move a disk that has no disk
above it.
CS5486 292
Towers of Hanoi
CS5486 293
Legal Moves
CS5486 294
Towers of Hanoi
CS5486 295
Move Represenation
CS5486 296
Corresponding Moves
CS5486 297
GA Operations
CS5486 298
GA Flow Chart
CS5486 299
Optimization Process
CS5486 300
Encoder-Decoder Design
CS5486 301
Encoder-Decoder Design
CS5486 302
Swarm Intelligence
Initialized by Gerardo Beni and Jing Wang
in 1989 in the context of cellular robotic
systems
Typically made up of a population of
simple agents interacting locally with one
another and with their environment
Typical representatives include particle
swarm optimization, ant colony
optimization, etc.
CS5486 303
Particle Swarm Optimization
A robust stochastic optimization technique
based on the movement and intelligence of
swarms
Applies the concept of social interaction to
problem solving
It uses a number of agents (particles) that
constitute a swarm moving around in the
search space looking for the best solution
CS5486 304
Particle Swarm Optimization
Each particle is treated as a point in a multi-
dimensional space which adjusts its
“flying” according to its own flying
experience as well as the flying experience
of other particles
• Developed in 1995 by James Kennedy
(social-psychologist) and Russell Eberhart
(electrical engineering professor).
CS5486 305
Fathers of PSO
306
Psychosocial Compromise
My best
perf.
pi
Here I The best
am! x pg perf. of my
neighbours
CS5486 307
Search Direction
At each time step t
for each particle
for each component i
vk t 1
update vk t
velocity rand 0, 1 pik xk t
the
rand 0, 2 p gk xk t
then move x(t 1) xt vt 1
308
CS5486
Random Proximity
pi
x
pg
309
CS5486
Global Optimization
max
y
min
fitness
max
y
min
fitness
max
y
min
fitness
search
312
space CS5486
Global Optimization
y max
min
fitness
x
max
y
min
fitness
x 314
CS5486
search space
Global Optimization
max
y
min
fitness
x 315
CS5486
search space
Global Optimization
max
y
min
fitness
x 316
CS5486
search space
Global Optimization
max
y
min
fitness
x 317
search space
Schwefel’s function
n
f ( x) xi sin( xi )
i 1
where
500 xi 500
Global minimum
f ( x*) = 418.9829n;
xi = -420.9687 , i=1,2 ,...,n
318
CS5486
Initialization Swarm
319
Evolution after 5 Iterations
320
Evolution after 10 Iterations
321
Evolution after15 Iterations
322
Evolution after 20 Iterations
323
Evolution after 25 Iterations
324
Evolution after 100 Iterations
325
Evolution after 500 Iterations
326
Search Results
Iterati Swarm best
on
0 416.245599
5 515.748796
10 759.404006
15 793.732019
20 834.813763
100 837.911535
5000 837.965771
Global 837.9658
327
CS5486
That’s all for this course.
CS5486 328