CS5486 Intelligent Systems: Prof. Jun Wang Department of Computer Science Tel: 3442 9701 Email: Jwang - Cs@cityu - Edu.hk

CS5486
Intelligent Systems
Prof. Jun Wang

Department of Computer Science
Tel: 3442 9701
Email: jwang.cs@cityu.edu.hk
CS5486 1
Aims
This course aims to equip students with the

skills of problem solving using artificial
intelligence (AI) techniques through a
demonstrable knowledge in a range problem
solving methods and the associated
knowledge representation and machine
learning techniques.
2
Keywords
Artificial intelligence vs. computational
intelligence. Neural networks. Knowledge
representations. Machine learning. Rule-based
systems. Fuzzy Systems. Evolutionary
computation.
3
Syllabus
 An introduction to the goals and objectives
of AI as a discipline and its milestones.
Approaches in AI. Major components in
intelligent systems.
 Methods of knowledge acquisition and
representations. Associative memory.
Techniques on machine learning such as
supervised learning, unsupervised learning,
reinforcement learning, and deep learning.
Generalization. 4
Syllabus (cont’d)
 Basic concepts of graph and tree search.
Optimization methods such as stochastic
annealing, neurodynamic optimization,
genetic algorithm, particle swarm
optimization, ant colony optimization, and
differential evolution.
5
References
 R. Rojas, Neural Networks: A Systematic
Introduction, Springer, 1996.
 S. Haykin, Neural Networks and Learning
Machines (3rd Ed), Prentice-Hall, 2009.
 S. Russell and P. Norvig, Artificial
Intelligence: A Modern Approach. 3rd Ed.
Prentice-Hall (2009)
 C.-T. Lin and C.S. G. Lee, Neural Fuzzy
Systems, Prentice-Hall, 1996.
6
Intelligence
“Intelligence is a mental quality that
consists of the abilities to learn from
experience, adapt to new situations,
understand and handle abstract
concepts, and use knowledge to
manipulate one’s environment.”
Britannica
7
Definition of Intelligent Systems
A system is an intelligent system if
it exhibits some intelligent
behaviors.
For example, neural networks, fuzzy

systems, simulated annealing, genetic
algorithms, and expert systems.
8
Intelligent Behaviors
Inference: Deduction vs. Induction
(generalization); e.g., judgment and pattern
recognition
Learning and adaptation: Evolutionary
processes; e.g., learning from examples
Creativity: e.g., planning and design
9
CS5486 10
Milestones of Intelligent System
Development
 1940’s: Cybernetics by  1974: Back propagation
Wiener algorithm by P. Werbos
 1943: Threshold logic  1970’s: Adaptive resonance
networks by McCulloch and theory by S. Grossberg
Pitts  1970’s: Self-organizing map
by T. Kohonen
 1950’s-1960’s: Perceptrons
 1980’s: Hopfield networks
by Rosenblatt by J. Hopfield
 1960’s Adaline by Widrow  1980’s: Genetic algorithms
 1970’s: AI and Expert by J. Holland
systems  1980’s: Simulated annealing
 1970’s: Fuzzy logic by L. by Kirkpatrick et al.
Zadeh  2006-: Deep learning by
Hinton et al. 11
CS5486
Milestones (cont’d)
Deep Blue (IBM), 1997

defeated chess world champion
Garry Kasparov
Alpha Go (Google), 2016
defeated go nine dan Sedol Lee
12
Professor Norbert Wiener
13
Professor Warren S. McCulloch
14
Professor Bernard Widrow
15
Professor Lofti A. Zadeh
16
Dr. Paul J. Werbos
17
Professor Teuvo Kohonen
18
Professor John J. Hopfield
19
Professor Jeffrey E. Hinton
20
Engineering Applications of
Intelligent Systems
 Pattern recognition: e.g., image processing,
pattern analysis, speech recognition, etc.
 Control and robotics: e.g., modeling and
estimation
 Associative memory (content-addressable
memory)
 Forecasting: e.g., in financial engineering
21
22
CS5486 23
Computational Intelligence
 Coined by IEEE Neural Networks Council
in 1994.
 Represent a new generation of intelligent
systems.
 Consist of Neural Networks, Fuzzy Logic,
and evolutionary computing techniques
(genetic algorithms).
24
CS5486 25
26
CS5486 27
Soft Computing
“Soft computing based on
computational intelligence should
be the basis for the conception,
design, deployment of intelligent
systems rather than hard
computing based on artificial
intelligence.”
Lofti Zadeh
CS5486 28
CS5486 29
CS5486 30
What are Neural Networks?
 Composed of a number of interconnected

neurons, resembling the human brain.
 Also known as connectionist models,
parallel distributed processing (PDP)
models, neural computers, and
neuromorphic systems.
CS5486 31
Components of Neural Networks
 A number of artificial neurons (also known
as nodes, processing units, or computational
elements)
 Massive inter-neuron connections with
different strengths (also known as synaptic
weights).
 Input and output channels
CS5486 32
Formalization of Neural Networks
ANN = (ARCH, RULE)

ARCH: architecture, refers to the
combination of components
RULE: rules, refers to the set of rules
that relate the components
CS5486 33
Architecture of Neural Networks
ARCH = (u, v, w, x, y)
Simple and alike neurons represented by u
and v in N-dimensional space
Inter-neuron connection weights represented
by w in M-dimensional space
External input and outputs represented
respectively by x and y in n and m-
dimensional space
CS5486 34
Model of Neurons
 Biological neurons: 1010-1011
 Highly simplified
 Fire activities are quantified by using state
variables (also called activation states)
 Net input to a neuron is usually a weighted
sum of state variables from other neurons,
input and/or output variables
 Net input to a neuron usually goes thru a
nonlinear transformation
CS5486
called activation 35
Connections between Neurons
Adaptive: Synaptic connections with
adjustable weights
Excitatory (positive weight) vs.
inhibitory (negative weight)
Distributed knowledge representation,
different from digital computers
CS5486 36
Rules of Neural Networks
RULE = (E, F, G, H, L)
E: Evaluation rule mapped from v and/or y to a real
line; e.g., error function or energy function
F: Activation rule mapped from u to v; e.g.,
activation function
G:Aggregation rule mapped from v, w, and/or x to u;
e.g., weighted sum
H: output rule mapped from v to y, y usually is a
subset of v
L: Learning rule mapped from v, w, and x to w,
usually iterativeCS5486 37
Learning in Neural Networks
 Goal: To improve performance
 Means: interact with environment
 A process by which the adaptable
parameters of an ANN are adjusted thru an
iterative process of stimulation by the
environment in which the ANN is
embedded
 Supervised vs. unsupervised
CS5486 38
On Learning
“By three methods we may learn

wisdom: First, by reflection, which
is the noblest; second, by imitation,
which is the easiest, and third by
experience, which is the bitterest.”
Confucius 孔子
CS5486 39
General Incremental Learning Rule
 Discrete-time:
w(t  1)  w(t )  L(v, w, x)

w(t )  L(v, w, x)
 Continuous-time:
dw(t )
 L(v, w, x)
dt
CS5486 40
Two-Time Scale Dynamics
in Neural Networks
 Faster dynamics in neuron activities

represented by u and v. Also called as short-
term memory
 Slower dynamics in connection weight
activities represented by w. Also called as
long-term memory
CS5486 41
Categories of Neural Networks
 Deterministic vs. stochastic, in terms of F

 Feedforward vs. recurrent, in terms of G
and H
 Semilinear vs. higher-order, in terms of G
 Supervised vs. unsupervised, in terms of L
CS5486 42
Definition of Neural Networks
Massive parallel distributed

processors that have a
natural property for storing
experiential knowledge and
making it available for use
CS5486 43
Features of Neural Networks
Resemble the brains in two aspects:
1. Knowledge acquisition: knowledge is
acquired by neural networks thru learning
processes.
2. Knowledge representation: Inter-neuron
connections, known as synaptic weights are
used to store acquired knowledge
CS5486 44
Properties of Neural Networks
1 Nonlinearity
2 Input-output mapping
3 Adaptivity
4 Contextual information
5 Fault tolerance
6 hardware implementability
7 Uniformity of analysis and design
8 Neurobilogical analogy and plausibility
CS5486 45
McCulloch-Pitts Neuron
CS5486 46
McCulloch-Pitts Neurons
 Binary values {0, 1}
 Unity connection weights of 1 and –1
 If an input to a neuron is 1 and the
associated weight is –1, then the output of
the neuron is 0
 Otherwise, if the weighted sum of input is
not less than a threshold, then the output is
1; or is less than the threshold, then 0.
CS5486 47
Threshold Logic
CS5486 48
AND & OR Gates
CS5486 49
Other Logic Functions
CS5486 50
Decoders
CS5486 51
Another Decoder
CS5486 52
Threshold Logic Units
Proposition: Any logical function F: {0, 1}n
-> {0, 1} can be implemented with a two-
layer McCulloch-Pitts network.
Proposition: Uninhibited threshold logic units

of McCulloch-Pitts type can only
implement monotonic logical functions.
CS5486 53
Weighted Connections
CS5486 54
A Recurrent Network
CS5486 55
Finite Automata
An automaton is an abstract device capable of
assuming different states which change
according to the received input and previous
states.
A finite automaton can take only a finite set of
possible states and can react to only a finite
set of input signals.
CS5486 56
Finite Automata & Recurrent Networks
Proposition: Any finite automaton can be

simulated with a recurrent network of
McCulloch-Pitts units.
CS5486 57
Perceptron
 A single adaptive layer of feedforward
network of pure threshold logic units.
 Developed by Rosenblatt at Connell
University in late 50’s.
 Trained for pattern classification.
 First working model implemented in
electronic hardware.
CS5486 58
Perceptron
CS5486 59
Perceptron
CS5486 60
Simple Perceptron
A simple perceptron is a computing device

with a threshold logic unit. When receiving
n real inputs thru connections with n
associated weights, a simple perceptron
outputs 1 if the net input of weighted sum
is not less than the threshold, and outputs 0
otherwise.
CS5486 61
Linear Separability
Two sets of data in an n-dimensional space

are said to be (absolutely) linearly separable
if n+1 real weights (including a threshold)
exist such that the weighted sum of a datum
in one set is always greater than or equal to
(greater than but not equal to) the threshold
and that in the other set is always less tan
the threshold.
CS5486 62
Absolute Linear Separability
If two finite sets of data are linearly

separable, they are also absolutely
linearly separable.
CS5486 63
Perceptron Convergence Algorithm
1) Initialize weights and threshold randomly.

2) Calculate actual output of the perceptron:
For all p n
y  f ( wi x   )
p
i
p
i 1
1) Adapt weights: for all p
wi (t  1)  wi (t )   ( z  y ) x ,  0
p p
i
p
2) Repeat until w converges.

CS5486 64
Iterative learning
CS5486 65
A Simple Case
CS5486 66
Error Landscape
CS5486 67
Learning Process
CS5486 68
Threshold
CS5486 69
Perceptron Convergence Theorem
If two sets of data are linearly

separable, the perceptron learning
algorithm converge to a set of
weights and a threshold in a finite
steps.
CS5486 70
Two-Variable Logic Functions
CS5486 71
2D OR and AND
CS5486 72
3D Logic Function
CS5486 73
OR Function
CS5486 74
Majority Function
CS5486 75
Duality
CS5486 76
2D Weight Space
CS5486 77
2D Weight Space
CS5486 78
XOR Problem
CS5486 79
Limitations of Perceptrons
 Only linearly separable data can be

classified
 The convergence rate may be low for
high-dimensional or large number of
data.
CS5486 80
Bipolar vs. Unipolar State Variables
 Unipolar: v {0,1}
 Bipolar: v {1,1}
Bipolar coding of state variables is better
than unipolar (binary) one in terms of
algebraic structure, region proportion
in weight space, etc.
CS5486 81
Monte Carlo Tests
CS5486 82
ADALINE
 A single adaptive layer of feedforward
network of linear elements.
 Full name: Adaptive linear elements.
 Developed by Widrow and Hoff at Stanford
University in early 60’s.
 Trained using a learning algorithm called
Delta Rule or Least Mean Squares (LMS)
Algorithm.
CS5486 83
LMS Learning Algorithm

2) Calculate actual
n
output of the ADALINE:
y   ( wi xi   ),   0
i 1
3) Adapt weights:
wi (t  1)  wi (t )   ( z  y ) x ,  0
p p
i
p
p
4) Repeat until w converges
CS5486 84
Gradient Descent Learning Algorithms
E   Ep   (z  y ) p p 2
p p
w(t  1)  w(t )  Ew

w(t )  Ew
Ew  (E / w1 , E / w2 ,..., E / wM ) .
T
CS5486 85
Training Modes
 Sequential mode: input training sample
pairs one by one orderly or randomly.
 Batch mode: input training sample pairs in
the whole training set at each iteration.
 Perceptron learning: either sequential or
batch mode.
 ADALINE training: batch mode only.
CS5486 86
Perceptron vs. Adaline
 Architecture: Perceptron uses bipolar or
unipolar hardlimiter activation function,
Adaline uses linear activation function.
 Learning rule: Perceptron learning
algorithm is not gradient-descent and can
operate in either sequential or batch training
mode, whereas Adaline learning (LMS)
algorithm is gradient descent, but can only
operate in batch mode.
CS5486 87
Weight Space Regions
Separated by Hyperplanes
 Each plane is defined by one training
sample.
 One plane separates two (2) half-space.
 Two planes separate up to four (4) regions.
 Three planes separate up to eight (8)
regions.
 However, four planes separate up to
fourteen (14) regions only.
CS5486 88
Number of Weight Space Regions
The number of different regions in weight

space defined by m separating hyperplanes
in n-dimensional weight space is a
polynomial of degree n-1 on m:
 m  1
n 1
m n
R(m, n)  2    2
i 0  i  n!
CS5486 89
Number of Weight Space Regions
CS5486 90
Number of Logic Functions vs.
Number of Threshold Functions
 The number of threshold functions defined
by hyperplanes is a function of 2 n(n-1)
n
whereas that of logical functions is 2 . 2
 The learnbability problem: when n is large,

there is not enough classification regions in
weight space to represent all logical
functions.
CS5486 91
Learnability Problems
 Solution existence in the weight space? Neither
Perceptron nor Adaline can classify patterns
with nonlinear distributions such as XOR. But
two-layer Perceptron can classify XOR data.
 How to find the solution even though it exists in
the weight space? It is known that multilayer
Perceptron can classify arbitrary shape of data
classes. But how to design learning algorithms
to determine the weights?
CS5486 92
Multilayer Feedforward Network
CS5486 93
Multilayer Feedforward Network
CS5486 94
XOR Problem Solved
CS5486 95
Alternative Networks
CS5486 96
Learning Process
CS5486 97
Alternative Network
CS5486 98
Two-layer Separation
CS5486 99
Two-layer Separation
CS5486 100
Backpropagation Algorithm
 Also known as generalized delta rule.
 Invented and reinvented by many researchers,
popularized by the PDP group at UC San Diego in
1986.
 A recursive gradient-descent learning algorithm for
multilayer feedforward networks of sigmoid
activation function.
 Compute errors backward from the output layer to
input layer.
 Minimze the mean squares error function.
CS5486 101
Sigmoid Activation Functions
CS5486 102
Sigmoid Activation Functions
1
 Unipolar: f (u ) 
1  exp( u )
df (u ) exp( u )
 
du (1  exp( u )) 2
1 1  exp( u )  1
 f (u )[1  f (u )]
1  exp( u ) 1  exp( u )
1  exp( u )
 Bipolar: f (u )  tanh(u ) 
1  exp( u )
CS5486 103
Related Activation Functions
CS5486 104
(cont’d)
Error function:
m
1
E   E p   ( zi  yi )
p p 2
p 2 p i 1
General formula:
E E p
wij (t )  wij (t  1)  wij (t )     
wij p wij
CS5486 105
(cont’d)
Output layer l:
 Ep  E p  yip  uil
 
 wij
l
 yi  ui  wij
p l l
 ( zip  yip ) yip (1  yip )v lj1   il v lj1

 Ep
where   l
 ( z  y ) y (1  y ).
p p p p
u
i l i i i i
i
CS5486 106
(cont’d)
Hidden layer l-1:
E p E p u lj E p u lj
 l 1
  l 1
 ( l ) l 1   j w ji
l l
vi j u v
l
j i j u j vi j
E p E p vil 1
 il 1     l 1 l 1  (  lj wlji )vil 1 (1  vil 1 )
uil 1 vi ui j
E p E p uil 1
 l 1 l 1   il 1v lj 2
wijl 1 ui wij
CS5486 107
(cont’d)
Input layer 1:
  (  w )v (1  v )
1
i
2
j
2
ji
1
i
1
i
j
E p E p u 1
   1 p
x i
wij ui w
1 1 i j 1
ij
CS5486 108
(cont’d)
2) Calculate actual output of the MLP:
3) Adapt weights for all layers:
E p
wij (t  1)  wij (t )   
p wij
4) Repeat until w converges
CS5486 109
Local Minima
CS5486 110
Momentum Term
To avoid local oscillation, a momentum term
is sometimes added:
E
wij (t )    wij (t  1)
 wij
0  1
CS5486 111
Kolmogorov Theorem
Let f: [0, 1]n -> [0, 1] be a continuous
function. There exist functions of one
argument g and hj for j=1,2,…,2n+1 and
constant wi for i=1,2,…,n such that
2 n 1 n
f ( x1 , x2 ,..., xn )   g[ wi h j ( x j )].
j 1 i 1
CS5486 112
Universal Approximators
 Multilayer feedforward neural networks are
universal approximators of continuous
functions.
 A set of weights exist such that the
approximation errors can be arbitrarily
small.
 However, the BP algorithm is not
guaranteed to find such a set of weights.
CS5486 113
Overfitting Problem
CS5486 114
Radial Basis Functions
 A radial basis function (RBF) is a real-
valued function whose value depends only
on the distance from its origin or center.
 Related to kernel theory in statistical
learning.
 Used as the means for approximating or
interpolating multivariate functions.
CS5486 115
h
y   wi  (|| x  c ||)
i 1
Polyharmonic spline  (r )  r p , p  1,3,5,...

r2
Gaussian  (r )  exp(  )
 2
Thin plate spline  (r )  r 2 log( r )

M ultiquadric  (r )  (r 2   2 )
1
Inverse multiquadric  (r ) 
(r 2   2 )
CS5486 116
 The most commonly used radial-basis
function is a Gaussian function
 In an RBF network, r is the distance from
the center.
 Graphically, it is a bell-shaped curve.
CS5486 117
Radial Basis Function Networks
 Proposed first by Broomhead and Lowe in
1988.
 A linear combination of a number radial
basis functions that play the role of hidden
neurons.
 Two-layer architecture. Its output layer uses
a linear activation function as ADALINE.
Its hidden layer uses radial basis activation
functions. CS5486 118
Radial Basis Function Networks
CS5486 119
Cover’s Theorem
 Cover’s Theorem (1965):
 A dichotomy {X+,X-} is said to be
φ-separable if there exist an m-dimensional
vector w such that
– wT φ(x) 0, if x in X+
– wT φ(x) < 0, if x in X-
– The hyperplane defined by wT φ(x) = 0 is the
separating surface between the two classes.
CS5486 120
XOR Problem Revisited
An RBF network can transform

the linearly inseparable XOR
data in the input space to linearly
separable data in the hidden state
space.
CS5486 121
 Using a pair of Gaussian RBFs, the input
patterns are mapped onto the φ1- φ2 plane
and become linearly separable.
|| x c || 2
1
|| x c 2 || 2
1 ( x )  e 2 ( x)  e
Let c1 = [1,1]T c2 = [0,0]T
CS5486 122
CS5486 123
Functional Link Network
 Proposed by Yoh-Han Pao at CWRU in late
80’s
 One-layer feedforward architecture
 Higher-order aggregation rule
 Fast learning process
 Local minima could be eliminated
 Many successful stories in applications
CS5486 124
Functional Link Network
CS5486 125
Extreme Learning Machine
 Proposed by Guangbin Huang at NTU in
mid 2000’s
 One-layer feedforward architecture
 Random connection weights from inputs to
hidden neurons
 Fast learning process for weights in output
layer.
 Local minima eliminated
CS5486 126
Support Vector Machine
 Proposed by Vladmir Vapnik based on

statistical learning and kernel theory in
early 1990’s.
 Minimization of structural risk.
 Maximal generalization power.
V. Vapnik
CS5486 127
Linear Discriminant Function
x2
g(x) is a linear function: wT x + b > 0
g (x)  wT x  b
 A hyper-plane in the
feature space
n
(Unit-length) normal
vector of the hyper-
plane:
w
n wT x + b < 0 x1
w
CS5486 128
x2
How would you classify
these data using a linear
discriminant function in
order to minimize the
error?
Infinite number of
answers!
denotes +1
denotes -1 x1
CS5486 129
x2
Which one is
the best with
maximal
generalization
power?
x1
CS5486 130
Maximal Margin Classifier
x2
The linear discriminant “safe zone” Margin
function (classifier)
with the maximum
margin is the best.
Margin is defined as the
width that the boundary
could be increased by
before hitting a data point
Why it is the best?
Robust to outliners and
x1
thus strong generalization
ability
CS5486 131
Given a set of data: x2
{(xi , yi )}, i  1, 2, , n, where
For yi  1, wT xi  b  0
For yi  1, wT xi  b  0
With a scale transformation

on both w and b, the above
is equivalent to
For yi  1, wT xi  b  1
x1
For yi  1, w xi  b  1
T
CS5486 132
Let x2
Margin
wT x   b  1 x+
wT x   b  1
x+
The margin width is:
n
M  (x   x  )  n x-
  w 2
 (x  x )  
w w
Support Vectors x1
CS5486 133
 Formulation: x2
Margin
2 x+
maximize
w
such that x+
For yi  1, wT xi  b  1 n
x-
For yi  1, wT xi  b  1
x1
CS5486 134
 Formulation: x2
Margin
1 2
x+
minimize w
2
such that x+
For yi  1, wT xi  b  1 n
x-
For yi  1, wT xi  b  1
x1
CS5486 135
 Formulation: x2
Margin
1 2
x+
minimize w
2
subject to x+
yi (wT xi  b)  1 n
x-
x1
CS5486 136
Quadratic Programming Problem
Quadratic 1 2
minimize w
programming 2
with linear
constraints s.t. yi (wT xi  b)  1
Lagrangian
Function
minimize Lp (w, b, i )  w  i  yi (wT xi  b)  1

n
1 2
2 i 1
s.t. i  0
CS5486 137
minimize Lp (w, b,  i )  w   i  yi (wT xi  b)  1
n
1 2
2 i 1
s.t.  i  0
Lp
n
0 w    i yi xi
w i 1
n
Lp
0  y i i 0
b i 1
CS5486 138
minimize Lp (w, b, i )  w  i  yi (wT xi  b)  1
n
1 2
2 i 1
s.t.  i  0
Lagrangian dual
problem
n
1 n n
maximize  i    i j yi y j xTi x j
i 1 2 i 1 j 1
n
s.t. i  0 , and  y
i 1
i i 0
CS5486 139
From KKT condition, we know: x2
i  yi (wT xi  b)  1  0 x+
Thus, only support vectors i  0 x+
x-
The solution has the form:

n Support Vectors
w    i yi xi   y x
x1
i i i
i 1 iSV
get b from yi (wT xi  b)  1  0,

where xi is support vector
CS5486 140
The linear discriminant function is:
g ( x)  w T x  b   i i xb
 x
iSV
T
It is a weighted dot product between the test point x

and the support vectors xi
Solving the optimization problem involved computing

the dot products xiTxj between all pairs of training
samples
CS5486 141
Soft Constraints
x2
What to do if data is
not linearly
separable? (noisy
data, outliers, etc.) 2
1
Slack variables ξi
can be added to
allow mis-
classification of
nonlinearly x1
distributed or
noisy data CS5486 142
Soft Constraints
Problem formulation:
n
1
w  C  i
2
minimize
2 i 1
subject to
yi (wT xi  b)  1  i
i  0
Parameter C is a weight between marginal maximization
and misclassification miinimization.
CS5486 143
Problem reformulation: (Lagrangian dual problem)
n
1 n n
maximize  i    i j yi y j xTi x j
i 1 2 i 1 j 1
Subject to
0  i  C
n
 y
i 1
i i 0
CS5486 144
Nonlinear SVM
Datasets that are linearly separable with noise work out great:
0 x
But what to do if the dataset is nonlinearly distributed?
0 x
How about… mapping data to a higher-dimensional space:
x2
CS5486 0 x1 145
Feature Space
General idea: the original input space can be mapped
to a higher-dimensional feature space where the
training set is linearly separable
feature map
Φ: x → φ(x)
CS5486 146
The Kernel Trick
With this mapping, the discriminant function is now:
g (x)  wT  (x)  b   i i  (x)  b

 
iSV
( x )T
No need to know this mapping explicitly, because we

only use the dot product of feature vectors in both the
training and test.
A kernel function is defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
K (xi , x j )   (xi )T  (x j )
CS5486 147
An Example
2-dimensional vectors x=[x1 x2];
let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj) = φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
CS5486 148
Common Kernel Functions
 Linear kernel: K (xi , x j )  xTi x j
 Polynomial K (xi , x j )  (1  xTi x j ) p
kernel:
 Gaussian (Radial-Basis Function (RBF) ) kernel:
2
xi  x j
K (xi , x j )  exp( )
2 2
 Sigmoid:
K (xi , x j )  tanh(0 xTi x j  1 )

CS5486 149
Kernel Functions
For some functions K(xi,xj) checking that
K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
 Semi-positive definite symmetric functions correspond
to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
… … … … …
K(xP,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
CS5486 150
Nonlinear SVM: Optimization
Problem reformulation: (Lagrangian dual problem)
n
1 n n
maximize  i    i j yi y j K (xi , x j )
i 1 2 i 1 j 1
subject to 0  i  C
n
 y
i 1
i i 0
The solution of the discriminant function is
g (x)    K (x , x)  b
iSV
i i
The optimization technique is the same.

CS5486 151
SVM Learning
1. Choose a kernel function
2. Choose a value for C
3. Solve the quadratic programming
problem (many software packages
available)
4. Construct the discriminant function
from the support vectors
CS5486 152
Design Issues
 Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating
appropriate similarity measures
 Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different
classifications
- In the absence of reliable criteria, applications rely on the use
of a validation set or cross-validation to set such parameters.
 Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters
are tested
CS5486 153
SVM Summary
 1. Maximal Margin Classifier
– Better generalization ability & less over-fitting
 2. The Kernel Trick

– Map data points to a higher dimensional space
to make them linearly separable.
– Since only dot product is used, we do not need
to represent the mapping explicitly.
CS5486 154
Support Vector Regression
Problem formulation
CS5486 155
Problem reformulation
CS5486 156
CS5486 157
CS5486 158
Least-squares SVM
Proposed by Suykens and Vandewalle at
KUL in 1999.
Let
Primal problem formulation
CS5486 159
Least-squares SVM
Lagrangian
CS5486 160
Least-squares SVM
Final model
CS5486 161
Two-spirial Classification
CS5486 162
Short-term Prediction
CS5486 163
MAXNET
 A sub-network for selecting the input with
maximum value - winner takes all.
 By means of mutually prohibition, a
MAXNET keeps the maximal input and
presses down the rest.
 It is often used as the output layer in some
existing neural networks
CS5486 168
MAXNET
 A recurrent neural network with self
excitatory connections and laterally
inhibitory connections.
 t. Julian
The weight of self excitatory connections is
1.
 The weight of self inhibitory connections is
-w where w<1/m, and m is the number of
output neurons.
CS5486 169
kWTA Model
Desirable Properties
 The kWTA model with Heaviside activation
function has been proven to be globally
stable and globally convergent to the kWTA
solutions in finite time.
 Derived lower and upper bounds of
convergence time are respectively
Simulation Results
Randomized Integer Inputs
Simulation Results
Low-Resolution Inputs
Clustering
物以类聚
人以群分
CS5486 174
Clustering
 Clustering or cluster analysis is to group
similar data based on a given similarity
measure.
 It is subjective without unique solutions.
 It is done via unsupervised learning.
CS5486 175
ART1 Network
 Invented by Stephen Grossberg at Boston
University in 1970’s.
 Used to cluster binary data w/ unknown
cluster number.
 A two-layer recurrent neural network.
 MAXNET serves as its output layer.
 Bidirectional adaptive connections called
bottom-up and top-down connections.
CS5486 176
ART1 Architecture
top down
bottom up (normalised)
CS5486 177
ART1 for Clustering
1
1) Initialize weights: wij (0)  1, wij (0) 
td bu
1 n
2) Compute net input for an input pattern xp:
n
u i
p
  wijbu (t ) x jp , i  1,2,..., m
j 1
Select the best match using the MAXNET uk  max i {ui

p p
3) }
n n
4) Vigilance test: If  wkj (t ) x j td p
 x jp  ,
j 1 j 1
then next; otherwise, disable neuron k and go to step 2).
5) Adapt weights:
td p
w (t ) x
wkjtd (t  1)  wkjtd (t ) x jp , wkjbu (t  1) 
kj j
n
0.5   wkjtd (t ) x jp
CS5486 j 1 178
Vigilance Parameter
in ART1 Network
 Value ranges between 0 and 1.
 A user-chosen design parameter to
control the sensitivity of the clustering.
 The larger its value is, the more
homogenous the data are in each cluster.
 Determine in an ad hoc way.
CS5486 179
Vigilance Parameter
in ART1 Network
 Vigilance sets granularity Small , imprecise
of clustering
 It defines basin of
attraction of each cluster
 Low threshold
– Large mismatch accepted
– Few large clusters
Large , fragmented
 High threshold
– Small mismatch accepted
– Many small clusters
– Higher precision
CS5486 180
Illustrative Example
CS5486 181
Hopfield Networks
 Invented by John Hopfield at Princeton
University in 1980’s.
 Used as associative memories or
optimization models.
 Single-layer recurrent neural networks.
 The discrete-time model uses bipolar
threshold logic units and the continuous-
time model uses unipolar sigmoid activation
function. CS5486 182
Hopfield Networks
The Hopfield networks are the classical recurrent
neural networks.
John Hopfield, “Neural networks and physical
systems with emergent collective computational
abilities,” PNAS, USA, vol. 79, 1982.
“As far as public visibility goes, the modern era

in neural networks dates from the publication of
this paper.” – J. A. Anderson and E. Rosenfeld
(1988).
CS5486 183
Hopfield Networks
CS5486 184
Discrete-Time Hopfield Network
u (t  1)  Wv(t )  x
ui (t  1)   wij v j (t )  xi , i
j
vi (t )  sgn( ui (t ))  {1,1}
CS5486 185
Stability Analysis
n n
1
E[v (t )]  
2
w
i 1 j i
ij vi (t )v j (t )   xi vi (t )
i 1
n
1
E[v (t  1)]  E[v (t )]  
2
w
i 1 j i
ij vi (t  1)v j (t  1) 
n n n
1

i 1
x v
i i (t  1) 
2
w
i 1 j i
ij vi (t )v j (t )   xi vi (t ) 
i 1
n
1

2
 (w
i 1
ik  wki )vi (t  1)vk (t  1)  xk vk (t  1) 
n
1
2
 (w
i 1
ik  wki )vi (t )vk (t )  xk vk (t ) 
n
 [vk (t  1)  vk (t )][ wik vi (t )  xk ]   vk (t )u k  0
i 1
CS5486 186
Stability Conditions
_
 Stability: lim v(t )  v
t 
 Sufficient conditions:
1. wij  w ji , wii  0; i, j  1,2,..., n

2. Activation is conducted asynchronously;
i.e., the state updating from v(t) to v(t+1) is
performed for one neuron each iteration.
CS5486 187
Stability Properties
 If W is symmetric with zero diagonal elements
and the activation is conducted asynchronously
(i.e., one neuron at one time), then the discrete-
time Hopfield network is stable (a sufficient
condition).
 If W is symmetric with zero diagonal elements
and the activation is conducted synchronously,
then the discrete-time Hopfield network is either
stable or oscillates in a limit cycle of two states.
CS5486 188
Associative Memory
 Learning and memory are the two most
important cognitive functions in brain-like
intelligence
 As important as learning
 Also known as content-addressable
memory
 Fundamentally different from the existing
computer memory
CS5486 189
Associative Memory
Memory is a process of acquiring and
storing information such that it will be
available when needed.
It is well known that human memory
is far more robust and fault tolerant
than existing computer memory.
Human Brain and Memory
Figure courtesy of the Frontotemporal dementia group (FRONTIER), Prince of

Wales Medical Research Institute, Sydney NSW 2031 Australia
Associative Memory
As the original source of human

intelligence, the brain has a powerful
ability of association.
Associative memories are content-
addressable mechanisms for storing
prototype patterns such that the stored
patterns can be retrieved with the
recalling probes (cues).
Associative Memory
When given a probe (e.g., noisy or corrupted
version of a prototype pattern), the retrieval
dynamics of an associate memory should
converge to an equilibrium representing the
prototype pattern.
In associative memories, the stored patterns are
associated with their retrieval probes internally
in a robust and fault-tolerant way.
Auto-association vs.
Hetero-association
There are two types of associative memories:
auto-associative and hetero-associative
memories.
An auto-associative memory retrieves a
previously stored pattern that closely
resembles the recalling probe.
In a hetero-associative memory, the retrieved
pattern is generally different from the probe in
content or format.
Auto-association vs.
Hetero-association
Hetero associative Auto associative
A α A A’
B β B B’
For example,
ID Name A A
Hetero-Associations
Albert
Input pattern desired output
Einstein
 x11 x12 x13 x14   y11 y12 y13 y14 
 1 4  1 
 x2 x22 x23 x2   y2 y22 y23 y24 
 x31 x32 x33 x34   y31 y32 y33 y34 
Charles  1 4
 1 
 x4 x42 x43 x4   y4 y42 y43 y44 
Kao
. . 1. Feed forward neural networks

. .
. . 2. Recurrent neural networks
Images Names
Memory Processes
 Storage  Retrieval
(Information (Information
encoding) decoding)
 Given a set of  Given any probe (key
prototype patterns or cue), recall the
to be memorized, corresponding
place them into prototype patterns in
the memory the memory.
indefinitely.
as an Associative Memory
• Storage: Outer product weight matrix
P P
wij   sip s jp  P ij ,W   s p ( s p )T  PI , s p {1,1}n
p 1 p 1
• Retrieval (recall):
P
v(0)  s q , u (1)  Wv(0)   s p ( s p )T s q  Ps q
p 1
 s q ( s q )T s q   s p ( s p )T s q  Ps q
pq
CS5486 198
as an Associative Memory
If is orthonormal; i.e., (s ) s  0, p  q.
p T q
 sp
then the second term in recall formula (cross-
talk or noise) is zero.
 If ( s ) s || s ||  n  P , then v(1)= s
q T q q 2 q
2
p
 If s is not orthonormal, for a small
variation of probe patterns, the Hopfield

network can still recall the correct patterns.
CS5486 199
Illustrative Example
Stored Memory
PatternsCS5486 Recall 200
Limitations
n
 Very limited capacity: 2 log n
, where n is the
memory length
 Many spurious states; e.g., s q
CS5486 201
as an Optimization Model
 Formulate the energy function according to
the objective function and constraints of a
given optimization problem.
1 T
E (v)   v Wv  xT v, v {1,1}n
2
 Form a Hopfield network, then update the
states asynchronously until convergence.
 Shortcoming: slow convergence due to
asychrony. CS5486 202
Bidirectional Associative Memories
(BAM)
 Also known as hetero-associative memories
and resonance networks.
 A generalization of auto-associative
memories.
 Proposed by Bart Kosko of University of
Southern California in 1988.
 Using bipolar signum activation functions.
CS5486 203
BAM Architecture
CS5486 204
Bidirectional Associative Memories
(BAM)
x  R n , y (t )  sgn(Wx (t ))
y  R m , x (t  1)  sgn(W T y (t ))
P
W  y
p 1
p
(x p
)T
P
sgn(Wx )  sgn(  y p ( x
q p
)T x q )
p 1
 sgn( y q || x q || 22   y p ( x p
)T x q )  y q ,
pq
P
sgn(W T
y )  sgn(  x
q p
( y p )T y q )
p 1
 sgn( x q || y q || 22   x p
( y p )T y q )  x q ,
pq
CS5486 205
Continuous-time Hopfield Network
du u
   Wv  x
dt 
dui ui
    wij vi  xi , i
dt  j
1
vi 
1  exp( ui )
CS5486 206
Continuous-time
model [Hopfield,
PNAS, vol. 81,
1984]
CS5486 207
CS5486 208
Stability Analysis
1 1 v j 1
E    wij vi v j    f (v)dv   x j v j
2 i j j  0
j
dE  E dvi ui dvi
   ( wij v j   xi ) 
dt i  vi dt i j  dt
dui dvi df (ui ) dui 2 dui
   ( )  0,   0.
i dt dt i dui dt dt
CS5486 209
High Gain Unipolar Sigmoid
Activation Function
1 1 1 v
f (u )  , u  f (v)  ln
1  exp( u )  1 v
v 1
0
1
f ( x ) dx  [ v ln v  (1  v ) ln(1  v )]

v
If    inf, then 
1
f ( x)dx  0
0
CS5486 210
Continuous-Time Hopfield Network
as an Optimization Model
 Formulate the energy function according to
the objective function and constraints of a
given optimization problem.
1 T
E (v)   v Wv  xT v, v  [0,1]n
2
 Synthesize a continuous-time Hopfield
network, then an equilibrium state is a local
minimum of the energy function. .
CS5486 211
Traveling Salesman Problem
Find a complete route by visiting each city once
and only once.
Checking out all possible routes: (N – 1)!/2
For example, N = 30 => 4.4 x 1030 routes
If we compute 1012 routes per second => we
need1018 seconds or 31,709,791,984 years!
In most practical problems N >> 30
A continuous Hopfield network can compute a
good solution to the TSP in a parallel and
distributed manner.
CS5486 212
Vertex Path Representation
The Hopfield network approach to the TSP

involves arranging the neurons in such a
way that they represent the decision variable
for an N-city problem we would require N2
neurons
N of them will be turned ON with the
remainder turned OFF
CS5486 213
Vertex Path Representation
Objective Function and Constraints
 Constraints (permutation matrix):
– One neuron should be on in each row
– One neuron should be on in each column
 Objective function:
– Total distance should be minimized
 Determine the network weights and bias so that
the Lyapunov function is minimized when
constraints are met and objective function is
minimum
CS5486 215
Objective Function to be Minimized
Via : binary state vaiable of the neuron in position
(i,a) in the matrix or 2D array
d ij : distance between city i and city j
Total distance of a route:

Lyapunv Function to be Minimized
: for minimizing tour length

: is non-zero if more than one
neuron in on in each row
: is non-zero if more than one
neuron in on in each column
: ensure that there is a total of N
neurons ON
CS5486 217
Constructing the Hopfield Network
We should select weights and currents so that two

following equations become equal
First we make output voltages double subscripts:
CS5486 218
Note first that wia , jb multiply the second-order

terms ViaV jb and iia multiply first-order terms Via
.
First order terms should be equal:
CS5486 219
Let’s treat the four sets of second-order terms

separately
The second-order C terms are given by
 =-C
CS5486 220
If , then
corresponding term is
where Kronecker delta function
 ij  1 if i  j else 0
Similarly,
CS5486 221
D term contributes an amount to the Lyapunov
function only when or
So
Bringing together all four components of the

weights, we have
CS5486 222
Parameter Selection
 Appropriate values for the parameters A,B,C,D
and  must be determined
 Tank and Hopfield used A = B =D =250 , C =
1000 and  = 50
 Tank and Hopfield applied it to random 10-city
maps and found that, overall, in about 50% of
cases, the method found the optimum route from
among the 181,440 distinct paths.
CS5486 223
Ten-City Example
CS5486 224
Optimal Route
CS5486 225
Local Minima
CS5486 226
Simulated Annealing
 Annealing is a metallurgical process in
which a material is heated and then slowly
brought to a lower temperature to let
molecules to assume optimal positions.
 Simulated annealing simulates the physical
annealing process mathematically for global
optimization of nonconvex objective
function.
CS5486 227
Updating Probability
E
PE  exp(  ), if E  0
T
PE  1, if E  0
where T  0, T  0, lim T  0
t 
The tangent of the probability function
intersects with the horizontal axis at T
CS5486 228
Updating Probability
1
PE 
E
1  exp( )
T
where T  0, T  0, lim T  0
t 
The tangent of the probability function

intersects with the horizontal axis at 2T.
CS5486 229
Decreasing Temperature
CS5486 230
Descending Energy
CS5486 231
Sample TSP Solution
CS5486 232
Characteristics of
Simulated Annealing
 The higher the temperature, the higher the
probability of an energy increase.
 As the temperature approaches to zero, the
simulated annealing procedure becomes an
iterative improvement one.
 The temperature parameter has to be lower
gradually to avoid prematurity.
CS5486 233
Boltzmann Machine
 A stochastic recurrent neural network
invented by G. Hinton (Univ. of Toronto)
and T. Sejnowski (Salk Institute) in 1983.
n
 It has binary state variables {-1, 1} with a
probabilistic activation function.

 A parallel implementation of simulated
annealing procedure.
 It can be seen as a stochastic, generative
counterpart ofCS5486
the Hopfield networks. 234
Boltzmann Machine
1
vi  {1,1} , ui   wij v j  xi , wij  w ji , E (v)    wij vi v j   xi vi
n
j 2 i j i i
E
E (vi ) : E (vi )  E (vi )  vi  ( wij v j  xi )2vi  2ui vi
vi j
1 1
P(vi  vi )  
E (vi ) 2ui vi
1  exp( ) 1  exp(  )
T T
1 1
P(vi  1  1)  , P(vi  1  1) 
2ui 2ui
1  exp( ) 1  exp(  )
T T
CS5486 235
Mean Field Annealing Network
 A deterministic recurrent neural network.
 Based on mean-field theory.
 Continuous state variables on [-1, 1]n.
 use a bipolar sigmoid activation function.
 Use a gradual decreasing temperature
parameter like simulated annealing.
 Used for combinatorial optimization.
CS5486 236
Mean Field Annealing Network
2ui
1  exp (  )
u
ui   wij v j  xi , vi  T  tanh( i )
2ui T
j
1  exp (  )
T
dT
T  0,  0, lim T  0. As T  0, vi  {1,1}
dt t 
1
E (v )     wij vi v j   xi vi , vi  [ 1,1],
2 i j i
Exp (vi )  P (vi  1)  P (vi  1)  P (vi  1)  [1  P (vi  1)] 

2ui
1  exp (  )
2 T u
2 P (vi  1)  1  1   tanh( i )
2ui 2ui T
1  exp (  ) 1  exp (  )
T T
1 E (v )
vi  tanh( )
T vi
CS5486 237
Self-Organizing Maps (SOMs)
 Developed by Prof. T. Kohonen at Helsinki
University of Technology in Finland in
1970’s.
 A single-layer network with a winner-take-
all layer using a unsupervised learning
algorithm.
 Formation of topographic map through self-
organization.
 Map high-dimensional data to one or two
dimensional feature maps.
CS5486 238
Data Clusters
CS5486 239
SOM Architecture
CS5486 240
Kohonen’s Learning Algorithm
1. (Initialization) Randomize wij(0) for i =
1,2,…n; j = 1,2,…m; p = 1, t = 0.
2. (Distance) for datum xp, d  n [ x p  w (t )]2
j i 1
i ij
3. (Minimization) Find k such that

dk= minj dj
4. (Adaptation)
j  N k (t ), i  1,2,...n, wij (t )   (t )[ xi  wij (t )]
p
0    1, d / dt  0, p  p  1, goto Distance.
CS5486 241
Neighborhood in SOMs
CS5486 242
A Simple Example
CS5486 243
Kohonen’s Example
CS5486 244
CS5486 245
Echo State Network
 Proposed by Herbert Jaeger and Harald
Haas at Jocobs University in 2004.
 Also called reservoir computing.
 It is a recurrent neural network with sparse
connections and random weights among
hidden neurons.
CS5486 246
ESN Architecture
input units output units
...
...
recurrent "dynamical
reservoir"
CS5486 247
State Equations
u (n)  (u1 (n),..., u K (n))'
x(n)  ( x1 (n),..., x N (n))'
y (n)  ( y1 (n),..., y L (n))'
W  ( wij ), W  ( wij ),
in in
W out  ( wijout ), W back  ( wij

back
)
x(n  1)  f (W inu (n  1)  Wx(n)  W back y (n))

y (n  1)  f out (W out (u (n  1), x(n  1), y (n))
CS5486 248
Fuzzy Logic
 Developed by Prof. Lofti Zedeh at the
University of California - Berkeley in late
1965.
 A generalization of classical logic.
 Fuzzy logic describes one kind of
uncertainty: impreciseness or ambiguity.
 Probability, on the other hand, describes the
other kind of uncertainty: randomness.
CS5486 249
Membership Function
Let X be a classical set. A membership
function of fuzzy set A uA : X -> [0, 1]
defines the fuzzy set A of X.
Crisp sets are special case of fuzzy sets where
the value of the membership function are 0
and 1 only.
CS5486 250
Membership Functions
CS5486 251
Temp: {Freezing, Cool, Warm, Hot}
Degree of Truth or "Membership"
Freezing Cool Warm Hot
1
0
10 30 50 70 90 110
Temp. (F°)
252
CS5486 253
Fuzzy Set
Fuzzy set A is the set of all pairs (x, uA(x)) where x
belongs to X; i.e.,
A  {( x, u A ( x)) | x  X }
If X is discrete, A   u A ( xi ) / xi
i
If X is continuous, A   u A ( x) / x
X
Support set of A is supp( A)  {x  X | u A ( x)  0}

CS5486 254
Discrete Fuzzy Set
μ A（x1） μ A（x2） μ A（x3） μ A（xn）

A ＝＋＋＋＋
x1 x2 x3 xn
Example: middle age
μ F（x） A  0  0.1  0.2  .....  0.8  1  0.8  ....  0.1  0

32 33 34 39 40 41 47 48
1
0.8
0.2
0.1
age 255
32 33 34 39 40 41 47 48
Fuzzy Set
CS5486 256
Fuzzy Set Terminology
 Fuzzy singleton: A fuzzy set where its
support set contain a single point only with
uA (x)=1.
 Crossover point: x  X such that u A ( x)  0.5
 Kernel of a fuzzy set A: All x such that
uA (x)=1; i.e., ker( A)  {x  X | u A ( x)  1}
 Height of a fuzzy set A: Supremum of
uA (x) over x; i.e., ht ( A)  sup xX u A ( x)
CS5486 257
Fuzzy Set Terminology
 Normalized fuzzy set A: Its height is unity; i.e.,
ht(A)=1. Otherwise, it is subnormal.
  -cut of a fuzzy set A: A crisp set
A  {x  X | u A ( x)  }
 Convex fuzzy set A:
 [0,1], x, y  X
u A (x  (1   ) y))  min(u A ( x), u A ( y))
i.e., any  -cut is a convex set.
CS5486 258
Logic Operations on Fuzzy Sets
 Union of two fuzzy sets:
 AB ( x)  max{  A ( x),  B ( x)}
 Intersection of two fuzzy sets:
 AB ( x)  min{ A ( x),  B ( x)}
 Complement of a fuzzy set:
 A ( x)  1   A ( x)
CS5486 259
CS5486 260
Cardinality and Entropy of Fuzzy
Sets
 Cardinality: |A| is defined as the sum of the
membership function values of all elements
in X; i.e.,
| A |   A ( x) or | A |   A ( x)dx
xX X
 Entropy: E(A) measures fuzziness and is

defined as | A A |
E ( A) 
| A A |
CS5486 261
Cardinality of Fuzzy Sets
CS5486 262
Entropy of Fuzzy Sets
CS5486 263
Entropy of Fuzzy Sets
CS5486 264
 Equality: For all x, uA(x)=uB(x)
 Degree of equality:
| A B |
E ( A, B)  deg( A  B) 
| A B |
 Subset: A  B, if u A ( x)  uB ( x), x  X
 Subsethood measure:
| A B |
S ( A, B)  deg( A  B) 
| A|
CS5486 265
Properties of Fuzzy Sets
Union: A  A  B, B  A  B
Intersection: A  B  A, A  B  B
Double negation law: A  A
DeMorgan’s laws: A  B  A  B
A B  A B
However, A A  X , A A  
CS5486 266
Fuzzy Relations
R( x1 , x2 ,..., xn )   u ( x , x ,..., x ) /( x , x ,..., x )
R 1
X 1  X 2 ...X n
2 n 1 2 n
R( x1 , x2 ,..., xn )  {(( x1 , x2 ,..., xn ), uR ( x1 , x2 ,..., xn )) | ( x1 , x2 ,..., x)  X 1  X 2 ... X n }
Binary fuzzy relations are most common.

Reflexive: uR ( x, x)  1
Symmetric: uR ( x, y)  uR ( y, x)
Transitive: uR ( x, z )  max y min ( x, z ) [uR ( x, y), uR ( y, z )]
CS5486 267
Fuzzifiers and Defuzzifiers
 Fuzzifier: A mapping from a real-
valued set to a fuzzy by means of a
membership function.
 Defuzzifier: A mapping from a fuzzy
set to a real-valued set.
CS5486 268
Typical Defuzzifiers
 Centoid (also know as center of gravity and
center of area) defuzzifier:
 x ( x) dx  x  (x ) i A i
x*  X
or x*  i

X
 ( x)dx   (x ) i
A i
 Center average (mean of maximum)

defuzzifier:  xi ht ( xi )
x*  i
 ht ( x
i
i )
CS5486 269
Centroid Defuzzifier
CS5486 270
Linguistic Variables
 Linguistic variables are important in fuzzy
logic and approximate reasoning.
 Linguistic variables are variables whose
values are words or sentences in natural or
artificial languages.
 For example, speed can be defined as a
linguistic variable and takes values of slow,
fast, and very fast.
CS5486 271
Fuzzy Inference Process
 When imprecise information is input to a
fuzzy inference system, it is first fuzzified
by constructing a membership function.
 Based on a fuzzy rule base, the fuzzy
inference engine makes a fuzzy decision.
 The fuzzy decision is then defuzzified to
output for an action.
 The defuzzification is usually done by using
the centoid method.
CS5486 272
Fuzzy Inference Process
CS5486 273
Fuzzy Inference System
CS5486 274
An Electrical Heater Example
Rule Base:
R1: If temperature is cold, then increase
power.
R2: If temperature is normal, then maintain.
R3: If temperature is warm, then reduce
power.
At 12o, T = cold/0.5 + normal/0.3+ warm/0.0,
A = increase/0.5 + maintain/0.3 + reduce/0.0.
CS5486 275
CS5486 276
CS5486 277
Type-2 Fuzzy Logic
 A generalization of type-1 fuzzy logic to
handle the uncertainty of membership
function by using fuzzy membership
function
 Proposed by Prof. Zadeh in 1975 (ten years
after type-1), but became popular in recent
few years
CS5486 278
CS5486 279
Type-2 Fuzzy System
CS5486 280
Evolutionary Computation
 Population-based stochastic and meta-
heuristic search algorithms for global or
multi-objective optimization
 Motivated by the natural evolution based
on Darwinist or other principles
 Use collective wisdom to accomplish given
tasks via efforts of many generations.
CS5486 281
Swarm-Based Search
Swarm is better than individual
282
Evolutionary Algorithms
 Evolutionary programming (Lawrence
Fogel, 1960’s)
 Evolutionary strategies (Ingo Rechenberg
and Hans-Paul Schiwefel, 1960’s)
 Genetic algorithms (John Holland, 1970’s)
 Genetic programming (John Koza,
 1990’s)
CS5486 283
Genetic Algorithms
 A stochastic search method simulating the
evolution of population of living species.
 Optimize a fitness function which is not
necessarily continuous or differentiable.
 A genetic algorithm generates a population
of seeds instead of one in traditional
algorithms.
 The computation of the population can be
carried out in parallel.
CS5486 284
Elements in Genetic Algorithms
 A coding of the optimization problem to produce
the required discretization of decision variables in
terms of strings.
 A reproduction operator to copy individual strings
according to their fitness.
 A set of information-exchange operators; e.g.,
crossover, for recombination of search points to
generate new and better population of points.
 A mutation operator for modifying data.
CS5486 285
Reproduction Operator
1. Sum the fitness of all the production members
and call the result total fitness.
2. Generate a random number n between 0 and
total fitness under uniform distribution.
3. Return the first population member whose
fitness, added to the fitnesses of the preceding
population members (running total), is greater
than or equal to n.
CS5486 286
Crossover Operator
 Select offspring from the population
after reproduction.
 Two strings (parents) from the
reproduced population are paired with
probability Pc.
 Two new strings (offspring) are
created by exchanging bits at a
crossover site.
CS5486 287
Crossover Operation
CS5486 288
Crossover Operation
CS5486 289
Mutation Operator
 Reproduction and crossover produce
new string without introducing new
information into the population at bit
level.
 To inject new information into
offspring.
 Invert chosen bits randomly with a
lower probability Pm
CS5486 290
Mutation Operation
CS5486 291
Towers of Hanoi
 Towers of Hanoi puzzle: move a number
of disks from one peg to another among
three pegs, to restore the original piling
order.
 Any move must satisfy constraints: e.g.,
we can only move a disk that has no disk
above it.
CS5486 292
Towers of Hanoi
CS5486 293
Legal Moves
CS5486 294
Towers of Hanoi
CS5486 295
Move Represenation
CS5486 296
Corresponding Moves
CS5486 297
GA Operations
CS5486 298
GA Flow Chart
CS5486 299
Optimization Process
CS5486 300
Encoder-Decoder Design
CS5486 301
Encoder-Decoder Design
CS5486 302
Swarm Intelligence
 Initialized by Gerardo Beni and Jing Wang
in 1989 in the context of cellular robotic
systems
 Typically made up of a population of
simple agents interacting locally with one
another and with their environment
 Typical representatives include particle
swarm optimization, ant colony
optimization, etc.
CS5486 303
Particle Swarm Optimization
 A robust stochastic optimization technique
based on the movement and intelligence of
swarms
 Applies the concept of social interaction to
problem solving
 It uses a number of agents (particles) that
constitute a swarm moving around in the
search space looking for the best solution
CS5486 304
Particle Swarm Optimization
 Each particle is treated as a point in a multi-
dimensional space which adjusts its
“flying” according to its own flying
experience as well as the flying experience
of other particles
• Developed in 1995 by James Kennedy
(social-psychologist) and Russell Eberhart
(electrical engineering professor).
CS5486 305
Fathers of PSO
Russ Eberhart James Kennedy
306
Psychosocial Compromise
My best
perf.
pi
Here I The best
am! x pg perf. of my
neighbours
CS5486 307
Search Direction
At each time step t
for each particle
for each component i
vk t  1 
update vk t 


velocity  rand 0, 1  pik  xk t 
the
 rand 0,  2  p gk  xk t 

then move x(t  1)  xt   vt  1
308
CS5486
Random Proximity
pi
x
pg
309
CS5486
Global Optimization
max
y
min
fitness
search space CS5486 310

Global Optimization
max
y
min
fitness

Global Optimization
max
y
min
fitness
search
312
space CS5486
Global Optimization
y max
min
fitness
x

Global Optimization
max
y
min
fitness
x 314
CS5486
search space
Global Optimization
max
y
min
fitness
x 315
CS5486
search space
Global Optimization
max
y
min
fitness
x 316
CS5486
search space
Global Optimization
max
y
min
fitness
x 317
search space
Schwefel’s function
n
f ( x)   xi sin( xi )
i 1
where
 500  xi  500
Global minimum
f ( x*) = 418.9829n;
xi = -420.9687 , i=1,2 ,...,n
318
CS5486
Initialization Swarm
319
Evolution after 5 Iterations
320
321
Evolution after15 Iterations
322
323
324
325
326
Search Results
Iterati Swarm best
on
0 416.245599
5 515.748796
10 759.404006
15 793.732019
20 834.813763
100 837.911535
5000 837.965771
Global 837.9658
327
CS5486
That’s all for this course.
See you in next semester.

Have a nice break!
CS5486 328

CS5486 Intelligent Systems: Prof. Jun Wang Department of Computer Science Tel: 3442 9701 Email: Jwang - Cs@cityu - Edu.hk

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CS5486 Intelligent Systems: Prof. Jun Wang Department of Computer Science Tel: 3442 9701 Email: Jwang - Cs@cityu - Edu.hk

Hochgeladen von

Copyright:

Verfügbare Formate

CS5486

Prof. Jun Wang

This course aims to equip students with the

For example, neural networks, fuzzy

Deep Blue (IBM), 1997

 Composed of a number of interconnected

ANN = (ARCH, RULE)

“By three methods we may learn

w(t  1)  w(t )  L(v, w, x)

 Faster dynamics in neuron activities

 Deterministic vs. stochastic, in terms of F

Massive parallel distributed

Proposition: Uninhibited threshold logic units

Proposition: Any finite automaton can be

A simple perceptron is a computing device

Two sets of data in an n-dimensional space

If two finite sets of data are linearly

1) Initialize weights and threshold randomly.

2) Repeat until w converges.

If two sets of data are linearly

 Only linearly separable data can be

1) Initialize weights and threshold randomly.

w(t  1)  w(t )  Ew

The number of different regions in weight

 The learnbability problem: when n is large,

 ( zip  yip ) yip (1  yip )v lj1   il v lj1

Polyharmonic spline  (r )  r p , p  1,3,5,...

Thin plate spline  (r )  r 2 log( r )

An RBF network can transform

Let c1 = [1,1]T c2 = [0,0]T

 Proposed by Vladmir Vapnik based on

With a scale transformation

minimize Lp (w, b, i )  w  i  yi (wT xi  b)  1

Thus, only support vectors i  0 x+

The solution has the form:

get b from yi (wT xi  b)  1  0,

It is a weighted dot product between the test point x

Solving the optimization problem involved computing

g (x)  wT  (x)  b   i i  (x)  b

No need to know this mapping explicitly, because we

let K(xi,xj)=(1 + xiTxj)2,

Need to show that K(xi,xj) = φ(xi) Tφ(xj):

K (xi , x j )  tanh(0 xTi x j  1 )

The solution of the discriminant function is

The optimization technique is the same.

 2. The Kernel Trick

Select the best match using the MAXNET uk  max i {ui

“As far as public visibility goes, the modern era

1. wij  w ji , wii  0; i, j  1,2,..., n

Figure courtesy of the Frontotemporal dementia group (FRONTIER), Prince of

As the original source of human

. . 1. Feed forward neural networks

variation of probe patterns, the Hopfield

The Hopfield network approach to the TSP

Total distance of a route:

: for minimizing tour length

We should select weights and currents so that two

First we make output voltages double subscripts:

Note first that wia , jb multiply the second-order

Let’s treat the four sets of second-order terms

The second-order C terms are given by

where Kronecker delta function

Bringing together all four components of the