RBF2

Radial Basis Function Networks
Computer Science,
KAIST
10/19/16
236875 Visual Recognition
contents
Introduction
Architecture
Designing
Learning strategies
MLP vs RBFN
10/19/16
introduction
Completely different approach by viewing the
design of a neural network as a curve-fitting
(approximation) problem in high-dimensional
space ( I.e MLP )
10/19/16
introduct
ion
In MLP
10/19/16
introduct
ion
In RBFN
10/19/16
introduct
ion
Radial Basis Function Network

A kind of supervised neural networks
Design of NN as curve-fitting problem
Learning
find surface in multidimensional space best fit to
training data
Generalization
Use of this multidimensional surface to interpolate the
test data
10/19/16
introduct
ion
Radial Basis Function Network

Approximate function with linear combination of
Radial basis functions
F(x) = wi h(x)
h(x) is mostly Gaussian function
10/19/16
architecture
x1
x2
x3
xn
Input layer
10/19/16
h1
h2
W1
h3
W3
hm
W2
f(x)
Wm

Hidden
layer
Output layer
architectur
e
Three layers
Input layer
Source nodes that connect to the network to its
environment
Hidden layer
Hidden units provide a set of basis function
High dimensionality
Output layer
Linear combination of hidden functions
10/19/16
architectur
e
Radial basis function

m
f(x) = wjhj(x)
j=1
hj(x) = exp( -(x-cj)2 / rj2 )

Where
cj is center of a region,
rj is width of the receptive field
10/19/16
10
designing
Require
Selection of the radial basis function width parameter
Number of radial basis neurons
10/19/16
11
designing
Selection of the RBF width para.

Not required for an MLP
smaller width
alerting in untrained test data
Larger width
network of smaller size & faster execution
10/19/16
12
designing
Number of radial basis neurons
By designer
Max of neurons = number of input
Min of neurons = ( experimentally determined)
More neurons
More complex, but smaller tolerance
10/19/16
13
learning strategies
Two levels of Learning
Center and spread learning (or determination)
Output layer Weights Learning
Make # ( parameters) small as possible

Curse of Dimensionality
10/19/16
14
learning
strategies
Various learning strategies

how the centers of the radial-basis functions of the
network are specified.
Fixed centers selected at random
Self-organized selection of centers
Supervised selection of centers
10/19/16
15
learning
strategies
Fixed centers selected at random(1)

Fixed RBFs of the hidden units
The locations of the centers may be chosen
randomly from the training data set.
We can use different values of centers and widths
for each radial basis function -> experimentation
with training data is needed.
10/19/16
16
learning
strategies
Fixed centers selected at random(2)

Only output layer weight is need to be learned.
Obtain the value of the output layer weight by
pseudo-inverse method
Main problem
Require a large training set for a satisfactory level of
performance
10/19/16
17
learning
strategies
Self-organized selection of
centers(1)
Hybrid learning
self-organized learning to estimate the centers of RBFs
in hidden layer
supervised learning to estimate the linear weights of the
output layer
Self-organized learning of centers by means of

clustering.
Supervised learning of output weights by LMS
algorithm.
10/19/16
18
learning
strategies
Self-organized selection of
centers(2)
k-means clustering
1.
2.
3.
4.
5.
10/19/16
Initialization
Sampling
Similarity matching
Updating
Continuation
19
learning
strategies
Supervised selection of centers

All free parameters of the network are changed by
supervised learning process.
Error-correction learning using LMS algorithm.
10/19/16
20
learning
strategies
Learning formula
Linear weights (output layer)

N
(n)
e j (n)G (|| x j t i (n) ||Ci )
wi (n) j 1
wi (n 1) wi (n) 1
i 1,2,..., M
Positions of centers (hidden layer)

N
E ( n)
2 wi (n) e j ( n)G ' (|| x j t i (n) ||Ci ) i1[x j t i (n)]
t i (n)
j 1
E (n)
,
wi (n)
t i (n 1) t i (n) 2
E (n)
,
t i (n)
i 1,2,..., M
Spreads of centers (hidden layer)

N
E ( n)
wi (n) e j (n)G ' (|| x j t i (n) ||Ci )Q ji ( n)
1
i (n)
j 1
i 1 (n 1) i 1 (n) 3
10/19/16
E (n)
i1 (n)
Q ji (n) [x j t i (n)][x j t i (n)]T
21
MLP vs RBFN
Global hyperplane
Local receptive field
EBP
LMS
Local minima
Serious local minima
Smaller number of hidden

neurons
Larger number of hidden

neurons
Shorter computation time
Longer computation time
Longer learning time
Shorter learning time
10/19/16
22
MLP vs RBFN
Approximation
MLP : Global network
All inputs cause an output
RBF : Local network

Only inputs near a receptive field produce an activation
Can give dont know output
10/19/16
23
Gaussian Mixture
Given a finite number of data points xn, n=1,N, draw
from an unknown distribution, the probability function
p(x) of this distribution can be modeled by
Parametric methods
Assuming a known density function (e.g., Gaussian) to start
with, then
Estimate their parameters by maximum likelihood
For a data set of N vectors ={x1,, xN} drawn independently
from the distribution p(x|the joint probability density of
the whole data set is given by
N
p ( | ) p ( x n | ) L ( )
n 1
10/19/16
24
Gaussian Mixture
L() can be viewed as a function of for fixed in other
words, it is the likelihood of for the given
The technique of maximum likelihood sets the value of by
maximizing L().
In practice, often, the negative logarithm of the likelihood is
considered, and the minimum of E is found.
For normal distribution, the estimated parameters can
be found
N
by analytic differentiation of E: E ln L() ln p ( x n | )
n 1
10/19/16
n 1
N
T
(
x
)(
x
n 1
25
Gaussian Mixture
Non-parametric methods
Histograms
An illustration of the histogram
approach to density estimation. The
set of 30 sample data points are
drawn from the sum of two normal
distribution, with means 0.3 and
0.8, standard deviations 0.1 and
amplitudes 0.7 and 0.3 respectively.
The original distribution is shown
by the dashed curve, and the
histogram estimates are shown by
the rectangular bins. The number M
of histogram bins within the given
interval determines the width of the
bins, which in turn controls the
smoothness of the estimated
10/19/16
density.
26
Gaussian Mixture
Density estimation by basis functions, e.g., Kenel
functions, or k-nn
(a) kernel function,

(b) K-nn
Examples of kernel and K-nn approaches to density estimation.
10/19/16
27
Gaussian Mixture
Discussions
Parametric approach assumes a specific form for the
density function, which may be different from the true
density, but
The density function can be evaluated rapidly for new input
vectors
Non-parametric methods allows very general forms of

density functions, thus the number of variables in the
model grows directly with the number of training data
points.
The model can not be rapidly evaluated for new input vectors
Mixture model is a combine of both: (1) not restricted

to specific functional form, and (2) yet the size of the
model only grows with the complexity of the problem
being solved, not the size of the data set.
10/19/16
28
Gaussian Mixture
The mixture model is a linear combination of component
densities p(x| j ) in the form
M
p( x ) p( x | j )P ( j )
j 1
P( j ) is the mixing parameters of data point x,

M
P( j ) 1, and 0 P( j ) 1
j 1
the component density function are normalized
p(x | j )dx 1, and hence p( x | j ) can be

regarded as a class - conditional density
10/19/16
29
Gaussian Mixture
The key difference between the mixture model representation and a

true classification problem lies in the nature of the training data, since
in this case we are not provided with any class labels to say which
component was responsible for generating each data point.
This is so called the representation of incomplete data
However, the technique of mixture modeling can be applied separately
to each class-conditional density p(x|Ck) in a true classification
problem.
In this case, each class-conditional density p(x|Ck) is represented by an
independent mixture model of the form
M
p( x ) p( x | j )P( j )
j 1
10/19/16
30
Gaussian Mixture
Analog to conditional densities and using Bayes theorem, the posterior
Probabilities of the component densities can be derived as
M
p( x | j ) P( j )
P( j | x )
, and P ( j | x ) 1.
p( x )
j 1
The value of P(j|x) represents the probability that a component j was

responsible for generating the data point x.
Limited to the Gaussian distribution, each individual component
densities are given by :
2
xj
exp
,
2 d /2
2
(2 j )
2 j
with a mean j and convariance matrix j 2j I.

p(x | j )
Determine the parameters of Gaussian Mixture methods:

(1) maximum likelihood, (2) EM algorithm.
10/19/16
31
Gaussian Mixture
Representation of the mixture model in terms of a

network diagram. For a component densities p(x|j), lines
connecting the inputs xi to the component p(x|j) represents
the elements ji of the corresponding mean vectors j of the
component j.
10/19/16
32
Maximum likelihood
The mixture density contains adjustable parameters: P(j), jand j where
j=1, ,M.
The negative log-likelihood for the data set {xn} is given by:
N
E ln L ln p (x ) ln
n
n 1
n 1
p(x
j ) P( j )
j 1
Maximizing the likelihood is then equivalent to minimizing E

Differentiation E with respect to
the centres j
E
P( j x n )
j n 1
N
the variances j :
10/19/16
( j x n )
2j
E
d
P( j x n )
j n 1
j
N
x j
n
3
j
33
Maximum likelihood
Minimizing of E with respect to to the mixing parameters P(j),
must subject to the constraints P(j) =1, and 0< P(j) <1. This
can be alleviated by changing P(j) in terms a set of M
auxiliary variables { j} such that: exp( )
j
P( j ) M
,
j
exp(
)
k 1
k
The transformation is called the softmax function, and
the minimization of E with respect to j is
P (k )
jk P( j ) P( j ) P (k ),
j
M
E
E P ( k )
using chain rule in the form

j k 1 P (k ) j
then,
10/19/16
N
E
{P ( j x n ) P ( j )}
j
n 1
34
Maximum likelihood
Setting
E
0, we obtain
i
E
1
2
0
,
then
j
Setting
j
d
E
1
0
,
then
P( j )
Setting
N
j
n
n
P
(
j
x
)
x
n
n
P
(
j
x
)
n
P ( j x ) x j
n
n
P
(
j
x
)
n
n
P
(
j
x
)
n 1
These formulai give some insight of the maximum likelihood

solution, they do not provide a direct method for calculating
the parameters, i.e., these formulai are in terms of P(j|x).
They do suggest an iterative scheme for finding the minimal
of E
10/19/16
35
Maximum likelihood
we can make some initial guess for the parameters, and use these
formulato compute
a revised value of the parameters.
- using , and to compute p ( x n|j ), and
- using P (j), p ( x n|j ), and Bayes' theorem to compute P ( j|x n )
Then, using P(j|xn) to estimate new parameters,

Repeats these processes until converges
10/19/16
36
The EM algorithm
The iteration process consists of (1) expectation and (2)
maximization steps, thus it is called EM algorithm.
We can write the change in error of E, in terms of old and
new
n
new parameters by: new
p
x
old
E E ln old n
n
p x
Using
p ( x ) p( x | j )P ( j )
we can rewrite this as follows
new
new
n
old
n
P
j
p
x
j
p
j
x
j
new
old
E E ln
old
n
old
n
p
x
p j x
n
Using Jensens inequality: given a set of numbers j 0,

such that jj=1,
j 1
ln
10/19/16
x
j
j ln x j
j
37
The EM algorithm
Consider Pold(j|x) as j, then the changes of E gives
E new E old
P new ( j ) p new ( x n j )
p old ( j x n ) ln old n old
n
n
j
p ( x ) p ( j x )
old
old
Let Q = n j p , then E new E old Q , and E Q is an
upper bound of Enew.
As shown in figure, minimizing Q will lead to a decrease of
Enew, unless Enew is already at a local minimum.
Schematic plot of the error function E as a

function of the new value new of one of the
parameters of the mixture model. The curve
Eold + Q(new) provides an upper bound on the
value of E (new) and the EM algorithm
involves finding the minimum value of this
upper bound.
10/19/16
38
The EM algorithm
Lets drop terms in Q that depends on only old parameters,
and rewrite Q as
~
Q p old j x n lnP new j p new x n j
n
the smallest value for the upper bound is found by

minimizing this quantity Q~
for the Gaussian mixture model, the quality Q~ can be
n
new 2
x j
~
old
n
new
new
Q p j x ln P j d ln j
const.
new 2
n
j
2 j
we can now minimize this function with respect to new

parameters, and they are:
old
n
n
new 2
old
n
n
P jx x
1 n P j x x j
new 2
new
n
j
, j
old
n
old
n
d
P
j
x
P
j
x
n
n
10/19/16
39
The EM algorithm
For the mixing parameters Pnew (j), the constraint jPnew (j)=1
can be considered by using the Lagrange multiplier and
minimizing the combined function Z Q
P j 1
new
Setting the derivative of Z with respect to Pnew (j) to zero,

P old j x n
0 new
P j
n
using jPnew (j)=1 and jPold (j|xn)=1, we obtain = N, thus

P j
new
old
n
P
j
x
Since the jPold (j|xn) term is on the right side, thus this
results are ready for iteration computation
Exercise 2: shown on236875
10/19/16
the Visual
netsRecognition
40

RBF2

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

RBF2

Hochgeladen von

Copyright:

Verfügbare Formate

Radial Basis Function Networks

236875 Visual Recognition

236875 Visual Recognition

236875 Visual Recognition

236875 Visual Recognition

236875 Visual Recognition

Radial Basis Function Network

236875 Visual Recognition

Radial Basis Function Network

236875 Visual Recognition

236875 Visual Recognition

236875 Visual Recognition

Radial basis function

hj(x) = exp( -(x-cj)2 / rj2 )

236875 Visual Recognition

236875 Visual Recognition

Selection of the RBF width para.

236875 Visual Recognition

Number of radial basis neurons

236875 Visual Recognition

Make # ( parameters) small as possible

236875 Visual Recognition

Various learning strategies

236875 Visual Recognition

Fixed centers selected at random(1)

236875 Visual Recognition

Fixed centers selected at random(2)

236875 Visual Recognition

Self-organized learning of centers by means of

236875 Visual Recognition

236875 Visual Recognition

Supervised selection of centers

236875 Visual Recognition

Linear weights (output layer)

Positions of centers (hidden layer)

Spreads of centers (hidden layer)

Q ji (n) [x j t i (n)][x j t i (n)]T

236875 Visual Recognition

Local receptive field

Serious local minima

Smaller number of hidden

Larger number of hidden

Shorter computation time

Longer computation time

Longer learning time

Shorter learning time

236875 Visual Recognition

RBF : Local network

236875 Visual Recognition

236875 Visual Recognition

236875 Visual Recognition

(a) kernel function,

236875 Visual Recognition

Non-parametric methods allows very general forms of

Mixture model is a combine of both: (1) not restricted

236875 Visual Recognition

P( j ) is the mixing parameters of data point x,

the component density function are normalized

p(x | j )dx 1, and hence p( x | j ) can be

236875 Visual Recognition

The key difference between the mixture model representation and a

236875 Visual Recognition

The value of P(j|x) represents the probability that a component j was

with a mean j and convariance matrix j 2j I.