Beruflich Dokumente
Kultur Dokumente
Computer Science,
KAIST
10/19/16
contents
Introduction
Architecture
Designing
Learning strategies
MLP vs RBFN
10/19/16
introduction
Completely different approach by viewing the
design of a neural network as a curve-fitting
(approximation) problem in high-dimensional
space ( I.e MLP )
10/19/16
introduct
ion
In MLP
10/19/16
introduct
ion
In RBFN
10/19/16
introduct
ion
Generalization
Use of this multidimensional surface to interpolate the
test data
10/19/16
introduct
ion
10/19/16
architecture
x1
x2
x3
xn
Input layer
10/19/16
h1
h2
W1
h3
W3
hm
W2
f(x)
Wm
architectur
e
Three layers
Input layer
Source nodes that connect to the network to its
environment
Hidden layer
Hidden units provide a set of basis function
High dimensionality
Output layer
Linear combination of hidden functions
10/19/16
architectur
e
f(x) = wjhj(x)
j=1
cj is center of a region,
rj is width of the receptive field
10/19/16
10
designing
Require
Selection of the radial basis function width parameter
Number of radial basis neurons
10/19/16
11
designing
Larger width
network of smaller size & faster execution
10/19/16
12
designing
By designer
Max of neurons = number of input
Min of neurons = ( experimentally determined)
More neurons
More complex, but smaller tolerance
10/19/16
13
learning strategies
Two levels of Learning
Center and spread learning (or determination)
Output layer Weights Learning
10/19/16
14
learning
strategies
10/19/16
15
learning
strategies
10/19/16
16
learning
strategies
10/19/16
17
learning
strategies
Self-organized selection of
centers(1)
Hybrid learning
self-organized learning to estimate the centers of RBFs
in hidden layer
supervised learning to estimate the linear weights of the
output layer
18
learning
strategies
Self-organized selection of
centers(2)
k-means clustering
1.
2.
3.
4.
5.
10/19/16
Initialization
Sampling
Similarity matching
Updating
Continuation
19
learning
strategies
10/19/16
20
learning
strategies
Learning formula
wi (n 1) wi (n) 1
i 1,2,..., M
E (n)
,
wi (n)
t i (n 1) t i (n) 2
E (n)
,
t i (n)
i 1,2,..., M
i 1 (n 1) i 1 (n) 3
10/19/16
E (n)
i1 (n)
21
MLP vs RBFN
Global hyperplane
EBP
LMS
Local minima
10/19/16
22
MLP vs RBFN
Approximation
MLP : Global network
All inputs cause an output
10/19/16
23
Gaussian Mixture
Given a finite number of data points xn, n=1,N, draw
from an unknown distribution, the probability function
p(x) of this distribution can be modeled by
Parametric methods
Assuming a known density function (e.g., Gaussian) to start
with, then
Estimate their parameters by maximum likelihood
For a data set of N vectors ={x1,, xN} drawn independently
from the distribution p(x|the joint probability density of
the whole data set is given by
N
p ( | ) p ( x n | ) L ( )
n 1
10/19/16
24
Gaussian Mixture
L() can be viewed as a function of for fixed in other
words, it is the likelihood of for the given
The technique of maximum likelihood sets the value of by
maximizing L().
In practice, often, the negative logarithm of the likelihood is
considered, and the minimum of E is found.
For normal distribution, the estimated parameters can
be found
N
by analytic differentiation of E: E ln L() ln p ( x n | )
n 1
10/19/16
n 1
N
T
(
x
)(
x
n 1
25
Gaussian Mixture
Non-parametric methods
Histograms
An illustration of the histogram
approach to density estimation. The
set of 30 sample data points are
drawn from the sum of two normal
distribution, with means 0.3 and
0.8, standard deviations 0.1 and
amplitudes 0.7 and 0.3 respectively.
The original distribution is shown
by the dashed curve, and the
histogram estimates are shown by
the rectangular bins. The number M
of histogram bins within the given
interval determines the width of the
bins, which in turn controls the
smoothness of the estimated
10/19/16
236875 Visual Recognition
density.
26
Gaussian Mixture
Density estimation by basis functions, e.g., Kenel
functions, or k-nn
27
Gaussian Mixture
Discussions
Parametric approach assumes a specific form for the
density function, which may be different from the true
density, but
The density function can be evaluated rapidly for new input
vectors
28
Gaussian Mixture
The mixture model is a linear combination of component
densities p(x| j ) in the form
M
p( x ) p( x | j )P ( j )
j 1
P( j ) 1, and 0 P( j ) 1
j 1
29
Gaussian Mixture
p( x ) p( x | j )P( j )
j 1
10/19/16
30
Gaussian Mixture
Analog to conditional densities and using Bayes theorem, the posterior
Probabilities of the component densities can be derived as
M
p( x | j ) P( j )
P( j | x )
, and P ( j | x ) 1.
p( x )
j 1
xj
exp
,
2 d /2
2
(2 j )
2 j
31
Gaussian Mixture
10/19/16
32
Maximum likelihood
The mixture density contains adjustable parameters: P(j), jand j where
j=1, ,M.
The negative log-likelihood for the data set {xn} is given by:
N
E ln L ln p (x ) ln
n
n 1
n 1
p(x
j ) P( j )
j 1
E
P( j x n )
j n 1
N
the variances j :
10/19/16
( j x n )
2j
E
d
P( j x n )
j n 1
j
N
x j
n
3
j
33
Maximum likelihood
Minimizing of E with respect to to the mixing parameters P(j),
must subject to the constraints P(j) =1, and 0< P(j) <1. This
can be alleviated by changing P(j) in terms a set of M
auxiliary variables { j} such that: exp( )
j
P( j ) M
,
j
exp(
)
k 1
k
The transformation is called the softmax function, and
the minimization of E with respect to j is
P (k )
jk P( j ) P( j ) P (k ),
j
M
E
E P ( k )
N
E
{P ( j x n ) P ( j )}
j
n 1
34
Maximum likelihood
Setting
E
0, we obtain
i
E
1
2
0
,
then
j
Setting
j
d
E
1
0
,
then
P( j )
Setting
N
j
n
n
P
(
j
x
)
x
n
n
P
(
j
x
)
n
P ( j x ) x j
n
n
P
(
j
x
)
n
n
P
(
j
x
)
n 1
Maximum likelihood
we can make some initial guess for the parameters, and use these
formulato compute
a revised value of the parameters.
10/19/16
36
The EM algorithm
The iteration process consists of (1) expectation and (2)
maximization steps, thus it is called EM algorithm.
We can write the change in error of E, in terms of old and
new
n
new parameters by: new
p
x
old
E E ln old n
n
p x
Using
p ( x ) p( x | j )P ( j )
new
new
n
old
n
P
j
p
x
j
p
j
x
j
new
old
E E ln
old
n
old
n
p
x
p j x
n
ln
10/19/16
x
j
j ln x j
j
236875 Visual Recognition
37
The EM algorithm
Consider Pold(j|x) as j, then the changes of E gives
E new E old
P new ( j ) p new ( x n j )
p old ( j x n ) ln old n old
n
n
j
p ( x ) p ( j x )
old
old
Let Q = n j p , then E new E old Q , and E Q is an
upper bound of Enew.
As shown in figure, minimizing Q will lead to a decrease of
Enew, unless Enew is already at a local minimum.
38
The EM algorithm
Lets drop terms in Q that depends on only old parameters,
and rewrite Q as
~
Q p old j x n lnP new j p new x n j
n
x j
~
old
n
new
new
Q p j x ln P j d ln j
const.
new 2
n
j
2 j
new 2
new
n
j
, j
old
n
old
n
d
P
j
x
P
j
x
n
n
10/19/16
39
The EM algorithm
For the mixing parameters Pnew (j), the constraint jPnew (j)=1
can be considered by using the Lagrange multiplier and
P j 1
new
P j
n
new
old
n
P
j
x
Since the jPold (j|xn) term is on the right side, thus this
results are ready for iteration computation
Exercise 2: shown on236875
10/19/16
the Visual
netsRecognition
40