Beruflich Dokumente
Kultur Dokumente
They are intended to support both the instructors in the development and delivery
of course content and the learners in acquiring the ideas and techniques
presented in the book in a more pleasant way than just reading.
T O P I C S in INTRODUCTARY PART
Black-Box W hite-Box
Structured knowledge
No previous knowledge, but there
(experience, expertise or heuristics).
are measurements, observations, IF - THEN rules are the most typical
records, data. examples of the structured knowledge.
V W o1 Example: Controlling the distance between
x1
two cars.
x2 R1: IF the speed is low AND
o2
the distance is small
THEN the force on brake should be small
x3 vj wk R2: IF the speed is medium
o3
AND the distance is small
THEN the force on brake should be big
x4 R3: IF the speed is high
o4
AND the distance is small
x5 THEN the force on brake should be very big
x o
This is the most common gray box situation covered by the paradigm of
Neuro-Fuzzy or Fuzzy-Neuro models.
If we do not have any prior knowledge AND we do not have any
measurements (by all accounts very hopeless situation indeed) it may be
hard to expect or believe that the problem at hand may be approached
and solved easily. This is a no-color box situation.
Classical approximation techniques in NN graphical appearance
FOURIER SERIES
AMPLITUDES and PHASES of sine (cosine) waves are ?, but
frequencies are known because
Mr Joseph Fourier has selected frequencies for us -> they are
INTEGER multiplies of some pre-selected base frequency.
And the problem is LINEAR!!! It is same with
V y1 w
is prescribed POLYNOMIALS
1 y2
vji
2 wj o = F(x)
x yj
4
n yj+1 BUT, what if we want to
learn the frequencies?
yJ
!!! NONLINEAR
N
F(x) = ak sin(kx), or bk cos(kx), or both
+1
LEARNING
k =1 PROBLEM !!!
N ow ,w e w antto find Fourierseriesm odely = w 2sin(w 1x)ofthe underlying
dependency y = 2.5 sin(1.5o x),know n to us butnotto the learning m achine (algorithm ).
HL o
x
w1 netHL
w2 net
o - d
W e know that the function is sinus but w e dont know its frequency and am plitude. Thus, by using the
training data set{x,d},w e w antto m odelthis system w ith the N N m odelconsisting ofa single neuron in H L
(having sinus as an activation function)as given above.
The cost function J dependence upon A (dashed) and w (solid)
250
100
A = w2
50
Z = w1
0
-2 -1 0 1 2 3 4 5
Weight w = [A; w]
N ow ,w e w antto find Fourierseriesm odely = w 2sin(w 1x)ofthe underlying
dependency
o
y = 2.5 sin(1.5x).
HL o
x
w1 netHL
w2 net
o - d
W e know that the function is sinus but w e dont know its frequency and am plitude. Thus, by using the
training data set{x,d},w e w antto m odelthis system w ith the N N m odelconsisting ofa single neuron in H L
(having sinus as an activation function)as given above.
The cost function J dependence upon A (dashed) and w (solid)
250
200 400
sum(d -w 2sin(w 1x))2
Cost fumction J
300
Cost fumction J
150
200
100 100
A = w2
50
Z = w1 0
6
4 6
2 4
2
0 0
-2 -1 0 1 2 3 4 5 0
Weight w = [A; w] Frequency w -2 -2 Amplitude A
Another classical approximation scheme is a
POLYNOMIAL SERIES
N
F(x) = w xi
i=0
i
V y1 w
is prescribed
1 y2
2
vji 3 yj wj o = F(x)
x
4
5
yj+1
yJ
+1
Another classical approximation scheme is a
POLYNOMIAL SERIES
N
F(x) = w xi
i=0
i
V y1 w
With a prescribed is prescribed
(integer) exponents
1 y2
this is again LINEAR
2
APPROXIMATION vji 3 yj wj o = F(x)
x
SCHEME. Linear in 4
terms of parameters 5
yj+1
to learn and not in
terms of the resulting
yJ
approximation
function. This one is +1
V y1 w
With a prescribed is prescribed
(integer) exponents
1 y2
this is again LINEAR
2
APPROXIMATION vji 3 yj wj o = F(x)
x
SCHEME. Linear in 4
terms of parameters 5
yj+1
to learn and not in
The following
terms of the resulting
yJ approximation scheme is a
approximation
NOVEL ONE called NN
function. This one is +1
or RBF network here.
NL function for i > 1.
wi+1i+1, wi+1 > 0
Approximation of y F(x)
*
some yk 3 * *
* *
*
*
*
*
*
* * *
NL 1D function by * N
F(x) =
*
2
*
* w i i ( x ,c i )
i =1
Gaussian **
*
1 2 i i+1 i+2 N
Radial Basis Function 1
2 i
(RBF) 0
wi+2i+2, wi+2 < 0 c2 ci xk x
V y1 w
In 1-D case forget
these two inputs. They x1 y2
are here just to denote
vji yj wj o = F(x)
that the basic structure xi
of the NN is the same
xn+1 yj+1
for ANY-
yJ
DIMENSIONAL
INPUT +1
Approximation of some NL 2D function by
APPROXIMATION OF SOME 2D NONLINEAR FCTN BY GAUSSIAN RBF NN
Gaussian Radial Basis Function (RBF)
Measurements 0
40
30 40
Images 20
20
30
10 10
0
Records 0
Observations
Data
Approximation of some NL 2D function by
Gaussian Radial Basis Functions (RBF)
APPROXIMATION OF SOME 2D NONLINEAR FCTN BY GAUSSIAN RBF NN
0
40
30 40
20 30
20
10 10
0 0
J
V y1 w
F(x) = j =1
w j j (x, c j , j )
x1
y2
vji yj wj o = F(x)
xi
yj+1
xn
yJ
+1
and, this is a Support Vector Machine.
J
V y1 w
F(x) = j =1
w j j (x, c j , j )
x1
y2
vji yj wj o = F(x)
xi
yj+1
xn
yJ
+1
and, this is a Support Vector Machine.
J
V y1 w
F(x) = j =1
w j j (x, c j , j )
x1
y2
vji yj wj o = F(x)
xi
yj+1
xn
yJ
+1
AND AGAIN !!!
This is a Neural Network,
J
V y1 w
F(x) = j =1
w j j (x, c j , j )
x1
y2
vji yj wj o = F(x)
xi
yj+1
xn
yJ
+1
and, this is a Support Vector Machine.
J
V y1 w
F(x) = j =1
w j j (x, c j , j )
x1
y2
vji yj wj o = F(x)
xi
yj+1
xn
yJ
+1
and, this is a Support Vector Machine.
J
V y1 w
F(x) = j =1
w j j (x, c j , j )
x1
y2
vji yj wj o = F(x)
xi
yj+1
xn There is no difference in
representational capacity, i.e., in a
yJ structure.
However, there is an important
+1 difference in LEARNING.
Therefore,
lets talk about basics of
identification, estimation,
regression,classification,
pattern recognition, function
approximation, curve or surface
fitting etc.
All these tasks used to be
solved previously.
Thus, THERE IS THE QUESTION:
Is there anything new in
respect to the classical
statistical inference?
The classical regression and (Bayesian) classification statistical
techniques are based on the very strict assumption that probability
distribution models or probability-density functions are known.
final error
Data size l
Dependency of the modeling error on the training data set size.
Error Small sample Medium sample Large sample
final error
Data size l
Dependency of the modeling error on the training data set size.
Glivenko-Cantelli-Kolmogorov results.
Glivenko-Cantelli theorem states that:
Distribution function Pemp(x) P(x)
as the number of data l .
However, for both regression and classification we need
) and not a
probability density functions p(x), i.e., p(x|
distribution P(x).
Therefore, there is a question:
whether pemp(x) p(x)
as the number of data l .
p(x)dx = P(x).
(Analogy with a classic problem Ax = y, x = A-1y)!
Therefore, there is a question:
whether pemp(x) p(x)
as the number of data l .
p(x)dx = P(x).
(Analogy with a classic problem Ax = y, x = A-1y)!
What is needed here is the theory of
UNIFORM CONVERGENCE of the set of functions
implemented by a model, i.e., learning machine
(Vapnik, Chervonenkis, 1960-70ties).
F or M or
M? F?
Gender recognition problem: These two faces are female or male?
F or M or
M? F?
There must be
something in the
geometry of our Problem from
N data
Just a few w ords about the data set size and dim ensionality. A pproxim ation and
classification are sam e for any dim ensionality of the input space.N othing (!) but
size changes. But the change is D RA STIC. H igh dim ensionality m eans both an
EX PLO SIO N in a num berO F PA RA M ETERS to learn and a SPA RSE training data
set.
H igh dim ensionalspaces seem to be terrifyingly em pty.
y
T
P T
PLANT U
H
U
T T
P
T T
2 n
N data N2 data Nn data
H igh dim ensionalspaces are terrifyingly em pty.
y
M any differentrem edies proposed to resolve this
CU RSE ofD IM EN SIO N A LITY .
O ne ofproposals to resolve itis to utilize the BLESSIN G ofSM O O TH N ES
REG U LA RIZA TIO N TH EO RY RBF w hich contain classical
splines.
CU RSE ofD IM EN SIO N A LITY and SPA RSITY O F D A TA .
d2 = -1
x1
CLA SSIFICA TIO N orPA TTERN RECO G N ITITO N EX A M PLE
A ssum e -N orm ally distributed classes,sam e covariance
m atrices.Solution is easy-decision boundary is linear
and defined by param eterw = X * D w hen there is
plenty ofdata (infinity).X * denotes the PSEU D O IN V ERSE.
x2 d1 = +1
d2 = -1
Note that this
solution follows
from the last two
assumptions in
classical inference!
x1 Gaussian data and
minimization of the
sum-of-errors-
squares!
H ow ever,fora sm allsam ple -
H ow ever,the question is
how to D EFIN E and FIN D
the
O PTIM A L SEPA R A TIO N
H Y PER PLA N E
G IV EN (scarce)
D A TA SA M PLES ???
The STA TISTIC A L LEA R N IN G TH EO R Y IS D EV ELO PED TO SO LV E
O PTIM A L
SEPA R A TIO N
H Y PER PLA N E
is the one thathas
the
LA R G EST
M A R G IN
on given
D A TA SET
Thus, lets show a little more formally,
the constructive part of
the statistical learning theory.
Vapnik and Chervonenkis introduced
a nested set of hypothesis (approximating or
decision) functions.
Nested set means that
H1 H2 Hn-1 Hn
Before presenting the basic ideas, clarifications regarding
terminology is highly desired because
a lot of confusion is caused by a non-unified terminology:
Approximation, or training, error eapp ~ Empirical risk ~ Bias,
Estimation error eest ~ Variance ~ Confidence on the training error ~ VC
confidence interval,
Generalization (true, expected) error egen ~ Bound on test error ~
Error or Guaranteed,overfitting
underfitting
or true, risk,
Risk
0 Input plane
( x 1, x 2)
d(x, w, b)
-1
Stars denote support vectors
Input x1
The optimal separating hyperplane d(x, w, b)
is an argument of indicator function
Therefore, in other to reduce a level of confusion at the first-time-
readers, i.e., beginners, in the world of SVM, we developed the
LEARNSC software that should elucidate all these mathematical
objects, their relations and the spaces they are living in.
Thus, the software that can be downloaded from the same site will
show the kind of figures below and a little more complex ones.
The decision boundary or
Desired value y Indicator function iF(x, w, b) = sign(d) separating line is an intersec-
Input x2 tion of d(x, w, b) and an input
plane (x1, x2); d = wTx +b = 0
+1
0 Input plane
( x 1, x 2)
d(x, w, b)
-1
Stars denote support vectors
Input x1
The optimal separating hyperplane d(x, w, b)
is an argument of indicator function
P
E= (d f(x
i =1
i , w))
i
2
Classical multilayer perceptron
Closeness to data
P
E=
i =1
(
d
i
f
( x
,
i w)) +
2
||
P
f|| 2 Regularization (RBF) NN
3) Nonlinear Classifier.
M
MARGIN IS DEFINED by
w as follows:
2
M=
w
yi[wTxi + b] 1, i = 1, l
where l denotes a number of training data and NSV stands for a number of
SV.
Note that maximization of M means a minimization of ||w||.
Minimization of a norm of a hyperplane normal weight vector ||w|| =
w T w = w12 + w22 + ... + wn2 leads to a maximization of a margin M.
Because sqrt(f )is a monotonic function, its minimization is equivalent to
a minimization of f .
Consequently, a minimization of norm ||w|| equals a minimization of
wTw = w12 + w22 + + wn2
and this leads to a maximization of a margin M.
Thus the problem to solve is:
minimize
J = wT w = || w ||2
subject to constraints
yi[wT xi + b] 1
subject to constraints
Correct
yi[wT xi + b] 1
classification!
Feature x2
Class 1 Class 2
y = +1 y = -1
2
0 1 2 3 4 5
Feature x1
Now, the SVM should be constructed by
1.5
f
x3 0
1 0
1
f > 0
x2
0.5
1
0 0 1
f 1 x1
0
-0.5
-0.5 0 0.5 1 1.5
The last plane, in a 3-dimensional feature space, will be produced by
the following NN structure.
LAYERS
INPUT HIDDEN OUTPUT
x1
1
1
x3 -2 o
f
1 1
x2 -1.5 -1/3
x1 1(x)
2(x)
x2
3(x)
1
1 x3 4(x)
1 w1
x1 5(x)
x1
(x1)2
6(x) iF=sign(d(x))
x2
(x2)2
d(x)
x2 x2
7(x)
x3 (x3)2
8(x)
x3 x3 w9
x 1 x2 9(x)
b
x1
x 2 x3
x 1 x3 +1
Now, we apply a kernel trick.
Ld() = or,
l
1 l
i =1
i
2 i , j =1
y T
i j i j i zj
y z
Ld() = l
1 l
i =1
i yi y j i j K (x i , x j )
2 i , j =1
and the constraints are
i 0, i = 1, l
In a more general case, because of noise or generic class features, there
will be an overlapping of training data points. Nothing but constraints
change as for the soft margin classifier above. Thus, the nonlinear soft
margin classifier will be the solution of the quadratic optimization
problem given above subject to constraints
C i 0, l
i = 1, l and
i yi = 0
i =1
The decision hypersurface is given by
l
d (x) = yi i K (x, x i ) + b
i =1
We see that the final structure of the SVM is equal to the NN model.
In essence it is a weighted linear combination of some kernel (basis)
functions. Well show this (hyper)surfaces in simulations later.
Regression by SVMs
Initially developed for solving classification problems, SV techniques
can be successfully applied in regression, i.e., for a functional
approximation problems (Drucker et al, (1997), Vapnik et al, (1997)).
Unlike pattern recognition problems (where the desired outputs yi are
discrete values e.g., Boolean), here we deal with real valued functions.
Now, the general regression learning problem is set as follows;
the learning machine is given l training data from which it attempts to
learn the input-output relationship (dependency, mapping or function)
f(x).
A training data set D = {[x(i), y(i)] n , i = 1,...,l} consists of l
pairs (x1, y1), (x2, y2), , (xl, yl), where the inputs x are n-dimensional
vectors x n and system responses y , are continuous values. The
SVM considers approximating functions of the form
N
f(x, w) = wi i(x )
i 1
Vapnik introduced a more general error (loss) function -
the so-called -insensitivity loss function
| y - f (x, w) | =
0 %& if | y - f (x, w) |
| y - f (x, w) | - , otherwise.
Thus, the loss is equal to 0 if the difference between the predicted f(x, w)
and the measured value is less than . Vapniks -insensitivity loss
function defines an tube around f(x, w). If the predicted value is
within the tube the loss (error, cost) is zero. For all other predicted points
outside the tube, the loss equals the magnitude of the difference between
the predicted value and the radius of the tube. See the next figure.
e e e
Predicted f(x, w)
solid line
j*
Measured
yj
x
i =1 i w x i + b yi + + i i =1 ( *i *i + i i )
l T l
subject to constraints
i=1 i = i=1 i
l l
*
0 i* C i = 1, l,
0 i C i = 1, l.
Note that a dual variables Lagrangian Ld(, *) is expressed in terms of
LaGrange multipliers and * only, and that - the size of the problem,
with respect to the size of an SV classifier design task, is doubled now.
There are 2l unknown multipliers for linear regression and the Hessian
matrix H of the quadratic optimization problem in the case of regression
is a (2l, 2l) matrix.
The standard quadratic optimization problem above can be expressed
in a matrix notation and formulated as follows:
Maximize
Ld() = -0.5TH + fT,
z = f(x, w) = Gw + b.
3 3
2 2
y 1 y 1
0 0
-1 -1
-2 -2
-4 -2 0 2 4 -4 -2 0 2
x x
The influence of a insensitivity zone e on modeling quality. A
nonlinear SVM creates a regression function with Gaussian
kernels and models a highly polluted (25% noise) sinus function
(dashed). 17 measured training data points (plus signs) are used.
Left: e = 0.1. 15 SV are chosen (encircled plus signs).
Right: e = 0.5. 6 chosen SV produced a much better
regressing function.
Some of the constructive problems:
The SV training works almost perfectly for not too large data basis.
However, when the number of data points is large (say l > 2000) the QP
problem becomes extremely difficult to solve with standard methods. For
example, a training set of 50,000 examples amounts to a Hessian
matrix H with 2.5*109 (2.5 billion) elements. Using an 8-byte
floating-point representation we need 20,000 Megabytes = 20
Gigabytes of memory (Osuna et al, 1997). This cannot be easily fit into
memory of present standard computers.
There are three approaches that resolve the QP for large data sets.
Vapnik in (Vapnik, 1995) proposed the chunking method that is the
decomposition approach. Another decomposition approach is proposed
in (Osuna et all, 1997). The sequential minimal optimization (Platt,
1997) algorithm is of different character and it seems to be an error back
propagation for a SVM learning.
There is an alternative approach in the
calculation of the support vectors presented
in the section 5.3.4 under the title:
Having the measurement pairs (x, y), place the basis function (kernels) G
at each data points and perform approximation same as in SVM
regression.
However, unlike in QP approach one minimizes the L1 norm now!
Thus, we do not want to solve the equation below
y = Gw, w = ?
#
wP
subject to,
G G w y + 1 +
+
G G y + 1, w > 0, w > 0
w
Despite the lack of a comparative analysis between QP an LP
approaches yet, our first simulational results show that the
LP subset selection may be more than a good alternative to
the QP based algorithms when faced with a huge training
data sets.