Sie sind auf Seite 1von 6

summary ::a tutorial on SVM in pattern

recognition
a paper by J.C.Burges
anirban sanyal

1 introduction
the main topics of my discussion can be broadly classified into the
following points;
1 . brief history of SVM
2 . separating hyperplane theory.
3 . application of SVM.

2 brief history of SVM


support vector machine was primarily invented by computer science
engineers.later on it was used in classification problem.
1. in 1979,SVM was first introduced by Vapnik.
2. however it was only in 1992,SVM was used by Vapnik in hand
written digit recognition.it was the starting and later on in 1995,1998
the application of SVM was recognized in pattern classification and
other classification problems.

3. initially SVM was used for two class classification problems,later


on in 2002-03 ,it was entended to multiclass classification problem.

3 objective
since this is a supervised classification probelm,we have one train-
ning set and one teest set observations.we have to fit a boundary
over the trainning set and this boundary is applied on test set data
to classify the test set observations.
here we have the trainning set as (xn ,yn ) where yn ∈ {−1, 1} andxn
∈ Rp . 1
we call those observations with yn = 1 as the positive observa-
tions and those with yn = −1 as negative observation.
let us now consider the following three situations;

1. separable classes
2. non-separable classes.
3. non-linear boundary.
we will develop the algorithm of SVM from situation 1 and will try
to modify the algorithm in the subsequent cases.
4 completely separable classes
as we can see from the picture that the two clases are completely sep-
arable andf hence there can be a number of different boundaries(or
hyperplane) that can separate these two classes,as we can see from
the picture(2)

4.1 separating hyperplane


by separating hyperplane ,we mean that hyperplane that separates
these two classes completely.as we can from the figure,all the lines
corrsponds to different separating hyperplanes.a general hyperplane
can be represented by w0 ∗ x + a = 0......(*)

so for two completely separating classes,there can be a number of


choices for w, a.
now the question that comes is that ”‘what will be the optimal sep-
arating hyperplane of these two classes?”’but before going to the
derivation of optimal hyperplane,let me define ”‘what is meant by
optimal hyperplane?”’
let us now define
d+=the minimum distance from the potive points from the hyper-
plane in (*) and similarly d− is defined for negative points.
so we define the optimal hyperplane as that hyperplane that keeps
equal distance from the nearest points of either classes .so for the
optimal hyperplane d+ = d−
thus for all the points of the optimal hyperplane in (*),
w0 ∗ x + a ≥ d for y = +1 and w0 ∗ x + a ≤ −dfor y = −1 where d
is the common value of d+ and d-.with proper reparametrization,we
have w0 ∗ x + a ≥ 1 and w0 ∗ x + a ≤ −1 for all the points for either
classes.and equality holds for those points which are nearest to the
hyperplane.such points are called ”‘support vector”’.and here lies
the significance of the name ”‘support vector machine”’ comes.in
the later sections,we will see that these are the vectors that helps us
derive the actual value of w, a.

2
5 derivation of optimal hyprplane
for support vectors,the perpendicular distance from the hyperplane
is given by
kw’*xk + ak/kwki.e; 1/kwk. since for the optimal hyperplane ,this
band width should be maximum,we have to minimize kwk.
thus our objective function will be to minimize kwk subject to the
condition that yn ×(w’*xn + a) ≥1.

6 optimisation problem
we form the Lagrangian as
L=0.5* kwk2 + λi ∗ (yi ∗ (w0 ∗ xi + a − 1).we have to minimize L
P

with respect to w and a where λi ’s are lagrangian multipliers and


λi ≥ 0 with strict inequality holds if xi correspond to the support
vectors.
∂L ∂L
=w− λi × yi ×xi sowehave ∂w
P
∂w
=0
⇒w− λi × yi ×xi =0
P

⇒w= λi × yi ×xi ,here the sum is taken over all the support vec-
P

tors.
we put the value of w in L and so the dual of the problem reduces
to P
L= λi − ( 21 ) × λi × λj ×xi ×xj
P

.now it is a convex quadretic programming problem since the ob-


jective function is convex itself.so we have to maximize L over λi to
find the optimal value of lagrangian multipliers. (1)

. the value of a is obtained from the relation


for the support vectors, we have ,
w0 x+a=+1 or -1.

7 overlapping classes
now let us extend this concept to overlapping classes.clearly we can-
not extend this things straight way to the overlapping classes.so we
do a little bit of modification.we introduce some slack variables ηi

3
where ηi ≥ 0.so we have xi .w+a ≥1-ηi for yi = +1 .......(1)
xi .w+a ≤ηi − 1 for yi = −1.......(2)
thus for a misclassified point,ηi ≥ 1 and hence ηi gives a upper
P

bound of the error of misclassification.


so our objective function will be to minimize kwk + ηi subject to
P

the condition that (1) and (2) holds.

8 optimisation problem
here the lagrangian would be

L1 =0.5∗kwk2 +C × ηi + λi ×(yi ×(xi .w+a)-1+ηi )− µi ×ηi


P P P

and we have to differentiate L1 with respect to w to get the op-


timal value of w so that we have
∂L1
∂w
=0 P
⇒w= λi ×yi ×xi and the value of s can be found from the support
vectors once we get the value of w is obtained. (2)

9 non-linear boundaries
but there can atill be situations with boundaries which cannot be
approximated by linear boundaries.in those cases we transform the
data into a new data set so that the boundary in the new data set
is linear or approximately linear.so we transform the data xn to zn
such that
zn =φ(xn ) where φ is the required transformation.that is
φ :Rp −→H .here p ≤ dim(H ).here H may be hilbert space,euclidean
space.for the sake of simplicity,let us consider H to be Euclidean
space.
λi ×yi ×φ(xi ) and the optimal hyperplane is
P
so here w=
λi × yi × φ(xi )’ ×φ(x)+a=0.......(3).
P

hence the transformation in the optimal hyperplane appears as an


inner product .so if we can find one kernel that will help us to avoid

4
finding the actual transformation.for this instance,let us suppose
thta such kernels do exist that can express the inner product be-
tween two vectors,then we can easily use that kernel to express the
optimal hyperplane in (3).let us call this kernel as K(.).then optimal
hyperplane is given by
λi × yi × K(xi ,x)+a= 0.
P

the question comes that


”‘1. what guartees the existance of such kernels?”’
”’2. what would the appropriate kernel in classification problems?”’

10 existance of kernel
10.1 Mercer’s condition
there exists a kernel K and a mapping φ such that K(x, y=φ(x) × φ(y)
iff for any g(x) such that
R 2
g (x)dx <∞
then, K(x, y) × g(x) × g(y)dxdy ≥ 0
R

then such P
kernels K(,.,)can be expressed as
K(x,y) = cp × (x.y)p
where cp are positive constants and the series is uniformly conver-
gent.

11 some appropriate kernels


there are some appropriate kernels which are used in classification
problems;
1. K(x, y)=exp(kx − yk)2 (Gaussain kernel)
2. K(x, y)=(1 + x.y)p (polynomial kernel)
these kernels satisfy Mercer’s condition.

12 some important points in SVM


1. here the optimal ,we are getting is global optimal rather than
local optimal.hence there is no chance that we will be struck at
local optimal which is usually the case in neural networks.
2. in classification problems,polynomial kernels are found to be most

5
useful.
3. even the boundary is non-linear ,we don’t have to think about
the appropriate transformations because once we get the kernel,we
can easily compute the optimal hyperplane.

13 disadvantages of SVM
1. if the dimension of data is too large ,then the computation of
SVM is quite extensive
2. also in some situstions the appropriate kernels cannot be found
out easily.

14 references
1. a tutorial on support vector machine for pattern recognition
(J.C.Burges)
2.C.Cortes and Vapnik

Das könnte Ihnen auch gefallen