Beruflich Dokumente
Kultur Dokumente
Bayesian Decision Theory Maximum-Likelihood & Bayesian Parameter Estimation Nonparametric Density Estimation
Parzen-Window, kn-Nearest-Neighbor
A classifier derived from statistical learning theory by Vapnik, et al. in 1992 SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc. Also used for regression (will not cover today) Chapter 5.1, 5.2, 5.3, 5.11 (5.4*) in textbook
V. Vapnik
Outline
Linear Discriminant Function Large Margin Linear Classifier Nonlinear SVM: The Kernel Trick Demo of SVM
Discriminant Function
g i ( x) " g j ( x)
for all j { i
Discriminant Function
Nearest Neighbor
Decision Tree
Linear Functions
Nonlinear Functions
g ( x) ! w T x b
x2
wT x + b > 0
g ( x) ! w T x b
w n! w
wT x + b < 0
x1
denotes +1 denotes -1
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x2
x1
denotes +1 denotes -1
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x2
x1
denotes +1 denotes -1
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x2
x1
denotes +1 denotes -1
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x2
x1
denotes +1 denotes -1
The linear discriminant function (classifier) with the maximum margin is the best Margin is defined as the width that the boundary could be increased by before hitting a data point Why it is the best?
Robust to outliners and thus strong generalization ability
x2 safe zone
Margin
x1
denotes +1 denotes -1
x2
x1
denotes +1 denotes -1
We know that
x2 x+
Margin
wT x b ! 1 w T x b ! 1
x+
M ! (x x ) n w 2 ! (x x ) ! w w
x-
Support Vectors
x1
denotes +1 denotes -1
Formulation:
x2
Margin x+
maximize
such that
2 w
x+
T
n x-
x1
denotes +1 denotes -1
Formulation:
x2
2
Margin x+
1 minimize w 2
such that
x+
T
n x-
x1
denotes +1 denotes -1
Formulation:
x2
2
Margin x+
1 minimize w 2
such that
x+ n x-
yi (wT xi b) u 1
x1
1 minimize w 2
s.t.
yi (wT xi b) u 1
Lagrangian Function
n 1 2 minimize Lp (w , b, E i ) ! w E i yi (wT xi b) 1
2 i !1
s.t.
Ei u 0
s.t.
Ei u 0
n
xLp xw xLp xb
!0 !0
w ! E i yi xi
i !1 n
E y
i i !1
!0
s.t.
Ei u 0
1 n n maximize E i E iE j yi y j xT x j i 2 i !1 j !1 i !1
n
s.t.
E i u 0 , and
E y
i i !1
!0
E i yi (wT xi b) 1
! 0
x+
x+
Support Vectors
x1
w ! E i yi xi !
i !1
E y x
i i iSV
g ( x) ! w T x b !
E i xT x b i
iSV
Notice it relies on a dot product between the test point x and the support vectors xi Also keep in mind that solving the optimization problem involved computing the dot products xiTxj between all pairs of training points
denotes +1 denotes -1
x2
Slack variables i can be added to allow misclassification of difficult or noisy data points
\2 \1
x1
Formulation:
n 1 2 w C \i minimize 2 i !1
such that
yi (wT xi b) u 1 \i \i u 0
1 n n maximize E i E iE j yi y j xT x j i 2 i !1 j !1 i !1
such that
0 e Ei e C
n
E y
i i !1
!0
Non-linear SVMs
Datasets that are linearly separable with noise work out great:
0 x
General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable:
: x
(x)
g ( x) ! w T J ( x) b !
E iJ (xi )T J (x) b
iSV
No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.
A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
K (xi , x j ) | J (xi )T J (x j )
An example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2, Need to show that K(xi,xj) = (xi) T (xj): K(xi,xj)=(1 + xiTxj)2, = 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 = [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1 2xj2] = (xi) T (xj), where (x) = [1 x12 2 x1x2 x22 2x1 2x2]
K (xi , x j ) ! xT x j i K (xi , x j ) ! (1 xT x j ) p i xi x j 2W
T 0 i 2 2
Polynomial kernel:
K (xi , x j ) ! exp(
Sigmoid:
K (xi , x j ) ! tanh( F x x j F1 )
1 n n maximize E i E iE j yi y j K (xi , x j ) 2 i !1 j !1 i !1
such that
0 e Ei e C
n
E y
i i !1
!0
g ( x) !
E K ( x , x) b
i i iSV
1. Choose a kernel function 2. Choose a value for C 3. Solve the quadratic programming problem (many software packages available) 4. Construct the discriminant function from the support vectors
Some Issues
Choice of kernel
- Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate similarity measures
Additional Resource
http://www.kernel-machines.org/
Demo of LibSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm/