Beruflich Dokumente
Kultur Dokumente
Machines
Jie Tang
25 July 2005
Introduction
• Support Vector Machine (SVM) is a learning
methodology based on Vapnik’s statistical
learning theory
– Addressed in the 1990s
– To solve the problems in traditional statistical
learning( over fitting, capacity control,…)
• It achieved the best performances in practical
applications
– Handwritten digit recognition
– text categorization…
Classification Problem
• Given training set S={(x1,y1),(x2,y2),…,
(xl,yl)}, and xi∈ X=Rn, yi∈ Y={1,-1},
i=1,2,…,l
• To learn a function g(x), and make the
decision function f(x)=sgn(g(x)) can
classify new input x
• So this is a supervised batch learning
method
Linear classifier
g ( x) ( wT x b)
1, g ( x) 0
sgn( g ( x))
1, g ( x ) 0
f ( x) sgn( g ( x))
Maximum Marginal Classifier
w b ˆ
x (1)
w w
w b ˆ
x (2)
w w
Maximum Marginal Classifier
(i ) y ( i ) ( wT x b)
ˆ min
w b ˆ
x (1)
w w
w b ˆ
x (2)
w w
w b ˆ
x (1)
w w
2ˆ
w b ˆ
x (2) w
w w
Then
2ˆ w
max , w,b equal to min , w,b
w 2ˆ
s.t. wT x ( i ) b ˆ, 0 i k
equal to y ( wT x (i ) b) ˆ , 0 i m
T
w x ( j)
b ˆ, k j m
w
min w,b
2
s.t. y ( wT x (i ) b) 1, 0 i m
Lagrange duality
For the problem:
w
min w,b
2
s.t. y ( wT x (i ) b) 1, 0 i m
w m
L( w, b) i [ y ( wT x (i ) b) 1]
2 i 0
s.t. i 0, 0 i m
Let us review generalized Lagrangian
min f ( x)
s.t. gi ( x) 0, 0 i k
h j ( x) 0, 0 j l
By lagrangian:
k l
L( w, b) f ( x) i g i ( x) j h j ( x)
i 0 j 0
s.t. i 0, j 0
Let us consider
RP ( w) max L( w, b)
Note: the constraints must be satisfied, otherwise, the maxL will be infinite.
Let us review generalized Lagrangian
If the constraints are satisfied, then we must have
max L f ( x)
Now you can found that, maxL takes the same value as the objective of
our problem f(x).
Therefore, we can consider the minimization problem
RD ( , ) min w L( w, , )
s.t. y ( wT x ( i ) b) 1, 0 i m
Then, what we have the maximum optimum problem with respect to αβ:
m
1 m (i ) ( j )
max L( ) i y y i j x ( i ) , x ( j )
i 0 2 i , j 1
s.t. i 0, i [1, m]
m
y
i 1
i
(i )
0
m
w x b ( i y ( i ) x ( i ) )T x b
T
i 1
m
i y (i ) x (i ) , x b
i 1
Non-separable case
What is non-separable case? I will not give an example. I suppose
you know that
w m
min w,b C i
2 i 1
s.t. y ( wT x (i ) b) 1 i , 0 i m
i 0, 0 i m
w m m m
L( w, b, , , ) C i i [ y ( w x b) 1 i ] i i
T (i )
2 i 1 i 1 i 1
Dual form
m
1 m (i ) ( j )
max L( ) i y y i j x (i ) , x ( j ) What is the difference
i 0 2 i , j 1
from the previous
s.t. C i 0, i [1, m] form??!!
m
y
i 1
i
(i )
0
i 0 y ( i ) ( wT x ( i ) b) 1
i C y (i ) ( wT x ( i ) b) 1
0 i C y ( i ) ( wT x ( i ) b) 1
How to train SVM = how to solve
the optimal problem
Sequential minimal optimization (SMO) algorithm, due to John Platt.
m
1 y (1)
i y (i )
i2
m
1 y (1)
i
i2
y (i )
Is it ok?
SMO
Change the algorithm by: this is just SMO
Repeat until convergence
{
1. select some pair ai and aj to update next. (using a heuristic that tries to
pick the two that will allow us to make the biggest progress towards the
global maximum).
2. reoptimize L(a) with respect to ai and aj, while holding all the other
a.
}
m
1 y 2 y
(1) (2)
i y ( i )
i 3
1 ( 2 y (2) ) y (1)
L(a ) L(( 2 y (2) ) y (1) , 2 ,..., m )
SMO(2)
L(a) L(( 2 y (2) ) y (1) , 2 ,..., m )
a 22 b 2 c
Solving a2
a 22 b 2 c
For the quadratic function, we can simply solve it by setting its derivative to
zero. Let us use a2new, unclipped as the resulting value.
H if ( 2new,unclipped H )
new
2 2new if ( L 2new,unclipped H )
L if ( 2new,unclipped L)
x ( x)
K ( x, z ) ( x)T ( z ) kernel function
m
m
1 m (i ) ( j )
max L( ) i y y i j ( x (i ) ), ( x ( j ) ) w x b ( i y ( i ) x ( i ) )T x b
T
i 0 2 i , j 1 i 1
s.t. i 0, i [1, m] m
i y (i ) ( x (i ) ), ( x) b
m
i 1
i y (i ) 0
i 1
K ( x , z ) ( x )T ( z )
For example: 2
xz
K ( x, z ) exp( )
2 2
References
• Vladimir N. Vapnik. The nature of statistical learning
theory. Springer-Verlag New York. 1998.
• Andrew Ng. CS229 Lecture notes. Lectures from
10/19/03 to 10/26/03. Part V. Support Vector Machines
• CHRISTOPHER J.C. BURGES. A Tutorial on Support
Vector Machines for Pattern Recognition. Data Mining
and Knowledge Discovery, 2, 121–167 (1998). 1998
Kluwer Academic Publishers, Boston. Manufactured in
The Netherlands.
• Cristianini, N., Shawe-Taylor, J., An Introduction to
Support Vector Machines, Cambridge University Press,
(2000).
People
• Vladimir Vapnik.
• J. Platt
• J. Platt, N. Cristianini, J. Shawe-Taylor
• Shawe-Taylor, J.
• Burges, C. J. C.
• Thorsten Joachims
• Etc.