Beruflich Dokumente
Kultur Dokumente
Javier Béjar c b e a
LSI - FIB
Term 2012/2013
5 Application
5 Application
Linear separators
Actually any line that divides the examples can be the answer for the
problem
The question that we can ask is, what line is the best one?
5 Application
Classification Margin
w T xi + b
r=
||w ||
The examples closest to the
separator are support vectors
The margin (ρ) of the separator
is the distance between support
vectors
wT x + b
r=
||w ||
2
So the length of the optimal margin is m = ||w ||
This means that maximizing the margin is the same that minimizing
the norm of the weights
yi (w T xi + b) ≥ 1, ∀i
This allows to define the problem of learning the weights as an
optimization problem
subject to yi (w T xi + b) ≥ 1, ∀i
subject to 1 − yi (w T xi + b) ≤ 0, ∀i
1
||w || = w T w
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 15 / 44
Maximum Margin Classification
and
n
X
αi y i = 0
i=1
n n n n
1X X X X
L = αi yi xiT αj yj xj + αi 1 − yi ( αj yj xjT xi + b)
2
i=1 j=1 i=1 j=1
n X
n n n n
1 X X X X
= αi αj yi yj xiT xj + αi − αi yi αj yj xjT xi
2
i=1 j=1 i=1 i=1 j=1
Xn n
X
−b αi yi (and given that αi yi = 0)
i=1 i=1
n X
n n
1 X X
= − αi αj yi yj xiT xj + αi (rearranging terms)
2
i=1 j=1 i=1
This problem is known as the dual problem (the original is the primal)
This formulation only depends on the αi
Both problems are linked, given the w we can compute the αi and
viceversa
This means that we can solve one instead of the other
Now this problem is a maximization problem (αi ≥ 0)
subject to ni=1 αi yi = 0, αi ≥ 0 ∀i
P
Geometric Interpretation
5 Application
P
In order to minimize the error, we can minimize i ξi introducing the
slack variables to the constraints of the problem:
T
w xi + b ≥ 1 − ξi yi = 1
w T xi + b ≥ −1 + ξi yi = −1
ξi ≥ 0
subject to yi (w T xi + b) ≥ 1 − ξi , ∀i, ξi ≥ 0
Pn
subject to = 0, C ≥ αi ≥ 0 ∀i
i=1 αi yi
P
We can recover the solution as w = ∀i∈SV αi , yi xi
This problem is very similar to the linearly separable case, except that
there is a upper bound C on the values of αi
This is also a QP problem
5 Application
In the problem that define a SVM only the inner product of the
examples is needed
This means that if we can define how to compute this product in the
feature space, then it is not necessary to explicitly build it
There are many geometric operations that can be expressed as inner
products, like angles or distances
We can define the kernel function as:
Kernels - Example
K (x, y ) = (1 + x1 y1 + x2 y2 )2
and we are using only the features from the input space
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 37 / 44
Non linear large margin classification
K = V ΛV T
V can be interpreted as a projection in the feature space (remember
feature projection?)
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 38 / 44
Non linear large margin classification
Kernel functions
Pn
subject to i=1 αi yi = 0, C ≥ αi ≥ 0 ∀i
Since the training of the SVM only needs the value of K (xi , xj ) there
is no constrains about how the examples are represented
We only have to define the kernel as a similarity among examples
We can define similarity functions for different representations
Strings, sequences
Graphs/Trees
Documents
...
5 Application