Sie sind auf Seite 1von 44

Support Vector Machines

Javier Béjar c b e a

LSI - FIB

Term 2012/2013

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 1 / 44


Outline

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 2 / 44


Support Vector Machines - Introduction

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 3 / 44


Support Vector Machines - Introduction

Linear separators

As we saw with the perceptron, one way to perform classification is to


find a linear separator (for binary classification)
The idea is to find the equation of a line w T x + b = 0 that divides
the set of examples in the target classes so:

w T x + b > 0 ⇒ Positive class


w T x + b < 0 ⇒ Negative class
This is just the problem that the single layer perceptron solves (b is
the bias weight)

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 4 / 44


Support Vector Machines - Introduction

Optimal linear separator

Actually any line that divides the examples can be the answer for the
problem
The question that we can ask is, what line is the best one?

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 5 / 44


Support Vector Machines - Introduction

Optimal linear separator

Some boundaries seem a bad choice due to poor generalization


We have to maximize the probability of classifying correctly unseen
instances
We want to minimize the expected generalization loss (instead of the
expected empirical loss)

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 6 / 44


Maximum Margin Classification

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 7 / 44


Maximum Margin Classification

Maximum Margin Classification

Maximizing the distance of the


separator to the examples seems
the right choice (actually
supported by PAC learning
theory)
This means that only the
nearest instances to the
separator matter (the rest can
be ignored)

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 8 / 44


Maximum Margin Classification

Classification Margin

We can compute the distance


from an example xi to the
separator as:

w T xi + b
r=
||w ||
The examples closest to the
separator are support vectors
The margin (ρ) of the separator
is the distance between support
vectors

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 9 / 44


Maximum Margin Classification

Large Margin Decision Boundary

The separator has to be as far as possible from the examples of both


classes
This means that we have to maximize the margin
We can normalize the equation of the separator so the distance in the
supports are 1 or −1, by

wT x + b
r=
||w ||
2
So the length of the optimal margin is m = ||w ||
This means that maximizing the margin is the same that minimizing
the norm of the weights

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 10 / 44


Maximum Margin Classification

Computing the decision boundary

Given a set of examples {x1 , x2 , . . . , xn } with class labels


yi ∈ {+1, −1}
The decision boundary that classify the examples correctly holds

yi (w T xi + b) ≥ 1, ∀i
This allows to define the problem of learning the weights as an
optimization problem

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 11 / 44


Maximum Margin Classification

Computing the decision boundary

The primal formulation of the optimization problem is:


Minimize 12 ||w ||2

subject to yi (w T xi + b) ≥ 1, ∀i

This problem is difficult to solve in this formulation, but we can


rewrite the problem in a more convenient form

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 12 / 44


Maximum Margin Classification

Constrained optimization (a little digression)

Suppose we want to minimize f (x) subject to g (x) = 0


A necessary condition for a point x0 to be a solution is
 ∂
∂x (f (x) + αg (x)) x=x0 = 0

g (x) = 0
Being α the Lagrange multiplier
If we have multiple constraints, we just need one multiplier per
constraint
 ∂ Pn
∂x (f (x) + i=1 αi gi (x)) x=x0 = 0

gi (x) = 0 ∀i ∈ 1 . . . n

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 13 / 44


Maximum Margin Classification

Constrained optimization (a little digression)

For inequality constraints gi (x) ≤ 0 we have to add the constraint


that the Lagrange multipliers have to be positive αi ≥ 0
If x0 is a solution to the constrained optimization problem
minx f (x) subject to gi (x) ≤ 0 ∀i ∈ 1 . . . n
There must exist αi ≥ 0 such that x0 satisfies
 ∂ Pn
∂x (f (x) + i=1 αi gi (x)) x=x0 = 0

gi (x) = 0 ∀i ∈ 1 . . . n
The function f (x) + ni=1 αi gi (x) is called the Lagrangian
P

The solution is the point of gradient 0

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 14 / 44


Maximum Margin Classification

Computing the decision boundary (we are back)

The primal formulation of the optimization problem is:


Minimize 12 ||w ||2

subject to 1 − yi (w T xi + b) ≤ 0, ∀i

The Lagrangian is1 :


n
1 X
L = wT w + αi (1 − yi (w T xi + b))
2
i=1

1
||w || = w T w
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 15 / 44
Maximum Margin Classification

Computing the decision boundary (we are back)

If we compute the derivative of L with respect to w and b and we set


them to zero (we are computing the gradient):
n
X n
X
w+ αi (−yi )xi = 0 ⇒ w = αi yi xi
i=1 i=1

and
n
X
αi y i = 0
i=1

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 16 / 44


Maximum Margin Classification

The dual formulation

If we substitute w in the Lagrangian L

 
n n n n
1X X X X
L = αi yi xiT αj yj xj + αi 1 − yi ( αj yj xjT xi + b)
2
i=1 j=1 i=1 j=1
n X
n n n n
1 X X X X
= αi αj yi yj xiT xj + αi − αi yi αj yj xjT xi
2
i=1 j=1 i=1 i=1 j=1
Xn n
X
−b αi yi (and given that αi yi = 0)
i=1 i=1
n X
n n
1 X X
= − αi αj yi yj xiT xj + αi (rearranging terms)
2
i=1 j=1 i=1

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 17 / 44


Maximum Margin Classification

The dual formulation

This problem is known as the dual problem (the original is the primal)
This formulation only depends on the αi
Both problems are linked, given the w we can compute the αi and
viceversa
This means that we can solve one instead of the other
Now this problem is a maximization problem (αi ≥ 0)

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 18 / 44


Maximum Margin Classification

The dual problem


The formulation of the dual problem is
Maximize ni=1 αi − 21 ni=1 nj=1 αi αj yi yj xiT xj
P P P

subject to ni=1 αi yi = 0, αi ≥ 0 ∀i
P

This a Quadratic Programming problem (QP)


This means that the parameters form a parabolloidal surface, and an
optimal can be found
We can obtain w by
n
X
w= αi y i x i
i=1
b can be obtained from a positive support vector (xpsv ) knowing that
w T xpsv + b = 1
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 19 / 44
Maximum Margin Classification

Geometric Interpretation

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 20 / 44


Maximum Margin Classification

Characteristics of the solution

Many of the αi are zero


w is a linear combination of a small number of examples
We obtain a sparse representation of the data (data compression)
The examples xi with non zero αi are the support vectors (SV)
The vector of parameters can be expressed as:
X
w= αi yi xi
∀i∈SV

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 21 / 44


Maximum Margin Classification

Classifying new instances

In order to classify a new instance z we just have to compute


X
sign(w T z + b) = sign( αi yi xiT z + b)
∀i∈SV

This means that w does not need to be explicitly computed

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 22 / 44


Maximum Margin Classification

Solving the QP problem

Quadratic programming is a well known optimization problem


Many algorithms have been proposed
One of the most popular is Sequential Minimal Optimization (SMO)
Picks a pair of variables and solves a QP problem for two variables
(trivial)
Repeats the procedure until convergence

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 23 / 44


Soft Margin Classification

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 24 / 44


Soft Margin Classification

Non separable problems

This algorithm works well for


separable problems
Sometimes data has errors and
we want to ignore them to
Better!
obtain a better solution
Sometimes data is just non
linearly separable Too Close!
We can obtain better linear
separators being less strict

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 25 / 44


Soft Margin Classification

Soft Margin Classification

We want to be permissive for certain examples, allowing that their


classification by the separator diverge from the real class
This can be obtained by adding to the problem what is called slack
variables (ξi )
This variables represent the deviation of the examples from the margin
Doing this we are relaxing the margin, we are using a soft margin

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 26 / 44


Soft Margin Classification

Soft Margin Classification

We are allowing an error in


classification based on the
separator w T x + b
The values of ξi approximate the
number of missclassifications

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 27 / 44


Soft Margin Classification

Soft Margin Hyperplane

P
In order to minimize the error, we can minimize i ξi introducing the
slack variables to the constraints of the problem:
 T
 w xi + b ≥ 1 − ξi yi = 1
w T xi + b ≥ −1 + ξi yi = −1
ξi ≥ 0

ξi = 0 if there are no errors (linearly separable problem)


The number of resulting supports and slack variables give an upper
bound of the leave one out error

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 28 / 44


Soft Margin Classification

The primal optimization problem

We need to introduce this slack variables on the original problem, we


have now:
Minimize 12 ||w ||2 + C ni=1 ξi
P

subject to yi (w T xi + b) ≥ 1 − ξi , ∀i, ξi ≥ 0

Now we have an additional parameter C that is the tradeoff between


the error and the margin
We will need to adjust this parameter

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 29 / 44


Soft Margin Classification

The dual optimization problem

Performing the transformation to the dual problem we obtain the


following:
Maximize ni=1 αi − 21 ni=1 nj=1 αi αj yi yj xiT xj
P P P

Pn
subject to = 0, C ≥ αi ≥ 0 ∀i
i=1 αi yi
P
We can recover the solution as w = ∀i∈SV αi , yi xi
This problem is very similar to the linearly separable case, except that
there is a upper bound C on the values of αi
This is also a QP problem

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 30 / 44


Non linear large margin classification

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 31 / 44


Non linear large margin classification

Non linear large margin classification

So far we have only considered large margin classifiers that use a


linear boundary
In order to have better performance we have to be able to obtain
non-linear boundaries
The idea is to transform the data from the input space (the original
attributes of the examples) to a higher dimensional space using a
function φ(x)
This new space is called the feature space
The advantage of the transformation is that linear operations in the
feature space are equivalent to non-linear operations in the input
space
Remember the RBFs networks?

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 32 / 44


Non linear large margin classification

Feature Space transformation

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 33 / 44


Non linear large margin classification

The XOR problem

The XOR problem is not linearly separable in its original definition,


but we can make it linearly separable if we add a new feature x1 · x2
x1 x2 x1 · x2 x1 XOR x2
0 0 0 1
0 1 0 0
1 0 0 0
1 1 1 1
The linear function h(x) = 2x1 x2 − x1 − x2 + 1 classifies correctly all
the examples

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 34 / 44


Non linear large margin classification

Transforming the data

Working directly in the feature space can be costly


We have to explicitly create the feature space and operate in it
We may need infinite features to make a problem linearly separable
We can use what is called the Kernel trick

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 35 / 44


Non linear large margin classification

The Kernel Trick

In the problem that define a SVM only the inner product of the
examples is needed
This means that if we can define how to compute this product in the
feature space, then it is not necessary to explicitly build it
There are many geometric operations that can be expressed as inner
products, like angles or distances
We can define the kernel function as:

K (xi , xj ) = φ(xi )T φ(xj )

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 36 / 44


Non linear large margin classification

Kernels - Example

We can show how this kernel trick works in an example


Lets assume a feature space defined as:
√ √ √
 
x1
φ = (1, 2x1 , 2x2 , x12 x22 , 2x1 x2 )
x2
A inner product in this feature space is:
    
x1 y1
φ ,φ = (1 + x1 y1 + x2 y2 )2
x2 y2
So, we can define a kernel function to compute inner products in this
space as:

K (x, y ) = (1 + x1 y1 + x2 y2 )2
and we are using only the features from the input space
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 37 / 44
Non linear large margin classification

Kernel functions - Properties

Given a set of examples X = {x1 , x2 , . . . xn }, and a symmetric


function k(x, z) we can define the kernel matrix K as:
 
k(x1 , x1 ) · · · k(x1 , xn )
K = φT φ =  ··· ··· ··· 
k(xn , x1 ) · · · k(xn , xn )
If K is a symmetric and positive definite matrix, then k(x, z) is a
kernel function
The justification is that if K is a sdp matrix then can be decomposed
in:

K = V ΛV T
V can be interpreted as a projection in the feature space (remember
feature projection?)
Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 38 / 44
Non linear large margin classification

Kernel functions

Examples of functions that hold this condition are:


Polynomial kernel of degree d: K (x, y ) = (x T y + 1)d
Gaussian function with width σ: K (x, y ) = exp(−||x − y ||2 /2σ 2 )
Sigmoid hiperbolical tangent with parameters k and θ:
K (x, y ) = tanh(kx T y + θ) (only for certain values of the parameters)
Kernel functions can also be interpreted as similarity function
A similarity function can be transformed in a kernel given that hold
the sdp matrix condition

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 39 / 44


Non linear large margin classification

The dual problem

Due to the introduction of the kernel function, the optimization


problem has to be modified:
Maximize ni=1 αi − 21 ni=1 nj=1 αi αj yi yj K (xi , xj )
P P P

Pn
subject to i=1 αi yi = 0, C ≥ αi ≥ 0 ∀i

For classifying new examples:


!
X
h(z) = sign αi , yi k(xi , z) + b
∀i∈SV

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 40 / 44


Non linear large margin classification

The advantage of using kernels

Since the training of the SVM only needs the value of K (xi , xj ) there
is no constrains about how the examples are represented
We only have to define the kernel as a similarity among examples
We can define similarity functions for different representations
Strings, sequences
Graphs/Trees
Documents
...

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 41 / 44


Application

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 42 / 44


Application

Splice-Junction gene sequences

Identification of gene sequence type


(Bioinformatics)
60 Attributes (Discrete)
Attributes: DNA base at position 1-60
3190 instances
3 classes
Methods: Support vector machines
Validation: 10 fold cross validation

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 43 / 44


Application

Splice-Junction gene sequences: Models

SVM (linear): accuracy 90.9%, 1331 Support Vectors


SVM (linear, C=1): accuracy 91.8%, 608 Support Vectors
SVM (linear, C=5): accuracy 91.5%, 579 Support Vectors
SVM (quadratic): accuracy 91.4%, 1305 Support Vectors
SVM (quadratic, C=5): accuracy 91.6%, 1054 Support Vectors
SVM (cubic, C=5): accuracy 91.9%, 1206 Support Vectors

Javier Béjar c b ea (LSI - FIB) Support Vector Machines Term 2012/2013 44 / 44

Das könnte Ihnen auch gefallen