Sie sind auf Seite 1von 24

Support Vector Machine

Shao-Chuan Wang
1
Support Vector Machine
1D Classification Problem: how will you
separate these data?(H1, H2, H3?)
2
x
0
H1 H2 H3
Support Vector Machine
2D Classification Problem: which H is better?
3
Max-Margin Classifier
Functional Margin


Geometric Margin
4
} 1 , 1 { , )} , {(
) (
0
) ( ) (
+ e =
=
i m
i
i i
y y S x
) (

) ( ) ( ) (
b y
i T i i
+ = x w
We feel more confident
when functional margin is larger
0
) ( ) (
= +
|
|
.
|

\
|
b
i i T
w
w
x w
w
x
w
w b
i
T
i
+
|
|
.
|

\
|
=
) ( ) (

0 = +b
T
x w
1
x
2
x
) (i

w
) , (
) ( ) ( i i
y x
Note that scaling on w, b wont
change the plane.
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Maximize margins
Optimization problem: maximize minimal
geometric margin under constraints.





Introduce scaling factor such that











5
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Optimization problem subject to
constraints
Maximize f(x, y), subject to constraint g(x, y) = c

6
c y x g = ) , (
) , ( y x f
Lagrange multiplier method
) ) , ( ( ) , ( ) , , ( c y x g y x f y x + = A
0 , 0 , 0 =
c
A c
=
c
A c
=
c
A c
y x
Lagrange duality
Primal optimization problem:


Generalized Lagrangian method


Primal optimization problem (equivalent form)

Dual optimization problem:
7
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Dual Problem

The necessary conditions that equality holds:
f, g
i
are convex, and h
i
are affine.
KKT conditions.
8
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Optimal margin classifiers


Its Lagrangian

Its dual problem
9
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
If not linearly separable, we can
Find a nonlinear solution
Technically, its a linear solution in higher-order
space








Kernel Trick
Support Vector Machine (contd)
( ) ( ) ( )
, = u u
T
T
i j i j i j
x x K x x x x
26
(

= u
2
2
2
1
) (
x
x
x
Kernel and feature mapping
Kernel:
Positive semi-definite
Symmetric
For example:

Loose Intuition
similarity between features

11
) ( ) ( ) , ( z x z x K
T
| | =
( )
2
) , ( z x z x K
T
=
(
(
(
(
(
(
(
(
(
(
(
(

=
3 3
2 3
1 3
3 2
2 2
1 2
3 1
2 1
1 1
) (
x x
x x
x x
x x
x x
x x
x x
x x
x x
x |
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
n T
z Kz z 9 e > , 0
0 = z
Soft Margin (L1 regularization)
12
C = leads to hard margin SVM,
Rychetsky (2001)
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Why doesnt my model fit well
on test data ?
13
Bias/variance tradeoff

underfitting (high bias) overfitting (high variance)





Training Error =
Generalization Error =
14

=
= =
m
i
i i
y x h
m
h
1
) ( ) (
} ) ( { 1
1
) (

c
) ) ( ( ) (
~ ) , (
y x h P h
D y x
= = c
Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
In-sample error
Out-of-sample error
Bias/variance tradeoff
15
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer
series in statistics. Springer, New York, 2001.
Is training error a good estimator
of generalization error?
16
Chernoff bound (|H|=finite)
Lemma: Assume Z
1
, Z
2
, , Z
m
are drawn iid
from Bernoulli(), and

and let > 0 be fixed. Then,


based on this lemma, one can find, with
probability 1-
(k = # of hypotheses)
17

=
=
m
i
i
Z m
1
) / 1 (

|
) 2 exp( 2 ) |

(|
2
m P | | s >
o
c c
k
m
h h
2
log
2
1
) ( ) (

s
Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
Chernoff bound (|H|=infinite)
VC Dimension d : The size of largest set that H
can shatter.
e.g.
H = linear classifiers
in 2-D
VC(H) = 3

With probability at least 1-,
18
|
|
.
|

\
|
+ O s
o
c c
1
log
1
log ) ( ) (

m d
m
m
d
h h
Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
Model Selection
Cross Validation: Estimator of generalization error
K-fold: train on k-1 pieces, test on the remaining
(here we will get one test error estimation).


Average k test error estimations, say, 2%. Then 2%
is the estimation of generalization error for this
machine learner.
Leave-one-out cross validation (m-fold, m =
training sample size)




19
train train validate train train train
Model Selection
Loop possible parameters:
Pick one set of parameter, e.g. C = 2.0
Do cross validation, get a error estimation
Pick the C
best
(with minimal error estimation) as
the parameter


20
Multiclass SVM
One against one
There are binary SVMs. (1v2, 1v3, )
To predict, each SVM can vote between 2 classes.


One against all
There are k binary SVMs. (1 v rest, 2 v rest, )
To predict, evaluate , pick the largest.
Multiclass SVM by solving ONE optimization
problem

21
|
|
.
|

\
|
2
k
1 3 5 3 2 1
1 2 3 4 5 6
K =
poll
K = 3
b
T
+ x w
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based
vector machines. JMLR, 2, 265-292.
Multiclass SVM (2/2)
DAGSVM (Directed Acyclic Graph SVM)
22
An Example: image classification
Process
23
Raw
images
Formatted
vectors
Training
Data
K-fold Cross
validation
SVM
(with best
C)
Test Data
Accuracy
1 0:49 1:25
1 0:49 1:25


2 0:49 1:25

1/4
3/4
K = 6
An Example: image classification
Results
Run Multi-class SVM 100 times for both
(linear/Gaussian).
Accuracy Histogram
24

Das könnte Ihnen auch gefallen