Lecture3 Linear SVM With Slack

Lecture 3: Linear SVM with slack variables
Stéphane Canu
stephane.canu@litislab.eu
Sao Paulo 2014
March 23, 2014

The non separable case
2.5
1.5
0.5
−0.5
−1
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Road map
1 Linear SVM
The C (L1) SVM
The L2 SVM and others “variations on a theme”
The hinge loss
0
Slack j
0
The non separable case: a bi criteria optimization problem
Modeling potential errors: introducing slack variables ξi
yi (w⊤ xi + b) ≥ 1 ⇒ ξi = 0

no error:
(xi , yi )
error: ξi = 1 − yi (w⊤ xi + b) > 0
1

kwk2
0

 min
2


 w,b,ξ
 n
CX p


min ξi
Slack j w,b,ξ p
i=1

0

with yi (w⊤ xi + b) ≥ 1 − ξi





 ξi ≥ 0 i = 1, n
Our hope: almost all ξi = 0

Bi criteria optimization and dominance
n

 L(w) = 1 ξip
admisible set
 X
Pareto’s front
w=0
p
p
L(w) = 1/p Yi=1 ji
Admissible solution
i=1
n
P(w) = kwk2


Dominance
w1 dominates w2 P(w) = || w ||2
Figure: dominated point (red),

if L(w1 ) ≤ L(w2 ) and P(w1 ) ≤ P(w2 ) non dominated point (purple)
and Pareto front (blue).
Pareto front (or Pareto Efficient Frontier) Pareto frontier

it is the set of all nondominated solutions ⇔
Regularization path
Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 5 / 29

3 equivalent formulations to reach Pareto’s front
n
1X p
min ξi + λ kwk2
w∈IRd p
i=1
it works for CONVEX criteria!

n
1X p
min ξi + λ kwk2
w∈IRd p
i=1
n

 min 1 X ξ p

w p i
 i=1
with kwk2 ≤ k

it works for CONVEX criteria!

n
1X p
min ξi + λ kwk2
w∈IRd p
i=1
n

 min 1 X ξ p

w p i
 i=1
with kwk2 ≤ k

 min kwk2

 w
n
1X p it works for CONVEX criteria!
 with ξi ≤ k ′
 p
i=1

Modeling potential errors: introducing slack variables ξi
yi (w⊤ xi + b) ≥ 1 ⇒ ξi = 0

no error:
(xi , yi )
error: ξi = 1 − yi (w⊤ xi + b) > 0
Minimizing also the slack (the error), for a given C > 0

n

 1 2 CX p

 min kwk + ξi
w,b,ξ 2 p

i=1


 with yi (w⊤ xi + b) ≥ 1 − ξi i = 1, n
ξi ≥ 0 i = 1, n

Looking for the saddle point of the lagrangian with the Lagrange
multipliers αi ≥ 0 and βi ≥ 0
n n n
1 C X p X X
L(w, b, α, β) = kwk2 + ξi − αi yi (w⊤ xi + b) − 1 + ξi − β i ξi
2 p
i=1 i=1 i=1
The KKT(p = 1)
n n n
1 2 C X p X ⊤
X
L(w, b, α, β) = kwk + ξi − αi yi (w xi + b) − 1 + ξi − β i ξi
2 p
i=1 i=1 i=1
n
X n
X
stationarity w − αi yi xi = 0 and αi yi = 0
i=1 i=1
C − αi − βi = 0 i = 1, . . . , n
primal admissibility yi (w⊤ xi + b) ≥ 1 i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n
dual admissibility αi ≥ 0 i = 1, . . . , n
βi ≥ 0 i = 1, . . . , n

complementarity αi yi (w⊤ xi + b) − 1 + ξi = 0 i = 1, . . . , n
β i ξi = 0 i = 1, . . . , n
Let’s eliminate β!
KKT (p = 1)
n
X n
X
stationarity w − αi yi xi = 0 and αi yi = 0
i=1 i=1
primal admissibility yi (w⊤ xi + b) ≥ 1 i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n;
dual admissibility αi ≥ 0 i = 1, . . . , n
C − αi ≥ 0 i = 1, . . . , n;

complementarity αi yi (w⊤ xi + b) − 1 + ξi = 0 i = 1, . . . , n
(C − αi ) ξi = 0 i = 1, . . . , n
sets I0 IA IC
αi 0 0<α<C C
βi C C −α 0
ξi 0 0 1 − yi (w⊤ xi + b)
yi (w⊤ xi + b) > 1 yi (w⊤ xi + b) = 1 yi (w⊤ xi + b) < 1
useless usefull (support vec) suspicious
The importance of being support
4 4
3 3
2 2
1 1
0 0
−1 −1
−2
−2 −1 0 1 2 3 4
−2
−2 −1 0 1 2 3 4 .
data constraint
α set
point value
xi useless αi = 0 ⊤
yi w xi + b > 1 I0
xi support 0 < αi < C yi w⊤ xi + b = 1 Iα
xi suspicious αi = C yi w⊤ xi + b < 1 IC
Table: When a data point is « support » it lies exactly on the margin.
here lies the efficiency of the algorithm (and its complexity)!

sparsity: αi = 0
Optimality conditions (p = 1)
n n n
1 X X X
L(w, b, α, β) = kwk2 + C ξi − αi yi (w⊤ xi + b) − 1 + ξi − β i ξi
2
i=1 i=1 i=1
n

 X



 ∇w L(w, b, α) =w− αi yi xi

 i=1
n
Computing the gradients: ∂L(w, b, α) X
 = α i yi
∂b


i=1



 ∇ξi L(w, b, α) = C − αi − βi
no change for w and b

βi ≥ 0 and C − αi − βi = 0 ⇒ αi ≤ C
The dual formulation:


1 ⊤
 minn
 α∈I 2α Gα − e⊤ α
R
⊤
 with y α=0
 and 0 ≤ αi ≤ C i = 1, n
SVM primal vs. dual
Primal Dual
n
 
minn 21 α⊤ G α − e⊤ α
X
 1 2

 min 2 kwk +C ξi 
 α∈IR
w,b,ξ∈IRn

i=1 with y⊤ α = 0


 with yi (w⊤ xi + b) ≥ 1 − ξi 

and 0 ≤ αi ≤ C i = 1, n
ξi ≥ 0 i = 1, n

n unknown
d + n + 1 unknown G Gram matrix (pairwise
influence matrix)
2n constraints
2n box constraints
classical QP
easy to solve
to be used when n is too
large to build G to be used when n is not too
large
The smallest C
C small ⇒ all the points are in IC : αi = C
6
n
X
5 −1 ≤ fj = C yi (x⊤
i xj )+b ≤ 1
4 i=1
3
1
fM = max(f ) fm = min(f )
0
−1
−2 2
Cmax =
−3
−2 −1 0 1 2 3 4 fM − fm
Road map
1 Linear SVM
The C (L1) SVM
The hinge loss
0
Slack j
0
L2 SVM: optimality conditions (p = 2)
n n
1 C X 2 X
kwk2 + αi yi (w⊤ xi + b) − 1 + ξi

L(w, b, α, β) = ξi −
2 2
i=1 i=1
n

 X


 w
 ∇ L(w, b, α) = w − αi yi xi

 i=1
n
Computing the gradients: ∂L(w, b, α) X
 = α i yi
∂b


i=1



 ∇ L(w, b, α) = C ξ − α
ξi i i
no need of the positivity constraint on ξi

no change for w and b
C
Pn 2
Pn 1
Pn
C ξi − α i = 0 ⇒ 2 i=1 ξi − i=1 αi ξi = − 2C i=1 αi2

 1 ⊤ 1
 minn
 2 α (G + C I )α − e⊤ α
α∈IR
 with y⊤ α = 0
and 0 ≤ αi i = 1, n

SVM primal vs. dual
Primal Dual
n
 
minn 21 α⊤ (G + C1 I )α − e⊤ α
X
C
1 2 ξi2

 min 2 kwk + 2

 α∈IR
w,b,ξ∈IRn
 i=1 with y⊤ α = 0
with yi (w⊤ xi + b) ≥ 1 − ξi
 
and 0 ≤ αi i = 1, n

n unknown
d + n + 1 unknown
G Gram matrix is regularized
n constraints
n box constraints
classical QP
easy to solve
to be used when n is too
large to build G to be used when n is not too
large
One more variant: the ν SVM


 max m
 v,a
with min |v⊤ xi + a| ≥ m
 i=1,n
kvk2 = k


 1 2
Pn
 min

v,a 2 kvk − ν m + i=1 ξi
⊤
 with yi (v xi + a) ≥ m − ξi
ξi ≥ 0, m ≥ 0


1 ⊤
min 2α Gα

 α∈IRn


⊤
with y α=0

 and 0 ≤ αi ≤ 1/n i = 1, n
m ≤ e⊤ α

The convex hull formulation
Minimizing the distance between the convex hulls


 min ku − v k

 α X X

 with
 u= αi xi , v= αi xi
{i|yi =1} {i|yi =−1}
 X X
and αi = 1, αi = 1, 0 ≤ αi ≤ C i = 1, n





{i|yi =1} {i|yi =−1}
2 kuk − kv k
w⊤ x = u ⊤ x − v ⊤ x and b =

ku − v k ku − v k
SVM with non symetric costs
Problem in the primal (p = 1)

 X X
 min 1
2 kwk
2
+ C+ ξi + C − ξi
w,b,ξ∈IRn
{i|yi =1} {i|yi =−1}
with yi w⊤ xi + b ≥ 1 − ξi , ξi ≥ 0, i = 1, n

for p = 1 the dual formulation is the following:

(
max − 12 α⊤ G α + α⊤ e
α∈IRn
with α⊤ y = 0 and 0 ≤ αi ≤ C + or C − i = 1, n
It generalizes to any cost (useful for unbalanced data)

Road map
1 Linear SVM
The C (L1) SVM
The hinge loss
0
Slack j
0
Eliminating the slack but not the possible mistakes
n

X
1 2


 min 2 kwk +C ξi
w,b,ξ∈IRn

i=1


 with yi (w⊤ xi + b) ≥ 1 − ξi
ξi ≥ 0 i = 1, n

Introducing the hinge loss

ξi = max 1 − yi (w⊤ xi + b), 0

n
X
min 12 kwk2 + C max 0, 1 − yi (w⊤ xi + b)

w,b
i=1
Back to d + 1 variables, but this is no longer an explicit QP

Ooops! the notion of sub differential
Definition (Sub gradient)

a subgradient of J : IRd 7−→ IR at f0 is any vector g ∈ IRd such that
∀f ∈ V(f0 ), J(f ) ≥ J(f0 ) + g ⊤ (f − f0 )
Definition (Subdifferential)
∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f .
IRd = IR J3 (x) = |x| ∂J3 (0) = {g ∈ IR | − 1 < g < 1}

IRd = IR J4 (x) = max(0, 1 − x) ∂J4 (1) = {g ∈ IR | − 1 < g < 0}
Regularization path for SVM
n
X λo
min max(1 − yi w⊤ xi , 0) + kwk2
w 2
i=1
Iα is the set of support vectors s.t. yi w⊤ xi = 1;

X X
∂w J(w) = αi yi xi − yi xi + λo w with αi ∈ ∂H(1) =] − 1, 0[
i∈Iα i∈I1
Regularization path for SVM
n
X λo
min max(1 − yi w⊤ xi , 0) + kwk2
w 2
i=1
Iα is the set of support vectors s.t. yi w⊤ xi = 1;

X X
∂w J(w) = αi yi xi − yi xi + λo w with αi ∈ ∂H(1) =] − 1, 0[
i∈Iα i∈I1
Let λn a value close enough to λo to keep the sets I0 , Iα and IC unchanged

In particular at point xj ∈ Iα (wo⊤ xj = wn⊤ xj = yj ) : ∂w J(w)(xj ) = 0
⊤ ⊤
P P
Pi∈Iα αio yi x⊤i xj = Pi∈I1 yi x⊤i xj − λo yj
i∈Iα αin yi xi xj = i∈I1 yi xi xj − λn yj
G (αn − αo ) = (λo − λn )y with Gij = yi x⊤
i xj
αn = αo + (λo − λn )d
d = (G )−1 y
Solving SVM in the primal
n
X
1 2
max 0, 1 − yi (w⊤ xi + b)

min 2 kwk +C
w,b
i=1
What for: Yahoo!, Twiter, Amazon,

Google (Sibyl), Facebook. . . : Big data
Data-intensive machine learning systems
"on terascale datasets, with trillions of

features,1 billions of training examples
and millions of parameters in an hour
using a cluster of 1000 machines"
How: hybrid online+batch approach adaptive gradient updates (stochastic

gradient descent)
Code available: http://olivier.chapelle.cc/primal/

Solving SVM in the primal
n
X 2
J(w, b) = 1
2 kwk22 + C
2 max 1 − yi (w⊤ xi + b), 0
i=1
C
= 1
2 kwk22 + 2 ξ⊤ξ
Xn
ξi = max 1 − yi (w⊤ xi + b), 0

with
n
X
max 1 − yi (w⊤ xi + b), 0 yi xi

∇w J(w, b) = w −C
i=1
= w − C (diag(y)X )⊤ ξ
X n
Hw J(w, b) = Id +C xi x⊤
i
i ∈I
/0
Optimal step size ρ in the Newton direction:
wnew = wold − ρ Hw−1 ∇w J(wold , b old )

The hinge and other loss
Square hinge: (huber/hinge) and Lasso SVM
n
X p
min kwk1 + C max 1 − yi (w⊤ xi + b), 0
w,b
i=1
Penalized Logistic regression (Maxent)

n 0/1 loss
X ⊤ hinge
kwk22 − C log 1 + exp−2yi (w xi +b)

min hinge
2
w,b logistic
i=1 exponential
classification loss
sigmoid
The exponential loss (commonly used in boosting)

n
X ⊤ 1
min kwk22 + C exp−yi (w xi +b)
w,b
i=1
0
−1 0 1
yf(x)
The sigmoid loss
n
X
kwk22 − C tanh yi (w⊤ xi + b)

min
w,b
i=1
Choosing the data fitting term and the penalty
For a given C : controling the tradeoff between loss and penalty
n
X
Loss yi (w⊤ xi + b)

min pen(w) + C
w,b
i=1
For a long list of possible penalties:

A Antoniadis, I Gijbels, M Nikolova, Penalized likelihood regression for
generalized linear models with non-quadratic penalties, 2011.
A tentative of classification:
convex/non convex
differentiable/non differentiable
What are we looking for

consistency
efficiency −→ sparcity
Conclusion: variables or data point?
seeking for a universal learning algorithm
◮ no model for IP(x, y )
the linear case: data is separable

◮ the non separable case
double objective: minimizing the error together with the regularity of

the solution
◮ multi objective optimisation
dualiy : variable – example

◮ use the primal when d < n (in the liner case) or when matrix G is hard
to compute
◮ otherwise use the dual
universality = nonlinearity
◮ kernels
Bibliography
C. Cortes & V. Vapnik, Support-vector networks, Machine learning, 1995

J. Bi & V. Vapnik, Learning with rigorous SVM, COLT 2003
T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, The entire regularization path
for the support vector machine, JMLR, 2004
P. Bartlett, M. Jordan, J. McAuliffe, Convexity, classification, and risk
bounds, JASA, 2006.
A. Antoniadis, I. Gijbels, M. Nikolova, Penalized likelihood regression for
generalized linear models with non-quadratic penalties, 2011.
A Agarwal, O Chapelle, M Dudík, J Langford, A reliable effective terascale
linear learning system, 2011.
informatik.unibas.ch/fileadmin/Lectures/FS2013/CS331/Slides/my_SVM_without_b.pdf
http://ttic.uchicago.edu/~gregory/courses/ml2010/lectures/lect12.pdf
http://olivier.chapelle.cc/primal/

Lecture3 Linear SVM With Slack

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture3 Linear SVM With Slack

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 3: Linear SVM with slack variables

Sao Paulo 2014

March 23, 2014

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Our hope: almost all ξi = 0

Figure: dominated point (red),

Pareto front (or Pareto Efficient Frontier) Pareto frontier

Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 5 / 29

it works for CONVEX criteria!

Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 6 / 29

it works for CONVEX criteria!

Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 6 / 29

Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 6 / 29

Minimizing also the slack (the error), for a given C > 0

primal admissibility yi (w⊤ xi + b) ≥ 1 i = 1, . . . , n

primal admissibility yi (w⊤ xi + b) ≥ 1 i = 1, . . . , n

here lies the efficiency of the algorithm (and its complexity)!

no change for w and b

The dual formulation:

C small ⇒ all the points are in IC : αi = C

no need of the positivity constraint on ξi

The dual formulation:

The dual formulation:

Minimizing the distance between the convex hulls

Problem in the primal (p = 1)

for p = 1 the dual formulation is the following:

It generalizes to any cost (useful for unbalanced data)

Introducing the hinge loss

Back to d + 1 variables, but this is no longer an explicit QP

Definition (Sub gradient)

∀f ∈ V(f0 ), J(f ) ≥ J(f0 ) + g ⊤ (f − f0 )

IRd = IR J3 (x) = |x| ∂J3 (0) = {g ∈ IR | − 1 < g < 1}

Iα is the set of support vectors s.t. yi w⊤ xi = 1;

Iα is the set of support vectors s.t. yi w⊤ xi = 1;

Let λn a value close enough to λo to keep the sets I0 , Iα and IC unchanged

What for: Yahoo!, Twiter, Amazon,

"on terascale datasets, with trillions of

How: hybrid online+batch approach adaptive gradient updates (stochastic

Code available: http://olivier.chapelle.cc/primal/

Optimal step size ρ in the Newton direction:

wnew = wold − ρ Hw−1 ∇w J(wold , b old )

Penalized Logistic regression (Maxent)

The exponential loss (commonly used in boosting)

For a long list of possible penalties:

What are we looking for

the linear case: data is separable

double objective: minimizing the error together with the regularity of

dualiy : variable – example

C. Cortes & V. Vapnik, Support-vector networks, Machine learning, 1995

Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 29 / 29

Das könnte Ihnen auch gefallen