Beruflich Dokumente
Kultur Dokumente
=1
+
+
+
+
+ +
-
-
-
-
-
-
-
Which is best linear separator (defined by w)?
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
A
B
C
Distance from the
separating
hyperplane
corresponds to
the confidence
of prediction
Example:
We are more sure
about the class of
A and B than of C
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
Margin: Distance of closest example from
the decision line/hyperplane
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
The reason we define margin this way is due to theoretical convenience and existence of
generalization error bounds that depend on the value of margin.
Remember: Dot product
=
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
| | =
()
=
Distance from a point to a line
A (x
A
(1)
, x
A
(2)
)
M (x
1
, x
2
)
H
d(A, L) = |AH|
= |(A-M) w|
= |(x
A
(1)
x
M
(1)
) w
(1)
+ (x
A
(2)
x
M
(2)
) w
(2)
|
= x
A
(1)
w
(1)
+ x
A
(2)
w
(2)
+ b
= w A + b
Remember x
M
(1)
w
(1)
+ x
M
(2)
w
(2)
= - b
since M belongs to line L
w
L
+
Let:
Line L: wx+b =
w
(1)
x
(1)
+w
(2)
x
(2)
+b=0
w = (w
(1)
, w
(2)
)
Point A = (x
A
(1)
, x
A
(2)
)
Point M on a line = (x
M
(1)
, x
M
(2)
)
(0,0)
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
Prediction = sign(wx + b)
Confidence = (w x + b) y
For i-th datapoint:
Want to solve:
Can rewrite as
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
+
+ +
+
+
+
+
-
-
-
-
-
-
-
> + ) ( , . .
max
,
b x w y i t s
i i
w
Maximize the margin:
Good according to intuition,
theory (VC dimension)
& practice
is margin distance from
the separating hyperplane
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
+
+
+
+
+
+
+
+
-
- -
-
-
-
-
wx+b=0
Maximizing the margin
> + ) ( , . .
max
,
b x w y i t s
i i
w
Separating hyperplane
is defined by the
support vectors
Points on +/- planes
from the solution
If you knew these
points, you could
ignore the rest
If no degeneracies,
d+1 support vectors (for d dim. data)
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
Problem:
Let + =
then + =
Scaling w increases margin!
Solution:
Work with normalized w:
=
+
Lets also require support vectors
to be on the plane defined by:
+ =
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
2
x
|| || w
w
1
x
| w | =
() 2
=1
Want to maximize margin !
What is the relation
between x
1
and x
2
?
+
||||
We also know:
+ = +
+ =
So:
+ = +
+
||||
+ = +
+ +
= +
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
2
-1
w w w
w
1
=
=
2
w w w =
Note:
2
x
1
x
|| || w
w
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
> + ) ( , . .
max
,
b x w y i t s
i i
w
1 ) ( , . .
|| || min
2
2
1
> + b x w y i t s
w
i i
w
This is called SVM with hard constraints
2
2
1
min min
1
max max w w
w
~ ~ ~
2
We started with
But w can be arbitrarily large!
We normalized and...
Then:
2
x
1
x
|| || w
w
If data is not separable introduce penalty:
Minimize w
2
plus the
number of training mistakes
Set C using cross validation
How to penalize mistakes?
All mistakes are not
equally bad!
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
1 ) ( , . .
mistakes) of number (# C min
2
2
1
> +
+
b x w y i t s
w
i i
w
+
+
+
+
+
+
+
-
-
-
-
-
-
-
+ -
-
Introduce slack variables
i
If point x
i
is on the wrong
side of the margin then
get penalty
i
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
i i i
n
i
i
b w
b x w y i t s
C w
i
> +
+
=
>
1 ) ( , . .
min
1
2
2
1
0 , ,
+
+
+
+
+
+
+
-
-
- -
-
For each datapoint:
If margin > 1, dont care
If margin < 1, pay linear penalty
+
j
-
i
What is the role of slack penalty C:
C=: Only want to w, b
that separate the data
C=0: Can set
i
to anything,
then w=0 (basically
ignores the data)
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
1 ) ( , . .
mistakes) of number (# C min
2
2
1
> +
+
b x w y i t s
w
i i
w
+
+
+
+
+
+
+
-
-
- -
-
+
-
big C
good C
small C
(0,0)
SVM in the natural form
SVM uses Hinge Loss:
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
{ }
=
+ +
n
i
i i
b w
b x w y C w w
1
2
1
,
) ( 1 , 0 max min arg
Margin
Empirical loss L (how well we fit training data)
Regularization
parameter
i i i
n
i
i
b w
b x w y i t s
C w
> +
+
=
1 ) ( , . .
min
1
2
2
1
,
-1 0 1 2
0/1 loss
p
e
n
a
l
t
y
) ( b w x y z
i i
+ =
Hinge loss: max{0, 1-z}
Announcement: HW2 is graded. We sorted it alphabetically
into several piles. Please dont mess the piles.
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
Want to estimate w and b!
Standard way: Use a solver!
Solver: software for finding solutions to
common optimization problems
Use a quadratic solver:
Minimize quadratic function
Subject to linear constraints
Problem: Solvers are inefficient for big data!
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
i i i
n
i
i
b w
b w x y i t s
C w w
> +
+
=
1 ) ( , . .
min
1
2
1
,
Want to estimate w, b!
Alternative approach:
Want to minimize f(w,b):
How to minimize convex functions f(z)?
Use gradient descent: min
z
f(z)
Iterate: z
t+1
z
t
q f(z
t
)
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
( )
= = =
)
`
+ + =
n
i
d
j
j
i
j
i
d
j
j
b x w y C w b w f
1 1
) ( ) (
1
2
) (
2
1
) ( 1 , 0 max ) , (
i i i
n
i
i
b w
b w x y i t s
C w w
> +
+
=
1 ) ( , . .
min
1
2
1
,
f(z)
z
Want to minimize f(w,b):
Compute the gradient V(j) w.r.t. w
(j)
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
=
c
c
+ =
c
c
= V
n
i
j
i i
j
j
w
y x L
C w
w
b w f
j
1
) (
) (
) (
) , ( ) , (
) (
else
1 ) (w if 0
) , (
) (
) (
j
i i
i i
j
i i
x y
b x y
w
y x L
=
> + =
c
c
( )
= = =
)
`
+ + =
n
i
d
j
j
i
j
i
d
j
j
b x w y C w b w f
1 1
) ( ) (
1
2
) (
2
1
) ( 1 , 0 max ) , (
Empirical loss (
)
Iterate until convergence:
For j = 1 d
Evaluate:
Update:
w
(j)
w
(j)
- qV(j)
Gradient descent:
Problem:
Computing V(j)
takes O(n) time!
n size of the training dataset
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
=
c
c
+ =
c
c
= V
n
i
j
i i
j
j
w
y x L
C w
w
b w f
j
1
) ( ) (
) , ( ) , (
) (
qlearning rate parameter
C regularization parameter
Stochastic Gradient Descent
Instead of evaluating gradient over all examples
evaluate it for each individual training example
Stochastic gradient descent:
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25
) (
) (
) , (
) , (
j
i i
j
w
y x L
C w i j
c
c
+ = V
=
c
c
+ = V
n
i
j
i i j
w
y x L
C w j
1
) (
) (
) , (
) (
We just had:
Iterate until convergence:
For i = 1 n
For j = 1 d
Evaluate: V(j,i)
Update: w
(j)
w
(j)
- qV(j,i)
Example by Leon Bottou:
Reuters RCV1 document corpus
Predict a category of a document
One vs. the rest classification
n = 781,000 training examples (documents)
23,000 test examples
d = 50,000 features
One feature per word
Remove stop-words
Remove low frequency words
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
Questions:
(1) Is SGD successful at minimizing f(w,b)?
(2) How quickly does SGD find the min of f(w,b)?
(3) What is the error on a test set?
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
Training time Value of f(w,b) Test error
Standard SVM
Fast SVM
SGD SVM
(1) SGD-SVM is successful at minimizing the value of f(w,b)
(2) SGD-SVM is super fast
(3) SGD-SVM test set error is comparable
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
Optimization quality: | f(w,b) f
(w
opt
,b
opt
) |
Conventional
SVM
SGD SVM
For optimizing f(w,b) within reasonable quality
SGD-SVM is super fast
SGD on full dataset vs. Batch Conjugate
Gradient on a sample of n training examples
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29
Bottom line: Doing a simple (but fast) SGD update
many times is better than doing a complicated (but
slow) BCG update a few times
Theory says: Gradient
descent converges in
linear time . Conjugate
gradient converges in .
condition number
Need to choose learning rate q and t
0
Leon suggests:
Choose t
0
so that the expected initial updates are
comparable with the expected size of the weights
Choose q:
Select a small subsample
Try various rates q (e.g., 10, 1, 0.1, 0.01, )
Pick the one that most reduces the cost
Use q for next 100k iterations on the full dataset
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30
|
.
|
\
|
c
c
+
+
+
w
y x L
C w
t t
w w
i i
t
t
t t
) , (
0
1
q
Sparse Linear SVM:
Feature vector x
i
is sparse (contains many zeros)
Do not do: x
i
= [0,0,0,1,0,0,0,0,5,0,0,0,0,0,0,]
But represent x
i
as a sparse vector x
i
=[(4,1), (9,5), ]
Can we do the SGD update more efficiently?
Approximated in 2 steps:
cheap: x
i
is sparse and so few
coordinates j of w will be updated
expensive: w is not sparse, all
coordinates need to be updated
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
|
.
|
\
|
c
c
+
w
y x L
C w w w
i i
) , (
q
w
y x L
C w w
i i
c
c
) , (
q
) 1 ( q w w
Solution 1: =
Represent vector w as the
product of scalar s and vector v
Then the update procedure is:
(1) =
(2) = ( )
Solution 2:
Perform only step (1) for each training example
Perform step (2) with lower frequency
and higher q
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32
w
y x L
C w w
i i
c
c
) , (
q
) 1 ( q w w
Two step update procedure:
(1)
(2)
Stopping criteria:
How many iterations of SGD?
Early stopping with cross validation
Create validation set
Monitor cost function on the validation set
Stop when loss stops decreasing
Early stopping
Extract two disjoint subsamples A and B of training data
Train on A, stop by validating on B
Number of epochs is an estimate of k
Train for k epochs on the full dataset
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
Idea 1:
One against all
Learn 3 classifiers
+ vs. {o, -}
- vs. {o, +}
o vs. {+, -}
Obtain:
w
+
b
+
, w
-
b
-
, w
o
b
o
How to classify?
Return class c
arg max
c
w
c
x + b
c
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34
Learn 3 sets of weights simoultaneously
For each class c estimate w
c
, b
c
Want the correct class to have highest margin:
w
y
i
x
i
+ b
y
i
> 1 + w
c
x
i
+ b
c
c = y
i
, i
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
(x
i
, y
i
)
Optimization problem:
To obtain parameters w
c
, b
c
(for each class c)
we can use similar techniques as for 2 class SVM
SVM is widely perceived a very powerful
learning algorithm
2/19/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
i c i c y i y
n
i
i c
b w
b x w b x w
C w
i i
+ + > +
+
=
1
min
1 c
2
2
1
,
i
i y c
i
i
>
=
, 0
,