Beruflich Dokumente
Kultur Dokumente
Stochastic Algorithms
L
eon Bottou
NEC Laboratories America
Princeton
Summary
I.
II.
III.
Assumption
Examples are drawn independently from
an unknown probability distribution P (x, y)
that represents the laws of Nature.
Loss Function
Function (
y , y) measures the cost
of answering y
when the true answer is y .
Expected Risk
We seek to find the function f that minimizes:
Z
min E(f ) =
( f (x), y ) dP (x, y)
f
Estimation
Not feasible to minimize the expectation E(f )
because P (x, y) is unknown.
Instead, we search fn that minimizes the Empirical Risk En(f ),
that is, the average loss over the training set examples.
min
f F
En(f ) =
n
1X
( f (xi), yi )
i=1
E(f )
E(fF )
+ E(fn)
E(f )
E(fF )
Approximation Error
Estimation Error
Estimation error
Approximation error
Size of F
(Vapnik and Chervonenkis, Ordered risk minimization, 1974).
(Vapnik and Chervonenkis, Theorie der Zeichenerkennung, 1979)
Statistical Perspective:
It is good to optimize an objective function than ensures a fast
estimation rate when the number of examples increases.
Optimization Perspective:
To efficiently solve large problems, it is preferable to choose
an optimization algorithm with strong asymptotic properties, e.g.
superlinear.
Incorrect Conclusion:
To address large-scale learning problems, use a superlinear algorithm to
optimize an objective function with fast estimation rate.
Test Error
Bayes Limit
Computing Time
Test Error
10,000 examples
100,000 examples
1,000,000 examples
Bayes limit
Computing Time
Test Error
optimizer a
optimizer b
optimizer c
model I
model II
model III
model IV
10,000 examples
100,000 examples
1,000,000 examples
Bayes limit
Computing Time
Test Error
optimizer a
optimizer b
optimizer c
model I
model II
model III
model IV
Good Learning
Algorithms
10,000 examples
100,000 examples
1,000,000 examples
Bayes limit
Computing Time
Bayes limit
Approximation error
)
+ E(fn) E(fF
Estimation error
+ E(fn) E(fn)
Optimization error
Problem:
Choose F , n, and to make this as small as possible,
subject to budget constraints
(Approximation theory)
(Vapnik-Chervonenkis theory)
(Algorithm dependent)
Definition 1:
Definition 2:
Small-scale Learning
The active budget constraint is the number of examples.
Estimation error
Approximation error
Size of F
Large-scale Learning
The active budget constraint is the computing time.
Executive Summary
log ()
Best
log(T)
Asymptotics
E(fn) E(f ) = E(fF ) E(f )
Approximation error
)
+ E(fn) E(fF
Estimation error
+ E(fn) E(fn)
Optimization error
Asymoptotic Approach
All three errors must decrease with comparable rates.
Forcing one of the errors to be decrease much faster
- costs in computing time,
- but does not significantly improve the test error.
Asymptotics: Estimation
Uniform convergence bounds
Estimation error O
n
d
log
n
d
1
with 1 .
2
n
d
Localized bounds (variance, Tsybakov):
O
log
n
d
Fast estimation rates: (Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2005; . . . )
Asymptotics: Estimation+Optimization
Uniform convergence arguments give
Estimation error + Optimization error O
d
n
log
n
d
h
d
n
n log d
h
i
.
i
n
d
n log d
+ = .
. . . Approximation+Estimation+Optimization
Gradient descent.
Second order gradient descent (Newton).
Stochastic gradient descent.
Stochastic second order gradient descent.
Quantities of Interest
Empirical Hessian at the empirical optimum wn.
n
1 X 2(fn(xi), yi)
2En
(fwn ) =
H =
2
n
w
w2
i=1
1X
(fn(xi), yi)
(fn(xi), yi)
G =
n
w
w
i=1
Condition number
We assume that there are min, max and such that
1
.
trace GH
spectrum H [min, max].
and we define the condition number = max/min.
Iterate
wt+1 wt
En(fwt )
w
Cost per
iteration
GD
O(nd)
Iterations
to reach
O log 1
Time to reach
accuracy
O nd log 1
Time to reach
E(fn) E(fF ) <
2
log2 1
O d1/
Iterate
wt+1 wt
H 1
En(fwt )
w
2GD
Cost per
iteration
O d d+n
Iterations
to reach
1
O log log
Time to reach
accuracy
1
O d d + n log log
Time to reach
E(fn) E(fF ) <
2
1
1
d
O 1/ log log log
Cost per
iteration
SGD
O(d)
Iterations
to reach
k +o 1
With 1 k 2
Time to reach
accuracy
O d k
Time to reach
E(fn) E(fF ) <
O d k
1 1
Replace scalar gain by matrix H .
t
t
2SGD
Cost per
iteration
2
O d
Iterations
to reach
+o 1
Time to reach
accuracy
2
O d
Time to reach
E(fn) E(fF ) <
2
O d
Task
Recognizing documents of category CCAT.
n
X
1
2
Minimize
w + ( w xi + b, yi ) .
n
2
i=1
w t
(w xt + b, yt)
w +
w
SVMLight
SVMPerf
SGD
= 0.0001
Training Time
23,642 secs
66 secs
1.4 secs
Primal cost
0.2275
0.2278
0.2275
Test Error
6.02%
6.03%
6.02%
Primal cost
0.18907
0.18890
0.18893
Test Error
5.68%
5.70%
5.66%
The Wall
0.3
Testing cost
0.2
Training time (secs)
100
SGD
50
TRON
(LibLinear)
0.1
0.01
Model
Conditional Random Field (all linear, log-loss.)
Features are n-grams of words and part-of-speech tags.
1,679,700 parameters.
Same setup as (Vishwanathan et al., 2006) but plain SGD.
Results
L-BFGS
SGD
Training Time
4335 secs
568 secs
Primal cost
9042
9098
Test F1 score
93.74%
93.75%
Notes
Computing the gradients with the chain rule runs faster than
computing them with the forward-backward algorithm.
Graph Transformer Networks are nonlinear conditional random fields
trained with stochastic gradient descent (Bottou et al., 1997).
Dataset
Reuters
781K
Translation 1000K
SuperTag
950K
Voicetone
579K
47K
274K
46K
88K
0.1%
0.0033%
0.0066%
0.019%
210,000
days
31,650
39,100
3930
47,700
905
197
153
1,105
210
51
7
7
1
1
n=10000
n=100000
n=781265
n=30000
n=300000
0.35
0.3
Log-loss problem
stochastic
Batch Conjugate
Gradient on various
training set sizes
0.25
0.2
Stochastic Gradient
on the full set
0.15
0.1
0.001
0.01
0.1
10
100
1000
Time (seconds)
wn
1
n+1
E
(f )
n n+1 w
En(f w )
w*n+1 w*n
wn+1
wn
1
Hn+1
n
fwn (xn), yn
+ O
w
n2
where Hn+1 is the empirical Hessian on n + 1 examples.
1
t
1
t2
+ O
with Bt B 0, BH I/2.
1
t
ut +
Lemma: study real sequence ut+1 = 1 + + o
1
1
When > 1 show ut = 1
+
o
t
t (nasty proof!).
When < 1 show ut t (up to log factors).
t2
+o
1
t2
.
tr(HBGB) 1
1
tr(HBGB) 1
1
+o
E E(fwt )E(fw )
+o
max
min
t
t
2BH 1 t
2BH 1 t
1
Interesting special cases: B = I/min
H and B = H .
After (Bottou & LeCun, 2003), see also (Fabian, 1973; Murata & Amari, 1998).
lim
2
n E kw wnk
2
= lim t E kw wtk
t
= tr(H 1G H 1)
Best solution in F.
w=w*
One Pass of
Second Order
Stochastic
Gradient
w0= w*0
Empirical
Optima
K/n
Best training
w*n set error.
0.366
Mse*
+1e2
0.362
0.358
Mse*
+1e3
0.354
Mse*
+1e4
0.350
0.346
1000
10000
100000
Number of examples
0.342
100
1000
10000
Milliseconds
1
tr(G H )
1
E E(fwt ) E(fw ) =
+ o
t
t
What happens when Bt H 1 slowly ?
One-pass learning speed regime may not be reached in one pass. . .
wt+1 wt
1
t
Bt
Exploiting Duality
Averaged SGD
Good asymptotic constants (Polyak and Juditsky 1992)
Asymptotic regime long to settle . . . possibly fixable (Wei Xu, to appear)
(sparse, cheap)
(not sparse, costly)
Solution
Perform only step (i) for each training example.
Perform step (ii) with lower frequency and higher gain.
General Idea
Scheduling SGD operations according to their cost!
Design Principle
Start with simple SGD.
Incremental improvements of the constants.
(Bordes, Bottou, Gallinary, 2008)
SVMSGD2
Require: , w0 , t0 , T, skip
1: t = 0
1: t = 0, count = skip
2: while t T do
1
(wt + (ytwtxt)ytxt)
3: wt+1 = wt (t+t
0)
4:
5:
6:
7:
8:
9: t = t + 1
2: while t T do
1
(ytwtxt)ytxt
3: wt+1 = wt (t+t
0)
4: count = count1
5: if count < 0 then
6:
wt+1 = wt+1 skip
t+t0 wt+1
7:
count = skip
8: end if
9: t = t + 1
11: return wT
11: return wT
SVMSGD2
Require: , w0 , t0 , T, skip
SGD-QN
Require: , w0 , t0 , T, skip
1: t = 0, count = skip
1: t = 0, count = skip,
2:
2: B = 1 I , updateB = false, r = 0
3: while t T do
1
(ytwtxt)ytxt
4: wt+1 = wt (t+t
0)
5:
6:
7:
8:
9:
10:
11: count = count1
12: if count < 0 then
13:
wt+1 = wt+1skip (t + t0)1wt+1
14:
count = skip
15: end if
16: t = t + 1
3: while t T do
4: wt+1 = wt (t + t0 )1 (yt wtxt)yt B xt
5: if updateB = true then
6:
pt = gt(wt+1) gt(wt)
B
7:
i , Bii = Bii + 2r [wt+1 wt]i [pt]1
ii
i
2 1
8:
i , Bii = max(Bii, 10 )
9:
r = r + 1 , updateB = false
10: end if
11: count = count1
12: if count < 0 then
13:
wt+1 = wt+1skip (t + t0)1 B wt+1
14:
count = skip, updateB = true
15: end if
16: t = t + 1
18: return wT
18: return wT
25.0
25.0
SVMSGD2
SGD-QN
oLBFGS
LibLinear
24.5
24.0
SVMSGD2
SGD-QN
oLBFGS
LibLinear
24.5
24.0
23.5
23.5
23.0
23.0
22.5
22.5
22.0
22.0
21.5
21.5
21.0
21.0
0
Number of epochs
0.2
0.4
0.6
0.8
Field segmentation
Character segmentation
Character recognition
Syntactical interpretation.
Words embedded
in 50100 dim space
Five TimeDelay
Multilayer networks :
Part Of Speech Tagging
( treebank, split 0221 / 23 )
ERR: 2.76%
State of the art: 2.75%
F1: 88.97%
State of the art: 89.31%
Chunking
( treebank )
Language Model
( wikipedia, 620M examples )
WER: ~14%
State of the art: ~13%
Trash
Conclusions
Qualitatively different tradeoffs for small and largescale
learning problems.
Traditional performance measurements for optimization
algorithms do not apply very well to machine learning
problems
Stochastic algorithms provide excellent performance, both
in theory and in practice.
There is room for improvements.
Engineering matters!
The End
Decreasing gains:
wt+1 wt
t + t0
Asymptotic Theory
s
t
2SGD
1
+o
SGD
k
1
+o
k
1 k 2