Sie sind auf Seite 1von 63

Large Scale Machine Learning and

Stochastic Algorithms

L
eon Bottou
NEC Laboratories America
Princeton

Machine Learning in the 1980s

A Path to Artificial Intelligence?


Emulate cognitive capabilities of humans.
Profund, but not immediately obvious, relation
with the foundations of statistics,
Solution does not involve a human expert,
unlike the practice of statistics.
Humans learn from abundant and diverse data:
vision, sound, speech, text, . . . .

More data than we can handle.

Machine Learning in the 2000s


A Path to Artificial Intelligence?
Still my ultimate goal . . . Opinions vary . . .

Automated Data Mining


Gain competitive advantages by
analyzing the masses of data that describes
the life of our computerized society.
- Placement of web advertisement.
- Customer relation management.
- Fraud detection.
Brute force approach.

More data than we can handle.

Linear Time Learning Algorithms?

The computing resources available for learning


do not grow faster than the volume of data.
The cost of data mining cannot exceed the revenues.
Intelligent animals learn from streaming data.
Most machine learning or optimization algorithms
demand resources that grow faster than the volume of data.
Matrix operations (n3 time for n2 coefficients).
Sparse matrix operations are worse.

Summary

I.
II.
III.

Machine Learning Redux.


Machine Learning and Optimization.
Stochastic Algorithms.

ML Redux: The Experimental Paradigm

Variations: k -fold cross-validation, etc.


This is the main driver for progress in machine learning.

ML Redux: Mathematical Statement (i)

Assumption
Examples are drawn independently from
an unknown probability distribution P (x, y)
that represents the laws of Nature.
Loss Function
Function (
y , y) measures the cost
of answering y
when the true answer is y .
Expected Risk
We seek to find the function f that minimizes:
Z
min E(f ) =
( f (x), y ) dP (x, y)
f

Note: The test set error is an approximation of the expected risk.

ML Redux: Mathematical Statement (ii)


Approximation
Not feasible to search f among all functions.
that minimizes the Expected Risk E(f )
Instead, we search fF
within some richly parametrized family of functions F .

Estimation
Not feasible to minimize the expectation E(f )
because P (x, y) is unknown.
Instead, we search fn that minimizes the Empirical Risk En(f ),
that is, the average loss over the training set examples.

min

f F

En(f ) =

n
1X

( f (xi), yi )

i=1

In other words, we optimize a surrogate problem!

ML Redux: The Tradeoff


E(fn)

E(f )

E(fF )

+ E(fn)

E(f )

E(fF )

Approximation Error
Estimation Error

Estimation error
Approximation error
Size of F
(Vapnik and Chervonenkis, Ordered risk minimization, 1974).
(Vapnik and Chervonenkis, Theorie der Zeichenerkennung, 1979)

The Computational Problem (i)

Statistical Perspective:
It is good to optimize an objective function than ensures a fast
estimation rate when the number of examples increases.
Optimization Perspective:
To efficiently solve large problems, it is preferable to choose
an optimization algorithm with strong asymptotic properties, e.g.
superlinear.
Incorrect Conclusion:
To address large-scale learning problems, use a superlinear algorithm to
optimize an objective function with fast estimation rate.

The Computational Problem (ii)


Baseline large-scale learning algorithm
Randomly discarding data is the simplest
way to handle large datasets.

What are the statistical benefits of processing more data?


What is the computational cost of processing more data?

We need a theory that joins Statistics and Computation!


1967: Vapnik and Chervonenkis theory does not discuss computation.
1981: Valiants learnability excludes exponential time algorithms,
but (i) polynomial time already too slow, (ii) few actual results.
We propose a new analysis of approximate optimization. . .

Test Error

Test Error versus Learning Time

Bayes Limit

Computing Time

Test Error

Test Error versus Learning Time

10,000 examples
100,000 examples
1,000,000 examples
Bayes limit

Computing Time

Vary the number of examples. . .

Test Error versus Learning Time

Test Error

optimizer a
optimizer b
optimizer c
model I
model II
model III
model IV

10,000 examples
100,000 examples
1,000,000 examples
Bayes limit

Computing Time

Vary the number of examples, the statistical models, the algorithms,. . .

Test Error versus Learning Time

Test Error

optimizer a
optimizer b
optimizer c
model I
model II
model III
model IV

Good Learning
Algorithms

10,000 examples
100,000 examples
1,000,000 examples
Bayes limit

Computing Time

Not all combinations are equal.

Test Error versus Learning Time

Point where we should


start working on something else?
Good Learning
Algorithms

Bayes limit

Changing the units along the axes. . .

Learning with Approximate Optimization

Computing fn = arg min En(f ) is often costly.


f F

Since we already optimize a surrogate function


why should we compute its optimum fn exactly?
Lets assume our optimizer returns fn
such that En(fn) < En(fn) + .
For instance, one could stop an iterative
optimization algorithm long before its convergence.

Decomposition of the Error (i)


E(fn) E(f ) = E(fF ) E(f )

Approximation error

)
+ E(fn) E(fF

Estimation error

+ E(fn) E(fn)

Optimization error

Problem:
Choose F , n, and to make this as small as possible,
subject to budget constraints

max number of examples n


max computing time T

Decomposition of the Error (ii)

Approximation error bound:


decreases when F gets larger.

(Approximation theory)

Estimation error bound:


decreases when n gets larger.
increases when F gets larger.

(Vapnik-Chervonenkis theory)

Optimization error bound:


increases with .
Computing time T :
decreases with
increases with n
increases with F

(Vapnik-Chervonenkis theory plus tricks)

(Algorithm dependent)

Small-scale vs. Large-scale Learning

We can give rigorous definitions.

Definition 1:

We have a small-scale learning problem when the active


budget constraint is the number of examples n.

Definition 2:

We have a large-scale learning problem when the active


budget constraint is the computing time T .

Small-scale Learning
The active budget constraint is the number of examples.

To reduce the estimation error, take n as large as the budget allows.


To reduce the optimization error to zero, take = 0.
We need to adjust the size of F .

Estimation error
Approximation error
Size of F

See Structural Risk Minimization (Vapnik 74) and later works.

Large-scale Learning
The active budget constraint is the computing time.

More complicated tradeoffs.


The computing time depends on the three variables: F , n, and .
Example.
If we choose small, we decrease the optimization error. But we
must also decrease F and/or n with adverse effects on the estimation
and approximation errors.
The exact tradeoff depends on the optimization algorithm.
We can compare optimization algorithms rigorously.

Executive Summary

log ()
Best

Good optimization algorithm (superlinear).


decreases faster than exp(T)
Mediocre optimization algorithm (linear).
decreases like exp(T)
Extraordinary poor
optimization algorithm
decreases like 1/T

log(T)

Asymptotics
E(fn) E(f ) = E(fF ) E(f )

Approximation error

)
+ E(fn) E(fF

Estimation error

+ E(fn) E(fn)

Optimization error

Asymoptotic Approach
All three errors must decrease with comparable rates.
Forcing one of the errors to be decrease much faster
- costs in computing time,
- but does not significantly improve the test error.

Asymptotics: Estimation
Uniform convergence bounds
Estimation error O



n
d
log
n
d



1
with 1 .
2

Value d describes the capacity of our system.


The simplest capacity measure is the Vapnik-Chervonenkis dimension of F .
There are in fact three (four?) types of bounds to consider:
q 
d
Classical V-C bounds (pessimistic):
O
n


n
d
log
Relative V-C bounds in the realizable case: O
n
d  


n
d
Localized bounds (variance, Tsybakov):
O
log
n
d
Fast estimation rates: (Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2005; . . . )

Asymptotics: Estimation+Optimization
Uniform convergence arguments give
Estimation error + Optimization error O



d
n
log
n
d

This is true for all three cases of uniform convergence bounds.

Scaling laws for when F is fixed


The approximation error is constant.
No need to choose smaller than O

h

d
n
n log d

Not advisable to choose larger than O

h

i 

.
i 

n
d
n log d


+ = .

. . . Approximation+Estimation+Optimization

When F is chosen via a -regularized cost


Uniform convergence theory provides bounds for simple cases
(Massart-2000; Zhang 2005; Steinwart et al., 2004-2007; . . . )

Scaling laws for n, and depend on the optimization algorithm.


new

See (Shalev-Shwartz and Srebro, ICML 2008) for Linear SVMs.

When F is realistically complicated


Large datasets matter
because one can use more features,
because one can use richer models.
Bounds for such cases are rarely realistic enough.

Analysis of a Simple Case

Simple parametric setup


F is fixed.
Functions fw (x) linearly parametrized by w Rd.
Comparing four iterative optimization algorithms for En(f )
1.
2.
3.
4.

Gradient descent.
Second order gradient descent (Newton).
Stochastic gradient descent.
Stochastic second order gradient descent.

Quantities of Interest
Empirical Hessian at the empirical optimum wn.
n

1 X 2(fn(xi), yi)
2En
(fwn ) =
H =
2
n
w
w2
i=1

Empirical Fisher Information matrix at the empirical optimum wn.


"
#



n

1X
(fn(xi), yi)
(fn(xi), yi)
G =
n
w
w
i=1

Condition number
We assume that there are min, max and such that

1
.
trace GH

spectrum H [min, max].
and we define the condition number = max/min.

Gradient Descent (GD)


Gradient J

Iterate

wt+1 wt

En(fwt )
w

Best speed achieved with fixed learning rate = 1/max.


(e.g., Dennis & Schnabel, 1983)

Cost per
iteration
GD

O(nd)

Iterations
to reach


O log 1

Time to reach
accuracy


O nd log 1

Time to reach
E(fn) E(fF ) <

 2
log2 1
O d1/

In the last column, n and are chosen to reach as fast as possible.


Solve for to find the best error rate achievable in a given time.
Remark: abuses of the O() notation

Second Order Gradient Descent (2GD)


Gradient J

Iterate

wt+1 wt

H 1

En(fwt )
w

We assume H 1 is known in advance.


Superlinear optimization speed (e.g., Dennis & Schnabel, 1983)

2GD

Cost per
iteration

O d d+n

Iterations
to reach


1
O log log

Time to reach
accuracy



1
O d d + n log log

Optimization speed is much faster.


Learning speed only saves the condition number .

Time to reach
E(fn) E(fF ) <

 2
1
1
d
O 1/ log log log

Stochastic Gradient Descent (SGD)


Iterate
Draw random example (xt, yt).
(fwt (xt), yt)
wt+1 wt
t
w

Total Gradient <J(x,y,w)>


Partial Gradient J(x,y,w)

Best decreasing gain schedule with = 1/min.


(see Bottou, 1991; Murata, 1998; Bottou & LeCun, 2004)

Cost per
iteration
SGD

O(d)

Iterations
to reach
 
k +o 1

With 1 k 2

Time to reach
accuracy


O d k

Time to reach
E(fn) E(fF ) <


O d k

Optimization speed is catastrophic.


Learning speed does not depend on the statistical estimation rate .
Learning speed depends on condition number but scales very well.

Second order Stochastic Descent (2SGD)


Iterate
Draw random example (xt, yt).
1
(fwt (xt), yt)
wt+1 wt H 1
t
w

Total Gradient <J(x,y,w)>


Partial Gradient J(x,y,w)

1 1
Replace scalar gain by matrix H .
t
t

2SGD

Cost per
iteration

2
O d

Iterations
to reach
 
+o 1

Time to reach
accuracy
 2 
O d

Time to reach
E(fn) E(fF ) <
 2 
O d

Each iteration is d times more expensive.


The number of iterations is reduced by 2 (or less.)
Second order only changes the constants.

Benchmarking SGD in Simple Problems


The theory suggests that SGD is very competitive.
Many people associate SGD with trouble.

SGD historically associated with back-propagation.


Multilayer networks are very hard problems (nonlinear, nonconvex)
What is difficult, SGD or MLP?

Try PLAIN SGD on a simple learning problem.

Download from http://leon.bottou.org/projects/sgd.


These simple programs are very short.

Text Categorization with SVMs


Dataset
Reuters RCV1 document corpus.
781,265 training examples, 23,149 testing examples.
47,152 TF-IDF features.

Task
Recognizing documents of category CCAT.

n 
X
1
2
Minimize
w + ( w xi + b, yi ) .
n
2
i=1

Update w w t (wt, xt, yt)

w t

(w xt + b, yt)
w +
w

Same setup as (Shalev-Schwartz et al., 2007) but plain SGD.

Text Categorization with SVMs


Results: Linear SVM
(
y , y) = max{0, 1 y y}

SVMLight
SVMPerf
SGD

= 0.0001

Training Time
23,642 secs
66 secs
1.4 secs

Primal cost
0.2275
0.2278
0.2275

Test Error
6.02%
6.03%
6.02%

Results: Log-Loss Classifier


(
y , y) = log(1 + exp(y y)) = 0.00001
Training Time
TRON(LibLinear, = 0.01)
30 secs
TRON(LibLinear, = 0.001)
44 secs
SGD
2.3 secs

Primal cost
0.18907
0.18890
0.18893

Test Error
5.68%
5.70%
5.66%

The Wall

0.3

Testing cost

0.2
Training time (secs)
100

SGD
50

TRON
(LibLinear)
0.1

0.01

0.001 0.0001 1e05 1e06 1e07 1e08 1e09

Optimization accuracy (trainingCostoptimalTrainingCost)

Text Chunking with CRFs


Dataset
CONLL 2000 Chunking Task:
Segment sentences in syntactically correlated chunks
(e.g., noun phrases, verb phrases.)
106,978 training segments in 8936 sentences.
23,852 testing segments in 2012 sentences.

Model
Conditional Random Field (all linear, log-loss.)
Features are n-grams of words and part-of-speech tags.
1,679,700 parameters.
Same setup as (Vishwanathan et al., 2006) but plain SGD.

Text Chunking with CRFs

Results

L-BFGS
SGD

Training Time
4335 secs
568 secs

Primal cost
9042
9098

Test F1 score
93.74%
93.75%

Notes
Computing the gradients with the chain rule runs faster than
computing them with the forward-backward algorithm.
Graph Transformer Networks are nonlinear conditional random fields
trained with stochastic gradient descent (Bottou et al., 1997).

More SVM Experiments

From: Patrick Haffner


Date: Wednesday 2007-09-05 14:28:50
. . . I have tried on some of our main datasets. . .
I can send you the example, it is so striking!
Patrick

Dataset

Train Number of % non-0 LIBSVM LLAMA LLAMA SGDSVM


size
features
features (SDot)
SVM
MAXENT

Reuters
781K
Translation 1000K
SuperTag
950K
Voicetone
579K

47K
274K
46K
88K

0.1%
0.0033%
0.0066%
0.019%

210,000
days
31,650
39,100

3930
47,700
905
197

153
1,105
210
51

7
7
1
1

More SVM Experiments


From: Olivier Chapelle
Date: Sunday 2007-10-28 22:26:44
. . . you should really run batch with various training set sizes . . .
Average Test Loss
0.4

n=10000
n=100000
n=781265
n=30000
n=300000

0.35
0.3

Log-loss problem
stochastic

Batch Conjugate
Gradient on various
training set sizes

0.25
0.2

Stochastic Gradient
on the full set

0.15
0.1
0.001

0.01

0.1

10

Why is SGD near the enveloppe?

100
1000
Time (seconds)

Effect of one Additional Example (i)


Compare

wn

= arg min En(fw )


w




1

wn+1 = arg min En+1(fw ) = arg min En(fw ) + fw (xn+1), yn+1


n
w
w

n+1
E
(f )
n n+1 w
En(f w )

w*n+1 w*n

Effect of one Additional Example (ii)

First Order Calculation

wn+1

wn

1
Hn+1
n

fwn (xn), yn

+ O

w
n2
where Hn+1 is the empirical Hessian on n + 1 examples.

Compare with Second Order Stochastic Gradient Descent



1 1 fwt (xn), yn
H
wt+1 = wt
t
w
Could they converge with the same speed?
C2 assumptions = Accurate speed estimates.

Speed of Scaled Stochastic Gradient


Study wt+1 = wt

1
t

fwt (xn ),yn


Bt
w

1
t2

+ O

with Bt B 0, BH I/2.

Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998).


Let Ut = H (wt w) (wt w). Observe E(fwt ) E(fw ) = tr(Ut) + o(tr(Ut))



HBGB
1
1
2BH
Derive Et(Ut+1) = I t + o t Ut + t2 + o t2 where G is the Fisher matrix.

1
t



ut +
Lemma: study real sequence ut+1 = 1 + + o

1
1
When > 1 show ut = 1
+
o
t
t (nasty proof!).
When < 1 show ut t (up to log factors).

t2

+o

1
t2


.

Bracket E(tr(Ut+1)) between two such sequences and conclude:

tr(HBGB) 1

 
 


1
tr(HBGB) 1
1
+o
E E(fwt )E(fw )
+o
max
min
t
t
2BH 1 t
2BH 1 t

1
Interesting special cases: B = I/min
H and B = H .

After (Bottou & LeCun, 2003), see also (Fabian, 1973; Murata & Amari, 1998).

Chasing the constants pays.


Follow the leader
Second-order SGD




lim n E E(fwn ) E(fF ) = lim t E E(fwt ) E(fF ) = tr(G H 1)

lim




2
n E kw wnk



2
= lim t E kw wtk
t

= tr(H 1G H 1)

Best solution in F.
w=w*

One Pass of
Second Order
Stochastic
Gradient
w0= w*0

Empirical
Optima

K/n
Best training
w*n set error.

Optimal Learning in One Pass


Given a large enough training set,
a Single Pass of Second Order Stochastic Gradient
generalizes as well as the Empirical Optimum.

Experiments on synthetic data


Mse*
+1e1

0.366

Mse*
+1e2

0.362
0.358

Mse*
+1e3

0.354

Mse*
+1e4

0.350
0.346
1000

10000

100000

Number of examples

0.342

100

1000

10000

Milliseconds

What matters is the Constant

Test set performance decreases in K/t.


Constant K depends on:

Second order versus first order optimization.


How quickly we reach the asymptotic constant.
Number of operations per iteration.
Parallel programming.
Programming language.
Programmers skill.

All on an equal footing!

Unfortunate Theoretical Issues

The dangers of asymptotic analysis!


Asymptotic one pass learning whenever Bt H 1.

 
1


tr(G H )
1

E E(fwt ) E(fw ) =
+ o
t
t
What happens when Bt H 1 slowly ?
One-pass learning speed regime may not be reached in one pass. . .

Unfortunate Practical Issues

Second Order SGD is not that fast!

wt+1 wt

1
t

Bt

(fwt (xt), yt)


w

Must estimate and store d d matrix H 1.


Must multiply the gradient for each example by the matrix H 1.
Sparsity tricks no longer work because H 1 is not sparse.

Directions in Stochastic Algorithm Design


Limited storage approximations of H 1

Reduce the number of epochs


Rarely sufficient for fast one-pass learning.
Diagonal approximation (Becker &LeCun, 1989, Bordes & Bottou, 2007)
Low rank approximation (e.g., LeCun et al., 1998, LeRoux et al., 2007)
Online L-BFGS approximation (Schraudolph, 2007)

Exploiting Duality

Coordinate ascent in the dual is related to SGD.


Dual Hessian sometimes more amenable to approximation.
Some successes (Bordes and Bottou, 2005-2007).
Long term perspective uncertain.

Averaged SGD
Good asymptotic constants (Polyak and Juditsky 1992)
Asymptotic regime long to settle . . . possibly fixable (Wei Xu, to appear)

Engineering Algorithms - Sparsity

The very simple SGD update offers lots of engineering opportunities.


Example: Sparse Linear SVM

The update w w w + (wxi, yi)
can be performed in two steps:
i) w w (wxi, yi)
ii) w w (1 )

(sparse, cheap)
(not sparse, costly)

Solution
Perform only step (i) for each training example.
Perform step (ii) with lower frequency and higher gain.
General Idea
Scheduling SGD operations according to their cost!

Engineering Algorithms SGD-QN (i)


SGD-QN - A Stochastic Gradient SVM.
Official winner of the Pascal Large Scale Learning Challenge
http://largescale.first.fraunhofer.de/workshop
Diagonal Hessian Approximation
Gain on the condition numbers with lowest computational cost.
Schedule of SGD operations according to their costs
- regularization versus example updates
- reestimation of the diagonal hessian.
Careful engineering makes it run faster
(although we did not get any points for that.)

Design Principle
Start with simple SGD.
Incremental improvements of the constants.
(Bordes, Bottou, Gallinary, 2008)

Engineering Algorithms SGD-QN (ii)


SVMSGD0
Require: , w0 , t0 , T

SVMSGD2
Require: , w0 , t0 , T, skip

1: t = 0

1: t = 0, count = skip

2: while t T do
1
(wt + (ytwtxt)ytxt)
3: wt+1 = wt (t+t
0)
4:
5:
6:
7:
8:
9: t = t + 1

2: while t T do
1
(ytwtxt)ytxt
3: wt+1 = wt (t+t
0)
4: count = count1
5: if count < 0 then
6:
wt+1 = wt+1 skip
t+t0 wt+1
7:
count = skip
8: end if
9: t = t + 1

10: end while

10: end while

11: return wT

11: return wT

SVMSGD2
Require: , w0 , t0 , T, skip

SGD-QN
Require: , w0 , t0 , T, skip

1: t = 0, count = skip

1: t = 0, count = skip,

2:

2: B = 1 I , updateB = false, r = 0

3: while t T do
1
(ytwtxt)ytxt
4: wt+1 = wt (t+t
0)
5:
6:
7:
8:
9:
10:
11: count = count1
12: if count < 0 then
13:
wt+1 = wt+1skip (t + t0)1wt+1
14:
count = skip
15: end if
16: t = t + 1

3: while t T do
4: wt+1 = wt (t + t0 )1 (yt wtxt)yt B xt
5: if updateB = true then
6:
pt = gt(wt+1) gt(wt)


B
7:
i , Bii = Bii + 2r [wt+1 wt]i [pt]1
ii
i
2 1
8:
i , Bii = max(Bii, 10 )
9:
r = r + 1 , updateB = false
10: end if
11: count = count1
12: if count < 0 then
13:
wt+1 = wt+1skip (t + t0)1 B wt+1
14:
count = skip, updateB = true
15: end if
16: t = t + 1

17: end while

17: end while

18: return wT

18: return wT

Engineering Algorithms SGD-QN (iii)

25.0

25.0

SVMSGD2
SGD-QN
oLBFGS
LibLinear

24.5
24.0

SVMSGD2
SGD-QN
oLBFGS
LibLinear

24.5
24.0

23.5

23.5

23.0

23.0

22.5

22.5

22.0

22.0

21.5

21.5

21.0

21.0
0

Number of epochs

0.2

0.4

0.6

Training time (sec.)

Pascal Large Scale Challenge Delta dataset


Test errors (in %) according to the number
of epochs (left) and training duration (right).

0.8

SGD in Real Life: A Check Reader.


Examples are pairs (image,amount).

Field segmentation
Character segmentation
Character recognition
Syntactical interpretation.

Define differentiable modules.


Pretrain modules with hand-labelled data.
Assemble them to form a composite model.
Define global cost function.
Train with SGD for a few weeks.
(Bottou, LeCun, Bengio, 1997; LeCun, Bottou, et al., 1998)
Industrially deployed; ran billions of checks since 1996.

SGD in Real Life: Sentence Analysis


Binary encoded
sentence words.

Words embedded
in 50100 dim space

Five TimeDelay
Multilayer networks :
Part Of Speech Tagging
( treebank, split 0221 / 23 )

ERR: 2.76%
State of the art: 2.75%

Named Entity Recognition


( treebank, Stanford NER )

F1: 88.97%
State of the art: 89.31%

Chunking
( treebank )

Semantic Role Labeling


( propbank )
Positional
information relative to
the chosen predicate for
semantic tagging

Language Model
( wikipedia, 620M examples )

F1: 92.7% ( 94.9% w/pos )


State of the art: 94.4%.

WER: ~14%
State of the art: ~13%

Trash

(Collobert and Weston, 2007, 2008) [not my work]

No hand-tuned parsing tricks.


No hand-tuned linguistic features.
State-of-the-art accuracies.
Analyzes a sentence in 50 milliseconds (instead of seconds.)
Trains with SGD in about 3 weeks.

Conclusions
Qualitatively different tradeoffs for small and largescale
learning problems.
Traditional performance measurements for optimization
algorithms do not apply very well to machine learning
problems
Stochastic algorithms provide excellent performance, both
in theory and in practice.
There is room for improvements.
Engineering matters!

The End

Insights on the Gain Schedule (i)

Decreasing gains:

wt+1 wt

t + t0

(wt, xt, yt)

Asymptotic Theory

s
t

if s = 2 min < 1 then slow rate O 



2
s t1
if s = 2 min > 1 then faster rate O s1

Example: the SVM benchmark


Use = 1/ because min.
Choose t0 to make sure that the expected initial updates
are comparable with the expected size of the weights.

Insights on the Gain Schedule (ii)

The sample size n does not change the SGD maths!


Constant gain:

wt+1 wt (wt, xt, yt)

At any moment during training, we can:


Pick a random subset of examples with moderate size.
Try various gains on the subsample.
Pick the gain that most reduces the cost.
Use it for the next 100000 iterations on the full dataset.

Insights on Stopping Criteria

Time to reach accuracy


Number of epochs to reach
same test cost as the full
optimization.

2SGD
 

1
+o

SGD
 
k
1
+o

k
1 k 2

There are many ways to make constant k smaller:


Exact second order stochastic gradient descent.
Approximate second order stochastic gradient descent.
Simple preconditionning tricks.

Insights on Stopping Criteria

Early stopping with cross validation


Create a validation set by setting some training examples apart.
Monitor cost function on the validation set.
Stop when it stops decreasing.
Early stopping a priori
Extract two disjoint subsamples of training data.
Train on the first subsample; stop by validating on the second.
The number of epochs is an estimate of k.
Train by performing that number of epochs on the full set.
This is asymptotically correct and gives reasonable results in practice.

Das könnte Ihnen auch gefallen