Slds 2009

Large Scale Machine Learning and
Stochastic Algorithms
L
eon Bottou
NEC Laboratories America
Princeton
Machine Learning in the 1980s
A Path to Artificial Intelligence?

Emulate cognitive capabilities of humans.
Profund, but not immediately obvious, relation
with the foundations of statistics,
Solution does not involve a human expert,
unlike the practice of statistics.
Humans learn from abundant and diverse data:
vision, sound, speech, text, . . . .
More data than we can handle.
Machine Learning in the 2000s

A Path to Artificial Intelligence?
Still my ultimate goal . . . Opinions vary . . .
Automated Data Mining

Gain competitive advantages by
analyzing the masses of data that describes
the life of our computerized society.
- Placement of web advertisement.
- Customer relation management.
- Fraud detection.
Brute force approach.
More data than we can handle.
Linear Time Learning Algorithms?
The computing resources available for learning

do not grow faster than the volume of data.
The cost of data mining cannot exceed the revenues.
Intelligent animals learn from streaming data.
Most machine learning or optimization algorithms
demand resources that grow faster than the volume of data.
Matrix operations (n3 time for n2 coefficients).
Sparse matrix operations are worse.
Summary
I.
II.
III.
Machine Learning Redux.

Machine Learning and Optimization.
Stochastic Algorithms.
ML Redux: The Experimental Paradigm
Variations: k -fold cross-validation, etc.

This is the main driver for progress in machine learning.
ML Redux: Mathematical Statement (i)
Assumption
Examples are drawn independently from
an unknown probability distribution P (x, y)
that represents the laws of Nature.
Loss Function
Function (
y , y) measures the cost
of answering y
when the true answer is y .
Expected Risk
We seek to find the function f that minimizes:
Z
min E(f ) =
( f (x), y ) dP (x, y)
f
Note: The test set error is an approximation of the expected risk.
ML Redux: Mathematical Statement (ii)

Approximation
Not feasible to search f among all functions.
that minimizes the Expected Risk E(f )
Instead, we search fF
within some richly parametrized family of functions F .
Estimation
Not feasible to minimize the expectation E(f )
because P (x, y) is unknown.
Instead, we search fn that minimizes the Empirical Risk En(f ),
that is, the average loss over the training set examples.
min
f F
En(f ) =
n
1X
( f (xi), yi )
i=1
In other words, we optimize a surrogate problem!
ML Redux: The Tradeoff

E(fn)
E(f )
E(fF )
+ E(fn)
E(f )
E(fF )
Approximation Error
Estimation Error
Estimation error
Approximation error
Size of F
(Vapnik and Chervonenkis, Ordered risk minimization, 1974).
(Vapnik and Chervonenkis, Theorie der Zeichenerkennung, 1979)
The Computational Problem (i)
Statistical Perspective:
It is good to optimize an objective function than ensures a fast
estimation rate when the number of examples increases.
Optimization Perspective:
To efficiently solve large problems, it is preferable to choose
an optimization algorithm with strong asymptotic properties, e.g.
superlinear.
Incorrect Conclusion:
To address large-scale learning problems, use a superlinear algorithm to
optimize an objective function with fast estimation rate.
The Computational Problem (ii)

Baseline large-scale learning algorithm
Randomly discarding data is the simplest
way to handle large datasets.
What are the statistical benefits of processing more data?

What is the computational cost of processing more data?
We need a theory that joins Statistics and Computation!

1967: Vapnik and Chervonenkis theory does not discuss computation.
1981: Valiants learnability excludes exponential time algorithms,
but (i) polynomial time already too slow, (ii) few actual results.
We propose a new analysis of approximate optimization. . .
Test Error
Test Error versus Learning Time
Bayes Limit
Computing Time
Test Error
10,000 examples
100,000 examples
1,000,000 examples
Bayes limit
Computing Time
Vary the number of examples. . .
Test Error
optimizer a
optimizer b
optimizer c
model I
model II
model III
model IV
10,000 examples
100,000 examples
1,000,000 examples
Bayes limit
Computing Time
Vary the number of examples, the statistical models, the algorithms,. . .
Test Error
optimizer a
optimizer b
optimizer c
model I
model II
model III
model IV
Good Learning
Algorithms
10,000 examples
100,000 examples
1,000,000 examples
Bayes limit
Computing Time
Not all combinations are equal.
Point where we should

start working on something else?
Good Learning
Algorithms
Bayes limit
Changing the units along the axes. . .
Learning with Approximate Optimization
Computing fn = arg min En(f ) is often costly.

f F
Since we already optimize a surrogate function

why should we compute its optimum fn exactly?
Lets assume our optimizer returns fn
such that En(fn) < En(fn) + .
For instance, one could stop an iterative
optimization algorithm long before its convergence.
Decomposition of the Error (i)

E(fn) E(f ) = E(fF ) E(f )
Approximation error
)
+ E(fn) E(fF
Estimation error
+ E(fn) E(fn)
Optimization error
Problem:
Choose F , n, and to make this as small as possible,
subject to budget constraints
max number of examples n

max computing time T
Decomposition of the Error (ii)
Approximation error bound:

decreases when F gets larger.
(Approximation theory)
Estimation error bound:

decreases when n gets larger.
increases when F gets larger.
(Vapnik-Chervonenkis theory)
Optimization error bound:

increases with .
Computing time T :
decreases with
increases with n
increases with F
(Vapnik-Chervonenkis theory plus tricks)
(Algorithm dependent)
Small-scale vs. Large-scale Learning
We can give rigorous definitions.
Definition 1:
We have a small-scale learning problem when the active

budget constraint is the number of examples n.
Definition 2:
We have a large-scale learning problem when the active

budget constraint is the computing time T .
Small-scale Learning
The active budget constraint is the number of examples.
To reduce the estimation error, take n as large as the budget allows.

To reduce the optimization error to zero, take = 0.
We need to adjust the size of F .
Estimation error
Approximation error
Size of F
See Structural Risk Minimization (Vapnik 74) and later works.
Large-scale Learning
The active budget constraint is the computing time.
More complicated tradeoffs.

The computing time depends on the three variables: F , n, and .
Example.
If we choose small, we decrease the optimization error. But we
must also decrease F and/or n with adverse effects on the estimation
and approximation errors.
The exact tradeoff depends on the optimization algorithm.
We can compare optimization algorithms rigorously.
Executive Summary
log ()
Best
Good optimization algorithm (superlinear).

decreases faster than exp(T)
Mediocre optimization algorithm (linear).
decreases like exp(T)
Extraordinary poor
optimization algorithm
decreases like 1/T
log(T)
Asymptotics
E(fn) E(f ) = E(fF ) E(f )
Approximation error
)
+ E(fn) E(fF
Estimation error
+ E(fn) E(fn)
Optimization error
Asymoptotic Approach
All three errors must decrease with comparable rates.
Forcing one of the errors to be decrease much faster
- costs in computing time,
- but does not significantly improve the test error.
Asymptotics: Estimation
Uniform convergence bounds
Estimation error O

n
d
log
n
d

1
with 1 .
2
Value d describes the capacity of our system.

The simplest capacity measure is the Vapnik-Chervonenkis dimension of F .
There are in fact three (four?) types of bounds to consider:
q
d
Classical V-C bounds (pessimistic):
O
n

n
d
log
Relative V-C bounds in the realizable case: O
n
d

n
d
Localized bounds (variance, Tsybakov):
O
log
n
d
Fast estimation rates: (Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2005; . . . )
Asymptotics: Estimation+Optimization
Uniform convergence arguments give
Estimation error + Optimization error O

d
n
log
n
d
This is true for all three cases of uniform convergence bounds.
Scaling laws for when F is fixed

The approximation error is constant.
No need to choose smaller than O
h
d
n
n log d
Not advisable to choose larger than O
h
i
.
i
n
d
n log d

+ = .
. . . Approximation+Estimation+Optimization
When F is chosen via a -regularized cost

Uniform convergence theory provides bounds for simple cases
(Massart-2000; Zhang 2005; Steinwart et al., 2004-2007; . . . )
Scaling laws for n, and depend on the optimization algorithm.

new
See (Shalev-Shwartz and Srebro, ICML 2008) for Linear SVMs.
When F is realistically complicated

Large datasets matter
because one can use more features,
because one can use richer models.
Bounds for such cases are rarely realistic enough.
Analysis of a Simple Case
Simple parametric setup

F is fixed.
Functions fw (x) linearly parametrized by w Rd.
Comparing four iterative optimization algorithms for En(f )
1.
2.
3.
4.
Gradient descent.
Second order gradient descent (Newton).
Stochastic gradient descent.
Stochastic second order gradient descent.
Quantities of Interest
Empirical Hessian at the empirical optimum wn.
n
1 X 2(fn(xi), yi)
2En
(fwn ) =
H =
2
n
w
w2
i=1
Empirical Fisher Information matrix at the empirical optimum wn.

"
#

n
1X
(fn(xi), yi)
(fn(xi), yi)
G =
n
w
w
i=1
Condition number
We assume that there are min, max and such that

1
.
trace GH

spectrum H [min, max].
and we define the condition number = max/min.
Gradient Descent (GD)

Gradient J
Iterate
wt+1 wt
En(fwt )
w
Best speed achieved with fixed learning rate = 1/max.

(e.g., Dennis & Schnabel, 1983)
Cost per
iteration
GD
O(nd)
Iterations
to reach

O log 1
Time to reach
accuracy

O nd log 1
Time to reach
E(fn) E(fF ) <

2
log2 1
O d1/
In the last column, n and are chosen to reach as fast as possible.

Solve for to find the best error rate achievable in a given time.
Remark: abuses of the O() notation
Second Order Gradient Descent (2GD)

Gradient J
Iterate
wt+1 wt
H 1
En(fwt )
w
We assume H 1 is known in advance.

Superlinear optimization speed (e.g., Dennis & Schnabel, 1983)
2GD
Cost per
iteration

O d d+n
Iterations
to reach

1
O log log
Time to reach
accuracy

1
O d d + n log log
Optimization speed is much faster.

Learning speed only saves the condition number .
Time to reach
E(fn) E(fF ) <

2
1
1
d
O 1/ log log log
Stochastic Gradient Descent (SGD)

Iterate
Draw random example (xt, yt).
(fwt (xt), yt)
wt+1 wt
t
w
Total Gradient <J(x,y,w)>

Partial Gradient J(x,y,w)
Best decreasing gain schedule with = 1/min.

(see Bottou, 1991; Murata, 1998; Bottou & LeCun, 2004)
Cost per
iteration
SGD
O(d)
Iterations
to reach

k +o 1
With 1 k 2
Time to reach
accuracy

O d k
Time to reach
E(fn) E(fF ) <

O d k
Optimization speed is catastrophic.

Learning speed does not depend on the statistical estimation rate .
Learning speed depends on condition number but scales very well.
Second order Stochastic Descent (2SGD)

Iterate
Draw random example (xt, yt).
1
(fwt (xt), yt)
wt+1 wt H 1
t
w
Total Gradient <J(x,y,w)>

Partial Gradient J(x,y,w)
1 1
Replace scalar gain by matrix H .
t
t
2SGD
Cost per
iteration

2
O d
Iterations
to reach

+o 1
Time to reach
accuracy
2
O d
Time to reach
E(fn) E(fF ) <
2
O d
Each iteration is d times more expensive.

The number of iterations is reduced by 2 (or less.)
Second order only changes the constants.
Benchmarking SGD in Simple Problems

The theory suggests that SGD is very competitive.
Many people associate SGD with trouble.
SGD historically associated with back-propagation.

Multilayer networks are very hard problems (nonlinear, nonconvex)
What is difficult, SGD or MLP?
Try PLAIN SGD on a simple learning problem.
Download from http://leon.bottou.org/projects/sgd.

These simple programs are very short.
Text Categorization with SVMs

Dataset
Reuters RCV1 document corpus.
781,265 training examples, 23,149 testing examples.
47,152 TF-IDF features.
Task
Recognizing documents of category CCAT.

n
X
1
2
Minimize
w + ( w xi + b, yi ) .
n
2
i=1
Update w w t (wt, xt, yt)
w t
(w xt + b, yt)
w +
w
Same setup as (Shalev-Schwartz et al., 2007) but plain SGD.
Text Categorization with SVMs

Results: Linear SVM
(
y , y) = max{0, 1 y y}
SVMLight
SVMPerf
SGD
= 0.0001
Training Time
23,642 secs
66 secs
1.4 secs
Primal cost
0.2275
0.2278
0.2275
Test Error
6.02%
6.03%
6.02%
Results: Log-Loss Classifier

(
y , y) = log(1 + exp(y y)) = 0.00001
Training Time
TRON(LibLinear, = 0.01)
30 secs
TRON(LibLinear, = 0.001)
44 secs
SGD
2.3 secs
Primal cost
0.18907
0.18890
0.18893
Test Error
5.68%
5.70%
5.66%
The Wall
0.3
Testing cost
0.2
Training time (secs)
100
SGD
50
TRON
(LibLinear)
0.1
0.01
0.001 0.0001 1e05 1e06 1e07 1e08 1e09
Optimization accuracy (trainingCostoptimalTrainingCost)
Text Chunking with CRFs

Dataset
CONLL 2000 Chunking Task:
Segment sentences in syntactically correlated chunks
(e.g., noun phrases, verb phrases.)
106,978 training segments in 8936 sentences.
23,852 testing segments in 2012 sentences.
Model
Conditional Random Field (all linear, log-loss.)
Features are n-grams of words and part-of-speech tags.
1,679,700 parameters.
Same setup as (Vishwanathan et al., 2006) but plain SGD.
Text Chunking with CRFs
Results
L-BFGS
SGD
Training Time
4335 secs
568 secs
Primal cost
9042
9098
Test F1 score
93.74%
93.75%
Notes
Computing the gradients with the chain rule runs faster than
computing them with the forward-backward algorithm.
Graph Transformer Networks are nonlinear conditional random fields
trained with stochastic gradient descent (Bottou et al., 1997).
More SVM Experiments
From: Patrick Haffner

Date: Wednesday 2007-09-05 14:28:50
. . . I have tried on some of our main datasets. . .
I can send you the example, it is so striking!
Patrick
Dataset
Train Number of % non-0 LIBSVM LLAMA LLAMA SGDSVM

size
features
features (SDot)
SVM
MAXENT
Reuters
781K
Translation 1000K
SuperTag
950K
Voicetone
579K
47K
274K
46K
88K
0.1%
0.0033%
0.0066%
0.019%
210,000
days
31,650
39,100
3930
47,700
905
197
153
1,105
210
51
7
7
1
1
More SVM Experiments

From: Olivier Chapelle
Date: Sunday 2007-10-28 22:26:44
. . . you should really run batch with various training set sizes . . .
Average Test Loss
0.4
n=10000
n=100000
n=781265
n=30000
n=300000
0.35
0.3
Log-loss problem
stochastic
Batch Conjugate
Gradient on various
training set sizes
0.25
0.2
Stochastic Gradient
on the full set
0.15
0.1
0.001
0.01
0.1
10
Why is SGD near the enveloppe?
100
1000
Time (seconds)
Effect of one Additional Example (i)

Compare
wn
= arg min En(fw )

w

1
wn+1 = arg min En+1(fw ) = arg min En(fw ) + fw (xn+1), yn+1

n
w
w
n+1
E
(f )
n n+1 w
En(f w )
w*n+1 w*n
Effect of one Additional Example (ii)
First Order Calculation
wn+1
wn
1
Hn+1
n
fwn (xn), yn
+ O
w
n2
where Hn+1 is the empirical Hessian on n + 1 examples.
Compare with Second Order Stochastic Gradient Descent

1 1 fwt (xn), yn
H
wt+1 = wt
t
w
Could they converge with the same speed?
C2 assumptions = Accurate speed estimates.
Speed of Scaled Stochastic Gradient

Study wt+1 = wt
1
t
fwt (xn ),yn

Bt
w
1
t2
+ O
with Bt B 0, BH I/2.
Establish convergence a.s. via quasi-martingales (see Bottou, 1991, 1998).

Let Ut = H (wt w) (wt w). Observe E(fwt ) E(fw ) = tr(Ut) + o(tr(Ut))

HBGB
1
1
2BH
Derive Et(Ut+1) = I t + o t Ut + t2 + o t2 where G is the Fisher matrix.
1
t
ut +
Lemma: study real sequence ut+1 = 1 + + o

1
1
When > 1 show ut = 1
+
o
t
t (nasty proof!).
When < 1 show ut t (up to log factors).
t2
+o
1
t2

.
Bracket E(tr(Ut+1)) between two such sequences and conclude:
tr(HBGB) 1

1
tr(HBGB) 1
1
+o
E E(fwt )E(fw )
+o
max
min
t
t
2BH 1 t
2BH 1 t
1
Interesting special cases: B = I/min
H and B = H .
After (Bottou & LeCun, 2003), see also (Fabian, 1973; Murata & Amari, 1998).
Chasing the constants pays.

Follow the leader
Second-order SGD

lim n E E(fwn ) E(fF ) = lim t E E(fwt ) E(fF ) = tr(G H 1)
lim

2
n E kw wnk

2
= lim t E kw wtk
t
= tr(H 1G H 1)
Best solution in F.
w=w*
One Pass of
Second Order
Stochastic
Gradient
w0= w*0
Empirical
Optima
K/n
Best training
w*n set error.
Optimal Learning in One Pass

Given a large enough training set,
a Single Pass of Second Order Stochastic Gradient
generalizes as well as the Empirical Optimum.
Experiments on synthetic data

Mse*
+1e1
0.366
Mse*
+1e2
0.362
0.358
Mse*
+1e3
0.354
Mse*
+1e4
0.350
0.346
1000
10000
100000
Number of examples
0.342
100
1000
10000
Milliseconds
What matters is the Constant
Test set performance decreases in K/t.

Constant K depends on:
Second order versus first order optimization.

How quickly we reach the asymptotic constant.
Number of operations per iteration.
Parallel programming.
Programming language.
Programmers skill.
All on an equal footing!
Unfortunate Theoretical Issues
The dangers of asymptotic analysis!

Asymptotic one pass learning whenever Bt H 1.

1

tr(G H )
1
E E(fwt ) E(fw ) =
+ o
t
t
What happens when Bt H 1 slowly ?
One-pass learning speed regime may not be reached in one pass. . .
Unfortunate Practical Issues
Second Order SGD is not that fast!
wt+1 wt
1
t
Bt
(fwt (xt), yt)

w
Must estimate and store d d matrix H 1.

Must multiply the gradient for each example by the matrix H 1.
Sparsity tricks no longer work because H 1 is not sparse.
Directions in Stochastic Algorithm Design

Limited storage approximations of H 1
Reduce the number of epochs

Rarely sufficient for fast one-pass learning.
Diagonal approximation (Becker &LeCun, 1989, Bordes & Bottou, 2007)
Low rank approximation (e.g., LeCun et al., 1998, LeRoux et al., 2007)
Online L-BFGS approximation (Schraudolph, 2007)
Exploiting Duality
Coordinate ascent in the dual is related to SGD.

Dual Hessian sometimes more amenable to approximation.
Some successes (Bordes and Bottou, 2005-2007).
Long term perspective uncertain.
Averaged SGD
Good asymptotic constants (Polyak and Juditsky 1992)
Asymptotic regime long to settle . . . possibly fixable (Wei Xu, to appear)
Engineering Algorithms - Sparsity
The very simple SGD update offers lots of engineering opportunities.

Example: Sparse Linear SVM

The update w w w + (wxi, yi)
can be performed in two steps:
i) w w (wxi, yi)
ii) w w (1 )
(sparse, cheap)
(not sparse, costly)
Solution
Perform only step (i) for each training example.
Perform step (ii) with lower frequency and higher gain.
General Idea
Scheduling SGD operations according to their cost!
Engineering Algorithms SGD-QN (i)

SGD-QN - A Stochastic Gradient SVM.
Official winner of the Pascal Large Scale Learning Challenge
http://largescale.first.fraunhofer.de/workshop
Diagonal Hessian Approximation
Gain on the condition numbers with lowest computational cost.
Schedule of SGD operations according to their costs
- regularization versus example updates
- reestimation of the diagonal hessian.
Careful engineering makes it run faster
(although we did not get any points for that.)
Design Principle
Start with simple SGD.
Incremental improvements of the constants.
(Bordes, Bottou, Gallinary, 2008)
Engineering Algorithms SGD-QN (ii)

SVMSGD0
Require: , w0 , t0 , T
SVMSGD2
Require: , w0 , t0 , T, skip
1: t = 0
1: t = 0, count = skip
2: while t T do
1
(wt + (ytwtxt)ytxt)
3: wt+1 = wt (t+t
0)
4:
5:
6:
7:
8:
9: t = t + 1
2: while t T do
1
(ytwtxt)ytxt
3: wt+1 = wt (t+t
0)
4: count = count1
5: if count < 0 then
6:
wt+1 = wt+1 skip
t+t0 wt+1
7:
count = skip
8: end if
9: t = t + 1
10: end while
10: end while
11: return wT
11: return wT
SVMSGD2
SGD-QN
1: t = 0, count = skip
1: t = 0, count = skip,
2:
2: B = 1 I , updateB = false, r = 0
3: while t T do
1
(ytwtxt)ytxt
4: wt+1 = wt (t+t
0)
5:
6:
7:
8:
9:
10:
11: count = count1
13:
wt+1 = wt+1skip (t + t0)1wt+1
14:
count = skip
15: end if
16: t = t + 1
3: while t T do
4: wt+1 = wt (t + t0 )1 (yt wtxt)yt B xt
5: if updateB = true then
6:
pt = gt(wt+1) gt(wt)
B
7:
i , Bii = Bii + 2r [wt+1 wt]i [pt]1
ii
i
2 1
8:
i , Bii = max(Bii, 10 )
9:
r = r + 1 , updateB = false
10: end if
11: count = count1
13:
wt+1 = wt+1skip (t + t0)1 B wt+1
14:
count = skip, updateB = true
15: end if
16: t = t + 1
17: end while
17: end while
18: return wT
18: return wT
Engineering Algorithms SGD-QN (iii)
25.0
25.0
SVMSGD2
SGD-QN
oLBFGS
LibLinear
24.5
24.0
SVMSGD2
SGD-QN
oLBFGS
LibLinear
24.5
24.0
23.5
23.5
23.0
23.0
22.5
22.5
22.0
22.0
21.5
21.5
21.0
21.0
0
Number of epochs
0.2
0.4
0.6
Training time (sec.)
Pascal Large Scale Challenge Delta dataset

Test errors (in %) according to the number
of epochs (left) and training duration (right).
0.8
SGD in Real Life: A Check Reader.

Examples are pairs (image,amount).
Field segmentation
Character segmentation
Character recognition
Syntactical interpretation.
Define differentiable modules.

Pretrain modules with hand-labelled data.
Assemble them to form a composite model.
Define global cost function.
Train with SGD for a few weeks.
(Bottou, LeCun, Bengio, 1997; LeCun, Bottou, et al., 1998)
Industrially deployed; ran billions of checks since 1996.
SGD in Real Life: Sentence Analysis

Binary encoded
sentence words.
Words embedded
in 50100 dim space
Five TimeDelay
Multilayer networks :
Part Of Speech Tagging
( treebank, split 0221 / 23 )
ERR: 2.76%
State of the art: 2.75%
Named Entity Recognition

( treebank, Stanford NER )
F1: 88.97%
State of the art: 89.31%
Chunking
( treebank )
Semantic Role Labeling

( propbank )
Positional
information relative to
the chosen predicate for
semantic tagging
Language Model
( wikipedia, 620M examples )
F1: 92.7% ( 94.9% w/pos )

State of the art: 94.4%.
WER: ~14%
State of the art: ~13%
Trash
(Collobert and Weston, 2007, 2008) [not my work]
No hand-tuned parsing tricks.

No hand-tuned linguistic features.
State-of-the-art accuracies.
Analyzes a sentence in 50 milliseconds (instead of seconds.)
Trains with SGD in about 3 weeks.
Conclusions
Qualitatively different tradeoffs for small and largescale
learning problems.
Traditional performance measurements for optimization
algorithms do not apply very well to machine learning
problems
Stochastic algorithms provide excellent performance, both
in theory and in practice.
There is room for improvements.
Engineering matters!
The End
Insights on the Gain Schedule (i)
Decreasing gains:
wt+1 wt
t + t0
(wt, xt, yt)
Asymptotic Theory

s
t
if s = 2 min < 1 then slow rate O

2
s t1
if s = 2 min > 1 then faster rate O s1
Example: the SVM benchmark

Use = 1/ because min.
Choose t0 to make sure that the expected initial updates
are comparable with the expected size of the weights.
Insights on the Gain Schedule (ii)
The sample size n does not change the SGD maths!

Constant gain:
wt+1 wt (wt, xt, yt)
At any moment during training, we can:

Pick a random subset of examples with moderate size.
Try various gains on the subsample.
Pick the gain that most reduces the cost.
Use it for the next 100000 iterations on the full dataset.
Insights on Stopping Criteria
Time to reach accuracy

Number of epochs to reach
same test cost as the full
optimization.
2SGD

1
+o
SGD

k
1
+o
k
1 k 2
There are many ways to make constant k smaller:

Exact second order stochastic gradient descent.
Approximate second order stochastic gradient descent.
Simple preconditionning tricks.
Insights on Stopping Criteria
Early stopping with cross validation

Create a validation set by setting some training examples apart.
Monitor cost function on the validation set.
Stop when it stops decreasing.
Early stopping a priori
Extract two disjoint subsamples of training data.
Train on the first subsample; stop by validating on the second.
The number of epochs is an estimate of k.
Train by performing that number of epochs on the full set.
This is asymptotically correct and gives reasonable results in practice.

Slds 2009

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Slds 2009

Hochgeladen von

Copyright:

Verfügbare Formate

Large Scale Machine Learning and

Machine Learning in the 1980s

A Path to Artificial Intelligence?

More data than we can handle.

Machine Learning in the 2000s

Automated Data Mining

More data than we can handle.

Linear Time Learning Algorithms?

The computing resources available for learning

Machine Learning Redux.

ML Redux: The Experimental Paradigm

Variations: k -fold cross-validation, etc.

ML Redux: Mathematical Statement (i)

Note: The test set error is an approximation of the expected risk.

ML Redux: Mathematical Statement (ii)

In other words, we optimize a surrogate problem!

ML Redux: The Tradeoff

The Computational Problem (i)

The Computational Problem (ii)

What are the statistical benefits of processing more data?

We need a theory that joins Statistics and Computation!

Test Error versus Learning Time

Test Error versus Learning Time

Vary the number of examples. . .

Test Error versus Learning Time

Vary the number of examples, the statistical models, the algorithms,. . .

Test Error versus Learning Time

Not all combinations are equal.

Test Error versus Learning Time

Point where we should

Changing the units along the axes. . .

Learning with Approximate Optimization

Computing fn = arg min En(f ) is often costly.

Since we already optimize a surrogate function

Decomposition of the Error (i)

max number of examples n

Decomposition of the Error (ii)

Approximation error bound:

Estimation error bound:

Optimization error bound:

(Vapnik-Chervonenkis theory plus tricks)

Small-scale vs. Large-scale Learning

We can give rigorous definitions.

We have a small-scale learning problem when the active

We have a large-scale learning problem when the active

To reduce the estimation error, take n as large as the budget allows.

See Structural Risk Minimization (Vapnik 74) and later works.

More complicated tradeoffs.

Good optimization algorithm (superlinear).

Value d describes the capacity of our system.

This is true for all three cases of uniform convergence bounds.

Scaling laws for when F is fixed

Not advisable to choose larger than O

When F is chosen via a -regularized cost

Scaling laws for n, and depend on the optimization algorithm.

See (Shalev-Shwartz and Srebro, ICML 2008) for Linear SVMs.

When F is realistically complicated

Analysis of a Simple Case

Simple parametric setup

Empirical Fisher Information matrix at the empirical optimum wn.

Gradient Descent (GD)

Best speed achieved with fixed learning rate = 1/max.

In the last column, n and are chosen to reach as fast as possible.

Second Order Gradient Descent (2GD)