Beruflich Dokumente
Kultur Dokumente
Deep Learning
Ian Go
Gooodfello
dfellow
w
Yosh
oshua
ua Bengio
Ian GoCourville
Aaron odfellow
Yoshua Bengio
Aaron Courville
Con
Conten
ten
tents
ts
Contents
Website vii
Wcebsite
A kno
knowledgmen
wledgmen
wledgments
ts vii
viii
Acknowledgments
Notation viii
xi
Notation
1 In
Intro
tro
troduction
duction xi1
1.1 Who Should Read This Bo ok? . . . . . . . .
Book? . . . . . . . . . . . . 8
1 1.2
Introduction
Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 1
11
1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8
1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11
I Applied Math and Mac Machine
hine Learning Basics 29
I Applied
2 Math and Machine Learning Basics
Linear Algebra 29
31
2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 31
2 2.2
LinearMultiplying
Algebra Matrices and Vectors . . . . . . . . . . . . . . . . . . 31
34
2.1 Scalars,
2.3 Iden
Identit
tityV
tity ectors,
and In Matrices
Inverse
verse Matrices and T . ensors
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 31
36
2.2 Linear
2.4 Multiplying
Dep Matrices
Dependence
endence andand Vectors
Span . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 34
37
2.3 Norms
2.5 Identity .and
. . In
. verse
. . . .Matrices
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 36
39
2.4 Sp
2.6 Linear
ecial Dep
Special Kindsendence and Span
of Matrices and V. ectors
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 37
40
2.5 Eigendecomp
2.7 Norms . . . osition
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Eigendecomposition .. .. .. .. .. .. .. .. .. .. .. .. 39
42
2.8 Singular ValueofDecomp
2.6 Sp ecial Kinds Matrices
Decomposition and V
osition . ectors
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 40
44
2.7 The
2.9 Eigendecomp
Mo
Moore-P
ore-P osition Pseudoinv
ore-Penrose
enrose . . . . . . erse
Pseudoinverse . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 42
45
2.8 The
2.10 Singular
Trace Value
Op Decomp. osition
Operator
erator . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 44
46
2.9 The
2.11 The Determinan
Mo ore-Penrose
Determinant t . Pseudoinv
. . . . . . erse . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 45
47
2.10 The T race Op erator
2.12 Example: Principal Comp . .
Components. .
onents. . .Analysis
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 46
48
2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.12 Example:
3 Probabilit
Probability y and Principal Components
Information Theory Analysis . . . . . . . . . . . . . 48
53
3.1 WhWhy y Probabilit
Probability?y? . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Probability and Information Theory 53
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
i
i
CONTENTS
II Deep
6 Deep FNetworks:
eedforw ardMo
eedforward dern
Netw
NetworksPractices
orks 165
167
6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 170
6 6.2
Deep Gradien
Feedforw ard Netw
Gradient-Based
t-Based orks. . . . . .
Learning . . . . . . . . . . . . . . . . . 167
176
6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 170
6.2 Gradient-Based Learning . . ii. . . . . . . . . . . . . . . . . . . . . 176
CONTENTS
III Linear
13 Deep FLearning
actor Mo Researc
Models
dels h 489
492
13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 493
13 13.2
LinearIndep
Factor
IndependenMo
enden
endent dels onent Analysis (ICA)
t Comp
Component . . . . . . . . . . . . . . 492
494
13.1 Probabilistic
13.3 Slo
Slow PCA and F.actor
w Feature Analysis . . . Analysis
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 493
496
13.2 Sparse
13.4 Independen
Co t Comp
Coding
ding . . .onent
. . . Analysis
. . . . . (ICA)
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 494
499
13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . 496
13.4 Sparse Coding . . . . . . . . iv. . . . . . . . . . . . . . . . . . . . . 499
CONTENTS
14 13.5
Auto Manifold
Autoenco
enco ders Interpretation of PCA . . . . . . . . . . . . . . . . . . .
encoders 502
505
14.1 Undercomplete Auto Autoenco
enco
encoders
ders . . . . . . . . . . . . . . . . . . . . 506
14 14.2
Autoenco ders
Regularized Auto
Autoenco
enco
encoders
ders . . . . . . . . . . . . . . . . . . . . . . 505
507
14.1 Undercomplete
14.3 Represen
Representational Auto enco
tational Power, La ders
Lay . . . and
yer Size . . .Depth
. . . .. .. .. .. .. .. .. .. .. .. .. 506
511
14.2 Sto
14.4 Regularized Auto
Stocchastic Enco
Encoders enco
ders dersDeco
and . .ders
Decoders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 507
512
14.3 Denoising
14.5 Representational
Auto
Autoenco Power,
enco ders La
encoders . y. er. .Size
. . and
. . .Depth
. . . .. .. .. .. .. .. .. .. .. .. .. 511
513
14.4 Learning
14.6 StochasticManifolds
Encoderswith and Deco
Auto
Autoenco ders
enco . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
encoders
ders 512
518
14.5 Denoising
14.7 Con
Contractiv
tractiv Auto
tractivee Autoenco
Autoenco ders
enco ders . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
encoders . . . . . . 513
524
14.6 Predictiv
14.8 Learning
PredictiveeManifolds with Auto
Sparse Decomp
Decomposition
osition encoders
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 518
526
14.7 Applications
14.9 Contractive Auto encoenco
of Auto dersders
Autoenco
encoders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 524
527
14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 526
15 14.9 Applications
Represen
Representation of Autoencoders . . . . . . . . . . . . . . . . . . . .
tation Learning 527
529
15.1 Greedy La Lay yer-Wise Unsup
Unsupervised
ervised Pretraining . . . . . . . . . . . 531
15 15.2
Represen tation
Transfer Learning
Learning and Domain Adaptation . . . . . . . . . . . . . 529
539
15.1 Semi-Sup
15.3 Greedy Laervised
yer-Wise
Semi-Supervised Unsupervised
Disentangling of P retraining
Causal Factors . . .. .. .. .. .. .. .. .. .. 531
544
15.2 Distributed
15.4 Transfer Learning and Domain
Representation . . .A.daptation
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 539
549
15.3 Semi-Sup
15.5 Exp
Exponen
onen tial Gains from Depth . . . . . F. actors
ervised
onential Disentangling of Causal . . . . .. .. .. .. .. .. .. .. .. 544
556
15.4
15.6 Distributed
Pro Representation
viding Clues
Providing to Disco
Discov ver. Underlying
. . . . . . .Causes . . . . .. .. .. .. .. .. .. .. .. .. 549
557
15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 556
16 15.6 Providing
Structured Clues to Disco
Probabilistic Mo ver
ModelsdelsUnderlying
for DeepCauses Learning . . . . . . . . . . 557
561
16.1 The Challenge of Unstructured Mo Modeling
deling . . . . . . . . . . . . . . 562
16 16.2
Structured Probabilistic
Using Graphs to Describ Mo
Describee Models
Model for Deep Learning
del Structure . . . . . . . . . . . . . 561
566
16.1
16.3 The Challenge
Sampling from of Unstructured
Graphical Mo
Modelsdels Mo .deling
. . . .
. .. .. .. .. .. .. .. .. .. .. .. .. .. 562
583
16.2 A
16.4 Using
dv Graphs
dvantages
antages of to Describe Mo
Structured Modeling
del Structure
Modeling . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 566
584
16.3 Learning
16.5 Sampling ab from
outGraphical
about Dep
Dependencies Models
endencies . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 583
585
16.4 Inference
16.6 Advantages and of Approximate
Structured Mo deling . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Inference 584
586
16.7 The Deep Learning Approach to. .Structured
16.5 Learning ab out Dep endencies . . . . . . Probabilistic
. . . . . . . Mo . . dels
Models. . . 585
587
16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 586
17 16.7
Mon
MonteteThe DeepMetho
Carlo Learning
Methods ds Approach to Structured Probabilistic Models 587
593
17.1 Sampling and Monte Carlo Metho Methods ds . . . . . . . . . . . . . . . . 593
17 17.2
Monte Carlo
Imp
ImportanceMetho ds
ortance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 593
595
17.1 Sampling
17.3 Mark
Marko and Monte
ov Chain Mon
Montete Carlo
Carlo MethoMetho
Methods ds
ds .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 593
598
17.2 Gibbs
17.4 Importance Sampling
Sampling . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 595
602
17.3 The
17.5 MarkChallenge
ov Chain Mon te Carlo
of Mixing betw Metho
etween ds . . . . Mo
een Separated . . des
Modes . . .. .. .. .. .. .. .. .. 598
602
17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
18 17.5
ConfronTheting
Confronting Challenge of MixingFunction
the Partition between Separated Modes . . . . . . . . 602
608
18.1 The Log-Lik
Log-Likeliho
eliho
elihoood Gradient . . . . . . . . . . . . . . . . . . . . 609
18 18.2
ConfronStoting
Stoc theMaximum
chastic Partition Function
Likelihoo
Likelihood d and Contrastiv
Contrastivee Divergence . . . 608
610
18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 609
18.2 Stochastic Maximum Likelihoo v d and Contrastive Divergence . . . 610
CONTENTS
18.3 Pseudolik
Pseudolikeliho
eliho
elihoood . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
18.4 Score Matc
Matching
hing and Ratio Matching . . . . . . . . . . . . . . . . 620
18.3 Denoising
18.5 Pseudolikeliho
ScoreodMatching
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 622618
18.4 Noise-Con
18.6 Score Matctrastiv
hing and
Noise-Contrastiv
trastive Ratio Matching
e Estimation . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 623 620
18.5 Denoising Score Matching
18.7 Estimating the Partition Function . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 626 622
18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 623
19 18.7
Appro Estimating
Approximate the Partition Function . . . . . . . . . . . . . .
ximate inference . . . . 634
626
19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 636
19 19.2
ApproExp
ximate
Expectationinference
ectation Maximization . . . . . . . . . . . . . . . . . . . . . . 634637
19.1 Inference as Optimization
19.3 MAP Inference and Sparse Co . . ding
Coding . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 638 636
19.2 V
19.4 Exp ectationInference
ariational Maximization . . . . . .. .. .. .. .. .. .. .. .. .. .. .. ..
and Learning .. .. .. .. 637
641
19.3 Learned
19.5 MAP Inference
Appro and Sparse
Approximate
ximate Coding
Inference . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 653 638
19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 641
20 19.5
Deep Learned
GenerativAppro
Generative e Moximate
dels Inference . . . . . . . . . . . . . . .
Models . . . . 656
653
20.1 Boltzmann Mac Machines
hines . . . . . . . . . . . . . . . . . . . . . . . . . 656
20 20.2
Deep Restricted
GenerativBoltzmann
e Mo dels Machines . . . . . . . . . . . . . . . . . . . 656
658
20.1 Deep
20.3 Boltzmann
Belief Mac
Netwhines
orks .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Networks .. .. .. .. 662 656
20.2 Deep
20.4 Restricted Boltzmann
Boltzmann Machines
Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 665
658
20.3 Deep Belief
20.5 Boltzmann Mac Netw orks
Machines . . .
hines for Real-V . . .alued
Real-Valued . . . Data
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. 662
678
20.4
20.6 Deep
Con
Conv Boltzmann
volutional MachinesMac
Boltzmann . . hines
Machines. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 665
685
20.5 Boltzmann
20.7 Boltzmann Mac Mac hines for
Machines
hines for Structured
Real-ValuedorData . . . . Outputs
Sequential . . . . . .. .. .. .. 678
687
20.6 Other
20.8 Convolutional
BoltzmannBoltzmann
Machines Mac. hines
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 688 685
20.7 Boltzmann
20.9 Bac Mac
Back-Propagation hines for Structured
k-Propagation through Random Op or Sequential
erations . Outputs
Operations . . . . . .. .. .. .. 689
687
20.10 Directed Generative Nets . . . . . . . . . . . . . . . .. .. .. ..
20.8 Other Boltzmann Machines . . . . . . . . . . . . . .. .. .. .. 688
694
20.9 Dra
20.11 Bac k-Propagation
Drawing
wing Samples from through
Auto Random
Autoenco enco dersOp. erations
encoders . . . . . .. .. .. .. .. .. .. .. .. .. 712
689
20.10 Generativ
20.12 Directed
GenerativeGenerative
StocchasticNets
e Sto Net
Netw .works
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 716694
20.13 Other Generation Schemes . enco
20.11 Dra wing Samples from Auto . . ders
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 712 717
20.12 Ev
20.14 Generativ
aluatinge Generative
Evaluating Stochastic Net Mo w
Modelsorks. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
dels .. .. .. .. 719
716
20.13 Conclusion
20.15 Other Generation
. . . . Schemes
. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 717721
20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 719
20.15 Conclusion
Bibliograph
Bibliography y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
721
Bibliography
Index 723
780
Index 780
vi
Website
Website www.deeplearningb
www.deeplearningboook.org
www.deeplearningbook.org
vii
vii
Ackno
knowledgmen
wledgmen
wledgments
ts
Acknowledgments
This book would not ha
hav
ve been possible without the con
contributions
tributions of man
many
y people.
ThisWbeook would
wouldlik
likeenot
to hathank those
ve been who commen
possible commented
without ted
the on
conour prop
proposal
tributionsosal for the
of man book
y people.
and help
helped
ed plan its conconten
ten
tents
ts and organization: Guillaume Alain, Kyungh Kyunghyunyun Cho,
W e w ould
Çağlar Gülçehre, Dalik e to
David thank those who commen
vid Krueger, Hugo Larochelle ted on
Larochelle,, Razv our
Razvan prop
an Pascanosal
Pascanuu and the
for book
Thomas
and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho,
Rohée.
Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas
We would like to thank the people who offered feedback on the conten contentt of the
Rohée.
book itself. Some offered feedbac feedbackk on many chapters: Martín Abadi, Guillaume
W e w ould like to thank the
Alain, Ion Androutsopoulos, Fred Bertsc people who
Bertsch, offered
h, Olexa feedback
Bilaniuk, on Can
Ufuk the conten
Biçici,tMatk
of the
Matko o
book itself.
Bošnjak, Some
John offered Greg
Boersma, feedbac k on
Bro
Broc manyPierre
ckman, chapters:
Luc Martín
Carrier,Abadi,
SarathGuillaume
Chandar,
Alain,
P Ion Androutsopoulos,
awel Chilinski, Mark Daoust, Fred Bertsc
Oleg h, Olexa Bilaniuk,
Dashevskii, Ufuk Can
Laurent Dinh, Biçici,
Stephan Matko
Dreseitl,
Bošnjak,
Jim John F
Fan, Miao Boersma,
an, MeireGreg Brockman,
Fortunato, FrédéricPierre Luc Carrier,
Francis, Nando de Sarath Chandar,
Freitas, Çağlar
P a w el Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh,
Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Stephan Dreseitl,
Jim Fan,Kab
Chingiz Miao
Kabyta
yta
ytayyFev,
an,Luk
Meire
Lukasz Fortunato,
asz Kaiser, Frédéric
Varun Kanade, Francis, NandoJohn
Akiel Khan, de FKing,
reitas,Diederik
Çağlar
Gülçehre,
P . Kingma, Jurgen
Yann VLeCun,
an Gael,Rudolf
JavierMathey
Alonso, García,
Mathey, Matías Jonathan
Mattamala, Hunt, Gopi Jeyaram,
Abhinav Maurya,
Chingiz
Kevin MurphyKab yta y ev, Luk asz Kaiser,
Murphy,, Oleg Mürk, Roman Nov V arun Kanade,
Novak, Akiel Khan, John
ak, Augustus Q. Odena, Simon PaKing, Diederik
Pavlik,
vlik,
P . Kingma, Y ann LeCun, Rudolf
Karl Pichotta, Kari Pulli, Tapani Raiko, An Mathey , Matías
Anurag Mattamala, Abhinav Maurya,
urag Ranjan, Johannes Roith, Halis
KevinCésar
Sak, Murphy , OlegGrigory
Salgado, Mürk, Sapunov,
Roman Nov Mikak,
Mike Augustus
e Sch
Schuster, Q. Odena,
uster, Julian Simon
Serban, Pavlik,
Nir Shabat,
Karl Shirriff,
Ken Pichotta,Scott
KariStanley
Pulli, T
Stanley, , apani
DavidRaiko, Anurag
Sussillo, Ranjan,
Ilya Sutsk
Sutskev
ev
ever,Johannes
er, Roith, Halis
Carles Gelada Sáez,
Sak, César Salgado,
Graham Taylor, Valen Grigory
alentin Sapunov, Mik e Sch uster, Julian Serban,
tin Tolmer, An Tran, Shubhendu Trivedi, Alexey Umnov, Nir Shabat,
Ken
Vincen Shirriff,
Vincentt Vanhouc Scott
anhouck ke, Marco, David
Stanley Visen Sussillo, Ilya Sutsk
Visentini-Scarzanella,
tini-Scarzanella, Da ever,WCarles
David
vid arde-F Gelada
arde-Farley
arley
arley, Sáez,
, Dustin
Graham
W Taylor,
ebb, Kelvin Xu,Valen
Wei tin Tolmer,
Xue, Li Yao,An Tran, tShubhendu
Zygmun
Zygmunt Za jąc and T
Zając rivedi,
Ozan Alexey
Çağlay an. Umnov,
Çağlayan.
Vincent Vanhoucke, Marco Visentini-Scarzanella, David Warde-Farley, Dustin
We would also lik likee to thank those who provided us with useful feedbac feedback k on
Webb, Kelvin Xu, Wei Xue, Li Yao, Zygmunt Za jąc and Ozan Çağlayan.
individual chapters:
We would also like to thank those who provided us with useful feedback on
individual chapters:
• Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,
Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu
Chapter
and 1, Introduction
Alfredo Solano. : Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi,
• Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu
• Chapter 2, Linear
and Alfredo Algebra: Amjad Almahairi, Nik
Solano. Nikola
ola Banić, Kevin Bennett,
viiiAlmahairi, Nikola Banić, Kevin Bennett,
Chapter 2, Linear Algebra: Amjad
• viii
CONTENTS
Philipp
Philippee Castonguay
Castonguay,, Oscar Chang, Eric Fosler-Lussier, Andrey Khaly Khalya avin,
Sergey Oreshk
Oreshko ov, Istv
István
án Petrás, Dennis Prangle, Thomas Rohée, Colb Colby
y
Philipp e Castonguay , Oscar Chang, Eric Fosler-Lussier,
Toland, Massimiliano Tomassoli, Alessandro Vitale and Bob Welland. Andrey Khaly avin,
Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Colby
• Toland, Massimiliano
Chapter 3, ProbabilityTand omassoli, Alessandro
Information TheoryVitale and
: John Bob Anderson,
Philip Welland. Kai
Arulkumaran, VincenVincentt Dumoulin, Rui Fa, Stephan Gouws, Artem Ob Oboturov,
oturov,
Chapter
An
Antti 3 , Probability
tti Rasmus, Andre Simp and Information
Simpelo,
elo, Alexey SurkTheory
Surko : John
ov and Volk Philip
olker Anderson,
er Tresp. Kai
• Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov,
• Antti Rasmus,
Chapter Andre Simp
4, Numerical elo, Alexey
Computation : TSurk
ran oLam
v andAn,Volk er T
Ian resp. and Hu
Fischer,
Yuhuang.
Chapter 4, Numerical Computation: Tran Lam An, Ian Fischer, and Hu
• Yuhuang.5, Machine Learning Basics: Dzmitry Bahdanau, Nikhil Garg,
Chapter
Mak
Makoto
oto Otsuk
Otsuka, a, Bob Pepin, Philip Popien, Emmanuel Ra Rayner,
yner, Kee-Bong
Chapter 5 , Machine
Song, Zheng Sun and Andy Wu. Learning Basics : Dzmitry Bahdanau, Nikhil Garg,
• Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Kee-Bong
• Song, Zheng
Chapter 6, DeepSun F and Andyard
eedforw
eedforward Wu.Netw
Networks
orks
orks:: Uriel Berdugo, Fabrizio Bottarel,
Elizab
Elizabeth
eth Burl, Ishan DurugkDurugkar,ar, Jeff Hlywa, Jong Wook Kim, Da David
vid Krueger
Chapter
and Adit
Adityy6a, Deep
Kumar Feedforw
Prahara ard
Praharaj.j. Networks: Uriel Berdugo, Fabrizio Bottarel,
• Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger
• and Adity
Chapter 7a, Regularization
Kumar Praharafor j. Deep Learning: Inkyu Lee, Sunil Mohan and
Josh
Joshua
ua Salisbury
Salisbury..
Chapter 7, Regularization for Deep Learning: Inkyu Lee, Sunil Mohan and
• Joshua Salisbury
Chapter .
8, Optimization for Training Deep Mo Models
dels
dels:: Marcel Ackermann,
Ro
Rowwel Atienza, Andrew Bro Brock,
ck, Tegan Mahara
Maharaj,j, James Martens and Klaus
Chapter 8, Optimization for Training Deep Models: Marcel Ackermann,
Strobl.
• Rowel Atienza, Andrew Brock, Tegan Mahara j, James Martens and Klaus
• Strobl. 9, Conv
Chapter Convolutional
olutional Netw
Networks
orks
orks:: Martín Arjovsky
Arjovsky,, Eugene Brevdo, Eric
Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie
Chapter
Sa
Say 9, Conv
yer, Ryan olutional
Stout Networks
and Wentao Wu.: Martín Arjovsky, Eugene Brevdo, Eric
• Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie
• Sayer, Ryan
Chapter 10, Stout and Modeling:
Sequence Wentao Wu.Recurren
Recurrentt and Recursive Nets Nets:: Gökçen
Eraslan, Stev
Steven en Hickson, Razv
Razvanan Pascan
Pascanu,u, Lorenzo von Ritter, Rui RoRodrigues,
drigues,
Chapter 10 , Sequence Modeling: Recurren t and
Mihaela Rosca, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. Recursive Nets : Gökçen
• Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues,
• Mihaela
Chapter Rosca, Dmitriy
11, Practical Serdyuk,
metho dologyDongyu
methodology : DanielShi and
Bec Kaiyu Yang.
Beckstein.
kstein.
• 11, Applications
Chapter 12 Practical metho dologyDahl
: George : Daniel
and Beckstein.
Ribana Roscher.
•
• 12, Representation
Chapter 15 Applications: George Dahl
Learning and Ribana
: Kunal Ghosh.Roscher.
•
• Chapter 15
16, Representation Learning: Mo
Structured Probabilistic Kunal
dels Ghosh.
Models for Deep Learning: Minh Lê
• and Anton Varfolom.
Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê
• and
Chapter 18,VConfronting
Anton arfolom. the Partition Function
unction:: Sam Bowman.
ix
Chapter 18, Confronting the Partition Function: Sam Bowman.
•
CONTENTS
x
Notation
Notation
This section pro
provides
vides a concise reference describing the notation used throughout
this book. If you are unfamiliar with an any
y of the corresp
corresponding
onding mathematical
This section pro vides a concise reference
concepts, this notation reference ma may describing
y seem in the
intimidating.notation
timidating. Ho
Howwev er,used
ever, throughout
do not despair,
this b o ok.
we describ If you are unfamiliar with any
describee most of these ideas in chapters 2-4.of the corresp onding mathematical
concepts, this notation reference may seem intimidating. However, do not despair,
we describe most of these ideas Num in chapters
Numb 2-4. Arra
bers and Arrays
ys
a A scalar (in
(integer
teger Num
or real)
bers and Arrays
a A scalar
vector (integer or real)
A
a A matrix
vector
AA A tensor
matrix
IAn Iden
Identit
A tit
tity
y matrix with n ro
tensor rows
ws and n columns
II Iden
Identit
tity
tit matrix with
y matrix withndimensionalit
dimensionality
rows and n columns y implied by
con
context
text
I Identity matrix with dimensionality implied by
e(i) context basis vector [0, . . . , 0, 1, 0, . . . , 0] with a
Standard
1 at position i
e Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a
diag
diag((a) A square,
1 at diagonal
position i matrix with diagonal entries
giv
given
en by a
diag(a) A square, diagonal matrix with diagonal entries
a A
givscalar
en by random
a variable
a A scalar
A vector-v
ector-valued
alued random
random variablevariable
A
a A matrix-v
matrix-valued
alued random
vector-valued random vvariable
ariable
xi
xi
CONTENTS
(a,\B
A b] Set
The subtraction,
real interval i.e., the set
excluding containing
a but includingthe
b ele-
A B men
ments
ts of A that are not in B
Set subtraction, i.e., the set containing the ele-
G A graph A B
\ men ts of that are not in
P a G(xi ) The
A parents of xi in G
parents
graph
P a G(x ) The parents of x in
Indexing
G
ai Elementt i of vector a , withIndexing
Elemen indexing starting at 1
aa−i All elemen
elements
Elemen ts vofector
t i of vector
a, a except
with for elemen
indexingelement t i at 1
starting
A
a i,j Elemen
Elementt i, ts
All elemen j of
of matrix Aexcept for element i
vector a
A
A i,: Ro
Roww i of
Elemen t i,matrix A
j of matrix A
A :,i Column of matrix
Row i ofimatrix A A
AAi,j,k Elemen
Element
Columnt i(i,ofj,matrix
k) of aA3-D tensor A
A 2-D slice A
A :,:,i Elemen t (of a k3-D
i, j, ) oftensor
a 3-D tensor
Aa Elemen
Element t iofofathe
i 2-D slice 3-Drandom
tensor vector a
a Element i of the random vector a
Linear Algebra OpOperations
erations
A> Transpose of matrix
LinearAAlgebra Op erations
A+ Mo
Moore-P
T ore-P
ore-Penrose
enrose
ranspose pseudoin
pseudoinv
of matrix A verse of A
B
AA Elemen
Element-wise
Mo t-wise
ore-P enrose(Hadamard) product
pseudoinverse of A of A and B
det(
A A B) Determinan
Determinant
Elemen t of
t-wise A
(Hadamard) product of A and B
A)
det( Determinant of A
xii
CONTENTS
Calculus
dy
Deriv
Derivative
ative of y with resp
respect
ect to x.
dx Calculus
dy
∂y Derivative of y with respect to x.
dx Partial deriv
derivative
ative of y with resp
respect
ect to x
∂x
∂y
∇ xy Gradien
Gradientt of yative
Partial deriv withofresp
respect
ect to
y with x ect to x
resp
∂x
∇Xyy Matrix
Gradienderiv
derivatives
t of yatives
with of withtoresp
y ect
resp respect
x ect to X
∇ Xyy
∇ Tensor con
Matrix containing
taining
derivatives deriv
derivatives
of y atives of yect
with resp with
to resp
respect
X ect to
∇ y X
∂f Tensor containing derivatives of y with respect to
X
Jacobian matrix J ∈ Rm×n of f : Rn → Rm
∇∂ x
∂f R R R
∇2x f (x) or H (f )(x) The Hessian
Jacobian matrix
matrix J of f at of
input
f : point x
Z ∂x
Definite in
integral
tegral ∈ the en
over entire
tire → of x
domain
f (x)for
(xH
)dx
(f )(x) The Hessian matrix of f at input point x
Z
∇ Definite integral with
over the entire
f (x)dx integral respect to domain
x ov
over of xset S
er the
S
S
f (x)dx Definite integral with respect to x over the set
Probabilit
Probability
y and Information Theory
Z
a⊥b The random
Probability andvInformation
ariables a and Theory
b are indep
independen
enden
endentt
Z a⊥b | c They are are vconditionally
a b The random ariables a andindependent
b are indepgiv
given
en ct
enden
aP ⊥
(a) c
b A probabilit
probability
They are arey conditionally
distribution oindependent
ver a discretegiv
variable
en c
⊥
Pp(a)| A probabilit
probability
y distribution
distribution oovver
er aa discrete
contin
continuous
uous vari-
variable
able, or over a variable whose type has not been
p(a) A ecified
sp probability distribution over a continuous vari-
specified
able, or over a variable whose type has not been
a∼P Random
sp ecified variable a has distribution P
or Ef (x)
Ex∼P [f (ax)] P Exp
Expectation
ectation
Random of f (xa)has
variable with resp
respect
ect to P (x)
distribution
E x))Ef (x)
V (x∼
[far( f (or
)] Variance
Exp of fof
ectation (x)f (under x) ect to P (x)
P (resp
x) with
Co
Cov(
v(ar(
V f (fx()x
, g))(x)) Co
Cov
V variance
ariance of of ) and Pg(x) under P (x)
f (xf)(xunder
Cov(fH((xx)), g(x)) Shannon en
entrop
Covariance trop
tropy
of fy(xof the random
) and variable
g(x) under P (x) x
H((Pxk) Q)
D KL Kullbac
Kullback-Leibler
Shannonk-Leibler
entropy div
divergence
of ergence of Pvand
the random Q x
ariable
N
D (x(;Pµ, Q
Σ) Gaussian
Kullbac distribution
k-Leibler over x
over
divergence of with Q µ and
P andmean
co
cov
variance Σ
(x; µk, Σ) Gaussian distribution over x with mean µ and
N covariance Σ
xiii
CONTENTS
Functions
f :A→B The function f with domain A and range B
Functions
A B A and g B
f :f ◦ g Comp
Composition
The osition of
function the functions
f with domain f and range
f f(x;→
θ
g) A
Compfunction
ositionofofx the
parametrized
functions f by
andθ.g Sometimes
we just write f (x) and ignore the argument θ to
f (x◦; θ) A function
ligh
lighten of x parametrized by θ. Sometimes
ten notation.
we just write f (x) and ignore the argument θ to
log x Natural
ligh logarithm of x
ten notation.
1
σ (xx)
log Logistic sigmoid, of x
Natural logarithm
1 + exp(−x)
1
ζσ((xx)) Logistic sigmoid,
Softplus, log
log(1
(1 + exp( x ))
1 + exp( x)
||ζx p
(x||)p L norm of
Softplus, logx(1 + exp(x)) −
||xx|| L2 norm of x
|| xx+|| P
Lositiv
ositive
norme part
of x of x, i.e., max(0, x)
||x ||
1 condition is 1 if the
Positiv condition
e part is true,
of x, i.e., max(00 ,otherwise
x)
Sometimes we use a function f whose argumen argumentt is a scalar, but apply it to a vector,
1 is 1 if the condition is true, 0 otherwise
matrix, or tensor: f (x), f ( X), or f (X). This means to apply f to the arra array
y
Sometimes
elemen
element-wise. w e use a function f whose argumen t is a scalar, but apply it to a
t-wise. For example, if C = σ (X)X, then Ci,j,k = σ(Xi,j,k ) for all valid values vector,
matrix,
of i, j andor ktensor:
. f (x), f ( X), or f ( ). This means to apply f to the array
C X C X
element-wise. For example, if = σ ( ), then = σ( ) for all valid values
of i, j and k. Datasets and distributions
p data The data generating distribution
Datasets and distributions
pp̂ˆdata The empirical distribution
data generating defined by the training
distribution
set
pˆ The empirical distribution defined by the training
X A
setset of training examples
xX
(i)
The
A seti-th exampleexamples
of training (input) from a dataset
y (i) xor y (i) The target asso
associated
ciated
i-th example with x
(input)
( i)
fromfora sup
supervised
ervised learn-
dataset
ing
y or y The target associated with x for supervised learn-
(i)
X ing m × n matrix with input example x in ro
The row
w
Xi,:
X The m n matrix with input example x in row
X ×
xiv
Chapter 1
Chapter 1
In
Intro
tro
troduction
duction
In
In
Inv
tro
ven
entors
duction
tors ha
hav
ve long dreamed of creating mac
machines
hines that think. This desire dates
bac
backk to at least the time of ancien ancientt Greece. The mythical figures Pygmalion,
Inventors ha
Daedalus, andve long dreamedma
Hephaestus ofy creating
may machines that
all be interpreted think. This
as legendary in
invvdesire
en tors,dates
entors, and
bac k to at least the time of ancien t Greece. The m ythical
Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, figures Pygmalion,
Daedalus,
2004 ; Spark
Sparkesand Hephaestus
es, 1996 ; Tandy, ma 1997y ).
all be interpreted as legendary inventors, and
Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin,
2004When
; Spark programmable
es, 1996; Tandy computers
, 1997). were first conceiv conceived,
ed, people wondered whether
they migh
mightt become intelligen
intelligent,t, ov
over
er a hundred years before one was built (Lo Lov
velace
elace,,
1842When
1842).). Todayprogrammable
oday,, artificial intel computers
intelligenc
ligenc w ere first conceiv ed, p
ligencee (AI) is a thriving field with man eople w ondered
many whether
y practical
they migh t b ecome intelligen
applications and active researc t,
research over a hundred
h topics. We lo look years b efore
ok to intelligen one was
intelligentt softw are to (automate
softwarebuilt Lovelace,
1842 ). Tlab
routine odayor,, artificial
labor, understand intel ligenc
sp eech eor(AI)
speech is a thriving
images, mak field withinman
makee diagnoses y practical
medicine and
applications
supp
support and active researc
ort basic scientific research. h topics. W e lo ok to intelligen t softw are to automate
routine labor, understand speech or images, make diagnoses in medicine and
suppInortthebasic
earlyscientific
da
days
ys of research.
artificial in intelligence,
telligence, the field rapidly tackled and solved
problems that are intellectually difficult for human beings but relativ relativelyely straight-
forwIn
forward
ard the
forearly da ys of artificial
computers—problems in telligence,
that can b e the field
describ
described
edrapidly
by a tackled
list of and solved
formal, math-
problems that are intellectually difficult for
ematical rules. The true challenge to artificial intelligence provhuman b eings but relativ
proved ely straight-
ed to be solving
forw ard for computers—problems that can b
the tasks that are easy for people to perform but hard for people e describ ed by a list of formal, math-
to describ
describe e
ematical rules. The true challenge
formally—problems that we solve intuitiv to artificial
intuitivelyely intelligence prov ed to
ely,, that feel automatic, like recognizingb e solving
the
sp
spok
oktasks
oken that are easy
en words or faces in images. for p eople to p erform but hard for people to describe
formally—problems that we solve intuitively, that feel automatic, like recognizing
This book is ab about
out a solution to these more intuitiv intuitivee problems. This solution is
spoken words or faces in images.
to allow computers to learn from exp experience
erience and understand the world in terms of a
This
hierarc
hierarch b o ok is ab out a solution to
hy of concepts, with each concept defined these more intuitiv e problems.
in terms This solution
of its relation to simpleris
to allow computers
concepts. By gathering to learn from exp
knowledge erience
from and understand
experience, the world
this approac
approach h av in terms
avoids
oids of a
the need
hierarc
for humanhy ofop concepts,
operators
erators to with each concept
formally sp ecify defined
specify all of the in terms
knowledgeof its that
relation
the to simpler
computer
concepts.
needs. TheBy gathering
hierarc
hierarch knowledge
hy of concepts fromthe
allows experience,
computerthis approac
to learn h avoids concepts
complicated the need
for h uman op erators to formally sp ecify all of the
by building them out of simpler ones. If we draw a graph showing how theseknowledge that the computer
needs. The hierarchy of concepts allows the computer to learn complicated concepts
by building them out of simpler ones. 1If we draw a graph showing how these
1
CHAPTER 1. INTRODUCTION
concepts are built on top of eac eachh other, the graph is deep, with man many y lay
layers.
ers. For
this reason, we call this approach to AI de deep
ep lelearning
arning
arning..
concepts are built on top of each other, the graph is deep, with many layers. For
Man
Many y of the early successes of AI took place in relativ relativelyely sterile and formal
this reason, we call this approach to AI deep learning.
en
environmen
vironmen
vironments ts and did not require computers to ha havve muc uchh kno
knowledge
wledge ab about
out
Man
the world. F y of the
For early successes of AI
or example, IBM’s Deep Blue chess-pla took place in
chess-playing relativ ely sterile and
ying system defeated world formal
cen vironmen
hampion ts and
Garry did not
Kasparo
Kasparov v inrequire
1997 (Hsu computers
, 2002). to haveismofuccourse
Chess h knowledge about
a very simple
wthe world.
orld, con For example,
containing
taining only sixt IBM’s
y-fourDeep
sixty-four lo Blue and
locations
cations chess-plathirt ying
thirty-t
y-t
y-tw wosystem
pieces defeated
that can w orlde
mov
move
champion
in only rigidlyGarrycircumscrib
Kasparov in
circumscribed ed 1997
ways.(Hsu , 2002). aChess
Devising is of course
successful chess astrategy
very simple is a
w orld, con taining
tremendous accomplishmen only
accomplishment, sixt y-four
t, but the challenge is not due to the difficulty ofe
lo cations and thirt y-t w o pieces that can mov
in only rigidly
describing the setcircumscrib ed ways.
of chess pieces and Devising
allo
allowwableamov successful
moves es to thechess strategyChess
computer. is a
tremendous
can be completelyaccomplishmen
describ
described ed t,bybut the brief
a very challenge
list ofiscompletely
not due to the difficulty
formal rules, easily of
describing
pro
provided
vided ahead the set of chess
of time by thepieces and allowable moves to the computer. Chess
programmer.
can be completely described by a very brief list of completely formal rules, easily
Ironically
Ironically,, abstract and formal tasks that are among the most difficult mental
provided ahead of time by the programmer.
undertakings for a human being are among the easiest for a computer. Computers
ha
havveIronically
long been , abstract
able toand formal
defeat even tasks
thethat
bestare humanamong the most
chess play
player,difficult
er, but are mental
only
undertakings
recen
recently for a h uman b eing are among the easiest
tly matching some of the abilities of average human beings to recognize ob for a computer. Computers
objects
jects
ha v
or sp e long
speech. b een able
eech. A person’s everyda to defeat
everyday even the b est human chess
y life requires an immense amount of kno play er, but are
knowledge only
wledge
recen
ab outtly
about thematching
world. Muchsome of of the
thisabilities
kno
knowledge
wledge of aviserage
sub human
subjectiv
jectiv
jective beings
e and to recognize
intuitiv
intuitive, ob jects
e, and therefore
or speech.
difficult to A person’s in
articulate everyda
a formal y lifewarequires
y. Computersan immense need amount
to capture of kno
thiswledge
same
ab
kno out
knowledge the world. Much
wledge in order to beha of this
ehav kno wledge
ve in an in is
intelligen sub
telligen jectiv e and intuitiv e,
telligentt way. One of the key challenges inand therefore
difficult
artificial in to articulate
intelligence in a formal w
telligence is how to get this informal a y. Computers kno need into
knowledge
wledge to capture this same
a computer.
knowledge in order to behave in an intelligent way. One of the key challenges in
Sev
Several
eral artificial in intelligence
telligence pro projects
jects hav
havee sought to hard-cohard-code de knowledge ab about
out
artificial intelligence is how to get this informal knowledge into a computer.
the worl
world d in formal languages. A computer can reason ab about
out statements in these
Sev eral artificial intelligence pro jects hav
formal languages automatically using logical inference rules. This e sought to hard-co de knowledge
is kno
knownwn as abthe
out
the
know worl
knowle le
ledge d in formal languages.
dge base approach to artificial in A computer can
intelligence. reason ab
telligence. None of these pro out statements
projects in these
jects has led to
aformal
ma jorlanguages
major success. One automatically
of the most using
famouslogicalsuchinference
pro
projects
jectsrules.
is CycThis is knoand
(Lenat wn asGuha the,
know).
1989
1989). ledgeCyc base approach
is an inference to artificial
engine and intelligence.
a database None of of these pro jects
statements in a has led to
language
a ma jor
called success.
CycL. These One of the most
statements famous
are en
entered
tered by such pro jects
a staff is Cycsup
of human (Lenat
supervisors. andItGuha
ervisors. is an,
1989
un ).
unwieldy Cyc
wieldy pro is
process.an inference engine and a database of statements
cess. People struggle to devise formal rules with enough complexity in a language
called
to CycL. These
accurately describ
describe statements
e the world. are en
Fortered by a staff
example, Cycoffailed
human to sup ervisors. aIt story
understand is an
unwieldy
ab out a ppro
about ersoncess. People
named struggle
Fred shaving to in
devise formal rules
the morning (Lindewith enough
, 1992 ). Itscomplexity
inference
to accurately describ e the world. F or example,
engine detected an inconsistency in the story: it knew that people do not Cyc failed to understand a story
ha
havve
ab out a p erson named F red shaving in the morning
electrical parts, but because Fred was holding an electric razor, it believed the ( Linde , 1992 ). Its inference
engine
en
entit
tit
tity
y “F detected an inconsistency
“FredWhileShaving”
redWhileShaving” contained in the story: parts.
electrical it knewIt that people
therefore askdo
ednot
asked have
whether
Felectrical
red was stillparts,a pbut
erson because
while he Fred
waswshaas holding
ving. an electric razor, it believed the
shaving.
entity “FredWhileShaving” contained electrical parts. It therefore asked whether
The difficulties faced by systems relying on hard-coded kno knowledge
wledge suggest that
Fred was still a person while he was shaving.
AI systems need the ability to acquire their own kno knowledge,
wledge, by extracting patterns
from Thera
raw wdifficulties
data. This faced by systems
capabilit
capability relyingason
y is known hard-coded
machine le knowledge
learning
arning
arning. . The in suggest
intro
tro
troduction that
duction
AI systems need the ability to acquire their own knowledge, by extracting patterns
from raw data. This capability is known 2 as machine learning. The intro duction
CHAPTER 1. INTRODUCTION
of mac
machinehine learning allo allowwed computers to tackle problems inv involving
olving knowledge
of the real world and mak makee decisions that app appearear subsubjective.
jective. A simple machine
of machine
learning learningcalled
algorithm allowloed computers
logistic
gistic regr
gression
ession to can
tackle problems
determine involving
whether knowledge
to recommend
of the real
cesarean world
deliv
deliveryery and
(Mor-Y mak
Mor-Yosef e decisions
osef et al. that
al.,, 1990 ). app ear submachine
A simple jective. learning
A simplealgorithm
machine
learning algorithm called lo gistic re gression
called naive Bayes can separate legitimate e-mail from spam e-mail. can determine whether to recommend
cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm
The performance of these simple machine learning algorithms dep depends
ends heavily
called naive Bayes can separate legitimate e-mail from spam e-mail.
on the repr epresentation
esentation of the data they are given. For example, when logistic
The p erformance
regression is used to recommend of these simple machine
cesarean deliv learning
deliveryery
ery,, the AI algorithms
system do dep
does es ends heavily
not examine
on the r epr esentation
the patient directly of the
directly.. Instead, the do data they
doctor are given. F or example,
ctor tells the system several pieces of relev when logistic
relevanan
antt
regression is used
information, suc
such htoasrecommend
the presence cesarean deliveryof, the
or absence AI system
a uterine scar.does not examine
Each piece of
the patient directly
information included . Instead, the doctor
in the represen
representation tellsofthe
tation thesystem
patient several piecesasofa relev
is known fe
featuranet.
atur
ature
information,
Logistic such learns
regression as the ho presence
how w eac
each h oforthese
absence of a of
features uterine scar. correlates
the patient Each piece of
with
information included
various outcomes. Ho in
Howwevthe
ever, represen tation
er, it cannot influence the waof the patient
way is known as
y that the features are a fe atur e.
Logistic in
defined regression
any wa wayylearns how eac
. If logistic h of thesewas
regression features
given of anthe MRI patient
scan correlates
of the patient, with
variousthan
rather outcomes.
the do Howev
doctor’s
ctor’s er, it cannot
formalized rep influence
report,
ort, it would the notwaybethat ablethe features
to mak
make are
e useful
defined in any
predictions. way. If pixels
Individual logisticinregression
an MRI scan washa given
hav an MRI correlation
ve negligible scan of thewith patient,
an
any
y
rather than the do ctor’s formalized
complications that might occur during delivery rep ort,
delivery..it would not be able to mak e useful
predictions. Individual pixels in an MRI scan have negligible correlation with any
This depdependence
endence on represenrepresentations
tations is a general phenomenon that app appearsears
complications that might occur during delivery.
throughout computer science and even daily life. In computer science, opera-
tionsThis
suc
such dep
h asendence
searching on arepresen
collection tations
of datais a cangeneral
pro
proceed phenomenon
ceed exp
exponentially that faster
onentially appears if
throughout computer science and
the collection is structured and indexed intelligen even daily life.
intelligently tlyIn
tly.. P computer
People science,
eople can easily perform opera-
tions such as
arithmetic on searching a collection
Arabic numerals, but of find data can proceed
arithmetic on Romanexponentially
numerals faster
muc if
uch
h
the collection is structured and indexed intelligen
more time-consuming. It is not surprising that the choice of represen tly . P eople can easily
representation p erform
tation has an
arithmetic on Arabic numerals,
enormous effect on the performance of mac but find arithmetic
machine on Roman n
hine learning algorithms. For a simple umerals much
more time-consuming.
visual example, see Fig. 1.1. It is not surprising that the choice of represen tation has an
enormous effect on the performance of machine learning algorithms. For a simple
Man
Many
visual y artificial
example, see intelligence
Fig. 1.1. tasks can be solv solveded by designing the righ rightt set of
features to extract for that task, then providing these features to a simple machine
Manyalgorithm.
learning artificial intelligence
For example, tasks can be
a useful solvedforbysp
feature designing
speak
eak
eaker er iden the right set
identification
tification of
from
features to extract for that task, then providing these
sound is an estimate of the size of speaker’s vocal tract. It therefore giv features to a simple
gives machine
es a strong
learning
clue as toalgorithm.
whether the Forsp example,
eaker is aaman,
speaker usefulwoman,
featureorforchild. speaker identification from
sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong
Ho
How wev
ever,
er, for man
many y tasks, it is difficult to know what features should be extracted.
clue as to whether the speaker is a man, woman, or child.
For example, supp suppose ose that we would lik likee to write a program to detect cars in
Ho w ev er, for man y
photographs. We know that cars hatasks, it is difficult
hav to know
ve wheels, so what
we might features
like should
to use b e extracted.
the presence
Fora example,
of wheel as asupp ose that
feature. we would lik
Unfortunately
Unfortunately, , ite istodifficult
write a to program
describeto
describ detect what
e exactly cars in a
photographs.
wheel lo looks W e know that cars ha v e wheels, so w e might
oks like in terms of pixel values. A wheel has a simple geometric shap like to use the presence
shapee but
of a wheel as a feature. Unfortunately , it is difficult
its image may be complicated by shadows falling on the wheel, the sun glaring to describ e exactly whatoffa
wheel lo oks like in terms of pixel v alues.
the metal parts of the wheel, the fender of the car or an ob A wheel has a simple
object geometric shap
ject in the foreground e but
its image may
obscuring partbof e complicated
the wheel, and by soshadows
on. falling on the wheel, the sun glaring off
the metal parts of the wheel, the fender of the car or an ob ject in the foreground
obscuring part of the wheel, and so on. 3
CHAPTER 1. INTRODUCTION
µ
x r
that are directly observed. Instead, they may exist either as unobserv unobserved ed obobjects
jects
or unobserved forces in the ph physical
ysical world that affect observ observable
able quan
quantities.
tities. They
that
ma
may are directly
y also exist asobserved.
constructsInstead,in the hthey uman may mind exist either
that pro as unobserv
provide
vide ed objects
useful simplifying
or unobservedorforces
explanations in the
inferred physical
causes of the world
observthat
observed ed affect
data. observ
They able
can bquan tities. They
e thought of as
ma y also exist as constructs in the h uman mind
concepts or abstractions that help us make sense of the rich variabilit that pro vide useful
ariability simplifying
y in the data.
explanations
When analyzing or inferred
a sp eechcauses
speech recording,of thethe observ
factors ed of
data. They can
variation includebe thought
the sp
speak
eakofer’s
as
eaker’s
concepts
age, theiror abstractions
sex, their accent thatandhelp theus makethat
words sensetheyof the
are rich
sp variabilit
speaking.
eaking. Wheny inanalyzing
the data.
When
an image analyzing
of a car,a thespeech recording,
factors the factors
of variation includeofthe variation
positioninclude
of the the
car, sp
itseak er’s
color,
age, their sex, their accent
and the angle and brightness of the sun. and the words that they are sp eaking. When analyzing
an image of a car, the factors of variation include the position of the car, its color,
andAthe ma
majorjor source
angle of difficult
difficulty
and brightness yofin the
many sun.real-w
real-worldorld artificial intelligence applications
is that many of the factors of variation influence ev every
ery single piece of data we are
able to observe. The individual pixels in an image of a red intelligence
A ma jor source of difficult y in many real-w orld artificial mightt bapplications
car migh e very close
is that many of the
to black at night. The shap factors of v ariation influence
shapee of the car’s silhouette dep ev ery single
ends on the viewingwe
depends piece of data are
angle.
able toapplications
Most observe. The individual
require pixels in anthe
us to disentangle image of a of
factors redvariation
car mighand t bediscard
very closethe
to black at night. The
ones that we do not care about. shap e of the car’s silhouette dep ends on the viewing angle.
Most applications require us to disentangle the factors of variation and discard the
Of course, it can be very difficult to extract such high-level, abstract features
ones that we do not care about.
from ra raww data. Man Many y of these factors of variation, such as a sp speak
eak
eaker’s
er’s accen
accent, t,
can Ofbe course,
iden
identified it can
tified onlybeusing
very sophisticated,
difficult to extract nearly such high-level,
human-lev
human-level abstract features
el understanding of
fromdata.
the raw Whendata. it Man y of these
is nearly factors to
as difficult of obtain
variation, such as a speak
a representation er’s
as to accen
solve thet,
can be iden
original tified representation
problem, only using sophisticated,
learning do doesnearly
es not, athuman-lev
first glance, el understanding
seem to help us. of
the data. When it is nearly as difficult to obtain a representation as to solve the
De
Deep
original epproblem,
le
learning
arning solvsolves
es this central
representation problem
learning does in not,
represen
representation
tation
at first learning
glance, seembytointroduc-
help us.
ing represen
representations
tations that are expressed in terms of other, simpler representations.
Deep Delearning
ep learning solves
allows thethis central problem
computer to build in represen
complex tation learning
concepts by introduc-
out of simpler con-
ing represen tations that are expressed
cepts. Fig. 1.2 shows how a deep learning system can represen in terms of other, simpler representations.
representt the concept of an
Deep learning allows the computer to build
image of a person by combining simpler concepts, such as corners complex concepts out ofandsimpler con-
contours,
cepts.
whic
which Fig.in1.2
h are shows
turn how in
defined a deep
termslearning
of edges. system can represent the concept of an
image of a person by combining simpler concepts, such as corners and contours,
whicThe
h are quin
quintessen
in tessen
tessential
tial example
turn defined in terms of aofdeep
edges. learning mo model
del is the feedforw
feedforwardard deep
net
netwwork or multilayer per ercceptr
eptron on (MLP). A multila ultilayyer perceptron is just a mathe-
The quin tessen tial example of a deep
matical function mapping some set of input values to output learning mo del is the feedforw
values. The ard deep
function
netformed
is work orbymultilayer
comp
composing
osing permany
ceptron (MLP).
simpler A multila
functions. Weyercanperceptron
think of each is just a mathe-
application
matical
of function
a different mapping some
mathematical functionset of as input
pro
providingvalues
viding to output
a new values. The
representation function
of the input.
is formed by composing many simpler functions. We can think of each application
of aThe idea of
different learning thefunction
mathematical right represen
representation
as protation
viding fora newthe representation
data provides one of thepersp
erspec-
ec-
input.
tiv
tivee on deep learning. Another persp erspective
ective on deep learning is that depth allows the
The idea
computer of learning
to learn the right
a multi-step represen
computer tation forEac
program. theh data
Each lay erprovides
layer one persp
of the represen
representation ec-
tation
tive b
can one deep
thought learning. Another
of as the state pofersp theective on deepmemory
computer’s learningafter
is that depth allows
executing anotherthe
computer to learn a m ulti-step
set of instructions in parallel. Net computer
Netw program. Eac h lay er of
works with greater depth can execute more the represen tation
can b e thought of as the state
instructions in sequence. Sequential instructionsof the computer’s memory offer great afterpow executing
ower er because another
later
set of instructions
instructions can refer in back
parallel.
to theNet worksofwith
results earliergreater depth can
instructions. executetomore
According this
instructions in sequence. Sequential instructions offer great power because later
instructions can refer back to the results5 of earlier instructions. According to this
CHAPTER 1. INTRODUCTION
Output
CAR PERSON ANIMAL
(object identity)
Visible layer
(input pixels)
Element
Set Element
Set
+
+
⇥ ⇥ ⇥ Logistic Logistic
+ Regression Regression
+
w1 x1 w2 x2 w x
⇥
Figure 1.3: Illustration of computational graphs mapping an input to an output where
eac
each
h no
output
node
de p erforms an op
Figure 1.3: Illustration
but dep
depends
each computation
ends on the
no de p erformsdepicted
⇥
operation.
definition of what
an op eration.
⇥
eration. Depth is the length of the longest path from input to
of computational graphs mappingaan
constitutes input to
p ossible an output where
computational step.
The in theseDepth is the
graphs length
is the of the
output of alongest
logisticpath from input
regression mo to
model,
del,
output
T but dep ends on the definition of what constitutes a p ossible computational step.
σ(w x ), where σ is the logistic sigmoid function. If we use addition, multiplication and
The computation
logistic sigmoids asdepicted in these
the elemen
elements graphs
ts of our is the output
computer of athen
language, logistic
this regression
mo del has mo
model del,
depth
σ(w xIf),we
three. viewσ logistic
where is the logistic sigmoid
regression as anfunction. If we use
element itself, thenaddition,
this mo
modelmultiplication
del and
has depth one.
logistic sigmoids as the elements of our computer language, then this mo del has depth
three. If we view logistic regression as an element itself, then this mo del has depth one.
view of deep learning, not all of the information in a lay layer’s
er’s activ
activations
ations necessarily
enco
encodes
des factors of variation that explain the input. The representation also stores
view of
state deep learning,
information not alltoofexecute
that helps the information
a programinthat a lay er’smake
can activsense
ationsofnecessarily
the input.
enco des factors of variation that explain
This state information could be analogous to a coun the input. The
counter representation also
ter or pointer in a traditional stores
state information
computer program. thatIt helps to execute
has nothing a program
to do with the thatcon
contencan
ten
tent make
t of the sense
inputofsp the input.,
specifically
ecifically
ecifically,
This state information
but it helps the mo model could b e analogous
del to organize its processing. to a coun ter or p ointer in a traditional
computer program. It has nothing to do with the content of the input specifically,
There are tw two
o main wa ways
ys of measuring the depth of a mo model.
del. The first view is
but it helps the model to organize its processing.
based on the num umber
ber of sequen
sequential
tial instructions that must be executed to ev evaluate
aluate
There
the arc are
architecture.two main wa ys of measuring the depth of a mo
hitecture. We can think of this as the length of the longest path throughdel. The first view is
based
a flo
flow on the n um ber
w chart that describ of sequen
describes tial instructions that
es how to compute each of the mo must b e executed
model’s to ev
del’s outputs givenaluate
theinputs.
its architecture.
Just asWtw e ocan
two think
equiv
equivalen
alen
alentoft computer
this as theprograms
length ofwillthe hav
longest
have path through
e different lengths
a
depflo w
dependingchart that describ es how to compute each of the
ending on which language the program is written in, the same function may mo del’s outputs given
be
its
dra inputs.
drawn
wn as a flo Just
flow as tw o equiv
wchart with differen alen t computer
differentt depths dep programs
depending will
ending on whic hav
which e different lengths
h functions we allow
dep ending on which language the
to be used as individual steps in the flo program
flowwchart. Fig. 1.3 illustratesfunction
is written in, the same ho
how w thismay be
choice
dralanguage
of wn as a flo wcgive
can hart twwith
two differentmeasurements
o different depths depending on whic
for the sameharchitecture.
functions we allow
to be used as individual steps in the flowchart. Fig. 1.3 illustrates how this choice
Another approac
approach, h, used by deep probabilistic mo models,
dels, regards the depth of a
of language can give two different measurements for the same architecture.
mo
model
del as being not the depth of the computational graph but the depth of the
graphAnother approac
describing howh,concepts
used by deep probabilistic
are related to eachmo dels, In
other. regards the depth
this case, the depthof a
mothe
of del as
flow bchart
eing not
flowchart the computations
of the depth of the computational
needed to compute graph the but representation
the depth of the of
graph describing how concepts are related to each other. In this case, the depth
of the flowchart of the computations needed 7 to compute the representation of
CHAPTER 1. INTRODUCTION
eac
eachh concept ma may y be muc uch h deep
deeperer than the graph of the concepts themselv themselves. es.
This is because the system’s understanding of the simpler concepts can be refined
eac
giv h concept
given
en information mayab beout
about muc theh more
deepercomplex
than the graph of
concepts. Forthe concepts
example, anthemselv
AI system es.
This is because
observing an image the of
system’s understanding
a face with one eye in of the simpler
shadow ma
may concepts
y initially cansee
only beone
refined
eye.
giv en information ab out
After detecting that a face is presenthe more
present,complex concepts. F or example,
t, it can then infer that a second ey an AI system
eyee is probably
observing
presen
present t asanwell.imageIn of
thisa face
case,withthe one
grapheyeof in concepts
shadow ma y initially
only includes only
twoseelayone
layers—aeye.
ers—a
After
la
lay detecting
yer for ey
eyes
es and thata alay
face
er is
layer forpresen t, it canthe
faces—but then inferofthat
graph a second eyeincludes
computations is probably2n
presen
la
lay t as well. In this case, the graph
yers if we refine our estimate of each concept giv of concepts
given only includes
en the other n times. tw o lay ers—a
layer for eyes and a layer for faces—but the graph of computations includes 2n
Because it is not alw alwa ays clear whic
which h of these tw twoo views—the depth of the
layers if we refine our estimate of each concept given the other n times.
computational graph, or the depth of the probabilistic mo modeling
deling graph—is most
relev Because
relevant, it is not
ant, and because differen alw ays clear whic
differentt people cho h of
hoose these tw o views—the
ose different sets of smallest depth of the
elemen
elements ts
computational
from whic
which graph, or the depth of the probabilistic mo
h to construct their graphs, there is no single correct value for the deling graph—is most
relev ant, and
depth of an arc because
architecture,
hitecture,differen
just taspthere
eople is cho noose different
single correct sets of smallest
value elemenof
for the length ts
afrom which program.
computer to construct Northeir graphs,
is there there isab
a consensus noout
about single
ho
how w correct
muc
much valueafor
h depth mo the
model
del
depth of an arc hitecture,
requires to qualify as “deep.” Ho just as
How there
wev
ever, is no single correct v alue for
er, deep learning can safely be regarded as the the length of
a computer
study of mo program.
models
dels Nor isinv
that either there
olveaaconsensus
involve greater amoun aboutt of
amount how compmucosition
h depthof alearned
composition model
requires to qualify as “deep.” Ho w ev
functions or learned concepts than traditional mac er, deep learning
machine can safely
hine learning does.b e regarded as the
study of models that either involve a greater amount of composition of learned
To summarize, deep learning, the sub subjectject of this book, is an approach to AI.
functions or learned concepts than traditional machine learning does.
Sp
Specifically
ecifically
ecifically,, it is a type of machine learning, a technique that allow allowss computer
To summarize,
systems to impro improv deep learning,
ve with exp experience the sub
erience and data. Aject of this
ccording to the approach
b
Accordingo ok, is an authors oftothis AI.
Sp ecifically
book, macmachine, it is a type of machine learning,
hine learning is the only viable approac a
approach technique that allow s
h to building AI systems that computer
systems
can op to impro
operate
erate ve with exp
in complicated, erience
real-w orldand
real-world environdata.
environmen men Ats.
ccording
ments. to the authors
Deep learning of this
is a particular
book,ofmac
kind mac hine
hinelearning
machine learningisthat the achiev
only viable
achieveses great approac
pow
power erh and
to building
flexibility AIbysystems
learning that
to
can op
represen erate in complicated,
representt the world as a nested hierarcreal-w orld
hierarch environ men ts. Deep learning
hy of concepts, with each concept defined in is a particular
kind of mac hine learning that achiev
relation to simpler concepts, and more abstract es great pow er and flexibility
representations by learning
computed in termsto
represen
of t the world
less abstract as Fig.
ones. a nested hierarchy of
1.4 illustrates theconcepts, with beach
relationship et
etw concept
ween thesedefined
differentin
relation
AI to simpler
disciplines. Fig.concepts,
1.5 gives and more abstract
a high-level schematicrepresentations
of ho
how w eachcomputed
works. in terms
of less abstract ones. Fig. 1.4 illustrates the relationship between these different
AI disciplines. Fig. 1.5 gives a high-level schematic of how each works.
1.1 Who Should Read This Bo
Book?
ok?
1.1 boWho
This ok can bShould
e useful forRead
a varietThis
ariety Book?
y of readers, but we wrote it with two main
target audiences in mind. One of these target audiences is universit university y students
This b o ok can b e useful for a
(undergraduate or graduate) learning abvariet y of
aboutreaders, but we wrote it with
out machine learning, including those two main
who
target audiences in mind. One of these target
are beginning a career in deep learning and artificial in audiences is universit
intelligence y
telligence researc students
research.
h. The
(undergraduate or graduate)
other target audience is softwlearning
software about machine
are engineers who do learning,
havee including
not hav a mac hinethose
machine who
learning
arestatistics
or beginning a career
backgr
background,
ound,in but
deepwan learning
want and artificial
t to rapidly acquire in telligence
one and begin researc
usingh. deep
The
other target
learning audience
in their pro is softw
product
duct are engineers
or platform. Deep who do not
learning hashav e a mac
already prohine
prov learning
ven useful in
or
manstatistics
manyy soft
softw backgr ound, but wan t to rapidly acquire one and
ware disciplines including computer vision, speech and audio pro b egin using deep
processing,
cessing,
learning in their product or platform. Deep learning has already proven useful in
many software disciplines including computer 8 vision, speech and audio processing,
CHAPTER 1. INTRODUCTION
Representation learning
Machine learning
AI
Figure 1.4: A Venn diagram showing how deep learning is a kind of represen
representation
tation learning,
whic
which
h is in turn a kind of mac
machine
hine learning, which is used for many but not all approaches
Figure 1.4: A V enn diagram showing how deep learning is a kind of represen
to AI. Each section of the Venn diagram includes an example of an AI technology tation learning,
technology. .
which is in turn a kind of machine learning, which is used for many but not all approaches
to AI. Each section of the Venn diagram includes an example of an AI technology.
9
CHAPTER 1. INTRODUCTION
Output
Mapping from
Output Output
features
Additional
Mapping from Mapping from layers of more
Output
features features abstract
features
Hand- Hand-
Simple
designed designed Features
features
program features
Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning
10
CHAPTER 1. INTRODUCTION
1.2
It Historical
is easiest Trends
to understand in Deep
deep learning Learning
with some historical con
context.
text. Rather than
pro
providing
viding a detailed history of deep learning, we iden
identify
tify a few key trends:
It is easiest to understand deep learning with some historical context. Rather than
providing
• Deepa learning
detailed has
history
had of deepand
a long learning,
ric
rich we iden
h history
history, tifyhas
, but a few key
gone bytrends:
many names
reflecting different philosophical viewp
viewpoints,
oints, and has waxed and waned in
Deep learning
popularit
opularity
y. has had a long and rich history, but has gone by many names
• reflecting different philosophical viewpoints, and has waxed and waned in
• popularit
Deep y.
learning has become more useful as the amoun amountt of av
available
ailable training
data has increased.
Deep learning has become more useful as the amount of available training
• data has
Deep increased.
learning models ha
hav ve gro
grown
wn in size over time as computer hardware
and soft
softw
ware infrastructure for deep learning has improv
improved.
ed.
Deep learning models have grown in size over time as computer hardware
• and soft
Deep ware infrastructure
learning has solv
solved for deep complicated
ed increasingly learning hasapplications
improved. with increasing
accuracy over time.
Deep learning has solved increasingly complicated applications with increasing
• accuracy over time.
11
CHAPTER 1. INTRODUCTION
1. Introduction
3. Probability and
2. Linear Algebra
Information Theory
6. Deep Feedforward
Networks
11. Practical
12. Applications
Methodology
18. Partition
19. Inference
Function
0.000250
Frequency of Word or Phrase
cyb
cybernetics
ernetics
0.000200
(connectionism + neural net
netw
works)
cyb ernetics
0.000150 (connectionism + neural networks)
0.000100
0.000050
0.000000
1940 1950 1960 1970 1980 1990 2000
Year
Figure 1.7: The figure shows tw twoo of the three historical wa wav ves of artificial neural nets
researc
research,h, as measured by the frequency of the phrases “cyb “cybernetics”
ernetics” and “connectionism” or
Figure net
“neural 1.7:works”
netw The figure shows
according twoogle
to Go of Bo
Google theoks
Books three
(thehistorical
third wawav vwa
e isvto
esoofrecent
too artificial neural
to app ear). nets
appear). The
researc
first h,easstarted
wav
ave measured
withbycybernetics
the frequency of the
in the phrases “cyb
1940s–1960s, ernetics”
with and “connectionism”
the developmen
development t of theoriesor
“neural networks”
of biological according
learning to Go
(McCullo
McCulloch chogle
andBo oks ,(the
Pitts 1943third
; Hebb wa,v1949
e is to
) oand
recent to app ear). The
implementations of
first w av
the first mo e started
models with cybernetics in the
dels such as the p erceptron (Rosen 1940s–1960s,
Rosenblatt with
blatt, 1958) allo the
allowing developmen t of theories
wing the training of a single
of biological
neuron. learning
The second wa (ve
waveMcCullo
startedchwith
andthePitts , 1943; Hebb
connectionist , 1949) and
approach implementations
of the 1980–1995 p erio of
eriod,
d,
the first
with bac mo dels such as(Rumelhart
back-propagation
k-propagation the p erceptron
et al.(,Rosen
1986ablatt
) to ,train
1958a) allo wing
neural theork
netw training
network of aorsingle
with one tw
twoo
neuron.
hidden la The
layers. second wa ve started with
yers. The current and third wa the
wav connectionist approach of the 1980–1995
ve, deep learning, started around 2006 (Hinton p erio d,
with
et al.bac k-propagation
, 2006 (Rumelhart
; Bengio et al. et al., 1986a
, 2007; Ranzato et al.,)2007a
to train a neural
), and is justnetw
no
nowwork with
app oneinorbtw
appearing
earing ooko
hidden
form as laofyers.
2016.The
Thecurrent
other and
tw
twoo third
waveswa ve, deepapp
similarly learning,
eared instarted
appeared around
b o ok form muc 2006
uch (Hinton
h later than
et al.corresp
the , 2006;onding
corresponding et al., 2007
Bengioscientific ; Ranzato
activity et al., 2007a), and is just now app earing in b ook
o ccurred.
form as of 2016. The other two waves similarly app eared in b o ok form much later than
the corresp onding scientific activity o ccurred.
14
CHAPTER 1. INTRODUCTION
The earliest predecessors of mo moderndern deep learning were simple linear mo models
dels
motiv
motivated
ated from a neuroscientific persp erspective.
ective. These mo models
dels were designed to
takeeThe
tak earliest
a set of n predecessors
input values ofx1mo , . .dern
. , xn deep learning
and asso ciatewere
associate themsimplewith linear
an output models y.
motiv
These moated
models from a neuroscientific
dels would learn a set of weigh p ersp ective.
eights These mo dels were
ts w1 , . . . , wn and compute their output designed to
f (x, w ) = x 1w1 + · · · + xnwn . This firstand
tak e a set of n input v alues x , . . . , x wavassoe of ciate
neural themnetw with
orks an
networks output
researc
research h was y.
These
kno
knownwnmo as dels
cyb would ,learn
cybernetics
ernetics
ernetics, a set of win
as illustrated eigh ts w
Fig. 1.7,.. . . , w and compute their output
f (x, w ) = x w + + x w . This first wave of neural networks research was
The McCullo
McCulloch-Pittsch-Pitts Neuron (McCullo McCulloch ch and Pitts, 1943) was an early mo modeldel
known as cybernetics · · ,· as illustrated in Fig. 1.7.
of brain function. This linear mo modeldel could recognize tw two o different categories of
The McCullo ch-Pitts Neuron
inputs by testing whether f (x, w ) is positiv (McCullo ch and Pitts , 1943
ositivee or negative. Of course, ) was anforearlythe momo
modeldel
del
of brain
to corresp function.
correspond ond to theThis desiredlinear model could
definition of therecognize
categories,twthe o different
weigh
weightsts categories
needed to be of
inputs
set by testing
correctly
correctly. . These whether
weigh
weights fts , w ) isbe
(xcould positiv
set beyorthe negative.
human Of op course, for
operator.
erator. In thethe mo del
1950s,
to corresp
the ond to(Rosen
perceptron the desired
Rosenblatt blatt, definition
1958, 1962of ) bthe
ecamecategories,
the firstthe mo weigh
model ts needed
del that could to be
learn
set correctly
the weigh
weights . These weigh ts could be set b y the
ts defining the categories given examples of inputs from each categoryhuman op erator. In the 1950s,
category..
the padaptive
The erceptron line(Rosen
linearar elementblatt, (ADALINE),
1958, 1962) bwhich ecamedates the first frommo del that
about the could
same learn
time,
the weigh ts defining the categories given
simply returned the value of f (x) itself to predict a real num examples of inputs from
umb each
ber (Widro
Widrow category
w and .
The
Hoff,adaptive
1960), and linecould
ar element (ADALINE),
also learn to predictwhich thesedatesnum
numbers fromfrom
bers about the same time,
data.
simply returned the value of f (x) itself to predict a real number (Widrow and
These simple learning algorithms greatly affected the mo modern
dern landscap
landscapee of
Hoff, 1960), and could also learn to predict these numbers from data.
mac
machine
hine learning. The training algorithm used to adapt the weigh weightsts of the ADA-
LINE These
was asimple
sp eciallearning
special case of an algorithms
algorithmgreatly called stoaffected
chasticthe
stochastic gr modern
gradient
adient desclandscap
descent
ent
ent.. Sligh e tly
Slightlyof
mac
mo hine learning.
modified
dified versions ofThe the training
sto
stocchastic algorithm
gradientt used
gradien descent to adapt
algorithm the remain
weightstheof the ADA-
dominan
dominant t
LINE was a sp ecial case
training algorithms for deep learning mo of an algorithm called
models sto
dels today
today..chastic gr adient desc ent . Sligh tly
modified versions of the stochastic gradient descent algorithm remain the dominant
Mo dels based on the f (x, w) used by the perceptron and ADALINE are called
Models
training algorithms for deep learning models today.
line
linear
ar momodels
dels
dels.. These mo modelsdels remain some of the most widely used machine learning
mo Mo
models, dels based
dels, though in man on the
many y fcases
(x, wthey) used arebtrained
y the perceptron
in differen
different and
t wADALINE
ays than the areoriginal
called
line
mo ar
models mo dels .
dels were trained. These mo dels remain some of the most widely used machine learning
models, though in many cases they are trained in different ways than the original
Linear
models weremo
models dels hav
trained. havee many limitations. Most famously famously,, they cannot learn the
XOR function, where f ([0 ([0,, 1] , w) = 1 and f ([1 ([1,, 0], w) = 1 but f ([1 ([1,, 1], w) = 0
Linear
and f ([0 mo dels hav e many
([0,, 0], w ) = 0. Critics who observ limitations.
observed Most
ed these fla famously
flaws , they
ws in linear mo cannot
models
dels learn
causedthe
X OR
a bac function,
backlash where f ([0 , 1] , w ) = 1 and f ([1
klash against biologically inspired learning in general (Minsky and Pap , 0] , w ) = 1 but f ([1 , 1] , w ) =
apert
ert0,
ert,
and ).
1969
1969). f ([0 , 0],w
This was )= the0.first Critics
ma
majorjorwhodipobserv
in theed these flaws
popularity in linear
of neural netwmo dels caused
networks.
orks.
a backlash against biologically inspired learning in general (Minsky and Papert,
1969T).oday
day,
This, neuroscience
was the firstisma regarded
jor dip as an imp
in the importan
ortan
ortantt source
popularity of neuralof inspiration
networks. for deep
learning researc
researchers, hers, but it is no longer the predominan predominantt guide for the field.
Today, neuroscience is regarded as an important source of inspiration for deep
The main reason for the diminished role of neuroscience in deep learning
learning researchers, but it is no longer the predominant guide for the field.
researc
research h to
today
day is that we simply do not ha hav ve enough information ab about
out the brain
The main reason for the diminished role
to use it as a guide. To obtain a deep understanding of the actual algorithms of neuroscience in deep learning
used
researc h to day is that w e simply do not ha ve enough
by the brain, we would need to be able to monitor the activity of (at the very information ab out the brain
to use thousands
least) it as a guide. To obtain a deepneurons
of interconnected understanding
sim of the actual
simultaneously
ultaneously
ultaneously. algorithms
. Because we areused not
by the
able to brain,
do this,weweware ouldfarneed fromtounderstanding
be able to monitor even some the activity
of the mostof (at the very
simple and
least) thousands of interconnected neurons simultaneously. Because we are not
able to do this, we are far from understanding 15 even some of the most simple and
CHAPTER 1. INTRODUCTION
neuroscience at all.
It is w worth
orth noting that the effort to understand ho howw the brain works on
neuroscience at all.
an algorithmic lev levelel is alivalivee and well. This endea endeav vor is primarily known as
It is w orth noting that the effort
“computational neuroscience” and is a separate field of study to understand how the from brain
deepworks on
learning.
an isalgorithmic
It common forlev el is aliv
researc
researchers herse to and mov
movewell.
e bac
backThis
k andendea forthvor betwis een
etweenprimarily
both fields. knownThe as
“computational
field of deep learning neuroscience”
is primarily andconcerned
is a separate with field
howoftostudybuildfrom deep learning.
computer systems
It is common for researc
that are able to successfully solv hers to mov e bac k
solvee tasks requiring inand forth b etw
intelligence, een both
telligence, while the fieldfields. The
of
field of deep learning is primarily concerned with
computational neuroscience is primarily concerned with building more accurate how to build computer systems
that
mo
models
delsareofable
howtothe successfully
brain actually solveworks.
tasks requiring intelligence, while the field of
computational neuroscience is primarily concerned with building more accurate
In the
models of 1980s,
how the thebrain
second wave of
actually neural net
works. netw work research emerged in great part
via a movmovemen
emen
ementt called conne onnectionism
ctionism or par aral
al
allel
lel distribute
distributed d pr
proocessing (Rumelhart
et al., 1986c; McClelland et al., 1995). Connectionism arose in theincontext
In the 1980s, the second w a v e of neural net w ork research emerged great partof
via a
cognitivmov emen t called
cognitivee science. Cognitiv conne ctionism or par al lel distribute
Cognitivee science is an interdisciplinary approach to understand-d pr o cessing ( Rumelhart
et al.
ing , 1986c
the mind,; McClelland
combining multiple et al., 1995 ). Connectionism
different lev
levels arose in
els of analysis. the context
During the earlyof
cognitiv
1980s, e science.
most Cognitiv
cognitive scientistse science
studied is anmo interdisciplinary
models
dels of sym symb bolic approach
reasoning.toDespiteunderstand-
their
ing the
popularit
opularity mind, combining
y, symbolic mo models multiple different lev els
dels were difficult to explain in terms of hoof analysis. During
how the
w the brainearly
1980s, most cognitive scientists studied mo dels of sym
could actually implement them using neurons. The connectionists began to study b olic reasoning. Despite their
p
moopularit
dels ofy,cognition
models symbolicthat models could were difficult
actually be to explain in
grounded in neural
terms of how thetations
implemen
implementations brain
(could actually
Touretzky andimplement
Min
Minton ton, 1985 them), using
revivingneurons.
man
many yTheideasconnectionists
dating bacback k btoegan
the to study
work of
mo
psycdels
psychologistof cognition that could actually
hologist Donald Hebb in the 1940s (Hebb, 1949). b e grounded in neural implemen tations
(Touretzky and Minton, 1985), reviving many ideas dating back to the work of
psycThe central
hologist idea inHebb
Donald connectionism
in the 1940s is that a large
(Hebb , 1949num
number
). ber of simple computational
units can ac achiev
hiev
hievee inintelligen
telligen
telligentt behavior when net netw work
orked ed together. This insight
The central idea in connectionism
applies equally to neurons in biological nerv is that a
nervouslarge
ous systems of
num ber andsimple computational
to hidden units in
units can
computational moachiev e in
models.
dels.telligen t behavior when net w ork ed together. This insight
applies equally to neurons in biological nervous systems and to hidden units in
Sev eral key concepts arose during the connectionism mov
Several ementt of the 1980s
movemen
emen
computational models.
that remain central to to today’s
day’s deep learning.
Several key concepts arose during the connectionism movement of the 1980s
thatOneremainof these concepts
central to today’sis that deepof distribute
distributed
learning.d repr epresentation
esentation (Hinton et al., 1986).
This is the idea that eac each h input to a system should be represen represented ted by man many y features,
andOneeachoffeature
these concepts
should beis in that
inv olv
volvedof
ed distribute
in the d r
represenepr esentation
representationtation of ( Hinton
many p et al.,inputs.
ossible 1986).
This is the ideasuppose
For example, that eacwhe input ha
hav ve atovision
a system should
system thatbecan represen ted bcars,
recognize y mantrucy features,
trucks,
ks, and
and each feature
birds and these ob should
objects b
jects can eace in volv
each ed in the represen tation of many
h be red, green, or blue. One way of representing p ossible inputs.
F or example, suppose
these inputs would be to ha w e ha
hav v e a vision system that can recognize
ve a separate neuron or hidden unit that activ cars, truc atesand
ks,
activates for
birds
eac
each and these ob jects can eac h b e red, green, or blue.
h of the nine possible combinations: red truck, red car, red bird, green truck, and One w ay of representing
these
so on.inputs would benine
This requires to ha ve a separate
different neurons, neuron
and each or hidden
neuron unit
mustthat activ
indep ates for
independently
endently
each of
learn thetheconcept
nine possible
of colorcombinations:
and ob object red truck,
ject identit
identity y. Onered wacar,
y to red
improbird,
improv green
ve on thistruck, and
situation
so on. This requires nine different neurons, and each
is to use a distributed representation, with three neurons describing the color and neuron m ust indep endently
learn the
three concept
neurons of color and
describing the ob object
ject identit
object identit
identity yy.. One
Thiswrequires
ay to impro onlyvesixon neurons
this situation
total
is to useofa nine,
instead distributed
and the representation,
neuron describing with three
redness neurons
is abledescribing
to learn ab theoutcolor
about and
redness
three neurons describing the ob ject identity. This requires only six neurons total
instead of nine, and the neuron describing 17 redness is able to learn ab out redness
CHAPTER 1. INTRODUCTION
from images of cars, trucks and birds, not only from images of one sp specific
ecific category
of ob
objects.
jects. The concept of distributed represen representation
tation is central to this book, and
frombimages
will e describ ofed
described cars, trucks and
in greater birds,
detail not only 15
in Chapter from
. images of one specific category
of ob jects. The concept of distributed representation is central to this book, and
will Another
be describ ma
major
edjorin accomplishmen
accomplishment
greater detail int of the connectionist
Chapter 15. mov
movemenemen
ementt was the suc-
cessful use of back-propagation to train deep neural net netwworks with in internal
ternal repre-
sen Another
sentations
tations mathe
and jor paccomplishmen
opularization of t ofthethe connectionist mov
back-propagation ement was
algorithm the suc-
(Rumelhart
cessful
et al. use of; back-propagation
al.,, 1986a LeCun, 1987). This to algorithm
train deep has neuralwaxednetwandorkswaned
with in internal repre-
popularity
sentations
but as of thisandwriting
the popularization
is currently the of the back-propagation
dominan
dominant t approac
approach h toalgorithm
training deep (Rumelhart
models.
et al., 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity
but During
as of thisthewriting
1990s, isresearc
researchers
hers made
currently important
the dominan adv
advances
t approac ances
h tointraining
mo
modeling
delingdeep sequences
models.
with neural netw networks.
orks. HoHochreiter
chreiter (1991) and Bengio et al. (1994) iden identified
tified some
During
of the the 1990s,
fundamental researchers made
mathematical difficultiesimportant
in mo advances
modeling
deling longinsequences,
modeling sequences
describ
describeded
with neural netw orks.
in Sec. 10.7. Hochreiter and Sc Ho chreiter
Schmidh(
hmidh 1991
hmidhub ub)
uber and Bengio et al. ( 1994 ) iden
er (1997) introduced the long short-term tified some
of the fundamental
memory or LSTM net mathematical
netw resolvee some ofinthese
work to resolv difficulties modeling long sequences,
difficulties. Todaday describ
y, the LSTM ed
in widely
is Sec. 10.7 used. Hochreiter and Schmidh
for many sequence mo
modeling uber tasks,
deling (1997)including
introduced many thenatural
long short-term
language
memory
pro
processing or LSTM
cessing tasks at Go net w ork
Google.
ogle. to resolv e some of these difficulties. T o da y, the LSTM
is widely used for many sequence modeling tasks, including many natural language
The second wa wave
ve of neural netw networks orks research lasted un untiltil the mid-1990s. Ven-
processing tasks at Google.
tures based on neural netw networks
orks and other AI technologies began to make unrealisti-
callyThe second wa
ambitious ve ofwhile
claims neural networks
seeking inv research lasted
investments.
estments. When un AItil the mid-1990s.
research Ven-
did not fulfill
tures based
these on neuralexp
unreasonable netw orks and inv
expectations,
ectations, other
investors AI technologies
estors were disapp bointed.
egan to Simultaneously
disappointed. make unrealisti-,
Simultaneously,
cally ambitious
other fields of mac claims
machine while seeking
hine learning made adv inv estments.
advances. When
ances. Kernel mac AI research
hines (did
machines not et
Boser fulfill
al.,
these unreasonable exp ectations,
1992; Cortes and Vapnik, 1995; Schölk inv
Schölkopf estors
opf et al.were disapp ointed.
al.,, 1999) and graphical mo Simultaneously
dels (Jor-,
models
other
dan fields
, 1998 ) bof
othmac hineedlearning
achiev
achieved go
goood resultsmadeonadv ances.
many imp Kernel
importan
ortan machines
ortantt tasks. These (Boser et al.,
two factors
1992
led to; Cortes
a decline andinVthe
apnik , 1995; Schölk
popularity opf etnetw
of neural al., orks
networks1999)that
andlasted
graphical
untilmo dels (Jor-
2007.
dan, 1998) both achieved good results on many important tasks. These two factors
During this time, neural net netwworks con contin
tin
tinued
ued to obtain impressiv
impressivee performance
led to a decline in the popularity of neural networks that lasted until 2007.
on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute
for During
Adv
Advanced
ancedthisResearc
time, neural
Research h (CIF net
(CIFAR)AR)works help
helpedcon
ed tin
toued
keep to neural
obtain netimpressiv
netw works eresearch
performance
alive
on some tasks (LeCun et al.
via its Neural Computation and Adaptiv , 1998b ; Bengio et al. , 2001 ). The Canadian
Adaptivee Perception (NCAP) research initiative. Institute
for Adv anced Researc h (CIF
This program united machine learning AR) helpedresearc
to keep
research neural led
h groups netw byorks research
Geoffrey alive
Hinton
via its Neural
at Universit
University Computation
y of Toron
oronto,
to, Yoshand
oshua Adaptiv e
ua Bengio at Univ P erception
Universit (NCAP)
ersit
ersity research
y of Montreal, and Yann initiative.
This
LeCun program
at Newunited York machine
Universit
University learning
y. The CIF researc
CIFARAR hNCAP groupsresearch
led by Geoffrey
initiativeHinton
had a
at Universit y of T oron to, Y osh ua Bengio
multi-disciplinary nature that also included neuroscien at Univ ersit
neuroscientists y of Montreal,
tists and experts in human and Yann
LeCun at New
and computer vision. Y ork Universit y . The CIF AR NCAP research initiative had a
multi-disciplinary nature that also included neuroscientists and experts in human
At this poin ointt in time, deep netw networks orks were generally believ elieved ed to be very difficult
and computer vision.
to train. W Wee now know that algorithms that hav havee existed since the 1980s work
quiteAtwell,
this but
pointhis
t in wtime,
as notdeep networks
apparent were
circa generally
2006. believ
The issue isedperhaps
to be vsimply
ery difficult
that
to train. W e now
these algorithms were to know too that algorithms that
o computationally costly to allo hav e existed
alloww muc since
much h expthe 1980s
experimentation work
erimentation
quite well, but
with the hardware av this w as not
available apparent
ailable at the time. circa 2006. The issue is p erhaps simply that
these algorithms were too computationally costly to allow much experimentation
The third wa wav ve of neural netw networksorks research began with a breakthrough in
with the hardware available at the time.
The third wave of neural networks18research began with a breakthrough in
CHAPTER 1. INTRODUCTION
2006. Geoffrey Hinton show showed ed that a kind of neural netw network ork called a deep belief
net
netwwork could be efficien efficiently tly trained using a strategy called greedy la layyer-wise
2006. Geoffrey
pretraining (Hin Hinton
Hinton
ton et al.show ed that a kind of neural
al.,, 2006), which will be describ netw
described ork called a
ed in more detail indeep belief
Sec.
net
15.1w ork could
15.1.. The other CIF be efficien
CIFAR-affiliated tly trained using a strategy
AR-affiliated research groups quickly show called
showedgreedy la yer-wise
ed that the same
pretraining (Hin ton et
strategy could be used to train manal. , 2006 ),
many which will b e describ
y other kinds of deep net ed in
netw more detail inetSec.
works (Bengio al.,
15.1.; The
2007 otheretCIF
Ranzato al.,,AR-affiliated
al. research groups
2007a) and systematically quickly
help
helped showede that
ed to improv
improve the same
generalization
strategy
on could be This
test examples. used wa to vtrain
wav many netw
e of neural otherorks
networkskinds of deep
researc
research networks (the
h popularized Bengio
use ofet the
al.,
2007
term; deRanzato
deep
ep le et al.to
learning
arning , 2007a ) and systematically
emphasize that researchers help ed to
were no
now improv
w able etogeneralization
train deeper
on test
neural netexamples.
netw This wa v e of neural netw
works than had been possible before, and to fo orks researc h p opularized
focus the
cus attention useonof the
the
term deep leimp
theoretical arning
importance to emphasize
ortance of depth (Bengio that researchers
and LeCun were
, 2007no;wDelalleau
able to train deeper,
and Bengio
neural
2011 ; Pnet
ascanworks
ascanu u et than
al. had b; een
al.,, 2014a possible
Montufar et bal.
efore,
al.,, 2014and
). At to fo
thiscustime,
attention
deep on the
neural
theoretical
net
netwworks outp imperformed
ortance ofcompeting
outperformed depth (Bengio and LeCun
AI systems based, 2007 ; Delalleau
on other machineandlearning
Bengio,
2011
tec ; Pascan
technologies u et al. , 2014a ; Montufar
hnologies as well as hand-designed functionalit et al. ,
functionality2014 ). At this time, deep
y. This third wave of popularity neural
netneural
of works netwoutporks
networkserformed
con
contintincompeting
ues to the AI
tinues timesystems
of this based
writing, onthough
other machine
the fo cuslearning
focus of deep
technologies as well as hand-designed functionalit y. This
learning research has changed dramatically within the time of this wave. The third w ave of popularity
of neural
third wavnetw
e began orkswith
contin a ues
fo
focus
cus to on
thenew
timeunsup
of this
unsupervised writing,
ervised though
learning the focus and
techniques of deep
the
learning
abilit
ability research
y of deep mo has
models changed dramatically within
dels to generalize well from small datasets, but tothe time of this w
today av e. The
day there is
third wave began
more interest in mwith
uc
uch h a
olderfo cus
sup on new
supervised
ervised unsup
learning ervised learning
algorithms and techniques
the ability and
ability the
of deep
abilit
mo
models
delsy of
to deep models
leverage largetolab generalize
labeled well from small datasets, but today there is
eled datasets.
more interest in much older supervised learning algorithms and the ability of deep
models to leverage large labeled datasets.
1.2.2 Increasing Dataset Sizes
1.2.2mayIncreasing
One wonder whyDataset Sizeshas only recently become recognized as a
deep learning
crucial tec
technology
hnology though the first exp experiments
eriments with artificial neural net netw works were
One may wonder
conducted why deep
in the 1950s. Deeplearning
learninghas hasonly
beenrecently become
successfully usedrecognized
in commercial as a
crucial technology
applications since thethough thebut
1990s, firstwexp eriments
as often with artificial
regarded as beingneural
more of netanworks were
art than
aconducted
technology inandthe something
1950s. Deep thatlearning
only anhasexp bert
expert eencould
successfully
use, untilused intly
recen
recently commercial
tly.
. It is true
applications since the 1990s,
that some skill is required to get go but w as often regarded as being more of
goood performance from a deep learning algorithm.an art than
Fa ortunately
technology, the
ortunately, and amoun
something
amount t of that only an exp
skill required ert could
reduces use,amount
as the until recen tly. It is data
of training true
that some The
increases. skill learning
is required to get gooreac
algorithms d phing
erformance
reaching human from a deep learning
performance on complexalgorithm.
tasks
F
to ortunately
toda
da
day , the
y are nearly iden amoun tical to the learning algorithms that struggled to solvedata
identicalt of skill required reduces as the amount of training toy
increases. The learning algorithms
problems in the 1980s, though the mo reac hing
models human p erformance on complex
dels we train with these algorithms hav tasks
havee
to day are nearly
undergone changesiden tical
that to thethe
simplify learning
trainingalgorithms thatarchitectures.
of very deep struggled to The solvemost
toy
problems
imp ortanttinnew
importan
ortan thedevelopmen
1980s, though
development the mo
t is that to delswe
today
day wecan
train with these
provide these algorithms
algorithms hav withe
undergone
the resourceschanges
they that
needsimplify the training
to succeed. Fig. 1.8ofshovery
wsdeep
shows howarchitectures. The most
the size of benchmark
imp ortan t new developmen
datasets has increased remark t is
remarkably that
ably ov
overto day we can provide these algorithms
er time. This trend is driven by the increasing with
the resources
digitization of they
so
societ
ciet
cietyneed
y. As to
moresucceed.
and more Fig.of1.8
oursho ws howtake
activities the place
size of
onbcomputers,
enchmark
datasets
more andhasmoreincreased
of what remark
we doablyisov er time. This
recorded. trend
As our is driven by
computers arethe increasing
increasingly
digitization
net
netwwork
orked of societyit. As
ed together, more and
becomes easiermore of our activities
to centralize take place
these records andoncurate
computers,
them
more and more of what we do is recorded. As our computers are increasingly
networked together, it becomes easier to19centralize these records and curate them
CHAPTER 1. INTRODUCTION
in
into
to a dataset appropriate for mac machinehine learning applications. The age of “Big
Data” has made mac machine
hine learning mucmuch h easier because the key burden of statistical
in to a dataset appropriate for mac
estimation—generalizing well to new data hine learning applications.
after observing only The age amoun
a small of “Bigt
amount
Data”
of has made
data—has macconsiderably
been hine learning lightened.
much easierAs because the key
of 2016, burden
a rough of of
rule statistical
th
thum
um
umb b
estimation—generalizing
is that a sup
supervised w ell to new data after observing
ervised deep learning algorithm will generally achiev only
achievee acceptablet
a small amoun
oferformance
p data—haswith beenaround
considerably
5,000 lab lightened.
labeled As ofper
eled examples 2016, a rough
category
category,, andrule
will of thumor
match b
is that human
exceed a supervised deep learning
performance algorithm
when trained withwill generally
a dataset conachiev
taininge at
containing acceptable
least 10
p erformance
million lab eled examples. Working successfully with datasets smaller than this or
labeledwith around 5,000 lab eled examples per category , and will match is
exceed
an imp human
importan
ortan performance
ortantt research area, fo when
focusing trained with a dataset
cusing in particular on ho how con taining
w we can take advat least
advantage 10
antage
million
of large lab eled examples.
quantities of unlabWeled
orking
unlabeled successfully
examples, with with
unsupdatasets
ervised smaller
unsupervised thanervised
or semi-sup this is
semi-supervised
an important research area, focusing in particular on how we can take advantage
learning.
of large quantities of unlabeled examples, with unsupervised or semi-supervised
learning.
1.2.3 Increasing Mo
Model
del Sizes
1.2.3 kIncreasing
Another ey reason that Mo del Sizes
neural net
netwworks are wildly successful to toda
da
day y after enjoying
comparativ
comparatively ely little success since the 1980s is that we hav havee the computational
Another
resourceskto ey run
reason
muc that
uch neural
h larger mo net
delsworks
models to
toda
da
dayyare
. Onewildly successful
of the today after
main insights enjoying
of connection-
comparativ
ism ely littlebecome
is that animals successin since
intelligenthe
telligen
telligent 1980sman
t when is that
many we hav
y of their e the computational
neurons work together.
resources to run m uc h larger mo dels to da y . One of
An individual neuron or small collection of neurons is not particularly the main insights of connection-
useful.
ism is that animals become intelligent when many of their neurons work together.
Biological neurons are not esp especially
ecially densely connected. As seen in Fig. 1.10,
An individual neuron or small collection of neurons is not particularly useful.
our mac
machine
hine learning mo models
dels hav
havee had a num number ber of connections per neuron that
Biological neurons are not esp ecially
was within an order of magnitude of even mammalian densely connected.
brains Asfor seen in Fig. 1.10,
decades.
our machine learning models have had a number of connections per neuron that
In terms of the total num umber ber of neurons, neural netw networks
orks hav
havee been astonishingly
was within an order of magnitude of even mammalian brains for decades.
small until quite recently
recently,, as shown in Fig. 1.11. Since the introduction of hidden
In terms
units, of the
artificial totalnet
neural num
netw ber of
works ha
havneurons,
ve doubled neural netw
in size orks hav
roughly e been
every 2.4astonishingly
years. This
small
gro
growth until
wth is drivquite
en by faster computers with larger memory and by the avof
driven recently , as shown in Fig. 1.11 . Since the introduction hidden
ailability
units,
of artificial
larger neural
datasets. netwnet
Larger orksworks
netw have aredoubled
able to in achiev
size roughly
achieve e higherevery 2.4 years.
accuracy This
on more
gro wth is driv en b y faster computers
complex tasks. This trend looks set to contin with
continue larger memory and by
ue for decades. Unless new tec the a vailability
technologies
hnologies
of larger
allo
allow
w fasterdatasets.
scaling,Larger
artificialnetw orks are
neural netwable
networksorkstowillachiev
note hav
higher
havee theaccuracy
same num onber
number more
of
complex tasks. This trend looks set to contin ue for decades.
neurons as the human brain until at least the 2050s. Biological neurons ma Unless new tec hnologies
mayy
allow faster
represen scaling, artificial neural
representt more complicated functions than curren netw orks will not hav e the same num
currentt artificial neurons, so biological ber of
neurons
neural net as
netw works may be even larger than thisthe
the human brain until at least plot2050s. Biological neurons may
portrays.
represent more complicated functions than current artificial neurons, so biological
In retrosp
retrospect,
ect, it is not particularly surprising that neural net netwworks with fewer
neural networks may be even larger than this plot portrays.
neurons than a leec leechh were unable to solv solvee sophisticated artificial in intelligence
telligence prob-
In
lems. Evretrosp
Even
en to ect,
today’s it is
day’s netw not
networks, particularly
orks, whic
which surprising that neural net w
h we consider quite large from a computationalorks with fewer
neurons than
systems pointa ofleec h were
view, areunable
smallerto solv
thane sophisticated
the nervous system artificialofineven
telligence prob-
relatively
lems. Eveen
primitiv
primitive today’s netw
vertebrate orks,like
animals whicfrogs.
h we consider quite large from a computational
systems point of view, are smaller than the nervous system of even relatively
The increase in mo model
del size over time, due to the availabilit ailabilityy of faster CPUs,
primitive vertebrate animals like frogs.
The increase in model size over time,
20 due to the availability of faster CPUs,
CHAPTER 1. INTRODUCTION
9
Increasing dataset size over time
10
Dataset size (number examples)
10
8 Canadian Hansard
7 time Sports-1M
Increasing dataset size overWMT
10 ImageNet10k
6
10 Public SVHN
10
5 Criminals ImageNet ILSVRC 2014
4
10
MNIST CIFAR-10
3
10
102 T vs G vs F Rotated T vs C
1 Iris
10
0
10
1900 1950 1985 2000 2015
Year
Figure 1.8: Dataset sizes ha hav ve increased greatly ov over
er time. In the early 1900s, statisticians
studied datasets using hundreds or thousands of manually compiled measuremen measurements ts (Garson,
Figure
1900 1.8: Dataset
; Gosset sizes hav,e1935
, 1908; Anderson increased
; Fishergreatly
, 1936).ovIn er the
time. In the
1950s early 1980s,
through 1900s, the
statisticians
pioneers
studied datasets using
of biologically inspired mac hundreds
machine or thousands
hine learning often work of manually
orked compiled
ed with small, syn measuremen
synthetic ts
thetic datasets, (Garson
such,
1900
as lo;w-resolution
Gosset, 1908bitmaps
low-resolution ; Anderson , 1935; that
of letters, Fisher , 1936
were ). In the
designed to1950s
incur through
lo
low 1980s, the pioneers
w computational cost and
of biologicallythat
demonstrate inspired
neuralmac hineorks
netw
networks learning oftentowork
were able ed sp
learn with
specific
ecificsmall,
kinds synofthetic datasets,
functions such
(Widrow
as lo w-resolution bitmaps
and Hoff, 1960; Rumelhart et al. of letters, that were designed to incur low computational
al.,, 1986b). In the 1980s and 1990s, machine learning cost and
demonstrate
b ecame morethat neural in
statistical netw orks and
nature werebable
egantotolearn
lev sp ecific
leverage
erage largerkinds of functions
datasets con (Widrow
containing
taining tens
andthousands
of Hoff, 1960 of; examples
Rumelhartsuch et al.as, the
1986b ). In the
MNIST 1980s
dataset and
(sho
(shownwn1990s,
in Fig.machine learning
1.9) of scans of
b ecame more
handwritten num statistical
numbers in nature and b egan to lev erage larger
bers (LeCun et al., 1998b). In the first decade of the 2000s, more datasets con taining tens
of thousands datasets
sophisticated of examples such
of this sameas the
size,MNIST
such as dataset
the CIF (showndataset
CIFAR-10
AR-10 in Fig. (1.9 ) of scansand
Krizhevsky of
handwritten
Hin ton, 2009)num
Hinton bers
contin
continuedued(LeCun
to b e pro et duced.
al., 1998b
produced. Tow ).ardInthe
oward theend first decade
of that of the
decade and2000s, more
throughout
sophisticated
the first half ofdatasets of this
the 2010s, same size,larger
significantly such datasets,
as the CIF AR-10 dataset
containing hundreds(Krizhevsky
of thousands and
Hintens
to ton,of2009 ) contin
millions of ued to b e pro
examples, duced. Tchanged
completely oward the endwofasthat
what decade
p ossible and
with throughout
deep learning.
the first
These half of included
datasets the 2010s,the significantly
public Street largerViewdatasets,
Housecontaining
Numbers dataset hundreds(Netzer
of thousands
et al.,
to tens
2011
2011), of millions of examples, completely changed what w as
), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russako p ossible with deep learning.
Russakovskyvsky
These
et al., datasets
2014a), and included
the Sp the public dataset
Sports-1M
orts-1M Street View House et
(Karpathy Numbers
al., 2014 dataset
). At the(Netzer
top of et the
al.,
2011), vwe
graph, arious versions
see that of theofImageNet
datasets translateddataset
sentences,(Deng et al.
such as ,IBM’s 2009, dataset
2010a; Russako
constructedvsky
et al. , 2014a ), and the Sp orts-1M dataset ( Karpathy et
from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to Frenc al. , 2014 ). A t the top of the
rench h
graph, we
dataset (Sch see
Schwenk that datasets of translated sentences,
wenk, 2014) are typically far ahead of other dataset sizes.such as IBM’s dataset constructed
from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French
dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.
21
CHAPTER 1. INTRODUCTION
Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National
Institute of Standards and Technology echnology,, the agency that originally collected this data.
Figure
The “M”1.9: Example
stands for “moinputs
“modified,” from
dified,” thethe
since MNIST dataset.
data has b een The “NIST”
prepro cessedstands
preprocessed for National
for easier use with
Institute
mac
machine of Standards and T echnology , the agency that originally
hine learning algorithms. The MNIST dataset consists of scans of handwritten digitscollected this data.
The
and “M”
asso standslab
associated
ciated forels
“mo
labels dified,” since
describing whic
which the data0-9
h digit hasis bcon
eentained
prepro
contained incessed for easier
each image. Thisusesimple
with
machine learning
classification algorithms.
problem is one ofThe
theMNIST
simplestdataset
and mostconsists of scans
widely of handwritten
used tests digits
in deep learning
and asso
researc h.ciated
research. lab els p
It remains describing whichbdigit
opular despite eing 0-9 is con
quite easytained
for moindern
each tec
modern image.
techniquesThistosimple
hniques solve.
classification
Geoffrey Hin
Hintonproblem
ton is one ed
has describ of the
described it assimplest
“the dr and most
drosophila
osophila of widely
machine used tests in meaning
learning,” deep learning
that
researc
it allowsh. machine
It remains p opular
learning despite b
researchers toeing quite
study theireasy for mo dern
algorithms techniques
in controlled labto solve.
laboratory
oratory
Geoffrey Hin
conditions, tonh has
muc
much describ edoften
as biologists “the drfruit
it as study osophila
flies. of machine learning,” meaning that
it allows machine learning researchers to study their algorithms in controlled lab oratory
conditions, much as biologists often study fruit flies.
22
CHAPTER 1. INTRODUCTION
the adven
adventt of general purp
purpose
ose GPUs (describ
(described
ed in Sec. 12.1.2), faster netnetw
work
connectivit
connectivityy and better softw
software
are infrastructure for distributed computing, is one of
the most
the advenimp
t ofortant
general
important purpin
trends osethe
GPUs (describ
history ed learning.
of deep in Sec. 12.1.2 ), faster
This trend network
is generally
connectivit
exp ected toy contin
expected and better
continue softw
ue well in arethe
into
to infrastructure
future. for distributed computing, is one of
the most important trends in the history of deep learning. This trend is generally
expected to continue well into the future.
1.2.4 Increasing Accuracy
Accuracy,, Complexit
Complexity
y and Real-W
Real-World
orld Impact
1.2.4 theIncreasing
Since 1980s, deep Accuracy
learning has, Complexit y anded
consistently improv
improved Real-W orld Impact
in its ability to provide
accurate recognition or prediction. Moreov Moreover, er, deep learning has consisten consistently tly been
Since the 1980s, deep learning has consistently
applied with success to broader and broader sets of applications. improv ed in its ability to provide
accurate recognition or prediction. Moreover, deep learning has consistently been
The earliest deep mo modelsdels were used to recognize individual ob objects
jects in tightly
applied with success to broader and broader sets of applications.
cropp
cropped,
ed, extremely small images ( Rumelhart et al. , 1986a ). Since then there has
The earliest deep mo dels w ere used
been a gradual increase in the size of images neural net to recognize netw individual
works could pro ob jects
cess.inMo
process. tightly
Modern
dern
cropp
ob
object ed, extremely
ject recognition netw small
networks images
orks proprocess ( Rumelhart
cess ricrich et al. , 1986a ). Since
h high-resolution photographs and do not then there has
been
ha
havve aa gradual
requirement increase that in the
the size
photo of images
be cropped neuralnear
netwthe orksob could
object
ject toprobcess. Modern
e recognized
ob ject recognition netw
(Krizhevsky et al., 2012). Similarly orks pro cess rich high-resolution
Similarly,, the earliest netw networks photographs and
orks could only recognize do not
tha
wovekinds
a requirement
of ob jects that
objects (or inthe somephotocases,be the
cropped
absencenear orthe ob jectoftoa bsingle
presence e recognized
kind of
(obKrizhevsky
object), et al.
ject), while these mo , 2012
modern). Similarly
dern netnetw , the earliest netw orks could
works typically recognize at least 1,000 different only recognize
tcategories
wo kinds of ob
of ob jects
objects. (or in
jects. The largestsome cases, the in
contest absence
ob jectorrecognition
object presence ofisa the single kind of
ImageNet
ob ject), while
Large-Scale theseRecognition
Visual modern netChallenge
works typically (ILSVRC)recognize held at eachleast 1,000
year. different
A dramatic
categories
momentt in of
momen theobmeteoric
jects. The riselargest
of deepcontest
learning in came
objectwhen recognition
a con
conv is the ImageNet
volutional netw
networkork
Large-Scale Visual Recognition Challenge (ILSVRC)
won this challenge for the first time and by a wide margin, bringing down the held each year. A dramatic
moment in the meteoric
state-of-the-art top-5 error riserate
of deepfromlearning
26.1% to came15.3% when a convolutional
(Krizhevsky et al.netw
, 2012ork
),
w on this challenge
meaning that the conv for the
convolutional first
olutional net time
netw and
work pro by
produces a wide margin, bringing
duces a ranked list of possible categories down the
state-of-the-art
for eac
eachh image and top-5 theerror
correctratecategory
from 26.1% to 15.3%
appeared in the (Krizhevsky
first fiv et al., of
fivee entries 2012
this),
meaning
list for allthatbutthe15.3%
convolutional
of the test netw ork produces
examples. a ranked
Since then, list
these of pcomp
ossible
competitionscategories
etitions are
for eac
consisten h
consistentlyimage and the
tly won by deep conv correct category
convolutional appeared in the
olutional nets, and as of this writing, advfirst fiv e entries
advancesof this
ances in
list for all
deep learning ha buthav15.3% of the test examples. Since
ve brought the latest top-5 error rate in this contest do then, these comp downetitions
wn to 3.6%, are
consisten
as sho wn tly
shown won 1.12
in Fig. by deep
. convolutional nets, and as of this writing, advances in
deep learning have brought the latest top-5 error rate in this contest down to 3.6%,
Deep
as sho wn learning
in Fig. 1.12 has. also had a dramatic impact on sp speech
eech recognition. After
impro
improving
ving throughout the 1990s, the error rates for sp speech
eech recognition stagnated
Deep
starting in ablearning
about has also had a dramatic
out 2000. The introduction of deep learning impact on sp(eech
Dahlrecognition.
et al., 2010; After
Deng
impro
et al., ving
2010bthroughout
; Seide et al. the 1990s,
, 2011 the error
; Hinton et al.rates
, 2012a for) tospeech
sp
speech
eechrecognition
recognition stagnated
resulted
starting
in a suddenin ab out 2000.
drop of error The introduction
rates, with someoferror deep rates
learning cut (inDahlhalf.et W
al.e, will
2010explore
; Deng
et al.history
this , 2010b;inSeidemoreetdetail
al., 2011 ; Hinton
in Sec. 12.3.et al., 2012a) to speech recognition resulted
in a sudden drop of error rates, with some error rates cut in half. We will explore
Deep netnetwworks ha hav ve also had sp spectacular
ectacular successes for pedestrian detection and
this history in more detail in Sec. 12.3.
image segmentation (Sermanet et al., 2013; Farab arabet et et al.al.,, 2013; Couprie et al. al.,,
2013 Deep net works
2013)) and yielded sup ha v
superh e
erhalso
erhuman had sp ectacular successes for p
uman performance in traffic sign classification (Ciresanedestrian detection and
image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al.,
2013) and yielded superhuman performance 23
in traffic sign classification (Ciresan
CHAPTER 1. INTRODUCTION
7
4
3
10 Mouse
2
10
5
2
8
10 3 Fruit fly
1
1
10
1950 1985 2000 2015
Year
Figure 1.10: Initially
Initially,, the number of connections b et etw
ween neurons in artificial neural
net
netw
works was limited by hardware capabilities. To day day,, the num
numberber of connections b et
etween
ween
Figure 1.10:
neurons Initially
is mostly , the consideration.
a design number of connections b etween
Some artificial neurons
neural netw in artificial
networks
orks hav neural
havee nearly as
net
manw
many orks was limited b y hardware capabilities. T o day , the num ber
y connections p er neuron as a cat, and it is quite common for other neural netof connections b et
netw ween
works
neurons
to hav is many
havee as mostlyconnections
a design consideration.
p er neuron asSome artificial
smaller mammals neural
likenetw orks
mice. havethe
Even nearly as
human
many do
brain connections
does
es not hahavvpeeranneuron as at cat,
exorbitan
exorbitant amoun
amountandt of
it is quite common
connections for otherBiological
p er neuron. neural netneural
works
to hav
net
netw
worke as many
sizes fromconnections
Wikip ediap(er
Wikipedia neuron
2015 ). as smaller mammals like mice. Even the human
brain do es not have an exorbitant amount of connections p er neuron. Biological neural
netw1.ork sizes linear
Adaptive fromelement
Wikip(Widrow
edia (2015 ). , 1960)
and Hoff
2. Neocognitron (Fukushima, 1980)
3. GPU-accelerated convolutional network (Chellapilla et al., 2006)
4. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
5. Unsupervised convolutional network (Jarrett et al., 2009)
6. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
7. Distributed autoencoder (Le et al., 2012)
8. Multi-GPU convolutional network (Krizhevsky et al., 2012)
9. COTS HPC unsupervised convolutional network (Coates et al., 2013)
10. GoogLeNet (Szegedy et al., 2014a)
24
CHAPTER 1. INTRODUCTION
et al.
al.,, 2012).
At the same time that the scale and accuracy of deep netw networksorks has increased,
et al., 2012).
so has the complexity of the tasks that they can solve. Go Goo odfellow et al. (2014d)
sho
show A t the same
wed that neural netw time that
networks the scale and accuracy of
orks could learn to output an entire sequence deep netw orks has increased,
of characters
so has the
transcrib
transcribed edcomplexity
from an image, of therathertasksthan thatjusttheyidentifying
can solve.aGo odfellow
single ob et al.
object.
ject. (2014d),
Previously
Previously,
showwasedwidely
it that neural
believed netwthatorksthis could
kind learn to output
of learning an entire
required lab sequence
labeling
eling of the of characters
individual
transcrib
elemen
elements ed from an image, rather than just
ts of the sequence (Gülçehre and Bengio, 2013). Recurren identifying a single ob ject.
Recurrentt neural net Previously
netw works,,
it whasaswidely
suc
such the LSTMbelievedsequence
that thismo kind
modeldelofmentioned
learning required ab
abovov
ove,e, lab
areeling
nowofused the individual
to mo modeldel
elemen ts of
relationships bet the sequence
etwween se (
sequenc Gülçehre
quenc
quences and
es and other se Bengio ,
sequenc
quenc2013
quences ). Recurren t neural
es rather than just fixed inputs. net w orks,
such sequence-to-sequence
This as the LSTM sequence modelseems
learning mentioned to be ab onovthee, arecuspnow used
of rev to model
revolutionizing
olutionizing
relationships
another betweenmachine
application: sequences and other(Sutskev
translation sequences
Sutskever er etrather
al. than; just
al.,, 2014 fixed inputs.
Bahdanau et al.
al.,,
This
2015
2015).
).sequence-to-sequence learning seems to b e on the cusp of rev olutionizing
another application: machine translation (Sutskever et al., 2014; Bahdanau et al.,
This trend of increasing complexit complexity y has been pushed to its logical conclusion
2015).
with the introduction of neural Turing machines (Grav Graves es et al.
al.,, 2014a) that learn
This trend of increasing complexit
to read from memory cells and write arbitrary con y has b een pushed
contententt to its
ten to logicalcells.
memory conclusionSuc
Such h
with the
neural net
netwintroduction of neural T uring machines ( Grav
works can learn simple programs from examples of desired behavior. For es et al. , 2014a ) that learn
example, they memory
to read from can learncells to sortandlistswriteof arbitrary
num
umbers conten
bers given t to memory
examples cells. Suc
of scrambled andh
neural net
sorted works canThis
sequences. learn simple programs technology
self-programming from examples is inofitsdesired
infancy
infancy,behavior.
, but in the For
example,
future couldtheyincan learn to
principle besort lists to
applied of nearly
numbers an
any ygiven
task.examples of scrambled and
sorted sequences. This self-programming technology is in its infancy, but in the
Another crowning achiev achievement ement of deep learning is its extension to the domain
future could in principle be applied to nearly any task.
of reinfor
einforccement le learning
arning
arning.. In the context of reinforcement learning, an autonomous
agen
agentAnother
t must learn crowning achievement
to perform a task of bydeep
triallearning
and error, is its extension
without an
any to the domain
y guidance from
of reinfor
the human op c ement
operator.le arning . In the context of reinforcement
erator. DeepMind demonstrated that a reinforcement learning system learning, an autonomous
agen t must learn
based on deep learning to perform is capablea taskofby trial and
learning to error,
play Atariwithoutvideo anygames,
guidance from
reaching
the human
human-lev
uman-level op erator. DeepMind demonstrated that a
el performance on many tasks (Mnih et al., 2015). Deep learning has reinforcement learning system
basedsignificantly
also on deep learning improv
improved isedcapable of learningoftoreinforcement
the performance play Atari video games,
learning for reaching
rob
robotics
otics
h uman-lev el
(Finn et al., 2015).p erformance on many tasks ( Mnih et al. , 2015 ). Deep learning has
also significantly improved the performance of reinforcement learning for robotics
(FinnMan
Many
etyal.of, these
2015).applications of deep learning are highly profitable. Deep learning
is now used b byy many top technology companies including Go Google,
ogle, Microsoft,
FacebMan
acebo y ofIBM,
ook, theseBaidu,
applications
Apple,ofAdobe, deep learning
Netflix,are highly and
NVIDIA profitable.
NEC. Deep learning
is now used by many top technology companies including Google, Microsoft,
FacebAdv
dvances
ances
ook, IBM,in deep
Baidu, learning
Apple,hav have e also Netflix,
Adobe, dep
depended
ended hea
heavily
NVIDIA vilyandon adv
advances
NEC. ances in softw software are
infrastructure. Softw Softwareare libraries such as Theano (Bergstra et al., 2010; Bastien
A dv ances in
et al., 2012), PyLearn2 deep learning
(Go
Goo have also
odfellow et al.dep ended), hea
, 2013c Torchvily(on advert
Collob
Collobert ances
et al.in ,softw
2011b are
),
infrastructure.
DistBelief (DeanSoftwet al.are, 2012libraries
), Caffe such(Jia as, 2013
Theano (Bergstra
), MXNet (Chen et al.
et ,al.
2010
, 2015 ; Bastien
), and
etensorFlow
T al., 2012),(AbadiPyLearn2 et al.(Go odfellow
, 2015 haveeetallal.supported
) hav , 2013c), Timp orch
importan (Collob
ortan
ortant ert et
t researc
research h proal., jects
2011bor),
projects
DistBelief
commercial pro ( Dean et
products.
ducts. al. , 2012 ), Caffe (Jia , 2013 ), MXNet ( Chen et al. , 2015 ), and
TensorFlow (Abadi et al., 2015) have all supported important research pro jects or
Deep learning
commercial products. has also made contributions back to other sciences. Mo Modern
dern
con
convvolutional netw networks
orks for ob object
ject recognition provide a mo modeldel of visual pro processing
cessing
Deep learning has also made contributions back to other sciences. Modern
convolutional networks for ob ject recognition 25 provide a model of visual processing
CHAPTER 1. INTRODUCTION
that neuroscientists can study (DiCarlo, 2013). Deep learning also pro provides
vides useful
to
tools
ols for pro
processing
cessing massiv
massivee amounts of data and making useful predictions in
that
scien neuroscientists
scientific
tific fields. It has can study
been (DiCarlo,used
successfully 2013to). predict
Deep learning also prowill
how molecules videsinteract
useful
tools
in orderfor to
pro cessing
help massive amounts
pharmaceutical of data
companies designandnewmaking
drugsuseful
(Dahlpredictions
et al., 2014in),
scien tific
to searc
search fields. It has b een successfully used to predict how molecules
h for subatomic particles (Baldi et al., 2014), and to automatically parse will interact
in order toe help
microscop
microscope images pharmaceutical companies
used to construct a 3-Ddesign
map of newthedrugs
human (Dahl et (al.
brain , 2014
Kno
Knowles- ),
wles-
to searceth al.
Barley for, 2014
al., subatomic particles
). We exp
expect
ect deep (Baldi et al.to, 2014
learning app
appear), and
ear to automatically
in more parse
and more scientific
microscop
fields in thee images
future. used to construct a 3-D map of the human brain (Knowles-
Barley et al., 2014). We expect deep learning to appear in more and more scientific
In summary
summary,, deep learning is an approac approach h to machine learning that has dra drawn
wn
fields in the future.
hea
heavily
vily on our knowledge of the human brain, statistics and applied math as it
dev In summary
develop
elop ed over ,the
eloped deep learning
past severalis decades.
several an approac
In hrecen
to machine
recent t years, learning thattremendous
it has seen has drawn
hea
gro vily in
growth
wth onits
ourpopularit
knowledge
opularity y andof usefulness,
the human due brain, statistics
in large partand applied
to more pow math
owerful as it
erful com-
dev elop ed ov er the past several decades.
puters, larger datasets and techniques to train deep In recen t years,
deeper
er netw it has
networks. seen tremendous
orks. The years ahead
growth
are in cits
full of popularit
hallenges y and
and opp usefulness,todue
opportunities
ortunities in large
improv
improve e deeppart to more
learning evenpow erful com-
further and
puters, larger datasets
bring it to new frontiers. and techniques to train deep er netw orks. The years ahead
are full of challenges and opportunities to improve deep learning even further and
bring it to new frontiers.
26
CHAPTER 1. INTRODUCTION
27
CHAPTER 1. INTRODUCTION
0.20
0.15
0.10
0.05
0.00
2010 2011 2012 2013 2014 2015
Year
Figure 1.12: Since deep netnetwworks reached the scale necessary to comp
compete
ete in the ImageNet
Large Scale Visual Recognition Challenge, they hav havee consistently won the comp
competition
etition
Figure
ev
every 1.12: and
ery year, Since deep net
yielded lowwer
lowerorks
andreached
low
lower the scale
er error ratesnecessary to comp
each time. Dataete in the
from ImageNet
Russak
Russakovsky
ovsky
Large
et al. (Scale
2014bVisual
) and HeRecognition
et al. (2015Challenge,
). they have consistently won the comp etition
every year, and yielded lower and lower error rates each time. Data from Russakovsky
et al. (2014b) and He et al. (2015).
28
Part I
Part I
Applied Math and Mac
Machine
hine
Learning Basics
Applied Math and Machine
Learning Basics
29
29
This part of the book in intro
tro
troduces
duces the basic mathematical concepts needed to
understand deep learning. We begin with general ideas from applied math that
allo
allow
wThis part
us to of the
define book inof
functions tromany
duces vthe basic find
ariables, mathematical
the highestconcepts
and low needed
lowest
est to
points
understand
on deep learning.
these functions We bdegrees
and quantify egin with
of bgeneral
elief. ideas from applied math that
allow us to define functions of many variables, find the highest and lowest points
Next, we describ
describee the fundamen
fundamental tal goals of machine learning. We describe how
on these functions and quantify degrees of belief.
to accomplish these goals by sp specifying
ecifying a momodel
del that represen
representsts certain beliefs,
Next, w e describ e the fundamen tal goals of machine learning.
designing a cost function that measures how well those beliefs corresp We describe
correspond how
ond with
to accomplish
realit
reality
y and usingthese goals balgorithm
a training y specifyingto aminimize
mo del that
that represen ts certain beliefs,
cost function.
designing a cost function that measures how well those beliefs correspond with
This
realit elementary
y and framew
framework
using a training ork is the basis
algorithm for a broad
to minimize thatvariety of mac
machine
cost function. hine learning
algorithms, including approac
approaches
hes to machine learning that are not deep. In the
This elementary
subsequen
subsequent t parts of framew
the bo ork
ok, is
book, wethe basis for
develop deepa broad variety
learning of machine
algorithms learning
within this
algorithms,
framew
framework.
ork. including approac hes to machine learning that are not deep. In the
subsequent parts of the book, we develop deep learning algorithms within this
framework.
30
Chapter 2
Chapter 2
Linear Algebra
Linear Algebra
Linear algebra is a branc
branch
h of mathematics that is widely used throughout science
and engineering. Ho How wev
ever,
er, because linear algebra is a form of contin continuous
uous rather
Linear
than algebramathematics,
discrete is a branch of manmathematics
many y computer that is widely
scientists haveeused
hav littlethroughout
exp erience science
experience with it.
and
A go engineering.
goood understanding of linear algebra is essential for understanding and wrather
Ho w ev er, b ecause linear algebra is a form of contin uous orking
than discrete
with man
many mathematics,
y mac
machine man y computer
hine learning algorithms, esp scientists
ecially deep learning algorithms. Wit.
especially hav e little exp erience with e
A go o d understanding
therefore precede our in of
introlinear
tro
troduction algebra is essential for
duction to deep learning with a fo understanding
focused and w orking
cused presentation of
withkey
the man y mac
linear hine learning
algebra algorithms, especially deep learning algorithms. We
prerequisites.
therefore precede our introduction to deep learning with a focused presentation of
If you are already familiar with linear algebra, feel free to skip this chapter. If
the key linear algebra prerequisites.
you hav
havee previous exp experience
erience with these concepts but need a detailed reference
If y
sheet tooureview
are already familiar we
key formulas, with linear algebra,
recommend feel freeCo
The Matrix tookb
Cookbskip
okboookthis chapter.
(Petersen If
and
you
P have, previous
edersen 2006). Ifexp youerience
ha
havve with
no expthese
osureconcepts
exposure at all tobut needalgebra,
linear a detailed
thisreference
chapter
sheet to
will teac
teach review key formulas, we recommend The Matrix Co
h you enough to read this bo ok, but we highly recommend that you also okb ook ( Petersen and
P edersen , 2006 ). If
consult another resource fo you ha v e
focusedno exp osure
cused exclusiv
exclusively at all to linear algebra, this
ely on teaching linear algebra, such aschapter
will
Shiloteac
Shilov h you enough to read this b
v (1977). This chapter will completely omito ok, but we highly
man
many y recommend
imp
importan
ortan that you
ortantt linear also
algebra
consultthat
topics another
are notresource
essential fo cused exclusively on
for understanding teaching
deep learning.linear algebra, such as
Shilov (1977). This chapter will completely omit many imp ortant linear algebra
topics that are not essential for understanding deep learning.
2.1 Scalars, Vectors, Matrices and Tensors
2.1 study
The Scalars,
of linear V ectors,
algebra inv Matrices
involv
olv
olves and
es several types Tensors ob
of mathematical objects:
jects:
31
CHAPTER 2. LINEAR ALGEBRA
2 3
A1,1 A1,2
A1,1 A2,1 A3,1
A = 4 A2,1 A2,2 5 ) A> =
A1,2 A2,2 A3,2
A3,1 A3,2
thoughtt of as(A
Vectors can be though ) =A
matrices . contain only one column. (2.3)
that The
transp
transpose
ose of a vector is therefore a matrix with only one row. Sometimes we
Vectors can be thought of as matrices that contain only one column. The
33
transpose of a vector is therefore a matrix with only one row. Sometimes we
CHAPTER 2. LINEAR ALGEBRA
define a vector by writing out its elements in the text inline as a ro roww matrix,
then using the transp transposeose op
operator
erator to turn it ininto
to a standard column vector, e.g.,
define x1a, xvector >by writing out its elements in the text inline as a row matrix,
x = [[x 2, x3 ] .
then using the transpose operator to turn it into a standard column vector, e.g.,
x =A[xscalar
, x , xcan ] .be thought of as a matrix with only >
a single entry
entry.. From this, we
can see that a scalar is its own transp transpose:
ose: a = a .
A scalar can be thought of as a matrix with only a single entry. From this, we
We can add matrices to each other, as long as they ha hav
ve the same shap shape,e, just
can see that a scalar is its own transpose: a = a .
by adding their corresp corresponding
onding elemen ts: C = A + B where Ci,j = Ai,j + Bi,j .
elements:
We can add matrices to each other, as long as they have the same shape, just
We can also add a scalar to a matrix or multiply a matrix by a scalar, just
by adding their corresp onding elements: C = A + B where C = A + B .
by performing that op operation
eration on eac h element of a matrix: D = a · B + c where
each
W e can also
Di,j = a · Bi,j + c. add a scalar to a matrix or multiply a matrix by a scalar, just
by performing that operation on each element of a matrix: D = a B + c where
D In= the a Bconcontext
text of deep learning, we also use some less conv
+ c.
conventional
entional notation.
·
We allo
alloww the addition of matrix and a vector, yielding another matrix: C = A + b,
In the ·
where Ci,j con
=A text+ofb deep
i,j
learning, we also use some less conventional notation.
j . In other words, the vector b is added to each row of the
We allowThis
matrix. the addition
shorthand of eliminates
matrix andthe a vector,
need toyielding
define aanother
matrixmatrix: C = Ain
with b copied +to
b,
into
where
eac
eachh ro Cw b=
row A doing
efore + b .the
In addition.
other words,Thisthe
implicit b is added
vectorcopying of bto
toeach
man
many yrow
lo of the
locations
cations
matrix.
is called brThis
bro oadcshorthand
adcasting
asting
asting.. eliminates the need to define a matrix with b copied into
each row before doing the addition. This implicit copying of b to many locations
is called broadcasting.
2.2 Multiplying Matrices and Vectors
2.2 of the
One Multiplying
most imp
important
ortantMatrices
op
operations and
erations inv
involvingVectors
olving matrices is multiplication of two
matrices. The matrix pr oduct of matrices A and B is a third matrix C . In order
pro
Onethis
for of the
pro mosttoimp
product
duct beortant
defined,opAerations
must havinveolving
have matrices
the same num
numb bis
er multiplication
of columns as B of has
two
matrices.
rows. If AThe
rows. is ofmatrix
shapee pr
shap mo×duct
n and B is of A
of matrices and
shap
shape e nB×isp,athen
thirdCmatrix C . In
is of shap
shapee morder
× p.
for this pro duct to b e defined,
We can write the matrix pro A
product m ust hav e the
duct just by placing twsame
twonum b er of columns as B
o or more matrices together, has
rows.
e.g. If A is of shap e m n and B is of shap e n p, then C is of shap e m p.
We can write the matrix × product just
C =byAB placing
. two or more matrices together,
× ×
(2.4)
e.g.
The pro
product
duct op eration is definedCb=
operation y AB . (2.4)
X
The product operation is defined Ci,j = by A i,k B k,j. (2.5)
k
C = A B . (2.5)
Note that the standard pro product
duct of tw
twoo matrices is not just a matrix con
containing
taining
the pro
product
duct of the individual elements. Suc Suchh an op
operation
eration exists and is called the
Note that
element-wise pr the
pro standard pro
oduct or Hadamar duct
Hadamard of
d pr
pro tw o matrices is not just
oduct, and is denoted as aAmatrix
B . containing
the product of the individual elements. XSuch an operation exists and is called the
The dot pr
pro
o duct b etw
etween x y of the same dimensionalit
element-wise product >
or Hadamard productand
een tw o vectors , and is denoted asdimensionality
A B. y is the
matrix pro duct x y . We can think of the matrix pro
product duct C = AB as computing
product
Ci,j The dot dot
as the product
pro betwbeen
product
duct etwteen
etw ro
roww i ofxAand
wo vectors y column
and of the same
j of dimensionalit
B. y is the
matrix product x y . We can think of the matrix pro duct C = AB as computing
C as the dot pro duct between row i of 34 A and column j of B .
CHAPTER 2. LINEAR ALGEBRA
Matrix pro
product
duct op
operations
erations hav
havee many useful prop
properties
erties that make mathematical
analysis of matrices more con convenien
venien
venient.
t. For example, matrix m multiplication
ultiplication is
Matrix
distributiv
distributive:pro
e: duct op erations hav e many useful prop erties that make mathematical
analysis of matrices more A con
(Bvenien
+ C )t.= AB
For example,
+ AC . matrix multiplication is
(2.6)
distributive:
It is also asso
associativ
ciativ
ciative:
e: A(B + C ) = AB + AC . (2.6)
A(B C ) = (AB )C . (2.7)
It is also asso ciative:
Matrix multiplication is not commutativ
commutative
A(B C ) = (eAB (the)Ccondition
. AB = B A do does
es(2.7)
not
alw
alwaays hold), unlik
unlikee scalar multiplication. HoHowev
wev
wever,
er, the dot pro
product
duct betw
etween
een twtwoo
Matrix multiplication
vectors is comm
commutativ
utativ
utative:
e:is not commutativ e (the condition AB = B A do es not
always hold), unlike scalar multiplication.
x>y = yHo > wever, the dot pro duct b etween two
x. (2.8)
vectors is commutative:
The transp
transpose
ose of a matrix pro x y has
product
duct = y a xsimple
. form: (2.8)
x y= x y = y x. (2.10)
Since the fo
focus
cus of this textb
textboook is not linear algebra, we do not attempt to
dev
develop
elop a comprehensive list of useful prop properties
erties of the matrix pro
product
duct here, but
Since the fo cus of this textb o ok is not
the reader should b e aware that many more exist. linear algebra, we do not attempt to
develop a comprehensive list of useful properties of the matrix product here, but
We no
noww kno
know notation
w enough linear algebra to write down a system of linear
the reader should b e aware that many more exist.
equations:
We now know enough linear algebra Axnotation
=b to write down a system of (2.11)
linear
equations:
where A ∈ R m×n is a known matrix,Ax b ∈=Rbm is a known vector, and x ∈ R(2.11) n
is a
vector of unknown
R variables we would like R elementt xi of xRis one
to solve for. Each elemen
where A is a known matrix,
of these unknown variables. Each row of b A is a
and eac known
h element of b and
each vector, pro x another
provide
vide is a
vector of t.
constrain unknown
∈
constraint. We canvrewrite
ariables Eq.
we w2.11
ouldas: ∈ to solve for. Each element x of∈ x is one
like
of these unknown variables. Each row of A and each element of b provide another
constraint. We can rewrite Eq. 2.11Aas: 1,: x = b1 (2.12)
A2,: x = b2 (2.12)
(2.13)
A .x. .= b (2.14)
(2.13)
A m,:x
. . .= bm (2.15)
(2.14)
or, even more explicitly
even explicitly,, as: A x=b (2.15)
or, even more explicitly
A,1as:
,1 x1 + A 1,2x 2 + · · · + A 1,nx n = b1 (2.16)
A x +A 35
x + +A x =b (2.16)
···
CHAPTER 2. LINEAR ALGEBRA
1 0 0
0 1 0
01 00 10
0 1 0
0 0 1
Figure 2.2: Example identity matrix: This is I 3 .
Figure 2.2: Example identity matrix: This is I .
A2,1 x1 + A
2,2x 2 + · · · +A 2,nx n = b2 (2.17)
A x +A x + ... + A x = b (2.18)
(2.17)
A m,1x1 + Am,2x 2 +. .··.···· + A m,nxn = bm . (2.19)
(2.18)
Matrix-v
Matrix-vector Aductx notation
ector pro
product + A xpro +vides
provides x compact
+ aAmore = b . representation
(2.19)
for
equations of this form. ···
Matrix-vector product notation provides a more compact representation for
equations of this form.
2.3 Iden
Identit
tit
tity
y and In
Inverse
verse Matrices
2.3 algebra
Linear Identit y and
offers a powInerful
verse
owerful tool Matrices
called matrix inversion that allows us to
analytically solv
solvee Eq. 2.11 for many values of A.
Linear algebra offers a powerful tool called matrix inversion that allows us to
To describ
describee matrix in
inv
version, we first need to define the concept of an identity
analytically solve Eq. 2.11 for many values of A.
matrix
matrix.. An identit
identityy matrix is a matrix that do does
es not change any vector when we
To describ
multiply e matrix
that vector byin version,
that we first
matrix. need tothe
We denote define
identhe
identit
tity concept
tity of anpreserves
matrix that identity
matrix . An identit y matrix
as In.isFaormally
matrix, Ithat dones
×n,not
andchange any vector when we
n-dimensional vectors ormally, n∈R
multiply that vector by that matrix. We denote the identity matrix that preserves
R
∀x ∈ R,nI, In x = x., and
n-dimensional vectors as I . Formally (2.20)
R ∈
The structure of the identit
identity x
y matrix is ,simple:
I x = xall. of the entries along the(2.20)
main
∀ ∈ entries are zero. See Fig. 2.2 for an example.
diagonal are 1, while all of the other
The structure of the identity matrix is simple: all of the entries along the main
The matrix inverse of A is denoted as A−1, and it is defined as the matrix
diagonal are 1, while all of the other entries are zero. See Fig. 2.2 for an example.
suc
such
h that
The matrix inverse of A is denoted A −1 Aas = IA , and it is defined as the matrix
(2.21)
n.
such that
We can now solve Eq. 2.11 by A A = I . steps:
the following (2.21)
x = A−1b. (2.25)
equation. How
However,
ever, we can not use the metho
method d of matrix inv
inversion
ersion to find the
solution.
equation. However, we can not use the method of matrix inversion to find the
So far we hav
havee discussed matrix in
inv
verses as b eing multiplied on the left. It is
solution.
also possible to define an inv
inverse
erse that is multiplied on the righ
right:
t:
So far we have discussed matrix inverses as b eing multiplied on the left. It is
AAis−1multiplied
also possible to define an inverse that = I. on the right: (2.29)
For square matrices, the left inverse and right inverse are equal.
2.5 Norms
2.5 Norms
Sometimes we need to measure the size of a vector. In mac
machine
hine learning, we usually
ormally,, the Lp norm
measure the size of vectors using a function called a norm. Formally
Sometimes
is given by we need to measure the size of a vector. In machine learning, we usually
given
measure the size of vectors using a function called ! 1a norm. Formally, the L norm
X p
is given by ||x|| p = |xi |p (2.30)
i
x = x (2.30)
for p ∈ R, p ≥ 1.
|| || | |
R including the L p norm, are functions mapping vectors to non-negative
Norms,
for p , p 1.
values. On an intuitiv
intuitivee lev
level,
el, the norm of a vector x measures the distance from
!mapping
∈
Norms, ≥
including the L norm, are functions vectors to
the origin to the poinointt x. More rigorously X
rigorously,, a norm is any function f non-negative
that satisfies
values.
the folloOn
following
winganprop
intuitiv
properties:e level, the norm of a vector x measures the distance from
erties:
the origin to the point x. More rigorously, a norm is any function f that satisfies
the•follo
f (xwing
) = 0prop
⇒x erties:
=0
whicThe
h isdot producttoofthe
analogous twoLvectors
normcan of abevector.
rewritten in terms of norms. Sp Specifically
ecifically
ecifically,,
sX
x>y =
The dot product of two vectors can||xbe y|| 2 cos θin terms of norms. Specifically
rewritten
|| 2|| (2.34),
2.6 sp
Some Sp ecial
special
ecial kindsKinds of and
of matrices Matrices
vectors areand Vectors
particularly useful.
40 are particularly useful.
Some special kinds of matrices and vectors
CHAPTER 2. LINEAR ALGEBRA
Diagonal matrices consist mostly of zeros and hav havee non-zero entries only along
the main diagonal. F ormally,, a matrix D is diagonal if and only if Di,j = 0 for
Formally
ormally
all iDiagonal
6= j . W Weematrices
hav consistseen
havee already mostly one ofexample
zeros and of hav e non-zero
a diagonal entriesthe
matrix: only alongy
identit
identity
the main
matrix, diagonal.
where Formally
all of the , a entries
diagonal D
matrixare is 1. diagonal
We write ifdiag and
diag( (v)only if
to denoteD 0 for
a=square
all i = j . W e hav e already seen one example
diagonal matrix whose diagonal entries are given by the en of a diagonal tries of the vector vy.
matrix:
entries the identit
matrix,
6 where
Diagonal all ofare
matrices theofdiagonal
interest entries
in partare write diag(vby
1. Wemultiplying
because ) to denote a square
a diagonal matrix
diagonal matrix whose
is very computationally efficien diagonal
efficient. entries are
t. To compute diag given b y the en tries of the
diag((v)x , we only need to scale each vector v.
Diagonal t xmatrices are of interest indiag( part
( vb
)xecause
= v multiplying by aa square
diagonal matrix
elemen
element i by v i. In other words, diag x. Inv
Inverting
erting diagonal
is very computationally
matrix is also efficient. The inv efficien t.
inverseT o compute
erse exists only if ev diag ( v )
everyx , we only need to scale
ery diagonal entry is nonzero, each
elemen x bycase,
in tthat v . In other words,
( v) −1 = diag diag
([1(/v
([1/vv)x, . = v x.>Inverting a square diagonal
and diag
diag( diag([1 1 . . , 1/vn] ). In many cases, we may
matrix
deriv is also
derivee some efficient.
very generalThe mac inv
hineerse
machine exists algorithm
learning only ifevery diagonal
in terms entry is matrices,
of arbitrary nonzero,
and obtain
but in thatacase, ( v) (and
diagensive
less exp
expensive = diag ([1descriptiv
less /v , . . . , 1e)
descriptive) /v algorithm
] ). In many cases, we some
by restricting may
deriv e some v ery
matrices to be diagonal. general mac hine learning algorithm in terms of arbitrary matrices,
but obtain a less expensive (and less descriptive) algorithm by restricting some
Not all diagonal matrices need be square. It is p ossible to construct a rectangular
matrices to be diagonal.
diagonal matrix. Non-square diagonal matrices do not hav havee inv
inverses
erses but it is still
Not all diagonal matrices
possible to multiply by them cheaply need b e square. It is p ossible
cheaply.. For a non-square diagonal matrixto construct a rectangular
D , the
diagonal
pro matrix.
duct Dx will in
product Non-square
inv olvee scaling each element of x , and either concatenatingissome
volv diagonal matrices do not hav e inv erses but it still
possible
zeros to tothemultiply
result ifby Dthem cheaply
is taller than. it Foris awide,
non-square diagonal
or discarding matrix
some of theD , last
the
producttsD
elemen
elements ofxthe
willvector
involvife scaling
D is wider eachthanelement x , and either concatenating some
it isoftall.
zeros to the result if D is taller than it is wide, or discarding some of the last
A symmetric
elemen matrixif is
ts of the vector Dany matrix
is wider thatit isis equal
than tall. to its own transp transpose:
ose:
A symmetric matrix is any matrix A= A>is. equal to its own transpose: (2.35)
that
Symmetric matrices often arise whenAthe = entries
A . are generated by some function of
(2.35)
two argumen
arguments ts that do
does
es not dep
depend
end on the order of the arguments. For example,
Symmetric matrices
if A is a matrix often arise
of distance when
measuremen thets,entries
with Aare generated by some function of
measurements, i,j giving the distance from p oint
itwto
opargumen
oin ts that
ointt j , then do=
Ai,j es A
not dep end on the order of the arguments. For example,
j,i b ecause distance functions are symmetric.
if A is a matrix of distance measurements, with A giving the distance from point
A unit vevector
ctor is a vector with unit norm:
i to p oint j , then A = A because distance functions are symmetric.
A unit vector is a vector with unit ||x||norm
2=1 1.
:. (2.36)
x = 1. (2.36)
>
A vector x and a vector y are orthoorthogonal
|| || gonal to each other if x y = 0. If both
each
vectors ha
hav
ve nonzero norm, this means that they are at a 90 degree angle to each
A vectorn x and a vector y
other. In R , at most n vectors ma are
mayortho gonal to each x ynonzero
other ifwith
y b e mutually orthogonal = 0. If norm.
both
vectors
If the vha ve nonzero norm, this means that they are
ectors at unit
a 90 norm,
degree we
angle
calltothem
each
R are not only orthogonal but also ha have
ve
other. In
orthonormal
orthonormal.. , at most n vectors ma y b e mutually orthogonal with nonzero norm.
If the vectors are not only orthogonal but also have unit norm, we call them
An ortho
orthogonal
orthonormal gonal
. matrix is a square matrix whose rows are mutually orthonormal
and whose columns are mutually orthonormal:
An orthogonal matrix is a square matrix whose rows are mutually orthonormal
and whose columns are mutually A>orthonormal:
A = AA> = I . (2.37)
A A = 41
AA = I . (2.37)
CHAPTER 2. LINEAR ALGEBRA
2.7 y mathematical
Man
Many Eigendecomp ob osition
objects
jects can be understo
understoo o d better by breaking them in into
to
constituen
constituentt parts, or finding some properties of them that are univ universal,
ersal, not caused
Man y mathematical
by the way we cho hoose ob jects
ose to represencan b
representt them.e understo o d b etter by breaking them into
constituent parts, or finding some properties of them that are universal, not caused
For wexample,
by the ay we choin integers
tegers
ose can bet decomp
to represen decomposed
them. osed in into
to prime factors. The wa wayy we
represen
representt the num
numb ber 12 will change dep depending
ending on whether we write it in base ten
or inFor example,
binary
binary,, but itin tegers
will alwa can
ys bb
always e etrue
decomp
that 12osed in×to2 ×prime
= 22× 3. Fromfactors. The way we
this representation
represen
w t the num
e can conclude ber 12prop
useful willerties,
change
properties, suc
suchdep
h asending
that 12oniswhether we write
not divisible by 5it, or
in that
base an
teny
any
or
in in
integerbinary , but it will alwa ys b e true
teger multiple of 12 will be divisible by 3. that 12 = 2 2 3 . From this representation
we can conclude useful properties, such as that 12×is not × divisible by 5, or that any
Muc
Much h as we can disco
discov ver something ab about
out the true nature of an integer by
integer multiple of 12 will be divisible by 3.
decomp
decomposing
osing it into prime factors, we can also decomp decompose ose matrices in ways that
sho
show Muc h as we can
w us information ab disco
about ver something
out their functional propab out erties thatnature
the
properties true of vious
is not ob an integer
obvious by
from the
decomp
represen osing
tationitofinto
representation theprime
matrixfactors, we can
as an array of also decompose matrices in ways that
elements.
show us information about their functional properties that is not obvious from the
One of the most widely used kinds of matrix decomp decomposition
osition is called eigen-
representation of the matrix as an array of elements.
de
deccomp
omposition
osition
osition,, in whic
which h we decomp
decompose ose a matrix in into
to a set of eigenv
eigenvectors
ectors and
eigenOne
eigenv of
values.the most widely used kinds of matrix decomp osition is called eigen-
decomposition, in which we decompose a matrix into a set of eigenvectors and
eigenAn eigenve ctor of a square matrix A is a non-zero vector v suc
eigenvector
values. h that multipli-
such
cation by A alters only the scale of v:
An eigenvector of a square matrix A is a non-zero vector v such that multipli-
cation by A alters only the scale of Av v: = λv. (2.39)
¸1 v(1)
2 Before multiplication 2 After multiplication
1 1
v (1) v (1)
0 0
x1
x10
¸ 2v (2)
(2)
v v (2)
−1 −1
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x0 x00
We hav
havee seen that constructing V diag(λwith
A = matrices )V sp . ecific eigenv
specific eigenvalues (2.40)
alues and eigenv
eigenvec-
ec-
tors allo
allows
ws us to stretch space in desired directions. Ho Howev
wev
wever,
er, we often wan antt to
de W
deccompe hav
ompose e seen that constructing
ose matrices into their eigen
eigenvmatrices with sp
values and eigenv ecific
ectors. Doing so can help ec-
eigenvectors.eigenv alues and eigenv us
torsanalyze
to allows certain
us to stretch
prop space
properties
erties of in
thedesired
matrix,directions.
muc
much However,
h as decomp
decomposing wean
osing often waninto
integer t to
decomp
its primeosefactors
matrices
caninto
helptheir eigenvaluesthe
us understand andbehavior
eigenvectors.
of thatDoing
in so can help us
integer.
teger.
to analyze certain prop erties of the matrix, much as decomposing an integer into
Not every matrix can b e decomp
decomposedosed in
into
to eigenv
eigenvalues
alues and eigenv
eigenvectors.
ectors. In some
its prime factors can help us understand the behavior of that integer.
Not every matrix can b e decomposed43into eigenvalues and eigenvectors. In some
CHAPTER 2. LINEAR ALGEBRA
2.8Sec. 2.7
In Singular
, we sa
saw
w hoVwalue
how Decomp
to decomp
decompose osition
ose a matrix in
into
to eigen
eigenv
vectors and eigen
eigenv
values.
The singular value dedeccomp
omposition
osition (SVD) pro
provides
vides another wa
wayy to factorize a matrix,
In
in Sec.
into 2.7 , w
to singular vee saw
vectorsho w to decomp ose a matrix into eigenvectors
ctors and singular values. The SVD allows us to disco and veigen
discov values.
er some of
The singular value de comp osition (SVD) pro
the same kind of information as the eigendecomp vides another
eigendecomposition. wa y
osition. Hoto
How factorize
wevever, a matrix,
er, the SVD is
into singular vectors and singular values. The SVD allows us to discover some of
the same kind of information as the eigendecomp
44 osition. However, the SVD is
CHAPTER 2. LINEAR ALGEBRA
more generally applicable. Every real matrix has a singular value decomp decomposition,
osition,
but the same is not true of the eigenv
eigenvalue
alue decomp
decomposition.
osition. For example, if a matrix
more
is not generally applicable.
square, the eigendecompEvery
eigendecomposition real matrix
osition has a singular
is not defined, and wevm alue
ustdecomp osition,
use a singular
but
valuethe same is
decomp not true
decomposition
osition of the eigenvalue decomposition. For example, if a matrix
instead.
is not square, the eigendecomposition is not defined, and we must use a singular
Recall that the eigendecomp
eigendecomposition
osition in
involv
volv es analyzing a matrix A to disco
volves discov
ver
value decomposition instead.
a matrix V of eigen
eigenv
vectors and a vector of eigen values λ suc
eigenv such
h that we can rewrite
A asRecall that the eigendecomp osition involves analyzing a matrix A to discover
a matrix V of eigenvectors andAa = vector
V diagof(eigen
diag( values
λ)V −1 . λ such that we can rewrite
(2.42)
A as
The singular value decomp A = V diag
decomposition
osition (λ)V except
is similar, . (2.42)
this time we will write A
as a product of three matrices:
The singular value decomposition is similar, except this time we will write A
as a product of three matrices: A = U DV >. (2.43)
A = U DV . (2.43)
Supp ose that A is an m × n matrix. Then U is defined to be an m × m matrix,
Suppose
D to be an m × n matrix, and V to be an n × n matrix.
Suppose that A is an m n matrix. Then U is defined to be an m m matrix,
Eac
Each h of these matrices is defined to hav havee a sp
special
ecial structure. The matrices U
D to be an m n matrix, and × V to be an n n matrix. ×
and V are both defined to b e orthogonal matrices. The matrix D is defined to be
Each ofmatrix.× matrices is defined to have a×special structure. The matrices U
these
a diagonal Note that D is not necessarily square.
and V are both defined to b e orthogonal matrices. The matrix D is defined to be
The elemen
elementsts along the diagonal of D are kno known wn as the singular values of the
a diagonal matrix. Note that D is not necessarily square.
matrix A. The columns of U are kno known
wn as the left-singular ve vectors
ctors
ctors.. The columns
The elemen
of V are kno
known ts along the diagonal of
wn as as the right-singular ve D are
vectorskno
ctors
ctors.. wn as the singular values of the
matrix A. The columns of U are known as the left-singular vectors. The columns
of VWare
e can
knoactually
wn as asinterpret the singular
the right-singular value
vectors . decomp osition of A in terms of
decomposition
the eigendecomposition of functions of A . The left-singular vectors of A are the
Wveectors
eigen
eigenv can actually
of AA> .interpret
The righ the singular
right-singular
t-singular valueofdecomp
vectors osition
A are the eigenof A in terms
eigenvectors
vectors of A>A of.
the eigendecomposition of functions of A .
The non-zero singular values of A are the square roThe left-singular
roots v ectors
ots of the eigen
eigenv of A are the
values of A>A.
eigenv ectors of AA .
The same is true for AA . The> righ t-singular vectors of A are the eigen vectors of A A.
The non-zero singular values of A are the square roots of the eigenvalues of A A.
Perhaps the most useful feature of the SVD is that we can use it to partially
The same is true for AA .
generalize matrix in inversion
version to non-square matrices, as we will see in the next
P erhaps
section. the most useful feature of the SVD is that we can use it to partially
generalize matrix inversion to non-square matrices, as we will see in the next
section.
2.9 The Mo
Moore-P
ore-P
ore-Penrose
enrose Pseudoin
Pseudoinverse
verse
2.9 in
Matrix The
inv
versionMois ore-P enrose
not defined Pseudoin
for matrices that areverse
not square. Supp
Suppose
ose we wan
wantt
to mak
makee a left-inv
left-inverse
erse B of a matrix A, so that we can solve a linear equation
Matrix inversion is not defined for matrices that are not square. Suppose we want
to make a left-inverse B of a matrixAx A, =so ythat we can solve a linear equation
(2.44)
Ax = y (2.44)
45
CHAPTER 2. LINEAR ALGEBRA
by left-m
left-multiplying
ultiplying eac
each
h side to obtain
2.10traceThe
The Trace
operator gives Op erator
the sum of all of the diagonal en
entries
tries of a matrix:
X
The trace operator gives the sum
Tr(ofAall
) =of theAdiagonal
i,i .
entries of a matrix: (2.48)
i
Tr(A) = A . (2.48)
The trace opoperator
erator is useful for a variety of reasons. Some op
operations
erations that are
difficult to sp
specify
ecify without resorting to summation notation can b e sp specified
ecified using
The trace operator is useful for a variety of reasons. Some operations that are
46X
difficult to specify without resorting to summation notation can b e specified using
CHAPTER 2. LINEAR ALGEBRA
matrix pro
products
ducts and the trace opoperator.
erator. For example, the trace op
operator
erator provides
an alternativ
alternativee way of writing the Frob
robenius
enius norm of a matrix:
matrix products and the trace operator. For example, the trace operator provides
q
an alternative way of writing the Frobenius norm of a matrix:
||A||F = Tr( AA> ).
r(AA (2.49)
A = Tr(AA ). (2.49)
Writing an expression in terms of the trace op operator
erator op
opens
ens up opp
opportunities
ortunities to
manipulate the expression using || ||man
manyy useful identities. F
For
or example, the trace
op W riting
operator an
erator is in
inv expression
varian in terms
ariantt to the transp of the
transpose
ose op trace operator opens up opp ortunities to
operator:
erator:
manipulate the expression using manyquseful identities. For example, the trace
operator is invariant to the transp Tr(ose
A) op r(A> ).
= erator:
T
Tr( (2.50)
or more generally
generally,, Tr(AB C ) = Tr(C AB) = Tr(B C A) (2.51)
n
Y n−1
Y
or more generally, Tr( F (i)) = T
Tr(
r(
r(FF (n) F (i) ). (2.52)
i=1 i=1
Tr( F ) = Tr(F F ). (2.52)
This inv
invariance
ariance to cyclic perm
ermutation
utation holds even if the resulting pro
product
duct has a
differen
differentt shap e. For example, for A ∈ Rm×n and B ∈ R n×m, we ha
shape. hav
ve
This invariance to cyclic permutation holds even if the resulting product has a
R R
different shape. For example, for A r(and B , we have (2.53)
YTr(AB ) = T Tr( BAY)
∈ ∈
even though AB ∈ Rm×m and TBr(A
even AB n×T
∈ )R= n r(B A)
. (2.53)
R mindRis that
evenAnother
though useful
AB fact to keep
and in
BA . a scalar is its own trace: a = Tr(a ).
Another useful∈fact to keep in mind
∈ is that a scalar is its own trace: a = Tr(a ).
2.11 The Determinan
Determinantt
2.11determinant
The The Determinan t
of a square matrix, denoted det
det((A ), is a function mapping
matrices to real scalars. The determinant is equal to the pro product
duct of all the
The
eigen
eigenvdeterminant
v of a square
alues of the matrix. The matrix,
absolute denoted
value of det
the(A ), is a function
determinant can bemapping
thought
matrices to real scalars.
of as a measure of how muc The
uch determinant is equal to the pro duct
h multiplication by the matrix expands or con of all the
contracts
tracts
eigen values of the matrix. The absolute value of the determinant can b e
space. If the determinant is 0, then space is contracted completely along at least thought
of asdimension,
one a measure causing
of how mit uc
tohlose
multiplication by the If
all of its volume. matrix expands or is
the determinant con1,tracts
then
space.
the If the determinant
transformation is 0, then space is contracted completely along at least
is volume-preserving.
one dimension, causing it to lose all of its volume. If the determinant is 1, then
the transformation is volume-preserving.
47
CHAPTER 2. LINEAR ALGEBRA
2.12simple
One Example:
mac
machine
hine learning Principal
algorithm,Comp princip
principal alonen
comp ts
omponents
onentsAnalysis
analysis or PCA can
be deriv
deriveded using only kno knowledge
wledge of basic linear algebra.
One simple machine learning algorithm, principal(1) components analysis or PCA can
Supp
Supposeoseusing
we ha havve akno collection m poin
of basic oints , . . . , x (m)} in Rn . Supp
ts {xalgebra. Supposeose we
be deriv ed only wledge of linear
would like to apply lossy compression to these poin oints.ts. Lossy compression
R means
Supp ose w e
storing the points in a waha v e a collection
way of m p oin ts x
y that requires less memory but ma , . . . , x may in . Supp ose
y lose some precision. we
would
W e wouldlike liktoeapply
like to loselossy compression
as little precisiontoasthese { points. Lossy} compression means
possible.
storing the points in a way that requires less memory but may lose some precision.
We One wouldwalikyw e etocan
loseenco
encode
as de these
little points is
precision asto represen
representt a lo
possible. lower-dimensional
wer-dimensional version
of them. For each point x(i) ∈ R n we will find a corresponding co de vector c (i) ∈ R l.
code
If l One way wthan
is smaller e can nenco
, it de
willthese
take points is to represent a lower-dimensional
R less memory to store the co codede points than version
the
R
of them. F or each
original data. We will wan p oint x w e will
antt to find some enco find a
encoding corresponding
ding function that pro co de vector
produces c
duces the co de.
code
If l is smaller than n ,
for an input, f (x) = c, and∈a deco it will take less
decoding memory to
ding function that pro store the
produces co de p oints than
duces the reconstructed ∈ the
original
input giv data.
given W
en its co e will w an t to
de, x ≈ g(f (x)).
code, find some enco ding function that pro duces the co de
for an input, f (x) = c, and a decoding function that produces the reconstructed
input PCAgivenis defined
its co de, by xour cghoice
(f (x))of. the deco
decoding
ding function. Sp Specifically
ecifically
ecifically,, to makmakee the
deco
decoder der very simple, we choose to use matrix multiplication to map the co code
de back
PCAn is defined b y ≈ choice of the ndeco
our ×l ding function. Sp ecifically, to make the
in to R . Let g (c) = Dc, where D ∈ R
into is the matrix defining the deco decoding.
ding.
decoder very simple, we choose to use matrix multiplication to map the code back
R
Computing the optimal co code R deco
de for this decoder
der could be a difficult problem. To
into . Let g (c) = Dc, where D is the matrix defining the decoding.
keep the encoencodingding problem easy easy,, PCA constrains the colum columns ns of
ofDD to be orthogonal
Computing the optimal co de for∈ this decoder could be a difficult problem. To
to eac
each h other. (Note that D is still not technically “an orthogonal matrix” unless
l = n) encoding problem easy, PCA constrains the columns of D to be orthogonal
k eep the
to each other. (Note that D is still not technically “an orthogonal matrix” unless
With the problem as describ described ed so far, manmany y solutions are possible, because we
l = n)
can increase the scale of D:,i if we decrease c i prop proportionally
ortionally for all poin oints.
ts. To giv
givee
With the problem as describ ed so far, man y
the problem a unique solution, we constrain all of the columns of D to hasolutions are p ossible, b ecause
hav we
ve unit
can
norm. increase the scale of D if we decrease c prop ortionally for all p oints. To giv e
the problem a unique solution, we constrain all of the columns of D to have unit
In order to turn this basic idea in into
to an algorithm we can implement, the first
norm.
thing we need to do is figure out how to generate the optimal co de point c∗ for
code
eac
each hIninput
orderpto oint turn
x . this
One basic
way to idea
do in to an
this is toalgorithm
minimize wethe candistance
implement, betw the
een first
etween the
input point x and its reconstruction, g( c ). We can measure this distance usingfor
thing w e need to do is figure out how to∗ generate the optimal co de p oint c a
each input
norm. In the point x . One
principal wayonents
comp
components to do algorithm,
this is to minimize
we use the theL2distance
norm: between the
input point x and its reconstruction, g( c ). We can measure this distance using a
norm. In the principal components algorithm, we use the L norm:
c∗ = arg min ||x − g(c)||2 . (2.54)
c
c = arg min x g(c) . (2.54)
We can switch to the squared L 2 norm instead of the L2 norm itself, b ecause
both are minimized by the same value of|| c .−This ||
is b ecause the L 2 norm is non-
W e
negativ can switch to the squared
negativee and the squaring op L
operation norm instead of the
eration is monotonically L norm
increasing foritself, b ecause
non-negative
both are minimized by the same value of c . This is b ecause the L norm is non-
negative and the squaring operation is monotonically increasing for non-negative
48
CHAPTER 2. LINEAR ALGEBRA
argumen
arguments.
ts.
c∗ = arg min ||x − g(c)||22 . (2.55)
arguments. c
c = arg min tox g(c) .
The function being minimized simplifies (2.55)
|| − ||
− g(c))>to
The function being minimized(xsimplifies (x − g(c)) (2.56)
(b
(by
y the distributiv = x x x g(c) g (c) x + g(c) g(c)
distributivee property) (2.57)
− −
= x>x − 2x > g(c) + g (c)>g(c)
(by the distributive property) (2.58)
Using a further matrix multiplication, we can also define the PCA reconstruction
op
operation:
eration:
Using a further matrix multiplication,
r(x) = g (f (wxe))can
=D also
D>define
x. the PCA reconstruction (2.67)
operation:
Next, we need to choose r(xthe
) = enco
encoding
g (f (ding DD D
x)) =matrix x.. To do so, we revisit the
(2.67)
2
idea of minimizing the L distance bet etw
ween inputs and reconstructions. How However,
ever,
Next, we need to choose the enco
since we will use the same matrix D to deco ding matrix
decode D . To do so, w e revisit
de all of the points, we can no longerthe
idea of minimizing
consider the points the L distance
in isolation. between
Instead, inputsminimize
we must and reconstructions.
the Frob eniusHow
robenius normever,
of
since w e will use the same matrix
the matrix of errors computed ov D
over to deco de all of the p oints,
er all dimensions and all points: we can no longer
consider the points in isolation. Instead, we must minimize the Frob enius norm of
the matrix of errors computed
s over all dimensions and all points:
X (i) 2
D ∗ = arg min xj − r (x(i))j sub ject to D> D = Il
subject (2.68)
D i,j
D = arg min x r (x ) sub ject to D D = I (2.68)
∗
D , we will start by considering the case
To derive the algorithm for finding −
where l = 1
1.. In this case, D is just a single vector, d. Substituting Eq. 2.67 in
into
to
T o derive the algorithm
Eq. 2.68 and simplifying D for
in
into
to finding
d , the D , we
problem will startto
reduces by considering the case
s D
where l = 1. In this case, X is just a single vector, d. Substituting Eq. 2.67 into
Eq. 2.68 and simplifying DXinto d, the problem reduces to
d ∗ = arg min ||
||xx(i) − dd>x(i) || 22 sub ject to ||d||2 = 11..
subject (2.69)
d i
d = arg min x dd x sub ject to d = 1. (2.69)
The ab abo
ove form
formulation
ulation is the most direct wa way y of performing the substitution,
|| − || || ||
but is not the most stylistically pleasing way to write the equation. It places the
Thevalue
scalar abovde>form
x (i) ulation is theofmost
on the right the vdirect
ector wad. yItofispmore
erforming
con
conv the
ven substitution,
entional
tional to write
but is not
scalar co the
coefficientsmost stylistically
efficients on the left pleasing
Xof vector they op w ay to
operate write the equation. It places
erate on. We therefore usually write the
scalar
such a vform
such alueula
d x
formula as on the right of the vector d. It is more conventional to write
scalar coefficients on the left Xof vector they operate on. We therefore usually write
such a formula d∗ =as arg min ||
||xx(i) − d>x(i) d|| 22 sub ject to ||d||2 = 11,,
subject (2.70)
d i
d = arg min x d x d sub ject to d = 1, (2.70)
or, exploiting the fact that a scalar is its own transp
transpose,
ose, as
X || − || || ||
∗ (i) (i)> 2
or, exploiting the fact that a scalar is its own 2transpose, as d||2 = 1
d = arg min ||
||xx − x dd ||
dd|| sub
subject
ject to || 1.. (2.71)
d i
d = arg min X x x dd sub ject to d = 1. (2.71)
The reader should aim to become familiar with such cosmetic rearrangements.
|| − || || ||
At this point, it can be helpful to rewrite the problem in terms of a single
The reader should aim to become familiar with such cosmetic rearrangements.
design matrix of examples, rather than as a sum ov over
er separate example vectors.
A t this p
This will allo
alloww us to use more compact notation. Let X ∈ Rinm×terms
oint, it can b
X e helpful to rewrite the problem of amatrix
n b e the single
design matrix of examples, rather than as a sum ov er separate example vectors.
(i)>
defined by stacking all of the vectors describing the poin oints, R that X i,: = x .
ts, such
This
W willnow
e can allorewrite
w us totheuseproblem
more compact
as notation. Let X be the matrix
defined by stacking all of the vectors describing the points,∈such that X = x .
We can now rewrited ∗ =the min ||X −
arg problem > 2
asX dd ||F sub ject to d> d = 11..
subject (2.72)
d
d = arg min X X dd50 sub ject to d d = 1. (2.72)
|| − ||
CHAPTER 2. LINEAR ALGEBRA
Disregarding the constraint for the moment, we can simplify the Frob
robenius
enius norm
portion as follo
follows:
ws:
Disregarding the constraint for
argthe
minmoment,
||X − X we
dd>can
||2F simplify the Frobenius(2.73)
norm
portion as follows: d
min X Xdd
arg (2.73)
>
> >
= arg min Tr X− X
|| −dd X
|| − X dd (2.74)
d
(b
(by
y Eq. 2.49) = arg min Tr X X dd X X dd (2.74)
− −
(by Eq. 2.49 )
= arg min Tr( X X − X X dd − dd X X + dd> X >X dd>)
r(X > > > > >
(2.75)
d
= arg min Tr(X X X X dd dd X X + dd X X dd ) (2.75)
> > > > >
= arg min Tr( X X ) − Tr(X X dd ) − Tr(dd X X ) + Tr(
r(X dd>X > X dd> )
r(dd
d
− −
= arg min Tr(X X ) Tr(X X dd ) Tr(dd X X ) + Tr(dd X X dd (2.76)
)
> > > > > > >
= arg min − Tr( X −X dd ) − Tr(dd −X X ) + Tr(
r(X r(dd
dd X X dd ) (2.77)
d (2.76)
(b = argterms
(because
ecause min notTr(in
X
inv X ddd)do T
volving r(dd
not X the
affect X )arg
+T r(dd
min ) X X dd ) (2.77)
− −
> affect
(because terms not inv olving d
= arg min −2 Tr( do
Xnot
r(X X dd>) +the
Tr(arg > )>
ddmin
r(dd X X dd>) (2.78)
d
= arg min 2 Tr(X X dd ) + Tr(dd X X dd ) (2.78)
(b
(because
ecause we can cycle the order of the matrices inside a trace, Eq. 2.52)
−
> matrices
(because we can=cycle the −
arg min order
2 Tr(of
r(X X dd>) + Tinside
X the r(X>X
r(X a dd >
trace, >
ddEq.
) 2.52) (2.79)
d
= arg min 2 Tr(X X dd ) + Tr(X X dd dd ) (2.79)
(using the same prop
propert
ert
erty
y again)
−
At this poin
oint,
t, we re-in
re-intro
tro
troduce
duce the constrain
constraint:
t:
(using the same prop erty again)
At this
arg min p−oin
2 Tt,r(
r(XXe >re-in
w X dd > duce the >
tro) + Tr( X dd>dd
X constrain
r(X t: >) sub ject to d > d = 1
subject (2.80)
d
arg min 2 Tr(X X>dd ) +> Tr(X X>dd dd> ) sub ject to d> d = 1 (2.80)
= arg min −2 Tr(X X dd ) + Tr(
r(X r(XX X dd ) sub
subject
ject to d d = 1 (2.81)
d
−
= arg
(due to the min 2 Tr(X X dd ) + Tr(X X dd ) sub ject to d d = 1
constraint) (2.81)
−
(due to the constraint)
= arg min − Tr(
r(XX > X dd>) sub ject to d>d = 1
subject (2.82)
d
= arg min Tr(X X dd ) sub ject to d d = 1 (2.82)
= arg max Tr(X > X dd> ) sub
r(X ject to d> d = 1
subject (2.83)
d
−
= arg max Tr(X> X>dd ) sub ject to d> d = 1 (2.83)
= arg max Tr(
r(d
d X X d) sub
subject
ject to d d = 1 (2.84)
d
= arg max Tr(d X X d) sub ject to d d = 1 (2.84)
51
CHAPTER 2. LINEAR ALGEBRA
52
Chapter 3
Chapter 3
Probabilit
Probability y and Information
Theory
Probability and Information
Theory describee probabilit
In this chapter, we describ probability
y theory and information theory
theory..
In this chapter, we describe probability theory and information theory.
Probabilit
Probability y theory is a mathematical framew framework ork for represen
representing
ting uncertain
In this chapter, we describe probability theory and information theory.
statemen
statements. It pro
ts. providesvides a means of quan
quantifying uncertain
tifying uncertaintt y and axioms for deriving
Probabilit y theory is a mathematical framew ork for represen
new uncertain statements. In artificial intelligence applications, we use probability ting uncertain
statemen
theory ints.
twIt projor
o ma vides
major waays.means of the
First, quanlaws
tifying uncertaintyy tell
of probabilit
probability and us
axioms
how AIfor systems
deriving
new uncertain statements. In artificial intelligence applications,
should reason, so we design our algorithms to compute or approximate various w e use probability
theory in twderiv
expressions o maed
derived jorusing
ways.probabilit
First, the
probability laws of. Second,
y theory
theory. probabilit wey can
tell use
us how AI systems
probability and
should reason,
statistics so we design
to theoretically our algorithms
analyze the beha
ehaviortoofcompute
vior prop
proposed
osed orAI
approximate
systems. various
expressions derived using probability theory. Second, we can use probability and
Probabilit
Probability
statistics y theory is analyze
to theoretically a fundamental
the behato
tool
ol ofof man
vior many
propyosed
disciplines of science and
AI systems.
engineering. We provide this chapter to ensure that readers whose background is
Probabilit
primarily y theory
in soft
softw is a fundamental
ware engineering tool of
with limited exp man
exposure
osurey disciplines
to probabilityof science
theory and
can
engineering. W e provide this
understand the material in this book. chapter to ensure that readers whose background is
primarily in software engineering with limited exposure to probability theory can
While probabilit
probability y theory allows us to make uncertain statements and reason
understand the material in this book.
in the presence of uncertaint
uncertainty y, information allows us to quan quantify
tify the amount of
While
uncertain
uncertaintt probabilit
y in a y theory
probabilit
probability y allows us
distribution.to make uncertain statements and reason
in the presence of uncertainty, information allows us to quantify the amount of
If you are already familiar with probability theory and information theory theory,,
uncertainty in a probability distribution.
you mamay y wish to skip all of this chapter except for Sec. 3.14, which describ describes
es the
If you are already
graphs we use to describ familiar with probability
describee structured probabilistic mo theorydels for machine learning. If,
models and information theory
you
you hav
maye wish
have to skipnoallprior
absolutely of this
exp chapter
erience except
experience for Sec.
with these sub 3.14, which
subjects,
jects, describshould
this chapter es the
graphs
b we usetotosuccessfully
e sufficient describe structured
carry outprobabilistic
deep learning moresearch
dels for machine
pro jects,learning.
projects, but we doIf
you havethat
suggest absolutely
you consultno prior experienceresource,
an additional with thesesuc
suchhsubasjects,
Ja
Jaynes this
ynes chapter
(2003 ). should
be sufficient to successfully carry out deep learning research pro jects, but we do
suggest that you consult an additional resource, such as Jaynes (2003).
53
53
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.1 Wh
Why
y Probabilit
Probability?
y?
3.1 y branc
Man
Many Whhes
branches y Probabilit y? deal mostly with entities that are en
of computer science entirely
tirely
deterministic and certain. A programmer can usually safely assume that a CPU will
Many branc
execute eac
each hhes of computer
machine science
instruction deal .mostly
flawlessly
flawlessly. Errorswith entities do
in hardware that are en
occur, tirely
but are
deterministic and certain.
rare enough that most softw A programmer
software can usually safely assume that
are applications do not need to be designed to account a CPU will
execute
for them.eacGivh machine
Given en that maninstruction
many y computerflawlessly . Errors
scientists andinsoftw
hardware
software do occur,
are engineers butinare
work a
rare enough
relativ
relatively that most softw are
ely clean and certain environmen applications
environment, do not need to b e designed
t, it can be surprising that mac machine to account
hine learning
for
mak them.
makeses hea
heavyGiv en that man
vy use of probabilit y computer
probabilityy theory
theory.. scientists and softw are engineers work in a
relatively clean and certain environment, it can be surprising that machine learning
makThis
es heaisvybecause
use of machine
probabilitlearning
y theorymust
. alwa
alwaysys deal with uncertain quantities,
and sometimes may also need to deal with sto stocchastic (non-deterministic) quan quantities.
tities.
This
Uncertain is b ecause
Uncertaintty and sto machine
stocchasticit
hasticity learning must
y can arise from man alwa
many ys deal with uncertain
y sources. Researc
Researchers quantities,
hers ha
havve made
and
comp sometimes
compelling
elling argumenmay
arguments also need to deal with sto
ts for quantifying uncertaint chastic
uncertainty y using probability since at tities.
(non-deterministic) quan least
Uncertain t
the 1980s. Many and
Many sto chasticit y can arise from man y sources. Researc
y of the arguments presented here are summarized from or inspired hers ha ve made
comp
b elling
y Pearl argumen
(1988 ). ts for quantifying uncertainty using probability since at least
the 1980s. Many of the arguments presented here are summarized from or inspired
Nearly all activities require some ability to reason in the presence of uncertaint uncertainty y.
by Pearl (1988).
In fact, beyond mathematical statements that are true by definition, it is difficult
Nearly
to think ofall
any activities
prop
propositionrequire
osition thatsome ability to reason
is absolutely true orinany
theeven
presence
event t thatofisuncertaint
absolutely y.
In fact,teed
guaran beyond
guaranteed mathematical statements that are true by definition, it is difficult
to occur.
to think of any proposition that is absolutely true or any event that is absolutely
There are three possible sources of uncertain uncertaintty:
guaranteed to occur.
There are three possible sources of uncertainty:
1. Inheren
Inherentt stochasticit
stochasticity y in the system being mo modeled.
deled. For example, most
in
interpretations
terpretations of quantum mechanics describ describee the dynamics of subatomic
1. particles
Inherent as stochasticit y in the system
being probabilistic. We can also b eingcreate
modeled. For example,
theoretical scenariosmost
that
ineterpretations
w postulate toofha havquantum
ve random mechanics
dynamics, describ
such easthe dynamics of card
a hypothetical subatomic
game
particles
where weasassume
being probabilistic.
that the cardsWare e can alsosh
truly create
shuffled
uffled in theoretical
into
to a random scenarios
order.that
we postulate to have random dynamics, such as a hypothetical card game
where we assume
2. Incomplete observ that
observability
ability the
ability. . Evcards
Even are truly shuffled
en deterministic systemsintocana random
app
appear order.
ear sto
stochastic
chastic
when we cannot observ observee all of the variables that drive the behavior of the
2. system.
Incomplete observ
For example, ability
in .the
EvMont
en deterministic
Monty y Hall problem,systems can sho
a game appwear
show consto chastic
contestan
testan
testant t is
when
ask
asked w e cannot
ed to choose betw observ
etween e all of
een three do the
doors variables that drive the b ehavior
ors and wins a prize held behind the chosen of the
system.
do
door.
or. Tw Foordo
example,
ors leadintothe
doors Montwhile
a goat y Halla problem,
third leadsa game showThe
to a car. contestan
outcome t is
ask
giv
givened the
en to choose betw
contestan
contestant’st’seen threeis do
choice ors and winsbut
deterministic, a prize
fromheld
the bcon
ehind thet’schosen
contestan
testan
testant’s poin
ointt
do or. T w o do ors lead to
of view, the outcome is uncertain. a goat while a third leads to a car. The outcome
given the contestant’s choice is deterministic, but from the contestant’s point
of view, the mo
3. Incomplete outcome
deling.is When
modeling. uncertain.
we use a mo model
del that must discard some of
the information we hav havee observ
observed,ed, the discarded information results in
3. uncertain
Incomplete mo deling.
uncertaintty in the mo model’sWhen we use a mo
del’s predictions. Fordel that must
example, suppdiscard
ose we some
suppose build of a
the
rob
robototinformation we hav
that can exactly e observ
observe theed,
lo the discarded
location
cation of every ob information
ject aroundresults
object in
it. If the
uncertainty in the model’s predictions. For example, suppose we build a
robot that can exactly observe the54location of every ob ject around it. If the
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
rob
robot ot discretizes space when predicting the future lo location
cation of these ob objects,
jects,
then the discretization makes the robot immediately become uncertain ab about
out
rob ot discretizes
the precise position of ob space when
objects: predicting
jects: eac each h ob the
object future lo cation of
ject could be anywhere within the these ob jects,
then the discretization
discrete cell that it was observ makes
observed the robot
ed to occup immediately
ccupy y. become uncertain about
the precise position of ob jects: each ob ject could be anywhere within the
discrete cell that it was observed to occupy.
In man
many y cases, it is more practical to use a simple but uncertain rule rather
than a complex but certain one, ev evenen if the true rule is deterministic and our
mo In man
modeling
deling y cases,
system hasitthe
is more
fidelit
fidelity practical
y to accommo to use
accommodate datea simple
a complex but rule.
uncertain rule rather
For example, the
than a complex but certain one, ev en if the true
simple rule “Most birds fly” is cheap to develop and is broadly useful, while a rule rule is deterministic and our
mothe
of delingform, system
“Birdshasfly the
fly, fidelityfor
, except to vaccommo
ery young date a complex
birds that ha havrule.
ve not Foryetexample,
learnedthe to
simple
fly rule “Most birds
fly,, sick or injured birds that hav fly” is cheap to develop
havee lost the abilit ability and is
y to fly broadly useful,
fly,, flightless sp while
species a
ecies of birdsrule
of the form,
including the “Birds
cassow
cassowary fly
ary,,except
ary, ostric
ostrich h for
andvkiwi.
ery y.oung . ” is birds
exp
expensivethat to
ensive hadev
ve not
elop,yet
develop, main learned
maintaintain and to
fly, sick
comm
communicate, or injured
unicate, and birds thatofhav
after all thise effort
lost the abilitvyery
is still to brittle
fly, flightless
and prone species of birds
to failure.
including the cassowary, ostrich and kiwi. . . ” is expensive to develop, maintain and
Giv
Given en that we need a means of representing and reasoning ab about
out uncertaint
uncertainty y,
communicate, and after all of this effort is still very brittle and prone to failure.
it is not immediately ob obvious
vious that probabilit
probability y theory can provide all of the to tools
ols
Given
we wan
want t for that we need
artificial in a means ofapplications.
intelligence
telligence representing Probability
and reasoning about
theory wasuncertaint
originally y,
it iselop
dev
developnoted
eloped immediately
to analyze ob thevious that probabilit
frequencies of evevenen yts.theory
ents. can provide
It is easy to see how all of the tools
probability
we wantcan
theory for bartificial
e used to intelligence
study ev even applications.
en
entsts like dra drawing Probability
wing a certaintheoryhand w ofascards
originally
in a
dev elop ed to analyze
game of poker. These kinds of even the frequenciesevents of ev en ts.
ts are often rep It is easy
repeatable. to see how
eatable. When we sa probability
sayy that
theory
an outcomecan bhas e used to study ev
a probability p enoftsoccurring,
like drawing a certain
it means that hand
if we of cards in
repeated thea
game
exp
experimenof poker.
erimen
eriment t (e.g.,These
draw akinds handofofeven cards)ts are often rep
infinitely man
manyeatable.
y times,When then propwe sa y that
proportion
ortion p
an outcome
of the reprepetitionshas a probability p of o ccurring, it
etitions would result in that outcome. This kind of reasoning do means that if w e repeated
does the
es not
exp erimen t (e.g., draw
seem immediately applicable to prop a hand of cards)
propositions infinitely man
ositions that are not rep y times, then
repeatable. prop
eatable. If a do ortion
doctor
ctorp
of the repaetitions
analyzes patientwould and sa result
saysys that in the
thatpatient
outcome. has This
a 40% kind
chanceof reasoning
of havingdothe es not
flu,
seem immediately applicable
this means something very different—w to prop ositions
different—wee can not mak that are not rep eatable.
makee infinitely man many If a do ctor
y replicas of
analyzes a patient
the patient, nor is there an and sa ys
any that the patient
y reason to believe that differen has a differentt replicas of the the
40% chance of having patienflu,
patient t
this means something very different—w
would present with the same symptoms yet hav e can not mak e infinitely man
havee varying underlying conditions. In y replicas of
the patient,
the case of the nor dois there
doctor
ctor an y reason
diagnosing to
the b elieve
patient, that
we differen t replicasto
use probability of represent
the patienat
would
de greee present
degr
gr elief,with
of belief , with the1 same symptoms
indicating absolute yet hav e varying
certaint
certainty y that underlying
the patient conditions.
has the flu In
the case
and of the doabsolute
0 indicating ctor diagnosing
certainttythe
certain thatpatient, we usedo
the patient probability
doeses not hav haveetotherepresent
flu. The a
de gr ee of b elief ,
former kind of probabilitywith 1 indicating absolute certaint
probability,, related directly to the rates at which even y that the patient
events has the
ts occur, is flu
and
kno
known 0 indicating
wn as fr absolute
freequentist pr prob
ob ability,, while the latter, related to qualitative flu.
certain
obability
ability t y that the patient do es not hav e the lev
levelsThe
els of
former
certainttkind
certain of probability
y, is known , related
as Bayesian pr
prob
obdirectly
ability.. to the rates at which events occur, is
obability
ability
known as frequentist probability, while the latter, related to qualitative levels of
If we list several properties that we expect common sense reasoning ab about
out
certainty, is known as Bayesian probability.
uncertain
uncertaintty to ha havve, then the only wa way y to satisfy those prop properties
erties is to treat
Ba
Bay If w e list several
yesian probabilities as beha properties
ehaving that w e expect common
ving exactly the same as frequentist sense reasoning about
probabilities.
uncertain
F or example, ty to if whaevwe,anthen
ant the onlythe
t to compute wayprobabilit
to satisfy
probability those
y that prop
a play
playerererties
will winis toa ptreat
oker
Ba yesian
game giv given probabilities as b eha ving exactly the same
en that she has a certain set of cards, we use exactly the same form as frequentist probabilities.
formulas
ulas
F or example, if w e w
as when we compute the probabilitan t to compute
probability the probabilit
y that a patien y that a play
patientt has a disease giv er will win
given a p
en that she oker
game given that she has a certain set of cards, we use exactly the same formulas
as when we compute the probability that 55 a patient has a disease given that she
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.2
A random Random
variable isV aariables
variable that can take on differen differentt values randomly
randomly.. W Wee
typically denote the random variable itself with a low lowerer case letter in plain typeface,
A random variable is a v ariable
and the values it can take on with low that er case script letters. tFor
lowercan take on differen values randomly
example, x1 and. Wx2e
typically
are both pdenote
ossiblethe random
values thatvthe ariable itselfvwith
random a low
ariable er case
x can takeletter
on. Fin
or plain typeface,
vector-v
vector-valued
alued
and the values it can take on with low er case script letters. For example,
variables, we would write the random variable as x and one of its values as x. On x and x
are b oth p ossible v alues that the random variable x can take on. For
its own, a random variable is just a description of the states that are possible; it vector-v alued
variables,
m we would
ust be coupled write
with a probability variable as xthat
the randomdistribution andsp one of its
specifies
ecifies values
how x. On
aseach
likely of
its own,
these a random
states are. variable is just a description of the states that are possible; it
must be coupled with a probability distribution that specifies how likely each of
theseRandom variables may be discrete or contin
states are. continuous.
uous. A discrete random variable
is one that has a finite or countably infinite num umb ber of states. Note that these
states are not necessarily the integers; they can also Ajust
Random variables may b e discrete or contin uous. discrete random
be named variable
states that
is one
are notthat has a finite
considered to hav ore any
have countably infinite
numerical num
value. Abcontin
er of states.
continuous Note that
uous random these
variable is
states
asso are
associated not necessarily
ciated with a real value. the integers; they can also just b e named states that
are not considered to have any numerical value. A continuous random variable is
associated with a real value.
3.3 Probabilit
Probability
y Distributions
3.3pr
A prob
ob Probabilit
obability
ability y Distributions
distribution is a description of how likely a random variable or
set of random variables is to take on each of its possible states. The way we
A probeability
describ
describe distribution
probability is a description
distributions dep ends onofwhether
depends how likely a random
the variables arevdiscrete
ariable or
or
set
con of uous.
contin
tin random variables is to take on each of its possible states. The way we
tinuous.
describe probability distributions depends on whether the variables are discrete or
continuous.
3.3.1 Discrete Variables and Probability Mass Functions
Probability
3.3.1
A Discrete
probabilit
probability Variables
y distribution ov
over and Probabilit
er discrete y Mass
variables ma
may Functions
y be describ
described
ed using a pr
prob
ob
oba-
a-
bility mass function (PMF). We typically denote probabilit
probability
y mass functions with a
A probability distribution
capital P . Often we asso ov
associateer discrete variables ma y b e describ
ciate each random variable with a different ed using a proba-
probability
bility mass function (PMF). We typically denote probability mass functions with a
capital P . Often we associate each random 56 variable with a different probability
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
mass function and the reader must infer which probability mass function to use
based on the identit
identity y of the random variable, rather than the name of the function;
P (x) is usually not the reader
mass function and the same asmP ust
(y)infer
. which probability mass function to use
based on the identity of the random variable, rather than the name of the function;
P (xThe probabilit
probability
) is usually notythemass
samefunction
as P (ymaps
). from a state of a random variable to
the probabilit
probability y of that random variable taking on that state. The probabilit probabilityy
thatThe
x =probabilit
x is denotedy massas Pfunction
(x), withmaps from a state
a probability of 1 of a random
indicating variable
that x = x to
is
the probabilit
certain and a yprobabilit
of that random
probability variable taking
y of 0 indicating that x on= xthat state.
is imp The Sometimes
impossible.
ossible. probability
that
to disam= x
x biguateis denoted
disambiguate whic
which has P (x
PMF )to , with
use, we a probability of 1 indicating
write the name = x is
that xvariable
of the random
certain andP (ax probabilit
explicitly: y of 0 indicating
= x). Sometimes we define athat x = xfirst,
variable is imp
thenossible. Sometimes
use ∼ notation to
to
sp disam
specify
ecify biguate
whic
which which PMF
h distribution to use,
it follo
follows we write
ws later: x ∼ Pthe
(x)name
. of the random variable
explicitly: P (x = x). Sometimes we define a variable first, then use notation to
Probabilit
Probability
specify which ydistribution
mass functions can
it follo wsact on many
later: x Pv(ariables
x). at the same time. Suc
Such h
∼
a probability distribution over many variables is known as a joint pr prob
ob
obability
ability
Probabilit y mass functions can act on ∼ variables at the same time. Such
many
distribution
distribution.. P (x = x, y = y ) denotes the probabilit
probability y that x = x and y = y
a
simprobability
simultaneously
ultaneously distribution
ultaneously.. We ma may o v er many variables
y also write P (x, y) for brevity is known
brevity.. as a joint probability
distribution. P (x = x, y = y) denotes the probability that x = x and y = y
To be a probability
simultaneously . We maymass also function
write P (on x, ya) random variable
for brevity . x, a function P must
satisfy the follo
following
wing prop
properties:
erties:
To be a probability mass function on a random variable x, a function P must
satisfy
• Thethe domain
following of prop
P musterties:
be the set of all possible states of x.
• ∀ x ∈domain
The x, 0 ≤ Pof(xP ) ≤must
1. Anbeimptheossible
set of ev
impossible allen
even
entptossible states of
has probabilit y 0x.and no state can
probability
• be less probable than that. Likewise, an ev even
en
entt that is guaran
guaranteed
teed to happ
happenen
x x , 0
has probabilit P
probability ( x ) 1. An imp ossible
y 1, and no state can ha ev
hav en t has probability 0 and no
ve a greater chance of occurring. state can
b∀e less
• P ∈ probable
≤ than
≤ that. Likewise, an event that is guaranteed to happen
• hasx∈x P (x) = y
probabilit 11.. 1W, and no to
e refer state
thiscan
prophaerty
ve aasgreater
property chance ofdo. ccurring.
being normalize
normalized Without this
prop
propert
ert
erty y, we could obtain probabilities greater than one by computing the
P (xy) =
probabilit
probability . Wof
of 1one e refer
man
many ytoevthis
even
en tsprop
ents erty as being normalized. Without this
occurring.
• property, we could obtain probabilities greater than one by computing the
Forprobabilit
example,yconsider
of one ofa man y ev
single ents occurring.
discrete random variable x with k differen
differentt states.
We can place a uniform distribution on x —that is, make each of its states equally
lik ForPexample,
likely—b
ely—b
ely—by y settingconsider a single
its probabilit
probability discrete
y mass randomtovariable x with k different states.
function
We can place a uniform distribution on x —that is, make each of its states equally
likely—by setting its probability mass function1 to
P (x = xi ) = (3.1)
k
1
for all i. We can see that this fits P the =x)=
(xrequirements (3.1)
k for a probability mass function.
The value 1k is positiv
ositivee because k is a positiv
ositivee in
integer.
teger. We also see that
for all i. We can see that this fits the requirements for a probability mass function.
The value is positive bX X e1 integer.
ecause k is a positiv k We also see that
P (x = xi ) = = = 11,, (3.2)
k k
i i 1 k
P (x = x ) = = = 1, (3.2)
so the distribution is prop erly normalized. k
properly k
57
so the distribution is properly normalized.
X X
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.3.2 Con
Contin
tin
tinuous
uous Variables and Probabilit
Probability
y Densit
Density
y Functions
3.3.2 wCon
When tinwith
orking uouscon Vtin
ariables
contin
tinuous and Probabilit
uous random variables, wey Densit
describeey probabilit
describ Functions
probability y dis-
tributions using a prprob
ob
obability
ability density function (PDF) rather than a probability
When w orking with con tinuous yrandom
mass function. To be a probabilit
probability densityvfunction,
density ariables, awe describpe mprobabilit
function ust satisfyy the
dis-
tributions
follo
following using
wing prop a probability density function (PDF) rather than a probability
properties:
erties:
mass function. To be a probability density function, a function p must satisfy the
follo•wing
Theprop
domain of p must be the set of all possible states of x.
erties:
• ∀ Thex ∈domain
x, p(x) ≥ of 0p. mNote
ust that
be theweset
doofnot allrequire x) ≤ 1of
possiblep(states . x.
R
•• xp(x)xdx , p(=x)11.. 0. Note that we do not require p(x) 1.
A• ∀probabilit
p(∈x)dx =y 1densit
probability ≥
density
. y function p(x) do doeses not giv givee the ≤probability of a sp specific
ecific
state directly
directly, , instead the probability of landing inside an
• probability density function p(x) does not give the probability of a specific infinitesimal region with
A
volume δx is given by p(x)δx.
state directly, instead the probability of landing inside an infinitesimal region with
We can integrate the densit densityy function to find the actual probability mass of a
volume R δx is given by p(x)δx.
set of points. Specifically
Specifically,, the probabilit
probability y that x lies in some set S is giv given
en by the
in W
integral e can integrate
tegral of p (x) ov over the densit y
er that set. In the function
univ to
univariate find example,
ariate the actualthe probability
probabilit
probability mass
y that of xa
R S
set of
lies in pthe
oints.
in Specifically
interv
terv
tervalal [a, b] is, the probabilit
given by y that x lies in some set is given by the
] p(x)dx.
integral of p (x) over that set. In the[a,b univ ariate example, the probability that x
For an example of a probability density function corresp corresponding
onding to a sp specific
ecific
lies in the interval [a, b] is given by p(x)dx.
probabilit
probability y density over a contin continuous
uous random variable, consider a uniform distribu-
tionFon or an example
an in
interv
terv al ofofthe
terval a probability
real num
numbers. density
bers. We canfunction
do this corresp
with aonding
function to u
a (sp
x; ecific
a, b),
probabilit y density o v er a
where a and b are the endpoints of the incontin uous random
interv
terv
terval, variable, consider a uniform
al, with b > a. The “;” notation means distribu-
tion on an in terv al of the real num
“parametrized by”; we consider x to Rbe the argumentbers. W e can do this withfunction,
of the a function u( xa
while ; a, b),
and
bwhere a and b are that
are parameters the endpoints
define theoffunction.
the intervT al,o with
ensureb> a. The
that there“;”isnotation means
no probability
“parametrized
mass outside the by”;in wterv
e consider
interv
terval, x toub(ex;the
al, we say a, b)argument
= 0 for of allthe
x 6∈function, while a[ a,
[a, b]. Within andb],
bu(are parameters 1 that define the function. T o ensure
x; a, b) = b−a . We can see that this is nonnegative everywhere. Additionallythat there is no probability
dditionally,, it
mass
in outside the in terv al, we
tegrates to 1. We often denote that x follo
integrates say u ( x; a, b ) =
follows 0 for all x [a,
ws the uniform distribution b ] . Within a, bb]],
on [[a,
buy(xwriting
; a, b) =x ∼ .UW (a,e bcan
). see that this is nonnegative everywhere. 6∈ Additionally, it
integrates to 1. We often denote that x follows the uniform distribution on [a, b ]
by writing x U (a, b).
3.4 Marginal ∼ Probability
3.4
SometimesMarginal
we know theProbability
probabilit
probabilityy distribution over a set of variables and we wan
over wantt
to know the probability distribution over just a subset of them. The probability
Sometimes woevknow
distribution er thethe probabilit
subset is knoywn
knowndistribution ov
as the mar er a set
marginal
ginal pr
prob
obofability
variables
obability and we want
distribution.
to know the probability distribution over just a subset of them. The probability
For example, supp
suppose
ose we ha
havve discrete random variables x and y, and we know
distribution over the subset is known as the marginal probability distribution.
P (x, y). We can find P (x) with the sum rule:
For example, suppose we have discrete X random variables x and y, and we know
∀ x ∈ x, P ( x = x ) =
P (x, y). We can find P (x) with the sum ruleP: (x = x, y = y ). (3.3)
y
x x, P (x = x) =58 P (x = x, y = y ). (3.3)
∀ ∈
X
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.5manyConditional
In cases, we are inte Probability
rested in theZprobabilit
interested probability y of some even event,t, given that some
other ev even
en
entt has happened. This is called a conditional pr probob
obability
ability
ability.. We denote
In many cases, we are
the conditional probabilit interested
probability in the probabilit
y that y = y giv en x = x as P (y = t,y given
given y of some even | x = that some
x). This
other event probabilit
conditional has happened.
probabilityy can bThis is called
e computed a conditional
with the form ulaprobability. We denote
formula
the conditional probability that y = y given x = x as P (y = y x = x). This
conditional probability can be computed with P (y = x = ula
they,form x) |
P (y = y | x = x) = . (3.5)
P (x = x)
P (y = y, x = x)
P (y = y x = x) = . (3.5)
The conditional probability is only defined when PP (x( x==x)x) > 0. We cannot compute
the conditional probabilit
probability |
y conditioned on an eveven
en
entt that nev
never er happ
happens.
ens.
The conditional probability is only defined when P ( x = x) > 0. We cannot compute
It is imp
important
ortant not to confuse conditional probability with computing what
the conditional probability conditioned on an event that never happens.
would happhappen en if some action were undertaken. The conditional probability that
It is imp
a person is from ortantGerman
not to yconfuse
Germany giv
given conditional
en that they sp probability
eak Germanwith
speak computing
is quite what
high, but if
would
a randomlyhappen if someperson
selected actioniswere undertaken.
taught to sp
speak
eak The
German,conditional
their probability
coun
country
try of that
origin
a
do peserson
does is from German
not change. y givthe
Computing en consequences
that they speak of anGerman
action isis quite
calledhigh,
makingbutanif
a randomly selected
intervention query person
query.. Interv is taught
Intervention
ention queriestoare
spthe
eak domain
German, of their
causalcounmo try of, origin
modeling
deling
deling, which
do es not change. Computing
we do not explore in this book. the consequences of an action is called making an
intervention query. Intervention queries are the domain of causal modeling, which
we do not explore in this book.
3.6 The Chain Rule of Conditional Probabilities
3.6
An
Any The
y join
joint Chain
t probabilit
probability Rule ofover
y distribution Conditional
man
many Probabilities
y random variables ma
mayy be decomp
decomposed
osed
in
into
to conditional distributions over only one variable:
Any joint probability distribution over many random variables may be decomposed
into conditional (1)
, . . . , x (n) ) o=ver
P (xdistributions x(1))Πone
P (only n variable:
i=2P (x
(i)
| x (1), . . . , x(i−1) ). (3.6)
x , . is
P (ation x wn
. . ,kno )Π P (x
(x chain
) =asPthe x ,...,x (3.6)
This observ
observation known rule or pr oduct rule of )probability
pro .
probability.
. It
follo
follows |
ws immediately from the definition of conditional probability in Eq. 3.5. For
This observation is known as the chain rule or product rule of probability. It
follows immediately from the definition59of conditional probability in Eq. 3.5. For
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.7
T Indep
wo random endence
variables x and yand Conditional
are independent if theirIndep
independent endence
probability distribution can
be expressed as a pro
product
duct of tw
twoo factors, one inv
involving
olving only x and one inv
involving
olving
Two random
only y: variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y:
∀x ∈ x, y ∈ y, p(x = x, y = y ) = p(x = x)p(y = y). (3.7)
x, y xy,and
Two random xvariables p(xy=are
x, yconditional
= y ) = p(xly=indep
onditionally y = y).given a random
x)p(endent
independent (3.7)
variable z if the∀conditional
∈ ∈ probability distribution over x and y factorizes in this
Two random
way for ev ery value of z: x and y are conditionally independent given a random
every variables
variable z if the conditional probability distribution over x and y factorizes in this
way for every value of z:
∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z )p(y = y | z = z).
(3.8)
x x, y y, z z, p(x = x, y = y z = z) = p(x = x z = z )p(y = y z = z).
We can denote indep
independence
endence and conditional indep
independence
endence with compact
∀ ∈ ∈ ∈ | | | (3.8)
notation: x⊥y means that x and y are indepindependen
endent, while x⊥y | z means that x
endent,
andWy eare
can denote indep
conditionally endence
indep endentand
independen
enden t giv conditional
given
en z. independence with compact
notation: x y means that x and y are independent, while x y z means that x
and y are conditionally
⊥ independent given z. ⊥ |
3.8 Exp
Expectation,
ectation, Variance and Co
Cov
variance
3.8 exp
The expeeExp ectation,
ctation or exp
expeecte
cted Variance
d value and Co
of some function f (xv)ariance
with resp
respect
ect to a probabilit
probability
y
distribution P (x ) is the av erage or mean value that f tak
average es on when x is drawn
takes
exp ectation or exp ected value of some function f (x
from P . For discrete variables this can be computed with aresp
The ) with ect to a probability
summation:
distribution P (x ) is the average or mean value that f takes on when x is drawn
X
from P . For discrete variables Ex∼Pthis
[f (xcan
)] =be computed
P (x)f (x)with
, a summation: (3.9)
E x
[f (x)] = P (x)f (x), (3.9)
while for con
contin
tin
tinuous
uous variables, it is computed with an integral:
Z
while for continuous variables, it is computed with an integral:
Ex∼p [f (x)] = p(x)f (x)dx. (3.10)
X
E
[f (x)] = p(x)f (x)dx. (3.10)
60
Z
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
s. With probabilit
probability y 12, we choose the value of s to be 1. Otherwise, we choose
the value of s to be − 1. We can then generate a random variable y by assigning
ariabley
sy.=With probabilit
sx. Clearly y , we choose
Clearly,, x and y are not indep the value
independen
enden
endent,of s to b e 1 . Otherwise, we choose
t, because x completely determines
the s
the magnitude
value of toof bye. How
1. W
Howev eveer,can
ever, Co then
v(x, ygenerate
Cov( y
) = 0. a random variable by assigning
y = sx. Clearly, x and− y are not indep endent, because x completely determines
ovariancee matrix of a random vector x ∈ Rn is an n × n matrix, suc
The covarianc such
h that
the magnitude of y. However, Cov(x, y) = 0.
R
The covariance matrix of Co Cov(v(x) i,j =
a random Co
Cov(
v(xxi, x j ). is an n n matrix, suc(3.14)
v(x
vector h that
The diagonal elemen
elements
ts of theCo
co
cov
variance give ∈ ×
v( x) = Co v(the
x , xvariance:
). (3.14)
The diagonal elements of the Co
Cov(
covv(
v(xxi , xi) =
ariance V thexiv)ariance:
Var(
givear(
ar(x . (3.15)
3.9eral Common
Sev
Several Probability
simple probability distributionsDistributions
are useful in many con
contexts
texts in machine
learning.
Several simple probability distributions are useful in many contexts in machine
learning.
3.9.1 Bernoulli Distribution
3.9.1Bernoul
The Bernoulli Distribution
li distribution
Bernoulli is a distribution ov overer a single binary random variable.
It is controlled by a single parameter φ ∈ [0 [0,, 1]
1],, whic
whichh gives the probability of the
The Bernoul li distribution is a distribution ov er a
random variable being equal to 1. It has the following prop single binary
erties:random variable.
properties:
It is controlled by a single parameter φ [0 , 1], which gives the probability of the
random variable being equal to 1.PIt(xhas =∈1)
the=following
φ properties: (3.16)
P (Px(=
x=0)1)
==1−
φφ (3.17)
(3.16)
P (x P=(xx)==0)φx=(11 − φφ)1−x (3.18)
(3.17)
P (x = xE) x=[xφ
] =(1φ − φ) (3.19)
(3.18)
Var x(Ex)[x=] =
φ(1φ− φ) (3.20)
(3.19)
Var (x) = φ(1 φ) (3.20)
3.9.2 Multinoulli Distribution
−
3.9.2multinoul
The Multinoulli
multinoullili or cate Distribution
ategoric
goric
gorical
al distribution is a distribution ov
over
er a single discrete
differentt states, where k is finite.1 The multinoulli distribution is
variable with k differen
The multinoulli or categorical distribution is a distribution over a single discrete
1
“Multinoulli”
variable is a termt that
with k differen waswhere
states, recently
k coined by Gustavo
is finite. The mLacerdo anddistribution
ultinoulli popularized by
is
Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A
multinomial distribution is the distribution over vectors in {0 , . . . , n} k representing how many
times each of the k categories is visited when n samples are drawn from a multinoulli distribution.
Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying
that they refer only to the n = 1 case.
62
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
parametrized by a vector p ∈ [0 [0,, 1]k−1 , where pi giv es the probability of the i-th
gives
state. The final, k -th state’s probability is given by 1 − 1> p. Note that we must
parametrized
constrain 1 > p by
≤ 1a. vector p [0
Multinoulli , 1] , where
distributions arep often
givesused
the to
probability of the i-th
refer to distributions
ostate.
ver categories k
The final,of ob -thjects,
state’s
objects, probability
so∈we is given
do not usually 1 p .
by 1 that state
assume Note1 has
thatnumerical
we must
constrain 1 p 1 . Multinoulli distributions
value 1, etc. For this reason, we do not usually need to are often used to refer to
− compute the expdistributions
expectation
ectation
ov er categories ≤of ob jects, so
or variance of multinoulli-distributedwe do notrandom variables.that state 1 has numerical
usually assume
value 1, etc. For this reason, we do not usually need to compute the expectation
The Bernoulli and multinoulli distributions are sufficient to describ describee an
anyy distri-
or variance of multinoulli-distributed random variables.
bution over their domain. This is because they mo model
del discrete variables for whicwhichh
The Bernoulli and multinoulli distributions are sufficient to
it is feasible to simply enumerate all of the states. When dealing with contin describ e an y distri-
continuous
uous
vbution overthere
ariables, theirare
domain.
uncoun This
tablyis many
uncountably because theyso
states, moany
del distribution
discrete variables
describforedwhic
described by ha
it is feasible
small num umb to simply enumerate
ber of parameters must imp all of
imposethe states. When dealing with
ose strict limits on the distribution. contin uous
variables, there are uncountably many states, so any distribution described by a
small number of parameters must impose strict limits on the distribution.
3.9.3 Gaussian Distribution
3.9.3
The mostGaussian
commonly Distribution
used distribution over real num numb bers is the normal distribution,
also kno
known
wn as the Gaussian distribution :
The most commonly used distribution over real numbers is the normal distribution,
also known as the Gaussian distribution r :
2 1 1 2
N (x; µ, σ ) = exp − 2 (x − µ) . (3.21)
2πσ2 2σ
1 1
(x; µ, σ ) = exp (x µ) . (3.21)
See Fig. 3.1 for a plot of the densit density
2π σ y function. 2σ
N − −
See Fig. 3.1 for a plotµof∈ the σ y∈ function.
(0, ∞ ) control the normal distribution.
The twtwo o parameters R and
densit
The parameter µ giv gives
es the cocoordinate
ordinate
R r of the central peak. This is also the mean of
the The two parameters
distribution: and σ (0
E[ x] = µµ. The standard , ) control
deviation of thethe
normal distribution.
distribution is given by
The parameter µ giv es the
2 co ordinate of the central p eak. This is also the mean of
σ, and the variance E by σ . ∈ ∈ ∞
the distribution: [ x] = µ. The standard deviation of the distribution is given by
When we ev evaluate
aluate the PDF, we need to square and inv ert σ. When we need to
invert
σ, and the variance by σ .
frequen
frequently
tly ev
evaluate
aluate the PDF with differendifferentt parameter values, a more efficient way
When we evaluate
of parametrizing the PDF, weisneed
the distribution to use to asquare and inv
parameter ∈ (0σ,. ∞
β ert When we needthe
) to control to
frequen
pr
pre
ecisiontlyorevinv
aluate
erse the
inverse PDF of
variance with
thedifferen t parameter values, a more efficient way
distribution:
of parametrizing the distribution is to use a parameter β (0, ) to control the
precision or inverse variance of ther distribution: ∈ ∞
−1 β 1
N (x; µ, β ) = exp − β (x − µ)2 . (3.22)
2π 2
β 1
(x; µ, β ) = exp β (x µ) . (3.22)
Normal distributions are a sensible2choice π for many
2 applications. In the absence
of prior knowledge N ab
about
out what form a distribution − ov−
er the real num numbers
bers should
tak Normal
take, distributions are a
e, the normal distribution is a go sensible
goo choice for many applications.
od default choice for two ma major In the absence
jor reasons.
of prior knowledge about what form r a distribution
over thereal numbers should
First, many distributions we wish to mo modeldel are truly close to being normal
take, the normal distribution is a good default choice for two ma jor reasons.
distributions. The centr entralal limit the
theor
or
orem
em shows that the sum of many indep independent
endent
First, many distributions we wish to mo del are truly
random variables is approximately normally distributed. This means that in close to being normal
distributions. The central limit theorem shows that the sum of many independent
random variables is approximately normally 63 distributed. This means that in
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
p(x)
0.30 Maxim
x =u¹m§a¾t
0.20
0.25 In"ection
x = ¹ points at
0.15 ¾
0.20 §
0.10
0.15
0.05
0.10
0.00
−2.0
0.05 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x
0.00 2
Figure 3.1: The
−2.0 normal
−1.5 distribution
−1.0 : The
−0.5 normal
0.0 distribution
0.5 N ) exhibits
(x; µ, σ 1.5
1.0 2.0a classic
“b
“bell
ell curv
curve”
e” shape, with the x co
coordinate
ordinate of its central peak given by µ, and the width
Figure Thetrolled
normalbydistribution x distribution (x; µ, σ ) exhibits a classic
of its p3.1:
eak con
controlled σ. In this :example,
The normal
we depict the standar
standard distribution,,
d normal distribution
“b ell µ
with curv
= 0e”and
shape, 1. the x co ordinate of its central peakNgiven by µ, and the width
σ =with
of its p eak controlled by σ. In this example, we depict the standard normal distribution,
with µ = 0 and σ = 1.
practice, man
many y complicated systems can be mo deled successfully as normally
modeled
distributed noise, even if the system can be decomp decomposedosed into parts with more
practice, man
structured beha y complicated
ehavior.
vior. systems can b e mo deled successfully as normally
distributed noise, even if the system can be decomposed into parts with more
Second, out of all possible probability distributions with the same variance,
structured behavior.
the normal distribution enco encodes
des the maxim
maximum um amount of uncertaint
uncertainty y ov
over
er the
Second,
real num
umb out of all
bers. We can th p ossible
thus probability distributions with the same
us think of the normal distribution as being the one that v ariance,
the normal distribution
amountt enco
inserts the least amoun des the
of prior kno maximum
knowledge
wledge intoamount
into a mo del.of F
model. uncertaint
Fully
ully dev y over and
developing
eloping the
real num b ers. W e can th us think of the
justifying this idea requires more mathematical tonormal distribution
tools, as b
ols, and is postpeing the
postponed one that
oned to Sec.
inserts
19.4.2.. the least amount of prior knowledge into a model. Fully developing and
19.4.2
justifying this idea requires more mathematical tools, and is postponed to Sec.
The normal distribution generalizes to Rn, in whic whichh case it is known as the
19.4.2.
multivariate normal distribution
distribution.. It ma
mayy be Rparametrized with a positiv ositivee definite
The normal
symmetric matrix Σ: distribution generalizes to , in whic h case it is known as the
multivariate normal distribution. It may be parametrized with a positive definite
s
symmetric matrix Σ: 1 1 > −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) . (3.23)
(2π) ndet(Σ) 2
1 1
(x; µ, Σ) = exp (x µ) Σ (x µ) . (3.23)
(2π) det(Σ) 2
The N parameter µ still giv gives
es the mean− of the − distribution, − though no now w it is
vector-v alued. The parameter Σ giv
ector-valued. gives
es the cov
covariance
ariance matrix of the distribution.
The parameter
As in the univ µ still gives the
ariatescase, when we wish
univariate mean of
to ev the distribution,
evaluate
aluate the PDF thoughsev eral no
several w itfor
times is
vector-valued. The parameter Σ gives the covariance matrix of the distribution.
As in the univariate case, when we wish 64 to evaluate the PDF several times for
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
man
many y differen
differentt values of the parameters, the co covvariance is not a computationally
efficien
efficientt wa
wayy to parametrize the distribution, since we need to inv ert Σ to ev
invert evaluate
aluate
man y differen t values of the
the PDF. We can instead use a pr parameters, the co variance
preecision matrix β: is not a computationally
efficient way to parametrize the distribution, since we need to invert Σ to evaluate
the PDF. We can instead usesa precision matrix β:
−1 det( β ) 1 >
N (x; µ, β ) = exp − (x − µ) β(x − µ) . (3.24)
(2π)n 2
det(β) 1
(x; µ, β ) = exp (x µ) β(x µ) . (3.24)
We often fix the co variance (2
cov π) to be a2diagonal matrix. An even simpler
matrix
version is theNisotr
isotropic
opic Gaussian distribution, − whose− co
cov − matrix is a scalar
variance
Wethe
times often
idenfix
identit
titythe
tity covariance matrix to be a diagonal matrix. An even simpler
matrix.
version is the isotropic Gaussians distribution, whose covariance matrix is a scalar
times the identity matrix.
3.9.4 Exp
Exponen
onen
onential
tial and Laplace Distributions
3.9.4
In Exponen
the context tiallearning,
of deep and Laplace
we oftenDistributions
wan
wantt to hav
havee a probability distribution
with a sharp point at x = 0. To accomplish this, we can use the exp exponential
onential
In the context
distribution
distribution:: of deep learning, w e often wan t to hav e a probability distribution
with a sharp point at x =p(0x.; λT)o=accomplish
λ1x≥0 exp (− this,
λx) w. e can use the exponential
(3.25)
distribution:
The exp
exponen
onen
onential
tial distribution
p(uses
x; λ) the
= λindicator λx) . 1 x≥0 to assign probabilit
1 exp (function probability
(3.25)y
zero to all negativ
negativee values of x.
The exponential distribution uses the indicator function − 1 to assign probability
A closely related probabilit
probability
y distribution that allo allows
ws us to place a sharp peak
zero to all negative values of x.
of probabilit
probability y mass at an arbitrary poin ointt µ is the Laplac
aplacee distribution
A closely related probability distribution that allows us to place a sharp peak
of probability mass atLaplace(
an arbitrary 1 | x − µ |
x; µ, γ )poin
= t µexpis the−Laplace distribution
. (3.26)
2γ γ
1 x µ
Laplace(x; µ, γ ) = exp . (3.26)
2γ | − γ |
3.9.5 The Dirac Distribution and Empirical − Distribution
3.9.5
In The Dirac
some cases, we wishDistribution
to sp ecify that and
specify all of Empirical
themass in aDistribution
probabilit
probability
y distribution
clusters around a single poin
oint.
t. This can be accomplished by defining a PDF using
In some
the Diraccases,
deltawe wish to δsp
function, (xecify
): that all of the mass in a probability distribution
clusters around a single point. This can be accomplished by defining a PDF using
the Dirac delta function, δ(x): p(x) = δ(x − µ). (3.27)
3.9.6
It is alsoMixtures
common toof Distributions
define probability distributions by com combining
bining other simpler
probabilit
probability y distributions. One common w waay of com
combining
bining distributions is to
It is also common
construct a mixtur to define probability
mixturee distribution distributions
distribution.. A mixture distributionby comis bining
made other
up of simpler
several
probabilit
comp
componen
onen y distributions. One
onentt distributions. On eaceach common w ay of com
h trial, the choice of whicbining
whichh comp distributions
component is to
onent distribution
construct the
generates a mixtur e distribution
sample . A mixture
is determined by samplingdistribution
a comp is made
component
onent uptit
iden
identitof
y several
tity from a
comp onen t distributions.
multinoulli distribution: On each trial, the choice of whic h comp onent distribution
generates the sample is determined by sampling a component identity from a
X
multinoulli distribution: P (x) = P (c = i)P (x | c = i) (3.29)
i
P (x) = P (c = i)P (x c = i) (3.29)
where P (c) is the multinoulli distribution ov
over
er comp
component
| onent identities.
We ha
havve already seen one example of a mixture distribution: the empirical
where P (c) is the multinoulli distribution over component identities.
distribution ovover
er real-v
real-valued
alued variables is a mixture distribution with one Dirac
W e ha v e already seen X
oneexample.
example of a mixture distribution: the empirical
comp
componen
onen
onentt for eac
each
h training
distribution over real-valued variables is a mixture distribution with one Dirac
compThe mixture
onen mo
model
t for eac hdel is one simple
training strategy for combining probability distributions
example.
to create a ric
richer
her distribution. In Chapter 16, we explore the art of building complex
The mixture
probabilit
probability model is one
y distributions fromsimple strategy
simple ones informore
combining
detail. probability distributions
to create a richer distribution. In Chapter 16, we explore the art of building complex
probability distributions from simple ones 66 in more detail.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.10 functions
Certain Usefularise Propoftenerties of Common
while working Functions
with probabilit
probability
y distributions, especially
the probabilit
probabilityy distributions used in deep learning mo
models.
dels.
Certain functions arise often while working with probability distributions, especially
the One of these
probabilit functions is the
y distributions usedlo
logistic
gistic
in deepsigmoid
sigmoid: : models.
learning
One of these functions is the logistic sigmoid 1 :
σ(x) = . (3.30)
1 + exp(−x)
1
σ(x) = . (3.30)
The logistic sigmoid is commonly used1 to + exp(
pro
produce
ducex) the φ parameter of a Bernoulli
distribution because its range is (0 (0,, 1)
1),, whic
whichh lies
− within the valid range of values
The
for the φ parameter. See Fig. 3.3 for a graph of thethe
logistic sigmoid is commonly used to pro duce φ parameter
sigmoid function.ofThe
a Bernoulli
sigmoid
distribution because its range is (0, 1), which lies within the valid range of values
for the φ parameter. See Fig. 3.3 for a graph 67 of the sigmoid function. The sigmoid
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
x2
x1
0.6
¾(x)
0.4
0.2
0.0
−10 −5 0 5 10
x
Figure 3.3: The logistic sigmoid function.
6
³(x)
0
−10 −5 0 5 10
x
Figure 3.4: The softplus function.
69
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.11
W e often Bafindyourselv
es’ Rule
ourselveses in a situation where we know P ( y | x) and need to know
P ( x | y). Fortunately
ortunately,, if we also know P (x), we can compute the desired quantit quantityy
W e often find
using Bayes’ rule: ourselv es in a situation where we know P ( y x) and need to know
P ( x y). Fortunately, if we also know PP(x(x),)P | x)compute
w(eycan | the desired quantity
using| Bayes’ rule: P ( x | y) = . (3.42)
P (y)
P (x)P (y x)
Note that while P (y ) appears P (xin ythe
) =formula, it is . usually feasible to compute(3.42)
P P (y) |
P (y) = x P (y | x)P (x), so we do | not need to begin with knowledge of P (y).
Note that while P (y ) appears in the formula, it is usually feasible to compute
P (yBa
Bay
) =yes’ rule
P (y isxstraigh
straightforward
)P (x), tforward
so we do not to deriv
derive
needetofrom
begin the definition
with of conditional
knowledge of P (y).
probabilit
probability y, but it is useful to know the name of this form formula
ula since many texts
| straightforward to derive from the definition of conditional
referBatoyes’
it brule is
y name. It is named after the Reverend Thomas Ba Bayyes, who first
probabilit
disco
discov y, but
vered a sp it
special is useful to know the name of this form
ecial case of the formula. The general version presented ula since manyheretexts
was
refer
indep to
independen it
enden b
endentlyy name. It is named after the Reverend Thomas Ba yes, who first
P tly disco
discovvered by Pierre-Simon Laplace.
discovered a special case of the formula. The general version presented here was
independently discovered by Pierre-Simon Laplace.
70
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.12 Tec
echnical
hnical Details of Con
Contin
tin
tinuous
uous Variables
3.12
A prop
proper er T echnical
formal Details
understanding of ofcon Con
contin
tin
tinuous
uous tin uous vV
random ariables
ariables and probabilit
probability y
densit
density y functions requires dev developing
eloping probabilit
probability y theory in terms of a branc branch h of
A prop er
mathematics knoformalknown understanding
wn as me measur
asur of
asuree the con
theory
ory tin uous random
ory.. Measure theory is bey v ariables and
eyond probabilit
ond the scop scopee of y
densit
this textby
textbofunctions requires
ook, but we can briefly sk dev eloping
sketc
etc
etch probabilit y theory in terms
h some of the issues that measure theory is of a branc h of
mathematics
emplo
employ known
yed to resolv
resolve. e. as measure theory. Measure theory is beyond the scope of
this textbook, but we can briefly sketch some of the issues that measure theory is
emplo In ySec.
ed to3.3.2 , wee.sa
resolv saw w that the probabilit
probability y of a concontin
tin
tinuous
uous vector-v alued x lying
vector-valued
in some set S is given by the in tegral of p( x) ov
integral over
er the set S. Some choices of set S
In
can pro Sec.
produce 3.3.2 , w e sa w that the probabilit
duceSparadoxes. For example, it is possible y of a conto tinconstruct
uous vector-v
S tw
two sets x
oalued S 1lying
and S
in some
S suc
such h set pis(xgiven
that ∈ S by
) + the
p (x in
∈ tegral
S ) > of
1 pbut
( x)Sov∩ erSthe=set ∅ . . Some
These choices
sets are of set
generally
2 1 2 1 2 S
can produce making
constructed paradoxes. very Fheavy
or example,
use of ittheis infinite
possibleprecision
to construct of twonum
real setsbers,and
numbers, for
S S S S S
suc h that p (x
example by making fractal-shap ) + p (x
fractal-shaped ) > 1 but = . These
ed sets or sets that are defined by transforming sets are generally
constructed making
the set of rational num ∈ very
numbers. heavy
2 ∈ use of the infinite
bers. One of the key contributions ∩ precision of real num
∅ of measure bers,is for
theory to
example
pro vide abcharacterization
provide y making fractal-shap of theed setsets
of setsor sets
thatthatwe canare defined
computeby thetransforming
probability
thewithout
of set of rational numbers.
encountering One of In
paradoxes. thethis keybo contributions
book,ok, we only in oftegrate
measure
integrate ovtheory
er sets iswithto
provideely
relativ
relatively a characterization
simple descriptions, of the so set
thisofasp sets
aspect ectthat we can compute
of measure theory nev the
never er probability
becomes a
of without
relev
relevant encountering
ant concern. paradoxes. In this bo ok, w e only in tegrate ov er sets with
relatively simple descriptions, so this aspect of measure theory never becomes a
relev Fant
or our purp
purposes,
concern. oses, measure theory is more useful for describing theorems that
apply to most points in R n but do not apply to some corner cases. Measure theory
pro For our
provides
vides purposes,
a rigorous wa
way measure theorythat
y Rof describing is morea setuseful
of poinfortsdescribing
oints is negligibly theorems
small. Suchthat
apply to most
a set is said to hav p oints
havee “ me in
measur
asur but
asuree zer do
zero not
o.” W apply to some corner cases.
Wee do not formally define this concept in thisMeasure theory
provides
textb
textbo ook.a Ho
rigorous
How wev er,wa
ever, it yisofuseful
describing that a setthe
to understand of pinoin ts is negligibly
intuition
tuition that a setsmall. Such
of measure
a set oisccupies
zero said tonohav e “ measur
volume e zer
in the o.” W
space wee doarenot formally Fdefine
measuring. this concept
or example, withinin R2this
,a
textb o ok. Ho w ev er, it is useful to understand
line has measure zero, while a filled polygon has positiv the in tuition that a set
ositivee measure. Likewise,of measure
R an
zero o ccupies no volume
individual point has measure zero. An in the space we
Any are measuring. F or example,
y union of countably many sets that each within ,a
line
ha
hav vehas measure
measure zerozero,
alsowhile a filled pzero
has measure olygon (so has
the p setositiv e measure.
of all the rational Likewise,
num
numb an
bers
individual
has measure point
zero,hasformeasure
instance). zero. Any union of countably many sets that each
have measure zero also has measure zero (so the set of all the rational numbers
has Another
measure useful
zero, forterm from measure theory is “ almost everywher
instance). everywheree.” A prop property
erty
that holds almost ev everywhere
erywhere holds throughout all of space except for on a set of
Another
measure zero.useful
Becausetermthe from measure otheory
exceptions ccupy isa “negligible
almost everywher
amounteof .” space,
A proptheyerty
thatbholds
can almost
e safely ignored everywhere
for manmany y holds throughout
applications. Some allimp
of space
ortanttexcept
importan
ortan resultsfor on a set of
in probability
measure
theory zero.
hold for Because
all discrete thevalues
exceptions
but only occupyhold a“almost
negligible amount for
everywhere” of space,
con
contin
tin they
tinuous
uous
can
values.b e safely ignored for man y applications. Some imp ortan t results in probability
theory hold for all discrete values but only hold “almost everywhere” for continuous
Another tec technical
hnical detail of contin continuousuous variables relates to handling contin continuous
uous
values.
random variables that are deterministic functions of one another. Supp Suppose ose we ha hav ve
Another tec hnical detail
two random variables, x and y , suc of contin
such uous v ariables relates
h that y = g (x ), where g is an inv to handling contin
invertible, uous
ertible, con-
random variables that are deterministic functions of one another. Suppose we have
2
twoThe Banach-Tarski
random variables, theorem
x andprovides
y , sucha fun
thatexample
y = g (of x )such sets. g is an invertible, con-
, where
71
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
tin
tinuous,
uous, differen
differentiable
tiable transformation. One migh mightt exp expectect that py (y ) = p x (g−1(y ))
))..
This is actually not the case.
tinuous, differentiable transformation. One might expect that p (y ) = p (g (y )).
As a simple example, supp supposeose we ha havve scalar random variables x and y. Suppose
This isx actually not the case.
y = 2 and x ∼ U (0 (0,, 1)
1).. If we use the rule py (y ) = p x(2 y) then p y will be 0
ev As a
everywheresimple example,
erywhere except the interv suppalose
interval [0,,we
1 have scalar random variables x and y. Suppose
[0 2 ], and it will b e 1 on this in interv
terv
terval.
al. This means
y = and x U (0, 1). If we use the rule p (y ) = p (2 y) then p will be 0
Z
everywhere except ∼ the interval [0, ]p, and it will1 be 1 on this interval. This means
y (y )dy = , (3.43)
2
1
p (y)dy = , (3.43)
whic
which h violates the definition of a probabilitprobability y distribution.
2
whicThis common
h violates the mistake
definition is of
wrong becausey it
a probabilit fails to accoun
distribution. accountt for the distortion
of space inintro
tro duced by the function g . Recall that the probability of x lying in
troduced
an infinitesimally mistake
This common small regionis wrong Z bvecause
with olume it δxfailsis givtoenaccoun
given by p(tx for
)δx.the distortion
Since g can
of space in
expand or contro duced by the function g . Recall that the
tract space, the infinitesimal volume surrounding x in x space ma
contract probability of x lying in
mayy
an
ha
havvinfinitesimally
e differen
differentt volumesmall in region
y space. with volume δx is given by p(x )δx. Since g can
expand or contract space, the infinitesimal volume surrounding x in x space may
To see ho
how
w to correct the problem, we return to the scalar case. We need to
have different volume in y space.
preserv
preservee the prop
propertert
erty
y
To see how to correct the |pyproblem,
(g(x))dy dy|| w=e |return
p x (x)dx to|. the scalar case. We need to
(3.44)
preserve the property
Solving from this, we obtain p (g(x))dy = p (x)dx . (3.44)
Solving from this, we obtain| | | ∂ x |
p y (y) = px (g−1 (y)) (3.45)
∂y
∂x
p (y) = p (g (y)) (3.45)
or equiv
equivalently
alently ∂y
∂ g ( x)
p ( x ) = p ( g ( x )) (3.46)
or equivalently x y ∂x .
∂ g(x)
p (ative
x) = pgeneralizes
(g(x)) to . determinan (3.46)
In higher dimensions, the deriv derivative
∂xi
∂xthe
determinantt of the JacJacobian
obian
matrix
matrix—the
—the matrix with Ji,j = ∂y j . Th Thus,
us, for real-v alued vectors x and y ,
real-valued
In higher dimensions, the derivative generalizes to the determinant of the Jacobian
matrix—the matrix with J = . Thus, forreal-valued
∂ g(x ) vectors x and y ,
px (x) = py (g(x)) det . (3.47)
∂ x
∂ g(x )
p (x) = p (g(x)) det . (3.47)
∂x
3.13 Information Theory
3.13 Information
Information Theory
theory is a branc
branch
h of applied that rev
mathematics revolv
olv
olves
es around
quan
quantifying
tifying how muc
much h information is presen
present
t in a signal. It was originally inv
inven
en
ented
ted
Information
to theory
study sending is a branc
messages fromhdiscrete
of applied mathematics
alphab
alphabets that channel,
ets over a noisy revolves such
aroundas
quan
comm tifying
communicationhow muc h information is presen t in a signal. It w as originally inv
unication via radio transmission. In this context, information theory tells howen ted
to design sending
to study messages
optimal co des and from
codes discrete
calculate alphab
the exp ectedets
expected over of
length a messages
noisy channel,
sampledsuch as
from
communication via radio transmission. In this context, information theory tells how
to design optimal codes and calculate the72expected length of messages sampled from
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
sp
specific
ecific probability distributions using various enco encoding
ding schemes. In the con context
text of
mac
machine
hine learning, we can also apply information theory to con contin
tin
tinuous
uous variables
specificsome
where probability
of these distributions
message length using interpretations
various encodingdoschemes.
not apply In. the
apply. confield
This text is
of
mac hine learning,
fundamen
fundamental tal to many we areas
can also apply information
of electrical engineeringtheory to continscience.
and computer uous variables
In this
where
textb
textbo some of these message length interpretations do not apply
ook, we mostly use a few key ideas from information theory to characterize . This field is
fundamental
probabilit
probability to many areas
y distributions or of electrical
quantify engineering
similarity betw and
een computer
etween probability science. In this
distributions.
textb
F ook, detail
or more we mostly use a few key
on information ideas
theory
theory, from
, see Co
Covvinformation
er and Thomas theory
(2006to)characterize
or MacKa
MacKay y
probabilit
(2003). y distributions or quantify similarity b etw een probability distributions.
For more detail on information theory, see Cover and Thomas (2006) or MacKay
The basic intuition behind information theory is that learning that an unlik unlikely
ely
(2003).
ev
even
en
entt has occurred is more informative than learning that a lik likely
ely ev
event
ent has
The basic
occurred. intuitionsaying
A message behind“theinformation
sun rosetheory is that learning
this morning” that an unlikely
is so uninformative as
ev en t has occurred is more informative
to be unnecessary to send, but a message sa than learning
saying that a likely ev
ying “there was a solar eclipse thisent has
o ccurred. A message saying
morning” is very informative. “the sun rose this morning” is so uninformative as
to be unnecessary to send, but a message saying “there was a solar eclipse this
We would like to quantify information in a way that formalizes this intuition.
morning” is very informative.
Sp
Specifically
ecifically
ecifically,,
We would like to quantify information in a way that formalizes this intuition.
Specifically
• Lik ely, ev
Likely even
en
ents
ts should ha
havve lolow
w information con conten
ten
tent,
t, and in the extreme case,
ev
even
en
ents
ts that are guaranteed to happen should ha hav
ve no information conten
contentt
Likely
whatso
whatsoevev
even
ever.
er.ts should have low information content, and in the extreme case,
• events that are guaranteed to happen should have no information content
• Less lik
likely
whatso elyer.ev
ev evenen
ents
ts should ha
havve higher information con
conten
ten
tent.
t.
• Less
Indeplikely ev
Independen
enden
endent t en
evts
even
en should
ents
ts haha
should vevhigher
hav information
e additiv
additive conten
e information. t. example, finding
For
• out that a tossed coin has come up as heads twice should conv convey
ey twice as
Indep
muc h information as finding out that a tossed coin has come up asfinding
uch enden t even ts should ha ve additive information. For example, heads
out that a tossed coin has come up as heads twice should convey twice as
• once.
much information as finding out that a tossed coin has come up as heads
In once.
order to satisfy all three of these prop
properties,
erties, we define the self-information
of an ev entt x = x to be
even
en
In order to satisfy all three Iof(xthese
) = −proplog Perties,
(x). we define the self-information(3.48)
of an event x = x to be
In this book, we alwa ys use logIto
always (x)mean
= the (x). logarithm, with base e(3.48)
log Pnatural . Our
definition of I( x) is therefore written in − unitsnatural
of nats. One nat is the amount of
In this book,gained
information we alwa use log to
byysobserving anmean
eventtthe
even of probabilitylogarithm,
1 with base e. Our
e . Other texts use base-2
definition ofand
logarithms ) is therefore
I( xunits written
called bits in units
or shannons of nats. One
; information nat is the
measured in amount of
bits is just
ainformation
rescaling ofgained by observing
information an even
measured t of probability . Other texts use base-2
in nats.
logarithms and units called bits or shannons; information measured in bits is just
When x is contin
continuous, we use the same definition of information by analogy
uous,measured analogy,,
a rescaling of information in nats.
but some of the prop properties
erties from the discrete case are lost. For example, an even eventt
When x is contin uous, we use the same definition of
with unit density still has zero information, despite not being an ev information b
entt that is,
eveny
en analogy
but
guaransome
guaranteed of the prop
teed to occur. erties from the discrete case are lost. For example, an event
with unit density still has zero information, despite not being an event that is
guaranteed to occur. 73
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
0.6
Shannon entropy of a binary random variable
Shannon entropy in nats 0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
p
Figure 3.5: This plot sho showsws ho
how
w distributions that are closer to deterministic hav havee lo
low
w
Shannon entrop
entropy y while distributions that are close to uniform hav havee high Shannon en entrop
trop
tropy y.
Figure 3.5:
On the horizon This
horizontal plot sho ws how distributions
tal axis, we plot , the probabilit
p probability that are closer to deterministic
y of a binary random variable being equal hav e low
Shannon
to 1. Theentrop
en
entrop
trop
tropyyywhile
is giv distributions
given
en that
by (p − 1) log
log(1
(1
(1−are
− p )close to puniform
− p log . When phav is enear
high0,Shannon entropy.
the distribution
Onnearly
is the horizon tal axis, bwecause
deterministic, p
e plot the
, the probabilit
random y of aisbinary
variable nearly random
alwa
always variable
ys 0. When bpeing equal
is near 1,
to 1.distribution
the The entropyisisnearlygiven by (p 1) log(1b ecause
deterministic, p ) p the . When pvariable
log prandom is near is0, nearly
the distribution
alwa
alwaysys 1.
is nearly
When p =deterministic,
00..5, the en b ecause
entrop
trop
tropy −the random
y is maximal, − variable
because− the is nearly alwaisysuniform
distribution 0. When ovperisthe
over neartw1,
twoo
the distribution is nearly deterministic, b ecause the random variable is nearly always 1.
outcomes.
When p = 0.5, the entropy is maximal, because the distribution is uniform over the two
outcomes.
Self-information deals only with a single outcome. We can quantify the amounamountt
of uncertain
uncertaintty in an en
entire
tire probabilit
probabilityy distribution using the Shannon entrentropy
opy
opy::
Self-information deals only with a single outcome. We can quantify the amount
of uncertainty in an en
Htire probabilit
(x) = y )]distribution
Ex∼P [I (x = −Ex∼P [logusing
P (xthe
)]. Shannon entropy :
(3.49)
E E
also denoted H ( P ). In x) = words,
H (other [I (the
x)] = Shannon [logen P (x
entrop
trop
tropy y)]of. a distribution (3.49)
is the
exp
expected
ected amoun
amountt of information in an ev entt−dra
even
en drawn
wn from that distribution. It gives
also
a lo
low denoted H (
wer bound on the numP ) . In other
numb w ords, the Shannon
ber of bits (if the logarithm entrop y of2,a otherwise
is base distributiontheisunits
the
expected
are differen
different)amoun t of information
t) needed on av eragein
average toanencoeven
encode det symbols
drawn fromdrawnthatfromdistribution. It gives
a distribution P.
a low er bound on the num b er of bits (if the logarithm is base
Distributions that are nearly deterministic (where the outcome is nearly certain) 2, otherwise the units
are
ha
havvedifferen
lo
loww en t)
entrop
tropneeded
tropy; on averagethat
y; distributions to enco
are de symbols
closer drawnhav
to uniform from
have a distribution
e high entrop
entropy P.
y. See
Distributions
Fig. that are nearlyWhen
3.5 for a demonstration. deterministic
x is contin (where
uous, the
continuous, the outcome
Shannon entropis nearly
entropyy iscertain)
kno
known
wn
havthe
as e lodiffer
w entrop
differential
entialy; entr
distributions
entropy
opy
opy.. that are closer to uniform have high entropy. See
Fig. 3.5 for a demonstration. When x is continuous, the Shannon entropy is known
If we havhavee two separate probability distributions P ( x) and Q(x) ov over
er the same
as the differential entropy.
random variable x, we can measure ho how w different these two distributions are using
If
the Kul we
Kullb lbhav
lback-L
ack-Le t w
ack-Leiblero separate
eibler (KL) diver gencee: distributions P ( x) and Q(x) over the same
probability
divergenc
genc
random variable x, we can measure how different these two distributions are using
the Kul lback-Leibler (KL) divergenc P (x)e:
D KL(P kQ) = E x∼P log = E x∼P [log P (x) − log Q(x)] . (3.50)
Q(x)
E P (x) E
D (P Q) = log = [log P (x) log Q(x)] . (3.50)
Q(x) 74
k −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.14
Mac hine Structured
Machine learning algorithms Probabilistic
often in
inv
volv Models
olvee probabilit
probabilityy distributions ovover
er a very
large num umb ber of random variables. Often, these probabilit
probabilityy distributions ininvvolv
olvee
Mac hine
direct in learning
interactions algorithms
teractions betwetween often inv olv e probabilit y distributions over
een relatively few variables. Using a single function to a very
large num
describ
describe ber entire
e the of random
jointt vprobabilit
join ariables. Often,
probability these probabilit
y distribution can be yvery
distributions
inefficienttin(both
inefficien volve
direct interactions
computationally andbetw een relatively few variables. Using a single function to
statistically).
describe the entire joint probability distribution can be very inefficient (both
Instead of using a single function to represen
representt a probability distribution, we
computationally and statistically).
can split a probability distribution in into
to man
many y factors that we multiply together.
For Instead
example, of supp
using
supposea we
ose single
havfunction
hav to represen
e three random t a probability
variables: a, b and cdistribution,
. Supp
Suppose
ose thatwe
acan split a probability
influences the value ofdistribution into manthe
b and b influences y factors
value that
of c, we
butmultiply together.
that a and c are
F or
indep example,
independen
enden
endentt givsupp
given ose we hav e three random v
en b. We can represent the probabilitariables:
probability a, b and c. Supp ose that
y distribution over all three
a influences the value of b and b influences the value of c, but that a and c are
independent given b. We can represent the probability distribution over all three
75
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Probability Density
p(x) p(x)
q (x) q (x)
x x
Figure 3.6: The KL divergence is asymmetric. Suppose we ha havve a distribution p(x) and
wish to approximate it with another distribution q ( x). We hav havee the choice of minimizing
Figure D 3.6: p(x) and
either KL (The
pkq) KL
or D divergence
KL ( q kp). W iseasymmetric.
illustrate theSuppose
effect ofwethis havcehoice
a distribution
using a mixture of
wish to approximate it with another distribution q
two Gaussians for p, and a single Gaussian for q . The choice of whic ( x) . W e hav e the c
whichhoice of minimizing
h direction of the
either
KL D (p qto
divergence ) oruseDis problem-dep
( q p). We illustrate
problem-dependen enden
endent. the effect
t. Some of this crequire
applications hoice using a mixture of
an approximation
two Gaussians
that for p,high
usually kplaces and aprobabilit
k single Gaussian
probability y anywhere for qthat. Thethe choice
true ofdistribution
which direction placesofhigh
the
KL divergence
probabilit
probability to use is problem-dep enden t.
y, while other applications require an appro Some applications
approximation require an approximation
ximation that rarely places high
that usually
probabilit
probability y anplaces
ywherehigh
anywhere that probabilit y anywhereplaces
the true distribution that the low true distribution
probabilit
probability y. The choiceplacesofhigh
the
probabilitofy, the
direction while
KLother
div applications
divergence
ergence requireofan
reflects which appro
these ximation that
considerations takes rarely places
priorit
priorityy for high
eac
eachh
probabilit y
application. (L anywhere that the true distribution places low probabilit
eft) The effect of minimizing DKL ( pkq). In this case, we select a q that has
(Left) y . The choice of the
direction
high of theyKL
probabilit
probability divergence
where p has high reflects which yof. these
probabilit
probability Whenconsiderations
p has multipletakes mo
modes, priorit
des, y for
q cho oseseac
hooses toh
application.
blur the modes (Left) The effect
together, of minimizing
in order ( p q). yInmass
to put highDprobabilit
probability this on
case,allwofe them. q that The
select a(Right) has
high probabilit
effect y where
of minimizing DKL p (has
q kp)high
. In probabilit
this case, ywe. When k pa has
select multiple
q that has low des, q cho oses
moprobability to
where
pblur
hasthe
lowmodes
low together,
probabilit
probability y. Whenin order
p has tomput highmo
ultiple probabilit
modesdes thaty are mass on all of
sufficien
sufficiently tlythem. (Right)
widely separated, The
effect
as in this figure, theDKL (div
of minimizing ). In this
q pergence
divergence case, we select
is minimized q that ahas
by cahoosing low mo
single probability
mode,
de, in orderwhere
to
p has low probabilit y. When p has m ultiple
avoid putting probability mass in the low-probabilit
k mo des
low-probability that are
y areas b et sufficien
etw
ween mo tly
modes widely separated,
des of p. Here, we
as in this the
illustrate figure, the KL
outcome divergence
when q is chosenis minimized
to emphasize by choosing a singleWmo
the left mode. de, inalso
e could order
havto
have e
avhiev
ac oid putting
achiev
hieved probability
ed an equal value ofmassthe inKLthediv low-probabilit
divergence
ergence by choosingy areas bthe etwright
een mo des ofIfpthe
mode. . Here,
mo
modeswe
des
illustrate
are the outcome
not separated when qnis
by a sufficie
sufficien tlychosen
strongto lo
low
wemphasize
probabilit
probability the left mode.
y region, We could
then this alsoofhav
direction thee
ac hiev
KL div ed an
divergence equal value of the KL div
ergence can still choose to blur the mo ergence
modes.by
des. c hoosing the right mode. If the mo des
are not separated by a sufficiently strong low probability region, then this direction of the
KL divergence can still choose to blur the mo des.
76
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
variables as a pro
product
duct of probability distributions ov
over
er tw
two
o variables:
a b
c d
a b
c d
79
Chapter 4
Chapter 4
Numerical Computation
Numerical
Mac
Machine
Computation
hine learning algorithms usually require a high amoun
amountt of numerical compu-
tation. This typically refers to algorithms that solve mathematical problems by
Mac
methohine
methods learning
ds that algorithms
update estimatesusually require avia
of the solution high
an amoun t of
iterative numerical
pro
process, compu-
cess, rather than
tation. This typically refers
analytically deriving a formformulato algorithms that solve mathematical problems
ula providing a symbolic expression for the correct so- by
metho ds that
lution. Common opupdate estimates
operations of the solution via an iterative pro cess, rather
erations include optimization (finding the value of an argument than
analytically deriving a formulaa providing
that minimizes or maximizes function) aand
symbolic
solvingexpression
systems offor the correct
linear so-
equations.
lution.
Ev
Even Common
en just ev operations
evaluating
aluating include optimization
a mathematical function on a(finding
digital the value of
computer cananbargument
e difficult
that minimizes or
when the function inv maximizes
involv
olv
olves a function)
es real num
numbers, and solving
bers, whic
which systems of linear equations.
h cannot be represented precisely
Ev en just evaluating a mathematical
using a finite amount of memorymemory.. function on a digital computer can be difficult
when the function involves real numbers, which cannot be represented precisely
using a finite amount of memory.
4.1 Ov
Overflo
erflo
erflow
w and Underflo
Underflow
w
4.1 fundamen
The Overflo
fundamental tal w and in
difficulty Underflo
performingwcontin
continuous
uous math on a digital computer
is that we need to represent infinitely many real num numb bers with a finite num number
ber
Thebitfundamen
of patterns.tal This
difficulty
means in pthat
erforming continall
for almost uous
realmath
num on
numbers,a digital
bers, w computer
wee incur some
is thatximation
appro we neederror
approximation to represent
when weinfinitely
representmany
represen t the nreal
um
umb bnum
er inbers
thewith a finiteIn
computer. num
manber
many y
of bit patterns. This means that for almost all real
cases, this is just rounding error. Rounding error is problematic, espnum bers, w e incur
especially some
ecially when
appro ximation error
it compounds across man when
many w
y ope erations, and can cause algorithms that In
represen
operations, t the n um b er in the computer. maniny
work
cases, this
theory is just
to fail rounding
in practice error.are
if they Rounding errortois minimize
not designed problematic, especially
the accum
accumulation when
ulation of
it compounds
rounding error.across many operations, and can cause algorithms that work in
theory to fail in practice if they are not designed to minimize the accumulation of
One form
rounding of rounding error that is particularly dev
error. devastating
astating is underflow. Under-
flo
floww occurs when num numbers
bers near zero are rounded to zero. Many functions behav ehavee
One
qualitativform
qualitatively of rounding
differen
ely differentlytly error
when that
their is particularly
argument is dev
zero astating
rather is
than underflow
a small .
p Under-
ositive
flo
n w boer.
um
umb ccurs
Forwhen numbers
example, near zero
we usually wanare
want rounded
t to av
avoid to zero. by
oid division Manyzerofunctions
(some softwbehav
aree
software
qualitatively differently when their argument is zero rather than a small positive
number. For example, we usually want80 to avoid division by zero (some software
80
CHAPTER 4. NUMERICAL COMPUTATION
en
environmen
vironmen
vironments ts will raise exceptions when this occurs, others will return a result
with a placeholder not-a-n not-a-num um
umb ber value) or taking the logarithm of zero (this is
en vironmen ts will
usually treated as −∞, whic raise exceptions
which h then when
becomes thisnot-a-n
occurs,um
not-a-num umbothers
ber if will
it is return
used for a result
many
with a placeholder
further arithmetic op not-a-n um
operations).
erations). b er value) or taking the logarithm of zero (this is
usually treated as , which then becomes not-a-number if it is used for many
Another highly damaging form of numerical error is overflow. Overflo Overflow w occurs
further arithmetic −∞ operations).
when num umbers
bers with large magnitude are appro ximated as ∞ or −∞. Further
approximated
Anotherwill
arithmetic highly damaging
usually changeform theseofinfinite
numerical error
values intois overflow
not-a-num
not-a-number. Overflo w occurs
ber values.
when numbers with large magnitude are approximated as or . Further
One example of a function that must be stabilized against underflow and
arithmetic will usually change these infinite values into not-a-num ∞ ber
−∞ alues.
v
overflo
erflow w is the softmax function. The softmax function is often used to predict the
One example
probabilities asso of a function
associated
ciated that must distribution.
with a multinoulli be stabilizedThe against
softmaxunderflow
function andis
odefined
verflowto is bthe
e softmax function. The softmax function is often used to predict the
probabilities associated with a multinoulli distribution. exp(
exp(x xi ) The softmax function is
softmax(x)i = Pn . (4.1)
defined to be j=1 exp(
exp(x xj )
exp(x )
Consider what happ happensens whensoftmax( x) x= are equal to some
all of the . constantt c. Analytically (4.1),
i exp(x ) constan Analytically,
we can see that all of the outputs should be equal to 1n . Numerically Numerically,, this may
Consider what happ ens when all of the x
not occur when c has large magnitude. If c is very negativ are equal to some constan
negative, t c. Analytically
e, then exp((c) will,
exp
w e can see
underflo
underflow. that means
w. This all of the
the outputs
denominator should be equal
of the softmax to will. Numerically
become 0, so , this
the may
final
not occur when c has large
result is undefined. When c is very large P magnitude. If c is
and positiv v ery
ositive, negativ
e, exp e,
exp((c) will ovthen
overfloexp
erflo
erflow, (c ) will
w, again
underflow.inThis
resulting means the denominator
the expression as a whole being of theundefined.
softmax will Both become
of these 0, so the final
difficulties
result is undefined.
can be resolved by instead ev When c is very
evaluating large and p ositiv e, exp (c ) will
softmax((z ) where z = x − max i xi . Simple
aluating softmax ov erflo w, again
resultingshows
algebra in the expression
that the valueasofathe whole beingfunction
softmax undefined. is notBoth of these
changed difficulties
analytically by
can be resolved
adding or subtractingby instead evaluating
a scalar from the softmax
input (vector.
z ) where z = x max
Subtracting maxx x. Simple
results
i i
algebra
in shows argument
the largest that the valueto expof bthe softmax
eing 0, whic
whichfunction
h rules out is not thechanged − analytically
possibility of ovoverflo
erflo by
erflow.
w.
adding
Lik
Likewise, or subtracting a scalar from the input vector.
ewise, at least one term in the denominator has a value of 1, which rules outSubtracting max x results
in the largest argument
the possibility of underflow to exp being
in the 0, which rules
denominator out the
leading to apdivision
ossibilityby of zero.
overflow.
Likewise, at least one term in the denominator has a value of 1, which rules out
There is still one small problem. Underflow in the numerator can still cause
the possibility of underflow in the denominator leading to a division by zero.
the expression as a whole to ev evaluate
aluate to zero. This means that if we implement
There
log softmax is still one small problem.
softmax((x) by first running the softmax Underflow in the numerator
subroutine then passing can thestill cause
result to
the expression
the log function, as waewhole
could to evaluate to
erroneously zero. −∞
obtain This means that
. Instead, if we implement
we must implement
alogseparate
softmaxfunction
(x) by first thatrunning the softmax
calculates log softmax subroutine then passing
in a numerically stable thewaresult
way y. Theto
the softmax
log log function,functionwe could
can beerroneously
stabilized using obtain the same . Instead,
trick aswe we must
used implement
to stabilize
a separate function
the softmax function. that calculates log softmax −∞in a n umerically stable way. The
log softmax function can be stabilized using the same trick as we used to stabilize
For the most part, we do not explicitly detail all of the numerical considerations
the softmax function.
in
inv
volv
olved
ed in implementing the various algorithms describ described ed in this book. Developers
F or
of low-lev the
low-level most part, we do not explicitly detail
el libraries should keep numerical issues in mind all of the numerical
when considerations
implementing
inv olv ed in implementing the v arious algorithms
deep learning algorithms. Most readers of this book can simply rely describ ed in this b o ok. Developers
on low-
of
lev low-lev
level el libraries should k eep numerical issues in
el libraries that provide stable implementations. In some cases, it is possible mind when implementing
deep
to learning aalgorithms.
implement new algorithm Most andreaders
hav
havee the of this
new b ook can simplyautomatically
implementation rely on low-
level libraries that provide stable implementations. In some cases, it is possible
to implement a new algorithm and hav81 e the new implementation automatically
CHAPTER 4. NUMERICAL COMPUTATION
4.2 Poorrefers
Conditioning Conditioning
to how rapidly a function changes with resp respect
ect to small changes
in its inputs. Functions that change rapidly when their inputs are perturb erturbeded slightly
can be problematic for scientific computation because rounding errors in thechanges
Conditioning refers to how rapidly a function c hanges with resp ect to small inputs
in its inputs. Functions that change rapidly
can result in large changes in the output. when their inputs are p erturb ed slightly
can be problematic for scientific computation because rounding errors in the inputs
Consider the function f ( x ) = A−1x. When A ∈ R n×n has an eigenv eigenvalue
alue
can result in large changes in the output.
decomp
decomposition,
osition, its condition numb
numberer is R
Consider the function f ( x ) = A x . When
A has an eigenvalue
decomposition, its condition number is λi ∈
max . (4.2)
i,j λj
λ
max . (4.2)
This is the ratio of the magnitude of the λlargest and smallest eigen eigenv value. When
this num
numbber is large, matrix inv
inversion
ersion is particularly sensitive to error in the input.
This is the ratio of the magnitude of the largest and smallest eigenvalue. When
This sensitivit
sensitivity
y is an in
intrinsic
trinsic prop
propert
ert
erty
y of the matrix itself, not the result
this number is large, matrix inversion is particularly sensitive to error in the input.
of rounding error during matrix inv
inversion.
ersion. Poorly conditioned matrices amplify
This sensitivit
pre-existing errorsywhen
is anwe
intrinsic
multiplyprop
by ert
they true
of the matrix
matrix inv itself,Innot
inverse.
erse. the result
practice, the
of rounding
error error
will be comp during further
compounded
ounded matrix binv
y nersion.
umerical Poorly
errorsconditioned
in the in
inv matrices
version pro amplify
process
cess itself.
pre-existing errors when we multiply by the true matrix inverse. In practice, the
error will be compounded further by numerical errors in the inversion process itself.
4.3 Gradien
Gradient-Based
t-Based Optimization
4.3 deep
Most Gradien
learningt-Based
algorithms Optimization
involv
involvee optimization of some sort. Optimization
refers to the task of either minimizing or maximizing some function f (x ) by altering
Most
x . WWeedeep learning
usually phrasealgorithms involve optimization
most optimization problems inofterms
some ofsort. Optimization
minimizing f (x).
refers to the
Maximization ma task
mayof either minimizing or maximizing some function f (x )
y be accomplished via a minimization algorithm by minimizing by altering
x
−.f (W xe). usually phrase most optimization problems in terms of minimizing f (x).
Maximization may be accomplished via a minimization algorithm by minimizing
f (The
x). function we wan antt to minimize or maximize is called the obje
objective
ctive function
or criterion. When we are minimizing it, we may also call it the cost function function,,
− The function we want to minimize or maximize is called the objective function
loss function, or err error
or function. In this book, we use these terms in interc
terc
terchangeably
hangeably
hangeably,,
or criterion . When we are minimizing it,
though some machine learning publications assign sp we may ecial meaning to somefunction
also
special call it the cost of these,
loss function, or error function. In this book, we use these terms interchangeably,
terms.
though some machine learning publications assign special meaning to some of these
terms.We often denote the value that minimizes or maximizes a function with a
sup erscript ∗. For example, we might say x∗ = arg min f (x).
superscript
We often denote the value that minimizes or maximizes a function with a
superscript . For example, we might say 82 x = arg min f (x).
∗
CHAPTER 4. NUMERICAL COMPUTATION
Gradient descent
2.0
0.5
0.0
For x <0, we have f0(x) <0, For x >0, we have f 0(x) >0,
so we can decrease f by so we can decrease f by
−0.5 moving rightward. moving leftward.
−1.0
f(x) = 12 x2
−1.5
ff0((xx))=
= xx
−2.0
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 f (x)1.5
=x 2.0
x
Approximate minimization
Figure 4.3: Optimization algorithms ma mayy fail to find a global minimum when there are
multiple lo
local
cal minima or plateaus present. In the context of deep learning, we generally
Figure such
accept 4.3: Optimization
solutions even algorithms
even though theyma y fail
are not to findminimal,
truly a globalso
minimum whencorresp
long as they there ond
are
correspond
m ultiple lo cal minima or plateaus present. In
to significantly low values of the cost function. the context of deep learning, we generally
accept such solutions even though they are not truly minimal, so long as they corresp ond
to significantly low values of the cost function.
critical points are points where ev every
ery element of the gradient is equal to zero.
The pdir
direectional derivative u
critical oints are points whereinevdirection
ery element (a of unit vector) is
the gradient is equal
the slop
slope
to ezero.
of the
function f in direction u. In other words, the directional deriv derivative
ative is the deriv
derivativ
ativ
ativee
The dir ectional derivative
of the function f ( x + αu) with resp in direction u (a
ect to α , ev
respect unit vector) is the slop e of
aluated at α = 00.. Using the chain
evaluated the
function
rule, in see
we fcan direction
that ∂α∂u. In other words,
f (x + αu) = u> the∇x fdirectional
(x). derivative is the derivative
of the function f ( x + αu) with respect to α , evaluated at α = 0. Using the chain
f , we would lik in which f decreases the
rule,Toweminimize
can see that f (x + like
αue)to= find
u the direction
f (x ).
fastest. We can do this using the directional deriv derivativ
ativ
ative:
e:
To minimize f , we would like to find∇the direction in which f decreases the
fastest. We can do this using the directional derivative:
min u>∇ x f (x) (4.3)
u,u > u=1
4.3.1 Bey
Sometimes weond
needthe Gradien
to find all of thet: Jacobian
partial deriv and
derivativ
ativ esHessian
atives of a functionMatrices
whose input
and output are both vectors. The matrix containing all suc such h partial deriv
derivatives
atives is
Sometimes
kno
knownwn as a we Jac need
obianto
Jacobian find .allSp
matrix
matrix. ofecifically
the partial
Specifically
ecifically, , if we deriv
hav
have ativ
e a es of a function
function f : R m whose
→ Rn, input
then
and output are b
the Jacobian matrix J ∈ Roth vectors.n×m The matrix containing all suc h partial
of f is defined such that Ji,j = Rj f (x) i .R ∂ deriv atives is
known as a Jacobian matrix. Specifically, if we have a function f ∂x : , then
Rin
the W e are also
Jacobian sometimes
matrix J interested
terested
of f is in defined
a deriv
derivativ
ativ
ative
such e of a deriv
that derivative.
Jn =ative. fThis
(x→) .is kno
known wn
as a se derivative.. For example, for a function f : R → R, the deriv
seccond derivative derivativ
ativ
ativee
W e are also sometimes ∈ interested in a derivative of a derivative. This is kno ∂ 2 wn
with resp ect to x i of the deriv
respect ativee of f with resp
derivativ
ativ ect to xRj is denoted
respect R as ∂xi ∂xj f .
as a second derivative. For example, for d2
a function
00
f : , the deriv ative
In a single dimension, we can
with respect to x of the derivative of dx denote f b y f ( x).
f with respect to x is→
2 The second deriv
derivativ
denoted as ativ
ativee tells
f.
us how the first deriv derivativ
ativ
ativee will change as we vary the input. This is imp important
ortant
Inecause
b a single dimension,
it tells us whetherwe can denote stepf will
a gradient by fcause(x). as Themucsecond
uchh of an deriv ativeemen
improv
improvemen tellst
ement
us we
as howwould
the first exp deriv
expect ative will
ect based on the change as wealone.
gradient vary the We input.
can thinkThisofisthe impsecond
ortant
b ecause
deriv
derivative it tells us whether
ative as measuring curvatur a gradient
curvaturee. Supp step
Suppose will
ose we havcause as m uc h of an improv
havee a quadratic function (many ement
as we would
functions thatexp ect in
arise based on the
practice are gradient
not quadratic alone.but Wecan canbethink of the second
approximated well
deriv ative as measuring
as quadratic, at least lo curvatur
locally).
cally). If suce . Supp
such ose we hav e a quadratic
h a function has a second deriv function
derivativ
ativ (many
ativee of zero,
functions that
then there is no curv arise in practice
curvature. are not quadratic but can b e
ature. It is a perfectly flat line, and its value can be predictedapproximated well
as quadratic, at least lo cally). If suc h a function
using only the gradient. If the gradient is 1, then we can mak has a second deriv ativ e of
makee a step of size zero,
then there
along is no curv
the negative ature. Itand
gradient, is atheperfectly flat line,
cost function and
will its value
decrease bycan. Ifbethe
predicted
second
using
deriv only
derivative the gradient. If the gradient
ative is negative, the function curves down is 1 , then
downw ward, so the cost functionsize
w e can mak e a step of
will
along thedecrease
actually negativeby gradient,
more than and the cost ,function
. Finally
Finally, will decrease
if the second deriv
derivativeby is
ative . Ifpositiv
the second
ositive,
e, the
deriv ative is
function curves upw negative, the function curves down w ard, so the
ard, so the cost function can decrease by less than . See Fig.
upward, cost function will
actually decrease by more than . Finally, if the second derivative is positive, the
function curves upward, so the cost function 86 can decrease by less than . See Fig.
CHAPTER 4. NUMERICAL COMPUTATION
Negativ
Negativee curv
curvature
ature No curv
curvature
ature Positiv
ositivee curv
curvature
ature
f (x)
f (x)
f (x)
x x x
eigen
eigenv vectors. The second deriv derivative
ative in a sp specific
ecific direction represented by a unit
vector d is giv givenen by d> H d. When d is an eigenv ector of H , the second deriv
eigenvector derivativ
ativ
ativee
eigen v ectors. The second
in that direction is given by the corresp deriv ative in a
corresponding sp ecific
onding eigenv direction
eigenvalue. represented
alue. For other directions by a unit
of
vector d is giv en by d
d , the directional second deriv H d . When
derivativ
ativ d is an
ativee is a weigh eigenv
weighted ector of H , the second
ted average of all of the eigen deriv
eigenv ativ
values, e
in that
with weigh direction
eightsts bet etwis given by the
ween 0 and 1, and eigenv corresp onding
eigenvectors eigenv
ectors that hav alue. F or other directions
havee smaller angle with d of
d , the directional
receiving more weigh second
eight. t. The deriv
maximativeum
maximum is aeigenv
weigh
eigenvalue ted determines
alue average of all theofmaximum
the eigensecondvalues,
with
deriv weighand
derivative
ative ts btheetwminim
een 0 um
minimum andeigenv
1, and
eigenvalue eigenv
alue ectors that
determines have smaller
the minim
minimum um second anglederivwith
derivative. d
ative.
receiving more weight. The maximum eigenvalue determines the maximum second
The (directional) second deriv derivative
ative tells us how well we can exp expect ect a gradient
derivative and the minimum eigenvalue determines the minimum second derivative.
descen
descentt step to perform. We can mak makee a second-order Taylor series appro approximation
ximation
Thefunction
to the (directional) second deriv
f (x) around ative tells
the current point us x how
(0) well we can exp ect a gradient
:
descent step to perform. We can make a second-order Taylor series approximation
to the function 1
f (x) f≈(xf)(xaround
(0)
) + (thex −current
x(0) )>gp+oint(xx − :x(0) )> H (x − x (0)). (4.8)
2
1
f ( x ) f ( x ) +
where g is the gradient and H is the Hessian ( x x ) g + (x x(0) ) H (x x ). (4.8)
2 at x . If we use a learning rate
of , then the new ≈ point x will − be given by x (0) − − g. Substituting − this into our
where
appro g is
approximation, the gradient
ximation, we obtain and H is the Hessian at x . If w e use a learning rate
of , then the new point x will be given by x g. Substituting this into our
approximation, wefobtain (0) (0) > − 1 2 >
(x − g) ≈ f (x ) − g g + g H g. (4.9)
2
1
There are three terms f (x here: g) thef (original
x ) vgalue g +of the g H g.
function, the exp (4.9)
expected
ected
2
impro
improv vemen
ementt due to the slop − e of
slope ≈ the function, − and the correction we must apply
There are three
to account for the curv terms ature of the function. Whenofthis
curvature here: the original v alue thelastfunction,
term is to the
too expected
o large, the
impro
gradien v emen
gradientt descen t due to the slop
descentt step can actually mov e of the function, and the correction
>
movee uphill. When g H g is zero or negative, we must apply
to account for the
the Taylor series appro curv ature
approximation of the
ximation predicts function. thatWhen this last
increasing term
forev
foreverer iswill
toodecrease
large, the f
gradien
forev er.t In
forever. descen t stepthe
practice, can Taactually
ylor series mov is eunlik
uphill.
unlikelyely toWhenremaing H g is zerofor
accurate orlarge
negative,
large , so
the T
one maustylorresort
seriestoappro
moreximation
heuristic predicts
choices ofthat this case. When
inincreasing foreverg>will
H g decrease
is positiv fe,
ositive,
forever. for
solving In the
practice,
optimal thestep
Taylor sizeseries is unlikelythe
that decreases to Tremain accurate
aylor series for large , so
approximation of
one m ust resort to
the function the most yieldsmore heuristic choices of in this case. When g H g is p ositiv e,
solving for the optimal step size that decreases g >g the Taylor series approximation of
∗
the function the most yields = . (4.10)
g> H g
g g
In the worst case, when g aligns with =the eigenv . ector of H corresp
eigenvector corresponding
onding to (4.10)
the
g Hg 1
maximal eigenveigenvalue λ
alue max , then this optimal step size is given by λ max . To the
In thet worst
exten
extent that the case, when gwaligns
function e minimizewith the caneigenv
be appro ector of H corresp
approximated
ximated well bonding to the
y a quadratic
maximal eigenv
function, the eigen alue
eigenv λ ,ofthen
values this optimal
the Hessian thus step size isthe
determine given byof the. learning
scale To the
exten
rate. t that the function w e minimize can b e appro ximated well b y a quadratic
function, the eigenvalues of the Hessian thus determine the scale of the learning
The second deriv derivative
ative can be used to determine whether a critical point is a
rate.
lo
local
cal maximum, a lo local
cal minimum, or saddle point. Recall that on a critical point,
f 0(xThe
)=0 second
. When deriv
f 00(ative
x) > can be used
0, this means to that
determine whether aascritical
f 0(x) increases we mov movepeoint is a
to the
lo cal
righ maximum, a lo cal minimum,
t, and f 0 (x ) decreases as we mov
right, or saddle p oint. Recall that on
movee to the left. This means f 0 ( x − ) < 0 and a critical p oint,
f (x) = 0. When f (x) > 0, this means that f (x) increases as we move to the
right, and f (x ) decreases as we move to 88 the left. This means f ( x ) < 0 and
−
CHAPTER 4. NUMERICAL COMPUTATION
f 0(x + ) > 0 for small enough . In other words, as we mov movee right, the slop slopee begins
to poin ointt uphill to the right, and as we mo mov ve left, the slop slopee begins to point uphill
f (the
to x + left.
) > 0Thus,for small
whenenoughf 0 (x ) = .0Inandotherf 00(xw)ords,
> 0, as wewe canmov e right, that
conclude the slop
x iseabloegins
local
cal
to
minim p oin
minimum. t uphill to the right, and
0 as w e mo v e
00 left, the
Similarly,, when f ( x) = 0 and f (x) < 0, we can conclude that x is a
um. Similarly slop e b egins to p oint uphill
to
lo
localthe left.
cal maximum. Thus,This whenisfknown (x ) = as 0 and
the fsesec (cxond 0, we can test
) >derivative conclude
test. that x is a, when
. Unfortunately
Unfortunately, local
00 (x um. Similarly, when f ( x) = 0 and f (x)x < 0, we can conclude that x is a
fminim ) = 00,, the test is inconclusive. In this case ma may y be a saddle poin oint,t, or a part
lo cal maximum.
of a flat region. This is known as the se c ond derivative test . Unfortunately , when
f (x) = 0, the test is inconclusive. In this case x may be a saddle point, or a part
In multiple dimensions, we need to examine all of the second deriv derivatives
atives of the
of a flat region.
function. Using the eigendecomp eigendecomposition osition of the Hessian matrix, we can generalize
the In multiple
second derivdimensions,
derivativativee test we
ativ to need
multipleto examine
dimensions. all of the At second
a criticalderivpatives of the
oint, where
∇function. Using
0,, we thecan eigendecomp
examine the osition of the of Hessian matrix, we can generalize
x f( x) = 0 eigen
eigenv values the Hessian to determine whether
the second deriv
the critical point is a lo ativ e test
local to multiple
cal maximum, lo localdimensions. A t a
cal minimum, or saddle point. When critical p oint, where
the
f( x ) =
Hessian is positiv 0 , w e can examine the
ositivee definite (all its eigenv eigen v alues
eigenvalues of the Hessian
alues are positive), the poin to determine whether
ointt is a lo local
cal
the critical
∇
minim
minimum. um. This pointcan is ablo e cal
seen maximum,
by observing localthat minimum, or saddlesecond
the directional point. deriv
Whenativ
derivativthee
ative
Hessian
in is positiv
any direction e definite
must (all itsand
be positive, eigenv makingalues reference
are positive), to thethe poin
univ
univariatet is second
ariate a local
minim
deriv
derivativeum.test.
ative ThisLik can
Likewise,be seen
ewise, when bythe observing
Hessian that the directional
is negativ
negative e definite (all second derivalues
its eigenv ative
eigenvalues
in any direction must
are negative), the point is a lo b e p ositive,
local and
cal maxim
maximum. making reference to the univ
um. In multiple dimensions, it is actually ariate second
pderiv ativetotest.
ossible find Lik ewise, evidence
positive when theofHessian saddle ispoin negativ
ointsts inesomedefinite (all When
cases. its eigenvat alues
least
are negative),
one eigenv
eigenvalue the p
alue is positiv oint is a lo cal maxim
ositivee and at least one eigenv um. In multiple
eigenvalue dimensions,
alue is negative, we kno it is actually
know w that
p ossible
x is a lo localto find p ositive evidence of
cal maximum on one cross section of f but a lo saddle p oin ts in some
local cases. When
cal minimum on another at least
one eigenv alue is p ositiv e and
cross section. See Fig. 4.5 for an example. Finally at least one eigenv alue is negative,
Finally,, the multidimensional second we kno w that
x is aative
deriv
derivativelocaltest maximum
can be on one crosse,section
inconclusiv
inconclusive, just like of the
f but univ a ariate
local minimum
univariate version. The on another
test is
cross section.
inconclusiv
inconclusivee whenev See
whenever Fig. 4.5 for an example.
er all of the non-zero eigenv Finally
eigenvalues , the
alues ha hav multidimensional
ve the same sign, but at second
deriv ative
least one eigenv test
eigenvaluecan b e inconclusiv e, just
alue is zero. This is because the univ like the univ
univariate ariate
ariate second version.
deriv The test
derivative
ative test isis
inconclusiv
inconclusiv e whenev er all
inconclusivee in the cross section corresp of the non-zero
corresponding eigenv alues
onding to the zero eigenv ha ve the same
eigenvalue.
alue. sign, but at
least one eigenvalue is zero. This is because the univariate second derivative test is
In multiple dimensions, there can b e a wide variety of different second deriv derivatives
atives
inconclusive in the cross section corresponding to the zero eigenvalue.
at a single point, because there is a different second deriv derivative
ative for each direction.
In multiple
The condition num dimensions,
numb there can b e a
ber of the Hessian measures how muc wide v ariety of different
much second deriv
h the second derivativ
atives
derivativ
atives
es
at
vary a single p oint, b ecause there
ary.. When the Hessian has a poor condition num is a different second
number, deriv
ber, gradienative for each
gradientt descent performsdirection.
pThe
oorly condition
orly.. This is num ber of
because in theoneHessian
direction, measures
the deriv how
derivative
ative muc h the second
increases rapidly
rapidly, deriv ativin
, while es
vary. When
another the Hessian
direction, has a slowly
it increases poor condition
slowly. . Gradientnum ber, gradien
descent is unaw
unawaretare
descent
of this performs
change
p o orly .
in the deriv This
derivativ is
ativ b ecause
ativee so it do in
does one direction, the deriv ative increases
es not know that it needs to explore preferentially in rapidly , while in
another direction,
the direction where the deriv it increases
derivativ slowly
ativ . Gradient
ativee remains negativ descent is unaw are
negativee for longer. It also mak of this change
makes es it
in the deriv
difficult ative so
to choose a goit
goo odo
d es
stepnot know
size. The thatstep it needs
size m to
ust bexplore
e small preferentially
enough to av in
avoid
oid
othe
vershodirection
ershooting
oting the where the deriv
minimum ativgoing
and e remains uphill negativ e for longer.
in directions with Itstrong
also mak es it
positive
difficult
curv
curvature.
ature. to This
choose a goodmeans
usually step size.
that Thethe step step sizesize ismto usto b
too e small
small enough
to make to avoid
significant
ov ersho oting the minimum
progress in other directions with less curv and going uphill
curvature. in directions with
ature. See Fig. 4.6 for an example. strong positive
curvature. This usually means that the step size is too small to make significant
This issue
progress in other candirections
be resolved with bylessusingcurv information
ature. See Fig. from4.6 theforHessian
an example.matrix to
This issue can be resolved by using information from the Hessian matrix to
89
CHAPTER 4. NUMERICAL COMPUTATION
500
f(x1 ;x2 )
0
−500
15
−15 0 x2
x1 0 −15
15
Figure 4.5: A saddle p oint containing b oth p ositive and negative curv curvature.
ature. The function
2 2
in this example is f (x ) = x1 − x 2. Along the axis corresp corresponding
onding to x1, the function
Figure
curv
curves 4.5: ard.
es upw A saddle
upward. This paxis
ointiscontaining
an eigenv bector
eigenvectoroth pofositive and negative
the Hessian and hascurv
a ature.
p ositiveThe function
eigenv
eigenvalue.
alue.
in this the
Along example is f (xonding
axis corresp ) = x to x
corresponding x2 ,. the
Along the axis
function curv corresp
curveses downonding
downward. to x direction
ward. This , the function
is an
curv
eigenes
eigenv upward.
vector of theThis axis with
Hessian is an− eigenvector
negative of alue.
eigenv the Hessian
eigenvalue. The name and“saddle
has a ppositive eigenvfrom
oint” derives alue.
Along
the the axis corresp
saddle-like shap
shapee ofonding x
to , theThis
this function. function
is thecurv
quinestessential
downward.
quintessential This direction
example is an
of a function
eigenvaector
with saddleof pthe Hessian
oint. In morewith negative
than eigenvalue.
one dimension, The
it is notname “saddle
necessary to pha
oint”
hav derives
ve an eigen
eigenvvfrom
alue
the saddle-like shap e of this function. This is the
of 0 in order to get a saddle point: it is only necessary to hav quin tessential example of a function
havee both positive and negative
with vaalues.
eigen
eigenv saddleWpeoint. In more
can think of athan onepdimension,
saddle oint with bitothis signs
not necessary
of eigenv to
eigenvalueshavas
alues e an eigen
being a vlo
alue
local
cal
of 0 in um
maxim
maximum order to getone
within a saddle point: it
cross section andis only
a lo necessary
local
cal minim
minimum to hav
um e both
within positive
another andsection.
cross negative
eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local
maximum within one cross section and a lo cal minimum within another cross section.
90
CHAPTER 4. NUMERICAL COMPUTATION
20
10
x2
−10
−20
−30
−30 −20 −10 0 10 20
x1
91
CHAPTER 4. NUMERICAL COMPUTATION
guide the search. The simplest metho method d for doing so is known as Newton
Newton’s
’s metho
methodd.
Newton’s metho
method d is based on using a second-order Taylor series expansion to
guide ximate
appro the search.
approximate f (x)The
nearsimplest
some poinmetho
oint d :for doing so is known as Newton’s method.
t x(0)
Newton’s method is based on using a second-order Taylor series expansion to
approximate(0) f (x) near some point x : 1
f (x) ≈ f (x )+(x−x(0) ) >∇ x f (x(0))+ (x−x (0))> H (f )(x(0) )( )(xx−x(0)). (4.11)
2
1
Iffwe
(x)then
f (solve
x )+ for )
(xthex critical pf oint
(x of)+this (x x ) H (f )(x )(x x ). (4.11)
2 function, we obtain:
≈ − ∇ − −
If we then solve for thex∗critical
= x(0) p−ointH (of
f )(this )−1 ∇x f (xwe
x(0) function, (0) obtain:
). (4.12)
and many optimization problems in deep learning can be made Lipschitz con contin
tin
tinuous
uous
with relatively minor mo modifications.
difications.
and many optimization problems in deep learning can be made Lipschitz continuous
Perhaps the most successful field of sp specialized
ecialized optimization is convex optimiza-
with relatively minor modifications.
tion
tion.. Conv
Convexex optimization algorithms are able to provide many more guarantees
P erhaps the mostrestrictions.
by making stronger successful field
Conofvsp
Conv execialized optimization
optimization algorithmsis convex optimiza-
are applicable
tion. to
only Conv
convexexoptimization
convex algorithms
functions—functions forare ablethe
which to provide
Hessian many more
is positiv
ositive guarantees
e semidefinite
b
evyerywhere.
making stronger
everywhere. Suc
Such restrictions.
h functions Convex optimization
are well-behav
well-behaved ed because they algorithms are papplicable
lack saddle oints and
only to conv
all of their loex
localfunctions—functions for which the
cal minima are necessarily global minima. HoHessian Howis p
wevositiv
ever, e semidefinite
er, most problems
ev erywhere. Suc h functions are well-behav ed b
in deep learning are difficult to express in terms of convecause they
convex lack saddle points
ex optimization. Convand
Convex
ex
all of their local
optimization minima
is used only are
as anecessarily
subroutineglobal
of someminima. However,
deep learning most problems
algorithms. Ideas
in deep
from thelearning
analysisare difficult
of conv
convex to express inalgorithms
ex optimization terms of conv
can ex
be optimization. Convthe
useful for proving ex
optimization
con
convvergence of is deep
used only as aalgorithms.
learning subroutine ofHowsome
Howevev deep
ever,
er, learningthe
in general, algorithms.
imp
importanceIdeas
ortance of
from
con
conv the analysis of conv ex optimization algorithms can b e useful
vex optimization is greatly diminished in the context of deep learning. For for proving the
convergence
more of deep
information ab learning
about
out conv
convex algorithms.
ex Howsee
optimization, ever,
Boin
ydgeneral,
Boyd and Vandenthe imp
andenb ortance
berghe (2004of)
conRo
or vex
Rock
ckoptimization
ckafellar
afellar (1997).is greatly diminished in the context of deep learning. For
more information about convex optimization, see Boyd and Vandenberghe (2004)
or Rockafellar (1997).
4.4 Constrained Optimization
4.4 Constrained
Sometimes we wish not only Optimization
to maximize or minimize a function f (x) ov over
er all
possible values of x. Instead we ma mayy wish to find the maximal or minimal value of
Sometimes we wish not only to maximize
f (x) for values of x in some set S. This is known or minimize
as constr
onstrainea function
aine
ained f (x) ov
d optimization
optimization. er all
. Poin
Pointsts
p ossible values of x . Instead w
x that lie within the set S areScalled fee ma y wish to
feasible find
asible poin the
oints maximal or minimal
ts in constrained optimization v alue of
fterminology.
(x) for values
terminology . of x in some set . This is known as constrained optimization. Points
S
x that lie within the set are called feasible points in constrained optimization
We often wish to find a solution that is small in some sense. A common
terminology.
approac
approach h in such situations is to imp imposeose a norm constrain
constraint, t, such as ||x|| ≤ 1.
We often wish to find a solution that is small in some sense. A common
One hsimple
approac approac
approach
in such h to constrained
situations is to imposeoptimization
a norm constrain is simply
t, suchto as
mo x gradient
modify
dify 1.
descentt taking the constraint into account. If we use a small constant step size ,
descen
Onemake
we can simple approac
gradien
gradient h to constrained
t descent steps, thenoptimization
pro ject the is
project simply
result bac
back to mo||dify
k into
|| ≤
S. Ifgradient
we use
descen t
a line searctaking
search, the constraint
h, we can search only ov into account. If we use a small
er step sizes that yield new x poin
over constant step
oints size ,
Sts that are
we can make
feasible, or wegradien
can pro t descent
ject each
project steps,
pointthen pro ject
on the line the
back result
into bac
thekconstraint
into . If we use
region.
a line searc
When h, wethis
possible, canmetho
searchd only
method can bov e er stepmore
made that yield
sizes efficient new
by pro x poin
projecting
jecting ts that
the are
gradient
feasible,
in
into or we can
to the tangen
tangent pro ject
t space each
of the point region
feasible on the blineeforeback into the
taking the step
constraint region.
or beginning
When p ossible, this metho
the line search (Rosen, 1960). d can b e made more efficient by pro jecting the gradient
into the tangent space of the feasible region before taking the step or beginning
the A more
line searchsophisticated
(Rosen, 1960 approach
). is to design a different, unconstrained opti-
mization problem whose solution can be conv converted
erted in into
to a solution to the original,
A more sophisticated approach is to
constrained optimization problem. For example, if we wan design a different,
wantt tounconstrained
minimize f( x)opti-for
mization2 problem whose
x ∈ R with x constrained to ha solution
hav can b e conv erted2 in to a solution
ve exactly unit L norm, we can instead minimize to the original,
constrained optimization problem. For example, if we want to minimize f( x) for
R
x with x constrained to have exactly unit L norm, we can instead minimize
93
∈
CHAPTER 4. NUMERICAL COMPUTATION
g(θ ) = f ([cos θ, sin θ]> ) with resp ect to θ, then return [ cos θ, sin θ] as the solution
respect
to the original problem. This approac approach h requires creativit
creativity; y; the transformation
gb(et
θ
etw) = f ([ cos θ, sin θ] ) with resp ect
ween optimization problems must be designed sp to θ , then return [ cos θ,
specifically
ecifically sin θfor
] aseac
theh solution
each case we
to the
encoun
encounter.original
ter. problem. This approac h requires creativit y; the transformation
between optimization problems must be designed specifically for each case we
The Karush–Kuhn–T
Karush–Kuhn–Tucker ucker (KKT) approach1 pro provides
vides a very general solution
encounter.
to constrained optimization. With the KKT approach, we in intro
tro
troduceduce a new function
The Karush–Kuhn–T
called the genergeneralizealize
alized
d Lagrucker
agrangian(KKT)
angian or gener approach
generalize
alize
alized pro
d Lagr vides
agrange a very
ange function general
. solution
to constrained optimization. With the KKT approach, we introduce a new function
To define the Lagrangian, we first need to describ describee in terms of equations
called the generalized Lagrangian or generalized LagrangeSfunction.
and inequalities. W antt a description of S in terms ofSm functions g (i) and n
Wee wan
To define
functions h(j) the Lagrangian,
so that S = {x | ∀we i, gfirst
(i) (x)need
= 0Sto anddescrib
∀j, h (je)( x )in≤terms
0} . Theof equations
equations
and
in inequalities.
( i )
volving g are called
inv W e w an t a description of in terms of m functions g h (nj)
and
S the equality constr onstraints
aints and the inequalities inv involving
olving
functions
are h
called ine =
so thatconstr
inequality
quality x
onstraints
aints
aints..i, g (x ) = 0 and j, h ( x ) 0 . The equations
involving g are called the{ equality |∀ c onstr aints and
∀ the ≤ } involving h
inequalities
We in intro
tro
troduceduce new variables λi and α j for each constraint, these are called the
are called inequality constraints.
KKT multipliers. The generalized Lagrangian is then defined as
We introduce new variables λ and Xα for each constraint,
X these are called the
KKT multipliers. The generalized Lagrangian (i) is then defined (j ) as
L(x, λ, α) = f (x) + λ ig (x) + α j h (x). (4.14)
i j
L(x, λ, α) = f (x) + λ g (x) + α h (x). (4.14)
We can nonow
w solve a constrained minimization problem using unconstrained
optimization of the generalized Lagrangian. Observe that, so long as at least one
We can
feasible noexists
point w solve a constrained
and f (x) is not pminimization problem
ermitted to hav
have e value using unconstrained
∞, then
X
optimization of the generalized Lagrangian. ObserveX that, so long as at least one
feasible point exists and f (xmin
) ismax
not pmax
ermitted
L(x, to hav
λ, α ). e value , then (4.15)
x λ α,α≥0
∞
min max max L(x, λ, α). (4.15)
has the same optimal ob objectiv
jectiv
jectivee function value and set of optimal points x as
94
CHAPTER 4. NUMERICAL COMPUTATION
For more information about the KKT approach, see Nocedal and Wright (2006).
95
CHAPTER 4. NUMERICAL COMPUTATION
4.5 oseExample:
Supp
Suppose we wan Linear
wantt to find Least
the value Squares
of x that minimizes
A will
This tells us that the solution Ax take b +form
A the 2λx = 0. (4.25)
−
This tells us that the solution (A >
x =will take
A +the ) −1A >b.
2λIform (4.26)
97
Chapter 5
Chapter 5
Mac
Machine
hine Learning Basics
Mac hine
Deep learning is a sp
Learning
specific
ecific kind of mac
Basics
machine
hine learning. In order to understand
deep learning well, one must ha have
ve a solid understanding of the basic principles
Deep
of mac learning
machine is a spThis
hine learning. ecificchapter
kind ofpro mac
provides hinea brief
vides learning.
courseIninorder
the mostto understand
important
deep learning w ell, one must ha ve a solid understanding
general principles that will be applied throughout the rest of the bo of the basic
book.
ok.principles
No
Novice
vice
of mac hine learning.
readers or those who wan This c hapter
antt a wider persppro vides
perspectiv
ectiv a brief course in the most
ectivee are encouraged to consider mac important
machine
hine
general principles
learning textb
textbo that will b e applied
ooks with a more comprehensive co throughout cov the rest of the
verage of the fundamen bo ok.
fundamentals, No vice
tals, suc
suchh
readers
as Murph
Murphyor ythose
(2012 who
) orwBishop
ant a wider (2006persp
). Ifectiv
youe are
are encouraged to consider
already familiar with mac machine
hine
machine
learning textb
learning basics, ooks
feelwith
free atomore
skip comprehensive
ahead to Sec. 5.11 coverage
. Thatofsection
the fundamen
cov
covers tals, suc
ers some h
per-
as
sp Murph
spectiv
ectiv
ectives y (traditional
es on 2012) or Bishop mac
machine
hine(2006 ). If tec
learning youhniques
are already
techniques familiar
that hav
havee stronglywithinfluenced
machine
learning
the dev basics,
developmen
elopmen feel free to skip ahead
elopmentt of deep learning algorithms. to Sec. 5.11 . That section cov ers some per-
spectives on traditional machine learning techniques that have strongly influenced
the Wdeve elopmen
begin with t of adeep
definition
learningofalgorithms.
what a learning algorithm is, and present an
example: the linear regression algorithm. W Wee then pro proceed
ceed to describ
describee how the
W e b egin with a definition of what a learning
challenge of fitting the training data differs from the challenge of finding algorithm is, and present an
patterns
example:
that the linear
generalize to new regression
data. Mostalgorithm.
mac
machineWe learning
hine then proceed to describ
algorithms hav
have ee how the
settings
challenge
called hyp of fitting the training
yperparameters
erparameters that must data differs fromexternal
be determined the challenge
to the of findingalgorithm
learning patterns
that generalize
itself; we discusstoho new
how w todata. Mostusing
set these machine learning
additio
additional algorithms
nal data. Mac
Machine havlearning
hine e settings is
called
essen h yp
essentially erparameters that m ust b e determined external
tially a form of applied statistics with increased emphasis on the use of to the learning algorithm
itself;
computers we discuss how toestimate
to statistically set these using additio
complicated nal data.
functions and aMac hine learning
decreased emphasis is
essen
on tially
pro ving aconfidence
proving form of applied
in
interv
terv alsstatistics
tervals around thesewith functions;
increased w emphasis
e therefore on present
the usethe of
tcomputers
wo centraltoapproac
statistically
approaches hes to estimate
statistics:complicated
frequen tistfunctions
frequentist estimators andanda decreased
Ba
Bay emphasis
yesian inference.
on promachine
Most ving confidence
learning in tervals around
algorithms can bethese functions;
divided in to thewcategories
into e thereforeofpresent
sup
supervised the
ervised
tlearning
wo central approac
and unsup hes
unsupervised to statistics: frequen
ervised learning; we describ tist estimators and Ba yesian
describee these categories and give some inference.
Most machine
examples learning
of simple algorithms
learning can befrom
algorithms divided
eachinto the categories
category
category. . Most deep of sup ervised
learning
learning andare
algorithms unsup
basedervised
on an learning; we describ
optimization e thesecalled
algorithm categories and give
stochastic somet
gradien
gradient
examples
descen
descent. t. W ofe simple
describe learning
how toalgorithms
com
combine from each
bine various category
algorithm . Most
comp
components
onentsdeepsuc
such learning
h as an
algorithms are based on an optimization algorithm called stochastic gradient
descent. We describe how to combine v98 arious algorithm components such as an
98
CHAPTER 5. MACHINE LEARNING BASICS
optimization algorithm, a cost function, a model, and a dataset to build a mac machine
hine
learning algorithm. Finally
Finally,, in Sec. 5.11, we describe some of the factors that hav havee
optimization
limited algorithm,
the ability a cost function,
of traditional mac hinea learning
machine model, and a dataset toThese
to generalize. build challenges
a machine
learning
ha
hav algorithm.
ve motivated the Finally
motivated dev , in Sec.
developmen
elopmen
elopment t of 5.11
deep, w e describe
learning some of the
algorithms thatfactors
ov that these
overcome
ercome have
limited the
obstacles. ability of traditional mac hine learning to generalize. These challenges
have motivated the development of deep learning algorithms that overcome these
obstacles.
5.1 Learning Algorithms
5.1
A machineLearning Algorithms
learning algorithm is an algorithm that is able to learn from data. But
what do we mean by learning? Mitchell (1997) provides the definition “A computer
A machine
program is learning algorithm
said to learn is anerience
from exp algorithm
experience thatresp
E with is able
ect totosome
respect learnclass
fromofdata.
tasksBut
T
what do w e mean by learning? Mitchell ( 1997) provides the definition
and performance measure P , if its performance at tasks in T , as measured by P , “A computer
program
impro
improv ves is saidexp
with toerience
learn from
experience experience
E .” One E witha resp
can imagine veryect to v
wide some
arietyclass tasks T
of eriences
of exp
experiences
and
E performance
, tasks measure P ,measures
T , and performance if its performance
P , and weat dotasks in Te, any
not mak
make as measured P,
attempt inbythis
impro
b ook tovesprovide
with exp a erience E .” One can
formal definition imagine
of what ma
may yabveery wide
used for veac
ariety
h ofofthese
each experiences
entities.
E , tasks T , and
Instead, the follo p erformance
following measures Pintuitivee descriptions and examples in
,
wing sections provide intuitiv and we do not mak e any attempt of this
the
b o ok
differento provide a formal definition of what
differentt kinds of tasks, performance measures and exp ma y b e used for eac
experiences h of these entities.
eriences that can be used
Instead, the following sections provide
to construct machine learning algorithms. intuitiv e descriptions and examples of the
different kinds of tasks, performance measures and experiences that can be used
to construct machine learning algorithms.
5.1.1 The Task, T
T
5.1.1
Mac hine The
Machine learniTng
ask,
learning allo
allows
ws us to tac
tackle
kle tasks that are too difficult to solv
solvee with
fixed programs written and designed by human beings. From a scien scientific
tific and
Mac hine learni ng allows us to tackle
philosophical point of view, machine learning is in tasks that are too
interesting difficult to solv
teresting because developing e with
our
fixed programs written and
understanding of machine learning en designed b y
entailshuman b eings. F rom a scien
tails developing our understanding of the tific and
philosophical
principles thatpoint of view,
underlie machine learning is interesting because developing our
intelligence.
understanding of machine learning entails developing our understanding of the
In this that
principles relativ
relatively
ely formal
underlie definition of the word “task,” the pro
intelligence. process
cess of learning
itself is not the task. Learning is our means of attaining the ability to perform the
task.InFthis relatively
or example, if formal
we wan antdefinition
t a rob
robot
ot toof bthe word
e able to “task,” the pro
walk, then cess of
walking learning
is the task.
Witself is not
e could the task.
program theLearning
rob ot to islearn
robot our to
means
walk,oforattaining
we couldthe abilitytotodirectly
attempt performwritethe
atask. For example,
program that sp if we w
specifies
ecifies ant to
how a rob
walkot man
to bually
e able. to walk, then walking is the task.
manually
ually.
We could program the rob ot to learn to walk, or we could attempt to directly write
Mac
Machine
hine learning tasks are usually describ described ed in terms of ho howw the mac machine
hine
a program that specifies how to walk manually.
learning system should pro process
cess an example . An example is a collection of fe
featur
atur
atures
es
thatMac ha
havvhine
e beenlearning tasksely
quantitativ
quantitatively aremeasured
usually describ
from someed inobterms
ject orofev
object ho
even
enw
ent the mac
t that we w hine
an
antt
learning
the mac system
machine
hine shouldsystem
learning processtoan example
pro
process.
cess. W . eAn examplerepresen
typically is a collection
represent of featur
t an example asesa
that ha v e b een
n
vector x ∈ R where eacquantitativ
each ely
h en try xi of the vector is another feature. For example,t
entry measured from some ob ject or ev en t that w e w an
the features
the machine of learning system to process. We typically represent an example as a
R an image are usually the values of the pixels in the image.
vector x where each entry x of the vector is another feature. For example,
the features∈ of an image are usually the values of the pixels in the image.
99
CHAPTER 5. MACHINE LEARNING BASICS
Man
Manyy kinds of tasks can be solved with machine learning. Some of the most
common mac
machine
hine learning tasks include the following:
Many kinds of tasks can be solved with machine learning. Some of the most
common machine
• Classific
Classification
ationlearning
ation:: In this tasks
typ include
ypee of the computer
task, the following: program is ask
asked
ed to sp
specify
ecify
whic
which h of k categories some input belongs to. To solv solvee this task, the learning
Classification
algorithm : In thisask
is usually t
askedyp
ed e of
to task,
pro
produce
duce the a computer
function fprogram
: Rn → is {1ask
, . . ed
. , kto}. sp ecify
When
• ywhic= fh(ofx)k, thecategories
mo
modeldel some
assigns input belongs
an input to. To solv
described byR evector
this task,
x tothe learning
a category
algorithm
iden
identified
tified b isyusually
numeric ask ed
co detoy.pro
code duce are
There a function
other varianf : ts of the
ariants 1, . .classification
. , k . When
y = f (x ) , the mo del assigns an input
task, for example, where f outputs a probability distribution described b y vector
→{ x to ov aer
over }category
classes.
idenexample
An tified by of numeric code y. There
a classification task is areob other
object variants of the
ject recognition, where classification
the input
task,
is for example,
an image (usually where f outputs
described as a set a probability distribution
of pixel brightness overand
values), classes.
the
An example
output is a numeric co of a classification
code task
de identifying the ob is ob ject
object recognition, where
ject in the image. For example, the input
is an
the Willoimage
Willow (usually
w Garage PR2 rob described
robotot is able to act as abrightness
as a set of pixel waiter that values), and the
can recognize
output tis kinds
differen
different a numeric code and
of drinks identifying
deliv
deliver er the
them ob ject in the on
to people image.
command For example,
(Go
Goo o d-
the
fello Willo
felloww et al. w Garage
al.,, 2010). Mo PR2
Modern rob ot
dern ob is
object able to act as a w aiter
ject recognition is best accomplished withthat can recognize
differen t kinds of drinks
deep learning (Krizhevsky et al., 2012 and deliv er them ; Ioffe to and
people on command
Szegedy , 2015). Ob (Goject
Objectod-
fello w et al. , 2010 ).
recognition is the same basic tec Mo dern ob ject
technology recognition
hnology that allo is
allows b est accomplished
ws computers to recognize with
deep learning
faces (Taigman et al. ( Krizhevsky et al. , 2012 ; Ioffe
al.,, 2014), which can be used to automaticallyand Szegedy, 2015tag ). Ob ject
people
recognition
in is the same
photo collections andbasicallow teccomputers
hnology that to allo ws computers
interact more naturally to recognize with
faces ( T
their users. aigman et al. , 2014 ), which can b e used to automatically tag p eople
in photo collections and allow computers to interact more naturally with
• Classific
Classification
their ation with missing inputs
users. inputs:: Classification b ecomes more challenging if
the computer program is not guaranteed that every measurement in its input
vClassific
ector will ation alwa
alwayswith
ys bmissing
e pro inputsIn
provided.
vided. : Classification
order to solv solveeb the
ecomes more challenging
classification task, the if
the computer
• learning algorithm program onlyishas nottoguaranteed
define a single that function
every measurement
mapping from in its input
a vector
vector to
input willa alwa ys be pro
categorical vided. When
output. In order someto solvof ethetheinputs
classification
may be task, the
missing,
learningthan
rather algorithm
providing onlya hassingleto define a singlefunction,
classification function the mapping
learning from a vector
algorithm
input to a categorical
must learn a set of functions. Eac output. EachWhen some
h function corresp of the
corresponds inputs may b
onds to classifying x withe missing,
arather
differen
differentthan providing
t subset of its ainputs
singlemissing.
classificationThis kind function, the learning
of situation algorithm
arises frequen
frequently tly
m ust learn a set of functions. Eac h function
in medical diagnosis, because many kinds of medical tests are exp corresp onds to classifying
expensiv x
ensiv with
ensivee or
ain
invdifferen t
vasive. One wasubset way of its inputs
y to efficien
efficiently missing.
tly define suc This
such kind of situation arises
h a large set of functions is to learn frequen tly
in medical diagnosis, b ecause
a probability distribution over all of the relev many kinds of medical
relevant tests are
ant variables, then solv exp ensiv e
solvee the or
invasive. Onetask
classification way by to efficien tly define
marginalizing outsuc theh amissing
large setvariables.
of functions With is ntoinput
learn
vaariables,
probability we can distribution
no
noww obtain over allall
2n ofdifferen
the relev
different ant variables,
t classification then solv
functions e the
needed
classification
for eac
each h possible task setby ofmarginalizing
missing inputs, out the butmissing
we onlyvariables.
need to learn With an singleinput
variables, we can
function describing the join no w obtain all 2
jointt probabilit
probabilitydifferen t classification
y distribution. See Go functions
Goo odfello
dfellow needed
w et al.
for eac h p ossible set of missing
(2013b) for an example of a deep probabilistic mo inputs, but w e only
modeldel applied to such asingle
need to learn a task
function
in this way describing
ay. . Man
Many y ofthethe join t probabilit
other y distribution.
tasks describ
described ed in thisSee Goodfello
section can w alsoet bal.
e
(generalized
2013b) for an example of a deep probabilistic mo del
to work with missing inputs; classification with missing inputs is applied to such a task
in this w ay
just one example . Manyof of what the machine
other tasks describ
learning can eddo.in this section can also be
generalized to work with missing inputs; classification with missing inputs is
just one example of what machine100learning can do.
CHAPTER 5. MACHINE LEARNING BASICS
• Regrgression
ession
ession:: In this type of task, the computer program is ask asked
ed to predict a
numerical value giv given en some input. To solv solvee this task, the learning algorithm
Reask
is gression
asked ed to: output
In this taype of task,f the
function : Rncomputer
→ R. This programtype of is ask
taskedistosimilar
predicttoa
numerical value
• classification, given that
except somethe input.
format To solve this task, the learning algorithm
R of outputR is differen
different. t. An example of
is ask ed to output a function
a regression task is the prediction of the exp f : . This
expected type of task is
ected claim amount that an similar to
classification,
insured personexcept will mak that
make e (usedthe format →of outputpremiums),
to set insurance is different.orAn theexample
prediction of
a regression
of future prices taskofissecurities.
the prediction Theseofkinds the exp ected claim are
of predictions amount
also usedthat foran
insured person
algorithmic will make (used to set insurance premiums), or the prediction
trading.
of future prices of securities. These kinds of predictions are also used for
• T
algorithmic
ranscription
anscription: trading.
: In this type of task, the machine learning system is ask asked
ed to
observ
observee a relativ
relatively ely unstructured represen representationtation of some kind of data and
T ranscription
transcrib : In this type of task,
transcribee it into discrete, textual form. For example, the machine learninginsystemopticalischaracter
asked to
observe a relativ
• recognition, the ely unstructured
computer program represen
is shown tation of some kind
a photograph of data and
containing an
transcrib e it into discrete, textual form. F or
image of text and is asked to return this text in the form of a sequence example, in optical character
recognition,
of charactersthe computer
(e.g., in ASCIIprogram or Unico
Unicode isdeshown
format). a photograph
Go
Google
ogle Streetcontaining
View uses an
image of text and is
deep learning to process address numasked to return
numbersthis text
bers in this wa in the
way form
y (Go
Goo of a
o dfello
dfellow sequence
w et al.
al.,,
of c
2014d haracters
2014d). ). Another (e.g., in
example ASCII is spor
speec
eec
eechUnico
h de format).
recognition, where Go ogle
the Street
computer View uses
program
deep
is pro learning
provided
vided an to audioprocess
waveformaddress andnum bersa in
emits this waof
sequence y (cGo o dfellow
haracters or et al.,
word
2014d
ID co ). Another
codes
des describing example
the wordsis speec
thath recognition,
were sp oken where
spoken in the the audiocomputer
recording. program
Deep
is pro vided an audio
learning is a crucial comp w a veform
component and
onent of mo emits
dern speech recognition systems w
modern a sequence of c haracters or ord
used
ID ma
at codes
major jor describing
companiesthe words that
including were spoken
Microsoft, IBMinand theGo audio
oglerecording.
Google (Hin
Hintonton etDeep al.
al.,,
learning
2012b
2012b). ). is a crucial comp onent of mo dern speech recognition systems used
at ma jor companies including Microsoft, IBM and Google (Hinton et al.,
• 2012b
Machine ). trtranslation
anslation
anslation:: In a mac machine
hine translation task, the input already consists
of a sequence of symbols in some language, and the computer program must
Machine
con
conv vert thistranslation
in
into : In a mac
to a sequence of hine
sym
symb btranslation
ols in another task, the inputThis
language. already consists
is commonly
of a sequence of symbols in some language,
• applied to natural languages, such as to translate from English to Frencand the computer program must
rench.h.
convert
Deep this into
learning a sequence
has recently of symbto
begun olshain
hav v another
e an imp language.
important
ortant This
impact is
on commonly
this kind
applied
of task (to natural
Sutskev
Sutskever er et languages,
al.,, 2014; such
al. Bahdanauas to translate
et al.
al.,, 2015 from
). English to French.
Deep learning has recently begun to have an important impact on this kind
• of
Structur
Structure d output
task (eSutskev
output: er: Structured
et al., 2014output; Bahdanau tasks et inv al.
involve
olve any).task where the output
, 2015
is a vector (or other data structure con containing
taining multiple values) with important
Structur e d
relationships betoutput etw: Structured
ween the differen output
differentt elemen tasks
elements. involve
ts. Thisany is atask
broadwhere the output
category
category, , and
is a v ector (or other data structure
• subsumes the transcription and translation tasks describcon taining m ultiple v alues)
described ed abwith
abovov
ove, important
e, but also
relationships
man
many b etw een the differen t elemen ts.
y other tasks. One example is parsing—mapping a natural language This is a broad category , and
subsumes
sen
sentence
tence inintothe
to a transcription
tree that describ and
describes es translation
its grammatical tasksstructure
described andabtagging
ove, butno also
nodes
des
man y other tasks. One example is parsing—mapping
of the trees as b eing verbs, nouns, or adverbs, and so on. See Collob a natural
Collobert language
ert (2011)
sen tence in to a tree that describ es its grammatical
for an example of deep learning applied to a parsing task. Another example structure and tagging nodes
of pixel-wise
is the trees assegmen b eing verbs,
segmentation tationnouns,
of images,or adverbs,
where the and computer
so on. Seeprogram
Collobertassigns
(2011)
for
ev
everyanpixel
ery examplein anofimagedeep tolearning
a sp applied
specific
ecific categoryto a. parsing
category. For example,task. Another
deep learning example
can
is pixel-wise segmentation of images, where the computer program assigns
every pixel in an image to a specific 101category. For example, deep learning can
CHAPTER 5. MACHINE LEARNING BASICS
• Denoising
Denoising:: In this type of task, the machine learning algorithm is given in
input a corrupte
orrupted d example x˜ ∈ Rn obtained by an unknown corruption pro process
cess
Denoising
from a cle : In this type of task,
n the machine learning
an example x ∈ R .RThe learner must predict the clean example
clean algorithm is given in
• xinput a cits
orrupte d example x˜ obtained
˜, or
x̃ by an unknown corruption process
from corrupted R x
version more generally predict the conditional
from a
probabilit cle
probability an example x
y distribution p(x∈| x . The
˜x̃). learner m ust predict the clean example
x from its corrupted version ∈ x˜, or more generally predict the conditional
• Density
probabilit y distribution
estimation or pr pob
prob
obability˜ ). mass function estimation: In the densit
(xability
x density y
estimation problem, the machine | learning algorithm is asked to learn a
Density estimation
function pmodel : Rn or → prRob abilityp mass(xfunction
, where ) can be estimation
interpreted: asIna the density
probability
model
estimation
• densit problem, thecontin
machine uous)learning algorithm is asked to learn
(if x isa
density y function R(if x isRcon tin
tinuous) or a probabilit
probability y mass function
function pon the: space that
discrete) , where p
the examples (xw candrawn
) ere be interpreted
from. Toasdoa sucprobability
such h a task
densit
well (w y function
(wee will sp (if
specify x
→ is con tin uous) or a probabilit y
ecify exactly what that means when we discuss performance mass function (if x is
discrete) onP ),
measures thethespace that the needs
algorithm examples were drawn
to learn from. Toof
the structure dothe
suchdata
a task it
w ell (w e will sp
has seen. It must kno ecify exactly
know what that means
w where examples cluster tigh when we discuss
tightly performance
tly and where they
measures
are unlik
unlikelyP ), the algorithm needs
ely to occur. Most of the tasks describ to learn the
described structure
ed ababo of thethat
ove require datathe it
has seen. It m ust kno w where examples cluster
learning algorithm has at least implicitly captured the structure of the tigh tly and where they
are unlikely
probabilit
probability to occur. Most
y distribution. Densitof y
Density the tasks describ
estimation allo
allowsed us
ws abotoveexplicitly
require thatcapturethe
learning
that algorithmInhas
distribution. at leastweimplicitly
principle, can then captured the structureon
p erform computations ofthat
the
probability distribution.
distribution in order to solve Densit y estimation
the other tasksallo aswswell.
us toForexplicitly
example, capture
if we
that
ha
hav distribution. In principle, w e can
ve performed density estimation to obtain a probabilit then p erform computations
probability y distribution pthat
on ( x),
wdistribution
e can use that in order to solvetothe
distribution solvother
solvee the tasks
missing as vwalue
ell. imputation
For example, if wIfe
task.
ahavvalue
e performed densityand
x i is missing estimation
all of the to other
obtainvalues,
a probabilit
denoted y distribution
x −i, are giv p( en,
x),
given,
w e can
then weusekno
knowthat
w thedistribution
distribution to ovsolv
ere itthe
over is missing
given byvalue
p(xi imputation task. If
| x −i). In practice,
a v
densitalue
density x is missing
y estimation do does and all
alwaays allow us to solve all of these relatedgiv
es not alw of the other v alues, denoted x , are en,
tasks,
bthen we in
ecause kno w thecases
many distribution
the required overop iterations
is givenon
operations byp(px(x) arex computationally
). In practice,
densit
in y estimation does not always allow us to solve all of| these related tasks,
intractable.
tractable.
because in many cases the required operations on p( x) are computationally
intractable.
Of course, many other tasks and types of tasks are possible. The typ ypeses of tasks
we list here are in intended
tended only to pro providevide examples of what machin machinee learning can
Of course, many other tasks
do, not to define a rigid taxonomy of tasks. and types of tasks are p ossible. The types of tasks
we list here are intended only to provide examples of what machine learning can
do, not to define a rigid taxonomy of tasks.
5.1.2 The Performance Measure, P
P
5.1.2
In order The
to ev Performance
evaluate
aluate Measure,
the abilities of a mac
machine
hine learning algorithm, we must design
quantitativee measure of its performance. Usually this performance measure P is
a quantitativ
In
sp orderto
specific
ecific tothe
evaluate
task Tthe abilities
b eing of aout
carried macbyhine
thelearning
system. algorithm, we must design
a quantitative measure of its performance. Usually this performance measure P is
For tasks such as classification, classification with missing inputs, and transcrip-
specific to the task T b eing carried out by the system.
tion, we often measure the ac accur
cur
curacy
acy of the mo
model.
del. Accuracy is just the prop
proportion
ortion
For tasks such as classification,
of examples for which the mo model classification
del pro
produces with missing inputs, and transcrip-
duces the correct output. We can also obtain
tion, we often measure the accuracy of the model. Accuracy is just the proportion
of examples for which the model produces 103 the correct output. We can also obtain
CHAPTER 5. MACHINE LEARNING BASICS
equiv
equivalent
alent information by measuring the err error
or rate
ate,, the prop
proportion
ortion of examples for
whic
which h the momodeldel produces an incorrect output. We often refer to the error rate as
equiv
the alent
exp
expectedinformation
ected 0-1 loss. by Themeasuring
0-1 loss on theaerr or rate, the
particular proportion
example is 0 ifofitexamples
is correctlyfor
which theand
classified model 1 ifproduces
it is not. an
Forincorrect
tasks suc output.
such We often
h as density refer to itthe
estimation, do
doeserror
es notratemak
makease
the exp ected 0-1
sense to measure accuracy loss. The 0-1 loss on a particular example is
accuracy,, error rate, or any other kind of 0-1 loss. Instead, we0 if it is correctly
classified
m ust use and 1 if it is
a different not. For tasks
performance metricsuchthatas density
gives the estimation,
mo
model it does
del a contin
continuous-v not alued
uous-v make
uous-valued
sense for
score to measure
eac
each accuracyThe
h example. , errormost rate, or any approach
common other kindisofto0-1 reploss.
report
ort theInstead,
averagewe
must use a different
log-probabilit
log-probability y the mo performance
del assigns metric
model to somethat gives the model a continuous-valued
examples.
score for each example. The most common approach is to report the average
Usually we are in interested
terested in ho how w well the mac machine
hine learning algorithm performs
log-probability the model assigns to some examples.
on data that it has not seen before, since this determines ho how
w well it will work when
deploUsually
deploy we are interested in
yed in the real world. We therefore evho w w ell the mac
evaluatehine learning
aluate these performancealgorithm measures
performs
on data
using thatset
a test it has not seen
of data that b
isefore,
separatesincefrom
thisthe
determines
data used hofor
w well it willthe
training workmac when
machine
hine
deplo y ed in
learning system. the real w orld. W e therefore evaluate these p erformance measures
using a test set of data that is separate from the data used for training the machine
The choice of performance measure may seem straightforw straightforward ard and ob objectiv
jectiv
jective,
e,
learning system.
but it is often difficult to choose a performance measure that corresponds well to
the The choice
desired of performance
behavior measure may seem straightforward and ob jective,
of the system.
but it is often difficult to choose a performance measure that corresponds well to
In some cases, this is because it is difficult to decide what should b e measured.
the desired behavior of the system.
For example, when performing a transcription task, should we measure the accuracy
In system
of the some cases, this is because
at transcribing en tireitsequences,
entire is difficult or to should
decide wwhat
e useshould
a morebfine-grained
e measured.
F or example, when
performance measure that givperforming a
gives transcription task, should we
es partial credit for getting some elemen measure the
elements accuracy
ts of the
of the system at transcribing en tire sequences, or should
sequence correct? When performing a regression task, should we penalize the w e use a more fine-grained
performance
system more measure
if it frequen that
tlygiv
frequently es partial
mak
makes credit for mistakes
es medium-sized getting someor if elemen
it rarely ts mak
of the
makeses
sequence correct? When p erforming
very large mistakes? These kinds of design choices dep a regression task,
depend should we penalize
end on the application. the
system more if it frequently makes medium-sized mistakes or if it rarely makes
In other cases, we kno knoww what quan quantittit
tityy we would ideally like to measure, but
very large mistakes? These kinds of design choices depend on the application.
measuring it is impractical. For example, this arises frequently in the context of
In yother
densit
density cases, wMany
estimation. e knowofwhatthe bquan tity we wouldmo
est probabilistic ideally
models like to measure,
dels represen
represent t probabilit
probabilitybuty
measuring it is impractical.
distributions only implicitly For example, this arises
implicitly.. Computing the actual probabilit frequently
probability in the context
y value assigned to of
densit
a sp y
specific estimation.
ecific poin Many
ointt in space in man of the
many b
y sucest
such probabilistic
h mo
models
dels is in mo dels
intractable. represen
tractable. In these t probabilit
cases, oney
distributions only
must design an alternativ implicitly . Computing the
alternativee criterion that still corresp actual
correspondsprobabilit y value
onds to the design ob assigned
objectives,to
jectives,
a sp
or ecific apoin
design go
goo otdinapproximation
space in manytosuc theh desired
models criterion.
is intractable. In these cases, one
must design an alternative criterion that still corresponds to the design ob jectives,
or design a good approximation to the desired criterion.
5.1.3 The Exp
Experience,
erience, E
E
5.1.3
Mac hine The
Machine Expalgorithms
learning erience, can be broadly categorized as unsup
unsupervise
ervise
ervised
d or su-
pervise
ervised
d by what kind of exp
experience
erience they are allow
allowed
ed to ha
hav
ve during the learning
Mac
pro hine
process.
cess. learning algorithms can be broadly categorized as unsupervised or su-
pervised by what kind of experience they are allowed to have during the learning
Most of the learning algorithms in this book can be understo
understoood as being allow
allowed
ed
process.
to exp
experience
erience an en
entire
tire dataset. A dataset is a collection of many examples, as
Most of the learning algorithms in this book can be understood as being allowed
to experience an entire dataset. A dataset
104 is a collection of many examples, as
CHAPTER 5. MACHINE LEARNING BASICS
defined in Sec. 5.1.1. Sometimes we will also call examples data points
oints..
One of the oldest datasets studied by statisticians and mac machine
hine learning re-
defined in Sec. 5.1.1. Sometimes we will also call examples data points.
searc
searchers
hers is the Iris dataset (Fisher, 1936). It is a collection of measuremen measurements ts of
One
differen of the oldest datasets
differentt parts of 150 iris plants. Eac studied
Each by statisticians
h individual plant corresp and
correspondsmac hine learning
onds to one example. re-
searcfeatures
The hers is thewithinIris eac
dataset
each h example(Fisher are, 1936 ). It is a collection
the measurements of each ofofmeasuremen
the parts oftsthe of
differen
plan
plant: t parts
t: the sepal of length,
150 iris sepal
plants.width,Each individual
petal length plant
andcorresp
petal onds
width. to one
Theexample.
dataset
The features within
also records which sp eac h
species example
ecies each plan are the measurements
plantt belonged to. Three differen of each of the
differentt sp parts
species
eciesof are
the
plan t: the
represen
represented tedsepal
in thelength,
dataset. sepal width, petal length and petal width. The dataset
also records which species each plant belonged to. Three different species are
Unsup
Unsupervise
represen ervise
ervised
ted d le
in the learning
arning
dataset. algorithms exp experience
erience a dataset containing many features,
then learn useful prop properties
erties of the structure of this dataset. In the con context
text of deep
Unsup ervise
learning, we usually wand learning algorithms exp erience
antt to learn the entire probabilit a dataset
probability containing many
y distribution that generated features,
athen learn useful
dataset, whether prop erties of as
explicitly theinstructure
densit
density of this dataset.
y estimation In the con
or implicitly text
for of deep
tasks lik
likee
learning,
syn
synthesis w e usually w an t to learn the entire probabilit
thesis or denoising. Some other unsupervised learning algorithms perform othery distribution that generated
a dataset,
roles, whether explicitly
like clustering, whic
which as in densit
h consists y estimation
of dividing the dataset or implicitly
into clusters for tasks like
of similar
synthesis or denoising. Some other unsupervised learning algorithms perform other
examples.
roles, like clustering, which consists of dividing the dataset into clusters of similar
Sup
Supervise
ervise
ervised d lelearning
arning algorithms exp experience
erience a dataset con containing
taining features, but
examples.
eac
eachh example is also asso associated
ciated with a lab label
el or tar
target
get. For example, the Iris dataset
Supervisedwith
is annotated learning
the sp algorithms
species
ecies of each expiris
erience
plant.a A dataset containing
supervised learning features,
algorithm but
eachstudy
can example is also
the Iris associated
dataset and learnwithtoa classify
label or iris
target . Ftsorinto
plan
plants example, the Irist dataset
three differen
different sp
species
ecies
is annotated with
based on their measurements. the sp ecies of each iris plant. A supervised learning algorithm
can study the Iris dataset and learn to classify iris plants into three different species
basedRoughly
on their sp
speaking,
eaking, unsup
unsupervised
measurements. ervised learning inv involves
olves observing sev several
eral examples
of a random vector x, and attempting to implicitly or explicitly learn the proba-
bilit Roughly
bility speaking,
y distribution p(x),unsup
or some ervisedin learningprop
interesting
teresting involves
ertiesobserving
properties several examples
of that distribution, while
of
sup a random
supervised v ector
ervised learning inv x , and
involv
olv
olves attempting
es observing sev to implicitly
several or explicitly learn
eral examples of a random vector x and the proba-
bilitasso
an y distribution
associated
ciated valuep (xor), or
v some
ector y ,in teresting
and learningpropto erties of that
predict distribution,
y from x, usually while
by
sup ervised learning inv
estimating p (y | x). The term sup olv es observing
supervise sev
ervise
ervised eral
d le examples
learning of a random
arning originates from the view of vector x and
an asso ciated
the target y being pro value or vided by an instructor orto
providedv ector y , and learning predict
teac
teacher
her who y from
sho
showswsx,the usually
mac
machine by
hine
estimating
learning p (y
system what x ). Theto do. termIn sup
unsupervise
unsupervised d learning
ervised learning,originates
there isfrom the view or
no instructor of
the
teac target y b eing pro vided b y an instructor or teac her
her, and the |algorithm must learn to make sense of the data without this guide.
teacher, who sho ws the mac hine
learning system what to do. In unsupervised learning, there is no instructor or
Unsup
Unsupervised
ervised learning and supervised learning are not formally defined terms.
teacher, and the algorithm must learn to make sense of the data without this guide.
The lines betw between een them are often blurred. Man Many y machine learning technologies can
Unsup ervised learning and supervised
be used to perform both tasks. For example, the chain learning are not formally
rule defined states
of probability terms.
The lines
that for abetw
vector eenxthem
∈ Rnare , theoften
jointblurred. Manycan
distribution machine learning technologies
be decomposed as can
be used to perform both tasks. For example, the chain rule of probability states
R n
that for a vector x , the jointY distribution can be decomposed as
p(x) = p(x i | x 1, . . . , xi−1). (5.1)
∈ i=1
p (x ) = p(x x , . . . , x ). (5.1)
This decomp
decomposition
osition means that we can solve the ostensibly unsup unsupervised
ervised problem of
|
mo deling p( x) by splitting it in
modeling to n sup
into supervised
ervised learning problems. Alternativ Alternatively ely
ely,, we
This decomposition means that we can solve the ostensibly unsupervised problem of
modeling p( x) by splitting it into nY 105
supervised learning problems. Alternatively, we
CHAPTER 5. MACHINE LEARNING BASICS
can solve the sup ervised learning problem of learning p(y | x) by using traditional
supervised
unsup ervised learning technologies to learn the joint distribution p(x, y) and
unsupervised
can solve the supervised learning problem of learning p(y x) by using traditional
inferring
unsupervised learning technologies to learn p(x, ythe) joint distribution
| p(x, y) and
inferring p ( y | x ) = P 0
. (5.2)
y0 p(x, y )
p(x, y)
Though unsup unsupervised p
ervised learning and sup ( y x ) =
supervised
ervisedp(learning . (5.2)
x, y ) are not completely formal or
distinct concepts, they do help to|roughly categorize some of the things we do with
Though
mac hine unsup
machine learningervised learningTand
algorithms. supervised
raditionally
raditionally, learning
, people referareto not completely
regression, formal or
classification
distinct
and concepts,output
structured they do help to roughly
problems as sup categorize
supervised
ervised some ofDensit
learning. the things
Density we do with
y estimation in
mac
supp hine
support learning algorithms. T
ort of other tasks is usually consideredraditionally , p eople refer
P unsupervised learning. to regression, classification
and structured output problems as supervised learning. Density estimation in
supp Other
ort ofvother
arian
ariants ts of the
tasks learningconsidered
is usually paradigmunsupervised
are p ossible.learning.For example, in semi-
sup
supervised
ervised learning, some examples include a supervision target but others do
not.OtherIn mvulti-instance
ariants of thelearning,learningan paradigm
en
entire are p ossible.
tire collection For example,
of examples is lab in
labeled semi-
eled as
sup
con ervised
containing learning, some examples include a supervision
taining or not containing an example of a class, but the individual mem target but others
members do
bers
not. In m ulti-instance
of the collection are not lab learning,
labeled. an en
eled. For a recen tire collection of examples
recentt example of multi-instance learning is lab eled as
containing
with deep mo or dels,
not containing
models, see Kotziasan et example
al. (2015of ). a class, but the individual members
of the collection are not labeled. For a recent example of multi-instance learning
Some machine learning algorithms do not just exp experience
erience a fixed dataset. For
with deep models, see Kotzias et al. (2015).
example, reinfor
einforccement le learning
arning algorithms interact with an environmen environment, t, so there
Some machine
is a feedback lo loop learning
op b etetw algorithms do not
ween the learning system and its exp just exp erience
experiences. a fixed
eriences. SucSuch dataset.
h algorithms For
example,
are bey
eyond
ondreinfor cement
the scop
scope e of le arning
this book.algorithms
Please seeinteract
Sutton with an environmen
and Barto (1998) ort,Bertsek
so there
Bertsekas as
is a feedback lo op b et w een
and Tsitsiklis (1996) for information abthe learning system
about and its exp eriences. Suc
out reinforcement learning, and Mnih et al. h algorithms
are b ey ond the scop e of this b o ok.
(2013) for the deep learning approach to reinforcemenPlease see Sutton and
reinforcement Barto (1998) or Bertsekas
t learning.
and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al.
(2013Most
) formac
machine
thehine
deeplearning
learningalgorithms
approach simply exp
experience
to reinforcemen erience a dataset. A dataset can
t learning.
be described in many wa ways.ys. In all cases, a dataset is a collection of examples,
whic
whichMost
h aremac hine collections
in turn learning algorithms
of features. simply exp erience a dataset. A dataset can
be described in many ways. In all cases, a dataset is a collection of examples,
One common wa way y of describing a dataset is with a design matrix. A design
which are in turn collections of features.
matrix is a matrix containing a differen differentt example in each ro row.w. Each column of the
One common
matrix corresp
corresponds wa y of differentt feature. For instance, the Irismatrix
onds to a differendescribing a dataset is with a design dataset . Acontains
design
matrix
150 is a matrix
examples with containing
four features a differen
for each t example
example. in This
each ro w. Each
means column
we can of the
represent
matrix corresp onds to a differen t feature. F or instance,
the dataset with a design matrix X ∈ R150×4 , where Xi,1 is the sepal length of the Iris dataset contains
150
plantexamples
plan t i, Xi,2 is with four width
the sepal features for each
of plant example.
i ,Retc. We willThis means
describ
describe e mostwe of
cantherepresent
learning
the datasetinwith
algorithms thisabdesign matrixofXho
ook in terms howw they op , where
operate
erate on X designis the sepal datasets.
matrix length of
plant i, X is the sepal width of plant∈i , etc. We will describe most of the learning
Of course, to describ describee a dataset as a design matrix, it must b e possible to
algorithms in this book in terms of how they operate on design matrix datasets.
describ
describee eaceachh example as a vector, and each of these vectors must be the same size.
Of course,
This is not alw alwa ato
ysdescrib
possible. e aFdataset
or example, as a ifdesign
you ha havmatrix, it mustofb ephotographs
ve a collection possible to
describ
with e each example
different widths and as aheigh
vector,
heights, and each
ts, then different of these vectors must
photographs be the same
will contain size.
differen
different t
This
num
umb is not alw a ys p ossible. F or example,
bers of pixels, so not all of the photographs ma if y ou ha
may v e a collection
y be describ
described of photographs
ed with the same
with
length of vector. Sec. 9.7 and Chapter 10 describe how to handlecontain
different widths and heigh ts, then different photographs will different differen
typest
numbers of pixels, so not all of the photographs may be described with the same
length of vector. Sec. 9.7 and Chapter106 10 describe how to handle different types
CHAPTER 5. MACHINE LEARNING BASICS
of such heterogeneous data. In cases lik likee these, rather than describing the dataset
as a matrix with m ro rows, describee it as a set containing m elemen
ws, we will describ elements:
ts:
of such
(1) heterogeneous
(2) ( m ) data.
{x , x , . . . , x } . This notation do In cases lik
doese these, rather than describing
es not imply that any two example vectors the dataset
as( i a
)
x and x hamatrix ( j ) with
hav m rows,
ve the same size. we will describ e it as a set containing m elements:
x ,x ,...,x . This notation does not imply that any two example vectors
x In the case of sup
supervised
ervised learning, the example con contains
tains a lab
label
el or target as
{ and x have the } same size.
well as a collection of features. For example, if we wan antt to use a learning algorithm
to pIn the case
erform ob
objectof sup
ject ervised learning,
recognition the example
from photographs, we con
need tains
to spa ecify
labelwhich
specify or target
ob as
object
ject
w ell ears
app
appears as a in
collection
each ofofthe features.
photos.ForWexample,
e migh
mightt doif wthis
e wan t to a
with usenumeric
a learning
co algorithm
code,
de, with 0
to p erform ob ject recognition from photographs, w e need
signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when workingto sp ecify which ob ject
appears
with in eachcon
a dataset of taining
the photos.
containing We matrix
a design might do of this withobserv
feature a numeric
ationsco
observations Xde,
, wewith
also0
signifying
pro
provide a person,
vide a vector 1 signifying
of labels a car,
y , with 2 viding
yi pro signifying
providing the alabcat,
label etc.example
el for Often when
i. working
with a dataset containing a design matrix of feature observations X, we also
Of course, sometimes the label ma may y be more than just a single num umb ber. For
provide a vector of labels y , with y providing the label for example i.
example, if we want to train a sp speec
eec
eech
h recognition system to transcribe en entire
tire
sen Of course,
sentences,
tences, thensometimes
the lablabel theeach
el for labelexample
may besen more
sentencethan
tence is ajust a single
sequence of n umber. For
words.
example, if we want to train a sp eech recognition system to transcribe entire
Just asthen
sentences, theretheis nolabformal definition
el for each example of sen
sup
supervised
ervised
tence is and unsup
unsupervised
a sequence ervised
of words.learning,
there is no rigid taxonom taxonomy y of datasets or exp
experiences.
eriences. The structures describ describeded here
co
cov Just as there is
ver most cases, but it is alwano formal definition
always of sup ervised and unsup ervised
ys possible to design new ones for new applications. learning,
there is no rigid taxonomy of datasets or experiences. The structures described here
cover most cases, but it is always possible to design new ones for new applications.
5.1.4 Example: Linear Regression
5.1.4definition
Our Example: Linear
of a mac
machine Regression
hine learning algorithm as an algorithm that is capable
of improving a computer program’s performance at some task via exp experience
erience is
Our definition of a mac hine learning algorithm
somewhat abstract. To make this more concrete, we presen as presentt an example of acapable
an algorithm that is simple
of
mac improving
machine a computer
hine learning program’s
algorithm: line ar rpeerformance
linear gr ession.. Wat
gression
ession somereturn
e will task via experience
to this example is
somewhat
rep eatedly abstract.
repeatedly To make
as we introduce thismac
more more
machine
hine concrete,
learningwe presentthat
concepts an example of a simple
help to understand
macbhine
its learning algorithm: linear regression. We will return to this example
ehavior.
repeatedly as we introduce more machine learning concepts that help to understand
As the name implies, linear regression solv solves
es a regression problem. In other
its behavior.
words, the goal is to build a system that can tak takee a vector x ∈ Rn as input and
As the
predict thename
value implies, linear
of a scalar y ∈ regression solves In
R as its output. a regression
the case ofproblem.
linear In other
regression,
R
w ords,
the the goal
output is a is to build
linear a system
function ofR thethat can Let
input. takeyŷˆabveector x
the value thatas our
inputmo and
model
del
predict the v alue of a scalar y as its
predicts y should take on. We define the output to be output. In the case of
∈ linear regression,
the output is a linear function∈of the input. Let yˆ be the value that our model
>
predicts y should take on. We defineyŷˆ the = woutputx to be (5.3)
then increasing the value of that feature increases the value of our prediction yŷˆ.
If a feature receiv
receiveses a negative weigh eight,t, then increasing the value of that feature
decreases the value of our prediction. If aincreases
then increasing the value of that feature feature’s the weigh value
eight of ourinprediction
t is large magnitude, yˆ.
If a feature
then it has areceiv
largeeseffect
a negative
on the w eight, thenIfincreasing
prediction. a feature’sthe value
weigh
weight t is of thatit feature
zero, has no
decreases the value
effect on the prediction. of our prediction. If a feature’s w eigh t is large in magnitude,
then it has a large effect on the prediction. If a feature’s weight is zero, it has no
We thus hav havee a definition of our task T : to predict y from x by outputting
effect on > the prediction.
yŷˆ = w x. Next we need a definition of our performance measure, P .
We thus have a definition of our task T : to predict y from x by outputting
yˆ = Suppose
Supp
w xose that we ha hav ve a design matrix of m example inputs that we will not
. Next we need a definition of our performance measure, P .
use for training, only for ev evaluating
aluating ho how w well the model performs. We also hav havee
Suppose
a vector that we hatargets
of regression ve a design
pro matrix
providing
viding the of correct
m example valueinputsof y forthat eachwe of will not
these
use for training,
examples. Because only
thisfordataset
evaluating howbwell
will only e used thefor model
ev performs.
evaluation,
aluation, we callWe italso thehavteste
a
setvector
set. of regression
. We refer to the design targets
matrixproviding
of inputs theascorrect
X (test) vand aluethe y
of vector
for each of these
of regression
examples. Because
targets as y(test). this dataset will only b e used for ev aluation, we call it the test
set. We refer to the design matrix of inputs as X and the vector of regression
One way of measuring the performance of the mo model del is to compute the me meanan
targets as y . ( test )
squar
squareed err
erroror of the mo model del on the test set. If y ŷˆ giv
gives es the predictions of the
mo
modelOne w a y of measuring the p erformance
del on the test set, then the mean squared error is given of the mo del is to
bycompute the mean
squared error of the model on the test set. If yˆ gives the predictions of the
1 X
model on the test set, then MSEtestthe =mean squared (yŷˆ (test)error
− y (testis given
) i . by
) 2
(5.4)
m
1 i
MSE = ( yˆ y ) . (5.4)
In
Intuitiv
tuitiv
tuitively
ely m
ely,, one can see that this error measure decreases to 0 when y ŷˆ ( test ) = y test) .
(
−
We can also see that
Intuitively, one can see that this error measure decreases to 0 when yˆ =y .
We can also see that 1 ( test ) ( test ) 2
MSE test = X ||
||ˆ
ŷˆ
y −y ||2 , (5.5)
m
1
so the error increases whenever MSE = Euclidean
the yˆ y
distance , bet
etween
ween the predictions (5.5)
m
and the targets increases. || − ||
so the error increases whenever the Euclidean distance between the predictions
andTtheo make
targetsa mac
machine
hine learning algorithm, we need to design an algorithm that
increases.
will improv
improvee the weighweights ts w in a way that reduces MSEtest when the algorithm
T
is alloo
allow make a mac
wed to gain exp hine learning
experience
erience by algorithm,
observing awe need toset
training design an),algorithm
(X (train y(train) ). One that
will
in improv
intuitiv
tuitiv
tuitivee wa
way eythe weights
of doing w in a wawe
this (which y that reduceslater,
will justify MSE in Sec. when5.5.1 the )algorithm
is just to
is allow ed to gain exp erience by observing
minimize the mean squared error on the training set, MSEtrain . a training set ( X , y ). One
intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to
To minimize
minimize the mean MSE train, werror
squared e can onsimply solve forset,
the training where MSE its gradien
gradient
. t is 0:
2
Linear regression example 0.50 Optimization of w
0.45
1
MSE(train)
0.40
0
y
0.35
−1
0.30
−2 0.25
−3 0.20
−1.0 −0.5 0.0 0 .5 1.0 0.5 1.0 1.5
x1 w1
w
w
w y =w x w
w w
w y =w x
w
>
(train) (train) (train) (train)
⇒ ∇w X w−y X w−y =0 (5.9)
⇒ ∇w w> X (trainX X w) w y
)> (train
− 2w> X (train X )> y (train
w )+ =y0(train) =
y y (train)> (5.9)
0
⇒∇ − − (5.10)
w X X w 2w X y +y y =0
⇒ 2X (train)> X (train)w − 2X (train)> y(train) = 0 (5.11)
⇒∇ − (5.10)
( ) >
−1 (train)>
train (train) (train)
⇒ 2wX= X X Xw 2X X y y = 0 (5.11)
(5.12)
⇒ −
w = Xwhose solution
The system of equations X Xen by Eq.
is giv
given y 5.12 is known (5.12) as the
quations..⇒Ev
normal equations Evaluating
aluating Eq. 5.12 constitutes a simple learning algorithm.
For The system of
an example of equations
the linear whose
regression solution is giv
learning en by Eq.in5.12
algorithm is known
action, see Fig.as 5.1
the.
normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm.
For Itanisexample
worth noting
of the that the
linear term line
regression linear
ar regr
learning gression
ession is often
algorithm used to
in action, seerefer
Fig. to5.1a.
sligh
slightly
tly more sophisticated mo modeldel with one additional parameter—an intercept
It is w orth
term b. In this mo noting
model
del that the term linear regression is often used to refer to a
slightly more sophisticated modelyŷˆ with = w >one
x +additional
b parameter—an intercept (5.13)
term b. In this model
so the mapping from parameters yˆto=predictions w x + b is still a linear function but the
(5.13)
mapping from features to predictions is now an affine function. This extension to
so the functions
affine mapping means
from parameters
that the plot to predictions
of the momodel’sis still
del’s a linear function
predictions still lo oksbut
looks the
like a
mapping from features to predictions is now an affine function.
line, but it need not pass through the origin. Instead of adding the bias parameter This extension to
baffine
, one functions
can contin means
continue thatthe
ue to use themo plot
modeldel of theonly
with model’sweighpredictions
eights
ts but augmen still tloxoks
augment withlikeana
line, but it need not pass through the origin. Instead of adding the bias parameter
b, one can continue to use the model with 109 only weights but augment x with an
CHAPTER 5. MACHINE LEARNING BASICS
5.2 cen
The Capacit
central
tral challengey, inOv erfitting
mac
machine
hine learning and Underfitting
is that we must perform well on
inputs—not just those on whic which h our mo del was trained. The
model
The
abilitcen
ability tral challenge in mac hine learning
y to perform well on previously unobserv is
unobserved that we m ust perform
ed inputs is called gener well on
generalization
alization
alization. .
inputs—not just those on which our mo del was trained. The
Typically
abilit ypically,, whenwell
y to perform training a machineunobserv
on previously learninged mo
model,
del, weishav
inputs have e access
called genertoalization
a training.
set, we can compute some error measure on the training set called the tr training
aining
err
error
orT,ypically
or, and we, when
reducetraining a machine
this training error.learning
So far,mo del, w
what we hav
e ha
havvee describ
access to
describededaistraining
simply
set, w e can compute some error measure on the training
an optimization problem. What separates machine learning from optimization set called the training is
error , and
that we wan w e reduce
wantt the gener this training
generalization
alization err error.
error So far, what w
or, also called the test err e ha ve
error describ ed
or, to be lo lowis simply
w as well.
an optimization problem. What separates
The generalization error is defined as the exp machine
expected learning from optimization
ected value of the error on a new is
that weHere
input. wantthetheexp
gener alization
expectation
ectation error,across
is taken also called
differen
different the test errorinputs,
t possible , to be dralowwn
drawn as from
well.
The generalization
the distribution error iswedefined
of inputs expect as thethe expected
system valueter
to encoun
encounter of in
thepractice.
error on a new
input. Here the expectation is taken across different possible inputs, drawn from
We typically estimate the generalization error of a machine learning mo modeldel by
the distribution of inputs we expect the system to encounter in practice.
measuring its performance on a test set of examples that were collected separately
fromWthee typically
trainingestimate
set. the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separately
fromIntheourtraining
linear regression
set. example, we trained the model by minimizing the
training error,
In our linear regression 1example, we trained
(train)
the model by minimizing the
(train) 2
training error, ||
||XX w − y ||2, (5.14)
m(train)
1
but we actually care ab about
out the testX error,w 1 y || X (test),w − y (test) ||22.
||X (5.14)
m m
|| − set when || we get to observ
but Ho
How
wewactually
can we affect
care abperformance
out the testonerror,the test X w y observe . e only the
training set? The field of statistic
statisticalal le
learning
arning thetheory
ory pro
provides
vides some answanswers.
ers. If the
How can we affect performance on the test set|| when we − || e only the
get to observ
training set? The field of statistical learning 110 theory provides some answers. If the
CHAPTER 5. MACHINE LEARNING BASICS
training and the test set are collected arbitrarily arbitrarily,, there is indeed little we can do.
If we are allo allowed
wed to make some assumptions about ho how w the training and test set
training and the test
are collected, then we can mak set are collected arbitrarily
makee some progress. , there is indeed little we can do.
If we are allowed to make some assumptions about how the training and test set
are The train and
collected, thentest
wedata are generated
can mak by a probability distribution ov
e some progress. over
er datasets
called the data gener generating
ating pr proocess
ess.. We typically mak makee a set of assumptions kno knownwn
The train
collectiv
collectively ely asand thetest dataassumptions
i.i.d. are generated by a probability
These assumptions distribution
are that the overexamples
datasets
called
in eac
each hthe data gener
dataset atingendent
are indep processfrom
independent . We each
typically
other,mak ande athat
set oftheassumptions
train set and knotest
wn
collectiv
set ely as al
are identic
identical the
ally i.i.d.
ly distribute
distributedassumptions
d, dra
drawn Thesethe
wn from assumptions
same probabilityare thatdistribution
the examples as
in
eac
eacheac h dataset are indep
h other. This assumption allo endent from
allows each other, and that the train
ws us to describe the data generating process set and test
set are
with identical lyy distribution
a probabilit
probability distributed, dra ov
overwna from
er singlethe same probability
example. The same distribution
distribution as is
eac h other. This
then used to generate ev assumption
every allo ws us to describe the data
ery train example and every test example. We call that generating process
with a probabilit y distribution
shared underlying distribution the ovdata
er a gener
singleating
example.
generating The same
distribution
distribution, distribution
, denoted p data. Thisis
then used to generate ev ery train example
probabilistic framework and the i.i.d. assumptions allo and every test
allow example. W e
w us to mathematically call that
sharedthe
study underlying distribution
relationship betw eenthe
etween data gener
training errorating test error. , denoted p
and distribution . This
probabilistic framework and the i.i.d. assumptions allow us to mathematically
studyOne theimmediate
relationship connection
betweenwe can observe
training betw
between
error and een error.
test the training and test error
is that the expected training error of a randomly selected mo modeldel is equal to the
exp One
expected immediate connection
ected test error of that mo we
model. can observe
del. Suppose we ha betw een
havve a probabilityand
the training test error
distribution
pis(xthat the expected
, y ) and we sampletraining
from it reperror of a randomly
repeatedly
eatedly to generateselected
the train mosetdeland
is equal to the
the test set.
exp ected test error of
For some fixed value w , then the exp that mo del. Suppose
expected w e ha v e a probability
ected training set error is exactly the same as distribution
p (x ,
the exp y ) and
ected test set error, b ecause both to
expected w e sample from it rep eatedly expgenerate
expectations
ectations thearetrain
formedset and
usingthethe
test set.
same
F or some fixed v alue w , then the
dataset sampling process. The only difference betw exp ected training set
etween error
een the tw is
twoexactly the same
o conditions is the as
the exp
name wected
e assign testtoset
theerror, b ecause
dataset both expectations are formed using the same
we sample.
dataset sampling process. The only difference between the two conditions is the
name Ofwcourse,
e assignwhen to thew we e use awe
dataset machine
sample.learning algorithm, we do not fix the
parameters ahead of time, then sample b oth datasets. We sample the training set,
thenOfuse course, when the
it to choose we parameters
use a machine learning
to reduce algorithm,
training set error, wethen
do not fix the
sample the
parameters
test set. Under aheadthisof time,
pro
process,thenthe
cess, sample
exp
expectedb othtest
ected datasets.
error is Wegreater
samplethan the training
or equalset,to
then
the exp use
expected it to choose the parameters to reduce
ected value of training error. The factors determining ho training set error,
how then sample
w well a machine the
test set. Under this pro cess,
learning algorithm will perform are its abilit the exp ected test
abilityy to:error is greater than or equal to
the expected value of training error. The factors determining how well a machine
learning
1. Mak
Make algorithm
e the trainingwill perror
erform are its ability to:
small.
1. Mak
2. training
Makee the gap betwerror
etween small. and test error small.
een training
2.These
Maketwthe gap betw
o factors een ond
corresp training
correspond to theand
twotest
cenerror
central small.
tral challenges in machine learning:
underfitting and overfitting. Underfitting occurs when the model is not able to
These
obtain two factors
a sufficien
sufficiently corresp
tly lo
low
w errorond to the
value on tthe
wo training
central challenges
set. Ov in machine
Overfitting
erfitting learning:
occurs when
underfitting
the gap betw and
etween overfitting . Underfitting occurs
een the training error and test error is to when
too the model
o large. is not able to
obtain a sufficiently low error value on the training set. Overfitting occurs when
We can con
control
trol whether a momodel
del is more lik
likely
ely to ov
overfit
erfit or underfit by altering
the gap between the training error and test error is too large.
its cap
apacity
acity
acity.. Informally
Informally,, a mo
model’s
del’s capacit
capacity y is its abilit
ability
y to fit a wide variety of
We can control whether a model is more likely to overfit or underfit by altering
its capacity. Informally, a mo del’s capacit111 y is its ability to fit a wide variety of
CHAPTER 5. MACHINE LEARNING BASICS
functions. Mo Models
dels with lo low
w capacit
capacity y ma
may y struggle to fit the training set. Mo Models
dels
with high capacit
capacity y can overfit by memorizing properties of the training set that do
functions.
not serv Models
servee them wellwith lowtest
on the capacit
set.y may struggle to fit the training set. Models
with high capacity can overfit by memorizing properties of the training set that do
not One
servewa
wayy towell
them con
control
trol
on thethetest
capacity
set. of a learning algorithm is by choosing its
hyp
hypothesis
othesis spspac
ac
acee, the set of functions that the learning algorithm is allow allowed
ed to
One
select as wa y tothe
being consolution.
trol the F capacity
or example, of a the
learning
linear algorithm
regression is by choosing
algorithm its
has the
hypothesis
set spacefunctions
of all linear , the set of functions
of its input as that
its the learningspace.
hypothesis algorithm
We canis allow ed to
generalize
select as
linear being the
regression tosolution. For example,rather
include polynomials, the linear
thanregression
just linearalgorithm
functions,has
inthe
its
set
hyp of all
ypothesis linear functions of its input
othesis space. Doing so increases the mo as its hypothesis
model’s
del’s capacitspace.
capacityy. W e can generalize
linear regression to include polynomials, rather than just linear functions, in its
hypA polynomial
othesis space. ofDoing
degreeso one giv
gives
es us
increases thethemolinear regression
del’s capacit y. mo
model
del with whic
whichh we
are already familiar, with prediction
A polynomial of degree one gives us the linear regression model with which we
are already familiar, with prediction yŷˆ = b + wx. (5.15)
By in
intro
tro ducing x2
troducing as another featureyˆ =provided (5.15)
b + wx. to the linear regression model, we
can learn a mo
model
del that is quadratic as a function of x:
By introducing x as another feature provided to the linear regression model, we
can learn a model that is quadratic
yŷˆ = b + x + w2 x2 of
aswa1function . x: (5.16)
parameters than training examples. We hav havee little chance of choosing a solution
that generalizes well when so man
many y wildly different solutions exist. In this example,
parameters
the than
quadratic mo training
model examples.
del is perfectly We hav
matched toe the
littletrue
chance of choosing
structure of the atask
solution
so it
that generalizes well when so
generalizes well to new data. man y wildly different solutions exist. In this example,
the quadratic model is perfectly matched to the true structure of the task so it
generalizes well to new data.
x y
x y
of the optimization algorithm, mean that the learning algorithm’s effe effective
ctive cap apacity
acity
ma
may y be less than the representational capacit capacity y of the mo model
del family
family..
of the optimization algorithm, mean that the learning algorithm’s effective capacity
Our mo modern
dern ideas ab about
out impro
improvingving the generalization of mac machine
hine learning
may be less than the representational capacity of the model family.
mo
models
dels are refinements of though thoughtt dating bac back k to philosophers at least as early
Our
as Ptolem
Ptolemy mo dern
y. ManMany ideas ab out impro
y early scholars inv ving
invokok the generalization
okee a principle of parsimony of machine thatlearning
is now
models
most are refinements
widely kno
known wn as Oc ofcthough
Occ t dating
am’s razor back to philosophers
(c. 1287-1347). This principleat least as early
states that
as Ptolem
among comp y .
competing Man y
eting hyp early
ypotheses scholars inv
otheses that explain knook e a principle
knownwn observ of ations equally well, now
parsimony
observations that is one
most widely
should cho hoose kno wn as Oc cam’s razor (c. 1287-1347).
ose the “simplest” one. This idea was formalized and made more preciseThis principle states that
among comp
in the 20th century eting hyp byotheses that explain
the founders known observ
of statistical ations
learning equally
theory well, and
(Vapnik one
shouldonenkis
Cherv choose, the
Chervonenkis 1971“simplest”
; Vapnik,one. 1982This
; Blumeridea et wasal.
formalized
al.,, 1989; Vapnikand ,made
1995). more precise
in the 20th century by the founders of statistical learning theory (Vapnik and
Statistical learning theory provides various means of quan quantifying
tifying momodeldel capacity
capacity..
Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).
Among these, the most well-kno well-known wn is the Vapnik-Chervonenkis dimension dimension,, or VC
dimension. The VC dimension measures the capacity of a binary mo
Statistical learning theory provides v arious means of quan tifying del capacity
classifier. The.
Among
V these, the
C dimension most well-kno
is defined as being wnthe is the Vapnik-Chervonenkis
largest possible value of m dimension
for whic
which , or
h VC
there
dimension.
exists The set
a training VC of dimension
m differen
different measures
t x poin ointststhe capacity
that of a binary
the classifier can lab classifier.
label
el arbitrarily The.
arbitrarily.
VC dimension is defined as being the largest possible value of m for which there
Quan
Quantifying
tifying the capacitcapacity y of the mo modeldel alloallows
ws statistical learning theory to
exists a training set of m different x points that the classifier can label arbitrarily.
mak
makee quan
quantitativ
titativ
titativee predictions. The most imp important
ortant results in statistical learning
Quan
theory shotifying
show w thatthe thecapacit y of the
discrepancy betw mo
etween deltraining
een allows statistical learning theory
error and generalization to
error
mak
is e quantitativ
bounded from ab e predictions.
abovov
ovee by a quantit The ymost
quantity that imp gro
growsortant
ws results
as the mo
model in capacity
del statisticalgro learning
growsws but
theory sho w
shrinks as the num that the
umb discrepancy b etw een training error
ber of training examples increases (Vapnik and Cherv and generalization
Chervonenkis error,
onenkis
onenkis,
is bounded
1971 ; Vapnik from
, 1982ab;ovBlumer
e by a quantit al.,, y1989
et al. that ;V gro ws as
apnik the ).
, 1995 moThese
del capacity
boundsgro ws but
provide
shrinks
in as the
intellectual
tellectual number ofthat
justification training
machineexamples
learning increases
algorithms (Vapnik and Cherv
can work, onenkis
but they are,
1971
rarely; Vused
apnikin, 1982 ; Blumer
practice when etworking
al., 1989 ; Vapnik
with deep ,learning
1995). These boundsThis
algorithms. provide
is in
in tellectual justification that
part because the bounds are often quite lo machine learning
loose algorithms can work, but
ose and in part because it can be quite they are
difficult to determine the capacity of deep learning algorithms. The problem in
rarely used in practice when working with deep learning algorithms. This is of
part becausethe
determining thecapacity
bounds of area often quite loose
deep learning mo
modelandisinesp
del part because
especially
ecially difficultit can be quite
because the
difficult
effectivee to
effectiv determine
capacity the capacity
is limited of deep learning
by the capabilities of the algorithms.
optimizationThe problemand
algorithm, of
determining
w e ha
havve littlethe capacityunderstanding
theoretical of a deep learning of the movery
del general
is esp ecially difficult
non-con
non-conv because the
vex optimization
effectiv e
problems in capacity
invvolv
olved is limited b
ed in deep learning.y the capabilities of the optimization algorithm, and
we have little theoretical understanding of the very general non-convex optimization
We must remem rememb ber that while simpler functions are more likely to generalize
problems involved in deep learning.
(to hav
havee a small gap betw etweeneen training and test error) we must still choose a
W
sufficien e
sufficiently must remem
tly complex hyp b er that
ypothesis while
othesis to simpler
ac hievee functions
achiev
hiev lo
loww training areerror.
more likely
Typicallyto generalize
ypically, , training
(to hav e a small
error decreases un gap
til it asymptotes to the minimum possible error value choose
until b etw een training and test error) we m ust still as mo model a
del
sufficien
capacit
capacity ytly complex
increases hypothesis
(assuming thetoerror
achiev e low training
measure has a minim error.
minimum umTvypically , training,
alue). Typically
Typically,
error decreaseserror
generalization untilhasit asymptotes
a U-shaped to thee minimum
curv
curve as a function possible errorcapacit
of model value yas
capacity model
. This is
capacit y increases
illustrated in Fig. 5.3. (assuming the error measure has a minim um value). Typically ,
generalization error has a U-shaped curve as a function of model capacity. This is
To reach
illustrated in the
Fig. most
5.3. extreme case of arbitrarily high capacit capacity y, we in intro
tro
troduce
duce
the concept of non-p non-par ar
arametric
ametric models. So far, w wee hav
havee seen only parametric
To reach the most extreme case of arbitrarily high capacity, we intro duce
the concept of non-parametric models. 114 So far, we have seen only parametric
CHAPTER 5. MACHINE LEARNING BASICS
mo
models,
dels, suc
suchh as linear regression. Parametric mo models
dels learn a function describdescribed ed
by a parameter vector whose size is finite and fixed before any data is observed.
models, such as mo
Non-parametric linear
delsregression.
models hav
havee no suc Pharametric
such limitation. models learn a function described
by a parameter vector whose size is finite and fixed before any data is observed.
Sometimes, non-parametric
Non-parametric models have no models are just theoretical abstractions (suc
such limitation. (such h as
an algorithm that searches over all possible probability distributions) that cannot
Sometimes, non-parametric
be implemented in practice. How models
ever, weare
However, canjust
alsotheoretical abstractions
design practical (such as
non-parametric
an
mo algorithm
models
dels that their
by making searches over all
complexit
complexity y apossible
functionprobability
of the trainingdistributions)
set size. Onethatexample
cannot
be such
of implemented in practice.
an algorithm is ne
near
ar
arestHow
est ever,orwerecan
neighb
neighbor gr also .design
gression
ession
ession. Unlikeepractical
Unlik non-parametric
linear regression, whic
whichh
mo dels b y making their
has a fixed-length vector of weighcomplexit y
eights,a function of the training set size.
ts, the nearest neighbor regression model simply One example
stores the X and y from the trainingorset.
of such an algorithm is ne ar est neighb regrWhen
ession.ask Unlik
askeded toe linear regression,
classify a test point whicxh,
has mo
the a fixed-length
model
del lo oks upvector
looks of weigh
the nearest en ts, the
entry
try in thenearest neighbor
training set and regression
returns the model
asso simply
associated
ciated
stores the X
regression and yInfrom
target. thewords,
other trainingyŷˆ = set.y When
where ask
i =edarg
to min
classify
||
||XX a− test
x point
||2 x,
i i,: 2 The
.
the model can
algorithm looks upbthe
also nearest ento
e generalized trydistance
in the training
metrics set andthan
other returns
the
theLLthe associated
2 norm, such
regression target. In other
as learned distance metrics (Goldb words,
Goldbergery
ˆ = y where i = arg min X
erger et al., 2005). If the algorithm is allo x .
allow The
wed
algorithm can also b e generalized to distance metrics other than
to break ties by averaging the yi values for all Xi,: that are tied for nearest, then ||the L − norm,
|| such
as learned
this distance
algorithm is ablemetrics
to achiev(Goldb
achieve e theerger
minim
minimum etumal.p, ossible
2005). training
If the algorithm
error (whicis hallo
(which wedt
migh
might
toe break
b greaterties by zero,
than averaging
if twothe y tical
iden values
identical inputs areXasso
for all that
associated
ciatedarewith
tied differen
for nearest,
different then
t outputs)
thisan
on algorithm
any is able
y regression to achieve the minimum p ossible training error (which might
dataset.
be greater than zero, if two identical inputs are associated with different outputs)
Finally
Finally,, we can also create a non-parametric learning algorithm by wrapping a
on any regression dataset.
parametric learning algorithm inside another algorithm that increases the num numb ber
Finally, we can also create a non-parametric learning algorithm by wrapping a
parametric learning algorithm inside another 115 algorithm that increases the number
CHAPTER 5. MACHINE LEARNING BASICS
5.2.1 The
Learning theoryNo Freethat
claims Lunc h Theorem
a machine learning algorithm can generalize well from
a finite training set of examples. This seems to con contradict
tradict some basic principles of
Learning
logic. theory
Inductiv
Inductive claims thatorainferring
e reasoning, machine general
learningrulesalgorithm
from acan generalize
limited well from
set of examples,
a finite
is training vset
not logically of examples.
alid. T
To
o logically This
inferseems
a ruleto describing
contradict someev erybasic
every principles
member of
of a set,
logic.must
one Inductiv
haveeeinformation
hav reasoning, orabout
inferring
every general
mem
memb brules
er of from a limited set of examples,
that set.
is not logically valid. To logically infer a rule describing every member of a set,
In part, mac hine learning av
machine oids this problem by offering only probabilistic rules,
avoids
one must have information about every member of that set.
rather than the entirely certain rules used in purely logical reasoning. Machine
In part,
learning machinetolearning
promises avoids
find rules thatthis
areproblem
pr
prob
ob ablybycorrect
obably offeringab only
out probabilistic
about most members rules,
of
rather than the
the set they concern.entirely certain rules used in purely logical reasoning. Machine
learning promises to find rules that are probably correct about most members of
Unfortunately
Unfortunately,, even this do es not resolve the en
does tire problem. The no fr
entire freee lunch
the set they concern.
the
theor
or
orem
em for machine learning (Wolp olpert
ert, 1996) states that, averaged ov over
er all possible
data generating distributions, every classification algorithm has thenosame
Unfortunately , even this do es not resolve the en tire problem. The free lunch
error
the orem for machine learning ( W olp
rate when classifying previously unobserv ert , 1996
unobserved ) states that, av eraged ov er all
ed points. In other words, in some sense, p ossible
data generating distributions, every
no machine learning algorithm is universally anclassification
any algorithm
y better than anyhasother.
the same
The error
most
rate when classifying
sophisticated algorithm previously unobserv
we can conceive ofedhas
points.
the sameIn other words,
average in some sense,
performance (o
(ov
ver
no machine learning algorithm is universally an y b etter than any
all possible tasks) as merely predicting that every point belongs to the same class. other. The most
sophisticated algorithm we can conceive of has the same average performance (over
all possible tasks) as merely predicting 116 that every point belongs to the same class.
CHAPTER 5. MACHINE LEARNING BASICS
117
CHAPTER 5. MACHINE LEARNING BASICS
Fortunately
ortunately,, these results hold only when we average ov over
er al
alll possible data
generating distributions. If we mak makee assumptions about the kinds of probability
Fortunately
distributions we, encounter
these results hold orld
in real-w only applications,
real-world when we average over
then we canal ldesign
possible data
learning
generating that
algorithms distributions. If won
perform well e mak e assumptions
these distributions.about the kinds of probability
distributions we encounter in real-world applications, then we can design learning
This means that the goal of mac machine
hine learning research is not to seek a universal
algorithms that perform well on these distributions.
learning algorithm or the absolute best learning algorithm. Instead, our goal is to
This means
understand that
what theof
kinds goal of machineare
distributions learning
relev research
relevant
ant to theis“real
not to seek that
world” a universal
an AI
learning
agen algorithm
agentt exp
experiences, or the absolute b
eriences, and what kinds of macest learning
machine algorithm. Instead, our goal
hine learning algorithms perform well is on
to
understand
data dra
drawn what kinds of distributions are relevant to the “real
wn from the kinds of data generating distributions we care ab world” that
about.
out. an AI
agent experiences, and what kinds of machine learning algorithms perform well on
data drawn from the kinds of data generating distributions we care about.
5.2.2 Regularization
5.2.2no free
The Regularization
lunc
lunch h theorem implies that we must design our mac machinehine learning
algorithms to perform well on a sp specific
ecific task. We do so by building a set of
The no free
preferences in lunc
into h theorem implies
to the learning algorithm. When that we mtheseust design our mac
preferences arehine learning
aligned with
algorithms to perform w ell on a
the learning problems we ask the algorithm to solvsp ecific task. W
solve,e do so by building
e, it performs better. a set of
preferences into the learning algorithm. When these preferences are aligned with
the So far, theproblems
learning only metho
method
we dask of the
mo
modifying
difying
algorithm a learning
to solve, algorithm
it performs we hahave
ve discussed is
better.
to increase or decrease the mo model’s
del’s capacit
capacity y by adding or remo removing
ving functions from
the hypothesis space of solutions the learning algorithm is able to ha
So far, the only metho d of mo difying a learning algorithm we ve discussed
choose. We gagavvise
to increase
the sp ecificor
specific decreaseofthe
example model’s or
increasing capacit y by adding
decreasing or remo
the degree of ving functions for
a polynomial from a
the hypothesis
regression space The
problem. of solutions
view wethe ha
havvlearning
e described algorithm
so far isis able to choose. We gave
oversimplified.
the specific example of increasing or decreasing the degree of a polynomial for a
The b eha
regression ehavior
vior of The
problem. our algorithm
view we haisvestrongly
described affected
so far isnot ovjust by how large we
ersimplified.
mak
makee the set of functions allow allowed ed in its hypypothesis
othesis space, but by the sp specific
ecific iden
identit
tit
tityy
The b eha vior of our algorithm
of those functions. The learning algorithm we ha is strongly havaffected not just by how large
ve studied so far, linear regression, we
makae the
has set of functions
hypothesis allowed inofitsthe
space consisting hyp othesis
set space,
of linear but by the
functions specific
of its input.iden tity
These
of thosefunctions
linear functions. canThe be learning
very useful algorithm we havewhere
for problems studied thesorelationship
far, linear regression,
betw
etween
een
has a hypothesis space consisting of the set of linear functions
inputs and outputs truly is close to linear. They are less useful for problems of its input. These
linear
that bfunctions
eha
ehav ve in acan verybe nonlinear
very useful for problems
fashion. where the
For example, relationship
linear regression betw een
would
inputs
not and outputs
perform very welltrulyif weistried
closetotouse linear. They are
it to predict sin((less
sin useful
x ) from x . for
We problems
can thus
that
con
controlb eha v e in a very nonlinear
trol the performance of our algorithms by chofashion. F or example,
hoosing linear regression
osing what kind of functions would
we
not
allo
allowp erform
w them to dra v ery
draw w ell if w e tried to use it
w solutions from, as well as by con to predict sin (x
controlling ) from
trolling the amounx . W e can
amountt of thesethus
con trol
functions.the p erformance of our algorithms b y c ho osing what kind of functions we
allow them to draw solutions from, as well as by controlling the amount of these
We can also give a learning algorithm a preference for one solution in its
functions.
hyp
ypothesis
othesis space to another. This means that both functions are eligible, but one
W e can also
is preferred. Thegive a learning
unpreferred algorithm
solution a preference
be chosen only if itfor fitsone
thesolution
trainingindataits
hypothesis
significan
significantlytlyspace
better to than
another. This means
the preferred that both functions are eligible, but one
solution.
is preferred. The unpreferred solution be chosen only if it fits the training data
For example, w wee can modify the training criterion for linear regression to
significantly better than the preferred solution.
include weight de deccay
ay.. To perform linear regression with weigh weightt deca
decay y, we minimize
For example, we can modify the training criterion for linear regression to
include weight decay. To perform linear118 regression with weight decay, we minimize
CHAPTER 5. MACHINE LEARNING BASICS
a sum comprising both the mean squared error on the training and a criterion
J (w) that expresses a preference for the weigh weights havee smaller squared L2 norm.
ts to hav
a
Sp sum comprising
Specifically
ecifically
ecifically, , both the mean squared error on the training and a criterion
J (w) that expresses a preference forMSE
the weigh ts to>have smaller squared L (5.18) norm.
J (w) = train + λw w ,
Specifically,
where λ is a value chosen ahead J (w)of=time
MSE that con
controls
+ λtrols
w w the
, strength of our preference(5.18)
for smaller weigh ts. When λ = 00,, we imp
weights. ose no preference, and larger λ forces the
impose
where
w eigh tsλ to
eights is abvecome
alue chosen ahead
smaller. of time that
Minimizing J (wcon trols theinstrength
) results a choiceofofour preference
weigh
weightsts that
for
mak smaller weigh
makee a tradeoff bet ts.
etw When λ = 0 , we imp ose no preference, and larger
ween fitting the training data and being small. This giv λ forces esthe
gives us
w eights tothat
solutions become
havee smaller.
hav a smallerMinimizing
slope, or put J (w ) results
weigh
eight in a cof
t on fewer hoice
the of weightsAsthat
features. an
make a tradeoff
example of ho
how betcan
w we ween fittinga the
control mo training
model’s
del’s data and
tendency to ov berfit
eingorsmall.
overfit underfitThis
viagiv es ust
weigh
weight
solutions
deca
decay y, we that have aa high-degree
can train smaller slope, or put wregression
polynomial eight on fewer
mo
model of with
del the features.
differentt vAs
differen an
alues
example of how we can control
of λ. See Fig. 5.5 for the results. a mo del’s tendency to ov erfit or underfit via weigh t
decay, we can train a high-degree polynomial regression model with different values
of λ. See Fig. 5.5 for the results.
λ λ
More generally
generally,, we can regularize a mo del that learns a function f ( x; θ ) by
model
adding a penalty called a regularizer to the cost function. In the case of weigh eightt
More
deca
decay generally , we can
y, the regularizer is Ω( regularize
> a mo del that learns a function
w) = w w. In Chapter 7, we will see that man
Ω(w f (
many x; θ ) by
y other
adding a penalty called a regularizer to the cost function. In the case of weight
decay, the regularizer is Ω(w) = w w. 119In Chapter 7, we will see that many other
CHAPTER 5. MACHINE LEARNING BASICS
5.3 machine
Most Hyperparameters
learning algorithms and hav
havee sevVeral
alidation
several Sets
settings that we can use to con control
trol
the behavior of the learning algorithm. These settings are called hyp hyperp
erp
erpar
ar
arameters
ameters
ameters..
Mostvalues
The machine learning algorithms
of hyperparameters arehav
note sev eral settings
adapted by thethat we can
learning use to con
algorithm trol
itself
the behavior
(though we canof the
designlearning algorithm.
a nested learningThesepro settings
procedure
cedure are one
where called hyperpar
learning ameters.
algorithm
The
learnsvalues
the bof
esthyperparameters
hyperparametersare for not adapted
another by the
learning learning algorithm itself
algorithm).
(though we can design a nested learning procedure where one learning algorithm
In the
learns the pbolynomial regression example
est hyperparameters for anotherwe saw in Fig.algorithm).
learning 5.2, there is a single hyper-
parameter: the degree of the polynomial, whic which h acts as a capapacity
acity hypyperparameter.
erparameter.
In the p olynomial regression example
The λ value used to control the strength of weigh w e saw in Fig. 5.2 , there is a
weightt decay is another example single hyper-
of a
parameter:
hyp the
yperparameter.
erparameter. degree of the p olynomial, whic h acts as a cap acity h yp erparameter.
The λ value used to control the strength of weight decay is another example of a
hypSometimes
erparameter. a setting is chosen to b e a hyp yperparameter
erparameter that the learning algo-
rithm does not learn because it is difficult to optimize. More frequently frequently,, we do
Sometimes
not learn the hyp a setting
erparameter because it is not appropriate to the
yperparameteris chosen to b e a h yp erparameter that learnlearning
that hyp algo-
yper-
er-
rithm does not learn b ecause it is difficult to optimize. More
parameter on the training set. This applies to all hyperparameters that control frequently , we do
not
mo
modellearn
del the h. yp
capacity
capacity. If erparameter
learned on the because
training it set,
is not
suchappropriate to learnwould
hyperparameters that alwhypaer-
alwa ys
parameter on the training set. This applies to all hyperparameters that control
model capacity. If learned on the training 120set, such hyperparameters would always
CHAPTER 5. MACHINE LEARNING BASICS
cho
hoose
ose the maxim
maximum um possible mo model
del capacity
capacity,, resulting in ov overfitting
erfitting (refer to
Fig. 5.3). For example, we can alw alwa ays fit the training set better with a higher
choose pthe
degree maximand
olynomial um possibleeightt mo
a weigh del setting
decay capacity of, λresulting
= 0 thaninweovcould erfitting with(refer
a lowto
low er
Fig. 5.3 ). F or example, we
degree polynomial and a positive weigh can alw a ys fit
eightt deca the
decay training
y setting. set better with a higher
degree polynomial and a weight decay setting of λ = 0 than we could with a lower
To solve this problem, we need a validation set of examples that the training
degree polynomial and a positive weight decay setting.
algorithm do does
es not observe.
To solve this problem, we need a validation set of examples that the training
Earlier we discussed ho how w a held-out test set, comp composed
osed of examples coming from
algorithm does not observe.
the same distribution as the training set, can be used to estimate the generalization
errorEarlier we discussed
of a learner, howlearning
after the a held-out
pro test has
process
cess set, completed.
composed ofItexamples
is imp
important coming
ortant thatfrom
the
the same
test distribution
examples are not as theintraining
used any waset,
y tocanmak
makebee choices
used to about
estimate thethemo generalization
model,
del, including
error
its of a learner, afterF
hyperparameters. the
For learning
or this process
reason, has completed.
no example from the It is impset
test ortant
can that the
be used
testthe
in examples
validationare set.
not used in anywe
Therefore, waalwa
y to ys
always mak e choices the
construct about the model,
validation set including
from the
its
tr hyperparameters.
training
aining data. Specifically F or this reason, no example
Specifically,, we split the training data in from
into the
to tw test
twoo disjoin set can
disjointt subsets. be used
One
in these
of the validation
subsets isset.usedTherefore,
to learn the weparameters.
always construct the vsubset
The other alidationis oursetvfrom the
alidation
training
set, useddata. Specifically
to estimate the, generalization
we split the training
error data
during intoortw o disjoin
after t subsets.
training, One
allowing
of these subsets is used to learn the
for the hyperparameters to be updated accordingly parameters. The other subset is
accordingly.. The subset of data used toour v alidation
set, used to estimate the generalization
learn the parameters is still typically called errortheduring or after
training set, training,
ev
even
en though allowing
this
for
ma
may ythebe hyperparameters
confused with thetolarger be updated
po ol of accordingly
pool data used for . The
the subset of data pro
entire training used to
process.
cess.
learnsubset
The the parameters
of data used is still typically
to guide called the
the selection of training set, even isthough
hyperparameters called this
the
ma y b e confused
validation set. Typically with the larger
Typically,, one uses ab po ol
about of data used for the entire training
out 80% of the training data for training and pro cess.
20% for validation. Since the validationselection
The subset of data used to guide the set is used of to
hyperparameters is called the
“train” the hyperparameters,
vthe
alidation set. set
validation Typically , one
error will uses about 80%
underestimate of the trainingerror,
the generalization data though
for training and
typically
20%
by a for validation.
smaller amountSincethan the
the vtraining
alidationerror.
set isAfter
usedall to h“train”
yp the hyperparameters,
yperparameter
erparameter optimization
is complete, the generalization error may be estimated using the test set.typically
the validation set error will underestimate the generalization error, though
by a smaller amount than the training error. After all hyperparameter optimization
In practice, when the same test set has been used rep repeatedly
eatedly to ev evaluate
aluate
is complete, the generalization error may be estimated using the test set.
performance of different algorithms over many years, and esp especially
ecially if we consider
In practice,
all the attempts when the scientific
from the same testcomm set has
communitunit
unity ybeen
at bused
eatingrep eatedly
the reported to ev aluate
state-of-
performance
the-art of different
performance algorithms
on that test set,owe
ver end
many upyhaving
ears, and especially
optimistic ev if we consider
evaluations
aluations with
all the attempts from
the test set as well. Benc the scientific
Benchmarks
hmarks can th comm
thus unit y at b eating the reported
us b ecome stale and then do not reflect the state-of-
the-art
true fieldperformance
performance on of
that test set,system.
a trained we end up having optimistic
Thankfully
Thankfully, , the communit evaluations
community y tends with
to
the
mo
mov test set as well. Benc hmarks
ve on to new (and usually more am can th us b
ambitious ecome stale and
bitious and larger) benc then
enchmarkdo not reflect
hmark datasets. the
true field performance of a trained system. Thankfully, the community tends to
move on to new (and usually more ambitious and larger) benchmark datasets.
5.3.1 Cross-V
Cross-Validation
alidation
5.3.1 Cross-V
Dividing the datasetalidation
into a fixed training set and a fixed test set can be problematic
if it results in the test set being small. A small test set implies statistical uncertaint
uncertainty
y
Dividingthe
around theestimated
dataset into a fixed
average training
test set and it
error, making a fixed testtosetclaim
difficult can bthat
e problematic
algorithm
if itwresults
A in thethan
orks better test algorithm
set being small.
B on A small
the test
given set implies statistical uncertainty
task.
around the estimated average test error, making it difficult to claim that algorithm
A works better than algorithm B on the given task.
121
CHAPTER 5. MACHINE LEARNING BASICS
When the dataset has hundreds of thousands of examples or more, this is not
a serious issue. When the dataset is to too o small, there are alternative procedures,
whic
which When the dataset has hundreds of
h allow one to use all of the examples in the thousands of examples
estimationorofmore, this istest
the mean not
a serious
error, issue.
at the priceWhen the dataset
of increased is too small,
computational there
cost. arepro
These alternative
procedures
cedures are procedures,
based on
whic h allow
the idea of rep one to
repeating use all of the examples in the estimation
eating the training and testing computation on different randomly of the mean test
cerror,
hosenatsubsets
the price of increased
or splits computational
of the original dataset. cost.
TheThese
most pro ceduresofare
common thesebased on
is the
kthe idea
-fold of repalidation
cross-v eating the
cross-validation protraining
cedure, and
procedure, shown testing computation
in Algorithm 5.1,onin different
which a randomly
partition
cofhosen subsets or splits of the original
the dataset is formed by splitting it in dataset. The
to k non-o
into non-ovverlapping subsets. Theistest
most common of these the
k -fold
error ma cross-v
may alidation pro cedure, shown in Algorithm 5.1 , in
y then be estimated by taking the average test error across k trials. On which a partition
of thei, dataset
trial the i-th is formed
subset by splitting
of the data is usedit inas k non-o
tothe test verlapping
set and the subsets.
rest of the Thedatatest
is
error ma y then b e estimated by taking the a v erage test error
used as the training set. One problem is that there exist no unbiased estimators of across k trials. On
trialviariance
the , the i-th of subset
such avoferage
the data
errorisestimators
used as the test setand
(Bengio andGrandv
the rest
Grandvalet of, the
alet 2004data
), butis
used
appro as the
approximations training set. One
ximations are typically used. problem is that there exist no unbiased estimators of
the variance of such average error estimators (Bengio and Grandvalet, 2004), but
approximations are typically used.
5.4 Estimators, Bias and Variance
5.4fieldEstimators,
The of statistics givesBias
gives us manand
manyy to Vthat
tools
ols ariance
can be used to ac
achiev
hiev
hievee the machine
learning goal of solving a task not only on the training set but also to generalize.
The field of statistics
Foundational conceptsgivsuc
es husasman
such y tools that
parameter can be used
estimation, to acvhiev
bias and e theare
ariance machine
useful
learning goal of solving a task not only on the training set but also to
to formally characterize notions of generalization, underfitting and overfitting. generalize.
Foundational concepts such as parameter estimation, bias and variance are useful
to formally characterize notions of generalization, underfitting and overfitting.
5.4.1 Poin
ointt Estimation
5.4.1
P oin Point Estimation
ointt estimation is the attempt to provide the single “best” prediction of some
quan
quantit tit
tity
y of interest. In general the quanquantit
tit
tityy of interest can be a single parameter
Poin
or a tvector
estimation is the attempt
of parameters in sometoparametric
provide themo single
model, “best”
del, such as prediction
the weigh tsofinsome
eights our
quan tity of interest. In general the quan tit y of interest can
linear regression example in Sec. 5.1.4, but it can also be a whole function. be a single parameter
or a vector of parameters in some parametric mo del, such as the weights in our
In order to distinguish estimates of parameters from their true value, our
linear regression example in Sec. 5.1.4, but it can also be a whole function.
con
conv ven
ention
tion will be to denote a point estimate of a parameter θ by θ θ̂ˆ.
In order to distinguish estimates of parameters from their true value, our
Let {x(1), . . . , x (m)} be a set of m indep
independen enden
endentt and identically ˆ distributed
convention will be to denote a point estimate of a parameter θ by θ.
(i.i.d.) data points. A point estimator or statistic is an any
y function of the data:
Let x , . . . , x be a set of m independent and identically distributed
(i.i.d.) data θˆm = g(xor
{ points. A p}oint estimator (1) statistic
, . . . , x(m)is).any function of the data: (5.19)
123
CHAPTER 5. MACHINE LEARNING BASICS
a gogoood estimator is a function whose output is close to the true underlying θ that
generated the training data.
a good estimator is a function whose output is close to the true underlying θ that
For no
now,
w, we tak
takee the frequentist persp perspective
ective on statistics. That is, we assume
generated the training data.
that the true parameter value θ is fixed but unknown, while the point estimate
θ̂ˆ isFaorfunction
θ now, weoftak e the
the data.frequentist
Since thepersp
dataective
is drawnon statistics. That is,process,
from a random we assumeany
that the of
function true
theparameter
data is random.value θTherefore
is fixed butθˆ isunknown,
a randomwhile the point estimate
variable.
θˆ is a function of the data. Since the data is drawn from a random process, any
Poin
ointt estimation can also refer to the estimation of the relationship b et etw
ween
function of the data is random. Therefore θˆ is a random variable.
input and target variables. We refer to these types of point estimates as function
Point estimation can also refer to the estimation of the relationship b etween
estimators.
input and target variables. We refer to these types of point estimates as function
estimators.
As we mentioned ab abo ove, sometimes we are in interested
terested in
performing function estimation (or function appro approximation).
ximation). Here we are trying to
predict a variable y giv given As we mentioned ab ov
en an input vector x . We assume e, sometimes thatwe are is
there interested in
a function
p erforming function
f (x ) that describ
describes estimation (or function
es the approximate relationship b etappro ximation).
etw Here w e are
ween y and x. For example, trying to
predict
w e may aassume
variable y giv
that y en
= fan( xinput where x. stands
) + ,vector We assumefor the that there
part of yisthat
a function
is not
f (x ) that describ es the approximate relationship b
predictable from x. In function estimation, we are interested in approet w een y and x . F or example,
approximating
ximating
w e may assume that y = f ( x )
ˆ + , where stands for
f with a model or estimate f . Function estimation is really just the same the part of y that is not
as
predictable from x . In function estimation, w e are
estimating a parameter θ; the function estimator f is simply a p oinˆ interested in appro ximating
ointt estimator in
f with a space.
function modelTheor estimate fˆ. Function
linear regression exampleestimation
(discussed is really
abov
abovee injust the5.1.4
Sec. same as
) and
θ; the ˆ
estimating
the a parameter
polynomial regression function
example estimator
(discussed in fSec.
is simply
5.2) area pboth
oint estimator
examples in of
function space.
scenarios that ma The
may linear regression example (discussed
y be interpreted either as estimating a parameter abov e
parameterw in Sec. 5.1.4
w or estimating) and
the polynomial
ˆ regression
a function f mapping from x to y. example (discussed in Sec. 5.2 ) are both examples of
scenarios that may be interpreted either as estimating a parameter w or estimating
We nonoww review the most commonly studied prop properties
erties of poin
ointt estimators and
a function fˆ mapping from x to y.
discuss what they tell us ab about
out these estimators.
We now review the most commonly studied properties of point estimators and
discuss what they tell us about these estimators.
5.4.2 Bias
5.4.2
The bias Bias
of an estimator is defined as:
X
CHAPTER 5. MACHINE LEARNING BASICS
To determine the bias of the sample mean, we are again interested in calculating
its exp
expectation:
ectation:
To determine the bias of the sample mean, we are again interested in calculating
its expectation: bias(
bias(ˆ µ̂
µ
ˆ m ) = E[µ ˆm] − µ (5.31)
" m
#
E 1 X
bias(µ ˆ )= = E [µ ˆ ] µx(i) − µ (5.31)
(5.32)
m−
E 1 i=1
!µ
= x (5.32)
1mX h (i) i
m
= E x − −µ (5.33)
m
1 i=1 E!
= " X m x# µ (5.33)
m1 X
= µ −µ − (5.34)
m
1 i=1
=
=µ− =µ 0 h µi! (5.34)
(5.35)
m µX
−
Th us we find that the sample mean is an unbiased estimator of Gaussian (5.35)
Thus = µ µ = 0 mean
parameter. − !
Thus we find that the sample mean is an Xunbiased estimator of Gaussian mean
parameter.
As an
2
example, we compare two differendifferentt estimators of the variance parameter σ of a
Gaussian distribution. We are in interested
terested in kno knowingwing if either estimator is biased.As an
example, we compare two differen t estimators of the
2 we consider is known as the sample varianc
variance parameter σ of a
The first estimator of σ
Gaussian distribution. We are interested in knowing if either estimator is biased. variancee :
The unbiase
unbiased
d sample varianc
variancee estimator
Xm 2
The unbiased sample varianc 2
e 1estimator
σ̃
σ
˜m = x(i) − µ µ̂
ˆm (5.40)
m−1
1 i=1
σ
˜ = x µ
ˆ (5.40)
pro
provides
vides an alternative approac
approach. h.mAs 1the name suggests
− this estimator is un
unbiased.
biased.
That is, we find that E[σ ˜ 2m ] = σ 2: −
σ̃
provides an alternative approach. As the name suggests this estimator is unbiased.
E " #
That is, we find that [σ ˜ ]=σ : 1 XX
m 2
2 (i)
E[σσ̃
˜m] = E x −µ µ̂
ˆm (5.41)
m−1
E E 1 i=1
[σ
˜ ]= m 2 x µ
ˆ (5.41)
= m E[1σ σ̂
ˆm ] (5.42)
m−1 −
m − E
= m ˆm −
[σ ] 1 2 (5.42)
= m" 1 σ # (5.43)
m−1 m
2
m− mX 1
=σ . σ (5.43)
(5.44)
m 1 m−
We ha ve two estimators:=one
hav σ .−is biased and the other is not. While unbiased (5.44)
estimators are clearly desirable, they are not alwalwa ays the “b
“best”
est” estimators. As we
W e have tw o estimators: one is biased and the
will see we often use biased estimators that possess other imp other is not.
importantWhile
ortant unbiased
properties.
estimators are clearly desirable, they are not always the “best” estimators. As we
will see we often use biased estimators that possess other important properties.
5.4.3 Variance and Standard Error
5.4.3 V
Another ariance
prop
propert
ert
ertyy ofand Standard
the estimator Error
that we migh
mightt wan
antt to consider is ho
how w muc
muchh
we exp
expect
ect it to vary as a function of the data sample. Just as we computed the
Another
exp propofert
expectation
ectation y ofestimator
the the estimator that we its
to determine migh t ww
bias, anetcan
to consider
computeisitshovarianc
w muche.
variance
we exp ect it to vary as a function of the data
The variance of an estimator is simply the variance sample. Just as we computed the
expectation of the estimator to determine its bias, we can compute its variance.
The variance of an estimator is simply theθˆ)variance
Var( (5.45)
!
X
CHAPTER 5. MACHINE LEARNING BASICS
1 X
m
= 2 Var x (i) (5.49)
m
1 i=1
= m Var x (5.49)
1 X
m
= 2 θ(1 − θ) (5.50)
m
1 i=1
= 1 θ(1 θ) (5.50)
= m2 mθ X (1 −
θ) (5.51)
m −
1
= 1 θmθ (1 θ) (5.51)
= m (1 − θ)
m
(5.52)
1 X −
= asθ(1
The variance of the estimator decreases θ)
a function of m, the num
numb (5.52)
b er of examples
m
in the dataset. This is a common propproperty
erty−of popular estimators that we will
The variance of the estimator decreases as
return to when we discuss consistency (see Sec. 5.4.5of).m, the numb er of examples
a function
in the dataset. This is a common property of popular estimators that we will
return to when we discuss consistency (see Sec. 5.4.5).
5.4.4 Trading off Bias and Variance to Minimize Mean Squared
Error
5.4.4 Trading off Bias and Variance to Minimize Mean Squared
Bias andError
variance measure two different sources of error in an estimator. Bias
measures the exp expected
ected deviation from the true value of the function or parameter.
Bias and v ariance
Variance on the other measure twovides
hand, pro different
provides a measure sources of error
of the in anfrom
deviation estimator. Bias
the expected
measures the
estimator expthat
value ectedanydeviation
particular from the trueofvalue
sampling of the
the data is function
likely to or parameter.
cause.
Variance on the other hand, provides a measure of the deviation from the expected
What happ
estimator happens
valueens
thatwhen
any w e are given
particular a choiceofbetw
sampling between
the een
datatw isolikely
estimators, one with
to cause.
more bias and one with more variance? How do we choose betw etween
een them? For
What happ ens when w e
example, imagine that we are interes are given
interested a choice betw een tw o estimators,
ted in approximating the function shown one within
more bias and one with more variance?
Fig. 5.2 and we are only offered the choice betw How do
between w e choose
een a mo
model b etw een them?
del with large bias and For
example,
one imagine
that suffers thatlarge
from we are interesHo
variance. tedwindoapproximating
How we cho hoose
ose betw the
etweeneenfunction
them? shown in
Fig. 5.2 and we are only offered the choice between a model with large bias and
The most common way to negotiate this trade-off is to use cross-v cross-validation.
alidation.
one that suffers from large variance. How do we choose between them?
Empirically
Empirically,, cross-v
cross-validation
alidation is highly successful on many real-w real-worldorld tasks. Alter-
The
nativ
natively
ely
ely,, most
w e common
can also w ay
compare to negotiate
the me
mean this
squar trade-off
d
an squaree error error is
(MSE)to use
of cross-v
the alidation.
estimates:
Empirically, cross-validation is highly successful on many real-world tasks. Alter-
natively, we can also compare MSEthe = Eme[(θˆθ̂an
m− θ) 2 ]ed error (MSE) of the estimates:
squar (5.53)
E ˆ θˆθ̂m) 2 + Var(θˆm )
MSE = = Bias(
[(θ θ) ] (5.54)
(5.53)
The MSE measures the overall=exp expected
ected
Bias( ˆ + Var(θˆ ) a squared error sense—
θ−)deviation—in (5.54)
bet ween the estimator and the true value of the parameter θ. As is clear from
etw
The MSE
Eq. 5.54 , evmeasures
evaluating
aluating thetheMSE
overall exp
incorp ectedbdeviation—in
incorporates
orates oth the bias anda the squared errorDesirable
variance. sense—
between the
estimators areestimator
those withand theMSE
small trueand value of the
these parameter that
are estimators θ. Asmanage
is cleartofrom
keep
Eq.
b oth5.54 , evbias
their aluating
and the MSE incorp
variance somewhatoratesinbcoth hec
heck. the
k. bias and the variance. Desirable
estimators are those with small MSE and these are estimators that manage to keep
The relationship betw etween
een bias and variance is tightly linked to the machine
both their bias and variance somewhat in check.
learning concepts of capacity
capacity,, underfitting and ov overfitting.
erfitting. In the case where gen-
The relationship between bias and variance is tightly linked to the machine
learning concepts of capacity, underfitting 129 and overfitting. In the case where gen-
CHAPTER 5. MACHINE LEARNING BASICS
eralization error is measured by the MSE (where bias and variance are meaningful
comp
componen
onen
onents
ts of generalization error), increasing capacity tends to increase variance
eralization
and decreaseerror is This
bias. measured by the MSE
is illustrated (where
in Fig. bias and
5.6, where we vsee
ariance
againare
themeaningful
U-shap
U-shaped
ed
comp
curv onen ts of generalization error), increasing capacity
curvee of generalization error as a function of capacit
capacityy. tends to increase variance
and decrease bias. This is illustrated in Fig. 5.6, where we see again the U-shaped
curve of generalization error as a function of capacity.
5.4.5 Consistency
5.4.5
So far weConsistency
hav
havee discussed the prop
properties
erties of various estimators for a training set of
fixed size. Usually
Usually,, we are also concerned with the behavior of an estimator as the
So far twe
amoun
amount of hav e discussed
training the prop
data grows. In erties of various
particular, estimators
we usually for a training
wish that, as the numsetb er
numb of
fixed
of size.
data Usually
points m in, we
ourare also concerned
dataset increases, with
our pthe
ointtbestimates
oin ehavior ofconv
an estimator
erge to theastrue
converge the
amoun t of training
value of the corresp data
correspondinggrows. In particular, we
onding parameters. More formallyusually wish that,
formally,, we would lik as the
likee that num b er
of data points m in our dataset increases, our p oint estimates converge to the true
p
value of the corresponding parameters. lim θˆMore
m→θ formally
. , we would like that (5.55)
m→∞
p lim θˆ θ. (5.55)
The sym bol → means that the conv
symb convergence probability,, i.e. for any > 0,
ergence is in probability
→
P (|θθ̂ˆm − θ| > ) → 0 as m → ∞ . The condition describ describeded b byy Eq. 5.55 is
The
kno
known symb ol means
wn as consistency that the conv ergence is in probability , i.e.
onsistency.. It is sometimes referred to as weak consistency for any , with
consistency, > 0,
ˆ θ >
P ( θ consistency→ ) referring
0 as m . Thesur condition describ
strong to the almost sure
e con
convergence
vergence of θθ̂ed
ˆ tobθy. Eq. 5.55
Almost suris
sure
e
kno| wn −as |consistency
→ . It is → sometimes
∞ referred to as weak consistency, with
strong consistency referring to the almost 130 sure convergence of θˆ to θ . Almost sure
CHAPTER 5. MACHINE LEARNING BASICS
5.5 Maxim
Previously
Previously, , we hav um
have e seenLik eliho
some od Estimation
definitions of common estimators and analyzed
their properties. But where did these estimators come from? Rather than guessing
Previously
that some ,function
we havemightseen mak
some
make e adefinitions
go
goo of common
od estimator and then estimators
analyzing and
its analyzed
bias and
their properties. But
variance, we would lik where
likee to hadid
hav these estimators come
ve some principle from whic from?
which h we can derive guessing
Rather than specific
that somethat
functions function
are go might
goo make a gofor
od estimators oddifferen
estimator
different t moand
models.
dels.then analyzing its bias and
variance, we would like to have some principle from which we can derive specific
The most
functions thatcommon
are goodsuch principle
estimators forisdifferen
the maxim
maximum
t moum
dels.lik
likeliho
eliho
elihoo od principle.
Consider a set of m examples X = {x (1), . . . , x(m) } dra drawn wn indep
independently
endently from
The most common such principle is the maximum likelihood principle.
the true but unknown data generating X distribution p data ( x ) .
Consider a set of m examples = x , . . . , x drawn independently from
Let pmodel ( x; θ) b e a parametric family of probability distributions over the
the true but unknown data generating{distribution p} (x).
same space indexed by θ. In other words, p model(x ; θ) maps an anyy configuration x
Let p
to a real num
numb ( x; θ ) b e a parametric family of probability
b er estimating the true probability pdata (x). distributions over the
same space indexed by θ. In other words, p (x ; θ) maps any configuration x
The maxim
maximum um likelihoo
likelihood d estimator for θ is then defined as
to a real numb er estimating the true probability p (x).
The maximum likelihoo
θMLd =
estimator
arg max for θ is
p model(Xthen
; θ) defined as (5.56)
θ
X
θ = arg max Y pm ( ; θ) (5.56)
= arg max pmodel(x (i); θ) (5.57)
θ i=1
= arg max p (x ; θ) (5.57)
This pro duct over man
product y probabilities can be inconv
many enientt for a variet
inconvenien
enien y of reasons.
ariety
For example, it is prone to numerical underflow. To obtain a more con convvenien
enientt
This
but equiv pro duct
equivalent o ver man y probabilities can b e inconv enien t for a variety
alent optimization problem, we observe that taking the logarithm of the of reasons.
F or example, Y
lik
likeliho
eliho
elihoood do esitnot
does is prone
changetoitsnarg
umerical underflow.
max but do
does
es convTenien
o obtain
convenien
eniently a more con
tly transform venien
a pro ductt
product
but equivalent optimization problem, we observe that taking the logarithm of the
likelihood does not change its arg max but does conveniently transform a product
131
CHAPTER 5. MACHINE LEARNING BASICS
in
into
to a sum: m
X
into a sum: θ ML = arg max log pmodel (x(i) ; θ). (5.58)
θ i=1
Because the argmax do θ es not
does = arg max when
change log pwe rescale
(x ;theθ). cost function, w(5.58) e can
divide by m to obtain a version of the criterion that is expressed as an exp expectation
ectation
Because
with resp the
respect argmax do es not c hange when w e rescale the
ect to the empirical distribution pp̂ˆdata defined by the training data: cost function, we can
divide by m to obtain a version of the criterion that is expressed as an expectation
with respect to the empirical θ ML = arg max EX
distribution ∼p̂ p ˆ logdefined
p model(xb;yθthe). training data: (5.59)
θ
E
θ = arg max log p (x; θ). (5.59)
One wa way y to interpret maximum lik likelihoo
elihoo
elihood d estimation is to view it as minimizing
the dissimilarit
dissimilarity y bet
etwween the empirical distribution pp̂ˆdata defined by the training
One wa
set and the mo y to
model interpret maximum
del distribution, likelihoo
with d estimation
the degree is to view
of dissimilarit
dissimilarity y bitet
etwas
weenminimizing
the two
the dissimilarit y b et w een
measured by the KL divergence. The KL divthe empirical distribution
divergence p
ˆ
ergence is givgivendefined
en by b y the training
set and the model distribution, with the degree of dissimilarity between the two
measured D KLby (the kpmodel
pp̂ˆdataKL ) = E ∼p̂The [log
divergence. KL pp̂ˆdiv (x) − log
ergence
data pmodel
is giv en b(yx)] . (5.60)
The term E
D on (the pˆ left p is a)function
= only[logofpˆ the(x data
) loggenerating
p pro
process,
(x)] . cess, not the
(5.60)
mo
model.
del. This means when we train the mo modeldel to minimize the KL divergence, we
The k is a function only of the data − generating process, not the
needterm
only on the left
minimize
model. This means when we−train E ∼p̂the [logmodel to minimize
p model (x)] the KL divergence, we
(5.61)
need only minimize
whic
whichh is of course the same as E the maximization
[log p in
(x)]Eq. 5.59. (5.61)
Minimizing this
which is of course the same as KL div
divergence
ergence corresp
correspondsonds
− the maximization in Eq. 5.59.exactly to minimizing the cross-
en
entrop
trop
tropyy betwetweeneen the distributions. Many authors use the term “cross-en “cross-entrop
trop
tropy”y” to
iden Minimizing
identify
tify sp
specifically this KL
ecifically the negativ div ergence
negativee log-lik corresp
log-likeliho
eliho onds exactly to minimizing
elihoood of a Bernoulli or softmax distribution, the cross-
en trop y b etw een the distributions.
but that is a misnomer. Any loss consisting Many authors use the log-lik
of a negative term “cross-en
log-likelihoo
elihoo
elihood d trop
is a y” to
cross
iden
en tifyy sp
entrop
trop
tropy ecifically
betw
etweeneen the theempirical
negative log-lik elihood defined
distribution of a Bernoulli
by theortraining
softmax distribution,
set and the
but
mo that
model. is a misnomer. Any loss consisting
del. For example, mean squared error is the cross-entrop of a negative
cross-entropy log-lik
y b et
etw elihoo
ween the d is a cross
empirical
en trop y b etw een the
distribution and a Gaussian mo empirical distribution
model.
del. defined b y the training set and the
model. For example, mean squared error is the cross-entropy b etween the empirical
We can th thus us see maximum lik likeliho
eliho
elihoo o d as an attempt to make the mo modeldel dis-
distribution and a Gaussian model.
tribution matcmatch h the empirical distribution pp̂ˆdata . Ideally Ideally,, we would lik likee to match
W e can th us see maximum likeliho
the true data generating distribution pdata , but we ha o d as an attempt
hav to make
ve no direct accessthe model dis-
to this
tribution
distribution. matc h the empirical distribution p
ˆ . Ideally , we w ould lik e to match
the true data generating distribution p , but we have no direct access to this
While the optimal θ is the same regardless of whether we are maximizing the
distribution.
lik
likeliho
eliho
elihoo od or minimizing the KL divergence, the values of the ob objective
jective functions
While
are differen
different. the optimal
t. In soft θ
ware, we often phrase both as minimizing amaximizing
software, is the same regardless of whether w e are cost function. the
likeliho
Maxim
Maximum o d or minimizing
um likelihoo
likelihood the KL divergence, the v alues
d thus becomes minimization of the negative log-likof the ob jective functions
log-likeliho
eliho
elihoo od
are differen
(NLL), or equiv t. In
equivalently soft
alently ware, we often phrase
alently,, minimization of the cross entropb oth as minimizing
entropy a
y. The persp cost function.
perspective
ective of
Maxim
maxim
maximum um lik
um likelihoo
likelihoo
elihoo
elihood dd as
thus becomes
minimu
minimum m KL minimization
div ergence of
divergence the negative
becomes helpfullog-lik
in thiseliho od
case
(NLL),
becauseor theequiv
KL alently
div , minimization
divergence
ergence has a known of the
minim cross
minimum umentrop
value yof . zero.
The persp
The ective
negative of
maxim
log-lik um
log-likeliho
eliho
elihoo lik elihoo d as minimu m KL div
od can actually become negative when x is real-v ergence becomes helpful
real-valued.
alued. in this case
because the KL divergence has a known minimum value of zero. The negative
log-likelihood can actually become negative 132 when x is real-valued.
CHAPTER 5. MACHINE LEARNING BASICS
5.5.1
The maxim Conditional
maximum um lik
likeliho
eliho
elihooodLog-Likelihoo d and be
estimator can readily Mean Squared
generalized Error
to the case where
our goal is to estimate a conditional probabilit
probability y P (y | x ; θ) in order to predict y
The
giv maxim um likeliho o d estimator can readily
en x . This is actually the most common situation
given be generalized
because it formsto thethe
case where
basis for
our goal
most sup is to
supervised estimate a
ervised learning. If conditional
X probabilit
represen
represents y P (y x
ts all our inputs and; θ ) inY order to predict
all our observ
observedy
ed
given x . then
targets, This the
is actually the most
conditional common
maximum lik situation
likelihoo
elihoo
elihood b|ecause is
d estimator it forms the basis for
most supervised learning. If X represents all our inputs and Y all our observed
targets, then the conditional θ MLmaximum
= arg maxlikPelihoo
(Y | Xd estimator
; θ ). is (5.62)
θ
θ = arg max P (Y X ; θ). (5.62)
If the examples are assumed to b e i.i.d., then this can be decomposed into
|
Xm
If the examples are assumed to b e i.i.d., then this can be decomposed into
θ ML = arg max log P (y(i) | x(i) ; θ). (5.63)
θ i=1
θ = arg max log P (y x ; θ). (5.63)
| Linear regression,
in
intro
tro
troduced
duced earlier in Sec. 5.1.4, ma may y be justified as a maximum likelihoo likelihood d pro
procedure.
cedure.
Previously
Previously,, we motiv motivated
ated linear regression as an algorithm that Linear
learns regression,
to tak
takee an
X
intro duced
input x and pro earlier in Sec. 5.1.4 , ma y b e justified as a maximum
duce an output value yŷˆ. The mapping from x to yŷˆ is chosen to
produce likelihoo d pro cedure.
Previously , we
minimize mean squared motiv ated linear
error, regression
a criterion that asweanin algorithm
introtro
troduced
duced morethat learns
or less to take an.
arbitrarily
arbitrarily.
input
We no
now x and pro duce an output
w revisit linear regression from the poin v alue y
ˆ . The mapping from x to
ointt of view of maximum likelihoo yˆ is chosen
likelihoodto
d
minimize mean
estimation. squared
Instead of proerror,
producing
ducinga criterion
a singlethat we introyŷˆduced
prediction , we now more or less
think arbitrarily
of the mo del .
model
W
asepro
now revisita linear
producing
ducing regression
conditional from the
distribution p( ypoin
| xt). ofWviewe canofimagine
maximum thatlikelihoo
with and
estimation.
infinitely large Instead of pro
training set,ducing
we migh a single
might prediction
t see several yˆ, we examples
training now thinkwith of thethemo del
same
as pro ducing a conditional
input value x but differen distribution p( y x
differentt values of y . The goal of the learning algorithm is nowan
) . W e can imagine that with to
infinitely
fit large training
the distribution p( y | set,
x) to weallmigh t see differen
of those several
| ttraining
different y valuesexamples
that are with the same
all compatible
with To xderive
input xv.alue but differen
the same t values
linear of yregression
. The goalalgorithm
of the learning algorithmbefore,
we obtained is nowweto
fit the pdistribution
define p (
(y | x ) = N (yy; yŷˆ(xx); to w)all
, σ2of). those different yŷˆ(x
The function y values that
; w) giv es are
gives the all compatible
prediction of
with x. T o derive the |same linear regression algorithm
the mean of the Gaussian. In this example, we assume that the variance is fixed to w e obtained b efore, we
some p(y xt) σ=2 chosen
defineconstan
constant (y; yˆ(b xy; w ) , σuser.
the ). TheWe function
will see that yˆ(x ;this
w) cgiv es the
hoice prediction
of the of
functional
the mean
form of p(of y| |the
x ) Gaussian.
N
causes the In this example,
maximum lik
likeliho
eliho weodassume
elihoo estimation that thepro variancetoisyield
procedure
cedure fixedthe
to
some constan
same σ chosen b
learningt algorithm asy we
thedevuser.
developed We will
eloped see that
before. Sincethis thechoice
examples of thearefunctional
assumed
form of p ( y x ) causes the maximum
to be i.i.d., the conditional log-likelihoo
log-likelihood lik eliho o d estimation
d (Eq. 5.63) is giv given pro
en by cedure to yield the
same learning | algorithm as we developed before. Since the examples are assumed
to be i.i.d., the conditional Xm log-likelihood (Eq. 5.63) is given by
log p(y (i) | x (i); θ) (5.64)
i=1
log p(y x ; θ) Xm (5.64)
m |yŷˆ(i) − y (i)|| 2
= − m log σ − | log(2π ) − (5.65)
2 2σ2
m i=1 y ˆ y
= m log σ log(2π ) (5.65)
2 | 2
−σ ||
X
− − 133 −
X
CHAPTER 5. MACHINE LEARNING BASICS
where yŷˆ(i) is the output of the linear regression on the i-th input x(i) and m is the
num
umb ber of the training examples. Comparing the log-likelihoo log-likelihood d with the mean
where y
ˆ
squared error,is the output of the linear regression on the i-th input x and m is the
number of the training examples. Comparing 1 X (i) the (log-likelihoo
m d with the mean
squared error, MSE train = ||yŷˆ − y i)||2,
||ˆ (5.66)
m
1 i=1
MSE = yˆ y , (5.66)
we immediately see that maximizingmthe log-likelihoo
log-likelihood d with respect to w yields
the same estimate of the parameters w as do ||es minimizing
does − || the mean squared error.
w e immediately
The twtwo see
o criteria ha
hav that maximizing the log-likelihoo
ve different values but the same lo d with
location
cation of respect to w yields
the optimum. This
the same estimate of the parameters
justifies the use of the MSE as a maxim w
maximumas
umdo es minimizing the mean squared error.
X lik
likelihoo
elihoo
elihoodd estimation pro procedure.
cedure. As we
The tw o criteria hav
will see, the maximum like different v
likelihoo
elihoo
elihoodalues but the
d estimator has sevsame lo
severalcation of the
eral desirable prop optimum.
erties. This
properties.
justifies the use of the MSE as a maximum likelihood estimation procedure. As we
will see, the maximum likelihood estimator has several desirable properties.
5.5.2 Prop
Properties
erties of Maxim
Maximum
um Lik
Likeliho
eliho
elihoood
5.5.2
The mainPropapp
appeal erties
eal of theofmaxim
Maxim
maximum um um lik Likdeliho
likelihoo
elihoo
elihood od is that it can be sho
estimator shown
wn to
be the best estimator asymptotically
asymptotically,, as the num umberber of examples m → ∞ , in terms
The
of itsmain
rate app eal of
of conv
convergencethe maxim
ergence as m um likelihoo d estimator is that it can be shown to
increases.
be the best estimator asymptotically, as the number of examples m , in terms
Under appropriate conditions, maximum lik likeliho
eliho
elihoood estimator has the prop propertert
erty
y
of its rate of convergence as m increases. →∞
of consistency (see Sec. 5.4.5 ab abovov
ove),
e), meaning that as the number of training
Under approac
examples appropriate
approaches conditions,
hes infinit
infinity y, themaximum
maxim
maximum umliklik
eliho
likelihoododestimator
eliho
elihoo estimatehas of athe property
parameter
of
con
convconsistency
verges to the(see trueSec.
value5.4.5 abov
of the e), meaning
parameter. Thesethatconditions
as the numberare: of training
examples approaches infinity, the maximum likelihood estimate of a parameter
con•verges
The to thedistribution
true true value ofp data
the parameter.
must lie within Thesethe conditions
mo
model are: pmodel(·; θ).
del family
Otherwise, no estimator can recov recoverer pdata .
The true distribution p must lie within the model family p ( ; θ).
•• The true distribution
Otherwise, no estimator pdatacanmust corresp
correspond
recov er p ond . to exactly one value of θ. Other- ·
wise, maximum likelihoolikelihood d can recov er the correct pdata, but will not be able
recover
Thedetermine
to true distribution
which valuep ofmθ ustwas
corresp
usedond to exactly
by the one value of
data generating proθcessing.
. Other-
processing.
• wise, maximum likelihoo d can recover the correct p , but will not be able
to determine
There are other whichinductiv
inductivevalue of θ wasb esides
e principles used by thethe dataum
maxim
maximum generating
lik
likelihoo
elihoo
elihood pro
d cessing.
estimator,
man
many y of which share the prop property
erty of being consistent estimators. Ho How wev
ever,
er, consis-
ten There are other inductiv e principles b esides the maxim um
tentt estimators can differ in their statistic efficiency, meaning that one consisten likelihoo d estimator,
consistentt
man y of
estimator ma which
may share
y obtain lowthe prop
lower erty of b eing consistent
er generalization error for a fixed num estimators.
umb Ho w ev er, consis-
ber of samples m,
ten t estimators
or equiv
equivalen
alen
alently
tly can
tly,, ma
may differ in their statistic efficiency
y require fewer examples to obtain a fixed lev , meaning that
el of generalizationt
level one consisten
estimator may obtain lower generalization error for a fixed number of samples m ,
error.
or equivalently, may require fewer examples to obtain a fixed level of generalization
error.Statistical efficiency is typically studied in the par arametric
ametric case (lik (likee in linear
regression) where our goal is to estimate the value of a parameter (and assuming
it isStatistical
possible toefficiency
identify the is ttrue
ypically studied in
parameter), notthethepar ametric
value case (like Ainwlinear
of a function. ay to
regression)
measure ho how where our goal is to estimate the v
w close we are to the true parameter is by the expalue of a parameter
expected (and assuming
ected mean squared
it is possible to identify the true
error, computing the squared difference betw parameter), not
etween the v alue of
een the estimated and a function. A way to
true parameter
measure how close we are to the true parameter is by the expected mean squared
error, computing the squared difference 134 between the estimated and true parameter
CHAPTER 5. MACHINE LEARNING BASICS
values, where the exp ectation is over m training samples from the data generating
expectation
distribution. That parametric mean squared error decreases as m increases, and
vfor
alues, wherethe
m large, theCramér-Rao
expectationlow
is er
lowerover m training
bound (Rao, 1945samples from, 1946
; Cramér the data generating
) shows that no
distribution.
consisten That parametric
consistentt estimator has a low
lower mean squared error decreases as m
er mean squared error than the maximum lik increases,
likelihoand
eliho
elihoood
for m large,
estimator. the Cramér-Rao low er b ound ( Rao , 1945; Cramér , 1946 ) shows that no
consistent estimator has a lower mean squared error than the maximum likelihood
For these reasons (consistency and efficiency), maximum lik likelihoo
elihoo
elihoodd is often
estimator.
considered the preferred estimator to use for machine learning. When the num numberber
F or these reasons (consistency and efficiency), maximum likelihoo
of examples is small enough to yield overfitting behavior, regularization strategiesd is often
considered
suc
such the
h as weigh
weight preferred
t deca y mayestimator
decay be used totoobtain
use for machine
a biased learning.
version When the
of maximum liknum
likeliho
elihober
elihoood
of examples
that has lessisvariance
small enough to yield odata
when training verfitting behavior, regularization strategies
is limited.
such as weight decay may be used to obtain a biased version of maximum likelihood
that has less variance when training data is limited.
5.6 Ba
Bay
yesian Statistics
5.6far wBa
So e ha yveesian
hav discussed Statistics
fr
freequentist statistics and approac approaches hes based on estimating a
single value of θ, then making all predictions thereafter based on that one estimate.
So far weapproach
Another have discussed frequentist
is to consider all statistics
p ossible vandaluesapproac
of θ when hes based
making on aestimating
prediction. a
single value of θ , then making all
The latter is the domain of Bayesian statistics. predictions thereafter based on that one estimate.
Another approach is to consider all p ossible values of θ when making a prediction.
As discussed in Sec. 5.4.1, the frequentist persp perspectiv
ectiv
ectivee is that the true parameter
The latter is the domain of Bayesian statistics.
value θ is fixed but unknown, while the poin ointt estimate θ θ̂ˆ is a random variable on
As discussed
accoun
account t of it being in Sec. 5.4.1, the
a function frequentist
of the dataset persp
(whic
(which ectiv
h is eseen
is that the true parameter
as random).
value θ is fixed but unknown, while the point estimate θˆ is a random variable on
Thet Ba
accoun Bayyesian
of it beingpaersp
erspective
ective on
function of thestatistics
datasetis(whic
quiteh different.
is seen as The random).Ba
Bay yesian uses
probabilit
probability y to reflect degrees of certaint certainty y of states of kno knowledge.
wledge. The dataset is
The observed
directly Bayesian and persp soective
is not on statistics
random. On is thequite
other different.
hand, theThe trueBaparameter
yesian uses θ
probabilit
is unkno
unknown y to reflect degrees of
wn or uncertain and thus is represen certaint y of
represented states of kno wledge.
ted as a random variable. The dataset is
directly observed and so is not random. On the other hand, the true parameter θ
Before observing the data, we represent our knowledge of θ using the prior
is unknown or uncertain and thus is represented as a random variable.
pr
prob
ob
obability
ability distribution
distribution,, p (θ ) (sometimes referred to as simply “the prior”). Gen-
Before
erally
erally,, the macobserving
machine the data,
hine learning we represent
practitioner ouraknowledge
selects of θ using
prior distribution thattheis prior
quite
pr ob ability distribution
broad (i.e. with high en , p
entrop
trop (
tropy) θ ) (sometimes referred to
y) to reflect a high degree of uncertain as simply “the prior”).
uncertaintty in the value Gen-
of
erally , the mac
θ before observing an hine learning
any practitioner
y data. For example, one migh selects a prior distribution
mightt assume a priori that θ lies that is quite
broad
in some (i.e. withrange
finite high orentrop
volume, y) towithreflect a high degree
a uniform of uncertain
distribution. Man
Many tyy in the vinstead
priors alue of
θ before
reflect observing an
a preference fory “simpler”
data. Forsolutionsexample,(suc oneh migh
(such t assume
as smaller a priori co
magnitude that θ lies
coefficients,
efficients,
in some
or finite that
a function range is or volume,
closer to beingwithconstant).
a uniform distribution. Many priors instead
reflect a preference for “simpler” solutions (such as smaller(1) magnitude coefficients,
No
Now w consider that we hav havee a set of data samples {x , . . . , x (m) }. We can
or a function that is closer to being constant).
reco
recov ver the effect of data on our belief ab out θ by com
about combining
bining the data lik likeliho
eliho
elihoo od
No
(1) w consider
p(x , . . . , x ( m ) that we hav e
| θ) with the prior via Bay a set of data
Bayes’ samples
es’ rule: x , . . . , x . W e can
recover the effect of data on our belief about θ by combining { the data} likelihood
p(x , . . . , x θ) with(1)the prior via Bay
p ( x (1),rule:
es’ . . . , x (m) | θ )p(θ )
p(θ | x , . . . , x (m) ) = (5.67)
| p(x(1), . . . , x(m))
p(x , . . . , x θ)p(θ )
p(θ x , . . . , x ) =135 (5.67)
p(x , . . . , x| )
|
CHAPTER 5. MACHINE LEARNING BASICS
In the scenarios where Ba Bayyesian estimation is typically used, the prior b egins as a
relativ
relatively
ely uniform or Gaussian distribution with high en entrop
trop
tropyy, and the observ
observation
ation
In the scenarios where Ba y esian estimation
of the data usually causes the posterior to lose entrop is typically
entropy used, the
y and concen prior
concentrate b egins
trate aroundas aa
relativ
few ely uniform
highly or Gaussian
likely values distribution with high entropy, and the observation
of the parameters.
of the data usually causes the posterior to lose entropy and concentrate around a
Relativ
Relativee to maxim
maximum um liklikelihoo
elihoo
elihood d estimation, Ba Bayyesian estimation offers two
few highly likely values of the parameters.
imp
importan
ortan
ortantt differences. First, unlik unlikee the maxim
maximumum likelihoo
likelihood d approach that makes
Relativ e to maxim um likelihoo d
predictions using a point estimate of θ , the Ba estimation,
Bay Ba yesian
yesian approac
approach hestimation offers two
is to make predictions
importan
using t differences.
a full distribution First,
ov erunlik
over θ. F e or
theexample,
maximum likelihoo
after d approach
observing that makes
m examples, the
predictions using a p
predicted distribution ovoint estimate
over of θ , the Bay
er the next data sample, xesian approac
(m +1) h is to make
, is given by predictions
using a full distribution over θ.Z For example, after observing m examples, the
predicted distribution over the next data sample, x , is given by
p(x (m+1) | x (1) , . . . , x(m)) = p(x (m+1) | θ)p(θ | x(1), . . . , x (m) ) dθ. (5.68)
yˆ
Expressed as a Gaussian conditional =X w.on y(train), we hav
distribution havee (5.70)
p(y(train) | as
Expressed X(atrain )
Gaussian N (y (train); X
, w) = conditional (train)
distribution
w, I ) on y , we have (5.71)
1 (train) (train) > ( train) ( train )
p(y X , w) ∝ = exp (y − (y ;X −wX, I ) w) (y −X (5.71)
w) ,
2
| N 1
exp (y X w) (y X (5.72)
w) ,
2
∝ − − − (5.72)
where we follow the standard MSE formulation in assuming that the Gaussian
variance on y is one. In what follo follows,
ws, to reduce the notational burden, we refer to
where
(X ( we
train),yfollow
( train)
) as simply (X , y ). formulation in assuming that the Gaussian
the standard MSE
variance on y is one. In what follows, to reduce the notational burden, we refer to
(X To determine
,y ) the posterior
as simply (Xdistribution
, y ). over the mo del parameter vector w, we
model
first need to sp specify
ecify a prior distribution. The prior should reflect our naive belief
ab T
about o determine the p osterior distribution ov er the
out the value of these parameters. While it is sometimes difficult or mo del parameter vector w, we
unnatural
first
to need to
express ourspprior
ecify baeliefs
priorindistribution.
terms of theThe prior should
parameters of thereflect
mo del,our
model, in naive
practice belief
we
ab out the v alue of these parameters. While it is sometimes
typically assume a fairly broad distribution expressing a high degree of uncertaindifficult or unnatural
uncertaintty
to
ab express
about θ
out . F our
For prior
or real-v b
real-valued eliefs in terms of the parameters of the mo
alued parameters it is common to use a Gaussian as a priordel, in practice we
typically assume a fairly broad distribution expressing a high degree of uncertainty
distribution:
about θ . For real-valued parametersit is common to use a Gaussian as a prior
distribution: 1
p(w) = N (w; µ0 , Λ 0) ∝ exp − (w − µ0 )> Λ−1 0 (w − µ0) (5.73)
2
1
p(w) = (w; µ , Λ ) exp (w µ ) Λ (w µ ) (5.73)
where µ0 and Λ 0 are the prior distribution 2 mean vector and cov covariance
ariance matrix
resp
respectiv
ectiv ely..1 N
ectively
ely ∝ − − −
where µ and Λ are the prior distribution mean vector and covariance matrix
respWith the. prior th
ectively thus
us sp
specified,
ecified, we can no
noww pro
proceed
ceed in determining the posterior
distribution over the mo model
del parameters.
With the prior thus specified, we can now proceed in determining the posterior
distribution
p(w | X , y) ∝ over
p(ythe | Xmo, wdel
)p(parameters.
w) (5.74)
p(wUnless
X, y ) isp(ayreason
there X, wto)p (5.74)a
(w) a particular covariance structure, we typically assume
assume
diagonal covariance matrix .
| ∝ |
137
CHAPTER 5. MACHINE LEARNING BASICS
1 > 1 > −1
∝ exp − (y − X w) (y − X w) exp − (w − µ 0) Λ 0 (w − µ 0 )
2 2
1 1
exp (y X w) (y X w) exp (w µ ) Λ (w (5.75) µ )
2 2
∝ −1 − − − − −
∝ exp − −2y >X w + w > X > X w + w> Λ−1 0 w − 2µ0 Λ 0 w
> −1 (5.75)
.
2
1
exp 2y X w + w X X w +w Λ w 2µ Λ w(5.76) .
2
∝ − − −1 > − −1 (5.76)
We no w define Λ m = X> X + Λ−1
now 0 and µm = Λ m X y + Λ0 µ0 . Using
these new variables, we find that the posterior ma may y be rewritten as a Gaussian
W e now define Λ = X
distribution:
X+Λ and µ = Λ X y + Λ µ . Using
these new variables, wefind that the posterior may be rewritten asa Gaussian
distribution: 1 1 > −1
p(w | X , y) ∝ exp − (w − µm )> Λ−1 m (w − µ m) + µm Λm µ m (5.77)
2 2
1 1
p(w X , y) exp 1 (w µ )> Λ−1(w µ ) + µ Λ µ (5.77)
∝ exp − 2 (w − µm ) Λm (w − µ m) . 2 (5.78)
| ∝ −2 − −
1
exp (w µ ) Λ (w µ ) . (5.78)
All terms that do not include 2 the parameter vector w ha havve been omitted; they
∝ fact that− the − − be normalized to in
are implied by the distribution must integrate
tegrate to 1.
All terms that do not include
Eq. 3.23 shows how to normalize a multiv the parameter
multivariate v ector w ha v e b
ariate Gaussian distribution.een omitted; they
are implied by the fact that the distribution must be normalized
to integrate to 1.
Examining this posterior distribution allows us to gain some in intuition
tuition for the
Eq. 3.23 shows how to normalize a multivariate Gaussian distribution.
effect of Bay esian inference. In most situations, we set µ0 to 0. If we set Λ0 = α1 I ,
Bayesian
thenExamining
µm giv
gives thissame
es the posterior distribution
estimate of w as allows
do
does us to gain linear
es frequentist some in tuition forwith
regression the
aeffect oftBay
weigh
weight esian
decay inference.
penalt
enalty y of αInw >most situations,
w . One we set
difference µ tothe
is that 0. Bay set Λ
If weesian
Bayesian = I,
estimate
then
is µ givesif the
undefined alphasame estimate
is set of w as are
to zero—-we doesnotfrequentist
allo
allow
wed to linear
begin regression
the Ba
Bay with
yesian
a weigh t
learning pro decay p enalt y of α w w . One difference
cess with an infinitely wide prior on w. The more imp
process is that the Bay esian
important estimate
ortant difference
is that
is undefined
the BaBay alphaestimate
ifyesian is set toprozero—-we
provides
vides a cov are not allo
covariance
ariance wed to
matrix, begin the
showing how Ba
how yesian
like
likely
ly all
learning
the differenpro cess with an infinitely wide
differentt values of w are, rather than pro prior on w
providing. The more imp
viding only the estimate µm.ortant difference
is that the Bayesian estimate provides a covariance matrix, showing how likely all
the different values of w are, rather than providing only the estimate µ .
5.6.1 Maxim
Maximum
um (MAP) Estimation
5.6.1 theMaxim
While um
most principled approac
approach h is (MAP)
to mak Estimation
makee predictions using the full Bay
Bayesian
esian
posterior distribution over the parameter θ , it is still often desirable to ha havve a
While
single the most
point principled
estimate. Oneapproac
common h isreason
to makfor e predictions
desiring a using
point the full Bay
estimate isesian
that
p osterior
most op distribution
operations
erations inv o v
involvinger the
olving the Ba parameter
Bay θ , it is still
yesian posterior for most in often desirable
interesting
teresting mo to dels eare
ha
models v a
single
in point and
intractable,
tractable, estimate.
a pointOne common
estimate reason
offers for desiring
a tractable a point estimate
approximation. Rather is than
that
most op erations involving
simply returning to the maxim the Ba
maximum yesian posterior
um likelihoo
likelihood d estimate, we can still gain someare
for most interesting mo dels of
intractable,
the benefit ofand theaBay
point
Bayesianestimate
esian approacoffers
approach a tractable
h by allowing theapproximation.
prior to influence Rather than
the choice
simply
of the preturning
oin to theOne
ointt estimate. maxim um likelihoo
rational way to ddoestimate,
this is towe can still
choose the gain some of
maximum a
the b enefit of the Bay esian approac h b y allowing the
posteriori (MAP) point estimate. The MAP estimate chooses the p oinprior to influence the choice
ointt of maximal
of the point estimate. One rational way to do this is to choose the maximum a
posteriori (MAP) point estimate. The MAP 138 estimate chooses the p oint of maximal
CHAPTER 5. MACHINE LEARNING BASICS
5.7 from
Recall SupSec.ervised Learning
5.1.3 that sup ervised Algorithms
supervised learning algorithms are, roughly sp speaking,
eaking,
learning algorithms that learn to asso associate
ciate some input with some output, giv given
en a
Recall from Sec. 5.1.3 that sup ervised
x learning algorithms
training set of examples of inputs and outputs . In man y are,
many roughly sp eaking,
y cases the outputs
learning
y ma
mayy balgorithms
e difficult tothat learnautomatically
collect to asso ciate some
and input
must with some output,
be provided by a giv en a
human
training
“sup set ofbut
“supervisor,”
ervisor,” examples
the termof still
inputs x andeven
applies whenythe
outputs . Intraining
many cases the outputs
set targets were
y ma y b e difficult
collected automaticallyto
automatically..collect automatically and must be provided b y a h uman
“supervisor,” but the term still applies even when the training set targets were
collected automatically. 139
CHAPTER 5. MACHINE LEARNING BASICS
5.7.1 supervised
Most Probabilistic
learningSup ervised in
algorithms Learning
this bo
book
ok are based on estimating a
probabilit
probabilityy distribution p(y | x). We can do this simply by using maxim maximum
um
Most
lik
likeliho
elihosupervised
elihoood estimation to find the best parameter vector θ for a parametric familya
learning algorithms in this bo ok are based on estimating
probabilit y distribution
of distributions p(y | x; θ)p.(y x). We can do this simply by using maximum
likelihood estimation to find the | best parameter vector θ for a parametric family
W e ha
hav
v e already seen
of distributions p(y x; θ). that linear regression corresp
corresponds
onds to the family
This approac
approach h is known as lo p(gistic
y = 1regr
logistic xession
gression x).
; θ) = σ(a(θsomewhat (5.81)
strange name since we
use the mo
model
del for classification rather| than regression).
This approach is known as logistic regression (a somewhat strange name since we
In the case of linear regression, we were able to find the optimal weigh eights
ts by
use the model for classification rather than regression).
solving the normal equations. Logistic regression is somewhat more difficult. There
In closed-form
is no the case of linear regression,
solution we were
for its optimal ablets.to Instead,
weigh
weights. find the weoptimal
must wsearch
eights for
by
solving the normal equations.
them by maximizing the log-lik Logistic
log-likelihoo
elihoo
elihood. regression is somewhat more difficult.
d. We can do this by minimizing the negative There
is no closed-form
log-lik
log-likeliho
eliho
elihoood (NLL)solution for its optimal
using gradient descen
descent.t.weights. Instead, we must search for
them by maximizing the log-likelihood. We can do this by minimizing the negative
This
log-lik same
eliho od strategy can bgradient
(NLL) using e applied descen
to essen
essentially
t.tially an
anyy sup
supervised
ervised learning problem,
by writing down a parametric family of conditional probability distributions over
the This
righttsame
righ kindstrategy
of inputcan
andbeoutput
appliedvariables.
to essentially any supervised learning problem,
by writing down a parametric family of conditional probability distributions over
the right kind of input and output variables.
5.7.2 Supp
Support
ort Vector Mac
Machines
hines
5.7.2of the
One Supp
mostort Vectorapproaches
influential Machines to sup
supervised
ervised learning is the supp
support
ort vector
mac
machine
hine (Boser et al., 1992; Cortes and Vapnik, 1995). This mo model
del is similar to
One of the most influential approaches to sup ervised learning
> is the supp
logistic regression in that it is driven by a linear function w x + b. Unlik ort
Unlike vector
e logistic
machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to
logistic regression in that it is driven by 140
a linear function w x + b. Unlike logistic
CHAPTER 5. MACHINE LEARNING BASICS
143
CHAPTER 5. MACHINE LEARNING BASICS
R
R
144
CHAPTER 5. MACHINE LEARNING BASICS
Another type of learning algorithm that also breaks the input space into regions
and has separate parameters for each region is the de decision
cision tr treee (Breiman et al.,
1984 Another type
1984)) and its many varianof learning
ariants. algorithm that also breaks
ts. As shown in Fig. 5.7, eac eachthe
h no input
nodede of space into regions
the decision tree
and
is assohas separate
associated parameters for each region is
ciated with a region in the input space, and internal no the decision tr
nodes ee ( Breiman
des break that region et al.,
1984
in
into ) and its many v arian ts. As
to one sub-region for each child of the no shown in Fig.
node 5.7 , eac h no de of the decision
de (typically using an axis-aligned tree
is assoSpace
cut). ciated is with
th
thus
usa region in the into
sub-divided inputnon-ov
space,erlapping
and internal
non-overlapping nodeswith
regions, breaka that region
one-to-one
in to one
corresp sub-region
correspondence
ondence b et
etw for leaf
ween eachno cdes
hildand
nodes of the
inputnoregions.
de (typically
Eac
Each using
h leaf nodean usually
axis-aligned
maps
cut).
ev
every Space is th us sub-divided into
ery point in its input region to the same out non-ov erlapping
output. regions, with a
put. Decision trees are usuallyone-to-one
corresp ondence
trained with sp b et w
specializedeen leaf no des and input regions.
ecialized algorithms that are beyond the Eachscope
leaf node
of thisusually
book.mapsThe
every point
learning in its can
algorithm inputberegion to the
considered same output.if it
non-parametric Decision
is allow
allowed trees
ed are usually
to learn a tree
trained
of withsize,
arbitrary specialized algorithms
though decision that
trees areare beyond
usually the scope
regularized withof this book. The
size constraints
learning
that turnalgorithm
them into canparametric
be considered mo non-parametric
models
dels in practice.if Decision
it is allowtrees
ed to aslearn
theya tree
are
of arbitrary size, though decision trees are usually regularized
typically used, with axis-aligned splits and constant outputs within eac with size constraints
eachh no
node,
de,
that turntothem
struggle solveintosome parametric
problems mo thatdels
areineasy
practice.
even for Decision
logistictrees as they F
regression. are
or
texample,
ypically used,
if we havwith axis-aligned splits and constant outputs
havee a two-class problem and the positive class occurs wherev within eac h no
whereverde,
er
struggle to solve some problems that are easy even for
x2 > x1 , the decision boundary is not axis-aligned. The decision tree will th logistic regression. For
thus
us
example, if we hav e a tw o-class problem
need to approximate the decision boundary with man and themanyp ositive
y no
nodes, class o ccurs wherev
des, implementing a step er
x > x , that
function the decision
constantly boundary
walks backis notandaxis-aligned.
forth acrossThe the decision
true decisiontree will thus
function
need axis-aligned
with to approximate the decision boundary with many nodes, implementing a step
steps.
function that constantly walks back and forth across the true decision function
withAs we hahav
axis-alignedve seen,
steps.nearest neighbor predictors and decision trees hav havee man
many y
limitations. Nonetheless, they are useful learning algorithms when computational
As we are
resources haveconstrained.
seen, nearestWneighbor
e can also predictors
build in and decision
intuition
tuition for more trees have many
sophisticated
limitations.
learning Nonetheless,
algorithms they areab
by thinking useful
about learning
out the algorithms
similarities and when computational
differences b et
etw
ween
resources are constrained. W e can also build
sophisticated algorithms and k-NN or decision tree baselines. intuition for more sophisticated
learning algorithms by thinking about the similarities and differences b etween
See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine
sophisticated algorithms and k-NN or decision tree baselines.
learning textb
textboooks for more material on traditional sup supervised
ervised learning algorithms.
See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine
learning textbooks for more material on traditional supervised learning algorithms.
5.8 Unsup
Unsupervised
ervised Learning Algorithms
5.8 from
Recall Unsup ervised
Sec. 5.1.3 Learning
that unsup
unsupervised Algorithms
ervised algorithms are those that exp
experience
erience only
“features” but not a sup supervision
ervision signal. The distinction betw between
een sup
supervised
ervised and
Recall ervised
unsup from Sec.
unsupervised 5.1.3 thatisunsup
algorithms ervised algorithms
not formally and rigidlyaredefined
those that experience
because there isonly
no
“features”
ob
objective but not a sup ervision signal. The distinction betw een sup ervised
jective test for distinguishing whether a value is a feature or a target provided by and
unsup
a sup ervised Informally
supervisor.
ervisor. algorithms
Informally, is notervised
, unsup formally
unsupervised and rigidly
learning defined
refers to most battempts
ecause there is no
to extract
ob jective testfrom
information for distinguishing
a distributionwhether
that do a vnot
alue require
is a feature or a lab
human target
labor
or toprovided by
annotate
a supervisor.
examples. TheInformally
term is , usually
unsupervised
asso learning
associated
ciated withrefers to most
density attemptslearning
estimation, to extract
to
information
dra
draw from a distribution that do not require human lab or
w samples from a distribution, learning to denoise data from some distribution,to annotate
examples. The term
finding a manifold is the
that usually
dataasso
liesciated with
near, or density the
clustering estimation,
data intolearning
groups to
of
draw samples from a distribution, learning to denoise data from some distribution,
finding a manifold that the data lies near, 145 or clustering the data into groups of
CHAPTER 5. MACHINE LEARNING BASICS
related examples.
A classic unsup
unsupervised
ervised learning task is to find the “best” represen representation
tation of the
related examples.
data. By ‘b ‘best’
est’ we can mean differen
differentt things, but generally sp speaking
eaking we are lo looking
oking
A classic unsup ervised
for a representation that preserv learning
preserves task
es as mucmuchis to find the
h information ab “best”
about represen tation
out x as possible while of the
data.
ob
obeying
eyingBysome
‘best’penalty
we canor mean differenaimed
constraint t things, but generally
at keeping speaking we are
the representation looking
simpler or
for a representation
more accessible thanthat preserves as much information about x as possible while
x itself.
obeying some penalty or constraint aimed at keeping the representation simpler or
There are multiple wa ways
ys of defining a simpler represen representation.
tation. Three of the
more accessible than x itself.
most common include lo low
wer dimensional represen representations,
tations, sparse representations
There
and indep are
independen
enden m ultiple
endentt represen wa ys
tations. Low-dimensionalrepresen
representations.of defining a simpler tation. Three
representations attemptof the
to
most common
compress as minclude
uc
uch lower dimensional
h information ab out x as
about represen
possible tations, sparse representations
in a smaller represen
representation.
tation.
and indep
Sparse endentations
represen t represen
representations tations.
(Barlow , 1989Low-dimensional
; Olshausen and representations
Field, 1996; Hin attempt
Hinton
ton and to
compress as ,m1997
Ghahramani uch )information
embed the ab dataset x
out into as possible
a represenin atation
representation smaller represen
whose tation.
entries are
Sparse
mostly zerorepresen
zeroes tations (Barlow , 1989 ; Olshausen and Field
es for most inputs. The use of sparse representations typically requires, 1996 ; Hin ton and
Ghahramani
increasing , 1997
the ) embed theofdataset
dimensionality into a represen
the representation, so tation
that thewhose entries
represen
representation are
tation
mostly
b ecoming zero es for zero
mostly mostesinputs.
zeroes do
does
es notThe use ofto
discard sparse
too o muc
much representations
h information. This typically requires
results in an
increasing the dimensionality
overall structure of the represen of
representation the representation, so that the
tation that tends to distribute data along the axesrepresen tation
b ecoming mostly
of the represen
representation zero es do es
tation space. Indep not discard
endentttorepresen
Independen
enden o much tations
information.
representations attempt Thistoresults in an
disentangle
overall
the structure
sources of the represen
of variation underlying tationthethat
datatends to distribute
distribution suc
such data the
h that along the axes
dimensions
of the
of the representation
representation are space. Independen
statistically t representations attempt to disentangle
independent.
the sources of variation underlying the data distribution such that the dimensions
Of course these three criteria are certainly not mutually exclusive. Lo Low-
w-
of the representation are statistically independent.
dimensional representations often yield elements that hav havee fewer or weak eakerer de-
Of course
pendencies thanthese
the three
original criteria are certainly
high-dimensional not This
data. mutually exclusive.
is because one waLo y w-
to
dimensional representations often yield
reduce the size of a representation is to find and remov elements that hav e fewer or w
removee redundancies. Identifyingeak er de-
p endencies
and remo
removing than the original high-dimensional
ving more redundancy allows the dimensionality data. This reduction
is becausealgorithm
one way to to
reduce
ac
achiev
hiev the size of a representation is to find
hievee more compression while discarding less information. and remov e redundancies. Identifying
and removing more redundancy allows the dimensionality reduction algorithm to
The
achiev notioncompression
e more of representation is one of the
while discarding central
less themes of deep learning and
information.
therefore one of the central themes in this book. In this section, we dev develop
elop some
The notion of representation
simple examples of represenrepresentation is one of the central themes of
tation learning algorithms. Together, these example deep learning and
therefore one of the
algorithms show how to op central themes
operationalize in this b o ok. In this
erationalize all three of the criteria ab section, w
above
ov
ove.dev elop
e. Most of thesome
simple examples
remaining chapters of represen
in
intro
tro
troduce
ducetation learning
additional algorithms. T
representation ogether,algorithms
learning these example that
algorithms
dev
develop show how to op erationalize
elop these criteria in different ways or in all three
intro
tro of
troduce the criteria
duce other criteria. ab ove. Most of the
remaining chapters introduce additional representation learning algorithms that
develop these criteria in different ways or introduce other criteria.
5.8.1 Principal Comp
Componen
onen
onents
ts Analysis
5.8.1
In Principal
Sec. 2.12 , we sa w Comp
saw that theonen ts Analysis
principal comp
components
onents analysis algorithm pro
provides
vides a
means of compressing data. We can also view PCA as an unsup unsupervised
ervised learning
In Sec. 2.12
algorithm , welearns
that saw that the principal
a represen tation comp
representation onents
of data. analysis
This algorithm
represen tation ispro
representation videsona
based
tmeans
wo of of
thecompressing
criteria for data. We can
a simple also tation
represen view PCA
representation as ed
describ an ab
described unsup
abov
ove.ervised
ove. PCA learning
learns a
algorithm that learns a representation of data. This representation is based on
two of the criteria for a simple represen 146tation describ ed ab ove. PCA learns a
CHAPTER 5. MACHINE LEARNING BASICS
z= x W z x
z
z= x W z
z
represen
representation
tation that has lowlower
er dimensionalit
dimensionality y than the original input. It also learns
a represen
representation
tation whose elemen
elements ts hav
havee no linear correlation with each other. This
represen
is a firsttation that
step tow ardhas
toward thelow er dimensionalit
criterion of learning y than the original
represen tations input.
representations whose It also learns
elemen
elements ts are
a represen tation
statistically indep whose
independent. elemen ts
endent. To achiev hav e no linear correlation
achievee full independence, a represen with each
representation other. This
tation learning
is a first step tow ard
algorithm must also remov the criterion of learning represen
removee the nonlinear relationships bet tations
etw whose elemen
ween variables. ts are
statistically independent. To achieve full independence, a representation learning
PCA learns an orthogonal, linear transformation of the data that pro projects
jects an
algorithm must also remove the nonlinear relationships between variables.
input x to a represen tation z as sho
representation shown
wn in Fig. 5.8. In Sec. 2.12, we saw that we
PCA learns an orthogonal,
could learn a one-dimensional represen linear transformation
representation
tation that best of the data that the
reconstructs pro jects an
original
input x to a represen tation z as sho wn in Fig. 5.8
data (in the sense of mean squared error) and that this represen . In Sec. 2.12 , w e
representationsaw that
tation actually w e
could
corresplearn
onds atoone-dimensional
corresponds the first principal represen
comp
componentation
onen
onent that
t of thebdata.
est reconstructs
Th
Thusus we can theuseoriginal
PCA
data (in the sense
as a simple and effectivof mean squared
effectivee dimensionalit
dimensionality error) and that
y reduction metho this
method represen tation
d that preserv
preserves actually
es as muc
much h
corresp onds to the first principal comp onen t of the data. Th
of the information in the data as possible (again, as measured by least-squares us w e can use PCA
as a simple anderror).
reconstruction effectivIne the
dimensionalit
following,ywreduction
following, e will study metho
ho
howwdthethatPCA
preserv es as much
representation
of the information
decorrelates in thedata
the original datarepresentation
as possible (again,X. as measured by least-squares
reconstruction error). In the following, we will study how the PCA representation
Let us consider the m × n -dimensional design matrix X . We will assume that
decorrelates the original data representation X .
the data has a mean of zero, E[ x] = 0. If this is not the case, the data can easily
Let us consider
be centered the m the
by subtracting n -dimensional design matrix X . preprocessing
We willcessing
assume that
E mean from all examples in a prepro step.
the data has a mean of zero, × [ x] = 0. If this is not the case, the data can easily
The un
unbiased
biased sample
be centered by subtracting the cov
covariance
ariance
meanmatrix
from all asso
associated
ciated with
examples in a X is giv
preprogiven
en by: step.
cessing
The unbiased sample covariance matrix 1 asso>ciated with X is given by:
Var[x] = X X. (5.85)
m−1
1
Var[x] = 147 X X . (5.85)
m 1
−
CHAPTER 5. MACHINE LEARNING BASICS
X X e=deriv
In this section, we exploit an alternativ
alternative WΛ W of
derivation
ation . the principal components. (5.86)
The
principal comp
componen
onen
onents
ts may also be obtained via the singular value decompdecomposition.
osition.
In
Sp this section,
Specifically
ecifically we exploit an
rightt singular vectors of X . To see this, let W be The
ecifically,, they are the righalternativ e derivation of the principal components. the
principal
righ comp onen ts may also
rightt singular vectors in the decompb e obtained via the singular
osition X = U ΣW . W
decomposition > value decomp
Wee then recov osition.
recover
er the
Specifically
original , they
eigen
eigenv areequation
vector the righwith
t singular
W asvectors of X
the eigenv see this, let W be the
. Tobasis:
eigenvector
ector
right singular vectors in the decomposition X = U ΣW . We then recover the
>
original eigenvector Xequation
>
X = Uwith ΣWW > as the eigenv
U ΣW > = ector
W Σ2basis:
W >. (5.87)
X X = U ΣW U ΣW = W Σ W . (5.87)
The SVD is helpful to show that PCA results in a diagonal Var ar[[ z]. Using the
SVD of X , we can express the variance of X as:
The SVD is helpful to show that PCA results in a diagonal Var[ z]. Using the
1
SVD of X , we can express
Var[xthe
] =variance Xof> X as:
X (5.88)
m−1
1
Var[x] == 1 (X UΣXW > )>U ΣW > (5.88)
(5.89)
m
m−1 1
1
= 1− W
= (U Σ>W >) U ΣW > (5.89)
m − 11 Σ U U ΣW
m
(5.90)
1
= 1− W W Σ2 U >U ΣW (5.90)
(5.91)
= m − 11 Σ W ,
m
1−
=
> WΣ W , (5.91)
where we use the fact that U Um= I 1because the U matrix of the singular value
definition is defined to be orthonormal. takee z = x> W , we
This shows that if we tak
−I because
where w e use the fact
can ensure that the covthat U
covariance U = the U matrix
ariance of z is diagonal as required: of the singular value
definition is defined to be orthonormal. This shows that if we take z = x W , we
can ensure that the covariance 1
Var[z ] =of z is diagonal
Z >Z as required: (5.92)
m−1
1
= 1 W
Var[z ] = Z >ZX >X W (5.92)
(5.93)
m − 11
m
= 11− W X> X2 W > (5.93)
= m 1WW Σ WW (5.94)
m−1
1
= 1− Σ
= W2W Σ W W (5.94)
(5.95)
m − 11 ,
m
1
= − Σ> , (5.95)
where this time we use the fact thatm W 1 W = I , again from the definition of the
SVD.
where this time we use the fact that−W W = I , again from the definition of the
SVD. 148
CHAPTER 5. MACHINE LEARNING BASICS
The ab
abov
ov
ovee analysis shows that when we pro ject the data x to z, via the linear
project
transformation W , the resulting representation has a diagonal co covvariance matrix
The
(as giv
givenab ov e 2analysis
en by Σ ) whicwhich shows that when we pro ject the data
h immediately implies that the individual elemenx to z, via ts
elements theoflinear
z are
transformation W
mutually uncorrelated., the resulting representation has a diagonal co variance matrix
(as given by Σ ) which immediately implies that the individual elements of z are
This ability of PCA to transform data into a representation where the elemen elementsts
mutually uncorrelated.
are mutually uncorrelated is a very imp important
ortant prop
property
erty of PCA. It is a simple
This ability
example of PCA
of a represen to transform
representation
tation that attemptdata to
into a representation where the elements
are mutually underlying
uncorrelated theis data.
a veryInimp
theortant
case ofprop ertythis
PCA, of PCA.
disen It is a simple
disentangling
tangling takes
example of a represen tation that attempt to
the form of finding a rotation of the input space (describ (described
ed by W ) that aligns the
principal axes underlying the data.
of variance with In the
the basis casenew
of the of PCA, this disen
representation tangling
space asso takes
associated
ciated
the form
with z. of finding a rotation of the input space (described by W ) that aligns the
principal axes of variance with the basis of the new representation space associated
While correlation is an imp important
ortant category of dep dependency
endency b etetw
ween elements of
with z.
the data, we are also in interested
terested in learning represen
representations
tations that disentangle more
While correlation
complicated is an imp
forms of feature ortant
dep category
dependencies.
endencies. Forofthis,
depwe
endency b etwmore
will need een elements
than what of
the data,
can we with
be done are also interested
a simple linearintransformation.
learning representations that disentangle more
complicated forms of feature dependencies. For this, we will need more than what
can be done with a simple linear transformation.
5.8.2 k-means Clustering
k
5.8.2 example
Another -means
of aClustering
simple representation learning algorithm is k -means clustering.
The k -means clustering algorithm divides the training set in to k differen
into differentt clusters
Another example of a simple
of examples that are near eac eachh other. We can thus think isofk -means
representation learning algorithm clustering.
the algorithm as
The
pro k -means clustering algorithm
viding a k-dimensional one-hot co
providing divides the training
de vector h represen
code set in
representing to k differen t
ting an input x. If x clusters
of examples that are near eac h other.
belongs to cluster i , then h i = 1 and all other en W e can thus
entries think of the algorithm
tries of the represen
representationtation h are as
providing a k-dimensional one-hot code vector h representing an input x. If x
zero.
belongs to cluster i , then h = 1 and all other entries of the representation h are
zero.The one-hot co de provided by k-means clustering is an example of a sparse
code
represen
representation,
tation, because the ma majority
jority of its entries are zero for ev every
ery input. Later,
The
we will devone-hot
develop co de provided by k -means clustering is an
elop other algorithms that learn more flexible sparse representations, example of a sparse
represen
where tation,
more because
than one en the
tryma
entry jority
can of its entries
be non-zero for are
eac
eachhzero
input for xev. ery input. co
One-hot Later,
codes
des
w e will
are develop example
an extreme other algorithms
of sparsethat learn more flexible
representations that lose sparse
manyrepresentations,
of the b enefits
where more thanrepresentation.
of a distributed one entry can The be non-zero
one-hot for co
codeeacstill
de h input
confersx . someOne-hot codes
statistical
are
adv an extreme
advantages
antages examplecon
(it naturally of vsparse
conv eys therepresentations that loseinmany
idea that all examples the same of the b enefits
cluster are
of a distributed representation. The one-hot
similar to each other) and it confers the computational adv co de still confers
advantage some statistical
antage that the en entire
tire
advantages
represen
representation (it naturally
tation ma
may conveys the
y be captured by aidea that
single in all examples in the same cluster are
integer.
teger.
similar to each other) and it confers the computational advantage that the entire
The ktation
represen -meansma algorithm works by
y be captured by initializing k differen
a single integer.differentt cen troids {µ(1), . . . , µ(k) }
centroids
to different values, then alternating betw etween
een two different steps un untiltil con
conv vergence.
The k -means algorithm works by initializing k differen t
In one step, each training example is assigned to cluster , where is the indexicen troids i µ , . . . , µ of
to different
the nearest vcen
alues,
troidthen
centroid µ (i)alternating
. In the other betw een eac
step, twohdifferent
each cen troidsteps
centroid µ(i) isunup{tildated
convergence.
updated to the}
In one step, each training example ( j ) is assigned
mean of all training examples x assigned to cluster i. to cluster i , where i is the index of
the nearest centroid µ . In the other step, each centroid µ is updated to the
mean of all training examples x assigned 149 to cluster i.
CHAPTER 5. MACHINE LEARNING BASICS
5.9 all
Nearly Stoof cdeep
hastic Gradient
learning is pow ered Descent
owered by one very imp
importan
ortan
ortantt algorithm: sto
stochastic
chastic
gr
gradient
adient desc
descent
ent or SGD. Sto
Stocchastic gradient descent is an extension of the gradient
Nearly all of deep learning is powered by one very important algorithm: stochastic
gradient descent or SGD. Stochastic gradient
150 descent is an extension of the gradient
CHAPTER 5. MACHINE LEARNING BASICS
descen
descentt algorithm introduced in Sec. 4.3.
A recurring problem in macmachine
hine learning is that large training sets are necessary
descent algorithm introduced in Sec. 4.3.
for go
gooo d generalization, but large training sets are also more computationally
exp A recurring
expensiv
ensiv
ensive.
e. problem in machine learning is that large training sets are necessary
for goo d generalization, but large training sets are also more computationally
The cost function used by a machine learning algorithm often decomp decomposes
oses as a
expensive.
sum over training examples of some per-example loss function. For example, the
Theecost
negativ
negative functionlog-likelihoo
conditional used by a machine
log-likelihoodd of thelearning
trainingalgorithm
data can often decomp
be written asoses as a
sum over training examples of some per-example loss function. For example, the
m
negative conditional log-likelihood of the training1 Xdata can be written as
J (θ) = E ,y∼p̂ L(x, y, θ) = L(x(i) , y (i), θ) (5.96)
m
E 1 i=1
J (θ) = L(x, y, θ) = L(x , y , θ) (5.96)
where L is the per-example loss L(x, y, θ) = − mlog p(y | x; θ ).
where ForLthese
is theadditiv
additive e cost functions,
per-example loss L(x, ygradien
gradient
, θ) = t descent
log p(y requires
x; θ ). computing
For these additive cost functions, 1 X
m
gradien− X |requires computing
t descent
∇θ J (θ) = ∇ θ L(x(i) , y (i), θ). (5.97)
m
1 i=1
J (θ ) = L(x , y , θ). (5.97)
The computational cost ∇ of this op m
operation
eration is O( m ) . As the training set size grows to
∇
billions of examples, the time to take a single gradien gradientt step becomes prohibitivprohibitively
ely
The
long. computational cost of this op eration is O( m ) . As the training set size grows to
billions of examples, the time to takeX a single gradient step becomes prohibitively
The insight of sto stocchastic gradient descen
descentt is that the gradien
gradientt is an expexpectation.
ectation.
long.
The exp expectation
ectation may be appro approximately
ximately estimated using a small set of samples.
Sp The insight
Specifically
ecifically
ecifically,, on eachof stostep
chastic gradient
of the descenwteiscan
algorithm, that the gradien
sample a minibt isatch
minibatch an exp ectation.
of examples
The
B = {exp ectation
x(1) may
, . . . , x (m
0) } be approximately estimated using a small set of samples.
dra
drawn
wn uniformly from the training set. The minibatc minibatch h size
Sp
m 0 ecifically , on each step of the algorithm, we can sample a minib atch of examples
B is typically chosen to be a relatively small num numberber of examples, ranging from
1 to= a xfew, .h.undred.
.,x drawn uniformly
Crucially
Crucially, from the
, m0 is usually heldtraining
fixed asset. the The minibatc
training h size
set size m
m
gro is
grows. {typically c hosen } to b e a relatively small num ber of examples,
ws. We may fit a training set with billions of examples using updates computed ranging from
1ontoonly
a few hundred.examples.
a hundred Crucially, m is usually held fixed as the training set size m
grows. We may fit a training set with billions of examples using updates computed
The estimate of the gradient is formed as
on only a hundred examples.
m0
The estimate of the gradient1 is formed
X as
g = 0 ∇θ L(x(i) , y(i) , θ). (5.98)
m
1 i=1
g= L(x , y , θ). (5.98)
m B. The stochastic gradien
using examples from the minibatch descen
gradientt descentt algorithm
∇
then follo
follows
ws the estimated gradient B do
downhill:
wnhill:
using examples from the minibatch . The stochastic gradient descent algorithm
then follows the estimated gradientθ do
←X wnhill:
θ − g , (5.99)
Gradien
Gradientt descen
descentt in general has often been regarded as slow or unreliable. In
the past, the application of gradien gradientt descen
descentt to non-conv
non-convex ex optimization problems
Gradien t descen t in general
was regarded as foolhardy or unprincipled. Todahas often b een regarded
day y, we as kno
know slow
w that or unreliable.
the mac
machine In
hine
the past, mo
learning thedels
application
models describ
described edof in
gradien
Part tIIdescen
work tvto erynon-conv
well when ex optimization
trained with problems gradient
w as
descen regarded
descent. as foolhardy or unprincipled. T o da y
t. The optimization algorithm may not be guaranteed to arrive at even, w e kno w that the mac hinea
learning
lo
local
cal minimummodelsindescrib ed in Pamount
a reasonable art II work verybut
of time, wellit when
often trained
finds a vwith ery lo gradient
low
w value
descen t. The optimization algorithm
of the cost function quickly enough to be useful. may not b e guaranteed to arrive at even a
local minimum in a reasonable amount of time, but it often finds a very low value
Sto
Stocchastic gradient descen descentt has man many y imp
important
ortant uses outside the con context
text of
of the cost function quickly enough to be useful.
deep learning. It is the main way to train large linear mo models dels on very large
Stochastic
datasets. For agradient
fixed mo modeldescen
del size,t has manyper
the cost impSGDortant up usesdo
update
date outside
does es not the dep context
depend
end on theof
deep learning.
training set size mIt. is
In the main we
practice, waoften
y to train large linear
use a larger mo
modeldel asmothe delstraining
on very setlarge
size
datasets. F or a fixed mo del size, the
increases, but we are not forced to do so. The num cost p er SGD
number up date
ber of up do
updates es not dep end
dates required to reach on the
training
con
conv set size m . In practice, we often
vergence usually increases with training set size. Ho use a larger mo del
How wevas
ever, the training
er, as m approac set
approaches size
hes
increases,
infinit
infinity y, thebutmowedel
modelarewill
notev forced
even
en
entuallyto do
tually con so.
converge Theto
verge num itsber
best of pup dates required
ossible test errortobreach
efore
convergence
SGD has sampledusuallyev increases
every
ery example with training
in the training set set.
size.Increasing
However,masfurther m approac hes
will not
infinity,the
extend theamount
model ofwilltraining
eventuallytime con vergetotoreac
needed itshbthe
reach est p moossible
del’s btest
model’s error before
est possible test
SGD has sampled ev ery example in the training set. Increasing
error. From this point of view, one can argue that the asymptotic cost of training m further will not
aextend
mo delthe
model amount
with SGD is of O training
(1) as atime needed
function of to
m.reach the model’s best possible test
error. From this point of view, one can argue that the asymptotic cost of training
a mo Prior to the
del with SGDadv
advenen
ent
is t (1)
O of deep
as a learning,
function of themmain
. wa
way y to learn nonlinear models
was to use the kernel trick in com combination
bination with a linear mo model.
del. Man Many y kernel learning
Prior to the adv en t of deep learning, the main wa
algorithms require constructing an m × m matrix Gi,j = k (x , x ). Constructing y to learn
( i ) ( nonlinear
j ) models
w as to
this use the
matrix haskernel trick in com
computational bination
cost O (m 2)with a linear
, which model.
is clearly Many kernel
undesirable learning
for datasets
algorithms
with billionsrequire constructing
of examples. In an m m matrix
academia, startingG in = k2006,
(x , xdeep ). learning
Constructing was
this matrix has computational cost O ×(m ), which is clearly
initially interesting because it was able to generalize to new examples better undesirable for datasets
with billions
than comp
competing of algorithms
eting examples. when In academia,
trained on starting
medium-sizedin 2006,datasetsdeep learningwith tens wasof
initially interesting
thousands of examples. because
Soon itafter,
was deepable learning
to generalize garnered to new examples
additional better
interest in
than
industrycomp
industry, eting algorithms
, because it pro videdwhen
provided trained
a scalable waony ofmedium-sized
training nonlinear datasets withon
models tens of
large
thousands of examples. Soon after, deep learning garnered additional interest in
datasets.
industry, because it provided a scalable way of training nonlinear models on large
Sto
Stocchastic gradien
gradientt descen
descentt and man many y enhancements to it are describ described ed further
datasets.
in Chapter 8.
Stochastic gradient descent and many enhancements to it are described further
in Chapter 8.
5.10 Building a Mac
Machine
hine Learning Algorithm
5.10 allBuilding
Nearly deep learninga Mac hinecan
algorithms Learning
be describ
describedAlgorithm
ed as particular instances of
a fairly simple recip
recipe:
e: combine a specification of a dataset, a cost function, an
Nearly all deep
optimization prolearning
procedure
cedure andalgorithms
a mo del. can be described as particular instances of
model.
a fairly simple recipe: combine a specification of a dataset, a cost function, an
For example,
optimization the linear
procedure andregression
a model. algorithm combines a dataset consisting of
For example, the linear regression algorithm combines a dataset consisting of
152
CHAPTER 5. MACHINE LEARNING BASICS
X and y , the cost function J (w, b) = −E ,y∼p̂ log pmodel(y | x), (5.100)
E
the mo
model
del spspecification
ecification J (w , b) =(y | x ) = N (ylog
pmodel ; x>pw + b, (y1)
1),x ),
, and, in most cases,(5.100)
the
optimization algorithm defined b− y solving for where the |gradien gradientt of the cost is zero
the mo del sp ecification
using the normal equations. p (y x ) = (y ; x w + b, 1) , and, in most cases, the
optimization algorithm defined by| solving N for where the gradient of the cost is zero
By realizing that
using the normal equations. w e can replace an
any y of these comp
componen
onen
onents ts mostly independently
from the others, we can obtain a very wide variety of algorithms.
By realizing that we can replace any of these components mostly independently
The cost function typically includes at least one term that causes the learning
from the others, we can obtain a very wide variety of algorithms.
pro
process
cess to perform statistical estimation. The most common cost function is the
Theecost
negativ
negative function d,
log-likelihoo
log-likelihood, typically
so thatincludes
minimizing at leasttheonecostterm that causes
function causesthe learning
maximum
pro
lik cessoto
likeliho
eliho
elihoo perform statistical estimation. The most common cost function is the
d estimation.
negative log-likelihood, so that minimizing the cost function causes maximum
The cost function ma may y also include additional terms, suc suchh as regularization
likelihood estimation.
terms. For example, we can add weigh eightt deca
decay y to the linear regression cost function
The cost function may also include additional terms, such as regularization
to obtain
terms. For example, J (ww,eb)can
= λadd||ww ||22eigh
− Et deca y tolog
thep linear regression cost function
,y∼p model(y | x). (5.101)
to obtain
This still allows closed-form E
J (w, b) = λoptimization.
w log p (y x). (5.101)
ThisIfstill
we callows
hange closed-form
the mo
modeldel tooptimization.
||be ||nonlinear,
− then most cost functions| can no longer
be optimized in closed form. This requires us to choose an iterativ iterativee numerical
If w e c hange
optimization pro the
procedure,mo del to b e
cedure, such as gradien nonlinear,
gradientt descen then
descent. most
t. cost functions can no longer
be optimized in closed form. This requires us to choose an iterative numerical
The recip
recipeepro
optimization forcedure,
constructing
such as a learning
gradientalgorithm
descent. by com combining
bining momodels,
dels, costs, and
optimization algorithms supp supports
orts both sup supervised
ervised and unsupunsupervised
ervised learning. The
The
linear recipe forexample
regression constructing
sho
showswsa learning
ho
how w to supp algorithm
support ort sup bervised
y combining
supervised models,
learning. costs,
Unsup
Unsupervisedand
ervised
optimization
learning can balgorithms
e supp ortedsupp
supported orts botha dataset
by defining supervised thatandcon unsup
contains
tains onlyervised
onlyX learning.
X and The
providing
linear regression
an appropriate unsup example
unsupervised sho ws ho
ervised cost and mo w to supp
model. ort sup ervised learning. Unsup
del. For example, we can obtain the first ervised
learning
PCA can bby
vector e supp
sp orted by
specifying
ecifying defining
that our loss a dataset
function is contains only X and providing
that
an appropriate unsupervised cost and model. For example, we can obtain the first
PCA vector by sp ecifyingJ (that w) =our E ∼p loss function
||x − r(x is; w)|| 22 (5.102)
E
while our mo modeldel is defined J (to
w)hav = e w with xnorm
have r(x ; w)and reconstruction function
one (5.102)
r(x) = w >xw xw.. || − ||
while our model is defined to have w with norm one and reconstruction function
r(x)In=some
w xw cases,
. the cost function may be a function that we cannot actually
ev
evaluate,
aluate, for computational reasons. In these cases, we can still approximately
In some
minimize cases,iterativ
it using the cost
iterative function optimization
e numerical may be a function so longthat as ww e ehav
cannot
havee someactually
wa
wayy of
ev aluate,
appro
approximating for computational
ximating its gradients. reasons. In these cases, we can still approximately
minimize it using iterative numerical optimization so long as we have some way of
Most macmachinehine learning algorithms make use of this recipe, though it ma may y not
approximating its gradients.
immediately be obvious. If a mac machinehine learning algorithm seems esp especially
ecially unique or
Most machine learning algorithms make use of this recipe, though it may not
immediately be obvious. If a machine learning 153 algorithm seems esp ecially unique or
CHAPTER 5. MACHINE LEARNING BASICS
5.11
The simpleChallenges
mac
machine
hine learning Motiv ating
algorithms Deep
describ ed in Learning
described this chapter work very well on
a wide variet
arietyy of important problems. Ho Howwev
ever,
er, they hahavve not succeeded in solving
The simple
the cen
central mac hine learning algorithms describ
tral problems in AI, such as recognizing sp ed in
speecthis
eec
eech chapter
h or work very
recognizing ob well on
objects.
jects.
a wide variety of important problems. However, they have not succeeded in solving
the The dev
development
central elopmentinofAI,
problems deep
suchlearning was motiv
motivated
as recognizing ated
speec inrecognizing
h or part by theobfailure
jects. of
traditional algorithms to generalize well on suc such
h AI tasks.
The development of deep learning was motivated in part by the failure of
This section
traditional is ab
algorithmsabout
outtohow the challenge
generalize of suc
well on generalizing
h AI tasks. to new examples becomes
exp
exponen
onen
onentially
tially more difficult when working with high-dimensional data, and how
the This
mec section isused
mechanisms
hanisms abouttohowac the echallenge
achiev
hiev
hieve of generalizing
generalization to new examples
in traditional machine b
machine ecomes
learning
exponen
are tially tmore
insufficien
insufficient to learndifficult when working
complicated withinhigh-dimensional
functions high-dimensionaldata, andSuch
spaces. how
the mec
spaces hanisms
also used tohigh
often impose achiev e generalization
computational costs.in Deep
traditional
learningmachine
was learning
designed to
are insufficien t to learn complicated
overcome these and other obstacles. functions in high-dimensional spaces. Such
spaces also often impose high computational costs. Deep learning was designed to
overcome these and other obstacles.
5.11.1 The Curse of Dimensionalit
Dimensionality
y
5.11.1
Man
Many y mac The
machine Curse of
hine learning Dimensionalit
problems y
become exceedingly difficult when the numnumbb er
of dimensions in the data is high. This phenomenon is kno known
wn as the curse
Man y mac hine learning
of dimensionality problems b ecome exceedingly
dimensionality.. Of particular concern is that the num difficult
umb when the distinct
ber of possible numb er
of dimensions in the data is high. This phenomenon
configurations of a set of variables increases exp
exponentially is kno wn
onentially as the num
numb as the curse
b er of variables
of dimensionality. Of particular concern is that the number of possible distinct
increases.
configurations of a set of variables increases exponentially as the numb er of variables
increases.
154
CHAPTER 5. MACHINE LEARNING BASICS
×
= 1000
10
d v
O(×
v ) = 1000
10
d v
O( v )
Ho
How
w could we possibly say something meaningful ab about
out these new configurations?
Man
Manyy traditional machine learning algorithms simply assume that the output at a
How pcould
new we possibly
oint should say something
be approximately themeaningful aboutput
same as the out these newnearest
at the configurations?
training
Man
poin y
oint.
t. traditional machine learning algorithms simply assume that the output at a
new point should be approximately the same as the output at the nearest training
point.
5.11.2 Lo
Local
cal Constancy and Smo
Smoothness
othness Regularization
5.11.2
In order toLo cal Constancy
generalize well, mac andlearning
machine
hine Smoothness algorithms Regularization
need to be guided by prior
beliefs ab aboutout what kind of function they should learn. Previously Previously,, we hav havee seen
In order to generalize
these priors incorpincorporated w ell, mac hine learning algorithms need to
orated as explicit beliefs in the form of probability distributions be guided by prior
obveliefs about what
er parameters kind
of the mo of
model. function
del. they should
More informally
informally, , welearn. Previously
may also discuss ,prior
we hav e seen
beliefs as
these priors incorp orated as explicit b eliefs in the form of
directly influencing the function itself and only indirectly acting on the parameters probability distributions
overtheir
via parameters
effect onofthethefunction.
model. More informally
Additionally
dditionally, , w, ewinformally
e may also discuss
discuss prior
prior bbeliefs
eliefs as
as
bdirectly influencing
eing expressed the function
implicitly
implicitly, itself and
, by choosing only indirectly
algorithms that areacting
biasedontow theard
toward parameters
choosing
via their effect on the function. A dditionally , w e informally
some class of functions over another, even though these biases may not be expressed discuss prior b eliefs as
being
(or ev
evenexpressed
en possibleimplicitly
to express) , byinchoosing
terms ofalgorithms
a probability thatdistribution
are biased represen
toward cting
representinghoosing
our
some class
degree of bof functions
elief in various overfunctions.
another, even though these biases may not be expressed
(or even possible to express) in terms of a probability distribution representing our
Among the most widely used of these implicit “priors” is the smo smoothness
othness prior
degree of belief in various functions.
or loloccal constancy priorprior.. This prior states that the function we learn should not
Among
change very muc the most
muchh withinwidely used of
a small these implicit “priors” is the smoothness prior
region.
or local constancy prior. This prior states that the function we learn should not
Man
Many
change vyery
simpler
much algorithms
within a small rely region.
exclusiv
exclusively ely on this prior to generalize well, and
as a result they fail to scale to the statistical challenges inv involved
olved in solving AI-
lev
levelMan y simpler algorithms rely exclusiv
el tasks. Throughout this book, we will describ ely on this
describee ho prior
w deep generalize
how to well, and
learning introduces
as a result (explicit
additional they fail to andscale to thepriors
implicit) statistical
in order challenges involved
to reduce in solving AI-
the generalization
level tasks.
error Throughout
on sophisticated this bHere,
tasks. ook, we we will describ
explain wh
why ye the
howsmodeep learning
smoothness
othness introduces
prior alone is
additional
insufficien (explicit
insufficientt for these tasks. and implicit) priors in order to reduce the generalization
error on sophisticated tasks. Here, we explain why the smoothness prior alone is
There are many differen differentt ways to implicitly or explicitly express a prior belief
insufficient for these tasks.
that the learned function should b e smooth or lo cally constan
locally constant. t. All of these different
metho There
methods are many differen t wa ys to
ds are designed to encourage the learning pro implicitly or explicitly
process
cess to learnexpress a prior
a function f ∗bthat
elief
that the learned
satisfies function should b e smooth or locally constant. All of these different
the condition
methods are designed to encourage f ∗ (xthe
) ≈learning
f∗ (x + pro ) cess to learn a function f(5.103) that
satisfies the condition
for most configurations x and small f (x)change f (x.+In ) other words, if we kno know w (5.103)
a go
goood
answ
answer er for an input x (for example, if≈x is a lab labeled
eled training example) then that
for
answ most
answer configurations
er is probably go x and small
goood in the neigh
neighb change
borho
orhoood of x. other
. In words,
If we hav
have if we go
e several kno
o dwanswers
goo a good
answ
in er for
some an b
neigh
neighbinput
orho
orhoo oxd(for
we example,
would com combinex is athem
if bine labeled(b
(byytraining
some form example)
of av then that
averaging
eraging or
answ
in
interp
terperolation)
is probably
terpolation) to pro goduce
produceod inanthe neighbthat
answer od of xwith
orhoagrees . If weas hav
man
manye yseveral
of them gooasd manswers
uc
uchh as
pinossible.
some neighborhood we would combine them (by some form of averaging or
interpolation) to pro duce an answer that agrees with as many of them as much as
An extreme example of the lo local
cal constancy approac approach h is the k -nearest neighbors
possible.
156
An extreme example of the local constancy approach is the k -nearest neighbors
CHAPTER 5. MACHINE LEARNING BASICS
family of learning algorithms. These predictors are literally constan constantt over eac each h
region containing all the points x that hav havee the same set of k nearest neighbors in
family
the of learning
training set. For algorithms. These
k = 11,, the num
number berpredictors are literally
of distinguishable regions constan
cannot t ovbeer more
each
regionthe
than containing
num
numb ber of all training
the points x k
that have the same set of nearest neighbors in
examples.
the training set. For k = 1, the number of distinguishable regions cannot be more
While the k-nearest neighbors algorithm copies the output from nearby training
than the number of training examples.
examples, most kernel mac machines
hines interpolate betw between een training set outputs asso associated
ciated
While the k -nearest
with nearby training examples. An imp neighbors algorithm
important copies the output from
ortant class of kernels is the family of lo nearby training
loccal
examples, most kernel mac hines interpolate betw
kernels where k(u, v ) is large when u = v and decreases as u and v gro een training set outputs growasso ciated
w farther
with nearby
apart from eac training
each examples.
h other. A lo local An imp ortant class of kernels
cal kernel can be thought of as a similarity function is the family of local
kernels
that where ktemplate
performs (u, v ) is large when u
matching, by=measuring
v and decreases ho
how as u and
w closely v gro
a test w farther
example x
apart
resem
resemblesfrom eac h other. A lo
bles each training example x . Muc cal kernel
( i ) can
Much b e thought of
h of the modern motiv as a similarity
motivation function
ation for deep
that p erforms
learning is deriv derivedtemplate matching, by
ed from studying the limitations of lomeasuring ho w
local closely a test
cal template matching example andx
resem
ho
how bles
w deep mo each
models training example x .
dels are able to succeed in cases where loMuc h of the modern
local motiv ation
cal template matching fails for deep
(learning
Bengio etis al. deriv ed from
, 2006b ). studying the limitations of local template matching and
how deep models are able to succeed in cases where local template matching fails
Decision
(Bengio et al.trees
, 2006b also
). suffer from the limitations of exclusively smoothness-based
learning because they break the input space into as many regions as there are
lea
leavvDecision
es and use trees also suffer
a separate from the
parameter (orlimitations
sometimes of man
manyexclusively
y parameters smoothness-based
for extensions
learning
of decision because
trees) theyin eac break
each the input
h region. If thespace targetinto as many
function regions
requires as there
a tree withare at
lea v es and
least n lealeav use a separate parameter
ves to be represented accurately (or sometimes man y parameters
accurately,, then at least n training examples are for extensions
required to fit the tree. A multiple ofIf nthe
of decision trees) in eac h region. target to
is needed function
ac hievee requires
achiev
hiev some level a oftree with at
statistical
least n lea ves to b e represented
confidence in the predicted output. accurately , then at least n training examples are
required to fit the tree. A multiple of n is needed to achieve some level of statistical
In general, to distinguish O( k) regions in input space, all of these metho methods ds
confidence in the predicted output.
require O (k ) examples. Typically there are O( k) parameters, with O (1) parameters
asso In
associatedgeneral,
ciated with to eachdistinguish
of the O( k O)( kregions.
) regions The in case
inputofspace,
a nearest all of these metho
neighbor scenario,ds
require O ( k ) examples. Typically there are O ( k) parameters,
where each training example can be used to define at most one region, is illustrated with O (1) parameters
asso
in ciated
Fig. 5.10with
. each of the O( k ) regions. The case of a nearest neighbor scenario,
where each training example can be used to define at most one region, is illustrated
Is there
in Fig. 5.10.a wa way y to represen
representt a complex function that has many more regions
to be distinguished than the num umb ber of training examples? Clearly Clearly,, assuming
Is
only smo there
smoothness a wa y to represen t a complex function
othness of the underlying function will not allow a learner that has many moreto doregions
that.
toorbeexample,
F distinguished imagine than thethe
that num ber offunction
target trainingisexamples?a kind of Clearlychec , assuming
checkerboard.
kerboard. A
only
chec
heck smo
kerb othness
erboard
oard con of
contains the
tains man underlying
many function will not allow
y variations but there is a simple structure to them. a learner to do that.
Imagine what happens whenthe
F or example, imagine that thetarget
num
umb bfunction
er of training is a kindexamplesof chec is kerboard.
substan
substantially
tiallyA
checkerbthan
smaller oardthe conntains
um
umb berman
of yblac
variations
black k and white but there
squares is ona simple
the chec structure
checkerboard.
kerboard. to Based
them.
Imagine
on only lo what
local happens when
cal generalization andthethensmo umbothness
smoothnesser of training
or local examples
constancy is substan
prior, tially
we would
smaller
b than the
e guaranteed to n umber ofguess
correctly blackthe and colorwhiteof asquares
new p oin ont the
oint if itchec
lieskerboard.
within theBased same
on
chec
heckonly
kerb lo
erboardcal generalization and the smo othness
oard square as a training example. There is no guaran or local constancy
guarantee prior, w e
tee that the learnerwould
b e guaranteed to correctly
could correctly extend the chec guess
heckkerbthe
erboard color of a
oard pattern to poin new p oin
ointsts lying in within
t if it lies squaresthe same
that do
checcon
not kerbtain
containoardtraining
square examples.
as a training With example.
this prior There is no
alone, theguaran tee that the that
only information learneran
could correctly extend the checkerboard pattern to points lying in squares that do
not contain training examples. With this 157prior alone, the only information that an
CHAPTER 5. MACHINE LEARNING BASICS
158
CHAPTER 5. MACHINE LEARNING BASICS
example tells us is the color of its square, and the only wa way y to get the colors of the
en
entire
tire chec
heck kerb
erboard
oard right is to co cov
ver eac
eachh of its cells with at least one example.
example tells us is the color of its square, and the only way to get the colors of the
The smo
smoothness
othness assumption and the associated non-parametric learning algo-
entire checkerboard right is to cover each of its cells with at least one example.
rithms work extremely well so long as there are enough examples for the learning
The smo
algorithm toothness
observeassumption
high pointsand on themost associated
peaks and non-parametric
lo
loww poinointsts on learning
most valleys algo-
rithms
of workunderlying
the true extremely well so long
function to asbe there
learned.are enough
This is examples
generallyfor true thewhen
learning
the
algorithm to observe
function to be learned is smo high p oints
smooth on most p eaks and lo w p oin
oth enough and varies in few enough dimensions. ts on most valleys
In high dimensions, even a very to
of the true underlying function smo beoth
smooth learned.
function This
canischange
generally true when
smoothly but in thea
function
differen to be learned is smo oth enough and v
differentt way along each dimension. If the function additionally behav aries in few enough
ehaves dimensions.
es differently
In different
in high dimensions,
regions, itevencan ab ecome
very smo oth function
extremely can change
complicated smoothly
to describ
describe e withbut a setin ofa
different examples.
training way along each If thedimension.
function isIfcomplicated
the function (w additionally
(we e wan behaves differently
antt to distinguish a huge
in
num different
umb regions, it can b ecome extremely complicated
ber of regions compared to the number of examples), is there any hope to describ e with a set to
of
training examples.
generalize well? If the function is complicated (w e w an t to distinguish a h uge
number of regions compared to the number of examples), is there any hope to
The answer to both of these questions is yes. The key insigh insightt is that a very
generalize well? k
large numumb ber of regions, e.g., O(2 ), can be defined with O (k) examples, so long
The answer
as we introduce some to bothdep of these questions
dependencies
endencies bet
etwween is theyes. The via
regions keyadditional
insight is assumptions
that a very
large
ab outnthe
about umbunderlying
er of regions, data e.g., O(2 ), can
generating be defined with
distribution. In this O (wa
k) yexamples,
way so long
, we can actually
as we introduce
generalize non-lo some
callydep
non-locally endencies
(Bengio and bMonp
etween
Monperrus the, regions
errus 2005; Bengio via additional
et al., 2006c assumptions
). Man
Many y
ab out
differen the underlying data
differentt deep learning algorithms pro generating
provide distribution. In this wa y ,
vide implicit or explicit assumptions that arewe can actually
reasonable for a broad range of AI tasks errus
generalize non-lo cally ( Bengio and Monp in order, 2005to ;capture
Bengio theseet al.,adv 2006c ). Many
advantages.
antages.
different deep learning algorithms provide implicit or explicit assumptions that are
Other approac
reasonable approaches hes to
for a broad machine
range of AIlearning
tasks in often
order makmake e stronger,
to capture thesetask-sp
task-specific
ecific as-
advantages.
sumptions. For example, we could easily solv solvee the chec heck kerb
erboard
oard task by pro providing
viding
Other approac hes to machine
the assumption that the target function is perio learning often
periodic. mak e stronger, task-sp
dic. Usually we do not include suc ecific as-
such h
sumptions.
strong, task-sp F or
task-specific example, w e
ecific assumptions in could easily
into solv
to neural netw e the
networks chec kerb oard task
orks so that they can generalizeby pro viding
the assumption
to a muc uch that the target function
h wider variety of structures. AI tasks ha is perio dic.havUsually
ve structure we dothat not include
is muc uchhsuc
tooh
strong, task-sp
complex to be ecific
limitedassumptions into neural
to simple, manually sp netw
specified
ecified orksprop soerties
that they
properties suc
such h can
as pgeneralize
erio
eriodicity
dicity
dicity,,
to a m
so we wanuc h wider variety of structures.
antt learning algorithms that embo AI
embody tasks ha ve structure
dy more general-purpose assumptions.that is m uc h too
complex
The core to
ideabe in
limited
deep to simple, ismanually
learning that we sp ecified that
assume properties
the data suchwas as pgenerated
eriodicity,
so w
by the e wan t learning algorithms that embo dy
or features, p otenmore general-purpose
otentially
tially at multiple lev assumptions.
levels
els in a
The
hierarccore
hierarch hy.ideaMany in deep
otherlearning
similarlyisgeneric
that weassumptions
assume thatcan thefurther
data wimpro as generated
improv ve deep
b y the
learning algorithms. These apparen or
apparently features, p oten
tly mild assumptions allo tially at
allow multiple
w an exp exponen
onenlevtial in
onentialels gain a
hierarc
in hy. Many other
the relationship bet
etwwsimilarly
een the num generic
number ber ofassumptions
examples and canthe further
num
numb bimpro
er of vregions
e deep
learning algorithms. These
that can be distinguished. These exp apparen tly mild
exponential assumptions
onential gains are describ allo w
described an exp onen
ed more precisely tial gainin
in the6.4.1
Sec. relationship
, Sec. 15.4b, et andween Sec.the num
15.5 . Theber exp
of examples
exponential
onential adv andan
advan the
antages
tages num ber of regions
conferred by the
thatofcan
use be distinguished.
deep, These exponential
distributed representations coun
counter gains
ter theareexp describ
exponen
onen
onential ed more
tial precisely
challenges posed in
Sec.
b y the6.4.1 , Sec.
curse 15.4, and Sec. . 15.5. The exponential advantages conferred by the
of dimensionality
dimensionality.
use of deep, distributed representations counter the exponential challenges posed
by the curse of dimensionality.
159
CHAPTER 5. MACHINE LEARNING BASICS
5.11.3
An imp Manifold
important
ortant conceptLearning
underlying man many y ideas in machine learning is that of a
manifold.
An important concept underlying many ideas in machine learning is that of a
A manifold is a connected region. Mathematically
manifold. Mathematically,, it is a set of p oin oints,
ts, associated
with a neighborho
neighborhooo d around each p oint. F rom an
any
y given p oint, the manifold lo
locally
cally
app A
appearsmanifold is a connected region.
ears to be a Euclidean space. In everyda Mathematically
everyday , it
y life, we exp is a set
experience of p oin ts, associated
erience the surface of the
with
w orlda as
neighborho
a 2-D plane, o d around
but it each
is in pfact
oint.a F rom any manifold
spherical given point, the manifold
in 3-D space. locally
appears to be a Euclidean space. In everyday life, we experience the surface of the
The definition of a neigh neighbborho
orhoo od surrounding each poin ointt implies the existence
world as a 2-D plane, but it is in fact a spherical manifold in 3-D space.
of transformations that can be applied to mo movve on the manifold from one position
to aThe definitionone.
neighboring of aInneigh
the bexample
orhood surrounding
of the world’s each pointasimplies
surface the existence
a manifold, one can
of transformations that
walk north, south, east, or west.can b e applied to mo ve on the manifold from one position
to a neighboring one. In the example of the world’s surface as a manifold, one can
Although there is a formal mathematical meaning to the term “manifold,”
walk north, south, east, or west.
in mac
machine
hine learning it tends to be used more lo loosely
osely to designate a connected
Although
set of poin
oints there is a
ts that can be approformal mathematical
approximated meaning
ximated well by considering to the
onlyterm
a small“manifold,”
num
umberber
in mac hine learning it tends to b e used more lo osely to
of degrees of freedom, or dimensions, embedded in a higher-dimensional space. designate a connected
set
Eac
Eachof dimension
h points thatcorresp
can bonds
e appro
corresponds to ximated
a lo
local well by considering
cal direction of variation.only See aFig.
small
5.11num
forber
an
of degrees
example of of freedom,
training dataorlying
dimensions, embedded in amanifold
near a one-dimensional higher-dimensional
em bedded inspace.
embedded two-
Each dimension
dimensional space.corresp
In the onds to a of
context local
macdirection
machine of variation.
hine learning, we allowSee theFig. 5.11 for any
dimensionalit
dimensionality
example
of of training
the manifold to vdata lying one
ary from nearp oin
aoint
one-dimensional
t to another. This manifold
often em bedded
happ
happens in two-
ens when a
dimensional
manifold in space.
intersects In the context of mac hine
tersects itself. For example, a figure eigh learning,
eightt is a manifold that has a singley
we allow the dimensionalit
of the manifold
dimension in most to places
vary frombut twone
two p oint to another.
o dimensions at the in This often happ
intersection
tersection at theenscenter.
when a
manifold intersects itself. For example, a figure eight is a manifold that has a single
dimension in most places but two dimensions at the intersection at the center.
160
CHAPTER 5. MACHINE LEARNING BASICS
Man
Many y mac
machine
hine learning problems seem hop hopeless
eless if we exp expect
ect the machine
learning algorithm to learn functions with interesting variations across all of
Rn. Man y macle
Manifold hine learning
learning
arning problems
algorithms surmounseem
surmount hopobstacle
t this eless if wbey exp ect thethat
assuming machine
most
learning
of n algorithm to learn functions with interesting variations across all of
R R consists of in invvalid inputs, and that in interesting
teresting inputs o ccur only along
. Manifold
a collection learning algorithms
of manifolds con containing surmoun t this obstacle
taining a small subset of poin b y assuming
oints, that
ts, with interesting most
R
vofariationsconsists
in theofoutput
invalidofinputs, and that
the learned interesting
function occurring inputs
onlyoalong
ccur only along
directions
a collection
that lie on the of manifold,
manifolds or con taining
with a smallvariations
interesting subset of happ poinening
ts, with
happening onlyinteresting
when we
vmoariations
mov in the output of the learned function o ccurring
ve from one manifold to another. Manifold learning was introduced in the case only along directions
that
of con lie on the manifold,
continuous-v
tinuous-v
tinuous-valuedalued dataorand with theinteresting
unsup
unsupervisedvariations
ervised learninghapp ening only
setting, whenthis
although we
move fromyone
probabilit
probability manifold toidea
concentration another.
can bManifold learning
e generalized was introduced
to both discrete data in the
andcase
the
of
sup con tinuous-v
supervised alued data and the unsup ervised learning
ervised learning setting: the key assumption remains that probability mass is setting, although this
probabilit
highly concen y concentration
concentrated.
trated. idea can be generalized to both discrete data and the
supervised learning setting: the key assumption remains that probability mass is
The assumption that the data lies along a low-dimensional manifold may not
highly concentrated.
alw
alwa ays be correct or useful. We argue that in the context of AI tasks, suc such h as
The assumption
those that inv involve
olve prothat
cessing images, sounds, or text, the manifold assumptionnot
processingthe data lies along a low-dimensional manifold may is
alw ays b
at least approe correct
approximately or useful. W e argue that
ximately correct. The evidence in fav in the
favor context of AI tasks,
or of this assumption consists suc h as
those that inv olve
of two categories of observpro cessing
observations.images,
ations. sounds, or text, the manifold assumption is
at least approximately correct. The evidence in favor of this assumption consists
The first observ
observation
ation in fav
favor
or of the manifold hyp hypothesis
othesis is that the probability
of two categories of observations.
distribution over images, text strings, and sounds that occur in real life is highly
concenThetrated.
first observ
concentrated. ationnoise
Uniform in favessentially
or of the manifold
nev
never hypothesisstructured
er resembles is that theinputs
probability
from
distribution ov er images,
these domains. Fig. 5.12 sho text
shows strings,
ws ho
how, and sounds that o ccur
w, instead, uniformly sampled points lo in real life
look is highly
ok like the
concentrated.
patterns of staticUniform
that app noise
ear essentially
appear never resembles
on analog television sets when structured
no signalinputs from
is available.
these
Similarlydomains.
Similarly, , if youFig. 5.12 sho
generate a dowscumen
how, tinstead,
documen
cument uniformly
by picking letters sampled
uniformly points look likewhat
at random, the
patterns
is of staticy that
the probabilit
probability that app
youear onget
will analog television English-language
a meaningful sets when no signal is avAlmost
text? ailable.
Similarly , if you generate a do cumen t by picking letters
zero, again, because most of the long sequences of letters do not corresp uniformly at random,
correspond ondwhat
to a
is the probabilit
natural languageysequence:
that you the willdistribution
get a meaningful
of naturalEnglish-language
language sequences text?oAlmost
ccupies
azero,
veryagain,
small bvolume
ecause most
in theoftotal
the space
long sequences
of sequences of letters do not correspond to a
of letters.
natural language sequence: the distribution of natural language sequences occupies
a very small volume in the total space of sequences of letters.
161
CHAPTER 5. MACHINE LEARNING BASICS
examples, with each example surrounded by other highly similar examples that
ma
may y be reached by applying transformations to trav traverse
erse the manifold. The second
examples,
argumen
argumentt in fa withfav each example
vor of the manifold hyp surrounded by
ypothesis other
othesis is that highly
we cansimilar
alsoexamples
imagine suc that
suchh
ma
neighy
neighb b e reached
borho
orhoo b y applying transformations
ods and transformations, at least informally to trav erse the manifold. The
informally.. In the case of images, we second
argumen t in fa vor of the manifold
can certainly think of many possible transformations h yp othesis is thatthatwe allo
canwalso
allow us toimagine
trace out sucha
neighborho
manifold inods and space:
image transformations,
we can gradually at leastdiminformally . In the
or brighten thecase of images,
lights, gradually we
canvecertainly
mo
mov or rotatethink ob
objectsof many
jects in thepossible
image, transformations
gradually alter the thatcolors
allowon ustheto trace
surfaces outofa
manifold
ob
objects,
jects, etc. in image space:lik
It remains w e can
likely
ely thatgradually
there aredim or brighten
multiple manifolds the inv
lights,
involved
olved gradually
in most
applications. For example, the manifold of images of human faces may not bofe
mo v e or rotate ob jects in the image, gradually alter the colors on the surfaces
ob jects, etc.
connected to It
theremains
manifold likely that there
of images of catarefaces.
multiple manifolds involved in most
applications. For example, the manifold of images of human faces may not b e
These thought exp experiments
eriments supp supporting
orting the manifold hypotheses conv conveyey some in-
connected to the manifold of images of cat faces.
tuitiv
tuitivee reasons supp supporting
orting it. More rigorous exp experimen
erimen
eriments ts ((Ca
Ca
Cayton
yton, 2005; NaraNaray yanan
These thought
and Mitter, 2010; Sc exp
Schölkeriments
hölk
hölkopf
opf et al.supp orting
al.,, 1998; Ro the
Row manifold hypotheses
weis and Saul, 2000; Tenen conv ey
enenbaum some
baum et al. in-,
al.,
tuitiv
2000 ; eBrand
reasons supp
, 2003 orting and
; Belkin it. More
Niy
Niyogi rigorous
ogi , 2003; exp erimen
Donoho andts (Grimes
Cayton, ,20032005; ;WNara yanan
einberger
and Mitter
and Saul, ,2004 2010) ; clearly
Schölkopf supp etort
support al., the
1998h;yp Ro weis and
ypothesis
othesis forSaul , 2000class
a large ; Tenenof baum
datasets et al.
of,
2000
in ; Brand
interest
terest in AI., 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger
and Saul, 2004) clearly support the hypothesis for a large class of datasets of
When the data lies on a low-dimensional manifold, it can be most natural
interest in AI.
for mac
machine
hine learning algorithms to represent the data in terms of coordinates on
When
the manifold, the rather
data lies than oninaterms
low-dimensional
of cocoordinates
ordinatesmanifold,
in R n . Init can
ev be ymost
everyda
eryda
eryday life, natural
we can
for mac hine learning algorithms to
think of roads as 1-D manifolds embedded in 3-D space. represent the data in terms
We give directions on
of coordinates to
R
the
sp manifold,
specific rather than
ecific addresses in terms of address numin terms of co ordinates
umb in . In everyda y
bers along these 1-D roads, not in terms life, we can
think
of co of roads in
coordinates
ordinates as3-D1-Dspace.
manifolds embedded
Extracting theseinmanifold
3-D space. co We giveisdirections
coordinates
ordinates challenging, to
specific
but holds addresses
the promise in terms of address
to impro
improv ve man
many nyum bers along
machine thesealgorithms.
learning 1-D roads, not Thisingeneral
terms
of coordinates
principle in 3-D
is applied in space.
man
many Extracting
y con
contexts.
texts. Fig. these
5.13manifold
shows the coordinates
manifold is challenging,
structure of a
but holds the promise to impro ve man y
dataset consisting of faces. By the end of this book, we will hamachine learning algorithms.
hav ve dev This
develop
elop
elopedgeneral
ed the
principle
metho
methods ds is applied in
necessary to man
learny con
suc
suchtexts.
h a manifoldFig. 5.13 shows theInmanifold
structure. Fig. 20.6structure
, we will ofseea
dataset
ho
how consisting
w a machine of faces.
learning By thecan
algorithm endsuccessfully
of this book, we will ha
accomplish ve dev
this eloped the
goal.
methods necessary to learn such a manifold structure. In Fig. 20.6, we will see
howThis concludes
a machine Part Ialgorithm
learning , which has can provided the basic
successfully concepts
accomplish thisingoal.
mathematics
and macmachine
hine learning which are emplo employ yed throughout the remaining parts of the
book. This Youconcludes
are nonowwPprepared
art I, which has provided
to embark up on ythe
upon ourbasic
studyconcepts in mathematics
of deep learning.
and machine learning which are employed throughout the remaining parts of the
book. You are now prepared to embark upon your study of deep learning.
163
CHAPTER 5. MACHINE LEARNING BASICS
164
Part II
Part II
Deep Net
Netw
works: Mo
Modern
dern
Practices
Deep Networks: Modern
Practices
165
165
This part of the book summarizes the state of mo modern
dern deep learning as it is
used to solv
solvee practical applications.
This part of the book summarizes the state of modern deep learning as it is
Deep learning has a long history and man many y aspirations. Sev Several
eral approac
approaches hes
used to solve practical applications.
ha
havve been proposed that hav havee yet to en entirely
tirely bear fruit. Sev Several
eral am
ambitious
bitious goals
ha
havve yet to be realized. These less-developed branches of deep learningapproac
Deep learning has a long history and man y aspirations. Sev eral appearhes in
havefinal
the beenpart
proposed
of the bo that
ok.have yet to entirely bear fruit. Several ambitious goals
book.
have yet to be realized. These less-developed branches of deep learning appear in
This part focuses only on those approac approaches hes that are essen essentially
tially working tec tech-
h-
the final part of the book.
nologies that are already used hea heavily
vily in industry
industry..
This part focuses only on those approaches that are essentially working tech-
Mo
Modern
nologies dern
thatdeep learningused
are already pro
provides
vides
hea vilya invery pow
powerful
industry erful
. framew
framework ork for supervised
learning. By adding more lay layers
ers and more units within a la lay
yer, a deep netw
networkork can
Mo
represen dern deep learning pro vides
representt functions of increasing complexity a v ery pow erful framew ork
complexity.. Most tasks that consist of mapping an for supervised
input vector to an output vector,and
learning. By adding more lay ers andmore
that units within
are easy for aa la yer, a to
person deep
do netw ork, can
rapidly
rapidly, can
represen
b t functions
e accomplished viaofdeep
increasing complexity
learning, giv . Most tly
en sufficien
given sufficiently tasks thatmo
large consist of mapping
dels and
models sufficiently an
input datasets
large vector toofanlabeled
outputtraining
vector, examples.
and that are easytasks,
Other for a that
person
cantonot
dobrapidly , can
e described
be asso
as accomplished
associating
ciating one viavector
deep learning,
to another, given sufficien
or that aretly large mo
difficult dels and
enough thatsufficiently
a p erson
large datasets of labeled training examples. Other tasks,
would require time to think and reflect in order to accomplish the task, remain that can not b e described
asey
b asso
ondciating
eyond the scop one
scope e ofvector to another,
deep learning orw.
for no
now. that are difficult enough that a p erson
would require time to think and reflect in order to accomplish the task, remain
This part of the book describ describeses the core parametric function approximation
beyond the scope of deep learning for now.
tec
technology
hnology that is behind nearly all modern practical applications of deep learning.
We This
beginpart
b
byy of the book the
describing describ es theard
feedforw
feedforward core parametric
deep net workfunction
network mo
model approximation
del that is used to
tec hnology
represen that is b ehind nearly all
representt these functions. Next, we present adv modern practical
advanced applications
anced tectechniques of deep learning.
hniques for regularization
We optimization
and begin by describing
of such mo the
models. feedforw
dels. Scalingardthese deepmo net
models work
dels model
to large thatsuch
inputs is used to
as high
represent these
resolution images functions. Next, wesequences
or long temporal present adv anced tec
requires sp hniques for regularization
specialization.
ecialization. We inintro
tro
troduce
duce
and optimization
the con
conv
volutional net of such
network mo dels. Scaling these mo dels
work for scaling to large images and the recurrento large inputs such as
recurrentt neural high
resolution
net
netwwork forimages
pro or long
processing
cessing temptemporal
temporal sequencesFinally
oral sequences. requires
Finally, , wspe ecialization.
present general We guidelines
introduce
the the
for conpractical
volutionalmetho network
methodology
dology forin
scaling
invvolved to large images
in designing, and the
building, recurren
and t neural
configuring an
net w ork for
application in pro cessing
involving temp oral sequences. Finally , w e present
volving deep learning, and review some of the applications of deep general guidelines
for the
learning.practical methodology involved in designing, building, and configuring an
application involving deep learning, and review some of the applications of deep
These chapters are the most imp ortant for a practitioner—someone who wants
important
learning.
to begin implemen
implementing ting and using deep learning algorithms to solv solvee real-w
real-world
orld
These
problems to chapters
todada
dayy. are the most imp ortant for a practitioner—someone who w ants
to begin implementing and using deep learning algorithms to solve real-world
problems today.
166
Chapter 6
Chapter 6
Deep Feedforw
eedforward
ard Net
Netw
works
Deep
De
Deep
ep fe
F
feeedforwar
dforward
eedforw ard Net
d networks, also often called fe
w
feeedforwar
dforward
orks
d neur
neural
al networks, or multi-
layer per ercceptr
eptrons
ons (MLPs), are the quintessen quintessential tial deep learning mo models.
dels. The goal
Deep
of feedforward netw
a feedforward networks
networkork ,isalso often called fesome
to approximate edforwar d neurfal∗ .networks
function F
For , or multi-
or example, for
layer p er ceptr ons ∗( MLPs ), are the quintessen tial deep
a classifier, y = f (x) maps an input x to a category y. A feedforward netw learning mo dels. The goal
network
ork
of a feedforward netw ork is to approximate some function
defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result f . F or example, for
a classifier,
in y = f (xapproximation.
the b est function ) maps an input x to a category y. A feedforward network
defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result
These
in the b estmomodels
dels areapproximation.
function called fe feeedforwar
dforward d b ecause information flo flows
ws through the
function b eing ev aluated from x, through the intermediate computations used to
evaluated
These mo
define f , and finallydels aretocalled
the outputfeedforwar d b ecause
y. There are noinformation
fe
feeedb
dback flows through
ack connections the
in whic
which h
function of
outputs b eing
the moev aluated
model
del are from
fed x
back , through
in
into
to the
itself. intermediate
When computations
feedforward neural used
netw
networks to
orks
define
are f , and finally
extended to the
to include output
feedbac
feedback ky . There are no
connections, they feeare
dback connections
called recurr
current
entin neur
whical
neuralh
outputs
networks,of
networks the mo delinare
, presented fed back
Chapter 10.into itself. When feedforward neural networks
are extended to include feedback connections, they are called recurrent neural
Feedforw
eedforward ard netw
networks
orks are of extreme imp importance
ortance to machine learning practi-
networks, presented in Chapter 10.
tioners. They form the basis of many important commercial applications. For
Feedforw
example, theardconvnetw orks are
convolutional
olutional net of extreme
networks
works impob
used for ortance
object to machine
ject recognition fromlearning
photospracti-
are a
tioners.
sp
specialized They form
ecialized kind of feedforwthe basis
feedforward of
ard net many
network. important
work. Feedforw
eedforward commercial
ard netw
networks applications.
orks are a conceptual For
example, the conv olutional
stepping stone on the path to recurren net works used for
recurrentt netw ob ject
networks, recognition
orks, which p ower man from photos
many y naturala
are
sp ecialized
language kind of feedforward network. Feedforward networks are a conceptual
applications.
stepping stone on the path to recurrent networks, which p ower many natural
Feedforw
eedforward ard neural net networks
works are called networks b ecause they are typically rep-
language applications.
resen
resented
ted by compcomposing
osing together many different functions. The mo modeldel is asso
associated
ciated
F eedforw ard neural net works are called networks
with a directed acyclic graph describing how the functions are comp b ecause they are
composed typically rep-
osed together.
Fresen ted by comp
or example, we osing
mightttogether
migh hav
havee three many differentffunctions.
functions (1) f (2)
, Thefmo
, and (3) del is asso ciated
connected in a
cwith
hain,a to
directed
form facyclic
(x ) = graph
f (3) (f describing
(2) (f (1) (x how the functions are comp osed together.
))).. These chain structures are the most
)))
For example,
commonly used westructures
might hav ofe neural
three functions
netw orks.fIn ,this
networks. f case, f is connected
, andf (1) in a
called the first
clayer
hain,oftothe form
netwf (x ) = f (f ( f
ork, f (2) is called the se
network, (x ))) . These
seccond layer chain structures are
layer,, and so on. The overall length the most
commonly used structures of neural networks. In this case, f is called the first
layer of the network, f is called the se 167cond layer, and so on. The overall length
167
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
1.The
Onequestion
option is is to
thenusehow
a veryto choose
genericthe mapping
φ, such as theφ.infinite-dimensional φ that
is implicitly used by kernel machines based on the RBF kernel. If φ(x ) is
1. of
One option
high enoughis to dimension, we canφalw
use a very generic , such
ays as
always havthe
have capacity to φ
infinite-dimensional
e enough fitthatthe
is implicitly used by kernel machines based on the
training set, but generalization to the test set often remains p oor. Very RBF kernel. If φ ( x ) is
of high enough
generic dimension,are
feature mappings we usually
can alwbased
ays hav e enough
only on thecapacity
principletooffitlo the
local
cal
training
smo
smoothness set, but generalization
othness and do not enco encode to the test set often remains
de enough prior information to solve adv p oor.
advancedV
ancedery
generic
problems. feature mappings are usually based only on the principle of lo cal
smo othness and do not enco de enough prior information to solve advanced
problems.option is to manually engineer φ. Until the adven
2. Another adventt of deep learning,
this was the dominant approach. This approach requires decades of human
2. effort
Another foroption
eac
each h is to manually
separate task, engineer φ. Until the sp
with practitioners adven t of deep
specializing
ecializing in learning,
different
domains such as speech recognition or computer vision, and withhuman
this was the dominant approach. This approach requires decades of little
effort for
transfer b etw eac
etweenh separate
een domains. task, with practitioners sp ecializing in different
domains such as speech recognition or computer vision, and with little
3. The strategy
transfer b etwofeendeep learning is to learn φ. In this approach, we ha
domains. have
ve a momodel del
>
y = f (x ; θ , w ) = φ(x; θ) w. We no noww hav
havee parameters θ that we use to learn
3. φThe strategy of deep learning is to
from a broad class of functions, and parameters learn φ. In this approach,
w that map we from
have aφ(mox )del to
y = desired
the f (x ; θ , w ) = φ(xThis
output. ; θ) iswan . Wexample
e now hav ofe aparameters
deep feedforw θ that
feedforward ardwe netuse
network,
work,to learn
with
φ defining
φ from a broad class lay
a hidden of er.
functions,
layer. and parameters
This approach w that
is the only onemap
of thefrom φ( xthat
three ) to
the
giv esdesired
gives up on output.
the conv This
convexit
exit
exity y isofan
theexample
trainingofproblem,
a deep feedforw
but the ard network,
b enefits outw
outweighwith
eigh
φ defining a hidden
the harms. In this approac lay er.
approach, This approach is the
h, we parametrize the represen only one of the three
tation as φ(x; θ)
representation that
givesuse
and up the
on the convexity algorithm
optimization of the training
to find problem,
the θ thatbutcorresp
the b enefits
onds tooutw
corresponds a go
gooeigh
od
the harms.
represen
representation. In this approac h, w e parametrize the represen
tation. If we wish, this approach can capture the b enefit of the first tation as φ ( x ; θ)
and use hthe
approac
approach byoptimization
b eing highlyalgorithmgeneric—we to finddo the θ that
so by usingcorresp
a very onds
broadto afamily
go o d
represen tation.
φ(x; θ ). This approac If we
approach wish, this approach can capture the
h can also capture the b enefit of the second approac b enefit of the
approach. first h.
approac
Human hpractitioners
by b eing highly can enco generic—we
encodede their do so
knowledge by using
to a
help very broad
generalization family by
φ(x; θ ). This
designing approac
families φ(x ;hθcan) that also capture
they exp ectthe
expect b enefit
will p erformof the
well.second
The adv approac
advanan tageh.
antage
Human
is that thepractitioners
human designer can enco de their
only needsknowledge
to find the to righ
helpt generalization
right general function by
designing families φ (x ; θ ) that they exp
family rather than finding precisely the right function. ect will p erform well. The adv an tage
is that the human designer only needs to find the right general function
family rather than finding precisely the right function.
This general principle of improving mo models
dels by learning features extends b ey eyond
ond
the feedforward net networks
works described in this chapter. It is a recurring theme of deep
This general
learning principle
that applies to allofofimproving
the kindsmo of dels
mo by learning
models
dels describ
described features
ed extends
throughout thisb ey
b oondok.
the feedforward
Feedforward netw net
networks works described in this chapter. It is a recurring
orks are the application of this principle to learning deterministic theme of deep
learning that applies to all of the kinds of mo dels describ ed throughout this b o ok.
Feedforward networks are the application 169of this principle to learning deterministic
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
approac
approaches
hes for mo
modeling
deling binary data.
Ev
Evaluated
aluated on our whole training set, the MSE loss function is
approaches for mo deling binary data.
Evaluated on our whole training1 Xset,∗ the MSE loss 2function is
J (θ ) = (f (x) − f (x; θ )) . (6.1)
4
1 x∈X
J (θ ) = (f (x) f (x; θ )) . (6.1)
4
No
Noww we must chohoose
ose the form of our mo − f (x; θ ). Supp
model,
del, Suppose
ose that we cho
hoose
ose
a linear mo
model,
del, with θ consisting of w and b. Our mo model
del is defined to b e
Now we must cho ose the form of our mo del, f (x; θ ). Supp ose that we cho ose
a linear mo del, with θ consisting X
f (x;of = xb>.wOur
w ,wb)and + b.mo del is defined to b e (6.2)
h2
x2
0 0
0 1 0 1 2
x1 h1
172
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
We can no
now fecify
w sp (x; Wa ,solution
specify towthemax
c, w , b) = XOR0,problem.
W x + cLet+ b. (6.3)
We can now sp ecify a solution 1 1{ }
Wto=the XOR problem.
, Let (6.4)
1 1
1 1
W = , (6.4)
10 1
c= , (6.5)
−1
0
c= 1 , (6.5)
w = 1 , (6.6)
−2
−1
w = 173 , (6.6)
2
−
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
174
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
and b = 0.
We can now walk through the wa way
y that the mo model
del pro
processes
cesses a batc
batch
h of inputs.
and b = 0.
Let X b e the design matrix containing all four p oints in the binary input space,
withWonee can now walk
example p erthrough
row: the way that the mo del pro cesses a batch of inputs.
all four p oints in the binary input space,
Let X b e the design matrix containing 0 0
with one example p er row: 0 1
X = 10 00 .
(6.7)
0 1
X= 1 1 . (6.7)
1 0
The first step in the neural netw network
ork is to 1 multiply
1 the input matrix by the first
la
layer’s
yer’s weight matrix:
The first step in the neural network isto0multiply 0 the input matrix by the first
layer’s weight matrix:
1 1
X W = 10 10 .
(6.8)
1 1
XW = 2 2 . (6.8)
1 1
Next, we add the bias vector c, to obtain 2 2
Next, we add the bias vector c, to obtain 0 −1
1 0
0 1 . (6.9)
1 0
21 −10 . (6.9)
1 0
In this space, all of the examples lie along 2 1 a line with slop slopee 1. As we mov
movee along
this line, the output needs to b egin at 0, then rise to 1, then drop bac back
k down to 0.
In linear
A this space,
mo delallcannot
model of theimplement
examples liesucalong
suchh a a line with
function. T o slop
finishe 1computing
. As we mov e along
the value
this line,
of h for eacthe
each output needs to
h example, we apply theb eginat 0 , then rise to 1 , then drop back down to 0.
rectifiedlinear transformation:
A linear mo del cannot implement such a function. To finish computing the value
of h for each example, we apply the rectified 0 0 linear transformation:
1 0
0 0 . (6.10)
1 0
21 10 . (6.10)
1 0
This transformation has changed the relationship 2 1 b etw
etween
een the examples. They no
longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a
This transformation
linear mo
model
del can solve hasthe
changed
problem.the relationship b etween the examples. They no
longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a
We finish by multiplying by theweigh
weightt vector w:
linear mo del can solve the problem.
We finish by multiplying by the weigh 0 t vector w :
1
0 . (6.11)
1
01 . (6.11)
1
175
0
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
6.2 Gradien
Designing and training t-Based
a neuralLearning
netw
network
ork is not much differen differentt from training any
other machine learning mo model
del with gradient descen descent. t. In Sec. 5.10, we describ
described ed
Designing
ho
how and training a neural netw
w to build a machine learning algorithm by sp ork is not m uch
specifying differen t from
ecifying an optimization protraining any
procedure,
cedure,
aother
cost machine
function,learning
and a mo mo
model delfamily
del with. gradient descent. In Sec. 5.10, we describ ed
family.
how to build a machine learning algorithm by sp ecifying an optimization pro cedure,
Thefunction,
a cost largest difference
and a mobdel etw
etween
een the
family . linear mo models
dels we havhavee seen so far and neural
net
networks
works is that the nonlinearit
nonlinearity y of a neural netw networkork causes most interesting loss
The largest difference
functions to become non-conv b etw
non-convex. een
ex. This means thatwe
the linear mo dels have seen
neural net so farare
networks
works andusually
neural
networksbyis using
trained that the nonlinearit
iterativ
iterative, y of a neuraloptimizers
e, gradient-based network causes most interesting
that merely drive the cost loss
functionstotoa become
function very lo wnon-conv
low ex. This
value, rather thanmeans
the linear thatequation
neural net worksused
solvers are tousually
train
trained by using
linear regression mo iterativ
models e, gradient-based
dels or the conconvex optimizers that merely
vex optimization algorithms with global conv drive the cost
conver-
er-
function to a v ery low value, rather than the
gence guarantees used to train logistic regression or SVMs. Convlinear equation solvers
Convex used to
ex optimizationtrain
linear
con regression
converges
verges starting mofrom
dels or
any theinitial
convex optimization
parameters (in algorithms
theory—in with global
practice it conv er-
is very
gence guarantees
robust used to ntrain
but can encounter logistic
umerical regression
problems). Sto
Stocor SVMs.
chastic Convex
gradient optimization
descent applied
con verges
to non-conv
non-convexstarting from any initial
ex loss functions has no such convparameters (in
convergence theory—in practice it
ergence guarantee, and is sensitive is very
robust but can encounter n umerical problems). Sto
to the values of the initial parameters. For feedforward neural net chastic gradient descent
networks, applied
works, it is
to non-conv
imp
importan
ortan
ortantt toexinitialize
loss functions has ts
all weigh
weights notosuch conv
small ergencevalues.
random guarantee,
The and
biasesis sensitive
ma
may y be
to the values of the initial parameters.
initialized to zero or to small p ositiv F or feedforward
ositivee values. The iterativ neural net works,
iterativee gradient-based opti- it is
imp ortanalgorithms
mization t to initialize usedalltoweigh
traintsfeedforward
to small random netw
networks
orksvalues. The biases
and almost all othermadeep
y be
initialized
mo
models to zero
dels will b e describor
describedto small p ositiv e values. The iterativ e gradient-based
ed in detail in Chapter 8, with parameter initialization opti-
in
mization algorithms
particular discussed usedin Sec.to train
8.4. Ffeedforward
or the moment, networks and almost
it suffices all other deep
to understand that
mo dels
the will b
training e describis
algorithm edalmost
in detail inysChapter
alwa
always based on8,usingwith theparameter
gradientinitialization
to descend the in
particular
cost functiondiscussed
in one in way Sec. 8.4. For the
or another. Themoment,
sp
specific it suffices toare
ecific algorithms understand
impro
improv that
vements
the training
and refinemen
refinementsalgorithm is almost alwa ys based
ts on the ideas of gradient descent, in on using the
intro
tro gradient
troduced to descend
duced in Sec. 4.3, and, the
cost function in one way or another. The sp ecific algorithms are improvements
and refinements on the ideas of gradient 176 descent, intro duced in Sec. 4.3, and,
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
more spspecifically
ecifically
ecifically,, are most often improv
improvements
ements of the sto stochastic
chastic gradien
gradientt descent
algorithm, introduced in Sec. 5.9.
more sp ecifically, are most often improvements of the sto chastic gradient descent
We can of course, train mo models
dels such as linear regression and supp support
ort vector
algorithm, introduced in Sec. 5.9.
mac
machines
hines with gradien
gradientt descen
descentt to
too,
o, and in fact this is common when the training
W e can of course, train mo dels
set is extremely large. From this p oint suchofasview,
linear regression
training and supp
a neural net ort visector
network
work not
mac
m uchhines witht gradien
differen
different t descenany
from training t toother
o, andmo indel.
model.factComputing
this is common when theis training
the gradient sligh
slightly
tly
set is extremely large. From
more complicated for a neural netw this p
network,oint of view, training a neural network
ork, but can still b e done efficiently and exactly is not
exactly..
m uch differen t from
Sec. 6.5 will describ training
describee ho
how any other mo del.
w to obtain the gradien Computing the gradient
gradientt using the back-propagation is sligh tly
more complicated
algorithm and mo for
modern a neural netw ork, but can still b e done
dern generalizations of the back-propagation algorithm.efficiently and exactly .
Sec. 6.5 will describ e how to obtain the gradient using the back-propagation
As withand
algorithm other machine
mo dern learning mo
generalizations models,
ofdels,
the to apply gradien
gradient-based
back-propagation t-based learning we
algorithm.
must chohoose
ose a cost function, and we must choose how to represent the output of
As
the mo with
model. other
del. W e no
nowmachine
w revisit learning
these designmo dels, to apply gradien
considerations with spt-based
special learning we
ecial emphasis on
m ust cho ose
the neural netwa cost
networks function,
orks scenario. and we must choose how to represent the output of
the mo del. We now revisit these design considerations with sp ecial emphasis on
the neural networks scenario.
An imp
important
ortant asp
aspect
ect of the design of a deep neural net network
work is the choice of the
cost function. Fortunately
ortunately,, the cost functions for neural netw networks
orks are more or less
An imp ortant asp ect of the design
the same as those for other parametric mo of a deep
models,neural
dels, suc
suchh as linear is
net work mothe
models.
dels.choice of the
cost function. Fortunately, the cost functions for neural networks are more or less
In most cases, our parametric mo del defines a distribution p ( y | x; θ ) and
model
the same as those for other parametric mo dels, such as linear mo dels.
we simply use the principle of maximum likelihoo likelihood.d. This means we use the
In
cross-en most
cross-entropy cases,
tropy b etw our
etween parametric mo del defines
een the training data and the mo a distribution
model’s p ( y asx;the
del’s predictions θ ) cost
and
w e simply
function. use the principle of maximum likelihoo d. This means w
| e use the
cross-entropy b etween the training data and the mo del’s predictions as the cost
Sometimes, we take a simpler approach, where rather than predicting a complete
function.
probabilit
probability y distribution ov er y , we merely predict some statistic of y conditioned
over
Sometimes,
on x. Sp w
Specialized e take a simpler
ecialized loss functions allowapproach,
us towhere
trainrather than predicting
a predictor a complete
of these estimates.
probability distribution over y , we merely predict some statistic of y conditioned
The total cost function used to train a neural net network
work will often combine one
on x. Sp ecialized loss functions allow us to train a predictor of these estimates.
of the primary cost functions describdescribeded here with a regularization term. We hav havee
The total cost function used to train a neural net work
already seen some simple examples of regularization applied to linear mo will often combine
models one
dels in Sec.
of the. The
5.2.2
5.2.2. primary
weighcost
weight functions
t decay describ
approach usededfor
here withmo
linear a dels
regularization
models term.applicable
is also directly We have
already seen some
to deep neural netw simple
networks examples of regularization applied to linear mo
orks and is among the most p opular regularization strategies. dels in Sec.
5.2.2. adv
More Theanced
weighregularization
advanced t decay approach used forfor
strategies linear monetw
neural dels is
networksalsowill
orks directly applicable
b e describ
describeded in
to deep
Chapter 7.neural netw orks and is among the most p opular regularization strategies.
More advanced regularization strategies for neural networks will b e describ ed in
Chapter 7.
Most mo
modern
dern neural net
networks
works are trained using maxim
maximumum lik
likeliho
eliho
elihooo d. This means
that the cost function is simply the negative log-likelihoo
log-likelihood, equiv
d, equivalen
alen tly describ
alently described
ed
Most mo dern neural networks are trained using maximum likeliho o d. This means
177
that the cost function is simply the negative log-likelihoo d, equivalently describ ed
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
as the cross-en
cross-entropy
tropy b et etwween the training data and the mo modeldel distribution. This
cost function is giv given
en by
as the cross-entropy b etween the training data and the mo del distribution. This
cost function is given bJy(θ ) = −E , ∼p̂ log pmodel(y | x). (6.12)
E
The sp ecific form ofJ the
specific (θ ) cost
= function changes log p from (y mo x)del
. to mo
model model,
del, dep (6.12)
depending
ending
on the sp specific
ecific form of log pmodel − . The expansion of the | ab
abo ove equation typically
The sp ecific form of
yields some terms that do not dep the cost function
depend changes
end on the mo from
model mo del
del parametersto mo del,and depma
ending
mayy be
on the sp ecific form of log p . The expansion of the ab ove
discarded. For example, as we saw in Sec. 5.5.1, if pmodel (y | x) = N (y ; f (x; θ) , I ), equation typically
yieldswesome
then recov terms
recoverer the thatmeandosquared
not dep end cost,
error on the mo del parameters and may b e
discarded. For example, as we saw in Sec. 5.5.1, if p (y x) = (y ; f (x; θ) , I ),
then we recover theJmean 1
(θ ) = squared
E , ∼p̂ error||ycost,
− f (x; θ )||2 + const | , N
const, (6.13)
2
1E
J (θ ) =1 y f (x; θ ) + const, (6.13)
up to a scaling factor of 2 and 2 a term that do doeses not depdepend
end on θ . The discarded
constan
constantt is based on the variance of the || Gaussian
− ||
distribution, which in this case
up to a scaling factor of
we chose not to parametrize. Previously and a term that do es not dep
Previously,, we saw that the equiv end on θ .alence
The discarded
equivalence betw
between
een
constanum
maxim
maximum t islik
based
likeliho
eliho
elihoo on
o d the variancewith
estimation of thean Gaussian distribution,
output distribution andwhich in this case
minimization of
w e chose not to parametrize. Previously , we
mean squared error holds for a linear model, but in fact, the equiv saw that the equiv alence
equivalence betw
alence holdseen
maxim um likeliho o d estimation with an output
regardless of the f (x; θ ) used to predict the mean of the Gaussian. distribution and minimization of
mean squared error holds for a linear model, but in fact, the equivalence holds
An adv
regardless advantage
antage
of the f of (x;this approach
θ ) used of deriving
to predict the mean the of
costthefunction
Gaussian. from maximum
lik
likeliho
eliho
elihoo o d is that it remov
removes es the burden of designing cost functions for each mo model.
del.
Sp An
Specifying adv
ecifying a moantage
model of this approach of deriving the cost function
del p(y | x ) automatically determines a cost function log p (y | x ). from maximum
likeliho o d is that it removes the burden of designing cost functions for each mo del.
One recurring theme throughout neural netw network
ork design is that the gradien gradientt of
Sp ecifying a mo del p(y x ) automatically determines a cost function log p (y x ).
the cost function must b e large and predictable enough to serve as a go od guide
good
One recurring theme | throughout neural network design is that the gradien| t of
for the learning algorithm. Functions that saturate (b (become
ecome very flat) undermine
the cost
this ob function
objectiv
jectiv must b e large and
jectivee b ecause they make the gradien predictable enough
gradientt b ecome very to serve
small.asIna many
go od guide
cases
for the
this happ learning
happens algorithm.
ens b ecause the activ Functions
activation that saturate
ation functions used to pro (b ecome
produce very flat) undermine
duce the output of the
this ob jectiv
hidden unitse or b ecause they make
the output units the gradienThe
saturate. t b ecome
negativ very small. In many
negativee log-likelihoo
log-likelihood d helpscases
to
this happ ens b ecause the
avoid this problem for many models. Man activ ation functions
Many used to
y output units invpro duce
involvolv the output of
olvee an exp function the
hidden units or the output
that can saturate when its argumen units saturate. The negativ e
argumentt is very negative. The log function log-likelihoo d helps to
in the
a void
negativ this problem
negativee log-lik
log-likeliho
eliho for many models.
elihooo d cost function undo Man
undoes y output units inv olv e an
es the exp of some output units. We will exp function
that can saturate
discuss the interaction b etwhen its
etwe
we argumen
ween t is very negative.
en the cost function and the The choice logoffunction in the
output unit in
negativ
Sec. e log-lik
6.2.2 . eliho o d cost function undo es the exp of some output units. We will
discuss the interaction b etween the cost function and the choice of output unit in
One un unusual
usual prop
propertert
erty y of the cross-entrop
cross-entropy y cost used to p erform maximum
Sec. 6.2.2.
lik
likeliho
eliho
elihooo d estimation is that it usually do doeses not ha have
ve a minimum value when applied
One
to the mo undels commonly used in practice. For cost
models usual prop ert y of the cross-entrop y usedoutput
discrete to p erform maximum
variables, most
likeliho
mo delsoare
models d estimation
parametrized is that initsuch
usually
a wa do
way y es notthey
that havecannot
a minimum value awhen
represent applied
probability
to zero
of the mo or dels
one,commonly
but can come usedarbitrarily
in practice.close Fortodiscrete
doing so. output variables,
Logistic most
regression
mo dels are parametrized
is an example of such a mo in such
model. a wa y
del. For real-v that
real-valued they cannot represent
alued output variables, if the mo a probability
model
del
of zero or one, but can come arbitrarily close to doing so. Logistic regression
is an example of such a mo del. For real-v 178 alued output variables, if the mo del
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
can control the density of the output distribution (for example, by learning the
variance parameter of a Gaussian output distribution) then it b ecomes p ossible
canassign
to control the density
extremely highofdensity
the output
to thedistribution (for example,
correct training by learning
set outputs, resultingthe
in
variance
cross-en parameter
cross-entropy of a Gaussian
tropy approaching negativ output
negativee infinitydistribution) then
infinity.. Regularization tec it b ecomes
techniques p ossible
hniques describ
described
ed
to assign extremely high
in Chapter 7 provide sev density
several to the
eral different wa correct
waysys of motraining
modifying set outputs, resulting
difying the learning problem so in
cross-en
that thetropy
mo delapproaching
model cannot reapnegativ e infinity
unlimited reward. Regularization
in this wa
wayy. techniques describ ed
in Chapter 7 provide several different ways of mo difying the learning problem so
that the mo del cannot reap unlimited reward in this way.
Differen
Differentt cost functions give differen
differentt statistics. A second result deriv derived
ed using
calculus of variations is that
Different cost functions give different statistics. A second result derived using
calculus of variations isf ∗that
= arg min E , ∼p ||y − f (x)||1 (6.16)
f
E
f = arg min y f (x) (6.16)
yields a function that predicts the value of y for eac eachh x , so long as such a
|| − ||
function may b e describ
describeded by the family of functions we optimize over. This cost
yields a function
function that predicts
is commonly called me the
an absolute verr
mean alueor..of y for each x , so long as such a
error
or
function may b e describ ed by the family of functions we optimize over. This cost
Unfortunately
Unfortunately,, mean squared error and mean absolute error often lead to p o or
function is commonly called mean absolute error.
results when used with gradient-based optimization. Some output units that
Unfortunately
saturate pro duce,very
produce meansmall
squared error and
gradients whenmean absolute
combined witherror often
these lead
cost to p o or
functions.
results
This when
is one usedthat
reason withthe gradient-based
cross-entrop
cross-entropy optimization.
y cost function is Some more poutput units mean
opular than that
saturateerror
squared pro duce very absolute
or mean small gradients
error, ev when
even
en whencombined
it is notwith these cost
necessary functions.
to estimate an
This
en is one reason that the
tire distribution p(y | x).
entire cross-entrop y cost function is more p opular than mean
squared error or mean absolute error, even when it is not necessary to estimate an
entire distribution p(y x).
|
The choice of cost function is tightly coupled with the choice of output unit. Most
of the time, we simply use the cross-entrop
cross-entropy y b et
etween
ween the data distribution and the
The
mo
model choice of cost function is tightly coupled
del distribution. The choice of how to represent with thethe
choice of output
output unit. Most
then determines
of the time, w e simply
the form of the cross-entropuse
cross-entropythe cross-entrop
y function. y b etween the data distribution and the
mo del distribution. The choice of how to represent the output then determines
the An
Any y kind
form of cross-entrop
of the neural netw
network
york unit that may b e used as an output can also b e
function.
used as a hidden unit. Here, we fo focus
cus on the use of these units as outputs of the
mo An
model, y kind of neural netw ork unit
del, but in principle they can b e used thatinternally
may b e used as an
as well. Weoutput
revisit can also
these be
units
used as a hidden unit.
with additional detail ab Here,
about we fo cus on the use of these units
out their use as hidden units in Sec. 6.3. as outputs of the
mo del, but in principle they can b e used internally as well. We revisit these units
Throughout this section, we supp suppose
ose that the feedforw
feedforward
ard net
network
work pro
provides
vides a
with additional detail ab out their use as hidden units in Sec. 6.3.
set of hidden features defined by h = f (x; θ). The role of the output lay layer
er is then
Throughout this section, we supp ose that the feedforw ard
to provide some additional transformation from the features to complete the net work provides a
task
set ofthe
that hidden
netw features
ork mustdefined
network by h = f (x; θ). The role of the output layer is then
p erform.
to provide some additional transformation from the features to complete the task
that the network must p erform.
One simple kind of output unit is an output unit based on an affine transformation
with no nonlinearity
nonlinearity.. These are often just called linear units.
One simple kind of output unit is an output unit based on an affine transformation
features h,.aThese ŷˆ = W > h+ b.
a vector y
withGiv
Given
noennonlinearity la
layer
yer of linear
are oftenoutput units linear
just called pro
produces
duces
units.
Linear outputhlay
layers
ers are often used to pro produce
duce the mean of a conditional
Given features , a layer of linear output units pro duces a vector yˆ = W h+ b.
Gaussian distribution:
Linear output layers are poften
(y | xused
) = Nto(y ;pro duce
yˆ, I ). the mean of a conditional
(6.17)
Gaussian distribution:
p(y x) =180 (y ; yˆ, I ). (6.17)
| N
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Man
Many y tasks require predicting the value of a binary variable y . Classification
problems with twtwoo classes can b e cast in this form.
Many tasks require predicting the value of a binary variable y . Classification
The maximum-lik
maximum-likeliho
eliho
elihood
od approach is to define a Bernoulli distribution over y
problems with two classes can b e cast in this form.
conditioned on x.
The maximum-likeliho od approach is to define a Bernoulli distribution over y
A Bernoulli
conditioned on xdistribution
. is defined by just a single num numb b er. The neural net
needs to predict only P ( y = 1 | x). For this num numb b er to b e a valid probability
probability,, it
A Bernoulli distribution
must lie in the in
interv
terv
terval
al [0, 1].is defined by just a single num b er. The neural net
needs to predict only P ( y = 1 x). For this numb er to b e a valid probability, it
Satisfying this constraint requires some careful design effort. Supp Suppose
ose we were
must lie in the interval [0, 1]. |
to use a linear unit, and threshold its value to obtain a valid probabilit
probability:
y:
Satisfying this constraint requires some
n careful
n design effort.
oo Supp ose we were
to use a linear unit,
P (and
y = threshold its value
1 | x) = max 0, minto obtain
1, w ha+valid
>
b probabilit
. y: (6.18)
yˆ = σ 181
w h+b (6.19)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
This deriv
derivation
ation mak
makeses use of=some ζ−((1 prop )z )−. from Sec. 3.10. By rewriting
2yerties
properties (6.26)
the loss in terms of the softplus function, − we can see that it saturates only when
(1 − 2y )z is very negative. Saturation thus oerties
This deriv ation mak es use of some prop ccurs from Sec. 3.10
only when the .mo
Bydelrewriting
model already
the loss in terms
has the right answ of the softplus function, we
er—when y = 1 and z is very p ositiv
answer—when can see that
e, or y = 0 and z iswhen
ositive, it saturates only very
(1 2
negativy
negative.)z is very negative. Saturation thus o ccurs only when
e. When z has the wrong sign, the argument to the softplus function, the mo del already
has−the right answer—when y = 1 and z is very p ositive, or y = 0 and z is very
negative. When z has the wrong sign, 182the argument to the softplus function,
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
(1
(1−− 2 y )z , may b e simplified to |z |. As |z | b ecomes large while z has the wrong sign,
the softplus function asymptotes tow toward
ard simply returning its argumen argumentt |z |. The
(1
deriv2 y )
derivativez , may
ative with respb e simplified
respect to z . As z
ect to z asymptotes to sign b ecomes
sign((z), so, in the limitthe
large while z has ofwrong
extremelysign,
the− softplus
incorrect function
z , the softplus asymptotes
function | |do tow
es |ard
does not simplythe
| shrink returning
gradientits
gradien t atargumen z
all. Thist prop . The
property
erty
deriv ative with resp ect to z asymptotes
is very useful b ecause it means that gradien to sign
gradient-based ( z ), so, in the limit
t-based learning can act to quic of extremely
| |
quickly
kly
incorrecta zmistaken
correct , the softplusz . function do es not shrink the gradient at all. This prop erty
is very useful b ecause it means that gradient-based learning can act to quickly
When we use other loss functions, such as mean squared error, the loss can
correct a mistaken z .
saturate anytime σ(z ) saturates. The sigmoid activ activation
ation function saturates to 0
When we use
when z b ecomes very negativother loss functions, such as
negativee and saturates to 1 when mean squared
z b ecomeserror,very
the ploss can
ositiv
ositive.e.
saturate
The gradienanytime
gradient σ(z ) saturates.
t can shrink to
tooo small to The b esigmoid
useful for activ ation function
learning wheneversaturates
this happ
happens,to 0
ens,
when z bthe
whether ecomes
mo delvery
model has negativ e andanswer
the correct saturatesor the to incorrect
1 when zanswer.b ecomesFor very
thisp ositiv
reason,e.
The
maxim gradien
maximum um lik t eliho
can
likeliho
elihoo shrink to o small
o d is almost alwa toysb ethe
always useful for learning
preferred approach whenever this happ
to training ens,
sigmoid
whether the
output units. mo del has the correct answer or the incorrect answer. For this reason,
maximum likeliho o d is almost always the preferred approach to training sigmoid
Analytically
Analytically,, the logarithm of the sigmoid is alwa always ys defined and finite, b ecause
output units.
the sigmoid returns values restricted to the op open en interv
interval (0,, 1)
al (0 1),, rather than using
Analytically ,
the entire closed interv the logarithm
interval of the sigmoid
al of valid probabilities [0 is alwa
[0,, 1] ys defined
1].. In softw
softwareareand finite, b ecause
implementations,
theav
to sigmoid
avoid returns problems,
oid numerical values restricted to the
it is b est op en interv
to write al (0, 1)
the negativ
negative e ,log-likelihoo
rather thandusing
log-likelihood as a
the entireofclosed
function intervthan
z, rather al of asvalid probabilities
a function of yŷˆ[0=, 1]σ. (In
z ).softw are sigmoid
If the implementations,
function
to avoidws
underflo
underflows numerical
to zero, then problems,
takingitthe is blogarithm
est to write of yŷˆthe negativ
yields e log-likelihoo
negativ
negative e infinit
infinityy. d as a
z
function of , rather than as a function of y
ˆ = σ (z ). If the sigmoid function
underflows to zero, then taking the logarithm of yˆ yields negative infinity.
An
Anyy time we wish to represen
representt a probability distribution ov over
er a discrete variable
with n p ossible values, we ma may y use the softmax function. This can b e seen as a
An y time we wish
generalization to sigmoid
of the representfunction
a probability
whic
which h distribution
was used to ov er a discrete
represen
represent variabley
t a probabilit
probability
with n p ossible
distribution ov ervalues,
over a binarywevma y use the softmax function. This can b e seen as a
ariable.
generalization of the sigmoid function which was used to represent a probability
Softmax functions are most often used as the output of a classifier, to represen representt
distribution over a binary variable.
the probabilit
probability
y distribution ov
overer n differen
differentt classes. More rarely
rarely,, softmax functions
can Softmax functions
b e used inside the are
mo most
model oftenifused
del itself, as the
we wish theoutput
mo deloftoacclassifier,
model ho
hoose
ose b et to
etw
weenrepresen
one oft
thedifferent
n probability distribution
options for someov iner n differen
internal
ternal t classes. More rarely, softmax functions
variable.
can b e used inside the mo del itself, if we wish the mo del to cho ose b etween one of
In the case of binary variables, we wished to pro produce
duce a single numnumb b er
n different options for some internal variable.
In the case of binary variables,
yŷˆ =we
P (wished
y=1|x to).pro duce a single numb er (6.27)
can cause similar difficulties for learning if the loss function is not designed to
comp
compensate
ensate for it.
can cause similar difficulties for learning if the loss function is not designed to
The argumen
argumentt to the softmax function can b e pro produced
duced in tw twoo different wa ways.
ys.
comp ensate for it.z
The most common is simply to hav havee an earlier lay layerer of the neural net network
work output
ev
everyThe argumen t z
ery element of z, as describto the softmax
described ed ab function can b
aboove using the linear laye pro duced
layer in tw o> different
er z = W h + b . While ways.
The most
straigh commonthis
straightforward,
tforward, is simply
approach to hav e an earlier
actually ov layer of the neural
overparametrizes
erparametrizes network output
the distribution. The
ev ery element of z , as describ ed ab o ve using the linear
constraintt that the n outputs must sum to 1 means that only n − 1 parameters are
constrain lay er z = W h + b . While
straigh
necessary;tforward, this approach
the probability of theactually
n -th valueoverparametrizes
may b e obtained theby distribution.
subtracting The the
constrain t that the n outputs m ust
first n − 1 probabilities from 1. We can thus imp sum to 1 means
impose that only
ose a requiremen n 1 parameters
requirementt that one element are
necessary;
of z b e fixed. the Fprobability
or example, of wthe n -threquire
e can value may that bzne = obtained
0. Indeed, by subtracting
− this is exactly the
first nthe1sigmoid
what probabilities
unit do from . We canPthus
es. 1Defining
does. (y =imp1 | osex) =a σ requiremen
(z ) is equiv t that
alentone
equivalent element
to defining
of(yz =
P b− e1 fixed.
| x) = Fsoftmax
or example,
softmax( we can
(z )1 with a tw require
two-dimensionalthat z z=and
o-dimensional 0. Indeed, this isthe
z1 = 00.. Both exactly
n −1
what the sigmoid
argumentt and the n argumen
argumen unit do es. Defining P (y = 1 x ) =
argumentt approaches to the softmax can describ σ ( z ) is equiv alent to defining
describee the same
set 1 x) = softmax
P (yof=probability (z ) withbut
distributions, a tw
havo-dimensional
have e different| z and zdynamics.
learning = 0. BothInthe n 1
practice,
argumen
there |t and m
is rarely theuch n difference
argument bapproaches
et
etw
ween using to the
the ov softmax
overparametrizedcan describ
erparametrized e the or
version same
−the
set of probability distributions, but hav
restricted version, and it is simpler to implement the ov e different learning dynamics.
overparametrized In
erparametrized version. practice,
there is rarely much difference b etween using the overparametrized version or the
From a neuroscientific p oin ointt of view, it is in interesting
teresting to think of the softmax as
restricted version, and it is simpler to implement the overparametrized version.
a wa
wayy to create a form of comp competition
etition b etw
etween
een the units that participate in it: the
Fromoutputs
softmax a neuroscientific
alw
alwaysays sum p oin
tot 1ofsoview, it is interesting
an increase in the value to think
of oneofunit
the softmax
necessarily as
a way toonds
corresp
corresponds create
to aa decrease
form of comp in the etition
valuebofetwothers.
een theThis unitsisthat participate
analogous to thein lateral
it: the
softmax outputs alw ays sum
inhibition that is b elieved to exist b et to 1 so an
etwincrease
ween nearb
nearby in the v alue of one unit
y neurons in the cortex. At the necessarily
corresp onds
extreme (when to athedecrease
differencein theb et vween
alue of
etween theothers.
maximal This ai isandanalogous
the others to the lateral
is large in
inhibition that is b elieved to exist
magnitude) it b ecomes a form of winner-take-al b et w een nearb y neurons in the
winner-take-alll (one of the outputs is nearly 1 cortex. At the
extreme (when the
and the others are nearly 0). difference b et ween the maximal a and the others is large in
magnitude) it b ecomes a form of winner-take-al l (one of the outputs is nearly 1
The name “softmax” can b e somewhat confusing. The function is more closely
and the others are nearly 0).
related to the argmax function than the max function. The term “soft” derives
from Thethenamefact “softmax”
that the softmaxcan b e somewhat
function confusing.
is con
continuous
tinuous Theand function is more closely
differentiable. The
related to the argmax function than the max function.
argmax function, with its result represented as a one-hot vector, is not con The term “soft” derives
continuous
tinuous
from the fact that the softmax
or differentiable. The softmax function thus pro function is con tinuous
vides a “softened” version ofThe
provides and differentiable. the
argmax function,
argmax. The corresp with
corresponding its result
onding represented
soft version of theasmaximum
a one-hotfunction vector, isis not continuous
softmax
softmax( (z ) > z.
or would
It differentiable.
p erhapsThe softmax
b e better to function
call the thussoftmaxprovidesfunctiona “softened”
“softargmax,”versionbut of the
the
argmax.
curren
current The corresp
t name is an en onding
entrenched
trenchedsoftconversion
conven
ven of the maximum function is
vention.
tion. softmax ( z ) z.
It would p erhaps b e better to call the softmax function “softargmax,” but the
current name is an entrenched convention.
a go
gooo d cost function for nearly any kind of output lay layer.
er.
This form
formulation
ulation works well with gradien diag(βt).descen
gradient descentt b ecause the formula for (6.34)
the
log-lik
log-likeliho
eliho
elihooo d of the Gaussian distribution parametrized by β in involv
volv
volves
es only mul-
This formulation
tiplication works
by β i and well with
addition gradien
of log t descen
βi . The t b ecause
gradient the formulaaddition,
of multiplication, for the
log-lik
and eliho o d of
logarithm op the Gaussian
operations
erations distribution
is well-behav
well-behaved.
ed. Byparametrized
comparison,by if β
weinparametrized
volves only mul-
the
tiplication by β and addition of log β . The gradient of multiplication,
output in terms of variance, we would need to use division. The division function addition,
and
b logarithm
ecomes op erations
arbitrarily steepisnear
well-behav
zero. ed. By comparison,
While if wecan
large gradients parametrized the
help learning,
output in terms
arbitrarily large ofgradients
variance,usually
we would needintoinstability
result use division.
. If The
instability. division function
we parametrized the
b ecomes arbitrarily steep near zero. While large
output in terms of standard deviation, the log-likelihoo gradients
log-likelihood can
d would still inv help
involvelearning,
olve division,
arbitrarily large
and would also inv gradients
involv
olv usually result in instability. If we parametrized
olvee squaring. The gradient through the squaring operation the
output in terms of standard deviation, the log-likelihoo d would still
can vanish near zero, making it difficult to learn parameters that are squared. involve division,
and would also involve squaring. The gradient through the squaring operation
can vanish near zero, making it difficult 187 to learn parameters that are squared.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
188
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
6.3far wHidden
So e ha
have
ve fo Units
focused
cused our discussion on design choices for neural netw networks
orks that
are common to most parametric machine learning mo models
dels trained with gradient-
So far we have fo cused our discussion on design
based optimization. Now we turn to an issue that is unique choices for to
neural networks
feedforward that
neural
are
net common
networks:
works: ho
howwtotomost
cho parametric
hoose
ose the type machine
of hiddenlearning mo dels
unit to use trained
in the hiddenwith
la gradient-
layers
yers of the
based
mo
model.
del.optimization. Now w e turn to an issue that is unique to feedforward neural
networks: how to cho ose the type of hidden unit to use in the hidden layers of the
The design of hidden units is an extremely active area of research and do does
es not
mo del.
yet ha
have
ve man
many y definitiv
definitivee guiding theoretical principles.
The design of hidden units is an extremely active area of research and do es not
yet Rectified
have manlinear unitse are
y definitiv an excellen
guiding excellentt default
theoretical choice of hidden unit. Many other
principles.
typ
ypes
es of hidden units are av available.
ailable. It can b e difficult to determine when to use
whic
whichRectified linear units are an
h kind (though rectified linear excellen t default
units choice an
are usually of hidden unit.choice).
acceptable Many other
We
typ es of hidden units are available. It can b e difficult to determine when to use
which kind (though rectified linear units 190 are usually an acceptable choice). We
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
describ
describee here some of the basic intuitions motiv motivating ating each type of hidden units.
These intuitions can b e used to suggest when to try out each of these units. It is
describ eimp
usually here
impossiblesometoofpredict
ossible the basic in advintuitions
advance
ance which motiv willating
workeach b est.type
Theofdesignhiddenpro units.
process
cess
These intuitions can
consists of trial and error, in b e used to suggest
intuiting when to try out each
tuiting that a kind of hidden unit may work well, of these units. It is
usually imp ossible
and then training a net to predict
network in adv ance which will
work with that kind of hidden unit and ev w ork b est. The design
evaluating pro
aluating itscess
consists
p erformance of trial
on aand error, inset.
validation tuiting that a kind of hidden unit may work well,
and then training a network with that kind of hidden unit and evaluating its
Some of the hidden units included in this list are not actually differentiable at
p erformance on a validation set.
all input p oints. For example, the rectified linear function g (z ) = max max{ {0 , z } is not
Some
differen
differentiable of the hidden units
tiable at z = 00.. This may seem lik included in this
likee it invlist are
invalidates not actually differentiable
alidates g for use with a gradient- at
all input
based p oints.algorithm.
learning For example, the rectified
In practice, gradientlineardescent
function g (zp)erforms
still = max w0ell is not
, z enough
differen
for these tiable
mo
modelsat z to
dels =b 0.e This
used may seem likelearning
for machine it invalidatestasks.g Thisfor use is inwith { a gradient-
part }
because
based learning
neural netw
network orkalgorithm.
training algorithmsIn practice, dogradient
not usually descent stillatp erforms
arrive a lolocal well enough
cal minimum of
for these mo dels to b e used for machine
the cost function, but instead merely reduce its value significan learning tasks. This
significantlyis in
tly part because
tly,, as shown in
neural
Fig. 4.3netw
. Theseork training
ideas willalgorithms
b e describ
described do
ed notfurtherusually arrive at8. aBecause
in Chapter lo cal minimumwe do not of
the
exp
expectcosttraining
ect function, but instead
to actually reach merely
a p oin
ointreduce
t whereitsthe value significan
gradient is 0 ,tlyit ,isasacceptable
shown in
Fig.the
for 4.3minima
. Theseofideas the cost willfunction
b e describ to ed further
corresp
correspond ondin toChapter
p oints with 8. Because
undefinedwegradient. do not
exp ect training to actually reach a p oin
Hidden units that are not differentiable are usually non-differen t where the gradient is 0
non-differentiable , it is acceptable
tiable at only a
for thenum
small minima
numb b er ofof ptheoin cost
oints.ts. In function
general, to acorresp
function ondgto (z )p oints
has awith left undefined
deriv
derivativeativegradient.
defined
Hidden units that are not differentiable are
by the slope of the function immediately to the left of z and a right deriv usually non-differen tiable at only
ativeea
derivativ
ativ
small num
defined bybtheer ofslopp oin
slope e ofts. the
In general, function g (z )tohas
functionaimmediately thea right
left deriv
of z .ative defined
A function
b y the
is differen slope
differentiable of the function immediately
tiable at z only if b oth the left deriv to the
derivative left of z and
ative and the right deriv a right deriv
derivativ
ativ
ative ativ
e aree
defined b
defined y the
and slopto
equal e of the other.
each function The immediately
functions used to the in right
the con of ztext
context . A of function
neural
is
netdifferen
networks tiable
works usually ha at z
have only if b oth
ve defined left deriv the left
derivatives deriv ative and
atives and defined right deriv the right deriv
derivativ ativ ativ
atives.
es. In e are
the
defined and
case of g (z ) = max equal
max{ to each other.
{0 , z }, the left deriv The functions used in the
ative at z = 0 is 0 and the right deriv
derivative con text of neural
derivativ
ativ
ativee
net works
is 1. Soft
Softwareusually
ware implemen ha ve defined
implementations left deriv
tations of neural netw atives and
network ork training usually return onethe
defined right deriv ativ es. In of
the of g (z ) =deriv
caseone-sided maxativ
derivativ 0 , zes, rather
atives the leftthan deriv ative
rep
reporting
ortingat zthat = 0theis 0deriv
and ative
derivativethe right derivativor
is undefined e
is 1. Soft
raising anwareerror.implemen{ ma
This } tations
may of neural netw
y b e heuristically ork training
justified usually that
by observing return one of
gradien
gradient- t-
the one-sided deriv ativ es rather
based optimization on a digital computer is sub than rep orting that
subject the
ject to nuderiv ative
numerical is
merical error anundefinedanyw
yw
ywayor
ay
ay..
raising an error.
When a function is asked to ev This ma y b e heuristically
aluate g(0)
evaluate (0),, it is very unlikely that the underlyingt-
justified b y observing that gradien
vbased
alue trulyoptimization
was 0. Instead,on a digitalit was computer
likely to b eissome
likely sub jectsmallto vnualuemerical
thaterror anyway.
was rounded
When
to 0. In a function
some con is
contexts, asked
texts, moreto ev aluate g(0),pleasing
theoretically it is veryjustifications
unlikely that arethe av underlying
available,
ailable, but
vthese
alue truly was 0 . Instead,
usually do not apply to neural net it w as likely to
network b e some small
work training. The imp v alue
importan that
ortan was
ortantt p oin rounded
ointt is that
to 0 . In some con texts, more theoretically
in practice one can safely disregard the non-differen pleasing justifications
non-differentiability are av
tiability of the hidden unit ailable, but
theseation
activ usually
activation do not apply
functions describ
described toedneural
b elo w.network training. The imp ortant p oint is that
elow.
in practice one can safely disregard the non-differentiability of the hidden unit
Unless indicated otherwise, most hidden units can b e describ described ed as accepting
activation functions describ ed b elow.
a vector of inputs x, computing an affine transformation z = W >x + b, and
thenUnless
applying indicated
an elemen otherwise,
element-wiset-wise most hidden
nonlinear units can
function g( z)b.e Most
describ ed as accepting
hidden units are
a vector of inputs
distinguished from eac x ,
each computing an affine transformation
h other only by the choice of the form of the activ z = W x + b,ation
and
activation
then applying
function g (z ). an element-wise nonlinear function g( z). Most hidden units are
distinguished from each other only by the choice of the form of the activation
function g (z ). 191
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of these groups:
g (z ) i = max z j (6.37)
of these groups: j∈G
g (z ) = max z (6.37)
where G (i) is the indices of the inputs for group i , { (i − 1) 1)kk + 1, . . . , ik ik}} . This
pro
provides
videsG a waway y of learning a piecewise linear function that resp respondsonds to multiple
where is the indices
directions in the input x space. of the inputs for group i , (i 1) k + 1, . . . , ik . This
provides a way of learning a piecewise linear function { that− resp onds to m } ultiple
A maxout unit can learn a piecewise linear, con
conv vex function with up to k pieces.
directions in the input x space.
Maxout units can thus b e seen as itself rather
A maxout unit can
than just the relationship b etw learn a piecewise linear, con vex function
een units. With large enough k , a maxout
etween with up k pieces.
to unit can
Maxout units
learn to appro can
approximate thus
ximate an b e
any seen
y conv
convexasex function with arbitrary fidelit fidelity itself
y. In particular,rather
than just
a maxout la the
layer relationship
yer with tw two b etw een units. With large enough
o pieces can learn to implement the same function k , a maxout unit of can
the
learn x
input toasappro ximate an
a traditional layy er
layer conv
usingex function
the rectified with arbitrary
linear activ
activationfidelit
ation y. In particular,
function, absolute
vaalue
maxout layer with
rectification two pieces
function, can leaky
or the learn to or implement
parametricthe sameorfunction
ReLU, can learn of the
to
input
implemen x as a traditional lay er using the rectified
implementt a totally different function altogether. The maxout lay linear activ ation function,
layer absolute
er will of course
vbalue rectification function,
e parametrized differently from an or theanyleaky or parametric
y of these other la yer types, so can
layer ReLU, or the learn
learning to
implementwill
dynamics a totally different
b e differen
different t even function
in the casesaltogether. The maxout
where maxout learnslay toerimplement
will of course the
b e parametrized differently
same function of x as one of the other lay from an y of these
layer
er types.other la yer types, so the learning
dynamics will b e different even in the cases where maxout learns to implement the
Eac
Eachh maxout unit is now parametrized by k weight vectors instead of just one,
same function of x as one of the other layer types.
so maxout units typically need more regularization than rectified linear units. They
can Eac
workh maxout
well without unit isregularization
now parametrized if the by k weight
training set visectors
large instead
and theofnumber just one, of
so maxout units typically
pieces p er unit is kept lo low need more regularization
w (Cai et al., 2013). than rectified linear units. They
can work well without regularization if the training set is large and the number of
Maxout
pieces units
p er unit is hav
have
kepte alow few(Cai
other b enefits.
et al. , 2013).In some cases, one can gain some sta-
tistical and computational adv advantages
antages by requiring few fewer er parameters. Sp Specifically
ecifically
ecifically,,
if theMaxout
features units have a few
captured by notherdifferenb enefits.
different t linear Infilters
some cases,
can b eone can gain some
summarized sta-
without
tisticalinformation
losing and computational by takingadv theantages
max ov byerrequiring
over each group fewof erkparameters.
features, then Sp ecifically
the next,
if
la the
layer features captured by n
yer can get by with k times fewer weights. differen t linear filters can b e summarized without
losing information by taking the max over each group of k features, then the next
layerBecause
can geteach unit kis times
by with drivenfewerby mw ultiple
eights.filters, maxout units hav havee some redun-
dancy that helps them to resist a phenomenon called catastr atastrophic
ophic for forgetting
getting in
whic
which Because
h neural netweach unit
orks forget how to p erform tasks that they were trainedredun-
networks is driven b y m ultiple filters, maxout units hav e some on in
dancy that
the past (Go Goo helps
o dfellothem
dfellow to
w et al. resist
al.,, 2014a). a phenomenon called catastr ophic for getting in
which neural networks forget how to p erform tasks that they were trained on in
the Rectified
past (Go olineardfellounits
w et al.and all of).these generalizations of them are based on the
, 2014a
principle that mo models
dels are easier to optimize if their behavior is closer to linear.
Rectified linear
This same general principle units andofall of these
using lineargeneralizations
b ehavior to obtain of them are optimization
easier based on the
principle that mo
also applies in other con dels are
contexts easier to optimize
texts b esides deep linear netw if their behavior
networks. is
orks. Recurrent netw closer to
networksorkslinear.
can
This
learnsame
from general
sequences principle
and pro of
produce using
duce a linear
sequence b ehavior
of states toandobtain easier
outputs. optimization
When training
also applies
them, one needsin other contexts b
to propagate esides deepthrough
information linear netwsev orks.time
several
eral Recurrent
steps, which networks can
is much
learn from
easier whensequences
some linear andcomputations
pro duce a sequence (with someof states and outputs.
directional deriv When btraining
derivatives
atives eing of
them, one needs
magnitude near 1) are invto propagate
involv
olv information
olved. through
ed. One of the best-p sev eral
best-performing time steps,
erforming recurren which
recurrentt net is
netwmwuch
ork
easier when some linear computations (with some directional derivatives b eing of
magnitude near 1) are involved. One 193 of the best-p erforming recurrent network
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
arc
architectures,
hitectures, the LSTM, propagates information through time via summation—a
particular straightforw
straightforward
ard kind of suc
such
h linear activ
activation.
ation. This is discussed further
architectures,
in Sec. 10.10. the LSTM, propagates information through time via summation—a
particular straightforward kind of such linear activation. This is discussed further
in Sec. 10.10.
or the hyperb
hyperbolic
olic tangent activ g (zfunction
activation
ation ) = σ (z ) (6.38)
These activ
activation g (z ) =
ation functions are closely tanh(zb)ecause
related . tanh(z ) = 2σ(2z ) − 1(6.39)
.
We ha have
ve already seen sigmoid units as output units, used to predict the
These activation functions are closely related b ecause tanh(z ) = 2σ(2z ) 1.
probabilit
probability y that a binary variable is 1. Unlike piecewise linear units, sigmoidal
W e ha ve already seen ofsigmoid units as output units, toused −
to predict the
units saturate across most their domain—they saturate a high value when
probabilit y that a binary variable is 1 . Unlike piecewise linear
z is very p ositive, saturate to a low value when z is very negative, and are only units, sigmoidal
units saturate
strongly sensitive across mostinput
to their of theirwhendomain—they
z is near 0. saturate to a high
The widespread value when
saturation of
zsigmoidal
is very punits
ositive, saturate
can mak to a low value when z is very negative,
makee gradient-based learning very difficult. For this reason, and are only
strongly sensitive
their use as hidden units to theirininput when z isnetw
feedforward near
networks0. is
orks Theno
nowwidespread
w discouraged.saturation
Their use of
sigmoidal units can mak e gradient-based
as output units is compatible with the use of gradien learning very difficult.
gradient-based F or this
t-based learning when an reason,
their use as hidden units in feedforward
appropriate cost function can undo the saturation of the netw orks is no w discouraged.
sigmoid in the Their
output use
as
la output units is compatible with the use of gradient-based learning when an
layer.
yer.
appropriate cost function can undo the saturation of the sigmoid in the output
When a sigmoidal activ activation
ation function must b e used, the hyperb hyperbolic
olic tangent
layer.
activ
activation
ation function typically p erforms b etter than the logistic sigmoid. It resembles
the When
identit
identityay sigmoidal
function more activclosely
ation function
closely, must that
, in the sense b e used, the =
tanh(0) hyperb
0 whileolicσ (0)
tangent
= 12 .
activationtanh
Because function typically
is similar p erforms
to iden
identity b etter
tity near 0, than the logistic
training a deep sigmoid.
neural net Itwork
resembles
network yŷˆ =
the
>
w tanhidentit y>
tanh((U tanhfunction > sense that tanh
tanh((V x)) resembles training a linear model yŷˆ = w U V x so.
more closely , in the (0) = 0 while
> σ
> (0)
> =
Because tanh
long as the activ is ations
similaroftothe
activations iden tity
netw
network
orknear
can0b, etraining a deep
kept small. Thisneural
mak
makesesnet work yˆthe
training =
w
tanhtanh (Uorktanh
netw
network (V x)) resembles training a linear model yˆ = w U V x so
easier.
long as the activations of the network can b e kept small. This makes training the
tanhSigmoidal activ
activation
network easier. ation functions are more common in settings other than feed-
forw
forward
ard netw
networks.
orks. Recurren
Recurrentt netw networks,
orks, man
many y probabilistic mo models,
dels, and some
auto Sigmoidal
autoencoders
encoders ha activ
haveve additional requirements that rule out the use of than
ation functions are more common in settings other feed-
piecewise
forward
linear netw
activ
activation orks.
ation Recurren
functions andt netw
makeorks, many units
sigmoidal probabilistic
more app mo dels,despite
appealing
ealing and some the
auto
dra encoders
drawbacks ha ve
wbacks of saturation. additional requirements that rule out the use of piecewise
linear activation functions and make sigmoidal units more app ealing despite the
drawbacks of saturation.
194
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Man
Many
y other types of hidden units are p ossible, but are used less frequently
frequently..
ManIn general,
y other a wide
types variet
of hidden ariety y of differentiable
units are p ossible, but functions
are used p erform p erfectly
less frequently . well.
Man
Many y unpublished activ activation
ation functions p erform just as well as the p opular ones.
To pro In general,
provide
vide a a wide
concrete v ariet
example, y ofthedifferentiable
authors tested functions p erform net
a feedforward p erfectly
work using
network well.
hMan y unpublished
= cos
cos( (W x + b) on activ
theation
MNIST functions
datasetp erform just as an
and obtained well as the
error ratep opular ones.
of less than
To pro
1%, videisa comp
which concrete
competitiv
etitiv
etitive example,
e with results the authors
obtainedtestedusing amore feedforward
con
conven
ven network
ventional
tional activ using
activation
ation
h = cos ( W x +
functions. During researcb) on the
research MNIST
h and dev dataset
development and obtained
elopment of new tec an error
techniques, rate of less
hniques, it is common than
1%,
to test whichmanyis comp etitiv
differen
different t eactiv
withation
activationresults obtainedand
functions usingfindmore thatcon ventional
several activation
variations on
functions.practice
standard Duringp erform
researchcomparably
and development
comparably. . This means of new thattecusually
hniques, new it hidden
is common unit
ttoyp test
ypes
es are many differen
published t activ
only ation
if they arefunctions and find thattoseveral
clearly demonstrated providevariations
a significanton
standard
impro
improvemen
vemenpractice
vement. p erform comparably . This means that
t. New hidden unit types that p erform roughly comparably to known usually new hidden unit
ttyp
yp es
ypes
es are
are published
so commononly as to if they are clearly demonstrated to provide a significant
b e uninteresting.
improvement. New hidden unit types that p erform roughly comparably to known
It would b e impractical to list all of the hidden unit types that hav havee app
appeared
eared
typ es are so common as to b e uninteresting.
in the literature. We highlight a few esp especially
ecially useful and distinctiv
distinctivee ones.
It would b e impractical to list all of the hidden unit types that have app eared
Oneliterature.
in the p ossibilityWis to not hav
e highlight have
aefew
an activ ation guseful
activation
esp ecially
(z ) at all. One can also think of
and distinctive ones.
this as using the iden identity
tity function as the activ activation
ation function. We ha hav ve already
One p ossibility is to not hav e an activ
seen that a linear unit can be useful as the output of a neural netation g ( z ) at all. One can work.think
also
network. It mayof
this base using
also used asthe identityunit.
a hidden function
If every as lay
theeractiv
layer of theation function.
neural netw orkWconsists
network e have already
of only
seen that a linear unit
linear transformations, then the netw can be useful
networkas the output of a neural
ork as a whole will b e linear. Ho net work.
Howev
wevIter,
wever, mayit
also
is b e used as
acceptable fora some
hidden lay unit.
layersers If
of every
the lay
neural er of
net the
network
work neural
to b e netw
purelyork consists
linear. of only
Consider
alinear
neural transformations,
netw
networkork lay
layer then nthe
er with netwand
inputs ork pasoutputs,
a wholehwill = gb(eWlinear.
> x + bHowever, it
). We ma may y
is acceptable
replace this with tw for some
two o laylay
layers, ers of the
ers, with one lay neural
layer net work to b e purely linear.
er using weight matrix U and the other Consider
a neural netw ork lay er with
weightt matrix V . If the first lay
using weigh n inputs and
layer
er has p outputs,
no activ h = gfunction,
activation
ation ( W x +then b ). W wee ha
ma
havey
ve
replace
essen
essentially thisfactored
tially with twothe layers,
weigh
weightwith one lay
t matrix oferthe
using weightlay
original matrix
layer U and
er based on the
W . other
The
using weigh t matrix V . If the first lay
factored approach is to compute h = g(V U x + b ). If U pro er has
> no> activ ation function, then we
duces q outputs,
produces have
essenU
then tially
andfactored
V together the contain
weight matrix
only (n of + pthe original layer
) q parameters, based
while W on con W
contains. The
tains np
factored approach is to compute h
parameters. For small q , this can b e a considerable sa = g (V U x + b ). If U
saving pro duces q
ving in parameters. Itoutputs,
then Uatand
comes theVcosttogether contain only
of constraining the(n + p) qtransformation
linear parameters, while to b eWlo con np
tains but
low-rank,
w-rank,
parameters.
these low-rankFor small q , this
relationships are can
oftenb esufficient.
a considerable
Linear sa ving in
hidden parameters.
units thus offer an It
comes
effectiv
effective at
e wthe
ay ofcost of constraining
reducing the num
numb bthe
er oflinear transformation
parameters in a net to b e low-rank, but
network.
work.
these low-rank relationships are often sufficient. Linear hidden units thus offer an
Softmax units are another kind of unit that is usually used as an output (as
effective way of reducing the numb er of parameters in a network.
describ
described ed in Sec. 6.2.2.3) but ma may y sometimes b e used as a hidden unit. Softmax
Softmax units are another
units naturally represent a probabilit kind
probability of unit that is ousually
y distribution used asvan
ver a discrete output
ariable with(ask
describ ed in Sec. 6.2.2.3 ) but ma y sometimes b e used
p ossible values, so they may b e used as a kind of switch. These kinds of hidden as a hidden unit. Softmax
units are
units naturally
usuallyrepresent
only used a probabilit
in more adv y distribution
advanced over a discrete
anced architectures variablelearn
that explicitly withto k
p ossible values,
manipulate memory
memory,so they mayedb einused
, describ
described Sec.as10.12a kind
. of switch. These kinds of hidden
units are usually only used in more advanced architectures that explicitly learn to
manipulate memory, describ ed in Sec. 10.12.
195
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
6.4 Arc
Another hitecture
key design Design
consideration for neural netw
networks
orks is determining the architecture.
The word ar archite
chite
chitectur
ctur
cturee refers to the ov overall
erall structure of the net
netw
work: ho
how w many
Another key design
units it should ha haveconsideration for neural networks is determining the
ve and how these units should b e connected to each other. architecture.
The word architecture refers to the overall structure of the network: how many
unitsMost neuralhanet
it should networks
veworks
and howare these
organized
units into groups
should of units called
b e connected lay
layers.
to each ers. Most
other.
neural netw
network
ork architectures arrange these lay layers
ers in a chain structure, with eaceachh
la yerMost
layer neural
b eing networks
a function of theare
la organized
layer
yer into groups
that preceded it. In of units
this called the
structure, layers.
first Most
la
layer
yer
neural
is givennetw
by ork architectures arrangethese layers in a chain structure, with each
layer b eing a function of the(1) layer that preceded it. In this structure, the first layer
h = g(1) W (1)> x + b(1) , (6.40)
is given by
the second laylayer
er is giv
given
en bhy = g W x+b , (6.40)
the second layer is given hb(2) y = g(2) W (2)>h (1) + b(2) , (6.41)
and so on. h =g
W h +b , (6.41)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
A linear mo model,
del, mapping from features to outputs via matrix multiplication, can
by definition represen
representt only linear functions. It has the adv advantage
antage of b eing easy to
A linear mo del, mapping from features
train b ecause many loss functions result in conv to outputs
convex via matrix
ex optimization multiplication,
problems when can
by definition
applied represen
to linear mo
models.t only
dels. linear functions.
Unfortunately
Unfortunately, , we oftenIt has wanthe
want t toadv antage
learn of b eing
nonlinear easy to
functions.
train b ecause many loss functions result in convex optimization problems when
At first glance, we migh mightt presume that learning a nonlinear function requires
applied to linear mo dels. Unfortunately, we often want to learn nonlinear functions.
designing a sp specialized
ecialized mo model
del family for the kind of nonlinearity we wan wantt to learn.
A t first
Fortunately glance, we
ortunately,, feedforward netw migh t presume
networks that
orks with hidden la learning
layers a nonlinear
yers pro
provide
vide a univ function
ersal requires
universal appro
approxi-xi-
designing
mation framew a sp
framework.ecialized
ork. Sp mo del
Specifically
ecifically family for the
ecifically,, the universal appr kind of nonlinearity
approximation
oximation the theoror we
orem wan t to learn.
em (Hornik et al.,
F ortunately
1989; Cyb
Cybenko , feedforward netw orks with
enko, 1989) states that a feedforward netw hidden layers
networkork with a linearersal
pro vide a univ output approla xi-
layer
yer
mation
and framew
at least oneork. Sp ecifically
hidden lay er ,with
layer the universal
any “squashing”approximationactiv theorfunction
activation
ation em (Hornik (suc
(sucheth al.
as,
1989 ; Cyb enko
the logistic , 1989)activ
sigmoid states
activation thatfunction)
ation a feedforward network withany
can approximate a linear
Boreloutput
measurable layer
and at least
function fromone onehidden layer with any
finite-dimensional space “squashing”
to another activ ation
with an
any yfunction (such as
desired non-zero
the
amoun logistic sigmoid activ ation
amountt of error, provided that the net function)
netw can approximate any Borel
work is given enough hidden units. The measurable
function
deriv
derivatives
atives from
of one
the finite-dimensional
feedforward netw
network
ork space
can to another
also appro
approximatewith an
ximate they deriv
desired
derivatives
ativesnon-zero
of the
amoun t of error, provided that the net w ork is given enough
function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability hidden units. The
deriv
is atives the
b eyond of the feedforward
scop
scope e of this bnetwo ok;ork
forcan ouralso approximate
purposes it suffices the deriv atives
to say thatof any
the
function
con arbitrarily well ( Hornik et al. , 1990 ). The concept
tinuous function on a closed and b ounded subset of Rn is Borel measurable
continuous of Borel measurability
is
and eyond
b thereforethema scop
y bee of this b o ok; for
approximated by our purposes
a neural net it suffices to say that any
may network.
work. RA neural netw networkork may
continuous
also appro
approximatefunction
ximate an
any yon a closed
function and b ounded
mapping from an any subset
y finite of is Boreldiscrete
dimensional measurablespace
and therefore ma y b e approximated b y a neural net
to another. While the original theorems were first stated in terms of units withwork. A neural netw ork may
also
activ appro
activation ximate an y function mapping
ation functions that saturate b oth for very negativfrom an y finite dimensional discrete
negativee and for very p ositive space
to another.
argumen
arguments, While the
ts, universal appro original
approximationtheorems
ximation theorems hav w ere first
havee stated in terms
also b een prov
proven of for
en units with
a wider
activation
class of activfunctions
activation that saturate
ation functions, whic
which b oth forthevery
h includes no
now wnegativ
commonlye and forrectified
used very p ositive
linear
argumen ts, universal
unit (Leshno et al., 1993). appro ximation theorems hav e also b een prov en for a wider
class of activation functions, which includes the now commonly used rectified linear
unitThe universal
(Leshno et al.approxim
approximation
, 1993). ation theorem means that regardless of what function
we are trying to learn, we kno know w that a large MLP will b e able to this
The universal
function. How ever,approxim
However, we are not ation theorem
guaran teed means
guaranteed that thethat regardless
training algorithmof what
willfunction
b e able
w
toe are tryingthat to learn, we
function. Evenknoifwthe
thatMLPa large
is ableMLP will b e able
to represent theto function, learning this
function.
can fail for twHow ever,
twoo differen we are not guaran teed that the training algorithm
differentt reasons. First, the optimization algorithm used for training will b e able
to that function. Even if the MLP is able to represent the function, learning
can fail for two different reasons. First, the 197 optimization algorithm used for training
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
ma
may y not b e able to find the value of the parameters that corresp correspondsonds to the desired
function. Second, the training algorithm might choose the wrong function due to
oma y not b e able
verfitting. Recallto find
fromthe value
Sec. 5.2.1of the
thatparameters
the “no free that corresp
lunch” onds to sho
theorem thews
shows desired
that
function. Second,
there is no universally sup the training
superior algorithm might choose
erior machine learning algorithm. Feedforw the wrong function
eedforward ard netw due
networks to
orks
overfitting.
pro
provide Recall from Sec.
vide a universal system for represen 5.2.1 that
representing the “no free lunch” theorem
ting functions, in the sense that, giv sho ws
giventhat
en a
there is nothere
function, universally
exists a sup erior machine
feedforward netw
networklearning
ork algorithm. Feedforw
that approximates ard netw
the function. orks
There
prono
is vide a universal
universal pro systemfor
procedure
cedure forexamining
representing functions,
a training setinofthe sp senseexamples
specific
ecific that, given anda
cfunction,
ho osing athere
hoosing existsthat
function a feedforward
will generalize networkto pthat
ointsapproximates
not in the training the function.
set. There
is no universal pro cedure for examining a training set of sp ecific examples and
The universal approximation theorem says that there exists a net network
work large
cho osing a function that will generalize to p oints not in the training set.
enough to ac achieve
hieve any degree of accuracy we desire, but the theorem do does es not
sa
say The universal
y how large this netw approximation
network theorem says
ork will b e. Barron (1993) pro that there
vides some b ounds onlarge
provides exists a net work the
enough to ac
size of a single-layhieve
single-layer any
er netw degree
network of accuracy w e desire, but
ork needed to approximate a broad class of functions. the theorem do es not
sa y how
Unfortunately large this netw ork
Unfortunately,, in the worse case, an exp will b e. Barron
exponential( 1993
onential num )
numb pro vides some b
b er of hidden units (p ounds on
(possiblythe
ossibly
size of
with onea hidden
single-lay uniter netw
corresporkonding
correspondingneededtotoeac approximate
each h input configuration a broad class that of functions.
needs to b e
Unfortunately
distinguished) ma , in
may the worse case, an exp onential num b er
y b e required. This is easiest to see in the binary case: the of hidden units (p ossibly
with one hidden
number of p ossible binaryunit corresp onding tooneac
functions h inputv configuration
vectors ∈ {0, 1}n is 22that andneeds to b e
selecting
one such function requires 2n bits, which will in general require O(2 n) degreesthe
distinguished) ma y b e required. This is easiest to see in the binary case: of
number of p ossible binary functions on vectors v
freedom. 0, 1 is 2 and selecting
one such function requires 2 bits, which will in general ∈ { require} O(2 ) degrees of
In
freedom. summary
summary, , a feedforward net
network
work with a single la
layer
yer is sufficient to represen
representt
an
any y function, but the lay layer
er may b e infeasibly large and ma may y fail to learn and
In summary
generalize correctly , a feedforward net work
correctly.. In many circumstances, using deep with a single la yer
deeper is
er mo sufficient
dels cantoreduce
models represen thet
nan y function,
umber of units butrequired
the layer to may
representb e infeasibly
the desired large and maand
function y failcan toreduce
learn and the
generalize
amoun correctly .
amountt of generalization error.In many circumstances, using deep er mo dels can reduce the
number of units required to represent the desired function and can reduce the
amounThere
t ofexist families oferror.
generalization functions whic which h can b e approapproximated
ximated efficien
efficiently tly by an
arc
architecture
hitecture with depth greater than some value d, but whic which h require a muc much h larger
mo
modelThere
del existisfamilies
if depth restricted of functions
to b e less thanwhichorcan equalb eto appro
d. Inximated
many cases,efficien thetly num
numbbybaner
arc hitecture with depth greater
of hidden units required by the shallow mo than some value
model d ,
del is expbut whic
exponen
onen h
onential require
tial in n. Suca muc
Such h larger
h results
mo del if
were first provdepth
proven is restricted
en for mo models to b e less than or equal
dels that do not resemble the contin to d . In many
continuous, cases,
uous, differenthe num
differentiable b er
tiable
of hidden
neural netw units
networks required by the shallow
orks used for machine learning, but hav mo del is exp onen tial in n .
havee since b een extended to these Suc h results
wmo eredels.
models.firstTheprovfirst
en for mo dels
results that
were fordocircuits
not resemble
of logicthe gates contin uous,, differen
(Håstad 1986). tiableLater
neural netw orks used for machine learning, but hav
work extended these results to linear threshold units with non-negative weigh e since b een extended to these
weights ts
mo dels. The first
(Håstad and Goldmann, 1991; Haresults w ere for
Hajnal circuits of logic gates
jnal et al., 1993), and then to netw ( Håstad , 1986
networks ). Later
orks with
wconork extended
continuous-v
tinuous-v
tinuous-valued these
alued activ results
activations to linear threshold units
ations (Maass, 1992; Maass et al., 1994). Many mowith non-negative weigh
modern ts
dern
(neural
Håstadnet and
netwowo Goldmann
works
rks use rectified, 1991;linear
Ha jnal et al.,Leshno
units. 1993), et andal.then (1993to) netw orks with
demonstrated
continuous-v
that shallow alued
netw
networks activ
orks ations
with (Maass
a broad , 1992
family of; non-p
Maassolynomial
et al., 1994
non-polynomial ). ation
activ Manyfunctions,
activation mo dern
neural netrectified
including works use linearrectified
units, linear
hav units. Leshno
havee universal approximation et al. (1993prop ) demonstrated
properties,
erties, but these
that shallow netw orks with a broad family
results do not address the questions of depth or efficiency—they sp of non-p olynomial activ ation
specify functions,
ecify only that
aincluding
sufficiently rectified
wide linear
rectifier units,
netw hav
network orke universal approximation
could represen
represent t any function. prop erties,
Pascan
Pascanubutu these
et al.
results do not address the questions of depth or efficiency—they sp ecify only that
a sufficiently wide rectifier network could represent any function. Pascanu et al.
198
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
O k 199 . (6.43)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Of course, there is no guaran guarantee tee that the kinds of functions we wan wantt to learn in
applications of machine learning (and in particular for AI) share such a prop propertyerty
erty..
Of course, there is no guarantee that the kinds of functions we want to learn in
We ma may y also wan wantt to choose a deep mo model del for statistical reasons. Any time
applications of machine learning (and in particular for AI) share such a prop erty.
we choose a sp specific
ecific machine learning algorithm, we are implicitly stating some
W e ma y also
set of prior b eliefs wan wet tohavchoose
have e ab
about
outa deep
what mo kinddel of
forfunction
statistical thereasons.
algorithm Any time
should
w e choose
learn. Cho aosing
Choosing sp ecific
a deep machine
mo
modeldellearning
enco
encodesdes algorithm,
a very general we are implicitly
b elief that thestating
function some
we
set of prior b eliefs
want to learn should inv w e hav
involv
olv e ab
olvee comp out what
composition kind of function the algorithm
osition of several simpler functions. This can b e should
learn.
in Cho
interpreted osing a deep mo
terpreted from a representation learningdel enco des a very p oingeneral
oint
t of view b elief
as sa that
ying the
saying thatfunction
we b elievwee
elieve
w ant to learn should inv olv
the learning problem consists of disco e comp osition
discovering of several simpler functions.
vering a set of underlying factors of variation This can be
interpreted
that can in fromturnabrepresentation
e describ
described ed in learning
terms ofp oin t of view
other, simpleras sa ying that w
underlying e b elievofe
factors
vthe learning
ariation. problem ,consists
Alternately
Alternately, we can of in discovering
interpret
terpret the use a set
of aofdeep
underlying
arc
architecturefactorsasofexpressing
hitecture variation
athat canthat
b elief in turn b e describ
the function weedwant in terms
to learnof other, simpler program
is a computer underlying factors of
consisting of
vm ariation. Alternately
ultiple steps, where eac , we can
each in terpret the use of a deep arc hitecture
h step makes use of the previous step’s output. These as expressing
a
in b elief that the
intermediate
termediate function
outputs we w
are not ant to learn
necessarily is a computer
factors of variation, program
but can consisting
instead bofe
multiple steps,
analogous to counwhere
ters eac
counters or h step makes
p ointers that use
the of theork
netw
network previous
uses tostep’sorganize output. These
its internal
in
pro termediate
processing. outputs are
cessing. Empirically
Empirically, not necessarily
, greater depth dodoesesfactors
seem to of result
variation, but can
in b etter instead b e
generalization
analogous to coun ters or p ointers that the netw ork
for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio uses to organize its internal
, 2009;
pro cessing.
Mesnil et al.Empirically
al.,, 2011; Ciresan , greater depth
et al.
al.,, 2012do es seem to result
; Krizhevsky al.,,in2012
et al. b etter generalization
; Sermanet et al.
al.,,
for a wide
2013; Farab v
arabet ariety
et et al.of tasks ( Bengio
al.,, 2013; Couprie et al. et al. , 2007 ; Erhan
al.,, 2013; Kahou et al.et al. , 2009 ; Bengio
al.,, 2013; Go Goo , 2009
o dfello
dfellow w;
Mesnil
et et al.;, Szegedy
al., 2014d 2011; Ciresan et al.,).2012
et al., 2014a See; Fig.
Krizhevsky
6.6 and Fig.et al.6.7
, 2012
for ;examples
Sermanet of et al.,
some
2013 ; F arab et et al. , 2013 ; Couprie et al. , 2013 ;
of these empirical results. This suggests that using deep architectures do Kahou et al. , 2013 ; Go
does o dfello
es indeed w
et al., 2014d
express ; Szegedy
a useful prior et ov al.,the
over
er 2014a ). See
space Fig. 6.6 and
of functions theFig.
mo
model6.7 learns.
del for examples of some
of these empirical results. This suggests that using deep architectures do es indeed
express a useful prior over the space of functions the mo del learns.
201
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
202
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
204
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.8: Examples of computational graphs. (a) The graph using the × op operation
eration to
compute z = xy . (b) The graph for the logistic regression prediction yŷˆ = σ x w + b . >
Figureof6.8:
Some theExamples of computational
intermediate expressions do graphs.
not ha (a)
have veThenamesgraph usingalgebraic
in the the opexpression
eration to
compute
but z = xy (b)
need names. in the The graphW
graph. for the logistic
e simply nameregression prediction
the i-th suc
suchh variable yˆ =
×uσ( ) .x (c)
w+ b .
The
Some of the intermediate
computational graph for theexpressions
expression H do=notmaxha
max{ { 0ve
, Xnames
W + bin }, the
whichalgebraic
computes expression
a design
but need names in the graph.
matrix of rectified linear unit activ W e simply
activations
ations H givname
given the i-th suc
en a design matrix con h v ariable u (c) The
taining a .minibatch
containing
computational
of inputs X . (d)graph for the expression
Examples H = max
a–c applied at most one op 0, X W + b
operation , which
eration to eachcomputes
variable, a design
but it
matrix
is of rectified
p ossible to applylinear
moreunit
thanactiv
oneations
op H given
operation.
eration. Here{a design
we showmatrix containing agraph
}a computation minibatch
that
applies X . than
of inputsmore (d) Examples
one op a–c applied
operation
eration to the at most one
weights w ofopaeration to each variable,
linear regression mo
model.del. but it
P The
iseights
w p ossible
are to apply
used morethe
to make than one
b oth theopprediction
eration. Here
yŷˆ andwethe show
weigha tcomputation
weight decay p enalty graph
λ thatw 2.
applies more than one op eration to the weights w of a linear regression mo del. The
weights are used to make the b oth the prediction yˆ and the weight decay p enalty λ w .
205
P
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The chain rule of calculus (not to b e confused with the chain rule of probability) is
used to compute the deriv derivativ
ativ
atives
es of functions formed by comp composing
osing other functions
The chain
whose deriv rule
derivativ
ativof
atives calculus
es are kno
known.(not to
wn. Bac b e confused
Back-propagation with the chain
k-propagation is an algorithm that rule of probability)
computes the is
cused
haintorule,
compute
with athe sp derivativ
specific
ecific es of
order of functions
op erationsformed
operations that isbyhighly
compefficien
osing other
efficient.t. functions
whose derivatives are known. Back-propagation is an algorithm that computes the
Let x b e a real numnumb b er, and let f and g b oth b e functions mapping from a real
chain rule, with a sp ecific order of op erations that is highly efficient.
number to a real num numb b er. Supp ose that y = g (x) and z = f (g (x)) = f (y ). Then
Suppose
Let x b e a real num
the chain rule states that b er, and let f and g b oth b e functions mapping from a real
number to a real numb er. Supp osedzthatdz y =dyg (x) and z = f (g (x)) = f (y ). Then
= . (6.44)
the chain rule states that dx dy dx
dz dz dy
= . (6.44)
We can generalize this b ey eyond
ond dxthe scalar
dy dxcase. Supp ose that x ∈ R m, y ∈ Rn ,
Suppose
g maps from Rm to Rn , and f maps from R n to R. If y = g (x ) and z = R f (y ), then
R
We can generalize this b eyond the scalar case. Supp ose that x ,y ,
R R X R R
g maps from to , and f maps∂ z from ∂ zto∂ y j. If y = g (x ) and ∈ z = f (y ),∈then
= . (6.45)
∂xi ∂ y j ∂ xi
∂z j ∂z ∂y
= . (6.45)
∂x ∂y ∂x
In vector notation, this may b e equiv equivalently
alently written as
>
In vector notation, this may b e equivalently ∂ y written as
∇xz = X ∇y z , (6.46)
∂x
∂y
z= z, (6.46)
∂y
where ∂x is the n × m Jacobian matrix∂of x g.
∇ ∇
F rom this we see that the gradient of
is the n ∂ym Jacobian matrix of g . a v ariable x can b e obtained by multiplying
where
a Jacobian matrix ∂x by a gradient ∇yz. The back-propagation algorithm consists
F rom this we × that the gradient of a variable
see x can b e obtained
of p erforming sucsuchh a Jacobian-gradient pro product
duct for each op eration by
operation in m ultiplying
the graph.
a Jacobian matrix by a gradient z. The back-propagation algorithm consists
of pUsually
erforming wesuc
doh not apply the bac
a Jacobian-gradientback-propagation
k-propagation algorithm merely to vectors,
∇ pro duct for each op eration in the graph.
but rather to tensors of arbitrary dimensionalit
dimensionality y. Conceptually
Conceptually,, this is exactly the
Usually we do not apply the back-propagation
same as back-propagation with vectors. The only difference algorithmis merely
ho
howw thetonumvectors,
numb b ers
but rather to tensors of arbitrary dimensionalit y. Conceptually
are arranged in a grid to form a tensor. We could imagine flattening each tensor , this is exactly the
same
in
into as back-propagation with vectors. The only
to a vector b efore we run back-propagation, computing a vector-vdifference is ho w
vector-valued the num b
alued gradient,ers
are arranged
and in a gridthe
then reshaping to gradien
form a ttensor.
gradient We could
back into imagine
a tensor. In flattening
this rearrangedeach tensor
view,
into
bac a vector
back-propagation b efore we run back-propagation, computing
k-propagation is still just multiplying Jacobians by gradien a vector-v
gradients.ts. alued gradient,
and then reshaping the gradient back into a tensor. In this rearranged view,
To denote the gradient of a value z with resp ect to a tensor , we write ∇ z ,
respect
back-propagation is still just multiplying Jacobians by gradients.
just as if were a vector. The indices into no noww hahave
ve multiple co coordinates—for
ordinates—for
To denote
example, a 3-Dthe gradient
tensor of a value
is indexed z with
by three co resp ect to W
coordinates.
ordinates. a tensor , we write
e can abstract this awa
wayzy,
just
b as if a single
y using were avvector.
ariable The
i to indices
representintothe no w have m
complete ultiple
tuple of co ordinates—for
indices. ∇ all
For
example,
p a 3-Dtuples
ossible index tensori,is(∇indexed
z)i givbes
y three
gives ∂z co ordinates. We can abstract this away
∂ . This is exactly the same as how for all
by using a single variable i to represent the complete tuple of indices. For all
p ossible index tuples i, ( z) gives . This is exactly the same as how for all
206
∇
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
∂z
p ossible integer indices i into a vector, (∇x z )i giv
into giveses ∂x . Using this notation, we
can write the chain rule as it applies to tensors. If = g ( ) and z = f ( ), then
p ossible integer indices i into a vector, ( z ) gives . Using this notation, we
X
can write the chain rule as it applies to tensors. ∂Ifz = g ( ) and z = f ( ), then
∇ z= (∇
∇ j) . (6.47)
∂ j
j ∂z
z= ( ) . (6.47)
∂
∇ ∇
_
{ ] [u (i)] | i = 1, . . . , n }_
_[u [u ]
i
←
_ [u ] i = 1, . . . , n
The back-propagation algorithm is Pdesigned to reduce the number of common
{ | }
sub
subexpressions
expressions without regard
P to memory
memory.. Sp Specifically
ecifically
ecifically,, it p erforms on the order
The back-propagation
of one Jacobian pro
product algorithm
duct p er no de in the graph.toThis
node is designed reduce
canthe b e nseen
umber of common
from the fact
sub expressions without
in Algorithm 6.2 that bacregard
backprop to memory . Sp ecifically
kprop visits each edge from no , it
nodep erforms
( j
de u to no) on
nodede u(order
the i) of
of one Jacobian pro duct p er no de in the graph. This can b e seen from the∂ufact
the graph exactly once in order to obtain the asso associated
ciated partial deriv derivativ
ativ
ativee .
in Algorithm 6.2 that backprop visits each edge from no de u to no de u ∂u of
Bac
Back-propagation
k-propagation thus av
avoids
oids the exp
exponential
onential explosion in rep repeated
eated sub subexpressions.
expressions.
the graph exactly once in order to obtain the asso ciated partial derivative .
Back-propagation thus avoids the exp onential209 explosion in rep eated sub expressions.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Ho
Howev
wev
wever,
er, other algorithms may b e able to avavoid
oid more sub
subexpressions
expressions by p erforming
simplifications on the computational graph, or ma may
y b e able to conserv
conservee memory by
Ho wev er, other algorithms may b e able
recomputing rather than storing some sub to avoid more
subexpressions.sub expressions by these
expressions. We will revisit p erforming
ideas
simplifications on the
after describing the bac computational
back-propagation graph, or may b
k-propagation algorithm itself.e able to conserv e memory by
recomputing rather than storing some sub expressions. We will revisit these ideas
after describing the back-propagation algorithm itself.
211
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
J = L(yˆ, y ) + λΩ(θ )
yˆ = h
J = L(yˆ, y ) + λΩ(θ )
is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012)
and TensorFlow (Abadi et al., 2015). An example of how this approach works
is the
is approach
illustrated in taken
Fig. 6.10 by Theano (Bergstraadv
. The primary etan
advan al.
antage, 2010
tage of ;this
Bastien
approac
approachet al.
h ,is2012
that)
and deriv
the TensorFlow
derivatives (Abadi
atives are et ed
describ
described al.,in2015
the).same
An example
languageofashow thethis approach
original works
expression.
is illustrated
Because the deriv in Fig.
derivativ
ativ
ativeses are just another computational graph, it is p ossible tothat
6.10 . The primary adv an tage of this approac h is run
the
bac derivatives areagain,
back-propagation
k-propagation describ ed in the same
differentiating language
the deriv ativesasinthe
derivatives original
order expression.
to obtain higher
Because
deriv
derivatives.the Computation
atives. derivatives areofjust another computational
higher-order deriv
derivativ
ativ es is graph,
atives describitedisinp ossible
described to run
Sec. 6.5.10 .
back-propagation again, differentiating the derivatives in order to obtain higher
We will use the latter approach and describ describee the bac
back-propagation
k-propagation algorithm in
derivatives. Computation of higher-order derivatives is describ ed in Sec. 6.5.10.
terms of constructing a computational graph for the deriv derivativ
ativ
atives.
es. Any subset of the
graphWemaywill then
use the
b e latter
ev approach
evaluated
aluated usingandsp describ
specific
ecific e the bacvk-propagation
numerical alues at a lateralgorithm
time. This in
terms
allo ws ofusconstructing
allows a computational
to avoid specifying graph eac
exactly when for hthe
each derivativshould
operation es. Anybsubset of the
e computed.
graph may then b e ev
Instead, a generic graph ev aluated using
evaluation sp ecific numerical
aluation engine can ev evaluate values
aluate every noat a
node later
de as sotime.
soon
on asThis
its
allo
parenws
parents’ us to av oid
ts’ values are avspecifying
available.
ailable. exactly when eac h operation should b e computed.
Instead, a generic graph evaluation engine can evaluate every no de as so on as its
The description of the symbol-to-symbol based approach subsumes the symbol-
parents’ values are available.
to-n
to-number
umber approac
approach. h. The symbol-to-num
symbol-to-numb b er approach can b e understo understood od as
The description of the symbol-to-symbol based approach subsumes
p erforming exactly the same computations as are done in the graph built by the the symbol-
to-n
sym
symb umber
b approac
ol-to-sym
ol-to-symb h. The symbol-to-num
b ol approach. The key differenceb er approach can sym
is that the b e understo
symb b ol-to-n od as
ol-to-number
umber
p erforming exactly the same computations as are done in the graph built by the
symb ol-to-symb ol approach. The key212 difference is that the symb ol-to-number
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Backwar
Backward d computation for the deep neural net network
work of Algo-
rithm 6.3, which uses in addition to the input x a target y. This computation
yields the gradients Backwar
on the d activ
computation
ations a(for
activations theeac
k) for deep
eachh la neural
layer
yer networkfrom
k, starting of Algo-
the
rithm 6.3
output lay ,
layer which uses
er and going bac in addition
backwards to the input
kwards to the first hidden layx a target
layer. y . This computation
er. From these gradients,
yields
whic
which the gradients on the activ ations
h can b e interpreted as an indication of how eac a for eac
each h
h laylayer
layer’s k, starting
er’s output shouldfrom the
change
output
to reducelayerror,
er andone goingcanbac kwards
obtain theto the first
gradien
gradient t onhidden layer. From
the parameters these
of eac
each gradients,
h lay
layer.
er. The
whic h
gradiencan
gradients b e interpreted
ts on weigh
weights as an indication of how each lay er’s
ts and biases can b e immediately used as part of a sto output should change
stoc chas-
to reduce error,
tic gradient up one
update
date (p can obtain
(performing the
erforming the up gradien
update t on the parameters of
date right after the gradients ha eac h layve bThe
haveer. een
gradien ts on weigh ts and biases can b e immediately
computed) or used with other gradient-based optimization metho used as part
methods. ds. of a sto chas-
tic gradient up date (p erforming the up date right after the gradients have b een
After the forward computation, compute the gradient on the output lay layer:
er:
computed) or used with other gradient-based optimization metho ds.
g ← ∇ŷ J = ∇ŷ L(yˆ, y )
After
k= thel , forward
l − 1, . . .computation,
,1 compute the gradient on the output layer:
g Con
Convert
vertJ =the L (yˆ, y ) t on the lay
gradien
gradient layer’s
er’s output in into
to a gradient into the pre-
← k∇= l , l ∇
nonlinearit
nonlinearity y1,activ
. . . ,ation
1
activation (elemen
(element-wise
t-wise multiplication if f is element-wise):
Con
g ←vert
∇a − the
J =gradien
g f 0(taon (k) )the layer’s output into a gradient into the pre-
nonlinearitgradients
Compute y activation (elemen
on weigh
weightsts t-wise multiplication
and biases (including ifthef isregularization
element-wise): term,
g
where needed): J = g f (a )
Compute
∇ ←∇ gradients on weights and biases (including the regularization term,
b J = g + λ∇ b Ω(θ )
where J
∇ needed):
= g h(k−1)> + λ∇ W Ω(θ )
W
J = gthe
Propagate + λgradients Ω(θ )w.r.t. the next lo lower-lev
wer-lev
wer-levelel hidden lay layer’s
er’s activ
activations:
ations:
∇
g ← ∇h J = g h
J =W ∇ ( k+
) >λg Ω( θ )
Propagate
∇ the gradients w.r.t. ∇ the next lower-level hidden layer’s activations:
g J =W g
←∇
213
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
214
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
approac
approach
h do
does
es not exp
expose
ose the graph.
•We assume
_ that each( ):variable
This returnsis assotheciated
op with the
operation
eration following
that computes subroutines:
, repre-
sen
sented
ted by the edges coming in into
to in the computational graph. For example,
there_ may b e a Python
( ): This returns
or C++ the
class op eration the
representing thatmatrix
computes , repre-
multiplication
sen
• op ted by the
operation,
eration, andedges
the coming into function. in the computational
Supp ose we graph.
Suppose have a vFariable
have or example,
that
there may b e a Python or C++ class
is created by matrix multiplication, C = AB . Then representing the matrix_ multiplication
( )
op eration, and
returns a p oin the
ointer function.
ter to an instance of the corresp Supp
correspondingose we have
onding C++ class. a variable that
is created by matrix multiplication, C = AB . Then _ ( )
• returns_ a p ointer( to , Gan
): instance
This returns thecorresp
of the list of onding
variables
C++thatclass.
are children of
in the computational graph G .
_ ( , ): This returns the list of variables that are children of
•• ( , G ) : GThisgraph
in_the computational returns. the list of variables that are parents of
in the computational graph G . G
_ ( , ) : This returns the list of variables that are parents of
inhthe
•Eac
Each op computational
operation
eration G is alsograph
asso .
associated
ciated with a op
operation.
eration. This
op
operation
eration can compute a Jacobian-vector G pro
product
duct as describ
described
ed by Eq. 6.47. This
Eac h op eration is also asso ciated
is how the back-propagation algorithm is able to achiev with a op eration. This . Each
achievee great generality
generality.
op
op eration
operation can
eration is resp compute
responsible a Jacobian-vector
onsible for knoknowing
wing ho how pro duct as describ ed
w to back-propagate through the by Eq. 6.47 . This
edges in
is how the back-propagation algorithm is able to achiev e great
the graph that it participates in. For example, we might use a matrix multiplication generality . Each
op
op eration to
operation
eration is create
resp onsible for kno
a variable C= wing
ABho w to ose
. Supp
Supposeback-propagate through
that the gradient the edges
of a scalar z within
the graph
resp ect to that
respect C isitgiven
participates
by G . Thein. Fmatrix
or example, we might op
multiplication use a matrix
operation
eration multiplication
is resp
responsible
onsible for
op eration
defining tw to create
twoo bac a variable
back-propagation C = AB . Supp ose that the gradient
k-propagation rules, one for each of its input arguments. If w of a scalar z ewith
call
resp ect to C is given by G . The matrix multiplication op eration is resp onsible for
defining two back-propagation rules, one215 for each of its input arguments. If we call
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
the metho
method d to request the gradient with resp respectect to A giv givenen that the gradient
on the output is G , then the metho
method d of the matrix multiplication op operation
eration
the metho
must state that the gradiend to request the
gradientt with resp gradient
respect with resp ect to A giv en>that
ect to A is given by GB . Likewise, if we the gradient
on the
call the output is G
metho
method , then the
d to request the gradienmetho d oft the
gradient withmatrix
resp
respectectmultiplication
to B, then the op eration
matrix
m
op ust
operationstate that
eration is resp the
responsible gradien t with
onsible for implementing the resp ect to A is givenmethob
method y GB .
d and sp Likewise,
ecifyingifthat
specifying we
call desired
the the methoisd given
gradient to requestby A> the
G .gradien t with resp ect toalgorithm
The back-propagation B, then the matrix
itself do
does
es
op eration is resp onsible for implementing the
not need to know any differentiation rules. It only needs to call each op metho d and sp ecifying
operation’sthat
eration’s
the desired gradient is given
rules with the right argumen by A
arguments. G . The back-propagation
ts. Formally
ormally,, . (algorithm itself
, , ) must do es
not
return need to know any differentiation rules. It only needs to call each op eration’s
rules with the right X argumen
(∇ ts. . ( Formally ) i ), i , . ( , , )(6.54) must
return i
( .( ) ) , (6.54)
whic
which h is just an implemen implementation tation of the cchain hain rule as expressed in Eq. 6.47.
Here, is a list of inputs∇that are supplied to the op operation,
eration, is the
whic h is just an implemen
mathematical function that the op tation of
operation the c
eration implemenhain
implements, rule as expressed in
ts, is the input whose gradien Eq. 6.47t.
gradient
Here, is
we wish to compute, and X a list of inputs that are supplied to
is the gradient on the output of the op the op eration,eration. the
operation. is
mathematical function that the op eration implements, is the input whose gradient
The metho
method d should alwa always ys pretend that all of its inputs are distinct
we wish to compute, and is the gradient on the output of the op eration.
from each other, even if they are not. For example, if the op
operator
erator is passed
The metho
two copies of x to compute x , the d should
2 alwa ys pretendmetho
methodthat all of its inputs
d should still return are xdistinct
as the
from
deriv each
derivative other,
ative with resp even
respect if they are not.
ect to b oth inputs. The bac F or example, if
back-propagation the op erator
k-propagation algorithm will later is passed
tadd
wo copies of x to compute x , the
b oth of these arguments together to obtain 2x metho d, should
which still
is the return
correctx astotal
the
derivative
deriv ative with
derivative on x.resp ect to b oth inputs. The back-propagation algorithm will later
add b oth of these arguments together to obtain 2x, which is the correct total
Soft
Software
ware implemen
implementations tations of bac back-propagation
k-propagation usually pro provide
vide b oth the op opera-
era-
derivative on x.
tions and their metho
methods, ds, so that users of deep learning softw software
are libraries are
ableSoft to ware implementations
back-propagate through of graphs
back-propagation
built using usually
commonpro op vide
operationsb othlik
erations the
like op era-
e matrix
mtions and their exp
ultiplication, exponenonenmetho
onents, ds, so that and
ts, logarithms, userssoofon.deep learning
Soft
Software software who
ware engineers libraries
build area
able implementation
new to back-propagate through graphs built
of back-propagation using
or adv
advanced
anced common
users whoop erations
need tolikadde matrix
their
m ultiplication,
own op operation exp onen ts, logarithms, and
eration to an existing library must usually derive the so on. Soft ware engineers who
metho
method build
d fora
new
an
anyy newimplementation
op erations man
operations of back-propagation
manually
ually
ually.. or advanced users who need to add their
own op eration to an existing library must usually derive the metho d for
The back-propagation algorithm is formally describ described ed in Algorithm 6.5.
any new op erations manually.
In
The Sec. 6.5.2, we motiv
back-propagation motivatedated back-propagation
algorithm as a strategy
is formally describ for av
avoiding
ed in Algorithm oiding6.5 comput-
.
ing the same sub subexpression
expression in the chain rule multiple times. The naiv naivee algorithm
could In hav
Sec.e 6.5.2
have exp , we tial
exponen
onen
onential motivrun ated
runtime
time back-propagation
due to these rep as
repeateda strategy
eated sub for avoidingNow
subexpressions.
expressions. comput-
that
ing
we ha the
have same
ve sp sub
specified expression
ecified the bac in the
back-propagation chain rule m ultiple times.
k-propagation algorithm, we can understand its com- The naiv e algorithm
could hav e exp onen
putational cost. If we assume tial run time thatdue to eachthese
op rep eatedev
operation
eration sub expressions.
evaluation
aluation has roughlyNow that the
w e ha ve sp ecified the bac k-propagation algorithm,
same cost, then we may analyze the computational cost in terms of the number w e can understand its com-
putational
of op
operations
erations cost. If we assume
executed. Keep inthat mind each
hereopthat
eration evaluation
we refer to an op has roughly
operation
eration the
as the
same
fundamen cost,talthen
fundamental unitwof e may analyze the computational
our computational graph, which might cost inactually
terms of the nof
consist umber
very
of
man
manyop erations
y arithmetic opexecuted.
operations Keep in mind
erations (for example, we migh here that we
mightt hav refer to an op eration
havee a graph that treats matrix as the
fundamental unit of our computational graph, which might actually consist of very
many arithmetic op erations (for example, 216we might have a graph that treats matrix
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The back-propagation algorithm is used to compute the gradien gradientt of the cost on a
single minibatch. Sp Specifically
ecifically
ecifically,, we use a minibatch of examples from the training
The back-propagation
set formatted algorithm
as a design matrix is used
X and to compute
a vector the gradien
of asso
associated
ciatedt ofclass
the cost
lab on ya.
labels
els
singlenetw
The minibatch.
network Sp ecifically
ork computes a lay ,erweofuse
layer a minibatch
hidden featuresofH examples
= max
max{ {from
0 , X the
W (1)training
} . To
set formatted as a design matrix X and a
simplify the presentation we do not use biases in this mo vector of asso
model. ciated class
del. We assume that lab elsour
y.
The
graphnetw ork computes
language includes aa layer op oferation
hiddenthat
operation features H = max
can compute max
max{ 0{, 0X W} elemen
,Z . Tt-
element- o
simplify
wise. Thethe presentation
predictions we do
of the not use biases
unnormalized log in this mo del.ov
probabilities Were{ assume
over classes arethat our
} then
graph
giv en blanguage
given y H W (2)includes
. We assumea op eration
that our graphthat can compute
language includesmax a 0, Z element-
wise.
op The
operation predictions of the unnormalized
eration that computes the cross-en
cross-entropy log
tropy b et ween the targets y andclasses
etween probabilities ov er } are then
{the probabilit
probability y
given by H W
distribution . Webyassume
defined that our graphlog
these unnormalized language includes
probabilities. The a resulting cross-
optropy
en erationdefines
entropy that computes
the cost theJMLEcross-en tropy b et
. Minimizing ween
this the targetsyypand
cross-entrop
cross-entropy the probabilit
erforms maxim
maximum umy
distribution
lik
likeliho
eliho defined bofy the
elihooo d estimation these unnormalized
classifier. How
However,logtoprobabilities.
ever, mak The resulting
makee this example cross-
more realistic,
en tropy defines the cost J . Minimizing
we also include a regularization term. The total cost this cross-entrop y p erforms maxim um
likeliho o d estimation of the classifier. However, to make this example more realistic,
we also include a regularization term.
X The(1)total 2 cost
X
(2) 2
J = JMLE + λ Wi,j + W i,j (6.56)
i,j i,j
J =J +λ W + W (6.56)
consists of the cross-en
cross-entropy
tropy and a weight decay term with co coefficien
efficien
efficientt λ. The
computational graph is illustrated in Fig. 6.11.
consists of the cross-entropy and thea gradien
weight decay term withco efficient λ. The
The computational graph
computational graph is illustratedX for gradient
in Fig. t of this example is large enough that
6.11
. X
it would b e tedious to draw or to read. This demonstrates one of the b enefits
The computational graph
of the back-propagation algorithm, whic for the gradien
which t of this example is
h is that it can automatically large enough that
generate
it w
gradienould
gradients b e tedious to draw
ts that would b e straigh or to
straightforw
tforwread.
tforward This demonstrates
ard but tedious for a softw one
software of the b enefits
are engineer to
of
derivthe back-propagation
derivee man
manually
ually
ually.. algorithm, whic h is that it can automatically generate
gradients that would b e straightforward but tedious for a software engineer to
derivWeemancanually
roughly
. trace out the b eha ehavior
vior of the back-propagation algorithm
by lo looking
oking at the forw forward
ard propagation graph in Fig. 6.11. To train, we wish
W e can roughly
to compute b oth ∇ W J and trace out the
∇Wb eha J . vior
Thereof the
are back-propagation
tw
twoo different paths algorithm
leading
b y
bac lo
backwardoking
kward at
from the
J forw
to ard
the w propagation
eights: one graph
through inthe Fig. 6.11
cross-en .
cross-entropy T
tropyo train,
cost, we
and wish
one
to compute
through the weigh b oth J and
weightt decay cost. The weigh J . There are tw o different paths
weightt decay cost is relatively simple; it will leading
bac
alw kward
always
ays con from
contribute J to
∇ the(i) weights:
tribute 2λW to the gradient ∇ one through
on W (thei)
. cross-entropy cost, and one
through the weight decay cost. The weight decay cost is relatively simple; it will
The other path through the cross-en cross-entropy
tropy cost is slightly more complicated.
always contribute 2λW to the gradient on W .
Let G b e the gradient on the unnormalized log probabilities U (2) pro provided
vided by
the The other path through
op
operation. the cross-en
eration. The bac tropy cost
back-propagation is slightly
k-propagation algorithm no more complicated.
now w needs to
Let G
explore tw b e the
two gradient
o different branc on the unnormalized log probabilities
hes. On the shorter branch, it adds H >G
branches. U provided by
to the
the op eration. The bac k-propagation
gradientt on W (2), using the back-propagation rule for the second argument to
gradien algorithm no w needs to
explore
the matrixtwomultiplication
different branc hes. On The
operation. the other
shorter branch,
branch it adds
corresp
correspondsondsH G to
to the the
longer
cgradien t on W further
hain descending , using along
the back-propagation
the netw
network.
ork. First, rulethefor the second argument
back-propagation algorithmto
the matrix∇multiplication (2)>operation. The other branch corresp onds to the longer
computes H J = GW using the back-propagation rule for the first argument
ctohain descending further
the matrix multiplication op along the network.
operation.
eration. First,
Next, thethe back-propagation
op eration uses algorithm
operation its bac
back-
k-
computes J = GW using the back-propagation rule for the first argument
to the matrix 219
∇ multiplication op eration. Next, the op eration uses its back-
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.11: The computational graph used to compute the cost used to train our example
of a single-lay
single-layer
er MLP using the cross-entrop
cross-entropy
y loss and weigh
weightt deca
decay
y.
Figure 6.11: The computational graph used to compute the cost used to train our example
of a single-layer MLP using the cross-entropy loss and weight decay.
propagation rule to zero out comp components
onents of the gradien
gradientt corresp
corresponding
onding to entries
of U (1) that were less than 0. Let the result be called G 0 . The last step of the
propagation
bac
back-propagationrule toalgorithm
k-propagation zero out compis to onents
use theofbacthek-propagation
gradient corresp
back-propagation ruleonding
for the to second
entries
of U
argumen
argument that were less than
t of the op . Let the
0eration
operation result
to add X >G be0 to
called G . The on
the gradient lastWstep(1) of the
.
back-propagation algorithm is to use the back-propagation rule for the second
After these gradients hav havee b een computed, itGis the resp responsibilit
onsibilit
onsibility y of the gradien
gradientt
argument of the op eration to add X to the gradient on W .
descen
descentt algorithm, or another optimization algorithm, to use these gradients to
up After
update these gradients
date the parameters. hav e b een computed, it is the resp onsibilit y of the gradien t
descent algorithm, or another optimization algorithm, to use these gradients to
For the
up date the parameters.
MLP
MLP,, the computational cost is dominated by the cost of matrix
multiplication. During the forward propagation stage, we multiply by each weight
For resulting
matrix, the MLPin , the
O ( wcomputational
) multiply-adds,costwhereis w dominated
is the num
numb by
b er the
of wcost
eigh
eights. of During
ts. matrix
multiplication.
the backw
backward During the forward
ard propagation stage, we propagation
multiply stage,
by thewtranspose
e multiply of by eac
each
each weightt
h weigh
weight
matrix, resulting
matrix, which has O ( wsame
in the ) multiply-adds, where
computational w is The
cost. the num
main b ermemory
of weighcostts. During
of the
the backw ard propagation stage, we m ultiply by the transpose
algorithm is that we need to store the input to the nonlinearity of the hidden la of eac h weigh t
layer.
yer.
matrix,
This which
value has the
is stored same
from thecomputational cost. The
time it is computed un main
until
til the memory
backw
backward ardcostpassof has
the
algorithmto
returned is that we need
the same tot.store
p oin
oint. The the input cost
memory to theis nonlinearity
thus O(mnhof ), the
wherehidden
m islathe
yer.
This
numbervalue is stored in
of examples from
thethe time it and
minibatch is computed
nh is the num untilb er
numb theofbackw
hidden ard pass has
units.
returned to the same p oint. The memory cost is thus O(mn ), where m is the
number of examples in the minibatch and n is the numb er of hidden units.
220
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
and backw
backward ard momodede is analogous to the relationship b etw etween
een left-multiplying versus
righ
right-multiplying
t-multiplying a sequence of matrices, such as
and backward mo de is analogous to the relationship b etween left-multiplying versus
right-multiplying a sequence of matrices, AB
ABC Csuch
D , as (6.58)
where the matrices can b e thought AB of asC Jacobian
D, matrices. For example,(6.58) if D
is a column vector while A has many rows, this corresponds to a graph with a
where
single theoutputmatrices
and mancanybinputs,
many e thought andof starting
as Jacobianthe m matrices. For example,
ultiplications from theifend D
is a column
and going backw vector
backwardsards onlyArequires
while has many rows, this corresponds
matrix-vector pro
products.
ducts. This to acorresp
graph onds
with to
corresponds a
single
the backw output
backward ard mo and
mode. man y inputs, and starting the m ultiplications
de. Instead, starting to multiply from the left would inv from the
involveend
olve a
and going backw ards
series of matrix-matrix pro only requires
products,
ducts, whic matrix-vector
which h mak
makes pro ducts. This
es the whole computation muc corresp
muchonds
h moreto
the
exp backw
expensiv
ensiv
ensive.e. ard
How
Howev mo
ev de.ifInstead,
ever,
er, A has few starting
fewerer ro
rowswsto multiply
than D hasfrom the left
columns, it iswcheaper
ould invtoolve
runa
series
the of matrix-matrix
multiplications pro ducts,
left-to-righ
left-to-right, which onding
t, corresp makes the
corresponding whole
to the computation
forward mo de. much more
mode.
A
exp ensive. However, if has fewer rows than D has columns, it is cheaper to run
the In many communities
multiplications left-to-righoutside of machine
t, corresp onding to learning,
the forward it is mo
more
de. common to
implemen
implementt differen
differentiation
tiation softw
softwareare that acts directly on traditional programming
In many
language co
code, communities
de, suc
suchh as Python outside
or Cofco machine
code, learning, it isgenerates
de, and automatically more commonprograms to
implemen
that t differen
different tiation
functions softwin
written arethese
thatlanguages.
acts directly on traditional
In the deep learning programming
communit
community y,
language co de, suc h as Python or C co de, and automatically
computational graphs are usually represented by explicit data structures created by generates programs
that
sp different
specialized
ecialized functions
libraries. written
The sp in theseapproach
specialized
ecialized languages. hasInthe
thedrawbac
deep learning
drawback communit
k of requiring y,
the
computational
library developer graphs are usually
to define the represented
metho
methods dsbyforexplicit
every op data structures
operation
eration createdthe
and limiting by
sp ecialized
user libraries.
of the library to The
onlysp ecialized
those op approach
operations
erations thathasha thebdrawbac
have
ve een defined.k of How
requiring
Howevever, the
ever, the
library
sp
specializeddeveloper to define the metho ds for every op
ecialized approach also has the b enefit of allowing customized back-propagation eration and limiting the
rules of
user tothe library to only
b e developed those
for each op op erations
operation,
eration, that hathe
allowing ve b een defined.
developer Howev
to impro
improveveer,
sp the
speed
eed
sp ecialized
or stability approach also has
in non-obvious wa the
ways b enefit
ys that an of allowing pro
automatic customized
procedure
cedure wouldback-propagation
presumably
rules to b e developed
b e unable to replicate. for each op eration, allowing the developer to impro ve sp eed
or stability in non-obvious ways that an automatic pro cedure would presumably
Bac
Back-propagation
k-propagation is therefore not the only wa way y or the optimal wa wayy of computing
b e unable to replicate.
the gradient, but it is a very practical method that con contin
tin
tinues
ues to serve the deep
Back-propagation
learning communit
community is therefore
y very well. Innot thethe only wa
future, y or the
differen optimal
differentiation
tiation tec wa y of computing
technology
hnology for deep
the
net gradient,
networks
works ma may but
y impro it
improve is a very practical method that
ve as deep learning practitioners b ecome more awcon tin ues to serve
aware the
are of adv deep
advances
ances
learning
in communit
the broader fieldyofvery well. Indifferentiation.
automatic the future, differentiation technology for deep
networks may improve as deep learning practitioners b ecome more aware of advances
in the broader field of automatic differentiation.
Some soft
software
ware framew
frameworks
orks supp
support
ort the use of higher-order deriv
derivatives.
atives. Among the
deep learning softw
software
are frameworks, this includes at least Theano and TensorFlo
ensorFlow.
w.
Some software framew orks supp ort the use of higher-order
These libraries use the same kind of data structure to describ derivatives. Among the
describee the expressions for
deep
deriv learning
derivatives softw are frameworks,
describee the original function bTheano
atives as they use to describ this includes at least and tiated.
eing differen TensorFlo
differentiated. w.
This
These
meanslibraries
that theuse
symthe
symb same
b olic kind of datamachinery
differentiation structure to
candescrib e the expressions
b e applied to deriv for
derivatives.
atives.
derivatives as they use to describ e the original function b eing differentiated. This
In the context of deep learning, it is rare to compute a single second derivderivative
ative
means that the symb olic differentiation machinery can b e applied to derivatives.
of a scalar function. Instead, we are usually interested in prop
properties
erties of the Hessian
In the context of deep learning, it is rare to compute a single second derivative
223 interested in prop erties of the Hessian
of a scalar function. Instead, we are usually
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
6.6
F Historical
eedforward netw orks Notes
networks can b e seen as efficient nonlinear function appro approximators
ximators
based on using gradient descent to minimize the error in a function approximation.
From
F eedforward netw
this p oin
ointt oforks canthe
view, b emo
seen
modern
dernas feedforward
efficient nonlinearnetw
network function
ork approximators
is the culmination of
based
cen on
centuries using gradient descent to minimize the error
turies of progress on the general function approximation task.in a function approximation.
From this p oint of view, the mo dern feedforward network is the culmination of
The chain rule that underlies the back-propagation algorithm was inv inven
en
ented
ted
centuries of progress on the general function approximation task.
in the 17th cen
century
tury (Leibniz, 1676; L’Hôpital, 1696). Calculus and algebra ha have
ve
The chain rule that underlies the back-propagation algorithm
long b een used to solve optimization problems in closed form, but gradient descen was inv ented
descentt
in the 17th
was not in cen
intro
tro tury
troduced (Leibniz , 1676 ; L’Hôpital
duced as a technique for iterativ ,
iteratively1696 ).
ely appro Calculus
approximating and algebra
ximating the solution hato
ve
long b een used
optimization to solve until
problems optimization
the 19thproblems
cen tury (inCauc
century closed
Cauchy hy, form,
1847).but gradient descent
was not intro duced as a technique for iteratively approximating the solution to
Beginning in the 1940s, these function approximation techniques were used to
optimization problems until the 19th century (Cauchy, 1847).
motiv
motivate
ate machine learning mo models
dels suc
suchh as the p erceptron. How Howevev
ever,
er, the earliest
Beginning in the 1940s, these function approximation techniques were used to
motivate machine learning mo dels such224 as the p erceptron. However, the earliest
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
mo
models
dels were based on linear mo models.
dels. Critics including Marvin Minsky p ointed
out several of the fla flaws
ws of the linear mo modeldel family
family,, such as it inability to learn the
mo dels w ere based on linear mo dels.
XOR function, which led to a backlash against the Critics including Marvinnetw
entire neural Minsky
network p ointed
ork approach.
out several of the flaws of the linear mo del family, such as it inability to learn the
XOR Learning
function,nonlinear
which ledfunctions
to a backlashrequired
against thethe developmen
development
entire neuralt of netw
a multila
multilay yer p er-
ork approach.
ceptron and a means of computing the gradient through such a mo model.
del. Efficien
Efficientt
Learningofnonlinear
applications the chain functions
rule based required
on dynamic theprogramming
developmentb egan of a tomultila
app
appear
earyerinpthe
er-
ceptron
1960s andand a means
1970s, mostly offor
computing
con
control the gradient
trol applications through
(Kelley , 1960such a moand
; Bryson del. Denham
Efficient,
applications
1961 ; Dreyfusof, the1962chain rule based
; Bryson and Ho on, dynamic
1969; Dreyfus programming
, 1973) but b egan
alsotoforappsensitivity
ear in the
1960s and 1970s, mostly for con trol
analysis (Linnainmaa, 1976). Werbos (1981) prop applications ( Kelley
proposed , 1960 ; Bryson
osed applying these tec and Denham
hniques,
techniques
1961 ; Dreyfus
to training , 1962; neural
artificial Brysonnet and Ho, 1969
networks.
works. The; Dreyfus
idea was, 1973finally ) but also for in
developed sensitivity
practice
analysis ( Linnainmaa
after b eing indep
independently , 1976
endently redisco). W erbos
rediscov (1981
vered in differen) prop osed
differentt wa waysapplying
ys (LeCun, 1985; hniques
these tec Park
Parkerer
er,,
to training artificial
1985; Rumelhart et al. neural net works. The
al.,, 1986a). The b o ok Par idea
Paral al
allel was finally
lel Distribute
Distributed developed
d Pr
Pro in
ocessing presen practice
presented ted
after b eing indep endently redisco
the results of some of the first successful exp v ered in differen
experimen
erimen
eriments tts with back-propagation inera,
wa ys ( LeCun , 1985 ; Park
1985 ; Rumelhart
chapter (Rumelhart et al.et, 1986a ). The
al., 1986b b o okcontributed
) that Paral lel Distribute
greatlydtoPrthe ocessing presented
p opularization
theback-propagation
of results of some ofand the initiated
first successful
a very exp activ
activeerimen ts dwith
e p erio
eriod back-propagation
of researc
research h in multi-la
multi-layerin
yera
chapternetw
neural (Rumelhart
networks.
orks. Ho et
Howevweval.er,
wever, , 1986b ) thatput
the ideas contributed
forw
forward ard bygreatly to the p opularization
the authors of that b ook
of back-propagation and initiated
and in particular by Rumelhart and Hinton go muc a very activ e p erio
much d of researc h in multi-layer
h b eyond back-propagation.
neuralinclude
They networks. Howev
crucial ideas er,abthe
outideas
about the pputossibleforwcomputational
ard by the authors implemen of that
implementation tation b ook
of
and
sev in
several particular
eral central asp b y
aspects Rumelhart and Hinton
ects of cognition and learning, whic go muc h
which b eyond back-propagation.
h came under the name of
They include crucial ideas
“connectionism” b ecause of the imp ab out the
importancep ossible computational
ortance given the connections implemen
b etw
etweeneentation
neurons of
sev eral
as the lo central
locus asp ects of
cus of learning and memory cognition and learning, whic h came under
memory.. In particular, these ideas include the notion the name of
“connectionism” b ecause of the(Hin
of distributed representation impton
Hintonortance al.,given
et al. , 1986the ). connections b etween neurons
as the lo cus of learning and memory. In particular, these ideas include the notion
Following the success of back-propagation, neural net network
work researc
research h gained p op-
of distributed representation (Hinton et al., 1986).
ularit
ularityy and reacreached
hed a p eak in the early 1990s. Afterwards, other machine learning
techniques b ecamesuccess
F ollowing
techniques the more p ofopular
back-propagation,
until the mo modern neuraldeep
dern network researc
learning h gained that
renaissance p op-
ularit
b egany inand reached a p eak in the early 1990s. Afterwards, other machine learning
2006.
techniques b ecame more p opular until the mo dern deep learning renaissance that
Theincore
b egan 2006. ideas b ehind modern feedforward netw networks
orks havhavee not changed sub-
stan
stantially
tially since the 1980s. The same back-propagation algorithm and the same
The
approac
approaches coretoideas
hes gradien b ehind
gradient t descentmodern feedforward
are still in use. Most netwoforks
the hav e not
improv
improvement changed
ement sub-
in neural
stan
net tiallyp erformance
network
work since the 1980s. from The1986same
to 2015 back-propagation
can b e attributed algorithm
to tw
two and the First,
o factors. same
approacdatasets
larger hes to gradien
hav t descentthe
havee reduced are degree
still in touse.whichMoststatistical
of the improv ement in neural
generalization is a
net work p erformance
challenge for neural netw from
networks. 1986 to 2015 can
orks. Second, neural netw b e attributed
networks
orks hav to tw o
havee b ecome mucfactors.
much First,
h larger,
larger datasets
due to more p ow hav
owerful e reduced the degree
erful computers, and b etter softw to which
software statistical generalization
are infrastructure. How However, is aa
ever,
challenge
small for neural
number networks. changes
of algorithmic Second, ha neural
hav ve impro netwved
improved orksthehavpeerformance
b ecome muc ofh neural
larger,
due
net to more
networks
works p owerful
noticeably
noticeably. . computers, and b etter software infrastructure. However, a
small number of algorithmic changes have improved the p erformance of neural
One of these algorithmic changes was the replacement of mean squared error
networks noticeably.
with the cross-en
cross-entropy
tropy family of loss functions. Mean squared error was popular in
the One
1980sofand these algorithmic
1990s, but waschanges
gradually wasreplaced
the replacement of mean
by cross-entrop
cross-entropy squared
y losses anderror
the
with the cross-entropy family of loss functions. Mean squared error was popular in
the 1980s and 1990s, but was gradually225 replaced by cross-entropy losses and the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
b eliev
elieved
ed that feedforward netw networks
orks would not p erform well unless they were assisted
by other mo models,
dels, such as probabilistic mo models.
dels. To day
day,, it is now known that with the
b eliev
righ ed that feedforward netw orks would
rightt resources and engineering practices, feedforw not p erform
feedforwardardwellnet unless pthey
networks
works werevery
erform assisted
well.
b y
To dayother mo dels,
day,, gradien such
gradient-based as probabilistic
t-based learning in feedforw mo dels.
feedforward T
ard neto day
networks , it is now known that
works is used as a tool to developwith the
righ t resources
probabilistic mo and
models, engineering practices,
dels, such as the variational autofeedforw
autoenco ard
enco
encoder net works p erform
der and generative adv very w
adversarialell.
ersarial
T
neto day , gradien
networks,
works, t-based
describ
described ed inlearning
Chapter in 20
feedforw
. Ratherard than
networksb eing is used
view
viewedas aastool
ed an to develop
unreliable
probabilistic
tec
technology
hnology that mo dels,
mustsuch as the
b e supp
supportedvariational
orted by otherauto enco der gradient-based
techniques, and generative adv ersarial
learning in
net works,
feedforw
feedforward describ
ard net ed
networks in Chapter
works has b een view 20 .
viewed Rather than b eing
ed since 2012 as a p ow view ed
owerful as an unreliable
erful technology that
tec
ma
may hnology that m ust b e supp
y b e applied to many other mac orted by
machine other techniques, gradient-based
hine learning tasks. In 2006, the communit learning in
community y
feedforw
used ardervised
unsup networks
unsupervised has bto
learning eensupp
view
support edsup
ort since 2012learning,
supervised
ervised as a p owand erfulno technology
now,
w, ironically that
ironically, , it
mamore
is y b e common
applied to to many
use sup other
supervisedmaclearning
ervised hine learning
to supp tasks.
support
ort unsupIn 2006,
unsupervised
ervisedthelearning.
community
used unsup ervised learning to supp ort sup ervised learning, and now, ironically, it
Feedforward netwnetworks
orks contin
continue
ue to ha
have
ve unfulfilled p otential. In the future, we
is more common to use sup ervised learning to supp ort unsup ervised learning.
exp
expect
ect they will b e applied to man many y more tasks, and that adv advances
ances in optimization
Feedforward
algorithms and mo netw
modeldelorks contin
design ue improv
will to havee unfulfilled
improve p otential.even
their p erformance In the future,
further. Thiswe
cexp ect they
hapter has will b e applied
primarily to man
describ
describeded ythemore tasks,
neural andork
netw
network that advances
family of mo in dels.
optimization
models. In the
algorithms
subsequen and mo del design will improv
subsequentt chapters, we turn to how to use these mo e their p erformance
models—ho
dels—ho
dels—how w to regularizeThis
even further. and
ctrain
hapter has
them. primarily describ ed the neural netw ork family of mo dels. In the
subsequent chapters, we turn to how to use these mo dels—how to regularize and
train them.
227
Chapter 7
Chapter 7
Regularization for Deep Learning
Regularization
A cen
central
tral problem in mac
machine
for Deep
hine learning is ho
how
w to mak
Learning
makee an algorithm that will
perform well not just on the training data, but also on new inputs. Man Manyy strategies
A cenintral
used mac problem
machine in mac
hine learning hine
are learningdesigned
explicitly is how to to reduce
make an thealgorithm
test error,that will
possibly
perform
at the exp well
expense
ensenotofjust on the training
increased training error.
data, but alsostrategies
These on new inputs.
are knownManycollectiv
strategies
collectively
ely
used in mac hine learning are explicitly designed to reduce the
as regularization. As we will see there are a great many forms of regularization test error, possibly
aatvailable
the exptoensetheof deep
increased training
learning error. These
practitioner. In strategies
fact, dev are known
developing
eloping morecollectiv ely
effective
as regularization.
regularization As wehas
strategies willbeen
see there
one ofaretheamagreat
major many
jor researc
researchforms
h effortsof regularization
in the field.
available to the deep learning practitioner. In fact, developing more effective
Chapter 5 introduced the basic concepts of generalization, underfitting, ov overfit-
erfit-
regularization strategies has been one of the ma jor research efforts in the field.
ting, bias, variance and regularization. If you are not already familiar with these
Chapter
notions, 5 introduced
please refer to that thecbasic
hapterconcepts of generalization,
before contin uing with this
continuing underfitting,
one. overfit-
ting, bias, variance and regularization. If you are not already familiar with these
In this
notions, chapter,
please referwto e describ
describe e regularization
that chapter before continin more
uing detail, focusing
with this one. on regular-
ization strategies for deep models or mo models
dels that ma may y be used as building blo blocks
cks
In this chapter,
to form deep models. w e describ e regularization in more detail, focusing on regular-
ization strategies for deep models or models that may be used as building blocks
Some sections of this chapter deal with standard concepts in machine learning.
to form deep models.
If you are already familiar with these concepts, feel free to skip the relev relevant
ant
Some
sections. Ho sections
How wev
ever, of this chapter deal with standard concepts in
er, most of this chapter is concerned with the extension of thesemachine learning.
basic concepts to thefamiliar
If you are already withcase
particular these concepts,
of neural netw feel
orks.free to skip the relevant
networks.
sections. However, most of this chapter is concerned with the extension of these
In Sec. 5.2.2, we defined regularization as “an “any y modification we mak makee to a
basic concepts to the particular case of neural networks.
learning algorithm that is intended to reduce its generalization error but not
In Sec. 5.2.2
its training , weThere
error.” defined areregularization as “any strategies.
many regularization modification we mak
Some put eextra
to a
learning
constrain
constraints algorithm
ts on a mac that
machine is intended to
hine learning model, sucreduce
suchh as adding restrictions on not
its generalization error but the
its trainingvalues.
parameter error.” SomeThereadd areextra
manyterms
regularization
in the ob strategies.
objectiv
jectiv
jective
e functionSome putcan
that extra
be
constrain
though ts on a mac hine learning
thoughtt of as corresponding to a soft constrain model, suc h as adding restrictions
constraintt on the parameter values. If chosen on the
parameter
carefully values. Some
carefully,, these extra constrain add
constraints extra terms in the
ts and penalties canob jectiv
lead toeimpro
function
improv that can be
ved performance
thought of as corresponding to a soft constraint on the parameter values. If chosen
carefully, these extra constraints and penalties can lead to improved performance
228
228
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
on the test set. Sometimes these constrain constraints ts and penalties are designed to encode
sp
specific
ecific kinds of prior kno knowledge.
wledge. Other times, these constrain constraints ts and penalties
are designed to express a generic preference for a simpler model class intoorder
on the test set. Sometimes these constrain ts and p enalties are designed encode to
sp ecific kinds of prior kno wledge. Other times, these
promote generalization. Sometimes penalties and constraints are necessary to makeconstrain ts and p enalties
areunderdetermined
an designed to express a generic
problem preference
determined. for aforms
Other simpler model class inknown
of regularization, order to as
promote
ensem
ensembleblegeneralization.
metho
methods, ds, com Sometimes
combine
bine multiple penalties
hyp
ypothesesand constraints
otheses that explain arethenecessary
trainingtodata.
make
an underdetermined problem determined. Other forms of regularization, known as
In the context of deep learning, most regularization strategies are based on
ensemble methods, combine multiple hypotheses that explain the training data.
regularizing estimators. Regularization of an estimator works by trading increased
biasInforthe context
reduced of deep An
variance. learning,
effective most regularization
regularizer strategies
is one that mak
makes esare based on
a profitable
regularizing
trade, reducingestimators.
varianceRegularization
significantly while of annotestimator
ov erly w
overly orks by trading
increasing the bias. increased
When
bias for reduced v ariance.
we discussed generalization and ov An effective regularizer
overfitting is one
erfitting in Chapter 5, we fo that mak es
focused a profitable
cused on three
trade, reducing v
situations, where the moariance significantly
model while not ov erly increasing
del family being trained either (1) excluded the bias.theWhen
true
w e discussed generalization
data generating process—corresp and
process—corresponding ov erfitting in Chapter 5 , w e
onding to underfitting and inducing bias, or (2)fo cused on three
situations,
matc
matchedhed thewhere the mo
true data del family
generating probcess,
eing or
process, trained either the
(3) included (1) generating
excluded the protrue
process
cess
data generating
but also manmany process—corresp onding to
y other possible generating processes—the ov underfitting and inducing
overfitting bias,
erfitting regime where or (2)
matc
v hed rather
ariance the true thandatabiasgenerating
dominatespro thecess, or (3) included
estimation error. The the generating
goal process
of regularization
but also
is to tak man
takee a mo y
modelother p ossible generating
del from the third regime in processes—the
into
to the second regime.ov erfitting regime where
variance rather than bias dominates the estimation error. The goal of regularization
In practice, an ov overly
erly complex mo modeldel family does not necessarily include the
is to take a model from the third regime into the second regime.
target function or the true data generating pro process,
cess, or even a close appro approximation
ximation
In practice, an ov
of either. We almost never haverly complex mo del family does not
havee access to the true data generating pronecessarily includecessthe
process so
target function
we can nev neverer knoor
know the true data
w for sure if the mo generating
model pro cess, or even a close
del family being estimated includes the appro ximation
of either. W
generating proe cess
almost
process never
or not. Howhav
However, e access
ever, most to the true data
applications of deep generating process so
learning algorithms
we can
are never kno
to domains w for
where thesure
trueifdata
the generating
model family pro being
process
cess estimated
is almost includes
certainly the
outside
generating
the mo
modeldel pro cess. or
family
family. not.learning
Deep However,algorithms
most applications
are typicallyof deep learning
applied to algorithms
extremely
are to domains where the true data generating pro cess
complicated domains such as images, audio sequences and text, for which the is almost certainly outside
true
the model family
generation pro
process . Deep
cess learning
essentially inv algorithms
involves
olves are typically
simulating the entire applied
universe.to extremely
To some
complicated
exten
extent,t, we are domains
alwa
alwaysys such
trying astoimages, audio psequences
fit a square eg (the data and generating
text, for which pro
process)the true
cess) into
generation pro cess essentially
a round hole (our model family). inv olves simulating the entire universe. T o some
extent, we are always trying to fit a square peg (the data generating process) into
What this means is that con controlling
trolling the complexity of the mo modeldel is not a
a round hole (our model family).
simple matter of finding the mo modeldel of the right size, with the right num numb ber of
What this means
parameters. Instead, we migh is that con trolling the complexity of the
mightt find—and indeed in practical deep learning scenarios, mo del is not a
simple
w e almostmatter
alw
alwa of do
ays finding the mo
find—that delbof
the estthe right
fitting mo size,
model
del (inwiththethe senseright number of
of minimizing
parameters. Instead, we migh t find—and indeed in practical
generalization error) is a large model that has been regularized appropriately deep learning scenarios,
appropriately. .
we almost always do find—that the best fitting model (in the sense of minimizing
We now review several strategies for ho how w to create such a large, deep, regularized
generalization error) is a large model that has been regularized appropriately.
mo
model.
del.
We now review several strategies for how to create such a large, deep, regularized
model.
229
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.1 Parameter
Regularization Norm
has been used Penalties
for decades prior to the adv
advent
ent of deep learning. Linear
mo
models
dels such as linear regression and logistic regression allow simple, straightforw
straightforward,
ard,
Regularization
and effectiv has b een used for decades
effectivee regularization strategies. prior to the adv ent of deep learning. Linear
models such as linear regression and logistic regression allow simple, straightforward,
Man
Many y regularization approac hes are based on limiting the capacit
approaches capacityy of mo
models,
dels,
and effective regularization strategies.
suc
suchh as neural netw
networks,
orks, linear regression, or logistic regression, by adding a pa-
Manynorm
rameter regularization
penalty Ω(
Ω(θapproac
θ ) to thehes
ob are based
objective
jective on limiting
function J . Wethe capacit
denote theyregularized
of models,
suc
ob h as neural netw orks,
jectivee function by J˜:
objectiv
jectiv linear regression, or logistic regression, by adding a pa-
rameter norm penalty Ω(θ) to the objective function J . We denote the regularized
ob jective function by J˜:J˜(θ; X , y) = J (θ; X , y) + αΩ(θ) (7.1)
231
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
w∗
w̃
w2
w1
Figure 7.1: An illustration of the effect of L 2 (or weight decay) regularization on the value
of the optimal w. The solid ellipses represen representt contours of equal value of the unregularized
Figure
ob
objective.7.1: The
jective. An illustration
dotted circles of the
represent L (or wof
effect ofcontours eight decay)
equal valueregularization on the value
of the L2 regularizer. At
of the
the optimal
p oint w . The
˜ , these
w̃
w solid ellipses
competing ob represen
objectives
jectives t contours
reach of equal In
an equilibrium. value
the of thedimension,
first unregularized
the
ob jective.
eigen
eigenv value The dotted
of the circles
Hessian of Jrepresent
is small. contours
The ob of equal
objective
jective value of
function dothe
does L regularizer.
es not increase muc At
much h
the p oint
when w
moving˜ , these competing
horizon
horizontally
tally awa
way yobfrom
jectives
w .reach
∗ an equilibrium.
Because the ob jectiveInfunction
objective the firstdodimension,
does the
es not express
aeigen value
strong of the Hessian
preference along ofthisJ direction,
is small. The the ob jective function
regularizer do es not
has a strong increase
effect on thismuc h
axis.
when moving horizon tally awa y from w . Because the
The regularizer pulls w1 close to zero. In the second dimension, the ob ob jective function do es not
objective express
jective function
a strong
is preference
very sensitiv
sensitive e to moalong
mov this direction,
vements awa
wayy from thewregularizer
∗ . The corresphas aonding
strongeigenv
corresponding effectalue
on this
eigenvalue axis.
is large,
The regularizer
indicating high curv w close
pullsature.
curvature. As atoresult,
zero. weigh
In thet second
weight dimension,
decay affects the ob jective
the p osition function
of w2 relativ
relatively
ely
is very
little. sensitiv e to mo vements a wa y from w . The corresp onding eigenv alue is large,
indicating high curvature. As a result, weight decay affects the p osition of w relatively
little.
233
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Comp
Componenonen
onentsts of the weight vector corresponding to suc suchh unimportant directions
are deca
decay yed awaway
y through the use of the regularization throughout training.
Components of the weight vector corresponding to such unimportant directions
So far we hav havee discussed weight decay in terms of its effect on the optimization
are decayed away through the use of the regularization throughout training.
of an abstract, general, quadratic cost function. Ho How w do these effects relate to
mac So
machine far we hav e discussed w eight decay in terms
hine learning in particular? We can find out by studying of its effect on the regression,
linear optimizationa
of
mo an
del abstract,
model for whic
which hgeneral,
the true quadratic costisfunction.
cost function quadratic Ho
andw therefore
do these amenable
effects relate to
to the
machine
same kind learning in particular?
of analysis we hahave We can
ve used findApplying
so far. out by studying linearagain,
the analysis regression,
we willa
mo
b e del
ablefortowhic h the
obtain a true
sp costcase
special
ecial function
of theissame
quadratic andbut
results, therefore amenable
with the solutiontonow
the
same kind of analysis we ha ve used so far. Applying the analysis
phrased in terms of the training data. For linear regression, the cost function isagain, we will
be able
the sum to of obtain
squareda errors:
special case of the same results, but with the solution now
phrased in terms of the training data. For linear regression, the cost function is
the sum of squared errors: (X w − y )> (X w − y ). (7.14)
When we add L2 regularization,
(X wthe ob
objective
y )jective
(X w function
y ). changes to (7.14)
− −1
When we add L regularization,
(X w − y )the
> ob jective function
(X w − y ) + αw > w changes
. to (7.15)
2
1
(Xequations
This changes the normal w y ) (for y ) + αw
X wthe solution w. (7.15)
2 from
− − −1 >
This changes the normal equations X >the
w = (for X )solution
X y from (7.16)
to w = (X X ) X y (7.16)
w = (X >X + αI )−1X >y. (7.17)
to
The matrix X >X in Eq. 7.16 w =is(Xprop
proportional
Xortional
+ αI ) toXthey.cov
covariance
ariance matrix m 1
X > X.
(7.17)
−1
Using L 2 regularization replaces this matrix with X >X + αI in Eq. 7.17.
The matrix X X in Eq. 7.16 is proportional to the covariance matrix X X .
The new matrix is the same as the original one, but with the addition of α to the
Using L regularization
diagonal. replaces
The diagonal entries of this
this matrix
matrix with
corresp XondXto+the
correspond αI variance
in Eq.of 7.17
eac
eachh.
The new
input matrixWise the
feature. can same as the
see that L2 original one, butcauses
regularization with the
the addition
learning of α to the
algorithm
diagonal.
to “p The diagonal entries of this matrix corresp
erceive” the input X as having higher variance, whic
“perceive” ond to
which the variance
h makes it shrink of eac
theh
input
weigh
eightsfeature.
ts We can
on features see cov
whose that L regularization
covariance
ariance causes
with the output the is
target learning algorithm
low compared to
to “p erceive” the input X as having higher v
ariance, whic h makes it shrink the
this added variance.
weights on features whose covariance with the output target is low compared to
this added variance.
7.1.2 L1 Regularization
L
7.1.2 L 2 weigh
While Regularization
eight
t deca
decay
y is the most common form of weight deca
decay
y, there are other
ways to penalize the size of the mo del parameters. Another option is to use L 1
model
While L weight decay is the most common form of weight decay, there are other
regularization.
ways to penalize the size of the model parameters. Another option is to use L
ormally,, L1 regularization on the model parameter w is defined as:
Formally
regularization.
X
Formally, L regularization Ω(θon
) =the
||wmodel
|| 1 = parameter
|wi |, w is defined as: (7.18)
i
Ω(θ) = w234 = w , (7.18)
|| || | |
X
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
that is, as the sum of absolute values of the individual parameters.2 We will
no
noww discuss the effect of L1 regularization on the simple linear regression mo model,
del,
that is, as the sum of absolute v alues of the individual parameters.
with no bias parameter, that we studied in our analysis of L regularization. In2 W e will
now discusswethe
particular, areeffect
in of L in
interested
terested regularization
delineating the on differences
the simpleblinearet
etwweenregression
L1 and L2mo del,
forms
with no bias parameter, that
of regularization. As with L 2 weigh w e studied
eightt deca in our
y, L1 weigh
decay analysis
eightt deca
decayof L
y conregularization.
controls
trols the strengthIn
particular,
of we are interested
the regularization by scalingin delineating
the penalty theΩdifferences between
using a positive hypL erparameter
and L forms
yperparameter α.
of
Th regularization.
Thus, As with
us, the regularized ob L
objective w eigh t decay
˜ , L w eigh t deca
jective function J (w; X , y) is given by y con trols the strength
of the regularization by scaling the penalty Ω using a positive hyperparameter α.
Thus, the regularized ob J˜ (jective ) = α||wJ˜||(1w+
w; X , yfunction ;XJ (, w is ,given
y); X y), by (7.19)
235
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.2 Norm
Consider the costPfunction
enalties as Constrained
regularized by a parameterOptimization
norm penalt
enalty:
y:
J˜(θregularized
Consider the cost function ; X , y) = J (bθy; X
a parameter
, y) + αΩ(θnorm
). penalty: (7.25)
˜(θ; X , y) = J (θ; X , y) + αΩ(θ). (7.25)
Recall from Sec. 4.4 Jthat we can minimize a function sub subject
ject to constrain
constraintsts by
constructing a generalized Lagrange function, consisting of the original ob objectiv
jectiv
jectivee
Recall from Sec. 4.4 that we
function plus a set of penalties. Eaccan
Eachminimize a function
h penalty is a pro sub
product ject
duct betetw to constrain ts by
ween a coefficient,
constructing
called a generalizeduck
a Karush–Kuhn–T
Karush–Kuhn–Tuck Lagrange
ucker
er (KKT) function, consisting
multiplier, and aof function
the original ob jective
representing
functionthe
whether plusconstrain
a set oftpisenalties.
constraint satisfied.Eac
If hwepenalty
wantedistoa constrain
product bΩ(etw
θ)een
Ω(θ a ecoefficient,
to b less than
called a Karush–Kuhn–T
some constan uck er (KKT) multiplier, and a
constantt k, we could construct a generalized Lagrange functionfunction representing
whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than
some constant k, we could construct a generalized Lagrange function
L(θ, α; X , y) = J (θ; X , y) + α(Ω(θ) − k). (7.26)
(θ,constrained
The solution to the α; X , y) = Jproblem
(θ; X , yis
) +giv
αen
(Ω(by
given θ) k). (7.26)
L −
θ ∗ = arg problem
The solution to the constrained min maxisLgiv(θ,enα).by (7.27)
θ α,α≥0
θ = arg min max (θ, α). (7.27)
As described in Sec. 4.4, solving this problem L requires modifying b oth θ and
α . Sec. 4.5 pro
provides
vides a work
orked 2
ed example of linear regression with an L constrainconstraint.
t.
Man
ManyAs described
y different proin Sec.
procedures4.4 , solving this problem
cedures are possible—some ma mayrequires modifying b oth θ
y use gradient descent, while and
α . Sec.may
others 4.5 pro
usevides a worked
analytical example
solutions forofwhere
linear the
regression
gradienwith
gradientt is an L constrain
zero—but t.
in all
Man
pro y different
procedures
cedures proincrease
α must cedures are possible—some
whenev
whenever
er Ω(
Ω(θ maydecrease
θ ) > k and use gradient
whenevdescent,
wheneverer Ω(
Ω(θθ)while
< k.
others may use analytical
All positive α encourage Ω( solutions for where the gradien t
∗ is zero—but
θ) to shrink. The optimal value α will encourage Ω(
Ω(θ in all
θ)
Ω(θ
pro cedures α m ust increase whenev
to shrink, but not so strongly to mak er Ω( θ ) > k and decrease
makee Ω(θ) become less than k. whenev er Ω( θ) < k.
All positive α encourage Ω(θ) to shrink. The optimal value α will encourage Ω(θ)
to shrink, but not so strongly to make 237 Ω(θ) become less than k.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
consisten
consistentlytly increase the size of the weigh ts, then θ rapidly mov
eights, moves
es aw
away
ay from
the origin un until
til numerical ov overflow
erflow occurs. Explicit constraints with repro reprojection
jection
consisten
allo
allow tly increase the size of the w eigh ts,
w us to terminate this feedback loop after the weights ha then θ rapidly
have mov es
ve reac
reachedaw ay from
hed a certain
the origin unHin
magnitude. til numerical
Hintonton et al. (ov erflow
2012c occurs. Explicit
) recommend using constraints
constraints withcom repro jection
combined
bined with a
allo w us to terminate
high learning rate to allo this
allow feedback loop after the w eights ha
w rapid exploration of parameter space while main ve reached a certain
maintaining
taining
magnitude.
some stabilitHin
stabilityy. ton et al. (2012c) recommend using constraints combined with a
high learning rate to allow rapid exploration of parameter space while maintaining
In particular, Hin Hinton
ton et al. (2012c) recommend constraining the norm of eac each
h
some stability.
column of the weighweightt matrix of a neural net lay layer,
er, rather than constraining the
Frob In particular,
robenius Hin
enius norm of the en ton et
entireal. (2012c
tire weigh ) recommend
weightt matrix, a strategy constraining
in
intro
tro
troducedtheby
duced norm of eac
Srebro h
and
column of the
Shraibman weigh
(2005 t matrix of athe
). Constraining neural
normnet layer,column
of each ratherseparately
than constraining
prev
prevents
ents an they
any
Frobhidden
one enius normunit of theha
from enving
tire weigh
having t matrix,
very large weigha ts.
strategy
eights. If we in trovduced
con
conv by Srebro
erted this constrain
constraintandt
Shraibman
in
into
to a penalt (2005
enalty y in). aConstraining the normit ofwould
Lagrange function, each column
be similar separately
to L2 wprev
eightents
eigh t decayany
one hidden unit from ha ving very large
but with a separate KKT multiplier for the weigh w eigh ts.
weightsts of each hidden unit. Each oft
If w e con verted this constrain
into aKKT
these penalt myultipliers
in a Lagrange
would b function, it would
e dynamically up be similar
updated
dated to L w
separately toeigh
makt edecay
make eac
each
h
but with a
hidden unit ob separate
obey KKT multiplier for the weigh ts of each hidden
ey the constraint. In practice, column norm limitation is alwa unit. Each
always of
ys
these
implemenKKT
implemented m ultipliers w ould
ted as an explicit constrainb e dynamically
constraintt with repro up dated
reprojection.
jection. separately to mak e each
hidden unit obey the constraint. In practice, column norm limitation is always
implemented as an explicit constraint with repro jection.
7.3 Regularization and Under-Constrained Problems
7.3someRegularization
In cases, regularization and Under-Constrained
is necessary for machine learning Problems
problems to be
prop
properly
erly defined. Man Manyy linear momodelsdels in machine learning, including linear re-
In some cases,
gression and PCA, depregularization
depend is
end on ininv necessary
verting theformatrixmachine
X >X learning
. This problems to be
is not possible
properly
whenev
whenever er defined.
X >X is Man y linear
singular. Thismo dels incan
matrix machine learning,
be singular whenevincluding
whenever linear
er the data re-
truly
gression
has and PCA,
no variance depend
in some on inverting
direction, or when thethere areX
matrix few X
er .examples
fewer This is not
(ro
(rowspossible
ws of
ofXX)
whenev er X X is singular. This matrix can b e singular whenev
than input features (columns of X). In this case, many forms of regularization er the data truly
has no ond
variance > X + αI or when there are fewer examples (rows of X )
corresp
correspond to in vin
inv someXdirection,
erting instead. This regularized matrix is guaran guaranteedteed
than
to be input
in features (columns of X). In this case, many forms of regularization
invertible.
vertible.
correspond to inverting X X + αI instead. This regularized matrix is guaranteed
These linear problems ha hav
ve closed form solutions when the relev relevant
ant matrix
to be invertible.
is in
invertible.
vertible. It is also possible for a problem with no closed form solution to be
These linear problems
underdetermined. An examplehave isclosed
logisticform solutionsapplied
regression when to thea relev ant matrix
problem where
is inclasses
the vertible. areItlinearly
is also possible
separable.for Ifa aproblem
weightt with
weigh vectornowclosed
is ableform solution
to achiev
achieve to be
e perfect
underdetermined.
classification, then 2An example
w will ishiev
also ac logistic
achiev
hieve regression
e perfect appliedand
classification to ahigher
problem where
likelihoo
likelihood.d.
the classes
An iterativ are linearly
iterativee optimization proseparable.
procedure If
cedure lik a weigh
likee sto t
stochasticvector w is able to achiev
chastic gradient descent will con e p
contin
tinerfect
tinually
ually
classification, then 2w will also ac
increase the magnitude of w and, in theory hiev e p erfect classification and higher likelihoo
theory,, will never halt. In practice, a numerical d.
An iterativtation
implemen
implementatione optimization
of gradien pro
gradient ceduret lik
t descen
descent e sto
will evenchastic
tuallygradient
eventually descenttly
reach sufficien
sufficiently will contin
large ually
weigh
weightsts
increase the magnitude
to cause numerical ov of w
overflow, and, in theory , will never halt.
erflow, at which point its behavior will dep In practice,
depend a n
end on ho umerical
howw the
implemen tation of gradien t descen t will even tually
programmer has decided to handle values that are not real num reach sufficien
umb tly
bers. large weigh ts
to cause numerical overflow, at which point its behavior will depend on how the
Most forms
programmer hasofdecided
regularization
to handleare vable
aluestothat
guarantee
are notthe
realcon
convergence
vergence
num bers. of iterativ
iterativee
239to guarantee the convergence of iterative
Most forms of regularization are able
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
metho
methods ds applied to underdetermined problems. F For
or example, weigh
weightt deca
decayy will
cause gradien
gradientt descent to quit increasing the magnitude of the weights when the
metho
slop dsthe
slopee of applied
lik toounderdetermined
likeliho
eliho
elihoo d is equal to the wproblems.
eigh
eightt deca yFor
decay co example,
coefficien
efficien
efficient.t. weight decay will
cause gradient descent to quit increasing the magnitude of the weights when the
slopThe
e of idea ofeliho
the lik usingodregularization
is equal to thetoweigh
solv
solvete deca
underdetermined
y coefficient. problems extends
bey
eyond
ond mac
machine
hine learning. The same idea is useful for sev several
eral basic linear algebra
The idea of using regularization to solve underdetermined problems extends
problems.
beyond machine learning. The same idea is useful for several basic linear algebra
As we saw in Sec. 2.9, we can solve underdetermined linear equations using
problems.
the Mo
Moore-P
ore-P
ore-Penrose
enrose pseudoin
pseudoinv verse. Recall that one definition of the pseudoinv
pseudoinverse
erse
+As w e saw
X of a matrix X isin Sec. 2.9 , w e can solve underdetermined linear equations using
the Moore-Penrose pseudoinverse. Recall that one definition of the pseudoinverse
X of a matrix X is X + = lim (X > X + αI )−1X > . (7.29)
α&0
7.4bestDataset
The wa
wayy to make Augmen
a mac hinetation
machine learning mo
model
del generalize better is to train it on
more data. Of course, in practice, the amoun amountt of data we ha havve is limited. One way
The
to get around this problem is to create fake data and add itetter
b est wa y to make a mac hine learning mo del generalize b is totraining
to the train it set.
on
more data.
For some mac Of course,
machine in practice, the amoun t of
hine learning tasks, it is reasonably straigh data w e ha v e
straightforwardis limited.
tforward to create newOne w ay
to
fakget
fake around this problem is to create fake data and add it to the training set.
e data.
For some machine learning tasks, it is reasonably straightforward to create new
This approac
approach h is easiest for classification. A classifier needs to tak takee a compli-
fake data.
cated, high dimensional input x and summarize it with a single category iden tity y .
identity
This approac h is easiest for classification.
This means that the main task facing a classifier is to be invA classifier needs
invariant to tak e a compli-
ariant to a wide variety
cated,
of high dimensional
transformations. We input x and summarize
can generate new (x, y )itpairs
with easily
a singlejustcategory identity y .
by transforming
This
the xmeans
inputsthat the training
in our main taskset. facing a classifier is to be invariant to a wide variety
of transformations. We can generate new (x, y ) pairs easily just by transforming
This approac
approach h is not as readily applicable to many other tasks. For example, it
the x inputs in our training set.
is difficult to generate new fak fakee data for a density estimation task unless we hav havee
This solv
already approac
solved
ed theh isdensit
not as
density readily applicable
y estimation problem.to many other tasks. For example, it
is difficult to generate new fake data for a density estimation task unless we have
Dataset augmen
augmentation
tation has been a particularly effective tec technique
hnique for a sp specific
ecific
already solved the density estimation problem.
classification problem: ob object
ject recognition. Images are high dimensional and include
Dataset augmen
an enormous variet ariety tation
y of factorshas bofeen a particularly
variation, many of effective technique
which can be easilyfor simulated.
a specific
classification
Op
Operations
erations lik problem:
like ob jectthe
e translating recognition. Imagesa are
training images fewhigh dimensional
pixels and include
in each direction can
an enormous v
often greatly impro ariet
improvey of factors of
ve generalization, evvariation,
even many of which can be easily
en if the model has already been designed to simulated.
Op erations like translating
be partially translation in inv the
varian training
ariantt by using images
the cona vfew
conv pixels
olution andin peach direction
ooling can
techniques
often greatly improve generalization, even if the model has already been designed to
be partially translation invariant by using 240 the convolution and p o oling techniques
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
describ
described ed in Chapter 9. Many other op operations
erations suc such h as rotating the image or
scaling the image ha haveve also proproven
ven quite effective.
described in Chapter 9. Many other operations such as rotating the image or
One must be careful not to apply transformations that would change the correct
scaling the image have also proven quite effective.
class. For example, optical character recognition tasks require recognizing the
One must
difference betetwwbeen
e careful
’b’ and not’d’to and
apply thetransformations
difference bet etw wthat
een w ’6’ould
andchange
’9’, so the correct
horizon
horizontal tal
class.and
flips For180 example,
◦
rotations optical
are not character
appropriate recognition
wa
waysys of tasks
augmen require
augmenting recognizing
ting datasets the
for these
difference between ’b’ and ’d’ and the difference between ’6’ and ’9’, so horizontal
tasks.
flips and 180 rotations are not appropriate ways of augmenting datasets for these
There are also transformations that we would like our classifiers to be in invvariant
tasks.
to, but whic
which h are not easy to perform. For example, out-of-plane rotation can not
There areted
be implemen
implemented alsoastransformations
a simple geometric that operation
we would like on theour input
classifiers
pixels.to be invariant
to, but which are not easy to perform. For example, out-of-plane rotation can not
Dataset augmen
augmentation
tation is effect for sp speec
eec
eechh recognition tasks as well (Jaitly and
be implemented as a simple geometric operation on the input pixels.
Hin
Hinton
ton, 2013).
Dataset augmentation is effect for speech recognition tasks as well (Jaitly and
Injecting noise in the input to a neural net netw work (Sietsma and Do Dow w, 1991)
Hinton, 2013).
can also be seen as a form of data augmentation. For man many y classification and
ev
even Injecting noise in the input to a neural net
en some regression tasks, the task should still be possible to solv w ork (Sietsma and
solve Do
e ev
evenenwif, 1991
small)
can also b e seen as a form of
random noise is added to the input. Neural netwdata augmentation.
networks For
orks prov man y classification
provee not to be very robust and
ev
to en some
noise, howregression
ever (Tang
however tasks,
and the task should
Eliasmith , 2010). still
Onebe wapossible
way y to improto solv
improv ve etheeven if small
robustness
random
of neuralnoise
net
netw wisorks
added to the input.
is simply to train Neural
them netwwithorks provenoise
random not toapplied
be verytorobusttheir
to noise, how ever ( T ang and Eliasmith
inputs. Input noise injection is part of some unsup , 2010 ). One wa
unsupervised y to impro ve the
ervised learning algorithms such robustness
of neural net w
as the denoising auto orks is
autoencosimply
enco
encoder to train
der (Vincen them
Vincentt et al. with
al.,, 2008 random
). Noisenoise applied
injection alsotoworks
their
inputs.theInput
when noisenoise injection
is applied is part
to the of some
hidden units,unsup
whicervised
which h can blearning
e seen asalgorithms
doing dataset such
as the denoising
augmen
augmentation
tation at m auto encolevels
ultiple der (Vincen t et al., P
of abstraction. 2008
oole).etNoise injection
al. (2014 ) recently alsoshoworks
showed
wed
when the noise is applied to the hidden units, whic
that this approach can be highly effective provided that the magnitude of theh can b e seen as doing dataset
augmen
noise is tation
carefully at m ultipleDrop
tuned. levelsout,
Dropout, of abstraction.
a powowerful Poole et al. (2014
erful regularization ) recently
strategy that sho willwed
be
that
describthis
described ed approach
in Sec. can
7.12 , b
cane highly
b e seeneffective
as a provided
pro
process
cess of that the
constructing magnitude
new inputsof the
by
noise is carefully
multiplying by noise. tuned. Drop out, a p ow erful regularization strategy that will be
described in Sec. 7.12, can be seen as a process of constructing new inputs by
When comparing
multiplying by noise. mac machine
hine learning benchmark results, it is important to tak takee
the effect of dataset augmen augmentationtation into account. Often, hand-designed dataset
When
augmen
augmentation comparing
tation schemes mac
can hine learning reduce
dramatically benchmark results, it is important
the generalization error of a mac to tak
hinee
machine
the effecttec
learning ofhnique.
dataset
technique. Toaugmen
compare tation into account.
the performance Often,
of one mac hand-designed
machine
hine learning algorithm dataset
augmen
to tation
another, it schemes can dramatically
is necessary reduce the generalization
to perform controlled experiments. When error ofcomparing
a machine
learning
mac hine learning algorithm A and machine learning algorithm B, it is algorithm
machine tec hnique. T o compare the performance of one mac hine learning necessary
to another,
to mak it is necessary to
makee sure that both algorithms were ev p erform controlled
evaluated experiments. When
aluated using the same hand-designed comparing
machineaugmen
dataset learning
augmentation algorithm
tation schemes. A and Suppmachine
Supposeose that learning
algorithmalgorithm
A performsB, it p isonecessary
orly with
to mak e sure
no dataset augmen that both
augmentation algorithms w ere ev aluated
tation and algorithm B performs well when com using the same hand-designed
combined
bined with
numerous synthetic transformations of the input. In such a case it isolikely
dataset augmen tation schemes. Supp ose that algorithm A p erforms p orly withthe
no
syn dataset
synthetic augmen tation and
thetic transformations caused the improv algorithmimprovedB p erforms w ell when com
ed performance, rather than the use bined with
numerous
of mac hinesynthetic
machine learning transformations
algorithm B. Sometimes of the input. In such
deciding a case an
whether it isexp
likely
experimen
erimen
erimentthet
synthetic transformations caused the improved performance, rather than the use
of machine learning algorithm B. Sometimes 241 deciding whether an experiment
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.5 7.4Noise
Sec. has motiv Robustness
motivated
ated the use of noise applied to the inputs as a dataset aug-
men
mentation
tation strategy
strategy.. For some mo models,
dels, the addition of noise with infinitesimal
Sec. 7.4 has motiv ated
variance at the input of the mo the use
modelof
delnoise
is equiv applied
equivalentalentto totheimp inputs
osing aaspaenalty
imposing dataset on aug-
the
men tation strategy
norm of the weigh weights . F or some mo dels, the addition
ts (Bishop, 1995a,b). In the general case, it is imp of noise with infinitesimal
important
ortant to
vremem
ariance
rememb beratthat
the noise
inputinjection
of the mo candelbeismuc equiv
much alent pto
h more owimp
erfulosing
thanasimply penalty on the
shrinking
norm
the of the weigh
parameters, esptsecially
(Bishop
especially , 1995a
when the,bnoise
). Inisthe general
added to thecase, it is imp
hidden ortant
units. Noiseto
remembto
applied er the
thathidden
noise injection
units is such can anbe impmucortan
importan h more
ortant powas
t topic erful
to than
merit simply
its ownshrinking
separate
the parameters,
discussion; the drop esp
dropout ecially when
out algorithm describthe noise
described is added to the hidden
ed in Sec. 7.12 is the main developmen units.
development Noiset
applied
of to the hidden
that approac
approach. h. units is such an important topic as to merit its own separate
discussion; the dropout algorithm described in Sec. 7.12 is the main development
Another wa wayy that noise has been used in the service of regularizing mo modelsdels
of that approach.
is by adding it to the weigh weights. ts. This technique has been used primarily in the
con Another
context wa
text of recurren y that noise
recurrentt neural netw has
networksb een (used
orks Jim et in al.
the, 1996
service; Graofvregularizing
Grav es, 2011). This modelscan
ise by
b adding it as
interpreted to athesto weigh
stochastic
chastic ts.implemen
This technique
implementation tation ofhas been
a Bay
Bayesian usedinference
esian primarily oviner the
the
con
weigh text
eights. of recurren
ts. The Ba Bay t neural netw orks ( Jim et al.
yesian treatment of learning would consider the mo , 1996 ; Gra v es , 2011 ).
model This
del we weigh can
igh
ights
ts
b e interpreted as a
to be uncertain and represen sto chastic
representable implemen
table via a probabilit tation
probability of a Bay esian inference
y distribution that reflects this o v er the
w eigh ts.
uncertain The Ba yesian treatment of learning
uncertaintty. Adding noise to the weights is a practical, would consider
stochasticthe mo wa
waydel
y towe ights
reflect
to beuncertain
this uncertain
uncertaint ty and
(Gra represen
Gravesves, 2011 table
). via a probability distribution that reflects this
uncertainty. Adding noise to the weights is a practical, stochastic way to reflect
This can also be in interpreted
terpreted as equiv equivalen alen
alentt (under some assumptions) to a
this uncertainty (Graves, 2011).
more traditional form of regularization. Adding noise to the weigh weights ts has been
sho
shown This can also
wn to be an effectiv b e in terpreted as equiv alen t (under
effectivee regularization strategy in the context of recurren some assumptions)
recurrentt neural to a
more
net
netw traditional
works (Jim et form of regularization.
al., 1996 ; Gra
Grav ves, 2011).Adding In thenoise to thewe
following, weigh
willtspresent
has been an
sho wn to b e an effectiv
analysis of the effect of weigh e regularization strategy
eightt noise on a standard feedforw in the context
feedforward of recurren
ard neural netw t
network neural
ork (as
net
in
introw
tro orks
troduced ( Jim et al.
duced in Chapter 6). , 1996 ; Gra v es , 2011 ). In the following, we will present an
analysis of the effect of weight noise on a standard feedforward neural network (as
ŷ
introWduced
e study in the 6). setting, where we wish to train a function yˆ( x) that
regression
Chapter
maps a set of features x to a scalar using the least-squares cost function betw between een
the moW e study the regression setting, where
del predictions yŷˆ(x) and the true values y:
model we wish to train a function y
ˆ ( x ) that
maps a set of features x to a scalar using the least-squares cost function between
2
the model predictions yˆ(x) Jand =E the ) (y
true
p(x,y x) − y
ŷˆv(alues y:) . (7.30)
E
The training set consists of Jm=labeled examples (yˆ(x) y{()x (1) . , y (1)), . . . , (x (m), y (m)(7.30)
)}.
242 −
The training set consists of m labeled examples (x , y ), . . . , (x , y ) .
{ }
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.5.1 datasets
Most Injecting
hahav Noiseamoun
ve some at the
amount t ofOutput
mistakes T inargets
the y lab
labels.
els. It can be harmful
to maximize log p(y | x) when y is a mistak mistake.e. One way to prev preven en
entt this is to
Most datasets
explicitly mo
model ha v e some amoun
del the noise on the lab t of
labels.mistakes in the y lab els.
els. For example, we can assume that It can be harmful
for some
to maximize log p (y x ) when
small constant , the training set lab y is a
label mistak e. One w ay
el y is correct with probabilitto prev
probability en
y 1 − , isand
t this to
explicitly mo
otherwise anydelofthethenoise
|other onpthe labels.
ossible lab
labelsFor might
els example, we can assume
be correct. that for some
This assumption is
small constant
easy to incorp
incorporate, the
orate in training
into set lab el y is
to the cost function analyticallycorrect with probabilit y 1
analytically,, rather than by explicitly , and
otherwise
dra
drawing any of the
wing noise samples. F other
For p ossible
or example, lablab els
labelel might
smo be correct.
smoothing
othing regularizesThisaassumption
mo
model − based
del is
easy to incorp orate in to the cost function analytically ,
on a softmax with k output values by replacing the hard 0 and 1 classification rather than by explicitly
drawingwith
targets noisetargets
samples. of kFand
or example,
1 − k−1 lab el smoothing
, respectively
respectively.. Theregularizes
standarda cross-en
model based
cross-entropy
tropy
k
on a softmax with k output values by
loss may then be used with these soft targets. Maxim replacing the
Maximum hard
um lik 0
likelihoand
eliho 1 classification
elihoood learning with a
targets with targets of and
softmax classifier and hard targets ma 1 may , respectively
y actually nev .
never The
er con
conv verge—thecross-en
standard softmax tropy
can
loss
nev
nevermay then be used
er predict a probabilit
probabilitywith these − soft targets. Maxim um lik
y of exactly 0 or exactly 1, so it will con eliho o d learning
tinue to learna
continue with
softmax
larger andclassifier
larger w and hardmaking
eights, targetsmore
may extreme
actually nev er converge—the
predictions forever. Itsoftmax
is possiblecan
nevpreven
to er predict
prevent t thisa probabilit
scenario using y of exactly 0 or exactly 1strategies
other regularization , so it will lik con
likee tinue
weigh
weight to learn
t deca
decay y.
larger
Lab
Label and
el smo larger
smoothing w eights,
othing has the adv making
advantage more
antage of prev extreme
preventing predictions forever. It
enting the pursuit of hard probabilitiesis possible
to preven t this scenario using other regularization
without discouraging correct classification. This strategy strategies haslikbeeenweigh
used t deca
sincey.
Label smoothing has the advantage of preventing the pursuit of hard probabilities
without discouraging correct classification. 243 This strategy has been used since
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.6the paradigm
In Semi-Sup ervised Learning
of semi-supervised learning, both unlab eled examples from P (x)
unlabeled
and lab eled examples from P (x, y ) are used to estimate P (y | x) or predict y from
labeled
In
x. the paradigm of semi-supervised learning, both unlabeled examples from P (x)
and labeled examples from P (x, y ) are used to estimate P (y x) or predict y from
x. In the con context
text of deep learning, semi-sup semi-supervised
ervised learning usually refers to
|
learning a represen
representation
tation h = f ( x ) . The goal is to learn a represen representationtation so
thatInexamples
the context fromofthe deepsame learning,
class ha havsemi-sup
ve similar ervised learning usually
representations. Unsup refers
Unsupervised
ervised to
learning can
learning a represen
provide tation
useful h cues
= f (for x ). how
The to goalgroupis toexamples
learn a represen tation so
in representation
that examples from the
space. Examples that cluster tigh same class
tightly ha v e similar representations.
tly in the input space should be mapp Unsup
mapped ervised
ed to
learning can
similar represen provide
representations. useful cues for how to group
tations. A linear classifier in the new space may achiev examples in representation
achievee better
space. Examples
generalization in manthaty cluster
many tightlyand
cases (Belkin in the
Niy
Niyogiinput
ogi , 2002 space should
; Chap elle etbeal.
Chapelle mapp
al.,
, 2003 ed). to
A
similar represen
long-standing varian tations. A linear classifier in the new
ariantt of this approach is the application of principal comp space may achiev e
componen b etter
onen
onentsts
generalization
analysis as a pre-pro in man y
pre-processing cases (Belkin and Niy ogi , 2002
cessing step before applying a classifier (on the pro ; Chap elle et al. , 2003 ).
projected
jected A
long-standing variant of this approach is the application of principal components
data).
analysis as a pre-processing step before applying a classifier (on the pro jected
Instead of having separate unsup unsupervisedervised and sup supervised
ervised comp
componen onen
onents ts in the
data).
mo
model,
del, one can construct mo dels in which a generative model of either P (x) or
models
P ( x, y) shares parameters withunsup
Instead of having separate ervised and sup
a discriminative mo ervised
model
del of Pcomp
(y | x onen
). One ts in can
the
model,
then one canthe
trade-off construct
sup
supervised models
ervised in which
criterion − loga Pgenerative
(y | x) with model
the of either
unsup
unsupervised P (x) or
ervised or
P ( x, y ) shares
generativ
generative e one (suc parameters
(such h as − logwith P (x)aordiscriminative
− log P (x, y )).mo The of P (y x
delgenerative ). One then
criterion can
then trade-off
expresses the supervised
a particular prior belieflog
form of criterion ab Pout
about (y the x) solution
with thetounsup ervised
|the sup
supervised
ervised or
generativ e one (suc h as log
learning problem (Lasserre et al., 2006− P ( x ) or log P (x , y ) ). The generative
), namely |that the structure of P(x ) is criterion then
expresses a particular
connected to the structure form
− of P (y | x−) in aab
of prior b elief waout
way y thatthe solution
is captured to the supervised
by the shared
learning problem By
parametrization. (Lasserre
con
controlling et al.
trolling ho, w2006
how muc
much),hnamely that the structure
of the generative criterion is of included
P(x ) is
connected
in the totalto the structure
criterion, one canoffind P (ya bxetter ) in atrade-off
way that is captured
than with a purely by the shared
generative
parametrization.
or a purely discriminativ By conetrolling
discriminative training how | much of
criterion the generative
(Lasserre criterion
et al., 2006 is included
; Larochelle and
in the total
Bengio, 2008). criterion, one can find a b etter trade-off than with a purely generative
or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and
In the
Bengio con
context
, 2008 ).text of scarcit
scarcity y of labeled data (and abundance of unlabeled data),
deep arcarchitectures
hitectures ha havve shown promise as well. Salakhutdino Salakhutdinov v and Hinton (2008)
In
describ the con
describee a metho text
method d for learning the kernel function of a kernelofmachine
of scarcit y of labeled data (and abundance unlabeled useddata),
for
deep arc hitectures
regression, in whic which ha ve shown promise as w
h the usage of unlabeled examples for mo ell. Salakhutdino v
modeling and
deling P Hinton
(x ) impro (
improv 2008
ves)
describ
P (y | xe) quite
a metho d for learning
significantly
significantly. . the kernel function of a kernel machine used for
regression, in which the usage of unlabeled examples for modeling P (x ) improves
P (ySeex)Chapelle et al. (2006. ) for more information ab
quite significantly about
out semi-sup
semi-supervised
ervised learning.
| Chapelle et al. (2006) for more information about semi-supervised learning.
See
244
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.7 Multi-T
Multi-Task
ask Learning
7.7 Multi-T
Multi-task learningask Learning
(Caruana , 1993) is a wa way
y to improv
improvee generalization by pooling
the examples (whic
(which h can be seen as soft constraints imp imposed
osed on the parameters)
Multi-task learning (Caruana , 1993 ) is
arising out of several tasks. In the same wa a way
wayy that additional trainingbexamples
to improv e generalization y pooling
the examples (whic h can b e seen as soft constraints
put more pressure on the parameters of the model tow imp osed
towards on the parameters)
ards values that generalize
arising
w out part
ell, when of several
of a motasks.
del is In
model the same
shared acrosswatasks,
y thatthat
additional training
part of the modelexamples
is more
put more pressure
constrained to
tow on
wards go the
goo parameters of the model tow ards values that generalize
od values (assuming the sharing is justified), often yielding
w
better generalization. del is shared across tasks, that part of the model is more
ell, when part of a mo
constrained towards good values (assuming the sharing is justified), often yielding
better generalization.
y (1) y (2)
h(shared)
associated with none of the output tasks (h ): these are the factors that explain some of
the Fig.
input7.2variations butaare notcommon
relevantform
for predicting y y .
illustrates very of multi-task or learning, in whic
which h different
sup ervised tasks (predicting y(i) giv
supervised en x) share the same input x, as well as some
given
in Fig. 7.2 illustrates
intermediate-lev
termediate-lev
termediate-levelel a very common
representation h formcapturing
(shared) of multi-task learning,
a common in of
pool whic h different
factors. The
supervised tasks (predicting y given x) share the same input x, as well as some
intermediate-level representation h 245capturing a common p o ol of factors. The
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
mo
model
del can generally be divided into two kinds of parts and associated parameters:
mo1.delTcan generally
ask-sp
ask-specific be divided(whic
ecific parameters into htwo
(which onlykinds
benefitof parts
fromandthe associated
examples ofparameters:
their task
to achiev
achievee go
goood generalization). These are the upp upperer lay
layers
ers of the neural
1. net
T ask-sp
netw ecific parameters
work in Fig. 7.2. (whic h only b enefit from the examples of their task
to achieve good generalization). These are the upper layers of the neural
2. Generic
network in parameters,
Fig. 7.2. shared across all the tasks (whic (which h benefit from the
pooled data of all the tasks). These are the low lowerer lay
layers
ers of the neural net
network
work
2. in
Generic parameters,
Fig. 7.2. shared across all the tasks (whic h b enefit from the
pooled data of all the tasks). These are the lower layers of the neural network
Impro
Improv in
vedFig. 7.2.
generalization and generalization error bounds (Baxter, 1995) can be
ac
achiev
hiev
hieved
ed because of the shared parameters, for which statistical strength can be
Improved
greatly generalization
improv
improveded (in prop and generalization
proportion
ortion error bounds
with the increased num
umb (Baxter
ber , 1995) for
of examples canthe
be
achieved
shared because ofcompared
parameters, the sharedto parameters,
the scenario for which statistical
of single-task mo dels).strength
models). Of coursecanthis
be
greatly improv ed (in prop ortion
will happen only if some assumptions ab with the increased
about n um b er of examples
out the statistical relationship betw for the
etween
een
shared
the parameters,
differen
differentt tasks compared
are valid, to the
meaning scenario
that of
there single-task
is somethingmo dels).
shared Of course
across this
some
will happen
of the tasks. only if some assumptions about the statistical relationship between
the different tasks are valid, meaning that there is something shared across some
From
of the the poin
tasks. ointt of view of deep learning, the underlying prior belief is the
follo
following:
wing: among the factors that explain the variations observed in the
datarom
F asso the
associatedpoinwith
ciated t of view
the of deep learning,
differen
different t tasks, somethe underlying
are shared prior belieftw
across isothe
two or
following:
more among the factors that explain the variations observed in the
tasks.
data associated with the different tasks, some are shared across two or
more tasks.
7.8 Early Stopping
7.8 training
When Early large Stopping
mo
models
dels with sufficien
sufficientt representational capacit
capacityy to ovoverfit
erfit
the task, we often observe that training error decreases steadily over time, but
When training
validation large
set error models
begins with
to rise sufficien
again. t representational
See Fig. 7.3 for an examplecapacit y to
of this overfit
behavior.
the task,
This beha we often
ehavior
vior occurs observe that training
very reliably
reliably.. error decreases steadily over time, but
validation set error begins to rise again. See Fig. 7.3 for an example of this behavior.
This means we can obtain a mo model del with better validation set error (and th thus,
us,
This behavior occurs very reliably.
hop
hopefully
efully better test set error) by returning to the parameter setting at the poin ointt
This means
in time with the loww e can
lowest obtain a mo del with better validation set error
est validation set error. Instead of running our optimization (and th us,
hopefully bun
algorithm etter
until
til wetest set
reac
reachh error)
a (lo by minim
(local)
cal) returning
minimum um oftovthe parameter
alidation error,setting
we runatit the
un poin
until
til thet
in time
error on with
the vthe lowestset
alidation validation set error.
has not improv
improved Instead
ed for some of running
amoun
amount our optimization
t of time. Every time
algorithm until we reac h a (lo
the error on the validation set improvcal) minim
improves, um of validation
es, we store a cop
copy error, we
y of the mo run
model it until the
del parameters.
error on the v alidation set has not improv ed for some amoun t of
When the training algorithm terminates, we return these parameters, rather than time. Every time
the latest
the error on the validation
parameters. Thissetproimprov
cedurees,
procedure we
is sp store amore
specified
ecified copyformally
of the moindel parameters.
Algorithm 7.1.
When the training algorithm terminates, we return these parameters, rather than
This strategy is kno known
wn as early stopping
stopping.. It is probably the most commonly
the latest parameters. This procedure is specified more formally in Algorithm 7.1.
used form of regularization in deep learning. Its popularity is due both to its
Thiseness
effectiv strategy
effectiveness and itsis kno wn asy.early stopping. It is probably the most commonly
simplicit
simplicity
used form of regularization in deep learning. Its popularity is due both to its
effectiveness and its simplicity. 246
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Learning curves
0.20
0.05
0.00
0 50 100 150 200 250
Time (epochs)
GPU memorymemory,, but storing the optimal parameters in host memory or on a disk
driv
drive).
e). Since the best parameters are written to infrequen infrequently tly and nev neverer read during
GPU memory , but storing
training, these occasional slow writes ha the optimal haveparameters in host memory
ve little effect on the total training or on atime. disk
drive). Since the best parameters are written to infrequently and never read during
Early these
training, stopping is a very
occasional slowunobtrusiv
unobtrusive
writes ha e ve
form of regularization,
little effect on the total in that it requires
training time.
almost no change in the underlying training pro procedure,
cedure, the ob objectiv
jectiv
jectivee function,
or theEarlysetstopping
of allow is a v
allowable
able ery unobtrusiv
parameter values.e form
Thisofmeans regularization,
that it isin thattoituse
easy requires
early
almost no change in the underlying training
stopping without damaging the learning dynamics. This is in con pro cedure, the ob jectiv
contrast e function,
trast to weighweightt
or
deca the
decay set of allow able parameter
y, where one must be careful not to use to values. This too means
o muc uch that it
h weight deca is easy
decay y and trapearly
to use the
stopping
net
netw without
work in a bad lo damaging
local the
cal minimum corresplearning onding to a solution with pathologicallyt
dynamics.
corresponding This is in con trast to weigh
decay,wwhere
small eigh
eights.ts.one must be careful not to use too much weight decay and trap the
network in a bad local minimum corresponding to a solution with pathologically
Early stopping may be used either alone or in conjunction with other regulariza-
small weights.
tion strategies. Even when using regularization strategies that mo modify
dify the ob objective
jective
Early stopping may b e used either alone or in conjunction
function to encourage better generalization, it is rare for the best generalization to with other regulariza-
otion
ccurstrategies.
at a localEven minimumwhen of using
the regularization
training ob objective.strategies that modify the ob jective
jective.
function to encourage better generalization, it is rare for the best generalization to
occurEarly
at astopping requires of
local minimum a vthe
alidation
training set,obwhich
jective.means some training data is not
fed to the model. To best exploit this extra data, one can perform extra training
afterEarly stopping
the initial requires
training witha vearly
alidation set, which
stopping means someIntraining
has completed. data is
the second, not
extra
fed to thestep,
training model. Tothe
all of best exploitdata
training this is extra data, one
included. canare
There perform
two basicextrastrategies
training
aftercan
one theuseinitial training
for this second with early stopping
training procedure. has completed. In the second, extra
training step, all of the training data is included. There are two basic strategies
one One
can strategy
use for this (Algorithm 7.2) is toprocedure.
second training initialize the mo model
del again and retrain on all
of the data. In this second training pass, we train for the same num numb ber of steps as
the One
earlystrategy
stopping(Algorithm
procedure7.2 ) is to initialize
determined was optimalthe moin deltheagainfirstand retrain
pass. There on are
all
of thesubtleties
some data. In this second training
associated with thispass,
pro we trainFor
procedure.
cedure. forexample,
the same there numbis er not
of steps
a goo as
goodd
the early
way of kno stopping
knowing procedure determined was optimal
wing whether to retrain for the same number of parameter up in the first pass. There
updates are
dates or
some
the samesubtleties
num
umb b associated
er of passes with
throughthis pro
the cedure.
dataset. For
On example,
the second there
round is not
of a good
training,
weacayhofpass
each knothrough
wing whether to retrain
the dataset will for the same
require morenparameter
umber of parameter
up dates bup
updates datesthe
ecause or
the sameset
training num is bbigger.
er of passes through the dataset. On the second round of training,
each pass through the dataset will require more parameter updates because the
Another strategy for using all of the data is to keep the parameters obtained
training set is bigger.
from the first round of training and then con continuetinue training but no now w using all of
Another strategy
the data. At this stage, we no for using
now all of the
w no longer ha data
have ve a guide for when to stop obtained
is to k eep the parameters in terms
from
of a num the
numb first round of training and then
ber of steps. Instead, we can monitor the av con tinue training
erage loss function onallthe
average but no w using of
the data. A t this
validation set, and contin stage,
continuew e no w no longer ha
ue training until it falls belove a guide
elow for when to stop
w the value of the training in terms
of a
set ob num b
objective er of
jective at whic steps.
which Instead, we can monitor
h the early stopping procedure halted. the av erage Thisloss function
strategy on the
avoids
vthe
alidation
high cost set,ofand continue
retraining thetraining
mo
model until scratch,
del from it falls bbut elowis the
not vasalue of the
well-b
ell-behav
ehav training
ehaved.ed. For
set ob jective at whic h the early
example, there is not any guarantee that the ob stopping procedure
objectiv halted.
jectiv This strategy
jectivee on the validation set avoids
will
the
ev
ever high
er reac
reach cost of retraining the mo del from
h the target value, so this strategy is not ev scratch, but
even is not as w ell-b ehav
en guaranteed to terminate. ed. For
example,
This pro there is
procedure
cedure is presen
not any
presented tedguarantee that the
more formally ob jective on
in Algorithm 7.3the
. validation set will
ever reach the target value, so this strategy is not even guaranteed to terminate.
Early stopping is also useful because it reduces the computational cost of the
This procedure is presented more formally in Algorithm 7.3.
248it reduces the computational cost of the
Early stopping is also useful because
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
w∗ w∗
w̃ w̃
w2
w2
w1 w1
Figure 7.4: An illustration of the effect of early stopping. (L eft) The solid contour
(Left)
lines indicate the contours of the negative log-lik log-likeliho
eliho
elihoo o d. The dashed line indicates the
Figure
tra
trajectory
jectory7.4:tak
Anen illustration
taken of the effect
by SGD beginning of early
from the origin.stopping.
Rather than (Left) The solid
stopping contour
at the p oin
ointt
lines
∗ indicate the contours of the negative log-lik
w that minimizes the cost, early stopping results in the tra eliho o d. The dashed
trajectory line indicates
jectory stopping at an earlierthe
tra
p jectory
oin
ointtw taken bAn
w̃˜. (Right) y SGD beginning
illustration from
of the the origin.
effect 2 Rather than stopping
of L regularization at the pThe
for comparison. oint
w thatcircles
dashed minimizes the cost,
indicate early stopping
the contours of the Lresults
2 in ythe
p enalt
enalty tra jectory
, which causesstopping
the minimat um
minimuman earlier
of the
total w˜. (Right)
p oint cost An illustration
to lie nearer of thethe
the origin than effect of L regularization
minimum for comparison.
of the unregularized cost. The
dashed circles indicate the contours of the L p enalty, which causes the minimum of the
total cost to lie nearer the origin than the minimum of the unregularized cost.
training pro
procedure.
cedure. Besides the obvious reduction in cost due to limiting the num numb ber
of training iterations, it also has the benefit of providing regularization without
training pro
requiring thecedure.
additionBesides the obvious
of penalt
enalty y terms reduction
to the cost in function
cost due toor limiting the numbof
the computation er
of training
the gradien
gradients iterations,
ts of suc
such it also has
h additional terms. the benefit of providing regularization without
requiring the addition of penalty terms to the cost function or the computation of
the gradients of such additional terms.
Ho
How w early stopping acts as a regularizer: So far we ha havve stated that early
stopping is a regularization strategy strategy,, but we hav havee supp
supported
orted this claim only by
Ho
sho w early
showing
wing stopping
learning curves acts
whereasthe a vregularizer:
alidation set errorSo farhaswae U-shap
have stated
U-shaped that What
ed curve. early
stopping
is is a mechanism
the actual regularization strategy
by whic
which , butstopping
h early we haveregularizes
supportedthe thismoclaim
del? only
model? by
Bishop
sho wing learning
(1995a) and Sjöb curves
Sjöberg where the validation set error has a U-shap
erg and Ljung (1995) argued that early stopping has the effect of ed curve. What
is the actual
restricting themechanism
optimization by pro
whic h earlytostopping
procedure
cedure a relativ regularizes
relatively
ely small volumethe moofdel? Bishop
parameter
(space
1995a)inandtheSjöb erg and oLjung
neighborho
neighborhoo (1995
d of the ) argued
initial that early
parameter stopping
value θ o. More has the
sp effect of,
specifically
ecifically
ecifically,
restricting
imagine the optimization
taking τ optimization prosteps
cedure to a relativelytosmall
(corresponding volume
τ training of parameter
iterations) and
space in the neighborho o d of the initial parameter
with learning rate . We can view the product τ as a measure of effectiv value θ . More sp ecifically
effectivee capacit
capacityy,.
imagine taking
Assuming τ optimization
the gradien
gradient t is bounded, stepsrestricting
(corresponding
both the to τnum
training
umb ber ofiterations)
iterations and
and
with learning rate . W e can view the product
the learning rate limits the volume of parameter space reac τ as a measure hable from θo . In thisy.
reachableof effectiv e capacit
Assuming
sense, τ btheeha
ehavvgradien
es as if titiswbere
ounded, restricting
the recipro
reciprocal both
cal of the the numused
coefficient ber of foriterations
weigh and
weightt deca
decay
y.
the learning rate limits the volume of parameter space reachable from θ . In this
Indeed,
sense, τ behawevcan
es assho
show
if w
it ho
how—in
w—in
were the the casecal
recipro of of
a simple linear mo
the coefficient model
del with
used a quadratic
for weigh t decay.
error function and simple gradient descent—early stopping is equiv alent to L2
equivalent
Indeed, we can show how—in the case of a simple linear model with a quadratic
error function and simple gradient descent—early 249 stopping is equivalent to L
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
regularization.
In order to compare with classical L 2 regularization, we examine a simple
regularization.
setting where the only parameters are linear weigh eights
ts (θ = w). We can mo model
del
In order to compare with classical
the cost function J with a quadratic appro L regularization,
approximation w e examine
ximation in the neighborho
neighborhoo a simple
od of the
setting where the only parameters
empirically optimal value of the weigh are
weights linear
∗
ts w : weigh ts (θ = w ). W e can model
the cost function J with a quadratic approximation in the neighborhood of the
empirically optimalJˆv(alue 1 w: ∗ >
θ) =ofJ (the
w ∗ )weigh
+ (tsw − w ) H (w − w ∗), (7.33)
2
1
Jˆ(θ) matrix
where H is the Hessian = J (wof) J (w respect
+ with w ) toHw(wev w ), at w∗ . Given
evaluated
aluated (7.33)
the
∗ 2
assumption that w is a minim
minimum um of J (w ),−we know that−H is positiv ositivee semidefinite.
where H is the Hessian matrix
Under a local Taylor series appro of J w
ximation, the gradient is given at
with
approximation, respect to evaluated by:w . Given the
assumption that w is a minimum of J (w ), we know that H is positive semidefinite.
∇w Jˆximation,
Under a local Taylor series appro (w) = H (w w∗).
the−gradient is given by: (7.34)
No
Now, Q >w
w, the expression for Q ˜ in =
w̃
w Eq.
[I 7.13 Λ2) regularization
(I forL ]Q w . can be rearranged
(7.40)
as: − −
Now, the expression for Q w ˜ in Eq. 7.13 for L regularization can be rearranged
as: Q>w˜ = (Λ + αI ) −1 ΛQ >w ∗ (7.41)
3
For neural networks, to obtain
Q w˜ symmetry
= (Λ + αbreaking
I ) ΛQ between
w hidden units, we cannot initialize
(7.41)
all the parameters to 0, as discussed in Sec. 6.2. However, the argument holds for any other
initial value w(0) .
250
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.9
Th
Thus Pin
us far, arameter
this chapter,Twhen
yingweandhavee P
hav arameter
discussed addingSharing
constraints or penalties
to the parameters, we ha hav
ve alw
alwa
ays done so with resp
respect
ect to a fixed region or point.
Th
Forusexample,
far, in this
L2 cregularization
hapter, when (or
we hav e discussed
weight deca y) padding
decay) enalizesconstraints
mo
model or penalties
del parameters for
to the parameters, w e ha ve alwa ys done
deviating from the fixed value of zero. Ho so with
Howev
wev resp
wever, ect to a fixed
er, sometimes we ma region
may or p oint.
y need other
For example, L regularization
ways to express our prior kno (or
knowledge w eight deca y) p enalizes mo
wledge about suitable values of the mo del parameters
model for
del parameters.
deviating from the fixed value of zero. However, sometimes we may need other
ways to express our prior knowledge about 251 suitable values of the mo del parameters.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Sometimes we might not know precisely what values the parameters should tak takee
but we know, from knowledge of the domain and mo model del architecture, that there
Sometimes
should we might
be some not knowbetw
dependencies precisely
een thewhat
etween mo
model values
del the parameters should take
parameters.
but we know, from knowledge of the domain and model architecture, that there
shouldA common
be sometdependencies
yp
ypee of dep dependency
endency
between thatthewemo often want to express is that certain
del parameters.
parameters should be close to one another. Consider the following scenario: we
ha
hav veA twcommon
twoo momodels typpeerforming
dels of dependency the same thatclassification
we often want to express
task (with the is that
samecertain
set of
parameters should b e close to one another.
classes) but with somewhat different input distributions. Formally Consider the following
ormally,, we ha scenario:
have
ve model we
Ahavwith
e twoparameters
models performingw (A) and the mo
modelsame
del classification
B with parameters taskw(with the tsame
(B ) . The wo mo set
models of
dels
classes) but
map the input to tw with somewhat
two different
o differen
different, input distributions. F ormally
t, but related outputs: yŷˆ (A) = f( w(A), x) and, we ha ve model
A ( with
B )
yŷˆ = g(w , x).parameters
( B ) w and mo del B with parameters w . The two models
map the input to two different, but related outputs: yˆ = f( w , x) and
Let
yˆ = g(w , x).us imagine that the tasks are similar enough (p
(perhaps
erhaps with similar input
and output distributions) that we believe the mo model
del parameters should be close
Let us imagine that (A) the tasks are similar (Benough
) (perhaps with similar input
to eaceach h other: ∀i , wi should be close to wi . We can lev leverage
erage this information
and output distributions) that we believe the model parameters should be close
through regularization. Sp Specifically
ecifically
ecifically,, we can use a parameter norm penalty of the
to eac h other: i , w
w , w ) = kw(A) −
( A) ( B ) should bewclose
(B )k 2to w . We can leverage this yinformation
2 enalty
form: Ω( Ω(w 2. Here we used an L p enalt , but other
cthrough
hoices are regularization.
also∀ possible.Specifically, we can use a parameter norm penalty of the
form: Ω(w , w ) = w w . Here we used an L penalty, but other
This kind of approach was prop proposed
osed by Lasserre et al. (2006), who regularized
choices are also possible.k − k
the parameters of one mo model,
del, trained as a classifier in a sup supervised
ervised paradigm, to
This kind of approach
be close to the parameters of another mo w as prop osed by
model, Lasserre et al.
del, trained in an unsup (2006 ),ervised
who regularized
unsupervised paradigm
(tothe capture
parameters of one model,oftrained
the distribution the observas aedclassifier
observed in a sup
input data). The ervised paradigm,
architectures wereto
be close to the
constructed parameters
such that many of another model, trained
of the parameters in thein an unsupervised
classifier mo
modeldel paradigm
could be
(to capture the distribution of
paired to corresponding parameters in the unsup the observ ed input
unsuperviseddata).
ervised mo The
model.del. architectures were
constructed such that many of the parameters in the classifier model could be
While a parameter norm penalt enalty y is one way to regularize parameters to be
paired to corresponding parameters in the unsupervised model.
close to one another, the more popular way is to use constraints: to force sets of
While a parameter
parameters to be equal norm penalt
. This y is one
method way to regularize
of regularization parameters
is often referred to to base
close
par
arameterto one
ameter sharinganother, the
sharing,, where we in more p opular
interpret w ay is
terpret the various moto use constraints:
models
dels or mo model to force
del comp
componen sets
onen
onents ts of
as
parameters to b e equal . This
sharing a unique set of parameters. A significan method of regularization
significantt adv advantage is often referred
antage of parameter sharing to as
opar
ver ameter sharing,the
regularizing where we interpret
parameters to bethe various
close (via mo dels orpenalt
a norm model
enalty) y) comp
is thatonen ts as
only a
sharing a unique set of parameters. A significan
subset of the parameters (the unique set) need to be stored in memory t adv antage of parameter sharing
memory.. In certain
o
mov er regularizing
models—suc
dels—suc
dels—such the
h as the con parameters
conv to b e
volutional neural netwclose (via a
network—this norm p
ork—this can lead enalt y) istothat only a
significant
subset of the parameters (the
reduction in the memory footprint of the model. unique set) need to b e stored in memory . In certain
models—such as the convolutional neural network—this can lead to significant
reduction in the memory footprint of the model.
Con
Convolutional
volutional Neural Netw Networksorks By far the most popular and extensiv extensivee use
of parameter sharing occurs in convolutional neur neuralal networks (CNNs) applied to
Convolutional
computer vision. Neural Networks By far the most popular and extensive use
of parameter sharing occurs in convolutional neural networks (CNNs) applied to
Natural images ha haveve many statistical prop properties
erties that are inv invarian
arian
ariantt to translation.
computer vision.
For example, a photo of a cat remains a photo of a cat if it is translated one pixel
Natural images have many statistical properties that are invariant to translation.
For example, a photo of a cat remains a photo of a cat if it is translated one pixel
252
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
to the righ
right.
t. CNNs take this property into accoun accountt by sharing parameters across
multiple image lo locations.
cations. The same feature (a hidden unit with the same weigh weights)
ts)
to the right. CNNs
is computed over differen take this
differentt lo property
locations into accoun t by sharing parameters
cations in the input. This means that we can find a across
m ultiple image lo cations.
cat with the same cat detector The same feature
whether the(acat
hidden
app unit at
appears
ears with the same
column i or weigh
columnts)
is computed over
i + 1 in the image. differen t lo cations in the input. This means that w e can find a
cat with the same cat detector whether the cat appears at column i or column
Parameter sharing has allowallowed
ed CNNs to dramatically lo low
wer the numnumb ber of unique
i + 1 in the image.
mo
model
del parameters and to significantly increase netw network
ork sizes without requiring a
P
corresparameter
corresponding sharing has allow ed CNNs to dramatically
onding increase in training data. It remains one lowoferthe
the bnum
est bexamples
er of unique
of
mo
ho
how del parameters
w to effectiv
effectively and
ely incorpto significantly
incorporate increase
orate domain knowledge in netw
intoork sizes
to the net
netw without
work arc hitecture. a
requiring
architecture.
corresponding increase in training data. It remains one of the best examples of
howCNNs will ely
to effectiv be discussed
incorporate in more
domaindetail in Chapter
knowledge into 9the
. network architecture.
7.10
W eigh Sparse
eightt deca
decayy acts byRepresen tations
placing a penalty directly on the mo model
del parameters. Another
strategy is to place a penalty on the activ activations
ations of the units in a neural net network,
work,
W eigh t deca y acts by
encouraging their activ placing
activations a penalty directly on the mo del parameters.
ations to be sparse. This indirectly imposes a complicated Another
strategy
penalt
enalty is to place a p enalty on the activations of the units in a neural network,
y on the model parameters.
encouraging their activations to be sparse. This indirectly 1
imposes a complicated
W e ha
have
ve already discussed
penalty on the model parameters. (in Sec. 7.1.2 ) ho
how
w L p enalization induces a sparse
parametrization—meaning that man many y of the parameters become zero (or close to
W e ha ve
zero). Represenalready
Representationaldiscussed
tational sparsity (in Sec.
sparsity,, on the 7.1.2other L penalization
) howhand, describ
describeses ainduces a sparse
represen
representation
tation
parametrization—meaning that man
where many of the elements of the represen y of the parameters
representation b ecome zero (or
tation are zero (or close to zero).close to
zero).
A Represen
simplified viewtational
of thissparsity , on the
distinction can other hand, describ
be illustrated in thees con
a represen
text of tation
context linear
where many
regression: of the elements of the represen tation are zero (or close to zero).
A simplified view of this distinction can be illustrated inthe con text of linear
regression: 2
18 4 0 0 −2 0 0
5 0 0 −1 0 3 0 3 2
18 −2
15 = 04 50 00 0 2 0
0 0
0 3
5 0 0
−9 1 0 01 − −01 03 −04 −52 (7.46)
153 = 10 05 −00 1
− 00 −05 00 (7.46)
9 1 0 0 1 0 4 − 45
∈−R
y m
3 1 0 A0∈ R− m× n
0 5 −0 x − ∈1 R n
4
−R
y A
R − x 0 R
−14
3
−1 2 −5 4 1
2
1 ∈
∈ 3 ∈
14 43 21 −23 −15 14 00
12
1 = −1 5 4 2 − 3 −
1 4 2 3 1 1 3 02 (7.47)
−2 3 −1 2 − −3 0 −3 0
1 = −51 45 − −3
23 −42 −22 −53 −12
2 −3 1 2 3 −0 − 3 00 (7.47)
×n 3 n
y ∈23Rm 5 4 B ∈2Rm− 2 5 −1 h ∈ R
−0
y R − B − R253 − − h R
∈ ∈ ∈
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.11 (short
Bagging Bagging for bootstrand
otstrap Other
ap aggr
aggre egatingEnsemble
) is a techniqueMetho for reducing ds generalization
error by combining several mo models
dels (Breiman, 1994). The idea is to train several
Bagging
differen
differentt mo (short
models for
separately,, theneha
bo otstr
dels separately ap aggr gating
haveve all) isofathe
technique
models for votereducing
on the outputgeneralization
for test
examples. This is an example of a general strategy in machine learning calledseveral
error b y combining several mo dels ( Breiman , 1994 ). The idea is to train mo
modeldel
differen
aver
averaging
aging
aging.t mo
. Tdels separately
echniques , then ha
employing ve all
this of the are
strategy models known voteason the output
ensemble fords
metho
methods ds.test
.
examples. This is an example of a general strategy in machine learning called model
averThe
agingreason that mo
. Techniques model
del averaging
employing this works
strategy is are
thatknowndifferent mo
models
as ensemble dels will usually
metho ds.
not mak
makee all the same errors on the test set.
The reason that model averaging works is that different models will usually
Consider for example a set of k regression models. Supp Suppose ose that eac h model
each
not make all the same errors on the test set.
mak
makes es an error i on each example, with the errors drawn from a zero-mean
Consider for example a set of k regression models.
[2i ] = vSupp oseariances
that eacEh[model
multiv
ultivariate
ariate normal distribution with variances E and cov covariances i j ] =
mak
c. P es an error on
Then the error made by the av each example,
average with the
erage prediction errors drawn from a zero-mean
E of all the ensem ensemble ble momodels
E dels is
m1 ultivariate normal distribution with variances [ ] = v and covariances [ ] =
i i . The exp expected
ected squared error of the ensem ensemble ble predictor is
ck. Then the error made by the average prediction of all the ensemble models is
! 2error
. The expected squared X of the X ensemble predictor
X is
1 1 2
E i = 2E i + i j (7.50)
k k
E 1 i 1 E i j 6=i
= + (7.50)
k 1k k−1
P = v+ c. (7.51)
k k
1 k 1
errors are = v + c. and c = v, the mean squared (7.51)
In the case where the ! perfectly
k k
correlated
−
error reduces to v, so theX model av averaging
eragingdo Xes not help
does
X at all. In the case where
In the
the caseare
errors where the errors
perfectly are perfectly
uncorrelated and ccorrelated
= 00,, the exp expectedc =
andected , the mean
vsquared errorsquared
of the
error
ensem
ensemblereduces to v1, so the model
ble is only k v. This means that the exp av eraging do es
expected not help at all. In
ected squared error of the ensemblethe case where
the errorslinearly
decreases are perfectly
with the uncorrelated
ensem
ensemble and cIn=other
ble size. 0, thewords,
expectedon avsquared
erage, the error of the
ensem
ensemble ble
ensem
will ble is only
perform at least v. This means
as well as anythatof theitsexp ected squared
members, and iferror the of the ensemble
members make
decreases
indep
independen
enden
endent linearly
t errors,with the the ensemble
ensemble willsize. In other
p erform words, on
significantly average,
better thanthe ensem
its mem
memb ble
bers.
will perform at least as well as any of its members, and if the members make
indepDifferen
Different
endentt errors,
ensem
ensemble ble
themetho
methods
ensembleds construct
will p erform the ensem
ensemble ble of models
significantly better than in differen
different
its memt wbaers.
ys.
For example, eac eachh member of the ensem
ensemble ble could b e formed by training a completely
Differen
differen
different t kind t ensem
of modelble metho
usingds constructalgorithm
a different the ensemor bleob ofjective
modelsfunction.
objective in differen t ways.
Bagging
Fora example,
is metho
method eachallo
d that member
allowsws theofsame
the ensem
kind bleof mocould
del,btraining
model, e formed algorithm
by trainingand a completely
ob
objectiv
jectiv
jectivee
differen t kind of model
function to be reused several times.using a different algorithm or ob jective function. Bagging
is a method that allows the same kind of model, training algorithm and ob jective
Sp
Specifically
function ecifically
ecifically, bagging
to be ,reused inv
involves
several olves
times.constructing k differen differentt datasets. Each dataset
has the same num umb ber of examples as the original dataset, but eac eachh dataset is
Sp ecifically , bagging inv
constructed by sampling with replacemen olves k
replacementt from the original dataset. Thisdataset
constructing differen t datasets. Each means
has the same n um
that, with high probability b er of examples
probability,, eac each as the original dataset,
h dataset is missing some of the examples from thebut eac h dataset is
constructed
original by sampling
dataset and also with con replacemen
contains
tains several tduplicate
from the examples
original dataset.
(on av Thisaround
average
erage means
that,ofwith
2/3 high probability
the examples from the, eacoriginal
h dataset is missing
dataset some of
are found in the
the examples
resulting from
training the
original dataset and also contains several duplicate examples (on average around
2/3 of the examples from the original dataset 255 are found in the resulting training
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Original dataset
7.12
Dr
Drop
op out (Drop
opout Sriv
Srivastav
astavout
astava a et al.
al.,, 2014) provides a computationally inexpensive but
powerful methomethod d of regularizing a broad family of mo models.
dels. To a first approapproximation,
ximation,
Dr op
drop
dropoutout ( Sriv astav a et al.
out can be thought of as a metho , 2014 ) provides
method a computationally inexpensive
d of making bagging practical for ensembles but
pow
of erfulmany
very metholarged of regularizing
neural net a broadBagging
networks.
works. family of mo
inv dels.training
involves
olves To a first approximation,
multiple mo
models,
dels,
drop
and ev out can
evaluating b e thought
aluating multiple mo of as
models a metho d of making bagging practical
dels on each test example. This seems impractical for ensembles
of v ery
when eac eachmany
h mo large
model neural net
del is a large neural works.netwBagging
network,
ork, sinceinvolves training
training and m evultiple
evaluating
aluating mosuch
dels,
and
net
netw ev aluating m ultiple mo dels
works is costly in terms of runtime and memoryon each test example. This seems
memory.. It is common to use ensem impractical
ensembles
bles
when
of five eac h mo
to ten del isnetw
neural a large
networks—
orks— neural
Szegedynetw
et ork, since )training
al. (2014a used six and evaluating
to win the ILSVR
ILSVRC—such
C—
net w orks is costly in terms
but more than this rapidly becomes un of runtime and memory
unwieldy
wieldy
wieldy.. Drop . It
Dropout is common to
out provides an inexpuse ensem
inexpensivebles
ensive
of five
appro to
approximation ten neural netw
ximation to training and evorks— Szegedy
evaluating et al. ( 2014a
aluating a bagged ensem ) used
ensemble six to win the ILSVR
ble of exponentially many C—
but more
neural net netw than
works. this rapidly b ecomes un wieldy . Drop out provides an inexp ensive
approximation to training and evaluating a bagged ensemble of exponentially many
Sp
Specifically
ecifically
ecifically,, dropdropout
out trains the ensemble consisting of all sub-netw sub-networks
orks that
neural networks.
can be formed by remo removing
ving non-output units from an underlying base netw network,
ork,
Sp ecifically , drop out
as illustrated in Fig. 7.6. In most motrains the ensemble
modern consisting
dern neural net netw of all sub-netw orks
works, based on a series of that
can b e formed by remo ving non-output
affine transformations and nonlinearities, we can effectively units from an underlying
remo
removeve base
a unitnetw
from ork,
a
as
net
netwillustrated
work by min Fig. 7.6its
ultiplying . Inoutput
most mo dern
value byneural
zero. netThisworks,
pro
procedurebased requires
cedure on a seriessomeof
affinet transformations
sligh
slight mo dification for and
modification modelsnonlinearities,
such as radial we can
basiseffectively
function net remo
netw ve a unit
works, whic
which from
takeae
h tak
netwdifference
the ork by mbultiplying
et
etwween the itsunit’s
output value
state andbsome
y zero. This pro
reference cedure
value. requires
Here, somet
we presen
present
sligh t
the drop mo
dropout dification for models such as radial basis function
out algorithm in terms of multiplication by zero for simplicity net
simplicity,, but ittak
w orks, whic h cane
the
b difference
e trivially betweentothe
modified workunit’s
withstate andoperations
other some reference value.
that remo
remov veHere,
a unitwefrom
presen
thet
thewdrop
net
netw ork.out algorithm in terms of multiplication by zero for simplicity, but it can
be trivially modified to work with other operations that remove a unit from the
Recall that to learn with bagging, we define k differen differentt mo dels, construct k
models,
network.
differen
differentt datasets by sampling from the training set with replacemen replacement, t, and then
trainRecall
modelthat i onto learn iwith
dataset . Drop bagging,
Dropout out aimswetodefine
appro k differen
approximate
ximate this t mo
pro dels,
cess, construct
process, but with an k
differen
exp
exponen
onen ttially
onentiallydatasets
largebn yumsampling
umb from the
ber of neural netw training
networks.
orks. Sp set with ,replacemen
Specifically
ecifically
ecifically, to train witht, and
drop then
dropout,
out,
train model i on dataset i . Drop out
we use a minibatch-based learning algorithm that makaims to appro ximate
makesthis pro cess, but
es small steps, suc with
such h an
as
exp
sto onen tially
stocchastic gradien large n um b er of neural netw orks.
gradientt descent. Each time we load an example in Sp ecifically , to train
into with
to a minibatc drop
minibatch, out,
h, we
we use a minibatch-based learning algorithm that makes small steps, such as
stochastic gradient descent. Each time257 we load an example into a minibatch, we
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
y y y y
h1 h2 h1 h2 h1 h2 h2
x1 x2 x2 x1 x1 x2
y
y y y y
h1 h1 h2 h2
h1 h2
x1 x2 x1 x2 x2
y y y y
x1 x2
h1 h1 h2
Base network
x1 x2 x1 x1
y y y y
h2 h1
x2
Ensemble of Sub-Networks
258
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
h1 h2
x1 x2
hˆ1 hˆ2
µ h1 h1 h2 µ h2
xˆ1 xˆ2
µx1 x1 x2 µ x2
randomly sample a differen differentt binary mask to apply to all of the input and hidden
units in the netw network.
ork. The mask for each unit is sampled indep independently
endently from all of
randomly sample a differen t binary mask
the others. The probability of sampling a mask value of one (causing to apply to all of the input aand unithidden
to be
units in the netw ork. The mask for each unit is
included) is a hyperparameter fixed before training begins. It is not a functionsampled indep endently from all of
the others.
of the curren The probability
currentt value of the mo of sampling
model a mask value of one
del parameters or the input example. Typically(causing a unit to be,
ypically,
included)
an input unit is a ishyperparameter
included with probabilityfixed before0.8 training begins.unit
and a hidden It isisnot a function
included with
of the
probabilit
probability curren t value of
y 0.5. We then run forwthe mo del
forwardparameters or
ard propagation, bac the inputk-propagation, and the,
example.
back-propagation, T ypically
an inputup
learning unit
update
date is as
included
usual. Fig. with 7.7probability
illustrates0.8 howandtoarun hidden
forw
forwardunitpropagation
ard is included with with
probabilit
drop
dropout.
out. y 0.5. W e then run forw ard propagation, bac k-propagation, and the
learning update as usual. Fig. 7.7 illustrates how to run forward propagation with
dropMore formally,, suppose that a mask vector µ sp
out. formally specifies
ecifies whicwhich h units to include,
and J (θ, µ) defines the cost of the mo modeldel defined by parameters θ and mask µ .
More
Then dropdropout formally , suppose that
out training consists in minimizing a mask vectorEµµJ (spθ,ecifies
µ). The whic
exph ectation
units to con
expectation include,
contains
tains
and
exp
exponenJ
onen( θ ,
onentiallyµ ) defines
tially manmany the cost of the mo
y terms but we can obtainEan un del defined by
unbiased parameters θ and
biased estimate of its gradien mask µt.
gradient
Then dropoutvalues
by sampling training of µ consists
. in minimizing J ( θ, µ). The expectation contains
exponentially many terms but we can obtain an unbiased estimate of its gradient
Drop
Dropout
by sampling out vtraining
alues of is µ.not quite the same as bagging training. In the case of
bagging, the mo models
dels are all indep independent.
endent. In the case of drop dropout,
out, the mo models
dels share
Drop out
parameters, with eac trainingeach is
h mo not
model quite the same as bagging training.
del inheriting a different subset of parameters from In the casethe
of
bagging,
paren
parent the mo
t neural netdels
network.
work. are This
all indep endent.sharing
parameter In the casemakes of it
drop out, the
possible to mo dels share
represent an
parameters,
exp
exponen
onen
onential
tial num with
umb eac h
ber of mo mo del
models inheriting a different
dels with a tractable amoun subset
amountt of memory of parameters from
memory.. In the case of the
paren t
bagging, eac neural
each net
h mo
modelwork. This parameter
del is trained to con sharing
convergence makes
vergence on its resp it
respectivpossible
ectiv to represent
ectivee training set. In thean
exp onen
case of drop tial
dropout,n um b er of mo
out, typically most mo dels with
modelsa tractable amoun t of memory
dels are not explicitly trained at all—usually . In the case
all—usually, of,
bagging,
the modeleac ishlarge
modelenough is trained
thattoitcon vergence
would on its resp
be infeasible toectiv e training
sample set. Insub-
all possible the
case
net
netw of drop
works out,the
within typically
lifetimemost of themouniv
delserse.
are Instead,
universe. not explicitlya tin
tiny ytrained
fractionatof all—usually
the possible,
the
sub-netmodel
sub-netw works is large
are eac enough
each h trainedthatforit awsingle
ould bstep,
e infeasible
and thetoparameter
sample allsharing
possible sub-
causes
net w orks within
the remaining sub-net the lifetime
sub-networks of the univ
works to arrive at go erse.
goo Instead, a tin y fraction
od settings of the parameters. Theseof the p ossible
sub-net
are the w orksdifferences.
only are each trained Bey ondforthese,
Beyond a single step, follows
dropout and thethe parameter sharing causes
bagging algorithm. For
the remaining sub-net works
example, the training set encountered by eac to arrive at go o
each d settings
h sub-netw
sub-network of the parameters.
ork is indeed a subset of These
are the only differences. Bey
the original training set sampled with replacemenond these, dropout
replacement. follows
t. the bagging algorithm. For
example, the training set encountered by each sub-network is indeed a subset of
To makmakee a prediction, a bagged ensemble must accum accumulate ulate votes from all of
the original training set sampled with replacement.
its members. We refer to this pro process
cess as infer
inferencenc
encee in this con context.
text. So far, our
T o mak e a prediction, a bagged ensemble
description of bagging and dropout has not required that the model must accum ulate votesbefrom all of
explicitly
its members. W
probabilistic. No ew,refer
Now, we to this pro
assume that cesstheasmoinfer
model’s encrole
del’s e inisthis context.a probability
to output So far, our
description of bagging and
distribution. In the case of bagging, each mo dropout has not required
del i pro
model produces that the model b e explicitly
duces a probability distribution
pprobabilistic.
(i)( y | x
). TheNo w, we assume
prediction of thethatensemblethe mo is del’s
giv enrole
given is toarithmetic
by the output a mean probability
of all
distribution.
of these distributions,In the case of bagging, each mo del i pro duces a probability distribution
p ( y x). The prediction of the ensemble k is given by the arithmetic mean of all
1 X (i)
of these | distributions, p (y | x). (7.52)
k
1 i=1
p (y x). (7.52)
In the case of drop dropout,out, each sub-mo k del defined by mask vector µ defines a prob-
sub-model
|
260
In the case of dropout, each sub-model defined by mask vector µ defines a prob-
X
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
abilit
abilityy distribution p (y | x, µ ). The arithmetic mean over all masks is giv given
en
by
ability distribution p (y x, µ )X . The arithmetic mean over all masks is given
p(µ)p(y | x, µ) (7.53)
by | µ
p(µ)p(y x, µ) (7.53)
where p( µ ) is the probabilit
probability y distribution that was used to sample µ at training
|
time.
where p( µ ) is the probability distribution that was used to sample µ at training
Because this sum includes an exp exponential
onential num numb ber of terms, it is intractable
time. X
to evevaluate
aluate except in cases where the structure of the mo model
del permits some form
Because this sum includes an exp onential
of simplification. So far, deep neural nets are not kno num b er
known of terms,
wn to permit it is
anyintractable
tractable
to ev aluate except in cases
simplification. Instead, we can appro where the structure
ximate the inference with sampling,form
approximate of the mo del p ermits some by
aofveraging
simplification.
together Sothe
far,output
deep neural
from nets
manyare not kno
masks. wn to
Even permit
10-20 any are
masks tractable
often
simplification.
sufficien
sufficientt to obtain Instead,
goo
good d pwe can approximate the inference with sampling, by
erformance.
averaging together the output from many masks. Even 10-20 masks are often
Ho
How wev
ever,
er, there is an even better approac approach, h, that allows us to obtain a go goood
sufficient to obtain good performance.
appro
approximation
ximation to the predictions of the entire ensemble, at the cost of only one
forw Ho
ardwev
forward er, there is Tan
propagation. o do even
so, bwetter approac
e change h, that
to using the allows
geometric us to obtain
mean a go
rather od
than
appro
the ximation mean
arithmetic to theofpredictions
the ensem
ensembleof the
ble mem
membentire
bers’ensemble,
predicted at the cost of only
distributions. one
Warde-
forw ard propagation. T o do
Farley et al. (2014) present argumen so, w e c
argumentshange to using the geometric mean
ts and empirical evidence that the geometricrather than
the arithmetic mean of the ensem
mean performs comparably to the arithmetic ble members’ meanpredicted
in this distributions.
context. Warde-
Farley et al. (2014) present arguments and empirical evidence that the geometric
The geometric mean of multiple probability distributions is not guaranteed to be
mean performs comparably to the arithmetic mean in this context.
a probability distribution. To guarantee that the result is a probabilit probability y distribution,
we impose the requirement that none of the sub-models assigns probability 0 totoan
The geometric mean of m ultiple probability distributions is not guaranteed bye
any
aeven
ev probability
en t, and wedistribution.
ent, renormalize the To guarantee that the result
resulting distribution. is aunnormalized
The probability distribution,
probabilit
probabilityy
w e impose thedefined
distribution requirement
directly that
by none of the sub-models
the geometric mean is assigns
giv en bprobability
given y 0 to any
event, and we renormalize the resulting distribution. sY The unnormalized probability
distribution defined directly by the geometric mean is given by
pp̃˜ensemble(y | x) = 2d p(y | x, µ) (7.54)
µ
p˜ (y x) = p(y x, µ) (7.54)
where d is the num numb ber of units that | may b e dropp
dropped.
| ed. Here we use a uniform
distribution over µ to simplify the presentation, but non-uniform distributions are
where d is theTnum
also possible. o makber
make of units that
e predictions we may
must be dropped. the Hereensemble:
we use a uniform
distribution over µ to simplify the presentation, sre-normalize
Y but non-uniform distributions are
also possible. To make predictions we must pp̃˜ensemble (y | x) the ensemble:
re-normalize
p ensemble(y | x) = P . (7.55)
y0 p p̃˜ensemble (y0 | x)
p˜ (y x)
p (y x) = . (7.55)
p˜ (|y x)
A key insight (Hinton et al., |2012c) in involv
volv
volved ed in dropout is that we can appro approxi-
xi-
p
mate ensemble by ev evaluating
aluating p( y | x ) in one model: the mo |model
del with all units, but
withAthekeywinsight
eigh
eights (Hinton
ts going outetofal. , 2012c
unit ) involvedbyinthe
i multiplied dropout is that
probabilit
probabilityy ofwincluding
e can appro xi-
unit
. Thep motiv
imate by ev
motivation
ation aluating
for this mo p(dification
y x) in one
modification is tomodel: capture thethe
morigh
del twith
right all units,
expected valuebut
of
with the w eights going out of unit | i m P
ultiplied by the probabilit y of including unit
the output from that unit. We call this approac approach h the weight sc scaling
aling infer
inferenc
enc
encee rule
rule..
i. The motivation for this modification is to capture the right expected value of
the output from that unit. We call this261 approach the weight scaling inference rule.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
There is not yet any theoretical argument for the accuracy of this approximate
inference rule in deep nonlinear net networks,
works, but empirically it performs very well.
There is not yet any theoretical argument for the accuracy of this approximate
Because we usually use an inclusion probability of 12 , the weight scaling rule
inference rule in deep nonlinear networks, but empirically it performs very well.
usually amounts to dividing the weights by 2 at the end of training, and then using
the Because
mo del aswusual.
model e usually use an winclusion
Another probability
ay to achiev
achievee the same of ,result
the wis eight scaling rule
to multiply the
usuallyofamounts
states tobdividing
the units y 2 duringthetraining.
weights bEither
y 2 at the
wayend
, theofgoal
training, andethen
is to mak
make sureusing
that
the expected
the model astotal
usual. Another
input wayattotest
to a unit achiev
timee isthe same the
roughly result
same is to
as multiply
the exp the
expected
ected
statesinput
total of thetounits
that bunit
y 2 during
at traintraining.
time, ev Either
even
en though wayhalf
, thethe
goal is toatmak
units e sure
train timethat
are
the expected total
missing on average. input to a unit at test time is roughly the same as the exp ected
total input to that unit at train time, even though half the units at train time are
For many
missing classes of mo
on average. models
dels that do not hahav
ve nonlinear hidden units, the weigh weightt
scaling inference rule is exact. For a simple example, consider a softmax regression
For many
classifier withclasses
n input of vmo dels that
ariables do not habvye the
represented nonlinear
vectorhidden
v: units, the weight
scaling inference rule is exact. For a simple example,
consider
a softmax regression
classifier with n inputPv(ariables >
y = y | vrepresented
) = softmaxbyW thevvector
+ b v. : (7.56)
y
We can index in
into (y = y ofvsub-mo
to thePfamily ) = softmax
dels byW
sub-models v +t-wise
elemen b . multiplication (7.56)
element-wise of the
input with a binary vector d: |
We can index into the family of sub-modelsby element-wise multiplication of the
input with a binary
P (yv= y | vd;: d) = softmax W >(d v) + b .
ector (7.57)
y
The ensem
ensemble P (y = yis defined
ble predictor v; d) = bsoftmax W (d the
y re-normalizing v) + b .
geometric mean (7.57)
over all
ensem
ensemble
ble mem
memb |
bers’ predictions:
The ensemble predictor is defined by re-normalizing the geometric mean over all
ensemble members’ predictions: P̃˜
P (y = y | v)
Pensemble(y = y | v) = P ensemble (7.58)
y˜ P̃˜ensemble(y = y 0 | v)
0 P
P (y = y v)
P (y = y v ) = (7.58)
where s Y P˜ (y = y| v)
|
P̃˜ensemble(y = y | v) = 2 n
P P (y = y | v|; d). (7.59)
where
d∈{0,1}n
P˜ (y = y v) = P (y = y v; d). (7.59)
To see that the weight scaling rule | is exact,
P we can simplify P̃˜ensemble:
| P
s Y
˜
To see that the w
P ensemble (y = y | v) = 2 n we can
˜eight scaling rule is exact, y = y | vP; d)
P (simplify : (7.60)
s Y n
d∈{0,1}
P˜ (y = y v) = P (y = y v; d) (7.60)
s Y
= n | softmax (W> (d v) + b| ) (7.61)
2 y
d∈{0,1} n
= v softmax
(W
s (d v) + b)
(7.61)
u Y Y
>
u exp Wy,:(d
v) + b
= 2n (7.62)
expWW y(>0d,: (d v
exp
d∈{0,1}n y0 ) v+) b+ b
= s Y (7.62)
262 W (
exp d v) + b
t P
v
u Y
u
t P
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
qQ >
2n
d∈{0,1}n exp Wy,: (d v) + b
= r (7.63)
n Q P exp W (
>d v ) + b
y 0 exp Wy 0 ,:(d v ) + b
2
= d∈{0,1}n (7.63)
Because P P̃˜ will be normalized, we can safely exp W ignore(dmultiplication
v) + b by factors that
are constan
constantt with respect q to y:
Because P˜ will be normalized, Q we can safely ignore multiplication by factors that
s Y
are constant with respect to y:
P˜ensemble
P̃ (yr=Qy | v) ∝ P 2n exp W y,: > (d v ) + b (7.64)
d∈{0,1} n
P˜ (y = y v ) exp W (d v) + b (7.64)
| ∝ X
1 >
= exp n Wy,:(d v) + b (7.65)
2
1 d∈{s 0,1}n
Y
= exp W (d v) + b (7.65)
2 1 >
= exp W v + b (7.66)
2 y,:
= exp 1
Substituting this bac backk in
into
to Eq. 7.58 w2eW v + ab softmaxclassifier with weigh
obtain (7.66)
weightsts
1 X
2W .
Substituting this back into Eq. 7.58 we obtain a softmax classifier with weights
WThe
. weigh
weightt scaling rule is also exact in other settings, including regression
net
netwworks with conditionally normal outputs, netw
and deep networks
orks that havhavee hidden
la
lay The weigh t scaling
yers without nonlinearities. Ho rule is also
How wev exact
ever, in other
er, the weight scaling settings,rule including
is only an regression
appro
approxi-
xi-
net w orks with conditionally
mation for deep models that hav normal outputs, and deep
havee nonlinearities. Though the appronetw orks that hav
approximatione hidden
ximation has
lay ers without nonlinearities. Ho w ev er, the w eight
not been theoretically characterized, it often works well, empirically scaling rule is only
empirically.. Go an oappro
Goo xi-
dfellow
mation
et for deep
al. (2013a ) found models
exp that hav
experimen
erimen
erimentally
tallye that
nonlinearities.
weightt Though
the weigh the appro
scaling appro
approximation ximation
ximation can whas
ork
not
b been
etter (in theoretically characterized,
terms of classification it often
accuracy) thanwMonte
orks wCarlo
ell, empirically . Goodfellow
approximations to the
et al. (ble
ensem
ensemble 2013a ) found exp
predictor. This erimen
held tally
true evthat
en the
even when weigh
thetMon
scaling
Monte approappro
te Carlo ximation
approximationcan wwas
ximation ork
b etter
allo
allow (in terms of classification
wed to sample up to 1,000 sub-net accuracy)
sub-netw than Monte Carlo approximations
works. Gal and Ghahramani (2015) found to the
ensem ble predictor. This held
that some models obtain better classification true ev en when the Monteusing
accuracy Carlotwenappro
tyximation
wenty samples and was
allow
the ed to
Mon
Monte te sample
Carlo appro up toximation.
1,000 sub-net
approximation. works. that
It appears Gal the
andoptimal
Ghahramani choice(2015 ) found
of inference
that
appro some
approximation
ximationmodels obtain better classification accuracy using twenty samples and
is problem-dependent.
the Monte Carlo approximation. It appears that the optimal choice of inference
Sriv
Srivasta
asta
astav va et al. (2014) show showed ed that drop dropout
out is more effective than other
approximation is problem-dependent.
standard computationally inexp inexpensive
ensive regularizers, suc such h as weigheightt decay
decay,, filter
normSriv astava etand
constraints al.sparse
(2014)activit
showyedregularization.
activity that dropoutDrop is more
Dropoutout ma effective
may y also bethan
com other
combined
bined
standard
with othercomputationally
forms of regularization inexpensiveto yield regularizers, such as
a further improv
improvement. weight decay, filter
ement.
norm constraints and sparse activity regularization. Dropout may also be combined
withOneotheradv
advantage
antage
forms of of drop
dropoutout is that
regularization to yield it isa very
furthercomputationally
improvement. cheap. Using
drop
dropout
out during training requires only O (n ) computation per example per update,
One advnantage
to generate random of binary
dropout num isbthat
umb ers and it is very computationally
multiply them by the state. cheap.
Dep Using
Depending
ending
drop
on outimplemen
the during training
implementation, tation, requires
it may also O (n ) computation
onlyrequire O (n) memory pertoexample per update,
store these binary
to
num generate
umbbers un n
until random
til the bac k-propagation stage. Running inference in the trained ending
binary
back-propagationn um b ers and multiply them by the state. Dep mo
model
del
on the implementation, it may also require O (n) memory to store these binary
numbers until the back-propagation stage. 263 Running inference in the trained mo del
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
has the same cost per-example as if drop dropoutout were not used, though we must pay
the cost of dividing the weigh weights ts by 2 once before beginning to run inference on
has the same cost per-example as if dropout were not used, though we must pay
examples.
the cost of dividing the weights by 2 once before beginning to run inference on
Another significan
examples. significantt adv advan an
antage
tage of drop dropoutout is that it do does
es not significantly limit
the typypee of mo
model del or training pro procedure
cedure that can be used. It works well with nearly
an
anyy Another
mo
model
del that significan
uses a tdistributed
advantagerepresen of dropout
representation is that
tation and canit dobes not significantly
e trained with sto limit
stochastic
chastic
the typet of
gradien
gradient model
descen
descent. t. or
This training
includes profeedforw
cedure that
feedforward ard can
neuralbe net
used.
netw It works
works, well withmo
probabilistic nearly
modelsdels
an
sucy
such mo del that uses a
h as restricted Boltzmann mac distributed represen
machines tation
hines (Sriv and
Srivastav
astav
astava can b
a et al. e trained with sto
al.,, 2014), and recurren chastic
recurrentt
gradien
neural nett
netwdescen
works (Bat. This
Bay includes feedforw
yer and Osendorfer, 2014; Pascan ard neural net
ascanu w orks,
al.,, 2014a). Manymo
u et al. probabilistic dels
other
such as restricted
regularization Boltzmann
strategies machinespow
of comparable (ower
Sriv
er astav
imp
impose a etmore
ose al., sev
2014
severe), restrictions
ere and recurren ont
neural
the arc networks (ofBathe
architecture
hitecture yer model.
and Osendorfer, 2014; Pascanu et al., 2014a). Many other
regularization strategies of comparable power impose more severe restrictions on
Though the cost per-step of applying drop dropout
out to a specific mo model
del is negligible,
the architecture of the model.
the cost of using drop dropout out in a complete system can be significan significant. t. Because dropout
is a Though the cost
regularization tec phnique,
er-step it
technique, of reduces
applying thedrop out to
effectiv
effective a specificofmo
e capacity del is negligible,
a model. To offset
the cost of using drop out in a complete system can be
this effect, we must increase the size of the model. Typically the optimal validation significan t. Because dropout
is a regularization
set error is muc uch h lotec
low hnique, it
wer when using drop reduces the
out, but this comes at the cost of ao m
dropout, effectiv e capacity of a model. T offset
uc
uchh
this effect, we m ust increase the size of the model. Typically
larger model and many more iterations of the training algorithm. For very large the optimal v alidation
set error isregularization
datasets, much lower when confers usinglittledrop out, but in
reduction this comes at the error.
generalization cost ofIn am uch
these
larger the
cases, model and many more
computational cost iterations
of using dropoutof the training
and larger algorithm.
models F or very
may outw large
outweigh eigh
datasets, regularization
the benefit of regularization. confers little reduction in generalization error. In these
cases, the computational cost of using dropout and larger models may outweigh
the When
benefitextremely few labeled training examples are av
of regularization. available,
ailable, drop
dropout
out is less
effectiv
effective. e. Ba Bay yesian neural net networks
works (Neal, 1996) outp outperform
erform dropout on the
When eextremely
Alternativ
Alternative Splicing Datasetfew labeled (Xiong training
et al., examples
2011) where arefew aver
fewerailable, dropout
than 5,000 is less
examples
effectiv
are e. Ba(y
available esian
Sriv
Srivasta
asta
astav neural
va et al. net
al., works
, 2014 (Neal,additional
). When 1996) outp erform data
unlabeled dropout
is avon the
ailable,
Alternativ
unsup
unsupervised e Splicing Dataset ( Xiong
ervised feature learning can gain an adv et al. , 2011 )
advantagewhere
antage ov few
over er
er drop than
dropout.out.5,000 examples
are available (Srivastava et al., 2014). When additional unlabeled data is available,
unsupWager
ervised et al. (2013learning
feature ) show
showed edcanthat, gainwhen
an advapplied
antagetoovlinear
er drop regression,
out. drop
dropout out
2
is equiv alentt to L weigh
equivalen
alen eightt deca
decay y, with a different weigh weightt decay co coefficient
efficient for
eac
each W ager et al. ( 2013 ) show
h input feature. The magnitude of eac ed that, when
each applied to
h feature’s weigh linear regression,
eightt deca
decayy co dropout
coefficien
efficien
efficient t is
is equiv alen t to L w eigh t deca y , with a
determined by its variance. Similar results hold for other linear mo different weigh t decay co efficient
models.
dels. For deep for
eac
mo h input
models,
dels, drop feature.
dropout out is not The magnitude
equiv alent to of
equivalent eachdecay
weight feature’s
decay. . weight decay coefficient is
determined by its variance. Similar results hold for other linear models. For deep
The drop
models, sto
stochasticit
chasticit
chasticity
out is not y usedequiv while
alent training
to weightwith decaydrop
dropout
. out is not necessary for the
approac
approach’s h’s success. It is just a means of approximating the sum ov over
er all sub-
mo dels. Wang and Manning (2013) derived analytical approximations for
The
models. sto chasticit y used while training with drop out is not necessary the
to this
approach’s success.
marginalization. Their It is just a means kno
approximation, of approximating
known
wn as fast dr drop
op the
out sum
opout overinallfaster
resulted sub-
mo
con
convdels.
vergence Wang timeand dueManning (2013) derived
to the reduced sto analytical
stocchasticit
hasticity y in the approximations
computation to of this
the
marginalization.
gradien
gradient. t. This metho Their
method approximation, kno wn as fast dr
d can also be applied at test time, as a more principledop out resulted in faster
con v ergence time due
(but also more computationally to the reduced expensive) stocapproximation
hasticity in the to computation
the av
average
erage ov oferthe
over all
gradien
sub-net
sub-netw t.orks
w This thanmetho
the dweighcan talso
eight scaling be applied
appro at test time,
approximation.
ximation. Fast drop as aout
dropout more
has principled
been used
(but also more computationally expensive) approximation to the average over all
sub-networks than the weight scaling appro 264 ximation. Fast drop out has b een used
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
to nearly match the performance of standard drop dropout out on small neural netw networkork
problems, but has not yet yielded a significant improv improvementement or been applied to a
to nearly
large match the performance of standard dropout on small neural network
problem.
problems, but has not yet yielded a significant improvement or been applied to a
largeJust as sto
problem. stocchasticit
hasticity y is not necessary to achiev achievee the regularizing effect of
drop
dropout,
out, it is also not sufficient. To demonstrate this, Warde-F arde-Farley arley et al. (2014)
Just as
designed stochasticit
control exp
experimeny is tsnot
erimen
eriments necessary
using a metho
methodtod achiev
callededr the
drop
op
opout regularizing
out boosting that effecttheyof
dropout, it
designed toisusealsoexactly
not sufficient.
the same To mask
demonstrate
noise asthis, Warde-Fdrop
traditional arleyout
dropout et al.
but(2014
lac
lack k)
designed
its controleffect.
regularizing experimen Drop tsout
Dropout using a metho
boosting d called
trains the dr opoutensemble
entire boosting to that they
join
jointly
tly
designed to use
maximize the log-lik exactly
log-likeliho
eliho
elihoo od on the training set. In the same sense that traditionalk
the same mask noise as traditional drop out but lac
its
drop regularizing
dropoutout is analogous effect. toDrop out bothis
bagging, osting trains the
approach entire ensemble
is analogous to joinAs
to boosting. tly
maximize
in
intended,
tended, exp the erimen
log-likeliho
experimen
eriments ts withod drop
on theouttraining
dropout boosting set.
sho
showInw the samenosense
almost that traditional
regularization effect
drop out is analogous to
compared to training the entire netw bagging, this
network approach is analogous
ork as a single model. This demonstrates thatto b o osting. As
in tended,
the in exp
interpretationerimen
terpretation of dropts with
dropout drop out b o osting sho
out as bagging has value bey w almost
eyond no regularization
ond the inte interpretation effect
rpretation of
compared
drop
dropoutout astorobustness
training the entire netw
to noise. The ork as a single effect
regularization model.of This demonstrates
the bagged ensemble that is
the in
only ac terpretation
achieved of drop out as bagging
hieved when the stochastically sampled ensemble memhas v alue b ey ond
memb the inte rpretation
bers are trained to of
drop out as robustness to
perform well independently of eac noise. The
each regularization
h other. effect of the bagged ensemble is
only achieved when the stochastically sampled ensemble members are trained to
Drop
Dropout out has inspired other sto stochastic
chastic approaches to training exp exponen
onen
onentially
tially
perform well independently of each other.
large ensembles of mo models
dels that share weigh eights.
ts. DropConnect is a sp special
ecial case of
drop Drop
dropoutout out
where haseacinspired
eachh pro
product
ductother
betwstoeen
etween chastic approaches
a single scalar weigh to training
weight t and a exp onen
single tially
hidden
largestate
unit ensembles of models
is considered a unitthat share
that canwbeigh ts. DropConnect
e dropp
dropped ed (Wan et al. al.,is a sp).ecial
, 2013 Sto case
Stochastic
chasticof
drop out where eac h pro duct b etw een a single scalar
pooling is a form of randomized pooling (see Sec. 9.3) for building ensembles weigh t and a single hidden
unit
of constate is considered
convolutional
volutional net
networks
worksa unit
withthat
eachcan conbvolutional
e dropped netw
convolutional (Wan
networkorket attending
al., 2013).toSto chastict
differen
different
poolinglo
spatial iscations
a formofofeac
locations randomized
each h feature map. pooling So (see Sec. out
far, drop
dropout 9.3remains
) for building the mostensembles
widely
of con volutional
used implicit ensem net
ensemble works
ble metho with
method. d. each con volutional netw ork attending to differen t
spatial locations of each feature map. So far, dropout remains the most widely
One of the key insigh insights ts of drop
dropoutout is that training a netw network ork with stochastic
used implicit ensemble method.
beha
ehavior
vior and making predictions by av averaging
eraging ov overer multiple sto stocchastic decisions
One oftsthe
implemen
implements key insigh
a form ts of drop
of bagging withoutparameter
is that training
sharing. a netw ork with
Earlier, wee stochastic
w describ
described ed
b eha
drop
dropoutvior and making predictions
out as bagging an ensemble of mo by av eraging
models ov er m ultiple sto
dels formed by including or excluding c hastic decisions
implemen
units. How ts
Howev ev a
ever, form of bagging with
er, there is no need for this mo parameterdel avsharing.
model Earlier,towbe edescrib
eraging strategy based on ed
dropout as
inclusion andbagging
exclusion. an Inensemble
principle,ofan mo
any dels offormed
y kind randombymo including
dification or
modification excluding
is admissible.
units.
In Howevwe
practice, er, mthere is no need
ust choose mo for this mo
modification
dification del averaging
families strategy
that neural netw toorks
networks be based
are ableon
inclusion
to learn to andresist.
exclusion.
IdeallyIn ,principle,
Ideally, we should anyalso
kinduseof random
mo
model modification
del families that isallow
admissible.
a fast
In practice,
appro
approximate we m ust choose mo dification
ximate inference rule. We can think of an families
any that
y form of mo neural
modification netw orks
dification parametrized are able
toy learn
b a vectorto resist. Ideally,an
µ as training weensemble
should also use moof
consisting delp(families
y | x, µ)that for allow a fast
all possible
vappro
aluesximate inference
of µ. There rule.
is no We can think
requirement thatofµanhay form
hav of modification
ve a finite num
numb ber ofparametrized
values. For
b y a vector µ as
example, µ can be real-v training
real-valued. an ensemble
alued. Sriv
Srivasta
asta
astav consisting of
va et al. (2014) show p( y
showed x , µ ) for
ed that multiplyingall possible
the
v alues
weigh
eights of µ . There is no
ts by µ ∼ N (1, I ) can outprequirement
outperform that
erform drop µ
dropout ha v e a finite |num b er
out based on binary masks. Because of values. F or
E [µ ] = 1 µthe
example, canstandard
be real-vnet
alued.
workSriv
network astava et al. implements
automatically (2014) showed that multiplying
approximate the
inference
weights by µ (1, I ) can outperform dropout based on binary masks. Because
E
[µ ] = 1 the ∼ standard
N network automatically implements approximate inference
265
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
in the ensem
ensemble,
ble, without needing any weight scaling.
So far we ha hav ve describ
described ed drop
dropoutout purely as a means of performing efficient,
in the ensemble, without needing any weight scaling.
appro
approximate
ximate bagging. How However,
ever, there is another view of drop dropoutout that go goes
es further
So far
than this. Dropw e ha
Dropout v e describ ed drop out purely
out trains not just a bagged ensemble of mo as a means of
models,p erforming
dels, but an ensem efficient,
ensemble ble
appro
of mo ximate
models bagging. How ever, there is another view
dels that share hidden units. This means each hidden unit must be able to of drop out that go es further
than
p this.
erform Drop
well out trains
regardless not just
of which a bagged
other hiddenensemble
units are of in mothedels,
mo
model.but Hidden
del. an ensem ble
units
ofust
m mobdels that share
e prepared to behidden
swappedunits.and This means
interc
interchanged
hanged eachbetwhidden
etweeneen mo unit
models. must
dels. be able
Hinton to
et al.
(p2012c
erform) wwell regardless
ere inspired byofanwhich
idea other hidden units
from biology: sexualarerepro
in the
reproduction,model.which
duction, Hidden in units
involv
volv
volveses
m
sw ust
swappingb e prepared
apping genes betw to
etween be swapped
een twtwo and interc hanged
o different organisms, creates ev b etw een mo dels.
evolutionary Hinton
olutionary pressure for et al.
(genes
2012c)towbecome
ere inspired by ango
not just gooidea fromtobiology:
od, but becomesexual readilyreprosw
swapp duction,
app
apped ed betw which
between involves
een different
swapping genes
organisms. Such betw een tw
genes ando different
suc
such organisms,
h features are verycreates evolutionary
robust to changes pressure
in theirfor
genes
en to become
environmen
vironmen
vironment t becausenot justtheygoareod, not
butable
to become readily sw
to incorrectly apped
adapt to betw
un een different
unusual
usual features
organisms.
of an
any Such
y one organism or mo genes and
model. suc h
del. Drop features
Dropoutout th thus are very robust to
us regularizes each hidden unit changes in totheir
be
environmen
not merely at bgo ecause
goood featurethey butare not able tothat
a feature incorrectly
is go
goo od in adapt
many to con
unusual
texts. features
contexts. Warde-
ofarley
F any etoneal.organism or model.drop
(2014) compared Drop
dropoutoutout thus regularizes
training to training each hidden
of large ensem unit
ensembles blestoand
be
not merely that
concluded a goodrop d feature
dropout out offersbut aadditional
feature that improv is go
improvements od in many
ements contexts. Werror
to generalization arde-
F arley
bey
eyond et al. ( 2014 ) compared drop out
ond those obtained by ensembles of independent mo training to training
models.
dels.of large ensem bles and
concluded that dropout offers additional improvements to generalization error
It is imp
important
ortant to understand that a large portion of the pow owerer of drop
dropoutout
beyond those obtained by ensembles of independent models.
arises from the fact that the masking noise is applied to the hidden units. This
can Itbeisseen
impas ortant
a form to ofunderstand
highly in that t,
intelligen
telligen
telligent, a large
adaptiv
adaptive portion of the of
e destruction pow er information
the of dropout
arises
con
conten
ten from the fact that the masking
tentt of the input rather than destruction of the ra noise is applied raww values of the input.This
to the hidden units. For
can be seen
example, as amodel
if the form learns
of highly intelligen
a hidden unitt,hiadaptiv e destruction
that detects a face by offinding
the information
the nose,
conten
then t of the input
dropping hi corresprather
correspondsondsthanto destruction of the raw values
erasing the information of theisinput.
that there a noseFin or
example,
the image.if the
Themodelmo
modeldellearns
must alearn
hidden unit hh that
another detects aredundantly
face by finding thedesnose,
i , either that redundan tly enco
encodes the
then dropping h corresp onds to erasing the
presence of a nose, or that detects the face by another feature, suc information that there
such is a nose
h as the mouth. in
Tthe image. The
raditional noisemo del must
injection teclearn
techniques
hniques that h
another add, either that redundan
unstructured noise at tlythe
enco des the
input are
presence of a nose, or that detects the face by another
not able to randomly erase the information about a nose from an image of a face feature, suc h as the mouth.
T raditional
unless noise injection
the magnitude of thetecnoise
hniques that
is so addthat
great unstructured
nearly all noise of theatinformation
the input are in
not able to randomly
the image is remov removed. erase the information about a nose from
ed. Destroying extracted features rather than original values an image of a face
unless
allo
allows the destruction
ws the magnitude of pro the
cessnoise
process is so use
to make greatof that
all ofnearly
the kno all
knowledgeof theab
wledge information
about
out the input in
the image is that
distribution remov theed.model
Destroying extracted
has acquired so far.features rather than original values
allows the destruction process to make use of all of the knowledge about the input
Another imp
distribution importan
thatortan
ortant t asp
the modelaspect
ecthasof acquired
drop
dropoutout isso that
far. the noise is multiplicative. If the
noise were additive with fixed scale, then a rectified linear hidden unit h i with
addedAnother
noise imp ortansimply
could t aspect of drop
learn to havoute ish that
have the noise is multiplicative. make If the
i b ecome very large in order to mak e
noise were additive
the added noise insignifican with fixed scale, then a
insignificantt by comparison. Multiplicativrectified linear hidden
Multiplicativee noise do unit
does h with
es not alloalloww
added
suc
such noise could simply learn to hav e
h a pathological solution to the noise robustness problem. h b ecome v ery large in order to mak e
the added noise insignificant by comparison. Multiplicative noise does not allow
suchAnother deep learning
a pathological solutionalgorithm,
to the noisebatcbatch h normalization,
robustness problem. reparametrizes the
mo
model
del in a wa way y that in intro
tro
troduces
duces both additive and multiplicative noise on the
Another deep learning algorithm, batch normalization, reparametrizes the
model in a way that introduces both 266 additive and multiplicative noise on the
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
+ .007 × =
+ .007 =
× x+
x sign((∇ xJ (θ, x, y))
sign
sign((∇ x J (θ, x, y))
sign
y =“panda” “nemato
“nematode” x +on”
x sign ( Jde”
(θ, x, y)) “gibb
“gibbon”
w/ 57.7% w/∇8.2% sign ( J%(θ, x, y))
w/ 99.3
=“panda”
yconfidence “nemato de”
confidence “gibb
∇on”
confidence
w/ 57.7% w/ 8.2% w/ 99.3 %
Figure 7.8: Aconfidence
demonstration of adv
adversarial
ersarial example generation
confidence applied to GoogLeNet
confidence
(Szegedy et al., 2014a) on ImageNet. By adding an imp imperceptibly
erceptibly small vector whose
Figure ts
elemen
elements 7.8:
areAequal
demonstration
to the signofofadv
theersarial
elementsexample generation
of the gradient of applied
the cost to GoogLeNet
function with
(resp
Szegedy
respect et al. , 2014a ) on ImageNet. By adding an imp erceptibly small
ect to the input, we can change GoogLeNet’s classification of the image. Repro vector whose
Reproduced
duced
elemen
with ts are equal
p ermission to Goo
from the dfellow
sign of et
Goodfellow theal.elements
(2014b).of the gradient of the cost function with
resp ect to the input, we can change GoogLeNet’s classification of the image. Repro duced
with p ermission from Goo dfellow et al. (2014b).
hidden units at training time. The primary purpose of batch normalization is to
impro
improvve optimization, but the noise can ha
hav
ve a regularizing effect, and sometimes
hidden
mak
makes units
es drop
dropoutat training time.
out unnecessary The
unnecessary.. Batc
Batch primary purpose
h normalization of batchfurther
is described normalization is to.
in Sec. 8.7.1
improve optimization, but the noise can have a regularizing effect, and sometimes
makes dropout unnecessary. Batch normalization is described further in Sec. 8.7.1.
7.13 Adv
dversarial
ersarial Training
7.13
In many A dversarial
cases, neural netw Torks
networks raining
hav
havee begun to reach human performance when
ev
evaluated
aluated on an i.i.d. test set. It is natural therefore to wonder whether these
In
mo many
models
dels ha cases,
have neural anetw
ve obtained trueorks have begun
human-lev
uman-level to reach human
el understanding performance
of these tasks. In when
order
ev aluated
to prob on
probee the lev an i.i.d.
level test set. It
el of understanding a net is natural
netwwork has of the underlying task, wethese
therefore to wonder whether can
mo dels
searc
search haexamples
h for ve obtained that a true
the mo human-lev
model el understanding
del misclassifies. Szegedy etofal.
these
(2014btasks. In order
) found that
to en
ev
evenprob e thenet
neural levworks
el of understanding
networks that perform at a net work level
human has ofaccuracy
the underlying
ha
hav task, w100%
ve a nearly e can
searchrate
error for on
examples
examplesthatthat
thearemoindelten
intenmisclassifies.
tentionally Szegedy bety al.
tionally constructed (2014b
using ) found that
an optimization
ev
proen neural
procedure net
cedure to searc works
search that p erform at human level
h for an input x0 near a data point x suc accuracy
such ha v e
h that the moa nearly
model 100%
del output
error
is veryrate on examples
different at x 0. that
In manare yincases,
many tentionally
x0 canconstructed by using
be so similar to xan optimization
that a human
pro cedure
observ
observer to searc h for an input
er cannot tell the difference betw x near
etweena data p oint x suc h that the mo del
een the original example and the adversarial output
is very different
example
example, , but theat x
netw .ork
network In can
manmake
y cases, x
highlycan be so
differen
different similar to See
t predictions. x thatFig.
a human
7.8 for
observ er
an example. cannot tell the difference b etw een the original example and the adversarial
example, but the network can make highly different predictions. See Fig. 7.8 for
Adv
dversarial
ersarial examples ha
an example. hav ve many implications, for example, in computer security security,,
that are bey eyond
ond the scop
scopee of this chapter. Ho Howev
wev
wever,
er, they are interesting in the
con text of regularization because one can reduce the example,
Adv
context ersarial examples ha v e many implications, for error rateinoncomputer security
the original i.i.d.,
that are b ey ond the
test set via adversarial tr scop e of
training
aining this chapter.
aining—training Ho
—training on adv wev er, they
adversarially are interesting
ersarially perturb
erturbed in
ed examples the
context of regularization because one can reduce the error rate on the original i.i.d.
test set via adversarial training—training 267 on adversarially p erturb ed examples
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Sec. 5.11.3.
One of the early attempts to take adv advantage
antage of the manifold hyp ypothesis
othesis is the
Sec. 5.11.3.
tangent distanc
distancee algorithm (Simard et al., 1993, 1998). It is a non-parametric
One of
nearest-neigh
nearest-neighb theborearly attempts
algorithm in to take
whic
which advmetric
h the antage used of theis manifold
not the generic hypothesis is the
Euclidean
tangent distanc
distance but one e algorithm (Simard
that is derived fromet al.
kno , wledge
1993, 1998
knowledge of the ). Itmanifolds
is a non-parametric
near whic
which h
nearest-neigh
probabilit
probability bor algorithm
y concen
concentrates.
trates. It is in assumed
which the metric
that we areused is not
trying to the generic
classify Euclidean
examples and
distance
that but one
examples on thethatsame
is derived
manifold from
shareknothewledge
sameofcategorythe manifolds
category. . Since the near which
classifier
probabilit
should beyin concen
invvarian
ariant trates.
t to theItlo iscal
local assumed
factors that we are trying
of variation that corresp to classify
correspond ond examples
to mov
movement and
ement
that examples on the same manifold share
on the manifold, it would make sense to use as nearest-neigh the same category
nearest-neighb . Since the
bor distance bet classifier
etwween
pshould
oin ts x
oints be and
invarian
x t todistance
the the localbfactors
etw
etween
een of
thevariation
manifolds that M corresp
and Mondtotowhic movhement
which they
1 2 1 2
on
resp the
respectiv manifold,
ectiv
ectively it would make sense to use as nearest-neigh
ely belong. Although that may be computationally difficult (it would b or distance b et w een
points xsolving
require and xan the distance bproblem,
optimization etween the to manifolds
find the nearest M and pairM oftopointswhichonthey M1
resp ectiv ely b elong.
and M2 ), a cheap alternativ Although that may
alternativee that makes sense lo b e computationally difficult
cally is to approximate Mi by its
locally (it would
require
tangentt solving
tangen plane atan x ioptimization
and measure problem,
the distance to find
betw the
een nearest
etween the tw twoo pair
tangen
tangents,of ts,
points
or betwon een
etweenM
aand M ),t aplane
tangen
tangent cheapand alternativ
a poin e that
oint.
t. Thatmakes
can bsense
e ac
achievlocally
hiev ed byissolving
hieved to approximate
a lo M by its
low-dimensional
w-dimensional
tangensystem
linear t plane (inat x theand measureofthe
dimension thedistance
manifolds). betwOfeencourse,
the twothis tangen ts, or brequires
algorithm etween
a tangen t plane and a
one to specify the tangent vectors.p oin t. That can b e ac hiev ed b y solving a lo w-dimensional
linear system (in the dimension of the manifolds). Of course, this algorithm requires
In a related spirit, the tangent pr prop
op algorithm (Simard et al., 1992) (Fig. 7.9)
one to specify the tangent vectors.
trains a neural net classifier with an extra penalty to make eac each h output f (x) of
In a related
the neural net lo spirit,
locally the
cally inv ariant to known factors of variation. ,These
tangent
invariant pr op algorithm ( Simard et al. 1992) factors
(Fig. 7.9 of)
trains a neural
variation corresp
correspondnet classifier
ond to mo movemen
vemenwith an extra p enalty to make
vementt along the manifold near which examples of the eac h output f ( x) of
the neural
same class net
concen locally
trate.inv
concentrate. ariant
Lo
Local
cal in to
inv knownisfactors
variance achiev
achieved of vby
ed ariation.
requiring These
∇xf (factors
x) to bof e
vorthogonal
ariation corresp ond
to the kno to mo vemen t along the manifold
wn manifold tangent vectors v at x , or equiv
known ( inear
) which examples
equivalently of
alently that the
same class concen
the directional deriv ativee of f at x in the directions v be small byf (adding
trate.
derivativ
ativ Lo cal in variance is achiev ed by ( irequiring
) x) to bea
orthogonal to
regularization penalt the kno
enalty wn
y Ω: manifold tangent vectors v at x , or equiv∇ alently that
the directional derivative of f at x in the directions v be small by adding a
regularization penalty Ω: X > (i)
2
Ω(f ) = (∇xf (x)) v . (7.67)
i
Ω(f ) = ( f (x)) v . (7.67)
This regularizer can of course by scaled by ∇ an appropriate hyperparameter, and, for
most neural netwnetworks,
orks, we would need to sum ov over
er man
many y outputs rather than the lone
This
outputregularizer can ed
f ( x) describ of course
described here forbysimplicity
scaled by. an
simplicity. As appropriate hyperparameter,
with the tangent and, for
distance algorithm,
mosttangen
the neuraltnetw
tangent orks,are
vectors we derived
would need
X to sumusually
a priori, over man youtputs
from ratherknowledge
the formal than the loneof
output f ( x) describ ed here for simplicity. As with the tangent distance
the effect of transformations such as translation, rotation, and scaling in images. algorithm,
the
T tangen
angen
angent t propt vectors
has beenareused
derived a priori,
not just usually from
for supervised the formal
learning (Simardknowledge
et al., 1992of)
the
but effect
also inofthetransformations
con such as translation,
text of reinforcement
context learning (rotation,
Thrun, 1995and).scaling in images.
Tangent prop has been used not just for supervised learning (Simard et al., 1992)
Tangen
angentt propagation is closely related to dataset augmen augmentation.
tation. In both
but also in the context of reinforcement learning (Thrun, 1995).
cases, the user of the algorithm enco encodes
des his or her prior knowledge of the task
T
by sp angen
ecifying a set of transformations that to
specifyingt propagation is closely related dataset
should not augmen tation.
alter the outputInofboth
the
cases, the user of the algorithm encodes his or her prior knowledge of the task
by specifying a set of transformations269 that should not alter the output of the
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
net
netw work. The difference is that in the case of dataset augmentation, the net netw work is
explicitly trained to correctly classify distinct inputs that were created by applying
network.
more thanThe difference is that
an infinitesimal amountin theof case
theseoftransformations.
dataset augmentation, Tangent thepropagation
network is
explicitly
do
doeses not trained
require to correctlyvisiting
explicitly classify adistinct
new inputinputs that wInstead,
point. ere created by applying
it analytically
more than
regularizes the moan infinitesimal
model amount of these transformations.
del to resist perturbation in the directions corresp T angent propagation
corresponding
onding to
doesspecified
the not require explicitly visiting
transformation. While a new
this input
analyticalpoint.approac
Instead,
approach h isitintellectually
analytically
regularizes
elegan
elegant, t, it hasthetwo model
ma
major to drawbac
jor resist perturbation
drawbacks. ks. First, it in onlytheregularizes
directionsthe corresp
mo
model
delonding to
to resist
the specified ptransformation.
infinitesimal erturbation. Explicit Whiledataset
this analytical
augmen
augmentation approac
tation h is intellectually
confers resistance to
larger perturbations. Second, the infinitesimal approach poses difficultiesdel
elegan t, it has two ma jor drawbac ks. First, it only regularizes the mo fortomo resist
models
dels
infinitesimal
based on rectifiedperturbation.
linear units. Explicit
These dataset
mo
models
delsaugmen
can only tation confers
shrink theirresistance
deriv
derivativ
ativto
atives
es
larger p erturbations. Second,
by turning units off or shrinking their weigh the infinitesimal
weights. approach p oses difficulties
ts. They are not able to shrink their for mo dels
based
deriv
derivativ on
ativ
atives rectified linear units. These
es by saturating at a high value with large mo dels can only weigh shrink
ts, as their
eights, sigmoidderiv orativ
tanhes
b y turning units off or shrinking their weigh ts. They
units can. Dataset augmentation works well with rectified linear units because are not able to shrink their
deriv
differen
differentativ es by saturating
t subsets of rectifiedatunitsa high canvalue
activ with
activate
ate forlarge weigh
differen
different ts, as sigmoid
t transformed ve or tanh
versions
rsions of
units
eac
each can. Dataset
h original input. augmentation works well with rectified linear units b ecause
different subsets of rectified units can activate for different transformed versions of
eachTangen
angent
original t propagation
input. is also related to double backpr ackpropop (Druck
Drucker er and LeCun
LeCun,,
1992
1992)) and adv adversarial
ersarial training (Szegedy et al., 2014b; Go Goo odfellow et al., 2014b).
DoubleTangen bac tkprop
backproppropagation
regularizesis also
therelated
Jacobian to to
double backpr
be small, op (Druck
while adv er andtraining
adversarial
ersarial LeCun,
1992
finds) inputs
and adv ersarial
near traininginputs
the original (Szegedy andettrains
al., 2014b
the mo; Go
modeldelodfellow
to pro et al.the
produce
duce , 2014b
same ).
Double on
output backprop
these as regularizes the Jacobian
on the original inputs. to bTeangen
small,t while
angent adversarial
propagation and training
dataset
finds
augmen inputs
augmentation near
tation using man the original
manually inputs
ually sp ecified transformations both require thatsame
specified and trains the mo del to pro duce the the
output
mo
model on these
del should be inasvarian
inv on the
ariant t tooriginal
certain inputs.
sp ecifiedTdirections
specified angent propagation
of change in and thedataset
input.
augmenbac
Double tation
backprop
kprop usingandmanadv ually
adversarial
ersarial sptraining
ecified transformations
both require that boththe require
mo
model that the
del should be
mo
in
inv del
varian should
ariantt to al b e in v arian t to certain sp ecified directions of c
alll directions of change in the input so long as the change is small. Just hange in the input.
Double bac
as dataset augmen kprop
augmentationand advis
tation ersarial training both require
the non-infinitesimal versionthat the motdel
of tangen
tangent should be
propagation,
invarian
adv
adversarial
ersarialt to training
al l directions
is theofnon-infinitesimal
change in the input so long
version as the change
of double bac
backprop. is small. Just
kprop.
as dataset augmentation is the non-infinitesimal version of tangent propagation,
advTheersarialmanifold
training tangen
tangent
is thet non-infinitesimal
classifier (Rifai etversion al., 2011c ), eliminates
of double backprop. the need to
kno
know w the tangent vectors a priori. As we will see in Chapter 14, auto autoenco
enco
encoders
ders can
The manifold
estimate the manifold tangen tangen t classifier ( Rifai et al.
tangentt vectors. The manifold tangen , 2011c ), eliminates
tangentt classifier makesthe needuse
to
knothis
of w the tangent to
technique vectors
avoida needing
priori. As we will
user-sp
user-specified see in
ecified Chapter
tangent 14, auto
vectors. Asenco ders can
illustrated
estimate
in Fig. 14.10 the ,manifold tangent tangen
these estimated vectors.t vectors
tangent The manifold
go beyond tangen thet classifier
classical makes
in
inv use
variants
of this
that technique
arise out of the to geometry
avoid needingof imagesuser-sp
(sucecified
(such tangent vectors.
h as translation, rotation Asandillustrated
scaling)
in Fig. 14.10 , these estimated tangen t
and include factors that must be learned because they are ob vectors go b eyond the classical
object-sp
ject-sp
ject-specific in v
ecific (suchariants
as
that
mo
moving arise
ving b o out
dy of the
parts). geometry
The of
algorithm images
prop (suc
proposed
osed h as
withtranslation,
the manifoldrotation
tangen and
tangentt scaling)
classifier
and
is include factors
therefore simple:that (1) must
use an beauto
learned
autoenco
enco bder
encoder ecause they are
to learn the ob ject-specific
manifold (such by
structure as
mo
unsupving
unsupervised body learning,
ervised parts). The andalgorithm
(2) use these prop osed ts
tangen
tangents with the manifold
to regularize tangen
a neural nett classifier
classifier
is therefore
as in tangen simple:
tangentt prop (Eq. 7.67).(1) use an auto enco der to learn the manifold structure by
unsupervised learning, and (2) use these tangents to regularize a neural net classifier
as inThistangenchapter
t prophas (Eq.describ
described
7.67).ed most of the general strategies used to regularize
This chapter has described most of the general strategies used to regularize
270
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Normal Tangent
x2
x1
Figure 7.9: Illustration of the main idea of the tangent prop algorithm (Simard et al.,
1992) and manifold tangent classifier (Rifai et al. al.,, 2011c), which b oth regularize the
Figure 7.9: Illustration
classifier output function off the
( x main idea
). Each curv of the tangent
curvee represents theprop algorithm
manifold Simardt et
for a (differen
different al.,
class,
1992 ) and manifold
illustrated here as a tangent
one-dimensional (Rifai et al.
classifier manifold , 2011c),inwhich
embedded a tw b oth regularize
two-dimensional
o-dimensional the
space.
classifier
On outputwefunction
one curve, hav f ( x)
havee chosen. aEach singlecurv e represents
p oint and dra
drawnwnthea manifold
vector that forisa tangen
differen
tangent t ttoclass,
the
illustrated
class manifoldhere(parallel
as a one-dimensional
to and touchingmanifold embedded
the manifold) and a in a twthat
vector o-dimensional
is normal to space.
the
On one curve, we hav e chosen a single p oint and dra wn a vector
class manifold (orthogonal to the manifold). In multiple dimensions there may b e many that is tangen t to the
class manifold
tangen
tangent (parallel
t directions and manto and
many touching
y normal the manifold)
directions. and athe
We expect vector that is normal
classification to the
function to
cclass
hangemanifold
rapidly (orthogonal
as it mov
moves to the
es in the direction
manifold). In multiple
normal to the dimensions
manifold, and there
notmay b e many
to change as
tangen
it mov
movest directions
es along the andclassman y normal
manifold. directions.
Both tangenttW
tangen e expect theand
propagation classification
the manifold function
tangent to
change rapidly
classifier as itf mov
regularize es not
(x) to in the direction
change normal
very muc
much h astox the
mo
mov vmanifold,
es along the and not to change
manifold. Tangent as
it moves along
propagation the class
requires themanifold. Both tangen
user to manually sp t propagation
specify
ecify and compute
functions that the manifold the tangent
tangent
directions (such as fsp
classifier regularize (xecifying
) to notthat
specifying change
smallvery much as x of
translations moimages
ves along the manifold.
remain in the sameTangent
class
propagation
manifold) requires
while the usertangent
the manifold to manually
classifiersp ecify functions
estimates that compute
the manifold tangent the tangent
directions
bdirections
y training(such as spncoder
an autoe ecifying
autoencoder to that small
fit the translations
training data. The of images remain
use of auto
autoenco
enco inders
encoders thetosame class
estimate
manifold) while the
manifolds will be describ manifold
described tangent
ed in Chapter 14. classifier estimates the manifold tangent directions
by training an autoe ncoder to fit the training data. The use of auto enco ders to estimate
manifolds will be describ ed in Chapter 14.
neural netw
networks.
orks. Regularization is a central theme of machine learning and as sucsuchh
will be revisited perio
periodically
dically by most of the remaining chapters. Another cen central
tral
neural netw
theme of macorks.
machine Regularization is a central theme
hine learning is optimization, describ of
described machine
ed next. learning and as suc h
will be revisited periodically by most of the remaining chapters. Another central
theme of machine learning is optimization, described next.
271
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Algorithm 7.1 The early stopping meta-algorithm for determining the best
amoun
amountt of time to train. This meta-algorithm is a general strategy that works
well with a v7.1
Algorithm arietThe
ariety early stopping
y of training meta-algorithm
algorithms and wa
waysys offor determining
quan tifying errorthe
quantifying onbest
the
amoun t
validation set. of time to train. This meta-algorithm is a general strategy that works
well
Letwith n bea thevariet y of training
number of stepsalgorithms
betw
etween
een ev and ways of quantifying error on the
evaluations.
aluations.
validation
Let p beset. the “patience,” the num umb ber of times to observ
observee worsening validation set
Let n b e the
error before giving up.n umber of steps b etween evaluations.
Let pθ bbe ethe
Let the“patience,” the number of times to observe worsening validation set
initial parameters.
o
θerror
← θboefore giving up.
iLet← θ0 be the initial parameters.
jθ ← 0θ
vi ←← 0∞
θj ∗←←0 θ
iv∗ ←←i
θ ←
while ∞θj < p do
i Up ←idate θ by running the training algorithm for n steps.
Update
while
i←← ij +<np do
vUp0← date θ by running the(θtraining
ValidationSetError
alidationSetError( ) algorithm for n steps.
i v0 i<+vnthen
if
v← j ←V0alidationSetError(θ)
if θ← ←vθthen
v∗ <
ij∗ ←0i
vθ ←← vθ0
else i ←i
jv ←←jv + 1
else←
end if
end jwhile j+1
Best end ←if
parameters are θ∗ , best number of training steps is i∗
end while
Best parameters are θ , best number of training steps is i
272
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Algorithm 7.2 A meta-algorithm for using early stopping to determine how long
to train, then retraining on all the data.
Algorithm 7.2 A meta-algorithm for using early stopping to determine how long
to Let X(then
train,
train)
and y (trainon
retraining
)
beallthethetraining
data. set.
( ) ( )
Split X train
and y train
into (X (subtrain) , X (valid)) and ( y(subtrain) , y (valid))
into
Let
resp X
respectiv
ectiv ely.. and y
ectively
ely be the training set.
Split X and y
Run early stopping (Algorithm X ) starting
into (7.1 , Xfrom random y using X, (ysubtrain))
) and ( θ
respectiv
and ely. ) for training data and X (valid) and y (valid) for validation data. This
y(subtrain
Run early
returns stopping
i∗ , the optimal(Algorithm
number of7.1 ) starting from random θ using X
steps.
and θy to random
Set for vtraining
alues again. data and X and y for validation data. This
returns i , the optimal number
Train on X(train) and y (train) for i∗ steps.of steps.
Set θ to random values again.
Train on X and y for i steps.
273
Chapter 8
Chapter 8
Optimization for Training Deep
Mo
Models dels
Optimization for Training Deep
Models
Deep learning algorithms in
inv
volv
olvee optimization in man
many
y con
contexts.
texts. For example,
Deep learning algorithms involve optimization in many contexts. For example,
performing inference in mo models
dels suc
suchh as PCA in involv
volv
volveses solving an optimization
Deep learning algorithms inv olve optimization
problem. We often use analytical optimization to write pro in man y con
proofs texts.
ofs or designForalgorithms.
example,
performing
Of all of theinference in models suc
many optimization h as PCA
problems inv involv
involv
olv
olved
ed ines deep
solving an optimization
learning, the most
problem. W e often
difficult is neural netw use
networkanalytical optimization to write
ork training. It is quite common to in pro ofs
inv or design algorithms.
vest days to months of
Of all of the many
time on hundreds of mac optimization
machines problems inv olv ed in deep
hines in order to solve even a single instance learning, theneural
of the most
difficult
net
netw is neural netw ork training. It is quite
work training problem. Because this problem is so imp common to in vest
important days to
ortant and so exp months
expensive,of
ensive,
atime
sp on hundreds
specialized
ecialized set ofof optimization
machines in order to solveha
techniques even
havve baeen single instancefor
developed of the neural
solving it.
net w ork training
This chapter presen problem.
presents Because this problem is so imp
ts these optimization techniques for neural netw ortant and
networkso exp ensive,
ork training.
a specialized set of optimization techniques have been developed for solving it.
ThisIf chapter
you are presen
unfamiliar with
ts these the basic principles
optimization techniques of for
gradient-based
neural network optimization,
training.
we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical
If you are in
optimization unfamiliar
general. with the basic principles of gradient-based optimization,
we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical
This chapter fo focuses
cuses on one particular case of optimization: finding the param-
optimization in general.
eters θ of a neural net netw
work that significantly reduce a cost function J (θ), which
This chapter fo cuses
typically includes a performance on one particular
measurecaseev of optimization:
evaluated
aluated on the entirefinding the param-
training set as
weters of a neuralregularization
ell asθ additional network that terms.
significantly reduce a cost function J (θ), which
typically includes a performance measure evaluated on the entire training set as
wellW ase b egin with regularization
additional a description of ho
how
terms.w optimization used as a training algorithm
for a machine learning task differs from pure optimization. Next, we presen presentt several
of theWeconcrete
begin with a description
challenges of hooptimization
that make w optimization used asnetw
of neural a training
networks algorithm
orks difficult. We
for a machine learning task differs from pure optimization. Next,
then define several practical algorithms, including both optimization algorithms w e presen t several
of the concrete
themselv
themselves challengesfor
es and strategies that make optimization
initializing of neural
the parameters. More netw
adv orks
advanced
anced difficult.
algorithmsWe
then define several practical algorithms, including b
adapt their learning rates during training or leverage information con oth optimization algorithms
contained
tained in
themselves and strategies for initializing the parameters. More advanced algorithms
adapt their learning rates during training or leverage information contained in
274
274
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.1 Howalgorithms
Optimization Learning usedDiffers
for training from of deep Pure mo
models Optimization
dels differ from traditional
optimization algorithms in several ways. Mac Machine
hine learning usually acts indirectly indirectly..
Optimization algorithms used for
In most machine learning scenarios, we care ab training of deep
about mo dels differ from
out some performance measure traditional
Poptimization
, that is definedalgorithms in
with resp several
respectect to w a
the ys. Mac
test hine
set and learning
ma
may usually
y also be in acts indirectly
intractable.
tractable. Wee.
W
In most machine
therefore optimizelearning scenarios,. W
P only indirectly
indirectly. wee reduce
care abaout some pcost
different erformance
functionmeasure
J (θ) in
P , that
the hop is defined with resp
hopee that doing so will improv ect to the test set and ma y also b
improvee P . This is in contrast to pure optimization,e intractable. We
therefore optimize P only indirectly . W e reduce a different
where minimizing J is a goal in and of itself. Optimization algorithms for training cost function J (θ ) in
the hop
deep mo edels
modelsthatalso
doing so willinclude
typically improvsomee P . This
sp is in contrast
specialization
ecialization on the tosppure
ecificoptimization,
specific structure of
where
mac
machine minimizing
hine learning ob J is a goal
objectiv
jectiv
jective in and of itself. Optimization algorithms for training
e functions.
deep models also typically include some specialization on the specific structure of
macThine
ypically
ypically, , the ob
learning cost function
jectiv can be written as an av
e functions. average
erage over the training set,
suc
suchh as
Typically, the cost functionJ (θ) =can be written as an average over the training(8.1) set,
E(x,y )∼pˆdata L(f (x; θ ), y ),
such as
where L is the per-example E
= function, fL((x
J (θ)loss f ;(x
θ); θis), the
y), predicted output when (8.1)
the input is x, pp̂ˆdata is the empirical distribution. In the sup supervised
ervised learning case,
is theL target
ywhere is the output.
per-example loss function,
Throughout f (x; θ) is
this chapter, wethe predicted
develop output when
the unregularized
the
sup
supervised
ervised x, pˆwhere
input iscase, is the
the argumen
empiricaltsdistribution.
arguments to L are f (x;In θ)the
andsupy. ervised
How
Howev ev learning
ever,
er, case,
it is trivial
y isextend
to the target output. Throughout
this developmen
development, t, for example,this chapter,
to include weθdevelop
or x as the unregularized
arguments, or to
sup ervised case,
exclude y as argumen where
arguments, the argumen ts to L are f ( x ; θ ) and y .
ts, in order to develop various forms of regularization or How ev er, it is trivial
to
unsupextend
unsupervised development, for example, to include θ or x as arguments, or to
thislearning.
ervised
exclude y as arguments, in order to develop various forms of regularization or
Eq. 8.1 defines an ob objective
jective function with resp respect ect to the training set. We
unsupervised learning.
would usually prefer to minimize the corresp corresponding
onding ob objectiv
jectiv
jectivee function where the
exp Eq.
expectation8.1 defines
ectation is taktaken an ob jective function with
en across the data generating distribution resp ect to the training set.than
pdata rather We
would
just ov usually
over prefer
er the finite to minimize
training set: the corresponding ob jective function where the
expectation is taken across the data generating distribution p rather than
just over the finite training ∗
J (θset:
) = E(x,y)∼pdata L(f (x; θ), y). (8.2)
E
J (θ) = L(f (x; θ), y). (8.2)
8.1.1 Empirical Risk Minimization
8.1.1goalEmpirical
The of a machineRisk Minimization
learning algorithm is to reduce the exp expected
ected generalization
error given by Eq. 8.2. This quantit
quantityy is kno
knownwn as the risk. We emphasize here that
Theexp
the goalectation
of a machine
expectation is tak
taken learning
en ov
over algorithm
er the is to reduce
true underlying the expected
distribution pdatageneralization
. If we knew
error given by Eq. 8.2 . This quantit y is kno wn as the risk . W e emphasize
the true distribution pdata(x, y), risk minimization would be an optimization here task
that
the expectation is taken over the true underlying distribution p . If we knew
the true distribution p 275
(x, y), risk minimization would be an optimization task
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
solv
solvable
able by an optimization algorithm. Ho Howev
wev er, when we do not know pdata(x, y)
wever,
but only hahav
ve a training set of samples, we ha have
ve a machine learning problem.
solvable by an optimization algorithm. However, when we do not know p ( x, y )
The simplest wa wayy to conv
convert
ert a machine learning problem back in into
to an op-
but only have a training set of samples, we have a machine learning problem.
timization problem is to minimize the exp expected
ected loss on the training set. This
The simplest wa y to conv ert a machine
means replacing the true distribution p(x, y) with learning problemdistribution
the empirical back into pp̂ˆan
(x,op-
y)
timization
defined problem
by the is to
training set.minimize
We now the expected
minimize the loss on al
empiric therisk
empirical training set. This
means replacing the true distribution p(x, y) with the empirical distribution pˆ(x, y)
m
X
defined by the training set. We now minimize 1 the empirical risk
Ex,y∼p̂data (x,y)[L(f (x; θ), y)] = L(f (x(i) ; θ), y (i)) (8.3)
m
E 1 i=1
[L(f (x; θ), y)] = L(f (x ; θ), y ) (8.3)
where m is the number of training examples.m
whereThemtraining
is the nproprocess
cess of
umber based on minimizing
training examples. this av average
erage training error is known
as empiric
empirical
al risk minimization
minimization.. In this setting, machine learning is still very similar
The trainingard
to straightforw
straightforward prooptimization.
cess based on minimizing
Rather than X
this average training
optimizing the risk error is known
directly
directly,, we
as empiric al risk minimization
optimize the empirical risk, and hop . In this setting, machine learning is still very
hopee that the risk decreases significantly as well. similar
to straightforw ard optimization.
A variety of theoretical results establish Rather than optimizing
conditions under whic theh risk
which directly
the true risk, can
we
optimize
b e exp the to
expected
ected empirical
decreaserisk,by vand hopamoun
arious e thatts.
amounts. the risk decreases significantly as well.
A variety of theoretical results establish conditions under which the true risk can
Ho
How
be exp w ev
ever,
er,toempirical
ected decrease risk minimization
by various amounts. is prone to ov overfitting.
erfitting. Models with
high capacity can simply memorize the training set. In many cases, empirical
riskHo wever, empirical
minimization is notrisk minimization
really feasible. The is most
proneeffective
to overfitting.
mo
modern Models with
dern optimization
high capacity
algorithms arecan
basedsimply memorize
on gradien
gradient the training
t descent, but manyset. useful
In manyloss cases, empirical
functions, such
risk minimization
as 0-1 loss, hahav is not really
ve no useful deriv feasible.
derivativ
ativ
atives The most
es (the deriv effective
derivative mo dern optimization
ative is either zero or undefined
algorithms
ev
everywhere). are based
erywhere). These tw twoon gradien t descent, but many
o problems mean that, in the context useful of
loss functions,
deep learning,suchwe
as 0-1 loss, ha v e no useful derivativ es (the deriv ative is either
rarely use empirical risk minimization. Instead, we must use a slightly differen zero or undefined
differentt
ev erywhere).
approac
approach, These tw o
h, in which the quantit problems
quantity mean that, in the context of deep
y that we actually optimize is even more differen learning, wet
different
rarelythe
from usequan
empirical
quantittit
tityy thatriskweminimization.
wantt to Instead,
truly wan optimize.we must use a slightly different
approach, in which the quantity that we actually optimize is even more different
from the quantity that we truly want to optimize.
8.1.2 Surrogate Loss Functions and Early Stopping
8.1.2 Surrogate
Sometimes, Loss Fwe
the loss function unctions andab
actually care Early
about Stopping
out (say classification error) is not
one that can be optimized efficiently
efficiently.. For example, exactly minimizing exp expected
ected 0-1
Sometimes,
loss the loss
is typically in function(exp
intractable
tractable we onential
actually in
(exponential care
theabinput
out (say classification
dimension), ev enerror)
even is not
for a linear
one that can
classifier be optimized
(Marcotte and Savefficiently
Savard
ard, 1992 .F).orInexample, exactly minimizing
such situations, one typicallyexpoptimizes
ected 0-1
aloss is otypically
surr
surro gate loss in tractableinstead,
function (exponential
which in theasinput
acts a prodimension),
proxy
xy but has advevenantages
for a linear
advantages
antages. . For
classifier ( Marcotte and Sav
example, the negative log-likelihooard , 1992
log-likelihood ). In such situations, one typically
d of the correct class is typically used as aoptimizes
a surr ogate loss function instead,
surrogate for the 0-1 loss. The negativ which acts as a pro
negativee log-likelihoo
log-likelihood xy butthe
d allows hasmoadv
delantages
model . For
to estimate
example,
the the negative
conditional log-likelihoo
probability d of the
of the classes, correct
given class isand
the input, typically used
if the mo del as
model cana
surrogate
do for the
that well, then0-1 loss.pick
it can Thethe
negativ e log-likelihoo
classes that yield d allows
the least the model to estimate
classification error in
the
exp conditional
expectation.
ectation. probability of the classes, given the input, and if the mo del can
do that well, then it can pick the classes that yield the least classification error in
expectation. 276
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
In some cases, a surrogate loss function actually results in being able to learn
more. For example, the test set 0-1 loss often contin continuesues to decrease for a long
In some cases, a surrogate loss function
time after the training set 0-1 loss has reached zero, when actually results in b eing ableusing
training to learn
the
more.
log-lik F or
log-likeliho
eliho
elihoo example, the test set
od surrogate. This is because ev 0-1 loss often
even contin
en when the expues to decrease
expected for a
ected 0-1 loss is zero,long
time after
one can improv the training set 0-1 loss has reached
improvee the robustness of the classifier by further pushing zero, when training usingapart
the classes the
log-likeach
from elihoother,
od surrogate.
obtainingThis is because
a more confiden
confident evt en
andwhen the classifier,
reliable expected 0-1 thusloss
thus is zero,
extracting
one can
more improve the
information fromrobustness of the
the training classifier
data by further
than would ha
hav vepushing the classes
been possible apart
by simply
from each other,
minimizing the avobtaining
erage 0-1aloss moreonconfiden t andset.
the training reliable classifier, thus extracting
more information from the training data than would have been possible by simply
A very imp importan
ortan
ortantt difference betw etween
een optimization in general and optimization
minimizing the average 0-1 loss on the training set.
as we use it for training algorithms is that training algorithms do not usually halt
at aAlo very
local importan
cal minim
minimum.um.t difference
Instead, abetw maceen
machine
hineoptimization in generalusually
learning algorithm and optimization
minimizes
as we use it for training algorithms
a surrogate loss function but halts when a conv is that training algorithms
convergence do not usually
ergence criterion based on early halt
at a lo cal minim um. Instead, a mac hine learning algorithm
stopping (Sec. 7.8) is satisfied. Typically the early stopping criterion is based on usually minimizes
a surrogate
the loss function
true underlying but haltssuch
loss function, whenasa0-1 conv ergence
loss measuredcriterion
on a based on early
validation set,
stopping (Sec. 7.8 ) is satisfied. T ypically
and is designed to cause the algorithm to halt whenev the early
wheneverstopping criterion is based
er overfitting begins to occur. on
the
T true underlying
raining often halts loss while function, such asloss
the surrogate 0-1function
loss measured
still hasonlarge
a validation
deriv
derivatives, set,
atives,
and
whic
whichis designed to cause the algorithm to halt whenev er overfitting
h is very different from the pure optimization setting, where an optimization b egins to o ccur.
Training often
algorithm halts while
is considered to hathe
havve surrogate
conv ergedloss
converged when function still has
the gradient large deriv
becomes very atives,
small.
which is very different from the pure optimization setting, where an optimization
algorithm is considered to have converged when the gradient becomes very small.
8.1.3 Batch and Minibatch Algorithms
Batch
8.1.3aspect
One Batc ofhmachine
and Minibatch Algorithms
learning algorithms that separates them from general
optimization algorithms is that the ob objectiv
jectiv
jectivee function usually decomp
decomposes
oses as a sum
One aspect of machine learning algorithms that
over the training examples. Optimization algorithms for macseparates
machine them fromtypically
hine learning general
optimization
compute eachalgorithms
up date toisthe
update thatparameters
the ob jectivbased
e function
on anusually
exp decomp
expected
ected oses
value of as
thea cost
sum
over the training
function estimated examples. Optimization
using only a subset ofalgorithms
the terms for macfull
of the hinecost
learning typically
function.
compute each update to the parameters based on an expected value of the cost
For example, maxim
maximum um likelihoo
likelihood d estimation problems, when view vieweded in log
function estimated using only a subset of the terms of the full cost function.
space, decomp
decomposeose in
into
to a sum ovover
er eac
eachh example:
For example, maximum likelihood estimation problems, when viewed in log
space, decompose into a sum over eac Xmh example:
θML = arg max log p model(x(i) , y(i); θ). (8.4)
θ i=1
θ = arg max log p (x , y ; θ). (8.4)
Maximizing this sum is equivequivalent
alent to maximizing the exp expectation
ectation ov
over
er the
empirical distribution defined by the training set:
Maximizing this sum is equivalent to maximizing the expectation over the
empirical distribution defined
J (θ) = by X
the
Ex,y∼p̂ training set: (x, y; θ).
log pmodel (8.5)
data
E
J (θ ) = log p (x, y; θ). (8.5)
Most of the prop
properties
erties of the ob jective function J used by most of our opti-
objective
mization algorithms are also expexpectations
ectations ovover
er the training set. For example, the
Most of the properties of the ob jective function J used by most of our opti-
mization algorithms are also expectations 277 over the training set. For example, the
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
than one but less than all of the training examples. These were traditionally called
minib
minibatch
atch or minib
minibatch
atch sto
stochastic
chastic metho
methodsds and it is no
now
w common to simply call
than one
them sto but less
stochastic than
chastic metho
methods.all
ds. of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call
The
them stocanonical example
chastic metho ds. of a stostochastic
chastic metho
methodd is sto
stochastic
chastic gradien
gradientt descent,
presen
presented
ted in detail in Sec. 8.3.1.
The canonical example of a stochastic method is stochastic gradient descent,
Minibatc
Minibatch h sizes are generally driv
driven
en by the following factors:
presented in detail in Sec. 8.3.1.
•Minibatc
Larger hbatches
sizes are generally
provide drivaccurate
a more en by theestimate
followingoffactors:
the gradient, but with
less than linear returns.
Larger batches provide a more accurate estimate of the gradient, but with
•• Multicore architectures
less than linear returns.are usually underutilized by extremely small batc batches.
hes.
This motiv
motivates
ates using some absolute minim minimum um batch size, belo elow
w which there
Multicore
is architectures
no reduction in the timeare usually
to pro underutilized
process
cess a minibatch. by extremely small batches.
• This motivates using some absolute minimum batch size, below which there
• If all reduction
is no examples in the the time
batc
batch htoare
proto be apro
cess processed
cessed in parallel (as is typically
minibatch.
the case), then the amount of memory scales with the batch size. For many
If all examples
hardw
hardwareare setupsinthis theisbatc
thehlimiting
are to bfactor
e processed
in batch in size.
parallel (as is typically
• the case), then the amount of memory scales with the batch size. For many
• Some
hardwkinds of hardware
are setups this is theac
achiev
hiev
hieve e better
limiting runtime
factor in batchwithsize.
sp
specific
ecific sizes of arrays.
Esp
Especially
ecially when using GPUs, it is common for power of 2 batch sizes to offer
better kinds
Some of hardware
runtime. Typical pac owhiev
ower e b2etter
er of batchruntime withfrom
sizes range specific sizes
32 to 256,ofwith
arrays.
16
• Esp ecially when
sometimes beingusing GPUs,for
attempted it is common
large mo
models. for
dels. p ower of 2 batch sizes to offer
better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16
• Small
sometimesbatc
batches
hes canattempted
being offer a regularizing
for large mo effect
dels. (Wilson and Martinez, 2003),
perhaps due to the noise they add to the learning pro process.
cess. Generalization
Small batc hes can offer
error is often best for a batc a regularizing
batch effect ( Wilson
h size of 1. Training with such and Martinez
a small, 2003
batch),
perhaps
• size might due to thea noise
require small they addrate
learning to the learning
to main
maintain process. due
tain stability Generalization
to the high
variance in the estimate of the gradient. The total runtime can be verybatch
error is often b est for a batc h size of 1. T raining with such a small high
size might
due to the require
need toamake small more
learning rateboth
steps, to main tain stability
because due to learning
of the reduced the high
variance
rate and in the estimate
because it tak esofmore
takes the gradient. The total
steps to observe theruntime can be vset.
entire training ery high
due to the need to make more steps, both because of the reduced learning
rate and
Differen
Different because
t kinds it takes more
of algorithms steps to observe
use different kinds ofthe entire training
information from theset. mini-
batc
batch h in different wa ways.
ys. Some algorithms are more sensitiv sensitivee to sampling error than
Differen
others, t kinds
either becauseof algorithms use different
they use information thatkinds of information
is difficult fromaccurately
to estimate the mini-
batchfew
with in different
samples,wa orys. Some they
because algorithms are more sensitiv
use information in wa
waysyse to
thatsampling
amplifyerror than
sampling
others,more.
errors either Metho
becausedsthey
Methods thatuse information
compute up thatbased
updates
dates is difficult
only on to estimate
the gradienaccurately
gradient t g are
with few samples, or b ecause they use information in wa ys
usually relatively robust and can handle smaller batch sizes like 100. Second-orderthat amplify sampling
errors
metho
methods, more.
ds, whic
whichMetho
h use ds alsothatthecompute
Hessian up dates H
matrix based
andonly
computeon the up gradien
updates
dates suc t ghare
such as
usually
H −1 g relatively robust and can handle smaller batch sizes like 100. Second-order
, typically require much larger batc batch h sizes like 10,000. These large batch
methoare
sizes ds, required
which use to also the Hessian
minimize fluctuationsmatrixin H theand compute
estimates up−1
of H dates such as
g. Suppose
H g , typically require m uch larger batc
that H is estimated perfectly but has a poor condition num h sizes like 10,000.
numb These large
ber. Multiplication batch
by
sizes are required to minimize fluctuations in the estimates of H g. Suppose
that H is estimated perfectly but has a279 poor condition number. Multiplication by
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
H or its inv erse amplifies pre-existing errors, in this case, estimation errors in g.
inverse
Very small changes in the estimate of g can th thus
us cause large changes in the up update
date
H −1or its inv erse amplifies
H g , even if H were estimated perfectly pre-existing errors, in this case, estimation
erfectly.. Of course, H will be estimated only errors in g.
Very small
appro
approximately
ximately
ximately,changes
, so thein the
up estimate
update
date H −1of g cancontain
g will thus causeevenlarge morechanges
error thanin the weup date
would
H g , even if H w ere estimated
predict from applying a poorly conditioned op p erfectly . Of eration to the estimate of g . only
course,
operation H will b e estimated
approximately, so the update H g will contain even more error than we would
It is also crucial that the minibatches be selected randomly randomly.. Computing an
predict from applying a poorly conditioned operation to the estimate of g .
un
unbiased
biased estimate of the exp expected
ected gradien
gradientt from a set of samples requires that those
It is also
samples be indep crucial
independent. that the minibatches
endent. We also wish for tw twoboe subsequent
selected randomly gradient. estimates
Computing to an
be
un biased
indep
independen
enden estimate of the
endentt from each other, so twexp ected gradien
two t from
o subsequen a set of samples requires
subsequentt minibatches of examples should that those
samples
also be indep
be indep
independenenden
endentendent.
t fromW e also
each wish Many
other. for twodatasets
subsequent gradient
are most estimates
naturally to be
arranged
indep
in a wenden
ay wheret from each other,
successive so twoare
examples subsequen t minibatches
highly correlated. For of examples
example, weshould
might
also
ha
hav b e indep enden t from each other.
ve a dataset of medical data with a long list of blo Many datasets bloo are most naturally
od sample test results. This arranged
in a w ay where successive
list might be arranged so that first we ha examples are highly
hav ve five blocorrelated.
bloo od samplesFor example,
taken atwdifferent
e might
have afrom
times datasettheoffirst medical
patient,datathen withwe a long
ha ve list
have threeof blo
blo
bloo oodd sample
samplestest results.
taken fromThis the
list might b e arranged
second patient, then the blo so that
bloo first we ha v e five blo o d samples
od samples from the third patient, and so on. If we taken at different
times from the first
were to draw examples in order patient, thenfrom wethis
havelist,three
thenblo each od ofsamples taken from
our minibatches the
would
second patient, then the blo o d samples from the
be extremely biased, because it would represent primarily one patient out of the third patient, and so on. If we
w
manereytopatien
many drawtsexamples
patients in order
in the dataset. In from
casesthis
suchlist, then each
as these whereofthe ourorder
minibatches would
of the dataset
be extremely
holds biased, because
some significance, it is it would represent
necessary to shuffleprimarily the examples one patient
beforeout of the
selecting
many patien
minibatc
minibatches. hes.tsFin or the
verydataset. In cases such
large datasets, as thesedatasets
for example where the conorder
containing
tainingof the dataset
billions of
holds some significance, it is necessary to shuffle the
examples in a data center, it can be impractical to sample examples truly uniformly examples b efore selecting
minibatc
at random hes.ev
everyFortime
ery very welargewantdatasets, for example
to construct a minibatc datasets
minibatch. containing
h. Fortunately
ortunately, , inbillions
practiceof
examples
it is usually in asufficient
data center, it can the
to shuffle be impractical
order of thetodatasetsample once examples
and thentruly store
uniformly
it in
at
sh random
shuffled ev ery time
uffled fashion. This will imp we w ant
impose to construct a minibatc
ose a fixed set of possible minibatc h. F ortunately
minibatches , in practice
hes of consecutive
it is usually
examples that all mosufficient to
models shuffle the order of the
dels trained thereafter will use, and eac dataset once
eachand then store
h individual mo itdel
model in
shuffled
will fashion.toThis
be forced reusewillthis
impordering
ose a fixed set of
every timepossible minibatc
it passes hes ofthe
through consecutive
training
examples
data. Ho Howevwevthat
wever, all mo dels trained thereafter
er, this deviation from true random selection do will use, and eac
does h individual
es not seem to hav model
have ea
will b
significane forced to reuse this ordering
significantt detrimental effect. Failing to ever sh every time
shuffle it passes through
uffle the examples in any wa the training
wayy can
data. Ho wev er, this deviation from
seriously reduce the effectiveness of the algorithm. true random selection do es not seem to hav ea
significant detrimental effect. Failing to ever shuffle the examples in any way can
Man
Many y optimization problems in machine learning decomp decompose ose over examples
seriously reduce the effectiveness of the algorithm.
well enough that we can compute entire separate up updates
dates over different examples
Man y optimization problems
in parallel. In other words, we can compute the up in machine learningdatedecomp
update ose over Jexamples
that minimizes (X ) for
w ell enough that we can compute entire separate
one minibatch of examples X at the same time that we compute the up up dates ov er different examples
update
date for
in
sev parallel.
several
eral other In other words,
minibatches. w
Suc
Suche hcan compute
asynchronous the up date
parallel that minimizes
distributed J
approaches (X ) for
are
one minibatch of examples
discussed further in Sec. 12.1.3. X at the same time that we compute the up date for
several other minibatches. Such asynchronous parallel distributed approaches are
An interesting motiv motivation
ation for minibatch sto stocchastic gradient descen descentt is that it
discussed further in Sec. 12.1.3.
follo
followsws the gradien
gradientt of the true generalization error (Eq. 8.2) so long as no
An interesting
examples are rep motivation
repeated.
eated. Most for minibatch stocof
implementations hastic
minibatcgradient
minibatch h sto
stocdescen
chastic t isgradien
that itt
gradient
follows the gradient of the true generalization error (Eq. 8.2) so long as no
examples are repeated. Most implementations of minibatch stochastic gradient
280
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
descen
descentt shuffle the dataset once and then pass through it multiple times. On the
first pass, each minibatch is used to compute an un unbiased
biased estimate of the true
descen t shuffle the dataset once and then pass
generalization error. On the second pass, the estimate becomes through it multiple times.
biased becauseOn itthe
is
first pass, each minibatch
formed by re-sampling values that havis used to compute an unbiased estimate
havee already been used, rather than obtainingof the true
generalization error. On the second
new fair samples from the data generating pass, thedistribution.
estimate becomes biased because it is
formed by re-sampling values that have already been used, rather than obtaining
The fact that sto stocchastic gradien
gradientt descent minimizes generalization error is
new fair samples from the data generating distribution.
easiest to see in the online learning case, where examples or minibatc minibatches
hes are drawn
The
from a str fact that sto c hastic gradien t descent minimizes generalization
streeam of data. In other words, instead of receiving a fixed-size training error is
easiest to see in the online learning case, where examples or minibatc
set, the learner is similar to a living being who sees a new example at each instant, hes are drawn
from every
with a streexample
am of data.
(x, y)Incoming
other words,
from theinstead of receiving
data generating a fixed-size
distribution training
p data(x, y ).
set, the learner is similar to a
In this scenario, examples are never repliving b eing who
repeated; sees a
eated; every expnew example
experience at each instant,
erience is a fair sample
with every
from p data. example (x , y ) coming from the data generating distribution p (x, y ).
In this scenario, examples are never repeated; every experience is a fair sample
The equiv alence is easiest to derive when both x and y are discrete. In this
equivalence
from p .
case, the generalization error (Eq. 8.2) can be written as a sum
The equivalence is easiest to derive when both x and y are discrete. In this
XX
case, the generalization J ∗(θerror
) = (Eq. 8.2p)data can(xb, ey)written
L(f (x; θas), ay)sum
, (8.7)
x y
J (θ ) = p (x, y)L(f (x; θ), y), (8.7)
with the exact gradient
XX
with the exact gradient
g = ∇θ J ∗ (θ) = p data(x, y)∇x L(f (x; θ), y). (8.8)
XxXy
g= J (θ) = p (x, y) L(f (x; θ), y). (8.8)
We hav
havee already seen ∇ the same fact demonstrated∇for the log-likelihoo log-likelihood d in Eq. 8.5
and Eq. 8.6; we observ observee no w that this holds for other functions L besides the
now
Weeliho
lik havo
likeliho
elihooe d.
already
A similarseen result
the same
canfact demonstrated
be derived when xforand theylog-likelihoo
are contin d in Eq.
continuous,
uous, 8.5
under
and Eq.
mild 8.6; we observ
assumptions regarding e nopw that
XX this holds for other functions L besides the
data and L.
likelihood. A similar result can be derived when x and y are continuous, under
Hence, w wee can obtain an un unbiased
biased estimator of the exact gradient of the
mild assumptions regarding p and L.
generalization error by sampling a minibatc minibatch h of examples {x(1) , . . . x(m)} with cor-
respHence,
responding w e can obtain
( i ) an un biased estimator
onding targets y from the data generating distribution of the exact
pdata,gradient of the
and computing
generalization
the gradient oferror by sampling
the loss with resp aect
respectminibatc
to thehparameters
of examplesforxthat with cor-
, . .minibatch:
.x
responding targets y from the data generating distribution { p , and }computing
the gradient of the loss with resp 1 ect X to the parameters for that minibatch:
ˆ = ∇θ
g L(f (x(i) ; θ), y (i)). (8.9)
m
1 i
ˆ=
g L(f (x ; θ), y ). (8.9)
Up dating θ in the direction ofmgˆ∇performs SGD on the generalization error.
Updating
Of course,
Updating thisdirection
θ in the interpretation only applies
of gˆ performs SGD on when examples are error.
the generalization not reused.
Nonetheless, it is usually best to make sev several
eral passes through the training set,
X
unless the training set is extremely large. Whenwhen
Of course, this interpretation only applies multipleexamples
such ep are
epoochsnotarereused.
used,
Nonetheless,
only the first itep
epooischusually
follo
followswsbthe
est unbiased
to make sev eral tpasses
gradien
gradient of thethrough the training
generalization set,
error, but
unless the training set is extremely large. When multiple such epochs are used,
only the first epoch follows the unbiased 281gradient of the generalization error, but
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
of course, the additional ep epoochs usually provide enough benefit due to decreased
training error to offset the harm they cause by increasing the gap bet etw
ween training
of course, the additional
error and test error. ep o chs usually provide enough b enefit due to decreased
training error to offset the harm they cause by increasing the gap between training
errorWith
and some datasets gro
test error. growing
wing rapidly in size, faster than computing pow ower,
er, it
is becoming more common for mac machine
hine learning applications to use eac eachh training
With only
example some once
datasets gro
or ev enwing
even rapidly
to make aninincomplete
size, fasterpass
thanthrough
computingthe ptraining
ower, it
is becoming
set. more an
When using common for mac
extremely hine
large learningset,
training applications
overfittingtoisuse
overfitting noteac
anh issue,
training
so
example only once or ev en to make an incomplete pass through
underfitting and computational efficiency become the predominant concerns. See the training
set. Bottou
also When using an extremely
and Bousquet (2008large
) for training set, of
a discussion overfitting
the effectisofnot an issue, so
computational
underfitting
b ottlenec
ottlenecks
ks onandgeneralization
computational efficiency
error, as thebnum
ecome
numb berthe predominant
of training concerns.
examples grows.See
also Bottou and Bousquet (2008) for a discussion of the effect of computational
bottlenecks on generalization error, as the number of training examples grows.
8.2 Challenges in Neural Net
Netw
work Optimization
8.2 Challenges
Optimization in generalin is Neural Net
an extremely worktask.
difficult Optimization
Traditionally
raditionally,, mac
machine
hine
learning has avavoided
oided the difficult
difficulty y of general optimization by carefully designing
Optimization
the ob
objectiv
jectiv in general is an extremely difficult
jectivee function and constraints to ensure task.
that Traditionallyproblem
the optimization , machine is
learning
con
conv has av oided the difficult
vex. When training neural netw y of general
networks, optimization by carefully designing
orks, we must confront the general non-conv
non-convex ex
the obEven
case. jectivconv
e function
convex and constraints
ex optimization is not to ensureits
without that the optimization
complications. problem
In this is
section,
con
w ev ex. When several
summarize trainingofneural netwprominen
the most orks, we m
prominent t ust confrontin
challenges the
involv
volvgeneral
volved non-convex
ed in optimization
case.
for Even conv
training deepexmo optimization
models.
dels. is not without its complications. In this section,
we summarize several of the most prominent challenges involved in optimization
for training deep models.
8.2.1 Ill-Conditioning
8.2.1 challenges
Some Ill-Conditioning
arise even when optimizing convconvex
ex functions. Of these, the most
prominentt is ill-conditioning of the Hessian matrix H. This is a very general
prominen
Some challenges
problem in most arise even when
numerical optimizing
optimization, conv
conv
convex ex otherwise,
ex or functions. and
Of these, the ed
is describ
describedmost
in
prominen t is ill-conditioning
more detail in Sec. 4.3.1. of the Hessian matrix H . This is a very general
problem in most numerical optimization, convex or otherwise, and is described in
The ill-conditioning problem is generally believ elieved
ed to be presen
presentt in neural
more detail in Sec. 4.3.1.
net
netw
work training problems. Ill-conditioning can manifest by causing SGD to get
The
“stuck” inill-conditioning
“stuck” the sense that problem
even veryissmall
generally believed the
steps increase to cost
be presen t in neural
function.
network training problems. Ill-conditioning can manifest by causing SGD to get
Recall
“stuc fromsense
k” in the Eq. that
4.9 that
even avery
second-order
small stepsTincrease
aylor series
the expansion
cost function.of the cost
gradientt descent step of −g will add
function predicts that a gradien
Recall from Eq. 4.9 that a second-order Taylor series expansion of the cost
function predicts that a gradien1t descentg H g step
2 >
− gof
>
g g will add (8.10)
2 −
1
to the cost. Ill-conditioning of 2the H g bgecomes
ggradient g a problem when 12 2g(8.10)> Hg
exceeds g> g. T To −
o determine whether ill-conditioning is detrimental to a neural
to
netthe
netw cost. Ill-conditioning of the gradient b ecomes
work training task, one can monitor the squared gradien a problem
gradient t normwheng >g and
g H theg
exceeds g g. To determine whether ill-conditioning is detrimental to a neural
network training task, one can monitor282 the squared gradient norm g g and the
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
16 1.0
14 0.9
8.2.2of the
One Lomost
cal Minima
prominent features of a conv
convexex optimization problem is that it
can be reduced to the problem of finding a lo
local
cal minim
minimum.
um. An
Anyy lo
local
cal minim
minimum
um is
One of the most prominent features of a convex optimization problem is that it
can be reduced to the problem of finding
283a lo cal minimum. Any lo cal minimum is
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
guaran
guaranteed teed to be a global minimum. Some con convex
vex functions hav havee a flat region at
the bottom rather than a single global minimum poin oint,
t, but any point within such
aguaran teed toisban
flat region e a acceptable
global minimum. solution. SomeWhen convex functions
optimizing havveex
a con
conv a flat region we
function, at
the w
kno
know bottom
that wrather
haveethan
e hav reac a single
reached
hed a go
goo global minimum
od solution if we pfind
oint,a but any ppoint
critical oint of within such
any kind.
a flat region is an acceptable solution. When optimizing a convex function, we
With non-conv
non-convex ex functions, suc such h as neural nets, it is possible to ha have
ve manmany y
know that we have reached a good solution if we find a critical point of any kind.
lo
local
cal minima. Indeed, nearly an any y deep mo modeldel is essentially guaranteed to ha have
ve
With non-conv
an extremely largeex num
numbfunctions,
ber of lo suc
cal hminima.
local as neural How nets,
However,
ever,it as
is pweossible to ha
will see, ve is
this mannoty
local minima.
necessarily a ma Indeed,
major nearly any deep model is essentially guaranteed to have
jor problem.
an extremely large number of local minima. However, as we will see, this is not
Neural net
necessarily netwworks
a ma and an
jor problem.any y mo
models
dels with multiple equiv equivalen
alen
alently
tly parametrized latent
variables all hav havee multiple lo local
cal minima because of the mo modeldel identifiability problem.
A mo Neural
modeldel is net
saidworks
to beand antifiable
iden y models
identifiable if awith multiplelarge
sufficiently equivtraining
alently parametrized
set can rule out latent
all
variables all hav e
but one setting of the mo m ultiple lo
model’scal minima b ecause of the mo del identifiability
del’s parameters. Models with latent variables are often problem.
A mo del is said to
not identifiable because we b e iden tifiable if a sufficiently
can obtain equiv
equivalent large
alent mo training
dels by set
models exc can
exchanging
hangingrule outlatenallt
latent
vbut one setting
ariables with eachof the
other.model’s parameters.
For example, Models
we could takewith latentnetw
a neural variables
network
ork and aremo often
modify
dify
not
la
lay identifiable
yer 1 by sw swapping because we
apping the incoming weighcan obtain equiv alent mo dels by exc
weightt vector for unit i with the incoming weighhanging weighttt
laten
variables with each other. F or example, w
vector for unit j , then doing the same for the outgoing weigh e could take a neural netw ork and
weightt vectors. If we hav mo dify
have e
lay
m la er
lay 1 by sw apping the incoming weigh t vector m for unit i
yers with n units each, then there are n! ways of arranging the hidden units. with the incoming weigh t
vectorkind
This for unit j , then doing the
of non-identifiabilit
non-identifiability y issame
known foras the outgoing
weight spaceeweigh
spac
ac t vectors.
symmetry
symmetry. . If we have
m layers with n units each, then there are n! ways of arranging the hidden units.
ThisInkind addition to weight space
of non-identifiabilit y issymmetry
symmetry,
known as, weightman
many y kinds
space of neural .net
symmetry networks
works hav havee
additional causes of non-identifiabilit
non-identifiability y. For example, in any rectified linear or
In
maxout netw addition
network, to w eight space symmetry
ork, we can scale all of the incoming , many w kinds
eightsofandneural netof
biases works
a unit havbye
αadditional
if we alsocausesscale allof ofnon-identifiabilit
its outgoing weigh y. Fts
weights orby example,
1 in any rectified linear or
α. This means that—if the cost
maxout netw
function do esork,
does not w e can scale
include termsallsuc ofhthe
such incoming
as weigh
weight t deca wyeights
decay that and
dep
dependbiases
end of a unit
directly on theby
α if we
weigh
eights ts also
ratherscale
than alltheof mo
its dels’
outgoing
models’ weights bylo
outputs—every . This
local
cal minimummeansofthat—if
a rectified thelinear
cost
function
or maxout netw do es not
network include terms suc h as weigh
ork lies on an (m × n )-dimensional hyperb t deca y that
hyperbola dep end directly
ola of equiv
equivalenalen on
alentt lo the
local
cal
weights rather than the models’ outputs—every lo cal minimum of a rectified linear
minima.
or maxout network lies on an (m n )-dimensional hyperbola of equivalent local
These mo
minima. model
del identifiabilit
identifiability y issues mean that there can be an extremely large
×
or ev even
en uncoun
uncountably
tably infinite amoun amountt of lo local
cal minima in a neural netw networkork cost
These
function. Ho mo del
Howev
wev
wever,identifiabilit
er, all of these loy issues
local mean that there can b e
cal minima arising from non-identifiabilit an extremely
non-identifiability ylarge
are
or
equiv evalen
en uncoun
equivalen
alent t to each tably
other infinite
in costamoun
functiont ofvalue.
local As minima in athese
a result, neural
lo
local
cal netw ork cost
minima are
function. Ho wev er, all
not a problematic form of non-conv of these lo cal
non-convexity minima
exity
exity.. arising from non-identifiabilit y are
equivalent to each other in cost function value. As a result, these local minima are
not LoLocal
cal minima can
a problematic formbeofproblematic
non-convexity if they
. hav
havee high cost in comparison to the
global minimum. One can construct small neural netw networks,
orks, even without hidden
units, Local
thatminima
hav
havee lo can
local be problematic
cal minima with higherif theycosthav e high
than the cost
globalin minimum
comparison(Sontag to the
global
and minimum.
Sussman , 1989 One can construct
; Brady et al. small
al.,, 1989 neural
; Gori andnetw
Tesiorks,
, 1992 even
). Ifwithout
lo
local hidden
cal minima
units, that hav e lo cal minima with higher cost than
with high cost are common, this could pose a serious problem for gradient-based the global minimum ( Sontag
and Sussmanalgorithms.
optimization , 1989; Brady et al., 1989; Gori and Tesi, 1992). If local minima
with high cost are common, this could pose a serious problem for gradient-based
It remains an op openen question whether there are many lo local
cal minima of high cost
optimization algorithms.
It remains an open question whether284
there are many local minima of high cost
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
for netw
networksorks of practical in interest
terest and whether optimization algorithms encoun encounterter
them. For man many y years, most practitioners believed that local minima were a
for networks
common of practical
problem plaguing interest
neural and whether
netw
network optimization Talgorithms
ork optimization. day,, thatencoun
oday do
does ter
es not
them.
app
appear
ear to Forbeman
the ycase.
years,Themost practitioners
problem remains anbelieved thatoflocal
active area researcminima
research, were a
h, but experts
common
no
noww susp
suspect problem plaguing
ect that, for sufficien neural
sufficiently netw ork
tly large neural netw optimization.
networks, T
orks, most lo o day
local , that do
cal minima hav es not
have ea
app
low ear
low costtofunction
be the case.
value,Theandproblem
that it isremains
not imp an active
important
ortant toarea
find of researc
a true h, but
global experts
minimum
now susp
rather ect to
than that,
findfor a psufficien
oint in tly large neural
parameter spacenetw
thatorks,
has most
lo
low
w butlocal
notminima
minimal havcost
ea
(low cost
Saxe et function
al., 2013v;alue, and that
Dauphin it, is2014
et al. not; imp
Go
Goo ortant
odfellow to et
findal.a, true
2015global minimum
; Choromansk
Choromanska a
rather
et al. than
al.,, 2014). to find a p oint in parameter space that has low but not minimal cost
(Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska
Man
Many
et al. y practitioners
, 2014 ). attribute nearly all difficulty with neural netw network
ork optimiza-
tion to lo local
cal minima. We encourage practitioners to carefully test for sp specific
ecific
Man y practitioners attribute
problems. A test that can rule out lo nearly all
local difficulty with neural netw
cal minima as the problem is to plot theork optimiza-
tion to lo cal minima.
norm of the gradient ov Wer time. If the norm of the to
over e encourage practitioners carefully
gradient do test
does
es notforshrink
specificto
problems.
insignificanttAsize,
insignifican testthe
that can rule
problem out local
is neither lo minima
local
cal minimaasnor thean problem
anyy other kindis toofplot the
critical
norm
p oint. of
oint. thekind
This gradient over etime.
of negativ
negative If the
test can rulenorm
out of
lo theminima.
local
cal gradientIndohighes not shrink to
dimensional
insignifican
spaces, t size,
it can be the
veryproblem
difficultis toneither
positiv loely
cal minima
ositively establishnor anylo
that other
local kind of are
cal minima critical
the
p oint. This kind of negativ e
problem. Many structures other than lo test can rule
local out lo cal minima.
cal minima also hav In high dimensional
havee small gradients.
spaces, it can be very difficult to positively establish that local minima are the
problem. Many structures other than local minima also have small gradients.
8.2.3 Plateaus, Saddle Points and Other Flat Regions
8.2.3
F or many Plateaus, Saddle non-con
high-dimensional Pointsvex
non-conv and Other Flat
functions, lo
local Regions(and maxima)
cal minima
are in fact rare compared to another kind of point with zero gradient: a saddle
Foin
p or t.
many
oint. Some high-dimensional
poin ts around a non-con
oints saddle pvoin ext functions,
oint hav
havee greaterlocal
costminima
than the (and maxima)
saddle p oin
oint,
t,
are inothers
while fact rarehav compared
a lo
havee lower wer to
cost.another
A t a kind
saddle ofp p oint
oint, with
the zero
Hessian gradient:
matrix a
has saddle
b oth
positiv
p oint. eSome
ositive points around
and negative eigen
eigenv vaalues.
saddlePoinpoin
ointstst lying
have along
greatereigenv
cost ectors
than the
eigenvectors assosaddle
ciatedpwith
associated oint,
while
positiv others
ositivee eigenv hav e a lo
eigenvalues
alues wer
hav
have cost. Atcost
e greater a saddle
than thepoint, the Hessian
saddle matrix
point, while poinhas
oints both
ts lying
p ositiv e and negative
along negative eigenv eigen
eigenvalues v
alues ha alues.
have
ve low P oin
lower ts lying along eigenv ectors
er value. We can think of a saddle poin asso ciatedointwith
t as
p
bositiv
eing ae loeigenv
local
cal alues
minimum hav e greater
along one cost than
cross-section the saddle
of the p oint,
cost while
function p oin
and ts
a lying
lo
local
cal
along
maxim
maximum negative
um alongeigenvanotheralues have lower vSee
cross-section. alue.
Fig.W4.5
e can
for think of a saddle point as
an illustration.
being a local minimum along one cross-section of the cost function and a local
Man
Many
maxim umy classes of random
along another functionsSee
cross-section. exhibit thefor
Fig. 4.5 following behavior: in low-
an illustration.
dimensional spaces, lo local
cal minima are common. In higher dimensional spaces, lo local
cal
Manare
minima y classes
rare and of saddle
random functions
points are more exhibit
common.the F following behavior:
or a function f : Rn → in Rlow-of
dimensional spaces,
this type, the exp expectedlo cal minima
ected ratio of the num are common.
numb In higher
ber of saddle poin dimensional
oints
ts to lo
local spaces,
cal minima lo
growscal
R R
minima
exp
exponen
onen are
onentially rare and saddle p oints are more common.
tially with n. To understand the intuition behind this beha For a function f :
ehavior,
vior, observe of
this type, the exp ected
that the Hessian matrix at a lo ratio of the
local num b er of saddle
cal minimum has only positiv p oin ts to lo cal
ositivee eigen minima
eigenv grows
→ The
values.
exp onen tially with n.
Hessian matrix at a saddle poinT o understand the intuition b ehind
ointt has a mixture of positive and negativ this b eha vior,
negativee eigenv observe
eigenvalues.
alues.
that the that
Imagine Hessian matrix
the sign athaeigenv
of eac
each local alue
minimum
eigenvalue has only
is generated by p ositive aeigen
flipping coin.values. The
In a single
Hessian matrix
dimension, it is at
easya saddle
to obtainpoinat lohas
calaminimum
local mixture of byptossing
ositive and negativ
a coin e eigenvheads
and getting alues.
Imagine that the sign of eac h
once. In n-dimensional space, it is exp eigenv alue is generated by flipping a coin. In
onentially unlikely that all n coin tosses will
exponentially a single
dimension, it is easy to obtain a local minimum by tossing a coin and getting heads
once. In n-dimensional space, it is exponentially 285 unlikely that all n coin tosses will
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
J(θ )
fθ
n 2o
Pro jec
tion 1 jectio
of θ Pro
287
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Gradien
Gradientt descent is designed to mov movee “do
“downhill”
wnhill” and is not explicitly designed
to seek a critical point. Newton’s metho method, d, how
however,
ever, is designed to solv solvee for a
Gradien
poin t descent
ointt where the gradien is designed to mov e “do wnhill”
gradientt is zero. Without appropriate mo and is not explicitly
modification,
dification, it can designed
jump
to seek a critical p oint. Newton’s metho
to a saddle point. The proliferation of saddle poin d, how ever,
oints is designed to
ts in high dimensional spacessolv e for a
p oin t where the gradien t is zero.
presumably explains why second-order metho Without appropriate
methods ds hav mo dification, it
havee not succeeded in replacing can jump
to a saddle
gradien
gradient point.forThe
t descent proliferation
neural netw
network of saddle pDauphin
ork training. oints in high
et al.dimensional
(2014) introducedspaces
apresumably
sadd le-freeeexplains
saddle-fr
le-fr Newtonwhy methosecond-order
method methodsoptimization
d for second-order have not succeededand show ined
showed replacing
that it
gradien
impro
improv t descent for neural netw ork training. Dauphin
ves significantly over the traditional version. Second-order metho et al. ( 2014 )
methods introduced
ds remain
a sadd le-fr ee Newton metho
difficult to scale to large neural net d for second-order
networks, optimization and
works, but this saddle-free approac show
approach ed hthat it
holds
improvesifsignificantly
promise over the traditional version. Second-order methods remain
it could be scaled.
difficult to scale to large neural networks, but this saddle-free approach holds
There are other kinds of points with zero gradient besides minima and saddle
promise if it could be scaled.
poin
oints.
ts. There are also maxima, whic which h are muc
muchh lik
likee saddle poin oints
ts from the
persp There
ectivee of optimization—many algorithms are not attracted to and
erspectiv
ectiv are other kinds of p oints with zero gradient besides minima them, saddle
but
p oints.dified
unmo
unmodified There are alsomethod
Newton’s maxima, is. whic h are bmuc
Maxima h lik
ecome e saddle
exp
exponentially
onentially poinrare
ts from the
in high
perspective of
dimensional optimization—many
space, just like minima do. algorithms are not attracted to them, but
unmodified Newton’s method is. Maxima become exponentially rare in high
There may also be wide, flat regions of constant value. In these lo locations,
cations, the
dimensional space, just like minima do.
gradien
gradientt and also the Hessian are all zero. Suc Such
h degenerate lo locations
cations pose ma major
jor
There for
problems mayallalso be wide,optimization
numerical flat regions algorithms.
of constant vInalue.
a convIn ex
convex these locations,
problem, the
a wide,
gradien
flat t and
region mustalsoconsist
the Hessian
en
entirely
tirelyareofall zero. minima,
global Such degenerate locationsoptimization
but in a general pose ma jor
problems such
problem, for alla numerical
region could optimization
corresp
correspondond algorithms. In aofconv
to a high value theexob problem,
objectiv
jectiv a wide,
jectivee function.
flat region must consist entirely of global minima, but in a general optimization
problem, such a region could correspond to a high value of the ob jective function.
8.2.4 Cliffs and Explo
Exploding
ding Gradients
8.2.4 netw
Neural Cliffsorksand
networks withExplo ding
many lay ers Gradients
layers often hav
havee extremely steep regions resembling
cliffs, as illustrated in Fig. 8.3. These result from the multiplication of several large
Neural
w eigh ts netw
eights orks with
together. many
On the layof
face ersanoften have extremely
extremely steep cliffsteep regionsthe
structure, resembling
gradient
cliffs,
up
update as illustrated
date step can mo in
movve the parameters extremely far, usually jumping off oflarge
Fig. 8.3 . These result from the m ultiplication of several the
w eigh ts together. On
cliff structure altogether.the face of an extremely steep cliff structure, the gradient
update step can move the parameters extremely far, usually jumping off of the
cliff structure altogether.
288
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
J(w; b)
w
b
8.2.5 difficulty
Another Long-Tthat ermneural
Dep endencies
netw
network
ork optimization algorithms must overcome arises
when the computational graph becomes extremely deep. F Feedforward
eedforward netw
networks
orks
Another
with manydifficulty
lay
layers that
ers have neural
hav netwcomputational
such deep ork optimization algorithms
graphs. So domrecurrent
ust overcome
net arises
networks,
works,
when the
describ ed computational
described in Chapter 10,graph
whic
which hbecomes extremely
construct very deepdeep. Feedforwardgraphs
computational networksby
with many layers have such deep computational graphs. So do recurrent networks,
described in Chapter 10, which construct 289
very deep computational graphs by
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
rep
repeatedly
eatedly applying the same op operation
eration at each time step of a long temp temporal
oral
sequence. Rep
Repeated
eated application of the same parameters gives rise to esp especially
ecially
repeatedly applying
pronounced difficulties.the same operation at each time step of a long temporal
sequence. Repeated application of the same parameters gives rise to especially
For example,
pronounced supp
suppose
ose that a computational graph con
difficulties. contains
tains a path that consists
of rep eatedly multiplying by a matrix W . After t steps, this is equiv
repeatedly equivalent
alent to mul-
For example,
tiplying suppose that
by W t . Suppose a computational
that W graph con
has an eigendecomp
eigendecompositiontainsW
osition a path that( consists
= V diag
diag( λ)V −1.
of rep
In thiseatedly
simplemultiplying Wto
by a matrixard
case, it is straightforw
straightforward . After t steps, this is equivalent to mul-
see that
tiplying by W . Suppose that
W has an eigendecomp
osition W = V diag( λ)V .
−1 t
In this simple case,W t
it is=straightforw
V diag
diag((λ)V diag(λ) tV −1.
= Vthat
ard to see (8.11)
An
Anyy eigenv alues λi W
eigenvalues that=areVnot diagnear
(λ)Van absolute value
= V diag (λ) ofV 1 will. either explo
explodede if
(8.11)
they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.
They vanishing
An eigenvaluesand λ that
explo are
dingnot
exploding gr near anpr
gradient
adient absolute
problem
oblem refersvaluetoofthe 1 will
facteither
that explo de if
gradients
they are greater
through than 1are
such a graph in magnitude
also scaledoraccording
vanish if tothey
diagare
diag( ( λless
) t. Vthan 1 in magnitude.
anishing gradients
The
makeevanishing
mak it difficultand explo
to kno
knoww ding
grdirection
which adient problem refers to should
the parameters the factmomovthat
ve togradients
impro
improv ve
through such a graph are
the cost function, while explo also
explodingscaled according to diag ( λ ) . V anishing
ding gradients can make learning unstable. The cliff gradients
mak e it difficult
structures describ to
described ed earlier that direction
kno w which motiv ate the
motivate parameters
gradien
gradient t clipping should
are an moexample
ve to impro ve
of the
the cost
explo dingfunction,
exploding gradientwhile exploding gradients can make learning unstable. The cliff
phenomenon.
structures described earlier that motivate gradient clipping are an example of the
The
explo repeated
ding phenomenon. by W at eac
gradientmultiplication eachh time step described here is very
similar to the power metho
method d algorithm used to find the largest eigen eigenvvalue of a matrix
The repeated
W and the corresp multiplication
corresponding
onding eigenv by
eigenvector. W at eac h time step described
ector. From this point of view it is not here is very
surprising
similar
that x>toWthe power
t will evenmetho
tuallyd discard
eventually algorithmallused
comp toonents
find the
components largest
of x that are eigen value of a matrix
orthogonal to the
W and the
principal corresp
eigenv
eigenvector onding
ector of Weigenv
. ector. From this point of view it is not surprising
that x W will eventually discard all components of x that are orthogonal to the
Recurren
Recurrent
principal t net
eigenvnetwworks
ector W .the same matrix W at eac
of use eachh time step, but feedforward
net
netwworks do not, so even very deep feedforward netw networks
orks can largely avoid the
Recurren
vanishing andt net w
explo orks
exploding
ding use the
gradientsame matrix
problem ( W at eac
Sussillo, h
2014time
). step, but feedforward
networks do not, so even very deep feedforward networks can largely avoid the
We defer a further discussion of the challenges of training recurrent net netwworks
vanishing and exploding gradient problem (Sussillo, 2014).
un
until
til Sec. 10.7, after recurren
recurrentt netw
networks
orks ha
haveve been describ
described ed in more detail.
We defer a further discussion of the challenges of training recurrent networks
until Sec. 10.7, after recurrent networks have been described in more detail.
8.2.6 Inexact Gradien
Gradients
ts
8.2.6optimization
Most Inexact Gradien
algorithmstsare primarily motiv
motivated
ated by the case where we ha havve
exact knowledge of the gradient or Hessian matrix. In practice, we usually only
Most
ha
havve aoptimization
noisy or ev enalgorithms
even are primarily
biased estimate motivated by
of these quantities. the case
Nearly everywhere we have
deep learning
exact knowledge of the gradient or Hessian matrix. In practice, we usually
algorithm relies on sampling-based estimates at least insofar as using a minibatch only
havtraining
of e a noisyexamples
or even biased estimate
to compute theofgradien
these quantities.
gradient.
t. Nearly every deep learning
algorithm relies on sampling-based estimates at least insofar as using a minibatch
In other cases, the obobjectiv
jectiv
jectivee function we wan
antt to minimize is actually in
intractable.
tractable.
of training examples to compute the gradient.
When the ob objective
jective function is intractable, typically its gradien
gradientt is in
intractable
tractable as
In other cases, the ob jectiv e
well. In such cases we can only appro function
approximatewe wan t to minimize is actually intractable.
ximate the gradient. These issues mostly arise
When the ob jective function is intractable, typically its gradient is intractable as
well. In such cases we can only approximate 290 the gradient. These issues mostly arise
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.2.7
Man
Many y of P o orproblems
the Corresp weondence
ha
hav between
ve discussed so far Locorresp
cal and
correspond ondGlobal
to prop Structure
properties
erties of the
loss function at a single point—it can be difficult to make a single step if J (θ ) is
Man
poorly y of the problems
conditioned at thewecurrent
have discussed
poin
ointt θ, or so iffarθ corresp
lies on aond cliff,toor prop
if θerties of the
is a saddle
loss
p ointtfunction
oin hiding the at aopp single
opportunity point—it
ortunity can progress
to make be difficult to make
downhill from a single step if J (θ ) is
the gradient.
poorly conditioned at the current point θ, or if θ lies on a cliff, or if θ is a saddle
It is possible to ov overcome
ercome all of these problems at a single point and still
point hiding the opportunity to make progress downhill from the gradient.
perform poorly if the direction that results in the most impro improv vemen
ementt lolocally
cally do does
es
not It is pto
point ossible
toward
ward to ov
distant ercome
regions allofofmucthese
much low
h lowerproblems
er cost. at a single point and still
perform poorly if the direction that results in the most improvement locally does
not Go
Goo odfello
point dfellow w et distant
toward al. (2015 ) argueofthat
regions mucmucmuch
h low herofcost.
the run
runtime
time of training is due to
the length of the tra trajectory
jectory needed to arrive at the solution. Fig. 8.2 sho showsws that
Go o dfello
the learning tra w et al.
trajectory (
jectory sp 2015 )
spendsargue that muc h of the run time
ends most of its time tracing out a wide arc around of training is due to a
the
moun length
mountain-shap
tain-shapof
tain-shaped the
ed tra jectory
structure. needed to arrive at the solution. Fig. 8.2 sho ws that
the learning tra jectory spends most of its time tracing out a wide arc around a
moun Muc
Much h of research
tain-shap in
into
ed structure. to the difficulties of optimization has fo focused
cused on whether
training arrives at a global minimum, a lo local
cal minimum, or a saddle point, but in
Muc h of research
practice neural netw networksorks do not arrive at aoptimization
in to the difficulties of critical pointhas of fo cused
any on whether
kind. Fig. 8.1
training
sho
shows
ws that arrives
neuralatnetw a global
networks minimum,
orks often do notaarrive
local at minimum,
a region of or small
a saddle point,Indeed,
gradient. but in
practice
suc
such neural
h critical netwdo
points orksnotdo notnecessarily
even arrive at aexist. criticalForpoint of any
example, thekind. Fig. 8.1
loss function
sholog
− wspthat
(y | neural
x; θ) can netwlackorks aoften do not
global arrive atpoint
minimum a region
andofinstead
small gradient. Indeed,
asymptotically
suc h
approac
approachcritical p oints do
h some value as the mo not even
model necessarily exist. F or example,
del becomes more confident. For a classifier with the loss function
log p(yy and
discrete x; θp) (can
y | xlack a globalbyminimum
) provided a softmax, point
the and instead
negative asymptotically
log-lik
log-likeliho
eliho
elihoo od can
approac
− h some
| v alue as the
become arbitrarily close to zero if the mo mo del b ecomes
model more confident. F or a
del is able to correctly classify ev classifier with
every
ery
discrete y and p (y x ) provided
example in the training set, but it is imp by a softmax,
impossible the negative log-lik
ossible to actually reach the value of eliho o d can
b ecome arbitrarily
zero. Likewise, a mo modelclose
|del of real values p( y | x)del
to zero if the mo = isN able
(y ; f (to
θ), correctly
β −1) can ha classify
hav every
ve negative
example
log-lik
log-likeliho
elihoinothe
elihoo d thattraining set, buttoitnegativ
asymptotes is impeossible
negative infinity—if to actually
f (θ) isreach
able the value of
to correctly
zero. Likewise,
predict the value a mo of del
all training set y ptargets,
of real values ( y x) =the learning
(y ; f (θ), βalgorithm
) can ha ve negative
will increase
log-lik eliho o d that asymptotes to negativ
β without bound. See Fig. 8.4 for an example of a failure of lo| e infinity—if
N f (θ ) is
localable to correctly
cal optimization to
predict
find a go goothe value of all training set y targets,
od cost function value even in the absence of any lo the learning algorithm
local
cal minima willorincrease
saddle
β without
poin
oints.
ts. b ound. See Fig. 8.4 for an example of a failure of lo cal optimization to
find a good cost function value even in the absence of any local minima or saddle
Future research will need to develop further understanding of the factors that
points.
influence the length of the learning tra trajectory
jectory and better characterize the outcome
Future research will need to develop further understanding of the factors that
influence the length of the learning tra jectory 291 and better characterize the outcome
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
J(θ )
8.2.8
Sev
Several Theoretical
eral theoretical Limits
results sho
showw of
thatOptimization
there are limits on the performance of an anyy
optimization algorithm we migh mightt design for neural net netw
works (Blum and Rivest,
Sev eral theoretical results sho w that there
1992; Judd, 1989; Wolpert and MacReady, 1997). Typicallyare limits on the pthese
erformanceresultsofha
an
vey
hav
optimization
little bearing onalgorithm
the use w ofe neural
might netw
design
networks
orksforinneural networks (Blum and Rivest,
practice.
1992; Judd, 1989; Wolpert and MacReady, 1997). Typically these results have
Some theoretical results apply only to the case where the units of a neural
little bearing on the use of neural networks in practice.
net
netwwork output discrete values. Ho Howwev
ever,
er, most neural net network
work units output
smo Some
smoothly theoretical results apply only to the
othly increasing values that make optimization via lo case where
local the
cal searc units
search of a neural
h feasible. Some
network output
theoretical resultsdiscrete
sho
show values.
w that there Ho wev
exist er, mostclasses
problem neuralthat
netare
work in units output
intractable,
tractable, but
smo othly increasing values that make optimization via lo cal searc
it can be difficult to tell whether a particular problem falls into that class. Other h feasible. Some
theoretical
results showresults show that
that finding there exist
a solution for a problem
netw
network
ork ofclasses that
a given are
size is in tractable, but
intractable, but
it can b e difficult to tell whether a particular problem
in practice we can find a solution easily by using a larger netw falls into
network that class.
ork for which man Other
many y
resultsparameter
more show that settings
finding acorresp
solution
ondfortoa an
correspond netw ork of a given
acceptable size isMoreov
solution. intractable,
Moreover, er, in but
the
in
con practice
context we cannetw
text of neural findork
networka solution
training,easily by using
we usually do anotlarger
carenetw
ab
aboutorkfinding
out for which the man
exacty
more
minim
minimum parameter settings corresp ond to an acceptable solution.
um of a function, but only in reducing its value sufficiently to obtain go Moreov er, in the
goo
od
con text of neural netw ork training, we usually do not care ab
generalization error. Theoretical analysis of whether an optimization algorithmout finding the exact
minim
can um of a function,
accomplish this goalbut only in reducing
is extremely difficult.itsDev value sufficiently
Developing
eloping to obtain
more realistic good
bounds
generalization
on the performanceerror. ofTheoretical
optimization analysis of whether
algorithms an optimization
therefore algorithm
remains an important
can accomplish
goal for mac hinethis
machine goal isresearc
learning extremely
research.
h. difficult. Developing more realistic bounds
on the performance of optimization algorithms therefore remains an important
goal for machine learning research.
293
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.3
W haveeBasic
e hav Algorithms
previously introduced the gradien
gradientt descent (Sec. 4.3) algorithm that
follo
follows
ws the gradient of an entire training set downhill. This may be accelerated
W e hav e previously
considerably by usingintroduced
sto the gradien
stocchastic gradien
gradient t descent
t descen
descent (Sec.
t to follow 4.3) algorithm
the gradien
gradient that
t of randomly
follows the gradient
selected minibatches doof an entire
downhill, training set downhill. This may
wnhill, as discussed in Sec. 5.9 and Sec. 8.1.3. b e accelerated
considerably by using stochastic gradient descent to follow the gradient of randomly
selected minibatches downhill, as discussed in Sec. 5.9 and Sec. 8.1.3.
8.3.1 Sto
Stocchastic Gradient Descent
8.3.1
Sto
Stoc Sto
chastic chasticdescent
gradient Gradient
(SGD) Descent
and its varian
ariants
ts are probably the most used
optimization algorithms for machine learning in general and for deep learning in
Stochastic gradient
particular. descent
As discussed (SGD)
in Sec. 8.1.3and
, it isitspossible
variants
to are probably
obtain the most
an unbiased used
estimate
optimization
of gradienttalgorithms
the gradien for machine
by taking the learningon
average gradient inageneral and
minibatch offor
minibatch m deep learning
examples dra in
drawn
wn
particular.
i.i.d Asdata
from the discussed in Sec.distribution.
generating 8.1.3, it is possible to obtain an unbiased estimate
of the gradient by taking the average gradient on a minibatch of m examples drawn
Algorithm 8.1 shoshows
ws how to follow this estimate of the gradiengradientt downhill.
i.i.d from the data generating distribution.
Algorithm
Algorithm 8.1Sto
8.1 sho
Stoc ws how
chastic to follow
gradient thistestimate
descen
descent (SGD) up ofdate
the at
update gradien t downhill.
training iteration k
Require: Learning8.1 Stocrate .
Algorithm hastick gradient descent (SGD) update at training iteration k
Require: Initial parameter θ
Require: Learning
while stopping rate .not met do
criterion
Require:
SampleInitial parameter
a minibatch of mθexamples from the training set {x(1), . . . , x (m)} with
while
correspstopping
ondingcriterion
corresponding targets ynot(i). met do
Sample a minibatch of m examples P training(i)set x (i), . . . , x
Compute gradient estimate: gˆ ← +from1
m∇ θ
the
i L(f (x ; θ ), y )
with
correspup
Apply onding
date: targets
update: θ←θ− y g ˆĝ. { }
Compute
end while gradient estimate: gˆ + L ( f ( x ; θ ) , y )
Apply update: θ θ g ˆ ← ∇
end while ← −
A crucial parameter for the SGD algorithm is the learning rate. Previously Previously,, we
ha
havve describ
described ed SGD as using a fixed learning rate . In practice, it is necessary to
A crucial parameter for the SGD algorithm Pis the learning rate. Previously, we
gradually decrease the learning rate ov
over
er time, so we now denote the learning rate
haviteration
at e describkedasSGD k. as using a fixed learning rate . In practice, it is necessary to
gradually decrease the learning rate over time, so we now denote the learning rate
This is because the SGD gradien gradientt estimator in intro
tro
troduces
duces a source of noise (the
at iteration k as .
random sampling of m training examples) that do doeses not vanish ev even
en when we arrive
This
at a minim is
minimum. b ecause the SGD gradien
um. By comparison, the true gradien t estimator intro duces a source
gradientt of the total cost function of noise (the
becomes
random
small andsampling
then 0 of m training
when we approach examples)andthat doaesminimum
reach not vanishusing even batch
when we arrive
gradient
at a minim
descen
descent,t, soum. By comparison,
batch gradient descent the truecangradien
use at fixed
of thelearning
total costrate.
function becomest
A sufficien
sufficient
small and to
condition then 0 when
guaran
guarantee we approach
tee conv ergence ofand
convergence SGD reach a minimum using batch gradient
is that
descent, so batch gradient descent can use a fixed learning rate. A sufficient
X ∞
condition to guarantee convergence of SGD is that
k = ∞, and (8.12)
k=1
=294 , and (8.12)
∞
X
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
∞
X
2k < ∞. (8.13)
k=1
< . (8.13)
In practice, it is common to decay the
∞ learning rate linearly un
until
til iteration τ :
and Bousquet (2008) argue that it therefore may not be worth worthwhile
while to pursue
an optimization algorithm that conv erges faster than O (k1 ) for machine learning
converges
and Bousquetconv
tasks—faster (2008 ) argue
convergence
ergence that it therefore
presumably corresp may
onds not
corresponds berfitting.
to ov e worthwhile
overfitting. to pursue
Moreov
Moreover, er, the
an optimization algorithm that
asymptotic analysis obscures many adv conv erges
advanan faster
antages than O
tages that sto ( ) for
stochastic machine learning
chastic gradient descent
tasks—faster conv
has after a small num ergence
numb ber of steps. With large datasets, theerfitting.
presumably corresp onds to ov ability ofMoreov
ability SGD toer, the
make
asymptotic
rapid initialanalysis
progressobscures
while ev many advthe
evaluating
aluating antages that for
gradient stochastic
only very gradient descent
few examples
haswafter
out
outw eighsa its
small
slownum ber of steps.
asymptotic conWith
conv large Most
vergence. datasets, thealgorithms
of the ability of SGD to make
describ
describeded in
rapid initial progress while ev
the remainder of this chapter achievaluating the gradient for only very few examples
achievee benefits that matter in practice but are lost
out weighs its slow asymptotic
in the constant factors obscured convergence.
by the O( 1k )Most of the algorithms
asymptotic analysis. Onedescrib
caned in
also
the remainder
trade of this chapter
off the benefits of both achiev
batch eand
benefits
sto that matter
stochastic
chastic gradient in descen
practice
descentt bybut are lost
gradually
in the constant
increasing factors hobscured
the minibatc
minibatch by the
size during the O ( ) asymptotic
course of learning.analysis. One can also
trade off the benefits of both batch and stochastic gradient descent by gradually
For more information on SGD, see Bottou (1998).
increasing the minibatch size during the course of learning.
For more information on SGD, see Bottou (1998).
8.3.2 Momen
Momentum
tum
8.3.2 sto
While Momen
stoc tum descent remains a very popular optimization strategy
chastic gradient strategy,,
learning with it can sometimes be slow. The metho method d of momentum (Poly olyak
ak, 1964)
While stochastic
is designed gradient
to accelerate descentesp
learning, remains
especially
ecially in a vthe
eryface
popular optimization
of high curv
curvature, strategy
ature, small but,
learning
consisten with it can sometimes b e slow.
consistentt gradients, or noisy gradients. The momen The metho d
momentum of
tum algorithm accumulates)
momentum (P olyak , 1964
is designed
an exp to accelerate
exponentially
onentially deca
decaying
yinglearning,
movingesp av eciallyofinpast
average
erage the gradien
face of high
gradients ts and curv
conature,
contin
tin uessmall
tinues to mo
mov butve
consisten t gradients, or noisy gradients. The momen
in their direction. The effect of momentum is illustrated in Fig. 8.5. tum algorithm accumulates
an exponentially decaying moving average of past gradients and continues to move
Formally
ormally,
in their , the momentum
direction. The effect algorithm
of momentum introduces a variable
is illustrated v that
in Fig. 8.5plays
. the role
of velocity—it is the direction and sp speed
eed at whicwhich h the parameters mo move ve through
Formallyspace.
parameter , the momentum
The velo algorithm
elocity
city is set introduces
to an exp a variable v
exponentially
onentially that
deca
decaying plays
ying av the role
average
erage of
of velocity—it is the direction and
the negative gradient. The name momentum deriv sp eed at whic h
derives the parameters
es from a physphysical mo ve through
ical analogy
analogy,, in
parameter
whic
which space.
h the negativ The v elo city
negativee gradient is a force mo is set to
moving an exp onentially deca
ving a particle through parameter ying averagespace, of
the negative gradient.
according to Newton’s la The
laws name momentum
ws of motion. Momen Momentum deriv es from a phys ical
tum in physics is mass times velocityanalogy ,
velocity..in
whic
In theh the
momennegativ
momentum tum e gradient
learning is a force
algorithm, mo
we ving
assume a particle
unit through
mass, so the parameter
v elo
elocit
cit space,
y ector
city v ectorv v
according
ma
may to Newton’s la ws of motion. Momen
y also be regarded as the momentum of the particle. A hyp tum in physics erparameter α ∈ [0 , 1).
is mass
yperparameter times velocity
In the momen
determines how tum learning
quic kly thealgorithm,
quickly contributions we assume
of previousunit gradients
mass, so the exp vonen
elocit
exponen y vector
onentially
tially decay
decay. v.
ma y
The upalso b
update e regarded
date rule is giv as
given the
en by: momentum of the particle. A hyp erparameter α [0 , 1)
determines how quickly the contributions of previous gradients ! exponentially∈decay.
X m
The update rule is given by: 1
v ← αv − ∇ θ L(f (x (i); θ), y(i) ) , (8.15)
m
1 i=1
v α v
θ ← θ + v. L(f (x ; θ), y ) , (8.15)
(8.16)
m
← − ∇ 1 Pm
θ ulates
θ + the
v. gradient elemen (i) ( i)
(8.16)
The velocity v accum accumulates elements ts ∇θ m i=1 L(f (x ; θ ), y ) .
The larger α is relative ←to , the more previous gradien gradients ts affect!the curren
currentt direction.
v X
The velocity accum
The SGD algorithm with momen ulates the
momentum tum is given in Algorithm 8.2. f (x ; θ), y ) .
gradient elemen ts L (
The larger α is relative to , the more previous gradien ∇ ts affect the current direction.
296
The SGD algorithm with momentum is given in Algorithm 8.2.
P
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
20
10
−10
−20
−30
−30 −20 −10 0 10 20
297
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Previously
Previously,, the size of the step was simply the norm of the gradient multiplied
by the learning rate. No Now,w, the size of the step dep depends
ends on how large and ho how
w
Previously , the size of the step was simply the norm
aligned a sequence of gradients are. The step size is largest when man of the gradient
many m ultiplied
y successive
b y the
gradien
gradientslearning rate. No w, the size of the step
ts point in exactly the same direction. If the momen dep ends
momentumon how large
tum algorithm andalwahoys
alwaysw
aligned
observ
observes a sequence of gradients are. The step size is largest when
es gradient g , then it will accelerate in the direction of −g, until reac man y successive
reaching
hing a
gradients velocity
terminal point inwhere
exactly the
the same
size direction.
of eac
eachh step isIf the momentum algorithm always
observes gradient g , then it will accelerate in the direction of g, until reaching a
terminal velocity where the size of each||gstep || is
. − (8.17)
1−α
g
. yperparameter 1 (8.17)
It is thus helpful to think of the momentum α hyp
1 || || erparameter in terms of 1−α . For
example, α = .9 corresp
corresponds
onds to multiplying the maxim maximum um spspeed
eed by 10 relative to
It is thus helpful to think of the momentum− h yp erparameter in terms of . For
the gradient descen
descentt algorithm.
example, α = .9 corresponds to multiplying the maximum speed by 10 relative to
Common values of α used in practice include .5, . 9, and .99 99.. Lik
Likee the learning
the gradient descent algorithm.
rate, α ma
may y also be adapted ov over
er time. Typically it begins with a small value and
Common v alues of α
is later raised. It is less imp used
ortantpractice
in
important to adapt α over.5time
include , . 9, and
than.99
to. shrink
Like the
ovlearning
er time.
rate, α may also be adapted over time. Typically it begins with a small value and
is later raised.
Algorithm 8.2ItSto
is less
Stochasticimportant
chastic gradien
gradientto adapt α (SGD)
t descent over timewiththan to shrink over time.
momentum
Require: Learning
Algorithm rate , momentum
8.2 Stochastic gradient descent parameter
(SGD)α.with momentum
Require: Initial parameter θ, initial velocity v.
Require: Learning
while stopping rate , momentum
criterion not met do parameter α.
Require:
SampleInitial parameter
a minibatch , initial velocity
of mθexamples from thev.training set {x(1), . . . , x (m)} with
while stopping
corresp ondingcriterion
corresponding (i). met do
targets ynot
Sample a minibatch of m examples 1fromP (i) set x ( i) , . . . , x
Compute gradient estimate: g←m ∇θ the training
i L(f (x ; θ ), y )
with
corresp onding targets
Compute velocity up y .
date: v ← αv − g
update: { }
Compute
Apply up gradient estimate:
date: θ ← θ + v
update: g L (f (x ; θ ) , y )
end while velocity update: v ←
Compute αv ∇g
Apply update: θ θ+v ← −
end while ←
We can view the momentum algorithm as simulating a particle sub subject
ject to
P
con
contin
tin
tinuous-time
uous-time Newtonian dynamics. The physical analogy can help to build
in We can
intuition
tuition for view
ho
how the momentum
w the momentumand algorithm
gradientasdescen
simulating
descent t algorithmsa particlebehav sub
ehave.e. ject to
continuous-time Newtonian dynamics. The physical analogy can help to build
The position of the particle at any poin ointt in time is giv givenen by θ (t ). The particle
intuition for how the momentum and gradient descent algorithms behave.
exp
experiences
eriences net force f (t). This force causes the particle to accelerate:
The position of the particle at any point in time is given by θ (t ). The particle
experiences net force f (t). This f force ∂ 2 the particle to accelerate:
(t) =causes θ(t). (8.18)
∂ t2
∂
Rather than viewing this as a second-order
f (t) = θ(differen
differential
t). tial equation of the position, (8.18)
∂t
we can introduce the variable v(t) representing the velo elocit
cit
cityy of the particle at time
Rather than viewing
t and rewrite this as adynamics
the Newtonian second-order differentialdifferential
as a first-order equation of the position,
equation:
we can introduce the variable v(t) representing the velocity of the particle at time
∂
t and rewrite the Newtonian dynamics v(t) =as aθfirst-order
(t), differential equation:(8.19)
∂t
∂
v(t) =298 θ(t), (8.19)
∂t
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
∂
f (t) = v(t). (8.20)
∂t
∂
The momentum algorithm then consists f (t) = ofvsolving (t). the differential equations via
(8.20)
numerical sim ulation. A simple numerical∂metho
simulation. t
method d for solving differendifferential tial equations
The momentum
is Euler’s metho
method, algorithm
d, whic
which then consists
h simply consistsofofsolving simulating the differential
the dynamics equations
definedvia by
n umerical sim ulation. A simple numerical
the equation by taking small, finite steps in the direction of eac metho d for solving differen
each tial
h gradient. equations
is Euler’s method, which simply consists of simulating the dynamics defined by
the This
equationexplains the basic
by taking formfinite
small, of thesteps momen
momentum tumdirection
in the up
update,
date, but of eacwhat sp
specifically
ecifically are
h gradient.
the forces? One force is prop proportional
ortional to the negative gradien gradientt of the cost function:
−∇θThisJ ( θ)explains
. This force the basic
pushes formtheofparticle
the momen tum up
downhill date,the
along butcost
what specifically
function are
surface.
the forces?
The gradient One force isalgorithm
descent proportional would to the
simplynegative
take agradien
singlet step
of thebasedcost function:
on each
gradienJ
gradient,( θt, but the Newtonian scenario used by the momentum algorithm surface.
) . This force pushes the particle downhill along the cost function instead
The
−∇ gradient
uses this forcedescent to alteralgorithm
the velocity would of the simply take W
particle. a esingle step based
can think of the on each
particle
gradien
as beingt,like butathe ho
hock Newtonian
ck
ckeyey puck slidingscenario downused anby icythe momentum
surface. Whenev
Whenever algorithm instead
er it descends a
uses this force to alter the
steep part of the surface, it gathers sp velocity of
speedthe particle.
eed and contin W
continues e can think of the
ues sliding in that direction particle
as
un b
until eing like a ho ck ey
til it begins to go uphill again.puck sliding down an icy surface. Whenever it descends a
steep part of the surface, it gathers speed and continues sliding in that direction
untilOneit bother
eginsforce
to goisuphill
necessary
necessary. . If the only force is the gradien
again. gradientt of the cost function,
then the particle migh mightt never come to rest. Imagine a ho hockck
ckeyey puck sliding down
One other force is necessary . If the only
one side of a valley and straight up the other side, oscillating back force is the gradien t of and
the cost
forthfunction,
forev
forever,
er,
then the particle migh t never come to rest. Imagine
assuming the ice is perfectly frictionless. To resolve this problem, we add one a ho ck ey puck sliding down
one side
other of a prop
force, valley and straight
proportional
ortional to −v(up t). the other side,
In physics oscillating, this
terminology
terminology, backforce
and forth
corresp forev
correspondsondser,
assuming
to viscous the drag,iceasisifpthe
erfectly frictionless.
particle must pushTothrough resolveathis problem,
resistant medium we add suchone as
other force, prop ortional to v (t ) .
syrup. This causes the particle to gradually lose energy ovIn physics terminology ,
over this force
er time and ev corresp
even
en onds
entually
tually
to
con
convviscous
verge to a lo drag, as
local if the
cal minimum. particle
− m ust push through a resistant medium such as
syrup. This causes the particle to gradually lose energy over time and eventually
convWh
Why
ergey doto awe −v (t) and viscous drag in particular? Part of the reason to
use minimum.
local
use −v (t) is mathematical con conv venience—an integer pow ower er of the velocity is easy
Wh y do
to work with. How we use
However, v (t ) and viscous
ever, other physical systems ha drag in particular?
ve otherPart
have kinds of of
thedrag
reason basedto
use v
on other in( t) is mathematical
integer
teger pow con v enience—an
−ers of the velocity
owers integer p
velocity.. For example, a particle tra ow er of the velocity
trav is
veling through easy
to
the −w ork
air exp with.
experiences How ever,
eriences turbulent otherdrag,physical withsystems
force prop have
proportionalother to
ortional kindsthe of drag of
square based
the
on
velo other
elocit
cit
city in teger p ow
y, while a particle mo ers of the
moving velocity . F or
ving along the ground exp example, a particle
experiences tra veling
eriences dry friction, with a through
the air exp eriences turbulent drag, with
force of constant magnitude. We can reject each of these options. force prop ortional to theTurbulent
square ofdrag, the
vprop
elo city ,
proportional while a particle mo
ortional to the square of the velo ving alongelocitthe
cit
city ground exp eriences dry
y, becomes very weak when the velocity is friction, with a
force ofItconstant
small. is not pow magnitude.
owerful
erful enough We can rejectthe
to force each of these
particle options.
to come Turbulent
to rest. drag,
A particle
proportional
with a non-zero to the square
initial of thethat
velocity veloexp cityeriences
, becomes
experiences very
only the weak
force when the velocity
of turbulen
turbulent t drag is
small.
will mo
move It is
ve aw
awaynot powerful
ay from enough
its initial to force
position the particle
forever, with thetodistance
come tofrom rest.the A starting
particle
with a non-zero initial velocity that
ointt growing like O(log t). We must therefore use a low
poin exp eriences only the
lower force
er pow of
ower turbulen
er of the velocityt drag.
velocity.
will
If wemo ve aw
use a payow from
owerer ofits initial
zero, position
represen
representing tingforever, with the
dry friction, thendistance
the forcefromisthe to
tooostarting
strong.
point growing
When the forcelike dueOto (log
thet).gradient
We must of therefore
the cost functionuse a low isersmall
power butofnon-zero,
the velocity the.
If we use
constan
constant a pow
t force er to
due of friction
zero, represen
can cause tingthe dryparticle
friction, to then
comethe forcebefore
to rest is tooreac strong.
reaching
hing
When
a lo
local the
cal minim force
minimum. due to the
um. Viscous drag av gradient avoids of the cost function is small
oids both of these problems—it is weak enough but non-zero, the
constant force due to friction can cause the particle to come to rest before reaching
a local minimum. Viscous drag avoids b299 oth of these problems—it is weak enough
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.3.3
Sutsk
Sutskev
ev erNestero
ever et al. (2013v Momen
) introduced tuma varian
ariantt of the momentum algorithm that was
inspired by Nesterov’s accelerated gradien gradientt metho
method d (Nesterov, 1983, 2004). The
Sutsk
up
updateev er et al. ( 2013 ) introduced
date rules in this case are given by: a varian t of the momentum algorithm that was
inspired by Nesterov’s accelerated gradient method (Nesterov, 1983, 2004). The
" #
update rules in this case are given 1 X
m
by:
v ← αv − ∇θ L f (x (i); θ + αv), y (i) , (8.21)
m
1 i=1
v←θ
θ αv+ v, L f (x ; θ + αv), y , (8.21)
(8.22)
m
← − ∇
where the parametersθ θα + and
v, pla playy a similar role as in the standard momentum (8.22)
metho
method.d. The difference ← betw
etween "
een Nestero
Nesterov v momentum and standard # momentum is
where the parameters α and pla yXa similar role as
where the gradient is ev evaluated.
aluated. With Nestero
Nesterov v momen
momentum tum the gradient momentum
in the standard is ev
evaluated
aluated
metho d. The
after the curren difference b etw een
currentt velocity is applied. Th Nestero v
Thus momentum and standard
us one can interpret Nestero Nesterovmomentum
v momentum is
where the gradient
as attempting is evaaluated.
to add corr With
orreection Nestero
factor to vthemomen tum the
standard gradient
metho
method is evaluated
d of momentum.
aftercomplete
The the curren t velocity
Nestero
Nesterov is applied.algorithm
v momentum Thus oneiscan interpret
presen
presentedted inNestero v momentum
Algorithm 8.3.
as attempting to add a correction factor to the standard method of momentum.
In the con
conveve
vex x batch gradient case, Nestero Nesterov v momen
momentum tum brings the rate of
The complete Nesterov momentum algorithm is presented in Algorithm 8.3.
con
convvergence of the excess error from O(1 /k) (after k steps) to O(1
(1/k /k2) as shown
(1/k
In the con
by Nestero
Nesterov v (ve1983x batch gradient case,
). Unfortunately
Unfortunately, , in Nestero
the sto vchastic
momengradien
stochastic tum brings
gradient t case,theNestero
rate of
Nesterov v
con v
momenergence
momentumtum do of
does the excess
es not improv error from O (1
improvee the rate of conv /k ) (after
convergence.
ergence. k steps) to O(1 /k ) as shown
by Nesterov (1983). Unfortunately, in the stochastic gradient case, Nesterov
momentum do
Algorithm 8.3es Sto
notchastic
improvgradien
Stochastic e the rate
gradient of conv
t descent ergence.
(SGD) with Nesterov momentum
Require: Learning
Algorithm rate , momentum
8.3 Stochastic parameter
gradient descent (SGD)α.with Nesterov momentum
Require: Initial parameter θ, initial velocity v.
Require: Learning
while stopping rate , momentum
criterion not met do parameter α.
Require:
SampleInitial parameter
a minibatch , initial velocity
of mθexamples from thev.training set {x(1), . . . , x (m)} with
while stopping
corresp
corresponding criterion
onding lab not
els y(i) .
labels met do
Sample a minibatch
Apply interim upupdate:of mθ˜ examples
date: ← θ + αvfrom the training set x , . . . , x with
corresponding labels . P
Compute gradient (atyinterim point): g ← m1 ∇θ̃ i L(f{(x (i); θ˜), y (i)) }
Apply interim update: ˜
θ v θ←+ααvv− g
Compute velocity up
update:
date:
Compute
Apply up date: θ ← θ + v← point): g
gradient
update: (at interim L(f (x ; θ˜), y )
Compute
end while velocity update: v αv g ← ∇
Apply update: θ θ+v ← −
end while ←
P
300
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.4 optimization
Some Parameter Initialization
algorithms are not iterative Strategies
by nature and simply solve for a
solution point. Other optimization algorithms are iterative by nature but, when
Some optimization
applied to the righ algorithms
rightt class are not iterative
of optimization problems,bycon nature
verge and
converge simply solve
to acceptable for a
solutions
solution p oint.
in an acceptable amoun Other optimization algorithms are iterative b
amountt of time regardless of initialization. Deep learning training y nature but, when
applied to the
algorithms rightdo
usually class
notof havoptimization
have e either of theseproblems, converge
luxuries. Training to acceptable
algorithmssolutions
for deep
in an acceptable
learning mo dels are usually iterative in nature and thus require the user totraining
models amoun t of time regardless of initialization. Deep learning sp
specify
ecify
algorithms
some initial poin usually do not hav e either of these luxuries.
ointt from which to begin the iterations. Moreov T raining algorithms
Moreover, for
er, training deep deep
learning
mo
models mo dels
dels is a sufficien are
sufficiently usually iterative in nature and thus
tly difficult task that most algorithms are strongly require the useraffected
to specifyby
some
the initial
choice of pinitialization.
oint from which The to begin
initial the can
point iterations.
determine Moreov
whether er, training
the algorithmdeep
models
con
conv verges is aatsufficien
all, withtly some
difficult task pthat
initial ointsmost
being algorithms
so unstable are strongly
that the affected
algorithm by
the
encounchoice
encounters ters of initialization.
numerical difficultiesThe initial
and fails point can determine
altogether. whether do
When learning theesalgorithm
does conv
converge,
erge,
con v erges
the initial poin at all, with some initial p oints
ointt can determine how quickly learning conv being so unstable
converges that the algorithm
erges and whether it
encoun
con
conv vergesters tonumerical
a pointdifficulties
with highand or fails
low altogether.
cost. Also,When points learning does converge,
of comparable cost
the initial
can hav p oin t can determine how quickly learning
havee wildly varying generalization error, and the initial point can affect the conv erges and whether it
converges to aaspw
generalization oint
ell. with high or low cost. Also, points of comparable cost
can have wildly varying generalization error, and the initial point can affect the
Mo
Moderndern initialization strategies are simple and heuristic. Designing improv improved ed
generalization as well.
initialization strategies is a difficult task because neural netw networkork optimization is
not Moyetdernwell initialization
understo
understoood. Most strategies are simple
initialization and heuristic.
strategies are based Designing
on achieving improvsomeed
initialization
nice prop
properties
erties strategies
when theis netw a difficult
networkork is task becauseHow
initialized. neural
ever,netw
However, we do orknotoptimization
hav
havee a go gooois
d
not yet well understo
understanding of whic whicho d. Most initialization
h of these propproperties strategies
erties are preserv
preserved are based
ed under whic on
which achieving
h circumstances some
nice prop
after ertiesbwhen
learning egins the
to pro netw
proceed.ork is
ceed. A initialized.
further difficultyHowever, we do
is that somenotinitial
have apoin go ots
ointsd
understanding
ma
may y be beneficial of whic
fromh ofthethese prop
viewp
viewpoint erties
oint of are preserved under
optimization which circumstances
but detrimental from the
after
viewp
viewpoin learning
oin b egins to pro ceed. A further difficulty
ointt of generalization. Our understanding of how the initial poin is that some initial points
ointt affects
ma y b e b eneficial
generalization is esp from
especially the viewp
ecially primitiv
primitive, oint of optimization but detrimental
e, offering little to no guidance for how to select from the
viewp oint of
the initial poingeneralization.
oint.t. Our understanding of how the initial point affects
generalization is especially primitive, offering little to no guidance for how to select
Perhaps the only prop propertyerty known with complete certaint certainty y is that the initial
the initial point.
parameters need to “break symmetry” b betw
etw
etweeneen differen
differentt units. If tw two o hidden
P erhaps the
units with the same activ only prop ation function are connected to the same inputs,initial
erty
activation known with complete certaint y is that the then
parameters
these units mustneed hav to “break
have e different symmetry” between differen
initial parameters. If they t units.
hav
havee the If same
two hidden
initial
units with the
parameters, then same activation function
a deterministic learning are connected
algorithm applied to the
to asame inputs, then
deterministic cost
these
and mo units
model must
del will constanhav e
constantly different
tly upupdate initial parameters. If they
date both of these units in the same way. Ev hav e the same
Even initial
en if the
parameters,
mo
model then a deterministic
del or training algorithm is capable of using stolearning algorithm applied
stochasticit
chasticit
chasticity to a deterministic
y to compute differen costt
different
anddates
up
updatesmodel forwill constan
differen
different tly up
t units date
(for both of ifthese
example, one units
trainsinwith the drop
sameout),
dropout),way. itEv isenusually
if the
mo
b estdeltoorinitialize
trainingeach algorithm
unit tois compute
capable ofa usingdifferen
differentsto
t chasticit
functiony from to compute
all of the differen
othert
updatesThis
units. for differen
may help t units
to mak(fore example,
make sure thatif no oneinput
trainspatterns
with drop areout),
lost itinistheusually
null
b est to initialize each unit
space of forward propagation and no gradien to compute a differen t function from all
gradientt patterns are lost in the null space of the other
units.
of This may helpThe
back-propagation. to makgoale of sure that eac
having nohinput
each patterns aare
unit compute lost in function
different the null
space of forward propagation and no gradient patterns are lost in the null space
of back-propagation. The goal of having 301 each unit compute a different function
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
motiv
motivates ates random initialization of the parameters. We could explicitly searc search h
for a large set of basis functions that are all mutually different from each other,
motiv
but thisates random
often incursinitialization
a noticeable of the parameters.
computational cost. W e could
For example,explicitly
if we ha searc
have
ve ath
for a large
most as many set outputs
of basis functions
as inputs, that are all
we could mutually
use Gram-Schmidtdifferentorthogonalization
from each other,
but this often incurs a noticeable
on an initial weight matrix, and be guaran computational
guaranteed cost. F or example,
teed that each unit computes if we ha a ve at
very
most
differen
different ast many
function outputs
from eac as hinputs,
each we could
other unit. use Gram-Schmidt
Random initialization from orthogonalization
a high-en
high-entrop
trop
tropyy
on an initial w eight matrix, and b e guaran teed that
distribution over a high-dimensional space is computationally cheaper and unlikely each unit computes a very
differen
to assign t function
an
any y units from each other
to compute theunit.
sameRandom
functioninitialization
as each other. from a high-entropy
distribution over a high-dimensional space is computationally cheaper and unlikely
Typically
ypically,, we set the biases for each unit to heuristically chosen constants, and
to assign any units to compute the same function as each other.
initialize only the weigh weights ts randomly
randomly.. Extra parameters, for example, parameters
enco T
encoding ypically , we set the
ding the conditional variance biases for each unit to heuristically
of a prediction, are usually chosen
set toconstants, and
heuristically
cinitialize only themweigh
hosen constants uc
uch ts randomly
h like the biases . Extra
are. parameters, for example, parameters
encoding the conditional variance of a prediction, are usually set to heuristically
We almost alw alwa ays initialize all the weights in the mo model del to values dra drawnwn
chosen constants much like the biases are.
randomly from a Gaussian or uniform distribution. The cchoice hoice of Gaussian
We almost
or uniform always initialize
distribution do
doeses notall seemthe to
weights
matterinvery the muc
model
much, h, butto vhas
alues
notdra wn
been
randomly
exhaustiv
exhaustively from a Gaussian or uniform distribution.
ely studied. The scale of the initial distribution, how The c hoice
however, of
ever, do Gaussian
does
es hav
havee a
or uniform distribution do es not
large effect on both the outcome of the optimization pro seem to matter v ery muc
procedure h, but has not
cedure and on the ability been
exhaustiv
of the net netw ely
w orkstudied. The scale of the initial distribution, however, does have a
to generalize.
large effect on both the outcome of the optimization procedure and on the ability
Larger
of the netwinitial
ork to weigh
weights ts will yield a stronger symmetry breaking effect, helping
generalize.
to avavoidoid redundan
redundantt units. They also help to av avoid
oid losing signal during forward or
bac Larger
back-propagation initial weigh ts
k-propagation through the linear compwill yield a stronger
componen
onen symmetry
onentt of each la ybreaking
lay er—largereffect,
valueshelping
in the
to avoidresult
matrix redundan t units.
in larger They of
outputs also help to
matrix avoid losing signal
multiplication. Initialduring
weigh forward
eights
ts that are or
bac
to
tooo k-propagation
large ma may y, how through
however, the linear
ever, result in explo comp
exploding
dingonen t of each
values during layforward
er—larger values in the
propagation or
matrix
bac result
back-propagation. in larger
k-propagation. In recurrent netw outputs of matrix
networks, multiplication.
orks, large weigh
weights Initial w eigh ts that
ts can also result in chaos are
to
(suco
(such large ma y , how ever, result in explo ding
h extreme sensitivity to small perturbations of the input v alues during forward thatpropagation
the behaehavioror
vior
bacthe
of k-propagation.
deterministicInforw recurrent
forward networks, pro
ard propagation large weigh
procedure
cedure ts ears
app can also
appears result T
random). ino chaos
To some
(suc
exten h
extent, extreme
t, the explo sensitivity
exploding to small p erturbations of the input
ding gradient problem can be mitigated by gradient clipping that the b eha vior
of the deterministic
(thresholding the values forw ofard
the propagation
gradients before procedure
performing appears random).
a gradien
gradient t descenTot step).
descent some
exten t,
Large weigh the
weights explo ding gradient problem can b e mitigated
ts may also result in extreme values that cause the activ by gradient
activation clipping
ation function
(thresholding
to saturate, causing the values of the gradients
complete before performing
loss of gradient a gradient units.
through saturated descentThese
step).
Largeeting
comp
competing weighfactors
ts may determine
also resultthe in extreme values
ideal initial that
scale of cause
the weighthe activ
weights.ts. ation function
to saturate, causing complete loss of gradient through saturated units. These
comp The etingpersp
erspectiv
ectiv
ectives
factors es of regularization
determine and optimization
the ideal initial scale of the weigh can give
ts. very different
insigh
insights ts in
into
to how we should initialize a net network.
work. The optimization persp erspective
ective
The pthat
suggests erspthe ectiv es of regularization
weights should be largeand optimization
enough to propagate can information
give very different
success-
insigh
fully ts insome
fully,, but to how we should initialize
regularization concerns aencourage
network. making
The optimization
them smaller. perspTheective
use
suggests that the w eights
of an optimization algorithm such as sto should b e large enough
stochastic to propagate
chastic gradien information
gradientt descent that makes small success-
fully
incremen, but
incremental some regularization
tal changes to the weigh concerns
weights encourage
ts and tends to halt making themthat
in areas smaller. The use
are nearer to
of aninitial
the optimization
parameters algorithm
(whether suchdueastosto chasticstuck
getting gradien
in at descent
region ofthat low makes
gradien
gradient,small
t, or
incremental changes to the weights and tends to halt in areas that are nearer to
the initial parameters (whether due to getting 302 stuck in a region of low gradient, or
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
initial activ
activations
ations throughout. If learning is still to too o slow at this point, it can be
useful to lo look
ok at the range or standard deviation of the gradients as well as the
initialations.
activ activations
activations. This pro throughout.
procedure
cedure can If in
learning
principle is still too slow at and
be automated this ispoint, it canless
generally be
useful to look at costly
computationally the rangethanor standard deviation
hyperparameter of the gradients
optimization based on as well as the
validation set
activ ations. This pro cedure can
error because it is based on feedback from the beha in principle be automated
ehavior and is
vior of the initial mogenerally
model less
del on a
computationally
single batch of data, costly than
rather hyperparameter
than on feedback from optimization
a trained mo based
del on
model validation
on the validationset
errorWhile
set. because long it used
is based on feedback
heuristically
heuristically, , thisfrom proto
protocolthecolbhas
ehavior of the
recently initial
been sp model more
specified
ecified on a
single batch
formally andofstudied
data, rather than onand
by Mishkin feedback
Matasfrom (2015 a ).trained model on the validation
set. While long used heuristically, this protocol has recently been specified more
So far w wee ha haveve focused on the initialization of the w weights.
eights. Fortunately
ortunately,,
formally and studied by Mishkin and Matas (2015).
initialization of other parameters is typically easier.
So far we have focused on the initialization of the weights. Fortunately,
The approac
approach h for setting the biases must be co coordinated
ordinated with the approach
initialization of other parameters is typically easier.
for settings the weigh weights. ts. Setting the biases to zero is compatible with most weight
initialization schemes. setting
The approac h for There are theabiases must bewhere
few situations coordinated
we mamay ywith the approach
set some biases to
for settings
non-zero the weights. Setting the biases to zero is compatible with most weight
values:
initialization schemes. There are a few situations where we may set some biases to
non-zero
• If a v alues:
bias is for an output unit, then it is often beneficial to initialize the bias to
obtain the righ rightt marginal statistics of the output. To do this, we assume that
If a bias is
the initial weighfor antsoutput
eights unit,enough
are small then itthat is often
the beneficial
output of to theinitialize the bias to
unit is determined
obtain the righ t marginal statistics of
• only by the bias. This justifies setting the bias to the inv the output. T o do this,
inverse w e assume
erse of the activ that
activation
ation
the initialapplied
function weightstoare thesmall enough
marginal that theofoutput
statistics of the in
the output unittheistraining
determined set.
only by the bias. This justifies
For example, if the output is a distribution ov setting the bias
over to the inv erse of the
er classes and this distributionactiv ation
function
is a highly sk applied
skew
ew
ewed to the marginal statistics
ed distribution with the marginal of the probability
output in the training
of class i givset.
given
en
F or example, if the output is a distribution ov er
by element ci of some vector c, then we can set the bias vector b by solving classes and this distribution
is a equation
the highly skew ed distribution
softmax
softmax( ( b) = c . This with applies
the marginalnot only probability
to classifiers but ialso
of class givento
b
moy element
models c of some vector c , then
dels we will encounter in Part III, such as auto we can set the
autoencobias
enco
encoders vector b b
ders and Boltzmann y solving
the
mac equation
machines.
hines. These mo softmax
models( b) =
dels hav c .
havee lay This
ers whose output should resemblebut
layers applies not only to classifiers thealso
inputto
mo dels we will encounter in Part I I I , such
data x, and it can be very helpful to initialize the biases of such layas auto enco ders and Boltzmann
layers
ers to
mac
matc
matchhines.
h the Thesemarginal models have layers
distribution overwhose x. output should resemble the input
data x, and it can be very helpful to initialize the biases of such layers to
• Sometimes
match the marginal we ma may y distribution
want to choose over x the
. bias to avoid causing too m much
uch
saturation at initialization. For example, we ma may y set the bias of a ReLU
Sometimes
hidden unit to we0.1 marather
y want thanto 0choose
to av oidthe
avoid bias to the
saturating avoid
ReLU causing too much
at initialization.
saturation at initialization.
• This approach is not compatible with weigh F or example, weightt initialization schemesathat
w e ma y set the bias of ReLU do
hidden
not exp unit
expect to 0.1 rather than 0 to av oid saturating
ect strong input from the biases though. For example, it is not the ReLU at initialization.
This approachfor
recommended is not
use compatible
with random withwalk weigh t initialization
initialization schemes
(Sussillo , 2014that). do
not expect strong input from the biases though. For example, it is not
• Sometimes
recommended a unit controls
for use whether walk
with random otherinitialization
units are able to participate
(Sussillo , 2014). in a
function. In suc such h situations, we ha hav ve a unit with output u and another unit
Sometimes
h ∈ [0
[0,, 1] a unit
1],, then we controls
can viewwhether
h as a gate other units
that are able whether
determines to participateuh ≈ 1inor a
function. In suc h situations, we ha v e a unit with
• uh ≈ 0. In these situations, we want to set the bias for h so that h ≈ 1 most output u and another unit
h [0 , 1], then we can view h as a gate that determines whether uh 1 or
uh∈ 0. In these situations, we want 305 to set the bias for h so that h 1 most
≈
≈ ≈
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.5 net
Neural Algorithms
network
work researc
researchers
herswith
hav Adaptiv
havee long realized ethat
Learning
the learningRates
rate was reliably one
of the hyperparameters that is the most difficult to set because it has a significant
Neural net
impact on work
mo
model researc
del hers have As
performance. longwerealized
ha
hav that the learning
ve discussed rateand
in Sec. 4.3 wasSec.
reliably
8.2, one
the
of the hyperparameters
cost is often highly sensitiv that is the most difficult to set b ecause it has a
sensitivee to some directions in parameter space and insensitiv significant
insensitivee
impact on mo del p erformance. As we ha ve discussed in
to others. The momentum algorithm can mitigate these issues somewhat, Sec. 4.3 and Sec. 8.2, but
the
cost
do es issooften
does highly
at the exp sensitiv
expense
ense etro
of in toducing
intro some directions
troducing in parameter space
another hyperparameter. andface
In the insensitiv e
of this,
to isothers.
it naturalThe momentum
to ask algorithm
if there is another can
wa
way y.mitigate thesee that
If we believ
elieve issuesthe
somewhat,
directionsbut
of
do es so
sensitivit
sensitivity at the exp ense of intro ducing another hyperparameter. In the
y are somewhat axis-aligned, it can make sense to use a separate learning face of this,
it is natural
rate for eaceachh to ask if there
parameter, and isautomatically
another way.adaptIf wethese
believ e that the
learning ratesdirections
throughoutof
sensitivit
the course y are somewhat axis-aligned, it can make sense to use a separate learning
of learning.
rate for each parameter, and automatically adapt these learning rates throughout
the course of learning.
306
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.5.1
The AdaGrad
AdaGrad algorithm, sho
shown
wn in Algorithm 8.4, individually adapts the learning
rates of all mo
model
del parameters by scaling them inv inversely
ersely prop
proportional
ortional to the square
The
ro
root AdaGrad algorithm, sho wn in Algorithm 8.4 , individually
ot of the sum of all of their historical squared values (Duc hiadapts
Duchi al.,,the
et al. 2011learning
). The
rates of all mo
parameters del the
with parameters by scaling
largest partial them
deriv ativeinv
derivative ofersely prop
the loss ortional
hav
have to the
e a corresp square
correspondingly
ondingly
root ofdecrease
rapid the sumin of all learning
their of their historical
rate, whilesquared
parametersvalues
with(Duc hi et
small al., 2011
partial deriv).atives
The
derivatives
parameters
ha
hav with
ve a relativ
relatively the largest partial deriv ative of the loss hav e a corresp
ely small decrease in their learning rate. The net effect is greater ondingly
progress in theinmore
rapid decrease theirgen
learning
gently rate,
tly slop
sloped while parameters
ed directions with small
of parameter partial derivatives
space.
have a relatively small decrease in their learning rate. The net effect is greater
In theincontext
progress the more of gen
con
convex
vexslop
tly optimization,
ed directionsthe
of A daGrad algorithm
parameter space. enjoys some
desirable theoretical prop
properties.
erties. Ho
Howev
wev
wever,
er, empirically it has been found that—for
In the context of
training deep neural netw con vex
network
ork mooptimization,
models—the
dels—the accumthe ulation
AdaGrad
accumulation algorithm
of squared enjoysfrom
gradients
gradientsfrom some
desirable
the theoretical
beginning properties.
of training canHoresult
wever,inempirically
a premature it has been
and founde that—for
excessiv
excessive decrease
training
in deep neural
the effective netwrate.
learning ork mo Adels—the
daGrad paccum
AdaGrad erformsulation of squared
well for some but not allfrom
gradients deep
the b eginning
learning mo
models.
dels. of training can result in a premature and excessiv e decrease
in the effective learning rate. AdaGrad performs well for some but not all deep
learning models.
8.5.2 RMSProp
8.5.2
The RMSProp
RMSProp algorithm (HinHinton
ton, 2012) mo modifies
difies AdaGrad to perform better in the
non-con
non-conv vex setting by changing the gradien
gradientt accumulation into an exp exponentially
onentially
The
weigh RMSProp
ted moving average. AdaGrad is )designed
eighted algorithm ( Hinton , 2012 modifies toAdaGrad
conv
convergeto rapidly
erge performwhen
betterapplied
in the
non-con
to a convvex
convexex setting
function.by When
changing the gradien
applied t accumulation
to a non-conv
non-convex into to
ex function antrain
exponentially
a neural
w eigh
net
netw ted moving average.
work, the learning tra AdaGrad
trajectory is designed to conv
jectory may pass through many differen erge rapidly when applied
differentt structures and
to
ev
evena
en conv
entually ex function. When applied
tually arrive at a region that is a lo to a non-conv
locally
cally con
conv ex function to train a
vex bowl. AdaGrad shrinks the neural
network, rate
learning the learning
accordingtrato
jectory may pass
the entire through
history of themany differen
squared t structures
gradient and
and ma
mayy
ev
ha
haven tually arrive at a region
ve made the learning rate to that
too is a lo cally con vex b owl. AdaGrad
o small before arriving at such a conv convexshrinks the
ex structure.
learning rate according
RMSProp uses an exp to
exponen
onen the
onentiallyentire
tially deca
decayinghistory of the squared gradient
ying average to discard history from and mathey
have made the learning rate too small before arriving at such a convex structure.
RMSProp uses an exponentially decaying average to discard history from the
307
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.5.3
A Adamand Ba, 2014) is yet another adaptive learning rate optimization
dam (Kingma
algorithm and is presen
presented
ted in Algorithm 8.7. The name “Adam” derives from
A dam
the (Kingma
phrase and Bamoments.”
“adaptive , 2014) is yetIn another adaptive
the context learning
of the earlierrate optimization
algorithms, it is
algorithm and is presen ted in Algorithm 8.7. The name
perhaps best seen as a variant on the combination of RMSProp and momen “Adam” derives from
momentumtum
the phrase “adaptive moments.” In the context
with a few important distinctions. First, in Adam, momen of the earlier
momentum algorithms,
tum is incorp it is
incorporated
orated
p erhaps b est seen as a v ariant on the combination
directly as an estimate of the first order momen of
momentt (with expRMSProp
exponen
onen and
onential momen
tial weigh
weighting) tum
ting) of
with a few important distinctions.
the gradient. The most straightforw
straightforwardFirst,
ard wa
way in Adam, momen tum
y to add momentum to RMSPropis incorp orated
is to
directly
apply as an estimate
momentum to theofrescaled
the firstgradien
order ts.
gradients.momen
The tuse
(with exponentialinweigh
of momentum ting) of
combination
the gradient.
with rescaling The
do es most
does straightforw
not hav
have ard way tomotiv
e a clear theoretical add momentum
motivation.
ation. Second,to RMSProp
Adam includesis to
apply momentum to the rescaled gradien ts. The use
bias corrections to the estimates of both the first-order momen of momentum
moments in combination
ts (the momentum
with rescaling do es
term) and the (uncen not hav
(uncentered) e a clear theoretical
tered) second-order momen
moments motiv ation. Second,
ts to account for theirAdam includes
initialization
bias corrections to the estimates of both the first-order moments (the momentum
term) and the (uncentered) second-order308 moments to account for their initialization
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.5.4
In Choosing
this section, the Righ
we discussed t Optimization
a series Algorithm
of related algorithms that each seek to address
the challenge of optimizing deep mo models
dels by adapting the learning rate for each
In
mo this
model section, we A
del parameter. discussed a series
t this point, of related
a natural algorithms
question that
is: whic
which each seekshould
h algorithm to address
one
the
cho challenge
hoose?
ose? of optimizing deep mo dels by adapting the learning rate for each
model parameter. At this point, a natural question is: which algorithm should one
Unfortunately
Unfortunately,, there is curren
choose? currently
tly no consensus on this poin oint.
t. Schaul et al. (2014)
presen
presented
ted a valuable comparison of a large num numbber of optimization algorithms
Unfortunately
across a wide range , there is currentasks.
of learning tly noWhile
consensus on this suggest
the results point. Schaul et al.
that the (2014
family of)
presented awith
algorithms valuable
adaptivecomparison of a large
learning rates numberbof
(represented optimization
y RMSProp andalgorithms
AdaDelta)
across a wide range
performed fairly robustly of learning tasks. While the results
robustly,, no single best algorithm has emerged. suggest that the family of
algorithms with adaptive learning rates (represented by RMSProp and AdaDelta)
Curren
Currentlytly
tly,, the most popular optimization algorithms actively in use include
performed fairly robustly, no single best algorithm has emerged.
SGD, SGD with momentum, RMSProp, RMSProp with momen momentum,tum, AdaDelta
Curren tly , the most p opular optimization
and Adam. The choice of which algorithm to use, at this poinalgorithms actively
oint, in use
t, seems to include
dep
depend
end
SGD, SGD with momentum, RMSProp, RMSProp with momen
largely on the user’s familiarity with the algorithm (for ease of hyperparameter tum, AdaDelta
and Adam. The choice of which algorithm to use, at this point, seems to depend
tuning).
largely on the user’s familiarity with the algorithm (for ease of hyperparameter
tuning). 309
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.6this section
In Appro weximate
discuss the Second-Order Methometho
application of second-order ds ds to the training
methods
of deep net
netw
works. See LeCun et al. (1998a) for an earlier treatment of this sub subject.
ject.
In this section w e
For simplicity of expdiscuss
exposition, the application
osition, the only ob of
objectiv
jectiv second-order metho ds to the training
jectivee function we examine is the empirical
of deep networks. See LeCun et al. (1998a) for an earlier treatment of this sub ject.
risk:
For simplicity of exposition, the only ob jective function we examine is the empirical
m
risk: 1 X
J (θ) = E x,y∼p̂data (x,y)[L(f (x; θ ), y)] = L(f (x (i); θ), y (i) ). (8.25)
m
E 1 i=1
J (θ) = [L(f (x; θ ), y)] = L(f (x ; θ), y ). (8.25)
Ho
Howwevever
er the metho
methodsds we discuss here extend m readily to more general ob objectiv
jectiv
jectivee
functions that, for instance, include parameter regularization terms such as those
However the
discussed methods7.we discuss here extend readily to more general ob jective
in Chapter
functions that, for instance, include parameter regularization
X terms such as those
discussed in Chapter 7.
8.6.1 Newton’s Metho
Methodd
8.6.1
In Newton’s
Sec. 4.3 Metho
, we introduced d
second-order gradient metho
methods.
ds. In contrast to first-
order metho
methods, ds, second-order metho
methods
ds make use of second deriv
derivatives
atives to improv
improvee
In Sec. 4.3 , we introduced second-order gradient
optimization. The most widely used second-order metho metho
method ds. In contrast
d is Newton’s metho tod.first-
method. We
order
no
now metho
w describ ds, second-order
describee Newton’s methomethodd in more detail, with emphasis on its application toe
metho ds make use of second deriv atives to improv
optimization.
neural netw
networkorkThe most widely used second-order method is Newton’s method. We
training.
now describe Newton’s method in more detail, with emphasis on its application to
Newton’s metho
method d is an optimization sc
scheme
heme based on using a second-order Tay-
neural network training.
lor series expansion to approximate J( θ) near some poin ointt θ 0, ignoring deriv
derivativ
ativ
atives
es
Newton’s method is an optimization scheme based on using a second-order Tay-
lor series expansion to approximate J( θ) near some point θ , ignoring derivatives
310
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
of higher order: 1
J (θ) ≈ J (θ0) + (θ − θ 0) >∇θ J (θ0 ) + (θ − θ 0) > H (θ − θ0 ), (8.26)
2
1
J ( θ ) J (θ ) + ( θ θ ) J (θ
where H is the Hessian of J with respect to θ 2ev) + θ ) at
(θaluated
evaluated Hθ(θ. If θwe), then solv
(8.26)
solve
e for
0
≈
the critical poin − ∇ − −
ointt of this function, we obtain the Newton parameter up update
date rule:
where H is the Hessian of J with respect to θ evaluated at θ . If we then solve for
θ ∗ = θwe
the critical point of this function, 0 − H −1∇the
obtain Newton parameter update(8.27)
θJ (θ 0)
rule:
Th
Thus
us for a lo cally quadraticθ function
locally =θ H (θ ) definite H ), by rescaling
(with pJositive (8.27)
−1
gradientt by H , Newton’s metho
the gradien − d jumps
method ∇ directly to the minimum. If the
Th
ob us for
objectiv
jectiv a lo cally quadratic
jectivee function is conv ex but not quadratic p(there
convex function (with ositivearedefinite H ), bterms),
higher-order y rescaling
this
the
up gradien
update
date can t
b by
e H ,
iterated, Newton’s
yielding metho
the d jumps
training directly
algorithm to
assothe minimum.
associated
ciated with If the
Newton’s
ob jectiv
metho
method, d,e given
function is convex but
in Algorithm 8.8. not quadratic (there are higher-order terms), this
update can be iterated, yielding the training algorithm associated with Newton’s
For surfaces that are not quadratic, as long as the Hessian remains positive
method, given in Algorithm 8.8.
definite, Newton’s metho method d can be applied iteratively
iteratively.. This implies a tw two-step
o-step
For surfaces that are not quadratic, as long as the Hessian remains positive
definite, Newton’s method can be applied 311 iteratively. This implies a two-step
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
num
umb ber of elemen
elements
ts in the Hessian is squared in the num numbber of parameters, so with
k parameters (and for even very small neural netw networks
orks the num
numb ber of parameters
num b er of elements in the Hessian
k can be in the millions), Newton’s metho is squared
method in the num b er of
d would require the inv parameters,
ersion of asokwith
inversion ×k
k parameters (and for even very small neural netw 3 orks the num
matrix—with computational complexity of O( k ). Also, since the parameters b er of parameters
k can
will be in with
change the millions),
every up Newton’s
update,
date, metho
the inv ersedHessian
inverse would require
has to the inversion of
be computed ataevk ery
every k
matrix—with
training computational
iteration complexity
. As a consequence, of O
only ( k orks
netw ). Also,
networks since
a with verythe parameters
small num
numb ×
ber
will change with every up date, the inv erse Hessian
of parameters can be practically trained via Newton’s metho has to b e
method.computed at ev
d. In the remainder ery
training iteration
of this section, w . As a consequence,
wee will discuss alternativ only
es that attempt tovery
alternatives netw orks a with gainsmall
somenumof bthe
er
of
advparameters
advanan
antages can b e practically
tages of Newton’s methomethod trained via Newton’s metho d. In the
d while side-stepping the computational hurdles. remainder
of this section, we will discuss alternatives that attempt to gain some of the
advantages of Newton’s method while side-stepping the computational hurdles.
8.6.2 Conjugate Gradien
Gradients
ts
8.6.2 Conjugate
Conjugate gradients isGradien
a metho
method dtsto efficiently av avoid
oid the calculation of the inv inverse
erse
Hessian by iteratively descending conjugate dir direections
ctions.. The inspiration for this
Conjugate
approac
approach gradients is a metho d to efficiently
h follows from a careful study of the weakness av oid theofcalculation
the metho
method of
d the inverse
of steep
steepestest
Hessian
descen by iteratively
descentt (see Sec. 4.3 for descending conjugate
details), where linedir ections
searc
searcheshes .areThe inspiration
applied for this
iteratively in
approac h follows
the direction asso from
associated a careful study of the w eakness of the
ciated with the gradient. Fig. 8.6 illustrates how the metho metho d of steep
method est
d of
descen
steep
steepest t (see Sec. 4.3 for details), where line searc hes are applied
est descent, when applied in a quadratic bowl, progresses in a rather ineffective iteratively in
the
bac direction
back-and-forth, asso ciated with the
k-and-forth, zig-zag pattern. This happ gradient.
happensFig. 8.6 illustrates
ens because each line searc how the
search metho d
h direction, of
steepest
when descent,
given by thewhen appliedisinguaranteed
gradient, a quadraticto bowl, progresses in
be orthogonal toathe
rather ineffective
previous line
bac k-and-forth,
searc
search h direction.zig-zag pattern. This happens because each line search direction,
when given by the gradient, is guaranteed to be orthogonal to the previous line
Let the previous searc search h direction be dt−1. At the minim minimum, um, where the line
search direction.
searc
search h terminates, the directional deriv ative is zero in direction dt−1: ∇ θ J (θ ) ·
derivative
dt−1Let= the previous
0. Since the searc h direction
gradient at this b peoindt defines
oint . At the theminim
current um,search
wheredirection,
the line
searc h terminates,have the directional derivinative is zero in direction
. Thus ddt is :orthogonalJ (θ )
d t = ∇ θJ (θ ) will hav e no contribution the direction dt−1
d d = .0. This
to Sincerelationship
the gradientbetw at this
betweeneen d pointand definesdt istheillustrated
current search
in Fig. direction,
∇ 8.6 for·
t−1 t−1
d = J (θ ) will hav
multiple iterations of steepe no contribution
steepest
est descen
descent. in the direction d . Thus d is
t. As demonstrated in the figure, the choice of orthogonal
to d ∇ . This
orthogonal relationship
directions of descentbetwdoeennotd preserve
and d theisminimum
illustrated in the
along Fig.previous
8.6 for
multiple
searc
search iterations ofThis
h directions. steep estesdescen
giv
gives rise to t. the
As demonstrated
zig-zag pattern in the figure, thewhere
of progress, choiceby of
orthogonal to
descending directions of descent
the minimum in thedocurrent
not preserve
gradien
gradient the minimum
t direction, wealong
mustthe previous
re-minimize
searc
the ob h directions.
objectiv
jectiv This giv es rise to the zig-zag
jectivee in the previous gradient direction. Thus, by follo pattern of wing the gradient by
progress,
following where at
descending to the minimum
the end of each line searc search in the current gradien t direction,
h we are, in a sense, undoing progress we hav we must re-minimize
havee already
the obin
made jectiv
the edirection
in the previous gradientline
of the previous direction. Thus,
search. The by follo
method ofwing the gradient
conjugate gradients at
the end
seeks to of each line
address this searc h we are, in a sense, undoing progress we have already
problem.
made in the direction of the previous line search. The method of conjugate gradients
In to
seeks theaddress
metho
method dthis
of conjugate
problem. gradients, we seek to find a search direction that is
conjugate to the previous line search direction, i.e. it will not undo progress made
In the
in that method of
direction. Atconjugate gradients,t,wthe
training iteration e seek
next to search
find a search
directiondirection
dt tak
takesthat
es the is
conjugate to the previous line search direction, i.e. it will not undo progress made
form:
in that direction. At trainingd iteration t, the next search direction d tak(8.29) es the
t = ∇ θJ (θ ) + β tdt−1
form:
d = J 313
(θ) + β d (8.29)
∇
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
20
10
−10
−20
−30
−30 −20 −10 0 10 20
The straightforw
straightforward
ard wa
way d ose
y to imp
impose = 0 would in
H dconjugacy involvee calculation (8.30)
volv of the
eigenvectors of H to choose β t, whic
eigenv which
h would not satisfy our goal of developing
The
a metho
method straightforw ard wa y to imp ose
d that is more computationally conjugacy would
viable than involve calculation
Newton’s metho
method d foroflarge
the
eigenv ectors of H to choose β , which would not satisfy our goal of developing
problems. Can we calculate the conjugate directions without resorting to these
a method thatFortunately
calculations? is more computationally viable
the answer to that than Newton’s method for large
is yes.
problems. Can we calculate the conjugate directions without resorting to these
Two popular metho
methods
ds for computing the β are:
calculations? Fortunately the answer to that ist yes.
1.TwFletc
o popular es: ds for computing the β are:
metho
Fletcher-Reev
her-Reev
her-Reeves:
∇ θ J (θt )> ∇θ J (θt )
1. Fletcher-Reeves: βt = (8.31)
∇θ J (θt−1 )> ∇θ J (θt−1)
J (θ ) J (θ )
β = (8.31)
2. PPolak-Ribière:
olak-Ribière: ∇J (θ ) ∇ J (θ )
2. Polak-Ribière: θt ) − ∇θ J (θ∇t−1)) > ∇θJ (θt )
(∇θJ (∇
βt = (8.32)
∇ J (θt−1 )> ∇θ J (θt−1)
( J (θθ) J (θ )) J (θ )
β = (8.32)
∇ J−314
(θ∇ ) J (θ ∇)
∇ ∇
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
For a quadratic surface, the conjugate directions ensure that the gradient along
the previous direction do does
es not increase in magnitude. We therefore stay at the
For a
minim
minimumquadratic
um along the previousconjugate
surface, the directions.directions ensure that in
As a consequence, thea gradient along
k -dimensional
the previous
parameter direction
space, does gradients
conjugate not increase inrequires
only magnitude.
k lineWsearches
e therefore stay
to ac at the
achieve
hieve the
minim
minim um
minimum. along the previous directions. As a consequence,
um. The conjugate gradient algorithm is given in Algorithm 8.9. in a k -dimensional
parameter space, conjugate gradients only requires k line searches to achieve the
minimum. The
Algorithm 8.9 conjugate
Conjugategradient
gradienttalgorithm
gradien metho
method d is given in Algorithm 8.9.
Require: Initial parameters θ0
Algorithm 8.9 Conjugate gradien t method
Require: Training set of m examples
Require:
InitializeInitial
ρ0 = 0parameters θ
Require:
Initialize graining
T set of m examples
0 =0
Initialize tρ==1 0
Initialize
Initialize
while g = 0 criterion not met do
stopping
Initialize
Initializet= 1 gradien
the gradientt gt = 0 P
while stopping
Compute gradient: criterion
gt ← not1 ∇met doL(f (x (i) ; θ), y(i) )
m θ i
Initialize the gradient−1 t g) > =
gt 0
Compute βt = (gtg−g (P
(Polak-Ribière)
olak-Ribière)
L(f (x ; θ), y )
Compute gradient:t−1 g t−1
> g
(Nonlinear
Compute β = conjugate gradien
gradient:
← ∇ (Polak-Ribière)reset βt to zero, for example if t is
t: optionally
a multiple of some constant k, suc suchh as k = 5)
(Nonlinear conjugate gradien t: optionally
Compute search direction: ρt = −gt + βtρ t−1 reset β to zero, for example if t is
Pa erform
multiple of searc
line somehconstant
search to find: k∗, = suc
P h as k =1 5P
argmin ) m L(f (x (i); θ + ρ ), y(i))
m i=1 t t
Compute
(On a trulysearch direction:
quadratic = g + analytically
costρ function, βρ solve for ∗ rather than
Perform line
explicitly search for
searching it) =−argmin
to find: L(f (x ; θ + ρ ), y )
(On a
Apply up truly quadratic cost
date: θt+1 = θ t + ∗ρ t
update: function, analytically solve for rather than
texplicitly
← t + 1 searching for it)
endApply
whileupdate: θ = θ + ρ
t t+1 P
end←while
Nonlinear Conjugate Gradients: So far we hav havee discussed the metho method d of
conjugate gradients as it is applied to quadratic ob objectiv
jectiv
jectivee functions. Of course,
Nonlinear Conjugate Gradients: So far we
our primary interest in this chapter is to explore optimization hav e discussed
metho the
methods metho
ds for d of
training
conjugate
neural netw gradients
networks as it is applied to quadratic
orks and other related deep learning mo ob jectiv
models e functions.
dels where the corresp Of course,
corresponding
onding
our
ob primary
objectiv
jectiv interest in this chapter is to explore optimization
jectivee function is far from quadratic. Perhaps surprisingly metho ds for
surprisingly,, the metho training
method d of
neural netw
conjugate orks andisother
gradients related deep
still applicable learning
in this mothough
setting, dels where
with the
somecorresp
mo onding
modification.
dification.
ob jectiveany
Without function is far
assurance thatfrom quadratic.
the ob
objectiv
jectiv Perhaps surprisingly
jectivee is quadratic, the conjugate , the methodare
directions of
conjugate gradients is still applicable
no longer assured to remain at the minim in
minimumthis setting,
um of the ob though
objectiv
jectiv with some mo dification.
jectivee for previous directions.
Without any assurance
As a result, the nonline
nonlinear that the ob jectiv
ar conjugate gr e is
gradients quadratic,
adients algorithm theincludes
conjugate directions
occasional are
resets
no longer
where theassured
metho
method dtoofremain at thegradients
conjugate minimumisofrestarted
the ob jectiv
withe for
lineprevious directions.
search along the
As a result, the
unaltered gradient.nonline ar conjugate gradients algorithm includes o ccasional resets
where the method of conjugate gradients is restarted with line search along the
Practitioners rep
report
ort reasonable results in applications of the nonlinear conjugate
unaltered gradient.
315in applications of the nonlinear conjugate
Practitioners report reasonable results
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
gradien
gradients ts algorithm to training neural netw networks,
orks, though it is often beneficial to
initialize the optimization with a few iterations of stochastic gradien
gradientt descent before
gradien ts algorithm to training neural netw orks, though
commencing nonlinear conjugate gradients. Also, while the (nonlinear)it is often beneficial to
conjugate
initializetsthe
gradien
gradients optimization
algorithm with a few iterations
has traditionally of stochastic
been cast as a batch gradien
methot descent
method, before
d, minibatc
minibatchh
commencing
versions hav nonlinear conjugate gradients. Also, while the
havee been used successfully for the training of neural netw (nonlinear)
networks conjugate
orks (Le et al.,
gradien
2011
2011).). A tsdaptations
algorithm of
Adaptations has traditionally
conjugate beensp
gradients cast as a batch
specifically
ecifically metho
for neural d, minibatc
netw
networks haveh
orks hav e
vbersions
een prop hav
proposede b een used successfully for the training of
osed earlier, such as the scaled conjugate gradien neural
gradients ts algorithm (Moller,,
netw orks (Le et al.
2011
1993
1993).). Adaptations of conjugate gradients specifically for neural networks have
).
been proposed earlier, such as the scaled conjugate gradients algorithm (Moller,
1993).
8.6.3 BF
BFGS
GS
8.6.3
The Br BFGS
Broyden–Fletcher–Goldfarb–Shanno
oyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm attempts to bring some
of the adv
advantages
antages of Newton’s metho
method d without the computational burden. In that
The
resp Broyden–Fletcher–Goldfarb–Shanno
respect,
ect, BFGS is similar to CG. How
Howevev (BFGS)
ever,
er, BFGSalgorithm
tak
takes attempts
es a more directtoapproach
bring someto
of the adv antages of Newton’s metho
the approximation of Newton’s up d
update.without the computational
date. Recall that Newton’s up burden.
update In that
date is given by
respect, BFGS is similar to CG. However, BFGS takes a more direct approach to
the approximation of Newton’sθ∗ =upθdate. −1
0 − HRecall
∇ θJthat
(θ0),Newton’s update is given by
(8.33)
complexit
complexityy of the up date is O( n2). The deriv
update derivation
ation of the BFGS approximation is
giv
given
en in man
many y textb
textbo ooks on optimization, including Luenberger (1984).
complexity of the update is O( n ). The derivation of the BFGS approximation is
Once the inv
inverse
erse Hessian appro
approximation
ximation Mt is updated, the direction of descent
given in many textbooks on optimization, including Luenberger (1984).
ρt is determined by ρt = Mt g t. A line searc search
h is performed in this direction to
Once the
determine theinv erseofHessian
size appro
the step, ∗ , ximation M is
taken in this updated,The
direction. the direction
final up of descent
update
date to the
ρ is determined
parameters is giv byby:
given
en ρ = M g . A line search is performed in this direction to
determine the size of the step, θ, taken in this
∗ direction. The final up date to the
t+1 = θt + ρt . (8.36)
parameters is given by:
The complete BFBFGSGS algorithm isθ presented
= θ + in
ρAlgorithm
. 8.10. (8.36)
The complete8.10
Algorithm BFGS BFGSalgorithm
metho
method dis presented in Algorithm 8.10.
Require: Initial parameters θ
Algorithm 8.10 BFGS metho0d
Initialize inv erse Hessian M0 = I
inverse
Require: Initial parameters
while stopping criterion not θ met do
Initialize
Compute invgradient:
erse Hessiangt =M ∇θ= J (Iθt )
while
Computestoppingφ =criterion
gt − g t−1not
, ∆met =θtdo − θt−1
Compute gradient:
−1 g = J (θ ) φ >M t−1φ φ>φ ∆φ > Mt−1 +Mt−1 φ∆>
Appro
Approx x H : M = M + 1 + ∆> φ −
Compute φ = g t g t−1 ,∇∆=θ θ ∆>φ ∆ >φ
Compute search direction: ρt = Mt gt
Approx H : M−= M + ∗ 1 + −
Perform line searc search h to find: = argmin J (θt + ρ t)
Compute searchθ direction: ρ∗ = M g −
Apply up
update:
date: t+1 = θ t + ρ t
endPerform
while
while* *line search to find: = argmin J (θ + ρ )
Apply update: θ = θ + ρ
end while*
Lik
Likee the metho
method d of conjugate gradients, the BFGS algorithm iterates a series of
line searches with the direction incorp orating second-order information. Ho
incorporating Howev
wev
weverer
unlikLik e the metho d of conjugate gradients,
unlikee conjugate gradients, the success of the the approach
BFGS algorithm iteratesdep
is not heavily a series
dependent
endent of
linethe
on searches with the
line search findingdirection
a point incorp
veryorating
close tosecond-order information.
the true minimum along Hothewev er
line.
unlik
Th
Thus, e conjugate gradients,
us, relative to conjugate gradien the success
gradients, of the approach
ts, BFGS has the adv advanis
an not
antage heavily dep
tage that it can sp endent
spend
end
on the line search finding a p oint very close to the true
less time refining each line search. On the other hand, the BFGS algorithm must minimum along the line.
Th us, the
store relative
inv
inverse
erseto Hessian
conjugate gradien
matrix, Mts, BFGS
, that has the
requires O(advn2)an tage that
memory
memory, it can BFGS
, making spend
less time refining
impractical for most eachmo line
modern
dernsearch. On the other
deep learning mo
models hand,
dels thatthe BFGS ha
typically algorithm
hav must
ve millions of
store the
parameters. inv erse Hessian matrix, M , that requires O ( n ) memory , making BFGS
impractical for most modern deep learning models that typically have millions of
parameters.
Limited Memory BF BFGSGS (or L-BF L-BFGS) GS) The memory costs of the BF BFGSGS
algorithm can be significantly decreased by avoiding storing the complete inv inverse
erse
Limitedapproximation
Hessian Memory BF MGS (or L-BF
. Alternativ
Alternatively ely,GS)
ely The memory
, by replacing the Mt−1costsin ofEq.the
8.35BF GS
with
algorithm
an identit
identity ycan be significantly
matrix, the BF
BFGSGS decreased
search direction by avoiding
up datestoring
update form
formulathebecomes:
ula complete inverse
Hessian approximation M . Alternatively, by replacing the M in Eq. 8.35 with
an identity matrix, the BFGSρsearch t = −gdirection
t + b∆ +up aφdate
, formula becomes: (8.37)
ρ = g + b∆ + aφ, (8.37)
− 317
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.7 y optimization
Man
Many Optimizationtec Strategies
techniques
hniques and algorithms,
are not exactly Meta-Algorithms
but rather general
templates that can be sp
specialized
ecialized to yield algorithms, or subroutines that can be
Man y orated
incorp optimization
incorporated tec
into man
manyy hniques
differentare not exactly algorithms, but rather general
algorithms.
templates that can be specialized to yield algorithms, or subroutines that can be
incorporated into many different algorithms.
8.7.1 Batc
Batch
h Normalization
8.7.1h normalization
Batc
Batch Batch Normalization (Ioffe and Szegedy, 2015) is one of the most exciting recen recentt
inno
innovv ations in optimizing deep neural netw
networks
orks and it is actually not an optimization
Batch normalization
algorithm at all. Instead,(Ioffeit and
is a Szegedy
metho
method d ,of2015 ) is one
adaptiv
adaptive of the most exciting
e reparametrization, motiv recen
motivatedatedt
inno
b vations
y the difficult
difficulty in optimizing
y of training deep neural
very deepnetw
mo orks
dels.and it is actually not an optimization
models.
algorithm at all. Instead, it is a method of adaptive reparametrization, motivated
Very deep mo models
dels in
involv
volv
volvee the comp
composition
osition of several functions or lay layers.
ers. The
by the difficulty of training very deep models.
gradien
gradientt tells how to up update
date each parameter, under the assumption that the other
la
lay Very
yers dodeep not mo dels inIn
change. volv e the comp
practice, weosition
up dateofall
update several
of the functions
lay
layers or layers. The.
ers simultaneously
simultaneously.
gradienwe
When t tells
makeehow
mak thetoupup dateunexp
update,
date, eachected
parameter,
unexpected resultsunder
can happ the en
happen assumption
because man that
many the other
y functions
lay
compers
composed do not change.
osed together are changed sim In practice, w e up
simultaneously
ultaneouslydate all
ultaneously,, using up of the
dates that were computed.
updates lay ers simultaneously
When
under we themak e the update,
assumption thatunexp
the ected
other results can remain
functions happen constant.
because man Asy functions
a simple
comp osed
example, supp together
suppose are
ose we hav changed sim ultaneously
havee a deep neural netw ,
network using up dates that
ork that has only one unit per lay were computed
layerer
under
and dodoesthe assumption
es not use an activ that
activation the other functions
ation function at eac each remain
h hidden lay constant.
layer: As a simple
er: yŷˆ = xw 1w2 w3 . . . wl .
example,
Here, w i pro supp ose we hav e a deep neural er i. The output ofonly
netw ork that has erone
i isunit h i =pher lay er
providesvides the weigh
weightt used by laylayer lay
layer i−1 wi .
and do
The es notyŷˆ use
output is aan activfunction
linear ation function
of the at eachxhidden
input , but a lay er: yˆ = xw
nonlinear w w of
function w.
. . .the
Here,
w eigh tsw wpro
eights vides the weight used by layer i. Thegradient
i . Suppose our cost function has put a gradien
outputt of of lay
1 on er yŷˆi,issohwe=wish
h w to.
The output
decrease yˆ istly
yŷˆ sligh
slightlya .linear
tly. The bacfunction of the input
back-propagation
k-propagation x, butcan
algorithm a nonlinear
then compute function of the
a gradient
w eigh ts w . Suppose
g = ∇wyŷˆ. Consider what happ our cost function
happens has put
ens when we mak a gradien
makee an up t of
update 1 on y
ˆ ,
date w ← w − g. The so we wish to
decrease yˆ slightly. The back-propagation algorithm can then compute a gradient
g= yˆ. Consider what happens when 318we make an up date w w g. The
∇ ← −
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
first-order Taylor series appro ximation of yŷˆ predicts that the value of yŷˆ will decrease
approximation
by g>g . If we wan wanted ted to decrease yŷˆ by .1, this first-order information av available
ailable in
first-order T aylor series appro ximation of y
ˆ
the gradient suggests we could set the learning rate to g> g . Ho predicts that the.1 value of y
Howev
wev ˆ
wever, will
er, the decrease
actual
b
upy
updateg g ted to decrease y
ˆ b y .
date will include second-order and third-order effects, on up to effects of orderinl .
. If w e wan 1 , this first-order information av ailable
The gradient
the new value suggests
of yŷˆ is w e could
given by set the learning rate to . However, the actual
update will include second-order and third-order effects, on up to effects of order l .
The new value of yˆ is given x(w1 by− g1 )( )(ww2 − g 2) . . . (w l − gl ). (8.40)
Ql
x(w g term
)(w arising g ) . .from
. (w this g up). date is 2 g1 g2 (8.40)
An example of one second-order
Ql
update i=3 wi .
This term might b e negligible − if i=3 w −i is small, or−might be exp exponen
onen
onentially
tially large
An example of one second-order term arising from this update is g g w.
if the weigh
weights ts on laylayersers 3 through l are greater than 1. This makes it very hard
This
to cho term
ose might
hoose b e negligible
an appropriate if
learning w is
rate, small, or
because themight
effects beofexpanonen
up tiallyto
update
date large
the
if the weigh ts
parameters for one la on lay ers
layer 3
yer depthrough
depends l are greater than
ends so strongly on all of the other lay 1 . This makes
layers. it very
ers. Second-order hard
to choose an algorithms
optimization appropriate learning
address thisrate,
issue bbecause
y computingthe effects
an up of anthat
update
date updatetak estothese
takes the
parameters forinteractions
one layer dep ends so strongly on can
all ofsee
thethatother Q
second-order into accoun
account, t, but we in lay ers.deep
very Second-order
netw
networks,
orks,
optimization algorithms address Q
this issue b y computing an up date that tak es these
ev
even
en higher-order interactions can be significan significant. t. Ev
Evenen second-order optimization
second-order
algorithms are exp interactions
expensive into accoun t, but we
ensive and usually require numerous appro can see that ximations
in very deep
approximations thatnetw
prevorks,
prevent
ent
even higher-order
them from truly accountinginteractions forcanall bsignificant
e significansecond-order
t. Even second-orderinteractions.optimization
Building
algorithms
an n-th order are optimization
expensive andalgorithm
usually requirefor n >numerous
2 thus seems approhop ximations
hopeless.
eless. Whatthat prev
can ent
we
them from
do instead? truly accounting for all significant second-order interactions. Building
an n-th order optimization algorithm for n > 2 thus seems hopeless. What can we
Batc
Batch
do instead? h normalization provides an elegant wa wayy of reparametrizing almost any deep
net
netwwork. The reparametrization significantly reduces the problem of co coordinating
ordinating
up Batc
updates h normalization
dates across man many provides
y lay
layers.
ers. Batc an
Batch elegant wa y of reparametrizing
h normalization can be applied to any almost anyinput
deep
nethidden
or work. The la
lay yerreparametrization
in a netwnetwork.
ork. Let significantly
H be a minibatc reduceshthe
minibatch of problem
activ
activations of co
ations ofordinating
the lay
layerer
up dates across man y lay ers. Batc h
to normalize, arranged as a design matrix, with the activ normalization can b e
activations applied
ations for eac to
each any input
h example
or
app hidden
appearing
earing in layaerrow in aofnetwthe ork.
matrix. LetTH be a minibatc
o normalize H , weh ofreplace
activations
it withof the layer
to normalize, arranged as a design matrix, with the activations for each example
appearing in a row of the matrix. T0o normalize H − µ H , we replace it with
H = , (8.41)
σ
H µ
where µ is a vector con containing H
taining the mean ofσ− = each, unit and σ is a vector con (8.41)
containing
taining
the standard deviation of eac eachh unit. The arithmetic here is based on broadcasting
where
the vectorµ isµa andvector thecon taining
vector σ to thebemeanappliedof each unitrow
to every andofσ the is amatrix
vectorH con
. taining
Within
the
eac
each standard deviation
h row, the arithmetic is elemen of eac h unit. The arithmetic here is based
t-wise, so Hi,j is normalized by subtracting µj
element-wise, on broadcasting
and dividing by σj . The rest of theapplied
the vector µ and the v ector σ to b e netw
networkorktothen
everyop row
operates
eratesof the on matrix H . Within
H0 in exactly the
eac h
same wa row,
way the arithmetic
y that the original netw is elemen
network t-wise,
ork op operated so H
erated on H . is normalized by subtracting µ
and dividing by σ . The rest of the network then operates on H in exactly the
At training time,
same way that the original network op1erated X on H .
At training time, µ = H i,: (8.42)
m
1 i
µ= H (8.42)
m
319
X
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
and s
1X
and σ= δ+ (H − µ)2i , (8.43)
m
1 i
σ= δ+ (H µ) , (8.43)
where δ is a small positive value such m as 10−8 imp imposedosed to av avoid
oid encountering the
√ −
undefined gradient of z at z = 0. Crucially Crucially,, we back-propagate through
where
these δop iserations
a small pfor
operations ositive value such
computing the meanas 10 and imp theosed to avoiddeviation,
standard encountering and the for
undefined gradient of z at zs = 0 . Crucially , w e back-propagate through
applying them to normalize √ H . This means X that the gradient will never prop propose ose
these op erations
an operation that acts simply to increase the standard deviation or meanfor
for computing the mean and the standard deviation, and of
applying
h ; the them to normalize
normalization op H . This
operations
erations remov
removemeans
e thethat the of
effect gradient
such will
an neverand
action propzero ose
i
an operation
out its comp
component that acts
onent in the simply
gradient.to increase
This was thea standard
ma
majorjor innovdeviation
ation oforthe
innovation mean batchof
h ; the normalization
normalization approach. operations
Previousremov e the effect
approaches had inv of olved
such adding
involved an action and zero
penalties to
out its comp onent in the
the cost function to encourage units to ha gradient. This w
have as a ma jor
ve normalized activ innov ation
activation of the
ation statistics or batch
normalization
in
inv olv
volved interv
ed interveningapproach.
ening to Previous
renormalize approaches
unit statistics hadafterinv olved
each adding
gradient penalties
descen
descent to
t step.
the cost
The former function
approach to encourage units to
usually resulted in ha
impveerfect
normalized
imperfect activation
normalization and statistics
the latter or
involvedresulted
usually intervening to renormalize
in significant wasted unittimestatistics
as theafter each algorithm
learning gradient descen rep t step.
repeatedly
eatedly
The
prop
proposedformer approach usually resulted in imp
osed changing the mean and variance and the normalization step reperfect normalization and the latter
repeatedly
eatedly
usuallythis
undid resulted
change. in Batch
significant wasted time
normalization as the learning
reparametrizes the algorithm
mo
modeldel to mak repeatedly
make e some
proposed
units alwa
alwayschanging the mean and
ys be standardized variance and
by definition, deftlythesidestepping
normalization both step repeatedly
problems.
undid this change. Batch normalization reparametrizes the model to make some
Atalwa
testystime, µ and σ ma may ybbe replaced bydeftly running averages that
units be standardized y definition, sidestepping bothwproblems.
ere collected
during training time. This allows the mo modeldel to be ev evaluated
aluated on a single example,
withoutAt test time, to
needing µ and σ may be replaced
use definitions of µ and by σ running
that dep aend
depend verages
on an that wereminibatc
entire collected
minibatch. h.
during training time. This allows the model to be evaluated on a single example,
withoutRevisiting
needing thetoyŷˆuse
= xw 1 w2 . . . w lofexample,
definitions µ and σwethat see depthatendweoncananmostly
entire resolv
resolve
minibatc e theh.
difficulties in learning this mo modeldel by normalizing h l−1. Supp Suppose ose that x is drawn
from Revisiting the yˆ = xw
a unit Gaussian. w . .h. w example,
Then we see that we can mostly resolve the
l−1 will also come from a Gaussian, b ecause the
difficulties in learning
transformation from x to this
hl mo del by Ho
is linear. Howevwev
wever,er, hl−1h will
normalizing . Supp ose that
no longer haveexzero
hav is drawn
mean
from a unit Gaussian.
and unit variance. After applying batc Then h will
batch also come from a Gaussian,
h normalization, we obtain the normalized b ecause the
transformation
ˆ
ĥ
h from x to h is linear. However, h will noerties.
longer F hav
or ealmost
zero mean
l−1 that restores the zero mean and unit variance prop properties. any
and
up
update unit variance.
date to the lo low After
wer lay applying
ers, h
layers, ˆ batc h normalization, we obtain
ĥl−1 will remain a unit Gaussian. The output yŷˆ ma the normalized may y
ˆ
h that restores
then b e learned as the zero linear
a simple mean function
and unityŷˆv= ariance
ˆĥ l−1prop
wl h erties. in
. Learning For almost
this mo
model del any
is
ˆ
up
no
now w date
verytosimple
the low er layers,
because thehparameters
will remain at the a unit
lo werGaussian.
lower la
layers
yers simplyThe do output
not hav yˆ ma
have e any
ˆ
then bin
effect e learned
most cases; as a simple linear is
their output function
alw
alwa aysyˆrenormalized
= w h . Learning to a unit in this model In
Gaussian. is
no w very simple
some corner cases, the low b ecause the
lower parameters
er lay
layers
ers can hav at the lo wer la yers simply
havee an effect. Changing one of the lo do not hav e
low an
wer
effect
la
lay in
yer weigh most
weights cases;
ts to 0 can mak their output is alw a ys renormalized
makee the output become degenerate, and changing the sign to a unit Gaussian. In
some
of onecorner
of thecases,
low
lowerertheweighlowtsercan
weights layersflipcan thehav e an effect.betw
relationship Changing
een ˆ
etween ĥ l−1one
h andof ythe lower
. These
layer weighare
situations ts to 0 can
very rare.mak e the output
Without become degenerate,
normalization, nearly every andup changing
update
date would the ha sign
havve
of one of the low er weigh ts can flip the relationship b etw een ˆ
h and y . These
an extreme effect on the statistics of hl−1. Batc Batch h normalization has thus made
situations
this mo model are v
del significanery
significantlyrare. Without normalization,
tly easier to learn. In this example, nearly every the upeasedateof would
learning havofe
an extreme
course came effect
at theon costtheofstatistics
making the of hlolow
wer. Batc
lay
layers h normalization
ers useless. In our has linearthus made
example,
this model significantly easier to learn. In this example, the ease of learning of
course came at the cost of making the lo wer layers useless. In our linear example,
320
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the low
lowerer lay
layers
ers no longer hav havee anany y harmful effect, but they also no longer hav havee
an
any y beneficial effect. This is because we hav havee normalized out the first and second
the low
order er layers whic
statistics, no longer
which h is allhav thate ana ylinear
harmfulnetw effect,
networkork can but they alsoInnoa deep
influence. longerneural
have
an
net
netwywbork
eneficial effect. This
with nonlinear activisation
because
activation we havthe
functions, e normalized
lo
lower
wer lalay
yers outcantheperform
first and second
nonlinear
order statistics, whic h is all that a linear netw ork
transformations of the data, so they remain useful. Batch normalization acts can influence. In a deep neural to
network withonly
standardize nonlinear
the mean activ ation
and functions,
variance the lo
of each werin
unit layorder
ers can to pstabilize
erform nonlinear
learning,
transformations
but of the data,bso
allows the relationships etw they
etween remain
een units and useful. Batch normalization
the nonlinear statistics of aacts singleto
standardize
unit to change. only the mean and variance of each unit in order to stabilize learning,
but allows the relationships between units and the nonlinear statistics of a single
Because the final lay layer
er of the netw network ork is able to learn a linear transformation,
unit to change.
we mamay y actually wish to remov removee all linear relationships betw etween
een units within a
la
lay Because the final lay er of the netw ork is
yer. Indeed, this is the approach taken by Desjardins et al. (2015),able to learn a linear transformation,
who provided
w e ma y actually
the inspiration for batc wish to
batch remov e all linear
h normalization. Unfortunately relationships b etw een
Unfortunately,, eliminating all units within
linear a
la
iny er. Indeed,
interactions
teractions is muc this
muchis the approach
h more exp expensiv
ensiv taken by Desjardins et al. (
ensivee than standardizing the mean and standard2015 ), who provided
the inspiration
deviation of eac
each hfor batch normalization.
individual unit, and so far Unfortunately
batch normalization , eliminating
remains all thelinear
most
in teractions
practical approach. is muc h more exp ensiv e than standardizing the mean and standard
deviation of each individual unit, and so far batch normalization remains the most
Normalizing the mean and standard deviation of a unit can reduce the expressiv expressivee
practical approach.
power of the neural net network
work con containing
taining that unit. In order to main maintaintain the
Normalizing
expressiv
expressive e powowererthe
of mean
the netwandork,
network, standard deviation
it is common toofreplace
a unit canthe reduce
batch of thehidden
expressivunite
p ower
activ
activationsof the
ations H withneural γ Hnet0+ work
β rathercontaining
than simplythat theunit. In orderHto
normalized 0 maintain the
. The variables
expressiv e p ow er of the netw ork, it is common
γ and β are learned parameters that allow the new variable to ha to replace the batch ofvehidden
have any mean unit
activ ations H with γ H + β rather
and standard deviation. At first glance, this ma than simply maythe normalized H .
y seem useless—why did we setThe v ariables
γ and
the mean β areto 0learned
, and thenparameters
in
intro
tro
troduce
duce that allow the new
a parameter that vallo
ariable
allows ws ittotoha bevesetany bacmean
backk to
and
an
any standard deviation.
y arbitrary value β ? The answ At first
answer glance, this ma y seem useless—why
er is that the new parametrization can represent did we set
the same
the to 0 , and
mean family then intro
of functions ofduce
the inputa parameter
as the old that allows it to be but
parametrization, set the
backnew to
an y arbitrary v alue
parametrization has differen β ? The answ er is that the new parametrization
differentt learning dynamics. In the old parametrization, the can represent
the same
mean of Hfamily of functionsby
was determined of athe input as the
complicated in old parametrization,
interaction
teraction betw
etweeneen the but the new
parameters
parametrization
in the lay
layers
ers below has H differen
. In tthe learning dynamics. In thethe
new parametrization, oldmeanparametrization,
of γ H 0 + βthe is
mean of H w as determined by a complicated
determined solely by β . The new parametrization is muc in teraction muchb etw een the parameters
h easier to learn with
in the
gradien lay ers
gradientt descent. b elow H . In the new parametrization, the mean of γ H + β is
determined solely by β . The new parametrization is much easier to learn with
Most neural net network
work lay ers take the form of φ(X W + b) where φ is some
layers
gradient descent.
fixed nonlinear activ activation
ation function suc such h as the rectified linear transformation. It
Most neural net work lay
is natural to wonder whether we should apply ers take the of φ(Xnormalization
form batch W + b) wheretoφthe is some
input
fixed nonlinear activ ation function suc h as the rectified
X , or to the transformed value X W + b . Ioffe and Szegedy (2015) recommend linear transformation. It
is natural
the latter. to Morewonder
sp whether
specifically
ecifically
ecifically, , Xwe W should
+ b should apply bebatch
replacednormalization
by a normalized to theversion
input
X ,XorWto. the
of Thetransformed
bias term should value X beW + b . Ioffe
omitted and Szegedy
because it becomes (2015 ) recommend
redundant with
the latter. More sp ecifically
the β parameter applied by the batc , X W
batch+ b should b e replaced by a normalized
h normalization reparametrization. The input version
of X
to a layW er is usually the output of omitted
.
layer The bias term should b e a nonlinear becauseactiv it
activationbecomes
ation functionredundant
such aswith the
the β parameter
rectified applied by
linear function in athe batch normalization
previous lay
layer. reparametrization.
er. The statistics of the inputThe areinput
thus
to a layer is usually the output of a nonlinear activation function such as the
rectified linear function in a previous layer. The statistics of the input are thus
321
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.7.2
In some Co ordinate
cases, it may b Descent
e possible to solve an optimization problem quickly by
breaking it into separate pieces. If we minimize minimizef f (x) with resp
respect
ect to a single variable
In some cases, it may b
xi , then minimize it with resp e p ossible
ect to another variable xj andproblem
respect to solve an optimization quickly
so on, rep
repeatedlyby
eatedly
breaking it into separate pieces. If we minimize f (x )
cycling through all variables, we are guaranteed to arrive at a (lowith resp ect to a single
(local) v ariable
cal) minimum.
x , then
This minimize
practice is knownit with
as coresp
or ect
ordinate to another
desc
dinate descent ent
ent,, b v ariable
ecause we x and
optimize so on,
one rep
co eatedly
coordinate
ordinate
cycling
at through
a time. Moreallgenerally
variables,
generally, we
, blo
blockckare
coorguaranteed
ordinate
dinate desc to
descent arrive
ent refersatto
a (lo cal) minimum.
minimizing with
This
resp
respectpractice is known as co ordinate desc
ect to a subset of the variables simultaneously ent , b ecause we optimize
simultaneously.. The term “co one
“coordinate co ordinate
ordinate descen
descent”
t”
at a time. More generally
is often used to refer to blo , blo
blocck co ck coor
coordinate dinate desc ent refers to minimizing
ordinate descent as well as the strictly individual with
resp
co ect
coordinateto
ordinate a subset
descent. of the v ariables simultaneously. The term “coordinate descent”
is often used to refer to block coordinate descent as well as the strictly individual
Co
Coordinate
ordinate
coordinate descent makes the most sense when the different variables in the
descent.
optimization problem can be clearly separated into groups that pla play
y relatively
Co ordinate descent makes the most sense when the different
isolated roles, or when optimization with respect to one group of variables v ariables in the
is
optimization
significan
significantly problem
tly more can than
efficient be clearly separated
optimization withinto groups
resp ect tothat
respect all ofpla y relatively
the variables.
isolated
F roles,consider
or example, or whentheoptimization
cost functionwith respect to one group of variables is
significantly more efficient than optimization with respect to all of the variables.
X X 2
For example, consider the
J (H , W ) =cost function
|Hi,j | + X−W H >
. (8.44)
i,j
i,j i,j
J (H , W ) = H + X W H . (8.44)
This function describ
describes es a learning |problem | called sparse
− co
coding,
ding, where the goal is
weightt matrix W that can linearly deco
to find a weigh decode
de a matrix of activ
activation
ation values
This
H tofunction describ es a learning problem
X called sparse
reconstruct the training set . Most applications of sparse co co ding, where the
coding
dinggoal is
also
tovfind
in
involvee aweigh
olv weigh
weightt tdeca
decay y or W
matrix thatXcan linearly
a constraint on the X deco
deofathe
norms matrix of activ
columns of Wation
, invorder
alues
H preven
to to reconstruct
prevent the training
t the pathological set X
solution . Most
with applications
extremely small H of and
sparse
largecoW ding. also
involve weight decay or a constraint on the norms of the columns of W , in order
The function J is not conv convex.
ex. HoHow wev
ever,
er, wwee can divide the inputs to the
to prevent the pathological solution with extremely small H and large W .
training algorithm in to two sets: the dictionary parameters W and the code
into
The
represen function J is not convthe
tations H . Minimizing
representations ex. obHo weveer,
objectiv
jectiv
jective we can
function divide
with resp the toinputs
respect
ect either toonethe
of
training algorithm in to
these sets of variables is a conv two sets:
convex the dictionary
ex problem. Block co parameters
coordinate W and the
ordinate descent thus gives code
represen tations H . Minimizing the ob jectiv
us an optimization strategy that allows us to use efficiente function with resp ectex
conv
convex to optimization
either one of
these sets ofby
algorithms, variables
alternating is a bconv
etw ex problem.
etween
een optimizing Block coordinate
W with H fixed, descent thus gives
then optimizing
us an
H withoptimization
W fixed. strategy that allows us to use efficient convex optimization
algorithms, by alternating between optimizing W with H fixed, then optimizing
Co
Coordinate
ordinate descent is not a very go goood strategy when the value of one variable
H with W fixed.
strongly influences the optimal value of another variable, as in the function f (x ) =
Coordinate descent is not a very good strategy when the value of one variable
strongly influences the optimal value of 322 another variable, as in the function f (x ) =
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
(x1 − x 2 )2 + α x21 + x22 where α is a positive constan constant.
t. The first term encourages
the tw
two o variables to hahavve similar value, while the second term encourages them
(to
x be xnear
) +zero.
α x The
+ xsolution
whereisα tois set
a positive
both toconstan
zero. t. The firstmethod
Newton’s term encourages
can solve
the−tw o v ariables to ha v e similar value, while the second term encourages
the problem in a single step because it is a positive definite quadratic problem. them
to
Ho
Howb e
wev near zero. The solution
er, for small α, co
ever, coordinate is to set b oth
ordinate descent will mak to
makee very slow progress becausesolve
zero. Newton’s method can the
the problem
first term dodoesin not
es a single
allo
allow
w step
a because
single v it is atopositive
ariable b e definite
changed to quadratic
a value problem.
that differs
However, tly α, coordinate descent will make very slow progress because the
for from
small
significan
significantly the current value of the other variable.
first term does not allow a single variable to be changed to a value that differs
significantly from the current value of the other variable.
8.7.3 Polyak Averaging
8.7.3
P oly
olyak Polyak (A
ak averaging veraging
Poly
Polyak
ak and Juditsky, 1992) consists of av averaging
eraging together sev several
eral
poin
oints
ts in the tratrajectory
jectory through parameter space visited b byy an optimization
Polyak averaging
algorithm. (Polyak and
If t iterations Juditsky,descen
of gradient 1992)tconsists
descent of averaging
visit points θ (1) , . . . together
, θ (t), thensevthe
eral
points in the Poly
trajectory through parameter P
output of the Polyakak av
averaging
eraging algorithm θ̂ˆ(space
is θ t) = 1visited
t
(i) by an optimization
i θ . On some problem
algorithm.
classes, t iterations
suchIf as of gradient
gradient descen
descent t applieddescen
to tconvvisit
convex expoints
problems,θ , .this . . , θapproac
, thenh the
approach has
output of the Poly ak av eraging algorithm is θˆ = θ . On some problem
strong conv
convergence
ergence guaran
guarantees.
tees. When applied to neural net netw works, its justification
classes,
is more such as gradient
heuristic, but it descen
performs t applied
well intopractice.
convex problems,
The basicthis idea approac
is that h has
the
strong conv ergence guaran tees. When applied to neural
optimization algorithm may leap back and forth across a valley several times net w orks, its justification
is more ever
without heuristic, but
visiting a pitoint
performs
near the well in practice.
bottom of the vThe alley..basic
alley The idea average is that
of allthe
of
optimization algorithm may leap back and forth P a valley several times
across
the lo
locations
cations on either side should be close to the bottom of the valley though.
without ever visiting a point near the bottom of the valley. The average of all of
the In non-conv
non-convex
locations onexeither
problems, the path
side should be taken by the
close to the boptimization
ottom of thetra trajectory
jectory
valley can be
though.
very complicated and visit man many y different regions. Including poin oints ts in parameter
In non-conv ex problems,
space from the distant past that ma the path
may y be separated from the current jectory
taken b y the optimization tra point bycan be
large
vbarriers
ery complicated and visit
in the cost function do man y
does different regions. Including p oin
es not seem like a useful behavior. As a result, ts in parameter
space applying
when from the distant
Poly ak past
Polyak av that ma
averaging
eraging toynon-conv
be separated
non-convex from the it
ex problems, current
is typical pointtoby uselarge
an
barriers
exp
exponen
onen in
onentially the cost function
tially decaying running av do es not
average:
erage: seem like a useful b ehavior. As a result,
when applying Polyak averaging to non-convex problems, it is typical to use an
exponentially decaying running θ̂ˆ(t) =av
θ θ̂ˆ(t−1) + (1 − α)θ(t).
αerage:
θ (8.45)
θˆ = α θˆ + (1 α)θ . (8.45)
The running average approach is used in numerous applications. See Szegedy
et al. (2015) for a recent example. −
The running average approach is used in numerous applications. See Szegedy
et al. (2015) for a recent example.
8.7.4 Sup
Supervised
ervised Pretraining
8.7.4 Sup
Sometimes, ervised
directly Pretraining
training a mo
model
del to solve a sp
specific
ecific task can be to too
o am
ambitious
bitious
if the mo
model
del is complex and hard to optimize or if the task is very difficult. It is
Sometimes, directly
sometimes more training
effective a moadel
to train to solve
simpler mo adel
modelspto
ecific task
solve thecan be then
task, too am bitious
make the
if the
mo delmo
model del complex.
more is complexItand
can hard toe optimize
also b or ife the
more effectiv
effective task is
to train thevery
mo difficult.
model
del It is
to solve a
sometimes more effective
simpler task, then mov to train a simpler
movee on to confron mo del to solve the task, then
confrontt the final task. These strategies that invmake the
involve
olve
model more complex. It can also be more effective to train the model to solve a
simpler task, then move on to confront the 323 final task. These strategies that involve
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
U(1)
h(1) h(1)
x x
(a) (b)
U(2)
h(2) h(2)
h(1) h(1)
x x
(c) (d)
325
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
then jointly trained to perform a different set of tasks (another subset of the 1000
ImageNet ob object
ject categories), with fewer training examples than for the first set of
then jointly trained
tasks. Other approaches to perform
to transfer a different
learning setwith
of tasks
neural (another
netw
networks subset
orks of the 1000
are discussed in
ImageNet
Sec. 15.2. ob ject categories), with fewer training examples than for the first set of
tasks. Other approaches to transfer learning with neural networks are discussed in
Another related line of work is the FitNets (Romero et al., 2015) approac approach. h. This
Sec. 15.2.
approac
approach h begins by training a net network
work that has lo low
w enough depth and great enough
width Another
(num
(numb brelated line poferwlay
er of units orker)
layer) is the
to bFitNets
e easy to (Romero et al.netw
train. This , 2015
network ) approac
ork then bh. This
ecomes
aapproac
te
teacher
acher h begins by training
for a second netw
network,aork,
network that has
designated thelow enough
student depth
. The and great
studen
student t net enough
network
work is
width
mucuch (num
h deep
deeper b er of units p
er and thinner (elev er lay
(eleven er) to b e easy
en to nineteen lay to train.
layers) This netw ork then
ers) and would be difficult to train b ecomes
a teacher
with SGDfor undera second
normal netw ork, designated
circumstances. The the studentof
training . The student netw
the student network
network
ork is is
much deep
made easierer by
andtraining
thinner the (elevstudent
en to nineteen
netw orklaynot
network ers)only
and to would be difficult
predict to train
the output for
with SGD under normal circumstances. The
the original task, but also to predict the value of the middle la training of the student
lay netw ork
yer of the teacher is
made
net
netw easier by training the student
work. This extra task provides a set of hints ab netw ork not only
about to predict the
out how the hidden lay output for
layers
ers
the original
should be used task,
andbutcanalso to predict
simplify the value ofproblem.
the optimization the middle layer of the
Additional teacher
parameters
netwintroduced
are ork. This to extra task the
regress provides
middlea lay seterofofhints
layer abouter how
the 5-lay
5-layer teac the netw
teacher
her hiddenorklay
network ers
from
should
the be used
middle lay
layerand
er of can
the simplify
deep
deeper the optimization
er student netw ork. problem.
network. How ever,Ainstead
However, dditional of parameters
predicting
are introduced to regress
the final classification target, the ob the middle lay
objectiv
jectiver of the 5-lay er teac her
jectivee is to predict the middle hidden netw ork lay
from
layer
er
thethe
of middle
teac
teacher layer
her netwof ork.
network.the deep
The er low student
lower er lay
layersnetw
ers ork.studen
of the However,
student t netw instead
networks
orks th ofuspredicting
thus hav
havee twtwo o
the
ob final
objectiv
jectiv classification
jectives: target, the
es: to help the outputs of the studen ob jectiv e is to
studentt netw predict
network the middle hidden
ork accomplish their task, as lay er
of the teac her netw
well as to predict the in ork. The
intermediate low
termediate lay er lay
layerers of the studen
er of the teacher net t netw
netw work.orks thus havae thin
Although two
ob jectiv es:
and deep netw to
networkhelp
ork appthe
appearsoutputs of the studen t netw ork accomplish
ears to be more difficult to train than a wide and shallow their task, as
w
netellwas
netw tothe
ork, predict
thin the
and in termediate
deep net
netw worklay er of
may the teacher
generalize betternetand
work. Although
certainly hasalowthin
lowerer
and deep netw ork app ears to
computational cost if it is thin enough to ha b e more difficult
hav to train than a wide
ve far fewer parameters. Without and shallow
net w ork, the thin
the hints on the hidden layand deep net
layer, w ork may generalize
er, the student netw network orkbperforms
etter and v certainly
ery poorly hasinlow er
the
computational
exp
experimen
erimen
eriments, cost if it is thin enough
ts, both on the training and test set. Hin to ha v e far
Hints fewer parameters.
ts on middle lay layers Without
ers may th thus
us
bthe hints
e one of on
thetheto
toolshidden
ols to help laytrain
er, the student
neural net
netwwnetw
orksork thatperforms
otherwise veryseempoorly in the
difficult to
exp erimen ts, b oth on the training and test
train, but other optimization techniques or changes in the arc set. Hin ts on middle lay
architectureers may
hitecture may also th us
b e
solv one of the
solvee the problem. to ols to help train neural net w orks that otherwise seem difficult to
train, but other optimization techniques or changes in the architecture may also
solve the problem.
8.7.5 Designing Mo
Models
dels to Aid Optimization
8.7.5
To impro
improv vDesigning Mothe
e optimization, dels
bestto Aid Optimization
strategy is not alw
always
ays to impro
improvve the optimization
algorithm. Instead, many improvimprovements
ements in the optimization of deep mo models
dels hav
havee
To impro ve optimization,
come from designing the mo the b
modelsest strategy is not always
dels to be easier to optimize. to impro ve the optimization
algorithm. Instead, many improvements in the optimization of deep models have
comeIn from
principle, we could
designing the mo use activ
activation
dels to ation functions
be easier that increase and decrease in
to optimize.
jagged non-monotonic patterns. How Howevev
ever,
er, this would make optimization extremely
In principle,
difficult. we could
In practice, it is use
moreactiv ation
imp
importantfunctions
ortant that
to cho oseincrease
hoose a mo
model andfamily
del decrease in
that
jagged
is easynon-monotonic
to optimize thanpatterns. Howaevper,
to use ow this
owerful would
erful make optimization
optimization algorithmextremely
. Most
difficult.
of the adv In practice,
advances it is more
ances in neural netw
network imp ortant to c ho ose a mo del
ork learning over the past 30 years ha family
ve that
have been
is easy to optimize than to use a powerful optimization algorithm. Most
of the advances in neural network learning 326 over the past 30 years have been
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.7.6
As arguedCon tinuation
in Sec. 8.2.7, manMetho
manyy of theds and Curriculum
challenges Learning
in optimization arise from the global
structure of the cost function and cannot be resolv resolved
ed merely by making better
As argued of
estimates in lo
Sec.
cal 8.2.7
local up
update, man
date y of the challenges
directions. in optimization
The predominant strategyarise
for ofrom the global
vercoming this
structure of the cost function and cannot b e resolved merely by making
problem is to attempt to initialize the parameters in a region that is connected better
estimates
to of local
the solution byupadate
shortdirections. The predominant
path through strategy
parameter space thatfor
lo overcoming
local
cal descent this
can
problem is to attempt to initialize the parameters in a region that is connected
to the solution by a short path through 327parameter space that lo cal descent can
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
disco
discov
ver.
Continuation metho methods ds are a family of strategies that can mak makee optimization
discover.
easier by cho hoosing
osing initial points to ensure that lo local
cal optimization sp spends
ends most of
Continuation
its time in well-b ell-beha metho
eha
ehaved ds are a family of strategies
ved regions of space. The idea behind contin that can mak
continuation e optimization
uation metho
methods ds is
easier by cho osing
to construct a series of ob initial p oints
objective to ensure
jective functions ov that
over lo cal optimization sp
er the same parameters. In order to ends most of
its time ina w
minimize ell-bfunction
cost ehaved regions
J (θ ), weofwillspace. The idea
construct newbcost
ehindfunctions
continuation {J (0),metho
. . . , J ds
(n)}is
.
to construct a series of ob jective functions ov er the
These cost functions are designed to be increasingly difficult, with J being fairly same parameters. (0) In order to
minimize
easy to minimize, and J J
a cost function (n()θ ), wemost
, the will construct
difficult, bnew eingcost ), the trueJcost
J(θfunctions , . function
..,J .
These
motiv cost
motivating functions
ating the entire pro are designed
process. to
cess. When we sab e increasingly
say difficult, with
y that J (i) is easier than J{ b eing fairly
J (i+1) , we }
easy to minimize,
mean that it is well behav and J
ehaved ,
ed ovthe
over most difficult, b eing J (θ ) , the
er more of θ space. A random initialization is moretrue cost function
motiv
lik
likely ating the entire
ely to land in the region wherepro cess. When lo calwedescent
local say that can is easierthe
J minimize thancostJ function , we
mean that itbis
successfully well bthis
ecause ehavregion
ed overis morelarger.of θThespace.
series A ofrandom initialization
cost functions is more
are designed
lik ely to land in the
so that a solution to one is a goregion where
goo lo cal descent
od initial poin can minimize the
ointt of the next. We thus begin by cost function
successfully
solving an easybecause this region
problem is larger.
then refine the The seriestoof solv
solution coste functions
solve incrementally are designed
harder
so that a solution to one is a go o d initial p
problems until we arrive at a solution to the true underlying problem. oin t of the next. W e thus b egin by
solving an easy problem then refine the solution to solve incrementally harder
Traditional contin continuation
uation methomethods ds (predating the use of contin continuation
uation methomethods ds
problems until we arrive at a solution to the true underlying problem.
for neural net network
work training) are usually based on smo smoothing
othing the ob objectiv
jectiv
jectivee function.
See TW raditional
u (1997) contin for anuation
example methoof ds
such (predating
a metho
method the
d and useaofreview
continof uation
somemetho related ds
for neural
metho
methods. ds. net workuation
Contin
Continuation training)metho
methodsaredsusually
are also based on smo
closely othing
related tothe sim ob jectivannealing,
simulated
ulated e function.
See W
whic
which h uadds
(1997 ) fortoanthe
noise example
parametersof such a methodk and
(Kirkpatric
Kirkpatrick et al.a ,review
al., 1983).of Contin
some related
Continuationuation
metho
metho
methods ds.
ds havContin uation metho ds are also closely related
havee been extremely successful in recent years. See Mobahi and Fisher to sim ulated annealing,
(whic
2015h) adds
for annoiseov to the
overview
erview parameters
of recent (Kirkpatric
literature, esp
especially k etforal.AI
ecially , 1983 ). Continuation
applications.
methods have been extremely successful in recent years. See Mobahi and Fisher
(2015 Con
Contintin
tinuation
) for uation
an metho
methods
overview ofdsrecent
traditionally
literature,were mostlyfor
especially designed with the goal of
AI applications.
overcoming the challenge of lo local
cal minima. Sp Specifically
ecifically
ecifically,, they were designed to
reac
reach Con tin uation
h a global minim metho
minimum ds traditionally
um despite the presence of man were mostly
many ydesigned
lo
local with the
cal minima. Togoal of
do so,
overcoming
these contin the challenge
continuation
uation metho
methods dsof localconstruct
would minima. easier Specifically , they were
cost functions designedthe
by “blurring” to
reach a cost
original global minimum
function. Thisdespite
blurring theop presence
operation
eration can of man y local
be done byminima.
approximating To do so,
these continuation methods would construct easier cost functions by “blurring” the
original cost function. This J (iblurring
)
(θ) = Eθop eration
0 ∼N can be 0 done by approximating
(θ 0;θ,σ(i)2 )J (θ ) (8.46)
E
via sampling. The intuition J for (θ)this
= approach is that J (θsome
) non-conv
non-convex ex functions(8.46)
become approximately con convex
vex when blurred. In man many y cases, this blurring preserves
via sampling.
enough information ab The intuition
about
out the lofor this
location approach is that
cation of a global minimum some non-conv
that we ex canfunctions
find the
b ecome
global minimapproximately
minimum con vex when blurred. In man y cases,
um by solving progressively less blurred versions of the problem. This this blurring preserves
enough
approac
approach information
h can break down aboutinthe location
three different of awglobal minimum
ays. First, it might that we can find
successfully the
define
global
a seriesminim
of costumfunctions
by solvingwhere progressively
the firstless blurred
is con vex and
convex versions
the optimof theum
optimum problem.
tracks from This
approac
one h can to
function break
the down in three at
next arriving different ways.minimum,
the global First, it might but itsuccessfully
might require defineso
a
manseries
many of cost functions where the first is
y incremental cost functions that the cost of the entire pro con vex and the optim
procedure um tracks
cedure remains high. from
one function to the next arriving at the
NP-hard optimization problems remain NP-hard, even when contin global minimum, but it might
uationrequire
continuation metho
methods so
ds
many incremental cost functions that the cost of the entire procedure remains high.
NP-hard optimization problems remain 328 NP-hard, even when continuation methods
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Zarem
Zaremba ba and Sutskev
Sutskeverer (2014) found that muc much h better results were obtained with
a sto
stochastic
chastic curriculum
curriculum,, in which a random mix of easy and difficult examples is
Zarem
alw
alwa aysbapresen
and Sutskev
presented er (2014
ted to the ) found
learner, but that muc
where h bav
the etter
average
erageresults
prop were obtained
proportion
ortion with
of the more
a stochastic
difficult curriculum
examples (here, ,those
in which
with alonger-term
random mix depof easy andisdifficult
dependencies)
endencies) examples
gradually is
increased.
alw ays presen ted to the learner,
With a deterministic curriculum, no impro but where the
improvemen
vemen av erage
vementt ov over prop ortion of the more
er the baseline (ordinary
difficult examples
training from the (here, those with
full training set)longer-term
wa
wass observed.dependencies) is gradually increased.
With a deterministic curriculum, no improvement over the baseline (ordinary
We hav
havee no
now
w describ
describeded the basic family of neural netw network
ork mo
models
dels and hohoww to
training from the full training set) was observed.
regularize and optimize them. In the chapters ahead, we turn to sp specializations
ecializations of
W e hav
the neural netwe now
network describ
ork family ed the basic
family,, that allo
allow family
w neural netof neural
networks netw ork mo dels
works to scale to very largeandsizes
howand
to
regularize
pro
process anddata
cess input optimize them.
that has sp In the
special
ecial chaptersThe
structure. ahead, we turn tometho
optimization specializations
methodsds discussedof
thethis
in neural network
chapter are family , that allo
often directly w neural net
applicable to works to ecialized
these sp scale to varchitectures
specialized ery large sizeswith
and
process
little or input
no mo data that has special structure. The optimization methods discussed
modification.
dification.
in this chapter are often directly applicable to these specialized architectures with
little or no modification.
330
Chapter 9
Chapter 9
Con
Conv
volutional Net
Netw
works
Con v olutional Net w orks
Convolutional networks (LeCun, 1989), also known as convolutional neur
neural
al networks
or CNNs, are a sp specialized
ecialized kind of neural net netwwork for pro processing
cessing data that has
aConvolutional
kno
known,
wn, grid-like networks (LeCun
topology
topology. , 1989), alsoinclude
. Examples known time-series
as convolutionaldata,neur al networks
which can be
or CNNs
though , are a sp ecialized kind of neural
thoughtt of as a 1D grid taking samples at regular time in net w ork for pro
intervcessing
terv
tervals, data that
als, and image data, has
a
whickno
which wn, grid-like topology . Examples
h can b e thought of as a 2D grid of pixels. Con include time-series
Conv volutional netdata,
netw which
works hav can be
havee b een
thought of as successful
tremendously a 1D grid taking samples
in practical at regular The
applications. timename
interv“con
als,vand
“conv imageneural
olutional data,
whic
net
netwwhork”
can indicates
b e thought of as
that thea netw
2D grid
orkofemplo
network pixels.
employs ys Con volutional netop
a mathematical works
operationhavecalled
eration b een
ctremendously
onvolution
onvolution.. Con successful
Conv volutioninis practical
a sp applications.
specialized
ecialized The operation.
kind of linear name “convCon olutional
Conv neural
volutional
net
net
netww
work”
orksindicates
are simply that neural
the netw ork
net
netw emplothat
works ys a mathematical
use conv
convolution
olution op eration
in place called
of
cgeneral
onvolution . Con v olution is a sp ecialized kind
matrix multiplication in at least one of their lay of linear operation.layers.Con
ers. v olutional
networks are simply neural networks that use convolution in place of
In this
general chapter,
matrix w
wee will first in
multiplication describe
at least whatoneconv
convolution
of olution
their lay is.ers.
Next, w wee will
explain the motiv motivation
ation b ehind using conv convolution
olution in a neural netw network.
ork. We will
In this
then describ c hapter,
describee an op w e
operation will first describe
eration called poolingoling,, whicwhat
which conv olution
h almost all conv is.
convolutional Next,
olutional net weworks
netw will
explain
emplo
employ y. the motiv
Usually
Usually, ation
, the op b ehind using
operation
eration used inconv olution
a conv in a neural
convolutional
olutional neural netwnetw ork.
networkork do Wes
doese will
not
then describ
corresp
correspond e an op eration
ond precisely to thecalled pooling
definition of, conv
whicolution
h almostasall
convolution conv
used inolutional
other fieldsnetwsuc
orks
suchh
emplo y. Usually , the op eration used in
as engineering or pure mathematics. We will describe sev a conv olutional neural
several netw ork do
eral variants on the es not
corresp
con
conv ond precisely to the definition of conv
volution function that are widely used in practice for neural olution as used in othernetw fields
orks. suc
networks. Whe
as engineering
will also sh sho ow orhow pure
convmathematics.
convolution
olution ma mayy be Weapplied
will describe
to man
many sev
y eral
kindsvariants
of data,onwith the
con volution
differen
different t numumb function that are widely
b ers of dimensions. W used discuss
Wee then in practicemeans forofneural
making netw
conorks.
conv We
volution
will also
more efficien shot.w Con
efficient. howvolutional
Conv convolution net
netwwma
orksy bestandapplied
out astoanman y kinds
example of data, with
of neuroscien
neuroscientifictific
differen t n um b ers of dimensions. W e then discuss means of
principles influencing deep learning. We will discuss these neuroscientific principles, making con v olution
more conclude
then efficient. with Convcommen
olutionaltsnet
comments ab w
aboutorks
out thestand
role out
conv as an example
convolutional
olutional netw
networks of neuroscien
orks hav
havee play tific
played
ed
principles
in the historyinfluencing
of deepdeep learning.
learning. One W e willthis
topic discuss these
chapter do neuroscientific
doeses not address principles,
is how to
cthen
ho
hoose
oseconclude
the arc with commen
architecture
hitecture ts ab
of your outolutional
conv the role net
convolutional conv
netw olutional
work. netwof
The goal orks hav
this e played
chapter is
in the
to describhistory of deep
describee the kinds of to learning.
tools One
ols that con
convtopic this
volutional netwchapter
networks do
orks proes not
provide, address is
vide, while Chapter 11how to
cho ose the architecture of your convolutional network. The goal of this chapter is
to describ e the kinds of to ols that convolutional 331 networks provide, while Chapter 11
331
CHAPTER 9. CONVOLUTIONAL NETWORKS
describ
describes
es general guidelines for cho hoosing
osing which tools to use in whic
whichh circumstances.
Researc
Research h in
into
to con
convvolutional net
netwwork archi
architectures
tectures pro
proceeds
ceeds so rapidly that a new
describ
b eshitecture
est arc general guidelines
architecture for a giv enfor
given chohmark
b enc osing which
enchmark tools to ev
is announced use
eryinfew
every whicwheeks
circumstances.
to months,
Researc h into con volutional net
rendering it impractical to describ w ork architectures pro ceeds so
describee the b est architecture in prin rapidly
print.
t. Ho
Howthat
wev a new
ever,
er, the
b est architecture
b est architectur
architectures for
es hava given b
havee consistenenc
consistently hmark is announced
tly b een comp
composed every few w eeks to months,
osed of the building blocks describ
described
ed
rendering it impractical to describ e the b est architecture in print. However, the
here.
b est architectures have consistently b een comp osed of the building blocks describ ed
here.
9.1 The Con
Conv
volution Op
Operation
eration
9.1its most
In Thegeneral
Conform,volutioncon
conv Operation
volution is an op eration on tw
operation twoo functions of a real-
valued argument. To motiv motivateate the definition of con convvolution, we start with examples
In its most general
of two functions we migh form, con
mightt use. volution is an op eration on two functions of a real-
valued argument. To motivate the definition of convolution, we start with examples
Supp
Supposeose we are tracking the lo location
cation of a spaceship with a laser sensor. Our
of two functions we might use.
laser sensor provides a single output x(t), the p osition of the spaceship at time
Suppxose
t. Both andwet are
are tracking
real-v
real-valued,the i.e.,
alued, lo cation
we can of get
a spaceship
a different with a laser
reading sensor.
from Our
the laser
laser sensor
sensor at an
anyyprovides
instan
instantt in time. output x(t), the p osition of the spaceship at time
a single
t. Both x and t are real-valued, i.e., we can get a different reading from the laser
No
Now
sensor w
atsuppose
any instan that
t inour laser sensor is somewhat noisy
time. noisy.. To obtain a less noisy
estimate of the spaceship’s p osition, we would like to av erage together sev
average several
eral
No w
measuremen suppose
measurements. that our laser sensor
ts. Of course, more recent measuremenis somewhat
measurements noisy .
ts are moreTo relev
obtain
relevan
an t,a so
ant, lesswenoisy
will
estimate
wan of the spaceship’s
antt this to b e a weigh
eighted
ted av p osition,
average we w ould like
erage that gives more weigh to av erage together
weightt to recent measuremen sev
measurements. eral
ts.
measuremen ts. Of course,
We can do this with a weigh more recent measuremen ts are more relev an
ting function w(a), where a is the age of a measuremen
weighting t, so w
measurement.e will
t.
wan
If wet this
applytosuc
b e ha awweigh
such eightedtedavav
weighted erage
eragethat
average op gives more
operation
eration weighmoment,
at every t to recentwemeasuremen
obtain a new ts.
We can dos this
function pro with aaweigh
providing
viding smo ting function
smoothed
othed estimatew(ofa),the
where a is the
position of age
the of a measurement.
spaceship:
If we apply such a weighted averageZ op eration at every moment, we obtain a new
function s providing a smo othed s(t) =estimate
x(a)wof(tthe
− aposition
)da of the spaceship: (9.1)
argumentt (in this example, the function w) as the kernel. The output is sometimes
argumen
referred to as the fefeatur
atur
aturee map
map..
argument (in this example, the function w) as the kernel. The output is sometimes
In our example, the idea of a laser sensor that can pro provide
vide measuremen
measurements ts
referred to as the feature map.
at every instant in time is not realistic. UsuallyUsually,, when we work with data on a
In our example, the idea of a laser sensor
computer, time will b e discretized, and our sensor thatwill
canprovide
providedata
measuremen ts
at regular
at
in every
interv
terv als.instant
tervals. In ourinexample,
time is not realistic.
it might Usually
b e more , whentoweassume
realistic work with
that data on a
our laser
computer,
pro
provides time will b e discretized,
vides a measurement and our
once p er second. Thesensor will provide
time index data
t can then at on
take regular
only
in
interv als. In our example, it might b e more realistic to
teger values. If we now assume that x and w are defined only on in
integer assume that our
integer laser
teger t, we
pro vides a measurement
can define the discrete con once
conv p er
volution: second. The time index t can then take on only
integer values. If we now assume that x and w are defined only on integer t, we
∞
X
can define the discrete convolution:
s(t) = (x ∗ w )(t) = x(a)w (t − a) (9.3)
a=−∞
s(t) = (x w )(t) = x(a)w (t a) (9.3)
In mac
machine
hine learning applications,∗ the input is usually−a multidimensional arra arrayy
of data and the kernel is usually a multidimensional array of parameters that are
In mac
adapted byhine
the learning
learning applications,
algorithm. Wthe input
e will is to
refer usually
theseammultidimensional
ultidimensionalarra
arraysy
arrays
of data andBecause
the kernel is elemen X
usuallyt aofmultidimensional array of parameters that are
as tensors. each element the input and kernel must b e explicitly stored
adapted by, we
separately
separately, the usually
learningassume
algorithm.
thatW e willfunctions
these refer to these multidimensional
are zero everywhere but arra ys
the
as tensors.
finite set ofBecause
points for each elemen
which wetstore
of thethe input and This
values. kernelmeans
must that
b e explicitly stored
in practice we
separately , we usually assume that these
can implement the infinite summation as a summation ov functions are zero
overeverywhere
er a finite num but
number the
ber of
finite
arra
arrayy set of points
elemen
elements. ts. for which we store the values. This means that in practice we
can implement the infinite summation as a summation over a finite number of
arraFinally
Finally,
y elemen, wts.
e often use conconvvolutions ov over
er more than one axis at a time. For
example, if we use a tw o-dimensional image I as our input, we probably also wan
two-dimensional wantt
Finally , w e often use con
to use a two-dimensional kernel K : v olutions ov er more than one axis at a time. For
example, if we use a two-dimensional X XI as our input, we probably also want
image
to use a two-dimensional
S (i, j ) = (I ∗kKernel ) :=
)(i, jK I (m, n)K (i − m, j − n). (9.4)
m n
S (i, j ) = (I K )(i, j ) = I (m, n)K (i m, j n). (9.4)
Con
Conv volution is commcommutativ
∗utativ
utative, e, meaning we can equiv equivalently
alently
− write:
−
XX
Convolution S (i,isj )comm
= (Kutativ
∗ I )(i,e,j )meaning
= weI (can
i − equiv
m, j −alently write:
n)K (m, n). (9.5)
Xm nX
S (i, j ) = (K I )(i, j ) = I (i m, j n)K (m, n). (9.5)
Usually the latter formula ∗ is more straightforw
straightforward
− ard to− implemen
implementt in a machine
library,, b ecause there is less variation in the range of valid values of m
learning library
Usually
and n. the latter formula is more straightforward to implement in a machine
learning library, b ecause there is lessX X
variation in the range of valid values of m
The commutativ
commutativee prop property
erty of con conv
volution arises b ecause we hav havee flipp
flippeed the
and n.
relativee to the input, in the sense that as m increases, the index into the
kernel relativ
The
input commutativ
increases, but ethe
prop erty into
index of con volution
the arises b ecause
kernel decreases. Thewonly
e hav e flippto
reason ed flip
the
k ernel relativ e to the input,
the kernel is to obtain the commutativin the sense
commutativee propertythat as m increases, the
property.. While the commutativindex into
commutativee prop the
propert
ert
erty
y
input increases, but the index into the kernel decreases. The only reason to flip
the kernel is to obtain the commutative 333 property. While the commutative prop erty
CHAPTER 9. CONVOLUTIONAL NETWORKS
334
CHAPTER 9. CONVOLUTIONAL NETWORKS
Input
Kernel
a b c d
w x
e f g h
y z
i j k l
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
335
CHAPTER 9. CONVOLUTIONAL NETWORKS
9.2 Motiv
Motivation
ation
9.2volution
Con
Conv Motiv ationthree imp
leverages importan
ortan
ortantt ideas that can help improv improvee a machine
learning system: sp sparse
arse inter
interactions
actions, par arameter
ameter sharing and equivariant repr epresenta-
esenta-
Con
tions volution
tions.. Moreov
Moreover, leverages
er, conv
convolutionthree
olution pro imp ortan
provides t ideas that can help improv
vides a means for working with inputs of variable e a machine
learning system:
size. We now describ describee each of these, ideas
sp arse inter actions parameter
in turn.sharing and equivariant representa-
tions. Moreover, convolution provides a means for working with inputs of variable
Traditional neural netw network ork laylayers
ers use matrix multiplication by a matrix of
size. We now describ e each of these ideas in turn.
parameters with a separate parameter describing the interaction b etw etween een each
T raditional neural netw ork lay ers use matrix multiplication
input unit and each output unit. This means every output unit interacts with every b y a matrix of
parameters
input with olutional
unit. Conv a separatenet
Convolutional parameter
netw works, ho how wdescribing
ev
ever,
er, typicallythe interaction
hav
havee sp arseb etw
sparse inter een
interactions each
actions
inputreferred
(also unit andtoeach as sp output
sparse
arse conne unit.ctivity
This means
onnectivity or sp every
sparse
arse output
weights ). unit
Thisinteracts with every
is accomplished by
input unit. Conv olutional net w orks, ho
making the kernel smaller than the input. For example, when prow ev er, typically hav e sp arse
processing inter actions
cessing an image,
(also referred to
the input image might ha as sp arse
hav ve thousands or millions of pixels, but we accomplished
conne ctivity or sp arse weights ). This is can detect small, by
making the features
meaningful kernel smaller
such asthan edges the input.
with For that
kernels example,o ccup
ccupy when
y onlypro tenscessing an image,
or hundreds of
the input
pixels. Thisimage
means might
thathawe ve need
thousands
to storeor millions
few
fewer of pixels, but
er parameters, whichwe bcan othdetect
reduces small,
the
meaningful
memory features such
requiremen
requirements ts of as theedges
mo
model with
del kernels
and improv
improvesthat
es oitsccup y only tens
statistical or hundreds
efficiency
efficiency. . It alsoof
pixels. that
means This computing
means thatthe we output
need to requires
store fewfewer
er parameters,
op
operations.
erations. which
These b oth reduces
improv
improvementsementsthe
memory requiremen ts of the mo del and improv es
in efficiency are usually quite large. If there are m inputs and n outputs, then its statistical efficiency . It also
means that
matrix computingrequires
multiplication the outputm × nrequires
parameters fewer and opthe
erations.
algorithmsTheseused improv ements
in practice
in
ha
hav vefficiency
e O(m × are usually quite
n ) runtime (p
(per large. If there
er example). If weare m inputs
limit the numnumbandb ern ofoutputs,
connections then
matrix
eac
each multiplication
h output ma
may y ha
hav vrequires
e to k, then m nthe parameters
sparsely and the algorithms
connected approac
approach used in practice
h requires only
ha v e O (m n ) runtime
k × n parameters and O(k × n) run (p er example).
× runtime. If w e limit the num b
time. For many practical applications, it iser of connections
eac
p h output
ossible may ha
to ×obtain go
goo k, then the on
voedtop erformance sparsely connected
the machine approac
learning taskh while
requires only
keeping
k sev
k neral
severalparameters
orders ofand O(k nsmaller
magnitude ) runtime. thanFm or. many
F
For practical demonstrations
or graphical applications, it of is
p ossible to obtain go
× connectivity
sparse connectivity, o d pFig.
, see erformance
× 9.2 and on Fig.the9.3machine
. In a deep learning
conv task
convolutional whilenetw
olutional keeping
network,ork,
k sev eral orders
units in the deep deeper of magnitude
er lay
layersers may indir smaller than m . F or graphical demonstrations
indireectly interact with a larger p ortion of the input, of
sparse
as sho
shown connectivity
wn in Fig. 9.4, .see This Fig. 9.2
allo ws and
allows the Fig.
net
netw 9.3. toInefficiently
work a deep conv olutional
describ
describe network,
e complicated
units
in in the deep
interactions
teractions b etwer
etw eenlay ers may
many indirectly
variables interact withsuch
by constructing a larger
in p ortion of
interactions
teractions thesimple
from input,
as shownblo
building in cFig.
bloc ks that9.4.eacThis
each allows the
h describe onlynet work interactions.
sparse to efficiently describ e complicated
interactions b etween many variables by constructing such interactions from simple
Par
Parameter
ameter sharing refers to using the same parameter for more than one
building blo cks that each describe only sparse interactions.
function in a mo model.
del. In a traditional neural net, each element of the weigh eightt matrix
Par ameter sharing refers to using
is used exactly once when computing the output of a lay the same parameter
layer. for more
er. It is multiplied than by one
one
function
elemen in a mo del. In
elementt of the input and then nev a traditional
never neural net, each element of
er revisited. As a synonym for parameter sharing,the weigh t matrix
is used
one can sa exactly
say once
y that a netw when
network computing
ork has tie tied the output
d weights
weights, , b ecause of athelayver.
alueItofisthemultiplied
weigh by one
weightt applied
elemen
to one tinput
of theisinput
tied to and thethenvaluenevofer arevisited.
weigh As a synonym
eightt applied elsewhere. for parameter
In a conv sharing,
convolutional
olutional
one cannet,
neural sayeach
that member
a network of has
the tie d weights
kernel is used, b ecause
at every the value ofofthe
p osition theweigh
inputt (except
applied
toerhaps
p one input
someisoftied the to the valuepixels,
b oundary of a wdepeigh t applied
depending
ending on theelsewhere. In a convregarding
design decisions olutional
neural net, each member of the kernel
the b oundary). The parameter sharing used by the con is used at every p
convosition
volution operation(except
of the input means
p erhaps
that some
rather thanof the b oundary
learning pixels,set
a separate depofending
parameterson thefor design
ev erydecisions
every lo cation,regarding
location, we learn
the b oundary). The parameter sharing used by the convolution operation means
that rather than learning a separate set336 of parameters for every lo cation, we learn
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
Figure 9.2: Sp
Sparse
arse conne
onnectivity,
ctivity, viewe
viewedd fr om below: We highlight one input unit, x3 , and
from
also highlight the output units in s that are affected by this unit. (T op) When s is formed
(Top)
Figure
by conv 9.2: Sparse
convolution
olution withconne ctivity,
a kernel viewed3fr
of width , om
onlybelow:
three W e highlight
outputs one input
are affected by unit, x , and
x. (Bottom)
also highlight
When the output
s is formed unitsmin
by matrix s that are affected
ultiplication, by this
connectivity is unit. (Top)sparse,
no longer Whenso s is
allformed
of the
by convolution
outputs with by
are affected a kx
ernel 3
of width , only three outputs are affected by . x (Bottom)
3.
When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the
outputs are affected by x .
337
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
Figure 9.4: The receptive field of the units in the deep deeper
er lay
layers
ers of a conv
convolutional
olutional net
netw work
is larger than the receptiv
receptivee field of the units in the shallow la lay
yers. This effect increases if
Figure
the 9.4:
netw orkThe
network receptive
includes field of the
architectural units in
features lik
likethe deep erconv
e strided layers of a conv
convolution
olution (Fig.olutional
9.12) or net work
p o oling
is larger
(Sec. 9.3).than
Thisthe receptiv
means thate field
even of
even the units
though direectinconnections
dir the shallowinlayaers.
convThis effectnet
convolutional
olutional increases
are very if
the netwunits
sparse, ork includes architectural
in the deep
deeper
er lay
layers features
ers can likeectly
b e indir
indire strided convolution
connected to all(Fig. 9.12of
or most ) or
th
thepe oinput
oling
image.9.3). This means that even though direct connections in a convolutional net are very
(Sec.
sparse, units in the deep er layers can b e indirectly connected to all or most of th e input
image. 338
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
Figure 9.5: Par ameter sharing: Black arrows indicate the connections that use a particular
Parameter
parameter in two differen
differentt models. (T op) The black arrows indicate uses of the cen
(Top) central
tral
elemen
element of aPar
Figure t9.5: ameter sharing:
3-element kernel inBlack arrows
a conv indicate
convolutional
olutional mo the connections
model.
del. that usesharing,
Due to parameter a particular
this
parameter
single in twoisdifferen
parameter used att all input (T
models. lo op) The(Bottom)
locations.
cations. black arrows
The indicate uses
single blac
black of theindicates
k arrow central
elemen
the usetofofthe
a 3-element
cen
central kernelofinthe
tral element a conv olutional
weigh
eightt matrixmoin del. Due
a fully to parameter
connected mo del.sharing,
model. This mo this
model
del
single
has noparameter
parameterissharing
used atsoallthe
input lo cations.
parameter (Bottom)
is used The single black arrow indicates
only once.
the use of the central element of the weight matrix in a fully connected mo del. This mo del
has no parameter sharing so the parameter is used only once.
only one set. This do does
es not affect the runtime of forward propagation—it is still
O(k × n)—but it do does
es further reduce the storage requiremen requirements ts of the mo model
del to
only
k one set. This
parameters. do es
Recall notkaffect
that the runtime
is usually several of forward
orders propagation—it
of magnitude is still
less than m.
O ( k n)—but it do es further reduce the storage requiremen
Since m and n are usually roughly the same size, k is practically insignificant ts of the mo del to
k parameters.
×
compared to mRecall
× n. Conv k is usually
thatolution
Convolution is thusseveral orders ofmore
dramatically magnitude
efficientless than than m.
dense
Since m and n are usually roughly the
matrix multiplication in terms of the memory requiremen same size, k
requirements is practically insignificant
ts and statistical efficiency
efficiency..
compared
F or a graphicalto m depiction
n. Convofolution is thus dramatically
how parameter sharing works, more seeefficient
Fig. 9.5than
. dense
matrix multiplication × in terms of the memory requirements and statistical efficiency.
As an example of b oth of these first two principles in action, Fig. 9.6 sho shows
ws how
For a graphical depiction of how parameter sharing works, see Fig. 9.5.
sparse connectivit
connectivity y and parameter sharing can dramatically impro improv ve the efficiency
As an example of b oth of these first
of a linear function for detecting edges in an image. tw o principles in action, Fig. 9.6 shows how
sparse connectivity and parameter sharing can dramatically improve the efficiency
of aInlinear
the case of conv
function convolution,
forolution,
detecting theedges
particular
in anformimage.of parameter sharing causes the
la
lay
yer to hav havee a prop
propertert
ertyy called equivarianc
quivariancee to translation. To say a function is
equiv Inariant means that if the inputparticular
the
equivariant case of conv olution, the changes, form of parameter
the output changessharing
in the causes
same wtheay.
la
Sp y er to hav e a prop ert y called
ecifically,, a function f (x) is equiv
Specifically
ecifically equivarianc e to translation. T o say a
ariantt to a function g if f (g (x)) = g (f (x))
equivarian
arian function is.
)).
equiv
In theariant
case means
of con
convvthat if the
olution, if input
we letchanges,
g b e an
any ythe outputthat
function changes in the the
translates same way.
input,
Sp ecifically
i.e., shifts it,, athen
function
the convf (xolution
) is equiv
convolution ariant istoequiv
function a function
equivariant
ariant to g gif. fF(or
g (xexample,
)) = g (flet(x))I.
Ine the
b case of giving
a function convolution, if we let at
image brightness g b einteger
any function that translates
coordinates. Let g b e athe input,
function
mapping one image function to another image function, such that I 0 = g (Ilet
i.e., shifts it, then the conv olution function is equiv ariant to g . F or example, I
) is
b e a function giving image brightness at integer coordinates. Let g b e a function
mapping one image function to another 339image function, such that I = g (I ) is
CHAPTER 9. CONVOLUTIONAL NETWORKS
the image function with I 0 (x, y ) = I ( x − 1, y). This shifts every pixel of I one
unit to the right. If we apply this transformation to I , then apply con conv volution,
the image function with I (x, y
the result will b e the same as if we applied con ) = I ( x 1 , y ) .
conv This shifts every
0
volution to I , then appliedpixel of I onethe
unit to the right. If we apply
transformation g to the output. When pro this transformation
− processing to I , then apply
cessing time series data, this means con v olution,
the result
that conv will
convolution b
olution proe the duces a sort of timeline con
same
produces as if we applied that volution
sho
shows to I , different
ws when then applied the
features
transformation
app
appearear in the input. g to the output.
If we movee When
mov an even
event pro cessing
t later time in
in time series
the data,
input,thisthemeans
exact
that conv
same represen olution
representation pro duces a sort of timeline that sho ws
tation of it will appear in the output, just later in time. Similarly when different features
app ear in
with images, con the input.
conv volution If we move aan
creates 2-Devenmap t later
of wherein time in the
certain input, app
features the ear
appear exactin
same represen
the input. If we mov tation of it
movee the ob will appear
object in the output, just
ject in the input, its representation will mov later in time. Similarly
movee the
with
sameimages,
amounttcon
amoun volution
in the output. creates
This ais2-D map
useful forofwhen where wecertain
kno
know features
w that someapp ear in
function
thea input.
of small num If we
number bermov e the ob ject pixels
of neighboring in theisinput, usefulits representation
when applied to m will movinput
ultiple e the
same
lo amoun
locations.
cations. Fort inexample,
the output. when This
pro iscessing
useful images,
processing for whenit wise useful know that some edges
to detect function in
of a small
the first lay num
layer ber of
er of a conv neighboring
convolutional
olutional net pixels
netw is useful when
work. The same edges app applied to
appear m ultiple
ear more or lessinput
lo
ev cations.
erywhereFin
everywhere or the
example,
image,when so it pro cessing images,
is practical to shareitparameters
is useful toacross detectthe edges
entirein
the firstIn
image. laysome
er of cases,
a convwe olutional
may not netwish
work.toThe sharesame edges appacross
parameters ear morethe or less
entire
everywhere
image. in the image,
For example, if we so areitpro
is practical
processing
cessing images to share thatparameters
are cropp
cropped across
ed to b ethe entire
centered
image. In some cases, we
on an individual’s face, we probably wan may not wish to share
wantt to extract differen parameters across the
differentt features at different entire
image.
lo For example,
locations—the
cations—the part ofifthe we netw
are pro
networkorkcessing
pro
processing images
cessing the that
top of arethe cropp
faceed to bto
needs e centered
lo ok for
look
on
ey an
eyebro
ebro
ebrows,individual’s face, we
ws, while the part of the netw probably
network wan
ork pro t to
processingextract differen t features
cessing the b ottom of the face needs to at different
lo
lo cations—the
look
ok for a chin. part of the netw ork pro cessing the top of the face needs to lo ok for
eyebrows, while the part of the network pro cessing the b ottom of the face needs to
lo okCon
Conv
forvaolution
chin. is not naturally equiv equivarian
arian
ariantt to some other transformations, such
as changes in the scale or rotation of an image. Other mec mechanisms
hanisms are necessary
Con v olution is not naturally
for handling these kinds of transformations. equiv arian t to some other transformations, such
as changes in the scale or rotation of an image. Other mechanisms are necessary
Finally
Finally,, some
for handling thesekinds
kindsofofdata cannot b e pro
transformations. processed
cessed by neural netw networksorks defined by
matrix multiplication with a fixed-shape matrix. Conv Convolution
olution enables pro processing
cessing
Finally
of some of ,these
some kinds
kinds of of data.
data cannot
We discuss b e pro cessed
this by neural
further in Sec.netw 9.7.orks defined by
matrix multiplication with a fixed-shape matrix. Convolution enables pro cessing
of some of these kinds of data. We discuss this further in Sec. 9.7.
9.3 Pooling
9.3
A typicalPla oyoling
layer of a con
conv volutional netw
network
ork consists of three stages (see Fig. 9.7). In
the first stage, the lay layer
er p erforms sev
several
eral con
convvolutions in parallel to proproduce
duce a set
A typical la
of linear activyer of a
activations. con volutional netw ork consists
ations. In the second stage, each linear activof three stages
activation (see Fig. 9.7). Ina
ation is run through
the first stage,
nonlinear activ the lay
activation
ation er p erforms
function, suc
suchhsev
aseral
the con volutions
rectified in activ
linear parallel
activation
ationtofunction.
pro duce aThisset
of linear activ ations. In the
stage is sometimes called the dete second ctor stage. In the third stage, we use a poolinga
detectorstage, each linear activation is run through
nonlineartoactiv
function mo ation
modify function,
dify the outputsucof hthe
as la
the
yerrectified
lay further.linear activation function. This
stage is sometimes called the detector stage. In the third stage, we use a pooling
A p o oling function replaces the output of the net at a certain lo location
cation with
function to mo dify the output of the layer further.
a summary statistic of the nearb nearby y outputs. For example, the max pooling (Zhou
A p o oling function
and Chellappa, 1988) op replaces
operation the
eration rep
reportsoutput
orts of theum
the maxim
maximum netoutput
at a certain
within loa cation with
rectangular
a summary statistic of the nearby outputs. For example, the max pooling (Zhou
and Chellappa, 1988) op eration rep orts 340 the maximum output within a rectangular
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.6: Efficiency of edge dete ction.. The image on the right was formed by taking
detection
ction
eac
eachh pixel in the original image and subtracting the value of its neighboring pixel on the
Figure
left. 9.6:sho
This Efficiency
shows of edge of
ws the strength dete
allction . The
of the image on
vertically the right
oriented edgeswasinformed
the inputby taking
image,
eac
which
which hpixel
can in
b ethe
a original
useful op image
operation
eration and
for subtracting
ob
object
ject the
detection.v alue
Bothof its neighboring
images are 280 pixel
pixelson the
tall.
left. This sho ws the strength of all of the v ertically oriented
The input image is 320 pixels wide while the output image is 319 pixels wide. This edges in the input image,
whic h can b e a can
transformation useful
b e op eration
describ
describeded for
by aobconv
ject olution
detection.
convolution kernelBothcon images
taining are
containing tw
twoo280 pixels tall.
elements, and
The input
requires 319image
× 280is ×3203 = pixels widefloating
267,, 960
267 while the output
p oin
oint
t op image is
operations
erations (tw319
(two pixels wide. This
o multiplications and
transformation
one addition p er can b e describ
output pixel)edtobycompute
a convolution
using kernel
con containing
convolution.
volution. two elements,
To describ
describe e the sameand
requires 319
transformation 280 3 = 267 , 960 floating pwould
with a matrix multiplication oint op erations
tak
takee 320 ×(tw 280o ×multiplications
319 × 280280,, or oand
ver
one
eighttaddition
eigh billion, en×ptries
er output
entries ×in the pixel)
matrix,tomaking
compute con using
convolution
volutionconfour
volution. To describ
billion times more eefficient
the same for
transformation
represen
representing with a matrix multiplication
ting this transformation. The straightforw would
straightforward tak e 320 280 319
ard matrix multiplication algorithm280 , or ov er
eigh
p t billion,
erforms overensixteen
tries inbillion
the matrix,
floating making
ointt con
p oin op volution making
operations,
erations, four billion
×con
convvtimes
× more
olution × efficient
roughly for
60,000
represen
times tingefficient
more this transformation.
computationally
computationally. The
. Of straightforw
course, mostard matrix
of the entriesmultiplication
of the matrixalgorithm
would b e
p erforms
zero. If weovstored
er sixteen
onlybillion floatingen
the nonzero ptries
oint of
entries opthe
erations,
matrix,making
then b con volution
oth matrix roughly 60,000
multiplication
timesconv
and more efficient
convolution
olution computationally
would require the same. Of course,
num bermost
number of the entries
of floating p oint opof erations
the matrix
operations to would
compute. be
zero.matrix
The If we stored
wouldonly
stillthe
neednonzero entries
to contain 2×of 319
the matrix,
× 280 =then 178,,b640
178 oth entries.
matrix multiplication
Conv
Convolution
olution
and
is anconv olution efficien
extremely would
efficient require
t wa
way y ofthe same numtransformations
describing ber of floating pthat oint op erations
apply to compute.
the same linear
The matrix would
transformation of astill needlo
small, to
calcontain
local 2 319 280 = 178
region across the entire input. (Photo, 640 entries.credit:
Convolution
Paula
is
Go an extremely efficient way of describing×transformations
Goodfellow)
odfellow) × that apply the same linear
transformation of a small, lo cal region across the entire input. (Photo credit: Paula
Go odfellow)
341
CHAPTER 9. CONVOLUTIONAL NETWORKS
Convolutional Layer
Detector stage:
Detector layer: Nonlinearity
Nonlinearity
e.g., rectified linear
e.g., rectified linear
342
CHAPTER 9. CONVOLUTIONAL NETWORKS
neigh
neighb b orhoo
orhood. d. Other p opular p o oling functions include the average of a rectangular
neigh
neighb b orhoo
orhood, d, the L 2 norm of a rectangular neighborho neighborhoo o d, or a weigh eighted
ted average
neigh b orhoo d. Other
based on the distance from the cenp opular p o oling
central functions
tral pixel. include the av erage of a rectangular
neighb orhoo d, the L norm of a rectangular neighborho o d, or a weighted average
based In onall the
cases, p o oling
distance helps
from thetocen maketral the representation b ecome appro
pixel. approximately
ximately
invariant to small translations of the input. Inv Invariance
ariance to translation means that if
In all cases,
we translate p o oling
the input byhelps
a smallto amount,
make thethe representation
values of mostb ecome of the pappro
o oledximately
outputs
invariant to small translations of the input.
do not change. See Fig. 9.8 for an example of how this works. In Inv ariance to translation Invmeans
variance thattoif
w
loecal
localtranslate the input
translation canby bea small
a very amount,
useful theprop
values
property of most
erty if we of care
the p omore
oled outputs
ab
about
out
do not change. See Fig. 9.8 for an example
whether some feature is present than exactly where it is. For example, of how this w orks. In v ariance to
lo cal determining
when translationwhether can beanaimage verycontains
useful prop a face,ertywe needif we notcare
knowmore the lo ab out
location
cation
whether
of the ey essome
eyes feature
with pixel-p
pixel-perfect is present
erfect accuracy,, than
accuracy we justexactly
need to wherekno
know w that it is.
thereFor example,
is an ey
eyee on
when determining whether an image
the left side of the face and an eye on the righ contains a face, we need not know
rightt side of the face. In other contexts,the lo cation
of the ey
it is more impes with
importantpixel-p erfect
ortant to preserv accuracy
preservee the lo , cation ofneed
w e
location just to know
a feature. Forthat there isifan
example, weeyweanont
ant
thefind
to left aside of the
corner face and
defined by twano eye
edges onmeeting
the rightatside a sp ofecific
specificthe orien
face. tation,
In other
orientation, wecontexts,
need to
it is
preserv more
preservee the lo imp ortant
cation of the edges well enough to test whether they meet. want
location to preserv e the lo cation of a feature. F or example, if we
to find a corner defined by two edges meeting at a sp ecific orientation, we need to
The use of p ooling can b e viewed as adding an infinitely strong prior that
preserve the lo cation of the edges well enough to test whether they meet.
the function the lay layer
er learns must b e inv invarian
arian
ariantt to small translations. When this
The useisofcorrect,
assumption p ooling can greatly
it can b e viewed improv
improveas eadding an infinitely
the statistical efficiency strong
of the prior
net
netwwthat
ork.
the function the layer learns must b e invariant to small translations. When this
Po oling is
assumption ov
overer spatial
correct, regions
it can produces
greatly improvinv invariance
ariance
e the to translation,
statistical efficiency of buttheif net
we wpork.
o ol
over the outputs of separately parametrized con conv volutions, the features can learn
whic
which P o oling ov er spatial regions
h transformations to b ecome inv produces
invariant inv ariance
ariant to (see Fig. to translation,
9.9). but if we p o ol
over the outputs of separately parametrized convolutions, the features can learn
Because p ooling summarizes the resp responses
onses over a whole neigh neighb b orhoo
orhood, d, it is
which transformations to b ecome invariant to (see Fig. 9.9).
p ossible to use fewer p o oling units than detector units, by rep reporting
orting summary
Because p ooling summarizes the resp onses
statistics for p ooling regions spaced k pixels apart rather than 1 pixel o ver a whole neigh b orhoo
apart.d, itSeeis
p ossible
Fig. 9.10 tofor use fewer p o oling
an example. unitsvthan
This impro
improv es thedetector
computational units, efficiency
by rep orting of thesummary
netw
networkork
statistics for
b ecause the next layp ooling regions spaced k pixels apart
er has roughly k times fewer inputs to pro
layer rather than 1 cess. When See
pixel
process. apart. the
Fig.
n um
umb 9.10
b er offorparameters
an example.inThis the impro
next vlay es er
layer theiscomputational
a function of efficiencyits inputofsize the(sucnetw
(such horkas
b ecause the
when the next lay next
layerlay er has roughly k times fewer inputs
er is fully connected and based on matrix multiplication) this to pro cess. When the
n um b er of parameters
reduction in the input size in thecannextalsolay er is in
result a improv
function
improved edofstatistical
its inputefficiency
size (suchand as
when
reduced thememory
next layrequiremen
er is fully ts
requirements connected
for storing andthe based on matrix multiplication) this
parameters.
reduction in the input size can also result in improved statistical efficiency and
For many
reduced memory tasks, p o oling ts
requiremen is for
essential
storingfor thehandling
parameters. inputs of varying size. F For
or
example, if we wan antt to classify images of variable size, the input to the classification
la
lay F or many
yer must hav tasks,
havee a fixed p osize.
oling This
is essential
is usually foraccomplished
handling inputs by vof varying
arying size.ofFan
the size or
example,
offset b et
etwif we wpan
ween t to classify
o oling regionsimages
so thatofthe variable size, the lay
classification input
layer to the
er alwa
always ys classification
receives the
la y er
same num must
umb hav e a fixed size. This is usually accomplished
b er of summary statistics regardless of the input size. For example, b y v arying the size ofthe an
offset b et w
final p ooling lay een p
layero oling regions
er of the net netw so
work ma that
may the classification lay er alwa
y be defined to output four sets of summary ys receives the
same n um b er
statistics, one for eacof summary
eachh quadranstatistics regardless
quadrantt of an image, regardless of the input size.image
of the For example,
size. the
final p ooling layer of the network may be defined to output four sets of summary
Some theoretical work giv gives
es guidance as to which kinds of p o oling one should
statistics, one for each quadrant of an image, regardless of the image size.
343as to which kinds of p o oling one should
Some theoretical work gives guidance
CHAPTER 9. CONVOLUTIONAL NETWORKS
POOLING STAGE
DETECTOR STAGE
POOLING STAGE
DETECTOR STAGE
344
CHAPTER 9. CONVOLUTIONAL NETWORKS
1. 0.2 0.1
Figure 9.11: Examples of architectures for classification with conv olutional netw
convolutional orks. The
networks.
sp
specific
ecific strides and depths used in this figure are not advisable for real use; they are
Figure 9.11:
designed Examples
to be of architectures
very shallow in order to forfitclassification
onto the page. withReal
convolutional
conv
convolutionalnetworks.
olutional net
netw The
works
sp ecific
also oftenstrides
inv
involvand
olv
olve depths used
e significant in this
amoun
amounts ts offigure are notunlike
branching, advisable for real
the chain use; theyused
structures are
designed to be
here for simplicity very
simplicity.. (Lshallow in
eft) A con
(Left) order to
convolutional fit onto
volutional netnetw the page.
work that pro Real conv
processes olutional net
cesses a fixed image size. w orks
also often
After involvebsignificant
alternating etw
etween
een con
conv amoun
volutionts and
of branching,
p o oling forunlike
a few the
lay chain
layers,
ers, thestructures
tensor forused
the
herevolutional
con for simplicity
convolutional . (Lmap
feature eft) is
A reshap
convolutional
reshaped network
ed to flatten out that pro cesses
the spatial a fixed image
dimensions. The size.
rest
After
of thealternating
net
netwwork is an b etw een confeedforward
ordinary volution andnet p owoling
netw for a few as
ork classifier, laydescrib
ers, the
described edtensor for the
in Chapter 6.
convolutional
(Center) A con
convfeature
volutional mapnet is
networkreshap
work thatedprotocesses
flatten
processes a vout the spatial
ariable-sized dimensions.
image, The rest
but still maintains
aoffully
the net work is section.
connected an ordinary
This feedforward
net work usesnet
network a pwooling
ork classifier,
op
operationaswith
eration describ ed in Chapter
variably-sized 6.
p o ols
(Center) A
but a fixed num con v
umbolutional
b er of po net
pools,work that pro
ols, in order to pro cesses
provide a variable-sized image, but still
vide a fixed-size vector of 576 units to the maintains
a fullyconnected
fully connectedpsection.
ortion ofThisthe net
net
netwwork
work.uses a p ooling
(Right) A conv opolutional
eration with
convolutional netw variably-sized
networkork that doesp onot ols
but
ha
havveaany
fixed num
fully b er of poweigh
connected ols, int order
weight lay er. to
layer. providethe
Instead, a fixed-size
last conv vector oflay
convolutional
olutional 576er units
layer to one
outputs the
fully connected
feature map p er pclass.
ortionThe of the
mo
modelnetw
del ork. (Right)
presumably A conv
learns olutional
a map of ho
howwnetw orkeach
likely thatclass
doesisnotto
oha ve any
ccur fullyspatial
at each connectedlo weighA
location.
cation. t veraging
layer. Instead, the map
a feature last conv
downolutional
to a singlelayer outputs
value one
provides
feature
the map p er
argument to class. The moclassifier
the softmax del presumably
at the top.learns a map of how likely each class is to
o ccur at each spatial lo cation. Averaging a feature map down to a single value provides
the argument to the softmax classifier at the 347
top.
CHAPTER 9. CONVOLUTIONAL NETWORKS
equiv
equivariant
ariant to translation. Lik Likewise,
ewise, the use of p o oling is an infinitely strong prior
that eac
eachh unit should b e in inv
varian
ariantt to small translations.
equivariant to translation. Likewise, the use of p o oling is an infinitely strong prior
Of course, implemen
implementing ting a conconvvolutional net as a fully connected net with an
that each unit should b e invariant to small translations.
infinitely strong prior would b e extremely computationally wasteful. But thinking
of aOfcon
convcourse, implemen
volutional net asting a con
a fully volutionalnet
connected netwith
as aan fully connected
infinitely strongnetprior
withcanan
infinitely
giv
givee us somestrong prior
insigh
insights would
ts in
into
to ho
howbwe con
extremely
convvolutional computationally
nets work. wasteful. But thinking
of a convolutional net as a fully connected net with an infinitely strong prior can
One key insight is that con conv volution and p o oling can cause underfitting. Like
give us some insights into how convolutional nets work.
an
anyy prior, conv
convolution
olution and p o oling are only useful when the assumptions made
One key insight
by the prior are reasonably is that con volution
accurate. If and
a task p orelies
oling can cause underfitting.
on preserving Like
precise spatial
any prior, conv
information, olution
then usingand p o oling
p o oling on are
all only useful
features canwhen
increasethe assumptions
the training madeerror.
b y the
Some con prior
conv are reasonably
volutional net netw accurate.
work arc If
architectures a task relies on preserving
hitectures (Szegedy et al., 2014a) are designed to precise spatial
information, then using
use p o oling on some channels but p o oling onnot
all onfeatures
other can increase
channels, in the
order training
to get error.
b oth
Some con
highly in invv olutional net w ork arc hitectures ( Szegedy et al. ,
variant features and features that will not underfit when the translation 2014a ) are designed to
use
in
inv p o olingprior
variance on some channels
is incorrect. Whenbut anot
taskoninv other
olves channels,
involves incorp
incorporating in order
orating to get bfrom
information oth
vhighly invariant
ery distan
distant t lo features
locations
cations andinput,
in the features
thenthat
thewill
priornotimp underfit
osed bywhen
imposed conv the translation
convolution
olution ma
may y be
in variance
inappropriate. prior is incorrect. When a task inv olves incorp orating information from
very distant lo cations in the input, then the prior imp osed by convolution may b e
Another key insigh insightt from this view is that we should only compare con convvolu-
inappropriate.
tional momodels
dels to other conv convolutional
olutional momodels
dels in b enchmarks of statistical learning
Another key
p erformance. Mo insigh
Models t from
dels that do this viewconv
not use is that
convolution we would
olution shouldbonly e ablecompare
to learncon volu-
even if
tional
we p erm mo
ermuteddels to other conv olutional
uted all of the pixels in the image. F mo dels in
For b enchmarks of statistical
or many image datasets, there are learning
p erformance.
separate b enc Mo
enchmarks dels
hmarks for mothat do
models not use conv olution
dels that are permutation would b e able
invariant andtomust
learndiscov
even er
discoverif
w e pconcept
the ermutedofall top of the via
topology
ology pixels in theand
learning, image.
mo
models Forthat
dels many ha
havvimage
e the knodatasets,
wledge there
knowledge are
of spatial
separate b enchard-co
relationships hmarks ded
hard-coded for moin
intodels
to themthatbyare permutation
their designer. invariant and must discover
the concept of top ology via learning, and mo dels that have the knowledge of spatial
relationships hard-co ded into them by their designer.
9.5 Varian
ariants
ts of the Basic Con
Conv
volution Function
9.5 discussing
When Variants convofolution
the Basic
convolution in the con Contextvof
context olution
neural netw Function
networks,
orks, we usually do
not refer exactly to the standard discrete con conv volution op operation
eration as it is usually
When
understodiscussing
understood conv olution in the con text of
od in the mathematical literature. The functions usedneural netw orks,
in w e usually
practice do
differ
not
slighrefer exactly
tly.. Here
slightly
tly we to the standard
describ
describe discrete con
e these differences invdetail,
olutionand op eration as it
highlight is usually
some useful
understo
prop
properties od in the mathematical literature.
erties of the functions used in neural net The
netw functions
works. used in practice differ
slightly. Here we describ e these differences in detail, and highlight some useful
propFirst,
ertieswhen
of thewefunctions
refer to conv
convolution
usedolution in the
in neural netcontext
works. of neural netw
networks,
orks, we usually
actually mean an op operation
eration that consists of many applications of con convvolution in
First,This
parallel. wheniswe refer toconv
b ecause conv olutionwith
convolution
olution in the context
a single of neural
kernel netwextract
can only orks, weone
usually
kind
actually mean
of feature, alb an
albeit op eration that
eit at many spatial lo consists
locations.of many applications
cations. Usually we wan of
wantt eac
eachcon v
h layolution
layer in
er of our
parallel.
net
netwwork toThis is b ecause
extract man
many yconv olution
kinds with a single
of features, at man
manykyernel
lo can only extract one kind
locations.
cations.
of feature, alb eit at many spatial lo cations. Usually we want each layer of our
netwAork
dditionally
dditionally, , theman
to extract input is usually
y kinds not just
of features, at aman
gridy of real values. Rather, it is a
lo cations.
Additionally, the input is usually not
348just a grid of real values. Rather, it is a
CHAPTER 9. CONVOLUTIONAL NETWORKS
grid of vector-v
ector-valued
alued observ
observations.
ations. F For
or example, a color image has a red, green
and blue in intensit
tensit
tensity y at eaceach h pixel. In a multila multilayer yer conv
convolutional
olutional netw network,
ork, the input
grid of v ector-v
to the second lay alued
layer observ ations.
er is the output of the first layF or example, er, which usually hasa the
layer, a color image has red,output
green
andman
of blue
many intensity conv
y different at eac h pixel.atIneach
convolutions
olutions a multila yer conv
p osition. When olutional
working netw
withork,images,
the input we
to the second lay er is the output
usually think of the input and output of the conv of the first lay er,
convolution which usually has
olution as b eing 3-D tensors, with the output
of man
one y different
index in to theconv
into olutions
different at each
channels and p osition.
tw
two o indicesWhen into working with co
the spatial images,
coordinates
ordinates we
usually
of eac
each h think
channel. of theSoftinput
Softw wareand output of the conv
implementations olution
usually work as in
b eing
batc
batch3-D tensors,
h mode, with
so they
one actually
will index into usethe4-Ddifferent
tensors,channels
with theand two axis
fourth indices into the
indexing spatial examples
different co ordinates in
of eac
the batc h channel.
batch, Soft w are
h, but we will omit the batc implementations
batch usually work in batc
h axis in our description here for simplicit h mode, so
simplicity they
y.
will actually use 4-D tensors, with the fourth axis indexing different examples in
the Because
batch, but conv
convolutional
weolutional
will omitnet netw
theworks
batchusually
axis in use our multi-c
multi-channel
description hannel hereconv
convolution,
forolution,
simplicitthe y.
linear opoperations
erations they are based on are not guaranteed to b e comm commutativ
utativ
utative, e, ev
even
en if
Because conv
kernel-flipping is olutional
used. These netw orks usually
multi-c
multi-channelhannel op useerations
multi-care
operations hannelonlyconv
comm olution,
commutativ utativ
utative the
e if
linear
eac
each h opop erations
operation they
eration has the same numare based on
umber are not guaranteed to b
ber of output channels as input channels. e comm utativ e, ev en if
kernel-flipping is used. These multi-channel op erations are only commutative if
Assume we hav havee a 4-D kernel tensor with elemen elementt K giving the connection
each op eration has the same number ofK output channelsi,j,k,l as input channels.
strength b et etwween a unit in channel i ofKthe output andK a unit in channel j of the
Assume
input, with an w e hav e a 4-D
offset of kkernel
ro
rowsws andtensor with elemen
l columns b et wteen the giving
etw outputthe connection
unit and the
strength b et w een a unit in channel
input unit. Assume our input consists of observ i of the output
observed and a unit in channel
ed data V with element Vi,j,k giving j of the
input, with of an
theoffset
inputofunit k rowithin
ws andchannel l columnsi at ro b et
wwjeen the output unit and our the
the value row V column k . Assume
and V
input unit.
output Assume
consists of Zour with input
the consists
same format of observ
as Ved . Ifdata
Z is pro with
produced
ducedelement
by conv
convolvinggiving
olving K
the value of the
across V without flipping input unit within
K, then channel i at ro w j and column k . Assume our
Z V Z K
output consists of with the same X format as . If is pro duced by convolving
V K
Zi,j,k, = (9.7)
across without flipping then V l,j +m−1,k +n−1Ki,l,m,n
Z l,m,n V K
= (9.7)
where the summation over l , m and n is ov overer all values for which the tensor indexing
op erations inside the summation is valid. In linear algebra notation, we index into
operations
where
arra
arraysysthe
usingsummation
a 1 for the over l, m
first and. nThis
entry
entry. is ovnecessitates
er all values the for which
− 1 in thethetensor
ab
abo ove indexing
form
formula.
ula.
op erations inside the
Programming languages such as summation X is v alid. In linear algebra notation,
C and Python index starting from 0, rendering we index into
arra
the abys
aboove expression even simpler.. This necessitates the 1 in the ab ove formula.
using a 1 for the first entry
Programming languages such as C and Python index starting − from 0, rendering
We may wan wantt to skip ov overer some positions of the kernel in order to reduce the
the ab ove expression even simpler.
computational cost (at the exp expense
ense of not extracting our features as finely). We
W e may wan t to skip
can think of this as downsampling the ov er some positions
outputofofthe thekernel
full con invorder
conv olutiontofunction.
reduce the If
computational
we wan cost
antt to sample only ev (at the exp ense
ery s pixels in eac
every of not
eachextracting
h direction in the output, then we cane
our features as finely). W
can think
defined a doofwnsampled
this as downsampling
downsampled con
conv volutionthe outputc of
function suc
suchthe full convolution function. If
h that
we want to sample only every s pixels Xin each direction in the output, then we can
Z
defined a doi,j,k = c ( K ,
wnsampled coni,j,kV , s ) =
volution function Vl,(j −1)c ×suc
s+m, k −1)×s+nKi,l,m,n .
h (that (9.8)
Z K V l,m,n V K
= c( , , s) = . (9.8)
We refer to s as the stride of this downsampled conv convolution.
olution. It is also p ossible
to define a separate stride for each direction of motion. See Fig. 9.12 for an
W e refer to s as the stride of this downsampled convolution. It is also p ossible
illustration.
to define a separate stride for each Xdirection of motion. See Fig. 9.12 for an
349
illustration.
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1 s2 s3
Strided
convolution
x1 x2 x3 x4 x5
s1 s2 s3
Downsampling
z1 z2 z3 z4 z5
Convolution
x1 x2 x3 x4 x5
350
CHAPTER 9. CONVOLUTIONAL NETWORKS
...
...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
Figure 9.13: The effeeffect
ct of zer
zero
o padding on network size size:: Consider a conv
convolutional
olutional netw
network
ork
with a kernel of width six at every la lay
y er. In this example, we do not use an
anyy p ooling, so
Figure 9.13:
only the conv The effe
convolution ct
olution op of zer
operation o p adding on network
eration itself shrinks the net size
netw : Consider
work size. (T a conv olutional
op) In this conv
(Top) netw
convolutionalork
olutional
with
net a kernel
network,
work, we doof width
not use sixany
at every
implicitlayer.
zeroInpadding.
this example,Thiswe do not
causes theuse any p ooling, so
representation to
only thebyconv
shrink fiv
fiveeolution
pixels op
at eration
eac
each la
h lay itself
y er. shrinks
Starting the
from net
anw ork size.
input of (T op)
sixteen In this
pixels, conv
we olutional
are only
network,
able wee three
to hav
have do notconuse
conv any implicit
volutional la
layyers,zero
andpadding.
the last lay This
layer causes
er do es not the
does everrepresentation
mo
mov ve the kernel,to
shrink
so by fiveonly
arguably pixels
tw
twooatofeac h la
the layyer.
lay ers Starting
are trulyfromcon an input of The
convolutional.
volutional. sixteen
ratepixels, we are only
of shrinking can
able
b to have three
e mitigated convsmaller
by using olutionalkernels,
layers, but
and smaller
the lastkernels
layer doare
es not
less ever move the
expressive andkernel,
some
so arguably
shrinking only two in
is inevitable of this
the kind
layers of are truly convolutional.
architecture. (Bottom) ByThe ratefiv
adding ofe implicit
five shrinking zerocan
zeroeses
b e each
to mitigated
layer, by
layer, we using
preven
preventsmaller kernels, but smaller
t the representation kernels with
from shrinking are less expressive
depth. This alloand
ws some
allows us to
shrinking
mak
make is inevitable
e an arbitrarily in this
deep conv kind
convolutional
olutional net
netw work. (Bottom) By adding five implicit zero es
of architecture.
to each layer, we prevent the representation from shrinking with depth. This allows us to
make an arbitrarily deep convolutional network.
352
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1 s2 s3 s4 s5
a b c d e f g h i
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
a b a b a b a b a
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
354
CHAPTER 9. CONVOLUTIONAL NETWORKS
Output Tensor
Input Tensor
Channel coordinates
Spatial coordinates
s1 s2 s3 s4 s5
a b c d e f g h i
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
a b c d a b c d a
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
a b a b a b a b a
x1 x2 x3 x4 x5
356
CHAPTER 9. CONVOLUTIONAL NETWORKS
the forw
forwardard propagation op operation,
eration, as well as the size of the forw forward
ard propagation’s
output map. In some cases, multiple sizes of input to forward propagation can
the forw
result in ard
the propagation
same size of op eration,
output map, as so
well
theastransp
the size
transpose ose ofop the
operationforwmust
eration ard propagation’s
b e explicitly
output map. In some cases, multiple
told what the size of the original input was. sizes of input to forward propagation can
result in the same size of output map, so the transp ose op eration must b e explicitly
These three op operations—conv
erations—conv
erations—convolution, olution, backprop from output to weigh weights, ts, and
told what the size of the original input was.
bac
backprop
kprop from output to inputs—are sufficien sufficientt to compute all of the gradients
These
needed three an
to train opyerations—conv
any depth of feedforwardolution,conv backprop
convolutional
olutional from
netw output
ork, asto
network, weigh
well as tots,train
and
bac
con
convkprop
v fromnetw
olutional output
networksorks to inputs—are
with reconstruction sufficien t to compute
functions based on all the
of the gradients
transp
transpose ose of
needed
con
conv to train
volution. See Go an y depth
Goodfellow of feedforward conv
odfellow (2010) for a full deriv olutional ation of the equations intrain
derivation netw ork, as w ell as to the
convolutional
fully networks with reconstruction
general multi-dimensional, multi-example functions
case. Tobased givee aonsense
giv the oftransp
how ose of
these
convolution.
equations work, SeewGo odfellow
e presen
present t the(2010
two )dimensional,
for a full deriv ation
single of the equations
example version here. in the
fully general multi-dimensional, multi-example case. To give a sense of how these
Supp
Suppose ose we wan wantt to train a conv convolutional
olutional netwnetwork ork that incorp
incorporates
orates strided
equations work, we present the two dimensional, single example version here.
con
convvolution of kernel stack K applied to multi-c hannel image V with stride s as
multi-channel
Supp ose w e wan t to train
defined by c(K, V , s) as in Eq.K9.8. Supp a conv olutional
Suppose ose we wan netw
wantt to ork that incorp
minimize orates strided
Vsome loss function
con
J (Vv,olution of kernel stack applied to multi-channel image with stride s as
K ). DuringK V forward propagation, we will need to use c itself to output Z ,
defined
whic
which by c( , , s) as in Eq. 9.8. Supp ose we wantnetwork to minimize
ork andsome used loss function
V hKis then propagated through the rest of the netw to compute Z
J ( cost
the , ).function
During Jforward
. Duringpropagation,
bac
back-propagation,
k-propagation, we will weneed to usee ca itself
will receiv
receive tensortoG output
suc
suchh that,
whic
Gi,j,kh=is ∂Zthen
∂
Jpropagated
(V, K). through the rest of the network and used to compute
G
the cost function J . During back-propagation, we will receive a tensor such that
i,j,k
G To train the V K
= J ( netw
network,
, )ork,
. we need to compute the deriv derivativ
ativ
ativeses with respect to the
weigh
eightsts in the kernel. To do so, we can use a function
To train the network, we need to compute the derivatives with respect to the
weights in the kernel. To∂ do so, we can X use a function
g (G, V , s)i,j,k,l = J (V, K) = Gi,m,n Vj,(m−1)×s+k,(n−1)×s+l . (9.11)
∂ Ki,j,k,l m,n G
G V ∂ V K V
g ( , , s) = K J( , ) = . (9.11)
∂
If this laylayerer is not the b ottom lay layer
er of the netnetw work, we will need to compute
the gradien
gradientt with resp respect
ect to V in order to bac back-propagate
k-propagate the error farther do down.
wn.
If this lay er is not
To do so, we can use a function the b ottom lay er of the net w ork, w e will need to compute
V
the gradient with resp ect to in order X to back-propagate the error farther down.
To do so, we can use a function ∂
h(K, G, s)i,j,k = J (V, K) (9.12)
∂ V i,j,k
K G ∂X V K X X
h( , , s) = V
= J( , ) Kq,i,m,p G q,l,n. (9.12)
(9.13)
∂
l,m n,p q K
s.t. G
= s.t.
( n− 1) ×s + p = k . (9.13)
(l−1)×s+m=j
Auto
Autoencoder
encoder netnetw
works, describ
described
ed in Chapter 14, are feedforw
feedforward
ard netnetw
works
trained to copy their input to theirXoutput. AXsimple example is the PCA algorithm,
thatAuto encoder
copies netwxorks,
its input to andescrib ed in Chapter
approximate 14X, are feedforw
reconstruction r using ard
the net works
function
trained
W >W x to. copy
It istheir input to
common fortheir
moreoutput. A simple
general auto example
autoenco
enco
encoders is the
ders to use PCA algorithm,
multiplication
that copies
by the transp its
transpose input x to
ose of the weighan approximate reconstruction
weightt matrix just as PCA do does.
es. Tr
To using the
o make suc
suchhfunction
mo
models
dels
W W x. It is common for more general auto enco ders to use multiplication
by the transp ose of the weight matrix 358 just as PCA do es. To make such mo dels
CHAPTER 9. CONVOLUTIONAL NETWORKS
con
conv volutional, we can use the function h to p erform the transp transpose ose of the conv
convolution
olution
op
operation.
eration. Suppose we ha havve hidden units H in the same format as Z and we define
acon volutional, we can use the function h to p erform the transp ose of the convolution
reconstruction H Z
op eration. Suppose we have hidden R= units
h(K, Hin, sthe
). same format as and we define (9.14)
a reconstruction
R K H
In order to train the auto autoenco
enco = h(we, will
encoder,
der, , s).receiv
receivee the gradient with resp (9.14)
respect
ect
to R as a tensor E. To train the deco decoder,
der, we need to obtain the gradient with
In order to train the auto enco der, we will receiveencoder, theder,gradient withto resp ect
resp Rect to K. ThisEis given by g (H, E, s). To train the enco
respect we need obtain
to as
the gradien a
gradient tensor . T o train the deco der, we need to obtain the gradient
en by c (K, E, s). It is also p ossible to with
Kt with resp
respectect to H . H ThisE is giv
given
resp ect tiate
differen
differentiateto .through
This is ggiven
using byc gand
( , h,, sbut
). Tthese
o train op the
operationsenco der, we need to obtain
erations
H K E are not needed for the
the
bac gradient with algorithm
back-propagation
k-propagation resp ect toon .an This
any is givennet
y standard bywcork
netw ( , arc , shitectures.
). It is also p ossible to
architectures.
differentiate through g using c and h, but these op erations are not needed for the
Generally
Generally,, we algorithm
back-propagation do not useon only
anyastandard
linear opoperation
eration
net work in arcorder to transform from
hitectures.
the inputs to the outputs in a conv convolutional
olutional lay layer.
er. W Wee generally also add some
biasGenerally
term to each, we output
do not buse only
efore a linear
applying theopnonlinearity
eration in .order
nonlinearity. This to transform
raises from
the question
thehow
of inputs to theparameters
to share outputs inamong a convolutional
the biases.layF er.
For
or loWcally
e generally
locally connected alsolay
adderssome
layers it is
bias term
natural to giv to each
givee eacoutput
each b efore applying the nonlinearity
h unit its own bias, and for tiled con conv . This raises the question
volution, it is natural to
of how to share parameters among the biases.
share the biases with the same tiling pattern as the kernels. F or lo cally connected
For conv lay ers it is
convolutional
olutional
natural
la yers, ittois giv
lay e each
typical tounit
ha
havveits own
one bias,
bias p er and for tiled
channel of thecon volution,
output and itshare
is natural to
it across
share
all lo the biases
locations
cations within witheachthe same
conv
convolutiontiling
olution map.pattern
How
Howev aser,
ev
ever,the kernels.
if the input Fisorofconv
kno olutional
known,
wn, fixed
lay ers, it is typical to ha ve one bias p er
size, it is also p ossible to learn a separate bias at eac channel of the
each h lo output
location and share
cation of the output it across
map.
all lo cations within
Separating the biases ma each
may conv olution map. How ev er, if the input
y slightly reduce the statistical efficiency of the mo is of kno wn,
model, fixed
del, but
size, allo
also it isws
allowsalso
thep ossible
mo
model
del toto correct
learn a for
separate bias atineac
differences theh image
lo cation of the output
statistics at differenmap.
different t
Separating
lo
locations. the biases ma y slightly reduce the statistical
cations. For example, when using implicit zero padding, detector units at theefficiency of the mo del, but
edge allo
also ws the
of the imagemoreceiv
del toecorrect
receive for input
less total differences
and ma inythe
may need image
largerstatistics
biases. at different
lo cations. For example, when using implicit zero padding, detector units at the
edge of the image receive less total input and may need larger biases.
9.6 Structured Outputs
9.6volutional
Con
Conv Structured
netw orksOutputs
networks can be used to output a high-dimensional, structured
ob
object,
ject, rather than just predicting a class lab label
el for a classification task or a real
Con volutional netw orks can be used
value for a regression task. Typically this ob to outputjectaishigh-dimensional,
object structured
just a tensor, emitted by a
ob ject, rather
standard conv than justlay
convolutional
olutional predicting
layer. a class the
er. For example, lab el
mo for
delamigh
model classification
might task S
t emit a tensor or, where
a real
vSalue isforthe a probability
regression task. Typically
that pixel this
(j, k ) of theob ject to
input is just a tensor,
the netw
network emitted
ork b elongs by a
i,j,k Sto class
istandard
S. This allo
conv
wsolutional
allows the mo
model
dellayto
er.lab
For
el example,
label every pixel theinmoandel mighand
image t emit a tensor
draw precise, masks
where
is
that follo the
follow probability that pixel (
w the outlines of individual obj, k ) of the
objects.
jects.input to the netw ork b elongs to class
i. This allows the mo del to lab el every pixel in an image and draw precise masks
One issue that often comes up is that the output plane can b e smaller than the
that follow the outlines of individual ob jects.
input plane, as shown in Fig. 9.13. In the kinds of arc architectures
hitectures typically used for
One issue that often
classification of a single ob comes
object up is that the output plane
ject in an image, the greatest reduction can b e smaller
in thethan the
spatial
input plane, as shown
dimensions of the netw in
network Fig. 9.13 . In the
ork comes from using po kinds of
poolingarc hitectures
oling lay ers with large stride. for
layers typically used In
classification of a single ob ject in an image, the greatest reduction in the spatial
dimensions of the network comes from359 using po oling layers with large stride. In
CHAPTER 9. CONVOLUTIONAL NETWORKS
V W V W V
U U U
The general idea is to assume that large groups of contiguous pixels tend to b e
asso
associated
ciated with the same lablabel.
el. Graphical mo
models
dels can describ
describee the probabilistic
The general idea
relationships b et
etwis to assume
ween neigh
neighb that large groups of
b oring pixels. Alternativ contiguous
Alternatively
ely
ely,, the con
convvpixels tend
olutional netto
netw be
work
asso ciated
can with to
b e trained themaximize
same lab el.
an Graphical
appro mo dels
approximation
ximation of can
the describ e the
graphical mo probabilistic
model
del training
relationships
ob
objective b etw een neighb oring pixels. Alternativ
jective (Ning et al., 2005; Thompson et al., 2014). ely , the con volutional network
can b e trained to maximize an approximation of the graphical mo del training
ob jective (Ning et al., 2005; Thompson et al., 2014).
9.7 Data Typ
ypes
es
9.7 dataData
The Typa es
used with conv
convolutional
olutional netwnetwork
ork usually consists of sev several
eral channels,
eac
eachh channel b eing the observ observation
ation of a different quantit quantity y at some p oint in space
The dataSee
or time. used with9.1
Table a conv olutional netw
for examples orktusually
of data yp
ypeses withconsists of sev
different eral channels,
dimensionalities
eachncum
and hannel
umb b er ofb eing the observation of a different quantity at some p oint in space
channels.
or time. See Table 9.1 for examples of data typ es with different dimensionalities
For an example of conv convolutional
olutional net netw works applied to video, see Chen et al.
and numb er of channels.
(2010).
For an example of convolutional networks applied to video, see Chen et al.
So far we hav havee discussed only the case where every example in the train and test
(2010).
data has the same spatial dimensions. One adv advantage
antage to conv
convolutional
olutional netwnetworks
orks
So far we hav
is that they can also pro e discussed cess inputs with varying spatial extents. These kindstest
process only the case where every example in the train and of
data has
input simplythe same
cannot spatial dimensions.byOne
b e represented advantage
traditional, to conv
matrix olutional networks
multiplication-based
is that net
neural they
netw can also
works. Thispro pro cess
provides
vides inputs
a comp with
compelling varying
elling reasonspatial
to useextents.
conv Thesenet
convolutional
olutional kinds
netw
worksof
input
ev
even simply
en when cannot b e represented
computational cost and ovby traditional,
erfitting are notmatrix multiplication-based
significan
significant t issues.
neural networks. This provides a comp elling reason to use convolutional networks
For example, consider a collection of images, where each image has a differen differentt
even when computational cost and overfitting are not significant issues.
width and height. It is unclear how to mo model del such inputs with a weigh eightt matrix of
fixed For example,
size. Con
Conv consider
volution a collection
is straigh
straightforw
tforw of images,
tforward
ard to apply; where
the each image
kernel has aapplied
is simply differenat
width
differen
different and
t num
numbheight.
b er ofIttimes
is unclear
dep how to
depending
ending on mo
thedel such
size inputs
of the with
input, anda the
weigh t matrix
output of
of the
fixed
con
conv size. Con v olution is straigh
volution operation scales accordingly tforw ard
accordingly.. Con to apply;
Convolution the kernel is
volution may b e view simply
viewed ed as matrixa
applied
differen t numb er
multiplication; theof same
timescon depvending
conv olution on the size
kernel of the
induces input, and
a different sizetheof output
doubly ofblothe
blocck
con volution
circulan operation
circulantt matrix for eac eachscales accordingly . Con volution may
h size of input. Sometimes the output of the netw b e view ed as matrix
network
ork is
m ultiplication;
allo
allowwed to hav the same con volution kernel induces a
havee variable size as well as the input, for example if we wandifferent size of doubly blo ck
antt to assign
circulan
a class lab t
labelmatrix for eac h size of input. Sometimes the output
el to each pixel of the input. In this case, no further design work is of the netw ork is
allowed to. hav
necessary
necessary. e variable
In other cases, size asnet
the well
netw as the
work must input,
pro for example
produce
duce if we wanoutput,
some fixed-size t to assign
for
a class lab
example if we wanel to each pixel of the input.
antt to assign a single class lab In this
label case, no further design
el to the entire image. In this case work is
necessary
w e must make . In other
some cases,
additionalthe net worksteps,
design must lik proeduce
like someafixed-size
inserting p o oling la output,
lay
yer whosefor
example if w e w an t to
p ooling regions scale in size prop assign a single class
proportional lab el to the entire image.
ortional to the size of the input, in order to In this case
w e
mainm ust
maintain make
tain a fixed num some
numberadditional design
ber of p ooled outputs. steps,Somelike examples
inserting ofa pthis
o oling
kindlaofyerstrategy
whose
p ooling
are sho
shownwn regions
in Fig.scale
9.11.in size prop ortional to the size of the input, in order to
maintain a fixed number of p ooled outputs. Some examples of this kind of strategy
are Note
shownthat the use
in Fig. 9.11 of. con
conv volution for pro processing
cessing variable sized inputs only mak makeses
sense for inputs that hav havee variable size b ecause they contain varying amounts
Note that the use of convolution for pro cessing variable sized inputs only makes
sense for inputs that have variable size b ecause they contain varying amounts
361
CHAPTER 9. CONVOLUTIONAL NETWORKS
of observ
observation
ation of the same kind of thing—differen
thing—differentt lengths of recordings over
time, differen
differentt widths of observ
observations
ations ovover
er space, etc. ConConvvolution dodoes
es not make
of observ ation of the same kind of thing—differen t lengths of recordings
sense if the input has variable size because it can optionally include different over
time, differen
kinds of observ t widths
ations. of
observations. observ
For ationsif ov
example, weerare
space,
pro etc. Con
processing
cessing volution
college do es not make
applications, and
sense if the input has variable size because it can optionally include
our features consist of b oth grades and standardized test scores, but not every different
kinds of tobserv
applican
applicant to ok ations.
took For example,
the standardized test,ifthen
we are pro
it do escessing
does not makecollege
senseapplications,
to con
conv olvee and
volv the
our features
same weigh
eights consist of b oth grades and
ts over b oth the features corresp standardized
corresponding test scores, but not
onding to the grades and the features every
applican
corresp t to
corresponding ok the standardized
onding to the test scores. test, then it do es not make sense to convolve the
same weights over b oth the features corresp onding to the grades and the features
corresp onding to the test scores.
9.8 Efficien
Efficientt Con
Conv
volution Algorithms
9.8
Mo dernEfficien
Modern con
conv t Con
volutional netw
networkv olution
ork applicationsAlgorithms
often inv
involve
olve net
netwworks containing more
than one million units. Po Powerful
werful implementations exploiting parallel computation
Mo dern con v olutional
resources, as discussed in Sec.netw ork 12.1
applications
, are essen often
tial. inv
essential. Ho
How olve
wev netin
ever,
er, works
man
many ycontaining
cases it ismore
also
pthan onetomillion
ossible sp eed units.
speed up con
conv Po werfulbimplementations
volution exploitingconv
y selecting an appropriate parallel
convolution computation
olution algorithm.
resources, as discussed in Sec. 12.1, are essential. However, in many cases it is also
Con
Conv volution is equiv
equivalen
alen
alentt to conv
converting
erting b oth the input and the kernel to the
p ossible to sp eed up convolution by selecting an appropriate convolution algorithm.
frequency domain using a Fourier transform, p erforming p oin oint-wise
t-wise multiplication
Con
of the tw v
twoolution is equiv
o signals, and con alen
conv t to conv
verting bac erting
backk to the time domain the
b oth the input and usingkernel toerse
an inv the
inverse
Ffrequency domain using
ourier transform. For asome
Fourier transform,
problem sizes,p erforming
this can be p oin t-wisethan
faster multiplication
the naive
of the
implemen tw o
implementation signals, and
tation of discrete concon v
converting
volution.bac k to the time domain using an inverse
Fourier transform. For some problem sizes, this can be faster than the naive
When a d-dimensional kernel can b e expressed as the outer pro duct of d
product
implementation of discrete convolution.
vectors, one vector p er dimension, the kernel is called sep separ
ar
arable
able
able.. When the kernel
When a dnaive
is separable, -dimensional
conv
convolutionkernel
olution can b e expressed
is inefficient. It is equivas alent
the outer
equivalent to comppro duct
compose of d
ose d one-
vdimensional
ectors, one vcon ector
conv p er dimension,
volutions with each theofkernel
theseisvectors.
called sepThe arable . When
comp
composedosed the kernel
approach
is separable, naive conv olution is inefficient.
is significantly faster than p erforming one d-dimensional conv It is equiv alent to
convolutioncomp ose d one-
olution with their
dimensional
outer pro
product.
duct.conThe
volutions
kernelwith alsoeach
takesoffewer
theseparameters
vectors. The comp osedasapproach
to represent vectors.
is the
If significantly
kernel is w faster than
elemen
elements p erforming
ts wide in eachone d-dimensional
dimension, then naivconveolution
naive with their
multidimensional
outer
con
conv pro duct.
volution The Okernel
requires also takes
(w d) runtime andfewer
parameterparameters
storagetospace,
represent
whileas vectors.
separable
If the
con
conv volution w
kernel requires
is elemen O (wts×wided) runintime
eachand
runtime dimension,
parameter then naive multidimensional
storage space. Of course,
con v
not ev olution
every
ery con requires
conv O (w ) runtime
volution can b e represen and
represented parameter
ted in this wa waystorage
y. space, while separable
convolution requires O (w d) runtime and parameter storage space. Of course,
not Devising
every confaster
volutionwa
waysys
canof×b ep erforming
representedconconv
invthis
olution
way.or approximate conv convolution
olution
without harming the accuracy of the mo model
del is an activ
activee area of researc
research.h. Ev
Even
en tech-
Devising
niques that impro faster
improv wa ys of p erforming
ve the efficiency of only forw con v olution
forwardard propagation are useful bolution
or approximate conv ecause
without harming the accuracy of the
in the commercial setting, it is typical to dev mo del is
devotean activ e area of researc
ote more resources to deplo h. Ev en
deploymen
ymen
ymenttech-
t of
aniques
net
netw thatthan
work impro tovits
e the efficiency of only forward propagation are useful b ecause
training.
in the commercial setting, it is typical to devote more resources to deployment of
a network than to its training.
363
CHAPTER 9. CONVOLUTIONAL NETWORKS
us the opp
opportunit
ortunit
ortunity y to take the pretraining strategy one step further than is p ossible
with multila
ultilayyer p erceptrons. Instead of training an entire conv convolutional
olutional la
layyer at a
us the opp ortunit
time, we can train a moy to take
model the pretraining
del of a small patc h, as Coates et al. (2011) do with kp-means.
patch, strategy one step further than is ossible
with
W e canmultila
thenyuse
er pthe
erceptrons.
parameters Instead
from of training
this an entire
patch-based mo
modelconv
del toolutional
define thelaykernels
er at a
time, we
of a conv can train
convolutional
olutional laya mo
layer. del of a small patc h, as Coates et al.
er. This means that it is p ossible to use unsup (2011 ) do with
unsupervised k -means.
ervised learning
W
toetrain
can then
a convuse the parameters
convolutional
olutional network from
netw this patch-based
without ev
everer using mo delolution
conv to defineduring
convolution the kernels
the
of a conv
training proolutional lay er. This means that it is p ossible
cess. Using this approach, we can train very large mo
process to use unsup ervised
models learning
dels and incur a
to train a conv olutional netw ork without ev er using conv
high computational cost only at inference time (Ranzato et al., 2007b; Jarrett olution duringetthe al.,
training
2009; KaKavuk pro
vuk cess
vukcuoglu . Using
cuoglu et al. this approach, we
al.,, 2010; Coates et al. can train very large mo dels and
al.,, 2013). This approach was p opular incur a
high computational
from roughly 2007–2013, cost onlywhenat inference
lab
labeled time (Ranzato
eled datasets were et al., 2007b
small ; Jarrett et al.,
and computational
2009
p ower; Ka vukmore
was cuoglu et al., 2010
limited. ; Coates
To day
day, , most et
conval.,olutional
2013). This
convolutional net
netw wapproach
orks are was p opular
trained in a
from roughly
purely sup 2007–2013,
supervised when lab eled datasets were small
ervised fashion, using full forward and back-propagation through the and computational
p
enow
entireer w
tire net as
netw more
work onlimited.
eac
each To dayiteration.
h training , most convolutional networks are trained in a
purely sup ervised fashion, using full forward and back-propagation through the
As with other approaches to unsup unsupervised
ervised pretraining, it remains difficult to
entire network on each training iteration.
tease apart the cause of some of the b enefits seen with this approac approach. h. Unsupervised
As with
pretraining ma other
may approaches to unsup
y offer some regularization relativervised pretraining,
relativee to sup ervised training,difficult
supervised it remains or it may to
tease apart the cause
simply allow us to train muc of some
much of the b enefits seen with this approac h.
h larger architectures due to the reduced computationalUnsupervised
pretraining
cost may offerrule.
of the learning some regularization relative to sup ervised training, or it may
simply allow us to train much larger architectures due to the reduced computational
cost of the learning rule.
9.10 The Neuroscientific Basis for Con Conv volutional Net-
works
9.10 The Neuroscientific Basis for Convolutional Net-
Con
Conv works
volutional networks are p erhaps the greatest success story of biologically
netw
inspired artificial intelligence. Though conv convolutional
olutional netw
networks
orks hav
havee b een guided
Con volutional net w orks are p erhaps the greatest success
by many other fields, some of the key design principles of neural netw story of biologically
networks
orks were
inspired
dra
drawn artificial intelligence.
wn from neuroscience. Though conv olutional netw orks hav e b een guided
by many other fields, some of the key design principles of neural networks were
The history of conv convolutional
olutional net
netw works b egins with neuroscien
neuroscientific
tific exp
experimen
erimen
erimentsts
drawn from neuroscience.
long b efore the relev
relevant
ant computational mo models
dels were dev
developed.
eloped. Neuroph
Neurophysiologists
ysiologists
Da The
David
vid Hub history
Hubel of conv olutional net
el and Torsten Wiesel collab w orks
collaboratedb egins with
orated for sev neuroscien
several tific exp
eral years to determine erimen
man
manytsy
long b efore the relev ant computational mo dels were dev eloped.
of the most basic facts about how the mammalian vision system works (Hub Neuroph ysiologists
Hubelel and
David ,Hub
Wiesel 1959el, and
1962T , orsten
1968). Wiesel collab orated for sev
Their accomplishments eralev
were years
even
en to determine
entually
tually recognizedman withy
aofNob
the el
Nobel most basic
prize. facts
Their about that
findings how ha thevemammalian
hav visioninfluence
had the greatest system works (Hub el and
on contemporary
Wiesel , 1959, 1962
deep learning mo , 1968
models
dels ). Their
were based accomplishments were ev
on recording the activit
activityyen
oftually recognized
individual neuronswith
in
acats.
NobThey
el prize. Their findings that ha ve had the
observed how neurons in the cat’s brain resp greatest influence
responded on contemporary
onded to images pro projected
jected
deep learning
in precise lo mo
locationsdels were based on recording the activit y of
cations on a screen in front of the cat. Their great disco individual neurons
discovery
very was in
cats. neurons
that They observed how neurons
in the early in the cat’s
visual system resp brain resp
responded
onded mostonded to images
strongly to verypro
spjected
specific
ecific
in precise lo
patterns of ligh cations
light, on a screen in front of the cat.
t, such as precisely oriented bars, but resp Their onded hardly at allwas
great
responded disco very to
that neurons
other patterns. in the early visual system resp onded most strongly to very sp ecific
patterns of light, such as precisely oriented bars, but resp onded hardly at all to
other patterns. 365
CHAPTER 9. CONVOLUTIONAL NETWORKS
nic
nicknamed
knamed “grandmother cells”—the idea is that a person could hav havee a neuron that
activ
activates
ates when seeing an image of their grandmother, regardless of whether she
nicknamed
app ears in “grandmother
appears the left or righ cells”—the
rightt side of the ideaimage,
is thatwhether
a personthe couldimage havise aa neuron
close-up thatof
activ ates when
her face or zo zoomed seeing an image of
omed out shot of her entire b o dytheir grandmother, regardless of whether
dy,, whether she is brightly lit, or in she
app
shadoears
shadow, in
w, etc. the left or righ t side of the image, whether the image is a close-up of
her face or zo omed out shot of her entire b o dy, whether she is brightly lit, or in
These grandmother cells ha hav ve b een sho shown wn to actually exist in the human brain,
shadow, etc.
in a region called the medial temp temporal oral loblobee (Quiroga et al., 2005). Researc Researchers hers
These grandmother cells ha
tested whether individual neurons would resp v e b een sho wn to
respond actually exist in the
ond to photos of famous individuals.human brain,
in a region called the medial temp
They found what has come to b e called the “Halle oral lob e ( Quiroga
Berry et al. , 2005).anResearc
neuron”: individualhers
tested whether
neuron that is activindividual
ated bneurons
activated y the concept wouldof resp ond Berry
Halle to photos
Berry. . Thisofneuronfamousfires individuals.
when a
They found what has come
p erson sees a photo of Halle Berry to b e called the “Halle
Berry,, a drawing of Halle Berry Berry neuron”: an
Berry,, or even text containingindividual
neuron that is
the words “Halle Berry activ ated
Berry.” by the concept of Halle
.” Of course, this has nothing to do Berry . This
withneuron fires when
Halle Berry herself; a
p erson sees a
other neurons resp photo of
responded Halle Berry , a drawing
onded to the presence of Bill Clin of Halle Berry
Clinton, , or even text
ton, Jennifer Aniston, etc. containing
the words “Halle Berry.” Of course, this has nothing to do with Halle Berry herself;
These medial temp temporaloral lob
lobee neurons are somewhat more general than mo modern
dern
other neurons resp onded to the presence of Bill Clinton, Jennifer Aniston, etc.
con
conv volutional net netw works, which would not automatically generalize to iden identifying
tifying
These medial
a p erson or ob object temp oral lob e neurons are somewhat
ject when reading its name. The closest analog to a con more general than
conv mo dern
volutional
conw
net
netw volutional
ork’s last net lay w
layererorks, which would
of features is a brainnot automatically
area called the generalize
inferotemp to oral
inferotemporal identifying
cortex
a p erson
(IT). When or ob ject when
viewing an obreading
object, its name. The
ject, information flowsclosest
fromanalog
the retina,to a con volutional
through the
net w ork’s last
LGN, to V1, then onw lay er of
onward features is a brain area
ard to V2, then V4, then IT. This happ called the inferotemp
happens oral
ens within the firstcortex
(IT). When viewing
100ms of glimpsing an ob an ob ject,
object. information
ject. If a p erson is allow flows
allowedfrom
ed to conthe
continueretina,
tinue lo through
looking
oking the
at the
LGN,
ob
object to V1, then onw ard to V2, then
ject for more time, then information will b egin to flo V4, then IT. This
flow happ
w bac
backw kwens
kwards within the
ards as the brain first
100ms of glimpsing
uses top-down feedbackan obtoject.
up If a the
update
date p erson
activ isations
allowed
activations to con
in the low
lowertinue lo oking
er level brainat the
areas.
ob
Ho
Howject
wev for ifmore
ever,
er, time, then
we interrupt theinformation
p erson’s gaze, will and
b egin to flowonly
observe backw theards
firing as rates
the brain
that
uses top-down feedback to up
result from the first 100ms of mostly feedforw date the activ
feedforward ations in
ard activ the
activation, low er level
ation, then IT pro brain
prov areas.
ves to b e
Ho w ev er, if w
very similar to a cone interrupt
conv the p
volutional neterson’s
netw gaze,
work. Conv and observe
Convolutional
olutional netwonly
networks orks can predictthat
the firing rates IT
result rates,
firing from the andfirstalso100ms
p erformof mostly feedforwto
very similarly ard(time
activation,
limited) then IT proon
humans vesob toject
objectbe
vrecognition
ery similar tasks to a con volutional
(DiCarlo , 2013net). work. Convolutional networks can predict IT
firing rates, and also p erform very similarly to (time limited) humans on ob ject
That b eing
recognition tasks said, there ,are
(DiCarlo 2013 man
many
). y differences betw between
een conconv volutional net netw works
and the mammalian vision system. Some of these differences are well known
That b eing said,
to computational there are manbut
neuroscientists, y differences
outside the betw eene con
scop
scope volutional
of this b o ok. netSomeworks of
and the mammalian vision system.
these differences are not yet known, because man Some of
manythese differences
y basic questions ab are
aboutw ell
out ho known
howw the
to computational neuroscientists,
mammalian vision system works remain unansw but outside the
unanswered. scop e of this
ered. As a brief list: b o ok. Some of
these differences are not yet known, because many basic questions ab out how the
mammalian
• The human visioneyeyesystem works
e is mostly very remain unanswered.
low resolution, exceptAsfor a brief
a tin
tinyylist:
patc
patch h called the
fove
foveaa. The fovfovea
ea only observes an area ab about
out the size of a thum
thumbnail
bnail held at
The h uman eye is mostly very low resolution,
arms length. Though we feel as if we can see an en excepttire scene in highhresolution,
entirefor a tiny patc called the
foveais. an
• this Theillusion
fovea only observes
created by theansub
area ab out the
subconscious
conscious partsize of abrain,
of our thumbnail held at
as it stitches
arms length.
together Though
several we feel
glimpses as if areas.
of small we canMost
see ancon
envtire
conv scene in
olutional high
netw
networksresolution,
orks actually
receivee large full resolution photographs as input. The human brainstitches
this
receivis an illusion created by the sub conscious part of our brain, as it makes
together several glimpses of small areas. Most convolutional networks actually
receive large full resolution photographs367 as input. The human brain makes
CHAPTER 9. CONVOLUTIONAL NETWORKS
sev
several
eral ey
eyee mov
movemen
emen
ements ts called sac sacccades to glimpse the most visually salien salientt or
task-relev
task-relevant ant parts of a scene. Incorporating similar attention mechanisms
sev
in toeral
into deepeyelearning
movemen motsdels
modelscalled saccactive
is an ades toresearch
glimpsedirection.
the most visually
In the con salien
textt or
context of
task-relev ant
deep learning, atten parts of
attention a
tion mecscene. Incorporating
mechanisms
hanisms ha hav similar attention mechanisms
ve been most successful for natural
in to deep
language pro learning
processing, mo dels
cessing, as describ is an
described active research
ed in Sec. 12.4.5.1 direction.
. Several In the conmo
visual text
models of
dels
deep fo
with learning,
fov attention mec
veation mechanisms hanisms
hav
havee b een ha devvelop
e been
develop
eloped mostsosuccessful
ed but far hav
havee notfor bnatural
ecome
language
the dominan pro cessing,
dominantt approac
approach as describ
h (Laro ed in Sec. 12.4.5.1 .
Larocchelle and Hinton, 2010; Denil et al.Several visual
al.,, 2012dels
mo ).
with foveation mechanisms have b een develop ed but so far have not b ecome
• The
the dominan
human tvisualapproac h (Laro
system is cintegrated
helle and Hinton , 2010;other
with many Denilsenses,
et al., 2012
such).as
hearing, and factors like our mo moo o ds and thoughts. Conv Convolutional
olutional netwnetworks
orks
The human visual
so far are purely visual. system is integrated with many other senses, such as
• hearing, and factors like our mo o ds and thoughts. Conv olutional netw orks
• The
so farhuman
are purely
visualvisual.
system do does
es muc
much h more than just recognize ob objects.
jects. It is
able to understand entire scenes including many ob objects
jects and relationships
The
b et
etw h uman
ween ob visual
objects, system
jects, and pro do
processes es muc h more than just
cesses rich 3-D geometric information recognize obneeded
jects. Itfor is
ablebto
• our understand
odies entire
to interface withscenes
the w including
orld. Conv many
Convolutional ob jects
olutional and
netw
networks relationships
orks hav
havee been
b et w een ob jects, and pro cesses rich 3-D geometric
applied to some of these problems but these applications are in their infancy information needed for.
infancy.
our b odies to interface with the world. Convolutional networks have been
• Evapplied
Even to some
en simple brainof areas
theselikeproblems
V1 arebut hea theseimpacted
heavily
vily applications are in their
by feedback frominfancy
higher.
lev
levels.
els. Feedbac
eedback k has b een explored extensiv extensively ely in neural netw network
ork momodels
dels but
Ev en simple brain
has not yet b een sho areas
shown like V1 are hea vily
wn to offer a compelling improv impacted
improvemen b y feedback
emen
ement.t. from higher
• lev els. F eedbac k has b een explored extensiv ely in neural netw ork mo dels but
• While
has notfeedforw
yet b een
feedforwardardshoITwn to offer
firing ratesa capture
compelling muc improv
uchh of the emen t. information as
same
con
conv volutional net netwwork features, it is not clear how similar the intermediate
computations are. IT
While feedforw ard The firing
brainrates captureuses
probably mucvery
h of the same activ
different information
activation
ation and as
• p con volutional
ooling network
functions. An features,
individual it neuron’s
is not clear how
activ
activation similar
ation the intermediate
probably is not well-
computations are. The brain
characterized by a single linear filter resp probably uses
response. very different
onse. A recent mo model activ
del of V1 ation
in
inv and
volv
olves
es
p ooling functions. An
multiple quadratic filters for eac individualeach neuron’s activ ation probably
h neuron (Rust et al., 2005). Indeed our is not well-
ccarto
haracterized
cartoon by a single linear
on picture of “simple cells” and filter resp onse. A recent
“complex cells” momighdelt of
might V1 inavolv
create es
non-
m ultiple
existen
existent quadratic filters
t distinction; simplefor eacand
cells h neuron
complex (Rustcellsetmight
al., 2005
b oth). bIndeed
e the same our
cartoof
kind oncell
picture of “simple
but with cells” and “complex
their “parameters” enablingcells”a con
continmigh
tin
tinuum
uumt create a non-
of b ehaviors
existent from
ranging distinction;
what wsimple cells andtocomplex
e call “simple” what wecells call might b oth b e the same
“complex.”
kind of cell but with their “parameters” enabling a continuum of b ehaviors
It ranging
is also from
worthwhat we call “simple”
mentioning to what wehas
that neuroscience call told
“complex.”
us relatively little
ab
about
out how to tr train
ain con
convvolutional netw
networks.
orks. Mo
Model
del structures with parameter
It is also worth mentioning
sharing across multiple spatial lo cations date back tohas
that
locations neuroscience told
early us relativelymo
connectionist little
models
dels
ab out
of how
vision to train
(Marr and con volutional
Poggio , 1976),netw
butorks.
theseMo del structures
models withthe
did not use parameter
modern
sharing
bac across multiple
back-propagation
k-propagation spatial
algorithm andlogradient
cations date backFto
descent. or early connectionist
example, the Neo mo dels
Neocognitron
cognitron
(of vision (Marr
Fukushima , 1980and Poggio
) incorp , 1976
incorporated
orated ), but
most these
of the mo models
model
del did notdesign
architecture use the modern
elemen
elementsts of
bac k-propagation
the mo
modern
dern conv algorithm
convolutional
olutional netand
netw gradient descent.
work but relied on a lay F or example,
layer-wise
er-wise unsup the Neo
unsupervised cognitron
ervised clustering
(algorithm.
Fukushima, 1980) incorp orated most of the mo del architecture design elements of
the mo dern convolutional network but relied on a layer-wise unsup ervised clustering
algorithm. 368
CHAPTER 9. CONVOLUTIONAL NETWORKS
Lang and Hinton (1988) introduced the use of back-propagation to train time-
delay neurneuralal networks (TDNNs). To use contemporary terminology terminology,, TDNNs are
Lang and
one-dimensional con Hinton
conv ( 1988 )
volutional netintroduced
netw the use of back-propagation
works applied to time series. Back-propagation to train time-
delay neur
applied to al networks
these mo
modelsdels(TDNNs).
was not inspiredTo use contemporary
by any neuroscientific terminology , TDNNs
observ
observation
ation and are
one-dimensional con v olutional net w orks applied to
is considered by some to b e biologically implausible. Following the success of time series. Back-propagation
applied
bac to these mo dels training
back-propagation-based
k-propagation-based was not inspired
of TDNNs, by any(LeCun neuroscientific
et al., 1989observ
) dev ation and
developed
eloped the
is
mo considered
modern
dern conconv by some
volutional net to
netw b e biologically implausible. Following
work by applying the same training algorithm to 2-D the success of
bac
con
conv vk-propagation-based
olution applied to images. training of TDNNs, (LeCun et al., 1989) developed the
mo dern convolutional network by applying the same training algorithm to 2-D
So far we hav havee described how simple cells are roughly linear and selectiv selectivee for
convolution applied to images.
certain features, complex cells are more nonlinear and b ecome inv invariant
ariant to some
So far we havof
transformations e described
these simple howcellsimple cells are
features, androughly
stac
stacks
ks oflinear
la
layyersandthatselectiv e for
alternate
bcertain
et
etwweenfeatures,
selectivity complex
and in cells
inv are more
variance nonlinear
can yield and b ecome
grandmother cellsinvforariant
very tosp some
specific
ecific
transformations
phenomena. We ha of
hav these simple
ve not yet describ cell
described features, and stac ks of la yers that
ed precisely what these individual cells detect. alternate
b etaween
In deep,selectivity
nonlinearand netw inork,
network,variance
it can canb eyield grandmother
difficult to understandcells forthevery sp ecific
function of
phenomena. W e ha v e not y et
individual cells. Simple cells in the first lay describ ed precisely
layer what these individual
er are easier to analyze, b ecause theircells detect.
In
resp a deep,
responses nonlinear netw ork, it can b
onses are driven by a linear function. In an artificial e difficult to understand
neural net thework,
netw function
we can of
individual cells. Simple
just display an image of the conv cells in the first
convolution lay er are easier to analyze,
olution kernel to see what the corresp b ecause
corresponding their
onding
cresp
hannelonsesofare
a condriven
conv by a linear
volutional la
layyer function.
resp onds In
responds to.an Inartificial neural
a biological netwnet
neural ork,
netw we can
work, we
just display
do not hav an image
havee access to the weighof the conv
eights olution k ernel to see what
ts themselves. Instead, we put an electro the corresp
electrode onding
de in the
cneuron
hannelitself,
of a con v
displa
displayolutional
y sev
several la y er resp onds to. In a biological neural
eral samples of white noise images in front of the animal’s net w ork, we
do not hav e access to the w eigh ts themselves.
retina, and record how each of these samples causes the neuron to activ Instead, w e put an electro de
activate. in W
ate. thee
neuron
can thenitself,
fit a displa
linearymo sevdel
model eraltosamples
these resp of white
responses
onses in noise images
order in front
to obtain an of the animal’s
approximation
retina,
of and record
the neuron’s weigh how
weights. ts.each
Thisofapproach
these samples
is known causes the neuron
as reverse corr to activ
orrelation
elation ate. Whe
(Ringac
Ringach
can then
and Shapleyfit a, 2004
linear). mo del to these resp onses in order to obtain an approximation
of the neuron’s weights. This approach is known as reverse correlation (Ringach
Rev
Reverse
erse correlation shows us that most V1 cells hav havee weigh
weights ts that are describ
described ed
and Shapley, 2004).
by GabGaboror functions
functions.. The Gab Gabor or function describes the weigh weightt at a 2-D p oin ointt in the
image. Reverse
We cancorrelation
think ofshows an imageus that as most
b eingV1 cells haveofweigh
a function 2-D ts co that are describ
coordinates,
ordinates, I (x, yed).
b
Liky Gab
Likewise, or functions . The Gab or function describes
ewise, we can think of a simple cell as sampling the image at a set of lothe weigh t at a 2-D p oin t in the
locations,
cations,
image. W e can
defined by a set of x co think of an image as b eing
ordinates X and a set of y co
coordinates a function of 2-D
coordinates, co ordinates,
ordinates, Y, and applying I (x, y).
wLik ewise,
eigh
eights we are
ts that can also
thinka of a simple
function ofXcell
the as
lo sampling
location,
cation, the
w (x, y )image
. Fromatthis a set of lo cations,
Y p oint of view,
defined
the resp by
onsea of
response x co ordinates
setaofsimple cell to an image and ais setgiv
given of ybyco ordinates, , and applying
en
weights that are also a function ofX theX lo cation, w (x, y ). From this p oint of view,
the resp onse of a simple cell s(Ito) =an imagewis(x, givy )en
I (x,
byy ). (9.15)
x∈X y∈Y
s(I ) = w (x, y )I (x, y ). (9.15)
Sp
Specifically
ecifically
ecifically,, w (x, y ) takes the form of a Gabor function:
02
w (x, y ; α,, w
Sp ecifically β x(x,
, βyy ,)ftakes
, φ, x 0the
, y 0,form
τ ) = of
α exp
a Gabor − βy y 02 cos(
−βxxfunction: f x 0 + φ),
cos(f (9.16)
w (x, y ; α , β , β , f , φ, x , y , τ ) X
where = αX
exp β x β y cos(f x + φ), (9.16)
0
x = ((xx − x0 ) cos(τ ) +
− (y − y−
0) sin(τ ) (9.17)
where
x = (x x ) cos(369 τ ) + (y y ) sin(τ ) (9.17)
− −
CHAPTER 9. CONVOLUTIONAL NETWORKS
and
y 0 = −(x − x0) sin(τ ) + (y − y0 ) cos(τ ). (9.18)
and
Here, α , β x, βy , fy, φ= , x 0,(xy0, andx ) sin( τ) +
τ are (y y ) cos(
parameters thatτ ).control the prop (9.18)
properties
erties
of the Gab
Gabor or function. Fig. − 9.18 − shows some examples − of Gab
Gabor or functions with
Here,
differen
different α , β , βof, these
t settings f, φ, xparameters.
, y , and τ are parameters that control the prop erties
of the Gab or function. Fig. 9.18 shows some examples of Gab or functions with
Thetparameters x0, y0 , and τ define a co coordinate
ordinate system. W Wee translate and
differen settings of these 0 parameters.
0
rotate x and y to form x and y . Sp Specifically
ecifically
ecifically,, the simple cell will resp respond
ond to image
The parameters
features centered at the p oin x , y , and τ define a co ordinate
ointt (x 0, y 0), and it will resp respond system. W e translate
ond to changes in brigh brightnessand
tness
rotate
as we mox and
mov y to form
ve along a linex rotated
and y . τSpradians
ecifically , thethe
from simple celltal.
horizon
horizontal. will resp ond to image
features centered at the p oint 0(x , y 0), and it will resp ond to changes in brightness
as wView
Viewed
e moed ve as a function
along of x and
a line rotated
y , the function w then resp
τ radians from the horizonresponds tal.onds to changes in
0
brigh
brightness
tness as we mov movee along the x axis. It has two imp important
ortant factors: one is a
View ed as a function of x and
Gaussian function and the other is a cosine function. y , the function w then resp onds to changes in
brightness as we move along the x axis. It has two imp ortant factors: one is a
The Gaussian factor α exp −β xx 02 − βy y02 can b e seen as a gating term that
Gaussian function and the other is a cosine function.
ensures the simple cell will only resp respondond to values near where x 0 and y 0 are b oth
The Gaussian factor
zero, in other words, near the cen α exp β x y can
ter of theβ cell’s
center b e seen
receptiv
receptive e field.as aThegating termfactor
scaling that
ensures
α adjuststhe thesimple cell will only
total magnitude respsimple
of−the ond− tocell’s
valuesresp near whilexβxand
onse,where
response, andy β arey con
btrol
oth
control
zero,
ho
how in other
w quic
quickly
kly its words,
receptivnear
receptive the cen
e field teroff.
falls of the cell’s receptive field. The scaling factor
α adjusts the total magnitude of the simple cell’s resp onse, while β and β control
cos((f x0 + φ ) con
cos
howThequiccosine
kly itsfactor
receptiv 0
e field fallscontrols
trols how the simple cell resp
off. responds
onds to ch changing
anging
brigh
brightness
tness along the x axis. The parameter f con controls
trols the frequency of the cosine
The
and φ con cosine
controls factor cos
trols its phase offset.( f x + φ ) con trols how the simple cell resp onds to changing
brightness along the x axis. The parameter f controls the frequency of the cosine
Altogether, this carto cartoon on view of simple cells means that a simple cell resp responds
onds
and φ controls its phase offset.
to a spspecific
ecific spatial frequency of brightness in a sp specific
ecific direction at a sp specific
ecific
lo Altogether,
location. this carto on view of simple cells means
cation. Simple cells are most excited when the wave of brightness in the image that a simple cell resp onds
to athe
has sp ecific
same phasespatialasfrequency
the weigh
weights. of brightness
ts. This o ccursinwhen a sptheecific
imagedirection
is brigh
brightatt where
a sp ecificthe
lo cation.
weigh
eights Simple
ts are p ositiv cells are most excited
ositivee and dark where the weigh when
eights the w a ve
ts are negativ of
negative. brightness in the
e. Simple cells are most image
has the same phase
inhibited when the wa as
wave the weigh ts. This o ccurs when
ve of brightness is fully out of phase with the the image is brigh
weigh t where
eights—when
ts—when the
w eigh ts are p ositiv e and
the image is dark where the weigh dark where
eights the w eigh
ts are p ositiv ts are negativ e. Simple
ositivee and bright where the weigh cells are
eights most
ts are
inhibited
negativ
negative. e. when the wa ve of brightness is fully out of phase with the w eigh ts—when
the image is dark where the weights are p ositive and bright where the weights are
The carto
cartoon on view of a complex cell is that it computes p the L 2 norm of the
negative.
2-D vector con containing
taining tw two o simple cells’ resp onses: c( I) = s0(I )2 + s1 (I )2 . An
responses:
imp The
importan
ortan carto
ortantt sp on view of a complex cell
ecial case o ccurs when s 1 has all of the
special is that it computes
same parametersthe L norm of the
as s0 except
2-D
for φvector
, and con φ istaining
set such twothat simple
s1 iscells’
one resp
quarteronses: c( I)out
cycle = ofsphase (I ) +with s (I )s .. An
0 In
imp ortan t sp ecial case
this case, s0 and s1 form a quadr o ccurs when
quadratur s
atur has
aturee pair all of the same parameters
air.. A complex cell defined in this wa as s except
way y
for
resp φ ,
responds and φ is set such
onds when the Gaussian reweigh that s is
reweighted one quarter
ted image I( x, y) expcycle out of phase
02 with
02
exp((−βx x − βy y ) contains s . In
this case, s and
a high amplitude sin s form
sinusoidal a
usoidal wa quadr
wav atur e p air . A complex
ve with frequency f in direction cell defined τ nearin this
(x0 ,way 0 )y,
resp onds when I( .x,In
y) other p
exp ( words,
β x the β ycomplex
regardless of the
theGaussian
phase offset reweigh oftedthisimage
wa
wave ve ) containscell
a high
is inv amplitude sin usoidal wa v e with frequency f
ariantt to small translations of the image in direction τ , or to negating the
invarian
arian in direction
− − τ near (x , y ),
regardless of the phase offset of this wave. In other words, the complex cell
is invariant to small translations of the370 image in direction τ , or to negating the
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.18: Gabor functions with a variety of parameter settings. White indicates
large p ositiv
ositivee weight, black indicates large negative weigh weight,t, and the background gra grayy
Figure
corresp 9.18:toGabor
corresponds
onds functions
zero weigh
eight.t. (L
(Left) with
eft) Gab aorvariety
Gabor functions of parameter
with different settings.
values ofWhite indicates
the parameters
large control
that p ositivethe
weight, black system:
coordinate indicatesx 0large negative
, y 0, and τ . Eacweigh
Each h Gabt, and
Gabor the background
or function in this grid graisy
corresp onds
assigned to zero
a value of xw0eigh
and (L eft)
t. y 0 prop Gab or functions
proportional
ortional to its pwith different
osition valuesand
in its grid, of the
τ isparameters
chosen so
control
that each Gab the
Gabor coordinate
or filter system:
is sensitive x ,direction
to the y , and radiating
τ . Each Gab or function
out from in this
the center grid
of the is
grid.
assigned
F a value
or the other tw
twoof
o x
plots,and y
,
x0 y 0 prop
, andortional
τ are to
fixed itstop osition
zero. in its
(Center) grid,
Gab
Gaborand
or τ is chosen
functions so
with
that each
differen
different Gab or filter
t Gaussian scaleis parameters
sensitive to βthe direction
x and βy . Gab radiating
Gabor out from
or functions the center
are arranged in of the grid.
increasing
For the(decreasing
width other two plots,
β x) as wex y
, mo , and
move τ
ve left are fixed through
to right (Center)
to zero. the grid, and Gab or functionsheight
increasing with
different Gaussian
(decreasing scale
βy ) as we mov
move e top to bβottom.
parameters and βFor . Gab
the or functions
other tw
two are arranged
o plots, the β valuesin increasing
are fixed
width
to 1.5
1.5× the image βwidth.
×(decreasing ) as we moveGabor
(Right) left to functions
right through with the grid, sin
different and increasing
sinusoid
usoid parametersheight
f
(decreasing β
and φ. As we )mov as ewetop
move mov toebtop to b ottom.
ottom, For the
f increases, andother
as wetwmoo plots,
mov ve leftthe β values
to right, are fixed
φ increases.
toor1.5
F the image
the other width.φ (Right)
two plots, is fixed to Gabor
0 and functions
f is fixedwith to 5different
× the imagesinusoid
width.parameters f
and φ× . As we move top to b ottom, f increases, and as we move left to right, φ increases.
For the other two plots, φ is fixed to 0 and f is fixed to 5 the image width.
image (replacing blac black k with white and vice versa). ×
Some of the most striking corresp correspondences
ondences b et etwween neuroscience and machine
image (replacing black with white and vice versa).
learning come from visually comparing the features learned by mac machine
hine learning
mo Some
models of the most
dels with those employstriking
employed ed by V1. Olshausen and Field (1996) and
corresp ondences b et ween neuroscience sho
showwmachine
ed that
learning come from visually comparing
a simple unsupervised learning algorithm, sparse cothe features learned
coding, b y mac hine learning
ding, learns features with
mo
receptivee fields similar to those of simple cells. Since Field
dels
receptiv with those employ ed by V1. Olshausen and then, (we1996 ) esho
hav
have wed that
found that
a simple
an unsupervised
extremely learning
wide variety algorithm,
of statistical sparsealgorithms
learning co ding, learns
learn features with
features with
receptiv
Gab
Gabor-lik e efields
or-lik
or-like similar
functions to those
when appliedof simple cells.images.
to natural Since then,
This we have found
includes that
most deep
an extremely
learning wide vwhich
algorithms, ariety learn
of statistical learning
these features algorithms
in their first lay learn
layer. features
er. Fig. 9.19 showith
shows
ws
Gab or-like functions when applied to natural images. This includes
some examples. Because so many different learning algorithms learn edge detectors, most deep
learning
it algorithms,
is difficult whichthat
to conclude learnanythese
sp features
specific
ecific in their
learning first layiser.
algorithm theFig. 9.19 mo
“right” shodel
modelws
some
of theexamples.
brain justBecause
based onsothe
many different
features thatlearning
it learns algorithms
(though itlearncan edge detectors,
certainly be a
it is difficult to conclude
bad sign if an algorithm do that
es not learn some sort of edge detector when applieddel
does any sp ecific learning algorithm is the “right” mo to
of the brain just based on the features
natural images). These features are an imp that ortantt part of the statistical structurea
it
importan
ortanlearns (though it can certainly b e
bad sign if images
of natural an algorithm
and candobese recov
not learn
ered some
recovered by man
manysort of edget detector
y differen
different approac
approaches when
hes to applied to
statistical
natural
mo
modeling. images).
deling. See Hyv These
Hyvärinen
ärinenfeatures
et al. (are an) imp
2009 for aortan t part
review of of
thethe statistical
field of natural structure
image
of natural
statistics. images and can b e recov ered by man y differen t approac hes to statistical
mo deling. See Hyvärinen et al. (2009) for a review of the field of natural image
statistics.
371
CHAPTER 9. CONVOLUTIONAL NETWORKS
373
Chapter 10
Chapter 10
Sequence Mo Modeling:
deling: Recurren
Recurrentt
and Recursiv
Recursivee Nets
Sequence Modeling: Recurrent
and
Recurr
current
Recursiv
ent neur
neural
e Nets
al networks or RNNs (Rumelhart et al., 1986a) are a family of
Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of
neural net networks
works for processing sequen sequential tial data. Much as a con convolutional
volutional net network
work
R e curr ent
is a neural netw neur al
network networks
ork that is sp or RNNs
specialized ( Rumelhart et al., 1986a
ecialized for processing a grid of values X suc ) are a family
suchh asof
neural
an image,networks for processing
a recurrent neural netw sequen
networkorktial is adata.
neural Much
netw as
networkork a con
thatvolutional network
is specialized for
X
is
pro a neural netw ork that is sp ecialized
(1)
cessing a sequence of values x , . . . , x . Just as con
processing for processing
( τ ) a grid
conv of values
volutional netw suc h
networks as
orks
an image,
can readilyascale recurrent
to imagesneural netw
with ork width
large is a neural networkand
and height, that is specialized
some conv
convolutional
olutionalfor
pro
net
netwocessing
wo
works a
rks can pro sequence
process of v alues x , . . . , x
cess images of variable size, recurrent netw . Just as con
networks v olutional netw
orks can scale to muc orks
much h
can readily
longer scale to
sequences than images
would withbe large
practicalwidth forand netheight,
networks and some
works without convolutional
sequence-based
netecialization.
sp works can proMost
specialization. cess images
recurrentof vnetw
ariable
networksorks size,
canrecurrent networks
also process can scale
sequences of to much
variable
longer sequences than would be practical for networks without sequence-based
length.
specialization. Most recurrent networks can also process sequences of variable
To go from multi-la
multi-lay yer netw
networks
orks to recurren
recurrentt net networks,
works, we need to take adv advan-
an-
length.
tage of one of the early ideas found in machine learning and statistical mo models
dels of
the T o go from
1980s: multi-la
sharing yer netw
parameters orks to
across recurren
differen
different t partst netof
works,
a mo we need
model.
del. to takesharing
Parameter advan-
tagees
mak
makes of itone of the to
possible early ideasand
extend foundapplyin machine
the model learning and statistical
to examples of differentmodelsformsof
the 1980s:
(differen sharing parameters across differen t parts of
(differentt lengths, here) and generalize across them. If we had separate parameters a mo del. Parameter sharing
makeach
for es it vpalue
ossible to extend
of the and apply
time index, we could thenot model to examples
generalize of different
to sequence lengths forms
not
(differen t lengths, here) and generalize across them. If
seen during training, nor share statistical strength across different sequence lengths we had separate parameters
for each
and across value of thepositions
different time index, we could
in time. Suc
Such h not
sharing generalize to sequence
is particularly imp lengths
importan
ortan
ortant not
t when
seen
a sp during
specific training, nor share statistical strength across
ecific piece of information can o ccur at multiple positions within the sequence. different sequence lengths
Fand across different
or example, consider positions
the twtwoin time. Such“I sharing
o sentences went toisNepal particularly
in 2009” imp ortan
and “Int when
2009,
a sp
I wen ecific piece of information
wentt to Nepal.” If we ask a mac can o
machine ccur at
hine learning mo multiple model p ositions within
del to read each sen the sequence.
sentence
tence and
F or example, consider the
extract the year in which the narrator wentw o sentences “I w ent to Nepal
wentt to Nepal, we would lik in 2009” and “In 2009,
likee it to recognize
Ithe
wen t to Nepal.”
year 2009 as the relevIf we ask
relevanan a mac hine learning mo del to read
antt piece of information, whether it appears in the sixth each sen tence and
extract the year in which the narrator went to Nepal, we would like it to recognize
the year 2009 as the relevant piece of information, 374 whether it appears in the sixth
374
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
375
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.1
A Unfolding
computational graph isComputational
a wa
wayy to formalize the Graphsstructure of a set of computations,
suc
such
h as those in involv
volv
volved
ed in mapping inputs and parameters to outputs and loss.
A computational graph
Please refer to Sec. 6.5.1 is afor
waya to formalize
general the structure
introduction. In of a set
this of computations,
section we explain
such as those in volv ed in mapping inputs and
the idea of unfolding a recursive or recurrent computation in parameters to
into outputs and loss.
to a computational
Pleasethat
graph referhas
to aSec.
rep 6.5.1 for
repetitive
etitive a general
structure, introduction.
typically corresp In this to
corresponding
onding section weofexplain
a chain even
events.
ts.
the idea of unfolding a recursive or recurrent computation in
Unfolding this graph results in the sharing of parameters across a deep netwto a computational
network
ork
graph that
structure. has a rep etitive structure, typically corresp onding to a chain of events.
Unfolding this graph results in the sharing of parameters across a deep network
For example, consider the classical form of a dynamical system:
structure.
For example, consider the classical (s(t−1)of; θa),dynamical system:
s(t) = fform (10.1)
Figure 10.1: The classical dynamical system described by Eq. 10.1, illustrated as an
unfolded computational graph. Eac Each
h no
node
de represen
represents
ts the state at some time t and the
Figure 10.1: The classical dynamical system described by Eq.
function f maps the state at t to the state at t + 1. The same 10.1, illustrated
parameters (the same as an
value
unfolded
of θ used computational
to parametrize graph. Eachfor
f ) are used noall
de time
represen t
ts the state at some time and the
steps.
function f maps the state at t to the state at t + 1. The same parameters (the same value
of θAs
used to parametrize
another
f are used for all time steps.
example, )let us consider a dynamical system driven by an external
signal x(t),
As another example, let uss(tconsider
) a 1)
= f (s(t− , x(t) ; θ), system driven by an external
dynamical (10.4)
signal x ,
s = f (s376 , x ; θ), (10.4)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
ph
physical
ysical implementation of the mo model,
del, suc
suchh as a biological neural net network.
work. In
this view, the netwnetwork
ork defines a circuit that op operates
erates in real time, with ph physical
ysical
ph ysical implementation of the mo del, suc h as a biological neural
parts whose current state can influence their future state, as in the left of Fig. 10.2 net work. In.
this view, thethis
Throughout netw ork defines
chapter, we use a circuit
a blackthat operates
square in realdiagram
in a circuit time, with physical
to indicate
parts whose current state can influence their future state, as in the
that an interaction takes place with a delay of 1 time step, from the state at time left of Fig. 10.2.
Throughout
t to the statethis chapter,
at time t + 1we use aother
. The blackwa square
way y to drainwa the
draw circuit
RNN diagram
is as an to unfolded
indicate
that an interaction
computational takes
graph, place with
in which eachacomp
each delayonent
componentof 1 time step, fromby
is represented themanstate
many at timet
y differen
different
vtariables,
to the state
with at
onetime t + 1p
variable . erThe other
time step,wa y to draw the
representing the RNN
state isof as
thean unfolded
comp
component
onent
computational graph,
at that point in time. Eac in which
Each each comp onent is represented
h variable for each time step is dra drawn by man y
wn as a separate nodifferen det
node
vofariables, with one variable
the computational graph,pas er in
time
thestep,
rightrepresenting
of Fig. 10.2.the stateweofcall
What the unfolding
component is
at that
the op p oint
operation in time. Eac h variable for each time step is dra wn as
eration that maps a circuit as in the left side of the figure to a computationala separate no de
of the computational
graph with rep
repeated graph,asasininthe
eated pieces theright
rightside.
of Fig.
The10.2 . Whatgraph
unfolded we callnow unfolding
has a sizeis
the op
that deperation
depends that maps a circuit
ends on the sequence length. as in the left side of the figure to a computational
graph with repeated pieces as in the right side. The unfolded graph now has (at)size
We can represent the unfolded recurrence after t steps with a function g :
that dep ends on the sequence length.
(t−1)
h(tthe
We can represent ) (t)
=gunfolded , x(t−2)after
(x(t), xrecurrence (2)
, x (1)
, . . . ,txsteps with) a function g(10.6)
:
(t−1) (t)
h =
=fg (h(x ,,xx ; θ,)x ,...,x ,x ) (10.7)
(10.6)
The function g (t) tak takes =f (whole
es the h ,past
x ;sequence
θ) ( x(t), x(t−1), x (t−2) , . . . , x(2)(10.7)
, x (1) )
as input and pro produces
duces the curren currentt state, but the unfolded recurren recurrentt structure
The
allo
allowsfunction g tak
ws us to factorize g in es ( the
t ) whole
into
to rep eated application of a function f . , The
past
repeated sequence ( x , x , x ...,x ,x )
unfolding
as
pro input
process
cess ththusand produces tthe
us introduces wo curren
ma jor tadv
major state,
advan
an but the unfolded recurrent structure
antages:
tages:
allows us to factorize g into repeated application of a function f . The unfolding
pro1.
cess thus introduces
Regardless two ma jorlength,
of the sequence advantages:
the learned mo model
del alw
always
ays has the same
input size, because it is specified in terms of transition from one state to
1. another
Regardless of the
state, rather sequence
than sp length,
ecifiedthe
specified learned
in terms of mo
a vdel always has history
ariable-length the same of
input
states. size, b ecause it is specified in terms of transition from one state to
another state, rather than specified in terms of a variable-length history of
2. It is possible to use the
states. transition function f with the same parameters
at every time step.
2. It is possible to use the transition function f with the same parameters
Theseat tw
twoevery timemak
o factors step.
make e it possible to learn a single model f that op operates
erates on
all time steps and all sequence lengths, rather than needing to learn a separate
These
mo del tw
model g (to) for
factors
all pmak e it time
ossible possible
steps.to learn a single
Learning a single, f thatmo
modelshared opdel
model erates on
allows
all time steps and
generalization all sequence
to sequence lengths,
lengths that rather
did notthanappear needing
in the totraining
learn a separate
set, and
mo
allodel
allows g
ws the mo for
modelall p ossible time steps. Learning a single,
del to be estimated with far fewer training examples than would shared mo del allows
be
generalization to sequence
required without parameter sharing. lengths that did not appear in the training set, and
allows the model to be estimated with far fewer training examples than would be
Both the recurren
recurrentt graph and the unrolled graph hav havee their uses. The recurrent
required without parameter sharing.
graph is succinct. The unfolded graph provides an explicit description of which
Both the recurren
computations t graph
to perform. Theandunfolded
the unrolled
graph graph
also hav
helpse their uses. The
to illustrate therecurrent
idea of
graph is succinct. The unfolded graph provides an explicit description of which
computations to perform. The unfolded 378graph also helps to illustrate the idea of
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
information flow forward in time (computing outputs and losses) and backw
backward
ard
in time (computing gradients) by explicitly sho
showing
wing the path along which this
information flows.
information flow forward in time (computing outputs and losses) and backward
in time (computing gradients) by explicitly showing the path along which this
information flows.
10.2 Recurren
Recurrentt Neural Net
Networks
works
10.2 with
Armed Recurren
the graph tunrolling
Neural andNet works
parameter sharing ideas of Sec. 10.1, we can
design a wide variet
ariety
y of recurren
recurrentt neural netw
networks.
orks.
Armed with the graph unrolling and parameter sharing ideas of Sec. 10.1, we can
design a wide variety of recurrent neural networks.
Figure 10.3: The computational graph to compute the training loss of a recurrent netw network
ork
that maps an input sequence of x values to a corresp corresponding
onding sequence of output o values.
AFigure
loss L10.3: The computational
measures ho
howw far each ograph
is fromto the
compute
correspthe training
corresponding
onding loss of
training a recurrent
target y . When netw ork
using
that maps an input sequence of x v alues to a corresp onding sequence
softmax outputs, we assume o is the unnormalized log probabilities. The loss L in of output o values.
internally
ternally
A loss L measures
computes how(far
ŷˆ = softmax
y softmax( o) and o is fromthis
eachcompares the to
corresp onding
the target y . training
The RNN target y . When
has input using
to hidden
softmax outputs,
connections we assume
parametrized byoaisweigh
the unnormalized
weight t matrix U , hidden-to-hidden The loss L
log probabilities. recurrent internally
connections
computes yˆ = by
parametrized softmax (o)t and
a weigh
weight compares
matrix W , and this to the target y . The
hidden-to-output RNN hasparametrized
connections input to hiddenby
aconnections
weight matrix parametrized by defines
V . Eq. 10.8 a weighforward U
t matrixpropagation
, hidden-to-hidden
in this mo recurrent
model.
del. connections
The RNN
parametrized
and by a with
its loss drawn weighrecurrent W , and hidden-to-output
t matrix connections. The sameconnections parametrized
seen as an time-unfoldedby
a w eight matrix V . Eq.
computational graph, where eac 10.8 defines
eachh no
node forward
de is no
now propagation
w asso
associated in this mo del. The
ciated with one particular time instance.RNN
and its loss drawn with recurrent connections. The same seen as an time-unfolded
computational graph, w here eac h no de is
Some examples of important design patterns for no w asso ciated withrecurren
one particular
recurrent timenet
t neural instance.
netw works
include the following:
Some examples of important design patterns for recurrent neural networks
• Recurren
Recurrent
include t netw
networks
the following: orks that pro
produce
duce an output at each time step and ha
have
ve
379an output at each time step and have
Recurrent networks that produce
•
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
recurren
recurrentt connections betw
etween
een hidden units, illustrated in Fig. 10.3.
• Recurren
Recurrent
recurrenttconnections
netw
networks
orks that betwproproduce
eenduce
hiddenan units,
outputillustrated
at each time in Fig. step10.3and
. ha haveve
recurren
recurrentt connections only from the output at one time step to the hidden
Recurren
units at the t netw nextorks
timethat step,pro duce an output
illustrated in Fig. at 10.4each time step and have
• recurren t connections only from the output at one time step to the hidden
• Recurren
Recurrent
units at the t net
networks
works
next time withstep,recurrent connections
illustrated in Fig. 10.4 betw
etween
een hidden units, that
read an en entire
tire sequence and then pro produce
duce a single output, illustrated in Fig.
Recurren
10.5
10.5.. t networks with recurrent connections between hidden units, that
• read an entire sequence and then produce a single output, illustrated in Fig.
Fig. 10.310.5.is a reasonably represen representative tative example that we return to throughout
most of the chapter.
Fig. 10.3 is a reasonably representative example that we return to throughout
The recurren
recurrentt neural netw network ork of Fig. 10.3 and Eq. 10.8 is universal in the
most of the chapter.
sense that any function computable by a Turing machine can be computed by such
The recurren
a recurrent netw
network t neural
ork network
of a finite size.ofTheFig.output
10.3 and can Eq.
be read10.8fromis universal
the RNNinafter the
asense
num
numb that
ber ofany function
time steps computable by a Turinglinear
that is asymptotically machine can num
in the be computed
numb ber of time bysteps
such
a recurrent netw ork of a finite size. The output can
used by the Turing machine and asymptotically linear in the length of the input b e read from the RNN after
(aSiegelmann
number of time and Sontagsteps that
, 1991 is; asymptotically
Siegelmann, 1995 linear in the numand
; Siegelmann ber Sontag
of time, 1995
steps;
used
Hy by the, T1996
Hyotyniemi
otyniemi uring ). machine
The functions and asymptotically
computable bylinear in the
a Turing length are
machine of the input
discrete,
(soSiegelmann
these results andregardSontag , 1991
exact ; Siegelmann
implemen
implementationtation , 1995
of the; Siegelmann
function, not and Sontag, 1995;
approximations.
Hy otyniemi , 1996 ).
The RNN, when used as a Turing mac The functions computable
machine, by a T uring machine
hine, takes a binary sequence as input are discrete,
and its
so these results regard
outputs must be discretized to pro exact implemen
provide tation of the function, not approximations.
vide a binary output. It is possible to compute all
The RNN,inwhen
functions this used as ausing
setting Turing machine,
a single sp takesRNN
specific
ecific a binary sequence
of finite as input andand
size (Siegelmann its
outputs
Son tag (must
Sontag 1995)buse e discretized
886 units). to The
provide a binary
“input” of theoutput.
TuringItmachine
is possible is atosp compute
specification
ecificationall
functions in this setting using
of the function to be computed, so the same netw a single sp ecific RNN
network of finite size ( Siegelmann
ork that simulates this Turing and
Son
mac tag
hine is sufficient for all problems. The theoretical machine
machine ( 1995 ) use 886 units). The “input” of the T uring RNN used is aforspecification
the pro
proofof
of the
can simfunction
ulate an to
simulate be computed,
unbounded stackso bythe same netwits
representing orkactiv
that simulates
activations
ations and weighthis ts
weights Turing
with
machine num
rational is sufficient
numb bers of un for
unb all problems.
bounded precision. The theoretical RNN used for the proof
can simulate an unbounded stack by representing its activations and weights with
We nonow w dev
develop elop the forwforward ard propagation equations for the RNN depicted in
rational numbers of unbounded precision.
Fig. 10.3. The figure does not sp specify
ecify the choice of activ activation
ation function for the
hiddenWe no w dev
units. elopwthe
Here forwardthe
e assume propagation
hyperb olicequations
hyperbolic tangent activ for the
activation
ation RNN depicted
function. in
Also,
Fig.figure
the 10.3. do The
does es notfigure sp does exactly
specify
ecify not specify whattheformchoice of activand
the output ationloss function
function fortakthe
take.
e.
hidden units. Here w e assume the hyperb olic tangent
Here we assume that the output is discrete, as if the RNN is used to predict words activ ation function. Also,
thecharacters.
or figure doesAnot specify
natural wa
way yexactly what discrete
to represent form thevariables
output and is toloss
regardfunction take.
the output
oHere we assume
as giving that the output
the unnormalized log isprobabilities
discrete, as of if each
the RNN is used
possible value to of
predict words
the discrete
vor characters.
ariable. We can A natural
then applyway to therepresent
softmaxdiscrete
op
operation
erationvariables is to regard
as a post-pro
ost-processing
cessing thestep
outputto
o as giving
obtain a vector ythe unnormalized log probabilities
ŷˆ of normalized probabilities ov of
over each p ossible v alue of
er the output. Forward propagation the discrete
vbariable. W e
egins with a sp can then apply the softmax op eration
ecification of the initial state h(0) . Then,
specification as a pforost-pro
each cessing
time step step to
from
tobtain
= 1 toa tvector
= τ , we yˆ ofapply
normalized probabilities
the following up
update
dateovequations:
er the output. Forward propagation
begins with a specification of the initial state h . Then, for each time step from
t = 1 to t = τ , we apply the a(t) following
= b + up h(t−1)equations:
W date + U x (t) (10.8)
a 380h
= b+W + Ux (10.8)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.4: An RNN whose only recurrence is the feedback connection from the output
to the hidden lay layer.
er. At each time step t , the input is x , the hidden lay layer
er activ
activations
ations are
Figure
h , the10.4: An RNN
outputs are owhose
, the only recurrence
targets are y isand thethefeedback
loss is Lconnection
. from thediagram.
Circuit output
to the hidden lay er. At each time step t , the input
Unfolded computational graph. Such an RNN is less pow is x , the hidden erful (can expressarea
lay
powerful er activ ations
h , theset
smaller outputs are o
of functions) , than
the targets
those in y
arethe family
and the loss is
represented L . Fig. 10.3
by Circuit
. Thediagram.
RNN
Unfolded
in Fig. 10.3 computational
can choose to put ananygraph.
y informationSuch an it RNN
wan
wants ts isabless
out pow
about the erful
past in(can
into
to itsexpress
hiddena
smaller
represen set
representation of functions)
tation h and transmit than those
h to in thethe familyThe
future. represented
RNN in by thisFig. 10.3is. trained
figure The RNN to
in Fig.
put a sp 10.3
specificcan choose to put an y information it wan ts ab
ecific output value into o , and o is the only information it is allow out the past into
allowed its
ed to hidden
send
represen
to the future. h and are
tation There transmit h toconnections
no direct the future.from ThehRNN goinginforw
thisard.
forward. figureTheis previous
trained toh
put
is a sp ecifictooutput
connected the present only oindirectly
value into , and o is
indirectly, thethe
, via only information
predictions it isused
it was allowtoedproto duce.
send
produce.
to the future.
Unless o is veryThere are no direct and
high-dimensional connections
ric
rich,
h, it will h going
fromusually forw
lack ard.
imp
important
ortant previous h
The information
is connected
from the past.toThis
the present
makes the only
RNNindirectly
in this, figure
via theless predictions
pow erful, itbut
owerful, wasit used
ma
may y btoe easier
pro duce.
to
train o is very
Unlessbecause eachhigh-dimensional
time step can beand rich,initisolation
trained will usually
from lack imp ortant
the others, information
allowing greater
from the past. during
parallelization This makes the as
training, RNN in this
describ
described ed infigure
Sec.less pow
10.2.1 . erful, but it may be easier to
train because each time step can be trained in isolation from the others, allowing greater
parallelization during training, as described in Sec. 10.2.1.
381
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
The net
network
work with recurrent connections only from the output at one time step to
the hidden units at the next time step (shown in Fig. 10.4) is strictly less powerful
The network
because withhidden-to-hidden
it lacks recurrent connections only
recurren
recurrent from the output
t connections. For at one time
example, it step to
cannot
the
sim hidden
simulate units at the next time
ulate a universal Turing machi step
machine. (shown in Fig.
ne. Because this net 10.4
network ) is strictly less p o
work lacks hidden-to-hiddenwerful
because it lacks hidden-to-hidden recurrent connections. For example, it cannot
382
simulate a universal Turing machine. Because this network lacks hidden-to-hidden
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
recurrence, it requires that the output units capture all of the information ab about
out
the past that the netw
network
ork will use to predict the future. Because the output units
recurrence,
are explicitlyittrained
requirestothat
matc
matchthe output
h the units
training setcapture
targets,all of the
they information
are unlikely about
to capture
the past
the that the
necessary network will
information ab use the
about
out to predict the future.
past history of theBecause
input, the output
unless the units
user
are
kno explicitly
knows
ws ho
how trained
w to describ to matc h the training set targets,
describee the full state of the system and pro they are
provides unlikely to capture
vides it as part of the
the necessary
training information
set targets. The adv aban
advan outtage
antagetheofpast history of
eliminating the input, unless
hidden-to-hidden the user
recurrence
kno
is ws ho
that, w any
for to describ e the full
loss function stateonofcomparing
based the systemthe andprediction
provides it atas part
time of the
t to the
training target
training set targets.
at timeThe advthe
t, all antage
timeofsteps
eliminating hidden-to-hidden
are decoupled. Training can recurrence
thus be
parallelized, with the gradient for each step t computed in isolation. Theretoisthe
is that, for any loss function based on comparing the prediction at time t no
training
need target atthe
to compute time t, all for
output thethe
time steps are
previous decoupled.
time step first,Tbraining
ecause can thus be
the training
parallelized,
set withideal
provides the the v
gradient for each
alue of that step t computed in isolation. There is no
output.
need to compute the output for the previous time step first, because the training
set provides the ideal value of that output.
log p y ,y x ,x (10.15)
383
|
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.6: Illustration of teacher forcing. Teacher forcing is a training tec technique
hnique that is
applicable to RNNs that ha have
ve connections from their output to their hidden states at the
Figure
next 10.6:
time step.Illustration
At of teacher
train time,forcing.
we feedTthe
eacher forcing is a ytraining
dra tecfrom
drawn
wn hnique
thethat is
train
applicable
set as inputtotoRNNs
h that. have connections
When the mo from
del their
model output
is deployed, to
deployed, thetheir
truehidden
outputstates at the
is generally
nextkno
not time
wn.step.
known. In this case,Atwtrain time, we feed
e approximate correct output y ywith
the the drawnmo
the from
del’sthe
model’s train
output
set as input to h .
o , and feed the output back in When
into the
to the mo mo
model.del
del. is deployed, the true output is generally
not known. In this case, we approximate the correct output y with the model’s output
o , and feed the output back into the model.
384
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
= log p y (2) | y (1), x (1), x (2) + log p y (1) | x (1), x(2) (10.16)
In this example,= log p we y see ythat, xat time, x t =+22,log , thep moy delxis trained
model ,x to maximize (10.16)the
conditional probability |of y giv (2) givenen the x sequence | so far and the previous y
In this example, we see that at
value from the training set. Maximum likelihoo time t = 2,
likelihoodthe mo del
d thus sp is trained
specifies
ecifies thattoduring
maximize
training,the
conditional
rather thanprobability
feeding theof model’s
y given own output the xbac sequence
back k into so far and
itself, thesetheconnections
previous y
vshould
alue from the with
be fed training
the set.
targetMaximum
values sp likelihoo
ecifyingdwhat
specifying thus sp
theecifies
correctthat duringshould
output training,be.
rather than feeding the
This is illustrated in Fig. 10.6. model’s own output bac k into itself, these connections
should be fed with the target values specifying what the correct output should be.
We originally motiv ated teacher forcing as allo
motivated wing us to av
allowing oid back-propagation
avoid
This is illustrated in Fig. 10.6.
through time in mo models
dels that lack hidden-to-hidden connections. Teac eacher
her forcing
ma
may W e originally motiv
y still be applied to mo ated
modelsteacher
dels that havforcing as allowing us to av oid back-propagation
havee hidden-to-hidden connections so long as
through
they ha have time in mo dels that lack hidden-to-hidden
ve connections from the output at one time step connections. Teacher forcing
to values computed in the
ma y still b e
next time step. Ho applied
Howevto
wev
wever,mo dels that hav e hidden-to-hidden connections
er, as soon as the hidden units become a function of earlier so long as
they ha
time ve connections
steps, the BPTT from the output
algorithm at one .time
is necessary
necessary. Some step
mo to
models values
dels ma
may y computed
th
thus in the
us be trained
next btime
with oth step.
teacher Hoforcing
wever, as andsoon
BPTT. as the hidden units become a function of earlier
time steps, the BPTT algorithm is necessary. Some models may thus be trained
withThe b oth disadv
disadvan
teacheran
antage
tage of strict
forcing teacher forcing arises if the net
and BPTT. network
work is going to be
later used in an op open-lo
en-lo
en-loopop mo
mode,de, with the netw network ork outputs (or samples from the
The distribution)
output disadvantagefed of strict
bac
backk asteacher
input.forcing
In thisarises
case,ifthe
thekindnetwork is going
of inputs that tothe
be
later
net
network
workused in during
sees an open-lo op mocould
training de, with the netw
be quite ork outputs
different from the (orkind
samples fromthat
of inputs the
output distribution) fed
it will see at test time. One wa bac k as
way input. In this case, the kind of
y to mitigate this problem is to train with both inputs that the
net
teac work
teacher-forcedsees
her-forced during
inputs training
and withcould be quite different
free-running inputs, for from the kind
example of inputs that
by predicting the
it will see at
correct target a num test time.
numb One wa y to mitigate this problem
ber of steps in the future through the unfolded recurrentis to train with b oth
teac her-forced
output-to-input paths. inputs andInwith
this free-running
way
ay,, the net netwinputs,
work can for learn
example by epredicting
to tak
take into account the
correctconditions
input target a num (such beras of steps
those it in the future
generates itselfthrough the unfoldedmo
in the free-running recurrent
mode)
de) not
output-to-input paths.
seen during training and ho In
howthis w ay , the net
w to map the state bac w ork can
backk tolearn
towards to tak e into
wards one that will mak account
makee
input
the net conditions
network
work generate (suchpropas er
proper those it generates
outputs after a few itself in the
steps. free-running
Another approach mo(de)
Bengionot
seen
et al.,during
al. , 2015btraining and ho
) to mitigate thew gap
to map betw the
etweeneenstate back toseen
the inputs wards at one
trainthat
timewilland makthee
the net work generate prop er outputs after a few steps.
inputs seen at test time randomly chooses to use generated values or actual data Another approach ( Bengio
vetalues
al., as
2015b ) toThis
input. mitigate the gap
approach betwaeen
exploits the inputs
curriculum seen at
learning train time
strategy and the
to gradually
inputs
use more seenof at
thetest time randomly
generated values aschooses
input. to use generated values or actual data
values as input. This approach exploits a curriculum learning strategy to gradually
use more of the generated values as input.
To gain some intuition for how the BPTT algorithm behav ehaves,es, we provide an
example of how to compute gradien gradients ts by BPTT for the RNN equations ab abov
ov
ovee
T o gain some intuition for how the BPTT
(Eq. 10.8 and Eq. 10.12). The nodes of our computational graph include thealgorithm b ehav es, we provide an
example of U
parameters how
, Vto , W compute
, b and cgradienas well ts asbythe
BPTT sequencefor the RNN
of no des equations
nodes indexed byab t ov
fore
(Eq. 10.8 and Eq. 10.12 ).
x(t) , h(t) , o (t) and L(t) . For each no The nodes of our computational
de N we need to compute the gradient ∇ L
node graph include the
parameters
recursiv
recursively ely U , V , on
ely,, based Wthe, b and
gradienc ast well
gradient computedas theatsequence
no
nodes
des that of no des indexed
follow t for
it in thebygraph.
N
Wx e ,start
h ,the o recursion
and L .with For theeachno no
nodesdeimmediately
des we need topreceding computethe thefinal
gradient
loss L
recursively, based on the gradient computed at nodes that follow it in the graph. ∇
We start the recursion with the nodes ∂ Limmediately preceding the final loss
= 11.. (10.17)
∂ L(t)
∂L
In this deriv
derivation
ation we assume that the∂ Loutputs = 1. o(t) are used as the argument(10.17) to the
softmax function to obtain the vector y ŷˆ of probabilities ov overer the output. We also
In this deriv ation we assume that
assume that the loss is the negative log-likelihoo the outputs
log-likelihood o are used
d of the true as the argument
target y(t) giv to the
given
en the
softmax function to obtain the vector yˆ of probabilities
input so far. The gradient ∇ L on the outputs at time step t, for all i, t , is as ov er the output. W e also
assume
follo ws: that the loss is the negative log-likelihood of the true target y given the
follows:
input so far. The gradient ∂ LL on the ∂ L outputs
∂ L(t) at (time t) step t, for all i, t , is as
follows: ( ∇ L )i = = = ŷ
yˆ i − i,y . (10.18)
∇ ∂ o(t) ∂ L(t) ∂ o(t)
∂L i ∂ L ∂ Li
We work our wa way (
y backw L
backwards,) = =
ards, starting∂from = yˆof the sequence. . (10.18)
∂ o L ∂the o
end At the final
(τ ) ∇ (τ ) −
time step τ , h only has o as a descendent, so its gradient is simple:
We work our way backwards, starting from the end of the sequence. At the final
time step τ , h only ∂ o(τ )
∇has Lo = ((∇ as
∇ a descendent,
L) (τ ) =so ∇its gradient
((∇ L) V . is simple: (10.19)
∂h
∂o
We can then iterate backw L
backwards = (in time
ards L)to back-propagat
=(
back-propagate Le) V .
gradients through(10.19)
time,
∂ h (t)
from t = τ − 1 do down
wn ∇ to t = 11,, noting
∇ that h (for∇t < τ ) has as descenden descendents ts both
W
o(et) can
andthen
h(t+1)iterate backwards
. Its gradient in time
is thus given to bback-propagat
y e gradients through time,
from t = τ 1 down to t = 1, noting that h (for t < τ ) has as descendents both
o and∇h − L .=Its ∂ h(t+1) given by ∂ o(t)
((∇∇gradientL) is thus + ( ∇ L) (10.20)
∂ h(t) ∂ h(t)
∂h 2∂ o
L= = ((∇
(∇ L) + ( (t+1)L ) (10.20)
L) diag ∂h 1 − h ∂ hW + (∇ L) V (10.21)
∇ ∇ ∇
=( 2L) diag 1 h W +( L) V (10.21)
where diag 1 − h(t+1) indicates the diagonal matrix con containing
taining the elements
∇ − ∇
(t+1) 2
− (hidiag) .1 Thish is the Jacobian
1where indicatesofthe thediagonal
hyperb
hyperbolic olic tangent
matrix asso
associated
containing ciated with the
the elements
hidden unit i at − time t + 1.
1 (h ) . This is the Jacobian of the hyperbolic tangent associated with the
Once the gradients on the internal no nodes
des of the computational graph are ob-
hidden
− unit i at time t + 1.
tained, we can obtain the gradien gradients ts on the parameter no nodes,des, which ha havve descenden
descendents ts
Once the gradients
at all the time steps: on the internal no des of the computational graph are ob-
tained, we can obtain the gradients on the parameter nodes, which have descendents
at all ∇ theLtime ∂ o(t)
= steps:(∇ L) = ∇ L
t
∂ c t
∂o
L = ( L) = 386 L
∂c
∇ ∇ ∇
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
∂ h (t) 2
∇L = (∇ L) = (∇ L) diag 1 − h(t)
t
∂b t
∂h
L = ( L) ∂ o(t) = ( L) diag 1 h
∇∇ L = (∇ L ) ∂b = ( ∇ L ) h (t)
∇ ∂V ∇ −
t ∂o t
L = ( L) ∂ h (t) = ( L) h 2
∇ L = ( ∇ L ) ∂ V = ( ∇ L ) diag 1 − h (t)
h(t−1)
∇ ∇ ∂ W ∇
t ∂h t
L = ( L) ∂ h (t) = ( L) diag 1 h 2 h
∇ L = (∇ ∂ W (∇ (t)
x(t)
∇ ∇ L) ∂ U = ∇ L) diag 1 − − h
t ∂h t
L = ( L) = ( L) diag 1 h x
∂U (t )
We do ∇ not need to compute∇ the gradient with ∇ resp
respectect to x− for training because
it do
does
es not hav
havee an
anyy parameters as ancestors in the computational graph defining
W e do
the loss.not need to compute the gradient with respect to x for training because
it does not have any parameters as ancestors in the computational graph defining
We are abusing notation somewhat in the ab abovovee equations. We correctly use
ov
the loss.
∇ L to indicate the full influence of h through all paths from h (t) to L. This
( t )
We are abusing notation somewhat in the above equations. We correctly use
is in contrast to our usage of ∂∂ or ∂ , whic
which
h we use here in an unconv unconven en
entional
tional
L to indicate the full influence of∂h through all paths from h to L. This
∂ (t )
manner. By ∂ we refer to the effect of W on h only via the use of W at time
∇
is in contrast to our usage of or , which we use here in an unconventional
step t. This is not standard calculus notation, because the standard definition of
manner.
the By would
Jacobian we actually
refer to the effectthe
include of W on h influence
complete only via theof Wuseon W(t)atvia
of h time
its
stepint. all
use This is not
of the standardtime
preceding calculus
steps notation,
to pro
produce because
duce h(t−1) .the
Whatstandard definition
we refer to here ofis
the Jacobian
in fact the would actually
metho
method include the complete influence of W
d of Sec. 6.5.6, that computes the contribution of a single on h via its
edge in the computational time
use in all of the preceding graphstepsto theto gradient.
produce h . What we refer to here is
in fact the method of Sec. 6.5.6, that computes the contribution of a single
edge in the computational graph to the gradient.
In the example recurrent netw networkork we havhavee developed so far, the losses L(t) were
cross-en
cross-entropies
tropies betw een training targets y(t) and outputs o(t) . As with a feedforward
between
In
net
netwo the
wo
work, example
rk, it is in recurrent
principle pnetw orktoweuse
ossible hav e developed
almost any losssowith
far, athe losses Lnetw
recurrent work.
ere
network.
cross-en tropies betw een training targets y and outputs
The loss should be chosen based on the task. As with a feedforward netw o . As with a feedforward
network,
ork, we
network,wish
usually it istoininterpret
principle the
p ossible
output to of
usethealmost
RNN anyas aloss with a recurrent
probability netwand
distribution, ork.
The
w loss should
e usually use the becross-entrop
chosen based
cross-entropy on ciated
y asso the task.
associated withAs with
that a feedforward
distribution netwthe
to define ork,loss.
we
usuallysquared
Mean wish toerrorinterpret
is thethe output ofythe
cross-entrop
cross-entropy lossRNN
asso as a probability
associated
ciated distribution,
with an output and
distribution
we usually use the cross-entrop y asso ciated with that
that is a unit Gaussian, for example, just as with a feedforward netw distribution to define
network.
ork. the loss.
Mean squared error is the cross-entropy loss associated with an output distribution
thatWhen we use
is a unit a predictive
Gaussian, log-lik
log-likeliho
for example, eliho
elihoo o d as
just training
with aobobjectiv
jective, suchnetw
jective,
feedforward as Eq.ork.10.12, we
train the RNN to estimate the conditional distribution of the next sequence elemen elementt
( t
y giv ) When
given we use a predictive log-lik eliho o d training ob jectiv
en the past inputs. This may mean that we maximize the log-likelihoo e, such as Eq. 10.12
log-likelihood,dwe
train the RNN to estimate the conditional distribution of the next sequence element
y given the past inputs. This (y (t) mean
log pmay | x(1) ,that
...,x (t)
we )maximize
, (10.22)
the log-likelihoo d
log p(y 387
x , . . . , x ), (10.22)
|
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
or, if the mo
modeldel includes connections from the output at one time step to the next
time step,
or, if the model includes log pconnections
(y(t) | x (1) , . from
. . , x(the
t) output
, y (1) , . . . ,at one
y(t− 1) time step to the next
). (10.23)
time step,
Decomp
Decomposing
osing the join joint
logt pprobability
(y x , .ov over
. .er , . . . , y of y )v. alues as a series
, xthe, ysequence of
(10.23)
one-step probabilistic predictions is one wa wayy to capture the full joint distribution
|
across the whole sequence. When we do notsequence
Decomp osing the join t probability ov er the feed pastofyyvvalues aluesasasinputs
a series
thatof
one-step probabilistic
condition the next steppredictions
prediction,isthe onedirected
way to graphical
capture the mo full
model joint
del con distribution
contains
tains no edges
across
from an the
anyy y in the past to the current y . In this case, the outputs y that
whole
( i ) sequence. When w e do not( t) feed past y values as inputs are
condition the
conditionally indep next step prediction, the directed graphical
endent given the sequence of x values. When we do feed the
independent mo del con tains no edges
from an y y in the
actual y values (not their pastprediction,
to the current but the y actual. In this observcase,
observed ed or the outputs vyalues)
generated are
conditionally
bac
backk in to the indep
into netw endent
network,
ork, the given the graphical
directed sequence mo of xdel
model values. When
contains we from
edges do feed the
all y ( i)
vactual
alues iny vthe
alues past(not
to their prediction,
the current y(t) vbut alue.the actual observed or generated values)
back into the network, the directed graphical model contains edges from all y
values in the past to the current y value.
Figure 10.8: Introducing the state variable in the graphical model of the RNN, ev even
en
though it is a deterministic function of its inputs, helps to see how we can obtain a very
Figure
efficientt10.8:
efficien Introducing based
parametrization, the state variable
on Eq. 10.5. in
Evthe
ery graphical
Every model
stage in the of the(for
sequence RNN,
h ev en
and
though
y ) in it isesathe
involv
volv
volves deterministic function
same structure (the of its num
same inputs,
ber helps
number to see
of inputs forhow
eachwe
no can and
node)
de) obtain
canashare
very
efficien
the samet parametrization,
parameters with based on Eq.
the other 10.5. Every stage in the sequence (for h and
stages.
y ) involves the same structure (the same number of inputs for each node) and can share
the The
sameedges
parameters with themo
in a graphical other
del stages.
model indicate which variables depend directly on other
variables. Many graphical models aim to achiev achievee statistical and computational
The edges in a graphical mo del indicate
efficiency by omitting edges that do not corresp which
correspond vond
ariables dependin
to strong directly on other
interactions.
teractions. For
vexample,
ariables. itMany graphical models
is common to make the Marko aim
Markovto achiev e statistical and
v assumption that the graphical mo computational
model
del
efficiency by
should only con omitting edges that
tain edges from {y
contain (do
t− k)not
,...,ycorresp
( t− 1) ond to
( t)strong in teractions.
} to y , rather than containing F or
example,
edges fromitthe
is common to make
entire past historythe
history.. How Marko
ever,v in
However, assumption
some cases, that
we the graphical
believ
elieve mopast
e that all del
should should
inputs only conhavtain
have edges
e an from on
influence y the, .next . . , y element to of
y the , rather than containing
sequence. RNNs are
edges from the entire past history{ . How
useful when we believe that the distribution ov ever, in
over some(t) cases, we b elieve that all past
}
er y ma mayy dep end on a value of y(i)
depend
inputs should hav e an
from the distant past in a wa influence
way on the next element
y that is not captured by the effect of the sequence.
of y(i) RNNs
on y(t−are
1)
.
useful when we believe that the distribution over y may depend on a value of y
One way to in interpret
terpret an RNN as a graphical mo model
del is to view the RNN as
from the distant past in a way that is not captured by the effect of y on y .
defining a graphical mo model
del whose structure is the complete graph, able to represent
One
direct depway to interpret
dependencies
endencies betw an RNN
etween
een any pair as aofgraphical
y values. mo Thedel is to view
graphical mo the
del oRNN
model ver theas
defining a graphical mo del whose structure is the complete
y values with the complete graph structure is shown in Fig. 10.7. The complete graph, able to represent
direct dep
graph in endencies bof
interpretation
terpretation etw eenRNN
the any pair of y von
is based alues. The graphical
ignoring the hidden mounits
del ohver
(t) the
by
ymarginalizing
values with the complete graph
them out of the mo model. structure
del. is shown in Fig. 10.7 . The complete
graph interpretation of the RNN is based on ignoring the hidden units h by
It is more interesting to consider the graphical mo model
del structure of RNNs that
marginalizing them out of the model. ( t )
results from regarding the hidden units h as random variables. Including the
It is more interesting to consider the graphical model structure of RNNs that
The from
results conditional distribution
regarding over these
the hidden variables
units h as given their parents
random is deterministic.
variables. IncludingThis
theis
389
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
hidden units in the graphical mo modeldel reveals that the RNN provides a very efficient
parametrization of the joint distribution ov overer the observ
observations.
ations. SuppSuppose ose that we
hidden
represen
representedunits in the graphical
ted an arbitrary join jointt distribution over discrete values withery
mo del reveals that the RNN provides a v a efficient
tabular
parametrization
represen
representation—an of the joint distribution
tation—an array containing a separate entry for eacov er the observ ations.
each Supp ose
h possible assignment that we
represen ted an arbitrary join t distribution ov er discrete
of values, with the value of that entry giving the probability of that assignment v alues with a tabular
orepresen
ccurring. tation—an
If y can array
take on containing
k differen
different a tseparate
values, theentrytabular
for eacrepresentation
h possible assignment would
of
ha
havev alues, τ with the v alue of that entry giving the
ve O (k ) parameters. By comparison, due to parameter sharing, the number probability of that assignment
o ccurring.
of parameters If yincan
thetake
RNN oniskOdifferen
(1) as at function
values, the tabular representation
of sequence length. The num would
number ber
ha ve O (k ) parameters. By comparison,
of parameters in the RNN may be adjusted to control mo due to parameter model sharing, the
del capacity but is not n umber
of parameters
forced to scale in with RNN is O
thesequence (1) as aEq.
length. function
10.5 showsof sequence
that the length. The number
RNN parametrizes
of parameters
long-term in the RNN
relationships betw may
een bveariables
etween adjusted to control
efficiently
efficiently, , usingmodel capacity
recurrent but is not
applications
forced
of the tosamescalefunction
with sequence
f and same length.parameters
Eq. 10.5 shows θ at that eac
each h the
time RNNstep. parametrizes
Fig. 10.8
long-term relationships
illustrates the graphical model interb etw een v ariables
interpretation. efficiently
pretation. Incorp , using recurrent
orating the h no
Incorporating applications
( t) nodes
des in
of the
the same function
graphical mo
model f and same
del decouples parameters
the past θ at eac
and the future, h time
acting as step. Fig. 10.8
an intermediate
illustrates
quan
quantit
tit
tityy betwtheeen
etween graphical
them. Amodel variable inter
y (i)pretation.
in the distant Incorppastorating the h ano
may influence des in
variable
ythe
(t) graphical mo del h
via its effect on decouples. The structurethe pastofand thisthegraphfuture,
shows acting
thatasthe anmo intermediate
model del can be
quan
efficien tit
efficiently y b etw een them. A v ariable y in the distant
tly parametrized by using the same conditional probability distributionspast may influence a variable at
yeac
eachh via
timeits effect
step, on
and h .
that The
when structure
the v of
ariables this graph
are all shows
observ
observed, ed,that
the the mo
probability del can
of b
thee
efficien
join
joint tly parametrized
t assignment of all vby using can
ariables the bsame
e ev conditional
evaluated
aluated probability
efficiently
efficiently. . distributions at
each time step, and that when the variables are all observed, the probability of the
joinEvEven
en with theofefficien
t assignment efficient t parametrization
all variables of the graphical
can be evaluated efficiently mo
model,
. del, some op operations
erations
remain computationally challenging. For example, it is difficult to predict missing
values Evenin with the efficien
the middle t parametrization
of the sequence. of the graphical model, some operations
remain computationally challenging. For example, it is difficult to predict missing
The price recurrent netw networksorks pa pay y for their reduced num numb ber of parameters is
values in the middle of the sequence.
that the parameters may be difficult.
The price recurrent networks pay for their reduced number of parameters is
The parameter sharing used in recurrent netw networks
orks relies on the assumption
that the parameters may be difficult.
that the same parameters can b e used for different time steps. Equiv Equivalen alen
alently
tly
tly,, the
The parameter sharing used
assumption is that the conditional probabilit in recurrent
probability netw orks
y distribution ovrelies on
over the assumption
er the variables at
time t + 1 given the variables at time t is stationary, meaning thatEquiv
that the same parameters can b e used for different time steps. alently, the
the relationship
assumption
bet etween is that the conditional probabilit
ween the previous time step and the next time step do y distribution does ov er
es not dep the
depend v
end ariables
on t . In at
time t + 1 itgiven
principle, would thebevariables
possibleattotime use tt as
is stationary
an extra input , meaning at each thattime
the step
relationship
and let
b etween the
the learner discov previous
discover time
er any time-dep step and
time-dependence the next time step do es
endence while sharing as much as it can betw not dep end on t . een
etween In
principle,
differen
different it would
t time steps.be This
possible to use
would t
alreadyas an beextra
muchinputbetter at each
than timeusingstep and lett
a differen
different
the learner discov er any time-dep endence
conditional probability distribution for each t, but the netw while sharing as
networkm uch as it
ork would then havcan b etweeen
have to
differen
extrap
extrapolatet time
olate whensteps.
facedThis
with would
new valready
alues ofbte. much better than using a different
conditional probability distribution for each t, but the network would then have to
To complete our view of an RNN as a graphical mo model, del, we must describ describee how
extrapolate when faced with new values of t.
to dradraww samples from the mo model.
del. The main op operation
eration that we need to perform is
To complete our view of an RNN as a graphical model, we must describe how
perfectly
to draw legitimate,
samples from thoughtheit ismo
somewhat
del. The raremain
to design a graphical
operation model
that wewithneed such
todeterministic
perform is
hidden units.
390
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
simply to sample from the conditional distribution at each time step. How Howev ev
ever,
er,
there is one additional complication. The RNN must hav havee some mechanism for
simply to sample
determining from of
the length thethe conditional
sequence. distribution
This can be at eachedtime
achiev
achieved step. How
in various wa
ways.ever,
ys.
there is one additional complication. The RNN must have some mechanism for
In the case
determining thewhen
lengththe of output
the sequence.is a symsymbol
bol can
This tak
takenen
be from
achievaedvoincabulary
cabulary,
various, wa oneys.can
add a sp special
ecial symbol corresp
corresponding
onding to the end of a sequence (Schmidh Schmidhub ub
uberer, 2012).
When In that
the case when
symbol the output the
is generated, is asampling
symbol tak pro en
process
cessfrom a voIncabulary
stops. , one can
the training set,
we insert this symbol as an extra member of the sequence, immediately after x(τ )).
add a sp ecial symbol corresp onding to the end of a sequence ( Schmidh ub er , 2012
When
in eachthat symbol
training is generated, the sampling process stops. In the training set,
example.
we insert this symbol as an extra member of the sequence, immediately after x
Another
in each option
training is to in
example. intro
tro
troduce
duce an extra Bernoulli output to the mo model del that
represen
represents ts the decision to either con contintin
tinue
ue generation or halt generation at eac eachh
Another option is to in tro duce an extra Bernoulli
time step. This approach is more general than the approach of adding an extra output to the mo del that
represen
sym
symb bol tots the
the vdecision
ocabulary
cabulary, to, either
because conittinmauey generation
may be appliedor to halt
any generation
any RNN, rather at than
each
time step. This approach is more general than the
only RNNs that output a sequence of symbols. For example, it may be applied to approach of adding an extra
symRNN
an bol to theemits
that vocabulary
a sequence, becauseof real it nma y be applied
umbers. The new to output
any RNN, unit rather
is usuallythan a
only RNNs that output a sequence
sigmoid unit trained with the cross-entrop of symbols.
cross-entropy F or example,
y loss. In this approac it may
approach b e applied
h the sigmoid is to
trained to maximize the log-probability of the correct prediction asunit
an RNN that emits a sequence of real numbers. The new output is usually
to whether thea
sigmoid unit
sequence endstrained
or contin with
continues uesthe cross-entrop
at each time step.y loss. In this approach the sigmoid is
trained to maximize the log-probability of the correct prediction as to whether the
Another way to determine the sequence length τ is to add an extra output to
sequence ends or continues at each time step.
the mo del that predicts the integer τ itself. The mo
model modeldel can sample a value of τ
and then sample τ steps worth of data. This approachtorequires
Another w ay to determine the sequence length τ is add an adding
extra output
an extrato
the motodelthe
input that predicts
recurren
recurrent the integer
t update at each τ itself. The so
time step mothat
del can
the sample
recurren
recurrent atvupalue
updateof τis
date
aand
warethen sample τitsteps
of whether is nearworth the ofend data.
of theThis approach
generated requires This
sequence. adding extraan extra
input
input to the recurren t update at each time step
can either consist of the value of τ or can consist of τ − t, the num so that the recurren
number t up date
ber of remaining is
awaresteps.
time of whether
Withoutit isthis
nearextra the endinput, of the
the generated
RNN migh mightsequence.
t generateThis extra input
sequences that
can either
end abruptly consist of the v alue of τ or can consist of τ t,
abruptly,, such as a sentence that ends before it is complete. This approach isthe num ber of remaining
time steps.
based on theWithout
decomp
decompositionthis extra input, the RNN migh
osition − t generate sequences that
end abruptly, such as a sentence that ends before it is complete. This approach is
(1)
based on the decomp P (xosition, . . . , x (τ )) = P (τ )P (x(1) , . . . , x(τ ) | τ ). (10.27)
P (x , . . .τ, xdirectly
The strategy of predicting ) = Pis
(τ )used
P (x for, . example
..,x τ ). Goo
by dfellow(10.27)
Goodfellow et al.
(2014d). |
The strategy of predicting τ directly is used for example by Goodfellow et al.
(2014d).
393
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
394
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
395
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.3
All of theBidirectional
recurren
recurrentt netw orks RNNs
networks we hav
havee considered up to now ha have
ve a “causal” struc-
ture, meaning that the state at time t only captures information from the past,
All
x(1)of, . .the
. , xrecurren
(t−1)
, and t netw
the orks
presen
present wte input
have considered
x(t). Some up to now
of the mo
modelshave
dels wea ha“causal”
have struc-
ve discussed
ture, meaning that the state at time t only
also allow information from past y values to affect the current state when the captures information from the past,y
x , . .
values are av. , x ,
available. and
ailable. the presen t input x . Some of the mo dels we ha ve discussed
also allow information from past y values to affect the current state when the y
valuesHo
Howev wev
wever,
are er, in many applications we wan
available. wantt to output a prediction of y (t) whic which h
ma
may y dep
depend end on . For example, in sp speech
eech recognition,
Ho wev er, in many applications
the correct interpretation of the current sound as a phoneme mawe wan t to output a prediction
may y dep y whic
of end
depend on the h
may dep
next few end on
phonemes because of co-articulation .and For pexample,
oten tiallyinma
otentially spyeech
may evenrecognition,
dep
depend
end on
the correct interpretation
the next few words because of the linguistic depof the current sound as a
dependencies phoneme
endencies betw etweenma y dep end
een nearby words: on the if
next few
there are tw phonemes b ecause of
twoo interpretations of the curren co-articulation and p oten tially ma y
currentt word that are both acoustically plausible,even dep end on
the
we ma next
may y havfew w
havee to loords b
look ecause of the linguistic
ok far into the future (and dependencies
the past)betw een nearby words:
to disambiguate them. if
thereisare
This also twotrueinterpretations
of handwriting of the current word
recognition andthat
many areother
both sequence-to-sequence
acoustically plausible,
w e may hav
learning tasks,e todescrib
look ed
described far ininto
thethe nextfuture
section. (and the past) to disambiguate them.
This is also true of handwriting recognition and many other sequence-to-sequence
Bidirectional recurren recurrentt neural netw networks
orks (or bidirectional RNNs) were inv invented
ented
learning tasks, described in the next section.
to address that need (Sch Schuster
uster and Paliw
Paliwal al , 1997 ). They ha
haveve b een extremely suc-
Bidirectional
cessful (Gra Graves recurren t neural netw orks (or bidirectional
ves, 2012) in applications where that need arises, such as handwriting RNNs) w ere inv ented
to address
recognition (Grav that need
Graves es et(Schal.,uster
al. , 2008and ; Gra Paliw
Graves
ves andal, 1997 Sc ). They ha, ve
Schmidhuber
hmidhuber been
2009 ), sp extremely
speech
eech recogni- suc-
cessful
tion (Gra (Gra
Grav vesvesand , 2012 ) in applications
Schmidh
Schmidhub ub
uber er, 2005; Gra where
Grav ves et that
al.,,need
al. 2013arises, such as handwriting
) and bioinformatics (Baldi
recognition
et al.
al.,, 1999). ( Grav es et al. , 2008 ; Gra ves and Sc hmidhuber , 2009 ), sp eech recogni-
tion (Graves and Schmidhuber, 2005; Graves et al., 2013) and bioinformatics (Baldi
As the name suggests, bidirectional RNNs combine an RNN that mo mov ves forw
forward ard
et al., 1999).
through time beginning from the start of the sequence with another RNN that
mo
movesAs bac
ves the name suggests,
backward
kward through time bidirectional
beginning RNNs fromcombine
the endan of RNN that moves
the sequence. forw
Fig. 10.11ard
through time
illustrates the bteginning from the start
ypical bidirectional RNN, of withthe sequence
h (t) standingwithfor another
the stateRNN of that
the
moves bacthat
sub-RNN kward mov through
moves es forward time through
beginning timefrom andtheg (end
t) of the sequence.
standing for the state Fig.of10.11
the
illustrates
sub-RNN that mov the t ypical
moves es bacbidirectional RNN, with h standing
kward through time. This allows the output units o to
backward for the state of( t the
)
exp
expensiv
ensiv
ensivee but allow for long-range lateral in interactions
teractions betw
betweeneen features in the
same feature map (Visin et al., 2015; Kalc Kalchbrenner
hbrenner et al.
al.,, 2015). Indeed, the
exp
forwensiv
forward e but allow equations
ard propagation for long-range lateral
for such RNNsinmay
teractions betwineen
be written features
a form that in the
shows
same use
they feature map
a con
conv (Visin
volution et computes
that al., 2015; the
Kalcbhbrenner
ottom-upetinput
al., 2015 ). Indeed,
to each lay
layer, the
er, prior
forw ard propagation
to the recurren equations for such RNNs may b e written
recurrentt propagation across the feature map that incorp in a form
incorporates that shows
orates the lateral
they
in use a convolution that computes the bottom-up input to each layer, prior
interactions.
teractions.
to the recurrent propagation across the feature map that incorporates the lateral
interactions.
10.4 Enco
Encoder-Deco
der-Deco
der-Decoder der Sequence-to-Sequence Arc Architec-
hitec-
tures
10.4 Encoder-Deco der Sequence-to-Sequence Architec-
We ha
have tures
ve seen in Fig. 10.5 how an RNN can map an input sequence to a fixed-size
vector. We hav havee seen in Fig. 10.9 how an RNN can map a fixed-size vector to a
Wsequence. We in
e ha ve seen havFig.
have e seen10.5inhow
Fig.an10.3RNN can10.4
, Fig. map an input
, Fig. 10.10sequence
and Fig. to a fixed-size
10.11 how an
vRNN
ector.can Wemap havan
e seen
input sequence to an output sequence of the same length. to a
in Fig. 10.9 how an RNN can map a fixed-size vector
sequence. We have seen in Fig. 10.3, Fig. 10.4, Fig. 10.10 and Fig. 10.11 how an
Here we discuss how an RNN can be trained to map an input sequence to an
RNN can map an input sequence to an output sequence of the same length.
output sequence whic which h is not necessarily of the same length. This comes up in
man
many Here
y applications, such an
we discuss how as RNN
speechcan be trained machine
recognition, to map an input sequence
translation to an
or question
output
answ
answering, sequence whic h is not necessarily of the same length.
ering, where the input and output sequences in the training set are generally This comes up in
manof
not y applications,
the same length such as speech
(although recognition,
their lengths might machine translation or question
be related).
answering, where the input and output sequences in the training set are generally
We often call the input to the RNN the “con “context.”
text.” We wan antt to proproduce
duce a
not of the same length (although their lengths might be related).
represen
representation
tation of this context, C . The context C migh mightt be a vector or sequence of
vectors that summarize the input sequence X = (xtext.”
W e often call the input to the RNN the “con (1) , . . . ,Wxe(n w)an
). t to produce a
representation of this context, C . The context C might be a vector or sequence of
The simplest RNN architecture for mapping a variable-length sequence to an-
vectors that summarize the input sequence X = (x , . . . , x ).
other variable-length sequence was first prop proposed
osed by Cho et al. (2014a) and shortly
afterThe simplest
by Sutsk
Sutskevev erRNN
ever et al.architecture
(2014), who for mapping
indep
independen
enden
endentlytlya dev
variable-length
develop
elop
eloped ed that arc sequence
architecture
hitecture toand
an-
other v ariable-length sequence was first prop osed b y Cho
were the first to obtain state-of-the-art translation using this approach. The former et al. (2014a ) and shortly
after by isSutsk
system everonetscoring
based al. (2014proposals
), who indep endentlyby
generated dev eloped that
another machine architecture and
translation
w ere the while
system, first to obtain
the latterstate-of-the-art
uses a standalone translation
recurrent using
netw this
network ork approach.
to generateThe theformer
trans-
system is based on
lations. These authors resp scoring proposals
respectively generated
ectively called this arc by another
architecture, machine
hitecture, illustrated in Fig. 10.12,translation
system,
the enco while
encoder-deco
der-decothe
der-decoder latter uses a standalone recurrent
der or sequence-to-sequence architecture. network Theto idea
generate
is verythesimple:
trans-
lations.
(1) an enc These
enco oder authors
or readerresp or ectively
input RNN called this
pro architecture,
processes
cesses the input illustrated
sequence. in The Fig. 10.12
enco der,
encoder
the enco
emits theder-deco
context derC ,orusually
sequence-to-sequence
as a simple function architecture.
of its final Thehidden
idea is state.
very simple:
(2) a
(1)
de an enc o der or reader or input RNN pro cesses the input
deccoder or writer or output RNN is conditioned on that fixed-length vector (just lik sequence. The enco dere
like
emits the context C , usually as a simple
in Fig. 10.9) to generate the output sequence Y = ((y function of its final hidden
y(1), . . . , y (n )). The inno state.
innov vationa
(2)
decthis
of oderkind
or writer or output RNN
of architecture is conditioned
over those presentedon inthat
earlierfixed-length
sections ofvector (just likise
this chapter
in Fig.
that the10.9 ) to generate
lengths n x and the output
ny can varysequence
from each Y = (y while
other, , . . . , yprevious
). The arc inno vation
architectures
hitectures
of this kind of
constrained nxarchitecture
= ny = τ . In over those presented in earlier
a sequence-to-sequence sections ofthe
architecture, thistw chapter
twoo RNNs is
that the lengths n and n can vary from each other, while previous architectures
constrained n = n = τ . In a sequence-to-sequence architecture, the two RNNs
397
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
398
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.5
The Deep inRecurren
computation most RNNstcan Net worksosed into three blo
be decomp
decomposed bloccks of parameters
and asso
associated
ciated transformations:
The computation in most RNNs can be decomposed into three blocks of parameters
and1.asso
fromciated transformations:
the input to the hidden state,
2.
1. from the previous hidden
input to the statestate,
hidden to the next hidden state, and
3.
2. from the hidden
previousstate to the
hidden output.
state to the next hidden state, and
With3. the
fromRNN
the hidden
arc state to
architecture
hitecture of the
Fig.output.
10.3, each of these three blobloccks is asso
associated
ciated
with a single weigh
weightt matrix. In other words, when the net network
work is unfolded, each
With thecorresponds
of these RNN architecture of Fig.transformation.
to a shallow 10.3, each of these
By three blocks
a shallow is associated
transformation,
with
w a single
e mean weight matrix.that
a transformation In other
would words, when the net
be represented bywork is unfolded,
a single la
lay each
yer within
of these
a deep MLPcorresponds to a shallow transformation. By a shallow transformation,
MLP.. Typically this is a transformation represented by a learned affine
w e mean a transformation
transformation follow ed by athat
followed fixedwould be represented
nonlinearit
nonlinearity y. by a single layer within
a deep MLP. Typically this is a transformation represented by a learned affine
Would it be advadvan
an
antageous
tageous to introduce depth in each of these operations?
transformation followed by a fixed nonlinearity.
Exp
Experimen
erimen
erimental
tal evidence (Grav
Graves
es et al.
al.,, 2013; Pascan
ascanu
u et al.
al.,, 2014a) strongly suggests
W ould
so. The exp it be
experimen
erimenadv
erimental an tageous to introduce
agreementt with the each
tal evidence is in agreemen depth in of these
idea that operations?
we need enough
Experimental evidence (Graves et al., 2013; Pascanu et al., 2014a) strongly suggests
so. The experimental evidence is in agreemen 399 t with the idea that we need enough
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
400
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
depth in order to perform the required mappings. See also Schmidh Schmidhub ub
uberer (1992),
El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs.
depth in order to perform the required mappings. See also Schmidhuber (1992),
Gra
Gravesves et al. (2013) were the first to show a significant benefit of decomp decomposingosing
El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs.
the state of an RNN in into
to multiple la layyers as in Fig. 10.13 (left). We can think
Gra
of the low ves
lower et
er la al.
lay ( 2013 ) were
yers in the hierarc the
hierarch first to show ainsignificant
hy depicted Fig. 10.13baenefit as pla ofying
decomp
playing osing
a role in
the state of an
transforming theRNNra
raww in to multiple
input in
into layers as in Fig.
to a representation 10.13
that (left).appropriate,
is more We can think at
of the
the lowerlev
higher laels
levelsyersofinthe thehidden
hierarcstate.
hy depicted
Pascan
ascanuin Fig.
u et al.10.13
(2014aa as
) goplaaying
stepa further
role in
transforming
and prop
proposeose tothe hav
haveraewa separate
input into MLPa representation
(p
(possibly
ossibly deep) that
for is
eacmore
each appropriate,
h of the three blo
blocksat
cks
the
en higher
enumerated
umerated ab lev
abovels
ov
ove,of the hidden state. P ascan u et al. ( 2014a ) go
e, as illustrated in Fig. 10.13b. Considerations of representational a step further
and prop
capacit
capacity ose to hav
y suggest to ealloa cate
separate
allocate enoughMLP (possibly
capacity deep)
in each of for
theseeacthree
h of the three
steps, butblo cks
doing
enumerated
so by addingab ove, as
depth may illustrated in Fig.by10.13
hurt learning b. Considerations
making of representational
optimization difficult. In general,
capacit y suggest to allo
it is easier to optimize shallo cate enough
wer architectures, and adding the extra but
shallower capacity in each of these three steps, doing
depth of
so by adding depth may hurt learning by making optimization
Fig. 10.13b makes the shortest path from a variable in time step t to a variable difficult. In general,
it is
in easier
time steptot + optimize
1 become shallo wer architectures,
longer. For example, ifand
For an adding
MLP with the aextra
singledepth
hidden of
Fig.
la yer 10.13
layer is usedb makes
for thethe shortest path
state-to-state from a variable
transition, haveeindoubled
we hav time step thet length
to a variable
of the
in time step
shortest path b et t + 1 b
etween ecome longer.
ween variables in any tw For example, if an MLP with a single
twoo different time steps, compared with the hidden
la yer is used for the
ordinary RNN of Fig. 10.3. Ho state-to-state
How transition,
wever, as argued we bhav e doubled
y Pascan
Pascanu u et the
al. length
(2014a), of this
the
shortest
can path b etween
be mitigated by in variables
intro
tro ducinginskip
troducing anyconnections
two differentin time steps, compared with
the hidden-to-hidden path,theas
ordinary RNN of
illustrated in Fig. 10.13c.Fig. 10.3 . Ho w ever, as argued b y Pascan u et al. ( 2014a ), this
can be mitigated by introducing skip connections in the hidden-to-hidden path, as
illustrated in Fig. 10.13c.
10.6 Recursiv
Recursive
e Neural Net
Networks
works
10.6 eRecursiv
Recursiv
Recursive neural net e Neural
networks
works represen
represent Net
t yetworks
another generalization of recurrent net-
works, with a different kind of computational graph, which is structured as a deep
Recursiv
tree, e neural
rather than net
theworks represen
chain-lik
chain-like t yet another
e structure of RNNs. generalization
The typicalofcomputational
recurrent net-
works, with a different
graph for a recursive net kind
work is illustrated in Fig. 10.14. Recursive neuralas
network of computational graph, which is structured netaworks
deep
networks
tree,
were in rather
intro
tro
troducedthanbythe
duced chain-lik
Pollack (1990 e structure
) and their of pRNNs.
otentialThe
use tfor
ypical computational
learning to reason
graph for a recursive net work is illustrated
was described by by Bottou (2011). Recursiv in Fig. 10.14
Recursivee net .
networksRecursive
works ha have neural networks
ve been successfully
were intro
applied toduced
pro by Pollack (1990) and their
processing
cessing potential
as input use for
to neural netslearning to reason
(Frasconi et al.
al.,,
was described by by Bottou
1997, 1998), in natural language pro (2011 ). Recursiv
processing
cessing (So e net
Socher works
cher et al. ha ve been successfully
al.,, 2011a,c, 2013a) as well
applied
as to processing
in computer vision (SoSocher
cher et al., 2011b as).input to neural nets (Frasconi et al.,
1997, 1998), in natural language processing (Socher et al., 2011a,c, 2013a) as well
as inOne clear adv
computer advantage
antage
vision ofcher
(So recursive
et al.,nets over
2011b ). recurren
recurrentt nets is that for a sequence
of the same length τ , the depth (measured as the num numb ber of comp
compositions
ositions of
One
nonlinear opclear adv antage of recursive nets o ver recurren t nets is
erations) can be drastically reduced from τ to O ( log τ ), whic
operations) that for a sequence
whichh might
of the
help dealsamewithlength τ , the
long-term depth (measured
dependencies. An op as question
open
en the numb iserhowof to
compbestositions of
structure
nonlinear
the tree. Oneop erations)
option iscan be edrastically
to hav
have reduced
a tree structure fromdo
which τes
does tonotO ( log
dep τend
depend), whic h might
on the data,
help deal with long-term dependencies. An open question is how to best structure
the We
tree.suggest
One to not abbreviate
option is to have“recursive neural network”
a tree structure whichasdo“RNN”
es nottodep avoid
endconfusion with
on the data,
“recurrent neural network.”
401
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
402
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
suc
suchh as a balanced binary tree. In some application domains, external metho methods ds
can suggest the appropriate tree structure. For example, when pro processing
cessing natural
such as a sentences,
language balanced binary
the treetree. In some
structure forapplication
the recursive domains,
netw
network external
ork can bemetho
fixed dsto
can suggest the appropriate tree structure. For example,
the structure of the parse tree of the sentence provided by a natural language when pro cessing natural
language
parser (So sentences,
cher et al.the
Socher tree, 2013a
, 2011a structure for the
). Ideally
Ideally, recursive
, one would netw ork learner
like the can be itself
fixed toto
the
discostructure
discover of the
ver and infer theparse
tree tree of thethat
structure sentence provided for
is appropriate by an
a y
anynatural
giv
given language
en input, as
parser ( So cher et al.
suggested by Bottou (2011). , 2011a , 2013a ). Ideally , one w ould like the learner itself to
discover and infer the tree structure that is appropriate for any given input, as
Man
Many y varian
ariants
ts of the recursiv
recursivee net idea are possible. For example, Frasconi
suggested by Bottou (2011).
et al. (1997) and Frasconi et al. (1998) asso associate
ciate the data with a tree structure,
andManasso yciate
varian
associate ts of
the the recursiv
inputs and targetse net withidea are possible.nodes
individual For example,
of the tree.Frasconi
The
et al. ( 1997 ) and F rasconi
computation performed by eac et
eachal.
h no (
node1998
de do )
does asso ciate
es not hav the data with a tree structure,
havee to be the traditional artificial
neuron computation (affine transformation of all inputs nodes
and asso ciate the inputs and targets with individual follow
followed of by
ed thea tree. The
monotone
computation
nonlinearit
nonlinearity). p erformed by
y). For example, So eac h
Socherno de do es not
cher et al. (2013a) prophav e to
propose b e the traditional
ose using tensor op artificial
operations
erations
neuron
and computation
bilinear forms, which(affineha
havetransformation
ve previously been of found
all inputsusefulfollow eddel
to mo
modelby relationships
a monotone
nonlinearit
b et
etween y). For(example,
ween concepts Weston etSoal. cher
al.,
, 2010 et ;al. (2013a
Bordes et) al.
prop
al., ose )using
, 2012 whentensor operations
the concepts are
and bilinear
represen
represented forms,
ted by con which
continuous ha ve previously
tinuous vectors (embeddings). b een found useful to mo del relationships
between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are
represented by continuous vectors (embeddings).
10.7 The Challenge of Long-T
Long-Term
erm Dep
Dependencies
endencies
10.7
The The Challenge
mathematical challenge of of Long-T
learning ermdep
long-term Dep endencies
dependencies
endencies in recurrent net-
works was inintro
tro
troduced
duced in Sec. 8.2.5. The basic problem is that gradients propagated
The
over mathematical
man
many y stages tendchallenge
to eitherof learning long-term
vanish (most of thedep endencies
time) or exploin de
exploderecurrent
(rarely,, net-
(rarely but
w orks was
with muc
much in tro duced in Sec. 8.2.5
h damage to the optimization). Ev. The basic
Even problem is that gradients
en if we assume that the parameters arepropagated
over
suc
such man y stages tend
h that the recurrent netwto either
networkork is stable (can of
v anish (most the memories,
store time) or explowithdegradients
(rarely, but
not
with
explomuc h
exploding), damage to the optimization).
ding), the difficulty with long-term dep Ev en if we
dependencies assume that the
endencies arises from the exp parameters are
exponentially
onentially
such that
smaller the ts
weigh
eightsrecurrent
giv en tonetw
given ork is stable
long-term (can store
interactions (invmemories,
olving thewith
(involving gradients not
multiplication of
explo
man
many ding), the difficulty with long-term
y Jacobians) compared to short-term ones. Man dep endencies Manyarises from the exp onentially
y other sources provide a
smaller
deep
deeper w eigh ts
er treatment (Hogiv en to
Hochreiterlong-term
chreiter, 1991; Doy interactions (inv olving
Doyaa, 1993; Bengio et al. al.,,the multiplication
1994 ; Pascan
Pascanuu et al.of,
al.,
man
2013ay Jacobians) compared
2013a)) . In this section, we describdescribee the problem in more detail. The remaininga
to short-term ones. Man y other sources provide
deeper treatment
sections describ (Hochreiter,to
describee approaches 1991 ; Doya, 1993
overcoming the; Bengio
problem. et al., 1994; Pascanu et al.,
2013a) . In this section, we describe the problem in more detail. The remaining
Recurren
Recurrent
sections t netw
describ networks
orks in
inv
e approaches volve
to othe composition
vercoming of the same function multiple
the problem.
times, once per time step. These comp compositions
ositions can result in extremely nonlinear
behaRecurren
ehavior, t netw orks in volve
vior, as illustrated in Fig. 10.15. the composition of the same function multiple
times, once per time step. These compositions can result in extremely nonlinear
In particular, the function composition employ employed ed by recurrent neural netw networks
orks
behavior, as illustrated in Fig. 10.15.
somewhat resembles matrix multiplication. We can think of the recurrence relation
In particular, the function composition employed by recurrent neural networks
somewhat resembles matrix multiplication.h(t) = W >h (t−1)
We can think of the recurrence (10.29) relation
as a very simple recurrent neural
h netw
network
= ork
W lacking
h a nonlinear activ
activation
ation function,
(10.29)
403 lacking a nonlinear activation function,
as a very simple recurrent neural network
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
404
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
h osition
and if W admits an eigendecomp = W
eigendecomposition h , (10.30)
The eigenv
eigenvalues
alues are raised to the h p= Q of tQh
ower . eigenv
causing eigenvalues
alues with magnitude (10.32)
less than one to decay to zero and eigenv eigenvalues
alues with magnitude greater than one to
The
explo eigenv
explode. alues
de. Any comp are
componentraised to the
(0)
onent of h that is notp o wer of t aligned
causing witheigenv alues
the witheigenv
largest magnitude
eigenvector
ector
less than
will even one
eventually to decay
tually be discarded.to zero and eigenv alues with magnitude greater than one to
explode. Any component of h that is not aligned with the largest eigenvector
will This problem
eventually be is particular to recurrent netw
discarded. networks.
orks. In the scalar case, imagine
multiplying a weighweightt w by itself many times. The pro duct wt will either vanish or
product
exploThis
explode problem
de dep endingis on
depending particular
the magnitude to recurrent
of w. netw
How
Howev orks.
ev er, ifInwe
ever, themake
scalar case, imaginet
a non-recurren
non-recurrent
m
netultiplying
netwo
wo
work a weigh t w b y
rk that has a different weigh itself many times.
weightt w (t) at eac each The pro duct w
h time step, the situation isvdifferent.
will either anish or
explo
If the de depending
initial state is on thebymagnitude
given 1, then theofstate w. How ever,
at time t isifgiven
we makeby at w non-recurren
(t). Supp
Supposeoset
netwothe
that rk that
w(t) vhas
aluesa different
are generated w at eac
weightrandomly
randomly, h time
, indep step, the
independently
endently situation
from is different.
one another, with
If the initial state is given b y 1 , then
zero mean and variance v . The variance of the pro the state at time t is given b y
n w
duct is O (v ). To obtain some
product . Supp ose
that the w v alues are generated randomly , indep endently from one another, √
with
desired variance v ∗ we may cho hoose ose the individual weights with variance v = v ∗.
zero mean and
Very deep feedforw variance
feedforward ard netv . worksvwith
The
networks ariance of the pro
carefully chosen is O (v can
ductscaling ). Tth
o us
thusobtain
av oidsome
avoid the
vdesired
anishing variance
and explo v w
exploding e may
ding chooseproblem,
gradient the individual
as argued weights (2014).v = √ v .
with variance
by Sussillo
Very deep feedforward networks with carefully chosen scaling can thus avoid the
The vanishing
vanishing and exploand dingexplo
exploding
ding problem,
gradient gradien
gradientt problem
as arguedforbyRNNs Sussillo was indep
independently
(2014 ). endently
disco
discovered
vered by separate researchers (Ho Hochreiter
chreiter, 1991; Bengio et al., 1993, 1994).
The
One may hopv anishing and explo ding
hopee that the problem can be av gradien t problem
avoided
oided simply for RNNsby sta was
yingindep
staying endently
in a region of
discovered space
parameter by separate
where the researchers
gradients(Ho dochreiter , 1991
not vanish or; explo
Bengio
explode.de.etUnfortunately
al., 1993, 1994
Unfortunately, , in).
One may
order hopememories
to store that the problem
in a waway y can
thatbise av oidedtosimply
robust small by staying in a the
perturbations, region
RNN of
parameter space where the gradients do not v anish or explo
must enter a region of parameter space where gradients vanish (Bengio et al., 1993, de. Unfortunately , in
order). to
1994
1994). Sp store memories
Specifically
ecifically
ecifically, , wheneverin athewaymo that
del is
model robust
is able to to small plong
represent erturbations,
term dep the RNN
dependencies,
endencies,
mustgradien
the enter at of
gradient region
a longof parameter
term interaction space where
has exp gradients
exponen
onen
onentially
tially vanish
smaller(Bengio et al., 1993
magnitude than,
1994 ). Specifically
the gradient of a ,short
whenever
termthe in mo del is able
interaction.
teraction. to represent
It does not mean long term
that it dep endencies,
is impossible
thelearn,
to gradienbutt of a long
that termtake
it might interaction has exp
a very long time onen
to tially
learn smaller
long-term magnitude
dep
dependencies, than
endencies,
the gradient of
because the signal ab a short
about term
out these depin teraction.
dependencies It does not mean that it
endencies will tend to be hidden by the smallest is impossible
to learn, but arising
fluctuations that it from
mightshort-term
take a verydep long time to learn
dependencies.
endencies. long-term
In practice, thedepexp endencies,
experiments
eriments
b ecause the signal ab out these dep endencies will tend
in Bengio et al. (1994) show that as we increase the span of the dependenciesto b e hidden by the smallest
that
fluctuations arising from short-term dependencies. In practice, the experiments
in Bengio et al. (1994) show that as we 405 increase the span of the dependencies that
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.8
The Echo
recurrent State
weigh
weights Netfrom
ts mapping works h(t−1) to h(t) and the input weigh weightsts mapping
from x(t) to h(t) are some of the most difficult parameters to learn in a recurrent
The
net recurrent
network.
work. One weigh
prop ts mapping
proposed
osed (Jaegerfrom, 2003h; Maass h al.and
to et the; Jaeger
, 2002 input weighand ts
Haasmapping
, 2004;
from x to h are
Jaeger, 2007b) approach to av some of the
avoidingmost difficult parameters to
oiding this difficulty is to set the recurren learn in a recurrent
recurrentt weigh
weights ts
net
suc
suchwork. One prop
h that the recurren osed ( Jaeger , 2003
recurrentt hidden units do a go; Maass et
goood job of capturing the history of;
al. , 2002 ; Jaeger and Haas , 2004
Jaeger , 2007band
past inputs, ) approach to avoiding this difficulty is. toThis set the recurren
is the t weigh
idea that ts
was
suc h that
indep
independen
enden
endentlythe
tly recurren
prop osedt for
proposed hidden
echo units
state do a goodorjob
networks ESNsof capturing
(Jaeger and theHaas
history
, 2004 of;
past inputs, and
Jaeger, 2007b) and liquid state machines (Maass et al. . This is the idea that
al.,, 2002). The latter is similar, w as
indep enden tly prop osed for e cho state networks
except that it uses spiking neurons (with binary outputs) or ESNsinstead
(Jaegerofand the Haas
contin, 2004
uous-;
continuous-
Jaeger
v alued, hidden
2007b) and unitsliquid
usedstate
for machines
ESNs. Both (MaassESNset al.
and, 2002 ). The
liquid statelatter is similar,
machines are
except that it uses spiking
termed reservoir computing (Luk neurons (with
Lukoševičius binary outputs) instead of
oševičius and Jaeger, 2009) to denote the fact the contin uous-
vthat
alued hidden units used
the hidden units form of reservfor ESNs.
reservoir Both
oir ESNs
of temp
temporal and
oral liquid which
features state machines
may capture are
termed
differen reservoir
differentt aspaspects computing ( Luk
ects of the history of inputs.oševičius and Jaeger , 2009 ) to denote the fact
that the hidden units form of reservoir of temporal features which may capture
One waway y to think ab about
out these reservoir computing recurrent netw networks
orks is that
different aspects of the history of inputs.
they are similar to kernel mac machines:
hines: they map an arbitrary length sequence (the
history of inputs up to time t ) intoreservoir
One wa y to think ab out these a fixed-lengthcomputing
vectorrecurrent
(the recurrennetwtorks
recurrent stateishthat
(t) ),
they are similar to kernel
on which a linear predictor (t mac hines:
(typically they map an arbitrary length
ypically a linear regression) can be applied to solv sequence (the
solvee
history
the of inputs
problem of interest. Thet )training
up to time into a fixed-length
criterion may vector
then(the recurren
be easily t state to
designed h b),e
on which
con vex asaalinear
convex functionpredictor
of the (t ypically
output a linear
weigh
weights. ts. Fregression)
or example,can be applied
if the to solve
output consists
thelinear
of problem of interest.
regression from The trainingunits
the hidden criterion
to themay thentargets,
output be easily designed
and to be
the training
con vex as a function of the output
criterion is mean squared error, then it is conv weigh ts. F
convexor example,
ex and ma may if the output consists
y be solved reliably with
of linearlearning
simple regression from the(Jaeger
algorithms hidden units). to the output targets, and the training
, 2003
criterion is mean squared error, then it is convex and may be solved reliably with
The imp
important
ortant question is therefore: ho howw do we set the input and recurrent
simple learning algorithms (Jaeger, 2003).
weights so that a rich set of histories can be represented in the recurren recurrentt neural
net The
network imp ortant question
work state? The answer prop is therefore:
proposed ho w do we set the input
osed in the reservoir computing literature and recurrentis to
weights so that a rich set of histories can be represented in the recurrent neural
network state? The answer proposed in 406the reservoir computing literature is to
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
408
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
One wawayy to obtain coarse time scales is to add direct connections from variables in
the distant past to variables in the present. The idea of using such skip connections
One wa
dates y totoobtain
back Lin etcoarse time
al. (1996 scales
) and is to add
follows fromdirect connections
the idea of incorp from
incorporating variables
orating dela
delays in
ys in
the distant
feedforw ardpast
feedforward to variables
neural netw orksin(the
networks Lang present. The idea
and Hinton , 1988of ).using
In an such skip connections
ordinary recurrent
dates
net back
network, to Lin et al. ( 1996
work, a recurrent connection go ) andgoesfollows from the idea of incorp
es from a unit at time t to a unit at time orating delatys
+ 1in.
feedforw
It ard neural
is p ossible networks
to construct (Lang net
recurrent andworks
Hinton
networks , 1988
with ). Indela
longer anys
delays ordinary
(Bengiorecurrent
, 1991).
network, a recurrent connection goes from a unit at time t to a unit at time t + 1.
As we hav
havee seen in Sec. 8.2.5, gradien
gradients ts may vanish or explo explodede exp
exponentially
onentially
It is p ossible to construct recurrent networks with longer delays (Bengio, 1991).
. Lin et al. (1996) in introduced
troduced
As we
recurren
recurrent have seen inwith
t connections Sec.a8.2.5 , gradien
time-delay oftsd may vanish or
to mitigate thisexplo de expGradients
problem. onentially
no
noww diminish exp onentially as a function of τd rather
exponentially . Lin thanetτal. (1996
. Since ) introduced
there are both
recurren
dela
delay t connections with a time-delay
yed and single step connections, gradien of
gradientsd to mitigate
ts may still explo this
explode problem.
de exp
exponen
onen Gradients
tially in τ.
onentially
no w diminish exp onentially as a function of
This allows the learning algorithm to capture longer dep rather than τ
dependencies. Since there
endencies although arenot
both
all
delayed and
long-term single
dep step connections,
dependencies
endencies may be represen gradien
representedtedtswell
mayinstill
thisexploay..de exp onentially in τ.
way
This allows the learning algorithm to capture longer dependencies although not all
long-term dep endencies may be represented well in this way.
Another wa
wayy to obtain paths on which the product of deriv derivativ
ativ
atives
es is close to one is to
ha
have
ve units with self-connections and a weighweightt near one on these connections.
Another way to obtain paths on which the product of derivatives is close to one is to
When we accumulate a running average µ t) of some value v (t) by applying the
(
have units with self-connections and a weight near one on these connections.
up date µ(t) ← αµ (t−1) + (1 − α) v(t) the α parameter is an example of a linear self-
update
When we
connection accumulate
from µ(t−1) to aµ(running
t)
. Whenaverage
α is near of some
µ one, value vav
the running by applying
average
erage remem
rememb bthe
ers
up date µ
information ab αµ + (1 α) v the α parameter is an example of
out the past for a long time, and when α is near zero, information
about a linear self-
connection
ab
about ← isµ rapidly
from
out the past to µdiscarded. α is nearunits
−. When Hidden one, with
the running average remembcan
linear self-connections ers
information
b eha
ehave about
ve similarly to the
suchpast for aav
running long time,Such
averages.
erages. when αunits
andhidden is near
are zero,
calledinformation
le
leaky
aky units
units..
about the past is rapidly discarded. Hidden units with linear self-connections can
behave similarly to such running averages. 409 Such hidden units are called leaky units.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Skip connections through d time steps are a wa way y of ensuring that a unit can
alw ays learn to be influenced by a value from d time steps earlier. The use of a
always
Skip
linear connections with
self-connection through d time
a weigh
weight steps
t near oneare is aa different
way of ensuring
wa
way that a unit
y of ensuring that can
the
alw ays learn to b e influenced by a value from d time steps earlier.
unit can access values from the past. The linear self-connection approach allows The use of a
linear self-connection with
this effect to be adapted more smo a weigh t near
smoothly one is a different way of ensuring
othly and flexibly by adjusting the real-v that the
real-valued
alued
unit
α can than
rather accessbyvalues from the
adjusting theinteger-v
past. The
integer-valued
aluedlinear
skipself-connection
length. approach allows
this effect to be adapted more smoothly and flexibly by adjusting the real-valued
These ideas were prop proposed
osed by Mozer (1992) and by El Hihi and Bengio (1996).
α rather than by adjusting the integer-valued skip length.
Leaky units were also found to be useful in the con context
text of ec
echo
ho state net
netwworks
These ideas were
(Jaeger et al., 2007). prop osed by Mozer ( 1992 ) and by El Hihi and Bengio (1996).
Leaky units were also found to be useful in the context of echo state networks
There
(Jaeger are, 2007
et al. tw
twoo ).
basic strategies for setting the time constants used by leaky
units. One strategy is to manually fix them to values that remain constant, for
Therebyare
example two basic
sampling theirstrategies
values from forsome
setting the time once
distribution constants used by leaky
at initialization time.
units. One strategy is to manually fix them to values that
Another strategy is to make the time constants free parameters and learn them. remain constant, for
example
Ha
Having by sampling
ving such leaky unitstheiratvalues from
differen
different somescales
t time distribution
appears once at initialization
to help time.
with long-term
Another
dep strategy
dependencies
endencies is to
(Mozer make
, 1992 the time
; Pascan
Pascanu u etconstants
al.,, 2013afree
al. ). parameters and learn them.
Having such leaky units at different time scales appears to help with long-term
dependencies (Mozer, 1992; Pascanu et al., 2013a).
410
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
The clevclever
er idea of introducing self-lo self-loops
ops to proproduce
duce paths where the gradient can
flo
floww for long durations is a core contribution of the initial long short-term memory
The
(LSTM) clevermoidea
del (of
model Ho
Hocintroducing
chreiter andself-lo ops to
Schmidh
Schmidhub ub
uberpro
er duce).paths
, 1997 whereaddition
A crucial the gradient
has bcan
een
flowmak
to for
makee long
the durations
weigh
weightt on is
thisa core
self-lo contribution
self-loop op conditioned of theon initial
the long
context, short-term
rather memory
than fixed
((LSTM)
Gers et mo al.,del
2000(Ho
). chreiter
By making and Schmidh
the weigh ubterof, 1997
eight this ). A crucial
self-loop addition
gated has been
(controlled by
to mak e the weigh t on this self-lo
another hidden unit), the time scale of in op conditioned
integration on the context, rather
tegration can be changed dynamically than fixed
dynamically.. In
(this
Gerscase,
et al. , 2000 ). By making the w eigh t of this
we mean that even for an LSTM with fixed parameters, the self-loop gated (controlled
time scale by of
another
in
integrationhidden unit), the time scale of integration can
tegration can change based on the input sequence, because the time constants b e changed dynamically . In
this output
are case, webymeanthe mo that
modeldeleven for The
itself. an LSTM
LSTM with
hasfixed beenparameters,
found extremelythe time scale of
successful
integration
in man
many can change such
y applications, basedasonunconstrained
the input sequence, becauserecognition
handwriting the time constants
(Gra
Grav ves
are output
et al. by the mo del itself.
al.,, 2009), speech recognition (Gra The LSTM
Grav has
ves et al. b een found
al.,, 2013; GravGraves extremely successful
es and Jaitly, 2014),
in many applications,
handwriting generation (such Gra as
Graves
ves , unconstrained
2013 ), mac
machine
hine handwriting
translation ( recognition
Sutsk
Sutskevever et al.
ever (Gra ves),
, 2014
al.,
et al., captioning
image 2009), speech recognition
(Kiros (Gra;vVin
et al., 2014b es et
Vinyals
yalsal.et
, 2013
al. ; Grav; es
al.,, 2014b Xuand Jaitly
et al.
al., , 2014
, 2015 ) and),
handwriting
parsing (Viny generation
Vinyals
als et al.al.,, (2014a
Graves). , 2013), machine translation (Sutskever et al., 2014),
image captioning (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015) and
The (LSTM
parsing Vinyalsbloblock
et ckal.,diagram
2014a). is illustrated in Fig. 10.16. The corresp corresponding
onding
forw
forward
ard propagation equations are giv
given
en b elow, in the case of a shallo
shalloww recurrent
The LSTM block diagram is illustrated in Fig. 10.16. The corresponding
forward propagation equations are given 411b elow, in the case of a shallow recurrent
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
412
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
net
netwo
wo
workrk architecture. Deep Deeper er architectures ha havve also been successfully used (Gra Graves
ves
et al.
al.,, 2013; Pascan
Pascanu u et al.al.,, 2014a). Instead of a unit that simply applies an element-
netwo
wise rk architecture.
nonlinearity to the Deep
affine er architectures
transformation havofe also
inputsbeen
andsuccessfully used (Gra
recurrent units, LSTM ves
et al., 2013
recurren
recurrent ; Pascan
t netw orks uha
networks havetveal. , 2014acells”
“LSTM ). Instead of aeunit
that hav
have an inthat
internalsimply
ternal applies(aanself-lo
recurrence element-
self-loop),
op),
wise nonlinearity to the affine transformation
in addition to the outer recurrence of the RNN. Eac of inputs
Each and recurrent units,
h cell has the same inputsLSTM
recurren
and t netwasorks
outputs an ha ve “LSTM
ordinary cells” that
recurrent netwhav
network,e anbut
ork, internal recurrence
has more (a self-lo
parameters andop),
a
in addition to the outer recurrence of the RNN. Eac
system of gating units that controls the flow of information. The most imph cell has the same inputs
importan
ortan
ortantt
and outputs as an ordinary recurrent (t) network, but has more parameters and a
comp
componenonentt is the state unit si that has a linear self-lo
onen self-loop
op similar to the leaky
system of gating units that controls the flow of information. The most important
units describ
describeded in the previous section. How Howevev
ever,
er, here, the self-lo
self-loop
op weigh
weightt (or the
component is the state unit s that has a linear self-loop similar (t) to the leaky
asso
associated
ciated time constant) is controlled by a for get gate unit fi (for time step t
forget
units described in the previous section. However, here, the self-loop weight (or the
and cell i), that sets this weigh weightt to a value betw etween
een 0 and 1 via a sigmoid unit:
associated time constant) is controlled by a forget gate unit f (for time step t
and cell i), that sets this weight to a value between 0 and 1 via a sigmoid unit:
(t) f (t) f (t−1)
f i = σ bfi + Ui,j xj + Wi,j hj , (10.33)
j j
f =σ b + U x + W h , (10.33)
where x (t) is the current input vector and h(t) is the current hidden lay layer
er vector,
con taining the outputs of all the LSTM cells, and bf ,U f , W f are resp
containing respectively
ectively
where x is the
biases, input weigh current
weights input vector
ts and recurrent weighand
weightsh is the current hidden lay
ts for the forget gates. The LSTM er vector,
cell
con
in taining
internal
ternal theisoutputs
state thus up of all the
updated
dated LSTM cells,
as follows, but withand ab conditional
,U , W are resp
self-lo opectively
self-loop weigh
weightt
biases,
(t) input weights and recurrent weights for the forget gates. The LSTM cell
fi :
internal state is thus updated as follows, but with a conditional self-loop weight
f :
(t) (t) (t−1) (t) (t) (t−1)
si = fi s i + gi σ bi + Ui,j xj + Wi,j hj , (10.34)
j j
s =f s +g σ b + U x + W h , (10.34)
where b, U and W resp respectiv
ectiv
ectively
ely denote the biases, input weigh
weights
ts and recurren
recurrentt
(t)
eightss into the LSTM cell. The external input gate unit g is computed similarly
weight
where b, U and W respectively denote the biases, inputi weights and recurrent
to the forget gate (with a sigmoid unit to obtain a gating value betw etween
een 0 and 1),
weight s into the LSTM cell.
but with its own parameters: The external input gate unit g is computed similarly
to the forget gate (with a sigmoid unit to obtain a gating value between 0 and 1),
but with its own parameters:
(t) g (t) g (t−1)
gi = σ bgi + Ui,j xj + Wi,j hj . (10.35)
j j
g =σ b + U x + W h . (10.35)
(t) (t)
The output hi of the LSTM cell can also be shut off, via the output gate qi ,
whic
which
h also uses a sigmoid unit for gating:
The output h of the LSTM cell can also be shut off, via the output gate q ,
(t) (t) (t)
which also uses a sigmoid unit for
h i = tanh si gating:
qi (10.36)
h = tanh s q (t−1)
(10.36)
(t) (t)
qi = σ boi + Uoi,j xj + o
Wi,j hj (10.37)
j j
q =σ b + U x + W h (10.37)
413
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
whic
which h has parameters bo, U o, W o for its biases, input weigh weightsts and recurren
recurrentt
(t)
weights, resp ectively.. Among the variants, one can choose to use the cell state s i
respectively
ectively
which has parameters b , U , W for its biases, input weights and recurrent
as an extra input (with its weight) into the three gates of the i-th unit, as sho shown
wn
w
ineights, respectively
Fig. 10.16 . Among
. This would the vthree
require ariants, choose to use the cell state s
one can parameters.
additional
as an extra input (with its weight) into the three gates of the i-th unit, as shown
LSTM netnetworks
works hav
havee been shoshown
wn to learn long-term dep dependencies
endencies more easily
in Fig. 10.16. This would require three additional parameters.
than the simple recurrent architectures, first on artificial data sets designed for
LSTM
testing the net works
ability tohav
ability e blong-term
learn een showndep to endencies
learn long-term
dependencies dep
(Bengio etendencies
al., 1994;more easily
Hochreiter
thanSc
and the simple recurrent
Schmidhuber
hmidhuber , 1997; Ho architectures,
Hochreiter
chreiter et al. first
al., on),artificial
, 2000 then ondata sets designed
challenging for
sequence
testing
pro the tasks
processing
cessing abilitywhere
to learn long-term deppendencies
state-of-the-art erformance (Bengio et al., 1994
was obtained ; Hochreiter
(Gra
Graves
ves, 2012;
and
Gra
GravesSc hmidhuber
ves et al. , 1997
al.,, 2013; Sutskev; Ho
Sutskever chreiter
er et al. et al. , 2000
al.,, 2014). Varian ),
ariants then on challenging
ts and alternativ
alternatives sequence
es to the LSTM
pro
ha vecessing
have tasks where
been studied state-of-the-art
and used performance
and are discussed next. was obtained (Graves, 2012;
Graves et al., 2013; Sutskever et al., 2014). Variants and alternatives to the LSTM
have been studied and used and are discussed next.
Whic
Which h pieces of the LSTM architecture are actually necessary? What other
successful architectures could be designed that allow the netw network
ork to dynamically
Whic
con
controlh pieces of the LSTM architecture
trol the time scale and forgetting beha ehaviorare actually necessary?
vior of different units? What other
successful architectures could be designed that allow the network to dynamically
Some answers to these questions are given with the recent work on gated RNNs,
control the time scale and forgetting behavior of different units?
whose units are also known as gated recurrent units or GR GRUsUs (Cho et al., 2014b;
Ch Some
Chung answers
ung et al. to these questions are
al.,, 2014, 2015a; Jozefowicz et al.given
al.,, 2015; Chrupala et wal.
with the recent ork
al., on gated
, 2015 ). TheRNNs,
main
whose
difference with the LSTM is that a single gating unit simultaneously controls the;
units are also known as gated recurrent units or GR Us ( Cho et al. , 2014b
Chung et al.
forgetting , 2014and
factor , 2015a
the ;decision
Jozefowicz et date
to up al., 2015
update the ;state
Chrupala
unit. etThe
al.,up
2015
update). equations
date The main
difference
are with the LSTM is that a single gating unit simultaneously controls the
the following:
forgetting factor and the decision to update the state unit. The update equations
are the following:
(t) (t−1) (t−1) (t−1) (t−1) (t−1) (t−1)
h i = ui hi + (1 − ui )σ bi + Ui,j xj + Wi,j rj hj ,
j j
h =u h + (1 u )σ b + U x + h (10.38)
, W r
where u stands for “up −
“update”
date” gate and r for “reset” gate. Their value is defined as
usual: (10.38)
where u stands for “update” gate and r for “reset” gate. Their value is defined as
(t) (t) u (t)
usual: ui = σ bui + Uui,jx j + W i,j hj (10.39)
j j
u =σ b + U x + W h (10.39)
and
(t) r (t) r (t)
and r i = σ b ri + Ui,j xj + W i,j hj . (10.40)
j j
r =σ b + U x + W h . (10.40)
The reset and up
updates
dates gates can individually “ignore” parts of the state vector.
The up
update
date gates act like conditional leaky integrators that can linearly gate any
The reset and updates gates can individually “ignore” parts of the state vector.
414 integrators that can linearly gate any
The update gates act like conditional leaky
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
dimension, th thus
us choosing to cop copyy it (at one extreme of the sigmoid) or completely
ignore it (at the other extreme) by replacing it by the new “target state” value
dimension,
(to
(towards
wards which thus the
choosing
leaky toin copy it (at
integrator
tegrator one to
wants extreme
conv of theThe
converge).
erge). sigmoid)
reset or completely
gates control
ignore
whic
which it (at the other extreme) b y replacing it by the
h parts of the state get used to compute the next target state, in new “target state”
intro
tro
troducing
ducingvalue
an
(towards which the leaky integrator w
additional nonlinear effect in the relationship betwants to conv erge).
etween The reset gates control
een past state and future state.
which parts of the state get used to compute the next target state, introducing an
Man
Many y more varian
ariantsts around this theme can be designed. For example the
additional nonlinear effect in the relationship between past state and future state.
reset gate (or forget gate) output could be shared across multiple hidden units.
Many more
Alternately
Alternately, varian
, the pro ducttsofaround
product a globalthis
gatetheme
(cov can baewhole
(covering
ering designed.
groupFoforunits,
example suchthe
such as
reset gate
an entire lay (or
layer) forget
er) and a logate)
local output
cal gate (p
(per could b e shared across m ultiple hidden
er unit) could be used to combine global control units.
Alternately
and lolocal , the product
cal control. How of a global
However,
ever, severalgatein (covering a whole
investigations
vestigations group of units,
over architectural such as
variations
an the
of entire
LSTMlayer)and
andGRUa local gate no
found (per unit)t could
varian
ariant be usedclearly
that would to combine
beat b global
oth ofcontrol
these
and lo cal control. How ever, several investigations
across a wide range of tasks (Greff et al., 2015; Jozefo ov er architectural
Jozefowicz
wicz et al. v ariations
al.,, 2015). Greff
of the LSTM and GRU found no
et al. (2015) found that a crucial ingredien v arian t that would clearly
ingredientt is the forget gate, whilebeat both of these
Jozefowicz
across
et a wide
al. (2015 rangethat
) found of tasks
adding (Greff
a biaset of
al.,1 2015
to the; Jozefo
LSTM wicz et al.
forget , 2015
gate, ). Greff
a practice
et
adv
advoal. (2015b)yfound
ocated Gers that
et al.a(crucial
2000), makingredien
makes es thet LSTM
is the forget gate,aswhile
as strong Jozefowicz
the best of the
et al. ( 2015 ) found that
explored architectural varian adding
ariants.
ts. a bias of 1 to the LSTM forget gate, a practice
advocated by Gers et al. (2000), makes the LSTM as strong as the best of the
explored architectural variants.
10.11 Optimization for Long-T
Long-Term
erm Dep
Dependencies
endencies
10.11
Sec. 8.2.5 Optimization
and Sec. 10.7 ha vefor
have Long-T
describ
describeded the ermvanishing Dep andendencies
explo
exploding
ding gradien
gradientt
problems that occur when optimizing RNNs ov over
er many time steps.
Sec. 8.2.5 and Sec. 10.7 have described the vanishing and exploding gradient
An in
interesting
teresting idea propproposed
osed by Martens and Sutskev Sutskever er (2011) is that second
problems that occur when optimizing RNNs over many time steps.
deriv
derivatives
atives may vanish at the same time that first deriv derivatives
atives vanish. Second-order
An interesting
optimization idea prop
algorithms mayosed by Martens
roughly be understoandoSutskev
understoo er (2011
d as dividing the) first
is that second
deriv
derivativ
ativ
ativee
deriv atives may
by the second deriv vanish
derivativ at
ativ the same time that first deriv atives
ativee (in higher dimension, multiplying the gradien v anish. Second-order
gradientt by the
optimization
in
inverse algorithms may
verse Hessian). If the second deriv roughly b e
derivativ
ativundersto o d as dividing
ativee shrinks at a similar rate the firsttoderiv
the ativ
firste
b y the
deriv second
derivative,
ative, thenderiv
theativ e (in
ratio higher
of first anddimension,
second deriv multiplying
derivativ
ativ
atives
es may theremain
gradienrelatively
t by the
in verse t.
constan
constant. Hessian). If the, second
Unfortunately
Unfortunately, derivativ
second-order e shrinks
metho
methods haveeatmany
ds hav a similar
drawbacrateks,toincluding
drawbacks, the first
derivative,
high then the cost,
computational ratio the
of first
needand
for second
a large deriv atives may
minibatch, and aremain
tendencyrelatively
to be
constan t. Unfortunately , second-order
attracted to saddle points. Martens and Sutskev metho ds
Sutskever hav e many drawbac ks,
er (2011) found promising results including
high computational
using second-order metho cost, the
methods.ds. need
Later,forSutsk
a large
everminibatch,
Sutskever et al. (2013and a tendency
) found to be
that simpler
attracted
metho
methods to saddle
ds suc
such points. Martens
h as Nesterov momentum and withSutskev er (2011
careful ) found promising
initialization could ac results
achieve
hieve
using second-order
similar results. See Sutsk metho
Sutskev ds.
ev
ever Later, Sutsk ever et al. ( 2013 ) found
er (2012) for more detail. Both of these approac that simpler
approaches hes
metho
ha
have ds suc h as Nesterov momentum with careful initialization
ve largely been replaced by simply using SGD (even without momentum) applied could ac hieve
similar
to LSTMs.results.
ThisSee Sutsk
is part of ev
a er (2012
contin
continuing) for
uing moreindetail.
theme mac
machine Both
hine of these
learning thatapproac hes
it is often
hauch
m ve largely
easier b toeen replaced
design a mo by
modeldelsimply
that isusing
easySGD (even without
to optimize than itmomentum)
is to design applied
a more
to LSTMs. This is part
powerful optimization algorithm.of a contin uing theme in mac hine learning that it is often
much easier to design a model that is easy to optimize than it is to design a more
powerful optimization algorithm.
415
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
As discussed in Sec. 8.2.4, strongly nonlinear functions suc suchh as those computed by
a recurrent net ov over
er many time steps tend to hav havee deriv
derivatives
atives that can be either
very large or very small in magnitude. This is illustrated inhFig.
As discussed in Sec. 8.2.4 , strongly nonlinear functions suc as those
8.3 and computed
Fig. 10.17 by,
a recurrent
in which we net see ov er the
that many ob time
jectivesteps
objectiv
jectiv tend (as
e function to hav e derivatives
a function of the that can be either
parameters) has a
v ery largee”
“landscap
“landscape” or in
very small
whic
which h onein finds
magnitude.
“cliffs”:Thiswideisand
illustrated
rather inflatFig. 8.3 and
regions Fig. 10.17
separated by,
in which
tin
tiny
y regionswe see
wherethatthetheobob jectiv
objectiv
jectiv
jectivee efunction
functionchanges
(as a function
quickly,,offorming
quickly the parameters)
a kind of hascliff.a
“landscape” in which one finds “cliffs”: wide and rather flat regions separated by
tinyThe difficulty
regions wherethatthe arises is ethat
ob jectiv when changes
function the parameterquickly,gradient
forming isa vkind
ery large,
of cliff.a
gradien
gradientt descen
descentt parameter up update
date could throw the parameters very far, into a
Thewhere
region difficulty that
the ob arises
objective
jective is that is
function when theundoing
larger, parameter much gradient
of the iswork
verythat
large,
hada
bgradien t descen
een done t parameter
to reach the current update couldThe
solution. throw the parameters
gradient tells us thevery far, into
direction thata
region
corresp where
corresponds the ob
onds to the steep jective
steepest function is larger, undoing m uch of the
est descent within an infinitesimal region surrounding the w ork that had
b een
curren done
currentt to reach
parameters. the current
Outside ofsolution. The gradient
this infinitesimal tellsthe
region, us cost
the direction
function that
ma
may y
corresp onds to the
begin to curve back upw steep est
upwards. descent
ards. The up within
date must be chosen to be small enoughthe
update an infinitesimal region surrounding to
acurren t parameters.
void tra
traversing
versing to too Outside
o muc
much h upw of
ardthis
upward curv infinitesimal
curvature. region, the
ature. We typically cost function
use learning rates ma
thaty
b egin
deca
decay to curve back upw ards. The
y slowly enough that consecutive steps hav up date must b
havee approe chosen
approximately to b e small enough
ximately the same learning to
avoid tra versing to o muc h upw ard curv ature. W
rate. A step size that is appropriate for a relatively linear part e typically use oflearning rates that
the landscap
landscape e is
decay inappropriate
often slowly enoughand thatcauses
consecutive steps hav
uphill motion if we eappro
enterximately
a more curvthe same
curveded partlearning
of the
rate. A step
landscap
landscape e onsize
the that
nextisstep.
appropriate for a relatively linear part of the landscap e is
often inappropriate and causes uphill motion if we enter a more curved part of the
landscape on the next step.
416
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.17: Example of the effect of gradien gradientt clipping in a recurrent net netw
work with
two parameters w and b . Gradient clipping can mak makee gradien
gradientt descent p erform more
Figure 10.17:
reasonably Example
in the of of
vicinity the effect of steep
extremely gradien t clipping
cliffs. These in a recurrent
steep networkowith
cliffs commonly ccur
tinworecurrent
parameters netww
networksand b . Gradient clipping
orks near where a recurren can
recurrentt netw mak
network e gradien
ork b ehav
ehaves t descent p
es approximately erform more.
linearly
linearly.
reasonably
The in the
cliff is exp vicinitysteep
exponentially
onentially of extremely
in the num steep
numb b er cliffs.
of timeThese
stepssteep cliffsthe
b ecause commonly
weigh o ccur
weightt matrix
in multiplied
is recurrent netw orksonce
by itself nearfor where
eachatime
recurren
step.t network b ehaves
Gradient approximately
descent linearly.
without gradient
The cliff ov
clipping is ersho
exp onentially
oversho
ershoots steep in
ots the b ottom of the
thisnum b erravine,
small of timethen
steps b ecause
receives the weigh
a very large tgradient
matrix
is multiplied
from the cliff by itself
face. Theonce
largeforgradien
each time
gradient step.
t catastrophically Gradient
prop
propels descent
els the without
parameters gradient
outside the
clipping
axes oversho
of the plot. ots the b ottom
Gradientof this small
descent ravine,
with then clipping
gradient receives has
a very largemo
a more gradient
moderate
derate
from the to
reaction cliff
theface.
cliff.The largeit gradien
While do
does t catastrophically
es ascend the cliff face,prop
theels thesize
step parameters outside
is restricted the
so that
axes
it of theb eplot.
cannot prop
propelled
elled aw
awa Gradient descent
ay from steep withnear
region gradient clipping has
the solution. a more
Figure mo derate
adapted with
preaction
ermissionto from
the cliff. While
Pascan
Pascanu u it do(es ascend
2013a ). the cliff face, the step size is restricted so that
it cannot b e prop elled away from steep region near the solution. Figure adapted with
permission from Pascanu (2013a).
A simple type of solution has been in use by practitioners for many years:
clipping the gr gradient
adient
adient.. There are differendifferentt instances of this idea (Mikolo Mikolovv, 2012;
A simple
Pascanu et al. type of solution has b een in use by practitioners
al.,, 2013a). One option is to clip the parameter gradien for many
gradientt from years:
a
clipping
minibatc
minibatch the gr adient . There
h element-wise (Mikolo are
Mikolov differen t instances of this
v, 2012) just before the parameter upidea ( Mikolo
date. Another;
update. v , 2012
Pascanu
is to clip et
theal. , 2013a
norm ||g||). ofOne option
the gradientisgto
gradient clip theetparameter
(Pascanu al.,, 2013a)gradien
al. t from
just before thea
minibatch element-wise
parameter up
update:
date: (Mikolov, 2012) just before the parameter up date. Another
is to clip the norm g of the gradient g (Pascanu et al., 2013a) just before the
parameter up date: || || if ||g|| > v (10.41)
gv
if g g>←v ||g|| (10.42)
(10.41)
|| ||g gv
(10.42)
where v is the norm threshold and g is used gto up update
date parameters. Because the
gradientt of all the parameters (including←different
gradien || || groups of parameters, such as
w eightsv and
where is the normisthreshold
biases) renormalizedand gjointly
is used to up
with date parameters.
a single scaling factor, Because the
the latter
gradien
metho
method dthasof all
thethe parameters
adv
advan
an tage that(including
antage it guaranteesdifferent groups
that each stepofisparameters, such as
still in the gradient
w eights and biases) is renormalized jointly with a
direction, but experiments suggest that both forms work similarlysingle scaling factor, the
similarly.. Althoughlatter
method has the advantage that it guarantees that each step is still in the gradient
direction, but experiments suggest that 417 b oth forms work similarly. Although
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Gradien
Gradientt clipping helps to deal with explo exploding
ding gradients, but it do does
es not help with
vanishing gradients. To address vanishing gradients and better capture long-term
Gradien
dep t clipping
dependencies,
endencies, helps to deal
we discussed withofexplo
the idea ding paths
creating gradients,
in thebut it does not help
computational with
graph of
vthe
anishing gradients.
unfolded recurren T o address vanishing gradients
recurrentt architecture along which the pro and
productb etter capture long-term
duct of gradients associated
dep endencies, we discussed the idea
with arcs is near 1. One approach to ac of creating
achiev
hiev paths in the computational
hievee this is with LSTMs graphops
and other self-lo
self-loopsof
the unfolded recurren t architecture
and gating mechanisms, described ab along
abovov which the pro duct of gradients associated
ovee in Sec. 10.10. Another idea is to regularize
with arcs is near
or constrain the1.parameters
One approachso astotoacencourage
hieve this is“information
with LSTMsflow.” and other self-loops
In particular,
wand gatinglike
e would mechanisms,
the gradientdescribed
vectorab ∇ove L inbSec.
eing10.10 . Another ideatoismaintain
back-propagated to regularize
its
or constrain the parameters so as to encourage “information flow.”
magnitude, even if the loss function only penalizes the output at the end of the In particular,
w e would like
sequence. the gradient
Formally
ormally,, we wan
wantvector
t L being back-propagated to maintain its
magnitude, even if the loss function ∇ only penalizes the output at the end of the
sequence. Formally, we want ∂ h (t)
(∇ L) (t−1) (10.43)
∂h
∂h
( L) (10.43)
418∂ h
∇
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
to b e as large as
∇ L. (10.44)
to b e as large as
With this obobjective,
jective, Pascan
Pascanu
u et al. (2013aL.
) prop
propose
ose the following regularizer:
(10.44)
With this ob jective, Pascanu et al. (2013a ∇ ) propose the following 2 regularizer:
| (∇ L) ∂∂ |
Ω= −1 . (10.45)
t ( ||∇ L ) L | |
Ω= | ∇ | 1 . (10.45)
L
Computing the gradiengradientt of this regularizer may app appear
− ear difficult, but Pascan Pascanu u
et al. (2013a) prop
propose
ose an appro
approximation ||∇ | |
ximation in which we consider the bac back-propagated
k-propagated
Computing the gradien t of this regularizer may app ear
vectors ∇ L as if they were constants (for the purpose of this regularizer, difficult, but Pascan sou
et al. there
that (2013ais) no
prop ose an
need to appro
bac ximation inthrough
back-propagate
k-propagate which wthem).
e consider Thetheexperiments
back-propagated
with
vthis
ectors L as if they were
regularizer suggest that, if com constants
combined (for the purpose of this regularizer,
bined with the norm clipping heuristic (which so
that there
handles ∇ is no need
gradient to bacthe
explosion), k-propagate
regularizer through them). The
can considerably experiments
increase with
the span of
this regularizer
the dep
dependencies suggest that, if com bined with the norm
endencies that an RNN can learn. Because it keeps the RNN dynamicsclipping heuristic (which
handles
on the edgegradient explosion),
of explosive the regularizer
gradients, can clipping
the gradient considerably increase the
is particularly span of
important.
the dependencies
Without that an RNN
gradient clipping, gradient canexplosion
learn. Because
preven
preventsit keeps
ts learning thefrom
RNN dynamics
succeeding.
on the edge of explosive gradients, the gradient clipping is particularly important.
A keygradient
Without weakness of this approac
clipping, approach
gradienthexplosion
is that it preven
is not ts
as learning
effective from
as thesucceeding.
LSTM for
tasks where data is abundant, such as language mo modeling.
deling.
A key weakness of this approach is that it is not as effective as the LSTM for
tasks where data is abundant, such as language modeling.
10.12 Explicit Memory
10.12
In telligenceExplicit
Intelligence Memory
requires knowledge and acquiring knowledge can be done via learning,
whic
which h has motiv
motivated
ated the developmen
developmentt of large-scale deep architectures. Ho Howev
wev
wever,er,
Intelligence
there requireskinds
are different knowledge and acquiring
of knowledge. Some knowledge
knowledge cancanbe done via learning,
be implicit, sub-
whic h has motiv
conscious, ated the to
and difficult developmen t of large-scale
verbalize—suc
verbalize—such h as how to deep architectures.
walk, or ho
howw a dogHowev
lookser,
there
differen
differentare different
t from a cat.kinds
Otherofknowledge
knowledge. canSome knowledge
be explicit, can be and
declarative, implicit,
relativsub-
relatively
ely
conscious,
straigh
straightforwardand difficult to verbalize—suc
tforward to put into words—ev
words—every ery dah
day as how to w alk, or ho w a dog
y commonsense knowledge, like “a cat looks
differen
is a kind t from a cat. Other
of animal,” or veryknowledge
sp
specific
ecific can
facts b e explicit,
that you need declarative,
to know to and relatively
accomplish
ystraigh tforward
our current to put
goals, likeinto
“thewords—ev ery dathe
meeting with y commonsense
sales team isknowledge,
at 3:00 PMlike in “a
ro
roomcat
om
is a kind of animal,” or very specific facts that you need to know to accomplish
141.”
your current goals, like “the meeting with the sales team is at 3:00 PM in room
Neural netw
networks
orks excel at storing implicit kno knowledge.
wledge. How
Howev ev
ever,
er, they struggle
141.”
to memorize facts. Sto Stochastic
chastic gradien
gradientt descent requires many presen presentations
tations of
Neural netw orks excel at storing implicit
the same input before it can be stored in a neural netw kno wledge.
network How ev er, they
ork parameters, and struggle
even
to memorize facts. Sto chastic
then, that input will not be stored esp gradien t descent
especially requires
ecially precisely many
precisely.. Gra
Graves presen tations
ves et al. (2014b of)
the
hyp same
ypothesizedinput b efore it can b e stored
othesized that this is because neural net in
netwa neural
works lac netw
lack ork
k the equivparameters,
equivalen
alen and even
alentt of the working
then, that
memory inputthat
system willallows
not be stored
human especially
beings precisely
to explicitly . and
hold Graves et al. (2014b
manipulate pieces)
hypinformation
of othesized that thisare
that is brelev
ecause
anttneural
relevan
an to ac networks
achieving
hieving lack goal.
some the equiv
Suc
Suchalen t of the memory
h explicit working
memory system that allows human beings to explicitly hold and manipulate pieces
of information that are relevant to achieving 419 some goal. Such explicit memory
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
420
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
comp
componenonen
onentsts would allow our systems not only to rapidly and “inten “intentionally”
tionally” store
and retriev
retrievee sp
specific
ecific facts but also to sequentially reason with them. The need
comp
for onents
neural would
net worksallow
networks that canour systems
pro
process not only to rapidly
cess information and “inten
in a sequence tionally”
of steps, store
changing
and wa
the retriev
wayy thee input
specific facts
is fed in butthe
into
to also
netwto ork
network sequentially
at each step,reason haswith
long them. The need
been recognized
for neural net works
as important for the abilit that can
ability y to reason rather than to make automatic, changing
pro cess information in a sequence of steps, in
intuitiv
tuitiv
tuitivee
the
resp wa
responsesy the
onses to input
the inputis fed into the
(Hinton netw
, 1990 ). ork at each step, has long been recognized
as important for the ability to reason rather than to make automatic, intuitive
To resolve this difficulty
difficulty,, Weston et al. (2014) in intro
tro
troduced
duced memory networks that
responses to the input (Hinton, 1990).
include a set of memory cells that can be accessed via an addressing mechanism.
To resolve
Memory net this difficulty
networks
works originally , Wrequired
eston et al. (2014
a sup ) introsignal
supervision
ervision duced instructing
memory networks them that ho
how w
include
to a set memory
use their of memory cellsGrav
cells. thatescan
Graves be (accessed
et al. 2014b) in via an addressing
introduced
troduced the neur mechanism.
neural al Turing
Memory
machine,, net
machine works
which originally
is able to learnrequired
to read froma supand ervision
writesignal instructing
arbitrary conten
contentt to them
memoryhow
to use
cells their memory
without explicit sup cells.
supervisionGraves
ervision ab et
outal.which
about (2014b ) introduced
actions to undertak the e,
undertake, neurandal allo
Turing
allowed
wed
machine , which is able to
end-to-end training without this sup learn to read from
supervision and write arbitrary
ervision signal, via the use of a contenconten t to memory
content-based
t-based
cells atten
soft without
tionexplicit
attention mec
mechanism supervision
hanism about which
(see Bahdanau et al.actions
(2015)toand undertak e, and ).
Sec. 12.4.5.1 alloThis
wed
end-to-end
soft addressingtraining withouthas
mechanism thisbsup
ecomeervision signal,
standard via other
with the use of a conten
related t-based
architectures
soft
em attention
emulating
ulating mechanism
algorithmic (see Bahdanau
mechanisms in a way et that
al. (still
2015allows
) and Sec. 12.4.5.1
gradien
gradient-based
t-based ). This
opti-
soft addressing mechanism has b ecome
mization (Sukhbaatar et al., 2015; Joulin and Mikolo standard with
Mikolov other related
v, 2015; Kumar et al. architectures
al.,, 2015;
emulating
Vin
Vinyals algorithmic
yals et al.
al., mechanismsetinal.
, 2015a; Grefenstette a ,w2015
al., ay that
). still allows gradient-based opti-
mization (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015; Kumar et al., 2015;
Eac
Each
Vinyals hetmemory
al., 2015a cell can be though
; Grefenstette thoughtet tal.of as an
, 2015 ). extension of the memory cells in
LSTMs and GRUs. The difference is that the netw network
ork outputs an in internal
ternal state
Eac h memory cell can be though t of as an extension
that chooses which cell to read from or write to, just as memory accesses of the memory cellsin in
a
LSTMscomputer
digital and GRUs. readThe from difference
or writeistothat a sp the
specific
ecificnetw ork outputs an internal state
address.
that chooses which cell to read from or write to, just as memory accesses in a
It iscomputer
digital difficult to readoptimize
from orfunctions
write to that pro
produce
a specific duce exact, integer addresses. To
address.
alleviate this problem, NTMs actually read to or write from many memory cells
sim It is difficult
simultaneously
ultaneously
ultaneously. . Tto optimize
o read, they functions
take a weigh that
weighted tedpro duce exact,
average of many integer
cells. addresses.
To write, they To
alleviate
mo
modify this problem,
dify multiple cells byNTMs different actually
amounts. read to The orco
write
coefficientsfrom for
efficients many memory
these op
operationscells
erations
simultaneously
are chosen to be . Tofocused
read, they on atake a weigh
small num
numb ted
ber aofverage
cells,offormany cells. T
example, byo write,
pro
producing they
ducing
modify
them viamultiple
a softmax cells by different
function. Usingamounts.
these weigh The
weights ts co efficients
with non-zero forderiv
theseativ
derivativ opes
atives erations
allows
are chosen to be focused on a small num b er of cells,
the functions controlling access to the memory to be optimized using gradien for example, by pro ducing
gradientt
them via
descen
descent. a softmax
t. The gradien
gradient function.
t on theseUsing co these weigh
coefficients
efficients ts with
indicates non-zero
whether eachderiv
of ativ
them es should
allows
the
b functionsorcontrolling
e increased decreased,access but the to gradient
the memory to be optimized
will typically be large using
only for gradien
thoset
descent. addresses
memory The gradien t on these
receiving coefficients
a large co
coefficient. indicates whether each of them should
efficient.
be increased or decreased, but the gradient will typically be large only for those
These memory cells are typically augmen augmented ted to con
contain
tain a vector, rather than
memory addresses receiving a large coefficient.
the single scalar stored by an LSTM or GRU memory cell. There are tw twoo reasons
These memory cells are typically augmen
to increase the size of the memory cell. One reason is that we ha ted to con tain a vector,
ve increasedthan
have rather the
the single scalar stored by an LSTM or GRU
cost of accessing a memory cell. We pay the computational cost of promemory cell. There are tw o reasons
producing
ducing a
to
co increase
coefficien
efficien the size of the
efficientt for many cells, but we exp memory cell.
expect One
ect these co reason is
coefficients that w e ha ve increased
efficients to cluster around a small the
cost
n umberof accessing
of cells. aBymemoryreadingcell. We pay
a vector the rather
value, computational
than a scalar cost ofvalue,
producing
we can a
coefficient for many cells, but we exp ect these coefficients to cluster around a small
number of cells. By reading a vector 421 value, rather than a scalar value, we can
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
mo
move
ve only forw
forward
ard in time through the sequence. In the case of machine translation
and memory net networks,
works, at eac
eachh step, the fofocus
cus of attention can mov
movee to a completely
mo ve
differenonly forw ard in time through the sequence.
differentt place, compared to the previous step. In the case of machine translation
and memory networks, at each step, the focus of attention can move to a completely
Recurren
Recurrent
differen t neural
t place, netw
networks
compared orks pro
provide
to the vide a way
previous to extend deep learning to sequential
step.
data. They are the last mamajor
jor to
tool
ol in our deep learning toolb
toolboox. Our discussion now
mo Recurren
moves
ves to ho
howt neural
w netw
to choose orksuse
and prothese
vide atowols
ay and
tools to extend
ho
howw todeep
applylearning to real-world
them to sequential
data. They are the last ma jor tool in our deep learning toolb ox. Our discussion now
tasks.
moves to how to choose and use these tools and how to apply them to real-world
tasks.
423
Chapter 11
Chapter 11
Practical metho
methodology
dology
Practical metho dology
Successfully applying deep learning tec
techniques
hniques requires more than just a go
goood
kno
knowledge
wledge of what algorithms exist and the principles that explain ho how w they
Successfully
w ork. A go gooodapplying
mac hinedeep
machine learning
learning techniques
practitioner also requires
needs to more
kno
know w than
ho
how just
w to a gooan
choose d
knowledgefor
algorithm of awhat algorithms
particular exist and
application andho the
how w toprinciples
monitor andthatrespond
explaintoho w they
feedback
w ork. A go o
obtained from expd mac hine
experimen
erimen learning
eriments practitioner
ts in order to improv also needs to kno w ho w to
improvee a machine learning system. During choose an
algorithm
da
dayy to dayfor devaelopmen
particular
developmen
elopment t ofapplication and howsystems,
machine learning to monitor and respond
practitioners needto to
feedback
decide
obtained to
whether from experimen
gather ts in order
more data, to improv
increase e a machine
or decrease mo del learning
model capacity,,system.
capacity During
add or remo
remov ve
da y to day dev elopmen
regularizing features, impro t of
improv machine learning systems,
ve the optimization of a mo practitioners
model,
del, improv need
improvee appro to decide
approximate
ximate
whether to gather
inference in a mo more
model, data, increase
del, or debug the softw or
softwaredecrease
are implemen mo del
implementation capacity ,
tation of the mo add del. All vofe
or
model. remo
regularizing
these op features,
operations impro v e the optimization of a mo del,
erations are at the very least time-consuming to try out, so it is impimprov e appro ximate
important
ortant
inference
to be ableintoadetermine
model, or thedebug
right thecourse
softwofareaction
implemen tation
rather thanofblindly
the moguessing.
del. All of
these operations are at the very least time-consuming to try out, so it is important
Most of this bo ok is ab about
out different machine learning mo models,
dels, training algo-
to be able to determine the right course of action rather than blindly guessing.
rithms, and ob objective
jective functions. This ma may y give the impression that the most
imp Most
importan
ortan of this b
ortantt ingredien o ok is ab out different machine
ingredientt to being a machine learning exp learning
expert mowing
ert is kno dels, atraining
knowing algo-
wide variet
ariety
y
rithms,
of machine andlearning
objective
tec functions.
techniques
hniques and This
beingma go
gooyodgive the impression
at differen
different t kinds of that
math.the In most
prac-
importan
tice, one tcan
ingredien
usuallyt do
to bmuc
eingh abetter
much machinewithlearning
a correct exp ert is knowing
application of a acommonplace
wide variety
of machine learning tec hniques and b eing go o d at differen t
algorithm than by sloppily applying an obscure algorithm. Correct application kinds of math. In prac-
of
tice, one can
an algorithm dep usually
depends do muc h b etter with a correct
ends on mastering some fairly simple metho application
methodologyof
dologya commonplace
dology.. Man
Many y of the
algorithm than byin
recommendations sloppily applying
this chapter arean obscure
adapted algorithm.
from Ng (2015 Correct
). application of
an algorithm depends on mastering some fairly simple methodology. Many of the
We recommend the follo following
wing practical design pro process:
cess:
recommendations in this chapter are adapted from Ng (2015).
•WeDetermine
recommend thegoals—what
your following practical designtopro
error metric cess:
use, and your target value for
this error metric. These goals and error metrics should be driv driven
en by the
Determine your goals—what error
problem that the application is in metric
intended to use,
tended to solve. and your target value for
• this error metric. These goals and error metrics should b e driv en by the
• Establish a working
problem that end-to-end
the application pip
pipeline
is in eline to
tended as solve.
so
soon
on as p ossible, including the
424eline as soon as p ossible, including the
Establish a working end-to-end pip
• 424
CHAPTER 11. PRACTICAL METHODOLOGY
• Instrumen
estimationt of
Instrument the
the appropriate
system well topdetermine
erformancebottlenecks
metrics. in performance. Diag-
nose which comp
components
onents are performing worse than expexpected
ected and whether it
Instrumen
is due to ovt erfitting,
the system
overfitting, well to determine
underfitting, bottlenecks
or a defect in por
in the data erformance.
software. Diag-
software.
• nose which components are performing worse than expected and whether it
• Rep
is due to overfitting,
Repeatedly
eatedly mak underfitting,
makee incremental or asuch
changes defect
as in the datanew
gathering or softw
data,are.
adjusting
hyp
yperparameters,
erparameters, or changing algorithms, based on sp specific
ecific findings from
Rep
y oureatedly make incremental changes such as gathering new data, adjusting
instrumentation.
• hyperparameters, or changing algorithms, based on specific findings from
Asyour instrumentation.
a running example, we will use Street View address num umb ber transcription
system (Go Goo odfellow et al., 2014d). The purp purpose
ose of this application is to add
As a running
buildings to Go Googleexample, we will use Street View
ogle Maps. Street View cars photograph address
thenbuildings
umber transcription
and record
system (
the GPS co Go o dfellow
coordinates et
ordinates asso al. , 2014d
associated ).
ciated with eac The
each purp ose of this
h photograph. A conv application
olutionalisnet
convolutional towork
add
network
buildings tothe
recognizes Goaddress
ogle Maps.num
umb Street
ber inView
eachcars photograph
photograph, the
allo buildings
allowing
wing the Go and
ogle record
Google Maps
the GPS co ordinates asso ciated with
database to add that address in the correct lo each photograph.
cation. The story of howwork
location. A conv olutional net this
recognizes the address n um b er
commercial application was developed giv in each photograph,
gives allowing the Go ogle Maps
es an example of how to follow the design
database
metho
methodology to add
dology we adv that
advo address
ocate. in the correct location. The story of how this
commercial application was developed gives an example of how to follow the design
We no
now w describ
describee each of the steps in this pro process.
cess.
methodology we advocate.
We now describe each of the steps in this process.
11.1 Performance Metrics
11.1 Performance
Determining your goals, in terms Metrics
of whic
which h error metric to use, is a necessary first
step because your error metric will guide all of your future actions. Y You
ou should
Determining
also hav your goals, in terms of whic h error
havee an idea of what level of performance you desire. metric to use, is a necessary first
step because your error metric will guide all of your future actions. You should
alsoKeep
have inanmind
idea of that
whatfor level
mostofapplications,
performanceityou is imp
impossible
ossible to ac
desire. achiev
hiev
hievee absolute
zero error. The Bay Bayes es error defines the minimum error rate that you can hop hopee to
ac Keep
achiev
hiev
hieve, in
e, ev
even mind that
en if you hav for most applications, it is
havee infinite training data and can reco imp ossible
recov to ac hiev e absolute
ver the true probabilit
probabilityy
zero error. The Bay es
distribution. This is because yerror defines
your the minimum
our input features ma error
mayrate that
y not con you
contain can hop e
tain complete to
achiev e, ev
information aben if
aboutyou hav e infinite training data and can
out the output variable, or because the system migh reco v er the true probability
mightt be intrinsically
distribution.
stocchastic. YouThis
sto will is
alsobecause yourbyinput
be limited ha
havingfeatures
ving a finite ma y not
amoun
amount t ofcon tain complete
training data.
information about the output variable, or because the system might be intrinsically
The amount of training data can be limited for a variet arietyy of reasons. When your
stochastic. You will also be limited by having a finite amount of training data.
goal is to build the best possible real-world pro product
duct or service, you can typically
The amount of training data can b e limited
collect more data but must determine the value of reducing for a varietyerror
of reasons.
furtherWhen your
and weigh
goal against
this is to buildthethecostbest possible real-world
of collecting more data. proData
duct or service, can
collection you require
can typically
time,
collect
money more data but m ust determine the v alue of reducing
money,, or human suffering (for example, if your data collection pro error further
process and
cess in
invweigh
volves
this against
performing in the
inv cost of collecting more data. Data collection can
vasive medical tests). When your goal is to answer a scientific question require time,
money
ab
about , or h uman suffering (for example,
out which algorithm performs better on a fixed benc if your data collection
enchmark, pro
hmark, the benccess involves
enchmark
hmark
performing invasive medical tests). When your goal is to answer a scientific question
about which algorithm performs better on a fixed benchmark, the benchmark
425
CHAPTER 11. PRACTICAL METHODOLOGY
sp
specification
ecification usually determines the training set and you are not allow alloweded to collect
more data.
specification usually determines the training set and you are not allowed to collect
Ho
How w can one determine a reasonable lev level
el of performance to exp expect?
ect? Typically
Typically,,
more data.
in the academic setting, we hav havee some estimate of the error rate that is attainable
Ho w can one determine
based on previously published a reasonable
benchmark levelresults.
of performance to expect?setting,
In the real-word Typicallywe,
in vthe
ha
hav academic
e some idea setting,
of the errorwe havratee some
that estimate
is necessary of theforerror rate that is to
an application attainable
be safe,
based on
cost-effectiv
cost-effective, previously
e, or app published
appealing b enchmark
ealing to consumers. Once you ha results. In
hav the real-word setting,
ve determined your realistic we
have some
desired erroridearate, of ythe
our error
designrate that iswill
decisions necessary
be guided for byan reaching
application thisto be safe,
error rate.
cost-effective, or appealing to consumers. Once you have determined your realistic
Another
desired errorimpimportant
rate, ortant
your consideration
design decisions besides
will bthe targetby
e guided value of thethis
reaching performance
error rate.
metric is the choice of whic which h metric to use. Sev Several
eral different performance metrics
ma
may yAnother
be usedimp ortant consideration
to measure the effectiveness besidesof athe target v
complete alue of the that
application performance
includes
metric
mac
machine is the choice
hine learning comp of whic
components. h metric to use. Sev eral different
onents. These performance metrics are usually different p erformance metrics
ma y b e used to measure the
from the cost function used to train the mo effectiveness of a
model. complete
del. As describ application
described that includes
ed in Sec. 5.1.2, it is
machine to
common learning
measure comptheonents.
accuracy
accuracy, These
, or equiv performance
equivalently
alently
alently,, the metrics are usually
error rate, different
of a system.
from the cost function used to train the model. As described in Sec. 5.1.2, it is
commonHo
How wev
ever,
toer,measure
many applications
the accuracy require
, or equiv more adv
advanced
alently anced metrics.
, the error rate, of a system.
Sometimes it is muc much h more costly to make one kind of a mistake than another.
However, many applications require more advanced metrics.
For example, an e-mail spam detection system can make two kinds of mistak mistakes: es:
Sometimes
incorrectly it is mucah legitimate
classifying more costlymessage to makeasone kindand
spam, of aincorrectly
mistake than another.
allowing a
F or example,
spam message to app an e-mail
appear spam detection
ear in the in inb system
box. It is muc can
much make t
h worse to blow o kinds
block of mistak
ck a legitimate es:
incorrectly classifying a legitimate message
message than to allow a questionable message to pass through. Rather as spam, and incorrectly allowingthana
spam message
measuring to app
the error earofina the
rate spam inbclassifier,
ox. It iswe muc ma
mayhyworse
wish to to measure
block a some
legitimate
form
message
of total cost,thanwhereto allow a questionable
the cost of blo
blocking message to
cking legitimate pass through.
messages Rather
is higher than thethan
cost
measuring the error
of allowing spam messages. rate of a spam classifier, we ma y wish to measure some form
of total cost, where the cost of blocking legitimate messages is higher than the cost
Sometimes we wish to train a binary classifier that is intended to detect some
of allowing spam messages.
rare even
event. t. For example, we migh mightt design a medical test for a rare disease. Supp Suppose ose
Sometimes we wish to train a binary classifier
that only one in every million people has this disease. We can easily achiev that is intended to detect some
achievee
rare even t. F or example, we migh t design
99.9999% accuracy on the detection task, by simply hard-co a medical test for a rare
hard-coding disease. Supp
ding the classifier ose
that
to alwa only
always one
ys rep
report in every million
ort that the disease is absenp eople has
absent. this
t. Clearlydisease. W e can
Clearly,, accuracy is a poor wa easily achiev
way y toe
99.9999%
characterizeaccuracy on the detection
the p erformance of suc
such h a task,
system. byOne simplywa
way y hard-co
to solveding the classifier
this problem is to
to alwa ys rep
instead measure pr ort that the disease
preecision and recal is absen t. Clearly , accuracy
alll. Precision is the fraction of detections rep is a p o or wa y to
reported
orted
characterize
b y the mo model thethat
del p erformance
were correct,of sucwhile
h a system.
recall is One thewafraction
y to solve of this
trueproblem
ev
even
en
ents is to
ts that
instead measure pr e cision and rec al l . Precision
were detected. A detector that says no one has the disease would ac is the fraction of detections
achiev
hiev rep orted
hievee perfect
b y the mo del that were correct,
precision, but zero recall. A detector that sa while recallsaysis
ys evthe
every fraction
ery
eryone of true
one has the disease eventswould
that
were
ac
achiev
hiev
hievedetected.
e perfect A detector
recall, but that says no
precision onetohas
equal thethe disease would
percentage achievwho
of people e perfect
ha
haveve
precision, but zero recall. A detector that sa ys ev
the disease (0.0001% in our example of a disease that only one people in a million ery one has the disease would
ac
ha
havhiev
ve). eWhenperfectusing recall, but precision
precision and recall, equalittois thecommonpercentage
to plotofapPR eople who, ha
curve withve
the diseaseon(0.0001%
precision the y-axis in and
our example
recall onofthe a disease
x-axis. that The only one pgenerates
classifier eople in a amillion
score
have).is When
that higher usingif the precision
eventt to band
even recall, itoccurred.
e detected is common F
Forortoexample,
plot a PR curve, with
a feedforw
feedforward ard
precision on the y-axis and recall on the x-axis. The classifier generates a score
that is higher if the event to be detected 426 o ccurred. For example, a feedforward
CHAPTER 11. PRACTICAL METHODOLOGY
net
netwwork designed to detect a disease outputs yŷˆ = P (y = 1 | x), estimating the
probabilit
probability y that a person whose medical results are describ described ed by features x has
net w ork designed
the disease. We choose to repto detect a disease
report outputs y
ˆ = P (
ort a detection whenever this score y = 1 x), estimating
exceeds some the
probability By
threshold. that a person
varying thewhose medical
threshold, results
we can areprecision
trade described byrecall.
| for features x
In manyhas
the disease. W e choose to rep ort a detection whenever
cases, we wish to summarize the performance of the classifier with a single num this score exceeds some
numberber
threshold.
rather thanBy varyingTthe
a curve. o dothreshold,
so, we can weconcanvert
conv trade precision
precision p and for recall
recall.r In in many
into
to an
cases,
F-sc
F-scoror we wish
oree given by to summarize the p erformance of the classifier with a single num ber
rather than a curve. To do so, we can con 2prvert precision p and recall r into an
F-score given by F = . (11.1)
p+r
2pr
Another option is to rep reportort the total F = area . (11.1)
p +lying
r beneath the PR curve.
In some applications, it is possible for the mac machine
hine learning system to refuse to
Another option is to report the total area lying beneath the PR curve.
mak
makee a decision. This is useful when the mac machine
hine learning algorithm can estimate
ho
howwInconfident
some applications,
it should b iteisabpossible
about for the mac
out a decision, esphine
especially learning
ecially systemdecision
if a wrong to refusecan to
mak e a decision. This
be harmful and if a human op is useful when
operator the mac hine learning
erator is able to occasionally take ov algorithm over. can estimate
er. The Street
ho w confident it should b e ab out a decision,
View transcription system provides an example of this situation. The esp ecially if a wrong decision
task iscan to
b e harmful
transcrib
transcribe e theand if a human
address num
number beroperator
from a is able to occasionally
photograph in order to take asso over.the
associate
ciate Thelo Street
location
cation
View transcription
where the photo wassystem tak
takenen provides
with the an example
correct address of this
in asituation.
map. Because The task the vis to
alue
transcrib
of the map e the addressconsiderably
degrades number fromifathe photograph in order toitasso
map is inaccurate, ciate
is imp
importan the tloto
ortan
ortant cation
add
where the photo was tak en with the
an address only if the transcription is correct. If the maccorrect address in a map.
machine Because the
hine learning system value
of the map
thinks that degrades considerably
it is less likely than a human if the mapbeingis to inaccurate,
obtain theitcorrect is importan t to add
transcription,
an address
then the best only if the
course transcription
of action is to allow is correct.
a humanIftothe machine
transcrib
transcribe learning
e the system
photo instead.
thinks
Of thatthe
course, it ismachine
less likely than asystem
learning humanisbeing only to obtain
useful if ittheis correct
able to transcription,
dramatically
then the b est course of action
reduce the amount of photos that the human op is to allow a human to
operators transcrib
erators must pro e the photo
process. instead.
cess. A natural
Of
p course, the
erformance machine
metric to use learning
in thissystem
situation is only
is coveruseful
age.. ifCo
overage
age itvis
Cov ableistothe
erage dramatically
fraction of
reduce the amount of photos that the
examples for which the machine learning system is able to pro h uman op erators must pro
produce cess.
duce a resp A onse.
natural
response. It
p erformance metric
is possible to trade co to
cov use in this
verage for accuracy situation is c over
accuracy.. One can alwa age .
alwaysCo v erage is the
ys obtain 100% accuracy fraction of
examples for
by refusing to pro which
process the machine learning system
cess any example, but this reduces the cov is able to pro
coverage ducetoa 0%.
erage response.
For theIt
is possible
Street Viewtotask, tradethe covgoal
erage forfortheaccuracy
pro ject. was
project Onetocan alwahys
reach obtain el
uman-lev
uman-level 100% accuracy
transcription
by refusing
accuracy to pro
while maincess
maintainingany example,
taining 95% cov but this
coverage.
erage. reduces the
Human-lev
Human-level el pcov erage to 0%.
erformance on thisFortask
the
Street
is 98% Viewaccuracy
accuracy.task,
. the goal for the pro ject was to reach human-level transcription
accuracy while maintaining 95% coverage. Human-level performance on this task
Man
Many
is 98% y other .metrics are possible. We can for example, measure clic
accuracy click-through
k-through
rates, collect user satisfaction surveys, and so on. Man Many y sp specialized
ecialized application
areasManhav
havey eother metrics are
application-sp
application-specific ecificpossible.
criteriaWase canwell.for example, measure click-through
rates, collect user satisfaction surveys, and so on. Many sp ecialized application
What is imp important
ortant is to determine whic which h performance metric to improv improvee ahead
areas have application-specific criteria as well.
of time, then concen concentrate
trate on improving this metric. Without clearly defined goals,
What
it can be isdifficult
important is towhether
to tell determine which to
changes performance
a machinemetric learningto improv
systeme aheadmake
of time, then
progress or not. concen trate on improving this metric. Without clearly defined goals,
it can be difficult to tell whether changes to a machine learning system make
progress or not.
427
CHAPTER 11. PRACTICAL METHODOLOGY
11.2 cho
After Default
hoosing
osing performanceBaseline metricsMoand dels goals, the next step in any practical
application is to establish a reasonable end-to-end system as so soonon as possible. In
this section, we provide recommendations for which algorithms toinuse
After cho osing p erformance metrics and goals, the next step anyaspractical
the first
application is to establish a reasonable end-to-end system
baseline approach in various situations. Keep in mind that deep learning research as so on as p ossible. In
this section,
progresses we provide
quickly
quickly, , so betterrecommendations
default algorithms for which algorithms
are likely to become to use as the so
available first
soon
on
baseline approach
after this writing. in v arious situations. Keep in mind that deep learning research
progresses quickly, so better default algorithms are likely to become available soon
afterDep
Depending
thisending
writing.on the complexity of your problem, you may ev even
en wan antt to begin
without using deep learning. If your problem has a chance of being solv solveded by
justDep endinga on
choosing fewthe complexity
linear weigh
weights of your problem,
ts correctly
correctly, , you mayyou wan
wantmay
t to bevegin
en wwith
ant to begin
a simple
without using
statistical mo
model deep
del likelearning. If your problem has a chance of being solved by
logistic regression.
just choosing a few linear weights correctly, you may want to begin with a simple
If you know that your problem falls in into
to an “AI-complete” category like ob object
ject
statistical model like logistic regression.
recognition, sp speech
eech recognition, machine translation, and so on, then you are likely
If you
to do well know that yourwith
by beginning problem falls into an
an appropriate “AI-complete”
deep learning mo category
model.
del. like ob ject
recognition, speech recognition, machine translation, and so on, then you are likely
First,
to do well cbho
hoose
y ose the general
beginning with an category
appropriateof momodeldel based
deep on mo
learning thedel.
structure of your
data. If you wan antt to perform sup supervised
ervised learning with fixed-size vectors as input,
use First, choose the
a feedforward netw general
networkork with category of model based
fully connected lay ers.on
layers. If the
the structure
input has of your
known
data.
top If youstructure
topological
ological want to p(for erform supervised
example, if thelearning
input is with fixed-size
an image), usevectors
a conv as input,
convolutional
olutional
use
net
netwwaork.
feedforward network
In these cases, youwith
shouldfully connected
begin by using laysome
ers. Ifkindtheofinput has known
piecewise linear
top ological structure (for example,
unit (ReLUs or their generalizations lik if the input is an image), use
likee Leaky ReLUs, PreLus and maxout). If a conv o