Google Net

GoogLeNet
Christian
Szegedy,
Wei
Liu,
Google
UNC
Yangqing
Jia,
Pierre
Sermanet,
Scott
Reed,
Dragomir
Anguelov,
Google
University of
Michigan
Dumitru
Erhan,
Google
Vincent
Vanhoucke,
Google
Google
Google
Andrew
Rabinovich,
Google
Deep Convolutional Networks
Revolutionizing computer vision since 1989
Well..
Deep Convolutional Networks
201
Revolutionizing computer vision since 1989
2
Why is the deep learning revolution arriving

just now?

just now?
Deep learning needs a lot
of training data.

just now?
of training data.
of computational resources

just now?
of training data.

just now?
of training data.

just now?
of training data.
Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural

networks for object detection. In Advances in Neural
Information Processing Systems 2013 (pp. 2553-2561).
Then state of the art performance using a

training set of ~10K images for object
detection on 20 classes of VOC, without
pretraining on ImageNet.

just now?
of training data.
Agarwal, P., Girshick, R., & Malik, J. (2014). Analyzing the

Performance of Multilayer Neural Networks for Object
Recognition
http://arxiv.org/pdf/1407.1610v1.pdf
40% mAP on Pascal VOC 2007 only without

pretraining on ImageNet.

just now?
Toshev, A., & Szegedy, C.

of training data.
Deeppose: Human pose estimation via deep neural

networks.
CVPR 2014
Setting the state of the art of human pose

estimation on LSP by training CNN on four
thousand images from scratch.

just now?
of training data.

just now?
of training data.
Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D.

Scalable Object Detection using Deep Neural Networks.
CVPR 2014
Significantly faster to evaluate than typical

(non-specialized) DPM implementation, even
for a single object category.

just now?
of training data.
Large scale distributed multigrid solvers

since the 1990ies.
MapReduce since 2004 (Jeff Dean et al.)
Scientific computing is dedicated to solving
large scale complex numerical problems for
decades on scale via distributed systems.
UFLDL (2010) on Deep Learning

While the theoretical benefits of deep networks in terms of their compactness and expressive power
have been appreciated for many decades, until recently researchers had little success training
deep architectures.
snip
How can we train a deep network? One method that has seen some success is the greedy layerwise training method.
snip
Training can either be supervised (say, with classification error as the objective function on each
step), but more frequently it is unsupervised
Andrew Ng, UFLDL tutorial

just now?
of training data.
?????
Why is the deep learning

revolution arriving just
now?
Why is the deep learning

revolution arriving just
now?
Why is the deep learning revolution arriving just

now?
Rectified Linear Unit

Glorot, X., Bordes, A., & Bengio, Y. (2011).
Deep sparse rectifier networks
In Proceedings of the 14th International
Conference on Artificial Intelligence and
Statistics. JMLR W&CP Volume (Vol. 15, pp.
315-323).
GoogLeNet
Convolution
Pooling
Softmax
Other
GoogLeNet vs State of the art
GoogLeNet
Convolution
Pooling
Softmax
Other
Zeiler-Fergus Architecture (1 tower)
Problems with training deep architectures?
Vanishing gradient?
Exploding gradient?
Tricky weight initialization?
Problems with training deep architectures?
Vanishing gradient?
Exploding gradient?
Tricky weight initialization?
Justified Questions
Why does it have so

many layers???
Justified Questions
Why does it have so

many layers???

just now?
It used to be hard and cumbersome to train deep
models due to sigmoid nonlinearities.

just now?
Deep neural networks are highly non-convex
without any obvious optimality guarantees or nice
theory.

just now?
U
L
e
R

Deep neural networks are highly non-convex
without any optimality guarantees or nice theory.
Theoretical breakthroughs
Arora, S., Bhaskara, A., Ge, R., & Ma, T.
Provable bounds for learning some deep
representations.
ICML 2014
Theoretical breakthroughs
Arora, S., Bhaskara, A., Ge, R., & Ma, T.
Provable bounds for learning some deep
representations.
!
s
e
n
o
x
ICML 2014
nv e
n
Eve
co
n
no
Hebbian Principle
Input
Cluster according activation statistics
Layer 1
Input
Cluster according correlation statistics
Layer 2
Layer 1
Input
Cluster according correlation statistics

Layer 3
Layer 2
Layer 1
Input
In images, correlations tend to be local
Cover very local clusters by 1x1 convolutions

number of
filters
1x1
Less spread out correlations

number of
filters
1x1
Cover more spread out clusters by 3x3 convolutions

number of
filters
1x1
3x3

number of
filters
1x1
3x3

number of
filters
1x1
3x3
5x5
A heterogeneous set of convolutions

number of
filters
1x1
3x3
5x5
Schematic view (naive version)

number of
filters
1x1
Filter
concatenation
3x3
1x1
convolutions
3x3
convolutions
5x5
Previous layer
5x5
convolutions
Naive idea
Filter
concatenation
1x1 convolutions
3x3 convolutions
Previous layer
5x5 convolutions
Naive idea (does not work!)

Filter
concatenation
1x1 convolutions
3x3 convolutions
Previous layer
5x5 convolutions
3x3 max pooling
Inception module
Filter
concatenation
3x3 convolutions
5x5 convolutions
1x1 convolutions
1x1 convolutions
1x1 convolutions
3x3 max pooling
1x1 convolutions
Previous layer
Inception
Why does it have so

many layers???
Convolution
Pooling
Softmax
Other
Inception
9 Inception modules
Network in a network in a network...
Convolution
Pooling
Softmax
Other
Inception
256
480
480
512
512
512
832
832
1024
Width of inception modules ranges from 256 filters (in early modules) to 1024 in top inception
modules.
Inception
256
480
480
512
512
512
832
832
1024
modules.
Can remove fully connected layers on top completely
Inception
256
480
480
512
512
512
832
832
1024
modules.
Number of parameters is reduced to 5 million
Inception
256
480
480
512
512
512
832
832
1024
modules.
Number of parameters is reduced to 5 million
Computional cost is increased by

less than 2X compared to
Krizhevskys network. (<1.5Bn
operations/evaluation)
Classification results on ImageNet 2012

Num
ber
of
Mode
ls
Number of
Crops
Comput
ational
Cost
Top5
Erro
r
Com
pare
d to
Bas
e
1
(center
crop)
1x
10
.0
7
%
10*
10x
9. 15 0.9
% 2
%
144
(Our
approa
144x 7. 89 2.1
% 8
*Cropping by [Krizhevsky et al 2014]

Num
ber
of
Mode
ls
Number of
Crops
Comput
ational
Cost
Top5
Erro
r
Com
pare
d to
Bas
e
1
(center
crop)
1x
10
.0
7
%
10*
10x
9. 15 0.9
% 2
%
144
(Our
144x 7. 89 2.1
*Cropping by [Krizhevsky et al 2014]
6.54%

Tea Yea Plac Erro Use
m
r
e
r
s
(top exte
-5) rnal
data
Sup 201 erVi 2
sion
16.4 no
%
Sup 201 1st

erVi 2
sion
15.3 Ima
%
geN
et
22k
Clari 201 -
11.7 no
Detection
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature
hierarchies for accurate object detection and semantic
segmentation. arXiv preprint arXiv:1311.2524.
Detection
Improved proposal generation:
Increase size of super-pixels by 2X
coverage 92%
90%
number of proposals: 2000/image
1000/image
Detection
coverage 92%
90%
1000/image
Add multibox* proposals
coverage 90%
93%
1200/image
*Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D.
CVPR 2014
Detection
coverage 92%
90%
1000/image
Add multibox* proposals
coverage 90%
93%
1200/image
Improves mAP by about 1% for single model.
*Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D.
CVPR 2014
Detection results without ensembling

Team
mAP
externa
l data
con
text
ual
mo
del
boun
dingbox
regre
ssion
Trimps- 31.6
Soushe %
n
ILSVR
C12
Classif
ication
no
Berkele 34.5
y
%
Vision
ILSVR
C12
Classif
ication
no ye
s
UvA35.4
Euvisio %
n
ILSVR
C12
Classif
ication
Final Detection Results

Tea
m
Y
e
a
r
P m ext ens
l A ern em
a P al ble
c
dat
e
a
con app
text roa
ual ch
mo
del
Uv
AEuv
isio
n
2
0
1
3
12
s 2
t .
6
%
non
e
ye Fis
s he
r
ve
cto
rs
De
ep
Insi
2
0
1
34
r 0
d.
ILS
VR
C12
Cla
3 ye Co
m s nv
od
Ne
Classification failure cases

Groundtruth: ????

Groundtruth: coffee mug

Groundtruth: coffee mug
GoogLeNet:
table lamp
lamp shade
printer
projector
desktop computer

Groundtruth: ???

Groundtruth: Police car

Groundtruth: Police car
GoogLeNet:
laptop
hair drier
binocular
ATM machine
seat belt

Groundtruth: ???

Groundtruth: hay

Groundtruth: hay
GoogLeNet:
sorrel (horse)
hartebeest
Arabian camel
warthog
gaselle
Acknowledgments
We would like to thank:
Chuck Rosenberg, Hartwig
Adam, Alex Toshev, Tom
Duerig, Ning Ye, Rajat Monga,
Jon Shlens, Alex Krizhevsky,
Sudheendra Vijayanarasimhan,
Jeff Dean, Ilya Sutskever,
and check out our poster!
Andrea Frome

Google Net

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Google Net

Hochgeladen von

Copyright:

Verfügbare Formate

GoogLeNet

Deep Convolutional Networks

Revolutionizing computer vision since 1989

Deep Convolutional Networks

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural

Then state of the art performance using a

Why is the deep learning revolution arriving

Agarwal, P., Girshick, R., & Malik, J. (2014). Analyzing the

40% mAP on Pascal VOC 2007 only without

Why is the deep learning revolution arriving

Deep learning needs a lot

Deeppose: Human pose estimation via deep neural

Setting the state of the art of human pose

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D.

Significantly faster to evaluate than typical

Why is the deep learning revolution arriving

Large scale distributed multigrid solvers

UFLDL (2010) on Deep Learning

Andrew Ng, UFLDL tutorial

Why is the deep learning revolution arriving

Why is the deep learning

Why is the deep learning

Why is the deep learning revolution arriving just

Rectified Linear Unit

GoogLeNet vs State of the art

Problems with training deep architectures?

Problems with training deep architectures?

Why does it have so

Why does it have so

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

Why is the deep learning revolution arriving

It used to be hard and cumbersome to train deep

Cluster according activation statistics

Cluster according correlation statistics

Cluster according correlation statistics

In images, correlations tend to be local

Cover very local clusters by 1x1 convolutions

Less spread out correlations

Cover more spread out clusters by 3x3 convolutions

Cover more spread out clusters by 5x5 convolutions

Cover more spread out clusters by 5x5 convolutions

A heterogeneous set of convolutions

Schematic view (naive version)

Naive idea (does not work!)

3x3 max pooling

3x3 max pooling

Why does it have so

Computional cost is increased by

Classification results on ImageNet 2012

*Cropping by [Krizhevsky et al 2014]

Classification results on ImageNet 2012

*Cropping by [Krizhevsky et al 2014]

Classification results on ImageNet 2012

Sup 201 1st

Detection results without ensembling