Simnets: A Generalization of Convolutional Networks

SimNets
A Generalization of Convolutional Networks
Nadav Cohen Amnon Shashua
The Hebrew University of Jerusalem
November 2014
Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 1 / 50

SimNets — Goals
Generalization: an architecture which includes ConvNets as a special

case.
Abstraction: higher abstraction levels of each layer can potentially

give rise to more compact networks (fewer layers, fewer channels).
Initialization: statistical analysis of unlabeled data (K-means, GMM,

Laplacian Mixtures) form a "natural" initialization of parameters, and
could help in determining network architecture (layers, channels).
Kernel Machines: a stronger connection to classical machine learning

could open new doors from analysis to optimization.

Convolutional neural networks (ConvNets)
Outline
1 Convolutional neural networks (ConvNets)
2 The SimNet architecture
3 SimNets and kernel machines

A basic neural-network analogy: input ! hidden layer ! output
A basic 3-layer SimNet with locality, sharing and pooling
4 Other SimNet settings – global average pooling
5 Experiments
6 Summary

Artificial neuron
Common activation functions:
Sigmoid: ReLU:
1
'(z) = 1+exp{ z} '(z) = max{0, z}

Artificial neural network (ANN)
instance to classify: object class of x: prediction rule:
x = (x1 , ..., xd ) y 2 {1, ..., k} ŷ (x) = argmax or

r =1,...,k

ConvNet example
Convent: locality, sharing and pooling.
Source: Zeiler and Fergus. Visualizing and understanding convolutional networks.

Other deep learning approaches
HMAX [Serre et al. Robust object recognition with cortex-like

mechanisms.]
Kernel methods for deep learning [Cho and Saul. Kernel methods for
deep learning.]
Sum-product networks [Poon and Domingos. Sum-Product Networks: A

New Deep Architecture.]
Invariant scattering convolution networks [Bruna and Mallat. Invariant

Scattering Convolution Networks.]
Network in network [Lin, Chen, Yan. Network in network.]
Polynomial networks [Livni et al. An algorithm for training polynomial

networks.]

The SimNet architecture
Outline

5 Experiments
6 Summary

The SimNet architecture consists of two basic building blocks:

Similarity operator: Generalizes the inner-product (convolutional)
operator found in ConvNets.
MEX operator: Replaces ConvNet ReLU activation and max/average
pooling, but allows much more...

The similarity operator
The “similarity” between input x 2 Rd and template z 2 Rd with

corresponding weights u 2 Rd+ :
d
X
u >
(x, z) = ui · (x , z)i
i=1


d
X
u >
(x, z) = ui · (x , z)i
i=1
: Rd ⇥ Rd ! Rd – point-wise similarity mapping. We consider the

following forms:


d
X
u >
(x, z) = ui · (x , z)i
i=1

following forms:
“linear”: lin (x, z)i = xi · zi


d
X
u >
(x, z) = ui · (x , z)i
i=1

following forms:
“l1 ”: l1 (x, z)i = |xi zi |


d
X
u >
(x, z) = ui · (x , z)i
i=1

following forms:
“l1 ”: l1 (x, z)i = |xi zi |
“l2 ”: l2 (x, z)i = (xi zi )2


d
X
u >
(x, z) = ui · (x , z)i
i=1

following forms:
“l1 ”: l1 (x, z)i = |xi zi |
“l2 ”: l2 (x, z)i = (xi zi )2
When setting u = 1, the corresponding similarities reduce to hx, zi,

kx zk1 and kx zk22 respectively.

The similarity layer
input output
H h x ij
D n
w
W D
out i, j, l ul T x ij , z l
Templates z1 , ..., zn 2 RhwD with corresponding weights u1 , ..., un 2 RhwD

+
are applied to input patches, creating n output channels (“feature maps”).

The similarity layer
input output
H h x ij
D n
w
W D
out i, j, l ul T x ij , z l
Templates z1 , ..., zn 2 RhwD with corresponding weights u1 , ..., un 2 RhwD

+
are applied to input patches, creating n output channels (“feature maps”).
Note that setting ul = 1 and = lin reduces the similarity layer to a

standard convolutional layer.

The MEX operator
Max-min-Expectation Collapsing Smooth (“CS” ! “X”) operator:
n
!
1 1X
MEX⇠ {ci } := log exp{⇠·ci }
i=1,...,n ⇠ n
i=1

The MEX operator
n
!
1 1X
i=1,...,n ⇠ n
i=1
Parameter ⇠ 2 R spans a continuum between max, expectation (mean)

and min: 8
< max{ci } , ⇠ ! +1
MEX⇠ {ci } ! mean{ci } ,⇠ ! 0
:
min{ci } ,⇠ ! 1

The MEX operator
n
!
1 1X
i=1,...,n ⇠ n
i=1
Parameter ⇠ 2 R spans a continuum between max, expectation (mean)

and min: 8
< max{ci } , ⇠ ! +1
MEX⇠ {ci } ! mean{ci } ,⇠ ! 0
:
min{ci } ,⇠ ! 1
For a given ⇠, MEX exhibits the following “collapsing” property:
MEX⇠ {MEX⇠ {cº }} = MEX⇠ {cº }

i j i,j

The MEX layer
input output
block t
out t MEX inp s bts s block t
, ct
Output element t is assigned by a MEX taken over the corresponding

input block, where:
Output-specific offsets bts 2 R are added to the elements of the block.
Optionally, the output-specific term ct 2 R participates in the MEX.

The MEX layer – generalization of ReLU activation
The MEX layer can realize ReLU activation found in ConvNets:

input output input output
H H
D D
block t W W
, ct out t max inp t ,0
Simply set:
Input blocks – single entries (output dimensions same as input’s).
bts = 0
ct = 0
⇠ ! +1

The MEX layer – generalization of max/average-pooling
The MEX layer can realize max/average-pooling found in ConvNets:

H
pool i , j
block t D
W
, ct out i, j, l max mean inp i ', j ', l i ', j ' pool i , j
Simply set:
Input blocks – 2D windows (output depth same as input’s).
bts = 0
Omit ct
⇠ ! +1 for max-pooling, ⇠ ! 0 for average-pooling.

The MEX layer – generalization of max/average-pooling
The MEX layer can realize max/average-pooling found in ConvNets:

H
pool i , j
block t D
W
, ct out i, j, l max mean inp i ', j ', l i ', j ' pool i , j
Simply set:
Input blocks – 2D windows (output depth same as input’s).
bts = 0
Omit ct
⇠ ! +1 for max-pooling, ⇠ ! 0 for average-pooling.
Note that ⇠ can be learned during training, i.e. a trade-off between max-
and average-pooling can be learned.
The SimNet architecture – generalization of ConvNets
To recap, the SimNet architecture can realize conventional ConvNets as

follows:
Convolutional layer: similarity layer with linear form (x, z)i = xi · zi
and unit weights ul = 1.
ReLU activation: MEX layer with single-entry input blocks, bts = 0,
ct = 0 and ⇠ ! +1.
Max-pooling: MEX layer with 2D input blocks, bts = 0, ct omitted
and ⇠ ! +1.
Dense layer: similarity layer with entire input as only block, linear
form (x, z)i = xi · zi and unit weights ul = 1.

SimNets and kernel machines
Outline

5 Experiments
6 Summary

So far, we set the architectural choices of SimNets to realize the special

case of classical ConvNets. We did not make use of many available
options, such as the l1 and l2 similarities, and the MEX offsets bts .


Next, we consider two basic SimNet constructions, exploring their

connections to kernel machines:
Basic neural-network analogy:

input ! hidden layer ! output


Next, we consider two basic SimNet constructions, exploring their

connections to kernel machines:
Basic neural-network analogy: Basic 3-layer network with locality,

input ! hidden layer ! output sharing and pooling

SimNets and kernel machines A basic neural-network analogy: input ! hidden layer ! output
Outline

5 Experiments
6 Summary

A basic neural-network analogy:

The “basic SimNet”:

input similarity output
1 1
x n k
1 1
H
n
W D sim l ul T x, z l out r MEX sim l brl l 1
n
MEX ul T x, z l brl
l 1

A basic neural-network analogy:

The “basic SimNet”:

input similarity output
1 1
x n k
1 1
H
n
W D sim l ul T x, z l out r MEX sim l brl l 1
n
MEX ul T x, z l brl
l 1
Classification corresponding to this network:
n
ŷ (x ) = argmax MEX⇠ {u>
l (x, zl ) + brl }l=1
r =1,...,k
MEX combines weighted similarities to n templates, where offsets assign

relevancy of templates to classes. For ⇠ ! +1 prediction is determined
by highest similarity (“nearest-neighbor”).
The basic SimNet and kernel machines

Fix ⇠ > 0. The basic SimNet’s classification rule becomes:

l (x, zl ) + brl }nl=1
r =1,...,k
n
X
= argmax exp{⇠ · brl } · exp{⇠ · u>
l (x, zl )}
r =1,...,k | {z }
l=1
:=↵rl
n
X
= argmax ↵rl · exp{⇠ · u>
l (x, zl )}
r =1,...,k
l=1

The basic SimNet and kernel machines

Fix ⇠ > 0. The basic SimNet’s classification rule becomes:

l (x, zl ) + brl }nl=1
r =1,...,k
n
X
= argmax exp{⇠ · brl } · exp{⇠ · u>
l (x, zl )}
r =1,...,k | {z }
l=1
:=↵rl
n
X
= argmax ↵rl · exp{⇠ · u>
l (x, zl )}
r =1,...,k
l=1
Setting uniform weights ul = 1 we get:

n
( d
)
X X
ŷ (x ) = argmax ↵rl · exp ⇠ · (x, zl )i
r =1,...,k
l=1 i=1
| {z }
:=K (x,zl )

The basic SimNet and kernel machines (cont’d)
n
X
ŷ (x ) = argmax ↵rl · K (x, zl )
r =1,...,k l=1

n
X
r =1,...,k l=1
n P o
For all considered similarities, K (x, zl ) := exp ⇠ · di=1 (x, zl )i is a
kernel function:

n
X
r =1,...,k l=1
n P o
kernel function:
Klin (x, z) = exp {⇠ · hx, zi} – “Exponential” kernel

n
X
r =1,...,k l=1
n P o
kernel function:
Kl1 (x, z) = exp { ⇠ kx zk1 } – “Laplacian” kernel

n
X
r =1,...,k l=1
n P o
kernel function:
Kl1 (x, z) = exp {n ⇠ kx zk1 }o– “Laplacian” kernel
Kl2 (x, z) = exp ⇠ kx zl k22 – “RBF” kernel

n
X
r =1,...,k l=1
n P o
kernel function:
Kl1 (x, z) = exp {n ⇠ kx zk1 }o– “Laplacian” kernel
Kl2 (x, z) = exp ⇠ kx zl k22 – “RBF” kernel
Corollary
For all considered similarities, the basic SimNet with fixed ⇠ > 0 and
uniform weights (ul = 1) is a “reduced” kernel-SVM.
The network similarity templates z1 , ..., zn are the (reduced) support
vectors, and the MEX offsets brl are directly related to the SVM
coefficients.
With weighted similarities (i.e. when ul are not fixed), the basic SimNet is
no longer a kernel machine.

More formally, as the following theorem states, learning weighted

templates (zl , ul ) through support-vectors of a kernel machine is not
possible in the cases of l1 and l2 similarities (for linear similarity weights
are not applicable):

More formally, as the following theorem states, learning weighted

templates (zl , ul ) through support-vectors of a kernel machine is not
possible in the cases of l1 and l2 similarities (for linear similarity weights
are not applicable):
Theorem
For any dimension d 2 N, constant c > 0 and p 2 {1, 2}, there are no
mappings Z : Rd ! Rd and U : Rd ! Rd+ and a kernel
K : (Rd ⇥ Rd+ ) ⇥ (Rd ⇥ Rd+ ) ! Rd ⇥ Rd+ , such that for all z, x 2 Rd and
u 2 Rd+ :
( d
)
X
K ([Z (x), U(x)], [z, u]) = exp c ui |xi zi |p
i=1
The basic SimNet – abstraction level
We now turn to a qualitative study of the basic SimNet’s abstraction level

(ability to capture category distributions) under the different similarity
measures.


measures.
Consider the basic SimNet’s classification rule in the case ⇠ ! +1:
ŷ (x ) = argmax max{u>
l (x, zl ) + brl }
r =1,...,k l2[n]


measures.
l (x, zl ) + brl }
r =1,...,k l2[n]
Define Ar to be the decision region corresponding to class r 2 {1, ..., k},

i.e. Ar := {x : ŷ (x ) = r }.


measures.
l (x, zl ) + brl }
r =1,...,k l2[n]

i.e. Ar := {x : ŷ (x ) = r }.
For r 0 2 {1, ..., k} and l, l 0 2 {1, ..., n} define:
Arr ,l,l := {x : u>

0 0
l (x, zl ) + brl u> l 0 (x, zl 0 ) + br 0 l 0 }

T
Ar ,l := (r 0 ,l 0 )6=(r ,l) Arr ,l,l
0 0


measures.
l (x, zl ) + brl }
r =1,...,k l2[n]

i.e. Ar := {x : ŷ (x ) = r }.
For r 0 2 {1, ..., k} and l, l 0 2 {1, ..., n} define:
Arr ,l,l := {x : u>

0 0
l (x, zl ) + brl u> l 0 (x, zl 0 ) + br 0 l 0 }

T
Ar ,l := (r 0 ,l 0 )6=(r ,l) Arr ,l,l
0 0
S
Then, up to boundary conditions: Ar = l2[n] Ar ,l .
The basic SimNet – abstraction level with linear and

unweighted l2 similarities
In the case of linear similarity ( (x, z)i = xi · zi , ul = 1) Arr ,l,l are
0 0
half-spaces:
Arr ,l,l = {x : hx, zl i + brl
0 0
hx, zl 0 i + br 0 l 0 }
Ar ,l are intersections of half-spaces (polytopes), and the decision region
Ar is thus a union of n polytopes.


0 0
half-spaces:
0 0
hx, zl 0 i + br 0 l 0 }
With unweighted l2 -similarity ( (x, z)i = (xi zi )2 , ul = 1):
2 2
Arr ,l,l
0 0
= {x : kx zl k2 + brl kx zl 0 k2 + br 0 l 0 }
2 2
= {x : 2 hx, zl i kzl k2 + brl 2 hx, zl 0 i kzl 0 k2 + br 0 l 0 }
i.e. Arr ,l,l are again half-spaces, Ar ,l are polytopes and Ar is a union of n
0 0
polytopes.


0 0
half-spaces:
0 0
hx, zl 0 i + br 0 l 0 }
With unweighted l2 -similarity ( (x, z)i = (xi zi )2 , ul = 1):
2 2
Arr ,l,l
0 0
= {x : kx zl k2 + brl kx zl 0 k2 + br 0 l 0 }
2 2
= {x : 2 hx, zl i kzl k2 + brl 2 hx, zl 0 i kzl 0 k2 + br 0 l 0 }
i.e. Arr ,l,l are again half-spaces, Ar ,l are polytopes and Ar is a union of n
0 0
polytopes.
This implies that qualitatively, unweighted l2 -similarity induces the same

abstraction level as linear similarity (convolutional layer).
The basic SimNet – abstraction level with weighted l2

similarity
Adding weights to the l2 -similarity (ul no longer fixed) converts Arr ,l,l from
0 0
half-spaces to regions defined by second-order hyper-surfaces. Ar ,l are not

necessarily polytopes, and the shapes that Ar can take are enriched.
Qualitatively, the addition of weights to the l2 -similarity increases the

abstraction level above that of the linear similarity (convolutional layer).

The basic SimNet – abstraction level with l1 similarity
zi |) Arr ,l,l are regions defined by

0 0
With l1 similarity ( (x, z)i = |xi
piecewise-linear surfaces.
In the unweighted case (ul = 1) the space is partitioned equally (up to

shift caused by offsets). Adding weights allows more complex decision
surfaces.

The basic SimNet – abstraction level with l1 similarity
zi |) Arr ,l,l are regions defined by

0 0
With l1 similarity ( (x, z)i = |xi
piecewise-linear surfaces.
In the unweighted case (ul = 1) the space is partitioned equally (up to

shift caused by offsets). Adding weights allows more complex decision
surfaces.
Qualitatively, the abstraction level induced by the l1 similarity is higher

than that of the linear similarity (convolutional layer).

SimNets and kernel machines A basic 3-layer SimNet with locality, sharing and pooling
Outline

5 Experiments
6 Summary

“Locality-sharing-pooling SimNet”:
input similarity pooling output
q i, j 1
k
ph , pw Ph 1
H h x ij I
D n n
w J Pw
D out r
W pool ph , pw , l
sim i, j, l ul T x ij , z l MEX pool ph , pw , l brlph pw
MEX 1
sim i, j, l i, j :q i, j ph , pw
2 ph , pw ,l

“Locality-sharing-pooling SimNet”:
input similarity pooling output
q i, j 1
k
ph , pw Ph 1
H h x ij I
D n n
w J Pw
D out r
W pool ph , pw , l
sim i, j, l ul T x ij , z l MEX pool ph , pw , l brlph pw
MEX 1
sim i, j, l i, j :q i, j ph , pw
2 ph , pw ,l
Three layers:
1 Similarity layer
2 MEX layer for pooling: 2D input blocks, terms ct omitted, offsets
zeroed.
3 MEX layer for classification: densely connected, terms ct omitted,
offsets serve for classification.
Locality-sharing-pooling SimNet classification
Fixing ⇠1 = ⇠2 = ⇠ > 0, and using the MEX collapsing property, the

network’s classification becomes:
n o
ŷ (inp) = argmax MEX⇠ u>l (x ,
º lz ) + br ,l,q(i,j)
r =1,...,k i,j,l


n o
r =1,...,k i,j,l
Recall the classification of the basic SimNet:

l (x, zl ) + brl }l
r


n o
r =1,...,k i,j,l
Recall the classification of the basic SimNet:

l (x, zl ) + brl }l
r
Two important differences here:

Similarity is applied to patches (“locality”+“sharing”)
MEX offsets are shared across regions (“pooling”)

Locality-sharing-pooling SimNet and kernel machines
The network’s classification can be expressed as:

X X
ŷ (inp) = argmax e ⇠·ul
> (xº ,zl )
↵rlph pw
r =1,...,k p ,p ,l
h w i,j:q(i,j)=(ph ,pw )
where ↵rlph pw := e ⇠·br ,l,q(i,j) .


X X
> (xº ,zl )
↵rlph pw
r =1,...,k p ,p ,l
As before, we set ul = 1 and denote the kernel function exp{⇠ · 1> (x, z)}
by K (x, z). The classification becomes:
X X
ŷ (inp) = argmax ↵rlph pw K (xº , zl )
r =1,...,k p ,p ,l


X X
> (xº ,zl )
↵rlph pw
r =1,...,k p ,p ,l
As before, we set ul = 1 and denote the kernel function exp{⇠ · 1> (x, z)}
by K (x, z). The classification becomes:
X X
ŷ (inp) = argmax ↵rlph pw K (xº , zl )
r =1,...,k p ,p ,l
This classification can be expressed as a (reduced) kernel-SVM, with a

kernel K (·, ·) based on K (·, ·), that is designed for instances represented
by collections of vectors (“patches”).


(cont’d)
More explicitly, we may write:

X
ŷ (inp) = argmax ↵rlph pw K (X , Zlph pw )
r =1,...,k p ,p ,l
h w
where:
K (·, ·) is a kernel function.
Classified instance X contains the concatenation of all input patches.
Support-vectors Zlph pw are concatenations of vectors subject to:
“Sharing” constraint: Entries that correspond to the pool index
(ph , pw ) contain copies of zl .
“Locality” constraint: Entries that do not correspond to the pool index
(ph , pw ) contain “null values”.

Other SimNet settings – global average pooling
Outline

5 Experiments
6 Summary

Following the “global average pooling” paradigm1 recently suggested for

ConvNets, we reverse the order of the MEX pooling and classification,
referring to the resulting network as “patch-labeling SimNet”:
input similarity labeling output
1
k
I 1 I 1
H h x ij 1
n q i, j
D n
ph , pw out r
w k
J J
shared MEX label i, j , r
W D offsets 2 i, j
sim i, j, l ul T
x ij , z l label i, j, r MEX 1
sim i, j, l br ,l ,q i , j
l
1
Lin, Chen, Yan. "Network in network".

1
k
I 1 I 1
H h x ij 1
n q i, j
D n
ph , pw out r
w k
J J
W D offsets 2 i, j
sim i, j, l ul T
x ij , z l label i, j, r MEX sim i, j, l br ,l ,q i , j
Three layers:
1 l
1

1
k
I 1 I 1
H h x ij 1
n q i, j
D n
ph , pw out r
w k
J J
W D offsets 2 i, j
sim i, j, l ul T
Three layers:
1 l
1 Similarity layer
1

1
k
I 1 I 1
H h x ij 1
n q i, j
D n
ph , pw out r
w k
J J
W D offsets 2 i, j
sim i, j, l ul T
Three layers:
1 l
1 Similarity layer
2 MEX layer for patch classification: cross-channel input blocks, terms
ct omitted, offsets serve for classification and shared across regions.
1

1
k
I 1 I 1
H h x ij 1
n q i, j
D n
ph , pw out r
w k
J J
W D offsets 2 i, j
sim i, j, l ul T
Three layers:
1 l
1 Similarity layer
2 MEX layer for patch classification: cross-channel input blocks, terms
ct omitted, offsets serve for classification and shared across regions.

3 MEX layer for pooling (combining patch classifications): channel
input blocks, terms ct omitted, offsets zeroed.

1
Patch labeling SimNet
The network’s classification:
ŷ (inp) = MEX⇠2 {MEX⇠1 {u>

l (xº , zl ) + br ,l,q(i,j) }}
i,j l


l (xº , zl ) + br ,l,q(i,j) }}
i,j l
If we set ⇠1 = ⇠2 = ⇠, and use the MEX collapsing property, we obtain the

same classification as the locality-sharing-pooling SimNet:
ŷ (inp) = argmax MEX⇠ {u>

l (xº , zl ) + br ,l,q(i,j) }i,j,l
r =1,...,k


l (xº , zl ) + br ,l,q(i,j) }}
i,j l
If we set ⇠1 = ⇠2 = ⇠, and use the MEX collapsing property, we obtain the

same classification as the locality-sharing-pooling SimNet:
ŷ (inp) = argmax MEX⇠ {u>

l (xº , zl ) + br ,l,q(i,j) }i,j,l
r =1,...,k
In our experiments, the best results were obtained by a different setting –

⇠2 ! 0, which corresponds to:
X n o
ŷ (inp) = argmax MEX⇠1 u> l (x º , z l ) + br ,l,q(i,j)
r =1,...,k i,j l

Unsupervised initialization of l2 -similarity layer
Consider a Gaussian Mixture Model (GMM) with n components and

n
diagonal covariances – N µl , diag( l ) 2 l=1 .


n
It can be shown that with l2 -similarity ( (x, z)i = (xi zi )2 ), setting

zl = µl and ul = 0.5 · l 2 gives:
l (xº , zl ) = log Pr (xº ^ Gaussian l) + cl

u>
for some c1 , ..., cn 2 R. In other words, assuming input patches follow a

GMM distribution as above, channel l of the layer’s output holds (up to a
constant) the probabilistic heat map of Gaussian l and the input patches.


n
It can be shown that with l2 -similarity ( (x, z)i = (xi zi )2 ), setting

zl = µl and ul = 0.5 · l 2 gives:
l (xº , zl ) = log Pr (xº ^ Gaussian l) + cl

u>
for some c1 , ..., cn 2 R. In other words, assuming input patches follow a

GMM distribution as above, channel l of the layer’s output holds (up to a
constant) the probabilistic heat map of Gaussian l and the input patches.
In practice, this suggests estimating the GMM means and covariances

based on unlabeled patches, and assigning templates and weights as above.

The exact same rationale and resulting initialization scheme presented for
l2 -similarity apply also to the case of l1 -similarity, the only difference being
the replacement of the GMM with a mixture of Laplacian distributions,
where each Laplacian has statistically independent coordinates.

Unsupervised initialization of MEX offsets
Consider the case of a MEX layer following the similarity layer:

input similarity MEX
H h x ij block t
D n
w
W D
sim i, j, l ul T x ij , z l mex t MEX sim i, j, l bt , i , j ,l , ct
i , j ,l block t


H h x ij block t
D n
w
W D
i , j ,l block t
Taking into account location-dependent statistics, we not only assume

that input patches follow a mixture distribution, but also that each patch
location corresponds to a different mixture of the same n components.


H h x ij block t
D n
w
W D
i , j ,l block t
Taking into account location-dependent statistics, we not only assume

that input patches follow a mixture distribution, but also that each patch
location corresponds to a different mixture of the same n components.
This suggests estimating the mixture separately for each location, and
calculating offsets such that when appended to the outputs of the similarity
layer, the probabilistic heat maps take into account location dependency.
Unsupervised initialization of MEX offsets (cont’d)
For example, if a template is deemed unlikely to appear on the top-left

corner of an image, that template’s heat map will be suppressed there.
The computed offsets serve for initialization of the MEX layer’s offsets.

Experiments
Outline

5 Experiments
6 Summary

Experiments
Experiments
Large-scale implementation and evaluation of deep SimNets against state

of the art ConvNets is currently under work. We report here a comparison
between the patch labeling SimNet, an equivalent ConvNet and the
“single-layer” network of Coates et al2 .
2
Coates et al.: An analysis of single-layer networks in unsupervised feature learning.
Experiments
Experiments
Large-scale implementation and evaluation of deep SimNets against state

of the art ConvNets is currently under work. We report here a comparison
between the patch labeling SimNet, an equivalent ConvNet and the
“single-layer” network of Coates et al2 .
Evaluation on CIFAR-10 dataset:

32 ⇥ 32 color images
10 classes, 50K training images, 10K test images
2
Coates et al.: An analysis of single-layer networks in unsupervised feature learning.
Experiments
Evaluated SimNet
Patch labeling SimNet:

input l1/l2-similarity labeling output
q i, j 1,1 q i, j 1, 2
stride=1
14 1 10
1
14
32 27 27 13
6 x ij
3 13
6 3 n q i, j 2,1 q i, j 2, 2
10
32 27 27 out r MEX 0 label i , j , r 1 i , j 27
108 108
2 n
sim i, j, l ul ,t xij ,t z l ,t ul ,t xij ,t z l ,t label i, j, r MEX 1 sim i, j, l br l q i , j
t 1 t 1 l 1
Initialization of similarity templates and weights via statistical

parameter estimation, using training set without the labels.
Supervised training via SGD minimization of softmax loss.
Experimented with up to n = 400 templates.

Experiments
Evaluated ConvNet
input convolution ReLU pooling output
q i, j 1,1 q i, j 1, 2
stride=1
14 1 10
2
1
14
32 27 27 13 n
6 x ij 2
3 13 out r pool , w r
6 3 n q i, j 2,1 q i, j 2, 2
n
32 27 27
pool ph , pw , l max relu i, j, l 1 i , j 27:q i , j ph , pw
conv i, j, l x ij , z l relu i, j, l max conv i, j, l ,0
Implemented with Caffe toolbox.

Random initialization, training via SGD minimization of softmax loss,
Dropout regularization included.

Experiments
Evaluated “single-layer” network of Coates et al.
input coding pooling output
q i, j 1,1 q i, j 1, 2
stride=1
14 1 10
2
14 1
32 27 13 n
6 x ij 2
13
3
q i, j 2,1 q i, j 2, 2
6 3 n
32 27 out r pool , w r
n
pool ph , pw , l code i , j , l
code i , j , l max mean x ij zl ' x ij z l ,0 1 i , j 27
2 l' 1 2 q i, j ph , pw
“Triangle” coding – state of the art network of this depth.

We used the implementation published by the authors, and added a
supervised training phase that jointly optimizes the coding templates
and the SVM coefficients.

Experiments
Results

Experiments
Conclusions
SimNets with weighted l1 and l2 similarities reach slightly higher

accuracies than the ConvNet, at less than 1/9 its size.

Experiments
Conclusions

SimNets with weighted l1 and l2 similarities are slightly outperformed
by the “single-layer” network of Coates et al., but are almost 1/5 in
size.

Experiments
Conclusions

size.
Without similarity weights the SimNets are comparable to the
ConvNet. Weights add parameters to the network, but provide a
super-linear gain in accuracy.

Experiments
Conclusions

size.
Without similarity weights the SimNets are comparable to the
ConvNet. Weights add parameters to the network, but provide a
super-linear gain in accuracy.
SimNets with l1 and l2 similarities produce comparable performance.

Summary
Outline

5 Experiments
6 Summary

Summary
Summary
The SimNets architecture consists of two basic building blocks:
Similarity operator: generalizes the ConvNet convolutional operator.
MEX operator: generalizes ConvNet ReLU activation and pooling, but
allows much more...

Summary
Summary
allows much more...
The SimNet ingredient of input to hidden-units to output-nodes, and

the basic locality-sharing-pooling structure, are both generalizations
of kernel machines.

Summary
Summary
allows much more...
The SimNet ingredient of input to hidden-units to output-nodes, and

the basic locality-sharing-pooling structure, are both generalizations
of kernel machines.
We considered three types of similarities: l1 , l2 and linear

(convolution).
In their unweighted form, the similarities correspond to kernel machines
with Laplacian, RBF and Exponential kernels respectively.
In their weighted forms, the l1 and l2 similarities go beyond kernel
machines, providing higher abstraction levels.

Summary
Summary (cont’d)
The SimNet architecture exhibits a natural unsupervised initialization

scheme based on statistical estimation, which has the potential of
automatically determining the number of channels in a similarity layer
via variance analysis of patterns generated from previous layers.

Summary
Summary (cont’d)

Large-scale evaluation of deep SimNets against state of the art

ConvNets is under work. On a benchmark conducted for a 3-layer
SimNet against an analogous ConvNet and the “single-layer” network
of Coates et al., the SimNet achieved accuracies comparable to the
competition at 1/9 and 1/5 (respectively) of their size.

Summary
Summary (cont’d)


Experimental results validate that the similarity weighting, which takes

SimNets beyond kernel machines, is crucial in terms of performance.

Summary
Summary (cont’d)


Experimental results validate that the similarity weighting, which takes

SimNets beyond kernel machines, is crucial in terms of performance.
Offsets in a MEX layer enable addition of locality-based biases to the

templates in a preceding similarity layer. This is something that
ConvNets cannot express, and will be evaluated in future work.
Summary
Thank You

Simnets: A Generalization of Convolutional Networks

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Simnets: A Generalization of Convolutional Networks

Hochgeladen von

Copyright:

Verfügbare Formate

SimNets

A Generalization of Convolutional Networks

Nadav Cohen Amnon Shashua

The Hebrew University of Jerusalem

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 1 / 50

Generalization: an architecture which includes ConvNets as a special

Abstraction: higher abstraction levels of each layer can potentially

Initialization: statistical analysis of unlabeled data (K-means, GMM,

Kernel Machines: a stronger connection to classical machine learning

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 2 / 50

1 Convolutional neural networks (ConvNets)

2 The SimNet architecture

3 SimNets and kernel machines

4 Other SimNet settings – global average pooling

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 3 / 50

Common activation functions:

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 4 / 50

Artificial neural network (ANN)

instance to classify: object class of x: prediction rule:

x = (x1 , ..., xd ) y 2 {1, ..., k} ŷ (x) = argmax or

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 5 / 50

Convent: locality, sharing and pooling.

Source: Zeiler and Fergus. Visualizing and understanding convolutional networks.

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 6 / 50

Other deep learning approaches

HMAX [Serre et al. Robust object recognition with cortex-like

Sum-product networks [Poon and Domingos. Sum-Product Networks: A

Invariant scattering convolution networks [Bruna and Mallat. Invariant

Network in network [Lin, Chen, Yan. Network in network.]

Polynomial networks [Livni et al. An algorithm for training polynomial

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 7 / 50

1 Convolutional neural networks (ConvNets)

2 The SimNet architecture

3 SimNets and kernel machines

4 Other SimNet settings – global average pooling

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 8 / 50

The SimNet architecture

The SimNet architecture consists of two basic building blocks:

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 9 / 50

The similarity operator

The “similarity” between input x 2 Rd and template z 2 Rd with

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 10 / 50

The similarity operator

The “similarity” between input x 2 Rd and template z 2 Rd with

: Rd ⇥ Rd ! Rd – point-wise similarity mapping. We consider the

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 10 / 50

The similarity operator

The “similarity” between input x 2 Rd and template z 2 Rd with

: Rd ⇥ Rd ! Rd – point-wise similarity mapping. We consider the

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 10 / 50

The similarity operator

The “similarity” between input x 2 Rd and template z 2 Rd with

: Rd ⇥ Rd ! Rd – point-wise similarity mapping. We consider the

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 10 / 50

The similarity operator

The “similarity” between input x 2 Rd and template z 2 Rd with

: Rd ⇥ Rd ! Rd – point-wise similarity mapping. We consider the

Nadav Cohen, Amnon Shashua (HUJI) SimNets November 2014 10 / 50

The similarity operator

The “similarity” between input x 2 Rd and template z 2 Rd with

: Rd ⇥ Rd ! Rd – point-wise similarity mapping. We consider the

When setting u = 1, the corresponding similarities reduce to hx, zi,