Beruflich Dokumente
Kultur Dokumente
Deep Architectures in Vision - Alexnet to ResNet, Transfer learning, Siamese Networks, Metric
Learning, Ranking/Triplet loss, RCNNs, CNN-RNN, Applications in captioning and video tasks,
3D CNNs
Image classification is the task of classifying a given image into one of the pre-defined
categories.
Traditional pipeline for image classification involves two modules: viz. feature
extraction and classification.
Feature extraction involves extracting a higher level of information from raw pixel values
that can capture the distinction among the categories involved.
This feature extraction is done in an unsupervised manner wherein the classes of the
image have nothing to do with information extracted from pixels.
Some of the traditional and widely used features are GIST, HOG, SIFT, LBP etc. After
the feature is extracted, a classification module is trained with the images and their
associated labels.
The problem with this pipeline is that feature extraction cannot be tweaked according to
the classes and images.
So if the chosen feature lacks the representation required to distinguish the categories, the
accuracy of the classification model suffers a lot, irrespective of the type of classification
strategy employed.
A common theme among the state of the art following the traditional pipeline has been, to
pick multiple feature extractors and club them inventively to get a better feature.
But this involves too many heuristics as well as manual labor to tweak parameters
according to the domain to reach a decent level of accuracy.
Another problem with this method is that it is completely different from how we humans
learn to recognize things.
Just after birth, a child is incapable of perceiving his surroundings, but as he progresses
and processes data, he learns to identify things. This is the philosophy behind deep
learning, wherein no hard-coded feature extractor is built in.
It combines the extraction and classification modules into one integrated system and it
learns to extract, by discriminating representations from the images and classify them
based on supervised data.
One such system is multilayer perceptrons aka neural networks which are multiple layers
of neurons densely connected to each other.
A deep vanilla neural network has such a large number of parameters involved that it is
impossible to train such a system without overfitting the model due to the lack of a
sufficient number of training examples.
But with Convolutional Neural Networks(ConvNets), the task of training the whole
network from the scratch can be carried out using a large dataset like ImageNet.
The reason behind this is, sharing of parameters between the neurons and sparse
connections in convolutional layers. It can be seen in this figure 2. In the convolution
operation, the neurons in one layer are only locally connected to the input neurons and
the set of parameters are shared across the 2-D feature map.
Accuracy:
If you are building an intelligent machine, it is absolutely critical that it must be as
accurate as possible. One fair question to ask here is that ‘accuracy not only depends on
the network but also on the amount of data available for training’. Hence, these networks
are compared on a standard dataset called ImageNet.
Computation:
Most ConvNets have huge memory and computation requirements, especially while
training. Hence, this becomes an important concern.
Similarly, the size of the final trained model becomes important to consider if you are
looking to deploy a model to run locally on mobile. As you can guess, it takes a more
computationally intensive network to produce more accuracy. So, there is always a trade-
off between accuracy and computation.
Apart from these, there are many other factors like ease of training, the ability of a
network to generalize well etc. The networks described below are the most popular ones
and are presented in the order that they were published and also had increasingly better
accuracy from the earlier ones.
AlexNet
This architecture was one of the first deep networks to push ImageNet Classification
accuracy by a significant stride in comparison to traditional methodologies.
It is composed of 5 convolutional layers followed by 3 fully connected layers.
AlexNet, proposed by Alex Krizhevsky, uses ReLu(Rectified Linear Unit) for the non-
linear part, instead of a Tanh or Sigmoid function which was the earlier standard for
traditional neural networks. ReLu is given by
f(x) = max(0,x)
The advantage of the ReLu over sigmoid is that it trains much faster than the latter
because the derivative of sigmoid becomes very small in the saturating region and
therefore the updates to the weights almost vanish. This is called vanishing gradient
problem.
In the network, ReLu layer is put after each and every convolutional and fully-connected
layers(FC).
Another problem that this architecture solved was reducing the over-fitting by using a
Dropout layer after every FC layer.
Dropout layer has a probability,(p), associated with it and is applied at every neuron of
the response map separately. It randomly switches off the activation with the
probability p.
Why does DropOut work?
The idea behind the dropout is similar to the model ensembles. Due to the dropout layer,
different sets of neurons which are switched off, represent a different architecture and all
these different architectures are trained in parallel with weight given to each subset and
the summation of weights being one. For n neurons attached to DropOut, the number of
subset architectures formed is 2^n.
So it amounts to prediction being averaged over these ensembles of models. This
provides a structured model regularization which helps in avoiding the over-fitting.
Another view of DropOut being helpful is that since neurons are randomly chosen, they
tend to avoid developing co-adaptations among themselves thereby enabling them to
develop meaningful features, independent of others.
VGG16
This architecture is from VGG group, Oxford. It makes the improvement over AlexNet
by replacing large kernel-sized filters(11 and 5 in the first and second convolutional
layer, respectively) with multiple 3X3 kernel-sized filters one after another.
With a given receptive field(the effective area size of input image on which output
depends), multiple stacked smaller size kernel is better than the one with a larger size
kernel because multiple non-linear layers increases the depth of the network which
enables it to learn more complex features, and that too at a lower cost.
For example, three 3X3 filters on top of each other with stride 1 ha a receptive size of 7,
but the number of parameters involved is 3*(9C^2) in comparison to 49C^2 parameters
of kernels with a size of 7.
Here, it is assumed that the number of input and output channel of layers is C.Also, 3X3
kernels help in retaining finer level properties of the image. The network architecture is
given in the table.
You can see that in VGG-D, there are blocks with same filter size applied multiple times
to extract more complex and representative features. This concept of blocks/modules
became a common theme in the networks after VGG.
The VGG convolutional layers are followed by 3 fully connected layers. The width of the
network starts at a small value of 64 and increases by a factor of 2 after every sub-
sampling/pooling layer. It achieves the top-5 accuracy of 92.3 % on ImageNet.
GoogLeNet/Inception:
While VGG achieves a phenomenal accuracy on ImageNet dataset, its deployment on
even the most modest sized GPUs is a problem because of huge computational
requirements, both in terms of memory and time.
It becomes inefficient due to large width of convolutional layers.
For instance, a convolutional layer with 3X3 kernel size which takes 512 channels as
input and outputs 512 channels, the order of calculations is 9X512X512.
In a convolutional operation at one location, every output channel , is connected to every
input channel, and so we call it a dense connection architecture.
The GoogLeNet builds on the idea that most of the activations in a deep network are
either unnecessary(value of zero) or redundant because of correlations between them.
Therefore the most efficient architecture of a deep network will have a sparse connection
between the activations, which implies that all 512 output channels will not have
a connection with all the 512 input channels.
There are techniques to prune out such connections which would result in a sparse
weight/connection. But kernels for sparse matrix multiplication are not optimized in
BLAS or CuBlas(CUDA for GPU) packages which render them to be even slower than
their dense counterparts.
So GoogLeNet devised a module called inception module that approximates a sparse
CNN with a normal dense construction(shown in the figure). Since only a small number
of neurons are effective as mentioned earlier, the width/number of the convolutional
filters of a particular kernel size is kept small.
Also, it uses convolutions of different sizes to capture details at varied scales(5X5, 3X3,
1X1).
Another salient point about the module is that it has a so-called bottleneck layer(1X1
convolutions in the figure). It helps in the massive reduction of the computation
requirement as explained below.
Let us take the first inception module of GoogLeNet as an example which has 192
channels as input. It has just 128 filters of 3X3 kernel size and 32 filters of 5X5 size.
The order of computation for 5X5 filters is 25X32X192 which can blow up as we go
deeper into the network when the width of the network and the number of 5X5 filter
further increases.
In order to avoid this, the inception module uses 1X1 convolutions before applying larger
sized kernels to reduce the dimension of the input channels, before feeding into those
convolutions.
So in the first inception module, the input to the module is first fed into 1X1 convolutions
with just 16 filters before it is fed into 5X5 convolutions. This reduces the computations
to 16X192 + 25X32X16. All these changes allow the network to have a large width and
depth.
Another change that GoogLeNet made, was to replace the fully-connected layers at the
end with a simple global average pooling which averages out the channel values across
the 2D feature map, after the last convolutional layer.
This drastically reduces the total number of parameters. This can be understood from
AlexNet, where FC layers contain approx. 90% of parameters.
Use of a large network width and depth allows GoogLeNet to remove the FC layers
without affecting the accuracy. It achieves 93.3% top-5 accuracy on ImageNet and is
much faster than VGG.
Residual Networks
As per what we have seen so far, increasing the depth should increase the accuracy of the
network, as long as over-fitting is taken care of.
But the problem with increased depth is that the signal required to change the weights,
which arises from the end of the network by comparing ground-truth and prediction
becomes very small at the earlier layers, because of increased depth.
It essentially means that earlier layers are almost negligible learned. This is
called vanishing gradient.
The second problem with training the deeper networks is, performing the optimization on
huge parameter space and therefore naively adding the layers leading to higher training
error.
Residual networks allow training of such deep networks by constructing the network
through modules called residual models as shown in the figure. This is
called degradation problem. The intuition around why it works can be seen as follows:
TRANSFER LEARNING
Most of the time in Machine Learning, features are manually hand-crafted by researchers
and domain experts. Fortunately, Deep Learning can extract features automatically. Note
that this does not mean that Feature Engineering and Domain knowledge isn’t important
anymore because you still have to decide which features you put into your Network.
But Neural Networks have the ability to learn which features, you have put into it, are
really important and which ones aren’t. A representation learning algorithm can discover a
good combination of features within a very short timeframe, even for complex tasks which
would otherwise require a lot of human effort.
The learned representation can then be used for other problems as well. You simply use
the first layers to spot the right representation of features but you don’t use the output of
the network because it is too task-specific.
Simply feed data into your network and use one of the intermediate layers as the output
layer. This layer can then be interpreted as a representation of the raw data.
This approach is mostly used in Computer Vision because it can reduce the size of your dataset,
which decreases computation time and makes it more suitable for traditional algorithms as well.
SIAMESE NETWORKS
In Siamese networks, we take an input image of a person and find out the encodings of that
image, then, we take the same network without performing any updates on weights or biases and
input an image of a different person and again predict it’s encodings. Now, we compare these two
encodings to check whether there is a similarity between the two images. These two encodings act
as a latent feature representation of the images. Images with the same person have similar
features/encodings. Using this, we compare and tell if the two images have the same person or
not.
Triplet Loss
you can train the network by taking an anchor image and comparing it with both a positive sample
and a negative sample. The dissimilarity between the anchor image and positive image must low
and the dissimilarity between the anchor image and the negative image must be high.
The formula above represents the triplet loss function using which gradients are calculated. The
variable “a” represents the anchor image, “p” represents a positive image and “n” represents a
negative image. We know that the dissimilarity between a and p should be less than the
dissimilarity between a and n,. Another variable called margin, which is a hyperparameter is
added to the loss equation. Margin defines how far away the dissimilarities should be, i.e if
margin = 0.2 and d(a,p) = 0.5 then d(a,n) should at least be equal to 0.7. Margin helps us
distinguish the two images better.
Therefore, by using this loss function we calculate the gradients and with the help of the
gradients, we update the weights and biases of the siamese network. For training the network, we
take an anchor image and randomly sample positive and negative images and compute its loss
function and update its gradients.
METRIC LEARNING
Metric learning aims to learn a distance function to measure the similarity of samples,
which plays an important role in many visual understanding applications.
Generally, the optimal similarity functions for different visual understanding tasks are
task specific because the distributions for data used in different tasks are usually
different.
It is generally believed that learning a metric from training data can obtain more
encouraging performances than handcrafted metrics e.g., the Euclidean and cosine
distances.
A variety of metric learning methods have been proposed in the literature , and many of
them have been successfully employed in visual understanding tasks such as face
recognition, image classification, visual search, visual tracking , person reidentification ,
cross-modal matching , image set classification and image-based geolocalization.
Metric learning techniques are usually classified into two categories: unsupervised and
supervised. Unsupervised metric learning attempts to learn a low-dimensional subspace
to preserve the useful geometrical information of the samples.
Supervised metric learning, which is the mainstream metric learning technique and the
focus in this article, seeks an appropriate metric by formulating an optimization objective
function to exploit supervised information of the training samples, where the objective
functions are designed for different specific tasks.
However, most conventional metric learning methods usually learn a linear mapping to
project samples into a new feature space, which suffer from the nonlinear relationship of
data points in metric learning.
While the kernel trick can be adopted to address this nonlinearity problem, this type of
method suffers from the scalability problem because the kernel trick has two major
issues:
1) choosing a kernel is typically difficult and quite empirical and
2) the expression power of kernel functions is often not flexible enough to capture the
nonlinearity in the data.
Motivated by the fact that deep learning is an effective solution to model the nonlinearity
of samples, several deep metric learning (DML) methods have been proposed in recent
years.
The key idea of DML is to explicitly learn a set of hierarchical nonlinear transformations
to map data points into other feature space for comparing or matching by exploiting the
architecture of neural networks in deep learning, which unifies feature learning and metric
learning into a joint learning framework.
The goal of this article is to provide an overview of recent advances in DML techniques
and their various applications in different visual understanding tasks.
TRIPLET LOSS
Usually in supervised learning we have a fixed number of classes and train the network using the
softmax cross entropy loss. However in some cases we need to be able to have a variable number
of classes. In face recognition for instance, we need to be able to compare two unknown faces
and say whether they are from the same person or not.
Triplet loss in this case is a way to learn good embeddings for each face. In the embedding space,
faces from the same person should be close together and form well separated clusters.
Two examples with the same label have their embeddings close together in the embedding
space
Two examples with different labels have their embeddings far away.
However, we don’t want to push the train embeddings of each label to collapse into very small
clusters. The only requirement is that given two positive examples of the same class and one
negative example, the negative should be farther away than the positive by some margin. This is
very similar to the margin used in SVMs, and here we want the clusters of each class to be
separated by the margin. To formalise this requirement, the loss will be defined over triplets of
embeddings:
an anchor
a positive of the same class as the anchor
a negative of a different class
For some distance on the embedding space dd, the loss of a triplet (a,p,n)(a,p,n) is:
L=max(d(a,p)−d(a,n)+margin,0)
We minimize this loss, which pushes d(a,p)d(a,p) to 00 and d(a,n)d(a,n) to be greater
than d(a,p)+margind(a,p)+margin. As soon as nn becomes an “easy negative”, the loss becomes
zero.
Triplet mining
Based on the definition of the loss, there are three categories of triplets:
easy triplets: triplets which have a loss of 00,
because d(a,p)+margin<d(a,n)d(a,p)+margin<d(a,n)
hard triplets: triplets where the negative is closer to the anchor than the positive,
i.e. d(a,n)<d(a,p)d(a,n)<d(a,p)
semi-hard triplets: triplets where the negative is not closer to the anchor than the positive,
but which still have positive loss: d(a,p)<d(a,n)<d(a,p)+margind(a,p)<d(a,n)<d(a,p)+margin
Each of these definitions depend on where the negative is, relatively to the anchor and positive.
We can therefore extend these three categories to the negatives: hard negatives, semi-hard
negatives or easy negatives.
batch all: select all the valid triplets, and average the loss on the hard and semi-hard triplets.
o a crucial point here is to not take into account the easy triplets (those with loss 00), as
averaging on them would make the overall loss very small
o this produces a total of PK(K−1)(PK−K)PK(K−1)(PK−K) triplets
(PKPK anchors, K−1K−1 possible positives per anchor, PK−KPK−K possible
negatives)
batch hard: for each anchor, select the hardest positive (biggest distance d(a,p)d(a,p)) and
the hardest negative among the batch
o this produces PKPK triplets
o the selected triplets are the hardest among the batch
According to the paper cited above, the batch hard strategy yields the best performance:
Additionally, the selected triplets can be considered moderate triplets, since they are the hardest within a small
subset of the data, which is exactly what is best for learning with the triplet loss.
RCNN
These 2000 candidate region proposals are warped into a square and fed into a convolutional
neural network that produces a 4096-dimensional feature vector as output.
The CNN acts as a feature extractor and the output dense layer consists of the features
extracted from the image and the extracted features are fed into an SVM to classify the
presence of the object within that candidate region proposal.
In addition to predicting the presence of an object within the region proposals, the algorithm
also predicts four values which are offset values to increase the precision of the bounding box.
For example, given a region proposal, the algorithm would have predicted the presence of a
person but the face of that person within that region proposal could’ve been cut in half.
Therefore, the offset values help in adjusting the bounding box of the region proposal.
It cannot be implemented real time as it takes around 47 seconds for each test image.
Faster R-CNN
Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the
region proposals. Selective search is a slow and time-consuming process affecting the
performance of the network.
Therefore, Shaoqing Ren et al. came up with an object detection algorithm that eliminates
the selective search algorithm and lets the network learn the region proposals.
Similar to Fast R-CNN, the image is provided as an input to a convolutional network
which provides a convolutional feature map.
Instead of using selective search algorithm on the feature map to identify the region
proposals, a separate network is used to predict the region proposals. The predicted region
proposals are then reshaped using a RoI pooling layer which is then used to classify the
image within the proposed region and predict the offset values for the bounding boxes.
How YOLO works is that we take an image and split it into an SxS grid, within each of
the grid we take m bounding boxes. For each of the bounding box, the network outputs a
class probability and offset values for the bounding box.
The bounding boxes having the class probability above a threshold value is selected and
used to locate the object within the image.
YOLO is orders of magnitude faster(45 frames per second) than other object detection
algorithms.
The limitation of YOLO algorithm is that it struggles with small objects within the image,
for example it might have difficulties in detecting a flock of birds. This is due to the
spatial constraints of the algorithm.
RNN-CNN
While deep convolutional neural networks (CNNs) have shown a great success in single-
label image classification, it is important to note that real world images generally contain
multiple labels, which could correspond to different objects, scenes, actions and attributes
in an image.
Traditional approaches to multi-label image classification learn independent classifiers
for each category and employ ranking or thresholding on the classification results. These
techniques, although working well, fail to explicitly exploit the label dependencies in an
image.
In this paper, we utilize recurrent neural networks (RNNs) to address this problem.
Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label
embedding to characterize the semantic label dependency as well as the image-label
relevance, and it can be trained end-to-end from scratch to integrate both information in a
unified framework.
Since we aim to characterize the high-order label correlation, we employ long short term
memory (LSTM) neurons [15] as our recurrent neurons, which has been demonstrated to
be a powerful model of long-term dependency.
Long Short Term Memory Networks (LSTM)
RNN [15] is a class of neural network that maintains internal hidden states to model
the dynamic temporal behaviour of sequences with arbitrary lengths through directed
cyclic connections between its units.
It can be considered as a hidden Markov model extension that employs nonlinear
transition function and is capable of modeling long term temporal dependencies.
LSTM extends RNN by adding three gates to an RNN neuron: a forget gate f to
control whether to forget the current state; an input gate i to indicate if it should read
the input; an output gate o to control whether to output the state.
These gates enable LSTM to learn long-term dependency in a sequence, and make it
is easier to optimize, because these gates help the input signal to effectively propagate
through the recurrent hidden states r(t) without affecting the output.
LSTM also effectively deals with the gradient vanishing/exploding issues that
commonly appear during RNN training [26].
where δ(.) is an activation function, ⊙ is the product with gate value, and various W matrices are
learned parameters. In our implementation, we employ rectified linear units (ReLU) as the
activation function [4].
The illustration of the CNNRNN framework is shown in Fig. 4. It contains two parts: The
CNN part extracts semantic representations from images; the RNN part models
image/label relationship and label dependency.
We decompose a multi-label prediction as an ordered prediction path. For example, labels
“zebra” and “elephant” can be decomposed as either (“zebra”, “elephant”) or (“elephant”,
“zebra”).
The probability of a prediction path can be computed by the RNN network. The image,
label, and recurrent representations are projected to the same low dimensional space to
model the image-text relationship as well as the label redundancy.
The RNN model is employed as a compact yet powerful representation of the label
cooccurrence dependency in this space. It takes the embedding of the predicted label at
each time step and maintains a hidden state to model the label co-occurrence information.
The a priori probability of a label given the previously predicted labels can be computed
according to their dot products with the sum of the image and recurrent embeddings.
The probability of a prediction path can be obtained as the product of the a-prior
probability of each label given the previous labels in the prediction path.
A label k is represented as a one-hot vector ek = [0,... 0, 1, 0,..., 0], which is 1 at the k-th
location, and 0 elsewhere. The label embedding can be obtained by multiplying the one-
hot vector with a label embedding matrix Ul.
The k-th row of Ul is the label embedding of the label k. wk = Ul.ek.
The dimension of wk is usually much smaller than the number of labels. The recurrent
layer takes the label embedding of the previously predicted label, and models the co-
occurrence dependencies in its hidden recurrent states by learning nonlinear functions:
o(t) = ho(r(t−1), wk(t)), r(t) = hr(r(t−1), wk(t)) (
where r(t) and o(t) are the hidden states and outputs of the recurrent layer at the time
step t, respectively, wk(t) is the label embedding of the t-th label in the prediction path,
and ho(.), hr(.) are the non-linear RNN functions.
The output of the recurrent layer and the image representation are projected into the same
low-dimensional space as the label embedding. xt = h(Ux o o(t) + Ux I I), where Ux o
and Ux I are the projection matrices for recurrent layer output and image representation,
respectively.
where tanh() is the hyperbolic tangent function, bij is the bias for this feature map, m indexes
over the set of feature maps in the ith layer connected to the current feature map, wpq ijk is the
value at the position (p,q) of the kernel connected to the kth feature map, and Pi and Qi are the
height and width of the kernel, respectively.
In the subsampling layers, the resolution of the feature maps is reduced by pooling over local
neighborhood on the feature maps in the previous layer, thereby enhancing the invariance to
distortions on the inputs. A CNN architecture can be constructed by stacking multiple layers
of convolution and subsampling in an alternating fashion.
The parameters of CNN, such as the bias bij and the kernel weight wpq ijk, are usually
learned using either supervised or unsupervised approaches
3D Convolution
In 2D CNNs, convolutions are applied on the 2D feature maps to compute features from
the spatial dimensions only.
When applied to video analysis problems, it is desirable to capture the motion
information encoded in multiple contiguous frames.
To this end, we propose to perform 3D convolutions in the convolution stages of CNNs
to compute features from both spatial and temporal dimensions.
The 3D convolution is achieved by convolving a 3D kernel to the cube formed by
stacking multiple contiguous frames together.
By this construction, the feature maps in the convolution layer are connected to multiple
contiguous frames in the previous layer, thereby capturing motion information. Formally,
the value at position (x, y, z) on the jth feature map in the ith layer is given by
where Ri is the size of the 3D kernel along the temporal dimension, wpqr ijm is the (p, q, r) th
value of the kernel connected to the mth feature map in the previous layer.
A comparison of 2D and 3D convolutions is given in Fig. 1. Note that a 3D convolutional
kernel can only extract one type of features from the frame cube since the kernel weights are
replicated across the entire cube.
A general design principle of CNNs is that the number of feature maps should be increased in
late layers by generating multiple types of features from the same set of lower level feature
maps. Similarly to the case of 2D convolution, this can be achieved.
A 3D CNN Architecture
Based on the 3D convolution described above, a variety of CNN architectures can be
devised.
In the following, we describe a 3D CNN architecture that we have developed for human
action recognition on the TRECVID data set. In this architecture, shown in Fig. 3, we
consider seven frames of size 60 X 40 centered on the current frame as inputs to the 3D
CNN model.
We first apply a set of hardwired kernels to generate multiple channels of information
from the input frames. This results in 33 feature maps in the second layer in ive different
channels denoted by gray, gradient-x, gradienty, optflow-x, and optflow-y.
The gray channel contains the gray pixel values of the seven input frames. The feature
maps in the gradient-x and gradient-y channels are obtained by computing gradients
along the horizontal and vertical directions, respectively, on each of the seven input
frames, and the optflow-x and optflow-y channels contain the optical flow fields along
the horizontal and vertical directions, respectively, computed from adjacent input frames.
This hardwired layer is employed to encode our prior knowledge on features, and this
scheme usually leads to better performance as compared to the random initialization.
We then apply 3D convolutions with a kernel size of 7 X 7 X 3 (7 X7) in the spatial
dimension and 3 in the temporal dimension) on each of the five channels separately.
To increase the number of feature maps, two sets of different convolutions are applied at
each location, resulting in two sets of feature maps in the C2 layer each consisting of 23
feature maps.
In the subsequent subsampling layer S3, we apply 2 X 2 subsampling on each of the
feature maps in the C2 layer, which leads to the same number of feature maps with a
reduced spatial resolution.
The next convolution layer C4 is obtained by applying 3D convolution with a kernel size
of 7 X 6 X 3 on each of the five channels in the two sets of feature maps separately.
To increase the number of feature maps, we apply three convolutions with different
kernels at each location, leading to six distinct sets of feature maps in the C4 layer, each
containing 13 feature maps.
The next layer S5 is obtained by applying 3 X 3 subsampling on each feature map in the
C4 layer, which leads to the same number of feature maps with a reduced spatial
resolution.
At this stage, the size of the temporal dimension is already relatively small (3 for gray,
gradient-x, gradient-y, and 2 for optflow-x and optflow-y), so we perform convolution
only in the spatial dimension at this layer.
The size of the convolution kernel used is 7X 4 so that the sizes of the output feature
maps are reduced to 1 X 1. The C6 layer consists of 128 feature maps of size 1Xs 1, and
each of them is connected to all 78 feature maps in the S5 layer.
After the multiple layers of convolution and subsampling, the seven input frames have
been converted into a 128D feature vector capturing the motion information in the input
frames.
The output layer consists of the same number of units as the number of actions, and each
unit is fully connected to each of the 128 units in the C6 layer.
In this design, we essentially apply a linear classifier on the 128D feature vector for
action classification. All the trainable parameters in this model are initialized randomly
and trained by the online error back-propagation algorithm as described in [17].
We have designed and evaluated other 3D CNN architectures that combine multiple
channels of information at different stages, and our results show that this architecture
gives the best performance.