Sie sind auf Seite 1von 24

UNIT V DEEP ARCHITECTURES IN VISION

Deep Architectures in Vision - Alexnet to ResNet, Transfer learning, Siamese Networks, Metric

Learning, Ranking/Triplet loss, RCNNs, CNN-RNN, Applications in captioning and video tasks,

3D CNNs

Deep Architectures in Vision - Alexnet to ResNet

 Image classification is the task of classifying a given image into one of the pre-defined
categories.
 Traditional pipeline for image classification involves two modules: viz. feature
extraction and classification.
 Feature extraction involves extracting a higher level of information from raw pixel values
that can capture the distinction among the categories involved.
 This feature extraction is done in an unsupervised manner wherein the classes of the
image have nothing to do with information extracted from pixels.
 Some of the traditional and widely used features are GIST, HOG, SIFT, LBP etc. After
the feature is extracted, a classification module is trained with the images and their
associated labels.
 The problem with this pipeline is that feature extraction cannot be tweaked according to
the classes and images.
 So if the chosen feature lacks the representation required to distinguish the categories, the
accuracy of the classification model suffers a lot, irrespective of the type of classification
strategy employed.
 A common theme among the state of the art following the traditional pipeline has been, to
pick multiple feature extractors and club them inventively to get a better feature.
 But this involves too many heuristics as well as manual labor to tweak parameters
according to the domain to reach a decent level of accuracy.
 Another problem with this method is that it is completely different from how we humans
learn to recognize things.
 Just after birth, a child is incapable of perceiving his surroundings, but as he progresses
and processes data, he learns to identify things. This is the philosophy behind deep
learning, wherein no hard-coded feature extractor is built in.
 It combines the extraction and classification modules into one integrated system and it
learns to extract, by discriminating representations from the images and classify them
based on supervised data.
 One such system is multilayer perceptrons aka neural networks which are multiple layers
of neurons densely connected to each other.
 A deep vanilla neural network has such a large number of parameters involved that it is
impossible to train such a system without overfitting the model due to the lack of a
sufficient number of training examples.
 But with Convolutional Neural Networks(ConvNets), the task of training the whole
network from the scratch can be carried out using a large dataset like ImageNet.
 The reason behind this is, sharing of parameters between the neurons and sparse
connections in convolutional layers. It can be seen in this figure 2. In the convolution
operation, the neurons in one layer are only locally connected to the input neurons and
the set of parameters are shared across the 2-D feature map.

Accuracy:
 If you are building an intelligent machine, it is absolutely critical that it must be as
accurate as possible. One fair question to ask here is that ‘accuracy not only depends on
the network but also on the amount of data available for training’. Hence, these networks
are compared on a standard dataset called ImageNet.

Computation:

 Most ConvNets have huge memory and computation requirements, especially while
training. Hence, this becomes an important concern.
 Similarly, the size of the final trained model becomes important to consider if you are
looking to deploy a model to run locally on mobile. As you can guess, it takes a more
computationally intensive network to produce more accuracy. So, there is always a trade-
off between accuracy and computation.
 Apart from these, there are many other factors like ease of training, the ability of a
network to generalize well etc. The networks described below are the most popular ones
and are presented in the order that they were published and also had increasingly better
accuracy from the earlier ones.
AlexNet
 This architecture was one of the first deep networks to push ImageNet Classification
accuracy by a significant stride in comparison to traditional methodologies.
 It is composed of 5 convolutional layers followed by 3 fully connected layers.
 AlexNet, proposed by Alex Krizhevsky, uses ReLu(Rectified Linear Unit) for the non-
linear part, instead of a Tanh or Sigmoid function which was the earlier standard for
traditional neural networks. ReLu is given by
f(x) = max(0,x)

 The advantage of the ReLu over sigmoid is that it trains much faster than the latter
because the derivative of sigmoid becomes very small in the saturating region and
therefore the updates to the weights almost vanish. This is called vanishing gradient
problem.
 In the network, ReLu layer is put after each and every convolutional and fully-connected
layers(FC).
 Another problem that this architecture solved was reducing the over-fitting by using a
Dropout layer after every FC layer.
 Dropout layer has a probability,(p), associated with it and is applied at every neuron of
the response map separately. It randomly switches off the activation with the
probability p.
Why does DropOut work?
 The idea behind the dropout is similar to the model ensembles. Due to the dropout layer,
different sets of neurons which are switched off, represent a different architecture and all
these different architectures are trained in parallel with weight given to each subset and
the summation of weights being one. For n neurons attached to DropOut, the number of
subset architectures formed is 2^n.
 So it amounts to prediction being averaged over these ensembles of models. This
provides a structured model regularization which helps in avoiding the over-fitting.
 Another view of DropOut being helpful is that since neurons are randomly chosen, they
tend to avoid developing co-adaptations among themselves thereby enabling them to
develop meaningful features, independent of others.
VGG16
 This architecture is from VGG group, Oxford. It makes the improvement over AlexNet
by replacing large kernel-sized filters(11 and 5 in the first and second convolutional
layer, respectively) with multiple 3X3 kernel-sized filters one after another.
 With a given receptive field(the effective area size of input image on which output
depends), multiple stacked smaller size kernel is better than the one with a larger size
kernel because multiple non-linear layers increases the depth of the network which
enables it to learn more complex features, and that too at a lower cost.
 For example, three 3X3 filters on top of each other with stride 1 ha a receptive size of 7,
but the number of parameters involved is 3*(9C^2) in comparison to 49C^2 parameters
of kernels with a size of 7.
 Here, it is assumed that the number of input and output channel of layers is C.Also, 3X3
kernels help in retaining finer level properties of the image. The network architecture is
given in the table.
 You can see that in VGG-D, there are blocks with same filter size applied multiple times
to extract more complex and representative features. This concept of blocks/modules
became a common theme in the networks after VGG.
 The VGG convolutional layers are followed by 3 fully connected layers. The width of the
network starts at a small value of 64 and increases by a factor of 2 after every sub-
sampling/pooling layer. It achieves the top-5 accuracy of 92.3 % on ImageNet.
GoogLeNet/Inception:
 While VGG achieves a phenomenal accuracy on ImageNet dataset, its deployment on
even the most modest sized GPUs is a problem because of huge computational
requirements, both in terms of memory and time.
 It becomes inefficient due to large width of convolutional layers.
 For instance, a convolutional layer with 3X3 kernel size which takes 512 channels as
input and outputs 512 channels, the order of calculations is 9X512X512.
 In a convolutional operation at one location, every output channel , is connected to every
input channel, and so we call it a dense connection architecture.
 The GoogLeNet builds on the idea that most of the activations in a deep network are
either unnecessary(value of zero) or redundant because of correlations between them.
 Therefore the most efficient architecture of a deep network will have a sparse connection
between the activations, which implies that all 512 output channels will not have
a connection with all the 512 input channels.
 There are techniques to prune out such connections which would result in a sparse
weight/connection. But kernels for sparse matrix multiplication are not optimized in
BLAS or CuBlas(CUDA for GPU) packages which render them to be even slower than
their dense counterparts.
 So GoogLeNet devised a module called inception module that approximates a sparse
CNN with a normal dense construction(shown in the figure). Since only a small number
of neurons are effective as mentioned earlier, the width/number of the convolutional
filters of a particular kernel size is kept small.
 Also, it uses convolutions of different sizes to capture details at varied scales(5X5, 3X3,
1X1).
 Another salient point about the module is that it has a so-called bottleneck layer(1X1
convolutions in the figure). It helps in the massive reduction of the computation
requirement as explained below.
 Let us take the first inception module of GoogLeNet as an example which has 192
channels as input. It has just 128 filters of 3X3 kernel size and 32 filters of 5X5 size.
 The order of computation for 5X5 filters is 25X32X192 which can blow up as we go
deeper into the network when the width of the network and the number of 5X5 filter
further increases.
 In order to avoid this, the inception module uses 1X1 convolutions before applying larger
sized kernels to reduce the dimension of the input channels, before feeding into those
convolutions.
 So in the first inception module, the input to the module is first fed into 1X1 convolutions
with just 16 filters before it is fed into 5X5 convolutions. This reduces the computations
to 16X192 + 25X32X16. All these changes allow the network to have a large width and
depth.
 Another change that GoogLeNet made, was to replace the fully-connected layers at the
end with a simple global average pooling which averages out the channel values across
the 2D feature map, after the last convolutional layer.
 This drastically reduces the total number of parameters. This can be understood from
AlexNet, where FC layers contain approx. 90% of parameters.
 Use of a large network width and depth allows GoogLeNet to remove the FC layers
without affecting the accuracy. It achieves 93.3% top-5 accuracy on ImageNet and is
much faster than VGG.

Residual Networks
 As per what we have seen so far, increasing the depth should increase the accuracy of the
network, as long as over-fitting is taken care of.
 But the problem with increased depth is that the signal required to change the weights,
which arises from the end of the network by comparing ground-truth and prediction
becomes very small at the earlier layers, because of increased depth.
 It essentially means that earlier layers are almost negligible learned. This is
called vanishing gradient.
 The second problem with training the deeper networks is, performing the optimization on
huge parameter space and therefore naively adding the layers leading to higher training
error.
 Residual networks allow training of such deep networks by constructing the network
through modules called residual models as shown in the figure. This is
called degradation problem. The intuition around why it works can be seen as follows:

 Imagine a network, A which produces x amount of training error. Construct a network B


by adding few layers on top of A and put parameter values in those layers in such a way
that they do nothing to the outputs from A. Let’s call the additional layer as C.
 This would mean the same x amount of training error for the new network. So while
training network B, the training error should not be above the training error of A.
 And since it DOES happen, the only reason is that learning the identity mapping(doing
nothing to inputs and just copying as it is) with the added layers-C is not a trivial
problem, which the solver does not achieve.
 To solve this, the module shown above creates a direct path between the input and output
to the module implying an identity mapping and the added layer-C just need to learn the
features on top of already available input. Since C is learning only the residual, the whole
module is called residual module.
 Also, similar to GoogLeNet, it uses a global average pooling followed by the
classification layer. Through the changes mentioned, ResNets were learned with network
depth of as large as 152.
 It achieves better accuracy than VGGNet and GoogLeNet while being computationally
more efficient than VGGNet. ResNet-152 achieves 95.51 top-5 accuracies.
 The architecture is similar to the VGGNet consisting mostly of 3X3 filters. From the
VGGNet, shortcut connection as described above is inserted to form a residual network.
This can be seen in the figure which shows a small snippet of earlier layer synthesis from
VGG-19.
 The power of the residual networks can be judged from one of the experiments in paper
4. The plain 34 layer network had higher validation error than the 18 layers plain
network. This is where we realize the degradation problem.
 And the same 34 layer network when converted into the residual network has much lesser
training error than the 18 layer residual network.

TRANSFER LEARNING

 Transfer Learning is the reuse of a pre-trained model on a new problem.


 In Transfer Learning, the knowledge of an already trained Machine Learning model is
applied to a different but related problem.
 For example, if you trained a simple classifier to predict whether an image contains a
backpack, you could use the knowledge that the model gained during its training to
recognize other objects like sunglasses.
 With transfer learning, we basically try to exploit what has been learned in one task to
improve generalization in another. We transfer the weights that a Network has learned at
Task A to a new Task B.
 The general idea is to use knowledge, that a model has learned from a task where a lot of
labeled training data is available, in a new task where we don’t have a lot of data. Instead
of starting the learning process from scratch, you start from patterns that have been
learned from solving a related task.
 Transfer Learning is mostly used in Computer Vision and Natural Language Processing
Tasks like Sentiment Analysis, because of the huge amount of computational power that is
needed for them.
 It is not really a Machine Learning technique. Transfer Learning can be seen as a ‘design
methodology’ within Machine Learning like for example, active learning. It is also not an
exclusive part or study-area of Machine Learning.
 Nevertheless, it has become quite popular in the combination with Neural Networks,
since they require huge amounts of data and computational power.
How it works
 For example, in computer vision, Neural Networks usually try to detect edges in their
earlier layers, shapes in their middle layer and some task-specific features in the later
layers. With transfer learning, you use the early and middle layers and only re-train the
latter layers. It helps us to leverage the labeled data of the task it was initially trained on.
 The model has learned to recognize objects and because of that, we will only re-train the
latter layers, so that it will learn what separates sunglasses from other objects.
 In Transfer Learning, we try to transfer as much knowledge as possible from the previous
task, the model was trained on, to the new task at hand. This knowledge can be in various
forms depending on the problem and the data.
Why it is used?
 Using Transfer Learning has several benefits that we will discuss in this section. The main
advantages are basically that you save training time, that your Neural Network performs
better in most cases and that you don’t need a lot of data.
 Usually, you need a lot of data to train a Neural Network from scratch but you don’t
always have access to enough data.
 That is where Transfer Learning comes into play because with it you can build a solid
machine Learning model with comparatively little training data because the model is
already pre-trained.
 This is especially valuable in Natural Language Processing (NLP) because there is mostly
expert knowledge required to created large labeled datasets. Therefore you also save a lot
of training time, because it can sometimes take days or even weeks to train a deep Neural
Network from scratch on a complex task.
When you should use it
 As it is always the case in Machine Learning, it is hard to form rules that are generally
 You would typically use Transfer Learning when
(a) you don’t have enough labeled training data to train your network from scratch and/or
(b) there already exists a network that is pre-trained on a similar task, which is usually
trained on massive amounts of data. Another case where its use would be appropriate is
when Task-1 and Task-2 have the same input.
 If the original model was trained using TensorFlow, you can simply restore it and re-train
some layers for your task.
 Note that Transfer Learning only works if the features learned from the first task are
general, meaning that they can be useful for another related task as well. Also, the input of
the model needs to have the same size as it was initially trained with.
 If you don’t have that, you need to add a preprocessing step to resize your input to the
needed size.
Approaches to Transfer Learning
1. Training a Model to Reuse it
 Imagine you want to solve Task A but don’t have enough data to train a Deep Neural
Network. One way around this issue would be to find a related Task B, where you have an
abundance of data. Then you could train a Deep Neural Network on Task B and use this
model as starting point to solve your initial Task A.
 If you have to use the whole model or only a few layers of it, depends heavily on the
problem you are trying to solve.
 If you have the same input in both Tasks, you could maybe just reuse the model and make
predictions for your new input. Alternatively, you could also just change and re-train
different task-specific layers and the output layer.
2. Using a Pre-Trained Model
 Approach 2 would be to use an already pre-trained model. There are a lot of these models
out there, so you have to do a little bit of research. How many layers you reuse and how
many you are training again, depends like I already said on your problem and it is
therefore hard to form a general rule.
 There are also many research institutions that released models they have trained. This type
of Transfer Learning is most commonly used throughout Deep Learning.
3. Feature Extraction
 Another approach is to use Deep Learning to discover the best representation of your
problem, which means finding the most important features. This approach is also known
as Representation Learning and can often result in a much better performance than can be
obtained with hand-designed representation.

 Most of the time in Machine Learning, features are manually hand-crafted by researchers
and domain experts. Fortunately, Deep Learning can extract features automatically. Note
that this does not mean that Feature Engineering and Domain knowledge isn’t important
anymore because you still have to decide which features you put into your Network.
 But Neural Networks have the ability to learn which features, you have put into it, are
really important and which ones aren’t. A representation learning algorithm can discover a
good combination of features within a very short timeframe, even for complex tasks which
would otherwise require a lot of human effort.
 The learned representation can then be used for other problems as well. You simply use
the first layers to spot the right representation of features but you don’t use the output of
the network because it is too task-specific.
 Simply feed data into your network and use one of the intermediate layers as the output
layer. This layer can then be interpreted as a representation of the raw data.
This approach is mostly used in Computer Vision because it can reduce the size of your dataset,
which decreases computation time and makes it more suitable for traditional algorithms as well.

SIAMESE NETWORKS

In Siamese networks, we take an input image of a person and find out the encodings of that
image, then, we take the same network without performing any updates on weights or biases and
input an image of a different person and again predict it’s encodings. Now, we compare these two
encodings to check whether there is a similarity between the two images. These two encodings act
as a latent feature representation of the images. Images with the same person have similar
features/encodings. Using this, we compare and tell if the two images have the same person or
not.

Triplet Loss
you can train the network by taking an anchor image and comparing it with both a positive sample
and a negative sample. The dissimilarity between the anchor image and positive image must low
and the dissimilarity between the anchor image and the negative image must be high.

The formula above represents the triplet loss function using which gradients are calculated. The
variable “a” represents the anchor image, “p” represents a positive image and “n” represents a
negative image. We know that the dissimilarity between a and p should be less than the
dissimilarity between a and n,. Another variable called margin, which is a hyperparameter is
added to the loss equation. Margin defines how far away the dissimilarities should be, i.e if
margin = 0.2 and d(a,p) = 0.5 then d(a,n) should at least be equal to 0.7. Margin helps us
distinguish the two images better.

Therefore, by using this loss function we calculate the gradients and with the help of the
gradients, we update the weights and biases of the siamese network. For training the network, we
take an anchor image and randomly sample positive and negative images and compute its loss
function and update its gradients.

METRIC LEARNING

 It is a type of mechanism to combine features to effectively compare observations.


 A metric or distance function is a function that defines a distance between each pair of
elements of a set.
 Metric learning learns a metric function from training data to calculate the similarity or
distance between samples.
 From the perspective of feature learning, metric learning essentially learns a new feature
space by feature transformation (e.g., Mahalanobis distance metric). However, traditional
metric learning algorithms are shallow, which just learn one metric space (feature
transformation).
 To this end, we present a hierarchical metric learning scheme and implement an online
deep metric learning framework, namely ODML.
 Specifically, we take one online metric learning algorithm as a metric layer, followed by
a nonlinear layer (i.e., ReLU), and then stack these layers modelled after the deep
learning.

 Metric learning aims to learn a distance function to measure the similarity of samples,
which plays an important role in many visual understanding applications.
 Generally, the optimal similarity functions for different visual understanding tasks are
task specific because the distributions for data used in different tasks are usually
different.
 It is generally believed that learning a metric from training data can obtain more
encouraging performances than handcrafted metrics e.g., the Euclidean and cosine
distances.
 A variety of metric learning methods have been proposed in the literature , and many of
them have been successfully employed in visual understanding tasks such as face
recognition, image classification, visual search, visual tracking , person reidentification ,
cross-modal matching , image set classification and image-based geolocalization.
 Metric learning techniques are usually classified into two categories: unsupervised and
supervised. Unsupervised metric learning attempts to learn a low-dimensional subspace
to preserve the useful geometrical information of the samples.
 Supervised metric learning, which is the mainstream metric learning technique and the
focus in this article, seeks an appropriate metric by formulating an optimization objective
function to exploit supervised information of the training samples, where the objective
functions are designed for different specific tasks.
 However, most conventional metric learning methods usually learn a linear mapping to
project samples into a new feature space, which suffer from the nonlinear relationship of
data points in metric learning.
 While the kernel trick can be adopted to address this nonlinearity problem, this type of
method suffers from the scalability problem because the kernel trick has two major
issues:
1) choosing a kernel is typically difficult and quite empirical and
2) the expression power of kernel functions is often not flexible enough to capture the
nonlinearity in the data.
 Motivated by the fact that deep learning is an effective solution to model the nonlinearity
of samples, several deep metric learning (DML) methods have been proposed in recent
years.
 The key idea of DML is to explicitly learn a set of hierarchical nonlinear transformations
to map data points into other feature space for comparing or matching by exploiting the
architecture of neural networks in deep learning, which unifies feature learning and metric
learning into a joint learning framework.
 The goal of this article is to provide an overview of recent advances in DML techniques
and their various applications in different visual understanding tasks.

TRIPLET LOSS
Usually in supervised learning we have a fixed number of classes and train the network using the
softmax cross entropy loss. However in some cases we need to be able to have a variable number
of classes. In face recognition for instance, we need to be able to compare two unknown faces
and say whether they are from the same person or not.
Triplet loss in this case is a way to learn good embeddings for each face. In the embedding space,
faces from the same person should be close together and form well separated clusters.

Definition of the loss

The goal of the triplet loss is to make sure that:

 Two examples with the same label have their embeddings close together in the embedding
space
 Two examples with different labels have their embeddings far away.

However, we don’t want to push the train embeddings of each label to collapse into very small
clusters. The only requirement is that given two positive examples of the same class and one
negative example, the negative should be farther away than the positive by some margin. This is
very similar to the margin used in SVMs, and here we want the clusters of each class to be
separated by the margin. To formalise this requirement, the loss will be defined over triplets of
embeddings:

 an anchor
 a positive of the same class as the anchor
 a negative of a different class

For some distance on the embedding space dd, the loss of a triplet (a,p,n)(a,p,n) is:
L=max(d(a,p)−d(a,n)+margin,0)
We minimize this loss, which pushes d(a,p)d(a,p) to 00 and d(a,n)d(a,n) to be greater
than d(a,p)+margind(a,p)+margin. As soon as nn becomes an “easy negative”, the loss becomes
zero.

Triplet mining
Based on the definition of the loss, there are three categories of triplets:
 easy triplets: triplets which have a loss of 00,
because d(a,p)+margin<d(a,n)d(a,p)+margin<d(a,n)
 hard triplets: triplets where the negative is closer to the anchor than the positive,
i.e. d(a,n)<d(a,p)d(a,n)<d(a,p)
 semi-hard triplets: triplets where the negative is not closer to the anchor than the positive,
but which still have positive loss: d(a,p)<d(a,n)<d(a,p)+margind(a,p)<d(a,n)<d(a,p)+margin

Each of these definitions depend on where the negative is, relatively to the anchor and positive.
We can therefore extend these three categories to the negatives: hard negatives, semi-hard
negatives or easy negatives.

Offline and online triplet mining


We have defined a loss on triplets of embeddings, and have seen that some triplets are more
useful than others. The question now is how to sample, or “mine” these triplets.

Offline triplet mining


The first way to produce triplets is to find them offline, at the beginning of each epoch for
instance. We compute all the embeddings on the training set, and then only select hard or semi-
hard triplets. We can then train one epoch on these triplets. Concretely, we would produce a list
of triplets (i, j, k). We would then create batches of these triplets of size B, which means we will
have to compute 3B embeddings to get the B triplets, compute the loss of these B triplets and
then backpropagate into the network. Overall this technique is not very efficient since we need to
do a full pass on the training set to generate triplets. It also requires to update the offline mined
triplets regularly.
Online triplet mining
Online triplet mining was introduced in Facenet and has been well described by Brandon Amos
in his blog post OpenFace 0.2.0: Higher accuracy and halved execution time. The idea here is to
compute useful triplets on the fly, for each batch of inputs. Given a batch of B examples (for
instance B images of faces), we compute the B embeddings and we then can find a maximum of
B3 triplets. Of course, most of these triplets are not valid (i.e. they don’t have 2 positives and 1
negative). This technique gives you more triplets for a single batch of inputs, and doesn’t require
any offline mining. It is therefore much more efficient. We will see an implementation of this in
the last part.

Strategies in online mining


In online mining, we have computed a batch of B embeddings from a batch of B inputs. Now we
want to generate triplets from these B embeddings. Whenever we have three
indices i,j,k∈[1,B]i,j,k∈[1,B], if examples ii and jj have the same label but are distinct, and
example kk has a different label, we say that (i,j,k)(i,j,k) is a valid triplet. What remains here is
to have a good strategy to pick triplets among the valid ones on which to compute the loss.
They suppose that you have a batch of faces as input of size B=PKB=PK, composed
of PP different persons with KK images each. A typical value is K=4K=4. The two strategies
are:

 batch all: select all the valid triplets, and average the loss on the hard and semi-hard triplets.
o a crucial point here is to not take into account the easy triplets (those with loss 00), as
averaging on them would make the overall loss very small
o this produces a total of PK(K−1)(PK−K)PK(K−1)(PK−K) triplets
(PKPK anchors, K−1K−1 possible positives per anchor, PK−KPK−K possible
negatives)
 batch hard: for each anchor, select the hardest positive (biggest distance d(a,p)d(a,p)) and
the hardest negative among the batch
o this produces PKPK triplets
o the selected triplets are the hardest among the batch

According to the paper cited above, the batch hard strategy yields the best performance:
Additionally, the selected triplets can be considered moderate triplets, since they are the hardest within a small
subset of the data, which is exactly what is best for learning with the triplet loss.

RCNN

 Object detection aids in pose estimation, vehicle detection, surveillance etc.


 The difference between object detection algorithms and classification algorithms is that in
detection algorithms, we try to draw a bounding box around the object of interest to locate it
within the image.
 Also, you might not necessarily draw just one bounding box in an object detection case, there
could be many bounding boxes representing different objects of interest within the image and
you would not know how many beforehand.
 The major reason why you cannot proceed with this problem by building a standard
convolutional network followed by a fully connected layer is that, the length of the output
layer is variable — not constant, this is because the number of occurrences of the objects of
interest is not fixed.
 A naive approach to solve this problem would be to take different regions of interest from the
image, and use a CNN to classify the presence of the object within that region. The problem
with this approach is that the objects of interest might have different spatial locations within
the image and different aspect ratios.
 Hence, you would have to select a huge number of regions and this could computationally
blow up. Therefore, algorithms like R-CNN, YOLO etc have been developed to find these
occurrences and find them fast.
R-CNN
 To bypass the problem of selecting a huge number of regions, Ross Girshick et al.
proposed a method where we use selective search to extract just 2000 regions from the
image and he called them region proposals.
 Therefore, now, instead of trying to classify a huge number of regions, you can just work
with 2000 regions. These 2000 region proposals are generated using the selective search
algorithm which is written below.
Selective Search:
1. Generate initial sub-segmentation, we generate many candidate regions
2. Use greedy algorithm to recursively combine similar regions into larger ones
3. Use the generated regions to produce the final candidate region proposals

 These 2000 candidate region proposals are warped into a square and fed into a convolutional
neural network that produces a 4096-dimensional feature vector as output.
 The CNN acts as a feature extractor and the output dense layer consists of the features
extracted from the image and the extracted features are fed into an SVM to classify the
presence of the object within that candidate region proposal.
 In addition to predicting the presence of an object within the region proposals, the algorithm
also predicts four values which are offset values to increase the precision of the bounding box.
 For example, given a region proposal, the algorithm would have predicted the presence of a
person but the face of that person within that region proposal could’ve been cut in half.
Therefore, the offset values help in adjusting the bounding box of the region proposal.

Problems with R-CNN


 It still takes a huge amount of time to train the network as you would have to classify 2000
region proposals per image.

 It cannot be implemented real time as it takes around 47 seconds for each test image.

 The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at


that stage. This could lead to the generation of bad candidate region proposals.
Fast R-CNN
 The same author of the previous paper(R-CNN) solved some of the drawbacks of R-CNN to
build a faster object detection algorithm and it was called Fast R-CNN. The approach is
similar to the R-CNN algorithm.
 But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN
to generate a convolutional feature map. From the convolutional feature map, we identify the
region of proposals and warp them into squares and by using a RoI pooling layer we reshape
them into a fixed size so that it can be fed into a fully connected layer.
 From the RoI feature vector, we use a softmax layer to predict the class of the proposed
region and also the offset values for the bounding box.
 The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000
region proposals to the convolutional neural network every time. Instead, the convolution
operation is done only once per image and a feature map is generated from it.
 When you look at the performance of Fast R-CNN during testing time, including region
proposals slows down the algorithm significantly when compared to not using region
proposals. Therefore, region proposals become bottlenecks in Fast R-CNN algorithm
affecting its performance.

Faster R-CNN

 Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the
region proposals. Selective search is a slow and time-consuming process affecting the
performance of the network.
 Therefore, Shaoqing Ren et al. came up with an object detection algorithm that eliminates
the selective search algorithm and lets the network learn the region proposals.
 Similar to Fast R-CNN, the image is provided as an input to a convolutional network
which provides a convolutional feature map.
 Instead of using selective search algorithm on the feature map to identify the region
proposals, a separate network is used to predict the region proposals. The predicted region
proposals are then reshaped using a RoI pooling layer which is then used to classify the
image within the proposed region and predict the offset values for the bounding boxes.

YOLO — You Only Look Once


 All of the previous object detection algorithms use regions to localize the object within the
image. The network does not look at the complete image. Instead, parts of the image which
have high probabilities of containing the object.
 YOLO or You Only Look Once is an object detection algorithm much different from the
region based algorithms seen above. In YOLO a single convolutional network predicts the
bounding boxes and the class probabilities for these boxes.

 How YOLO works is that we take an image and split it into an SxS grid, within each of
the grid we take m bounding boxes. For each of the bounding box, the network outputs a
class probability and offset values for the bounding box.
 The bounding boxes having the class probability above a threshold value is selected and
used to locate the object within the image.
 YOLO is orders of magnitude faster(45 frames per second) than other object detection
algorithms.
 The limitation of YOLO algorithm is that it struggles with small objects within the image,
for example it might have difficulties in detecting a flock of birds. This is due to the
spatial constraints of the algorithm.
RNN-CNN
 While deep convolutional neural networks (CNNs) have shown a great success in single-
label image classification, it is important to note that real world images generally contain
multiple labels, which could correspond to different objects, scenes, actions and attributes
in an image.
 Traditional approaches to multi-label image classification learn independent classifiers
for each category and employ ranking or thresholding on the classification results. These
techniques, although working well, fail to explicitly exploit the label dependencies in an
image.
 In this paper, we utilize recurrent neural networks (RNNs) to address this problem.
Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label
embedding to characterize the semantic label dependency as well as the image-label
relevance, and it can be trained end-to-end from scratch to integrate both information in a
unified framework.
 Since we aim to characterize the high-order label correlation, we employ long short term
memory (LSTM) neurons [15] as our recurrent neurons, which has been demonstrated to
be a powerful model of long-term dependency.
Long Short Term Memory Networks (LSTM)
 RNN [15] is a class of neural network that maintains internal hidden states to model
the dynamic temporal behaviour of sequences with arbitrary lengths through directed
cyclic connections between its units.
 It can be considered as a hidden Markov model extension that employs nonlinear
transition function and is capable of modeling long term temporal dependencies.
 LSTM extends RNN by adding three gates to an RNN neuron: a forget gate f to
control whether to forget the current state; an input gate i to indicate if it should read
the input; an output gate o to control whether to output the state.
 These gates enable LSTM to learn long-term dependency in a sequence, and make it
is easier to optimize, because these gates help the input signal to effectively propagate
through the recurrent hidden states r(t) without affecting the output.
 LSTM also effectively deals with the gradient vanishing/exploding issues that
commonly appear during RNN training [26].

where δ(.) is an activation function, ⊙ is the product with gate value, and various W matrices are
learned parameters. In our implementation, we employ rectified linear units (ReLU) as the
activation function [4].
 The illustration of the CNNRNN framework is shown in Fig. 4. It contains two parts: The
CNN part extracts semantic representations from images; the RNN part models
image/label relationship and label dependency.
 We decompose a multi-label prediction as an ordered prediction path. For example, labels
“zebra” and “elephant” can be decomposed as either (“zebra”, “elephant”) or (“elephant”,
“zebra”).
 The probability of a prediction path can be computed by the RNN network. The image,
label, and recurrent representations are projected to the same low dimensional space to
model the image-text relationship as well as the label redundancy.
 The RNN model is employed as a compact yet powerful representation of the label
cooccurrence dependency in this space. It takes the embedding of the predicted label at
each time step and maintains a hidden state to model the label co-occurrence information.
 The a priori probability of a label given the previously predicted labels can be computed
according to their dot products with the sum of the image and recurrent embeddings.
 The probability of a prediction path can be obtained as the product of the a-prior
probability of each label given the previous labels in the prediction path.
 A label k is represented as a one-hot vector ek = [0,... 0, 1, 0,..., 0], which is 1 at the k-th
location, and 0 elsewhere. The label embedding can be obtained by multiplying the one-
hot vector with a label embedding matrix Ul.
 The k-th row of Ul is the label embedding of the label k. wk = Ul.ek.
 The dimension of wk is usually much smaller than the number of labels. The recurrent
layer takes the label embedding of the previously predicted label, and models the co-
occurrence dependencies in its hidden recurrent states by learning nonlinear functions:
o(t) = ho(r(t−1), wk(t)), r(t) = hr(r(t−1), wk(t)) (
where r(t) and o(t) are the hidden states and outputs of the recurrent layer at the time
step t, respectively, wk(t) is the label embedding of the t-th label in the prediction path,
and ho(.), hr(.) are the non-linear RNN functions.
The output of the recurrent layer and the image representation are projected into the same
low-dimensional space as the label embedding. xt = h(Ux o o(t) + Ux I I), where Ux o
and Ux I are the projection matrices for recurrent layer output and image representation,
respectively.

 We propose a novel CNN-RNN framework for multilabel classification problem.


 The illustration of the CNNRNN framework is shown in Fig. 4.
 It contains two parts: The CNN part extracts semantic representations from images; the
RNN part models image/label relationship and label dependency. We decompose a multi-
label prediction as an ordered prediction path. For example, labels “zebra” and “elephant”
can be decomposed as either (“zebra”, “elephant”) or (“elephant”, “zebra”).
 The probability of a prediction path can be computed by the RNN network. The image,
label, and recurrent representations are projected to the same low dimensional space to
model the image-text relationship as well as the label redundancy.
 The RNN model is employed as a compact yet powerful representation of the label co
occurrence dependency in this space. It takes the embedding of the predicted label at each
time step and maintains a hidden state to model the label co-occurrence information.
 The a priori probability of a label given the previously predicted labels can be computed
according to their dot products with the sum of the image and recurrent embeddings. The
probability of a prediction path can be obtained as the product of the a-prior probability
of each label given the previous labels in the prediction path.
 A label k is represented as a one-hot vector ek = [0,... 0, 1, 0,..., 0], which is 1 at the k-th
location, and 0 elsewhere. The label embedding can be obtained by multiplying the one-
hot vector with a label embedding matrix Ul. The k-th row of Ul is the label embedding
of the label k.
 The dimension of wk is usually much smaller than the number of labels. The recurrent
layer takes the label embedding of the previously predicted label, and models the co-
occurrence dependencies in its hidden recurrent states by learning nonlinear functions:
wk = Ul.ek.
 The output of the recurrent layer and the image representation are projected into the same
low-dimensional space as the label embedding.
xt = h(Ux o o(t) + Ux I I), where Ux o and Ux I are the projection
matrices for recurrent layer output and image representation, respectively.
3D CNN
 This model extracts features from both the spatial and the temporal dimensions by
performing 3D convolutions, thereby capturing the motion information encoded in
multiple adjacent frames.
 The developed model generates multiple channels of information from the input frames,
and the final feature representation combines information from all channels. In 2D CNNs,
2D convolution is performed at the convolutional layers to extract features from local
neighborhood on feature maps in the previous layer.
 Then an additive bias is applied and the result is passed through a sigmoid function.
Formally, the value of an unit at position (x,y) in the jth feature map in the ith layer,
denoted as vijxy , is given by

where tanh() is the hyperbolic tangent function, bij is the bias for this feature map, m indexes
over the set of feature maps in the ith layer connected to the current feature map, wpq ijk is the
value at the position (p,q) of the kernel connected to the kth feature map, and Pi and Qi are the
height and width of the kernel, respectively.
 In the subsampling layers, the resolution of the feature maps is reduced by pooling over local
neighborhood on the feature maps in the previous layer, thereby enhancing the invariance to
distortions on the inputs. A CNN architecture can be constructed by stacking multiple layers
of convolution and subsampling in an alternating fashion.
 The parameters of CNN, such as the bias bij and the kernel weight wpq ijk, are usually
learned using either supervised or unsupervised approaches
3D Convolution
 In 2D CNNs, convolutions are applied on the 2D feature maps to compute features from
the spatial dimensions only.
 When applied to video analysis problems, it is desirable to capture the motion
information encoded in multiple contiguous frames.
 To this end, we propose to perform 3D convolutions in the convolution stages of CNNs
to compute features from both spatial and temporal dimensions.
 The 3D convolution is achieved by convolving a 3D kernel to the cube formed by
stacking multiple contiguous frames together.
 By this construction, the feature maps in the convolution layer are connected to multiple
contiguous frames in the previous layer, thereby capturing motion information. Formally,
the value at position (x, y, z) on the jth feature map in the ith layer is given by

where Ri is the size of the 3D kernel along the temporal dimension, wpqr ijm is the (p, q, r) th
value of the kernel connected to the mth feature map in the previous layer.
 A comparison of 2D and 3D convolutions is given in Fig. 1. Note that a 3D convolutional
kernel can only extract one type of features from the frame cube since the kernel weights are
replicated across the entire cube.
 A general design principle of CNNs is that the number of feature maps should be increased in
late layers by generating multiple types of features from the same set of lower level feature
maps. Similarly to the case of 2D convolution, this can be achieved.

A 3D CNN Architecture
 Based on the 3D convolution described above, a variety of CNN architectures can be
devised.
 In the following, we describe a 3D CNN architecture that we have developed for human
action recognition on the TRECVID data set. In this architecture, shown in Fig. 3, we
consider seven frames of size 60 X 40 centered on the current frame as inputs to the 3D
CNN model.
 We first apply a set of hardwired kernels to generate multiple channels of information
from the input frames. This results in 33 feature maps in the second layer in ive different
channels denoted by gray, gradient-x, gradienty, optflow-x, and optflow-y.
 The gray channel contains the gray pixel values of the seven input frames. The feature
maps in the gradient-x and gradient-y channels are obtained by computing gradients
along the horizontal and vertical directions, respectively, on each of the seven input
frames, and the optflow-x and optflow-y channels contain the optical flow fields along
the horizontal and vertical directions, respectively, computed from adjacent input frames.
 This hardwired layer is employed to encode our prior knowledge on features, and this
scheme usually leads to better performance as compared to the random initialization.
 We then apply 3D convolutions with a kernel size of 7 X 7 X 3 (7 X7) in the spatial
dimension and 3 in the temporal dimension) on each of the five channels separately.
 To increase the number of feature maps, two sets of different convolutions are applied at
each location, resulting in two sets of feature maps in the C2 layer each consisting of 23
feature maps.
 In the subsequent subsampling layer S3, we apply 2 X 2 subsampling on each of the
feature maps in the C2 layer, which leads to the same number of feature maps with a
reduced spatial resolution.
 The next convolution layer C4 is obtained by applying 3D convolution with a kernel size
of 7 X 6 X 3 on each of the five channels in the two sets of feature maps separately.
 To increase the number of feature maps, we apply three convolutions with different
kernels at each location, leading to six distinct sets of feature maps in the C4 layer, each
containing 13 feature maps.
 The next layer S5 is obtained by applying 3 X 3 subsampling on each feature map in the
C4 layer, which leads to the same number of feature maps with a reduced spatial
resolution.
 At this stage, the size of the temporal dimension is already relatively small (3 for gray,
gradient-x, gradient-y, and 2 for optflow-x and optflow-y), so we perform convolution
only in the spatial dimension at this layer.
 The size of the convolution kernel used is 7X 4 so that the sizes of the output feature
maps are reduced to 1 X 1. The C6 layer consists of 128 feature maps of size 1Xs 1, and
each of them is connected to all 78 feature maps in the S5 layer.
 After the multiple layers of convolution and subsampling, the seven input frames have
been converted into a 128D feature vector capturing the motion information in the input
frames.
 The output layer consists of the same number of units as the number of actions, and each
unit is fully connected to each of the 128 units in the C6 layer.
 In this design, we essentially apply a linear classifier on the 128D feature vector for
action classification. All the trainable parameters in this model are initialized randomly
and trained by the online error back-propagation algorithm as described in [17].
 We have designed and evaluated other 3D CNN architectures that combine multiple
channels of information at different stages, and our results show that this architecture
gives the best performance.

Das könnte Ihnen auch gefallen