Beruflich Dokumente
Kultur Dokumente
KANPUR
CS-698
VISUAL RECOGNITION
Prof. Vinay P. Namboodiri
BY
ABHINAV JAIN, 13022
RISHABH GUPTA, 13571
INTRODUCTION
For the task of instance retrieval, we have worked on two models. One which is presented in
the paper Scalable Recognition with a Vocabulary Tree and the other one which uses
Convolutional Neural Networks and for the rest of report they will be referred to as Model-
Tree and Model-CNN respectively.
The basic idea underlying the pipeline of CNN based model is that we have a way to extract a
good feature representation of the input, I, whether it is an image or its sub-patch. For this
Convolutional Neural Networks have been used. Units in the first convolutional layer of the
CNN respond to simple patterns like edges or textures. Second layers units aggregate those
responses into more complex patterns. As the process continues further (deeper), units are
expected to respond to more and more complex patterns in their corresponding receptive
fields. This way you get a good feature representation of the input image. Activations of any
convolutional layer can be used to construct a good feature vector and the choice depends
on the goal for which CNN is being used.
Our goal is instance retrieval where given any query image, we are supposed to retrieve best
matches from the database. Instance retrieval proposes intra-class variability, where images
containing the exact instances of the object should be retrieved. For this, using activations
from early layers will not suffice. The activations have to be taken from layers deeper in the
network. In fact, experimental evidence is provided in the study for the factors of
transferability that the last convolutional layer is the best alternative for instance retrieval.
VGG16 have been used in this assignment for obtaining activations from the last convolutional
layer using pre-trained weights. All fully connected layers have been discarded. This gives a
512 x 7 x 7 output volume which is a 3D tensor activation map of an image. Fine-tuning the
model with the images from the database is a customary practice. But fine-tuning roughly
14.7 Million parameters with just ~6K database images will lead to over-fitting of the model.
Thus, we opted for the pre-trained model as it is and focused on using activation map for a
good feature representation.
MODEL CNN
For every feature map, we apply ReLU to get non-negative activations.
MAC (maximum activation of convolutions) feature representation: We represent 3D tensor
response as a set of 2D feature channel responses (number of channels being 512), and max-
pool over all the locations for a given 2D channel response. This is gives us the 1 x 512 sized
feature vector which is translation invariant because it encodes maximum local response of
each convolutional filter.
RMAC (regional maximum activation of convolutions): For this, we consider different regions
at L different scales on the CNN response maps. For every region, we compute the MAC
feature vector, post-process it with L2 normalization and PCA-whitening. And finally,
aggregating all the feature vectors into one by summing and L2 normalizing. This results in a
1 x 512 RMAC feature vector.
Ranking: RMAC feature vectors are computed for every image in the database. For every
query image, its MAC or RMAC feature vector is computed and matched against database
vectors using simple cosine similarity.
REFERENCES
@article{tolias2015particular, title={Particular object retrieval with integral max-
pooling of CNN activations}, author={Tolias, Giorgos and Sicre, Ronan and J{\'e}gou,
Herv{\'e}}, journal={arXiv preprint arXiv:1511.05879}, year={2015}}