IIT KANPUR CS-698 VISUAL RECOGNITION MODEL CNN

INDIAN INSTITUTE OF TECHNOLOGY
KANPUR
CS-698
VISUAL RECOGNITION
Prof. Vinay P. Namboodiri
BY
ABHINAV JAIN, 13022
RISHABH GUPTA, 13571
INTRODUCTION
For the task of instance retrieval, we have worked on two models. One which is presented in
the paper Scalable Recognition with a Vocabulary Tree and the other one which uses
Convolutional Neural Networks and for the rest of report they will be referred to as Model-
Tree and Model-CNN respectively.
The basic idea underlying the pipeline of CNN based model is that we have a way to extract a
good feature representation of the input, I, whether it is an image or its sub-patch. For this
Convolutional Neural Networks have been used. Units in the first convolutional layer of the
CNN respond to simple patterns like edges or textures. Second layers units aggregate those
responses into more complex patterns. As the process continues further (deeper), units are
expected to respond to more and more complex patterns in their corresponding receptive
fields. This way you get a good feature representation of the input image. Activations of any
convolutional layer can be used to construct a good feature vector and the choice depends
on the goal for which CNN is being used.
Our goal is instance retrieval where given any query image, we are supposed to retrieve best
matches from the database. Instance retrieval proposes intra-class variability, where images
containing the exact instances of the object should be retrieved. For this, using activations
from early layers will not suffice. The activations have to be taken from layers deeper in the
network. In fact, experimental evidence is provided in the study for the factors of
transferability that the last convolutional layer is the best alternative for instance retrieval.
VGG16 have been used in this assignment for obtaining activations from the last convolutional
layer using pre-trained weights. All fully connected layers have been discarded. This gives a
512 x 7 x 7 output volume which is a 3D tensor activation map of an image. Fine-tuning the
model with the images from the database is a customary practice. But fine-tuning roughly
14.7 Million parameters with just ~6K database images will lead to over-fitting of the model.
Thus, we opted for the pre-trained model as it is and focused on using activation map for a
good feature representation.
MODEL CNN
For every feature map, we apply ReLU to get non-negative activations.
MAC (maximum activation of convolutions) feature representation: We represent 3D tensor
response as a set of 2D feature channel responses (number of channels being 512), and max-
pool over all the locations for a given 2D channel response. This is gives us the 1 x 512 sized
feature vector which is translation invariant because it encodes maximum local response of
each convolutional filter.
RMAC (regional maximum activation of convolutions): For this, we consider different regions
at L different scales on the CNN response maps. For every region, we compute the MAC
feature vector, post-process it with L2 normalization and PCA-whitening. And finally,
aggregating all the feature vectors into one by summing and L2 normalizing. This results in a
1 x 512 RMAC feature vector.
Ranking: RMAC feature vectors are computed for every image in the database. For every
query image, its MAC or RMAC feature vector is computed and matched against database
vectors using simple cosine similarity.
EXPERIMENT AND OBSERAVTIONS (MODEL CNN)

Initial Retrieval: A simple ranking of all 63 x 84 images were done for each query image (Q),
(84: number of categories, 63: number of images in each category). Number of scales, L were
chosen to be 3 as presented in the paper. Experimentally, we noticed that increasing the scale
further deteriorated the results as increasing the scale increased the number of regions for
computing the RMAC vector. This increased the coverage of regions when traced back to the
image; further resulting in increase in involvement of background clutter.
Re-ranking: Top 1000 images were re-ranked using the re-ranking module. Re-ranking was
done by exhaustively searching for the best region in the output volume at L=5 different scales
in a fixed grid fashion. The optimal region was searched in the following manner: First for
every region, MAC vector was computed and then it was matched against the MAC vector of
query image using cosine similarity. The scores were sorted and the optimal region was
chosen for which RMAC vector was re-computed. Finally, a matching and ranking procedure
was again followed for the top 1000 images. Top images along with their categories were
listed in a .txt file.
Limitations: Categories of objects retrieved for some of the query images were not
satisfactory. Not fine tuning the CNN model could be one of the possible reasons. This reason
is the most plausible one because VGG16 model is trained on a very different database.
REFERENCES
@article{tolias2015particular, title={Particular object retrieval with integral max-
pooling of CNN activations}, author={Tolias, Giorgos and Sicre, Ronan and J{\'e}gou,
Herv{\'e}}, journal={arXiv preprint arXiv:1511.05879}, year={2015}}

IIT KANPUR CS-698 VISUAL RECOGNITION MODEL CNN

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

IIT KANPUR CS-698 VISUAL RECOGNITION MODEL CNN

Hochgeladen von

Copyright:

Verfügbare Formate

INDIAN INSTITUTE OF TECHNOLOGY

EXPERIMENT AND OBSERAVTIONS (MODEL CNN)

Das könnte Ihnen auch gefallen