Beruflich Dokumente
Kultur Dokumente
Neural Network
Abstract: The main goal behind this work is to recognize different Indian cultural
events. This problem is challenging due to the presence of various events, where each
event is celebrated in different ways by diverse Indian cultures. The events differ in
rituals in addition to a variety of objects and content of the event. Still images are used
as a mode of input which helps us to differentiate between the various cultural events.
An attempt has been made to button down events of various classes and define them.
Apart from the basic computer vision techniques, the CNN features have extensively
been put to work to identify the salient features of various cultural events which
differentiate one from other. Various pre-trained models have also been deployed to
support the proposed framework.
1 Introduction
The presented paper is structured as stated. Section 2 contains the literary works
done on the topic of the hour in the past. It is followed by the experimental setup and
2
scenarios in the section 3. Section 4 contains the details of the dataset and it is followed
by the conclusion and references in sections 5 and 6 respectively.
2 Literature Review
The problem of recognizing Indian cultural events can also be a problem of scene
classification. A discussion of the related work on cultural event recognition has been
presented in this section.
For object recognition, A.Krizhevsky trained a large, deep convolution network. The
network consisted of five convolution layers followed by max pooling layers, and three
fully connected layers. He used dropout method to reduce over fitting in fully connected
layers. He collected a large dataset and classified 1.2 million high resolution images on
the trained network [1]. The model, namely called the AlexNet outperformed all the
previous models and won the ILSVRC competition in 2012. It used the strategy of
making 5 guesses about the image label. The model reduced the top-5 error to 15.3%.
For event recognition L.Wang utilized deep CNN and proposed a new architecture
called object-scene convolution neural network (OS-CNN). The author broke down its
architecture into two separate nets – object net and scene net. It extracted useful
information or events understanding from the perspective of objects and scene context
[2]. By Late-Fusion, he obtained performance of 85.5% by combining recognition
results from object net and scene net.
Karen Simonyan and Andrew Zisserman [12], scholars of University of Oxford,
presented a model known as the VGGNet in the ILSVRC competition in 2014. The
model occupied the runners up spot in that competition. VGGNet consists of 16
convolutional layers and is an appealing model due to its stable and plane architecture.
The image is passed through a structure of convolutional layer where filters with a very
small receptive field (3x3 which is the smallest size to capture the notion) is used.
Another work on event recognition to extract the most revealing features from the
images is using page rank technique with bag of features [10].
Scholars at Sungheon Park and Nojun Kwak graduate school of CST, Korea
proposed an algorithm in which image was broken down into smaller patches and
distinct representations were extracted. Patches were trained using deep convolution
neural networks. The result was calculated by taking mean of all probabilities with
entropy thresholding [3].
Another work was presented by scholars from Stony Brook University to study the
recognition of cultural events using still images. They used Least Square Support
Vector Machines (LSSVM) and three local features: - SIFT (to detect local features in
images or to capture the texture of cultural events), color and CNN. Categorization
techniques like spatial pyramid matching(SPM) and regularized max pooling(RMP)
were put in to work. RMP works by partitioning an image into multiple regions and
aggregate local features computed for each region. The best results were achieved by
combining SPM using SIFT + color and RPM using CNN [4]
3
A.Salvador combined the features of last three fully connected layers of two
convolution neural networks. They trained a low level SVM by late fusion strategy on
each of the extracted neural codes and achieved mean average precision of 0.767 by
adding a temporal refinement set into their classification scheme [5].
Another work proposed automatic classification of events and sub events using time
clustering information. The author obtained best results by using post processing,
cluster information [6].
For image classification, Bag of visual words generates a histogram of visual words'
occurrence that represents an image. This feature has been used in human action
recognition [7], visual concept classification [8], object and scene recognition [9].
The GBVS model presented by the scholars at Caltech [11] is extensively featured
in the model proposed by the authors. The model revolves around the visual salient
features of the images and thus is a padding layer for obtaining accurate results. The
functionality of the model featuring the calculation of an activation map and converting
it into salient images is the backbone of the proposed model.
Fig. 1. Heat map of salient features of image belonging to specific cultural event
3 Experimental Setup
In this section, we describe the details of the experiments conducted. All the
experiments were conducted on a machine that was powered by Intel’s core i5 – 7200U
CPU clocked at 2.50 GHz. The machine architecture consisted of 8192 MB RAM and
4172 MB of NVidia GeForce 940-MX graphical processing unit.
The experiments were conducted under two scenarios. In first scenario, various
standard models were used on the dataset and different classifiers were put to work. In
second scenario, there was a specific emphasis on certain features of the image and a
framework was followed to obtain the targeted result.
4
Table 1. Different descriptors and classifiers used along with the accuracies
Using these edges, we map the complete features and just focus on the salient areas of
the image. This GBVS model helps in calculating salient features of each image. This
model is used extensively in the proposed model.
features of several parts of image in form of layer. We found that the FC6 and the FC7
layers of the images belonging to a similar class gave great results showing connectivity
between the images.
The second part of the proposed model uses the saliency of the given image. We
evaluated the visual saliency of each image using the GBVS model. We used the
AlexNet model to store and differentiate between these features. AlexNet model has a
large learning capacity and the prior knowledge to compensate all the data. It uses five
convolutional layers and three fully connected layers. Thus, the results from AlexNet
model and the VGGNet model where joined and classified all together using the support
vector machine. Table 2 shows the results obtained during the experiment.
Models Accuracy
VGGNet 61.25%
AlexNet 62.50%
AlexNet + VGGNet 63.96%
4 Dataset
The proposed model aims to identify salient features present in images belonging to a
same class. The target was to identify the salient features of the images belonging to
the same class. We used the convolutional neural network for the above described
framework which would later classify an Indian cultural event from its still images.
Inspired from the ChaLearn Cultural Event Recognition dataset which consisted of
the images of 50 cultural events over 28 countries such as Rio Carnival, St Patrick’s
Day etc., we built a dataset that consisted of 16 prominent cultural events of India. From
the dataset, we can observe the variation of clothes, ways of celebration, human poses
and similarity and dissimilarity of objects. Thus, our main goal of was to find saliency
in distinct representations amidst the similar ones, which is readily justified by the
dataset.
The authors have used a relatively small dataset to run the experiments. The dataset
consists of 1607 images. The major part of the dataset images has been collected from
Google and Bing. The dataset has been partitioned in the ratio of 8:2 (where we use
80% of the data to train our model and 20% data to test its validation) in all the
experiments. The list of the cultural events and the number of images in each class is
given below in table 3. We extract the different characteristic of each cultural event and
classify the images on the above discussed model.
7
5 Conclusion
From the test conducted on the training dataset, it can be inferred that the proposed
model is a reliable technique which can be used to classify different Indian cultural
events. Further works in this can be carried out by exploring more layers of the images
using model having a different architecture. The results clearly signify the importance
and future of deep learning frameworks in identifying various cultural events with the
help of their images. Models like these can be used for increasing tourism and
educational purposes. Work like these have great economic importance as it can help
in fetching foreign capital by attracting people living outside India.
8
Fig. 3. Confusion matrix when results of AlexNet with combined with VGGNet and classified
using SVM
6 References
9. Van De Sande, K., Gevers, T. and Snoek, C., 2010. Evaluating color descriptors for
object and scene recognition. IEEE transactions on pattern analysis and machine
intelligence, 32(9), pp.1582-1596.
10. Imran, N., Liu, J., Luo, J. and Shah, M., 2009, October. Event recognition from
photo collections via pagerank. In Proceedings of the 17th ACM international
conference on Multimedia (pp. 621-624). ACM.
11. Harel, J., Koch, C. and Perona, P., 2007. Graph-based visual saliency. In Advances
in neural information processing systems (pp. 545-552).
12. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. In ICLR, 2015.