Sie sind auf Seite 1von 9

Recognition of Indian cultural event using Convolution

Neural Network

Abstract: The main goal behind this work is to recognize different Indian cultural
events. This problem is challenging due to the presence of various events, where each
event is celebrated in different ways by diverse Indian cultures. The events differ in
rituals in addition to a variety of objects and content of the event. Still images are used
as a mode of input which helps us to differentiate between the various cultural events.
An attempt has been made to button down events of various classes and define them.
Apart from the basic computer vision techniques, the CNN features have extensively
been put to work to identify the salient features of various cultural events which
differentiate one from other. Various pre-trained models have also been deployed to
support the proposed framework.

1 Introduction

A cultural event is celebrated to preserve an ancient belief or to depict the emotional


status of the common crowd related to a specific culture or religion. India with a
population of over 1.2 billion people is the country with second largest population in
the world. It is the land where along with six major religions, many other religions are
followed by people. With a cultural history spanning for more than 4500 years, it is the
birthplace of prominent religions like Hinduism, Buddhism and Jainism along with
many more.
Indian cultural events are popular throughout the world for their aura and cultural
beliefs. Their mythological relevance fascinates tourists, journalists, photographers,
devotees and many more across the globe. The task to recognize an Indian cultural
event just by its still images is a cumbersome challenge not just for computer but also
for a human eye. The topic includes a large number of classes and a variety of
representations, thus authors found it important to propose a model that would focus on
identifying the event using still images. Due to a varying cultural diversity, the similar
cultural events incorporate similar representation of objects and context. Identifying
this problem, the authors have tried to identify significant and distinctive
representations of each event.
Through this paper, the authors have tried to answer the following questions: -

 What is distinctive about a cultural event?


 How can computer vision approaches be deployed for identifying the distinctive
representations of any cultural event?
 Can these distinctive models aid the computer in recognizing cultural events?

The presented paper is structured as stated. Section 2 contains the literary works
done on the topic of the hour in the past. It is followed by the experimental setup and
2

scenarios in the section 3. Section 4 contains the details of the dataset and it is followed
by the conclusion and references in sections 5 and 6 respectively.

2 Literature Review

The problem of recognizing Indian cultural events can also be a problem of scene
classification. A discussion of the related work on cultural event recognition has been
presented in this section.
For object recognition, A.Krizhevsky trained a large, deep convolution network. The
network consisted of five convolution layers followed by max pooling layers, and three
fully connected layers. He used dropout method to reduce over fitting in fully connected
layers. He collected a large dataset and classified 1.2 million high resolution images on
the trained network [1]. The model, namely called the AlexNet outperformed all the
previous models and won the ILSVRC competition in 2012. It used the strategy of
making 5 guesses about the image label. The model reduced the top-5 error to 15.3%.
For event recognition L.Wang utilized deep CNN and proposed a new architecture
called object-scene convolution neural network (OS-CNN). The author broke down its
architecture into two separate nets – object net and scene net. It extracted useful
information or events understanding from the perspective of objects and scene context
[2]. By Late-Fusion, he obtained performance of 85.5% by combining recognition
results from object net and scene net.
Karen Simonyan and Andrew Zisserman [12], scholars of University of Oxford,
presented a model known as the VGGNet in the ILSVRC competition in 2014. The
model occupied the runners up spot in that competition. VGGNet consists of 16
convolutional layers and is an appealing model due to its stable and plane architecture.
The image is passed through a structure of convolutional layer where filters with a very
small receptive field (3x3 which is the smallest size to capture the notion) is used.
Another work on event recognition to extract the most revealing features from the
images is using page rank technique with bag of features [10].
Scholars at Sungheon Park and Nojun Kwak graduate school of CST, Korea
proposed an algorithm in which image was broken down into smaller patches and
distinct representations were extracted. Patches were trained using deep convolution
neural networks. The result was calculated by taking mean of all probabilities with
entropy thresholding [3].
Another work was presented by scholars from Stony Brook University to study the
recognition of cultural events using still images. They used Least Square Support
Vector Machines (LSSVM) and three local features: - SIFT (to detect local features in
images or to capture the texture of cultural events), color and CNN. Categorization
techniques like spatial pyramid matching(SPM) and regularized max pooling(RMP)
were put in to work. RMP works by partitioning an image into multiple regions and
aggregate local features computed for each region. The best results were achieved by
combining SPM using SIFT + color and RPM using CNN [4]
3

A.Salvador combined the features of last three fully connected layers of two
convolution neural networks. They trained a low level SVM by late fusion strategy on
each of the extracted neural codes and achieved mean average precision of 0.767 by
adding a temporal refinement set into their classification scheme [5].
Another work proposed automatic classification of events and sub events using time
clustering information. The author obtained best results by using post processing,
cluster information [6].
For image classification, Bag of visual words generates a histogram of visual words'
occurrence that represents an image. This feature has been used in human action
recognition [7], visual concept classification [8], object and scene recognition [9].
The GBVS model presented by the scholars at Caltech [11] is extensively featured
in the model proposed by the authors. The model revolves around the visual salient
features of the images and thus is a padding layer for obtaining accurate results. The
functionality of the model featuring the calculation of an activation map and converting
it into salient images is the backbone of the proposed model.

Fig. 1. Heat map of salient features of image belonging to specific cultural event

3 Experimental Setup

In this section, we describe the details of the experiments conducted. All the
experiments were conducted on a machine that was powered by Intel’s core i5 – 7200U
CPU clocked at 2.50 GHz. The machine architecture consisted of 8192 MB RAM and
4172 MB of NVidia GeForce 940-MX graphical processing unit.
The experiments were conducted under two scenarios. In first scenario, various
standard models were used on the dataset and different classifiers were put to work. In
second scenario, there was a specific emphasis on certain features of the image and a
framework was followed to obtain the targeted result.
4

3.1 First Experimental Scenario


In this section, we shall be discussing about the standard methodologies we used to
build some sample models. These sample models were later to be used to compare the
authenticity of our proposed model.
These standard methodologies used various image descriptors such as HOG,
GIST and LBP. The features extracted from the descriptors were later classified using
classifiers such as SVM, RF, KNN.
The dataset was split in the ratio 8:2 and we permuted all the above discussed
descriptors along with the classifiers on the dataset. The attempt was to obtain a
relatively stable framework, which we would either use as a base to build our model or
we would use for comparison purpose. Table 1 shows the results obtained when the
standard methodologies were used on the dataset.

Table 1. Different descriptors and classifiers used along with the accuracies

Descriptor Classifier Accuracy


HOG SVM 26.7388%
HOG RF 23.1839%
HOG KNN 14.9923%
LBP SVM 23.8390%
LBP RF 24.9226%
LBP KNN 14.5511%
GIST SVM 18.8563%
GIST RF 27.5116%
GIST KNN 21.3292%

3.2 Second Experimental Scenario


This section discusses the focused approach we used to achieve an accurate model. The
dataset as explained in section 4 has sixteen different categories of images. The images
present in the dataset aptly depict the variations present in each of the different cultural
events.
Indian cultural fests in specific are celebrated in diverse manner when compared to the
festivals celebrated across the globe. The festivals are ornamented with vibrant color
schemes and each festival has a unique celebration. Thus, we found that the salient
features of different festivals play a vital role in differentiating them with other cultural
festivals. It is also found that these features strongly connect the images belonging to
the same class. Thus, by using these features as the base of our model, graph based
visual saliency of each image was extracted.

Graph-Based Visual Saliency. Graph-Based Visual Saliency (GBVS) model [11] is


used to calculate the salient representations of a given image. It is a bottom-up model
that works in two steps – first, it forms an activation map based on certain features and
then it normalizes the image such that those features are highlighted, and they become
5

prominent. Salient representations of an image are locations which contain some


informative data related to a topic. Thus, to simplify, salient features of an image are
calculated in steps described below.
The first step includes the formation of an activation map. This map is derived from
feature map of the image. The activation map shows locations which contain unusual
traits (Fig. 1) of information based on a certain criterion as compared to the neighboring
cell.
Suppose two points 𝑀(i, j) and 𝑀(𝑝, 𝑞) located on the lattice of image, the
dissimilarity between those points is defined as
𝑀(𝑖, 𝑗)
d((i, j)||(p, q)) ≜ |𝑙𝑜𝑔 |
𝑀(𝑝, 𝑞)
Using the above dissimilarity, we build a graph and to normalize, we introduce an
edge between nodes (i, j) and (p, q) with weight

𝑤2 ((𝑖, 𝑗), (𝑝, 𝑞)) ≜ 𝐴(𝑝, 𝑞) ∙ 𝐹(𝑖 − 𝑝, 𝑗 − 𝑞)

Using these edges, we map the complete features and just focus on the salient areas of
the image. This GBVS model helps in calculating salient features of each image. This
model is used extensively in the proposed model.

Fig. 2. Block diagram of the proposed model

The proposed model can be explained as a combination of two different models. In


first part, we use the architecture proposed by Karen Simonyan and Andrew Zisserman
known as the VGGNet. The VGGNet architecture has stack like structure where filters
with very small receptive field are used. Using the VGGNet, we could easily access
6

features of several parts of image in form of layer. We found that the FC6 and the FC7
layers of the images belonging to a similar class gave great results showing connectivity
between the images.
The second part of the proposed model uses the saliency of the given image. We
evaluated the visual saliency of each image using the GBVS model. We used the
AlexNet model to store and differentiate between these features. AlexNet model has a
large learning capacity and the prior knowledge to compensate all the data. It uses five
convolutional layers and three fully connected layers. Thus, the results from AlexNet
model and the VGGNet model where joined and classified all together using the support
vector machine. Table 2 shows the results obtained during the experiment.

Table 2. Results of the experiments using various models

Models Accuracy
VGGNet 61.25%
AlexNet 62.50%
AlexNet + VGGNet 63.96%

4 Dataset

The proposed model aims to identify salient features present in images belonging to a
same class. The target was to identify the salient features of the images belonging to
the same class. We used the convolutional neural network for the above described
framework which would later classify an Indian cultural event from its still images.
Inspired from the ChaLearn Cultural Event Recognition dataset which consisted of
the images of 50 cultural events over 28 countries such as Rio Carnival, St Patrick’s
Day etc., we built a dataset that consisted of 16 prominent cultural events of India. From
the dataset, we can observe the variation of clothes, ways of celebration, human poses
and similarity and dissimilarity of objects. Thus, our main goal of was to find saliency
in distinct representations amidst the similar ones, which is readily justified by the
dataset.
The authors have used a relatively small dataset to run the experiments. The dataset
consists of 1607 images. The major part of the dataset images has been collected from
Google and Bing. The dataset has been partitioned in the ratio of 8:2 (where we use
80% of the data to train our model and 20% data to test its validation) in all the
experiments. The list of the cultural events and the number of images in each class is
given below in table 3. We extract the different characteristic of each cultural event and
classify the images on the above discussed model.
7

Table 3. Dataset description

S. No Cultural Event Number


of
images
1 Pushkal mela 100
2 Diwali 100
3 Holi 100
4 Eid 107
5 Ardh Kumbh mela 100
6 Monkey Buffet festival 100
7 Desert festival of Jaisalmer 100
8 Rakshabandhan 100
9 Rath Yatra 100
10 Thaipusam Tamil festival 100
11 Thrissur Pooram festival 100
12 Ganesh Chaturthi 100
13 Christmas 100
14 Janmashtami 100
15 Durga Pooja 100
16 Onam 100

5 Conclusion

From the test conducted on the training dataset, it can be inferred that the proposed
model is a reliable technique which can be used to classify different Indian cultural
events. Further works in this can be carried out by exploring more layers of the images
using model having a different architecture. The results clearly signify the importance
and future of deep learning frameworks in identifying various cultural events with the
help of their images. Models like these can be used for increasing tourism and
educational purposes. Work like these have great economic importance as it can help
in fetching foreign capital by attracting people living outside India.
8

Fig. 3. Confusion matrix when results of AlexNet with combined with VGGNet and classified
using SVM

6 References

1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification


with deep convolutional neural networks." Advances in neural information
processing systems. 2012.
2. Wang, L., Wang, Z., Du, W., & Qiao, Y. (2015). Object-scene convolutional neural
networks for event recognition in images. In Proceedings of the IEEE conference
on computer vision and pattern recognition workshops (pp. 30-35).
3. Park, S. and Kwak, N., 2015. Cultural event recognition by subregion classification
with convolutional neural network. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition Workshops (pp. 45-50).
4. Kwon, H., Yun, K., Hoai, M. and Samaras, D., 2015. Recognizing cultural events
in images: A study of image categorization models. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops (pp. 51-57).
5. Salvador, A., Zeppelzauer, M., Manchon-Vizuete, D., Calafell, A. and Giro-i-
Nieto, X., 2015. Cultural event recognition with visual convnets and temporal
models. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops (pp. 36-44).
6. Mattivi, R., Uijlings, J., De Natale, F.G. and Sebe, N., 2011, November.
Exploitation of time constraints for (sub-) event recognition. In Proceedings of the
2011 joint ACM workshop on Modeling and representing events (pp. 7-12). ACM.
7. Dollár, P., Rabaud, V., Cottrell, G. and Belongie, S., 2005, October. Behavior
recognition via sparse spatio-temporal features. In Visual Surveillance and
Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE
International Workshop on (pp. 65-72). IEEE.
8. Uijlings, J.R., Smeulders, A.W. and Scha, R.J., 2010. Real-time visual concept
classification. IEEE Transactions on Multimedia, 12(7), pp.665-681.
9

9. Van De Sande, K., Gevers, T. and Snoek, C., 2010. Evaluating color descriptors for
object and scene recognition. IEEE transactions on pattern analysis and machine
intelligence, 32(9), pp.1582-1596.
10. Imran, N., Liu, J., Luo, J. and Shah, M., 2009, October. Event recognition from
photo collections via pagerank. In Proceedings of the 17th ACM international
conference on Multimedia (pp. 621-624). ACM.
11. Harel, J., Koch, C. and Perona, P., 2007. Graph-based visual saliency. In Advances
in neural information processing systems (pp. 545-552).
12. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. In ICLR, 2015.

Das könnte Ihnen auch gefallen