Beruflich Dokumente
Kultur Dokumente
Sawarkar
Abstract
In today’s era of digitization and fast internet, many video are uploaded on websites, a
mechanism is required to access this video accurately and efficiently. Semantic concept detection
achieve this task accurately and is used in many application like multimedia annotation, video
summarization, annotation, indexing and retrieval. Video retrieval based on semantic concept is
efficient and challenging research area. Semantic concept detection bridges the semantic gap
between low level extraction of features from key-frame or shot of video and high level
interpretation of the same as semantics. Semantic Concept detection automatically assigns labels
to video from predefined vocabulary. This task is considered as supervised machine learning
problem. Support vector machine (SVM) emerged as default classifier choice for this task. But
recently Deep Convolutional Neural Network (CNN) has shown exceptional performance in this
area. CNN requires large dataset for training. In this paper, we present framework for semantic
concept detection using hybrid model of SVM and CNN. Global features like color moment, HSV
histogram, wavelet transform, grey level co-occurrence matrix and edge orientation histogram are
selected as low level features extracted from annotated groundtruth video dataset of TRECVID.
In second pipeline, deep features are extracted using pretrained CNN. Dataset is partitioned in
three segments to deal with data imbalance issue. Two classifiers are separately trained on all
segments and fusion of scores is performed to detect the concepts in test dataset. The system
performance is evaluated using Mean Average Precision for multi-label dataset. The performance
of the proposed framework using hybrid model of SVM and CNN is comparable to existing
approaches.
Keywords: Semantic Concept Detection, SVM, CNN, Multi-label Classification, Deep Features,
Imbalanced Dataset.
1. INTRODUCTION
The semantic concept detection system detects the concepts presents in key-frame of the video
based on features and assigns automatic labels to video based on the predefined concept list.
Human can assign labels based on the visual appearance with the experience. But automatic
semantic detection systems performs mapping of the low level features to high level concepts
using machine learning techniques. Such systems are useful for many applications like semantic
indexing, annotations and video summarization and retrieval.
Semantic concept detection system works on bridging the semantic gap by performing mapping
of low level features to high-level video semantics. Extensive research in this field has improved
the efficiency of semantic concept detection systems but it is still a challenging problem due to
the large variations in low level features of semantic concepts and inter concept similarities.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 13
Nita S. Patil & Sudhir D. Sawarkar
Earlier researchers focused on improving accuracy of the concept detection system using global
and local features obtained from key-frame or shot of the video and various machine learning
algorithm. In recent years, due to the technological advances in computing power deep learning
techniques specially Convolutional Neural Network (CNN) has shown promising improvement in
efficiency in various field. CNN has the powerful ability of feature extraction and classification on
large amount of data and hence widely adopted in concept detection systems.
Systems with unbalanced dataset have less relevant examples as compared to irrelevant
examples. This limits classifier accuracy during training phase and the created classifier model
may give misclassification. Researchers used methods to deal with unbalanced dataset problem
mostly based on over sampling positive examples and down sampling negative samples which
may lead to over fitting of the classifier.
This paper proposes framework for effective feature extraction and classification dealing with
imbalanced dataset problem. Section 2 covers related work. Section 3 discusses basic semantic
concept detection system. Section 4 focuses on the methodology and concept selection for
segments as well as the generation of concept training data. Section 5 presents the results of the
experimental evaluation of the concept detector system. Finally, Section 6 concludes this paper
by discussing the key aspects of the presented concept detection system.
2. RELATED WORK
Feature extraction and feature selection is fundamental and important step in concept detection
task. In this section we discuss deep leaning framework and semantic concept detection using
traditional handcrafted features and deep learning methods.
Deep learning frameworks like Cafe [1] ,Theano [2], Cuda-convnet are adopted in various video
processing applications.
Deep convolutional networks proposed by Krizhevsky et al. [3] implemented on Imagnet dataset
in the ILSVRC-2010 and ILSVRC-2012 competitions set benchmark in deep learning. Krizhevsky
et al. achieved the best results and reduced the top-5 test error by 10.9% compared with the
second winner.
A Deep Convolutional Activation Feature (Decaf) [4] was used to extract the features from an
unlabeled dataset. Decaf learns the features with high generalization and representation to
extract the semantic information using simple linear classifiers such as Support Vector Machine
(SVM) and logistic Regression (LR).
Jia et al. [1] proposed Cafe, a Convolutional Architecture for Fast Feature Embedding which
contains modifiable deep learning algorithms, but also has several pretrained reference models.
Another model for object detection is (R-CNN) [5], which extracts features from region proposals
to detect semantic concepts from large datasets.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 14
Nita S. Patil & Sudhir D. Sawarkar
descriptors like SIFT ,SURF, ORB [9][10][11][12]. Duy-Dinh Le et al. [13] evaluated the
performance of global features and local features for the semantic concept detection task on
Trecvid dataset from 2005 to 2009. They discussed the performance of the individual global
features like color moments, color histogram, edge orientation histogram, and local binary
patterns on grid of varying size, various color spaces including HSV, RGB, Luv, and YCrCb and
variation in bin sizes. The local feature used is SIFT with BOW. They also considered late fusion
of all features by averaging probability scores of SVM classifier. They concluded that global
features are compact and effective in feature representation as compared to the computational
complexity of local features.
Next, Simonyan et al. [15] proposed a two stream CNN network for video classification. One
network analyzes the spatial information while the second analyzes the optical flow field. Their
approach generates significant improvement over the single frame model.
Tran et al. [16]presented the first single stream CNN that incorporate both spatial and temporal
information at once. Their approach takes multiple frames as input and examines them with 3D
spatio-temporal convolutional filters. They handle the problem of activity recognition and
performed evaluation on the UCF101 dataset. They outperformed all previous work. In addition,
their technique is fast as it does not require optical flow estimation.
Xiaolong Wang et al. [17] used unsupervised learning methods for visual feature representation.
They tracked first and last frame of the moving object or object part in patch using KFC tracker
forming two parts and third part from Siamese Triplet Network extracting features from these
paths and loss function. They extracted SURF interest points and use Improved Dense
Trajectories (IDT) to obtain motion of each SURF point and using thresholding technique to
obtain motion surf points. In next step extracted best bounding box to contain maximum moving
surf points. Given various patches from these triplets, they extracted feature representation using
Siamese Triplet Network and CNN from the last layer by forward propagation and ranking loss
function. They obtained boosted result using model ensemble of Alexnet and transfer learning.
In 2013 [18] team from NTT Media Intelligence Laboratories worked on dense SIFT features
around local points and used effective feature encoding Fisher Vector for concept detection task.
To reduce dimension team used PCA and GMM to generate the codebooks. They used Support
Vector Machines (SVM) classifiers with linear kernels for the first run. Ensemble Learning with
sparse encoding for fusion of sub classifiers showed better performance than ensemble learning
with Clara, the fisher vector run classified by the large linear classification method LIBLINEAR
and Deep CNN.
ZhongwenXu et al. [22] introduced latent concept descriptors for video event detection by
extracting features from vgg pool 5 layer and descriptors from all frames in video are encoded to
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 15
Nita S. Patil & Sudhir D. Sawarkar
generate video representations and descriptor are encoded using VLAD at the last convolutional
layer with spatial pooling.
Podlesnaya A. et al. [23] develop video indexing and retrieval system based on semantic features
extracted from GoogLeNet architecture and provided shot summarization using temporal pooling.
Video retrieval was created using structured query for keyword based retrieval and indexing for
database was prepared using graph based approach with the WorldNet lexical database.
Nitin J. Janwe et al. [24] asymmetrically trained two CNN model to deal with unbalanced dataset
problem. They also proposed to combine foreground driven object to background semantic
concept.
Bahjat Safadi et al. [26] considered various types audio, image and motion descriptors for
evaluation and variants of classifiers and their fusion. Classical descriptors extracted from key-
frame includes color histogram, Gabor transform, quaternionic wavelets, a variety of interest
points descriptors (SIFT, color SIFT, SURF), local edge patterns, saliency moments, and spectral
profiles for audio description. Semantic descriptors which are scores computed on the current
data using classifier trained on other data are also considered in fusion step. They used KNN and
SVM as base classifiers and in late fusion step MAP weighted average of the scores produced by
the two classifiers is combined.
3.1 Preprocessing
In preprocessing step video is divided into significant scene. Scene is semantically co-related and
temporally adjacent shots depicting its story. Scene is segmented into shots.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 16
Nita S. Patil & Sudhir D. Sawarkar
considering all frames based on sequential or clustering based methods. In sequential method
variation between visual features of frames are calculated and whenever considerable change is
observed it is selected as key-frame. In cluster based approaches, frames are clustered based on
variation of content of frames of shots and frame nearest to cluster center is selected as key-
frame.
In the set of shot-based features, audio features and shot-based visual features are extracted.
The extracted audio feature set is composed of volume-related, energy-related, and spectrum-
flux-related features as well as the average zero crossing rates. For shot-based visual features,
the grass ratio is calculated as a mid-level feature, which is useful for detecting sports-related
semantic concepts such as soccer players and sports. Furthermore, a set of motion intensity
estimation features are extracted, such as the center-to-corner pixel change ratio. Both of the
categories of features target at representing / summarizing the video.
Key-frame based features are commonly used in concept detection system. Color features,
texture features, and shape features are derived from key-frame along the spatial scale i.e.,
global level, region level, key-point level, and at temporal level. These features can be used
independently or fused together or with other modality. Fusion can also be done at classifier level,
where their kernel functions can be fused to improve performance.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 17
Nita S. Patil & Sudhir D. Sawarkar
4. METHODOLOGY
Figure 2 shows the framework of proposed method using Support Vector Machine (SVM) and
Convolutional Neural Network(CNN).
FIGURE 2: Block diagram of Proposed Framework for Semantic Concept Detection Using SVM and CNN.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 18
Nita S. Patil & Sudhir D. Sawarkar
more than those belonging to other class. The predictive classifier model developed using such
imbalanced dataset can be biased and inaccurate. Over-Sampling can be used to increase the
number of instances in the minority class by randomly replicating them in order to increase the
number of samples of the minority class in the sample. Here we have partitioned dataset into
three segments based on frequency of the samples in dataset. Concepts having low frequency
are kept in segment one, moderate frequency between 0.1 to 0.5 in segment two and more than
0.5 are in segment three.
1 𝑁
𝐸𝑖 = ∑ 𝐼𝑖𝑗 (1)
𝑁 𝑖=0
1 2
𝜎𝑖 = √ (∑𝑁
𝑗=1(𝐼𝑖𝑗 − 𝐸𝑖 ) ) (2)
𝑁
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 19
Nita S. Patil & Sudhir D. Sawarkar
edge detector which is used in our experiment. Edge pixels in vertical, horizontal, two diagonals
directions and one non-directional are counted. Table1 summarizes the feature set used in our
experiments. The dimension of the concatenated feature vector is 99. SVM creates model
accurately when features are in same range. Features are normalized using min-max
normalization.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 20
Nita S. Patil & Sudhir D. Sawarkar
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 21
Nita S. Patil & Sudhir D. Sawarkar
Partition I contains 80 videos used for training and 30 videos for testing. Training dataset consists
of 16542 randomly chosen positive key-frames to perform classifier training. Test dataset consists
of 12615 randomly chosen positive key-frames from Partition-II. Table 2 shows number key-frame
used in two partitions.
Figure 5 shows number of positive key-frames available for 36 concepts in training dataset. As
observed from figure 5 the TRECVID training dataset is highly imbalanced having few positive
examples for concepts like US flag as compared to very high frequency of other concepts.
Approaches like oversampling for imbalanced dataset are not sufficient in this task as classifier
overfits. Classifiers are biased towards classes having more number of positive examples and
tend to predict majority classes. Therefore we have partitioned training dataset as described in
section 4.2. Table 3 shows distribution of training and testing key-frames used in three segments.
4.6 Classification Design
Two types of the classifiers are used in the experimentation. Training dataset partitioned into
three segments is referred as seg1, seg2 and seg3.
Concept Names
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 22
Nita S. Patil & Sudhir D. Sawarkar
TABLE 4: Concepts in all three segments and number of training key-frames for each concept.
4.6.1 SVM
The classifier adopted is SVM. Global features like color moment, HSV histogram and wavelet
transform are extracted from training key-frame and combined to form single feature vector. SVM
is trained for each segment separately creating three SVM models. The stepwise procedure for
building SVM classifier is as follows [24]:
a. Normalization: Training and test dataset feature vectors are normalized in the range (0.0
to 1.0).
b. Kernel function selection: The choice of kernel function like RBF or linear depend upon
feature selection.
c. Parameter tuning: In non- linear features, best parameters for C and g are to be obtained for
RBF kernel.
d. Training: Values of C and g obtained after cross- validation are used to train the entire
training set.
e. Testing: Model obtained after training, are used for predicting a class for the test dataset.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 23
Nita S. Patil & Sudhir D. Sawarkar
pipelined are linearly fused to get final scores. These final merged scores are used to predict the
concepts. The performance of the fused classifier is evaluated over the test dataset. Figure 6
shows approach for semantic concept detection for test video using hybrid approach.
Computing MAP: The ground-truth key-frames consists of multi-label data. For each key-frame
the labels available for each key-frame called as concept set (Yi )and the count of the concepts
are stored in (Ni) are computed and stored. Let D be a multi-label test dataset, consisting of |D|
multi-label test examples (xi , Yi ), i = 1 . . . |D|, Yi ⊆ L, where L is a concept vocabulary set for a
dataset. When concept prediction scores for all the L concepts are obtained, the following
procedure is adopted to compute Top-d MAP [24]:
1. Rank the prediction scores and concepts in descending order for a test sample x i .
2. Pick-up top di scores from a ranked list, let Pi is concept list from top d concepts. The
intersection Mi between Yi and Pi are the detected concepts. Let Ni is the label density for x i , and
intersection between Yi and Pi is Mi, then Mi are correctly predicted concepts by a classifier. The
average precision (AP) for a test sample is computed by (3) ,
|𝑌𝑖 ∩ 𝑃𝑖 | 𝑀𝑖
𝐴 𝑃𝑖 = = (3)
|𝑃𝑖 | 𝑁𝑖
And the Top-d MAP for a classifier, H, on dataset D, is obtained by computing the mean of APs
by (4).
FIGURE 6:Testing phase of Semantic Concept Detection Using SVM and CNN.
1 |𝐷| |𝑌 ∩ 𝑃 |
𝑖 𝑖
𝑀𝐴𝑃(𝐻, 𝐷) = ∑ (4)
|𝐷| 𝑖=1 |𝑃𝑖 |
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 24
Nita S. Patil & Sudhir D. Sawarkar
In the experiment based on CNN, a pretrained Alexnet architecture is used for obtaining deep
features. These features are used to train SVM classifier. A multiclass SVM classifier using a fast
linear solver is used to build classifier on deep features. Map of 0.56 is obtained for CNN model.
MAP weighted average fusion of scores from both the model is performed and probability for
each concept class is ranked. Based on the fused score MAP of 0.58 is obtained for hybrid
model. Table 5 shows MAP 5, 10 and n of SVM, CNN and hybrid fusion of scores of these
classifiers. Table 6 depicts concepts available in groundtruth dataset and correctly detected
concepts for test key-frames of few concept classes using hybrid model. As the dataset is
partitioned into segments accurate classifier models are created with better discriminating
capability. Concepts with less number of samples are recognized by model helping in improving
efficiency.
As shown in table 6 out of 8 labels of first key-frame 6 concepts are correctly predicted. similarly,
model is able to detect concepts with less annotations and concepts classes having less samples
in dataset It is observed that CNN classifiers gives better precision than SVM classifier method
and hybrid fusion method. The performance of the semantic concept detector using CNN is better
than the other detectors as the automatically extracted deep features are more powerful as
compare to low level features used to learn SVM classifier. MAP obtained for hybrid model is less
than CNN classifier for top n concepts
The proposed work is compared with the existing methods presented in [18], [25], [27] and [28] as
shown in table 7.The MAP of proposed system is better than these methods.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 25
Nita S. Patil & Sudhir D. Sawarkar
No. Of concepts 3 2
207 Shot26_113_RKF Boat_Ship, Outdoor,
Face, Sky ,
Outdoor, Snow,
Sky , Waterscape_Waterfront
Snow,
Waterscape_Waterfro
nt
No. Of concepts 6 4
1066 Shot4_255_RKF Building , Building,
Face, Face,
Meeting, Meeting,
Office, Outdoor
Person,
Studio
No. Of concepts 6 4
764 Shot63_25_RKF Building, Building,
Car, Car,
Outdoor, Outdoor,
Sky, Sky
Urban
No. Of concepts 5 4
12108 Shot86_114_NRKF_ Face, Face,
1 Outdoor, Outdoor,
Person, Vegetation Person
No. Of concepts 4 3
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 26
Nita S. Patil & Sudhir D. Sawarkar
6. CONCLUSION
Video concept detection systems detects one or multiple concepts present in the shots or key-
frame of video and automatically assign labels to unseen video which provides facilities to
automatically indexing of multimedia data. Visual concept detection bridges the semantic gap
between low level data representation and high level interpretation of the same by human visual
system. Selection of compact and effective low level feature is important. Also because of
imbalanced dataset problem, accurate classifier models cannot be created resulting into less
accuracy.
In this paper, framework for multi-label semantic concept detection is proposed using classifiers
trained on global and deep features. The imbalanced dataset issue is solved by partitioning
dataset into three segments. The framework is evaluated on TRECVID dataset using mean
average precision as predictive measure for multilabel dataset. Hybrid model of CNN and SVM
performed better than individual classifier for top 10 concepts whereas CNN worked better for top
n ranked concepts.
7. REFERENCES
[1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T.
Darrell. "Caffe: Convolutional architecture forfast feature embedding," arXiv:1408.5093,
2014.
[4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng and T. Darrell. “DeCAF: A
Deep Convolutional Activation Feature for Generic Visual Recognition,”CoRR,
abs/1310.1531, vol. 32, 2013.
[5] R. Girshick, J. Donahue, T. Darrell, U. C. Berkeley, and J. Malik. “Rich feature hierarchies
for accurate object detection and semantic segmentation Tech report,” 2012.
[6] A. Ulges, C. Schulze, M. Koch, and T. M. Breuel. “Learning automatic concept detectors
from online video,” In Comput. Vis. Image Underst., vol. 114, no. 4, pp. 429–438, 2010.
[7] B. Safadi, N. Derbas, A. Hamadi, M. Budnik, P. Mulhem, and G. Qu. “LIG at TRECVid 2014 :
Semantic Indexing tion of the semantic indexing,” 2014.
[10] L. Feng and B. Bhanu. “Semantic Concept Co-Occurrence Patterns for Image Annotation
and Retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 785–799, 2016.
[11] S. T. Strat, A. Benoit, P. Lambert, and A. Caplier. “Retina-Enhanced SURF Descriptors for
Semantic Concept Detection in Videos,” 2012.
[12] F. Markatopoulou, V. Mezaris, N. Pittaras, and I. Patras. “Local Features and a Two-Layer
Stacking Architecture for Semantic Concept Detection in Video,” IEEE Trans. Emerg. Top.
Comput., vol. 3, no. 2, pp. 193–204, 2015.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 27
Nita S. Patil & Sudhir D. Sawarkar
[15] K. Simonyan and A. Zisserman. “Two-Stream Convolutional Networks for Action Recognition
in Videos,” pp. 1–9, 2014.
[17] X. Wang and A. Gupta, “Unsupervised Learning of Visual Representations using Videos.” In
ICCV, 2015.
[18] Y. Sun, K. Sudo, Y. Taniguchi, H. Li, Y. Guan, and L. Liu. “TRECVid 2013 Semantic Video
Concept Detection by NTT-MD-DUT,” In Sun2013TrecVid2s, 2013.
[19] H. Ha, Y. Yang, and S. Pouyanfar. “Correlation-based Deep Learning for Multimedia
Semantic Concept Detection.” ” In IEEE International Symposium on Multimedia (ISM08),
pp. 316–321,Dec 2008.
[20] H. Tian and S.-C. Chen. “MCA-NN: Multiple Correspondence Analysis Based Neural
Network for Disaster Information Detection,” In IEEE Third Int. Conf. Multimed. Big Data, pp.
268–275, 2017.
[21] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-wide: A real-world web
image database from national university of singapore,” ACM Int. Conf. Image Video Retr., p.
48, 2009.
[22] Z. Xu, Y. Yang, and A. G. Hauptmann Itee. “A Discriminative CNN Video Representation for
Event Detection.”. In Proceedings ofthe IEEE Conference on Computer Vision and
PatternRecognition, 2015.
[23] A. Podlesnaya and S. Podlesnyy. “Deep Learning Based Semantic Video Indexing and
Retrieval,” no. 2214.
[24] N. J. Janwe and K. K. Bhoyar. “Multi-label semantic concept detection in videos using fusion
of asymmetrically trained deep convolutional neural networks and foreground driven concept
co-occurrence matrix,” In Appl. Intell., vol. 48, no. 8, pp. 2047–2066, 2018.
[25] F. Markatopoulou, V. Mezaris, and I. Patras. “Cascade of classifiers based on binary, non-
binary and deep convolutional network descriptors for video concept detection.,”In IEEE Int.
Conf. onImage Processing (ICIP 2015), Canada, 2015.
[26] B. Safadi, N. Derbas, A. Hamadi, M. Budnik, P. Mulhem, and G. Qu, “LIG at TRECVid 2014 :
Semantic Indexing LIG at TRECVid 2014 : Semantic Indexing,” no. June 2015, 2014.
[27] F. Markatopoulou, “Deep Multi-task Learning with Label Correlation Constraint for Video
Concept Detection,” pp. 501–505.
International Journal of Image Processing (IJIP), Volume (13) : Issue (2) : 2019 28