Sie sind auf Seite 1von 3

2018 First International Conference on Secure Cyber Computing and Communication(ICSCCC)

Deep Learning Approaches for Human Activity


Recognition in Video Surveillance – A Survey
Rajat Khurana Alok Kumar Singh Kushwaha
Dept of Comp Scienc. & Engg. Dept of Comp Science & Engg.
IKGPTU Main Campus IKGPTU Main Campus
Kapurthala, India Kapurthala, India
Email id: rajat.ptu@gmail.com Email id: dr.alokkushwaha@ptu.ac.in

Abstract— Recognition of the human activities in videos has Recurrent Neural Networks. Deep learning approaches are
gathered numerous demands in various applications of further categorized into supervised and un-supervised based
computer vision such as Ambient Assisted Living, intelligent techniques.
surveillance, Human Computer interaction. One of the most
pioneering technique for Human Activity Recognition is based Therefore, it is being a research issue to recognize
upon deep learning and this paper focuses on various human action based on deep learning methodology. Paper is
approaches based on that. Convolution Neural Network and divided into four parts. Where Part 1 covers the introductory
Recurrent Neural Networks are mostly used in deep learning part. Literature review is carried in Part 2. Part 3 covers the
architectures. Deep Learning have the capacity of automatic analytical study done on human activity recognition
learning of the features from the input modality. Analysis techniques. Conclusion and future work are in part 4.
based on Methodology, Accuracy, classifier and datasets is
presented in this survey paper.
II. LITERATURE SURVEY
Keywords— HAR, Action, Computer Vision, MEI, MHI In this section we will study the different approaches
based upon deep learning. Human Activity Recognition
I. INTRODUCTION techniques use appearance and motion features to learn
features from videos. Moez et al.[3] proposed a fully deep
Human activity recognition using deep learning model which learn spatio temporal features without using
techniques is the growing area in the field of computer any previous knowledge and extends Convolution Neural
vision. In general, HAR is the process of automatically Networks to 3 dimensional. Recurrent Neural Networks
finding the action in the sequence of videos. From few past having one hidden layer of Long Short-Term Memory cell is
years HAR has garnered lot of attention of researchers in this trained to classify the learned model with 10 layers in total.
field because of interesting application such as automatic Graham et al.[4] proposed the concept of feature learning
surveillance, security, human computer interactions, video using CNN. Static as well as dynamic features are learned
annotations, robotics, human behavior analysis and many from the sequence of images by using Gated Restricted
more [1]. Human Activity Recognition consists of pre- Boltzmann Machines and are represented in the form of
processing of input video after that feature extraction which optical flow. Shuiwang et al.[5] presented 3D CNN
is used to recognize the action class from learned feature architecture having one hardwired layer, three convolution
representations. The human behavior prediction can be a layers, two subsampling layers, and one full connection
probabilistic method of inferring an ongoing activity from a layer. The supervisory depth architecture generates multiple
video containing only the set of activities (i.e. the beginning information channels from adjacent input frames and
part), to grasp the extra accurate human behavior detection performs convolution and sub-sampling in each channel. The
above the idea in police investigation settings, entertainment final feature representation is obtained by combining
completion Setup, healthcare system, fraud detection agency, information from all channels.
mass population testing and population statistics, gender
classification [2]. Nevertheless, it is very challenging to Joe Yue-Hei Ng [6] proposed an approach based on
recognize human activities in unconstrained videos due to learning spatio temporal features for long term of videos
some real conditions such as varying light conditions, using LSTM over GoogleNet and AlexNet. Raw frames and
divergent viewpoints, varying action speeds, light variations optical flow both are used as input modality than feature
[1]. Successful techniques for activity recognition consist of pooling is applied for class score fusion for action
hand-crafted based approach in which first of all data recognition. Limin et al. [7] proposed trajectory-pooled deep-
acquisition from the sensor takes place, Preprocessing, convolutional descriptor (TDD) having benefits of both
Segmentation, Feature extraction, training and classification. hand-crafted and deep-learned features. CNN after training
While another category is deep learning-based approach [3]. are used by TDD and later on fisher vector is used to encode
In, learning based approach the features are learned them and SVM is used as classifier. CNNs after achieving
automatically reducing the laborious human intervention, milestone in image is extended for 1 million videos [8]
expert knowledge and selection of optimal features. having 487 classes named as Sports 1M dataset. Moreover,
Moreover, if features are not correctly extracted and selected two spatial resolutions low-resolution context stream and a
in hand crafted based approach than a correct classification is high-resolution fovea stream are used to achieve better
not possible rather Raw data is used for end to end learning results and to reduce training time. To verify the results for
from pixels to classification of actions in learning-based other challenging datasets concept of Transfer Learning is
approach using Convolutional Neural Network and used for the dataset UCF 101 along with Slow Fusion

978-1-5386-6373-8/18/$31.00 ©2018 IEEE 542


2018 First International Conference on Secure Cyber Computing and Communication(ICSCCC)

Table 1: Summary of various methods for activity recognition

65.9% and 91.5%


73.1% and 88.6%

89.1% and 65.2%

85.6% and 81.1%

66.2% and 86.9%


90% and 46.6%
Accuracy

Accuracy
-94.39%

94.16%

99.55%
71.4%

63.3%

96.6%

86.7%
Sports1millin and

UCF101 and HMDB5


TRECVID 2008

MSR Daily Activity


HMDB51 and

NTURGB+D and
Hollywood2

Escalator exits of
Locomotion
UCF-101
KTH and

KTH, UCF11,
Datasets

UCF101

UCF101

UCF101 and

VIRAT, and
TRECVID
MSRC-12
KTH

HMDB51
Datasets

airport
3D
Machine learning
Networks and
Body Area
Classifier

CNN

RNN

CNN

CNN

CNN

CNN

CNN and Softmax


Classifier

RNN

CNN

CNN

CNN

RNN
Research Article

Baccouche et al. 2011 [3]

Karpathy et al. 2016 [8]


Taylor et al. 2010 [4]

Feichtenhofer et al.2016 [13]


Research Article
Ng et al.2015 [6]
Ji et al. 2013 [5]

Gu et al.2017 [9]
Wang et al. [7]

Hasan et al.2014 [12]


Bilen et al. 2016 [11]
Parka et al. 2016 [10]

Li et al. 2017 [15]


Luo et al. [14]
Sr No.

Sr No.

10

11

12

13
8

Network and achieves a better result in finetuning of only RNNs have recurrent connections between the hidden
few layers instead of training from scratch. layers and the layer connecting the networks and passes the
previous information to next layers. Hakan et al.[11]
Fuqiang Gu et al. [9] proposed a method based on deep presented the concept of Dynamic Images. Dynamic Images
learning for recognition of locomotion activities. Multiple are produced from a sequence of images by applying rank
smartphone sensors like gyroscope, accelerometer, pooling on the raw image pixels. Existing CNN models are
magnetometer data are used for this purpose. Stacked directly applied to dynamic images and A visual inspection
denoising autoencoders used eliminating expert knowledge outlines the richness of dynamic images in describing
for training. complex motion patterns as simple 2d images. Learning
Su et al. drafted an approach [10] based on RNN for activity models from live streaming video is yet a
health and social care services using depth camera. Multiple challenging task but was proposed by [12] using deep
joints from the body changing with time are represented in network along with active learning. Two types initial and
the form of spatio temporal matrix and later on RNN is incremental phase learning are introduced and in initial phase
trained and after wards is used for testing purpose. This few labelled and unlabeled instances are used for auto
approach achieves the state of art results as compared to encoder to get learned. Activity recognition model from the
previous used methods such as HMM, Deep belief network. above phase along with active learning is used in this phase.
Mainly this reduces the best set of features for activity class

543
2018 First International Conference on Secure Cyber Computing and Communication(ICSCCC)

in unlabeled manner. Convolutional Networks towers for [5] S. Ji, M. Yang, and K. Yu, “3D Convolutional Neural Networks for
spatio level and temporal level are fused at softmax level but Human Action Recognition.,” Pami, vol. 35, no. 1, pp. 221–31, 2013.
Christoph Feichtenhofer et al.[13] experimented fusing [6] J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R.
layers at last convolution layer and at class prediction layer Monga, and G. Toderici, “Beyond short snippets: Deep networks for
and achieved better results. Spatial used for 2D features and video classification,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., vol. 07–12–June, pp. 4694–4702, 2015.
temporal fusion is done for two pre-trained ImageNet models
is done after RElu unit. [7] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-
pooled deep-convolutional descriptors,” Proc. IEEE Comput. Soc.
Zelun et al. [14] proposed the methodology based on Conf. Comput. Vis. Pattern Recognit., vol. 07–12–June, pp. 4305–
RGB D modality which is used to produce a sequence of 4314, 2015.
atomic 3 D flow. Later on, RNN is used to predict the [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F.
classified actions from these 3D flow. This model is generic F. Li, “Large-scale video classification with convolutional neural
for any input modality such as RGB, depth, RGB- D and networks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
captures long term motion dependencies and spatial temporal Recognit., pp. 1725–1732, 2014.
relationships. Jun et al. [15] presented the method based on [9] F. Gu and K. Khoshelham, “Locomotion Activity Recognition : A
Convolution Neural Network (CNN). The raw spatial Deep Learning Approach.”
features are input to the CNN and feature analysis is done. [10] S. U. Park, J. H. Park, M. A. Al-Masni, M. A. Al-Antari, M. Z.
Further, Softmax classifier is used to classify. Basic activities Uddin, and T. S. Kim, “A Depth Camera-based Human Activity
such as walking, sitting, jumping, jogging, standing are Recognition via Deep Learning Recurrent Neural Network for Health
classified by this model. and Social Care Services,” Procedia Comput. Sci., vol. 100, pp. 78–
84, 2016.
[11] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action
III. VISUAL ANALYSIS FOR ACTIVITY RECOGNITION OF Recognition with Dynamic Image Networks,” IEEE Trans. Pattern
HUMANS Anal. Mach. Intell., vol. 8828, no. c, pp. 1–14, 2017.
This segment covers the analytical study done on various [12] M. Hasan and A. K. Roy-Chowdhury, “Continuous Learning Of
techniques, datasets and their accuracies are presented in
Human Activity Models Using Deep Nets,” pp. 705–720, 2014.
Table 1. According to analysis done on various techniques.
Park et al. [10] proposed method based on RNN used for [13] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-
health and social care services using depth camera approach Stream Network Fusion for Video Action Recognition,” no. i, 2016.
which outperforms better. Yet there is no global method for [14] Z. Luo, B. Peng, D. A. Huang, A. Alahi, and L. Fei-Fei,
recognition which performs better. Because of varied “Unsupervised learning of long-term motion dynamics for videos,”
datasets every approach performs best in that scenario. Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR
2017, vol. 2017–Janua, pp. 7101–7110, 2017.

IV. CONCLUSIONS [15] J. Li, R. Wu, J. Zhao, and Y. Ma, “Convolutional neural networks
(CNN) for indoor human activity recognition using Ubisense system,”
In the above survey, our goal was to find techniques for 2017 29th Chinese Control Decis. Conf., pp. 2068–2072, 2017.
behavioral identification and its applications. Various
techniques for motion recognition based on deep learning.
Today, human activity detection in video is the most popular
hot space. For security purposes, behavior recognition is
used in many public places such as train stations. For
example, some minor accidents occurred in the depot, and
security personnel could not know that it happened, because
the police work in these forms of abnormal activities, we like
this detection technology. Behavioral testing in the police
work environment, recreational environment, aid systems,
character detection in large numbers of people, civil
enumeration and gender classification constitute a large
number of analyses. This analysis has found some solutions,
but it is still not mature and there is a lot of scope. More
experiments were performed on different input modalities.
REFERENCES
[1] T. Subetha and S. Chitrakala, “A survey on human activity
recognition from videos,” 2016 Int. Conf. Inf. Commun. Embed.
Syst., no. Icices, pp. 1–7, 2016.
[2] W. Ding, K. Liu, F. Cheng, and J. Zhang, “Learning hierarchical
spatio-temporal pattern for human activity prediction,” J. Vis.
Commun. Image Represent., vol. 35, pp. 103–111, 2016.
[3] T. Guha, S. Member, and R. K. Ward, “Sequential deep learning for
human action recognition,” Hum. Behav. Underst., no. November, pp.
29--39, 2011.
[4] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional
learning of spatio-temporal features,” Lect. Notes Comput. Sci.
(including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 6316 LNCS, no. PART 6, pp. 140–153, 2010.

544

Das könnte Ihnen auch gefallen