A Handheld Gun Detection Using Faster R-CNN Deep Learning

A Handheld Gun Detection using Faster R-CNN Deep Learning
Gyanendra K. Verma Anamika Dhillon

Department of Computer Engineering Department of Computer Engineering
National Institute of Technology National Institute of Technology
Kurukshetra, Haryana Kurukshetra, Haryana
gyanendra@nitkkr.ac.in dhillon.anamika2390@gmail.com
ABSTRACT this problem is to deploy surveillance system or control cameras

Today’s, most of the criminal activities are taken place using hand- with automatic handheld gun detection and alert system. In last
held arms particularly gun, pistol and revolver. Several surveys re- few years, deep learning has established a landmark in the area
vealed that hand held gun is the foremost weapon used for diverse of machine learning particularly in object detection, classification
crimes like burglary, rape, etc. Therefore, automatic gun detection and image segmentation. Convolutional Neural Network (CNN)
is a prime requirement in current scenario and this paper presents achieved best results so far in classical image processing problems
automatic gun detection from cluttered scene using Convolutional such as image segmentation, classification and detection in several
Neural Networks (CNN). We have used Deep Convolutional Net- applications.
work (DCN), a state-of-the-art Faster Region-based CNN model, In this paper we present an automatic handheld gun detection
through transfer learning, for automatic gun detection from clut- system using deep learning particularly CNN model. Gundetection
tered scenes. We have evaluated our gun detection over Internet is a very challenging problem because of the various subtleties as-
Movie Firearms Database (IMFDB), a benchmark gun database. For sociated with it. One of the most important challenges of gun detec-
detecting the visual handheld gun, we got propitious performance tion is occlusion of gun that arises frequently. There are two types
of our system. Moreover, we demonstrate that, against the number of occlusions of gun, namely gun to object and gun to site/scene
of several training images, CNN model magnifies the classification occlusion. Normally, occlusions in gun detection are arises beneath
accuracy, which is most advantageous in those practices where three conditions: self-occlusion, inter-object occlusion or by back-
generous liberal is often not available. ground site/scene structure. Self-occlusion arises when one portion
of the gun is occluded by another. Inter-object occlusion occurs
CCS CONCEPTS when some different object like hand, clothes etc. occluded the gun.
Though those occlusion, which takes places due to background,
• Computing methodologies → Image processing;
occurs while the gun is occluded by the structure in the background.
These occlusions can be of two types either partial or full. Partial
KEYWORDS
or full occlusion takes place due to carrying of guns in either hand
Gun Detection, Video Surveillance, DCN, CNN, VGG Net, Deep or in a halter. There are many methods to handle occlusion based
Learning on depth analysis [7] of object from the camera, fusion of color and
ACM Reference Format: shape features of object [2] and optimal position of camera [18].
Gyanendra K. Verma and Anamika Dhillon. 2017. A Handheld Gun De- Inter class variation in guns occurs due to variation in color
tection using Faster R-CNN Deep Learning. In ICCCT-2017: International and structure of different models of guns. Guns are extensively
Conference on Computer and Communication Technology, November 24– accessible in various colors like silver, black etc., because of which
26, 2017, Allahabad, India, Jennifer B. Sartor, Theo D’Hondt, and Wolf- detection of images is a challenging task. View variation in gun
gang De Meuter (Eds.). ACM, New York, NY, USA, Article 4, 5 pages. detection arises due to different 2-D view of a gun from different
https://doi.org/10.1145/3154979.3154988
viewpoints or orientation. Rotation variation in gun detection arises
due to rotation of objects in its plane whereas scale variation ap-
1 INTRODUCTION pears because of shifting in distance of gun from CCTV, when it
Today, most of the criminal activities are taken place using hand- takes the video.
held arms particularly gun, pistol and revolver. The crime and Some other problems that arise during gun detection are noises
social offence can be reduced by monitoring such activities and in gun image, deformation of gun and real time processing require-
to identify antisocial behavior so that law enforcement agencies ment. During the transportation and addition of images, noises
can take appropriate action in early stage[6] [16]. The solution of can occur in that image. Various types of noises can be there like
salt and pepper noise, Raleigh noise, Gaussian noise, etc. and these
Permission to make digital or hard copies of all or part of this work for personal or noises can be introduced during varied conditions of transporta-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation tion and addition of noises. The rest paper is organized as follows:
on the first page. Copyrights for components of this work owned by others than ACM related work is given under section 2. The methodology with deep
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
learning model is described under section 3. Implementation details
fee. Request permissions from permissions@acm.org. are given in section 4 followed by results and analysis under section
ICCCT-2017, November 24–26, 2017, Allahabad, India 5. Finally, concluding remarks are given in last section 6.
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-5324-3/17/11. . . $15.00
https://doi.org/10.1145/3154979.3154988
ICCCT-2017, November 24–26, 2017, Allahabad, India G. K. Verma et al.
2 RELATED WORK E. M. Upadhyayet al. [13] proposed a method of CWD, which uses
Primarily, research on gun detection focuses on Concealed Weapon image fusion. They used the fusion of IR image and visual to detect
Detection (CWD) and knife detection. CWD is stand on some tech- concealed weapon in a situation where over exposed and under
niques of imaging like infrared imaging, millimeter wave imaging, exposed area in image of scene are present. Their methodology
in application of luggage control at airports. consists of applying a homomorphic filter to visual and IR images,
In our previous work [11], we have implemented the visual gun captured at different exposure condition.
detection system using SIFT(Scale Invariant Feature Transform) Glowacz et al. [5] proposed a method for recognizable knives
and Harris interest point detector. The proposed system utilized detection for the baggage scanning system at airports and railway
color based segmentation to take out distinct object from an image stations. Their method is stand on active appearance model and
using k-mean clustering algorithm. Harris interest point detector Harris corner detector.
and Fast Retina Keypoint (FREAK) is utilized to find the weapon in
the segmented images. The object detection challenges like scaling, 3 METHODOLOGY
rotation, affine and occlusion were addressed in this work. The
dataset was ordered by us with sixty-five positive images (gun
3.1 Database
present) and twenty-four negative images (gun is not present). The We have implemented and evaluated our system over Internet
dataset was set up such that it comprises of images of various kind Movie Firearms Database IMFDB, a benchmark database of firearms
of handheld gun with various scale, revolution and orientation. [1].
In a few images, some of firearm was blocked by either hand or
some other protest and some images comprises multiple weapons. Internet Movie Firearms Database (IMFDb)
In a few pictures, some different items are additionally present The IMFDb is an online database of firearms used or featured
other than firearm with different background. The overall accuracy in movies, television shows, video games, and anime. The firearms
achieved by us using proposed system was 84.26%. images are compiled from hollywood movies, television shows,
Followed by work [11], we have implemented gun detection video games and Japanese animation. Following firearms are in-
system Speeded up robust features (SURF) interest point detector cluded under gun category- Assault Rifle, Battle Rifle, Bullpup, Car-
[12]. The Color based segmentation was utilized to take out random bine, Flamethrower, Flare Gun, Fictional Firearm, Grenade, Grenade
color or objects that are not of interest. At that point SURF features Launcher, Machine Guns, Machine Pistol, Mine, Missile Launcher,
were utilized to measure similarity of each segmented object with Mortar, Pisol, Revolver, Rifle, Shotgun, Sniper Rifle, Submachine
the weapon descriptor. An object is marked as gun if half of the Gun, Underwater Firearm etc.
features of weapon descriptor are matched with the SURF features Although, several number of gun categories are available in
of object. The accuracy achieved using SURF descriptor was 88.67%. IMFDb, we have compiled only Revolver, Rifle, Shotgun. Figure 1
Halima, N.B. et al.[3] demonstrated that BoWSS (Bag of Words shows sample positive images of IMFDb database. The negative im-
Surveillance System) algorithm has a high potential to detect guns. ages are collected from internet randomly from different categories
They first extract features using SIFT, cluster the obtained functions like flowers, landscape, animals etc.
using K-Means clustering and use SVM (Support Vector Machine)
for the training. 3.2 Deep Learning Model
Sheen et al.[9] proposed a method of CWD, based on three dimen- We have implemented CNN using MatConvNet [14], a MATLAB
sional millimeter (mm) wave imaging technique, to detect concealed toolbox implementing state-of-the-art Convolutional Neural Net-
weapon in the body at airports and other secure location. By using works (CNNs) without Graphical Processing Unit(GPU) for com-
2-D millimeter wave imaging, they modeled a 3-D image for the puter vision applications. In this study we have used a VGG-16
target. 3-D image from gathered data of 2-D image can be formed based classification model pre-trained on the Image Net dataset (ap-
from three dimensional imaging systems or wide wand imaging. proximately 1.28 million images across 1,000 generic object classes).
Z. Xue et al.[17] proposed a method of CWD, which is based VGG Net comes with two version VGG-16 and VGG-19. The archi-
on multi scale decomposition based fusion method. This method tecture of VGG-16 involves 16 convolutional layers with millions
associates the integration of color visual image andinfrared (IR) of parameters. The output of the model is a combination of a linear
imaging. For maintaining the natural color of the actual image, layer, which has activation function, named as ’Softmax”, and three
integration method for visual image and infrared image is done. fully connected layers. In a fully connected layer, VGG-16 utilizes
R. Blum et al.[4] proposed a method of CWD, which is based on dropout regularization and then RELU activation is implemented to
integration of visual image and IR or mm wave image and it uses the all convolutional layers. Deep CNNs, such as VGG-16, are generally
multi resolution mosaic technique. They have used image mosaic trained based on the prediction loss minimization.
to highlight the concealed weapon of the target image. To construct Let x and y be the input images and corresponding output class
that composite image, which havemicroscopic seam, image mosaic labels, the objective of the training is to iteratively minimize the
method is used to combine two or more images.Cut and paste average loss defined as equation 1.
process is resembled by image mosaic process. A multi resolution
algorithm is used here, which is proposed by Simoncelli et al. [10],
for recognizable image, to construct the steerable pyramid. N
1 Õ
J (w) = Ł (f (w, x i ) , yi ) + λR(w) (1)
N i=1
A Handheld Gun Detection using Faster R-CNN Deep Learning ICCCT-2017, November 24–26, 2017, Allahabad, India
customized layers have been used for implementation of the system.

As the whole system was implemented on a single CPU, training
time is a big issue. To handle this problem, we have used ImageNet
[8] pre-trained model to train our system. Then, the trained model
is fine tuned with the databse. A very deep VGG-16 architecture is
being explained in under section 4.2
4.2 Architecture: Very Deep VGG-16/10

During training, the only preprocessing step is to abstract that mean
value of RBG, which is computed on the training data. After that,
the image is moved along the stack of convolutional layers whose
filters have a very compact field: 3 × 3, which have a convolutional
(a) positive stalk of 1 pixel. By using, five max-pooling layers (which pursue a
bit of convolutional layers), spatial pooling can be achieved. With
the help of two strides, max-pooling can be performed on a 2 × 2
pixel window. Three fully connected layers follow the stack of
those convolutional layers, which have varied depth in varied ar-
chitectures. Soft-max is the final layer and with the rectification
non-linearity, the entire hidden layer can be provisioned.
4.3 Classification
Training
With the help of mini-batch gradient descent with momentum,
training can be performed by modifying the multinomial logistic
regression objective. In spite of the larger number of parameters
and the greater depth of the introduced nets is taken and to connect
(b) negative these, some epochs are required by the nets, due to the following
purpose: (a) indirect regularization, inflicted by smaller convolu-
Figure 1: Sample images from IMFDB database (a) positive tional filter sizes and greater depth (b) use pre-initialization for
and (b) negative samples definite layers. The ConvNets input is a fixed size 224 × 224 RGB
image during the Training.
To obtain this fixed-size image, rescaling has been done while
where N is the number of data instances (mini-batch) in every training. Two approaches for setting the training scale S (Let S be
iteration, L is the loss function, f is the predicted output of the the compact side of an isotropically resized training image) are
network depending on the current weights w, and R is the weight considered: 1) single-scale training, that requires a fixed S. 2) multi-
decay with the Lagrange multiplierλ. scale training, where every training image is independently resized
We use the Stochastic Gradient Descent (SGD), which is com- by arbitrarily sampling S from a definite range [Smin, Smax].
monly used in deep CNNs to update the weights and given with To improve overall training speed of each model, the researchers
the equation 2. introduced parallelization to the mini batch gradient descent pro-
cess. Since the model is very deep, training on a single GPU would
Wt +1 = µw t − α ∆J (w t ) (2) take months to finish. To speed up the process, the researchers
trained separate batches of images, on each GPU in parallel, to
where µ is the momentum weight for the current weights wt calculate the gradients.
and α is the learning rate. Testing At test time, in order to classify the input image: Firstly,
The network weights are randomly initialized if the network is it is isotropically resized to that shortest image side, which is pre-
trained from scratch and are initially set to a pre-trained network defined and symbolized as Q. After that, the network is impenetra-
weights if fine-tuning the deep model. In this work we have used bly employed across the resized test image in such a way that, firstly
fine-tuning VGG-16 and initialized it with the weights of the same the fully connected layers are transformed to the convolutional lay-
architecture. VGG-16 pre-trained on Image net database. The pre- ers (first fully connected layer to7 × 7 convolutional layer, the last
trained VGG-16 model used in this study is obtained through the two Fully Connected convolutional layers to 1 × 1 convolutional
MatConvNet; a deep learning software for MatLab. layers)
4 EXPERIMENTAL SETUP 4.4 Performance Evaluation

4.1 Implementation Details For evaluating the performance of our system, specially three met-
We have adopted MatConvNet framework [15] to implement faster ric are used and they are True Positive (TP), False Positive(FP)
convolutional neural network. A Very Deep VGG-16 model with and accuracy. True Positive Rate (TPR): It is the determination of
ICCCT-2017, November 24–26, 2017, Allahabad, India G. K. Verma et al.
percentage of those actual positive images, which are accurately

perceived by the system [17] and expressed by equation (3). It is
also known as sensitivity or recall in machine learning. (FN: False
Negative)
T PR = T P/T P + F N (3)
False Positive Rate: It is the determination of percentage of those
negative images, which are wrongly perceived by the system [17].
It is also known as specificity in machine learning.
Accuracy: It is the determination of proportion of those total
numbers of images which are accurately detected by the system.
5 RESULTS AND DISCUSSION

The performance of the system is tested against varying condi-
tions such as varied background with guns, occlusion, etc. The
performance of the system is evaluated in terms of accuracy, True (a)
Positive Rate (TPR), False Positive Rate (FPR), Positive Prediction

Value (PPV) and False Detection Rate (FDR).
We found that CNN method works well with the interclass vari-
ation and intra class similarities. The system works well for dif-
ferent conditions like occlusion, interclass variation and gun in
variant background. We have used three different classifiers namely
Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and
Ensemble tree for classification and the results are shown in table
1-3 respectively. We have used different values of parameters for
evaluating our system; these different values are used for cross
validation. The best set of parameter values are considered optimal
for visual handheld gun detection. It is concluded that, SVM outper-
form compate to other classifiers with classification accuracy 92.6%
. The overall performance of the system canal can be evaluated in
terms of ROC curves. Receivers operating characteristics (ROC)
curves are an evaluation measure for binary classifier. It shows
the tradeoff between specificity and sensitivity. A performance (b)
graph in terms of ROC and AUC Curve for (a) SVM (b) KNN and
(c) Ensemble Tree are shown in figure 2 with maximum accuracy
value. We have also compared our results with existing studies. A
comparative performance with existing work is given in table 4.
Table 1: Classification accuracy for SVM
Algorithm Accuracy TPR FPR PPV FDR

Linear SVM 89.9 87 6 95 5
Quadratic SVM 88.3 88 11 91 9
Cubic 84.6 88 19 85 15
Fine Gaussian 92.6 87 0 100 0
Medium Gaussian 92.0 86 0 100 0
Coarse gaussian 92.0 86 0 100 0
(c)
6 CONCLUSION
In this proposed paper we propounded and implemented a novel Figure 2: Performance graph in terms of ROC and AUC
handheld gun detection approach for surveillance and alert system. Curve for (a) SVM (b) KNN and (c) Ensemble Tree
The system includes CNN based VGG-16 architecture as feature
extractor, followed by state-of-the-art classifiers implemented on a
standard gun database. With 93% accuracy, the utmost auspicious
A Handheld Gun Detection using Faster R-CNN Deep Learning ICCCT-2017, November 24–26, 2017, Allahabad, India
Table 2: Classification accuracy for KNN [7] Yingdong Ma and Qian Chen. 2010. Depth Assisted Occlusion Handling in Video
Object Tracking. Springer Berlin Heidelberg, Berlin, Heidelberg, 449–460. https:
//doi.org/10.1007/978-3-642-17289-2_43
Algorithm Accuracy TPR FPR PPV FDR [8] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Fine KNN 87.2 91 18 86 14 Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.
Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.
Medium KNN 89.9 88 7 94 6 International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https:
Coarse KNN 89.9 89 10 92 8 //doi.org/10.1007/s11263-015-0816-y
[9] D. M. Sheen, D. L. McMakin, and T. E. Hall. 2001. Three-dimensional millimeter-
Cosine KNN 91.5 88 4 97 3 wave imaging for concealed weapon detection. IEEE Transactions on Microwave
Cubic KNN 91 88 6 95 5 Theory and Techniques 49, 9 (Sep 2001), 1581–1592. https://doi.org/10.1109/22.
Weighted KNN 91 88 5 96 4 942570
[10] E. P. Simoncelli and W. T. Freeman. 1995. The steerable pyramid: a flexible
architecture for multi-scale derivative computation. In Proceedings., International
Conference on Image Processing, Vol. 3. 444–447 vol.3. https://doi.org/10.1109/
Table 3: Classification accuracy for Ensemble Tree ICIP.1995.537667
[11] Rohit Kumar Tiwari and Gyanendra K. Verma. 2015. A Computer Vision
based Framework for Visual Gun Detection Using Harris Interest Point De-
Algorithm Accuracy TPR FPR PPV FDR tector. Procedia Computer Science 54, Supplement C (2015), 703 – 712. https:
Boosted tree 93.1 90 6 99 1 //doi.org/10.1016/j.procs.2015.06.083
[12] R. K. Tiwari and G. K. Verma. 2015. A computer vision based framework for
Bagged tree 92.6 88 1 99 1 visual gun detection using SURF. In 2015 International Conference on Electrical,
Subspace 76.6 91 42 73 27 Electronics, Signals, Communication and Optimization (EESCO). 1–5. https://doi.
org/10.1109/EESCO.2015.7253863
Discreminant [13] E. M. Upadhyay and N. K. Rana. 2014. Exposure fusion for concealed weapon
Subspace KNN 89.9 89 10 92 8 detection. In 2014 2nd International Conference on Devices, Circuits and Systems
RUSBoosted tree 92 90 6 95 5 (ICDCS). 1–6. https://doi.org/10.1109/ICDCSyst.2014.6926141
[14] Andrea Vedaldi and Karel Lenc. 2014. MatConvNet - Convolutional Neural
Networks for MATLAB. CoRR abs/1412.4564 (2014). http://arxiv.org/abs/1412.
4564
Table 4: Accuracy comparison with existing studies [15] A. Vedaldi and K. Lenc. 2015. MatConvNet – Convolutional Neural Networks for
MATLAB. (2015).
[16] Sergio A. Velastin, Boghos A. Boghossian, and Maria Alicia Vicencio-Silva. 2006.
Study Year Methods Accuracy (%) A motion-based image processing system for detecting potentially dangerous
Rohit Tiwari et al. [11] 2015 SIFT & FREAK 84.26 situations in underground railway stations. Transportation Research Part C:
Emerging Technologies 14, 2 (2006), 96 – 113. https://doi.org/10.1016/j.trc.2006.05.
Rohit Tiwari et al. [12] 2015 SURF 88.67 006
This Study 2017 CNN 93.1 [17] Z. Xue, R. S. Blum, and Y. Li. 2002. Fusion of visual and IR images for concealed
weapon detection. In Proceedings of the Fifth International Conference on Infor-
mation Fusion. FUSION 2002. (IEEE Cat.No.02EX5997), Vol. 2. 1198–1205 vol.2.
https://doi.org/10.1109/ICIF.2002.1020949
results have been procured. Our system can discern the existence of [18] V. Zeljkovic and M. Popovic. 2001. Detection of moving objects in video signal
under fast changes of scene illumination. In 5th International Conference on
numerous guns in real time and it is robust across the variation in Telecommunications in Modern Satellite, Cable and Broadcasting Service. TELSIKS
affine, scale, rotation and partial closure or occlusion. Although, we 2001. Proceedings of Papers (Cat. No.01EX517), Vol. 2. 411–414 vol.2. https://doi.
presume that by implementing the novel method, the performance org/10.1109/TELSKS.2001.955808
of our system, can be refined and its real time processing essentials
like complexity of space and time can be diminished. For comparing
the accuracy of our paper, [11] and [12] methods are used. The
accuracy rate using [11] (SIFT and FREAK method) comes out to be
84.26 and using [12] (SURF method), it comes out to be 88.67. From
table 4, it is observed that the accuracy rate of our system comes
93.1, which is greater than both methods.
REFERENCES
[1] [n. d.]. IMFDB: Internet Movie Firearms Database. http://www.imfdb.org/wiki/
Main_Page. ([n. d.]). Accessed: 2016-10-30.
[2] Jorge P. Batista. 2004. Tracking Pedestrians Under Occlusion Using Multiple
Cameras. Springer Berlin Heidelberg, Berlin, Heidelberg, 552–562. https:
//doi.org/10.1007/978-3-540-30126-4_68
[3] Nadhir Ben Halima and Osama Hosam. 2016. Bag of Words Based Surveillance
System Using Support Vector Machines. 10 (04 2016), 331–346.
[4] R. Blum, Zhiyun Xue, Z. Liu, and D. S. Forsyth. 2004. Multisensor concealed
weapon detection by using a multiresolution mosaic approach. In IEEE 60th
Vehicular Technology Conference, 2004. VTC2004-Fall. 2004, Vol. 7. 4597–4601 Vol.
7. https://doi.org/10.1109/VETECF.2004.1404961
[5] Andrzej Glowacz, Marcin Kmieć, and Andrzej Dziech. 2015. Visual detection of
knives in security applications using Active Appearance Models. Multimedia
Tools and Applications 74, 12 (01 Jun 2015), 4253–4267. https://doi.org/10.1007/
s11042-013-1537-2
[6] Yu-Ming Liang, Sheng-Wen Shih, and Arthur Chun-Chieh Shih. 2013. Human
action segmentation and classification based on the Isomap algorithm. Multimedia
Tools and Applications 62, 3 (01 Feb 2013), 561–580. https://doi.org/10.1007/
s11042-011-0858-2

A Handheld Gun Detection Using Faster R-CNN Deep Learning

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Handheld Gun Detection Using Faster R-CNN Deep Learning

Hochgeladen von

Copyright:

Verfügbare Formate

A Handheld Gun Detection using Faster R-CNN Deep Learning

Gyanendra K. Verma Anamika Dhillon

ABSTRACT this problem is to deploy surveillance system or control cameras

customized layers have been used for implementation of the system.

4.2 Architecture: Very Deep VGG-16/10

4 EXPERIMENTAL SETUP 4.4 Performance Evaluation

percentage of those actual positive images, which are accurately

5 RESULTS AND DISCUSSION

Positive Rate (TPR), False Positive Rate (FPR), Positive Prediction

Table 1: Classification accuracy for SVM

Algorithm Accuracy TPR FPR PPV FDR

Das könnte Ihnen auch gefallen