Sie sind auf Seite 1von 8

2013 IEEE International Conference on Computer Vision

Abnormal Event Detection at 150 FPS in MATLAB

Cewu Lu Jianping Shi Jiaya Jia


The Chinese University of Hong Kong
{cwlu, jpshi, leojia}@cse.cuhk.edu.hk

Abstract 15, 6, 2], which utilize co-occurrence patterns. Among


these methods, normal patterns were fitted in a space-
Speedy abnormal event detection meets the growing time Markov random field in [21, 11, 10, 3]. Kwon and
demand to process an enormous number of surveillance Lee [9] used a graph editing framework for abnormal event
videos. Based on inherent redundancy of video structures, detection. Topic models, such as latent Dirichlet allocation,
we propose an efficient sparse combination learning frame- were employed [19, 15]. Recently, sparse representation
work. It achieves decent performance in the detection phase [12, 17] has attracted much attention and sparsity-based
without compromising result quality. The short running abnormality detection models [22, 5] achieved state-of-the-
time is guaranteed because the new method effectively turns art performance reported in many datasets.
the original complicated problem to one in which only Although realtime processing is a key criterion to an
a few costless small-scale least square optimization steps practically employable system given continuously captured
are involved. Our method reaches high detection rates videos, most sparsity-based methods cannot be performed
on benchmark datasets at a speed of 140∼150 frames per fast enough. The major obstruction to high efficiency is
second on average when computing on an ordinary desktop the inherently intensive computation to build the sparse
PC using MATLAB. representation. Note a slow process could delay alarm and
postpone response to special events.
We provide brief analysis below about this issue with
1. Introduction respect to the general sparsity strategies and present our
new framework with an effective representation. It fits
With the increasing demand of security, surveillance the structure of surveillance video data and leads to an
cameras are commonly deployed. Detecting abnormal extremely cheap testing cost.
events is one critical task based on what cameras capture,
which is traditionally labor-intensive and requires non-stop 1.1. Sparsity Based Abnormality Detection
human attention. What makes this interminable and boring
process worse is that abnormal events generally happen Sparsity is a general constraint [22, 5] to model normal
with a frustratingly small chance, making over 99.9% of event patterns as a linear combination of a set of basis atom-
the effort for one to watch videos wasted. s. We analyze abnormality detection in one local region
This predicament catalyzes important research in com- to show that this process is computationally expensive by
puter vision, aiming to find abnormal events automatical- nature.
ly [8, 1, 13, 23, 16, 21, 11, 22, 5]. It is not a typical Given training features [x1 , . . . , xn ] extracted from the
classification problem due to the difficulty to list all possible history video sequence in a region, a normal pattern dic-
negative samples. Research in this area commonly follows tionary D ∈ Rp×q is learned with a sparsity prior. In
the line that normal patterns are first learned from training the testing phase for a new feature x, we reconstruct it by
videos, and are then used to detect events deviated from this sparsely combining elements in D, expressed as
representation.
min x − Dβ22 s.t. β0 ≤ s (1)
Specifically, in [8, 20], extracted trajectories were u- β
tilized by tracking object-of-interest to represent normal
patterns. Outliers are regarded as abnormal. Another line is where β ∈ Rq×1 contains sparse coefficients. x − Dβ22
to learn normal low-level video feature distributions, such is the data fitting term; β0 is the sparsity regularization
as exponential [1], multivariate Gaussian mixture [13] or term; and s ( q) is a parameter to control sparsity. With
clustering as a special case [23, 16]. Graph model normal this representation, an abnormal pattern can be naturally
event representations were proposed in [21, 11, 10, 3, 19, defined as one with large error resulted from x − Dβ22 .

1550-5499/13 $31.00 © 2013 IEEE 2720


DOI 10.1109/ICCV.2013.338
Figure 2. Pyramid region architecture. A frame is resized into 3
different scales. In each scale the frame is partitioned into several
Figure 1. Our testing architecture. X denotes testing da- regions.
ta. {S1 , . . . , SK } are learned combinations, with each Si ∈
Rp×s (s  q). Ei is the corresponding least square reconstruction
error. The final error is the minimum among all combinations. Compared to original sparse coding, our model is more
faithful to the input data. When freely selecting s basis vec-
tors from a total of q vectors by Eq. (1), the reconstructed
Previous work verified that this form can lead to high
structure could much deviate from input due to the large
detection accuracy.
freedom. But in our trained combinations, it is unlikely
Efficiency Problem A high testing cost is inevitable to happen, since each combination finds its corresponding
when adopting Eq. (1), which aims to find the suitable input data, better constraining reconstruction quality. Our
basis vectors (with scale s) from the dictionary (with scale method therefore is robust to distinguish between normal
q) to represent testing data x. The search space is very and abnormal patterns.
large, as (qs ) different combinations exist. Although much We have verified our model on a large set of surveillance
effort has been put to reducing the dictionary size [5] and videos in Sec. 3.2. We also benchmark it on existing
adopting fast sparse coding solvers [22], in general, seconds datasets for abnormal event detection. It reaches 140∼150
are needed to process a frame as reported in prior papers. FPS using a desktop with 3.4GHz CPU and 8G memory in
The efficiency problem is thus critical to address before MATLAB 2012.
this type of methods can be deployed practically. A realtime
process needs to be 100 times faster than the current 2. Method
fastest sparsity-based methods, which is difficult without We describe our framework that learns sparse basis com-
tremendous hardware advancement. We tackle this problem binations. To extract usable data, we resize each frame into
from an algorithm perspective. Our method yields decent different scales as [5] and uniformly partition each layer
performance and naturally accelerates sparse coding by to a set of non-overlapping patches. All patches have the
400+ times even using MATLAB implementation. same size. Corresponding regions in 5 continuous frames
are stacked together to form a spatial-temporal cube. An
1.2. Our Contribution example is illustrate in Fig. 2. This pyramid involves local
We propose sparse combination learning for detection. information in fine-scale layers and more global structures
With high structure redundancy in surveillance videos, in small-resolution ones.
instead of coding sparsity by finding an s basis combination With the spatial-temporal cubes, we compute 3D gradi-
from D in Eq. (1), we code it directly as a set of possible ent features on each of them following [11]. These features
combinations of basis vectors. Each combination here in a video sequence are processed separately according to
corresponds to a set with s dictionary bases in Eq. (1). their spatial coordinates. Only features at the same spatial
With this change, other than searching s bases from p of location in the video frames are used together for training
them for each testing feature, we only need to find the most and testing.
suitable combination by evaluating the least square error.
The testing framework is shown in Fig. 1.
2.1. Learning Combinations on Training Data
This framework is efficient since only small-scale least For each cube location, 3D gradient features in all frames
square optimization is required in detection with simple are denoted as X = {x1 , . . . , xn } ∈ Rp×n , gathered
matrix projection. In our experiments, testing is on a small temporally for training. Our goal is to find a sparse basis
number of combinations, each takes 10−6 ∼ 10−7 second combination set S = {S1 , . . . , SK } with each Si ∈ Rp×s
in MATLAB. containing s dictionary basis vectors, forming a unique
The effectiveness of our approach is well guaranteed by combination, where s  q. Each Si belongs to a closed,
the inherent sparsity constraint on the combination size. convex and bounded set, which ensures column-wise unit

2721
norm to prevent over-fitting. maximum commonness. This process ends until all training
Our sparse combination learning has two goals. The data are computed and bounded. The size of combination
first goal – effective representation – is to find K basis K reflects how informative the training data are.
combinations, which enjoy a small reconstruction error t. Specifically, in the ith pass, given the leftover training
It is coarsely expressed as data Xc ⊆ X that cannot be represented by previous
combinations {S1 , . . . , Si−1 }, we compute Si to bound
n 
 K
most data in Xc . Our objective function becomes
t  min γji xj − Si β ij 22 
S,γ , β j=1 i=1 min γji (xj − Si β ij 22 − λ)
(2) Si ,γ , β j∈Ω
K
 c

s.t. γji = 1, γji = {0, 1} K



(4)
i i
i=1 s.t. γj = 1, γj = {0, 1}
i=1
where γ = {γ 1 , . . . , γ n } and γ j = {γj1 , . . . , γjK }. Each
where Ωc is the index set for Xc . It is easy to prove that this
γji indicates whether or not the ith combination Si is chosen cost satisfies condition (3) and the resulting Si can represent
for data xj . β ij is the corresponding coefficient set for repre- most data. Specifically, if xj − Si β ij 22 − λ ≥ 0, setting

senting xj with combination Si . The constraints γji = 1 γji = 0 yields a smaller value compared to setting γji =
and γji = {0, 1} require that only one combination is 1. Contrarily, γji should be 1 if xj − Si β ij 22 − λ < 0,
selected. The objective function makes each training cube complying with condition (3).
x constructible by at least one basis combination in S. In each pass i, we solve the function in Eq. (4) by
The second goal is to make the total number K of combi- dividing it into two steps to iteratively update {Sis , β} and
nations small enough based on redundant surveillance video γ using the following procedure.
information. It is natural and inevitable because a very large
K could possibly make the reconstruction error t in Eq. (2) Update {Sis , β} With fixed γ, Eq. (4) becomes a quadrat-
always close to zero, even for abnormal events. ic function

2.2. Optimization for Training L(β, Si ) = γji xj − Si β ij 22 . (5)
j∈Ωc
Our two objectives contradict each other in a sense.
Reducing K could increase reconstruction errors. It is not Following the traditional procedure, we optimize β while
optimal to fix K as well, as content may vary among videos. fixing Si for all γji
= 0 and then optimize Si using block-
This problem is addressed in our system with a maximum coordinate descent [14]. These two steps alternate. The
representation strategy. It automatically finds K while not closed-form solution for β is
wildly increasing the reconstruction error t. In fact, error t β ij = (STi Si )−1 STi xj . (6)
for each training feature is upper bounded in our method.
Sis finds its solution as
We obtain a set of combinations with a small K by 
setting a reconstruction error upper bound λ uniformly for Si = [Si − δt ∇Si L(β, Si )], (7)
all elements in S. If the reconstruction error for each feature 
is smaller than λ, the coding result is with good quality. So where δt is set to 1E − 4 and denotes projecting the basis
we update function (2) to to a unit column. Block-coordinate descent can converge
to a global optimum due to its convexity [4]. Therefore,
K
 the total energy for L(β, Si ) decreases in each iteration,
∀j ∈ {1, · · · , n}, tj = γji {xj − Si β ij 22 − λ} ≤ 0, guaranteeing convergence.
i=1
K
 Update γ With the {Si , β} output, for each xj , the
s.t. γji = 1, γji = {0, 1} objective function becomes
i=1
(3) min γji xj − Si β ij 22 − λγji
i
γj
(8)
Algorithm Our method performs in an iterative manner.
s.t. γji = 0 or 1 .
In each pass, we update only one combination, making it
represent as many training data as possible. This process γji has a closed-form solution
can quickly find the dominating combinations encoding 
i 1 if xj − Si β ij 22 < λ
important and most common features. Remaining train- γj = (9)
ing cube features that cannot be well represented by this 0 otherwise
combination are sent to the next round to gather residual The update step guarantees condition (3).

2722
Algorithm 1 Training for Sparse Combination Learning Algorithm 2 Testing with Sparse Combinations
Input: X , current training features Xc = X Input: x, auxiliary matrices {R1 , . . . , RK } and thresh-
initialize S = ∅ and i = 1 old T
repeat for j = 1 → K do
repeat if Rk x22 < T then
Optimize {Si , β} with Eqs. (6) and (7) return normal event;
Optimize {γ} using Eq. (9) end if
until Eq. (5) converges end for
Add Si to set S return abnormal event;
Remove computed features xj with γji = 0 from Xc
i=i+1
until Xc = ∅ It is noted that the first a few dominating combinations
Output: S represent the largest number of normal event features,
which enable us to determine positive data quickly. In
our experiments, the average combination checking ratio
Algorithm Summary and Analysis In each pass, we is 0.325, which is the number of combinations checked
learn one Si . We repeat this process to obtain a few divided by the total number K. Also, our method can be
combinations until the training data set Xc is empty. This easily accelerated via parallel processing to achieve O(1)
scheme reduces information overlap between combinations. complexity although it is generally not necessary.
We summarize our training algorithm in Algorithm 1. The
initial dictionary Si in each pass is calculated by clustering 2.4. Relation to Subspace Clustering
training data Xc via K-means with s centers.
Our approach can be regarded as enhancement of sub-
Our algorithm is controlled by λ, the upper bound of
space clustering [7] with the major difference on the work-
reconstruction errors. Reducing it could lead to a larger
ing scheme. The relationship between subspace clustering
K. Our approach is expressive because all training normal
and our method is similar to that between K-means and
event patterns are represented with controllable reconstruc-
hierarchical clustering [18]. Specifically, subspace cluster-
tion errors under condition (3). We train 20K-sample data,
ing method [7] takes the number of clusters k as known
whose scales are in R100 , in 20 minutes on a PC with 8GB
or fixed beforehand, like K-means. In video abnormality
RAM and an Intel 3.4GHz CPU.
detection applications, it is however difficult to know the
2.3. Testing optimal number of bases in prior. Our approach utilizes the
allowed representation error to build combinations, where
With the learned sparse combinations S = {S1 . . . SK },
the error upper bound is explicitly implemented with clear
in the testing phase with new data x, we check if there exists
statistical meaning. There is no need to define the cluster
a combination in S fitting the reconstruction error upper
size in this method. Our extensive experiments manifest
bound. It can be quickly achieved by checking the least
that our strategy is both reliable and efficient.
square error for each Si :
min x − Si β i 22 ∀ i = 1, . . . , K (10) 3. Experiments
βi
We empirically demonstrate that our model is suitable
It is a standard quadratic function with the optimal solution to represent general surveillance videos. We apply our
i = (ST S )−1 ST x. method to different datasets. Quantitative comparisons are
β i i i (11) provided.
The reconstruction error in Si is 3.1. System Setting
i 2 = (S (ST S )−1 ST − I )x2 ,
x − Si β (12) In our method, size of Si ∈ Rp×s controls the sparsity
2 i i i i p 2
level. We experimentally set s = 0.1 × p where p is the data
where Ip is a p × p identity matrix. To further simplify dimension. λ in Eq. (4) is the error upper bound, set to 0.04
computation, we define an auxiliary matrix Ri for each Si : in experiments.
Ri = Si (STi Si )−1 STi − Ip . (13) Given the input video, we resize each frame to 3 scales
with 20×20, 30×40, and 120×160 pixels respectively and
Reconstruction error for Si is accordingly Ri x22 . If it is uniformly partition each layer to a set of non-overlapping
small, x is regarded as a normal event pattern. The final 10 × 10 patches, leading to 208 sub-regions for each frame
testing scheme is summarized in Algorithm 2. in total shown in Fig. 2. Corresponding sub-regions in 5

2723
Figure 3. A few frames from the surveillance videos used for Figure 5. Spatial distribution of combination numbers to represent
verification. normal structures in the Avenue data.

6000
Run Loiter Throw False Alarm
5000 Ground Truth 4 5 5 N/A
4000 Ours 4 4 4 1
3000 Table 1. Detection results in the Avenue dataset.
2000

1000 frames in Fig. 3.


0 20 40 60 80 100 120
Each video contains 208 regions as illustrated in Fig. 2.
With the 150 different videos, we gather a total of 31,200
Figure 4. Different numbers of basis combinations to represent
(208 × 150) groups of cube features with each group
normal events in 31,200 groups (x-axis: K; y-axis: number of
corresponding to a set of patches (cubes). They are used
groups that use K combinations).
separately to verify the combination model. Each group
contains 6,000-120,000 features. The number of combina-
continuous frames are stacked together to form a spatial- tions for each group is denoted as K. We show in Fig. 4 the
temporal cube, each with resolution 10 × 10 × 5. We distribution of K in the 31,200 groups. Its mean is 9.75 and
compute 3D gradient features on each of them following variance is 10.62, indicating 10 combinations are generally
[11]. Those gradients are concatenated to a 1500-dimension enough in our model. The largest K is 108. About 99% of
feature vector for each cube and are then reduced to 100 the Ks are smaller than 45.
dimensions via PCA. Normalization is performed to make We illustrate the K distributions spatially in the Avenue
them mean 0 and variance 1. data (described below) in Fig. 5. Many regions only need
For each frame, we compute an abnormal indicator V by 1 combination because they are static. Largely varying
summing the number of cubes in each scale with weights. patches may need dozens of combinations to summarize
It is defined as V = ni=1 2n−i vi , where vi is the number normal events. The statistical regression error is as small as
of abnormal cubes in scale i. The top scale is with index 0.0132±1.38E−4, which indicates our dictionaries contain
1 while the bottom one is with n. All experiments are almost all normal patterns.
conducted using MATLAB.
3.3. Avenue Data Benchmark
3.2. Verification of Sparse Combinations
We build an avenue dataset, which contains 15 se-
Surveillance videos consist of many redundant patterns. quences. Each sequence is about 2 minutes long. The total
For example, in subway exit, people generally move in number of frames is 35,240. There are 14 unusual events
similar directions. These patterns share information coded including running, throwing objects, and loitering. 4 videos
in our sparse combinations. To verify it, we collect 150 are used as training data with 8,478 frames in total.
normal event surveillance videos with a total length of 107.8 A video sequence and its abnormal event detection result
hours. The videos are obtained from sources including are demonstrated in Fig. 6. Fig. 7 contains two important
dataset UCSD Ped1 [15], Subway datasets [1] (excluding frames and their abnormal event regions in two image
abnormal event frames), 68 videos from YouTube, and 79 scales. We list the detection statistics in Table 1. The
videos we captured. The scene includes subway, mall, performance of our method is satisfactory with the average
traffic, indoor, elevator, square, etc. We show a few example detection rate of 141.34 frames per second.

2724
100

50

0
0 100 200 300 400 500 600
Figure 6. Detection results in a video sequence. The bottom plot is the response. A peak appears when an abnormal event – paper throwing
– happens. The x value indexes frames and y-index denotes response strength.

(a) events (b) maps (c) maps (a) events (b) maps (c) maps
Figure 7. Two abnormal events and their corresponding abnormal Figure 8. Subway dataset (Exit-Gate): Three abnormal events and
patches under two different scales in the Avenue dataset. their corresponding detection maps in two different scales in the
Subway-Exit video.
WD LT MISC Total FA
Ground Truth 9 3 7 19 0 WD NP LT II MISC Total FA
[22] 9 3 7 19 2 GT 26 13 14 4 9 66 0
[10] 9 3 7 19 3 [22] 25 9 14 4 8 60 5
[5] 9 - - - 2 [10] 24 8 13 4 8 57 6
subspace 6 3 5 14 4 [5] 21 6 - - - - 4
Ours 9 3 7 19 2 subspace 21 6 9 3 7 46 7
Ours 25 7 13 4 8 57 4
Table 2. Comparison with other sparsity-based methods [22, 5] on Table 3. Comparison using the Subway-Entrance video with sever-
the Exit-Gate Subway dataset. WD: wrong direction; LT: loitering; al previous methods. GT: ground truth; WD: wrong direction; NP:
FA: false alarm. “-” means the results are not provided. Subspace: no payment; LT: loitering; II: irregular interactions; MISC: misc;
results by replacing our combination learning by subspace cluster- FA: false alarm. “-” means results are not provided. Subspace:
ing [7]. replacing our combination learning by subspace clustering [7].

Second/Frame Platform CPU Memory


3.4. Subway Dataset [22] 2 MATLAB 7.0 2.6 GHz 2.0GB
We conduct quantitative comparison with previous meth- [5] 4.6 - 2.6 GHz 2.0GB
Ours 0.00641 MATLAB 2012 3.4 GHz 8.0GB
ods on the Subway dataset [1]. The videos are 2 hours long
in total, containing 209,150 frames with size 512 × 384. Table 4. Running time comparison on the Subway dataset.
There are two types of videos, i.e., “exit gate” and “entrance
gate” videos.
reducing the chance of constructing an abnormal structure
Exit Gate The subway exit surveillance video contains with a small error. This representation tightens feature mod-
19 different types of unusual events, such as walking in eling and makes it not that easy to misclassify abnormality
the wrong direction and loitering near the exit. The video as normal events. In this dataset, our combination number
sequence in the first 15 minutes is used for training. This K varies from 1 to 56 for different cube features.
configuration is the same as those in [10, 22].
The abnormal event detection results for a few frames Entrance Gate In this video, again, the first 15 minutes
are shown in Fig. 8. Table 2 lists the comparison with other are used for training. Detection statistics are listed in Table
methods. Our false alarm rate is low mainly because each 3. Our results are comparable to those of [22, 10, 5]. The
combination can construct many normal event features, thus proposed method yields high detection rates together with

2725
1 Second/Frame Platform CPU Memory
[13] 25 - 3.0 GHz 2.0GB
0.8 [5] 3.8 - 2.6 GHz 2.0GB
Sparse [2] 5 ∼ 10 MATLAB - -
DMT
0.6 Ours 0.00697 MATLAB 2012 3.4 GHz 8.0GB
SF
TPR

MPPCA
0.4 MPPCA+SF Table 5. Running time comparison on the UCSD Ped1 dataset.
Adam
Antic
0.2 Saligrama
Subspace ROC Curve Comparison According to [13] in frame-
Ours
0
level detection, if a frame contains at least one abnormal
0 0.2 0.4 0.6 0.8 1 pixel, it is considered as successful detection. In our
FPR
experiment, if a frame contains one or more abnormal
Figure 9. Frame-level comparison of the ROC curves in UCSD patches, we label it as an abnormal event. For frame-level
Ped1 dataset. Method abbreviation: MPPCA+SF [13], SF [13], evaluation, we alter frame abnormality threshold to produce
MDT [13], Sparse [5], Saligrama [16], Antic [2], Subspace:
a ROC curve shown in Fig. 9. Our method has a reasonably
replacing our combination learning by subspace clustering [7].
high detection rate when the false positive value is low. It is
vital for practical detection system development.
Sparse
In pixel level evaluation, a pixel is labeled as abnormal,
1.2 DMT if and only if the regions it belongs to in all scales are
SF
1 MPPCA abnormal. We alter threshold for all pixels. Following [13],
MPPCA+SF
Adam
if more than 40% of truly anomalous pixels are detected,
0.8 Subspace the corresponding frame is considered as being correctly
TPR

Ours
0.6 detected. We show the ROC curve in Fig. 10. Besides
all methods that are compared in [13], we also include
0.4
the performance of subspace clustering [7]. Our method
0.2 achieves satisfactory performance.
0 EER and EDR Different parameters could affect detec-
0 0.2 0.4 0.6 0.8 1
FPR tion and error rates. Following [13], we obtain these rates
Figure 10. Pixel-level comparison of the ROC curves in UCSD when false positive number equals to the missing value.
Ped1 dataset. Method abbreviation: MPPCA+SF [13], SF [13], They are called equal error rate (EER) and equal detected
MDT [13], Sparse [5], Saligrama [16], Antic [2], Subspace: rate (EDR). We compute the area under the ROC curve
replacing our combinations learning by subspace clustering [7]. (AUC). We report EER, ERD and AUC in the pixel-level
comparison (Table 6) and calculate EER and AUC in the
frame-level (Table 7). These results indicate that our results
low false alarm. are with high quality in both measures.
We compare the running time in Table 5. The detection
Running Time Comparison We compare our system
time per frame and working platforms of [13, 5, 2] are
with other sparse dictionary learning based methods [22, 5]
obtained from the original papers.
in terms of running time on the Subway dataset in Table 4.
The speed of methods [22, 5] is reported in their respective 3.6. Separate Cost Analysis
papers. The difference on detection speed is much larger
than that of working environment. Our testing includes two main steps: feature extraction
(3D cube gradient computing and PCA) and combination
3.5. UCSD Ped1 Dataset testing using Algorithm 2. Other minor procedures are
frame resizing, matrix reshape, etc. We list the average
The UCSD Ped1 dataset [13] provides 34 short clips
running time spent for each step to process one frame in
for training, and another 36 clips for testing. All testing
the three datasets in Table 8.
clips have frame-level ground truth labels, and 10 clips have
pixel-level ground truth labels. There are 200 frames in
each clip.
4. Conclusion
Our configuration is similar to that of [13]. That is, the We have presented an abnormal event detection method
performance is evaluated on frame- and pixel-levels. We via sparse combination learning. This approach direct-
show the results via ROC curves, Equal Error Rate (EER), ly learns sparse combinations, which increase the testing
and Equal Detected Rate (EDR). speed hundreds of times without compromising effective-

2726
SF [13] MPPCA [13] SF-MPPCA [13] MDT [13] Sparse[5] Adam[1] Antic [2] Subspace [7] Ours
EDR 21 % 18 % 18 % 45 % 46 % 24 % 68 % 39.3% 59.1 %
AUC 19.7 % 20.5 % 21.3 % 44.1 % 13.3% 46.1 % 76 % 43.2 % 63.8 %
Table 6. Comparison of pixel-level EDR and AUC curves on the UCSD Ped1 dataset.

SF-MPPCA [13] SF [13] MDT [13] Sparse[5] Saligrama [16] Antic [2] Subspace [7] Ours
EER 40 % 31 % 25 % 19 % 16 % 18 % 29.6% 15 %
AUC 59 % 67.5 % 81.8% 86 % 92.7 % 91 % 68.4 % 91.8 %
Table 7. Comparison of frame-level EER and AUC curves on the UCSD Ped1 dataset.

Feature extraction (ms) Combinations testing (ms) Others (ms) All (ms) FPS
Avenue 4.513 1.792 0.770 7.075 141.34
UCSD Ped1 4.496 1.724 0.743 6.965 143.57
Subway 4.634 1.409 0.625 6.412 155.97
Table 8. Average running time of processing one frame in each step on the three datasets. “ms” is short for millisecond.

ness. Our method achieves state-of-the-art results in several [10] J. Kim and K. Grauman. Observe locally, infer globally:
datasets. It is related to but differ largely from traditional a space-time mrf for detecting abnormal activities with
subspace clustering. Our future work will be to extend incremental updates. In CVPR, pages 2921–2928, 2009.
the sparse combination learning framework to other video [11] L. Kratz and K. Nishino. Anomaly detection in extremely
applications. crowded scenes using spatio-temporal motion pattern mod-
els. In CVPR, pages 1446–1453, 2009.
[12] C. Lu, J. Shi, and J. Jia. Online robust dictionary learning.
Acknowledgments In CVPR, 2013.
This research has been supported by General Research [13] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos.
Anomaly detection in crowded scenes. In CVPR, 2010.
Fund (No. 412911) from the Research Grants Council of
[14] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning
Hong Kong.
for matrix factorization and sparse coding. The Journal of
Machine Learning Research, 11:19–60, 2010.
References [15] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd
behavior detection using social force model. In CVPR, 2009.
[1] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. Ro-
bust real-time unusual event detection using multiple fixed- [16] V. Saligrama and Z. Chen. Video anomaly detection based
location monitors. IEEE TPAMI, 30(3):555–560, 2008. on local statistical aggregates. In CVPR, pages 2112–2119,
2012.
[2] B. Antic and B. Ommer. Video parsing for abnormality
[17] J. Shi, X. Ren, G. Dai, J. Wang, and Z. Zhang. A non-convex
detection. In ICCV, pages 2415–2422, 2011.
relaxation approach to sparse dictionary learning. In CVPR,
[3] Y. Benezeth, P.-M. Jodoin, V. Saligrama, and C. Rosenberg- pages 1809–1816, 2011.
er. Abnormal events detection based on spatio-temporal co-
[18] H. Trevor, T. Robert, and J. H. Friedman. The elements of
occurences. In CVPR, 2009.
statistical learning. Springer New York, 2001.
[4] D. Bertsekas. Nonlinear programming. Athena Scientific [19] X. Wang, X. Ma, and E. Grimson. Unsupervised activity
Belmont, MA, 1999. perception by hierarchical bayesian models. In CVPR, pages
[5] Y. Cong, J. Yuan, and J. Liu. Sparse reconstruction costs 1–8, 2007.
for abnormal event detection. In CVPR, pages 3449–3456, [20] S. Wu, B. E. Moore, and M. Shah. Chaotic invariants
2011. of lagrangian particle trajectories for anomaly detection in
[6] X. Cui, Q. Liu, M. Gao, and D. Metaxas. Abnormal detection crowded scenes. In CVPR, 2010.
using interaction energy potentials. In CVPR, pages 3161– [21] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
3167, 2011. Semi-supervised adapted hmms for unusual event detection.
[7] E. Ehsan and R. Vidal. Sparse subspace clustering. In CVPR, In CVPR, 2005.
2009. [22] B. Zhao, L. Fei-Fei, and E. Xing. Online detection of unusual
[8] F. Jianga, J. Yuan, S. A. Tsaftarisa, and A. K. Kat- events in videos via dynamic sparse coding. In CVPR, 2011.
saggelosa. Anomalous video event detection using spa- [23] H. Zhong, J. Shi, and M. Visontai. Detecting unusual activity
tiotemporal context. Computer Vision and Image Under- in video. In CVPR, 2004.
standing, 115(3):323–333, 2011.
[9] K. Jouseok and L. Kyoungmu. A unified framework for event
summarization and rare event detection. In CVPR, 2012.

2727

Das könnte Ihnen auch gefallen