Robust Background Subtraction For Network Surveillance in H.264 Streaming Video

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO.
10, OCTOBER 2013 1695
Robust Background Subtraction for Network

Surveillance in H.264 Streaming Video
Bhaskar Dey and Malay K. Kundu, Senior Member, IEEE
Abstract—The H.264/Advanced Video Coding (AVC) is the Background subtraction [1]–[3] is a fundamental process
industry standard in network surveillance offering the lowest used in most video-based surveillance systems. Classical pixel-
bitrate for a given perceptual quality among any MPEG or
based techniques for background subtraction use raw informa-
proprietary codecs. This paper presents a novel approach for
background subtraction in bitstreams encoded in the Baseline tion about each pixel. However, most video transmitted over
profile of H.264/AVC. Temporal statistics of the proposed feature networks are encoded bitstreams such as those produced by
vectors, describing macroblock units in each frame, are used MJPEG, MPEG-x, and H.26x-compliant encoders. Therefore,
to select potential candidates containing moving objects. From such algorithms necessitate prior decoding of each frame,
the candidate macroblocks, foreground pixels are determined
involving substantially huge overhead in terms of computation
by comparing the colors of corresponding pixels pair-wise with
a background model. The basic contribution of the current as well as memory space. These algorithms have a decent seg-
work compared to the related approaches is that, it allows mentation performance, but most of them fail to meet real-time
each macroblock to have a different quantization parameter, in constraints due to the computation involved in processing each
view of the requirements in variable as well as constant bit- pixel individually in every frame. Under the circumstances, it
rate applications. Additionally, a low-complexity technique for
is desirable that new background subtraction techniques be
color comparison is proposed which enables us to obtain pixel-
resolution segmentation at a negligible computational cost as able to process encoded video of considerably smaller size,
compared to those of classical pixel-based approaches. Results directly for segmentation. Such techniques, the so-called com-
showing striking comparison against those of proven state-of- pressed domain algorithms, avert significant computational
the-art pixel domain algorithms are presented over a diverse set effort by reusing aggregated information of blocks of pixels
of standardized surveillance sequences.
already available in coded syntax to localize object motion
Index Terms—Background subtraction, compressed domain in frames. This motivates the development of a background
algorithm, H.264/AVC, video surveillance. subtraction algorithm for compressed bitstreams encoded in
the latest video coding standards like H.264 [4].
The H.246/AVC standard brings new possibilities in the
field of video surveillance [5]. It outperforms previous coding
I. Introduction and Motivation
standards toward its goals of supporting
I N RECENT years, there has been considerable interest

in the use of network surveillance over internet protocol
for a wide range of indoor and outdoor applications. This is
1) significantly lower bitrate for a given perceived level of
visual quality;
driven by the advent of technology enabling replacement of 2) error-robust video coding tools that can adapt to chang-
analog closed-circuit television (CCTV) systems with network ing network conditions;
cameras coupled with increasing private and public security 3) low latency capabilities. Latency is the total time it takes
concerns. However, the limitations and deficiencies, together to encode, transmit, decode and display the video at the
with the costs associated with human operators in monitoring destination. Interactive video applications require that
the overwhelming multitude of streaming feeds, have created latency be extremely small.
urgent demands for automated video surveillance solutions. In this paper, a novel feature vector is proposed that effectively
describes macroblock (MB) data in compressed video. The
Manuscript received September 4, 2012; revised January 9, 2013 and
February 20, 2013; accepted February 26, 2013. Date of publication March temporal statistics of feature vectors are used to select a set of
28, 2013; date of current version September 28, 2013. This paper was potential MBs that are occupied fully or partially by moving
recommended by Associate Editor S. Takamura. objects. From the set of coarsely localized candidate MBs,
This paper has supplementary downloadable material available at
http://ieeexplore.ieee.org, provided by the authors. This includes five AVI foreground pixels are detected by comparing corresponding
format movie clips, and executable codes used to generate the results presented pixel colors pair-wise with a background model. The pro-
in the article. This material is 62 MB in size. posed method is embedded into the decoding process and
B. Dey is with the Center for Soft Computing Research, Indian Statistical
Institute, Kolkata 700108, India (e-mail: bhaskar.dey09@gmail.com). obtains pixel-level segmentation at the cost of a negligible
M. K. Kundu is with the Machine Intelligence Unit, Indian Statistical overhead.
Institute, Kolkata 700108, India (e-mail: malay@isical.ac.in). The subsequent sections are organized as follows. Related
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. work is presented in Section II. Section III presents the
Digital Object Identifier 10.1109/TCSVT.2013.2255416 feature vector abstraction for MB data. Section IV introduces
1051-8215
c 2013 IEEE
1696 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013
the proposed method. Experimental results are presented in III. Proposed Macroblock Feature Vector
Section V, and Section VI concludes the paper. A. H.264 Preliminaries
II. Related Work A bitstream sequence encoded in H.264 consists of a se-
quence of pictures or frames, each of which is split into a MB
Although there exist several techniques for moving ob-
unit covering a fixed area of 16 × 16 pixels. An intracoded or
ject segmentation in MPEG domain, algorithms that work
I-macroblock is predicted from spatially neighboring samples
on H.264 compressed video are relatively few. Thilak and
of previously coded blocks in the same frame. On the other
Cruesere [6] presented a system to track targets in H.264
hand, most MBs are predicted using references to one or
video, which relied heavily on the prior knowledge of the size
more previously coded block(s) of pixels in past frames or a
of the target. Zeng et al. [7] proposed an algorithm to segment
combination of past and future frames. A bi-predictive or B-
moving objects from the sparse motion vector (MV) field using
macroblock may use past and/or future frames as references
block based Markov random field (MRF). The limitation of
while a predictive or P-macroblock may use past frames
this approach is that it is only applicable to video sequences
only. These are collectively called intercoded MBs, which
with stationary background. Liu et al. [8] used complex binary
require each referred block to be indicated using a MV with
partition tree to segment normalized MV field into motion-
the corresponding reference frame index. Additionally, the
homogeneous regions. The complexity of the algorithm in-
difference between the predicted and the target MB, i.e.,
creases drastically with a noisy MV field. Solana-Cipres et al.
the prediction error or residual is transformed, quantized,
[9] incorporated fuzzy linguistic concepts that use MV and
and entropy coded. Thus, the MVs as well as the residual
MB decision modes. Fei and Zhu [10] introduced mean shift
component, bearing complementary information of the same
clustering using normalized MV field and partitioned block
MB, hold key to effective feature vector representation.
size for moving object segmentation. This algorithm fails to
For network surveillance, the profile most commonly used
detect objects with slow or intermittent motion. W. You et al.
for encoding is the Baseline. It claims minimal computational
[11] proposed a new probabilistic spatio-temporal macroblock
resource and has a very low latency, ideal for live surveil-
filtering (PSMF) and linear motion interpolation for object
lance feed on low-end devices. The profile is a simplified
tracking in P-frames. The linearity assumption holds good for
implementation of the H.264 standard based only on I- and
group of picture (GOP) sizes less than ten frames and slow
P-frames. I-frames are coded using I-macroblocks only, and
moving targets.
require more bits to encode than any other frame types.
Most related techniques that work in H.264 compressed
P-frames, which are mainly coded using P-macroblocks, may
domain are based solely on the MV field. MVs, however, do
occasionally contain I-macroblocks, particularly in the areas of
not necessarily correspond to true object motion and such
complex motion. A video sequence starts necessarily with an
related techniques end up wrongly classifying the dynamic
I-frame. It mostly consists of P-frames, with I-frames spaced
components (eg. waving tree branches, fountain, ripples,
at regular intervals. The number of I-frames in a coded video
camera jitter etc.) as foreground. Poppe et al. [12] proposed
is negligible; hence, they are not considered in the proposed
an alternative technique that relied on the total number of
method.
bits required to encode a given MB. This technique fails to
detect slow-moving objects.
B. Feature #1: Mean of Absolute Transformed Differences
A common drawback of the related approaches is the
(MATD)
assumption of a fixed quantization parameter (QP) for all MBs
thereby limiting them to variable bit rate (VBR) applications As the name indicates, MATD is the mean of absolute values
with uncontrolled bit rate. However, in most streaming video of the quantized Discrete Cosine Transform (DCT) coefficients
applications, a predetermined constant output bit rate is de- in a coded MB. We formulate MATD using the statistical
sired. These applications, referred to as constant bit rate (CBR) properties of DCT coefficients.
applications, ensure a target bit rate by carefully selecting In literature, it is found to be most appropriate and conve-
a different QP for each MB. Unlike the related approaches, nient to model the distribution of DCT coefficients by Lapla-
our algorithm allows each MB to have a different QP in cian distributions [13]–[15]. A Laplacian probability density
consideration of the stringent bandwidth requirements of a function (pdf) is given by
1
practical surveillance scenario. p(z) = exp −|z| b , z ∈ R (1)
Secondly, we perform filtering on the aggregate result of 2b
MB-level object segmentation obtained jointly from the MV where b>0 is the Laplacian parameter which defines the width
field and the residual data rather than applying separate filter- of the pdf and z being the value of a given DCT coefficient.
ing procedures on either or both of the features in isolation. In H.264 baseline encoding, an integer implementation of
This helps us to reduce computation while relying more on the 4 × 4 block DCT operates on a similar-sized block of
the consensus between the MB features. residual or target data. The resulting block is subsequently
The performance of the related approaches are largely scaled and quantized element-wise by a quantization matrix.
restricted to coarse MB-level segmentation. In contrast, we The quantization scheme most commonly adopted is uniform
introduce a low-complexity technique for comparing pixel col- quantization with dead-zone [16]. In this process, the value z
ors (YCbCr) in order to obtain pixel-resolution segmentation of an input coefficient is, in general terms, quantized as

of sequences with a highly dynamic background. k = (|z| + f ) Q sgn(z), (2)
DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE 1697
TABLE I
Scalar Multiplier q0
mod(QP,6) n=0 n=1 n=2

0 0.6250 0.6250 0.6423
1 0.6875 0.7031 0.6917
2 0.8125 0.7812 0.7906
3 0.8750 0.8984 0.8894
4 1.0000 0.9766 0.9882
5 1.1250 1.1328 1.1364
where
Fig. 1. Relation between the input coefficient z and the reconstructed output q (QP, n) = q0 (mod (QP, 6) , n) 2QP / 6 (7)
zk for quantization step size Q and rounding offset f.
q0 being a scalar multiplier
as defined in Table-I [17]. It is
where k ∈ Z represents the quantization level or index that is noted in (6) that exactly 1 2 of the transformed coefficients are
actually transmitted by the source encoder, Q>0 is the uniform quantized with a step size equal to q(QP,2), and the remaining
quantization step-size except the dead-zone interval, and f ∈ with q(QP,0) and q(QP,1) equally. Thus, r can take only three
possible values q(QP,0) , q(QP,1) or q(QP,2) , for given values of
0, Q 2 is the rounding offset that controls the width of the 6b 6b 6b
dead-zone. Typically, f is set to be Q/6 for P-frames in an QP and b.
effort to approximate the distribution of DCT coefficients in Given the fact that the quantized coefficients zk of a MB
a quantization interval. The decoder process reconstructs the (both I and P types) are entropy coded, the lower bound on
quantized output zk of the input coefficient z as the average bit rate (bits/coefficient) may be expressed as
zk = kQ. (3)
+∞
H =− P(zk ) log2 P(zk ).
Fig. 1 illustrates the relationship between z and zk . Formally, k=−∞
it may be verified that the value z of any input coefficient is

mapped to kQ in the quantization process, such that Using (4)–(6), the above expression simplifies to
1) k=0 if z ∈ (−Q + f, Q − f );
2)
k =
−1, −2,
−3,. . . if z ∈ exp(−2r)
(6r + (1 − exp(−6r)) (2r − ln sinh 3r))
H(r) = sinh 3r ln 4
k−1+ Q f
Q, k + Qf
Q ; − (1 − exp(−5r)) log2 (1 − exp(−5r)) .

(8)
3) k=1,2,3,. . . if z ∈ k − Q f
Q, k + 1 − Q
f
Q .
Total bits B required in coding the DCT data of a MB may
Let P(zk ) be the probability that z is mapped to zk . Using be computed as the product of H and the total number of
(1) and substituting for f (= Q/6) in the limiting expressions DCT coefficients. For sequences encoded in the YCbCr 4:2:0
of above intervals, we get baseline format, a MB is represented by 256 luminance (Y), 64
⎧ k+ 1 Q red chrominance (Cr) and 64 blue chrominance (Cb) samples,
⎪ ( 6) giving a total of 384 samples. Therefore, we have
⎪
⎪ p(z)dz = exp(−2r(1−3k)) sinh 3r; if k<0
⎨(5k− 6 )Q
5
Q
P(zk ) = −6 5 Q p(z)dz = 1− exp(−5r); if k = 0

⎪
⎪
⎪ 6
⎩ ( 6 ) p(z)dz = exp(−2r(1 + 3k)) sinh 3r; if k > 0
k+ 5
Q B = 384 41 H q(QP,0) 6b
+ 1
4
H q(QP,1)
6b
+ 1
2
H q(QP,2)
6b
(k− 16 )Q

q(QP,0) q(QP,1) q(QP,2)

(4) = 96H 6b
+ 96H 6b
+ 192H 6b
.
where (9)
q(QP,1)
Table I is used to express quantization step sizes 6b and
r = Q 6b. (5) q(QP,2)
as scalar multiples of q(QP,0) , so that the right hand side
6b 6b
The H.264 standard does not specify Q directly for each of (9) may be expressed in one unknown. For known values of
coefficient separately but rather uses a quantization parameter B and QP (obtained from the MB header), we use numerical
QP, whose relationship to Q for each 4×4 block of DCT methods to evaluate q(QP,0)6b
as it is difficult to derive an exact
coefficients of an MB is given by the quantization matrix: closed form solution by analytical means.
The statistical prediction of MATD for the current MB is
⎡ ⎤ formulated as the mean of absolute values of zk
q (QP, 0) q (QP, 2) q (QP, 0) q (QP, 2)
⎢ q (QP, 2) q (QP, 1) q (QP, 2) q (QP, 1) ⎥
Q=⎢ ⎥
⎣ q (QP, 0) q (QP, 2) q (QP, 0) q (QP, 2) ⎦ , (6)
∞
∞

MATD = |zk | P(zk ) = 2 |zk | P(zk )
q (QP, 2) q (QP, 1) q (QP, 2) q (QP, 1) k=−∞ k=1
the sub-MB partitions within an 8×8 sub-MB share the same

reference frame. Let us denote the reference index of the ith
partition of a MB as fi . A P-frame having frame number t is
predicted from a list of previous frames where reference index
0 denotes frame (t-1), reference index 1 denotes frame (t-2),
and so on. In order to obtain uniformity among the partitions
we normalize each of them by fi and assign fixed weights
according to the ratio of the MB area it represents. The weight
wi of the ith partition is defined as the fraction of the partition
size contributing to the total MB area, i.e., 256. Assuming a
total of p partitions in the current MB, the computation process
of SNMVM is described in Fig. 2. It may be noted that the
value of SNMVM for an I-macroblock in P-frame is taken as
zero. The computation involved in this step amounts to four
multiplications/divisions and four additions for each partition,
Fig. 2. Computation of SNMVM. the total number of partitions being no more than 16 for any
given MB.
Using (4)–(6), the above expression of predicted MATD sim-
plifies to the form shown in (10) D. Feature Vector Notation
Formally, let T be the total number of MBs in each frame.
q(QP,0) exp − q(QP,0) q(QP,1) exp − q(QP,1) MBs are numerically addressed using a MB index idx ∈
MATD = q(QP,0)
3b
+ q(QP,1)
3b
8 sinh 8 sinh {0,1,. . . ,T −1} in raster-scan order starting with zero for the
2b 2b
(10)
q(QP,2) exp − q(QP,2) MB at the top-left hand corner of a frame. A P-frame MB
+ q(QP,2) .
3b
having frame number t at location idx is described by a vector
4 sinh 2b
−
→ T
p = x, y ,
t,idx (11)
Substituting the values of q(QP,0) and its scalar multiples
q(QP,1) q(QP,2)
6b where components x ≥ 0 and y ≥ 0 are the values of MATD
6b
and 6b
in (10), we finally obtain MATD. Using
and SNMVM respectively for the given MB.
the fact that B and QP can only assume non-negative integer
values from a limited range, we construct a lookup table E. Initialization and Incremental Update of Covariance
containing precomputed values of MATD (indexed by B and
In the proposed approach, static and dynamic components
QP). This enables us to bypass entropy decoding and de-
present at different locations in a scene background are
quantization of individual coefficients which would otherwise
modeled with a temporally weighted local covariance ∀idx ∈
be required to compute MATD directly, i.e., actual MATD.
{0, 1, . . . , T − 1}. Feature vectors computed for each idx are
The correlation between the actual MATD and the predicted
stored in T corresponding first-in first-out (FIFO) buffers, each
MATD is explored in Section V.
having a predefined capacity M. A new MB vector is inserted
at the rear. However, if the buffer is full, the vector at the front,
C. Feature #2: Sum of Normalized Motion Vector Magnitudes being the least relevant in the temporal order of vectors in the
(SNMVM) buffer, is removed to make room for a new one. Corresponding
SNMVM represents the motion compensated information to each position j ∈ {1(front), 2, 3, . . . , M(rear)} in a buffer
associated with the MVs of a MB. In H.264, two new containing a vector, predefined weight Wj = j/M is assigned
coding characteristics are introduced in motion compensation: in order of temporal relevance. Without loss of generality, let
variable partition size and multiple reference frames. Each us assume that buffer[idx]
is in a state in which all n vectors
inter coded MB may be predicted using a range of block T n
from the sequence xi , yi are inserted, with the nth
sizes. Accordingly, the MB is split into one, two or four MB i=1
vector currently at the rear. If n > M, this would have resulted
partitions using either
in the removal of previous (n−M) vectors. Let σx2 and σy2
1) one 16×16 partition (covering the whole MB); denote respective weighted variances of {xi }ni=max(1,n−M+1) and
2) two 8×16 partitions; {yi }ni=max(1,n−M+1) currently accumulated in the buffer. Also, let
3) two 16×8 partitions; or σxy denote the covariance between the same set of values.
4) four 8×8 partitions. Accordingly, the required covariance matrix idx is expressed
If the partition size is chosen as 8×8, then each 8×8 block, as in
heretofore a sub-MB, is split into one, two or four sub-MB 2
σx σxy
partitions (either one 8×8, two 4×8, two 8×4 or four 4×4 idx = σxy σy2
. (12)
sub-MB partitions). Each partition or sub-MB partition of a
P-macroblock has a MV (mvxi , mvyi ) pointing to an area of Weighted variance σx2 is defined as
the same size in a reference frame, which is used to predict n
i=max(1,n−M+1) WM−n+i (xi − x)

2
the current (say ith) partition. Each partition in a given MB 2
σx = n = x2 − x2 (13)
may be predicted from different reference frame(s). However, i=max(1,n−M+1) WM−n+i
where (20) may be obtained from (21) by substituting for (a, b) each
of the ordered pairs (2,0), (0,1), (0,2), and (1,1) respectively.

n
n
The update procedures of x, x2 , y, y2 , and xy are followed by
x= WM−n+i xi WM−n+i (14) an adjustment of [W]n as
i=max(1,n−M+1) i=max(1,n−M+1)
⎧
and ⎨ 1; if n = 1
[W]n = W + WM−n+1 n−1 ; if 1 < n ≤ M (22)
⎩

n
n
[W]n−1 ; if n > M.
x2 = WM−n+i xi2 WM−n+i . (15)
The overall process of updating idx requires no more than
14 multiplications, ten divisions, and 21 additions.
Similarly, we have

σy2 = y2 − y2 (16) IV. Proposed Method

and A. System Overview
σxy = (xy − x y) (17) As highlighted in the introductory section, the proposed
method operates at two levels of granularity as follows
where 1) performing a coarse MB-level segmentation of each
frame by selecting of a set of potential MBs that are

n
n
occupied fully or partially by parts of a moving object
y= WM−n+i yi WM−n+i (18)
and
2) performing a finer pixel-level segmentation of the se-

n
n lected MBs by eliminating pixels that are similar to the
y2 = WM−n+i yi2 WM−n+i (19) corresponding background model.
The flowchart of the proposed method is shown in Fig. 3. In
and order to compute temporal covariance of each MB in a frame,
T identical FIFO buffers (indexed by idx) are used. The MB

n
n
level binary segmentation of the current and the previous frame
xy = WM−n+i xi yi / WM−n+i . (20)
are stored in arrays τ1 [0 : T −1] and τ0 [0 : T −1] respectively.
Before parsing a new frame, the contents of τ1 are copied to
Direct computation of idx using (13), (16), and (17) τ0 and τ1 is initialized to zero (block A). As usual, frame
following each insertion would be computationally prohibitive numbers are denoted by t. Parameters B, QP and MVs, which
and inefficient as most of the samples in the buffer remain are parsed from the current MB header (block B) are used
unaltered between subsequent insertions. Hence, recursive to compute its feature vector representation as described in
counterparts of (14), (15), (18), (19), and (20) are formulated Section III (block C).
in order to facilitate online update of idx . The recursive Using the first N (N < M, say 100) frames, a median
update of x is considered as follows. Let [Sx ]n−1 denote the background frame is initialized (Section IV-D). At the same
sum of x-components in the buffer prior to insertion of the nth time, feature vectors ∀ idx ∈ {0,1,. . . ,T −1} are inserted (block
vector. If 1<n≤M, insertion of the nth vector at the rear causes H) into the corresponding buffers, i.e., buffer[idx]. Temporally
all other entries in the buffer to be shifted by one place toward weighted mean − →μ idx and covariance idx are computed online
the front. Otherwise (n>M), the same insertion process will (block I) using (21) and (22) . Buffered feature vectors for a
additionally cause the (n-M)th vector to be removed from the given MB are characteristics of locally varying background
front. In either case, the insertion operation causes the existing model. If the total number of vectors in a buffer at given idx
sum of x-components to decrease by ([Sx ]n−1 /M) and increase reaches N (block D), any incoming vector − → p t,idx is considered
by xn . The adjusted sum is finally normalized by the sum [W]n for further insertion, based on a criteria set as
of predefined weights to obtain x. Eqs.(21) and (22) summarize ! "
the initialization and recursive update process of idx for the 1 (foreground candidate); if MD T > th1
nth insertion τ1 [idx] =
0 (background); otherwise
⎧ a b a b T (23)
⎪ −
⎪
⎪
⎪
x1 y1 , x1 y1 ; if n = 1 where MD = (− →p t,idx − −
→
μ idx )T −1 ( →
p t,idx − −
→μ idx ) is the
⎪
⎪ x a yb W−(S M)+x a b
y
idx
⎪
⎨
a
x y b n n Mahalanobis distance and th1 = 0.225 being a predetermined
; if 1 < n ≤ M
threshold. If τ1 [idx] = 0, − →
W+WM−n+1
xa y b p t,idx is inserted into buffer[idx].
= Sxa yb +xna ynb
Sx a y b n ⎪
⎪ n−1 Otherwise, the current MB is selected as one of the probable
⎪
⎪ xa yb W−(Sxa yb M)+xna ynb
⎪
⎪ candidates expected to contain part(s) of a foreground object.
⎪
⎩ S a b − xa W yb +xa yb ; if n > M.
x y n−M n−M n n Should the buffer be already full (block G), the vector at the
front is removed (block F) prior to the insertion of − →
n−1
(21) p t,idx
where the ordered pair (a,b) ∈ {(1,0), (2,0), (0,1), (0,2), (1,1)}. (block H) at the rear. The coarse MB-level segmentation so
Similarly, the recursive counterparts of (15), (18), (19), and obtained, is filtered (block J) using the procedure described in
Fig. 4. Modeling background variance in the proposed feature space.
portrays highly dynamic component (waving tree branches) in

the background at location 330, which is characterized by a
sparsely distributed scatter with a higher value of |330 |.
B. Multistage Filtering for Noisy Macroblocks

In (23), the MB-level binary decision τ1 [idx] ∀idx ∈
{0, 1, . . . , T − 1} may erroneously misclassify some MBs as
foreground that do not correspond to true object motion, i.e.,
false positives and vice versa, i.e., false negatives. In order
to reduce misclassification. Such noisy MBs contribute to
flickering speckles in the final segmentation result. To suppress
such false positives, we use multistaged filter banks in which
the output of a 3×3×2 spatio-temporal median filter (STMF)
is cascaded to a 3×3 Gaussian filter. For frames of width u
MBs and height v MBs as shown in Fig. 5, the spatio-temporal
support for the candidate MB at idx in frame t is highlighted
with shaded blocks. The output of the STMF is computed
as the median of τ0 [idx], τ1 [idx], τ1 [idx − 1], τ1 [idx + 1],
τ1 [idx − u], and τ1 [idx + u].
The STMF often erroneously removes MBs containing
boundaries of slow-moving objects. In addition, the output
of STMF may occasionally contain small holes or gaps in
the segmented regions that correspond to very large moving
objects (comprising a half of the entire frame or more).
Fig. 3. Flowchert of the Proposed method.
Consequently, Gaussian filter is applied on the STMF output
Section IV-B. Pixels constituting the set of filtered MBs are to obtain a smooth segmentation mask at the MB-level.
further used to obtain precise object segmentation (block K), It may be noted that an optimized algorithm for computation
details of which are presented in Section IV-D. of the above median (of six binary values) would require no
Fig. 4 shows the scatter plots of the buffered feature vectors more than nine comparisons. The symmetric Gaussian filter
corresponding to different MB locations (highlighted in white) additionally requires three multiplications and eight additions.
at 330 and 591 for the frame t=1896 from Fall sequence.
The location at 591 depicts stationary background. As shown C. Proposed Color Comparison Technique
in the corresponding scatter diagram, this is modeled by a In the proposed method, a low-complexity color compar-
highly dense cluster with low |591 |. The same sequence also ison technique is used to determine if two given colors
Fig. 5. Illustration of 3×3×2 support for the STMF (enlarged insets).
Fig. 7. Each row shows one example from each category. The categories are
from top to bottom: baseline (highway), camera jitter (boulevard), dynamic
background (fountain01), intermittent object motion (sofa), shadow (cubicle),
and thermal (library).
color) signals which are perceived independently by the human

visual system. As brightness is separated from color, the space
is affected less by visible shadows.
Fig. 6. Scatterplot of actual MATD versus predicted MATD (with regression
lines by the method of least squares). (a) VBR with fixed QP = 25. (b) CBR
at 1024 kb/s. D. Background Model Initialization and Update Process
In order to obtain foreground-background classification of
are similar or different. Considering a given pair of pixels the pixels of an input frame, a background model is initial-
with YCbCr color coordinates XF ≡ (YF , CbF , CrF ) and ized and continually updated in order to incorporate gradual
XB ≡ (YB , CbB , CrB ) respectively from the current and the changes in the scene. For initialization of the background
background frame, let luminance differential Y = |YF − YB | model, the first N frames of the input sequence are divided
and chrominance differential C = |Cb F −CbB |+|CrF −CrB |. into ten equal-sized groups, from each of which one frame
Decision thresholds
tY = C 1 + C 2 | idx | for luminance and is randomly picked or sub-sampled for computation of a
tC = C3 + C4 | idx | for chrominance are used which scale temporal median frame. This frame is used as the initial
linearly with | idx | for a given MB. Constants C1 = 18, background model corresponding to the frame t=N+1. For
C2 = 81, C3 = 0.5, and C4 = 4.2 are empirically set for subsequent frames (i.e., t > N), pixels constituting the
dynamic background sequences. XF and XB are considered set of (filtered) candidate MBs in the current frame are
different and the pixel corresponding to XF is classified as compared with the corresponding pixels in the background
foreground if Y > tY and C > tC . This is in contrast model. If the pixels are found to be different (as discussed in
to most existing techniques where the similarity of pixels is Section IV-C), the corresponding pixel in the current frame
determined based on the Euclidean (or straight-line) distance is labeled as foreground. The remaining pixels in the frame
between a given pair of points in RGB color space. It is constitute the background. This produces a precise pixel-level
obvious that the proposed color comparison involves fewer segmentation of the current frame. If the number of pixels la-
computations (with additions and subtractions only) than those beled a foreground exceeds ∼20% of the total pixels in a given
of Euclidean distance based techniques. This method benefits MB, the pixel-level comparison process is repeated for the co-
from the inherent advantage of YCbCr color space, i.e., located MB in the following frame. This typically enforces
decoupling of luminance (or brightness) and chrominance (or inter frame continuity of object segmentation masks. MBs
TABLE II
Quantitative Evaluation (Results of all categories combined)
Average Average Average Average Average Average Average Average Average

METHOD Recall Specificity FPR FNR PWC F-Measure Precision speed (fps) Rank
Proposed (CBR) 0.58995 0.99253 0.00753 0.41015 2.50442 0.64845 0.79863 443.11 3.1667
Proposed (VBR) 0.59534 0.99174 0.00834 0.40474 2.52633 0.65263 0.78844 401.72 3.3333
SC-SOBS [20] 0.80161 0.98316 0.01696 0.19841 2.40881 0.72831 0.73165 7.37 3.5000
PBAS [24] 0.55046 0.99561 0.00441 0.44966 2.83855 0.60756 0.84081 9.25 4.0000
ViBe [21] 0.69603 0.98555 0.01455 0.30403 2.54484 0.64854 0.69506 121.23 4.1667
GMM [22] 0.50727 0.99472 0.00532 0.49287 3.10516 0.59047 0.82282 30.14 4.6667
KDE [23] 0.74422 0.97577 0.02437 0.25582 3.46027 0.67192 0.68437 8.86 5.1667
containing parts of slow-moving objects often go undetected. was [-16…16] × [-16…16] for three reference frames. The
All
√ MBs in a frame corresponding to which τ1 [idx]=0 and rate of decoding frames was fixed at 25 frames per second
(MD/T ) ∈ (0.02, 0.12) represent the background. Pixels con- (fps).
stituting such MBs are used to update the corresponding pixels The background subtraction masks of the proposed method
Bt (x, y) in the existing background using (24). The number of for a few selected frames are shown in Fig. 7 to enable
such MBs in a given frame, say β, is practically very small. qualitative evaluation against the specified ground-truth masks.
For quantitative evaluation, a set of seven evaluation metrics
Bt+1 (x, y) = αIt (x, y) + (1 − α) Bt (x, y) for t > N (24) defined in [19], such as recall, specificity, false-positive rate
where It (x, y) is the pixel’s intensity value for frame t; α = (FPR), false-negative rate (FNR), percentage of bad classifica-
0.08 is a predefined learning rate that determines the tradeoff tion (PBC), F-measure and precision have been used together
between stability and quick update. with average processing speed. An exhaustive comparison of
the proposed method with those of [20]–[24] (applied on
original input frames prior to encoding with default parameters
V. Experimental Results and Discussion defined in each work) is summarized in Table II. Subscripts
Following the discussion in Section III-B, we provide a indicate rank of the corresponding figures in the indicated
statistical comparison of the actual MATD against the pre- evaluation category. It is noticed that rankings based on
dicted MATD values of all P-frame MBs selected at regular FPR=(1 − specificity) and FNR=(1 − recall) are identical
intervals from Traffic sequence. The sequence was encoded to those based on specificity and recall respectively; hence,
in VBR as well as CBR modes as reported in Fig. 6. The all evaluation metrics except FPR and FNR were given equal
values of actual MATD are found to be greater than the weightage for computation of average rank.
corresponding predicted MATD values, owing to fact that the Prior to empirical evaluation, it is important to realize that
latter is modeled using the entropy criterion, which is the every codec can deliver a varying degree of output video
theoretical lower bound on the average bitrate. It is observed quality for a given set of input frames. Any degradation of
that the actual MATD and the predicted MATD values are visual data introduced by “lossy” compression will inevitably
very highly correlated (correlation coefficient ρ > 0.98). In remain visible through any further processing of the content.
(23), we used the Mahalanobis distance, which is invariant Notwithstanding the encoding options which considerably
under arbitrary nonsingular linear transformations of the form affect the performance of a compressed-domain algorithm,
shown in Fig. 6 (with regression equations). Thus, predicted the proposed method delivers better overall performance even
MATD qualifies for a convenient surrogate to actual MATD when pitted against SoA pixel-based techniques.
insofar as the discriminative aspect of MATD is concerned. Background segmentation is but one component of a poten-
The proposed algorithm was implemented in C and inte- tially complex computer vision system. Therefore, in addition
grated into H.264/AVC MB decoding module of Libavcodec, to being accurate, a successful technique must consume as few
an open source audio/video codec library that is developed as CPU cycles and as little memory as possible. An algorithm that
a part of the FFmpeg [18] project. We evaluate our approach segments perfectly but computationally expensive is useless
on the entire benchmark dataset provided for the Change because insufficient processing resources will remain to do
Detection Challenge 2012 [19]. As the proposed method uses anything useful with its results in real-time. The most notable
pixel information to achieve pixel-level accuracy in the final aspect, in this regard, is the comparison of average processing
stage of segmentation, we consider it fair and obvious to speeds in Table II. The computing speeds were recorded for
compare our results with those of proven state-of-the-art (SoA) videos with 720 × 420 resolution on a personal computer
pixel-based approaches [20]–[24]. All 31 sequences of the powered by Intel Core i7-2600 3.40 GHz CPU with 16 GB
dataset were encoded in: 1) VBR with a fixed QP=25 for RAM (no dedicated hardware or graphics processing unit
all MBs; and 2) CBR with a target bit-rate of 1024 kb/s. The used). It is evident that the proposed method runs significantly
encoder configuration was set as follows: Baseline profile with faster in comparison to any of the reported SoA techniques.
YCbCr 4:2:0 (progressive format) 8-bit chroma sub-sampling, The computation per MB involved in each step of the pro-
GOP size varying in [1, 250], rate-distortion optimization posed method (described in Section III) costs up to a constant
enabled, and the range of MVs (using hexagon-based search) factor. Consequently, the complexity of the overall process is
# T −1
$ T −1
[14] E. Y. Lam and J. W. Goodman, “A mathematical analysis of the DCT
O T + c1 β + c 2 τ1 [idx] , where τ1 [idx] denotes the coefficient distributions for images,” IEEE Trans. Image Process., vol.
9, no. 10, pp. 1661–1666, Oct. 2000.
idx=0 idx=0 [15] W. Wu and B. Song, “DC Coefficient Distributions for P-Frames in
number of candidate MBs that require pixel level processing, T H.264/AVC,” ETRI J., vol. 33, no. 5, pp. 814–817, Oct. 2011.
is the total number of MBs per frame and c1 , c2 are constants. [16] G. J. Sullivan and S. Sun, “On dead-zone plus uniform threshold scalar
Arguably, the running time scales quantization,” in Proc. SPIE Vis. Commun. Image Process., vol. 5960,
# linearly with T , $
T −1
incurring
no. 2. Jul. 2005, pp. 1041–1052.
[17] I. E. Richardson, The H.264 Advanced Video Compression Standard,
only a negligible overhead κ = c1 β + c2 τ1 [idx] ≤ T in 2nd ed. London, U.K.: Wiley, 2010, p. 191.
idx=0 [18] F. Bellard. (2002, Apr. 26). FFmpeg [Online]. Available: http://ffmpeg.
addition to the regular decoding cost claimed by each frame. org/\underbar
[19] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar,
“Changedetection.net: A new change detection benchmark dataset,”
in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
VI. Conclusion Workshops, Jun. 2012, pp. 1–8.
We introduced a novel approach for background subtraction [20] L. Maddalena and A. Petrosino, “The SOBS algorithm: What are
the limits?,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
on videos encoded in the Baseline profile of H.264/AVC. Recognit. Workshops, Jun. 2012, pp. 21–26.
The proposed method is aptly built for real-time network [21] O. Barnich and M. Van Droogenbroeck, “ViBe: A universal background
streaming applications in consideration of variable/constant subtraction algorithm for video sequences,” IEEE Trans. Image Process.,
vol. 20, no. 6, pp. 1709–1724, Jun. 2011.
bit rate options under practical bandwidth constraints. It also [22] P. KaewTraKulPong and R. Bowden, “An improved adaptive back-
proved to be robust to a diverse set of real-world (non- ground mixture model for real-time tracking with shadow detection,”
synthetic) surveillance sequences. in Proc. 2nd Eur. Workshop Adv. Video-Based Surveillance Syst., 2001,
pp. 149–158.
[23] A. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model for
background subtraction,” in Proc. 6th Eur. Conf. Comput. Vis., 2000, pp.
Acknowledgment 751–767.
[24] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Background segmentation
The authors would like to thank the anonymous reviewers with feedback: The pixel-based adaptive segmenter,” in Proc. IEEE
and the associate editor for their valuable comments that Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun.
significantly improved the quality of this paper. 2012, pp. 38–43.
Bhaskar Dey received the B.Tech. degree in infor-

References mation technology from the University of Kalyani,
Kalyani, India, in 2007, and the M.Tech. degree
[1] Y. Benezeth, P.-M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, in the same field from the University of Calcutta,
“Comparative study of background subtraction algorithms,” J. Electron. Kolkata, India, in 2009. He is currently pursuing
Imaging, vol. 19, no. 3, pp. 1–12, Jul. 2010. the Ph.D. degree at the Center for Soft Computing
[2] S. Elhabian, K. El-Sayed, and S. Ahmed, “Moving object detection Research, Indian Statistical Institute, Kolkata, India.
in spatial domain using background removal techniques—State-of-art,” His current research interests include compressed
Recent Patents Comput. Sci., vol. 1, pp. 32–54, Jan. 2008. domain video and image analysis, machine vision,
[3] T. Bouwmans, F. El Baf, and B. Vachon, “Statistical background and pattern recognition.
modeling for foreground detection: A survey,” in Handbook of Pattern
Recognition and Computer Vision, vol. 4, Singapore: World Scientific,
2010, ch. 3, pp. 181–199. Malay K. Kundu (M’90–SM’99) received the
[4] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview B.Tech., M.Tech. and Ph.D (Tech.) degrees in ra-
of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. dio physics and electronics from the University of
Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003. Calcutta, Kolkata, India.
[5] “H.264 video compression standard—New possibilities within video In 1982, he joined the Indian Statistical Institute
surveillance,” White paper, Axis Communications Inc. Mar. 2008. (ISI), Kolkata, India, as a Faculty Member where
[6] V. Thilak and C. D. Creusere, “Tracking of extended size targets in he is currently a Professor with the Machine Intelli-
H.264 compressed video using the probabilistic data association filter,” gence Unit. He was the Head of the Machine Intel-
in Proc. EUSIPCO, 2004, pp. 281–284. ligence Unit from 1993 to 1995, and from 2004 to
[7] W. Zeng, J. Du, W. Gao, and Q. M. Huang, “Robust moving object 2006, was the Professor In-charge (Chairman) of the
segmentation on H.264/AVC compressed video using the block-based Computer and Communication Sciences Division,
MRF model,” Real-Time Imaging, vol. 11, no. 4, pp. 290–299, Aug. ISI. He is the Co-Principal Investigator and Acting In-Charge of the Center
2005. for Soft Computing Research: A National Facility (funded by the Department
[8] Z. Liu, Z. Zhang, and L. Shen, “Moving object segmentation in the of Science and Technology, Government of India) at ISI, Kolkata. He has
H.264 compressed domain,” Opt. Eng., vol. 46, no. 1, p. 017003, Jan. contributed three book volumes, about 140 research papers in well known
2007. and prestigious archival journals, international refereed conferences, and in the
[9] C. Solana-Cipres, G. Fernandez-Escribano, L. Rodriguez-Benitez, J. edited monograph volumes. He holds nine U.S patents, two International and
Moreno-Garcia, and L. Jimenez-Linares, “Real-time moving object two E.U patents. His current research interests include image processing and
segmentation in H.264 compressed domain based on approximate rea- analysis, soft computing, content based image retrieval, digital watermarking,
soning,” Int. J. Approx. Reasoning, vol. 51, pp. 99–114, Sep. 2009. wavelets, genetic algorithms, machine vision, fractals, and very large scale
[10] W. Fei and S. Zhu, “Mean shift clustering-based moving object seg- integration design for digital imaging.
mentation in the H.264 compressed domain,” IET Image Process., vol. Dr. Kundu was a recipient of the Sir. J. C. Bose Memorial Award of the
4, no. 1, pp. 11–18, Feb. 2010. Institute of Electronics and Telecommunication Engineers, India, in 1986,
[11] W. You, M. S. H. Sabirin, and M. Kim, “Moving Object Tracking in and the prestigious VASVIK Award for industrial research in the field of
H.264/AVC bitstream,” in Proc. MCAM, vol. 4577, 2007, pp. 483–492. electronic sciences and technology in 1999. He is a Fellow of the the
[12] C. Poppe, S. D. Bruyne, T. Paridaens, P. Lambert, and R. V. D. Walle, International Association for Pattern Recognition, USA, the Indian National
“Moving Object Detection in the H.264/AVC compressed domain for Academy of Engineering, the National Academy of Sciences, and the Institute
video surveillance applications,” J. Vis. Commun. Image Representation, of Electronics and Telecommunication Engineers, India. He is the Founding
vol. 20, pp. 428–437, May 2009. Life Member and Vice President of the Indian Unit for Pattern Recognition
[13] S. R. Smoot and L.A. Rowe, “Study of DCT Coefficient Distributions,” and Artificial Intelligence, the Indian wing of the International Association
in Proc. SPIE Symp. Electron. Imaging, 1996, pp. 403–411. for Pattern Recognition.

Robust Background Subtraction For Network Surveillance in H.264 Streaming Video

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Robust Background Subtraction For Network Surveillance in H.264 Streaming Video

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO.

10, OCTOBER 2013 1695

Robust Background Subtraction for Network

I N RECENT years, there has been considerable interest

mod(QP,6) n=0 n=1 n=2

it may be verified that the value z of any input coefficient is

q(QP,0) q(QP,1) q(QP,2)

the sub-MB partitions within an 8×8 sub-MB share the same

i=max(1,n−M+1) WM−n+i (xi − x)

σy2 = y2 − y2 (16) IV. Proposed Method

Fig. 4. Modeling background variance in the proposed feature space.

portrays highly dynamic component (waving tree branches) in

B. Multistage Filtering for Noisy Macroblocks

Fig. 5. Illustration of 3×3×2 support for the STMF (enlarged insets).

color) signals which are perceived independently by the human

Average Average Average Average Average Average Average Average Average

Bhaskar Dey received the B.Tech. degree in infor-

Das könnte Ihnen auch gefallen