Beruflich Dokumente
Kultur Dokumente
Abstract—The H.264/Advanced Video Coding (AVC) is the Background subtraction [1]–[3] is a fundamental process
industry standard in network surveillance offering the lowest used in most video-based surveillance systems. Classical pixel-
bitrate for a given perceptual quality among any MPEG or
based techniques for background subtraction use raw informa-
proprietary codecs. This paper presents a novel approach for
background subtraction in bitstreams encoded in the Baseline tion about each pixel. However, most video transmitted over
profile of H.264/AVC. Temporal statistics of the proposed feature networks are encoded bitstreams such as those produced by
vectors, describing macroblock units in each frame, are used MJPEG, MPEG-x, and H.26x-compliant encoders. Therefore,
to select potential candidates containing moving objects. From such algorithms necessitate prior decoding of each frame,
the candidate macroblocks, foreground pixels are determined
involving substantially huge overhead in terms of computation
by comparing the colors of corresponding pixels pair-wise with
a background model. The basic contribution of the current as well as memory space. These algorithms have a decent seg-
work compared to the related approaches is that, it allows mentation performance, but most of them fail to meet real-time
each macroblock to have a different quantization parameter, in constraints due to the computation involved in processing each
view of the requirements in variable as well as constant bit- pixel individually in every frame. Under the circumstances, it
rate applications. Additionally, a low-complexity technique for
is desirable that new background subtraction techniques be
color comparison is proposed which enables us to obtain pixel-
resolution segmentation at a negligible computational cost as able to process encoded video of considerably smaller size,
compared to those of classical pixel-based approaches. Results directly for segmentation. Such techniques, the so-called com-
showing striking comparison against those of proven state-of- pressed domain algorithms, avert significant computational
the-art pixel domain algorithms are presented over a diverse set effort by reusing aggregated information of blocks of pixels
of standardized surveillance sequences.
already available in coded syntax to localize object motion
Index Terms—Background subtraction, compressed domain in frames. This motivates the development of a background
algorithm, H.264/AVC, video surveillance. subtraction algorithm for compressed bitstreams encoded in
the latest video coding standards like H.264 [4].
The H.246/AVC standard brings new possibilities in the
field of video surveillance [5]. It outperforms previous coding
I. Introduction and Motivation
standards toward its goals of supporting
the proposed method. Experimental results are presented in III. Proposed Macroblock Feature Vector
Section V, and Section VI concludes the paper. A. H.264 Preliminaries
II. Related Work A bitstream sequence encoded in H.264 consists of a se-
quence of pictures or frames, each of which is split into a MB
Although there exist several techniques for moving ob-
unit covering a fixed area of 16 × 16 pixels. An intracoded or
ject segmentation in MPEG domain, algorithms that work
I-macroblock is predicted from spatially neighboring samples
on H.264 compressed video are relatively few. Thilak and
of previously coded blocks in the same frame. On the other
Cruesere [6] presented a system to track targets in H.264
hand, most MBs are predicted using references to one or
video, which relied heavily on the prior knowledge of the size
more previously coded block(s) of pixels in past frames or a
of the target. Zeng et al. [7] proposed an algorithm to segment
combination of past and future frames. A bi-predictive or B-
moving objects from the sparse motion vector (MV) field using
macroblock may use past and/or future frames as references
block based Markov random field (MRF). The limitation of
while a predictive or P-macroblock may use past frames
this approach is that it is only applicable to video sequences
only. These are collectively called intercoded MBs, which
with stationary background. Liu et al. [8] used complex binary
require each referred block to be indicated using a MV with
partition tree to segment normalized MV field into motion-
the corresponding reference frame index. Additionally, the
homogeneous regions. The complexity of the algorithm in-
difference between the predicted and the target MB, i.e.,
creases drastically with a noisy MV field. Solana-Cipres et al.
the prediction error or residual is transformed, quantized,
[9] incorporated fuzzy linguistic concepts that use MV and
and entropy coded. Thus, the MVs as well as the residual
MB decision modes. Fei and Zhu [10] introduced mean shift
component, bearing complementary information of the same
clustering using normalized MV field and partitioned block
MB, hold key to effective feature vector representation.
size for moving object segmentation. This algorithm fails to
For network surveillance, the profile most commonly used
detect objects with slow or intermittent motion. W. You et al.
for encoding is the Baseline. It claims minimal computational
[11] proposed a new probabilistic spatio-temporal macroblock
resource and has a very low latency, ideal for live surveil-
filtering (PSMF) and linear motion interpolation for object
lance feed on low-end devices. The profile is a simplified
tracking in P-frames. The linearity assumption holds good for
implementation of the H.264 standard based only on I- and
group of picture (GOP) sizes less than ten frames and slow
P-frames. I-frames are coded using I-macroblocks only, and
moving targets.
require more bits to encode than any other frame types.
Most related techniques that work in H.264 compressed
P-frames, which are mainly coded using P-macroblocks, may
domain are based solely on the MV field. MVs, however, do
occasionally contain I-macroblocks, particularly in the areas of
not necessarily correspond to true object motion and such
complex motion. A video sequence starts necessarily with an
related techniques end up wrongly classifying the dynamic
I-frame. It mostly consists of P-frames, with I-frames spaced
components (eg. waving tree branches, fountain, ripples,
at regular intervals. The number of I-frames in a coded video
camera jitter etc.) as foreground. Poppe et al. [12] proposed
is negligible; hence, they are not considered in the proposed
an alternative technique that relied on the total number of
method.
bits required to encode a given MB. This technique fails to
detect slow-moving objects.
B. Feature #1: Mean of Absolute Transformed Differences
A common drawback of the related approaches is the
(MATD)
assumption of a fixed quantization parameter (QP) for all MBs
thereby limiting them to variable bit rate (VBR) applications As the name indicates, MATD is the mean of absolute values
with uncontrolled bit rate. However, in most streaming video of the quantized Discrete Cosine Transform (DCT) coefficients
applications, a predetermined constant output bit rate is de- in a coded MB. We formulate MATD using the statistical
sired. These applications, referred to as constant bit rate (CBR) properties of DCT coefficients.
applications, ensure a target bit rate by carefully selecting In literature, it is found to be most appropriate and conve-
a different QP for each MB. Unlike the related approaches, nient to model the distribution of DCT coefficients by Lapla-
our algorithm allows each MB to have a different QP in cian distributions [13]–[15]. A Laplacian probability density
consideration of the stringent bandwidth requirements of a function (pdf) is given by
1
practical surveillance scenario. p(z) = exp −|z| b , z ∈ R (1)
Secondly, we perform filtering on the aggregate result of 2b
MB-level object segmentation obtained jointly from the MV where b>0 is the Laplacian parameter which defines the width
field and the residual data rather than applying separate filter- of the pdf and z being the value of a given DCT coefficient.
ing procedures on either or both of the features in isolation. In H.264 baseline encoding, an integer implementation of
This helps us to reduce computation while relying more on the 4 × 4 block DCT operates on a similar-sized block of
the consensus between the MB features. residual or target data. The resulting block is subsequently
The performance of the related approaches are largely scaled and quantized element-wise by a quantization matrix.
restricted to coarse MB-level segmentation. In contrast, we The quantization scheme most commonly adopted is uniform
introduce a low-complexity technique for comparing pixel col- quantization with dead-zone [16]. In this process, the value z
ors (YCbCr) in order to obtain pixel-resolution segmentation of an input coefficient is, in general terms, quantized as
of sequences with a highly dynamic background. k = (|z| + f ) Q sgn(z), (2)
DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE 1697
TABLE I
Scalar Multiplier q0
where
Fig. 1. Relation between the input coefficient z and the reconstructed output q (QP, n) = q0 (mod (QP, 6) , n) 2QP / 6 (7)
zk for quantization step size Q and rounding offset f.
q0 being a scalar multiplier
as defined in Table-I [17]. It is
where k ∈ Z represents the quantization level or index that is noted in (6) that exactly 1 2 of the transformed coefficients are
actually transmitted by the source encoder, Q>0 is the uniform quantized with a step size equal to q(QP,2), and the remaining
quantization step-size except the dead-zone interval, and f ∈ with q(QP,0) and q(QP,1) equally. Thus, r can take only three
possible values q(QP,0) , q(QP,1) or q(QP,2) , for given values of
0, Q 2 is the rounding offset that controls the width of the 6b 6b 6b
dead-zone. Typically, f is set to be Q/6 for P-frames in an QP and b.
effort to approximate the distribution of DCT coefficients in Given the fact that the quantized coefficients zk of a MB
a quantization interval. The decoder process reconstructs the (both I and P types) are entropy coded, the lower bound on
quantized output zk of the input coefficient z as the average bit rate (bits/coefficient) may be expressed as
zk = kQ. (3)
+∞
H =− P(zk ) log2 P(zk ).
Fig. 1 illustrates the relationship between z and zk . Formally, k=−∞
Q
P(zk ) = −6 5 Q p(z)dz = 1− exp(−5r); if k = 0
⎪
⎪
⎪ 6
⎩ ( 6 ) p(z)dz = exp(−2r(1 + 3k)) sinh 3r; if k > 0
k+ 5
Q B = 384 41 H q(QP,0) 6b
+ 1
4
H q(QP,1)
6b
+ 1
2
H q(QP,2)
6b
(k− 16 )Q
8 sinh 8 sinh {0,1,. . . ,T −1} in raster-scan order starting with zero for the
2b 2b
(10)
q(QP,2) exp − q(QP,2) MB at the top-left hand corner of a frame. A P-frame MB
+ q(QP,2) .
3b
having frame number t at location idx is described by a vector
4 sinh 2b
−
→ T
p = x, y ,
t,idx (11)
Substituting the values of q(QP,0) and its scalar multiples
q(QP,1) q(QP,2)
6b where components x ≥ 0 and y ≥ 0 are the values of MATD
6b
and 6b
in (10), we finally obtain MATD. Using
and SNMVM respectively for the given MB.
the fact that B and QP can only assume non-negative integer
values from a limited range, we construct a lookup table E. Initialization and Incremental Update of Covariance
containing precomputed values of MATD (indexed by B and
In the proposed approach, static and dynamic components
QP). This enables us to bypass entropy decoding and de-
present at different locations in a scene background are
quantization of individual coefficients which would otherwise
modeled with a temporally weighted local covariance ∀idx ∈
be required to compute MATD directly, i.e., actual MATD.
{0, 1, . . . , T − 1}. Feature vectors computed for each idx are
The correlation between the actual MATD and the predicted
stored in T corresponding first-in first-out (FIFO) buffers, each
MATD is explored in Section V.
having a predefined capacity M. A new MB vector is inserted
at the rear. However, if the buffer is full, the vector at the front,
C. Feature #2: Sum of Normalized Motion Vector Magnitudes being the least relevant in the temporal order of vectors in the
(SNMVM) buffer, is removed to make room for a new one. Corresponding
SNMVM represents the motion compensated information to each position j ∈ {1(front), 2, 3, . . . , M(rear)} in a buffer
associated with the MVs of a MB. In H.264, two new containing a vector, predefined weight Wj = j/M is assigned
coding characteristics are introduced in motion compensation: in order of temporal relevance. Without loss of generality, let
variable partition size and multiple reference frames. Each us assume that buffer[idx]
is in a state in which all n vectors
inter coded MB may be predicted using a range of block T n
from the sequence xi , yi are inserted, with the nth
sizes. Accordingly, the MB is split into one, two or four MB i=1
vector currently at the rear. If n > M, this would have resulted
partitions using either
in the removal of previous (n−M) vectors. Let σx2 and σy2
1) one 16×16 partition (covering the whole MB); denote respective weighted variances of {xi }ni=max(1,n−M+1) and
2) two 8×16 partitions; {yi }ni=max(1,n−M+1) currently accumulated in the buffer. Also, let
3) two 16×8 partitions; or σxy denote the covariance between the same set of values.
4) four 8×8 partitions. Accordingly, the required covariance matrix idx is expressed
If the partition size is chosen as 8×8, then each 8×8 block, as in
heretofore a sub-MB, is split into one, two or four sub-MB 2
σx σxy
partitions (either one 8×8, two 4×8, two 8×4 or four 4×4 idx = σxy σy2
. (12)
sub-MB partitions). Each partition or sub-MB partition of a
P-macroblock has a MV (mvxi , mvyi ) pointing to an area of Weighted variance σx2 is defined as
the same size in a reference frame, which is used to predict n
where (20) may be obtained from (21) by substituting for (a, b) each
of the ordered pairs (2,0), (0,1), (0,2), and (1,1) respectively.
n
n
The update procedures of x, x2 , y, y2 , and xy are followed by
x= WM−n+i xi WM−n+i (14) an adjustment of [W]n as
i=max(1,n−M+1) i=max(1,n−M+1)
⎧
and ⎨ 1; if n = 1
[W]n = W + WM−n+1 n−1 ; if 1 < n ≤ M (22)
⎩
n
n
[W]n−1 ; if n > M.
x2 = WM−n+i xi2 WM−n+i . (15)
i=max(1,n−M+1) i=max(1,n−M+1)
The overall process of updating idx requires no more than
14 multiplications, ten divisions, and 21 additions.
Similarly, we have
Fig. 7. Each row shows one example from each category. The categories are
from top to bottom: baseline (highway), camera jitter (boulevard), dynamic
background (fountain01), intermittent object motion (sofa), shadow (cubicle),
and thermal (library).
TABLE II
Quantitative Evaluation (Results of all categories combined)
containing parts of slow-moving objects often go undetected. was [-16…16] × [-16…16] for three reference frames. The
All
√ MBs in a frame corresponding to which τ1 [idx]=0 and rate of decoding frames was fixed at 25 frames per second
(MD/T ) ∈ (0.02, 0.12) represent the background. Pixels con- (fps).
stituting such MBs are used to update the corresponding pixels The background subtraction masks of the proposed method
Bt (x, y) in the existing background using (24). The number of for a few selected frames are shown in Fig. 7 to enable
such MBs in a given frame, say β, is practically very small. qualitative evaluation against the specified ground-truth masks.
For quantitative evaluation, a set of seven evaluation metrics
Bt+1 (x, y) = αIt (x, y) + (1 − α) Bt (x, y) for t > N (24) defined in [19], such as recall, specificity, false-positive rate
where It (x, y) is the pixel’s intensity value for frame t; α = (FPR), false-negative rate (FNR), percentage of bad classifica-
0.08 is a predefined learning rate that determines the tradeoff tion (PBC), F-measure and precision have been used together
between stability and quick update. with average processing speed. An exhaustive comparison of
the proposed method with those of [20]–[24] (applied on
original input frames prior to encoding with default parameters
V. Experimental Results and Discussion defined in each work) is summarized in Table II. Subscripts
Following the discussion in Section III-B, we provide a indicate rank of the corresponding figures in the indicated
statistical comparison of the actual MATD against the pre- evaluation category. It is noticed that rankings based on
dicted MATD values of all P-frame MBs selected at regular FPR=(1 − specificity) and FNR=(1 − recall) are identical
intervals from Traffic sequence. The sequence was encoded to those based on specificity and recall respectively; hence,
in VBR as well as CBR modes as reported in Fig. 6. The all evaluation metrics except FPR and FNR were given equal
values of actual MATD are found to be greater than the weightage for computation of average rank.
corresponding predicted MATD values, owing to fact that the Prior to empirical evaluation, it is important to realize that
latter is modeled using the entropy criterion, which is the every codec can deliver a varying degree of output video
theoretical lower bound on the average bitrate. It is observed quality for a given set of input frames. Any degradation of
that the actual MATD and the predicted MATD values are visual data introduced by “lossy” compression will inevitably
very highly correlated (correlation coefficient ρ > 0.98). In remain visible through any further processing of the content.
(23), we used the Mahalanobis distance, which is invariant Notwithstanding the encoding options which considerably
under arbitrary nonsingular linear transformations of the form affect the performance of a compressed-domain algorithm,
shown in Fig. 6 (with regression equations). Thus, predicted the proposed method delivers better overall performance even
MATD qualifies for a convenient surrogate to actual MATD when pitted against SoA pixel-based techniques.
insofar as the discriminative aspect of MATD is concerned. Background segmentation is but one component of a poten-
The proposed algorithm was implemented in C and inte- tially complex computer vision system. Therefore, in addition
grated into H.264/AVC MB decoding module of Libavcodec, to being accurate, a successful technique must consume as few
an open source audio/video codec library that is developed as CPU cycles and as little memory as possible. An algorithm that
a part of the FFmpeg [18] project. We evaluate our approach segments perfectly but computationally expensive is useless
on the entire benchmark dataset provided for the Change because insufficient processing resources will remain to do
Detection Challenge 2012 [19]. As the proposed method uses anything useful with its results in real-time. The most notable
pixel information to achieve pixel-level accuracy in the final aspect, in this regard, is the comparison of average processing
stage of segmentation, we consider it fair and obvious to speeds in Table II. The computing speeds were recorded for
compare our results with those of proven state-of-the-art (SoA) videos with 720 × 420 resolution on a personal computer
pixel-based approaches [20]–[24]. All 31 sequences of the powered by Intel Core i7-2600 3.40 GHz CPU with 16 GB
dataset were encoded in: 1) VBR with a fixed QP=25 for RAM (no dedicated hardware or graphics processing unit
all MBs; and 2) CBR with a target bit-rate of 1024 kb/s. The used). It is evident that the proposed method runs significantly
encoder configuration was set as follows: Baseline profile with faster in comparison to any of the reported SoA techniques.
YCbCr 4:2:0 (progressive format) 8-bit chroma sub-sampling, The computation per MB involved in each step of the pro-
GOP size varying in [1, 250], rate-distortion optimization posed method (described in Section III) costs up to a constant
enabled, and the range of MVs (using hexagon-based search) factor. Consequently, the complexity of the overall process is
DEY AND KUNDU: ROBUST BACKGROUND SUBTRACTION FOR NETWORK SURVEILLANCE 1703
# T −1
$ T −1
[14] E. Y. Lam and J. W. Goodman, “A mathematical analysis of the DCT
O T + c1 β + c 2 τ1 [idx] , where τ1 [idx] denotes the coefficient distributions for images,” IEEE Trans. Image Process., vol.
9, no. 10, pp. 1661–1666, Oct. 2000.
idx=0 idx=0 [15] W. Wu and B. Song, “DC Coefficient Distributions for P-Frames in
number of candidate MBs that require pixel level processing, T H.264/AVC,” ETRI J., vol. 33, no. 5, pp. 814–817, Oct. 2011.
is the total number of MBs per frame and c1 , c2 are constants. [16] G. J. Sullivan and S. Sun, “On dead-zone plus uniform threshold scalar
Arguably, the running time scales quantization,” in Proc. SPIE Vis. Commun. Image Process., vol. 5960,
# linearly with T , $
T −1
incurring
no. 2. Jul. 2005, pp. 1041–1052.
[17] I. E. Richardson, The H.264 Advanced Video Compression Standard,
only a negligible overhead κ = c1 β + c2 τ1 [idx] ≤ T in 2nd ed. London, U.K.: Wiley, 2010, p. 191.
idx=0 [18] F. Bellard. (2002, Apr. 26). FFmpeg [Online]. Available: http://ffmpeg.
addition to the regular decoding cost claimed by each frame. org/\underbar
[19] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar,
“Changedetection.net: A new change detection benchmark dataset,”
in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
VI. Conclusion Workshops, Jun. 2012, pp. 1–8.
We introduced a novel approach for background subtraction [20] L. Maddalena and A. Petrosino, “The SOBS algorithm: What are
the limits?,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
on videos encoded in the Baseline profile of H.264/AVC. Recognit. Workshops, Jun. 2012, pp. 21–26.
The proposed method is aptly built for real-time network [21] O. Barnich and M. Van Droogenbroeck, “ViBe: A universal background
streaming applications in consideration of variable/constant subtraction algorithm for video sequences,” IEEE Trans. Image Process.,
vol. 20, no. 6, pp. 1709–1724, Jun. 2011.
bit rate options under practical bandwidth constraints. It also [22] P. KaewTraKulPong and R. Bowden, “An improved adaptive back-
proved to be robust to a diverse set of real-world (non- ground mixture model for real-time tracking with shadow detection,”
synthetic) surveillance sequences. in Proc. 2nd Eur. Workshop Adv. Video-Based Surveillance Syst., 2001,
pp. 149–158.
[23] A. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model for
background subtraction,” in Proc. 6th Eur. Conf. Comput. Vis., 2000, pp.
Acknowledgment 751–767.
[24] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Background segmentation
The authors would like to thank the anonymous reviewers with feedback: The pixel-based adaptive segmenter,” in Proc. IEEE
and the associate editor for their valuable comments that Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun.
significantly improved the quality of this paper. 2012, pp. 38–43.