Beruflich Dokumente
Kultur Dokumente
Augusto Destrero
Francesca Odone
Alessandro Verri
DISI - Universit`a di Genova
Via Dodecaneso 35 - I-16146 Genova, Italy
{ destrero,odone,verri } @ disi.unige.it
June 10, 2007
Abstract
We describe a trainable system for face detection and tracking. The structure of the system is based on multiple cues
that discard non face areas as soon as possible: we combine motion, skin, and face detection. The latter is the core
of our system and consists of a hierarchy of small SVM classifiers built on the output of an automatic feature selection
procedure. Our feature selection is entirely data-driven and
allows us to obtain powerful descriptions from a relatively
small set of data. Finally, a Kalman tracking on the face
region optimizes detection results over time. We present an
experimental analysis of the face detection module and results obtained with the whole system on the specific task of
counting people entering the scene.
1. Introduction
In this paper we describe a full system that we designed and
implemented for real-time face detection. The efficiency
is guaranteed by a coarse-to-fine multiple cue structure that
aims at discarding a non-face area as soon as there is enough
evidence against it. The building blocks of the system are
a motion detector, a skin detector, a feature-based face detector implemented as a cascade of SVM classifiers. Finally, a Kalman tracker is applied, to minimize operations
over time. Figure 1 sketches the various processing phases
on a given frame. It also shows the scenario monitored by
our system: a corridor of our department, busy in office
hours, illuminated by both natural and neon light. The various blocks are independent and may be switched off in case
the conditions they require are not met. The hardware of the
system is a standard pc equipped with a frame grabber and
a color video-surveillance camera.
Our motion detection is based on an incremental background construction method, that we use to initialize (or
re-initialize) a model for the background and to keep it updated as time passes. The background initialization method
is similar to the one proposed in [2]. As for skin detection
we implement the method proposed in [7], with thresholds
499
Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.
500
Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.
1111
0000
0000
1111
0000
1111
0000
1111
2v
11
00
00
11
00
11
00
11
00
11
00
11
3v
2h
000
111
000
111
000
111
000
111
000
111
11
00
000
111
00
11
00
11
00
11
00
11
00
11
111
000
000
111
000
111
3h
(1)
(2)
(3)
(4)
if fi Fk fj Fl
k, l = {2h, 2v, 3h, 3v, 4} k 6= l
(5)
D(fi , fj ) =
d(fi , fj ) otherwise
(d(fi , fj ) is the sum of the Euclidean distances between
corresponding corners of the rectangle support), or (b) features that are close but appear not correlated (to this purpose
we use the Spearmans correlation test [10]). We call the final set of features S out .
501
Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.
The set S
is used to set up a coarse-to-fine cascade of
classifiers. Each classifier is built extracting from S out
small subsets of at least 3 distant features that are able to
reach a fixed target performance on a validation set. We
start by 3 mutually distant features (according to (5)) and
add further features until a target performance is obtained,
testing a validation set with a linear SVM.
Target performance is chosen so that each weak classifier
will not be likely to miss positive occurrences: than we set
the minimum hit rate to 99.5% and the maximun false positive rate to 50%. These modest targets allow us to achieve
good performances with the global classifier, since global
performance of the cascade will be computed according to
the following [15]:
H=
K
Y
i=1
hi
and
F =
K
Y
fi
i=1
where hi is the hit rate and fi is the false positive rate for
each weak classifier i. In our case, assuming a cascading of
10 weak classifiers, we will get H = 0.99510 0.9 and
F = 0.510 3 105 .
At the end of the cascade design we test the system live
and store the detected objects. We then analyze the results
checking whether they meet our needs in terms of performance. If not, we repeat the second and third stages of feature selection on a new set of negative examples, made of
all the false positives detected by the system (similar to the
bootstrapping procedure described in [12]); the features we
obtain are more specialized on discriminating among positive and difficult negative examples. Further layers derived
from this refinement are added at the bottom of the cascade.
of the spatial correlation of the image, while often the number of detections in the surrounding of a false positive is
lower, so we discard regions that accumulated less than a
fixed number of hits (4 in our experiments). Moreover is
reasonable to keep only one delegate for groups of overlapping detections, and we do it simply by partitioning the set
of detection in disjoint subsets, then keeping the average
bounding box of each partition.
4. Experiments
The methodology described in section 3, which is entirely
data-driven, has been applied to both face and eye detection
problems. In this section we first report experiments that
confirm the appropriateness of our object detection method
on faces, then we describe the experimental analysis of the
whole system on a face tracking and people counting problem, finally we present some experiments made with our
system trained to detect eyes.
502
Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.
0.95
0.9
0.85
0.8
0.75
0.7
Figure 5: Sample shots from the system interface, acquired (from top left) at times 11:00:58, 11:44:17, 13:15:13,
13:55:42, 14:25:23, 14:25:25.
0.01
0.02
0.03
0.04
0.05
0.06
obtain applying the cascade to our test set is: 0.1% false
positives and 94.0% hit rate.
due to changes of the background. Once a face is first located in a video, we track it with a Kalman tracker (whose
state system models position and velocity, while the measurements system models the position evolution over time)
so to build the trajectory of the face and count it just once.
The tracking module also allows us to evaluate the stability
of detected faces over time, discarding the ones that survive for few frames only; more precisely we set a minimum number of frames in which a face had to be tracked
to be considered stable. This increases the detection performance, since often false positives are less stable than true
ones. The overall performance with respect to the number of people that crossed the scene over the 5 hours was
of 16% false positives and a hit rate of 84%. The results
are very encouraging, considering that the corridor activity
was entirely out of our control, and it included abrupt scene
changes (due to illumination or to doors opened or closed),
people standing in the corridor for unpredictable time, people reading or putting their jumper on while walking, people
using the telephone thus occluding part of the face, and so
forth (see Fig. 5). Fig. 6 shows the difference between the
number of real faces and the detected faces: positive values
indicate misses, negative values false positives. To make
the figure more readable we composed the temporal ranges
where motion was detected, thus on the x-axis there is not
a real flow of time, but a timestamp running on the various
video-shots. A posteriori inspection on a subset of the video
shots allowed us to estimate a performance of the detector
with respect to the analyzed patches of 4 107 false positives and a hit rate of about 67%. The number of analyzed
patches, on an average frame, is of about 20000.
503
Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.
References
1.5
0.5
0.5
[4] I. Daubechies, M. Defrise, and C. D. Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity
constraint. Comm. on Pure Appl. Math., 57, 2004.
0.5
1.5
2.5
3.5
4.5
5
4
x 10
504
Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.