You are on page 1of 21

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 1
INTRODUCTION
Embedded smart cameras have made a dramatic shift towards distributed surveillance
systems by combining sensing, processing and communicating on a single platform. A critical
issue in embedded smart cameras is resource limited, which poses great challenging in designing
fast and efficient vision algorithms. Therefore, it should be very important to consider the vision
algorithms efficiency, memory requirements and portability to an embedded processor during
the algorithm design.
Face detection has been one of the most studied topics in the computer vision literature,
and is the step stone to all facial analysis algorithm . As a fundamental computer vision problem,
the goal of face detection is, given an arbitrary image, to determine whether or not there are any
faces in the images and, if present, return the image location and extent of each
face .
Most of the face detection approaches can be categorized as knowledge-based, featurebased, template-based and appearance-based methods . However, little attentions have been paid
to the algorithms efficiency in processing time as well as meeting the real-time requirement in
some resource-limited applications. For example, Hsu Etal propose a face detection method in
colour image needs 540 seconds to process a 640x480 image on a 1.7GHz CPU. Rowley Etal
present a neural network-based upright frontal face detection system takes approximately 383
seconds to process a 320x240 image. The so-called most successful and fastest Viola-Jones
detector can process 384x288 images at the speed of 15 FPS (Frames Per Second) on a
conventional desktop and 2 FPS on a low power 200 mips Strong ARM processors, nevertheless
their image size is too small to be preferable (a 640x480 resolution is common used) and the 2
FPS on Strong Arm is apparently unacceptable in a real-time application based on embedded
smart cameras. To obtain the real-time performance in video streams, several optimized face
detectors appeared. Most of them use the Viola-Jones face detector and optimize the software
and/or hardware implementation to improve the systems performance. Table 1.1 summarizes
some embedded system-oriented implementations.
1
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

TABLE 1.1
IMPLEMENTATION OF VIOLA JONES DETECTOR

2
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 2
DESIGN GOALS
2.1 PROBLEM DEFINITION
In the traditional algorithm design , more attention are drawn to the detection accuracy
performance rather than the processing efficiency as well as resource-limited conditions. On the
other hand, for some hardware or/and software optimized implementations in Table I, their FPS
are moderate. However, all of these implementations platforms are ASIC, DSP and FPGA who
are highly specialised and customised processors. In a real smart camera networks application,
every camera mote may have changing task with the varying situation. So, our goal was to
design a light-weight face detector on embedded smart camera with general purpose processor,
where it consumes little resource and achieves real-time and acceptable detection performance.

2.2 PROPOSED SOLUTION


This work is based on the observation that computation and storage overhead increase
proportionally to its pixel manipulation in image processing. A natural way is to construct a
hierarchical scheme: identifying face candidates with little computation and manipulation on full
image and then eliciting the true faces from candidates with reliable algorithm. The key
challenge in the hierarchical scheme is how to construct a multi-layer architecture, in which the
complex processing can be split from the pixel manipulation and guarantee detection accuracy
simultaneously. This problem is solved by proposing Pyramid-like Face Detection (P-FAD) that
consists of five layers whose operating units decrease dramatically from top to down while the
operations on every unit increase gradually. PFAD addresses this challenge using a 3-stage
coarse, shift and refine process. P-FAD first imposes the coarse operations on every pixel for
skin detection. It is extremely efficient without losing the robustness to the changing
environment. P-FAD then make a shift between operating units, that means using schemes layer
2-4 to shift the operating from pixel manipulation to contour points, grouped regions and face
candidates. Finally, P-FAD presents a modified Viola-Jones detector, to refine the final results.
The scheme was implemented both on a notebook and embedded smart camera platform.
3
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Experimental results demonstrate the P-FADs resource-aware properties that could process a
VGA image in just 7.23ms on a notebook and 28.3ms on a light-weight embedded smart camera
while still hold the acceptable detection accuracy compared to the Viola-Jones haar detectors
OpenCV implementation. Moreover, P-FAD is not customised/optimised for any given hardware
platform, so its resource-aware properties can also be ported to other general-purposed smart
camera platforms.

4
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 3
PYRAMID-LIKE FACE DETECTION SCHEME
In this section, hierarchical framework for face detection on embedded smart camera is
briefly introduced and then focus on tackling the challenging issues in constructing the
hierarchical scheme. P-FAD is a hierarchical detection scheme. More specifically, as shown in
Fig.3.1, P-FAD is consists of five layers: skin detection, contour point detection, dynamic group,
region merge and filter, and haar face detection. P-FAD first uses a relatively coarse skin
detection to detect the skin then through contour point detection, dynamic group, region merge
and filter, P-FAD shifts operating unit from pixels to contour points, regions and face candidates;
and the finally results are refined by the haar face detection. The hierarchical detection scheme is
tailored to implement real time detection scheme with low computation and storage overhead,
where the operating units decrease dramatically from top to down while the operations on each
unit are increasing. It could make pixel manipulation as few as possible to make a significant
shift in time cost and guarantee the detection accuracy through further complex process. Thus,
PFAD has an inverted pyramid-like appearance on the scales of every layers operate units while
the processs complexity is increasing with a pyramid-like shape. To achieve low overhead and
high detection accuracy, the following critical issues should be answered in P-FAD:
Derive the efficiency detection regions with operating on full images as few as possible.
Achieve a robust detector with high accuracy by considering the changing environment
such as illumination and different individuals.

5
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Fig 3.1.

Pyramid like architecture. MA is short for memory access operation.NI


means normal instruction.

6
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 4
LAYERS OF PYRAMID LIKE FACE DETECTION SCHEME
4.1 SKIN DETECTION
In this implementation of the scheme, there was no interface to scale the graphics engine
frequency, so the only metric that came into play is the processor frequency. Additional knobs
would definitely have a more profound effect on the energy consumption as well as on peak
power. An algorithm for predicting the EM (such as a weighted average of the previous samples)
as an enhanced way of fine-tuning was not developed, but instead the failure of traditional
schemes in some scenarios where the proposed mechanism succeeds was showcased.

Skin detection is the first layer with pixel manipulation of P-FAD. Because the pixel
manipulation accounts for the most processing time in image process, a crucial issue in skin
detection is to consider the process complexity. To reduce processing time significantly, the basic
design principle is to present a relatively coarse but highly time-saving skin detection. In P-FAD,
skin detection is based on skin-colour information as skin colour provides computationally
effective yet, robust information against rotations, scaling and partial occlusions. Further, the
skin colour is modelled in CbCr subset of YCbCr colour space. CbCr subset can eliminate the
luminance effect and provide nearly best performance among different colour spaces To classify
a pixel as a skin-pixel or none-skin-pixel, we choose the widely used Gaussian mixture models
(GMM), which has relatively simplified parameters without losing accuracy, to represent the
skin-colour distribution through its probability distribution function (PDF) in CbCr subspace,
defined as:

7
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

where t denotes the frame index, x(t) is a two-dimension colour vector in CbCr subspace,
ni(x(t),i,t,i,t) is the ith Single Gaussian model (SGM) component contributes to mixed model
with a weight i. The SGM is a elliptical (two-dimension)
Gaussian joint probability distribution function, determined by its mean vector

and the

covariance matrix i,t .At last, the pixel with the colour vector x(t) can be judged as a skin-colour
pixel or not through comparing the p(x(t)) with a predefined
threshold. The main difficulties to implement the GMM in P-FAD are
the following:
The fixed Gaussians parameters i,t ,i,t obtained by offline training procedure from a
large face dataset is not robust to the changing environment.
The computation overhead is high in Eq. (1) on every pixels for judging a pixel as a skin
colour pixel or not.

8
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

FIG 4.1 . FLOW CHART OF REGION FORMATION PROCEDURE


To solve these two problems, an adaptive GMM skin colour detection algorithm with
online learning and simplify judging criteria is used. The pseudo-code is given in Algorithm 1.
First, two sample sets (Sskin, Sfake) derived from the final output of P-FAD are used to train the
adaptive GMM online, the training speed (t) has different sign symbol for two sets which
denotes the learning and forgetting the two sets current distribution information respectively
(line 5 to line 7). Then we update the parameters i,t ,i,t,wi,t of each SGM based on current
learning parameter (t) (line 9 to line 15). Note the threshold 2.5 l i,tl means the confidence
interval with the 95% probability to confirm the fact that current skin colour distribution belong
to the given SGM (line 10). Moreover, the SGM to/from GMM is to be added/removed if there is
no SGM that can approximately represent the current skin-colour distribution (line 17 to line 22).
To the end, a pixel is judged as a skin colour pixel or not based on our simplified rectangle
judgment.
The basic idea of simplification is to approximate the equal value boundary of p(x(t)) by
a single ellipse and project this ellipse onto two axis to get its minimum enclosing rectangle.
Specifically, we use the Eq. (3) to calculate the mean values of GMMs parameters (i,t) and i,t
as approximate ellipses position and shape, denote as (t) and i,t respectively. Then, we
decomposition the covariance matrix i,t of ellipse to get the ellipses rotation angle C and the

length of axis

, so we can use Eq. (4) to obtain the ellipses minimum enclosing


9

Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

rectangles width and height, denote as Wcr(t) and Wcb(t) respectively. Obviously, the centre
position of rectangle is the same with approximate ellipses.

Where D is a diagonal matrix and d11, d22 represent the elements of matrix D. We can use
threshold

as our judgement so that we get the length of approximate ellipses axis:


. And ellipse rotation angle can be easily computed from orthogonal matrix C.

10
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

11
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

4.2 Contour points detection, dynamic group, and region merge


After skin detection, the next step in P-FAD is to shift the operating from pixel
manipulation to region. To refine the process, P-FAD presents a 3-layer architecture, where the
operating units are formed from contour points, grouped regions to face candidates.
Correspondingly, the operations on each unit increase from several normal instructions (NIs),
tens NIs to hundreds NIs. The AIMD based (Additive Increase and Multiplicative Decrease)
contour points detection scheme and dynamic group based point classification method for
foreground detection on embedded smart cameras. Here region merge and filter is used for an
integrated region formation procedure, as shown in Fig.3. The region merge is to merge the small
regions, which are usually split by the eyebrows, glasses, to a complete face candidate; the filter
procedure is to eliminate the non-face regions through some prior knowledge, such as heightwidth ratio ranging from 1.1 to 1.5 in our scheme. Table II shows the time consumption
comparison between our proposed region formation and the traditional component find way
approaching. The implement of component find way approach is obtained from OpenCV 2.3 [8],
and the mask size of morphological filter is 3x3 and the number of iterations is 3 to obtain a
good filter performance. The results show our region formation method make a wonderful time
saving compared to the traditional component finding method with (Condition I) & without
(Condition II) morphological filtering. Moreover, the time complexity of integrated region
formation scheme is adaptive to the number of faces.
TABLE 4.1.
TIME CONSUMPTION OF REGION FORMATION

12
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

4.3 Modified haar detector


The above four layers in P-FAD may produce 0 to 5 face candidates in a frame. To verify
the final output, we choose a most successful and fastest Viola-Jones haar feature face detector
as the final layer. In a typical configuration, a VGA image would produce nearly 881484 subwindows to be classified as face or not in Viola-Jones detector. However, the sub-windows are
several hundred by exploiting the above four layers in P-FAD. Thus, the computation overhead
of Viola-Jones detector in P-FAD is reduced significantly. Moreover, the fully cascade structure
in Viola-Jones detector is not almost required because P-FAD can present an early rejection in
above four layers. It is time-consuming when a face sub-window goes through the whole
cascade. In the traditionally situation, this overhead can be compensated by the large time saving
in the early rejection of non-face sub windows. To determine which stage to start in P-FAD, first
the cascades time consumption of different start stage is modelled.

wheret = [t(1) _ _ _ t(n)]T , t(x) denotes the time consumption if the start stage is x and there are
n stages in a cascade structure. Nreject and Naccept are the number of rejected and accepted subwindows in the whole cascade structure respectively. t reject andt accept are the expectation time
consumption to reject and accept a sub-window respectively.

where pij in Matrix P denotes the probability of the sub window starting from the i-th stage
would be rejected in stage j, aij in matrix A is the sum of features from stage i to j. Assume that
the time consumption is proportional (k times) to the number of processed features. The _(_)
maps the matrix to a vector whose elements are the elements in matrixs dialog. Thus, the
detectors time consumption is determined by two set of arguments: F = [F(1) _ _ _ F(n)]T , F(x)
is the number of features in stage x; and P = [P(1) _ _ _ P(n)]T , P(x) is the probability to reject
13
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

the non-face sub-windows in stage x. From the OpenCV 2.3 baseline face detector, we can getF,
the cascade detectors feature quantity distribution in every stage. Details can be seen in [8]. P(x)
is assumed to be linearly increasing from 50% to 99% which is obedient to the cascade structure
[11]. Practically scanning conditions is simulated using above formula and also get an
implementation on a 2.2GHz notebook. Simulation and experimental results in Fig. 3 show that
the time cost function is convex for the start stage. Then the minimum value of start stage can be
only obtained at the first and last stage. In P-FAD, according to Fig. 3, the optimal start stage is
depend on the ratio = Nreject/Naccept. The overhead rate = (ttotal(1)-ttotal(n))/(ttotal(1)+ttotal(n)) is
defined as the effect of choosing first stage to start, while negative value means the choice could
save time. Fig. 4 shows that when the is extremely large, which meets the traditional ViolaJones detector situation, the is near to -1 indicates choosing first stage to start is undoubtedly
saving most processing time. However, when is smaller than 12, the will be positive.
Thus, in P-FAD, when the number of non-face sub-windows is few, choosing the last
stage is optimal for time-consuming. Based on the following discussion, we may implement a
modified Viola-Jones detector in P-FAD according to the online estimation of . Note that the
total number of sub windows is determined by its scan strategy before the classifying, so we only
need to estimate the Naccept. In our implementation, just assume Naccept approximately equals to
the number of face candidates, which is reasonable when checking the statistical data in Table
III. Obviously, further work could be done to improve the estimations accuracy.

FIG 4.2 . Viola Jones detector's time cost for different start stages

14
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

FIG4.3 The overload of time consumption when starting at first stag

Table 4.2 . Detection results of video sequence

15
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 5
P-FAD'S ALGORITHMIC COMPLEXITY
In this section, the computation complexity of our scheme is concluded. First of all, note
that the schemes whole computation is mainly determined by the P-FADs first and second layer
while last three layer can be omitted for their extremely fewer operating units. Specifically,
Layer 1 and Layer 2 are pixel manipulation and they are completed simultaneously in a single
image scan which could reduce the repeat access to memory. Suppose the image size is N, layer
1 needs N or 2N memory access to get the CbCr value, N to 4N comparison instructions to run
our simplified rectangle judgment and N instructions to link the second layer. Layer 2 needs at
most N/ (usually = 10) memory access to store the contour points and 3N normal instructions.
As a result, pixel manipulation totally needs N to 2.1N memory access as well as 5N to 8N
normal instruction. Secondly, the time consumption in Layer 3 and Layer 4 is extremely low for
its dynamic properties and relatively much fewer operating units, see Table II. At last, the time
consumption of the Viola-Jones detector in P-FAD is reduced significantly because there are only
hundreds sub-windows to be classified. In the traditional Viola-Jones detector, the number of
sub-windows is nearly O(N^2) for its scaling and shifting, and the processing time is
proportionally to it. Moreover, our modified Viola-Jones detector can reduce the time
consumption further. in P-FAD is O(N), which is similar with the simple image process
functions, and much lower than the traditional Viola-Jones detector with computation overhead
O(N^2).

16
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

Chapter 6
EXPERIMENTAL RESULTS
Face detection scheme on our embedded smart camera platforms and a 2.20GHz
notebook respectively to evaluate its fast-processing as well as robust performance. The
embedded smart camera platform consists of a Intel Xscale microprocessor PXA270 and a image
sensor OV9655, which is based on the CITRIC architecture.
First,schemes adaptive GMM algorithm is evaluated on a video sequence. The skin-tone
detections PD (Probability of Detection) and FA (probability of False Alarm) among adaptive
GMM algorithm and two fixed rectangle model in Fig.5. The six selected frames are
corresponding to the picture 2, 3, 5, 8, 10, 16 respectively in Fig.6. It can be seen that last three
frames are darker than the first three frames because a man stood by the windows, and different

17
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

persons in different frame have various pose. Adaptive model can be robust to this kind of

FIGURE 6.1. Frame from test video sequence


of environmental stage. At last, P-FAD scheme is implemented on embedded smart camera to
show its resource-aware property. Fig. 7 shows the limited capability of embedded platform by
listing the run time of basic image processing function in three frequencies. The face detection
costs only 28.3 ms to process a VGA image, which costs almost the same with a typically
background subtraction operation.

18
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

FIGURE 6.2 TIME CONSUMPTION ON EMBEDDED SMART CAMERAS

Chapter 6
CONCLUSIONS
P-FAD, a hierarchical framework for reducing the computing and storage cost of face
detection in embedded cameras was implemented. The goal was to reduce the pixel manipulation
19
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

without compromising the detection performance. This goal was met by devising a 3-stage
coarse, shift and refine process, which shifts the operating unit from pixel to contour points,
regions and face candidates and reserves more complex processing for the given promising units.
The experimental results exhibit the P-FAD schemes resource-aware properties that could
process a VGA image in just 7.23ms on a notebook and 28.3ms on a light-weight embedded
smart camera while still hold the acceptable detection accuracy compared to the Viola-Jones haar
detectors OpenCV implementation.

REFERENCES
[1] Qiang Wang, Jing Wu, Chengnian Long and Bo LI, "P-FAD: Real-time Face Detection
Scheme on Embedded Smart Cameras",Shanghai Jiao Tong University, Shanghai, China
[2] L. Acasandreni and A. Barriga, Accelerating Viola-Jones face detection for embedded and
SoC environments, in Proc. ICDSC Conf., 2011.
20
Dept. of Electronics and Communication Engineering, MBCET.

P-FAD: Real-time Face Detection Scheme on Embedded Smart Cameras

[3] M. Bramberger, J. Brunner, B. Rinner, and H. Schwabach, Real-Time Video Analysis on an


Embedded Smart Camera for Traffic Surveillance, in Proc. ITAS Conf., Toronto, May 2004, pp.
174-178.
[4] P. Chen, P. Ahammad, C. Boyer, CITRIC: A Low-Bandwidth Wireless Camera Network
Platform, in Proc. ICDSC Conf., Aug. 2008, pp. 1-10.
[5] J. Cho, S. Mirzaei, and R. Kastner, Fpga-based face detection system using haar classifers,
in Proceeding of the ACM/SIGDA international symposium on Filed programmable gate arrays,
2009.
[6] R.-L. Hsu, A.-M. Mohamed and A.K. Jain, Face detection in color images, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706, May
2002.
[7] P. Kakumanu, S. Makrogiannis, N. Bourbakis, A survey of skin-color modeling and
Detection methods, Pattern Recognition, vol. 40, pp. 1106-1122, 2007.
[8] R. Kleihorst, M. Reuvers, B. Krose and H. Broers, A smart camera for face recognition, in
Proc. ICIP Conf, 2004.
[9] OpenCV: http://www.opencv.org.cn/opencvdoc/2.3.2/html/index.html [10] H.A. Rowley, S.
Baluja and T. Kanade, Neural network-based face detection, IEEE Trans. on Pattern Analysis
and Machine Intelligence, vol. 20, no. 1, pp. 23-38, 1998.
[11] K. Suzuki, I. Horiba, N. Sugie, Fast connected-component labeling based on sequential
local operations in the course of forward raster scan followed by backward raster scan, in Proc.
ICPR Conf., Aug. 2000, vol. 2, pp. 434-437.

21
Dept. of Electronics and Communication Engineering, MBCET.