Sie sind auf Seite 1von 18

Face Detection

in images :

Neural networks
&
Support Vector
Machines

Asim Shankar Priyendra Singh Deshwal


asim@cse.iitk.ac.in priyesd@iitk.ac.in

April 2002

Under the supervision of


Dr. Amitabha Mukherjee,
amit@cse.iitk.ac.in

Report submitted in partial fulfillment of requirements of the course


CS397 – Special topics in Computer Science
to the
Computer Science and Engineering Department,
Indian Institute of Technology, Kanpur
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

ABSTRACT
Over the years, one of the many problems being dealt with by the
computer-vision community is that of face detection and recognition in
images. The applications of such a system are numerous, from automated
security systems, census, intelligence information etc. In this report, we
present our experience with two of the most successful techniques present
today ([rowley98],[cvpr97face]) and extensions of this work into other
interesting applications.

1 TABLE OF CONTENTS

1 TABLE OF CONTENTS..................................................................................... 2
2 Introduction....................................................................................................... 3
3 Generic Approach ............................................................................................ 4
3.1 The sliding window ................................................................................. 4
3.2 Image pre-processing............................................................................ 4
3.3 Bootstrapping............................................................................................ 5
3.4 Training Set description........................................................................ 5
4 The Neural Network Technique.................................................................. 7
4.1 Network Structure ................................................................................... 7
4.2 Results ......................................................................................................... 8
4.3 Other species ............................................................................................ 9
4.4 Different Network Architectures ...................................................... 10
4.4.1 Fully connected network............................................................. 10
4.4.2 Two outputs..................................................................................... 11
5 Support Vector Machines ........................................................................... 12
5.1 Introduction to SVMs ........................................................................... 12
5.2 SVM learning parameters................................................................... 13
5.3 Results of training................................................................................. 14
6 Implementation Details............................................................................... 15
7 Neural Nets and SVMs – A comparison ................................................ 16
8 Further Directions ......................................................................................... 17
9 References ....................................................................................................... 18
10 Resources..................................................................................................... 18

Table of Figures
Figure 1 - Image pre-processing ................................................................. 5
Figure 2 - Constructing 20x20 training image from original.............................. 6
Figure 3- Basic structure of neural network (Taken from [rowley98]) ................ 7
Figure 4 - Results of Neural Network on pictures taken by us ........................... 8
Figure 5 - Results of neural network on "standard" pictures ............................. 9
Figure 6 - Results on a "fully-connected" network ........................................ 11

-2-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

2 Introduction

Classification algorithms of any kind have traditionally worked on reducing


the object in question to a small set of meaningful features, however, in
many cases this is not quite feasible. Face detection, for example involves
“concepts” (such as face) that cannot be reduced to manageable,
quantifiable set of features, whose basis or eigen-features can be found.
Since it is not known apriori, what the relevant features for the given
concept are, the feature vectors are typically large (such as the grey
values of each pixel in the image).

Under such circumstances, the approach taken is to “learn” the solution


from a large set of examples. We look into a Neural Network based
technique (Henry Rowley et al.) and a support-vector-machine based
techniques (Osuna et al.) which take in the large feature vector and
attempt to classify the same.

-3-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

3 Generic Approach

The problem in question: Given an arbitrary image, be able to mark the


faces detected in the image.

3.1 The sliding window


The two classification techniques studied (neural networks and SVMs)
classify a 20x20 window of pixels as a face/non-face.

Thus, the system slides this 20x20 window across the image. For the
classifier to correctly detect a face, the face must fit into the window and
occupy all of it, ie, it must not to larger or smaller than the window. To
expect that this will always happen is ofcourse absurd, and to compensate
for this fact we repeatedly scale down the image by a constant factor and
then slide a 20x20 image on this smaller window.

With this we are able to detect faces that may be larger than the window
in the original image.

3.2 Image pre-processing

Face images have a great deal of variation – the diversity in race, color,
gender etc. bring about a great deal of variation in face pictures. Add to
that the difference in images taken under different lighting conditions,
with different equipment etc. and the classifier can get completely
confused, its decision being influence by such factors.

To avoid this, each image is pre-processed before being given to the


classifier. The pre-processing consists of the following steps:

• Illumination correction: A best-fit brightness plane is subtracted


from the window pixel values, allowing reduction of light and heavy
shadows.
• Histogram equalization: This compensates for differences in
illumination brightness, camera responses, skin color etc.

These steps are applied to each 20x20 window and not the image as a
whole.

-4-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

Figure 1 - Image pre-processing

3.3 Bootstrapping

Generating a training set for the SVM/neural network is a challenging task


because of the difficulty in placing “characteristic” non-face images in a
the training set. To get a representative sample of face images is not
much of a problem; however, to choose the right combination of non-face
images from the immensely large set of such images, is a complicated
task.

For this purpose, after each training session, non-faces incorrectly


detected as faces are placed in the training set for the next session. This
“bootstrap” method overcomes the problem of using a huge set of non-
face images in the training set, many of which may not influence the
training.

3.4 Training Set description

Researches in the field of face-detection have used two common training


sets (CMU, MIT (Poggio)), however, those are not available easily. For our
purposes, we used some images from the CMU test set (see Resources)
and the Biometric Security’s BioID face database (see Resources) and a
database of Indian faces generated here at IIT Kanpur.

In each image to be placed in the training set the eyes, nose and left,
right and center of the mouth were marked. With these markings, the face
was transformed into a 20x20 window with the marked features at
predetermined positions [ELABORATE].

-5-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

Initially, for negative samples, random images were created and added to
the training set. The training set was subsequently enhanced with
bootstrapping of scenery and false-detected images.

To make the system somewhat invariant to changes such as rotation of


the face random transformations (rotation by ±15 degress, mirroring)
were applied to images in the training set.

The last used training set (including bootstrapping) had 8982 input
vectors.

Figure 2 - Constructing 20x20 training image from original

-6-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

4 The Neural Network Technique

We implemented a retinally-connected neural network. The network takes


as input a 400-length vector (each corresponding to the gray value of a
pixel in the 20x20 window) and returns a result between 0.0 and 1.0. The
network is trained using the standard back-propagation algorithm.

4.1 Network Structure


Our implementation was a crude version of the system described in
[rowley98]. We did not implement arbitration amongst multiple networks
and the size of the training set used was significantly smaller.

Figure 3- Basic structure of neural network (Taken from [rowley98])

The neural network is a two-layer (one hidden, one output) feed-forward


network. There are 400 input neurons, 26 hidden neurons and 1 output
neurons.

Each hidden neuron is not connected to ALL the input neurons. The hidden
neuron connections are as follows:

• The input image is divided into a 2x2 grid. 4 of the hidden neurons
take input from only one of these grids each
• The input image is divided into a 4x4 grid. 16 of these neurons take
input from only one of these grids each. This division into grids
should help in detection local features (eyes, nose) important for
face detection.
• The input image is divided into 6 horizontal stripes (each of height
5 pixels, this there is some overlap between strips).This should aid
in the detection of features such as a pair of eyes or the mouth.

The idea is that the hidden neurons taking square (grid) inputs would
detect individual features while the horizontal stripes would detect pairs of
eyes and the mouth.

-7-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

4.2 Results

Here were present some results obtained (green rectangles around


detected faces). You will notice that there are some false detections,
which should be reduced by adding these to the training set (more
bootstrapping). Also, many times the same face is detected multiple
times. The remedy for this is to draw a bounding rectangle around the
multiply detected regions. We implemented a primitive collapsing
technique and have to refine it further.

Figure 4 - Results of Neural Network on pictures taken by us

-8-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

Figure 5 - Results of neural network on "standard" pictures

4.3 Other species

Here we tried some images of animal faces etc. to see if the network
learnt to recognize faces in general (two eyes, a nose and a mouth) or
was able to detect something unique about human faces. Do note that
none of these animal faces were in the training set. We obtained some
interesting results:

The application (screenshots above) didn’t draw a


rectangle around the chimp, so it didn’t think it was a
face. However, when inspected more closely, we say
that this chimp and some others too had a network
output quite close to 0.5 (the demarcating limit we
used between a face and a non-face).
This dog’s face was detected by the network. The
region after all does have two eyes, the fur of the dog
is dark in the middle which makes it appear somewhat
like a nose. However, many other dog faces were
categorically rejected by the system.

-9-
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

4.4 Different Network Architectures

Other than the network structure proposed [rowley98] we also


experimented with alternative structures and compared their
performance with the one mentioned above.

4.4.1 Fully connected network

After reading about the aforementioned network an obvious


question that arose was the effect on the network of such restricted
connections between hidden neurons and others. Rowley proposed
1426 different edges, while if we fully connect all 400 inputs to all
26 hidden neurons and all 26 hidden neurons to the output neuron
we end up with 10426 edges. To see this, we trained a fully
connected network on the same training set.

We observed that results were quite similar, however, the time


taken to process the image with the fully connected network was
much larger (420% extra edges). Since this slower performance
didn’t translate to more accurate detection, we concluded that
Rowley’s construction was quite appropriate.

- 10 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

Figure 6 - Results on a "fully-connected" network

4.4.2 Two outputs

The networks above with only one output gave a few false detections and
on rare occasions missed a face. A common strategy used in many neural-
network based classifiers is a two-output system. Some believe that
neural networks work better with sparse input/output schemes. We thus
tried a two output system, where the first output gives us a measure of
how likely is the given image to be a face while the second output gives a
measure of how likely is the given image to not be a face.

Again, such a structure seemed to be no better than the original, more


compact network with one input.

- 11 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

5 Support Vector Machines

Preliminary experiments with the SVM technique as mentioned in


[cvpr97face] seem to show that the technique is as promising as the
neural network technique. A small difference in our implementation and
that proposed is that Osuna et al propose a 19x19 feature vector while we
use a 20x20 so that the training set can be shared.

The training set used was exactly the same as that used in the neural
network, i.e., of 8982 input vectors.

5.1 Introduction to SVMs

Support vector machine is a patter classification algorithm developed by


V. Vapnik and his team at AT&T Bell Labs [vapnik95svnets]. While most
machine learning based classification techniques are based on the idea of
minimizing the error in training data (empirical risk) SVMs operate on
another induction principle, called structural risk minimization, which
minimizes an upper bound on the generalization error.

Consider data points of the form {(xi ,yi )}i=1..N , and we wish to determine
among the infinite such points in an N-dimensional space which of two
classes of such points does a given point belong to. If the two classes are
linearly separable, we need to determine a hyper-plane that separates
these two classes in space. However, if the classes are not clearly
separable, then our objective would be to minimize the smallest
generalization error. Intuitively, a good choice is the hyper-plane that
leaves the maximum margin between the two classes (margin being
defined as the sum of the distances of the hyper-plane from the closest
points of the two classes), and minimizes the misclassification errors.

It can be shown that the solution to this problem is a linear classifier:


f(x)=sign(ΣiN λiyi xTxi + b), whose coefficients ({λ}) are the solution of the
following QP problem:

Figure 7 - QP eqn. whose solutions are the support vectors (from [cvpr97face])

- 12 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

It turns out that only a small number of coefficients are different from
zero, and since every coefficient is a particular data point, this means that
the solution is determined by the data points associated with the non-zero
coefficients. There are the support vectors, the only ones which are
relevant to the solution of the problem, and thus all other data points can
be deleted from the data set without affecting the solution. Intuitively,
support vectors are data points lying between the border between the two
classes.

Figure 8 - Separating hyperplanes (a) small margin (b) larger margin, better classifier [taken
from [cvpr97face]]

In the real world, we’re unlikely to find problems that actually be


solved by a linear classifier. To extend the technique to non-linear
decision surfaces, we project the original vector into a higher
dimensional feature space. The problem now is the choice of the
features that will project the original vector into a higher
dimensional space. For this we use Kernel functions K(x,y). See
[vapnik95svnets] for more details.

5.2 SVM learning parameters

The parameters used by the learning engine were:


• C – (tradeoff between training error (minimized) and margin
(maximized)) = 1.0
• Kernel function – 2nd degree polynomial

The SVM training algorithm reported zero misclassifications with these


parameters over the training set. Increasing C to 2.0 however resulted in
89 misclassification errors (1% error) over the training set. As there were
no misclassifications with a 2nd degree polynomial, increasing the degree
did not seem required, and indeed performance was similar with a 3rd
degree polynomial kernel.

- 13 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

5.3 Results of training

644 support vectors were obtained with C=1.0 and a 2 nd degree


polynomial kernel, with no misclassifications on the training set.

With a 3 rd degree polynomial kernel, the support vectors increased to 711.

Interestingly, the learning time for the SVM algorithm was significantly
smaller than that for the neural network. Over the ~9000 image training
set, the SVM algorithm produced the model in approximately 15 minutes,
while backpropagation of the neural network took 1 hour.

- 14 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

6 Implementation Details

In the course of studying the face detection techniques described above, a


lot of implementation was done by us. We tried to write a significant
amount of reusable and pluggable code so that future work can easily
build upon our engines.

Intel’s Image Processing Library (IPL) was used for image processing and
manipulation (histogram equalization, window extraction, scaling etc.).
Input vectors were then created from the scaled, processed windows.

The application also assists in the creation of the training set by allowing
features (eyes, nose, mouth) to be labeled, transforming the face based
on the selected features to a 20x20 window, rotating the image randomly,
pre-processing the image and then writing to a training set file.

A neural network library (see Resources) was created for the


corresponding technique. Training of the network was done on a compute
server (as the training set was large) and the trained network was then
plugged into the GUI for testing.

The SVM engine used was SVM-light (see Resources).

All training engines are both Linux and Windows compatible. The GUI is
currently written for Windows systems.

The code written is free for use, with the hope that this will save a
significant amount of time for anyone trying to build up from here. Please
feel to contact the authors for these applications.

- 15 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

7 Neural Nets and SVMs – A comparison


Here we present a few images (20x20 windows blown up) and the output
of the neural network and the SVM classifier on that input.

Image Neural Network SVM classifier


(0=NO, 1=YES) (-1=NO, +1=YES)
Yes (0.97) Yes (1.02)

Yes (0.82) Yes (0.89)

Yes (0.59) No!! (-0.12)

Yes (0.86) Yes (0.39)

Yes (0.99) Yes (2.01)

Yes (0.87) Yes (1.11)

No (0.020) No (-2.9)

No (0.001) No (-3.9)

No (0.00001) No (-5.6)

No (0.00002) No (-4.5)

No (0.040) No (-2.3)

- 16 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

8 Further Directions

The face detection problem has many applications in the field of security
systems, automated census, intelligence systems etc. However, of
particular interest to us is in the field of video summarization. The idea is
that given a video sequence, we first identify the faces in the frames and
then use the identified faces for motion-tracking and face-recognition.
With this, we may be able to textually comment on the movement of
persons across a scene.

While the face detection technique described above can be applied to


video applications, a major hindrance is the speed, or rather lack of it, on
large images. To do this over a large set of frames in a video would make
the system prohibitively slow. However, we can use properties of video to
ease this problem. For example, using background subtraction techniques
we can reduce the number of regions in the frame where a face detection
is likely, and thus instead of looking at all windows in each frame we look
only at the regions of interest in each frame.

Testing out the feasibility and performance of such a system would be the
next logical step to take after the

Furthermore, the detection scheme described in this report deals with full-
frontal facial images, meaning thereby that profile views and occluded
faces are not handled. Profile views can be detected using the same
technique, possibly using the eye, nose and ear to positions to standardize
the training set and then use the training schemes described above.

We surveyed such techniques as the first step in video summarization.


The next step would be to be able to:
• Label every scene with the characters present in it, and then
• Label every scene with the actions of each actor (bend, walk, move
hand etc.)

- 17 -
Asim Shankar, Priyendra Singh Deshwal Face Detection in Images
IIT Kanpur April 2002

9 References
• [rowley98] – Neural network based Face Detection. Henry Rowley,
Shumeet Baluja, Takeo Kanade. CMU. IEEE Transactions on Pattern
Analysis and Machine Intelligence, volume 20, number 1, pages 23-
38, January 1998.
(http://www-
2.cs.cmu.edu/afs/cs.cmu.edu/user/har/Web/faces.html)

• [cvpr97face] – Training support vector machines: An application to


Face Detection. Edgar Osuna, Robert Freund, Federico Girosi. MIT.
1997.
(http://citeseer.nj.nec.com/osuna97training.html)

• [sung94examplebased] – Example-based learning for Human Face


Detection. Kah-Kay Sung, Tomaso Poggio. MIT. IEEE Transactions
on Pattern Analysis and Machine Intelligence, volume 20, number 1,
pages 39-51, January 1998.
(http://citeseer.nj.nec.com/sung94examplebased.html)

• [rowley97] - Rotation Invariant Neural Network-Based Face


Detection. H. Rowley, S. Baluja, and T. Kanade. Technical report
CMU-CS-97-201, Computer Science Department, Carnegie Mellon
University, December, 1997.

• [vapnik95svnets] - Support vector networks. C. Cortes and V.


Vapnik. Machine Learning, 20:1-25, 1995

• T. Joachims, Making large-Scale SVM Learning Practical. Advances


in Kernel Methods - Support Vector Learning, B. Schölkopf and C.
Burges and A. Smola (ed.), MIT-Press, 1999.

10 Resources
• CMU image test set for face detection
http://vasc.ri.cmu.edu/IUS/eyes_usr17/har/har1/usr0/har/faces/test/

• BIOID Face database


http://www.bioid.com/technology/facedatabase.html

• Annie – Artificial Neural Network library for C++


http://home.iitk.ac.in/student/asim/annie/

• SVM light – Support Vector Machine training and classification software


http://svmlight.joachims.org/

• Intel Performance Libraries – Image Processing Library


http://www.intel.com/software/products/perflib/ipl

- 18 -