Sie sind auf Seite 1von 63

MASTER OF SCIENCE IN

INFORMATION NETWORKING





THESIS REPORT


ELASTIC BUNCH GRAPH MATCHING FACE
RECOGNITION:
PERFORMANCE AND COMPARISON WITH SUBSPACE
PROJECTION METHODS


ANDREAS STERGIOU, MSIN 2003


SUPERVISOR: DR. A. PNEVMATIKAKIS

AUTONOMIC AND GRID COMPUTING GROUP

Abstract

The problem of face identification on still images has gained much attention over the past years.
One of the main driving factors for this trend is the ever growing number of applications that an efficient and
resilient recognition technique can address, such as security systems based on biometric data and user-
friendly human-machine interfaces. An example application of the latter category are smart rooms, which
use cameras and microphone arrays to detect the presence of humans, decide on their identity and then react
according to a predefined set of preferences for each person. Although a great number of algorithms have
been proposed in the literature, their success usually assumes some specific parameters for the problem, and
finding a resilient, all-purpose face identification method has proven to be a much tougher challenge.
Variations in pose, illumination, and expression, as well as partial face occlusions are only a few of the
problems that such an algorithm would have to cope with.
Elastic Bunch Graph Matching (EBGM) is a feature-based face identification method. The
algorithm assumes that the positions of certain fiducial points on the faces are known and stores information
about the faces by convolving the images around the fiducial points with 2D Gabor wavelets of varying size.
The results of all convolutions form the Gabor jet for that fiducial point. EBGM treats all images as graphs
(called Face Graphs), with each jet forming a node. The training images are all stacked in a structure called
the Face Bunch Graph (FBG), which is the model used for identification. For each test image, the first step
is to estimate the position of fiducial points on the face based on the known positions of fiducial points in the
FBG. Afterwards, jets are extracted from the estimated points and the resulting Face Graph is compared
against all training images in the FBG, using Gabor jet similarity measures to decide the identity of the
person in the test image.
The goal of this Thesis is to study the EBGM algorithm in the context of the CHIL project, as one of
the candidate technologies for the implementation of the Face ID module in the AIT Smart Room. The first
aspect of the algorithm that was studied is its ability to automatically locate the positions of features on
novel imagery, since increasing the estimation accuracy leads to better identification performance. A number
of feature estimation techniques were compared to ascertain their merits and shortcomings. With regard to
the actual recognition step, a variety of similarity metrics were studied, leading to two EBGM variants
depending on the recognition speed; for real-time and off-line applications, respectively. A multitude of
parameters such as the number of training images per class, illumination changes, varying image size,
imperfect eye localization and varying number of wavelets used for the Gabor jets were studied in order to
ascertain the algorithms robustness to such changes. Finally, the EBGM algorithm was compared with a
number of Subspace Projection method variants (also implemented here at AIT) and the strengths and
weaknesses of both approaches were deduced.



TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION.......................................................................................... 5
1.1 The Face Identification Problem...............................................................................................................................5
1.2 EBGM Algorithm Overview......................................................................................................................................6
1.3 The HumanScan Database ........................................................................................................................................8
1.4 Thesis Goals ................................................................................................................................................................9
1.5 Thesis Organization .................................................................................................................................................10
CHAPTER 2: EBGM IN DEPTH....................................................................................... 11
2.1 Gabor Wavelets ........................................................................................................................................................11
2.2 Gabor Jets .................................................................................................................................................................16
2.2.1 Gabor Jet Similarity Measures............................................................................................................................16
2.3 Face Graphs ..............................................................................................................................................................18
2.4 Face Bunch Graph....................................................................................................................................................19
CHAPTER 3: FEATURE ESTIMATION........................................................................... 21
3.1 Image Preprocessing (Normalization) ....................................................................................................................21
3.2 Landmark Localization ...........................................................................................................................................23
3.3 Displacement Estimation Methods .........................................................................................................................25
3.3.1 Displacement Estimation Predictive Step (DEPS) .............................................................................................25
3.3.2 Displacement Estimation Predictive Iteration (DEPI) ........................................................................................26
3.3.3 Displacement Estimation Grid Search (DEGS)..................................................................................................27
3.3.4 Displacement Estimation Local Search (DELS).................................................................................................28
3.4 Comparative Analysis of DE Methods....................................................................................................................28
3.4.1 FBG Creation Time ............................................................................................................................................29
3.4.2 Image Processing Time.......................................................................................................................................30
3.4.3 Feature Estimation RMS Error ...........................................................................................................................34
3.4.4 Usage Of The Eye Pupil Coordinates .................................................................................................................40
CHAPTER 4: IDENTIFICATION RESULTS..................................................................... 42
4.1 Face Similarity Metrics............................................................................................................................................42
4.2 Baseline Identification Performance.......................................................................................................................43
4.2.1 Using A Single Training Image Per Class ..........................................................................................................43
4.2.2 MS vs. DEGS_16 ...............................................................................................................................................44
4.2.3 Increasing The Training Set Size........................................................................................................................48
4.3 Further Identification Experiments........................................................................................................................49
4.3.1 Effect Of Illumination Changes..........................................................................................................................50
4.3.2 Effect Of Image Size ..........................................................................................................................................52
4.3.3 Effect Of Imperfect Eye Localization.................................................................................................................54
4.3.4 Effect Of Number Of Fiducial Points.................................................................................................................56
4.3.5 Effect Of Gabor Kernel Set Size ........................................................................................................................58
CHAPTER 5: CONCLUSIONS......................................................................................... 61
5.1 Summary...................................................................................................................................................................61
5.2 Directions For Future Work....................................................................................................................................61
APPENDIX A: REFERENCES......................................................................................... 63















Chapter 1: Introduction

1.1 The Face Identification Problem

For human beings, the task of face identification is fairly straightforward and seemingly
uncomplicated; for the average person, only a few glimpses of an unknown face are needed to place
it in memory and just as easily recall it when needed. Although humans perform so well in this
task, it is not clear how the desired result is achieved; deducing the underlying mechanisms which
enable this process is a totally different story, while at the same time being a crucial step in
allowing computers to imitate our face recognition capabilities in a reliable and robust manner.
When a machine is presented with the face identification problem, it must process a given image or
video stream and return the most probable identities of the people present (possibly more than one),
according to the contents of its database (i.e. the people the machine knows). In an effort to
duplicate the human decision process, two main categories of algorithms have been proposed,
relying on either information about the whole face or specific, easily-located points on it (facial
features).
The first of these families of methods is usually termed appearance-based in the literature,
whereas the second is referred to as the feature-based approach. Perhaps the best known
appearance-based algorithm is the Principal Component Analysis (PCA, [1]), which belongs to the
family of Subspace Projection Methods. PCA considers the image as a whole, arranges all pixel
values in a line vector and regards each pixel as a separate dimension of the problem. This vector is
then projected on a space of much lower dimension (hence the name of the family), in an attempt to
reduce the problem size while retaining as much information as possible about the original image.
PCA is usually enhanced with Linear Discriminant Analysis (LDA, [2]) in an effort to improve
performance. LDA is essentially a supervised training method of the system in the projected
subspace which tries to form tight clusters of points corresponding to images from the same
subject, while at the same time placing clusters corresponding to different individuals as far away
as possible.
Feature-based approaches, on the other hand, rely on information about well-defined facial
characteristics and the image area around these points to represent a face in the problem space and
perform recognition. Examples of these facial features are the eyes, nose, mouth, eyebrows etc. The
exact coordinates of the eyes in particular are ideally given, although in practice the algorithm can
only work with estimates obtained from a face detection and eye zone locator module that precedes
the recognition process. An example of a feature-based approach is the Elastic Bunch Graph
Matching (EBGM, [3]) algorithm, which stores spectral information about the neighborhoods of
facial features by convolving these areas with Gabor wavelets (masks). An overview of the EBGM
algorithm is presented in section 1.2; its mechanics and the AIT implementation are discussed
throughout the remainder of this Thesis.
The face recognition task has gained much attention over the last years, due to the great
number of applications it can address. Automatic face identification tools are a key component of
modern biometric security systems and can be deployed both in corporate and in public service
environments (such as an airport), providing reliable and unobtrusive surveillance of the premises.
Another wide area of applications involves user-friendly human-machine interfaces, whereby
incapacitated individuals could rely on such automated systems to receive specific services and
communicate with the rest of the world when in need. Such premises, usually termed smart
rooms, contain a number of sensors such as cameras and microphone arrays which detect the
presence of humans and react according to a set of predefined scenarios whenever certain triggering
events occur.
Due to this increase in the need for reliable and robust automatic face identification
systems, a great number of algorithms have been proposed in the literature ([4] gives an indicative
list of such methods). Up to date, however, no all-purpose, resilient approach has been reported.
This is due to a number of parameters of the face recognition problem which complicate matters in
a number of ways. Any such algorithm must cope with faces of varying pose and size conveying
different expressions. Illumination changes are also important, since they can lead to significant
performance degradation if they are not accounted for appropriately. Another difficulty is that a
face may be partially occluded if that person wears sunglasses or a scarf, in which case it is usually
very difficult to perform recognition reliably.

1.2 EBGM Algorithm Overview

As discussed earlier, the EBGM algorithm is a feature-based approach to the face
identification problem. Figure 1 below shows a flow diagram of the algorithms implementation. In
the context of EBGM, the facial features that are used are called fiducial points. For the training
step, the exact coordinates of these points are assumed to be known (usually hand-annotated by
humans). Images are represented internally by the algorithm using spectral information of the
regions around these features, which is obtained after convolving those portions of the image with a

Figure 1: Flow diagram of the EBGM algorithm (from [7]).

set of Gabor wavelets of varying size, orientation and phase. The results of the convolution for a
specific position (called the Gabor Jet) are then collected for all fiducial points on a given image
and aggregated (together with the feature coordinates) in that images Face Graph. Having applied
this process to all images in the training set, all the resulting Face Graphs are concatenated in a
stack-like structure called the Face Bunch Graph (FBG). This is the systems model of all
individuals it can identify.
For the testing step, on the other hand, minimal information about the features is available
(at best we have the eye pupil coordinates). Rather, the algorithm constructs the test images Face
Graph by estimating the positions of fiducial points in an iterative manner, using the information
stored in the FBG and previously estimated feature positions. This automatic feature localization
capability is one of the major advantages of the EBGM algorithm. After the Face Graph has been
constructed, it is compared against all members of the FBG to determine the closest match
according to a given similarity metric. The identity of the person this closest match corresponds to
is reported as the systems decision.

1.3 The HumanScan Database

In order to evaluate the implementation of the EBGM algorithm, we used the HumanScan
image database ([5]). This contains a total of 1521 images belonging to 23 individuals (classes),
each accompanied by a set of twenty hand-annotated features (fiducial points), as shown in Figures
2 and 3. The HumanScan database contains only moderate lighting and expression variations, and
in some case an individual has images with and without glasses. Two classes had very few image
instances which were discarded, so in the end 1513 images were used, corresponding to 21 classes
with at least 25 samples per person. Table 1 lists the facial features whose positions are provided
with the database.


Figure 2: Example images from the HumanScan database.


Figure 3: Hand-annotated fiducial points on a HumanScan image.


Feature # Description
1 Right Eye Pupil
2 Left Eye Pupil
3 Right Mouth Corner
4 Left Mouth Corner
5 Outer End of Right Eyebrow
6 Inner End of Right Eyebrow
7 Inner End of Left Eyebrow
8 Outer End of Left Eyebrow
9 Right Temple
10 Outer Corner of Right Eye
11 Inner Corner of Right Eye
12 Inner Corner of Left Eye
13 Outer Corner of Left Eye
14 Right Temple
15 Tip of Nose
16 Right Nostril
17 Left Nostril
18 Centre Point on Outer Edge of Upper Lip
19 Centre Point on Outer Edge of Lower Lip
20 Tip of Chin

Table 1: List of fiducial points provided with the HumanScan images.
1.4 Thesis Goals

The purpose of this Thesis is three-fold:

To implement the EBGM algorithm as a candidate method for the AIT Smart Room Face
Identification module, in the context of the CHIL (Computer in the Human Interaction
Loop) project ([6]). For this purpose, the RAVL C++ libraries were used ([10]).
To test the algorithm under varying conditions (such as lighting and image size variations
and imperfect eye localization) and ascertain its robustness and applicability in a real-world
scenario.
To compare EBGM with representatives of the Subspace Projection Methods family, also
studied here in AIT in the context of the CHIL project, and determine their respective
strengths and weaknesses on a common test bed.


1.5 Thesis Organization

The first Chapter has introduced the Face Identification problem, provided a short
discussion of the EBGM algorithm and the HumanScan image database and stated the Thesis goals.

Chapter 2 discusses the EBGM algorithm in depth, presenting its components analytically
in increasing order of complexity.
Chapter 3 presents the standard image preprocessing step (geometric normalization)
applied and studies the various displacement estimation methods used for feature localization. The
speed and accuracy of these methods is compared on the average and on a per-feature basis, both
when the eye pupil coordinates are known exactly and when they are estimated using the FBG.
Chapter 4 addresses the methods performance for the actual identification process, in
terms of both accuracy and speed. A number of similarity metrics are considered and their
respective strengths and weaknesses are discussed. A baseline recognition performance is
established, which is subsequently compared with the Subspace Projection methods. Following
that, we introduce certain impairments to the problem, such as intensity variations, decreasing
image size and imperfect eye localization incurred by an eye detector module and the algorithms
robustness to them is studied, again in conjunction with the results obtained with the Subspace
Projection methods. Some initial results from tests using additional (extrapolated) Gabor jets (from
features not provided in the HumanScan database) are also discussed. Finally, the effect of varying
the Gabor kernel set size is described.
Chapter 5 presents a number of directions for future work and summarizes the Thesis
results and conclusions.











Chapter 2: EBGM In Depth

2.1 Gabor Wavelets

Gabor wavelets are fundamental to the EBGM algorithm and a background in wavelet
analysis is necessary in order to understand the methods intricacies. Wavelets are used (much like
Fourier transforms) to analyze frequency space properties of an image, the difference being that
that wavelets operate on a localized image patch, while the Fourier transform affects the whole
image.
The EBGM algorithm uses a two dimensional form of Gabor wavelets for image
processing. Each wavelet consists of a planar sinusoid multiplied by a two dimensional Gaussian
distribution. The sine wave is activated by the frequency information on the image, while the
Gaussian ensures that the convolution result is dominated by the region close to the center of the
wavelet. Gabor wavelets can take a variety of different forms, usually having parameters that
control the orientation, frequency, phase, size and aspect ratio. To acquire an accurate and
comprehensive description of a feature in an image, it is necessary to convolve that location with a
family of many different wavelets, typically having different frequencies and orientations. This
multitude of convolution kernels leads to the notion of the Gabor Jet that will be described in the
following section.
Figure 4 below shows the result of convolving an image containing a face with a real and
imaginary wavelet. A two dimensional Gabor wavelet will respond to image features that are of the
same orientation and size. The figure shows the original image and the two wavelet masks that are
used for convolution, and the bottom two images depict the magnitude and phase convolution
values at each point. From the magnitude response it can be seen that the wavelets respond
especially to the nose and left ear and that magnitude values change rather slowly with the
displacement from the center of the convolution. On the other hand, phase values are pretty much
proportional to the horizontal displacement.
The EBGM algorithm performs the wavelet convolution using precomputed Gabor masks,
whereby each mask is a two dimensional array that is loaded from a text file at run time and used as
a look-up table for wavelet values during convolution. The masks are centered over the correct
location in the image and each corresponding value is computed by multiplying the pixel intensity
with the mask value at that point and then summing up all individual contributions to the
convolution. In order to compute both the real and the imaginary part of the wavelet, it is necessary
to convolve the image with two masks that are out of phase by /2, corresponding to the use of a
sine and a cosine in the wavelet transform.


Figure 4: Two Dimensional Gabor Wavelet Convolution Example (from [7]).

The wavelet specification follows along the lines discussed in [7], and the full equation was
chosen for its straightforward formulation and simplicity as follows

2 2 2
2
2
( , , , , , , ) cos 2
x y
x
W x y e


| |
= +
|
\ .


where

cos sin
sin cos
x x y
y x x


= +
= +


As can be seen from the above equation, a set of five parameters determine each wavelets
characteristics, as discussed below.

specifies the wavelets orientation, rotating it around its center. A wavelets orientation
determines the angle of the edges and bars in the image to which it will respond. It is easy to
observer from the previous equations that selecting a set of values between 0 and is
sufficient, since values between and 2 are rendered redundant due to the wavelets
symmetry around its origin. Figure 5 shows a wavelet rotated for four different values of :
0, /4, /2 and 3/4.


Figure 5: Examples of different wavelet orientations.

specifies the frequency of the sinusoid, or equivalently the wavelets frequency. Wavelets
with a large wavelength will respond to gradual changes in the image intensity, whereas
short wavelengths are better suited for sharp edges and bars. Figure 6 shows four kernels
with the wavelength being slowly increased from 8 to 16 pixels.


Figure 6: Examples of different wavelet frequencies.

specifies the sinusoids phase. Typically, Gabor wavelets are based on a sine or cosine
wave. In the scope of the EBGM algorithm, cosine wavelets are viewed as the real part of
the transform and sine wavelets as the imaginary part, therefore a convolution with both
phases produces a complex coefficient for a given orientation and wavelength. In order to
achieve the complex characteristics, the two wavelets must simply have a phase offset of
/2, i.e. it is not necessary to choose the values 0 and /2. For example, Figure 7 illustrates
the values of 0, /2, and 3/2 (note that the last two wavelets are again redundant due to
the equations symmetry).


Figure 7: Examples of different wavelet phases.

specifies the Gaussians radius and consequently its size, which is sometimes also referred
to as its basis of support. The Gaussians size determines the image area around the kernels
center that is affected by convolution. In theory, the whole image should be considered;
however, as the convolution moves further away from the center of the Gaussian, the
contribution of the outlying regions becomes negligible. is usually chosen to be
proportional to the wavelength, so that wavelets of different size and frequency are simply
scaled versions of one another. The mask size is also closely related to the size of the
Gaussian. Although there is no strict relationship, it is imperative to choose the size so as to
capture the significant portions of the Gaussian and obtain and adequate description in the
frequency space. Figure 8 shows four wavelets with radii of 16, 12, 8 and 4 pixels,
respectively.


Figure 8: Examples of different wavelet sizes.

specifies the aspect ratio of the Gaussian. This parameter is included so that the wavelets
can approximate some biological models ([8]) and determines how oblong or round the
wavelet mask will be. Figure 9 depicts four wavelets with aspect ratios between 0.5 and 1.5


Figure 9: Examples of different wavelet aspect ratios.

For our implementation of the EBGM algorithm, the following parameter sets were
employed:

8 different values for the orientation: { } 0, / 8, / 4, 3 / 8, / 2, 5 / 8, 3 / 4, 7 / 8 .
5 different values for the wavelength of the sinusoid:
{ }
4, 4 2,8, 8 2,16 .
2 different values for the sinusoids phase: { } 0, / 2 .
The Gaussians radius was chosen to be equal to . Each mask was computed for 3
pixels, both horizontally and vertically.
Perfectly round Gaussians were considered, i.e. 1 = .

This choice of parameters leads to a total of 80 different wavelets, depicted in Figure 10 below.


Figure 10: The complete wavelet mask set. Real wavelets are on the left side,
imaginary on the right. Wavelengths vary from left to right and orientations
from top to bottom.

2.2 Gabor Jets

The EBGM algorithm uses Gabor jets to represent landmark (facial feature) information.
These jets describe the local frequency information around the landmark location and are
essentially a collection of complex Gabor coefficients from the same location in an image. These
coefficients are generated by convolving an image portion around the feature with the full kernel
set described in section 2.1.
For the standard algorithm configuration, all 80 convolution masks are used, so that a Gabor
jet is composed of 40 complex wavelets having a real and imaginary component. As we described
before, each complex coefficient corresponds to a unique orientation and wavelength of the
sinusoid. For the requirements of the algorithm, these complex coefficients are represented in polar
coordinates using the transformation

2 2
arctan , 0
arctan , 0
, 0 0
2
, 0 0
2
real imag
imag
real
real
imag
real
real
real imag
real imag
a a a
a
if a
a
a
if a
a
if a and a
if a and a

= +
| |
>
|
\ .

| |

+ <
|

\ . =

= <



and stored in an array such that
j
a and
j
correspond to the j
th
complex wavelet pair. This is the
internal representation of the feature for a given image, i.e. the Gabor jet for that feature.

2.2.1 Gabor Jet Similarity Measures

Measuring the similarity of two jets is fundamental for the EBGM algorithm, as it is
required for both landmark localization (feature estimation) and face graph similarity measurement
(identification). Three Gabor jet similarity measures are generally used (from [3] and [7]).
The simplest metric is referred to as Magnitude Similarity (or MS). This measure will only
compute the similarity of the energy of the frequencies, while the phase information is not utilized.
The result is a similarity measure based on the covariance of the magnitudes:

( )
'
1 '
2 ' 2
1 1
,
N
j j
j
a
N N
j j
j j
a a
S J J
a a
=
= =
=




where N is the number of complex wavelet coefficients constituting the jet. While this methods is
tolerant of small displacements between the features being compared, it is completely unaffected by
their differences in phases. Thus, it measures the energy of the frequency responses but does not
react to out of phase frequency components.
The second metric is very similar to correlation:

( )
( )
' '
1 '
2 ' 2
1 1
cos
,
N
j j j j
j
N N
j j
j j
a a
S J J
a a


=
= =

=




This function is called the Phase Similarity (or PS) measure and it effectively computes a similarity
between -1.0 and 1.0. Although it is not exactly a correlation function, this measure yields relative
quantities similar to the correlation of all 40 original complex convolution values. It is based on the
similarity of the magnitudes of the frequency response; however, these values are weighted by the
similarity of the phase angles. Thus, high scores are achieved only when both the magnitudes and
the phase angles are similar.
Each of these first two measures has its own advantages. Phase Similarity will correctly
respond to the phase information in the images, we have seen however that this changes rapidly,
even for small displacements. Therefore, this measure will produce low similarity if the two jets
being compared come from a similar landmark jet but are displaced by a small amount. In this case,
the Magnitude Similarity metric will produce high similarity under these conditions. In theory,
there is the possibility of producing some false positives because it completely disregards phase
information. Our results however indicate that in general Magnitude Similarity is more robust than
Phase Similarity. This is to a great extent due to the image normalization step, described in Section
3.1.
The final similarity measure attempts to correct for small displacements in the Phase
Similarity metric, and is therefore called Displacement Similarity. In essence, it will estimate the
similarity as if jet J was extracted from a displacement d
G
from its current location. This new
approach retains the phase information and can compensate for small displacements, yielding the
following equation:

( )
( ) ( )
' '
1 '
2 ' 2
1 1
cos
,
N
j j j j j
j
D
N N
j j
j j
a a d k
S J J
a a

=
= =
+
=


G G


where the displacement vector
x
y
d
d
d
| |
=
|
\ .
G
and

2 cos
2 sin
k

| |
|
= |

|
|
\ .
G


is a vector such that it points in the direction of the sinusoid component of the Gabor wavelet and
has a magnitude equal to the frequency of the sinusoid. The displacement along the direction of the
sinusoid is therefore simply the dot product of these two vectors.
This similarity measure is based on both the magnitude and phase components of the
coefficients and can compensate for the difference in phases. The only setback is that in order to
maximize this metric, the value of the displacement vector is needed. However, in the feature
estimation step of the EBGM algorithm, it is exactly this vector that we wish to compute; hence
direct maximization of the S
D
metric is not possible. Section 3.3 discusses a number of approaches
to this problem.

2.3 Face Graphs

A Face Graph is the structure used to internally represent a face in the context of the EBGM
algorithm. Both training and testing images have a corresponding Face Graph, which stores
information about all the features defined in the HumanScan database. Every node in the graph
contains the location of a landmark and a Gabor jet extracted at that fiducial point. The algorithm
computes the similarity of two faces using the data stored in their respective Face Graphs.
Similarity can be computed as a function of the landmark jets, the feature locations, or even both.
Face Graph similarity metrics will be discussed in detail in Chapter 4.
The Face Graph structure is useful for two reasons. First, a Face Graph contains information
about the images that is useful for recognition; no further information is necessary. Second, after
the Gabor jets have been extracted for all fiducial points on the face and the Face Graph has been
constructed, the original image can be safely discarded. This leads to reduced storage requirements
and also allows recognition to be performed much faster. Figure 11 depicts the EBGM steps needed
to construct a Face Graph.


Figure 11: The steps required to transform an original image to its
corresponding Face Graph (from [3]).

2.4 Face Bunch Graph

The Face Bunch Graph (FBG) structure is the algorithms main model, both for feature
estimation and for identification of testing images. After the training set has been selected, all
corresponding Face Graphs are selected and then stacked in the FBG. Thus, this structure contains
spectral information for all features across the whole training set. Each node contains the Gabor jets
for that feature from all training images, as well as its average coordinates in the training set. Figure
12 depicts the conceptual structure of the FBG.
One powerful aspect of the FBG is that each bunch of Gabor jets contains many different
examples of its particular fiducial point. For example, the algorithm can provide model jets of eyes
with and without glasses, eyes that are closed and eyes under different lighting conditions. It is also
possible that the algorithm selects jets from different subjects for each landmark on a novel face
being processed. For a particular face the algorithm can mix and match eyes, mouths and noses
such that each landmark has the best possible example, and consequently the best possible
localization accuracy. It is obvious therefore that training images should be chosen so that they
represent all possible states of a persons face.


Figure 12: 3D representation of a Face Bunch Graph (from [3]).


















Chapter 3: Feature Estimation

3.1 Image Preprocessing (Normalization)

The image dimensions in the original HumanScan database are 384x286 pixels, moreover
these images contain a large portion of background which is not necessary for the identification
process and can therefore safely be removed. Furthermore, there was no standard face size and/or
orientation, so the images had to be normalized prior to any further processing. This resulted in a
more or less uniform face size and orientation distribution across the database, thus improving the
feature estimation accuracy and consequently the actual recognition performance. It should be
noted at this point that this normalization is not part of the original EBGM algorithm as proposed in
[3]. Figure 13 shows a flow diagram of this preprocessing step.

Calculate the
Gabor
Wavelets
Read Image
from file
1
2
3
4
Read features
coordinate from
file
Make
necessary axis
adjustmentss
Move face to
center of
image
Rotate image
to have eyes
parallel to
horizontal
plane
5
6
Scale image to
have constant
distance
between eyes
7
8
Smooth the
image around
the face
Crop
Unnecessary
part of the
image
Save new
image to output
file
Save new
features
coordinate to
output file
9
10
Move face
again to center
of image
12
11

Figure 13: Flow diagram of the geometric normalization step.
The basic preprocessing procedure was termed geometric normalization, since it involves
only scaling, rotation and horizontal or vertical shifts of the whole image. The goal is to bring the
face at the center of the image, rotate and resize it appropriately so that the eyes are aligned at
preselected positions and at a predefined distance. The normalization parameters (eye coordinates
and distance) are of course constant for the whole database and were selected after some
experimentation so that applying convolution with even the largest kernel would not result in the
overlap region extending outside the image border. After these steps, the remaining background is
discarded by cropping a rectangular area around the face. To avoid abrupt intensity variations at the
cropped image border which would interfere with the frequency content of the image at these
locations and disrupt the convolution results, the intensity around the face is smoothed as proposed
in [7]. The normalized image size is considerably reduced when compared with the original
(261x266 pixels), leading to faster processing in the following steps of the algorithm. Figure 14
shows an example of face normalization, whereas Table 2 lists the normalization parameters and
the values chosen in our implementation.


Figure 14: Image Normalization example.

Parameter Description Value
Xmid X-coordinate of eyes midpoint 85
Ymid Y-coordinate of eyes midpoint 192
DEYES Distance between eyes (in pixels) 60
BORDER Width of smoothing border (in pixels) 40
U_SMOOTH Distance above the eyes midpoint where smoothing starts (in pixels) 40
D_SMOOTH Distance below the eyes midpoint where smoothing starts (in pixels) 145
L_SMOOTH Distance to the left of the left eye where smoothing starts (in pixels) 60
R_SMOOTH Distance to the right of the right eye where smoothing starts (in pixels) 60

Table 2: List of parameters used for the geometric normalization step.

Although this preprocessing step is in general beneficial, there are some images for which a
part of the face (usually the tip of chin or the upper temple and hair) is lost in the process. These
images were termed incomplete and removed from the image set so that they would not interfere
with the identification process, resulting in a data set size of 1373. An example of an incomplete
normalized image is shown in Figure 15 below.


Figure 15: Example of an incomplete image, before and after normalization.

3.2 Landmark Localization

The landmark localization process is divided into two basic steps: initial estimation and
refinement. For the initial estimation step of each fiducial point, the algorithm makes an educated
guess based on the positions of the previously localized face features. Starting with the positions of
the eyes (which are very accurately defined after the normalization process), an estimate for the
position of the n
th
landmark is obtained using the following weighted average formula

( )
1
1
n
in i in
i
n
in
w p v
p
w

=
+
=

G G
G


where
i
p
G
are the positions of the n-1 previously estimated points,
in
v
G
are the distance vectors
between points i and n (actually, the average distances from the FBG are used here) and finally
in
v
in
w e

= are appropriate weight factors. The notion behind these weights is that the position of a
feature on a face can be better estimated based on its neighboring features, rather than on the ones
that are further apart. As an example, it is expected that the position of the lower lip will lead to a
better estimate of the chin than that of an eyebrow. Regarding the sequence of visiting the features,
[7] suggests starting from the eyes and then proceeding radially outwards to the edge of the face,
but our experiments indicated that this is not necessary as it does not improve estimation accuracy.
To simplify things, features were visited based on the order they are listed in the point files of the
database. Figure 16 illustrates the first step of the landmark localization step.


Figure 16: The known positions of the eyes and the nose bridge provide an
educated guess for the coordinates of the nose in a novel image (from [7]).

After obtaining this initial estimate, the algorithm attempts to improve the accuracy by
using Gabor jet similarity metrics. The basic idea is to extract a Gabor jet from the initially
estimated point and then compare it with all jets contained in the FBG for that feature. The jet with
the highest similarity is termed the local expert for that feature and is used to finally determine the
refinement in the features position on the testing image. An interesting point is that this selection
process is followed for all fiducial points separately and it is therefore possible that not all fiducial
points for a testing image will have their local experts from the same training image. This once
again supports the argument that a complex enough (in terms of image diversity) FBG can greatly
improve the algorithm performance. Out of the three Gabor jet similarity metrics presented in
Section 2.2, the first two (namely Magnitude and Phase Similarity respectively) can be used to find
the local expert as described previously. We decided on using S

since it incorporates phase


information, thus resulting in better estimates.
After the local expert for a feature in a test image has been selected, the third Gabor jet
similarity metric (Displacement Similarity) is used to refine the estimated position by computing a
displacement vector along which the estimated jet must be moved in order to maximize the
similarity with the local expert. The final feature position is obtained by adding the displacement
vector to the initial position estimate. Although the S
D
metric is very useful because it captures both
magnitude and phase information and also attempts to correct for small perturbations around a
given position in the image, the drawback for our case is that it cannot be maximized directly since
we lack the exact knowledge of the displacement vector, which is what we are actually trying to
compute. To overcome this issue, four alternative displacement estimation methods can be used,
which were adopted with slight modifications from [7].

3.3 Displacement Estimation Methods

To properly refine the initially estimated landmark locations, the EBGM algorithm relies on
the model jets that are stored in the FBG. The obvious way to find a landmark location would be to
extract a novel Gabor jet from every point in the image and compare these jets with the models.
The pixel whose jet best matches one of the model jets indicates the correct location of the
landmark. This approach however is computationally expensive, since computing a Gabor jet for
every pixel in an image would take a considerable amount of time. Because landmark locations in
an image are always contained in a small region (especially after the normalization step), it is only
necessary to search a small portion of the image around the location of the initial estimate. Using
the S
D
metric can greatly reduce the number of wavelet convolutions that must be performed and
therefore increase speed, since it allows similarity to be estimated under small displacements.
The landmark locations can therefore be found by extracting one jet in the region close to
the actual landmark position and then searching that local area for a maximum using the
Displacement Similarity metric. The obvious way to find this maximum would be to use a direct
solution method. The goal of such a method would be to maximize the S
D
function with respect to a
displacement in two dimensions, by taking the two partial derivatives, setting them to zero and then
find values for the displacement vector components that together satisfy both equations. Although
computing the partial derivatives of the function is simple, the resulting function would be a sum of
40 distinct sinusoids with various amplitudes and frequencies. Unfortunately, there is no obvious
and succinct analytic method to find the zeros of those two functions and therefore, alternative
methods for finding the optima must be employed.

3.3.1 Displacement Estimation Predictive Step (DEPS)

The first Displacement Estimation Method suggests performing a small-angle two-term
Taylor expansion to approximate the cosine terms of the S
D
metric. By replacing the cosine terms
in the equations with this Taylor expansion, the algorithm can approximate the similarity function
immediately surrounding zero displacement. It logically follows that if the similarity function has a
maximum nearby, its approximation will also have a similar maximum. The approximation takes
the form:

( )
( )
2
2
' '
1 '
2 ' 2
1 1
1
cos 1
2
1 0.5
, ,
N
j j j j j
j
D
N N
j j
j j
a a d k
S J J d
a a


=
= =

(
+
(

=


G G
G


It is subsequently fairly straightforward to solve for the displacement vector using the
following formula

1 x yy yx x
xy xx y y xx yy xy yx
d
d
| | | | | |
=
| | |


\ . \ . \ .


where

( )
' '
1
'
1
,
N
x j j jx j j j j
j
N
xy j j jx jy
j
a a k
a a k k

=
=
= < <
=



and the remaining qualities are defined accordingly. This method is called Displacement Estimation
Predictive Step (DEPS) because the required displacement vector is predicted using an
approximation to the S
D
metric in a single step. It is quite fast and works well when the true
optimum is near. However, because this method approximates the similarity function surrounding
zero displacement, it produces poor estimates as the distance from the novel jets to the true
landmark location increases. Another disadvantage of this method is that is it does not account for
the periodic nature of the similarity function.

3.3.2 Displacement Estimation Predictive Iteration (DEPI)

A straightforward extension to the previous method is to repeat this process a number of
times in order to obtain iteratively higher similarity values until there is no improvement of the
estimate, in which case the resulting method is called Displacement Estimation Predictive Iteration
(DEPI). The algorithm starts with a displacement vector of (0, 0). The maximum of the Taylor
expansion of S
D
is computed and the result becomes the new estimate for d
G
. The expansion is
subsequently recomputed around this new point and is used to further refine the estimate. The
maximum number of iterations is a user-defined parameter.
Although the underlying idea is simple, it is important to implement it carefully in order to
achieve improved performance. The first parameter that must be considered is whether the
algorithm will be allowed to choose a different local expert in each iteration (i.e. a different image
from the training set as the best match), or will simply stick with the local expert obtained by the
first pass. Another setting that must be carefully chosen is the terminating condition; that is,
whether it is sufficient to observe no change in either dimension of the displacement vector or n
both of them before the search terminates. Perhaps the most vital inclusion is to ensure that further
iterations will actually increase similarity; this can be accomplished by allowing the algorithm to
backtrack to a previous step if the new displacement vector estimate gives a lower similarity score.
Table 3 summarizes the results obtained (in terms of total RMS estimation error across all features
on the face in pixels) when a variety of combinations of these constraints is employed. When no
backtracking is applied, it is clear that it is preferable to allow for multiple local experts and
terminate the search when the displacement vector becomes zero in either direction. It should also
be clear that the use of backtracking improves estimation accuracy considerably. Based on these
remarks, the DEPI method was implemented in the MLE-FDZ version with backtracking.

Total RMS Estimation Error MLE-BDZ MLE-FDZ SLE-BDZ SLE-FDZ
w/o Backtracking 34.5781 26.0042 40.5274 35.0626
w/ Backtracking 23.5631 23.5656 25.1149 23.644

Table 3: Comparison of terminating criteria and local expert selection options for the DEPI method in terms
of the total RMS estimation error for all facial features. MLE: Multiple Local Experts, SLE: Single Local
Expert, BDZ: Both Displacements Zero, FDZ: First Displacement Zero.

3.3.3 Displacement Estimation Grid Search (DEGS)

The remaining two displacement estimation approaches are also parametric, but do not
attempt to approximate the S
D
metric. Rather, they maximize the similarity by trying out different
values for the displacement vector according to a specific search strategy. The simplest idea is to
search a square area around the initial estimate exhaustively, in which case we are using the
Displacement Estimation Grid Search (DEGS) method and the grid size (i.e., the length of the
squares side) is the user-specified parameter. A pseudo-code segment for this method is presented
in Figure 17. The displacement that produces the best similarity is the estimated displacement of
the novel jet, and in this way the algorithm effectively estimates the distance from the novel jet to
the true location of the landmark.

( )
( )
( )
( )
( ) ( )
x x x
y y y
D
J =ExtractJet ModelImage, ModelPoint ;
J' =ExtractJet NovelImage, EstimatePoint ;
for d = grid_length/2; d grid_length/2; d ++
for d = grid_length/2; d grid_length/2; d ++
RefinedPoint =Maximum S J, J', d ;
end
end


G

Figure 17: Pseudo-code for the DEGS method.

3.3.4 Displacement Estimation Local Search (DELS)

A more elaborate approach is to start from the initially estimated position and search its four
immediate neighboring pixels. The neighbor that gives the highest similarity becomes the starting
point for the next step, provided its similarity is higher than that of the current estimate. This
process, called Displacement Similarity Local Search (DELS), terminates either when none of the
neighbors offers any similarity improvement or when a maximum, user-defined, number of steps
have been performed. DELS in general performs a smart search in the same area as DEGS,
although it could be possible to proceed out of the rectangular grid employed by DEGS if the
similarity scores lead the search in one direction for many consecutive steps.

3.4 Comparative Analysis of DE Methods

In order to decide upon the most appropriate method, the following two criteria were used:
feature estimation accuracy and speed. The first was measured based on the average RMS
estimation error across all features and the whole test image set, while speed characteristics of the
four methods were deducted from average feature estimation time per image measurements.
Furthermore, per-feature RMS estimation error estimates were also obtained, indicating for which
of the fiducial points consistently better or worse estimates can be achieved. The DEPS method was
used as the baseline and various choices for the parameters of the remaining three approaches were
made; specifically, the maximum number of iterations for DEPI were 3, 6 and 10, the grid size for
DEGS was 8, 12 and 16 pixels, and DELS performed at maximum 10, 25 and 50 search steps.
Simulations were run on a P4/2.66GHz with 512 MB RAM under SuSe Linux 9.1. The following
sections describe the results obtained in the various tests and outline the strengths and weaknesses
of each.

3.4.1 FBG Creation Time

The first aspect studied was the speed with which the FBG can be created. When using
between 1 and 20 training images per class to create the FBG, the training time increases pretty
much linearly, at a rate between 5-6 seconds per 21 training images added. The averages are
calculated over all different DE methods for the given number of training images, since FBG
creation time is independent of the DE method used. For the highest number of training images,
420, FBG creation takes less than two minutes. However, it is highly unlikely that such a high
volume of training data will be used in an actual system. As the feature estimation results indicate
(and will be showcased in the subsequent sections), the estimation accuracy improvement is more
than outweighed by the significant increase of processing time per image. Figure 18 depicts the
increase in FBG creation time as more training images are used.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
20
40
60
80
100
120
Effect of training set size on FBG creation time
Training Images per Class
T
i
m
e

(
s
e
c
)

Figure 18: FBG Creation Time vs. Training Set Size.
3.4.2 Image Processing Time

The Average Processing Time (APT) per image is defined as the average time needed to
produce the position estimate of all twenty fiducial points in a novel (test) image. It should be
pointed out that APT does not include either the creation of the Face Graph for the test image or its
subsequent storage in file. However, these steps are independent of the DE method used and should
therefore not affect the relative ordering of the various DE methods studied here with respect to
APT; it is expected that they will merely effect a uniform increase for APT in all cases. The general
trend observed was an increase in APT as more images were added to the training set, as would be
naturally expected.
The first goal for this part of the experiments was to study the three parametric DE methods
and see what (if any) the effect of the parameter on the APT was. Subsequently, the best parameter
choice was selected as a representative of this method and compared directly against all other DE
methods. This analysis, coupled with the RMS error results that will be described in the next
section, was meant to identify the most appropriate method in terms of speed, accuracy and the
combination of (or trade-off between) the two decisive factors for our system.
The results for the DEPI method are depicted in Figure 19 and offer a large number of clues
and insights as to the strengths and weaknesses of this approach. For less than four training images
per class (84 in total), the APT is a steadily declining figure, which seems to indicate that the
increasing size of the training set reduces the required iterations per point. This is verified by the
plot of the average iterations per point (Figure 20). However, for more than five training images per
class the APT begins to increase again, although the number of average iterations per point still
declines.
The answer to this seemingly contradictive set of results is twofold: first of all, the rate of
average iterations per point is steadily decreasing and secondly (and most importantly) there is
another factor that affects APT; this is the time it takes to choose the local estimate for each fiducial
point from the FBG. Simply put, as the training set size increases, there are more candidates for the
local expert and it is necessary that all are examined sequentially to determine the best match for
that role. This effect outbalances the reduction in APT caused by the reduction in the average
iterations per point, leading to an increased APT. Since by that point the training set size increases
by 21 between successive experiments, it is no wonder that the APT shows this significant increase.
As a last but certainly crucial point, note that increasing the maximum number of allowable
iterations from three to six and then to ten does not seem to have much of an impact on the APT.
This remark is again supported by the average iterations per point figures and shows that in fact the
maximum number of iterations is seldom if ever needed, mainly because of all the checks enforced
2 4 6 8 10 12 14 16 18 20
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.4
0.41
Effect of maximum number of iterations for DEPI in terms of APT
Training Images per Class
A
P
T

(
s
e
c
)
3 Iterations
6 Iterations
10 Iterations

Figure 19: Effect of DEPI maximum number of iterations on APT.

between iterations in the code to ensure that no unnecessary and even harmful (in the sense of
worsening the estimate) calculations are performed. Since pretty much the same calculations are
performed in all three cases, the estimation accuracy results should be very close. This is indeed the
case, as described in the next section. In the light of these results, it was straightforward to choose a
maximum number of three iterations as the representative of the DEPI method.

2 4 6 8 10 12 14 16 18 20
1.08
1.1
1.12
1.14
1.16
1.18
1.2
1.22
1.24
Effect of maximum number of iterations for DEPI in terms of average iterations per feature
Training Images per Class
A
v
e
r
a
g
e

I
t
e
r
a
t
i
o
n
s

p
e
r

F
e
a
t
u
r
e
3 Iterations
6 Iterations
10 Iterations

Figure 20: Effect of DEPI maximum number of iterations on average iterations
needed per feature.
The next method studied was DEGS. Since this is in essence a brute-force (extensive)
search of a specified area around the initial estimate provided by the FBG, one would expect that
the calculations involved would be very cumbersome and even prohibitively time-consuming for a
real-time system (a view also supported in [7]). It turned out, however, that the difference from the
DEPS method (which served as the reference) was in most cases negligible; in some cases DEGS
was even faster than DEPS, although this is probably due to the APT jitter caused by other Linux
processes executing in parallel with our code. Increasing the grid size from eight to twelve and then
sixteen pixels naturally resulted in increased APT, although the differences were smaller than
expected, as shown in Figure 21 below. The DEGS_8 variant was chosen as the representative of
this method.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
Effect of grid length for DEGS in terms of APT
Training Images per Class
A
P
T

(
s
e
c
)
8 pixels
12 pixels
16 pixels

Figure 21: Effect of DEGS grid length on APT.

The final parametric method was DELS. Essentially a more sophisticated and refined
variant of DEGS, the search performed here is not extensive, which in principle makes this method
faster than DEGS. Although at each step three pixels are examined, the number of steps is such that
the total number of examined pixels is in general smaller than in a full grid; furthermore each step
requires only one jet to be extracted. Despite the fact that an additional number of checks and flags
are used in the code, this method remains slightly faster than DEGS, in accordance with intuition.
In this case, as with DEPI, allowing a higher maximum number of steps has very little impact on
the actual number of steps needed on average per point, as indicated by Figure 22. This is also
depicted in Figure 23 (with slight aberrations due to the APT jitter discussed already), leading to
the selection of DELocalSearch_10 as the representative of this method.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
Training Images per Class
A
v
e
r
a
g
e

S
t
e
p
s

p
e
r

F
e
a
t
u
r
e
Effect of maximum number of steps for DELS in terms of average steps per feature
10 Steps
25 Steps
50 Steps

Figure 22: Effect of DELS maximum number of steps on average steps needed per
feature.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
0.33
0.34
Effect of maximum number of steps for DELS in terms of APT
Training Images per Class
A
P
T

(
s
e
c
)
10 Steps
25 Steps
50 Steps

Figure 23: Effect of DELS maximum number of steps on APT.
After choosing the representatives for the three parametric methods, the final step was to
compare them against DEPS and each other. As Figure 24 indicates, in all cases, DEPI_3 is clearly
the slowest of the four methods, while the other three are almost always very close to each other,
with DEPS usually being the fastest of the four. Therefore, DEPS, DEGS_8 and DELS_10 are
practically equivalent with respect to their speed.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
Training Images per Class
A
P
T

(
s
e
c
)
Comparison of all DE Methods in terms of APT
DEPS
DEPI__3
DEGS__8
DELS__16

Figure 24: Speed comparison of all DE Methods.

3.4.3 Feature Estimation RMS Error

The second very important factor to be considered is how well each method was able to
provide an estimate for the fiducial features on a test image. The procedure followed here is similar
with the one described in the previous section: first, a representative of each parametric method was
selected and then these three representatives were pitted against the reference method (DEPS). The
main figure of comparison was the average RMS feature estimation error over all fiducial points
and images.
For the DEPI variants, it has already been pointed out in the previous section that in most
cases only a few iterations are needed, therefore allowing a higher maximum number of iterations
should not affect performance. This is clearly the case as can be seen by the related plots (Figure
25), where the three curves are practically overlapping. DEPI_3 was chosen somewhat arbitrarily
as the representative, mainly because it was also the representative chosen in the APT comparison.
A similar behavior is evident for the DELS average RMS error plots (Figure 26), leading to the
choice of DELS_10 as the family representative.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4.9
4.95
5
5.05
5.1
5.15
5.2
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy for DEPI variants
DEPI__3
DEPI__6
DEPI__10

Figure 25: Comparison of DEPI variants in terms of average RMS estimation error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4.7
4.75
4.8
4.85
4.9
4.95
5
Feature estimation accuracy for DELS variants
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
DELS__10
DELS__25
DELS__50

Figure 26: Comparison of DELS variants in terms of average RMS estimation error.
Moving on to the DEGS family of methods a small surprise awaited, since it would seem
that increasing the grid size actually reduces estimation accuracy (Figure 27). The most probable
cause for this phenomenon is the periodic nature of the convolution masks (Gabor wavelets), which
allows for pixels near the edge of the grid to have comparable or even higher similarity to the local
expert than the actual feature pixel position. Since the face preprocessing (geometric normalization)
leads to a relative alignment of fiducial features across all images, the true displacement should not
be very large, especially for higher training set sizes, supporting the argument that searching grids
of higher sizes is not only a waste of time but could also provide misleading results. Because of its
superior performance, DEGS_8 was nominated the family representative; given that it is both the
fastest and most accurate variant examined, it was designated the de facto choice for all subsequent
feature estimation tests involving the DEGS method.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
Feature Estimation Accuracy for DEGS variants
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
DEGS__8
DEGS__12
DEGS__16

Figure 27: Comparison of DEGS variants in terms of average RMS estimation error.

Comparing the three parametric representatives with the reference method, we arrive at
some interesting observations (Figure 28). First of all, in almost all cases DEPS is by far the least
accurate method. For up to a total of ten training images, DEPI_3 and DELS_10 are very close in
performance, followed by DEGS_8. The superiority of DELS_10 becomes apparent after that point
and throughout the rest of the experiments it provides steadily the most accurate estimates. The
performance of DEGS_8 keeps on improving, to the point where it becomes better than DEPI_3 for
more than four training images per class. All in all, it is apparent that DELS_10 is the best approach
for training set sizes of practical interest as prescribed in all past publications.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Comparison across all DE Methods
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 28: Comparison across all DE Methods in terms of average RMS estimation
error.

Concluding this section, we provide four RMS estimation error plots for specific fiducial
points. These four points (inner corner of right eye [Figure 29], left temple [Figure 30], tip of nose
[Figure 31] and tip of chin [Figure 32]) were chosen because they represent some interesting trends
across all methods and number of training images: the first gives consistently some of the best
results, the second an average performance but with high fluctuation in the error as the training set
size increases, and the two last because they are by far the worse estimated fiducial points. The
plots themselves should point out how representative these fiducial points are, since the trends
observed there are more or less identical to the ones exhibited by the average RMS error plot
(Figure 28).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Inner corner of right eye
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 29: Comparison across all DE Methods in terms of average RMS estimation
error (inner corner of right eye).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3.6
3.8
4
4.2
4.4
4.6
4.8
5
5.2
Feature estimation accuracy: Left temple
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 30: Comparison across all DE Methods in terms of average RMS estimation
error (left temple).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8
8.1
8.2
8.3
8.4
8.5
8.6
8.7
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Tip of nose
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 31: Comparison across all DE Methods in terms of average RMS estimation
error (tip of nose).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
7.5
8
8.5
9
9.5
10
10.5
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Tip of chin
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 32: Comparison across all DE Methods in terms of average RMS estimation
error (tip of chin).

3.4.4 Usage Of The Eye Pupil Coordinates

In the first set of experiments, described thus far, the positions of the eye pupils were
considered to be exactly known and not estimated using the DE methods. The effects of this
simplification seem to be negligible, as indicated by Figure 33 which contains the same results as
Figure 28, with the only difference that now the eye pupil coordinates are also estimated using the
FBG and the DE methods. The discrepancies between the two sets of experiments are not
significant, as was expected. A little note must be made here: when the eye pupil coordinates are
taken as absolute truth, the average RMS error is computed over the remaining18 fiducial points. In
the second case, the average is computed naturally over all 20 fiducial points. This is the reason
why the average RMS error improves in the second case, although we now know two features with
less certainty than before, which should also logically have a negative impact on the estimation of
the rest of the fiducial points. Since the estimation error for the eye pupils is considerably lower
than the average error in the first set of experiments (see Figures 34 and 35 for verification), the
improvement in average RMS error in the second set of experiments should be obvious.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Comparison across all DE Methods
(eye pupils are also estimated)
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 33: Comparison across all DE Methods in terms of average RMS estimation
error (eye pupils are also estimated).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1.08
1.085
1.09
1.095
1.1
1.105
1.11
1.115
1.12
1.125
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Right eye pupil
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 34: Comparison across all DE Methods in terms of average RMS estimation
error (right eye pupil).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Training Images per Class
A
v
e
r
a
g
e

R
M
S

E
r
r
o
r
Feature estimation accuracy: Left eye pupil
DEPS
DEPI__3
DEGS__8
DELS__10

Figure 35: Comparison across all DE Methods in terms of average RMS estimation
error (left eye pupil).

Chapter 4: Identification Results

4.1 Face Similarity Metrics

After the fiducial points for a testing image have been estimated, the algorithm proceeds to
extract Gabor jets from all those positions and construct the Face Graph, which is then compared
against all training images in the FBG to produce the systems decision for the identification
problem. A number of metrics (proposed in [3] and [7]) were studied in our experiments in order to
determine the strengths and weaknesses of each one. A short discussion of those similarity
measures follows.
The simplest idea is to ignore all information contained in the jets and rely only on the
positions of the fiducial points for identification. A scan across all members of the training set is
performed in order to determine the image for which the average Euclidean distance from the
testing image across all features is minimized. Although this approach (named Geometry Similarity
or GeoS in short) is extremely fast, it is also the one that gives the worst results by far. This is due
to the normalization step described in Section 3.1, which enforces a uniform distribution of the
facial features across all images, making a successful identification very difficult even when a very
large number of training images per class are available.
The main drawback of GeoS is that it does not utilize the information about the surrounding
areas of the fiducial points stored in the Gabor jets. The simplest methods that make use of this
information are Magnitude Similarity (MS) and Phase Similarity (PS) which were discussed in
Section 2.2.1 and used in the previous Chapter in the context of feature estimation. Although the
inclusion of the jet phase characteristics in PS was a definite improvement in that case, for
recognition purposes it is enough to just use MS. This point was also discussed in [3] and was
verified in our experiments. Figure 36 shows the performance of both MS and PS for a single run
with between 1 and 24 training images per class. Although a single run is of course statistically
insignificant, the performance discrepancy between the two methods is still obvious. MS was the
first of the two metrics that were used for extensive experiments including hundreds of runs with a
given number of training images per class, as will be discussed in Section 4.2.1.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
5
10
15
20
25
30
35
40
45
50
Training Images per Class
P
M
C

(
%
)
Comparison of MS and PS metrics
MS
PS

Figure 36: Comparison of identification performance in terms of Probability of
Misclassification (PMC) for the MS and PS metrics.

A family of more sophisticated metrics was also studied, based on the Displacement
Estimation methods we discussed in Chapter 3. The idea is similar to that applied for feature
estimation, in that we try to find a displacement vector for the test image Gabor jet that would
maximize its similarity to the corresponding jet of a training image under the S
D
metric. The only
method that was not studied was DEPI, since its execution times were expected to be very slow,
based on our experience when performing feature estimation. In this case, DELS proved slightly
inferior to DEGS, although it is about an order of a magnitude faster; hence DEGS_16 was chosen
as the representative of this family of methods for further experimentation.

4.2 Baseline Identification Performance

4.2.1 Using A Single Training Image Per Class

Our first step was to compare all similarity metrics discussed in the previous section and
decide upon the two best candidates. To this end the algorithm was run 25 times using a single
training image per class and the average ID time and Probability of Misclassification (PMC %)
were computed. Table 4 lists the corresponding results.

Metric ID Time (sec) PMC (%)
GeoS 0.000148 96.60
MS 0.0046 33.22
PS 0.0056 46.53
DEPS 0.0133 48.78
DEGS_16 0.458 25.31
DELS_25 0.044 29.71

Table 4: Comparison of speed and accuracy of face similarity metrics using a single training image per
class.

Starting from GeoS, we can see that it is an extremely fast metric, but its recognition
performance is clearly unacceptable. MS is the second fastest metric and also performs quite well
given the scarcity of training images. PS is both slower and less accurate than MS; this was
expected and is consistent with the discussion in the previous section and the plots in Figure 36.
Moving on to the DE-based metrics, it is evident that DEPS is also a poor solution. DELS_25
performs somewhat better than MS, but it is ten times slower on average. Finally, DEGS_16 is the
most accurate metric by far, but this advantage is overshadowed by its large run times; it is ten
times slower than DELS_25 and consequently 100 times slower than MS. As a final note, it should
be pointed out that the reported ID times refer only to the actual decision process; the feature
estimation and Face Graph creation times are not included. However, these times are independent
of the metric used for identification.

4.2.2 MS vs. DEGS_16

The previous section described the experiment carried out in order to select the most
appropriate metrics for identification. Table 4 showed that DEGS_16 has the best identification
performance of all metrics considered. However, its slow execution times called for the selection of
a second metric that could provide a compromise between recognition speed and accuracy. The MS
measure was considered for this role, since it is considerably faster than DEGS_16 and still
provides an adequate identification performance. This section discusses a series of experiments that
were carried out in order to compare these two metrics more thoroughly. A single run with between
1 and 24 training images per class was considered for both MS and DEGS_16; although the results
cannot be considered conclusive with so few runs, the emerging trends are nonetheless interesting.
The first aspect that was studied is the performance degradation in terms of PMC % when
switching from DEGS_16 to MS. As can be seen from Figure 37, the discrepancy between the two
methods starts off at around 7% for one training image per class but quickly drops; in fact, it stays
below 2% for more than eight training images per class. This indicates that using MS is a welcome
option when the training set size is quite large.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
5
10
15
20
25
30
35
Training Images per Class
P
M
C

(
%
)
Recognition accuracy comparison: MS vs. DEGS__16
MS
DEGS__16

Figure 37: Identification performance degradation between MS and DEGS_16.

The next step was to compare the execution time of the two methods. It has already been
established in Section 4.2.1 that MS is on average two orders of magnitude faster than DEGS_16
when only one training image per class is available; it is also of interest to see what trend this rate
follows as the training set size increases. This information can be found in Figure 38, where we can
see that the identification time ratio between the two methods ranges between 77 and 180, with a
tendency to decrease as more training images per class become available. Another interesting
statistic is the total processing time ratio of the two metrics, which includes feature estimation and
Face Graph creation apart from the actual decision process. As indicated by Figure 39, this is in
general an increasing function of the training set size (any fluctuations can safely be attributed to
the small number of runs). This fact seems at first to contradict Figure 38; however, it is merely
being caused by the inclusion of the feature estimation and Face Graph creation times in the
calculations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
2
4
6
8
10
12
Training Images per Class
I
D

T
i
m
e

(
s
e
c
)
Identification time comparison: MS vs. DEGS__16
MS
DEGS__16

Figure 38: Identification time comparison between MS and DEGS_16.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
2
4
6
8
10
12
14
16
Training Images per Class
T
i
m
e

(
s
e
c
)
Total processing time ratio: DEGS__16 vs. MS

Figure 39: Total processing time comparison between MS and DEGS_16.

To understand why this is so, it is instructive to study the ratio of identification to total
processing time for both methods. As can be seen from Figure 40, this ratio is quite small in the
case of MS, never reaching 16%. Things are quite different for DEGS_16 on the other hand; in that
case, the ratio starts off at 53.2% for one training image per class and reaches 94.5% when 24
images are available per person. We can see therefore that the actual decision process takes up most
of the time under DEGS_16; when MS is employed, the most time consuming tasks are the feature
extraction and Face Graph creation steps. Since these are not strongly affected by the increase in
the training set size, the total processing time for MS increases more slowly than in the case of
DEGS_16, thus explaining the results in Figure 39.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0
10
20
30
40
50
60
70
80
90
100
Training Images per Class
I
D

:

T
o
t
a
l

P
r
o
c
e
s
s
i
n
g

T
i
m
e

(
%
)
Percentage of identification to total processing time
MS
DEGS__16

Figure 40: Percentage of identification to total processing time: MS vs.
DEGS_16.

The most important conclusion from the discussion in this section is that MS becomes
increasingly preferable to DEGS_16 if enough training images per class are available; the system
speed can then be increased significantly while suffering only small performance degradation. For
example, for 22 training images per class, the PMC degradation is around 0.77% but processing
happens almost 9.4 times faster when using MS and not DEGS_16. Another interesting point is that
this same run is only slightly slower than the DEGS_16 run for 1 training image per class, while
offering a 22.2% performance boost in terms of PMC.
4.2.3 Increasing The Training Set Size

Having chosen the two most appropriate face similarity metrics for identification, we
proceeded with comparing their performance when more than one training images per class is
available. In order to study them further and ascertain their strengths and weaknesses, a large
number of simulations with a varying number of training images per class were run. Table 5 shows
comparative statistics for MS and DEGS_16 average PMC % and ID time. It is clear that DEGS_16
has consistently superior identification performance, although the discrepancy between the two
methods is reduced as more training images per class become available (as was suggested in the
previous section). However, MS is considerably faster, being able to identify an image in about two
orders of magnitude less time than DEGS_16. We can therefore propose two versions of the EBGM
algorithm according to the metric used for identification. When we are interested in the optimum
recognition performance and have no serious time constraints (off-line applications) the most
appropriate method is definitely DEGS_16. However, when identification time is of the essence (as
in real-time applications), the best approach is using MS, even though this incurs a small reduction
in the identification rate. Seen from another point of view, DEGS_16 is most preferable when
training is scarce; however, as more images become available for the creation of the FBG, MS
provides a better trade-off between speed and recognition accuracy.

PMC (%) ID Time (sec)
TPC Runs MS DEGS_16 MS DEGS_16
2 300 19.79 14.07 0.011 0.957
3 400 13.84 10.24 0.016 1.578
5 400 8.55 6.15 0.032 3.168
10 400 4.84 3.06 0.048 5.164

Table 5: Comparison of identification and speed characteristics for the MS and DEGS_16 metrics.

Concluding this section, a comparison of identification accuracy between the two EBGM
variants and the Subspace Projection methods is presented in Figure 41. It can be seen that the
EBGM algorithm performs much better than simple PCA under both metrics; however both PCA
w/o 3 and LDA are superior to the EBGM variants when more than three training images per class
are available. The critical point comes when two images per class are used to construct the FBG.
We can see that in this case EBGM under the DEGS_16 metric is slightly better than all Subspace
Projection variants; unfortunately it is also noticeably slower, since LDA runs more than three
orders of magnitude faster. This is another indication that EBGM is more preferable when training
is scarce and furthermore that it cannot be used for real-time applications, especially under the
DEGS_16 metric.


Figure 41: Comparison of the EBGM variants to the Subspace Projection methods
(from [9]) in terms of recognition accuracy.

4.3 Further Identification Experiments

Having established a baseline identification performance and studied two different metrics
with their respective strengths and weaknesses, we proceeded with running a number of
experiments to determine the algorithms robustness to illumination variations, image size and
imperfect eye position estimates. Furthermore, the standard EBGM implementation was modified
with respect to the number of fiducial points per person and the number of Gabor wavelets used to
construct a jet. These tests are described in the following paragraphs of this section.



1 2 3 4 5 6 7 8 9 10
0
5
10
15
20
25
30
35
40
EBGM vs Subspace Projection Variants
Training Images per Class
P
M
C

(
%
)
MS
DEGS__16
PCA
PCA w/o 3
LDA
4.3.1 Effect Of Illumination Changes

The first variant to the baseline algorithm consists of applying a more involved
preprocessing method, which was proposed in [7]. The idea behind this approach is to also obtain a
uniform illumination distribution across all images in the database (apart from the standardization
of face size and orientation offered by the simple geometric normalization) so as to improve the
recognition performance. The first step is to zero-mean the intensity of the original (unnormalized)
HumanScan image. The standard steps for geometric normalization then follow, the only difference
being that the image pixel values are again adjusted to have zero mean and unit variance, prior to
cropping out the background. To achieve unit variance across all images, the maximum variance in
the whole database must be computed prior to this second step. This preprocessing step was named
intensity normalization, since it involves adjusting the pixel values in the image. An example of a
normalized image with and without intensity normalization is depicted in Figure 42.


Figure 42: Result of preprocessing an image, with and without intensity
normalization.

To ascertain the effect of this modification to the normalization procedure, we ran a number
of tests under both similarity metrics and with different number of training images per class. The
results we obtained indicate that there is no clear advantage to enhancing the preprocessing stage of
the algorithm with the intensity normalization step. Specifically, in a total of 25 runs using one
training image per class under MS, identification performance improvement was observed in only
11 cases, while 13 cases actually suffered by the introduction of intensity normalization and one
run was totally unaffected. The relevant numbers under DEGS_16 were similar: an improvement
was observed in 12 cases, performance degraded in 10 runs and the remaining three were
unaffected. The average PMC across all 25 runs improved only slightly (by 0.24% under MS and
0.03% under DEGS_16). These results indicate that MS is slightly more dependent on illumination
changes in the images, although the effect is still negligible in any case.
These tests were repeated for 400 runs with 5 training images per class under the DEGS_16
metric. The results are depicted as a scatter plot in Figure 43 and it is obvious that there is again no
clear benefit in introducing intensity normalization to the preprocessing step. To further exemplify
this, note that both the average and standard deviation statistics for the PMC were practically
unaffected. It is clear that the EBGM algorithm is in general quite robust with respect to
illumination variations in the data set. This is mainly due to the characteristics of the Gabor
wavelets, as was first pointed out in [3].

3 4 5 6 7 8 9 10 11 12
3
4
5
6
7
8
9
10
11
w/o Intensity Normalization
w
/

I
n
t
e
n
s
i
t
y

N
o
s
m
a
l
i
z
a
t
i
o
n
Effect of Intensity Normalization on PMC% (400 runs, 5 TPC, DEGS__16)

Figure 43: Using a scatter plot to study the effect of applying intensity
normalization to the preprocessing step.





4.3.2 Effect Of Image Size

A second interesting point that was studied is the effect of the image size on the
identification performance. This is especially important in the context of the CHIL project, since
we are expected to be able to identify images of much smaller sizes than the ones in the
HumanScan database (even after normalization). Normalized images were resized by factors of 3:4,
2:3, 1:2, 1:3, 1:4 and 1:6 using bilinear interpolation in MATLAB and were consequently passed
through the identification process. An example of an image at various scales is shown in Figure 43.


Figure 43: A normalized image at varying scales (from left to right and up to
down): 3:4, 2:3, 1:2, 1:3, 1:4 and 1:6.

Figure 44 shows the effect of adjusting the image size on the recognition performance for a
varying number of training images per class (TPC), under the MS metric. 300 runs were executed
for 2 TPC and 400 runs for the remaining cases. We can see that as the image size is reduced the
probability of misclassification increases, which was to be expected since downscaling smears the
facial characteristics and makes reliable fiducial point location harder, resulting in lower
identification rates. It is however important to note that this performance degradation becomes less
noticeable as more training images per class become available, and also that in all cases
performance is quite insensitive to downscaling by one third or less and reaches a peak for a
downsizing by four. It is also quite interesting that after that point (i.e. for a scale factor of 1:6)
performance actually improves, except for the case of 2 TPC. The most striking result is that
downsizing by 6 increases PMC by only 0.4% when 10 training images per class are contained in
the FBG. This indicates that EBGM can withstand changes in scale quite well if the training set size
is large enough.

10 20 30 40 50 60 70 80 90 100
0
5
10
15
20
25
30
35
Scale factor (%)
P
M
C

(
%
)
Effect of image resizing on recognition performance
2 TPC
3 TPC
5 TPC
10 TPC

Figure 44: Effect of downsizing images for various training set sizes.

It is also interesting to see how the Subspace Projection methods react to changes in the
image size. In order to study this aspect of the algorithms, we refer to Figure 45, where the
probability of misclassification is plotted under different scales (measured by the horizontal
distance between the eyes), both for EBGM (using the MS metric) and for LDA. We can see that
for a few training images per class, LDA is clearly superior and appears to be more robust with
respect to these changes in scale. However, when 10 training images per class become available,
the performance of EBGM seems to be rather unaffected by the reduced image size, while that of
LDA appears to be slightly deteriorating for smaller images, leading to the conclusion that EBGM
is less sensitive to resizing than LDA when training images are in abundance.

0 10 20 30 40 50 60 70
0
5
10
15
20
25
30
35
Eye distance (pixels)
P
M
C

(
%
)
Effect of image size on identification performance
MS (2 TPC)
MS (10 TPC)
LDA (2 TPC)
LDA (10 TPC

Figure 45: Comparison of EBGM/MS and LDA (from [9]) resilience to image
resizing.

4.3.3 Effect Of Imperfect Eye Localization

So far in our experiments we have assumed that the exact coordinates of the eye pupils for
the testing images are known; in practice, however, this is far from the truth. This aspect of the
problem is also important in the context of the CHIL project, since the front-end of the
identification system in the AIT Smart Room contains an eye detector module to provide initial
estimates for the eye coordinates (Figure 46). Having obtained the results from the eye detector
evaluation (which gave an RMS error of 4% of the actual eye distance), we ran a number of
simulations whereby we used the actual eye positions for the training images (from the HumanScan
point files) but adjusted the eye coordinates of the testing images to match the eye detector output.
Our goal was to study the extent to which this mismatch between the training and testing set
conditions would affect the recognition performance.
The results are plotted in Figure 47, where the EBGM algorithm under the DEGS_16 metric
is compared with the best of the Subspace Projection methods (LDA). It is obvious that, although
when perfect knowledge of the eye positions is available LDA is superior, this situation is reversed
when the eye detector module is introduced. In fact, the degradation LDA suffers is so severe that
its recognition performance becomes significantly inferior to that of EBGM, even for a very high
number of training images per class.


Figure 46: Block diagram of the complete AIT Face ID system (courtesy of Dr. A.
Pnevmatikakis).

0 2 4 6 8 10 12
0
5
10
15
20
25
30
35
Training Images per Class
P
M
C

(
%
)
Effect of eye misalignment
EBGM (Perfect Eyes)
LDA (Perfect Eyes
EBGM (Imperfect Eyes)
LDA (Imperfect Eyes

Figure 47: Identification performance degradation introduced by the eye
detector.



Face detector
Foreground /
background
segmentation
Skin likelihood
& heuristics
segmentation
Eye
detector
DFFS
frontal face
verificatio
n
Face
tracker
Accumulate
K out of N
smallest DFFS
Face
recognizer
with voting
4.3.4 Effect Of Number Of Fiducial Points

It is suggested in [7] that recognition performance can be further improved if additional
points are used in the construction of the Face Graph. These are defined as the midpoints between
existing features, although no specific strategy or intuition leading to a successful choice of
interpolated points is suggested; in fact it seems that the extra points added to the Face Graph are
chosen somewhat arbitrarily. In any case, a basic goal should be to cover the face evenly (if
possible) with these additional features; however, a large number of extra points will be detrimental
to the algorithms performance, since the areas covered by the Gabor wavelets will be overlapping
and thus the information contained in the Gabor jets of neighboring points highly correlated.
Therefore, some trade-off must be made, as hinted also in [3].
To this end, we have chosen 20 extra points for use in the recognition process, since the
number used in [7] (55) was deemed rather extravagant and redundant; besides, employing a large
number of additional features could only serve to further reduce the algorithms already low speed.
In order to somewhat mitigate this problem, these points were not accounted for during the feature
estimation process; rather, they were only added to the Face Graph during its creation, by simply
extracting jets from the appropriate coordinates. It should be noted here that it is quite difficult to
come up with an appropriate strategy or heuristics for the selection, although the choice was made
while keeping the trade-off discussed previously in mind. Table 6 lists the locations of the
additional points considered. It should be obvious that an attempt was made to place these new
features as symmetrically as possible around the face.
Although care was taken to place these points in appropriate locations, it was immediately
clear that even twenty new features would degrade performance in most cases (apart from the
expected speed reduction). Therefore, an alternative strategy was employed, whereby the inclusion
of each one of these points on its own was evaluated with respect to the recognition performance
change it incurred. In order to obtain statistically meaningful results, each extra feature was tested
on 25 runs using a single training image per class and the resulting differences in PMC % from the
original algorithm setup were recorded. In order to evaluate each jet, we were interested in the
number of runs (out of the total of 25) when its inclusion actually led to a performance
enhancement; furthermore we studied the frequency with which a jet proved to be either the best or
worst candidate for inclusion, on the assumption that only one additional point would be used for
the Face Graph creation.
Figure 48 plots the number of runs for which each jet proved to be the best or worst
addition. It is clear that jet #18 is an unwelcome addition, since it accounted for the worst
performance deterioration in 11 cases. Interestingly enough, jet #17 (which is its symmetric
counterpart) lies on the other side of the spectrum, since it provides the highest improvement in five
cases, more than any other proposed feature.

Extra Point # Location
1 Mouth center
2 Midpoint between eyebrows
3 Right eyebrow middle
4 Left eyebrow middle
5 Eye pupil midpoint
6 Right eye-nostril midpoint
7 Left eye-nostril midpoint
8 Inner corner of right eye & tip of nose midpoint
9 Inner corner of left eye & tip of nose midpoint
10 Right eye pupil & mouth corner midpoint
11 Left eye pupil & mouth corner midpoint
12 Right mouth corner & nostril midpoint
13 Left mouth corner & nostril midpoint
14 Right mouth corner & temple midpoint
15 Left mouth corner & temple midpoint
16 Lower lip middle & tip of chin midpoint
17 Right temple & tip of chin midpoint
18 Left temple & tip of chin midpoint
19 Right mouth corner & tip of chin midpoint
20 Left mouth corner & tip of chin midpoint

Table 6: List of interpolated additional features.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
2
4
6
8
10
12
Extrapolated jet #
N
u
m
b
e
r

o
f

o
c
c
u
r
e
n
c
e
s
Performance across interpolated jets
Times worst jet
Times best jet

Figure 48: Worst-Best jet statistics.
Figure 49 provides much better intuition concerning the most appropriate choices for extra
features by plotting the number of times (i.e. frequency) with which each features addition
enhanced or degraded recognition performance. We can see that half of the proposed jets generally
(i.e. in about 80% of the runs) tend to improve performance, while only three of them are harmful
most of the times. This plot also verifies that jet #18 is definitely a bad inclusion and that #17 is
indeed helpful. One could use this graph as a starting point for further evaluation of the effect
added jets have on recognition performance, but unfortunately the process of picking likely
candidates as additional features remains in large part trial-and-error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
5
10
15
20
25
30
Interpolated jet #
N
u
m
b
e
r

o
f

o
c
c
u
r
e
n
c
e
s
Performance across interpolated jets
Degradation
Improvement

Figure 49: Frequency of performance degradation and improvement incurred for
each of the proposed extra jets.

4.3.5 Effect Of Gabor Kernel Set Size

It has already been pointed out that the most computationally intense part of the algorithm is
the extraction of Gabor jets, both during feature estimation and during Face Graph creation and
identification. Apart from the speed reduction incurred by the use of the full kernel set, we
discussed in the previous section how choosing densely packed features can degrade identification
performance because the resulting jets become highly correlated. The same problems can occur
even if we use only the 20 basic features provided by HumanScan, provided that the kernels used
are large enough. In our case, the largest wavelets employed are 97x97 pixels, meaning that the
convolution areas will be highly overlapping in some cases. We therefore decided to study how
well the algorithm would perform if the 16 largest kernels were not used in the Gabor jet extraction
process.
In order to test the effects of this modification, we ran the algorithm 25 times with a single
training image per class and recorded the average results. As Figures 50 and 51 indicate, there is
twofold improvement from this modification under both MS and DEGS_16; we can see that the
algorithm can run considerably faster while also gaining somewhat in recognition accuracy.
Beyond that point, speed continues to increase in an almost linear fashion; however, dropping more
than 32 kernels results in severe PMC % increase. MS seems to be more robust, since it retains its
original performance when the 32 largest kernels are dropped; in fact the performance discrepancy
between MS and DEGS_16 is steadily declining, to the point where it becomes slightly better than
DEGS_16 when only the 16 smallest kernels are retained.

0 16 32 48 64 80
20
25
30
35
40
45
50
55
60
65
70
Effect of the number of retained Gabor Kernels on PMC (1 TPC, 25 Runs)
Retained Gabor Kernels
P
M
C

(
%
)
DEGS__16
MS

Figure 50: Effect of the Gabor kernel set size on recognition performance.

0 16 32 48 64 80
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Retained Gabor Kernels
I
D

T
i
m
e

(
s
e
c
)
Effect of number of retained Gabor Kernels on ID Time (1 TPC, 25 Runs)

Figure 51: Effect of the Gabor kernel set size on identification time.



















Chapter 5: Conclusions

5.1 Summary

This Thesis has presented the results of the implementation of the EBGM algorithm for face
identification in the context of the CHIL project at AIT. We have provided a detailed analysis of
the major steps of the algorithm and discussed the methods and metrics used for both facial feature
estimation and the actual recognition process. We have shown that EBGM can be used to
successfully locate the positions of fiducial points in novel imagery, even when only a few training
images per class are contained in the FBG. We then proceeded to study two different metrics used
in the identification process and exposed their respective strengths and weaknesses, leading to two
versions of the algorithm that could be used in practice, depending on the particular application and
its requirements in terms of speed and recognition accuracy. Finally, we delved into the intricacies
of the algorithm and touched upon its resilience with respect to illumination variations, different
image scales and imperfect eye localization.
The EBGM algorithm has been studied extensively, both in itself and in comparison with
several variants of the Subspace Projection family, and their respective merits and shortcomings
have been investigated and analyzed. EBGM has proven to be a fairly mathematically involved
face identification method that exhibits robustness under illumination variations, image resizing
and imperfect eye localization. It is a very good choice for off-line applications and cases where
training images are scarce; however, its high computational complexity makes it inappropriate for
real-time applications.

5.2 Directions For Future Work

A number of issues are however still open. One of the most important is the attempt to find
a similarity metric that would combine the speed of MS with the accuracy of DEGS_16. Some
steps in this direction have already been made, by considering information about both the face
geometry and the feature characteristics in a hybrid measure, namely a linear combination of GeoS
and MS. This idea was suggested in [7] but our results indicate that such a simple solution does not
offer any performance improvements; if such a hybrid solution is to be sought out, it will definitely
have to be a more complex combination of geometric and feature information.
Another way to approach this problem is to use the existing metrics but weigh the
contribution of each fiducial feature by a different amount when computing the total similarity over
the whole face. The major obstacle in this case is determining the weights appropriately in some
systematic way to guarantee enhanced performance. A simple idea is to weigh each contribution
according to the expected accuracy in estimating the feature position, so that we bias our decision
towards those fiducial points we have more confidence in. A different approach attempts to
increase the algorithms resilience to different facial expressions, by applying higher weights to
facial features that are more or less indifferent to the expression of emotions, such as the eyes and
nostrils; conversely, the area around the mouth should have lower weights since it is the most
heavily affected region of the face in such cases or when the individual is talking. This is however
an open problem entailing considerable trial-and-error and considerable work still lies ahead.
Apart from improving recognition accuracy, the second major direction lies in improving
the algorithms speed. We have already showcased the importance of the selection of the Gabor
kernels used in Section 4.3.5; another idea along those lines would be to retain the maximum kernel
set size while using smaller wavelets. This can be accomplished by reducing their basis of support
or alternatively by only retaining an area of length around the center (instead of 3 like we
have used so far). It would also be possible to weigh the contribution of each kernel to the Gabor jet
according to the inverse of its basis of support, thus also reducing possible jet correlation due to
highly overlapping convolution regions.
A final set of experiments could be aimed at further investigating the algorithms resilience
to the impairments discussed in Sections 4.3.1-4.3.3. For example, it would be interesting to
perform a joint evaluation of EBGM and the Subspace Projection methods in image databases with
more adverse characteristics (such as the Yale Illumination database). The effect of image resizing
could also be further investigated by studying a larger number of scale factors in order to obtain
smoother PMC % curves. Last but not least, we have only discussed the performance degradation
due to imperfect eye localization for the testing images alone; to complete the picture, the same
experiments should be repeated under matched conditions, i.e. by considering errors in the
algorithms knowledge of the eye coordinates for training images as well.







Appendix A: References

[1] M. Turk and A. Pentland, Eigenfaces for Recognition, J. Cognitive Neuroscience, Vol. 3,
March 1991, pp. 71-86.
[2] R. Duda, P. Hart and D. Stork, Pattern Classification, Wiley-Interscience, New York, 2000.
[3] Laurenz Wiskott, Jean-Marc Fellous, Norbert Krueger and Christoph von der Malsburg, Face
Recognition by Elastic Bunch Graph Matching, in Intelligent Biometric Techniques in Fingerprint
and Face Recognition, eds. L.C. Jain et al., publ. CRC Press, ISBN 0-8493-2055-0, Chapter 11, pp.
355-396, 1999.
[4] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips, Face Recognition: A Literature
Survey, 2000.
[5] http://www.humanscan.de/support/downloads/facedb.php
[6] http://chil.server.de
[7] David S. Bolme, Elastic Bunch Graph Matching, Masters Thesis, Computer Science
Department, Colorado State University, Summer 2003.
[8] N. Perkov and P. Kruizinga, Computational models of visual neurons specialized in the
detection of periodic and aperiodic oriented visual stimuli: Bar and grating cells, 1997, pp. 83-96.
[9] A. Pnevmatikakis, L. C. Polymeankos, Subspace Projection Face Recognition: Comparison of
Methods and Metrics, AIT TR-CV-REC-003, submitted for publication to the International
Journal of Computer Vision.
[10] http://ravl.sourceforge.net

Das könnte Ihnen auch gefallen