An Automatic Framework For Textured 3D Video-Based Facial Expression Recognition

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
An Automatic Framework for Textured

3D Video-based Facial Expression
Recognition
Munawar Hayat, Mohammed Bennamoun
Abstract
Most of the existing research on 3D facial expression recognition has been done
using static 3D meshes. 3D videos of a face are believed to contain more information
in terms of the facial dynamics which are very critical for expression recognition. This
paper presents a fully automatic framework which exploits the dynamics of textured 3D
videos for recognition of six discrete facial expressions. Local video-patches of variable
lengths are extracted from numerous locations of the training videos and represented
as points on the Grassmannian manifold. An efficient graph-based spectral clustering
algorithm is used to separately cluster these points for every expression class. Using
a valid Grassmannian kernel function, the resulting cluster centers are embedded into
a Reproducing Kernel Hilbert Space (RKHS) where six binary SVM models are learnt.
Given a query video, we extract video-patches from it, represent them as points on
the manifold and match these points with the learnt SVM models followed by a voting
based strategy to decide about the class of the query video. The proposed framework
is also implemented in parallel on 2D videos and a score level fusion of 2D & 3D
videos is performed for performance improvement of the system. The experimental
results on the largest publicly available 3D video database, BU4DFE, show that the
system achieves a very high classification accuracy and outperforms the current state
of the art algorithms for facial expression recognition from 3D videos.
Index Terms
Facial expression recognition, 3D videos, Grassmannian manifold, spectral clustering, SVM on Grassmannian manifold
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
I NTRODUCTION AND R ELATED W ORK
Automatic recognition of human facial expressions finds its applications in

many emerging areas such as affective computing and intelligent Human Computer Interaction (HCI). Due to the easy availability of 2D cameras, most of the
existing work on Facial Expression Recognition (FER) has been done using
2D static images or videos. Surveys on 2D FER have been presented in [1],
[2]. Despite the development of many successful algorithms and techniques
for 2D FER, the 2D face data suffers from inherent problems of illumination
changes and pose variations. The 3D data modality is however considered to
be more effective in addressing the issues faced by its 2D counterpart. With
recent advancements in the development of 3D image capturing technologies
such as structured light scanning, photogrammetry and photometric stereo, the
acquisition of 3D data is becoming a more feasible task [3] and future real
world applications would require recognition of facial expressions from 3D
videos. Sandbach et al. [3] and Fang et al. [4] presented comprehensive surveys
on recent advances of 3D Facial Expression Recognition (FER). These surveys
indicate that almost all of the existing research on 3D FER is based on static 3D
images. 3D videos are however believed to provide more information in terms
of the dynamics of an individuals face, which is very critical for expression
recognition.
Ever since the public availability of 3D video database (BU4DFE) [5], there
have only been a few methods [5][11] developed for 3D video based expression
recognition. The framework for 3D video based FER generally comprises two
main stages: the extraction of facial features and their classification. Amongst
these two steps, the extraction of the pertinent facial features is the most critical
as the best machine learning (classification) algorithm could fail to achieve
M. Hayat and M. Bennamoun are with School of Computer Science and Software Engineering, The University
of Western Australia, 35 Stirling Hwy, 6009, Crawley WA, Australia.
E-mail: {munawar, m.bennamoun} @ csse.uwa.edu.au
April 22, 2014
DRAFT
the desired performance in the presence of weak features. As indicated in the

surveys [3], [4], the focus of the existing methods has mostly been to devise
methods of feature extraction from 3D videos. These methods failed to develop
a full-fledged system capable of automatically recognizing facial expressions
from 3D videos. This paper fills in the gap and presents a fully automatic
system, which requires no user intervention, and is capable of recognizing the
six discrete facial expressions (commonly used in the literature) from 3D face
videos.
Here we first present an overview of the existing methods which were developed and tested on the BU4DFE dataset. We then present our proposed
system in Sec 2. In order to encode the temporal dynamics of a video in terms
of feature descriptors, the existing methods incorporate different tracking and
alignment techniques to ensure that features extracted from different locations
of the video correspond to the same regions of the face. Furthermore, these
methods require a vertex level correspondence between successive frames of a
video to track the movements of different vertices. In [5] Sun and Yin adapt
a deformable mesh (the first mesh of the video) to the subsequent meshes
and track its changes in order to extract geometric features. First, 83 facial
landmarks from the first frame of the corresponding 2D texture video are
manually annotated and an Active Appearance Model (AAM) is used to track
these points in the subsequent frames. Since the 2D and 3D data are aligned,
the tracked landmarks are transferred to the corresponding 3D meshes. A postprocessing step is incorporated to manually correct wrongly tracked points.
From the known locations of the 83 facial landmarks of the complete video
sequence, the authors use Radial Basis Function (RBF) to adapt all meshes of the
video with the first mesh. This process establishes a vertex level correspondence
and ensures an equal number of vertices amongst all the meshes of the video.
3D surface descriptors are used to characterize every vertex of the 3D mesh
in the video by its primitive geometric surface features (e.g. convex peaks and
concave pits). An optimal discriminative feature space is then constructed for
April 22, 2014
DRAFT
every mesh using Linear Discriminant Analysis (LDA). For classification, one
HMM for each of the six expressions is learnt. Finally, the probability scores
of the test video to each HMM are evaluated by a Bayesian decision rule to
determine the expression type of the test video. An extension of this approach
has been presented in [12] for the task of face recognition from 3D videos.
Rosato et al. [8] propose a conformal mapping and generic model adaptation
technique for vertex level correspondence establishment. A deformable template based approach is applied to extract 22 feature points from all frames of
the corresponding 2D texture video. These extracted points are then mapped to
the corresponding 3D meshes of the video. A circle pattern conformal mapping
is used to parameterize the 3D meshes in a 2D plane. A coarse to fine model
adaptation is performed in the 2D planar representation for correspondence
establishment between vertices. The established correspondences are then extrapolated from planar representation to the original 3D meshes. Each of the
3D mesh is then characterized by one of its 12 geometric primitive features.
LDA is used for classification of the facial expressions.
Fang et al. [9], [13] fit an Annotated deformable Face Model (AFM) to every
mesh of the video and use the fitted video sequence for feature extraction
and classification. For every pair of consecutive video frames, registration is
performed by mesh matching and the rigid transformations are computed. Two
different techniques are used for mesh matching i.e. spin image similarities and
distances between MeshHOG descriptors. The first mesh is directly fitted to the
AFM while the transformations obtained by dense registration of pair of consecutive meshes are used for fitting the remaining meshes of the video. Assuming
that the expression is performed gradually (from neutral to onset to apex to
offset), a flow matrix is obtained from fitting of the AFM. Spatiotemporal LBP
features are extracted from the flow matrix and a SVM is used for classification.
Sandbach et al. [6], [14] model a facial expression sequence by a four state
model i.e. neutral-onset-apex-offset. Firstly, all 3D meshes of a video are aligned
with the first mesh using an iterative closest point (ICP) algorithm. Then, Free
April 22, 2014
DRAFT
Form Deformations (FFDs) are used to capture the motion between the frames
of the video. This motion is represented in terms of vector fields. Using a quadtree decomposition, the vector fields are sub-divided into regions according
to the amount of motion appearing in every region. A set of 2D features is
extracted for every region, which are then used to train boosting classifiers on
the onset and offset segments of the expression. Finally, HMMs are used to
model the temporal dynamics of the complete expression sequence.
Huang et al. [10] propose a hybrid approach based upon two vertex mapping algorithms (i.e. displacement mapping and point-to-surface mapping) for
achieving a vertex level correspondence. They follow a procedure similar to
[12] for feature extraction and classification. Unlike most global model based
algorithms e.g. [5], [6], [8], [9] which map the complete face model, the proposed approach segments the face into multiple independent local regions and
maps the corresponding local regions. Nevertheless, mapping a video sequence
of about 100 meshes and establishing a vertex level correspondence amongst
all meshes (each mesh comprises about 40,000 vertices) is a very slow and
computationally expensive process.
Le et al. [7] proposed an approach based on facial level curves for expression
recognition from 3D videos. The arc-length function is used to parameterize the
level curves and spatiotemporal features are extracted by comparing the curves
across frames using Chamfer distances. An HMM-based decision boundary
focus algorithm is used for classification.
In order to extract motion between meshes of a 3D video sequence, Drira et
al. [11] use radial curves. First, the facial surface in each 3D mesh of the video is
parameterized by radial curves emanating from the nose tip. Then, the motion
between every pair of consecutive meshes is captured in terms of vector fields
computed by comparing the corresponding radial curves of the two meshes.
LDA is used for dimensionality reduction followed by a multi-class random
forest algorithm for classification.
This paper contributes towards the development of a full-fledged automatic
April 22, 2014
DRAFT
framework for textured 3D video-based facial expression recognition. After

normalizing all videos, the system uses a sliding window to extract videopatches from different locations of the video, represents these video-patches by
a set of basis vectors, which are considered to be points on the Grassmannian
manifold. Using an efficient graph based spectral clustering algorithm, the
video-patches for each of the six expressions are then clustered, and only the
cluster centers are considered for final matching with the query video. A multiclass SVM on the Grassmannian manifold is learnt for the final matching.
The main strengths of our proposed system are as follows: 1) our system does
not rely on manually annotated facial landmarks for the tracking of various
facial regions across the frames of a video. 2) Our proposed system does not
require a dense correspondence establishment between vertices of meshes of a
video. Therefore, unlike [5], [6], [8], [9], [13], [14] it does not incur an iterative
and computationally expensive process to align all meshes of the video. 3) Our
system is invariant to the temporal length of the video sequences and it does not
make any assumptions about the presence of all four segments (neutral-onsetapex-offset) nor their temporal order in a video. 4) the 2D texture information is
readily available in BU4DFE dataset but has not been previously incorporated
along-with 3D information. Our proposed system performs a score level fusion
of results from 2D and 3D videos for performance improvement. 5) Unlike the
proposed systems in [5], [11], which require a large amount of training data (the
authors divide a video sequence of about 100 frames into multiple subsequences
of 6 frames each), our system works well for any length and number of the
training videos (see Sec 6).
Preliminary results of our algorithm have been published in [15]. However, a
number of extensions have been made since then which resulted in a significant
improvement in the expression classification accuracy (from 90.97% to 94.34%)
and efficiency of the algorithm. These extensions include 1) formulation and implementation of support vector machine classifier on Grassmannian manifold,
2) introduction of a new Grassmannian kernel whose efficacy is demonstrated
April 22, 2014
DRAFT
through experimental results and 3) a score-level fusion of available 2D texture

videos for performance improvement.
The rest of the paper is organized as follows. We first provide a detailed description of the three main parts of our proposed system i.e. spectral clustering
on Grassmannian, SVM on Grassmannian and 3D video database normalization
in Sec 2, Sec 3 and Sec 4 respectively. Then, the complete pipeline of our
automatic video based FER system is explained in Sec 5. Experiments and
results are discussed in Sec 6. Finally, Sec 7 concludes the paper.
S PECTRAL C LUSTERING ON G RASSMANNIAN M ANIFOLD
Clustering is one of the most widely used technique for data analysis in many
applications. The problem of clustering can be formulated as, given a set of
points and some measure of similarity between all pairs of points, divide the
points into groups such that points in the same group are similar, while points
in different groups are dissimilar. While clustering in the Euclidean space has
been well explored over the years, very few papers [16][18] discuss clustering
on the Grassmannian manifold.
By definition, a manifold is a topological space which is locally similar to an
Euclidean space. A Grassmannian manifold is a space of all d-dimensional linear
subspaces of Rn [18], [19]. A point on the Grassmannian manifold is represented
by an orthonormal matrix of Rnd . The existing clustering techniques [16]
[18] on the Grassmannian manifold need to compute the distance as well as
the mean of the points on the manifold, for every iteration of the clustering
algorithm (e.g. K-means). Methods of computations of the mean and distance
on the Grassmannian manifold can be broadly categorized as intrinsic and
extrinsic [18]. The intrinsic methods are entirely restricted to the manifold
themselves, whereas extrinsic methods embed the points on the manifold to
an Euclidean space and use Euclidean metrics for computations. Using either
an intrinsic or extrinsic method for an iterative process such as K-means is
very time consuming and requires a lot of computation. For example, the
April 22, 2014
DRAFT
authors in [17] use an extrinsic method by firstly embedding the points on

the Grassmannian to its tangent space (Euclidean space) and then applying the
Mean Shift algorithm. Traversing from the manifold to its tangent space and
then back to the manifold for several iterations of the clustering algorithm is
very slow and time consuming. Another recent semi-intrinsic method proposed
by Turaga et al. [18] uses Karcher mean [20] which in itself is an iterative
approach to compute the mean of a set of points on a manifold.
Using spectral clustering [21], [22], the iterative computations on Grassmannian can be avoided and the problem of clustering can be simplified to an eigenvector-decomposition of the Graph-Laplacian matrix. The complete spectral
clustering algorithm for a set of points {X1 , X2 , X3 , . . . Xm } on the Grassmannian is provided in Algorithm 1. Firstly, we compute the graph-laplacian matrix
L Rmm . This matrix is information rich and carries similarity scores for
all pair of points on the manifold. As discussed earlier, the points on the
Grassmannian are d-dimensional subspaces of Rn . On a computer, these points
are stored as tall-thin orthonormal matrices in Rnd [19]. To compute the similarity between a pair of points on the Grassmannian, we use a Grassmannian
kernel function (discussed later in Sec 3.1). The theory of spectral clustering
[21] suggests that the normalized graph-laplacian matrix is more suitable and
yields an even better clustering. For a matrix L, its normalized graph-laplacian
1
is given by Lnorm = D 2 LD 2 [21], where D is a diagonal degree matrix whose

ith diagonal element is equal to the sum of all elements of the ith row of L. After
computation of Lnorm , we follow the steps shown in Algorithm 1 to complete
the clustering process. It should be noted that unlike the existing clustering
algorithms (which require the computation of the mean and distance on the
Grassmannian manifold for every iteration), the proposed spectral clustering
algorithm reduces the problem to a very low dimensional Euclidean space
where clustering can be performed in a much faster and efficient way. Compared with a recently published method of clustering on the Grassmannian [18],
our experimental results showed that the spectral clustering is several orders
April 22, 2014
DRAFT
of magnitude (approximately 10-15 times) faster.

Algorithm 1 Spectral Clustering on Grassmannian
Require: Points on the manifold: {X1 , X2 , X3 , . . . Xm }
1) Compute the graph-laplacian L Rmm
1
1
2) Compute the normalized graph-laplacian Lnorm = D 2 LD 2
3) Compute the first t eigenvectors of Lnorm
4) Let V Rmt contain the t eigenvectors of Lnorm
5) Let every row of V be the corresponding point in the Euclidean space
t
i.e. the new set of points become {yi }m
i=1 ; yi R
m
6) Cluster the points {yi }i=1 using K-means
Ensure: Clusters: {C1 C2 ..Ck } with Ci = {j|yj Ci }
S UPPORT V ECTOR M ACHINE ON G RASSMANNIAN M ANIFOLD
Support Vector Machine (SVM) is a supervised binary class classification algorithm which constructs a hyperplane to optimally separate the data points
between the two classes [23]. The original SVM is designed for data lying in
the Euclidean space. In this paper, we exploit the theory of Reproducing Kernel
Hilbert Space (RKHS) [24] to recast SVM from Euclidean space to Grassmannian
manifold. More specifically, the data points on the Grassmannian manifold are
embedded into RKHS by using a Grassmannian kernel. This embedding allows
us to use the SVM classifier in RKHS.
Given a set of m training data points Xtrain = {X1 , X2 , X3 , . . . Xm } along
with their class labels ytrain = {y1 , y2 , y3 , . . . ym }, where Xi Rnd (a tall thin
orthonormal matrix, represented as a point on the Grassmannian) and yi
{1, +1}, the problem of SVM on the Grassmannian manifold can be formulated as, find a maximum-margin hyperplane that divides the points having
yi = 1 from the points having yi = +1. Using the soft-margin SVM formulation
[23], this can be achieved by solving the following optimization problem,
April 22, 2014
DRAFT
10
X
1
minimize
kwk2 + C
i
w,b,
2
i=1

subject to yi wT fi + b 1 i , i = 1, . . . , m
(1)
i 0, i = 1, . . . , m.
Where w is the coefficient vector, b is the intercept term and is a parameter
for handling non-separable data and C is the penalty parameter for the error
term [23]. fi Rm is a feature vector in RKHS computed between Xi and all
other m training points in Xtrain using a Grassmannian kernel function,
(2)
fi = k(Xtrain , Xi )
As shown in [23] the primal problem in (1) is a convex optimization problem.

By using Lagrangian duality, it can be solved as the following dual optimization
problem,
minimize
X
1X
i j yi yj k(Xi , Xj )
i
2 i,j
i
(3)
subject to 0 i C, i = 1, . . . , m
X
i y i = 0
The above dual optimization problem is a quadratic programming problem

and is solved for optimal dual variable using CVX1 [25], a package for
specifying and solving convex optimization problems in Matlab. From the opP
timal dual variable , the primal variables are given as w = i i yi fi and
b = 12 hw , k+ + k i, where hi is the dot product operator and k+ and k
are positive and negative class samples given by fi (see equation (2) for fi )
corresponding to yi = +1 and yi = 1 respectively. The normalized distance of
1. LIBSVM http://www.csie.ntu.edu.tw/cjlin/libsvm/ can also be used for solving (3)
April 22, 2014
DRAFT
11
a test data point Xtest from the SVM hyperplane is then given by,

hw , ftest i + b
f (Xtest ) = sgn
kw k
(4)
Where the sigmoid function sgn(z) is defined as,

sgn(z) =
3.1
1
1 + exp(z)
(5)
Grassmannian Kernels
A Grassmannian Kernel is used to compute the similarity between two points

Xi and Xj on the manifold. A valid Grassmannian kernel function embeds
the points lying on the manifold into RKHS where the kernel based machine
learning techniques can be used. The necessary and sufficient conditions for a
kernel function to be valid are 1). The kernel function is well defined i.e. it is
invariant to various representations of the input data and 2) It is symmetric
positive-definite. Here, we first discuss two Grassmannian kernels (projection
kernel and canonical correlation kernel in Sections 3.1.1 and 3.1.2 respectively)
from the literature and then introduce our own proposed Grassmannian kernel
(in Sec 3.1.3). We also show that more complex and sophisticated kernels can be
obtained by the linear combination of existing kernels. This linear combination
of kernels results in an increased discrimination of the data points and an
improved accuracy of the system (see Sec 6.2).
3.1.1
Projection Kernel
The projection kernel between two points Xi and Xj is defined as [26],

2
kproj (Xi , Xj ) = XiT Xj F
(6)
The projection kernel is a valid kernel because,

1) It is well-defined i.e. kproj (Xi , Xj ) = kproj (R1 Xi , R2 Xj ) for any R1 ,R2 O(n),
where O(n) is a set of n n orthonormal matrices.
April 22, 2014
DRAFT
12
2) It is positive definite. By using the properties of Frobenius norm, Hamm

and Lee [26] showed that the projection kernel is a positive-definite kernel.
3.1.2
Canonical Correlation Kernel
Principal angles have been used as a measure of similarity between two subspaces (i.e. two points on the Grassmannian). The principal angles 0 1
. . . m
can be defined as the smallest angles between all pairs of unit
vectors in the first and second subspace. The cosine of the principal angles
is known as canonical correlations. The first principal angle 1 is the smallest
angle. Based upon the value of canonical correlation corresponding to the first
principal angle i.e. cos(1 ), Harandi et al. [27] defined the canonical correlation
kernel as,
kcc (Xi , Xj ) =
max
max
ap span(Xi ) bq span (Xj )
aTp bq
(7)
subject to
aTp ap = bTp bp = 1 and aTp aq = bTp bq = 0, p 6= q.
The value of kcc (Xi , Xj ) can be computed by singular value decomposition of

XiT Xj i.e. XiT Xj = Q12 Q21 . The value of kcc (Xi , Xj ) is equal to the largest
singular value in the diagonal matrix . Harandi et al. [27] showed that the
canonical correlation kernel is a valid Grassmannian kernel.
3.1.3
Proposed Kernel
We introduce our kernel function based upon the geodesic distance between
points on the Grassmannian manifold. Here, we first provide a description
of the geodesic distance on the Grassmannian and then propose our kernel
function in equation (10). Later on, we investigate different strategies to achieve
a positive definite kernel matrix from the proposed kerenl.
The geodesic distance between any two points on the manifold is the length of
the shortest possible curve connecting them. Gallivan et al. [28] use the quotient
interpretation to derive the geodesic path on Grassmannian. More specifically,
April 22, 2014
DRAFT
13
the Grassmannian manifold can be considered as a quotient space of a Stiefel

manifold. The Stiefel manifold itself can be considered as a quotient space
of a more basic manifold i.e. the special orthogonal group SO(n). Therefore,
the Grassmannian manifold can be studied as a quotient space of a special
orthogonal group. The quotient interpretation allows us to use the well known
results from the established theory of the basic manifolds such as the special
orthogonal group. The geodesic path on SO(n) are given by one-parameter (i.e.
the time t) exponential flows t 7 exp(tB) where B Rnn is a skew-symmetric
matrix [18], [28] . Using the quotient interpretation, geodesic path on Gn,d is
given by t 7 exp(tB) where the matrix B is of the form [28],
T
0 A
, A R(nd)d
B=
A 0
(8)
The matrix A in equation (8) specifies the direction and speed of the geodesic
flow. As shown in [18], [28], in order to compute the geodesic distance between
two points Xi and Xj on the Grassmannian, we need to find an appropriate
direction matrix A such that the geodesic flow originating from Xi reaches Xj
in a unit time. For this purpose, we first map Xi and Xj to their respective
tangent spaces and then use the algorithm proposed in [28] for an efficient
computation of the matrix A. The geodesic distance between Xi and Xj is then
given by,
dg (Xi , Xj ) = trace(AT A)
(9)
Where trace denotes the sum of the diagonal elements. By considering the
geodesic distance dg (Xi , Xj ), we introduce the following kernel function on the
Grassmannian manifold,
kg (Xi , Xj ) = exp(
dg (Xi , Xj )
)
2 2
(10)
For the proposed kernel to be a valid kernel, the necessary and sufficient
conditions are 1) it should be well-defined 2) the resulting gram matrix K
April 22, 2014
DRAFT
14
RN N , where Ki,j = kg (Xi , Xj ), should be symmetric positive definite. The

proposed kernel in Eq 10 is well defined but not necessarily positive definite
for all cases. To make it positive definite, we investigated the two following
strategies.
The proposed kernel is a Radial Basis Function (RBF) kernel based on the
geodesic distance on the Grassmann manifold. In the Euclidean space, an RBF
kernel is a valid kernel because the distance term in the exponent defines a
valid inner product in the Euclidean Space [24]. For the geodesic distance on
the manifold to be defined by a valid inner product, we should first map all
the points on the manifold to a common tangent space and then compute the
geodesic distance in terms of the inner product between the mapped vectors on
the tangent space. The common tangent space can correspond to the mean of
all the points on the manifold or to the truncated identity matrix. Although the
geodesic distance defined in terms of the inner product between the mapped
points on the common tangent space results in a positive definite K, there is
a drawback in computing the geodesic distance on the manifold using this
approach. Specifically, mapping the points on the manifold to a common tangent space does not preserve well the global structure and the distances of the
original data points on the manifold. The distribution of the resulting mapped
points tends to be very different from the original distribution of the points
on the manifold and hence the resulting geodesic distance is not very accurate
[29].
In order to enforce our K to be positive definite, we adapt an analytical
approach called spectrum transformation [30][32] . We first represent K by its
singular value decomposition i.e., K = UUT where = diag (1 , 2 N ) is
a diagonal matrix of the eigenvalues i s. To approximate K with a positive
the following convex optimization problem is then
definite kernel matrix K,
formulated:
April 22, 2014
DRAFT
minimize
15
K K
(11)
0.
subject to K
The solution to Eq. 11 is given by,
T,
= UU
K
(12)
= diag (max(0, 1 ), max(0, 2 ) max(0, N )). In summary, a posiwhere

is obtained from a non-positive definite matrix
tive definite kernel matrix K
K by discarding the negative eigenvalues in and replacing them with ze is determined from the
ros. Note that the positive definite kernel matrix K
training data only. In order to transform the effect of the modifications (done
onto the test data in a consistent manner, we can repon K to achieve K)
= PK, where
resent these modifications by a linear transformation i.e, K

P = UT diag I{1 0} , I{N 0} U and I{.} is the indicator function. The modi can then be reflected onto a test sample Xtest
fications done on K to achieve K
test = PXtest .
by X
3.1.4
Kernel Combinations
From the theory of RKHS [24], it is well known that a linear combination of
valid kernels is a new kernel. Therefore, we can obtain a new kernel by,
(13)
knew = proj kproj + cc kcc + g kg
where proj , cc , g 0. Our experimental results (see Sec 6.2) show that the
kernel combination enhances discrimination of points and results in improved
classification.
3D V IDEO DATABASE N ORMALIZATION
We use the BU4DFE database [33] for our experiments. This database comprises 101 subjects (58 females and 43 males) with an age range of 18-45 years,
April 22, 2014
DRAFT
16
belonging to various ethnical and racial groups including: asian (28), black(8),
Latino(3) and white (6). Under the supervision of a psychologist, each subject is
asked to perform 6 different expressions i.e angry, disgust, fear, happy, sad and
surprise. Each facial expression was captured to produce a 4 seconds video
sequence of temporally varying 2D texture and 3D shapes at the rate of 25
frames per second.
Fig. 1. 3D face normalization

The videos of BU4DFE are acquired from shoulder level up. The size and
resolution of the meshes of a given raw 3D video vary from frame to frame.
We perform an efficient fully automatic face normalization to obtain uniformity
April 22, 2014
DRAFT
17
(a) Range Images
(b) RGB Images
Fig. 2. A few normalized frames (both 2D and 3D) of an expression video from
BU4DFE database for visualization
in terms of the size and resolution of all meshes of a video. The input to our
algorithm are the meshes of a raw 3D video. The algorithm removes outlier
points, detects the nose tip, crops the Region Of Interest (ROI) around the
detected nose tip and corrects the pose of the face. Figure 1 shows the block
diagram of the 3D face normalization. The details of each block are given below.
We represent a 3D face by a matrix Pm3 of point clouds, where m is the
total number of points and each row of P corresponds to x, y, z coordinates of
a point (vertex). A raw 3D face contains outlier points as shown by the encircled
region in Figure 1(a). We remove these outlier points by finding the mean ( =
q P
Pm
2
m
1
1
z
)
and
the
standard
deviation
(
=
i=1 i
i=1 (zi ) ) of the depths (zi )
m
m
of all points. Any point whose depth is outside the 3 limit is considered an
outlier and is filtered. Figure 1(b) shows a 3D face after the removal of outlier
points. To reduce dimensionality and computational complexity, we sample the
filtered 3D face onto a uniform rectangular grid of 278 208 ( 15 th of the original
size i.e 1392 1040) at a resolution of 1mm. The resulting uniformly sampled
pointcloud is input to the nose tip detection algorithm.
The nose tip detection algorithm [34] has three steps. First, 61 2D face profiles
of the input pointcloud are generated by rotating the face around the y-axis from
[90 90 ] in a step size of 3 (see Fig. 3 (a)). Second, from the generated face
profiles, a set of possible nose tip candidates is determined by moving a circle
April 22, 2014
DRAFT
18
along all the points of these face profiles. A point on the face profile is selected
as a nose tip candidate if it fulfills the following criterion: the difference in the
area of the circle lying inside and outside of the face profile must be greater
than an experimentally determined threshold. The selected nose tip candidates
using this criteria are marked in Fig. 3 (a). The frequency of occurrence of these
nose tip candidates with respect to the y-axis is shown in Fig. 3 (b). It can be
seen that most of the candidates correspond to the actual nose tip, while others
are from the chin and shoulder region. Finally, in order to filter the false nose
tip candidates and only select the valid nose tip, a triangle is fitted around all
candidates. The candidate which best fits the triangle is then selected as the
final nose tip (shown in Fig. 3 (c)).
35
250
nose tip detected occurences
30
200
150
100
25
20
15
10
50
5
0
150
100
50
0
x
(a)
50
100
150
50
100
150
y
(b)
200
250
300
(c)
Fig. 3. Nose-tip detection. (a) Face profiles are generated by rotating the face
around y-axis. A set of possible nose-tip candidates is obtained by moving a circle
along every point of these face profile. If at a point, the area of the circle outside
the face profile is greater than the inside, the point is selected as a possible nosetip. (b) Frequency of occupance of the candidate nose-tips w.r.to y-axis. A triangle
is fitted at these nose-tip candidates (c) The candidate which best fits the triangle
is declared as the final nose-tip
After a successful detection of the nose tip, a sphere of radius (r = 85mm)
centered at the nose tip is used to crop the face. Figure 1(d) shows the cropped
3D face. Although, videos of BU4DFE dataset have been acquired in a frontal
view, slight head rotations and pose variations were observed. We use a techP
nique similar to [35] for pose correction. The mean vector = m1 m
i=1 Pi and
Pm
T
the covariance matrix (C = i=1 Pi Pi T ) of the pointcloud matrix (P) are
April 22, 2014
DRAFT
19
computed. The Principal Component Analysis (PCA) of the covariance matrix

gives a matrix V of eigenvectors, which is used to align the pointcloud matrix P
along its principal axis, by using, P0 = V(P ). The pose corrected pointcloud
is once again resampled onto a uniform square grid of 160 160 at 1mm
resolution.
For meshes of a raw 3D video, the above process of face normalization
gives uniformly sampled, pose corrected range images. These range images
are joined together to form a range video. As 2D and 3D data are aligned for
BU4DFE database, the above process is used to normalize the corresponding
2D texture videos as well. The location of detected nose-tip from 3D mesh is
used to crop the corresponding region from 2D textured face, which is then
uniformly sampled and pose corrected by applying the matrix V on R, G, and
B pixels (mapped into 3D pointcloud). A few normalized range images and the
corresponding RGB images from a video of the database are shown in Fig 2(a)
and (b) for visualization.
T HE P ROPOSED S YSTEM
The complete pipeline for our automatic 3D video based facial expression recognition system is shown in Fig 4. During the offline training phase, our system
extracts video-patches from different locations of the training videos, separately
clusters these video-patches into class representatives and learns six binary
SVM models on these class representatives. During the online testing phase,
a similarity for all extracted video-patches of the query video from the learnt
SVM models is obtained and a voting based strategy is used to decide about the
class of the query video. The detailed description of each block of our system
is given here.
An expression video is generally modeled to contain segments such as neutral
followed by onset, apex and an offset. During our visual inspection of the
videos of the BU4DFE database, we noticed that the neutral-onset-apex-offset
order does not necessarily hold for every video. For example, some videos start
April 22, 2014
DRAFT
20
(a) A video is represented as points on the Grassmannian manifold. All raw 3D meshes of the video are normalized
(see Fig 1), a sliding window is used to extract video-patches of variable lengths from different locations of the
normalized video, for computational benefits LBPs are computed for each of the extracted video-patch and finally
the top four most significant basis vectors obtained by SVD of the video-patch matrix are considered as a point on
the Grassmannian.
(b) The complete pipeline for our 3D video based FER system. The system comprises two parts: offline training
and online testing. During the offline training, for each of the six expression classes, video-patches from all training
videos are extracted and represented as points on the Grassmannian and spectral clustering is used to compute class
representatives. Six one-vs-all binary SVMs are learnt on these class representatives. During online testing, the videopatches extracted from a test video are represented as points on the manifold and matched with the learnt SVM
models followed by a weighted-voting based strategy for expression classification of the test video
Fig. 4. Block diagram of our 3D video based facial expression recognition system
April 22, 2014
DRAFT
21
from onset of the expression and skip the neutral part, or in other videos, a
performer might not return to the offset of the expression. Thus modeling the
complete video sequence as a whole could result in performance degradation.
Therefore, we need to extract local video-patches of different lengths from
numerous locations of a video.
Given a normalized video sequence V = [f1 , f2 , f3 , ...fn ] of n frames, a sliding
window of variable length is used to extract video-patches along the sequence.
More specifically, a sliding window of length m frames extracts a video-patch
from f1 to fm , then the sliding window is shifted by
m
2
frames and the next patch
is extracted. The process is repeated till the last frame i.e. fn is reached. An illustration of the extraction of video-patches by using sliding window is presented
in Fig 4(a). Four different lengths of the sliding window (m {24, 30, 36, 44})
are used. The motivation of using different lengths of the sliding window
comes from our observation that if a person performs an activity during one
expression in a certain number of frames, another person might perform the
same activity of the same expression in a different number of frames. An
experimental analysis of the effect of window length on the performance of
the proposed method is presented in Sec. 6.2.1.
Each of the extracted video-patch is represented as a matrix X = [f1 f2 ...fm ],
X R25600m , whose columns correspond to the raster-scanned depth values
of m frames of the video-patch. We can loose the identity information and only
retain the required expression deformation information from X by subtracting
P
the mean = m1 m
i=1 fi from X. The Singular Value Decomposition (SVD) of
the resulting matrix X 0 = X is given by X 0 = U SV T . The columns of U
form a set of orthonormal vectors, which can be regarded as basis vectors. These
basis vectors are arranged in a descending order of significance and carry the
important expression deformation information contained in the video-patch.
Fig 5 shows the top 4 basis vectors corresponding to the top four singular
values of video-patches extracted from happy, sad and surprise expressions.
A set of these basis vectors (i.e. a tall-thin orthonormal matrix containing
April 22, 2014
DRAFT
(a) Happy
(b) Sad
22
(c) Surprise
Fig. 5. Top four basis vectors of video-patches extracted from Happy, Sad and
Surprise expression videos
the top basis vectors in U ) can be considered as a point on the Grassmannian

manifold [18], [19]. In our case, as we only consider the top four basis vectors,
our point is on G25600,4 . Clearly, the dimensions of G25600,4 are quite large and
would require a lot of memory. We can overcome this by replacing the depth
values in our video-patch matrix X by histograms of LBPs [36]. Every frame in
u2
X is divided into 4 4 non-overlapping blocks and R59 histograms of LBP8,1
[36] are computed for each block. Histograms of all blocks are concatenated and
the frame is represented by a feature vector of R944 . This results in our points to
be on G944,4 instead of G25600,4 . Extracting video-patches and representing them
as points on G944,4 also helps us avoid corrupted and incomplete video frames.
By considering only the top 4 most significant basis vectors, we retain the most
prevalent and consistent information in the video-patch and ignore the aberrant
information from the corrupted frames.
The above process gives us points on the manifold for one video. We perform
the same process and compute the points on the Grassmannian for all videopatches extracted from the training videos of one expression class. The next
step is to cluster these points. We follow the procedure described in Sec 2 to
perform clustering. During clustering, our similarity graph-laplacian matrix L
showed that, some of the points are very dissimilar from most of the other
April 22, 2014
DRAFT
23
points. These points come from those parts of videos, where the expression is
performed in an incorrect and inconsistent manner. Therefore, we only consider
the top 300 most similar points for each class and group them into 15 clusters.
The mean of every cluster is computed following the procedure described in
[18]. Finally, the 15 cluster centers are considered as class representatives and
are used in the classification step.
Following a similar procedure, the class representatives of all six classes are
computed. The next step is to learn SVM models on Grassmannian for these
class representatives. To do that, we firstly embed the class representatives in
RKHS by using one of the Grassmannian kernel functions described in Sec 3.1.
SVM is designed for binary class classification. For multi-class classification, we
can use an appropriate multi-class strategy and train multiple binary SVMs.
One-vs-one (pairwise comparisons, total of k(k 1) binary SVMs for k classes)
and one-vs-all (one against all others, total of k binary SVMs for k classes)
are the two most popular multi-class SVM strategies. A comparison [37] on
multi-class SVM strategies shows that amongst one-vs-one and one-vs-all, the
one-vs-all achieves a comparable performance with a faster speed. We therefore
use one-vs-all multi-class strategy and train six binary SVM models for six
expression classes. For each expression class, the class representatives belonging
to that class are labeled as +1 whereas the class representatives belonging to
all other classes are labeled as 1. Using CVX package [25], the optimization
problem in equation 3 is solved for optimal dual variables and the corresponding primal variables are computed. Therefore, six binary SVM models
are learnt for six expression classes. Lets denote the learnt SVM models by
c = {wc , bc }, c = 1, 6
Given a test video Xtest , after normalization and extraction of video-patches
from Xtest , we represent these video-patches as points on G944,4 . The points are
then embedded into RKHS by using a Grassmannian kernel function. For every
embedded point, a similarity score is computed with each of the six learnt SVM
models. The similarity is determined in terms of the normalized distance of
April 22, 2014
DRAFT
24
the point from the corresponding SVMs hyperplane. Specifically, the similarity
(j)
dc (Xtest ) of the jth video-patch from the cth SVM model (c = {wc , bc }) is given
by:
D
(j)
dc (Xtest ) = sgn
(j)
wc , Xtest
+ bc
(14)
kwc k
After computing the similarity of all the embedded points of the video from
all SVM models, a weighted-voting based strategy is used to decide on the class
of the query video. Every video point casts a vote to each of the six expression
classes, with the weight of each vote being equal to the similarity from the
corresponding SVM model (obtained using Eq. 14). The class label ytest of the
test video Xtest is then given by:
ytest = arg max
c
(j)
(15)
dc (Xtest )
That is, the weights casted to all six classes are accumulated and the class
with the maximum weights is declared to be the class of the query video.
2D-3D Fusion: Modern 3D imaging technologies can simultaneously acquire
2D texture (RGB) videos along with the co-registered depth (3D) information.
In order to improve the accuracy of the proposed method, we use both the
texture and the depth information in parallel. For this purpose, we follow the
pipeline in Fig 4 for 2D and 3D data separately and learn their corresponding
SVM models. For the 2D data, the RGB images are first converted into gray
scale images, whereas the depth images are directly used. After learning the
SVM models from the 2D and 3D data separately, the next step is to perform
classification. Given a 2D-3D test video Xtest (both 2D and 3D video frames),
the classification task consists in finding the class label ytest of the video. For this
purpose, we first separately extract video patches from the 2D and 3D data of
Xtest , represent these patches as points on the manifold and then compute their
[2D]
(j)
[3D]
(j)
similarity from the corresponding SVM models. Let dc (Xtest ) and dc (Xtest )
be the similarity of the jth video-patch from the cth SVM model trained using
April 22, 2014
DRAFT
25
2D and 3D data respectively, the class label ytest of the test video is then given
by,
ytest = arg max

c
(j)
(j)
[2D]
3D d[3D]
c (Xtest ) + 2D dc (Xtest )
(16)
That is, a weighted voting strategy is used to fuse the information from 2D
and 3D data. The parameters 2D and 3D decide on the importance given
to each data modality. These parameters are determined empirically (for the
best cross validation classification accuracy) by performing a grid search over
a range of values between 0 and 1.
E XPERIMENTS AND R ESULTS
We evaluate the performance of our system on BU4DFE database. To the best of

our knowledge, this is the largest publicly available 3D video database for FER.
We first present an overview of the experimental protocols followed by previous
papers on BU4DFE database (Sec 6.1). We then devise our own experimental
settings and report the performance of our proposed system (Sec 6.2). We
also present a comparison of our system with previously published methods
(Sec 6.3)
6.1
Experimental Protocols on BU4DFE
Unfortunately, no standard performance evaluation protocol has been followed

by the previous works on BU4DFE database. Instead, every paper follows an
explicit experimental procedure. A summary of the testing procedures followed
in the literature along with the achieved classification accuracy is presented in
Table 1. The BU4DFE database contains a total of 101 subjects and 606 videos
i.e. each subject has six videos corresponding to six performed expressions. As
can be observed in Table 1, none of the previous works reports results for all
606 videos. Instead, a subset of the dataset is extracted and the performance
April 22, 2014
DRAFT
Experimental Settings
Persons/Videos Expressions Testing
Sun and Yin [5]
60 subjects
All six
10 fold
Drira et al. [11]
60 subjects
All six
10 fold
Huang et al. [10]
60 subjects
All six
20 fold
Le et al. [7]
60 subjects
Sa, Ha, Su
10 fold
All six
Fang et al. [9], [13]
507 videos
An, Ha, Su 10 fold
Sa, Ha, Su
All six
Sandbach et al. [6], [14] 397 videos
10 fold
Sa, Ha, Su
Rosato et al. [8]
Not mentioned All six
20 fold
26
Paper
Results
CV
CV
CV
CV
90.44%
93.21%
85.6%
92.22%
74.63%
CV 96.71%
95.75%
64.6%
CV
83.03%
CV 85.9%
TABLE 1
A summary of experimental settings followed by previously published methods
on the BU4DFE database. The acronyms used in the Table are, CV: Cross
Validation, Sa: Sad, Ha: Happy, Su: Surprise, An: Angry.
of the system is evaluated for that subset. Based upon the selection of the data
subset, we can roughly categorize the previous papers as:
1) Papers which select a subset of 60 subjects. Either 10 fold or 20 fold
person-independent cross validation testing is performed on the selected
subjects. For example, for 10 fold cross validation testing, videos of all
expression classes of 54 subjects are used for training and 6 subjects are
used for testing. Papers in [5], [10], [11] report classification accuracy for
all six expression classes, whereas the method in [7] is evaluated for only
three expressions i.e. sad, happy and surprise.
2) Papers which select a subset of either 507 or 397 videos. The proposed methods in [6], [9], [13], [14] model a video by a four state model
i.e. neutral-onset-apex-offset. These methods make an assumption that
a video starts and ends with the neutral frame. The authors therefore
manually inspect all videos of the dataset and exclude those videos which
do not fulfil these criteria. Moreover, videos which contain corrupted
meshes are also excluded and only 507 & 397 videos are selected in [9], [13]
April 22, 2014
DRAFT
proj
1
0
0
0
1
1
1
cc
0
1
0
1
0
1
1
g
0
0
1
1
1
0
1
Experiment-1
2D
3D
2D+3D
88.17% 91.14% 93.81%
70.50% 80.78% 81.94%
88.34% 91.15% 93.97%
83.66% 90.42% 92.45%
88.76% 91.62% 94.34%
83.86% 90.39% 91.61%
85.10% 90.43% 92.55%
27
Experiment-2
2D
3D
2D+3D
86.89% 90.62% 92.07%
60.40% 68.67% 79.12%
86.96% 91.31% 92.39%
80.12% 88.27% 90.13%
87.26% 90.84% 93.24%
80.83% 88.64% 89.55%
81.55% 91.60% 92.43%
TABLE 2
A summary of results for our experiments. The system achieves the best
classification accuracy for a linear combination of the projection and the
proposed kernel .e for proj = 1, cc = 0, g = 1.
and [6], [14] respectively. The testing is performed on all six expression as
well as three expressions i.e. either angry-happy-surprise or sad-happysurprise.
6.2
Our Results and Analysis
In order to compare our results with previously published methods, we perform

the following experiments:
1) Experiment-1: 10 fold person-independent cross validation testing on a
subset of 60 subjects
2) Experiment-2: 10 fold person-independent cross validation testing on the
complete dataset i.e. 606 videos
Experiment-1 will compare our results with [5], [7], [10], [11], whereas Experiment2 will compare our results with [6], [9], [13], [14]. Further analysis on the
comparison is provided in Sec 6.3.
The experiments are performed for different combinations of kernel functions
i.e. different values of proj , cc , g are used in equation (13). In order to achieve
consistency in our results, each experiment is performed for 10 different runs
April 22, 2014
DRAFT
An
Di
Fe
Ha
Sa
Su
An
Di
92.71 2.89
2.88 93.46
3.91
4.17
1.09
0.91
1.67
2.69
0.00
0.00
Fe
1.67
1.82
90.09
0.00
2.53
1.33
Ha
Sa
0.00 2.73
1.17 0.00
1.00 0.33
98.00 0.00
0.00 93.12
0.00 0.00
28
Su
0.00
0.67
0.50
0.00
0.00
98.67
TABLE 3
Confusion Matrix corresponding to the best classification accuracy (i.e for
proj = 1, cc = 0, g = 1) for Experiment-1. An: Angry, Di: Disgust, Fe: Fear, Ha:
Happy, Sa: Sad, Su: Surprise.
for different selections of the training and testing subjects. The results averaged over ten runs are presented in Table 2. The best classification accuracy is
achieved for a linear combination of the projection and the proposed kernel .e
for proj = 1, cc = 0, g = 1. The confusion matrices for this combination for
Experiment-1 and Experiment-2 are shown in Table 3 and Table 4 respectively.
The results indicate that the system achieves the highest classification accuracy
for happy and surprise expressions as these are the most consistently performed
expressions. Angry and fear are rather more subtle expressions and the system
achieves comparatively less accuracy for them. The best overall average classification accuracy achieved by our system for all expressions is 94.34% and
93.24% for Experiment-1 and Experiment-2 respectively.
Our proposed kernel function provides a good classification accuracy of
93.97% when tested alone i.e. for proj = 0, cc = 0, g = 1. It can be seen
that, once tested alone (for the first three tests in Table 2), the performance of
the proposed kernel function is superior to the other two kernel functions.
6.2.1
Effect of video-patch length on performance
The effect of the video patch length on the performance of the proposed method
is presented in Table 5. Experiments are performed by changing the video patch
April 22, 2014
DRAFT
An
Di
Fe
Ha
Sa
Su
An
91.0
0.00
5.57
1.67
1.93
0.00
Di
3.03
95.71
3.24
0.00
2.30
0.00
29
Fe
Ha
Sa
2.55 0.00 3.29
1.72 1.43 0.00
83.62 2.86 1.14
0.47 97.86 0.00
1.29 0.00 93.69
1.26 1.17 0.00
Su
0.13
1.14
3.57
0.00
0.79
97.57
TABLE 4
Confusion Matrix corresponding to the best classification accuracy (i.e for
proj = 1, cc = 0, g = 1) for Experiment-2. An: Angry, Di: Disgust, Fe: Fear, Ha:
Happy, Sa: Sad, Su: Surprise.
Patch Length
16
24
32
40
48
{24, 30, 36, 44}
Experiment-1
2D
3D
2D+3D
79.63% 81.31% 82.03%
81.54% 82.09% 84.26%
81.23% 84.82% 85.36%
83.94% 85.14% 89.42%
83.75% 86.01% 89.18%
88.76% 91.62% 94.34%
Experiment-2
2D
3D
2D+3D
75.45% 78.42% 81.64%
76.21% 78.67% 82.34%
79.92% 81.72% 83.13%
81.28% 84.81% 85.42%
82.19% 84.37% 86.21%
87.26% 90.84% 93.24%
TABLE 5
Performance analysis of the proposed method for different values of the
video-patch lengths
length from 16 to 48 with a step size of 8 frames. The results show that the
performance of the proposed method increases with an increased patch length.
The results also show that the proposed method achieves the best performance
when video patches of multiple lengths are combined. Using video patches of
different lengths introduces temporal invariance and results in an improved
performance.
6.2.2
Performance evaluation using ARMA models
April 22, 2014
DRAFT
30
Auto Regressive and Moving Average (ARMA) models have been used for
video analysis in a number of tasks such as dynamic texture analysis [38],
silhouettes [39] and action recognition [18], [40]. Here, we present a performance
evaluation of our method once the videos are modeled using ARMA models.
p
Given a video sequence {f (t)}t=
t=1 ; f (t) R , an ARMA process models the
video as,
f (t) = Cz(t) + w(t),
w(t) = N (0, R)
z(t + 1) = Az(t) + v(t),
v(t) = N (0, Q),
(17)
where z(t) Rd is the hidden state vector, A Rdd is the transition matrix
and C Rpd is the measurement matrix. v and w are the noise modeled by
zero mean and R Rpp and Q Rdd covariance matrices respectively. p is
the number of pixels (or features) of a frame and d determines the order of the
system.
The parameters (A, C) of an ARMA model can be estimated using a closed
form solution as proposed in [18], [38]. For this purpose, let us represent a
video with an ordered sequence of frames [f (1), f (2), f ( )] indexed by the
time t = 1, . The singular value decomposition of the video data results
[f (1), f (2), f ( )] = U V T . The transition and the measurement matrices of
the ARMA model can then be estimated by:
C = U
(18)
A = V T D1 V (V T D2 V )1 1 ,
0 0
I
0
and D2 = 1 .
where D1 =
I 1 0
0 0
In order to compare ARMA models, we can first represent them with subspaces and then compute the similarity between these subspaces. The subspace
representation of an ARMA model is obtained from the column space of its
April 22, 2014
DRAFT
31
Experiment-1
Experiment-2
2D
3D
2D+3D
2D
3D
2D+3D
85.38% 90.51% 92.71% 81.58% 86.18% 89.72%
TABLE 6
Average classification performance for videos modeled by ARMA models
observability matrix O, which in turn is approximated as [18], [40], [41],

O = [C T , (CA)T , (CA2 )T , (CAd1 )T ].
(19)
For our framework, we consider the subspace representation of an ARMA

model as a point on the Grassmann manifold. More specifically, in order to
represent a video modeled by an ARMA model on the manifold, we first approximate the parameters of the ARMA model. These parameters are then used
to compute the observability matrix. The subspace spanned by the columns of
the observability matrix is finally considered as the representation of the video
on the Grassmann manifold.
The average classification performance of the method for videos modeled
by ARMA models is shown in Table 6. These experiments are performed for
proj = 1, cc = 0, g = 1. An average classification rate of 92.71% and 89.72% is
achieved for Experiment-1 and Experiment-2 respectively.
6.3
Comparison with previous methods
Table 7 shows a comparison of the proposed system with the previously published top-performing methods. The method proposed by Sun & Yin [5] achieves
a classification accuracy of 90.22%. However, their method relies on the manual
annotation of 83 facial landmarks which is not only an undesirable and time
consuming process but may also render inaccuracies and could be difficult
to accommodate for numerous practical applications. Amongst the previously
published methods on BU4DFE, the method proposed by Drira et al. [11] shows
April 22, 2014
DRAFT
32
the best classification accuracy i.e. 93.21%. However, it should be noted that,
for the methods in [11], [42], each video sequence of n frames has been divided
into multiple subsequences of 6 frames each (f1 f6 , f2 f7 , f3 f8 . . . fn5 fn ).
The authors thus generate a total of 30780 subsequences for training and 3420
subsequences for testing. The proposed methods therefore perform matching
for videos of six frames each ( 14 th of a second). This duration of the video is
too short to capture the facial dynamics and the corresponding spatiotemporal
information.
The method proposed by Fang et al. [9], [13] shows a classification accuracy of
96.71% once evaluated for only three expressions i.e. angry, happy and surprise.
The authors manually inspect all videos of the database and exclude those
videos which do not start or end with a neutral frame. Also the videos which
contain a few corrupted meshes are excluded from the final evaluation. In our
proposed framework, by representing the local video-patches as points on the
Grassmann and considering only the top 4 most significant basis vectors, we
automatically avoid such corrupted meshes. We therefore evaluate the performance of our system for all 606 videos of the database.
The method in [7] shows an accuracy of 92.22% for only 3 expressions i.e. sad,
happy and surprise. Our experiments show that our proposed system achieves
a very high classification accuracy for the happy and surprise expressions for
both Experiment-1 and Experiment-2 (see Table 3 and Table 4). Therefore, if we
consider only 3 expression classes, either anger-happy-surprise similar to [9],
[13] or sad-happy-surprise similar to [7], the classification accuracy achieved
by our system is higher than the previously published techniques.
April 22, 2014
DRAFT
33
Method
Sun & Yin [5]
Testing
Accuracy Note
Experiment-1 90.44%
Manual 83 landmarks, evaluated on subsequences of 6
frames
Experiment-1 93.21%
Evaluated on subsequences of 6
Drira et al. [11]
frames
Experiment-1 92.22%
Tested for only sad, happy &
Le et al. [7]
surprise
Fang et al. [9], [13] Experiment-2 96.71%
Tested for only angry, happy &
surprise, 507 videos only
Experiment-1 94.91%
Fully automatic. Tested for all
This Paper
six expressions
Experiment-2 93.63%
Tested for all six expressions.
This Paper
All 606 videos
TABLE 7
Comparison with previously published results
C ONCLUSION
A system for automatic recognition of facial expressions from textured 3D

videos is presented. After normalizing the raw videos, the system extracts local
video-patches from different locations of the videos and represents them on the
Grassmannian manifold. The strengths and effectiveness of spectral clustering
[21] are exploited and contrived for an efficient clustering of points on the
Grassmannian manifold. SVM models are learnt on the Grassman followed by
a voting based strategy for classification. The theory of RKHS has been explored
to adapt the SVM classifier for the Grassmannian manifold. A new Grassmannian kernel function is also proposed. The performance of the system is tested
on the largest publicly available 3D video database, BU4DFE. In comparison
to previously published methods on BU4DFE database, our system shows a
superior performance in terms of the classification accuracy. Moreover, the
proposed system avoids the computationally expensive pre-processing steps for
the establishment of a dense vertex level correspondence. Furthermore, it does
April 22, 2014
DRAFT
34
not require any user intervention for manual annotation of facial landmarks.
The system does not make any assumptions about the presence of all four
expression segments in a video and performs equally well for all video types.
ACKNOWLEDGMENT
This work is supported by SIRF scholarship from the University of Western
Australia (UWA) and ARC grant DPI10102166.
R EFERENCES
[1]
Z. Zeng, M. Pantic, G. Roisman, and T. Huang, A survey of affect recognition methods: Audio,
visual, and spontaneous expressions, Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 31, no. 1, pp. 3958, 2009.
[2]
M. Pantic, S. Member, and L. J. M. Rothkrantz, Automatic analysis of facial expressions: The state
of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 14241445, 2000.
[3]
G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin, Static and dynamic 3d facial expression recognition:
A comprehensive survey, Image and Vision Computing, 2012.
[4]
T. Fang, X. Zhao, O. Ocegueda, S. Shah, and I. Kakadiaris, 3d facial expression recognition: A

perspective on promises and challenges, in Automatic Face Gesture Recognition and Workshops (FG
2011), march 2011, pp. 603 610.
[5]
Y. Sun and L. Yin, Facial expression recognition based on 3d dynamic range model sequences, in
ECCV 2008, pp. 5871.
[6]
G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert, Recognition of 3d facial expression

dynamics, Image and Vision Computing, February 2012, (in press). [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S0262885612000157
[7]
V. Le, H. Tang, and T. Huang, Expression recognition from 3d dynamic faces using robust spatiotemporal shape features, in Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE
International Conference on.
[8]
IEEE, 2011, pp. 414421.
M. Rosato, X. Chen, and L. Yin, Automatic registration of vertex correspondences for 3d facial
expression analysis, in Biometrics: Theory, Applications and Systems, 2008. BTAS 2008. 2nd IEEE
International Conference on, 29 2008-oct. 1 2008, pp. 1 7.
[9]
T. Fang, X. Zhao, O. Ocegueda, S. Shah, and I. Kakadiaris, 3d/4d facial expression analysis: an
advanced annotated face model approach, Image and Vision Computing, 2012.
[10] Y. Huang, X. Zhang, Y. Fan, L. Yin, L. Seversky, J. Allen, T. Lei, and W. Dong, Reshaping 3d facial
scans for facial appearance modeling and 3d facial expression analysis, Image and Vision Computing,
2012.
April 22, 2014
DRAFT
35
[11] H. Drira, B. B. Amor, M. Daoudi, A. Srivastava, and S. Berretti, 3d dynamic expression recognition
based on a novel deformation vector field and random forest, in Pattern Recognition, 2012. ICPR
2012. Proceedings of the 21st International Conference on, Nov. 2012.
[12] Y. Sun, X. Chen, M. Rosato, and L. Yin, Tracking vertex flow and model adaptation for threedimensional spatiotemporal face analysis, Systems, Man and Cybernetics, Part A: Systems and Humans,
IEEE Transactions on, vol. 40, no. 3, pp. 461 474, may 2010.
[13] T. Fang, X. Zhao, S. Shah, and I. Kakadiaris, 4d facial expression recognition, in Computer Vision
Workshops (ICCV Workshops), 2011 IEEE International Conference on, nov. 2011, pp. 1594 1601.
[14] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert, A dynamic approach to the recognition of 3d
facial expressions and their temporal models, in Automatic Face & Gesture Recognition and Workshops
(FG 2011), 2011 IEEE International Conference on.
IEEE, 2011, pp. 406413.
[15] M. Hayat, M. Bennamoun, and A. A. El-Sallam, Clustering of video-patches on grassmannian

manifold for facial expression recognition from 3d videos, in Applications of Computer Vision (WACV),
2013 IEEE Workshop on, jan. 2013.
[16] H. Cetingul and R. Vidal, Intrinsic mean shift for clustering on stiefel and grassmann manifolds,
in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.
IEEE, 2009, pp.
18961902.
[17] R. Subbarao and P. Meer, Nonlinear mean shift over riemannian manifolds, IJCV, vol. 84, no. 1,
pp. 120, 2009.
[18] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, Statistical computations on grassmann
and stiefel manifolds for image and video-based recognition, IEEE TPAMI, vol. 33, no. 11, pp. 2273
2286, nov. 2011.
[19] A. Edelman, T. Arias, and S. Smith, The geometry of algorithms with orthogonality constraints,
SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303353, 1998.
[20] H. Karcher, Riemannian center of mass and mollifier smoothing, Communications on pure and applied
mathematics, vol. 30, no. 5, pp. 509541, 1977.
[21] U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing, vol. 17, no. 4, pp. 395
416, 2007.
[22] S. Shirazi, M. T. Harandi, C. Sanderson, A. Alavi, and B. C. Lovell, Clustering on grassmann
manifolds via kernel embedding with application to action analysis, in Image Processing (ICIP),
2012 19th IEEE International Conference on.
IEEE, 2012, pp. 781784.
[23] P. Chen, C. Lin, and B. Scholkopf,

A tutorial on -support vector machines, Applied Stochastic
Models in Business and Industry, vol. 21, no. 2, pp. 111136, 2005.
[24] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis.
Cambridge university press,
2004.
[25] I. CVX Research, CVX: Matlab software for disciplined convex programming, version 2.0 beta,
http://cvxr.com/cvx, Sep. 2012.
[26] J. Hamm and D. Lee, Grassmann discriminant analysis: a unifying view on subspace-based
April 22, 2014
DRAFT
learning, in Proceedings of the 25th international conference on Machine learning.
36
ACM, 2008, pp.
376383.
[27] M. Harandi, C. Sanderson, S. Shirazi, and B. Lovell, Graph embedding discriminant analysis
on grassmannian manifolds for improved image set matching, in Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on.
IEEE, 2011, pp. 27052712.
[28] K. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren, Efficient algorithms for inferences on
grassmann manifolds, in Statistical Signal Processing, 2003 IEEE Workshop on. IEEE, 2003, pp. 315
318.
[29] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi, Kernel methods on the riemannian
manifold of symmetric positive definite matrices, in Computer Vision and Pattern Recognition (CVPR),
2013 IEEE Conference on.
IEEE, 2013, pp. 7380.
[30] G. Wu, E. Y. Chang, and Z. Zhang, An analysis of transformation on non-positive semidefinite

similarity matrix for kernel machines, in Proceedings of the 22nd International Conference on Machine
Learning, vol. 8.
Citeseer, 2005.
[31] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer, Classification on pairwise proximity

data, Advances in neural information processing systems, pp. 438444, 1999.
[32] E. Pekalska, P. Paclik, and R. P. Duin, A generalized kernel approach to dissimilarity-based
classification, The Journal of Machine Learning Research, vol. 2, pp. 175211, 2002.
[33] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale, A high-resolution 3d dynamic facial expression
database, in Automatic Face Gesture Recognition, 2008. FG 08. 8th IEEE International Conference on,
pp. 1 6.
[34] X. Peng, M. Bennamoun, and A. S. Mian, A training-free nose tip detection method from face range
images, Pattern Recognition, vol. 44, no. 3, pp. 544 558, 2011.
[35] A. Mian, M. Bennamoun, and R. Owens, An efficient multimodal 2d-3d hybrid approach to
automatic face recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29,
no. 11, pp. 19271943, 2007.
[36] T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation invariant texture
classification with local binary patterns, Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 24, no. 7, pp. 971987, 2002.
[37] C. Hsu and C. Lin, A comparison of methods for multiclass support vector machines, Neural
Networks, IEEE Transactions on, vol. 13, no. 2, pp. 415425, 2002.
[38] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, Dynamic textures, International Journal of Computer
Vision, vol. 51, no. 2, pp. 91109, 2003.
[39] A. Veeraraghavan, A. K. Roy-Chowdhury, and R. Chellappa, Matching shape sequences in video
with applications in human movement analysis, Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 27, no. 12, pp. 18961909, 2005.
[40] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell, Kernel analysis on grassmann manifolds
for action recognition, Pattern Recognition Letters, vol. 34, no. 15, pp. 19061915, 2013.
[41] T. Kailath, Linear systems.
April 22, 2014
Prentice-Hall Englewood Cliffs, NJ, 1980, vol. 1.
DRAFT
37
[42] Y. Sun and L. Yin, 3d spatio-temporal face recognition using dynamic range model sequences,
2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 1,
pp. 17, 2008.
April 22, 2014
DRAFT

An Automatic Framework For Textured 3D Video-Based Facial Expression Recognition

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

An Automatic Framework For Textured 3D Video-Based Facial Expression Recognition

Hochgeladen von

Copyright:

Verfügbare Formate

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007