Beruflich Dokumente
Kultur Dokumente
Index Terms
Facial expression recognition, 3D videos, Grassmannian manifold, spectral clustering, SVM on Grassmannian manifold
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
every mesh using Linear Discriminant Analysis (LDA). For classification, one
HMM for each of the six expressions is learnt. Finally, the probability scores
of the test video to each HMM are evaluated by a Bayesian decision rule to
determine the expression type of the test video. An extension of this approach
has been presented in [12] for the task of face recognition from 3D videos.
Rosato et al. [8] propose a conformal mapping and generic model adaptation
technique for vertex level correspondence establishment. A deformable template based approach is applied to extract 22 feature points from all frames of
the corresponding 2D texture video. These extracted points are then mapped to
the corresponding 3D meshes of the video. A circle pattern conformal mapping
is used to parameterize the 3D meshes in a 2D plane. A coarse to fine model
adaptation is performed in the 2D planar representation for correspondence
establishment between vertices. The established correspondences are then extrapolated from planar representation to the original 3D meshes. Each of the
3D mesh is then characterized by one of its 12 geometric primitive features.
LDA is used for classification of the facial expressions.
Fang et al. [9], [13] fit an Annotated deformable Face Model (AFM) to every
mesh of the video and use the fitted video sequence for feature extraction
and classification. For every pair of consecutive video frames, registration is
performed by mesh matching and the rigid transformations are computed. Two
different techniques are used for mesh matching i.e. spin image similarities and
distances between MeshHOG descriptors. The first mesh is directly fitted to the
AFM while the transformations obtained by dense registration of pair of consecutive meshes are used for fitting the remaining meshes of the video. Assuming
that the expression is performed gradually (from neutral to onset to apex to
offset), a flow matrix is obtained from fitting of the AFM. Spatiotemporal LBP
features are extracted from the flow matrix and a SVM is used for classification.
Sandbach et al. [6], [14] model a facial expression sequence by a four state
model i.e. neutral-onset-apex-offset. Firstly, all 3D meshes of a video are aligned
with the first mesh using an iterative closest point (ICP) algorithm. Then, Free
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
Form Deformations (FFDs) are used to capture the motion between the frames
of the video. This motion is represented in terms of vector fields. Using a quadtree decomposition, the vector fields are sub-divided into regions according
to the amount of motion appearing in every region. A set of 2D features is
extracted for every region, which are then used to train boosting classifiers on
the onset and offset segments of the expression. Finally, HMMs are used to
model the temporal dynamics of the complete expression sequence.
Huang et al. [10] propose a hybrid approach based upon two vertex mapping algorithms (i.e. displacement mapping and point-to-surface mapping) for
achieving a vertex level correspondence. They follow a procedure similar to
[12] for feature extraction and classification. Unlike most global model based
algorithms e.g. [5], [6], [8], [9] which map the complete face model, the proposed approach segments the face into multiple independent local regions and
maps the corresponding local regions. Nevertheless, mapping a video sequence
of about 100 meshes and establishing a vertex level correspondence amongst
all meshes (each mesh comprises about 40,000 vertices) is a very slow and
computationally expensive process.
Le et al. [7] proposed an approach based on facial level curves for expression
recognition from 3D videos. The arc-length function is used to parameterize the
level curves and spatiotemporal features are extracted by comparing the curves
across frames using Chamfer distances. An HMM-based decision boundary
focus algorithm is used for classification.
In order to extract motion between meshes of a 3D video sequence, Drira et
al. [11] use radial curves. First, the facial surface in each 3D mesh of the video is
parameterized by radial curves emanating from the nose tip. Then, the motion
between every pair of consecutive meshes is captured in terms of vector fields
computed by comparing the corresponding radial curves of the two meshes.
LDA is used for dimensionality reduction followed by a multi-class random
forest algorithm for classification.
This paper contributes towards the development of a full-fledged automatic
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
Clustering is one of the most widely used technique for data analysis in many
applications. The problem of clustering can be formulated as, given a set of
points and some measure of similarity between all pairs of points, divide the
points into groups such that points in the same group are similar, while points
in different groups are dissimilar. While clustering in the Euclidean space has
been well explored over the years, very few papers [16][18] discuss clustering
on the Grassmannian manifold.
By definition, a manifold is a topological space which is locally similar to an
Euclidean space. A Grassmannian manifold is a space of all d-dimensional linear
subspaces of Rn [18], [19]. A point on the Grassmannian manifold is represented
by an orthonormal matrix of Rnd . The existing clustering techniques [16]
[18] on the Grassmannian manifold need to compute the distance as well as
the mean of the points on the manifold, for every iteration of the clustering
algorithm (e.g. K-means). Methods of computations of the mean and distance
on the Grassmannian manifold can be broadly categorized as intrinsic and
extrinsic [18]. The intrinsic methods are entirely restricted to the manifold
themselves, whereas extrinsic methods embed the points on the manifold to
an Euclidean space and use Euclidean metrics for computations. Using either
an intrinsic or extrinsic method for an iterative process such as K-means is
very time consuming and requires a lot of computation. For example, the
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
Support Vector Machine (SVM) is a supervised binary class classification algorithm which constructs a hyperplane to optimally separate the data points
between the two classes [23]. The original SVM is designed for data lying in
the Euclidean space. In this paper, we exploit the theory of Reproducing Kernel
Hilbert Space (RKHS) [24] to recast SVM from Euclidean space to Grassmannian
manifold. More specifically, the data points on the Grassmannian manifold are
embedded into RKHS by using a Grassmannian kernel. This embedding allows
us to use the SVM classifier in RKHS.
Given a set of m training data points Xtrain = {X1 , X2 , X3 , . . . Xm } along
with their class labels ytrain = {y1 , y2 , y3 , . . . ym }, where Xi Rnd (a tall thin
orthonormal matrix, represented as a point on the Grassmannian) and yi
{1, +1}, the problem of SVM on the Grassmannian manifold can be formulated as, find a maximum-margin hyperplane that divides the points having
yi = 1 from the points having yi = +1. Using the soft-margin SVM formulation
[23], this can be achieved by solving the following optimization problem,
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
10
X
1
minimize
kwk2 + C
i
w,b,
2
i=1
subject to yi wT fi + b 1 i , i = 1, . . . , m
(1)
i 0, i = 1, . . . , m.
Where w is the coefficient vector, b is the intercept term and is a parameter
for handling non-separable data and C is the penalty parameter for the error
term [23]. fi Rm is a feature vector in RKHS computed between Xi and all
other m training points in Xtrain using a Grassmannian kernel function,
(2)
fi = k(Xtrain , Xi )
X
1X
i j yi yj k(Xi , Xj )
i
2 i,j
i
(3)
subject to 0 i C, i = 1, . . . , m
X
i y i = 0
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
11
a test data point Xtest from the SVM hyperplane is then given by,
hw , ftest i + b
f (Xtest ) = sgn
kw k
(4)
1
1 + exp(z)
(5)
Grassmannian Kernels
Projection Kernel
(6)
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
12
Principal angles have been used as a measure of similarity between two subspaces (i.e. two points on the Grassmannian). The principal angles 0 1
. . . m
vectors in the first and second subspace. The cosine of the principal angles
is known as canonical correlations. The first principal angle 1 is the smallest
angle. Based upon the value of canonical correlation corresponding to the first
principal angle i.e. cos(1 ), Harandi et al. [27] defined the canonical correlation
kernel as,
kcc (Xi , Xj ) =
max
max
aTp bq
(7)
subject to
aTp ap = bTp bp = 1 and aTp aq = bTp bq = 0, p 6= q.
Proposed Kernel
We introduce our kernel function based upon the geodesic distance between
points on the Grassmannian manifold. Here, we first provide a description
of the geodesic distance on the Grassmannian and then propose our kernel
function in equation (10). Later on, we investigate different strategies to achieve
a positive definite kernel matrix from the proposed kerenl.
The geodesic distance between any two points on the manifold is the length of
the shortest possible curve connecting them. Gallivan et al. [28] use the quotient
interpretation to derive the geodesic path on Grassmannian. More specifically,
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
13
T
0 A
, A R(nd)d
B=
A 0
(8)
The matrix A in equation (8) specifies the direction and speed of the geodesic
flow. As shown in [18], [28], in order to compute the geodesic distance between
two points Xi and Xj on the Grassmannian, we need to find an appropriate
direction matrix A such that the geodesic flow originating from Xi reaches Xj
in a unit time. For this purpose, we first map Xi and Xj to their respective
tangent spaces and then use the algorithm proposed in [28] for an efficient
computation of the matrix A. The geodesic distance between Xi and Xj is then
given by,
dg (Xi , Xj ) = trace(AT A)
(9)
Where trace denotes the sum of the diagonal elements. By considering the
geodesic distance dg (Xi , Xj ), we introduce the following kernel function on the
Grassmannian manifold,
kg (Xi , Xj ) = exp(
dg (Xi , Xj )
)
2 2
(10)
For the proposed kernel to be a valid kernel, the necessary and sufficient
conditions are 1) it should be well-defined 2) the resulting gram matrix K
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
14
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
minimize
15
K K
(11)
0.
subject to K
The solution to Eq. 11 is given by,
T,
= UU
K
(12)
Kernel Combinations
From the theory of RKHS [24], it is well known that a linear combination of
valid kernels is a new kernel. Therefore, we can obtain a new kernel by,
(13)
where proj , cc , g 0. Our experimental results (see Sec 6.2) show that the
kernel combination enhances discrimination of points and results in improved
classification.
We use the BU4DFE database [33] for our experiments. This database comprises 101 subjects (58 females and 43 males) with an age range of 18-45 years,
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
16
belonging to various ethnical and racial groups including: asian (28), black(8),
Latino(3) and white (6). Under the supervision of a psychologist, each subject is
asked to perform 6 different expressions i.e angry, disgust, fear, happy, sad and
surprise. Each facial expression was captured to produce a 4 seconds video
sequence of temporally varying 2D texture and 3D shapes at the rate of 25
frames per second.
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
17
Fig. 2. A few normalized frames (both 2D and 3D) of an expression video from
BU4DFE database for visualization
in terms of the size and resolution of all meshes of a video. The input to our
algorithm are the meshes of a raw 3D video. The algorithm removes outlier
points, detects the nose tip, crops the Region Of Interest (ROI) around the
detected nose tip and corrects the pose of the face. Figure 1 shows the block
diagram of the 3D face normalization. The details of each block are given below.
We represent a 3D face by a matrix Pm3 of point clouds, where m is the
total number of points and each row of P corresponds to x, y, z coordinates of
a point (vertex). A raw 3D face contains outlier points as shown by the encircled
region in Figure 1(a). We remove these outlier points by finding the mean ( =
q P
Pm
2
m
1
1
z
)
and
the
standard
deviation
(
=
i=1 i
i=1 (zi ) ) of the depths (zi )
m
m
of all points. Any point whose depth is outside the 3 limit is considered an
outlier and is filtered. Figure 1(b) shows a 3D face after the removal of outlier
points. To reduce dimensionality and computational complexity, we sample the
filtered 3D face onto a uniform rectangular grid of 278 208 ( 15 th of the original
size i.e 1392 1040) at a resolution of 1mm. The resulting uniformly sampled
pointcloud is input to the nose tip detection algorithm.
The nose tip detection algorithm [34] has three steps. First, 61 2D face profiles
of the input pointcloud are generated by rotating the face around the y-axis from
[90 90 ] in a step size of 3 (see Fig. 3 (a)). Second, from the generated face
profiles, a set of possible nose tip candidates is determined by moving a circle
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
18
along all the points of these face profiles. A point on the face profile is selected
as a nose tip candidate if it fulfills the following criterion: the difference in the
area of the circle lying inside and outside of the face profile must be greater
than an experimentally determined threshold. The selected nose tip candidates
using this criteria are marked in Fig. 3 (a). The frequency of occurrence of these
nose tip candidates with respect to the y-axis is shown in Fig. 3 (b). It can be
seen that most of the candidates correspond to the actual nose tip, while others
are from the chin and shoulder region. Finally, in order to filter the false nose
tip candidates and only select the valid nose tip, a triangle is fitted around all
candidates. The candidate which best fits the triangle is then selected as the
final nose tip (shown in Fig. 3 (c)).
35
250
30
200
150
100
25
20
15
10
50
5
0
150
100
50
0
x
(a)
50
100
150
50
100
150
y
(b)
200
250
300
(c)
Fig. 3. Nose-tip detection. (a) Face profiles are generated by rotating the face
around y-axis. A set of possible nose-tip candidates is obtained by moving a circle
along every point of these face profile. If at a point, the area of the circle outside
the face profile is greater than the inside, the point is selected as a possible nosetip. (b) Frequency of occupance of the candidate nose-tips w.r.to y-axis. A triangle
is fitted at these nose-tip candidates (c) The candidate which best fits the triangle
is declared as the final nose-tip
After a successful detection of the nose tip, a sphere of radius (r = 85mm)
centered at the nose tip is used to crop the face. Figure 1(d) shows the cropped
3D face. Although, videos of BU4DFE dataset have been acquired in a frontal
view, slight head rotations and pose variations were observed. We use a techP
nique similar to [35] for pose correction. The mean vector = m1 m
i=1 Pi and
Pm
T
the covariance matrix (C = i=1 Pi Pi T ) of the pointcloud matrix (P) are
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
19
T HE P ROPOSED S YSTEM
The complete pipeline for our automatic 3D video based facial expression recognition system is shown in Fig 4. During the offline training phase, our system
extracts video-patches from different locations of the training videos, separately
clusters these video-patches into class representatives and learns six binary
SVM models on these class representatives. During the online testing phase,
a similarity for all extracted video-patches of the query video from the learnt
SVM models is obtained and a voting based strategy is used to decide about the
class of the query video. The detailed description of each block of our system
is given here.
An expression video is generally modeled to contain segments such as neutral
followed by onset, apex and an offset. During our visual inspection of the
videos of the BU4DFE database, we noticed that the neutral-onset-apex-offset
order does not necessarily hold for every video. For example, some videos start
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
20
(a) A video is represented as points on the Grassmannian manifold. All raw 3D meshes of the video are normalized
(see Fig 1), a sliding window is used to extract video-patches of variable lengths from different locations of the
normalized video, for computational benefits LBPs are computed for each of the extracted video-patch and finally
the top four most significant basis vectors obtained by SVD of the video-patch matrix are considered as a point on
the Grassmannian.
(b) The complete pipeline for our 3D video based FER system. The system comprises two parts: offline training
and online testing. During the offline training, for each of the six expression classes, video-patches from all training
videos are extracted and represented as points on the Grassmannian and spectral clustering is used to compute class
representatives. Six one-vs-all binary SVMs are learnt on these class representatives. During online testing, the videopatches extracted from a test video are represented as points on the manifold and matched with the learnt SVM
models followed by a weighted-voting based strategy for expression classification of the test video
Fig. 4. Block diagram of our 3D video based facial expression recognition system
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
21
from onset of the expression and skip the neutral part, or in other videos, a
performer might not return to the offset of the expression. Thus modeling the
complete video sequence as a whole could result in performance degradation.
Therefore, we need to extract local video-patches of different lengths from
numerous locations of a video.
Given a normalized video sequence V = [f1 , f2 , f3 , ...fn ] of n frames, a sliding
window of variable length is used to extract video-patches along the sequence.
More specifically, a sliding window of length m frames extracts a video-patch
from f1 to fm , then the sliding window is shifted by
m
2
is extracted. The process is repeated till the last frame i.e. fn is reached. An illustration of the extraction of video-patches by using sliding window is presented
in Fig 4(a). Four different lengths of the sliding window (m {24, 30, 36, 44})
are used. The motivation of using different lengths of the sliding window
comes from our observation that if a person performs an activity during one
expression in a certain number of frames, another person might perform the
same activity of the same expression in a different number of frames. An
experimental analysis of the effect of window length on the performance of
the proposed method is presented in Sec. 6.2.1.
Each of the extracted video-patch is represented as a matrix X = [f1 f2 ...fm ],
X R25600m , whose columns correspond to the raster-scanned depth values
of m frames of the video-patch. We can loose the identity information and only
retain the required expression deformation information from X by subtracting
P
the mean = m1 m
i=1 fi from X. The Singular Value Decomposition (SVD) of
the resulting matrix X 0 = X is given by X 0 = U SV T . The columns of U
form a set of orthonormal vectors, which can be regarded as basis vectors. These
basis vectors are arranged in a descending order of significance and carry the
important expression deformation information contained in the video-patch.
Fig 5 shows the top 4 basis vectors corresponding to the top four singular
values of video-patches extracted from happy, sad and surprise expressions.
A set of these basis vectors (i.e. a tall-thin orthonormal matrix containing
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
(a) Happy
(b) Sad
22
(c) Surprise
Fig. 5. Top four basis vectors of video-patches extracted from Happy, Sad and
Surprise expression videos
[36] are computed for each block. Histograms of all blocks are concatenated and
the frame is represented by a feature vector of R944 . This results in our points to
be on G944,4 instead of G25600,4 . Extracting video-patches and representing them
as points on G944,4 also helps us avoid corrupted and incomplete video frames.
By considering only the top 4 most significant basis vectors, we retain the most
prevalent and consistent information in the video-patch and ignore the aberrant
information from the corrupted frames.
The above process gives us points on the manifold for one video. We perform
the same process and compute the points on the Grassmannian for all videopatches extracted from the training videos of one expression class. The next
step is to cluster these points. We follow the procedure described in Sec 2 to
perform clustering. During clustering, our similarity graph-laplacian matrix L
showed that, some of the points are very dissimilar from most of the other
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
23
points. These points come from those parts of videos, where the expression is
performed in an incorrect and inconsistent manner. Therefore, we only consider
the top 300 most similar points for each class and group them into 15 clusters.
The mean of every cluster is computed following the procedure described in
[18]. Finally, the 15 cluster centers are considered as class representatives and
are used in the classification step.
Following a similar procedure, the class representatives of all six classes are
computed. The next step is to learn SVM models on Grassmannian for these
class representatives. To do that, we firstly embed the class representatives in
RKHS by using one of the Grassmannian kernel functions described in Sec 3.1.
SVM is designed for binary class classification. For multi-class classification, we
can use an appropriate multi-class strategy and train multiple binary SVMs.
One-vs-one (pairwise comparisons, total of k(k 1) binary SVMs for k classes)
and one-vs-all (one against all others, total of k binary SVMs for k classes)
are the two most popular multi-class SVM strategies. A comparison [37] on
multi-class SVM strategies shows that amongst one-vs-one and one-vs-all, the
one-vs-all achieves a comparable performance with a faster speed. We therefore
use one-vs-all multi-class strategy and train six binary SVM models for six
expression classes. For each expression class, the class representatives belonging
to that class are labeled as +1 whereas the class representatives belonging to
all other classes are labeled as 1. Using CVX package [25], the optimization
problem in equation 3 is solved for optimal dual variables and the corresponding primal variables are computed. Therefore, six binary SVM models
are learnt for six expression classes. Lets denote the learnt SVM models by
c = {wc , bc }, c = 1, 6
Given a test video Xtest , after normalization and extraction of video-patches
from Xtest , we represent these video-patches as points on G944,4 . The points are
then embedded into RKHS by using a Grassmannian kernel function. For every
embedded point, a similarity score is computed with each of the six learnt SVM
models. The similarity is determined in terms of the normalized distance of
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
24
the point from the corresponding SVMs hyperplane. Specifically, the similarity
(j)
dc (Xtest ) of the jth video-patch from the cth SVM model (c = {wc , bc }) is given
by:
D
(j)
dc (Xtest ) = sgn
(j)
wc , Xtest
+ bc
(14)
kwc k
After computing the similarity of all the embedded points of the video from
all SVM models, a weighted-voting based strategy is used to decide on the class
of the query video. Every video point casts a vote to each of the six expression
classes, with the weight of each vote being equal to the similarity from the
corresponding SVM model (obtained using Eq. 14). The class label ytest of the
test video Xtest is then given by:
ytest = arg max
c
(j)
(15)
dc (Xtest )
That is, the weights casted to all six classes are accumulated and the class
with the maximum weights is declared to be the class of the query video.
2D-3D Fusion: Modern 3D imaging technologies can simultaneously acquire
2D texture (RGB) videos along with the co-registered depth (3D) information.
In order to improve the accuracy of the proposed method, we use both the
texture and the depth information in parallel. For this purpose, we follow the
pipeline in Fig 4 for 2D and 3D data separately and learn their corresponding
SVM models. For the 2D data, the RGB images are first converted into gray
scale images, whereas the depth images are directly used. After learning the
SVM models from the 2D and 3D data separately, the next step is to perform
classification. Given a 2D-3D test video Xtest (both 2D and 3D video frames),
the classification task consists in finding the class label ytest of the video. For this
purpose, we first separately extract video patches from the 2D and 3D data of
Xtest , represent these patches as points on the manifold and then compute their
[2D]
(j)
[3D]
(j)
similarity from the corresponding SVM models. Let dc (Xtest ) and dc (Xtest )
be the similarity of the jth video-patch from the cth SVM model trained using
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
25
2D and 3D data respectively, the class label ytest of the test video is then given
by,
(j)
(j)
[2D]
3D d[3D]
c (Xtest ) + 2D dc (Xtest )
(16)
That is, a weighted voting strategy is used to fuse the information from 2D
and 3D data. The parameters 2D and 3D decide on the importance given
to each data modality. These parameters are determined empirically (for the
best cross validation classification accuracy) by performing a grid search over
a range of values between 0 and 1.
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
Experimental Settings
Persons/Videos Expressions Testing
Sun and Yin [5]
60 subjects
All six
10 fold
Drira et al. [11]
60 subjects
All six
10 fold
Huang et al. [10]
60 subjects
All six
20 fold
Le et al. [7]
60 subjects
Sa, Ha, Su
10 fold
All six
Fang et al. [9], [13]
507 videos
An, Ha, Su 10 fold
Sa, Ha, Su
All six
Sandbach et al. [6], [14] 397 videos
10 fold
Sa, Ha, Su
Rosato et al. [8]
Not mentioned All six
20 fold
26
Paper
Results
CV
CV
CV
CV
90.44%
93.21%
85.6%
92.22%
74.63%
CV 96.71%
95.75%
64.6%
CV
83.03%
CV 85.9%
TABLE 1
A summary of experimental settings followed by previously published methods
on the BU4DFE database. The acronyms used in the Table are, CV: Cross
Validation, Sa: Sad, Ha: Happy, Su: Surprise, An: Angry.
of the system is evaluated for that subset. Based upon the selection of the data
subset, we can roughly categorize the previous papers as:
1) Papers which select a subset of 60 subjects. Either 10 fold or 20 fold
person-independent cross validation testing is performed on the selected
subjects. For example, for 10 fold cross validation testing, videos of all
expression classes of 54 subjects are used for training and 6 subjects are
used for testing. Papers in [5], [10], [11] report classification accuracy for
all six expression classes, whereas the method in [7] is evaluated for only
three expressions i.e. sad, happy and surprise.
2) Papers which select a subset of either 507 or 397 videos. The proposed methods in [6], [9], [13], [14] model a video by a four state model
i.e. neutral-onset-apex-offset. These methods make an assumption that
a video starts and ends with the neutral frame. The authors therefore
manually inspect all videos of the dataset and exclude those videos which
do not fulfil these criteria. Moreover, videos which contain corrupted
meshes are also excluded and only 507 & 397 videos are selected in [9], [13]
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
proj
1
0
0
0
1
1
1
cc
0
1
0
1
0
1
1
g
0
0
1
1
1
0
1
Experiment-1
2D
3D
2D+3D
88.17% 91.14% 93.81%
70.50% 80.78% 81.94%
88.34% 91.15% 93.97%
83.66% 90.42% 92.45%
88.76% 91.62% 94.34%
83.86% 90.39% 91.61%
85.10% 90.43% 92.55%
27
Experiment-2
2D
3D
2D+3D
86.89% 90.62% 92.07%
60.40% 68.67% 79.12%
86.96% 91.31% 92.39%
80.12% 88.27% 90.13%
87.26% 90.84% 93.24%
80.83% 88.64% 89.55%
81.55% 91.60% 92.43%
TABLE 2
A summary of results for our experiments. The system achieves the best
classification accuracy for a linear combination of the projection and the
proposed kernel .e for proj = 1, cc = 0, g = 1.
and [6], [14] respectively. The testing is performed on all six expression as
well as three expressions i.e. either angry-happy-surprise or sad-happysurprise.
6.2
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
An
Di
Fe
Ha
Sa
Su
An
Di
92.71 2.89
2.88 93.46
3.91
4.17
1.09
0.91
1.67
2.69
0.00
0.00
Fe
1.67
1.82
90.09
0.00
2.53
1.33
Ha
Sa
0.00 2.73
1.17 0.00
1.00 0.33
98.00 0.00
0.00 93.12
0.00 0.00
28
Su
0.00
0.67
0.50
0.00
0.00
98.67
TABLE 3
Confusion Matrix corresponding to the best classification accuracy (i.e for
proj = 1, cc = 0, g = 1) for Experiment-1. An: Angry, Di: Disgust, Fe: Fear, Ha:
Happy, Sa: Sad, Su: Surprise.
for different selections of the training and testing subjects. The results averaged over ten runs are presented in Table 2. The best classification accuracy is
achieved for a linear combination of the projection and the proposed kernel .e
for proj = 1, cc = 0, g = 1. The confusion matrices for this combination for
Experiment-1 and Experiment-2 are shown in Table 3 and Table 4 respectively.
The results indicate that the system achieves the highest classification accuracy
for happy and surprise expressions as these are the most consistently performed
expressions. Angry and fear are rather more subtle expressions and the system
achieves comparatively less accuracy for them. The best overall average classification accuracy achieved by our system for all expressions is 94.34% and
93.24% for Experiment-1 and Experiment-2 respectively.
Our proposed kernel function provides a good classification accuracy of
93.97% when tested alone i.e. for proj = 0, cc = 0, g = 1. It can be seen
that, once tested alone (for the first three tests in Table 2), the performance of
the proposed kernel function is superior to the other two kernel functions.
6.2.1
The effect of the video patch length on the performance of the proposed method
is presented in Table 5. Experiments are performed by changing the video patch
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
An
Di
Fe
Ha
Sa
Su
An
91.0
0.00
5.57
1.67
1.93
0.00
Di
3.03
95.71
3.24
0.00
2.30
0.00
29
Fe
Ha
Sa
2.55 0.00 3.29
1.72 1.43 0.00
83.62 2.86 1.14
0.47 97.86 0.00
1.29 0.00 93.69
1.26 1.17 0.00
Su
0.13
1.14
3.57
0.00
0.79
97.57
TABLE 4
Confusion Matrix corresponding to the best classification accuracy (i.e for
proj = 1, cc = 0, g = 1) for Experiment-2. An: Angry, Di: Disgust, Fe: Fear, Ha:
Happy, Sa: Sad, Su: Surprise.
Patch Length
16
24
32
40
48
{24, 30, 36, 44}
Experiment-1
2D
3D
2D+3D
79.63% 81.31% 82.03%
81.54% 82.09% 84.26%
81.23% 84.82% 85.36%
83.94% 85.14% 89.42%
83.75% 86.01% 89.18%
88.76% 91.62% 94.34%
Experiment-2
2D
3D
2D+3D
75.45% 78.42% 81.64%
76.21% 78.67% 82.34%
79.92% 81.72% 83.13%
81.28% 84.81% 85.42%
82.19% 84.37% 86.21%
87.26% 90.84% 93.24%
TABLE 5
Performance analysis of the proposed method for different values of the
video-patch lengths
length from 16 to 48 with a step size of 8 frames. The results show that the
performance of the proposed method increases with an increased patch length.
The results also show that the proposed method achieves the best performance
when video patches of multiple lengths are combined. Using video patches of
different lengths introduces temporal invariance and results in an improved
performance.
6.2.2
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
30
Auto Regressive and Moving Average (ARMA) models have been used for
video analysis in a number of tasks such as dynamic texture analysis [38],
silhouettes [39] and action recognition [18], [40]. Here, we present a performance
evaluation of our method once the videos are modeled using ARMA models.
p
Given a video sequence {f (t)}t=
t=1 ; f (t) R , an ARMA process models the
video as,
w(t) = N (0, R)
(17)
where z(t) Rd is the hidden state vector, A Rdd is the transition matrix
and C Rpd is the measurement matrix. v and w are the noise modeled by
zero mean and R Rpp and Q Rdd covariance matrices respectively. p is
the number of pixels (or features) of a frame and d determines the order of the
system.
The parameters (A, C) of an ARMA model can be estimated using a closed
form solution as proposed in [18], [38]. For this purpose, let us represent a
video with an ordered sequence of frames [f (1), f (2), f ( )] indexed by the
time t = 1, . The singular value decomposition of the video data results
[f (1), f (2), f ( )] = U V T . The transition and the measurement matrices of
the ARMA model can then be estimated by:
C = U
(18)
A = V T D1 V (V T D2 V )1 1 ,
0 0
I
0
and D2 = 1 .
where D1 =
I 1 0
0 0
In order to compare ARMA models, we can first represent them with subspaces and then compute the similarity between these subspaces. The subspace
representation of an ARMA model is obtained from the column space of its
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
31
Experiment-1
Experiment-2
2D
3D
2D+3D
2D
3D
2D+3D
85.38% 90.51% 92.71% 81.58% 86.18% 89.72%
TABLE 6
Average classification performance for videos modeled by ARMA models
(19)
Table 7 shows a comparison of the proposed system with the previously published top-performing methods. The method proposed by Sun & Yin [5] achieves
a classification accuracy of 90.22%. However, their method relies on the manual
annotation of 83 facial landmarks which is not only an undesirable and time
consuming process but may also render inaccuracies and could be difficult
to accommodate for numerous practical applications. Amongst the previously
published methods on BU4DFE, the method proposed by Drira et al. [11] shows
April 22, 2014
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
32
the best classification accuracy i.e. 93.21%. However, it should be noted that,
for the methods in [11], [42], each video sequence of n frames has been divided
into multiple subsequences of 6 frames each (f1 f6 , f2 f7 , f3 f8 . . . fn5 fn ).
The authors thus generate a total of 30780 subsequences for training and 3420
subsequences for testing. The proposed methods therefore perform matching
for videos of six frames each ( 14 th of a second). This duration of the video is
too short to capture the facial dynamics and the corresponding spatiotemporal
information.
The method proposed by Fang et al. [9], [13] shows a classification accuracy of
96.71% once evaluated for only three expressions i.e. angry, happy and surprise.
The authors manually inspect all videos of the database and exclude those
videos which do not start or end with a neutral frame. Also the videos which
contain a few corrupted meshes are excluded from the final evaluation. In our
proposed framework, by representing the local video-patches as points on the
Grassmann and considering only the top 4 most significant basis vectors, we
automatically avoid such corrupted meshes. We therefore evaluate the performance of our system for all 606 videos of the database.
The method in [7] shows an accuracy of 92.22% for only 3 expressions i.e. sad,
happy and surprise. Our experiments show that our proposed system achieves
a very high classification accuracy for the happy and surprise expressions for
both Experiment-1 and Experiment-2 (see Table 3 and Table 4). Therefore, if we
consider only 3 expression classes, either anger-happy-surprise similar to [9],
[13] or sad-happy-surprise similar to [7], the classification accuracy achieved
by our system is higher than the previously published techniques.
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
33
Method
Sun & Yin [5]
Testing
Accuracy Note
Experiment-1 90.44%
Manual 83 landmarks, evaluated on subsequences of 6
frames
Experiment-1 93.21%
Evaluated on subsequences of 6
Drira et al. [11]
frames
Experiment-1 92.22%
Tested for only sad, happy &
Le et al. [7]
surprise
Fang et al. [9], [13] Experiment-2 96.71%
Tested for only angry, happy &
surprise, 507 videos only
Experiment-1 94.91%
Fully automatic. Tested for all
This Paper
six expressions
Experiment-2 93.63%
Tested for all six expressions.
This Paper
All 606 videos
TABLE 7
Comparison with previously published results
C ONCLUSION
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
34
not require any user intervention for manual annotation of facial landmarks.
The system does not make any assumptions about the presence of all four
expression segments in a video and performs equally well for all video types.
ACKNOWLEDGMENT
This work is supported by SIRF scholarship from the University of Western
Australia (UWA) and ARC grant DPI10102166.
R EFERENCES
[1]
Z. Zeng, M. Pantic, G. Roisman, and T. Huang, A survey of affect recognition methods: Audio,
visual, and spontaneous expressions, Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 31, no. 1, pp. 3958, 2009.
[2]
M. Pantic, S. Member, and L. J. M. Rothkrantz, Automatic analysis of facial expressions: The state
of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 14241445, 2000.
[3]
G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin, Static and dynamic 3d facial expression recognition:
A comprehensive survey, Image and Vision Computing, 2012.
[4]
[5]
Y. Sun and L. Yin, Facial expression recognition based on 3d dynamic range model sequences, in
ECCV 2008, pp. 5871.
[6]
[7]
V. Le, H. Tang, and T. Huang, Expression recognition from 3d dynamic faces using robust spatiotemporal shape features, in Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE
International Conference on.
[8]
M. Rosato, X. Chen, and L. Yin, Automatic registration of vertex correspondences for 3d facial
expression analysis, in Biometrics: Theory, Applications and Systems, 2008. BTAS 2008. 2nd IEEE
International Conference on, 29 2008-oct. 1 2008, pp. 1 7.
[9]
T. Fang, X. Zhao, O. Ocegueda, S. Shah, and I. Kakadiaris, 3d/4d facial expression analysis: an
advanced annotated face model approach, Image and Vision Computing, 2012.
[10] Y. Huang, X. Zhang, Y. Fan, L. Yin, L. Seversky, J. Allen, T. Lei, and W. Dong, Reshaping 3d facial
scans for facial appearance modeling and 3d facial expression analysis, Image and Vision Computing,
2012.
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
35
[11] H. Drira, B. B. Amor, M. Daoudi, A. Srivastava, and S. Berretti, 3d dynamic expression recognition
based on a novel deformation vector field and random forest, in Pattern Recognition, 2012. ICPR
2012. Proceedings of the 21st International Conference on, Nov. 2012.
[12] Y. Sun, X. Chen, M. Rosato, and L. Yin, Tracking vertex flow and model adaptation for threedimensional spatiotemporal face analysis, Systems, Man and Cybernetics, Part A: Systems and Humans,
IEEE Transactions on, vol. 40, no. 3, pp. 461 474, may 2010.
[13] T. Fang, X. Zhao, S. Shah, and I. Kakadiaris, 4d facial expression recognition, in Computer Vision
Workshops (ICCV Workshops), 2011 IEEE International Conference on, nov. 2011, pp. 1594 1601.
[14] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert, A dynamic approach to the recognition of 3d
facial expressions and their temporal models, in Automatic Face & Gesture Recognition and Workshops
(FG 2011), 2011 IEEE International Conference on.
18961902.
[17] R. Subbarao and P. Meer, Nonlinear mean shift over riemannian manifolds, IJCV, vol. 84, no. 1,
pp. 120, 2009.
[18] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, Statistical computations on grassmann
and stiefel manifolds for image and video-based recognition, IEEE TPAMI, vol. 33, no. 11, pp. 2273
2286, nov. 2011.
[19] A. Edelman, T. Arias, and S. Smith, The geometry of algorithms with orthogonality constraints,
SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303353, 1998.
[20] H. Karcher, Riemannian center of mass and mollifier smoothing, Communications on pure and applied
mathematics, vol. 30, no. 5, pp. 509541, 1977.
[21] U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing, vol. 17, no. 4, pp. 395
416, 2007.
[22] S. Shirazi, M. T. Harandi, C. Sanderson, A. Alavi, and B. C. Lovell, Clustering on grassmann
manifolds via kernel embedding with application to action analysis, in Image Processing (ICIP),
2012 19th IEEE International Conference on.
2004.
[25] I. CVX Research, CVX: Matlab software for disciplined convex programming, version 2.0 beta,
http://cvxr.com/cvx, Sep. 2012.
[26] J. Hamm and D. Lee, Grassmann discriminant analysis: a unifying view on subspace-based
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
36
376383.
[27] M. Harandi, C. Sanderson, S. Shirazi, and B. Lovell, Graph embedding discriminant analysis
on grassmannian manifolds for improved image set matching, in Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on.
[28] K. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren, Efficient algorithms for inferences on
grassmann manifolds, in Statistical Signal Processing, 2003 IEEE Workshop on. IEEE, 2003, pp. 315
318.
[29] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi, Kernel methods on the riemannian
manifold of symmetric positive definite matrices, in Computer Vision and Pattern Recognition (CVPR),
2013 IEEE Conference on.
Citeseer, 2005.
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TAFFC.2014.2330580, IEEE Transactions on Affective Computing
37
[42] Y. Sun and L. Yin, 3d spatio-temporal face recognition using dynamic range model sequences,
2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 1,
pp. 17, 2008.
DRAFT
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.