Sie sind auf Seite 1von 16

Yue Ming a,n, Xiaopeng Hong b

a Beijing Key Laboratory of Work Safety Intelligent Monitoring, School of Electronic Engineering, Beijing University
of Posts and Telecommunications, Beijing
100876, PR China
b Department of Computer Science and Engineering, University of Oulu, Finland

articleinfo abstract

Article history: In this paper, we design a unified 3D face authentication system for practical use. First, we propose a facial depth recovery
Received 2 January 2015 method to construct a facial depth map from stereoscopic videos. It effectively utilize prior facial information and incorporate
Received in revised form the visibility term to classify static and dynamic pixels for robust depth estimation. Secondly, in order to make 3D face
24 May 2015 authentication more accurate and consistent, we present an intrinsic scale feature detection for interesting points on 3D facial
Accepted 18 July 2015 mesh regions. Then, a novel feature descriptor is proposed, called Local Mesh Scale-Invariant Feature Transform (LMSIFT) to
Available online 11 December 2015
reflect the different face recognition abilities in different facial regions. Finally, the sparse optimization problem of visual
codebook is used to 3D face learning. We evaluate our approach on publicly available 3D face databases and self-collected
realistic scene databases. We also develop an interactive education system to investigate its performance in practice, which
Keywords: demonstrates the high performance of the proposed approach for accurate 3D face authentication. Compared with previous
3D face authentication
popular approaches, our system has consistently better performance in terms of effectiveness, robustness and universality.
Depth estimation
& 2015 Elsevier B.V. All rights reserved.
Facial region segmentation
Local Mesh Scale-Invariant Feature Transform
(LMSIFT)
Interactive education platform

1. Introduction makes a strong push for performance improvement by adding 3D depth


information [8]. 3D images can capture different transformations and actual
Face authentication and recognition has been a hot topic in computer facial anatomical structure. The distinctive advantages of 3D face authentication
vision field for decades. It has a wide variety of applications, for example have improved the effectiveness of recognition [9].
surveillance, automated screening, and human–computer interaction, since Though great strides have been made in 3D face authentication, it is still
human face images are easyto-access, and the acquisition of human faces are challenging to obtain reliability in 3D face identification and verification,
usually nonintrusive and user-friendly. Over the past several decades particularly in unconstrained “in the wild” scenarios. Especially, accurate depth
considerable efforts have been devoted to 2D face authentication and information recovery from binocular images and effective 3D facial
recognition. As a result, accuracy for 2D face authentication and recognition representations are two of the crucial but unsolved issues.
has been substantially improved [1–6]. However, 2D face authentication and Firstly, traditional binocular vision based depth information recovery
recognition are still difficult because of the inherent flaws in handling extreme methods only depend on the geometric relationship between the two parallel
illumination, pose variations, complex backgrounds and Intra-subject images from the camera and do not incorporate the reference model of the target,
deformation [7]. therefore it can only coarsely reconstruct the 3D information (e.g. for large and
Due to the deficiency of using only the intensity as the 2D face uniform scenes). However, they have difficulties in building a fine depth image
authentication and recognition algorithms do, the lowered cost of 3D in which the detail of scenes, like the slight depth variations on a human face,
acquisition devices, such as Kinect, SR4000 and 3D scanners, can be well retained.
Secondly, traditional 3D face descriptors, including depth map [10], spin
image [11], meshSIFT [12], etc., are usually based on the assumption that the
☆The work presented in this paper was supported by the National Natural Science Foundation sampling of 3D face surface is nearly uniform or made uniform by resampling.
of China (Grants no. NSFC-61402046), President Funding of Beijing University of Posts and However, it is extremely difficult for this assumption to be satisfied in real-world
Telecommunications.
n
Corresponding author. since occlusion and pose variations may result in loss of facial local information.
E-mail addresses: myname35875235@126.com (Y. Ming), xhong@ee.oulu.fi (X. As a result, 3D information is inevitably lost. Moreover, they usually suffer the
Hong). problem of high computation complexity.
Concentrating on the major challenges in 3D face generation and facial
http://dx.doi.org/10.1016/j.neucom.2015.07.127
representations, we propose a unified 3D face authentication framework based
0925-2312/& 2015 Elsevier B.V. All rights reserved.
118 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

on stereoscopic generation. More specially, we propose a facial depth recovery camera based approaches namely the stereo matching algorithm are getting
approach for robust depth estimation, and a novel scale and rotation invariant much higher. As a result, they become a popular method to collect depth
feature descriptor for 3D face description to effectively overcome the obstacles information [15,16].
of computation expense and complex backgrounds. Finally, we apply the The researches in the US and Europe started the early research of stereo
proposed system to a real-world, interactive education environment to matching algorithm [17]. Scharstein and Szeliski reorganized the past
conveniently identify students, thereby providing the benefit of personalized researches in 2002, presented a completed standard of stereo matching
service. algorithms, which is still being used nowadays, and maintain a website for
The main contributions of this paper are summarized in the following items: stereo matching algorithm evaluation [17]. Stereo matching algorithm can be
roughly divided into the global algorithms and the local ones. The global
1. Reliability: We propose a reliable facial depth recovery technique, which is stereo matching algorithms show high performance but their running costs are
adapted to estimate the 3D face shape and geometric information. Different usually high. On the contrary, the local algorithm computes efficiently but its
from previous research, our method is based on prior facial information, and performance was inferior. To improve its performance, Yoon and Kweon
incorporates the visibility term with static and dynamic segmentation for proposed an adaptive weighting scheme [18]. In 2008, the mutual information
robust dense depth estimation. An accurate, refined model has significant was introduced by Hirschmuller, to refine the local algorithm [19]. More
advantages in handling occlusions and complex backgrounds. recently, the researches on stereo matching gradually blur the concept of local
2. Effectiveness: We derive a descriptor named Local Mesh ScaleInvariant algorithm and global algorithm, and combine them to achieve a semi-local-
Feature Transform (LMSIFT) for 3D face authentication. LMSIFT is able to semi-global stereo matching scheme. Moreover, parallel stereo matching
transform the depth and gray information into a one-dimensional vector algorithms are well investigated in [20,21].
without the too strong assumption of uniform density. Compared with However, the above-mentioned methods do not incorporate the reference
commonly-used 3D facial descriptors, our representation has better model of the target, therefore it can only coarsely reconstruct the 3D
discrimination, and easily handles scale-rotation changes and partial mesh information (e.g. for large and uniform scenes), which is difficult to reflect
matching. fine depth variations (e.g. for human face).
3. Universality: The performance improvement of our 3D face authentication
system which was demonstrated on realistic scenes, where the 3D facial data 2.2. 3D facial representations
was collected at various times, different environments, and from individuals
from different countries, and with a large range of ages, poses, occlusions, Because of the aforementioned advantages, also with the advent of new
and complex backgrounds. capture devices, the 3D face recognition has triggered increased interest.
Feature representations have also been developed for 3D facial images [8].
The rest of this paper is organized as follows. In Section 2 we survey related Huang et al. [10] proposed multiscale extended Local Binary Patterns (eLBP)
work on face authentication using local features in mesh and 3D face data for to represent the facial geometry. SIFT-based matching scheme combined
authentication. Section 3 introduces our proposed framework. In Section 4 local and holistic analysis and proved robust to expressions, occlusions, and
consistent facial depth recovery from a binocular framework is described. In pose variations. Spherical harmonic features (SHF) was presented by Liu et
Section 5 our proposed feature detection and descriptor scheme is presented. In al. [11], which demonstrated the outstanding advantages for representing
Section 6 we derive the learning scheme for the local mesh scale-rotation gross shape and fine surface details on 3D face surface. Our previous work
invariant descriptor. In Section 7 we verify the use of our approach and apply our [22] proposed Sparse Bounding Sphere for 3D Face Recognition, which
system to an interactive education system. In Section 8 we conclude the paper. demonstrated the excellent performance for pose variations. Although facial
surface features can efficiently describe facial geometric structures, they
unfortunately lost the regional discrimination. Our past researches have
2. Related works proved effectiveness of the rigid area for 3D face recognition [23].
Darom and Keller [24] applied the popular SIFT feature to the mesh
3D data processing, especially 3D data generation and learning, has been a domain to capture local regions of interests in different manifestations over a
long continuously challenging topic in computer vision and graphics due to its similar support. Maes et al. [25] presented the meshSIFT detector that detects
widely applications in for example 3D movies (e.g. Avatar) and 3DTV [13]. the local feature as scale space extreme for 3D face surface. Smeets further
The problem of analyzing and recognizing faces is an important branch of 3D extended the meshSIFT to the symmetric surface-feature [12,26,27], which
data processing. The actual facial anatomical structure is usually recovered by shows promising 3D face recognition performance for expression variations
stereo matching for compensating data missing during 2D projection and and partial data. More recently, the cues from 3D shape and 2D color were
overcoming the difficulty of 3D affordable acquisition devices. In realistic fused to improve the learning process [9]. Mian et al. [28] designed spherical
situations, the 3D face recognition method can match partial scans, especially for representation for 3D facial surface and combined scaleinvariance feature
analyzing students' states in the interactive learning system. From the perspective transformation for the corresponding 2D facial image, which can effectively
of theoretical analysis, in 2005, Phillips et al. [14] first compared the recognition overcome facial expression variations. Xu et al. [29] modified the LDA for
performance between 2D and 3D face data, which demonstrated that 3D images the intrinsic feature estimation and used Adaboost learning in a hierarchical
can preserve more discriminative information. selection scheme to construct the effective classifier. However, this 2Dþ3D
fusion scheme faces the obstacle of the expensive computation and large
2.1. Depth information recovery from binocular images amount of storage space required.
Most of these methods are based on the assumption that the sampling of
There are two kinds of devices for depth information acquisition: the depth 3D face surface is nearly uniform or made uniform by resampling, which is
camera and the stereo camera. The most representative product of the depth quite difficult to access in real-world captured dataset and significantly
camera might be the Kinect developed by Microsoft Corp. Kinect measures reduces the description of features.
the distance in depth by infrared, which has the advantage in real-time
situation. However, it is influenced by sunshine easily so that its detective
range is limited. In contrast, the stereo camera system collects the depth
information through building the matching relationship between the pixels in
a pair of images obtained by two cameras. It has much more immunity to
sunshine than the early one and thus results in a larger range of detective
distance. With the development of technology, the precision of the stereo
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 119

3. The proposed framework 4. Facial depth recovery and segmentation

The aforementioned remaining challenges finally facilitate to the proposal As the first step, our objective is to recover the facial depth information and
of our unified binocular vision based 3D face authentication framework as perform face surface segmentation. Fig. 2 provides the flow chart of facial depth
shown in Fig 1, which included the following four parts: recovery and segmentation. The internal and external parameters of the camera
are calibrated by Zhang's method [32]. We first initialize the facial depth maps,
1. Facial depth recovery and segmentation: By fusing facial prior shape and combining facial prior information and virtual image pair matching. As the depth
virtual image pairs, we first initialize the depth map for facial sparse depth pixel recovery may be affected by occlusions, data corruption, and complex
estimation. In order to effectively address the issues of occlusions, holes backgrounds, we propose a spatiotemporal refinement approach to improve the
and complex backgrounds, spatial-temporal refinement is performed in accuracy. After facial depth map is generated, facial regions are segmented using
static and dynamic ways for facial dense depth recovery. The 3D face, the shape index and bands.
generated by camera calibration and the Delaunay algorithm [30], can then
be divided into facial regions based on shape index and bands. 4.1. Facial disparity initialization
2. Feature detection based on intrinsic mesh scale: Inspired by Lowe [31] in
detecting interesting points and their scales using the DoG operator, we To initialize the facial depth map, we first introduce 3D reference face image
present an intrinsic scale detection scheme for interesting points on 3D as the prior knowledge for facial depth initialization. The 3D reference face
facial mesh regions and derive the mesh regional scale and rotation image is an average model including 53,490 vertices and 106,446 triangular
invariant features, which can effectively handle the degradation of missing faces, from 200 copies of cloud data containing 3D human face point information
data caused by extreme poses and illumination due to inside and outside and its corresponding texture information, obtained by BaselFaceModel database
occlusions. [27]. The average texture and shape of the 3D reference face image are illustrated
3. Local mesh scale-invariant feature descriptor: The regional scale in Fig. 3. It is notable that human face images, covered by uniform skin, are low
estimation can be used to derive the scale-invariant mesh descriptor. A texture regions which could be significantly subjected to the influence of
scale-invariant feature transform is adapted to facial mesh data by illumination changes, occlusions and other factors. Therefore, we no longer
representing the vicinity of each interest point and estimating its dominant follow the traditional way of texture-based stereo matching, but use the virtual
angle to achieve rotation invariance. To make the approach human face images to address the challenges of human face matching at different
computationally efficient and easy to extend, the concatenated feature viewpoints.
descriptor from three orthogonal planes is used for describing the local The framework of stereo matching based on 3D reference face image
facial information and its spatial locations. The presented feature (template) is shown in Figs. 4 and 5. We built the correspondence between the
descriptor has shown good robustness to the scale changes in regional key points on inputted left and right facial images and the key points on the
mesh matching. reference 3D facial image, based on which we obtain the pose parameters as
4. 3D face learning: The sparse optimization problem of visual codebook is shown in Fig. 4. Via the pose parameters, we projected the 3D reference face
used to extract facial lower-dimensional feature. Bag-of-feature can be image to the corresponding left and right virtual images. This projected human
learned to describe the facial intrinsic properties for minimizing the within face image was set to the same pose of the inputted 2D face as shown in Fig. 5.
variability. Due to the reason that right and left virtual images are the projection facial image
pair that obtained by mapping pose parameters of 3D reference template, the
right/left
image pair has the same pose with the input right/left facial image pair. Thus, the texture information of right/left virtual image is the texture information of 3D
reference facial image.
We denote the left and right images from a pair of calibrated binocular facial
images (which is illustrated by Fig. 4) IL and IR, respectively. We use ASM (Active
Shape Model) [33] to extract the

Fig. 1. The flow chart of our proposed 3D face authentication based on robust local mesh SIFT feature.
Numbers represent the above item number. Fig. 2. The flow chart of our proposed facial depth recovery and segmentation.

Fig. 3. 3D reference face image.


120 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

Fig. 4. Pose parameters estimation.


facial key points, which are closely related to areas where face shape changes, points. The optimal solutions for the parameters γ;θ;ϕ; f ; Px; Py of the input
such as the mouth corners, the nose tip, the nostrils, and the eye corners. Image
left and right images are reached by using the Levenberg Marquardt method
rectification [34] is then carried out to restrict vertical disparity between images
be 0. As a result, only the non-zero horizontal parallax is allowed. Suppose the
coordinates of one key point on the left and right facial images are ðu1; v1Þ and
ðu2; v2Þ respectively. The disparity can be expressed as d ¼ u1 u2. [35] and obtain the pose parameters of the input facial image pair, as shown
According to the selected key points on the binocular images, we project the in Table 1. We utilize the obtained pose parameters to generate a pair of virtual
corresponding points on the reference model onto a (2D) image plane by means image pair for the input image pair through projecting the 3D face model, as
of orthogonal projection: shown in Fig. 5
(a) and (b).
1 0 0 p ¼ f 0 1 0 RγRθRϕVref þt2d ð1Þ The virtual images, which are generated by the known reference human
face model, provide prior knowledge on positions of the correspondence
points. It enables to initialize the facial disparity through the following four
steps:
¼
where Vref Vrefx; Vrefy; VrefzT are the key point coordinates of theT Firstly, based on the face warping from the input left image to the virtual
left image, plj, the corresponding point of Flj on the left virtual image is
determined by optimizing Eq. (2); secondly by back projecting the key point
3D face reference model, p ¼ px; py is the corresponding 2D projection point, R plj to the 3D model and get the vertex V, and then projecting V to the right
virtual image we obtain
¼ RγRθRϕ is the rotation matrix, γ;θ;ϕ are the rotation angles around the X; Y;

¼
Z axes respectively related to facial pose, t2d Px; PyT is the displacement within

a 2D plane, and f is the focal length related to facial length.

We consider the searching for the optimal solution for the parameters γ;θ;ϕ;
f ; Px; Py as an optimization problem, where the optimal function can be
expressed as follows:
22

min j ¼X1;…;NFx;j px;j þFy;j py;j ð2Þ

where ðFx;j; Fy;jÞ is the projection of key point j on 2D inputted image, ðpx;j;
py;jÞ is the projection position of the corresponding 3D key point j, which is
the projection of key point j on 2D virtual image, and N is the number of key
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 121

Fig. 5. Virtual image synthesis.


Table 1 1. Data term EdðDis; O; IÞ: The data term measures how well the hypothesized
The pose parameters of the stereo image pairs.
disparity Dis as follows:
Pose parameters γ ϕ f θ Px Py

Left image 0.2077 0.1240 3.2217 0.0020 195.1449 261.1701 EdðDis; O; IÞ ¼ Xx Ed1ðDis; O; IÞ PfFð Þþx Ed2ðDis; O; IÞ PfB ð Þx ð4Þ
Right image 3.0836 0.0020 507.1449 261.3643
0.3182 0.2535

the projection prj; thirdly, based on face warping in analogous to Step 1, the
corresponding point Frj on the right image is calculated; finally, if the points
on the left input image and right input image satisfy the polar line constraint regions,Here,f Ed1EðdDis2 Dis; O;; OIÞ; Iis the energy function for reis for
relation [34], these two points are corresponding points in stereo matching,
whose difference value of abscissa is called human face parallax difference the refined static regions,fined dynamicPfFðxÞ and ð
value [35].
Þ

4.2. Spatio-temporal depth optimization Pregions, respectively. More speciBðxÞ delegates the probability in
pixelfically,x for dynamic and static
Our goal is to refine the disparity map Dis to handle occlusions and
complex backgrounds. First, optical flow estimation [36] can be used to
segment the static pixels and dynamic pixels between consecutive frames
from the complex backgrounds. Then, combined with facial prior
information, Ol(x) indicates whether a facial pixel x in image IL is occluded in Pf x
Þ ¼ ( PS xð ÞFð Þx; Dis=ðPFtanð Þþxce xPð ÞBð Þx4Þ;TDisb tan ce xð

the left image. If x is occluded, OLðxÞ ¼ 1, otherwise OLðxÞ ¼ 0. OR(x) is 5
ÞrTb ð Þ
defined in a similar way. Deriving for the dense depth estimation [37], the
fine-estimated disparity can be solved by minimizing the following energy
function: where we denote PFð Þ ¼x P L xð ð Þ ¼ 1jxÞ and PBð Þ ¼x P L xð ð Þ ¼
0jxÞ.
min E; s:t:E Disð; O; IÞ ¼ EdðDis; O; IÞþEsðDis; O; IÞþEvðDis; O; IÞ ð3Þ
Let L be a binary map where L(x) equals to 1 belongs to the facial area and
thus the disparity is a “facial pixel”, and 0 for otherwise. P L xð ð ÞjxÞ can be
where EdðDis; O; IÞ, EsðDis; O; IÞ, and EvðDis; O; IÞ are the data term, the
calculated by the normalized probability that L(x) is the facial region label of
smoothing term and the visible term of 3D data respectively: pixel x [37]. S is the dynamic regional map. Distance(x) is the Euclidean
distance between x and the segmentation boundary. Tb presets as a threshold.
Moreover, PfBðxÞ is calculated by
122 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

PfBð Þ ¼x 1PfFð Þx ð6Þ η ρ


E Disð ; O; IÞ ¼ Xx OLð Þx ORð Þx þXx ð1OLð Þx Þð1þORð Þx Þ ðx; Dis
The data term can be further expressed as,
1
EdðDis;O;IÞ¼Xx Znð Þx ðOLð Þx ORð Þx ηþð1OLð Þx Þð1þORð Þx xð Þ; ILÞ þXx ð1ORð Þx Þð1þOLð Þx Þρðx; Dis xð Þ; IRÞ þXx
Þρðx;Dis xð Þ;ILÞ

βw OL0ð Þx LWLðx; DisL Þ þ ORRð Þx WRRðx; DisÞ


þð1ORð Þx Þð1þOLð Þx Þρðx; Dis xð Þ; IRÞÞ ð7Þ

ρ β O x O y O x O y 10
where for the captured facial image pair IL and IR, ðx; Dis xð Þ; IÞ describes þXx yXN x ðÞ ðÞ þ ðÞ ðÞ ð Þ
AðÞ
the robustness of matching cost with pixel x and disparity d. Zn(x) is the
normalization parameter. The cost η is used to quantify the proportion of The first three terms constitute the data term EdðDis; O; IÞ, and the last two
terms constitute the visibility term EvðDis; O; IÞ [37]. In order to effectively
occlusion in the entire image.
2. Smooth term EsðDis; O; IÞ: The smooth term EsðDis; O; IÞ reflects regional reflect the occlusion of left and right images, the data term at pixel x can be
defined in four cases,
smoothness on disparity map Dis:
8> ρ ρ
λ < >> ðx; d; ILÞþ ðx; d; IRÞ; if OLð Þ ¼x 0 and ORð Þ ¼x 0
EsðDis; O; IÞ ¼ Xx yXN x ðx; yÞρsðDis xð Þ; Dis yð ÞÞ ð8Þ
A ðÞ

11 E x; d
d> 2ρðx; d; ILRÞ; if OLLð Þ ¼x 1 and ORRð Þ ¼x 0 Þ ð Þ ¼ 2ρðx; d; I
continuitywhere N(x) factor,is the set of adjacent pixels,ρ is a robust
λ
function ðx; ydeÞ marks a dis-fined as ρs Þ; if O ð Þ ¼x 0 and O ð Þ ¼x 1 ð

>>>>: η; if OLð Þ ¼x 1 and ORð Þ ¼x 1


ðDis xð Þ; Dis yð ÞÞ ¼ minv Dis xð ÞDis yð Þ ; T where vT controls the upper

cost limit. And the first term in visibility term at pixels x also has four cases,

8 βwðWLðx; DisÞþWRðx; DisÞÞ; if OLð Þ ¼x 0 and ORð Þ ¼x 0


3. Visibility term E ðDis; O; IÞ: The visibility term E ðDis; O; IÞ is the same
with Eq. (6) in the literature [37], which is used for overcoming the impact
that light change during the fineestimated disparity. EvðDis; O; IÞ can be
E x; d
defined as, vð Þ ¼ >>>>< ββwwðð11þWWLLððxx;; DisDisÞÞþWWRRððxx;;
DisDisÞÞÞÞ;; ifif OOLLð Þ ¼ð Þ ¼xx01 andand
EvðDis;O;IÞ¼ EvLðDis;OL;ILÞþEvRðDis;OR;IRÞ¼Xx βw OLð Þx WLðx;DisÞ OORRð Þ ¼ð Þ ¼xx 01

>>>>: βwð2WLðx; DisÞWRðx; DisÞÞ; if OLð Þ ¼x 1 and ORð Þ


þXx yXN x β0 OLð Þx OLð Þy þXx βw ORð Þx WRðx; DisÞ

A ðÞ ¼x 1 ð12Þ

þXx yXN x β0 ORð Þx ORð Þy ð9Þ


Then, we can obtain the simplified energy function, which takes full account
A ð Þ where WL is a binary map defined on the intensity of the occlusion and illumination variations. BP (Belief Propagation) is used
change of adjacent frames, which can effectively reflect the light change to minimize the energy function (3) and obtain a refined disparity [37].

between consecutive frames from the complex backgrounds. If the


4.2.1. Facial depth optimization
intensity change exceeds the threshold T, WLðx; DisÞ ¼ 1, otherwise The facial depth Z can be computed using the camera calibration parameters:
WRðx; DisÞ ¼ 0. WR is defined in a similar way. N(x) is the set of all
Z ¼ fdis b ð13Þ where f is the focal length of the camera, b is the baseline
adjacent pixels of x.
distance of binocular cameras, and dis is the refined disparity. The initialized
depth map is shown as in Fig. 6. Then, the coordinates of 3D vertices can be
Given the initialized disparity map Dis, the original energy function (3) can be
expressed as,
simplified as,

X u b
8>>>>>>>< ¼ zf 1¼ udis1
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 123

b
Y ¼ fz v1 ¼ vdis1 b ð14Þ >>>>>:>> Z ¼ fdis

4.2.2. 3D face mesh generation


A 3D face mesh can be generated by Delaunay [30] via Z. As the 3D
coordinates of all the matched points have been calculated, in order to get
realistic 3D model of human faces, it is necessary to triangulate data points to
generate 3D face model of triangular mesh data. The details of the algorithm
is as follows:

1. Initial mesh generation: Sort the scattered face points and search for the
point v1 with the smallest X coordinate. The points v1; v2;…; vn are sorted
according to the square of the distance in ascending order. The first line is
connected by v1 and v2. Then, we search the collinear point vk with v1 and
v2 in the sequence f gvi . The first triangle is formed by v1; v2; vk points and
denotes as initial mesh front boundary. Using grid-edge technology [38],
scaling out point-by-point the initial face of triangular meshes can be
formed.
2. Mesh subdivision: Using Loop subdivision method [34], the basic idea is
to insert a node on each side of the triangle after it is divided into two Fig. 7. The shape index values with the different shapes.

segments. Original triangulated vertices will be added by the newly


inserted nodes. Thus, one of the original triangulated meshes is divided The new boundary vertices and points will be connected to generate a new
into four triangular meshes. triangular mesh, converge to the limit surface in the end.
(a) If two triangles ðv0; v1; v2Þ and ðv0; v1; v3Þ share an inner edge with two 3. Mesh optimization: The face model is optimized by adjusting the position
of internal nodes with Laplacian [39] smoothing technology.
vertices v0 and v1, the new boundary point is then vE ¼ 38ðv0 þv1Þþ18ðv2
þv3Þ. 4.3. Facial region segmentation
(b) If the internal and 1-neighbor vertices are v and
For 3D face authentication, inherent challenges remain constrained
improvement of authentication accuracy, such as facial muscle contraction
¼ Þ
vviVði¼ 01;1;n…β;vnþβ1 P, ni ¼the11 vi, wherenewly βgenerated¼ 1n58 during expression variations, self-occlusion caused by pose variation and
outliers. In the actual scenario, different facial regions reflect the facial
þvertices cos 2nπ2are; n geometry and usually have different contributions to face authentication. In
line with this finding, we segment the generated 3D face meshes into several
¼ j jv E is the weight of neighbor vertices. parts with apparent semantic meaning.
(c) If two vertices of a boundary edge are v0 and v1, the new boundary vertices The curvature and its derived shape index [40] serve as intrinsic shape
will be denoted as vE ¼ 12ðv0 þv1Þ. properties for segmenting different facial regions. Firstly, as shown in Fig. 7,
different geometries have different shape index values. We can perform
(d) If two adjacent vertices of the boundary vertices v on the border are v0 and segmentation on facial region by combining key points with standard three-
v1, the newly generated vertices becomes vV ¼ 18v0 þ18v1 þ34v. dimensional facial model information. We choose left/right inner eye corner
points, nasal tip point and left/right nasal basis points as five facial key points.
Nasal tip point is the highest point of facial geometric area. Readers may refer
to our earlier work about the detailed detection method [22,23]. As for each
point on three dimensional facial surface, we can calculate shape value on the
basis of minimum and maximum curvature values k 1; k2. Given k1; k2, the
method to calculate shape index value as surface point vi is shown as follows:

SðviÞ ¼ 12 π1tg1kk11ððvviiÞÞþkk22ððvviiÞÞ ð15Þ

The area containing left/right inner eye point is similar to a cone structure.
We apply a 3n3 window function to search at the region of facial symmetry
silhouette lines and the neighborhood of the nasal tip [41], and the area that
are above nasal tip and the Gaussian curvature values are approximate to zero
are the inner eye position. Left/right lateral nasal bases are located in the
border of nose, which have a saddle geometry region, and the shape index
value is approximate to 0.375. We can extract border line of the saddle-like
nose with contour information at the region of facial symmetry silhouette
Fig. 6. The generated face depth map. lines and the neighborhood of the nasal tip, as shown in Fig. 7.
In three-dimensional facial segmentation, it is necessary to construct region
model. First, we choose a front, expressionless three-dimensional facial points
cloud as the reference template. Then, we divide the reference template into a
124 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

corresponding set of facial regions as regional reference facial model set, one of ¼
Tri Trii is a set of triangle faces, generated by the method ofS Section 4. A
which can be considered as the region mask for one facial region. Then, we label
the region by the important organ of human face and set the region model as the
facial region that is used for region matching with reference template. We set
different deformation parameters to different regions. As for a certain region j, scale space of the input mesh MO can be expressed as
the deformation parameters expression is shown as follows:

TðjÞðviÞ ¼ RðxjÞRyðjÞRðzjÞvi þtðjÞ; i ¼ 1; ; N ð16Þ where V ¼ v1; v2;…; vi is the ith


MOS ¼ ( GMO^σs ;s ¼MO0 ; else ð18Þ
surface point on the matching region of test object. R x; Ry; Rz are the rotation

parameters on x; y; z axes respectively. t is the translation parameters. TðjÞðviÞ is

the result that we get when the points on three-dimensional facial surface on where MO is the original mesh, and G^ σs is the approximated Gaussians filter
with scales σs. The scales σs are calculated as σs ¼
region j have been rotated, translated and aligned.
Suppose regions j, the optimum region matching alignment result is obtained 2Gaussiank σ0 withfilter for meshes is approximated as subsequent con-k being
by the following cost function: parameter of the meshSIFT algorithm. The
J

T^ ðjÞ ¼ arg minTðjÞjX¼ 1 JTðjÞðviÞprefðjÞ J ð17Þ volutions of the mesh with a regional filter. For each vertex vis at an octave s, the
vertex visþ1 can be computed by applying a regional filter with uniform weights
in the next octave as

where pðrefjÞ is the points of 3D reference facial image in the jth region. The region
segmentation and alignment result are shown in Fig. 8.
¼ Vn1 ð
vsi þ1 ssi vsjXAVnsi vj
s
is i19s Þ

5. Local mesh scale-invariant feature descriptor

In order to realize scale and rotation invariance for a 3D face shape, we extend
the 2D SIFT feature to mesh data and form our local descriptor, namely the Local
where Vn i denotes the set of the first-order neighbors of v . Vn is invariant to the
mesh scale-invariant feature transform (LMSIFT). LMSIFT effectively
distance between the vertices and reflect facial structural information.
characterizes the 3D facial surface in different scales from coarse to fine. In
analogous to the 2D SIFT descriptor, LMSIFT also has two key steps namely the The feature points are defined as the maxima in both the location and the scale
feature point detection and descriptor. space. For a single facial region mesh, consecutive scale octaves are subtracted
to obtain the DoG (Different of Gradient) function, for which the gradients and
5.1. Local mesh SIFT feature detection Laplacians on the meshes are calculated using numerical mesh processing [38].
Considering two consecutive mesh octaves, the DoG function d is at scale s is
computed by,
In the feature point detection step, density-invariant Gaussian filters are
defined to calculate the filtered mesh sets on the surface geometry. These are
denoted as mesh octave MOS ¼ ðV; TriÞ, where
dsi ¼ σi12s vsi vsi þ1 is ð20Þ
0
MO ¼ MO, and MO is the facial mesh with a set of vertices V ¼ f gvi .

Let σ be the local density at v ,

s
1 sX si vjs ð21Þ

Di ¼ Vnsi vj AVn vi

In a density-invariant formulation [24], the overall filter width is provided by σN

¼ pNffi Di. The points vis at the local maxima both in scale and location at scales

are chosen as the feature points. The detected feature points have remarkable

robustness and repeatability, which can effectively capture the facial mesh
Fig. 8. 3D facial regional segmentation.

structure and reflect the fine discriminative information on 3D facial geometry.

Different from meshSIFT detector and meshDoG detector


[25,12,26,38,42–44], the proposed LMSIFT is not dependent on the
assumption that the sampling of facial mesh is uniform. LMSIFT detector
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 125

obtains the key points that have local maximum value of different scale spaces The image gradient along the horizontal and vertical directions can be
based on different geometry features of facial regions, which can reduce the calculated as follows:
number of key points and improve the computation efficiency. Also, it
guarantees effective data supply for latter description of geometry ∂I
deformation of facial region along with time. As a result, the LMSIFT can Ix ¼ ∇xð Þ ¼I ∂x; Iy ¼ ∇yð Þ ¼I ∂y;
adequately identify the characteristics of different facial regions and improve
the discriminative power. ∂D ∂D
Dx ¼ ∇xð Þ ¼D ∂x; Dy ¼ ∇yð Þ ¼D ∂x ð22Þ
5.2. Local mesh SIFT feature descriptor
where and are the gradients in x and y directions. Fig. 9 illustrates the
MeshHoG descriptor and MeshSIFT descriptor both are the feature
calculation of LMSIFT. More details can be found in our previous work [42].
descriptors that are aimed for key points on a stationary 3D facial mesh. In
real application system, information of key points on facial region is changing
with time. LMSIFT descriptor, different from meshHoG descriptor and
meshSIFT descriptor, cascades the all facial image frames obtained for facial 6. 3D face learning via local mesh SIFT features
mesh to form a mesh video sequence, as shown in Fig. 9. Then, project the
video sequence on xy; yz; xz three coordinate planes and obtain the three The popular bag-of-words (BoW) representation is adopted for learning
LMSIFT descriptors on each plane, called XY-LMSIFT, YZ-LMSIFT the discriminating properties on 3D face. Each 3D face image is encoded by
and XZ-LMSIFT. a histogram of code words. More specifically, we use the sparse representation
To handle this problem, we utilize the temporal information contained in to make our feature robust to face authentication.
facial dynamics. Our LMSIFT feature incorporates motion contained in facial Suppose L ¼ l1; l2;…; lN ARDN is a set of D-dimensional descriptors
dynamics from the mesh video
sequence with facial shapes. Instead of characterizing the 3D joint distribution ¼
extracted from training videos, we construct the visual codebook B b1; b2;…;
directly which usually leads to too sparse representation and more or less
suffers from the curse of dimensionality, we consider the feature distributions
from three orthogonal planes namely the xy; yz; xz planes respectively. We bM ARDM with M being entries.…; cN; ci ARM, Given L with a sparse
thereby propose LMSIFT feature descriptor by concatenating them together
to make a compact feature vector. The LMSIFT is composed of XYLMSIFT, representation C ¼ ½c1; c2; which implies that each ci contains k kð ⪡MÞ or
XZ-LMSIFT and YZ-LMSIFT. Spatial geometric (XY-LMSIFT) and two
spatial-temporal planes (XZ-LMSIFT and YZ-LMSIFT) are extracted from fewer nonzero elements.
the XY, XZ and YZ planes. As a result, the LMSIFT not only can reflect facial
Then, the problem can be transformed to the following optimization
geometry, but also to describe the subtle facial motion information, suitable
problem:
for practical applications.
The feature descriptors can be computed in 3D gradient space. We project
the detected feature points into xy; yz; xz planes and encode the depth minJLBC J2F; s:t:Jci Jrk; 8i ð23Þ
C
information in the point description. Let I be the projection map of a mesh
frame in the XY plane which contains the spatial geometric information, and
D the depth difference map between adjacent temporal frames which are used where J JF is the Frobenius norm, J J0 is the l0 norm which counts the number
to respectively calculate the depth changes in X and Y directions over time. of nonzero elements. The solutions to Eq. (15) are provided by convex
relations [12].

Fig. 9. Computation of the LMSIFT feature descriptor, where Dx is the depth changes in X direction and Dy is the depth changes in Y direction.
126 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

The sparse coefficients of the vector ci AC associated with all the recognition. For verification, receiver operating characteristics (ROC) curves for
descriptors of a specified class and thus demonstrate the contribution of the three different FRGC masks, namely ROC I, II, and III are illustrated in Fig. 11.
entries toward the representation of that class. Therefore, we use the For all of the masks at an FAR (False Acceptance Rate) of 0.1%, the overall
coefficient histogram to denote the representation of each individual sample, performance of our proposed method yields the best verification results through
all the thresholds, which correspond to optimal feature subspaces.
N
1

h j ¼ N X ci ð24Þ
i¼1

where ci ARM is the ith descriptor of C ARMN, and N is the total number of
descriptors for a sample and hj ARM.
The Nearest-Neighbor classifier is used for recognition. The nearest
distance between the test and training images is the obtained result of
recognition.
7. Experiments

In this section, we provide a performance evaluation of our unified framework


for 3D face authentication with state-of-the-art solutions. Firstly, in Section 7.1
we show the effects of 3D face visualization using our facial depth recovery
described in Sections 4.1 and 4.2. Then, we perform a comparative study of our
proposed LMSIFT feature descriptor for 3D face authentication on two
challenging 3D face databases, the FRGC v2 [14] and CASIA [43] 3D face
databases, in Sections 7.2 and 7.3, respectively. In Section 7.4, we report the
results of 3D face authentication with large facial pose variations. Finally, in
order to test practicality in the real world, we implemented our unified framework
into a practical authentication system in Section 7.5. All of our experiments ran
on a 64-bit Core i5 2.67 GHz PC with 12 GB RAM.
As for real applications, the input right/left facial images apply the classic
Adaboost algorithm to detect facial region and remove the part of body that is
below neck. For international common three-dimensional facial databases, we
use the robust sparse bounding sphere representation (RSBSR) algorithm [22] to
subtract the unrelated region.

7.1. Experimental results for 3D face generation

In this subsection, we show the experimental results for facial depth recovery
and 3D face generation using our unified framework. Comparisons are made
between simple and complex backgrounds as shown in Fig. 10. After extracting
the face area from the background, a 3D face model can be obtained by the
method discussed in Section 4. A set of 15 subjects were used for performance
analysis (this will be increased in the near future). Data alignment was conducted
using an automatic process involving 3D facial rigid transformation. The results
demonstrated in Fig. 10 show that the stereo reconstruction of 3D faces reliably
reflect the geometric structure of the subject face. By introducing facial prior
information and virtual facial images, our method shows excellent 3D facial
recovery without any singularity points even when bad lighting or occlusions are
presented. The method is automatic, stable and fast.
We have collected a new database for evaluating the performance of the facial
mesh reconstruction. Left and right facial images from 10 individuals from 10
individuals were captured for reconstructing facial meshes by our proposed
method. And 3D scanners were used to obtain the face meshes of the 10
individuals as ground truth data. We calculate the average Euclidean distances of
facial key points between the generated facial meshes and the ground truth
meshes. The selected key points include left/right inner eye corner points, nasal
tip point and left/right nasal basis points. From Table 2, we can conclude that the
reconstruction errors satisfy the needs of practical applications.

7.2. Comparison evaluation for verification scenarios

In this subsection, we present the comparative evaluation of our proposed


method for 3D face authentication in verification scenarios. In order to evaluate
the performance of our method, we first introduce the most widely used database
[14] which has become a standard benchmark in the literature of 3D face
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 127

Furthermore, intra-class variations, due to factors as diverse and facial Kakadiaris et al. [45] used wavelets and Pyramid transformation, which

Fig. 10. The Results of 3D face reconstruction.


expressions, are usually greater than inter-class factors [10]. As a result, some resulted in high computational cost. Dirk et al. [12] introduced the uniform
popular methods were proposed for solving expression issues based on different assumption for the sampling of the object, which demonstrates the excellent
testing sets, including Neutral vs. Neutral (N vs. N), Non-Neutral vs. Neutral performance for ROC I dataset with slight changes. However, for large
(Non-N vs. N), All vs. Neutral (A vs. N). Table 3 shows our verification results outliers or missing data on 3D face surface, the verification performance
for ROC I, ROC II and ROC III protocols along with other state-of-the-art degrades significantly on ROC II and ROC III datasets. Our previous work
methods as the standard for evaluation on FRGC v2 3D face recognition. Table [22,23] showed promising performance for static 3D face image. But for the
3 shows the verification results with the FAR of 0.1%. We can see that the practical applications, the descriptors do not fully take into account the
verification rates of our method are better than all others, except Kakadiaris et al. dynamic variations on 3D face surface. As a result,
[45], some results of Dirk et al. [45] and our previous work [22,23]. Table 2
128 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

The average Euclidean distances (mm) of facial key points between the generated facial meshes performance compared with the state-of-the-art algorithms considering the
and the ground truth meshes.
computational efficiency.
Facial key points
The average Euclidean distances
(mm) 7.3. Performance evaluation based on the CASIA 3D face database
Distance between inner eye corner points
32.25 3D face authentication often suffers from not only expression variations, but
Left inner eye corner point 3.75 also illumination and facial poses. Here, we employ the CASIA 3D face database
Right inner eye corner point 4.02 [43] to demonstrate the sensorinvariance of our unified framework. The 4625 3D
Nasal tip point 2.12 facial images in the database can be divided into a training subset and a testing
Left nasal basis point 3.84 subset, which was composed of 615 images, with 5 facial images selected for
Right nasal basis point 4.13 each individual. The rest of the images from the 123 individuals are used as the
test set. In these experiments, we compared the performance for 3D face
recognition. The considered methods include COSMOS shape index (CO) [48],
Annotated Deformable Model (ADM) [45], Sparse Spherical Representation
(SSR) [50] and our LMSIFT for 3D face recognition. The test set was further
divided into six subsets with pose and expression variations [29] as shown in
Table 4. It shows the rankone identification rate using the same test sets.
From the experimental results, we can draw the following conclusions: (1)
the highest identification rate was 98.5% (123 people) by our framework; (2)
robustness to shape and illumination variations plays a vitally important role in
identifying an individual; (3) with expressions and pose variations, our method
and ADM show a better accuracy rate; (4) our unified framework efficiently
overcame the influence of the self and external occlusions better than other
popular methods. In light of the above discussion it should be noted that our
framework, based on LMSIFT feature descriptor, can capture the shape
properties of an individual's face while also showing performance superior to
other methods tested. Our method is very effective against many different 3D
face databases demonstrating its generality.

7.4. 3D face authentication with large facial pose variations

Fig. 11. The ROC curves of FRGC v2.0 3D face database. Large pose variations have a huge impact on 3D face authentication. In this
subsection, we evaluate the authentication performance of the different facial
Table 3 descriptions with large pose variations. We follow our previous experimental
Verification results (%) with the different 3D face recognition methods on the FRGC v2.0
design based on CASIA database for the computation of the rank-1 identification
database.
rate [22], including small pose variations (SPV), large pose variations (LPV),
Methods ROC I ROC II ROC III N vs. N Non-N vs. N A vs. N
small pose variations with smiling (SPVS), and large pose variations with
Phillips et al. [14] – – – – 40 45 smiling (LPVS), which contain different training sets and testing sets from the
experimental design in Section 7.3. From Table 5, our proposed descriptor
Berretti et al. [46] – – – 97.7 91.4 95.5
achieves the highest rate among all the testing sets. Depth information has been
Passalis et al. [47] – – – 94.9 79.4 81.5
proved to be more effective than intensity information authentication. Gabor
Cook et al. [48] 93.71 92.91 92.01 – – –
features based on multi-scale space outperforms the original depth and intensity
Kakadiaris et al. [45] 97.3 97.2 97 – 79.4 85.1
Dirk et al. [12] 97.6 78.94 77.24 – – –
images. The fusion scheme can further improve the discrimination of the feature
Ming et al. [22] 96.85 96.41 96.03 96.92 95.57 95.79 descriptions. Bounding spherical feature demonstrates
Ming et al. [23] 96.35 95.47 95.01 96.27 93.14 95.24
Ours 96.97 96.5 96.1 96.4 92.5 96.3 Table 4
Rank-1 identification results (%) based on the CASIA 3D face database.
when the subtle motion and facial muscle contraction exist during the actual
Test databases CO ADM SSR Ours
applications, the robustness of the algorithms will be significantly affected.
For our method, we utilize the local scale estimate to derive a scale- Illumination variations 48.21 98.33 96.47 98.5
invariant local facial mesh. Then, the novel geometric facial descriptors are Expression variations 45.74 95.73 93.08 95.2
learned by constructing the visual codebook with the sparse representation. Small pose variations 45.27 93.97 92.83 94.86
This results in a substantial advantage in speed. Our algorithm, based on Large pose variations 32.64 56.85 60.99 80.5
LMSIFT features, has a much lower computational complexity and is much Small pose variations with smiling 43.76 90.38 82.15 87.57
easier to implement. The performance will be further improved when the Large pose variations with smiling 31.79 52.14 58.43 74.5
amount of training data can be significantly increased.
greater advantages to overcome the outlier and self-occlusions. However, our
In order to compare with the computational efficiency of the proposed
proposed method reflects more generality and robustness for 3D face
system, our method only spends 7.876 s for the whole 3D face authentication
authentication with large pose variation due to scale and rotation invariance.
for the popular FRGC dataset. Kakadiaris et al. [45] introduced a complex
approach. Fitting the model took about 15 s from the original facial data. Mian
et al. [28] also developed a complicated method for 3D face recognition. The 7.5. Practicability testing for our proposed unified framework
ICP algorithm was used for 3D matching with approximately 12.19 s
computational cost. Queirolo's method [49] was even slower than the ICP The previous subsections show that our 3D face authentication framework
algorithm for the global convergence. Thus, our method has superior achieves promising results from a theoretical perspective. So far, we have
collected one database by the stereo cameras, to evaluate the performance of our
proposed complete framework. The database is composed of 2000 facial
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 129

stereoscopic videos from 20 individuals including 13 males and 7 females from We randomly selected 200 of these stereoscopic videos as training sets
different ages with different expressions and poses. To better demonstrate the and the remaining videos were treated as test sets. We repeated the
performance of the framework, we first recover facial depth information from experiments 10 times and the average recognition results listed in Table 6.
stereoscopic videos. Some examples are shown in Fig. 12. Then, we test the From Table 6, we can conclude that our proposed method achieved the best
performance of our 3D face authentication framework with other face performance. Intensity images and depth images are worse than ours. Thus,
representations, including intensity images, depth images and our proposed our complete framework has better performance for reflecting facial geometry
framework. and
Table 5
Rank-1 recognition results (%) for large pose variances.
Test databases SPV LPV SPVS LPVS

Depth 90.7 50 81 46.5

Intensity 69.9 49.5 68.1 48.5


Depth Gabor 91.4 51.5 82.4 49
Intensity Gabor 75.3 65.5 77.6 61.5
Decision fusion 89 70.5 85.6 64.5
Feature fusion 91 91 87.9 79
Bounding spherical feature 94 93.4 89.2 82.9
Ours 95.6 94.2 92.4 87.8

discriminative information.
To further investigate the performance of our framework in a practical
application, we apply the framework into our previous developed Mandarin
educational system [51]. Our Mandarin educational system is a virtual reality
game for Mandarin learning,

Table 6
Rank-1 recognition results (%) for our collected database.

Representations Intensity images Depth images Ours

Recognition results 67.2 62.4 81.5


130 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

Fig. 12. Some examples and their corresponding face mesh of our collected database.

Fig. 13. The login interface of the learning system.


which is highly responsive to a student's pronunciation. When a student logs captures a 2D face image pair of the student as registration information. Then
into the system for the first time, a binocular camera embedded in the system the backend server treats the registration images as the training images of the
Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130 131

student and generates a corresponding 3D face model following our method Mandarin learning. Based on the evaluation, the Mandarin edutainment system
as presented in Section 4. This information is then added to the student's demonstrates the effectiveness, robustness and universality.
database. References
When a student logs into the system again, the camera will capture face
information again and recover its corresponding 3D face test model. The test [1] W. Zhao, R. Chellappa, A. Rosenfeld, P. Phillips, Face recognition: a literature survey, ACM
model will be matched to the registered models in the student's database by Comput. Surv. 4 (2003) 399–458.
[2] L.E. Shafey, C. McCool, R. Wallace, S. Marcel, A scalable formulation of probabilistic linear
the feature description and learning as discussed in Sections 5 and 6, discriminant analysis: applied to face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 7
respectively. After obtaining a match, the student can enter the Mandarin (2013) 1788–1794.
educational system and begin the learning process. The system assigns [3] W. Deng, J. Hu, J. Guo, In defense of sparsity based face recognition, in: IEEE Conference
on Computer Vision and Pattern Recognition, 2013, pp. 399–406.
personalized learning tasks based on the different skill levels of the students
[4] J. Lu, Y.-P. Tan, G. Wang, Discriminative multimanifold analysis for face recognition from
and provides customized services adapted to their individual needs. The a single training sample per person, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2013) 39–51.
system login interface is shown in Fig. 13. Clicking on the input image button [5] C.H. Chan, M. Tahir, J. Kittler, M. Pietikainen, Multiscale local phase quantization for robust
on the right column will capture the student's facial images by means of the component-based face recognition using kernel fusion of multiple descriptors, IEEE Trans.
Pattern Anal. Mach. Intell. 5 (2013) 1164–1177.
binocular camera. Then, after clicking on the buttons for calibration and depth
[6] S. Liao, A. Jain, S. Li, Partial face recognition: alignment-free approach, IEEE Trans. Pattern
generation, the student's depth map will be produced and shown on the lower- Anal. Mach. Intell. 5 (2013) 1193–1205.
left corner. The model generation button is used to get the student's 3D face [7] J. Park, K. Jee, S. Jee, Glasses removal from facial image using recursive error compensation,
model with the result shown on the right of depth map. Choosing LMSIFT IEEE Trans. Pattern Anal. Mach. Intell. 5 (2005) 805–811.
feature descriptor and learning the feature, we get the result of face [8] K. Bowyer, K. Chang, P. Flynn, A survey of approaches and challenges in 3d and multi-

authentication at the bottom of the interface. After checking the student's modal 3dþ2d face recognition, Comput. Vis. Image Understand. 1 (2006) 1–15.

identity to verify that they are a registered user, the student can start their [9] A. Abate, M. Nappi, D. iccio, G. Sabatino, 2d and 3d face recognition: a survey, Pattern
Recongit. Lett. 14 (2007) 1885–1906.
personalized learning of Mandarin pronunciation. The environment of the [10] D. Huang, M. Ardabilian, Y. Wang, L. Chen, 3d face recognition using elbpbased facial
learning system is similar to a desert island as shown in Fig. 14, which has description and local feature hybrid matching, IEEE Trans. Inf. Forens. Secur. 5 (2012) 1551–
many virtual theme worlds with rivers, swamps and so on. The student has to 1565.
navigate step by step through the island to accomplish exploration goals. [11] P. Liu, Y. Wang, D. Huang, Z. Zhang, L. Chen, Learning the spherical harmonic features for
3d face recognition, IEEE Trans. Image Process 22 (2013) 914–925.
Pronunciation practice for each question is the only way to get to the next [12] D. Smeets, J. Keustermans, D. Vandermeulen, P. Suetens, meshsift: local surface features for
goal. Currently the system has more than 20 registered students from many 3d face recognition under expression variations and partial data, Comput. Vis. Image
different countries and a large range of ages. Our proposed unified framework Understand. 2 (2013) 158–169.
[13] J. Lee, K. Yun, K. Kim, A 3dtv broadcasting scheme for high-quality stereoscopic content
based on 3D face authentication provides strong technical support for the
over a hybrid network, IEEE Trans. Broadcast. 2 (2013) 281–289.
students with the identity authentication and delivery of personalized services. [14] P. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, W.
Worek, Overview of the face recognition grand challenge, in: IEEE Conference on Computer
Vision and Pattern Recognition, 2005, pp. 947–954.
[15] X. Yin, Research on real-time stereo matching based on parallel computing (Master thesis),
2013, pp. 61–110.
[16] J. Zhou, M. Wu, H. Zhou, Research on fast dense stereo matching technique using adaptive
mask, Pattern Recognit. Artif. Intell. 1 (2014) 11–20.
[17] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms, Int. J. Comput. Vis. (1–3) (2002) 7–42.
[18] K. Yoon, I.-S. Kweon, Adaptive support-weight approach for correspondence search, IEEE
Trans. Pattern Anal. Mach. Intell. 4 (2006) 650–656.
[19] H. Hirschmller, Stereo vision in structured environments by consistent semiglobal matching,
IEEE Trans. Pattern Anal. Mach. Intell. 2 (2008) 328–341.
[20] X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, X. Zhang, On building an accurate stereo
matching system on graphics hardware, in: ICCV Workshop, 2011, pp. 467–474.
[21] X. Sun, X. Mei, S. Jiao, M. Zhou, Z. Liu, H. Wang, Real-time local stereo via edgeaware
disparity propagation, Pattern Recognit. Lett. 1 (2014) 201–206.
[22] Y. Ming, Q. Ruan, Robust sparse bounding sphere for 3d face recognition, Image Vis.
Comput. 14 (2007) 1885–1906.
[23] Y. Ming, Rigid-area orthogonal spectral regression for efficient 3d face recognition,
Neurocomputing 10 (2013) 445–457.
[24] T. Darom, Y. Keller, Scale-invariant features for 3d mesh models, IEEE Trans. Image
Process. 5 (2012) 2758–2769.
Fig. 14. The environment of the Mandarin learning system. [25] C. Maes, T. Fabry, J. Keustermans, D. Smeets, Feature detection on 3d face surfaces for pose
normalisation and recognition, in: 2010 Fourth IEEE International Conference on Biometrics:
Theory Applications and Systems, 2010, pp. 1–6.
[26] D. Smeets, J. Keustermans, J. Hermans, P. Claes, D. Vandermeulen, P. Suetens, Symmetric
surface-feature based 3d face recognition for partial data, in: The International Joint
8. Conclusions Conference on Biometrics, 2011, pp. 1–5.
[27] V. Blanz, T. Vetter, Face recognition based on fitting a 3d morphable model, IEEE Trans.
Pattern Anal. Mach. Intell. 25 (2003) 1063–1074.
In this paper, we have presented a novel unified framework for 3D face [28] A. Mian, M. Bennamoun, R. Owens, An evaluation multimodel 2d–3d hybrid approach to
authentication to handle the challenging issues associated such as facial automatic face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 11 (2007) 1927–1943.
expressions, pose and illumination variations. We propose a facial depth [29] C. Xu, S. Li, T. Tan, L. Quan, Automatic 3d face recognition from depth and intensity Gabor
features, Pattern Recognit. 9 (2009) 1895–1905.
recovery approach by stereo matching with the facial prior information and
[30] M. De Berg, O. Cheong, M. Van Kreveld, M. Overmars, Computational Geometry,
virtual images; and design an accurate and consistent feature descriptor, called Algorithms and Applications, Springer-Verlag, Berlin, 2008, pp. 45-307, ISBN 978-3-540-
Local Mesh ScaleInvariant Feature Transform (LMSIFT) for describing the 77973-5.
different facial regions with different discrimination. We present experimental [31] D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 4
(2004) 91–110.
results on 3D face authentication on the two largest standard 3D face databases, [32] Z. Zhang, A flexible new technique for camera calibration, IEEE Trans. Pattern Anal. Mach.
FRGC v2 and CASIA. Compared with previous approaches, our framework has Intell. 11 (2000) 1330–1334.
consistently better performance. It shows good robustness against variations of [33] Q. Wang, K.L. Boyer, The active geometric shape model: a new robust deformable shape
facial expressions, pose and illumination. To test a practical application, we model and its applications, Comput. Vis. Image Understand. 12 (2012) 1178–1194.
introduced our unified system into an interactive education platform for
132 Y. Ming, X. Hong / Neurocomputing 184 (2016) 117–130

[34] S.B. Kang, J.A. Webb, T. Kanade C.L. Zitnick, A multibaseline stereo system with active
illumination and real-time image acquisition, in: Proceeding of the IEEE Fifth International
Conference on Computer Vision, 1995, pp. 88–93.
[35] Y. Zheng, Image-based 3d face modeling (Ph.D. thesis), 2009, pp. 60–110.
[36] M. Black, P. Anandan, The robust estimation of multiple motions: parametric and piecewise-
smooth flow fields, Comput. Vis. Image Understand. 1 (1996) 75–104.
[37] W. Yang, G. Zhang, H. Bao, J. Kim, H.Y. Lee, Consistent depth maps recovery from a
trinocular video sequence, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 1–8.
[38] P. Gabriel, Numerical Mesh Processing, 2009, pp. 1–54 (Chapter 1).
[39] C. Manning, P. Raghavan, M. Schtze, Introduction to Information Retrieval, Cambridge
University Press, Cambridge, 2008, pp. 57–64.
[40] J.J. Koenderink, A.J. Van Doorn, Surface shape and curvature scales, Image Vis. Comput. 8
(1992) 557–564.
[41] X. Bai, Q. Li, L. Latecki, W. Liu, Z. Tu, Shape band: a deformable object detection approach,
in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009,
pp. 1335–1342.
[42] Y. Ming, Q. Ruan, A. Hauptmann, Activity recognition from rgb-d camera with 3d local
spatio-temporal features, in: IEEE International Conference on Multimedia and Expo (ICME
2012), 2012, pp. 344–349.
[43] CASIA-3D FaceV1, 〈 http://biometrics.idealtest.org/〉 .
[44] A. Zaharescu, E. Boyer, R. Horaud, Keypoints and local descriptors of scalar functions on 2d
manifolds, Int. J. Comput. Vis. 1 (2012) 78–98.
[45] I. Kakadiaris, G. Passalis, G. Toderici, M. Murtuza, Y. Lu, N. Karampatziakis, T. Theoharis,
Three-dimensional face recognition in the presence of facial expressions: an annotated
deformable model approach, IEEE Trans. Pattern Anal. Mach. Intell. 4 (2007) 640–649.
[46] S. Berretti, A.D. Bimbo, P. Pala, 3d face recognition using isogeodesic stripes, IEEE Trans.
Pattern Anal. Mach. Intell. 12 (2010) 2162–2177.
[47] G. Passalis, P. Perakis, T. Theoharis, I. Kakadiaris, Using facial symmetry to handle pose
variations in real-world 3d face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 10
(2011) 1938–1951.
[48] J. Cook, C. McCool, V. Chandran, S. Sridharan, Combined 2d/3d face recognition using log-
Gabor templates, in: IEEE International Conference on Video and Signal Based Surveillance,
2006, pp. 83–86.
[49] C. Querirolo, L. Silva, O. Bellon, M. Segundo, 3d face recognition using simulated annealing
and the surface interpenetration measure, IEEE Trans. Pattern Anal. Mach. Intell. 2 (2010)
206–219.
[50] R.S. Llonch, E. Kokiopoulou, I. Tosic, P. Frossard, 3d face recognition with sparse spherical
representations, Pattern Recognit. 3 (2010) 824–834.
[51] Y. Ming, Q. Ruan, A mandarin edutainment system integrated virtual learning environments,
Speech Commun. 1 (2013) 71–83.

Yue Ming received the B.S. degree in Communication


Engineering, the M.Sc degree in Human–Computer Interaction
Engineering, and Ph.D. degree in Signal and Information
Processing from Beijing Jiaotong University, China, in 2006,
2008, and 2013, respectively.
She worked as a visiting scholar in Carnegie Mellon
University, US, between 2010 and 2011. Since 2013, she has
been working as a faculty member at Beijing University of Posts
and Telecommunications. Her research interests are in the areas
of biometrics, computer vision, computer graphics, information
retrieval, pattern recognition, etc.

Xiaopeng Hong received his B.Eng., M.Eng., and Ph.D. degrees


in Computer Application from Harbin Institute of Technology,
Harbin, PR China, in 2004, 2007, and 2010 respectively. He has
been a scientist researcher in the Center for Machine Vision
Research, Department of Computer Science and Engineering,
University of Oulu since 2011. He has authored or co-authored
more than 10 peer-reviewed articles in journals and conferences,
and has served as a reviewer for several journals and
conferences. His current research interests include pose and
gaze estimation, texture classification, object detection and
tracking, and visual speech recognition.

Das könnte Ihnen auch gefallen