Beruflich Dokumente
Kultur Dokumente
Escuela de Informtica y Telecomunicaciones, Universidad Diego Portales, Ejrcito 441, Santiago 8320000, Chile
Department of Computer Engineering, Kyung Hee University, 1 Seocheon-dong, Giheung-gu, Yongin-si, Gyeonggi-do 446-701, Republic of Korea
a r t i c l e
i n f o
Article history:
Received 17 January 2014
Available online 16 September 2014
Keywords:
Directional number pattern
Expression recognition
Face descriptor
Image descriptor
Local pattern
Scene recognition
a b s t r a c t
Deriving an effective image representation is a critical step for a successful automatic image recognition
application. In this paper, we propose a new feature descriptor named Local Directional Texture Pattern
(LDTP) that is versatile, as it allows us to distinguish persons expressions, and different landscapes scenes.
In detail, we compute the LDTP feature, at each pixel, by extracting the principal directions of the local
neighborhood, and coding the intensity differences on these directions. Consequently, we represent each
image as a distribution of LDTP codes. The mixture of structural and contrast information makes our descriptor
robust against illumination changes and noise. We also use Principal Component Analysis to reduce the
dimension of the multilevel feature set, and test the results on this new descriptor as well.
2014 Elsevier B.V. All rights reserved.
1. Introduction
Nowadays several applications need to recognize something
through visual cues, such as faces, expressions, objects, or scenes.
These applications need robust descriptors that discriminate between
classes, but that are general enough to incorporate variations within
the same class. However, most of the existing algorithms have a narrow application spectrum, i.e., they focus on some specic task. For
example, previous methods [1,2] designed for face analysis, cannot be
used to recognize scenes, as these methods have been tuned for ne
texture representation. Similarly, descriptors [3,4] tuned for scene
recognition under-perform on other tasks. Thereby, in this paper, we
design and test a robust descriptor capable of modeling ne textures
as well as coarse ones.
A wide range of algorithms have been proposed to describe
micro-patterns. The most common ones are the appearance-based
methods [5] that use image lters, either on the whole-image, to create holistic features, or some specic region, to create local features,
to extract the appearance change in the image. The performance of
the appearance-based methods is excellent in constrained environment but their performance degrade in environmental variation [6].
In the literature, there are many methods for the holistic class, such
as Eigenfaces and Fisherfaces. Although these methods have been
studied widely, local descriptors have gained attention because of
their robustness to illumination and pose variations. The Local Binary
http://dx.doi.org/10.1016/j.patrec.2014.08.012
0167-8655/ 2014 Elsevier B.V. All rights reserved.
Pattern (LBP) [1] is by far the most popular one, and has been successfully applied to several problem domains [5,7,8]. Despite LBPs robustness to monotonic illumination, it is sensitive to non-monotonic
illumination variation, and shows poor performance in presence of
random noise [7]. Tan and Triggs [9] proposed an improvement of
LBP by introducing a ternary pattern (LTP) which uses a threshold to
stabilize the micro-patterns. A directional pattern (LDP) [2] has been
proposed to overcome the limitations of LBP. However, it suffers in
noisy conditions, is sensitive to rotations, and cannot detect different
transitions in the intensity regions. Similarly, many other methods
appeared that extract information and encoded it in a similar way
like LBP, such as infrared [10], near infrared [11], and phase information [12,13]. Nevertheless, all these methods inherit the sensitivity
problem, i.e., the feature being coded into the bit-string is prone to
change due to noise or other variations. Thereby, the directionalnumber-based methods [1417] appeared as a solution to the common bit-string representation, as these methods use an explicit coding scheme in which the prominent directions are embedded into
the code. However, all these methods still encode only one type of
information, e.g., intensity or direction, which limits their description
capabilities.
Therefore, in this paper, we propose a novel feature descriptor
named Local Directional Texture Pattern (LDTP) which exploits the
advantages of both directional and intensity information in the image.
The combination of both features outperforms the singled-feature
counterparts, e.g., LBP, LTP, LDP, among others. The proposed method
identies the principal directions from the local neighborhood, and
then extracts the prominent relative intensity information. Following,
LDTP characterizes the neighborhood by mixing these two features
in a single code. On the contrary, previous methods rely on one type
of information and use a sensitive coding strategy. In detail, LDTP encodes the structural information in a local neighborhood by analyzing
its directional information and the difference between intensity values of the rst and second maximum edges responses. This mechanism is consistent against noise, since the edge response is more stable than intensity, and the use of relative intensity values makes our
method more robust against illumination changes and other similar
conditions. Moreover, given that we encode only the prominent information of the neighborhood, our method is more robust to changes in
comparison to other methods, as we dispose insignicant details that
may vary between instances of the image. Consequently, we convey
more reliable information of the local texture, rather than coding all
the information that may be misleading, not important, or affected
by noise. Furthermore, we evaluate the performance of the proposed
LDTP feature with a machine learning method, Support Vector Machine, on ve different databases for expression recognition and three
different databases for scene recognition to demonstrate its robustness and versatility.
2. Local Directional Texture Pattern
LBP feature labels each pixel by thresholding a set of sparse points
of its circular neighborhood, and encodes that information in a string
of bits. Similarly, LDP encodes the principal directions of each pixels
neighborhood into an eight bit string. However, these methods only
mark which neighbor has the analyzed characteristic on or off. Furthermore, this ad hoc construction overlooks the prominent information in the neighborhood, as all the information in the neighborhood
(regardless of its usefulness) is poured into the code. In contrast, we
create a code from the principal directions of the local neighborhood,
similar to the directional numbers [1417]. However, the later has the
problem of using only the structural information of the neighborhood.
Therefore, we propose to extract the contrast information from the
principal directions to enhance the description of our code. In other
words, we code the principal direction and the intensity difference of
the two principal directions into one number. This approach allows
us to encode the prominent texture information of the neighborhood
that is revealed by its principal directions.
To compute LDTP, we calculate the principal directional numbers
of the neighborhood using the Kirsch compass masks [18]in eight
different directions. We dene our directional number as
1
Pdir
= arg max{Ii | 0 i 7},
(1)
95
n
xPdir
n
yPdir
x 1
= x
x1
y 1
= y
y1
n
if Pdir
{0, 1, 7},
n
if Pdir
{2, 6},
n
{3, 4, 5},
if Pdir
(4)
n
{1, 2, 3},
if Pdir
n
{0, 4},
if Pdir
n
if Pdir
{5, 6, 7}.
(5)
This local difference, is equivalent to the local threshold that LBP does.
Unlike the LBP binary encoding, we encode the difference using three
levels (negative, equal, and positive), which creates a more distinctive
code for the neighborhood. Then each difference is encoded as
0, if d
Df (d) = 1, if d <
2, if d > ,
(6)
LDTP(x, y) = 16Pdir
(x,y)
(x,y)
+ Df d2
,
+ 4Df d1
(7)
1(x,y)
Pdir
Ii = |I Mi |.
(2)
Thus, we compute the absolute value of the eight Kirsch masks responses, {M0 , . . . , M7 }, applied to a particular pixel. More precisely,
1 and P 2 . Therefore, the second
we take the two greatest responses, Pdir
dir
2
directional number, Pdir , is computed in the same way, with the difference that we take the second maximum response in Eq. (1) instead.
These directions signal the principal axis of the local texture.
In each of the two principal directions, we compute the intensity
difference of the opposed pixels in the neighborhood. That is
(x,y)
dn
n
n
n
n
= I xPdir
+ , yPdir
+ I xPdir
, yPdir
,
(3)
where dn is the nth difference for the pixel (x, y) in the nth principal
direction, I (xPn + , yPn + ) corresponds to the intensity value of the
pixel (xPn
dir
dir
n + ),
+ , yPdir
dir
n )
, yPdir
dir
(xPn , yPn ), which is the previous pixel in the given principal didir
dir
rection. In other words, the next and previous pixel positions dened
Fig. 1. Example of the LDTP code computation (details are in the text). (a) Original
image. (b) Edge responses after applying Kirsch masks. (c) Coded image. (d) Sample
neighborhood (intensity values). (e) Edge responses of the sample neighborhood shown
in (d). (f) Code of the neighborhood shown in (d). (For interpretation of the references
to color in this gure legend, the reader is referred to the web version of this article.)
96
H = Hi ,
(8)
i=0
97
Fig. 3. Steps for creating the histogram of the scene by using a spatial pyramid.
Table 1
Comparison against others methods, in CK and JAFFE databases.
Method
CK
LBP
LDP
Gabor
LTP
LDTP
LDTP+PCA
JAFFE
6-class (%)
7-class (%)
6-class (%)
7-class (%)
92.6 2.9
98.5 1.4
89.8 3.1
99.3 0.2
99.4 1.1
99.7 0.9
88.9 3.5
94.3 3.9
86.8 3.1
92.5 2.5
95.1 3.1
95.7 2.9
86.7 4.1
85.8 1.1
85.1 5.0
80.3 1.0
90.2 1.0
92.4 1.2
80.7 5.5
85.9 1.8
79.7 4.2
77.9 1.0
88.7 0.5
89.2 0.8
Table 2
Confusion matrix of 6-class facial expression recognition using SVM (RBF) with LDTP in the CK database.
(%) Anger Disgust Fear Joy Sadness Surprise
Anger 99.5
Disgust
0.5
100
Fear
100
Joy
Sadness
100
2.1
97.9
Surprise
100
Table 3
Confusion matrix of 7-class facial expression recognition using
SVM (RBF) with LDTP in the CK database.
(%) Anger Disgust Fear Joy Sadness Surprise Neutral
Anger 87.5
Disgust
Fear
1.3
0.5
Joy
10.7
3
97
2.5
0.4 98.8
Sadness
0.8
95.2
Surprise
Neutral
0.5
97
4.8
100
2.7
0.5
0.5
1.3
95
98
(b)
CK+
Method
MMI
Method
CMU
SPTS
CAPP
SPTS+CAPP
CLM
EAI
LTP
LDTP
LDTP + PCA
50.4
66.7
83.3
74.4
82.6
78.7
81.4
84.5
LBP
CPL
CSPL
AFL
ADL
LTP
LDTP
LDTP + PCA
86.9
49.4
73.5
47.7
47.8
89.5
90.5
93.7
LBP
LBPw
85.1
90.3
LTP
LDP
LPQ
LDTP
LDTP + PCA
88.8
88.4
90.9
90.9
92.9
7.5
Fear 12.5
Happy
Sadness 21.43
Surprise
2.5
2.5
17.5
93.75
Disgust 14.29
Resolution
(c)
Method
Anger 67.5
Table 6
Recognition rate (%) in low-resolution (CK database) images.
2.5
6.25
82.54
4.17
1.59
1.59
70.83
1.45
12.5
98.55
3.57
4.6
14.29
60.71
2.3
1.15
91.95
110 150
55 75
36 48
27 37
18 24
14 19
Methods
LBP
LDP
Gabor
LDTP+PCA
92.6 2.9
89.9 3.1
87.3 3.4
84.3 4.1
79.6 4.1
76.9 5.0
98.5 1.4
95.5 1.6
93.1 2.2
90.6 2.7
89.8 2.3
89.1 3.1
89.8 3.1
89.2 3.0
86.4 3.3
83.0 4.3
78.2 4.5
75.1 5.1
99.7 0.9
98.8 3.1
98.5 1.4
95.1 1.3
94.8 1.3
94.2 2.7
3.51
Disgust
89.13
4.35
Fear
4.55
84.09
Joy
4.76
1.59
Sadness 1.85
1.85
Surprise
9.72
2.17
4.35
11.36
93.65
96.3
90.28
(b)
Method
Recognition (%)
Method
Recognition (%)
Gist
CENTRIST
LDTP
LDTP + PCA
82.60 0.86
85.65 0.73
81.18 0.78
85.81 0.67
SPM-1
SPM-2
SPM-C
SP-pLSA
Gist
CENTRIST
LDTP
LDTP + PCA
66.80 0.60
81.40 0.50
83.25 N/A
83.70 N/A
73.28 0.67
83.88 0.76
73.12 0.69
83.94 0.85
the second best method, while applying PCA improves the result on
the recognition by 2%.
5.6. Scene recognition results
Similarly to the face analysis, we used PCA to reduce the number
of bins in the histogram. After reducing the data, we used 5-fold cross
validation and an RBF kernel to classify scene images. Finally, we
report the average value.
We evaluated our methods against several methods that we describe in the following. Oliva and Torralba [28] proposed the Gist
descriptor to represent the similar spatial structures present in each
scene category. Gist computes the spectral information in an image through Discrete Fourier Transform, and the spectral signals
are then compressed by the KarhunenLoeve Transform. The CENsus TRansform hISTogram (CENTRIST) [3] is a holistic representation
and has strong generalizability for category recognition. CENTRIST
mainly encodes the structural properties within an image and suppresses detailed textural information. Additionally, we compared our
methods against [4] method: the spatial pyramid matching (SPM). In
one variation (SPM-1), they used 16 channels weak features, where
their recognition rate is 66.8%; and in the other experiment (SPM-2),
they increased their recognition rate by 14.6% because they incorporated the SIFT descriptor using 400 visual words. Liu and Shah [37]
used SIFT, as previous method [4] did, with the difference that they
used 400 intermediate concepts to convey more reliable information
(SPM-C). Similarly, Bosch et al. [38] used SIFT and 1200 pLSA topics
to incorporate more information (SP-pLSA).
Table 8(a) and (b) shows the comparison on the 8- and 15-class
scene data sets, in which our proposed method (LDTP + PCA) outperforms other methods. Despite the improvement being small, our
method is more versatile as it can work on ne textures as well as
large image representations. Additionally, we did not preprocessed
the images to extract parts of the images to train our descriptor, as
we used the images as a whole which speeds up the process.
Additionally, we tested our proposed method in the MIT indoor
scene dataset against the ROI annotation and Gist method proposed
by Quattoni and Torralba [29], the Object Bank (OB) proposed by
Li et al. [39], and the Local Pairwise Codebook (LPC) proposed by
Morioka and Satoh [40]. Table 9 shows the results of different meth-
Table 9
Recognition accuracy of the methods for the MIT
indoor scenes.
Method
Recognition (%)
ROI + Gist
OB
CENTRIST
LPC
LDTP
LDTP + PCA
25.0
37.6
36.9
39.6
38.1
35.7
99
100
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
Vision and Pattern Recognition, 2005. CVPR 2005, vol. 2, 2005, pp. 524531,
10.1109/CVPR.2005.16.
A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of
the spatial envelope, Int. J. Comput. Vis. 42 (2001) 145175.
A. Quattoni, A. Torralba, Recognizing indoor scenes, in: Proceedings of 2009 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2009), 2009, pp. 413420.
I.T. Joliffe, Principal Component Analysis, Springer-Verlag, 1986.
M.S. Bartlett, G. Littlewort, I. Fasel, J.R. Movellan, Real time face detection and facial
expression recognition: development and applications to human computer interaction., in: Proceedings of Conference on Computer Vision and Pattern Recognition Workshop, 2003. CVPRW 03, vol. 5, 2003, p. 53, 10.1109/CVPRW.2003.10057.
S. Chew, P. Lucey, S. Lucey, J. Saragih, J. Cohn, S. Sridharan, Person-independent
facial expression detection using constrained local models, in: Proceedings of
2011 IEEE International Conference on Automatic Face Gesture Recognition and
Workshops (FG 2011), 2011, pp. 915920, 10.1109/FG.2011.5771373.
L. Jeni, D. Takacs, A. Lorincz, High quality facial expression recognition in video
streams using shape related information only, in: Proceedings of 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011,
pp. 2168 2174, 10.1109/ICCVW.2011.6130516.
S. Yang, B. Bhanu, Understanding discrete facial expressions in video using an
emotion avatar image, IEEE Trans. Syst. Man Cybern. B 42(4) (2012) 980992,
10.1109/TSMCB.2012.2192269.
L. Zhongy, Q. Liuz, P. Yangy, B. Liuy, J. Huangx, D.N. Metaxasy, Learning active
facial patches for expression analysis, in: Proceedings of 2012 IEEE Conference on
Computer Vision and Pattern Recognition, 2012.
T. Ahonen, A. Hadid, M. Pietikinen, Face description with local binary patterns:
application to face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006)
20372041, 10.1109/TPAMI.2006.244.
J. Liu, M. Shah, Scene modeling using co-clustering, in: Proceedings
of IEEE International Conference on Computer Vision 0 (2007) 17,
http://doi.ieeecomputersociety.org/10.1109/ICCV.2007.4408866.
A. Bosch, A. Zisserman, X. Munoz, Scene classication using a hybrid generative/discriminative approach, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008)
712727.
L.J. Li, H. Su, E.P. Xing, F.F. Li, Object bank: a high-level image representation for scene classication & semantic feature sparsication, in: NIPS, 2010,
pp. 13781386.
N. Morioka, S. Satoh, Building compact local pairwise codebook with joint feature
space clustering, in: K. Daniilidis, P. Maragos, N. Paragios (Eds.), Computer Vision
ECCV 2010, vol. 6311 of Lecture Notes in Computer Science, Springer, Berlin,
Heidelberg, 2010, pp. 692705, 10.1007/978-3-642-15549-9_50.