Author Copy: Linear Discriminant Analysis: A Detailed Tutorial

AI Communications 30 (2017) 169–190 169
DOI 10.3233/AIC-170729
IOS Press
Linear discriminant analysis: A detailed

tutorial
Alaa Tharwat a,b,∗,∗∗ , Tarek Gaber c,∗,∗∗∗ , Abdelhameed Ibrahim d,∗ and Aboul Ella Hassanien e,∗
a Department of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt am Main,
Germany
b Faculty of Engineering, Suez Canal University, Egypt
E-mail: engalaatharwat@hotmail.com
c Faculty of Computers and Informatics, Suez Canal University, Egypt
E-mail: tmgaber@gmail.com
d Faculty of Engineering, Mansoura University, Egypt
E-mail: afai79@yahoo.com
PY
e Faculty of Computers and Information, Cairo University, Egypt
E-mail: aboitcairo@gmail.com
O
Abstract. Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre-
processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box,
C
but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works,
thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in
different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual
explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent
R
methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how
the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most
common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state-
O
of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with
different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted
TH
feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed.
Keywords: Dimensionality reduction, PCA, LDA, Kernel Functions, Class-Dependent LDA, Class-Independent LDA, SSS
(Small Sample Size) problem, eigenvectors artificial intelligence
AU
1. Introduction are two major approaches of the dimensionality reduc-

tion techniques, namely, unsupervised and supervised
Dimensionality reduction techniques are important approaches. In the unsupervised approach, there is no
in many applications related to machine learning [15], need for labeling classes of the data. While in the su-
data mining [6,33], Bioinformatics [47], biometric [61] pervised approach, the dimensionality reduction tech-
and information retrieval [73]. The main goal of the di- niques take the class labels into consideration [15,32].
mensionality reduction techniques is to reduce the di- There are many unsupervised dimensionality reduc-
mensions by removing the redundant and dependent tion techniques such as Independent Component Anal-
features by transforming the features from a higher di- ysis (ICA) [28,31] and Non-negative Matrix Factor-
mensional space that may lead to a curse of dimension- ization (NMF) [14], but the most famous technique of
ality problem, to a space with lower dimensions. There the unsupervised approach is the Principal Component
* Scientific Research Group in Egypt, (SRGE), http://www. Analysis (PCA) [4,62,67,71]. This type of data reduc-
egyptscience.net. tion is suitable for many applications such as visualiza-
** Corresponding author. E-mail: engalaatharwat@hotmail.com. tion [2,40], and noise removal [70]. On the other hand,
*** Corresponding author. E-mail: tmgaber@gmail.com. the supervised approach has many techniques such as
0921-7126/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved
170 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
Mixture Discriminant Analysis (MDA) [25] and Neu- the authors presented different applications that used
ral Networks (NN) [27], but the most famous technique the LDA-SSS techniques such as face recognition and
of this approach is the Linear Discriminant Analysis cancer classification. Furthermore, they conducted dif-
(LDA) [50]. This category of dimensionality reduction ferent experiments using three well-known face recog-
techniques are used in biometrics [12,36], Bioinfor- nition datasets to compare between different variants
matics [77], and chemistry [11]. of the LDA technique. Nonetheless, in [57], there is
The LDA technique is developed to transform the no detailed explanation of how (with numerical exam-
features into a lower dimensional space, which max- ples) to calculate the within and between class vari-
imizes the ratio of the between-class variance to the ances to construct the LDA space. In addition, the steps
within-class variance, thereby guaranteeing maximum of constructing the LDA space are not supported with
class separability [43,76]. There are two types of LDA well-explained graphs helping for well understanding
technique to deal with classes: class-dependent and of the LDA underlying mechanism. In addition, the
class-independent. In the class-dependent LDA, one non-linearity problem was not highlighted.
separate lower dimensional space is calculated for each This paper gives a detailed tutorial about the LDA
class to project its data on it whereas, in the class- technique, and it is divided into five sections. Section 2
independent LDA, each class will be considered as a gives an overview about the definition of the main
separate class against the other classes [1,74]. In this idea of the LDA and its background. This section be-
PY
type, there is just one lower dimensional space for all gins by explaining how to calculate, with visual expla-
classes to project their data on it. nations, the between-class variance, within-class vari-
Although the LDA technique is considered the most ance, and how to construct the LDA space. The algo-
well-used data reduction techniques, it suffers from a rithms of calculating the LDA space and projecting the
O
number of problems. In the first problem, LDA fails to data onto this space to reduce its dimension are then
find the lower dimensional space if the dimensions are introduced. Section 3 illustrates numerical examples to
C
much higher than the number of samples in the data show how to calculate the LDA space and how to select
matrix. Thus, the within-class matrix becomes singu- the most robust eigenvectors to build the LDA space.
lar, which is known as the small sample problem (SSS). While Section 4 explains the most two common prob-
There are different approaches that proposed to solve lems of the LDA technique and a number of state-of-
R
this problem. The first approach is to remove the null the-art methods to solve (or approximately solve) these
space of within-class matrix as reported in [56,79]. The problems. Different applications that used LDA tech-
O
nique are introduced in Section 5. In Section 6, differ-

second approach used an intermediate subspace (e.g.
ent packages for the LDA and its variants were pre-
PCA) to convert a within-class matrix to a full-rank
TH
sented. In Section 7, two experiments are conducted to

matrix; thus, it can be inverted [4,35]. The third ap-
show (1) the influence of the number of the selected
proach, a well-known solution, is to use the regulariza-
eigenvectors on the robustness and dimension of the
tion method to solve a singular linear systems [38,57].
LDA space, (2) how the SSS problem occurs and high-
In the second problem, the linearity problem, if differ-
AU
lights the well-known methods to solve this problem.

ent classes are non-linearly separable, the LDA can-
Finally, concluding remarks will be given in Section 8.
not discriminate between these classes. One solution to
this problem is to use the kernel functions as reported
in [50]. 2. LDA technique
The brief tutorials on the two LDA types are re-
ported in [1]. However, the authors did not show the 2.1. Definition of LDA
LDA algorithm in details using numerical tutorials, vi-
sualized examples, nor giving insight investigation of The goal of the LDA technique is to project the orig-
experimental results. Moreover, in [57], an overview inal data matrix onto a lower dimensional space. To
of the SSS for the LDA technique was presented in- achieve this goal, three steps needed to be performed.
cluding the theoretical background of the SSS prob- The first step is to calculate the separability between
lem. Moreover, different variants of LDA technique different classes (i.e. the distance between the means
were used to solve the SSS problem such as Di- of different classes), which is called the between-class
rect LDA (DLDA) [22,83], regularized LDA (RLDA) variance or between-class matrix. The second step is to
[18,37,38,80], PCA+LDA [42], Null LDA (NLDA) calculate the distance between the mean and the sam-
[10,82], and kernel DLDA (KDLDA) [36]. In addition, ples of each class, which is called the within-class vari-
A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 171
ance or within-class matrix. The third step is to con- 2.2. Calculating the between-class variance (SB )
struct the lower dimensional space which maximizes
the between-class variance and minimizes the within- The between-class variance of the ith class (SBi )
class variance. This section will explain these three represents the distance between the mean of the ith
steps in detail, and then the full description of the LDA class (μi ) and the total mean (μ). LDA technique
algorithm will be given. Figures 1 and 2 are used to searches for a lower-dimensional space, which is used
visualize the steps of the LDA technique. to maximize the between-class variance, or simply
PY
O
C
R
O
TH
AU
Fig. 1. Visualized steps to calculate a lower dimensional subspace of the LDA technique.
Fig. 2. Projection of the original samples (i.e. data matrix) on the lower dimensional space of LDA (Vk ).
maximize the separation distance between classes. class and the total mean in step (B) and (C), respec-
To explain how the between-class variance or the tively.
PY
between-class matrix (SB ) can be calculated, the fol-
lowing assumptions are made. Given the original data 1
μj = xi (2)
matrix X = {x1 , x2 , . . . , xN }, where xi represents the nj x ∈ω
i j
ith sample, pattern, or observation and N is the to- O
tal number of samples. Each sample is represented by 1
N
c
ni
M features (xi ∈ RM ). In other words, each sam- μ= xi = μi (3)
N N
C
i=1 i=1
ple is represented as a point in M-dimensional space.
Assume the data matrix is partitioned into c = 3 where c represents the total number of classes (in our
classes as follows, X = [ω1 , ω2 , ω3 ] as shown in example c = 3).
R
Fig. 1 (step (A)). Each class has five samples (i.e. The term (μi − μ)(μi − μ)T in Equation (1) repre-
n1 = n2 = n3 = 5), where ni represents the number of sents the separation distance between the mean of the
O
samples of the ith class. The total number of samples ith class (μi ) and the total mean (μ), or simply it repre-
(N ) is calculated as follows, N = 3i=1 ni . sents the between-class variance of the ith class (SBi ).
TH
To calculate the between-class variance (SB ), the Substitute SBi into Equation (1) as follows:
separation distance between different classes which
is denoted by (mi − m) will be calculated as fol- (mi − m)2 = W T SBi W (4)
lows:
AU
2 The total between-class

variance is calculated as fol-
(mi − m)2 = W T μi − W T μ lows, (SB = ci=1 ni SBi ). Figure 1 (step (D)) shows
first how the between-class matrix of the first class
= W T (μi − μ)(μi − μ)T W (1) (SB1 ) is calculated and then how the total between-
class matrix (SB ) is then calculated by adding all the
where mi represents the projection of the mean of the between-class matrices of all classes.
ith class and it is calculated as follows, mi = W T μi ,
where m is the projection of the total mean of all 2.3. Calculating the within-class variance (SW )
classes and it is calculated as follows, m = W T μ, W
represents the transformation matrix of LDA,1 μi (1 × The within-class variance of the ith class (SWi ) rep-
M) represents the mean of the ith class and it is com- resents the difference between the mean and the sam-
puted as in Equation (2), and μ(1 × M) is the to- ples of that class. LDA technique searches for a lower-
tal mean of all classes and it can be computed as in dimensional space, which is used to minimize the dif-
Equation (3) [36,83]. Figure 1 shows the mean of each ference between the projected mean (mi ) and the pro-
jected samples of each class (W T xi ), or simply min-
1 The transformation matrix (W ) will be explained in Section 2.4. imizes the within-class variance [36,83]. The within-
class variance of each class (SWj ) is calculated as in where λ represents the eigenvalues of the transfor-
Equation (5). mation matrix (W ). The solution of this problem
T 2 can be obtained by calculating the eigenvalues (λ =
W xi − mj {λ1 , λ2 , . . . , λM }) and eigenvectors (V = {v1 , v2 , . . . ,
−1
xi ∈ωj ,j =1,...,c vM }) of W = SW SB , if SW is non-singular [36,81,83].
2 The eigenvalues are scalar values, while the eigen-
= W T xij − W T μj vectors are non-zero vectors, which satisfies the Equa-
xi ∈ωj ,j =1,...,c tion (8) and provides us with the information about
the LDA space. The eigenvectors represent the direc-
= W T (xij − μj )2 W
tions of the new space, and the corresponding eigen-
xi ∈ωj ,j =1,...,c
values represent the scaling factor, length, or the mag-

= W T (xij − μj )(xij − μj )T W nitude of the eigenvectors [34,59]. Thus, each eigen-
vector represents one axis of the LDA space, and
xi ∈ωj ,j =1,...,c
the associated eigenvalue represents the robustness
= W T SWj W (5) of this eigenvector. The robustness of the eigenvec-
xi ∈ωj ,j =1,...,c tor reflects its ability to discriminate between differ-
ent classes, i.e. increase the between-class variance,
PY
From Equation (5), the within-class variance for and decreases the within-class variance of each class;
each class can be calculated as follows, SWj = djT ∗ hence meets the LDA goal. Thus, the eigenvectors with
nj the k highest eigenvalues are used to construct a lower
dj = i=1 (xij −μj )(xij −μj )T , where xij represents
the ith sample in the j th class as shown in Fig. 1 (step dimensional space (Vk ), while the other eigenvectors
O
(E), (F)), and dj is the centering data of the j th class, ({vk+1 , vk+2 , vM }) are neglected as shown in Fig. 1
nj (step (G)).
i.e. dj = ωj − μj = {xi }i=1 − μj . Moreover, step (F)
C
Figure 2 shows the lower dimensional space of the
in the figure illustrates how the within-class variance
LDA technique, which is calculated as in Fig. 1 (step
of the first class (SW1 ) in our example is calculated.
The total within-class variance represents the sum of (G)). As shown, the dimension of the original data ma-
all within-class matrices of all classes (see Fig. 1 (step trix (X ∈ RN×M ) is reduced by projecting it onto the
R
(F))), and it can be calculated as in Equation (6). lower dimensional space of LDA (Vk ∈ RM×k ) as de-
noted in Equation (9) [81]. The dimension of the data
O

3 after projection is k; hence, M − k features are ig-
SW = SWi nored or deleted from each sample. Thus, each sample
TH
i=1 (xi ) which was represented as a point a M-dimensional

space will be represented in a k-dimensional space by
= (xi − μ1 )(xi − μ1 )T
projecting it onto the lower dimensional space (Vk ) as
xi ∈ω1
follows, yi = xi Vk .
AU
+ (xi − μ2 )(xi − μ2 )T
xi ∈ω2 Y = XVk (9)

+ (xi − μ3 )(xi − μ3 )T (6) Figure 3 shows a comparison between two lower-
xi ∈ω3 dimensional sub-spaces. In this figure, the original data
which consists of three classes as in our example are
2.4. Constructing the lower dimensional space plotted. Each class has five samples, and all samples
are represented by two features only (xi ∈ R2 ) to
After calculating the between-class variance (SB )
be visualized. Thus, each sample is represented as a
and within-class variance (SW ), the transformation ma-
trix (W ) of the LDA technique can be calculated as in point in two-dimensional space. The transformation
Equation (7), which is called Fisher’s criterion. This matrix (W (2 × 2)) is calculated using the steps in
formula can be reformulated as in Equation (8). Section 2.2, 2.3, and 2.4. The eigenvalues (λ1 and λ2 )
and eigenvectors (i.e. sub-spaces) (V = {v1 , v2 }) of
W T SB W W are then calculated. Thus, there are two eigenvec-
arg max (7) tors or sub-spaces. A comparison between the two
W W T SW W
lower-dimensional sub-spaces shows the following no-
SW W = λSB W (8) tices:
separate lower dimensional space is calculated for each

−1
class as follows, Wi = SW S , where Wi represents
i B
the transformation matrix for the ith class. Thus, eigen-
values and eigenvectors are calculated for each trans-
formation matrix separately. Hence, the samples of
each class are projected on their corresponding eigen-
vectors. On the other hand, in the class-independent
method, one lower dimensional space is calculated for
all classes. Thus, the transformation matrix is calcu-
lated for all classes, and the samples of all classes are
projected on the selected eigenvectors [1].
2.6. LDA algorithm
In this section the detailed steps of the algorithms of

the two LDA methods are presented. As shown in Al-
gorithms 1 and 2, the first four steps in both algorithms
PY
are the same. Table 1 shows the notations which are
Fig. 3. A visualized comparison between the two lower-dimensional
used in the two algorithms.
sub-spaces which are calculated using three different classes.
2.7. Computational complexity of LDA
O
• First, the separation distance between different
classes when the data are projected on the first In this section, the computational complexity for
C
eigenvector (v1 ) is much greater than when the LDA is analyzed. The computational complexity for
data are projected on the second eigenvector (v2 ). the first four steps, common in both class-dependent
As shown in the figure, the three classes are effi- and class-independent methods, are computed as fol-
lows. As illustrated in Algorithm 1, in step (2), to cal-
R
ciently discriminated when the data are projected

on v1 . Moreover, the distance between the means culate the mean of the ith class, there are ni M addi-
of the first and second classes (m1 −m2 ) when the tions and M divisions, i.e., in total, there are (N M +
O
original data are projected on v1 is much greater cM) operations. In step (3), there are N M additions
than when the data are projected on v2 , which re- and M divisions, i.e., there are (NM + M) opera-
TH
flects that the first eigenvector discriminates the tions. The computational complexity of the fourth step
three classes better than the second one. is c(M + M 2 + M 2 ), where M is for μi − μ, M 2 for
• Second, the within-class variance when the data (μi − μ)(μi − μ)T , and the last M 2 is for the multipli-
cation between ni and the matrix (μi −μ)(μi −μ)T . In
AU
are projected on v1 is much smaller than when it

the fifth step, there are N (M + M 2 ) operations, where
projected on v2 . For example, SW1 when the data
M is for (xij −μj ) and M 2 is for (xij −μj )(xij −μj )T .
are projected on v1 is much smaller than when the
In the sixth step, there are M 3 operations to calculate
data are projected on v2 . Thus, projecting the data −1 −1
SW , M 3 is for the multiplication between SW and
on v1 minimizes the within-class variance much
better than v2 . SB , and M 3 to calculate the eigenvalues and eigenvec-
tors. Thus, in class-independent method, the computa-
From these two notes, we conclude that the first eigen- tional complexity is O(N M 2 ) if N > M; otherwise,
vector meets the goal of the lower-dimensional space the complexity is O(M 3 ).
of the LDA technique than the second eigenvector; In Algorithm 2, the number of operations to cal-
hence, it is selected to construct a lower-dimensional culate the within-class variance for each class SWj in
space. the sixth step is nj (M + M 2 ), and to calculate SW ,
N (M + M 2 ) operations are needed. Hence, calculat-
2.5. Class-dependent vs. class-independent methods ing the within-class variance for both LDA methods
are the same. In the seventh step and eighth, there
The aim of the two methods of the LDA is to calcu- are M 3 operations for the inverse, M 3 for the multi-
−1
late the LDA space. In the class-dependent LDA, one plication of SW S , and M 3 for calculating eigenval-
i B
Algorithm 1 Linear Discriminant Analysis (LDA): Algorithm 2 Linear Discriminant Analysis (LDA):
Class-Independent Class-Dependent
1: Given a set of N samples [xi ]N
i=1 , each of which is 1: Given a set of N samples [xi ]N
i=1 , each of which is
represented as a row of length M as in Fig. 1 (step represented as a row of length M as in Fig. 1 (step
(A)), and X(N × M) is given by, (A)), and X(N × M) is given by,
⎡ ⎤ ⎡ ⎤
x(1,1) x(1,2) ... x(1,M) x(1,1) x(1,2) ... x(1,M)
⎢ x(2,1) x(2,2) ... x(2,M) ⎥ ⎢ x(2,1) x(2,2) ... x(2,M) ⎥
⎢ ⎥ ⎢ ⎥
X=⎢ . .. .. .. ⎥ (10) X=⎢ . .. .. .. ⎥ (13)
⎣ .. . . . ⎦ ⎣ .. . . . ⎦
x(N,1) x(N,2) ... x(N,M) x(N,1) x(N,2) ... x(N,M)
2: Compute the mean of each class μi (1 × M) as in 2: Compute the mean of each class μi (1 × M) as in
Equation (2). Equation (2).
3: Compute the total mean of all data μ(1 × M) as in 3: Compute the total mean of all data μ(1 × M) as in
Equation (3). Equation (3).
4: Calculate between-class matrix SB (M ×M) as fol- 4: Calculate between-class matrix SB (M × M) as in
PY
lows: Equation (11)
5: for all Class i, i = 1, 2, . . . , c do

c
6: Compute within-class matrix of each class
SB = ni (μi − μ)(μi − μ) T
(11)
SWi (M × M), as follows:
i=1
O

C
5: Compute within-class matrix SW (M × M), as fol- SWj = (xi − μj )(xi − μj )T (14)
xi ∈ωj
lows:
nj

c
Construct a transformation matrix for each class
R
7:
SW = (xij − μj )(xij − μj )T (12) (Wi ) as follows:
j =1 i=1
O
−1
Wi = SW S
i B
(15)
where xij represents the ith sample in the j th
TH
class.
6: From Equation (11) and (12), the matrix W that 8: The eigenvalues (λi ) and eigenvectors (V i ) of
maximizing Fisher’s formula which is defined in each transformation matrix (Wi ) are then calcu-
Equation (7) is calculated as follows, W = SW−1
SB . lated, where λi and V i represent the calculated
AU
The eigenvalues (λ) and eigenvectors (V ) of W are eigenvalues and eigenvectors of the ith class, re-
then calculated. spectively.
7: Sorting eigenvectors in descending order accord- 9: Sorting the eigenvectors in descending order ac-
ing to their corresponding eigenvalues. The first k cording to their corresponding eigenvalues. The
eigenvectors are then used as a lower dimensional first k eigenvectors are then used to construct a
space (Vk ). lower dimensional space for each class Vki .
8: Project all original samples (X) onto the lower di- 10: Project the samples of each class (ωi ) onto their
mensional space of LDA as in Equation (9). lower dimensional space (Vki ), as follows:
j
j = xi Vk , xi ∈ ω j (16)
ues and eigenvectors. These two steps are repeated for where j represents the projected samples of
each class which increases the complexity of the class- the class ωj .
11: end for
dependent algorithm. Totally, the computational com-
plexity of the class-dependent algorithm is O(N M 2 ) if
N > M; otherwise, the complexity is O(cM 3 ). Hence,
Table 1
numbers are rounded up to the nearest hundredths
Notation
(i.e. only two digits after the decimal point are dis-
Notation Description played).
X Data matrix The first four steps of both class-independent and
N Total number of samples in X class-dependent methods are common as illustrated in
W Transformation matrix Algorithms 1 and 2. Thus, in this section, we show how
ni Number of samples in ωi these steps are calculated.
μi The mean of the ith class Given two different classes, ω1 (5 × 2) and ω2 (6 × 2)
μ Total or global mean of all samples have (n1 = 5) and (n2 = 6) samples, respectively.
SW i Within-class variance or scatter matrix of the ith class Each sample in both classes is represented by two fea-
(ωi ) tures (i.e. M = 2) as follows:
SBi Between-class variance of the ith class (ωi )
V Eigenvectors of W
⎡ ⎤
1.00 2.00
Vi ith eigenvector ⎢2.00 3.00⎥
⎢ ⎥
xij The ith sample in the j th class ω1 = ⎢
⎢3.00 3.00⎥⎥ and
k The dimension of the lower dimensional space (Vk ) ⎣4.00 5.00⎦
xi ith sample 5.00 5.00
PY
M Dimension of X or the number of features of X ⎡ ⎤ (17)
Vk The lower dimensional space 4.00 2.00
⎢5.00 0.00⎥
c Total number of classes ⎢ ⎥
⎢5.00 2.00⎥
mi The mean of the ith class after projection O ω2 = ⎢
⎢3.00
⎥
m The total mean of all classes after projection ⎢ 2.00⎥⎥
SW Within-class variance ⎣5.00 3.00⎦
C
SB Between-class variance 6.00 3.00
λ Eigenvalue matrix
λi ith eigenvalue To calculate the lower dimensional space using
Y Projection of the original data LDA, first the mean of each class μj is calculated. The
R
ωi ith Class total mean μ(1×2) is then calculated, which represents

the mean of all means of all classes. The values of the
O
the class-dependent method needs computations more mean of each class and the total mean are shown below,
than class-independent method.
TH
In our case, we assumed that there are 40 classes μ1 = 3.00 3.60 ,

and each class has ten samples. Each sample is repre- μ2 = 4.67 2.00 , and (18)
sented by 4096 features (M > N ). Thus, the compu- 5
tational complexity of the class-independent method is μ = 11 6
μ1 11 μ2 = 3.91 2.727
AU
O(M 3 ) = 40963 , while the class-dependent method

needs O(cM 3 ) = 40 × 40963 . The between-class variance of each class (SBi (2 ×
2)) and the total between-class variance (SB (2×2)) are
calculated. The values of the between-class variance of
3. Numerical examples the first class (SB1 ) is equal to,
In this section, two numerical examples will be SB1 = n1 (μ1 − μ)T (μ1 − μ)
presented. The two numerical examples explain the T
= 5 −0.91 0.87 −0.91 0.87
steps to calculate the LDA space and how the LDA

technique is used to discriminate between only two 4.13 −3.97
= (19)
different classes. In the first example, the lower- −3.97 3.81
dimensional space is calculated using the class-
independent method, while in the second example, the Similarly, SB2 is calculated as follows:
class-dependent method is used. Moreover, a compar-

ison between the lower dimensional spaces of each 3.44 −3.31
SB2 = (20)
method is presented. In all numerical examples, the −3.31 3.17
The total between-class variance is calculated a fol- class matrix for each class and the total within-class
lows: matrix are as follows:

10.00 8.00
SB = SB1 + SB2 SW1 = ,
8.00 7.20
4.13 −3.97
= 5.33 1.00
−3.97 3.81 SW2 = , (23)
1.00 6.00
3.44 −3.31
+ 15.33 9.00
−3.31 3.17 SW =
9.00 13.20
7.58 −7.27
= (21)
−7.27 6.98 The transformation matrix (W ) in the class-independent
−1
method can be obtained as follows, W = SW SB , and
−1
To calculate the within-class matrix, first subtract the values of (SW ) and (W ) are as follows:
the mean of each class from each sample in that class
and this step is called mean-centering data and it is cal- −1 0.11 −0.07
SW = and
culated as follows, di = ωi − μi , where di represents −0.07 0.13
PY
centering data of the class ωi . The values of d1 and d2 (24)
1.36 −1.31
are as follows: W =
−1.48 1.42
⎡ ⎤
−2.00 −1.60
⎢−1.00 −0.60⎥
⎢ ⎥
O
The eigenvalues (λ(2 × 2)) and eigenvectors (V (2 ×
2)) of W are then calculated as follows:
d1 = ⎢
⎢ 0.00 −0.60⎥
⎥ and
C
⎣ 1.00
1.40 ⎦ 0.00 0.00
2.00 1.40 λ= and
0.00 2.78
⎡ ⎤ (22) (25)
−0.67 0.00 −0.69 0.68
R
⎢ 0.33 −2.00⎥ V =
⎢ ⎥ −0.72 −0.74
⎢ 0.33 0.00 ⎥
d2 = ⎢ ⎥
O
⎢−1.67 0.00 ⎥
⎢ ⎥ From the above results it can be noticed that, the
⎣ 0.33 1.00 ⎦ second eigenvector (V2 ) has corresponding eigenvalue
TH
1.33 1.00 more than the first one (V1 ), which reflects that, the
second eigenvector is more robust than the first one;
In the next two subsections, two different methods hence, it is selected to construct the lower dimen-
AU
are used to calculate the LDA space. sional space. The original data is projected on the
lower dimensional space, as follows, yi = ωi V2 , where
yi (ni × 1) represents the data after projection of the ith
3.1. Class-independent method class, and its values will be as follows:
y1 = ω1 V2
⎡ ⎤
In this section, the LDA space is calculated us- 1.00 2.00
ing the class-independent method. This method rep- ⎢2.00 3.00⎥
⎢ ⎥ 0.68
resents the standard method of LDA as in Algo- ⎢
= ⎢3.00 3.00⎥⎥
rithm 1. ⎣4.00 5.00⎦ −0.74
After centring the data, the within-class variance 5.00 5.00
for each class (SWi (2 × 2)) is calculated as follows, ⎡ ⎤
nj −0.79
SWj = djT ∗ dj = i=1 (xij − μj )T (xij − μj ), where ⎢−0.85⎥
⎢ ⎥
xij represents the ith sample in the j th class. The total =⎢⎢−0.18⎥
⎥ (26)
within-class matrix (SW (2 × 2)) is then calculated as ⎣−0.97⎦
c
follows, SW = i=1 SWi . The values of the within- −0.29
3.2. Class-dependent method
In this section, the LDA space is calculated using the

class-dependent method. As mentioned in Section 2.5,
the class-dependent method aims to calculate a sepa-
rate transformation matrix (Wi ) for each class.
The within-class variance for each class (SWi (2 ×
2)) is calculated as in class-independent method. The
transformation matrix (Wi ) for each class is then cal-
−1
culated as follows, Wi = SW S . The values of the
i B
two transformation matrices (W1 and W2 ) will be as
follows:
−1
W 1 = SW 1
SB
−1
10.00 8.00 7.58 −7.27
=
8.00 7.20 −7.27 6.98

PY
0.90 −1.00 7.58 −7.27
=
−1.00 1.25 −7.27 6.98

14.09 −13.53
=
O −16.67 16.00
(28)
C
Similarly, W2 is calculated as follows:
Fig. 4. Probability density function of the projected data of the first
example, (a) the projected data on V1 , (b) the projected data on V2 .
1.70 −1.63
W2 = (29)
Similarly, y2 is as follows: −1.50 1.44
R
⎡ ⎤ The eigenvalues (λi ) and eigenvectors (Vi ) for each

1.24
O
⎢3.39⎥ transformation matrix (Wi ) are calculated, and the val-

⎢ ⎥
⎢1.92⎥ ues of the eigenvalues and eigenvectors are shown be-
y2 = ω2 V2 = ⎢ ⎥
TH
⎢0.56⎥ (27) low.

⎢ ⎥
⎣1.18⎦
0.00 0.00
1.86 λω 1 = and
0.00 30.01

AU
(30)
Figure 4 illustrates a probability density function −0.69 0.65
Vω1 =
(pdf) graph of the projected data (yi ) on the two eigen- −0.72 −0.76
vectors (V1 and V2 ). A comparison of the two eigen-
vectors reveals the following: 3.14 0.00
λω2 = and
0.00 0.00
• The data of each class is completely discriminated (31)
when it is projected on the second eigenvector 0.75 0.69
Vω2 =
(see Fig. 4(b)) than the first one (see Fig. 4(a). −0.66 0.72
In other words, the second eigenvector maximizes
the between-class variance more than the first where λωi and Vωi represent the eigenvalues and eigen-
one. vectors of the ith class, respectively.
• The within-class variance (i.e. the variance be- From the results shown (above) it can be seen that,
{2}
tween the same class samples) of the two classes the second eigenvector of the first class (Vω1 ) has cor-
are minimized when the data are projected on the responding eigenvalue more than the first one; thus,
second eigenvector. As shown in Fig. 4(b), the the second eigenvector is used as a lower dimensional
{2}
within-class variance of the first class is small space for the first class as follows, y1 = ω1 ∗ Vω1 ,
compared with Fig. 4(a). where y1 represents the projection of the samples of
∗ The projected data on the second eigenvec-

tor (V2 ) which has the highest corresponding
eigenvalue will discriminate the data of the
two classes better than the first eigenvector. As
shown in the figure, the distance between the
projected means m1 − m2 which represents SB ,
increased when the data are projected on V2
than V1 .
∗ The second eigenvector decreases the within-
class variance much better than the first eigen-
Fig. 5. Probability density function (pdf) of the projected data using
{2}
class-dependent method, the first class is projected on Vω1 , while
vector. Figure 6 illustrates that the within-class
{1}
the second class is projected on Vω2 .
variance of the first class (SW1 ) was much
smaller when it was projected on V2 than V1 .
∗ As a result of the above two findings, V2 is used
the first class. While, the first eigenvector in the second
{1} to construct the LDA space.
class (Vω2 ) has corresponding eigenvalue more than
{1}
• Class-Dependent: As shown from the figure, there
PY
the second one. Thus, Vω2 is used to project the data of
{1} {2} {1}
the second class as follows, y2 = ω2 ∗ Vω2 , where y2 are two eigenvectors, Vω1 (red line) and Vω2
represents the projection of the samples of the second (blue line), which represent the first and second
class. The values of y1 and y2 will be as follows: O classes, respectively. The differences between the
⎡ ⎤ two eigenvectors are as following:
⎡ ⎤ 1.68
−0.88 ⎢3.76⎥ ∗ Projecting the original data on the two eigen-
C
⎢−1.00⎥ ⎢ ⎥
⎢ ⎥ ⎢2.43⎥ vectors discriminates between the two classes.
y1 = ⎢
⎢−0.35⎥
⎥ ⎢
and y2 = ⎢ ⎥
⎥ (32)
⎣−1.24⎦ ⎢0.93⎥ As shown in the figure, the distance between
⎣1.77⎦ the projected means m1 − m2 is larger than the
R
−0.59
2.53 distance between the original means μ1 − μ2 .
∗ The within-class variance of each class is de-
O
Figure 5 shows a pdf graph of the projected data (i.e. creased. For example, the within-class variance
{2} {1}
y1 and y2 ) on the two eigenvectors (Vω1 and Vω2 ) and of the first class (SW1 ) is decreased when it is
TH
a number of findings are revealed the following: projected on its corresponding eigenvector.
{2}
• First, the projection data of the two classes are ∗ As a result of the above two findings, Vω1 and
{1}
efficiently discriminated. Vω2 are used to construct the LDA space.
AU
• Second, the within-class variance of the projected

samples is lower than the within-class variance of • Class-Dependent vs. Class-Independent: The two
the original samples. LDA methods are used to calculate the LDA
space, but a class-dependent method calculates
separate lower dimensional spaces for each class
3.3. Discussion which has two main limitations: (1) it needs
more CPU time and calculations more than class-
In these two numerical examples, the LDA space is independent method; (2) it may lead to SSS prob-
calculated using class-dependent and class-independent lem because the number of samples in each class
methods. affects the singularity of SWi .2
Figure 6 shows a further explanation of the two
methods as following: These findings reveal that the standard LDA technique
used the class-independent method rather than using
• Class-Independent: As shown from the figure,
the class-dependent method.
there are two eigenvectors, V1 (dotted black line)
and V2 (solid black line). The differences between
the two eigenvectors are as follows: 2 SSS problem will be explained in Section 4.2.
PY
O
C
R
Fig. 6. Illustration of the example of the two different methods of LDA methods. The blue and red lines represent the first and second eigenvectors
O
of the class-dependent approach, respectively, while the solid and dotted black lines represent the second and first eigenvectors of class-indepen-
dent approach, respectively.
TH
4. Main problems of LDA The mathematical interpretation for this problem is as

follows: if the means of the classes are approximately
Although LDA is one of the most common data re- equal, so the SB and W will be zero. Hence, the LDA
AU
duction techniques, it suffers from two main problems: space cannot be calculated.
the Small Sample Size (SSS) and linearity problems. One of the solutions of this problem is based on the
In the next two subsections, these two problems will transformation concept, which is known as a kernel
be explained, and some of the state-of-the-art solutions methods or functions [3,50]. Figure 7 illustrates how
are highlighted. the transformation is used to map the original data into
a higher dimensional space; hence, the data will be lin-
4.1. Linearity problem
early separable, and the LDA technique can find the
lower dimensional space in the new space. Figure 8
LDA technique is used to find a linear transforma-
tion that discriminates between different classes. How- graphically and mathematically shows how two non-
ever, if the classes are non-linearly separable, LDA can separable classes in one-dimensional space are trans-
not find a lower dimensional space. In other words, formed into a two-dimensional space (i.e. higher di-
LDA fails to find the LDA space when the discrimina- mensional space); thus, allowing linear separation.
tory information are not in the means of classes. Fig- The kernel idea is applied in Support Vector Ma-
ure 7 shows how the discriminatory information does chine (SVM) [49,66,69] Support Vector Regression
not exist in the mean, but in the variance of the data. (SVR) [58], PCA [51], and LDA [50]. Let φ represents
This is because the means of the two classes are equal. a nonlinear mapping to the new feature space Z. The
PY
O
Fig. 8. Example of kernel functions, the samples lie on the top panel
(X) which are represented by a line (i.e. one-dimensional space) are
non-linearly separable, where the samples lie on the bottom panel
C
(Z) which are generated from mapping the samples of the top space
are linearly separable.
Fig. 7. Two examples of two non-linearly separable classes, top φ i

where μi = n1i ni=1 φ{xi } and μφ = N1 ×
R
panel shows how the two classes are non-separable, while the bot- N c ni φ
i=1 φ{xi } = i=1 N μi .
tom shows how the transformation solves this problem and the two
classes are linearly separable.
O
Thus, in kernel LDA, all samples are transformed

non-linearly into a new space Z using the function φ.
In other words, the φ function is used to map the
TH
transformation matrix (W ) in the new feature space

original features into Z space by creating a non-
(Z) is calculated as in Equation (33).
linear combination of the original samples using a
dot-products of it [3]. There are many types of ker-
T φ
AU
nel functions to achieve this aim. Examples of these

W SB W
F (W ) = max φ

(33) function include Gaussian or Radial Basis Function
W T SW W (RBF), K(xi , xj ) = exp(−xi − xj 2 /2σ 2 ), where
σ is a positive parameter, and the polynomial kernel
where W is a transformation matrix and Z is the new of degree d, K(xi , xj ) = (xi , xj + c)d , [3,29,51,
φ 72].
feature space. The between-class matrix (SB ) and the
φ
within-class matrix (SW ) are defined as follows:
4.2. Small sample size problem
φ

c
φ φ T 4.2.1. Problem definition
SB = n i μ i − μφ μi − μφ (34)
Singularity, Small Sample Size (SSS), or under-
i=1
sampled problem is one of the big problems of LDA
nj
φ

c
φ
technique. This problem results from high-dimensional
SW = φ{xij } − μj pattern classification tasks or a low number of train-
j =1 i=1 ing samples available for each class compared with
φ T the dimensionality of the sample space [30,38,82,
× φ{xij } − μj (35) 85].
The SSS problem occurs when the SW is singular.3 Four different variants of the LDA technique that are
The upper bound of the rank4 of SW is N − c, while used to solve the SSS problem are introduced as fol-
the dimension of SW is M × M [17,38]. Thus, in most lows:
cases M N − c which leads to SSS problem. For
PCA+LDA technique. In this technique, the orig-
example, in face recognition applications, the size of
the face image my reach to 100×100 = 10,000 pixels, inal d-dimensional features are first reduced to h-
which represent high-dimensional features and it leads dimensional feature space using PCA, and then the
to a singularity problem. LDA is used to further reduce the features to k-
dimensions. The PCA is used in this technique to re-
4.2.2. Common solutions to SSS problem: duce the dimensions to make the rank of SW is N − c
There are many studies that proposed many solu- as reported in [4]; hence, the SSS problem is addressed.
tions for this problem; each has its advantages and However, the PCA neglects some discriminant infor-
drawbacks. mation, which may reduce the classification perfor-
• Regularization (RLDA): In regularization mance [57,60].
method, the identity matrix is scaled by multi- Direct LDA technique. Direct LDA (DLDA) is one
plying it by a regularization parameter (η > 0) of the well-known techniques that are used to solve
and adding it to the within-class matrix to make the SSS problem. This technique has two main steps
PY
it non-singular [18,38,45,82]. Thus, the diagonal [83]. In the first step, the transformation matrix, W , is
components of the within-class matrix are biased computed to transform the training data to the range
as follows, SW = SW + ηI . However, choosing space of SB . In the second step, the dimensionality of
the value of the regularization parameter requires
the transformed data is further transformed using some
O
more tuning and a poor choice for this parame-
regulating matrices as in Algorithm 4. The benefit of
ter can degrade the performance of the method
the DLDA is that there is no discriminative features are
C
[38,45]. Another problem of this method is that
neglected as in PCA+LDA technique [83].
the parameter η is just added to perform the in-
verse of SW and has no clear mathematical inter- Regularized LDA technique. In the Regularized LDA
pretation [38,57]. (RLDA), a small perturbation is add to the SW matrix
R
• Sub-space: In this method, a non-singular inter- to make it non-singular as mentioned in [18]. This reg-
mediate space is obtained to reduce the dimen- ularization can be applied as follows:
O
sion of the original data to be equal to the rank

of SW ; hence, SW becomes full-rank,5 and then (SW + ηI )−1 SB wi = λi wi (36)
TH
SW can be inverted. For example, Belhumeur et

al. [4] used PCA, to reduce the dimensions of the where η represents a regularization parameter. The di-
original space to be equal to N − c (i.e. the upper agonal components of the SW are biased by adding this
bound of the rank of SW ). However, as reported small perturbation [13,18]. However, the regularization
AU
in [22], losing some discriminant information is a parameter need to be tuned and poor choice of it can
common drawback associated with the use of this degrade the generalization performance [57].
method.
• Null Space: There are many studies proposed Null LDA technique. The aim of the NLDA technique
to remove the null space of SW to make SW is to find the orientation matrix W , and this can be
full-rank; hence, invertible. The drawback of this achieved using two steps. In the first step, the range
method is that more discriminant information is space of the SW is neglected, and the data are projected
lost when the null space of SW is removed, which only on the null space of SW as follows, SW W = 0. In
has a negative impact on how the lower dimen- the second step, the aim is to search for W that satis-
sional space satisfies the LDA goal [83]. fies SB W = 0 and maximizes |W T SB W |. The higher
dimensionality of the feature space may lead to com-
3 A matrix is singular if it is square, does not have a matrix in-
putational problems. This problem can be solved by
verse, the determinant is zeros; hence, not all columns and rows are (1) using the PCA technique as a pre-processing step,
independent.
4 The rank of the matrix represents the number of linearly inde- i.e. before applying the NLDA technique, to reduce the
pendent rows or columns. dimension of feature space to be N − 1; by removing
5 A is a full-rank matrix if all columns and rows of the matrix are the null space of ST = SB +SW [57], (2) using the PCA
independent, (i.e. rank(A) = #rows = #cols) [23]. technique before the second step of the NLDA tech-
nique [54]. Mathematically, in the Null LDA (NLDA) or dimensions. Due to this high dimensionality, the
technique, the h column vectors of the transformation computational models need more time to train their
matrix W = [w1 , w2 , . . . , wh ] are taken to be the models, which may be infeasible and expensive. More-
null space of the SW as follows, wiT SW wi = 0, ∀i = over, this high dimensionality reduces the classifica-
1, . . . , h, where wiT SB wi = 0. Hence, M −(N −c) lin- tion performance of the computational model and in-
early independent vectors are used to form a new ori- creases its complexity. This problem can be solved us-
entation matrix, which is used to maximize |W T SB W | ing LDA technique to construct a new set of features
subject to the constraint |W T SW W | = 0 as in Equa- from a large number of original features. There are
many papers have been used LDA in medical applica-
tion (37).
tions [8,16,39,52–55].
T
W = arg max W SB W (37) 6. Packages
|W T SW W |=0
In this section, some of the available packages that
are used to compute the space of LDA variants. For
5. Applications of the LDA technique example, WEKA6 is a well-known Java-based data
mining tool with open source machine learning soft-
PY
In many applications, due to the high number of fea- ware such as classification, association rules, regres-
tures or dimensionality, the LDA technique have been sion, pre-processing, clustering, and visualization. In
used. Some of the applications of the LDA technique WEKA, the machine learning algorithms can be ap-
and its variants are described as follows: plied directly on the dataset or called from person’s
O
Java code. XLSTAT7 is another data analysis and sta-
5.1. Biometrics applications tistical package for Microsoft Excel that has a wide
C
variety of dimensionality reduction algorithms includ-
Biometrics systems have two main steps, namely, ing LDA. dChip8 package is also used for visualiza-
tion of gene expression and SNP microarray including
feature extraction (including pre-processing steps) and
some data analysis algorithms such as LDA, cluster-
recognition. In the first step, the features are extracted
R
ing, and PCA. LDA-SSS9 is a Matlab package, and it

from the collected data, e.g. face images, and in the
contains several algorithms related to the LDA tech-
second step, the unknown samples, e.g. unknown face
O
niques and its variants such as DLDA, PCA+LDA,

image, is identified/verified. The LDA technique and and NLDA. MASS10 package is based on R, and it has
its variants have been applied in this application. For functions that are used to perform linear and quadratic
TH
example, in [10,20,41,68,75,83], the LDA technique discriminant function analysis. Dimensionality reduc-
have been applied on face recognition. Moreover, the tion11 package is mainly written in Matlab, and it has
LDA technique was used in Ear [84], fingerprint [44], a number of dimensionality reduction techniques such
gait [5], and speech [24] applications. In addition, the as ULDA, QLDA, and KDA. DTREG12 is a software
AU
LDA technique was used with animal biometrics as in package that is used for medical data and modeling
[19,65]. business, and it has several predictive modeling meth-
ods such as LDA, PCA, linear regression, and decision
5.2. Agriculture applications trees.
In agriculture applications, an unknown sample can 7. Experimental results and discussion

be classified into a pre-defined species using computa-
tional models [64]. In this application, different vari- In this section, two experiments were conducted to
ants of the LDA technique was used to reduce the di- illustrate: (1) how the LDA is used for different appli-
mension of the collected features as in [9,21,26,46,63,
6 http:/www.cs.waikato.ac.nz/ml/weka/
64].
7 http://www.xlstat.com/en/
8 https://sites.google.com/site/dchipsoft/home
5.3. Medical applications 9 http://www.staff.usp.ac.fj/sharma_al/index.htm
10 http://www.statmethods.net/advstats/discriminant.html
In medical applications, the data such as the DNA 11 http://www.public.asu.edu/*jye02/Software/index.html
microarray data consists of a large number of features 12 http://www.dtreg.com/index.htm
Table 2
cations, (2) what is the relation between its parame-
Dataset description
ter (Eigenvectors) and the accuracy of a classification
problem, (3) when the SSS problem could appear and Dataset Dimension (M) No. of Samples (N ) No. of classes (c)
a method for solving it. ORL64×64 4096 400 40
ORl32×32 1024
7.1. Experimental setup Ear64×64 4096 102 17
Ear32×32 1024
This section gives an overview of the databases, Yale64×64 4096 165 15
the platform, and the machine specification used to Yale32×32 1024
conduct our experiments. Different biometric datasets
were used in the experiments to show how the LDA the classification accuracy, the Nearest Neighbour clas-
using its parameter behaves with different data. These sifier was used. This classifier aims to classify the test-
datasets are described as follows: ing image by comparing its position in the LDA space
with the positions of training images. Furthermore,
• ORL dataset13 face images dataset (Olivetti Re- class-independent LDA was used in all experiments.
search Laboratory, Cambridge) [48], which con- Moreover, Matlab Platform (R2013b) and using a PC
sists of 40 distinct individuals, was used. In this with the following specifications: Intel(R) Core(TM)
PY
dataset, each individual has ten images taken at i5-2400 CPU @ 3.10 GHz and 4.00 GB RAM, under
different times and varying light conditions. The Windows 32-bit operating system were used in our ex-
size of each image is 92 × 112. periments.
• Yale dataset14 is another face images dataset O
which contains 165 grey scale images in GIF for- 7.2. Experiment on LDA parameter (eigenvectors)
mat of 15 individuals [78]. Each individual has
C
11 images in different expressions and configura- The aim of this experiment is to investigate the re-
tion: center-light, happy, left-light, with glasses, lation between the number of eigenvectors used in the
normal, right-light, sad, sleepy, surprised, and a LDA space, and the classification accuracy based on
wink.
R
these eigenvectors and the required CPU time for this

• 2D ear dataset (Carreira-Perpinan,1995)15 images classification.
dataset [7] was used. The ear data set consists of As explained earlier that the LDA space consists of
O
17 distinct individuals. Six views of the left pro- k eigenvectors, which are sorted according to their ro-
file from each subject were taken under a uniform, bustness (i.e. their eigenvalues). The robustness of each
TH
diffuse lighting. eigenvector reflects its ability to discriminate between

In all experiments, k-fold cross-validation tests have different classes. Thus, in this experiment, it will be
used. In k-fold cross-validation, the original samples of checked whether increasing the number of eigenvec-
tors would increase the total robustness of the con-
AU
the dataset were randomly partitioned into k subsets of

(approximately) equal size and the experiment is run k structed LDA space; hence, different classes could be
times. For each time, one subset was used as the test- well discriminated. Also, it will be tested whether in-
ing set and the other k − 1 subsets were used as the creasing the number of eigenvectors would increase the
training set. The average of the k results from the folds dimension of the LDA space and the projected data;
can then be calculated to produce a single estimation. hence, CPU time increases. To investigate these is-
In this study, the value of k was set to 10. sue, three datasets listed in Table 2 (i.e. ORL32×32 ,
The images in all datasets resized to be 64 × 64 and Ear32×32 , Yale32×32 ), were used. Moreover, seven,
32 × 32 as shown in Table 2. Figure 9 shows samples four, and eight images from each subject in the ORL,
of the used datasets and Table 2 shows a description of ear, Yale datasets, respectively, are used in this exper-
the datasets used in our experiments. iment. The results of this experiment are presented in
In all experiments, to show the effect of the LDA Fig. 10.
with its eigenvector parameter and its SSS problem on From Fig. 10 it can be noticed that the accuracy and
CPU time are proportional with the number of eigen-
13 http://www.cam-orl.co.uk vectors which are used to construct the LDA space.
14 http://vision.ucsd.edu/content/yale-face-database Thus, the choice of using LDA in a specific applica-
15 http://faculty.ucmerced.edu/mcarreira-perpinan/software.html tion should consider a trade-off between these factors.
Fig. 9. Samples of the first individual in: ORL face dataset (top row); Ear dataset (middle row), and Yale face dataset (bottom row).
Moreover, from Fig. 10(a), it can be remarked that
PY
when the number of eigenvectors used in computing
the LDA space was increased, the classification accu-
racy was also increased to a specific extent after which
the accuracy remains constant. As seen in Fig. 10(a),
O
this extent differs from application to another. For ex-
ample, the accuracy of the ear dataset remains con-
C
stant when the percentage of the used eigenvectors is
more than 10%. This was expected as the eigenvec-
tors of the LDA space are sorted according to their ro-
bustness (see Section 2.6). Similarly, in ORL and Yale
R
datasets the accuracy became approximately constant

when the percentage of the used eigenvectors is more
O
than 40%. In terms of CPU time, Fig. 10(b) shows the

CPU time using different percentages of the eigenvec-
TH
tors. As shown, the CPU time increased dramatically

when the number of eigenvectors increases.
Figure 11 shows the weights16 of the first 40 eigen-
AU
vectors which confirms our findings. From these re-

sults, we can conclude that the high order eigenvectors
of the data of each application (the first 10% of ear
database and the first 40% of ORL and Yale datasets)
are robust enough to extract and save the most discrim-
inative features which are used to achieve a good accu-
racy.
These experiments confirmed that increasing the
number of eigenvectors will increase the dimension of
the LDA space; hence, CPU time increases. Conse-
quently, the amount of discriminative information and
the accuracy increases.
16 The weight of the eigenvector represents the ratio of its cor-

Fig. 10. Accuracy and CPU time of the LDA techniques using dif- responding eigenvalue (λi ) to the total of all eigenvalues (λi , i =
ferent percentages of eigenvectors, (a) Accuracy (b) CPU time. λ
1, 2, . . . , k) as follows, k i .
j =1 λj
Algorithm 3 PCA-LDA
1: Read the training images (X = {x1 , x2 , . . . , xN }),
where xi (ro×co) represents the ith training image,
ro and co represent the rows (height) and columns
(width) of xi , respectively, N represents the total
number of training images.
2: Convert all images in vector representation Z =
{z1 , z2 , . . . , zN }, where the dimension of Z is
M × 1, M = ro × co.
3: calculate the mean of each class μi , total mean of
all data μ, between-class matrix SB (M × M), and
within-class matrix SW (M × M) as in Algorithm 1
(Step (2–5)).
4: Use the PCA technique to reduce the dimension of
Fig. 11. The robustness of the first 40 eigenvectors of the LDA tech- X to be equal to or lower than r, where r represents
nique using ORL32×32 , Ear32×32 , and Yale32×32 datasets. the rank of SW , as follows:
PY
7.3. Experiments on the small sample size problem XPCA = U T X (38)
The aim of this experiment is to show when the LDA where, U ∈ RM×r is the lower dimensional space
is subject to the SSS problem and what are the methods O of the PCA and XPCA represents the projected data
that could be used to solve this problem. In this experi- on the PCA space.
ment, PCA-LDA [4] and Direct-LDA [22,83] methods 5: Calculate the mean of each class μi , total mean of
C
are used to address the SSS problem. A set of exper- all data μ, between-class matrix SB , and within-
iments was conducted to show how the two methods class matrix SW of XPCA as in Algorithm 1 (Step
interact with the SSS problem, including the number of (2–5)).
samples in each class, the total number of classes used,
R
6: Calculate W as in Equation (7) and then calculate

and the dimension of each sample. the eigenvalues (λ) and eigenvectors (V ) of the W
As explained in Section 4, using LDA directly may
O
matrix.
lead to the SSS problem when the dimension of the 7: Sorting the eigenvectors in descending order ac-
samples are much higher than the total number of sam- cording to their corresponding eigenvalues. The
TH
ples. As shown in Table 2, the size of each image or first k eigenvectors are then used as a lower dimen-
sample of the ORL dataset is 64×64 is 4096 and the to- sional space (Vk ).
tal number of samples is 400. The mathematical inter- 8: The original samples (X) are first projected on the
pretation of this point shows that the dimension of SW
AU
PCA space as in Equation (38). The projection on

is M × M, while the upper bound of the rank of SW is the LDA space is then calculated as follows:
N −c [79,82]. Thus, all the datasets which are reported
in Table 2 lead to singular SW . XLDA = VkT XPCA = VkT U T X (39)
To address this problem, PCA-LDA and direct-LDA
methods were implemented. In the PCA-LDA method, where XLDA represents the final projected on the
PCA is first used to reduce the dimension of the origi- LDA space.
nal data to make SW full-rank, and then standard LDA
can be performed safely in the subspace of SW as in Al-
gorithm 3. For more details of the PCA-LDA method and the accuracy. The results of these scenarios using
are reported in [4]. In the direct-LDA method, the null- both PCA-LDA and direct-LDA methods are summa-
space of SW matrix is removed to make SW full-rank, rized in Table 3.
then standard LDA space can be calculated as in Algo- As summarized in Table 3, the rank of SW is very
rithm 4. More details of direct-LDA methods are found small compared to the whole dimension of SW ; hence,
in [83]. the SSS problem occurs in all cases. As shown in
Table 2 illustrates various scenarios designed to test Table 3, using the PCA-LDA and the Direct-LDA,
the effect of different dimensions on the rank of SW the SSS problem can be solved and the Direct-LDA
Table 3
Accuracy of the PCA-LDA and direct-LDA methods using the datasets listed in Table 2
Dataset Dim(SW ) # Training images # Testing images Rank(SW ) Accuracy (%)
PCA-LDA Direct-LDA
ORL32×32 1024 × 1024 5 5 160 75.5 88.5
7 3 240 75.5 97.5
9 1 340 80.39 97.5
Ear32×32 1024 × 1024 3 3 34 80.39 96.08
4 2 51 88.24 94.12
5 1 68 100 100
Yale32×32 1024 × 1024 6 5 75 78.67 90.67
8 3 105 84.44 97.79
10 1 135 100 100
ORL64×64 4096 × 4096 5 5 160 72 87.5
7 3 240 81.67 96.67
9 1 340 82.5 97.5
Ear64×64 4096 × 4096 3 3 34 74.5 96.08
PY
4 2 51 91.18 96.08
5 1 68 100 100
Yale64×64 4096 × 4096 6 5 75 74.67 92
8 3 O 105 95.56 97.78
10 1 135 93.33 100
The bold values indicate that the corresponding methods obtain best performances.
C
Algorithm 4 Direct-LDA method achieved results better than PCA-LDA because
1: Read the training images (X = {x1 , x2 , . . . , xN }), as reported in [22,83], Direct-LDA method saves ap-
where xi (ro×co) represents the ith training image, proximately all important information for classifica-
R
ro and co represent the rows (height) and columns tion while the PCA in PCA-LDA method saves the in-
(width) of xi , respectively, N represents the total formation with high variance.
O
number of training images.

2: Convert all images in vector representation Z =
TH
{z1 , z2 , . . . , zN }, where the dimension of Z is M × 8. Conclusions

1, M = ro × co.
3: Calculate the mean of each class μi , total mean In this paper, the definition, mathematics, and im-
of all data μ, between-class matrix SB (M × M),
AU
plementation of LDA were presented and explained.

and within-class matrix SW (M × M) of X as in The paper aimed to give low-level details on how the
Algorithm 1 (Step (2–5)). LDA technique can address the reduction problem by
4: Find the k eigenvectors of SB with non- extracting discriminative features based on maximiz-
zeros eigenvalues, and denote them as U = ing the ratio between the between-class variance, SB ,
[u1 , u2 , . . . , uk ] (i.e. U T Sb U > 0). and within-class variance, SW , thus discriminating be-
5: Calculate the eigenvalues and eigenvectors of tween different classes. To achieve this aim, the pa-
U T SW U , and then sort the eigenvalues and dis- per followed the approach of not only explaining the
card the eigenvectors which have high values. The steps of calculating the SB , and SW (i.e. the LDA
selected eigenvectors are denoted by V ; thus, V space) but also visualizing these steps with figures
represents the null space of SW . and diagrams to make it easy to understand. More-
6: The final LDA matrix () consists of the range17 over, two LDA methods, i.e. class-dependent and class-
SB and the null space of SW as follows, = U V . independent, are explained and two numerical exam-
7: The original data are projected on the LDA space ples were given and graphically illustrated to explain
as follows, Y = X = XU V .
17 The range of a matrix A represents the column-space of A
(C(A)), where the dimension of C(A) is equal to the rank of A.
how the LDA space using the two methods can be con- [11] D. Coomans, D. Massart and L. Kaufman, Optimization by
structed. In all examples, the mathematical interpreta- statistical linear discriminant analysis in analytical chemistry,
tion of the robustness and the selection of the eigenvec- Analytica Chimica Acta 112(2) (1979), 97–122. doi:10.1016/
S0003-2670(01)83513-3.
tors as well the data projection were detailed and dis- [12] J. Cui and Y. Xu, Three dimensional palmprint recognition
cussed. Also, LDA common problems (e.g. the SSS and using linear discriminant analysis method, in: Proceedings of
linearity) were mathematically explained using graph- the Second International Conference on Innovations in Bio-
ical examples, then their state-of-the-art solutions are Inspired Computing and Applications (IBICA), 2011, IEEE,
highlighted. Moreover, a detailed implementation of 2011, pp. 107–111. doi:10.1109/IBICA.2011.31.
[13] D.-Q. Dai and P.C. Yuen, Face recognition by regularized dis-
LDA applications was presented. Using three standard criminant analysis, IEEE Transactions on Systems, Man, and
datasets, a number of experiments were conducted to Cybernetics, Part B (Cybernetics) 37(4) (2007), 1080–1085.
(1) investigate and explain the relation between the doi:10.1109/TSMCB.2007.895363.
number of eigenvectors and the robustness of the LDA [14] D. Donoho and V. Stodden, When does non-negative matrix
space, (2) to practically show when the SSS problem factorization give a correct decomposition into parts? in: Ad-
vances in Neural Information Processing Systems 16, MIT
occurs and how it can be addressed.
Press, 2004, pp. 1141–1148.
[15] R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,
2nd edn, Wiley, 2012.
References [16] S. Dudoit, J. Fridlyand and T.P. Speed, Comparison of discrim-
PY
ination methods for the classification of tumors using gene ex-
[1] S. Balakrishnama and A. Ganapathiraju, Linear discriminant pression data, Journal of the American Statistical Association
analysis-a brief tutorial, Institute for Signal and Information 97(457) (2002), 77–87. doi:10.1198/016214502753479248.
Processing (1998). [17] T.-t. Feng and G. Wu, A theoretical contribution to the fast im-
[2] E. Barshan, A. Ghodsi, Z. Azimifar and M.Z. Jahromi, Su- O plementation of null linear discriminant analysis method us-
pervised principal component analysis: Visualization, classifi- ing random matrix multiplication with scatter matrices, 2014,
cation and regression on subspaces and submanifolds, Pattern arXiv preprint arXiv:1409.2579.
C
Recognition 44(7) (2011), 1357–1371. doi:10.1016/j.patcog. [18] J.H. Friedman, Regularized discriminant analysis, Journal of
2010.12.015. the American Statistical Association 84(405) (1989), 165–175.
[3] G. Baudat and F. Anouar, Generalized discriminant analysis doi:10.1080/01621459.1989.10478752.
using a kernel approach, Neural Computation 12(10) (2000), [19] T. Gaber, A. Tharwat, A.E. Hassanien and V. Snasel, Biometric
2385–2404. doi:10.1162/089976600300014980. cattle identification approach based on Weber’s local descriptor
R
[4] P.N. Belhumeur, J.P. Hespanha and D. Kriegman, Eigenfaces and adaboost classifier, Computers and Electronics in Agricul-
vs. fisherfaces: Recognition using class specific linear projec- ture 122 (2016), 55–66. doi:10.1016/j.compag.2015.12.022.
O
tion, IEEE Transactions on Pattern Analysis and Machine In- [20] T. Gaber, A. Tharwat, A. Ibrahim, V. Snáel and A.E. Has-
telligence 19(7) (1997), 711–720. doi:10.1109/34.598228. sanien, Human thermal face recognition based on random lin-
[5] N.V. Boulgouris and Z.X. Chi, Gait recognition using Radon ear oracle (rlo) ensembles, in: International Conference on In-
TH
transform and linear discriminant analysis, IEEE Transactions telligent Networking and Collaborative Systems, IEEE, 2015,
on Image Processing 16(3) (2007), 731–740. doi:10.1109/TIP. pp. 91–98. doi:10.1109/INCoS.2015.67.
2007.891157. [21] T. Gaber, A. Tharwat, V. Snasel and A.E. Hassanien, Plant
[6] M. Bramer, Principles of Data Mining, 2nd edn, Springer, identification: Two dimensional-based vs. one dimensional-
AU
2013. based feature extraction methods, in: 10th International Con-

[7] M. Carreira-Perpinan, Compression neural networks for fea- ference on Soft Computing Models in Industrial and Environ-
ture extraction: Application to human recognition from ear im- mental Applications, Springer, 2015, pp. 375–385.
ages, MS thesis, Faculty of Informatics, Technical University [22] H. Gao and J.W. Davis, Why direct lda is not equivalent to lda,
of Madrid, Spain, 1995. Pattern Recognition 39(5) (2006), 1002–1006. doi:10.1016/j.
[8] H.-P. Chan, D. Wei, M.A. Helvie, B. Sahiner, D.D. Adler, patcog.2005.11.016.
M.M. Goodsitt and N. Petrick, Computer-aided classification [23] G.H. Golub and C.F. Van Loan, Matrix Computations, 4th edn,
of mammographic masses and normal tissue: Linear discrimi- Vol. 3, Johns Hopkins University Press, 2012.
nant analysis in texture feature space, Physics in Medicine and [24] R. Haeb-Umbach and H. Ney, Linear discriminant analysis for
Biology 40(5) (1995), 857–876. doi:10.1088/0031-9155/40/5/ improved large vocabulary continuous speech recognition, in:
010. IEEE International Conference on Acoustics, Speech, and Sig-
[9] L. Chen, N.C. Carpita, W.-D. Reiter, R.H. Wilson, C. Jef- nal Processing (1992), Vol. 1, IEEE, 1992, pp. 13–16.
fries and M.C. McCann, A rapid method to screen for cell- [25] T. Hastie and R. Tibshirani, Discriminant analysis by Gaus-
wall mutants using discriminant analysis of Fourier transform sian mixtures, Journal of the Royal Statistical Society. Series B
infrared spectra, The Plant Journal 16(3) (1998), 385–392. (Methodological) (1996), 155–176.
doi:10.1046/j.1365-313x.1998.00301.x. [26] K. Héberger, E. Csomós and L. Simon-Sarkadi, Principal com-
[10] L.-F. Chen, H.-Y.M. Liao, M.-T. Ko, J.-C. Lin and G.-J. Yu, ponent and linear discriminant analyses of free amino acids
A new lda-based face recognition system which can solve the and biogenic amines in Hungarian wines, Journal of Agricul-
small sample size problem, Pattern Recognition 33(10) (2000), tural and Food Chemistry 51(27) (2003), 8055–8060. doi:10.
1713–1726. doi:10.1016/S0031-3203(99)00139-9. 1021/jf034851c.
[27] G.E. Hinton and R.R. Salakhutdinov, Reducing the dimension- [43] F. Pan, G. Song, X. Gan and Q. Gu, Consistent feature selec-
ality of data with neural networks, Science 313(5786) (2006), tion and its application to face recognition, Journal of Intelli-
504–507. doi:10.1126/science.1127647. gent Information Systems 43(2) (2014), 307–321. doi:10.1007/
[28] K. Honda and H. Ichihashi, Fuzzy local independent com- s10844-014-0324-5.
ponent analysis with external criteria and its application to [44] C.H. Park and H. Park, Fingerprint classification using fast
knowledge discovery in databases, International Journal of Fourier transform and nonlinear discriminant analysis, Pat-
Approximate Reasoning 42(3) (2006), 159–173. doi:10.1016/j. tern Recognition 38(4) (2005), 495–503. doi:10.1016/j.patcog.
ijar.2005.10.011. 2004.08.013.
[29] Q. Hu, L. Zhang, D. Chen, W. Pedrycz and D. Yu, Gaussian [45] C.H. Park and H. Park, A comparison of generalized linear
kernel based fuzzy rough sets: Model, uncertainty measures discriminant analysis algorithms, Pattern Recognition 41(3)
and applications, International Journal of Approximate Rea- (2008), 1083–1097. doi:10.1016/j.patcog.2007.07.022.
soning 51(4) (2010), 453–471. doi:10.1016/j.ijar.2010.01.004. [46] S. Rezzi, D.E. Axelson, K. Héberger, F. Reniero, C. Mariani
[30] R. Huang, Q. Liu, H. Lu and S. Ma, Solving the small sam- and C. Guillou, Classification of olive oils using high through-
ple size problem of lda, in: Proceedings of 16th International put flow 1 h nmr fingerprinting with principal component anal-
Conference on Pattern Recognition, 2002, Vol. 3, IEEE, 2002, ysis, linear discriminant analysis and probabilistic neural net-
pp. 29–32. works, Analytica Chimica Acta 552(1) (2005), 13–24. doi:10.
[31] A. Hyvärinen, J. Karhunen and E. Oja, Independent Compo- 1016/j.aca.2005.07.057.
nent Analysis, Vol. 46, Wiley, 2004. [47] Y. Saeys, I. Inza and P. Larrañaga, A review of feature selection
[32] M. Kirby, Geometric Data Analysis: An Empirical Approach techniques in bioinformatics, Bioinformatics 23(19) (2007),
to Dimensionality Reduction and the Study of Patterns, Wiley, 2507–2517. doi:10.1093/bioinformatics/btm344.
PY
2000. [48] F.S. Samaria and A.C. Harter, Parameterisation of a stochas-
[33] D.T. Larose, Discovering Knowledge in Data: An Introduction tic model for human face identification, in: Proceedings of the
to Data Mining, 1st edn, Wiley, 2014. Second IEEE Workshop on Applications of Computer Vision,
[34] T. Li, S. Zhu and M. Ogihara, Using discriminant analysis 1994, IEEE, 1994, pp. 138–142.
for multi-class classification: An experimental investigation, [49] B. Schölkopf, C.J. Burges and A.J. Smola, Advances in Kernel
O
Knowledge and Information Systems 10(4) (2006), 453–472. Methods: Support Vector Learning, MIT Press, 1999.
doi:10.1007/s10115-006-0013-y. [50] B. Schölkopf and K.-R. Mullert, Fisher discriminant analysis
C
[35] K. Liu, Y.-Q. Cheng, J.-Y. Yang and X. Liu, An efficient al- with kernels, in: Proceedings of the 1999 IEEE Signal Process-
gorithm for foley–sammon optimal set of discriminant vectors ing Society Workshop Neural Networks for Signal Processing
by algebraic method, International Journal of Pattern Recog- IX, Madison, WI, USA, 1999, pp. 41–48.
nition and Artificial Intelligence 6(5) (1992), 817–829. doi:10. [51] B. Schölkopf, A. Smola and K.-R. Müller, Nonlinear com-
1142/S0218001492000412. ponent analysis as a kernel eigenvalue problem, Neu-
R
[36] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Face recog- ral Computation 10(5) (1998), 1299–1319. doi:10.1162/
nition using lda-based algorithms, IEEE Transactions on Neu- 089976698300017467.
O
ral Networks 14(1) (2003), 195–200. doi:10.1109/TNN.2002. [52] A. Sharma, S. Imoto and S. Miyano, A between-class over-
806647. lapping filter-based method for transcriptome data analysis,
[37] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regularized Journal of Bioinformatics and Computational Biology 10(5)
TH
discriminant analysis for the small sample size problem in face (2012), 1250010. doi:10.1142/S0219720012500102.
recognition, Pattern Recognition Letters 24(16) (2003), 3079– [53] A. Sharma, S. Imoto and S. Miyano, A filter based feature se-
3087. doi:10.1016/S0167-8655(03)00167-3. lection algorithm using null space of covariance matrix for dna
[38] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regular- microarray gene expression data, Current Bioinformatics 7(3)
AU
ization studies of linear discriminant analysis in small sam- (2012), 289–294. doi:10.2174/157489312802460802.
ple size scenarios with application to face recognition, Pat- [54] A. Sharma, S. Imoto, S. Miyano and V. Sharma, Null space
tern Recognition Letters 26(2) (2005), 181–191. doi:10.1016/ based feature selection method for gene expression data, In-
j.patrec.2004.09.014. ternational Journal of Machine Learning and Cybernetics 3(4)
[39] B. Moghaddam, Y. Weiss and S. Avidan, Generalized spectral (2012), 269–276. doi:10.1007/s13042-011-0061-9.
bounds for sparse lda, in: Proceedings of the 23rd International [55] A. Sharma and K.K. Paliwal, Cancer classification by gradient
Conference on Machine Learning, ACM, 2006, pp. 641–648. lda technique using microarray gene expression data, Data &
[40] W. Müller, T. Nocke and H. Schumann, Enhancing the visu- Knowledge Engineering 66(2) (2008), 338–347. doi:10.1016/
alization process with principal component analysis to support j.datak.2008.04.004.
the exploration of trends, in: Proceedings of the 2006 Asia- [56] A. Sharma and K.K. Paliwal, A new perspective to null linear
Pacific Symposium on Information Visualisation, Vol. 60, Aus- discriminant analysis method and its fast implementation us-
tralian Computer Society, Inc., 2006, pp. 121–130. ing random matrix multiplication with scatter matrices, Pattern
[41] S. Noushath, G.H. Kumar and P. Shivakumara, (2d) 2 lda: An Recognition 45(6) (2012), 2205–2213. doi:10.1016/j.patcog.
efficient approach for face recognition, Pattern Recognition 2011.11.018.
39(7) (2006), 1396–1400. doi:10.1016/j.patcog.2006.01.018. [57] A. Sharma and K.K. Paliwal, Linear discriminant analysis for
[42] K.K. Paliwal and A. Sharma, Improved pseudoinverse lin- the small sample size problem: An overview, International
ear discriminant analysis method for dimensionality re- Journal of Machine Learning and Cybernetics (2014), 1–12.
duction, International Journal of Pattern Recognition and [58] A.J. Smola and B. Schölkopf, A tutorial on support vector
Artificial Intelligence 26(1) (2012), 1250002. doi:10.1142/ regression, Statistics and Computing 14(3) (2004), 199–222.
S0218001412500024. doi:10.1023/B:STCO.0000035301.49549.88.
[59] G. Strang, Introduction to Linear Algebra, 4th edn, Wellesley- [72] V. Vapnik, The Nature of Statistical Learning Theory, 2nd edn,
Cambridge Press, Massachusetts, 2003. Springer, New York, 2013.
[60] D.L. Swets and J.J. Weng, Using discriminant eigenfeatures [73] J. Venna, J. Peltonen, K. Nybo, H. Aidos and S. Kaski, Infor-
for image retrieval, IEEE Transactions on Pattern Analysis and mation retrieval perspective to nonlinear dimensionality reduc-
Machine Intelligence 18(8) (1996), 831–836. doi:10.1109/34. tion for data visualization, The Journal of Machine Learning
531802. Research 11 (2010), 451–490.
[61] M.M. Tantawi, K. Revett, A. Salem and M.F. Tolba, Fiducial [74] P. Viszlay, M. Lojka and J. Juhár, Class-dependent two-
feature reduction analysis for electrocardiogram (ecg) based dimensional linear discriminant analysis using two-pass recog-
biometric recognition, Journal of Intelligent Information Sys- nition strategy, in: Proceedings of the 22nd European Signal
tems 40(1) (2013), 17–39. doi:10.1007/s10844-012-0214-7. Processing Conference (EUSIPCO), IEEE, 2014, pp. 1796–
[62] A. Tharwat, Principal component analysis-a tutorial, Interna- 1800.
tional Journal of Applied Pattern Recognition 3(3) (2016), [75] X. Wang and X. Tang, Random sampling lda for face recogni-
197–240. doi:10.1504/IJAPR.2016.079733. tion, in: Proceedings of the 2004 IEEE Computer Society Con-
[63] A. Tharwat, T. Gaber, Y.M. Awad, N. Dey and A.E. Hassanien, ference on Computer Vision and Pattern Recognition (CVPR),
Plants identification using feature fusion technique and bag- Vol. 2, IEEE, 2004, pp. II–II.
ging classifier, in: The 1st International Conference on Ad- [76] M. Welling, Fisher Linear Discriminant Analysis, Vol. 3, De-
vanced Intelligent System and Informatics (AISI2015), Beni partment of Computer Science, University of Toronto, 2005.
Suef, Egypt, November 28–30, 2015, Springer, 2016, pp. 461– [77] M.C. Wu, L. Zhang, Z. Wang, D.C. Christiani and X. Lin,
471. Sparse linear discriminant analysis for simultaneous testing
[64] A. Tharwat, T. Gaber and A.E. Hassanien, One-dimensional for the significance of a gene set/pathway and gene selec-
PY
vs. two-dimensional based features: Plant identification ap- tion, Bioinformatics 25(9) (2009), 1145–1151. doi:10.1093/
proach, Journal of Applied Logic (2016). bioinformatics/btp019.
[65] A. Tharwat, T. Gaber, A.E. Hassanien, H.A. Hassanien and [78] J. Yang, D. Zhang, A.F. Frangi and J.-y. Yang, Two-
M.F. Tolba, Cattle identification using muzzle print images dimensional pca: A new approach to appearance-based face
based on texture features approach, in: Proceedings of the Fifth
International Conference on Innovations in Bio-Inspired Com-
O representation and recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence 26(1) (2004), 131–137.
puting and Applications IBICA 2014, Springer, 2014, pp. 217– doi:10.1109/TPAMI.2004.1261097.
C
227. [79] L. Yang, W. Gong, X. Gu, W. Li and Y. Liang, Null space
[66] A. Tharwat, A.E. Hassanien and B.E. Elnaghi, A ba-based al- discriminant locality preserving projections for face recogni-
gorithm for parameter optimization of support vector machine, tion, Neurocomputing 71(16) (2008), 3644–3649. doi:10.1016/
Pattern Recognition Letters (2016). j.neucom.2008.03.009.
R
[67] A. Tharwat, A. Ibrahim, A.E. Hassanien and G. Schaefer, [80] W. Yang and H. Wu, Regularized complete linear discriminant
Ear recognition using block-based principal component anal- analysis, Neurocomputing 137 (2014), 185–191. doi:10.1016/
ysis and decision fusion, in: International Conference on Pat- j.neucom.2013.08.048.
O
tern Recognition and Machine Intelligence, Springer, 2015, [81] J. Ye, R. Janardan and Q. Li, Two-dimensional linear discrim-
pp. 246–254. doi:10.1007/978-3-319-19941-2_24. inant analysis, in: Proceedings of 17th Advances in Neural In-
TH
[68] A. Tharwat, H. Mahdi, A. El Hennawy and A.E. Hassanien, formation Processing Systems (NIPS), 2004, pp. 1569–1576.
Face sketch synthesis and recognition based on linear regres- [82] J. Ye and T. Xiong, Computational and theoretical analysis
sion transformation and multi-classifier technique, in: The 1st of null space and orthogonal linear discriminant analysis, The
International Conference on Advanced Intelligent System and Journal of Machine Learning Research 7 (2006), 1183–1204.
Informatics (AISI2015), Beni Suef, Egypt, November 28–30, [83] H. Yu and J. Yang, A direct lda algorithm for high-
AU
2015, Springer, 2016, pp. 183–193. dimensional data with application to face recognition, Pattern
[69] A. Tharwat, Y.S. Moemen and A.E. Hassanien, Classification Recognition 34(10) (2001), 2067–2070. doi:10.1016/S0031-
of toxicity effects of biotransformed hepatic drugs using whale 3203(00)00162-X.
optimized support vector machines, Journal of Biomedical In- [84] L. Yuan and Z.-c. Mu, Ear recognition based on 2d images,
formatics (2017). in: Proceedings of the First IEEE International Conference on
[70] C.G. Thomas, R.A. Harshman and R.S. Menon, Noise reduc- Biometrics: Theory, Applications, and Systems, 2007, BTAS
tion in bold-based fmri using component analysis, Neuroimage 2007, IEEE, 2007, pp. 1–5.
17(3) (2002), 1521–1537. doi:10.1006/nimg.2002.1200. [85] X.-S. Zhuang and D.-Q. Dai, Inverse Fisher discriminate cri-
[71] M. Turk and A. Pentland, Eigenfaces for recognition, Journal teria for small sample size problem and its application to face
of Cognitive Neuroscience 3(1) (1991), 71–86. doi:10.1162/ recognition, Pattern Recognition 38(11) (2005), 2192–2194.
jocn.1991.3.1.71. doi:10.1016/j.patcog.2005.02.011.

Author Copy: Linear Discriminant Analysis: A Detailed Tutorial

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Author Copy: Linear Discriminant Analysis: A Detailed Tutorial

Hochgeladen von

Copyright:

Verfügbare Formate

AI Communications 30 (2017) 169–190 169

Linear discriminant analysis: A detailed

1. Introduction are two major approaches of the dimensionality reduc-

nique are introduced in Section 5. In Section 6, differ-

sented. In Section 7, two experiments are conducted to

lights the well-known methods to solve this problem.

2 The total between-class

i=1 (xi ) which was represented as a point a M-dimensional

separate lower dimensional space is calculated for each

2.6. LDA algorithm

In this section the detailed steps of the algorithms of

ciently discriminated when the data are projected

are projected on v1 is much smaller than when it

ωi ith Class total mean μ(1×2) is then calculated, which represents

In our case, we assumed that there are 40 classes μ1 = 3.00 3.60 ,

O(M 3 ) = 40963 , while the class-dependent method

3.2. Class-dependent method

In this section, the LDA space is calculated using the

⎡ ⎤ The eigenvalues (λi ) and eigenvectors (Vi ) for each

⎢3.39⎥ transformation matrix (Wi ) are calculated, and the val-

⎢0.56⎥ (27) low.

∗ The projected data on the second eigenvec-

• Second, the within-class variance of the projected

4. Main problems of LDA The mathematical interpretation for this problem is as

Fig. 7. Two examples of two non-linearly separable classes, top φ i

Thus, in kernel LDA, all samples are transformed

transformation matrix (W ) in the new feature space

nel functions to achieve this aim. Examples of these

sion of the original data to be equal to the rank

SW can be inverted. For example, Belhumeur et

ing, and PCA. LDA-SSS9 is a Matlab package, and it

niques and its variants such as DLDA, PCA+LDA,

In agriculture applications, an unknown sample can 7. Experimental results and discussion

these eigenvectors and the required CPU time for this

diffuse lighting. eigenvector reflects its ability to discriminate between

the dataset were randomly partitioned into k subsets of

Moreover, from Fig. 10(a), it can be remarked that

datasets the accuracy became approximately constant

than 40%. In terms of CPU time, Fig. 10(b) shows the

tors. As shown, the CPU time increased dramatically

vectors which confirms our findings. From these re-

16 The weight of the eigenvector represents the ratio of its cor-

6: Calculate W as in Equation (7) and then calculate

PCA space as in Equation (38). The projection on

number of training images.

{z1 , z2 , . . . , zN }, where the dimension of Z is M × 8. Conclusions

plementation of LDA were presented and explained.

2013. based feature extraction methods, in: 10th International Con-

Das könnte Ihnen auch gefallen