Beruflich Dokumente
Kultur Dokumente
DOI 10.3233/AIC-170729
IOS Press
E-mail: engalaatharwat@hotmail.com
c Faculty of Computers and Informatics, Suez Canal University, Egypt
E-mail: tmgaber@gmail.com
d Faculty of Engineering, Mansoura University, Egypt
E-mail: afai79@yahoo.com
PY
e Faculty of Computers and Information, Cairo University, Egypt
E-mail: aboitcairo@gmail.com
O
Abstract. Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre-
processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box,
C
but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works,
thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in
different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual
explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent
R
methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how
the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most
common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state-
O
of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with
different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted
TH
feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed.
Keywords: Dimensionality reduction, PCA, LDA, Kernel Functions, Class-Dependent LDA, Class-Independent LDA, SSS
(Small Sample Size) problem, eigenvectors artificial intelligence
AU
0921-7126/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved
170 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
Mixture Discriminant Analysis (MDA) [25] and Neu- the authors presented different applications that used
ral Networks (NN) [27], but the most famous technique the LDA-SSS techniques such as face recognition and
of this approach is the Linear Discriminant Analysis cancer classification. Furthermore, they conducted dif-
(LDA) [50]. This category of dimensionality reduction ferent experiments using three well-known face recog-
techniques are used in biometrics [12,36], Bioinfor- nition datasets to compare between different variants
matics [77], and chemistry [11]. of the LDA technique. Nonetheless, in [57], there is
The LDA technique is developed to transform the no detailed explanation of how (with numerical exam-
features into a lower dimensional space, which max- ples) to calculate the within and between class vari-
imizes the ratio of the between-class variance to the ances to construct the LDA space. In addition, the steps
within-class variance, thereby guaranteeing maximum of constructing the LDA space are not supported with
class separability [43,76]. There are two types of LDA well-explained graphs helping for well understanding
technique to deal with classes: class-dependent and of the LDA underlying mechanism. In addition, the
class-independent. In the class-dependent LDA, one non-linearity problem was not highlighted.
separate lower dimensional space is calculated for each This paper gives a detailed tutorial about the LDA
class to project its data on it whereas, in the class- technique, and it is divided into five sections. Section 2
independent LDA, each class will be considered as a gives an overview about the definition of the main
separate class against the other classes [1,74]. In this idea of the LDA and its background. This section be-
PY
type, there is just one lower dimensional space for all gins by explaining how to calculate, with visual expla-
classes to project their data on it. nations, the between-class variance, within-class vari-
Although the LDA technique is considered the most ance, and how to construct the LDA space. The algo-
well-used data reduction techniques, it suffers from a rithms of calculating the LDA space and projecting the
O
number of problems. In the first problem, LDA fails to data onto this space to reduce its dimension are then
find the lower dimensional space if the dimensions are introduced. Section 3 illustrates numerical examples to
C
much higher than the number of samples in the data show how to calculate the LDA space and how to select
matrix. Thus, the within-class matrix becomes singu- the most robust eigenvectors to build the LDA space.
lar, which is known as the small sample problem (SSS). While Section 4 explains the most two common prob-
There are different approaches that proposed to solve lems of the LDA technique and a number of state-of-
R
this problem. The first approach is to remove the null the-art methods to solve (or approximately solve) these
space of within-class matrix as reported in [56,79]. The problems. Different applications that used LDA tech-
O
ance or within-class matrix. The third step is to con- 2.2. Calculating the between-class variance (SB )
struct the lower dimensional space which maximizes
the between-class variance and minimizes the within- The between-class variance of the ith class (SBi )
class variance. This section will explain these three represents the distance between the mean of the ith
steps in detail, and then the full description of the LDA class (μi ) and the total mean (μ). LDA technique
algorithm will be given. Figures 1 and 2 are used to searches for a lower-dimensional space, which is used
visualize the steps of the LDA technique. to maximize the between-class variance, or simply
PY
O
C
R
O
TH
AU
Fig. 1. Visualized steps to calculate a lower dimensional subspace of the LDA technique.
172 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
Fig. 2. Projection of the original samples (i.e. data matrix) on the lower dimensional space of LDA (Vk ).
maximize the separation distance between classes. class and the total mean in step (B) and (C), respec-
To explain how the between-class variance or the tively.
PY
between-class matrix (SB ) can be calculated, the fol-
lowing assumptions are made. Given the original data 1
μj = xi (2)
matrix X = {x1 , x2 , . . . , xN }, where xi represents the nj x ∈ω
i j
ith sample, pattern, or observation and N is the to- O
tal number of samples. Each sample is represented by 1
N
c
ni
M features (xi ∈ RM ). In other words, each sam- μ= xi = μi (3)
N N
C
i=1 i=1
ple is represented as a point in M-dimensional space.
Assume the data matrix is partitioned into c = 3 where c represents the total number of classes (in our
classes as follows, X = [ω1 , ω2 , ω3 ] as shown in example c = 3).
R
Fig. 1 (step (A)). Each class has five samples (i.e. The term (μi − μ)(μi − μ)T in Equation (1) repre-
n1 = n2 = n3 = 5), where ni represents the number of sents the separation distance between the mean of the
O
samples of the ith class. The total number of samples ith class (μi ) and the total mean (μ), or simply it repre-
(N ) is calculated as follows, N = 3i=1 ni . sents the between-class variance of the ith class (SBi ).
TH
To calculate the between-class variance (SB ), the Substitute SBi into Equation (1) as follows:
separation distance between different classes which
is denoted by (mi − m) will be calculated as fol- (mi − m)2 = W T SBi W (4)
lows:
AU
class variance of each class (SWj ) is calculated as in where λ represents the eigenvalues of the transfor-
Equation (5). mation matrix (W ). The solution of this problem
T 2 can be obtained by calculating the eigenvalues (λ =
W xi − mj {λ1 , λ2 , . . . , λM }) and eigenvectors (V = {v1 , v2 , . . . ,
−1
xi ∈ωj ,j =1,...,c vM }) of W = SW SB , if SW is non-singular [36,81,83].
2 The eigenvalues are scalar values, while the eigen-
= W T xij − W T μj vectors are non-zero vectors, which satisfies the Equa-
xi ∈ωj ,j =1,...,c tion (8) and provides us with the information about
the LDA space. The eigenvectors represent the direc-
= W T (xij − μj )2 W
tions of the new space, and the corresponding eigen-
xi ∈ωj ,j =1,...,c
values represent the scaling factor, length, or the mag-
= W T (xij − μj )(xij − μj )T W nitude of the eigenvectors [34,59]. Thus, each eigen-
vector represents one axis of the LDA space, and
xi ∈ωj ,j =1,...,c
the associated eigenvalue represents the robustness
= W T SWj W (5) of this eigenvector. The robustness of the eigenvec-
xi ∈ωj ,j =1,...,c tor reflects its ability to discriminate between differ-
ent classes, i.e. increase the between-class variance,
PY
From Equation (5), the within-class variance for and decreases the within-class variance of each class;
each class can be calculated as follows, SWj = djT ∗ hence meets the LDA goal. Thus, the eigenvectors with
nj the k highest eigenvalues are used to construct a lower
dj = i=1 (xij −μj )(xij −μj )T , where xij represents
the ith sample in the j th class as shown in Fig. 1 (step dimensional space (Vk ), while the other eigenvectors
O
(E), (F)), and dj is the centering data of the j th class, ({vk+1 , vk+2 , vM }) are neglected as shown in Fig. 1
nj (step (G)).
i.e. dj = ωj − μj = {xi }i=1 − μj . Moreover, step (F)
C
Figure 2 shows the lower dimensional space of the
in the figure illustrates how the within-class variance
LDA technique, which is calculated as in Fig. 1 (step
of the first class (SW1 ) in our example is calculated.
The total within-class variance represents the sum of (G)). As shown, the dimension of the original data ma-
all within-class matrices of all classes (see Fig. 1 (step trix (X ∈ RN×M ) is reduced by projecting it onto the
R
(F))), and it can be calculated as in Equation (6). lower dimensional space of LDA (Vk ∈ RM×k ) as de-
noted in Equation (9) [81]. The dimension of the data
O
3 after projection is k; hence, M − k features are ig-
SW = SWi nored or deleted from each sample. Thus, each sample
TH
+ (xi − μ2 )(xi − μ2 )T
xi ∈ω2 Y = XVk (9)
+ (xi − μ3 )(xi − μ3 )T (6) Figure 3 shows a comparison between two lower-
xi ∈ω3 dimensional sub-spaces. In this figure, the original data
which consists of three classes as in our example are
2.4. Constructing the lower dimensional space plotted. Each class has five samples, and all samples
are represented by two features only (xi ∈ R2 ) to
After calculating the between-class variance (SB )
be visualized. Thus, each sample is represented as a
and within-class variance (SW ), the transformation ma-
trix (W ) of the LDA technique can be calculated as in point in two-dimensional space. The transformation
Equation (7), which is called Fisher’s criterion. This matrix (W (2 × 2)) is calculated using the steps in
formula can be reformulated as in Equation (8). Section 2.2, 2.3, and 2.4. The eigenvalues (λ1 and λ2 )
and eigenvectors (i.e. sub-spaces) (V = {v1 , v2 }) of
W T SB W W are then calculated. Thus, there are two eigenvec-
arg max (7) tors or sub-spaces. A comparison between the two
W W T SW W
lower-dimensional sub-spaces shows the following no-
SW W = λSB W (8) tices:
174 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
PY
are the same. Table 1 shows the notations which are
Fig. 3. A visualized comparison between the two lower-dimensional
used in the two algorithms.
sub-spaces which are calculated using three different classes.
2.7. Computational complexity of LDA
O
• First, the separation distance between different
classes when the data are projected on the first In this section, the computational complexity for
C
eigenvector (v1 ) is much greater than when the LDA is analyzed. The computational complexity for
data are projected on the second eigenvector (v2 ). the first four steps, common in both class-dependent
As shown in the figure, the three classes are effi- and class-independent methods, are computed as fol-
lows. As illustrated in Algorithm 1, in step (2), to cal-
R
original data are projected on v1 is much greater cM) operations. In step (3), there are N M additions
than when the data are projected on v2 , which re- and M divisions, i.e., there are (NM + M) opera-
TH
flects that the first eigenvector discriminates the tions. The computational complexity of the fourth step
three classes better than the second one. is c(M + M 2 + M 2 ), where M is for μi − μ, M 2 for
• Second, the within-class variance when the data (μi − μ)(μi − μ)T , and the last M 2 is for the multipli-
cation between ni and the matrix (μi −μ)(μi −μ)T . In
AU
Algorithm 1 Linear Discriminant Analysis (LDA): Algorithm 2 Linear Discriminant Analysis (LDA):
Class-Independent Class-Dependent
1: Given a set of N samples [xi ]N
i=1 , each of which is 1: Given a set of N samples [xi ]N
i=1 , each of which is
represented as a row of length M as in Fig. 1 (step represented as a row of length M as in Fig. 1 (step
(A)), and X(N × M) is given by, (A)), and X(N × M) is given by,
⎡ ⎤ ⎡ ⎤
x(1,1) x(1,2) ... x(1,M) x(1,1) x(1,2) ... x(1,M)
⎢ x(2,1) x(2,2) ... x(2,M) ⎥ ⎢ x(2,1) x(2,2) ... x(2,M) ⎥
⎢ ⎥ ⎢ ⎥
X=⎢ . .. .. .. ⎥ (10) X=⎢ . .. .. .. ⎥ (13)
⎣ .. . . . ⎦ ⎣ .. . . . ⎦
x(N,1) x(N,2) ... x(N,M) x(N,1) x(N,2) ... x(N,M)
2: Compute the mean of each class μi (1 × M) as in 2: Compute the mean of each class μi (1 × M) as in
Equation (2). Equation (2).
3: Compute the total mean of all data μ(1 × M) as in 3: Compute the total mean of all data μ(1 × M) as in
Equation (3). Equation (3).
4: Calculate between-class matrix SB (M ×M) as fol- 4: Calculate between-class matrix SB (M × M) as in
PY
lows: Equation (11)
5: for all Class i, i = 1, 2, . . . , c do
c
6: Compute within-class matrix of each class
SB = ni (μi − μ)(μi − μ) T
(11)
SWi (M × M), as follows:
i=1
O
C
5: Compute within-class matrix SW (M × M), as fol- SWj = (xi − μj )(xi − μj )T (14)
xi ∈ωj
lows:
nj
c
Construct a transformation matrix for each class
R
7:
SW = (xij − μj )(xij − μj )T (12) (Wi ) as follows:
j =1 i=1
O
−1
Wi = SW S
i B
(15)
where xij represents the ith sample in the j th
TH
class.
6: From Equation (11) and (12), the matrix W that 8: The eigenvalues (λi ) and eigenvectors (V i ) of
maximizing Fisher’s formula which is defined in each transformation matrix (Wi ) are then calcu-
Equation (7) is calculated as follows, W = SW−1
SB . lated, where λi and V i represent the calculated
AU
The eigenvalues (λ) and eigenvectors (V ) of W are eigenvalues and eigenvectors of the ith class, re-
then calculated. spectively.
7: Sorting eigenvectors in descending order accord- 9: Sorting the eigenvectors in descending order ac-
ing to their corresponding eigenvalues. The first k cording to their corresponding eigenvalues. The
eigenvectors are then used as a lower dimensional first k eigenvectors are then used to construct a
space (Vk ). lower dimensional space for each class Vki .
8: Project all original samples (X) onto the lower di- 10: Project the samples of each class (ωi ) onto their
mensional space of LDA as in Equation (9). lower dimensional space (Vki ), as follows:
j
j = xi Vk , xi ∈ ω j (16)
ues and eigenvectors. These two steps are repeated for where j represents the projected samples of
each class which increases the complexity of the class- the class ωj .
11: end for
dependent algorithm. Totally, the computational com-
plexity of the class-dependent algorithm is O(N M 2 ) if
N > M; otherwise, the complexity is O(cM 3 ). Hence,
176 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
Table 1
numbers are rounded up to the nearest hundredths
Notation
(i.e. only two digits after the decimal point are dis-
Notation Description played).
X Data matrix The first four steps of both class-independent and
N Total number of samples in X class-dependent methods are common as illustrated in
W Transformation matrix Algorithms 1 and 2. Thus, in this section, we show how
ni Number of samples in ωi these steps are calculated.
μi The mean of the ith class Given two different classes, ω1 (5 × 2) and ω2 (6 × 2)
μ Total or global mean of all samples have (n1 = 5) and (n2 = 6) samples, respectively.
SW i Within-class variance or scatter matrix of the ith class Each sample in both classes is represented by two fea-
(ωi ) tures (i.e. M = 2) as follows:
SBi Between-class variance of the ith class (ωi )
V Eigenvectors of W
⎡ ⎤
1.00 2.00
Vi ith eigenvector ⎢2.00 3.00⎥
⎢ ⎥
xij The ith sample in the j th class ω1 = ⎢
⎢3.00 3.00⎥⎥ and
k The dimension of the lower dimensional space (Vk ) ⎣4.00 5.00⎦
xi ith sample 5.00 5.00
PY
M Dimension of X or the number of features of X ⎡ ⎤ (17)
Vk The lower dimensional space 4.00 2.00
⎢5.00 0.00⎥
c Total number of classes ⎢ ⎥
⎢5.00 2.00⎥
mi The mean of the ith class after projection O ω2 = ⎢
⎢3.00
⎥
m The total mean of all classes after projection ⎢ 2.00⎥⎥
SW Within-class variance ⎣5.00 3.00⎦
C
SB Between-class variance 6.00 3.00
λ Eigenvalue matrix
λi ith eigenvalue To calculate the lower dimensional space using
Y Projection of the original data LDA, first the mean of each class μj is calculated. The
R
the class-dependent method needs computations more mean of each class and the total mean are shown below,
than class-independent method.
TH
In this section, two numerical examples will be SB1 = n1 (μ1 − μ)T (μ1 − μ)
presented. The two numerical examples explain the
T
= 5 −0.91 0.87 −0.91 0.87
steps to calculate the LDA space and how the LDA
technique is used to discriminate between only two 4.13 −3.97
= (19)
different classes. In the first example, the lower- −3.97 3.81
dimensional space is calculated using the class-
independent method, while in the second example, the Similarly, SB2 is calculated as follows:
class-dependent method is used. Moreover, a compar-
ison between the lower dimensional spaces of each 3.44 −3.31
SB2 = (20)
method is presented. In all numerical examples, the −3.31 3.17
A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 177
The total between-class variance is calculated a fol- class matrix for each class and the total within-class
lows: matrix are as follows:
10.00 8.00
SB = SB1 + SB2 SW1 = ,
8.00 7.20
4.13 −3.97
= 5.33 1.00
−3.97 3.81 SW2 = , (23)
1.00 6.00
3.44 −3.31
+ 15.33 9.00
−3.31 3.17 SW =
9.00 13.20
7.58 −7.27
= (21)
−7.27 6.98 The transformation matrix (W ) in the class-independent
−1
method can be obtained as follows, W = SW SB , and
−1
To calculate the within-class matrix, first subtract the values of (SW ) and (W ) are as follows:
the mean of each class from each sample in that class
and this step is called mean-centering data and it is cal- −1 0.11 −0.07
SW = and
culated as follows, di = ωi − μi , where di represents −0.07 0.13
PY
centering data of the class ωi . The values of d1 and d2 (24)
1.36 −1.31
are as follows: W =
−1.48 1.42
⎡ ⎤
−2.00 −1.60
⎢−1.00 −0.60⎥
⎢ ⎥
O
The eigenvalues (λ(2 × 2)) and eigenvectors (V (2 ×
2)) of W are then calculated as follows:
d1 = ⎢
⎢ 0.00 −0.60⎥
⎥ and
C
⎣ 1.00
1.40 ⎦ 0.00 0.00
2.00 1.40 λ= and
0.00 2.78
⎡ ⎤ (22) (25)
−0.67 0.00 −0.69 0.68
R
⎢ 0.33 −2.00⎥ V =
⎢ ⎥ −0.72 −0.74
⎢ 0.33 0.00 ⎥
d2 = ⎢ ⎥
O
⎢−1.67 0.00 ⎥
⎢ ⎥ From the above results it can be noticed that, the
⎣ 0.33 1.00 ⎦ second eigenvector (V2 ) has corresponding eigenvalue
TH
1.33 1.00 more than the first one (V1 ), which reflects that, the
second eigenvector is more robust than the first one;
In the next two subsections, two different methods hence, it is selected to construct the lower dimen-
AU
are used to calculate the LDA space. sional space. The original data is projected on the
lower dimensional space, as follows, yi = ωi V2 , where
yi (ni × 1) represents the data after projection of the ith
3.1. Class-independent method class, and its values will be as follows:
y1 = ω1 V2
⎡ ⎤
In this section, the LDA space is calculated us- 1.00 2.00
ing the class-independent method. This method rep- ⎢2.00 3.00⎥
⎢ ⎥ 0.68
resents the standard method of LDA as in Algo- ⎢
= ⎢3.00 3.00⎥⎥
rithm 1. ⎣4.00 5.00⎦ −0.74
After centring the data, the within-class variance 5.00 5.00
for each class (SWi (2 × 2)) is calculated as follows, ⎡ ⎤
nj −0.79
SWj = djT ∗ dj = i=1 (xij − μj )T (xij − μj ), where ⎢−0.85⎥
⎢ ⎥
xij represents the ith sample in the j th class. The total =⎢⎢−0.18⎥
⎥ (26)
within-class matrix (SW (2 × 2)) is then calculated as ⎣−0.97⎦
c
follows, SW = i=1 SWi . The values of the within- −0.29
178 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
PY
0.90 −1.00 7.58 −7.27
=
−1.00 1.25 −7.27 6.98
14.09 −13.53
=
O −16.67 16.00
(28)
C
Similarly, W2 is calculated as follows:
Fig. 4. Probability density function of the projected data of the first
example, (a) the projected data on V1 , (b) the projected data on V2 .
1.70 −1.63
W2 = (29)
Similarly, y2 is as follows: −1.50 1.44
R
(30)
Figure 4 illustrates a probability density function −0.69 0.65
Vω1 =
(pdf) graph of the projected data (yi ) on the two eigen- −0.72 −0.76
vectors (V1 and V2 ). A comparison of the two eigen-
vectors reveals the following: 3.14 0.00
λω2 = and
0.00 0.00
• The data of each class is completely discriminated (31)
when it is projected on the second eigenvector 0.75 0.69
Vω2 =
(see Fig. 4(b)) than the first one (see Fig. 4(a). −0.66 0.72
In other words, the second eigenvector maximizes
the between-class variance more than the first where λωi and Vωi represent the eigenvalues and eigen-
one. vectors of the ith class, respectively.
• The within-class variance (i.e. the variance be- From the results shown (above) it can be seen that,
{2}
tween the same class samples) of the two classes the second eigenvector of the first class (Vω1 ) has cor-
are minimized when the data are projected on the responding eigenvalue more than the first one; thus,
second eigenvector. As shown in Fig. 4(b), the the second eigenvector is used as a lower dimensional
{2}
within-class variance of the first class is small space for the first class as follows, y1 = ω1 ∗ Vω1 ,
compared with Fig. 4(a). where y1 represents the projection of the samples of
A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 179
PY
the second one. Thus, Vω2 is used to project the data of
{1} {2} {1}
the second class as follows, y2 = ω2 ∗ Vω2 , where y2 are two eigenvectors, Vω1 (red line) and Vω2
represents the projection of the samples of the second (blue line), which represent the first and second
class. The values of y1 and y2 will be as follows: O classes, respectively. The differences between the
⎡ ⎤ two eigenvectors are as following:
⎡ ⎤ 1.68
−0.88 ⎢3.76⎥ ∗ Projecting the original data on the two eigen-
C
⎢−1.00⎥ ⎢ ⎥
⎢ ⎥ ⎢2.43⎥ vectors discriminates between the two classes.
y1 = ⎢
⎢−0.35⎥
⎥ ⎢
and y2 = ⎢ ⎥
⎥ (32)
⎣−1.24⎦ ⎢0.93⎥ As shown in the figure, the distance between
⎣1.77⎦ the projected means m1 − m2 is larger than the
R
−0.59
2.53 distance between the original means μ1 − μ2 .
∗ The within-class variance of each class is de-
O
Figure 5 shows a pdf graph of the projected data (i.e. creased. For example, the within-class variance
{2} {1}
y1 and y2 ) on the two eigenvectors (Vω1 and Vω2 ) and of the first class (SW1 ) is decreased when it is
TH
a number of findings are revealed the following: projected on its corresponding eigenvector.
{2}
• First, the projection data of the two classes are ∗ As a result of the above two findings, Vω1 and
{1}
efficiently discriminated. Vω2 are used to construct the LDA space.
AU
PY
O
C
R
Fig. 6. Illustration of the example of the two different methods of LDA methods. The blue and red lines represent the first and second eigenvectors
O
of the class-dependent approach, respectively, while the solid and dotted black lines represent the second and first eigenvectors of class-indepen-
dent approach, respectively.
TH
duction techniques, it suffers from two main problems: space cannot be calculated.
the Small Sample Size (SSS) and linearity problems. One of the solutions of this problem is based on the
In the next two subsections, these two problems will transformation concept, which is known as a kernel
be explained, and some of the state-of-the-art solutions methods or functions [3,50]. Figure 7 illustrates how
are highlighted. the transformation is used to map the original data into
a higher dimensional space; hence, the data will be lin-
4.1. Linearity problem
early separable, and the LDA technique can find the
lower dimensional space in the new space. Figure 8
LDA technique is used to find a linear transforma-
tion that discriminates between different classes. How- graphically and mathematically shows how two non-
ever, if the classes are non-linearly separable, LDA can separable classes in one-dimensional space are trans-
not find a lower dimensional space. In other words, formed into a two-dimensional space (i.e. higher di-
LDA fails to find the LDA space when the discrimina- mensional space); thus, allowing linear separation.
tory information are not in the means of classes. Fig- The kernel idea is applied in Support Vector Ma-
ure 7 shows how the discriminatory information does chine (SVM) [49,66,69] Support Vector Regression
not exist in the mean, but in the variance of the data. (SVR) [58], PCA [51], and LDA [50]. Let φ represents
This is because the means of the two classes are equal. a nonlinear mapping to the new feature space Z. The
A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 181
PY
O
Fig. 8. Example of kernel functions, the samples lie on the top panel
(X) which are represented by a line (i.e. one-dimensional space) are
non-linearly separable, where the samples lie on the bottom panel
C
(Z) which are generated from mapping the samples of the top space
are linearly separable.
panel shows how the two classes are non-separable, while the bot- N c ni φ
i=1 φ{xi } = i=1 N μi .
tom shows how the transformation solves this problem and the two
classes are linearly separable.
O
φ
c
φ φ T 4.2.1. Problem definition
SB = n i μ i − μφ μi − μφ (34)
Singularity, Small Sample Size (SSS), or under-
i=1
sampled problem is one of the big problems of LDA
nj
φ
c
φ
technique. This problem results from high-dimensional
SW = φ{xij } − μj pattern classification tasks or a low number of train-
j =1 i=1 ing samples available for each class compared with
φ T the dimensionality of the sample space [30,38,82,
× φ{xij } − μj (35) 85].
182 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
The SSS problem occurs when the SW is singular.3 Four different variants of the LDA technique that are
The upper bound of the rank4 of SW is N − c, while used to solve the SSS problem are introduced as fol-
the dimension of SW is M × M [17,38]. Thus, in most lows:
cases M N − c which leads to SSS problem. For
PCA+LDA technique. In this technique, the orig-
example, in face recognition applications, the size of
the face image my reach to 100×100 = 10,000 pixels, inal d-dimensional features are first reduced to h-
which represent high-dimensional features and it leads dimensional feature space using PCA, and then the
to a singularity problem. LDA is used to further reduce the features to k-
dimensions. The PCA is used in this technique to re-
4.2.2. Common solutions to SSS problem: duce the dimensions to make the rank of SW is N − c
There are many studies that proposed many solu- as reported in [4]; hence, the SSS problem is addressed.
tions for this problem; each has its advantages and However, the PCA neglects some discriminant infor-
drawbacks. mation, which may reduce the classification perfor-
• Regularization (RLDA): In regularization mance [57,60].
method, the identity matrix is scaled by multi- Direct LDA technique. Direct LDA (DLDA) is one
plying it by a regularization parameter (η > 0) of the well-known techniques that are used to solve
and adding it to the within-class matrix to make the SSS problem. This technique has two main steps
PY
it non-singular [18,38,45,82]. Thus, the diagonal [83]. In the first step, the transformation matrix, W , is
components of the within-class matrix are biased computed to transform the training data to the range
as follows, SW = SW + ηI . However, choosing space of SB . In the second step, the dimensionality of
the value of the regularization parameter requires
the transformed data is further transformed using some
O
more tuning and a poor choice for this parame-
regulating matrices as in Algorithm 4. The benefit of
ter can degrade the performance of the method
the DLDA is that there is no discriminative features are
C
[38,45]. Another problem of this method is that
neglected as in PCA+LDA technique [83].
the parameter η is just added to perform the in-
verse of SW and has no clear mathematical inter- Regularized LDA technique. In the Regularized LDA
pretation [38,57]. (RLDA), a small perturbation is add to the SW matrix
R
• Sub-space: In this method, a non-singular inter- to make it non-singular as mentioned in [18]. This reg-
mediate space is obtained to reduce the dimen- ularization can be applied as follows:
O
in [22], losing some discriminant information is a parameter need to be tuned and poor choice of it can
common drawback associated with the use of this degrade the generalization performance [57].
method.
• Null Space: There are many studies proposed Null LDA technique. The aim of the NLDA technique
to remove the null space of SW to make SW is to find the orientation matrix W , and this can be
full-rank; hence, invertible. The drawback of this achieved using two steps. In the first step, the range
method is that more discriminant information is space of the SW is neglected, and the data are projected
lost when the null space of SW is removed, which only on the null space of SW as follows, SW W = 0. In
has a negative impact on how the lower dimen- the second step, the aim is to search for W that satis-
sional space satisfies the LDA goal [83]. fies SB W = 0 and maximizes |W T SB W |. The higher
dimensionality of the feature space may lead to com-
3 A matrix is singular if it is square, does not have a matrix in-
putational problems. This problem can be solved by
verse, the determinant is zeros; hence, not all columns and rows are (1) using the PCA technique as a pre-processing step,
independent.
4 The rank of the matrix represents the number of linearly inde- i.e. before applying the NLDA technique, to reduce the
pendent rows or columns. dimension of feature space to be N − 1; by removing
5 A is a full-rank matrix if all columns and rows of the matrix are the null space of ST = SB +SW [57], (2) using the PCA
independent, (i.e. rank(A) = #rows = #cols) [23]. technique before the second step of the NLDA tech-
A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 183
nique [54]. Mathematically, in the Null LDA (NLDA) or dimensions. Due to this high dimensionality, the
technique, the h column vectors of the transformation computational models need more time to train their
matrix W = [w1 , w2 , . . . , wh ] are taken to be the models, which may be infeasible and expensive. More-
null space of the SW as follows, wiT SW wi = 0, ∀i = over, this high dimensionality reduces the classifica-
1, . . . , h, where wiT SB wi = 0. Hence, M −(N −c) lin- tion performance of the computational model and in-
early independent vectors are used to form a new ori- creases its complexity. This problem can be solved us-
entation matrix, which is used to maximize |W T SB W | ing LDA technique to construct a new set of features
subject to the constraint |W T SW W | = 0 as in Equa- from a large number of original features. There are
many papers have been used LDA in medical applica-
tion (37).
tions [8,16,39,52–55].
T
W = arg max W SB W (37) 6. Packages
|W T SW W |=0
In this section, some of the available packages that
are used to compute the space of LDA variants. For
5. Applications of the LDA technique example, WEKA6 is a well-known Java-based data
mining tool with open source machine learning soft-
PY
In many applications, due to the high number of fea- ware such as classification, association rules, regres-
tures or dimensionality, the LDA technique have been sion, pre-processing, clustering, and visualization. In
used. Some of the applications of the LDA technique WEKA, the machine learning algorithms can be ap-
and its variants are described as follows: plied directly on the dataset or called from person’s
O
Java code. XLSTAT7 is another data analysis and sta-
5.1. Biometrics applications tistical package for Microsoft Excel that has a wide
C
variety of dimensionality reduction algorithms includ-
Biometrics systems have two main steps, namely, ing LDA. dChip8 package is also used for visualiza-
tion of gene expression and SNP microarray including
feature extraction (including pre-processing steps) and
some data analysis algorithms such as LDA, cluster-
recognition. In the first step, the features are extracted
R
example, in [10,20,41,68,75,83], the LDA technique discriminant function analysis. Dimensionality reduc-
have been applied on face recognition. Moreover, the tion11 package is mainly written in Matlab, and it has
LDA technique was used in Ear [84], fingerprint [44], a number of dimensionality reduction techniques such
gait [5], and speech [24] applications. In addition, the as ULDA, QLDA, and KDA. DTREG12 is a software
AU
LDA technique was used with animal biometrics as in package that is used for medical data and modeling
[19,65]. business, and it has several predictive modeling meth-
ods such as LDA, PCA, linear regression, and decision
5.2. Agriculture applications trees.
Table 2
cations, (2) what is the relation between its parame-
Dataset description
ter (Eigenvectors) and the accuracy of a classification
problem, (3) when the SSS problem could appear and Dataset Dimension (M) No. of Samples (N ) No. of classes (c)
a method for solving it. ORL64×64 4096 400 40
ORl32×32 1024
7.1. Experimental setup Ear64×64 4096 102 17
Ear32×32 1024
This section gives an overview of the databases, Yale64×64 4096 165 15
the platform, and the machine specification used to Yale32×32 1024
conduct our experiments. Different biometric datasets
were used in the experiments to show how the LDA the classification accuracy, the Nearest Neighbour clas-
using its parameter behaves with different data. These sifier was used. This classifier aims to classify the test-
datasets are described as follows: ing image by comparing its position in the LDA space
with the positions of training images. Furthermore,
• ORL dataset13 face images dataset (Olivetti Re- class-independent LDA was used in all experiments.
search Laboratory, Cambridge) [48], which con- Moreover, Matlab Platform (R2013b) and using a PC
sists of 40 distinct individuals, was used. In this with the following specifications: Intel(R) Core(TM)
PY
dataset, each individual has ten images taken at i5-2400 CPU @ 3.10 GHz and 4.00 GB RAM, under
different times and varying light conditions. The Windows 32-bit operating system were used in our ex-
size of each image is 92 × 112. periments.
• Yale dataset14 is another face images dataset O
which contains 165 grey scale images in GIF for- 7.2. Experiment on LDA parameter (eigenvectors)
mat of 15 individuals [78]. Each individual has
C
11 images in different expressions and configura- The aim of this experiment is to investigate the re-
tion: center-light, happy, left-light, with glasses, lation between the number of eigenvectors used in the
normal, right-light, sad, sleepy, surprised, and a LDA space, and the classification accuracy based on
wink.
R
17 distinct individuals. Six views of the left pro- k eigenvectors, which are sorted according to their ro-
file from each subject were taken under a uniform, bustness (i.e. their eigenvalues). The robustness of each
TH
Fig. 9. Samples of the first individual in: ORL face dataset (top row); Ear dataset (middle row), and Yale face dataset (bottom row).
PY
when the number of eigenvectors used in computing
the LDA space was increased, the classification accu-
racy was also increased to a specific extent after which
the accuracy remains constant. As seen in Fig. 10(a),
O
this extent differs from application to another. For ex-
ample, the accuracy of the ear dataset remains con-
C
stant when the percentage of the used eigenvectors is
more than 10%. This was expected as the eigenvec-
tors of the LDA space are sorted according to their ro-
bustness (see Section 2.6). Similarly, in ORL and Yale
R
Algorithm 3 PCA-LDA
1: Read the training images (X = {x1 , x2 , . . . , xN }),
where xi (ro×co) represents the ith training image,
ro and co represent the rows (height) and columns
(width) of xi , respectively, N represents the total
number of training images.
2: Convert all images in vector representation Z =
{z1 , z2 , . . . , zN }, where the dimension of Z is
M × 1, M = ro × co.
3: calculate the mean of each class μi , total mean of
all data μ, between-class matrix SB (M × M), and
within-class matrix SW (M × M) as in Algorithm 1
(Step (2–5)).
4: Use the PCA technique to reduce the dimension of
Fig. 11. The robustness of the first 40 eigenvectors of the LDA tech- X to be equal to or lower than r, where r represents
nique using ORL32×32 , Ear32×32 , and Yale32×32 datasets. the rank of SW , as follows:
PY
7.3. Experiments on the small sample size problem XPCA = U T X (38)
The aim of this experiment is to show when the LDA where, U ∈ RM×r is the lower dimensional space
is subject to the SSS problem and what are the methods O of the PCA and XPCA represents the projected data
that could be used to solve this problem. In this experi- on the PCA space.
ment, PCA-LDA [4] and Direct-LDA [22,83] methods 5: Calculate the mean of each class μi , total mean of
C
are used to address the SSS problem. A set of exper- all data μ, between-class matrix SB , and within-
iments was conducted to show how the two methods class matrix SW of XPCA as in Algorithm 1 (Step
interact with the SSS problem, including the number of (2–5)).
samples in each class, the total number of classes used,
R
matrix.
lead to the SSS problem when the dimension of the 7: Sorting the eigenvectors in descending order ac-
samples are much higher than the total number of sam- cording to their corresponding eigenvalues. The
TH
ples. As shown in Table 2, the size of each image or first k eigenvectors are then used as a lower dimen-
sample of the ORL dataset is 64×64 is 4096 and the to- sional space (Vk ).
tal number of samples is 400. The mathematical inter- 8: The original samples (X) are first projected on the
pretation of this point shows that the dimension of SW
AU
Table 3
Accuracy of the PCA-LDA and direct-LDA methods using the datasets listed in Table 2
Dataset Dim(SW ) # Training images # Testing images Rank(SW ) Accuracy (%)
PCA-LDA Direct-LDA
ORL32×32 1024 × 1024 5 5 160 75.5 88.5
7 3 240 75.5 97.5
9 1 340 80.39 97.5
Ear32×32 1024 × 1024 3 3 34 80.39 96.08
4 2 51 88.24 94.12
5 1 68 100 100
Yale32×32 1024 × 1024 6 5 75 78.67 90.67
8 3 105 84.44 97.79
10 1 135 100 100
ORL64×64 4096 × 4096 5 5 160 72 87.5
7 3 240 81.67 96.67
9 1 340 82.5 97.5
Ear64×64 4096 × 4096 3 3 34 74.5 96.08
PY
4 2 51 91.18 96.08
5 1 68 100 100
Yale64×64 4096 × 4096 6 5 75 74.67 92
8 3 O 105 95.56 97.78
10 1 135 93.33 100
The bold values indicate that the corresponding methods obtain best performances.
C
Algorithm 4 Direct-LDA method achieved results better than PCA-LDA because
1: Read the training images (X = {x1 , x2 , . . . , xN }), as reported in [22,83], Direct-LDA method saves ap-
where xi (ro×co) represents the ith training image, proximately all important information for classifica-
R
ro and co represent the rows (height) and columns tion while the PCA in PCA-LDA method saves the in-
(width) of xi , respectively, N represents the total formation with high variance.
O
how the LDA space using the two methods can be con- [11] D. Coomans, D. Massart and L. Kaufman, Optimization by
structed. In all examples, the mathematical interpreta- statistical linear discriminant analysis in analytical chemistry,
tion of the robustness and the selection of the eigenvec- Analytica Chimica Acta 112(2) (1979), 97–122. doi:10.1016/
S0003-2670(01)83513-3.
tors as well the data projection were detailed and dis- [12] J. Cui and Y. Xu, Three dimensional palmprint recognition
cussed. Also, LDA common problems (e.g. the SSS and using linear discriminant analysis method, in: Proceedings of
linearity) were mathematically explained using graph- the Second International Conference on Innovations in Bio-
ical examples, then their state-of-the-art solutions are Inspired Computing and Applications (IBICA), 2011, IEEE,
highlighted. Moreover, a detailed implementation of 2011, pp. 107–111. doi:10.1109/IBICA.2011.31.
[13] D.-Q. Dai and P.C. Yuen, Face recognition by regularized dis-
LDA applications was presented. Using three standard criminant analysis, IEEE Transactions on Systems, Man, and
datasets, a number of experiments were conducted to Cybernetics, Part B (Cybernetics) 37(4) (2007), 1080–1085.
(1) investigate and explain the relation between the doi:10.1109/TSMCB.2007.895363.
number of eigenvectors and the robustness of the LDA [14] D. Donoho and V. Stodden, When does non-negative matrix
space, (2) to practically show when the SSS problem factorization give a correct decomposition into parts? in: Ad-
vances in Neural Information Processing Systems 16, MIT
occurs and how it can be addressed.
Press, 2004, pp. 1141–1148.
[15] R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,
2nd edn, Wiley, 2012.
References [16] S. Dudoit, J. Fridlyand and T.P. Speed, Comparison of discrim-
PY
ination methods for the classification of tumors using gene ex-
[1] S. Balakrishnama and A. Ganapathiraju, Linear discriminant pression data, Journal of the American Statistical Association
analysis-a brief tutorial, Institute for Signal and Information 97(457) (2002), 77–87. doi:10.1198/016214502753479248.
Processing (1998). [17] T.-t. Feng and G. Wu, A theoretical contribution to the fast im-
[2] E. Barshan, A. Ghodsi, Z. Azimifar and M.Z. Jahromi, Su- O plementation of null linear discriminant analysis method us-
pervised principal component analysis: Visualization, classifi- ing random matrix multiplication with scatter matrices, 2014,
cation and regression on subspaces and submanifolds, Pattern arXiv preprint arXiv:1409.2579.
C
Recognition 44(7) (2011), 1357–1371. doi:10.1016/j.patcog. [18] J.H. Friedman, Regularized discriminant analysis, Journal of
2010.12.015. the American Statistical Association 84(405) (1989), 165–175.
[3] G. Baudat and F. Anouar, Generalized discriminant analysis doi:10.1080/01621459.1989.10478752.
using a kernel approach, Neural Computation 12(10) (2000), [19] T. Gaber, A. Tharwat, A.E. Hassanien and V. Snasel, Biometric
2385–2404. doi:10.1162/089976600300014980. cattle identification approach based on Weber’s local descriptor
R
[4] P.N. Belhumeur, J.P. Hespanha and D. Kriegman, Eigenfaces and adaboost classifier, Computers and Electronics in Agricul-
vs. fisherfaces: Recognition using class specific linear projec- ture 122 (2016), 55–66. doi:10.1016/j.compag.2015.12.022.
O
tion, IEEE Transactions on Pattern Analysis and Machine In- [20] T. Gaber, A. Tharwat, A. Ibrahim, V. Snáel and A.E. Has-
telligence 19(7) (1997), 711–720. doi:10.1109/34.598228. sanien, Human thermal face recognition based on random lin-
[5] N.V. Boulgouris and Z.X. Chi, Gait recognition using Radon ear oracle (rlo) ensembles, in: International Conference on In-
TH
transform and linear discriminant analysis, IEEE Transactions telligent Networking and Collaborative Systems, IEEE, 2015,
on Image Processing 16(3) (2007), 731–740. doi:10.1109/TIP. pp. 91–98. doi:10.1109/INCoS.2015.67.
2007.891157. [21] T. Gaber, A. Tharwat, V. Snasel and A.E. Hassanien, Plant
[6] M. Bramer, Principles of Data Mining, 2nd edn, Springer, identification: Two dimensional-based vs. one dimensional-
AU
[27] G.E. Hinton and R.R. Salakhutdinov, Reducing the dimension- [43] F. Pan, G. Song, X. Gan and Q. Gu, Consistent feature selec-
ality of data with neural networks, Science 313(5786) (2006), tion and its application to face recognition, Journal of Intelli-
504–507. doi:10.1126/science.1127647. gent Information Systems 43(2) (2014), 307–321. doi:10.1007/
[28] K. Honda and H. Ichihashi, Fuzzy local independent com- s10844-014-0324-5.
ponent analysis with external criteria and its application to [44] C.H. Park and H. Park, Fingerprint classification using fast
knowledge discovery in databases, International Journal of Fourier transform and nonlinear discriminant analysis, Pat-
Approximate Reasoning 42(3) (2006), 159–173. doi:10.1016/j. tern Recognition 38(4) (2005), 495–503. doi:10.1016/j.patcog.
ijar.2005.10.011. 2004.08.013.
[29] Q. Hu, L. Zhang, D. Chen, W. Pedrycz and D. Yu, Gaussian [45] C.H. Park and H. Park, A comparison of generalized linear
kernel based fuzzy rough sets: Model, uncertainty measures discriminant analysis algorithms, Pattern Recognition 41(3)
and applications, International Journal of Approximate Rea- (2008), 1083–1097. doi:10.1016/j.patcog.2007.07.022.
soning 51(4) (2010), 453–471. doi:10.1016/j.ijar.2010.01.004. [46] S. Rezzi, D.E. Axelson, K. Héberger, F. Reniero, C. Mariani
[30] R. Huang, Q. Liu, H. Lu and S. Ma, Solving the small sam- and C. Guillou, Classification of olive oils using high through-
ple size problem of lda, in: Proceedings of 16th International put flow 1 h nmr fingerprinting with principal component anal-
Conference on Pattern Recognition, 2002, Vol. 3, IEEE, 2002, ysis, linear discriminant analysis and probabilistic neural net-
pp. 29–32. works, Analytica Chimica Acta 552(1) (2005), 13–24. doi:10.
[31] A. Hyvärinen, J. Karhunen and E. Oja, Independent Compo- 1016/j.aca.2005.07.057.
nent Analysis, Vol. 46, Wiley, 2004. [47] Y. Saeys, I. Inza and P. Larrañaga, A review of feature selection
[32] M. Kirby, Geometric Data Analysis: An Empirical Approach techniques in bioinformatics, Bioinformatics 23(19) (2007),
to Dimensionality Reduction and the Study of Patterns, Wiley, 2507–2517. doi:10.1093/bioinformatics/btm344.
PY
2000. [48] F.S. Samaria and A.C. Harter, Parameterisation of a stochas-
[33] D.T. Larose, Discovering Knowledge in Data: An Introduction tic model for human face identification, in: Proceedings of the
to Data Mining, 1st edn, Wiley, 2014. Second IEEE Workshop on Applications of Computer Vision,
[34] T. Li, S. Zhu and M. Ogihara, Using discriminant analysis 1994, IEEE, 1994, pp. 138–142.
for multi-class classification: An experimental investigation, [49] B. Schölkopf, C.J. Burges and A.J. Smola, Advances in Kernel
O
Knowledge and Information Systems 10(4) (2006), 453–472. Methods: Support Vector Learning, MIT Press, 1999.
doi:10.1007/s10115-006-0013-y. [50] B. Schölkopf and K.-R. Mullert, Fisher discriminant analysis
C
[35] K. Liu, Y.-Q. Cheng, J.-Y. Yang and X. Liu, An efficient al- with kernels, in: Proceedings of the 1999 IEEE Signal Process-
gorithm for foley–sammon optimal set of discriminant vectors ing Society Workshop Neural Networks for Signal Processing
by algebraic method, International Journal of Pattern Recog- IX, Madison, WI, USA, 1999, pp. 41–48.
nition and Artificial Intelligence 6(5) (1992), 817–829. doi:10. [51] B. Schölkopf, A. Smola and K.-R. Müller, Nonlinear com-
1142/S0218001492000412. ponent analysis as a kernel eigenvalue problem, Neu-
R
[36] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Face recog- ral Computation 10(5) (1998), 1299–1319. doi:10.1162/
nition using lda-based algorithms, IEEE Transactions on Neu- 089976698300017467.
O
ral Networks 14(1) (2003), 195–200. doi:10.1109/TNN.2002. [52] A. Sharma, S. Imoto and S. Miyano, A between-class over-
806647. lapping filter-based method for transcriptome data analysis,
[37] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regularized Journal of Bioinformatics and Computational Biology 10(5)
TH
discriminant analysis for the small sample size problem in face (2012), 1250010. doi:10.1142/S0219720012500102.
recognition, Pattern Recognition Letters 24(16) (2003), 3079– [53] A. Sharma, S. Imoto and S. Miyano, A filter based feature se-
3087. doi:10.1016/S0167-8655(03)00167-3. lection algorithm using null space of covariance matrix for dna
[38] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regular- microarray gene expression data, Current Bioinformatics 7(3)
AU
ization studies of linear discriminant analysis in small sam- (2012), 289–294. doi:10.2174/157489312802460802.
ple size scenarios with application to face recognition, Pat- [54] A. Sharma, S. Imoto, S. Miyano and V. Sharma, Null space
tern Recognition Letters 26(2) (2005), 181–191. doi:10.1016/ based feature selection method for gene expression data, In-
j.patrec.2004.09.014. ternational Journal of Machine Learning and Cybernetics 3(4)
[39] B. Moghaddam, Y. Weiss and S. Avidan, Generalized spectral (2012), 269–276. doi:10.1007/s13042-011-0061-9.
bounds for sparse lda, in: Proceedings of the 23rd International [55] A. Sharma and K.K. Paliwal, Cancer classification by gradient
Conference on Machine Learning, ACM, 2006, pp. 641–648. lda technique using microarray gene expression data, Data &
[40] W. Müller, T. Nocke and H. Schumann, Enhancing the visu- Knowledge Engineering 66(2) (2008), 338–347. doi:10.1016/
alization process with principal component analysis to support j.datak.2008.04.004.
the exploration of trends, in: Proceedings of the 2006 Asia- [56] A. Sharma and K.K. Paliwal, A new perspective to null linear
Pacific Symposium on Information Visualisation, Vol. 60, Aus- discriminant analysis method and its fast implementation us-
tralian Computer Society, Inc., 2006, pp. 121–130. ing random matrix multiplication with scatter matrices, Pattern
[41] S. Noushath, G.H. Kumar and P. Shivakumara, (2d) 2 lda: An Recognition 45(6) (2012), 2205–2213. doi:10.1016/j.patcog.
efficient approach for face recognition, Pattern Recognition 2011.11.018.
39(7) (2006), 1396–1400. doi:10.1016/j.patcog.2006.01.018. [57] A. Sharma and K.K. Paliwal, Linear discriminant analysis for
[42] K.K. Paliwal and A. Sharma, Improved pseudoinverse lin- the small sample size problem: An overview, International
ear discriminant analysis method for dimensionality re- Journal of Machine Learning and Cybernetics (2014), 1–12.
duction, International Journal of Pattern Recognition and [58] A.J. Smola and B. Schölkopf, A tutorial on support vector
Artificial Intelligence 26(1) (2012), 1250002. doi:10.1142/ regression, Statistics and Computing 14(3) (2004), 199–222.
S0218001412500024. doi:10.1023/B:STCO.0000035301.49549.88.
190 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial
[59] G. Strang, Introduction to Linear Algebra, 4th edn, Wellesley- [72] V. Vapnik, The Nature of Statistical Learning Theory, 2nd edn,
Cambridge Press, Massachusetts, 2003. Springer, New York, 2013.
[60] D.L. Swets and J.J. Weng, Using discriminant eigenfeatures [73] J. Venna, J. Peltonen, K. Nybo, H. Aidos and S. Kaski, Infor-
for image retrieval, IEEE Transactions on Pattern Analysis and mation retrieval perspective to nonlinear dimensionality reduc-
Machine Intelligence 18(8) (1996), 831–836. doi:10.1109/34. tion for data visualization, The Journal of Machine Learning
531802. Research 11 (2010), 451–490.
[61] M.M. Tantawi, K. Revett, A. Salem and M.F. Tolba, Fiducial [74] P. Viszlay, M. Lojka and J. Juhár, Class-dependent two-
feature reduction analysis for electrocardiogram (ecg) based dimensional linear discriminant analysis using two-pass recog-
biometric recognition, Journal of Intelligent Information Sys- nition strategy, in: Proceedings of the 22nd European Signal
tems 40(1) (2013), 17–39. doi:10.1007/s10844-012-0214-7. Processing Conference (EUSIPCO), IEEE, 2014, pp. 1796–
[62] A. Tharwat, Principal component analysis-a tutorial, Interna- 1800.
tional Journal of Applied Pattern Recognition 3(3) (2016), [75] X. Wang and X. Tang, Random sampling lda for face recogni-
197–240. doi:10.1504/IJAPR.2016.079733. tion, in: Proceedings of the 2004 IEEE Computer Society Con-
[63] A. Tharwat, T. Gaber, Y.M. Awad, N. Dey and A.E. Hassanien, ference on Computer Vision and Pattern Recognition (CVPR),
Plants identification using feature fusion technique and bag- Vol. 2, IEEE, 2004, pp. II–II.
ging classifier, in: The 1st International Conference on Ad- [76] M. Welling, Fisher Linear Discriminant Analysis, Vol. 3, De-
vanced Intelligent System and Informatics (AISI2015), Beni partment of Computer Science, University of Toronto, 2005.
Suef, Egypt, November 28–30, 2015, Springer, 2016, pp. 461– [77] M.C. Wu, L. Zhang, Z. Wang, D.C. Christiani and X. Lin,
471. Sparse linear discriminant analysis for simultaneous testing
[64] A. Tharwat, T. Gaber and A.E. Hassanien, One-dimensional for the significance of a gene set/pathway and gene selec-
PY
vs. two-dimensional based features: Plant identification ap- tion, Bioinformatics 25(9) (2009), 1145–1151. doi:10.1093/
proach, Journal of Applied Logic (2016). bioinformatics/btp019.
[65] A. Tharwat, T. Gaber, A.E. Hassanien, H.A. Hassanien and [78] J. Yang, D. Zhang, A.F. Frangi and J.-y. Yang, Two-
M.F. Tolba, Cattle identification using muzzle print images dimensional pca: A new approach to appearance-based face
based on texture features approach, in: Proceedings of the Fifth
International Conference on Innovations in Bio-Inspired Com-
O representation and recognition, IEEE Transactions on Pattern
Analysis and Machine Intelligence 26(1) (2004), 131–137.
puting and Applications IBICA 2014, Springer, 2014, pp. 217– doi:10.1109/TPAMI.2004.1261097.
C
227. [79] L. Yang, W. Gong, X. Gu, W. Li and Y. Liang, Null space
[66] A. Tharwat, A.E. Hassanien and B.E. Elnaghi, A ba-based al- discriminant locality preserving projections for face recogni-
gorithm for parameter optimization of support vector machine, tion, Neurocomputing 71(16) (2008), 3644–3649. doi:10.1016/
Pattern Recognition Letters (2016). j.neucom.2008.03.009.
R
[67] A. Tharwat, A. Ibrahim, A.E. Hassanien and G. Schaefer, [80] W. Yang and H. Wu, Regularized complete linear discriminant
Ear recognition using block-based principal component anal- analysis, Neurocomputing 137 (2014), 185–191. doi:10.1016/
ysis and decision fusion, in: International Conference on Pat- j.neucom.2013.08.048.
O
tern Recognition and Machine Intelligence, Springer, 2015, [81] J. Ye, R. Janardan and Q. Li, Two-dimensional linear discrim-
pp. 246–254. doi:10.1007/978-3-319-19941-2_24. inant analysis, in: Proceedings of 17th Advances in Neural In-
TH
[68] A. Tharwat, H. Mahdi, A. El Hennawy and A.E. Hassanien, formation Processing Systems (NIPS), 2004, pp. 1569–1576.
Face sketch synthesis and recognition based on linear regres- [82] J. Ye and T. Xiong, Computational and theoretical analysis
sion transformation and multi-classifier technique, in: The 1st of null space and orthogonal linear discriminant analysis, The
International Conference on Advanced Intelligent System and Journal of Machine Learning Research 7 (2006), 1183–1204.
Informatics (AISI2015), Beni Suef, Egypt, November 28–30, [83] H. Yu and J. Yang, A direct lda algorithm for high-
AU
2015, Springer, 2016, pp. 183–193. dimensional data with application to face recognition, Pattern
[69] A. Tharwat, Y.S. Moemen and A.E. Hassanien, Classification Recognition 34(10) (2001), 2067–2070. doi:10.1016/S0031-
of toxicity effects of biotransformed hepatic drugs using whale 3203(00)00162-X.
optimized support vector machines, Journal of Biomedical In- [84] L. Yuan and Z.-c. Mu, Ear recognition based on 2d images,
formatics (2017). in: Proceedings of the First IEEE International Conference on
[70] C.G. Thomas, R.A. Harshman and R.S. Menon, Noise reduc- Biometrics: Theory, Applications, and Systems, 2007, BTAS
tion in bold-based fmri using component analysis, Neuroimage 2007, IEEE, 2007, pp. 1–5.
17(3) (2002), 1521–1537. doi:10.1006/nimg.2002.1200. [85] X.-S. Zhuang and D.-Q. Dai, Inverse Fisher discriminate cri-
[71] M. Turk and A. Pentland, Eigenfaces for recognition, Journal teria for small sample size problem and its application to face
of Cognitive Neuroscience 3(1) (1991), 71–86. doi:10.1162/ recognition, Pattern Recognition 38(11) (2005), 2192–2194.
jocn.1991.3.1.71. doi:10.1016/j.patcog.2005.02.011.