Beruflich Dokumente
Kultur Dokumente
net/publication/289528785
CITATIONS READS
0 1,294
1 author:
Alaa Tharwat
Frankfurt University of Applied Sciences
86 PUBLICATIONS 674 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Alaa Tharwat on 01 January 2018.
Alaa Tharwat
Alaa Tharwat, Tarek Gaber, Abdelhameed Ibrahim and Aboul Ella Hassanien. ”Linear
discriminant analysis: A detailed tutorial” AI Communications 30 (2017) 169-190
Email: engalaatharwat@hotmail.com
The goal of the LDA technique is to project the original data matrix
onto a lower dimensional space. To achieve this goal, three steps
needed to be performed.
1 The first step is to calculate the separability between different classes
(i.e. the distance between the means of different classes), which is
called the between-class variance or between-class matrix.
2 The second step is to calculate the distance between the mean and
the samples of each class, which is called the within-class variance or
within-class matrix.
3 The third step is to construct the lower dimensional space which
maximizes the between-class variance and minimizes the within-class
variance.
of classes.
Alaa Tharwat December 30, 2017 10 / 66
Theoretical background to LDA Calculating the Between-Class variance (SB )
The total
P between-class variance is calculated as follows,
(SB = ci=1 ni SBi ).
x2
)
ec d
v2
nv on
r(
to
ge Sec
Ei
2
1-
m
m
Th een
be
m1
tw
e d cl
ist ass
an es
m2
ce
μ1
m3
SB1 μ2
S B2
μ3 S B3 μ
m2
m1
First
Eigenvector (v1)
m3 m1-m2
x1
x2
)
ec d
v2
nv on
r(
to
ge Sec
Ei
1
W
S
2
m
1-
m
m1
m2
SW1
μ1
m3
μ2
ss
e
ea inc
cla
wit he v W2
hin ara
ch
S
μ3
3 μ
W
T
S
m2
m1
First
Eigenvector (v1)
m3 m1-m2
SW1 x1
The eigenvalues are scalar values, while the eigenvectors are non-zero
vectors provides us with the information about the LDA space.
The eigenvectors represent the directions of the new space, and the
corresponding eigenvalues represent the scaling factor, length, or the
magnitude of the eigenvectors.
Thus, each eigenvector represents one axis of the LDA space and the
associated eigenvalue represents the robustness of this eigenvector.
The robustness of the eigenvector reflects its ability to discriminate
between different classes and decreases the within-class variance of
each class, hence meets the LDA goal.
Thus, the eigenvectors with the k highest eigenvalues are used to
construct a lower dimensional space (Vk ), while the other
eigenvectors ({vk+1 , vk+2 , vM }) are neglected.
A B
DatafMatrixfdXE
MeanfoffeachfClassfdμiE
x1
x2
ω1 μ1 Mean-CenteringfDataf
xn1 dDi=ωi-μiE
X=
xn1L1
xn1L2 - μ2
ω2 E
xn1Ln2 x1-μ1
μ3 x2-μ1
ω1-μ1
ω3
d1xME xn1-μ1
xN xdn1L1E-μ2
C D= ω2-μ2
-
dNxME
xdn1Ln2E-μ2
TotalfMeanfdμE
μ ω3-μ3
xN-μ3
d1xME
dNxME
D F
μ1 μ Between-ClassfMatrixfdSBE Within-ClassfMatrixfdSWE
ω1 μ1
SB1 SW1
μ1 ω1
- X = =
-
SB1= μ - X
S w 1=
-
μ1
dMxME dMxME
d1xME
dn1xME
dMx1E
SB1 SB2 SB3 SB dMxn1E
SW1 SW2 SW3 SW
S B= L L =
S W= L L =
G
ConstructfthefLowerfDimensionalfSpacefdVkE
SortedfEigenvalues
λ1 λ2 λk λM
Largestfkf
Eigenvalues
v1 v2 vk vM WTSBWff
argfmax=
W WT S wW
Lowerf
Dimensionalf
Space
dVkE
Vk∈RMxK kfSelected
Eigenvectors
Figure:Alaa
Visualized
Tharwat steps to calculate a lower dimensional subspace of30,the
December 2017LDA20 / 66
Theoretical background to LDA Constructing the lower dimensional space
Data1Matrix1(X) Data1After1Projection
k
y1
x1 y2
x2 ω1
ω1 yn1
xn1
xn1+1
Lower1 yn1+1
Dimensional1 yn1+2
xn1+2 Projection
X= Space ω2
ω2
Y=XVk (Vk) yn1+n2
xn1+n2
ω3 ω3
Vk∈RMxK yN
xN
Y∈RNxk
(NxM)
Figure: Projection of the original samples (i.e. data matrix) on the lower
dimensional space of LDA (Vk ).
x2
)
ec d
v2
n v on
r(
to
ge Sec
Ei
1
W
S
2
1-
m
m
Th een
be
m1
tw
e d cl
ist ass
m2
an es
SW1
ce
μ1
m3
SB1
μ2
ss
e
ea inc
cla
S B2
wit he v W2
hin ara
μ3
ch
S
3
S B3 μ
W
T
S
m2
m1
First
Eigenvector (v1)
m3 m1-m2
SW1 x1
To calculate SB1
SB1 = n1 (µ1 − µ)T (µ1 − µ) = 5[−0.91 0.87]T [−0.91 0.87]
(11)
4.13 −3.97
=
−3.97 3.81
Similarly, SB1
3.44 −3.31
SB2 = (12)
−3.31 3.17
Similarly, y2 is as follows:
1.24
3.39
1.92
y2 = ω2 V2 =
0.56 (19)
1.18
1.86
P(Projected Data)
P(Projected Data)
1 1
0.8 0.8
0.6 0.6
0.4 SB 0.4
0.2 0.2
0 0
−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −1 0 1 2 3 4 5
Projected Data Projected Data
(a) (b)
Figure: Probability density function of the projected data of the first example,
(a) the projected data on V1 , (b) the projected data on V2 .
where λωi and Vωi represent the eigenvalues and eigenvectors of the
ith class, respectively.
From the results shown (above) it can be seen that, the second
{2}
eigenvector of the first class (Vω1 ) has corresponding eigenvalue
more than the first one; thus, the second eigenvector is used as a
{2}
lower dimensional space for the first class as follows, y1 = ω1 ∗ Vω1 ,
where y1 represents the projection of the samples of the first class.
{1}
The first eigenvector in the second class (Vω2 ) has corresponding
{1}
eigenvalue more than the second one. Thus, Vω2 is used to project
{1}
the data of the second class as follows, y2 = ω2 ∗ Vω2 , where y2
represents the projection of the samples of the second class.
The values of y1 and y2 will be as follows:
1.68
−0.88 3.76
−1.00
2.43
y1 = −0.35 and y2 = 0.93 (24)
−1.24
1.77
−0.59
2.53
Alaa Tharwat December 30, 2017 43 / 66
Numerical example
The Figure (below) shows a pdf graph of the projected data (i.e. y1
{2} {1}
and y2 ) on the two eigenvectors (Vω1 and Vω2 ) and a number of
findings are revealed the following:
First, the projection data of the two classes are efficiently
discriminated.
Second, the within-class variance of the projected samples is lower
than the within-class variance of the original samples.
1.4
Class 1
Class 2
1.2
P(Projected Data)
0.8
0.6
0.4
0.2
0
−1 0 1 2 3 4 5
Projected Data
x2 SamplesPofPClassP1
Vω1{2} SamplesPofPClassP2
MeanPofPClassP1
MeanPofPClassP2
7 S ProjectedPMeanPofPClassP1
W
1 1
SW ProjectedPMeanPofPClassP2
TotalPMean
6
Cl DA
2
1 -m
as Ps
L
s- ub
m
D e sp
5 1 1 -m
pe ac
SW 2
nd es
µ
1 -µ
en
tP
4 2
ce ntP
pa e
µ1
bs nd
s
µ
su pe
3 1 -µ
AP de
2
LD s-In
µ
as
Cl
2 m
1 µ2
1}
nP Vω2
tio 1
m
ec
oj PV 1 1 -m
1
Pr on
W
S 2
1 2 3 4 5 6 7 x1
m
ProjectionP
2
onPV2
ω1 ω2 ω1
ω2
x µ = µ1 = µ2 x
ω2 ω1
Mapping or Transformation
ω2
ω2
ω2
ω1
ω1
ω1
Figure: Two examples of two non-linearly separable classes, top panel shows how
the two classes are non-separable, while the bottom shows how the
transformation solves this problem and the two classes are linearly separable.
Class 1
Class 2
Samples are
Non-linearly Seprable
X
-2 -1 0 1 2
Mapping Φ(x)=(x,x2)=(z1,z2)=Z
(1)→ (1,1) (2)→ (2,4)
(-1)→(-1,1) (-2)→(-2,4)
Samples are z2
linearly Seprable
1
z1
-2 -1 0 1 2
Figure: Example of kernel functions, the samples lie on the top panel (X) which
are represented by a line (i.e. one-dimensional space) are non-linearly separable,
where the samples lie on the bottom panel (Z) which are generated from
mapping the samples of the top space are linearly separable.
Alaa Tharwat December 30, 2017 55 / 66
Main problems of LDA Linearity problem
where µφi = ni φ
1 Pni 1 PN Pc
ni i=1 φ{xi } and µφ = N i=1 φ{xi } = i=1 N µi
Alaa Tharwat December 30, 2017 56 / 66
Main problems of LDA Linearity problem
Problem Definition
Singularity, Small Sample Size (SSS), or under-sampled problem is one
of the big problems of LDA technique.
This problem results from high-dimensional pattern classification tasks
or a low number of training samples available for each class compared
with the dimensionality of the sample space.
The SSS problem occurs when the SW is singular1 .
The upper bound of the rank2 of SW is N − c, while the dimension of
SW is M × M .
Thus, in most cases M >> N − c which leads to SSS problem.
For example, in face recognition applications, the size of the face
image my reach to 100 × 100 = 10000 pixels, which represent
high-dimensional features and it leads to a singularity problem.
1
A matrix is singular if it is square, does not have a matrix inverse, the determinant
is zeros; hence, not all columns and rows are independent
2
The rank of the matrix represents the number of linearly independent rows or
columns
Alaa Tharwat December 30, 2017 59 / 66
Main problems of LDA Small Sample Size (SSS) problem
3
A is a full-rank matrix if all columns and rows of the matrix are independent, (i.e.
rank(A)= # rows= #cols)
Alaa Tharwat December 30, 2017 61 / 66
Main problems of LDA Small Sample Size (SSS) problem
Four different variants of the LDA technique that are used to solve
the SSS problem are introduced as follows:
PCA + LDA technique:
In this technique, the original d-dimensional features are first reduced
to h-dimensional feature space using PCA, and then the LDA is used
to further reduce the features to k-dimensions.
The PCA is used in this technique to reduce the dimensions to make
the rank of SW is N − c; hence, the SSS problem is addressed.
However, the PCA neglects some discriminant information, which may
reduce the classification performance.
Direct LDA technique
Direct LDA (DLDA) is one of the well-known techniques that are used
to solve the SSS problem.
This technique has two main steps.
In the first step, the transformation matrix, W , is computed to
transform the training data to the range space of SB .
In the second step, the dimensionality of the transformed data is
further transformed using some regulating matrices.
The benefit of the DLDA is that there is no discriminative features are
neglected as in PCA+LDA technique.
Four different variants of the LDA technique that are used to solve
the SSS problem are introduced as follows:
Regularized LDA technique:
In the Regularized LDA (RLDA), a small perturbation is add to the
SW matrix to make it non-singular. This regularization can be applied
as follows:
(SW + ηI)−1 SB wi = λi wi (28)
where η represents a regularization parameter. The diagonal
components of the SW are biased by adding this small perturbation.
However, the regularization parameter need to be tuned and poor
choice of it can degrade the generalization performance.
Four different variants of the LDA technique that are used to solve
the SSS problem are introduced as follows:
Null LDA technique
The aim of NLDA is to find the orientation matrix W .
Firstly, the range space of the SW is neglected, and the data are
projected only on the null space of SW as follows, SW W = 0.
Secondly, the aim is to search for W that satisfies SB W = 0 and
maximizes |W T SB W |.
The higher dimensionality of the feature space may lead to
computational problems. This problem can be solved by (1) using the
PCA technique as a pre-processing step, i.e. before applying the
NLDA technique, to reduce the dimension of feature space to be
N − 1; by removing the null space of ST = SB + SW , (2) using the
PCA technique before the second step of the NLDA technique.
Mathematically, in the Null LDA (NLDA) technique, the h column
vectors of the transformation matrix W = [w1 , w2 , . . . , wh ] are taken
to be the null space of the SW as follows, wiT SW wi = 0, ∀i = 1 . . . h,
where wiT SB wi 6= 0. Hence, M − (N − c) linearly independent vectors
are used to form a new orientation matrix, which is used to maximize
|W T SB W | subject to the constraint |W T SW W | = 0 as follows,
W = arg max |W T SB W |.
|W T SW W |=0