Beruflich Dokumente
Kultur Dokumente
Class Measurement
Membership Pattern Space,
G Space, F
Space, M
C P
• Translation
• Rotation
• Scale
• Noise
• Projective (?)
Compactness
- this is the ratio of the square of the perimeter to the area of the
region
Connectivity -
- the number of neighboring features adjoining the region.
Euler Number
- for a single region, one minus the number of holes in that region.
The Euler number for a set of connected regions can be calculated as the
number of regions minus the number of holes.
Elongatedness:
A ratio between the length and width of the region bounding
rectangle = a/b = Area/sqr(thickness).
Compactness
Compactness is independent of linear transformations
= sqr(perimeter)/Area
Moments of Inertia
The ij-th discrete central moment mij, of a region is defined by:
~ ~
μij = ( x − x)i ( y − y ) j
where the sums are taken over all points (x, y) contained within the
region and (x~, y~) are the center of gravity of the region:
~ 1 ~ 1
x = i xi and y = i yi
n n
Note that, n, the total number of points contained in the region, is
a measure of its area ( = μ00 ).
We can form seven new moments from the central moments that
are invariant to changes of position, scale and orientation ( RTS ) of the
object represented by the region, although these new moments are not
invariant under perspective projection. Let, the normalized unscaled
central moments be:
μ pq p+q
m pq = ;γ = ( ) +1
(μoo ) γ
2
Up to order seven, the RST-invariant moments are:
We can also define eccentricity, using moments as
m10 ~ m01
~
Area = m00 ; x = ;y=
m00 m00
Θ
Polar Signature
Phase of DFT-based Signature Function
Coordinates Shape is rotated by
This leads to the
of the an angle θ, and the
derivation of the
boundary points Coefficient starting point by l0 modified DFT
of the shape of the Z m' = Z m e jθ e ( − jm 2πl0 ) coefficients for
are expressed Fourier Z m = R m e jθ m normalization
as: (x0, y0), Transform Shift in phase, due to against the
(x1, y1)……. is given as rotation of the shape scaling, rotation
(xN, yN) and shift in the
and change in starting
OR Z (e jω ) = z k e ( − jωn ) point is
starting point as
depicted in the
zi = xi + jyi n
θ m = θ m + θ − 2πl 0 m / N
'
table below:
Scale R’m = Rm /S
Rotation Θ’m = Θm+ θ - (Θ-1 + Θ+1 )/2
Read about :
• MCC
• Shape Context
An Example of Classification
• “Sorting incoming Fish on a conveyor
according to species using optical sensing
Sea bass
Species
Salmon
– Some properties that could be possibly used to
distinguish between the two types of fishes is
• Length
• Lightness
Features
• Width
• Number and shape of fins
• Position of the mouth, etc…
Lightness Width
Feature space
• The samples of input (when represented by their features) are
represented as points in the feature space
Perimeter
Compactness
Elongation
Area
Elongation
Class 1
Class 2
Class 3
F2
F1
• Pattern
• Feature
• Feature vector
• Feature space
• Classification
• Decision Boundary
• Decision Region
• Discriminant function
• Hyperplanes and Hypersurfaces
• Learning
• Supervised and unsupervised
• Error
• Noise
• PDF
• Baye’s Rule
• Parametric and Non-parametric approaches
Decision region and Decision Boundary
• Our goal of pattern recognition is to reach an optimal
decision rule to categorize the incoming data into their
respective categories
Statistical Neural
Syntactic
Supervised or Unsupervised
• Linear
• Quadratic
• Piecewise
• Non-parametric
Parametric Decision making (Statistical) - Supervised
P(wi) - Prior Prob. for class wi ; P(X) - Prob. (Uncondl.) for feature vector X.
X, P(X)
Take an example:
Two class problem:
Cold (C) and not-cold (C’). Feature is fever (f).
Then using Bayes Th., the Prob. that a person has a cold, given
that she (or he) has a fever is:
P ( f | C ) P (C ) 0.4 * 0.01
P (C | f ) = = = 0.2
Not convinced that it works? P( f ) 0.02
let us take an example with values to verify:
Total Population =1000. Thus, people having cold = 10. People having
both fever and cold = 4. Thus, people having only cold = 10 – 4 = 6.
People having fever (with and without cold) = 0.02 * 1000 = 20.
People having fever without cold = 20 – 4 = 16 (may use this later).
C f
P(C) = 0.01
P(f) = 0.02
Thus:
P ( X | wi ) P ( wi )
P ( wi | X ) =
P ( w1 ) P ( X | w1 ) + P ( w2 ) P ( X | w2 ) + .... + P ( wk ) P ( X | wk )
With our last example:
1 if x i ∈ Gi
uij =
0 else
& Gj, j=1, 2, …, k represent the k clusters
If n is the number of known patterns and c the
desired number of clusters, the k-means algorithm is:
Begin
initialize n, c, μ1,μ2,…,μc(randomly
selected)
do
1.classify n samples according
to nearest μi
2.recompute μi
until no change in μi
return μ1, μ2, …, μc
End
Classification Stage
• The samples have to be assigned to clusters in order to
minimize the cost function which is:
c c 2
J = J i = xk − μ i
i =1 i =1
k , xk ∈Gi
• This is the Euclidian Distance of the samples from
its cluster center; for all clusters this sum should
be minimum
• The classification of a point xk is done by:
1 if x k − μ i
2
≥ xk − μ j
2
,∀ k ≠ i
ui =
0 otherwise
Re-computing the Means
• The means are recomputed according to:
1
μi = xk
Gi k , x ∈G
k i
• Disadvantages
• What happens when there is overlap between classes…
that is a point is equally close to two cluster centers……
Algorithm will not terminate
• The Terminating condition is modified to “Change in
cost function (computed at the end of the Classification)
is below some threshold rather than 0”.
An Example
• The no of clusters is
two in this case.
• But still there is
some overlap
Some necessary elements of
1 1 x−μ 2
p( x) = exp[− ( ) ]
σ 2π 2 σ
X
1 1 x−μ 2
p( x) = exp[− ( ) ]
σ 2π 2 σ
https://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg
(2013)
The 68 – 95 - 99.7% Rule:
All normal density curves satisfy the following property
which is often referred to as the Empirical Rule:
i =1
Thus, the second central moment (also called Variance) of a random variable x is
defined as:
σ = E[{x − E ( x)} ] = E[( x − μ x ) ]
2
x
2 2
S.D. of x is σx.
= E ( x ) − 2μ + μ = E ( x ) − μ
2 2
x
2
x
2 2
x
Thus
E(x ) = σ + μ
2 2 2
σ xy x − μx y − μ y
Correlation between x and y: ρ xy = = E[( )( )]
σ xσ y σx σy
Property of correlation coefficient: − 1 ≤ ρ xy ≤ 1
For Z = ax + by ;
E[( z − μ z ) ] = a σ + 2abσ xy + b σ ;
2 2 2
x
2 2
y
If σ xy = 0, σ = a σ + b σ
2
z
2 2
x
2 2
y
σ xy x − μx y − μ y
ρ xy = = E[( )( )]
σ xσ y σx σy
The correlation coefficient can also be viewed as the cosine of the angle
between the two vectors (R D) of samples drawn from the two random variables.
This method only works with centered data, i.e., data which have been
shifted by the sample mean so as to have an average of zero.
Poisson
Other PDFs:
λ x
P( x) = e −λ ; λ > 0
x!
Binomial
Cauchy
LAPLACE:
Read about:
n
~ 1 n
Sample mean is defined as: x = xi P( xi ) = xi where,
i =1 n i =1 P(xi) = 1/n.
n
1 ~
Sample Variance is: σ x2 = ( xi − x) 2
n i =1
~ ~
Higher order moments may also be computed: E ( xi − x) ; E ( xi − x) 4
3
p( X ) =
1
exp[−
( X − μ ) Σ ( X − μ)
T −1
]
det(Σ)(2π ) d 2
1 1
= exp[− ( xi − μi ) sij ( x j − μ j )]
det(Σ)(2π ) d 2 ij
p( X ) =
1
exp[−
( X − μ ) Σ ( X − μ)
T −1
]
det(Σ)(2π ) d 2
1 1
= exp[− ( xi − μi ) sij ( x j − μ j )]
det(Σ)(2π ) d 2 ij
where, sij is the i-jth component of Σ−1 (the inverse of covariance matrix Σ).
μx
Special case, d = 2; where X = (x y)T; Then: μ =
μy
σ x2 σ xy σ x2
and
ρ xyσ xσ y
= =
σ σ 2 ρ σ σ σ y2
xy y xy x y
Can you now obtain this, 2 ρ xy ( x − μ x )( y − μ y ) y−μ y
1 x−μ x 2
as given earlier: − [( ) − +( )2 ]
2
2 (1− ρ xy ) σx σ xσ y σy
e
p ( x, y ) =
2πσ xσ y (1 − ρ ) 2
xy
σ 11 σ 12 . . σ 1d σ 12 σ 12 . . σ 1d μ1
σ σ 22
. . σ 2 d σ 12 σ 22 . . σ 2d
μ
21 2
= .
. . . . = .
. . . .
μ = E( X ) = .
. . . . . . . . . .
.
σ d 1 σ d2 . . σ dd σ 1d σ 2d . . σ d2 μ d
Diagonal covariance;
σ x > σ y;
ρ xy = 0; Remember,
asymmetric and oriented
Gaussians
Non-Diagonal
covariance;
σ x = σ y; σ x = σ y;
ρ xy < 0; ρ xy > 0;
Decision Regions and Boundaries
Determine the decision region (in Rd) into which X falls, and
assign X to this class.
μ1 g1 ( X ) = X − μ1 ; g 2 ( X ) = X − μ 2
R1
p( X | wi ) =
1
exp[−
( X − μ ) Σ i ( X − μ)
T −1
]
det(Σ i )(2π ) d 2
μ μ
T
This gives the simplest
where, ω = μi and ω = −
T
i i0
i i
linear discriminant function
2 or correlation detector.
The perceptron (ANN) built to form the linear discriminant function
x1
w1
x2 w2
O(X)
O( X ) = ( wi xi ) + wi 0
i
wd wi0
xd
G = MX − Y + C
The decision region boundaries are determined by solving :
2 i i 2 2
The mahalanobis distance (quadratic term) spawns a number
of different surfaces, depending on Σ-1. It is basically a vector
distance using a Σ-1 norm. It is denoted as: 2
X − μi i−1
Make the case of Baye’s rule more general for class assignment.
Earlier we has assumed that:
g i ( X ) = P(Ci | X ), assuming P (Ci ) = P (C j ), ∀i, j; i ≠ j.
Now, Gi ( X ) = log[ P (Ci | X ).P ( X )] = log[ P( X | Ci )] + log[P(Ci )]
−1
T
1 ( X − μ ) Σi ( X − μ )
Gi ( X ) = log[ ]− i i + log[ P (C )]
i
det( Σ i )( 2π ) d 2
1 −1 d 1
= − ( X − μ ) Σ i ( X − μ ) − ( ) log( 2π ) − log( i ) + log[ P (Ci )]
T
2 i i 2 2
1 −1 1 Neglecting the
= − ( X − μ ) Σ i ( X − μ ) − log( i ) + log[ P (Ci )] constant term
T
2 i i 2
Simpler case: Σi = σ2I, and eliminating the class-independent bias,
we have: 1 T
Gi ( X ) = − ( X − μ ) ( X − μ ) + log[ P(Ci )]
2σ 2 i i
These are loci of constant hyper-spheres, centered at class mean.
More on this later on…..
If Σ is a diagonal matrix, with equal/unequal σii2:
1 0 . . 0
σ 2
1 0 . . 0 σ 2
1
0 1 . . 0
0 σ 22 . . 0 σ 2
2
= . . . . . and =
−1
. . . . .
. . . . . . . . . .
0 0 . . σ d2 1 2
0 0 . .
σ d
Considering the discriminant function:
1 1
Gi ( X ) = − ( X − μ ) Σ i ( X − μ ) − log( i ) + log[ P(Ci )]
T −1
2 i i 2
This now will yield a weighted distance classifier. Depending
on the covariance term (more spread/scatter or not), we tend to put
more emphasis on some feature vector components than the other.
2
Thus the decision surfaces are hyperplanes and decision
boundaries will also be linear (use Gi(X) = Gj(X), as done earlier)
This plane partitions the space into two mutually exclusive regions,
say Rp and Rn. The assignment of the vector X
to either the +ve side, or > 0 if X ∈ R p
–ve side or along H,
can be implemented by: T
ω X − d = 0 if X ∈ H
< 0 if X ∈ R
n
A relook at,
x2
ω
Linear Discriminant Function g(X):
H
+ve side, Rp
g( X ) = ωT X − d Xd
XTW=0
x1
Orientation of H is determined by ω. -ve side, Rn
Location of H is determined by d.
Pattern/feature Space
H’
w2
WTX=0 Weight
The complementary role of Space
a sample in parametric space: w1
Xk
x2
ω
H H’
C1 w2
Xd
WTX=0
XTW=0
x1 w1
C2 w2 Xk
X3 X2
X1
w1
T1 = [X1, X2];
X4
T2 = [X3, X4];
T1 = [X1, X2];
w2 T2 = [X3, X4];
X2
X1
w1
X4
g (T2 ) < 0 X3
g (T1 ) > 0
SOLUTION
SPACE
x1 LMS learning Law in BPNN or FFNN models
w1
Read about perceptron
x2 w2 vs. multi-layer feedforward network
O(X)
Wk + η k X k if X TkWk ≤ 0
Wk +1 =
Wk if X TkWk ≥ 0
ηκ is the learning rate parameter
wd wi0
xd
w2
Xk
Wk+1
H
Wk + η k X k if X k ∈ Χ1 and X kT Wk ≤ 0
Wk +1 =
W
k − η k X k if X k ∈ Χ 0 and X T
k Wk ≥ 0
Wk w1
WT Xk = 0
T1 = [X1, X2];
w2
T2 = [X3, X4];
X2
X1
w1
X4
X3
Wk + η k X k if X k ∈ Χ1 and X kT Wk ≤ 0
ηκ decreases with each iteration Wk +1 =
Wk − η k X k if X k ∈ Χ 0 and X T
k Wk ≥ 0
In case of FFNN, the objective is to minimize the error term:
Xk
Wk+1
ΔWk
ek = d k − sk = d k − X kT Wk
Wk
1
ξ k = [d k − X kT Wk ]2 = E / 2 − PT W + (1 / 2)W T RW .
2
P T = E [ d k X kT ];
1 x1k x nk
k
x1 x1k x1k x1k x nk
R = E [ X k X kT ] = E [
xk x nk x1k x nk x nk
n
δξ δξ δξ T Thus,
∇ξ = ( , ,......, ) = − P + RW
δw0 δw1 δwn ^
−1
W =R P
Effect of class Priors – revisiting DBs in a more general case.
P ( X | wi ) P ( wi )
p( X | wi ) = P ( wi | X ) =
P( X )
1 ( X − μ) Σ ( X − μ)
T −1
exp[− ]
det(Σ)(2π ) d 2
2σ 2
−1
gi ( X ) = [ X T
X − 2 μ T
i X + μ i μ i ] + ln P ( wi )
T
2σ 2
−1
gi ( X ) = [ X T
X − 2 μ T
i X + μ i μ i ] + ln P ( wi )
T
2σ 2
Thus, g i ( X ) = ωiT X + ωi 0
μ μ i μi
T
where ωi = i
2 and ωi 0 = − + ln P ( wi )
σ 2σ 2
W T ( X − X 0 ) = 0; 1 2 μ k − μl P(ωk )
X 0 = ( μ k + μl ) − σ ln
where, W = μ k − μl 2 μ k − μl
2
P(ωl )
CASE – B. – Arbitrary Σ, but identical for all class.
−1
g i ( X ) = [( X − μi )T Σ −1 ( X − μi )] + ln P ( wi )
2
Removing the class-invariant quadratic term:
− 1 T −1
gi ( X ) = μi Σ μi + (Σ −1μi )T X + ln P( wi )
2
Thus, g i ( X ) = ωiT X + ωi 0
1 T −1
where ωi = Σ μi and ωi 0 = − μi Σ μi + ln P ( wi )
−1
2
The linear DB is thus: g k ( X ) = g l ( X ), k ≠ l
which is: (ωkT − ωlT ) X + (ωk 0 − ωl 0 ) = 0;
(ωk 0 − ωl 0 ) = (ωl − ωk )T X 0 ; where
1 μ k − μl P (ωk )
X 0 = ( μ k + μl ) − ln Prove it.
2 ( μ k − μl ) Σ ( μ k − μl ) P(ωl )
T −1
Thus the linear DB is: W T ( X − X 0 ) = 0;
where, W = ωk − ωl where ωi = Σ −1μi
Thus, W = Σ ( μ k − μl );
−1
μ k2 − μl2 μ k1 − μl1 μl
σ1 > σ 2
−
σ2 σ 1
Thus the linear DB is: W T ( X − X 0 ) = 0;
where, W = ωk − ωl where ωi = Σ −1μi
Thus , W = Σ −1 ( μ k − μl );
Special case:
μ k1 − μl1 μ −μ
2 2
WD = k l
;
σ1 σ2
Increasing σ2 and decreasing σ1
Diagonal Σ
in all cases.
Diagonal elements
in Σ are both 1.0,
in all cases
Point P is actually closer (in the
Euclidean sense) to the mean for the
Orange class.
2 2
Thus, g i ( X ) = X T Wi X + ω iT X + ω i 0 ;
− 1 −1
where Wi = Σi ;
2
ω i = Σ i μ i and
−1
1 T −1 1
ω i 0 = − μ i Σ i μ i − ln Σ i + ln P( wi )
2 2
−12 0
Σ =
1 ; Assume; P(w1) = P(w1) = 0.5;
0 1 / 2
−1 1 / 2 0
Σ2 = ;
0 1 / 2
d d −1 d d
w x + w x x +w x + w
i =1
2
ii i
i =1 j =i +1
ij i j
i =1
i i o =0 ..1
2 2
w x + w x + w12 x1 x2 + w1 x1 + w2 x2 + w0 = 0
11 1 22 2 ..2
Special cases of equation:
2 2
w x + w x + w x x + w x + w2 x2 + w0 = 0
11 1 22 2 12 1 2 1 1 ..2
Case 1:
w11 = w22 = w12 = 0; Eqn. (2) defines a line.
Case 2:
w11 = w22 = K; w12 = 0; defines a circle.
Case 3:
w11 = w22 = 1; w12 = w1 = w2 = 0; defines a circle whose center is at the origin.
Case 4:
w11 = w22 = 0; defines a bilinear constraint.
Case 5:
w11 = w12 = w2 = 0; defines a parabola with a specific orientation.
Case 6:
w11 ≠ 0, w22 ≠ 0, w11 ≠ w22 ; w12 = w1 = w2 = 0
defines a simple ellipse.
ii i +
w
i =1
x 2
w x x + w x +ω
i =1 j =i +1
ij i j
i =1
i i o =0 ..1
2d + 1 + d(d-1)/2 = (d+1)(d+2)/2.
2
T −1 T −1 T −1 T −1
d = ( X − μ ) Σ ( X − μ ) = − X Σ X + 2μ Σ X − μ Σ μ
i
i i i i i
Example of linearization:
2
g ( X ) = x2 − x − 3 x1 + 6 = 0
1
T
g ( X ) = x2 − x3 − 3x1 + 6 = W X + wo
T
where, X = [ x1 , x2 , x3 ]
T
and W = [−3, 1, − 1]
CASE – C. – Arbitrary Σ, all parameters are class dependent – contd..
−1 −1
g i ( X ) = [( X − μi ) Σ i ( X − μi )] − ln Σ i + ln P( wi )
T −1
2 2
− 1 −1
Thus, g i ( X ) = X Wi X + ω X + ω i 0 ;
T T
i where Wi = Σi ;
2
1 T −1 1
ωi = Σ μi
−1
i and ω i 0 = − μ i Σ i μ i − ln Σ + ln P ( wi )
2 2
σ = σ ;σ = σ ;
1
x y
2 1
y x
2
ρ1 = ρ 2 = 0;
μ < μ ;μ = μ ;
1
x x
2 1
y y
2
σ 1x = σ 2y ; σ 1y = σ 2x ;
ρ1 = ρ 2 = 0;
μ1x = μ 2x ± C ; μ1y = μ 2y C ;
Read about GMM,
and estimation using
MLE or EM methods.
Kullback-Leibler divergence
p(i)
DKL ( p, q) = p(i) log − p(i) + q(i)
i q(i)
or
DKL ( p, q) = − p(i) log q(i) + p(i) log p(i)
i i
= H ( p, q ) - H(p)
= cross_entropy(P & Q) - entropy(p)
• Bregman divergence
DBG( p, q) = F( p) − F(q) − ∇F(q), p − q
• Jensen–Shannon divergence: The Bregman distance
associated with F for points (P, Q), is the difference
between the value of F at point P and the value of the first-
order Taylor expansion of F around point Q evaluated at
point P. F is a continuously-differentiable real-valued and
strictly convex function defined on a closed convex set.
DKL (P, M ) + D(Q, M )
DJS ( p, q) = ; where M = (P + Q) / 2
2
• Deviance information criterion
• Bayesian information criterion
• Quantum relative entropy
• Information gain in decision trees
• Solomon Kullback and Richard Leibler
• Information theory and measure theory
• Entropy power inequality
• Information gain ratio
• F-divergence
Principal Component Analysis
Unsupervised learning
Demonstration of KL Transform
First
eigen
vector
Second
eigen
vector
Another One
Another Example
Goal of PCA:
Find some orthonormal matrix WT, where Y = WTX;
such that
COV(Y) ≡ (1/(n−1))YYT is diagonalized.
- 1 - 2 4
− 1 − 2 4
Samples: x1 = 1 ; x2 = 3 ; x3 = 0; X = 1 3 0
2 1 3 2 1 3
3-D problem, with N = 3.
1 − 4 − 7 11
Mean of the samples:
3 ~ 3 ~ 3 ~ 3
μ = 4 ; x1 = − 1 ; x 2 = 5 ; x 3 = − 4 ;
3 3 3 3
x
2 0 − 1 1
Method – 1 (easiest)
− 4 −7 11
3 3 3 COVAR = 62 − 25 6
~
X = − 1 5 − 4 ; ~ ~ T 3 3
3 3 3 ( X X ) / 2 = (1 / 2) − 25 14 − 3
0 −1 1
3 3
6 −3 2
Method – 2 (PCA defn.) 1 N
ST = ( ) ( xk − μ ) ( x k − μ ) T
N − 1 k =1
− 4 − 7 11
~ 3 ~ 3 ~ 3
C1 = x1 = − 1 ; x 2 = 5 ; x 3 = − 4 ;
1.7778 0.4444 0 3 3 3
0.4444 0.1111 0 0 − 1 1
0 0 0
C2 = C3 =
5.4444 -3.8889 2.3333 13.4444 -4.8889 3.6667
-3.8889 2.7778 -1.6667 -4.8889 1.7778 -1.3333
2.3333 -1.6667 1.0000 3.6667 -1.3333 1.0000
SigmaC = COVAR =
20.6667 -8.3333 6.0000 SigmaC/2 =
-8.3333 4.6667 -3.0000
6.0000 -3.0000 2.0000 10.3333 -4.1667 3.0000
-4.1667 2.3333 -1.5000
3.0000 -1.5000 1.0000
62 − 25 6
~ ~ T 3 3 10.3333 -4.1667 3.0000
S = X X = (1 / 2) − 25 14 − 3 = -4.1667 2.3333 -1.5000
3 3 3.0000 -1.5000 1.0000
6 −3 2
U= U=
S= S=
13.0404 0 0 13.0404 0 0
0 0.6263 0 0 0.6263 0
0 0 0.0000 0 0 0.0000
V= V=
− 3 − 2 − 1 4 5 6
x1 = ; x2 = ; x3 = ; x4 = ; x5 = ; x6 = ;
− 3 − 2 − 1 4 5 7
2-D problem (d=2), with N = 6. X=
-3 -2 -1 4 5 6
Each column is an observation (sample) -3 -2 -1 4 5 7
and each row a variable (dimension),
i =1 xk ∈ X i
trS1
J 3 = tr ( S1 ) − μ (trS 2 − c) J4 =
trS2
Linear Discriminant Analysis
• Learning set is labeled – supervised learning
• Class specific method in the sense that it tries to ‘shape’ the
scatter in order to make it more reliable for classification.
i =1 xk ∈X i
Linear Discriminant Analysis
• If SW is nonsingular, Wopt is chosen to satisfy
W T S BW
Wopt = arg max
W T SW W
S B wi = λi SW wi
• There are at most (c-1) non-zero eigen values. So upper
bound of m is (c-1).
Linear Discriminant Analysis
SW is singular most of the time. It’s rank is at most N-c
Solution – Use an alternative criterion.
T W TW pca
T
S BW pcaW
W pca = arg max W ST W W fld = arg max
W W W TW pca
T
SW W pcaW
Demonstration for LDA
Hand workout EXAMPLE:
1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9
Class: 1 1 1 1 1 1 1 2 2 2 2 2 2 2
Sw = 10.6122 8.5714
8.5714 8.0000
Sb = 20.6429 -17.00
INV(Sw) . Sb = -17.00 14.00
27.20 -22.40
-31.268 25.75
Perform Eigendecomposition
on above:
Eigenvectors:
- 0.7719 0.6357
0.6357 0.7719
Sw = 10.6122 8.5714 Sw = 10.6122 8.5714
8.5714 8.0000 8.5714 8.0000
Sb = 20.6429 - 17.00
- 17.00 14.00 Sb = 203.143 - 95.00
- 95.00 87.50
Eigenvalues of Sw-1 Sb : 53.687 Eigenvalues of Sw-1 Sb : 297.83
0 0.0
Eigenvectors: - 0.7719 0.6357 Eigenvectors: -0.7355 -0.6775
0.6357 0.7719 0.6775 0.7355
After linear projection, using LDA:
Same EXAMPLE for LDA, with C = 3:
1 2 3 5 4 6 8 -2 -1 1 3 4 2 5
Data Points: 1 2 3 4 5 6 7 3 4 5 6 7 8 9
Class: 1 1 1 2 2 3 3 1 1 1 2 2 3 3
Sw = 8.0764 - 2.125
- 2.125 4.1667
Sb = 56.845 52.50
52.50 50.00
INV(Sw) . Sb =
11.958 11.155
18.7 17.69
Perform Eigendecomposition
on above:
• Reinforcement learning
• Generalization capabilities
• Evolutionary Computations
• Genetic algorithms
• Pervasive computing
• Neural dynamics
• Bishop – PR