Sie sind auf Seite 1von 47

Dimension Reduction from PCA

(and beyond) to BoW.


Some Slides credits :
Désire Sidibé, Assoc Prof, University of Burgundy
PCA

I PCA (principal component analysis) is one of the most widely


used technique for
I data analysis
I data visualization
I dimensionality reduction
I Applications of PCA include :
I data compression, image processing, pattern recognition, tec.

4 / 35
Introduction PCA algorithm Extensions Multidimensional PCA Conclusion

PCA

What is PCA ?
I Most common answer would be ’an algorithm for
dimensionality reduction’
I Yes, but :
I Where does the algorithm comes from ?
I What’s the underlying model ?
I PCA is actually many different things (models)
I latent variable model (Hotelling, 1930s)
I variance maximization directions (Pearson, 1901)
I optimal linear reconstruction (Kosambi-Karhunen-Loève
transform in signal processing)
I It just turns out that these different models lead to the same
algorithm (in the linear Gaussian case)
5 / 35
Introduction PCA algorithm Extensions Multidimensional PCA Conclusion

PCA

What is PCA ?
Goal of PCA
The main goal of PCA is to express a complex data set into a new
set a basis vectors that ’best’ explain the data

I So, PCA is essentially a change of basis


I We want to find the most meaningful basis to re-express the
data such that
I the new basis reveals hidden structure
I the new basis removes redundancy
I Most of the time, we would like a lower dimensional space.

6 / 35
PCA algorithm
Introduction PCA algorithm Extensions Multidimensional PCA Conclusion

P
Given a set of set of N data samples xi 2 Rd such that i xi = 0
PCA algorithm 1 PN T
1. Compute the sample covariance matrix C = i =1 xi xi
N
Note that C is a d ⇥ d matrix. P
d
Given
2. a set of
Compute set of N data samples
eigen-decomposition of C :xC
i 2=RU⇤such
UT that i xi = 0
U is an orthogonal d ⇥ d matrix and ⇤ is a diagonal 1 matrix.
PN T
1. Compute the sample covariance matrix C = i =1 xi xi
3. Since, C is symmetric, its eigenvectors u1 , u2 , . . . ,Nud form a
Note
basis ofthat
Rd . C is a d ⇥ d matrix.
2. Compute eigen-decomposition
I The eigenvectors of called
u1 , u2 , . . . , ud are = U⇤UT
C : C principal
components
U is an orthogonal d ⇥ d matrix and ⇤ is a diagonal matrix.
I The corresponding eigenvalues
1 > 2 > · · · > d give the
3. Since, C is
importance
symmetric, its eigenvectors u1 , u2 , . . . , ud form a
d of each principal axis.
basis of R .
I The eigenvectors u , u , . . . , u are called principal
1 2 d
components 8 / 35

I The corresponding eigenvalues 1 > 2 > ··· > d give the


importance of each principal axis.

8 / 35
Introduction
PCA PCA
algorithm
algorithm Extensions Multidimensional PCA Conclusion

P
Given a set of set of NPCA samples xi 2 Rd such that
dataalgorithm i xi = 0
1 PN T
1. Compute the sample covariance matrix C = i =1 xi xi
N
NoteThe
thatPCA
C isalgorithm is pretty simple
a d ⇥ d matrix.
P
2. Compute I First, center the data (if of
eigen-decomposition it isCnot) UT0
xi⇤=
: C =i U
U is anIorthogonal d ⇥ dthe
Then, compute matrix
sample andcovariance
⇤ is a diagonal
matrix matrix.
and its
3. Since, C eigenvectors
is symmetric, its eigenvectors u1 , u2 , . . . , ud form a
d
basis ofI RFinally,
. each sample point xi can be represented in the new
I The eigenvectors u , u , . . . , u are called principal
basis (projection
1 onto
2 thed eigenspace) as
components
I The corresponding eigenvalues yi1= > U2T x>i · · · > d give the
importance of each principal axis.
I We claim that the new representation makes the data
un-correlated, i.e. Cov (yi , yj ) = 0 if i , j.
8 / 35

9 / 35
PCA algorithm
We claim that the new representation makes the data un-correlated
Why ?
The sample covariance of the transformed data is
N N
1X T 1X T
Cnew = yi yi = (U xi )(UT xi )T
N N
i =1 i =1
N
0 N
1
1X T T T B
BBB 1 X
T T
CCC
= U xi xi U = U BB@ xi U xi CCCA U
N N
i =1 i =1
= UT CU = UT (U⇤UT )U
= ⇤
Hence, when projected onto the principal components, the data is
decorreletad.
10 / 35
Introduction PCA algorithm Extensions Multidimensional PCA Conclusion

PCA algorithm
Dimensionality reduction
I We usually want to represent our data in a lower dimensional
space Rk , with k ⌧ d.
I We achieve this by projecting onto the k principal axes which
preserve most of the variance in the data
I From the previous analysis, we see that those axes
correspond to the eigenvectors associated with the k largest
eigenvalues
2 3 2 3
666 | | | 777 666 | | | 777
6 7 6 7
U = 6666u1 u2 . . . ud 7777 ) Uk = 6666u1 u2 . . . uk 7777
4 5 4 5
| | | d ⇥d | | | d ⇥k

I The projected data is then yi = UTk xi , yi 2 Rk .


11 / 35
PCA algorithm

Dual PCA
I Let X be the d ⇥ N data matrix X = [x1 , x2 , . . . , xN ], xi 2 Rd
1 T
I The sample covariance can be computed as C = N XX
I If N ⌧ d, then it is better to work with C0 = N1 XT X
I C0 is an N ⇥ N matrix
I Let C0 = U0 ⇤0 U0T be the eigen-decomposition of C0
I We have ⇤ = ⇤0 , i.e. eigenvalues of C and C0 are equal
I We have u = Xu0 , for all i
i i
I Working with C0 is computationally less expensive if N ⌧ d.

12 / 35
PCA algorithm

Connection with SVD


PCA & SVD
There is a direct link between PCA and SVD

I Let X be the d ⇥ N data matrix X = [x1 , x2 , . . . , xN ]


1 T
I The sample covariance can be computed as C = N XX
I The eigenvectors of C are the principal components
I The SVD of X is given as X = U⌃VT ,
where U is orthogonal d ⇥ d and V is orthogonal N ⇥ N.
I The columns of U are eigenvectors of XXT
I So, the columns of U are the principal components
I The sigular values of X are ordered as the eigenvalues of C,
since 2i = i

13 / 35
Introduction
Introduction PCA
PCAalgorithm
algorithm Extensions
Extensions Multidimensional PCA
Multidimensional PCAConclusion Conclusion

Other facts about PCA


Other facts about PCA

I It can be shown that the principal axes found as described


It can (i.e.
I above be shown that
the matrix U)the
formprincipal axes
the best set found as described
of orthogonal basis
vectors whichthe
above (i.e. minimizes
matrix U)the form
average
thereconstruction error
best set of orthogonal basis
vectors which minimizes the
N average reconstruction error
X 1
U = argmin kxi WT xi kF
W N N
X
i =1
1
U = argmin kxi WT xi kF
W N
For each data point xi , the projection T
I i =1yi = Uk xi is the best
k-dimensional approximation to xi (best in the mean square
For each
I error sense)data point xi , the projection yi = UTk xi is the best
k-dimensional
I The approximation
principal axes to xi (best
are axes of maximum in the mean square
variance
error sense)
I The principal axes are axes of maximum variance 14 / 35

14 / 35
Algebraic Interpretation
•  Given m points in a n dimensional space, for large n, how does
one project on to a low dimensional space while preserving
broad trends in the data and allowing it to be visualized?
Algebraic Interpretation – 1D
•  Given m points in a n dimensional space, for large n, how does
one project on to a low dimensional space while preserving
broad trends in the data and allowing it to be visualized?

•  Choose a line that fits the data so the points are spread out well
along the line
Algebraic Interpretation – 1D

•  Formally, minimize sum of squares of distances to the line.

•  Why sum of squares? Because it allows fast minimization, assuming


the line passes through 0
Algebraic Interpretation – 1D

•  Minimizing sum of squares of distances to the line is the same as


maximizing the sum of squares of the projections on that line, thanks
to Pythagoras.
Algebraic Interpretation – 1D

•  How is the sum of squares of projection lengths expressed in


algebraic terms?

Line P P P… P Point 1 L
t t t … t Point 2 i
1 2 3… m Point 3 n
: e
Point m
xT B BT x
Algebraic Interpretation – 1D

•  How is the sum of squares of projection lengths


expressed in algebraic terms?

max(X ' B.B' X) subject to X ' X = 1


Algebraic Interpretation – 1D

•  Rewriting this:
xTBBTx = e = e xTx = xT (ex)
<=> xT (BBTx – ex) = 0

•  Show that the maximum value of xTBBTx is obtained for x satisfying


BBTx=ex

•  So, find the largest e and associated x such that the matrix BBT
when applied to x yields a new vector which is in the same direction
as x, only scaled by a factor e.
Algebraic Interpretation – 1D

•  (BBB)x points in some other direction in general

(BBT)x

x is an eigenvector and e an eigenvalue if

ex=(BBT)x
x
5
2nd Principal
Component, y2 1st Principal
Component, y1
4

2
4.0 4.5 5.0 5.5 6.0
PCA Scores

xi2 Yi,2
4 Yi,1

2
4.0 4.5 5.0 xi1 5.5 6.0
PCA Eigenvalues

5
λ1 λ2

2
4.0 4.5 5.0 5.5 6.0
LDA

• Recall PCA does not consider class


memberships.

• We need to define some new criteria to


optimize that accounts for class
memberships of our training data.
LDA Summary
1. Compute total mean.
2. Compute class means.
3. Compute within-class and between-class
scatter matrices.
4. Solve generalized eigenvector problem.
5. Project data onto lower-dimensional
subspace.
Object Bag of ‘words’
Features

Many problems and applications in Computer Vision require


extracting "good" features from images/videos :
image matching
object detection/recognition
object tracking
3D reconstruction
image segmentation
classification
etc

We can represent an image in several ways

How to find good representations ?

Désiré Sidibé (Le2i) Le2i - Lab Seminar 13/11/2014 4 / 58


Features

Désiré Sidibé (Le2i) Le2i - Lab Seminar 13/11/2014 5 / 58


Features

Recognition pipeline

From H. Lee 2010

Désiré Sidibé (Le2i) Le2i - Lab Seminar 13/11/2014 6 / 58


Features Extraction

Hand-designed features
Many highly qualified researchers have spent years to design those
features
SIFT, SURF, HOG, LBP, BRIEF, DAISY, ORB, ...
some are class/problem specific

Can we find better representation ?


Can we learn the features from the data directly ?
How ?
This talk

Main objectives
An introduction to an active research area in computer vision and
pattern recognition
An overview of the main ideas and principles
Some examples of applications

Expected outcomes
To know a bit more about sparse coding
To think about how to use it in your own works
To get ideas of extensions and/or applications domains
Retinal images classification
Feature extraction

Désiré Sidibé (Le2i) Le2i - Lab Seminar 13/11/2014 4


Features

Désiré Sidibé (Le2i) Le2i - Lab Seminar 13/11/2014 5 / 58


BoW representation

Sampling strategy

Keypoints detection
Detect a set of keypoints (Harris, SIFT, etc)
Extract local descriptors around each keypoint
BoW representation

Sampling strategy

Dense sampling
Divide image into local patches
Extract local features from each patch
BoW representation
Clustering/Quantization
For each image Ii we extract a set of low level descriptors and
represent them as a feature matrix Xi :
2 3
666 | | | 777
666 1 2 i7
Xi = 66fi fi . . . fN
i 7
777 ,
4 5
| | |

where f1i , . . . , fN
i
i
are the Ni descriptors extracted from Ii .
We then put together all descriptors from all training images to form a
big training matrix X :
h i
X = X1 . . . XN .
PN
X is a matrix of size d ⇥ M, with M = i =1 Ni and d the dimension of
the descriptor.
Désiré Sidibé (Le2i) Le2i - Lab Seminar 25/06/2015 14 / 36
BoW representation
Clustering/Quantization
To simplify the notation, we will just write the set of descriptors from
the training images as
2 3
666 | | | 777
6 7
X = 6666f1 f2 . . . fM 7777 .
4 5
| | |

Create a dictionary by solving the following optimization problem

M
X
min min kfm dk k2 ,
D k =1...K
m =1

where D = [d1 , . . . , dK ] are the K clusters centers to be found and k.k


is the L2 norm of vectors.
D is the visual dictionary or codebook.
BoW representation

Clustering/Quantization
The optimization problem
M
X
min min kfm dk k2 ,
D k =1...K
m =1

is solved iteratively with K-means algorithm.

K-means
1 Initialize the K centers (randomly)
2 Assign each data point to one of the K centers
3 Update the centers
4 Iterate until convergence
BoW representation

Clustering/Quantization
K-means algorithm results in a set of K cluster centers which form the
dictionary
2 3
666 | | | 777
666 7
D = 66d1 d2 . . . dK 7777
4 5
| | | d ⇥K
BoW representation

Features coding
Given the dictionary D
Given a set of low-level features Xi from image Ii
2 3
666 | | | 77
666 1 2 Ni 7 7
Xi = 66fi fi . . . fi 7777
4 5
| | |

Encode each local descriptor fli using the dictionary D


Find al such that

min kfli Dal k2 s .t . kal k0 = 1, al ⌫ 0


al
BoW representation

Features coding
Encode each local descriptor fli using the dictionary D
BoW representation

Features pooling

The coding of image Ii results in a matrix of codes A


2 3
666 | | | 777
666 7
A = 66a1 a2 . . . aK 7777 ,
4 5
| | | K ⇥N
i

where each al satisfies kal k0 = 1, al ⌫ 0


The pooling step transforms A into a single signature vector b
xi

xi = pooling(A)
b
BoW representation
Features pooling

A popular choice for pooling is to compute a histogram


Ni
1 X
xi =
b al
Ni
l =1

The final vector just encodes the frequency of occurrence of each


visual words.
BoW representation
Summary : Basic BoW framework
1 Extract a set of local features from all images
2 3
666 | | | 777
6 7
X = 6666f1 f2 . . . fM 7777
4 5
| | | d ⇥M
2 Create a visual dictionary by clustering of the set of local features
2 3
666 | | | 777
666 7
D = 66d1 d2 . . . dK 7777
4 5
| | | d ⇥K
3 Given D, encode each local feature from an image Ii , by assigning it
2 3
666 | | | 777
6 7
to its closest word : A = 6666a1 a2 . . . aK 7777
4 5
| | | K ⇥N
i

1 PNi
4 Finally, compute the final representation of Ii : b
xi = l 1
al
Ni =
Summary

Das könnte Ihnen auch gefallen