Beruflich Dokumente
Kultur Dokumente
This article will explore the mathematical theory behind the Principal Com-
ponent Analysis Algorithm (PCA). Ian Goodfellow’s wonderful textbook Deep
Learning provides an overview of the algorithm. The rest of the introduction will
summarize Goodfellow’s text, so that this article is self-contained. The subse-
quent sections will cover aspects of the algorithm Goodfellow did not, including
an almost-rigorous argument that PCA achieves its optimization objective and
an implementation in Python.
The problem is as follows: suppose we have a collection of m points
{x (1) , x (2) , . . . , x (m) }, each in Rn . We wish to apply lossy compression to these
points, that is, map each x (i) ∈ Rn to some c (i) ∈ Rl , with l < n. We will use an
encoding function f (x ) = c, and a decoding function g such that x ≈ g(f (x )).
To make the decoding function simple, let g(c) = Dc, where D ∈ Rn×l . We
introduce some constraints on D. First, we stipulate that the columns of D are
orthogonal, which guarantees that the components of c are uncorrelated. Note
that this implies D T D = Il . We also require that the columns of D have unit
norm. To see why this is necessary, suppose we have found an optimal decoding
matrix D ∗ . We could then create another optimal decoding matrix by scaling a
column of D ∗ by some number, and then scaling the corresponding component
of all c’s by the inverse of that number. Thus we need to fix a norm for each
column of D in order for there to be a unique optimal decoding matrix (up to
permutations of its columns).
We choose our encoding function f (x ) so that it minimizes the Euclidean
distance between c and x .
f (x ) = arg minkx − g(c)k
c
1
Then the optimization objective simplifies to
2 The Almost-Proof
We first have to do a bit of algebra to simplify our optimization objective. I’m
typesetting this late at night so please forgive any minor errors.
2
(j)
X
(V D 0 )i,j = V i,k d k .
k
Again using the definition of matrix multiplication, along with the above equa-
tion, we find that the entries of D T V D are given by
X
0
(D T V D)i,j = d (i)
r (V D )r,j
r
X X
(j)
T
(D V D)i,j = d (i)
r V r,k d k .
r k
l−1
X T T
= d (j) V d (j) + d (l) V d (l) . (1)
j=1
3
that corresponds to the highest remaining eigenvalue (i.e an eigenvector that
has not been assigned to one of d (1) , . . . , d (l−1) .This is what was to be shown.
3 Implementation
The implemenation of PCA is almost trivial once the math is done. First, we
use the following module to generate some data. The data generated consists of
1000 row vectors, each of length 200.The first 100 components of each vector are
randomly generated with distribution U (0, 1).The 100 + nth component is the
sum of the nth component and a random variable with distribution N (0, 0.01).
This is so that the components of the data are correlated. In all code blocks,
assume Numpy has been imported as np.
Listing 1: vectorDataGenerator.py
X = np . z e r o s ( ( 1 0 0 0 , 2 0 0 ) )
f o r m in range ( 1 0 0 0 ) :
X[m, : 1 0 0 ] = np . random . random ( 1 0 0 )
f o r n in range ( 1 0 0 , 2 0 0 ) :
X[m, n ] = X[m, n −100] + np . random . normal ( 0 , 0 . 0 1 )
np . s a v e ( ’ v e c t o r D a t a ’ , X)
Next, we define a function that takes as inputs X and l and returns the decoding
matrix.
Listing 2: PCA.py
def P r i n c i p a l C A (X, l ) :
n = X. shape [ 1 ]
D = np . z e r o s ( ( n , l ) )
s o r t = np . f l i p ( np . a r g s o r t ( e v a l s ) , 0 )
f o r c o l in range ( l ) :
D[ : , col ] = evecs [ : , so rt [ col ] ]
return D
And that’s all folks. All that remains is to test our code. Of course we must
have some measure of performance for our PCA implementation. We use the
following:
(i)
− r(x (i) )k2
P
i kx
P (i) k2
.
i kx
4
We will call this the relative projection error. It can be interpreted as the
proportion of variance lost during reconstruction, and is especially useful for
choosing l. The following module makes sure our code is working properly and
implements the above performance metric. It assumes that vectorData.npy has
been saved into its directory, and uses l = 100.
Listing 3: PCAtest.py
import PCA
X = np . l o a d ( ’ v e c t o r D a t a . npy ’ )
n = X. shape [ 1 ]
m = X. shape [ 0 ]
l = int ( n/2)
print ( ’ n : ’ , n, ’ \nm : ’ , m, ’ \ n l : ’, l)
D = PCA. P r i n c i p a l C A (X, l )
np . s a v e ( ’ d e c o m p r e s s o r ’ , D)
r e l a t i v e p r o j e r r o r = unscaledprojerror / unscaledvariance
print ( ” P r o j e c t i o n e r r o r : ” , r e l a t i v e p r o j e r r o r , ” o f v a r i a n c e ” )
In practice, l can be much lower than n/2, but we are working with highly
artificial data. The program’s output is shown below.
n: 200
m: 1000
l: 100
Projection error: 0.00824 of variance