Sie sind auf Seite 1von 5

1 Introduction

This article will explore the mathematical theory behind the Principal Com-
ponent Analysis Algorithm (PCA). Ian Goodfellow’s wonderful textbook Deep
Learning provides an overview of the algorithm. The rest of the introduction will
summarize Goodfellow’s text, so that this article is self-contained. The subse-
quent sections will cover aspects of the algorithm Goodfellow did not, including
an almost-rigorous argument that PCA achieves its optimization objective and
an implementation in Python.
The problem is as follows: suppose we have a collection of m points
{x (1) , x (2) , . . . , x (m) }, each in Rn . We wish to apply lossy compression to these
points, that is, map each x (i) ∈ Rn to some c (i) ∈ Rl , with l < n. We will use an
encoding function f (x ) = c, and a decoding function g such that x ≈ g(f (x )).
To make the decoding function simple, let g(c) = Dc, where D ∈ Rn×l . We
introduce some constraints on D. First, we stipulate that the columns of D are
orthogonal, which guarantees that the components of c are uncorrelated. Note
that this implies D T D = Il . We also require that the columns of D have unit
norm. To see why this is necessary, suppose we have found an optimal decoding
matrix D ∗ . We could then create another optimal decoding matrix by scaling a
column of D ∗ by some number, and then scaling the corresponding component
of all c’s by the inverse of that number. Thus we need to fix a norm for each
column of D in order for there to be a unique optimal decoding matrix (up to
permutations of its columns).
We choose our encoding function f (x ) so that it minimizes the Euclidean
distance between c and x .
f (x ) = arg minkx − g(c)k
c

It can be shown that this implies


f (x ) = D T x .
So our encoding function is as simple as our decoding function. We will call the
following function the reconstruction function
r(x ) = g(f (x )) = DD T x .
Define the optimal decoding matrix D ∗ to be the one that minimizes the
Frobenius norm of the matrix containing the reconstruction errors.
sX
∗ (i)
D = arg min (x j − r(x (i) )j )2
D i,j

To make our notation more concise, we define


 T 
x (1)
 (2) T 
x 
X =  ..  .

 . 
T
x (m)

1
Then the optimization objective simplifies to

D ∗ = arg minkX − XDD T kF .


D

Goodfellow shows that in the case of l = 1, the above can be simplified to


arg maxD D T X T XD. Since X T X is symmetric and has real entries, and D
has unit norm, D ∗ is the eigenvector corresponding to the maximum eigenvalue
of X T X . Goodfellow claims and leaves to the reader to verify that, in general,
D ∗ is the matrix whose columns are the l eigenvectors corresponding to the l
highest eigenvalues of X T X . The next section is devoted to an almost-rigorous
argument that this is true.

2 The Almost-Proof
We first have to do a bit of algebra to simplify our optimization objective. I’m
typesetting this late at night so please forgive any minor errors.

D ∗ = arg minkX − XDD T kF .


D

= arg min Tr((X − XDD T )(X − XDD T )T )


D

= arg min Tr((X − XDD T )(X T − DD T X T ))


D

= arg min Tr(XX T − 2X DD T X T + X DD T DD T X T )


D

Noticing that XX T is independent of D and that D T D = Il , we obtain

= arg min Tr(−X DD T X T )


D

= arg max Tr(X DD T X T )


D

= arg max Tr(D T X T X D)


D

To simplify our notation, let V = X T X . As previously stated, V is symmetric


and has real entries.
We proceed by inducting on l. The base case l = 1 was established in
the introduction. Suppose the matrix D ∗ ∈ Rn×(l−1) whose columns are the
eigenvectors corresponding to the largest l − 1 eigenvalues of of V maximizes
Tr(D T V D) subject to D T D = Il−1 . Let D 0 be the matrix in Rn×l that
maximizes Tr(D T V D) subject to D T D = Il . We will denote the jth column
of D 0 by d (j) . By the definition of matrix multiplication, the components of
V D 0 are given by the following:

2
(j)
X
(V D 0 )i,j = V i,k d k .
k

Again using the definition of matrix multiplication, along with the above equa-
tion, we find that the entries of D T V D are given by
X
0
(D T V D)i,j = d (i)
r (V D )r,j
r

X X
(j)

T
(D V D)i,j = d (i)
r V r,k d k .
r k

Therefore the diagonal entries of D T V D are given by


X X
(j)

T (j)
(D V D)j,j = dr V r,k d k .
r k

We can simplify this into a quadratic form


T
(D T V D)j,j = d (j) V d (j) .
Thus the trace of D T V D is
X T
Tr(D T V D) = d (j) V d (j)
j

l−1
X T T
= d (j) V d (j) + d (l) V d (l) . (1)
j=1

Here is where we must rely on an unproven assumption. I claim that we can


maximize (1) by maximizing the left term, and then maximizing the right term
using the remaining possibilities for d (l) . This is not trivially equivalent to
maximizing the entire expression because the choices of d (1) , . . . , d (l−1) that
maximize the left term place additional constraints on d (l) , which may prevent
the entire expression from being maximized. I am sure that someone with a
better background in linear algebra than me could resolve this, perhaps I will
when I take the class.
Assume the claim made in the previous paragraph is true. By our inductive
hypothesis, the left term in (1) is maximised by choosing d (1) , . . . , d (l−1) to be
the eigenvectors of V corresponding to the l −1 largest eigenvalues. By our con-
straints on D 0 , d (l) must have unit norm and be orthogonal to d (1) , . . . , d (l−1) .
This implies that d (l) must be one of the remaining n − l + 1 unit eigenvectors
of V . To see why, realize that there can be at most n − l + 1 unit vectors in
Rn orthogonal to d (1) , . . . , d (l−1) , and there are n − l + 1 remaining unit eigen-
vectors of V all of which are orthogonal to one another and to d (1) , . . . , d (l−1) .
Since when e is a unit eigenvector eV e takes on the eigenvalue of e, the choice
of d (l) that maximizes (1) and satisfies our constraints is the eigenvector of V

3
that corresponds to the highest remaining eigenvalue (i.e an eigenvector that
has not been assigned to one of d (1) , . . . , d (l−1) .This is what was to be shown.

3 Implementation
The implemenation of PCA is almost trivial once the math is done. First, we
use the following module to generate some data. The data generated consists of
1000 row vectors, each of length 200.The first 100 components of each vector are
randomly generated with distribution U (0, 1).The 100 + nth component is the
sum of the nth component and a random variable with distribution N (0, 0.01).
This is so that the components of the data are correlated. In all code blocks,
assume Numpy has been imported as np.

Listing 1: vectorDataGenerator.py

X = np . z e r o s ( ( 1 0 0 0 , 2 0 0 ) )

f o r m in range ( 1 0 0 0 ) :
X[m, : 1 0 0 ] = np . random . random ( 1 0 0 )
f o r n in range ( 1 0 0 , 2 0 0 ) :
X[m, n ] = X[m, n −100] + np . random . normal ( 0 , 0 . 0 1 )

np . s a v e ( ’ v e c t o r D a t a ’ , X)

Next, we define a function that takes as inputs X and l and returns the decoding
matrix.

Listing 2: PCA.py
def P r i n c i p a l C A (X, l ) :
n = X. shape [ 1 ]

D = np . z e r o s ( ( n , l ) )

e v a l s , e v e c s = np . l i n a l g . e i g ( (X. T ) . dot (X) )

s o r t = np . f l i p ( np . a r g s o r t ( e v a l s ) , 0 )

f o r c o l in range ( l ) :
D[ : , col ] = evecs [ : , so rt [ col ] ]

return D

And that’s all folks. All that remains is to test our code. Of course we must
have some measure of performance for our PCA implementation. We use the
following:
(i)
− r(x (i) )k2
P
i kx
P (i) k2
.
i kx

4
We will call this the relative projection error. It can be interpreted as the
proportion of variance lost during reconstruction, and is especially useful for
choosing l. The following module makes sure our code is working properly and
implements the above performance metric. It assumes that vectorData.npy has
been saved into its directory, and uses l = 100.

Listing 3: PCAtest.py
import PCA

X = np . l o a d ( ’ v e c t o r D a t a . npy ’ )

n = X. shape [ 1 ]
m = X. shape [ 0 ]
l = int ( n/2)

print ( ’ n : ’ , n, ’ \nm : ’ , m, ’ \ n l : ’, l)

u n s c a l e d v a r i a n c e = np . sum( np . s q r t ( np . sum(X∗ ∗ 2 , a x i s =1)))

D = PCA. P r i n c i p a l C A (X, l )

np . s a v e ( ’ d e c o m p r e s s o r ’ , D)

d i f f m a t r i x = X − D. dot ( (D. T ) . dot (X. T ) ) . T

u n s c a l e d p r o j e r r o r = np . sum( np . s q r t ( np . sum( d i f f m a t r i x ∗ ∗ 2 , a x i s =1)))

r e l a t i v e p r o j e r r o r = unscaledprojerror / unscaledvariance

print ( ” P r o j e c t i o n e r r o r : ” , r e l a t i v e p r o j e r r o r , ” o f v a r i a n c e ” )

In practice, l can be much lower than n/2, but we are working with highly
artificial data. The program’s output is shown below.

n: 200
m: 1000
l: 100
Projection error: 0.00824 of variance

This is satisfactory performance. 0.01 is usually considered to be the highest


acceptable projection error.

Das könnte Ihnen auch gefallen