Sie sind auf Seite 1von 12

Eigenfaces with PCA and SVM

Mike Chaykowsky and Richard Mata

Abstract
We used principal component analysis (PCA) and support vector machines (SVM) to reduce the dimensionality
of and classify over 13,000 color images of famous figures faces and look at a rudimentary facial recognition
process. PCA utilizes the eigenvectors of the covariance matrix to redescribe data points in the direction of
highest variance. A metric developed using a weighted average of precision and recall yielded a score of 77%
when 50 principal components were used for the study and out of 36 faces given to the model to recognize,
29 were predicted correctly. Improvements of the study could be developed using a cross-validation of the
number of principal components.
Introduction
Facial recognition software nowadays has become most popular in many commercial uses, such as Snapchats
new face swapping feature, in which an individual can swap faces with another individual. Although this
feature is fun to use with friends and family there is other ways facial recognition software has come to
play a role in society, with the biggest use of facial recognition software being for security systems. The
use of facial recognition software for security systems is similar to that of other biometric software such
as fingerprint scanners and eye recognition software. Although facial recognition software may not be as
reliable as fingerprint scanners and eye recognition software, its key advantage is that it does not require
an individuals cooperation to work. Facial recognition simply pulls from a data base of facial images and
creates the best match possible.
Facial recognition has a wide array of uses in industry; however, its most common use in industry is for
security systems. For instance, many law enforcement agencies in the United States use facial recognition
software in their forensic investigative work. When used for security systems, facial recognition software
takes a facial image of an individual which it then runs through a data base of images, finds a match, and
provides identification to that individual. Facial recognition has also become popular for commercial uses like
the image messaging app, Snapchat, new face swapping feature. In this feature, Snapchat has introduced a
new filter where two individuals can place their faces within view of the camera. Once the facial recognition
software is able to identify the faces of the two individuals, it proceeds to swap the faces with one another.
To better understand facial recognition software using principle component analysis, there is some terminology
that first needs to be covered. First off, PCA uses what is known as eigenfaces in its algorithm. Eigenfaces
is the name given to a set of eigenvectors used for facial recognition software. A non-zero vector, v, is said to
be an eigenvector of a linear transformation T if

T (v) = v

where is a scalar known as the associated eigenvalue of the eigenvector v. These eigenfaces are derived
from the covariance matrix of the probability distribution over a vector space of facial images, where the
eigenfaces themselves form a basis of all images used to construct the covariance matrix. The covariance
matrix is a matrix in which the (i, j) element of the matrix is found by taking the covariance between the ith
and jth elements of a random variable vector X, where a random variable is a variable whose value is subject
to variations of chance. That is, given a random variable vector

X1
X2
X= .

..
Xn

1
,
where X1 , X2 , . . . , Xn are random variables. Then the (i, j) entry of the covariance matrix is found by,

ij = Cov(Xi , Xj ) = E[(Xi i )(Xj j )]

where,

E[Xi ] = i

is the expected value of the random variable Xi . Thus, the covariance matrix is,

Cov(X1 , X1 ) Cov(X1 , X2 ) Cov(X1 , Xn )




Cov(X2 , X1 ) Cov(X2 , X2 ) Cov(X2 , Xn )
i,j = .. .. .. ..

. . . .


Cov(Xn , X1 ) Cov(Xn , X2 ) Cov(Xn , Xn )

E[(X1 1 )(X1 1 )] E[(X1 1 )(X2 2 )] E[(X1 1 )(Xn n )]




E[(X2 2 )(X1 1 )] E[(X2 2 )(X2 2 )] E[(X2 2 )(Xn n )]
= .. .. .. ..

. . . .


E[(Xn n )(X1 1 )] E[(Xn n )(X2 2 )] E[(Xn n )(Xn n )]

Data
Our data are images set to 250 x 250 pixels containing faces of famous figures. There are 13,233 images
in the whole set with 5749 different people and 1680 of them contain more than 1 image in their set of
images. This is an important fact because the program learns from the images and the highest variance
will be described in the direction of people with the most images. The dataset contains 5 errors, which are,
Fold 1: Janica_Kostelic_0001, Janica_Kostelic_0002 Fold 1: Nora_Bendijo_0001, Nora_Bendijo_0002
Fold 5: Jim_OBrien_0001, Jim_OBrien_0002 Fold 5: Jim_OBrien_0001, Jim_OBrien_0003 Fold 5:
Elisabeth_Schumacher_0001, Elisabeth_Schumacher_0002.
If we are representing our images as a vector of values. Each value corresponds to a pixels intensity on a scale
measured from 0 to 255 inclusive, and a square N by N image can be expressed as an N 2 -dimensional vector,
X = (x1 x2 x3 ...xN 2 )
where the rows of pixels are just placed concurrently to form a one dimensional image. We are using color
images so we need matrices for each of the three color channels in the RGB color space (red, green, blue). To
describe how the data is processed, and a slight introduction into visualizing principal component analysis
(PCA), we can take a brief view into one of the images, Aaron Eckhart.
Example

2
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.6929 0.36422 0.03701
## Proportion of Variance 0.9553 0.04422 0.00046
## Cumulative Proportion 0.9553 0.99954 1.00000

0.8

0.4
v

0.0

0.4

0.8
2 0 2
u
Here we see our segmented image described graphically as a function of its first two principal components u
and v, respectively. So now that we have seen what PCA can do with one of our training images, lets take a
look at how PCA actually works.
What is PCA?
PCA is a technique that identifies patterns in data, and then expresses the data in such a way that highlights
these patterns. Some may recognize that identifying patterns in high-dimensionality is a difficult task. Why

3
is this difficult? We cannot display our data visually and the higher we go in dimensions, the more uniformly
distant each data point is from every other data point. PCA can be used to reduce this dimensionlity.
Math of PCA
So mathematically, what does PCA want? PCA wants to know if there is another basis that is a linear
combination of the original basis, which reexpresses our data. PCA does carry some assumptions with it. One
of which is linearity. Linearity makes the problem simpler by reducing the number of possible bases allowed.
Linearity also is a benefit that allows us to assume continuity in a data set. If we let X and Y be mn
matrices related by a linear transformation P , where X is the original data set and Y is a new representation
of the data, then,
PX = Y
, pi are the rows of P xi are the columns of X yi are the columns of Y represents a change of basis. The
rows of P, p1, ..., pm, are a set of new basis vectors for expressing the columns of X and each coefficient of yi
is a dot product of xi with the corresponding row in P . So the jth coefficient of yi is a projection on to the
jth row of P , or yi is a projection on to the basis of p1, ..., pm. So our conclusion here is that the rows of
P are a new set of basis vectors for the columns of X. So we must now ask ourselves, how do we find the
correct change of basis? The row vectors p1, ..., pm become the principal components.
The first step in applying PCA on a dataset is to subtract the mean from each of the data dimensions. Where
the mean is the average across each dimension. Essentially what this does is produces a data set whose mean
is zero.
Our next step is calculating the covariance matrix. PCA is all about variance. Lets take a look at the idea
of variance for a moment.
The variance of A and B are individually defined as follows.
2
A = hai ai ii , B
2
= hbi bi ii

where the expectation is the average over n variables.


Covariance is much like variance, except instead of seeing how much one variable varies it determines how
much two variables co-vary.
covariance of A and B
2
AB = hai bi ii

But we dont just want A and B. We want to be able to have a big matrix of m row vectors. So say we make
the sets of A and B into row vectors, a = [a1 , a2 , ...an ], b = [b1 , b2 , ...bn ]. Now,
1
2
ab abT
n1
with the first term for normalization.
Now we can accomplish the expansion of our method from just two vectors to m vectors. We can rename the
row vectors x1 a, x2 b and consider additional indexed row vectors x3, ..., xm. Recall we want a covariance
matrix, so we can define a new mn matrix X where each row of X corresponds to all measurements of a
particular type (xi ). Each column of X corresponds to a set of measurements from one particular trial. And
we can call this new matrix the covariance matrix SX ,
1
SX XX T
n1
the ijth element of the variance is the dot product between the vector of the ith measurement type with the
vector of the jth measurement type.
Now our new matrix SX gives us correlation information for all possible pairs of values where zero covariance
corresponds to entirely uncorrelated data. In a way, we can view our really high-dimensions data set as being

4
extremely redundant, as in, we really dont need all of those dimensions to describe most of the relationships
within the data. So to reduce this redundancy, we can imagine changing the covariance matrix around until
each variable co-varies as little as possible with the other variables. So with this new covariance matrix SY ,
what features would we want to optimize the most? If we are trying to make the covariances between different
values to be zero, this would make all off-diagonal terms in SY zero. Or in other words, this diagnolizes SY .
To do this, PCA assumes P is an orthonormal matrix, i.e. PCA assumes that all basis vectors p1, ..., pm are
orthonormal (pi pj = ij ). The other assumption that PCA makes is that the most important directions are
the ones with the largest variances.
So this choice of assuming the directions iwth the highest variances are the most important ones is another
way of saying they are the principal components. PCA finds the normalized direction that the variance
in X is maximized and this is p1 or the first principal component. Then when it goes to select its next
direction it must pick a direction that is perpendicular to all previous directions because of the orthonormality
condition. We then have to specify how many times it does this. It could select directions up to m times in
m-dimensions.
So we have calculated the covariance matrix, which will be pxp, and since this is a square matrix we can
calculate the eigenvectors and eigenvalues (note: these are unit eigenvectors). So this extraction of the
eigenvectors of the covariance matrix is what gives us the lines that characterize the data. Then we transform
the data to be expressed in terms of those directions. We can see that the eigenvector with the greatest
eigenvalue is the first principal component. So naturally we want to order the eigenvectors by eigenvalue.
Now we have all of the directions in order from most important to least important. So this is the important
aspect to dimensionality reduction. We can leave out as many of these components as we want, and that will
be the new dimensionality of our data set. Using matrices, this is equivalent to choosing the eigenvectors
that we want to keep as our principal components and placing them in a matrix of vecotrs, then take the
transpose of thise and multiply it by the original data set, transposed. This will give us the original data in
terms of the vectors we chose. Now our data is expressed in terms of the similarities and differences of the
data themselves.
Solving PCA. Lets assume some data set is an mn matrix X, where m is the number of measurement types
and n is the number of data trials. The goal is to find some orthonormal matrix P where Y = P X such that
SY n11
Y Y T is diagonalized. The rows of P are the principal components of X. We begin by rewriting SY
in terms of P.
1
SY = YYT
n1
1
= (P X)(P X)T
n1
1
= P (XX T )P T
n1
1
SY = P AP T
n1
A XX T where A is symmetric (A symmetric matrix is diagonalized by an orthogonal matrix of its
eigenvectors)
A = EDE T
where D is a diagonal matrix and E is a matrix of eigenvectors of A arranged as columns. We select the
matrix P to be a matrix where each row pi is an eigenvector of XX T . By this selection, P E T . Substituting,
A = P T DP and then (P 1 = P T ),
1
SY = P AP T
n1
1
= P (P T DP )P T
n1
1
= (P P T )D(P P T )
n1

5
1
= (P P 1 )D(P P 1 )
n1
1
SY = D
n1
The choice of P diagonalizes SY .
PCA for Image Compression
So when we use PCA for image compression, what we are doing is reducing the image to only the most
important directions of variance, however we sometimes need to be able to reconstruct the image afterwards.
For this we must take all m components when performing PCA, since if we didnt take all of the components
we would have lost some information about the image. To get the original data back we just need to multiply
the final data by the inverse of the matrix of vectors we created, then we have to add on the mean of the
original data that we subtracted in the beginning.
In our experiment we have varying numbers of images for each famous figure, where each image is 250 pixels
wide by 250 pixels high. For each image we create an image vector as described above, and then we put all of
the images together in one matrix with each row as an images vector. So now assume we have a matrix
for each famous person in our set, and then we have performed PCA so we have or original data in terms
of the eigenvectors of the covariance matrix. Now we want to use this new information to perform facial
recognition on an image or multiple images that the program has never seen before. The program checks
for the difference between the new image and the original images, along the new axes determined by PCA
eigenvectors directions. This allows the program to check for the most similarities and differences at once
since these are the directions of highest variance.
What are SVMs?
The next part of our methods looks at Support Vector Machines (SVMs). So what are SVMs and what
role does it play in facial recognition? SVMs are supervised learning models that use learning algorithms
to analyze data for classification and regression analysis. That is, given a set of labeled training data, the
algorithm builds a hyperplane or hyperplanes in which it categorizes new examples into one category or
another. When used for facial recognition software, SVMs classify the features of a presented facial image
into subgroups (eyes, nose, etc.) which it then can match with other facial images when searching through an
image data base.
Math of SVMs
So what is the math behind SVMs? In general, SVMs are used to classify data by finding the best hyperplane
that separates all data points in to one class or another. A hyperplane is seen as the best if it is one that has
the largest margin or maximal width of the slab parallel to the hyperplane that has no interior data points.
The data points that are closest to the hyperplane, which serve as the boundary of the slab are known as
support vectors.

6
This image shows what these concepts are stating. Here the + and serve as the data points that
are being separated by the hyperplane. Each data point for training is a set of vectors referred to by
xi RN, i = 1, 2, 3, . . . , N , which belong in categories yi 1, 1. Assuming that the data provided is linearly
separable, the goal to having a maximum margin is by finding the best possible hyperplane such that the
distance between the support vectors is maximized. This best possible hyperplane that gives us the maximum
margin is known as the optimal separating hyperplane (OSH). To calculate the hyperplane used to separate
the data points the following equation is used

f (x) = w x + b

where b is a real number and,

n
X
w= i yi xi
i=1

By replacing w, we get a more generalized form,

n
X
f (x) = i yi xi x + b
i=1

To find the OSH, we must find w and b that minimize kwk such that for all data points (xi , yi ), i = 1, . . . , N ,

yi f (xi ) 1

The support vectors xi that sit on the boundary of the margin are those data points that satisfy,

yi f (xi ) = 1

7
This case of separating data with an OSH is known as using a hard margin.
In the case where most, but not all data is separable by a hyperplane (that is, data is nonlinearly separable),
then a soft margin is used. In this case, we wish to minimize the expression

N
1 X
[ max(0, 1 yi f (xi )] + kwk2 )
N i=1
,
where max(0, 1 yi f (xi )) is the hinge loss function and is a parameter that determines the tradeoff between
increasing the size of the margin and ensuring that the xi s lie on the correct sides of the margin. Thus, when
is sufficiently small, the soft margin will act like a hard margin.
For solving the primal problem, we can minimize the above expression by rewriting it as a constrained
optimization problem. That is, for i = 1, . . . , N, we introduce the variable i , where i = max(0, 1yi f (xi ))
if and only if i is the smallest positive number that satisfies the following,

yi f (xi ) 1 i

Thus, we seek to minimize the expression

N
1 X
[ i + kwk2 )
N i=1
,
To solve the dual problem, we seek to minimize the expression presented above by rewriting it as a simpler
problem. For i = 1, . . . , N , we want to maximize the expression

N N N
X 1 XX
f (c1 , ..., cn ) = ci yi ci (xi xj )yj cj
i=1
2 i=1 j=1

where ci 0, 2n
1
, and,

N
X
ci y i = 0
i=1

The variable ci is defined by,

N
X
w= ci yi xi
i=1

We get that the xi s lie on the correct side of the margin when ci = 0, and the xi s lie on the boundary of the
margin (i.e., the xi is a support vector) when ci 0, 2n1
.
SVMs used to classify dimensionally-reduced images
All in all, SVMs are used to solve a two class pattern recognition problem. When presented with a high-
dimensional image, SVMs make use of a kernel function so that the dataset can be linearly separable. Thus,
when presented with a dimensionally-reduced image, SVMs can linearly separate a dataset. The SVMs
classify the features of the individuals into two classes, similarities and dissimilarities. When an image
presented contains similarities to an image in the database, a match is made and an identity of the individual
is presented.

8
Results
This example of PCA is exciting because we actually have a visual representation of what happened to our
photos. The eigenfaces figure below shows how PCA applied to all ~13,000 images has created a kind of
compilation photo for all of the famous people, where the most prevalent figures have the most impact on
the first eigenface. The first eigenface contains 21.7% of the variance explained and the second component
contains 13.6% of the variance explained.
Eigenfaces

Here we can see a list of 19 figures used in the prediction process and the models precision and recall, as well
as another metric f1-score, which can be interpreted as a weighted average of the precision and recall, where
an f1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall
to the f1 score are equal. The formula for the f1 score is: F 1 = 2 (precision recall)/(precision + recall).
Support is the number of images used for each figure during the classification. The greatest f1-score came
with Gloria Macapagal Arroyo and the most comical prediction was the model thought Former Governor of
Missouri John Ashcroft was Former Governor of California Arnold Schwarzenegger.

precision recall f1-score support

Ariel Sharon 0.73 0.76 0.74 21


Schwarzenegger 0.33 0.67 0.44 6
Colin Powell 0.76 0.75 0.75 67
Donald Rumsfeld 0.68 0.92 0.78 25
George W Bush 0.83 0.85 0.84 138
Gerhard Schroeder 0.86 0.74 0.79 34
GM Arroyo 0.91 1.00 0.95 10
Hugo Chavez 0.62 0.76 0.68 17
Jacques Chirac 0.50 0.50 0.50 10

9
Jean Chretien 0.89 0.57 0.70 14
Jennifer Capriati 0.80 0.57 0.67 14
John Ashcroft 0.70 0.64 0.67 11
Junichiro Koizumi 0.92 0.71 0.80 17
Laura Bush 1.00 0.71 0.83 14
Lleyton Hewitt 0.44 0.67 0.53 6
LIL da Silva 0.70 0.64 0.67 11
Serena Williams 1.00 0.38 0.56 13
Tony Blair 0.59 0.74 0.66 27
Vladimir Putin 0.55 0.50 0.52 12

150 comps:avg / total 0.77 0.75 0.75 467

50 comp's:avg / total 0.78 0.77 0.77 467

250 comps:avg / total 0.76 0.75 0.75 467


What we can see from the results above is that when the number of principal components was reduced from
150 to 50, there was actually an improved f1-score. A possible reason for this is overfitting. When you allow
too much training in the model we can see that the model will only be able to predict those images that it
trained on (which is not helpful at all). When we reduce the number of components to 50 we actually see an
improved prediction, precision, and recall. When 150 components were used, the model predicted 28 of the
36 images correctly and when 50 components were used the model predicted 29 of the 36 images correctly
(see below for 150 components results). With 250 components the model makes 10 errors, only predicting 26
of the 36 correctly.
Predictions

10
Conclusion
So why is dimensionality reduction important? Dimensionality reduction proves to be important when the
number of features turn out to be non-negligible as compared to the number of training samples. Thus, in
facial recognition we would like to determine the identity of a person depicted in an image, based on a training
dataset of labeled facial images. If we operate in a high-dimensional space, this can prove problematic if
presented with a small number of training samples. Hence, using PCA to reduce the dimensionality of the
images makes it easier for the SVMs to classify the images.
Although facial recognition software using PCA and SVMs has shown to be popular, there are other methods
that prove to be better. For instance, 3D recognition software has shown to be an emerging trend. This
method used 3D sensors to capture information on an individuals face. The information collected from the
3D sensors is then used to identify distinctive feature of the face, such as contour of eye sockets, nose, and
chin. This method is not affected by lighting changes, and can be used to identify a face from a variety of
different angles, including profile view.
All in all, facial recognition software has come a long way since it was first introduced. Differing methods
have been used to better the results of facial recognition software, with the use of PCA and SVMs being

11
among the most popular. With the use of PCA images can be dimensionally reduced, where then SVMs can
be used to classify these dimensionally-reduced images and identify the individual in the image.
Reference
http://setosa.io/ev/principal-component-analysis/
https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. http://scikit-learn.
org/stable/auto_examples/applications/face_recognition.html
http://cbcl.mit.edu/cbcl/publications/ps/iccv2001.pdf
http://infomesr.org/attachments/09-005.pdf
http://www.mathworks.com/help/stats/support-vector-machines-for-binary-classification.html

12

Das könnte Ihnen auch gefallen