Sie sind auf Seite 1von 12

Irene Hoffmann

Seminar 107.424
Winter semester 2010/11

Principal Component Analysis


with FactoMineR
1 Introduction
Principal Component Analysis (PCA) is a procedure based on linear transformation of a
data set into a new coordinate system. The aim of PCA is to reduce the dimensionality
of this data set with the condition to loose as little information as possible. Information
is directly related to variability, so PCA pays most attention to the maximum variance
of all variables. This direction with maximum variance defines the direction of the first
principal component, i.e. the first dimension of the new coordinate system. The second
principal component has be orthogonal (i.e. uncorrelated) to the first and in this directi-
on most of the remaining variability has to be found. Following this approach we get an
orthogonal basis of the vector space of the data set, where the previous dimension holds
more information than the following.

The reason why it’s important to reduce the variables of a data set is that many multiva-
riate data analysis procedures (e.g. cluster analysis, multiple regression etc.) can’t handle
a too large amount of variables (too many explanatory variables in comparison to the
number of observations) or the variables in a given data set are highly correlated.

FactoMineR is a package in R which provides functions for multivariate data analysis


with the main issue to consider the structure of the data (if there is a partition on or
a hierarchy of the variables, or if there is a partition on the observations) and the type
of the variables (quantitative or categorical). A special focus also lies on supplementary
information, that is information which is added after the analysis. To make the use of
FactorMineR more comfortable a graphical user interface is implemented in the environ-
ment Rcmdr.

For quantitative variables PCA is used and here we will concentrate on this case. For
contingency tables Correspondence Analysis is implemented, for categorical variables
Multiple Correspondence Analysis and many more can be found in this package such
as Multiple Factor Analysis or Hierarchical Multiple Factor Analysis.

Although PCA is an analysis for continuous variables the functions of this package can
also handle data which includes categorical variables. To get a better idea of what this
means we will have a look at the data set decathlon, available in the package FactoMineR.

1
> library(FactoMineR)
> data(decathlon)
> str(decathlon)

'data.frame': 41 obs. of 13 variables:


$ 100m : num 11 10.8 11 11 11.3 ...
$ Long.jump : num 7.58 7.4 7.3 7.23 7.09 7.6 7.3 7.31 6.81 7.56 ...
$ Shot.put : num 14.8 14.3 14.8 14.2 15.2 ...
$ High.jump : num 2.07 1.86 2.04 1.92 2.1 1.98 2.01 2.13 1.95 1.86 ...
$ 400m : num 49.8 49.4 48.4 48.9 50.4 ...
$ 110m.hurdle: num 14.7 14.1 14.1 15 15.3 ...
$ Discus : num 43.8 50.7 49 40.9 46.3 ...
$ Pole.vault : num 5.02 4.92 4.92 5.32 4.72 4.92 4.42 4.42 4.92 4.82 ...
$ Javeline : num 63.2 60.1 50.3 62.8 63.4 ...
$ 1500m : num 292 302 300 280 276 ...
$ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Points : int 8217 8122 8099 8067 8036 8030 8004 7995 7802 7733 ...
$ Competition: Factor w/ 2 levels "Decastar","OlympicG": 1 1 1 1 1 1 1 1 1 1 ...

This data set contains 13 variables, column 1 to 12 correspond to the performance of


athletes during the Olympic Game 2004 and the Decastar 2004. The last variable is a
categorical variable and corresponds to the respective athletic event.

2 Mathematical Background
The setting will be the following:

x = (x1 , ..., xp )T a p-dimensional random vector,


µ = (µ1 , ..., µp )T the mean vector of x
Σ = E[(x − µ)(x − µ)T ] the covariance matrix,
Γ = (γ1 , ..., γp) an orthogonal (p × p)-matrix with γiT γi = 1.

The linear transformatin of x

z = ΓT (x − µ)

results in a new random vector z. Trying to find such a transformation matrix Γ as


described above, which maximizes the variance of zi , can be traced back to an eigenvalue
problem in the following way.
The variance of zi for i = 1, ..., p is

V ar(zi ) = γi T V ar(x − µ)γi = γi T Σγi .

2
With the method of Lagrange multipliers we can maximize this variance under the re-
striction γi T γi = 1.
φi = γi T Σγi − ai (γi T γi − 1)

dφi
dγi
= 2Σγi − 2ai γi = 0

⇔ Σγi = ai γi

⇔ ΣΓ = ΓA
with A = diag(a1 , ...ap ). For this eigenvalue problem, γi are the eigenvectors of Σ and
ai are the eigenvalues with a1 ≥ a2 ≥ ... ≥ ap . Because Γ is orthogonal, the covariance
matrix can be written as Σ = ΓAΓT .

Now we have the linear transformation z = ΓT (x − µ) with the following attributes


E(z) = ΓT (E(x − µ)) = 0

Cov(z) = ΓT Cov(x − µ)Γ = ΓT ΣΓ = A = diag(a1 , ..., ap )


So the expected value equals zero, the variance of zi is ai and for all i 6= j is zi uncorre-
lated to zj .

In the classical PCA, the mean vector and the covariance matrix of x are estimated
in the classical way, i.e. for a data matrix X ∈ Rn×p is
µ̂ = x̄ = n1 ni=1 xi.
P

1
Pn
Σ̂ = n−1 i=1 (xi. − x̄)(xi. − x̄)T .
The actual values of the principal components, the matrix Z (calculated with the matrix
X) are usually called scores. The orthogonal matrix Γ is referred to as loadings matrix
and describes the relation between x and z i.e. the influence xi has on zj is described by
γij .

3 Before Applying PCA


• Scaling
PCA is not invariant to scaling. The matrix Σ is obviously influenced by the variance
of xi . Therefor it is reasonable to standardize the variables before applying PCA,
especially when the variables have obviously different ranges.
• Normal Distribution
There are no distribution requirements to use PCA in general, but the estimated
covariance matrix is sensitive to non normal distributed data. Depending on the
data set, log transformation of certain variables could be helpful to obtain normal
distribution. Notice that normal distribution is a requirement for some tests that
are performed in context with PCA.

3
• Outliers
One way to deal with outliers is to use robust PCA. In this case we only look
at the major part of the data and fitting the principal components only to this
majority. Especially when the extreme values are the data points of special interest
this approach would be misleading. Then one possible method could be to down
weight these values, so their influence on the analysis wouldn’t be too high.

4 Choosing the Number of Components


The total variance of the data set ( pi=1 ai ) is only completely described by all p principal
P
components. Because the last principal component has the smallest eigenvalue we loose
least variance by leaven the last component away. Doing so we loose one dimension. The
question will now be how many components (and how many dimensions) are necessa-
ry to resemble the given data set X appropriately, i.e. how many of the last components
we can leave out without loosing too much information about the structure of our data set.

Many different approaches to choose a reasonable number of components are given in


literature. A statistically well founded method is to apply a test if given a significance
level the last p − k eigenvalues are equal, i.e. ap = ap−1 = ... = ak+1 . Starting with k = 0
we can increase k as long as the null hypothesis is rejected. For this the test statistic
   
2p + 11 ma
n− (p − k) ln ∼ χ2(p−k+2)(p−k−1)/2
6 mg
is used with
âk+1 + ... + âp p
ma = , mg = âk+1 ...âp
p−k
Normal distribution is required of the data set for this test.

Not based on statistical tests are these informal, but still often used rules:
• We can choose the number of components by the percentage of total variance they
describe, i.e. q components explain this percentage:
Pq
a
Ppk=1 k
k=1 ak

The suggestions for an appropriated percentage lie between 70% and 90%. A rea-
sonable percentage can even decrease when a data set with a very high number of
observations or variables is given.

• Concentrating on the eigenvalues we can exclude all components with eigenvalues


less than the average.

• In a scree plot the number of the principal components is plotted against the explai-
ned variance, i.e. the eigenvalue of this component. The last principal components
with points which lie approximately on a straight line are excluded.

4
Another group of methods to choose an appropriated number of components are the
resampling-methods, which include Bootstrap and Cross Validation. These non parame-
trical methods don’t need normal distribution as a requirement.

The package FactoMineR provides the function estim_ncp to calculate the best num-
ber of dimensions. In this function generalized cross validation approximation (GCV) and
the smoothing method (Smooth) are implemented.
> estim_ncp(decathlon[, 1:10], ncp.min = 0, ncp.max = NULL,
+ scale = TRUE, method = "Smooth")

$ncp
[1] 3

$criterion
[1] 16.92292 17.36487 18.27353 16.72241 17.65689 20.12934
[7] 20.39593 22.30299 27.24807 113.33135

> estim_ncp(decathlon[, 1:10], ncp.min = 0, ncp.max = NULL,


+ scale = TRUE, method = "GCV")

$ncp
[1] 4

$criterion
[1] 16.922920 21.085579 21.690016 9.678537 8.506169 10.129689
[7] 9.081901 14.625748 31.431345 81.540410

In $criterion the mean error for each dimension is stored. For the data set decathlon
this function suggests to work with three dimensions ($ncp), when using the smoothing
method and with four dimensions, when using the generalized cross validation approxi-
mation. We can see that this methods don’t necessarily give unique answers and it is
important to be aware of this. Combining different methods can be helpful to gain a
better insight.

5 Working with FactoMineR


The data set decathlon consists of 13 variables. Only the quantitative variables can be
active variables, i.e. variables that are used to calculate the principal components. In this
example I will choose as active variables the first 10 columns which correspond to the
achievement of the athletes in the different disciplines and add the rank and the points
(column 11 and 12) as supplementary information afterwards. It’s not reasonable to use
them as active variables as well since they represent information already included in the
first 10 columns and because it’s our aim to reduce dimensions we won’t include dimen-
sions without any new information. As supplementary information they can still help in
the analysis. Column 13 (Competition) will also be added afterwards to help with the
interpretation.

5
We will focus on three main issues:

• Individuals’ study
We will observe the variability between the individuals (i.e. the athletes), look for
structure and clusters, and see if we can find different profiles of individuals.

• Variables’ study
Interesting would be to find linear relationships between variables (i.e. the per-
formance in different competitions). We’ll ask the question if we can give a good
picture of the performance of an athlete by less variables.

• Link between these two studies


Is it possible to characterize groups of individuals by variables?

5.1 The Function PCA


To perform PCA FactorMineR supplies the function PCA. This procedure can handle
missing values (they are replaced by the column mean) and it’s possible to add supple-
mentary information. First we will leave all supplementary variables out and look only
at the active variables:

> res.pca1 = PCA(decathlon[, 1:10], scale.unit = TRUE, ncp = 4,


+ graph = TRUE)
Individuals factor map (PCA)
Because the units of the variables are very
Casarsa different it is important to scale them to unit
4

variance. This is done by setting scale.unit


to TRUE (which is default). The argument
YURKOV
Parkhomenko
Korkizoglou
2

ncp equals the number of dimensions kept in


Zsivoczky
Smith Macey
Sebrle
Dim 2 (17.37%)

SEBRLE
Pogorelov Clay
MARTINEAU CLAY
Turi Terek Barras
BOURGUIGNON Uldal the results.
HERNU
McMULLEN
Schoenbeck
KARPOV

Bernard
Karpov
0

Karlivans
BARRAS Qi

By this procedure two plots are produced.


BERNARDOjaniemi
Smirnov
ZSIVOCZKY
Hernu

Gomez
Lorenzo
NOOL
The first plot is called Individuals’ graph and
Schwarzl
Nool
Averyanov
WARNERS Warners
-2

it shows the scores according to the first two


Drews
principal components. Together they explain
about 50% of the information contained in
-4

-4 -2 0 2 4 6

the data set. The graph is divided into four


Dim 1 (32.72%)

areas. To understand. what it means for an athlete to have for example a positive value
for dimension one and a negative for dimension two we need to have a look at the second
plot, the Variables’ graph. It gives an idea of the conclusion we can draw from the first two
principal components. Often this interpretation is not easy to make (one of the biggest
disadvantages of PCA), but in this example it is possible to interpret them.
The correlation circle describes the correlation between the vector x.k and z.1 on the
first axis and the correlation between x.k and z.2 on the second axis. The angle between
two arrows represent the correlation of the respective variables. There is no linear depen-
dence if the angle is 90 degrees.

6
Variables factor map (PCA)
1.0
Projecting the arrows onto the first dimension we
can see that Long.jump, 100m and 110m.hurdle are
Discus most important for the first principal component.
400m Shot.put
1500m
The arrows of 100m and Long.jump are nearly on a
0.5

Javeline High.jump
110m.hurdle straight line, so they are negatively correlated, i.e.
Dim 2 (17.37%)

100m

high scores in Long.jump are strongly correlate to


0.0

Pole.vault
short time in 100m. Notice that short time implies
Long.jump
high scores. The same applies to 400m, 110m.hurdle
-0.5

and 1500m. Hence the first principal component


shows the difference between athletes, who are good
-1.0

overall (and especially in Long.jump and short di-


-1.0 -0.5 0.0 0.5 1.0
stance run) and athletes, who are bad all the ti-
Dim 1 (32.72%)

Variables’ graph considering only me.


active variables
If we look now at the plot Individuals’ graph we can interpret the first dimension: Karpov
was the best, when taking all disciplines into consideration, BOURGUIGNON was the
worst.

For the second principal component the variables Discus and Shot.put are most import-
ant. This can be interpretet that for the second dimension it’s important who is strong
and who isn’t. It’s also remarkable that the variables Discus and Shot.put are hardly
correlated to the variables Long.jump and 100m. The performance in the first disciplines
has no implication on the performance in the later.

5.2 Quantitative Supplementary Variables


I will add now column 11 (Rank) and 12 (Points) as supplementary variables. They don’t
have an influence on the calculation of the principal components, but they will just be
added in the graph. The reason to add supplementary variables is that we hope that they
will help with the interpretation.

> res.pca2 = PCA(decathlon[, 1:12], scale.unit = TRUE, ncp = 4,


+ quanti.sup = c(11:12), graph = TRUE)

With the argument quanti.sup supplementary variables, which are quantitative can be
added. The Individuals graph is not affected by this change. When we include column 11
and 12 in the procedure in the Variables graph two arrows in blue are added for Rank
and Points. They are highly correlated to the first principal component. The interpre-
tation that the first dimension represents the overall performance is confirmed by this.
Interesting is also to see which variables are correlated most with Rank and Points. To
be good in short distance run is most important to gain many points. Though trying to
interpret this graph we should be careful not to over interpret it. This is a very simplified
illustration of the data set, containing only about 50% of the variability of the original
data.

7
Variables factor map (PCA)
1.0

Discus
400m Shot.put
1500m
0.5

Javeline High.jump
110m.hurdle
Dim 2 (17.37%)

100m
Rank
0.0

Points

Pole.vault
Long.jump
-0.5
-1.0

-1.0 -0.5 0.0 0.5 1.0

Dim 1 (32.72%)

5.3 Qualitative Supplementary Variables


The last column is the categorical variable Competition. We can add it as supplementary
variable with the argument quali.sup.

> res.pca3 = PCA(decathlon, scale.unit = TRUE, ncp = 4, quanti.sup = c(11:12),


+ quali.sup = 13, graph = TRUE)
Individuals factor map (PCA)
This new information is added in the Indivi-
Casarsa duals graph as an average observation of each
4

group. We can see that the average athlete


YURKOV
Parkhomenko
Korkizoglou participating in Decastar has lower value on
2

Zsivoczky
Smith Macey
Sebrle
the first dimension than the average athle-
Dim 2 (17.37%)

SEBRLE
Pogorelov Clay
MARTINEAU CLAY
Turi Terek Barras
BOURGUIGNON Uldal
HERNU
McMULLEN
OlympicG
Schoenbeck
Decastar
KARPOV

Bernard
Karpov te participating in the Olympic Games. Al-
0

Karlivans
BARRAS Qi

BERNARDOjaniemi
Smirnov
ZSIVOCZKY
Hernu
so interesting is that there are athletes who
Gomez
Lorenzo
NOOL
Schwarzl
Nool
Averyanov
WARNERS Warners performed in both competitions and general-
-2

ly their performance is better for the Olym-


Drews
pic Games. Considering the second dimensi-
on the average athlete participating in the
-4

-4 -2 0 2 4 6

Dim 1 (32.72%)
Olympic Games is very close to the average
Individuals graph with categorical athlete in Decastar.
supplementary variables

A better overview on this coherence gives the following plot.


> plot.PCA(res.pca3, axes = c(1, 2), choix = "ind", habillage = 13)
The argument choix refers to the graph which we want to plot again (ind for Individu-
als’ graph and var for Variables’ graph). To give colors to the individuals according to a
categorical variable the argument habillage is used.

8
To see if the difference between Decastar and Olympic Games is significant, the func-
tion coord.ellipse() can be used. This function constructs confidence ellipses.

> concat = cbind.data.frame(decathlon[, 13], res.pca3$ind$coord)


> ell = coord.ellipse(concat, bary = TRUE, level.conf = 0.95)
> plot.PCA(res.pca3, habillage = 13, ellipse = ell, cex = 0.8)
Individuals factor map (PCA) Individuals factor map (PCA)

Decastar Decastar
OlympicG OlympicG

Casarsa Casarsa
4

4
YURKOV
Parkhomenko
Korkizoglou YURKOV
Parkhomenko
Korkizoglou
2

2
Sebrle Sebrle
Zsivoczky
Smith Zsivoczky
Macey Smith Macey
Dim 2 (17.37%)

Dim 2 (17.37%)
SEBRLE
Pogorelov Clay SEBRLE
Pogorelov Clay
MARTINEAU CLAY CLAY
Turi Terek Barras
HERNU KARPOV MARTINEAU
HERNU Terek KARPOV
Turi Barras
BOURGUIGNON Uldal McMULLEN BOURGUIGNON Uldal McMULLEN
OlympicG
Schoenbeck
Decastar Karpov OlympicG
Schoenbeck
Decastar Karpov
Bernard Bernard
0

0
Karlivans
BARRAS Qi Karlivans
BARRAS Qi
Hernu Hernu
BERNARDOjaniemi BERNARD
Ojaniemi
Smirnov
ZSIVOCZKY Smirnov
ZSIVOCZKY
Gomez Gomez
Schwarzl Schwarzl
Lorenzo Nool
Averyanov Lorenzo Nool
Averyanov
WARNERS Warners WARNERS
Warners
NOOL NOOL
-2

-2
Drews Drews
-4

-4

-4 -2 0 2 4 6 -4 -2 0 2 4 6

Dim 1 (32.72%) Dim 1 (32.72%)

5.4 Supplementary Observations


Sometimes it is interesting to leave certain observations out in the analysis and add them
as supplementary information. The argument ind.sup provides this possibility.

> res.pca4 = PCA(decathlon, scale.unit = TRUE, ncp = 4, quanti.sup = c(11:12),


+ ind.sup = c(5, 8, 9, 13, 17, 23, 24, 25, 27:41), quali.sup = 13,
+ graph = TRUE)

Here the analysis is done only for those athletes which participated in both competiti-
ons, as supplementary observations we can add those athletes who only participate in
one competition. The interpretation for the principal components has changed. The first
dimension is again highly correlated with 100m and with the supplementary variable
Points, but here also Discus and Shot.put are highly important. Overall the first prin-
cipal component shows again who is good overall, except for 1500m. This discipline is
correlated positively, i.e. long time has a positive influence on the overall performance.
The second principal component is highly influenced by Pole.vault, long distance run and
High.jump, which is negatively correlated to these.
We can observe the different performance of an athlete during both competitions. Athle-
tes are possibly better trained and have more motivation for the Olympic Games. Their
overall performance is generally much better there. Notice that the number of observati-
ons used as active variables is much lower in this example, than it has been before.

9
Variables factor map (PCA)

1.0
Pole.vault
Individuals factor map (PCA)

1500m
Javeline

0.5
Nool
Korkizoglou
2

BOURGUIGNON CLAY
HERNU BERNARD 400m
Terek SEBRLE
Schoenbeck Clay Long.jump
Casarsa 110m.hurdle

Dim 2 (18.36%)
Sebrle
Shot.put
1

Points
Parkhomenko Schwarzl Pogorelov Discus

0.0
Turi
Uldal
Decastar
YURKOV 100m
Dim 2 (18.36%)

Ojaniemi KARPOV
0

OlympicG
Averyanov Warners
BARRAS Drews WARNERS
Qi Barras Zsivoczky
Karlivans
MARTINEAU Rank
Smirnov
Lorenzo High.jump

-0.5
Gomez
-1

NOOL
Hernu
McMULLEN
Karpov
Smith
Bernard
Macey
-2

ZSIVOCZKY

-1.0
-1.0 -0.5 0.0 0.5 1.0
-6 -4 -2 0 2 4

Dim 1 (39.44%) Dim 1 (39.44%)

5.5 The Function dimdesc


This function is very useful to observe which variables are highly correlated to a certain
principal component. It returns not only the correlation coefficient, but also performs a
test if these variables are significant. For each dimension for all variables with a p-value
smaller than 0.05 the results are returned (this default value can be changed by setting
the argument proba). These tables confirm our interpretation: most important for the
first principal component are the active variables Long.jump and 100m. The first dimen-
sion is highly correlated to the Points gained in the competition. Most important for the
second principal component are the variables Discus and Shot.put.

> dimdesc(res.pca3, axes = c(1, 2))

$Dim.1
$Dim.1$quanti
correlation p.value
Points 0.9561543 2.099191e-22
Long.jump 0.7418997 2.849886e-08
Shot.put 0.6225026 1.388321e-05
High.jump 0.5719453 9.362285e-05
Discus 0.5524665 1.802220e-04
Rank -0.6705104 1.616348e-06
400m -0.6796099 1.028175e-06
110m.hurdle -0.7462453 2.136962e-08
100m -0.7747198 2.778467e-09

$Dim.2
$Dim.2$quanti
correlation p.value
Discus 0.6063134 2.650745e-05
Shot.put 0.5983033 3.603567e-05

10
400m 0.5694378 1.020941e-04
1500m 0.4742238 1.734405e-03
High.jump 0.3502936 2.475025e-02
Javeline 0.3169891 4.344974e-02
Long.jump -0.3454213 2.696969e-02

We have already seen this correlations in the graphs, but especially for more complex data
sets with a high amount of variables it is harder to get an overview only by two dimensional
plots. Then this function helps with the interpretation of the principal components.

6 Graphical User Interface


As mentioned before FactoMineR has a GUI implemented to make its use more comfor-
table. It can be installed easily in R with

> source("http://factominer.free.fr/install-facto.r")

It is implemented in the environment of Rcmdr, which can be opened with the command

> library(Rcmdr)

Most of what was described before can also be performed within this GUI. For a good
description on how to use it, please have a look at

http://factominer.free.fr/docs/article FactoMineR.pdf

In chapter four you will find a Rcmdr support for the FactoMineR package.

7 Conclusion
This was just a small part of the procedures FactoMineR provides. Much more complex
data sets, with different structures and hierarchy can be analyzed with this package. A
good introduction into many applications of FactoMineR can be found here:

http://factominer.free.fr/

8 References
Everitt, Brian; Dunn, Graham: Applied Multivariate Data Analysis. Second Edition. -
London: Arnold, 2001.

Fahrmeir, Ludwig; Hamerle, Alfred; Tutz, Gerhard: Multivariate statistische Verfahren.


2. Auflage. - Berlin: Walter de Gryter & Co, 1996.

Filzmoser, Peter: Multivariate Statistik. Vorlesungsskriptum. - Wien, 2010.

Husson, François; Josse, Julie; Lê, Sébastien: FACTOMINER. http://factominer.free.fr/,

11
17.12.2010.

Husson, François; Josse, Julie; Lê, Sébastien: FactoMineR: An R Package for Multiva-
riate Analysis. In: Journal of Statistical Software, Volume 25, 2008, Page 1-18. American
Statistical Asociation.

Comprehensive R Archive Network. http://CRAN.R-project.org/, January 2011.

12