0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

15 Ansichten12 SeitenPrincipal Component Analysis

May 12, 2018

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

Principal Component Analysis

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

15 Ansichten12 SeitenPrincipal Component Analysis

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 12

Seminar 107.424

Winter semester 2010/11

with FactoMineR

1 Introduction

Principal Component Analysis (PCA) is a procedure based on linear transformation of a

data set into a new coordinate system. The aim of PCA is to reduce the dimensionality

of this data set with the condition to loose as little information as possible. Information

is directly related to variability, so PCA pays most attention to the maximum variance

of all variables. This direction with maximum variance defines the direction of the first

principal component, i.e. the first dimension of the new coordinate system. The second

principal component has be orthogonal (i.e. uncorrelated) to the first and in this directi-

on most of the remaining variability has to be found. Following this approach we get an

orthogonal basis of the vector space of the data set, where the previous dimension holds

more information than the following.

The reason why it’s important to reduce the variables of a data set is that many multiva-

riate data analysis procedures (e.g. cluster analysis, multiple regression etc.) can’t handle

a too large amount of variables (too many explanatory variables in comparison to the

number of observations) or the variables in a given data set are highly correlated.

with the main issue to consider the structure of the data (if there is a partition on or

a hierarchy of the variables, or if there is a partition on the observations) and the type

of the variables (quantitative or categorical). A special focus also lies on supplementary

information, that is information which is added after the analysis. To make the use of

FactorMineR more comfortable a graphical user interface is implemented in the environ-

ment Rcmdr.

For quantitative variables PCA is used and here we will concentrate on this case. For

contingency tables Correspondence Analysis is implemented, for categorical variables

Multiple Correspondence Analysis and many more can be found in this package such

as Multiple Factor Analysis or Hierarchical Multiple Factor Analysis.

Although PCA is an analysis for continuous variables the functions of this package can

also handle data which includes categorical variables. To get a better idea of what this

means we will have a look at the data set decathlon, available in the package FactoMineR.

1

> library(FactoMineR)

> data(decathlon)

> str(decathlon)

$ 100m : num 11 10.8 11 11 11.3 ...

$ Long.jump : num 7.58 7.4 7.3 7.23 7.09 7.6 7.3 7.31 6.81 7.56 ...

$ Shot.put : num 14.8 14.3 14.8 14.2 15.2 ...

$ High.jump : num 2.07 1.86 2.04 1.92 2.1 1.98 2.01 2.13 1.95 1.86 ...

$ 400m : num 49.8 49.4 48.4 48.9 50.4 ...

$ 110m.hurdle: num 14.7 14.1 14.1 15 15.3 ...

$ Discus : num 43.8 50.7 49 40.9 46.3 ...

$ Pole.vault : num 5.02 4.92 4.92 5.32 4.72 4.92 4.42 4.42 4.92 4.82 ...

$ Javeline : num 63.2 60.1 50.3 62.8 63.4 ...

$ 1500m : num 292 302 300 280 276 ...

$ Rank : int 1 2 3 4 5 6 7 8 9 10 ...

$ Points : int 8217 8122 8099 8067 8036 8030 8004 7995 7802 7733 ...

$ Competition: Factor w/ 2 levels "Decastar","OlympicG": 1 1 1 1 1 1 1 1 1 1 ...

athletes during the Olympic Game 2004 and the Decastar 2004. The last variable is a

categorical variable and corresponds to the respective athletic event.

2 Mathematical Background

The setting will be the following:

µ = (µ1 , ..., µp )T the mean vector of x

Σ = E[(x − µ)(x − µ)T ] the covariance matrix,

Γ = (γ1 , ..., γp) an orthogonal (p × p)-matrix with γiT γi = 1.

z = ΓT (x − µ)

described above, which maximizes the variance of zi , can be traced back to an eigenvalue

problem in the following way.

The variance of zi for i = 1, ..., p is

2

With the method of Lagrange multipliers we can maximize this variance under the re-

striction γi T γi = 1.

φi = γi T Σγi − ai (γi T γi − 1)

dφi

dγi

= 2Σγi − 2ai γi = 0

⇔ Σγi = ai γi

⇔ ΣΓ = ΓA

with A = diag(a1 , ...ap ). For this eigenvalue problem, γi are the eigenvectors of Σ and

ai are the eigenvalues with a1 ≥ a2 ≥ ... ≥ ap . Because Γ is orthogonal, the covariance

matrix can be written as Σ = ΓAΓT .

E(z) = ΓT (E(x − µ)) = 0

So the expected value equals zero, the variance of zi is ai and for all i 6= j is zi uncorre-

lated to zj .

In the classical PCA, the mean vector and the covariance matrix of x are estimated

in the classical way, i.e. for a data matrix X ∈ Rn×p is

µ̂ = x̄ = n1 ni=1 xi.

P

1

Pn

Σ̂ = n−1 i=1 (xi. − x̄)(xi. − x̄)T .

The actual values of the principal components, the matrix Z (calculated with the matrix

X) are usually called scores. The orthogonal matrix Γ is referred to as loadings matrix

and describes the relation between x and z i.e. the influence xi has on zj is described by

γij .

• Scaling

PCA is not invariant to scaling. The matrix Σ is obviously influenced by the variance

of xi . Therefor it is reasonable to standardize the variables before applying PCA,

especially when the variables have obviously different ranges.

• Normal Distribution

There are no distribution requirements to use PCA in general, but the estimated

covariance matrix is sensitive to non normal distributed data. Depending on the

data set, log transformation of certain variables could be helpful to obtain normal

distribution. Notice that normal distribution is a requirement for some tests that

are performed in context with PCA.

3

• Outliers

One way to deal with outliers is to use robust PCA. In this case we only look

at the major part of the data and fitting the principal components only to this

majority. Especially when the extreme values are the data points of special interest

this approach would be misleading. Then one possible method could be to down

weight these values, so their influence on the analysis wouldn’t be too high.

The total variance of the data set ( pi=1 ai ) is only completely described by all p principal

P

components. Because the last principal component has the smallest eigenvalue we loose

least variance by leaven the last component away. Doing so we loose one dimension. The

question will now be how many components (and how many dimensions) are necessa-

ry to resemble the given data set X appropriately, i.e. how many of the last components

we can leave out without loosing too much information about the structure of our data set.

literature. A statistically well founded method is to apply a test if given a significance

level the last p − k eigenvalues are equal, i.e. ap = ap−1 = ... = ak+1 . Starting with k = 0

we can increase k as long as the null hypothesis is rejected. For this the test statistic

2p + 11 ma

n− (p − k) ln ∼ χ2(p−k+2)(p−k−1)/2

6 mg

is used with

âk+1 + ... + âp p

ma = , mg = âk+1 ...âp

p−k

Normal distribution is required of the data set for this test.

Not based on statistical tests are these informal, but still often used rules:

• We can choose the number of components by the percentage of total variance they

describe, i.e. q components explain this percentage:

Pq

a

Ppk=1 k

k=1 ak

The suggestions for an appropriated percentage lie between 70% and 90%. A rea-

sonable percentage can even decrease when a data set with a very high number of

observations or variables is given.

less than the average.

• In a scree plot the number of the principal components is plotted against the explai-

ned variance, i.e. the eigenvalue of this component. The last principal components

with points which lie approximately on a straight line are excluded.

4

Another group of methods to choose an appropriated number of components are the

resampling-methods, which include Bootstrap and Cross Validation. These non parame-

trical methods don’t need normal distribution as a requirement.

The package FactoMineR provides the function estim_ncp to calculate the best num-

ber of dimensions. In this function generalized cross validation approximation (GCV) and

the smoothing method (Smooth) are implemented.

> estim_ncp(decathlon[, 1:10], ncp.min = 0, ncp.max = NULL,

+ scale = TRUE, method = "Smooth")

$ncp

[1] 3

$criterion

[1] 16.92292 17.36487 18.27353 16.72241 17.65689 20.12934

[7] 20.39593 22.30299 27.24807 113.33135

+ scale = TRUE, method = "GCV")

$ncp

[1] 4

$criterion

[1] 16.922920 21.085579 21.690016 9.678537 8.506169 10.129689

[7] 9.081901 14.625748 31.431345 81.540410

In $criterion the mean error for each dimension is stored. For the data set decathlon

this function suggests to work with three dimensions ($ncp), when using the smoothing

method and with four dimensions, when using the generalized cross validation approxi-

mation. We can see that this methods don’t necessarily give unique answers and it is

important to be aware of this. Combining different methods can be helpful to gain a

better insight.

The data set decathlon consists of 13 variables. Only the quantitative variables can be

active variables, i.e. variables that are used to calculate the principal components. In this

example I will choose as active variables the first 10 columns which correspond to the

achievement of the athletes in the different disciplines and add the rank and the points

(column 11 and 12) as supplementary information afterwards. It’s not reasonable to use

them as active variables as well since they represent information already included in the

first 10 columns and because it’s our aim to reduce dimensions we won’t include dimen-

sions without any new information. As supplementary information they can still help in

the analysis. Column 13 (Competition) will also be added afterwards to help with the

interpretation.

5

We will focus on three main issues:

• Individuals’ study

We will observe the variability between the individuals (i.e. the athletes), look for

structure and clusters, and see if we can find different profiles of individuals.

• Variables’ study

Interesting would be to find linear relationships between variables (i.e. the per-

formance in different competitions). We’ll ask the question if we can give a good

picture of the performance of an athlete by less variables.

Is it possible to characterize groups of individuals by variables?

To perform PCA FactorMineR supplies the function PCA. This procedure can handle

missing values (they are replaced by the column mean) and it’s possible to add supple-

mentary information. First we will leave all supplementary variables out and look only

at the active variables:

+ graph = TRUE)

Individuals factor map (PCA)

Because the units of the variables are very

Casarsa different it is important to scale them to unit

4

to TRUE (which is default). The argument

YURKOV

Parkhomenko

Korkizoglou

2

Zsivoczky

Smith Macey

Sebrle

Dim 2 (17.37%)

SEBRLE

Pogorelov Clay

MARTINEAU CLAY

Turi Terek Barras

BOURGUIGNON Uldal the results.

HERNU

McMULLEN

Schoenbeck

KARPOV

Bernard

Karpov

0

Karlivans

BARRAS Qi

BERNARDOjaniemi

Smirnov

ZSIVOCZKY

Hernu

Gomez

Lorenzo

NOOL

The first plot is called Individuals’ graph and

Schwarzl

Nool

Averyanov

WARNERS Warners

-2

Drews

principal components. Together they explain

about 50% of the information contained in

-4

-4 -2 0 2 4 6

Dim 1 (32.72%)

areas. To understand. what it means for an athlete to have for example a positive value

for dimension one and a negative for dimension two we need to have a look at the second

plot, the Variables’ graph. It gives an idea of the conclusion we can draw from the first two

principal components. Often this interpretation is not easy to make (one of the biggest

disadvantages of PCA), but in this example it is possible to interpret them.

The correlation circle describes the correlation between the vector x.k and z.1 on the

first axis and the correlation between x.k and z.2 on the second axis. The angle between

two arrows represent the correlation of the respective variables. There is no linear depen-

dence if the angle is 90 degrees.

6

Variables factor map (PCA)

1.0

Projecting the arrows onto the first dimension we

can see that Long.jump, 100m and 110m.hurdle are

Discus most important for the first principal component.

400m Shot.put

1500m

The arrows of 100m and Long.jump are nearly on a

0.5

Javeline High.jump

110m.hurdle straight line, so they are negatively correlated, i.e.

Dim 2 (17.37%)

100m

0.0

Pole.vault

short time in 100m. Notice that short time implies

Long.jump

high scores. The same applies to 400m, 110m.hurdle

-0.5

shows the difference between athletes, who are good

-1.0

-1.0 -0.5 0.0 0.5 1.0

stance run) and athletes, who are bad all the ti-

Dim 1 (32.72%)

active variables

If we look now at the plot Individuals’ graph we can interpret the first dimension: Karpov

was the best, when taking all disciplines into consideration, BOURGUIGNON was the

worst.

For the second principal component the variables Discus and Shot.put are most import-

ant. This can be interpretet that for the second dimension it’s important who is strong

and who isn’t. It’s also remarkable that the variables Discus and Shot.put are hardly

correlated to the variables Long.jump and 100m. The performance in the first disciplines

has no implication on the performance in the later.

I will add now column 11 (Rank) and 12 (Points) as supplementary variables. They don’t

have an influence on the calculation of the principal components, but they will just be

added in the graph. The reason to add supplementary variables is that we hope that they

will help with the interpretation.

+ quanti.sup = c(11:12), graph = TRUE)

With the argument quanti.sup supplementary variables, which are quantitative can be

added. The Individuals graph is not affected by this change. When we include column 11

and 12 in the procedure in the Variables graph two arrows in blue are added for Rank

and Points. They are highly correlated to the first principal component. The interpre-

tation that the first dimension represents the overall performance is confirmed by this.

Interesting is also to see which variables are correlated most with Rank and Points. To

be good in short distance run is most important to gain many points. Though trying to

interpret this graph we should be careful not to over interpret it. This is a very simplified

illustration of the data set, containing only about 50% of the variability of the original

data.

7

Variables factor map (PCA)

1.0

Discus

400m Shot.put

1500m

0.5

Javeline High.jump

110m.hurdle

Dim 2 (17.37%)

100m

Rank

0.0

Points

Pole.vault

Long.jump

-0.5

-1.0

Dim 1 (32.72%)

The last column is the categorical variable Competition. We can add it as supplementary

variable with the argument quali.sup.

+ quali.sup = 13, graph = TRUE)

Individuals factor map (PCA)

This new information is added in the Indivi-

Casarsa duals graph as an average observation of each

4

YURKOV

Parkhomenko

Korkizoglou participating in Decastar has lower value on

2

Zsivoczky

Smith Macey

Sebrle

the first dimension than the average athle-

Dim 2 (17.37%)

SEBRLE

Pogorelov Clay

MARTINEAU CLAY

Turi Terek Barras

BOURGUIGNON Uldal

HERNU

McMULLEN

OlympicG

Schoenbeck

Decastar

KARPOV

Bernard

Karpov te participating in the Olympic Games. Al-

0

Karlivans

BARRAS Qi

BERNARDOjaniemi

Smirnov

ZSIVOCZKY

Hernu

so interesting is that there are athletes who

Gomez

Lorenzo

NOOL

Schwarzl

Nool

Averyanov

WARNERS Warners performed in both competitions and general-

-2

Drews

pic Games. Considering the second dimensi-

on the average athlete participating in the

-4

-4 -2 0 2 4 6

Dim 1 (32.72%)

Olympic Games is very close to the average

Individuals graph with categorical athlete in Decastar.

supplementary variables

> plot.PCA(res.pca3, axes = c(1, 2), choix = "ind", habillage = 13)

The argument choix refers to the graph which we want to plot again (ind for Individu-

als’ graph and var for Variables’ graph). To give colors to the individuals according to a

categorical variable the argument habillage is used.

8

To see if the difference between Decastar and Olympic Games is significant, the func-

tion coord.ellipse() can be used. This function constructs confidence ellipses.

> ell = coord.ellipse(concat, bary = TRUE, level.conf = 0.95)

> plot.PCA(res.pca3, habillage = 13, ellipse = ell, cex = 0.8)

Individuals factor map (PCA) Individuals factor map (PCA)

Decastar Decastar

OlympicG OlympicG

Casarsa Casarsa

4

4

YURKOV

Parkhomenko

Korkizoglou YURKOV

Parkhomenko

Korkizoglou

2

2

Sebrle Sebrle

Zsivoczky

Smith Zsivoczky

Macey Smith Macey

Dim 2 (17.37%)

Dim 2 (17.37%)

SEBRLE

Pogorelov Clay SEBRLE

Pogorelov Clay

MARTINEAU CLAY CLAY

Turi Terek Barras

HERNU KARPOV MARTINEAU

HERNU Terek KARPOV

Turi Barras

BOURGUIGNON Uldal McMULLEN BOURGUIGNON Uldal McMULLEN

OlympicG

Schoenbeck

Decastar Karpov OlympicG

Schoenbeck

Decastar Karpov

Bernard Bernard

0

0

Karlivans

BARRAS Qi Karlivans

BARRAS Qi

Hernu Hernu

BERNARDOjaniemi BERNARD

Ojaniemi

Smirnov

ZSIVOCZKY Smirnov

ZSIVOCZKY

Gomez Gomez

Schwarzl Schwarzl

Lorenzo Nool

Averyanov Lorenzo Nool

Averyanov

WARNERS Warners WARNERS

Warners

NOOL NOOL

-2

-2

Drews Drews

-4

-4

-4 -2 0 2 4 6 -4 -2 0 2 4 6

Sometimes it is interesting to leave certain observations out in the analysis and add them

as supplementary information. The argument ind.sup provides this possibility.

+ ind.sup = c(5, 8, 9, 13, 17, 23, 24, 25, 27:41), quali.sup = 13,

+ graph = TRUE)

Here the analysis is done only for those athletes which participated in both competiti-

ons, as supplementary observations we can add those athletes who only participate in

one competition. The interpretation for the principal components has changed. The first

dimension is again highly correlated with 100m and with the supplementary variable

Points, but here also Discus and Shot.put are highly important. Overall the first prin-

cipal component shows again who is good overall, except for 1500m. This discipline is

correlated positively, i.e. long time has a positive influence on the overall performance.

The second principal component is highly influenced by Pole.vault, long distance run and

High.jump, which is negatively correlated to these.

We can observe the different performance of an athlete during both competitions. Athle-

tes are possibly better trained and have more motivation for the Olympic Games. Their

overall performance is generally much better there. Notice that the number of observati-

ons used as active variables is much lower in this example, than it has been before.

9

Variables factor map (PCA)

1.0

Pole.vault

Individuals factor map (PCA)

1500m

Javeline

0.5

Nool

Korkizoglou

2

BOURGUIGNON CLAY

HERNU BERNARD 400m

Terek SEBRLE

Schoenbeck Clay Long.jump

Casarsa 110m.hurdle

Dim 2 (18.36%)

Sebrle

Shot.put

1

Points

Parkhomenko Schwarzl Pogorelov Discus

0.0

Turi

Uldal

Decastar

YURKOV 100m

Dim 2 (18.36%)

Ojaniemi KARPOV

0

OlympicG

Averyanov Warners

BARRAS Drews WARNERS

Qi Barras Zsivoczky

Karlivans

MARTINEAU Rank

Smirnov

Lorenzo High.jump

-0.5

Gomez

-1

NOOL

Hernu

McMULLEN

Karpov

Smith

Bernard

Macey

-2

ZSIVOCZKY

-1.0

-1.0 -0.5 0.0 0.5 1.0

-6 -4 -2 0 2 4

This function is very useful to observe which variables are highly correlated to a certain

principal component. It returns not only the correlation coefficient, but also performs a

test if these variables are significant. For each dimension for all variables with a p-value

smaller than 0.05 the results are returned (this default value can be changed by setting

the argument proba). These tables confirm our interpretation: most important for the

first principal component are the active variables Long.jump and 100m. The first dimen-

sion is highly correlated to the Points gained in the competition. Most important for the

second principal component are the variables Discus and Shot.put.

$Dim.1

$Dim.1$quanti

correlation p.value

Points 0.9561543 2.099191e-22

Long.jump 0.7418997 2.849886e-08

Shot.put 0.6225026 1.388321e-05

High.jump 0.5719453 9.362285e-05

Discus 0.5524665 1.802220e-04

Rank -0.6705104 1.616348e-06

400m -0.6796099 1.028175e-06

110m.hurdle -0.7462453 2.136962e-08

100m -0.7747198 2.778467e-09

$Dim.2

$Dim.2$quanti

correlation p.value

Discus 0.6063134 2.650745e-05

Shot.put 0.5983033 3.603567e-05

10

400m 0.5694378 1.020941e-04

1500m 0.4742238 1.734405e-03

High.jump 0.3502936 2.475025e-02

Javeline 0.3169891 4.344974e-02

Long.jump -0.3454213 2.696969e-02

We have already seen this correlations in the graphs, but especially for more complex data

sets with a high amount of variables it is harder to get an overview only by two dimensional

plots. Then this function helps with the interpretation of the principal components.

As mentioned before FactoMineR has a GUI implemented to make its use more comfor-

table. It can be installed easily in R with

> source("http://factominer.free.fr/install-facto.r")

It is implemented in the environment of Rcmdr, which can be opened with the command

> library(Rcmdr)

Most of what was described before can also be performed within this GUI. For a good

description on how to use it, please have a look at

http://factominer.free.fr/docs/article FactoMineR.pdf

In chapter four you will find a Rcmdr support for the FactoMineR package.

7 Conclusion

This was just a small part of the procedures FactoMineR provides. Much more complex

data sets, with different structures and hierarchy can be analyzed with this package. A

good introduction into many applications of FactoMineR can be found here:

http://factominer.free.fr/

8 References

Everitt, Brian; Dunn, Graham: Applied Multivariate Data Analysis. Second Edition. -

London: Arnold, 2001.

2. Auflage. - Berlin: Walter de Gryter & Co, 1996.

11

17.12.2010.

Husson, François; Josse, Julie; Lê, Sébastien: FactoMineR: An R Package for Multiva-

riate Analysis. In: Journal of Statistical Software, Volume 25, 2008, Page 1-18. American

Statistical Asociation.

12

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.