Beruflich Dokumente
Kultur Dokumente
Abstract: This paper is a hands-on introduction and shows how to perform basic tasks in the
analysis of compositional data following Aitchison's philosophy, within the statistical package
'R' and using a contributed package (called 'compositions'), which is devoted specially to com-
positional data analysis. The studied tasks are: descriptive statistics and plots (ternary diagrams,
boxplots), principal component analysis (using biplots), cluster analysis with Aitchison distance,
analysis of variance (ANOVA) of a dependent composition, some transformations and operations
between compositions in the simplex.
This paper will show how the basic tasks of compo- manuals or of typing to a command line any
sitional data analysis (Aitchison et al. 2002) can be command found out there. However, it should be
performed with the package 'compositions' in the remembered that 'R' and its packages are a living
free statistical environment 'R' (R Development Core project permanently adapted to the development
Team 2003). The paper aims to be useful for a wide of the field. More intstructions can be found at
spectrum of 'R' users: for this reason, it is suggested 'http://www.stat.boogaart.de/compositions/'.
that the experienced skip these first steps, whereas After starting 'R' (either by clicking on the
those who never heard about 'R' should begin with appropriate icon, selecting the entry 'R' in the
Appendix A before continuing with the text. It is start menu or by typing the command 'R' to a
strongly recommended that the reader be in front of console or command window, after installing the
the computer, typing the examples outlined here: software) a command window appears where com-
thus, text output of these instructions is kept to a mini- mands can be given to 'R'. The following appears:
mum, and almost all figures are not included, although
they are described briefly (with a few exceptions).
R: C o p y r i g h t 2004, T h e R F o u n d a t i o n for
Statistical Computing Version 2.0.1
First steps (2004-11-15), ISBN3-900051-07-0
The version number should be checked, since at When working in a terminal, the help can be
least version 2.0.0 is required for running compo- closed by typing 'q' for Quit. In a windows-based
sitions. The ' > ' mark shows that 'R' is willing to environment the help window can simply be closed.
accept commands. This character should not be
typed with the commands. To see how 'R' works,
type '3 "7', and hit the ENTER-Key to make 'R' > i s ( ) # S h o w n a m e s of all v a r i a b l e s /
execute this command: datasets
[i] " s a . d i r i c h l e t .... s a . d i r i c h l e t .
dil .... s a . d i r i c h l e t . m i x "
> 3*7 [4] " s a . d i r i c h l e t 5 .... s a . d i r i c h l e t 5 .
[i] 21 dil .... s a . d i r i c h l e t 5 . m i x "
> ... (lines o m i t t e d )
'R' executes the command by multiplying 3 and 7 The other commands show a typical usage of 'R':
and then prints the result 21. At this moment Use .'? to get help information, or '1 s ( )' to show
ignore the '[ 1 ] ' . 'R' can in this way be used as a all variables/datasets defined previously. Just type
(extremely powerful) calculator. To prepare 'R' the name of a dataset to show its content, which
for compositional data analysis the library compo- in this case is a set of simulated amounts of three
sitions must be loaded with the library command: different chemical elements in ppm:
> library(compositions)
Attaching package 'compositions': > sa.lognormals # Show one of the
The following o b j e c t ( s ) are masked datasets
Cu Zn Pb
from package:stats:
[i,] 8.8043262 35.1671810 45.895025
cor cov dist var
[2,] 0.8115227 2.6547329 47.804310
The following object(s) are masked [3,] 1.2836130 12.4472047 40.553628
from package:base: ... (lines o m i t t e d )
%*%
[60,] 3 . 9 8 5 4 9 9 8 6 . 1 3 0 1 9 0 9 4 0 . 5 7 9 4 1 7
>
The dataset is now closed. Note that the closure A barplot can also be used to display the whole
constant is automatically considered as one. In this dataset:
case, the resulting object is stored in ' c d a t a ' and
marked as an Aitchison composition (having the I
attribute class ' a c omp') such that it is automatically I > barplot(cdata) # Display the w h o l e I
data set
treated in an adequate way in further commands. For L
example, the ' p l o t ' command will automatically
draw a ternary diagram: The variation of compositions can be summarized
in several ways (Aitchison et al. 2002;
Pawlowsky-Glahn & Egozcue 2001):
> plot(cdata) # Ternary diagram
> ? p l o t . a c o m p # H e l p on the a c o m p -
speci~c plot function
> variation(cdata) # Variation matrix
Cu Zn Pb
One quits the help by closing the help window or Cu 0 . 0 0 0 0 0 0 0 0 . 4 0 4 6 9 9 4 2 . 9 3 8 1 8 2
by typing q for 'quit'. 1 There is always an Zn 0 . 4 0 4 6 9 9 4 0 . 0 0 0 0 0 0 0 2 . 9 1 0 5 3 9
example of the command at the end of each help Pb 2 . 9 3 8 1 8 1 6 2 . 9 1 0 5 3 8 9 0 . 0 0 0 0 0 0
> v a r ( c d a t a ) # V a r i a n c e m a t r i x of the
page. Try this out to see what happens.
clr-transform
Another graphical display of compositional data
Cu Zn Pb
related closely to the Aitchison geometry of the Cu 0.4194692 0.2125124 -0.6319817
simplex and displaying the Aitchison distance in a Zn 0.2125124 0.4102550 -0.6227674
visual way is the boxplot of log-ratios: Pb - 0 . 6 3 1 9 8 1 7 - 0 . 6 2 2 7 6 7 4 1.2547491
> mvar(cdata) # metric variance
[i] 2 . 0 8 4 4 7 3
> boxplot(cdata) # B o x - p l o t s of
>msd(cdata) # metric standard
p a i r w i s e r a t i o s in log s c a l e
deviation = sqrt(mvar/(D-l))
> ? b o x p l o t . a c o m p # H e l p on
[i] 1.0209
compositional box-plots
> boxplot(cdata,log=FALSE) # use
> summary(cdata) # multiple
information about
normal scale
pairwise ratios
. . .
component analysis, which uses the clr transforms The optional parameter 'parts=', allows you to
(Aitchison 2002): select the parts to be used in the subcomposition.
Optional parameters are a typical way of 'R' pro-
viding additional functionality to the default beha-
> pca <- princomp(cdata) # p e r f o r m
PCA and store the result in pca viour of a command. The possible optional
> pca # display results as text parameters and their effects are documented in
Call: the help to each command that can be invoked by
p r i n c o m p . a c o m p ( x = cdata) '? nameoffunction'. The 'c ( ) ' function is
just here to Concatenate the variable names.
Standard deviations: m
0.0 0.2 0.4 0.6 0.8 t .0 0.0 0.2 0.4 0.6 0.8 1.0
1 I I I i i t i I I =
,
Cd
o. -
oo
,o
Pb
c5
eq
c5
o
c5
~3
Co
Pb
Cu
Cu C
I I I I I
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0,8 1.0
Fig. 2. Matrix of ternary diagrams of a four-part composition (Cd, Pb, Co, Cu).
Cluster Dendrogram
dist(cdata)
hclust (*,"complete")
Fig. 3. Dendrogram of groups in a four-part composition (Cd, Pb, Co, Cu), defined by the Aitchison distance.
When the user wants to analyse only one of the Here one sees a highly significant influence of the
groups, a subset of the data is selected based on a group given by a p-value stated as ' < 2 . 2 e - 1 6 ' . If
criterion: this example was run, a series of plots would result:
the first one would show the residuals with substan-
> sa.groups5.area tial spread. The second plot shows the location of the
[i] U p p e r U p p e r U p p e r U p p e r U p p e r predicted group means in ternary diagrams. Unfor-
Upper Upper Upper Upper Upper tunately, the variable names are lost during the ilr
[ii] U p p e r U p p e r U p p e r U p p e r U p p e r transform, such that the plots are drawn without
Upper Upper Upper Upper Upper labels. The third plot shows qqnorm-plots of the
[21] M i d d l e M i d d l e M i d d l e M i d d l e pairwise log-ratios, in order to check the normality
Middle Middle Middle Middle assumption used in the manova. The last two calcu-
Middle Middle
lations give the total of the model of about 39% and
[31] M i d d l e M i d d l e M i d d l e M i d d l e
the individual's for the four parts of the composition.
Middle Middle Middle Middle
Middle Middle
In a similar way a discrimination analysis can be
[41] L o w e r L o w e r L o w e r L o w e r L o w e r performed based on the ilr transform and standard
Lower Lower Lower Lower Lower functionality of 'R':
[51] L o w e r L o w e r L o w e r L o w e r L o w e r
Lower Lower Lower Lower Lower > library(MASS) # Loading
Levels: Lower Middle Upper appropriate library
> u p p e r <- s p l i t ( c d a t a , s a . g r o u p s S . > # Generating example data
area) [["Upper"]] > subsample <- s a m p l e ( l : 6 0 , 4 5 )
> plot(upper) > TrainingData
> mean(upper) <- a c o m p ( c d a t a [ s u b s a m p l e , ] )
> TrainingGroups
<- s a . g r o u p s S . a r e a [ s u b s a m p l e ]
A parallel analysis of all groups is possible through > ControlData
the ' 1 a p p l y ' or ' s a p p ly'-function of 'R': <- a c o m p ( c d a t a [ - s u b s a m p l e , ] )
> ControlGroups
> sapply(split(cdata,sa.groups5.area), <- s a . g r o u p s 5 . a r e a [ - s u b s a m p l e ]
mean) > ControlGroups
Lower Middle Upper [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e
Cd 0.064366655 0.006801803 0.001636658 Middle Middle Middle Middle
Pb 0 . 0 5 9 1 7 8 2 4 2 0 . 5 7 2 8 8 9 9 1 9 0 . 9 5 7 1 9 1 9 0 9 Middle
Co 0 . 0 0 9 5 0 9 9 9 5 0 . 0 0 2 6 5 8 0 6 4 0 . 0 0 4 3 2 3 9 1 2 [ii] L o w e r L o w e r L o w e r L o w e r L o w e r
Cu 0 . 8 6 6 9 4 5 1 0 8 0 . 4 1 7 6 5 0 2 1 4 0 . 0 3 6 8 4 7 5 2 1 Levels: Lower Middle Upper
> # Performing the discriminat
analysis
However the grouping information could be used > d s c r <- i d a ( T r a i n i n g G r o u p s - . , i l r
to check whether the groups are really different in a (TrainingData)) # Discrimination
Multivariate Analysis of Variance (manova), which Analysis
can be done by 'R' standard routines based on the ilr > dscr
transform: ... ( o u t p u t o m i t t e d )
> predict(dscr,newdata:ilr
(ControlData)) # Classify
ControlData
> m <- manova(ilr(cdata)-sa.groups5, area) $class
> summary(m) [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e
Df Pillai approx F num Df den Lower Middle Middle Middle
Df Pr(>F)
sa.groups5.area 2 1.0872 22.2312 6 Middle
112 < 2.2 e-16 *** [ii] L o w e r L o w e r L o w e r L o w e r L o w e r
Residuals 57 Levels: Lower Middle Upper
___
$posterior
Signif. codes: 0 ~***" 0.001 "**" 0.01 ~*" Lower Middle Upper
0.05~. " 0.i " "i
1 3.626286e-16 1.851031e-07 9.999998e-01
> plot(ilr.inv(residuals(m)),col=sa.groups5.
area) 2 7.991869e-12 8.473827e-05 9.999153e-01
> plot(ilr.inv(predict(m)),col=sa.groups5. ... ( l i n e s o m i t t e d )
area) > table(ControlGroups, predict
> qqnorm(ilr.inv(residuals(m))) (dscr, n e w d a t a = i l r (ControlData))
> mvar(predict(m))/(mvar(residuals(m)
$class)
+predict (m))) # ~R"^2
[i] 0.3980416 ControlGroups Lower Middle Upper
> diag(ilrvar2clr(var(predict(m)))/ilrvar2 Lower 5 0 0
clr(var(residuals(m)+predict(m)))) Middle 1 6 0
[i] 0.4001846 0.5670027 0.1392320 0.2654141 Upper 0 0 3
Downloaded from http://sp.lyellcollection.org/ at Pennsylvania State University on April 8, 2016
Conclusions
For the beginner, this approach i m m e d i a t e l y pro-
Cd;Zn;Pb;Cd;Co
vides all basic compositional plots, summaries and
1.2;2.6;4.9;0.2;5
transformation in the form o f simple standard com-
23.4;11;0.2;0.002;6.2
mands given in this publication. M o r e helpful
. . .
9 It doesn't work in the second session: Have you loaded AITCHISON, J., BARCELO-VIDAL, C., EGOZCUE, J. J. &
all necessary libraries and prepared all variables? PAWLOWSKY-GLAHN, V. 2002. A concise guide to
9 'R' comes with plenty of help. Type the the algebraic geometric structure of the simplex,
' h e l p . s t a r t ( ) ' command and start with 'Intro- the sample space for compositional data analysis.
duction to R'. In: BAYER, U., BURGER, H. & SKALA, W. (eds)
9 Type 'q ( )' for quit and the ENTER-key to leave 'R'. Proceedings of the 8th Annual Conference of the
International Association for Mathematical
Save your workspace, when asked.
Geology, Berlin, Germany, 387-392.
PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2001. Geo-
metric approach to statistical analysis on the
References
simplex. Stochastic Environmental Research and
AITCHISON, J. 2002. Simplicial inference. In: VIANA, Risk Assessment, 15 (5), 384-398.
M. A. G. & RICHARDS, D. S. P. (eds) Algebraic R Development Core Team 2003. R: A language and
Methods in Statistics and Probability. Contempor- environment for statistical computing. R Foun-
ary Mathematics Series, 287, American Mathe- dation for Statistical Computing, Vienna, Austria
matical Society, Providence, Rhode Island, 1-22. (http://www.R-project.org).