Beruflich Dokumente
Kultur Dokumente
Email: joanna.lee@npl.co.uk
Web: http://www.npl.co.uk/nanoanalysis
1. Introduction
What is multivariate analysis?
10
Some matrix algebra
20
2. Identification
h t
Principal component analysis (PCA)
r ig
y
Multivariate curve resolution (MCR)
p
Co
3. Quantification and prediction
Partial least squares regression (PLS)
w
4. Classification n
r o
PCA classification
) C
Principal Component Discriminant Function Analysis (PC-DFA)
(C
Partial Least Squares Discriminant Analysis (PLS-DA)
5. Conclusion
Slide 2
Why are we here?
120
10
PCA
20
100 MCR
h t
Number of Publications
80
PLS
DFA
r ig
ANNs
p y
60
Co
40
w n
20 r o
) C
0
1990 (C 2000 2010
Year of Publication
Slide 3
Data analysis
Calibration /
0
Quantification
1
20
Identification
h t
How is it related to
known properties?
What chemicals SIMS
SIMS
r ig Can we predict
are on the surface?
Dataset
p
Dataset y these properties?
Where are they
located? Co
w n
r o
) C Classification
Slide 4
Contents
1. Introduction
What is multivariate analysis?
10
Some matrix algebra
20
2. Identification
h t
3. Quantification and prediction
r ig
4. Classification
p y
5. Conclusion
Co
w n
r o
) C
(C
Slide 5
Chemometrics
methods
10
system to the state of the system via application of mathematical or statistical
20
h t
r ig
Chemometrics
p y
SIMS
Data C o
Knowledge of Surface Chemistry
and Instrumental Influences
Statistical
Results
w n
r o
) C Statistical methods e.g.
(C multivariate analysis
10
20
Manual analysis involves
h t
selecting a sub-set of most
interesting features for
r ig
analysis by eye
p y
Co Multivariate analysis involves
simultaneous statistical
Slide 8
Advantages and
disadvantages
Advantages
Fast and efficient on modern computers
10
Uses all information available
20
Improves signal to noise ratio
h t
r ig
Statistically valid, removes potential bias
Disadvantages
p y
Co
Lots of different methods, procedures, terminologies
n
Can be difficult to understand and interpret
w
r o
) C
(C
Slide 9
Why use multivariate
analysis?
4
10
10
20
Peaks or information bins
10
3
h t
r ig
as PCA, PLS usefulp y
Multivariate methods such
o
2
10
n C
o w Ordinary
r
1
10
x + -
) C x+-
jolly good
(C
10 0
0
1 2 3 4
here
10 10 10 10 10
Counts per peak
Slide 10
Contents
1. Introduction
What is multivariate analysis?
10
Some matrix algebra
20
2. Identification
h t
3. Quantification and prediction
r ig
4. Classification
p y
5. Conclusion
Co
w n
r o
) C
(C
Slide 11
Data matrix
Intensity
30
20
10
0 1
9 32 10 1 21
0
t 2
h
1 2 3 4 5
Samples
Mass
X = 18 20 22 4 12
r ig
y
Mass spectrum of Sample 2
p
30
24 12 30 6 6
o
Intensity
20
Variables
n C 10
0
ow 1 2 3
Mass
4 5
Cr
X has 3 row and 5 columns
3 5 data matrix
Mass spectrum of Sample 3
) 40
(C
Intensity
30
20
Each row (spectra) is 10
0
represented by a vector 1 2 3 4 5
Mass
Slide 12
Matrix algebra
Matrix multiplication
Matrix addition
A B = C
A+ B = C 10
(I N ) (N K ) = (I K )
20
(I K ) + (I K ) = (I K )
t
No. of columns of A must be equal
h
ig
no. of rows of B
r
A and B must be the same size
Each corresponding element is p y
Row i of A times column j of B gives
the row i and column j of the product
added
Comatrix AB
w
2 4 1 1 2 0 1 6 1
n 1 4 1 1 + 4 3 1 2 + 4 2
r o
3 8 6 + 0 1 2 = 3 9 4 2 2 1 2 = 2 1 + 2 3 2 2 + 2 2
3 2
) C
4 2 4 1 + 2 3 4 2 + 2 2
(C
(e.g. pure spectra + noise =
experimental data)
13 10
=8 8
10 12
Slide 13
Matrix algebra and SIMS
0
10 Chemical Chemical 6
Intensity
0
1 2 3
Mass
4 5
Sample
1 2 4
2
0 1
5 1
2
Mass spectrum of Sample 2 0
30 1
= 1 2 3 4 5
t
Mass
Intensity
20
Sample
10
0
2
2
i
4
gh 6
Mass spectrum of Chemical 2
Intensity
r
1 2 3 4 5 Sample 4
Mass 0 6
3
y
2
Mass spectrum of Sample 3
p
40 0
Intensity
30 1 2 3 4 5
o
Mass
20
C
10
0
1 2 3 4 5
Data matrix
Mass
w =
n Sample
Chemical spectra
9 r
o composition
C
32 10 1 21 5 1
Samples
20 22 )4 12
Chemicals
Samples
18 2 4 1 6 1 0 4
24
C
12 ( 30 6 6
=
0 6
4 2 5 1 1
Variables [mass] Variables [mass]
Chemicals
Slide 14
Matrix algebra and SIMS
9 r
o composition
C
32 10 1 21 5 1
Samples
20 22 )4 12
Chemicals
Samples
18 2 4 1 6 1 0 4
24
C
12 ( 30 6 6
=
0 6
4 2 5 1 1
Variables [mass] Variables [mass]
Chemicals
Slide 15
Factor analysis
1. Each spectrum
We can describe canthebedata
represented bycombination
as a linear a vector of spectra,
by writing
2. Instead of x, y, the
z indata matrix
3D real as product
space, the axesof are
twomass1,
matrices:
mass2,
10
mass3 etc in variable space (also data space)
20
3. Assuming the One
data contains the spectra
are a linear (loadings)
combination
h t
of chemical spectra, we
One containing the contributions (scores)
can write it as a product of two matrices
r ig
4. There are infinite
This number of possible
is the basis p
solutions!
of factor y
analysis!
Co
Data matrix
w = n
Scores Loadings
9 r
o
C
32 10 1 21 3.6 1.9
Samples
20 22 )4 12
Samples
Factors
18 3.7 4.7 5.7 5.7 1.0 3.5
-0.2
24
C
12 ( 30 6 6
3.8
=
-1.6
3.9 5.9 5.3 1.3 4.4
Variables [mass]
Variables [mass] Factors
Slide 16
Contents
1. Introduction
2. Identification
10
Principal component analysis (PCA)
20
PCA walkthrough
h t
Data preprocessing
r ig
PCA examples
p y
MCR examples Co
Multivariate curve resolution (MCR)
w n
3. Quantification and prediction
r o
4. Classification
) C
5. Conclusion
(C
Slide 17
Data analysis
Calibration /
0
Quantification
1
20
Identification
h t
How is it related to
known properties?
What chemicals SIMS
SIMS
r ig Can we predict
are on the surface?
Dataset
p
Dataset y these properties?
Where are they
located? Co
w n
r o
) C Classification
Slide 18
Terminology
r ig
Factor -
p
An axis in the data space of a factory Principal Pure Latent
Co
analysis model, representing an
underlying dimension that contributes
Component Component Vectors,
Latent
Loadings P
r o Projection of a factor onto the Loadings, Pure Loadings
PCA Factor 2
PCA Factor 1
10
PCA Factor 2
20
h t PCA Factor 1
m1
r ig
p y
Co
w n
o
Factors are directions in the data space chosen such that they reflect
r
C
interesting properties of the dataset
)
(C
Equivalent to a rotation in data space factors are new axes
Data described by their projections onto the factors
Slide 20
Principal component
I = no. of samples
analysis (PCA) K = no. of mass units
m2 D = dimensionality of data
PCA Factor 2
PCA Factor 1
10
PCA Factor 2
20
h t PCA Factor 1
m1
r ig
p y
Co
w n
The projection of the PCA factors onto the
X = TP
r o
original variables (m1, m2) are loadings
(I K ) = (I D )(D K )
) C
The projection of the samples (stars) onto the
Data matrix
(C
PCA factors are scores
The data is fully described by D factors, where Scores matrix Loadings matrix
D is the dimensionality of the data (number of
samples or variables, whichever is smaller)
Slide 21
Principal component
I = no. of samples
analysis (PCA) K = no. of mass units
m2 D = dimensionality of data
) C
largest amount of variance within the data
Data matrix
(C
The amount of variance described by each
factor is called eigenvalue Scores matrix Loadings matrix
Slide 22
Principal component
I = no. of samples
analysis (PCA) K = no. of mass units
m2 N = no. of PCA factors
PCA Factor 1
10
20
h t PCA Factor 1
(78%)
m1
r ig
p y
Co
w
By removing higher factors (smalln XX= =TT PP + E
r o
variance due to noise) we can reduce (I K(I)=K(I) =N
(I)(NN)(K
N)+K(I) K )
C
the dimensionality of data factor
) PCAmatrix
Data reproduced Residuals
(C
compression (noise)
data matrix
Often hundreds of variables can be Scores
Scores
matrix
matrix
Loadings
Loadings
matrixmatrix
described with just a handful of factors!
Slide 23
Number of factors
105
no noise
100
2. Scree test:
h t
Eigenvalue plot levels off in a linearly
10-5
ig
decreasing manner after 3 factors
r
10-10
p y
3. Percentage of variance captured by
10-151
108
2 3 4 5 6 7 8
Co Nth PCA factor:
107
w
(b) n N th eigenvalue
sum of all eigenvalue s
100%
106
105
r o Poisson noise
104
) C max 5000 counts
4. Percentage of total variance captured
(C
103 by first N PCA factors:
102
sum of eigenvalue s up to N
1011 2 3 4 5 6 7 8 100%
sum of all eigenvalue s
PCA Factor
Slide 24
Contents
1. Introduction
2. Identification
10
Principal component analysis (PCA)
20
PCA walkthrough
h t
Data preprocessing
r ig
PCA examples
p y
MCR examples Co
Multivariate curve resolution (MCR)
w n
3. Quantification and prediction
r o
4. Classification
) C
5. Conclusion
(C
Slide 25
PCA walkthrough
PS 2480
PS 3550 10
PMMA 2170
20
PMMA 2500
h t
PEG 1470
PEG 4250
r ig
PPG 425
p y
PPG 1000
Co
Unit mass binned
w n
o
and mean centered
r
C
prior to analysis
)
(C
Calculation using
MATLAB with PLS
Toolbox 4.0
Slide 26
PCA walkthrough
10
20
h t
r ig
p y
Co
w n
r o
) C
(C
Slide 27
PCA walkthrough
10
20
h t
r ig
p y
Co
w n
r o
) C
(C
Slide 28
PCA walkthrough
10
20
h t
r ig
p y
Co
w n
r o
) C
(C
Slide 29
PCA walkthrough
10
20
h t
r ig
p y
Co
w n
r o
) C
(C
Slide 30
PCA walkthrough
10
20
h t
r ig
p y
Co
w n
r o
) C
(C
Slide 31
PCA walkthrough
Using PCA we have effectively reduced 300 correlated variables (mass units) to 3
0
independent variables (factors) by which all the samples can be characterised.
1
20
h t
r ig
p y
Co
w n
r o
) C
(C
Slide 32
Contents
1. Introduction
2. Identification
10
Principal component analysis (PCA)
20
PCA walkthrough
h t
Data preprocessing
r ig
PCA examples
p y
MCR examples Co
Multivariate curve resolution (MCR)
w n
3. Quantification and prediction
r o
4. Classification
) C
5. Conclusion
(C
Slide 33
Data preprocessing
Slide 35
Peak selection and
binning
Manual selection
Peaks of interest only
Unexpected features lost
10
20
Auto peak search
h t
All peaks of interest included?
r ig
What threshold to use?
p y
Unit mass binning Co
w
Straightforward to use but n
r o
detailed information lost
) C
(C
0.5 u binning*
Separates organic from inorganics
Manual selection
Peaks of interest only
Unexpected features lost
10
Important considerations
20
Auto peak search
h t
All peaks of interest included?
r g
What information are we putting into PCA?
i
What is included? What is omitted?
What threshold to use?
p y
Unit mass binning Co
Do we need to apply further processing
w
Straightforward to use but n
e.g. dead time correction?
r o
detailed information lost
) C
(C
0.5 u binning*
Separates organic from inorganics
X ik = X ik mean(X :k )
~
p y
Factor 2
Co Factor 2 Factor 1
w n m1 m1
r o
) C
(C
Factor 1
m2 m2
1st factor goes from origin to 1st factor goes from origin and
centre of gravity of data accounts for the highest variance
Slide 38
Normalisation
~ 1
X ik = X ik
sum (X i : )
o
Mass, u
w
Preserves the shape of spectra
o
Cr
Reduces effects of topography,
)
sample charging, changes in
(C
primary ion current
0 50 100 150
Mass, u
Slide 39
Variance scaling
~ 1
X ik = X ik
var (X :k )
w n
o
Variance
Cr
Mean
)
(C
Diagram from P. Geladi and B. Kowalski, Partial Least-Squares Regression:
A Tutorial, Analytica Chimica Acta, 185 (1986) 1 Slide 40
Poisson and binomial M R Keenan et al. Surf. Interface Anal., 36 (2004) 203
M R Keenan et al., Surf. Interface Anal., 40 (2008) 97
scaling
Slide 41
Contents
1. Introduction
2. Identification
10
Principal component analysis (PCA)
20
PCA walkthrough
h t
Data preprocessing
r ig
PCA examples
p y
MCR examples Co
Multivariate curve resolution (MCR)
w n
3. Quantification and prediction
r o
4. Classification
) C
5. Conclusion
(C
Slide 42
PCA example (1)
D.J. Graham et al, Appl. Surf. Sci., 252 (2006) 6860 Slide 43
PCA example (2)
) C
95% confidence limits
provide means for
(C
identification / classification
w n
o
2 2 2
1
1
4
4
7
C
7
r 3 3 3
I,
rows
)
4 4 4
1 24 57 8 unfold
(C
2
2
35
5
68
8
9
5 5 5
s s
6 6 6 a
3 6 9 m
3 6 9
7 7 7 K, aks
8 8 8 J, columns pe
9 9 9
Slide 46
PCA image example (1) 2
Mean centering
log(eigenvalues)
1
0
Immiscible PC / PVC polymer blend
-1
42 counts per pixel on average
Total ion image
-2
2 4 6 8
10
10 12 14 16
0
18 20
Sorted eigenvector index
-1.5
t 2
h
-2
log(eigenvalues)
Normalisation
g
-2.5
-3
-3.5
-4
yr i
op -4.5
-5
C
2 4 6 8 10 12 14 16 18 20
Sorted eigenvector index
w n 0.8
log(eigenvalues)
r o 0.6 Poisson scaling
C
0.4
)
Only 2 factors needed 0.2
(C
dimensionality of image reduced
by a factor of 20!
0
2 4 6 8 10 12 14 16
Sorted eigenvector index
18 20
h t
PVC and PC phases
0
0
O + OH
r ig
PC1 scores -5
-0.5
0 5 10
p
15 20
Mas s, u
25 30
y
35 40
10 1
Co
w
8
0.8
0.6
n
PC2 loadings
r o 6
0.4 2nd factor shows
C detector saturation
4
0.2
) 2
0
for intense 35Cl peak
(C
PC2 scores
0
-2
-0.2
-0.4
0 5 10 15 20
Mas s, u
25 30 35 40
10
Image courtesy of
20
Dr Ian Fletcher
h t
Intertek MSG
r ig
p y
Co
w n
Total Spectra
r o
) C
m
50
(C Mass, u
Slide 49
Image courtesy of Dr Ian
PCA image example (2) Fletcher, Intertek MSG
3000 0.8
B
0.6
2000
Factor 3
Hair fibre with multi-component 1000
0.4
0.2
D
pretreatment
0
0
0
1
-0.2
20 A
5000
0.4
A
B
h
2000
t
0.6
B
Factor 1
Factor 4
g
4000 0.4 A
3000
2000
0.2
C
yr i 1000
0
0.2
1000
D E
op -1000
-0.2
-0.4 C
E
C
0
0.6
B B
n
C 2000 0.8
1000
Factor 5
0.4 0.6
Factor 2
ow 0.2
1000 0.4
r
0.2 C
-1000 0
0
C
0
)
-2000 -0.2
-0.2 -1000
E -0.4
(C
-3000 Mass (arb. scale)
0.8
Slide 51
Contents
1. Introduction
2. Identification
10
Principal component analysis (PCA)
20
PCA walkthrough
h t
Data preprocessing
r ig
PCA examples
p y
MCR examples Co
Multivariate curve resolution (MCR)
w n
3. Quantification and prediction
r o
4. Classification
) C
5. Conclusion
(C
Slide 52
Multivariate curve
resolution (MCR)
9 32 10 1 21
Samples
Co
5 1
n 2 4
Samples
Chemicals
18 20 22 4 12 1 6 1 0 4
=
ow
24 12 30 6 6
4 2 5 1 1
Cr 0 6
)
Variables [mass] Chemicals
Variables [mass]
(C
Try multivariate curve resolution (MCR)!
Slide 53
Multivariate curve
I = no. of samples
resolution (MCR) K = no. of mass units
X = TP + E
N = no. of factors
(I K ) = (I N )(N K ) + (I K )
Data matrix
10
Residuals (noise)
Projection of samples Projection of factors
20
onto factors (scores matrix) onto variables (loadings matrix)
h t
r ig
MCR is designed for recovery of chemical
p y m2 MCR Factor 2
spectra and contributions from a multi-
Co
component mixture, when little or no prior
information about the composition is
available w n
r o MCR Factor 1
) C
MCR assumes linear combination of
(C
chemical spectra (loadings) and
contributions (scores) only an
m1
approximation in SIMS
Slide 54
Multivariate curve
I = no. of samples
resolution (MCR) K = no. of mass units
X = TP + E
N = no. of factors
(I K ) = (I N )(N K ) + (I K )
Data matrix
10
Residuals (noise)
Projection of samples Projection of factors
20
onto factors (scores matrix) onto variables (loadings matrix)
h t
r ig
MCR uses an iterative least-squares
p y m2 MCR Factor 2
algorithm to extract solutions, while
applying suitable constraints
Co
w n
r o
With non-negativity constraint, MCR
MCR Factor 1
) C
factors resemble SIMS spectra and
chemical contributions more directly, as
(C
these must be positive
m1
Slide 55
Outline of MCR Raw Data
X = TP + E
Data Matrix X
Initial Estimates of T or P
10
Random initialisation
Reproduced
PCA
20
PCA loadings or scores
Data Matrix X
h t
Varimax rotated PCA loadings
Noise filtered data
Ensures MCR
r ig or scores
Pure variable detection
solution is robust
Number of Factors
p y algorithm e.g. SIMPLISMA
Co
Constraints
Non-negativity
w n MCR Convergence
Equality
r o alternating-least-squares criterion
) C optimisation Non-negativity
Equality
(C
MCR MCR
Scores T Loadings P
Slide 56
Rotational ambiguity
p y
Co
w n
MCR Factor 1
MCR Factor 1
r o m1 m1
) C
(C
Good initial estimates, suitable data preprocessing and correct
number of factors are essential
Slide 57
Contents
1. Introduction
2. Identification
10
Principal component analysis (PCA)
20
PCA walkthrough
h t
Data preprocessing
r ig
PCA examples
p y
MCR examples Co
Multivariate curve resolution (MCR)
w n
3. Quantification and prediction
r o
4. Classification
) C
5. Conclusion
(C
Slide 58
MCR calculations using Matlab with MCR-ALS toolbox,
MCR image example (1) freely available from http://www.mcrals.info/
Simple
10 PVC / PC
20 polymer
h t blend
r ig
p y
Co
w n
r o
) C
(C
Slide 59
MCR image example (1)
MCR scores
10
(pure component
0
concentration)
2
h tthese will be
r ig folded to form
projection images
p y
Co
w n MCR loadings
r o (pure component
) C spectra)
(C
MCR calculations using Matlab with MCR-ALS toolbox, freely available
from http://www.mcrals.info/ Slide 60
MCR image example (1)
(a)
Simple PVC / PC polymer(b)
blend
Loadings on MCR factor 1 Scores on MCR factor 1
2
10
1.5 PVC 10
20
1
h t
MCR extracts two
distinctive factors,
0.5
r ig 5
corresponding to
0
p y 0
PVC and PC
0 10 20 30
Co
Scores on MCR factor 2
respectively
1.5
w n
1
PC
r o 10 Straightforward
interpretation
) C 5
(C
0.5
0 0
0 10 20 30 40
Mass, u
J L S Lee et al., Surf. Interface Anal. 2008, 40, 1-14 Slide 61
MCR image example (2)
10
Image courtesy of
20
Dr Ian Fletcher
h t
Intertek MSG
r ig
p y
Co
w n
Total Spectra
r o
) C
m
50
(C Mass, u
Slide 62
Image courtesy of Dr Ian
MCR image example (2) Fletcher, Intertek MSG
Factor 3
spectra (characteristic peaks A-E)
0
0.4
3000
and fragments, and scores directly
reveal spatial distributions!
0
2000
1
2
0.2
D
t
1000
MCR scores
1
B
MCR loadings
igh 0
4000
0
r
6000
0.8 0.4
y
E
Factor 1
Factor 4
5000
3000
4000 0.6
3000 0.4
op 2000
0.2
2000
1000
0.2
n C 1000
w
0 0 0 0
3000 0.8
o
1
A
Factor 5
2000
) 0.4
0.6
(C 1000
0
0.2
0
1000 0.4
0
0.2
0
Mass (arb. scale) Mass (arb. scale)
p y
Co
w n
r o
) C
(C
Slide 64
MCR image example (3)
p y
Co
w n
r o
) C
(C
MCR resolves the original images unambiguously!
Slide 65
MCR spectra example
Scores and loadings for 3 of the MCR factors
C
contribution to the depth profile
)
(C
Improve signal to noise and
correlation of related peaks
h
r ig
y
MCR describes the original data using factors, consisting of loadings
p
Co
and scores which which resembles chemical spectra and
contributions from a multi-component mixture, respectively
w n
MCR uses an iterative algorithm to extract solutions, while applying
r o
suitable constraints e.g. non-negativity
) C
Good initial estimates and suitable data preprocessing are essential
(C
MCR is excellent for identification and localisation of chemicals in
complex mixtures and allows for direct interpretation
Slide 67
Identification summary
h t
Chemical
identification
Difficult
Characteristic
Medium
r ig
Important peaks
Easy
Full spectra obtained
peaks only
p y
and correlation
Detection of Difficult
Co
Easy Difficult ?
minor
components
Only if substance
w
is known n Higher factors
capture small
Possibly depend on
system studied
r o variance
)
Most suitableC Simple dataset Discrimination of Identification for
for
(C with good prior
knowledge
similar chemical
phases
unknown mixtures
Slide 68
Contents
1. Introduction
2. Identification
10
3. Quantification and prediction
20
Partial least squares regression (PLS)
h t
ig
Calibration, validation and prediction
r
PLS examples
p y
4. Classification
5. Conclusion Co
w n
r o
) C
(C
Slide 69
Data analysis
Calibration /
0
Quantification
1
Identification
20
h t
How is it related to
known properties?
What chemicals
are on the surface? SIMS
SIMS
r ig Can we predict
Where are they
Dataset
p
Dataset y these properties?
located?
Co
w n
r o
) C Classification
Slide 70
Regression analysis
Response variable, y
y = b*x + e 20
ig h e
15
yr
op
Response
variable
n C 10
ow
Cr
Regression Predictor 0
)
coefficient variable 0 10 20 30
(C
Predictor variable, x
Slide 71
Multivariate regression
30
20
10
Molecular
weight
Solution
concentration
10
Reaction
time
0
0
1 2 3 4 5
2
Mass Sample 1 5 1 3
t
Mass spectrum of Sample 2
30 Sample 2 2 4 7
h
Intensity
20
10
0
1 2 3 4 5
Sample 3 1
r ig 6 4
40
Mass
Mass spectrum of Sample 3
p y
o
Can we predict the properties of similar
Intensity
30
20
10
0
1 2 3
Mass
4 5
n C
materials from their SIMS spectra?
o w y = f (x) + e
yC
r
= b x + b x + b x + ... + b x +e
) 1 1 2 2 3 3 m m
(C
Response variable Regression Predictor variable
i.e. measured property coefficient i.e. intensity at mass m
Slide 72
Multivariate regression
I = no. of samples
K = no. of mass units
M = no. of response variables
p
data matrix
y matrix
Co
1. We can calculate B to gain an understanding of the covariance
n
relationship between X and Y
w
o
e.g. relating SIMS spectra with sample preparation parameters
r
C
2. B can be applied to future samples in order to predict Y using only
)
(C
measurements of X
e.g. quantifying surface composition or coverage of samples using
only their SIMS spectra
Slide 73
Partial least squares
regression (PLS) Y = XB + E
(I M ) = (I K )(K M ) + (I M )
r o
C
The regression vectors B are a linear combination of PLS loadings
)
that best predict Y from X
(C
Important to determine the number of factors to include in PLS!
Slide 74
Calibration, validation,
prediction Y = XB + E
(I M ) = (I K )(K M ) + (I M )
Calibration
Use cross-validation to
Fit a PLS model to a calibration data determine number of
10
set with known X and Y factors
20
h t
Validation
r ig
Apply model to an independent
p y
Y, and calculate error between
Co
validation data set with known X and
Slide 75
Number of factors
cross validation
Response variable, Y
25
n C 35
30
w
20 25
15
r o 20
10
) C 15
10
5
0
0
(C 10 20 30
5
0
0 10 20 30
Predictor variable, X Predictor variable, X
Slide 76
Number of factors
cross validation
0.7
10
of Calibration) goes down with
RMSECV, RMSEC
0.5
0
RMSECV
increasing number of factors
0.4
RMSEC
t 2
0.3
gh
To decide optimal number of
i
0.2
yr
factors use minimum of RMSECV
(Root Mean Square Error of Cross
0.1
op Validation)
0
1 2 3 4 5 6 7 8 9
Number of PLS Factors
n C
10 11 12
ow 1 2 3
r
Leave one out cross validation most popular
C
Calculate PLS model excluding sample i, for N PLS factors 4 5
)
(C
Use model to predict sample i and calculate error
Repeat for all different samples
Calculate root mean square error of cross validation (RMSECV)
Repeat for different number of PLS factors
Slide 77
Validation and prediction
population
next set of
Calibration
next set of
data setsample 10
sample?
20
h t
r ig
p y
Validation data should be statistically independent from calibration data
e.g. data taken on a different batch of samples, on a different day
Co
Calculate RMSEP (Root Mean Square Error of Prediction)
w n
r o Calibration
) C
Independent validation set is
(C
essential if we want to use model to Validation
predict new samples!
Prediction
Slide 78
Contents
1. Introduction
2. Identification
10
3. Quantification and prediction
20
Partial least squares regression (PLS)
h t
Calibration, validation and prediction
r ig
PLS examples
p y
4. Classification
5. Conclusion Co
w n
r o
) C
(C
Slide 79
PLS example (1)
) C
data) and 98.8% of the
(C
variance in Y (thicknesses)
0.6
w n
Factor 1
Factor 2
2 Factor 1
Factor 2
o
PLS Loadings
r PLS Scores
0.4
0.2
) C 0
(C
-1
0
-2
-0.2
500 1000 1500 2 4 6 8 10 12
Mass Sample
Slide 81
PLS example (1) 12
x 10
10
10 231
4
10
PLS regression vector shows
Irganox characteristic peaks most 2
277
20 1176
correlated with thickness 0
h t
Irganox dewets on the surface so -2
r ig
initial thickness is proportional to
p y 200 400 600
Mass, u
800 1000 1200
surface coverage!
Co 6
ow 4
Cr 3
)
(C
2
0
0 1 2 3 4 5 6
Thickness measured by XPS (nm)
FM Green et al., Anal. Chem. 2009, 81, 7579 Slide 82
PLS example (2)
1
Data matrix Regression
matrix
20
h t
ig
PLS is a multivariate linear regression technique
r
y
PLS find factors that best describe the structure of covariance
p
between X and Y
Co
n
Data preprocessing method needs to be selected with care
w
o
PLS is excellent for calibration and quantification, and for studying
r
C
the relationship between SIMS data and other measured properties
)
(C
Properly validated PLS models can be used for predictions of these
properties using SIMS spectra
Slide 84
Contents
1. Introduction
2. Identification
10
3. Quantification and prediction
20
4. Classification
h t
PCA classification
r ig
y
Principal Component Discriminant Function Analysis (PC-DFA)
p
5. Conclusion Co
Partial Least Squares Discriminant Analysis (PLS-DA)
w n
r o
) C
(C
Slide 85
Data analysis
Calibration /
0
Quantification
1
Identification
20
What chemicals
h t
How is it related to
known properties?
are on the surface? SIMS
SIMS
r ig Can we predict
Where are they Dataset
p
Dataset y these properties?
located?
Co
w n
r o
) C Classification
Slide 86
PCA classification
(C
Slide 88
Example 1 PC-DFA
Fisher' s ratio =
2
20
var1 + var2
h t
Used to distinguish strains of bacteria
r ig
p y
Co
w n
r o
) C
(C
J. S. Fletcher et al, Appl. Surf. Sci. 252 (2006) 6869 Slide 89
Example 2 PLS-DA
10
PCA
20
PC1, 2 and 3 overlay PC10 scores and loadings
h t
r ig 0.6
Variables/Loadings Plot for data
213
y
0.5
p
0.4
mesophyll
Loadings on PC 10 (0.58%)
o
0.3
175
215
C
0.2 86
95 184
58 257
epidermal 0.1 73
23 37 97104 137 157 249
6371 85 113127
n
3142 91 109 166177 197
187 225235 253
4755627283 99 145159 205218 237 261 273 299 337
309 335350
0 4812
162232
3850 748289 102 122131
188203
167182 223
212 231
243259 289
281
250267284 297311 327344
319
2 192834 5666 6879 92100 111 140155
130
121 164173 195
178 194
220 241
208 232247 275 294
302
310323336
30 112 148
156
707784 135147 172 317
w
-0.1 57 94 116134 170
76 132
93
75
o trichome -0.2
r
39
-0.3
50 100 150 200 250 300 350
C
Variable
Decluttered
)
(C
PC 10 describes differences between epidermal cells and other areas
but this is not efficient!
Image courtesy of Dr Kat Smart and Prof Chris Grovenor at the University of Oxford
Slide 90
Example 2 PLS-DA
p y
o
PLS-DA prediction
C
Regression vector
n
Variables/Loadings Plot for data
0.15
Step 3 1
w
39
0.1
0.8
Regression vector:
o
57 170
76 91
r
0.6 0.05 37 94 145
93 109 132 163
Combination of peaks 114129 151 172 199 226 255 278 288 317 326 344
C
0 45 112233 53627279 8492 108118 144
154 179
171 209
198
219
230
238 254 274
246 284 300
292 314 349
341
330
0.4 1723323850606774 89 136 164 182195212225 251 257
267 339350
9 16 111126141158
95103 173 214 233 259
276
27 455461 81 8798 123 149 187
197
which best predicts the 2431 131139 177
184 249
)
42 113
71 8697104 157
0.2 -0.05 40 215
differences between
(C
175
0 -0.1
-0.4 -0.2
50 100 150 200 250 300 350
Variable
Decluttered
Image courtesy of Dr Kat Smart and Prof Chris Grovenor at the University of Oxford
Slide 91
Classification summary
analysis
p y
dissemination
creation
Co
w n chemical
chemical
information
use
o
design
r
information
organisation
) C retrieval
(C visualisation
management
1. Introduction
2. Identification
10
3. Quantification and prediction
20
4. Classification
h t
5. Conclusion
r ig
p y
Co
w n
r o
) C
(C
Slide 93
Data analysis
PLS
Calibration /
PCA, MCR 10
Quantification
20
Identification
h t
How is it related to
known properties?
What chemicals SIMS
SIMS
r ig Can we predict
are on the surface?
Dataset
p
Dataset y these properties?
Where are they
located? Co
n
w PC-DFA, PLS-DA
r o
) C Classification
Slide 94
Conclusion
w n
r o
Surface and
) CInterface Analysis
Surface Analysis: The
Principal Techniques 2nd
(C
Multivariate Analysis
edition, Chapter 10 The
special issues
application of multivariate
(Volume 41 Issue 2 &
data analysis techniques in
8, Feb/Aug 2009)
surface analysis
Slide 95
Bibliography
General
J. L. S. Lee et al, The application of multivariate data analysis techniques in surface analysis, in Surface Analysis: The
Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.
P. Geladi et al, Multivariate image analysis, John Wiley and Sons (1996)
20
J. L. S. Lee et al, Quantification and methodology issues in multivariate analysis of ToF-SIMS data for mixed organic systems,
Surf. Interface Anal. 40 (2008) 1
h t
D. J. Graham, NESAC/BIO ToF-SIMS MVA web resource, http://nb.engr.washington.edu/nb-sims-resource/
PCA
r ig
p y
D. J. Graham et al, Information from complexity: challenges of ToF-SIMS data interpretation, Appl. Surf. Sci. 252 (2006) 6860
o
M. R. Keenan et al, Accounting for Poisson noise in the multivariate analysis of ToF-SIMS spectrum images, Surf. Interface
Anal. 36 (2004) 203
40 (2008) 97
n C
M. R. Keenan et al, Mitigating dead-time effects during multivariate analysis of ToF-SIMS spectral images, Surf. Interface Anal.
MCR
ow
Cr
N. B. Gallagher et al, Curve resolution for multivariate images with applications to TOF-SIMS and Raman, Chemom. Intell.
Lab. Syst. 73 (2004) 105
)
J. A. Ohlhausen et al, Multivariate statistical analysis of time-of-flight secondary ion mass spectrometry using AXSIA, Appl.
(C
Surf. Sci. 231-232 (2004) 230
R. Tauler, A. de Juan, MCR-ALS Graphic User Friendly Interface, http://www.ub.edu/mcr/
PLS
P. Geladi et al, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta 185 (1986) 1
A. M. C. Davies et al, Back to basics: observing PLS, Spectroscopy Europe 17 (2005) 28
Slide 96