Sie sind auf Seite 1von 41

Module 6

Data preprocessing
PCA
Hypothesis testing
Clustering
Class A Class B
?
Classification – “Work Flow”

Raw Data

Preprocessing

Feature Selection

Classifier

Quality Assessment
Raw Data
• Typically vectors

Features Target value

ID x y
Mol1 12,2,1,3,4,36,7.5 1
Mol2 4,5.6,3,8,99,1,5.2 0
Mol3 2.1,2.2,4,6,3,1,11 1

MolM x1,x2, … xn ym

M data points in an N-dimensional space.


Preprocessing: Mean-Centering
• Eliminates a constant offset of the data
* i : row index
x = xik − xk
ik j : column index
Preprocessing: Range Scaling

• Eliminates different column metrics

* xik − xk (min) *
x =
ik 0 ≤ x ≤1
ik
xik (max) − xk (min)

i : row index
j : column index
Preprocessing: Autoscaling

• Eliminates different column metrics


• Reveals data with zero mean and unit variance
M

* xik − xk ∑ (x ik − xk )
x = where sk = i =1
ik
sk N −1

Sphericization
of groups
Preprocessing / Feature extraction: PCA
• Reduces dimension of the data vectors
• Generates new orthogonal co-ordinate system

Raw data (3D) PCA scores (2D)

x3
PC2

Var2

x2
x1 Var1

PC1
PCA Scores

x1 x2 x3 PC1 PC2 PC3


Mol1 34 12 3 0.7 -1.4 1.6
Mol2 6 2 3 0.4 1.0 -1.5 Correlation
Mol3 2 12 28 -1.1 0.4 -0.2 PC1 PC2 PC3
Mol4 9 17 4 -0.6 -1.7 -0.4
x1 0.81 -0.16 0.57
Mol5 20 5 9 0.5 0.4 0.2
Mol6 4 14 56 -1.9 1.1 1.4 x2 -0.76 -0.63 0.16
Mol7 3 14 2 -0.6 -1.1 -1.2 x3 -0.79 0.44 0.43
Mol8 30 2 3 1.3 0.5 0.7
Mol9 23 3 1 1.0 0.4 -0.0
Mol10 15 5 6 0.4 0.4 -0.4

Raw data Scores


Exkurs: PCA

Ausgangsdatenmatrix, X

x1 x2 x3 x4 x5 x6
Mol1 2 1 1 1 3 3
Mol2 1 1 2 1 6 7
Mol3 6 5 6 4 6 5
Mol4 5 5 6 5 4 4

• Welche „Grundeigenschaften“ (Faktoren)


liegen den sechs Variablen zugrunde?

• Werden alle sechs Variablen zur Beschreibung der Moleküle


benötigt?

Korrelationsanalyse
Korrelationsanalyse zur
zur Aufdeckung
Aufdeckung der
der Variablenzusammenhänge
Variablenzusammenhänge
Faktorenanalyse
Faktorenanalyse
Exkurs: PCA

Korrelationsmatrix, R
Korrelationskoeffizient (Pearson)

∑ (x 1i − x1 ) ⋅ (x 2i − x 2 )
rx1,x2 = i =1 in [-1,1]
N N

∑ 1i 1 ∑ 2i 2
22
( x − x ) ⋅ ( x − x )
i =1 i =1
N: Anzahl der Objekte (Moleküle)
(n: Anzahl der Variablen)
x1 x2 x3 x4 x5 x6
x1 1 0.97 0.93 0.92 0.14 -0.29
x2 1 0.99 0.98 0.19 -0.17
x3 1 0.97 0.32 -0.02
x4 1 0.08 -0.21
x5 1 0.87
x6 1
Exkurs: PCA

Standardisierte Variablenmatrix, Z

x ji − x j sj: Standardabweichung der Variablen j


z ji = 1 N
sj sj = ⋅ ∑ (x ji − x j )
2

N − 1 i =1
Mittelwert gleich Null
Standardabweichung gleich 1

1
R= ⋅ Z ⋅ ZT ZT: Transponierte der standardisierten
N −1 Ausgangsmatrix X

einfachere Berechnung von R


Korrelationsmatrix und Varianz-Kovarianzmatrix
sind identisch N
1
s 2jk = cov( j , k ) = ⋅ ∑ z ji zki
N − 1 i =1
j und k: zwei der n Variablen
Exkurs: PCA

Hauptkomponentananalyse (PCA)

Modell:

z j = a j 1F1 + a j 2F2 + K + a jn Fn
n: Anzahl der Variablen in X
Z = A ⋅F F: unkorrelierte Komponenten („Faktoren“)

Jede der n Variablen in Z wird durch n neue,


unkorrelierte Variablen F beschrieben.

Unterstellter linearer Zusammenhang zwischen


den Faktoren und den Variablen!
Exkurs: PCA

Hauptkomponentananalyse (PCA)

Z = F⋅A

X = S ⋅ LT

Score Vektoren Ladungen („Loadings“)


( neue Variablenwerte) ( Korrelation mit Originalvariablen)

Hauptkomponenten einer Matrix


• sind orthogonal zueinander („senkrecht“)
• ändern bei einer linearen Transformation der Matrix nicht ihre Richtung

heuristisch: NIPALS Algorithmus


exakt: Eigenvektoren der Varianz-Kovarianzmatrix
Exkurs: Kovarianz

Kovarianzmatrix der Zufallsvariablen x1, x2, x3


 Cov( x1 , x1 ) Cov( x1 , x2 ) Cov( x1 , x3 ) 
 
Σ =  Cov( x2 , x1 ) Cov( x2 , x2 ) Cov( x2 , x3 )  mit
N
 Cov( x , x ) Cov( x , x ) Cov( x , x )  1
 3 1 3 2 3 3  ( )
Cov x, y =
N
∑ (x − µ )(y − µ )
i =1
i x i y

Determinante Σ einer Matrix A:


Die Determinantenfunktion det(A) ordnet Matrizen einen
eindeutigen Funktionswert (Zahl) zu.

Diesen Funktionswert nennt man "Determinante" („Skalierungswert des Volumens“)


= Produkt der Eigenwerte
„Regel von Sarrus“ (gilt nur für 3×3 Matrizen!)
Σ = det( A) = a11a22 a33 + a12 a23a31 + a13 a21a32 − a13a22 a31 − a11a23a32 − a12 a21a33
Exkurs: PCA

NIPALS Algorithmus (S. Wold)


nonlinear iterative partial least squares

Reproduktion der Datenmatrix X durch eine


Matrixmultiplikation von S (Scoringmatrix ) und
LT (transponierte Loadingmatrix).

iterative Schätzung von s1, s2, etc.


Wähle eine beliebige Spalte der Matrix X als den
1) s := xi initialen Scoringvektor s.

XTs Projiziere X auf s, um den entsprechenden


2) l= T Loadingvektor zu erhalten.
s s
l
3) l= Normalisiere l auf die Länge 1.
l

4) sold := s Zwischenspeichern des aktuellen Scores.

Xl Projiziere X auf l, um den entsprechenden neuen


5) s= Scoringvektor zu erhalten.
lT l
6) d := s old − s Bestimme die Differenz d des neuen und alten Scores.
Falls d größer als ein definierter Schwellenwert t (z.B. 10-6) ist,
7) d < t?
gehe zurück zu Schritt 2. Ansonsten fahre fort bei Schritt 8.

T Entferne den Anteil der gefundenen Hauptkomponente aus


8) E := X − sl der Datenmatrix X.
Um weitere Hauptkomponenten zu finden, wiederhole den
9) X := E Algorithmus ab Schritt 1 mit der neuen Datenmatrix.
PCA – Variance Explained / Eigenvalues

Eigenvalue (EV) cumulative


EV

PC1 1.9 63 1.9 63 Eigenvalue-one


PC2 0.6 20 2.5 83 Criterion
PC3 0.5 17 3.0 100
Scree-Plot

% variance cumulative
explained variance
PCA – Scree Plot

• Selection of relevant PC-variables


Eigenvalue

Eigenvector-Rank
Eigenvectors (PC)
to be considered
A Practical Example
ACE inhibitors vs. MMP inhibitors
Descriptors: MW, clogP, SAS, ovality, boiling point

Scree Plot
4.5
4.0
3.5
3.0
Value

2.5
2.0
1.5

PC2
1.0
0.5
0.0
1 2 3 4 5 PC1
Number of Eigenvalues
A Practical Example
ACE inhibitors vs. Cytotoxic agents
Descriptors: MW, clogP, SAS, ovality

Scree Plot
3.0

2.5

2.0
Value

1.5

1.0

PC2
0.5

0.0
1 2 3 4
PC1
Number of Eigenvalues
Homework!
PCA
Tutorial at Manchester Metropolitan University
http://149.170.199.144/multivar/pca.htm
Hypothesis testing: Student’s t-Test (1)
tests whether the means of two groups are
statistically different from each other

The t-Test judges the difference between two means


relative to the spread or variability of the two data groups.

intuitively „most different“

• assumes Gaussian data distibution


http://www.socialresearchmethods.net/kb/stat_t.htm
http://www.physics.csbsju.edu/stats/t-test.html
Hypothesis testing: Student’s t-Test (2)

signal difference between group means


t= =
noise variability of groups

xtest − xcontrol P(x)


t=
2 2
σ test σ control
+
N test N control
x

Test Control
group group

Is t large enough to be significant?


α-level: risk of being wrong (typically set to 0.05 = 5%)
degrees-of-freedom = Ntest+ Ncontrol – 2
lookup-table.
Hypothesis testing: Kolmogorov-Smirnov-Test (1)

• no assumption about data distribution („non-parametric“)


in contrast to Student‘s t-Test!

KS-testing for:
• discriminating features
• normal distribution
• similarity of two distributions

KS statistics is applicable to small sample sizes.

based on cumulative distribution analysis

http://www.physics.csbsju.edu/stats/KS-test.html
Hypothesis testing: Kolmogorov-Smirnov-Test (2)

“Dissimilarity” between two property distributions


Maximum value of the absolute difference
K = max P( x) − P * ( x) between two distribution functions
−∞ < x >∞

actual distribution target (reference, control) distribution

Example:

http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Kolmogorov-Smi/basis-statistik-kolmogorov-smi.html
Pause!
Classification Scenarios
The data representation must reflect
a solution of the problem.

linear nonlinear
separator separator ?
Classification Methods

• Hierarchical clustering (e.g., Ward‘s method)


• Non-hierarchical methods (e.g., k-means clustering)

Tutorial at Manchester Metropolitan University


http://149.170.199.144/multivar/ca.htm
Data Preparation

data universe • are the available data


representative of the problem?
available data
• is the number of data points
sufficient for meaningful
statistics?

Split into training and test set

supervised unsupervised
methods methods
Experimental design:
• Selection of “representative” data sets

• ”D/S-optimal” design
• similarity/dissimilarity-based methods
• partition-based methods
• approximation by e.g., Kennard-Stone (Maxmin) algorithm
Ref.: Gillet et al. (1998) Similarity and dissimilarity methods for
processing chemical structure databases.
www3.oup.co.uk/computer_journal/subs/Volume_41/Issue_08/Gillet.pdf
Outlier detection
By
• visual inspection of the data distribution
• a formal criterion, e.g. 3σ threshold


Software tools
• SIMCA-P
• Statistica
•R
•…
Measuring Binary Classification Accuracy
• Training data (re-classification)
• Test data (classification)

P+N P P
Q1 = Q2 = Q3 =
# predicted P+O P +U

Matthews PN − OU
cc = in [-1,1]
(N + U )(N + O )(P + U )(P + O )
P: # positive correct
N: # negative correct
O: # false-positives (“overprediction”)
U: # false-negatives (“underprediction”)
Example: Matthews cc
Class 1
Class 0

PN + OU
cc =
(N + U )(N + O )(P + U )(P + O )

Total = 11
P=4 cc = (4•4 + 1•2) / sqrt((4+2) • (4+1) • (4+2) • (4+1))
N=4
O=1 = 18 / sqrt(900)
U=2 = 0.6
Hierarchical Cluster Analysis

Nested classes
“Phylogenetic” classification (dendrogram)

1. Data preprocessing (which variables, scaling etc.)


2. Calculate the distance matrix, D
3. Find the two most similar objects in D
4. Fuse these two objects
5. Update D: calculate the distances between
the new cluster and all other objects
6. Repeat with step 3) until all cases are in one cluster
Clustering Algorithms - Examples

Aggregation of clusters can be based on minimal dissimilarity


between clusters.

Average Linkage Single Linkage Complete Linkage


(lowest average distance) (nearest neighbor) (furthest neighbor)
loose clusters compact clusters
long “chains” well-separated

Recommended
(UPGMA: Umweighted Pair-Groups Method Average)
Example: Hierarchical Clustering
Dendrogram
Data Euclidian Distance
Average Linkage
ID x1 x2 7
Euclidean distances

A1 1 1
A2 2 1 6
B1 4 5
5

Linkage Distance
B2 5 7
B3 7 7 4

3
B2
B3 2
B1
x2 1
A1 A2
0
B3 B2 B1 A2 A1
x1
Clustering Algorithms – Ward‘s method

+: centroid (mean)

• Variance approach!
Cluster membership is assessed by calculating the
total sum of squared deviations from the mean of a
cluster.

produces “well-structured” dendrograms


tends to create small clusters
Example: “Phylogenetic” Tree
4 species with homologous sequences:
ATCC
ATGC Assumption:
Assumption:
TTCG Number
Number of
of pair-wise
pair-wise mutations
mutations dissimilarity
dissimilarity
TCGG
Hamming
Hamming Distance
Distance (sequences
(sequences must
must have
have the
the same
same length)
length) ≡≡
number
number of
of positions
positions containing
containing different
different symbols
symbols
agtc
cgta Hamming
Hamming distance
distance == 22

Levenshtein
Levenshtein Distance
Distance (1965)
(1965) („Edit
(„Edit Distance“)
Distance“) ≡≡
number
number of
of edit
edit operations
operations (deletion,
(deletion, insertion,
insertion, change
change of
of aa single
single
symbol)
symbol) converting
converting one
one sequence
sequence into
into the
the other
other

ag-tcc
cgctca Levenshtein
Levenshtein distance
distance == 33
Step 1) Distance matrix
ATCC ATGC TTCG TCGG
ATCC 0 1 2 4
ATGC 0 3 3
TTCG 0 2 ATCC ATGC
TCGG 0

Step 2) Reduced distance matrix


{ATCC, ATGC} TTCG TCGG
{ATCC, ATGC} 0 ½(2+3)=2.5 ½(4+3)=3.5
TTCG 0 2
TCGG 0

UPGMA 1.5 1.5 branch length =


distance between nodes
0.5 0.5 1 1
ATCC ATGC TTCG TCGG
Issues
• branch lengths need not be proportional to the time since
the separation of taxa
• different evolution rates of branches
• unweighted mutation / editing events

Cladistic Methods („Abstammungsverhältnisse“)


1. Maximum-Parsimony (W. Fitch, 1971) (number of evo. changes min.)
2. Maximum-Likelihood (logL for the tree topology max.)
Example: 4 homologous sequences: ATCG, ATGG, TCCA, TTCA

ATCA ATCG
A G A T G A A T

ATCG TTCA ATCA TTCG


C G T C A G A T T A G A
T C C G

ATCG ATGG TCCA TTCA ATCG TCCA ATGG TTCA

4 mutations 7 mutations