Beruflich Dokumente
Kultur Dokumente
Data preprocessing
PCA
Hypothesis testing
Clustering
Class A Class B
?
Classification – “Work Flow”
Raw Data
Preprocessing
Feature Selection
Classifier
Quality Assessment
Raw Data
• Typically vectors
ID x y
Mol1 12,2,1,3,4,36,7.5 1
Mol2 4,5.6,3,8,99,1,5.2 0
Mol3 2.1,2.2,4,6,3,1,11 1
…
MolM x1,x2, … xn ym
* xik − xk (min) *
x =
ik 0 ≤ x ≤1
ik
xik (max) − xk (min)
i : row index
j : column index
Preprocessing: Autoscaling
* xik − xk ∑ (x ik − xk )
x = where sk = i =1
ik
sk N −1
Sphericization
of groups
Preprocessing / Feature extraction: PCA
• Reduces dimension of the data vectors
• Generates new orthogonal co-ordinate system
x3
PC2
Var2
x2
x1 Var1
PC1
PCA Scores
Ausgangsdatenmatrix, X
x1 x2 x3 x4 x5 x6
Mol1 2 1 1 1 3 3
Mol2 1 1 2 1 6 7
Mol3 6 5 6 4 6 5
Mol4 5 5 6 5 4 4
Korrelationsanalyse
Korrelationsanalyse zur
zur Aufdeckung
Aufdeckung der
der Variablenzusammenhänge
Variablenzusammenhänge
Faktorenanalyse
Faktorenanalyse
Exkurs: PCA
Korrelationsmatrix, R
Korrelationskoeffizient (Pearson)
∑ (x 1i − x1 ) ⋅ (x 2i − x 2 )
rx1,x2 = i =1 in [-1,1]
N N
∑ 1i 1 ∑ 2i 2
22
( x − x ) ⋅ ( x − x )
i =1 i =1
N: Anzahl der Objekte (Moleküle)
(n: Anzahl der Variablen)
x1 x2 x3 x4 x5 x6
x1 1 0.97 0.93 0.92 0.14 -0.29
x2 1 0.99 0.98 0.19 -0.17
x3 1 0.97 0.32 -0.02
x4 1 0.08 -0.21
x5 1 0.87
x6 1
Exkurs: PCA
Standardisierte Variablenmatrix, Z
N − 1 i =1
Mittelwert gleich Null
Standardabweichung gleich 1
1
R= ⋅ Z ⋅ ZT ZT: Transponierte der standardisierten
N −1 Ausgangsmatrix X
Hauptkomponentananalyse (PCA)
Modell:
z j = a j 1F1 + a j 2F2 + K + a jn Fn
n: Anzahl der Variablen in X
Z = A ⋅F F: unkorrelierte Komponenten („Faktoren“)
Hauptkomponentananalyse (PCA)
Z = F⋅A
X = S ⋅ LT
% variance cumulative
explained variance
PCA – Scree Plot
Eigenvector-Rank
Eigenvectors (PC)
to be considered
A Practical Example
ACE inhibitors vs. MMP inhibitors
Descriptors: MW, clogP, SAS, ovality, boiling point
Scree Plot
4.5
4.0
3.5
3.0
Value
2.5
2.0
1.5
PC2
1.0
0.5
0.0
1 2 3 4 5 PC1
Number of Eigenvalues
A Practical Example
ACE inhibitors vs. Cytotoxic agents
Descriptors: MW, clogP, SAS, ovality
Scree Plot
3.0
2.5
2.0
Value
1.5
1.0
PC2
0.5
0.0
1 2 3 4
PC1
Number of Eigenvalues
Homework!
PCA
Tutorial at Manchester Metropolitan University
http://149.170.199.144/multivar/pca.htm
Hypothesis testing: Student’s t-Test (1)
tests whether the means of two groups are
statistically different from each other
Test Control
group group
KS-testing for:
• discriminating features
• normal distribution
• similarity of two distributions
http://www.physics.csbsju.edu/stats/KS-test.html
Hypothesis testing: Kolmogorov-Smirnov-Test (2)
Example:
http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Kolmogorov-Smi/basis-statistik-kolmogorov-smi.html
Pause!
Classification Scenarios
The data representation must reflect
a solution of the problem.
linear nonlinear
separator separator ?
Classification Methods
supervised unsupervised
methods methods
Experimental design:
• Selection of “representative” data sets
• ”D/S-optimal” design
• similarity/dissimilarity-based methods
• partition-based methods
• approximation by e.g., Kennard-Stone (Maxmin) algorithm
Ref.: Gillet et al. (1998) Similarity and dissimilarity methods for
processing chemical structure databases.
www3.oup.co.uk/computer_journal/subs/Volume_41/Issue_08/Gillet.pdf
Outlier detection
By
• visual inspection of the data distribution
• a formal criterion, e.g. 3σ threshold
3σ
Software tools
• SIMCA-P
• Statistica
•R
•…
Measuring Binary Classification Accuracy
• Training data (re-classification)
• Test data (classification)
P+N P P
Q1 = Q2 = Q3 =
# predicted P+O P +U
Matthews PN − OU
cc = in [-1,1]
(N + U )(N + O )(P + U )(P + O )
P: # positive correct
N: # negative correct
O: # false-positives (“overprediction”)
U: # false-negatives (“underprediction”)
Example: Matthews cc
Class 1
Class 0
PN + OU
cc =
(N + U )(N + O )(P + U )(P + O )
Total = 11
P=4 cc = (4•4 + 1•2) / sqrt((4+2) • (4+1) • (4+2) • (4+1))
N=4
O=1 = 18 / sqrt(900)
U=2 = 0.6
Hierarchical Cluster Analysis
Nested classes
“Phylogenetic” classification (dendrogram)
Recommended
(UPGMA: Umweighted Pair-Groups Method Average)
Example: Hierarchical Clustering
Dendrogram
Data Euclidian Distance
Average Linkage
ID x1 x2 7
Euclidean distances
A1 1 1
A2 2 1 6
B1 4 5
5
Linkage Distance
B2 5 7
B3 7 7 4
3
B2
B3 2
B1
x2 1
A1 A2
0
B3 B2 B1 A2 A1
x1
Clustering Algorithms – Ward‘s method
+: centroid (mean)
• Variance approach!
Cluster membership is assessed by calculating the
total sum of squared deviations from the mean of a
cluster.
Levenshtein
Levenshtein Distance
Distance (1965)
(1965) („Edit
(„Edit Distance“)
Distance“) ≡≡
number
number of
of edit
edit operations
operations (deletion,
(deletion, insertion,
insertion, change
change of
of aa single
single
symbol)
symbol) converting
converting one
one sequence
sequence into
into the
the other
other
ag-tcc
cgctca Levenshtein
Levenshtein distance
distance == 33
Step 1) Distance matrix
ATCC ATGC TTCG TCGG
ATCC 0 1 2 4
ATGC 0 3 3
TTCG 0 2 ATCC ATGC
TCGG 0
ATCA ATCG
A G A T G A A T
4 mutations 7 mutations