Beruflich Dokumente
Kultur Dokumente
Clustering
Loı̈c Cerf
September, 1st 2018
DCC – ICEx – UFMG
Example of applicative problem
Student profiles
Given the marks students received for different courses, group the
students so that two students in a same group received about the
same marks for each course and two students in different groups
have different profiles.
1 Definition
2 Classical Algorithms
3 Assessing a Clustering
4 Case study
5 Clustering in KNIME
Outline
1 Definition
2 Classical Algorithms
3 Assessing a Clustering
4 Case study
5 Clustering in KNIME
An optimization problem
Definition
Partitioning the objects into clusters so that each cluster contains
similar objects and objects in different clusters are dissimilar.
Input:
a1 a2 ... an
o1 d1,1 d1,2 ... d1,n
o2 d2,1 d2,2 ... d2,n
.. .. .. .. ..
. . . . .
om dm,1 dm,2 . . . dm,n
An optimization problem
Definition
Partitioning the objects so that the intra-cluster similarities are
maximized and the inter-cluster similarities are minimized.
Input:
a1 a2 ... an
o1 d1,1 d1,2 ... d1,n
o2 d2,1 d2,2 ... d2,n
.. .. .. .. ..
. . . . .
om dm,1 dm,2 . . . dm,n
An optimization problem
Definition
Partitioning the objects so that the intra-cluster similarities are
maximized and the inter-cluster similarities are minimized.
Output:
a1 a2 ... an cluster
o1 d1,1 d1,2 ... d1,n c1
o2 d2,1 d2,2 ... d2,n c2
.. .. .. .. .. ..
. . . . . .
om dm,1 dm,2 . . . dm,n cm
Illustration
Clustering objects, described with two interval-scaled attributes,
using the Euclidean distance.
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
6 / 46
N
Definition
Illustration
Clustering objects, described with two interval-scaled attributes,
using the Euclidean distance.
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
6 / 46
N
Definition
Illustration
Clustering objects, described with two interval-scaled attributes,
using the Euclidean distance.
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
6 / 46
N
Definition
Illustration
Clustering objects, described with two interval-scaled attributes,
using the Euclidean distance.
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
6 / 46
N
Definition
Querying patterns:
{X ∈ P | Q(X , D)}
where:
D is the dataset,
P is the pattern space,
Q is an inductive query.
Querying a clustering:
{X ∈ P | Q(X , D)}
where:
D is the dataset,
P is the pattern space,
Q is an inductive query.
Querying a clustering:
{X ∈ P | Q(X , D)}
where:
D is a set of objects O associated with a similarity measure,
P is the pattern space,
Q is an inductive query.
Querying a clustering:
{X ∈ P | Q(X , D)}
where:
D is a set of objects O associated with a similarity measure,
∀` ∈ {1, . . . , k}, C` 6= ∅
O k
P is {(C1 , . . . , Ck ) ∈ (2 ) | ∀m 6= `, C` ∩ Cm 6= ∅ },
k
∪`=1 C` = O
Q is an inductive query.
{X ∈ P | Q(X , D)}
where:
D is a set of objects O associated with a similarity measure,
∀` ∈ {1, . . . , k}, C` 6= ∅
P is {(C1 , . . . , Ck ) ∈ (2O )k | ∀m 6= `, C` ∩ Cm 6= ∅ },
k
∪`=1 C` = O
Q is a function to optimize. It quantifies how similar are pairs of
objects in a same cluster and/or how dissimilar are those in two
different clusters.
{X ∈ P | Q(X , D)}
where:
D is a set of objects O associated with a similarity measure,
∀` ∈ {1, . . . , k}, C` 6= ∅
P is {(C1 , . . . , Ck ) ∈ (2O )k | ∀m 6= `, C` ∩ Cm 6= ∅ },
k
∪`=1 C` = O
Q is a function to optimize. It quantifies how similar are pairs of
objects in a same cluster and/or how dissimilar are those in two
different clusters.
Inexactness
Inexactness
Inexactness
Outline
1 Definition
2 Classical Algorithms
3 Assessing a Clustering
4 Case study
5 Clustering in KNIME
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
10 / 46
N
Classical Algorithms
Dendrogram
Distance
Linkage criteria
The similarity between two clusters can be defined as:
Complete linkage the worst similarity between any pair of objects
taken from the two clusters;
to provide:
Complete linkage spherical clusters of approximately equal
diameters (clustering computed in O(|O|2 ) time);
Linkage criteria
The similarity between two clusters can be defined as:
Complete linkage the worst similarity between any pair of objects
taken from the two clusters;
Single linkage the best similarity between any pair of objects taken
from the two clusters;
to provide:
Complete linkage spherical clusters of approximately equal
diameters (clustering computed in O(|O|2 ) time);
Single linkage “chains” of similar objects (clustering computed in
O(|O|2 ) time);
Linkage criteria
The similarity between two clusters can be defined as:
Complete linkage the worst similarity between any pair of objects
taken from the two clusters;
Single linkage the best similarity between any pair of objects taken
from the two clusters;
Group average linkage the average similarity over all pairs of
objects taken from the two clusters.
to provide:
Complete linkage spherical clusters of approximately equal
diameters (clustering computed in O(|O|2 ) time);
Single linkage “chains” of similar objects (clustering computed in
O(|O|2 ) time);
Group average linkage the most natural linkage (clustering
computed in O(|O|3 ) time).
Loı̈c Cerf Mineração de Dados Aplicada
14 / 46
N
Classical Algorithms
Considering all possible split to find the best one takes exponential
time. That is why a split is usually find in an approximate way,
e. g., using 2-means.
k-means: illustration
3-means clustering of objects, described with two interval-scaled
attributes, using the Euclidean distance.
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
16 / 46
N
Classical Algorithms
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
k-means: algorithm
Seeking the centers of k clusters by expectation-maximization:
1 Randomly choose k centers µ1 , . . . , µk in the object space;
2 Until convergence or a specified maximal number of iterations:
E Assign each object to the cluster C` with the closest center µ` ;
M Update the center µ` of each cluster to the mean of the objects
assigned to it.
Plot in
Pfunction
P of k a measure of the quality of the clustering,
e. g., k`=1 o∈C` ko − µ` k2 that k-means locally minimizes.
Choose k after a large drop.
Plot in
Pfunction
P of k a measure of the quality of the clustering,
e. g., k`=1 o∈C` ko − µ` k2 that k-means locally minimizes.
Choose k after a large drop.
Plot in
Pfunction
P of k a measure of the quality of the clustering,
e. g., k`=1 o∈C` ko − µ` k2 that k-means locally minimizes.
Choose k after a large drop.
EM
EM
EM
Dataset:
Iteration 1:
Iteration 5:
k-means specializes EM
k-means specializes EM
k-means specializes EM
k-means vs. EM
k-means vs. EM
k-means vs. EM
Fuzzy c-means
Fuzzy c-means
Fuzzy c-means
Problem
k-means, EM and fuzzy c-means only find convex clusters.
Problem
k-means, EM and fuzzy c-means only find convex clusters.
Ideas
Problem
k-means, EM and fuzzy c-means only find convex clusters.
Ideas
Kernel k-means
Kernel k-means
Definition
Removing from the (non-negative and symmetric) similarity
graph the edges with a small total weight so that k “reason-
ably large” connected components are obtained.
Definition
O k
P the partitioning (C1 , . . . , Ck ) ∈ (2 )
Approximately compute
Pk o ∈C ,o ∈O\C` s(o i ,o j )
that minimizes `=1 Pi ` j s(oi ,oj ) .
oi ∈C` ,oj ∈O
Definition
O k
P the partitioning (C1 , . . . , Ck ) ∈ (2 )
Approximately compute
Pk o ∈C ,o ∈O\C` s(o i ,o j )
that minimizes `=1 Pi ` j s(oi ,oj ) .
oi ∈C` ,oj ∈O
Method
Extract the k smallest eigenvectors of an affinity matrix, e. g.,
the normalized Laplacian of the similarity matrix. Cluster (e. g.,
with k-means) the objects rewritten w.r.t. these k attributes.
DBSCAN: illustration
DBSCAN clustering of objects, described with two interval-scaled
attributes, using the Euclidean distance.
x y
o1 91 70
o2 129 91
o3 359 243
o4 322 254
o5 100 104
o6 464 113
o7 342 297
o8 410 65
o9 334 329
.. .. ..
. . .
Loı̈c Cerf Mineração de Dados Aplicada
32 / 46
N
Classical Algorithms
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
4 Otherwise cluster these objects as well as those listed by the same
recursive process applied to the newly clustered objects.
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
4 Otherwise cluster these objects as well as those listed by the same
recursive process applied to the newly clustered objects.
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
4 Otherwise cluster these objects as well as those listed by the same
recursive process applied to the newly clustered objects.
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
4 Otherwise cluster these objects as well as those listed by the same
recursive process applied to the newly clustered objects.
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
4 Otherwise cluster these objects as well as those listed by the same
recursive process applied to the newly clustered objects.
DBSCAN: algorithm
A density-based algorithm:
1 At each iteration, choose an unlabeled object;
2 List the sufficiently similar objects;
3 If there are too few of them, label the object as outlier;
4 Otherwise cluster these objects as well as those listed by the same
recursive process applied to the newly clustered objects.
Configuration
Configuration
Outline
1 Definition
2 Classical Algorithms
3 Assessing a Clustering
4 Case study
5 Clustering in KNIME
An unsupervised task
An unsupervised task
Internal evaluation
Internal evaluation
Internal evaluation
Internal evaluation
Comparing clusterings
Comparing clusterings
Randomization of a dataset
Stability of a clustering
Stability of a clustering
Stability of a clustering
Stability of a clustering
The Fowlkes-Mallows index, the Rand index and the adjusted Rand
index (all absent from KNIME) are alternatives. They are all based
on the number of pairs of objects that are in the same/different
cluster(s) in one clustering and in the same/different cluster(s) in
the other clustering.
The Fowlkes-Mallows index, the Rand index and the adjusted Rand
index (all absent from KNIME) are alternatives. They are all based
on the number of pairs of objects that are in the same/different
cluster(s) in one clustering and in the same/different cluster(s) in
the other clustering.
Outline
1 Definition
2 Classical Algorithms
3 Assessing a Clustering
4 Case study
5 Clustering in KNIME
Outline
1 Definition
2 Classical Algorithms
3 Assessing a Clustering
4 Case study
5 Clustering in KNIME
Clustering in KNIME
Practice
2012–2018
c Loı̈c Cerf
These slides are licensed under the Creative Commons
Attribution-ShareAlike 4.0 International License.