Beruflich Dokumente
Kultur Dokumente
Michael Houle
National Institute of Informatics
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
Clustering is:
Organization of data into well-differentiated
groups of highly-similar items.
A form of unsupervised learning.
Fundamental operation in data mining &
knowledge discovery.
Important tool in the design of efficient
algorithms and heuristics.
Closely related to search & retrieval.
Other desiderata:
Scalable heuristics.
Adaptive to variations in density.
Handles cluster overlap.
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
∑i
n
xi yi − nxy
r= =1 A
(∑ n
i=1
x − nx
2
i
2
) (∑ n
y
i=1 i
2
− ny 2
)
B
Apply this to coordinate pairs of set vectors for A, B ⊂ S…
Gives set correlation between A and B :
S A ∩B A ⋅ B
R (A ,B ) = −
( S − A ) ( S − B ) A ⋅ B S
Tends to cosine similarity measure when A and B are small.
1 1
SC (A ) =
A
∑
∈
v A
MC (A ,Q (v, A ))=
A
2 ∑
∈
v A
A ∩ Q (v, A )
Measure – self-correlation:
The average correlation
between a set and the
(same-sized) relevant sets
of its members.
Denoted SR (A), where
1 S ⋅ SC (A )− A
SR (A ) =
A
∑
v∈ A
R (A ,Q (v, A ))=
S−A
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
⋅
(S − A)(S − B) A ⋅B = 1
(S − A)(S − B) A ⋅B S ( S − 1)
2
S −1
∑ E ( R ( A ,Q ( v, A ) ) )
1
SR (A )−
SR (A )− E ( SR (A )) A
A ( S − 1) SR (A )
v∈ A
Z SR (A ) = = =
Var ( SR (A ))
∑ Var ( R ( A ,Q ( v, A ) ) )
1
2
A v∈ A
E ( SC (A )) =
1
2 ∑
(
E A ∩ Q ( v, A ) = ) 1
2 ∑
A
S
=
A
S
A v∈ A A ∈
v A
Var ( SC (A )) =
1
Var ( A ∩ Q ( v, A ) ) =
(S − A) 2
A
4 ∑
∈
v A A ⋅ S ( S − 1)
2
Can show:
A ( S − 1) SR (A )= Z (A )
∆
Z SR (A ) = Z SC (A ) =
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
Z (v |A ) = S − 1 R ( A ,Q ( v, A ) )
1
Z (A ) = ∑ Z ( v |A )
A ∈
v A
A ∩ Q ( v, A )
Significance of SR(A′|A)
w.r.t. randomness hypoth.:
1
Z (A ′ |A ) = ∑ Z ( v |A )
A′ v A
∈ ′
A ∩ Q ( v, A )
* *
maximize
*
Z (v |C )⋅ Z (C |C )
C
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
Candidate Map
set significance.
Edges meeting min threshold on
inter-set significance
(correlation).
Cluster Map
disconnected.
Clusters nodes form a rough
cover of the data set.
Pattern
Pruning Candidate Sample Relevant
Patterns Sets
Problems:
Curse of dimensionality: computing queries Q(q,k)
can take time linear in |S| even when k is small.
Computing Z(A) takes time quadratic in |A|.
Workarounds:
Use approximate neighborhoods using SASH search
structure [H. '03, H. & Sakuma '05].
Pattern generation over samples of varying sizes,
with fixed range for |A|.
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
Z 2 (A )
0 ≤ = A ⋅ SR 2 (A ) ≤ A
S −1
For all experiments:
Min normalized squared set significance = 4.
Max normalized inter-set significance = 0.5.
Min norm. inter-set sig. = 0.1 (for cluster map).
26 April 2007 M E Houle @ NII 45
Images
Amsterdam Library of Object Images (ALOI):
Dense feature vectors, colour & texture histogram (prepared
by INRIA-Rocquencourt).
Number of vectors: 110,250.
641 features per vector.
5322 clusters computed in < 4 hours on desktop (older,
slower implementation).
Maximum cluster size: 7201.
Median cluster size: 12.
Minimum cluster size: 4.
SASH accuracy of ~96%.
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
Example at NII:
WEBCAT Plus library database.
GETA search engine.
WEBCAT Plus QRC tool currently under
development (with N. Grira).
26 April 2007 M E Houle @ NII 51
Query Result Clustering
Adapting RSC for QRC:
Database size may not be known.
Self-correlation and significance formula
undefined.
Approximation: assume infinite database size.
As |S| tends to infinity, the normalized
squared significance tends to:
Z 2 (A )
= A ⋅ SR 2 (A ) → A ⋅ SC 2 (A )
S −1
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection
Assumptions:
Similarity measure and candidate features
unknown.
Assess the effectiveness of the features &
similarity measure for search and clustering.
26 April 2007 M E Houle @ NII 56
Good Feature Sets
For any given 'true'
cluster C:
For any item v in C, any
query based at v should
ideally rank the items of C
ahead of any other items.
Two-set classification.
Best result – when there
exists a partition of the
data set into clusters such
that the two-set
classification property
holds.
1
= SR (A )−
S−A
∑
∉
v A
R (A ,Q (v, A ))
1
maximize
S
∑
∈
q S
SR (Q (q,kq )), kq = argmax DR (Q (q,k ))
1≤ k ≤ S
Properties:
If all relevant sets are randomly generated, criterion
is 0.
If all relevant sets are identical, criterion is 0.
If two-set classification holds for all q, criterion
is 1 (the maximum possible).
Conjectured: for any disjoint partition of the data
into clusters of size at least some constant, there
exists a set of rankings for which the maximum of 1 is
achieved.
Problems:
Single linkage (agglomerative) clustering
techniques produce clusters with poor internal
association.
Traditional clustering techniques do not scale
well to large set sizes.
Protein sequence data is “inherently” high-
dimensional.
26 April 2007 M E Houle @ NII 63
Protein Sequence Similarity