2007 04 26 Nii Houle Combin Apporach Search Cluster

A Combinatorial Approach to
Search and Clustering
Michael Houle
National Institute of Informatics
26 April 2007 M E Houle @ NII 1

Overview
I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

What is Clustering?
 Clustering is:
 Organization of data into well-differentiated
groups of highly-similar items.
 A form of unsupervised learning.
 Fundamental operation in data mining &
knowledge discovery.
 Important tool in the design of efficient
algorithms and heuristics.
 Closely related to search & retrieval.

Clustering Paradox
 Clustering models/methods traditionally

make assumptions on the nature of the
data:
 Data representation.
 Similarity measures.
 Data distribution.
 Cluster numbers, sizes and/or densities.
 Definition of noise.
 … but, cluster analysis seeks to

discover the nature of the data!
Shared Neighbor Clustering
 Similarity measures not fully trusted?

 Curse of dimensionality – concentration effect.
 Variations in density.
 Lack of objective meaning.
 Shared neighbor information:

 “If two items have many neighbors in common,
they are probably closely related.”
 Similarity measure used primarily for ranking.
 Adaptive to variations in density.

Shared-Neighbor
Clustering Methods (1)
 Jarvis-Patrick (1973)
 Hierarchical clustering
heuristic.
 Single-linkage merge criterion.
 Fixed-cardinality neighborhoods.
 Merge threshold t. B
 Merge if there exists a pair
a, b such that:
 a and b are k-NNs of one
another; A
 Intersection of k-NN lists
contains at least tk items.

Shared-Neighbor
 ROCK
(Guha, Rastogi, Shim 2000)
 Hierarchical clustering
heuristic.
 Fixed-radius neighborhoods.
 Pairwise linkage defined as
B
size of intersection of
neighborhoods.
 Merge if total (size-weighted)
inter-cluster linkage size is A
maximized.

Shared-Neighbor
 SNN (Ertöz, Steinbach, Kumar 2003)
 Based on DBSCAN (1996):
 Density over fixed-radius
neighborhood.
 Core points – density exceeding a
supplied threshold.
 Merging – if one core point is B
contained in the neighborhood
of another.
 SNN: DBSCAN with
A
fixed-cardinality neighborhoods.
 Similarity: intersection size of
fixed-cardinality neighborhoods.
Drawbacks of
Shared Neighbor Clustering
 Fixed k-NNs:
 Bias towards clusters of size order k.
 Examples: Jarvis-Patrick, SNN.
 How to choose k ?
 Fixed radius neighborhoods:

 Bias towards clusters of larger density.
 Examples: ROCK.
 How to choose radius?
 Clustering depends on parameters that make

implicit assumptions regarding the data.
Desiderata for Clustering
 Fully automated clustering:
 Similarity measure, but used strictly for ranking.
 Otherwise, no knowledge of data distribution.
 Parameters must have domain-independent
interpretation.
 Automatic determination of number
of clusters, cluster sizes.
 Other desiderata:
 Scalable heuristics.
 Adaptive to variations in density.
 Handles cluster overlap.

Overview

Query-Based Clustering
 How can we cluster when the nature of

the data is hidden?
No pairwise (dis)similarity measure?
Only assumption: relevancy rankings
for queries-by-example.
 Q(q, k): ranked relevant set for query
item q.
|Q(q, k)| = k.
 Clusters will be patterned on query

relevant sets Q(q, k) for some q in S.
Confidence
 Two sets A and B related according
to their degree of overlap.
 Natural measure – confidence
(inspired by Association Rule Mining)
(related to Jarvis-Patrick merge
criterion): A
A ∩B
0 ≤ conf (A ,B ) = ≤1
A B
 Interpretation of conf : precision & recall (as in IR).
 Query result Q for concept set C.
 Precision is conf (Q, C).
 Recall is conf (C, Q).
Mutual Confidence
 Symmetric measure – mutual
confidence:
0 ≤ MC (A ,B )
= conf (A ,B )⋅conf (B ,A ) A
A B B
= ≤1
A ⋅B
 Interpretation of MC : cosine-angle between set

vectors.
 If item j is a member, j-th coordinate equals 1.
 Otherwise, j-th coordinate equals 0.
 cos-1 (MC (A, B)) is a distance metric.
Set Correlation
 Pearson correlation formula:
∑i
n
xi yi − nxy
r= =1 A
(∑ n
i=1
x − nx
2
i
2
) (∑ n
y
i=1 i
2
− ny 2
)
B
 Apply this to coordinate pairs of set vectors for A, B ⊂ S…
 Gives set correlation between A and B :
S  A ∩B A ⋅ B 
R (A ,B ) =  −
( S − A ) ( S − B )  A ⋅ B S 

 Tends to cosine similarity measure when A and B are small.

Intra-set Association
 How do we measure the goodness of a cluster
candidate C ?
 No pairwise similarity measure is available!
 First-order criterion – if v belongs to C, then:

 The items relevant to v should belong to C.
 R (C, Q(v, |C|)) should be high.
 Second-order criterion – if v, w belong to C, then:

 The items relevant to v should coincide with those
relevant to w.
 (Will discuss only first-order criterion here.)

Self-confidence
 Measure – self-confidence:
 The average mutual confidence
between a set and the (same-
sized) relevant sets of its
members.
 Denoted SC (A), where
1 1
SC (A ) =
A
∑
∈
v A
MC (A ,Q (v, A ))=
A
2 ∑
∈
v A
A ∩ Q (v, A )
 Related to SNN density

criterion.

Self-correlation
 Measure – self-correlation:
The average correlation
between a set and the
(same-sized) relevant sets
of its members.
Denoted SR (A), where
1 S ⋅ SC (A )− A
SR (A ) =
A
∑
v∈ A
R (A ,Q (v, A ))=
S−A

Significance & Size
 Which aggregation of points
is more significant?
 SC (A) = 0.8525,
SR (A) = 0.815625.
 SC (B) = 1.0,
SR (B) = 1.0.
 SC (C) = 0.45,
SR (C) ≈ 0.3888889.
 Set size must be considered.
 Note: proper interpretation of Pearson correlation

requires a test of significance.
Overview

Randomness Hypothesis
 What if every query relevant set (QRS) were
independently selected uniformly at random?
 Number of intersections between QRS and fixed
set is distributed hypergeometrically.
 Expectation and variance can be determined.
 If X = |A ∩ B|, where A is fixed and B

is selected randomly from S, then:
A
A⋅B A ⋅ B ( S − A )( S − B )
E (X ) = Var (X ) =
S ( S − 1)
2
S
B

Significance
& Standard Scores
 Standard score under randomness hypothesis:
Measure of the deviation from randomness.
ZSC (A): number of standard devs of SC (A)
from its expectation.
ZSR (A): number of standard devs of SR (A)
from its expectation.
 The greater the standard scores, the more
significant the aggregation.
SC (A )− E ( SC (A )) SR (A )− E ( SR (A ))
Z SC (A ) = Z SR (A ) =
Var ( SC (A )) Var ( SR (A ))

Intra-Set Significance
S E( A ∩ B ) A ⋅ B 
E ( R ( A ,B ) ) =  −
(S − A)(S − B) A ⋅B
 S 

S  1 A⋅B A ⋅ B 
=  ⋅ − =0
( S − A ) ( S − B )  A⋅B S S 

2
S
Var ( R ( A ,B ) ) = Var ( A ∩ B )
(S − A)(S − B) A ⋅B
=
S
2
⋅
(S − A)(S − B) A ⋅B = 1
(S − A)(S − B) A ⋅B S ( S − 1)
2
S −1
∑ E ( R ( A ,Q ( v, A ) ) )
1
SR (A )−
SR (A )− E ( SR (A )) A
A ( S − 1) SR (A )
v∈ A
Z SR (A ) = = =
Var ( SR (A ))
∑ Var ( R ( A ,Q ( v, A ) ) )
1
2
A v∈ A

Intra-Set Significance
2
E ( SC (A )) =
1
2 ∑
(
E A ∩ Q ( v, A ) = ) 1
2 ∑
A
S
=
A
S
A v∈ A A ∈
v A
Var ( SC (A )) =
1
Var ( A ∩ Q ( v, A ) ) =
(S − A) 2
A
4 ∑
∈
v A A ⋅ S ( S − 1)
2
 Can show:
A ( S − 1) SR (A )= Z (A )
∆
Z SR (A ) = Z SC (A ) =
 Z (A): intra-set significance of set A.

Example
 Set significances of A, B, C:
 SC (A) = 0.8525,
SR (A) = 0.815625,
Z (A) ≈ 36.29.
 SC (B) = 1.0,
SR (B) = 1.0,
Z (B) ≈ 22.25.
 SC (C) = 0.45,
SR (C) ≈ 0.3888889,
Z (C) ≈ 12.24.
 Z (A) > Z (B) > Z (C).

Inter-Set Significance
 Z (A, B): inter-set significance of (the
relationship between) A and B.
A⋅B
E ( R (A ,B )) = 0 E ( MC (A ,B )) =
S
1 A ⋅ B ⋅ ( S − A )( S − B )
Var ( R (A ,B )) = Var ( MC (A ,B )) =
S −1 S
2
( S − 1)
R (A ,B )− E ( R (A ,B )) MC (A ,B )− E ( MC (A ,B ))
Z (A ,B ) = = = S − 1 R (A ,B )
Var ( R (A ,B )) Var ( MC (A ,B ))
 For fixed S, inter-set significance is

equivalent to the set correlation R (A, B).
Overview

Contributions
to Significance
 Some members contribute more
than others towards the set
significance Z (A).
 Contribution of member v to
SR (A):
1
R ( A ,Q ( v, A ) )
∆
t(v |A )=
A
 Can consider potential

contributions Z (v|A) even
when v ∉ A.
A ∩ Q ( v, A )

Partial Significance
 Standard score of t (v|A)
w.r.t. the randomness
hypothesis:
Z (v |A ) = S − 1 R ( A ,Q ( v, A ) )
 The significance of A can be

expressed in terms of these
partial significances:
1
Z (A ) = ∑ Z ( v |A )
A ∈
v A
A ∩ Q ( v, A )

Set Reshaping (1)
 Idea: modify A so as to
boost its significance.
 New set A′ has average
mutual correlation to A:
1
SR (A ′ |A ) = ∑ R ( A ,Q ( v, A ) )
A′ ∈ ′
v A
 Significance of SR(A′|A)
w.r.t. randomness hypoth.:
1
Z (A ′ |A ) = ∑ Z ( v |A )
A′ v A
∈ ′
A ∩ Q ( v, A )

Set Reshaping (2)
 For fixed size |A′|, we
can maximize Z (A′|A) by
taking largest values of
Z (v|A).
 For this example:
 Z (A|A) = Z (A) = 36.29
 Z (A′|A) = 37.18
 Maximum achieved at A′.
 A serves as a pattern for

the discovery of A′.
A ∩ Q ( v, A )

Partitioning
 What if we need to assign item v to a single group?
 Want a group C for which both:

 the set significance is high; and
 the significance of the relationship with v is
high.
 In practice, can choose group C (reshaped from
pattern C*) satisfying:
* *
maximize
*
Z (v |C )⋅ Z (C |C )
C

Overview

Cluster Map Generation
 Nodes are sets having sufficiently-high
significance scores.
 Edges appear between set nodes having

sufficiently-high inter-set significances.
 Retained cluster candidates should not be too

highly correlated with other retained
candidates.
 Final clusters form an “independent set”

within an initial candidate cluster map.

 Nodes meeting min threshold on
Candidate Map
set significance.
 Edges meeting min threshold on
inter-set significance
(correlation).

 Nodes ranked by significance
(red is highest).
 Thick edges join nodes whose
set correlations are too high.

 Need independent node set
within thick-edged subgraph.
 Heuristic: greedy by
significance.

 Need independent node set
within thick-edged subgraph.
 Heuristic: greedy by
significance.

 Final cluster map is typically
Cluster Map
disconnected.
 Clusters nodes form a rough
cover of the data set.

DB
GreedyRSC
Queries
Pattern
Pruning Candidate Sample Relevant
Patterns Sets
Cluster Cluster Map

Member Assignment
26 April 2007 Pruning
M E Houle @ NII Generation 40
Scalability Issues
 Problems:
 Curse of dimensionality: computing queries Q(q,k)
can take time linear in |S| even when k is small.
 Computing Z(A) takes time quadratic in |A|.
 Workarounds:
 Use approximate neighborhoods using SASH search
structure [H. '03, H. & Sakuma '05].
 Pattern generation over samples of varying sizes,
with fixed range for |A|.

Sampling Strategy
 Create bands of samples of sizes |S|, |S|/2, |S|/4, ….
 Within each sample, compute candidate patch for each
element via maximization of set significance.
 Fixed range of patch sizes a < k < b.
 Within each sample, select patterns greedily and
eliminate duplicates.
 Reshape patterns to form cluster candidates.
 Eliminate duplicate candidates to form final cluster set.
 Generate cluster map edges.

Overall Time Complexity
Without data partitioning:
 Precompute relevant sets: O(n log n) queries, n = |S|.
 Compute pattern for O(n log n) neighborhoods: O(b 2) time each.
 Compute all patterns: O(b 2n log n) time.
 Eliminate duplicate patterns: O((b 2+σ2)n log n), where σ2 is the
average variance of inverted member list sizes over each sample.
 Form candidate clusters: bounded by O(b n log2n).
 Eliminate duplicate clusters & create map: O((b 2+τ2)n log n ),
where τ2 is the average variance of inverted member list sizes
over each sample.
 Total excluding relevant set computation:
O((b 2+σ2+τ2)n log n + b n log2n).
Data partitioning (details omitted!) introduces factor of c, the
number of data chunks. M E Houle @ NII
26 April 2007 43
Overview

Clustering Parameters
 For comparison of significance values, can normalize
significance scores for convenience.
 Common factor dependent on |S| can be dropped.
 Normalizing inter-set significance → set correlation.
 Normalizing square of set significance: use
Z 2 (A )
0 ≤ = A ⋅ SR 2 (A ) ≤ A
S −1
 For all experiments:
 Min normalized squared set significance = 4.
 Max normalized inter-set significance = 0.5.
 Min norm. inter-set sig. = 0.1 (for cluster map).
Images
 Amsterdam Library of Object Images (ALOI):
 Dense feature vectors, colour & texture histogram (prepared
by INRIA-Rocquencourt).
 Number of vectors: 110,250.
 641 features per vector.
 5322 clusters computed in < 4 hours on desktop (older,
slower implementation).
 Maximum cluster size: 7201.
 Median cluster size: 12.
 Minimum cluster size: 4.
 SASH accuracy of ~96%.

Journal Abstracts
 Medline medical journal abstracts, 1996 to mid-2003:
 Vectors with TF-IDF term weighting, NO dimensional
reduction (prepared by IBM TRL).
 Number of vectors: 1,055,073.
 ~75 non-zero attributes per vector.
 Representational dimension: 1,101,003.
 9789 clusters computed in < 24 hours on 3.0GHz desktop.
 Maximum cluster size: 15,255.
 SASH accuracy of 51%, 115 x faster than sequential search.

Protein Sequences
 Bacterial ORFs (protein sequences):
 Vectors of gapped BLAST scores with respect to fixed sample
of 1/10th size (as per Liao & Noble, 2003) (prepared by DNA
Data Base of Japan).
 Number of vectors: 378,659.
 ~125 non-zero attributes per vector.
 Representational dimension: 40,000.
 Vector preparation: < 1 day on 16 node PC cluster.
 8907 clusters computed in 7 hours on 3.0GHz desktop.
 Maximum cluster size: 69,859.
 SASH accuracy of 75%, 34 x faster than sequential search.
Demos

Overview

Query Result Clustering
 “Pure” shared-neighbor clustering can be used
to cluster results of queries.
 Produce long ranked query result lists.
 Ranking function can be hidden.
 Database must support queries-by-example.
 Otherwise can be performed without the
cooperation of the database manager.
 Example at NII:
 WEBCAT Plus library database.
 GETA search engine.
 WEBCAT Plus QRC tool currently under
development (with N. Grira).
Query Result Clustering
 Adapting RSC for QRC:
 Database size may not be known.
 Self-correlation and significance formula
undefined.
 Approximation: assume infinite database size.
 As |S| tends to infinity, the normalized
squared significance tends to:
Z 2 (A )
= A ⋅ SR 2 (A ) → A ⋅ SC 2 (A )
S −1
 This formula does not depend on |S|.

 Computation as per GreedyRSC heuristic.

Overview

Outlier Detection
 RSC model:
 Patterns of low significance can indicate the
presence of outliers.
 Can be detected as per the initial stages of
GreedyRSC.
 Many potential definitions are possible.
 Some possible formulations (to be minimized):

Z (Q (v,k)) SR (Q (v,k))
max Z (Q (v,k)) max SR (Q (v,k))
1≤ i≤ k 1≤ i≤ k
 Work in progress with M. Gebski, NICTA.

Overview

Feature Set Evaluation
 Feature selection methods:
 To our knowledge, current feature selection
techniques developed only for supervised
learning.
 Training set needed to guide the process.
 Work with N. Grira: unsupervised feature set
evaluation and selection.
 Assumptions:
 Similarity measure and candidate features
unknown.
 Assess the effectiveness of the features &
similarity measure for search and clustering.
Good Feature Sets
 For any given 'true'
cluster C:
 For any item v in C, any
query based at v should
ideally rank the items of C
ahead of any other items.
 Two-set classification.
 Best result – when there
exists a partition of the
data set into clusters such
that the two-set
classification property
holds.

Distinctiveness
 Distinctiveness criterion:
 Variant of self-correlation.
 External items that are well-correlated are
penalized.
 Equals 1 if A is perfectly associated (SR(A)=1)
and external points are uncorrelated with A.
1 1
DR (A ) =
A
∑
v∈ A
R (A ,Q (v, A ))−
S−A
∑
∉
v A
R (A ,Q (v, A ))
1
= SR (A )−
S−A
∑
∉
v A
R (A ,Q (v, A ))

Significance
 Significance can be derived as under RSC:
 Randomness hypothesis.
A ( S − A )( S − 1)
Z DR (A ) = DR (A )
S
 Unlike self-correlation significance,

distinctiveness significance tends to 0 as |A|
tends to |S|.
 Distinctiveness is expensive to compute in
practice.
 Can use SR(A) to approximate DR(A).

 Criterion for feature set selection:
 For each item q, estimate the most significant
relevant set based at q.
 Average the self-correlations of the most
significant relevant set identified.
 Serves as the basis for search, e.g. local
improvement methods such as Tabu Search.
1
maximize
S
∑
∈
q S
SR (Q (q,kq )), kq = argmax DR (Q (q,k ))
1≤ k ≤ S

1
maximize
S
∑
∈
q S
SR (Q (q,kq )), kq = argmax DR (Q (q,k ))
1≤ k ≤ S
 Properties:
 If all relevant sets are randomly generated, criterion
is 0.
 If all relevant sets are identical, criterion is 0.
 If two-set classification holds for all q, criterion
is 1 (the maximum possible).
 Conjectured: for any disjoint partition of the data
into clusters of size at least some constant, there
exists a set of rankings for which the maximum of 1 is
achieved.

Background:
Protein Sequence Analysis
 Applications of clustering:
 Classification of sequences of unknown
functionality with respect to clusters of
sequences of known functionality.
 Discovery of new motifs from clusters of
sequences of previously unknown function.
 Problems:
 Single linkage (agglomerative) clustering
techniques produce clusters with poor internal
association.
 Traditional clustering techniques do not scale
well to large set sizes.
 Protein sequence data is “inherently” high-
dimensional.
Protein Sequence Similarity
 Pairwise (gapped) sequence alignment

scoring: BLAST heuristic [Altschul et al.
'97].
 Bonus for matching and near-matching symbols.
 Penalty for non-matching symbols.
 Penalty for gaps (increases with gap length).
 Dynamic programming (expensive!).
 Faster heuristics exist.
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60
M M++K+L+PTDFSE A A++ + ++ EVILLHVIDE +++ L+ G +
MIFMFRKVLFPTDFSEGAYRAVEVFEKRNKMEVGEVILLHVIDEGTLEE-----LMDGYS 55

Alignment-based Similarity
 Problem: direct use of BLAST scores fails!

 Not transitive since alignments are incomplete.
 SASH index for approximate neighbourhood computation
achieves poor accuracy vs time trade-off.
 Sequential search would work, but is too expensive.
 Example: pairs (A,B) and (B,C) have high BLAST

scores, but pair (A,B) has score of zero.

Reference Set Similarity
 Solution (with Å. J. Västermark): sample sequences

 Vectors of gapped-alignment
BLAST scores with respect to
full set sequences

fixed sample of 1/10th size (as
per Liao & Noble, 2003).
 Conversion of BLAST scores to
E-values.
 Vector sparsification via
thresholding to zero.
 Vector angle distance metric
for neighbourhood computation.
BLAST
E-values

Analogy to Text
sample sequences
 Each sequence analogous to a
document.
full set sequences

 Reference sequences analogous to
terms.
 Sparse vectorization.
 Significant BLAST scores
analogous to terms appearing in
document.
 Strong BLAST scores analogous to
dominant terms of document (as BLAST
per TF-IDF weighting). E-values


2007 04 26 Nii Houle Combin Apporach Search Cluster

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2007 04 26 Nii Houle Combin Apporach Search Cluster

Hochgeladen von

Copyright:

Verfügbare Formate

A Combinatorial Approach to

Search and Clustering

26 April 2007 M E Houle @ NII 1

26 April 2007 M E Houle @ NII 2

26 April 2007 M E Houle @ NII 3

 Clustering models/methods traditionally

 … but, cluster analysis seeks to

 Similarity measures not fully trusted?

 Shared neighbor information:

26 April 2007 M E Houle @ NII 5

26 April 2007 M E Houle @ NII 6

26 April 2007 M E Houle @ NII 7

 Fixed radius neighborhoods:

 Clustering depends on parameters that make

26 April 2007 M E Houle @ NII 10

26 April 2007 M E Houle @ NII 11

 How can we cluster when the nature of

 Clusters will be patterned on query

 Interpretation of MC : cosine-angle between set

26 April 2007 M E Houle @ NII 15

 First-order criterion – if v belongs to C, then:

 Second-order criterion – if v, w belong to C, then:

26 April 2007 M E Houle @ NII 16

 Related to SNN density

26 April 2007 M E Houle @ NII 17

26 April 2007 M E Houle @ NII 18

 Set size must be considered.

 Note: proper interpretation of Pearson correlation

26 April 2007 M E Houle @ NII 20

 If X = |A ∩ B|, where A is fixed and B

26 April 2007 M E Houle @ NII 21

26 April 2007 M E Houle @ NII 22

26 April 2007 M E Houle @ NII 23

 Z (A): intra-set significance of set A.

26 April 2007 M E Houle @ NII 24

 Z (A) > Z (B) > Z (C).

26 April 2007 M E Houle @ NII 25

 For fixed S, inter-set significance is

26 April 2007 M E Houle @ NII 27

 Can consider potential

26 April 2007 M E Houle @ NII 28

 The significance of A can be

26 April 2007 M E Houle @ NII 29

26 April 2007 M E Houle @ NII 30

 A serves as a pattern for

26 April 2007 M E Houle @ NII 31

 Want a group C for which both:

26 April 2007 M E Houle @ NII 32

26 April 2007 M E Houle @ NII 33

 Edges appear between set nodes having

 Retained cluster candidates should not be too

 Final clusters form an “independent set”

26 April 2007 M E Houle @ NII 34

26 April 2007 M E Houle @ NII 35

26 April 2007 M E Houle @ NII 36

26 April 2007 M E Houle @ NII 37

26 April 2007 M E Houle @ NII 38

26 April 2007 M E Houle @ NII 39

Cluster Cluster Map

26 April 2007 M E Houle @ NII 41

26 April 2007 M E Houle @ NII 42

26 April 2007 M E Houle @ NII 44

26 April 2007 M E Houle @ NII 46