Sie sind auf Seite 1von 68

A Combinatorial Approach to

Search and Clustering

Michael Houle
National Institute of Informatics

26 April 2007 M E Houle @ NII 1


Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 2


What is Clustering?

 Clustering is:
 Organization of data into well-differentiated
groups of highly-similar items.
 A form of unsupervised learning.
 Fundamental operation in data mining &
knowledge discovery.
 Important tool in the design of efficient
algorithms and heuristics.
 Closely related to search & retrieval.

26 April 2007 M E Houle @ NII 3


Clustering Paradox

 Clustering models/methods traditionally


make assumptions on the nature of the
data:
 Data representation.
 Similarity measures.
 Data distribution.
 Cluster numbers, sizes and/or densities.
 Definition of noise.

 … but, cluster analysis seeks to


discover the nature of the data!
26 April 2007 M E Houle @ NII 4
Shared Neighbor Clustering

 Similarity measures not fully trusted?


 Curse of dimensionality – concentration effect.
 Variations in density.
 Lack of objective meaning.

 Shared neighbor information:


 “If two items have many neighbors in common,
they are probably closely related.”
 Similarity measure used primarily for ranking.
 Adaptive to variations in density.

26 April 2007 M E Houle @ NII 5


Shared-Neighbor
Clustering Methods (1)
 Jarvis-Patrick (1973)
 Hierarchical clustering
heuristic.
 Single-linkage merge criterion.
 Fixed-cardinality neighborhoods.
 Merge threshold t. B
 Merge if there exists a pair
a, b such that:
 a and b are k-NNs of one
another; A
 Intersection of k-NN lists
contains at least tk items.

26 April 2007 M E Houle @ NII 6


Shared-Neighbor
Clustering Methods (2)
 ROCK
(Guha, Rastogi, Shim 2000)
 Hierarchical clustering
heuristic.
 Fixed-radius neighborhoods.
 Pairwise linkage defined as
B
size of intersection of
neighborhoods.
 Merge if total (size-weighted)
inter-cluster linkage size is A
maximized.

26 April 2007 M E Houle @ NII 7


Shared-Neighbor
Clustering Methods (3)
 SNN (Ertöz, Steinbach, Kumar 2003)
 Based on DBSCAN (1996):
 Density over fixed-radius
neighborhood.
 Core points – density exceeding a
supplied threshold.
 Merging – if one core point is B
contained in the neighborhood
of another.
 SNN: DBSCAN with
A
fixed-cardinality neighborhoods.
 Similarity: intersection size of
fixed-cardinality neighborhoods.
26 April 2007 M E Houle @ NII 8
Drawbacks of
Shared Neighbor Clustering
 Fixed k-NNs:
 Bias towards clusters of size order k.
 Examples: Jarvis-Patrick, SNN.
 How to choose k ?

 Fixed radius neighborhoods:


 Bias towards clusters of larger density.
 Examples: ROCK.
 How to choose radius?

 Clustering depends on parameters that make


implicit assumptions regarding the data.
26 April 2007 M E Houle @ NII 9
Desiderata for Clustering
 Fully automated clustering:
 Similarity measure, but used strictly for ranking.
 Otherwise, no knowledge of data distribution.
 Parameters must have domain-independent
interpretation.
 Automatic determination of number
of clusters, cluster sizes.

 Other desiderata:
 Scalable heuristics.
 Adaptive to variations in density.
 Handles cluster overlap.

26 April 2007 M E Houle @ NII 10


Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 11


Query-Based Clustering

 How can we cluster when the nature of


the data is hidden?
No pairwise (dis)similarity measure?
Only assumption: relevancy rankings
for queries-by-example.
 Q(q, k): ranked relevant set for query
item q.
|Q(q, k)| = k.

 Clusters will be patterned on query


relevant sets Q(q, k) for some q in S.
26 April 2007 M E Houle @ NII 12
Confidence
 Two sets A and B related according
to their degree of overlap.
 Natural measure – confidence
(inspired by Association Rule Mining)
(related to Jarvis-Patrick merge
criterion): A
A ∩B
0 ≤ conf (A ,B ) = ≤1
A B
 Interpretation of conf : precision & recall (as in IR).
 Query result Q for concept set C.
 Precision is conf (Q, C).
 Recall is conf (C, Q).
26 April 2007 M E Houle @ NII 13
Mutual Confidence
 Symmetric measure – mutual
confidence:
0 ≤ MC (A ,B )
= conf (A ,B )⋅conf (B ,A ) A
A B B
= ≤1
A ⋅B

 Interpretation of MC : cosine-angle between set


vectors.
 If item j is a member, j-th coordinate equals 1.
 Otherwise, j-th coordinate equals 0.
 cos-1 (MC (A, B)) is a distance metric.
26 April 2007 M E Houle @ NII 14
Set Correlation
 Pearson correlation formula:

∑i
n
xi yi − nxy
r= =1 A
(∑ n
i=1
x − nx
2
i
2
) (∑ n
y
i=1 i
2
− ny 2
)
B
 Apply this to coordinate pairs of set vectors for A, B ⊂ S…
 Gives set correlation between A and B :

S  A ∩B A ⋅ B 
R (A ,B ) =  −
( S − A ) ( S − B )  A ⋅ B S 

 Tends to cosine similarity measure when A and B are small.

26 April 2007 M E Houle @ NII 15


Intra-set Association
 How do we measure the goodness of a cluster
candidate C ?
 No pairwise similarity measure is available!

 First-order criterion – if v belongs to C, then:


 The items relevant to v should belong to C.
 R (C, Q(v, |C|)) should be high.

 Second-order criterion – if v, w belong to C, then:


 The items relevant to v should coincide with those
relevant to w.
 (Will discuss only first-order criterion here.)

26 April 2007 M E Houle @ NII 16


Self-confidence
 Measure – self-confidence:
 The average mutual confidence
between a set and the (same-
sized) relevant sets of its
members.
 Denoted SC (A), where

1 1
SC (A ) =
A


v A
MC (A ,Q (v, A ))=
A
2 ∑

v A
A ∩ Q (v, A )

 Related to SNN density


criterion.

26 April 2007 M E Houle @ NII 17


Self-correlation

 Measure – self-correlation:
The average correlation
between a set and the
(same-sized) relevant sets
of its members.
Denoted SR (A), where
1 S ⋅ SC (A )− A
SR (A ) =
A

v∈ A
R (A ,Q (v, A ))=
S−A

26 April 2007 M E Houle @ NII 18


Significance & Size
 Which aggregation of points
is more significant?
 SC (A) = 0.8525,
SR (A) = 0.815625.
 SC (B) = 1.0,
SR (B) = 1.0.
 SC (C) = 0.45,
SR (C) ≈ 0.3888889.

 Set size must be considered.

 Note: proper interpretation of Pearson correlation


requires a test of significance.
26 April 2007 M E Houle @ NII 19
Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 20


Randomness Hypothesis
 What if every query relevant set (QRS) were
independently selected uniformly at random?
 Number of intersections between QRS and fixed
set is distributed hypergeometrically.
 Expectation and variance can be determined.

 If X = |A ∩ B|, where A is fixed and B


is selected randomly from S, then:
A
A⋅B A ⋅ B ( S − A )( S − B )
E (X ) = Var (X ) =
S ( S − 1)
2
S
B

26 April 2007 M E Houle @ NII 21


Significance
& Standard Scores
 Standard score under randomness hypothesis:
Measure of the deviation from randomness.
ZSC (A): number of standard devs of SC (A)
from its expectation.
ZSR (A): number of standard devs of SR (A)
from its expectation.
 The greater the standard scores, the more
significant the aggregation.
SC (A )− E ( SC (A )) SR (A )− E ( SR (A ))
Z SC (A ) = Z SR (A ) =
Var ( SC (A )) Var ( SR (A ))

26 April 2007 M E Houle @ NII 22


Intra-Set Significance
S E( A ∩ B ) A ⋅ B 
E ( R ( A ,B ) ) =  −
(S − A)(S − B) A ⋅B
 S 

S  1 A⋅B A ⋅ B 
=  ⋅ − =0
( S − A ) ( S − B )  A⋅B S S 

2
S
Var ( R ( A ,B ) ) = Var ( A ∩ B )
(S − A)(S − B) A ⋅B
=
S
2


(S − A)(S − B) A ⋅B = 1
(S − A)(S − B) A ⋅B S ( S − 1)
2
S −1

∑ E ( R ( A ,Q ( v, A ) ) )
1
SR (A )−
SR (A )− E ( SR (A )) A
A ( S − 1) SR (A )
v∈ A
Z SR (A ) = = =
Var ( SR (A ))
∑ Var ( R ( A ,Q ( v, A ) ) )
1
2
A v∈ A

26 April 2007 M E Houle @ NII 23


Intra-Set Significance
2

E ( SC (A )) =
1
2 ∑
(
E A ∩ Q ( v, A ) = ) 1
2 ∑
A
S
=
A
S
A v∈ A A ∈
v A

Var ( SC (A )) =
1
Var ( A ∩ Q ( v, A ) ) =
(S − A) 2

A
4 ∑

v A A ⋅ S ( S − 1)
2

 Can show:

A ( S − 1) SR (A )= Z (A )

Z SR (A ) = Z SC (A ) =

 Z (A): intra-set significance of set A.

26 April 2007 M E Houle @ NII 24


Example
 Set significances of A, B, C:
 SC (A) = 0.8525,
SR (A) = 0.815625,
Z (A) ≈ 36.29.
 SC (B) = 1.0,
SR (B) = 1.0,
Z (B) ≈ 22.25.
 SC (C) = 0.45,
SR (C) ≈ 0.3888889,
Z (C) ≈ 12.24.

 Z (A) > Z (B) > Z (C).

26 April 2007 M E Houle @ NII 25


Inter-Set Significance
 Z (A, B): inter-set significance of (the
relationship between) A and B.
A⋅B
E ( R (A ,B )) = 0 E ( MC (A ,B )) =
S
1 A ⋅ B ⋅ ( S − A )( S − B )
Var ( R (A ,B )) = Var ( MC (A ,B )) =
S −1 S
2
( S − 1)
R (A ,B )− E ( R (A ,B )) MC (A ,B )− E ( MC (A ,B ))
Z (A ,B ) = = = S − 1 R (A ,B )
Var ( R (A ,B )) Var ( MC (A ,B ))

 For fixed S, inter-set significance is


equivalent to the set correlation R (A, B).
26 April 2007 M E Houle @ NII 26
Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 27


Contributions
to Significance
 Some members contribute more
than others towards the set
significance Z (A).
 Contribution of member v to
SR (A):
1
R ( A ,Q ( v, A ) )

t(v |A )=
A

 Can consider potential


contributions Z (v|A) even
when v ∉ A.
A ∩ Q ( v, A )

26 April 2007 M E Houle @ NII 28


Partial Significance
 Standard score of t (v|A)
w.r.t. the randomness
hypothesis:

Z (v |A ) = S − 1 R ( A ,Q ( v, A ) )

 The significance of A can be


expressed in terms of these
partial significances:

1
Z (A ) = ∑ Z ( v |A )
A ∈
v A
A ∩ Q ( v, A )

26 April 2007 M E Houle @ NII 29


Set Reshaping (1)
 Idea: modify A so as to
boost its significance.
 New set A′ has average
mutual correlation to A:
1
SR (A ′ |A ) = ∑ R ( A ,Q ( v, A ) )
A′ ∈ ′
v A

 Significance of SR(A′|A)
w.r.t. randomness hypoth.:
1
Z (A ′ |A ) = ∑ Z ( v |A )
A′ v A
∈ ′
A ∩ Q ( v, A )

26 April 2007 M E Houle @ NII 30


Set Reshaping (2)
 For fixed size |A′|, we
can maximize Z (A′|A) by
taking largest values of
Z (v|A).
 For this example:
 Z (A|A) = Z (A) = 36.29
 Z (A′|A) = 37.18
 Maximum achieved at A′.

 A serves as a pattern for


the discovery of A′.
A ∩ Q ( v, A )

26 April 2007 M E Houle @ NII 31


Partitioning
 What if we need to assign item v to a single group?

 Want a group C for which both:


 the set significance is high; and
 the significance of the relationship with v is
high.
 In practice, can choose group C (reshaped from
pattern C*) satisfying:

* *
maximize
*
Z (v |C )⋅ Z (C |C )
C

26 April 2007 M E Houle @ NII 32


Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 33


Cluster Map Generation
 Nodes are sets having sufficiently-high
significance scores.

 Edges appear between set nodes having


sufficiently-high inter-set significances.

 Retained cluster candidates should not be too


highly correlated with other retained
candidates.

 Final clusters form an “independent set”


within an initial candidate cluster map.

26 April 2007 M E Houle @ NII 34


 Nodes meeting min threshold on

Candidate Map
set significance.
 Edges meeting min threshold on
inter-set significance
(correlation).

26 April 2007 M E Houle @ NII 35


 Nodes ranked by significance
(red is highest).
 Thick edges join nodes whose
set correlations are too high.

26 April 2007 M E Houle @ NII 36


 Need independent node set
within thick-edged subgraph.
 Heuristic: greedy by
significance.

26 April 2007 M E Houle @ NII 37


 Need independent node set
within thick-edged subgraph.
 Heuristic: greedy by
significance.

26 April 2007 M E Houle @ NII 38


 Final cluster map is typically

Cluster Map
disconnected.
 Clusters nodes form a rough
cover of the data set.

26 April 2007 M E Houle @ NII 39


DB
GreedyRSC
Queries

Pattern
Pruning Candidate Sample Relevant
Patterns Sets

Cluster Cluster Map


Member Assignment
26 April 2007 Pruning
M E Houle @ NII Generation 40
Scalability Issues

 Problems:
 Curse of dimensionality: computing queries Q(q,k)
can take time linear in |S| even when k is small.
 Computing Z(A) takes time quadratic in |A|.

 Workarounds:
 Use approximate neighborhoods using SASH search
structure [H. '03, H. & Sakuma '05].
 Pattern generation over samples of varying sizes,
with fixed range for |A|.

26 April 2007 M E Houle @ NII 41


Sampling Strategy
 Create bands of samples of sizes |S|, |S|/2, |S|/4, ….
 Within each sample, compute candidate patch for each
element via maximization of set significance.
 Fixed range of patch sizes a < k < b.
 Within each sample, select patterns greedily and
eliminate duplicates.
 Reshape patterns to form cluster candidates.
 Eliminate duplicate candidates to form final cluster set.
 Generate cluster map edges.

26 April 2007 M E Houle @ NII 42


Overall Time Complexity
Without data partitioning:
 Precompute relevant sets: O(n log n) queries, n = |S|.
 Compute pattern for O(n log n) neighborhoods: O(b 2) time each.
 Compute all patterns: O(b 2n log n) time.
 Eliminate duplicate patterns: O((b 2+σ2)n log n), where σ2 is the
average variance of inverted member list sizes over each sample.
 Form candidate clusters: bounded by O(b n log2n).
 Eliminate duplicate clusters & create map: O((b 2+τ2)n log n ),
where τ2 is the average variance of inverted member list sizes
over each sample.
 Total excluding relevant set computation:
O((b 2+σ2+τ2)n log n + b n log2n).
Data partitioning (details omitted!) introduces factor of c, the
number of data chunks. M E Houle @ NII
26 April 2007 43
Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 44


Clustering Parameters
 For comparison of significance values, can normalize
significance scores for convenience.
 Common factor dependent on |S| can be dropped.
 Normalizing inter-set significance → set correlation.
 Normalizing square of set significance: use

Z 2 (A )
0 ≤ = A ⋅ SR 2 (A ) ≤ A
S −1
 For all experiments:
 Min normalized squared set significance = 4.
 Max normalized inter-set significance = 0.5.
 Min norm. inter-set sig. = 0.1 (for cluster map).
26 April 2007 M E Houle @ NII 45
Images
 Amsterdam Library of Object Images (ALOI):
 Dense feature vectors, colour & texture histogram (prepared
by INRIA-Rocquencourt).
 Number of vectors: 110,250.
 641 features per vector.
 5322 clusters computed in < 4 hours on desktop (older,
slower implementation).
 Maximum cluster size: 7201.
 Median cluster size: 12.
 Minimum cluster size: 4.
 SASH accuracy of ~96%.

26 April 2007 M E Houle @ NII 46


Journal Abstracts
 Medline medical journal abstracts, 1996 to mid-2003:
 Vectors with TF-IDF term weighting, NO dimensional
reduction (prepared by IBM TRL).
 Number of vectors: 1,055,073.
 ~75 non-zero attributes per vector.
 Representational dimension: 1,101,003.
 9789 clusters computed in < 24 hours on 3.0GHz desktop.
 Maximum cluster size: 15,255.
 Median cluster size: 45.
 Minimum cluster size: 4.
 SASH accuracy of 51%, 115 x faster than sequential search.

26 April 2007 M E Houle @ NII 47


Protein Sequences
 Bacterial ORFs (protein sequences):
 Vectors of gapped BLAST scores with respect to fixed sample
of 1/10th size (as per Liao & Noble, 2003) (prepared by DNA
Data Base of Japan).
 Number of vectors: 378,659.
 ~125 non-zero attributes per vector.
 Representational dimension: 40,000.
 Vector preparation: < 1 day on 16 node PC cluster.
 8907 clusters computed in 7 hours on 3.0GHz desktop.
 Maximum cluster size: 69,859.
 Median cluster size: 20.
 Minimum cluster size: 4.
 SASH accuracy of 75%, 34 x faster than sequential search.
26 April 2007 M E Houle @ NII 48
Demos

26 April 2007 M E Houle @ NII 49


Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 50


Query Result Clustering
 “Pure” shared-neighbor clustering can be used
to cluster results of queries.
 Produce long ranked query result lists.
 Ranking function can be hidden.
 Database must support queries-by-example.
 Otherwise can be performed without the
cooperation of the database manager.

 Example at NII:
 WEBCAT Plus library database.
 GETA search engine.
 WEBCAT Plus QRC tool currently under
development (with N. Grira).
26 April 2007 M E Houle @ NII 51
Query Result Clustering
 Adapting RSC for QRC:
 Database size may not be known.
 Self-correlation and significance formula
undefined.
 Approximation: assume infinite database size.
 As |S| tends to infinity, the normalized
squared significance tends to:

Z 2 (A )
= A ⋅ SR 2 (A ) → A ⋅ SC 2 (A )
S −1

 This formula does not depend on |S|.


 Computation as per GreedyRSC heuristic.

26 April 2007 M E Houle @ NII 52


Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 53


Outlier Detection
 RSC model:
 Patterns of low significance can indicate the
presence of outliers.
 Can be detected as per the initial stages of
GreedyRSC.
 Many potential definitions are possible.

 Some possible formulations (to be minimized):


Z (Q (v,k)) SR (Q (v,k))
max Z (Q (v,k)) max SR (Q (v,k))
1≤ i≤ k 1≤ i≤ k

 Work in progress with M. Gebski, NICTA.


26 April 2007 M E Houle @ NII 54
Overview

I. Shared-Neighbor Clustering
II. The Relevant-Set Correlation Model
a. Association Measures
b. Significance of Association
c. Partial Significance and Reshaping
III. The GreedyRSC Heuristic
IV. Experimental Results
V. Extensions and Applications
a. Query Result Clustering
b. Outlier Detection
c. Feature Set Evaluation and Selection

26 April 2007 M E Houle @ NII 55


Feature Set Evaluation
 Feature selection methods:
 To our knowledge, current feature selection
techniques developed only for supervised
learning.
 Training set needed to guide the process.
 Work with N. Grira: unsupervised feature set
evaluation and selection.

 Assumptions:
 Similarity measure and candidate features
unknown.
 Assess the effectiveness of the features &
similarity measure for search and clustering.
26 April 2007 M E Houle @ NII 56
Good Feature Sets
 For any given 'true'
cluster C:
 For any item v in C, any
query based at v should
ideally rank the items of C
ahead of any other items.
 Two-set classification.
 Best result – when there
exists a partition of the
data set into clusters such
that the two-set
classification property
holds.

26 April 2007 M E Houle @ NII 57


Distinctiveness
 Distinctiveness criterion:
 Variant of self-correlation.
 External items that are well-correlated are
penalized.
 Equals 1 if A is perfectly associated (SR(A)=1)
and external points are uncorrelated with A.
1 1
DR (A ) =
A

v∈ A
R (A ,Q (v, A ))−
S−A


v A
R (A ,Q (v, A ))

1
= SR (A )−
S−A


v A
R (A ,Q (v, A ))

26 April 2007 M E Houle @ NII 58


Significance
 Significance can be derived as under RSC:
 Randomness hypothesis.
A ( S − A )( S − 1)
Z DR (A ) = DR (A )
S

 Unlike self-correlation significance,


distinctiveness significance tends to 0 as |A|
tends to |S|.
 Distinctiveness is expensive to compute in
practice.
 Can use SR(A) to approximate DR(A).

26 April 2007 M E Houle @ NII 59


Feature Set Evaluation
 Criterion for feature set selection:
 For each item q, estimate the most significant
relevant set based at q.
 Average the self-correlations of the most
significant relevant set identified.
 Serves as the basis for search, e.g. local
improvement methods such as Tabu Search.
1
maximize
S


q S
SR (Q (q,kq )), kq = argmax DR (Q (q,k ))
1≤ k ≤ S

26 April 2007 M E Houle @ NII 60


Feature Set Evaluation

1
maximize
S


q S
SR (Q (q,kq )), kq = argmax DR (Q (q,k ))
1≤ k ≤ S

 Properties:
 If all relevant sets are randomly generated, criterion
is 0.
 If all relevant sets are identical, criterion is 0.
 If two-set classification holds for all q, criterion
is 1 (the maximum possible).
 Conjectured: for any disjoint partition of the data
into clusters of size at least some constant, there
exists a set of rankings for which the maximum of 1 is
achieved.

26 April 2007 M E Houle @ NII 61


26 April 2007 M E Houle @ NII 62
Background:
Protein Sequence Analysis
 Applications of clustering:
 Classification of sequences of unknown
functionality with respect to clusters of
sequences of known functionality.
 Discovery of new motifs from clusters of
sequences of previously unknown function.

 Problems:
 Single linkage (agglomerative) clustering
techniques produce clusters with poor internal
association.
 Traditional clustering techniques do not scale
well to large set sizes.
 Protein sequence data is “inherently” high-
dimensional.
26 April 2007 M E Houle @ NII 63
Protein Sequence Similarity

 Pairwise (gapped) sequence alignment


scoring: BLAST heuristic [Altschul et al.
'97].
 Bonus for matching and near-matching symbols.
 Penalty for non-matching symbols.
 Penalty for gaps (increases with gap length).
 Dynamic programming (expensive!).
 Faster heuristics exist.
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60
M M++K+L+PTDFSE A A++ + ++ EVILLHVIDE +++ L+ G +
MIFMFRKVLFPTDFSEGAYRAVEVFEKRNKMEVGEVILLHVIDEGTLEE-----LMDGYS 55

26 April 2007 M E Houle @ NII 64


Alignment-based Similarity

 Problem: direct use of BLAST scores fails!


 Not transitive since alignments are incomplete.
 SASH index for approximate neighbourhood computation
achieves poor accuracy vs time trade-off.
 Sequential search would work, but is too expensive.

 Example: pairs (A,B) and (B,C) have high BLAST


scores, but pair (A,B) has score of zero.

26 April 2007 M E Houle @ NII 65


Reference Set Similarity

 Solution (with Å. J. Västermark): sample sequences


 Vectors of gapped-alignment
BLAST scores with respect to

full set sequences


fixed sample of 1/10th size (as
per Liao & Noble, 2003).
 Conversion of BLAST scores to
E-values.
 Vector sparsification via
thresholding to zero.
 Vector angle distance metric
for neighbourhood computation.
BLAST
E-values

26 April 2007 M E Houle @ NII 66


Analogy to Text
sample sequences
 Each sequence analogous to a
document.

full set sequences


 Reference sequences analogous to
terms.
 Sparse vectorization.
 Significant BLAST scores
analogous to terms appearing in
document.
 Strong BLAST scores analogous to
dominant terms of document (as BLAST
per TF-IDF weighting). E-values

26 April 2007 M E Houle @ NII 67


26 April 2007 M E Houle @ NII 68

Das könnte Ihnen auch gefallen