Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001

Clustering Gene Expression Data
Part 2
CS 838
www.cs.wisc.edu/~craven/cs838.html
Mark Craven
craven@biostat.wisc.edu
April 2001
Announcements
• reading for next week
– Friedman et al., Journal of Computational
Biology 2000
– Brazma et al., Genome Research 1998
– Craven et al., ISMB 2000
1
The Course Project
• implement an algorithm or two
• experiment with it on some real data
• milestones
– description of the basic area (due 4/16)
• the algorithm(s) you will be investigating
• the data set(s) you will be using
• 1-3 hypotheses
– description of your experiments (due 4/23)
• how you will test your hypotheses
• data to be used
• what will be varied
• methodology
– final write-up (due 5/16)
• 8-10 pages similar in format to a CS conference paper
• prototypical organization: introduction, description of
methods, description of experiments, discussion of results
Non-Hierarchical Clustering
2
K-Means Clustering
• assume our objects are represented by vectors of
real values
• put k cluster centers in same space as objects
• now iteratively move cluster centers
+ +
+
object cluster center
K-Means Clustering
• each iteration involves two steps
– assignment of objects to clusters
– re-computation of the means
+ + + +
+ +
+ +
assignment re-computation of means
3
K-Means Clustering
given : a set X = { x1 ... x n } of objects
r r
r r
select k initial cluster centers f1 ... f k
while stopping criterion not true do
for all clusters c j do
{ ( ) (
c j = xi | ∀f l sim xi , f j ≥ sim xi , f l
r r r r r
)}
r
for all means f j do
f j = µ (c j )
r
K-Means Clustering
• in k-means as just described, objects are assigned

to one and only one cluster
• can do “soft” k-means clustering via EM
– each cluster represented by a normal
distribution
– E step: determine how likely is it that each
cluster “generated” each object
– M step: move cluster centers to maximize
likelihood of objects
4
The CLICK Algorithm
• Sharan & Shamir, ISMB 2000
• objects to be clustered (e.g. genes) represented as
vertices in a graph
• weighted, undirected edges represent similarity of
objects
1
5
4
1 6
CLICK: How Do We Get Graph?

• assume pairwise similarity values are normally
distributed
N ( µT , σ T2 ) for mates (objects in same

“true” cluster)
N ( µ F , σ F2 ) for non-mates
• estimate the parameters of these distributions and

Pr(mates) (the prob that two randomly chosen
objects are mates) from the data
5
CLICK: How Do We Get Graph?
• let f ( S ij | i, j are mates) be the probability
density function for similarity values when i and j
are mates
• then set the weight of an edge by:
Pr( mates ) f ( S ij | i, j are mates)

wij = log
(1 − Pr(mates) ) f ( Sij | i, j are non - mates)
• prune edges with weights < specified non-negative

threshold t
The Basic CLICK Algorithm

BasicCLICK (G) :
if V (G ) = {v} then /* does graph have just one vertex? */
move v to singleton set R

else if G is a kernel /* does graph satisfy stopping criterion? */
return V (G )
else /* partition graph, call recursively */
( H , H ) ← MinWeightC ut (G )
BasicCLICK ( H )
BasicCLICK ( H )
6
Minimum Weight Cuts
• a cut of a graph is a subset of edges whose
removal disconnects the graph
• a minimum weight cut is the cut with the smallest
sum of edge weights
• can be found efficiently
1
5
4
1 6
Deciding When a Subgraph

Represents a Kernel
• we can test a cut C against two hypotheses
H 0C : C contains only edges between non - mates
H1C : C contains only edges between mates
• we can then score C by

Pr( H1C | C )
log
Pr( H 0C | C )
7
Represents a Kernel
• if we assume a complete graph, the minimum
weight cut algorithm finds a cut that minimizes
this ratio, i.e.
Pr( H1C | C )
weight (C ) = log
Pr( H 0C | C )
• thus, we accept H1C and call G a kernel iff

weight (C ) > 0

Represents a Kernel
• but we don’t have a complete graph
• we call G a kernel iff weight (C ) + weight ′(C ) > 0
where weight ′(C ) approximates the contribution
of missing edges
8
The Full CLICK Algorithm
• the basic CLICK algorithm produces kernels of

clusters
• add two more operations
adoption: find singletons that are similar, and
hence can be adopted by kernels
merge: merge similar clusters
The Full CLICK Algorithm

CLICK( G N ) :
R←N
while some change occurs do
BasicCLICK (G R )
let L be the set of kernels produced
let R be the set of singletons produced
Adoption( L , R )
Merge( L )
Adoption ( R )
9
CLICK Experiment:
Fibroblast Serum Response Data
• show table 2 from paper
figure from: Sharan & Shamir, ISMB 2000
Evaluating Clustering Results

• given random data without any “structure”,
clustering algorithms will still return clusters
• the gold standard: do clusters correspond to
natural categories?
• do clusters correspond to categories we care
about? (lots of ways to partition the world)
• how probable does held aside data look
• how well does clustering algorithm optimize
homogeneity and separation?
10
Measuring Homogeneity
• average similarity of objects to their clusters
1
H ave = ∑ sim( F (u ), F (cluster(u )))
| N | u∈N
• minimum similarity of an object to its cluster
H min = min sim( F (u ), F (cluster(u )))

u∈N
Measuring Separation
• average separation of pairs of clusters
1
S ave = ∑ | X i || X j | sim( F ( X i ), F ( X j ))
∑ | X i || X j | i ≠ j
i≠ j
• maximum separation of a pair of clusters
S max = max sim ( F ( X i ), F ( X j ))

i≠ j
• note that under these definitions, low separation is

good!
11
CLICK Experiment:
Fibroblast Serum Response Data
table from: Sharan & Shamir, ISMB 2000
12

Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001

Hochgeladen von

Copyright:

Verfügbare Formate

Clustering Gene Expression Data

assignment re-computation of means

• in k-means as just described, objects are assigned

CLICK: How Do We Get Graph?

N ( µT , σ T2 ) for mates (objects in same

• estimate the parameters of these distributions and

Pr( mates ) f ( S ij | i, j are mates)

• prune edges with weights < specified non-negative

The Basic CLICK Algorithm

move v to singleton set R

Deciding When a Subgraph

• we can then score C by

• thus, we accept H1C and call G a kernel iff

Deciding When a Subgraph

• the basic CLICK algorithm produces kernels of

The Full CLICK Algorithm

figure from: Sharan & Shamir, ISMB 2000

Evaluating Clustering Results

• minimum similarity of an object to its cluster

H min = min sim( F (u ), F (cluster(u )))

• maximum separation of a pair of clusters

S max = max sim ( F ( X i ), F ( X j ))

• note that under these definitions, low separation is

table from: Sharan & Shamir, ISMB 2000

Das könnte Ihnen auch gefallen