Sie sind auf Seite 1von 12

Clustering Gene Expression Data

Part 2

CS 838
www.cs.wisc.edu/~craven/cs838.html
Mark Craven
craven@biostat.wisc.edu
April 2001

Announcements
• reading for next week
– Friedman et al., Journal of Computational
Biology 2000
– Brazma et al., Genome Research 1998
– Craven et al., ISMB 2000

1
The Course Project
• implement an algorithm or two
• experiment with it on some real data
• milestones
– description of the basic area (due 4/16)
• the algorithm(s) you will be investigating
• the data set(s) you will be using
• 1-3 hypotheses
– description of your experiments (due 4/23)
• how you will test your hypotheses
• data to be used
• what will be varied
• methodology
– final write-up (due 5/16)
• 8-10 pages similar in format to a CS conference paper
• prototypical organization: introduction, description of
methods, description of experiments, discussion of results

Non-Hierarchical Clustering

2
K-Means Clustering
• assume our objects are represented by vectors of
real values
• put k cluster centers in same space as objects
• now iteratively move cluster centers
+ +

+
object cluster center

K-Means Clustering
• each iteration involves two steps
– assignment of objects to clusters
– re-computation of the means

+ + + +

+ +

+ +

assignment re-computation of means

3
K-Means Clustering
given : a set X = { x1 ... x n } of objects
r r
r r
select k initial cluster centers f1 ... f k
while stopping criterion not true do
for all clusters c j do
{ ( ) (
c j = xi | ∀f l sim xi , f j ≥ sim xi , f l
r r r r r
)}
r
for all means f j do
f j = µ (c j )
r

K-Means Clustering

• in k-means as just described, objects are assigned


to one and only one cluster
• can do “soft” k-means clustering via EM
– each cluster represented by a normal
distribution
– E step: determine how likely is it that each
cluster “generated” each object
– M step: move cluster centers to maximize
likelihood of objects

4
The CLICK Algorithm
• Sharan & Shamir, ISMB 2000
• objects to be clustered (e.g. genes) represented as
vertices in a graph
• weighted, undirected edges represent similarity of
objects

1
5
4

1 6

CLICK: How Do We Get Graph?


• assume pairwise similarity values are normally
distributed

N ( µT , σ T2 ) for mates (objects in same


“true” cluster)

N ( µ F , σ F2 ) for non-mates

• estimate the parameters of these distributions and


Pr(mates) (the prob that two randomly chosen
objects are mates) from the data

5
CLICK: How Do We Get Graph?
• let f ( S ij | i, j are mates) be the probability
density function for similarity values when i and j
are mates
• then set the weight of an edge by:

Pr( mates ) f ( S ij | i, j are mates)


wij = log
(1 − Pr(mates) ) f ( Sij | i, j are non - mates)

• prune edges with weights < specified non-negative


threshold t

The Basic CLICK Algorithm


BasicCLICK (G) :
if V (G ) = {v} then /* does graph have just one vertex? */

move v to singleton set R


else if G is a kernel /* does graph satisfy stopping criterion? */

return V (G )
else /* partition graph, call recursively */

( H , H ) ← MinWeightC ut (G )
BasicCLICK ( H )
BasicCLICK ( H )

6
Minimum Weight Cuts
• a cut of a graph is a subset of edges whose
removal disconnects the graph
• a minimum weight cut is the cut with the smallest
sum of edge weights
• can be found efficiently
1
5
4

1 6

Deciding When a Subgraph


Represents a Kernel
• we can test a cut C against two hypotheses
H 0C : C contains only edges between non - mates
H1C : C contains only edges between mates

• we can then score C by


Pr( H1C | C )
log
Pr( H 0C | C )

7
Deciding When a Subgraph
Represents a Kernel
• if we assume a complete graph, the minimum
weight cut algorithm finds a cut that minimizes
this ratio, i.e.
Pr( H1C | C )
weight (C ) = log
Pr( H 0C | C )

• thus, we accept H1C and call G a kernel iff


weight (C ) > 0

Deciding When a Subgraph


Represents a Kernel
• but we don’t have a complete graph
• we call G a kernel iff weight (C ) + weight ′(C ) > 0
where weight ′(C ) approximates the contribution
of missing edges

8
The Full CLICK Algorithm

• the basic CLICK algorithm produces kernels of


clusters
• add two more operations
adoption: find singletons that are similar, and
hence can be adopted by kernels
merge: merge similar clusters

The Full CLICK Algorithm


CLICK( G N ) :
R←N
while some change occurs do
BasicCLICK (G R )
let L be the set of kernels produced
let R be the set of singletons produced
Adoption( L , R )
Merge( L )
Adoption ( R )

9
CLICK Experiment:
Fibroblast Serum Response Data
• show table 2 from paper

figure from: Sharan & Shamir, ISMB 2000

Evaluating Clustering Results


• given random data without any “structure”,
clustering algorithms will still return clusters
• the gold standard: do clusters correspond to
natural categories?
• do clusters correspond to categories we care
about? (lots of ways to partition the world)
• how probable does held aside data look
• how well does clustering algorithm optimize
homogeneity and separation?

10
Measuring Homogeneity
• average similarity of objects to their clusters
1
H ave = ∑ sim( F (u ), F (cluster(u )))
| N | u∈N

• minimum similarity of an object to its cluster

H min = min sim( F (u ), F (cluster(u )))


u∈N

Measuring Separation
• average separation of pairs of clusters
1
S ave = ∑ | X i || X j | sim( F ( X i ), F ( X j ))
∑ | X i || X j | i ≠ j
i≠ j

• maximum separation of a pair of clusters

S max = max sim ( F ( X i ), F ( X j ))


i≠ j

• note that under these definitions, low separation is


good!

11
CLICK Experiment:
Fibroblast Serum Response Data

table from: Sharan & Shamir, ISMB 2000

12

Das könnte Ihnen auch gefallen