Beruflich Dokumente
Kultur Dokumente
Abstract
It is now possible to simultaneously measure the expression of thou-
sands of genes during cellular dierentiation and response, through the
use of DNA microarrays. A major statistical task is to understand the
structure in the data that arise from this technology. In this paper
1
we review various methods of clustering, and illustrate how they can
be used to arrange both the genes and cell lines from a set of DNA
microarray experiments. The methods discussed are global clustering
techniques including hierarchical, K-means, and block clustering, and
tree-structured vector quantization. Finally, we propose a new method
for identifying structure in subsets of both genes and cell lines that
are potentially obscured by the global clustering approaches.
1 Introduction
DNA microarrays and other high-throughput methods for analyzing complex
nucleic acid samples make it now possible to measure rapidly, eciently and
accurately the levels of virtually all genes expressed in a biological sample.
The application of such methods in diverse experimental settings generates
results rich in information. However, the process of transforming this in-
formation into meaningful biological insights is impeded by the complexity
and vastness of the data. One way to overcome this obstacle is exemplied
in recent analyses of genome- scale expression timeseries (Eisen, Spellman,
Brown & Botstein (1998), Tamayo, Slonim, Mesirov, Zhu, Kitareewan &
Dmitrovsky (1999), Iyer, Eisen, Ross, Schuler, Moore, Lee, Trent, Hudson,
Boguski, Lashkari, Botstein & Brown (1999), Chu, Eisen, Mulholland, Bot-
2
stein, Brown & Herskowitz (1998), Spellman, Sherlock, Iyer, Zhang, Anders,
Eisen, Brown & Botstein (1998), Roth, Estep, & Church (1998)) where statis-
tical clustering methods were used to organize the data by identifying groups
of genes with similar behavior across time. Such organizational frameworks
greatly facilitates the process of exploring these complex sets of biological
data (Botstein & Brown (1999)). In this paper we discuss the logical exten-
sion of these methods to expression data from collections of discrete samples,
where it is useful to uncover relationships among samples as well as genes,
and illustrate the properties of various methods using gene expression data
from sixty human tumor cell lines [Ross, 1999 to be added]. We rst de-
scribe application of one-dimensional clustering methods to both the gene
and sample dimensions. We then describe a new implementation of two-way
clustering. Finally, we propose methods for identifying structure in subsets
of both axes that are potentially obscured by global clustering approaches.
2 Clustering techniques
The data from a microarray experiment form a matrix, where the rows are
dierent genes and the columns are dierent cell lines. In some experiments
the samples are dierent cell lines from dierent people, and we assume that
here. In other experiments the samples are a time series of measurements
3
during dierent phases of cell development.
Recently some authors have explored the use of clustering methods to
arrange the genes in some natural order, with similar genes placed close
together. Good general references on clustering are Everitt (1980), Kaufman
& Rousseeuw (1990) and Gordon (1999). There are two major approaches to
clustering- bottom up and top-down. Hierarchical clustering (e.g. Sokal &
Mitchener (1958)) is a bottom-up clustering method, that starts with each
observation (gene) in its own cluster. It works by agglomerating the closest
pair of clusters at each stage, successively combining clusters until all of the
data is in one cluster. The clustering sequence is represented by a hierarchical
tree{ the \dendogram"{, which can be cut at any level to yield a specied
number of clusters. Eisen et al. (1998) apply this kind of clustering to DNA
microarray data.
Top down clustering starts with a specied number of clusters and initial
positions for the cluster centers. The K-means (or Lloyd's) algorithm ((Lloyd
1957),(MacQueen 1967)) is used to reposition the cluster centers through the
following steps a) observations are assigned to the closest cluster center to
form a partition of the data, b) the observations in each cluster are averaged
to produce new values for the center vector of that cluster. Steps (a) and
(b) are iterated, and the process converges to a local minimum of the total
within cluster variance. Typically the K-means procedure is repeated with
4
a number of initial values for the cluster centers, and the best solution (in
terms of total within cluster variance) is chosen.
Tree-structured vector quantization (TSVQ) carries out K-means clus-
tering in a top-down, binary manner (Gersho & Gray (1992), Perlmutter,
Cosman, Olshen, Gray, Li & Bergin (1998)). It is commonly used in image
and signal compression.
Principal components (e.g. Mardia, Kent & Bibby (1979)) when applied
to the genes, nds the linear combinations of genes having the highest vari-
ance. Similarly, when applied to cell lines, it nds the highest variance linear
combination of the cell lines. The correlation of each gene with the leading
principal component provides a way of sorting (clustering) the genes, and
similarly for the cell lines.
The self-organizing map (SOM)(Kohonen (1989)) is similar to K-means
clustering, with the additional constraint that the cluster centers are re-
stricted to lie in a one or two-dimensional manifold. An online procedure is
used to readjust the positions of the centers. There is a similarity between
SOMs, multi-dimensional scaling and nonlinear principal components. See
Ripley (1995) and Cherkassky & Mulier (1998) for more details. This method
was used successfully for DNA microarray data by Tamayo et al. (1999).
We have found that K-means clustering produces tighter clusters than
hierarchical clustering, but the latter tends to produce a greater number of
5
smaller clusters, which can be a valuable feature for discovery. Unlike K-
means, hierarchical clustering also produces an ordering of the objects (see
below) which can be informative for data display. SOMs allow interpretation
of the clusters, but should be checked against K-means clustering to see
if the low-dimensional representation for the cluster centers is a reasonable
assumption for the data.
All of these methods are one-way clustering techniques. In this paper
we investigate the use of two-way clustering, to simultaneously cluster both
genes and cell lines. One simple approach this problem is to apply a one-
way clustering method to the genes and cell lines separately, and we do this
below. Block clustering, in contrast, uses both gene and cell line information
to simultaneously cluster both.
The two-way clustering procedures seek a global organization of genes and
cell lines. We nd that they are able to discover gross global structure but
may not be eective for discovering nal detail. In response to this nding,
we propose a new method call \gene shaving" which searches for sets of genes
that optimally separate the cell lines.
6
valued expression levels Y = yij , where genes are the rows and samples are
the columns.
Two way clustering. We investigate four dierent methods for two-way
clustering. The rst three methods cluster and reorder the rows and columns
of data matrix separately from one another.
7
clustering is performed at each tree node, and the best node is suc-
cessively split until the specied number of clusters is obtained. An
advantage over simple K-means is that an ordering of the objects can
be obtained from the leaves of the tree.
8
(e.g. (Duy & Quiroz 1991)).
Allowable splits: if there are existing row splits that intersect the block,
one of these must be used for the rows, called a \xed split". The same
is done for columns. Otherwise all split points are tried.
To nd the best split into two groups, one can show that it is sucient
to sort the rows (or columns) by row (resp. column) mean, and then seek a
split in that order. A drawback of block clustering when applied to median
centered data (the case here) is that at the start, all row and columns means
are approximately zero. Hence the procedure has diculty getting started.
By restricting the splits to xed splits, this ensures that a) the overall
partition can be displayed as a contiguous representation, with a common
9
re-ordering for the rows and columns, and b) the partitions of each of the
rows and columns can be described by hierarchical trees.
Figure 1 shows a simple example for illustration. There are 5 genes and
3 cell lines, labelled 1{5 and 1{3 respectively. The rst (vertical) split sepa-
rates cell line 3 from 1 and 2. The second (horizontal) split separates genes
2 and 3 from 1,4,5. Now consider splitting the rightmost box. The split
that separates genes 1 and 2 from 3,4,5 in the right box would not allow a
single contiguous representation of the entire data matrix, and hence is not
permitted. The split that separates gene 2 from 1,3,4,5 violates property (b)
above and is also not permitted. The only permissible horizontal split of the
rightmost box is the one that separates genes 2 and 3 from 1,4,5, continuing
the horizontal line segment in the left box all the way to the right.
The contiguity property (a) is most important to preserve. It is however
possible to relax (b), allowing splits such as 2 vs 3,5,4,1 in the right box.
Stopping rule for splitting blocks. For all clustering techniques, estimation
of the appropriate number of clusters is an important but dicult problem.
Clustering algorithms will nd clusters, when applied to independent (un-
clustered) data, so it is important to calibrate them. Milligan & Cooper
(1985) compare many of the suggested approaches to the problem, for one-
way clustering. For block clustering, Duy & Quiroz (1991) suggest the use
10
1
4
genes
3 1 2
cell lines
Figure 1: Simple example to illustrate the block clustering rules. The rst
(vertical) split separates cell line 3 from 1 and 2. The second (horizontal) split
separates genes 2 and 3 from 1,4,5. If the rightmost box is split horizontally,
it must be split between genes 2,3 and 1,4,5.
11
of permutation tests to determine when a given block split is not signicant.
However this can lead to early stopping of the splitting process, which can
miss good block splits later.
Instead, our strategy is to split into some large number of blocks M , and
then apply weakest link pruning (recombining) of the block to produce a
series of partitions having dierent numbers of blocks (between 1 and M ).
Then we apply the algorithm to permuted versions of the data, to estimate
the best number of blocks k M . Here is a summary of what we call the
\maximum gap" approach:
1. Let rssk be the total within block sum of squares, when k clusters are
used.
e) Gene shaving
The two-way clustering methods seek a single re-ordering of the cell lines
for all genes. However a more complex pattern may exist. In particular, one
set of genes might cluster the cell lines in one fashion, and another set of
genes might produce a very dierent clustering.
Here we describe a method which rst nds the linear combination of
genes having maximal variation among the cell lines . We think of this linear
combination as a \super gene". The genes having lowest correlation with
super gene are then removed (\shaved") from the data, and the process is
continued until the subset of genes contains only one gene. This process
produces a sequence of gene blocks, each containing genes that are similar to
one another and displaying large variance across the cell lines.
The details of the gene shaving procedure are as follows:
1. Start with all of the data. Find the rst principal component of the
13
genes.
2. For each gene i compute the absolute value of its correlation with the
rst principal component.
14
The Dataset.
The dataset used in our study has expression measurements on 6830 genes
for a set of 64 human cancer tumors. A full decription of these data appears
in [Ross et al 1999] The row and column median were set to zero, by alter-
nately subtracting o median of each column and each row, in an alternating
fashion. Finally, missing values were set to the value zero.
4 Results
Two-way clustering
Figures 2 | 7 show the clustering results for the human tumor data.
K-means clustering performs poorly probably because it does not give an
order of the clustered objects. In the gure we have used multidimensional
scaling to order the objects within each cluster, and to order the centroids
of each cluster. TSVQ xes this problem, and gives a similar picture to hier-
archical clustering. Both TSVQ and hierarchical clustering have successfully
organized the genes and cell lines to produce some visible structure. Block
clustering probably does the best job of discovering contiguous blocks of gene
expression.
Two of the cell lines have two replicates in the dataset, indicated by
15
the sux \repro". An eective clustering technique should place replicates
nearby one another. Examining the gures, this occurs for hierarchical, prin-
cipal component, TSVQ and partly for block clustering. K-mean clusterings
fails in this regard.
Block clustering also gives a one-way clustering of the cell lines, and and a
one-way clustering of the genes. In Figure 7, the cell lines are partitioned into
9 groups, by the vertical lines in the diagram. This partition is hierarchical,
meaning that for any two subpartitions the rst is contained in the second,
or vice versa. Examination of the genes in Figure 7 corresponding to the
green block in the bottom left, and the red block in the middle left reveal
a number that are known to be characteristically up or down regulated in
leukemia. Also included are unregulated ring 3 proteins, and cytoskeletal
proteins. The presence of a breast cell line clustered with the leukemias
is somewhat surprising, and is also seen with some of the other clustering
techniques. However it dicult to extract ne gene-cell line interactions from
block clustering or any of the other global clustering schemes.
Gene shaving
The rst three blocks of genes from the gene shaving process are shown
in gures 8 and 10. The variance of the column means of gene expression is
indicated in the heading. Some clear separation of the cell lines is visible. Al-
16
columns. The order of the rows and columns was randomly chosen.
Figure 2: Human tumor data, with genes in the rows and cell lines in the
BREAST
BREAST
CNS
RENAL
RENAL
MELANOMA
LEUKEMIA
CNS
COLON
BREAST
OVARIAN
COLON
LEUKEMIA
BREAST
NSCLC
BREAST
PROSTATE
LEUKEMIA
MELANOMA
RENAL
RENAL
RENAL
OVARIAN
OVARIAN
MELANOMA
OVARIAN
NSCLC
UNKNOWN
COLON
17
raw data
OVARIAN
CNS
CNS
MELANOMA
RENAL
RENAL
LEUKEMIA
MELANOMA
OVARIAN
NSCLC
RENAL
NSCLC
NSCLC
NSCLC
BREAST
MELANOMA
COLON
PROSTATE
K562B-repro
BREAST
LEUKEMIA
NSCLC
NSCLC
COLON
COLON
COLON
MCF7D-repro
MELANOMA
CNS
LEUKEMIA
NSCLC
MELANOMA
K562A-repro
RENAL
MCF7A-repro
rows and columns, from hierarchical clustering applied separately to each.
Figure 3: Clustering for human tumor data. Shown is the result of reordering
BREAST
MCF7A-repro
BREAST
MCF7D-repro
COLON
COLON
COLON
COLON
COLON
COLON
COLON
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA
BREAST
BREAST
MELANOMA
MELANOMA
RENAL
UNKNOWN
OVARIAN
BREAST
NSCLC
CNS
CNS
CNS
CNS
BREAST
OVARIAN
OVARIAN
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
OVARIAN
OVARIAN
NSCLC
NSCLC
NSCLC
NSCLC
MELANOMA
CNS
NSCLC
PROSTATE
OVARIAN
PROSTATE
clustering, applied separately to rows and columns.
Figure 4: Clustering for human tumor data. Shown is the result of K-means
RENAL
RENAL
MELANOMA
NSCLC
CNS
CNS
RENAL
MELANOMA
BREAST
OVARIAN
RENAL
NSCLC
NSCLC
OVARIAN
BREAST
OVARIAN
LEUKEMIA
CNS
OVARIAN
RENAL
MCF7D-repro
PROSTATE
NSCLC
Two-way k-means
COLON
COLON
LEUKEMIA
LEUKEMIA
19
COLON
MCF7A-repro
BREAST
RENAL
UNKNOWN
LEUKEMIA
OVARIAN
BREAST
NSCLC
RENAL
LEUKEMIA
CNS
PROSTATE
BREAST
RENAL
LEUKEMIA
NSCLC
CNS
COLON
COLON
K562B-repro
K562A-repro
MELANOMA
NSCLC
NSCLC
BREAST
OVARIAN
NSCLC
RENAL
MELANOMA
COLON
COLON
BREAST
structured vector quantization, applied separately to rows and columns.
Figure 5: Clustering for human tumor data. Shown is the result of tree-
NSCLC
UNKNOWN
OVARIAN
MELANOMA
CNS
BREAST
NSCLC
CNS
CNS
CNS
RENAL
BREAST
CNS
BREAST
NSCLC
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
PROSTATE
OVARIAN
PROSTATE
NSCLC
NSCLC
NSCLC
20
two-way TSVQ
NSCLC
OVARIAN
OVARIAN
OVARIAN
OVARIAN
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
BREAST
BREAST
LEUKEMIA
NSCLC
NSCLC
K562B-repro
K562A-repro
LEUKEMIA
MCF7A-repro
BREAST
MCF7D-repro
BREAST
COLON
COLON
COLON
COLON
COLON
COLON
COLON
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
ordered with respect to their inner product with the rst principal component.
Figure 6: Clustering for human tumor data. Here the rows and columns are
BREAST
MELANOMA
OVARIAN
PROSTATE
PROSTATE
NSCLC
BREAST
OVARIAN
OVARIAN
NSCLC
MELANOMA
OVARIAN
NSCLC
MELANOMA
NSCLC
MELANOMA
OVARIAN
MELANOMA
UNKNOWN
COLON
CNS
RENAL
COLON
NSCLC
RENAL
BREAST
LEUKEMIA
CNS
COLON
BREAST
RENAL
NSCLC
RENAL
RENAL
CNS
LEUKEMIA
CNS
COLON
NSCLC
CNS
MCF7D-repro
RENAL
COLON
COLON
RENAL
RENAL
BREAST
MCF7A-repro
K562B-repro
LEUKEMIA
K562A-repro
LEUKEMIA
BREAST
LEUKEMIA
LEUKEMIA
ible.
rows and columns have been rearranged, and some contiguous blocks are vis-
Figure 7: Clustering for human tumor data. Result from block clustering:
LEUKEMIA
LEUKEMIA
LEUKEMIA
K562A-repro
LEUKEMIA
BREAST
MCF7A-repro
LEUKEMIA
MCF7D-repro
COLON
NSCLC
COLON
BREAST
NSCLC
COLON
BREAST
MELANOMA
COLON
BREAST
COLON
RENAL
MELANOMA
UNKNOWN
OVARIAN
OVARIAN
BREAST
block clustering
PROSTATE
OVARIAN
RENAL
22
K562B-repro
LEUKEMIA
COLON
COLON
MELANOMA
OVARIAN
MELANOMA
MELANOMA
MELANOMA
PROSTATE
OVARIAN
MELANOMA
NSCLC
OVARIAN
MELANOMA
NSCLC
NSCLC
NSCLC
NSCLC
NSCLC
RENAL
NSCLC
CNS
CNS
RENAL
BREAST
RENAL
RENAL
CNS
CNS
CNS
RENAL
RENAL
RENAL
BREAST
though the cancer classes were not used in the shaving process, the resulting
orderings are quite successful at grouping together some of the classes.
The gene names shown at the left of each rectangle are internal codes.
The Most of the genes are uncharacterized, illustrating the potential for this
technique to discover new patterns of expression.
The full genes names are :
Block 1:
1. "357775" "SIDW357775,HumannuclearorphanreceptorLXR-alphamRNA,completecds
[5':W95560,3':W95433]"
2. "512287" "SID512287,Humanneuronalpentraxin1(NPTX1)mRNA,completecds
[5':AA057692,3':AA057694]"
3. "359412" "SIDW359412,CyclinD2
[5':AA011227,3':AA010487]"
4. 376178" "SIDW376178,Human5'-AMP-activatedproteinkinase,gamma-1subunitmRNA,
completecd[5':AA040683,3':AA040600]"
5. "136798" "FN1Fibronectin1Chr.2
[136798,(IEW),5':R36450,3':R36451]"
6."359396""SIDW359396,HumancGMP-stimulated3',5'-
23
-cyclicnucleotidephosphodiesterasePDE2A3(PDE2A)mRNA,completecd [5':AA010496,3'
7. "376052" "SIDW376052,Humannucleotide-bindingproteinmRNA,completecds
[5':AA039305,3':AA039353]"
8. "151144" "FN1Fibronectin1Chr.2[151144,(EW),5':H03906,3':H03907]"
9. "324037" "SIDW324037,Homosapiensclone24590mRNAsequence[5':W46518,3':W46450]
Block 2
1. "50250" ESTsChr.9[50250,(R),5':H17799,3':H17800]"
2. "512355"
"SID512355,ESTs,HighlysimilartoSRCSUBSTRATEP80/85PROTEINS[Gallusgallus][
5':AA059424,3':AA057835]"
Block 3
1. "241935"
SPP1Secretedphosphoprotein1(osteopontin,bonesialoproteinI)Chr.4
[241935,(EW),5':H93913,3':H93048]"
2. "363981"
"SPP1Secretedphosphoprotein1(osteopontin,bonesialoproteinI)Chr.4
[363981,(EW),5':AA021511,3':AA021512]"
24
The rst block are related to stromal cells, and tend to separate the tissue
tumors from blood cancers. The second block of genes are uncharacterized.
The third block consists of Secreted phosphoprotein genes, and produce a
dierent separation of the stromal cancers than the rst block of genes. This
illustrates the potential for this technique to discover new patterns of expres-
sion.
5 Discussion
We have investigated the use of two-way clustering methods DNA microarray
data. Some of the methods are successful for discovering contiguous areas
of high or low gene expression, including hierarchical clustering, TSVQ, and
block clustering. We have introduced the \maximum gap" diagnostic for
protection against nding spurious structure.
There are close connections between block clustering and the classication
and regression tree algorithm (CART) of Breiman, Friedman, Olshen & Stone
(1984). Block clustering is very similar to CART with splits on 2 categorical
predictors (genes and cell lines), and the pruning algorithm is the same as
that in CART. What's dierent is the restriction to xed splits and the use
25
1071X
3414X
3397X
4751X
2808X
2492X
3281X
5037X
200X
Figure 8: First gene block from gene shaving process.
LEUKEMIA
NSCLC
NSCLC
LEUKEMIA
COLON
RENAL
COLON
BREAST
NSCLC
LEUKEMIA
BREAST
BREAST
COLON
OVARIAN
PROSTATE
BREAST
COLON
OVARIAN
MCF7A-repro
LEUKEMIA
LEUKEMIA
variance= 4.37
OVARIAN
RENAL
26
K562B-repro
CNS
MCF7D-repro
K562A-repro
COLON
PROSTATE
OVARIAN
COLON
COLON
NSCLC
BREAST
CNS
MELANOMA
OVARIAN
NSCLC
NSCLC
CNS
MELANOMA
MELANOMA
MELANOMA
NSCLC
RENAL
MELANOMA
RENAL
BREAST
OVARIAN
MELANOMA
MELANOMA
NSCLC
CNS
RENAL
CNS
RENAL
NSCLC
UNKNOWN
RENAL
RENAL
LEUKEMIA
MELANOMA
RENAL
BREAST
6293X
3004X
4344X
2453X
1082X
2016X
802X
118X
502X
Figure 9: Second gene block from gene shaving process.
BREAST
MELANOMA
BREAST
MELANOMA
MELANOMA
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
MELANOMA
MELANOMA
K562A-repro
LEUKEMIA
COLON
MELANOMA
MELANOMA
BREAST
MCF7A-repro
MELANOMA
variance= 3.007
K562B-repro
COLON
COLON
27
COLON
CNS
BREAST
RENAL
MCF7D-repro
NSCLC
COLON
OVARIAN
COLON
NSCLC
BREAST
COLON
OVARIAN
OVARIAN
OVARIAN
BREAST
NSCLC
NSCLC
PROSTATE
RENAL
RENAL
RENAL
RENAL
OVARIAN
NSCLC
BREAST
RENAL
CNS
NSCLC
RENAL
NSCLC
CNS
UNKNOWN
CNS
RENAL
OVARIAN
PROSTATE
NSCLC
CNS
RENAL
NSCLC
4325X
263X
Figure 10: Third gene block from gene shaving process.
MCF7A-repro
HS_578T_CL5006__BREAST
MOLT-4_CL7006_LEUKEMIA
NCI-H226_CL1013__NSCLC
CCRF-CEM_CL7003_LEUKEMIA
ADR-RES_CL5002_UNKNOWN
SR_CL7019__LEUKEMIA
OVCAR-8_CL6005_OVARIAN
K562A-repro
SNB-75_CL12005_RENAL
HCT-15_CL4015__COLON
NCI-H522_CL1003__NSCLC
T-47D__CL5014__BREAST
MCF7D-repro
OVCAR-5_CL6003_OVARIAN
KM12__CL4017_COLON
HOP-62_CL1026_NSCLC
SF-539__CL12016_CNS
SN12C_CL9008__RENAL
variance= 11.344
BT-549_CL5013_BREAST
OVCAR-3_CL6001_OVARIAN
PC-3 (CL11001) PROSTATE
DU-145_CL11003_PROSTATE
28
SF-268__CL12014_CNS
HL-60 (CL7008) LEUKEMIA
SW-620_CL4009_COLON
SF-295_CL12015_CNS
HCT-116_CL4003_COLON
MCF7_CL5001__BREAST
K-562 (CL7005) LEUKEMIA
MALME-3M_CL10002_MELANOMA
MDA-MB-231_CL5005__BREAST
HOP-92__CL1029_NSCLC
K562B-repro
COLO205_CL4010_COLON
HT-29___CL4001__COLON
UACC-62_CL10020_MELANOMA
NCI-H23_CL1001__NSCLC
EKVX__CL1008_NSCLC
SK-MEL-2_CL10005_MELANOMA
OVCAR-4 (CL6002) OVARIAN
LOXIMVI (CL10001) MELANOMA
UACC-257CL10021_MELANOMA
HCC-2998_CL4002_COLON
SK-OV-3_CL6011_OVARIAN
A549_CL1004__NSCLC
CAKI-1_CL9015_RENAL
SK-MEL-28_CL10008_MELANOMA
786-0__CL9018_RENAL
TK-10_CL9024_RENAL
SNB-19_CL12002_CNS
SK-MEL-5_CL10007_MELANOMA
RPMI-8226_CL7010__LEUKEMIA
UO-31_CL9004__RENAL
RXF-393_CL9016__RENAL
U251_CL12009_CNS
MDA-N_CL5012__BREAST
ACHN_CL9023_RENAL
M-14_CL10014_MELANOMA
MDA-MB-435_CL5011__BREAST
A498_CL9013_RENAL
NCI-H322_CL1017_NSCLC
NCI-H460_CL1021_NSCLC
IGROV1_CL6010_OVARIAN
of permutations to estimate the optimal number of splits.
By seeking a single global organization of the data, the two-way clustering
procedures are limited in their ability to discover ne structure. The gene
shaving method, introduced here, looks for blocks of genes that produce dif-
ferent separations of the cell lines, and the initial results look very promising.
There are many interesting modications of this procedure. For example any
aspect of the data can be used to direct the shaving process. If class labels
are available for the cell lines (tumor types in our example), the shaving can
be supervised by these labels. The procedure then tries to nd subsets of
genes that separate the classes as well as possible. Details will be given in a
forthcoming paper.
References
Botstein, D. & Brown, P. (1999), `Exploring the new world of the genome
with dna microarrays', Nature Genetics (Supp.) 21, 33{7.
Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984), Classication and
Regression Trees, Wadsworth.
29
Cherkassky, V. & Mulier, F. (1998), Learning from data, Wiley.
Chu, S.and DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O.
& Herskowitz, I. (1998), `The transcriptional program of sporulation in
budding yeast', Science 282, 699{705.
Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998), `Cluster analysis
and display of genome-wide expression patterns', Proc. Nat. Acad. Sci
95, 14863{14868.
Everitt, B. (1980), Cluster analysis, Halstead, New York.
30
Hartigan, J. (1972), `Direct clustering of a data matrix', J. Amer. Statis.
Assoc. 6, 123{129.
Iyer, V. R., Eisen, M. B., Ross, D. R., Schuler, G., Moore, T., Lee, J. C. F.,
Trent, J. M., Hudson, J., Boguski, M., Lashkari, D.and Shalon, D.,
Botstein, D. & Brown, P. (1999), `The transcriptional program in the
response of human broblasts to serum', Science 283, 83{87.
MacQueen, J. (1967), Some methods for classication and analysis of multi-
variate observations, in `Proceedings of the Fifth Berkeley Symposium
on Mathematical Statistics and Probability, eds L.M. LeCam and J.
Neyman', Univ. of Cal. Press, pp. 281{297.
31
Mardia, K., Kent, J. & Bibby, J. (1979), Multivariate Analysis, Academic
Press.
Roth, F.P.and Hughes, J., Estep, P., & Church, G. (1998), `Finding dna
regulatory motifs within unaligned noncoding sequences clustered by
whole genome mrna quantitation', Nat. Biotechnol. 16, 939{45.
Spellman, P. T., Sherlock, G., Iyer, V. R., Zhang, M., Anders, K., Eisen,
M. B., Brown, P. O. & Botstein, D.and Futcher, B. (1998), `Comprehen-
sive identication of cell cylce-reulated genes of the yeast saccharomyces
by microarray hybridization', Mol. Cell. Biol. 9(12), 3273{975.
32
Tamayo, P., Slonim, T., Mesirov, J., Zhu, Q., Kitareewan, S. & Dmitrovsky,
E. (1999), `Interpreting patterns of gene expression with self-organizing
maps: Methods and applications to hematopoietic diferentation', Proc.
Nat. Acad. Sci 96, 2907{2912.
33