Sie sind auf Seite 1von 244

Tampereen teknillinen yliopisto.

Julkaisu 561
Tampere University of Technology. Publication 561

Cristian Mircean

Genomic and Proteomic Signal Processing in


Cancer Research
Thesis for the degree of Doctor of Technology to be presented with due permission for
public examination and criticism in Tietotalo Building, Auditorium TB222, at Tampere
University of Technology, on the 29th of November 2005, at 12 noon.

Tampereen teknillinen yliopisto - Tampere University of Technology


Tampere 2005

In memory of my father

ISBN 952-15-1475-2
ISSN 1459-2045

Abstract
In order to measure the state of alteration on a cell's molecular biology, we need
qualitative markers, better performing analysis algorithms, and more reliable diagnosis
of the molecular underpinnings of disease, especially cancer.
New technologies have revealed much about the cells' inner processes, but this
information

explosion

represents

challenge

for

the

scientific

community.

Bioinformatics fills the gaps by creating algorithms, tools, and methods to process
thousands of signals.
During present exciting progress, this doctoral research was dedicated to the
development of good classifiers, insightful models, and methods to select the most
significant features and patterns in molecular biology measurements. These studies
utilized several approaches, based on genomics and proteomics techniques, to develop
methods and apply them to real data from several cancer cell types.
The classification capabilities of the developed algorithms were scrutinized based on
multivariate statistical tools. This typically minimize the estimated error rates, and
additionally we investigated the biological interpretation of the results. The tools should
have direct application in medicine.

Genomics
Disease diagnosis is sometimes biased by features visually identified by pathologists on
tissue samples. However, molecular modifications in cells reflect the biological nature of
specific tumors more accurately than visual features. Current diagnosis schemes will be
enhanced if the classification uses molecular signatures. Also, careful design of methods
to discriminate between diseased and normal cells may replace the human effort in
diagnosis and may reduce error. We analyzed numerous discrimination methods using
cDNA microarray data. The successful techniques combined the k-Nearest Neighbor
algorithm with optimized (Lloyd) data quantization. The error rates obtained with
quantization methods are shown to be smaller than those reported in previous published
studies on the same data sets.
A classifier was trained using a glioma microarray data set, on typical cases. The
effectiveness of learning depended on the quality of data and the information contained
in the data. The selected genes provided the lowest cross-validation classification error.
This classifier was applied to the remaining cases of several mixed gliomas and atypical
meningioma. The algorithm correctly classified most of the gliomas and the detailed
voting results provided subtle information regarding the molecular similarities with

iii

neighboring classes. We propose that the developed method can be used as a diagnosis
tool for observing the continuous character of glioma malignancy.
Information theory provides grounds for feature selection. We measured the
influence of genes on the information stored in dataset. The relative improvement may be
estimated in terms of description length gain when the particular gene is used, compared
to the description without the gene. The Rissanen's normalized maximum likelihood is a
principled way to find the structure of the model for classification and the informative
candidates for the feature sets. Further, on another dataset, using the "gene shaving"
clustering method we grouped similarly behaving genes in clusters, and we checked the
enriched functions based on gene ontology. The functions of the selected genes showed a
significant enrichment of genes involved in metabolism and signal transduction.

Proteomics
One of proteomic technologies able to measure the levels of protein expression in a large
number of biological samples simultaneously is the reverse-phase lysate microarray. The
technology provided the means for effectively observing the insights on a proteomic level,
which carry out cellular functions, and are important to understanding biological
systems. A challenge for accurate quantification of protein expression is the relatively
narrow dynamic range associated with the commonly used chromogenic signal detection
system. We developed a 1440-spots (and then 1728 spots) lysate microarray that contains
80 (or 96) lysate (or serum) samples, printed in triplicate with six two-fold dilutions. We
then designed several algorithms that estimate the levels of protein expression.
The analysis showed that the method based on a robust least squares estimator
provided the most accurate quantitation of the protein lysate microarray data for purified
bovine serum albumin. We then applied the technology to real biological samples. As first
application of the method, we analyzed HCT116 colon cancer cell lines after treatment
with each of two drugs or a combination of the two drugs. The array contained p53-/HCT116 cells with no treatment as well as p53+/+ HCT116 cells (parental cells) with no
treatment as a control. The protein levels estimated from the array data were compared
to those observed by western blotting.
Then, on a large-scale and high-throughput application, we surveyed 82 glioma
samples for the levels of protein expression and phosphorylation of 46 different proteins
involved in signaling of cell survival, apoptosis, angiogenesis, invasion, and cell cycle
pathways. We observed two groups based on survival curves, glioblastomas vs. other
gliomas of lower grade, where glioblastomas are correlated with a dramatically and
distinct negative-outcome. Twelve proteins were identified as the most powerful
discriminators and cluster analysis of phosphorylated sites suggested functional
relationships that warrant further investigation.

iv

Content
Abstract

iii

Content

List of publications

vii

Preface

ix

Outline of the thesis

xii

Acknowledgements

xvii

Abbreviations and notations

xix

1. Historical perspective of the "Central Dogma" of biology

2. Genomics and proteomics in diagnosis of disease and prognosis

Major characteristics of cancer cells....................................................................... 7


Efforts in molecular targeting.................................................................................9
Personalized medicine ............................................................................................9
Importance of technology for immediate-integrated diagnosis.......................... 13
3. Glioma overview

17

Brain tumor classification..................................................................................... 18


Pathology of astrocytic gliomas ............................................................................ 19
Pathology of oligodendrogliomas .........................................................................22
Pathology of mixed gliomas..................................................................................23
Molecular pathogenesis of gliomas ......................................................................23
Reverse-phase protein lysate array analysis ........................................................26
4. Fundaments of analysis of cDNA microarray data

35

Experimental design .............................................................................................39


Data pre-processing analysis in microarrays .......................................................43
Standardization of information acquired with microarray experiment .............46
5. Reverse-phase protein array technology

49

6. Translational and post-translational examination of pathway


alterations during glioma progression

57

Antibody selection.................................................................................................60

Lysate array preparation....................................................................................... 61


Analysis of lysate array data .................................................................................62
Proteins selected on glioma progression..............................................................62
7. An overview of analysis methods

71

Normalization........................................................................................................ 72
Differential expression of genes ........................................................................... 77
Clustering .............................................................................................................. 81
Distance Measures ................................................................................................82
kMeans algorithm...............................................................................................82
Principal Component Analysis (PCA) ..................................................................83
Multidimensional Scaling (MDS) .........................................................................84
Classification .........................................................................................................85
kNearest Neighbor algorithm............................................................................. 87
Quantization and the Lloyd algorithm ................................................................. 87
Model selection and Minimum description Length (MDL) ................................88

Publications

91

vi

List of publications
P1.

Mircean C, Tabus I, Astola J. Quantization and distance function selection for


discrimination of tumors using gene expression data. SPIE 2002, BiOS 2002
Symposium, 19-25 January 2002, San Jose, CA.

P2.

Mircean C, Tabus I, Astola J, Kobayashi T, Shiku H, Yamaguchi M, Shmulevich


I. and Zhang W. Quantization and similarity measure selection for discrimination of
lymphoma subtypes under k-nearest neighbour classification. SPIE 2004, BiOS
2004, Microarrays, Combinatorial Techniques and High Throughput Screening,
2429 January 2004, San Jose, California, USA

P3.

Tabus I, Mircean C, Zhang W, Shmulevich I. and Astola J. Chapter 14:


Transcriptome-Based Glioma Classification using Informative Gene Set; in
Genomic and Molecular Neuro-Oncology, Jones and Bartlett Publishers, 2003
ISBN: 0-7637-2261-8

P4.

Fuller GN*, Mircean C*, Tabus I, Taylor E, Sawaya R, Bruner MJ, Shmulevich I,
Zhang W. Molecular Voting for Glioma Classification Reflecting Heterogeneity in
the Continuum of Cancer Progression. Oncol Rep. 2005 14: 651-656. *Co-first
author.

P5.

Giurcaneanu CD, Mircean C, Fuller GN and Tabus I. Chapter 2: Finding


functional structures in glioma gene-expressions using Gene Shaving clustering and
MDL principle; in Computational and Statistical Approaches to Genomics, second
edition Kluwer Academic Publisher (in press)

P6.

Mircean C, Tabus I, Kobayashi T, Yamaguchi M, Shiku H, Shmulevich I, Zhang


W. Pathway analysis of informative genes from microarray data reveals that
metabolism and signal transduction genes distinguish different subtypes of
lymphomas. Int J Oncol. 2004 Mar;24(3):497-504.

P7.

Fuller GN, Hess KR, Mircean C, Tabus I, Shmulevich I, Rhee CH, Aldape KD,
Bruner JM, Sawaya RA, Zhang W. Chapter 14: Human Glioma Diagnosis From Gene
Expression Data; in Computational and Statistical Approaches to Genomics
Kluwer Academic Publisher 2002 ISBN: 1-4020-7023-3

P8.

Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR,


Zhang W. Robust estimation of protein expression ratios with lysate microarray
technology. Bioinformatics. 2005 May 1;21(9):1935-42. Epub 2005 Jan 12.

vii

viii

Preface
Biology, genetics, and medicine have experienced revolutionary changes in the past
decade. A milestone was the success of the Human Genome Project1 that has identified
approximately 20,000-25,000 genes present in human DNA, determined the sequences
of the 3 billion chemical base pairs that make up human DNA, and stored this
information in databases. Unprecedented high-throughput experiments are now capable
of acquiring data on thousands of molecular events at once, generating an information
explosion.
There is a dramatic gap between the generation of data and the capabilities and
methods of processing the information. A direct challenge to bioinformatics researchers
is to provide and improve tools for data analysis. To help achieve these goals, my
research concentrated on developing algorithms that process biological genomic and
proteomic data.
Signal processing and computer science engaged in medical areas much later than on
other observational sciences. The traditional approach of engineers to design devices and
instruments transformed to a direct engagement in the experiment that generates the
data-flow. The high-throughput character of current experiments requires that design,
data acquisition, and processing to be effective.

Genomics
The archetype one geneone disease was long discarded for many diseases. Now, it is
established that mal-functioning of a group of genes (not only one) often generates
diseases. Subjectivity and recurrent uncertainty related to classical histological diagnosis
based on morphological particularities is one of the reasons for improving the diagnosis
tools using molecular models. Further, the diagnosis should be grounded on
measurements with certain statistical premises. Correct estimations usually utilize only
the information obtained from representative features, filtering-out the information from
the rest of high dimensional space of features on a typical experiment.
One goal of measuring the gene expression in cancer research is to develop models for
the molecular classification of tumors and capabilities for an objective diagnosis based on
gene expression. Decisions for diagnosis or treatment based only on a single marker may
be in error if the selected marker was non-representative. However, it is not feasible to
make diagnoses based on large number of genes or proteins because of cost and errors
1

The Human Genome Project was a 13-year project coordinated by the U.S. Department of Energy of Science
and the National Institutes of Health and completed in 2003. Major partners included the Wellcome Trust
(U.K.) and important contributions were made by researchers from Japan, France, Germany, China, and
others. (http://www.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml).
ix

related to noisy measurements. Therefore research has focused on reducing the number
of analyzed genes to an informative set, typically containing 100-200 genes or even less.
The classification errors are directly dependent on the classifier, on the information
comprised in the data-set, and also the match of the method to the type of the data set.
Quantization was observed to have a special importance in filtering the noise.
Mistakenly, quantization is often presented as a loss of data precision. This is true when
the data constitute a perfectly true representation of the generating mechanism and
when no fundamental uncertainty exists in the generating mechanism itself. In most
cases, neither of these assumptions is true. For example, in the case of cDNA microarray
data, it is widely recognized that reproducibility of measurements and between-slide
variation is a major issue. Furthermore, genetic regulation exhibits considerable
uncertainty on the biological level. We bring evidence in favor of quantization, suggesting
that this type of noise is in fact advantageous in some regulatory mechanisms. The goal
of quantization can be thought of as trying to find the right balance between properly
capturing the information content in the data and filtering out the non-informative and
detrimental noise.
By discovering the subset of genes that allow correct diagnosis we will help in
understanding different cancers that can eventually lead to an understanding of the
regulatory network between genes (or proteins as products) involved in diseases. The
proteins produced by the selected genes are also potential targets for the
pharmaceutical industry. The key proteins that regulate one or several genes or proteins
could also be regulated or inhibited by drugs.
Much attention has been concentrated on a small number of genes that appear to
have a high probability of involvement in the observed cancer. These so-called classifiers
can be obtained by "supervised" or "unsupervised" methods.
In filter methods the genes are ranked according to a common property relevant for
the prediction or classification (as discriminative power, correlation, or mutual
information) able to explain the sample (disease) class, without making it explicit in the
subsequent prediction model. After ranking single or various groups of features, a
suitable sub-set is accepted and proposed as the set to be used for subsequent analysis
tasks. On the other hand, the wrapper methods intend to restrain the set of factors so
that the prediction ability of a certain given method is improved. The prediction
capability of a particular method is investigated for all possible groups of genes; the set
with the best performance is declared optimal and it maximizes the prediction abilities of
the studied class of models, but may not be relevant for other classes of models. On geneexpression analysis, if a particular method is investigated for all possible groups, the
dimensionality of combinations for thousands of genes makes wrappers computationally
unfeasible and time consuming.

In contrast, we look into simple and effective methods that may be promising for
molecular diagnostic efforts, as a complement to the currently used morphology-based
diagnostics. This thesis regards "simple" algorithms from computational load,
programming, and information complexity view-points.
Future technological evolution may lead to powerful diagnostic tools based on
classification algorithms we develop today as stand-alone tools (a system-software
solution), integrated into existing devices, or as major health-computing systems
established as large national databases. The current trend in diagnosis of cancer is to
merge efforts toward a complete decision-making system that integrates tools from gene
expression with protein expression, and analysis of protein interactions, loss-of-function
phenotyping using RNAi, tissue microarrays, metabolomics, cellomics, etc. This makes
the possible network of interactions complicated, but brings another level of
dimensionality to the solution space of the problem of classification.

Proteomics
Microarray cDNA techniques that analyze levels of mRNA conceptually assume gene
transcription levels are correlated with protein levels, which accomplish the cellular
functions. However, cDNA transcriptional data should be complemented with
information on protein translation and post-translational modification. The products of
genes act in the cellular environment depending on the protein level and protein posttranslational modifications such as phosphorylation and acetylation. Although the
methods of data-mining, classification, and multivariate analysis are similar in the case
of gene expression and protein expression, the specificity of technologies make each of
the two operations distinct. In gene expression analysis, we deal with thousands of
features (genes), typically duplicated. In protein expression, the number of features
(proteins and antibodies) is typically in the range of hundreds, but samples are analyzed
in multiple replicates and dilutions. Anterior methods were characterized by semiquantitative or qualitative measurements. To evaluate the expression of proteins in a
high-throughput manner, together with the new developments of the reverse-phase
protein array technology, we observed a specific need for robust algorithms that
quantitatively estimate the expression of proteins.
With the aid of protein array technology we can estimate the expression of proteins in
a high-throughput manner. In a simplistic description, a robot produces the required
diluted spots by deposing the lysate from cells on a nitrocellulose membrane. After batch
printing, one or two layers of antibodies (called in the last case a sandwich) bind with
high specificity to the spotted proteins. Each antibody is processed on a different slide.
The new algorithms should be able to deal with the relatively narrow dynamic range of
chromogenic signal detection system and with possible errors caused by incorrect load of

xi

lysate on the membrane, crack of the membrane surface, and spot replication and
segmentation imperfections.
Because of the high-throughput character of reverse phase protein arrays as
compared with technologies like western blotting, all samples will be treated with
virtually identical conditions throughout the process. Also, there is the opportunity to
design multivariate analyses, similar to those performed in microarray technology.
In some cases, as in glioma research, these samples are rare. The amount of lysate
needed in protein array technology is about two orders of magnitude lower than for the
western-blotting. Each antibody used must have proven specificity (e.g., by recognizing a
single band in western blot). In the last year, there has been an increase in purity and
availability of antibodies. Lastly, but of great importance, the total cost of reverse-phase
protein array analysis is lower than for classical methods of analysis.

Outline of the thesis


The thesis is structured in two parts, an introduction and a collection of publications.
The introduction attempts to recapitulate the state-of-the-art in genomics and
proteomics field and to introduce the articles in the collection of publications.

Introduction
The field of bioinformatics has evolved dramatically since my doctoral studies began:
there are new technologies, new applications, improved devices, and more systematic
approaches to molecular biology experiments. The replicability of measurements is
higher and the set of initial feature genes has been partly standardized by industry
researchers. Additionally, there is a tendency toward quantitative proteomics, while in
2000 most of studies measured mRNA expression.
Chapter 1 will describe the "Central dogma of biology" and genetics from a historical
perspective.
Chapter 2 portrays the importance for diagnosis and prognosis in clinical research.
Due to the development of new molecular biology techniques and enhanced by economic
reasons, there has been an attempt to integrate diagnosis of disease. Molecular targeting
is one of the steps in the drug discovery path. In this chapter, the characteristics of cancer
cells will be compared to those of normal cells.
Chapter 3 describes the specificities of gliomas as they fall into the wide spectrum of
cancer malignancy. Brain tumors are described according to the World Health
Organization (WHO) system based on histological features. The section captures the
known molecular changes in pathology and we refer to articles that might help the reader
interested in further medical/clinical/pathological details.
xii

In genomic research, two major platforms are frequently used in expressing the
genes from samples, differing by the method of nucleic acids deposition. Chapter 4 will
describe the fundamentals of preparing, processing, and analyzing data form cDNA
microarrays. Particularities of pre-processing and analysis of microarray data with
weight on the technological process are explained here.
Technological developments made easier the high-throughput observation of
transcriptional levels years before we were able to use high-throughput techniques to
evaluate translational levels. The scientific community assumed a direct relation between
the two phenomena without having technological capabilities for high-throughput
analysis of protein expression. Quantitative measures and high-throughput technologies
in proteomic research have now made possible observation of translational and posttranslational alterations in cancer cells. Chapter 5 describes technological developments
in protein arrays with emphasis on reverse-phase lysate arrays. Our successful
contribution to lysate array technology is placed in the framework of other algorithms
that estimate the expression of proteins. Several theoretical specificities and
technological description that were not included in (P7) are present in the Chapter 5.
Chapter 6 presents the application of our algorithms to the study of glioblastomas.
Chapter 7 is dedicated to describing mathematical algorithms utilized in the papers.
This chapter concisely describes methods of analysis in normalization, differential
expression for genes, clustering methods, definition of distance measures used in
clustering, k-Nearest Neighbor algorithm, k-means, principal component analysis (PCA),
the multidimensional scaling (MDS) as visualization method of high-dimensional spaces,
classification methods, quantization and the Lloyd algorithm, model selection, and
minimum description length (MDL).

Collection of publications
Genomics. The first of the attached publications (P1) compares several discrimination
methods for the classification of gliomas using gene expression data. The considered
methods and the combinations, and the selection of the distance function were evaluated.
Our error rates based on the new methods were shown to be smaller than those reported
in anterior published studies on the same set.
The next study (P2) applied the novel classification techniques to cDNA microarray
data for discriminating subtypes of malignant lymphoma. The genes, on which the
classification is based, were selected by ranking them according to their separability
criteria computed by taking into account between-class and within-class scatter. The
observed errors estimated using cross-validation, were significantly lower for the case of
the k-Nearest Neighbor (k-NN) algorithm with optimized Lloyd data quantization than

xiii

those produced by classical variants of the k-NN algorithm. Multidimensional scaling


and hierarchical clustering dendrograms were used to visualize the separation of the
three subtypes of lymphoma.
In the study described in (P3), we designed a model of classification that compares
the description length of the message telling the class of each patient without knowledge
of gene expressions to the description length of the message that makes use of the gene
expression values to predict the glioma types. The description length of a model is the
total length necessary to encode both the errors of the model and the model itself. A
small number of genes are able to describe well the class labels, such that a description
gain is obtained when using them to encode the label string. However, many more genes
do not appear to contribute to the class labels because their addition does not provide a
better description than the unconditional model. The cost of the model includes the cost
of describing the errors and the model parameters. We show a meaningful reduction of
the candidate sets by using minimum description length principle applied to nested
feature sets, which allows many overly complex predictors to be discarded if predictors
with a smaller subset of genes returned a better description length.
In a next study (P4), we first trained the classification on gene set obtained from
typical glioma cases to extract the signatures that reflected more accurately the biological
nature of specific tumors. Then we applied this gene set to other cases including the
samples that did not belong to gliomas, such as atypical meningioma. The actual voting
results, which are typically used only to decide the winning class label in k-Nearest
Neighbor algorithm, generated a useful method for gaining deeper insight into the stage
of a tumor in the continuum of cancer development. This classification scheme provides
more subtle information regarding the molecular similarities to the neighboring classes
and correctly classifies the gliomas. It also overcomes a limitation of diagnosis using
other classifications: the gliomas can be at any stage of the continuum of cancer
progression and may contain mixed features, but can still be identified with this
classification scheme.
By grouping genes with similar expression patterns, we were able to observe broader
patterns and reduce the dimensionality. Patterns that synergistically regulate the
components of the system may group together or the respective cluster of features may
have a common pathway of interaction in the cell. Two variants of the "gene shaving"
algorithm are presented in (P5). The decisions in the second method were based on the
minimum description length principle.
In the study described in (P6), features were selected according to their individual
discriminatory power, clustered by a gene-shaving algorithm, and the informative genes
were analyzed for gene-ontology pathways. Six of the clusters were highly correlated with
the class labels of the patients disease and the top three clusters accounted for the major
differences among the three class subtypes. The functions of the genes selected in this
xiv

lymphoma study showed a significant enrichment of metabolism and signal


transduction. To further examine whether genes of particular functions reflected more
faithfully the differences between the subtypes of lymphomas, we separated the
informative genes into six different functional groups and performed multidimensional
scaling analysis (MDS) using each of the gene groups. Four of the gene-function groups
metabolism, signal transduction, transcriptional factors, cell adhesion, and migration
pathwaysseparated the three lymphoma subtypes well, whereas apoptosis and cell cycle
genes did not result in a good separation.
The last publication in genomics (P7) was an implementation of the classifiers as
diagnostic tool for clinical use. Ideally a diagnostic tool would be a customized expression
array comprised of less than 100 genes, each possessing robust differentiating power
with respect to the particular cancer under investigation. This book-chapter describes
gliomas from histological diagnosis view point. A number of methods including linear
discriminant analysis, the k-Nearest Neighbor algorithm, and multidimensional scaling
have been applied to derive the optimal size of the data for correct classification. Gene
expression profiling studies have demonstrated utility in molecular classification,
identification of novel subgroups, markers and targets for therapeutic intervention, and
the set of 40 genes identified may potentially constitute a clinical diagnostic chip.
Ultimately clinicians will adopt the method that best allows them to determine patient
therapy.
Proteomics. Despite the enormous genomic complexity of human organism, at the
protein level the complexity is further increased as a result of posttranslational
modifications, such as phosphorylation, acetylation, and ubiquitination, which can
appreciably impact the functional state of proteins. Reverse-phase lysate arrays allow
observation of these phenomena in a high-throughput manner. In (P8), we evaluated
several algorithms for estimating relative protein expression in different samples on the
lysate microarray by means of a cross-validation procedure. The analysis showed that the
algorithm based on robust least squares estimator provided the most accurate
quantification of the protein lysate microarray data. We demonstrated the impact of the
technology and our winning algorithm by estimating relative expression levels of p53 and
p21 in either p53+/+ or p53/ HCT116 colon cancer cells after treatments with each of
two drugs and their combination.

xv

xvi

Acknowledgements
The research presented in this thesis reflects the joint work in Signal Processing with two
areas: molecular biology and bioinformatics. The characteristics of data processing
required in genomics and in proteomics highlighted the need of a tight collaboration
between a biological group (placed where patients need care and where the samples are
acquired and evaluated) and a computer science oriented group. This work was begun in
August 2000 at Institute of Signal Processing, Tampere University of Technology,
Tampere, Finland, and continued beginning in 2003 at the Cancer Genomics Core
Laboratory, The University of Texas M.D. Anderson Cancer Center, Houston, Texas.
My gratitude goes foremost to my supervisor Prof. Ioan Tabus from Tampere
University of Technology, who helped me in the each and every of the aspects related to
my work and life in Finland. I appreciate his support, and credit him with correctly
pointing to the research I needed to do. I thank him for his never-ending willingness to
help, and especially for his expertise in the field of signal processing. My work with Prof.
Ioan Tabus as mentor transformed me from a student in a researcher.
I am particularly thankful to Prof. Dr. Wei Zhang, Director of Cancer Genomics Core
Laboratory and my supervisor on the biological side of this research, for his support and
devotion to bringing the new ideas to life. M.D. Anderson is a first-rate institution in
cancer treatment, and the core-laboratory lead by Dr. Zhang is certainly one of the
research leaders in genomics. My work with the group of Dr. Zhang resulted in major
articles published during last two years. I thank him for his hard work in guiding me and
polishing my scientific attitude, for sharing his scientific expertise, and for his managerial
qualities.
I owe special gratitude to Prof. Ilya Shmulevich, my supervisor in the mathematical,
statistical, and algorithm research performed at M.D. Anderson. I met him a couple of
years ago, when he was a lecturer of Nonlinear Filters in Institute of Signal Processing.
His clear explanations guided me and change my way of understanding algorithm
processes, and even life. I have become simpler and more organized, in means of
Rissanen's modeling. The understanding of processes in molecular biology necessitates
visualization and an organized "logical trend." Ilya brought and catalyzed hundreds ideas
to our research, and it is a pleasure to have him as a supervisor, as a teacher, and as a
friend. I learned also from him that "a sustained good work in research is always
rewarded sooner or later".
I respectfully thank Acad. Prof. Jaakko Astola for his support and advice regarding
my research and for sharing his ideas as co-author in publications included in this thesis.

xvii

I appreciate the generosity of Tampere Graduate School in Information Science and


Engineering (TISE). In particular, I thank Prof. Markku Renfors, Director, who granted
part of my salary and funded the scientific travels during this period. I send many warm
thanks for the clear and consistent help of Dr. Pertti Koivisto. The funding received from
Academy of Finland is gratefully acknowledged.
Special thanks to my co-authors Ioan Tabus, Wei Zhang, Ilya Shmulevich, Jaakko
Astola, Gregory N. Fuller, Ciprian D. Giurcaneanu, David Cogdell, Eun-Ju Lee, Ellen
Taylor, Yu Jia, Woonyoung Choi, Rongcai Jiang, Raymond Sawaya, Tohru Kobayashi,
Hiroshi Shiku, Motoko Yamaguchi, Kenneth R. Hess, Chang H. Rhee, Kenneth D.
Aldape, Janet M. Bruner, Stanley Hamilton, to my colleagues Latha Ramdas, Sarah
Dunlap, Harri Lhdesmki, Matti Nykter, Antti Niemist, Li Mei, Kenji Tada, and many
others who helped me in numerous situations.
My presence at Tampere University of Technology I owe to Prof. Pauli Kuosmanen
and Prof. Corneliu Rusu. Both warmly recommended me to participate in research at
Tampere University of Technology and complete this thesis in Signal Processing.
I am grateful to those few and close friends that encouraged me; for many constant
encouragements that came from far, from unchanged-deep warm heart.
I express my deep gratitude to my parents Maria and Gheorghe for every moment of
life, especially for educating me; to Bogdan and Blandine for their support.
Finally, I deeply thank to my wife Crina for help and her patience during my long
working nights.

xviii

Abbreviations
Abbreviations

Description

AKTpThr308
BadpSer136
BSS/WSS

phosphorylated AKT at threonine 308


phosphorylated Bad at Serine 136
differentially expressed genes using between sum of squares and within sum
of squares

c-Abl, Cabl
CCND1
CDK4 (CDK6)
CDKN2A
EGFR
EGFRpTyr845
IGFBP2
IGFBP5
IB
LOH
MDM2 (MDM4)
MET
MMP9
MYC-C
NF-B
NHGRI
p-BAD
PCR
PDGFR
p-EGFR
PI3K
Pi3k
p-RB
RB1
TP53
VEGF
WHO

v-abl Abelson murine leukemia viral oncogene homolog 1


cyclin D1, cyclin D3
cyclin-dependent kinase 4, cyclin-dependent kinase 6
cyclin-dependent kinase inhibitor 2A
epidermal growth factor receptor
phosphorylated epidermal growth factor receptor at tyrosine 845
insulin-like growth factor binding protein 2
insulin-like growth factor binding protein 5
inhibitory kappa B protein alpha
loss of heterozygosity
transformed 3T3 cell double minute 2, p53 binding protein; (double minute 4)
met proto-oncogene (hepatocyte growth factor receptor)
matrix metalloproteinase 9
v-myc myelocytomatosis viral oncogene homolog
nuclear factor of kappa light polypeptide gene enhancer in B-cells 1 (p105)
National Human Genome Research Institute
phosphorylated BCL2-antagonist of cell death
polymerase chain reaction
platelet-derived growth factor receptor
phosphorylated epidermal growth factor receptor
phosphatidylinositol 3-kinase
phosphoinositide-3 kinase
phosphorylated retinoblastoma protein
retinoblastoma 1
tumor protein p53
vascular endothelial growth factor
World health organization

xix

xx

Chapter 1
Historical perspective of the "Central Dogma" of biology
"Dans les champs de l'observation le hasard ne favorise que les esprits prpars. "
Louis Pasteur 1
In metaphoric sense, the present is a reflection of the values from the past, and the
present creates deterministic reasons for future. Creative arrangements and any ideas
are more effective when based on a currently-accepted framework supported by past
experience.
As we approach the maturation of the field of genetics, we see an unprecedented
trend of knowledge accumulation. The next decades are expected to bring
systematization and easy use of the large amount of data generated, and therefore
extensive application of new knowledge. This chapter is dedicated as a brief survey of the
historical laws in molecular biology.
As long as models reflect real events, they help our understanding by allowing us to
predict behaviors. We need the models to guide experimental design and anticipate
results. Rissanen says that only humans need models and modeling, since intrinsically
the real events happen without our defined model.
The set of statements called "Central Dogma" were made more than 40 years ago, as a
way of explaining the world of information transfer in cells. This theory asserts that
information flows from chromosomal DNA through RNA to protein. More than a model,
the Central Dogma has become a paradigm; Adam Smith2 affirmed that paradigms are
like "water to the fish," a generally accepted judgment. Of course, paradigms may be
detrimental to the generation and acceptance of new ideas: "When we are in the middle
of the paradigm, it is hard to imagine any other paradigm." Accepted by the scientific
community, the Central Dogma has inspired researchers in structuring entirely new
fields of genetics, molecular biology, systems biology, and pharmacology.
Historically, knowledge is a succession of discoveries followed by assimilation. There
were a number of competing theories before the true crystalline structure of DNA was
determined. After extensive experiments in crystallography, on April 25, 1953 Nature
magazine published the article describing the double-helical structure of DNA by Watson

Lecture, University of Lille (December 7, 1854); In H. Eves Return to Mathematical Circles, Boston:
Prindle, Weber and Schmidt, 1988 translated as "Chance favors only the prepared mind."
2
Adam Smith (17231790) was a Scottish economist and moral philosopher. His "Inquiry into the Nature and
Causes of the Wealth of Nations" was one of the initial studies of industry and commerce development.

Chapter 1

and Crick [10],[11]. The manuscript explained for first time the X-ray diffraction images
obtained from DNA and was supported by an article by Wilkins et al., in same issue [12].
Five years later, in 1958, Crick [5] described the hypothesis for molecular processes and
the informational flow between the three families of polymers, DNA, RNA, and protein.
From the beginning, it was clear that not all informational flows possessed same
probability. Crick originally represented the informational flow as: DNA RNA
PROTEIN. This suggests that life was traceable to DNA. Reviewing the hypothesis, Crick
[4] summarized on 1970 the work of Temin and Mizutani [9] and Baltimore [1] showing
that an RNA tumor virus can use viral RNA as a template for DNA synthesis, and
observed that no major arguments against the Central Dogma had appeared after of
twelve years of genetic research.

a) All possible simple transfers between the three


families of polymers. Arrows represent the directional flow
of detailed sequence information.

b) The arrows assign different probability rates to


the directional flows (as postulated by Crick in
1958). Solid lines represent probable transfers, dotted
arrows possible transfers, and the absent arrows
represent the impossible transfers.

Figure 1.1. Schematic of the Central Dogma of molecular biology. From the total possible combinations
presented in panel a, probabilistically the main flow is represented as: DNA RNA Protein in panel b.

If one considers the un-wound linear structure of the three families of polymers, all
possible informational inter-connections (see Figure 1.1 a.), including bi-directional
transfers, must be initially taken into consideration [4]. Given the information available
to researchers at the time, the likelihood of certain transfers was clear dissimilar due to
known conformational three-dimensional crystalline structures. By inspecting the
experiments and considering his own laboratory experience, Crick [5] grouped the
informational transfers into three typological classes. The first group is formed by
transfers for which experimental evidence, direct or indirect, existed. These transfers
commonly occurred in all known cell types:
I (a)

DNA DNA (via cells DNA polymerase)

I (b)

DNA RNA

I (c)

RNA Protein

Historical perspective of the "Central Dogma" of biology

I (d)

RNA RNA (presumed in 1958 [5] to take place because of existence

of RNA viruses, and retracted in 1970 [4])


For the second group of transfers there was neither any experimental evidence
nor any strong theoretical requirement.
II (a)

RNA DNA (Temin [9] and Baltimore [1], and RNA DNA RNA

via reverse transcriptase)


II (b)

DNA Protein

Years after the publication, Temin's indirect evidence [9], corroborated by the
experiments of Baltimore [1], showed the presence of a specific enzyme in RNA tumor
virus particles that makes a DNA copy from RNA. The publication of this work was met
with a generally hostile reception mainly because of the accepted Central Dogma
paradigm. In 1975, Temin, Baltimore, and Dulbecco shared the Nobel Price for
discovering the enzyme, reverse transcriptase, responsible for copying the information in
a strand of RNA into DNA.
The third group of transfers was not known to occur [5].
III (a)

Protein Protein

III (b)

Protein RNA

III (c)

Protein DNA

There are several factors that make Crick's paper remarkable (see [4]): (1) There are no
assumptions regarding the machinery or how the transfer is made. The accuracy is
considered high and possible errors are not discussed. (2) Control mechanisms are not
considered, nor the rate at which the processes work. (3) The organisms under discussion
are present-day organisms and the paper [5] was not intended to apply to events in the
remote past. (4) The Central Dogma is essentially a negative statement, saying that
transfer of genetic information from protein to other polymers does not exist. These
statements from 1970 [4] try to clarify the disagreements published in several papers
during that year.
For many decades, it was thought that there were no mechanisms that allow
information to flow from protein back into RNA. Once protein is synthesized from RNA,
the information was considered trapped at the protein level. This hypothesis seemed
reasonable in light of the degeneracy of the genetic code; in most cases more than one
nucleotide triplet specifies an amino acid. Recent studies showed [2] that there are
emerging evidences for the role of conformational plasticity in protein-protein
interactions and also, exists mechanisms for a type of protein replication by means of
prions [8]. The mechanism implies a specific isomer of a normal protein, homolog to a
prion. When the organism is infected, the prions interact with this homolog isomer of the
protein forcing it into another prion. Although more on the side of post-translational
processes, by this mechanism, certain proteins can replicate and an informational
transfer is made. These conflict and now complete Central Dogma postulated in 1958.

Chapter 1

The transfers from DNA to RNA use the same type of encoding (nucleic acids) and
therefore are called transcription. Also, when the information flows from RNA to DNA,
the flow catalyzed by an enzyme called reverse transcriptase, the process is called reverse
transcription. Protein synthesis, directed by RNA, is called translation [3]. It is called
translation because an essentially different encoding of the information is used by
proteins (amino acids instead of nucleic acids). For both translation and transcription,
the molecular apparatus are very complex and involves many different RNA and protein
molecules.
Ultimately, the Central Dogma is not sufficient to explain the complexity of a cell or
organism. The sequencing of the entire human genome opened the era of further
understanding of information flow in cells and organisms. The next step is the
elucidation of the relationship between DNA genes with other regions of DNA and with
products of genes; how and when transcription and translation occur and posttranslational modifications to proteins ultimately define a cell. Strong opposition to the
gene centric paradigm also arises from symbiosis phenomena (the merger of two
organisms into one without inheritance of genes). Exploiting the power of cooperation,
rather than competition, it can allow an evolutionary jump that might take a million
years of individual trial and error [6].
Although static as concept, the effect of the "Central Dogma"asserting that
information flows from DNA nucleotide sequence to messenger RNA (mRNA) and then
is translated to the specific amino acid sequence of a proteinhad a large positive impact
over the last decades in modeling of the machinery of intracellular mechanism.
References:
[1] Baltimore D. Viral RNA-dependent DNA polymerase. Nature 226: 1209-11, 1970.
[2] Buck E, Iyengar R. Organization and functions of interacting domains for signaling by proteinprotein interactions. Sci STKE. 2003 Nov 18;2003(209):re14.
[3] Crick FHC, Barnett L, Brenner S, Watts-Tobin RJ. General nature of the genetic code for proteins.
Nature 192: 1227-1232, 1961.
[4] Crick FHC. "Central Dogma of Molecular Biology". Nature 227: 561-563, 1970.
[5] Crick FHC. In Symp. Soc. Exp. Biol. The Biological Replication of Macromolecules, XII, 138
(1958).
[6] Kevin K. Out of Control The New Biology of Machines, Social Systems, and the Economic
World. Perseus Books, 1995.
[7] Kornberg A. Biologic synthesis of deoxyribonucleic acid. Science 131: 1503- 1508, 1960.
[8] Prusiner SB. The Prion Diseases One Protein, Two Shapes. Scientific American 272(1):48-57,
1995.
[9] Temin HM, Mizutani S. RNA-dependent DNA polymerase in virions of Rous sarcoma virus.
Nature 226: 1211- 1213, 1970.
[10] Watson JD, Crick FHC. Genetical implications of the structure of deoxyribonucleic acid. Nature
171: 964- 967, 1953.
[11] Watson JD, Crick FHC. Molecular structure of nucleic acids: A structure for deoxyribose nucleic
acid. Nature 171: 737-738, 1953.
[12] Wilkins MHF, Stokes AR, Wilson HR. Molecular structure of deoxypentose nucleic acids. Nature
171: 738-740, 1953.
[13] Yanofsky C, Carlton BC, Guest JR, Helinski DR, Henning U. On the colinearity of gene structure
and protein structure. Proc. Natl. Acad. Sci. USA 51: 266-272, 1964.

Chapter 2
Genomics and proteomics in diagnosis of disease and
prognosis
Molecular profiling completes disease diagnosis and prognosis. Clinical medicine,
classically described as disease diagnosis, medication, side effects, and management of
outcome, has evolved dramatically with the advent of genomics and proteomics.
Sequence analysis of genomes, discovery of structures and pathways, gene and protein
expression profiling has aided in cancer evaluation and therapy and has enhanced patient
prognosis. In addition, these applications support pharmacology by incorporating
pharmacokinetics and pharmacodynamics into drug development.
In the mid-90's the initial phase of bioinformatics concentrated only on a few subjects
of relevance to sequence analysis; now, bioinformatics is essential to the medical,
pharmaceutical, and biological fields. The development of technologies and the broader
understanding of the genetic pathways in the development of tumors opened the
possibility of correlating molecular data with clinical outcome, survival, and response to
treatment modalities. This chapter will review and discuss the relations between
technologies and diagnosis, It is clear that the combined use of high-throughput
techniques in a comprehensive system will better facilitate our understanding of the
genetic complexities inherent in cancer and will revolutionize cancer therapy.
"Why search for differences between diseases at the molecular level?"
Molecular profiling comprises individual applications of mRNA expression, proteomic,
and metabolomic measurements, and combinations of techniques used to characterize
the state of a cell or a tissue [21]. We are looking for patterns that predict or identify subphenotypes of disease that should allow clinicians to make more informed decisions
about therapy and ultimately allow design of drugs suited to a particular disease
genotype.
Let's assume the general case of a data set and its compression process. Good
compression-rates are obtained when the appropriate model is used for the patterns
contained in the file. Therefore, in addition to minimizing the size of the files, the good
compressor will make use of the model and extract information. This duality (model
information) is essential to information theory.
Molecular profiling analysis is based on the discovery of good predictors. Represented
by mRNA transcriptional or protein translational expressions, the goal is to make use of
the information content that describes the disease classification. Each feature brings a

Chapter 2

certain amount of information about the disease and the state of the cell. In other words,
while considering a given label-set of disease, invasiveness, or survival terms, if the
classification can be described in a short manner or by a small number of features, the
information content is of low complexity. Descriptions that require more space are of
higher complexity or information content.
We certainly know that only a small number of the features (genes or proteins) are
involved in cancer, because most other features regulate non-related mechanisms (e.g.,
they are involved in normal metabolism or normal function of the cell, or are proteins
that consolidate and define the internal structure of the cell). The retained small set in
our discussion represents the patterns linked to the cancer phenotype.
The information based on class-labels and the information based on expression
values assigned to the labels, both characterize the information retrieval process. When
we talk about features, we mean the genes, proteins, phosphorylation states, etc., that
carry information and are identified during a process called feature selection. When we
discuss patterns we refer to situations when the algorithm first maps the space of
features such that the selection is in a mapped space (e.g., principal component, also
called PCA, or multidimensional scale algorithm) and the interpretation of selection is
not straightforward. It is worth looking ahead to some of the consequences of successful
selection of predictors.
The molecular profiling may be conducted at several molecular levels by means of
numerous technologies that measure gene-expression level, protein levels, or
metabolites. This thesis analyzes improvements in cDNA microarray and reverse-phase
lysate array technologies. The duality "informationmodel" recommends that features or
patterns from cells and tissues that discriminate the disease progression, invasiveness,
survival terms, or disease-classification must also satisfy the following requirements:

Identify markers that visualize the affected cells or tissues. Despite numerous
methods of tumor visualization, including computed tomography (CT), magnetic
resonance imaging (MRI), and positron emission tomography (PET), etc., these
techniques may give uncertain results. Tumors are cells that behave in a different
manner from normal tissue, yet it is often difficult to delineate between benign and
malignant. In the initial stages, the changes manifest at transcriptional,
translational, or post-translational levels. Only in late stages of disease, do cells
display visible histological and/or structural changes. Consequently, markers are
still needed to discriminate the cancer and non-cancerous cells.
The current markers used in PET and CT are based on an (increased) metabolic
activity of cancer tissues [5][12]; MRI detects density differences (based on water
content) [11]. In an ideal case clinicians generate prognosis and diagnosis based on

Genomics and proteomics in diagnosis of disease and prognosis

the size and anatomic and histologic characteristics of malignancy. However, in


reality, a clear line between benign and malignant cells is hardly detectable.

Discover new possible targets. If the selected features (genes, proteins, or


metabolites) or patterns identify the tumor, then the new drugs must be able to
target these same features. Therapy should modulate the production of these specific
events via opposed pathways or re-equilibrate the balance directly from the source of
unbalance.

Generate models. The selected features or patterns compose the active nodes in a
graph-representation of a system. The model is defined by the arrangement of
elements and the interconnections that relate them. Arches are proposed to hold the
modulations between the molecular components. If studied under impulses as the
knock-out of a gene1 or malignancy2, the response of the system might help elucidate
the system structure and therefore generate knowledge concerning the mechanisms
of cancer.

Eventually lead to the pathogenesis. In several cancer types the pathogenesis is


known (e.g., cervical cancer is due to human papillomavirus), but for most
malignant diseases the etiology is unknown. We know only a set of risk factors, but
not the exact origin of the disease.

Major characteristics of cancer cells


Certain general characteristics of cancer can be listed despite the large number of
pathological specificities resulting from many different genetic changes. These
modifications concern mainly the cells; therefore, we can define the cancer as mainly a
cellular disorder. One major characteristic of cancer cells is the lack of contact inhibition.
The uncontrolled and disorganized growth is called neoplasia, a pathologic process in
which a permanent alteration in a cell's growth-controlling mechanism permits its
continuous proliferation. Neoplasia literally means "new growth"; the abnormal growth
of a tissue or organ is called a neoplasm or tumor. Normal healthy cells can detect
cellular borders and inhibit their division when in contact with other normal cells or the
culture dish. The tendency of normal cells in tissue cultures is to grow in a single layer.
Cancer cells do not limit division due to contact. Neoplasia and cancer are often used
interchangeably, although technically the latter refers specifically to malignant neoplasia.
A close term, anaplasia is the failure of structural differentiation within a cell or group
of cells. Often it refers to a group of cells with increased capacity for multiplication,
malignant growth, and an inability to function normally.

1
2

This refers to the impulse-response on the system theory.


Normal and malignant might define two distinct states of the same system.

Chapter 2

To accomplish their unchecked growth and the transport of metabolites, tumors are
complemented by complex vascularization. The growth of new blood vessels is called
angiogenesis. In the healthy body, angiogenesis occurs in the process of healing wounds,
for restoring blood flow to tissues after injuries. In malignant growth, the process of
angiogenesis is not regulated and new blood vessels spread to supply the growing tumor's
nutrient demands. A relatively new class of drugs, called angiogenesis inhibitors,
prevents abnormal vascular proliferation, and therefore act in the malignant tissue by
stopping the nutrient supply and slowing the tumor growth.
Target identification

Disease
tissue expression

Clinical samples

Molecular
approach

Target validation

Genomics
proteomics
associations

Forward genetics &


reverse genetics

Modulation on
molecular
mechanisms

Patients

Clinical

Modulation on
systemic
mechanisms

In vivo
models

Forward genetics &


reverse genetics

In vitro
models

Systemic
approach

Targeted Drug
Discovery

Disease model

Figure 2.1. Targeted drug discovery approaches are categorized based on molecular or systemic origin.
Three stages define the drug discovery process: the identification of disease, target identification, and target
validation. If the studied cells are components of a biological system, discovery is through a systemic
approach and in vivo studies are used. In the molecular approach, studies are initially performed in vitro and
the modulation is studied at single-cell level. Target validation is carried out using cell cultures or animal
models and is followed by careful clinical studies.

Another phenomenon is related to the capacity of tumor cells to detach from the
primary site and spread to other organs. Cancer cells tend to be more motile and possess
intrinsic differences in adhesion characteristics from normal cells. Cancerous cells
typically metastasize to distant locations by penetrating through basement membranes
into lymphatic and blood vessels, then circulate through the bloodstream and grow at
distant loci elsewhere in the body.

Genomics and proteomics in diagnosis of disease and prognosis

Efforts in molecular targeting


Over the years, the pharmaceutical industry has developed a highly successful business
model for making new medicines. Currently, there are an estimated 13,000 prescription
drugs on the market [1] and an imminent expiry of patents for numerous best selling
products.
Discovery, evaluation, and validation stages are expensive and difficult. Academic
research often covers the early of steps of this business model, while industry
concentrates on later stages of drug development and optimization, and an efficient
production. The complexity of current anti-cancer drug-discovery and the cost of related
research suggest that the pharmaceutical industry will not achieve an income rate for
genetically-specific cancer therapies as they have in the past for traditional drugs [14].
This suggests a need for new models in drug discovery, drug evaluation, and
validation stages. The financial interests lean toward merging the efforts in research and
distributing the risk. By reorganizing their core competencies, academia, biotechnology
companies, and the pharmaceutical industry, in a combined effort, may more efficiently
address the existing delays in anticancer drug discovery and development [14].
The steps required for new drugs are shown in Figure 2.1: discovery of the potential
targets, the evaluation and validation of these targets, and production and optimization,
completed by coherent developed therapeutic strategies.
The studies on the genetic basis of differential responses of patients to drugs resulted
in the evolution of pharmacogenetics3 into pharmacogenomics4 [24]. In this evolving
frame, target identification engages identification and characterization of new proteins5
whose modulation might inhibit or reverse disease progression. Changes at the molecular
level of genomics and proteomics are correlated with the disease class from clinical
samples, cells developed in vitro, or from cells or tissues obtained from in vivo models.
Good clinical samplesmeaning that the samples support the representative medical
features needed for discriminationare required for effective target identification.

Personalized medicine
Genomics, proteomics, and other '-omics' technologies have revealed a complexity among
cancers that makes almost every tumor genetically unique. Effective targeted therapies
might be better suited for small subgroups of patients, who require a personalized
approach for discovery. Research groups with interest in multiple areas rather than a
unique production-oriented research may prove more effective in finding solutions on

3
4
5

The study of how a patients genetic make-up affects the response to medicines.
The study of the effect of an individuals genotype on the response to medications.
The targets are typically proteins, but are not restricted to these.

Chapter 2

10

targeted drugs. We can observe two tendencies manifested in cancer growth at the
molecular level. These trends situate the researchers at opposite positions. One tendency
is that cancers that arise in different locations share alterations of the same genes or
pathways. In this case, common therapeutic targets may regulate various cancer types. In
the second case, cancers arising from the same tissue of origin consist of a complex
combination of several different genetic alterations that uniquely define a genetic
subclass of that cancer [14]. The therapeutic targets that are active in each situation may
not overlap.
Initially, cancer therapies were dictated by the organ system of origin. For example,
malignancies arising in a specific organ were grouped together as one single disease, and
thus received the same therapy. The testing of new therapeutic agents and the
chemotherapy prescribed to the patients typically followed the same judgment.
Since the evolution of molecular biology, cancer is no longer classified based uniquely
upon location [2][15][13][17]. Other histological data and information about genetic
alterations contribute to the diagnosis.

Studies based on large sets of patients

demonstrated that the knowledge about targets and their relevance in human cancers is
incomplete. There is a clear need to turn attention to patient characteristics, such as
individual genetic composition, in addition to histological and tissue characteristics.
Therefore, a (small) representative subpopulation is needed for the medical classification
so that we can use the advantages of highly specific, targeted drugs [14][18].
These tendencies lean toward personalized therapies and treatment in cancer
[16][22][9]. It is not a trivial task to provide targeted drugs (or a targeted combination) to
those patients whose tumors carry the relevant genetic alterations. For this purpose, the
patients need to be catalogued by the number and the type of alterations, and the data
stored in standardized databases. Individualized diagnoses and treatments with the
specific medications can then be made.
The motor of all business-models is economic success and thus several economic
concerns of personalized medicine need to be discussed here. The vision of personalized
medicine raises the difficulties of a small and fragmented market. If the market is divided
into smaller, genetically stratified segments, block-buster drugs will become rarer and
the pharmaceutical sector might suffer and become less attractive to investors [22].
Production of personalized drugs, even specific combinations of typical compounds,
would need to take place after the diagnosis stage using a model of "drug on demand."
This is a complex system able to analyze large amounts data, and requires
communication between diagnostic systems and clinicians and an interconnected
informational flow. It could be an "all-in-one" service provided by one single company or
agency, or separate systems with a standardized communication protocol. Currently,
diagnosis, target discovery, validation, and drug production are provided by separate
institutions (hospitals, research laboratories, drug industry, etc.).

Patients with disease originating


from a certain tissue type

Therapy: general

a)

Chances of success: low


Pool of possible agents

Patients with disease originating


from a certain tissue type

Genetic screen

Therapy: individualized

Personalized drug

Pool of agents specific for


genetic alterations

Chances of success: high

b)

11

Chapter 2

12

The sudden increases in drug entities coupled with a smaller market for each show
that the envisioned system would need to be universal, producing a large numbers of
personalized drugs. Also, the model of "drug on demand" will change the infrastructure
of present companies. Additional issues protecting the patient's genetic data and other
ontological6 grounds make this subject delicate. A positive example imatinib, developed
by Novartis; this successful chronic myelogenous leukemia drug provides confidence in
targeted drugs as a business model. Although prescribed for less than 1% of US cancer
patients, the market for imatinib was ~US$ 1.2 billion in 2003 [16].
Currently, pharmaceutical companies are seeking targeted drugs in order to replace
traditional, less effective therapeutics. In previous years the incomes from hormonal,
cytotoxic, and targeted drugs shared each about one third of profits (Figure 2.3). The
research in drug development has changed such that targeted drugs will cover more than
two-thirds of the income on cancer products in the coming decade. Although the number
of potential targeted therapies in cancer is much higher than for other diseases, patents
with the market lists only a limited number of targeted drugs. Table 2.1 lists the targeted
cancer therapy agents analyzed in a recent publication [14] and approved since the year
2000.
Table 2.1 New agents for cancer therapy marketed since 2000. The discovery of novel drugs is considered
one of the most difficult scientific challenges of our times, and both pharma and academia have realized that
many diseases are unlikely to be cured exclusively by either one on their own [14].
Novel agents for cancer therapy
Drug name
Trade name
Pemetrexed
Alimta (Eli Lilly)
http://www.alimta.com/
Bevacizumab
Avastin (Roche)
http://www.drugdevelopmenttechnology.com/projects/avastin/
Clofarabine
Clolar (Genzyme)
http://www.clolar.com/
Cetuximab
Erbitux (Bristol-Myers Squibb)
http://www.erbitux.com/
Erlotinib
Tarceva (Genentech)
http://www.tarceva.com/
Gefitinib
Iressa (AstraZeneca)
http://www.iressa.com/
Bortezomib
Velcade (Millennium)
http://www.mlnm.com/products/velcade/
Tositumomab
Bexxar (GlaxoSmithKline)
http://www.bexxar.com/
Oxaliplatin
Ibritumomab
tiuxetan
Imatinib
mesylate
Alemtuzumab

Eloxatin (Sanofi-Aventis)
http://en.sanofi-aventis.com/
Zevalin (Biogen Idec)
http://www.zevalin.com/
Gleevec (Novartis)
http://www.gleevec.com/
Campath (Genzyme)

Indication
Mesothelioma

Originator
Eli Lilly

Year
2004

Colorectal cancer

Genentech

2004

Acute lymphocytic
leukemia
Colorectal cancer

Southern Research Institute

2004
2004

Non-small-cell lung cancer

University of California &


ImClone
OSI Pharmaceuticals & Pfizer

2004

Non-small-cell lung cancer

AstraZeneca & Sugen

2003

Multiple myeloma

ProScript (Millennium)

2003

Coulter Corp with Dana-Farber


Cancer Institute & University of
Michigan
Nagoya University

2003
2002

Non-Hodgkins lymphoma

IDEC (Biogen Idec)

2002

Chronic myelogenous
leukemia
B-cell chronic lymphocytic

Novartis

2001

Cambridge University

2001

Non-Hodgkins lymphoma
Colorectal cancer

6
The information from the study of genetic alterations may be utilized against the patient in many ways if
made public.

Genomics and proteomics in diagnosis of disease and prognosis

Gemtuzumab
ozogamicin
Arsenic
trioxide

http://www.campath.com/
Mylotarg (Wyeth)
http://www.wyeth.com/
Trisenox (Cell Therapeutics)
http://www.trisenox.com/

leukemia
Acute myeloid leukemia
Acute promyelocytic
leukemia

13
Celltech Group

2000

PolaRx Biopharmaceuticals
(Cell
Therapeutics)

2000

Importance of technology for immediateintegrated diagnosis


The development of safe and effective new therapies is a long, difficult, and expensive
process [26]. The engine of cancer treatment is based on the identification of molecular
abnormalities that drive a malignant process. Novel molecular targets are discovered by
use of a number of high-throughput technologies, such as genome sequencing, gene
expression microarrays, and protein arrays.
The high-throughput screening technologies and bioinformatics, the innovative drug
delivery systems, pharmaceutical biotechnology, pharmacogenomics, and combinatorial
chemistry are major advances that give a new direction to pharmaceutical sciences. No
developed system comprises all knowledge as a unitary system, but important efforts are
being directed toward a system able to integrate the knowledge from many technologies
[3][7][20].

Potential increase in im portance of targeted drugs

67

20

40
Hormonal

60
Cytotoxic

18

49

32

Oncology products
on the m arket

27

Oncology products
in developm ent

80

100

120

Targeted

Figure 2.3. Targeted therapy has increased in importance in recent years. The total worldwide cancer
revenue is about US$ 12 billions (2001). Targeted drugs account for about 18% from total, data from [14].

Chapter 2

14

Data mining success in other fields, such retailer marketing and text mining (e.g., as
demonstrated by Google), show the potential of integrative diagnosis in medicine
decision management and specifically in cancer diagnosis and treatment. These methods
require standardized and compatible interrogation methods followed by specific
information- and knowledge-extraction tools. The efforts and rapid evolution in
integration of such databases [9][4][8][6][19] is slowed by the huge amount of data that
must be processed. This data processing requires computational and human-validation
resources.
Gordon et al. recently presented a standardized neuro-imaging data-base that may
provide a normative and evidence-based framework for individually-based assessments
in "Personalized Medicine." The three primary goals of this database are to quantify
individual differences in brain function, to compare an individual's performance to their
database peers, and to provide a robust normative framework of multidimensional
measures for clinical assessment and treatment prediction [10].
The software in bioinformatics is inclined to follow the rules of "open software"7. The
positive side of open sources is the capability of multiple groups to collaborate and
improve the tools and algorithms. A larger number of specialized analysis tools could
positively impact the biology by providing systematized and standardized information
[25]. Largely written by the highly computer-literate investigators in this field, the use of
bioinformatics software by non-specialists requires the additional development of facile
user interfaces.
Figure 2.4 describes the technologies and the integrated use of molecular profiling
and clinical data. Multivariate analysis, pattern recognition, and system modeling are all
integral parts of selecting features and patterns. Signal processing and computer science
manage the flow of information between technologies with the goal of extracting the
biological knowledge. Technologies have progressed from simple semi-quantitative,
single sample, singly-observed features to more complex, statistically quantitative, and
high-throughput measurements. These evolutions will provide a vast space for research,
and changes in technology will definitely require signal processing methods.

The meaning here is the program source codes are made freely available for modification and redistribution

Disease classification

Molecular
profile

Drug response

Multivariate analysis
pattern recognition
system modeling

Selecting the predictive patterns in phenotypes


genomics

transcriptomics

proteomics

metabolomics

phenomics

Knowledge

Study of molecular functions

Expression patterns

2D gels

Yeast twotwo-hybrid

Protein interaction

Protein expression

Mass spectrometry

Western Blot

Protein arrays

LossLoss-ofof-function
phenotyping

Phage display

Knockouts

Mass spectrometry
RNA expression

Microarray

SAGE

15

Antisense

RNA knockdown

TaqMan

Ribozyme

RNAi

Figure 2.4 Technologies and the integrated use of molecular profiling and clinical data. The correlation between the phenotypes and patterns in each technology, utilize the
Signal Processing algorithms.

Chapter 2

16
References:

[1] Ahlborn H, Henderson S, Davies N. No immediate pain relief for the pharmaceutical industry.
Curr Opin Drug Discov Devel. 2005 May;8(3):384-91.
[2] Bardelli A, Parsons DW, Silliman N, Ptak J, Szabo S, Saha S, Markowitz S, Willson JK,
Parmigiani G, Kinzler KW, Vogelstein B, Velculescu VE.. Mutational analysis of the tyrosine kinome in
colorectal cancers. Science. 2003 May 9;300(5621):949.
[3] Basik M, Mousses S, Trent J. Integration of genomic technologies for accelerated cancer drug
development. Biotechniques. 2003 Sep;35(3):580-2, 584, 586 passim.
[4] Bertone P, Gerstein M. Integrative data mining: the new direction in bioinformatics. IEEE Eng
Med Biol Mag. 2001 Jul-Aug;20(4):33-40.
[5] Brock CS, Young H, Osman S, Luthra SK, Jones T, Price PM. Glucose metabolism in brain
tumours can be estimated using [18F]2-fluorodeoxyglucose positron emission tomography and a
population-derived input function scaled using a single arterialised venous blood sample. Int J Oncol.
2005 May;26(5):1377-83.
[6] Cornell M, Paton NW, Hedeler C, Kirby P, Delneri D, Hayes A, Oliver SG. GIMS: an integrated
data storage and analysis environment for genomic and functional data. Yeast. 2003 Nov;20(15):1291
[7] Ferrari M, Cremonesi L, Bonini P, Stenirri S, Foglieni B. Molecular diagnostics by
microelectronic microchips. Expert Rev Mol Diagn. 2005 Mar;5(2):183-92.
[8] Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka.
Bioinformatics. 2004 Oct 12;20(15):2479-81. Epub 2004 Apr 8.
[9] Gerstein M. Integrative database analysis in structural genomics. Nat Struct Biol. 2000 Nov;7
Suppl:960-3.
[10] Gordon E, Cooper N, Rennie C, Hermens D, Williams LM. Integrative neuroscience: the role of a
standardized database. Clin EEG Neurosci. 2005 Apr;36(2):64-75.
[11] Jacobs AH, Kracht LW, Gossmann A, Ruger MA, Thomas AV, Thiel A, Herholz K. Imaging in
neurooncology. NeuroRx. 2005 Apr;2(2):333-47.
[12] Jacobs AH, Li H, Winkeler A, Hilker R, Knoess C, Ruger A, Galldiks N, Schaller B, Sobesky J,
Kracht L, Monfared P, Klein M, Vollmar S, Bauer B, Wagner R, Graf R, Wienhard K, Herholz K, Heiss
WD. PET-based molecular imaging in neuroscience. Eur J Nucl Med Mol Imaging. 2003
Jul;30(7):1051-65. Epub 2003 May 23.
[13] Kwak EL, Sordella R, Bell DW, Godin-Heymann N, Okimoto RA, Brannigan BW, Harris PL,
Driscoll DR, Fidias P, Lynch TJ, Rabindran SK, McGinnis JP, Wissner A, Sharma SV, Isselbacher KJ,
Settleman J, Haber DA. Irreversible inhibitors of the EGF receptor may circumvent acquired resistance
to gefitinib. Proc Natl Acad Sci U S A. 2005 May 24;102(21):7665-70. Epub 2005 May 16.
[14] Lengauer C, Diaz LA Jr, Saha S. Cancer drug discovery through collaboration. Nat Rev Drug
Discov. 2005 May;4(5):375-80.
[15] Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL,
Haserlat SM, Supko JG, Haluska FG, Louis DN, Christiani DC, Settleman J, Haber DA. Activating
mutations in the epidermal growth factor receptor underlying responsiveness of nonsmall-cell lung
cancer to gefitinib. N Engl J Med. 2004 May 20;350(21):2129-39. Epub 2004 Apr 29.
[16] Mertens G. [Market Research Report] Beyond the blockbuster drug. Strategies for "nichebusterdrugs", targeted therapies and personalized medicine. Business Insights (February 2005).
[17] Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N,
Boggon TJ, Naoki K, Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M. EGFR
mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004 Jun
4;304(5676):1497-500. Epub 2004 Apr 29.
[18] Patel JD, Pasche B, Argiris A. Targeting non-small cell lung cancer with epidermal growth factor
tyrosine kinase inhibitors: where do we stand, where do we go. Crit Rev Oncol Hematol. 2004
Jun;50(3):175-86.
[19] Russell RB. Genomics, proteomics and bioinformatics: all in the same boat. Genome Biol. 2002
Sep 24;3(10):REPORTS4034. Epub 2002 Sep 24.
[20] Saghatelian A, Cravatt BF. Global strategies to integrate the proteome and metabolome. Curr Opin
Chem Biol. 2005 Feb;9(1):62-8.
[21] Stoughton RB, Friend SH. How molecular profiling could revolutionize drug discovery. Nat Rev
Drug Discov. 2005 Apr;4(4):345-50.
[22] Sullivan CG. How personalized medicine is changing the rules of drug life exclusivity.
Pharmacogenomics. 2004 Jun;5(4):429-32.
[23] Toyoda T, Wada A. Omic space: coordinate-based integration and analysis of genomic phenomic
interactions. Bioinformatics. 2004 Jul 22;20(11):1759-65. Epub 2004 Mar 22.
[24] Weinshilboum R, Wang L. Pharmacogenomics: bench to bedside. Nat Rev Drug Discov. 2004
Sep;3(9):739-48.
[25] Wiley HS, Michaels GS. Should software hold data hostage? Nat Biotech. 2004 22, 1037-38.
[26] Workman P. Drug discovery strategies: technologies to accelerate translation from target to drug. J
Chemother. 2004 Nov;16 Suppl 4:13-5.

Chapter 3
Glioma overview1
Although brain tumors account for less than 2% of the total incidence of primary tumors,
the childhood incidence of brain cancers is very high, counting for approximately 20% off
all cancers in those ages 19 and under. Primary tumors2 of the brain are the foremost
cause of cancer mortality among children (Figure 3.1), responsible for 7% of the years of
life lost from cancer before the age of 70 years. The incidence of brain tumors world wide
is about 7 in 100,000 people per year [54], with no major differences across countries.
Gliomas are the most common primary tumor of central nervous system [8].
The median survival of patients with glioblastoma, the most aggressive subclass of
gliomas, is less than 1 year even when surgical resection is combined with pre- or postoperative chemotherapy, immunotherapy, or radiotherapy. In patients with low-grade
gliomas3, such as oligodendroglioma, long-term survival can be achieved after surgical
resection and postoperative chemotherapy.
This variation in survival terms implies that there are important molecular
distinctions among the different glioma grades. Chapter 7 describes our analysis of
glioma data. Based on survival and Kaplan-Meier curves, the samples cluster into only
two classes (glioblastomas vs. the others). A significant relationship was observed in a
recent study [44] between the poor survival of patients with glioblastomas and the
morphology of the tumor cell nuclei. As defined by the physiopathology (see Table 3.1 at
the end of chapter), glioblastomas are characterized by nuclear atypia, and
multinucleated cells. The reasons for this correlation are not yet clear, but if data-mining
algorithmic rules are followed, future research should concentrate on the nuclei and
phenomena localized to the nucleus. Analysis of the same bio-morphometric data [44]
showed that the survival term is statistically independent from the amount of surgical
resection, from the patients age, and from the classification of the glioblastoma (as
primary or secondary).

1
This chapter is describes the pathology of the most common tumors of the nervous system and recapitulates
in brief the molecular characteristics of gliomas. This short survey is warranted in the context of applications
described in the thesis and attached publications that study the classification of gliomas. My expertise is signal
processing; for a more comprehensive examination of the subject, for further details about brain tumor
pathology, I recommend the publications listed in references.
2
There are two categories of brain tumors: primary and secondary (or metastatic). In brain cancer, primary
(benign or malignant) tumors originate in the brain tissue. The classification depends on the tissue from which
the tumor originates. Metastatic, or secondary, tumors are cancers that start in other parts of the body and
metastasize to the brain.
3
The grading system addresses the speed of progression (and aggressiveness) of a cancer.

18

Chapter 3

Figure 3.1. Mortality rates for common cancers. The values were calculated per 100,000 person-years
during 1950 to 1995 (left) and differentiated (right) as neoplastic types for the age group 0-19. Brain and
nervous system tumors are ranked second after leukemias.

Brain tumor classification


The current classifications of gliomas are based on morphological features and classify
the cancer based on the cell type in the developing embryo/fetus or adult that the tumor
cells most resemble histologically. Brain tumors were first systematically subclassified by
Bailey and Cushing in 1926 on a histogenetic basis [15][4][3]. Their concept of tumor
grading was a milestone as it related the histopathological classification structure and
certain characteristics related to the outcome of the patients [4]. Later, the definitions
(presented in Table 3.2) were adjusted to provide for a standard for communication
between groups. Four malignancy grades are recognized by the World Health
Organization (WHO) system, with grade I tumors the biologically least aggressive and
grade IV the biologically most aggressive tumors [32].

Glioma overview

19

In brain tumor aggressiveness evaluation and classification, most of diagnoses are


made on the basis of very small and fragmented biopsies. Thus, a neuropathologist needs
to know the clinical context of the patient. The minimum amount of clinical information
that must be provided for a comprehensive decision includes: age, neuro-radiological
findings including location of the tumor, location of the biopsies, relevant clinical and
family history, and whether the patient has received any treatment. Although this is not
formally part of the WHO system, the brain tumors may also be subclassified as invasive4
or non-invasive.

Pathology of astrocytic gliomas


Astrocytoma is defined as the cancer that primarily arises from astrocytes. Based on
morphology of infiltration, two major categories of astrocytic gliomas are described in
WHO system (Table 3.3). The more common group are the diffuse tumors that include:
a) diffuse astrocytomas (Table 3.1), b) anaplastic5 astrocytomas, and c) glioblastomas;
the less common astrocytic neoplasms have circumscribed growth and include: d)
pilocytic astrocytomas (Table 3.1), and e) pleomorphic xanthoastrocytomas (Table 3.1).
Table 3.3. Classification and grading of astrocytic tumors.

Tumor type1

WHO grade

Diffuse astrocytoma
Fibrillary astrocytoma
Protoplasmic astrocytoma
Gemistocytic astrocytoma
Anaplastic astrocytoma
Glioblastoma
Giant cell glioblastoma
Gliosarcoma
Pilocytic astrocytoma
Pleomorphic xanthoastrocytoma
Pleomorphic xanthoastrocytoma with
anaplastic features

II
II
II
II
III
IV
IV
IV
I
II
Not determined

The classes are according to the World Health Organization classification of tumors of the nervous system
[32] [55].

Macroscopically diffuse astrocytomas (WHO grade II) appear as ill-defined lesions in


the white and/or grey matter characterized by enlargement and distortion, but not
destruction, of the invaded brain structures [55]. The anatomical boundaries become
blurred by invading tumor tissue.

An invasive type of cancer has spread from the point of origin to adjacent tissue.
Anaplasia is the phenomena of replacement of specialized cells by unspecialized, undifferentiated, or stem
cellsin other words, dedifferentiation.

Table 3.2. The World Health Organization (WHO) classification of tumors affecting the Central Nervous System 6,7,8

Neuroepithelial tumors
1.

2.
3.

4.

5.

6.

Astrocytic tumors [glial tumors--categories I-V]


Astrocytoma (WHO grade II)
protoplasmic, gemistocytic, fibrillary, mixed
Anaplastic (malignant) astrocytoma (WHO grade III)
hemispheric, diencephalic, optic, brain stem, cerebellar
Glioblastoma multiforme (WHO grade IV)
giant cell glioblastoma, gliosarcoma
Pilocytic astrocytoma [non-invasive, WHO grade I]
hemispheric, diencephalic, optic, brain stem, cerebellar
Subependymal giant cell astrocytoma [non-invasive, WHO grade I]
Pleomorphic xanthoastrocytoma [non-invasive, WHO grade I]
Oligodendroglial tumors
Oligodendroglioma (WHO grade II)
Anaplastic (malignant) oligodendroglioma (WHO grade III)
Ependymal cell tumors
Ependymoma (WHO grade II) [10]
cellular, papillary, epithelial, clear cell, mixed
Anaplastic ependymoma (WHO grade III)
Myxopapillary ependymoma
Subependymoma (WHO grade I)
Mixed gliomas
Mixed oligoastrocytoma (WHO grade II)
Anaplastic (malignant) oligoastrocytoma (WHO grade III)
Others (e.g. ependymo-astrocytomas)
Neuroepithelial tumors of uncertain origin
Polar spongioblastoma (WHO grade IV)
Astroblastoma (WHO grade IV)
Gliomatosis cerebri (WHO grade IV)
Tumors of the choroid plexus
Choroid plexus papilloma

Other neoplasms
10. Tumors of the Sellar Region
Pituitary adenoma
Pituitary carcinoma
Craniopharyngioma
11. Hematopoietic tumors
Primary malignant lymphomas
Plasmacytoma
Granulocytic sarcoma
Others
12. Germ Cell Tumors
Germinoma
Embryonal carcinoma
Yolk sac tumor (endodermal sinus tumor)
Choriocarcinoma
Teratoma
Mixed germ cell tumors
13. Tumors of the Meninges
Meningioma
meningothelial, fibrous (fibroblastic), transitional (mixed),
psammomatous, angiomatous, microcystic, secretory, clear cell,
chordoid, lymphoplasmacyte-rich, metaplastic subtypes
Atypical meningioma
Anaplastic (malignant) meningioma
14. Non-menigothelial tumors of the meninges
Benign Mesenchymal
osteocartilaginous tumors, lipoma, fibrous histiocytoma, others
Malignant Mesenchymal
chondrosarcoma, hemangiopericytoma, rhabdomyosarcoma, meningeal
sarcomatosis, others
Primary Melanocytic Lesions

6
Since 1993 a new classification of neoplasms for central nervous system has been used. The classification is based on the premise that each type of tumor results from the abnormal
growth of a specific cell type. To the extent that the behavior of a tumor correlates with basic cell type, tumor classification dictates the choice of therapy and predicts prognosis. In the
grading system for aggressiveness, the classes are of a single defined grade.
7
See http://neurosurgery.mgh.harvard.edu/newwhobt.htm by Stephen B. Tatter, M.D., Ph.D. [32].
8
Categories in italics are not recognized by the WHO classification system, but are in common use.

7.

8.

9.

Choroid plexus carcinoma (anaplastic choroid plexus papilloma)


Neuronal and mixed neuronal-glial tumors
Gangliocytoma
Dysplastic gangliocytoma of cerebellum (Lhermitte-Duclos)
Ganglioglioma
Anaplastic (malignant) ganglioglioma
Desmoplastic infantile ganglioglioma
desmoplastic infantile astrocytoma
Central neurocytoma
Dysembryoplastic neuroepithelial tumor
Olfactory neuroblastoma (esthesioneuroblastoma)
olfactory neuroepithelioma
Pineal Parenchyma Tumors
Pineocytoma
Pineoblastoma
Mixed pineocytoma/pineoblastoma
Tumors with neuroblastic or glioblastic elements (embryonal tumors)
Medulloepithelioma
Primitive neuroectodermal tumors with multipotent differentiation
medulloblastoma (variant: medullomyoblastoma, melanocytic
medulloblastoma, desmoplastic medulloblastoma ) [1][2][5][6][7]
cerebral primitive neuroectodermal tumor
Neuroblastoma
ganglioneuroblastoma
Retinoblastoma
Ependymoblastoma

15.

16.

17.
18.
19.

diffuse melanosis, melanocytoma, maliganant (variant: melanoma


meningeal melanomatosis)
Hemopoietic Neoplasms
malignant lymphoma, plasmactoma, granulocytic sarcoma
Tumors of Uncertain Histogenesis
hemangioblastoma (capillary hemangioblastoma)
Tumors of Cranial and Spinal Nerves
Schwannoma (neurinoma, neurilemoma)
cellular, plexiform, and melanotic subtypes
Neurofibroma
circumscribed (solitary) neurofibroma
plexiform neurofibroma
Malignant peripheral nerve sheath tumor (Malignant schwannoma)
epithelioid
divergent mesenchymal or epithelial differentiation
melanotic
Local Extensions from Regional Tumors
Paraganglioma (chemodectoma)
Chordoma
Chodroma
Chondrosarcoma
Carcinoma
Metastatic tumours
Unclassified Tumors
Cysts and Tumor-like Lesions
Rathke cleft cyst
Epidermoid
Dermoid
Colloid cyst of the third ventricle
Enterogenous cyst
Neuroglial cyst
Granular cell tumor (choristoma, pituicytoma)
hypothalamic neuronal hamartoma
nasal glial herterotopia
plasma cell granuloma

Chapter 3

22

Anaplastic astrocytomas (WHO grade III) are characterized by signs of focal or


diffuse anaplasia, such as increased cellularity, nuclear atypia, and marked mitotic
activity. The typical histological signs of anaplastic astrocytomas are microvascular
proliferation without necrosis (Table 3.1). The structure often includes fibrillar and
gemistocytic9 astrocyte cells, fusiform cells, small anaplastic cells, and pleomorphic10
multinuclear giant cells. The nucleus manifests a high mitotic activity, prominent
features, and atypical forms. Anaplastic astrocytomas tend to progress to glioblastomas.
At cellular level, glioblastomas (WHO grade IV) are tumors that may be composed of
cells of various morphologies. At macromolecular level, the glioblastoma tumors are
largely necrotic, with a peripheral zone of fleshy grey tumor tissue, hemorrhagic
behavior, edema, and presence of pathological microvascular proliferation. Necrosis and
vascularization are essential for the diagnosis. Topologically, glioblastomas share a
supratentorial11 location preference in the cerebral hemispheres with the other diffusely
growing astrocytomas. Other histological sub-variants of glioblastoma listed by WHO
classification [55], but rare, are the giant cell glioblastoma [16] [33] and the gliosarcoma
[59] [47].
Circumscribed growth tumors develop mostly in children. Pilocytic12 astrocytomas
are typically found in cerebellum, optic nerve and optic chiasm, hypothalamus, thalamus,
basal ganglia and brainstem [9]. In rare cases, pilocytic astrocytomas can arise in lobes or
spinal cord. Pilocytic astrocytomas are less-aggressive tumors than glioblastomas and are
classified as (WHO grade I). Pleomorphic xanthoastrocytoma (PXAs) are rare tumors
that develop superficially in the cerebral hemispheres of children with a history of
chronic epilepsy. PXAs are characterized by macromolecular cystic lesions in the cerebral
cortex that grow attached to the meninges and behave typically as WHO grade II.

Pathology of oligodendrogliomas
A subclass of glioma, oligodendroglioma is the brain cancer subtype that primarily arises
from cells morphologically resembling oligodendroglia [32]. Like classification of
astrocytomas, the classification of oligodendrogliomas is based on histopathological
features; the majority of studies describe oligodendrogliomas in the context of other
malignancies, making analysis of correlations between tumor characteristics and patient
prognosis

difficult.

Bailey

proposed

link

between

what

are

now

called

oligodendroglioma [4] [3] and oligodendrocytes cells based on three observations: that
both cell types demonstrate round and uniform nuclei, both show a swollen and clear
9
The gemistocytic astrocytoma is a histological variant of diffuse astrocytomas. It is characterized by the
presence of large, glial fibrillary acidic protein (GFAP)-expressing neoplastic astrocytes called gemistocytes
and a high tendency towards rapid progression to glioblastoma [63].
10
Having variation in the size and shape of their nuclei.
11
Location in the upper part of the brain, anatomically above the "tentorium cerebri".
12
Cells that look like fibers when viewed under a microscope.

Glioma overview

23

cytoplasm following standard histological tissue preparation, and the two cell types
display similar cell processes upon silver staining [27].
Oligodendrogliomas (WHO grade II) consist of moderately cellular, monomorphic
tumors with round nuclei, often artifactually inflamed cytoplasm, few or no mitoses, no
florid microvascular proliferation, or necrosis, and are classified as malignancy grade II
according to the WHO. The characteristic cytoplasm artifact relevant for diagnosis is the
clear cytoplasm, resembling a honeycomb, seen upon standard tissue preparation. The
tumor tissue contains numerous delicate, branching vessels with reticular appearance
Anaplastic oligodendrogliomas are histologically represented by aggressiveness
(WHO grade III) through increase in nuclear pleomorphic

13

features, abnormal

pigmentation and pronounced high-cellularity, fervent mitotic activity, and prominent


microvascular proliferation with spontaneous necrosis. Oligodendrogliomas (grades II
and II) show relatively specific genetic abnormalities that differ from the other gliomas.
Anaplastic oligodendrogliomas are differentiated from oligodendrogliomas by the level of
aggressiveness.

Pathology of mixed gliomas


Mixed gliomas or oligoastrocytomas are defined by the presence of both oligodendroglial
and astrocytic components, and are divided into grade II and anaplastic grade III
variants [27]. Histologically, oligoastrocytomas show two different morphological
patterns (Table 3.1): biphasic tumors with two distinct components of astrocytic and
oligodendroglial differentiation and diffuse variants with tumor cells in varying
transitional states between oligodendroglial and astrocytic appearance [26].

Molecular pathogenesis of gliomas


In the last two decades, the molecular genetic studies in human glioma research have
concentrated on diffuse infiltrating astrocytomas, in particular on the most aggressive
class of glioblastomas and most studies have observed gene expression at the
transcriptional level. These studies identified a number of genes (Table 3.4, Figure 3.2
and 3.3) that may play an important role in glioma progression.
Table 3.4 Tumor suppressor genes and proto-oncogenes in astrocytic gliomas.

Gene

Location

Function

TP53

17p13

Transactivator involved in the regulation of


apoptosis, cell cycle progression, DNA repair

RB1

13q14

Nuclear phospho-protein involved in cell cycle


regulation

13

Alteration common in
diffuse astrocytomas,
anaplastic astrocytomas,
glioblastomas (secondary)
glioblastomas
anaplastic astrocytomas

OLIG genes are expressed strongly in oligodendroglioma, and are absent or expressed at a low level in
astrocytoma.

Chapter 3

24
Inhibitor of cyclin-dependent kinase 4 and 6

glioblastomas
anaplastic astrocytomas
glioblastomas
anaplastic astrocytomas

CDKN2A

9p21

p14ARF

9p21

Inhibitor of Mdm2

PTEN

10q23

Dual-specificity protein phosphatase and lipid


phosphatase, negative regulator of
phosphatidylinositol 3-kinase

glioblastomas

EGFR

7p11

Tyrosine kinase growth factor receptor

glioblastomas

PDGFR

4q12

Tyrosine kinase growth factor receptor

glioblastomas

MET

7q31

Tyrosine kinase growth factor receptor

glioblastomas

CDK4

12q13

CDK6

7q21q22

Cyclin-dependent kinase, promotes G1/S phase


progression
Cyclin-dependent kinase, promotes G1/S phase
progression

glioblastomas
glioblastomas

CCND1

11q13

Cyclin D1, promotes G1/S phase progression

glioblastomas

CCND3

6p21

Cyclin D3, promotes G1/S phase progression

glioblastomas

MDM 2

12q15

Inhibitor of p53 function

glioblastomas

MDM 4

1q32

Inhibitor of p53 function

glioblastomas

MYCC

8q24

Transcription factor

glioblastomas

Studies have indicated that tumor suppressor genes [17][29][53][64][46] and protooncogenes [21][39][46] are genetically altered during glioma progression. Among the
changes are the amplification of EGFR, loss of hererozygosity (LOH) of chromosome 10,
mutation or deletion of PTEN, LOH of chromosome 9, deletion of p16 genes, and
mutation in p53 [11]. Recent genomic studies using cDNA microarrays have revealed a
large number of gene expression changes, including the discovery of overexpression of
IGFBP2 in 80% of glioblastomas [66][12][23][20].
We are far from understanding the gene and pathway alterations in gliomas. In fact,
the question of the cell of origin, illustrated by the animal models of gliomas, and by
lineage markers such as Olig1/2 marker13 [38] remains unsolved. A number of papers
[6][17][20] observe alterations in astrocytic cells in expression of growth factors and
other proteins that control apoptosis, cell cycle, and (see Figure 3.2). In
oligodendroglioma expansion, LOH 1p/19q is suggested [50][43] to characterize
oligodendroglial lineage evolution (Figure 3.3). If alterations in the p53-dependent
pathways are present, the diagnosis is of mixed glioma (i.e., oligoastrocytomas). In
summary, most genetic alterations in gliomas result from the disruption of three main
cellular systems, RB1, p53, and tyrosine kinase receptor pathways. Other gene alterations
have also been identified in glioma, including those in genes that promote mitotic signal
transduction, cell cycle regulation, apoptosis, angiogenesis, or invasion.

Glioma overview

25

anaplastic
astrocytomas

diffuse
astrocytomas

glioblastomas

p53-dependent alterations (apoptosis, cell progression, and DNA repair)


TP53 mutation
p14ARF
hypermethylation

TP53 mutation,
TP53 mutation
p14ARF homozygous deletion,
p14ARF
MDM2 amplification
hypermethylation /homozygous deletion
MDM4 amplification

growth factor signaling pathways activations


PDGFRA
PDGFRA overexpression

PDGFRA
PDGFRA overexpression

PDGFRA
PDGFRA amplification
EGFR amplific.
amplific. / rearrangement
MET amplification

pRb1 dependent alterations


CDKN2A / CDKN2B homozygous deletion
RB1 mutation / hypermethylation
CDK4 / CDK6 amplification
CCND1 / CCND3 amplification

Pi3k / AKT activation


PTEN mutation
CTMP hypermethylation
EGFR amplification / rearrangement
PDGFR
PDGFR amplification
MET amplification
Figure 3.2 Phenomena that characterize the progression of diffuse astrocytic gliomas. Assessing the
frequency of alterations in the samples across progression of astrocytic lineage, the impairment of p53dependent growth control, either through mutation of TP531 or p14ARF silencing by hypermethylation, is one
of the first steps leading to the formation of astrocytomas (grade II). Diffuse astrocytomas frequently present
PDGFR overexpression 1 that may activate the mitogenic signaling pathways. A small set of (grade III)
anaplastic astrocytomas have lost pRb1-dependent cell cycle control and glioblastomas (grade IV) are
sometimes characterized by Pi3k/AKT signaling alterations.

Temporal

Extra-temporal

TP53 mut.
mut.

unknown

LOH 1p / LOH 19q

Oligoastrocytoma
(Astrocytoma)

Oligodendroglioma
Oligoastrocytoma

Oligodendroglioma
Oligoastrocytoma

Figure 3.3. Molecular model of oligodendroglial classification. Oligodendrogliomas predominantly located


in the extratemporal region and have LOH 1p/19q, while the TP53 mutations principally arise in the temporal
lobes and resemble astrocytomas. Muller et al., [43] proposed this diagnosis model.

Chapter 3

26

Reverse-phase protein lysate array analysis


In contrast to studies that observe gene expression at the transcriptional level (usually
using cDNA microarrays or oligo-arrays), parallel high-throughput proteomic analyses of
gliomas have been lacking. Inspired by the application of high-throughput DNA
microarray techniques that have been used to comprehensively analyze the molecular
basis of various diseases, we developed the reverse-phase lysate array technique to
provide high-throughput analysis of protein levels. The technology of reverse-phase
protein arrays was initiated by Espina et al., [19]. The reverse-phase protein lysate array
experiment involves spotting cell extracts from a large number of biological samples on a
coated glass slide. The array is subsequently probed with a large number of antibodies.
The technique has shown promise for monitoring the expression of disease-related
proteins [41][22][65].
Improved with robust estimation capabilities and high-throughput capabilities [42],
we analyzed a set of glioma samples using a reverse-phase protein lysate array.14. The
feature selection procedure described in Chapter 7 was used to determine which of the
evaluated proteins had altered levels of expression or post-translational modification in
glioblastomas relative to low-grade gliomas. These alterations included: changes in the
phosphorylation levels of the retinoblastoma (p-Rb) protein, the BCL2-antagonist of cell
death protein (p-BAD), and epidermal growth factor receptor (p-EGFR); up- regulation
of insulin-like growth factor binding protein 2 (IGFBP2), the macrophage gelatinase
protein (matrix metalloproteinase 9 or mmp9), phosphatidylinositol 3-kinase (Pi3K)
protein, insulin-like growth factor binding protein 5 (IGFBP5), BCL2 protein, threonine
308-phosphorylated-Akt (p-Thr308Akt), vascular endothelial growth factor (VEGF), NFB inhibitor alpha (IkB) protein, and proto-oncogene tyrosine-protein kinase ABL1
(Cabl). These results are in agreement with the transcriptional observations from the
analyzed publications (Figure 3.2 and Table 3.4, for details see Chapter 7).

14

Chapter 7 describes details of the procedure.

Table 3.1 Histological features and classification of the gliomas.

Images 15

15

Histological
Class

Pathophysiology

Diffuse
astrocytoma
(WHO grade II)

Diffuse astrocytomas are


moderately cellular and
infiltrative tumors, expanding,
distorting, but not destroying the
neighboring anatomical
structures. Mitotic activity is
generally absent. TP53
mutations and overexpression of
the PDGFR are the principal
associated genetic alterations,
although these findings are more
frequently observed in adults
than in children.

Anaplastic
astrocytoma
(WHO grade III)

Anaplastic astrocytoma (AA)


arises in the same locations as
diffuse astrocytomas, with a
preference for the cerebral
hemispheres. Tumors show
increased cellularity, distinctive
nuclear atypia, marked mitotic
activity, and tendency to
infiltrate through neighboring
tissue. This stage of astrocytoma
is the least common, and may
represent a short-term
intermediate lesion during the
transition from Grade II to
Grade IV astrocytoma.

Part of images are from public sources, part are experiments performed in vivo at Cancer Genomics Core Laboratory by Sarah Dunlap, M.D. Anderson Cancer Center.

16

Undifferentiated; characterized by anaplasia or reversed development.

Glioblastoma
Multiforme
(WHO grade IV)

GBM is an anaplastic16, highly


cellular tumor with poorly
differentiated, round, or
pleomorphic cells, occasional
multinucleated cells, nuclear
atypia, and anaplasia. GBM
differs from anaplastic
astrocytomas (AA) by the
presence of necrosis under the
microscope. Variants of the
tumor include: gliosarcoma,
multifocal GBM, or gliomatosis
cerebri, in which the entire brain
may be infiltrated with tumor
cells. GFAP staining varies.

Pilocytic
astrocytoma
(grade II)

Characterized by fusiform
"piloid" bipolar astrocytes, with
areas alternating dense and
loose. In loose areas, microcysts
may coalesce to form the
macroscopic cysts. The presence
of nuclear atypia (without
mitotic activity) does not carry a
worse prognosis. Vascular
changes are limited to capillary
proliferations that may include
glomeruloid capillaries and
endothelial proliferation.
Eosinophilic "Rosenthal fibers"
are characteristic; calcification
poss. Common locations are
cerebellum and diencephalon
(especially the optic nerves and
hypothalamus).

Pleomorphic
xanthoastrocytoma

(grade II)

Oligodendroglioma

(grade II)

Pleomorphic xanthoastrocytoma
are a mixture of unusually
pleomorphic cells, ranging from
fibrillary to bizarre giant
multinucleated cells with
intracellular lipid vacuoles
("xanthoma" cells). Usually a
large hemispheric mass, closely
related to the cerebral surface.

Oligodendrogliomas are a
continuous spectrum of lesions
ranging from well-differentiated
neoplasms to malignant tumors;
solid, relatively well-defined,
soft, gray-pink tumors. Tumor is
typically located in the cortex
and white matter, and
infiltration of the overlying
leptomeninges17 may be seen.
Calcification is frequent.
Necrosis, cyst formation, and
hemorrhage define grade III.

Anaplastic
oligodendrogliomas

(grade III)

Oligoastrocytomas

(grade II and III)

Oligoastrocytomas exhibits
histologic characteristics
indicative of malignancy,
including high cellularity,
cellular pleomorphism, nuclear
atypia, and increased mitotic
activity that includes both
astrocytic and oligodendrocytic
components.

Glioma overview

31

References
[1] Adesina AM, Dunn ST, Moore WE, Nalbantoglu J. Expression of p27kip1 and p53 in
medulloblastoma: Relationship with cell proliferation and survival. Pathol Res Pract 2000;196:243-50
[2] Aldosari N, Bigner SH, Burger PC, et al. MYCC and MYCN oncogene amplification in
medulloblastoma. A fluorescence in situ hybridization study on paraffin sections from the Children's
Oncology Group. Arch Pathol Lab Med 2002;126:540-44
[3]
Bailey P, Bucy P (1929) Oligodendrogliomas of the brain. J Pathol Bacteriol 32:735754
[4] Bailey P, Cushing H. A classification of tumors of the glioma group on a histogenetic basis with a
correlated study of prognosis. Philadelphia: JB Lippincott, 1926.
[5] Batra SK, McLendon RE, Koo JS, et al. Prognostic implications of chromosome 17p deletions in
human medulloblastomas. J Neurooncol 1995;24:39-45
[6] Ben Arush MW, Linn S. Ben-Izhak O, et al. Prognostic significance of DNA ploidy in childhood
astrocytomas. Pediatr Hematol Oncol 1999;16:387-96
[7] Biegel JA, Janss AJ, Raffel C, et al. Prognostic significance of chromosome 17p deletions in
childhood primitive neuroectodermal tumors (medulloblastomas) of the central nervous system. Clin
Cancer Res 1997;3:473-78
[8] Boudreau CR, Liau LM, Molecular characterization of brain tumors. Clin. Neurosurg. 2004, 51,
81-90.
[9] Bredel M, Pollack IF, Hamilton RL, Birner P, Hainfellner JA, Zentner J. DNA topoisomerase
IIalpha predicts progression-free and overall survival in pediatric malignant non-brainstem gliomas. Int J
Cancer 2002;99:817-20
[10] Carter M, Nicholson J, Ross F, et al. Genetic abnormalities detected in ependymomas by
comparative genomic hybridisation. Br J Cancer 2002;86:929-39
[11] Caskey LS, Fuller GN, Bruner JM, Yung WK, Sawaya RE, Holland EC, Zhang W. Toward a
molecular classification of the gliomas: histopathology, molecular genetics, and gene expression
profiling. Histol Histopathol. 2000 Jul;15(3): 971-81.
[12] Chen G, Gharib TG, Huang CC, Taylor JM, Misek DE, Kardia SL, Giordano TJ, Iannettoni MD,
Orringer MB, Hanash SM, Beer DG. Discordant protein and mRNA expression in lung
adenocarcinomas. Mol. Cell Proteomics 2002, 1, 304-313.
[13] Cogen PH. Prognostic significance of molecular genetic markers in childhood brain tumors.
Pediatr Neurosurg 1991;17:245-50
[14] Coons SW, Johnson PC, Haskett D, Rider R. Flow cytometric analysis of deoxyribonucleic acid
ploidy and proliferation in choroid plexus minors. Neurosurgery 1992;31:850-56
[15] Cushing N. Intracranial tumors: Notes upon a series of two thousand verified cases with surgicalmortality percentages pertaining thereto. Springfield, IL: Charles C. Thomas. 1932
[16] De Prada I, Cordobes F, Azorin D, Contra T, Colmenero I, Glez-Mediero I. Pediatric giant cell
glioblastoma: a case report and review of the literature. Childs Nerv Syst. 2005 Jul 6; [Epub ahead of
print]
[17] Drach LM, Kammermeier M, Neirich U, et al. Accumulation of nuclear p53 protein and prognosis
of astrocytomas in childhood and adolescence. Clin Neuropathol 1996;15:67-73
[18] Dyer S, Prebble E, Davison V, et al. Genomic imbalances in pediatric intracranial ependymomas
define clinically relevant groups. Am J Pathol 2002;161:2133-41
[19] Espina V, Mehta AI, Winters ME, Calvert V, Wulfkuhle J, Petricoin EF 3rd, Liotta LA. Protein
microarrays: molecular profiling technologies for clinical specimens. Proteomics. 2003 Nov;3(11):2091100.
[20] Fuller GN, Hess KR, Rhee CH, et al. Molecular classification of human diffuse gliomas by
multidimensional scaling analysis of gene expression profiles parallels morphology-based classification,
correlates with survival, and reveals clinically-relevant novel glioma subsets. Brain Pathol 2002;12:10816
[21] Gilbertson RJ, Clifford SC, MacMeekin W, et al. Expression of the ErbB-neuregulin signaling
network during human cerebellar development: Implications for the biology of medulloblastoma. Cancer
Res 1998;58:3932-41
[22] Grubb RL, Calvert VS, Wulkuhle JD, Paweletz CP, Linehan WM, Phillips JL, Chuaqui R, Valasco
A, Gillespie J, Emmert-Buck M, Liotta LA, Petricoin EF. Signal pathway profiling of prostate cancer
using reverse phase protein arrays. Proteomics. 2003 Nov;3(11):2142-6.
[23] Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance
in yeast. Mol Cell Biol. 1999 Mar;19(3):1720-30.
[24] Hall PA, Going JJ. Predicting the future: A critical appraisal of cancer prognosis studies.
Histopathology 1999;35:489- 94
[25] Hamilton RL, Pollack IF. The molecular biology of ependymomas. Brain Pathol 1997;7:807-22
[26] Hart MN, Petito CK, Earle KM (1974) Mixed gliomas. Cancer 33:134140
[27] Hartmann C, Mueller W, von Deimling A. Pathology and molecular genetics of oligodendroglial
tumors. J Mol Med. 2004 Oct;82(10):638-55.

32

Chapter 3
[28] Ino Y, Betensky RA, Zlatescu MC, et al. Molecular subtypes of anaplastic oligodendroglioma:
Implications for patient management at diagnosis. Clin Cancer Res 2001;7:839-45
[29] Jams E, Lunec J, Perry RH, Kelly PJ, Pearson AD. p53 protein overexpression identifies a group
of central primitive neuroectodermal tumours with poor prognosis. Br J Cancer 1993;68:801- 7
[30] Kaatsch R Rickert CH, Khl J, Schz J. Michaelis J. Population- based epidemiological data of brain
tumors in German children. Cancer 2001;92:3155-64
[31] Kim JY, Sutton ME, Lu DJ, et al. Activation of neurotrophin- 3 receptor TrkC induces apoptosis
in mcdulloblastomas. Cancer Res 1999;59:711-19
[32] Kleihues P, Cavenee WK, eds. World Health Organization classification of tumours: Vol. 1.
Pathology and genetics of tumours of the nervous system. Lyon: IARC Press, 2000.
[33] Klein R, Molenkamp G, Sorensen N, Roggendorf W. Favorable outcome of giant cell
glioblastoma in a child. Report of an 11-year survival period. Childs Nerv Syst. 1998 Jun;14(6):288-91.
[34] Korshunov A, Golanov A. Timirgaz V. Immunohistochemical markers for intracranial
ependymoma recurrence. An analysis of 88 cases. J Neurol Sci 2000;177:72-82
[35] Korshunov A, Savostikova M, Ozerov S. Immunohistochemical markers for prognosis of averagerisk pediatric medulloblastomas. The effect of apoptotic index, TrkC, and c-myc expression. J
Neurooncol 2002;58:271-79
[36] Korshunov A, Sycheva R, Timirgaz V, Golanov A. Prognostic value of immunoexpression of the
chemoresistance-related proteins in ependymomas: An analysis of 76 cases. J Neurooncol 1999;45:21927
[37] Kotylo PK, Robertson PB, Fineberg NS, Azzarelli B, Jakacki R. Flow cytometric DNA analysis of
pediatric intracranial ependymomas. Arch Pathol Lab Med 1997;121:1255-58
[38] Lu QR, Park JK, Noll E, Chan JA, Alberta J, Yuk D, Alzamora MG, Louis DN, Stiles CD,
Rowitch DH, Black PM. Oligodendrocyte lineage genes (OLIG) as molecular markers for human glial
brain tumors. Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10851-6. Epub 2001 Aug 28.
[39] MacDonald TJ, Brown KM, LaFleur B, et al. Expression profiling of medulloblastoma: PDGFRA
and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat Genet 2001;29:143- 52
[40] Marshall T, Rutledge JC. Flow cylometry DNA applications in pediatric tumor pathology. Pediatr
Dev Pathol 2000;3:314-34
[41] Melton L. Protein arrays: Proteomics in multiplex. Nature 2004, 429, 101-107.
[42] Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR, Zhang W. Robust
estimation of protein expression ratios with lysate microarray technology. Bioinformatics. 2005 May
1;21(9):1935-42. Epub 2005 Jan 12.
[43] Mueller W, Hartmann C, Hoffmann A, Lanksch W, Kiwit J, Tonn J, Veelken J, Schramm J,
Weller M, Wiestler OD, Louis DN, von Deimling A (2002) Genetic signature of oligoastrocytomas
correlates with tumor location and denotes distinct molecular subsets. Am J Pathol 161:313319
[44] Nafe R, Franz K, Schlote W, Schneider B. Morphology of tumor cell nuclei is significantly related
with survival time of patients with glioblastomas. Clin Cancer Res. 2005; 11: 21418.
[45] Nicholson JC, Ross FM, Kohler JA, Ellison DW. Comparative genoinic hybridization and
histological variation in primitive neuroectodermal tumours. Br J Cancer 1999;80:1322-31
[46] Nutt CL, Mani DR, Betensky RA, et al. Gene expression-based classification of malignant gliomas
correlates better with survival than histological classification. Cancer Res 2003;63:1602-7
[47] Okami N, Kawamata T, Kubo O, Yamane F, Kawamura H, Hori T. Infantile gliosarcoma: a case
and a review of the literature. Childs Nerv Syst. 2002 Jul;18(6-7):351-5. Epub 2002 May 17.
[48] Paulus W. Lisle DK, Tonn JC, et al. Molecular genetic alterations in pleomorphic
xanthoastrocytoma. Acta Neuropathol 1996;91: 293-97
[49] Pollack IF, Campbell JW, Hamilton RL, Martinez AJ, Bozik ME. Proliferation index as a
predictor of prognosis in malignant gliomas of childhood. Cancer 1997;79:849-56
[50] Pollack IF, Finkelstein SD, Burnham J, et al. Association between chromosome 1p and 19q loss
and outcome in pediatric malignant gliomas: Results from the CCG-945 cohort. Pediatr Neurosurg
2003;39:114-21
[51] Pollack IF, Finkelstein SD, Woods J, et al. Expression of p53 and prognosis in children with
malignant gliomas. N Eng J Med 2002; 346:420-27
[52] Pomeroy SL, Tamayo P, Gaasenbeek, M, et al. Prediction of central nervous system embryonal
tumour outcome based on gene expression. Nature 2002;415:436-42
[53] Raffel C, Frederick L, O'Fallon JR, et al. Analysis of oncogene and tumor suppressor gene
alterations in pediatric malignant astrocytomas reveals reduced survival for patients with PTEN
mutations. Clin Cancer Res 1999;5:4085-90
[54] Reifenberger C, Collins VP. Pathology and molecular genetics of astrocytic gliomas. J Mol Med
(2004) 82:656670
[55] Reifenberger G, Collins VP. Pathology and molecular genetics of astrocytic gliomas. J Mol Med.
2004 Oct;82(10):656-70.
[56] Reyes-Mugica M, Chou PM, Myint MM, Ridaura-Sanz C, Gonzales- Crussi F, Tomita T.
Ependymomas in children: Histologic and DNA- flow cytometric study. Pediatr Pathol 1994;14:453-66

Glioma overview

33

[57] Rickert CH, Strter R, Kaatsch P, et al. Pediatric high-grade astrocytomas show chromosomal
imbalances distinct from adult cases. Am J Pathol 2001;158:1525-32
[58] Rickert CH, Wiestler OD, Paulus W. Chromosomal imbalances in choroid plexus tumors. Am J
Pathol 2002;160:1105-13
[59] Salvati M, Caroli E, Raco A, Giangaspero F, Delfini R, Ferrante L. Gliosarcomas: analysis of 11
cases do two subtypes exist? J Neurooncol. 2005 Aug;74(1):59-63.
[60] Shibata N, Oda H, Hirano A, Kato Y, Kawaguchi M, Dal Canto MC, Uchida K, Sawada T,
Kobayashi M. Molecular biological approaches to neurological disorders including knockout and
transgenic mouse models. Neuropathology. 2002 Dec;22(4):337-49.
[61] Sreekumar A, Nyati MK, Varambally S, Barrette TR, Ghosh D, Lawrence TS, Chinnaiyan AM.
Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins.
Cancer Res. 2001 Oct 15;61(20):7585-93.
[62] Ward S, Harding B, Wilkins P, et al. Gain of 1q and loss of 22q are the most common changes
detected by comparative genomic hybridisation in paediatric ependymoma. Genes Chromosomes Cancer
2001;32:59-66
[63] Watanabe K, Peraud A, Gratas C, Wakai S, Kleihues P, Ohgaki H. p53 and PTEN gene mutations
in gemistocytic astrocytomas. Acta Neuropathol (Berl). 1998 Jun;95(6):559-64.
[64] Woodburn RT, Azzarelli B, Montebello JF Goss IE. Intense p53 staining is a valuable prognostic
indicator for poor prognosis in medulloblastoma/central nervous system primitive neuroectodermal
tumors. J Neurooncol 2001;52:57-62
[65] Wulfkuhle JD, Aquino JA, Calvert VS, Fishman DA, Coukos G, Liotta LA, Petricoin EF 3rd.
Signal pathway profiling of ovarian cancer from human tissue specimens using reverse-phase protein
microarrays. Proteomics. 2003 Nov;3(11):2085-90.
[66] Zhou YH, Tan F, Hess KR, Yung WK. The expression of PAX6, PTEN, vascular endothelial
growth factor, and epidermal growth factor receptor in gliomas: relationship to tumor grade and survival.
Clin Cancer Res. 2003 Aug 15;9(9):3369-75.

34

Chapter 3

Chapter 4
Fundaments of analysis of cDNA microarray data
Microarray technology can be described as a multi-step and data intensive technology for
observing the phenomena of RNA in a high-throughput manner. One reason for its
popularity is the outstanding and revolutionary perspective it has added to biological
studies; several decades ago researchers could observe the variations only on a gene-bygene basis, but now are able to evaluate hundreds or thousands of genes in an
experiment. Improvements in the technology allow observation of gene expression in a
highly noisy environment and good results can be obtained by a wisely chosen
experimental design. In addition, a great deal of information can be easily extracted by
means of simple methods.
Two major platforms are frequently used, differing by the method of nucleic acid
deposition: in the first, nucleic acid probes are robotically spotted, and in the second,
they are deposited by photolithography. Both platforms are continuously optimized,
changed, and further developed. The robotically spotted microarrayreferred to as cDNA
because polymerase chain reaction products (PCR) were amplified from cDNAwas
described by Schena et al.1 [31] in 1995 and DeRisi et al.2 in 1996 [7]; other important
improvements to the algorithms are presented in [2], [5] [8], [16] [18], [24] [32], [34],
[39], [37], [38] [9]. The photo-lithographically synthesized array (also referred to as
high-density oligonucleotide chips or oligoarrays because each gene is represented by
multiple ~25-mer oligonucleotides, or as Affymetrixchips after the first company to
produce the technology) was initially described by Lockhart et al. [22], Lipshutz et al.
[21], and others [20] [28] [30][29] [33] [42]. The original differences based on the size of
nucleic acids (number of oligonucleotides in each probe) have smoothly diminished over
the last few years. Each platform has good sides and deficiencies.
Elements encompassed in each procedural step are described with accent on the
technical and statistic-algorithmic aspects; Figure 4.1 shows the flow of a typical
microarray experiment.

The system was developed to monitor the expression of 45 Arabidopsis genes. It was made by means of
simultaneous, two-color fluorescence hybridization of volumes of 2 L robotically printed in a high density
array. This system enabled for first time the detection of transcripts in probe mixtures derived from 2
micrograms of total cellular messenger RNA.
2
The paper was the first investigation of the genetic basis of the phenotype of tumors using cDNA
technology. The probes for hybridization were derived from two sources of cellular mRNA [UACC-903 and
UACC-903(+6)]. As in the previous citation, samples were labeled with two-color dyes to provide a direct and
internally controlled comparison of the mRNA levels.

Chapter 4

36

Define biological,
pathological
question

Statistic relevance
of experiment

Microarray
scheme design

Validation

Experimental design

Biological experiment, collecting the samples


RNA extraction
Fluorescent labeling

Labeled sample A

Microarray
hybridization
Microarray

Labeled sample B

Fabrication of
microarray

Microarray Scanning
Spot identification & Segmentation
Intensity value extraction

PoorPoor-quality spots
assessment

Normalization
Data prepre-processing

Differentially
expressed genes
Classification

Pathway analysis
Gene ontology

Knowledge
extraction

Etc.
data analysis
Figure 4.1 Flow of a typical microarray experiment. A good experiment design ensures qualitative results.

Fundaments of analysis of cDNA microarray data

37

The Figure 4.2 illustrates the well-known representation published by NHGRI-group


Duggan, Bittner, Chen, Meltzer, and Trent in 1999 [9] that became the standard
representation of the process in microarray technology.

Figure 4.2 Representation of cDNA microarray schema presented in [9] by NHGRI-group Duggan,
Bittner, Chen, Meltzer, and Trent in 1999.

The success of the experiment is based on the initial design of the experiment. Four
elements have been distinguished as crucial in the design phase (see Leung et al. [18]
[19]).
1) The hypothesis and the initial examination conditions of the biological aspects
should be stated in clear form. This is not intended to restrict the chances of new, unexpected findings, but to concentrate on a few biological questions. To illustrate, the next
examples will focus on feature selection and discrimination of classes A, B, C, and D. If
the number of profiles (samples) is limited, the feature selection in the discrimination of
(A vs. B vs. C vs. D) is prone to errors compared to the case (A B vs. C D). More than
that, the set of genes selected by the discrimination (A vs. B vs. C vs. D) differs from the
set of the genes selected by the discriminations (A vs. B vs. C D) or (A B vs. C vs. D)
or any combination of these classes. Sometimes, it is difficult to derive the classification
rules from the biologist's questions.
2) The systematic and experimental errors in microarray experiments can be
reduced during the design phase. Careful planning will allow the cells from the treatment
and normal groups to pass similar conditions, so that the parameters and conditions of
the experiment will equally affect the cells in the classes discriminated. Another example

Chapter 4

38

is related to performing a dye-swap experiment such that the particular characteristics of


the dyes are removed.
3) The experimental design should lead to a simple, cost effective, accurate, and
correct conclusion from a statistical viewpoint. In the case of feature selection upon
classification, as in the previous example, an equal effort is suggested for each of the
classes, a minimum 7 to 9 samples should be analyzed from each class.
4) The experiment that complies with the standardization in information collection
can be compared to other experiments; the values of results will thus increase.
Depending on the purpose of the experiment, the expertise of personnel and other
conditions of laboratory, one of the two platforms will be more appealing. Robotically
printed cDNA is generally cheaper but requires expertise and laborious work for
customization. Oligonucleotide arrays are more expensive, and generally uncustomizable, but are an accurate choice, encapsulated, and simply implemented.
In a cDNA micro-array, two different fluorescent dyes are commonly used for the
samples "A" and "B" that are co-hybridized to the same array. Nucleic acids are labeled
with monochromatic dyes Cy3 and Cy5 with the aim to compare expressions of different
biological samples3 or samples under different conditions. Because of co-hybridization,
the errors in levels of expression will be lower than spotting separate probes. Several
methods for finding differential expression genes are described in Chapter 8.
Although the classical cDNA arrays use the two-dye method, there are also single-dye
microarray techniques where the spots are observed on a single channel base. Most of
considerations of extraction, labeling, and hybridization are applicable. The image
acquisition and data processing will differ depending on the exploratory needs.
Oligonucleotide arrays can be considered "single-channel," the expression of a gene is
the corroborated expression of several oligonucleotides. Depending on the company that
provides the chips, the size of oligonucleotides may vary from 25 to 50-mer size. Since
the oligonucleotide-library is carefully selected and the procedure of estimation is
implemented in the device, the method provides a higher selectivity, sensitivity, and fits a
large range of applications. The customizations allow a user to discard the features notneeded from a larger-set to analyze only the genes of interest. Moreover, in this
technology, the high-confidence in the results is enhanced by redundantly observing the
features (typically, one gene is observed on average by 16 unique oligonucleotides).
3
In studies designed to learn the features that discriminate diseased versus healthy cells, the samples
commonly are obtained from tumor cells and normal cells. In some cases, the "normal cells" are replaced by a
reference formed from a mixture of cells specific for the respective tissue (e.g., Stratagene, CA provides such
Universal Reference RNA). Other authors emphasize evolution in the same cell population. Experiments
could involve tumor vs. metastasis or before vs. after specific treatment, or radiation vs. mechanical stress.
Between these experimental designs there is no technical difference in dealing with RNA-amplification,
labeling, etc. (called wet-laboratory steps), but from a discrimination or classification point of view the
algorithms generally differ.

Fundaments of analysis of cDNA microarray data

39

Experimental design
It has to combine the knowledge of several fields. When the technology was first
developed, the design in many laboratories was decided without taking in account
mathematical or statistical aspects (for instance, the value of an experiment that contains
small number of samples is limited). Recently the situation has changed, and more and
more genomic laboratories incorporate the efforts of (molecular) biologists, (clinical)
medical staff with the pathology expertise, technical microarray capabilities, and
mathematical-statistic knowledge of computer-scientist in the experimental design
phase. One of the benefits of Involving computer scientists in the design stage is to
minimize the time and costs of acquiring the results from this high-throughput
technology.
Also, many programmers can integrate the analyzed data from microarray and can
enhance the adaptability of experimental design. The circumstances of the experiments,
especially in basic investigation laboratories or core facilities, often lead to questions that
the traditional software from the market cannot answer.
Statistics must necessarily contribute to the analysis of the quantitative biological
results for the correct interpretation of a biological experiment. The performance of
further analysis, the choice of mathematical methods, the accuracy, and ease of
implementation of algorithms dramatically depend on a statistically-correct design of
experiment.
The selected features and the (supervised) classification error are dependent not only
on an accurate classification algorithm. If samples are labeled erroneously, the features
selected by an optimal classifier will reflect an erroneous biological learning procedure.
Because the labeling is discerned by pathologist, this initial labeling holds a high
importance. Typically, the number of samples in an array experiment is small. Therefore,
if a significant percentage is incorrectly labeled the quality of experiment is dramatically
decreased. Special attention is requires for cases with mixed cell types, such as the mixed
glioma types in Fuller et. al.,4 (2002) [11], where the pathologists designated the samples
as having features from two or many pathological classes. Mixed samples will not be

4
Glioma sub-types are clinically classified by visual observation of features of the tissues. In labeling the
cancer sub-types, pathologists make assignments based on the type of cell from which the tumor originates.
Neuronal cells very seldom degenerate and produce cancer; the cells that nourish, protect, and mechanically
sustain the neurons are at the origin of neuronal cancers. Two glioma cell-subtypes are dendrogliomas that
originates from dendrocytes and astrocytomas that originate astrocytes. In several cases, the pathologists
observe tissues with features from both sub-types. The approach currently used is to label this cancer as
"mixed gliomas". When analyzing the molecular bases of the glioma sub-type discrimination, the samples
labeled as "mixed-gliomas" should not be considered.

Chapter 4

40

utilized to gain knowledge about the class discrimination, because in these cases the
disease-class attribution is unclear.
An important prerequisite of classification procedures is a complete, updated, and
correct clinical database. If the database is correctly designed and populated properly,
the design algorithms can select the features responsible for aggressiveness of disease or
survival terms. Tracking all these details of the experimental design requires complete
attention and a clear and organized approach. The proper design of both experiments
and data handling will avoid later undesired and costly reiteration of experiments.
Image acquisition is also critical and can affect the quality of the entire process.
Acquisition can easily be repeated two or three times in the time-frame limited by the
stability of the fluorescent dyes used. Image acquisition and analysis is succinctly
described by (1) laser scanning, (2) spot recognition and segmentation, and (3) spot
intensity evaluation.
Details of the cDNA image scanning process will be described here; the procedures
used for oligonucleotide arrays are similar. Immediately after sample spotting, in order
to avoid label degradation due to time and light instability,5 the slides are scanned and
image files are stored in lossless compression format (usually TIFF 6 ). The files are
typically 20-80 MB for each of the two dye-wavelengths depending on the resolution
(typically adjusted in steps of 1, 5, 10, and 20 m). For large numbers of samples, the
procedure of scanning is time consuming process. Enough informative scans can be
obtained when the resolution is adjusted at 10 m for cDNA [12]. A suggested rule is that
the scanner resolution be at least 10% of the spot diameter7. The scanners on the market
are generally designed for the dual redgreen (Cy3Cy5) wavelength and most have a
port for laser-bulb installation that allows blue and yellow detection.
A correct scanning implies appropriate adjustment of the laser beam intensity and
photomultiplier gain. The number of spots with saturated pixels should be minimized
(e.g., 3-10 for a slide with 12 k features), such that the experiment benefits from the
maximum dynamic range of the device. Each scan will degrade by some extent the quality
of the slide by irreversible modification of the phosphor-based labels (this effect is called
photo-bleaching). If the laser beam intensity is too low and gain is too high, the
photomultiplier will introduce too much error or the spots will not be visible. It is
advisable to set to the low values (minimum needed) for the intensity of laser for a "preview" in order not to damage the slide. Although the slides are kept in a dry and dark
5

At M.D. Anderson Cancer Center in the Cancer Genomics Core Laboratory the scanning step is performed
on the same day as the washing procedure when the membrane of slides is still wet. Spurious expression levels
can be observed when the slide is very dry.
6
Tag Image File Format.
7
This rule actually generates a square of 100 pixels that covers the spot and more than 60 pixels will
characterize the spot. This is considered enough to get a good estimate of the spot value. See further "sVOL"
definition.

Fundaments of analysis of cDNA microarray data

41

place, photo-bleaching affects the microarrays and the slides will typically degrade after
six months.
To locate the spots and quantify the spot-level values, the papers presented in this
thesis used ArrayVision. This software is one from a multitude of choices on the market.
The advantages of ArrayVision include a clear and detailed documentation for the
algorithms for measuring expression. Once a laboratory creates a structure that links the
position of the spot and the accession number (from the GenBank database) of the
respective gene, the configuration for robotically spotting of cDNA and spot-estimation
method do not need major modifications for further experiments.
The grid 8 for spot-finding is placed manually (using click-and-drag). If distancecharacteristics are correctly defined over the entire slide, the pre-defined circles will
overlap the spots. The precise initial position is rather non-important once the first circle
overlaps the first spot, since the grid position will auto-tune in a fine-range. The
specifications for the grid, including number of spots, configuration in patches (the
number of vertical and horizontal spots and distances between spots), the respective
distances, and dimensions are easy accessible and defined in a table.
There are several algorithms that automatically re-place the grid and warp the
location of grid to the center of the spots. One of
methods, based on two-dimensional fast Fourier
transform (2D-FFT) of the image, obtains the
relative distance between spots by means of nextmaximum frequency, since the spots represent a
local maximum in two-dimensional space. A request
in all grid finding solutions is to visually verify the
grid and/or flag the spots that are low-qualitatively.
Once the grid for spot-finding is placed over the
scanned microarray image, the values of spots are
acquired. Segmentation is the process used to
separate the foreground pixels containing the true
Figure 4.3 The image of a single spot.
The circles estimate the intensity of the
spot, while the diamonds estimate the
intensity of the background for the spot
area.

signal from the nonspecific background. The pixel


intensities in a grid containing a spot are assumed to
be the superposition of the foreground and the local
background.

Two

approaches

are

typically

considered depending on the segmentation algorithm that defines the convex area of the
spot. The simplest is to define fixed-circle size segmentation9.

The grid is the imaginary net-structure that defines the location of centers of the spots.
The procedure assumes (ideally) for cDNA that the spots in the array are circular. Beside simplicity of
implementation, it can be advantageous in some oligoarrays (e.g., in Illumina's arrays www.ilumina.com)

Chapter 4

42

A more accurate procedure with direct application to cDNA arrays, since the spots are
not perfectly circular, is the adaptive segmentation method (Figure 4.3). Two circles
centered in the mass-center of the spots, which usually coincide with the nodes of the
spot-finding grid, define the minimum and maximum initial area for segmentation. The
algorithm called SPOT [37], developed by Yang et al. (2001), enlarges the small innercircle and shrinks the large exterior-circle to a convex area such that an optimal border
can be determined. A region-growing algorithm can then be used to define the area of
the spot enclosed in this segmented convex area.
In estimating the value of spot, the algorithms can use, or not use, the background.
The value of the background may be estimated as the mean-value (or median value) over
the entire slide, in this case the method is called flattened background. Another method
is to locally estimate the value of the background for each spot in a procedure that
measure the intensity of background-pixels in 4-diamonds around each spot, and then
the estimated intensity (by using median of spots in the diamonds) is applied for the area
of spot.
In most publications on cDNA microarrays, the spot value is estimated as "subtracted
volume" (i.e., volume expression of spots when the background is removed) denoted
by sVOL :

sVOL = (S spot Aspot )


where

S spot

convex area

(S

bkg

Abkg )Aspot
Abkg

is the spot signal level calculated as the median of pixels values in the

Aspot

of the spot.

Several publications [40][14] analyzed the validity of considering the background,


since the values of the background are not involved in the observed molecularphenomena, irrespective the mRNA concentration. By using the background in the
estimation process, we introduce noise to the spot estimated values. Beside, qualitativelygood experiments will have low levels of background with values significantly smaller
than the spot volume.
Another procedure by Nagarajan et al. defines which pixels, restricted to a spotfinding grid superposed on the microarray scan, belong to the spots and which belong to
the background [25]. The k-means clustering technique and the partitioning around
medoids (PAM) were used to generate a binary partition (the low is assigned to the
background and the high to the spot), based on the pixel intensity distribution. The
values of the spot and background are assessed from the median (k-means) and the

where the spots are defined by holes in the silicon bed. The hybridization takes place in holes and the fixedcircle method confers robustness.

Fundaments of analysis of cDNA microarray data

43

medoid (PAM) of the clusters is chosen as the cluster-representatives. The results [25]
are superior compared to previous methods by Yang et al. [39] [37].

Data pre-processing analysis in microarrays.


Although the exploratory analysis methods 10 in the two microarray platforms are
conceptually similar once the gene expression11 profiles are obtained, the pre-processing
approaches may vary. The pre-processing steps are intended to enhance the biologicalinformation in the data-set. Poor-quality spot estimation can be caused by local artifacts
such as deposition/binding on membrane errors or by normalization algorithms (the
removal of systematic errors, the variations in sample concentrations or inter-slide
variation).
When the spot intensity is lower than the background, the image-processing tools will
assign a zero-value. The zero-values of these spots are critical for further steps, because
most classification algorithms are not able to deal with zero or infinite (NaN) values12.
One typical method is to simply exclude individual gene values that contain spots with
log-transformed intensity lower or higher than a constant times the standard deviation
[26] [27] or, if the number of rows that have zero-values is small enough, to flag and filter
out the entire row (gene) that contains the zero-values.
An important advance in dealing with the unreliable spots is called the imputing
procedure [36]. This procedure estimates the value of the flagged spots (or flagged genes)
based on the gene values of the closest neighbors using a kNearest Neighbor procedure.
Depending whether the search for closest-neighbors is in the same disease sub-type or
based on the entire data set, the imputing method is called supervised or unsupervised.
Imputation based on k-NN algorithm provides a robust and sensitive approach for
estimating missing spot values or gene-expressions for microarrays. However, based on
the significance of the biological discovery, importance of results, and based on the
amount of "imputed data" caution should be exercised when drawing critical biological
conclusions from data that are partially imputed.
The normalization process is intended to remove the systematic errors. These errors
might be due to unbalanced expression due to the dyes, to the differences labeling
efficiency (in binding), or differences in secondary emission and filtering. Amplification

10
Chapter 8 describes a set of algorithms for selection of the differentially expressed genes after the preprocessing steps are performed. A variety of signal processing methods can be applied for
classification/discrimination of cancer types and for discovering differentially expressed genes in a specific
cancer type or sub-types using various methods from multivariate data analysis.
11
"Gene expression" refers to the amount of one or more gene products in a particular sample. In the cDNA
microarray case, the expression is often calculated as the mean of the two replicates. Several other modalities
are possible if the number of replicates is higher. In oligoarrays the value of "gene expression" is related to the
expression of the gene by means of the hybridization to a number of oligonucleotides (~30 per gene).
12
An infinite value comes from dividing by zero (e.g., tumor/normal; also logarithm of a zero value is not
defined).

Chapter 4

44

on the two channels of scanning is known to introduce important errors that can partially
be corrected. A microarray experiment can be normalized based on principles that come
from different fields. Using a molecular biology method, normalization can be performed
by introducing additional gene probes as controls into microarray design, those for so
called equally-expressed genes. From statistics comes a method that normalizes the
expression of the entire microarray by its statistics. For sets with 20 k features/genes, it
is expected that most genes express in the same way in the different samples.
The approach of equally-expressed genes introduces probes that hybridize to
housekeeping genes or exogenous genes as controls. Initially, housekeeping genes like
glyceraldehydes-3-phosphate dehydrogenase or ribosomal RNA were thought to be
expressed at a constant level in all tissues [10]. Many studies used a single housekeeping
gene for normalization. However, in later studies [6], no gene with equal expression in all
cells could be identified. Several papers showed that the expression level of housekeeping
genes varied among tissues or cells, and changed under certain circumstances. The
reference gene(s) must be selected for each study (namely the tissue of origin) and
requires special attention [41], [5]. If care is not exercised, normalizing the expression
values to a housekeeping gene may introduce additional errors.
Another

method

of

normalization was developed


with the rationale of removing
any effect due to individual
characteristics

of

dyes.

Scientists performed a "dyeswapping" procedure, where


tumor and normal samples
were each spotted with both
dyes,

Cy3

method

and

cannot

Cy5.
be

This

applied

when there is a limited amount


Figure 4.4 Example of normalization procedure. A more
complete description is presented in Chapter 8.

of sample and it increases the


cost of the experiment.

Nonlinear normalization methods use statistics that assume genes to express similar
in median and quantiles values, on a slide, to aid in normalization [13] [16] [23]. There
are a large variety of algorithms for normalizing the expression levels (described Chapter
8) that fall into several categories: (1) subtraction of the global-median value of each
sample, (2) subtraction of the median value and equalization using the variance by
standard deviation (STD) or median absolute deviation estimate (MAD), and (3)
application of locally weighted scatterplot smoothing normalization (locally weighted
scatter plot smooth - lowess).

Fundaments of analysis of cDNA microarray data

45

Shortly, "MA-plots" were defined by Dudoit et al. [8] and help identify the artifacts of
the spotting procedure. On the scatter plot defined by

M = log 2 (Cy3 Cy5)

and

(Cy3 Cy5) several local normalization procedures can be defined by


subtracting a function ( ) from intensity log ratios, computed separately for each
A = log 2

slide, using only data from that hybridization.


A constant adjustment is used to tune the distribution of the log-ratios to have a
median zero for each slide. This assumes that the red and green intensities are related by
a

constant

factor,

Cy3 = Cy5 Const.

therefore

on

the

logarithm

scale

M = log 2 (Cy 3 Cy 5) needs to be shifted with a constant:


log 2 (Cy 3 (Cy 5 Const .)) = log 2 (Cy3 Cy 5) Const .
However, a global normalization approach is not adequate in situations where dye biases
can depend on spot overall intensity or spatial location on the array. Several articles
([39], [3]) proposed normalization based on local regression. To perform a local

A dependent

normalization [39],

log 2 (Cy3 Cy5) is mapped to:

log 2 (Cy3 Cy5) ( A)


(see Figure 4.4, where

() is the lowess fit of the MA-plot).

Berger et al., [3] presented an optimization approach for the parameters in lowess,
which usually is selected arbitrarily. The optimization based on a local regression
procedure determines the bandwidth parameter for the local regression by minimizing a
cost function as the mean-squared difference between the lowess estimates and the
normalization reference level.
Although initially designed to equalize the red and green channels of the microarray
experiment, the up-described normalization algorithms may apply also to spot replicates
values and gene-expression values originating from two samples. In the case of spotreplicates, even for the same color the expressions of replicates are expected to have same
local characteristics of median and variance if the number of evaluated genes is high. In
the case of gene-expressions, the two samples (e.g., tumor vs. normal, or a transfected
stable cell line vs. parental cells [15]) should have similar levels of expression in the
majority of genes.
Interestingly, in addition to systematic effects of dyes, which can mostly be corrected
by a normalization method, a gene-specific dye bias was defined recently in MartinMagniette et al. [24] as Label Bias Index (LBI). This bias may take values larger than two
for the ratio of

(Cy3 / Cy5) and may alter the conclusions about differentially expressed

Chapter 4

46

genes. The issue is new and more studies must be performed for the effect to be
understood.

Standardization of information acquired with


microarray experiment.
There are several reasons for standardization of microarray technology. As in other
technical areas, a proper standardization permits increased interchanges of information
and makes comparison of experiments valid. Without a standard, is difficult to judge the
validity of protocols [1]. The Microarray Gene Expression Society (http://www.mged.org)
proposed [4] the standard Minimum Information about a Microarray Experiment
(MIAME). The standard aims to guide the development of microarray databases and data
management software. The data models and exchange formats of microarray are
described in MAGE. Microarray Gene Expression Markup Language (MAGE-ML based
on XML) can describe microarray designs, microarray manufacturing information,
microarray experiment setup and execution information, gene expression data and data
analysis results.
MIAME has been submitted by European Bioinformatics Institute (EBI at
http://www.ebi.org) and Rosetta Biosoftware (http://www.rosettabio.com) and in 2002
was

adopted

by

the

Object

Management

Group

(OMG)

standards

group

(http://www.mged.org/mage). Additionally, several companies including Agilent


(http://www.agilent.com),

Affymetrix

(www.affymetrix.com),

and

Iobion

(http://www.iobion.com) contributed to MAGE-ML [35]. Several tools supporting the


information

capture

and

management

are

described

on

MIAME

webpage

(http://www.mged.org/miame).
The minimal information is structured in six parts: (1) the standards for experimental
design, the set of hybridization experiments as a whole, (2) standards for array design,
each array used and each element (spot) on the array, (3) samples used, extract
preparation and labeling, (4) procedures and parameters of hybridizations, (5) images,
quantitation, and measurements specifications, (6) normalization controls: types, values,
specifications. As MIAME is commonly accepted by major users, it is advisable that
experiments comply with these standards. Also, most journals require data-submission
using this format.

Fundaments of analysis of cDNA microarray data

47

References
[1] [No Authors], Microarray standards at last. Nature 419, 323 (26 September 2002); doi:10.1038 /
419323a
[2] Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Tissue classification
with gene expression profiles. J Comput Biol. 2000;7(3-4):559-83.
[3] Berger JA, Hautaniemi S, Jarvinen AK, Edgren H, Mitra SK, Astola J. Optimized LOWESS
normalization parameter selection for DNA microarray data. BMC Bioinformatics. 2004 Dec
09;5(1):194.
[4] Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W,
Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC,
Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M.
Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.
Nat Genet. 2001 Dec; 29(4):373.
[5] Churchill GA. Fundamentals of experimental design for cDNA microarrays. Nat Genet. 2002 Dec;
32 Suppl:490-5.
[6] de Kok JB, Roelofs RW, Giesendorf BA, Pennings JL, Waas ET, Feuth T, Swinkels DW, Span
PN. Normalization of gene expression measurements in tumor tissues: comparison of 13 endogenous
control genes. Lab Invest. 2005 Jan;85(1):154-9.
[7] DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM. Use
of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet. 1996
Dec;14(4):457-60.
[8] Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Stat. Sinica 12 (2002), pp. 111139.
[9] Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM. Expression profiling using cDNA
microarrays. Nat Genet. 1999;21(1, suppl):10-14.
[10] Eickhoff B, Korn B, Schick M, Poustka A, van der Bosch J. Normalization of array hybridization
experiments in differential gene expression analysis. Nucleic Acids Res. 1999 Nov 15;27(22):e33.
[11] Fuller GN, Hess KR, Mircean C, Tabus I, Shmulevich I, Rhee CH, Aldape KD, Bruner JM,
Sawaya RA, Zhang W. Chapter 16: Human Glioma Diagnosis From Gene Expression Data; in
Computational and Statistical Approaches to Genomics Kluwer Academic Publisher 2002 ISBN: 14020-7023-3
[12] Jain AN, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, Pinkel D. Fully automatic
quantification of microarray image data. Genome Res. 2002 Feb;12(2):325-32.
[13] Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J
Comput Biol. 2000;7(6):819-37.
[14] Kooperberg C, Fazzio TG, Delrow JJ, Tsukiyama T. Improved background correction for spotted
DNA microarrays. J Comput Biol. 2002;9(1):55-66.
[15] Lee EJ, Mircean C, Shmulevich I, Wang H, Liu J, Niemisto A, Kavanagh JJ, Lee JH, Zhang W.
Insulin-like growth factor binding protein 2 promotes ovarian cancer cell invasion. Mol Cancer. 2005
Feb 02;4(1):7.
[16] Lee ML, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression
studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S
A. 2000 Aug 29;97(18):9834-9.
[17] Lee PD. Control genes and variability: absence of ubiquitous reference transcripts in diverse
mammalian expression studies. Genome Res. 12 (2002), pp. 292297.
[18] Leung YF, Cavalieri D. Fundamentals of cDNA microarray data analysis. Trends Genet. 2003
Nov;19(11):649-59.
[19] Leung YF. et al., Microarray software review. In: Berrar DP. et al., A practical approach to
microarray data analysis, Kluwer academic (2002).
[20] Li C. and Wong W.H. Model-based analysis of oligonucleotide arrays: expression index
computation and outlier detection. Proc. Natl. Acad. Sci. U. S. A. 98 (2001), pp. 3136.
[21] Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays.
Nat Genet. 1999 Jan;21(1 Suppl):20-4.
[22] Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C,
Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density
oligonucleotide arrays. Nat Biotechnol. 1996 Dec;14(13):1675-1680.
[23] Lnnstedt I, Speed TP. Replicated Microarray Data. Stat. Sinica 12 (2002), pp. 3146.
[24] Martin-Magniette ML, Aubert J, Cabannes E, Daudin JJ. Evaluation of the gene-specific dye bias
in cDNA microarray experiments. Bioinformatics. 2005 Feb 2; [Epub ahead of print]
[25] Nagarajan R. Intensity-based segmentation of microarray images. IEEE Trans Med Imaging. 2003
Jul; 22(7):882-9.
[26] Pritchard CC, Hsu L, Delrow J, Nelson PS. Project normal: defining normal variance in mouse
gene expression. Proc Natl Acad Sci U S A. 2001 Nov 6;98(23):13266-71.

48

Chapter 4
[27] Quackenbush J. Microarray data normalization and transformation. Nat. Genet. 32 Suppl. (2002),
pp. 496501.
[28] Sasik R. et al., Statistical analysis of high-density oligonucleotide arrays: a multiplicative noise
model. Bioinformatics 18 (2002), pp. 16331640.
[29] Schadt EE, Li C, Ellis B, Wong WH. Feature extraction and normalization algorithms for highdensity oligonucleotide gene expression array data. J Cell Biochem Suppl. 2001;Suppl 37:120-5.
[30] Schadt EE, Li C, Su C, Wong WH. Analyzing high-density oligonucleotide gene expression array
data. J Cell Biochem. 2000 Oct 20;80(2):192-202.
[31] Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns
with a complementary DNA microarray. Science. 1995 Oct 20;270(5235):467-70
[32] Sherlock G. Analysis of large-scale gene expression data. Brief. Bioinform. 2 (2001), pp. 350
362.
[33] Shmulevich I, Astola J, Cogdell D, Hamilton SR, Zhang W. Data extraction from composite
oligonucleotide microarrays. Nucleic Acids Res. 2003 Apr 1; 31(7): e36-e36.
[34] Simon RM, Dobbin K. Experimental design of DNA microarray experiments. Biotechniques
(2003), pp. S16S21.
[35] Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball
C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White
J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ Jr, Brazma A.
Design and implementation of microarray gene expression markup language (MAGE-ML). Genome
Biol. 2002 Aug 23;3(9):RESEARCH0046. Epub 2002 Aug 23.
[36] Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB.
Missing value estimation methods for DNA microarrays. Bioinformatics. 2001 Jun;17(6):520-5.
[37] Yang YH, Buckley MJ, Speed TP. Analysis of cDNA microarray images. Brief Bioinform. 2001
Dec;2(4):341-9.
[38] Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Res. 2002 Feb 15;30(4):e15.
[39] Yang YH, Speed T. Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3 (2002),
pp. 579588.
[40] Zakharkin SO, Kim K, Mehta T, Chen L, Barnes S, Scheirer KE, Parrish RS, Allison DB, Page
GP. Sources of variation in Affymetrix microarray experiments. BMC Bioinformatics. 2005 Aug
29;6(1):214 [Epub ahead of print]
[41] Zhang X, Ding L, Sandford AJ. Selection of reference genes for gene expression studies in human
neutrophils by real-time PCR. BMC Mol Biol. 2005 Feb 18;6(1):4.
[42] Zhou Y, R. Abagyan. Algorithms for high-density oligonucleotide array. Curr. Opin. Drug Discov.
Devel. 6 (2003), pp. 339345.

Chapter 5
Reverse-phase protein array technology
The emerging field of quantitative proteomics aims to identify differences between
groups of experimental samples by measuring the expression of proteins. The
development of high-throughput techniques to measure protein expression is required
for applications in targeted drug discovery and analysis of biomarkers, and may
revolutionize

early

cancer

diagnosis

[26][23][12][13].

The

recently

developed

technologies overcome the two main impediments of traditional immunoassay platforms:


limited multiplexing capability and high sample volume requirements. The reverse-phase
lysate arrays described in this thesis are one of the technologies that benefit from the
highly specific binding of antibodies to unique proteins. Due to high -throughput,
scalability, flexibility, and sample consumption, the reverse-phase lysate array (or simply,
lysate array) has been promoted as excellent protein quantitation tool.
Cancer growth, invasion, and metastasis are usually the result of perturbations in the
normal network of cell signaling proteins [16]. Genetic defects ultimately lead to tumor
cell survival by altering the functional proteins that confer survival advantages for the
tumor [14]. Through specific analysis of protein isoforms, lysate arrays can monitor
changes in cells at translational and post-translational levels. This knowledge is
complementary to transcriptional levels studied in genomic analysis. Correlation of
studies of translational processes with those from high-throughput protein arrays will
hopefully reveal the cellular proteomic network for individual patients tumors. Direct
findings or computational models of cellular networks have potential to aid in diagnosis,
aid in treatment decisions, and predict prognosis [19][3][6][15]. Performance
benchmarking, standardization, and procedural specification for protein arrays will
improve the communication and information exchange in the key applications such as
novel assays, biomarker discovery in trials of new drugs, and validation of these
proteomics techniques.
For measurement of the expression of hundreds of proteins on hundreds of samples,
these protein array methods have three major advantages when compared to western
blot experiments: a minimal amount of protein extract is required, especially valuable in
the case of rare samples; considerably lower cost; and much less labor.
Two approaches are currently used to measure the expression of proteins in highthroughput format. In the first, called a forward-phase array, a large number of
antibodies are deposed on the slide. The slide is then incubated with one cell lysate

Chapter 5

50

sample. Each spot corresponds to a protein of interest. The expression is detected with a
second tagged molecule or by labeling directly the protein of interest.
Reverse-phase arrays are prepared as shown in Figure 5.1 by immobilizing the
sample proteins, usually from a cell lysate or serum, to a slide typically by robotically
spotting, in a manner similar to cDNA microarray printing. Spots represent samples and
usually each sample is spotted in a multiple dilution series. Thousands of spots are
printed on each slide and each slide comprises hundreds of samples. Only a small
quantity of lysate is utilized for preparation of a slide. As result, rare or valuable samples,
as the one we used from glioma tissues, are conserved. The entire slide is then probed
with a single antibody, and thus a single protein is measured across multiple samples.
Because all samples are probed on the same slide, the multiples samples are assayed
under the same conditions.

Tumor biopsy

Laser capture

Patients

microdissection

Tumor cell
analysis

Labels
Labels

Antibodies
Labels

Labels

ReverseReverse-phase
lysate array slide

Lysate
ReverseReverse-phase lysate array

Figure 5.1. Main steps in the production of a reverse-phase lysate array. With the aid of laser capture
microdissection, protein lysates from histopathologically relevant cell populations of the sample tumor are
isolated, extracted, and immobilized on the reverse-phase protein array. On average, 10,000 cells were
microdissected and lysed, a quantity smaller than used in the traditional immunoassay platforms.

In the case of forward-phase array, many antibodies are applied to a single glass
surface and each sample is analyzed on a different slide. In order to allow comparison of
samples, the following requirements should be met: (a) assay cross-reactivity must be
eliminated, (b) conditions must allow for different sensitivities of the multiple analytes,
(c) the dynamic range must be adjusted according to biological relevance, and (d) the

Reverse-phase protein array technology

51

slide production conditions must be optimized for reproducibility. Cross-reactivity is less


of a problem in reverse-phase protein array assays than in forward-phase arrays, but the
issue of dynamic range is similar. Proteins of interest may be expressed over a broad
dynamic range (up to a factor of 1010) [10]. A challenge, therefore, is developing a system
capable of accurate detection over this wide range of protein concentrations. In contrast
to traditional enzyme-linked immunoassays, specifications and standards for antibody
microarrays have not yet been generally agreed upon [25]. A number of studies have
begun to examine the procedures and formulate these standards.
The procedures used to prepare the samples have been described [11][9][1].
Traditional formalinfixed, paraffinembedded specimens typically provide superior
morphology and reliable long-term storage of clinical specimens. However, formalin
fixed, paraffinembedded specimens are not always compatible with current molecular
techniques due to suboptimal recovery of most macromolecules. A recent paper [7]
presents progresses in nondestructive molecule extraction. Prior to microdissection,
paraffin-embedded tissue sections are completely submerged in xylene and are stained
with a modified H&E staining1 protocol. Using a laser capture microdissection system the
tumor cells are isolated from the tissue under careful direct pathological examination.
Espina et al., suggested a formula to calculate the total number of cells required in an
experiment. Typically a 2 mm5 mm core of tissue obtained by core needle biopsy with a
16 or 18 gauge needle is sufficient. The total number of cells

required for the protein

array (in the range 1 k to 5 k antibodies) is dependent on the number of protein


molecules per sample, the sensitivity of the system,
molecules/mole,

, and the number of

x . This may be expressed [10] by the formula T = 6.023(x s ) .

A number of companies provide standard 2576 mm coated slides specifically designed


for protein arrays. The slides typically used in microarrays are also acceptable for a selfdesigned lysate array. In our experiments, we observed better results (of sample binding
to nitrocellulose membrane) with VIVID PALL slides2. Other choices are available,
including FASTSlides from Schleicher and Schuell Bioscience3 that are stable for up to
three years at room temperature, according to the company. SuperProtein4 slides from
TeleChem International 5 and the MaxiSorp black polymer-coated slides from Nalge
Nunc 6 are also designed for protein arrays [20]. 3-D HydroGel-coated slides from

1
Hematoxylin & Eosin staining protocol, color the nucleus in blue and the cytoplasm in pale pink. For
protocols example, see http://www.bcm.edu/rosenlab/protocols.html .
2
http://www.pall.com; 2200 Northern Boulevard East Hills, NY 11548; USA or
http://www.pall.com/laboratory_17821.asp for VIVID products.
3
http://www.schleicher-schuell.com; Hahnestrasse 3, D-37586 Dassel, Germany.
4
An atomically flat glass substrate coated with a 150 m thick layer of hydrophobic polymer.
5
http://www.arrayit.com; 524 East Weddell Drive; Sunnyvale, CA 94089; USA.
6
http://www.nalgenunc.com; 75 Panorama Creek Drive Rochester, NY 14625. USA.

Chapter 5

52

PerkinElmer7 provide a hydrophilic environment and the capacity to hold more than a
single layer of probes [20].
Prior to our publication [22], several other articles described important advances in
lysate array technology. Nishizuka et al. studied the correlation between the mRNA
expression and protein expression [24]. Because they are technically easier to implement,
techniques to measure transcript levels in high-throughput format have developed more
rapidly than protein profiling and transcript expression levels were assumed to predict
protein expression. However, protein levels cannot be inferred from transcript data [24].
NCI cell lines (previously analyzed intensively at transcription level) were printed on
FAST nitrocellulose-coated glass slides with a pin-in-ring format GMS 417 arrayer
(Affymetrix, Santa Clara, CA) with four 500 m-diameter pins [24]. The cDNA/protein
and oligo/protein arrays showed significant correlations for the 19 species across 60 cell
lines. The mean correlation coefficients were +0.52 for cDNA/protein and +0.40
oligo/protein, the profiling structure showing a higher correlation of proteins with
mRNA levels across the 60 cell lines.
Nishizuka et al. [24] were the first to estimate protein expression based on a
monotonic linear spline fitted to
the

88

intensity

serial

spline curves fitted to the


serial dilution of samples

66

curve

the expression in the space of


dilutions (see Figure 5.2). For
each

44

sample,

range,

0
1

10

dilution series log2

range,
spline

Figure 5.2. Dose-dependent interpolation algorithm


developed in [24] for the study of protein expression

p = 25%

the

intensity

values in between maximum

25%

22

at

dilution

procedure. The algorithm maps

110

I= f

I max

, and minimum

I min , are smoothed by a


curve

interpolation,

, and the expression is

estimated as the

level. The values proposed to restrict the dynamic range,

log 2

dilution

[I max K I min ] ,

are heuristically selected. The third highest ranked value observed anywhere on the array
is selected as

I max

over

cell

all

and the minimum intensity,


types.

If

is

I min , is the mean of the last ten points

the

true

dilution

= f 1[I + p (I I )] .

min
max
min
7

http://www.perkinelmer.com; 45 William Street, Wellesley, MA 02481-4078, USA.

factor,

then

Reverse-phase protein array technology

53

A set of minimal standards, in use with first generation of antibody arrays in


quantitative proteomics are suggested in [10][25][24]. Perlee et al. [25] identify the main
sources of error under highly multiplexed conditions and describe the requisites for
specificity, sensitivity, and reproducibility of antibody biomarkers: (1) a comprehensive
validation program to minimize antibody cross-reaction; (2) the application of
standardized statistical approaches to manage the data output of highly replicated assays;
(3) requirements for controls (e.g., the negative control) to normalize sample replicate
measurements; (4) use of real-time monitors to evaluate sensitivity, dynamic range, and
platform precision; and (5) survey procedures to reveal the significance of biomarker
findings.
In the search for a better procedure for estimation of protein levels on lysate arrays,
we evaluated several algorithms for estimating relative protein expression in different
samples that might lead to successful estimation on lysate microarray. In our recent
paper [22], we focused the attention on statistics-based algorithms, and evaluated two
algorithms by means of a cross-validation procedure. The nonlinear method and a
method based on robust least squares were compared to standard linear regression.
Accurate quantifications are relatively difficult because of the narrow dynamic range
associated with the chromogenic signal detection. To increase the accuracy of the
measurements, each sample is serially diluted and from four to ten two-fold dilutions are
used. The laboratory equipment detection capabilities lead to the choice of six two-fold
dilutions and each dilution was spotted in triplicate [22]. Thus, each sample yielded 18
measurements over what is effectively a wide dynamic range. Samples were placed on the
nitrocellulose membrane using a Genetic Solution robot. The thickest pins (500 m)
available were used and each sample was spotted 5 times. In each batch of slides, we
reserved one slide for normalization and one for a negative control. For normalization, actin or gold staining were used. The slides were scanned, preferably while wet, since
after drying contrast and membrane may change. The analysis involves several steps
common in microarray studies [4][5][8][21][27]: spot finding, background intensity
determination, and segmentation.
The results of the analysis showed that the algorithm based on a robust least squares
provided the most accurate quantitation. The method assesses all 18 spots per sample (3
replicates, 6 dilutions) in a log-log space, in order to fit the linear model. As a preliminary
application, we estimated the relative expression levels of p53 and p21 in either p53+/+
or p53-/- HCT116 colon cancer cells after two drug treatments [22].

Chapter 5

54

As a further interesting
step in robust quantitation of
expression on lysate arrays,
the

paper

Tabus

(submitted)

et

al.,

estimates

the

expression of proteins by
fitting

model

to

the

expression ratios of an entire


slide

using

linear

and

nonlinear models.
The

lysate

array

production followed the same


Figure 5.3. Nonlinear modeling expression of a protein on a lysate
array. The measured data yi (in red) and the model g xi , (in

( )

green) vs. the estimated log concentration, x = q d [28].


i
si
i

steps as in [22], concentrations


were estimated from same
sequentially diluted samples
scheme. Since all spots on a

single slide undergo the same processes, the dependence function of the measurement as
a function of concentration is expected to be the same (linear or nonlinear) function.
For the first order polynomial model the paper derived the close form and for the
sigmoidal model and higher-order polynomial models an iterative procedure was used.
The paper computed the Cramer-Rao bound for each model and validated the estimation
accuracy by Monte-Carlo simulation. Testing different criteria for model structure
selection (Rissanen's stochastic complexity, Akaike information, and Minimum
Description Length) the expression of particular proteins on the lysate arrays typically
were sigmoidal (see Figure 5.3 and Table 5.1).

Table 5.1. Nonlinear model order selection in lysate array. Model structure selection gives preference to
saturated models.

Lysate array

pThr308AKT

Criterion
Rissanen's stochastic
complexity
(SC)
Sigmoidal

Polynomial k = 4

Minimum
Description Length
(MDL)
Polynomial k = 4

pSer473AKT
t-AKT
-actin

Sigmoidal
Sigmoidal
Sigmoidal

Sigmoidal
Sigmoidal
Polynomial k = 5

Sigmoidal
Sigmoidal
Polynomial k = 5

Akaike information
(AIC)

Reverse-phase protein array technology

55

Nonlinear models are able to capture the saturations that affect intensity (i.e.,
concentration dependency). The accuracy of the estimated values can be estimated by CR
bounds or Monte Carlo simulations. Tabus et al. show that the models based on
saturation are preferred and create a basis for better estimation of protein levels on slides
with saturated samples, and offer criteria for better experimental design and more
precise inference.
The next chapter of this thesis presents a more extensive application of reverse-phase
lysate arrays and our algorithms. We slightly modified the design of 1440-spot lysate
microarray slide from [22] to hold 96 samples and we observed the expression levels of
46 proteins in a set of 82 glioma samples.

References
[1] Ahram M, Flaig MJ, Gillespie JW, Duray PH, Linehan WM, Ornstein DK, Niu S, Zhao Y,
Petricoin EF 3rd, Emmert-Buck MR. Evaluation of ethanol-fixed, paraffin-embedded tissues for
proteomic applications. Proteomics. 2003 Apr;3(4):413-21.
[2] Blume-Jensen P, Hunter T. Oncogenic kinase signalling. Nature. 2001 May 17;411(6835):355-65.
[3] Bowden ET, Barth M, Thomas D, Glazer RI, Mueller SC. An invasion-related complex of
cortactin, paxillin and PKCmu associates with invadopodia at sites of extracellular matrix degradation
Oncogene. 1999 Aug 5;18(31):4440-9.
[4] Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W,
Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC,
Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M.
Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.
Nat Genet. 2001 Dec;29(4):365-71.
[5] Carlisle AJ, Prabhu VV, Elkahloun A, Hudson J, Trent JM, Linehan WM, Williams ED, EmmertBuck MR, Liotta LA, Munson PJ, Krizman DB. Development of a prostate cDNA microarray and
statistical gene expression analysis package. Mol Carcinog. 2000 May;28(1):12-22.
[6] Celis JE, Gromov P. Proteomics in translational cancer research: toward an integrated approach.
Cancer Cell. 2003 Jan;3(1):9-15.
[7] Chu WS, Liang Q, Liu J, Wei MQ, Winters M, Liotta L, Sandberg G, Gong M. A nondestructive
molecule extraction method allowing morphological and molecular analyses using a single tissue
section. Lab Invest. 2005 Aug 22; [Epub ahead of print]
[8] Cutler P. Protein arrays: the current state-of-the-art. Proteomics. 2003 Jan;3(1):3-18.
[9] Emmert-Buck MR, Strausberg RL, Krizman DB, Bonaldo MF, Bonner RF, Bostwick DG, Brown
MR, Buetow KH, Chuaqui RF, Cole KA, Duray PH, Englert CR, Gillespie JW, Greenhut S, Grouse L,
Hillier LW, Katz KS, Klausner RD, Kuznetzov V, Lash AE, Lennon G, Linehan WM, Liotta LA, Marra
MA, Munson PJ, Ornstein DK, Prabhu VV, Prang C, Schuler GD, Soares MB, Tolstoshev CM, Vocke
CD, Waterston RH. Molecular profiling of clinical tissues specimens: feasibility and applications. J Mol
Diagn. 2000 May;2(2):60-6.
[10] Espina V, Mehta AI, Winters ME, Calvert V, Wulfkuhle J, Petricoin EF 3rd, Liotta LA. Protein
microarrays: molecular profiling technologies for clinical specimens. Proteomics. 2003 Nov;3(11):2091100.
[11] Gillespie JW, Best CJ, Bichsel VE, Cole KA, Greenhut SF, Hewitt SM, Ahram M, Gathright YB,
Merino MJ, Strausberg RL, Epstein JI, Hamilton SR, Gannot G, Baibakova GV, Calvert VS, Flaig MJ,
Chuaqui RF, Herring JC, Pfeifer J, Petricoin EF, Linehan WM, Duray PH, Bova GS, Emmert-Buck MR.
Evaluation of non-formalin tissue fixation for molecular profiling studies. Am J Pathol. 2002
Feb;160(2):449-57.
[12] Haab BB, Dunham MJ, Brown PO. Protein microarrays for highly parallel detection and
quantitation of specific proteins and antibodies in complex solutions. Genome Biol.
2001;2(2):RESEARCH0004. Epub 2001 Jan 22.

56

Chapter 5
[13] Haab BB. Methods and applications of antibody microarrays in cancer research. Proteomics. 2003
Nov;3(11):2116-22.
[14] Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000 Jan 7;100(1):57-70.
[15] Hunter T. Signaling-2000 and beyond. Cell. 2000 Jan 7;100(1):113-27.
[16] Hunter T. The Croonian Lecture 1997. The phosphorylation of proteins on tyrosine: its role in cell
growth and disease. Philos Trans R Soc Lond B Biol Sci. 1998 Apr 29;353(1368):583-605.
[17] Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic
networks. Nature. 2000 Oct 5;407(6804):651-4.
[18] Knezevic V, Leethanakul C, Bichsel VE, Worth JM, Prabhu VV, Gutkind JS, Liotta LA, Munson
PJ, Petricoin EF 3rd, Krizman DB. Proteomic profiling of the cancer microenvironment by antibody
arrays. Proteomics. 2001 Oct;1(10):1271-8.
[19] Liotta LA, Kohn EC. The microenvironment of the tumour-host interface. Nature. 2001 May
17;411(6835):375-9.
[20] Melton L. Protein arrays: proteomics in multiplex. Nature. 2004 May 6;429(6987):101-7.
[21] Miller JC, Zhou H, Kwekel J, Cavallo R, Burke J, Butler EB, Teh BS, Haab BB. Antibody
microarray profiling of human prostate cancer sera: antibody screening and identification of potential
biomarkers. Proteomics 2003, 1, 5663.
[22] Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR, Zhang W. Robust
estimation of protein expression ratios with lysate microarray technology. Bioinformatics. 2005 May
1;21(9):1935-42. Epub 2005 Jan 12.
[23] Nielsen UB, Geierstanger BH. Multiplexed sandwich assays in microarray format. J Immunol
Methods. 2004 Jul;290(1-2):107-20.
[24] Nishizuka S, Charboneau L, Young L, Major S, Reinhold CW, Waltham M, Mehr KH, Bussey KJ,
Lee JK, Espina V, Munson PJ, Petricoin EF 3rd, Liotta LA, Weinstein JN. Proteomic profiling of the
NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proc Natl Acad Sci U
S A. 2003 Nov 25; 100(1):1422934.
[25] Perlee L, Christiansen J, Dondero R, Grimwade B, Lejnine S, Mullenix M, Shao W, Sorette M,
Tchernev V, Patel D, Kingsmore S. Development and standardization of multiplexed antibody
microarrays for use in quantitative proteomics. Proteome Sci. 2004 Dec 15;2(1):9.
[26] Schweitzer B, Kingsmore SF. Measuring proteins on microarrays. Curr Opin Biotechnol. 2002
Feb;13(1):14-9.
[27] Sreekumar A, Nyati MK, Varambally S, Barrette TR, Ghosh D, Lawrence TS, Chinnaiyan AM.
Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins.
Cancer Res. 2001 Oct 15;61(20):7585-93.
[28] Tabus I, Hategan A, Mircean C, Rissanen J, Shmulevich I, Zhang W, Astola J. Nonlinear
modeling of protein expressions in protein arrays.(submitted)
[29] Tonkinson JL, Stillman BA. Nitrocellulose: a tried and true polymer finds utility as a postgenomic substrate. Front Biosci. 2002 Jan 1;7:c1-12.

Chapter 6
Translational and post-translational examination of
pathway alterations during glioma progression1,2
The progression of gliomas has been extensively studied at the genomic level using cDNA
microarrays

[7][9].

However,

systematic

examination

of

translational

and

posttranslational levels has only recently been carried out in a high-throughput manner.
This

chapter

describes

our

applications

of

the

reverse-phase

lysate

array

[14][30][18][46][31], constructed utilizing glioma protein lysates from 82 different


primary glioma tissues. We surveyed the expression and phosphorylation of 46 different
proteins3 involved in signaling of cell survival, apoptosis, angiogenesis, invasion, and
cell-cycle pathways. The analysis algorithm described in [31] was employed to robustly
estimate and quantify the protein expression in these samples.
When ranked by their discriminating power in separating 37 glioblastomas from 45
other types of glioma, 12 proteins were identified as the most powerful discriminators.
These proteins included , EGFRpTyr845, AKTpThr308, PI3K, BadpSer136, IGFBP2,
IGFBP5, MMP9, VEGF, pRB, Bcl-2, and c-Abl. Additionally, the cluster analysis showed
a close link between the pairs of proteins PI3K and AKTpThr308, IGFBP5 and IGFBP2,
and and EGFRpTyr845. Another cluster included MMP9, Bcl-2, VEGF, and pRB.
The patterns may suggest functional relationships that justify further investigation. The
marked association of phosphorylation of AKT at Thr308, but not Ser473, with
glioblastoma suggests that specific activation of the PI3K pathway is critical to glioma
progression.
Gliomas are usually classified as high-grade (glioblastoma), mid-grade (anaplastic
astrocytoma and anaplastic oligodendroglioma), and low-grade (astrocytoma and
oligodendroglioma) [1]. Low-grade gliomas are rare and this is reflected by the small
number included in the study (18 of 82 samples; less than 25% of our sample set). In
addition to the classical diffuse glioma subtypes of astrocytoma and oligodendroglioma,
there are a group of gliomas that possess mixed lineage features (oligoastrocytoma) that
are classified as mixed gliomas. Different grades of gliomas are correlated with different
patient survival terms, with the most significant difference between glioblastoma and
1

The chapter follows partially the paper by Jiang R2, Mircean C2, Shmulevich I, Cogdell D, Jia Y, Tabus I,
Aldape K, Sawaya R, Bruner J, Fuller GN, and Zhang W, titled Pathway alterations during glioma
progression revealed by reverse phase protein lysate arrays [46]. The paper has been submitted to
Proteomics.
2
The authors R. Jiang and C. Mircean contributed equally to the paper [46].
3
A total of 108 antibodies were initially tested. Some did not show a unique band in western blotting and
were, therefore, not used in any assessment.

Chapter 6

58

lower grade tumors. To characterize the 82 tumor samples used in this study, we first
performed a Kaplan-Meier survival analysis according to the information from our
clinical database. As shown in Figure 6.1 for the 82 samples, there was a significant
difference in patient survival time between glioblastoma and all other gliomas combined.
However, there is an overlap in the survival curves of the other glioma subgroups.
Therefore, based on clinical phenotype, the 82 samples can be divided into two major
groups. Although the samples could be divided into 3 or 4 histopathological groups, a
grouping scheme based on clinical difference is likely to generate more biologically
meaningful information at the protein and pathway levels than groupings based purely
on morphological features.

Figure 6.1. Kaplan-Meier survival analysis. The survival curves for patients show that there is a distinction
between glioblastoma and lower grade glioma. But the survival difference between the subgroups of lower
grade gliomas in this sample set is not significant.

Glioblastoma poses a formidable challenge to the cancer research community. The


extremely poor prognosis and the refractory nature with respect to all conventional
therapies suggest a complex pattern of alterations likely involving multiple cellular
pathways and molecular regulatory systems in the tumor cells. A survey of protein
expression is particularly useful in providing comprehensive insights into the functional
mechanisms and important modifications to cellular pathways during disease
progression.

Figure 6.2. Quality evaluation of the antibodies used on protein lysate arrays. All the antibodies were tested in a western blotting assay to confirm they preferentially detect a single
band. Some blots were sequentially probed with 2 or 3 different antibodies and the composite results are shown.

Chapter 6

60

Survival analysis of patients in this cohort showed a major difference between


glioblastoma and lower grade tumors. The difference in protein expression may very well
be responsible for the distinct clinical phenotypes. We attempted to understand the
relationships among some of the proteins based on the fact they share some common
patterns of expression (or were clustered together in the clustering analysis). These
findings, in the context of known functions of the proteins and literature reports are
discussed in this chapter.
Table 6.1 Antibodies used in the experiments and source.

Company
Cell Signaling Technology, Inc.
Beverly, MA 01915

BD Biosciences
Immunocytometry Systems
San Jose, CA 95131
Santa Cruz Biotechnology, Inc.
Santa Cruz, CA 95060
GeneTex Inc.
San Antonio, TX 78245
Invitrogen Corporation
(Zymed Laboratories Inc.)
Carlsbad, CA 92008
Abcam Inc
Cambridge, MA 02139
Sigma Chem Co.
St. Louis, MO 63178

Name of antibody
AKT, AKTpSer473, AKTpThr308, PTEN, PTENpSer380,
RSK1/RSK2/RSK3, p90RSKpThr573, GSK3beta,
BADpSer136, Cleaved Caspase 8 Asp374, Cleaved Caspase
9 Asp315, Puma, PDGFR, MAPK, mTOR, mTORpSer2481,
CD11b, EGFRpTyr845
PI3K, Integrin 5, p16, pRB, Src Pan, Src pTyr529, NF B,
Cathepsin D
p53 (Do-l), BCL-2, Bax, p - PDGFR, Cdk7, Cdk4, Cyclin
D3, VEGF, Tie2, IGFBP2, IGFBP3, IGFBP5, ,
EGFR VIII , c-Abl
p14ARF, c-Myc
EGFR
MMP2, MMP9
-actin

Antibody selection
Accumulating evidence from conventional molecular biology studies has produced an
understanding that signal transduction pathways involved in cell growth, cell death, and
metabolism are disturbed in glioma progression. Therefore, in order to systematically
survey the subset of proteomic changes in gliomas using a parallel protein-lysate array
platform, we first generated a list of proteins (Table 6.1) that have been previously
implicated in oncogenetic pathways; some of these proteins have already been shown to
be activated in glioblastoma through other types of assays.
Protein lysate arrays provide a high-throughput platform that allows simultaneous
detection of a protein in a large number of samples with replicates and serial dilutions.
However, the dot blot nature of the assay demands that high quality antibodies be used
in order to avoid cross hybridization, which would produce an unacceptable level of noise
and render the data un-interpretable. Therefore, we first tested many of the antibodies

Examination of pathway alterations during glioma progression

61

on a western blot to make certain that the antibodies applied to the array detect a single
dominant band. Representative blots are shown in Figure 6.2. Only 46 antibodies passed
this quality control step and were subsequently used in the hybridization experiments.

Lysate array preparation


The frozen glioma tissue samples were collected from the Tissue Bank of the Brain
Tumor Center at The University of Texas M. D. Anderson Cancer Center. Samples were
collected from 82 different patients suffering from the following subtypes of gliomas: 8
low

grade

astrocytomas

(World

Health

Organization

[WHO]

grade

II),

oligodendrogliomas (WHO grade II), 3 oligoastrocytomas (WHO grade II), 10 anaplastic


astrocytomas (WHO grade III), 11 anaplastic oligodendrogliomas (WHO grade III), 6
anaplastic oligoastrocytomas (WHO grade III), and 37 glioblastomas (WHO grade IV).
Patient survival data were also collected from the database. All studies were approved by
the Institutional Review Board through an established protocol.
Protein isolation from glioma tissues was described previously [23]. The
concentrations of protein samples were determined using the Bradford assay according
to the manufacturers protocol (Bio-Rad Laboratories, Hercules, CA) and adjusted to 20
g/L with lysis buffer (20 M Tris, pH 7.6, 150 mM NaCl, 5 mM EDTA, 0.5% NP40).
The lysate solution was then diluted two-fold six times with the lysis buffer freshly
supplemented with 0.02 mM Leupeptin. The serially-diluted protein lysates were printed
on PVDF-coated glass slides in triplicate using a robotic spotter (G3, Genomics
Solutions). We used the same method as in [31].
Each antibody was tested on a Western blot to confirm that it detected a single
dominant band thus was appropriate for this experiment. The 46 antibodies are
described in (Table 6.1). The secondary antibodies, including anti-goat, anti-rabbit and
anti-mouse antibodies were purchased from Vector Laboratories (Burlingame, CA).
Detection was conducted with the DakoCytomation catalyzed signal amplification
system kit (CSA, DakoCytomation; Carpinteria, CA) as described in [31]. Briefly,
endogenous biotin was blocked for 5 min. using the biotin blocking kit, followed by
application of protein block for 10 min. Primary antibodies were diluted and incubated
on slides for 2 hours, and biotinylated secondary antibodies were incubated for 1 hour.
For signal amplification, the slides were incubated for 15 min with a streptavidin-biotinperoxidase complex provided in the amplification kit and for 15 min each with
amplification

reagents

(biotinyltyramide/hydrogen

peroxide,

and

streptavidin-

peroxidase). Development of slides was completed using hydrogen peroxide. The slides
were then allowed to air dry. Primary and secondary antibodies used in these studies
were diluted 1:100~200 and 1:4000~10,000, respectively. In addition to -actin, which

Chapter 6

62

served as the positive control in each set of protein arrays, one negative control without
any primary antibody was included in each set of experiments. The hybridized slides
were scanned at optical resolution of 1200 dpi and images were saved as uncompressed
TIFF files. After inverting the 16bpp gray image (to allow the same analysis approach as
used in cDNA microarray technology), the spots were segmented and quantified with
ArrayVision (Imaging Research Inc., Catharines, Ontario, Canada).

Analysis of lysate array data


The lysate was printed in triplicate with six two-fold dilutions, yielding 18 data points for
each sample4. Protein expression was quantified using the robust least squares method,
as described in [31]. Inter-slide variability was assessed using quantile-to-quantile
normalization [28][48]. The expression levels were normalized against -actin measured
on each production lot (24 slides in each lot).
All data were normalized against -actin in a similar fashion to that routinely
performed in a western blotting analysis. To diminish variability introduced by
microarray production, each protein-lysate array production lot (24 slides) was
normalized against -actin separately. This study focused on finding features that
distinguished glioblastomas from all other gliomas. Therefore, the classification is seen
as a two class-discrimination problem: glioblastoma vs. others (low-grade astrocytoma,
anaplastic astrocytoma, oligodendroglioma, anaplastic oligodendroglioma, and mixed
glioma samples). After normalization with -actin, we selected the protein features with
the highest discrimination power based on a t-statistic [42][12][13]. To decide the
threshold, we selected the subset that minimized the false discovery rate (FDR), which is
defined as the expected proportion of false positives among the declared significant
results [2][26].

Proteins selected on glioma progression


The normalized data were then statistically analyzed to identify those feature proteins
that distinguished glioblastoma from other gliomas in our sample set. To gain insight
into potential pathway and protein relationships, we also performed clustering analysis
of the protein data. The feature proteins were ranked by the discrimination between
glioblastoma vs. other glioma subtypes (Figure 6.4). The proteins that differed most in
expression were determined to be: IB, EGFRpTyr845, AKTpThr308, PI3K, IGFBP5,
IGFBP2, MMP9, bcl-2, c-Abl, VEGF, BadpSer136, and pRB. Clustering analysis of these
4
The procedure described in [31] is based on a 1440 spots: 80 samples 6 dilutions 3 replicates. In this
study we increased the density to 96 samples per slide.

Examination of pathway alterations during glioma progression

63

proteins showed that IB clusters with EGFRpTyr845, PI3K with AKTpThr308, and
IGFBP5 with IGFBP2. MMP9, Bcl-2, VEGF and pRB formed another cluster.

Figure 6.3. Examples of protein lysate array images. The images show a layout of the protein lysate array.
Each sample is present in triplicate in a series of six two-fold dilutions.

Chapter 6

64
NFB/IB and EGFR pathways and their relationship

IB was the protein with expression that differed most between glioblastoma and other
glioma subtypes in our analysis. IB is the key regulator of nuclear factor kappa B
(NFB), a transcriptional factor involved in cell growth, apoptosis, and immune response
[39]. NFB is retained in the cytoplasm through its interaction with IB. When IB is
phosphorylated by activation of IKK, NFB is released from the cytoplasm and enters the
nucleus to act on its target genes, including IB itself. Therefore, a regulatory feedback
loop exists between these two proteins and an increase in IB is often considered a result
of activation of NFB [11]. Interestingly, IB is known to be a labile protein whose
expression level is sensitive to physiological conditions. In contrast, the steady state level
of NFB is less sensitive to modulation and the key regulatory step is at the cellular
localization level. Thus, it is not surprising that the NFB subunits p65 and p50 were not
found to be key feature proteins that distinguished glioblastomas. In a recent study, we
reported that NFB was indeed activated in glioblastoma [45]. Thus, these protein-lysate
array data from a large set of patient samples confirm our previous observation.
Our clustering analysis showed that IB clustered with phosphorylated EGFR. EGFR
is critical for cell growth, differentiation, survival, and migration. The activation of EGFR
is believed to be one of the most important molecular events in triggering glioblastoma.
Amplification of EGFR genes occurs in 40% of glioblastomas, and EGFR overexpression
occurs in 60% [47][40]. A mutation in the EGFR gene that results in a shortened form of
EGFR called EGFR vIII has been detected in approximately 20% of glioblastomas [3]. In
our lysate array study, we also detected EGFR vIII in 12% of glioblastomas (data not
shown), but because of its relatively low frequency of detection, EGFR vIII was not
selected by our statistical analysis as distinguishing between the two glioma groups.
Tyrosine kinase receptors are often activated by phosphorylation and the functional
activation of the protein is a major switch in the growth pathway in the cells. Thus, it is
not surprising that we found EGFRpTyr845 rather than EGFR as one of the top
significant discriminators between glioblastoma and other subtypes of gliomas. Tyr845 is
highly conserved within the active loop of the kinase domain on the EGFR and
phosphorylation of this residue is mediated by c-Src and dependent on EGF stimulation
[4][38]. Phosphorylation of EGFR on Tyr845 residues is necessary for the binding of
EGFR to cytochrome c-oxidase subunit II (Cox II), which is a mitochondrion-encoded
protein and a subunit of complex IV of the respiratory chain [6] . After EGF stimulation,
EGFR translocates to the mitochondrion, where it interacts with Cox II to regulate cell
survival. Integrin proteins induce phosphorylation of EGFR on tyrosines 845, 1068,
1086, and 1173 [5][33], thus EGFR activation is also implicated in increased cell motility
and invasion, additional critical features of glioblastoma.

Examination of pathway alterations during glioma progression

65

It is intriguing that IB and EGFRpTyr845 formed a cluster in our analysis. Reports


in the literature have shown that EGFR activates NFB, and there is an NFB regulatory
element in the promoter of EGFR [34]. Thus, the clustering of the two proteins may
reflect mutual functional regulation in glioblastoma. Because both NFB and EGFR are
targets for therapeutic intervention in several types of cancers [32], a combined attack on
both targets in glioblastoma may offer an effective therapeutic strategy.
PI3K and AKT survival pathways
Our results showed that both PI3K and phosphorylated AKT are among the top feature
proteins that distinguish glioblastoma from lower grade gliomas. This is consistent with
prior reports of the role of these proteins in many different types of cancers, including
glioblastoma. Animal model results show that elevation of AKT and Ras is sufficient to
induce spontaneous glioblastoma [21]. Because PI3K is upstream of AKT, it is not
surprising that PI3K and phosphorylated AKT fall in one group in our clustering analysis.
Our studies revealed that phosphorylation of AKT is a complex process. It has been
observed that AKT is phosphorylated on two major residues: Thr308 and Ser473. Two
antibodies are available that specifically recognize these sites. Our study showed that
phosphorylation of Thr308 is a discriminating event between glioblastomas and other
gliomas, whereas phosphorylation of Ser473 is not. A search of the literature
demonstrated that many studies do not specify which site is phosphorylated although
many studies focus on Ser473. However, a few investigations showed that the two sites
are differentially phosphorylated. In response to insulin stimulation, AKTpThr308
phosphorylation increased, whereas AKTpSer473 was not significantly affected [25][24].
The two residues are also phosphorylated by different kinases; AKTpThr-308 is
phosphorylated by 3-phosphoinositide-dependent kinase-1 (PDK1) and AKTpSer473 by
the putative kinase PDK2. Thus, site specific phosphorylation of AKT may represent two
different switches. For glioma-genesis, phosphorylation of Ser473 may represent an early
event in cancer progression, and phosphorylation of Thr308 may represent a later event
leading to glioblastoma. Thus, full Akt activation may require phosphorylation on both
sites [15]. This notion can be tested via site-directed mutagenesis experiments.
Another interesting observation of phosphorylation of AKT at Thr308 is that it
appears to be linked to primary glioblastomas because in glioblastoma cell lines cultured
in vitro, AKTpThr308 is phosphorylated at a very low level, whereas AKTpSer473
phospho-protein is abundant (data not shown). This is very similar to several other
proteins closely associated with glioblastoma. EGFR amplification and overexpression, a
signature of glioblastoma tissues, are lost in cultured cell lines [17]. Similarly, IGFBP2,
which is overexpressed in 80% of glioblastoma tissues, is expressed at a low level in most
glioblastoma cell lines (data not shown). Therefore, phosphorylation of AKT at Thr308,

Chapter 6

66

together with EGFR and IGFBP2 expression, appears to be relevant in vivo, but not in
vitro, perhaps related to hypoxic conditions.
IGFBP2/IGFBP5 invasion pathway
As mentioned above, genomics studies coupled with tissue microarray experiments have
shown that IGFBP2 overexpression is a signature event in glioblastoma. IGFBP2 is a
promoter of glioma invasion, one of the most important phenotypes of glioblastoma [44].
There are six members in the IGFBP family and they have very different functions,
especially those that are IGF independent [49]. IGFBP5 has been implicated in breast
cancer metastasis [20][41]. In this study, we showed that IGFBP5 is also overexpressed
in glioblastoma and IGFBP2 and IGFBP5 are closely clustered. Thus, both proteins may
contribute to glioma invasion and/or other common functions. This hypothesis can be
readily tested in the future.
The other feature proteins
Several other proteins were identified as feature discriminators for glioblastoma.
Angiogenesis is a key phenotype in glioblastoma, and thus, the selection of VEGF as one
of the feature proteins was expected. Resistance to apoptosis is another important
phenotype. Therefore, the selection of Bcl-2 and BADpSer136 as two discriminators was
also consistent with the phenotype, although their over-expression was a new finding.
Bcl-2 is a survival protein and has been shown to be expressed in glioblastomas [29].
BAD is an apoptosis promoting protein, but when phosphorylated, BAD becomes inactive
and perhaps may even gain survival function [36]. Although BAD phosphorylation is
believed to be a downstream event of AKT phosphorylation, intriguingly, we found that
BADpSer136 did not cluster with AKTpThr308 and PI3K. This may mean that there are
other major upstream regulators of BAD phosphorylation. In support of this hypothesis,
Scheid et al. showed that activation of AKT alone was not sufficient to phosphorylate
BAD and complete inhibition of PI3K/AKT did not abrogate the phosphorylation of BAD
[35]. Furthermore, phosphorylation of BAD at Ser136 was correlated with AKTpSer473
phosphorylation but not with AKTpThr308 phosphorylation; this data suggests that the
activation of the AKT pathway could be independent of BADpSer136 activation [27][43].
An interesting finding from our study is that c-Abl is also highly expressed in
glioblastoma. In an earlier microarray experiment, we found that c-Abl mRNA
expression is associated with poor survival, although those results were not published
due to a small sample size of 25 patients. The present study, however, appears to support
the earlier finding. Additional experiments should be carried out to pursue this
observation because of the clinical implications.

Figure 6.4. Quantitative analysis of protein expressions. The protein level was normalized against -actin and then quantile normalized. The protein expression levels are represented
as a heat-map (a) and (b) where the proteins are ranked in decreasing order for discrimination power of glioblastomas vs. other lower grades. The most discriminating feature proteins are
shown in (b). We used the ratio between the partitioning of sums of squares into between-class and within-class (BSS/WSS) and the set that return the minimum false discovery rate.
Visually, FDR is the ratio of areas between random assignment (c), yellow bars and the correct assignment (red bar) when discrimination is high. The location of the minimal false
discovery rate is shown in (d).

Chapter 6

68

Imatinib (Gleevec), which inhibits bcr-abl in chronic myeloid leukemia and c-Kit in
gastrointestinal stromal cancers, has been one of the most successful therapeutic agents
ever used for targeted therapy [16]. Because Gleevec is also an inhibitor of Abl, it may
also be beneficial for glioblastoma treatment. A clinical trial with Gleevec in glioblastoma
is ongoing [8], and it may be insightful to view the results of that trial through the prism
of our findings.
In summary, our survey of 46 proteins and post-translationally modified isoforms in
82 glioma tissue samples has yielded several biologically relevant discoveries that further
our understanding of glioma systems. Some of these findings provide confirmation, using
a different experimental approach, of concepts previously proposed in the literature.
Others are novel and provide focus for further, in-depth, functional studies. The present
glioma protein-lysate array study demonstrates the utility of this proteomics discovery
tool in advancing our understanding of glioma physiology.

References
[1] Benesch, M., Wagner, S., Berthold, F., Wolff, J.E. J. Neurooncol. 2005, 72, 179-183.
[2] Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in
behavior genetics research. Behav Brain Res. 2001 Nov 1;125(1-2):279-84.
[3] Biernat W, Huang H, Yokoo H, Kleihues P, Ohgaki H. Predominant expression of mutant EGFR
(EGFRvIII) is rare in primary glioblastomas. Brain Pathol. 2004 Apr;14(2):131-6.
[4] Biscardi JS, Maa MC, Tice DA, Cox ME, Leu TH, Parsons SJ. c-Src-mediated phosphorylation of
the epidermal growth factor receptor on Tyr845 and Tyr1101 is associated with modulation of receptor
function. J Biol Chem. 1999 Mar 19;274(12):8335-43.
[5] Boerner JL, Demory ML, Silva C, Parsons SJ. Phosphorylation of Y845 on the epidermal growth
factor receptor mediates binding to the mitochondrial protein cytochrome c oxidase subunit II. Mol Cell
Biol. 2004 Aug;24(16):7059-71.
[6] Boerner JL, Demory ML, Silva C, Parsons SJ. Phosphorylation of Y845 on the epidermal growth
factor receptor mediates binding to the mitochondrial protein cytochrome c oxidase subunit II. Mol Cell
Biol. 2004 Aug;24(16):7059-71.
[7] Boudreau CR, Liau LM, Molecular characterization of brain tumors. Clin. Neurosurg. 2004, 51,
81-90.
[8] Breedveld P, Pluim D, Cipriani G, Wielinga P, van Tellingen O, Schinkel AH, Schellens JH. The
effect of Bcrp1 (Abcg2) on the in vivo pharmacokinetics and brain penetration of imatinib mesylate
(Gleevec): implications for the use of breast cancer resistance protein and P-glycoprotein inhibitors to
enable the brain penetration of imatinib in patients. Cancer Res. 2005 Apr 1;65(7):2577-82.
[9] Caskey LS, Fuller GN, Bruner JM, Yung WK, Sawaya RE, Holland EC, Zhang W. Toward a
molecular classification of the gliomas: histopathology, molecular genetics, and gene expression
profiling. Histol Histopathol. 2000 Jul;15(3): 971-81.
[10] Chen G, Gharib TG, Huang CC, Taylor JM, Misek DE, Kardia SL, Giordano TJ, Iannettoni MD,
Orringer MB, Hanash SM, Beer DG. Discordant protein and mRNA expression in lung
adenocarcinomas. Mol. Cell Proteomics 2002, 1, 304-313.
[11] Chiao PJ, Miyamoto S, Verma IM. Autoregulation of I kappa B alpha activity. Proc Natl Acad Sci
U S A. 1994 Jan 4;91(1):28-32.
[12] Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of
tumors using gene expression data. J. Amer. Statist. Assoc. 2000, 97, 77-87.
[13] Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Statistica. Sinica. 2000, 12, 111-139.
[14] Espina V, Mehta AI, Winters ME, Calvert V, Wulfkuhle J, Petricoin EF 3rd, Liotta LA. Protein
microarrays: molecular profiling technologies for clinical specimens. Proteomics. 2003 Nov;3(11):2091100.
[15] Feng, J., Park, J., Cron, P., Hess, D., Hemmings, B.A. J. Biol. Chem. 2004, 279, 41189-41196.

[16] George D. Targeting PDGF receptors in cancer--rationales and proof of concept clinical trials. Adv
Exp Med Biol. 2003;532:141-51.
[17] Giannini C, Sarkaria JN, Saito A, Uhm JH, Galanis E, Carlson BL, Schroeder MA, James CD.
Patient tumor EGFR and PDGFRA gene amplifications retained in an invasive intracranial xenograft
model of glioblastoma multiforme. Neuro-oncol. 2005 Apr;7(2):164-76.
[18] Grubb RL, Calvert VS, Wulkuhle JD, Paweletz CP, Linehan WM, Phillips JL, Chuaqui R, Valasco
A, Gillespie J, Emmert-Buck M, Liotta LA, Petricoin EF. Signal pathway profiling of prostate cancer
using reverse phase protein arrays. Proteomics. 2003 Nov;3(11):2142-6.
[19] Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance
in yeast. Mol Cell Biol. 1999 Mar;19(3):1720-30.
[20] Hao X, Sun B, Hu L, Lahdesmaki H, Dunmire V, Feng Y, Zhang SW, Wang H, Wu C, Wang H,
Fuller GN, Symmans WF, Shmulevich I, Zhang W. Differential gene and protein expression in primary
breast malignancies and their lymph node metastases as revealed by combined cDNA microarray and
tissue microarray analysis. Cancer. 2004 Mar 15;100(6):1110-22
[21] Holland EC, Celestino J, Dai C, Schaefer L, Sawaya RE, Fuller GN. Combined activation of Ras
and Akt in neural progenitors induces glioblastoma formation in mice. Nat Genet. 2000 May;25(1):55-7.
[22] Jiang R, Mircean C, Shmulevich I, Cogdell D, Jia Y, Tabus I, Aldape K, Sawaya R, Bruner J,
Fuller GN, Zhang W. Pathway alterations during glioma progression revealed by reverse phase protein
lysate arrays. Proteomics (submitted)
[23] Jung JM, Li H, Kobayashi T, Kyritsis AP, Langford LA, Bruner JM, Levin VA, Zhang W.
Inhibition of human glioblastoma cell growth by WAF1/Cip1 can be attenuated by mutant p53. Cell
Growth Differ. 1995 Aug;6(8):909-13.
[24] Karlsson HK, Zierath JR, Kane S, Krook A, Lienhard GE, Wallberg-Henriksson H. Insulinstimulated phosphorylation of the Akt substrate AS160 is impaired in skeletal muscle of type 2 diabetic
subjects. Diabetes. 2005 Jun;54(6):1692-7.
[25] Kawakami Y, Nishimoto H, Kitaura J, Maeda-Yamamoto M, Kato RM, Littman DR, Leitges M,
Rawlings DJ, Kawakami T. Protein kinase C betaII regulates Akt phosphorylation on Ser-473 in a cell
type- and stimulus-specific fashion. J Biol Chem. 2004 Nov 12;279(46):47720-5. Epub 2004 Sep 9.
[26] Keselman, H.J., Cribbie, R., Holland, B. Br. J. Math. Stat. Psychol. 2002, 55, 27-39.
[27] Khor TO, Gul YA, Ithnin H, Seow HF. Positive correlation between overexpression of phosphoBAD with phosphorylated Akt at serine 473 but not threonine 308 in colorectal carcinoma. Cancer Lett.
2004 Jul 16;210(2):139-50.
[28] Lu T, Costello CM, Croucher PJ, Hasler R, Deuschl G, Schreiber S. Can Zipf's law be adapted to
normalize microarrays? BMC Bioinformatics. 2005, 6, 37.
[29] Lytle RA, Jiang Z, Zheng X, Rich KM. BCNU down-regulates anti-apoptotic proteins bcl-xL and
Bcl-2 in association with cell death in oligodendroglioma-derived cells. J Neurooncol. 2004
Jul;68(3):233-41.
[30] Melton L. Protein arrays: Proteomics in multiplex. Nature 2004, 429, 101-107.
[31] Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I, Hamilton SR, Zhang W. Robust
estimation of protein expression ratios with lysate microarray technology. Bioinformatics. 2005 May
1;21(9):1935-42. Epub 2005 Jan 12.
[32] Monks NR, Biswas DK, Pardee AB. Blocking anti-apoptosis as a strategy for cancer
chemotherapy: NF-kappaB as a target. J Cell Biochem. 2004 Jul 1;92(4):646-50.
[33] Moro L, Dolce L, Cabodi S, Bergatto E, Erba EB, Smeriglio M, Turco E, Retta SF, Giuffrida MG,
Venturino M, Godovac-Zimmermann J, Conti A, Schaefer E, Beguinot L, Tacchetti C, Gaggini P,
Silengo L, Tarone G, Defilippi P. Integrin-induced epidermal growth factor (EGF) receptor activation
requires c-Src and p130Cas and leads to phosphorylation of specific EGF receptor tyrosines. J Biol
Chem. 2002 Mar 15;277(11):9405-14. Epub 2001 Dec 27.
[34] Park S, James CD. ECop (EGFR-Coamplified and overexpressed protein), a novel protein,
regulates NF-B transcriptional activity and associated apoptotic response in an IB-dependent manner
Oncogene. 2005, 24, 2495-2502.
[35] Scheid MP, Duronio V. Dissociation of cytokine-induced phosphorylation of Bad and activation
of PKB/akt: involvement of MEK upstream of Bad phosphorylation. Proc Natl Acad Sci U S A. 1998
Jun 23;95(13):7439-44.
[36] Schurmann A, Mooney AF, Sanders LC, Sells MA, Wang HG, Reed JC, Bokoch GM. p21activated kinase 1 phosphorylates the death agonist bad and protects cells from apoptosis. Mol Cell Biol.
2000 Jan;20(2):453-61.
[37] Sreekumar A, Nyati MK, Varambally S, Barrette TR, Ghosh D, Lawrence TS, Chinnaiyan AM.
Profiling of cancer cells using protein microarrays: discovery of novel radiation-regulated proteins.
Cancer Res. 2001 Oct 15;61(20):7585-93.
[38] Tice DA, Biscardi JS, Nickles AL, Parsons SJ. Mechanism of biological synergy between cellular
Src and epidermal growth factor receptor. Proc Natl Acad Sci U S A. 1999 Feb 16;96(4):1415-20.
[39] Tran NL, McDonough WS, Savitch BA, Sawyer TF, Winkles JA, Berens ME. The tumor necrosis
factor-like weak inducer of apoptosis (TWEAK)-fibroblast growth factor-inducible 14 (Fn14) signaling

70

Chapter 6
system regulates glioma cell survival via NFkappaB pathway activation and BCL-XL/BCL-W
expression. J Biol Chem. 2005 Feb 4;280(5):3483-92. Epub 2004 Dec 16.
[40] Tripp SR, Willmore-Payne C, Layfield LJ. Relationship between EGFR overexpression and gene
amplification status in central nervous system gliomas. Anal Quant Cytol Histol. 2005 Apr;27(2):71-8.
[41] van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K,
Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend
SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002 Jan
31;415(6871):530-6.
[42] von Heydebreck A, Huber W, Poustka A, Vingron M. Identifying splits with clear separation: a
new class discovery method for gene expression data. Bioinformatics. 2001;17 Suppl 1:S107-14.
[43] Wang SW, Denny TA, Steinbrecher UP, Duronio V. Phosphorylation of Bad is not essential for
PKB-mediated survival signaling in hemopoietic cells. Apoptosis. 2005 Mar;10(2):341-8.
[44] Wang, H., Wang, H., Shen, W., Huang, H., et al., Insulin-like Growth Factor Binding Protein 2
Enhances Glioblastoma Invasion by Activating Invasion-enhancing Genes. Cancer Res. 2003, 63, 43154321.
[45] Wang. H., Wang, H., Zhang, W., Huang, H.J., et al., Analysis of the activation status of Akt,
NFkappaB, and Stat3 in human diffuse gliomas. Lab Invest. 2004, 84, 941-951.
[46] Wulfkuhle JD, Aquino JA, Calvert VS, Fishman DA, Coukos G, Liotta LA, Petricoin EF 3rd.
Signal pathway profiling of ovarian cancer from human tissue specimens using reverse-phase protein
microarrays. Proteomics. 2003 Nov;3(11):2085-90.
[47] Xie D, Zeng YX, Wang HJ, Tai LS, Wen JM, Tao Y, Ma NF, Hu L, Sham JS, Guan XY.
Amplification and overexpression of epidermal growth factor receptor gene in glioblastomas of Chinese
patients correlates with patient's age but not with tumor's clinicopathological pathway. Acta Neuropathol
(Berl). 2005 Sep 7; [Epub ahead of print]
[48] Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA
microarray data: a robust composite method addressing single and multiple slide systematic variation.
Nucleic Acids Res. 2002 Feb 15;30(4):e15.
[49] Zhang W, Wang H, Song SW, Fuller GN. Insulin-like growth factor binding protein 2: gene
expression microarrays and the hypothesis-generation paradigm. Brain Pathol. 2002 Jan;12(1):87-94.
[50] Zhou YH, Tan F, Hess KR, Yung WK. The expression of PAX6, PTEN, vascular endothelial
growth factor, and epidermal growth factor receptor in gliomas: relationship to tumor grade and survival.
Clin Cancer Res. 2003 Aug 15;9(9):3369-75.

Chapter 7
An overview of analysis methods
Scientists make predictions based on hypotheses, laws, and theories. The value of
predictions is determined by whether they are supported by experiments. If results do
not confirm the initial hypothesis or experimental results are not confirmed by other
scientists, then the initial assumption that generated the prediction should be changed or
discarded.
The models meet a very "human" need to understand events; we create models and
have expectations. Models are not required by biological organisms.

The scientific

hypothesis must be testable and falsifiable1. The process of proving a hypothesis is not
based on accumulating evidence in its favor, but rather in showing that situations that
could establish falsity do not happen [29]. In order to support a hypothesis, observations
need to be specific and not simply observational; they must be quantitatively expressed
in numerical measurements. When a hypothesis is proven, it gains the status of a
scientific law or theory. Once widely confirmed by scientists over time, the laws become
accepted as facts.
With the aid of technology, molecular biology has evolved from traditionalobservational to quantitative-estimated measurements. We expect that biology and
medicine will continue to move toward a more quantitative science based on scientific
laws. In addition to algorithms, models, and pure numbers used as in mathematical field,
bioinformatic research deals with numbers that often represent measurements of fuzzy
biological events. Measurement, precision, and other aspects that are clearly applicable
in highly technical sciences, need amendment and comprehensive adjustments for
application to biological events.
Systems biology attempts to define a biological environment of a cell or organism
including the structure of the complex integrated system composed of genes, proteins,
metabolites, etc. Two features, the structure and the dynamics of the biological system,
provide the foundation from which variations are measured to analyze the robustness,
which is the essential property of biological systems [15].
In an article in Science published in 2000, Kitano defined robustness this way:
"Robustness of biological systems manifests in various ways. Firstly, biological systems
constantly adapt to internal or external changes. Secondly, they show certain
1

This refers to the theory of assertion described by the philosophers Sir Karl Popper and Ernest Gellner.
Rather than the non-philosophical use of "falsification," meaning "counterfeiting," in science the idea of
"falsifiable" means "disprovable," the opposite of "verifiable."

Chapter 7

72

insensitivity, which enables them to deal with the noise generated by the stochastic
signals to which they are exposed. Finally, they also exhibit what could be called a
graceful degradation, which is a slow and gradual end as opposed to the catastrophic
failure that occurs when functions are damaged [18].
In the array-based experiments described in this dissertation, we have attempted to
improve measurement methods used to characterize variations in cellular events. No
matter what the intrinsic character of the technology (cDNA microarrays, oligonucleotide
arrays, or protein arrays) and independent of what is measured (transcriptional levels,
translational levels, or post- translational levels), the data analysis aims to extract
biologically relevant conclusions.

The "array" character of these experiments,

distinguishes them from other classical biological analysis by the large number of
measurements and the number of undesirable sources of variation.

Normalization
Analysis is the last step of array-based experiments and provides a substantial number of
difficulties and challenges. Data pre-processing typically includes a step that removes
systematic variability, a process called normalization. Some sources of generalized
variations may be highly informative as they may be of biological origin. On the other
hand, variations due to the specific character of array-based technology should be
removed prior to analysis.
Normalization methods are usually based on certain hypotheses. For instance, the
expressions detected should contain a random sample of genes, most of which not
differentially expressed. This means that the expression of only a small subset of genes
will be altered by a given condition. Another hypothesis is that for each sample under
evaluation, equal amounts of RNA or protein were initially present. This implies that the
total number of molecules in the samples will be roughly the same.
In the case of a cDNA microarray, if two samples are co-hybridized expressing values
in two different fluorescent dyes as Cy3, and Cy5 denoted as

Ri

and

Gi , the ratios of

the arithmetic mean of the two channels

k mean = i Ri

may be used for global normalization for scaling one of the channels (e.g.,

Gi ' = k meanGi and Ri ' = Ri ) such that the new ratios are equal to unity:

R'
G'
i

R
G
i

k mean

=1

An overview of analysis methods

73

and when a logarithmic scale is used:

R'
log i
G 'i

= log(R'i ) log(G 'i ) = log(Ri ) log(Gi ) log(k mean ) .

The mean estimate assumes a Gaussian distribution of outliers that is not valid in cDNA
expressions. Hoyle et al., 2002 [14] note that cDNA microarrays have skewed
distribution behavior, with many genes expressed at a very low level. Therefore, a robust
estimate of tendency (such as 75 percentile)2 will give us more realistic estimate of the
common intensity in two different fluorescent dyes, Cy3 and Cy5.
A simple way to compare the two channels is with a loglog plot. Points that are
above or below the diagonal in this plot correspond to spots that have higher expression
levels in one of the channels. Another way to visualize the data is to create an "Intensity
vs. Ratio" plot for the normalized data, called an MA-plot. This plot of the log-ratios,

log(Ri Gi )
R
, is shown in Figure 7.1. (c. d.)
M i = log i vs. mean log-intensities Ai =
2
Gi
Figure 7.1 Intra-array normalization methods. Global normalization consists in scaling one of the channels
(a) and (b) log-log plot, while lowess locally estimates the local intensity of the channels (c) and (d), MA plot.

(a)

(b)

(c)

(d)

In the case of an array with many genes expressed at a low level, the median, mean, or other centraltendency estimates will be close to zero values. A higher than 50th percentile estimate of tendency will better
estimate the variance in distributions.

Chapter 7

74

Most MA-plots of the data from cDNA arrays show skewed tendencies due to the
different binding potential of Cy3 and Cy5 in genes expressed at a low level compared to
highly expressed genes. Typically, microarray MA-scatter plots resemble a "fish"; they
are, skewed toward high mean log-intensities in one of the channels. This bias cannot be
removed by global normalization methods, even though the averages of the intensities of
the two channels are equal.
One way to remove such bias is to apply a local estimate of the intensity. The methods
called lowess and loess are derived from locally weighted scatter plot smoothing and are
differentiated by the model used in the regression: lowess uses a linear polynomial,
whereas loess uses a quadratic polynomial. The smoothing process is based on the
moving average method. Each value is determined by neighboring data points within a
span3. The regression weight function is defined for the data points contained within the
span. In addition, the smoothing version of the robust weight function may provide
outlier-resistant behavior.
In local regression smoothing [5][24], a three-step process is applied to each point
(Figure 7.2). (1) The regression weights are calculated in the span given by the function:

x xi
wi = 1

where

is the predictor value associated with the response value to be smoothed,

are the nearest neighbors of


abscissa from

as defined by the span, and

xi

is the distance along the

to the most distant predictor value within the span. (2) A weighted

linear least squares regression is performed (for lowess, the regression uses a first degree
polynomial; for loess, the regression uses a second degree polynomial). In order to insure
that the smoothed values do not become distorted by neighboring outliers, a robust
smoothing, which is not influenced by a small fraction of outliers, is used to calculate the
weights in the span:

1 (r 6MAD(r ))2
i
wi =
0

where

ri

, ri < 6 MAD(r )
, ri > 6MAD(r )

is the residual of the ith data point produced by the regression smoothing

procedure, and MAD is the median absolute deviation of the residuals (3). The smoothed
value is given by the weighted regression at the predictor value of interest or both the
local regression weight and the robust weight in the case of a robust smoothing.
3

The elements considered in the window process.

An overview of analysis methods

75

Figure 7.2. Representation of Lowess generation steps4.

Robotic pins are used to print samples on the nitrocellulose membrane or glass for cDNA
and reverse-phase lysate arrays, therefore another possible systematic source of variation
is the pin-loading capacity.
Any defect or variation in mechanical parameters of the pins might affect in a
common way the expression of the genes or proteins in the group deposed by same pin.
For instance, for a microarray that uses 16 pins, a pin-dependent normalization
procedure is used for quantile-to-quantile normalization of the expression of the genes
deposed by each pin separately. A good visualization is to show a box-plot of an ANOVA
procedure (Figure 7.3). The normalization procedure should be used with caution when
the number of spots processed by each pin is small, because in this case more bias may be
introduced to existent expressions.

From MathWorks, http://www.mathworks.com .

Chapter 7

76

(a)

(b)

Figure 7.3. Pin-dependent normalization. A quick survey on the ANOVA shows a variation in the groups
generated by pins 1-4, and that the group 5 is out of scale. This may be due to a defect on the needle that
printed these groups (pin no. 5, not loading). The quantile-to-quantile normalization may solve this problem.

Normalization of reverse-phase lysate arrays is more complicated that for cDNA


arrays because the spots represent multiple diluted versions of the original sample. The
distribution of spots is a combination of expressions and diluted replicas of same
distribution. Additionally, the relatively large chromographic range hinders the
normalization process. I am not aware of publications on reverse-phase lysate arrays that
take into consideration pin loading or normalization.
When performing array-based experiments, high reproducibility, both within and
between arrays, is essential. The high reproducibility enables the researchers to compare
data obtained from different samples, especially when the amount of available sample or
the cost limit the design, such that only a small number of arrays can be used. A method
to enhance reproducibility on repeated experiments on multiple slides is inter-array
normalization. Box-plots (see Figure 7.4) show that there are large differences in scale
(approx. variance) among arrays. A classical method is to use quantile-to-quantile
normalization of array's expression.

(a)

(b)

Figure 7.4. Inter-slide pre- and post-normalization (quantile-to-quantile normalization) box-plots for
seven arrays.

An overview of analysis methods

77

After inter-slide normalization the scales are comparable. Without this normalization,
one or more slides would have excessive weight in deciding the features when performing
a classification. In cases with low differences between scales, however, the error
introduced by an inter-slide normalization procedure may be detrimental to future
analysis procedures5.
The majority of expressed genes exhibit a power-law distribution with an exponent
close to -1 (i.e., gene expression obeys Zipf's law, [25] and [6][22]). Based on the
observation that single channel and two channel microarray data sets also follow a
power-law distribution, a recent paper developed a normalization method based on Zipf's
law. This normalization procedure is useful when the quantile-to-quantile method cannot
be applied (e.g., in the case with microarrays containing functionally specific gene sets,
where the initial hypothesis of genes mostly not differentially expressed doesn't hold
true).

Differential expression of genes


In array-based studies, the differential expression pattern of each gene is usually
evaluated by (typically pair-wise) dissimilarity of mean expression values among
experimental conditions. Such comparisons have been routinely assessed as fold changes
whereby genes with greater than two- or three-fold changes are selected for further
investigation.
However, a gene with a high fold change between the two experimental conditions
might also exhibit within class high variability and hence its differential expression may
not be significant in terms of statistics. Similarly, a modest change in gene expression
may be significant if its differential expression pattern is highly reproducible. In order to
assess differential expression in a way that controls for both false positives and false
negatives, with attention paid to reliability of variance estimates and multiple
comparison issues, two-sample t -statistics are often used for testing each genes
differential expression [2][36].
Using a cDNA microarray experiment6 as an example, for each gene we compute the
mean conditional on the class

l L as M i ,l =

1
nl

j l ( j )= l

M (i, j ) , where nl is the

5
For example, BSS/WSS is highly sensitive when WSS 0. This is when the elements in the class are close
one to the other one. The inter-slide normalization, based on the principle that most genes are not differentially
expressed, can spoil the WSS 0, and therefore any further feature selection based on ranking the elements
based on this BSS/WSS procedure (see next paragraph).
6
Similar logic may be applied to protein arrays.

Chapter 7

78
number of samples in the class

l . We define the number of classes by L, and the within

class variance is defined as

Wi =

(M (i, j ) M ) (m L)

( )
2

i ,l

lL j |l j = l

and the between class variance as

1
Bi = nl M i ,l M (i, j )
m j
lL

(L 1)

WSS
WSS
BSS
BSS

BSS

WSS

Figure 7.5. Representation of the ratio of between-class and within-class variances in a 3-dimensional
space.

This is represented in Figure 7.5. For each gene, the ratio of between-class and
within-class variances (BSS/WSS 7 ) may be used to rank all genes according to their
ratios, and to select those genes having the largest ratios, up to a predefined number, as
feature genes.
A valid threshold for BSS/WSS ratios must be selected where the features are
informative and then, after determining the significance level for retaining a specific set
of best ranked genes, we can define a procedure for assessing the validity of the set. The
general idea is to compare the BSS/WSS used with the true classes against a random
assignment of labels. We can label this as "Confidence level"

(Clevel ) and use it as a

threshold for selection of the lowest value of the ratio BSS/WSS (Figure 7.6).

Clevel

is

defined as the ratio of cases when the real value of BSS/WSS is larger than the BSS/WSS
for a random assignment of labels to the total cases:

Between Sum of Squares (BSS), Within Sum of Squares (WSS), Total Sum of Squares (TSS).

An overview of analysis methods

Clevel =

79
real > random

N total cases

100 [%] .

To discriminate the classes, two other methods have been used, one based on
Receiver Operating Characteristic (ROC) [34] and one on t-test. The t-test and Between
Sum of Squares vs. Within Sum of Squares (BSS/WSS) on two class-separation problems
should lead to same ranking.
The ROC method occupies a
central unifying position [4][8][10]
in the process of assessing the
Real BSS/WSS = 0.2155
Confidence level = 94.66%

clinical outcome, usually based on a


small

number

Robustness

is

of
due

elements.
to

its

nonparametric character (i.e., the


method does not assume normally
distributed values in the classes);
this
Figure 7.6. Histogram of BSS/WSS values for the first
gene on a random assignment of labels (106 runs).
The real BSS/WSS is 0.21 leading to a Clevel of 94.6%. .

property

matches

our

expectation for expression of genes


or proteins over multiple samples
(on linear or logarithmic scale
representation).

The performance of finely distinguishing classes can be described in terms of


separation accuracy, or the ability to correctly classify subjects into relevant subgroups.
In many cases, the number of features that are to be analyzed is much larger than the
number of samples in each class. For this reason, we need to test with very high
specificity (similar to the case of diagnostic testing, where the disease prevalence is very
small), to prevent false positive results.
The standard p-value, although sometimes used inappropriately, was introduced for
testing individual hypotheses. A principled way to select the threshold or to assign the
confidence interval in multiple testing is to use a correction, such as the Bonferroni
correction. A given significance value may be appropriate for each individual comparison,
but for the retained set of comparisons (when several independent statistical tests are
being performed simultaneously) in order to avoid spurious positives, each significance
value needs to be lowered to account for the number of comparisons being performed.

Chapter 7

80
For a set of

i = 1K m

features selected from the entire set, the significance value for the

set is the product of individual significances 8 :

(1 ) = 1
i

set

i = 1K m

Although

number

of

other

adjustment

procedures

are

available

[2],[36],[13],[12],[11] it is not immediately clear how small a p-value should be to protect


against false positives.
The false discovery rate (FDR) of a test is defined as the expected proportion of false
positives among the declared significant results [1],[16].

FDR =
where

c>0

p0 {1 F0 (c )}
1 F (c )

is a critical value (for example, on the space of BSS/WSS values). The

adjustment value

p0

is selected such that

p0 min{ f (c ) f 0 (c )} or, as presented in


c

[3] by integrating over an interval

near c = 0 :
p0

f (c )
f (c )

Because of this directly useful


interpretation, and the convenience
of scale, it is easier to use FDR than
p-values

[28].

Once

the

discrimination procedure has been


decided (for example, t-statistic or
ROC), for each gene we obtain a
Figure 7.7. False discovery rate and components.

measure of separation. The values

over the entire set are used to draw the histogram of discrimination values. In an ideal
case, when discrimination carries (biological) meaning, the probabilistic density function
of a random assignment of classes will decrease earlier than in the case of correct
assignment of classes (Figure 7.7). Visually, FDR is the ratio between areas of the random
assignment and the correct assignment when discrimination is high.

NB. When individual comparison brings an equal significance, this equation can easily be solved for

= 1 (1 set )1 k which for small significance value reduces to = set k . The relation is called
Bonferroni correction.

An overview of analysis methods

81

As much as assignment of class-labels has biological meaning, important features


have higher discrimination values than randomly assigned features would have. The
density function of random assignment will overlap or end later than the discussed
"correct" probabilistic density function. These histograms, additionally, may be utilized
as visual tools to observe the significance of the analysis.

Clustering
Clustering is the classification of similar objects into different groups, or more precisely,
the partitioning of a data set into subsets (called clusters), so that the data in each subset
ideally will share common attributes, often proximity according to some defined distance
measure. There are two types of data clustering algorithms: hierarchical and partitional.
In hierarchical algorithms, successive clusters are found using previously established
clusters, while partitional algorithms determine all clusters at once. Hierarchical
algorithms are either agglomerative (also called bottom-up) or divisive (top-down)
methods. Agglomerative algorithms begin with each element as a separate cluster and
merge them in successively larger clusters. The divisive algorithms begin with the whole
set and divide it into successively smaller clusters.
Agglomerative hierarchical clustering builds the hierarchy from the individual
elements by progressively merging clusters; the first step is to determine which elements
merge together. To know the order of elements that merge first, we need to define the
closeness measure between elements, the function

d ( x, y )

(that sometimes is a

distance9). By using this closeness function between elements and/or clusters, depending
on the way of merging we have the following linkages:

max d ( x, y )

complete linkage clustering when is used the

single linkage clustering when is used the

average linkage clustering when the mean distance between elements of each
cluster

xC1 , yC 2

min d ( x, y )

xC1 , yC 2

1
d (x, y ) is used
card (C1 ) card (C2 ) xC1 yC 2

the sum of all intra-cluster variance

the increase in variance for the cluster being merged (called Ward's criterion)

I am referring to mathematical distance function or the metric is defined as the function

such that

d : n

x, y, z (a) d(x, y) 0 (non-negativity), (b) d(x, y) = 0 (identity of indiscernibles), (c) d(x,

y) = d(y, x) (symmetry), and (d) d(x, z) d(x, y) + d(y, z) (triangle inequality).

Chapter 7

82

Each agglomeration occurs at larger values of the function than the previous
agglomeration and one can decide to stop clustering either when the clusters are too far
apart to be merged (based on a distance criterion) or when there is a sufficiently small
number of clusters (a number criterion).

Distance measures
If

x, y n

are two points in the

n -dimensional

space, the following norms can be

defined:
n

1-norm distance

x y
i =1

called Manhattan distance


1

2-norm distance

2 2
n

xi yi
i =1

p-norm distance

n
xi yi
i =1

infinity-norm distance

where

is the Euclidian distance

is the Minkowski distance of order

n
lim xi yi
p
i =1

p
= max( x1 y1 ,K, xn yn )

need not be an integer, distance called also called Chebyshev distance.

k-Means algorithm
The k-means algorithm assigns each point to the cluster whose centroid is closest. The
centroids are the points that have the coordinates on a central tendency (for example, the
mean) on each dimension separately for all the points in the cluster. The algorithm
described in [27] (a) randomly generates k clusters and determines the cluster centroids
or, directly generate k seed points as cluster centers; (b) assigns each point to the nearest
cluster center; (c) re-computes the new cluster centers; and (d) repeats until some
convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simple implementation and
computational speed (this represents an imperative advantage for large datasets). The
algorithm does NOT return the same results on each run; the resulting clusters depend
on the initial assignments. The k-means algorithm, therefore, maximizes inter-cluster
(or minimizes intra-cluster) variance and ensures a local-optimal solution on a local
minimum of variance.

An overview of analysis methods

83

Principal Component Analysis (PCA)


The high-throughput characteristic of microarray technology that allows generation of
data with high dimensionality is challenging for scientists who are analyzing this gene
expression data. Thousands to tens of thousands of genes can be analyzed in a single
microarray experiment. In addition, the data are rather noisy. In characterization of this
data, singular value decomposition (SVD) and principal component analysis (PCA) can
be valuable algorithmic-instruments since both decrease the number of features by
mapping to another space type.
Consider the matrix

of a microarray experiment of size

mn ,

where

m>n

(the number of genes analyzed is substantial larger than the number of samples). The
elements of the matrix
( xij

xij

R ). The elements of the j th column of X

denote the expression profile of the


of

form the

response of the

n dimensional

i th

uk

of

j th

VT

and

gj

patient

i th

row

which we refer to as the transcriptional

gene.

is

X = USV T where U

is a

mn

matrix, the

are called left singular vectors (gene coefficient vectors) and form an

uiu j = 0

otherwise).

VT

is an

nn

U TU = I ( uiu j = 1 , when

matrix,

V TV = I ,

the rows

vk

of

contain the elements of the right singular vectors (expression level vectors), and

form an orthonormal basis for the gene transcriptional responses.


with the same dimension as

. The diagonal values

and represent the mode amplitudes. When


and

j th

m dimensional vector p j

form the

orthonormal basis for the expression profiles with property

i= j,

gene and

sample (or patient). The elements of the

vector

Singular value decomposition of


columns

i th

are expression levels of the

si = 0

for

(r + 1) k n . X ( l )

the sum of the squares

{s1...sn } are called singular values

is the rank of

X , sk > 0
l

is defined by

ij

xij xijl

for 1

kr

X ( l ) = uk sk vkT and minimizes


k =1

S is a diagonal matrix

Chapter 7

84

Calculating the SVD involves finding the eigenvalues and eigenvectors of

X T X . The eigenvectors of X T X
form the columns of
eigenvalues from
the matrix

XX T

form the columns of

X T X 10.

XX T

, the eigenvectors of

. Furthermore, the singular values in


or

XX T and

are square roots of

The singular values are the diagonal elements of

and are arranged in descending order. The singular values are always real

numbers. If elements in matrix

are real, then

and

are also real.

The relationship between PCA and SVD is described in Wall et. al., 2003 [35].
Performing PCA is the equivalent of performing SVD on the covariance matrix11 of the
data. When matrix

is centered on columns,

X T X = i g i g i

is proportional to the

covariance matrix of samples.

Multidimensional Scaling (MDS)


Multidimensional scaling (MDS) refers to a group of methods used to analyze pair-wise
similarities. In general, the goal of the analysis is to detect meaningful underlying
dimensions that allow the researcher to explain observed similarities or dissimilarities
(distances) between the investigated objects. The starting point of MDS is a matrix
consisting of similarities or dissimilarities of the entities. The matrix can hold distances
between pattern vectors in Euclidean space, but in MDS the dissimilarities need not be
distances in the mathematically strict sense. The multitude of variants of MDS is based
on various cost functions and optimization algorithms. The first Euclidian MDS was
developed in the 1930s and later generalized for analyzing non-metric data12.
If the elements are represented by

Xk

in high-dimensional space, and by

X 'k in

their projection in a low-dimensional space, the representation should optimized so that


the distances between the items in the low-dimensional space will be as close as possible
to the original distances. If we denote with

xk

and

xl

in

Xk

and with

d ' (k , l )

d (k , l ) the distance between the elements

x'l

in the low-

d (k , l ) by d ' (k , l ) if

the square-

the distance between

dimensional space, the metric MDS approximates

x 'k

and

error cost (the objective function) is minimum:


10

Proof: X=USVT and XT=VSUT; XTX = VSUTUSVT ; XTX = VS2VT; then XTXV = VS2
USVTVSUT; XXT = US2UT; then XTXU = US2.

11

For N observations, the covariance between variables x and y is C ( x, y ) =

XXT =

1 N
(x x )( y y ) .
N 1 i =1

12
See the work on MDS by Young and Householder, 1938; Torgerson, 1952 for Euclidian MDS and further
Kruskal and Wish, 1978; de Leeuw and Heiser, 1982; Wish and Carroll, 1982; Young, 1985.

An overview of analysis methods

85

[d (k , l ) d ' (k , l )]

k ,l

If the components of the data vectors are of ordinal type, the rank order of the distances
between the vectors is meaningful. In this case, the projection will map the distances to
such values that best preserve the rank order.
In non-metric MDS, only the rank order of entries in the data matrix (not the actual
dissimilarities) is assumed to contain the significant information. Kruskal, 1964 [19],
Shepard, 1962 [32] Kruskal and Wish, 1977 [20] use a function

to generate the cost

function:

[(d (k , l )) d ' (k , l )]
(d ' (k , l ))

k ,l

k ,l

The distances in the final configuration should be in the same rank order as the original
data. Consequently, the purpose of the non-metric MDS algorithm is to find a
configuration of points whose distances reflect as closely as possible the rank order of the
data.
Using a monotone regression that involves the calculation of a new set of distances

d ' (k , l ) , the most common approach is to determine the elements X 'k

in an iterative

process, commonly referred to as the Sheppard-Kruskal algorithm. The steps involved in


the MDS algorithm are as follows: (1) to assign points to arbitrary coordinates in a pdimensional space; (2) to compute the distances among all pairs of points, to form the
estimated

d ' (k , l ) ;

(3) to compare the

d ' (k , l ) matrix

correspondent to

d (k , l ) by

evaluating the cost function; (4) to adjust coordinates of each point in the direction that
maximally reduces the stress while the fit is adequate.
Since non-metric MDS deals with ranking, a crucial problem is how to treat ties in the
data. There are two main approaches, the first, called the primary approach, considers
the ties as undetermined and continues with the algorithm preserving the equality or
replacing it by an inequality. In the secondary approach, ties are retained in the fitting
values. Consequently, if the actual distances do not preserve every inequality in the data,
the infringement would be counted as a deviation from monotonicity. The primary
approach is preferred if there are a large number of distinct dissimilarity values.

Classification
Classification refers to a group of methods that aim to characterize the objects, or
phenomena, in a data-set. The classification methods make use of a sub-set of features

Chapter 7

86

relevant for the task or the entire space of features. The methods can be supervised or
unsupervised according whether knowledge of categorized objects within known classes
(the labels of elements) is used.
There are two phases in construction of a classifier. In the training phase, the
training set is used to decide how the parameters should be combined in order to
separate the various classes of objects. In the application phase, the weights determined
in the training set are applied to a set of objects that do not have known classes in order
to determine what the classes are likely to be.
A special case of classification is when we divide a known set, with known labels, in
both training and test. The application phase aims to report the differences between the
classification results and the real labels. This is a test model for the classifier used to
observe its performance on a representative set of data.
If a classification has few parameters to learn, the classification is usually an easy
problem. When there are many parameters to consider, as in the case of array-based
datasets, the classification problem becomes difficult because of the space of parameter
combinations to search and the techniques based on exhaustive searches of the
parameter space are sometimes computationally infeasible.
Formally, the classification problem can be stated as follows: from a training set

{(x1 , y ),K(xn , y )}, we aim to produce a classifier (function) h : X Y that maps


the object space , x X , to its attached classification labels,

y Y .

Three related mathematical problems can be defined. The first is the mapping of the
multi-dimensional vector space of features to a set of labels. In this category, we can
include the nearest neighbor algorithm that partitions the feature space into regions and
assigns labels for respective regions. Also in this first category, we can include the
unsupervised clustering of feature space and labeling of each cluster or region.
In the second category of classification is the estimation problem, defined as a
function of form P

(class | xr ) =

( ), the input feature vector is xr , and the function

r r
f x;

is dependent on several other parameters

. This category of classification includes

( ) (

r
r
r
r r
P(class | x ) = f x; P | D d and the
r
result is integrated over all possible and weighted by how likely are in the training
the Bayesian classification, where

D . The third category relates to the second; in this approach, the class-conditional
r
probability P (class | x ) is estimated and the Bayesian approach is used.

data

An overview of analysis methods

87

kNearest Neighbor algorithm


A straightforward classification method is to simply find, in the N-dimensional space of
elements, the closest object from the training set to an object being classified. The
neighboring character is defined by a function (for example, a correlation). It is likely
that the elements tested will share a common class-label with the elements in the training
set. The main advantage of nearest neighbor methods is that they are easily
implemented. Good results can be obtained if the N-elements are chosen carefully, and if
they are weighted carefully in the computation of the distance.

Quantization and the Lloyd algorithm


An N-level vector quantizer for k-dimensional blocks of real-valued source samples is
designed by determining the N code vectors in k-dimensional Euclidean space into which
the source blocks are to be quantized. In vector quantizer design, the goal is to find a
vector quantizer for which the expected squared-error quantization noise achieves a
value close to the minimum.
Stuart Lloyd, first in an unpublished technical report written in 1957, and then in an
article published 1982 [23] proposed an algorithm for accomplishing this goal13. The
Lloyd algorithm starts with an initial quantizer and modifies it through a sequence of
iterations. On each iteration, the code-vectors from the preceding iteration are replaced
with the centroids of the nearest neighbor regions corresponding to these code-vectors.
As long as successive iterations of the Lloyd algorithm continue to generate new
quantizers, the quantization noise strictly decreases14. In the scalar quantization case, if
the density function of the source samples is log-concave and has finite variance, it is
known that the N-level quantizers generated by iterates of the Lloyd algorithm have
asymptotically optimal performance, independently of the N-level quantizer chosen at
the start of the iteration process.

13

Lloyd stated his algorithm for the scalar quantization, case in which k = 1, but it easily extends to the k > 1.
Lloyd-Max necessary conditions for an optimal quantizer. These were first discovered by the Polish
mathematicians Lukaszewicz and Steinhaus [26].
14

Chapter 7

88

Model selection and Minimum description


Length (MDL)
The minimum description length (MDL) principle is the formalization of Occam's
Razor15; in which the best hypothesis for a given set of data is the one that leads to the
largest compression of the data. Data can be represented by strings of symbols from a
finite set. MDL principle says that any regularity in a given set of data can be utilized to
compress the data. Since we want to select the algorithm that captures the most
regularity in the data, we look for the hypothesis with which the best compression can be
achieved.
In order to do this, we must first fix a code to compress the data. We then write a
program in the language that outputs the data. This program thus represents the data.
The length of the shortest program that outputs the data is called the Kolmogorov
complexity of the data [22]. This is the central idea of Ray Solomonoff's idealized theory
of inductive inference [33]. In this theoretical work, the mathematics does not provide a
practical way to calculate inference. Kolmogorov complexity is uncomputable when the
input is an arbitrary sequence of data. If we accidentally find a short program that
outputs the data, it is in general not possible to know that it is the shortest possible
program. The Kolmogorov complexity depends on the programming language and is
only defined up to a constant number of bits. If only a small amount of data is available,
then such constants may have a very large influence on the inference results: good results
cannot be guaranteed when one is working with limited data.
Minimum description length (MDL) restricts the set of allowed codes in such a way
that it becomes possible (computable) to find the shortest code length of the data, relative
to the allowed codes. Rather than "programs" as in Kolmogorov complexity, in MDL
theory candidate hypotheses, models, and codes are used. The set of allowed codes is
called the model class.
MDL theory was introduced by Jorma Rissanen in 1978 [31]. MDL makes the
correspondence between code length functions and probability distributions. An
important result in information theory (called Kraft-McMillan inequality) states that [30]
for any probability distribution P, it is possible to construct a code C such that the length
(in bits) of C(x) is equal to log2P(x); this code minimizes the expected code length. This
can also be reversed: given the code C, one can construct a probability distribution P such
that the length of C(x) is equal to log2P(x). In other words, searching for an efficient
code is reduced to a search for a good probability distribution, and vice versa.

15
Occam's Razor principle states that one should make no more assumptions than needed. In simple words,
"the simplest explanation is the best."

An overview of analysis methods

89

This has led researchers to view MDL as equivalent to Bayesian inference16. Code
length of the model and code length of model and data together in the MDL framework
correspond to prior probability and marginal likelihood respectively in the Bayesian
framework. While the Bayesian machinery is often useful in constructing efficient MDL
codes, the MDL framework sometimes uses other codes that do not fit into a Bayesian
framework17. Furthermore, the MDL Principle prefers some priors over others. While the
same priors tend to be favored in so-called objective Bayesian analysis, they are favored
for different reasons.

References
[1] Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in
behavior genetics research. Behav Brain Res. 2001 Nov 1;125(1-2):279-84.
[2] Dudoit S, Yang YH, Callow MJ, Speed T. Statistical Methods for Identifying Differentially
Expressed Genes in Replicated cDNA Microarray Experiments, Statistica Sinica. 2002 12:111-139
[3] Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes Analysis of a Microarray
Experiment. faculty.washington.edu/ ~jstorey/papers/ETST_JASA_2001.pdf
[4] Egan JP. Signal Detection Theory and ROC Analysis, Academic Press,New York, 1975
[5] Fan J, Gijbels I. Local Polynomial Modelling and its Applications. Chapman and Hall, London.
1996.
[6] George K. Zipf, Human Behaviour and the Principle of Least-Effort, Addison-Wesley, Cambridge
MA, 1949
[7] Gersho A, Gray RM, Vector Quantization and Signal Compression, Kluwer Academic Publishers,
1991.
[8] Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev
Diagn Imaging. 1989; 29(3):307-35.
[9] Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P.
'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.
Genome Biol. 2000;1(2):RESEARCH0003. Epub 2000 Aug 4.
[10] Henderson AR. Assessing test accuracy and its clinical consequences: a primer for receiver
operating characteristic curve analysis. Ann Clin Biochem. 1993 Nov; 30 ( Pt 6):521-39.
[11] Hochberg Y and Benjamini Y. More powerful procedures for multiple significance testing. 1990
Stat. Med. 9: 811/818.
[12] Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. 1988 Biometrika
75: 800/802.
[13] Holm S. A simple sequentially rejective multiple test procedure. Scand J. Stat. 1979, 6:65-70.
[14] Hoyle DC, Rattray M, Jupp R, Brass A. Making sense of microarray data distributions.
Bioinformatics. 2002 Apr;18(4):576-84.
[15] Jardon M. Systems Biology: An Overview, at http://www.bioteach.ubc.ca/Bioinformatics/
systemsbiology/
[16] Keselman HJ, Cribbie R, Holland B. Controlling the rate of Type I error over a large set of
statistical tests. Br J Math Stat Psychol. 2002 May;55(Pt 1):27-39.
[17] Kieffer JC, A survey of the theory of source coding with a fidelity criterion. Information Theory,
IEEE Transactions on, 1993
[18] Kitano H. Systems Biology: A Brief Overview. Science, 2002, 295: 1662-1664.
[19] Kruskal JB, Nonmetric Multidimensional Scaling: A Numerical Method, Psychometrika, 29:2
1964, pp. 115-129.
[20] Kruskal JB, Wish M. Multidimensional Scaling, Sage Publications, Beverly Hills, Calif., 1978.
[21] Li M, Vitanyi P. An introduction to Kolmogorov Complexity and its Applications: Preface to the
First Edition (1997)
[22] Li W, Random texts exhibit Zipf's-law-like word frequency distribution, IEEE Transactions on
Information Theory, 38(6), pp.1842-1845, 1992
16

For example in David MacKay's Information Theory, Inference, and Learning Algorithms.
An example is the Shtarkov `normalized maximum likelihood code', which plays a central role in current
MDL theory, but has no equivalent in Bayesian inference.

17

90

Chapter 7
[23] Lloyd SP, Least Squares Quantization in PCM, IEEE Transactions on Information Theory, 1982.
[24] Loader C. Local Regression and Likelihood. Springer, New York 1999.
[25] Lu T, Costello CM, Croucher JP,Hsler R,Deuschl G, Schreiber S. Can Zipf's law be adapted to
normalize microarrays? BMC Bioinformatics. 2005; 6: 37.
[26] Lukaszewicz J, H. Steinhaus. On measuring by comparison, Zastos. Mat., vol. 2, pp. 225-232,
1955.
[27] MacQueen J. Some methods for classication and analysis of multivariate observations. Proc. Fifth
Berkeley Symp. University of California Press 1, 1966, pp. 281-297.
[28] Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False Discovery Rate, Sensitivity and
Sample Size for Microarray Studies. Bioinformatics. 2005 Apr 21
[29] Popper K, Conjectures and Refutations, London: Routledge and Keagan Paul, 1963, pp. 33-39
[30] Rissanen J. Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and
Development 20(3): 198-203 (1976)
[31] Rissanen J. Theory of Relations for Databases - A Tutorial Survey. MFCS 1978: 536-551
[32] Shepard, R. N. (1962). Psychometrika. 27, 125140: 219-246;
[33] Solomonoff R "Perfect Training Sequences and the Costs of Corruption - A Progress Report on
Inductive Inference Research," Oxbridge Research, August 1982.
[34] Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988 Jun 3; 240(4857):128593.
[35] Wall ME, Rechtsteiner A , Rocha LM - A Practical Approach to Microarray Data Analysis, 2003
[36] Westfall PH, Young SS. Resampling-Based Multiple Testing, 1993 Wiley, New York.

Publications

Publication no. 1
Mircean C, Tabus I, Astola J.
Quantization and distance function selection for discrimination of
tumors using gene expression data.
SPIE 2002, BiOS 2002 Symposium, 19-25 January 2002, San
Jose, CA.

Quantization and distance function selection for discrimination of


tumors using gene expression data
Cristian Mircean, Ioan Tabus, Jaakko Astola
Institute of Signal Processing
Tampere University of Technology
POBox 553, Tampere, Finland
ABSTRACT
This paper compares several discrimination methods for the classification of tumors using gene expression data. We
introduce variations of known classification methods, and compare the effects of quantizing the data prior to applying
various methods, and also discuss the selection of the distance function. The error rates obtained with the new methods
are shown to be smaller than those reported in recently published studies.
Keywords: Supervised classification, k -nearest neighbor, Lloyd quantization

1. INTRODUCTION
This is a case study realized using the MITLeukemia data set, containing expressions of about seven thousand genes
in 72 measurements (also referred to as cells in the following). There is available a labeling of the 72 measurements
in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), and also a further refinement
of ALL classes in B-cells and T-cells.
We concentrate on k Nearest Neighbor (k-NN) classification method, and propose variations of it, the most important
one being the use of normalized mutual information to define the closeness function function. In order to get
meaningful estimates of the mutual information, the data has to be first quantized, which leaves us with the task of
selecting an appropriate quantization method.
The criterion used to rank the performance of various methods is the estimated error rate. The error rates are estimated
by repeating of 150 times (latter 1000 times) the following: split the 72 cells in 48 training cells (for which the labels
are known) and 24 test cells, for which the labels are estimated by the given method, and the number of erroneous
classifications is stored as err(i). To graphically represent the error rates we either box-plot the 150 values of err(i), or
we bar-plot them: the first bar represents the number of times err(i)=1, the second bar is (2x number of times err(i)=2)
and so on; the height of all the bars results to be the overall number of erroneous classifications in the 150 runs.
We also study how well certain cells can be classified, by checking how many times in the 150 runs a cell was
misclassified. If a cell is repeatedly misclassified, the original labeling becomes questionable, which suggests the use of
classification as a technique of spotting possible wrong diagnoses.

2. DATA SET, PRE-PROCESSING DATA


1. Original Data Set
The MitLeukemia data set has been analyzed in several papers (Golub1 and Dudoit2) and is publicly available at:
http://waldo.wi.mit.edu/MPR/data_set_ALL_AML.html.
The dataset contains measurements of gene expressions, done with Affymetrix Hu6800 micro-arrays, corresponding
to acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) samples from bone marrow and peripheral
blood. Intensity values have been re-scaled such that overall intensities for each chip are equivalent. The values of chip
rescaling can be retrieved in The Whitehead Institute for Biomedical Research / MIT Center for Genome Research web
page http://waldo.wi.mit.edu/MPR/table_ALL_AML_rfactors.txt.
2. Gene selection
The original data set contains 7130 genes x 72 cells genes profiles. We pre-processed the MitLeukemia data set as
follows: applying exclusion of elements with max/min ratio 5 and the max-min difference 500; and logarithm in
base 10, the data set was reduced to 1709 genes x 72 cells. However, the number of genes (factors) still exceeds by far
the number of cells to be classified (72 of them), and therefore we have to further select smaller subsets of

Functional Monitoring and Drug-Tissue Interaction, Manfred D. Kessler, Gerhard J. Mller, Editors,
Proceedings of SPIE Vol. 4623 (2002) 2002 SPIE 1605-7422/02/$15.00

discriminative genes. To this end we consider the criterion BSS/WSS (between-class-variance/within-class-variance) as


described next.
3. Selection of genes using BSS / WSS ratio
We applied the BSS / WSS selection of genes in order to exclude from our analysis those genes that act close to a
constant value across the cell direction. We obtained good results in minimizing the estimated error rates (in k-NN
method with closeness function based on correlation coefficient) by reducing the gene set to the first 40 80 genes.
The variance of a continuous variable x can be decomposed into within-group sum of squares (WSS) and between-sum
of squares (BSS) by the following formulae:

WSS = xi x
g

where

g 2

BSS = n x x

g designates an interest group, n denotes the number of data in the group g , i the increment across samples
g

belonging to the group g , x represent the average expression level of elements in the group and
across all samples.
For gene j in the 2-class decomposition, BSS / WSS ratio is:
ALL

AML

ALL

)
)

ALL

AML

AML

BSS ( j ) BSS + BSS


n
x j x. j + n
x j x. j
=
=
ALL 2
AML
ALL
AML
WSS ( j ) WSS ALL + WSS AML
xij x j + xij x j
i

x is the average

)
)

2
2

For the 3 class decomposition (ALL & B-cell, ALL & T-cell, AML), for gene j the BSS / WSS ratio is:
ALL B

ALL T

AML

BSS ( j ) BSS
+ BSS
+ BSS
=
=
WSS ( j ) WSS ALL B + WSS ALL T + WSS AML
=

ALL B

(x

(x

ALL B
j

ALL B
ij

)
(x
) + (x
2

x. j + n
ALL B 2

xj

ALL T

ALL T
j

ALL T
ij

) (x
) + (x
2

x. j + n

AML

ALL T 2

xj

AML
j

AML
ij

x. j

)
)

AML 2

xj

When using the Entropy Correlation Coefficient (see Section 3.5) we selected first the most discriminative genes
according to their BSS / WSS, ratio and after that we quantized the values. The other alternative would be to use the
decomposition proposed by Odaka6 applied for categorical data sets. In that case, the quantization will be applied first
and then the genes with largest BSS / WSS on the quantized values are selected.

3. NEAREST NEIGHBOUR CLASSIFIER


1. k-Nearest Neighbor rule
The Nearest Neighbor (NN) allocation is based on a variant of nearest neighbor density estimation of the groupconditional densities. The basics of the method are described in Fix and Hodge7, Stone2, McLachlan3.
The k-NN rule to classify a gene expression profile proceeds as follows:
find the closest k observations in the learning (training) set
predict the class of the unknown gene profile by majority of vote (the winning label of the neighbors)
We also tested the k,l NN rule from Hellman9 and Ripley8 using correlation coefficient, and for the uncertainty domain
we use the Mahalanobis distance.
2. Correlation Coefficient
The most commonly used measure for linear relationship between two variables is the correlation coefficient. The
values of the coefficient can range from -1 to +1. If there is no linear relationship between two variables, the value of
thee coefficient is 0. If there is a perfect positive relationship, the value is +1. If there is a perfect negative relationship,
the value is -1. Note that correlation coefficient measures linear relationship only. Two variables may have correlation
coefficient close to zero and yet have a very strong non-linear relationship. Genes are characterized by a non-linear
activity. The correlation coefficient will reveal only the linear relationships between cancerous cells.

Proc. SPIE Vol. 4623

In MITLeukemia dataset we measure the similarity between two mRNA gene expression profiles, one belonging to one
of the two classes: acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML) and the other one considered
unlabeled. The similarity between genes expression profiles x = ( x1 ,..., x p ) and y = ( y1 ,..., y p ) is evaluated by:
p

(x

rx,y =

x )(y i y )

i =1

(x

x)

i =1

(y

y)

i =1

In k-NN algorithm we will define the similarity function as one minus the absolute value of correlation; therefore we
interpret those genes giving either negative correlations or positive correlations to be close to each other.
3. Euclidean Distance
We can also use Euclidean distance between the expression vectors as the distance measure.
If x = ( x1 ,..., x p ) and y = ( y1 ,..., y p ) are the expression profiles the Euclidean distance is:
1

d x, y

p
2
= ( xi yi ) 2
i =1

The two cells are considered similar if the shape of the gene expression profile is similar.
4. Mahalanobis Distance
We use in this report the distances between a unclassified cell

x = ( x1 ,..., x p ) from the test set (TS) to the two means

ALL = ( ALL1 ,..., ALLp )

in the learning set (LS).

and

AML = ( AML1 ,..., AMLp )

If we consider the ALL class and AML class having normal density distribution with the means
ALL = ( ALL1 ,..., ALLp ) and AML = ( AML1 ,..., AMLp ) (px1 vector) and (p x p positive defined matrix) the
common covariance matrix, Mahalanobis distance is defined:

(
, ) = (( x

ALL ( x, ALL , ) = (x ALL )'


AML ( x, AML

(x ALL ))2
1

' 1
2
AML ) ( x AML ))
1

The Mahalanobis distance takes the cells gene expression profiles variability into account. Instead of treating equally
all values when calculating the distance from the mean point, it weights the differences by the range of variability in the
direction of the sample point. Another advantage of using the Mahalanobis measurement for discrimination is that the
distances are calculated in units of standard deviation from the group mean.
An inconvenience of the method is the increasing computational complexity, especially for large values of p number
of genes in the cells expression profile.
5. Mutual information; Entropy Correlation Coefficient
Mutual Information measures the amount of information that one random variable contains about another random
variable, or equivalently, measures the reduction in the uncertainty of one random variable due to the knowledge of the
other. Each gene expression profile is a random vector. Mutual information can be defined also for continuous domain,
but in order to reduce the computational task we preferred to quantize the continuous values first.
Consider two random variables X and Y with a joint probability mass function p(x,y) and marginal probability mass
functions p(x) and p(y). The mutual information I(X;Y) is the relative entropy between the joint distribution and the
product distribution:

I(X;Y) =

p(x,y)

p(x,y) log p(x)p(y)

xX yY

The relationship between H(X), H(Y) , H(X,Y) and I(X;Y) is expressed:

Proc. SPIE Vol. 4623

I(X;Y) = H(X) + H(Y) H(X,Y)


where H(X), H(Y) and H(X,Y) are the entropies of the random variables X, Y and joint (X,Y), respectively.
When used as a measure of the degree of dependence of two qualitative variables, the mutual information has some
disadvantages. The maximum value depends on the marginal distributions and is not scaled between 0 and 1. In analogy
with the correlation coefficient defined for quantitative variables, Astola and Virtanen10 propose that the mutual

1
(H ( X ) + H (Y ) ) . The entropy correlation coefficient (Astola10) is defined for
2

information to be scaled by
qualitative variables as follows:

(X,Y) =

I(X;Y)
H(X,Y)
= 21

1
H(X)
+
H(Y)

(H(X) + H(Y))
2
4. QUANTIZATION

Several algorithms can be used for changing the gene


expression profile from continuous values to discrete values.
Considering the global data as a Gaussian distribution, the
first method applied is to split the domain of values in three

parts, delimited by

and + . This ad-hoc


2
2

quantization is latter compared with that obtained by Lloyd


algorithm.
After the pre-processing steps discussed in Section 2.2 the
data has mean = 2.9743 and standard deviation

= 0.4984 ; i.e. the three interval partition is in this case


[min, 2.7251); (2.7251, 3.2235) and (3.2235, max].
Another partition can be defined by maximizing the entropy
of the resulting quantized variable. The procedure consists in
finding the partition for which the partition areas in the
histogram are equal. We obtain, in this way, the splitting
points at values: 2.8525 and 3.2778.

Figure 1 Histogram of gene expression profiles values for


the first 40 genes according to BSS/WSS

1. Lloyd Algorithm for Quantization


The Lloyd5 algorithm provides a partition that minimizes the distortion at a given rate. This algorithm was shown to
give the best results in both mutual information case and entropy correlation coefficient case.
The centroid of a given partition R, is defined as the scalar y* = centr ( R ) , which minimizes the mean distance

x R and y*.
y* = centr ( R ) E [d ( x, y*) x R ] E [d ( x, y ) y R ]

between

The centroid can be estimated the arithmetic mean:

cent ( R ) =

1
R

, where

i =1

R = {x i : i = 1,..., R } and R is the cardinality of R.

The Lloyd algorithm consists in the following steps:


1. Initialization: arbitrarily choose M scalars as the initial set of code words in the codebook
2. Nearest-Neighbor Search: For each training point, find the code word in the current codebook that is closest,
and assign that point to the corresponding cell,
3. Centroid Update: Update the code word in each cell using the centroid of the training points assigned to that
cell.
4. Iteration: Repeat steps 2 and 3 until the average distance falls below a preset threshold
Using Lloyd algorithm the splitting points are 2.7018 and 3.2144; and the reconstruction values are 2.3676, 2.9576 and
3.5164.

Proc. SPIE Vol. 4623

5. SIMULATION RESULTS
In this section we describe the variations on the classical algorithms and the selection of their parameters, which we use
in our experiments.
1. Review of a previous report on the Leukemia data set
Repeating the experiments of the Sandrine Duodoit4 et. Al. team we obtained very similar results (Figure 2). In this
experiment we take p = 40 genes having the largest values of BSS / WSS; increasing the number of variables to p = 200
did not affect significantly the performance. The number of runs is N = 150. We used first the box plot diagram to
visualize the data, as used in Sandrine Duodoit4 paper. The best results with Correlation Coefficient similarity measure
can be obtained for the number of neighbors k = 11 20. The box median in the box-plots, for all these k values, is not
in zero.
2. Changing the metric
We experimented with changing the metric used by the k-NN algorithm, while keeping the number of genes constant, p
= 40. Euclidean Distance is the basic metric used for reference. There are no great advantages (Figure 3) of Euclidean
Distance: although the box median position is sometime zero, errors with large values appear as well.
The nearest neighbor classification rule with a reject option (k,l-NN) was introduced by Hellman9 and Ripley8. We use
the correlation coefficient for the outer interval; for the uncertainty domain (heuristically chosen 20%) we use the
Mahalanobis distance. This distance can be defined between one gene expression profile from TS and each of the two
classes from LS. Mahalanobis distance is not used directly in the k-NN algorithm; it is used only as an option. When
using it, we observe some improvements, the box-plots in Figure 4 have a lower position of the median, for the values
of k = 1220.
A particular situation, occurring frequently, is the following: the number of votes for ALL class is zero. In this situation
the presumed class of the gene expression profile from TS is AML. In the same way, if the number of AML votes is zero,
then the presumed class is ALL. In all other cases, we allowed the Mahalanobis distance to decide. The results are
presented in Figure 5.
We conclude that in general the error rates increase significantly in the case of using Mahalanobis Distance.
3. Comparing the effect of various metrics for the quantized data
In the next experiments we use the Mutual Information as a metric in the k-NN algorithm. The continuous values are
quantized in three levels, the limits of partitioning are

. The use of this ad-hoc quantization already improves

the results when compared to the correlation coefficient case.


We estimated the error rates using k = 1 47 and we selected the detail of interesting domain in Figure 6, where k = 12
21 neighbors of the k-NN algorithm.
The lowest classification errors are obtained as follows: for the Correlation Coefficient similarity, when k = 17, the
estimated error is 2.80%; for the Euclidean distance case, when k = 10, the estimated error is 2.47%; for the combined
use of the Correlation Coefficient and Mahalanobis distance, when the incertitude interval is 20% and k = 19 the
estimated error is 2.91%; for the Mutual Information similarity, when k = 15 the estimated error is 2.30%.
4. Entropy Correlation Coefficient
Entropy Correlation Coefficient (ECC) is a better metric for the quantized values since it is normalized to the values of
entropy of both profile vectors from learning set (LS) and test set (TS). It takes values in [01]. It is shown in Astola10
that representing the Entropy Correlation Coefficient as a function of the degree of dependence, in the bivariate case,
the resulting graph is very close to the first bisector.
In our experiments we used p = 40; the number of runs N = 150; and different models of quantization. We observed that
significant changes are due to the quantization system. Better results can be achieved with Lloyd quantization, which is
such that it minimizes the distortion due to quantization (see Figure 7 and Figure 8).
5. Distribution of misclassified profiles
By repeating the random split into learning and test sets, a given gene expression profile will be sometimes classified
well, sometimes it will be misclassified. In this experiment we observe for each gene profile the number of
misclassifications in the 150 runs where k = 17, this being the value of k which gave minimum estimated error rates for

Proc. SPIE Vol. 4623

correlation coefficient metric. When using Correlation Coefficient metric (with p = 40; k = 17; N = 1000) the
distribution of the estimated error over the genes expressions profiles is approximately uniform, having the minimum
value at 1.5% and maximum value at 7.8%.
The error is reduced substantially when p is in the range 20 to 40 (Figure 9). We studied in experiments the distribution
for the estimated error over the profiles for Entropy Correlation Coefficient when p= 20; 40; 200 and the genes are first
20, 40, 200 genes, or 4180 genes ranked in the order of BSS / WSS.
In the Figures 10, 11, 12 and 13 the right-down representations contains the ratio value Miss / Hit. On the 66th
observation the ratio is some-time infinity due to the zero on Hit values and is not represented.
As we can see from the Figure 12, when the number of genes variables is increased to 200, the error concentrate with
100% rate at gene 66 and 23.4% for gene 67. All other gene expression profiles give 100% good prediction.
Observations 66 and 67 had also low prediction strengths (0.27 and 0.15, respectively) in the Golub1 et al. paper and in
Sandrine Dudoit4 report.
Changing completely the variables (Figure 13), the 66th observation is again present with 100% misclassification ratio.
Comparing the error dispersion in the above experiments we can appreciate that the Entropy Correlation Coefficient
improves the k-NN algorithm results. The overall error (for k which give minimum error) is concentrating only on
several profiles.
6. Comparative study with 3 classes
It is possible that the 66th observation to be misclassified because of the small number of classes considered. The two
classes: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) could be not enough to characterize
the gene expression profiles. We extended the experiments also to three classes: ALL B cell, ALL T cell and AML.
Selecting only the first p = 20 gene in the classification algorithm, the error is concentrating on the 17th observation
(originally classified as ALL B cell). This gene expression profile was perfectly classified on the two-class case.
Increasing the number of variables p from 20 to 40 and 200, the 66th and 67th observations become again misclassified.
The estimated error rates for 3 classes (ALL B cell, ALL T cell and AML) is shown in Figure 14; we used Entropy
Correlation Coefficient metric; Lloyd quantization; p = 20, 40, 200 and 4180; k = 17; N = 150; and Figure 15, 16,
17, and 18 present the error dispersion in the genes expressions profiles for first 20, 40, 200 genes and 4180 genes,
ranked in the order of BSS / WSS.
7. Auto selected k value
All experiments reported until now were done for many values of k (the number of neighbors in the k-NN algorithm)
to check which value gives the lowest error rate. An interesting issue is the estimation of k, by using only the LS data.
In the auto selected k value experiment, for each run (N = 1150) we randomly divide the learning data set (LS) in two
parts with the same ratio 2/3 as in the main division. With similar k-NN algorithm repeated N1 = 100 times using LS
only, we obtain a k value for minimum estimated error which is used further in the main algorithm.
Varying the number of variables, p, from 20 to 200 (see Figure 19) we observe that the estimated error rate is the best
for p =40 genes. The error distribution over the genes expression profiles (Figure 20) displays the same pattern, very
high at the 66th and 67th profile.

6. CONCLUSIONS
Comparing the similarity measures used in the k-NN algorithm, we concluded that Entropy Correlation Coefficient is
the best performing.
The quantization is an important step in microarray data processing since it makes the decision robust to noise. Lloyd
quantization, which minimizes the quantization distortion, turns out to provide the minimum estimated classification
error.
The k-NN algorithm is simple and yet provides good classification.
The 66th and the 67th gene profiles accumulate the largest part of the estimated error rate in both cases of:
Two-class classification: ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia)
Three-class classification: ALL B cell, ALL T cell, AML

Proc. SPIE Vol. 4623

7. FIGURES
ERROR RATES USING THE CONDITION DESCRIBED IN
"Comparison of Disc rimination Methods for the Classification of Tumors Using Gene Expression Data"
Sandrine Dudoit, Jane Fridlyard, Terence Speed
0.6

Figure 2. Box plot diagram for estimated


errors using k-NN algorithm and
Correlation Coefficient metric.

0.5

Error rates

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 1011 12 131415 16 171819 20 2122 23 24 25 2627 28 29 3031 32 333435 36 3738 39 40 41 42 43 44 45 46 47
k value of the k-NN alghorithm
Error of k-NN algorithm using Euc lidian metric
0.6

0.5

Figure 3. Box plot diagram for estimated


errors using k-NN algorithm and Euclidean
Distance metric

Estimated Error

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 4 1 42 43 44 45
k value from k-NN algorithm

Error rates

ERROR RATES USING k- NN ALGHORITHM COMBINED W ITH MAHALANOBIS DISTANCE FOR +/- 20% INCERTITUDE INTERVAL

0.5 If the ratio


(nb. of votes ALL / nb. of votes AML)>=( treshold+incertitude)
than the pressumed class chosen is ALL
if (nb. of votes ALL / nb. of votes AML)<=(treshold-incertitude)
0.4 than the pressumed class chosen is AML
for the (+/-) incertitude spac e I used the Mahalanobis distance.
In this case:
treshold = 0.5319 & incertitude = 0.1 (cc a 20%).

Figure 4. Box plot diagram for estimated


errors using k-NN algorithm and
Mahalanobis Distance metric

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3233 34 35 36 37 38 39 40 41 42 43 44 45 46 47
k value of the k- NN alghotithm

ERROR RATES USING MAHALANOBIS DISTANCE IN MAXIMUM WAY

0.6

Error rates

0.5

I took first the k-NN rule. If the number of votes ALL=0


than the presumed class chosen is AML. If the number
of votes AML=0 than the presumed class chosen is ALL.
In all the other cases, I used the Mahalanobis distance
from the vector to the ALL and AML learning groups. The
minimum distance decided the presumed c lass of the vector.

0.4

Figure 5. Box plot diagram for estimated


classification errors using k-NN algorithm
and Mahalanobis Distance metric

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 1011 12 13 1415 16 17 18 19 202122 23 2425 26 27 28 29 3031 32 33 3435 3637 38 39 40 41 424344 45 46 47
k value of the k-NN alghorithm

Proc. SPIE Vol. 4623

DETAIL IN THE INTE RES TING A REA OF GLOBAL ERROR RATES


0.04 5
Correlation metric
Euclidian Distanc e
Correlation & Ma halanobis metric
Mutu al Information metric

0.04

Figure 6. Comparison of the estimated


classification errors values using k-NN
algorithm and Correlation Metric,
Euclidean Distance, Mahalanobis Distance
metric and Mutual Information Metric.

Estimated Error Rates

0.03 5
0.03
0.02 5
0.02
0.01 5
0.01
0.00 5
0

12

13

14

15

16
17
18
k value from k-NN alghorithm

19

20

21

D E TA IL I N TH E I N TE RE S TI N G A R E A OF GL O BA L E R RO R R A TE S
0 .0 45

C or r e latio n D ista n c e
E uc li d ia n D is tan c e
C or r e latio n & M a h ala n ob is D is tan c e
E C C with (M e an & S t a nd a rd D evia tio n ) Q u an ti s atio n
E C C with Llo y d C ua n tisa t ion

Estimated Error Rates

0 . 04

Figure 7. Estimated classification errors


for Entropy Correlation Coefficient metric
compared with the other metrics.

0 .0 35
0 . 03

0 .0 25
0 . 02
0 .0 15

0 . 01
0 .0 05
0
12

14

16

18
k va lu e f r o m k -N N alg h or ithm

E r r o r R a te s u s i n g m e a n a n d

s ta n d a r d

20

22

24

d e v i a ti o n

E r r o r R a te s u s i n g

3
2 .5
2
1 .5
1
0 .5
0

L lo y d Q u a n t i s a t i o n

4
1 e rro r
2 e rro rs
3 e rro rs

Estimated Error Rate [%]

Estimated Error Rate [%]

4
3 .5

1 0
k v a lu e f r o m

E rr o r R a t e s u s in g

1 5
2 0
k - N N a lg o r i t h m

erro r
e r r o rs
e r r o rs

2 .5
2
1 .5
1
0 .5
0

2 5

1
2
3

3 .5

1 0
1 5
2 0
k v a l u e f r o m k - N N a lg o r i t h m

2 5

M a x im u m E n t r o p y Q u a n t i s a t i o n

1
2
3

erro r
e r r o rs
e r r o rs

2 .5

Figure 8 Comparing the estimated error for Entropy


Correlation Coefficient metric in different quantization
models.

2
1 .5
1
0 .5
0

1 0
k v a lu e f r o m

1 5
2 0
k - N N a lg o r i t h m

% Error

E r r o r r a t e f o r t h e f ir s t 2 0 B S S
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0

1
2
3
4
5
6
7
8
9
1

e
e
e
e
e
e
e
e
e
0

/ W SS

rro
rro
rro
rro
rro
rro
rro
rro
rro
e rr

r
rs
rs
rs
rs
rs
rs
rs
rs
o rs

10
15
k v a l u e f o m k - N N a lg o r i t h m

% Error

E r r o r ra t e f o r th e o r d e re d 4 1 - 8 0 B S S
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0

1
2
3
4
5
6
7
8
9
1

erro
erro
erro
erro
erro
erro
erro
erro
erro
0 err

2 5

E r r o r r a t e f o r t h e f ir s t 4 0 B S S

% Error

2 0

25

2 0

/ W SS

1 e rro r
2 e rro rs
3 e rro rs

10
15
k v a lu e f o m k - N N a lg o r i t h m

2 0

25

E r r o r r a t e f o r t h e f ir s t 2 0 0 B S S / W S S

r
rs
rs
rs
rs
rs
rs
rs
rs
ors

10
15
k v a l u e f o m k - N N a lg o r i t h m

8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0

/ W S S

% Error

Estimated Error Rate [%]

4
3 .5

25

8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0

1 e rro r
2 e rro rs

10
15
k v a lu e f o m k - N N a lg o r i t h m

2 0

25

Figure 9. Comparing the estimated error for Entropy Correlation Coefficient metric; Lloyd quantization for different p = BSS /
WSS values; N = 150.

Proc. SPIE Vol. 4623

T h e n u m b e r o f M i s c la s s i f i c a ti o n s u s i n g th e E n tr o p y & L l o y d Q u a n ti s a ti o n f o r p = 4 1 - 8 0 g e n e s
800

6 00
No. of Total Tests

No. of Misclassification

7 00

5 00
4 00
3 00
2 00

700

600

500

1 00
0

20

40
60
I n di c e s o f e xp er im e n t

400

80

20

40
60
I n d ic e s o f e x p er im e n t

M is / A ll te s t s

80

2 .5
2

Ratio Value

% Error

60
40

1 .5
1

20
0

80

M is / H it

1 00

Figure 10. The estimated error distribution


over the gene expression profiles using
Entropy Correlation Coefficient metric;
Lloyd quantization; k = 1221; N = 150;
with p = 20.

0 .5
0

20

40
60
I n di c e s o f e xp er im e n t

80

20

40
60
I n d ic e s o f e x p er im e n t

80

700

7 50

600

7 00
No. of total Tests

No. of Misclassification

T h e n u m b er o f M is c la s sif ic atio n s u sin g E n tr o p y & L lo y d Q u a n tis atio n f o r th e f ir s t 2 0 0 B S S / W S S

500
400
300
200

6 50
6 00
5 50
5 00

100
0

4 50
0

20

40
60
I n d ic e s o f E xp e r im e n t

4 00

80

20

M is / Hit

40
60
I n di c e s o f E xp er im e n t

80

M is / Hit

100

0 .35

Figure 11. The estimated error distribution


over the gene expression profiles using
Entropy Correlation Coefficient metric;
Lloyd quantization; k = 1221; N = 150;
with p = 40.

0 .3

80
Ratio Value

% Error

0 .25
60
40

0 .2
0 .15
0 .1

20
0

0 .05
0

20

40
60
I n d ic e s o f E xp e r im e n t

80

20

40
60
I n di c e s o f E xp er im e n t

80

500

650

400

600

No. of Total Tests

No. of Misclassification

The number of Misclassification using the Entr opy & Lloyd Quantisation for the first 40 BSS / W SS

300
200
100
0

20

40
60
Indices of Experiment
Mis / All Tests

550
500
450
400
350

80

100

20

40
60
Indices of Exper iment
Mis / Hit

80

20

40
60
Indices of Exper iment

80

0.2
Ratio Value

% Error

80
60
40
20
0

0.25

Figure 12. The estimated error distribution


over the gene expression profiles using
Entropy Correlation Coefficient metric;
Lloyd quantization; k = 1221; N = 150;
with p = 200.

0.15
0.1
0.05

20

40
60
Indices of Experiment

80

7 00

7 00

6 00

6 50
No of Total Tests

No. of Misclasification

The num ber of M isc la ssif ic ati on u sing the Entr opy Co r r ela tion C oe ffic i ent a nd Lloy d Qu anti satio n fo r the fir s t 20 BS S / W SS

5 00
4 00
3 00
2 00

5 50
5 00
4 50

1 00
0

6 00

20

40
Indic es of E xper ime nt

60

80

4 00

20

25

80

20

60
40
20
0

40
Indi c es of E xper imen t

60

80

60

80

M is / Hit

1 00

Ratio Value

% Error

Mis / All tests

Figure 13. The estimated error distribution


over the gene expression profiles using
Entropy Correlation Coefficient metric;
Lloyd quantization; k = 1221; N = 150;
with p = 40 genes, those with ranks from
4180 considering the BSS / WSS order.

15
10
5

20

40
Indic es of E xper ime nt

60

80

20

40
Indi c es of E xper imen t

Proc. SPIE Vol. 4623

er ro
er ro
er ro
er ro
er ro
er ro
er ro
er ro
er ro

E r r o r r a t e f o r th e f ir s t 4 0 B S S / W S S w i th 3 c l a s s e s

r
rs
rs
rs
rs
rs
rs
rs
rs

% Error

1
2
3
4
5
6
7
8
9

5
10
k va lu e f r o m k - N N a lg o r i th m

15

E r r o r r a te f o r th e o r d e r e d 4 1 - 8 0 B S S / W S S w i th 3 c la s s e s
10
9 .5
1 er r o r
9
2 e r r o rs
8 .5
8
3 e r r o rs
7 .5
4 e r r o rs
7
5 e r r o rs
6 .5
6
6 e r r o rs
5 .5
7 e r r o rs
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0
0
5
10
15
k va lu e f r o m k - N N a lg o r i th m

10
9.5
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

1
2
3
4
5
6
7
8
9

er r
er r
er r
er r
er r
er r
er r
er r
er r

or
o rs
o rs
o rs
o rs
o rs
o rs
o rs
o rs

5
10
k va lu e f r o m k - N N a lg o r i th m

15

E r r o r r a te f o r th e f ir s t 2 0 0 B S S / W S S w i th 3 c la s s e s

% Error

% Error

% Error

E r r o r r a t e f o r th e f ir s t 2 0 B S S / W S S w i th 3 c la s s e s
10
9 .5
9
8 .5
8
7 .5
7
6 .5
6
5 .5
5
4 .5
4
3 .5
3
2 .5
2
1 .5
1
0 .5
0

10
9.5
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

1
2
3
4
5
6
7

5
10
k va lu e f r o m k - N N a lg o r i th m

er r
er r
er r
er r
er r
er r
er r

or
o rs
o rs
o rs
o rs
o rs
o rs

15

Figure 14 Comparing the estimated error for 3 classes (ALL B cell, ALL T cell and AML) Entropy Correlation Coefficient
metric; Lloyd quantization different p = BSS / WSS values; N = 150.
The n um b er o f M isc la s si fic at io n u s in g the En tr o py & L lo y d Qu a ntis a tio n fo r 3 c la ss e s
320
300
No. of Total Tests

No. of Misclassification

200

150

100

50

280
260
240
220
200

20

40
60
In dic e s o f e xp er im en t

180

80

20

80

60
40
20
0

40
60
In dic e s o f e xp e r im en t

80

M is s / Hi t
10

Ratio Value

% Error

M is / To ta l T e s ts
100

Figure 15 The estimated error


distribution over the gene expression
profiles for 3 classes (ALL B cell, ALL
T cell and AML); using Entropy
Correlation Coefficient metric; Lloyd
quantization; p = 20; k = 17; N = 150.

6
4
2

20

40
60
In dic e s o f e xp er im en t

80

20

40
60
In dic e s o f e xp e r im en t

80

T h e n u m b e r o f M i s c la s s i f i c a ti o n s u s i n g th e E n tr o p y & L lo y d Q u a n tis a ti o n f o r 3 c la s s e s
4 00

2 50
No. of Total Tests

No. of Misclassifications

3 00

2 00
1 50
1 00

3 00
2 50
2 00

50
0

3 50

20

40
60
I n d i c e s o f e x p e r im e n t

1 50

80

1 00

20

40
60
I n d ic e s o f e xp er im e n t

80

20

40
60
I n d ic e s o f e xp er im e n t

80

70
60

80
Ratio Value

% Error

50
60
40

40
30
20

20
0

10

10
0

20

40
60
I n d i c e s o f e x p e r im e n t

Proc. SPIE Vol. 4623

80

Figure 16 The estimated error


distribution over the gene expression
profiles for 3 classes (ALL B cell, ALL
T cell and AML); using Entropy
Correlation Coefficient metric; Lloyd
quantization; p = 40; k = 17; N = 150.

4 00

2 00

3 50

No. of Total Tests

No. of Misclassifications

T he n um b e r o f M isc la s s if ic a tio n u s in g th e E n tr o p y & L lo y d Q u a nti z a tio n fo r 3 c la ss


2 50

1 50
1 00
50
0

20

40
60
I n d ic e s o f e xp e r im e n t

3 00
2 50
2 00
1 50

80

20

40
60
I n d ic e s o f e xp e r im e n t

M is / To t a l T e s ts

80

M is / Hit

1 00

0 .4

80

Figure 17 The estimated error


distribution over the gene expression
profiles for 3 classes (ALL B cell, ALL
T cell and AML); using Entropy
Correlation Coefficient metric; Lloyd
quantization; p =200; k = 17; N = 150.

Ratio Value

0 .3
% Error

60
40

0 .2
0 .1

20
0

20

40
60
I n d ic e s o f e xp e r im e n t

80

20

40
60
I n d ic e s o f e xp er im e n t

80

T he n um b er o f M iss c la ss if i c a ti on s us i n g t h e L lo y d Q u a n ti s a tio n an d C o r re la ti on C o e f fi c ie nt f or 3 c la s se s
4 00

250
No. of Total Tests

No. of Misscl asiffications

300

200
150
100

3 50

3 00

2 50

50
0

20

40
60
I n di c e s o f e xp er i m en t

2 00

80

20

40
60
I n di c e s o f e x p er i m en t

M is s / T o tal T e s ts

80

40
Ratio Value

50

% Er ror

60
40
20
0

80

M is s / H it

100

30
20

Figure 18 The estimated error


distribution over the gene expression
profiles for 3 classes (ALL B cell, ALL
T cell and AML); using Entropy
Correlation Coefficient metric; Lloyd
quantization; p = 40, the genes ranked
4180 in the order of BSS / WSS; k =
17; N = 150.

10

20

40
60
I n di c e s o f e xp er i m en t

80

20

40
60
I n di c e s o f e x p er i m en t

80

/W
8.5
1 error
2 errors
3 errors
4 errors
5 errors
6 errors
7 errors

8
7.5
7

Figure 19. Estimated error for 3 classes


(ALL B cell, ALL T cell and AML)
using auto selected k value, Entropy
Correlation Coefficient metric; Lloyd
quantization different p = 20, 40, 200; N
=1

6.5
6
5.5
5
% Error
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

20

40
The number "p" of BSS / WSS

200

T he n um b er o f M i ss c l a ss i f i c a t i on s us in g E C C & L lo y d Q u a n ti s a tio n ; fi r s t q u a n tis i ng a n d se c o n d c h o o sin g B S S / W S S f o r 3 c la s e


w si th k a u t os e l e c te d f r o m L S
65
60

40

No. Of Total Tests

No. of Missclassifications

50

30
20
10
0

55
50
45
40

20

40
I n di c e s o f e x p e r i m en t

60

35

80

20

M is s / A ll Te s ts

40
In d i c e s o f e xp e r i m e n t

60

80

60

80

M is s / H i t

1 00

0 .7
0 .6

80
Ratio Value

% Error

0 .5
60
40

Figure 20. The estimated error


distribution over the gene expression
profiles for 3 classes (ALL B cell, ALL
T cell and AML); using auto selected k
value; Entropy Correlation Coefficient
metric; Lloyd quantization; p = 40; N =
150.

0 .4
0 .3
0 .2

20
0

0 .1
0

20

40
I n di c e s o f e x p e r i m en t

60

80

20

40
In d i c e s o f e xp e r i m e n t

Proc. SPIE Vol. 4623

11

8. REFERENCES
1.

T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R.


Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander. Molecular classification of cancer: class discovery
and class prediction by gene expression monitoring. Science, 286: pp. 531-537, (1999)
2. C. J. Stone. Consistent nonparametric regression (with discussion). Ann. Statist. 5, pp. 595-645. (1977)
3. G. J. McLachlan. Discriminant analysis and statistical pattern recognition, Willey-Interscience Pub. ISBN 0-47161531-5 (1992)
4. S. Dudoit, J. Fridlyand, T. P. Speed. Comparison of Discrimination Methods for the Classification of Tumors
Using Gene Expression Data, Dept. of Statistics University of California, Berkeley, Technical report # 576 (2000)
5. S. P. Lloyd. Least Squares Quantization in PCM, IEEE Transactions on Information Theory, Vol. IT-28, March
1982, 129-137. (1982)
6. Takashi Odaka. Sum of Squares Decomposition for Categorical Data, Kwansei Gakuin Studies in Computer
Science, Vol. 14, pp. 1-6 (1999)
7. E. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination: consistency properties. Technical
Report, Randolph Field, Texas: USAF School of Aviation Medicine (1951)
8. B. D. Ripley. Pattern Recognition and Neural Networks, Cambridge University Press. ISBN 0-521-46086-7 (1996)
9. M. E. Hellman. The nearest neighbor classification rule with a reject option. IEEE Transactions Systems Science
and Cybernetics 6, pp. 179-185. (1970) Reprinted in Dasarathy (1991)
10. J. Astola, I. Virtanen. Measure of overall statistic dependence based on the entropy concept. Vaasan korkeakoulu
ISBN 951-683-181-8 (1983)

12

Proc. SPIE Vol. 4623

Publication no. 2
Mircean C, Tabus I, Astola J, Kobayashi T, Shiku H, Yamaguchi M,
Shmulevich I. and Zhang W.
Quantization and similarity measure selection for discrimination
of lymphoma subtypes under k-nearest neighbor classification.
SPIE 2004, BiOS 2004, Microarrays, Combinatorial Techniques
and High Throughput Screening, 2429 January 2004, San Jose,
California, USA

Microarrays and Combinatorial Techniques: Design, Fabrication, and Analysis II,


edited by Dan V. Nicolau, Ramesh Raghavachari, Proceedings of SPIE Vol. 5328
(SPIE, Bellingham, WA, 2004) 1605-7422/04/$15 doi: 10.1117/12.529580

Proc. of SPIE Vol. 5328

Proc. of SPIE Vol. 5328

Proc. of SPIE Vol. 5328

10

Proc. of SPIE Vol. 5328

Proc. of SPIE Vol. 5328

11

12

Proc. of SPIE Vol. 5328

Proc. of SPIE Vol. 5328

13

14

Proc. of SPIE Vol. 5328

Proc. of SPIE Vol. 5328

15

16

Proc. of SPIE Vol. 5328

Proc. of SPIE Vol. 5328

17

Publication no. 3
Tabus I, Mircean C, Zhang W, Shmulevich I. and Astola J.
Chapter 14: Transcriptome-Based Glioma Classification using
Informative Gene Set
in Genomic and Molecular Neuro-Oncology, Jones and Bartlett
Publishers, 2003 ISBN: 0-7637-2261-8

Publication no. 4
Fuller GN*, Mircean C*, Tabus I, Taylor E, Sawaya R, Bruner MJ,
Shmulevich I, Zhang W.
Molecular Voting for Glioma Classification Reflecting
Heterogeneity in the Continuum of Cancer Progression.
Oncol Rep. 2005 14: 651-656. *Co-first author.

Molecular Voting for Glioma Classification Reflecting Heterogeneity in the Continuum of


Cancer Progression
Running title: Molecular voting for glioma classification

Gregory N. Fuller1, Cristian Mircean1, Ioan Tabus, Ellen Taylor, Raymond Sawaya, Janet M.
Bruner, Ilya Shmulevich, Wei Zhang
Department of Pathology (GNF, CM, JMB, ET, IS, WZ) and Neurosurgery (RS), The University of Texas
M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, Texas 77030 USA.
Institute of Signal Processing (CM, IT), Tampere University of Technology
P.O. Box 553, Tampere 33101, Finland.

Gregory N. Fuller and Cristian Mircean made equal contribution to this study

Correspondence: Wei Zhang, Ph.D., Cancer Genomics Core Laboratory, Department of Pathology, the
University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston, Texas 77030. Tel:
713-745-1103; Fax: 713-792-5549
E-mail: wzhang@mdanderson.org.
Keywords: glioma, classification of mixed glioma, multidimensional scaling.
Abbreviations used: GBM, glioblastoma multiforme; AA, anaplastic astrocytoma; AO, anaplastic
oligodendroglioma; OL, oligodendroglioma; MDS, multidimensional scaling; k-NN, k-nearest neighbor;
WSS, within-group sum of squares; BSS, between-group sum of squares.

Abstract
Gliomas are the most common brain tumors that are generally categorized into two lineages
(astrocytic and oligodendrocytic) and low-grade (astrocytoma and oligodendroglioma), midgrade (anaplastic astrocytoma and anaplastic oligodendroglioma), and high-grade (glioblastoma
multiforme) based on morphological features. A strict classification scheme has limitations
because a specific glioma can be at any stage of the continuum of cancer progression and may
contain mixed features. Thus, a more comprehensive classification based on molecular
signatures may reflect the biological nature of specific tumors more accurately. In this study, we
used microarray technology to profile the gene expressions of 49 human brain tumors and
applied the k-nearest neighbor algorithm for classification. We first trained the classification
gene set with 19 most typical glioma cases and selected a set of genes that provide the lowest
cross-validation classification error with k = 5.

Then we applied this gene set to the 30

remaining cases including several that do not belong to gliomas such as atypical meningioma.
The results showed that not only does the algorithm correctly classify most of the gliomas, the
detailed voting results provide more subtle information regarding the molecular similarities to
the neighboring classes. For the atypical meningioma, the voting was equally split among the
four classes, indicating a difficulty in placement of meningioma into the four classes of gliomas.
Thus, the actual voting results, which are typically used only to decide the winning class label in
k-nearest neighbor algorithms, in and of themselves provide a useful method for gaining deeper
insight into the stage of a tumor in the continuum of cancer development.

Introduction
Gliomas are primary tumors of the central nervous system and account for 80% of adult primary
brain tumors (Kleihues and Cavenee, 2000). The prognosis for patients with an advanced
glioma, GBM, is very poor, with a median survival of 8 to 10 months. The diffuse gliomas are
traditionally separated into subtypes based on subjective interpretation and arbitrary weighting of
morphologic features (histologic criteria). The gliomas that share morphological features of
normal astrocytes are classified as astrocytomas and those with morphological features of normal
oligodendrocytes are classified as oligodendrogliomas. Depending on the cellularity and features
such as mitotic index and the presence of necrosis, gliomas are further classified as low-grade
such as oligodendroglioma and astrocytoma, mid-grade (anaplastic oligodendroglioma and
anaplastic astrocytoma), or high-grade GBMs (see Caskey et al., 2000 for a review). However,
heterogeneity is an intrinsic feature of all tumors and histologic classification does not capture
the continuum of the tumor spectrum. When the dominant pathological feature cannot be
identified for a specific tumor, pathologists designate it a mixed tumor. For gliomas, tumors of
mixed or uncertain phenotype represent a significant percentage (10-30%, Kleihues and
Cavenee, 2000). To alleviate the subjectivity in cancer classification, gene expression profiling
has been used to survey the molecular events in cancers and a number of computational
algorithms have been used to perform molecular classification based on groups of genes
(Alizadeh et al., 2000; Fuller et al., 2002; Hedenfalk et al., 2001; Kim et al., 2002). However,
most studies ostensibly reached the conclusions that gene expression-based classification
matched pathological classification and thus no further insight is provided in terms of
classification.

In addition, there have been fewer attempts to classify the pathologically

uncertain cases. A recent study (Nutt et al., 2003) investigated whether gene expression profiling

could be used to define subgroups of GBM and AO more objectively than standard pathology.
The feature genes were ranked based on the correlation to each of the two classes and tested
against a random permutation. The classifier, based on Euclidian distance, returned the best error
rate with 20 features. This study showed that prediction models can predict survival for
diagnostically challenging malignant gliomas better than standard pathology.
In this study, we established gene expression profiles of 49 brain tumors including diffuse
gliomas of both typical subtypes (OL, AO, AA, and GBM) and atypical types. By using the kNN method and Fisher discriminant, we identified a set of 50 genes that were used to construct a
voting classifier in a manner highly consistent with pathological evaluation. Further, the actual
votes for each class reflect heterogeneity in the continuum of glioma progression. The molecular
voting showed that mixed gliomas are indeed mixed in gene expression profiles with reference to
the typical subtypes. Finally, the voting for an atypical meningioma was split between all
classes, suggesting a difficulty in its placement, which is consistent with the non-glioma nature
of meningioma.
Materials and Methods:
Patient samples and microarray experiments. All the glioma tissues were obtained from
the Brain Tumor Tissue Bank of M. D. Anderson Cancer Center with the approval of
Institutional Reviewing Board. The tissues were first evaluated by a pathologist (GNF) to
confirm initial pathological evaluation and to ensure that only those tissues with more than 90%
tumor were used.
RNA isolation, microarray experiments, and image analysis were carried out following
procedures previously described (Fuller et al., 1999; Shmulevich et al., 2002; Kobayashi et al.,
2003). The cDNA microarray used for this study included 2,303 genes printed in duplicates and

sequence-verified before printing in the Cancer Genomics Core Laboratory of M. D. Anderson


Cancer Center (Taylor et al., 2001).
Voting unlabelled samples using k-Nearest Neighbor (k-NN) analysis on quantized
data. At the preprocessing phase, we eliminated those genes considered to be incorrectly
measured, by using the Hampel identifier procedure (Huber, 1981) for outlier detection of logtransformed expressions (see Mircean et al., 2004).Starting from the initial set of 2303 genes for
each patient we retained 2009 genes with the expression of each remaining gene being estimated
as the average of its two corresponding replicates. As an additional filtering of non-informative
genes, we eliminated those genes having the same quantized expression for all samples and were
left with 1826 genes to be used for the k-NN classifier.
Quantization can be seen as a tradeoff between removing unwanted noise and keeping
useful information. It has already been demonstrated that quantizing gene expression data prior
to classification can dramatically reduce the classification error (Mircean et al., 2004). Applying
quantization to a dataset provides a means of making subsequent decisions more robust with
respect to the noise, usually existent in microarray data, especially when the dataset is composed
of measurements collected over a long period of time. Therefore, we decided to find the
quantization thresholds separately for each patient sample, with the number of observed genes
yielding good accuracy for data partitioning. Implicitly, this procedure normalizes the gene
expressions across patients. Obviously, the decision to pre-quantize the data prior to subsequent
analysis such as classification is motivated by practical reasons (i.e. reducing classification error
by removing unwanted sources of variability), especially in the absence of any statistical
assumptions on the data or the noise, and we do not rule out the possibility that there may exist
other statistical procedures that do not involve quantizing data, which may perform equally well.

Such a decision is quite similar in spirit to pre-smoothing of data, commonly performed in signal
and image processing prior to further analysis.
Concerning the number of quantization levels, the simplest approach is the binary model,
where the genes can be either on or off. This approach gives many benefits related to
utilization of already existing methods for dealing with binary models and has been shown to be
useful in gene expression data analysis (Shmulevich and Zhang, 2002; Zhou et al., 2003). The
data may be also quantized to three or four levels, which will allow each gene to discriminate
more than two classes. We considered and compared the binary, ternary, and quaternary
quantization and obtained the optimal thresholds by using the Lloyd algorithm (also known as kmeans) (Lloyd, 1982) separately for each patient sample.
The most discriminative genes are selected according to the ratio of within-group sum of
squares (WSS) and between-sum of squares (BSS) (see Mircean et al., 2002).
As a variant of nearest neighbor density estimation of the group-conditional densities (Fix and
Hodges, 1951; Stone, 1977), k-NN assigns a label to an unknown profile to be the most
frequently occurring label of its k neighbors. In case of a tie, k is reduced until a winner exists.
We also tested the use of the correlation coefficient and the entropy correlation coefficient
(Mircean et al., 2002) as similarity measures between the gene-expression profiles of two patient
samples.
Results and discussion
We began this study with the aim of providing a molecular classification to the mixed gliomas.
We profiled the gene expression of 49 brain tumor tissue samples using a 2,303-gene cDNA
microarray produced in our facility. Those 49 cases included 27 glioblastomas (GBM), 7
anaplastic astrocytomas (AA), 6 anaplastic oligodendrogliomas (AO), 3 oligodendrogliomas

(OL), 5 mixed gliomas, and one case of atypical meningioma. To gain a global estimate of the
similarities or differences among the cases, we performed an initial unsupervised analysis using
multidimensional scaling (MDS). Almost no separation among the four typical groups of
gliomas, as well as the mixed cases, was observed in the MDS representation (Figure 1),
implying the presence of many non-informative genes.
The best-performing genes for molecular classification were identified by using 19
samples that exhibited characteristic morphologic features of the four basic subtypes of diffuse
glioma (5 anaplastic astrocytomas, 5 anaplastic oligodendrogliomas, 6 glioblastomas, and 3
oligodendrogliomas). We first selected candidate subsets of the most discriminative genes
according to the BSS/WSS ratio, the size of the gene set being p {20,30,40,50,60,70} , and then

analyzed the data set using the k-NN algorithm and MDS representation for visualization. Using
cross-validation, we chose the parameters (p, k, and the number of quantization levels) by
performing a random 2:1 split of the set of 19 patients (12 in the training set and 7 in the test set).
The classification problem consists of discriminating AA, AO, OL, and GBM. We tested various
numbers of discriminative genes using both quantized as well as unquantized data. The smallest
cross-validation error was obtained for quaternary quantized data (Lloyd quantization) with the
correlation coefficient as the similarity measure and with 50 retained discriminative genes
(summarized in Table 2). Figure 2 shows an MDS representation of the 49 patients using the
identified 50 informative genes. This figure demonstrates the spatial separation among the four
glioma groups and visually presents the closeness of patients diagnosed as mixed gliomas and
atypical meningioma.
To test the robustness of the derived classifier gene set for molecular voting, we
performed the voting analysis using the same classifiers on the remaining independent set of 30

gliomas samples (validation set - see Table 1), among which 4 belong to a group with mixed
(oligoastrocytic) features and 1 non-glioma tissue (atypical meninglioma). In this algorithm,
classifier genes are used to vote for an assignment for each case into one of the four subtypes
based on its neighbors. If our objective is to produce one final nosologic assignment, then this
can be determined by the class receiving the highest number of votes. However, as we shall see,
the distribution of the votes among the four classes may itself be more informative, especially for
the case of mixed tumors.
For the classical cases in the independent validation set, only two were classified at
variance with the traditional histopathologic assignment (Table 1). One of the miss-assigned
GBM cases had three votes for AA and two votes for GBM. Another GBM case had four votes
as AO and one vote as AA. Because there is no absolute distinction line between AA and GBM,
and high grade AOs are often characterized as GBMs, this may not be a result of missassignment. Rather, the detailed voting information provides a more subtle snapshot of the
cancer progression process and of the characteristic features of a tumor. A careful scrutiny of the
two tables showed that some GBMs had 4-5 votes for GBM whereas others showed 3 votes for
GBM and 2 votes for AA. Thus, the detailed votes may provide important clinical information
about the differences between different cases in the continuum of the glioma spectrum.
Similarly, for the mixed gliomas, voting was conspicuously divided among the four glioma
subtypes. For the atypical meningioma, which is not part of glioma cases, the almost even voting
split among the four glioma subtypes clearly indicated that this case does not fit any of those four
subtypes well. Thus, this molecular voting algorithm with the k-NN method appears to provide
additional diagnosis information as compared to classical pathological diagnosis.

We attempted to determine whether the detailed voting has any bearing on clinical
outcome by correlating the number of votes for GBM with patient survival. Indeed, we observed
a correlation between the numbers of GBM votes and shorter survival.

However, this

information cannot be overplayed at this time because our sample population is overrepresented
by GBMs as it is in the glioma population, and GBMs are known to have shorter survival.
Nevertheless, this study provides an insight for future studies with molecular diagnosis or
prognosis as the goal. We envision that a survival (or therapy response) gene set can be
determined from a training set of gliomas with long, median, and short survival times (or
response levels). Then this gene set may be used to predict whether a patient will survive longer
than median or respond at different levels.
Acknowledgements

The work was partially supported by an Advanced Technology grant from Texas Higher
Education Coordinating Board and the Bullock Fund for Brain Tumor Research, and a grant
from Academy of Finland. The Cancer Genomics Core Lab is supported by the Tobacco
Settlement Fund to M. D. Anderson Cancer Center (MDACC) as appropriated by the Texas
Legislature, a grant from Kadoorie Foundation to MDACC, a grant from the Goodwin Fund, and
the Cancer Center Supporting Grant from NIH/NCI.

References:

1. Alizadeh, A. A., Eisen, M. B., Davis, R.E., Ma C, Lossos, I. S., Rosenwald, A., Boldrick,
J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson,
J. Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C.,
Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R.,
Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. Distinct types of diffuse large B-cell
lymphoma identified by gene expression profiling. Nature., 403(6769):503-11, 2000
2. Borg, I., and Groenen, P. Modern Multidimensional Scaling: Theory and Applications.
Springer, New York, 1997.
3. Caskey, L. S., Fuller, G. N., Bruner, J. M., Yung, W. K., Sawaya, R. E., Holland, E. C.,
and Zhang, W. Toward a molecular classification of the gliomas: histopathology,
molecular genetics, and gene expression profiling. Histol. Histopathol., 15: 971-981,
2000.
4. Fix, E., and Hodges, J. Discriminatory analysis, nonparametric discrimination:
consistency properties. Technical Report, Randolph Field, Texas: USAF School of
Aviation Medicine (1951)
5. Fuller, G. N., Rhee, C. H., Hess, K. R., Caskey, L. S., Wang, R., Bruner, J. M., Yung, W.
K., and Zhang, W. Reactivation of insulin-like growth factor binding protein 2
expression in glioblastoma multiforme: a revelation by parallel gene expression profiling.
Cancer Res 59: 4228-32, 1999.
6. Fuller, G. N., Hess, K. R., Rhee, C. H., Yung, W. K., Sawaya, R. A., Bruner, J. M., and
Zhang W. Molecular classification of human diffuse gliomas by multidimensional scaling
analysis of gene expression profiles parallels morphology-based classification, correlates
with survival, and reveals clinically-relevant novel glioma subsets. Brain Pathol.
12(1):108-16, 2002.
7. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander,
E. S. Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science. 286(5439):531-7, 1999.
8. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P.,
Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J. Geneexpression profiles in hereditary breast cancer. N Engl J Med. 344(8):539-48, 2001.
9. Huber, P. J. Robust Statistics. John Wiley & Sons, p.107, 1981.
10. Kim, S., Dougherty, E. R., Shmulevich, I., Hess, K. R., Hamilton, S. R., Trent, J. M.,
Fuller, G. N., and Zhang, W. Identification of combination gene sets for glioma
classification. Mol Cancer Ther. (13):1229-36, 2002.
11. Kleihues, P., Cavenee, W. K. Pathology and genetics of tumours of the nervous system.
Lyon: IARC press; 2000.
12. Kobayashi, T., Yamaguchi, M., Kim, S., Morikawa, J., Ogawa, S., Ueno, S., Suh, E.,
Dougherty, E., Shmulevich, I., Shiku, H., and Zhang, W. Microarray reveals differences
in both tumors and vascular specific gene expression in de novo CD5+ and CD5- diffuse
large B-cell lymphomas. Cancer Res. 63(1):60-6, 2003.
13. Lloyd, S. P. Least Squares Quantization in PCM, IEEE Transactions on Information
Theory, vol. IT-28, March 1982, 129-137, 1982.

10

14. Mircean, C., Tabus, I., and Astola, J. Quantization and distance function selection for
discrimination of tumors using gene expression data. Proceedings of SPIE Photonics
West 2002, BiOS 2002 Symposium, San Jose, CA. 2002.
15. Mircean, C., Tabus, I., Astola, J., Kobayashi, T., Shiku, H., Yamaguchi, M., Shmulevich,
I., and Zhang, W. Quantization and similarity measure selection for discrimination of
lymphoma subtypes under k - nearest neighbor classification. SPIE Photonics West 2004,
BiOS 2004 Symposium, San Jose, CA. 2004.
16. Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd C., Pohl,
U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., Deimling, A.,
Pomeroy, S. L., Golub, T. R., and Louis, D. N. Gene Expression-based Classification of
Malignant Gliomas Correlates Better with Survival than Histological Classification.
Cancer Research 63, 16021607, 2003.
17. Shmulevich, I., Hunt, K., El-Naggar, A., Taylor, E., Ramdas, L., Laborde, P., Hess, K.
R., Pollock, R., and Zhang, W. Tumor specific gene expression profiles in human
leiomyosarcoma: an evaluation of intratumor heterogeneity. Cancer; 94: 2069-2075,
2002.
18. Shmulevich, I., and Zhang, W. Binary Analysis and Optimization-Based Normalization
of Gene Expression Data. Bioinformatics, 18(4): 555-565, 2002.
19. Stone, C. J. Consistent nonparametric regression (with discussion). Ann Statist. 5 pp.
595-645, 1977.
20. Taylor, E., Cogdell, D., Coombes, K., Hu, L., Ramdas, L., Tabor, A., Hamilton, S., and
Zhang, W. Sequence verification as quality control step for production of cDNA
microarray. BioTechniques 31: 62-65, 2001.
21. Zhou, X., Wang, X. and Dougherty, E. R. Binarization of microarray data on the basis of
a mixture model. Mol Cancer Ther. 2(7):679-84, 2003.

11

Table 1. Molecular voting procedure results and gene expression based diagnostic.
Nr
Crt

Index /
Dataset

20
21
22
23
24
25

B 01
B 20
B 21
B 22
B 23
B - 24

26

B 25

27
28
29

B 26
B 27
B 28

30

B 29

31

B 30

32

B 31

33
34
35
36
37
38
39

B 32
B 33
B 34
B 35
B 36
B 37
B 38

40

B 39

41

B 40

42
43
44
45
46
47
48
49

B 41
B 42
B 43
B 44
B 45
B 46
B 47
B 48

Pathology

GBM
GBM
AA
GBM
GBM
AA
Anaplastic
mixed AO
GBM
GBM
GBM
high grade
Astrocytoma
GBM
low grade
Astrocytoma
mixed oligo
GBM
AO w /necrosis
GBM
GBM
GBM
GBM
Atypical
meningioma
Anaplastic
mixed AO
GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM

0
2
3
2
2
3

0
0
0
0
0
0

0
0
0
0
0
1

4
2
1
2
2
0

Gene
expression
based
diagnostic
k=5
GBM
GBM
AA
GBM
AA
AA

GBM

GBM

GBM
AA/GBM*
GBM

0
2
1

1
0
0

0
0
0

3
2
3

GBM
GBM
GBM

0
2
1

1
0
0

0
0
0

4
3
4
3

Gene
expression
based
diagnostic
k=1
GBM
GBM
AA
AA
AA
AA

Gene
expression
based
diagnostic
k=4
GBM
GBM/AA*
AA
AA/GBM*
AA/GBM*
AA

GBM
GBM
AA
AA

Votes of algorithm
k=4
[AA AO OL GBM]

Votes of algorithm
k=5
[AA AO OL GBM]
0
2
4
2
3
3

0
0
0
0
0
0

0
0
0
0
0
1

5
3
1
3
2
1

AA

GBM

GBM

GBM

GBM/AA*

GBM

AA

GBM/AA*

GBM

AA
AA
AA
GBM
AO
GBM
GBM

AA
AA/GBM*
AO
GBM
AO
GBM
GBM
OL/GBM*/
/AA*/AO*

3
2
1
1
1
0
0

0
0
2
0
3
0
0

0
0
0
0
0
0
0

1
2
1
3
0
4
4

AA
GBM
AO
GBM
AO
GBM
GBM

3
2
1
1
1
0
0

0
0
3
0
4
0
0

0
0
0
0
0
0
0

2
3
1
4
0
5
5

GBM

AO

AO

AO

GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM

GBM
GBM
GBM/AA*
GBM
GBM
GBM
GBM/AA*
GBM

1
0
2
0
0
1
2
1

0
0
0
0
1
0
0
0

0
0
0
0
0
0
0
0

3
4
2
4
3
3
2
3

GBM
GBM
GBM
GBM
GBM
GBM
GBM
GBM

2
1
2
0
0
1
2
1

0
0
0
0
1
0
0
0

0
0
0
0
0
0
0
0

3
4
3
5
4
4
3
4

OL

Note: We split the data into training and test sets. Set A contains 19 profiles and set B contains 30 patient-profiles. Sets A
and B have been collected and processed during different time frames. As a training set, we use the profiles from A [1:19],
which are balanced across subtypes. The votes for the samples in B are based only on the known labels from the A dataset.
In the case of a tie (ex. gene profile 40 and k=4) we decrease the number of neighbors by one until a maximum vote exists, the
cases are marked (sequential) by ' * ' symbol. This corresponds to choosing the neighbor majority vote as the estimated label. The
diagnostic decisions for k = 3 are similar to those for k = 1.

12

Table 2 Feature genes that yielded the smallest cross-validation error. In right side, for genes sorted in
decreasing order of Fisher discriminant values, we represented the un-quantized expression. The patients (columns
from left to right) follows the labels "AA", "AO", "OL", "GM", and "mixed tumors".
Symbol Gene Names
Accession Name
AA, AO, OL, and GBM
MATN2
IGFBP1
DDOST
MEF2C

Matrilin 2
Insulin-like growth factor binding protein 1
dolichyl-diphosphooligosaccharide -protein glycosyltransferase
MADS box transcription enhancer factor 2, polypeptide C

AA071473
AA233079
NM_005216
AA234897
R31938
AA464346
PAFAH1B3 Platelet-activating factor acetylhydrolase, isoform Ib
H08188
CLCN6
Chloride channel 6
AA398458
MGC5178 Hypothetical protein MGC5178
HIVEP1
Human immunodeficiency virus type I enhancer binding protein 1 AA429769
AA461424
EFNB2
Ephrin-B2
N58107
VTN
Vitronectin (serum spreading factor, somatomedin B)
ATRX
Alpha thalassemia/mental retardation syndrome X-linked (RAD54) AA410435
AA486471
FMOD
Fibromodulin
NM_000873
ICAM2
intercellular adhesion molecule 2
NM_001964
EGR1
early growth response 1
AA482198
MPI
Mannose phosphate isomerase
AA489602
TRAP1
Heat shock protein 75
AA258735
Hs. moderate similarity to protein pir:A45973
HSUDGM
UNG2
uracil-DNA glycosylase 2
H11692
AP3B2
Adaptor-related protein complex 3, beta 2 subunit
W38923
ROR2
Receptor tyrosine kinase-like orphan receptor 2
AA497051
STHM
Sialyltransferase
NM_000436
OXCT
3-oxoacid CoA transferase
H63175
NM_003826
NAPG
N-ethylmaleimide-sensitive factor attachment protein, gamma
NM_003139
SRPR
signal recognition particle receptor ('docking protein')
AC005510
R32756
EWSR1
Ewing sarcoma breakpoint region 1
N74131
TFF3
Trefoil factor 3 (intestinal)
AC007199
T64094
HD
Huntingtin (Huntington disease)
H12006
AP4M1
Adaptor-related protein complex 4, mu 1 subunit
AA010781
DCLRE1A DNA cross-link repair 1A (PSO2 homolog, S. cerevisiae)
HSNFYB
NFYB
Nucl. trans. fact.Y, beta (alias HAP3, CBF-A, CBF-B, NF-YB)
NM_004999
MYO6
myosin VI, (alias DFNA22, DFNB37, KIAA0389)
AA453969
LDHC
Lactate dehydrogenase C
AA281932
PBEF
Pre-B-cell colony-enhancing factor
NM_015847
MBD1
methyl-CpG binding domain protein 1
AA625655
REG1A
Reg. islet-deriv. 1 alpha (pancreatic stone protein)
AA504327
PTP4A2
Protein tyrosine phosphatase type IVA, member 2
HSP162
EEA1
Early endosome antigen 1, 162kD
HSJ735G18
AL096703
T64192
TRB@
T cell receptor beta locus
AA487215
MYLK
Myosin, light polypeptide kinase
AA644191
ARL3
ADP-ribosylation factor-like 3
T60191
AJ271079
AA453202
NR1D1
Nucl. rec. subfam 1, gr D, me 1 (V-erbA related protein EAR-1)
W93163
TNFAIP6
Tumor necrosis factor, alpha-induced protein 6
NM_014688
RNTRE
related to the N terminus of tre

13

Figures

1500

dimension 3

1000
1500

500
0

1000

500

500

1000
0

1500

500

2000
1000

1000
0

dimension 1
1000

dimension 2

Figure 1. Multidimensional Scaling representation (MDS) of the 49 brain tumor gene-expression profiles
using the retained 1826 genes (genes retained after all preprocessing steps). Red circles, anaplastic astrocytoma
(AA); green diamond, anaplastic oligodendroglioma (AO); blue triangle, oligodendroglioma (OL); cyan star,
glioblastoma (GBM); and black cross, the others.

14

dimension 3

20
40

20

20
0
40
20
40

40
20

60

dimension 1
20

dimension 2

Figure 2. Multidimensional Scaling representation (MDS) of the same glioma profiles as in Figure 1, using the
most discriminative 50 genes based on five-neighbor analysis. Red circles, anaplastic astrocytoma (AA); green
diamond, anaplastic oligodendroglioma (AO); blue triangle, oligodendroglioma (OL); cyan star, glioblastoma
(GBM); and black cross, the others.

15

Publication no. 5
Giurcaneanu CD, Mircean C, Fuller GN and Tabus I.
Chapter 2: Finding functional structures in glioma geneexpressions using Gene Shaving clustering and MDL principle
in Computational and Statistical Approaches to Genomics, second
edition Kluwer Academic Publisher (in press)

Chapter 7
FINDING FUNCTIONAL STRUCTURES IN
GLIOMA GENE-EXPRESSIONS USING GENE
SHAVING CLUSTERING AND MDL PRINCIPLE
Ciprian D. Giurcaneanu1 , Cristian Mircean1,2 , Gregory N. Fuller2 , and Ioan
Tabus1
1 Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
2 Cancer Genomics Core Laboratory, Department of Pathology, The University of Texas

M. D. Anderson Cancer Center

1.

Introduction

The recent technological advances in genomics made it possible to measure simultaneously, in similar experimental conditions, the expressions of
thousands of genes, allowing an unprecedented wide probing at the transcriptome level. The huge amount of data thus available calls for improved
methods of data analysis, well grounded in the classical statistical methods, and providing fast and reliable processing. Finding clusters can be
a preliminary step for data analysis, and is a valuable task in itself: the
obtained clusters can convey useful information regarding the similar expression patterns for a number of genes, allowing thus a dimensionality
reduction of the data set, but furthermore, a cluster of similar genes induces hypotheses regarding the genes that act in synergy, and that are on
the same pathway inuencing the studied disease.
The traditional clustering methods are very appealing in gene data analysis, since they do not rely on any a priori knowledge or prior models for
the data set. Clustering was traditionally a method of excellence for biological data; many of the existing clustering algorithms have been developed to be used in taxonomy for the classication of plants and animals.
Therefore the rst option was to employ classical clustering methods on
the increasingly large amount of gene expression data, e.g. partitioning
or hierarchical algorithms derived and well studied for some other type
of biological data. The importance of studying gene expression data for

90

COMPUTATIONAL GENOMICS

understanding, diagnosing, and possibly prevailing the course of a disease


made necessary to further develop clustering methods to provide multiple
options in nding valuable structures in data.
The general aim of clustering is to group together objects that are similar according to a certain criterion. This implies that each algorithm will
identify a specic structure in a data set, based on the chosen similarity
criterion. Different algorithms will be sensitive to different types of structure, and will signal different facets of the dependencies between the genes
in the studied process. Understanding the inner workings of various clustering algorithms increases the ability to choose the proper algorithms to
be used in a certain application. There are a number of surveys discussing
classical clustering algorithms in the context of gene expression data. We
concentrate in presenting a recent method, in two of its variants, since
this method was specically designed for clustering gene expression data.
The name of the original algorithm is Gene Shaving (GS) (Hastie et al.,
2000a), and we present it in a separate section. GS makes clever use of
the geometrical structure of large matrices, and also requires a series of
statistical decisions, which originally were solved using statistics obtained
by extensively sampling from random distributions, but later the decisions
could be based on the simpler solutions offered by minimum description
length (MDL) principle (Giurcaneanu et al., 2004a).
We treat also the problem of estimating the number of gene clusters in
genomic data sets by analyzing two solutions that lead to different versions of the GS algorithm: GS-Gap and GS-MDL (Hastie et al., 2000a;
and Giurcaneanu et al., 2004a). The motivation of the two estimation approaches is different, and we mention that the last one is derived using
the minimum description length principle (Rissanen, 1978). Since most of
clustering algorithms require the user to set the number of clusters as input
information, the accurate estimation of the number of clusters is extremely
important. Many of the traditional estimators are analyzed and compared
in Milligan et al., (1985) in a generic context, while studies closer to genomic problems can be found in Giurcaneanu et al., (2004a); Giurcaneanu
et al., (2004b); Tabus et al., (2003), including the principled MDL method
for deciding the number of clusters in gene expression data.
We apply the two versions of the GS algorithm to the glioma data set
studied in (Fuller et al., submitted). Fuller et al., propose to alleviate the
subjectivity in glioma classication using gene expression proling. The
voting algorithm is able to provide more subtle information regarding the
molecular similarities to the neighboring classes, specically for mixed
gliomas. A number of computational algorithms have been used to perform
molecular classication based on groups of genes (Fuller et al., submitted;
and Nutt et. al., 2003).

Finding Functional Structures in Glioma Gene-Expressions

91

Gliomas are usually classied into two lineages (astrocytic and oligodendrocytic) and low-grade (astrocytoma and oligodendroglioma), midgrade (anaplastic astrocytoma and anaplastic oligodendroglioma), and
high-grade (glioblastoma multiforme). The labeling-class is based on
morphological features given by pathologists based on cellularity and
features such as mitotic index and the presence of necrosis. This classication scheme has limitations because a specic glioma can be at any
stage of the continuum of cancer progression and may contain mixed
features.
In this work, we focus on the gene expression that makes the difference
between morphological classications of glioma subtypes. We search for
capturing the feature-genes that mark the transition from lower grades to
higher grades (see Caskey et al., 2000 for a review) and the increased
aggressiveness of glioma multiforme (GM) subtype of glioma, and the
lineage differentiation of gliomas.
The initial data contain 2303 different genes (see Taylor et al., 2001)
that are observed over p = 49 different samples (experiments). The number of feature-genes is reduced in several steps to N = 121, where an
optimal classier was obtained for the four glioma-subtypes in quantized
values (Fuller et al., submitted). In the next section we clarify the procedures used. Grouping the genes of obtained classier in clusters that express with similar patterns over patients is a method to provide insight into
the molecular basis of gliomas and reveal intrinsic interaction at molecular
level between components and common function. If the cluster is enriched
with a certain function, we expect with higher probability to have genes
that act together in a possible-chained path. Further efforts can concentrate around these genes in nding the molecular-pathways and validate
them with biological procedures.

2.

Description of Processing Glioma Data Set

Patient samples and microarray experiments. The cDNA microarray used


for this experiment included 2,300 genes printed in duplicates. The
sequences were veried before printing in the Cancer Genomics Core Laboratory of M. D. Anderson Cancer Center; details are described in (Taylor
et al., 2001). The glioma tissues are obtained from the Brain Tumor Tissue
Bank of M. D. Anderson Cancer Center with the approval of Institutional
Reviewing Board. The original pathological assessment of the tissues was
reviewed by a pathologist (GNF). Procedural details related to RNA isolation, microarray experiments, imaging analysis are described in (Fuller
et al., 1999; Shmulevich et al., 2002; Kobayashi et al., 2003). The 49
cases include 27 glioblastomas (GM), 7 anaplastic astrocytomas (AA),

92

COMPUTATIONAL GENOMICS

6 anaplastic oligodendrogliomas (AO), 3 oligodendrogliomas (OL), 5 mixed


gliomas, and one case of atypical meningioma.
Gene selection in a Leave-Out One Cross-validation (LOO-CV) procedure. The gene selection process is in itself very important, revealing
information about the important genes for glioma diagnosis, and nally
for the performance of the classier. We concentrate as rst step, to select
genes bearing a higher discrimination power for the four glioma-types.
These genes are determined such that the performance of a given classier
is optimal in a cross-validation experiment, making the genes to perform
well on sets different than the ones used for training. We ranked the genes
according to Fisher discriminant, and then choose sets of genes of size
r of them in a - nearest neighbor ( N N ) classier. The parameters,
r and the number of neighbors , are selected in a leave-one-out crossvalidation (LOO-CV) procedure that leaves in turns, one patient outside
the training set. The resulting errors will have a granularity of 1/43.
To overcome this granularity, we can use a scheme that randomize the
training set and test set, but the small number of patients does not permit
the estimation of the classier parameters in sub-sequential splits inside
the cross-validation loop with large test sets.
We group the expressions of genes in a matrix, such that Munq (i, j) is
the expression of gene i on the sample j. In the preprocessing step, we
removed the genes with spots inaccurately replicated (see Mircean et. al.,
2004 for a description of the method) and we are leaving out the mixed
cases. The algorithm uses only the 43 patients out of the total p = 49
samples.
After agging the spots considered being incorrectly measured, we estimate the expression of gene Munq (i, j) as the mean of its replicates; we
quantize using Lloyd algorithm separately for each patient to four levels,
and remove genes with low variance over patients. We denote with Mq the
set of genes reduced from 2303 to 1826 genes.
In cross-validation, the following procedure is applied for each j
{1, 2, . . . , p} : the j th column of the matrix Mq is chosen to be the
test set, and the rest of the columns of Mq form the training set denoted
training
training
. For simplicity we drop the index j from Mq, j
and the
Mq, j
training
. Only on the training set, the genes are ranked
notation becomes Mq
using Fisher discriminant and we select the number of retained features
in steps r [10, 20, . . . , 70] and the number of neighbors of N N ,
training
[1, 2, . . . , 7], splitting again Mq
with LOO-CV procedure. For
training
we observe the error of misclassication
each of the selection in Mq
of size r 41. We use r and that gives lower observed error overall
training
. For each j th sample, the vote vector (see Fuller et al.,
patients in Mq

Finding Functional Structures in Glioma Gene-Expressions

93

submitted) is saved (the gures show in the second column these votes).
We further use this set of N = 121genes because we consider these genes
being very likely to play a special role in discrimination of the four glioma
sub-types, since they are minimizing the discrimination errors
We continue the gene shaving experiments by using the N = 121 genes
selected by LOO-CV procedure from the set Munq , observed in p = 49
different patients (samples). We denote with X the retained matrix with
the size N p.
Our goal is to identify groups of genes that operate similarly in the considered experiments.

3.

A Brief Review of Gene Shaving (GS)

We briey revisit the strategy of the GS algorithm. Each cluster Sn found


by GS is a set of n genes (out of N ) where n and the set of genes are selected trying to satisfy the following requirements: (a) all n genes have a
similar prole over the p experiments; (b) the variance of the cluster center (mean) forSn is the largest among all possible groups of ngenes; (c)
the value of n is such that the cluster obtained is the largest cluster containing only similar genes. These requirements correspond to the intuition
of the biologist who is looking for those genes that operate similarly, and
whose measurements vary signicantly from one patient to another. There
is no interest on genes with at prole over various conditions since their
measurements are almost the same for all types of analyzed tumors, and
consequently they cannot be used in diagnosis.
To keep the computational complexity at a reasonable level, GS uses the
largest principal component of the rows of X to select the genes according
to the above requirements. After all rows of the matrix X are centered to
have zero mean, a sequence of nested clusters S N Sn 1 Sn 2
S 1 is yield by an iterative procedure. The subscript of the set Si denotes
the number of genes contained in it, thus the largest cluster in the chain
contains all N genes, and the smallest cluster contains just one gene. The
nested clusters are generated as follows: rst the principal component of
the rows of X is computed, then = 10% of the rows of X having smallest inner product with this principal component are eliminated, and the
remaining genes are grouped in the cluster Sn 1 . The procedure continues
iteratively, by shaving off a proportion of the genes kept at the previous iteration, until the cluster S1 is found. Each set Si will have associated
an i p matrix containing the expressions of the genes kept in the cluster, and the principal component of that matrix will give the preferred direction of the genes to be kept after each shaving-off stage. This process
will ensure that the candidate clusters in their decreasing size sequence

94

COMPUTATIONAL GENOMICS

are becoming more and more pure, having the genes better and better
aligned to the principal component of the matrices associated to the candidate clusters.
Once proper candidates are generated by the consecutive shaving
process, one needs to choose the optimal cluster from the nested sets S N ,
Sn 1 , Sn 2 , . . . , S 1 . There are two ways to choose a suitable set: by using
the Gap statistics (as was proposed in the original GS) or to use a MDL
selection. Both are reviewed in this survey.
After the selection of the most appropriate set as a cluster, a new cluster
can be obtained by banning-off the direction of the principal component
of the previously extracted cluster from the following stages. That can be
achieved either by orthogonalizing all the rows of X with respect to direction of the principal component previously extracted (Hastie et al., 2000a),
or by simply removing from the matrix X of the genes found in all previous clusters (Giurcaneanu et al., 2004a). Depending on the banning-off
procedure used, the iterative process of extracting clusters continues for
psuccessive clusters when using orthogonalization, or until only pgenes
are left in the matrix X , when the genes of the clusters found are iteratively removed.
The Gap statistics selection of the optimal size of the cluster is a simple
process of discriminating the structure (regularity) of a cluster by checking against the statistics of randomly permuted data. The nested clusters
have been selected such that the variance of the cluster mean is high, the
cluster center being almost aligned to the rst principal component, which
by its denition has the highest possible variance. Now Gap will choose
the cluster Sn containing genes that are very similar one to each other and
simultaneously have a high degree of dissimilarity with the genes not included in Sn . Based on the analogy with the analysis of variance (Mardia
et al., 1979), the criterion used in cluster selection relies on computing for
Sn the within variance

p


2

1
1
VnW =
xi j ave j (Sn ) ,
p
n
j=1

iSn

and the between variance


VnB

p
2
1 
ave j (Sn ) ave (Sn )
=
p
j=1


where ave j (Sn ) = n1 iSn xi j is calculated as the average of the measurements in cell line j that correspond to genes in Sn , and ave (Sn ) =

Finding Functional Structures in Glioma Gene-Expressions


1
np

p

iSn

the criterion

j=1 x i j .

95

The cluster Sn is recommended to be selected when



VnB VnW

Dn = 100
1 + VnB VnW

takes large values. To check that Dn is larger than one could expect by
chance, the value of the Gap statistic is evaluated. The procedure implies
to generate a number of matrices, each of them obtained by applying a
different permutation to the columns of X . For example, assume that 20
such matrices are generated, and let us denote them X 1 , . . . , X 20 . Nested
clusters are identied for every matrixX i , 1 i 20, by using the same
algorithm employed for the original matrix X , and let Dn (X i ) be the criterion computed for the cluster size n of the matrix X i . The Gap statistic
is given by
Gap (n) = Dn ave (Dn ) ,
1 20
where ave (Dn ) = 20
i=1 Dn (X i ), and the optimal cluster size is
n = arg max Gap (n) .
n

Once the optimal cluster is found, each row of X is orthogonalized with


respect to the vector representing the mean of Sn , and the procedure continues until nding an a priori indicated number of clusters.
GS proved to be a successful algorithm in clustering gene expression
arrays (Hastie et al., 2000b), but we have to note the computational burden due to the necessity of clustering not only the matrix X , but also the
other twenty matrices {X i , 1 i 20}. Moreover, the Gap statistic is
mostly heuristically motivated. In Giurcaneanu et al., (2004a), a principled method was introduced for the selection of the optimal cluster. The
principle on which the method relies is the celebrated MDL (Minimum
Description Length; Rissanen, 1978), and leads to a fast algorithm for optimal cluster selection. To distinguish between the two forms of the GS algorithm, we call GS-Gap the algorithm proposed by Hastie (Hastie et al.,
2000a) and GS-MDL the modied version introduced by Giurcaneanu
(Giurcaneanu et al., 2004a). In the next section we briey present the GSMDL algorithm in connection with probability models for gene clustering.

4.

The GS-MDL Clustering Algorithm

In this section we show how the MDL principle can lead to a method for
choosing the optimal cluster from the set S N , Sn 1 , Sn 2 , . . . , S 1 , without
resorting to the evaluation of the Gap statistic. The idea is very simple:
we just need a wand that points on S N and shows us how many clusters it
contains. If S N contains one single cluster, then we decide that S N is the

96

COMPUTATIONAL GENOMICS

optimal choice; otherwise, we use again the wand until nding an index
n such that Sn +1 contains more clusters and Sn contains exactly one
cluster. We conclude that Sn is the optimal choice.
We explain in the sequel how MDL can provide the desired wand. The
roots of the MDL are in information theory, and it allows selecting the
best parametric model for a data set using as criterion the minimum description length or, equivalently, the minimum code length. We emphasize
here that all the considered models belong to a nite family that will be
dened latter. The MDL principle does not search for the true model of
the observed data, but selects the best model from a family that is a priori
dened. The selection relies on a scenario of transmitting the whole data
set from an hypothesized encoder to a decoder. The encoder is constrained
to use only models from the given family. In the most celebrated form of
the MDL, the code length is evaluated as the sum of two parts:


Choose a model from the family and tune its parameters such that
to obtain the best t to the given data set. Practically this step corresponds to nding the maximum likelihood estimators for the model
parameters. The estimated values are then sent to the decoder by using
1
2 log N bits for each parameter, where N is the number of samples in
the data set.
The symbol log () states for the natural logarithm, thus the codelength is expressed in nats. Equally well we can consider the logarithm base two in order to express the codelength in bits. Therefore
the rst term of the codelength is 12 q log N , where q denotes the number of parameters of the model.

Once both the encoder and the decoder know the values of the parameters, it remains only to encode the samples in the data set according
to the chosen model. This leads to a code length that equals the minus
logarithm of the maximum likelihood.

Remark that the code length is not constrained to be an integer number. This is not a difculty since we are not interested on realizable code
lengths, but rather to use code lengths for comparing various models. To
compute the two-parts code length criterion described above, we have to
resort to probabilistic models, which appear frequently in clustering.

4.1

Background on Mixture Models for Gene Expression


Data and Traditional Estimation Methods

The effort of developing mathematical models for clustering gene expression data obtained with various technologies was not very extensive so
far. Most of the recent papers (Dougherty et al., 2002; C. D. Giurcaneanu

Finding Functional Structures in Glioma Gene-Expressions

97

et al., 2004a; Hastie et al., 2000a; Yeung et al., 2001) that treat the issue
of model-based clustering for gene expression arrays have shown that the
nite mixture model can be successfully applied.
In nite mixture model, a multivariate distribution is associated to each
gene cluster. To ?x the ideas, let us assume that the number of gene clusters
does not exceed p, the total number of cell lines. Following the approach
from Hastie (Hastie et al., 2000a), we make the hypothesis that the number of signal clusters is K , and there exist also a noise cluster. The
last cluster groups all genes with at prole over all conditions, and consequently its mean is hypothesized to be a row vector with size 1 p
and having all entries only zeros. The mean vectors of signal clusters
are denoted by b1T , b2T , . . . , b TK . Supplementary we assume that the matrix
whose rows are b1T , b2T , . . . , b TK is full-rank. Moreover, the covariance matrix is assumed to be the same for all clusters, and to equal 2 I , where 2
is a parameter of the model and I denotes the p p identity matrix.
Under the mixture sampling, the rows of the matrix X , x 1T , x 2T , . . . , x TN
are taken at random from the mixture density such that the number of
observations from each cluster has a multinomial distribution with sample
size N and probability parameters p1 , p2 , . . . , p K +1 .
The results reported (Hastie et al., 2000a; Hastie et al., 2000b; and Yeung et al., 2001) encourage us to use such simple parametric models. The
nite mixture models have been extensively studied in statistics (Redner
et al., 1984), and they are suitable for the application of the ExpectationMaximization (EM) algorithm (Dempster et al., 1977). Various instances
of the use of the EM algorithm for clustering based on nite mixture models, generically named Classication-Expectation-Maximization (CEM),
are investigated in Celeux et al., (1995).
The EM algorithm is designed to be applied for incomplete data, which
recommends it to be used in clustering. The aim of gene clustering when
the data set is recorded in matrix X is to assign to each row x iT of X a cluster
label. Following the approach from Celeux et al., (1995) we assign to x iT a
row vector v iT whose length equals the number of clustersK + 1. If x iT is
assigned to cluster k, then the k-th entry of v iT takes value one, and all other
entries are zeros.
One can easily observe that the complete data are given

by the pairs x iT , v iT .
To illustrate how the EM algorithm can be employed in gene clustering, we consider the famous case when the distribution for each cluster is
Gaussian with covariance matrix 2 I where I is the p pidentity matrix,
and the parameter 2 is unknown. Therefore the set of parameters is
=

V, p1 , p2 , . . . ,

p K +1 , b1T , b2T , . . . , b TK +1

98

COMPUTATIONAL GENOMICS

where bkT are the mean vectors of the clusters, and pk are the mixing pro K +1
pk = 1. The CEM algorithm is an iterative
portions, 0 < pk < 1, k=1
procedure to search for the values of the parameters  that maximize the
log-likelihood function:

 K +1  N



L x 1T , x 2T , . . . , x TN ;  =
v i k log pk gk x iT |bkT , 2 I
k=1

i=1

where the notation gk (|) is used for the Gaussian distribution. We briey
revisit in the sequel each step of the algorithm (Celeux et al., 1995):


M step Assume that every gene is assigned to a cluster, or equivalently the entries of the matrix V are xed. Denote k the total number
of genes that have been assigned to the cluster k. Elementary calculations prove that the log-likelihood function is maximized by selecting
the values of the parameters such that:
p k =
N

T
b k

k
,
N

T
i=1 v ik x i

2 =

tr (W )
,
Np

where tr (W ) is the trace of the within cluster scattering matrix


 K +1  N

T
W = k=1
i=1 v i k (x i bk ) (x i bk ) .


E step For every row x iT is estimated the


probability
of x iT , condi

T
p k gk x iT |b k , 2 I
 
.

tional on cluster k, as tk x iT =  K +1
T T
2
l=1

pl gl x i |bl , I

C step For every index i {1, 2, . . . , N }, the entries of the row v it


are modied so

vi k =

1, k = k
0, other wise

 
where k = arg maxk tk x iT .
The algorithm is initialized with an arbitrary matrix V , and the iterations
stop when a convergence criterion is fullled. In practical applications
the CEM is started from many different random points. Each initialization can lead to a different convergence point, and the one that maximizes

Finding Functional Structures in Glioma Gene-Expressions

99

the log-likelihood function is chosen as nal result. Observe that we have


described the algorithm in the case of spherical model which is particularly
interesting for our application, but in Celeux et al., (1995) more complex
models are studied. We have assumed that the true number of clusters is
given as input to the algorithm. When the number of clusters is not known
a priori, it is estimated by using the Bayesian Information Criterion (BIC;
Schwarz, 1978). A discussion on the use of CEM in conjunction with BIC
can be found, for example, in Fraley et al., (1998), and we note here that
BIC is not the only criterion applied for estimating the number of clusters.
In the general case, BIC and the two-part code MDL have very similar expressions even if their theoretical motivations are rather different. For the
reader interested in information theoretic criteria, we refer to two tutorials,
Barron et al., (1998) Stoica et al., (2004). We emphasize here that the use
of MDL principle in GS-MDL represents a totally different approach than
the tandem CEM-BIC.
It is well known that the convergence of the EM algorithm can be very
slow. EM algorithm has also the drawback that it can converge to local
maxima. The interested reader can nd in Giurcaneanu et al., (2004a) a
comparison of the performances of the GS-MDL and the CEM algorithm
reported for articial data that mimic the gene expression arrays: the GSMDL is signicantly faster than CEM at the same level of accuracy of
assigning the genes to clusters.

4.2

MDL Estimation of the Number of Clusters

Let us investigate how to apply the MDL principle for estimating the number of gene clusters in the hypothesis of the particular nite mixture discussed above, without the need of running the slowly convergent EM algorithm. For every possible number of clusters k, we have to compute the best
code length which implicitly means that we have to label each gene with
a cluster name. Considering all possible partitions of N genes in kclusters,
and then evaluating the code length for every case is not a practical way
to solve the problem. We emphasize here that we just need a fast method
to estimate the number of clusters, and we do not need to decide which
gene belongs to which cluster, since this is successfully solved by the GS
algorithm.
The key observation that leads to the design of the GS-MDL method
relates to the particular structure of the mixture covariance matrix . It
was shown in Giurcaneanu et al., (2004a) that the eigenvalues of can be
sorted such that
1 () > 2 () > > K () > K +1 () = K +2 ()
= = p () = 2

100

COMPUTATIONAL GENOMICS

Therefore it is enough to count how many eigenvalues of  are larger


than 2 , and we obtain at once the number of clusters. Unfortunately, in
practice we do not know the true mixture covariance matrix , but only
the covariance matrix that is estimated from data and denoted S. The eigenvalues of S are all different with probability one, therefore we can write
without any loss of generality:
1 (S) > 2 (S) > > p (S) .
The problem has similarities with the estimation of the number of
sources, which is well known in the signal processing community, and
for which an elegant solution based on MDL was proposed in Wax et al.,
(1985). The GS-MDL method is inspired from this solution.
Since the application of the MDL principle strongly requires a coding
scenario, let us assume that each row x iT of the matrix X is encoded independently. Moreover, to perform the encoding, we assume that each row
x iT is Gaussian distributed with mean b T and covariance matrix , where
K
pk bkT . This coding procedure is not optimal in terms of comb T = k=1
pression results, but allows us to estimate in a fast way the number of gene
clusters. The likelihood function depends only on b T and , and furthermore  can be expressed using only its eigenvalues and eigenvectors. The
N T
ML estimator for the mean is b = N1 i=1
x i . Note that the ML estimators for the eigenvalues i () and the eigenvectors u i are given by a
famous result from Anderson (1963):
N 1
i () =
i (S) , i {1, 2, . . . , K }
N
N 1 p
2 =
i (S)
i=K +1
N (p K)
u i = ci , i {1, 2, . . . , K }
where 1 (S) , 2 (S) , . . . , p (S)and c1 , c2 , . . . , c K are the eigenvalues and, respectively the eigenvectors of the sample covariance matrix S.
The loglikelihood function reduces to
  p
log

1/( pK )  ( pK2 )N

i=K +1 i (S)
1 p
i=K +1 i
pK

(S)

(7.1)

We now focus on the term of the two-part MDL criterion, which is


mainly given by the number of parameters that must be sent from the encoder to the decoder. In the analyzed case, we have to transmit the estimated entries for the mean vector b and the matrix . Instead of sending

Finding Functional Structures in Glioma Gene-Expressions

101

all entries of  one by one, we transmit the eigenvectors and the eigenvalues of  relying on the following identity which is implied by the spectral
representation theorem:

K
 = 2I +
k () 2 u k u kT
k=1

where u k , 1 k K , are the eigenvectors of  and I is the p p identity


matrix.
We can easily count the number of parameters:


The mean vector b T has length p, therefore the number of parameters is p. Since p does not depend on K , we ignore the cost of
transmitting the mean vector to the decoder;
The eigenvalues of  are all real-valued. Recall that K of them are
distinct, and the rest of p K are equal with 2 , therefore the number
of parameters is K + 1;

The eigenvectors of  have all entries real numbers. Since the number of eigenvectors is K and each has length p, it results K pparameters.
Each eigenvector is constrained to have norm 1, thus only p 1 entries of each eigenvector are independent parameters. Similarly the
orthogonality constrained for the eigenvectors reduces the number of
parameters with K (K21) .


+1
The total number of parameters results to be 1 + K 2 pK
, which
2
together with equation (7.1) leads to the following MDL criterion:


k
p

k
+
1)
A
(2
k
K = arg min ( p k) N log
+
log N
(7.2)
k
Gk
2


 1/( pk)
 1 p
  p
=
where Ak = pk
. Obi=k+1 i (S) and G k
i=k+1 i (S)
serve that Ak is the arithmetic mean of the last p k eigenvalues of S, and
G k is the geometric mean of the same eigenvalues. We know from an elementary inequality that Ak G k with equality if and only if k+1 (S) =
k+2 (S) = . . . = p (S), which leads to the conclusion that the rst term
of the criterion is nonnegative. It is obvious that the second term is positive
and penalizes the number of parameters.
The expression of the criterion has an intuitive form, and is easy to be
computed. It still remains the question if such a criterion can lead to an
accurate estimation of the number of clusters despite various approximations we have done when deriving it. The answer is given by a result from
Giurcaneanu et al., (2004a) where the consistency of the MDL criterion

102

COMPUTATIONAL GENOMICS

(7.2) is proven. Therefore if the number of samples N is large enough, the


estimated number of clusters is guaranteed to be exact. The result is very
interesting since also in Giurcaneanu et al., (2004a) is proven that another
famous criterion, namely Akaikes criterion (Akaike, 1974), is not consistent even if its form is close to the MDL criterion. In fact in Giurcaneanu
et al., (2004a) is discussed a larger family of consistent estimators, but
among them the criterion (7.2) is shown to be the best in experiments with
simulated data.
To conclude this section on the use of MDL for nding the number of
clusters, we mention several related methods. In the derivation of the GSMDL criterion the crucial role is played by the eigenvectors and the eigenvalues of the covariance matrix. Heuristical methods have been already
applied to determine how many eigenvalues are small and correspond just
to noise. From this category of methods we mention the graphical methods
(Cattell, 1966), or the comparisons of the eigenvalues with some heuristically found thresholds (Everitt et al., 2001). All these methods became recently popular in the bioinformatics community, especially in connection
with the application of Singular Value Decomposition (SVD) in microarray data analysis. In this approach, the left singular vectors (eigenassays),
the right singular vectors (eigengenes), and the singular values are used
for the biological interpretation of data recorded in matrix X (Wall et al.,
2002). An implementation of this method named SVDMAN is described
in Wall et al., (2001), and in the same paper is drawn a parallel between
GS-Gap and SVDMAN.

5.

Functional Insights in Clustering Glioma


Gene-Expression

In the original version of Gap, the found clusters are non-exclusive in the
sense that the same gene can potentially be assigned to more different clusters. The same non-exclusive property occurs also in the case of SVDMAN
algorithm. Our aim is to partition the gene set in non-overlapping clusters,
and consequently once a cluster is identied by the GS-MDL algorithm,
all the corresponding genes are eliminated, and a new matrix X is generated. The rows of X are not orthogonalized with respect to the average
gene in the last found cluster. Therefore, at each step the number of rows
in matrix X is decreasing, while the number of columns remains constant.
We decide to stop the procedure when the number of genes in X becomes
smaller than the number of columns of X .
After running the GS-MDL algorithm, 72 genes are grouped in 16 different clusters, and the rest of 49 genes form the last big cluster. In our
analysis we focus on the 16 clusters that are labeled with numbers from
1 to 16, according to the order in which they have been found by the

Finding Functional Structures in Glioma Gene-Expressions

103

algorithm. We also run the Gap-GS algorithm, which groups the 121 genes
in 11 clusters.
GS groups together the genes highly correlated, either positive or negative correlated. Clusters are approximately aligned to principal components of data, e.g. rst cluster is aligned to rst principal component.
We are interested in the next three molecular-biological discrimination
problems:


The transition from low grades to most aggressive grade (i.e., Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and
Anaplastic Astrocytoma vs. Glioblastoma Multiforme)

To discover what are the key differences, between the two different
lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma and Oligodendroglioma Low Grade)

The transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others)

For each of problems we evaluate the clusters generated by GS-Gap


and GS- MDL using the discrimination power of the average gene1 expressed by its BSS/WSS statistics (see Fig. 7.1). We selected the cluster with highest discrimination power from the three dened molecularbiological discrimination problems and compare the functions of the genes
in the clusters selected by the two algorithms. Then, we evaluate the implications of each gene with GoMiner (Zeeberg et. al., 2003), and NCBI
EntrezDatabase(http://www.ncbi.nlm.nih.gov).
The knowledge referring the biological questions revealed by the two algorithms are comparable. Both algorithms suggest that the transition from
low grades to most aggressive grade (Oligodendroglioma Low Grade,
Anaplastic Oligodendroglioma and Anaplastic Astrocytoma vs. Glioblastoma Multiforme) is preponderant due to Cell growth function
(see Fig. 7.2).
SRPR (signal recognition particle receptor) gene is present in the clusters generated by GS-Gap and GS-MDL. In mammalian cells, SRP receptor is required for the targeting of nascent secretory proteins to the ER (endoplasmic reticulum) membrane; among cellular physiological processes
SRP receptor is involved in intracellular protein transport of growing cells.
Also, we identify the implication of members of the POU transcription
factor family together with highly preponderant Cell growth genes such

1 Average Gene is dened as mean of genes from one cluster. Negative correlated genes are represented

with the reversed sign.

POWER OF DISCRIMINATION

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

Cluster 1
Cluster 2
Cluster 3
Cluster 4
AA vs.AO+OL
Cluster 5
Cluster 6
GM vs.all
Cluster 7
Cluster 8
Cluster 9
OL vs.all
Cluster 10
Cluster 11
discrimination power

POWER OF DISCRIMINATION

discrimination power

OL vs.all

GM vs.all

AA vs.AO+OL

Figure 7.1. The power of discrimination between classes (AA, AO, OL and GM) and the three important discrimination problems. In left side
the clusters obtained from Gene Shaving with Minimum Description Length (GS-MDL) and in the right side the clusters of the Gene Shaving with Gap
(GS-Gap). In each gure, from left to right for all clusters, we considered power of discrimination by measuring the ratio of Between Sum of Squares
and Within Sum of Squares (BSS/WSS) in case of: 1) transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others);
2) transition from low grades to most aggressive grade (Oligodendroglioma Low Grade, Anaplastic Oligodendroglioma and Anaplastic Astrocytoma vs.
Glioblastoma Multiforme); 3) discriminating between the two different lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma
and Oligodendroglioma Low Grade).

Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Cluster 10
Cluster 11
Cluster 12
Cluster 13
Cluster 14
Cluster 15

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

104
COMPUTATIONAL GENOMICS

NL
SRPR
MYO6
CLCN6

17
12
43

+
+
+
average gene

Symbol

Votes

8.0

Cell growth

Cell growth,
regulation of vol.

3.9

5.8

2.8

2.8

2.6

AA

7.5

4.2

5.9

2.5

2.3

1.8

AO

7.6

5.4

5.7

1.2

0.9

1.8

OL

mean / std. dev.

7.6

5.4

2.3

1.4

2.0

GM
6.2

Figure 7.2. The transition from low-medium grades to the highest grade (all vs. GM) is inuenced by genes from Cell growth function family.
Gene Shaving with Minimum Description Length (GS-MDL) and Gene Shaving with Gap (GS-Gap) collect genes from same group functions in the
clusters (Cluster 3 and Cluster 4) that have highest discriminatory power in this problem. SRPR (signal recognition particle receptor, a docking protein)
is present in both selections. Other relevant genes are FADD (Fas-TNFRSF6 associated via death domain), POU6F1 (POU domain, class 6, transcription
factor 1), and IGFBP1 (insulin-like growth factor binding protein 1). For the four-type classication problem, Cluster 3 (GS-MDL) situates in rst position
and Cluster 4 (GS-Gap) in the second position. We consider genes from this cluster to be related to evolution of glioma from aggressivity facet. (See
description for color map and table content in caption of Fig. 7.3).

chloride channel 6

Cell growth,
protein metabol.

signal recognition
particle receptor
myosin VI
Cell growth, binding

Function

Name

Genes and Properties selected by MDL Gene Shaving


+/-

DETAIL of average gene and label of patients for GS-MDL grouping

GM

Glioma Subtypes

AA AO OL

Finding Functional Structures in Glioma Gene-Expressions


105

106

COMPUTATIONAL GENOMICS

IGFBP1 (insulin-like growth factor binding protein 1), MYO6 (myosin


VI), MATN2 (matrilin 2), and DCLRE1 (DNA cross-link repair 1A). The
POU domain transcription factors are a subgroup of homeodomain
proteins that appear to control cellular phenotypes. The POU-domain
factor-1 (POU6F1) is expressed in dispersed populations of neurons in
the dorsal hypothalamus, where its expression is restricted to the medical
habenula, to a dispersed population of neurons in the dorsal hypothalamus,
and to subsets of ganglion and amacrine cells in the retina (Andersen et.
al., 1993). No specic function has been proposed, and a knockout study
has not been reported.
In epithelial cells, Myosin VI (MYO6) functions to move endocytic vesicles out of actin-rich regions. In the absence of this motor activity, the uncoated vesicles present a Brownian-like motion and are trapped in actin
(Aschenbrenner et. al., 2004). The walking mechanism is controversial
since its very large steps of 36-nm step size (Okten et. al., 2004). Myosin
VI is known to be part of cytoskeleton organization in growing cells. Although gene shaving is not supervised clustering, in our experiment MYO6
is specically over-express in GM comparing the other subtypes.
The differences between the two different lineages of glioma (Anaplastic Astrocytoma vs. Anaplastic Oligodendroglioma and Oligodendroglioma
Low Grade) are underlined by Transcription and DNA binding functions
(see Fig. 3). The results are consistent and overlap in both Gene Shaving
(GS-Gap and GS-MDL) algorithms. As can be seen from Fig. 7.1a, Cluster
6 has the highest discrimination power for lineage differentiation comparing the other groupings. GS-MDL (marked with red-shaded background)
presents a higher selectiveness compared to the genes clustered by GS-Gap
(with light-yellow background). In both cases, the genes own preponderant the Transcription and DNA binding function. Leger (Leger et. al.,
1995), suggested Translin to be involved in the process of translocation;
and HMGA1 in the activation of viral gene expressions (together with
POU domain protein Tst-1/Oct-6 because of its endogenous expression
in myelinating glia). Also, the differential expression of HMGA1 could
be a result of treatment with morphine which may trigger independent reaction pathways affecting transcription, regulation, and/or post-synthetic
phosphorylation (Chakarov et. al., 2000).
For the transition from lowest glioma grades to high grades (Oligodendroglioma Low Grade vs. all others) any classication algorithm could
have limitations because of the small number of samples from Oligodendroglioma Low Grade. In our case, most of selected genes in the two winner clusters have the metabolic function. Figure 8 shows the expressions
of Myosin light polypeptide kinase (MYLK), Endothelin receptor type B
(EDNRB), Heat shock protein 75 (TRAP1), Sialyltransferase 7 (SIAT7B),
and RNTRE (USP6NL).

FADD

POU6F1

H2BFQ

EWSR1

SRPR

average gene

16

17

sorted

average

Nicotiana tabacum
clonePR50
H2B histone family,
member Q
POU domain, class
6, transcription factor
Homo sapiens cDNA
FLJ38365
TNFRSF6 associated via death
H.ch.14 DNA
sequence BAC RHuman T-cell
receptor germ-line

B25-NaN A15-GM

B46-GM A16-GM

A13-GM A14-GM

A17-GM A13-GM

B26-GM A12-GM

A08-AO A19-OL

B35-GM A18-OL

B34-GM B34-AO

B28-GM A11-AO

B41-GM A09-AO

B40-NAN A07-AO

B44-GM B04-AA

A11-AO B01-AA

A09-AO A06-AA

A10-AO A03-AA

A07-AO A01-AA

5.8

5.3

4.4

4.3

4.7

4.6

5.0

4.2

5.8

6.3

Figure 7.3. Continued...

0.7

0.8

0.6

1.4

1.8

1.4

2.2

0.9

2.6

1.0

0.6

5.1

4.0

4.9

3.9

4.5

4.5

5.5

4.0

5.9

5.6

5.5

1.3

1.7

1.0

1.1

1.1

1.1

1.3

2.0

1.8

2.2

0.9

0.9

AO
7.4

Cell growth, Cell cycle

Transduction,
Apoptosis

Metabolism, Cell
growth
Transcription, Cell
growth, Stem cells

Transcription

Metabolism, DNA
repair
Cell growth,
Metabolism

6.0

1.6

AA
7.4

CLUSTER 4 significances: 0.56856 (OL vs.all= 0.2023 GM vs.all=0.27056 AA vs.AO+OL=0.96931)

B34-AO B23-GM

B43-GM B01-GM

DNA cross-link repair


1A
signal recognition
particle receptor
Ewing sarcoma
breakpoint region 1

B29-NaN B22-GM

DCLRE1A

B42-GM B27-GM

18

B39-NaN B26-GM

Cell growth

B48-GM B37-GM

MATN2

B22-GM B38-GM

43

B31-NaN B28-GM

Function

B36-GM B42-GM

B38-GM B33-GM

Transduction, Cellgrowth, Cell-cycle

B37-GM B44-GM

Name

B32-NaN B43-GM

insulin-like growth
factor binding protein
matrilin 2

B47-GM B45-GM

IGFBP1

B01-GM B46-GM

Symbol

B30-GM A17-GM

DETAIL of average gene and label of patients for GS-Gap grouping

average

A02-AA

A18-OL

A05-AA

A20-OL

A08-AO
A19-OL

A10-AO
B24-AA

A20-OL
A06-AA

B20-GM
A03-AA

B30-GM
A05-AA

B36-GM
B21-AA

B41-GM
B01-AA

B48-GM
A02-AA

43

B33-GM B47-GM

Votes

B23-GM B25-NaN

+/
+

B29-GM B29-NaN

NL

1.1

4.5

3.6

4.0

4.4

4.5

5.0

3.3

5.7

4.2

4.7

1.6

2.7

0.9

1.2

2.7

1.4

1.2

2.3

1.8

1.9

2.0

1.3

OL
6.8

B20-GM B31-NaN

GM

A15-GM B32-NaN

AA AO OL

B27-GM B35-GM

mean / std. dev.

A39-GM B39-NaN

Genes and Properties selected by Gap Gene Shaving

4.4

5.7

4.5

4.8

4.9

4.6

5.7

4.7

6.2

6.6

5.5

1.4

1.3

1.5

1.4

0.8

1.6

1.3

1.4

2.0

1.0

1.4

1.6

GM
6.6

A16-GM B40-NaN

Glioma Subtypes

Finding Functional Structures in Glioma Gene-Expressions


107

NL

TRAP1
STHM
NFYB
REG1A
MBD1
RNTRE
INPP5D
TBP
KIAA0268

40
35
25
16
13
9
4
3
2
2

+
+
+

+
+
TSN

2
1

+
average gene

HOXD1

HMGA1

+
+

ATRX

Votes

+/

Symbol

homeo box D1

Transcription

Transcription

Transcription

Transcription

Transcription

Signal Transduction

Cell proliferation

Metabolism

Cell proliferation

Transcription

Catalysis

6.1

7.2

7.4

8.7

5.4

4.6

7.2

4.5

7.6

6.0

4.7

5.3

0.4

1.8

0.8

1.0

2.7

1.6

1.6

3.1

1.9

1.0

1.5

0.8

2.3

AA
4.6

4.8

5.9

6.1

7.6

4.0

4.7

6.0

5.0

7.2

6.5

4.9

4.9

3.7

1.2

1.2

1.6

0.8

0.5

1.1

0.8

0.6

0.6

1.5

1.3

1.4

1.0

AO

3.8

6.3

6.9

7.4

5.5

4.3

6.2

4.4

6.8

6.2

4.1

2.8

3.6

0.8

1.1

1.0

1.1

0.8

1.0

0.8

2.9

1.2

1.3

0.8

1.7

0.5

OL

mean / std. dev.

7 genes with Transcription Function

high mobility
group AT-hook 1
translin

methyl-CpG binding
domain protein
related to the N
terminus of tre.
inositol polyphosph. 5- phosph.
TATA box binding
protein
C219-reactive peptide

nuclear transcription
factor Y, beta
regenerating isletderived 1 alpha

sialyl-transferase

Function
Transcription Cell
Growth
Tumor necros. factor
receptor, chaperone

Name
alpha thalassemia
mental retard syndr.
heat shock protein 75

Genes and Properties GS-MDL and GS-Gap

5.7

5.9

6.8

7.7

5.3

4.5

6.9

5.7

6.9

6.5

4.4

4.7

1.2

1.7

1.5

0.8

1.3

0.9

1.5

1.1

1.1

1.1

1.5

1.3

1.5

GM
4.4

Figure 7.3. Transcription and DNA binding function are characteristic for the lineage differentiation separation. Left gure presents genes:
columns are gene expression proles; the 49 patients are grouped by glioma labels: AA, AO, OL, and GM. For positive correlated genes the color green is
used for low expression values, the color red for high expression values. Gene shaving algorithm groups together positive and negative correlated genes.
If a gene is negative correlated, we represent the expression as a negative lm. On the column +/ we labeled with + positive correlated genes and
with negative correlated genes. On right-side of the table we show for each class the mean (upper row, left aligned) and the variance of expression
values (down and right aligned).

Glioma Subtypes
AA AO OL
GM

108
COMPUTATIONAL GENOMICS

average gene

and label of patients for GS-MDL grouping

Figure 7.4. Transcription and DNA binding function are characteristic for the lineage differentiation separation. Down part of the image
illustrates details of labels of patients and average gene of Cluster 6 for GS-MDL representing the averages (with sign) of the genes along the patients.
Cluster 6 (GS-MDL) overlaps Cluster 8 (GS-Gap). We then reorder the patients such that the values of the average gene are increasing, we horizontally
list the re-ordered patient labels, and below this list, we draw a row representing by colors the values of the sorted average gene. The bottom row shows
the average gene as a plot.

DETAIL of

Finding Functional Structures in Glioma Gene-Expressions


109

NL

OPRK1
NAPG
MEF2C
DCLRE1A

43
20
43
18
6.6

6.3

8.0

7.1

7.1

7.0

7.8

3.3

1.0

3.3

3.0

4.1

3.8

3.1

AA

5.1

5.6

5.6

6.4

5.0

8.3

6.1

2.6

2.2

1.0

1.4

1.0

1.5

2.1

AO

4.2

4.2

4.9

6.2

6.6

8.3

5.5

1.1

1.9

1.8

0.4

1.7

1.2

4.3

OL

mean / std. dev.

Metabolism, Transcription

Metabolism, DNA
repair

Transcription

Cell growth,
Metabolism

Transduction

Metabolism,
Transduction

Human estrogen
sulfotransferase
ephrin-B2
Opioid receptor, kappa
1
N-ethylmaleimidesensitive factor,
MADS box
transcription enhancer
DNA cross-link repair
1A
Oenothera elata
hookeri chloroplast

Function

Name

CLUSTER 1 significances: 0.65937 (OL vs.all= 0.11673 GM vs.all=0.044389 AA vs.AO+OL=0.77953)

average gene

10

EFNB2

39

+
+
+
+
+
+

SULT1A3

43

Symbol

Genes and Properties selected by GS-MDL


Votes

+/

5.9

6.6

6.1

6.0

5.9

8.5

1.5

1.0

1.8

1.7

1.8

1.1

1.7

GM
6.1

Figure 7.5. Cluster 1 group the genes aligned to the rst principal component. In both, GS-MDL and GS-Gap, the rst clusters contain genes that
have highest vote-rate (43/43) from LOO-CV algorithm, as shown in Fuller et al., submitted. Five genes are represented in red; they belong to the rst
cluster for both GS-MDL and GS-Gap. The rst clusters are composed of genes with a mixture of metabolic functions: DNA-binding, transduction, and
cell-cycle. These genes discriminate well in the problem of transition from lowest grade to medium-high glioma grades (OL vs. all). We can observe a
low expression of AO and OL subtype versus over-expression in AA glioma subtype.

GM

Glioma Subtypes

AA AO OL

110
COMPUTATIONAL GENOMICS

NL

MEF2C
MPI
NAPG

SGCD
TERF1
TSPY

43
36
20

1
1
1
1

+
+
+

UROS

average gene

10

16

OPRK1

43

+
+
+
+
+
+
+

SULT1A3

43

Symbol

6.0

5.4

6.0

6.9

6.6

7.5

7.1

6.3

8.0

2.6

2.4

2.4

2.6

3.3

3.4

3.0

2.3

3.3

4.1

3.1

5.4

5.5

5.8

6.4

5.1

7.2

6.4

4.5

5.6

5.0

6.1

0.9

0.6

1.3

1.2

2.6

1.7

1.4

1.2

1.0

1.0

2.1

AO

6.0

5.3

6.0

7.1

4.2

5.5

6.2

6.4

4.9

6.6

5.5

0.9

1.3

0.9

0.6

1.1

0.9

0.4

1.2

1.8

1.7

4.3

OL

mean / std. dev.

Metabolism, Transcription

Transcription, Cell
prolif.
Transcription, Cell
prolif.

Cell growth

Metabolism

Cell growth,
Metabolism

Metabolism

Transcription

7.1

7.8

AA

Figure 7.6. Cluster 1 groups the genes aligned to the rst principal component.

Hs.ch. 5 Bac clone


111n13
Oenothera elata hookeri
chloroplast plastome I
Uroporphyrinogen III
synthase
Sarcoglycan, delta (35kD
dystrophin glycoprot.)
Hs. telomeric repeat
binding factor (TRF1)
Testis specific protein, Ylinked

MADS box transcription


enhancer
Mannose phosphate
isomerase
N-ethylmaleimidesensitive factor, gamma

Metabolism,
Transduction

Human estrogen
sulfotransferase
Opioid receptor, kappa 1
Transduction

Function

Name

Genes and Properties selected by GS-Gap


Votes

+/

DETAIL on average gene and label of patients

GM

Glioma Subtypes

AA AO OL

6.0

6.3

6.5

6.7

5.9

6.6

6.0

5.3

6.1

5.9

1.4

1.5

1.5

1.5

1.5

1.9

1.7

1.4

1.8

1.8

1.7

GM
6.1

Finding Functional Structures in Glioma Gene-Expressions


111

SIAT7B
USP6NL

35
25
4

+
+
+
average gene

TRAP1

Votes

+/-

Symbol

7.2

4.7

Metabolism

Protein Metabolism

Protein Metabolism
1.7

1.6

0.8

AA
5.3

6.0

4.9

4.9

0.9

1.3

1.4

AO

6.2

4.1

2.8

0.9

0.8

1.7

OL

mean / std. dev.

CLUSTER 10 significances: 0.38475 (OL vs.all= 0.24703 GM vs.all=0.0018932 AA vs.AO+OL=0.0036034)

related to the terminus


of tre; rntre

sialyltransferase 7

Function
Tumor necros. factor
receptor, chaperone

Name
heat shock protein 75

Genes and Properties selected by GS-MDL

6.9

4.4

1.6

1.5

1.3

GM
4.7

Figure 7.7. The winners in the discriminatory problem of transition from lowest grade (OL) to the other grades, are different in GS-MDL and
GS-Gap. Cluster 10 of GS-MLD creates a better structure than GS-Gap. The genes agree for metabolic function. TRAP1 (heat shock protein 75) is
intensively studied.

sort.avg.gene

average.gene

NL

DETAIL on average gene and label of patients

Glioma Subtypes
AA AO OL
GM

112
COMPUTATIONAL GENOMICS

EDNRB

10
1

+
+
average gene

MYLK

23

Symbol
Metabolism, Cell
Growth
Cell Growth,
Migration

Hs. chr. 4 B368A9 map


4q25
myosin, light polypeptide
kinase
endothelin receptor type B
4.2

5.6

7.8

4.5

4.9

7.0

1.2

1.0

0.9

AO

4.0

5.4

7.4

0.5

0.7

0.9

OL

mean / std. dev.

CLUSTER 4 significances: 0.28159 (OL vs.all= 0.23609 GM vs.all=0.011157 AA vs.AO+OL=0.012255)

1.0

0.9

2.1

AA

Not Defined

Function

Name

Genes and Properties selected by GS-Gap


Votes

+/-

4.1

5.4

1.1

1.1

1.3

GM
7.3

Figure 7.8. The winners in the discriminatory problem of transition from lowest grade (OL) to the other grades, are different in GS-MDL and
GS-Gap. Cluster 9 of GS-Gap is composed of MYLK (myosin, light polypeptide kinase); EDNRB (endothelin receptor type B) and clone B368A9 map
4q25.

average

sorted

average

NL

DETAIL on average gene and label of patients

Glioma Subtypes
AA AO OL
GM

Finding Functional Structures in Glioma Gene-Expressions


113

114

COMPUTATIONAL GENOMICS

References
Akaike, H. (1974) A new look at the statistical model identication. IEEE
Trans. Autom. Control, AC-19:716723, Dec.
Alizadeh, A. A., Eisen, M. B., Davis, R.E., Ma C, Lossos, I. S.,
Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I.,
Yang, L., Marti, G. E., Moore, T., Hudson, J. Jr., Lu, L., Lewis, D. B.,
Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger,
D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M.
R., Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. (2000) Distinct
types of diffuse large B-cell lymphoma identied by gene expression
proling. Nature., 403(6769):50311
Andersen B, Schonemann D, Pearse II V, Jenne K, Sugarman J, Rosenfeld
G. (1993) Brn-5 is a divergent POU domain factor highly expressed in
layer IV of the neocortex. J Biol Chem 268: 2339023398
Anderson, T.W. (1963) Asymptotic theory for principal component
analysis. Ann. Math. Stat., 34:122148, Mar.
Aschenbrenner L, Naccache SN, Hasson T. (2004) Uncoated endocytic
vesicles require the unconventional myosin, Myo6, for rapid transport
through actin barriers. Mol Biol Cell. May;15(5):225363. Epub 2004
Mar 05.
Barron A, Rissanen J, Yu B. (1998) The minimum description length principle in coding and modeling, IEEE Trans. Info. Theory, IT-44:
27432760.
Borg, I., and Groenen, P. (1997) Modern Multidimensional Scaling:
Theory and Applications. Springer, New York.
Caskey, L. S., Fuller, G. N., Bruner, J. M., Yung, W. K., Sawaya, R. E.,
Holland, E. C., and Zhang, W. (2000) Toward a molecular classication
of the gliomas: histopathology, molecular genetics, and gene expression
proling. Histol. Histopathol., 15: 971981.
Cattell, R.B. (1966) The scree test for the number of factors. Multivariate
Behavioral Research, 1:245276.
Celeux, G. and Govaert, G. (1995) Gaussian parsimonious clustering
models. Pattern Recognit., 28:781793.
Chakarov S, Chakalova L, Tencheva Z, Ganev V, Angelova A. (2000)
Morphine treatment affects the regulation of high mobility group I-type
chromosomal phosphoproteins in C6 glioma cells. Life Sci. 24;66(18):
172531
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood
from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat.
Methodol., 39:138.
Dougherty, E.R., Barrera, J., Brun, M., Kim, S., Cesar, R.M., Chen, Y.,
Bittner, M. and Trent, J.M. (2002) Inference from clustering with

REFERENCES

115

application to gene-expression microarrays. J. Comput. Biol., 9(1):


105126, Jan.
Entrez Database Website http://www.ncbi.nlm.nih.gov/ at National Center
for Biotechnology
Everitt, B.S. and Dunn, G. (2001) Applied multivariate data analysis.
Arnold, London.
Fix, E., and Hodges, J. (1951) Discriminatory analysis, nonparametric
discrimination: consistency properties. Technical Report, Randolph
Field, Texas: USAF School of Aviation Medicine
Fraley, C., and Raftery, A.E. (1998) How many clusters? Which clustering
method? Answers via model-based cluster analysis. Comput. J., 41:
578588.
Fuller, G. N., Hess, K. R., Rhee, C. H., Yung, W. K., Sawaya, R. A.,
Bruner, J. M., and Zhang W. (2002) Molecular classication of human
diffuse gliomas by multidimensional scaling analysis of gene expression
proles parallels morphology-based classication, correlates with survival, and reveals clinically-relevant novel glioma subsets. Brain Pathol.
12(1):10816.
Fuller, G. N., Mircean, C., Tabus, I., Taylor, E., Sawaya, R., Bruner, J.,
Shmulevich, I., Zhang, W. Molecular Voting for Glioma Classication
Reecting Heterogeneity in the Continuum of Cancer Progression
(submitted)
Fuller, G. N., Rhee, C. H., Hess, K. R., Caskey, L. S., Wang, R., Bruner,
J. M., Yung, W. K. and Zhang, W. (1999) Reactivation of insulin-like
growth factor binding protein 2 expression in glioblastoma multiforme:
a revelation by parallel gene-expression proling. Cancer Res 59:
422832.
Giurcaneanu, C.D., Tabus, I., Astola, J., Ollila, J. and Vihinen, M. (2004a)
Fast iterative gene clustering based on information theoretic criteria for
selecting the cluster structure. J. Comput. Biol., 11(4):660682.
Giurcaneanu, C.D., Tabus, I., Shmulevich, I. and Zhang, W. (2004b) Clustering genes and samples from glioma microarray data. In R. Dobrescu
and C. Vasilescu, editors, Interdisciplinary applications of fractal and
chaos theory, pages 157171. The Publishing House of the Romanian
Academy, Bucharest, Romania, 2004.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M.,
Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M.
A., Bloomeld, C. D., and Lander, E. S. (1999) Molecular classication of cancer: class discovery and class prediction by gene expression
monitoring. Science. 286(5439):5317.
Hastie, T., Tibshirani, R., Eisen, MB, Alizadeh A, Levy R, Staudt L,
Chan WC, Botstein D, Brown P. (2000) Gene shaving as a method for

116

COMPUTATIONAL GENOMICS

identifying distinct sets of genes with similar expression patterns.


Genome Biol. 2000;1(2):RESEARCH0003. Epub 2000 Aug 04.
Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Ross, D., Scherf, U.,
Weinstein, J., Alizadeh, A., Staudt, L. and Botstein, D. (2000) Gene
Shaving: a new class of clustering methods. http://www.stat.stanford.
edu/hastie/Papers/
Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M.,
Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O. P.,
Wilfond, B., Borg, A., and Trent, J. (2001) Gene-expression proles
in hereditary breast cancer. N Engl J Med. 344(8):53948.
http://genomebiology.com/2000/1/2/research /0003/
Huber, P. J. (1981) Robust Statistics. John Wiley & Sons, p.107.
Kim, S., Dougherty, E. R., Shmulevich, I., Hess, K. R., Hamilton, S. R.,
Trent, J. M., Fuller, G. N., and Zhang, W. (2002) Identication of combination gene sets for glioma classication. Mol Cancer Ther. (13):
122936.
Kleihues, P., Cavenee, W. K. (2000) Pathology and genetics of tumours of
the nervous system. Lyon: IARC press.
Kobayashi, T., Yamaguchi, M., Kim, S., Morikawa, J., Ogawa, S.,
Ueno, S., Suh, E., Dougherty, E., Shmulevich, I., Shiku, H., and
Zhang, W. (2003) Microarray reveals differences in both tumors and
vascular specic gene expression in de novo CD5+ and CD5- diffuse
large B-cell lymphomas. Cancer Res. 63(1):606.
Leger H, Sock E, Renner K, Grummt F, Wegner M. (1995) Functional
interaction between the POU domain protein Tst-1/Oct-6 and the
high-mobility-group protein HMG-I/Y. Mol Cell Biol. Jul;15(7):
373847
Lloyd, S. P. (1982) Least Squares Quantization in PCM, IEEE
Transactions on Information Theory, vol. IT-28, March, 129137.
Mardia, K.V. Kent, J.T. and Bibby, J.M. (1979) Multivariate analysis.
Academic Press, London.
Milligan G.W. and Cooper. M.C. (1985) An examination of procedures
for determining the number of clusters in a data set. Psychometrika,
50:159179
Mircean, C., Tabus, I., and Astola, J. (2002) Quantization and distance
function selection for discrimination of tumors using gene expression
data. Proceedings of SPIE Photonics West 2002, BiOS 2002 Symposium, San Jose, CA.
Mircean, C., Tabus, I., Astola, J., Kobayashi, T., Shiku, H., Yamaguchi, M.,
Shmulevich, I., and Zhang, W. (2004) Quantization and similarity
measure selection for discrimination of lymphoma subtypes under

REFERENCES

117

- nearest neighbor classication. SPIE Photonics West 2004, BiOS


2004 Symposium, San Jose, CA.
Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G.,
Ladd C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T.,
Black, P. M., Deimling, A., Pomeroy, S. L., Golub, T. R., and
Louis, D. N. (2003) Gene Expression-based Classication of Malignant
Gliomas Correlates Better with Survival than Histological Classication. Cancer Research 63, 16021607.
Okten Z, Churchman LS, Rock RS, Spudich JA. (2004) Myosin VI walks
hand-over-hand along actin. Nat Struct Mol Biol., Aug 1 Epub 2004
Aug 01.
Redner R.A. and Walker, H.F. (1984) Mixture densities, maximum
likelihood and the EM algorithm. SIAM Rev., 26:195239.
Rissanen, J. (1978) Modeling by shortest data description. Automatica J.
IFAC, 14:465471.
Schwarz, G. (1978) Estimating the dimension of a model. Ann. Stat., 6(2):
461464.
Shmulevich, I., and Zhang, W. (2002) Binary Analysis and OptimizationBased Normalization of Gene Expression Data. Bioinformatics, 18(4):
555565.
Shmulevich, I., Hunt, K., El-Naggar, A., Taylor, E., Ramdas, L.,
Laborde, P., Hess, K. R., Pollock, R., and Zhang, W. (2002) Tumor
specic gene expression proles in human leiomyosarcoma: an
evaluation of intratumor heterogeneity. Cancer; 94: 20692075.
Stoica, P. and Selen, Y. (2004) Model-order selection. Signal Processing
Mag., 21(4):3647.
Stone, C. J. (1977) Consistent nonparametric regression (with discussion).
Ann Statist. 5 pp. 595645.
Tabus I. and Astola. J. Clustering the non-uniformly sampled time series of gene expression data. In: Proc. ISSPA 2003, EURASIP-IEEE
Seventh Int. Symp. on Signal Processing and its Applications,
pages 6164, Paris, Jul. 14 2003.
Taylor, E., Cogdell, D., Coombes, K., Hu, L., Ramdas, L., Tabor, A.,
Hamilton, S., and Zhang, W. (2001) Sequence verication as quality
control step for production of cDNA microarray. BioTechniques 31:
6265.
Wall, M.E., Dick, P.A. and Brettin, T.S. (2001) SVDMAN-singular value
decomposition of microarray data. Bioinformatics, 17:566568.
Wall, M.E., Rechtsteiner, A. and Rocha, L.M. (2002) Singular value
decomposition and principal component analysis. In: D. P. Berrar, W.
Dubitzky, and M. Granzow, editors, A practical approach to microarray

118

COMPUTATIONAL GENOMICS

data analysis, pages 91109. Kluwer Academic Publishers, Mass,


USA.
Wax, M. and Kailath, T. (1985) Detection of signals by information
theoretic criteria. IEEE Trans. Acoustics Speech Signal Proc., 33:
387392.
Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E. and Ruzzo W.L. (2001)
Model-based clustering and data transformations for gene expression
data. Bioinformatics, 17:977987.
Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M,
Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ,
Riss J, Barrett JC, Weinstein JN. (2003) GoMiner: a resource for
biological interpretation of genomic and proteomic data. Genome Biol.;
4(4):R28. Epub 2003 Mar 25.
Zhou, X., Wang, X. and Dougherty, E. R. (2003) Binarization of microarray data on the basis of a mixture model. Mol Cancer Ther. 2(7):
67984.

Publication no. 6
Mircean C, Tabus I, Kobayashi T, Yamaguchi M, Shiku H,
Shmulevich I, Zhang W.
Pathway analysis of informative genes from microarray data
reveals that metabolism and signal transduction genes distinguish
different subtypes of lymphomas.
Int J Oncol. 2004 Mar;24(3):497-504.

INTERNATIONAL JOURNAL OF ONCOLOGY 24: 497-504, 2004

497

Pathway analysis of informative genes from microarray data


reveals that metabolism and signal transduction genes
distinguish different subtypes of lymphomas
CRISTIAN MIRCEAN1, IOAN TABUS1, TOHRU KOBAYASHI2, MOTOKO YAMAGUCHI2,
HIROSHI SHIKU2, ILYA SHMULEVICH3 and WEI ZHANG3
1

Institute of Signal Processing, Tampere University of Technology, FIN-33101 Tampere, Finland;


Second Department of Internal Medicine, Mie University School of Medicine, Tsu 514-8507, Japan;
3
Department of Pathology, The University of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA
2The

Received August 1, 2003; Accepted September 9, 2003

Abstract. Recent clinicopathological studies identified a


unique subgroup of diffuse large B-cell lymphoma (DLBCL)
that expresses CD5 on the cell surface. This de novo CD5+
DLBCL comprises 10% of all DLBCL and has a poorer
prognosis than CD5- DLBCL. Comparison of gene expression
profiles between de novo CD5+ DLBCLs and CD5- DLBCLs
shows that de novo CD5+ DLBCL expresses high levels of
integrin 1 in tumor cells and CD36 in the vascular cells. On
the other hand, comparison between mantle cell lymphomas
(MCLs) and DLBCLs expectedly identified cyclin D1 as a
top feature gene. To gain insight into the molecular pathway
differences among the three types of lymphoma, we evaluated
the functional categories of groups of genes important for the
discrimination among the three groups. We first selected 280
(from 2,142) genes, according to their individual discriminatory
power. We then used the gene-shaving clustering algorithm
and identified 22 clusters of genes. Of the 22 clusters, six
were highly correlated with the class labels of the patients
and the top three clusters accounted for the major difference
among the three lymphoma subtypes. A multidimensional
scaling (MDS) analysis using the average genes from the top
three clusters separated the three lymphoma subtypes quite
well. The functions of the genes in the top three gene clusters
showed a significant enrichment of metabolism and signal
transduction. To further examine whether genes of particular
functions reflect more faithfully the difference between the
subtypes of lymphomas, we separated the 280 informative
genes into six different functional groups and performed

_________________________________________
Correspondence to: Dr Wei Zhang, Cancer Genomics Core
Laboratory, Department of Pathology, The University of Texas
M.D. Anderson Cancer Center, 1515 Holcombe Blvd., Houston,
TX 77030, USA
E-mail: wzhang@mdanderson.org
Key words: gene shaving, cDNA microarray, diffuse large B-cell
lymphoma, mantle cell lymphoma, metabolism and transduction

MDS analysis using each of the gene groups. Four of the genefunction groups (metabolism, signal transduction pathway,
transcriptional factors, cell adhesion and migration), separated
the three lymphoma subtypes well, whereas apoptosis genes
and cell cycle genes did not result in good separation.
Introduction
Diffuse large B-cell lymphomas (DLBCLs) are characterized
by proliferation of large transformed lymphoid cells. They
are heterogeneous in immunophenotypes, clinical features, and
treatment responses (1). Alizadeh et al identified two different
subtypes of DLBCLs, germinal center B-like DLBCL and
activated B-like DLBCL, based on the characteristic gene
expression patterns (2). In the activated B-like DLBCLs, a
poor prognostic subtype, the BCL-2 gene, which has antiapoptotic function, was overexpressed (2). Mantle cell
lymphoma (MCL) is characterized by the proliferation of
monomorphous small to medium-sized lymphoid cells with
irregular nuclei, the recurrent cytogenetic abnormality of
t(11;14)(q13;q32), and poor prognosis (1). Lymphoma cells of
MCL express cyclin D1, BCL-2 and CD5 (1). Immunoglobulin
heavy and light chain genes are rearranged in both DLBCLs
and MCLs. However, variable region genes are not mutated
in most MCLs, consistent with a pre-germinal center B-cell
origin, whereas the variable region genes are mutated in
DLBCLs (1). It has been reported that apoptosis-inducing
genes were down-regulated in MCLs when compared with
non-malignant hyperplastic lymph nodes (3), which may
account for their poor response to chemotherapy and poor
prognosis.
Immunophenotypically, naive B-cells express CD5, and
chronic lymphocytic leukemia (CLL) and MCL are considered
to correspond to CD5+ B-cells (1). Recently, however, 10%
of DLBCLs without prior CLL phase express CD5 and this
de novo CD5+ DLBCL has been reported to be clinicopathologically different from CD5- DLBCL and MCL (4). Immunoglobulin variable region genes are mutated in de novo CD5+
DLBCLs (5-7). To further evaluate de novo CD5+ DLBCL at
the molecular level, we performed gene expression profiling
using cDNA microarray technology (8). A series of genes

498

MIRCEAN et al: FUNCTIONAL GENE GROUPS SEPARATING LYMPHOMAS

were identified that distinguish de novo CD5+ DLBCLs from


CD5- DLBCLs (8). Immunohistochemical confirmation studies
focusing on adhesion molecules revealed that integrin 1 was
overexpressed in lymphoma cells of de novo CD5+ DLBCLs
and that CD36 was overexpressed in their endothelial cells
(8). Integrin 1 overexpression in de novo CD5+ DLBCL is
consistent with the high extranodal involvement and poor
prognosis of this type of lymphoma (9-11). In addition,
vascular CD36 overexpression in de novo CD5+ DLBCL may
contribute to high incidence of intravascular or intrasinusoidal
infiltration of lymphoma cells.
Taken together, different subtypes of lymphomas have
different phenotypes manifested by different cellular functional
pathways. Thus, gene expression profiling followed by gene
function classification may provide important clues to the
biological behavior of the tumors. In this study, we analyzed
gene expression profiles of de novo CD5+ DLBCLs, CD5DLBCLs, and MCLs according to function-based gene sets
and found that four gene function groups (metabolism, signal
transduction pathway, transcriptional factors, cell adhesion
and migration) are the most dominant features that distinguish
the three types of lymphomas.
Methods
Patients/samples. Clinical samples were obtained from 10
patients with MCL, 11 patients with de novo CD5+ DLBCL,
and 9 patients with CD5- DLBCL. The diagnoses were made
according to the World Health Organization Classification of
Tumours of Haematopoietic and Lymphoid Tissues (1). DNA
microarray studies using specimens of patients with hematopoietic malignancies were approved by Institutional Review
Committee in Mie University School of Medicine (8).
Microarray experiment. A total of 2,303 known human cDNAs
(2,142 distinct ones) were prepared by PCR from the Research
Genetics cDNA clone library using two primers, purified using
MultiScreen PCR plates (Millipore Corp., Bedford, MA) and
verified by sequencing at the Cancer Genomics Core Lab.
(M.D. Anderson Cancer Center) before printing (9). The DNA
clones, in 384-well plates, were spotted onto poly-L-lysinecoated microscope slides using an arrayer (Genomic Solutions,
Ann Arbor, MI). Lymphoma cells were lysed in the TRI
reagent (MRC, Cincinnati, OH). Control RNA was prepared
by mixing the same amount of total RNA extracted from
these six cell lines: K562, HL60, NB4, BV173, KBM7 and
Jurkat cells (8). The labeling reaction was performed as
described previously (8,10,11). Methods of hybridization were
described previously (8). Hybridized arrays were scanned at
10 m resolution on a GeneTAC LS-IV scanner (Genomic
Solution), and the obtained signal intensities were quantified
with ArrayVision (Imaging Research Inc., St. Catherines,
Ontario, Canada).
After removing the few genes for which the replicate spots
were drastically different in intensity due to scratches and
dust (8,10), the expression of each gene was calculated as the
mean of the two replicate spots.
Identification of informative genes. Informative genes are
those that have the highest difference in expression between

the three different classes of lymphoma as compared to the


difference in expression within the same class of lymphomas.
Mathematically, let VRN be a row vector of expression of a
gene over N patients, Wi {CD5-, CD5+, MCL} classes; and
the vector of classes W=W1,...,WN of size N=30. We have nCD5-=
9 patients in class CD5-, nCD5+=11 patients in class CD5+, and
nMCL=10 patients in the MCL class. The average of expressions
for class CD5- is
can define

. In the same way, we


and

The overall average is

The between-class sum of squares/within-class sum of


squares (BSS/WSS) ratio for vector V with classes W is
defined as:

Cluster identifications by gene shaving algorithm. We ranked


genes according to the BSS/WSS ratio and retained a large
number of genes (n=280), i.e., those having higher importance
for {CD5-, CD5+, MCL} classification. The 280 genes were
analyzed using unsupervised gene shaving clustering procedure
(12). The method of gene-shaving is designed to extract
coherent clusters of genes orthogonal on the previous clusters
and automatically determines the size of clusters. Briefly,
the gene shaving algorithm consists of the following steps:
1) Start with entire expression table, in our case 280 featuregenes (rows) and 30 patients (columns). 2) Center rows
(extract row mean). 3) Compute principal component of rows
. 4) Shave off (remove rows from table)
% (typically 10%) of rows xi with smallest absolute inner

product with the leading principal component


. 5) Repeat 3
and 4 until one gene remains. 6) From the nested sequence
where Sk denotes a cluster of k genes,
...
estimate the optimal cluster size
mizing the gap statistic

maxiwhere Dk is square

of between-to-within variance ratio

for

the kth member of the sequence and

is the average

of the same measure for the cluster


when X*b are matrices
with the rows permuted b = 1,2,...,B of X. 7) Orthogonalize
- the average gene. 8) Repeat
rows of X with respect to
steps 1-6 to find the next optimal cluster.
The significance of each cluster related to the classes {CD5-,
CD5+, MCL} is computed using the BSS/WSS criterion applied
to the average
gene of the cluster by using the following
power of discrimination indicator:

where W

is the vector of labels. We obtained 22 clusters of different size.


Clusters 1, 2, 6, 8, 19 and 20 were, according to the value of
, the most informative for the three-class discrimination.

INTERNATIONAL JOURNAL OF ONCOLOGY 24: 497-504, 2004

499

Figure 1. Three clusters of genes identified by gene shaving. For each cluster, the average gene is displayed twice: in original order of patients and in sorted
form, the latter showing that similar lymphoma subtypes correspond to close values of the average gene. The first cluster of genes (upper left) is dominantly
involved in signal transduction. Most CD5- DLBCL cases had high expressions of the average gene, whereas most MCL cases had lower expression values.
Second cluster (lower left) has highest discrimination power. It can be considered enriched for metabolism and signal transduction functions. Cluster 6
(right) is a mixture of genes with all functions, from which signal transduction function is enriched. Gene expression pattern of clusters 1 and 6 (both
enriched in signal transduction) indicates that CD5- DLBCLs are still heterogeneous although these nine cases consist of nodal CD5- DLBCL, suggesting the
presence of subgroups. The gene names corresponding to each row are listed on the vertical axes (those with only accession no. are available in supplementary
tables and are presented here with empty spaces). On the horizontal axis are listed the class labels of each patient, below the first compact image. The last row
in the compact image, named average, shows the averages of the genes along the patients. We re-ordered the patients such that the values of the average
gene are increasing, then we horizontally list the re-ordered patient labels and draw a row below with colors representing the values of the sorted average
gene. That row has strong green at the left end, then starts shading off and changes to red at the right end; being sorted, it allows one to observe easier whether
a specific lymphoma class group has higher or lower expression values.

Functional grouping of genes. We annotated the functions of


genes from each cluster according to the SOURCE program
of Stanford University (13). We checked if in a cluster a
function was enriched as compared to the full population, by
counting the number of genes having that function.
The probability p of occurrence of a certain function in
the whole population can be estimated as the number of
occurrences over the total number of genes. For a cluster of
size n where the function occurs xC times, the function is said
to be enriched if it appears unusually often according to the
level of decision . For the Bernoulli distribution, let xH denote
the threshold for which P(x > xH) < and xL the threshold for
which P(x < xL) < . We reject the hypothesis that in the cluster
the function is represented similarly to the entire population,

and we conclude that a function is enriched in a group, when


xC > xH, while we call rare cases those when xC < xL.
Results
Identification of gene clusters using gene shaving method.
Thousands of genes are profiled in a microarray experiment.
However, not all genes contribute to the difference of subjects
under study and many genes behave in a similar manner, which
potentially carries biological meaning. In the current study, we
were interested in finding gene clusters that represent the
separation of three lymphoma subtypes. We first selected 280
genes that had discriminatory power among the 30 lymphoma
cases. Then, we clustered the 280 genes according to their
gene expressions using the gene shaving method.

500

MIRCEAN et al: FUNCTIONAL GENE GROUPS SEPARATING LYMPHOMAS

Gene shaving is an unsupervised clustering method, which


uses only the gene expressions for grouping into clusters and
the information on the disease type (i.e., class label) is not
used. The particular algorithm for establishing the sizes of
the clusters and refining the principal components by focusing
on smaller sets of genes was summarized in the Methods.
When sorting the clusters in decreasing order of power
of discrimination in the ternary classification problem (to
distinguish all three lymphoma classes), we found that the
first, second, and sixth clusters performed extremely well,
while the others had either little value or were very small in the
number of feature-genes. We concentrated on the clusters 1,
2, and 6 since the cluster 8 has only 2 genes and cluster 19
has only 3 genes. The gene names corresponding to each row
are listed on the vertical axes of Fig. 1. The class label of
each patient is listed on the horizontal axis, below the first
compact image. The last row in the compact image is named
average and shows the averages of the genes along the
patients. We then re-order the patients such that the values of
the average gene are increasing, then we horizontally list
the re-ordered patient labels and draw a row below with colors
representing the values of the sorted average gene. That
row is strongly green at the left end and starts shading off
and changes to red at the right end. For cluster 1, most CD5DLBCL cases had low expressions of the average gene,
whereas most MCL cases had higher values. Thus, the average
gene was discriminative for the three types of lymphomas,
although the grouping of the genes in the cluster was unsupervised.
Functional characterization of the genes in the cluster. We
annotated the functions of genes from each cluster and looked
for dominant functions represented in each cluster. We checked
if in a cluster a function is enriched as compared to the full
population, by counting the number of genes having that
function.
Cluster 1 has a high power of discrimination with C(S1)=
1.1311 and has 24 genes. The signal transduction function
35
has probability p==0.125 in the whole set, and the function
280
should be considered enriched at the significance threshold
=0.05, for a number >7 occurrences in a cluster of size 24.
Cluster 1 contains 8 genes having a signal transduction type
of function, which therefore can be considered an enriched
function.
Cluster 2 contains 16 genes and it has even higher
discrimination power, C(S2)=1.6388. Functionally, this cluster
can be considered enriched for metabolism and signal
transduction functions at the significance threshold =0.05.
To see that, we note that for metabolism function p=53/280,
and at a size of 16 genes, a cluster is enriched if it contains at
least 6 genes with that function, which is true for cluster 2.
For signal transduction function enrichment is obtained if
the cluster will contain more than 5 genes having the function,
which again is true for cluster 2. One can easily observe from
Fig. 1 that all genes in cluster 2 are highly expressed for
patients with CD5+ subtype.
Cluster 6 has significance coefficient C(S6)=0.9043. We
observed a mixture of genes with all functions, from which
the transduction function is enriched. The average genes

for the 5-9 patients (all CD5- DLBCL) are especially highly
expressed. As an interesting fact, the cluster gathered 3
apoptosis genes (TNFAIP3, IL1B, HDAC3), 2 tumor
suppressors (TPD52 and DLG1, considered to not belong to
any of the six function-classes), and 5 genes from cell cycle
and proliferation - classes that are seldom represented in
the whole set.
Cluster 8 has significance C(S8)=1.2829 and groups two
genes, ASAH1, which hydrolyzes the sphingolipid ceramide
into sphingosine and free fatty acid (belonging to metabolism)
and the other coding for the KIAA0379 protein, whose function
is unknown. Small groupings like this cluster should be
statistically tested on larger experiments. The genes in cluster 8
are highly expressed for the patients in MCL class.
We summarized all 22 clusters and on the right side, we
illustrated a zoomed view of the three clusters (1, 2, and 6)
(Fig. 2). We next characterized the genes in the three best
clusters and identified the dominant functional features. In
cluster 1, genes involved in signal transduction are enriched.
In cluster 2, genes involved in metabolism and signal transduction are enriched. In cluster 6, genes in signal transduction,
apoptosis, and cell cycle are enriched.
The three lymphoma classes can be grouped in four
combination scenarios (CD5- DLBCLs vs. others; de novo
CD5+ DLBCLs vs. others; MCLs vs. DLBCLs, and each
class separately). How each cluster discriminates the three
lymphoma subtypes on the two class problems and three
class problems is shown in Fig. 3.
We observe the following: cluster 2, which is enriched in
genes belonging to metabolism and signal transduction
function, is extremely good in discriminating between de novo
CD5+ DLBCLs and others; cluster 1, which is enriched in
genes belonging to signal transduction function, performs a
good discrimination of MCLs vs. DLBCLs; cluster 6, enriched
in genes belonging to signal transduction, apoptosis, and
cell cycle is able to discriminate well the types CD5DLBCLs vs. others. Drawing a 3D representation in Fig. 4
with average gene of clusters 1, 2 and 6 on the axes, we
observe that these genes are able to discriminate the disease
types quite well.
Multidimensional scaling (MDS) representation of cellular
functions of genes. The above analysis and genes in gene
clusters identified by gene shaving showed that genes of
different functional groups have different discriminative power
in separating the three types of lymphomas. To test this in a
more global manner, we used the whole set of 280 genes,
each of which individually exhibits high discrimination of the
three lymphoma classes. First, we picked those genes that are
involved in six cell functions: metabolism, signal transduction, transcription and DNA binding, cell cycle and
cell proliferation, adhesion and cell migration, apoptosis.
We examined for each group of genes how well MDS could
separate the lymphoma classes (Fig. 5). The grouped genes
from metabolism function are the best in separating the
three lymphoma subtypes. Genes from signal transduction
are the second best and genes from the other functional group
separate less well. Thus, the MDS-based analysis completely
corroborates the results we obtained from the gene shaving
method.

INTERNATIONAL JOURNAL OF ONCOLOGY 24: 497-504, 2004

501

Figure 2. Clustering structure resulted from gene shaving algorithm. We sorted the clusters in the descending order of their discriminatory power, with
clusters 2, 8, 1, 19 and 6 discriminating the three lymphoma classes best. These clusters are enriched in metabolic (cluster 2) and signal transduction
(clusters 2, 1 and 6) functions. In the right side image, the name of the genes corresponding to the rows of clusters 2, 1, and 6 are listed on the vertical axes.
The columns represent the gene expression profiles of the patients, their labels coinciding with the unsorted profiles in Fig. 1.

Figure 3. Classification problem. Three classes can be grouped into four


possible combinations. The power of discrimination of genes in the best
discriminatory clusters are shown for two class problems (CD5- DLBCLs vs.
others; de novo CD5+ DLBCLs vs. others; and MCLs vs. DLBCLs) and the
three class problems (each class separately). In the two class problems, cluster 1
concentrates power of discrimination for MCLs vs. DLBCLs; cluster 2 for
de novo CD5+ DLBCLs vs. others, and cluster 6 for CD5- DLBCLs vs. others.

Figure 4. Representation 3D of patients considering average genes of clusters 1,


2 and 6. The gene shaving algorithm steers the average genes approximately
parallel to the principal components (PC) of the data set. The average genes
of clusters 1, 2 and 6, represented on the three axes, result in good separation
of the three classes. We represent CD5- DLBCL with red circles, de novo
CD5+ DLBCL with green diamonds, and MCL class with blue triangles.

Discussion
Lymphoma is a heterogeneous disease (1). Depending on
the similarities to the normal cells during differentiation,
lymphomas are classified into different subtypes (1). Genetic

studies at chromosomal levels revealed some recurrent chromosomal translocations further defining subtypes (1). Expression
of surface markers such as CD5 and CD10 in B-cell neoplasm
has been useful in identification of additional subclasses

502

MIRCEAN et al: FUNCTIONAL GENE GROUPS SEPARATING LYMPHOMAS

Figure 5. Multidimensional scaling (MDS) representations for functional genes separations. From 280 retained genes having a high discrimination value for the
three-lymphoma class problem, we separated six groups of genes, one for each of the six functional classes: metabolism, signal transduction, transcription
and DNA binding factor, cell cycle and cell proliferation, adhesion factor and migrations and apoptosis. The best separation of the three lymphoma classes
is obtained with the genes having metabolic functions, the next best is the one with signal transduction genes. For the rest of the four functional groups, the MDS
representations show a poorer separation of the lymphoma cases. Patients are labeled as follows: CD5- DLBCL with red circles, de novo CD5+ DLBCL with green
diamonds, and MCL class with blue triangles.

of lymphomas (1), although not all markers distinguish


tumors into clinically relevant subtypes. Recent transcriptome
studies have been another successful method to subclassify
lymphomas, an example being the separation of germinal
center B-cell like DLBCL, activated B-cell like DLBCL
(2), and a recently identified type 3 (14). However, the
prognostic implication of the two subgroups is not yet set in
the stone. The original study showed differential survival of
the two groups, but an independent study by a second research
group did not observe such a difference (15).
Therefore, recognition of subgroups of a disease is only
the first step in understanding the biology underlying the
disease heterogeneity. Even if the clinical prognosis of
the disease subgroups is apparent, the molecular basis for

such a difference in most cases is unknown. With increasing


accumulation of functional knowledge of the genes in the
genome and the newly acquired capability to acquire the
gene expression activities in cells, such molecular bases have
begun to emerge.
We used the gene expression profiling method to compare
transcriptomes of de novo CD5+ DLBCL, CD5- DLBCL,
and MCL. When focused on the most differentially expressed
genes between the two subtypes of DLBCL, we found
integrin 1 and CD36, among others (8). We speculated that
overexpression of integrin 1 and CD36 may account at
least partially for the more invasive nature of de novo CD5+
DLBCL. However, MCL, which expresses CD5 and has
invasive nature, overexpresses nuclear cyclin D1, which is

INTERNATIONAL JOURNAL OF ONCOLOGY 24: 497-504, 2004

different from de novo CD5 + DLBCL (1,8). Thus, the


differences and similarities among the three subtypes of
lymphomas are complex and involve multiple genes. Whereas
it may be extremely valuable to focus on individual feature
genes, such an approach does not provide the global insight
into the functional pathways that holistically reflect the
difference and similarities among different diseases.
The present study used the gene shaving method to identify
gene clusters that are associated with the classifications of
the three subtypes of the lymphoma allowing us to scrutinize
the gene function makeup of the gene clusters. Interestingly,
among the top two gene clusters, metabolism-related genes
and signal transduction genes are dominantly represented.
The third top cluster is also enriched in signal transduction
genes, in addition to cell cycle and apoptosis genes. Further,
we took another approach by including all the 280 informative
genes (from which the clusters were selected) and grouped
those genes based on their known cellular functions. Then,
we performed MDS analysis using genes in each of the
functional groups. Amazingly, metabolism and signal transduction genes were among the best groups of genes that
separate the three subtypes of B-cell lymphoma. Apoptosis
gene group, in contrast, did not separate the classes well.
These results suggest that the key cellular pathways that set
apart the three different lymphomas are metabolism and signal
transduction.
This may be surprising given that the cancer field has
focused much on cell cycle and apoptosis. However, a look
back at the history will reveal that in the middle of the last
century, Otto Warburg proposed a hypothesis that the cause
of cancer is primarily a defect in energy metabolism (16).
Extending this hypothesis, different subtypes of cancer may
well have different defects in energy metabolism. The fact
that we observe metabolism as the key difference at the gene
level among the three lymphomas provides a link from genotype to phenotype.
A scrutiny of the genes in the three clusters revealed some
interesting insights. MCL is an established category of B-cell
neoplasm characterized by recurrent cytogenetic abnormality
of t(11;14), and nuclear cyclin D1 expression, which was also
shown in our study (8). Gene expression profiles in cluster 1
show more difference between MCLs and DLBCLs (Fig. 2).
Interestingly, ICAM1 (CD54), which is an adhesion molecule,
is expressed highly in DLBCLs, although MCLs are very
invasive to extranodal organs. This finding suggests that extranodal involvement of MCL is regulated by different adhesion
molecules. It has been reported that homing receptor 47
integrin overexpression is shown in digestive tract involvement
in MCL (17). The purkinje cell protein 4 (PCP4) in cluster 1,
which is known to be up-regulated in the process of B-cell
anergy (18), is expressed higher in DLBCL than in MCL.
This observation is consistent with the notion that most
MCLs belong to the naive B-cell population. Myeloid cell
nuclear differentiation antigen (MNDA) is reported to be
weakly expressed in hairy cell leukemia, parafollicular B-cell
lymphoma, MCL, and small lymphocytic lymphoma, and it
is negative in germinal center cells or plasma cells (19). Cell
sorting of normal bone marrow showed MNDA expression in
CD20+/CD10-/CD5- B-cells. The latter evidence corresponds to
our finding that MNDA is mainly expressed in CD5- DLBCL

503

(19). Hypoxia-inducible factor 1, -subunit (HIF1A) expression


is reported in 70% of patients with non-Hodgkin's lymphoma
(20). In the present study, HIF1A expression is mainly seen in
de novo CD5+ DLBCL suggesting that its expression influences
angiogenesis.
Among genes in cluster 2, protein tyrosine kinase 2/focal
adhesion kinase (FAK) is overexpressed in de novo CD5+
DLBCLs, and this gene is known to integrate growth-factor
and integrin signals to promote cell migration (21). We have
shown that integrin 1 was overexpressed in de novo CD5+
DLBCLs (8). Overexpression of FAK may relate to high
incidence of integrin 1 overexpression, resulting in extranodal involvement and intravenous/intrasinusoidal infiltration
of this lymphoma. Another interesting gene in cluster 2 is
pim-1. It has been reported that pim-1 transgenic mice develop
T-cell lymphoma (22,23) and that bovine leukemia virus
induces CD5+ B lymphocytosis and CD5+ B-cell leukemia or
lymphoma associated with elevation of pim-1 and c-myc in
cattle (24). Electron transfer flavoprotein and acyl-coenzyme A
dehydrogenase are known to interact with each other and are
involved in human diseases (25). Both genes are in cluster 2,
and electron transfer flavoprotein is overexpressed in de novo
CD5+ DLBCL and acyl-coenzyme A dehydrogenase is high
in CD5- DLBCL, although the exact significance of this
expression pattern is not clear at present. It has been reported
that transfer of aldehyde-dehydrogenase gene induced cyclophosphamide-resistance (26). Overexpression of aldehydedehydrogenase in de novo CD5 + DLBCL may partially
contribute to chemo-resistance of this lymphoma (4).
A gene group of apoptosis and cell cycle and
proliferation did not separate these three subtypes of
lymphoma well, but the similarity of de novo CD5+ DLBCLs
and MCLs may account for the poor prognosis of these two
subtypes of lymphoma. In addition, the gene expression
patterns of clusters 1 and 6 indicate that CD5- DLBCLs are
still heterogeneous, although these nine cases consist of
nodal CD5- DLBCL, suggesting the presence of other subgroups.
In summary, this study, examining the genes that
distinguish the three subtypes of lymphomas, has offered
insight into the molecular processes that discriminate the
different clinical entities of the disease. It is quite interesting
that metabolism and signal transduction may be at the center
of lymphomagenesis.
Acknowledgments
This work was partially supported by the grants (nos.
13671059, 12217064, and 12217062) from Japanese Ministry
of Education, Sports, Science and Technology; the Tobacco
Settlement Fund to M.D. Anderson Cancer Center (MDACC)
as appropriated by the Texas Legislature; a grant from
Kadoorie Foundation to MDACC; CCSG grant from NIH/
NCI, and a grant from Academy of Finland.
References
1. Jaffe ES, Harris NL, Stein H and Vardiman JW (eds): World
Health Organization Classification of Tumours. Pathology and
Genetics, Tumours of Haematopoietic and Lymphoid Tissues.
IARC Press, Lyon, pp119-187, 2001.

504

MIRCEAN et al: FUNCTIONAL GENE GROUPS SEPARATING LYMPHOMAS

2. Alizadeh AA, Eisen MB, Davis RE, et al: Distinct types of diffuse
large B-cell lymphoma identified by gene expression profiling.
Nature 403: 503-511, 2000.
3. Hofmann W-K, De Vos S, Tsukasaki K, Wachsman W,
Pinkus GS, Said JW and Koeffler HP: Altered apoptosis pathways in mantle cell lymphoma detected by oligonucleotide microarray. Blood 98: 787-794, 2001.
4. Yamaguchi M, Seto M, Okamoto M, et al: De novo CD5+ diffuse
large B-cell lymphoma: a clinicopathologic study of 109 patients.
Blood 99: 815-821, 2002.
5. Kume M, Suzuki R, Yatabe Y, et al: Somatic hypermutations in
the VH segment of immunoglobulin genes of CD5-positive diffuse
large B-cell lymphoma. Jpn J Cancer Res 88: 1087-1093, 1997.
6. Taniguchi M, Oka K, Hiasa A, Yamaguchi M, Ohno T, Kita K
and Shiku H: De novo CD5+ diffuse large B-cell lymphomas
express VH genes with somatic mutation. Blood 91: 1145-1151,
1998.
7. Nakamura N, Hashimoto Y, Kuze T, Tasaki K, Sasaki Y, Sato M
and Abe M: Analysis of the immunoglobulin heavy chain gene
variable region of CD5-positive diffuse large B-cell lymphoma.
Lab Invest 79: 925-933, 1999.
8. Kobayashi T, Yamaguchi M, Kim S, et al: Microarray reveals
differences in both tumors and vascular specific gene expression
in de novo CD5+ and CD5- diffuse large B-cell lymphomas.
Cancer Res 63: 60-66, 2003.
9. Taylor E, Cogdell D, Coombes K, et al: Sequence verification
as quality control step for production of cDNA microarray.
Biotechniques 31: 62-65, 2001.
10. Shmulevich I, Hunt K, El-Naggar A, et al: Tumor specific gene
expression profiles in human leiomyosarcoma: an evaluation of
intratumor heterogeneity. Cancer 94: 2069-2075, 2002.
11. Hu L, Wang J, Baggerly K, et al: Obtaining reliable information
from minute amounts of RNA using cDNA microarrays. BMC
Genomics 3: 16, 2002.
12. Hastie T, Tibshirani R, Eisen M, et al: Gene shaving: a new
class of clustering methods for expression arrays. Technical
report. Stanford University, Stanford, 2000.
13. Diehn M, Sherlock G, Binkley G, et al: SOURCE: a unified
genomic resource of functional annotations, ontologies, and
gene expression data. Nucleic Acids Res 31: 219-223, 2003.
14. Rosenwald A, Wright G, Chan WC, et al: The use of molecular
profiling to predict survival after chemotherapy for diffuse largeB-cell lymphoma. N Engl J Med 346: 1937-1947, 2002.
15. Shipp MA, Ross KN, Tamayo P, et al: Diffuse large B-cell
lymphoma outcome prediction by gene-expression profiling and
supervised machine learning. Nat Med 8: 68-74, 2002.

16. Warburg O: On the origin of cancer cells. Science 123: 309-314,


1956.
17. Geissmann F, Ruskone-Fourmestraux A, Hermine O, et al:
Homing receptor 47 integrin expression predicts digestive
tract involvement in mantle cell lymphoma. Am J Pathol 153:
1701-1705, 1998.
18. Glynne R, Ghandour G, Rayner J, Mack DH and Goodnow CC:
B-lymphocyte quiescence, tolerance, and activation as viewed
by global gene expression profiling on microarrays. Immunol Rev
176: 216-246, 2000.
19. Miranda RN, Briggs RC, Shults K, Kinney MC, Jensen RA and
Cousar JB: Immunocytochemical analysis of MNDA in tissue
sections and sorted normal bone marrow cells documents
expression only in maturing normal and neoplastic myelomonocytic cells and a subsets of normal and neoplastic B lymphocytes. Hum Pathol 30: 1040-1049, 1999.
20. Stewart M, Talks K, Leek R, Turley H, Pezzella F, Harris A and
Gatter K: Expression of angiogenic factors and hypoxia inducible
factors HIF 1, HIF 2 and CA IX in non-Hodgkin's lymphoma.
Histopathology 40: 253-260, 2002.
21. Sieg DJ, Hauck CR, Ilic D, Klingbeil CK, Schaefer E,
Damsky CH and Schlaepfer DD: FAK integrates growth-factor
and integrin signals to promote cell migration. Nat Cell Biol 2:
249-256, 2000.
22. Van Lohuizen M, Verbeek S, Krimpenfort P, Domen J, Saris C,
Radaszkiewicz T and Berns A: Predisposition to lymphomagenesis in pim-1 transgenic mice: cooperation with c-myc and
N-myc in murine leukemia virus-induced tumors. Cell 56:
673-682, 1989.
23. Breuer M, Slebos R, Verbeek S, Lohuizen M, Wientjens E and
Berns A: Very high frequency of lymphoma induction by a
chemical carcinogen in pim-1 transgenic mice. Nature 340:
61-63, 1989.
24. Stone DM, Norton LK, Magnuson NS and Davis WC: Elevated
pim-1 and c-myc proto-oncogene induction in B lymphocytes
from BLV-infected cows with persistant B lymphocytosis.
Leukemia 10: 1629-1638, 1996.
25. Amendt BA and Rhead WJ: The multiple acyl-coenzyme A
dehydrogenation disorders, glutaric aciduria type II and ethylmalonic-adipic aciduria: mitochondrial fatty acid oxidation,
acyl-coenzyme A dehydrogenase, and electron transfer flavoprotein activities in fibroblasts. J Clin Invest 78: 205-213, 1986.
26. Magni M, Shammah S, Schiro R, Mellado W, Dalla-Favera R
and Gianni AM: Induction of cyclophosphamide-resistance by
aldehyde-dehydrogenase gene transfer. Blood 87: 1097-1103,
1996.

Publication no. 7
Fuller GN, Hess KR, Mircean C, Tabus I, Shmulevich I, Rhee CH,
Aldape KD, Bruner JM, Sawaya RA, Zhang W.
Chapter 14: Human Glioma Diagnosis From Gene Expression Data
in Computational and Statistical Approaches to Genomics
Kluwer Academic Publisher 2002 ISBN: 1-4020-7023-3

Publication no. 8
Mircean C, Shmulevich I, Cogdell D, Choi W, Jia Y, Tabus I,
Hamilton SR, Zhang W.
Robust estimation of protein expression ratios with lysate
microarray technology.
Bioinformatics. 2005 May 1;21(9):1935-42. Epub 2005 Jan 12.

BIOINFORMATICS

ORIGINAL PAPER

Vol. 21 no. 9 2005, pages 19351942


doi:10.1093/bioinformatics/bti258

Gene expression

Robust estimation of protein expression ratios with lysate


microarray technology
Cristian Mircean1,2 , Ilya Shmulevich1, , David Cogdell1 , Woonyoung Choi1 , Yu Jia1 ,
Ioan Tabus2 , Stanley R. Hamilton1 and Wei Zhang1
1 Department
2 Institute

of Pathology, University of Texas M.D. Anderson Cancer Center, Houston, TX, USA and
of Signal Processing, Tampere University of Technology, Tampere, Finland

Received on June 24, 2004; revised on December 21, 2004; accepted on December 29, 2004
Advance Access publication January 10, 2005

ABSTRACT
Motivation: The protein lysate microarray is a developing proteomic
technology for measuring protein expression levels in a large number of biological samples simultaneously. A challenge for accurate
quantification is the relatively narrow dynamic range associated with
the commonly used chromogenic signal detection system. To facilitate accurate measurement of the relative expression levels, each
sample is serially diluted and each diluted version is spotted on
a nitrocellulose-coated slide in triplicate. Thus, each sample yields
multiple measurements in different dynamic ranges of the detection
system. This study aims to develop suitable algorithms that yield
accurate representations of the relative expression levels in different
samples from multiple data points.
Results: We evaluated two algorithms for estimating relative protein
expression in different samples on the lysate microarray by means of a
cross-validation procedure. For this purpose as well as for quality control we designed a 1440-spot lysate microarray containing 80 identical
samples of purified bovine serum albumin, printed in triplicate with six
2-fold dilutions. Our analysis showed that the algorithm based on a
robust least squares estimator provided the most accurate quantification of the protein lysate microarray data. We also demonstrated our
methods by estimating relative expression levels of p53 and p21 in
either p53+/+ or p53/ HCT116 colon cancer cells after two drug
treatments and their combinations on another lysate microarray.
Availability: http://www.cs.tut.fi/mirceanc/lysate_array_
bioinformatics.htm
Contact: is@ieee.org

INTRODUCTION
Despite the enormous genomic complexity of most organisms, and in
particular humans, the complexity is further increased at the protein
level as a result of posttranslational modifications, such as phosphorylation, acetylation and ubiquitination, which can appreciably
impact the functional state of proteins. Thus, it is not only the levels
of proteins but also their modification status that will have to be
studied in order to gain a deeper understanding of biological systems. A number of proteomic technologies have been developed
that allow researchers to study proteins in a high-throughput fashion. Among these is the protein microarray, which may appear in
To

whom correspondence should be addressed.

different formats depending on what substrates are deposited on the


solid matrix (MacBeath and Schreiber, 2000). For example, various antibodies can be spotted on the slides to produce the so-called
antibody array (Ivanov et al., 2004).
Another format is called the reverse-phase protein lysate microarray (or simply, lysate array) where cellular lysates or purified proteins
are arrayed on nitrocellulose-coated slides and probed with antibodies that recognize various proteins and their modified products
(Paweletz et al., 2001). Several recent studies report the use of lysate arrays for investigating signal pathways using cancer specimens
(Grubb et al., 2003; Wulfkuhle et al., 2003; Espina et al., 2003).
A lysate array spotted from a library of purified proteins can also
serve as a powerful tool for detecting proteinprotein interactions.
The advantages of a lysate array include the requirement of minimal amount of protein extracts for generation of tens of arrays,
potential for automation of hybridization and considerable saving of
labor compared with western blotting experiments. Another important advantage of lysate arrays over the western blotting assay is that
multiple replicates and dilutions can be incorporated into the experimental design, thus making the protein level quantification more
accurate. This is not insignificant because the dynamic range of commonly used protein detection methods is narrow, whereas the range
of protein expression in different samples can be very large. Therefore, it is important to develop robust algorithms for analyzing the
multiple data points produced by lysate arrays.
Nishizuka et al. (2003) profiled 60 human cancer cell lines (NCI60) with lysate arrays using serial dilutions. After removal of
outliers, dilution curves were estimated from the ten dilution points
using monotonic linear spline interpolation. The patterns of protein expression were compared with those obtained for the same
genes using both cDNA and oligonucleotide arrays. It was discovered
that structure-related proteins exhibited a high correlation between
mRNA and protein levels across the NCI-60 cell lines.
In this study, our primary objective was to develop a method for
accurately estimating the relative protein expression in two different samples from a protein lysate microarray. Toward that end, we
proposed and evaluated two different algorithms. For this purpose as
well as for quality control we designed a 1440-spot lysate microarray containing 80 identical samples of purified bovine serum albumin
(BSA), printed in triplicate with six 2-fold dilutions. We further tested
the selected algorithms to quantify the expression of selected proteins
in colon cancer cells after drug treatment.

The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org

1935

C.Mircean et al.

Fig. 1. The layout of the lysate arrays. The approaches proposed in this paper estimate differential protein expression. The samples are placed in patches of 18
spots, 6 dilutions and 3 replicates for each dilution, as shown in the illustration on the left. The sets T (k) and N (k) represent the measurements corresponding
to two different samples (e.g. tumor and normal), with k denoting the dilution and s denoting the replicate.

MATERIALS AND METHODS


Lysate mixtures and signal detection
Cells and protein extraction Proteins were harvested from isogenic
HCT116 cell lines provided by Dr Bert Vogelstein (Johns Hopkins University, Baltimore, MD). The p53+/+ and p53/ cell lines were cultured in
Dulbeccos minimal essential medium (DMEM) with 10% Nu-serum (Collaborative Research Products, Bedford, MA). Lysates were collected in a
buffer containing 20 mM Tris pH 7.6, 150 mM NaCl, 5 mM EDTA, 0.5%
NP40 freshly supplemented with 0.02 mM leupeptin and 0.01 mM PMSF
after 48 h of drug treatment at IC25. Lowry assay was used to normalize the
protein concentration prior to liquid handling.

Lysate array printing BSA demonstration chips containing six 2-fold


dilutions with 80 sample replicates were made by a liquid-handling robot
RSP (TECAN US, Research Triangle Park, NC). These dilutions were then
spotted on the nitrocellulose-coated FAST (Scheicher and Schuell, Keene,
NH) or Vivid (Pall, Ann Arbor, MI) slides with 500 m center-to-center
interspot distance using a G3 spotter (Genomics Solutions, Ann Arbor, MI)
to produce triplicate spots of all dilution points with the specified number of
transfers. HCT116 lysate arrays were created in a similar manner with 500,
250, 125, 62.5, 31.5 and 15.625 ng/l of protein.
Detection and imaging Detection of target protein was done by a heavily modified DAKO CSA kit (DakoCytomation, Carpinteria, CA) catalog
# K1500. Briefly, the slides were blocked with reblot + Mild (Chemicon,
Temecula, CA) catalog # 2502, followed by overnight blocking with i-block
(Applied Biosystems, Foster City, CA) catalog # Tropix AI300. The next
day, blocking continued with fresh 3% hydrogen peroxide, Avidin/Biotin and
casein block. Primary antibodies p53 Ab-6 (Oncogene Research Products,
San Diego, CA) and p21/WAF1 (gift from Wade Harper, Baylor College
of Medicine, Houston, TX) were diluted in antibody dilution buffer and
incubated at 25 C for 1 h in a humid chamber. Secondary biotinylated

1936

antibodies BA-9200 for rabbit primaries or BA-1000 for mouse primaries


(Vector Laboratories, Burlingame, CA) were diluted 1:10 000 and incubated
as before. Final steps included streptavidinbiotin, amplification reagent and
streptavidin/peroxidase incubations from the CSA kit followed by DAB+
development. TBST washes preceded all steps for the removal for the previous reagent. Slides were scanned in color at 1200 DPI; the resulting image was
converted to a 16-bit grayscale and inverted (negative) to allow quantification
by ArrayVision (Imaging Research Inc.).

Design on slide and spotting


In order to tune the preprocessing stages of the liquid handling, spotting and
image analysis, we created slides containing purified BSA spotted in three
replicates of each sample with six 2-fold dilutions for each of 80 identical
samples resulting in 1440 spots per slide. Three replicates were included to
provide better error resilience and outlier detection (Fig. 1 gives the layout
of the lysate microarray).

Robust estimation of expression ratios


An important goal is the estimation of the ratio between expressions of specific proteins from two samples (e.g. tumor/normal or treated/untreated). We
propose and evaluate two robust approaches in comparison with standard linear regression aimed at estimating the expression ratios from lysate arrays.
In an ideal case, both robust methods should yield the same results as linear
regression.
In developing our approaches for estimating the protein expression ratios,
our overall aim was to produce results that are biologically accurate and
robust to outliers and other artifacts. We consider an outlier to be a spot that
is inconsistent either with the other replicate spots in the same dilution (e.g.
in a certain dilution two replicates are close to each other while the third is
significantly different) or with respect to all other spots, which may be caused,
e.g. by extreme saturation or a crack in the membrane affecting all spots from
the same dilution.

Protein lysate array quantification

(a)

(b)

(c)

Fig. 2. Robust estimation of the protein expression ratios from six dilutions in three replicates. The model that fits the dilutions is expected to be linear in
loglog space. (a and b) Graphical representation of the non-linear and robust least squares approaches, respectively (see text). (c) The distance between the
two fitted lines at each dilution is weighted according to the estimated spot quality at that dilution (see text).

1937

C.Mircean et al.

Non-linear approach The use of triplicates rather than duplicates confers


an obvious advantage for robustly estimating the protein expression value. In
this approach, we apply the median to the triplicates because of its outlierresilient properties, thus providing a robust estimate of protein expression
for each dilution. We then perform a least squares linear fit to the median
estimated values (one estimate per dilution and sample), since the dilution
is expected to be linear on a loglog scale as will become apparent in the
experimental section (Fig. 5). Figure 2a illustrates the approach for the sample
labeled n. The log-expression value is denoted by n(k)
i for the replicate i at
dilution k, where i {1, 2, 3} and k = 1, . . . , 6; k = 1 corresponds to
the highest concentration. For the sample labeled t, where we have the logexpression values ti(k) , the approach is similar. One potential drawback of the
non-linear approach described here is that when all replicate spots from one
dilution are erroneous, say, due to a scratch, the fitted line can be dramatically
affected, even though the other dilutions may be perfectly reliable. The least
squares method is optimal if the errors are normal, independent and identically
distributed. However, if this is not the case, it can perform very poorly,
especially if data are affected by outliers. Next, we propose a more robust
method alleviating to a large extent the mentioned drawbacks.
Robust least squares This method considers all 18 spots (3 replicates, 6
dilutions) for each sample in order to fit a linear model (in loglog space).
Since lysate array data may be susceptible to several artifacts such as membrane irregularities, improper spot segmentation or outliers due to saturation,
that can affect all replicates for a given dilution, a robust regression scheme
should be used, since one of the main disadvantages of least squares is its
sensitivity to outliers. We propose the use of Huber weights with an iteratively reweighted least squares algorithm that minimizes the weighted sum
of squares, where the weight given to each value depends on its distance to
the fitted line (Huber, 1981; Street et al., 1988). The algorithm, using all 18
spots from each sample, will produce a robust linear fit in loglog space.
This method is shown in Figure 2b, where we use the same notation as in
Figure 2a.
Estimation of protein expression ratio The distance between the two
fitted lines for the two samples n and t represents the logarithm of the protein
expression ratio. However, it is clear that because of experimental variability,
the two lines (tumor and normal) will not be perfectly parallel. Our approach
is to estimate the distances at all the six dilution points and to weight them
in accordance with the estimated quality of the slide at these points. This is
based on the intuition that the higher the dispersion for a particular dilution
(over the entire array) the less the weight this dilution should get when calculating the distance between the two fitted lines and consequently, the less
the influence this dilution should have on the final estimate of the ratio of
protein expressions between the two samples. In the case of the BSA test
slide, we chose the weights to be inversely proportional to the dispersion
of the values for each dilution, as measured by the interquartile range. For
other lysate arrays, which typically contain many different samples, we used
another approach to estimate the weights by means of the coefficient of variation (CoV), which is defined as the standard deviation divided by the mean
for each dilution. Specifically, for each dilution, we calculated the mean (or
median) of the CoVs, where the mean is taken of the over all samples on the
array, and set the weights to be inversely proportional to these values. Finally,
the weights were normalized such that their sum is equal to one. In summary,
we proposed the following multistage method:
(1) Fit a (robust) regression line for each sample,
(2) Calculate the dilution weights,
(3) Calculate the weighted distance between the two samples.

RESULTS AND DISCUSSION


Experiments using protein lysate arrays included multiple steps,
including preparation of the biological samples, construction of
lysate arrays, detection and imaging. An additional factor for

1938

Dilutions

Dilutions

Dilutions
Fig. 3. A box plot showing the lower quartile, median and upper quartile
values of the logarithms of the BSA spots for each of the six dilutions. The
three figures, from top to bottom, compare the performance with 1, 5 and
10 touches per spot on Palls Vivid slides. The whiskers extend to 1.5 times
the interquartile range and all values beyond the whiskers are deemed to be
outliers. Taking into account the number of outliers and the slope (implicitly
the saturations) of the dilution curve, the best quality was obtained with 5
touches. Each boxplot represents 240 spots (80 samples 3 replicates).

lysate arrays is that when the protein concentrations from biological


samples, such as microdissected cells, are very low, we want to
repeat the spotting multiple times on one array in order to transfer more proteins to the lysate array. This is not normally done for
cDNA microarrays and consistency of such repeated touches has
to be evaluated in our experiments. Quality control is crucial for the
success of the experiments and for evaluation of the data analysis
methods as in cDNA microarray experiments. To avoid potentially
complicating factors associated with biological materials from cells,
we first used a purified protein, BSA, as printing material to evaluate
our lysate spotting procedures and the performance of the estimation algorithms. For example, we compared the quality of the slides
after 1, 5 and 10 touches (Fig. 3), referring to the number of times
a protein is transferred to the same spot on the slide by the printing
robot. As seen in Figure 3, single-touch printing results in more outliers, especially in the low concentration range, but 10-touch printing

Protein lysate array quantification

Fig. 4. Comparison of the methods using the BSA array. The main figure shows the histogram of the mean squared error (MSE) between the estimated true
values, computed from 49 training samples, and the estimated values computed from one test sample performed 5776 times with random splits of the data
(limited by computational load). The lower inset shows the histograms of the errors for the proposed methods, defined as the difference between the estimated
true values and the estimated test values. As can be seen, the robust least squares method that uses all 18 spots from a sample produces dramatically smaller
errors than least squares applied to median estimated spot values at each dilution or than standard least squares. Additionally, the robust least squares method
yields Gaussian behavior of cross-validation errors (on the split 30:49:1), as shown in the upper insets, containing the histograms of the errors as well as a
normal probability plot, which is expected to appear linear for a Gaussian distribution. Standard least squares serve as a baseline comparison for either of the
robust methods. For weighting procedures in robust least squares, we tested Andrews, Cauchy, Huber, Logistic, Talwar, Turkey and Welsch with similar results.
Huber produced the smallest standard deviation of the mean error (std. ME). Also, other heuristic methods (not presented here) were tested with inferior results.

produces a dilution curve with an insufficiently steep slope. We found


5 touches to yield the best overall quality.
We also used the BSA array to compare the two proposed
algorithms to estimate the protein expression ratios. Since all 80
samples on the BSA array are identical, under ideal circumstances,
the protein expression ratios between any pair of samples should be
equal to one; equivalently, the log-expression differences should be
zero. Our strategy for comparing the two proposed methods with
each other and with standard linear regression is based on a crossvalidation approach whereby the true dilution curve is estimated,
by averaging from 49 samples (training set), and an error is formed
between that curve and a test dilution curve computed from one
sample (test set). The remaining 30 samples are used for computing
the weights corresponding to the quality of each dilution, as discussed earlier. The reason for reserving 30 samples for computing
only the weights is to avoid computing the quality weights from
the same samples to which these weights will be applied. Finally,
the 49:1:30 random split is performed approximately 5700 times in
order to construct the histograms of the errors (log-expression differences between the quality-weighted dilution curves). The results
are shown in Figure 4. In addition to the error, we also computed

the mean squared error (MSE), the histogram of which is also


shown in Figure 4. Our results strongly indicate that the robust least
squares approaches yield a dramatically lower error than the nonlinear approach and the standard linear regression. In the iteratively
reweighted least squares algorithm, there are a number of possibilities for the weight function, each protecting against outliers in
slightly different ways. We used the Huber weight function, but
also tested other variants. Figure 4 (upper inset) shows fairly comparable results for the Andrews, Cauchy, Logistic, Talwar, Turkey
and Welsch weighting schemes. Furthermore, only the robust least
squares method yields cross-validation errors that are normally distributed, as shown by the histogram and the normal probability plot
in Figure 4 for Huber weights.
In order to demonstrate this method in a real experimental setting,
we designed another lysate array containing samples of HCT116
colon cancer cell lines under two different drug treatments (we
describe them as drug 1 and drug 2) as well as a treatment consisting of a combination of the two drugs. Additionally, the array
contained p53-null or p53/ HCT116 cells with no treatment as
well as p53+/+ HCT116 cells (parental cells) with no treatment
as a control. Each sample was spotted with six dilutions and three

1939

C.Mircean et al.

Fig. 5. Expression of p53 and p21 in p53+/+ HCT116 cells under drug 1 (B), drug2 (C) and combination of drug 1 and drug 2 (D), as well as p53/
HCT116 cells with no treatment (E), all relative to p53+/+ HCT116 cells with no treatment (A). The no-treatment p53+/+ HCT116 control (A) is represented
as a reference baseline at zero in (b). The plotted bars illustrate the relative estimated values as well as the bootstrap-estimated standard deviations (1000
bootstrap samples). See the Methods section for more details. The values for the bars and standard deviations are shown in the table in (e). As a validation step,
a western blot is shown in (a), with the same labels (AE) as in (b). The quality-based weights, estimated using the coefficient of variation as described in the
Methods section, are seen to be different for p21 than for p53. (f) shows the quality-based weights for p53 (green) and p21 (red).

replicates, as on the BSA array. Using these arrays, we measured the


relative protein expressions for p53 as well as its downstream target
gene product p21. All expression ratios were with respect to p53+/+
HCT116 cells with no treatment.
To estimate the protein expression ratios, we used the robust
least squares method and estimated the quality-based weights for
each dilution using the mean of the coefficients of variation over
the samples of the slide, as described in the Methods section.
Figures 5c and d show the fitted lines for all samples, for p53 and
p21, respectively.
We estimated the errors of the model once again, this time directly applied to HCT116 cells, by means of a bootstrap procedure
(Efron and Tibshirani, 1993). In applying bootstrap, we have at least
two different possible ways of estimating the errors for the distance
between the two regressions. The two fitted lines of drug 1 (B), drug
2 (C), combination of drug 1 and drug 2 (D) and p53/ no-drug
(E) are compared with the expression of p53+/+ HCT116 (A). The
principles of the two methods are different. One could bootstrap
the pairs (n(k)
i , k) and then estimate the ratio between two protein

1940

expressions as the quality-weighted distance using the bootstraprandomized selection. Another option is to estimate the errors by
means of bootstrapping the residuals.
We preferred to use the first alternative, which involves recalculation of the parameters, as it makes no assumption about whether or
not the regression model holds. Figure 5b illustrates the expression
ratios, relative to p53+/+ HCT116 cells with no treatment, indicated
as a baseline at zero, along with the bootstrap-estimated standard
deviations. Because the bar graphs in Figure 5b represent distances,
the standard deviations on these bars incorporate the combined variability owing to the treatment (or p53/ ) and the p53+/+ no-drug
reference.
As expected, the p53/ cells (E) have a much lower p53 relative
expression than all other conditions. The same behavior is seen for
p21 expression, but not so dramatically. Further, the combination
of drugs (D) does not increase the expression of p53 and p21 in
HCT116 p53+/+ cells, relative to drug 1 (B) or drug 2 (C) alone. As
a validation step, Figure 5a shows a western blot corresponding to
all tested conditions.

Protein lysate array quantification

Fig. 6. To study the robustness of the algorithms, the three methods were applied on a BSA array with a crack in the membrane. The sample 59 is presented in
the original slide image (a) and then before quantifying in a negative image prepared for ArrayVision. As result, all three replicates of the fifth dilution are
outliers (b). The simple linear regression is most affected by this error, followed by the non-linear approach robust least squares is able to reject the effect of
outliers. Furthermore, using a bootstrap procedure (1000 times), which combines the pairs of spot intensity and dilution (see Discussion section), the estimated
errors from robust least squares are considerably lower than from the other two cases. The error bars on the fitted lines represent the standard deviations of the
estimated fits for each dilution using the bootstrap procedure. The lower inset shows bootstrap histograms (10 000 bootstrap samples) of distances between two
neighboring normal samples, computed with robust least squares (c), least squares (d) and least squares of the medians (e). In each subplot, the vertical bar
indicates the distance between a cracked sample and its neighboring normal sample, using the same algorithm that is used to generate the histogram.

The HCT116 cell data are fairly free of outliers. In an ideal case,
the two robust methods should return the same results as a standard
regression model. We considered it fit to report the case, where the
three methods return significantly different results, a case not used in
our further analyses. The slide in question is spotted with BSA, with
1-touch. We can observe several cracks of the membrane caused by
erroneous handling combined with a faster drying on a Palls Vivid
slide. On the left of Figure 6, we show the entire slide. The upper
detail is the processed image as described in the Methods section,
before applying the segmentation step of ArrayVision. For this
sample, the values of the fifth dilution spots are strongly affected
by the crack and, as we can see, in reality the spots themselves are
not outliers. From the three models, we can observe the robust least

squares to be the least influenced, followed by least squares of the


medians method (non-linear approach), and finally by standard least
squares. Further it may be useful, as an additional step, to conduct
a test of linearity (on the log-scale) prior to applying any of the
aforementioned methods (Neter et al., 2004).
In measuring the expression of samples, an important question
is to determine if the sample is significantly different from another.
The results of the three algorithms are compared on the cracked
membrane sample and two normal samples. In Figure 6, we plot
bootstrap histograms (10 000 runs, resampling with replacement was
performed separately for each dilution) of the distances between
two neighboring normal samples, using each of the three estimation methods. The distance between the sample influenced by the

1941

C.Mircean et al.

cracked membrane and the neighboring normal sample is in the same


range (P -value 0.61) using the robust least squares method (Fig. 6c),
in contrast to the other two algorithms (P -value is zero) (Fig. 6d
and e). The distances between the cracked and normal samples
are shown by the bars in Figure 6c, d and e.
Our quality control experiments and evaluation of several proposed
methods for estimating the relative protein expressions, using a specially designed BSA array, indicate that this technology, coupled with
the computational methods described here, can be used to robustly
determine protein differential expression in a high-throughput manner. The experiments with HCT116 cells further support this claim.
We demonstrate that the robust least squares approach provides an
accurate quantification for protein lysate arrays that contain multiple
data points for each sample.

ACKNOWLEDGEMENTS
The work was partially supported by the Tobacco Settlement Fund
to M.D. Anderson Cancer Center (MDACC) as appropriated by the
Texas Legislature, a grant from Kadoorie Foundation to MDACC,
a grant from the Goodwin Fund and the Cancer Center Supporting
Grant from NIH/NCI, and a grant from the Academy of Finland.

REFERENCES
Efron,B. and Tibshirani,R.J. (1993) An Introduction to the Bootstrap. Chapman and
Hall, New York.

1942

Espina,V., Mehta,A.I., Winters,M.E., Calvert,V., Wulfkuhle,J., Petricoin,E.F.,III and


Liotta,L.A. (2003) Protein microarrays: molecular profiling technologies for clinical
specimens. Proteomics, 11, 20912100.
Grubb,R.L., Calvert,V.S., Wulkuhle,J.D., Paweletz,C.P., Linehan,W.M., Phillips,J.L., Chuaqui,R., Valasco,A., Gillespie,J., Emmert-Buck,M., Liotta,L.A. and
Petricoin,E.F. (2003) Signal pathway profiling of prostate cancer using reverse phase
protein arrays. Proteomics, 11, 21422146.
Huber,P.J. (1981) Robust Statistics. Wiley, New York.
Ivanov,S.S., Chung,A.S., Yuan,P.Z., Guan,A.Y., Sachs,K.V., Reichner,J.S. and
Chin,Y.E. (2004) Antibodies immobilized as arrays to profile protein posttranslational modifications in mammalian cells. Mol. Cell Proteomics, 8,
788795.
MacBeath,G. and Schreiber,S.L. (2000) Printing proteins as microarrays for highthroughput function determination. Science, 289, 17601763.
Neter,J., Kutner,M.H., Wasserman,W. and Nachtsheim,C.J. (2004) Applied Linear
Regression Models with Student CD-rom, 4th edn. WCB/McGraw-Hill, Boston,
New York.
Nishizuka,S., Charboneau,L., Young,L., Major,S., Reinhold,W.C., Waltham,M.,
Kouros-Mehr,H., Bussey,K.J., Lee,J.K., Espina,V. et al. (2003) Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate
microarrays. Proc. Natl Acad. Sci. USA, 24, 1422914234.
Paweletz,C.P., Charboneau,L., Bichsel,V.E., Simone,N.L., Chen,T., Gillespie,J.W.,
Emmert-Buck,M.R., Roth,M.J., Petricoin,E.F. III and Liotta,L.A. (2001) Reverse
phase protein microarrays which capture disease progression show activation of
pro-survival pathways at the cancer invasion front. Oncogene, 20, 19811989.
Street,J.O., Carroll,R.J. and Ruppert,D. (1988) A note on computing robust regression
estimates via iteratively re-weighted least squares. Am. Stat., 42, 152154.
Wulfkuhle,J.D., Aquino,J.A., Calvert,V.S., Fishman,D.A., Coukos,G., Liotta,L.A.
and Petricoin,E.F. III (2003) Signal pathway profiling of ovarian cancer from
human tissue specimens using reverse-phase protein microarrays. Proteomics 3(11),
20852090.

Das könnte Ihnen auch gefallen