Sie sind auf Seite 1von 101

Advances in

Bioinformatics
ISSN: 1687-8027
Volume 2014 No. 1, June 2014
About this Journal
Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research
articles as well as review articles in all areas of bioinformatics.

Aims and Scope


Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research
articles as well as review articles in all areas of bioinformatics.

Advances in

Bioinformatics
ISSN: 1687-8027
Volume 2014 No. 1, June 2014
Abstracting and Indexing
The articles of Advances in Bioinformatics are included in the following databases/resources:

Academic OneFile
Academic Search Complete
Access to Global Online Research in Agriculture (AGORA)
Airiti Library
Applied Science and Technology Source
Biological Sciences
BioMedSearch
Biotechnology and BioEngineering Abstracts
Biotechnology Research Abstracts
CAB Abstracts
Chemical Abstracts Service (CAS)
CNKI Scholar
Computers and Applied Sciences Complete
CSA Illustrata - Natural Sciences
CSA Illustrata - Technology
CSA Technology Research Database
Current Abstracts
Directory of Open Access Journals (DOAJ)
EBSCO Discovery Service
EBSCOhost Connection
Expanded Academic Index
Google Scholar
HINARI Access to Research in Health Programme
InfoTrac Custom journals
INSPEC
J-Gate Portal
Odysci Academic Search
ProQuest Advanced Technologies and Aerospace Collection
ProQuest Biological Science Collection
ProQuest Computer Science Journals
ProQuest Natural Science Collection
ProQuest SciTech Collection
PubMed
PubMed Central
Scopus
The DBLP Computer Science Bibliography
The Index of Information Systems Journals
The Informatics Portal io-port.net
TOC Premier

Editorial Board
Shandar Ahmad, National Institute of Biomedical Innovation, Japan
Tatsuya Akutsu, Kyoto University, Japan
Rolf Backofen, University of Freiburg, Germany
Craig Benham, University of California, Davis, USA
Mark Borodovsky, Georgia Institute of Technology, USA
Rita Casadio, Universit di Bologna, Italy
Ming Chen, Zhejiang University, China
David Corne, Heriot Watt University, United Kingdom
Bhaskar Dasgupta, University of Illinois at Chicago, USA
Ramana Davuluri, The Wistar Institute, USA
J. Dopazo, Felipe Research Centre, Spain
Anton Enright, European Bioinformatics Institute, United Kingdom
Stavros J. Hamodrakas, National and Capodistrian University of Athens, Greece
Paul Harrison, McGill University, USA
Huixiao Hong, U.S. Food and Drug Administration, USA
David Jones, University College London, United Kingdom
George Karypis, University of Minnesota, USA
Jian-Liang Li, Sanford-Burnham Medical Research Institute, USA
Jie Liang, University of Illinois at Chicago, USA
Guohui Lin, University of Alberta, Canada
Pietro Li, University of Cambridge, United Kingdom
Dennis Livesay, University of North Carolina at Charlotte, USA
Satoru Miyano, The University of Tokyo, Japan
Burkhard Morgenstern, University of Goettingen, Germany
Masha Niv, Hebrew University of Jerusalem, Israel
Florencio Pazos, Consejo Superior de Investigaciones Cientficas, Spain
David Posada, Universidad de Vigo, Spain
Jagath Rajapakse, Nanyang Technological University, Singapore
Marcel J. T. Reinders, Delft University of Technology, The Netherlands
P. Rouze, Ghent University, Belgium
Alejandro A. Schffer, National Institutes of Health, USA
E. L. Sonnhammer, Stockholm University, Sweden
Sandor Vajda, Boston University, USA
Yves Van de Peer, U Gent, Belgium
Antoine van Kampen, University of Amsterdam, The Netherlands
Alexander Zelikovsky, Georgia State University, USA
Zhongming Zhao, Vanderbilt University, USA
Yi Ming Zou, University of Wisconsin-Milwaukee, USA

Editorial Workflow
The following is the editorial workflow that every manuscript submitted to the journal undergoes during
the course of the peer-review process.
The entire editorial workflow is performed using the online Manuscript Tracking System. Once a
manuscript is submitted it is sent to an appropriate Editor based on the subject of the manuscript and
the availability of the Editors. If the Editor finds that the manuscript may not be of sufficient quality to
go through the normal peer review process, or that the subject of the manuscript may not be
appropriate for the journals scope, the Editor may Refuse to Consider the manuscript. In this case,
the manuscript is sent to a second Editor, and if the second Editor also chooses to Refuse to
Consider the manuscript, the manuscript shall be rejected with no further processing.
If the Editor finds that the submitted manuscript is of sufficient quality and falls within the scope of the
journal, they would assign the manuscript to a minimum of 2 and a maximum of 5 external reviewers
for peer-review. The reviewers submit their reports on the manuscripts along with their
recommendation of one of the following actions to the Editor:
Publish Unaltered
Consider after Minor Changes
Consider after Major Changes
Reject: Manuscript is flawed or not sufficiently novel
When all reviewers have submitted their reports, the Editor can make one of the following editorial
recommendations:
Publish Unaltered
Consider after Minor Changes
Consider after Major Changes
Reject
If the Editor recommends Publish Unaltered, the manuscript is accepted for publication.
If the Editor recommends Consider after Minor Changes, the authors are notified to prepare and
submit a final copy of their manuscript with the required minor changes suggested by the reviewers.
The Editor reviews the revised manuscript after the minor changes have been made by the authors.
Once the Editor is satisfied with the final manuscript, the manuscript can be accepted.
If the Editor recommends Consider after Major Changes, the recommendation is communicated to
the authors. The authors are expected to revise their manuscripts in accordance with the changes
recommended by the reviewers and to submit their revised manuscript in a timely manner. Once the
revised manuscript is submitted, the Editor can then make an editorial recommendation which can be
Publish Unaltered, Consider after Minor Changes, or Reject.
If the Editor recommends rejecting the manuscript, the rejection is immediate. Also, if two of the
reviewers recommend rejecting the manuscript, the rejection is immediate.
The editorial workflow gives the Editors the authority to reject any manuscript because of
inappropriateness of its subject, lack of quality, or incorrectness of its results. The Editor cannot assign
himself/herself as an external reviewer of the manuscript. This is to ensure a high-quality, fair, and
unbiased peer-review process of every manuscript submitted to the journal, since any manuscript
must be recommended by one or more (usually two or more) external reviewers along with the Editor
in charge of the manuscript in order for it to be accepted for publication in the journal.

The name of the Editor recommending the manuscript for publication is published with the manuscript
to indicate and acknowledge their invaluable contribution to the peer-review process and the
indispensability of their contributions to the running of the journals.
The peer-review process is single blinded; that is, the reviewers know who the authors of the
manuscript are, but the authors do not have access to the information of who the peer reviewers are.
Every journal published by Hindawi has an acknowledgment page for the researchers who have
performed the peer-review process for one or more of the journal manuscripts in the past year.
Without the significant contributions made by these researchers, the publication of the journal would
not be possible.

Advances in

Bioinformatics
ISSN: 1687-8027
Volume 2014 No. 1, June 2014

Table of Contents
Comparing Imputation Procedures for Affymetrix Gene Expression Datasets
Using MAQC Datasets
Sreevidya Sadananda Sadasiva Rao, Lori A. Shepherd, Andrew E. Bruno, Song Liu,
and Jeffrey C. Miecznikowski

01-10

A Multilevel Gamma-Clustering Layout Algorithm for Visualization of Biological


Networks
Tomas Hruz, Markus Wyss, Christoph Lucas, Oliver Laule, Peter von Rohr,
Philip Zimmermann and Stefan Bleuler

11-20

Reverse Engineering Sparse Gene Regulatory Networks Using Cubature Kalman


Filter and Compressed Sensing
Amina Noor, Erchin Serpedin, Mohamed Nounou, and Hazem Nounou

21-31

Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST
Databases
Manrique Mata-Montero, Nabil Shalaby, and Bradley Sheppard

32-37

Gene Regulation, Modulation, and Their Applications in Gene Expression


Data Analysis
Mario Flores, Tzu-Hung Hsiao, Yu-Chiao Chiu, Eric Y. Chuang, Yufei Huang
and Yidong Chen

38-48

Spectral Analysis on Time-Course Expression Data: Detecting Periodic


Genes Using a Real-Valued Iterative Adaptive Approach
Kwadwo S. Agyepong, Fang-Han Hsu, Edward R. Dougherty, and Erchin Serpedin

49-58

Identification of Robust Pathway Markers for Cancer through Rank-Based


Pathway Activity Inference
Navadon Khunlertgit and Byung-Jun Yoon

59-66

An Overview of the Statistical Methods Used for Inferring Gene Regulatory


Networks and Protein-Protein Interaction Networks
Amina Noor, Erchin Serpedin, Mohamed Nounou, Hazem Nounou, Nady Mohamed
and Lotfi Chouchane

67-78

Using Protein Clusters from Whole Proteomes to Construct and Augment


a Dendrogram
Yunyun Zhou, Douglas R. Call, and Shira L. Broschat

79-86

Solving the 0/1 Knapsack Problem by a Biomolecular DNA Computer


Hassan Taghipour, Mahdi Rezaei, and Heydar Ali Esmaili

87-92

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Comparing Imputation Procedures for Affymetrix Gene
Expression Datasets Using MAQC Datasets
Sreevidya Sadananda Sadasiva Rao,1 Lori A. Shepherd,1 Andrew E. Bruno,2
Song Liu,1 and Jeffrey C. Miecznikowski1,3
1

Department of Biostatistics, Roswell Park Cancer Institute, Bufalo, NY 14263, USA


Center for Computational Research, University at Bufalo, NYS Center of Excellence in Bioinformatics and Life Sciences,
Bufalo, NY 14203, USA
3
Department of Biostatistics, SUNY University at Bufalo, Bufalo, NY 14214, USA
2

Correspondence should be addressed to Jefrey C. Miecznikowski; jcm38@bufalo.edu


Received 26 June 2013; Accepted 28 August 2013
Academic Editor: Shandar Ahmad
Copyright 2013 Sreevidya Sadananda Sadasiva Rao et al. Tis is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Introduction. Te microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the
precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are
aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the
MAQC Afymetrix datasets to evaluate several imputation procedures in Afymetrix microarrays. Results. We evaluated several
cutting edge imputation procedures and compared them using diferent error measures. We randomly deleted 5% and 10% of the
data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. Te results
for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with = 4 is
most accurate under the error measures considered. Te k-nearest neighbor method with = 1 has the highest error rate among
imputation methods and error measures. Conclusions. We conclude for imputing missing values in Afymetrix microarray datasets,
using the MAS 5.0 preprocessing scheme, the local least squares method with = 4 has the best overall performance and k-nearest
neighbor method with = 1 has the worst overall performance. Tese results hold true for both 5% and 10% missing values.

1. Introduction
In microarray experiments, randomly missing values may
occur due to scratches on the chip, spotting errors, dust, or
hybridization errors. Other nonrandom missing values may
be biological in nature, for example, probes with low intensity
values or intensity values that may exceed a readable threshold. Tese missing values will create incomplete gene expression matrices where the rows refer to genes and the columns
refer to samples. Tese incomplete expression matrices will
make it difcult for researchers to perform downstream
analyses such as diferential expression inference, clustering
or dimension reduction methods (e.g., principal components
analysis), or multidimensional scaling. Hence, it is critical to
understand the nature of the missing values and to choose an
accurate method to impute the missing values.

Tere have been several methods put forth to impute


missing data in microarray experiments. In one of the frst
papers related to microarrays, Troyanskaya et al. [1] examine
several methods of imputing missing data and ultimately
suggest a -nearest neighbors approach. Researchers also
explored applying previously developed schemes for microarrays such as the nonlinear iterative partial least squares
(NIPALS) as discussed by Wold [2]. A Bayesian approach for
missing data in gene expression microarrays is provided by
Oba et al. [3]. Other approaches such as that of B et al. [4]
suggest using least squares methods to estimate the missing
values in microarray data, while Kim et al. [5] suggest using
a local least squares imputation. A Gaussian mixture method
for imputing missing data is proposed by Ouyang et al. [6].
While many of these approaches can be generally applied
to diferent types of gene expression arrays, we will focus

2
on applying these methods to Afymetrix gene expression
arrays, one of the most popular arrays in scientifc research.
Naturally, when proposing a new imputation scheme for
expression arrays, it is necessary to compare the new method
against existing methods. Several excellent papers have compared missing data procedures on high throughput data
platforms such as in two-dimensional gel electrophoresis as
in Miecznikowski et al.s works [7] or gene expression arrays
[810]. Before studying missing data imputation schemes in
Afymetrix gene expression arrays, it is reasonable to frst
remove any existing missing values. In this way, we ensure
that any subsequent missing values have known true values.
A detection call algorithm is used to flter and remove missing
expression values based on absent/present calls [11]. Subsequently, a preprocessing scheme is then employed. Tere are
numerous tasks to perform in preprocessing Afymetrix
arrays, including background adjustment, normalization,
and summarization. A good overview of the methods available for preprocessing is provided by Gentleman et al. [12].
For our analysis, the detection call employs MAS 5.0 [13] to
obtain expression values; thus, we also use the MAS 5.0 suite
of functions as our preprocessing method.
For our analysis, we focus on the microarray quality control (MAQC) datasets (Accession no. GSE5350), where the
datasets have been specifcally designed to address the points
of strength and weakness of various microarray analysis
methods. Te MAQC datasets were designed by the US Food
and Drug Administration to provide quality control (QC)
tools to the microarray community to avoid procedural failures. Te project aimed to develop guidelines for microarray
data analysis by providing the public with large reference
datasets along with readily accessible reference ribonucleic
acid (RNA) samples. Another purpose of this project was to
establish QC metrics and thresholds for objectively assessing
the performance achievable by various microarray platforms.
Tese datasets were designed to evaluate the advantages and
disadvantages of various data analysis methods.
Te initial results from the MAQC project were published
in Shis work [14] and later in Chen et al.s work [15] and
Shi et al.s work [16]. Specifcally, the MAQC experimental
design for Afymetrix gene expression HG-U133 Plus 2.0
GeneChip includes 6 diferent test sites, 4 pools per site, and
5 replicates per site, for a total of 120 arrays (see Section 2).
Tis rich dataset provides an ideal setting for evaluating
imputation methods on Afymetrix expression arrays. While
this dataset has been mined to determine inter-intra platform
reproducibility of measurements, to our knowledge, none has
studied imputation methods on this dataset.
Te MAQC dataset hybridizes two RNA sample types
Universal Human Reference RNA (UHRR) from Stratagene
and a Human Brain Reference RNA (HBRR) from Ambion.
Tese 2 reference samples and varying mixtures of these samples constitute the 4 diferent pools included in the MAQC
dataset. By using various mixtures of UHRR and HBRR, this
dataset is designed to study technical variations present in
this technology. By technical variations, we are referring to
the variability between preparations and labeling of sample,
variability between hybridization of the same sample to different arrays, testing site variability, and variability between

Advances in Bioinformatics
the signal on replicate features of the same array. Meanwhile,
biological variability refers to variability between individuals
in population and is independent of the microarray process
itself. By the MAQC dataset being designed to study technical
variation, we can examine the accuracy of the imputation
procedures without the confounding feature of biological
variability. Other than MAQC datasets, similar technical
datasets have been used to evaluate diferent analysis methods
specifc to Afymetrix microarrays, for example, methods for
identifying diferentially expressed genes [1719].
In summary, our analysis examines cutting edge imputation schemes on an Afymetrix technical dataset with minimal biological variation. Section 2 discusses the MAQC
dataset and the proposed imputation schemes. Meanwhile,
Section 3 describes the results from applying the imputation
methods for addressing missingness in the MAQC datasets.
Finally, we conclude our paper with a discussion and conclusion in Sections 4 and 5.

2. Materials and Methods


2.1. Datasets. Te MAQC experiments and datasets are fully
described by Shi [14]. Te MAQC dataset hybridizes 2 RNA
samples a Universal Human Reference RNA (UHRR) from
Stratagene and a Human Brain Reference RNA (HBRR) from
Ambion. From these 2 samples, 4 pools are created, that is, the
2 reference RNA samples as well as 2 mixtures of the original
samples: Pool A, 100% UHRR; Pool B, 100% HBRR; Pool C,
75% UHRR and 25% HBRR; and Pool D, 25% UHRR and 75%
HBRR. Both Pool A and Pool B are commercially available
and biologically distinct where we expect a large number of
diferentially expressed genes between Pool A and Pool B.
Tere are 6 diferent test sites where each test site assayed
the 4 pools with 5 replicates per pool. Tus, for each test site
there are a total of 20 arrays and thus a total of 120 arrays
over the 6 sites. Te data is examined separately for each
pool (4) and each site (6) separately yielding 24 site and pool
datasets.
2.2. Missing Values and Detection Call Algorithm. Using MAS
5.0, a detection call algorithm is used to fag the missing
values [13]. Te detection call determines if the transcript of
a gene is present or absent in the sample. For every gene, the
microarray chip has probes that perfectly match a segment
of the gene sequence (PM probes) and probes that contain
a single mismatched nucleotide in the center of the perfect
match probe (MM probes). Te diference in the intensity of
the perfect and mismatch probes is used to make detection
calls.
Te detection call algorithm is further summarized by
Mei et al. [11]. For each genetic transcript, there is a probe
set with 11 to 20 probe pairs where a probe pair consists of
a PM probe and MM probe. In short, discrimination scores
are calculated for each probe set from the raw intensity data
for the probe pairs in the probe set. For each probe pair, the
ratio of the sum and diference of the PM and MM probes
gives the discrimination score for that probe pair. Tis score
is calculated for all the probe pairs in a probe set. Te null
hypothesis is that the median discrimination score of a probe

Advances in Bioinformatics
set is equal to , and the alternate hypothesis is that the
median discrimination score is greater than , where is
defned as a small nonnegative number which can be changed
by the user to adjust the specifcity and sensitivity. One-sided
Wilcoxon rank sum tests are performed for each probe set.
Two signifcance levels 1 and 2 , act as the cutofs for the
values for probe set detection calls. A present call is made for
a probe set (transcript) with a value <1 , an absent call for
a transcript with value 2 and a marginally detected call
for a transcript with 1 value < 2 . We use the MAS 5.0
preset values 0.04, 0.06, and 0.015 for 1 , 2 , , respectively,
to determine if the probe set is present, marginally present, or
absent in the sample.
2.3. Percent Present Algorithm. We use the mas5calls function detailed in Afymetrix [20] from the afy package [13]
to make the detection calls. Using this function, we get a
present, marginal, or absent call for each probe set in each
array. For every sample, probe sets were fltered based on
the present calls where probe sets that were present in all 5
replicates of a given pool and a given site were retained for
further analyses. Probe sets that were detected as absent or
marginally present in 1 or more replicates of a sample were
removed. Tis creates a complete expression matrix for each
site and pool combination.
Te SimpleAfy R package has methods for quality control
metrics on Afymetrix arrays [21]. One metric is percent
present call which calculates the percentage of present probe
sets in each array. Using this metric, we calculate the percent
present calls for all 120 arrays separately and then average the
percentages over the 5 replicates for each sample and each site.
2.4. Preprocessing Algorithm. We pre-process each complete
expression matrix using MAS 5.0 available in Bioconductor
[22] to obtain expression values for further analyses. Te
MAS 5.0 preprocessing was implemented using the R language afy library [13].
Preprocessing algorithms for Afymetrix gene expression
microarrays are necessary to account for the systematic variation present in array technology and to summarize the signal
for each gene which is measured via a series of probe sets.
As discussed by Gentleman et al. [12], preprocessing schemes
can be organized into three steps: a background adjustment
step, a normalization step, and a summarization step. In short,
the MAS 5.0 preprocessing algorithm is outlined in the Statistical Algorithms Description Document [20] and used in
the MAS 5.0 sofware Afymetrix [20]. Te steps in MAS 5.0
involve (1) a weighted nearest neighbor step to estimate and
remove the background signal, (2) a normalization step that
scales all arrays to a baseline array, and (3) a summarization
step using an ideal mismatch, which may be slightly diferent
than the perfect mismatch probe described earlier.
To compare imputation methods, we randomly remove a
percentage of the probe set expression values from the complete expression matrix and compare the complete dataset
and the dataset(s) with the missing probe sets expression
values estimated via an imputation method. We randomly
remove 5% and 10% of the probe set expression values from

3
the complete expression matrix with 1000 Monte-Carlo simulations at each deletion percentage.
2.5. Missing Value Imputation Methods. Similar to the analysis by Oh et al. [10], we examine the following missing data
analysis methods for the MAQC dataset:
(1) row average (ROW),

(2) nearest neighbors using Euclidean distance or Pearson correlation, with = 1 or 5, where is the number
of neighbors in the imputation (KNN),
(3) singular value decomposition (SVD) [1],
(4) least squares adaptive (LSA) [4],

(5) local least squares (LLS), choosing = 1, 3, and 4,


where is the cluster size used for regression [5],

(6) Bayesian principal components analysis (BPCA) [3],


and
(7) noniterative partial least squares (NIPALS) [2].
Note that the row average method (ROW) and -nearest
neighbor (KNN) imputation were done using the R computing language with the package [23] while LSA
was implemented using the Java language code [24]. In the
ROW method, the average of the values that are present for
that particular probe set is used to replace the missing probe
set expression values. Te KNN algorithm classifes objects
based on closest (nearest) probe sets. In this algorithm, we
fnd the -nearest neighbors using a suitable distance metric,
and then we impute the missing elements by averaging those
(nonmissing) values of its neighbors. In the KNN method,
there are diferent types of distance metrics (Pearson correlation, Euclidean, Mahalanobis, and Chebyshev distance) that
can be employed. We chose the Euclidean distance metric as
it has been reported to be more accurate [25].
Singular value decomposition (SVD) reduces the dimension of the data matrix and uses the global information in
the data matrix to predict the missing values as detailed by
Troyanskaya et al. [1]. Initially, all missing values are replaced
by the row average. With this complete gene matrix, SVD is
used to obtain eigen genes which are the orthogonal principal components. Ten, the nonmissing values are regressed
against the most signifcant eigen genes, and the regression
function is used to predict the missing values. Using an expectation-maximization algorithm, missing values are estimated
repeatedly until the total change in the expression matrix falls
below the empirically determined threshold of 0.01.
Te least squares adaptive method (LSA) is a combination
of gene-based and array-based estimates of the missing values. Te gene-based estimate is based on the correlation
between genes and the array-based estimate is based on the
correlation between arrays. A weighted average of these two
estimates predicts the missing value. Te weight is chosen to
minimize the sum of squared errors for the new estimate.
An adaptive weighting procedure is used which takes into
account the strength of the gene correlation in the gene-based
estimates. Te LSA method is fully described by B et al.
[4] and was implemented using the LSimpute.jar java code
available at http://www.ii.uib.no/ trondb/imputation/.

Advances in Bioinformatics
Average of percent present calls per site per pool

Present (%)

Te LLS method is a neighbor-based approach that selects


neighbors based on their Pearson correlation coefcient.
Multiple regression is performed using -nearest neighbors
as described by Kim et al. [5], and the LLS method is implemented using the R package pcaMethods [26]. Te method
restricts to be less than the number of replicates/columns. In
our case, with 5 replicates, we chose equal to 1, 3, or 4. Global
based methods, SVD [1] and BPCA [3], were implemented
using the R package pcaMethods [26]. Te NIPALS method
is summarized by Wold [2] and is implemented using the
R package pcaMethods [26]. Similar to KNN, in order to
implement the NIPALS algorithm, it is necessary for the user
to specify the number of principal components. To evaluate
the diferent methods of imputation, probe set expression
values were randomly deleted from the complete dataset, and
the summary measures in the next section were compared
across the methods.

59.0
58.5
58.0
57.5
57.0
56.5
56.0
55.5
55.0
54.5
54.0
53.5
53.0
52.5
52.0
51.5
51.0
A

2.6. Quantitative Error Evaluation. Te complete expression


matrices for each pool and site are such that the rows correspond to probe sets, and the columns correspond to samples.
Similar to Oh et al. [10], we denote this complete expression
matrix as CD = ( ) , where is the expression intensity
of probe set (roughly speaking gene) on sample . To
simulate the missing data, we randomly remove 5% or 10%
of the entries in CD. Ten given a missing value imputation
scheme, the missing value for probe set , sample , is imputed
as and the imputed dataset is denoted as ID.
To compare the imputed dataset ID with the complete
dataset CD, we employ the following summary statistics:

Pool
Site 1
Site 2
Site 3

Site 4
Site 5
Site 6

Figure 1: Percent present across pools and sites. Each curve shows a
diferent site, and the -axis shows the 4 pools and the -axis shows
the mean percentage of present probes on the Afymetrix arrays.
Pool B has the smallest percentage of present probes, while Pool D
has the largest percentage of present probes. Site 4 has the highest
percentage of present probes, while Site 2 has the lowest percentage
of present probes.

(1) root mean squared error (RMSE),


RMSE =

1
no. of missing

{ missing}

( ) ,

(2) relative estimation error (RAE) [25],

,
RAE =

no. of missing { missing} ( )

where

(2)


if > ,

if < ,


{ ,
( ) = {
{,

(1)

(3)

(3) logged RMSE (LRMSE) [8],


LRMSE =

1
no. of missing

{ missing}

( ) , (4)

where = log ( ), and


(4) RAE-L2 [10],
1
RAE-L2 =
no. of missing

{ missing}

( )

. (5)

See Section 4 for the motivation for using these error measures to evaluate the imputation methods. To understand the
variability in the imputation procedures, we perform each
missing data simulation 1000 times.
2.7. Ranking the Imputation Methods. To identify the overall
best and worst performing imputation methods (IM), we
rank the IM based on their average performance across the
diferent error measures, all pools, and all sites. Te ranking
procedure is carried out separately for 5% and 10% deletion.
For each simulation, we compute 4 error measures for
each of the 10 imputation methods. Averaging over the 1000
simulations, we get an average error value for each imputation
method for every site and pool combination. For example, for
the metric RMSE, there are 10 values: 1 for each imputation
method at, say, Site 1 and Pool A.
Ten, we rank the 10 IM based on each error measure
separately for each site and pool combination. For example,
based on RMSE values, the IM are ranked from the lowest to
highest; the IM with lowest RMSE value is 1 and the IM with
the highest RMSE is 10. Te IM each have a rank value for a
given error measure at each site for each of the 4 pools.
For every imputation method, the error measure rank
values are averaged across the 6 sites for each pool; thus we
obtain 4 average rank values, 1 for each pool. Finally, we
average these 4 rank values to obtain a single number that
gives a global ranking to every imputation method, refecting

Advances in Bioinformatics

Table 1: Summary of imputation methods for 5% and 10% deletion.


Error metric
Deletion %
BPCA
KNN1
KNN5
LLS1
LLS3
LLS4
LSA
NIPALS
ROW
SVD

RMSE
5%
2.38
9.79
7.83
5.17
3.83
2.79
1
7.25
6
8.96

LRMSE
10%
2.33
9.79
7.83
5.21
3.75
2.92
1
7.25
5.96
8.96

5%
5.5
9.88
9.08
4.29
2.17
2.29
4.88
7.33
1.5
8.29

RAE
10%
5.71
10
8.88
4.25
2.25
2.92
4.92
6.96
1.5
8.08

5%
5.13
9.83
8.33
4.67
2.13
1.96
4.33
7.33
2
8.29

RAEL2
10%
5.46
10
8.75
4.5
2.17
1.96
4.5
7.08
2
8.17

5%
5.63
9.79
8.42
4.38
2.17
2
4.71
7.33
1.58
8.33

Average
10%
5.67
9.88
8.79
4.33
2.17
2.08
4.71
7.04
1.46
8.29

5%
4.66
9.82
8.42
4.63
2.57
2.26
3.73
7.31
2.77
8.47

10%
4.79
9.92
8.56
4.57
2.58
2.48
3.78
7.08
2.72
8.38

Rows correspond to imputation methods and columns correspond to error measures with the last columns showing the average across the error measures.
Each imputation method is ranked based on its average rank performance across all pools and all sites. Te rank values for every error measure and imputation
method combination are averaged across the 6 sites and 4 pools as detailed in Section2. Smaller average rank values suggest more accurate imputation methods.
From the table, we observe that RMSE metric suggests that LSA imputation method has the best performance. With LRMSE and RAEL2 metrics, ROW is the
best imputation method. LLS with = 4 (LLS4) has the best performance when we use the RAE error measure. KNN with = 1 (KNNl) has the highest rank
value for any given error measure; thus, it is the worst performing imputation method. LLS with = 4 (LLS4) has the overall best performance across the
diferent error measures. Tese results hold true for both 5% and 10% deletion.

1400

1200

RMSE mean values

1000

800

600

400

200

0
1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

BPCA

KNN1

KNN5

LLS1

LLS3

LLS4

LSA

NIPALS

ROW

SVD

Test sites (16) and imputation methods

Figure 2: Average RMSE barplot with error bars. RMSE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6) and
10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed bar is
the overall mean for individual IM where the RMSE values are averaged across the 4 pools and 6 sites. Tis fgure shows the performance of
the 10 imputation tests using the RMSE metric with 5% deletion of values. 1000 simulations were performed where each simulation generated
a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix of probe sets. Missing
values were imputed using the 10 imputation tests. Te results are compared using the RMSE metric (see Section 2). Te RMSE values are
averaged across the 4 pools. LSA has the best performance as it has the lowest RMSE value for a given site. KNN with = 1 has the highest
RMSE value and has the worst performance for all pools and all sites.

Advances in Bioinformatics

LRMSE mean values

0.6

0.4

0.2

0.0
1 2 3 4 5 6M 1 2 3 4 5 6M

BPCA

KNN1

1 2 3 4 5 6M

1 2 3 4 5 6M 1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

KNN5
LLS1
LLS3
LLS4
LSA
Test sites (16) and imputation methods

1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M

NIPALS

ROW

SVD

Figure 3: Average LRMSE barplot with error bars. LRMSE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)
and 10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed
bar represents the overall mean for individual IM where the LRMSE values are averaged across the 4 pools and 6 sites. Tis fgure shows
the performance of the 10 imputation tests using the RMSE metric with 5% deletion of values. 1000 simulations were performed where each
simulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix
of probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the LRMSE metric (see Section 2).
Te LRMSE values are averaged across the 4 pools. ROW has the best performance as it has the lowest LRMSE value for a given site. KNN
with = 1 has the highest LRMSE value and has the worst performance for all pools and all sites.

its overall performance across diferent error measures, sites,


and pools for a given deletion percentage.

3. Results
We summarize our fndings in two ways: probe set detection
call summaries and error metrics and rankings for IM. Detection call results compare sites and pools while IM results
choose the best imputation method based on the error
metrics discussed in Section 2.
3.1. Detection Call Algorithm Results. Across the 120 samples,
as shown in Figure 1 the percent present calls has a minimum
value of 51% and a maximum value of 58.5%. We observe
that Site 4 have the highest mean percent present calls and
Site 2 has the lowest mean percent present calls for probe
sets. In terms of pools, Pool B has the lowest mean percent
present calls for probe sets while Pool D has the highest mean
percent present calls (see Figure 1). We performed an analysis
of variance (ANOVA) to examine the efects of site and pool
on the percentage of present probe sets in a microarray. Te
values for site and pool are <0.0001 indicating signifcant site
and pool efects. Nevertheless, the smallest percent present
is 49.77 while the largest percent present is 63.69. Tese
results indicate that the percentage of present probes is
sensitive to site and pool and could be caused by the wet

lab preparation of each pool and/or slight diferences in each


laboratorys (site) microarray protocols. Regardless of these
subtle diferences, we believe that percent present calls are
similar across sites and pools, and hence it is reasonable to
compare the subsequent IM results across the diferent sites
and pools.
Te Afymetrix HG-U133 chip has 54675 probe sets. Afer
fltering the absent calls, the number of present probe sets
ranges from 22,900 (Pool B, Site 2) to 27,021 (Pool C, Site 3).
Te number of present probe sets for Site 1 is 24,184 (Pool A),
23,557 (Pool B), 25,163 (Pool C), and 25,318 (Pool D). Further
tables and graphs representing the percent present calls and
present probe sets for each pool and site can be found in
Sadasiva Rao et al.s work [27].
3.2. Imputation Results. Te imputation methods are ranked
based on average rank performance as described in Section 2,
and the results are summarized in Table 1. Based on this
ranking, the results are very similar at both 5% and 10%
deletion. RMSE metric suggests that LSA imputation method
has the best performance. With LRMSE and RAEL2 metrics,
ROW is the best imputation method. Te imputation method
LLS with = 4 has the best performance with the RAE
( = 0.20) metric.
From Table 1, we observe that KNN with = 1 has the
highest value for any given error measure; thus, it is the worst

Advances in Bioinformatics

1.0

RAE mean values

0.8

0.6

0.4

0.2

0.0
1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

BPCA

KNN1

KNN5

LLS1

LLS3

LLS4

LSA

NIPALS

ROW

SVD

Test sites (16) and imputation methods

Figure 4: Average RAE barplot with error bars. RAE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)
and 10 imputation tests (BPCA, KNN with = 1, 5, lls with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed
bar represents the overall mean for individual IM where the RAE values are averaged across the 4 pools and 6 sites. Tis fgure shows the
performance of the 10 imputation tests using the RAE metric with 5% deletion of values. 1000 simulations were performed where each
simulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix of
probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the RAE metric (see Section 2). Te
RAE values are averaged across the 4 pools. LLS with = 4 has the best performance as it has the lowest RAE value for a given site. KNN
with = 1 has the highest RAE value and has the worst performance for all pools and all sites.

performing imputation method across all pools and sites.


LLS with = 4 has the overall best performance across the
diferent error measures.
Figures 2, 3, 4, and 5 show the performance of diferent
imputation methods for each error measure for all the pool
and site combinations for 5% deletion. Further supplemental
fgures and tables showing the performance of diferent
imputation methods on each site and pool as measured by the
4 error measures are found in Sadasiva Rao et al.s work [27].
Results from 5% deletion and 10% deletion show a similar
pattern. As expected the imputed values and variance with
10% missing data are larger than 5% missing data. Site 4 has
the highest values for most of the imputation tests for all the
samples (see Sadasiva Rao et al.s work [27] for more details).
Ultimately, LLS with = 4 has the best performance with 10%
deleted values.

4. Discussion
Te MAQC project allows researchers to study a variety of
microarray aspects including comparisons of one-color and
two-color arrays [28], reproducibility [14, 15, 29], removal of
batch efects [30], and determining diferentially expressed

genes [31]. From this diverse research, it is clear that


the MAQC projects represent a fertile testing ground for
microarray inspired algorithms and methods. However, to
date, we are not aware of any work examining imputation
methods on the MAQC datasets.
Our conclusion is that LLS with = 4 has the best
performance given our set of error measures. We note that
the optimality of LLS with = 4 is not uniform across
all error measures, sites, and pools. Also, in Figures 25, it
is clear that several other imputation methods ofer similar
performance to LLS with = 4, for example, LLS with =
1, 3, LSA, and BPCA. Tese results are similar to those found
by Brock et al. [8] and commented on by Aittokallio [32]
concluding that the top performing imputation algorithms
(LS, LLS, and BPCA) are all highly competitive with each
other, but no method is uniformly superior in all analyses. To
that end, Brock et al. [8] develop measures to determine the
appropriate (optimal) imputation method for a given dataset
based on the correlation within the dataset.
We choose a set of cutting edge imputation schemes to
apply in the MAQC datasets. Tere are numerous applied
references for the imputation schemes including [710, 32
34]. Optimality in the imputation schemes was assessed

Advances in Bioinformatics
2.5

RAEL2 mean values

2.0

1.5

1.0

0.5

0.0
1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

BPCA

KNN1

KNN5

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

LLS1
LLS3
LLS4
LSA
Test sites (16) and imputation methods

1 2 3 4 5 6M

1 2 3 4 5 6M

1 2 3 4 5 6M

NIPALS

ROW

SVD

Figure 5: Average RAEL2 barplot with error bars. RAEL2 values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)
and 10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed
bar represents the overall mean for individual IM where the RAEL2 values are averaged across the 4 pools and 6 sites. Tis fgure shows the
performance of the 10 imputation tests using the RAEL2 metric with 5% deletion of values. 1000 simulations were performed where each
simulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix
of probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the RAEL2 error measure (see
Section 2). Te RAEL2 values are averaged across the 4 pools. ROW has the best performance as it has the lowest RAEL2 value for a given
site. KNN with = 1 has the highest RAEL2 value and has the worst performance for all pools and all sites.

via (1) raw score error measures and (2) rank-based error
measures taken across our cohort of error measures. Te
error measures chosen (see Secton 2) were designed to assess
(1) errors in raw expression values (RMSE), (2) errors in
the logarithm transformed expression values (LRMSE), (3)
relative errors designed to penalize errors relative to the raw
expression values (RAE), and (4) relative errors designed
to penalize the error relative to the logarithm expression
value (RAEL2). Hence, there are 2 (relative and absolute)
error measures based on raw expression scores and 2 error
measures (relative and absolute) based on the logarithm of
expression values. Because of this balanced design in error
measures between relative and absolute measures and raw
and logarithm transformed data, it is reasonable to compute
the average rank across these error measures to assess the
overall quality of an imputation method (see Table 1). Tus,
these rank-based error measures shown in Table 1 summarize
the results in a straightforward manner across sites, pools,
and error measures. Note that we set = 0.20 for the RAE
error method. For future work, our group is interested in
studying the robustness of RAE to the choice of . We also
include the raw score error measures to demonstrate the best
imputation methods regardless of the employed set of the
imputation methods (see Figures 25).

Our study is designed with the technical MAQC dataset


in mind. Tus, our error measures do not include biological
measures of the type discussed in [10]. Tese biological measures are designed to study the clustering and classifcation
schemes commonly applied to gene expression microarrays.
While our summary error measures are important to compare the imputation schemes, it is not clear how the diferent
imputation procedures will afect downstream biological
analysis and interpretation. It is outside of the scope of this
paper to address this biological question since the MAQC
experiment does not represent a real biological experiment.
We study imputation methods while using the MAS 5.0
algorithm as the preprocessing method. However, there are
other preprocessing algorithms such as RMA [3537] and
GCRMA [38] that are routinely used, and these methods
may infuence the performance of the imputation scheme. We
highlight several works that extensively study and compare
preprocessing schemes for Afymetrix datasets including [17,
18, 39, 40]. It is of future work to compare imputation
methods across diferent preprocessing algorithms.
We recognize that the MAQC datasets are not without
criticism. For example, the issue of choosing an overall optimal preprocessing scheme is still an open question [41].
Another serious criticism is provided in [42] with a reply by

Advances in Bioinformatics
Shi et al. [43]. In that discussion, one of the main concerns
involves technical versus biological variation. Tis important
issue has arisen when studying other technical microarray
datasets [39]. Considering both aspects of this question, if
we use datasets containing biological and technical variation,
that is, datasets designed to answer biological questions, then
there are biases due to the intent of the original datasets
(e.g., biological variation of the species, sample preparation,
procurement of RNA, and hybridization afnities).

5. Conclusions
Missing values in microarray experiments are a common
problem with efects on downstream analysis. Many variables
such as the biological variability of the dataset, experimental
conditions of the study, percentage of missing values, and
type of downstream analysis performed need to be considered when choosing an imputation method.
In our work, we use the MAQC datasets with the MAS 5.0
preprocessing scheme to compare missing data imputation
schemes for Afymetrix datasets. Te best and worst performing imputation schemes remain the same for both 5% and 10%
deletion percentages. We observe that -nearest neighbor
method with = 1 has the worst performance among the
imputation schemes across all error measures. Local least
squares (LLS) method with = 4 gives the best performance
for imputing missing values across all error measures for
both 5% and 10% deletion. Tese conclusions are based on
studying 10 imputation methods with 4 error metrics and
1000 Monte-Carlo simulations.

Authors Contribution
Jefrey C. Miecznikowski and Song Liu designed the study.
Sreevidya Sadananda Sadasiva Rao performed the statistical
analysis. Sreevidya Sadananda Sadasiva Rao and Jefrey C.
Miecznikowski wrote the paper. Lori A. Shepherd and
Andrew E. Bruno assisted with the data analysis and writing
the paper. All authors read and approved the fnal paper.

References
[1] O. Troyanskaya, M. Cantor, G. Sherlock et al., Missing value
estimation methods for DNA microarrays, Bioinformatics, vol.
17, no. 6, pp. 520525, 2001.
[2] H. Wold, Path models with latent variables: the NIPALS
approach, in Quantitative Sociology: International Perspectives
on Mathematical and Statistical Modeling, pp. 307357, 1975.
[3] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, and
S. Ishii, A Bayesian missing value estimation method for gene
expression profle data, Bioinformatics, vol. 19, no. 16, pp. 2088
2096, 2003.
[4] T. H. B, B. Dysvik, and I. Jonassen, LSimpute: accurate estimation of missing values in microarray data with least squares
methods, Nucleic Acids Research, vol. 32, no. 3, p. e34, 2004.
[5] H. Kim, G. H. Golub, and H. Park, Missing value estimation
for DNA microarray gene expression data: local least squares
imputation, Bioinformatics, vol. 21, no. 2, pp. 187198, 2005.

9
[6] M. Ouyang, W. J. Welsh, and P. Georgopoulos, Gaussian mixture clustering and imputation of microarray data, Bioinformatics, vol. 20, no. 6, pp. 917923, 2004.
[7] J. C. Miecznikowski, S. Damodaran, K. F. Sellers, D. E. Coling,
R. Salvi, and R. A. Rabin, A comparison of imputation procedures and statistical tests for the analysis of two-dimensional
electrophoresis data, Proteome Science, vol. 9, p. 14, 2011.
[8] G. N. Brock, J. R. Shafer, R. E. Blakesley, M. J. Lotz, and G. C.
Tseng, Which missing value imputation method to use in
expression profles: a comparative study and two selection
schemes, BMC Bioinformatics, vol. 9, no. 1, p. 12, 2008.
[9] M. Celton, A. Malpertuy, G. Lelandais, and A. G. de Brevern,
Comparative analysis of missing value imputation methods
to improve clustering and interpretation of microarray experiments, BMC Genomics, vol. 11, no. 1, p. 15, 2010.
[10] S. Oh, D. D. Kang, G. N. Brock, and G. C. Tseng, Biological
impact of missing-value imputation on downstream analyses of
gene expression profles, Bioinformatics, vol. 27, no. 1, Article
ID btq613, pp. 7886, 2011.
[11] R. Mei, X. Di, T. B. Ryder et al., Analysis of high density expression microarrays with signed-rank call algorithms, Bioinformatics, vol. 18, no. 12, pp. 15931599, 2002.
[12] R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit,
Bioinformatics and computational biology solutions using R
and Bioconductor, Statistics for Biology and Health, 2005.
[13] L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry, AfyAnalysis of Afymetrix GeneChip data at the probe level,
Bioinformatics, vol. 20, no. 3, pp. 307315, 2004.
[14] L. Shi, Te MicroArray Quality Control (MAQC) project
shows inter- and intraplatform reproducibility of gene expression measurements, Nature Biotechnology, vol. 24, no. 9, pp.
11511161, 2006.
[15] J. J. Chen, H. Hsueh, R. R. Delongchamp, C. Lin, and C.
Tsai, Reproducibility of microarray data: a further analysis of
microarray quality control (MAQC) data, BMC Bioinformatics,
vol. 8, no. 1, p. 412, 2007.
[16] L. Shi, W. D. Jones, R. V. Jensen et al., Te balance of reproducibility, sensitivity, and specifcity of lists of diferentially
expressed genes in microarray studies, BMC Bioinformatics,
vol. 9, supplement 9, p. S10, 2008.
[17] S. E. Choe, M. Boutros, A. M. Michelson, G. M. Church, and
M. S. Halfon, Preferred analysis methods for Afymetrix GeneChips revealed by a wholly defned control dataset, Genome
Biology, vol. 6, no. 2, p. R16, 2005.
[18] Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, Preferred analysis methods for Afymetrix GeneChips. II. An expanded, balanced, wholly-defned spike-in dataset, BMC Bioinformatics,
vol. 11, no. 1, p. 285, 2010.
[19] Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, A wholly
defned Agilent microarray spike-in dataset, Bioinformatics,
vol. 27, no. 9, Article ID btr135, pp. 12841289, 2011.
[20] I. Afymetrix, Statistical algorithms description document,
Technical Paper, 2002.
[21] C. L. Wilson and C. J. Miller, Simpleafy: a BioConductor package for Afymetrix Quality Control and data analysis, Bioinformatics, vol. 21, no. 18, pp. 36833685, 2005.
[22] R. C. Gentleman, V. J. Carey, D. M. Bates et al., Bioconductor:
open sofware development for computational biology and
bioinformatics, Genome Biology, vol. 5, no. 10, p. R80, 2004.
[23] T. Hastie, R. Tibshirani, B. Narasimhan, and G. Chu, Impute:
Imputation for Microarray Data, 1999, R package version 1.10.0.

10
[24] T.H. BB, B. Dysvik, and I. Jonassen, Lsimpute: Accurate estimation of missing values in microarray data with least squares
methods, 2005, http://www.ii.uib.no/trondb/imputation/.
[25] D. V. Nguyen, N. Wang, and R. J. Carroll, Evaluation of missing
value estimation for microarray data, Journal of Data Science,
vol. 2, no. 4, pp. 347370, 2004.
[26] W. Stacklies and H. Redestig, PcaMethods: A Collection of PCA
Methods, 2007, R package version 1.18.0.
[27] S. S. Sadasiva Rao, L. A. Shepherd, A. E. Bruno, S. Liu, and J.
C. Miecznikowski, A full analysis of imputation procedures for
Afymetrix gene expression datasets, Technical Report 1202,
SUNY University at Bufalo-Department of Biostatistics, Buffalo, NY, USA, 2012.
[28] T. A. Patterson, E. K. Lobenhofer, S. B. Fulmer-Smentek et al.,
Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC)
project, Nature Biotechnology, vol. 24, no. 9, pp. 11401150,
2006.
[29] Z. Wen, C. Wang, Q. Shi et al., Evaluation of gene expression
data generated from expired Afymetrix GeneChip microarrays using MAQC reference RNA samples, BMC Bioinformatics, vol. 11, supplement 6, p. S10, 2010.
[30] J. Luo, M. Schumacher, A. Scherer et al., A comparison of batch
efect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data,
Pharmacogenomics Journal, vol. 10, no. 4, pp. 278291, 2010.
[31] K. Kadota and K. Shimizu, Evaluating methods for ranking
diferentially expressed genes applied to microArray quality
control data, BMC Bioinformatics, vol. 12, no. 1, p. 227, 2011.
[32] T. Aittokallio, Dealing with missing values in large-scale studies: microarray data imputation and beyond, Briefngs in Bioinformatics, vol. 11, no. 2, Article ID bbp059, pp. 253264, 2009.
[33] J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio, Missing value imputation improves clustering and interpretation of
gene expression microarray data, BMC Bioinformatics, vol. 9,
no. 1, p. 202, 2008.
[34] A. Liew, N. Law, and H. Yan, Missing value imputation for gene
expression data: computational techniques to recover missing
data from available information, Briefngs in Bioinformatics,
vol. 12, no. 5, Article ID bbq080, pp. 498513, 2011.

[35] B. M. Bolstad, R. A. Irizarry, M. Astrand,


and T. P. Speed, A
comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics,
vol. 19, no. 2, pp. 185193, 2003.
[36] R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and
T. P. Speed, Summaries of Afymetrix GeneChip probe level
data, Nucleic Acids Research, vol. 31, no. 4, p. e15, 2003.
[37] R. A. Irizarry, B. Hobbs, F. Collin et al., Exploration, normalization, and summaries of high density oligonucleotide array probe
level data, Biostatistics, vol. 4, no. 2, pp. 249264, 2003.
[38] Z. Wu, R. A. Irizarry, R. Gentleman, F. Martinez-Murillo, and F.
Spencer, A model-based background adjustment for oligonucleotide expression arrays, Journal of the American Statistical
Association, vol. 99, no. 468, pp. 909917, 2004.
[39] A. R. Dabney and J. D. Storey, A reanalysis of a published
Afymetrix GeneChip control dataset, Genome Biology, vol. 7,
no. 3, p. 401, 2006.
[40] D. P. Gaile and J. C. Miecznikowski, Putative null distributions
corresponding to tests of diferential expression in the Golden
Spike dataset are intensity dependent, BMC Genomics, vol. 8,
no. 1, p. 105, 2007.

Advances in Bioinformatics
[41] J. M. Perkel, Six things you wont fnd in the MAQC, Te
Scientist, vol. 20, no. 11, p. 68, 2007.
[42] P. Liang, MAQC papers over the cracks, Nature Biotechnology,
vol. 25, no. 1, pp. 2728, 2007.
[43] L. Shi, W. D. Jones, R. V. Jensen et al., Reply to MAQC papers
over the cracks, Nature Biotechnology, vol. 25, pp. 2829, 2007.

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
A Multilevel Gamma-Clustering Layout Algorithm for
Visualization of Biological Networks
Tomas Hruz,1 Markus Wyss,2 Christoph Lucas,1 Oliver Laule,2
Peter von Rohr,2 Philip Zimmermann,2 and Stefan Bleuler2
1
2

Institute of Teoretical Computer Science, ETH Zurich, 8092 Zurich, Switzerland


NEBION AG, Hohlstrae 515, 8048 Zurich, Switzerland

Correspondence should be addressed to Markus Wyss; mw@nebion.com


Received 12 April 2013; Accepted 7 June 2013
Academic Editor: Guohui Lin
Copyright 2013 Tomas Hruz et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Visualization of large complex networks has become an indispensable part of systems biology, where organisms need to be
considered as one complex system. Te visualization of the corresponding network is challenging due to the size and density of
edges. In many cases, the use of standard visualization algorithms can lead to high running times and poorly readable visualizations
due to many edge crossings. We suggest an approach that analyzes the structure of the graph frst and then generates a new graph
which contains specifc semantic symbols for regular substructures like dense clusters. We propose a multilevel gamma-clustering
layout visualization algorithm (MLGA) which proceeds in three subsequent steps: (i) a multilevel -clustering is used to identify
the structure of the underlying network, (ii) the network is transformed to a tree, and (iii) fnally, the resulting tree which shows
the network structure is drawn using a variation of a force-directed algorithm. Te algorithm has a potential to visualize very
large networks because it uses modern clustering heuristics which are optimized for large graphs. Moreover, most of the edges are
removed from the visual representation which allows keeping the overview over complex graphs with dense subgraphs.

1. Introduction
Te development in systems biology has brought a strong
interest in considering an organism as a large and complex
network of interacting parts. Many subsystems of living
organisms can be modeled as complex networks. One important example is a network of biochemical reactions which
constitutes a complex system responsible for homeostasis in
the living cell. An abstract network model of the biochemical
processes within the cell can be constructed such that reactions are represented as nodes and metabolites (and enzymes)
as edges. In the past, this system was studied mainly on a
subsystem level through metabolic pathways. Recently, it has
become important to consider the metabolic system as one
complex network to understand deeper phenomena involving interactions across multiple pathways.
Te need to study the whole network consisting of thousands of reactions, metabolites, and enzymes requires a
visualization system allowing biologists to study the overall

structure of the system. Such a visualization should allow navigation and comprehension of the global system structures.
In the present paper, we propose a visualization algorithm
for very large networks arising in systems biology and we
illustrate its usage on two complex biological networks. Te
frst case study is a metabolic network of Arabidopsis thaliana
and the second case study is a gene correlation network of
Mus musculus based on mRNA expression measurements.
Biological networks are usually represented as graphs
because such model can provide an insight into their structure. Te goal of the subsequent visualization is to present the
information contained in the graph in a clear and structured
way. For instance, closely related nodes of a subsystem should
be positioned together. Tis can be achieved using a cost
function which formalizes the visualization criteria and
which controls the drawing algorithm. Several standard
algorithms exist to achieve this goal using continuous optimization of the cost function, but the optimization of a
discrete cost function remains hard to solve.

12

Advances in Bioinformatics

(a)

(b)

Figure 1: Arabidopsis thaliana metabolic network visualized with (a) a force-directed algorithm with all edges shown, (b) the MLGA method
which combines -clustering with the force-directed algorithm. Te underlying network has 1199 reactions (nodes) and 4386 metabolites
(edges).

A widely used graph drawing method for larger graphs


is the force-directed layout algorithm [1, 2]. Basically, the
graph is modeled as a physical system. A force is calculated
on every node: a repulsive force between every pair of nodes
and an attractive force if an edge exists between two nodes.
Te forces direct the system into a steady state which defnes
a fnal layout. However, the method has several disadvantages
for large graphs with many edges. First, a straightforward
implementation needs to calculate the forces between each
node-pair in each iteration. Second, for complex graphs too
many iterations are needed to fnd an optimal layout. Tird,
a drawback results from the node degree distribution in biological networks which tends to be skewed (scale-free). Few
nodes have a high degree while a large number of nodes have
a small degree. Te attractive forces will stick together the
nodes with many interactions in a small area which prevents
the identifcation of the network structure in the dense parts
of the network; see Figure 1(a). Te repulsive force against
the other nodes leads to a scattered layout. To overcome
these disadvantages, other graph visualization methods have
emerged which are discussed later. On the other hand, for a
very specifc class of graphs like trees, a modifed version of
force-directed algorithm can be still a suitable method.
Te visualization of very large biological networks was
considered in [3]. Te large graph layout algorithm (LGL)
separates the graph into connected components, lays out
each connected component separately, and integrates these
layouts into one coordinate system. A grid variant of the
spring-based algorithm [1] is used to draw the graph for
each connected component. To separate dense parts of each
component, the minimum spanning tree (MST) is calculated
to defne the order in which nodes are included in the
layout computation. Beginning from a root node of the MST,
new nodes with increasing edge distance from the root are

iteratively added to the layout. Te new nodes are placed


randomly on spheres away from the current layout. At
each iteration, the spring-based layout algorithm is executed
until the layout is at rest. Under certain conditions, this
node placement strategy reduces cluttering and retains the
structure of core components; moreover, it separates highly
connected components. Tis layout phase is illustrated in [3,
page 181, Figure 1]. However, in some situations, the LGL
algorithm can even obfuscate the true structure of the graph.
Consider the situation where in the graph two cliques are
connected by a matching. Te MST algorithm will represent
this subgraph as a star having many paths of length two
from the center and one path of length three leading to the
center of the second clique. Te rendering according to LGL
would lead to a situation where the frst clique is placed in the
interior of the second clique. Such starting confguration can
easily lead to a situation where the force-directed algorithm
cannot separate the cliques; moreover, the edges of the second
clique would cross the rest of the graph. Te problem is that in
this situation the MST algorithm reduces the second clique to
one edge. In such cases a diferent solution would be needed
as we describe later.
Te problem of fast visualization for protein interaction
networks was studied in [4]. Te method uses an approach
with a grouping phase, and a layout phase. In the grouping
phase the algorithm identifes the connected components of
the graph and uniformly selects pivot nodes in each component. Te selection of the pivot nodes is controlled by a set of
rules based on empirical parameters. In the layout phase, the
pivot nodes defne an initial layout of the connected components. Aferwards, the layout of each connected component is
refned separately. Te authors show that the method is faster
than many other algorithms; however, a certain disadvantage
of this algorithm is the choice of pivot nodes involving

13

Advances in Bioinformatics
many parameters and a complex set of rules. Te rules and
its parameters are heuristically identifed to give a uniform
distribution of the nodes within the connected component.
Another drawback is that the method per se cannot visualize,
the structure of dense subgraphs because of too many edge
crossings (see [4, page 1887, Figure 3]). To improve the visualization the authors introduce visual operations to collapse
the cliques (and complete bipartite subgraphs) to reduce the
number of edges and nodes. Additionally, the problem of
fnding maximal clique (or complete bipartite subgraph) is
NP hard together with its approximation there is almost
no chance to have fast identifcation heuristics for large
graphs. Our algorithm improves the situation in this respect
because relaxing requested density of the subgraph through
-clustering (where 0 1 is the cluster density) allows
much more efcient heuristics for large graphs (order 106
nodes and edges [5]).
A global optimization method was explored in [6] where
the authors describe a layout algorithm for metabolic networks. Nodes of the graph are placed on a square grid. A
discrete cost function between a pair of nodes is introduced
based on their relation and position on the grid. By minimizing the total cost, a layout is generated. A simulated
annealing heuristic is used to optimize the cost function
by choosing better layouts among possible candidates. Due
to the computationally costly calculation of the layout, the
approach is applicable to networks with a few hundred nodes
only. Te authors showed that the algorithm works well on
sparse or planar graphs and clarifes the network structure as
the cost function of the method places closely related nodes
together. But this layout algorithm would place dense parts
of the graph in the same area leading to many edge crossings.
Additionally, as no reduction in the number of edges or nodes
is performed, the identifcation of the graph structure would
be very hard for large graphs with many edges.

2. MLGA Approach
Te experience with the existing visualization methods has
shown that it is necessary to provide a structural view of
dense networks. Representing networks with a large number
of nodes and edges in a two-dimensional area results in many
edge crossings. Dense subgraphs prevent the recognition of
the network structure if drawn directly. Apart from other
technical problems, this is the main shortcoming of most
layout algorithms. We believe that the future progress in
visualization of large and dense networks lies in algorithms
which analyze the structure of the graph frst and then generate a new graph which contains specifc semantic symbols
for regular substructures like dense clusters. Additionally,
the algorithms may allow for drilling down and interactively
show all edges for a given substructure, described below (see
section visual representation and operation). Dense clusters
are ideal candidates for graph preprocessing because they
can be simply described, efciently searched, and if they
are replaced with a specifc symbol they signifcantly reduce
the complexity of the resulting low-dimensional (planar or
three dimensional) picture because they contain most of the
edges. Moreover, we focus on the graph clustering algorithms

because the underlying dimension of graphs can be very high


providing difculties for other clustering algorithms.
Graph clustering is a large feld with many algorithms
developed over the years [7]; however, there is no universal
solution for all cases. Even a defnition of a cluster comes in
many favors with diferent algorithmic consequences. Terefore, it is important to consider a certain class of graphs which
is sufciently general in the context of bioinformatics but
allows for using an efcient clustering method. Recently new
clustering methods emerged based on the idea of so-called
-clusters [5] or (, )-clusters [8]. Tese methods use fast
heuristics which allow for clustering efciently large graphs.
Te existence of such methods inspired the general idea
behind our research to use clustering algorithms to build a
hierarchical structure of a given graph which can be much
better visualized and which tells the users more about the
structure of the underlying biological network. In the following, we focus on -clusters but other graph clustering
methods could be used as well.

3. Algorithm
Te MLGA method introduces multilevel -clustering and a
specifc tree transformation with a force-directed layout algorithm to visualize the structure of highly complex biological
networks. First, the original graph is preprocessed using a clustering algorithm described in [5] to identify the clusters.
For every cluster, a new cluster node is created and these new
nodes are linked with new edges if there are edges between
the underlying cluster nodes as illustrated in Figure 2.
Tis process constructs the frst hierarchical layer above
the original graph. Ten, the clustering algorithm is recursively applied to the cluster nodes itself to generate a cluster
hierarchy. Aferwards, this hierarchy is transformed to a tree
showing only the shortest paths from a root node through the
intermediate cluster nodes to the nodes of the initial graph.
Finally, a modifed version of the force-directed algorithm
visualizes the tree structure of the remaining graph. Tis
combination of preprocessing and layout algorithm eases the
identifcation of the cluster structure and their interactions,
see Figure 1(b).
For the clustering step, we prefer -clustering to (, )clustering or to other more complex methods because it
would be much more difcult to control the clustering parameters during the transitions between the hierarchy levels. Te
only parameter which has to be specifed for our algorithm at
every hierarchical level is the parameter . It can be seen that
the density of the graph grows when the algorithm proceeds
to the higher levels. On the other hand, the number of nodes
decreases very rapidly so that afer few steps there is only
one clique lef. As a consequence, it is not meaningful to use
the same clustering parameters as the algorithm recursively
proceeds up the hierarchy. For more complex clustering algorithms, it would be very difcult to defne a good clustering
parameters if the parameter space has more dimensions.
In our case, the sequence of the values for the parameter
must be growing. As we discuss later, the actual values
can be empirically determined and moreover 3-4 values are
sufcient for large graphs.

14

Advances in Bioinformatics

(a)

(b)

Figure 2: (a) Te construction of a cluster hierarchy and (b) the transformation to a tree.

4. Algorithmic Phases
Let = (, ) be an undirected graph with the vertex set
and edge set . A -cluster for 0 1, also described as
-clique or dense subgraph, is a subset such that for its
edge set () and the vertex set () the following is true:
| ()| (

| ()|
).
2

(1)

Finding a -clique of maximal cardinality in is the


maximal -clique problem. Te 1-clique problem is NP hard
and is proved to be hard to approximate [9].
To identify the clusters on one hierarchical level, we use
a heuristic developed in [5] to detect -clusters for very
large graphs. Reference [5] introduced a potential function
on a vertex set relative to a given -cluster and derived an
algorithm to discover maximal -clusters. Te time complexity of the algorithm is (||||2 ) with the set of vertexes
of the maximal -cluster detected. Further, the authors use
a greedy randomized adaptive search procedure (GRASP)
version of the algorithm with edge pruning. Te feasibility
of the resulting method was demonstrated by applying it to
telecommunication data with millions of vertexes and edges.

5. -Cluster Detection

To fnd all -clusters on one level in the graph a variant


(Algorithm 1) of the GRASP approach of [5] is used. Te
cluster construction procedure is the nonbipartite case for fnding a high cardinality cluster of specifed
density in a graph with nodes and edges . Our algorithm
repeatedly applies the detection algorithm to the highest
hierarchical level of the new graph. It terminates if no more
-clusters are found or the number of clusters with a clustersize below a given minimum size is reached.

6. Hierarchy Creation
Te cluster detection algorithm is repeatedly applied to the
graph and the clusters to build a hierarchy; see Figure 2(a).
Each node of the graph has an attribute level which is

input: : Vertices
input: : Edges
input: : density of cluster
begin
initialize empty list of clusters C;
count 0;
cluster construct dsubg(, V, E);
while cluster = 0 count < max count do
size |cluster|;
if size min size then
add cluster to C;
count 0;
else
count count + 1;
end
set to without nodes of cluster;
set to without edges within cluster;
cluster construct dsubg(, V, E);
end
end
Algorithm 1: createClusters.

initially assigned to zero. First, the cluster detection algorithm


retrieves the clusters of this initial graph. Aferwards, the
algorithm iteratively creates the clusters of the next level
among the clusters one level below 1 (Algorithm 2). To
control the density of the clusters on each level a -value is
specifed. At each level an edge between the cluster nodes
is created if an edge exists between the nodes one level
below. Additionally, a new edge is generated between the
cluster and the nodes belonging to the cluster. Te algorithm
terminates if no more clusters are found. Tis phase resembles
hierarchical clustering where new nodes are introduced
for hierarchically diferent clusters but the -clustering is
based on a completely diferent density measure and merges
multiple nodes in one step. Consequently, this leads to a
much lower tree depth (as described in the following section)
compared to the hierarchical clustering which generates a
binary tree.

15

Advances in Bioinformatics

begin
level 0;
nodes getNodes(level);
getGammaValue(level);
clusters createClusters(nodes, );
while clusters = 0 do
create one node on the next level for each cluster;
create edges between clusters and nodes;
level level + 1;
nodes getNodes(level);
getGammaValue(level);
clusters createClusters(nodes, );
end
end
Algorithm 2: createMultiLevelClusters.

7. Tree Transformation
To gain the structure of the cluster hierarchy, a tree transformation is performed; see Figure 2(b). In the transformation
(Algorithm 3), a hidden root node is connected to all cluster
nodes at the highest level as their parent. Aferwards, only
the edges belonging to the shortest path from the root node
to each node is shown. If the shortest path is not unique a
path will be chosen at random. Te distance for each node is
calculated beginning from the root using a breath-frst search.
Te parent of a node will be set to the neighbor node with the
shortest distance. If the node belongs to a cluster node at one
level above, the parent is set to this cluster.

8. Layout Algorithm
A modifed version of a force-directed algorithm [2] is used
to lay out the transformed graph. Our method introduces
diferent edge length on each level. Longer edges are assigned
to higher levels than on lower levels. Tis results in a
natural visualization of the hierarchy. Furthermore, the initial
positions of the nodes are specifcally calculated. Te nodes of
the graph are located on concentric circles with the hidden
root node at the center. Nodes immediately connected to
the root are positioned at the next inner circle and so on. A
segment of the circle is assigned to each node within which its
location is calculated. Recursively, a fraction of this segment
is assigned to the children of the node on the next circle.
Tis initial setup reduces the rendering time and guides the
layout algorithm to visualize the tree structure. A random
initial positioning may result in a local minimum of the forcedirected layout with many edge crossings which would disrupt the tree representation. Additionally, the repulsive forces
are ignored beyond a given distance depending on the size
of the drawing area. Tis restriction prevents disconnected
components of the graph from separating too far. To suppress
the well-known oscillation problem [10] of force-directed
algorithms a dumping heuristics is used where we compute an
average of the previous node positions during the force
calculation.

begin
create root;
set parent of highest level nodes to root;
candidates highest level nodes;
foreach node candidates do
if node belongs to a cluster one level above then
node.parent cluster;
else
set node.parent to the neighbor with shortest
distance to the root;
end
node.dist node.parent.dist + 1;
foreach neighbor of node except node.parent do
if neighbor has already been visited then
hide edge;
else
candidates candidates neighbor;
end
end
end
end
Algorithm 3: treeTransformation.

As the graph is transformed to a tree, other layout


algorithms can be used in the this phase. Reference [11] uses
a level-based approach which horizontally aligns nodes with
the same distance from the root node. As only a few levels are
created for our initial graph the resulting drawing would have
a much larger width than height. A ringed circular approach
like [12] where the children of the nodes are plotted on the
periphery of a circle has a better space efciency on the 2D
plane than [11]. But a visual inspection of the resulting graph
in [12, page 11, Figure 7 (lef)] shows that the force-directed
layout distributes the children of a node more evenly.

9. Visual Representation and Operation


Afer the creation of the cluster hierarchy and the tree
transformation many initial edge connections are hidden.
In the presentation of the resulting graph the nodes of the
inital graph are colored depending on the number of edges
in the initial graph. Additionally cluster nodes and edges are
visualized with diferent symbols and colors (Figure 3).
Our implementation of the visualization tool ofers two
operations to get deeper insight into the original graph. First,
all edges between a selected node and its direct neighbors
can be highlighted (Figure 4(b)). If the marked node is a
cluster node, all connections to the nodes of the cluster
will be shown. During the tree transformation, most of
these connections were eliminated and a direct connection
between two nodes in diferent clusters was replaced by an
indirect connection between the cluster nodes. Te second
operation will display all edges between the nodes forming
a -cluster node (Figure 4(c)) which allows the user to
temporarily alter the view between the star-shaped cluster
node and the real connections of the cluster.

16

Advances in Bioinformatics
Node of initial graph (degree 0)
Node of initial graph (degree 1)
Node of initial graph (degree 2)
Node of initial graph (degree 3)
Node of initial graph (degree 4)
Node of initial graph (degree 5)
Node of initial graph (degree 6 or higher)
Cluster node level 1
Cluster node level 2
Edge of initial graph
Edge of a node belonging to a cluster
Edge between cluster nodes of same level

Figure 3: Semantic symbols used in MLGA visualization. Nodes


and edges are color encoded, nodes of the initial graph are colored
according to their degree and cluster nodes are enlarged at each level
of the hierarchy.

10. Computation Speed and


Memory Requirements
Retrieving a -clique with the clustering algorithm of [5]
has a running time of (||||2 ) with || the size of the
detected clique and || the size of the initial graph. As the
algorithm is recursively applied to the remaining nodes of
the graph, a time complexity of at most (||3 ) results on
each level of the hierarchy. Te number of levels depends on
the number of clusters found on each level. It ranges from
the worst case where two nodes are clustered together log2 ||
down to 1 if all nodes are in the same cluster. Terefore, the
total runtime order has an upper limit ((log2 ||)4 ) and a
lower limit (||3 ) . Te tree transformation of the resulting
hierarchy uses breath-frst search. As new nodes and edges are
introduced during the hierarchy creation its runtime ranges
from (4|| + ||) down to (|| + ||) in terms of the inital
number of nodes || and initial number of edges ||.
Te implemented version of force-directed layout algorithm needs a runtime of (||2 ). A specialized tree layout
algorithm like [11] has a runtime of order (||) and [12] an
order (||) or (|| log ||) if an optimal solution for the
circle size is required.
Te memory required by the algorithm mainly depends
on the graph representation. For biological networks a representation between the worst case (4||+||) and (||+||)
space is suitable.

11. Results
11.1. Metabolic Networks. To provide experimental justifcation of the proposed method, we extracted the metabolic
network for Arabidopsis thaliana from Genevestigator [13].
Te network has 932 nodes and 2315 edges. Te edges

represent metabolites and the nodes represent biochemical


reactions. We used two versions of the network as illustrated
in Figure 5(a) where the second version in Figure 5(b) contains additionally the regulatory pathways with enzymes as
edges leading to 1199 nodes and 4386 edges.
Te application of multi-level -clustering to visualize the
A. thaliana biochemical network (Figure 5(a)) revealed that
both global view and lucidity were sustained, which is also
true when regulatory pathways were added to the network
(Figure 5(b)). We looked at plant isoprenoid biosynthesis, in
particular at the synthesis of brassinosteroids (BRs), a class of
plant hormones which are essential for the regulation of plant
growth and development [14, 15]. All individual reaction
steps, leading to BRs, were found structured according to
metabolite fux through the pathway as part of a level 2
cluster, also containing upstream pathways as well as reactions leading to other isoprenoid end products (Figure 5(a)).
Afer inclusion of regulatory elements and signal-transduction chains, multi-level -clustering assigned the biochemical reactions from brassinosteroid biosynthesis to clusters
containing reactions known to be regulated by BRs and
elements that are involved in the regulation of this biosynthetic pathway. As an example, the known fact that BRs act
synergistically with auxins to promote cell elongation [16] is
nicely refected in the MLGA drawn network (Figure 5(b)).
11.2. Gene Interaction Networks. Te MLGA method can
also be successfully used to analyze gene interaction networks. Gene interaction networks are constructed as graphs
where nodes represent genes and edges represent interactions
between genes on various biological levels, as for example,
interactions between the corresponding proteins or regulatory and causal interactions obtained from gene expression
experiments. Tere is a long-term research into methods how
to obtain networks which identify diferent kinds of gene
interaction networks based on diferent types of input data
[1722]. However, as we illustrate later, even if a simplifed
network generation method was used, our visualization algorithm was able to identify correctly the biologically meaningful subsystems from a genomewide correlation network.
To generate the gene correlation networks, we used a
Mus musculus dataset from the Genevestigator database. Te
data consisted of 3157 publicly available Afymetrix arrays.
Each array measured the expression values of 12488 genes.
Te gene correlation matrix was calculated using the Pearson
correlation and aferwards a network was generated, where
an edge was introduced between two genes if the correlation
value between the genes was above a certain threshold.
Two networks were constructed using a threshold of 0.72 in
Figure 6(a) and 0.80 in Figure 6(b).
We use a well-known ribosomal gene complex [2325]
to illustrate the possibilities of MLGA to discover interesting
structures in the correlation network. An inspection of the
cluster highlighted in Figure 6(a) shows that it contains
genes which are documented to belong to the ribosomal
cluster. Moreover, our method has a structural stability in
a wide range of graph density. Tis can be seen comparing
Figures 6(a) and 6(b). Figure 6(a) has lower threshold; therefore, it contains 38 889 edges and 2774 nodes compared to

17

Advances in Bioinformatics

(a)

(b)

(c)

Figure 4: (a) A part of a gene correlation network of Arabidopsis thaliana drawn with MLGA, (b) showing all edges connected to the -cluster
node at the top right and (c) displaying all edges between the nodes defning the cluster. Te inset shows a magnifcation of the edges of the
selected cluster.

(a)

(b)

Figure 5: MLGA applied to (a) A. thaliana biochemical network without signaling efects and regulatory elements. Reactions directly
involved in the synthesis of brassinosteroids are highlighted with the red color and direct connections are depicted by red edges. Te level
2 cluster, indicated by an arrow, combines the major parts of isoprenoid biosynthesis, resulting from the nonmevalonate pathway. (b) A.
thaliana biochemical network including signaling efects and regulatory elements. Reactions directly involved in brassinosteroid and auxin
metabolism/signaling are highlighted with red and direct connections are depicted by red edges. Black arrows point to reactions involved in
brassinosteroid metabolism/signaling. Green arrows point to reactions involved in auxin metabolism/signaling.

8659 edges and 1232 nodes in Figure 6(b). In both cases the
ribosomal cluster can be clearly identifed.

12. Discussion
Visualization methods ofen contain parameters which must
be empirically identifed. In [4], the selection of pivot nodes is

determined by a set of empirical values to achieve a uniform


distribution of these nodes in the network. Aferwards,
the layout is computed with respect to the selected pivot
nodes. In our method, the only important empirically set
parameters are the cluster densities on each hierarchical
level. Tis infuences the granularity of the visualization and
the subsequent tree transformation supports the recovering

18

Advances in Bioinformatics

(a)

(b)

Figure 6: MLGA applied to (a) Mus musculus gene correlation network generated with a threshold of 0.72. (b) Te gene correlation network
generated with a threshold of 0.80. Te red highlighted nodes and direct connections belong to the ribosomal cluster.

of the network structure. Te computational experience has


shown that it is recommended to use a slightly smaller
value in the frst than in the subsequent levels; we use 0.5 and
0.7, respectively. Te density of the graph grows considerably
over the hierarchical levels. In most of the experiments with
a graph sizing up to 105 edges the third level was already a
clique. Terefore, very few values are needed even for large
graphs.
A diferent approach in a similar direction as our research
is described in [26]. Te authors consider weighted graphs
where vertexes with degree 1 are repeatedly removed until
no more vertex of degree 1 exists. Te removed nodes can be
added at the end of the drawing algorithm. Afer that, a hierarchy of clusters is calculated on the remaining graph using
an approximation of the graph distances. A cluster is formed
if the pairwise shortest path of its nodes is equal to or above a
threshold which depends on the hierarchy level. Tis leads to
diferent defnition of clusters than in our algorithm together
with a diferent cluster heuristics. Te algorithm preserves
more edges than in our case making the structure of the graph
more difcult to identify. Comparing the result of MLGA
Figure 1(b) with [26], the structure of graphs in [26] is only
partially resolved (see, e.g., [26, page 2, Figure 1]). Similarly
as in our case a version of force-directed algorithm is used
in the last stage to refne the visualization. A higher weight is
assigned to the edges between the nodes of a cluster than to
the edges between clusters. Tis forces the nodes of a cluster to
be drawn close to each other. In our approach, the additional
cluster node and the desired edge length have a similar efect.
In addition to complexity problems for large graphs (NPhard approximation), the algorithms based on identifcation
of cliques have also a drawback that it is ofen not clear which
cliques are relevant. Tis problem is particularly present
in cases of graphs with dense subgraphs, where we obtain

a system of large cliques with similar sizes which have


additionally large intersections. In particular, in the presence
of noise where every measurement defnes a graph which
difers in a small percentage of edges, it is difcult to decide
during visualization based on cliques which part of the graph
can be emphasized as structurally important. An interesting
development is represented by [27] where the authors concentrate on intersections of large clusters under a condition
that these intersections are cliques. Tey identify so-called
atom subgraphs which represent clique minimal separator
decomposition (a separator is a set of vertexes whose removal
will disconnect the graph into several parts). However, the
relaxation from cliques to dense clusters in our method
improves also on the intersection problem because our
method would glue the cliques with large intersection into
one cluster.

13. Conclusion
As discussed, many approaches try to improve the layout of
complex networks through better placement of the nodes
alone. In our work, we pursue a diferent line of research
towards efcient visualization algorithms for large biological
networks. Our approach does not aim at rendering all edges
in a network, but we focus on the discovery and visualization
of important structural features. Tis approach is combined
with complementary visual operations which allow to drilldown into the details of structurally identifed elements. Te
MLGA method is successful in identifying the biologically
relevant structures and allows for processing very large
graphs as we illustrated on two diferent case studies of biological networks. Naturally, this paradigm opens new questions on how to further improve the visualization output
and speed. Diferent clustering algorithms can be tried to

19

Advances in Bioinformatics
create the multi-level structure; however, in the case of
multiparameter clustering the control and analysis of the
parameter values between the levels would become more
difcult.
On the theoretical side, the next question to consider
is how to provide a provably good (optimal) sequence of
values. Another question is whether the surprisingly good
structure identifcation features of our algorithm could be
traced back to the scale-free character of many biological
networks.

[11]

[12]

[13]

Disclosure
All materials, source code, and small case studies data can
be freely downloaded from http://www.pw.ethz.ch/research/
projects/complexnetworks/. Very large datasets are available
on request.

[14]

[15]

Acknowledgments
Te authors would like to thank Professor Peter Widmayer
for the ongoing support of this project. Tis work was also
supported by Commission for Technology and Innovation of
the Swiss Federation under Grant 9428.1 PFLS-LS.

[16]

[17]

References
[1] T. Kamada and S. Kawai, An algorithm for drawing general
undirected graphs, Information Processing Letters, vol. 31, no. 1,
pp. 715, 1989.
[2] T. M. J. Ffuchterman and E. M. Reingold, Graph drawing by
force-directed place-ment, Sofware, vol. 21, no. 11, pp. 1129
1164, 1991.
[3] A. T. Adai, S. V. Date, S. Wieland, and E. M. Marcotte, LGL:
creating a map of protein function with an algorithm for visualizing very large biological networks, Journal of Molecular
Biology, vol. 340, no. 1, pp. 179190, 2004.
[4] K. Han and B.-H. Ju, A fast layout algorithm for protein interaction networks, Bioinformatics, vol. 19, no. 15, pp. 18821888,
2003.
[5] J. Abello, M. Resende, and S. Sudarsky, Massive quasi-clique
detection, in LATIN 2002: Teoretical Informatics, S. Rajsbaum,
Ed., vol. 2286 of Lecture Notes in Computer Science, pp. 598612,
Springer, Berlin, Germany, 2002.
[6] W. Li and H. Kurata, A grid layout algorithm for automatic
drawing of biochemical networks, Bioinformatics, vol. 21, no.
9, pp. 20362042, 2005.
[7] S. E. Schaefer, Graph clustering, Computer Science Review,
vol. 1, pp. 2764, 2007.
[8] N. Mishra, R. Schreiber, I. Stanton, and R. Tarjan, Clustering
social networks, in Algorithms and Models For the Web-Graph,
A. Bonato and F. Chung, Eds., vol. 4863 of Lecture Notes in
Computer Science, pp. 5667, Springer, Berlin, Germany, 2007.
[9] J. Hastad, Clique is hard to approximate within n1 , Acta
Mathematica, vol. 182, pp. 105142, 1999.
[10] A. Frick, A. Ludwig, and H. Mehldau, A fast adaptive layout
algorithm for undirected graphs (extended abstract and system
demonstration), in Graph Drawing, R. Tamassia and I. Tollis,

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

Eds., vol. 894 of Lecture Notes in Computer Science, pp. 388403,


Springer, Berlin, Germany, 1995.
E. M. Reingold and J. S. Tilford, Tidier drawings of trees, IEEE
Transactions on Sofware Engineering, vol. SE-7, no. 2, pp. 223
228, 1981.
S. Grivet, D. Auber, J. P. Domenger, and G. Melancon, Bubble
tree drawing algorithm, in Computer Vision and Graphics, K.
Wojciechowski, B. Smolka, H. Palus, R. Kozera, W. Skarbek, and
L. Noakes, Eds., vol. 32 of Computational Imaging and Vision,
pp. 633641, Springer, Dordrecht, Te Netherlands, 2006.
T. Hruz, O. Laule, G. Szabo et al., Genevestigator v3: a reference
expression database for the meta-analysis of transcriptomes,
Advances in Bioinformatics, vol. 2008, Article ID 420747, 5
pages, 2008.
T. Asami, Y. K. Min, K. Sekimata et al., Mode of action of brassinazole: a specifc inhibitor of brassinosteroid biosynthesis, ACS
Symposium Series, vol. 774, pp. 269280, 2001.
J.-X. He, J. M. Gendron, Y. Yang, J. Li, and Z.-Y. Wang, Te
GSK3-like kinase BIN2 phosphorylates and destabilizes BZR1,
a positive regulator of the brassinosteroid signaling pathway in
Arabidopsis, Proceedings of the National Academy of Sciences
of the United States of America, vol. 99, no. 15, pp. 1018510190,
2002.
K. J. Halliday, Plant hormones: the interplay of brassinosteroids
and auxin, Current Biology, vol. 14, no. 23, pp. R1008R1010,
2004.
K. Y. Yip, R. P. Alexander, K.-K. Yan, and M. Gerstein,
Improved reconstruction of in silico gene regulatory networks
by integrating knockout and perturbation data, PLoS ONE, vol.
5, no. 1, Article ID e8121, 2010.
M. Mutwil, B. Usadel, M. Schutte, A. Loraine, O. Ebenhoh,
and S. Persson, Assembly of an interactive correlation network
for the Arabidopsis genome using a novel Heuristic Clustering
Algorithm, Plant Physiology, vol. 152, no. 1, pp. 2943, 2010.
D. Marbach, R. J. Prill, T. Schafer, C. Mattiussi, D. Floreano, and G. Stolovitzky, Revealing strengths and weaknesses
of methods for gene network inference, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 107, no. 14, pp. 62866291, 2010.
S. De Bodt, S. Proost, K. Vandepoele, P. Rouze, and Y. Van de
Peer, Predicting protein-protein interactions in Arabidopsis
thaliana through integration of orthology, gene ontology and
co-expression, BMC genomics, vol. 10, p. 288, 2009.
D. R. Rhodes, S. A. Tomlins, S. Varambally et al., Probabilistic
model of the human protein-protein interaction network,
Nature Biotechnology, vol. 23, no. 8, pp. 951959, 2005.
A. de la Fuente, N. Bing, I. Hoeschele, and P. Mendes, Discovery of meaningful associations in genomic data using partial
correlation coefcients, Bioinformatics, vol. 20, no. 18, pp.
35653574, 2004.
J.-F. Rual, K. Venkatesan, T. Hao et al., Towards a proteomescale map of the human protein-protein interaction network,
Nature, vol. 437, no. 7062, pp. 11731178, 2005.
K. Ishii, T. Washio, T. Uechi, M. Yoshihama, N. Kenmochi, and
M. Tomita, Characteristics and clustering of human ribosomal
protein genes, BMC Genomics, vol. 7, article 37, 2006.
O. Atias, B. Chor, and D. A. Chamovitz, Large-scale analysis
of Arabidopsis transcription reveals a basal co-regulation network, BMC Systems Biology, vol. 3, p. 86, 2009.
R. Bourqui, D. Auber, and P. Mary, How to draw clustered weighted graphs using a multilevel force-directed graph

20
drawing algorithm, in Proceedings of the 11th International
Conference Information Visualization (IV 07), pp. 757764, July
2007.
[27] B. Kaba, N. Pinet, G. Lelandais, A. Sigayret, and A. Berry, Clustering gene expression data using graph separators, In Silico
Biology, vol. 7, no. 4-5, pp. 433452, 2007.

Advances in Bioinformatics

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Reverse Engineering Sparse Gene Regulatory Networks Using
Cubature Kalman Filter and Compressed Sensing
Amina Noor,1 Erchin Serpedin,1 Mohamed Nounou,2 and Hazem Nounou3
1

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
Chemical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building,
Education City, P.O. Box 23874, Doha, Qatar
3
Electrical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building,
Education City, P.O. Box 23874, Doha, Qatar
2

Correspondence should be addressed to Amina Noor; amina@neo.tamu.edu


Received 30 November 2012; Accepted 15 April 2013
Academic Editor: Yufei Huang
Copyright 2013 Amina Noor et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Tis paper proposes a novel algorithm for inferring gene regulatory networks which makes use of cubature Kalman flter (CKF) and
Kalman flter (KF) techniques in conjunction with compressed sensing methods. Te gene network is described using a state-space
model. A nonlinear model for the evolution of gene expression is considered, while the gene expression data is assumed to follow a
linear Gaussian model. Te hidden states are estimated using CKF. Te system parameters are modeled as a Gauss-Markov process
and are estimated using compressed sensing-based KF. Tese parameters provide insight into the regulatory relations among the
genes. Te Cramer-Rao lower bound of the parameter estimates is calculated for the system model and used as a benchmark to
assess the estimation accuracy. Te proposed algorithm is evaluated rigorously using synthetic data in diferent scenarios which
include diferent number of genes and varying number of sample points. In addition, the algorithm is tested on the DREAM4 in
silico data sets as well as the in vivo data sets from IRMA network. Te proposed algorithm shows superior performance in terms
of accuracy, robustness, and scalability.

1. Introduction
Gene regulation is one of the most intriguing processes taking
place in living cells. With hundreds of thousands of genes at
their disposal, cells must decide which genes are to express at
a particular time. As the cell development evolves, diferent
needs and functions entail an efcient mechanism to turn the
required genes on while leaving the others of. Cells can also
activate new genes to respond efectively to environmental
changes and perform specifc roles. Te knowledge of which
gene triggers a particular genetic condition can help us ward
of the potential harmful efects by switching that gene of. For
instance, cancer can be controlled by deactivating the genes
that cause it.
Gene expression is the process of generating functional
gene products, for example, mRNA and protein. Te level
of gene functionality can be measured using microarrays or
gene chips to produce the gene expression data [1]. More

accurate estimation of gene expression is now possible using


the RNA-Seq method. Intelligent use of such data can help
improve our understanding of how the genes are interacting
in a living organism [24]. Gene regulation is known to
exhibit several modes; a couple of important ones include
transcription regulation and posttranscription regulation [5].
While the theoretical applications of gene regulation are
extremely promising, it requires a thorough understanding of
this complex process. Diferent genes may cooperate to produce a particular reaction, while a gene may repress another
gene as well. Te potential benefts of gene regulation can
only be reaped if a complete and accurate picture of genetic
interactions is available. A network specifying diferent interconnections of genes can go a long way in understanding the
gene regulation mechanism. Te control and interaction of
genes can be described through a gene regulatory network.
Such a network depicts various interdependencies among
genes where nodes of the network represent the genes,

22

Advances in Bioinformatics2

and the edges between them correspond to an interaction


among them. Te strength of these interactions represents
the extent to which a gene is afected by other genes in the
network. A key ingredient of this approach is an accurate and
representative modeling of gene networks. Precise modeling
of a regulatory network coupled with efcient inference and
intervention algorithms can help in devising personalized
medicines and cures for genetic diseases [6].
Various methods for gene network modeling have been
proposed recently in the literature with varying degrees of
sophistication [710]. Tese techniques can be broadly classifed as static and dynamic modeling schemes. Static modeling
includes the use of correlation, statistical independence for
clustering [1113], and information theoretic criteria [1416].
On the other hand, dynamic models provide an insight into
the temporal evolution of gene expressions and hence yield
a more quantitative prediction on gene network behavior
[1720]. In order to incorporate the stochasticity of gene
expressions, statistical techniques have been applied [13]. A
rich literature is also available on the Bayesian modeling of
gene networks [2126]. Promoted in part by the Bayesian
methods, the state-space approach is a popular technique
to model the gene networks [2733], whereby the hidden
states can be estimated using the Kalman flter. In the case
of nonlinear functions, the extended Kalman flter (EKF) and
particle flter represent feasible approaches [33, 34]. However,
the EKF relies on the frst-order linear approximations of
nonlinearities, while the particle flter may be computationally too complex. A comprehensive review of these methods
can be found in [35].
In this paper, the gene network is modeled using a statespace approach, and the cubature Kalman flter (CKF) is used
to estimate the hidden states of the nonlinear model [36,
37]. Te gene expressions are assumed to evolve following a
sigmoid squash function, whereas a linear function is considered for the expression data. Te noise is assumed to be
Gaussian for both the state evolution and gene expression
measurements. As the gene network is assumed sparse, any
simple mean square error minimization technique will not
sufce for the estimation of static parameters. Terefore, a
compressed sensing-based Kalman flter (CSKF) [38] is used
in conjunction with CKF for reliable estimation of parameters. In case of statistical inference, it is essential to obtain
some guarantees on the performance of estimators. In this
regard, the Cramer-Rao lower bound (CRB) of the parameter
estimates is used as a benchmarking index to assess the mean
square error (MSE) performance of the proposed estimator
which is evaluated here for a parameter vector. Te performance of the proposed algorithm is tested on synthetically
generated random Boolean networks in various scenarios.
Te algorithm is also tested using DREAM4 data sets and
IRMA networks [39, 40].
Te main contributions of this paper can be summarized
as follows.
(1) CKF is proposed for the estimation of states, and a
compressed sensing-based Kalman flter is used for
the estimation of system parameters. Te genes are

known to interact with few other genes only necessitating the use of sparsity constraint for more accurate
estimation. Te proposed algorithm carries out online
estimation of parameters and is therefore computationally efcient and is particularly suitable for large
gene networks.
(2) Te Cramer-Rao lower bound is calculated for the
estimation of unknown parameters of the system. Te
performance of the proposed algorithm is compared
to CRB. Tis comparison is signifcant as it shows
room for improvement in the estimation of parameters.
(3) Te proposed algorithm is compared with the EKF
algorithm. Using the false alarm errors, true connections, and Hamming distance as fdelity criteria, rigorous simulations are carried out to assess the performance of the algorithm with the increase in the number of samples. In addition, receiver operating characteristic (ROC) curves are plotted to evaluate the algorithms for diferent network sizes. It is observed that
the proposed algorithm outperforms EKF in terms
of accuracy and precision. Te proposed algorithm
is then applied to the DREAM4 10-gene and 100gene data sets to assess the algorithm accuracy. Te
underlying gene network for the IRMA data sets is
also inferred.
Te rest of this paper is organized as follows. Section 2
describes the underlying system model for the gene expressions. Te proposed CKF algorithm in combination with
CSKF for gene network inference is formulated in Section 3.
Te derivation of CRB is shown in Section 4, and the
simulation results and their interpretation are presented in
Section 5. Finally, conclusions are drawn in Section 6.

2. System Model
Gene regulatory networks can be modeled as static or dynamical systems. In this work, state-space modeling is considered
which is an instance of a dynamic modeling approach and
can efectively cope with time variations. Te states represent
gene expressions, and their evolution in time, in general, can
be expressed as
x = (x1 ) + w

= 1, . . . , ,

(1)

where is the total number of data points available, w is


assumed to be a zero-mean Gaussian random variable with
covariance Q = 2 I, and the function () represents the
regulatory relationship between the genes and is generally
nonlinear. Te microarray data is a set of noisy observations
and is commonly expressed as a linear Gaussian model [41]
y = (x ) + k ,

(2)

where k is Gaussian-distributed random variable with zero


mean and covariance S = V2 I and incorporates the uncertainty in the microarray experiments. In order to capture

23

Advances in Bioinformatics
the gene interactions efectively, the following nonlinear state
evolution model is assumed [33, 34]:

Input time-series data y


y

, = (1, ) + , ,
=1

= 1, . . . , , = 1, . . . , ,

(3)
CKF

where is the total number of genes in the network and ()


is the sigmoid squash function
(1, ) =

1
.
1 + 1,

(5)

b [1 , 2 , . . . , ] ,

(6)

y = R b + e ,

(7)

where = [1 , . . . , ]. Plugging the values of states from


(3) into (5), it follows that

[0
[
R [
0
[0

0
f

0
0

0
0
d
0

0
0]
],
0]
f ]

(8)

f [ (

1,1 ) (1, )] .

b
No

Stacking the unknown parameters together, the parameter


vector to be estimated is

where

CSKF

Initialize
b0 , x0

(4)

Tis particular choice for the nonlinear function ensures that


the conditional distribution of the states remains Gaussian
[41]. Te multiplicative constants quantify the positive or
negative relations between various genes in the network. A
positive value of implies that the th gene is activating
the th gene, whereas a negative value implies repression
[34, 42]. Te absolute value of these parameters indicates the
strength of interaction.
Te model given in (3) and (4) in the absence of any constraints may be unidentifable and may result into overftted
solutions [43]. Assumptions on network structures are, therefore, necessary to obtain a connectivity matrix that agrees
with the biological knowledge. In a gene regulatory network
(GRN), the genes are known to interact with few other genes
only. To this end, the coefcients s are estimated using
sparsity constraints, as explained in the next section.
A discrete linear Gaussian model for the microarray data
is considered which can be expressed at the th time instant
as [41]
y = x + v .

(9)

Tus, the gene network inference problem boils down to the


estimation of system parameters b using the observations
y , where the efective noise e is the sum of system and
observation noises. Te next section describes the proposed
inference algorithm for sparse networks.

=
Yes

Output

Figure 1: Block diagram of network inference methodology CKFS.

3. Method
In this section, the methodology proposed to infer the system parameters in (3) is described. Te proposed cubature
Kalman flter with sparsity constraints (CKFS) approach is
succinctly illustrated in Figure 1. Te specifc details of this
algorithm are as next presented.
3.1. Cubature Kalman Filter. Kalman flter is a Bayesian flter
which provides the optimal solution to a general linear state
space inference problem depicted by (1) and (2) and assumes a
recursive predictive-update process. Te underlying assumption of Gaussianity for the predictive and the likelihood
densities simplifes the Kalman flter algorithm to a two-step
process, consisting of prediction and update of the mean and
covariance of the hidden states. However, the presence of
nonlinear functions in the state and measurement equations
requires calculation of multidimensional integrals of the form
nonlinear function Gaussian density [36], which in general
is computationally prohibitive. Several solutions to this problem have been proposed including the EKF, which linearizes
the nonlinear function by taking its frst-order Taylor approximation, and the unscented Kalman flter (UKF), which
approximates the probability density function (PDF) using a
nonlinear transformation of the random variable. Recently, a
new approach, CKF, has been proposed which evaluates the
integrals numerically using spherical-radial cubature rules
[36].
Te next two subsections briefy explain the working of
Bayesian fltering and the CKF solution for the nonlinear
multidimensional integrals.
3.1.1. Time Update. Let the observations up to the time instant
be denoted by d ; that is, d [y1 , . . . , y ] . In the prediction phase, also called the time update of the Bayesian flter,

24

Advances in Bioinformatics

the mean and covariance of the Gaussian posterior density are


computed as follows:
x|1 = [f (x1 ) | d1 ] ,

P,|1 = [f (x ) f ( )] x|1 x|1


+ Q1 ,

(10)

where denotes the expectation operator and x1 is normally distributed with parameters (x1|1 , P,1|1 ). Te
third equality is a consequence of the zero-mean nature
of Gaussian noise w and its independence from d . Te
estimates x1|1 and P,1|1 are assumed to be available
from the previous iteration. Here, P,|1 is an estimate of
the error covariance matrix.
3.1.2. Measurement Update. Since the measurement noise
is also Gaussian, the likelihood density is given by y1 |
d1 : N(z1 ; y|1 , P,|1 ). As the measurements become
available at the th time instant, the mean and covariance of
the likelihood density are calculated as follows:
y|1 = [y | d1 ] ,

P,|1 =

[x x ]

y|1 y|1

+ S1 .

(11)

Te updated posterior density, obtained from the conditional joint density of states, and the measurements can be
expressed as

([x y ] dk1 )
N ((

P,|1 P,|1
x|1
),(
)) ,
y|1
P,|1 P,|1

(12)

where

,|1 = [x x ] x|1 y|1

(13)

is the cross-covariance matrix between the states and the


measurements. Hence, the states and the corresponding error
covariance matrix are updated by calculating the innovation
z z|1 and the Kalman gain K,
x| = x|1 + K, (y y|1 ) ,

,| = ,|1 K, ,|1 K, ,

(14)

1
K, = ,|1 ,|1
.

Te next subsection briefy describes the computation of


high-dimensional integrals present in the equations above.
3.1.3. Computation of Integrals Using Spherical-Radial Cubature Points. In order to determine the expectations in (10),
using a numerical integration method, a spherical-radial
cubature rule is applied. Tis method calculates the cubature
points X,1|1 as follows [36]:
X,1|1 = U1|1 + x1|1 ,

(15)

where = /2[1] , = 1, . . . , , = 2 denotes the total


number of cubature points and U1|1 stands for the square
root of the error covariance matrix; that is,
P,1|1 = U1|1 U1|1 .

(16)

Te cubature points are updated via the state equation

X,|1
= (X,1|1 ) .

(17)

Te propagated cubature points yield the state and error


covariance estimates
x|1 =
P,|1 =

1
,
X
=1 ,|1

1
X
X
=1 ,|1 ,|1

x|1 x|1
+ Q1 .

(18)

Te integrals in (11) and (14) can be evaluated in a similar


manner. Te next subsection explains the estimation of
parameters in the system.
3.2. Estimation of Sparse Parameters Using Kalman Filter. Te
state estimates are obtained using the CKF as described in
the previous subsection. In order to estimate the unknown
parameters in the system model, one of the most commonly
used methods involves stacking the parameters with the
states and estimating them together. Te estimation process
performed in this manner is called joint estimation. Another
method for the estimation of parameters consists of a twostep recursive process which is termed dual estimation. Tis
process estimates the states in the frst step, and with the
assumption that states are known, parameters are estimated
in the second step. Tese steps are repeated until the algorithm converges to the true values or until the amount of
available observations is exhausted. Tis paper makes use of
the latter technique.
Te vector b as defned in (6) is assumed to be evolving
as a Gauss-Markov model. As discussed previously, the states
are assumed to be known at this step. Te system evolution
equations can therefore be expressed as
b = b1 + 1 ,
y = R b + e ,

(19)

where stands for the i.i.d Gaussian noise and R is as


defned in (8). It is observed that (19) is a system of linear
equations with additive Gaussian noise, and therefore, the
Kalman flter is the optimal choice for the estimation of

25

Advances in Bioinformatics
parameter vector. Te standard predict and update steps
involved in Kalman flter are summarized as follows:
b |1 = b 1|1 + ,

P,|1 = P,1|1 + ,
u = y R b ,

K = P,|1 R (R P,|1 R + 2 I1 ) ,

(20)

b | = b |1 + Ku ,

P,| = (I KR ) P,|1 ,

Algorithm 1: Network inference: CKFS.

where K denotes the Kalman gain and P represents the error


covariance matrix.
Te Kalman flter algorithm is based on an 2 -norm
minimization criterion. As the gene networks are known to
be highly sparse, the parameter vector is expected to have
only a few nonzero values. A more accurate approach for
estimating such a vector would be to introduce an additional
constraint on its 1 -norm which is the core idea in compressed
sensing [38, 44]. Such an 1 -norm constraint provides a
unique solution to the underdetermined set of equations [45].
Terefore, instead of a simple 2 norm minimization, the
following constrained optimization problem is considered:

2
minb b 2


s.t. b .

(21)

Te importance of this constraint can be judged by the fact


that without it, the system would be rendered unidentifable
[43].
Te problem (21) can be solved using a pseudomeasurement (PM) method which incorporates the inequality constraint (21) in the fltering process by assuming an artifcial
measurement b 1 = 0. Tis is concisely expressed as
0 = Rb ,

R = [sign (b (1)) , . . . , sign (b ())] .


(22)

Te value of the covariance matrix = 2 I of the pseudonoise is selected in a similar manner as the process noise
covariance in the EKF algorithm. However, it is found that
large values of variances, that is, 2 100, prove sufcient in
most cases [38]. Further details on selecting these parameters
can be found in [38, 46]. Te PM method solves (21) in a
recursive manner for iterations using the following steps:

K = P R (R P R + ) ,
b +1 = (I KR ) b ,

P+1 = (I KR ) P .

(1) Input time series data set y.


(2) Initialize , , 0 , x0 .
(3) for = 1, . . . , do
(4) Find the state estimates using CKF following the time
and measurement update steps in Section 3.
(5) Estimate parameters b from x and y using (20).
(6) for = 1, . . . , do
(7)
Update the parameters b using (23).
(8) end for
(9) end for
(10) return

(23)

At each th time instant, P,| and b | obtained from


(20) are considered as initial values; that is, b 1 = b | and
P1 = P,| which is the error covariance matrix. Te value of

is equal to the number of constraints, that is, the expected


number of nonzero b s in the system. Possible ways for
calculating include minimum description length (MDL)
principle and Bayesian information criterion (BIC).
3.3. Inference Algorithm. Te network inference algorithm
is summarized in Algorithm 1. Te algorithm consists of a
recursive process which repeats itself for the number of observations present in the time-series data. For each time sample,
the state estimate is obtained using the CKF, and the parameter estimate is obtained using the KF. Since the parameters are
expected to be sparse, the estimates are then refned further
using the CSKF algorithm. Tis iterative process results in
a simple and accurate algorithm for gene network inference
while considering a complex nonlinear model.

4. Cramr-Rao Bound
Te performance of an estimator can be judged by comparing
it with theoretical lower bounds proposed in parameter
estimation theory. Te CRB establishes a lower bound on the
MSE of an unbiased estimator [47]. In particular, the CRB
states that the covariance matrix of the estimator b is lower
bounded by
E [(b b) (b b) ] [I (b)]1 ,

(24)

where the matrix inequality is to be interpreted in the


semidefnite sense and I(b) is the Fisher information matrix
(FIM)
I (b) = E [(

ln (y | b)
ln (y | b)
)(
) ].
b
b

(25)

Te CRB for gene network inference can be calculated as


follows. By stacking all the observations for = 1, . . . , , (7)
can be written compactly in the matrix form
y = Rb + e,

(26)

26

Advances in Bioinformatics

(y | b) = exp (

(y Rb) (y Rb)
),
22

(27)

where is a constant. Te derivative of ln (y | b) can be


expressed as

ln (y | b)
(y Rb) (y Rb)
]
= [
b
b
2
=

R y R Rb
.
2

100

101
102
103

log 10 MSE

] , R = [R1 , . . . , R
] , and e = [e1 , . . . ,
where y = [y1 , . . . , y

e ] . Te PDF (y | b) is expressed as

104
105

(28)

106
107

It now follows that


(

ln (y | b)
ln (y | b)
)(
)
b
b
=

R (y Rb) (y Rb) R
.
4

10
Sample size

12

14

16

CRB
CKFS

(29)

Figure 2: Cramer-Rao bound on the estimation of parameters. Te


MSE for one of the representative is shown here for a network
consisting of 8 vertices.

By taking the expectation of (29), the FIM in (25) is given by


R R
I (b) = 2 .

(30)

Te inverse of the FIM in (30) can be used to place a lower


bound on the estimation error of the parameter vector b.
Figure 2 shows the comparison of MSE of CKFS algorithm
with CRB as a function of number of samples for one
representative gene from the eight-gene network considered
in Section 5.1. It is observed that the MSE of the estimated
parameters decreases with increasing number of samples.

5. Results and Discussion


Te simulation results of the CKFS algorithm are discussed in
this section. Te performance is frst tested on synthetic data
obtained from randomly generated Boolean networks under
various scenarios and performance metrics. Te algorithm
is then assessed on the DREAM4 networks and the IRMA
network.
5.1. Synthetic Data. Time-series data is produced from randomly generated Boolean networks using the system model
(3) and (5). Two scenarios are considered for this purpose.
First, the comparison is performed by varying the number of sample size while keeping the network size fxed. Te
gene network consists of 8 genes and 20 vertices. In terms
of network estimation, if the algorithm predicts an edge
between two nodes which may not be present in reality, an
error, referred to as false alarm error (F), is said to have
occurred. Another situation is the indication of the absence
of a vertex in the graph which in fact is present in the real
network. Tis kind of error is termed missed detection (M).
Te summation of these two errors normalized over the total

number of vertices in the network yields the Hamming


distance. It is also important to consider the probability
of predicting the true connections correctly which will be
assessed by the true connections (T) metric. An algorithm
with low Hamming distance and small false alarm error
is particularly desirable as predicting an edge erroneously
can be troublesome for biologists. True connections indicate
the reliability of the predictions. Figure 3 illustrates the
performance of the CKFS algorithm and that of the EKF
algorithm proposed in [34] in terms of the metrics described
above. It is important to mention here that the same system
model is assumed by both CKFS and EKF algorithms for
the purpose of this simulation. Tese metrics are the same
as those used in [15]. Te variances of both the system and
measurement noises, 2 and V2 , respectively, are taken to be
105 in all the simulations and are assumed to be known.
It is noticed that EKF has a slightly lower false alarm rate
when the number of samples is small; however, as the
number of samples increases, CKFS yields a lower false alarm
error. Te Hamming distance for CKFS is also smaller than
EKF indicating lesser cumulative error. True connections
show a consistent behavior for the two algorithms when the
number of samples is increased where CKFS is able to predict
connections more accurately. Tese experiments show the
superiority of CKFS in terms of lower error rate.
To obtain a more rigorous evaluation, the performance of
algorithms is then compared in a scenario which considers
the sample size to be fxed and assumes networks of diferent
sizes. Te receiver operating characteristic (ROC) curves are
plotted as performance measures. A higher area under the
ROC curve (AUROC) shows more true positives for a given
false positive, and therefore, indicates better classifcation.
Te performance of CKFS(, , ) and EKF(, , ) is
shown in Figure 4, where stands for the number of nodes,

27

Advances in Bioinformatics
0.38

0.7

0.36

0.65
0.6
Hamming distance

False alarm errors

0.34
0.32
0.3
0.28
0.26

0.55
0.5
0.45
0.4

0.24

0.35

0.22
0.2
5

10

15

20
25
Sample size

30

35

40

EKF
CKFS

10

15

20
25
Sample size

30

35

40

EKF
CKFS
(a)

(b)

0.8

True connections

0.75
0.7
0.65
0.6
0.55
0.5
0.45
5

10

15

20
25
Sample size

30

35

40

EKF
CKFS
(c)

Figure 3: (a), (b), and (c) False alarm errors, Hamming distance, and true connections. Te synthetic networks consist of 8 vertices and 20
edges. Te metric is normalized over the number of edges. CKFS gives lower error and predicts more true connections with the increase in
the sample size of data.

represents the number of edges, and denotes the time


points. It is observed that the CKFS exhibits superior performance than the EKF for networks of diferent sizes.
Te complexity of the two algorithms is compared for
synthetically generated networks with number of genes equal
to 10, 20, 30, and 40. Te sample size is kept to 50 time points
for each of these networks, and the run time for EKF and
CKFS algorithms is calculated as shown in Table 1. It is noted
that EKF is faster for smaller network sizes, but as the network
size increases, the run time gets much larger than that for
CKFS. Te main reason for this is that EKF [34] estimates
the states and parameters by stacking them together which
requires large-sized matrix multiplications at each iteration.

Te beneft associated with performing dual estimation, as


in CKFS, is that the parameters are estimated separately
from the states. Since the system is linear and one-to-one
for parameters, inversion of much smaller matrices can be
performed reducing the computational complexity of CKFS
algorithm. CKFS is therefore particularly attractive for largesized networks.
5.2. DREAM4 Gene Networks. Several in silico networks have
been produced in order to benchmark the performance
of gene network inference algorithms. dialogue on reverse
engineering assessment and methods (DREAM) in silico
networks serve as one of the popular methods used for this

28

Advances in Bioinformatics
ROC curve
1

0.9

0.9

0.8

0.8

0.7

0.7

True positive rate

True positive rate

ROC curve
1

0.6
0.5
0.4

0.6
0.5
0.4

0.3

0.3

0.2

0.2

0.1

0.1

0
0

0.2

0.4

0.6

0.8

0.2

False positive rate

0.4

0.6

0.8

False positive rate


CKFS
EKF

CKFS
EKF
(a)

(b)

ROC curve
1
0.9
0.8
True positive rate

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

0.2

0.4
0.6
False positive rate

0.8

CKFS
EKF
(c)

Figure 4: ROC curves for the performance of CKFS and EKF using synthetic data. (, , ) (a), (b), and (c) (5, 10, 20), (10, 12, 20), and
(15, 19, 20). Te area under the ROC curve for CKFS is more than that for EKF for various sized networks.

Table 1: Run time in seconds for EKF and CKFS algorithms for
varying network sizes for synthetically generated data. Te number
of sample points is fxed to 50.
Number of genes
EKF
CKFS

10
0.16
1.2

20
1.9
4.3

30
16.5
11.5

40
84
24.1

purpose [39, 48]. In this section, the performance of the


CKFS algorithm is evaluated using the 10-gene and 100gene networks released online by the DREAM4 challenge.

Five networks are produced using the known GRNs of


Escherichia coli and Saccharomyces cerevisiae. Te data sets
for each of 10-gene network consists of 21 data points for
fve diferent perturbations. Te inference is performed by
using all the perturbations. Te 100-gene network consists
of data sets for ten perturbations. AUROC and area under
the precision-recall curve (AUPR) are calculated for the fve
networks of both the data sets and shown in Tables 2 and 3,
respectively. Te quantities, precision and recall, are defned
as = /( + ) and = /( + ), respectively. For
comparison purposes, the values of the two quantities for
time-series network identifcation (TSNI) algorithm that

29

Advances in Bioinformatics

GAL80

GAL4

SW15

GAL80

CBF1

GAL4

SW15

GAL80

CBF1

GAL4

SW15

CBF1

ASH1

ASH1

ASH1

(a)

(b)

(c)

Figure 5: Te inferred IRMA networks. (a), (b), and (c) Gold standard, inferred network using CKFS, and inferred network using ODE
[39, 40]. Black arrows indicate true connections, blue arrows indicate the edges that are correct, but their directions are reversed, and red
arrows indicate false positives.

Table 2: Area under the ROC curve (AUROC) and area under the PR curve (AUPR) for DREAM4 10-gene networks for the fve diferent
networks.
Algorithm
ODE [39]
CKFS
Random [39]

Network 1
0.62 (0.27)
0.63 (0.40)
0.55 (0.18)

Network 2
0.63 (0.32)
0.67 (0.50)
0.55 (0.19)

Network 3
0.58 (0.21)
0.72 (0.50)
0.55 (0.17)

Network 4
0.63 (0.23)
0.75 (0.49)
0.57 (0.17)

Network 5
0.68 (0.25)
0.81 (0.42)
0.56 (0.16)

Table 3: Area under the ROC curve (AUROC) and area under the PR curve (AUPR) for DREAM4 100-gene networks for the fve diferent
networks.
Algorithm
ODE [39]
CKFS
Random [39]

Network 1
0.55 (0.02)
0.67 (0.13)
0.50 (0.002)

Network 2
0.55 (0.03)
0.57 (0.08)
0.50 (0.002)

exploits ordinary diferential equations are also given [39].


Te CKFS algorithm is found to perform signifcantly better
than the TSNI algorithm.

Network 3
0.60 (0.03)
0.60 (0.10)
0.50 (0.002)

Network 4
0.54 (0.02)
0.62 (0.10)
0.50 (0.002)

Network 5
0.59 (0.03)
0.60 (0.07)
0.50 (0.002)

to predict most of the interactions while giving lower false


positives.

6. Conclusions
5.3. IRMA Gene Network. In addition to synthetic data, it is
imperative to test the algorithms using real biological data.
In this subsection, the performance of the CKFS algorithm is
assessed using the in vivo reverse-engineering and modeling
assessment (IRMA) network [40]. Tis network consists of
fve genes. Galactose activates the gene expression in the
network, whereas glucose deactivates it. Te cells are grown
in the presence of galactose and then switched to glucose
to obtain the switch-of data which represents the expressive
samples at 21 time points. Te switch-on data consists of 16
sample points and is obtained by growing the cells in a glucose
medium and then changing to galactose. Te system and
measurement noise variances for the CKFS are assumed to
be identical as in the previous simulations. Figure 5 shows the
inferred network, the gold standard, and the network inferred
using TSNI. It is observed that the CKFS algorithm succeeds

Tis paper presents a novel algorithm for inferring gene


regulatory networks from time-series data. Gene regulation
is assumed to follow a nonlinear state evolution model. Te
parameters of the system, which indicate the inhibitory or
excitatory relationships between the genes, are estimated
using compressed sensing-based Kalman fltering. Te sparsity constraint on the parameters is crucial because the genes
are known to interact with few other genes only. Te use of
CKF and the dual estimation of states and parameters renders
the algorithm computationally efcient. Te performance of
CKFS is evaluated for synthetic data for diferent network
sizes as well as varying sample points. ROC curves, Hamming
distance, and true positives are used for comparing the
accuracy of inferred network with EKF. It is observed that
CKFS outperforms the EKF algorithm. In addition, CKFS

30
gives advantages over EKF in terms of smaller run time for
large networks. Te Cramer-Rao lower bound is also determined for the parameters of the model and compared with
the MSE performance of the proposed algorithm. Assessment using DREAM4 10-gene and 100-gene networks and
IRMA network data corroborates the superior performance
of CKFS. Future research directions include incorporating
the estimation of model order in the network inference
algorithm.

Acknowledgments
Tis work was supported by US National Science Foundation
(NSF) Grant 0915444 and QNRF-NPRP Grant 09-874-3235. Te material in this paper was presented in part at the
IEEE International Workshop on Genomic Signal Processing
and Statistics (GENSIPS), San Antonio, TX, USA, December
2011.

References
[1] X. Zhou, X. Wang, and E. R. Dougherty, Genomic Networks:
Statistical Inference from Microarray Data, John Wiley & Sons,
New York, NY, USA, 2006.
[2] H. Kitano, Computational systems biology, Nature, vol. 420,
pp. 206210, 2002.
[3] X. Zhou and S. T. C. Wong, Computational Systems Bioinformatics, World Scientifc, River Edge, NJ, USA, 2008.
[4] X. Cai and X. Wang, Stochastic modeling and simulation of
gene networks, IEEE Signal Processing Magazine, vol. 24, no. 1,
pp. 2736, 2007.
[5] D. Yue, J. Meng, M. Lu, C. L. P. Chen, M. Guo, and Y. Huang,
Understanding micro-RNA regulation: a computational perspective, IEEE Signal Processing Magazine, vol. 29, no. 1, pp. 77
88, 2012.
[6] R. Pal, S. Bhattacharya, and M. U. Caglar, Robust approaches
for genetic regulatory network modeling and intervention: a
review of recent advances, IEEE Signal Processing Magazine,
vol. 29, no. 1, pp. 6676, 2012.
[7] H. Hache, H. Lehrach, and R. Herwig, Reverse engineering of
gene regulatory networks: a comparative study, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2009, Article ID
617281, 2009.
[8] T. Schlitt and A. Brazma, Current approaches to gene regulatory network modelling, BMC Bioinformatics, vol. 8, no. 6, p. 9,
2007.
[9] H. D. Jong, Modeling and simulation of genetic regulatoy systems: a literature review, Journal of Computational Biology, vol.
9, no. 1, pp. 67103, 2002.
[10] I. Nachman, A. Regev, and N. Friedman, Inferring quantitative
models of regulatory networks from expression data, Bioinformatics, vol. 20, no. 1, pp. i248i256, 2004.
[11] C. D. Giurcaneanu, I. Tabus, and J. Astola, Clustering time
series gene expression data based on sum-of-exponentials ftting, EURASIP Journal on Advances in Signal Processing, vol.
2005, no. 8, Article ID 358568, pp. 11591173, 2005.
[12] C. D. Giurcaneanu, I. Tabus, J. Astola, J. Ollila, and M. Vihinen,
Fast iterative gene clustering based on information theoretic
criteria for selecting the cluster structure, Journal of Computational Biology, vol. 11, no. 4, pp. 660682, 2004.

Advances in Bioinformatics
[13] X. Cai and G. B. Giannakis, Identifying diferentially expressed
genes in microarray experiments with model-based variance
estimation, IEEE Transactions on Signal Processing, vol. 54, no.
6, pp. 24182426, 2006.
[14] X. Zhou, X. Wang, and E. R. Dougherty, Gene clustering based
on cluster-wide mutual information, Journal of Computational
Biology, vol. 11, no. 1, pp. 151165, 2004.
[15] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring connectivity of genetic regulatory networks using informationtheoretic
criteria, IEEE/ACM Transactions on Computational Biology and
Bioinformatics, vol. 5, no. 2, pp. 262274, 2008.
[16] J. Dougherty, I. Tabus, and J. Astola, Inference of gene regulatory networks based on a universal minimum description
length, Eurasip Journal on Bioinformatics and Systems Biology,
vol. 2008, Article ID 482090, 2008.
[17] L. Qian, H. Wang, and E. R. Dougherty, Inference of noisy
nonlinear diferential equation models for gene regulatory networks using genetic programming and Kalman fltering, IEEE
Transactions on Signal Processing, vol. 56, no. 7, pp. 33273339,
2008.
[18] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring gene regulatory networks from time series data using the minimum
description length principle, Bioinformatics, vol. 22, no. 17, pp.
21292135, 2006.
[19] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R.
Dougherty, A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks, Bioinformatics, vol. 20, no. 17, pp. 29182927, 2004.
[20] J. Meng, M. Lu, Y. Chen, S.-J. Gao, and Y. Huang, Robust inference of the context specifc structure and temporal dynamics of
gene regulatory network, BMC Genomics, vol. 11, no. 3, p. S11,
2010.
[21] Y. Zhang, Z. Deng, H. Jiang, and P. Jia, Inferring gene regulatory networks from multiple data sources via a dynamic
Bayesian network with structural em., in DILS, S. C. Boulakia
and V. Tannen, Eds., vol. 4544 of Lecture Notes in Computer
Science, pp. 204214, Springer, New York, NY, USA, 2007.
[22] K. Murphy and S. Mian, Modeling gene expression data using
dynamic Bayesian networks, University of California, Berkeley,
Calif, USA, 2001.
[23] H. Liu, D. Yue, L. Zhang, Y. Chen, S. J. Gao, and Y. Huang, A
Bayesian approach for identifying miRNA targets by combining sequence prediction and gene expression profling, BMC
Genomics, vol. 11, no. 3, p. S12, 2010.
[24] Y. Huang, J. Wang, J. Zhang, M. Sanchez, and Y. Wang,
Bayesian inference of genetic regulatory networks from time
series microarray data using dynamic Bayesian networks,
Journal of Multimedia, vol. 2, no. 3, pp. 4656, 2007.
[25] B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and
F. DAlche-Buc, Gene networks inference using dynamic
Bayesian networks, Bioinformatics, vol. 19, no. 2, pp. ii138ii148,
2003.
[26] C. Rangel, D. L. Wild, F. Falciani, Z. Ghahramani, and A. Gaiba,
A. modelling biological responses using gene expression profling and linear dynamical systems, Bioinformatics, pp. 349356,
2005.
[27] M. Quach, N. Brunel, and F. dAlch Buc, Estimating parameters
and hidden variables in non-linear state-space models based on
ODEs for biological networks inference, Bioinformatics, vol. 23,
no. 23, pp. 32093216, 2007.
[28] F.-X. Wu, W.-J. Zhang, and A. J. Kusalik, Modeling gene
expression from microarray expression data with state-space

Advances in Bioinformatics
equations, in Pacifc Symposium on Biocomputing, R. B. Altman, A. K. Dunker, L. Hunter, T. A. Jung, and T. E. Klein, Eds.,
pp. 581592, World Scientifc, River Edge, NJ, USA, 2004.
[29] R. Yamaguchi, S. Yoshida, S. Imoto, T. Higuchi, and S. Miyano,
Finding module-based gene networks with state-space
modelsmining high-dimensional and short time-course gene
expression data, IEEE Signal Processing Magazine, vol. 24, no.
1, pp. 3746, 2007.
[30] O. Hirose, R. Yoshida, S. Imoto et al., Statistical inference
of transcriptional module-based gene networks from time
course gene expression profles by using state space models,
Bioinformatics, vol. 24, no. 7, pp. 932942, 2008.
[31] J. Angus, M. Beal, J. Li, C. Rangel, and D. Wild, Inferring
transcriptional networks using prior biological knowledge and
constrained state-space models, in Learning and Inference in
Computational Systems Biology, N. Lawrence, M. Girolami,
M. Rattray, and G. Sanguinetti, Eds., pp. 117152, MIT Press,
Cambridge, UK, 2010.
[32] C. Rangel, J. Angus, Z. Ghahramani et al., Modeling T-cell activation using gene expression profling and state-space models,
Bioinformatics, vol. 20, no. 9, pp. 13611372, 2004.
[33] A. Noor, E. Serpedin, M. N. Nounou, and H. N. Nounou, Inferring gene regulatory networks via nonlinear state-space models
and exploiting sparsity, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 12031211,
2012.
[34] Z. Wang, X. Liu, Y. Liu, J. Liang, and V. Vinciotti, An extended
kalman fltering approach to modeling nonlinear dynamic gene
regulatory networks via short gene expression time series,
IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 3, pp. 410419, 2009.
[35] A. Noor, E. Serpedin, M. Nounou, H. Nounou, N. Mohamed,
and L. Chouchane, An overview of the statistical methods
used for inferring gene regulatory networks and proteinprotein
interaction networks, Advances in Bioinformatics, vol. 2013,
Article ID 953814, 12 pages, 2013.
[36] I. Arasaratnam and S. Haykin, Cubature kalman flters, IEEE
Transactions on Automatic Control, vol. 54, no. 6, pp. 12541269,
2009.
[37] A. Noor, E. Serpedin, M. N. Nounou, and H. N. Nounou, A
cubature Kalman flter approach for inferring gene regulatory
networks using time series data, in Proceedings of the IEEE
International Workshop on Genomic Signal Processing and Statistics (GENSIPS 11), pp. 2528, 2011.
[38] A. Carmi, P. Gurfl, and D. Kanevsky, Methods for sparse
signal recovery using kalman fltering with embedded rseudomeasurement norms and quasi-norms, IEEE Transactions on
Signal Processing, vol. 58, no. 4, pp. 24052409, 2010.
[39] C. A. Penfold and D. L. Wild, How to infer gene networks from
expression profles, revisited, Interface Focus, pp. 857870, 2011.
[40] I. Cantone, L. Marucci, F. Iorio et al., A yeast synthetic network
for in vivo assessment of reverse-engineering and modeling
approaches, Cell, vol. 137, no. 1, pp. 172181, 2009.
[41] Y. Huang, I. M. Tienda-Luna, and Y. Wang, Reverse engineering gene regulatory networks: a survey of statistical models,
IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 7697, 2009.
[42] Z. Wang, F. Yang, D. W. C. Ho, S. Swif, A. Tucker, and X. Liu,
Stochastic dynamic modeling of short gene expression timeseries data, IEEE Transactions on Nanobioscience, vol. 7, no. 1,
pp. 4455, 2008.

31

[43] H. Xiong and Y. Choe, Structural systems identifcation of


genetic regulatory networks, Bioinformatics, vol. 24, no. 4, pp.
553560, 2008.
[44] R. Tibshirani, Regression shrinkage and selection via the lasso,
Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267
288, 1996.
[45] E. J. Cands and T. Tao, Decoding by linear programming, IEEE
Transactions on Information Teory, vol. 51, no. 12, pp. 4203
4215, 2005.
[46] J. D. Geeter, H. V. Brussel, and J. D. Schutter, A smoothly constrained Kalman flter, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 19, no. 10, pp. 11711177, 1997.
[47] S. M. Kay, Fundamentals of Statistical Signal Processing. Estimation Teory, Prentice-Hall, New York, NY, USA, 1993.
[48] http://wiki.c2b2.columbia.edu/dream/.

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Efficient Serial and Parallel Algorithms for Selection of
Unique Oligos in EST Databases
Manrique Mata-Montero,1 Nabil Shalaby,2 and Bradley Sheppard1,2
1
2

Department of Computer Science, Memorial University, Canada


Department of Mathematics and Statistics, Memorial University, Canada

Correspondence should be addressed to Nabil Shalaby; nshalaby@mun.ca


Received 15 October 2012; Accepted 14 February 2013
Academic Editor: Alexander Zelikovsky
Copyright 2013 Manrique Mata-Montero et al. Tis is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Obtaining unique oligos from an EST database is a problem of great importance in bioinformatics, particularly in the discovery of
new genes and the mapping of the human genome. Many algorithms have been developed to fnd unique oligos, many of which are
much less time consuming than the traditional brute force approach. An algorithm was presented by Zheng et al. (2004) which fnds
the solution of the unique oligos search problem efciently. We implement this algorithm as well as several new algorithms based
on some theorems included in this paper. We demonstrate how, with these new algorithms, we can obtain unique oligos much faster
than with previous ones. We parallelize these new algorithms to further improve the time of fnding unique oligos. All algorithms
are run on ESTs obtained from a Barley EST database.

1. Introduction
Expressed Sequence Tags (or ESTs) are fragments of DNA
that are about 200800 bases long generated from the
sequencing of complementary DNA. ESTs have many applications. Tey were used in the Human Genome Project in the
discovery of new genes and are ofen used in the mapping
of genomic libraries. Tey can be used to infer functions of
newly discovered genes based on comparison to known genes
[1].
An oligonucleotide (or oligo) is a subsequence of an EST.
Oligos are short, since they are typically no longer than 50
nucleotide bases. Oligos are ofen referred to in the context
of their length by adding the sufx mer. For example,
an oligo of length 9 would be referred to as a 9-mer. Te
importance of oligos in relation to EST databases is quite
signifcant. An oligo that is unique in an EST database serves
as a representative of its EST sequence. Te oligonucleotides
(or simply oligos) contained in these EST databases have
applications in many areas such as PCR primer design,
microarrays, and probing genomic libraries [24].
In this paper we will improve on the algorithms presented
in [2] to solve the unique oligos search problem. Tis problem

requires us to determine all oligos that appear in one EST


sequence but not in any of the others. In addition, we will
consider two oligos to be virtually identical if they fall within
a certain number of mismatches from each other. In the
appendix we include all the algorithms used and developed
in this paper.

2. The Unique Oligos Search Problem


In this paper we use the notation HD(, ) to denote the
Hamming Distance between the strings and . Given an
EST database = {1 , 2 , . . . , }, where is a string over
the alphabet {A, C, G, G}, integers and , and -mer , we say
that occurs approximately in if there exists a substring
of some EST such that HD(, ) . We also say that an
-mutant list of a string is a list of all possible strings, , of
length || over the alphabet {A, C, G, T} such that HD(, )
. Such a string is referred to as an -mutant of . A unique
oligo of is defned as an -mer such that occurs exactly in
one EST and does not occur approximately in any other EST.
Te unique oligos search problem is the problem of fnding
all unique oligos in an EST database.

33

Advances in Bioinformatics
Require: EST database = {1 , 2 , . . . , }, integer (length of unique oligos) and integer
(maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1) /(/2 + 1)
(2) fndqmers() (hashtable of positions of all qmers in )
(3) for 1 to 4 {split loop iterations among processors} do
(4) as a base 4 integer of length
(5) list of base 4 integers of length mismatching by 1 digit
(6) the numbers in in base 10
(7) list of each [] for all
(8) goo2(, , , [], )
(9) end for
Algorithm 1: Algorithm for the unique oligos problem.

Many algorithms have been presented to solve this problem [5, 6]. Te algorithm presented in [2] relies on an observation that if two -mers agree within a specifc Hamming
Distance, then they must share a certain substring. Tese
observations are presented in this paper as theorems.

so, we have (2 ) comparisons for each entry in the table.


Also, the time required to extend each pair of -mers to mers is 2( + 1). Given that we have 4 entries in the hash
table, we have a total time complexity of
(( ) 2 4 ) = (( ) (

Teorem 1. Suppose one has two -mers 1 and 2 such that


(1 , 2 ) . If one divides them both into /2 + 1 substrings, 11 12 1/2+1 and 21 22 2/2+1 , and each , except
possibly /2+1 , has length /(/2 + 1), then there exists at

least one 0 {1, 2, . . . , /2 + 1}, such that HD(10 , 20 ) 1.

Proof. Suppose by contradiction that for any {1, 2, . . . ,


/2 + 1}, 1 and 2 have at least 2 mismatches. Ten
HD(1 , 2 ) + 2 which is a contradiction to the fact that
HD(1 , 2 ) .

Using this observation, an algorithm was presented in


[2] which solves the unique oligos search problem in time
(( )2 4 ). Te algorithm can be thought of as a twophase method. In the frst phase we record the position of
each -mer in the database into a hash table of size 4 . We
do so in such a way that for each -mer over the alphabet
{A, C, G, T} we have that [[]] =
{{1 , 1 }, {2 , 2 }, . . . , { , }} whereby is an EST sequence,
is the position of within that sequence, and is the
number of occurrences of in the database. In the second
phase, we extend every pair of identical -mers into -mers
and compare these -mers for nonuniqueness. We also do the
same for pairs that have a Hamming Distance of 1. If they are
nonunique, we mark them accordingly. Teorem 1 guarantees
that if an -mer is nonunique, then it must share a -mer
substring that difers by at most one character with another
-mer substring from another -mer. Hence, if an -mer is
nonunique, it will be marked during phase two.
Assuming there are symbols in our EST database, the
fling of the -mers into the hash table takes time (). In
phase two, we assume that the distribution of -mers in the
database is uniform; in other words, that each table contains
/4 entries. Tus we have (2 ) comparisons within
each table entry. Each -mer also has a 1-mutant list of size 3,

2
)4 )
4

( ) 2
),
= (
4

where
=

.
/2 + 1

(1)

(2)

In [7], several variations of Teorem 1 are presented. We


can use these theorems to generate similar algorithms with
slightly diferent time complexities.
Teorem 2. Suppose one has two -mers 1 and 2 such that
HD(1 , 2 ) . If one divides them both into + 1 substrings,
11 12 1+1 and 21 22 2+1 , and each , except possibly +1 , has
length /(+1), then there exists at least one 0 {1, 2, . . . , +

1}, such that 10 = 20 .

Proof. Suppose by contradiction that we cannot fnd any 0

{1, 2, . . . , + 1} such that 10 = 20 . Ten there exists at least


one mismatch between 1 and 2 for each {1, 2, . . . , + 1},
and thus we have at least + 1 mismatches which contradicts
the fact that HD(1 , 2 ) .
Based on Teorem 2 we can design a second algorithm
that works in a similar way to Algorithm 1. Te major diference between these algorithms is that in Algorithm 2 we are
not required to do comparisons with each hash table entries
mutant list. Tis means we have (2 ) comparisons within
each table entry which yields a total time complexity of
(( ) 2 4 ) = (( ) (

2
)4 )
4

( ) 2
),
= (
4

(3)

34

Advances in Bioinformatics

Require: EST database = {1 , 2 , . . . , }, integer (length of unique oligos) and integer


(maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1) /( + 1)
(2) fndqmers() (hashtable of positions of all qmers in )
(3) for 1 to 4 {split loop iterations among processors} do
(4) goo(, , , [])
(5) end for
Algorithm 2: Algorithm for the unique oligos problem.

Require: EST database = {1 , 2 , . . . , }, integer (length of unique oligos) and integer


(maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1) /(/3 + 1)
(2) fndqmers() (hashtable of positions of all qmers in )
(3) for 1 to 4 {split loop iterations among processors} do
(4) as a base 4 integer of length
(5) list of base 4 integers of length mismatching by at most 2 digits
(6) the numbers in in base 10
(7) list of each [] for all
(8) goo2(, , , [], )
(9) end for
Algorithm 3: Algorithm for the unique oligos problem.

Require: EST database = {1 , 2 , . . . , }, integer


Ensure: A hashtable of all positions.
(1) a hashtable of all positions in
(2) for 1 to do
(3) for 1 to length([]) + 1 do
(4)
map([], , + 1)
(5)
[] Append(hashtable[], {, })
(6) end for
(7) end for
Algorithm 4: Findqmers ().

(1) substring(, , )
(2) under the transformation {A, C, G, T} {0, 1, 2, 3}
(3) return
Algorithm 5: Map (string , , ).
(1) substring of from character to character
(2) return
Algorithm 6: Substring (string , , ).

35

Advances in Bioinformatics

(1) posi a list of positions of a specifed qmer in D


( = {{1 , 1 }, {2 , 2 }, . . .} where {, } corresponds to position of sequence )
(2) mut a list of positions of qmers in D that mismatch this qmer by either 1 or 2 characters
(depending on the fltration algorithm using this function)
(3) for 1 to length() do
(4) for + 1 to length() do
(5)
if [][1] = [][1] then
(6)
1 list of -mers generated from the extension of the in position []
(7)
2 list of -mers generated from the extension of the in position []
(8)
for 1 to length(1) do
(9)
for 1 to length(2) do
(10)
if HD(1[], 2[]) then
(11)
mark the as non-unique
(12)
end if
(13)
end for
(14)
end for
(15)
end if
(16) end for
(17) for 1 to length() do
(18)
if [][1] = [][1] then
(19)
1 list of -mers generated from the extension of the in position []
(20)
2 list of -mers generated from the extension of the in position []
(21)
for 1 to length(1) do
(22)
for 1 to length(2) do
(23)
if HD(1[], 2[]) then
(24)
mark the as non-unique
(25)
end if
(26)
end for
(27)
end for
(28)
end if
(29) end for
(30) end for
Algorithm 7: goo2(, , , , ).

(1) posi a list of positions of qmer in D


( = {{1 , 1 }, {2 , 2 }, . . .} where {, } corresponds to position of sequence )
(2) for 1 to length() do
(3) for + 1 to length() do
(4)
if [][1] = [][1] then
(5)
1 list of -mers generated from the extension of the in position []
(6)
2 list of -mers generated from the extension of the in position []
(7)
for 1 to length(1) do
(8)
for 1 to length(2) do
(9)
if HD(1[], 2[]) then
(10)
mark the as non-unique
(11)
end if
(12)
end for
(13)
end for
(14)
end if
(15) end for
(16) end for
Algorithm 8: goo(, , , ).

36

Advances in Bioinformatics
Table 1: Results of serial algorithms.
Algorithm
Algorithm 2
Algorithm 1
Algorithm 3
Algorithm 2
Algorithm 1
Algorithm 3

28
28
27
28
28
27

6
6
6
6
6
6

4
7
9
4
7
9

Dataset
1 (78 ESTs)
1 (78 ESTs)
1 (78 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)

Time taken (secs)


163
131
231
197, 500
117, 714
94, 317

Non-unique oligos
46,469
46,469
46,564
1,611,241
1,611,241
1,614,235

Table 2: Results of parallel algorithms on 12 processors.


Algorithm
Algorithm 2
Algorithm 1
Algorithm 3
Algorithm 2
Algorithm 1
Algorithm 1

28
28
27
28
28
27

6
6
6
6
6
6

4
7
9
4
7
9

Dataset
1 (78 ESTs)
1 (78 ESTs)
1 (78 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)

where

(4)
.
+1
A third theorem was also briefy mentioned [7]; however,
it was not implemented in an algorithm. We use this theorem
to create a third algorithm to solve the unique oligos search
problem.
=

Teorem 3. Suppose one has two -mers 1 and 2 such that


HD(1 , 2 ) . If one divides them both into /3 + 1 substrings, 11 12 1/3+1 and 21 22 2/3+1 , and each , except
possibly /3+1 , has length /(/3 + 1), then there exists at

least one 0 {1, 2, . . . , /3 + 1}, such that HD(10 , 20 ) 2.

Proof. Suppose by contradiction that for any {1, 2, . . . ,


/3 + 1}, 1 and 2 have at least 3 mismatches. Ten
HD(1 , 2 ) + 3 which is a contradiction to the fact that
HD(1 , 2 ) 2.

Te algorithm is somewhat similar to Algorithm 1. Te


main diference is that we compare every -mer to -mers in
its corresponding 2-mutant list, rather than its 1-mutant list.
Each -mer has 9 ( 2 ) + 3 = 9( 1)/2 + 3 2-mutants, so
we have (2 2 ) comparisons for each entry in the hash table
yielding a total time complexity of
(( ) 2 2 4 ) = (( ) 2 (

2
)4 )
4

( ) 2 2
= (
),
4

where
=

.
/3 + 1

(5)

(6)

Time taken (secs)


33
29
66
40, 420
22, 848
18, 375

Non-unique oligos
46,469
46,469
46,564
1,611,241
1,611,241
1,614,235

It is important to note the 4 term in the denominator


of our time complexity expressions. Since this term is exponential, it will have the largest impact on the time taken to
run our algorithms. Based on this observation, we expect
Algorithm 3 to run the fastest, followed by Algorithm 1 and
then Algorithm 2.

3. Implementation
We implement these algorithms using C on a machine with
12 Intel Core i7 CPU 80 @ 3.33 GHz processors and 12 GB
of memory. Te datasets we use in this implementation are
Barley ESTs taken from the genetic sofware HarvEST by
Steve Wanamaker and Timothy Close of the University of
California, Riverside (http://harvest.ucr.edu/). We use two
diferent EST databases, one with 78 ESTs and another with
2838. In our experiments we search for oligos of lengths
27 and 28 since they are common lengths for oligonucleotides. As we increase the size of the database, we see that
Algorithm 3 is the most efcient as anticipated (data shown
in Tables 1 and 2).
One important thing to note about all of these algorithms
is the fact that the main portion of them is a for loop
which iterates through each index of the hash table. It is also
obvious that loop iterations are independent of each other.
Tese two factors make the algorithms perfect candidates for
parallelism. Rather than process the hash table one index
at a time, our parallel algorithms process groups of indices
simultaneously. Ignoring the communication between processors, our algorithms optimally parallelize our three serial
algorithms.
Tere are many APIs in diferent programming languages
that aid in the task of parallel programming. Some examples
of this in the C programming language are OpenMP and
POSIX Pthreads. OpenMP allows one to easily parallelize

37
a C program amongst multiple cores of a multicore machine
[8]. OpenMP also has an extension called Cluster OpenMP
which allows one to parallelize across multiple machines in a
computing cluster.
A new trend in parallel programming is in the use of
GPUs. GPUs are the processing units inside computers graphics card. C has several APIs which allow one to carry out GPU
programming. Te two such APIs are OpenCL and CUDA
[9, 10].
In the second implementation of our algorithms we use
OpenMP to parallelize our algorithms throughout the 12
cores of our machine. We can easily see that we achieve near
optimal parallelization with our parallel algorithms; that is,
the time taken by the parallel algorithms is approximately that
of the serial algorithms divided by the number of processors.

4. Conclusion
In this paper we used three algorithms to solve the unique
oligos search problem which are extensions of the algorithm presented in [2]. We observed that we can achieve
a signifcant performance improvement by parallelizing our
algorithms. We can also see that Algorithm 3 yields the best
results for larger databases. For smaller databases, however,
the time diference between each pair of algorithms is
negligible, but results in Algorithm 3 being the slowest, and
this is due to the time required to compute the mismatches of
each -mer. Other algorithms can be obtained by setting to
diferent values. See Algorithms 1, 2, 3, 4, 5, 6, 7, and 8.

References
[1] M. D. Adams, J. M. Kelley, J. D. Gocayne et al., Complementary
DNA sequencing: expressed sequence tags and human genome
project, Science, vol. 252, no. 5013, pp. 16511656, 1991.
[2] J. Zheng, T. J. Close, T. Jiang, and S. Lonardi, Efcient selection
of unique and popular oligos for large EST databases, Bioinformatics, vol. 20, no. 13, pp. 21012112, 2004.
[3] S. H. Nagaraj, R. B. Gasser, and S. Ranganathan, A hitchhikers
guide to expressed sequence tag (EST) analysis, Briefngs in Bioinformatics, vol. 8, no. 1, pp. 621, 2007.
[4] W. Klug, M. Cummings, and C. Spencer, Concepts of Genetics,
Prentice-Hall, Upper Saddle River, NJ, USA, 8th edition, 2006.
[5] F. Li and G. D. Stormo, Selection of optimal DNA oligos for
gene expression arrays, Bioinformatics, vol. 17, no. 11, pp. 1067
1076, 2001.
[6] S. Rahmann, Rapid large-scale oligonucleotide selection for
microarrays, in Proceedings of the 1st IEEE Computer Society
Bioinformatics Conference (CSB 02), pp. 5463, IEEE Press,
Stanford, Calif, USA, 2002.
[7] S. Go, Combinatorics and its applications in DNA analysis [M.S.
thesis], Department of Mathematics and Statistics, Memorial
University of Newfoundland, 2009.
[8] OpenMP.org, 2012, http://openmp.org/wp/.
[9] Khronos Group, OpenCLTe open standard for parallel
programming of heterogeneous systems, 2012, http://www
.khronos.org/opencl/.
[10] Nvidia, Parallel Programming and Computing Platform
CudaNvidia, 2012, http://www.nvidia.com/object/cuda
home new.html.

Advances in Bioinformatics

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Gene Regulation, Modulation, and Their Applications in
Gene Expression Data Analysis
Mario Flores,1 Tzu-Hung Hsiao,2 Yu-Chiao Chiu,3 Eric Y. Chuang,3
Yufei Huang,1 and Yidong Chen2,4
1

Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249, USA
Greehey Childrens Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX 78229, USA
3
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
4
Department of Epidemiology and Biostatistics, University of Texas Health Science Center at San Antonio, San Antonio,
TX 78229, USA
2

Correspondence should be addressed to Yufei Huang; yufei.huang@utsa.edu and Yidong Chen; cheny8@uthscsa.edu
Received 2 December 2012; Accepted 24 January 2013
Academic Editor: Mohamed Nounou
Copyright 2013 Mario Flores et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Common microarray and next-generation sequencing data analysis concentrate on tumor subtype classifcation, marker detection,
and transcriptional regulation discovery during biological processes by exploring the correlated gene expression patterns and their
shared functions. Genetic regulatory network (GRN) based approaches have been employed in many large studies in order to
scrutinize for dysregulation and potential treatment controls. In addition to gene regulation and network construction, the concept
of the network modulator that has signifcant systemic impact has been proposed, and detection algorithms have been developed
in past years. Here we provide a unifed mathematic description of these methods, followed with a brief survey of these modulator
identifcation algorithms. As an early attempt to extend the concept to new RNA regulation mechanism, competitive endogenous
RNA (ceRNA), into a modulator framework, we provide two applications to illustrate the network construction, modulation efect,
and the preliminary fnding from these networks. Tose methods we surveyed and developed are used to dissect the regulated
network under diferent modulators. Not limit to these, the concept of modulation can adapt to various biological mechanisms
to discover the novel gene regulation mechanisms.

1. Introduction
With the development of microarray [1] and lately the next
generation sequencing techniques [2], transcriptional profling of biological samples, such as tumor samples [35]
and samples from other model organisms, have been carried
out in order to study sample subtypes at molecular level or
transcriptional regulation during the biological processes [6
8]. While common data analysis methods employ hierarchical clustering algorithms or pattern classifcation to explore
correlated genes and their functions, the genetic regulatory
network (GRN) approaches were employed to scrutinize for
dysregulation between diferent tumor groups or biological
processes (see reviews [912]).

To construct the network, most of research is focused on


methods based on gene expression data derived from highthroughput technologies by using metrics such as Pearson
or Spearman correlation [13], mutual information [14], codetermination method [15, 16], Bayesian methods [17, 18], and
probabilistic Boolean networks [19]. Recently, new transcriptional regulation via competitive endogenous RNA (ceRNAs)
has been proposed [20, 21], introducing additional dimension
in modeling gene regulation. Tis type of regulation requires
the knowledge of microRNA (miRNA) binding targets [22,
23] and the hypothesis of RNA regulations via competition
of miRNA binding. Common GRN construction tries to
confne regulators to be transcription factor (TF) proteins,
a primary transcription programming machine, which relies

39

Advances in Bioinformatics

Regulator

Target

Interact

Regulator
of and

Regulator of

Target of

Target of

Target of

(a)

(b)

Figure 1: Regulator-target pair in genetic regulatory network model: (a) basic regulator-target pair and (b) regulator-target complex.

on sequence-specifc binding sites at target genes promoter


regions. In contrast, ceRNAs mediate gene regulation via
competing miRNAs binding sites in target 3 UTR region,
which exist in >50% of mRNAs [22, 24]. In this study, we
will extend the current network construction methods by
incorporating regulation via ceRNAs.
In tumorigenesis, gene mutation is the main cause of
the cancer [25]. Te mutation may not directly refect in
the change at the gene expression level; however, it will
disrupt gene regulation [2628]. In Hudson et al., they
found that mutated myostatin and MYL2 showed diferent
coexpressions when comparing to wild-type myostatin. Chun
et al. also showed that oncogenic KRAS modulates HIF1 and HIF-2 target genes and in turn modulates cancer
metabolism. Stelniec-Klotz et al. presented a complex hierarchical model of KRAS modulated network followed by double
perturbation experiments. Shen et al. [29] showed a temporal
change of GRNs modulated afer the estradiol stimulation,
indicating important role of estrogen in modulating GRNs.
Functionally, modulation efect of high expression of ESR1
was also reported by Wilson and Dering [30] where they studied previously published microarray data with cells treated
with hormone receptor agonists and antagonists [3133]. In
this study, a comprehensive review of existing algorithms to
uncover the modulators was provided. Given either mutation
or protein expression status was unknown under many of
reported studies, the problem of how to partition the diverse
samples with diferent conditions, such as active or inactive
oncogene status (and perhaps a combination of multiple
mutations), and the prediction of a putative modulator of
gene regulation remains a difcult task.
By combining gene regulation obtained from coexpression data and ceRNAs, we report here an early attempt to
unify two systems mathematically while assuming a known
modulator, estrogen receptor (ER). By employing the TCGA
[3] breast tumor gene expressions data and their clinical test
result (ER status), we demonstrate the approach of obtaining
GRN via ceRNAs and a new presentation of ER modulation
efects. By integrating breast cancer data into our unique
ceRNAs discovery website, we are uniquely positioned to
further explore the ceRNA regulation network and further

develop the discovery algorithms in order to detect potential


modulators of regulatory interactions.

2. Models of Gene Regulation and Modulation


2.1. Regulation of Gene Expression. Te complex relationships
among genes and their products in a cellular system can
be studied using genetic regulatory networks (GRNs). Te
networks model the diferent states or phenotypes of a cellular
system. In this model, the interactions are commonly modeled as regulator-target pairs with edges between regulator
and target pair representing their interaction direction, as
shown in Figure 1(a). In this model a target gene is a gene
whose expression can be altered (activated or suppressed)
by a regulator gene. Tis defnition of a target gene implies
that any gene can be at some point a target gene or a
direct or indirect regulator depending on its position in the
genetic regulatory network. Te regulator gene is a gene that
controls (activates or suppresses) its target genes expression.
Te consequences of these activated (or suppressed) genes
sometimes are involved in specifc biological functions, such
as cell proliferation in cancer. Examples of regulator-target
pair in biology are common. For example, a target gene
CDCA7 (cell division cycle-associated protein 7) is a c-Myc
(regulator) responsive gene, and it is part of c-Myc-mediated
transformation of lymphoblastoid cells. Furthermore, as
shown in Figure 1(b), a regulator gene can also act as a target
gene if there exists an upstream regulator.
If the interaction is modeled afer Boolean network (BN)
model [34], then
( + 1) = (1 () , . . . , () , ()) ,

(1)

where each regulator {0, 1} is a binary variable, as well


as it is target . As described by (1), the target at time
+ 1 is completely determined by the values of its regulators
at time by means of a Boolean function , where
is a collection of Boolean functions. Tus, the Boolean
network (, ) is defned as a set of nodes (genes) =
{1 , 2 , . . . , } and a list of functions (edges or interactions)
= {1 , 2 , . . . , }. Similarly such relationship can be
defned in the framework of Bayesian network where the

40

Advances in Bioinformatics

Hormone
Receptor

microRNAs

R
TF
Singal
transduction
proteins

Target gene
(a)

Target gene

(b)

3UTR

(c)

Figure 2: Tree diferent cases of regulation of gene expression that share the network representation of a regulator target interaction.

similar regulators-target relationship as defned in (1) can be


modeled by the distribution

( ( + 1) , 1 () , . . . , () , ())

= ( ( + 1) | Parents ( ( + 1)))
(Parents ( ( + 1))) ,

Modulator

(2)

where Parents( ( + 1)) = {1 (), . . . , (), ()} is the set


of regulators, or parents, of , ( ( + 1) | Parents ( ( +
1))) is the conditional distribution defning the regulatortarget relationship, and (Parents( ( + 1))) models the
prior distribution of regulators. Unlike in (1), the target and
regulators in (2) are modeled as random variables. Despite
of this diference, in both (1) and (2), the target is always a
function (or conditional distribution) of the regulator (or parents). When the relationship is defned by a Boolean function
as in (1), the conditional distribution in (2) take the form
of a binomial distribution (or a multinomial distribution
when both regulators and target take more than two states).
Other distributions such as the Gaussian and Poisson can be
introduced to model more complex relationships than the
Boolean. Te network construction, inference, and control,
however, are beyond the scope of this paper, and we leave the
topics to the literatures [9, 35, 36].
Te interactions among genes and their products in
a complex cellular process of gene expression are diverse,
governed by the central dogma of molecular biology [37].
Tere are diferent regulation mechanisms that can actuate
during diferent stages. Figure 2 shows three diferent cases
of regulation of gene expression. Figure 2(a) shows the case
of regulation of expression in which a transcription factor
(TF) regulates the expression of a protein-coding gene (in
dark grey) by binding to the promoter region of target gene .
Figure 2(b) is the case of regulation at the protein level in
which a ligand protein interacts with a receptor to activate
relay molecules to transduce outside signals directly into cell
behavior. Figure 2(c) is the case of regulation at the RNA
level in which one or more miRNAs regulate target mRNA
by translational repression or target transcript degradation
via binding to sequence-specifc binding sites (called miRNA
response elements or MREs) in 3 UTR region. As illustrated
in Figure 2(c), the target genes/proteins all contain a domain
of binding or docking site, enabling specifc interactions

Regulator

Target

Interact

Figure 3: Graphical representation of the triplet interaction of


regulator , target , and modulator .

between regulator-target pairs, a common element in network structure.


2.2. Modulation of Gene Regulation. Diferent from the concept of a coregulator commonly referred in the regulatory
biology, a modulator denotes a gene or protein that is capable
of altering the endogenous gene expression at one stage or
time. In the context of this paper, we specifcally defne
a modulator to be a gene that can systemically infuence
the interaction of regulator-target pair, either to activate
or suppress the interaction in the presence/absence of the
modulator. One example of modulator is the widely studied
estrogen receptor (ER) in breast cancer studies [3840]; the
ER status determines not only the tumor progression, but also
the chemotherapy treatment outcomes. It is well known that
binding of estrogen to receptor facilitates the ER activities
to activate or repress gene expression [41], thus efectively
modulating the GRN. Figure 3 illustrates the model of the
interaction between a modulator () and a regulator ()
target () pair that it modulates.
Following the convention used in (1) and (2), the modulation interaction in Figure 3 can be modeled by
= F() () ,

(3)

where represents target expression, represents the parents


(regulators) of target , and F() () is the regulation function

41

Advances in Bioinformatics

modulated by . When F() () is stochastic, the relationship


is modeled by the conditional distribution as
(, | ) = ( | , ) ( | ) ,

(4)

where (|, ) models the regulator-target relationship


modulated by and (|) defnes the prior distribution
of regulators (parents) expression modulated by . Diferent
distribution models can be used to model diferent mechanisms for modulation. At the biological level, there are
diferent mechanisms for modulation of the interaction -,
and currently several algorithms for prediction of the modulators has been developed. Tis survey presents the latest
formulations and algorithms for prediction of modulators.

3. Survey of Algorithms of Gene Regulation


and Modulation Discovery
During the past years, many computational tools have been
developed for regulation network construction, and then
depending on the hypothesis, modulator concept can be
tested and extracted. Here we will focus on modulator detection algorithms (MINDy, Mimosa, GEM, and Hermes). To
introduce gene-gene interaction concept, we will also briefy
discuss algorithms for regulation network construction
(ARACNE) and ceRNA identifcation algorithm (MuTaMe).
3.1. ARACNE (Algorithm for the Reconstruction of Accurate
Cellular Networks). ARACNE [14, 42] is an algorithm that
extracts transcriptional networks from microarray data by
using an information-theoretic method to reduce the indirect
interactions. ARACNE assumes that it is sufcient to estimate
2-way marginal distributions, when sample size > 100, in
genomics problems, such that
( ) =

1 [=1 ( )+, ( , )]
.

(5)

Or a candidate interaction can be identifed using estimation of mutual information MI of genes and , MI(, ) =
MI , where MI = 1 if genes and are identical, and
MI is zero if (, ) = ()(), or and are statistically independent. Specifcally, the estimation of mutual
information of gene expressions and of regulator and
target genes is done by using the Gaussian kernel estimator.
Te ARACNE takes additional two steps to clean the network:
(1) removing MI if its value is less than that derived from
two independent genes via random permutation and (2) data
processing inequality (DPI). Te algorithm further assumes
that for a triplet gene ( , , ), where regulates ,
through , then
MI, < min (MI, , MI, ) ,

if ,

ARACNE in the DPI step. A similar algorithm was proposed


[43] to utilize conditional mutual information to explore
more than 2 regulators.

(6)

with no alternative path,


where represents regulation relationship. In other words,
the lowest mutual information MI, is from an indirect
interaction and thus shall be removed from the GRN by

3.2. MINDy (Modulator Inference by Network Dynamics).


Similar to ARACNE, MINDy is also an information-theoretic
algorithm [44]. However, MINDy aims to identify potential transcription factor-(TF-target) gene pairs that can be
modulated by a candidate modulator. MINDy assumes that
the expressions of the modulated TF-target pairs are of
diferent correlations under diferent expression state of the
modulator. For simplicity and computational consideration,
MINDy considers only two modulator expression states, that
is, up- ( = 1) or down-expression ( = 0). Ten, it
tests if the expression correlations of potential TF-target pairs
are signifcantly diferent for modulator up-expression versus
down-expression. Te modulator dependent correlation is
assessed by the conditional mutual information (CMI) or
(, | = 0) and (, | = 1). Similar to ARACNE,
the CMI is calculated using the Gaussian kernel estimator. To
test if a pair of TF () and target () is modulated by , the
CMI diference can be calculated as
= (, | = 1) (, | = 0) .

(7)

Te pair is determined to be modulated if = 0. Te signifcance values for = 0 is computed using permutation


tests.
3.3. Mimosa. Similarly to MINDy, Mimosa [45] was proposed to identify modulated TF-target pairs. However, it does
not preselect a set of modulators of interest but rather aims to
also search for the modulators. Mimosa also assumes that a
modulator takes only two states, that is, absence and presence
or 0 and 1. Te modulated regulator-target pair is further
assumed to be correlated when a modulator is present but
uncorrelated when it is absent. Terefore, the distribution
of a modulated TF-target pair, and , naturally follows a
mixture distribution
(, ) = (, | = 0) + (1 ) (, = 1) , (8)

where is the probability of the modulator being absent. Particularly, an uncorrelated and correlated bivariant Gaussian
distributions were introduced to model diferent modulated
regulator-target relationship, such that
(, | = 0) =

(, | = 1)
=

21 2

1 (1/2)(2 +2 )
,

(1/2)(

+2 +2)/(12 )

(9a)

(9b)

where models the correlation between and when the


modulator is present. With this model, Mimosa sets out to
ft the samples of every pair of potential regulator target
with the mixture model (7). Tis is equivalent to fnding

42

Advances in Bioinformatics

Modulator

mRNA

Te hypothesis of MuTaMe is that mRNAs that have many


of the same MREs can regulate each other by competing
for miRNAs binding. Te input of this algorithm is a GoI,
which is targeted by a group of miRNAs known to the user.
Ten, from a database of predicted MREs for the entire
transcriptome, it is possible to obtain the binding sites and
its predicted locations in the 3 UTR for all mRNAs. Tis data
is used to generate scores for each mRNA based on several
features:

3 UTR

Regulator
MicroRNAs

Target

mRNA

3 UTR

Figure 4: Modulation of gene regulation by competing mRNAs.

a partition of the paired expression samples into the correlated and uncorrelated samples. Te paired expression
samples that possess such correlated-uncorrelated partition
(0.3 < < 0.7 and || > 0.8) are determined to be
modulated. To identify the modulator of a (or a group of)
modulated pair(s), a weighted -test was developed to search
for the genes whose expressions are diferentially expressed
in the correlated partition versus the uncorrelated partition.
3.4. GEM (Gene Expression Modulator). GEM [46] improves
over MINDy by predicting how a modulator-TF interaction
afects the expression of the target gene. It can detect new
types of interactions that result in stronger correlation but
low , which therefore would be missed by MINDy. GEM
hypothesizes that the correlation between the expression of a
modulator and a target must change, as that of the TF
changes. Unlike the previous surveyed algorithms, GEM
frst transforms the continuous expression levels to binary
states (up- (1) or down-expression (0)) and then works only
with discrete expression states. To model the hypothesized
relationship, the following model is proposed:
( = 1 | , ) = + + + ,

(10)

where is a constant, and model the efect of


modulator and TF on the target genes, and represents the
efect of modulator-TF interaction on the target gene. If the
modulator-TF interaction has an efect on , then will
be nonzero. For a given (, , ) triplets GEM devised an
algorithm to estimate the model coefcients in (10) and a test
to determine if is nonzero, or is a modulator of and .
3.5. MuTaMe (Mutually Targeted MRE Enrichment). Te goal
of MuTaMe [21] is to identify ceRNA networks of a gene of
interest (GoI) or mRNA that share miRNA response elements
(MREs) of same miRNAs. Figure 4 shows two mRNAs, where
one is the GoIy and the other is a candidate ceRNA or
modulator . In the fgure, the miRNA represented in color
red has MREs in both mRNA and mRNA ; in this case
the presence of mRNA will start the competition with
for miRNA represented in color red.

(a) the number of miRNAs that an mRNA shares with


the GoI ;
(b) the density of the predicted MREs for the miRNA; it
favors the cases in which more MREs are located in
shorter distances;
(c) the distribution of the MREs for every miRNA; it
favors situations in which the MREs tend to be evenly
distributed;
(d) the number of MREs predicted to target ; it favors
situations where each miRNA contains more MREs in
.

Ten each candidate transcript will be assigned a score


that results from multiplying the scores in (a) to (d). Tis
score indicates the likelihood of the candidates to be ceRNAs
and will be used to predict ceRNAs.

3.6. Hermes. Hermes [20] is an extension of MINDy that


infers candidate modulators of miRNA activity from expression profles of genes and miRNAs of the same samples.
Hermes makes inferences by estimating the MI and CMI.
However, diferent from MINDy (7), Hermes extracts the
dependences of this triplet by studying the diference between
the CMI of expression and expression conditional on the
expression of and the MI of and expressions as follows:
= (; | ) (; ) .

(11)

Tese quantities and their associated statistical significance can be computed from collections of expression
of genes with number of samples 250 or greater. Hermes
expands MINDy by providing the capacity to identify candidate modulator genes of miRNAs activity. Te presence
of these modulators () will afect the relation between
the expression of the miRNAs targeting a gene () and the
expression level of this gene ().
In summary, we surveyed some of the most popular
algorithms for the inference of modulator. Additional modulator identifcation algorithms are summarized in Table 1.
It is worth noting that the concept of modulator applies
to cases beyond discussed in this paper. Such example
includes the multilayer integrated regulatory model proposed
in Yan et al. [49], where the top layer of regulators could be
also considered as modulators.

4. Applications to Breast Cancer Gene


Expression Data
Algorithms of utilizing modulator concept have been implemented in various sofware packages. Here we will discuss

43

Advances in Bioinformatics
Table 1: Gene regulation network and modulator identifcation methods.

Algorithm

Features

ARACNE

Interaction network constructed via mutual information (MI).


A varying-coefcient structural equation model (SEM) to represent the
modulator-dependent conditional independence between genes.
Gene-pair interaction dependency on modulator candidates by using the
conditional mutual information (CMI).
Search for modulator by partition samples with a Gaussian mixture model.
A probabilistic method for detecting modulators of TFs that afect the expression of
target gene by using a priori knowledge and gene expression profles.
Based on the hypothesis that shared MREs can regulate mRNAs by competing for
microRNAs binding.
Extension of MINDy to include microRNAs as candidate modulators by using CMI
and MI from expression profles of genes and miRNAs of the same samples.
Analyzes the interaction between TF and target gene conditioned on a group of
specifc modulator genes via a multiple linear regression.

Network profler
MINDy
Mimosa
GEM
MuTaMe
Hermes
ER modulator

two new applications, MEGRA and TraceRNA, implemented


in-house specifcally to utilize the concept of diferential
correlation coefcients and ceRNAs to construct a modulated GRN with a predetermined modulator. In the case of
MGERA, we chose estrogen receptor, ESR1, as the initial
starting point, since it is one of the dominant and systemic
factor in breast cancer; in the case of TraceRNA, we also chose
gene ESR1 and its modulated gene network. Preliminary
results of applications to TCGA breast cancer data are
reported in the following 2 sections.
4.1. MGERA. Te Modulated Gene Regulation Analysis algorithm (MGERA) was designed to explore gene regulation
pairs modulated by the modulator . Te regulation pairs
can be identifed by examining the coexpression of two genes
based on Pearson correlation (similar to (7) in the context
of correlation coefcient). Fisher transformation is adopted
to normalize the correlation coefcients biased by sample
sizes to obtain equivalent statistical power among data with
diferent sample sizes. Statistical signifcance of diference
in the absolute correlation coefcients between two genes is
tested by the student -test following Fisher transformation.
For the gene pairs with signifcantly diferent coefcients
between two genes, active and deactive statuses are identifed by examining the modulated gene expression pairs
(MGEPs). Te MGEPs are further combined to construct the
modulated gene regulation network for a systematic and
comprehensive view of interaction under modulation.
To demonstrate the ability of MGERA, we set estrogen
receptor (ER) as the modulator and applied the algorithm to
TCGA breast cancer expression data [3] which contains 588
expression profles (461 ER+ and 127 ER). By using value
<0.01 and the diference in the absolute Pearson correlation
coefcients >0.6 as criteria, we identifed 2,324 putative
ER+ MGEPs, and a highly connected ER+ modulated gene
regulation network was constructed (Figure 5). Te top ten
genes with highest connectivity was show in Table 2. Te
cysteine/tyrosine-rich 1 gene (CYYR1), connected to 142
genes, was identifed as the top hub gene in the network and
thus may serve as a key regulator under ER+ modulation.

References
[14, 42]
[47]
[44]
[45]
[46]
[21]
[20]
[48]

Table 2: Hub genes derived from modulated gene regulation


network (Figure 5).
Gene
CYYR1
MRAS
C9orf19
LOC339524
PLEKHG1
FBLN5
BOC
ANKRD35
FAM107A
C16orf77

Number of ER+ MGEPs


142
109
95
93
92
91
91
89
83
73

Gene Ontology analysis of CYYR1 and its connected neighbor genes revealed signifcant association with extracellular
matrix, epithelial tube formation, and angiogenesis.
4.2. TraceRNA. To identify the regulation network of ceRNAs
for a GoI, we developed a web-based application TraceRNA
presented earlier in [50] with extension to regulation network
construction. Te analysis fow chart of TraceRNA was shown
in Figure 6. For a selected GoI, the GoI binding miRNAs
(GBmiRs) were derived either validated miRNAs from miRTarBase [51] or predicted miRNAs from SVMicrO [52]. Ten
mRNAs (other than the given GoI) also targeted by GBmiRs
were identifed as the candidates of ceRNAs. Te relevant (or
tumor-specifc) gene expression data were used to further
strengthen relationship between the ceRNA candidates and
GoI. Te candidate ceRNAs which coexpressed with GoI
were reported as putative ceRNAs. To construct the gene
regulation network via GBmiRs, we set each ceRNA as the
secondary GoI, and the ceRNAs of these secondary GoIs
were identifed by applying the algorithm recursively. Upon
identifying all the ceRNAs, the regulation network of ceRNAs
of a given GoI was constructed.

44

Advances in Bioinformatics

TRGV5
DNASE2
C1orf175
PLA2G4D
DPY19L4
GHSR

TIGD5

C9orf7

PROP1

EPHA10
KIAA0143
POLR2K

KLK9

COMMD5

OR5BF1

LSR
OCLN

CDK5R2

FLJ32214

IRX5
P2RX4
IRF6

CAPN13
PVRL4

GRHL2

OR4D1

OR5H1
DYRK1B
OR10H2
LRBA

PSD4

KIAA1543

CRHR1
TMEM95

DYSF

LCE3E

AANAT
MED29

MB
CBLC
MARVELD2
CGN

C10orf27

KRTAP11-1

CCDC114

FAAH

KIAA1324

TMEM125
C1orf172 C19orf46SLC44A4

PRSS8

LOC652968

KIR3DL1

SPDEF
TJP3

RAB25

TACSTD1
MARVELD3

BIK

PSORS1C2
SSTR3

IGSF9SPINT1 RASEF

OVOL1 FXYD3

OR2B11

DLG3

BSPRY CLDN4

FAAH2 SPINT2

ELF3

RASGRP4

ATN1

TFAP2A
LOC400451
CREB3L4

PROM2
SHROOM3
C1orf34

OR10P1

P2RX2

SH2D3A

TRPS1

MGC40574

KRT3

VPS28

C1orf210
CLDN3

OTOF

ELMO3

LMAN1L

TMC4

SPINK7

LOC124220 ATAD4

CDS1

AP1M2

KRT19

KIAA1244

PKP3 PRRG2
NEBL
GSTO2

LRRC8E

C2orf15

CRB3

RBP7
LASS6

P2RY14
GIMAP5

RGS5
GUCY1B3

VIL2
FAM86A

KIAA1688

ATP8B4

OLFML2A
CDR1
FABP4
PPP1R12B
ADH1C
SPNS2

FAM84B

KRT7

PEX11B

SEC16A

C10orf81

HHAT

CREB3

OR10A5

CXorf36
PLB1

AVPR1A
AV
SCARF1
FAM83H
CLIC5
MMP11

GIMAP6
A
ADIPOQ
PDE1B
CIDEA
GIMAP7
C10orf54
ROBO4
ESAM

KCNJ2
EGFLAM

STX12

LRFN2

HSPA12B
H
SOCS3
CD93

PDZRN3

TPD52
DDR1

MAGIX

SH2D3C

HIVEP2
CCDC24

EPN3

CLDN7
CCDC107

RBM35A

CCL4 PTGDS

EMILIN2ACVRL1

GGTA1
ELTD1
KLF6
P2RY5

LOC338328CD36
RUNDC3B
ABCA9

FILIP1

TPK1
RRAD
EMX2OS
KGFLP1 LRRC32
C1QTNF7
GIMAP1
MEOX1

F13A1
PLEKHQ1

FHL5

ANKRD47
S
SPARCL1
VWF
W
WF
LIMS2
NOTCH4
RSU1
BCL6B
FOXO1 PALMD MALL
EDG1
ACVR2A CFH
C13orf33
JPH2
LYVE1
C14orf37
GIMAP8
ADH1B CILP
CCIN
TCF7L2
LPPR4
PLA2G5
CYYR1 EBF3
GSN
PODN
CDH23
OMD
SOX7
PALLD
EBF2
KIAA0355CCDC80
MAFB
TCEAL7
LGI4 MAB21L1CASQ2
SFRP4
COL6A3
TLR4 C1S
NDRG2
CLDN5RSPO3FGF7 SV2B
FBLN5
TCF4
CRYAB
FIBIN
USHBP1
KERA
CD248
FMO1
ACTA2
UACA PTH2R
RASD1
BOCABCC9
PRICKLE1
CTSK
IL33
LAMA2
LUM
DPT
HAND2
ELN
GPBAR1
IFNGR1
SAA2
RP5-1054A22.3
PRRX1
CHL1
COL15A1 C16orf77
MMP23B TNFAIP6
TIE1
LCN6
FAM20A
CLEC14A
ANKRD35
FAM49A
GIMAP4
C16orf30
FLJ45803
SELP
CRISPLD2
XG
GUCY1A3
ADCY4 CCL15
NKAPL
CX3CL1SSPN
CCCCL23
SFRP2
EGR2
PDGFD
SPSB1
PRELP
DACT3
EMP1
FILIP1L
SAA1
TSHZ2
KCNJ15
THSD7A
SRPX2
IGFBP7
COL1A2
LMOD1
APOD
MFAP5
C2orf40
RFTN2
HSD11B1
MXRA8
LAYN
CRTAC1
BHLHB3
FRZB
SH3PXD2B
D2
2
RARRES1
MMRN2
COL8A2
SPON2
GLT8D2
FBLN1
HTRA11
BST1
SHANK3
DNM3
DACT1
SKIP
LTBP2 TAGLN
ENPP6
FMO3
RASL12
GABRP
KIF19
PLEKHF1
COL16A1
SERPING1
MMP7
TLL1
PIK3C2G
RASSF22
FGF1
GJA12
FAM101A
SCPEP1
ALDH2
ITGBL1
ARID5B
TP63
TNN
OLFM4
PLXDC2
SAV1
LGI2
HSPG2
KLK5
PIGR
MMRN1
ANKRD6
DSC1 KCNMB11
ST5
UTRN
FAM3D
RSPO1
RRAGC
CPAMD8
CLDN8
DCHS1
PACRG
LGR6
KRT6B
FMOD

KL

ARL6IP1

ZFP36

ADAMTS16

CPA3

CYP11B1

HECA
TM4SF18

C1orf186
LCE1F

HDC
RGS13
AQP2
RCN3

ZNF574

PTCH2

ZNF446

MAGED2

SLC25A44

PSEN2

C4orf7
C12orf54
GPRIN2
SLC34A2

BCL6

GRAMD2

Positive correlation under ER+ modulation


Negative correlation under ER+ modulation
Gene

Figure 5: ER+ modulated gene regulation network.

To identify ceRNA candidates, three miRNAs binding


prediction algorithms, SiteTest, SVMicrO, and BCMicrO,
were used in TracRNA. SiteTest is an algorithm similar to
MuTaMe and uses UTR features for target prediction. SVMicrO [52] is an algorithm that uses a large number of sequencelevel site as well as UTR features including binding secondary
structure, energy, and conservation, whereas BCMicrO [53]
employs a Bayesian approach that integrates predictions from
6 popular algorithms including TargetScan, miRanda, PicTar,
mirTarget, PITA, and DIANA-microT. Pearson correlation
coefcient was used to test the coexpression between the GoI
and the candidate ceRNAs. We utilized TCGA breast cancer
cohort [3] as the expression data, by using 60% of GBmiRs

as common miRNAs and Pearson correlation coefcient >0.9


as criteria. Te fnal scores of putative ceRNAs (see Table 3,
last column) were generated by using Borda merging method
which rerank the sum of ranks from both GBmiR binding
and coexpression values [54]. To illustrate the utility of the
TraceRNA algorithm for breast cancer study, we also focus on
the genes interacted with the estrogen receptor alpha, ESR1,
with GBmiRs including miR-18a, miR-18b, miR-193b, miR19a, miR-19b, miR-206, miR-20b, miR-22, miR-221, miR-222,
miR-29b, and miR-302c. Te regulation network generated
by ESR1 as the initial GoI is shown in Figure 7, and the top
18 ceRNAs are provided in Table 3. Te TraceRNA algorithm
can be accessed http://compgenomics.utsa.edu/cerna/.

45

Advances in Bioinformatics

Gene of interest (GoI)


Validated miRNAs from miRTarBase
or predicted by SVMicrO
Te GoI binding miRNA
(GBmiRs)
Identify mRNAs targeted by the same GBmiRs

Candidates of ceRNAs

Identify ceRNAs with coexpressed patterns

Output to
TraceRNA website

Putative ceRNAs

Select each ceRNA as a secondary GoI

Secondary Gol

Regulation network of ceRNAs

Figure 6: Te analysis fow chart of TraceRNA.

ZNF292

TARDBP
PIK3C2A
NR3C2

NCOA2

DCUN1D3 SIRT1 TSHZ3

ZDHHC17

USP15 ACSL1

FAM135A

MBNL2

FUSIP1
CTNND2
PPP2CA
PAK7

MAPRE1
NARG1

RND3

FMR1

CHD9
SLC12A2
PRPF38B
CSNK1G3
DCUN1D4

SPTY2D1
CAMTA1
SMARCAD1
UBL3
MAGI1

KCNJ2

GH2

DNAJA2 EBF1
ZFPM2 CUL3

PUM2
CPEB3
TNRC6C

ZFP91

ZNF516

YPEL5

CAPN3

ARHGEF17

ESRRG
FNDC3A
REEP1
GLCE
MAT2A
TNRC6A

KLF9

HEYL
SLC12A5C10orf26

ZFHX4

MMP24

PPP6C

PPP1R12A

CRIM1

LHX6 ANK2 MEF2C


ARL4A

DIXDC1 NRP2
BACE1 WDR47

PPM1B MAP3K1

KIAA0240

RNF11

RAB8B

GRM8

ESRRG
FNDC3A

ZC3H11A
RAB5B
DAG1

SENP1
GRIA2

FAM120A

TP53INP1

MAT2A

KLF9

TNRC6A

KIAA1522

FNBP1L

ARRDC3

NFIA

FNIP1

G3BP2
RYBP
PKN2

ZNF654 ZNF800

ATP7A

MED13

MED13L

POMGN
NAV3

PCDHA9
FEM1C
SFRS12
RAPGEF2
C20orf194 JAZF1

KB

ZFX

VAV3
SMG1
PCDHA5
PCDHA1

TGFBR3

WDFY3

REEP1
GLCE

PTPRDPAPOLA

FLRT3

ANTXR2

ERBB2
MAP2K4

RANBP6

HS

RAP1GDS1
KBTBD4ARID4A

PCDHA3

FAM70A

CPEB2

ELL2

ESR1
RAP2C
CADM1

LEF1

THBS1
MYST4

ZNF516

KCNE4
ZFX

BAZ2B
PAN3
CACNA2D2

PITPNB

GMFBHNRPDL
FOXA1PPP3CB

WDFY3

SRGAP3
POMGNT1
BMS1P5
PAFAH1B1

MED13L
NAV3

ABCD4

SCN2A

PRKG1

ZNF238

NFIA
ATP7A

LCOR
REEP2

COL4A3BP
HIPK1

ESR1

RNF11

RAB8B

HIPK1

SNORD8 GIT1

CPEB4

GOSR1

ALCAM

CPEB2
ERBB2IP
MAP2K4
RANBP6
KIAA0240

ELL2

NHS
MAPT MAGI2

ST7L

MIER3

NPAS3

BSN

QKI

DST

RERE

DACH1

PTEN

MAPKBP1

GOSR1
ALCAM

NPAS3

CSH1

RAB14

DDX3X ARPP-19

ZBTB4

HOMER1
TMEM115
FOXP1

PDS5B

VEGFA RSBN1
SEMA6D
SOCS5 DTNA

TNRC

ZFP91
MIER3

ARPP-19

GGT3P
RUNX2
RICTOR

RAP1B

NOPE
BZRAP1
PHF15

DACH1

PTEN

SLC24A3
PPFIA2

BTRC

BAGE5

CFL2

RAB14

MAP1LC3A

ATRX
ZFHX3

SATB2

SOCS5
ZADH2

PML
PRM1

DLG2
ATP2B1

LOC442245

SEMA6A

KIAA0232RCAN2

CPEB3

ARX
ZNF292

CNOT6 NOVA1

RSBN1

DTNA

B4GALT2

LYCAT

PIP4K2A
TARDBP

DCUN1D3
SIRT1 TSHZ3

LEMD3

VEGFA
DMRT2

CD2BP2

STK35
SLC6A10P
CYB561D1

CCNT2

ABHD13

PUM2

RAP1B
CBFB

NBEA
JAG1

BAGE5

DAXX

PTGFRN
C5orf13
THSD3
VEZF1
TMEM11

USP47

ZSWIM6 CADPS

CUL3

GFAP

ATPBD4

RBM12 UBE3A

SLC38A2

MEIS2

BT
ZFHX3

ZFPM2

CXCL14
MBNL1

CIT
RAPGEF5 SOBP

ATRX

SEMA6A

TCF4
C7orf60
BRMS1L
KIAA0467

NRXN1

NOVA1

CNOT6

MUTEDC18orf25
CLK2P
ZNF608

TEX2
BCL2L1
PCDHAC1
C14orf101
RPGRIP1L
TMEM57C1orf9
PCDHAC2
COL12A1
COPB2
C5orf24

OTUD4 TAOK1

PAFAH1B1

PHTF2

CRIM1
BAZ2B
PAN3
F2C

(a)

THBS1
MYST4

CACNA2D2

FAM120A

(b)

Figure 7: (a) ceRNA network for gene of interest ESR1 generated using TraceRNA. (b) Network graph enlarged at ESR1.

5. Conclusions
In this report, we attempt to provide a unifed concept of
modulation of gene regulation, encompassing earlier mRNA
expression based methods and lately the ceRNA method. We
expect the integration of ceRNA concept into the gene-gene
interactions, and their modulator identifcation will further

enhance our understanding in gene interaction and their


systemic infuence. Applications provided here also represent
examples of our earlier attempt to construct modulated networks specifc to breast cancer studies. Further investigation
will be carried out to extend our modeling to provide a
unifed understanding of genetic regulation in an altered
environment.

46

Advances in Bioinformatics
Table 3: Top 18 candidate ceRNAs for ESR1 as GOI obtained from TraceRNA. ESR1 is at rank of 174 (not listed in this table).
SVMicrO-based prediction
Gene symbol
Score
FOXP1
VEZF1
NOVA1
CPEB3
MAP2K4
FAM120A
PCDHA3
SIRT1
PCDHA5
PTEN
PCDHA1
NBEA
ZFHX4
GLCE
MAGI2
SATB2
LEF1
ATPBD4

1.066
0.942
0.896
0.858
0.919
0.885
0.983
0.927
0.983
0.898
0.983
0.752
0.970
0.798
0.777
0.801
0.753
0.819

value
0.0043
0.0060
0.0067
0.0074
0.0064
0.0069
0.0054
0.0062
0.0054
0.0067
0.0054
0.0098
0.0056
0.0087
0.0092
0.0086
0.0098
0.0082

Authors Contribution
M. Flores and T.-H Hsiao are contributed equally to this work.

Acknowledgments
Te authors would like to thank the funding support of
this work by Qatar National Research Foundation (NPRP
09 -874-3-235) to Y. Chen and Y. Huang, National Science
Foundation (CCF-1246073) to Y. Huang. Te authors also
thank the computational support provided by the UTSA
Computational Systems Biology Core Facility (NIH RCMI
5G12RR013646-12).

References
[1] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, vol. 270, no. 5235, pp. 467470,
1995.
[2] E. R. Mardis, Next-generation DNA sequencing methods,
Annual Review of Genomics and Human Genetics, vol. 9, pp.
387402, 2008.
[3] Cancer Genome Atlas Network, Comprehensive molecular
portraits of human breast tumours, Nature, vol. 490, pp. 6170,
2012.
[4] D. Bell, A. Berchuck, M. Birrer et al., Integrated genomic
analyses of ovarian carcinoma, Nature, vol. 474, no. 7353, pp.
609615, 2011.
[5] R. McLendon, A. Friedman, D. Bigner et al., Comprehensive
genomic characterization defnes human glioblastoma genes
and core pathways, Nature, vol. 455, no. 7216, pp. 10611068,
2008.

Expression correlation
Score
0.508
0.4868
0.479
0.484
0.322
0.341
0.170
0.230
0.148
0.221
0.140
0.491
0.154
0.3231
0.321
0.243
0.291
0.170

value
0.016
0.020
0.023
0.022
0.097
0.082
0.215
0.162
0.233
0.168
0.239
0.020
0.229
0.096
0.097
0.151
0.112
0.215

Final score
1212
1179
1160
1149
1139
1130
1125
1117
1113
1104
1103
1102
1097
1096
1086
1078
1065
1060

[6] C. M. Perou, T. Srile, M. B. Eisen et al., Molecular portraits of


human breast tumours, Nature, vol. 406, no. 6797, pp. 747752,
2000.
[7] J. Lapointe, C. Li, J. P. Higgins et al., Gene expression profling
identifes clinically relevant subtypes of prostate cancer, Proceedings of the National Academy of Sciences of the United States
of America, vol. 101, no. 3, pp. 811816, 2004.
[8] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, Cluster analysis and display of genome-wide expression patterns,
Proceedings of the National Academy of Sciences of the United
States of America, vol. 95, no. 25, pp. 1486314868, 1998.
[9] T. Schlitt and A. Brazma, Current approaches to gene regulatory network modelling, BMC Bioinformatics, vol. 8, supplement 6, article S9, 2007.
[10] H. Hache, H. Lehrach, and R. Herwig, Reverse engineering of
gene regulatory networks: a comparative study, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2009, Article ID
617281, 2009.
[11] W. P. Lee and W. S. Tzou, Computational methods for discovering gene networks from expression data, Briefngs in
Bioinformatics, vol. 10, no. 4, pp. 408423, 2009.
[12] C. Sima, J. Hua, and S. Jung, Inference of gene regulatory
networks using time-series data: a survey, Current Genomics,
vol. 10, no. 6, pp. 416429, 2009.
[13] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim, A genecoexpression network for global discovery of conserved genetic
modules, Science, vol. 302, no. 5643, pp. 249255, 2003.
[14] A. A. Margolin, I. Nemenman, K. Basso et al., ARACNE: an
algorithm for the reconstruction of gene regulatory networks
in a mammalian cellular context, BMC Bioinformatics, vol. 7,
supplement 1, article S7, 2006.
[15] E. R. Dougherty, S. Kim, and Y. Chen, Coefcient of determination in nonlinear signal processing, Signal Processing, vol. 80,
no. 10, pp. 22192235, 2000.

47
[16] S. Kim, E. R. Dougherty, Y. Chen et al., Multivariate measurement of gene expression relationships, Genomics, vol. 67, no. 2,
pp. 201209, 2000.
[17] X. Chen, M. Chen, and K. Ning, BNArray: an R package for
constructing gene regulatory networks from microarray data
by using Bayesian network, Bioinformatics, vol. 22, no. 23, pp.
29522954, 2006.
[18] A. V. Werhli, M. Grzegorczyk, and D. Husmeier, Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and
bayesian networks, Bioinformatics, vol. 22, no. 20, pp. 2523
2531, 2006.
[19] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, Probabilistic Boolean networks: a rule-based uncertainty model for
gene regulatory networks, Bioinformatics, vol. 18, no. 2, pp. 261
274, 2002.
[20] P. Sumazin, X. Yang, H.-S. Chiu et al., An extensive MicroRNAmediated network of RNA-RNA interactions regulates established oncogenic pathways in glioblastoma, Cell, vol. 147, no.
2, pp. 370381, 2011.
[21] Y. Tay, L. Kats, L. Salmena et al., Coding-independent regulation of the tumor suppressor PTEN by competing endogenous
mRNAs, Cell, vol. 147, no. 2, pp. 344357, 2011.
[22] D. P. Bartel, MicroRNAs: target recognition and regulatory
functions, Cell, vol. 136, no. 2, pp. 215233, 2009.
[23] D. Yue, J. Meng, M. Lu, C. L. P. Chen, M. Guo, and Y. Huang,
Understanding MicroRNA regulation: a computational perspective, IEEE Signal Processing Magazine, vol. 29, no. 1, Article
ID 6105465, pp. 7788, 2012.
[24] M. W. Jones-Rhoades and D. P. Bartel, Computational identifcation of plant MicroRNAs and their targets, including a stressinduced miRNA, Molecular Cell, vol. 14, no. 6, pp. 787799,
2004.
[25] D. Hanahan and R. A. Weinberg, Te hallmarks of cancer, Cell,
vol. 100, no. 1, pp. 5770, 2000.
[26] S. Y. Chun, C. Johnson, J. G. Washburn, M. R. Cruz-Correa,
D. T. Dang, and L. H. Dang, Oncogenic KRAS modulates
mitochondrial metabolism in human colon cancer cells by
inducing HIF-1 and HIF-2 target genes, Molecular Cancer,
vol. 9, article 293, 2010.
[27] N. J. Hudson, A. Reverter, and B. P. Dalrymple, A diferential
wiring analysis of expression data correctly identifes the gene
containing the causal mutation, PLoS Computational Biology,
vol. 5, no. 5, Article ID e1000382, 2009.
[28] I. Stelniec-Klotz, S. Legewie, O. Tchernitsa et al., Reverse
engineering a hierarchical regulatory network downstream of
oncogenic KRAS, Molecular Systems Biology, vol. 8, Article ID
601, 2012.
[29] C. Shen, Y. Huang, Y. Liu et al., A modulated empirical
Bayes model for identifying topological and temporal estrogen
receptor regulatory networks in breast cancer, BMC Systems
Biology, vol. 5, article 67, 2011.
[30] C. A. Wilson and J. Dering, Recent translational research:
microarray expression profling of breast cancer. Beyond classifcation and prognostic markers? Breast Cancer Research, vol.
6, no. 5, pp. 192200, 2004.
[31] H. E. Cunlife, M. Ringner, S. Bilke et al., Te gene expression
response of breast cancer to growth regulators: patterns and
correlation with tumor expression profles, Cancer Research,
vol. 63, no. 21, pp. 71587166, 2003.

Advances in Bioinformatics
[32] J. Frasor, F. Stossi, J. M. Danes, B. Komm, C. R. Lyttle, and
B. S. Katzenellenbogen, Selective estrogen receptor modulators: discrimination of agonistic versus antagonistic activities
by gene expression profling in breast cancer cells, Cancer
Research, vol. 64, no. 4, pp. 15221533, 2004.
[33] L. J. vant Veer, H. Dai, M. J. van de Vijver et al., Gene expression
profling predicts clinical outcome of breast cancer, Nature, vol.
415, no. 6871, pp. 530536, 2002.
[34] S. A. Kaufman, Te Origins of Order : Self-Organization and
Selection in Evolution, Oxford University Press, New York, NY,
USA, 1993.
[35] J. D. Allen, Y. Xie, M. Chen, L. Girard, and G. Xiao, Comparing
statistical methods for constructing large scale gene networks,
PLoS ONE, vol. 7, no. 1, Article ID e29348, 2012.
[36] Y. Huang, I. M. Tienda-Luna, and Y. Wang, Reverse engineering gene regulatory networks: a survey of statistical models,
IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 7697, 2009.
[37] F. Crick, Central dogma of molecular biology, Nature, vol. 227,
no. 5258, pp. 561563, 1970.
[38] A. Hamilton and M. Piccart, Te contribution of molecular
markers to the prediction of response in the treatment of breast
cancer: a review of the literature on HER-2, p53 and BCL-2,
Annals of Oncology, vol. 11, no. 6, pp. 647663, 2000.
[39] C. Sotiriou, S. Y. Neo, L. M. McShane et al., Breast cancer
classifcation and prognosis based on gene expression profles
from a population-based study, Proceedings of the National
Academy of Sciences of the United States of America, vol. 100, no.
18, pp. 1039310398, 2003.
[40] T. Srlie, C. M. Perou, R. Tibshirani et al., Gene expression
patterns of breast carcinomas distinguish tumor subclasses with
clinical implications, Proceedings of the National Academy of
Sciences of the United States of America, vol. 98, no. 19, pp. 10869
10874, 2001.
[41] J. S. Carroll, C. A. Meyer, J. Song et al., Genome-wide analysis
of estrogen receptor binding sites, Nature Genetics, vol. 38, no.
11, pp. 12891297, 2006.
[42] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. DallaFavera, and A. Califano, Reverse engineering of regulatory
networks in human B cells, Nature Genetics, vol. 37, no. 4, pp.
382390, 2005.
[43] K. C. Liang and X. Wang, Gene regulatory network reconstruction using conditional mutual information, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2008, Article ID
253894, 2008.
[44] K. Wang, B. C. Bisikirska, M. J. Alvarez et al., Genome-wide
identifcation of post-translational modulators of transcription
factor activity in human B cells, Nature Biotechnology, vol. 27,
no. 9, pp. 829837, 2009.
[45] M. Hansen, L. Everett, L. Singh, and S. Hannenhalli, Mimosa:
mixture model of co-expression to detect modulators of regulatory interaction, Algorithms for Molecular Biology, vol. 5, no.
1, article 4, 2010.
[46] O. Babur, E. Demir, M. Gonen, C. Sander, and U. Dogrusoz,
Discovering modulators of gene expression, Nucleic Acids
Research, vol. 38, no. 17, Article ID gkq287, pp. 56485656, 2010.
[47] T. Shimamura, S. Imoto, Y. Shimada et al., A novel network profling analysis reveals system changes in epithelialmesenchymal transition, PLoS ONE, vol. 6, no. 6, Article ID
e20804, 2011.
[48] H. Y. Wu et al., A modulator based regulatory network for ERalpha signaling pathway, BMC Genomics, vol. 13, Supplement 6,
article S6, 2012.

Advances in Bioinformatics
[49] K.-K. Yan, W. Hwang, J. Qian et al., Construction and analysis of an integrated regulatory network derived from HighTroughput sequencing data, PLoS Computational Biology, vol.
7, no. 11, Article ID e1002190, 2011.
[50] M. Flores and Y. Huang, TraceRNA: a web based application
for ceRNAs prediction, in Proceedings of the IEEE Genomic
Signal Processing and Statistics Workshop (GENSIPS 12), 2012.
[51] S. D. Hsu, F. M. Lin, W. Y. Wu et al., MiRTarBase: a database curates experimentally validated microRNA-target interactions, Nucleic Acids Research, vol. 39, no. 1, pp. D163D169,
2011.
[52] H. Liu, D. Yue, Y. Chen, S. J. Gao, and Y. Huang, Improving performance of mammalian microRNA target prediction, BMC
Bioinformatics, vol. 11, article 476, 2010.
[53] Y. Dong et al., A Bayesian decision fusion approach for
microRNA target prediction, BMC Genomics, vol. 13, 2012.
[54] J. A. Asm and M. Montague, Models for Metasearch, in Proceedings of the 24th annual international ACM SIGIR conference
on Research and development in information retrieval, pp. 276
284, la, New Orleans, La, USA, 2001.

48

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Spectral Analysis on Time-Course Expression Data: Detecting
Periodic Genes Using a Real-Valued Iterative Adaptive Approach
Kwadwo S. Agyepong,1 Fang-Han Hsu,1 Edward R. Dougherty,1,2 and Erchin Serpedin1
1
2

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004-2101, USA

Correspondence should be addressed to Erchin Serpedin; serpedin@ece.tamu.edu


Received 26 October 2012; Accepted 23 January 2013
Academic Editor: Mohamed Nounou
Copyright 2013 Kwadwo S. Agyepong et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Time-course expression profles and methods for spectrum analysis have been applied for detecting transcriptional periodicities,
which are valuable patterns to unravel genes associated with cell cycle and circadian rhythm regulation. However, most of the
proposed methods sufer from restrictions and large false positives to a certain extent. Additionally, in some experiments, arbitrarily
irregular sampling times as well as the presence of high noise and small sample sizes make accurate detection a challenging task.
A novel scheme for detecting periodicities in time-course expression data is proposed, in which a real-valued iterative adaptive
approach (RIAA), originally proposed for signal processing, is applied for periodogram estimation. Te inferred spectrum is then
analyzed using Fishers hypothesis test. With a proper -value threshold, periodic genes can be detected. A periodic signal, two
nonperiodic signals, and four sampling strategies were considered in the simulations, including both bursts and drops. In addition,
two yeast real datasets were applied for validation. Te simulations and real data analysis reveal that RIAA can perform competitively
with the existing algorithms. Te advantage of RIAA is manifested when the expression data are highly irregularly sampled, and
when the number of cycles covered by the sampling time points is very reduced.

1. Introduction
Patterns of periodic gene expression have been found to
be associated with essential biological processes such as
cell cycle and circadian rhythm [1], and the detection of
periodic genes is crucial to advance our understanding of
gene function, disease pathways, and, ultimately, therapeutic solutions. Using high-throughput technologies such as
microarrays, gene expression profles at discrete time points
can be derived and hundreds of cell cycle regulated genes have
been reported in a variety of species. For example, Spellman
et al. applied cell synchronization methods and conducted
time-course gene expression experiments on Saccharomyces
cerevisiae [2]. Te authors identifed 800 cell cycle regulated
genes using DNA microarrays. Also, Rustici et al. and Menges
et al. identifed 407 and about 500 cell cycle regulated genes
in Schizosaccharomyces pombe and Arabidopsis, respectively
[3, 4].

Signal processing in the frequency domain simplifes the


analysis and an emerging number of studies have demonstrated the power of spectrum analysis in the detection of
periodic genes. Considering the common issues of missing
values and noise in microarray experiments, Ahdesmaki et
al. proposed a robust detection method incorporating the fast
Fourier transform (FFT) with a series of data preprocessing
and hypothesis testing steps [5]. Two years later, the authors
further proposed a modifed version for expression data
with unevenly spaced time intervals [6]. A Lomb-Scargle
(LS) approach, originally used for fnding periodicities in
astrophysics, was developed for expression data with uneven
sampling [7]. Yang et al. further improved the performance
using a detrended fuctuation analysis [8]. It used harmonic
regression in the time domain for signifcance evaluation.
Te method was termed Lomb-Scargle periodogram and
harmonic regression (LSPR). Basically, these methods consists of two steps: transferring the signals into the frequency

50

Advances in Bioinformatics

(spectral) domain and then applying a signifcance evaluation


test for the resulting peak in the spectral density.
While numerous methods have been developed for
detecting periodicities in gene expression, most of these
methods sufer from false positive errors and working restrictions to a certain extent, particularly when the time-course
data contain limited time points. In addition, no algorithm
seems available to resolve all of these challenges. Microarray as well as other high-throughput experiments, due to
high manufacturing and preparation costs, have common
characteristics of small sample size [9], noisy measurements
[10], and arbitrary sampling strategies [11], thereby making
the detection of periodicities highly challenging. Since the
number and functions of cell cycle regulated genes, or periodic genes, remain greatly uncertain, advances in detection
algorithms are urgently needed.
Recently, Stoica et al. developed a novel nonparametric
method, termed the real-valued iterative adaptive approach
(RIAA), specifcally for spectral analysis with nonuniformly
sampled data [12]. As stated by the authors, RIAA, an
iteratively weighted least-squares periodogram, can provide
robust spectral estimates and is most suitable for sinusoidal signals. Tese characteristics of RIAA inspired us to
apply it to time-course gene expression data and conduct
an examination on its performance. Herein, we incorporate RIAA with a Fishers statistic to detect transcriptional
periodicities. A rigorous comparison of RIAA with several
aforementioned algorithms in terms of sensitivities and
specifcities is conducted through simulations and simulation results dealing with real data analysis are also provided.
In this study, we found that the RIAA algorithm can
provide robust spectral estimates for the detection of periodic
genes regardless of the sampling strategies adopted in the
experiments or the nonperiodic nature of noise present in
the measurement process. We show through simulations that
the RIAA can outperform the existing algorithms particularly
when the data are highly irregularly sampled, and when the
number of cycles covered by the sampling time points is
very few. Tese characteristics of RIAA ft perfectly the needs
of time-course gene expression data analysis. Tis paper is
organized as follows. In Section 2, we begin with an overview
of RIAA. In Section 3, a scheme for detecting periodicities is
proposed, and simulation models for performance evaluation
and a real data analysis for validation purposes are presented.
A complete investigation of the performance of RIAA and a
rigorous comparison with other algorithms are provided in
Section 4.

2.1. Basics. Suppose that the signals associated with the periodic gene expressions are composed of noise and sinusoidal
components. Let ( ), = 1, . . . , , denote the time-course
expression ratios of gene at instances 1 , . . . , , respectively;
( ) are real numbers; =1 ( ) = 0. Te least-squares
periodogram is given by

2. RIAA Algorithm

2.2. Observation Interval and Resolution. Prior to implementation of RIAA for periodogram estimation, the observation
interval [0, max ] and the resolution in terms of grid size have
to be selected. To this end, the maximum frequency max in
the observation interval without aliasing errors for sampling
instances 1 , . . . , , can be evaluated by

max = 0 ,
(8)
2

RIAA is an iterative algorithm developed for fnding the


least-squares periodogram with the utilization of a weighted
function. Te essential mathematics involved in RIAA is
introduced in this section with the algorithm input being
time-course expression data; for more details regarding
RIAA, the readers are encouraged to check the original paper
by Stoica et al. [12].

= |
()|2 ,

(1)

where () is the solution to the following ftting problem:

() = arg min[ ( ) () ] .
()

=1

(2)

Let () = |()|() = , where = |()| 0 and


= () [0, 2] refer to the amplitude and phase of (),
respectively. Te criterion in (2) can then be rewritten as

[ ( ) cos ( + )] + 2 sin2 ( + ) .
=1

=1

(3)

Te second term in the above equation is data independent and can be omitted from the minimization operation.
Hence, the criterion (2) is simplifed to

)
= arg min[ ( ) cos ( + )]2 .
(,

=1

(4)

We further apply = cos() and = sin() and derive


an equivalent of (4) as follows:

2
(
, ) = arg min[ ( ) cos ( ) sin ( )] . (5)
,

=1

Te target of interest to the ftting problem now becomes


and (instead of ()), and the solution is well known to be

[ ] = R1 r,

(6)

where

R = [
=1

cos ( ) sin ( )
cos ( )
],
2
sin ( ) cos ( )
sin ( )

cos ( )
r = [
] ( ) .
sin ( )
=1

(7)

Afer and are estimated, the least-squares periodogram


can be derived.

51

Advances in Bioinformatics
where 0 is given by

0 =

2 ( 1)

1
=1

(+1 )

(9)

Te observation interval [0, max ] is hence chosen afer max


is obtained.
To ensure that the smallest frequency separation in timecourse expression data with regular or irregular sampling can
be adequately detected, the grid size is chosen to be
2
,
1

= 1, . . . , ,

(11)

where the number of grids is given by


=

max
.

(12)

2.3. Implementation. Te following notations are introduced


for the implementation of RIAA at a specifc frequency :
Y = [ (1 )

= [ ( )

A = [c

( )] ,

( )] ,

(13)

s ] ,

where
c = [cos ( 1 )
s = [sin ( 1 )

cos ( )] ,

sin ( )] ,

(14)

and ( ) and ( ) denote variables and at frequency ,


respectively.
RIAAs salient feature is the addition of a weighted matrix
Q to the least-squares ftting criterion. Te weighted matrix
Q can be viewed as a covariance matrix encapsulating the
contributions of noise and other sinusoidal components in Y
other than to the spectrum; it is defned as
Q = +

where
D =

=1, =

A D A ,

2 ( ) + 2 ( ) 1
[
0
2

0
],
1

(15)

(16)

and denotes the covariance matrix of noise in expression


data Y, given by
2
[
= [ ...
[0

...
..
.

...

.. ] .
. ]
2 ]

[Y A ] Q1
= arg min
[Y A ] .

(18)

In Stoica et al. [12], the solution to (18) has been shown to


be
=

(10)

which, in fact, is the resolution limit of the least-squares


periodogram. As a result, the frequency grids considered
in periodogram are
= ,

Assuming that Q is invertible, in RIAA, a weighted leastsquares ftting problem is formulated and considered for
fnding and (instead of using (5)), and it is written in the
form of matrices using (13) as follows:

A Q1
Y

A Q1
A

(19)

and the RIAA periodogram at = can be derived by


riaa ( ) =

1
(A A ) .

(20)

From (15) and (19), it is obvious that Q and are dependent


on each other. An iterative approach (i.e., RIAA) is hence
a feasible solution to get the estimate and the weighted
matrix Q .
Te iteration for estimating spectrum starts with initial
estimates 0 , in which the elements and are given by (6)
with = , = 1, . . . , . Afer initialization, the frst
iteration begins. First, the elements and of 0 are applied
1 using (16). Secondly, to get a good estimate of
to obtain D

1 , the frequency at which the largest value- is located


in the temporary periodogram 0 ( ), = 1, . . . , , derived
using (20) with = 0 , is applied for obtaining a reversed
0
0 . Te elements ( ), = 1, . . . , , in Y
engineered signal Y
are given by
( ) = 2 cos ( + ) ,

= 1, . . . , .

(21)

Te phase of the cosine function is unknown; however, 1


is estimable using
Y Y
0 2

1
,
(22)

= min
[0,2]

1 and
where || || is the Euclidean norm. With estimates D

1
1
, = 1, . . . , , in the frst iteration are
, the estimates Q

1 are inserted into the righthence given by (15). Afer this, Q

hand side of (19) and updated estimates 1 , = 1, . . . , ,


are derived. Te algorithm consists of repeating these steps
and iteratively, where denotes the
and updating Q

number of iterations, until a termination criterion is reached.


If the process stops at the th iteration, then the fnal RIAA
periodogram is given by (20) using . Te pseudocode in
Algorithm 1 represents a concise description of the iterative
RIAA process.

3. Methods
(17)

Figure 1 demonstrates our scheme for periodicity detection


and algorithm comparison. Te frst step involves a periodogram estimation, which converts the time-course gene

52

Advances in Bioinformatics

Algorithm RIAA
Initialization
Use (6) to obtain the initial estimates and in 0 .
Te First Iteration
1
1 using (16) with parameters and given by 0 . Obtain 1 using (22). Using D
Obtain D

1
1
1

1 .
and to drive the frst weighted matrix Q by (15). Update estimate by (19) with Q = Q

Updating Iteration
and are iteratively updated in the same way
At the th iteration, = 1, 2, . . ., estimates Q

as the frst iteration.


Termination
Terminate simply afer 15 iterations ( = 15), or when the total changes in = || ||

for = 1, . . . , , is extremely small, say, =1 ( 1 )2 < 0.005 =1 (1 )2 , then = .

Algorithm 1: Te pseudocode of the iterative process in RIAA.

3.1. Fishers Test. Afer the spectrum of time-course expression data is obtained via periodogram estimation, a Fishers
statistic for gene with the null hypothesis 0 that
the peak of the spectral density is insignifcant against the
alternative hypothesis 1 that the peak of the spectral density
is signifcant is applied as
=

max1 ( ( ))
1
=1 ( )

(23)

where refers to the periodogram derived using RIAA, LS,


or DLS. Te null hypothesis 0 is rejected, and the gene
is claimed as a periodic gene if its -value, denoted as
, is less than or equal to a specifc signifcance threshold.
For simplicity, is approximated from the asymptotic null
distribution of assuming Gaussian noise [13] as follows:
= 1

(24)

In real data analysis, deviation might be invoked for the


estimation of when the time-course data is short. Tis
issue was carefully addressed by Liew et al. [14], and, as
suggested, alternative methods such as random permutation
may provide less deviation and better performance. However,
permutation also has limitations such as tending to be conservative [15]. While fnding the most robust method for the

Time-course
expression ratios
Spectral analysis in frequency domain

expression ratios into the frequency domain. Tree methods


are considered for comparison: RIAA, LS, and a detrend LS
(termed DLS), which uses an additional detrend function
(developed in LSPR) before regular LS periodogram estimation is applied. Te derived spectra are then analyzed using
hypothesis testing. Tis study is conducted using a Fishers
test, with the null hypothesis that there are no periodic
signals in the time domain and hence no signifcantly large
peak in the derived spectra. Te algorithm performance
is evaluated and compared via simulations and receiver
operating characteristic (ROC) curves. In real microarray
data analysis, three published benchmark sets are utilized as
standards of cell cycle genes for performance comparison.

Periodogram
estimation

RIAA,
compared with
LS, DLS

Hypothesis
testing

Fishers test

Benchmark
sets

ROC curves
Real data

Simulations

Periodic genes and


nonperiodicities

Figure 1: Te scheme of the process for detecting periodicities in


time-course expression data.

-value evaluation remains an open question, it gets beyond


the scope of this study since the algorithm comparison via
ROC curves is threshold independent [16], and the results are
unafected by the deviation.
3.2. Simulations. Simulations are applied to evaluate the
performance of RIAA. Te simulation models and sampling
strategies used for simulations are described in the following
paragraphs.
3.2.1. Periodic and Nonperiodic Signals. Tree models, one
for periodic signals and two for nonperiodic signals, are
considered as transcriptional signals. Since periodic genes are
transcribed in an oscillatory manner, the expression levels
embedded with periodicities are assumed to be
( ) = cos ( ) + ,

= 1, . . . , ,

(25)

where denotes the sinusoidal amplitude; refers to the


signal frequency; are Gaussian noise independent and

53

Advances in Bioinformatics
1
0.8
2

Amplitude

Gene expression

0.6
0.4

-value = 2.4 103

0.2
0
0

10

12

14

16

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Frequency

Time

RIAA

Sampled data
Periodic signal

(b)
(a)

Figure 2: (a) A time-course periodic signal with frequency = 0.2 sampled by the bio-like sampling strategy; 16 time points are assigned to
the interval (0,8], and 8 time points are assigned to the interval (8,16]. (b) Te periodogram derived using RIAA. Te maximum value (peak)
in the periodogram locates at frequency = 0.195.

identically distributed (i.i.d.) with parameters and . For


nonperiodic signals, the frst model is simply composed of
Gaussian noise, given by
( ) = ,

= 1, . . . , .

(26)

Additionally, as visualized by Chubb et al., gene transcription


can be nonperiodically activated with irregular intervals in a
living eukaryotic cell, like pulses turning on and of rapidly
and discontinuously [17]. Based on this, the second nonperiodic model incorporates one additional transcriptional
burst and one additional sudden drop into the Gaussian
noise, which can be written as
( ) = ( ) ( ) + ,

= 1, . . . , ,

(27)

where and are indicator functions, equal to 1 at the


location of the burst and the drop, respectively, and 0
otherwise. Te transcriptional burst assumes a positive pulse
while the transcriptional drop assumes a negative pulse. Both
of them may be located randomly among all time points and
are assumed to last for two time points. In other words, the
indicator functions are equal to 1 at two consecutive time
points, say, = 1 at and +1 . Te burst and the drop have
no overlap.
3.2.2. Sampling Strategies. As for the choices of sampling time
points , = 1, . . . , , four diferent sampling strategies, one
with regular sampling and three with irregular sampling, are
considered. First, regular sampling is applied in which all time
intervals are set to be 1/, where is a constant. Secondly,
a bio-like sampling strategy is invoked. Tis strategy tends
to have more time points at the beginning of time-course
experiments and less time points afer we set the frst 2/3
time intervals as 1/ and set the next 1/3 time intervals
as 2/. Tird, time intervals are randomly chosen between
1/ and 2/. Te last sampling strategy, in which all time
intervals are exponentially distributed with parameter , is
less realistic than the others but it is helpful for us to evaluate
the performance of RIAA under pathological conditions.

ROC curves are applied for performance comparison.


To this end, 10,000 periodic signals were generated using
(25) and 10,000 nonperiodic signals were generated using
either (26) or (27). Sensitivity measures the proportion of
successful detection among the 10,000 periodic signals and
specifcity measures the proportion of correct claims on
the 10,000 nonperiodic simulation datasets. Sampling time
points are decided by one of the four sampling strategies and
the number of time points is chosen arbitrarily. For all ROC
curves in Section 4, = 2 and = 24.
3.3. Real Data Analysis. Two yeast cell cycle experiments
synchronized using an alpha-factor, one conducted by Spellman et al. [2] and one conducted by Pramila et al. [18],
are considered for a real data analysis. Te frst timecourse microarray data, termed dataset alpha and downloaded from the Yeast Cell Cycle Analysis Project website
(http://genome-www.stanford.edu/cellcycle/), harbors 6,178
gene expression levels and 18 sampling time points with a 7minute interval. Te second time-course data, termed dataset
alpha 38, is downloaded from the online portal for Fred
Hutchinson Cancer Research Centers scientifc laboratories
(http://labs.fcrc.org/breeden/cellcycle/). Tis dataset contains 4,774 gene expression levels and 25 sampling time points
with a 5-minute interval. Tree benchmark sets of genes that
have been utilized in Lichtenberg et al. [19] and Liew et al.
[20] as standards of cell cycle genes are also applied herein for
performance comparison. Tese benchmark sets, involving
113, 352, and 518 genes, respectively, include candidates of
cycle cell regulated genes in yeast proposed by Spellman et al.
[2], Johansson et al. [21], Simon et al. [22], Lee et al. [23], and
Mewes et al. [24] and are accessible in a laboratory website
(http://www.cbs.dtu.dk/cellcycle/).

4. Results
RIAA performed well in the conducted simulations. As
shown in Figure 2(a), a periodic signal (solid line) with
amplitude = 1 and frequency = 0.4 is sampled

AdvancesinBioinformatics
1

0.8

0.8
Sensitivity

Sensitivity

54

0.6

0.6

0.4

0.4

0.2

0.2
0

0.2

0.4

0.6

0.8

0.2

1-specifcity

0.8

0.8

0.6

0.8

0.6

0.8

0.6

0.8

0.6

0.8

0.6

0.4

0.4

0.2

0.2
0

0.2

0.4

0.6

0.8

0.2

0.4

1-specifcity

1-specifcity

(c)

(d)

0.8

0.8
Sensitivity

Sensitivity

0.6

(b)

Sensitivity

Sensitivity

(a)

0.6

0.4

0.6

0.4

0.2

0.2
0

0.2

0.4

0.6

0.8

0.2

1-specifcity

(e)

0.4
1-specifcity

(f)

0.8

0.8
Sensitivity

Sensitivity

0.4
1-specifcity

0.6

0.4

0.6

0.4

0.2

0.2
0

0.2

0.4

0.6

0.8

0.2

1-specifcity

0.4
1-specifcity

RIAA
LS
DLS

RIAA
LS
DLS
(g)

(h)

Figure 3: Te ROC curves derived from simulations with 24 sampling time points, signal amplitude = 1, = 0.4, and Gaussian noise
= 0 and = 0.5. Description of subplots is provided in Section 4.

55

0.8

0.8
Sensitivity

Sensitivity

Advances in Bioinformatics

0.6

0.4

0.4

0.2

0.2
0.2

0.4

0.6

0.8

0.4
1-specifcity

(a)

(b)

0.8

0.8

0.6
0.4

0.6

0.8

0.6

0.8

0.6

0.8

0.6

0.8

0.6
0.4

0.2

0.2
0

0.2

0.4

0.6

0.8

0.2

0.4

1-specifcity

1-specifcity

(c)

(d)

0.8

0.8
Sensitivity

Sensitivity

0.2

1-specifcity

Sensitivity

Sensitivity

0.6
0.4

0.6
0.4

0.2

0.2
0

0.2

0.4

0.6

0.8

0.2

0.4

1-specifcity

1-specifcity

(e)

(f)

0.8

0.8
Sensitivity

Sensitivity

0.6

0.6
0.4

0.6
0.4

0.2

0.2
0.2

0.4

0.6

0.8

0.2

1-specifcity

0.4
1-specifcity

RIAA
LS
DLS

RIAA
LS
DLS
(g)

(h)

Figure 4: Te ROC Curves derived from simulations with 24 sampling time points, signal amplitude = 1, = 0.1, and Gaussian noise
= 0 and = 0.5. Description of subplots is provided in Section 4.

56

Advances in Bioinformatics
100

80
60
40
20

Te number of intersection

100
Te number of intersection

Te number of intersection

100

80
60
40
20

113 gene benchmark set


600

800

1000

200

400

(a)

600

800

200

1000

60
40
20

100
80
60
40
20

800

Te number of preserved genes


RIAA
LS
DLS

1000

1000

100
80
60
40
20
518 gene benchmark set

352 gene benchmark set


0

0
600

800

120

113 gene benchmark set


0

600

(c)

Te number of intersection

80

400

Te number of preserved genes

120
Te number of intersection

Te number of intersection

518 gene benchmark set

(b)

100

400

20

Te number of preserved genes

Te number of preserved genes

200

40

0
400

60

352 gene benchmark set

0
200

80

200

400

600

800

Te number of preserved genes

200

400

600

800

1000

Te number of preserved genes


RIAA
LS
DLS

RIAA
LS
DLS
(d)

1000

(e)

(f)

Figure 5: Te intersection of preserved genes and the benchmark sets using RIAA, LS, and DLS algorithms. (a), (b), and (c) reveal the analysis
results when dataset alpha was applied. (d), (e), and (f) reveal the analysis results when dataset alpha 38 was applied.

using the bio-like sampling strategy, which applies 16 time


points in (0,8] and 8 more time points in (8,16]. Gaussian
noise with parameters = 0 and = 0.5 is assumed
during microarray experiments. Te resulting time-course
expression levels (dots), at a total of 24 time points and
the sampling time information were treated as inputs to
the RIAA algorithm. Figure 2(b) demonstrates the result
of periodogram estimation. In this example, the grid size
was chosen to be 0.065 and a total of 11 amplitudes
corresponding to diferent frequencies were obtained and
shown in the spectrum. Using Fishers test, the peak at the
third grid (frequency = 0.195) was found to be signifcantly
large (-value = 2.4 10 3 ), and hence a periodic gene was
claimed.
ROC curves strongly illustrate the performance of RIAA.
In Figures 3 and 4, subplots (a)-(b), (c)-(d), (e)-(f), and (g)(h) refer to the simulations with regular, bio-like, binomially random, and exponentially random sampling strategies,
respectively. Additionally, in the lef-hand side subplots (a),
(c), (e), and (g), nonperiodic signals were simply Gaussian
noise with parameters = 0 and = 0.5, while in the

right-hand side subplots (b), (d), (f), and (h), nonperiodic


signals involve not only the Gaussian noise but also a
transcriptional burst and a sudden drop (27). Periodic signals
were generated using (25) with amplitude = 1, = 2, and
= 24. Te only diference in simulation settings between
Figures 3 and 4 is the frequency of periodic signals; they are
= 0.4 and 0.1, respectively. As shown in these fgures,
LS and DLS can perform well as RIAA when the time-course
data are regularly sampled, or mildly irregularly sampled;
however, when data are highly irregularly sampled, RIAA
outperforms the others. Te superiority of RIAA over DLS
is particularly clear when the signal frequency is small.
Figure 5 illustrates the results of the real data analysis
when these three algorithms, namely, the RIAA, LS, and
DLS, were applied. On the -axis, the numbers indicate the
thresholds that we preserved and classifed as periodicities
among all yeast genes; on the y-axis, the numbers refer
to the intersection of preserved genes and the proposed
periodic candidates listed in the benchmark sets. Figures
5(a)5(c) demonstrate the results derived from dataset alpha
when the 113-gene benchmark set, 352-gene benchmark

57

Advances in Bioinformatics
set, and 518-gene benchmark set were applied, respectively.
Similarly, Figures 5(d)5(f) demonstrate the results derived
from dataset alpha 38. Te RIAA does not result in signifcant
diferences in the numbers of intersections when compared
to those corresponding to LS and DLS in most of these
cases. However, RIAA shows slightly better coverage when
the dataset alpha 38 and the 113-gene benchmark set was
utilized (Figure 5(d)).

5. Conclusions
In this study, the rigorous simulations specifcally designed
to comfort with real experiments reveal that the RIAA can
outperform the classical LS and modifed DLS algorithms
when the sampling time points are highly irregular, and when
the number of cycles covered by sampling times is very
limited. Tese characteristics, as also claimed in the original
study by Stoica et al. [12], suggest that the RIAA can be
generally applied to detect periodicities in time-course gene
expression data with good potential to yield better results. A
supplementary simulation further shows the superiority of
RIAA over LS and DLS when multiple periodic signals are
considered (see Supplementary Figure s1 available online at
http://dx.doi.org/10.1155/2013/171530). From the simulations,
we also learned that the addition of a transcriptional burst and
a sudden drop to nonperiodic signals (the negatives) does not
afect the power of RIAA in terms of periodicity detection.
Moreover, the detrend function in DLS, designed to improve
LS by removing the linearity in time-course data, may fail to
provide improved accuracy and makes the algorithm unable
to detect periodicities when transcription oscillates with a
very low frequency.
Te intersection of detected candidates and proposed
periodic genes in the real data analysis (Figure 5) does not
reveal much diferences among RIAA, LS, and DLS. One
possible reason is that the sampling time points conducted
in the yeast experiment are not highly irregular (not many
missing values are included), since, as demonstrated in Figures 3(a)3(d), the RIAA just performs equally well as the LS
and DLS algorithms when the time-course data are regularly
or mildly irregularly sampled. Also, the very limited time
points contained in the dataset may deviate the estimation
of -values [14] and thus hinder the RIAA from exhibiting
its excellence. Besides, the number of true cell cycle genes
included in the benchmark sets remains uncertain. We expect
that the superiority of RIAA in real data analysis would be
clearer in the future when more studies and more datasets
become available.
Besides the comparison of these algorithms, it is interesting to note that the bio-like sampling strategy could lead
to better detection of periodicities than the regular sampling
strategy (as shown in Figures 3(c) and 3(d)). It might be
benefcial to apply loose sampling time intervals at posterior
periods to prolong the experimental time coverage when the
number of time points is limited.

Acknowledgments
Te authors would like to thank the members in the Genomic
Signal Processing Laboratory, Texas A&M University, for

the helpful discussions and valuable feedback. Tis work


was supported by the National Science Foundation under
Grant no. 0915444. Te RIAA MATLAB code is available at
http://gsp.tamu.edu/Publications/supplementary/agyepong
12a/.

References
[1] W. Zhao, K. Agyepong, E. Serpedin, and E. R. Dougherty,
Detecting periodic genes from irregularly sampled gene
expressions: a comparison study, EURASIP Journal on Bioinformatics and Systems Biology, vol. 2008, Article ID 769293, 2008.
[2] P. T. Spellman, G. Sherlock, M. Q. Zhang et al., Comprehensive
identifcation of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Molecular
Biology of the Cell, vol. 9, no. 12, pp. 32733297, 1998.
[3] G. Rustici, J. Mata, K. Kivinen et al., Periodic gene expression
program of the fssion yeast cell cycle, Nature Genetics, vol. 36,
no. 8, pp. 809817, 2004.
[4] M. Menges, L. Hennig, W. Gruissem, and J. A. H. Murray,
Cell cycle-regulated gene expression in Arabidopsis, Journal
of Biological Chemistry, vol. 277, no. 44, pp. 4198742002, 2002.
[5] M. Ahdesmaki, H. Lahdesmaki, R. Pearson, H. Huttunen,
and O. Yli-Harja, Robust detection of periodic time series
measured from biological systems, BMC Bioinformatics, vol. 6,
article 117, 2005.
[6] M. Ahdesmaki, H. Lahdesmaki, A. Gracey et al., Robust
regression for periodicity detection in non-uniformly sampled
time-course gene expression data, BMC Bioinformatics, vol. 8,
article 233, 2007.
[7] E. F. Glynn, J. Chen, and A. R. Mushegian, Detecting periodic
patterns in unevenly spaced gene expression time series using
Lomb-Scargle periodograms, Bioinformatics, vol. 22, no. 3, pp.
310316, 2006.
[8] R. Yang, C. Zhang, and Z. Su, LSPR: an integrated periodicity
detection algorithm for unevenly sampled temporal microarray
data, Bioinformatics, vol. 27, no. 7, pp. 10231025, 2011.
[9] E. R. Dougherty, Small sample issues for microarray-based
classifcation, Comparative and Functional Genomics, vol. 2, no.
1, pp. 2834, 2001.
[10] Y. Tu, G. Stolovitzky, and U. Klein, Quantitative noise analysis
for gene expression microarray experiments, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 99, no. 22, pp. 1403114036, 2002.
[11] Z. Bar-Joseph, Analyzing time series gene expression data,
Bioinformatics, vol. 20, no. 16, pp. 24932503, 2004.
[12] P. Stoica, J. Li, and H. He, Spectral analysis of nonuniformly
sampled data: a new approach versus the periodogram, IEEE
Transactions on Signal Processing, vol. 57, no. 3, pp. 843858,
2009.
[13] J. Fan and Q. Yao, Nonlinear Time Series: Nonparametric and
Parametric Methods, Springer, New York, NY, USA, 2003.
[14] A. W. C. Liew, N. F. Law, X. Q. Cao, and H. Yan, Statistical
power of Fisher test for the detection of short periodic gene
expression profles, Pattern Recognition, vol. 42, no. 4, pp. 549
556, 2009.
[15] V. Berger, Pros and cons of permutation tests in clinical trials,
Statistics in Medicine, vol. 19, no. 10, pp. 13191328, 2000.
[16] A. P. Bradley, Te use of the area under the ROC curve
in the evaluation of machine learning algorithms, Pattern
Recognition, vol. 30, no. 7, pp. 11451159, 1997.

58
[17] J. R. Chubb, T. Trcek, S. M. Shenoy, and R. H. Singer, Transcriptional pulsing of a developmental gene, Current Biology, vol. 16,
no. 10, pp. 10181025, 2006.
[18] T. Pramila, W. Wu, W. Noble, and L. Breeden, Periodic genes of
the yeast Saccharomyces cerevisiae: a combined analysis of fve
cell cycle data sets, 2007.
[19] U. Lichtenberg, L. J. Jensen, A. Fausbll, T. S. Jensen, P. Bork,
and S. Brunak, Comparison of computational methods for the
identifcation of cell cycle-regulated genes, Bioinformatics, vol.
21, no. 7, pp. 11641171, 2005.
[20] A. W. C. Liew, J. Xian, S. Wu, D. Smith, and H. Yan, Spectral
estimation in unevenly sampled space of periodically expressed
microarray time series data, BMC Bioinformatics, vol. 8, article
137, 2007.
[21] D. Johansson, P. Lindgren, and A. Berglund, A multivariate
approach applied to microarray data for identifcation of genes
with cell cycle-coupled transcription, Bioinformatics, vol. 19,
no. 4, pp. 467473, 2003.
[22] I. Simon, J. Barnett, N. Hannett et al., Serial regulation of
transcriptional regulators in the yeast cell cycle, Cell, vol. 106,
no. 6, pp. 697708, 2001.
[23] T. I. Lee, N. J. Rinaldi, F. Robert et al., Transcriptional
regulatory networks in Saccharomyces cerevisiae, Science, vol.
298, no. 5594, pp. 799804, 2002.
[24] H. W. Mewes, D. Frishman, U. Guldener et al., MIPS: a
database for genomes and protein sequences, Nucleic Acids
Research, vol. 30, no. 1, pp. 3134, 2002.

Advances in Bioinformatics

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Identification of Robust Pathway Markers for Cancer through
Rank-Based Pathway Activity Inference
Navadon Khunlertgit and Byung-Jun Yoon
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
Correspondence should be addressed to Byung-Jun Yoon; bjyoon@ece.tamu.edu
Received 30 November 2012; Accepted 19 January 2013
Academic Editor: Hazem Nounou
Copyright 2013 N. Khunlertgit and B.-J. Yoon. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
One important problem in translational genomics is the identifcation of reliable and reproducible markers that can be used to
discriminate between diferent classes of a complex disease, such as cancer. Te typical small sample setting makes the prediction
of such markers very challenging, and various approaches have been proposed to address this problem. For example, it has been
shown that pathway markers, which aggregate the gene activities in the same pathway, tend to be more robust than gene markers.
Furthermore, the use of gene expression ranking has been demonstrated to be robust to batch efects and that it can lead to more
interpretable results. In this paper, we propose an enhanced pathway activity inference method that uses gene ranking to predict the
pathway activity in a probabilistic manner. Te main focus of this work is on identifying robust pathway markers that can ultimately
lead to robust classifers with reproducible performance across datasets. Simulation results based on multiple breast cancer datasets
show that the proposed inference method identifes better pathway markers that can predict breast cancer metastasis with higher
accuracy. Moreover, the identifed pathway markers can lead to better classifers with more consistent classifcation performance
across independent datasets.

1. Introduction
Advances in microarray and sequencing technologies have
enabled the measurement of genome-wide expression profles, which have spawned a large number of studies aiming
to make accurate diagnosis and prognosis based on gene
expression profles [14]. For example, there has been significant amount of work on identifying markers and building
classifers that can be used to predict breast cancer metastasis
[2, 4]. Many existing methods have directly employed gene
expression data without any knowledge of the interrelations
between genes. As a result, the predicted gene markers ofen
lack interpretability and many of them are not reproducible
in other independent datasets.
To overcome this problem, several diferent approaches
have been proposed so far. For example, a recent work by
Geman et al. [3] proposed an approach that utilizes the
relative expression between genes, rather than their absolute
expression values. It was shown that the resulting markers
are easier to interpret, robust to chip-to-chip variations,
and more reproducible across datasets. Another possible

way to address the aforementioned problem is to interpret


the gene expression data at a modular level through
data integration [511]. Tese methods utilize additional
data sources and prior knowledgesuch as protein-protein
interaction (PPI) data and pathway knowledgeto jointly
analyze the expression of interrelated genes. Tis results in
modular markers, such as pathway markers and subnetwork
markers, which have been shown to improve the classifcation
performance and also to be more reproducible across independent datasets [811]. In order to utilize pathway markers,
we need to infer the pathway activity by integrating the
gene expression data with pathway knowledge. For example,
Guo et al. [6] used the mean or median expression value
of the member genes (that belong to the same pathway)
as the activity level of a given pathway. Recently, Su et al.
[10] proposed a probabilistic pathway activity inference
method that uses the log-likelihood ratio between diferent
phenotypes based on the expression level of each member
gene.
In this work, we propose an enhanced pathway activity
inference method that utilizes the ranking of the member

60

Advances in Bioinformatics

genes to predict the pathway activity in a probabilistic


manner. Te immediate goal is to identify better pathway
markers that are more reliable, more reproducible, and easier
to interpret. Ultimately, we aim to utilize these markers to
build accurate and robust disease classifers. Te proposed
method is motivated by the relative gene expression analysis
strategy proposed in [3, 12] and it builds on the concept of
probabilistic pathway activity inference proposed in [10, 11].
In this study, we focus on predicting breast cancer metastasis
and demonstrate that the proposed method outperforms
existing methods. Preliminary results of this work have been
originally presented in [13].

2. Materials and Methods


2.1. Study Datasets. Six independent breast cancer microarray gene expression datasets have been used in this
study: GSE2034 (USA) [4], NKI295 (Te Netherlands)
[14], GSE7390 (Belgium) [15], GSE1456 (Stockholm) [16],
GSE15852, and GSE9574. Te Netherlands dataset uses a
custom Agilent chip and it has been obtained from the Stanford website [17]. All datasets have been profled using the
Afymetrix U133a platform and they have been downloaded
from the Gene Expression Omnibus (GEO) website [18].
Te above datasets have been used in our study both with
and without (re)normalization. To test the reproducibility
of pathway markers, we selected the USA dataset and the
Belgium dataset, both of which were obtained using the
Afymetrix platform. Te raw data for these two datasets
have been normalized by utilizing the microarray preprocessing methods provided in the Bioconductor package [19].
We applied three popular normalization methodsRMA,
GCRMA, and MAS5with default setting.
Te pathway data have been obtained from the MSigDB
3.0 Canonical Pathways [20]. Tis pathway dataset consists of
880 pathways, where 3,698 genes in these pathways intersect
with all datasets.
2.2. Gene Ranking. In this study, we utilize gene ranking or
the relative ordering of the genes based on their expression
levels within each profle [3]. Consider a pathway that
contains member genes G = {1 , 2 , . . . , } afer removing
the genes that are not included in all datasets. Given a sample
x = {1 , 2 , . . . , } that contains the expression level of the
member genes, the gene ranking r is defned as follows:
r =

,
{

| 1 < } ,

(1)

1,
0,

if < ,
otherwise.

=
,

, ( ) ,

1<

(3)

where , ( ) is the log-likelihood ratio (LLR) between the


two phenotypes (i.e., class labels) for the ranking r . Te LLR
,
, ( ) is defned as
,
, ( )

= log [
[

1
,
( )
,

2 ( )
,

],

(4)

1
where ,
() is the conditional probability mass function
(PMF) of the ranking of the expression level of gene and
2
gene under phenotype 1 and ,
() is the conditional PMF
of the ranking of the expression level of gene and gene
under phenotype 2.
In practice, the number of possible gene pairs ( 2 ) may be
too large when we have large pathways with many member
genes (i.e., when is large). To reduce the computational
complexity, we prescreen the gene pairs based on the mutual
information [21] as follows. For every gene pair (, ), we frst
,
compute the mutual information between the ranking
and the corresponding phenotype . Ten we select the top
10% gene pairs with the highest mutual information and use
only these gene pairs for computing the pathway activity level
defned in (3). Although we selected the top 10% gene pairs
for simplicity, this may not be necessarily optimal and one
may also think of other strategies for adaptively choosing this
threshold.
In a practical setting, we may not have enough training
1
2
() and ,
(). For this
data to reliably estimate the PMFs ,
,

reason, we normalize the original LLR , ( ) as follows to


decrease its sensitivity to small alterations in gene ranking:
(, ) =

, ( ) ( , )
( , )

(5)

where ( , ) and ( , ) are the mean and standard deviation


,

of , ( ) across all = 1, . . . , . Figure 1 illustrates the


overall process.

where
= {

2.3. Pathway Activity Inference Based on Gene Ranking. To


infer the pathway activity, we follow the strategy proposed in
[10], where the activity level of a given pathway in the th
sample is predicted by aggregating the probabilistic evidence
of all the member genes. Te main diference between the
strategy proposed in this work and the original strategy [10] is
that we estimate the probabilistic evidence provided by each
gene based on its ranking rather than its expression value.
More specifcally, the pathway activity level is given by

(2)

Te resulting gene ranking r is a binary vector representing


the ordering of the member genes based on their expression
values in the th sample x . To preserve the gene ranking
in each sample, we do not employ any between-sample
normalization.

2.4. Assessing the Discriminative Power of Pathway Markers.


In order to assess the discriminative power of a pathway
marker, we compute the -test statistics score, which is given
by
(a) =

1 2

1 /1 + 2 /2

(6)

61

Advances in Bioinformatics

Member gene expression matrix

Gene ranking

Pathway = {1 , 2 ,. . ., }

Samples

(, )

( 1, )
1,
,
= {
0,

Phenotype 2

(1, 2)

..
.

Phenotype 1

LLR matrix

1,2

(1, 2)

Samples

Member gene ranking matrix

1,

<
otherwise

(, )

..
.

( 1, )

Ranking

,
, ( )

,
, ( )

LLR

1
2
= log[,
( )/[,
( )]
,

1
,

0 1

2
,

0 1

Normalization
, (, ) =

, ( ) ( , )
,
( , )

where , = { 1, , 1, ,. . .,
, } and

( , ) and ( , ) denote mean


and standard deviation of ,

Pathway activity level


, (, )
=
1<

Figure 1: Probabilistic inference of rank-based pathway activity. For a given pathway, we frst compute the ranking of the member genes for
each individual sample in the dataset. Ten we estimate the conditional probability mass function (PMF) of the gene ranking under each
phenotype. Next, we transform the gene ranking into log-likelihood ratios (LLRs) based on the estimated PMFs and normalize the LLR
matrix. Finally, the pathway activity level is inferred by aggregating the normalized LLRs of the member genes.

where a = { } is the set of inferred pathway activity


levels for a given pathway, and represent the mean
and the standard deviation of the pathway activity levels
for samples with phenotype {1, 2}, respectively, and
represents the number of samples in the dataset with
phenotype . Tis measure has been widely used in previous
studies to evaluate the performance of pathway markers
[9, 10].
2.5. Evaluation of the Classifcation Performance. In order
to evaluate the classifcation performance, we use the AUC
(Area under ROC Curve). Many previous studies [811] have
utilized AUC due to its ability to summarize the efcacy of
a classifcation method over the entire range of specifcity
and sensitivity. We compute the AUC based on the method
proposed in [22]. Given a classifer, let 1 , 2 , . . . , be
the output of the classifer for positive samples and let

1 , 2 , . . . , be the output for negative samples. Te AUC


of the classifer can be computed as follows:
=

1
( > ) ,
=1=1

where
( > ) = {

1,
0,

if > ,
otherwise.

(7)

(8)

3. Results and Discussion


3.1. Discriminative Power of the Pathway Markers Using the
Proposed Method. In order to assess the performance of the
rank-based pathway activity inference method proposed in
this paper, we frst evaluated the discriminative power of the
pathway markers following a similar setup that was adopted

62
in a number of previous studies [9, 10]. For comparison, we
also evaluated the performance of the mean and medianbased schemes proposed in [6] and the original probabilistic
pathway activity inference method (we refer to this method
as the LLR method for simplicity) presented in [10]. As
explained in Materials and Methods, the discriminative
power of a pathway marker was measured based on the
absolute -test score of the inferred pathway activity level.
Ten the pathway markers were sorted according to their score, in a descending order.
Figure 2 shows the discriminative power of the pathway
markers on the six datasets using diferent activity inference
methods. On each dataset, we computed the mean absolute test statistics score of the top % pathways for each of the four
pathway activity inference methods. Te -axis corresponds
to the proportion (%) of the top pathway markers that were
considered and the -axis shows the mean absolute -test
score for these pathway markers. As we can see from Figure 2,
the proposed method clearly improves the discriminative
power of the pathway markers on all six datasets that we
considered in this study. In order to investigate the efect of
normalization on the discriminative power of the pathway
activity inference methods, we repeated this experiment
using the USA and the Belgium datasets, where we frst
normalized the raw data using three diferent normalization
methods (RMA, GCRMA, and MAS5) and then evaluated the
discriminative power of the pathway markers. Te results are
summarized in Figure S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2013/618461), where
we can see that the proposed rank-based scheme is not very
sensitive to the choice of the normalization method and
performs consistently well in all cases.
Next, we investigated how the top pathway markers
identifed on a specifc dataset perform in other independent
datasets. We frst ranked the pathway markers based on their
mean absolute -test statistics score in one of the datasets
and then estimated the discriminative power of the top %
markers on a diferent dataset. Tese results are shown in
Figure 3, where the frst dataset is used for ranking the
markers and the second dataset is used for assessing the
discriminative power. As we can see from Figure 3, the
pathway markers identifed using the mean- and the medianbased schemes do not retain their discriminative power
very well in other datasets. Both the LLR method [10] and
the proposed rank-based inference method perform well
across diferent datasets, where the proposed method clearly
outperforms the previous LLR method. It is interesting to
see that the discriminative power of the markers is retained
even when we consider datasets that are obtained using
diferent platforms. For example, USA/Belgium datasets are
profled on the U133a platform and Te Netherlands dataset
is profled on a custom Agilent chip, but Figure 3 shows
that pathway markers identifed using the proposed method
retain their discriminative power across these datasets. As
before, we repeated these experiments afer normalizing the
datasets using diferent normalization methods. Te results
are depicted in Figure S2, where we can see that the proposed
method works very well, regardless of the normalization
method that was used. Interestingly, this is also true even

Advances in Bioinformatics
when the frst dataset and the second dataset are normalized
using diferent methods, as shown in Figures S3 and S4.
Another interesting observation is that the rank-based
method can overcome one of the limitations of the previous
LLR method. For example, normalization of the Belgium
dataset using GCRMA results makes the LLR method fail, as
some of the genes loose variability and some of the LLR values
become infnite. We can see this issue in Figures S1(d), S2(c),
S3(a), and S3(f). However, this limitation is easily overcome
by the proposed method through the use of gene ranking and
the preselection of informative gene pairs based on mutual
information.
3.2. Classifcation Performance of the Pathway Markers Using
the Proposed Method. Next, we evaluated the classifcation
performance of the proposed rank-based pathway activity
inference method. For this purpose, we performed fvefold
cross validation experiments, following a similar setup used
in previous studies [811]. We frst performed the withindataset experiments for each of the six datasets. First, a given
dataset was randomly divided into fvefolds, where fourfolds
(training dataset) were used for constructing an LDA (Linear
Discriminant Analysis) classifer and the remaining fold
(testing dataset) was used for evaluating its performance.
To construct the classifer, the training dataset was again
divided into threefolds, where twofolds (marker-evaluation
dataset) were used for evaluating the pathway markers and
the remaining onefold (feature-selection dataset) for feature
selection. Te entire training dataset was used for PDF/PMF
estimation. Te overall setup is shown in Figure 4(a).
In order to build the classifer, we frst evaluated the discriminative power of each pathway on the marker-evaluation
dataset. Te pathways were sorted according to their absolute
-test statistics score in a descending order and the top 50
pathways were selected as potential features. Initially, we
started with an LDA-based classifer with a single feature
(i.e., the pathway marker that is on the top of the list) and
continued to expand the feature set by considering additional pathway markers in the list. Te classifer was trained
using the marker-evaluation dataset and its performance was
assessed on the feature-selection dataset by measuring the
AUC. Pathway markers were added to the feature set only
when they increased the AUC. Finally, the performance of
the classifer with the optimal feature set was evaluated by
computing the AUC on the testing dataset. Te above process
was repeated for 100 random partitions to ensure reliable
results, and we report the average AUC as the measure of
overall classifcation performance.
Figure 5 shows how the respective classifers that use
diferent pathway activity inference methods perform on
diferent datasets. As we can see in Figure 5, among the
four inference methods, the proposed rank-based scheme
typically yields the best average performance across these
datasets. We also performed similar experiments based on
the USA and the Belgium datasets afer normalizing the raw
data using diferent normalization methods. Tese results are
summarized in Figure S5. We can see from Figure S5 that
the proposed method yields the best performance on the

63

Advances in Bioinformatics
Te Netherlands

USA
14
Average absolute -score

Average absolute -score

25

12

20

10

15
10
5
0

8
6
4
2
0

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(a)

Mean
Median

(b)

Belgium

GSE1456

14
Average absolute -score

Average absolute -score

20
18
16
14
12
10
8
6
4
2
0

12
10
8
6
4
2
0

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(c)

Mean
Median

(d)

GSE15852

GSE9574

16

Average absolute -score

20
18
16
14
12
10
8
6
4
2
0

Average absolute -score

14
12
10
8
6
4
2
0

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(e)

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(f)

Figure 2: Discriminative power of pathway markers. We computed the mean absolute -score of the top % markers for each dataset without
any further normalization.

USA dataset for all three normalization methods. On the


Belgium dataset, the proposed method yields good consistent
performance that is not very sensitive to the normalization
method.
3.3. Reproducibility of the Pathway Markers Identifed by the
Proposed Method. To assess the reproducibility of the pathway markers, we performed the following cross-dataset

experiments based on a similar setup that has been utilized


in previous studies [811]. In this experiment, we used one
of the breast cancer datasets for selecting the best pathway
markers (i.e., only for feature selection) and a diferent dataset
for building the classifer (using the selected pathways) and
evaluating the performance of the resulting classifer. More
specifcally, we proceeded as follows. Te frst dataset was
frst divided into threefolds, where twofolds were used for

64

Advances in Bioinformatics
USA-Belgium

USA-Te Netherlands

20
18
16
14
12
10
8
6
4
2
0

Average absolute -score

Average absolute -score

14
12
10
8
6
4
2
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(a)

(b)

Te Netherlands-USA

Te Netherlands-Belgium

18
16
14
12
10
8
6
4
2
0

Average absolute -score

Average absolute -score

25
20
15
10
5
0

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(c)

Mean
Median

(d)
Belgium-Te Netherlands

Belgium-USA

10
9
8
7
6
5
4
3
2
1
0

Average absolute -score

25
Average absolute -score

Mean
Median

20
15
10
5
0

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(e)

Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR

Mean
Median

(f)

Figure 3: Discriminative power of pathway markers across diferent datasets. Te pathway markers have been ranked and sorted using the
frst dataset, and their discriminative power has been reevaluated using the second dataset. As before, the mean absolute -score was used for
assessing the discriminative power.

marker evaluation and the remaining fold was used for


feature selection. Te second dataset was randomly divided
into fvefolds, where fourfolds were used to train the LDA
classifer, using the features selected from the frst dataset,
and the remaining fold was used to evaluate the classifcation
performance. Te overall setup is shown in Figure 4(b).
To obtain reliable results, we repeated this experiment for

100 random partitions (of the second dataset) and report


the average AUC as the performance metric. For these
experiments, we used the three largest breast cancer datasets
(USA, Te Netherlands, and Belgium) among the six.
Te results of the cross-dataset classifcation experiments
are shown in Figure 6. As we can see from this fgure, the
proposed rank-based inference scheme typically outperforms

65

Advances in Bioinformatics

Dataset

Dataset 1

Training set

Marker evaluation
Rank the pathways by the discriminative power
of their activity levels

Build the classifer using LDA


Marker evaluation
Rank the pathways by the discriminative power
of their activity levels

Feature selection
Select features using sequential forward selection
to maximize the AUC

Feature selection

Dataset 2

Select features using sequential forward selection

Training set

to maximize the AUC

Build the classifer using LDA

Testing set

Testing set

Evaluate the classifer

Evaluate the classifer

(a)

(b)

Figure 4: Experimental setup for evaluating the classifcation performance. (a) Te setup for the within-dataset experiment. (b) Te setup
for the cross-dataset experiment.

GSE9574

GSE15852

GSE1456

Belgium

Te Netherlands

USA

1
0.9
0.8
0.7
0.6
0.5

Mean
Median

Proposed
LLR

Figure 5: Classifcation performance for within-dataset experiments. Te bars show the classifcation performance (average AUC)
of diferent pathway activity inference methods evaluated on various
breast cancer datasets.

0.75
0.7
0.65
0.6
0.55
0.5

other methods in terms of reproducibility. Furthermore, we


can also observe that the proposed method yields consistent
classifcation performance across experiments, while the performance of other inference methods is much more sensitive
on the choice of the dataset. Next, we repeated the crossdataset classifcation experiments based on the USA and the
Belgium datasets afer normalizing the raw data using RMA,
GCRMA, and MAS5. As shown in Figure 7, the proposed
method yields consistently good performance, regardless of
the normalization method that was used.
Finally, we performed additional cross-dataset experiments afer normalizing the USA and the Belgium datasets
using diferent normalization methods. Tese results are
summarized in Figures S6 and S7. We can see that the
proposed pathway activity inference scheme is relatively
robust to normalization mismatch. Moreover, these results
also show that the proposed scheme overcomes the problem
of the previous LLR-based scheme [10] when used with
GCRMA (see Figures 7, S6, and S7).

4. Conclusions

U-N

Proposed
LLR

U-B

N-U

N-B

B-U

B-N

Mean
Median

Figure 6: Classifcation performance for cross-dataset experiments.


Te bars show the cross-dataset classifcation performance (average
AUC) of diferent pathway activity inference methods. Te frst
dataset was used for selecting the pathway markers and the second
dataset was used for training and evaluation of the classifer. Te
three largest breast cancer datasets were used: USA (U), Te
Netherlands (N), and Belgium (B).

In this work, we proposed an improved pathway activity


inference scheme, which can be used for fnding more robust
and reproducible pathway markers for predicting breast cancer metastasis. Te proposed method integrates two efective
strategies that have been recently proposed in the feld:
namely, the probabilistic pathway activity inference method
[10] and the ranking-based relative gene expression analysis
approach [3]. Experimental results based on several breast
cancer gene expression datasets show that our proposed
inference method identifes better pathway markers that
have higher discriminative power, are more reproducible,
and can lead to better classifers that yield more consistent
performance across independent datasets.

66

Advances in Bioinformatics

Proposed
LLR

B-U (MAS5)

B-U (GCRMA)

B-U (RMA)

U-B (MAS 5)

U-B (GCRMA)

U-B (RMA)

0.75
0.7
0.65
0.6
0.55
0.5

Mean
Median

Figure 7: Classifcation performance for cross-dataset experiments.


We repeated the cross-dataset experiments based on the USA and
the Belgium datasets afer normalizing the raw data using diferent
normalization methods.

Acknowledgment
N. Khunlertgit has been supported by a scholarship from the
Royal Tai Government.

References
[1] M. West, C. Blanchette, H. Dressman et al., Predicting the
clinical status of human breast cancer by using gene expression
profles, Proceedings of the National Academy of Sciences of the
United States of America, vol. 98, no. 20, pp. 1146211467, 2001.
[2] L. J. Vant Veer, H. Dai, M. J. Van de Vijver et al., Gene
expression profling predicts clinical outcome of breast cancer,
Nature, vol. 415, no. 6871, pp. 530536, 2002.
[3] D. Geman, C. DAvignon, D. Q. Naiman, and R. L. Winslow,
Classifying gene expression profles from pairwise mRNA
comparisons, Statistical Applications in Genetics and Molecular
Biology, vol. 3, no. 1, article 19, 2004.
[4] Y. Wang, J. G. M. Klijn, Y. Zhang et al., Gene-expression profles
to predict distant metastasis of lymph-node-negative primary
breast cancer, Te Lancet, vol. 365, no. 9460, pp. 671679, 2005.
[5] L. Tian, S. A. Greenberg, S. W. Kong, J. Altschuler, I. S. Kohane,
and P. J. Park, Discovering statistically signifcant pathways
in expression profling studies, Proceedings of the National
Academy of Sciences of the United States of America, vol. 102, no.
38, pp. 1354413549, 2005.
[6] Z. Guo, T. Zhang, X. Li et al., Towards precise classifcation
of cancers based on robust gene functional expression profles,
BMC Bioinformatics, vol. 6, article 58, 2005.
[7] C. Aufray, Protein subnetwork markers improve prediction of
cancer outcome, Molecular Systems Biology, vol. 3, article 141,
2007.
[8] H. Y. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker, Networkbased classifcation of breast cancer metastasis, Molecular
Systems Biology, vol. 3, article 140, 2007.
[9] E. Lee, H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee, Inferring
pathway activity toward precise disease classifcation, PLoS
Computational Biology, vol. 4, no. 11, Article ID e1000217, 2008.
[10] J. Su, B. J. Yoon, and E. R. Dougherty, Accurate and reliable
cancer classifcation based on probabilistic inference of pathway
activity, PloS ONE, vol. 4, no. 12, Article ID e8161, 2009.

[11] J. Su, B. J. Yoon, and E. R. Dougherty, Identifcation of


diagnostic subnetwork markers for cancer in human proteinprotein interaction network, BMC Bioinformatics, vol. 11, no. 6,
article 8, 2010.
[12] J. A. Eddy, L. Hood, N. D. Price, and D. Geman, Identifying
tightly regulated and variably expressed networks by Diferential Rank Conservation (DIRAC), PLoS Computational Biology,
vol. 6, no. 5, Article ID e1000792, 2010.
[13] N. Khunlertgit and B. J. Yoon, Finding robust pathway markers
for cancer classifcation, in Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics
(GENSIPS 12), 2012.
[14] M. J. Van De Vijver, Y. D. He, L. J. Van T Veer et al., A geneexpression signature as a predictor of survival in breast cancer,
New England Journal of Medicine, vol. 347, no. 25, pp. 1999
2009, 2002.
[15] C. Desmedt, F. Piette, S. Loi et al., Strong time dependence
of the 76-gene prognostic signature for node-negative breast
cancer patients in the TRANSBIG multicenter independent
validation series, Clinical Cancer Research, vol. 13, no. 11, pp.
32073214, 2007.
[16] Y. Pawitan, J. Bjohle, L. Amler, and A. L. Borg, Gene expression
profling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts,
Breast Cancer Research, vol. 7, pp. R953R964, 2005.
[17] H. Y. Chang, D. S. A. Nuyten, J. B. Sneddon et al., Robustness,
scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival, Proceedings
of the National Academy of Sciences of the United States of
America, vol. 102, no. 10, pp. 37383743, 2005.
[18] R. Edgar, M. Domrachev, and A. E. Lash, Gene Expression
Omnibus: NCBI gene expression and hybridization array data
repository, Nucleic Acids Research, vol. 30, no. 1, pp. 207210,
2002.
[19] R. C. Gentleman, V. J. Carey, D. M. Bates et al., Bioconductor:
open sofware development for computational biology and
bioinformatics, Genome Biology, vol. 5, no. 10, p. R80, 2004.
[20] A. Liberzon, A. Subramanian, R. Pinchback, H. Torvaldsdottir,
P. Tamayo, and J. P. Mesirov, Molecular signatures database
(MSigDB) 3.0, Bioinformatics, vol. 27, no. 12, pp. 17391740,
2011.
[21] T. M. Cover and J. A. Tomas, Elements of Information Teory,
Wiley Interscience, New York, NY, USA, 2006.
[22] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, vol. 27, no. 8, pp. 861874, 2006.

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Review Article
An Overview of the Statistical Methods Used for Inferring Gene
Regulatory Networks and Protein-Protein Interaction Networks
Amina Noor,1 Erchin Serpedin,1 Mohamed Nounou,2 Hazem Nounou,3
Nady Mohamed,4 and Lotfi Chouchane4
1

Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843-3128, USA
Chemical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building, Education City,
P.O. Box 23874, Doha, Qatar
3
Electrical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building, Education City,
P.O. Box 23874, Doha, Qatar
4
Department of Genetic Medicine, Weill Cornell Medical College in Qatar, P.O. Box 24144, Doha, Qatar
2

Correspondence should be addressed to Amina Noor; amina@neo.tamu.edu


Received 20 July 2012; Revised 12 January 2013; Accepted 17 January 2013
Academic Editor: Yufei Huang
Copyright 2013 Amina Noor et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Te large infux of data from high-throughput genomic and proteomic technologies has encouraged the researchers to seek
approaches for understanding the structure of gene regulatory networks and proteomic networks. Tis work reviews some of the
most important statistical methods used for modeling of gene regulatory networks (GRNs) and protein-protein interaction (PPI)
networks. Te paper focuses on the recent advances in the statistical graphical modeling techniques, state-space representation
models, and information theoretic methods that were proposed for inferring the topology of GRNs. It appears that the problem
of inferring the structure of PPI networks is quite diferent from that of GRNs. Clustering and probabilistic graphical modeling
techniques are of prime importance in the statistical inference of PPI networks, and some of the recent approaches using these
techniques are also reviewed in this paper. Performance evaluation criteria for the approaches used for modeling GRNs and PPI
networks are also discussed.

1. Introduction
Postgenomic era is marked by the availability of a deluge
of genomic data and has, thus, enabled the researchers to
look towards new dimensions for understanding the complex
biological processes governing the life of a living organism
[15]. Te various life sustaining functions are performed
via a collaborative efort involving DNA, RNA, and proteins.
Genes and proteins interact with themselves and each other
and orchestrate the successful completion of a multitude of
important tasks. Understanding how they work together to
form a cellular network in a living organism is extremely
important in the feld of molecular biology. Two important
problems in this considerably nascent feld of computational
biology are the inference of gene regulatory networks and
the inference of protein-protein interaction networks. Tis
paper frst looks at how the genes and proteins interact with

themselves and then discusses the inference of an integrative


cellular network of genes and proteins combined.
Gene regulation is one of the many fascinating processes
taking place in a living organism whereby the expression and
repression of genes are controlled in a systematic manner.
With the help of the enzyme RNA polymerase, DNA transcribes into mRNA which may or may not translate into
proteins. It is found that in certain special cases mRNA is
reverse-transcribed to DNA. Te processes of transcription
and translation are schematically represented in Figure 1,
where the interactions in black show the most general
framework and the interactions depicted in red occur less
frequently. Transcription factors (TFs), which are a class of
proteins, play the signifcant role of binding onto the DNA
and thereby regulate their transcription. Since the genes may
be coding for TFs and/or other proteins, a complex network
of genes and proteins is formed. Te level of activity of a gene

68

Advances in Bioinformatics

Transcription
Gene

RNA

Translation

Protein

Figure 1: Central dogma of molecular biology.

is measured in terms of the amount of resulting functional


product, and is referred to as gene expression. Te recent
high-throughput genomic technologies are able to measure
the gene expression values and have provided large-scale
data sets, which can be used to obtain insights into how the
gene networks are organized and operated. One of the most
encountered representations of gene regulatory networks is
in terms of a graph, where the genes are depicted by its nodes
and the edges represent the interactions between them.
Te gene regulatory network (GRN) inference problem
consists in understanding the underlying system model [6
10]. Simply stated, given the gene expression data, the activation or repression actions by a set of genes on the other
genes need to be identifed. Tere are several issues associated
with this problem, including the choice of models that
capture the gene interactions sufciently well, followed by
robust and reliable inference algorithms that can be used to
derive decisive conclusions about the network. Te inferred
networks vary in their sophistication depending on the extent
and accuracy of the prior knowledge available and the type of
models used in the process. It is also important that the gene
networks thus inferred should possess the highly desirable
quality of reproducibility in order to have a high degree of
confdence in them. A sufciently accurate picture of gene
interactions could pave the way for signifcant breakthroughs
in fnding cures for various genetic diseases including cancer.
Protein-protein interactions (PPIs) are of enormous signifcance for the workings of a cell. Insights into the molecular
mechanism can be obtained by fnding the protein interactions with a high degree of accuracy [11, 12]. Te protein
interaction networks not only consist of the binary interactions, rather, in order to carry out various tasks, proteins work
together with cohorts to form protein complexes. It should be
emphasized that a particular protein may be a part of diferent
protein complexes, and hence the inference problem is much
more complicated. Te existing high-throughput proteomic
data sets enable the inference of protein-protein interactions.
However, it is found that the protein-protein interactions
obtained by using diferent methods may not be equivalent,
indicating that a large number of false positives and negatives
are present in the data. Similar to the representation of gene
regulatory networks, protein-protein interaction networks
will also be modeled in terms of graphs, where the proteins
denote the nodes and the edges signify whether an interaction
is present between the adjacent nodes.

Many statistical methods have been applied extensively


to solve various bioinformatics problems in the last decade.
Tere are several papers that provide excellent review of
various statistical and computational techniques for inferring
genomic and proteomic networks [2, 12]. However, it is
important to understand the fundamental similarities and
diferences that characterize the two inference problems. Tis
paper provides an overview of the most recent statistical
methods proposed for the inference of GRNs and PPI networks. For gene network inference, three large classes of
modeling and inferencing techniques will be presented,
namely, probabilistic graphical modeling approaches, information theoretic methods, and state-space representation
models. Clustering and probabilistic graphical modeling
methods which comprise the largest class of statistical methods using PPI data are reviewed for the protein-protein
interaction networks. Trough a concise review of these
contemporary algorithms, our goal is to provide the reader
with a sufciently rich understanding of the current state-ofthe-art techniques used in the feld of genomic and proteomic
network inference.
Te rest of this paper is organized as follows. Section 2
describes some of the data sets available for the inference
of genomic and proteomic networks. Section 3 reviews the
recent statistical methods employed to infer gene regulatory
networks. Protein-protein network inferencing techniques
are reviewed in Section 4. Te methods for obtaining an
integrated network with gene network and protein-protein
as subnetworks are given in Section 5. Te inferred network
evaluation is discussed in Section 6. Finally, conclusions are
drawn in Section 7.

2. Available Biological Data


Te postgenomic era is distinguished by the availability of
huge amount of biological data sets which are quite heterogenous in nature and difcult to analyze [3]. It is expected that
these data sets can aid in obtaining useful knowledge about
the underlying interactions in gene-gene and protein-protein
networks. Tis section reviews some of the main types of data
used for the inference of genomic and proteomic networks,
including, gene expression data, protein-protein interaction
data, and ChIP-chip data.
2.1. Gene Expression Data. Of all the available datasets, gene
expression data is the most widely used for gene regulatory
network inference. Gene expression is the process that results
in functional transcripts, for example, RNA or proteins, while
utilizing the information coded on the genes. Te level of
gene expression is an important indicator of how active a
gene is and is measured in the form of gene expression
data. Similarity in the gene expression profles of two genes
advocates some level of correlation between them. In this
paper, the gene expression data is denoted by means of a
random variable x(), where stands for the time index.
2.1.1. cDNA-Microarray Data. One way of generating cDNAmicroarray data is via the DNA microarray technology, which

69

Advances in Bioinformatics
Reverse transcription
mRNA

cDNA

Sequencing

Short
sequence
reads

Expression
levels

Estimation

Mapped
reads

Alignment

Figure 2: Expression estimation in RNA-Seq.

is by far the most popular method employed for this purpose.


Te number of data samples is in general much smaller
than the number of genes. A main drawback associated with
cDNA-microarray data is the noise in the observed gene
expressions. Although the gene expression values should be
continuous, the inability to measure them accurately suggests
the use of discretized values.
2.1.2. RNA-Seq Data. Te recent advancement of sequencing
technologies has provided the ability to acquire more accurate
gene expression levels [13]. RNA-Seq is a novel technology for
mapping and quantifying transcriptomes, and it is expected
to replace all the contemporary methods because of its
superiority in terms of time, complexity, and accuracy. Te
gene expression estimation in RNA-Seq begins with the
reverse transcription of RNA sample into cDNA samples,
which undergo high-throughput sequencing, resulting in
short sequence reads. Tese reads are then mapped to the
reference genome using a variety of available alignment tools.
Te gene expression levels are estimated using the mapped
reads, and several algorithms have been proposed in the
recent literature to fnd efcient and more accurate estimates
of the gene expression levels. Tis process is summarized in
Figure 2. Te gene expression data obtained in this manner
has been found to be much more reproducible and less noisy
as compared to the cDNA microarrays. Te next subsection
describes the data used for PPI network inference.
2.2. Protein-Protein Interaction Data. Large-scale PPI data
have been produced in recent years by high-throughput
technologies like yeast two-hybrid and tandem afnity purifcation, which provide stable and transient interactions, and
mass spectrometry, which indicates the protein complexes
[11, 12]. Tese data sets, in addition to being incomplete
also consist of false positives, and, therefore, the interactions
found in various data sets may not agree with each other.
Owing to this disagreement, it is imperative to make use
of statistical methods to infer the PPI networks by fnding
reliable and reproducible interactions and predict the interactions not found yet in the currently available data.
2.3. ChIP-Chip Data. ChIP-chip data, which is an abbreviation of chromatin immunoprecipitation and microarray
(chip), investigates the interactions between DNA and proteins. Tis data provides information about the DNA-binding

proteins. Since some of the genes encode for transcription


factors (TFs) which in turn regulate some other genes and/or
proteins, this information comes in hand for the inference
of gene networks [10] and the integrated network. However,
generating the ChIP-chip data for large genome would be
technically and fnancially difcult.
2.4. Other Data Sets. Apart from the data sets described
above, gene deletion and perturbation data are worth mentioning here. Perturbation data set is generated by performing
an initial perturbation and then letting the system to react
to it [14]. Te gene expression values at the following time
instants and at steady-state are measured, thereby obtaining
the response of the genes to the specifc perturbation which
could be the increase or decrease of the expression level of all
or certain genes. Gene deletion dataset, as the name indicates,
involves deleting a gene and measuring the resulting expression level of other genes. Tis data may efectively uncover
simple direct relationships [14].

3. Modeling and Inferring Gene


Regulatory Networks
Gene regulatory networks capture the interactions present
among the genes. Accurate and reliable estimation of gene
networks is signifcantly crucial and can reap far-reaching
benefts in the feld of medicinal biology, for example, in
terms of developing personalized medicines. Te following
subsections review the main statistical methods used for
inference of gene regulatory networks. First, the important
class of probabilistic graphical models is presented.
3.1. Probabilistic Graphical Modeling Techniques. Probabilistic graphical models have emerged as a useful tool for reverse
engineering gene regulatory networks. A gene network is
represented by a graph G = (, ), where represents
the set of vertices (genes), and denotes the set of edges
connecting the vertices. Te vertices of the graph are modeled
as random variables and the edges signify the interaction
between them. Te expression value of gene is denoted by
, and the total number of genes in the network is denoted
by . Te following subsections briefy describe some of the
robust and popular graphical modeling techniques for gene
network inference.
3.1.1. Bayesian Networks. Bayesian networks model the gene
regulatory networks as directed acyclic graphs (DAGs). To
simplify the inference process, the probability distribution
of DAG-networks is generally factored in terms of the
conditional distributions of each random variable given its
parents:

(X) = ( | ( )) ,
=1

(1)

where ( ) denotes the parent of node . Te gene


regulatory network is inferred by using the Bayesian network learning techniques. Tis is done by maximizing the

70

Advances in Bioinformatics

probability (G | D), where D denotes the available gene


expression data. Several scoring metrics have been proposed
to obtain the best graph structure [15]. Te network, thus,
obtained is unique to the extent of equivalence class; that is,
the independence relationships are uniquely identifed.
Te gene expression data available to date consist of very
few data points, while the number of genes is substantially
larger, rendering the system to be underdetermined. As an
alternative to fnding the complete networks, scientists have
proposed looking at certain important features, for example,
Markov relations and order relations. If a gene is present in
the minimal network blanketing the gene , then a Markov
relation is said to be established. A relationship between two
genes is referred to as an ordered relation if a particular
gene appears to be a parent of another gene in all the
equivalent networks. By aggregating this information, it is
possible to infer the underlying regulatory structure robustly
and reliably. Te network structure inferred in this manner
looks at the static interactions only. In order to cater for the
dynamic interactions inherent in gene networks, dynamic
Bayesian networks (DBNs) have been used [16, 17].
3.1.2. Qualitative Probabilistic Networks. A novel method of
modeling gene networks is via the usage of qualitative probabilistic networks (QPNs), which represent the qualitative
analog of the DBNs [18]. Te structural and independence
properties of QPNs are the same as those of Bayesian networks. However, instead of being concerned about the local
conditional probabilities of the random variables, the former
class of models looks at how the changes in probabilities
of the random variables afect the probabilities of their
immediate parents. Tis change is measured in qualitative
terms instead of quantitative values, that is, whether the
probabilities increase, decrease, or stay the same as shown in
Figure 3.
Two important properties of QPNs are the qualitative
infuences and the qualitative synergies. A positive infuence
denoted by + (, ) indicates the greater possibility of
having a higher value when that of is high and vice versa,
irrespective of all other variables; that is,
+ (, )

if ( | , ) > ( | ) .

(2)

In the case of three variables, QPNs look at the synergies.


A positive additive synergy, denoted by + ({, }, ), exists
when the combined efect of the parent nodes is greater on
the child node than their individual efects given by
+ ({, } , )

if ( | , , ) + ( | , , )

> ( | , , ) + ( | , , ) .
(3)

QPNs, thus, provide more insight into the gene networks


by indicating whether a particular gene is a promoter or an
inhibitor.
3.1.3. Graphical Gaussian Models. Graphical Gaussian models, also known as covariance selection or concentration

Figure 3: Qualitative probabilistic network (red) for a Bayesian


network (blue).

graph models, provide a simple and efective way of characterizing the gene interactions [19, 20]. Tis method relies on
assessing the conditional dependencies among genes in terms
of partial correlation coefcients among the gene expressions
and results in an undirected network. A covariance matrix
is estimated using the available gene expression data sets.
Suppose that X R denotes the gene expression data
matrix, where the rows correspond to observations and
the columns correspond to genes, then an estimate of the
covariance matrix is obtained by
=
W

1
X X.
1

(4)

the partial correlations can be


Assuming invertibility of W,
determined as
=

(5)

denotes the partial correlation between genes and


where
.
3.1.4. Graphical LASSO Algorithm. A major drawback of the
covariance-matrix-estimation-based methods is their unreliability due to the small number of data samples. Making
use of the fact that gene networks are inherently sparse, it is
possible to obtain the dependencies between genes by means
of a penalized linear regression approach [20]. Te graphical
Least Absolute Shrinkage and Selection Operator (LASSO)
algorithm solves the network inference problem efciently by
maximizing the following penalized likelihood function:
2

(W) = log (det (W)) trace (WW)


W1 ,

(6)

71

Advances in Bioinformatics
1

()

()

squash function defned below in (9) [24]. Te nonlinear


state-space representation model capturing the gene interactions is described by the following system of equations:
z () = Az ( 1) + B (z ( 1) , ) + I0 + V () ,

where the th entry of vector function () is given by the


sigmoid squash function:

Figure 4: State-Space model.

where controls the sparsity of the network, notation || ||1


represents the 1 -norm, and W denotes the covariance matrix.
Tis minimization can be carried out by using block gradient
descent methods, the details of which can be found in [20]
and the references therein.
3.2. State-Space Representation Models. One of the earliest
and widely used methods of modeling gene networks is by
employing the state-space representation models [21]. As
opposed to other classes, all the methods belonging to this
class model the dynamic evolution of the gene network. Tese
models generally consist of two sets of equations, the frst set
of equations representing the evolution of the hidden state
variables denoted by z(), and the second set of equations
relating the hidden state variables with the observed gene
expression data, denoted by x() as depicted in Figure 4. Te
functions () and () describe the evolution of hidden and
observed variables, respectively. Next, in this section we will
describe various models for gene network inference using the
state-space representation model.
3.2.1. Linear State-Space Model. Te simplest model for statespace equations is the linear Gaussian model given by [21, 22]:
z () = Az ( 1) + V () ,
x () = Cz () + w () ,

(8)

(7)

where A is a matrix representing the regulatory relations


between the genes, and stands for the discrete time points.
Diference equations are used in place of diferential equations because discrete observations are available in the gene
expression data. Te noise components V() and w() represent the system and the measurement noise, respectively, and
are assumed to be Gaussian. Te noise models the uncertainty
present in the estimated gene expression data. Te matrix C
is generally considered to be an identity matrix. Inference in
gene networks modeled by the state-space representation (7)
can be performed using standard Kalman flter updates. Te
simplicity of the state-space model avoids overftting of the
network, and therefore, it provides reliable results.
3.2.2. Nonlinear Models. While it is useful to represent
gene networks by simple models to ease the computational
complexity, it is also imperative to incorporate nonlinear
efects into the system equations, since the genes are known
to interact nonlinearly [23]. A particular function that is
frequently used to capture the nonlinear efects is the sigmoid

( , ) =

1
,
1 +

(9)

where is a parameter to be identifed. Matrix A represents


the linear relationships between the genes, while matrix B
characterizes the nonlinear interactions. Te problem, thus,
boils down to the estimation of the following unknowns in
the system:
= [A, B, , I0 ] ,

(10)

z () = B (z ( 1)) + V () ,

(11)

where I0 models the constant bias. One way of solving these


equations is by using the extended Kalman flter (EKF) [24],
which is a popular algorithm for solving nonlinear statespace equations. EKF algorithm provides the solution by
approximating the nonlinear system by its frst-order linear
approximation. Other variants of Kalman flter algorithm like
the cubature Kalman flter (CKF), unscented Kalman flter
(UKF), and particle flter algorithm are also used to solve such
inference problems [25].
However, for many studies, the considered nonlinear
model is comprised of a large number of unknowns and in
order to estimate these unknown variables with considerable
accuracy, data sets consisting of a large number of samples
are required. Te availability of smaller data sets represents an
insurmountable obstacle in the reliable estimation of a large
number of unknowns. Tis problem can be partially avoided
by simplifying the model to include only nonlinear terms, and
thus reducing the number of unknown parameters to the bare
minimum [25] and by approximating to be one. Te system
of equations corresponding to such a parsimonious scenario
is then given by
where is the function defned previously.

3.2.3. Models with Sparsity Constraints. A crucial feature for


many gene networks is their inherent sparsity; that is, all
genes in the network are connected to a few other genes
only. Terefore, matrices A and B depicting the regulatory
relations between the genes are expected to contain only very
few nonzero values as compared to the size of these matrices.
Terefore, one may apply shrinkage-based methods like
LASSO [25, 26] for parameter estimation and parsimonious
model selection. One of the ways for inferring models with
sparsity constraints is to perform dual estimation, which
involves estimating the states and the parameters one by one.
Te hidden states can be estimated using the particle flter
algorithm, and once all the estimates for the hidden states are
obtained, they can be stacked together to form a matrix and

72

Advances in Bioinformatics

thus the following system of equations is obtained to perform


the parameter estimation:
(0,1 )
1
[
[2 ] [
[ ] [ (1,1 )
[ .. ] = [
..
[ . ] [
.
[ ] [ (1,1 )

. . . (0, )
V

..
] [ 1 ] [V1 ]
]
2 ] [ 2 ]
...
.
][
. ] + [ . ],
][
..
[
] .. ] [ .. ]
.
(1, )] [ ] [V ]
(12)

which can be expressed compactly in vector/matrix-form


representation as
z = b + V .

(13)

LASSO operates on this system of equations and produces a


parameter vector b by minimizing the criterion [27]:
1
2

min z b 2 + b 1 .
b 2

(14)

Te parameter estimates obtained using LASSO-based algorithms appear to be more reliable than the estimates provided
by other approaches [25].
3.2.4. State-Space Models for Time-Delayed Dependencies.
Te state-space models discussed so far do not consider
time delays whereas it has been found that time-delayed
interactions are present in gene networks [28] due to the time
required for the processes of transcription and translation to
take place. One of the ways to model this phenomenon is by
adopting the following state-space model:
z () = Az ( 1) + Bu ( ) + V () ,

x () = Cz () + w () .

(15)

In this state-space model, the input is considered to be the


expression profle of a regulator such as a transcription factor.
Here, A stands for the state transition matrix, while
matrix B captures the efect of regulators on the system.
Te value of the time delay is obtained by fnding the best
ft over a range of possible values using Akaikes information
criterion (AIC) in order to avoid overftting the network.
3.3. Information Teoretic Methods. Information theoretic
methods have provided some of the most robust and reliable
algorithms for gene network inference and form the basis of a
standard in this feld [2931]. A particular advantage associated with these methods is their ability to work with minimal
assumptions about the underlying network. Tis is in contrast
with the probabilistic graphical modeling techniques as well
as the state-space models, both of which have their own set
of assumptions. As highlighted previously, a Markov network
provides an undirected network, while Bayesian networks are
not able to incorporate cycles or feedback loops. State-space
models apart from the linear Gaussian model make critical
assumptions on the model structure. Tese drawbacks are
not present in the case of information theoretic methods. Te
following discussion presents the main information theoretic
approaches for inferring gene regulatory networks.

3.3.1. Finding the Correlation between Genes. Two of the most


fundamental concepts in information theory are mutual
information and entropy. Mutual information between two
random variables and is defned as [32]
(; ) = [ (, ) log
,

(, )
]
() ()

= () + () (, ) ,

(16)

where denotes the entropy or the uncertainty present in a


random variable, and it is given by
() = () log () .

(17)

Mutual information measures the correlation between two


random variables. In the context of gene network inference,
a higher mutual information between two genes indicates
a higher dependency, and therefore, a possible interaction
between them. Some of the most important and robust algorithms for gene network inference make use of the mutual
information for fnding the interacting genes [29, 30].
3.3.2. Identifying Indirect Interactions between Genes. If the
mutual information between two genes is greater than a certain threshold, it indicates some correlation between them.
However, this information alone is not sufcient to decide
whether the genes are connected directly or indirectly via
an intermediate gene. Te data processing inequality (DPI)
provides some insight to assess whether such a scenario holds.
In case of three genes forming a Markov chain as shown in
Figure 5, DPI can be expressed as
(; ) min [ (; ) , (; )] .

(18)

Using this inequality, it is found that the interaction with the


least mutual information is an indirect one. Tis method is
employed in ARACNE [29], which has become a standard
algorithm for gene network inference. However, DPI fails to
hold in situations where one of the three genes is a parent
gene to the other two genes. Conditional mutual information
has been proposed to be used in such cases [30]. Conditional
mutual information is defned as
(; | ) = [ (, , ) log
,,

(, | )
]
( | ) ( | )

= (, ) + (, ) () (, , ) .
(19)

If (; | ) is much less than (; ), it implies that is


a parent of the genes and as shown in Figure 5. In case
the two quantities are almost equal, it means that the gene
does not have any infuence on the other two genes. Terefore,
by employing the idea of conditional mutual information,
indirect interactions in the case of common cause can be
sifed.

73

Advances in Bioinformatics

It can be seen that a gene can be a regulator for gene if


and only if (if) IcE( ) < IcE( ). Te mutual information
in this case is given by

( ; ) = [ ( , + ) log
=1

( , ) = max { ( , () )}

Figure 5: Markov chain (blue) and common cause (red).

3.3.3. Finding the Directed Networks. Calculating the mutual


information using static data does not provide any information about the directed relationships. On the other hand,
using time series data may indicate the directionality of
interactions as well [33]. Mutual information for time series
data can be expressed as
(+1 ; ) = [ (+1 , ) log
+1 ,

(+1 , )
].
(+1 ) ( )

(20)

If a high value is obtained for (+1 ; ), it signifes a


directed relationship from gene to . While using these
methods, the determination of the signifcance threshold is
of considerable importance and can be estimated based on
the prior knowledge about the network.
Te information theoretic quantities discussed so far are
symmetric (or bidirectional) and do not provide any information about the directionality by themselves. Some new
metrics have been proposed recently to infer asymmetric or
one-directional relationships such as the -mixing coefcient
defned as [34]:
( | ) = max |Pr { | } Pr { }| .
,

(21)

In other words, this coefcient provides a measure of independence or diference between two genes and . DPI also
holds true for the -mixing metric, and therefore, it can be
used to identify the indirect interactions as in the case of
mutual information.
3.3.4. Time-Delayed Dependencies. Another way of fnding
directed relationships is by detecting the time-delayed dependencies by using time series data. Te time instants at which
the mutual information goes above or drops below the
thresholds up and down , respectively, are noted [35]. Tese
instants are called the initial change of expression (IcE) times
and are defned as
IcE ( ) = arg min {

or
down } .
up
0
0

( ) (+ )

],

(23)

where the delay is denoted by . Te next step consists


in fnding the maximum of the mutual information values
calculated for all the time delays; that is,

( , + )

(22)

for = 1, 2, . . . , while IcE ( ) IcE ( ) .

(24)

If the value of the maximum mutual information is greater


than a prespecifed threshold, it is concluded that a directed
relationship exists from to . Te calculation of threshold
is very important in all the information theoretic methods
which is selected on the basis of the predetermined value [29]. Tis helps to obtain networks with the required
signifcance value.
3.3.5. Model Selection. An important and necessary step in
the implementation of the above-mentioned algorithms is
the model selection. A network formed by using mutual
information alone will result in an overftted structure, and
therefore, model selection becomes imperative. Minimum
description length (MDL) principle was proposed as a general
approach for model selection. MDL states that the network
with the shortest coding length should be selected. For a
network with a large number of nodes, the coding length will
be large and vice versa. MDL principle provides a trade-of
and aids in selecting only the signifcant interactions between
the genes. MDL was applied in various ways in fnding the
coding length of the network and the probability densities
associated with it [33]. Another way of using this principle is
in conjunction with the maximum likelihood (ML) principle
which results in a more general algorithm [36]. Further
details on this algorithm can be found in [36]. Tus, it appears
that the tools of information theory are quite powerful in
modeling and inferring gene regulatory networks.

4. Inferring the Protein-Protein


Interaction Networks
Having examined the gene network inference problem, this
section describes the statistical methods that are used to fnd
reliable and complete protein-protein interaction networks.
As opposed to gene networks which are mostly inferred using
the expression data or the likes of it, inference of PPI networks
can be carried out in various ways such as phylogenetic
profling and identifcation of structural patterns. Tis paper
focuses only on the methods that employ PPI data to make
inference. Te given data in this scenario are the proteinprotein interactions. However, such data sets consist of a
large number of false positives and negatives and are far from
being complete and homogeneous. Terefore, only a small

74

Advances in Bioinformatics

overlap is found between the PPI data sets obtained from


various sources. However, it is observed that the interactions
predicted by more than one method are more reliable [37].
One of the challenges is the large number of interactions indicated by the PPI data as opposed to the considerably fewer
interactions assumed to be present in reality. Terefore, the
problem in this scenario is to fnd more reliable interactions
and predict the yet unknown interactions. In addition, the
protein interactions can be of diferent types ranging from
stable ones to transient ones [37].
It is to be noted that as opposed to the gene networks,
a lot of work can still be done for protein-protein network
inference using the probabilistic methods. In a living organism, several proteins work together to carry out various tasks
forming a protein complex. Most of the PPI data consists of
binary interactions only and it is very rare to fnd interactions
between more than two proteins simultaneously. Hence,
identifcation of protein complexes is of prime importance to
gain a better understanding of the cellular network.
Detecting protein complexes is a fundamental area of
study of protein networks [38], for which various clustering
methods were applied. One of the various ways of identifying
the protein complexes include graph segmentation, where
the graph is clustered into subgraphs using cost-based search
algorithms. Another approach is broadly categorized as
conservation across species [38], where alignment tools are
used to fnd the complexes that are common in multiple data
sets coming from diferent species. In what follows, some of
the recently proposed probabilistic graphical-modeling- and
clustering-based methods are described.
4.1. Markov Networks. Te available PPI data look mostly at
the binary interactions, and interactions of three or more
genes are hard to fnd. However, it is important to look at
the interacting proteins holistically. Markov networks are
probabilistic graphical modeling techniques which result
in undirected graphs. Suppose X = {1 , . . . , } is a
vector of random variables modeling the proteins. Teir joint
distribution is captured in terms of the potentials .
Te random variables X that are connected to each other
are called the scope for the particular potential . Te joint
probability distribution is then given by
(X = x) =

1
(x ) ,

particular strength of BNs is their ability to estimate model


parameters even in the presence of incomplete data, which is
ofen the case with the PPI networks. Tis fact makes BN a
perfectly suited method for modeling protein networks. One
way of estimating the model parameters is via the Expectation
Maximization (EM) algorithm [39]. Te joint probability
distribution is expressed as
(, 1 , . . . , ) = () ( | ) .

Assuming all the random variables to be independent of each


other, the posterior density is given by
( | 1 , . . . , ) = ()

4.2. Bayesian Networks. Another way of modeling PPI networks is by means of Bayesian networks (BNs) [39], which
represent a probabilistic graphical modeling technique. Te
inference algorithm is based on fnding the conditional
probability densities ( | ), where denotes the class
variable, and denotes the th node in the network. A

( | )
.
( )

(27)

Once the model parameters are known, prediction can be


made about random variables for which the data may not
be available. Terefore, this algorithm provides a suitable
method for fnding protein complexes.
4.3. Graphical Clustering Methods. One of the ways of graph
clustering is based on supervised learning [12, 38]. Te
subgraphs are modeled using Bayesian networks, and the features consist of topological patterns of graphs and biological
properties. Rather than assuming the widely used cliqueness
property, which considers all the nodes to be connected
with each other, the algorithm looks for the properties that
are inferred from already known complexes. Two important
features are the label indicating whether a subgraph is
a complex and the number of nodes . Te other feature
descriptors including degree statistics, graph density, and
degree correlation statistics are indicated by 1 , . . . , and
are considered independent given and . Te number of
nodes in and of itself is an important feature. Its importance
can be seen from the fact that a larger number of nodes in
a subgraph indicate a lesser probability of it being a clique.
All the subgraphs are assigned scores by making use of
these properties. One way of fnding how probable it is for
a subgraph to be a protein complex is to perform simple
hypothesis testing by calculating the following conditional
probability [12, 38]:
= log

(25)

where is the normalizing constant also called the partition function. In this way, a compact representation of the
probability distribution is obtained. Te network structure
is learned by using the independence properties of Markov
networks using the available PPI data. Te details of this
method can be found in [37].

(26)

(1 | 1 , . . . , )
(0 | 1 , . . . , )

( | 1 )
=1 ( | , 1 )
= log
,
( | 0 )
=1 ( | , 0 )

(28)

where the posterior probabilities are calculated via Bayes rule


as
( | , 1 , . . . , )
=

(, 1 , . . . , | = 1) ( = 1)
(, 1 , . . . , )

(1 , . . . , | , = 1) ( | = 1) ( = 1)
.
(, 1 , . . . , )
(29)

75

Advances in Bioinformatics
Tese probability densities can be calculated using maximum
likelihood methods. By comparing the obtained score to
a predetermined threshold, some of the subgraphs can be
labeled to be complexes. Tis algorithm takes the weighted
matrix of PPI data as input, where the weights are assigned
using the likelihood of any particular interaction. Several
other graphical-clustering-based methods are surveyed in
[12].
4.4. Matrix Factorization Methods for Clustering. Nonnegative matrix factorization (NMF) is a method widely used in
problems of clustering. Application of this technique has been
proposed recently in [40], where an ensemble of nonnegative
factored matrices obtained using protein-protein interaction
data are combined together to perform sof clustering. Te
importance of this step lies in the fact that a particular object
may belong to multiple classes. Hence, the various algorithms
reported in the literature performing hard clustering may not
be of much beneft in such scenarios. Tis ensemble NMF
method is observed to classify the proteins in accordance
with the functions they perform and also identify the multiple
groups they belong to.
Te algorithm produces base clusterings by factorizing
the symmetric data matrix of protein interactions in the
following manner [40]:

2
minS VVT ,

(30)

V>0

where || || denotes the Frobenius norm. Te factors V


produced in this manner are not unique. Let be the number
of clusters in the th base cluster, each with a diferent value
in order to promote diversity. Once the ensemble of factored
matrices is available, the next step is to construct the graph
by combining the information present in them. Parameter
= 1 + + gives the total number of basis vectors which
are denoted by V = {V1 , . . . , V }. Each vector denotes a node on
the graph, and the edge weight is calculated using the Pearson
correlation for a pair of vector (V , V ) given by
cor (V , V ) =

(V V ) (V V )
1
+ 1) .
(
2 V V )2 V V
2

(31)

Having looked at the GRNs and PPI network inference


problems individually, we now proceed to review the recent
advancements in the joint modeling of the two networks.

5. An Integrated Cellular Network


Te advances in reverse engineering of GRNs and PPI
networks have paved the way for joint estimation of GRNs
and PPI networks [41]. Tis is a step towards the inference of
an integrated network consisting of genes, proteins, and transcription factors, indicating interactions among themselves
and each other. Figure 6 shows the schematic of an integrated
cellular network. In this section, we review two important
ways of estimating a joint network.

P4
P2

P1

P3

TF1

TF2

G3

TF3

G1

G2

G4

Figure 6: An integrated cellular network.

5.1. Probabilistic Graphical Models for Joint Inference. Reference [41] proposed an interesting method for estimating
GRNs and PPI networks simultaneously. Suppose that the
gene expression is denoted by x and PPI data is represented by
y. Te algorithm provides an undirected protein network
and a directed gene network , modeled using Markov and
Bayesian networks, respectively, by maximizing their joint
distribution; that is,
( , | , ) ( , , , )

= ( | ) ( | ) ( , ) ,

(32)

where ( | , ) = ( | ) and ( | , ) =
( | ). Te inference on Markov and Bayesian networks
is performed in the same manner as explained in the previous
sections. Te two subnetworks are estimated iteratively till the
algorithm converges. Further details on this algorithm can be
found in [41].
5.2. Joint Estimation Using State-Space Model. State-space
model can also be used to obtain an integrated network
of gene and protein-protein interactions [42, 43]. A novel
approach employing nonlinear model is proposed in [43],
where the system parameters are estimated using constrained
leastsquares. Te gene expression is assumed to follow a
dynamic model given by

( + 1) = () + () () + + () , (33)
=1

where
() = ( ()) =

1 + exp { ( () ) / }

(34)

76

Advances in Bioinformatics

and denotes the protein activity profle of th transcription


factor, and its mean and standard deviations are represented
by and , respectively. Te magnitude of indicates
the strength of relationship between the th TF and th
gene, and the sign suggests whether it is an excitatory or
inhibitory relationship. Te model in (33) suggests that the
gene expression level at th time instant depends upon the
gene expression level at the previous time instant as well
as the protein activity level. Te degradation efect of gene
expression is modeled by , is a constant representing
the basal level, and () is the Gaussian noise modeling the
uncertainties in the model and the errors in the data.
Te protein activity level follows the following dynamic
model:

( + 1) = () + () ()
=1

+ () () + + () ,

(35)

where gives the relationship between the proteins,


indicates the translation efect of mRNA to protein, and ()
is the Gaussian noise. Te unknown parameters for both the
models are given by

= [1 ] ,

= [1 ]

(36)

and are estimated by solving a constrained least squares


problem [43]. Once the individual subnetworks are obtained,
they are merged together to form one cellular network with
the TFs connecting them together.
Te problem of inferring an integrated network is in
relatively initial stages, and several avenues of research are
still open. Moreover, comparison studies are needed so as to
determine the merits and demerits of the diferent methods
in use.

6. Performance Evaluation
Te inference accuracy can be assessed using the knowledge
of a gold-standard network or the true network. In order
to benchmark the algorithms, the correctly identifed edges
or true positives (TPs) need to be calculated. In addition,
the number of false positives (FPs), or the edges incorrectly
indicated to be present, and false negatives (FNs) which is
the missed detection should also be counted [10]. With these
values in hand, true positive rate or recall; that is, TPR =
TP/(TP+FN), false positive rate; that is, FPR = FP/(FP+TN),
and positive predictive value; that is, PPV = TP/(TP + FP),
also called the precision, can be calculated. Tese quantities
enable us to view the performance graphically by the area
under the ROC curve which plots FPR versus the TPR. Tese
criteria are most widely used as the fdelity criterion for gene
network inference algorithms.
While it is possible to identify the gene regulatory
relationships experimentally, it would not only be technically prohibitive but also proved to be very costly. For this

reason, several in silico and in vivo networks have been


generated to assist in benchmarking the network inference
algorithms. Foremost among these are the DREAM (dialogue
on reverse engineering assessment and methods) [44] and
IRMA (in vivo reverse engineering and modeling assessment)
[45] datasets. Reference [10] provides a unifed survey of
some of the important algorithms in gene network inference
algorithms using these datasets.

7. Discussions and Conclusions


Tis paper reviews the main statistical methods used for
inference of gene and protein-protein networks. PPI network
inference can be carried out in a wide variety of ways by
exploiting phylogenetics information and sequencing data.
Tis paper focused only on those inference methods that
employ PPI data.
For the inference of gene regulatory networks, the problem can be simply stated as follows: given the gene expression data, fnd the interactions between the genes. Tree
major classes of statistical methods were reviewed in this
paper: probabilistic graphical models, state-space models,
and information theoretic methods. For all these methods,
modeling as well as inferencing techniques was discussed.
It is observed that much progress has been made in the
feld of GRN inference. However, almost all of the proposed
network inference methods in the literature work with only
the popular gene expression data sets. An interesting part
of future work could be integrating diferent data sets and
biological knowledge available to come up with better and
more robust algorithms.
Comparing the three broad classes of statistical methods
reviewed in the paper, it is found that the information
theoretic methods have advantages over the other methods
in terms of minimal modeling assumptions and, therefore,
are capable of modeling more general networks. Graphical
modeling techniques assume the network to be acyclic in case
of Bayesian network modeling and provide an undirected
graph when using Markov networks. Te state-space nonlinear models work with nonlinear functions which may not
be the true representative of the underlying network, thereby
resulting in less robust algorithms.
In case of PPI network prediction, the most popular statistical method is clustering. In addition, probabilistic graphical modeling techniques are also used. However, several
important avenues of research are still open. Since the Markov
networks and Bayesian networks are able to model PPI
networks efciently, other probabilistic graphical techniques
such as factor graphs could potentially be used for solving
this inference problem. Clustering methods are more suited
to the PPI network inference problem as the main emphasis
is on the identifcation of protein complexes. It is found that
certain important and popular modeling techniques may fail
to model PPI networks [46]. Also, clustering methods based
on mutual information could be used [47].
Several statistical methods have been proposed to infer an
integrated network of transcription regulation and proteinprotein interaction. A state-space model for integrated
network inference involves parameter estimation which

Advances in Bioinformatics
indicates the strength of the inhibitory and excitatory regulations. As the cellular networks are known to be sparse,
employing sparsity-constrained least squares for parameter
estimation as proposed in [25] is expected to result in more
robust inference algorithms.
Recent years have shown tremendous and rapid progress
in the feld of cellular network modeling. With the amount
and types of data sets increasing, algorithms combining
multiple datasets are necessary for future.

Acknowledgments
Tis paper was made possible by QNRF-NPRP Grant no. 09874-3-235 and support from NSF Grant no. 0915444. Te
statements made herein are solely the responsibility of the
authors.

References
[1] X. Zhou and S. T. C. Wong, Computational Systems Bioinformatics, World Scientifc, 2008.
[2] Y. Huang, I. M. Tienda-Luna, and Y. Wang, Reverse engineering gene regulatory networks: a survey of statistical models,
IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 7697, 2009.
[3] X. Zhou, X. Wang, and E. R. Dougherty, Genomic Networks:
Statistical Inference from Microarray Data, John Wiley & Sons,
2006.
[4] H. Kitano, Computational systems biology, Nature, vol. 420,
no. 6912, pp. 206210, 2002.
[5] B. Mallick, D. Gold, and V. Baladandayuthapani, Bayesian
Analysis of Gene Expression Data, Wiley, 2009.
[6] H. D. Jong, Modeling and simulation of genetic regulatoy
systems: a literature review, Journal of Computational Biology,
vol. 9, no. 1, pp. 67103, 2002.
[7] X. Cai and X. Wang, Stochastic modeling and simulation of
gene networks, IEEE Signal Processing Magazine, vol. 24, no. 1,
pp. 2736, 2007.
[8] H. Hache, H. Lehrach, and R. Herwig, Reverse engineering of
gene regulatory networks: a comparative study, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2009, Article ID
617281, 2009.
[9] F. Markowetz and R. Spang, Inferring cellular networksa
review, BMC Bioinformatics, vol. 8, article S5, 2007.
[10] C. A. Penfold and D. L. Wild, How to infer gene networks from
expression profles, revisited, Interface Focus, vol. 3, pp. 857
870, 2011.
[11] J. Wang, M. Li, Y. Deng, and Y. Pan, Recent advances in
clustering methods for protein interaction networks, BMC
Genomics, vol. 11, no. supplement 3, article S10, 2010.
[12] X. Li, M. Wu, C. K. Kwoh, and S. K. Ng, Computational
approaches for detecting protein complexes from protein interaction networks: a survey, BMC Genomics, vol. 11, no. 1, article
S3, 2010.
[13] A. Mortazavi, B. A. Williams, K. McCue, L. Schaefer, and B.
Wold, Mapping and quantifying mammalian transcriptomes
by RNA-Seq, Nature Methods, vol. 5, no. 7, pp. 621628, 2008.
[14] K. Y. Yip, R. P. Alexander, K. K. Yan, and M. Gerstein, Improved
reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data, PLoS ONE, vol. 5, no. 1,
Article ID e8121, 2010.

77

[15] D. Koller and N. Friedman, Probabilistic Graphical Models:


Principles and Techniques, MIT Press, 2009.
[16] K. Murphy and S. Mian, Modeling gene expression data using
dynamic Bayesian networks, Tech. Rep., University of California, Berkeley, Calif, USA, 2001.
[17] Y. Zhang, Z. Deng, H. Jiang, and P. Jia, Inferring gene
regulatory networks from multiple data sources via a dynamic
Bayesian network with structural EM, in DILS, S. C. Boulakia
and V. Tannen, Eds., vol. 4544 of Lecture Notes in Computer
Science, pp. 204214, Springer, 2007.
[18] Z. M. Ibrahim, A. Ngom, and A. Y. Tawfk, Using qualitative
probability in reverse-engineering gene regulatory networks,
IEEE Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 326334, 2011.
[19] N. Kramer, J. Schafer, and A. Boulesteix, Regularized estimation of large-scale gene association networks using graphical
gaussian models, BMC Bioinformatics, vol. 10, no. 1, p. 384,
2009.
[20] P. Menendez, Y. A. I. Kourmpetis, C. J. F. ter Braak, and F. A. van
Eeuwijk, Gene regulatory networks from multifactorial perturbations using graphical lasso: application to the DREAM4
challenge, PLoS ONE, vol. 5, no. 12, Article ID e14147, 2010.
[21] F.-X. Wu, W.-J. Zhang, and A. J. Kusalik, Modeling gene
expression from microarray expression data with state-space
equations, in Pacifc Symposium on Biocomputing, R. B. Altman, A. K. Dunker, L. Hunter, T. A. Jung, and T. E. Klein, Eds.,
pp. 581592, World Scientifc, 2004.
[22] Z. Wang, F. Yang, D. W. C. Ho, S. Swif, A. Tucker, and X. Liu,
Stochastic dynamic modeling of short gene expression timeseries data, IEEE Transactions on Nanobioscience, vol. 7, no. 1,
pp. 4455, 2008.
[23] M. Quach, N. Brunel, and F. Dalche-Buc, Estimating parameters and hidden variables in non-linear state-space models
based on ODEs for biological networks inference, Bioinformatics, vol. 23, no. 23, pp. 32093216, 2007.
[24] Z. Wang, X. Liu, Y. Liu, J. Liang, and V. Vinciotti, An extended
kalman fltering approach to modeling nonlinear dynamic gene
regulatory networks via short gene expression time series,
IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 3, pp. 410419, 2009.
[25] A. Noor, E. Serpedin, M. N. Nounou, and H. N. Nounou, Inferring gene regulatory networks via nonlinear state-space models
and exploiting sparsity, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 12031211,
2012.
[26] A. Noor, E. Serpedin, M. Nounou, and H. Nounou, Inferring
gene regulatory networks with nonlinear models via exploiting
sparsity, in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 12), pp. 725728, March 2012.
[27] R. Tibshirani, Regression shrinkage and selection via the lasso,
Journal of the Royal Statistical Society B, vol. 58, pp. 267288,
1996.
[28] C. Koh, F. X. Wu, G. Selvaraj, and A. J. Kusalik, Using a statespace model and location analysis to infer time-delayed Regulatory Networks, Eurasip Journal on Bioinformatics and Systems
Biology, vol. 2009, Article ID 484601, 3 pages, 2009.
[29] A. A. Margolin, I. Nemenman, K. Basso et al., ARACNE: an
algorithm for the reconstruction of gene regulatory networks
in a mammalian cellular context, BMC Bioinformatics, vol. 7,
no. supplement 1, article S7, 2006.

78
[30] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring connectivity of genetic regulatory networks using informationtheoretic criteria, IEEE/ACM Transactions on Computational
Biology and Bioinformatics, vol. 5, no. 2, pp. 262274, 2008.
[31] A. Noor, E. Serpedin, M. N. Nounou, H. N. Nounou, N.
Mohamed, and L. Chouchane, Information theoretic methods
for modeling of gene regulatory networks, in IEEE Symposium
on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 12), pp. 418423, 2012.
[32] T. Cover and J. Tomas, Elements of Information Teory, Wiley
Interscience, 2006.
[33] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring gene
regulatory networks from time series data using the minimum
description length principle, Bioinformatics, vol. 22, no. 17, pp.
21292135, 2006.
[34] M. Vidyasagar, Probabilistic methods in cancer biology, Childhood, vol. 20, pp. 8289, 2011.
[35] P. Zoppoli, S. Morganella, and M. Ceccarelli, TimeDelayARACNE: reverse engineering of gene networks from timecourse data by an information theoretic approach, BMC Bioinformatics, vol. 11, no. 1, article 154, 2010.
[36] J. Dougherty, I. Tabus, and J. Astola, Inference of gene regulatory networks based on a universal minimum description
length, Eurasip Journal on Bioinformatics and Systems Biology,
vol. 2008, Article ID 482090, 2008.
[37] A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman,
Towards an integrated protein-protein interaction network: a
relational Markov network approach, Journal of Computational
Biology, vol. 13, no. 2, pp. 145164, 2006.
[38] Y. Qi, F. Balem, C. Faloutsos, J. Klein-Seetharaman, and Z. BarJoseph, Protein complex identifcation by supervised graph
local clustering, Bioinformatics, vol. 24, no. 13, pp. i250i268,
2008.
[39] J. R. Bradford, C. J. Needham, A. J. Bulpitt, and D. R. Westhead,
Insights into protein-protein interfaces using a Bayesian network prediction method, Journal of Molecular Biology, vol. 362,
no. 2, pp. 365386, 2006.
[40] D. Greene, G. Cagney, N. Krogan, and P. Cunningham, Ensemble non-negative matrix factorization methods for clustering
protein-protein interactions, Bioinformatics, vol. 24, no. 15, pp.
17221728, 2008.
[41] N. Nariai, Y. Tamada, S. Imoto, and S. Miyano, Estimating
gene regulatory networks and protein-protein interactions of
Saccharomyces cerevisiae from multiple genome-wide data,
Bioinformatics, vol. 21, no. supplement 2, pp. ii206ii212, 2005.
[42] C. W. Li and B. S. Chen, Identifying functional mechanisms of
gene and protein regulatory networks in response to a broader
range of environmental stresses, Comparative and Functional
Genomics, vol. 2010, Article ID 408705, 2010.
[43] Y. C. Wang and B. S. Chen, Integrated cellular network
of transcription regulations and protein-protein interactions,
BMC Systems Biology, vol. 4, no. 1, article 20, 2010.
[44] http://wiki.c2b2.columbia.edu/dream.
[45] I. Cantone, L. Marucci, F. Iorio et al., A yeast synthetic network
for in vivo assessment of reverse-engineering and modeling
approaches, Cell, vol. 137, no. 1, pp. 172181, 2009.
[46] R. Schweiger, M. Linial, and N. Linial, Generative probabilistic
models for protein-protein interaction networks-the biclique
perspective, Bioinformatics, vol. 27, no. 13, pp. i142i148, 2011.
[47] X. Zhou, X. Wang, and E. R. Dougherty, Construction of
genomic networks using mutual-information clustering and

Advances in Bioinformatics
reversible-jump Markov-chain-Monte-Carlo predictor design,
Signal Processing, vol. 83, no. 4, pp. 745761, 2003.

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Using Protein Clusters from Whole Proteomes to Construct and
Augment a Dendrogram
Yunyun Zhou,1 Douglas R. Call,1,2 and Shira L. Broschat1,2,3
1

School of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman,
WA 99164-2752, USA
2
Paul G. Allen School for Global Animal Health, Washington State University, P.O. Box 642752, Pullman,
WA 99164-2752, USA
3
Department of Veterinary Microbiology and Pathology, Washington State University, P.O. Box 642752, Pullman,
WA 99164-2752, USA
Correspondence should be addressed to Shira L. Broschat; shira@eecs.wsu.edu
Received 19 November 2012; Revised 3 January 2013; Accepted 13 January 2013
Academic Editor: Yves Van de Peer
Copyright 2013 Yunyun Zhou et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In this paper we present a new ab initio approach for constructing an unrooted dendrogram using protein clusters, an approach that
has the potential for estimating relationships among several thousands of species based on their putative proteomes. We employ an
open-source sofware program called pClust that was developed for use in metagenomic studies. Sequence alignment is performed
by pClust using the Smith-Waterman algorithm, which is known to give optimal alignment and, hence, greater accuracy than
BLAST-based methods. Protein clusters generated by pClust are used to create protein profles for each species in the dendrogram,
these profles forming a correlation flter library for use with a new taxon. To augment the dendrogram with a new taxon, a protein
profle for the taxon is created using BLASTp, and this new taxon is placed into a position within the dendrogram corresponding to
the highest correlation with profles in the correlation flter library. Tis work was initiated because of our interest in plasmids, and
each step is illustrated using proteomes from Gram-negative bacterial plasmids. Proteomes for 527 plasmids were used to generate
the dendrogram, and to demonstrate the utility of the insertion algorithm twelve recently sequenced pAKD plasmids were used to
augment the dendrogram.

1. Introduction
Te availability of complete proteomes for hundreds of thousands of species provides an unprecedented opportunity to
study genetic relationships among a large number of species.
However, the necessary sofware tools for handling massive
amounts of data must frst be developed before we can
exploit the availability of these proteomes. Currently the
tools used for clustering either are restricted in terms of
the number of proteomes that can be examined because of
the time required to obtain results or else are restricted in
terms of their sensitivity. For example, clustering by means
of hidden markov models (HMM), multiple sequence alignment, and pairwise sequence alignment by means of the
Smith-Waterman alignment algorithm are limited by their

time complexity. Te Smith-Waterman algorithm, a dynamic


programming algorithm, is known to give optimal alignment
between two protein sequences for a given similarity matrix
[1], but alignment of two sequences of lengths and
requires () time. On the other hand, heuristic approximate alignment methods, frequently based on BLAST and
its variants [2], reduce the computational time required; for
example, in practice BLAST efectively reduces the time to
(), but this comes at the risk of losing sensitivity to homology detection. In fact, numerous articlesfor example, see
[3, 4]have discussed this loss of sensitivity in BLAST-based
results compared to those of the Smith-Waterman algorithm.
To ensure that a maximum number of homologous sequences
are identifed, highly sensitive pairwise homology detection
is required. Otherwise, the clusters of homologous sequences

80

Advances in Bioinformatics

Output: protein clusters


Input: plasmid protein sequences
C2
P1

P2

P3

P4

P6

C1
C3

P5

C4

P7
pClust

C5
C6

C7

Protein profles

Tree construction
PM2

C1 C2 C3 C4 C5 C6 C

Distance
metric

PM5
PM1
PM4
PM3
PM6

PM1
PM2
PM3
PM4
PM5
PM6

1 0 1
0 0 1
1 0 0

1 0 1
1 1 0
0 1 1

0
1
0

Figure 1: Flowchart for tree construction using pClust.

obtained by means of a given clustering method will not


include all possible members and, ultimately, the fnal results
will be less accurate.
In this work we use an alternative sequence comparison
algorithm and clustering method called pClust. Rather than
approximating Smith-Waterman, pClust systematically eliminates sequence pairs with little likelihood of having alignments and then only employs the Smith-Waterman algorithm
on promising pairs [5]. Clustering is accomplished using a
method based on a previously developed approach called
shingling [6]. By fltering out unlikely sequences and using
the Smith-Waterman algorithm judiciously, pClust remains
highly sensitive to sequence homology without loss of speed.
In an unpublished study of 6,602 proteins from four bacterial
proteomes, pClust and BLAST results were compared, and
BLASTp missed more than 69% of the aligned pairs identifed
by pClust. In a diferent study, a direct clusters-to-clusters
comparison was performed with BLAST results used as the
test and pClust results used as the benchmark [7]. Te results
showed that all the BLAST results were included within the
pClust results but BLAST missed 14% of the clustered pairs
obtained with pClust. In addition to its sensitivity and speed,
pClust is readily parallelizable, and to cluster proteins from
the proteomes of thousands of species will require highperformance computing platforms and the use of parallel
algorithms.
Tis work was initiated by our interest in plasmids. We
wanted a sofware tool that would allow us to obtain genetic
relationships among 527 Gram-negative bacterial plasmids
based on their putative proteome sequences. In addition, we
wanted an efcient means of adding new plasmids to our
initial dendrogram as their proteomes become available. Plasmids are typically circular DNA sequences that can transfer

between and replicate within bacteria and that are generally


classifed as broad- or narrow-host range [8, 9]. Plasmid
sequences are described as mosaic because they are composed
of DNA arising from many sources [10]. Plasmids serve to
shuttle important adaptive traits, such as antibiotic resistance,
between organisms [11, 12]. Consequently, understanding
the genetic relationships among plasmids is important, for
example, in the study of microbial evolution, in medical
epidemiology, and in assessing the dissemination of antibiotic
resistance genes [13, 14]. Tere are a number of approaches
to examine plasmid relationships. Some researchers focus on
the identifcation of important plasmid backbone genes that
are involved in horizontal gene transfer (HGT) or replication
within bacterial hosts [15, 16]. Some approaches compare
compositional features such as genomic signatures and codon
usage [5, 17]. Some researchers use network-based representations to explore genetic relationships among plasmids [5, 18,
19]. In this work we use the whole proteomes of 527 Gramnegative (GN) bacterial plasmids to construct a dendrogram.
We use protein cluster information from pClust to construct our dendrogram and then to predict the relationship
of new plasmids within the structure of this tree. A binary
profle is created for each species, indicating the presence or
absence of a protein in each cluster (Figure 1). Te concatenation of all the profles results in a binary matrix from which
a distance matrix is calculated, and neighbor joining is then
used to construct a dendrogram. Te binary matrix also can
be viewed as a library of individual profles that can serve as
correlation flters for a new taxon. A profle for a new taxon
can be quickly correlated with the profles in the library to
flter out the profle with the highest correlation coefcient.
Tis correlation coefcient is then evaluated based on known
biological information and a decision is made as to whether

81

Advances in Bioinformatics

Protein clusters
Construct new profle

Extract proteins from


new plasmid ()

C2

C1
C3

P1
P2
P3
P4
P

C4

C1 C2 C3 C4 C5 C6 C

1 0 0 0 0 1 1

C5

BLASTp

C6
C7

Correlation flter library


Insertion
PM2
PM5
PM1
PM4

PM3
PM6

PM1
PM2
PM3
PM4
PM5
PM6

C1 C2 C3 C4 C5 C6 C

1 0 1
0 0 1
1 0 0

1 0 1
1 1 0
0 1 1

0
1
0

0 0 1

0 0

Calculate correlation coefcient of with all other


plasmids
should be classifed near
the largest coefcient plasmids
Insert into original tree

Figure 2: Flowchart for insertion of a new taxon into an existing tree using a correlation flter library.

the taxon should be added to the tree. If it is to be added,


its binary profle is added to the binary matrix, a new
distance matrix is calculated, and neighbor joining is again
used to construct a new dendrogram with the additional
taxon. To utilize the algorithm for new plasmids, we focus
on sequences from twelve pAKD plasmids that were isolated
from Norwegian soil [20]. Tese plasmids belong to incompatibility groups IncP-1() and IncP-1(). A phylogenetic tree
constructed using multiple alignment of the relaxase gene
traI is presented by Sen et al. [20] and serves as a basis of
comparison for our augmentation results.

2. Materials and Methods


2.1. Data Preparation. Zhou et al. [21] presented a virtual
hybridization method to construct a dendrogram for 527
GN bacterial plasmids with 50 or more putative coding
genes. Te same plasmids are used in this study to facilitate
comparison. BLASTp with default parameters was used to
remove duplicate proteins within plasmid sequences using a
similarity score defned by the formula (length of matching
sequence)(BLAST identity score)/(length of reference protein + length of matching sequence) 0.45that is, proteins
with scores 0.45 were considered to be duplicates [22]. Te
maximum score 0.5 is obtained when two proteins are an
exact match. Including the matching sequence length in the
denominator of the formula insures that a large diference
in sequence lengths does not bias the results. Afer removal

of duplicate proteins, more than 97,000 protein sequences


remained.
2.2. Dendrogram Construction. Te fowchart in Figure 1
shows the approach used to construct a dendrogram for the
plasmids based on the >97,000 plasmid protein sequences.
Te protein sequences 1, 2, . . . , are used as input into
the pClust program [5], which employs the Smith-Waterman
algorithm to perform pairwise comparison of a subset of the
sequences. Te output from pClust is composed of clusters
1, 2, . . . , of homologous proteins. Protein profles
1, 2, . . . , are then created for all the plasmids
from the pClust output fles. Each profle consists of a binary
sequence with 1 indicating the presence of a protein and 0
indicating absence (Figure 1). Te pClust sofware was used
with default settings in the confguration fle except for ExactMatchLen for which a value of 4 was used. A total of 6,618
clusters (defned as having at least two proteins) were identifed by pClust. Te resulting 527 6, 618 binary matrix was
used to construct the dendrogram for two diferent distance
measures. Te Jaccard distance metric was originally developed for computation with binary matrices and is given by
=

( + )
,
( + + )

(1)

where is the number of clusters 1, 2, . . . , that are 1 for


species and 0 for species , is the number of clusters that
are 0 for species and 1 for species , and is the number of

82

Advances in Bioinformatics
NC 006856

Salmonella choleraesuis

pSC138

NC 010119

Salmonella choleraesuis

pOU7519

NC 011092

Salmonella enterica

pCVM19633 110

NC 011964

Escherichia coli

pAPEC-O103-ColBM

NC 006143

Aeromonas punctata

pFBAOT6

NC 007100

Pseudomonas aeruginosa

Rms149

NC 008613

Photobacterium piscicida

pP91278

NC 008612

Photobacterium piscicida

pP99018

NC 009139

Yersinia ruckeri

pYR1

NC 009141

Yersinia pestis biovar

pIP1202

NC 012885

Aeromonas hydrophila

pRA1

NC 012690

Escherichia coli

peH4H

NC 009140

Salmonella newport

pSN254

NC 012692

Escherichia coli

pAR060302

NC 012693

Salmonella enterica

pAM04528

NC 009349

Aeromonas salmonicida

pAsa4

NC 012886

Escherichia coli

pRAx

NC 012555

Enterobacter cloacae

pEC-IMP

NC 012556

Enterobacter cloacae

pEC-IMPQ

NC 010870

Klebsiella pneumoniae

pK29

NC 005211

Serratia marcescens

R478

NC 009838

Escherichia coli

pAPEC-O1-R

NC 009981

Salmonella choleraesuis

pMAK1

NC 013365

Escherichia coli

pO111 1

NC 003384

Salmonella typhi

pHCM1

NC 002305

Salmonella typhi

R27

NC 005249

Klebsiella pneumoniae

pLVPK

NC 006625

Klebsiella pneumoniae

NTUH-K2044

NC 014107

Enterobacter cloacae

pECL A

NC 012193

Borrelia burgdorferi

72a lp54

NC 012194

Borrelia burgdorferi

CA-11.2a lp54

NC 012244

Borrelia burgdorferi

94a lp54

NC 012199

Borrelia burgdorferi

64b lp54

NC 012175

Borrelia burgdorferi

WI91-23 lp54

NC 012505

Borrelia burgdorferi

29805 lp54

NC 012504

Borrelia burgdorferi

Bol26 lp54

NC 001857

Borrelia burgdorferi

B31 lp54

NC 013129

Borrelia burgdorferi

JD1 lp54

NC 013130

Borrelia burgdorferi

N40 lp54

NC 012202

Borrelia burgdorferi

CA11.2alp36-28-4

NC 001855

Borrelia burgdorferi

B31 lp36

NC 012184

Borrelia burgdorferi

64b lp36

NC 008565

Borrelia afzelii

PKol p60-2

NC 012167

Borrelia burgdorferi

WI91-23 lp38

NC 012182

Borrelia burgdorferi

64b lp38

NC 011857

Borrelia garinii

PBr lp36

NC 011867

Borrelia garinii

Far04 lp36

NC 011856

Borrelia garinii

PBr lp25

NC 011860

Borrelia garinii

PBr lp28-4

NC 012166

Borrelia valaisiana

VS116 lp25

Figure 3: Jaccard distance tree for 50 Gram-negative plasmids.

clusters that are 1 for both species and . We also employ


a conventional Euclidean distance metric. For both metrics,
a neighbor-joining algorithm was used to obtain the fnal
dendrogram.
2.3. Insertion of New Plasmids. As additional plasmid gene
sequences become available, we can repeat the procedure

described in the previous section to obtain a new dendrogram. Te amount of computation and time required to
accomplish this task, however, is excessive considering the
incremental gain that may be achieved. For example, the
original execution time for the 527-plasmid tree was 72 hours
on an Intel Xeon CPU E5420 machine with 32 GB of memory.
Instead it is preferable to have a means of inserting new

83

Advances in Bioinformatics
NC 012555

Enterobacter cloacae

pEC-IMP

NC 012556

Enterobacter cloacae

pEC-IMPQ

NC 010870

Klebsiella pneumoniae

pK29

NC 005211

Serratia marcescens

R478

NC 009838

Escherichia coli

pAPEC-O1-R

NC 009981

Salmonella choleraesuis

pMAK1

NC 003384

Salmonella typhi

pHCM1

NC 013365

Escherichia coli

pO111 1

NC 002305

Salmonella typhi

R27

NC 006625

Klebsiella pneumoniae

NTUH K2044

NC 005249

Klebsiella pneumoniae

pLVPK

NC 014107

Enterobacter cloacae

pECL A

NC 006856

Salmonella choleraesuis

pSC138

NC 010119

Salmonella choleraesuis

pOU7519

NC 011092

Salmonella enterica

pCVM19633 110

NC 011964

Escherichia coli

pAPEC-O103-ColBM

NC 008613

Photobacterium piscicida

pP91278

NC 008612

Photobacterium piscicida

pP99018

NC 009141

Yersinia pestis biovar

pIP1202

NC 009139

Yersinia ruckeri

pYR1

NC 012885

Aeromonas hydrophila

pRA1

NC 012692

Escherichia coli

pAR060302

NC 009140

Salmonella newport

pSN254

NC 012693

Salmonella enterica

pAM04528

NC 012690

Escherichia coli

peH4H

NC 009349

Aeromonas salmonicida

pAsa4

NC 012886

Escherichia coli

pRAx

NC 012504

Borrelia burgdorferi

Bol26 lp54

NC 001857

Borrelia burgdorferi

B31 lp54

NC 013129

Borrelia burgdorferi

JD1 lp54

NC 013130

Borrelia burgdorferi

N40 lp54

NC 012193

Borrelia burgdorferi

72a lp54

NC 012194

Borrelia burgdorferi

CA-11.2a lp54

NC 012244

Borrelia burgdorferi

94a lp54

NC 012199

Borrelia burgdorferi

64b lp54

NC 012175

Borrelia burgdorferi

WI91-23 lp54

NC 012505

Borrelia burgdorferi

29805 lp54

NC 006143

Aeromonas punctata

pFBAO T6

NC 007100

Pseudomonas aeruginosa

Rms149

NC 011856

Borrelia garinii

PBr lp25

NC 011860

Borrelia garinii

PBr lp28-4

NC 012166

Borrelia valaisiana

VS116 lp25

NC 012202

Borrelia burgdorferi

CA11.2alp36-28-4

NC 001855

Borrelia burgdorferi

B31 lp36

NC 012184

Borrelia burgdorferi

64b lp36

NC 008565

Borrelia afzelii

PKo lp60-2

NC 012167

Borrelia burgdorferi

WI91-23 lp38

NC 012182

Borrelia burgdorferi

64b lp38

NC 011857

Borrelia garinii

PBr lp36

NC 011867

Borrelia garinii

Far04 lp36

Figure 4: Euclidean distance tree for 50 Gram-negative plasmids.

plasmids into the existing tree structure as described in this


section, where execution of the insertion algorithm takes only
a few minutes on a laptop computer.
To insert a new plasmid into an existing dendrogram,
proteins 1, 2, . . . , from a new plasmid are extracted
from the plasmid proteome (Figure 2). BLASTp is performed
with these proteins against all the proteins in the 6,618 clusters
to determine the protein profle for the new plasmid. A
protein is considered to be a member of a cluster when its
similarity score is >0.2. Te similarity score is given by (length

of matching sequence)(BLAST identity score)/(length of


reference protein + length of matching sequence). Te cutof
value of 0.2 is consistent with the 40% sequence similarity
used as a parameter setting in pClust. Correlation fltering is
then performed with the correlation flter library consisting
of the protein profles of the original 527 GN bacterial plasmids. Te Pearsons product-moment correlation coefcient,
whose absolute value is less than or equal to 1, is used to
measure the correlation between two profles [23, 24]. Te
larger the correlation value, the greater the similarity between

84

Advances in Bioinformatics

JN106170 pAKD25
JN106167 pAKD16
JN106175 pAKD34
JN106171 pAKD26
JN106173 pAKD31
JN106174 pAKD33
JN106166 pAKD15

>0.7

JN106165 pAKD14
JN106168 pAKD17
JN106169 pAKD18
JN106172 pAKD29
JN106164 pAKD1
NC 005912 Ralstonia

>0.5

NC 007337 Ralstonia
NC 006830 Achromobacter
NC 008766 Acidovorax
NC 010935 Comamonas
NC 004956 Pseudomonas
NC 001735 Enterobacter
NC 005088 Delfia
NC 013666 Burkholderia
NC 008357 Pseudomonas
NC 012919 Photobacterium
NC 013176 Pseudomonas
NC 009704 Yersinia
NC 006824 Aromatoleum
NC 013193 Candidatus
NC 005793 Achromobacter

Figure 5: Subtree for 12 pAKD plasmids.

two profles. Tis value is used to determine whether the


plasmid fts into the dendrogram and, if so, where it should
be located as explained in the discussion section. When
appropriate, the new protein profle is added to the binary
matrix, and a tree is constructed from the entire matrix as
described in the previous section.

3. Results and Discussions


3.1. 527-Plasmid Dendrogram. Following the procedure described above, a dendrogram was constructed for 527 GN

bacterial plasmids. Because of its size, it is not shown,


but it is available as supplementary information in Newick
standard format (.nwk) for both Jaccard and Euclidean
distance metrics and can be viewed using MEGA5 [25]. A tree
constructed using the Jaccard distance metric for the same
subset of 50 plasmids used in [21] is shown in Figure 3, and
the Euclidean distance version is shown in Figure 4. Tese
trees are very similar with only a slight diference in the
clustering of the Borrelia plasmids. Te tree constructed using
the Euclidean distance metric is closer to the one shown in
[21], but the Jaccard tree does a better job of clustering the

85

Advances in Bioinformatics
Borrelia plasmids [26, 27]. Te Jaccard distance metric is
commonly used for a binary matrix. Nevertheless, the results
based on Euclidean distance compare favorably with those
obtained for a nonbinary intensity matrix using a diferent
approach [21]. It is not clear which distance method gives
more accurate results so users should use both matrices and
the decision as to which one is more accurate should be
determined on the basis of the biology of the system.
3.2. Insertion of New Plasmids. We applied our correlation
flter algorithm to twelve new plasmids from the pAKD
family [20]. Te twelve plasmids cluster together and are most
closely grouped with genera typical of other soil bacteria.
Te correlation coefcient values among the pAKD plasmids
were >0.7 and decreased relative to the other plasmids with
distance to >0.5 (Figure 5). pAKD plasmids 16, 25, and 34
belong to the IncP-1() compatibility group and form a
discrete cluster: pAKD plasmids 1, 14, 15, 17, 18, 29, 31, and
33 cluster as the IncP-1() compatibility group. Although
pAKD26 falls into the IncP-1() clade, it should be in the
IncP-1() group if compatibility grouping is considered the
gold standard for comparison. Nevertheless, the placement
is distal from the eight other plasmids in the group, and
pAKD26 was actually designated as IncP-1-2 to diferentiate
it from the other eight plasmids as recently described in [28].
Our results are consistent with [20].
Importantly, the correlation coefcient is used to check
the fnal dendrogramthat is, a new plasmid should be
located near the plasmid with which it is most highly
correlated. In addition, the correlation coefcient is used to
determine whether a plasmid should even be inserted into
a dendrogram. In other words, how does the magnitude of
the correlation coefcient infuence our confdence in the
placement of a new plasmid within an existing dendrogram?
Several works ofer guidelines for the interpretation of a
correlation coefcient [29, 30], but all criteria are in some
way arbitrary and ultimately interpretation of a correlation
coefcient depends on the purpose. In our case, we chose
a value of 0.5, but we also require biological evidencefor
example, that a plasmid is, in fact, from a GN bacterium.
To further examine the correlation coefcient, we randomly selected 10 Gram-positive bacterial plasmid proteomes
from 10 diferent genera. Te correlation coefcients were
found to range from 0.112 to 0.234. GP bacterial plasmids do
not belong in our GN bacterial plasmid dendrogram, and our
minimum correlation value of 0.5 sufces to exclude these
unrelated plasmids. While this level of discrimination is easy
to identify, we should note that the 527 GN bacterial plasmids
considered in this study do not represent the full diversity of
GN plasmids. Tus, it is possible to obtain a small correlation
coefcient value for a completely new and uncharacterized
GN plasmid. If the new plasmid is able to meet an underlying
correlation threshold, it can be placed within the dendrogram
structure, and by incorporating the new plasmid sequence
information into the correlation flter library, we can group
future plasmids that may be closely related to it.
While the method of inserting new plasmids into an existing tree is fast and efcient, at some point, generation of a new
dendrogram using all proteins from all the taxa will probably

be required. We do not know at what point this will occur,


but we assume it will be necessary eventually to insure that
all possible protein clusters are included. Recall that a cluster
must contain at least two proteins to be considered a cluster.
Tus, any new plasmid containing a protein that would have
formed a cluster with a single discarded protein represents
incomplete information in the library. It is probable that the
total number of clusters for all Gram-negative plasmids will
ultimately be much greater than 6,818.

4. Conclusion
In this work we present a new ab initio method for constructing a dendrogram from whole proteomes that begins
with output from pClust, a sofware program developed
for homology detection for large-scale protein sequence
analyses. We develop an efcient approach for insertion of
a new species into the dendrogram based on the use of a
correlation flter library. Tis is much more efcient than
constructing an entirely new tree which is computationally
costly. We illustrate our method by creating a dendrogram for
527 Gram-negative bacterial plasmids and augmenting this
dendrogram with twelve pAKD plasmids isolated from Norwegian soil. For purposes of comparison, we also construct
a smaller dendrogram consisting of 50 species and use two
diferent distance metrics. Te two resulting trees agree well
with results shown in [21]. Te classifcation results for the
twelve plasmids agree with a phylogenetic tree constructed
using multiple sequence alignment of the relaxase gene traI
presented in [20].

Authors Contribution
Y. Zhou and S. L. Broschat performed the research for this
paper, and all three authors shared in the preparation of the
paper.

Conflict of Interests
Tis work was not infuenced by any commercial agency, and
no confict of interests exist.

Acknowledgments
Te authors are grateful to Carl M. Hansen Foundation
for partial support of Y. Zhou and the Washington State
Agricultural Research Center and College of Veterinary
Medicine Agricultural Animal Health program for support
of D. R. Call.

References
[1] T. F. Smith and M. S. Waterman, Identifcation of common
molecular subsequences, Journal of Molecular Biology, vol. 147,
no. 1, pp. 195197, 1981.
[2] S. F. Altschul, T. L. Madden, A. A. Schafer et al., Gapped
BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Research, vol. 25, no. 17, pp.
33893402, 1997.

86

[3] D. L. Brutlag, J.-P. Dautricourt, R. Diaz, J. Fier, B. Moxon,


and R. Stamm, BLAZE: an implementation of the SmithWaterman sequence comparison algorithm on a massively
parallel computer, Computers and Chemistry, vol. 17, no. 2, pp.
203207, 1993.
[4] E. G. Shpaer, M. Robinson, D. Yee, J. D. Candlin, R. Mines, and
T. Hunkapiller, Sensitivity and selectivity in protein similarity
searches: a comparison of Smith-Waterman in hardware to
BLAST and FASTA, Genomics, vol. 38, no. 2, pp. 179191, 1996.
[5] C. Wu, A. Kalyanaraman, and W. R. Cannon, PGraph: efcient
parallel construction of large-scale protein sequence homology
graphs, IEEE Transactions on Parallel and Distributed Systems,
vol. 23, no. 10, Article ID 6127863, pp. 19231933, 2012.
[6] D. Gibson, R. Kumar, and A. Tomkins, Discovering large
dense subgraphs in massive graphs, in Proceedings of the 31st
International Conference on Very Large Data Bases, pp. 721732,
September 2005.
[7] A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel, Efcient
clustering of large EST data sets on parallel computers, Nucleic
Acids Research, vol. 31, no. 11, pp. 29632974, 2003.
[8] E. Bapteste, Y. Boucher, J. Leigh, and W. F. Doolittle, Phylogenetic reconstruction and lateral gene transfer, Trends in
Microbiology, vol. 12, no. 9, pp. 406411, 2004.
[9] E. Fidelma Boyd, C. W. Hill, S. M. Rich, and D. L. Hard, Mosaic
structure of plasmids from natural populations of Escherichia
coli, Genetics, vol. 143, no. 3, pp. 10911100, 1996.
[10] H. Ochman, J. G. Lawrence, and E. A. Grolsman, Lateral gene
transfer and the nature of bacterial innovation, Nature, vol. 405,
no. 6784, pp. 299304, 2000.
[11] C. M. Tomas, Paradigms of plasmid organization, Molecular
Microbiology, vol. 37, no. 3, pp. 485491, 2000.
[12] C. M. Tomas and K. M. Nielsen, Mechanisms of, and barriers
to, horizontal gene transfer between bacteria, Nature Reviews
Microbiology, vol. 3, no. 9, pp. 711721, 2005.
[13] M. Couturier, F. Bex, P. L. Bergquist, and W. K. Maas, Identifcation and classifcation of bacterial plasmids, Microbiological
Reviews, vol. 52, no. 3, pp. 375395, 1988.
[14] J. J. Dennis, Te evolution of IncP catabolic plasmids, Current
Opinion in Biotechnology, vol. 16, no. 3, pp. 291298, 2005.
[15] J. Huang and J. P. Gogarten, Ancient horizontal gene transfer
can beneft phylogenetic reconstruction, Trends in Genetics,
vol. 22, no. 7, pp. 361366, 2006.
[16] S. Karlin and C. Burge, Dinucleotide relative abundance extremes: a genomic signature, Trends in Genetics, vol. 11, no. 7,
pp. 283290, 1995.
[17] S. Karlin, Detecting anomalous gene clusters and pathogenicity
islands in diverse bacterial genomes, Trends in Microbiology,
vol. 9, no. 7, pp. 335343, 2001.
[18] M. Brilli, A. Mengoni, M. Fondi, M. Bazzicalupo, P. Li`o, and R.
Fani, Analysis of plasmid genes by phylogenetic profling and
visualization of homology relationships using Blast2Network,
BMC Bioinformatics, vol. 9, article 551, 2008.
[19] S. Halary, J. W. Leigh, B. Cheaib, P. Lopez, and E. Bapteste,
Network analyses structure genetic diversity in independent
genetic worlds, Proceedings of the National Academy of Sciences
of the United States of America, vol. 107, no. 1, pp. 127132, 2010.
[20] D. Sen, G. A. Van der Auwera, L. M. Rogers, C. M. Tomas, C.
J. Brown, and E. M. Top, Broad-host-range plasmids from
agricultural soils have IncP-1 backbones with diverse accessory
genes, Applied and Environmental Microbiology, vol. 77, pp.
79757983, 2011.

Advances in Bioinformatics
[21] Y. Zhou, D. R. Call, and S. L. Broschat, Genetic relationships
among 527 Gram-negative bacterial plasmids, Plasmid, vol. 68,
no. 2, pp. 133141, 2012.
[22] D. R. Call, R. S. Singer, D. Meng et al., blaCMY-2-positive
IncA/C plasmids from Escherichia coli and Salmonella enterica
are a distinct component of a larger lineage of plasmids,
Antimicrobial Agents and Chemotherapy, vol. 54, no. 2, pp. 590
596, 2010.
[23] J. L. Rodgers and W. A. Nicewander, Tirteen ways to look at
the correlation coefcient, Te American Statistician, vol. 42,
pp. 5966, 1988.
[24] M. S. Stigler, Francis Galtons account of the invention of correlation, Statistical Science, vol. 4, pp. 7379, 1989.
[25] K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, and
S. Kumar, MEGA5: molecular evolutionary genetics analysis
using maximum likelihood, evolutionary distance, and maximum parsimony methods, Molecular Biology and Evolution,
vol. 28, no. 10, pp. 27312739, 2011.
[26] M. Lescot, S. Audic, C. Robert et al., Te genome of Borrelia
recurrentis, the agent of deadly louse-borne relapsing fever, is a
degraded subset of tick-borne Borrelia duttonii, PLoS Genetics,
vol. 4, no. 9, Article ID e1000185, 2008.
[27] J. E. Purser and S. J. Norris, Correlation between plasmid
content and infectivity in Borrelia burgdorferi, Proceedings of
the National Academy of Sciences of the United States of America,
vol. 97, no. 25, pp. 1386513870, 2000.
[28] P. Norberg, M. Bergstrom, V. Jethava, D. Dubhashi, and M.
Hermansson, Te IncP-1 plasmid backbone adapts to diferent
host bacterial species and evolves through homologous recombination, Nature Communications, vol. 2, article 268, 2011.
[29] A. Buda and A. Jarynowski, Life-time of correlations and its
applications, Wydawnictwo Niezalezne, vol. 1, pp. 521, 2010.
[30] J. Cohen, Statistical Power Analysis For the Behavioral Sciences,
Law-rence Erlbaum Associates, Hillsdale, NJ, USA, 2nd edition,
1988.

Hindawi Publishing Corporation


Advances in Bioinformatics
Volume 2014, No. 1, June 2014
doi:10.1155/2012/705435

Research Article
Solving the 0/1 Knapsack Problem by a Biomolecular DNA
Computer
Hassan Taghipour,1 Mahdi Rezaei,2 and Heydar Ali Esmaili1
1
2

Department of Pathology, Tabriz University of Medical Sciences, Tabriz, Iran


Department of Teoretical Physics and Astrophysics, University of Tabriz, Tabriz 51664, Iran

Correspondence should be addressed to Hassan Taghipour; taghipourh@yahoo.com


Received 2 November 2012; Revised 11 January 2013; Accepted 11 January 2013
Academic Editor: Bhaskar Dasgupta
Copyright 2013 Hassan Taghipour et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Solving some mathematical problems such as NP-complete problems by conventional silicon-based computers is problematic and
takes so long time. DNA computing is an alternative method of computing which uses DNA molecules for computing purposes.
DNA computers have massive degrees of parallel processing capability. Te massive parallel processing characteristic of DNA
computers is of particular interest in solving NP-complete and hard combinatorial problems. NP-complete problems such as
knapsack problem and other hard combinatorial problems can be easily solved by DNA computers in a very short period of time
comparing to conventional silicon-based computers. Sticker-based DNA computing is one of the methods of DNA computing. In
this paper, the sticker based DNA computing was used for solving the 0/1 knapsack problem. At frst, a biomolecular solution space
was constructed by using appropriate DNA memory complexes. Ten, by the application of a sticker-based parallel algorithm using
biological operations, knapsack problem was resolved in polynomial time.

1. Introduction
DNA encodes the genetic information of cellular organisms.
Te unique and specifc structure of DNA makes it one of the
favorite candidates for computing purposes. In comparison
with conventional silicon-based computers, DNA computers
have massive degrees of miniaturization and parallelism.
By recent technology, about 1018 DNA molecules can be
produced and placed in a medium-sized laboratory test tube.
Each of these DNA molecules could act as a small processor. Biological operations such as hybridization, separation,
setting, and clearing can be performed simultaneously on all
of these DNA strands. Tus, in an in vitro assay, we could
handle about 1018 DNA molecules or we can say that 1018 data
processors can be executed in parallel.
In 1994, Adleman introduced the DNA computing as a
new method of parallel computing [1]. Adleman succeeded
in solving seven-point Hamiltonian path problem solely by
manipulating DNA molecules and suggested that DNA could
be used to solve complex mathematical problems.

In 1999, a new model of DNA computing (sticker model)


was introduced by Roweis et al. [2]. Tis model has a kind
of random access memory that requires no strand extension,
uses no enzymes, and its materials are reusable. Stickerbased DNA computing has potential capability for being a
universal method in DNA computing. Roweis et al. [2] also
proposed specifc machine architecture for implementing the
sticker model as a microprocessor-controlled parallel robotic
workstation. Tus, the operations used in sticker model can
be performed on fully automated devices, which is helpful in
reducing the error rates of operations.
In this paper, we applied sticker model for solving the
knapsack problem which is one of the NP-complete problems.
Te paper is organized as follows. Section 2 introduces
the DNA structure and various DNA computing models and
discusses about the sticker based DNA computing and biological operations which are used in sticker model. Section 3
introduces a DNA-based algorithm for solving the knapsack
problem in sticker model.

88

Advances in Bioinformatics
5
3

T G C A

T T C C G

A C G T

A A G G C

3
5

Figure 1: A DNA molecule.

2. Basics of DNA Computing


2.1. Structure of DNA and DNA Computing Models. DNA
is a polymeric and a double-stranded molecule which is
composed of monomers called nucleotides. Nucleotides are
building blocks of DNA, and each of them contains three
components: sugar, phosphate group, and nitrogenous base.
Tere are four diferent nitrogenous bases which contribute
in DNA structure: Tymine (T) and Cytosine (C) which
are called pyrimidines and Adenine (A) and Guanine (G)
which are called purines. Because nitrogenous bases are
variable components of nucleotides, diferent neucleotides
are distinguished by nitrogenous bases which contribute in
their structure. For this reason, the name of the bases are used
to refer to the neucleotides, and the neucleotides are simply
represented as A, G, C, and T. Te nucleotides are linked
together by phosphodiester bonds and form a single-stranded
DNA (ssDNA). A ssDNA molecule can be likened to a string
consisting of a combination of four diferent symbols, A, G,
C, and T. Mathematically, this means that we have a fourletter alphabet = {A, G, C, T} to encode information. Two
ssDNA molecules join together to form a double-stranded
DNA (DsDNA) based on complementary rule: A always
pairs with T, and likewise C pairs with G. In Figure 1,
a schematic picture of DNA is shown.
DNA computing was initially developed by Adleman in
1994. Adleman resolved an instance of Hamiltonian path
problem just by handling the DNA molecules [1]. In 1995,
Lipton presented a method for solving the satisfability
(SAT) problem [3]. Adleman-Lipton model can be used to
solve diferent NP-complete problems. In Adleman-Lipton
model, DNA splints are used for the construction of solution
space. Adleman [4, 5] also presented a molecular algorithm
for solving the 3-coloring problem. Chang and Guo [68]
showed that the DNA operations in Adelman-Lipton model
could be used for developing DNA algorithms to resolve
the dominating set problem, the vertex cover problem, the
maximal clique problem, and the independent set problem.
In 1999, Roweis et al. [2] introduced the Sticker based
DNA computing model and applied it in solving the minimal
set cover problem, and this model also was applied for
breaking the Data Encryption Standard (DES) [9]. In our
previous work, we also applied sticker based model for
solving the independent set problem [10].
Other than Adleman-Lipton and Sticker based models,
other various models are also proposed in DNA computing
by researchers. Quyang et al. [11] solved the maximal clique
problem using DNA molecules and restriction endonuclease
enzymes. Amos et al. [12, 13] described a DNA computation
model using restriction endonuclease enzymes instead of
successive cycles of separation by DNA hybridization, which

can reduce the error rate of computation. Hagiya et al. [14]


proposed a new method of DNA computing that involves
a self-acting DNA molecule containing both the input,
program, and working memory. In this method, a singlestranded DNA molecule consists of an input segment on the
5 end, followed by a formula (program) segment, followed by
a spacer, and fnally with a head on the 3 end that moves
and performs the computation. Another method for DNA
computation is computation by self-assembly. Winfree et al.
[1517] introduced a linear and 2-dimensional self-assembly
model.
Te surface-based model was introduced by Liu et al. [18].
Tis model uses DNA molecules attached to a solid surface,
instead of DNA molecules foating in a solution. Te surfacebased model was used by Taghipour et al. for solving the
dominating set problem [19]. Te computing by blocking was
introduced by Rozenberg and Spaink [20]. Tis model uses
a novel approach to flter the DNA molecules. Instead of
separating the DNA strands to distinct tubes, or destroying
and removing the DNA molecules that do not contribute to
fnding a solution, it blocks (inactivates) them in a way that
the blocked strands can be considered as nonexistent during
the subsequent steps of computation.

2.2. Sticker-Based DNA Computation. Te sticker model was


introduced by Roweis et al. [2]. In this model, there is a
memory strand with bases in length subdivided into
nonoverlapping regions each bases long ( ).
can be, for example, 20. Te substrands (bit regions)
are signifcantly diferent from each other. One sticker is
designed for each subregion; each sticker has bases long
and is complementary to one and only one of the memory
regions. If a sticker is annealed to its corresponding region on
memory strand, then the particular region is said to be on. If
no sticker is annealed to a region, then the corresponding bit
is of. Each memory strand along with its annealed stickers
is called memory complex. In sticker model, a tube is a
collection of memory complexes, composed of large number
of identical memory strands each of which has stickers
annealed only at the required bit positions. Tis method of
representation of information difers from other methods in
which the presence or absence of a particular subsequence
in a strand corresponded to a particular bit being on or of.
In sticker model, each possible bit string is represented by
a unique association of memory strands and stickers. Tis
model has a kind of random access memory that requires
no strand extension and uses no enzymes [2]. Indeed, in
the sticker model, memory strands are used as registers,
and stickers are used to write and erase information in the
registers.
Another conception in sticker model is (, ) library.
Each (, ) library contains memory complexes with bit
regions, the frst bit regions are either on or of, in all
possible ways, whereas the remaining - bit regions are of.
Te last - bit regions can be used for intermediate data
storage. In every (, ) library, there are at least 2 memory
complexes. In Figure 2, a memory complex with 7 bit regions
representing the binary number 1100101 is shown.

89

Advances in Bioinformatics

Figure 2: A memory complex representing 1100101.

2.3. Biological Operations in Sticker Model. Tere are four


principal operations in sticker model: combination, separation, setting, and clearing [2]. We also defned a new
operation called divide which is used in the construction
of solution space [10]. Here, we briefy discuss about these
operations.
(1) Combine (0 , 1 , and 2 ). Te memory complexes
from the tubes 1 and 2 are combined to form a new
tube, 0 , simply the contents of 1 and 2 are poured
into the tube 0 . (0 = 1 2 ).

(2) Separate (0 , ) (+ , ). Tis operation creates


two new tubes + and ; + contains the memory
complexes having the th bit on (+ = +(0 , )), and
contains the memory complexes having the th bit
of ( = (0 , )).
(3) Set (0 , ). Te th bit region on every memory
complex in tube 0 is set to 1 or turned on.

(4) Clear (0 , ). Te th bit region on every memory


complex in tube 0 is set to 0 or turned of.
(5) Divide (0 , 1 , and 2 ). By this operation, the contents
of tube 0 is divided into two equal portions and
poured into the tubes 1 and 2 .

3. Solving the 0/1 Knapsack Problem in


Sticker-Based DNA Computers
3.1. Defnition of the Knapsack Problem. Knapsack problem
is one of the classical optimization problems which have two
variants: the 0/1 and fractional knapsack problems.
Te 0/1 knapsack problem is posed as follows.
Tere are items 1 , 2 , 3 , . . . , ; each item has a weight
and a value , where and are integers. We have a
knapsack which its capacity (weight) is , where is also an
integer. We want to take the most valuable set of items that ft
in our knapsack. Which items should we take? Tis is called
the 0/1 or binary knapsack problem because each item must
either be taken or lef behind; we cannot take a fractional
amount of an item.
In the fractional knapsack problem, the setup is the same,
but we can take fractions of items, rather than having to make
a binary (0-1) choice for each item. Te fractional knapsack
problem is solvable by a greedy strategy, where as the 0/1
knapsack problem is not. Te 0/1 knapsack problem has been
proved to be an NP-complete problem [21].

Figure 3: Memory strand with at least + + bit regions.

3.2. Construction of Sticker Based DNA Solution Space for


Knapsack Problem
3.2.1. Designing Appropriate DNA Memory Complexes. As
discussed before, there are items 1 , 2 , 3 , . . . , ; each item
has a weight and a value , where and are
integers. Let us consider that the total weight of items is
and total value of items is .

= 1 + 2 + 3 + + = ,
=1

= 1 + 2 + 3 + + =

(1)

=1

= total number of items.


We start with 2 or more identical memory strands, which
each of them has at least + + bit regions. (Figure 3)
Te frst bit regions (bit regions 1 to ) are used to represent
items, the middle bit regions (bit regions + 1 to +
) represent the total weight of items , and the next bit
regions (bit regions + + 1 to + + ) represent the
total value of items . Each bit region, for example can have
20 neucleotides, furthermore, every memory strand at least
contains 20 ( + + ) neucleotides.
3.2.2. Production of DNA Memory Complexes Which Represent All Possible Subsets of Items. It is clear that a set of items
has 2 subsets and each of these subsets has its own weight
and value. For construction of solution space, it is essential
to represent all subsets of items by appropriate DNA memory
complexes. Furthermore, by using at least 2 or more memory
strands and making the frst bit regions on or of in all
possible ways, we represent all 2 subsets of items by DNA
memory complexes. On the other hand, simply we design
a ( + + , ) library. For this purpose, (Procedure 1) is
proposed.
Procedure 1 has divide, set and combine operations.
At the end of procedure, tube 0 contains all of the memory
complexes which each of them represent one of the subsets of
items.

3.2.3. Representing the Weight and Value of Each Subset on


DNA Memory Complexes. In this step, based on the items
which are present in subsets, and by annealing corresponding
stickers in and regions of memory strands, the total
weight and value of subsets are represent on memory complexes. Note, each item has a weight and a value , thus,

90

Advances in Bioinformatics
(1) Input (0 ), where 0 contains 2 or more memory strands with at least ( + + ) bit regions.
(2) For = 1 to , where is the total number of items
(a) Divide (0 , 1 , 2 )
(b) Set (1 , )
(c) Combine (0 , 1 , 2 )
End for
Procedure 1

For = 1 to
{
Separate (0 , ) (+ , )
For = 1 to
1
Set (+ , + =1 + )
For = 1 to

1
Set (+ , + =1 + =1 + )
Combine (0 , + , )
}
Procedure 2

for each item , numbers of stickers are annealed to


region and numbers of stickers are annealed to region
on memory strands. Furthermore, the numbers of annealed
stickers in and regions represent the weight and value of
corresponding subset, respectively. Procedure 2 is proposed
for representing the weight and value of each subset.
Now, our solution space is completely produced and
contains at least 2 memory complexes, which each of them
represent one of the subsets of items, and the numbers of
annealed stickers in and regions represent the weight
and value of corresponding subset, respectively.
3.3. DNA Algorithm for Solving the 0/1 Knapsack Problem.
Algorithm 1 is proposed for solving the 0/1 knapsack problem.
According to the steps in the algorithm, the knapsack
problem can be resolved by sticker based DNA computation
in polynomial time.
By the execution of step 1, the memory complexes without
any annealed stickers in region (represent the subset 0)
are placed in tube 0 , the memory complexes with only one
annealed sticker (represent the subsets of items which their
weight are 1) are placed in tube 1 , the memory complexes
with 2 annealed stickers (represent the subsets of items which
their weight are 2) are placed in tubes 2 , the memory
complexes with 3 annealed stickers (represent the subsets of
items which their weight are 3) are placed in tube 3 , and
fnally, the tube contains the memory complexes witch all
bit regions located in region are turned to on (represent
the subset which contains all items). On the other hands,
step 1 is a sorting procedure and sorts memory complexes
according to the number of annealed stickers in region.
In this step, + 1 tubes are produced (0 , 1 , 2 , . . . , ),
and number of every tube indicate the number of annealed

stickers in region. Step 1 contains ( + 1)/2 separate


and ( + 1)/2 combine operations, or totally it contains
( + 1) operations.
In step 2 of algorithm, the contents of tubes
+1 , +2 , +3 , . . . , are discarded, because memory
complexes which are present in these tubes, represent subset
of items that their weight are exceeded the capacity of
knapsack. Ten, the contents of tubes 0 , 1 , 2 , . . . , are
mixed together and transferred to tube 0 . Now, tube 0
contains memory complexes which represent the subsets
of items that their weight are not exceeded the capacity of
knapsack. Furthermore, at the end of step 2, the memory
complexes which represent the subsets of items that their
weight are exceeded the capacity of knapsack, removed
from solution space and only remain memory complexes
representing subsets that ft in our knapsack. It is clear that
the step 2 contains only 2 operations.
By the execution of step 3, sorting of memory complexes
are performed according to the number of annealed stickers
in region. During this step, + 1 tubes are produced
(0 , 1 , 2 , . . . , ). Te memory complexes without any
annealed stickers in region (represent the subset 0) are
placed in tube 0 , the memory complexes with only one
annealed sticker (represent the subsets of items which their
value are 1) are placed in tube 1 , the memory complexes with
2 annealed stickers (represent the subsets of items which their
value are 2) are placed in tubes 2 , the memory complexes
with 3 annealed stickers (represent the subsets of items which
their value are 3) are placed in tube 3 , and fnally, the tube
contains the memory complexes with all bit regions located
in region are turned to on (represent the subset which
contains all items). Step 3 contains ( + 1)/2 separate and
(+1)/2 combine operations, or totally it contains (+1)
operations.

91

Advances in Bioinformatics
(1) For = to + 1
For = down to
Separate ( , + 1) ((+1) , )
Combine (+1 , +1 , (+1) )
(2) Te capacity of knapsack is ,
Discard tubes +1 , +2 , +3 , . . . ,
Combine (0 , 0 , 1 , 2 , . . . , )
(3) For = + to + + 1
For = down to +
Separate ( , + 1) ((+1) , )
Combine (+1 , +1 , (+1) )
(4) Read ; else if it was empty then:
Read 1 ; else if it was empty then:
Read 2 ; else if it was empty then:
..
.
Read 2 ; else if it was empty then:
Read 1 ;
Algorithm 1

In step 4, all of tubes (from to 1 ) are evaluated for


presence of memory complexes, and the frst tube which
is not empty and contains memory complexes represent
the most valuable set. Step 4, maximally contains Read
operations.
Finally, it is clear that the total number of operations in
our algorithm is: 2 + 2 + + 2 + 2.

4. Conclusion
In this paper, the sticker based DNA computing was used
for solving the 0/1 knapsack problem. Tis method could
be used for solving other NP-complete problems. Tere are
four principal operations in sticker model: Combination,
Separation, Setting and Clearing. We also defned a new
operation called divide and applied it in construction of
solution space.
As mentioned earlier, one of the important properties of
DNA computing is its real massive parallelism, which makes
it a favorite and powerful tool for solving NP-complete and
hard combinatorial problems. In sticker model, as in other
DNA based computation methods, the property of DNA
molecules to making duplexes is used as main biological
operation. Te main diference between the sticker model and
Adleman-Lipton model is that in the sticker model there is a
kind of Random access memory and the computations do not
depend on DNA molecules extension as seen in AdlemanLipton model.

References
[1] L. Adleman, Molecular computation of solutions to combinatorial problems, Science, vol. 266, pp. 10211024, 1994.
[2] S. Roweis, E. Winfree, R. Burgoyne et al., A sticker based
model for DNA computation, in Proceedings of the 2nd Annual
Workshop on DNA Computing, Princeton University, L. Landweber and E. Baum, Eds., Series in Discrete Mathematics and
Teoretical Computer Science, DIMACS, pp. 129, American
Mathematical Society, 1999.
[3] R. J. Lipton, DNA solution of hard computational problems,
Science, vol. 268, pp. 542545, 1995.
[4] L. M. Adleman, On Constructing a Molecular Computer,
Department of Computer Science, University of Southern
California, 1995.
[5] L. M. Adleman, On constructing a molecular computer, in
DNA Based Computers, R. J. Lipton and E. B. Baum, Eds., pp.
122, American Mathematical Society, 1996.
[6] W.-L. Chang and M. Guo, Solving the dominating-set problem
in Adleman-Liptons Model, in Proceedings of the 3rd International Conference on Parallel and Distributed Computing,
Applications and Technologies, pp. 167172, Kanazawa, Japan,
2002.
[7] W.-L. Chang and M. Guo, Solving the clique problem and the
vertex cover problem in Adleman-Liptons model, in IASTED
International Conference, Networks, Parallel and Distributed
Processing, and Applications, pp. 431436, Tsukuba, Japan, 2002.
[8] W.-L. Chang and M. Guo, Solving NP-complete problem in
the Adleman-Lipton Model, in Proceedings of Te International
Conference on Computer and Information Technology, pp. 157
162, 2002.

92
[9] L. Adleman, P. Rothemund, S. Roweis, and E. Winfree, On
applying molecular computation to the data encryption standard, in Proceedings of the 2nd DIMACS wWorkshop on DNA
Based Computers, Princeton University, pp. 2448, 1996.
[10] H. Taghipour, A. Taghipour, M. Rezaei, and H. Esmaili, Solving
the independent set problem by sticker based DNA computers,
American Journal of Molecular Biology, vol. 2, no. 2, pp. 153158,
2012.
[11] Q. Ouyang, P. D. Kaplan, S. Liu, and A. Libchaber, DNA
solution of the maximal clique problem, Science, vol. 278, no.
5337, pp. 446449, 1997.
[12] M. Amos, A. Gibbons, and D. Hodgson, Error-resistant implementation of DNA computations, in Proceedings of the 2nd
DIMACS Workshop on DNA Based Computers, 1996.
[13] M. Amos, A. Gibbons, and D. Hodgson, A new model of DNA
computation, in Proceedings of the 12th British Colloquium on
Teoretical Computer Science, 1996.
[14] M. Hagiya, M. Arita, D. Kiga, K. Sakamoto, and S. Yokoyama,
Towards parallel evaluation and learning of boolean formulas with molecules, DIMACS Series in Discrete Mathematics and Teoretical Computer Science, vol. 48, pp. 5772,
1999.
[15] E. Winfree, Simulations of computing by self-assembly, in
Proceedings of the 4th International Meeting on DNA Based
Computers, pp. 213239, 1998.
[16] E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman, Design
and self-assembly of two-dimensional DNA crystals, Nature,
vol. 394, no. 6693, pp. 539544, 1998.
[17] E. Winfree, X. Yang, and N. Seeman, Universal computation
via self-assembly of DNA: some theory and experiments, in
Proceedings of the 2nd DIMACS Workshop on DNA Based
Computers, 1996.
[18] Q. Liu, Z. Guo, A. E. Condon, R. M. Corn, M. G. Lagally, and
L. M. Smith, A surface-based approach to DNA computation,
in Proceedings of the 2nd Annual Meeting on DNA Based
Computers, Princeton University, 1996.
[19] H. Taghipour, M. Rezaei, and H. Esmaili, Applying surfacebased DNA computing for solving the dominating set problem,
American Journal of Molecular Biology, vol. 2, no. 3, pp. 286290,
2012.
[20] G. Rozenberg and H. Spaink, DNA computing by blocking,
Teoretical Computer Science, vol. 292, no. 3, pp. 653665, 2003.
[21] M. R. Garey and D. S. Johnson, Computer and Intractability:
a Guide to the Teory of NP-Completeness, Freeman, San
Francisco, Calif, USA, 1979.

Advances in Bioinformatics

Author Guidelines
Submission
Manuscripts should be submitted by one of the authors of the manuscript through the online
Manuscript Tracking System. Regardless of the source of the word-processing tool, only electronic
PDF (.pdf) or Word (.doc, .docx, .rtf) files can be submitted through the MTS. There is no page limit.
Only online submissions are accepted to facilitate rapid publication and minimize administrative costs.
Submissions by anyone other than one of the authors will not be accepted. The submitting author
takes responsibility for the paper during submission and peer review. If for some technical reason
submission through the MTS is not possible, the author can contact abi@hindawi.com for support.

Terms of Submission
Papers must be submitted on the understanding that they have not been published elsewhere and are
not currently under consideration by another journal published by Hindawi or any other publisher. The
submitting author is responsible for ensuring that the article's publication has been approved by all the
other coauthors. It is also the authors' responsibility to ensure that the articles emanating from a
particular institution are submitted with the approval of the necessary institution. Only an
acknowledgment from the editorial office officially establishes the date of receipt. Further
correspondence and proofs will be sent to the author(s) before publication unless otherwise indicated.
It is a condition of submission of a paper that the authors permit editing of the paper for readability. All
enquiries concerning the publication of accepted papers should be addressed to abi@hindawi.com.

Peer Review
All manuscripts are subject to peer review and are expected to meet standards of academic
excellence. Submissions will be considered by an editor and if not rejected right away by peerreviewers, whose identities will remain anonymous to the authors.

Microarray Data Submission


Before publication, the microarray data should be deposited in an appropriate database such as Gene
Expression Omnibus (GEO) or Array Express, and an entry name or accession number must be
included in the manuscript prior to its publication. Microarray data should be MIAME compliant. During
the reviewing process, submitting authors are committed to provide the editor and the reviewers
handling his/her manuscript with the login information by which they can access this information in the
database.

Article Processing Charges


Advances in Bioinformatics is an open access journal. Open access charges allow publishers to make
the published material available for free to all interested online visitors. For more details about the
article processing charges of Advances in Bioinformatics, please visit the Article Processing Charges
information page.

Units of Measurement
Units of measurement should be presented simply and concisely using System International (SI) units.

Title and Authorship Information


The following information should be included

Paper title
Full author names
Full institutional mailing addresses
Email addresses

Abstract
The manuscript should contain an abstract. The abstract should be self-contained and citation-free
and should not exceed 200 words.

Introduction
This section should be succinct, with no subheadings.

Materials and Methods


This part should contain sufficient detail so that all procedures can be repeated. It can be divided into
subsections if several methods are described.

Results and Discussion


This section may each be divided by subheadings or may be combined.

Conclusions
This should clearly explain the main conclusions of the work highlighting its importance and relevance.

Acknowledgments
All acknowledgments (if any) should be included at the very end of the paper before the references
and may include supporting grants, presentations, and so forth.

References
Authors are responsible for ensuring that the information in each reference is complete and accurate.
All references must be numbered consecutively and citations of references in text should be identified
using numbers in square brackets (e.g., as discussed by Smith [9]; as discussed elsewhere [9,
10]). All references should be cited within the text; otherwise, these references will be automatically
removed.

Preparation of Figures
Upon submission of an article, authors are supposed to include all figures and tables in the PDF file of
the manuscript. Figures and tables should not be submitted in separate files. If the article is accepted,
authors will be asked to provide the source files of the figures. Each figure should be supplied in a
separate electronic file. All figures should be cited in the paper in a consecutive order. Figures should
be supplied in either vector art formats (Illustrator, EPS, WMF, FreeHand, CorelDraw, PowerPoint,

Excel, etc.) or bitmap formats (Photoshop, TIFF, GIF, JPEG, etc.). Bitmap images should be of 300 dpi
resolution at least unless the resolution is intentionally set to a lower level for scientific reasons. If a
bitmap image has labels, the image and labels should be embedded in separate layers.

Preparation of Tables
Tables should be cited consecutively in the text. Every table must have a descriptive title and if
numerical measurements are given, the units should be included in the column heading. Vertical rules
should not be used.

Proofs
Corrected proofs must be returned to the publisher within 2-3 days of receipt. The publisher will do
everything possible to ensure prompt publication. It will therefore be appreciated if the manuscripts
and figures conform from the outset to the style of the journal.

Copyright
Open Access authors retain the copyrights of their papers, and all open access articles are distributed
under the terms of the Creative Commons Attribution License, which permits unrestricted use,
distribution and reproduction in any medium, provided that the original work is properly cited.
The use of general descriptive names, trade names, trademarks, and so forth in this publication, even
if not specifically identified, does not imply that these names are not protected by the relevant laws
and regulations.
While the advice and information in this journal are believed to be true and accurate on the date of its
going to press, neither the authors, the editors, nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Disclosure Policy
A competing interest exists when professional judgment concerning the validity of research is
influenced by a secondary interest, such as financial gain. We require that our authors reveal any
possible conflict of interests in their submitted manuscripts.
If there is no conflict of interests, authors should state that The author(s) declare(s) that there is no
conflict of interests regarding the publication of this article.

Das könnte Ihnen auch gefallen