Beruflich Dokumente
Kultur Dokumente
Bioinformatics
ISSN: 1687-8027
Volume 2014 No. 1, June 2014
About this Journal
Advances in Bioinformatics is a peer-reviewed, open access journal that publishes original research
articles as well as review articles in all areas of bioinformatics.
Advances in
Bioinformatics
ISSN: 1687-8027
Volume 2014 No. 1, June 2014
Abstracting and Indexing
The articles of Advances in Bioinformatics are included in the following databases/resources:
Academic OneFile
Academic Search Complete
Access to Global Online Research in Agriculture (AGORA)
Airiti Library
Applied Science and Technology Source
Biological Sciences
BioMedSearch
Biotechnology and BioEngineering Abstracts
Biotechnology Research Abstracts
CAB Abstracts
Chemical Abstracts Service (CAS)
CNKI Scholar
Computers and Applied Sciences Complete
CSA Illustrata - Natural Sciences
CSA Illustrata - Technology
CSA Technology Research Database
Current Abstracts
Directory of Open Access Journals (DOAJ)
EBSCO Discovery Service
EBSCOhost Connection
Expanded Academic Index
Google Scholar
HINARI Access to Research in Health Programme
InfoTrac Custom journals
INSPEC
J-Gate Portal
Odysci Academic Search
ProQuest Advanced Technologies and Aerospace Collection
ProQuest Biological Science Collection
ProQuest Computer Science Journals
ProQuest Natural Science Collection
ProQuest SciTech Collection
PubMed
PubMed Central
Scopus
The DBLP Computer Science Bibliography
The Index of Information Systems Journals
The Informatics Portal io-port.net
TOC Premier
Editorial Board
Shandar Ahmad, National Institute of Biomedical Innovation, Japan
Tatsuya Akutsu, Kyoto University, Japan
Rolf Backofen, University of Freiburg, Germany
Craig Benham, University of California, Davis, USA
Mark Borodovsky, Georgia Institute of Technology, USA
Rita Casadio, Universit di Bologna, Italy
Ming Chen, Zhejiang University, China
David Corne, Heriot Watt University, United Kingdom
Bhaskar Dasgupta, University of Illinois at Chicago, USA
Ramana Davuluri, The Wistar Institute, USA
J. Dopazo, Felipe Research Centre, Spain
Anton Enright, European Bioinformatics Institute, United Kingdom
Stavros J. Hamodrakas, National and Capodistrian University of Athens, Greece
Paul Harrison, McGill University, USA
Huixiao Hong, U.S. Food and Drug Administration, USA
David Jones, University College London, United Kingdom
George Karypis, University of Minnesota, USA
Jian-Liang Li, Sanford-Burnham Medical Research Institute, USA
Jie Liang, University of Illinois at Chicago, USA
Guohui Lin, University of Alberta, Canada
Pietro Li, University of Cambridge, United Kingdom
Dennis Livesay, University of North Carolina at Charlotte, USA
Satoru Miyano, The University of Tokyo, Japan
Burkhard Morgenstern, University of Goettingen, Germany
Masha Niv, Hebrew University of Jerusalem, Israel
Florencio Pazos, Consejo Superior de Investigaciones Cientficas, Spain
David Posada, Universidad de Vigo, Spain
Jagath Rajapakse, Nanyang Technological University, Singapore
Marcel J. T. Reinders, Delft University of Technology, The Netherlands
P. Rouze, Ghent University, Belgium
Alejandro A. Schffer, National Institutes of Health, USA
E. L. Sonnhammer, Stockholm University, Sweden
Sandor Vajda, Boston University, USA
Yves Van de Peer, U Gent, Belgium
Antoine van Kampen, University of Amsterdam, The Netherlands
Alexander Zelikovsky, Georgia State University, USA
Zhongming Zhao, Vanderbilt University, USA
Yi Ming Zou, University of Wisconsin-Milwaukee, USA
Editorial Workflow
The following is the editorial workflow that every manuscript submitted to the journal undergoes during
the course of the peer-review process.
The entire editorial workflow is performed using the online Manuscript Tracking System. Once a
manuscript is submitted it is sent to an appropriate Editor based on the subject of the manuscript and
the availability of the Editors. If the Editor finds that the manuscript may not be of sufficient quality to
go through the normal peer review process, or that the subject of the manuscript may not be
appropriate for the journals scope, the Editor may Refuse to Consider the manuscript. In this case,
the manuscript is sent to a second Editor, and if the second Editor also chooses to Refuse to
Consider the manuscript, the manuscript shall be rejected with no further processing.
If the Editor finds that the submitted manuscript is of sufficient quality and falls within the scope of the
journal, they would assign the manuscript to a minimum of 2 and a maximum of 5 external reviewers
for peer-review. The reviewers submit their reports on the manuscripts along with their
recommendation of one of the following actions to the Editor:
Publish Unaltered
Consider after Minor Changes
Consider after Major Changes
Reject: Manuscript is flawed or not sufficiently novel
When all reviewers have submitted their reports, the Editor can make one of the following editorial
recommendations:
Publish Unaltered
Consider after Minor Changes
Consider after Major Changes
Reject
If the Editor recommends Publish Unaltered, the manuscript is accepted for publication.
If the Editor recommends Consider after Minor Changes, the authors are notified to prepare and
submit a final copy of their manuscript with the required minor changes suggested by the reviewers.
The Editor reviews the revised manuscript after the minor changes have been made by the authors.
Once the Editor is satisfied with the final manuscript, the manuscript can be accepted.
If the Editor recommends Consider after Major Changes, the recommendation is communicated to
the authors. The authors are expected to revise their manuscripts in accordance with the changes
recommended by the reviewers and to submit their revised manuscript in a timely manner. Once the
revised manuscript is submitted, the Editor can then make an editorial recommendation which can be
Publish Unaltered, Consider after Minor Changes, or Reject.
If the Editor recommends rejecting the manuscript, the rejection is immediate. Also, if two of the
reviewers recommend rejecting the manuscript, the rejection is immediate.
The editorial workflow gives the Editors the authority to reject any manuscript because of
inappropriateness of its subject, lack of quality, or incorrectness of its results. The Editor cannot assign
himself/herself as an external reviewer of the manuscript. This is to ensure a high-quality, fair, and
unbiased peer-review process of every manuscript submitted to the journal, since any manuscript
must be recommended by one or more (usually two or more) external reviewers along with the Editor
in charge of the manuscript in order for it to be accepted for publication in the journal.
The name of the Editor recommending the manuscript for publication is published with the manuscript
to indicate and acknowledge their invaluable contribution to the peer-review process and the
indispensability of their contributions to the running of the journals.
The peer-review process is single blinded; that is, the reviewers know who the authors of the
manuscript are, but the authors do not have access to the information of who the peer reviewers are.
Every journal published by Hindawi has an acknowledgment page for the researchers who have
performed the peer-review process for one or more of the journal manuscripts in the past year.
Without the significant contributions made by these researchers, the publication of the journal would
not be possible.
Advances in
Bioinformatics
ISSN: 1687-8027
Volume 2014 No. 1, June 2014
Table of Contents
Comparing Imputation Procedures for Affymetrix Gene Expression Datasets
Using MAQC Datasets
Sreevidya Sadananda Sadasiva Rao, Lori A. Shepherd, Andrew E. Bruno, Song Liu,
and Jeffrey C. Miecznikowski
01-10
11-20
21-31
Efficient Serial and Parallel Algorithms for Selection of Unique Oligos in EST
Databases
Manrique Mata-Montero, Nabil Shalaby, and Bradley Sheppard
32-37
38-48
49-58
59-66
67-78
79-86
87-92
Research Article
Comparing Imputation Procedures for Affymetrix Gene
Expression Datasets Using MAQC Datasets
Sreevidya Sadananda Sadasiva Rao,1 Lori A. Shepherd,1 Andrew E. Bruno,2
Song Liu,1 and Jeffrey C. Miecznikowski1,3
1
1. Introduction
In microarray experiments, randomly missing values may
occur due to scratches on the chip, spotting errors, dust, or
hybridization errors. Other nonrandom missing values may
be biological in nature, for example, probes with low intensity
values or intensity values that may exceed a readable threshold. Tese missing values will create incomplete gene expression matrices where the rows refer to genes and the columns
refer to samples. Tese incomplete expression matrices will
make it difcult for researchers to perform downstream
analyses such as diferential expression inference, clustering
or dimension reduction methods (e.g., principal components
analysis), or multidimensional scaling. Hence, it is critical to
understand the nature of the missing values and to choose an
accurate method to impute the missing values.
2
on applying these methods to Afymetrix gene expression
arrays, one of the most popular arrays in scientifc research.
Naturally, when proposing a new imputation scheme for
expression arrays, it is necessary to compare the new method
against existing methods. Several excellent papers have compared missing data procedures on high throughput data
platforms such as in two-dimensional gel electrophoresis as
in Miecznikowski et al.s works [7] or gene expression arrays
[810]. Before studying missing data imputation schemes in
Afymetrix gene expression arrays, it is reasonable to frst
remove any existing missing values. In this way, we ensure
that any subsequent missing values have known true values.
A detection call algorithm is used to flter and remove missing
expression values based on absent/present calls [11]. Subsequently, a preprocessing scheme is then employed. Tere are
numerous tasks to perform in preprocessing Afymetrix
arrays, including background adjustment, normalization,
and summarization. A good overview of the methods available for preprocessing is provided by Gentleman et al. [12].
For our analysis, the detection call employs MAS 5.0 [13] to
obtain expression values; thus, we also use the MAS 5.0 suite
of functions as our preprocessing method.
For our analysis, we focus on the microarray quality control (MAQC) datasets (Accession no. GSE5350), where the
datasets have been specifcally designed to address the points
of strength and weakness of various microarray analysis
methods. Te MAQC datasets were designed by the US Food
and Drug Administration to provide quality control (QC)
tools to the microarray community to avoid procedural failures. Te project aimed to develop guidelines for microarray
data analysis by providing the public with large reference
datasets along with readily accessible reference ribonucleic
acid (RNA) samples. Another purpose of this project was to
establish QC metrics and thresholds for objectively assessing
the performance achievable by various microarray platforms.
Tese datasets were designed to evaluate the advantages and
disadvantages of various data analysis methods.
Te initial results from the MAQC project were published
in Shis work [14] and later in Chen et al.s work [15] and
Shi et al.s work [16]. Specifcally, the MAQC experimental
design for Afymetrix gene expression HG-U133 Plus 2.0
GeneChip includes 6 diferent test sites, 4 pools per site, and
5 replicates per site, for a total of 120 arrays (see Section 2).
Tis rich dataset provides an ideal setting for evaluating
imputation methods on Afymetrix expression arrays. While
this dataset has been mined to determine inter-intra platform
reproducibility of measurements, to our knowledge, none has
studied imputation methods on this dataset.
Te MAQC dataset hybridizes two RNA sample types
Universal Human Reference RNA (UHRR) from Stratagene
and a Human Brain Reference RNA (HBRR) from Ambion.
Tese 2 reference samples and varying mixtures of these samples constitute the 4 diferent pools included in the MAQC
dataset. By using various mixtures of UHRR and HBRR, this
dataset is designed to study technical variations present in
this technology. By technical variations, we are referring to
the variability between preparations and labeling of sample,
variability between hybridization of the same sample to different arrays, testing site variability, and variability between
Advances in Bioinformatics
the signal on replicate features of the same array. Meanwhile,
biological variability refers to variability between individuals
in population and is independent of the microarray process
itself. By the MAQC dataset being designed to study technical
variation, we can examine the accuracy of the imputation
procedures without the confounding feature of biological
variability. Other than MAQC datasets, similar technical
datasets have been used to evaluate diferent analysis methods
specifc to Afymetrix microarrays, for example, methods for
identifying diferentially expressed genes [1719].
In summary, our analysis examines cutting edge imputation schemes on an Afymetrix technical dataset with minimal biological variation. Section 2 discusses the MAQC
dataset and the proposed imputation schemes. Meanwhile,
Section 3 describes the results from applying the imputation
methods for addressing missingness in the MAQC datasets.
Finally, we conclude our paper with a discussion and conclusion in Sections 4 and 5.
Advances in Bioinformatics
set is equal to , and the alternate hypothesis is that the
median discrimination score is greater than , where is
defned as a small nonnegative number which can be changed
by the user to adjust the specifcity and sensitivity. One-sided
Wilcoxon rank sum tests are performed for each probe set.
Two signifcance levels 1 and 2 , act as the cutofs for the
values for probe set detection calls. A present call is made for
a probe set (transcript) with a value <1 , an absent call for
a transcript with value 2 and a marginally detected call
for a transcript with 1 value < 2 . We use the MAS 5.0
preset values 0.04, 0.06, and 0.015 for 1 , 2 , , respectively,
to determine if the probe set is present, marginally present, or
absent in the sample.
2.3. Percent Present Algorithm. We use the mas5calls function detailed in Afymetrix [20] from the afy package [13]
to make the detection calls. Using this function, we get a
present, marginal, or absent call for each probe set in each
array. For every sample, probe sets were fltered based on
the present calls where probe sets that were present in all 5
replicates of a given pool and a given site were retained for
further analyses. Probe sets that were detected as absent or
marginally present in 1 or more replicates of a sample were
removed. Tis creates a complete expression matrix for each
site and pool combination.
Te SimpleAfy R package has methods for quality control
metrics on Afymetrix arrays [21]. One metric is percent
present call which calculates the percentage of present probe
sets in each array. Using this metric, we calculate the percent
present calls for all 120 arrays separately and then average the
percentages over the 5 replicates for each sample and each site.
2.4. Preprocessing Algorithm. We pre-process each complete
expression matrix using MAS 5.0 available in Bioconductor
[22] to obtain expression values for further analyses. Te
MAS 5.0 preprocessing was implemented using the R language afy library [13].
Preprocessing algorithms for Afymetrix gene expression
microarrays are necessary to account for the systematic variation present in array technology and to summarize the signal
for each gene which is measured via a series of probe sets.
As discussed by Gentleman et al. [12], preprocessing schemes
can be organized into three steps: a background adjustment
step, a normalization step, and a summarization step. In short,
the MAS 5.0 preprocessing algorithm is outlined in the Statistical Algorithms Description Document [20] and used in
the MAS 5.0 sofware Afymetrix [20]. Te steps in MAS 5.0
involve (1) a weighted nearest neighbor step to estimate and
remove the background signal, (2) a normalization step that
scales all arrays to a baseline array, and (3) a summarization
step using an ideal mismatch, which may be slightly diferent
than the perfect mismatch probe described earlier.
To compare imputation methods, we randomly remove a
percentage of the probe set expression values from the complete expression matrix and compare the complete dataset
and the dataset(s) with the missing probe sets expression
values estimated via an imputation method. We randomly
remove 5% and 10% of the probe set expression values from
3
the complete expression matrix with 1000 Monte-Carlo simulations at each deletion percentage.
2.5. Missing Value Imputation Methods. Similar to the analysis by Oh et al. [10], we examine the following missing data
analysis methods for the MAQC dataset:
(1) row average (ROW),
(2) nearest neighbors using Euclidean distance or Pearson correlation, with = 1 or 5, where is the number
of neighbors in the imputation (KNN),
(3) singular value decomposition (SVD) [1],
(4) least squares adaptive (LSA) [4],
Advances in Bioinformatics
Average of percent present calls per site per pool
Present (%)
59.0
58.5
58.0
57.5
57.0
56.5
56.0
55.5
55.0
54.5
54.0
53.5
53.0
52.5
52.0
51.5
51.0
A
Pool
Site 1
Site 2
Site 3
Site 4
Site 5
Site 6
Figure 1: Percent present across pools and sites. Each curve shows a
diferent site, and the -axis shows the 4 pools and the -axis shows
the mean percentage of present probes on the Afymetrix arrays.
Pool B has the smallest percentage of present probes, while Pool D
has the largest percentage of present probes. Site 4 has the highest
percentage of present probes, while Site 2 has the lowest percentage
of present probes.
1
no. of missing
{ missing}
( ) ,
,
RAE =
where
(2)
if > ,
if < ,
{ ,
( ) = {
{,
(1)
(3)
1
no. of missing
{ missing}
( ) , (4)
{ missing}
( )
. (5)
See Section 4 for the motivation for using these error measures to evaluate the imputation methods. To understand the
variability in the imputation procedures, we perform each
missing data simulation 1000 times.
2.7. Ranking the Imputation Methods. To identify the overall
best and worst performing imputation methods (IM), we
rank the IM based on their average performance across the
diferent error measures, all pools, and all sites. Te ranking
procedure is carried out separately for 5% and 10% deletion.
For each simulation, we compute 4 error measures for
each of the 10 imputation methods. Averaging over the 1000
simulations, we get an average error value for each imputation
method for every site and pool combination. For example, for
the metric RMSE, there are 10 values: 1 for each imputation
method at, say, Site 1 and Pool A.
Ten, we rank the 10 IM based on each error measure
separately for each site and pool combination. For example,
based on RMSE values, the IM are ranked from the lowest to
highest; the IM with lowest RMSE value is 1 and the IM with
the highest RMSE is 10. Te IM each have a rank value for a
given error measure at each site for each of the 4 pools.
For every imputation method, the error measure rank
values are averaged across the 6 sites for each pool; thus we
obtain 4 average rank values, 1 for each pool. Finally, we
average these 4 rank values to obtain a single number that
gives a global ranking to every imputation method, refecting
Advances in Bioinformatics
RMSE
5%
2.38
9.79
7.83
5.17
3.83
2.79
1
7.25
6
8.96
LRMSE
10%
2.33
9.79
7.83
5.21
3.75
2.92
1
7.25
5.96
8.96
5%
5.5
9.88
9.08
4.29
2.17
2.29
4.88
7.33
1.5
8.29
RAE
10%
5.71
10
8.88
4.25
2.25
2.92
4.92
6.96
1.5
8.08
5%
5.13
9.83
8.33
4.67
2.13
1.96
4.33
7.33
2
8.29
RAEL2
10%
5.46
10
8.75
4.5
2.17
1.96
4.5
7.08
2
8.17
5%
5.63
9.79
8.42
4.38
2.17
2
4.71
7.33
1.58
8.33
Average
10%
5.67
9.88
8.79
4.33
2.17
2.08
4.71
7.04
1.46
8.29
5%
4.66
9.82
8.42
4.63
2.57
2.26
3.73
7.31
2.77
8.47
10%
4.79
9.92
8.56
4.57
2.58
2.48
3.78
7.08
2.72
8.38
Rows correspond to imputation methods and columns correspond to error measures with the last columns showing the average across the error measures.
Each imputation method is ranked based on its average rank performance across all pools and all sites. Te rank values for every error measure and imputation
method combination are averaged across the 6 sites and 4 pools as detailed in Section2. Smaller average rank values suggest more accurate imputation methods.
From the table, we observe that RMSE metric suggests that LSA imputation method has the best performance. With LRMSE and RAEL2 metrics, ROW is the
best imputation method. LLS with = 4 (LLS4) has the best performance when we use the RAE error measure. KNN with = 1 (KNNl) has the highest rank
value for any given error measure; thus, it is the worst performing imputation method. LLS with = 4 (LLS4) has the overall best performance across the
diferent error measures. Tese results hold true for both 5% and 10% deletion.
1400
1200
1000
800
600
400
200
0
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
BPCA
KNN1
KNN5
LLS1
LLS3
LLS4
LSA
NIPALS
ROW
SVD
Figure 2: Average RMSE barplot with error bars. RMSE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6) and
10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed bar is
the overall mean for individual IM where the RMSE values are averaged across the 4 pools and 6 sites. Tis fgure shows the performance of
the 10 imputation tests using the RMSE metric with 5% deletion of values. 1000 simulations were performed where each simulation generated
a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix of probe sets. Missing
values were imputed using the 10 imputation tests. Te results are compared using the RMSE metric (see Section 2). Te RMSE values are
averaged across the 4 pools. LSA has the best performance as it has the lowest RMSE value for a given site. KNN with = 1 has the highest
RMSE value and has the worst performance for all pools and all sites.
Advances in Bioinformatics
0.6
0.4
0.2
0.0
1 2 3 4 5 6M 1 2 3 4 5 6M
BPCA
KNN1
1 2 3 4 5 6M
1 2 3 4 5 6M 1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
KNN5
LLS1
LLS3
LLS4
LSA
Test sites (16) and imputation methods
1 2 3 4 5 6M 1 2 3 4 5 6M 1 2 3 4 5 6M
NIPALS
ROW
SVD
Figure 3: Average LRMSE barplot with error bars. LRMSE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)
and 10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed
bar represents the overall mean for individual IM where the LRMSE values are averaged across the 4 pools and 6 sites. Tis fgure shows
the performance of the 10 imputation tests using the RMSE metric with 5% deletion of values. 1000 simulations were performed where each
simulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix
of probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the LRMSE metric (see Section 2).
Te LRMSE values are averaged across the 4 pools. ROW has the best performance as it has the lowest LRMSE value for a given site. KNN
with = 1 has the highest LRMSE value and has the worst performance for all pools and all sites.
3. Results
We summarize our fndings in two ways: probe set detection
call summaries and error metrics and rankings for IM. Detection call results compare sites and pools while IM results
choose the best imputation method based on the error
metrics discussed in Section 2.
3.1. Detection Call Algorithm Results. Across the 120 samples,
as shown in Figure 1 the percent present calls has a minimum
value of 51% and a maximum value of 58.5%. We observe
that Site 4 have the highest mean percent present calls and
Site 2 has the lowest mean percent present calls for probe
sets. In terms of pools, Pool B has the lowest mean percent
present calls for probe sets while Pool D has the highest mean
percent present calls (see Figure 1). We performed an analysis
of variance (ANOVA) to examine the efects of site and pool
on the percentage of present probe sets in a microarray. Te
values for site and pool are <0.0001 indicating signifcant site
and pool efects. Nevertheless, the smallest percent present
is 49.77 while the largest percent present is 63.69. Tese
results indicate that the percentage of present probes is
sensitive to site and pool and could be caused by the wet
Advances in Bioinformatics
1.0
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
BPCA
KNN1
KNN5
LLS1
LLS3
LLS4
LSA
NIPALS
ROW
SVD
Figure 4: Average RAE barplot with error bars. RAE values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)
and 10 imputation tests (BPCA, KNN with = 1, 5, lls with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed
bar represents the overall mean for individual IM where the RAE values are averaged across the 4 pools and 6 sites. Tis fgure shows the
performance of the 10 imputation tests using the RAE metric with 5% deletion of values. 1000 simulations were performed where each
simulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix of
probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the RAE metric (see Section 2). Te
RAE values are averaged across the 4 pools. LLS with = 4 has the best performance as it has the lowest RAE value for a given site. KNN
with = 1 has the highest RAE value and has the worst performance for all pools and all sites.
4. Discussion
Te MAQC project allows researchers to study a variety of
microarray aspects including comparisons of one-color and
two-color arrays [28], reproducibility [14, 15, 29], removal of
batch efects [30], and determining diferentially expressed
Advances in Bioinformatics
2.5
2.0
1.5
1.0
0.5
0.0
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
BPCA
KNN1
KNN5
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
LLS1
LLS3
LLS4
LSA
Test sites (16) and imputation methods
1 2 3 4 5 6M
1 2 3 4 5 6M
1 2 3 4 5 6M
NIPALS
ROW
SVD
Figure 5: Average RAEL2 barplot with error bars. RAEL2 values are represented on the -axis. Te -axis has the 6 sites (1, 2, 3, 4, 5, and 6)
and 10 imputation tests (BPCA, KNN with = 1, 5, LLS with = 1, 3, 4, LSA, NIPALS, ROW, and SVD). Mean (M) depicted by the slashed
bar represents the overall mean for individual IM where the RAEL2 values are averaged across the 4 pools and 6 sites. Tis fgure shows the
performance of the 10 imputation tests using the RAEL2 metric with 5% deletion of values. 1000 simulations were performed where each
simulation generated a dataset containing 5% missing values by randomly removing probe set values from the complete expression matrix
of probe sets. Missing values were imputed using the 10 imputation tests. Te results are compared using the RAEL2 error measure (see
Section 2). Te RAEL2 values are averaged across the 4 pools. ROW has the best performance as it has the lowest RAEL2 value for a given
site. KNN with = 1 has the highest RAEL2 value and has the worst performance for all pools and all sites.
via (1) raw score error measures and (2) rank-based error
measures taken across our cohort of error measures. Te
error measures chosen (see Secton 2) were designed to assess
(1) errors in raw expression values (RMSE), (2) errors in
the logarithm transformed expression values (LRMSE), (3)
relative errors designed to penalize errors relative to the raw
expression values (RAE), and (4) relative errors designed
to penalize the error relative to the logarithm expression
value (RAEL2). Hence, there are 2 (relative and absolute)
error measures based on raw expression scores and 2 error
measures (relative and absolute) based on the logarithm of
expression values. Because of this balanced design in error
measures between relative and absolute measures and raw
and logarithm transformed data, it is reasonable to compute
the average rank across these error measures to assess the
overall quality of an imputation method (see Table 1). Tus,
these rank-based error measures shown in Table 1 summarize
the results in a straightforward manner across sites, pools,
and error measures. Note that we set = 0.20 for the RAE
error method. For future work, our group is interested in
studying the robustness of RAE to the choice of . We also
include the raw score error measures to demonstrate the best
imputation methods regardless of the employed set of the
imputation methods (see Figures 25).
Advances in Bioinformatics
Shi et al. [43]. In that discussion, one of the main concerns
involves technical versus biological variation. Tis important
issue has arisen when studying other technical microarray
datasets [39]. Considering both aspects of this question, if
we use datasets containing biological and technical variation,
that is, datasets designed to answer biological questions, then
there are biases due to the intent of the original datasets
(e.g., biological variation of the species, sample preparation,
procurement of RNA, and hybridization afnities).
5. Conclusions
Missing values in microarray experiments are a common
problem with efects on downstream analysis. Many variables
such as the biological variability of the dataset, experimental
conditions of the study, percentage of missing values, and
type of downstream analysis performed need to be considered when choosing an imputation method.
In our work, we use the MAQC datasets with the MAS 5.0
preprocessing scheme to compare missing data imputation
schemes for Afymetrix datasets. Te best and worst performing imputation schemes remain the same for both 5% and 10%
deletion percentages. We observe that -nearest neighbor
method with = 1 has the worst performance among the
imputation schemes across all error measures. Local least
squares (LLS) method with = 4 gives the best performance
for imputing missing values across all error measures for
both 5% and 10% deletion. Tese conclusions are based on
studying 10 imputation methods with 4 error metrics and
1000 Monte-Carlo simulations.
Authors Contribution
Jefrey C. Miecznikowski and Song Liu designed the study.
Sreevidya Sadananda Sadasiva Rao performed the statistical
analysis. Sreevidya Sadananda Sadasiva Rao and Jefrey C.
Miecznikowski wrote the paper. Lori A. Shepherd and
Andrew E. Bruno assisted with the data analysis and writing
the paper. All authors read and approved the fnal paper.
References
[1] O. Troyanskaya, M. Cantor, G. Sherlock et al., Missing value
estimation methods for DNA microarrays, Bioinformatics, vol.
17, no. 6, pp. 520525, 2001.
[2] H. Wold, Path models with latent variables: the NIPALS
approach, in Quantitative Sociology: International Perspectives
on Mathematical and Statistical Modeling, pp. 307357, 1975.
[3] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, and
S. Ishii, A Bayesian missing value estimation method for gene
expression profle data, Bioinformatics, vol. 19, no. 16, pp. 2088
2096, 2003.
[4] T. H. B, B. Dysvik, and I. Jonassen, LSimpute: accurate estimation of missing values in microarray data with least squares
methods, Nucleic Acids Research, vol. 32, no. 3, p. e34, 2004.
[5] H. Kim, G. H. Golub, and H. Park, Missing value estimation
for DNA microarray gene expression data: local least squares
imputation, Bioinformatics, vol. 21, no. 2, pp. 187198, 2005.
9
[6] M. Ouyang, W. J. Welsh, and P. Georgopoulos, Gaussian mixture clustering and imputation of microarray data, Bioinformatics, vol. 20, no. 6, pp. 917923, 2004.
[7] J. C. Miecznikowski, S. Damodaran, K. F. Sellers, D. E. Coling,
R. Salvi, and R. A. Rabin, A comparison of imputation procedures and statistical tests for the analysis of two-dimensional
electrophoresis data, Proteome Science, vol. 9, p. 14, 2011.
[8] G. N. Brock, J. R. Shafer, R. E. Blakesley, M. J. Lotz, and G. C.
Tseng, Which missing value imputation method to use in
expression profles: a comparative study and two selection
schemes, BMC Bioinformatics, vol. 9, no. 1, p. 12, 2008.
[9] M. Celton, A. Malpertuy, G. Lelandais, and A. G. de Brevern,
Comparative analysis of missing value imputation methods
to improve clustering and interpretation of microarray experiments, BMC Genomics, vol. 11, no. 1, p. 15, 2010.
[10] S. Oh, D. D. Kang, G. N. Brock, and G. C. Tseng, Biological
impact of missing-value imputation on downstream analyses of
gene expression profles, Bioinformatics, vol. 27, no. 1, Article
ID btq613, pp. 7886, 2011.
[11] R. Mei, X. Di, T. B. Ryder et al., Analysis of high density expression microarrays with signed-rank call algorithms, Bioinformatics, vol. 18, no. 12, pp. 15931599, 2002.
[12] R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit,
Bioinformatics and computational biology solutions using R
and Bioconductor, Statistics for Biology and Health, 2005.
[13] L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry, AfyAnalysis of Afymetrix GeneChip data at the probe level,
Bioinformatics, vol. 20, no. 3, pp. 307315, 2004.
[14] L. Shi, Te MicroArray Quality Control (MAQC) project
shows inter- and intraplatform reproducibility of gene expression measurements, Nature Biotechnology, vol. 24, no. 9, pp.
11511161, 2006.
[15] J. J. Chen, H. Hsueh, R. R. Delongchamp, C. Lin, and C.
Tsai, Reproducibility of microarray data: a further analysis of
microarray quality control (MAQC) data, BMC Bioinformatics,
vol. 8, no. 1, p. 412, 2007.
[16] L. Shi, W. D. Jones, R. V. Jensen et al., Te balance of reproducibility, sensitivity, and specifcity of lists of diferentially
expressed genes in microarray studies, BMC Bioinformatics,
vol. 9, supplement 9, p. S10, 2008.
[17] S. E. Choe, M. Boutros, A. M. Michelson, G. M. Church, and
M. S. Halfon, Preferred analysis methods for Afymetrix GeneChips revealed by a wholly defned control dataset, Genome
Biology, vol. 6, no. 2, p. R16, 2005.
[18] Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, Preferred analysis methods for Afymetrix GeneChips. II. An expanded, balanced, wholly-defned spike-in dataset, BMC Bioinformatics,
vol. 11, no. 1, p. 285, 2010.
[19] Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, A wholly
defned Agilent microarray spike-in dataset, Bioinformatics,
vol. 27, no. 9, Article ID btr135, pp. 12841289, 2011.
[20] I. Afymetrix, Statistical algorithms description document,
Technical Paper, 2002.
[21] C. L. Wilson and C. J. Miller, Simpleafy: a BioConductor package for Afymetrix Quality Control and data analysis, Bioinformatics, vol. 21, no. 18, pp. 36833685, 2005.
[22] R. C. Gentleman, V. J. Carey, D. M. Bates et al., Bioconductor:
open sofware development for computational biology and
bioinformatics, Genome Biology, vol. 5, no. 10, p. R80, 2004.
[23] T. Hastie, R. Tibshirani, B. Narasimhan, and G. Chu, Impute:
Imputation for Microarray Data, 1999, R package version 1.10.0.
10
[24] T.H. BB, B. Dysvik, and I. Jonassen, Lsimpute: Accurate estimation of missing values in microarray data with least squares
methods, 2005, http://www.ii.uib.no/trondb/imputation/.
[25] D. V. Nguyen, N. Wang, and R. J. Carroll, Evaluation of missing
value estimation for microarray data, Journal of Data Science,
vol. 2, no. 4, pp. 347370, 2004.
[26] W. Stacklies and H. Redestig, PcaMethods: A Collection of PCA
Methods, 2007, R package version 1.18.0.
[27] S. S. Sadasiva Rao, L. A. Shepherd, A. E. Bruno, S. Liu, and J.
C. Miecznikowski, A full analysis of imputation procedures for
Afymetrix gene expression datasets, Technical Report 1202,
SUNY University at Bufalo-Department of Biostatistics, Buffalo, NY, USA, 2012.
[28] T. A. Patterson, E. K. Lobenhofer, S. B. Fulmer-Smentek et al.,
Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC)
project, Nature Biotechnology, vol. 24, no. 9, pp. 11401150,
2006.
[29] Z. Wen, C. Wang, Q. Shi et al., Evaluation of gene expression
data generated from expired Afymetrix GeneChip microarrays using MAQC reference RNA samples, BMC Bioinformatics, vol. 11, supplement 6, p. S10, 2010.
[30] J. Luo, M. Schumacher, A. Scherer et al., A comparison of batch
efect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data,
Pharmacogenomics Journal, vol. 10, no. 4, pp. 278291, 2010.
[31] K. Kadota and K. Shimizu, Evaluating methods for ranking
diferentially expressed genes applied to microArray quality
control data, BMC Bioinformatics, vol. 12, no. 1, p. 227, 2011.
[32] T. Aittokallio, Dealing with missing values in large-scale studies: microarray data imputation and beyond, Briefngs in Bioinformatics, vol. 11, no. 2, Article ID bbp059, pp. 253264, 2009.
[33] J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio, Missing value imputation improves clustering and interpretation of
gene expression microarray data, BMC Bioinformatics, vol. 9,
no. 1, p. 202, 2008.
[34] A. Liew, N. Law, and H. Yan, Missing value imputation for gene
expression data: computational techniques to recover missing
data from available information, Briefngs in Bioinformatics,
vol. 12, no. 5, Article ID bbq080, pp. 498513, 2011.
Advances in Bioinformatics
[41] J. M. Perkel, Six things you wont fnd in the MAQC, Te
Scientist, vol. 20, no. 11, p. 68, 2007.
[42] P. Liang, MAQC papers over the cracks, Nature Biotechnology,
vol. 25, no. 1, pp. 2728, 2007.
[43] L. Shi, W. D. Jones, R. V. Jensen et al., Reply to MAQC papers
over the cracks, Nature Biotechnology, vol. 25, pp. 2829, 2007.
Research Article
A Multilevel Gamma-Clustering Layout Algorithm for
Visualization of Biological Networks
Tomas Hruz,1 Markus Wyss,2 Christoph Lucas,1 Oliver Laule,2
Peter von Rohr,2 Philip Zimmermann,2 and Stefan Bleuler2
1
2
1. Introduction
Te development in systems biology has brought a strong
interest in considering an organism as a large and complex
network of interacting parts. Many subsystems of living
organisms can be modeled as complex networks. One important example is a network of biochemical reactions which
constitutes a complex system responsible for homeostasis in
the living cell. An abstract network model of the biochemical
processes within the cell can be constructed such that reactions are represented as nodes and metabolites (and enzymes)
as edges. In the past, this system was studied mainly on a
subsystem level through metabolic pathways. Recently, it has
become important to consider the metabolic system as one
complex network to understand deeper phenomena involving interactions across multiple pathways.
Te need to study the whole network consisting of thousands of reactions, metabolites, and enzymes requires a
visualization system allowing biologists to study the overall
structure of the system. Such a visualization should allow navigation and comprehension of the global system structures.
In the present paper, we propose a visualization algorithm
for very large networks arising in systems biology and we
illustrate its usage on two complex biological networks. Te
frst case study is a metabolic network of Arabidopsis thaliana
and the second case study is a gene correlation network of
Mus musculus based on mRNA expression measurements.
Biological networks are usually represented as graphs
because such model can provide an insight into their structure. Te goal of the subsequent visualization is to present the
information contained in the graph in a clear and structured
way. For instance, closely related nodes of a subsystem should
be positioned together. Tis can be achieved using a cost
function which formalizes the visualization criteria and
which controls the drawing algorithm. Several standard
algorithms exist to achieve this goal using continuous optimization of the cost function, but the optimization of a
discrete cost function remains hard to solve.
12
Advances in Bioinformatics
(a)
(b)
Figure 1: Arabidopsis thaliana metabolic network visualized with (a) a force-directed algorithm with all edges shown, (b) the MLGA method
which combines -clustering with the force-directed algorithm. Te underlying network has 1199 reactions (nodes) and 4386 metabolites
(edges).
13
Advances in Bioinformatics
many parameters and a complex set of rules. Te rules and
its parameters are heuristically identifed to give a uniform
distribution of the nodes within the connected component.
Another drawback is that the method per se cannot visualize,
the structure of dense subgraphs because of too many edge
crossings (see [4, page 1887, Figure 3]). To improve the visualization the authors introduce visual operations to collapse
the cliques (and complete bipartite subgraphs) to reduce the
number of edges and nodes. Additionally, the problem of
fnding maximal clique (or complete bipartite subgraph) is
NP hard together with its approximation there is almost
no chance to have fast identifcation heuristics for large
graphs. Our algorithm improves the situation in this respect
because relaxing requested density of the subgraph through
-clustering (where 0 1 is the cluster density) allows
much more efcient heuristics for large graphs (order 106
nodes and edges [5]).
A global optimization method was explored in [6] where
the authors describe a layout algorithm for metabolic networks. Nodes of the graph are placed on a square grid. A
discrete cost function between a pair of nodes is introduced
based on their relation and position on the grid. By minimizing the total cost, a layout is generated. A simulated
annealing heuristic is used to optimize the cost function
by choosing better layouts among possible candidates. Due
to the computationally costly calculation of the layout, the
approach is applicable to networks with a few hundred nodes
only. Te authors showed that the algorithm works well on
sparse or planar graphs and clarifes the network structure as
the cost function of the method places closely related nodes
together. But this layout algorithm would place dense parts
of the graph in the same area leading to many edge crossings.
Additionally, as no reduction in the number of edges or nodes
is performed, the identifcation of the graph structure would
be very hard for large graphs with many edges.
2. MLGA Approach
Te experience with the existing visualization methods has
shown that it is necessary to provide a structural view of
dense networks. Representing networks with a large number
of nodes and edges in a two-dimensional area results in many
edge crossings. Dense subgraphs prevent the recognition of
the network structure if drawn directly. Apart from other
technical problems, this is the main shortcoming of most
layout algorithms. We believe that the future progress in
visualization of large and dense networks lies in algorithms
which analyze the structure of the graph frst and then generate a new graph which contains specifc semantic symbols
for regular substructures like dense clusters. Additionally,
the algorithms may allow for drilling down and interactively
show all edges for a given substructure, described below (see
section visual representation and operation). Dense clusters
are ideal candidates for graph preprocessing because they
can be simply described, efciently searched, and if they
are replaced with a specifc symbol they signifcantly reduce
the complexity of the resulting low-dimensional (planar or
three dimensional) picture because they contain most of the
edges. Moreover, we focus on the graph clustering algorithms
3. Algorithm
Te MLGA method introduces multilevel -clustering and a
specifc tree transformation with a force-directed layout algorithm to visualize the structure of highly complex biological
networks. First, the original graph is preprocessed using a clustering algorithm described in [5] to identify the clusters.
For every cluster, a new cluster node is created and these new
nodes are linked with new edges if there are edges between
the underlying cluster nodes as illustrated in Figure 2.
Tis process constructs the frst hierarchical layer above
the original graph. Ten, the clustering algorithm is recursively applied to the cluster nodes itself to generate a cluster
hierarchy. Aferwards, this hierarchy is transformed to a tree
showing only the shortest paths from a root node through the
intermediate cluster nodes to the nodes of the initial graph.
Finally, a modifed version of the force-directed algorithm
visualizes the tree structure of the remaining graph. Tis
combination of preprocessing and layout algorithm eases the
identifcation of the cluster structure and their interactions,
see Figure 1(b).
For the clustering step, we prefer -clustering to (, )clustering or to other more complex methods because it
would be much more difcult to control the clustering parameters during the transitions between the hierarchy levels. Te
only parameter which has to be specifed for our algorithm at
every hierarchical level is the parameter . It can be seen that
the density of the graph grows when the algorithm proceeds
to the higher levels. On the other hand, the number of nodes
decreases very rapidly so that afer few steps there is only
one clique lef. As a consequence, it is not meaningful to use
the same clustering parameters as the algorithm recursively
proceeds up the hierarchy. For more complex clustering algorithms, it would be very difcult to defne a good clustering
parameters if the parameter space has more dimensions.
In our case, the sequence of the values for the parameter
must be growing. As we discuss later, the actual values
can be empirically determined and moreover 3-4 values are
sufcient for large graphs.
14
Advances in Bioinformatics
(a)
(b)
Figure 2: (a) Te construction of a cluster hierarchy and (b) the transformation to a tree.
4. Algorithmic Phases
Let = (, ) be an undirected graph with the vertex set
and edge set . A -cluster for 0 1, also described as
-clique or dense subgraph, is a subset such that for its
edge set () and the vertex set () the following is true:
| ()| (
| ()|
).
2
(1)
5. -Cluster Detection
6. Hierarchy Creation
Te cluster detection algorithm is repeatedly applied to the
graph and the clusters to build a hierarchy; see Figure 2(a).
Each node of the graph has an attribute level which is
input: : Vertices
input: : Edges
input: : density of cluster
begin
initialize empty list of clusters C;
count 0;
cluster construct dsubg(, V, E);
while cluster = 0 count < max count do
size |cluster|;
if size min size then
add cluster to C;
count 0;
else
count count + 1;
end
set to without nodes of cluster;
set to without edges within cluster;
cluster construct dsubg(, V, E);
end
end
Algorithm 1: createClusters.
15
Advances in Bioinformatics
begin
level 0;
nodes getNodes(level);
getGammaValue(level);
clusters createClusters(nodes, );
while clusters = 0 do
create one node on the next level for each cluster;
create edges between clusters and nodes;
level level + 1;
nodes getNodes(level);
getGammaValue(level);
clusters createClusters(nodes, );
end
end
Algorithm 2: createMultiLevelClusters.
7. Tree Transformation
To gain the structure of the cluster hierarchy, a tree transformation is performed; see Figure 2(b). In the transformation
(Algorithm 3), a hidden root node is connected to all cluster
nodes at the highest level as their parent. Aferwards, only
the edges belonging to the shortest path from the root node
to each node is shown. If the shortest path is not unique a
path will be chosen at random. Te distance for each node is
calculated beginning from the root using a breath-frst search.
Te parent of a node will be set to the neighbor node with the
shortest distance. If the node belongs to a cluster node at one
level above, the parent is set to this cluster.
8. Layout Algorithm
A modifed version of a force-directed algorithm [2] is used
to lay out the transformed graph. Our method introduces
diferent edge length on each level. Longer edges are assigned
to higher levels than on lower levels. Tis results in a
natural visualization of the hierarchy. Furthermore, the initial
positions of the nodes are specifcally calculated. Te nodes of
the graph are located on concentric circles with the hidden
root node at the center. Nodes immediately connected to
the root are positioned at the next inner circle and so on. A
segment of the circle is assigned to each node within which its
location is calculated. Recursively, a fraction of this segment
is assigned to the children of the node on the next circle.
Tis initial setup reduces the rendering time and guides the
layout algorithm to visualize the tree structure. A random
initial positioning may result in a local minimum of the forcedirected layout with many edge crossings which would disrupt the tree representation. Additionally, the repulsive forces
are ignored beyond a given distance depending on the size
of the drawing area. Tis restriction prevents disconnected
components of the graph from separating too far. To suppress
the well-known oscillation problem [10] of force-directed
algorithms a dumping heuristics is used where we compute an
average of the previous node positions during the force
calculation.
begin
create root;
set parent of highest level nodes to root;
candidates highest level nodes;
foreach node candidates do
if node belongs to a cluster one level above then
node.parent cluster;
else
set node.parent to the neighbor with shortest
distance to the root;
end
node.dist node.parent.dist + 1;
foreach neighbor of node except node.parent do
if neighbor has already been visited then
hide edge;
else
candidates candidates neighbor;
end
end
end
end
Algorithm 3: treeTransformation.
16
Advances in Bioinformatics
Node of initial graph (degree 0)
Node of initial graph (degree 1)
Node of initial graph (degree 2)
Node of initial graph (degree 3)
Node of initial graph (degree 4)
Node of initial graph (degree 5)
Node of initial graph (degree 6 or higher)
Cluster node level 1
Cluster node level 2
Edge of initial graph
Edge of a node belonging to a cluster
Edge between cluster nodes of same level
11. Results
11.1. Metabolic Networks. To provide experimental justifcation of the proposed method, we extracted the metabolic
network for Arabidopsis thaliana from Genevestigator [13].
Te network has 932 nodes and 2315 edges. Te edges
17
Advances in Bioinformatics
(a)
(b)
(c)
Figure 4: (a) A part of a gene correlation network of Arabidopsis thaliana drawn with MLGA, (b) showing all edges connected to the -cluster
node at the top right and (c) displaying all edges between the nodes defning the cluster. Te inset shows a magnifcation of the edges of the
selected cluster.
(a)
(b)
Figure 5: MLGA applied to (a) A. thaliana biochemical network without signaling efects and regulatory elements. Reactions directly
involved in the synthesis of brassinosteroids are highlighted with the red color and direct connections are depicted by red edges. Te level
2 cluster, indicated by an arrow, combines the major parts of isoprenoid biosynthesis, resulting from the nonmevalonate pathway. (b) A.
thaliana biochemical network including signaling efects and regulatory elements. Reactions directly involved in brassinosteroid and auxin
metabolism/signaling are highlighted with red and direct connections are depicted by red edges. Black arrows point to reactions involved in
brassinosteroid metabolism/signaling. Green arrows point to reactions involved in auxin metabolism/signaling.
8659 edges and 1232 nodes in Figure 6(b). In both cases the
ribosomal cluster can be clearly identifed.
12. Discussion
Visualization methods ofen contain parameters which must
be empirically identifed. In [4], the selection of pivot nodes is
18
Advances in Bioinformatics
(a)
(b)
Figure 6: MLGA applied to (a) Mus musculus gene correlation network generated with a threshold of 0.72. (b) Te gene correlation network
generated with a threshold of 0.80. Te red highlighted nodes and direct connections belong to the ribosomal cluster.
13. Conclusion
As discussed, many approaches try to improve the layout of
complex networks through better placement of the nodes
alone. In our work, we pursue a diferent line of research
towards efcient visualization algorithms for large biological
networks. Our approach does not aim at rendering all edges
in a network, but we focus on the discovery and visualization
of important structural features. Tis approach is combined
with complementary visual operations which allow to drilldown into the details of structurally identifed elements. Te
MLGA method is successful in identifying the biologically
relevant structures and allows for processing very large
graphs as we illustrated on two diferent case studies of biological networks. Naturally, this paradigm opens new questions on how to further improve the visualization output
and speed. Diferent clustering algorithms can be tried to
19
Advances in Bioinformatics
create the multi-level structure; however, in the case of
multiparameter clustering the control and analysis of the
parameter values between the levels would become more
difcult.
On the theoretical side, the next question to consider
is how to provide a provably good (optimal) sequence of
values. Another question is whether the surprisingly good
structure identifcation features of our algorithm could be
traced back to the scale-free character of many biological
networks.
[11]
[12]
[13]
Disclosure
All materials, source code, and small case studies data can
be freely downloaded from http://www.pw.ethz.ch/research/
projects/complexnetworks/. Very large datasets are available
on request.
[14]
[15]
Acknowledgments
Te authors would like to thank Professor Peter Widmayer
for the ongoing support of this project. Tis work was also
supported by Commission for Technology and Innovation of
the Swiss Federation under Grant 9428.1 PFLS-LS.
[16]
[17]
References
[1] T. Kamada and S. Kawai, An algorithm for drawing general
undirected graphs, Information Processing Letters, vol. 31, no. 1,
pp. 715, 1989.
[2] T. M. J. Ffuchterman and E. M. Reingold, Graph drawing by
force-directed place-ment, Sofware, vol. 21, no. 11, pp. 1129
1164, 1991.
[3] A. T. Adai, S. V. Date, S. Wieland, and E. M. Marcotte, LGL:
creating a map of protein function with an algorithm for visualizing very large biological networks, Journal of Molecular
Biology, vol. 340, no. 1, pp. 179190, 2004.
[4] K. Han and B.-H. Ju, A fast layout algorithm for protein interaction networks, Bioinformatics, vol. 19, no. 15, pp. 18821888,
2003.
[5] J. Abello, M. Resende, and S. Sudarsky, Massive quasi-clique
detection, in LATIN 2002: Teoretical Informatics, S. Rajsbaum,
Ed., vol. 2286 of Lecture Notes in Computer Science, pp. 598612,
Springer, Berlin, Germany, 2002.
[6] W. Li and H. Kurata, A grid layout algorithm for automatic
drawing of biochemical networks, Bioinformatics, vol. 21, no.
9, pp. 20362042, 2005.
[7] S. E. Schaefer, Graph clustering, Computer Science Review,
vol. 1, pp. 2764, 2007.
[8] N. Mishra, R. Schreiber, I. Stanton, and R. Tarjan, Clustering
social networks, in Algorithms and Models For the Web-Graph,
A. Bonato and F. Chung, Eds., vol. 4863 of Lecture Notes in
Computer Science, pp. 5667, Springer, Berlin, Germany, 2007.
[9] J. Hastad, Clique is hard to approximate within n1 , Acta
Mathematica, vol. 182, pp. 105142, 1999.
[10] A. Frick, A. Ludwig, and H. Mehldau, A fast adaptive layout
algorithm for undirected graphs (extended abstract and system
demonstration), in Graph Drawing, R. Tamassia and I. Tollis,
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
20
drawing algorithm, in Proceedings of the 11th International
Conference Information Visualization (IV 07), pp. 757764, July
2007.
[27] B. Kaba, N. Pinet, G. Lelandais, A. Sigayret, and A. Berry, Clustering gene expression data using graph separators, In Silico
Biology, vol. 7, no. 4-5, pp. 433452, 2007.
Advances in Bioinformatics
Research Article
Reverse Engineering Sparse Gene Regulatory Networks Using
Cubature Kalman Filter and Compressed Sensing
Amina Noor,1 Erchin Serpedin,1 Mohamed Nounou,2 and Hazem Nounou3
1
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
Chemical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building,
Education City, P.O. Box 23874, Doha, Qatar
3
Electrical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building,
Education City, P.O. Box 23874, Doha, Qatar
2
1. Introduction
Gene regulation is one of the most intriguing processes taking
place in living cells. With hundreds of thousands of genes at
their disposal, cells must decide which genes are to express at
a particular time. As the cell development evolves, diferent
needs and functions entail an efcient mechanism to turn the
required genes on while leaving the others of. Cells can also
activate new genes to respond efectively to environmental
changes and perform specifc roles. Te knowledge of which
gene triggers a particular genetic condition can help us ward
of the potential harmful efects by switching that gene of. For
instance, cancer can be controlled by deactivating the genes
that cause it.
Gene expression is the process of generating functional
gene products, for example, mRNA and protein. Te level
of gene functionality can be measured using microarrays or
gene chips to produce the gene expression data [1]. More
22
Advances in Bioinformatics2
known to interact with few other genes only necessitating the use of sparsity constraint for more accurate
estimation. Te proposed algorithm carries out online
estimation of parameters and is therefore computationally efcient and is particularly suitable for large
gene networks.
(2) Te Cramer-Rao lower bound is calculated for the
estimation of unknown parameters of the system. Te
performance of the proposed algorithm is compared
to CRB. Tis comparison is signifcant as it shows
room for improvement in the estimation of parameters.
(3) Te proposed algorithm is compared with the EKF
algorithm. Using the false alarm errors, true connections, and Hamming distance as fdelity criteria, rigorous simulations are carried out to assess the performance of the algorithm with the increase in the number of samples. In addition, receiver operating characteristic (ROC) curves are plotted to evaluate the algorithms for diferent network sizes. It is observed that
the proposed algorithm outperforms EKF in terms
of accuracy and precision. Te proposed algorithm
is then applied to the DREAM4 10-gene and 100gene data sets to assess the algorithm accuracy. Te
underlying gene network for the IRMA data sets is
also inferred.
Te rest of this paper is organized as follows. Section 2
describes the underlying system model for the gene expressions. Te proposed CKF algorithm in combination with
CSKF for gene network inference is formulated in Section 3.
Te derivation of CRB is shown in Section 4, and the
simulation results and their interpretation are presented in
Section 5. Finally, conclusions are drawn in Section 6.
2. System Model
Gene regulatory networks can be modeled as static or dynamical systems. In this work, state-space modeling is considered
which is an instance of a dynamic modeling approach and
can efectively cope with time variations. Te states represent
gene expressions, and their evolution in time, in general, can
be expressed as
x = (x1 ) + w
= 1, . . . , ,
(1)
(2)
23
Advances in Bioinformatics
the gene interactions efectively, the following nonlinear state
evolution model is assumed [33, 34]:
, = (1, ) + , ,
=1
= 1, . . . , , = 1, . . . , ,
(3)
CKF
1
.
1 + 1,
(5)
b [1 , 2 , . . . , ] ,
(6)
y = R b + e ,
(7)
[0
[
R [
0
[0
0
f
0
0
0
0
d
0
0
0]
],
0]
f ]
(8)
f [ (
1,1 ) (1, )] .
b
No
where
CSKF
Initialize
b0 , x0
(4)
(9)
=
Yes
Output
3. Method
In this section, the methodology proposed to infer the system parameters in (3) is described. Te proposed cubature
Kalman flter with sparsity constraints (CKFS) approach is
succinctly illustrated in Figure 1. Te specifc details of this
algorithm are as next presented.
3.1. Cubature Kalman Filter. Kalman flter is a Bayesian flter
which provides the optimal solution to a general linear state
space inference problem depicted by (1) and (2) and assumes a
recursive predictive-update process. Te underlying assumption of Gaussianity for the predictive and the likelihood
densities simplifes the Kalman flter algorithm to a two-step
process, consisting of prediction and update of the mean and
covariance of the hidden states. However, the presence of
nonlinear functions in the state and measurement equations
requires calculation of multidimensional integrals of the form
nonlinear function Gaussian density [36], which in general
is computationally prohibitive. Several solutions to this problem have been proposed including the EKF, which linearizes
the nonlinear function by taking its frst-order Taylor approximation, and the unscented Kalman flter (UKF), which
approximates the probability density function (PDF) using a
nonlinear transformation of the random variable. Recently, a
new approach, CKF, has been proposed which evaluates the
integrals numerically using spherical-radial cubature rules
[36].
Te next two subsections briefy explain the working of
Bayesian fltering and the CKF solution for the nonlinear
multidimensional integrals.
3.1.1. Time Update. Let the observations up to the time instant
be denoted by d ; that is, d [y1 , . . . , y ] . In the prediction phase, also called the time update of the Bayesian flter,
24
Advances in Bioinformatics
(10)
where denotes the expectation operator and x1 is normally distributed with parameters (x1|1 , P,1|1 ). Te
third equality is a consequence of the zero-mean nature
of Gaussian noise w and its independence from d . Te
estimates x1|1 and P,1|1 are assumed to be available
from the previous iteration. Here, P,|1 is an estimate of
the error covariance matrix.
3.1.2. Measurement Update. Since the measurement noise
is also Gaussian, the likelihood density is given by y1 |
d1 : N(z1 ; y|1 , P,|1 ). As the measurements become
available at the th time instant, the mean and covariance of
the likelihood density are calculated as follows:
y|1 = [y | d1 ] ,
P,|1 =
[x x ]
y|1 y|1
+ S1 .
(11)
Te updated posterior density, obtained from the conditional joint density of states, and the measurements can be
expressed as
([x y ] dk1 )
N ((
P,|1 P,|1
x|1
),(
)) ,
y|1
P,|1 P,|1
(12)
where
(13)
,| = ,|1 K, ,|1 K, ,
(14)
1
K, = ,|1 ,|1
.
(15)
(16)
X,|1
= (X,1|1 ) .
(17)
1
,
X
=1 ,|1
1
X
X
=1 ,|1 ,|1
x|1 x|1
+ Q1 .
(18)
(19)
25
Advances in Bioinformatics
parameter vector. Te standard predict and update steps
involved in Kalman flter are summarized as follows:
b |1 = b 1|1 + ,
P,|1 = P,1|1 + ,
u = y R b ,
K = P,|1 R (R P,|1 R + 2 I1 ) ,
(20)
b | = b |1 + Ku ,
P,| = (I KR ) P,|1 ,
2
minb b 2
s.t. b .
(21)
Te value of the covariance matrix = 2 I of the pseudonoise is selected in a similar manner as the process noise
covariance in the EKF algorithm. However, it is found that
large values of variances, that is, 2 100, prove sufcient in
most cases [38]. Further details on selecting these parameters
can be found in [38, 46]. Te PM method solves (21) in a
recursive manner for iterations using the following steps:
K = P R (R P R + ) ,
b +1 = (I KR ) b ,
P+1 = (I KR ) P .
(23)
4. Cramr-Rao Bound
Te performance of an estimator can be judged by comparing
it with theoretical lower bounds proposed in parameter
estimation theory. Te CRB establishes a lower bound on the
MSE of an unbiased estimator [47]. In particular, the CRB
states that the covariance matrix of the estimator b is lower
bounded by
E [(b b) (b b) ] [I (b)]1 ,
(24)
ln (y | b)
ln (y | b)
)(
) ].
b
b
(25)
(26)
26
Advances in Bioinformatics
(y | b) = exp (
(y Rb) (y Rb)
),
22
(27)
ln (y | b)
(y Rb) (y Rb)
]
= [
b
b
2
=
R y R Rb
.
2
100
101
102
103
log 10 MSE
] , R = [R1 , . . . , R
] , and e = [e1 , . . . ,
where y = [y1 , . . . , y
e ] . Te PDF (y | b) is expressed as
104
105
(28)
106
107
ln (y | b)
ln (y | b)
)(
)
b
b
=
R (y Rb) (y Rb) R
.
4
10
Sample size
12
14
16
CRB
CKFS
(29)
(30)
27
Advances in Bioinformatics
0.38
0.7
0.36
0.65
0.6
Hamming distance
0.34
0.32
0.3
0.28
0.26
0.55
0.5
0.45
0.4
0.24
0.35
0.22
0.2
5
10
15
20
25
Sample size
30
35
40
EKF
CKFS
10
15
20
25
Sample size
30
35
40
EKF
CKFS
(a)
(b)
0.8
True connections
0.75
0.7
0.65
0.6
0.55
0.5
0.45
5
10
15
20
25
Sample size
30
35
40
EKF
CKFS
(c)
Figure 3: (a), (b), and (c) False alarm errors, Hamming distance, and true connections. Te synthetic networks consist of 8 vertices and 20
edges. Te metric is normalized over the number of edges. CKFS gives lower error and predicts more true connections with the increase in
the sample size of data.
28
Advances in Bioinformatics
ROC curve
1
0.9
0.9
0.8
0.8
0.7
0.7
ROC curve
1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
CKFS
EKF
(a)
(b)
ROC curve
1
0.9
0.8
True positive rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
False positive rate
0.8
CKFS
EKF
(c)
Figure 4: ROC curves for the performance of CKFS and EKF using synthetic data. (, , ) (a), (b), and (c) (5, 10, 20), (10, 12, 20), and
(15, 19, 20). Te area under the ROC curve for CKFS is more than that for EKF for various sized networks.
Table 1: Run time in seconds for EKF and CKFS algorithms for
varying network sizes for synthetically generated data. Te number
of sample points is fxed to 50.
Number of genes
EKF
CKFS
10
0.16
1.2
20
1.9
4.3
30
16.5
11.5
40
84
24.1
29
Advances in Bioinformatics
GAL80
GAL4
SW15
GAL80
CBF1
GAL4
SW15
GAL80
CBF1
GAL4
SW15
CBF1
ASH1
ASH1
ASH1
(a)
(b)
(c)
Figure 5: Te inferred IRMA networks. (a), (b), and (c) Gold standard, inferred network using CKFS, and inferred network using ODE
[39, 40]. Black arrows indicate true connections, blue arrows indicate the edges that are correct, but their directions are reversed, and red
arrows indicate false positives.
Table 2: Area under the ROC curve (AUROC) and area under the PR curve (AUPR) for DREAM4 10-gene networks for the fve diferent
networks.
Algorithm
ODE [39]
CKFS
Random [39]
Network 1
0.62 (0.27)
0.63 (0.40)
0.55 (0.18)
Network 2
0.63 (0.32)
0.67 (0.50)
0.55 (0.19)
Network 3
0.58 (0.21)
0.72 (0.50)
0.55 (0.17)
Network 4
0.63 (0.23)
0.75 (0.49)
0.57 (0.17)
Network 5
0.68 (0.25)
0.81 (0.42)
0.56 (0.16)
Table 3: Area under the ROC curve (AUROC) and area under the PR curve (AUPR) for DREAM4 100-gene networks for the fve diferent
networks.
Algorithm
ODE [39]
CKFS
Random [39]
Network 1
0.55 (0.02)
0.67 (0.13)
0.50 (0.002)
Network 2
0.55 (0.03)
0.57 (0.08)
0.50 (0.002)
Network 3
0.60 (0.03)
0.60 (0.10)
0.50 (0.002)
Network 4
0.54 (0.02)
0.62 (0.10)
0.50 (0.002)
Network 5
0.59 (0.03)
0.60 (0.07)
0.50 (0.002)
6. Conclusions
5.3. IRMA Gene Network. In addition to synthetic data, it is
imperative to test the algorithms using real biological data.
In this subsection, the performance of the CKFS algorithm is
assessed using the in vivo reverse-engineering and modeling
assessment (IRMA) network [40]. Tis network consists of
fve genes. Galactose activates the gene expression in the
network, whereas glucose deactivates it. Te cells are grown
in the presence of galactose and then switched to glucose
to obtain the switch-of data which represents the expressive
samples at 21 time points. Te switch-on data consists of 16
sample points and is obtained by growing the cells in a glucose
medium and then changing to galactose. Te system and
measurement noise variances for the CKFS are assumed to
be identical as in the previous simulations. Figure 5 shows the
inferred network, the gold standard, and the network inferred
using TSNI. It is observed that the CKFS algorithm succeeds
30
gives advantages over EKF in terms of smaller run time for
large networks. Te Cramer-Rao lower bound is also determined for the parameters of the model and compared with
the MSE performance of the proposed algorithm. Assessment using DREAM4 10-gene and 100-gene networks and
IRMA network data corroborates the superior performance
of CKFS. Future research directions include incorporating
the estimation of model order in the network inference
algorithm.
Acknowledgments
Tis work was supported by US National Science Foundation
(NSF) Grant 0915444 and QNRF-NPRP Grant 09-874-3235. Te material in this paper was presented in part at the
IEEE International Workshop on Genomic Signal Processing
and Statistics (GENSIPS), San Antonio, TX, USA, December
2011.
References
[1] X. Zhou, X. Wang, and E. R. Dougherty, Genomic Networks:
Statistical Inference from Microarray Data, John Wiley & Sons,
New York, NY, USA, 2006.
[2] H. Kitano, Computational systems biology, Nature, vol. 420,
pp. 206210, 2002.
[3] X. Zhou and S. T. C. Wong, Computational Systems Bioinformatics, World Scientifc, River Edge, NJ, USA, 2008.
[4] X. Cai and X. Wang, Stochastic modeling and simulation of
gene networks, IEEE Signal Processing Magazine, vol. 24, no. 1,
pp. 2736, 2007.
[5] D. Yue, J. Meng, M. Lu, C. L. P. Chen, M. Guo, and Y. Huang,
Understanding micro-RNA regulation: a computational perspective, IEEE Signal Processing Magazine, vol. 29, no. 1, pp. 77
88, 2012.
[6] R. Pal, S. Bhattacharya, and M. U. Caglar, Robust approaches
for genetic regulatory network modeling and intervention: a
review of recent advances, IEEE Signal Processing Magazine,
vol. 29, no. 1, pp. 6676, 2012.
[7] H. Hache, H. Lehrach, and R. Herwig, Reverse engineering of
gene regulatory networks: a comparative study, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2009, Article ID
617281, 2009.
[8] T. Schlitt and A. Brazma, Current approaches to gene regulatory network modelling, BMC Bioinformatics, vol. 8, no. 6, p. 9,
2007.
[9] H. D. Jong, Modeling and simulation of genetic regulatoy systems: a literature review, Journal of Computational Biology, vol.
9, no. 1, pp. 67103, 2002.
[10] I. Nachman, A. Regev, and N. Friedman, Inferring quantitative
models of regulatory networks from expression data, Bioinformatics, vol. 20, no. 1, pp. i248i256, 2004.
[11] C. D. Giurcaneanu, I. Tabus, and J. Astola, Clustering time
series gene expression data based on sum-of-exponentials ftting, EURASIP Journal on Advances in Signal Processing, vol.
2005, no. 8, Article ID 358568, pp. 11591173, 2005.
[12] C. D. Giurcaneanu, I. Tabus, J. Astola, J. Ollila, and M. Vihinen,
Fast iterative gene clustering based on information theoretic
criteria for selecting the cluster structure, Journal of Computational Biology, vol. 11, no. 4, pp. 660682, 2004.
Advances in Bioinformatics
[13] X. Cai and G. B. Giannakis, Identifying diferentially expressed
genes in microarray experiments with model-based variance
estimation, IEEE Transactions on Signal Processing, vol. 54, no.
6, pp. 24182426, 2006.
[14] X. Zhou, X. Wang, and E. R. Dougherty, Gene clustering based
on cluster-wide mutual information, Journal of Computational
Biology, vol. 11, no. 1, pp. 151165, 2004.
[15] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring connectivity of genetic regulatory networks using informationtheoretic
criteria, IEEE/ACM Transactions on Computational Biology and
Bioinformatics, vol. 5, no. 2, pp. 262274, 2008.
[16] J. Dougherty, I. Tabus, and J. Astola, Inference of gene regulatory networks based on a universal minimum description
length, Eurasip Journal on Bioinformatics and Systems Biology,
vol. 2008, Article ID 482090, 2008.
[17] L. Qian, H. Wang, and E. R. Dougherty, Inference of noisy
nonlinear diferential equation models for gene regulatory networks using genetic programming and Kalman fltering, IEEE
Transactions on Signal Processing, vol. 56, no. 7, pp. 33273339,
2008.
[18] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring gene regulatory networks from time series data using the minimum
description length principle, Bioinformatics, vol. 22, no. 17, pp.
21292135, 2006.
[19] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R.
Dougherty, A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks, Bioinformatics, vol. 20, no. 17, pp. 29182927, 2004.
[20] J. Meng, M. Lu, Y. Chen, S.-J. Gao, and Y. Huang, Robust inference of the context specifc structure and temporal dynamics of
gene regulatory network, BMC Genomics, vol. 11, no. 3, p. S11,
2010.
[21] Y. Zhang, Z. Deng, H. Jiang, and P. Jia, Inferring gene regulatory networks from multiple data sources via a dynamic
Bayesian network with structural em., in DILS, S. C. Boulakia
and V. Tannen, Eds., vol. 4544 of Lecture Notes in Computer
Science, pp. 204214, Springer, New York, NY, USA, 2007.
[22] K. Murphy and S. Mian, Modeling gene expression data using
dynamic Bayesian networks, University of California, Berkeley,
Calif, USA, 2001.
[23] H. Liu, D. Yue, L. Zhang, Y. Chen, S. J. Gao, and Y. Huang, A
Bayesian approach for identifying miRNA targets by combining sequence prediction and gene expression profling, BMC
Genomics, vol. 11, no. 3, p. S12, 2010.
[24] Y. Huang, J. Wang, J. Zhang, M. Sanchez, and Y. Wang,
Bayesian inference of genetic regulatory networks from time
series microarray data using dynamic Bayesian networks,
Journal of Multimedia, vol. 2, no. 3, pp. 4656, 2007.
[25] B.-E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and
F. DAlche-Buc, Gene networks inference using dynamic
Bayesian networks, Bioinformatics, vol. 19, no. 2, pp. ii138ii148,
2003.
[26] C. Rangel, D. L. Wild, F. Falciani, Z. Ghahramani, and A. Gaiba,
A. modelling biological responses using gene expression profling and linear dynamical systems, Bioinformatics, pp. 349356,
2005.
[27] M. Quach, N. Brunel, and F. dAlch Buc, Estimating parameters
and hidden variables in non-linear state-space models based on
ODEs for biological networks inference, Bioinformatics, vol. 23,
no. 23, pp. 32093216, 2007.
[28] F.-X. Wu, W.-J. Zhang, and A. J. Kusalik, Modeling gene
expression from microarray expression data with state-space
Advances in Bioinformatics
equations, in Pacifc Symposium on Biocomputing, R. B. Altman, A. K. Dunker, L. Hunter, T. A. Jung, and T. E. Klein, Eds.,
pp. 581592, World Scientifc, River Edge, NJ, USA, 2004.
[29] R. Yamaguchi, S. Yoshida, S. Imoto, T. Higuchi, and S. Miyano,
Finding module-based gene networks with state-space
modelsmining high-dimensional and short time-course gene
expression data, IEEE Signal Processing Magazine, vol. 24, no.
1, pp. 3746, 2007.
[30] O. Hirose, R. Yoshida, S. Imoto et al., Statistical inference
of transcriptional module-based gene networks from time
course gene expression profles by using state space models,
Bioinformatics, vol. 24, no. 7, pp. 932942, 2008.
[31] J. Angus, M. Beal, J. Li, C. Rangel, and D. Wild, Inferring
transcriptional networks using prior biological knowledge and
constrained state-space models, in Learning and Inference in
Computational Systems Biology, N. Lawrence, M. Girolami,
M. Rattray, and G. Sanguinetti, Eds., pp. 117152, MIT Press,
Cambridge, UK, 2010.
[32] C. Rangel, J. Angus, Z. Ghahramani et al., Modeling T-cell activation using gene expression profling and state-space models,
Bioinformatics, vol. 20, no. 9, pp. 13611372, 2004.
[33] A. Noor, E. Serpedin, M. N. Nounou, and H. N. Nounou, Inferring gene regulatory networks via nonlinear state-space models
and exploiting sparsity, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 12031211,
2012.
[34] Z. Wang, X. Liu, Y. Liu, J. Liang, and V. Vinciotti, An extended
kalman fltering approach to modeling nonlinear dynamic gene
regulatory networks via short gene expression time series,
IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 3, pp. 410419, 2009.
[35] A. Noor, E. Serpedin, M. Nounou, H. Nounou, N. Mohamed,
and L. Chouchane, An overview of the statistical methods
used for inferring gene regulatory networks and proteinprotein
interaction networks, Advances in Bioinformatics, vol. 2013,
Article ID 953814, 12 pages, 2013.
[36] I. Arasaratnam and S. Haykin, Cubature kalman flters, IEEE
Transactions on Automatic Control, vol. 54, no. 6, pp. 12541269,
2009.
[37] A. Noor, E. Serpedin, M. N. Nounou, and H. N. Nounou, A
cubature Kalman flter approach for inferring gene regulatory
networks using time series data, in Proceedings of the IEEE
International Workshop on Genomic Signal Processing and Statistics (GENSIPS 11), pp. 2528, 2011.
[38] A. Carmi, P. Gurfl, and D. Kanevsky, Methods for sparse
signal recovery using kalman fltering with embedded rseudomeasurement norms and quasi-norms, IEEE Transactions on
Signal Processing, vol. 58, no. 4, pp. 24052409, 2010.
[39] C. A. Penfold and D. L. Wild, How to infer gene networks from
expression profles, revisited, Interface Focus, pp. 857870, 2011.
[40] I. Cantone, L. Marucci, F. Iorio et al., A yeast synthetic network
for in vivo assessment of reverse-engineering and modeling
approaches, Cell, vol. 137, no. 1, pp. 172181, 2009.
[41] Y. Huang, I. M. Tienda-Luna, and Y. Wang, Reverse engineering gene regulatory networks: a survey of statistical models,
IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 7697, 2009.
[42] Z. Wang, F. Yang, D. W. C. Ho, S. Swif, A. Tucker, and X. Liu,
Stochastic dynamic modeling of short gene expression timeseries data, IEEE Transactions on Nanobioscience, vol. 7, no. 1,
pp. 4455, 2008.
31
Research Article
Efficient Serial and Parallel Algorithms for Selection of
Unique Oligos in EST Databases
Manrique Mata-Montero,1 Nabil Shalaby,2 and Bradley Sheppard1,2
1
2
1. Introduction
Expressed Sequence Tags (or ESTs) are fragments of DNA
that are about 200800 bases long generated from the
sequencing of complementary DNA. ESTs have many applications. Tey were used in the Human Genome Project in the
discovery of new genes and are ofen used in the mapping
of genomic libraries. Tey can be used to infer functions of
newly discovered genes based on comparison to known genes
[1].
An oligonucleotide (or oligo) is a subsequence of an EST.
Oligos are short, since they are typically no longer than 50
nucleotide bases. Oligos are ofen referred to in the context
of their length by adding the sufx mer. For example,
an oligo of length 9 would be referred to as a 9-mer. Te
importance of oligos in relation to EST databases is quite
signifcant. An oligo that is unique in an EST database serves
as a representative of its EST sequence. Te oligonucleotides
(or simply oligos) contained in these EST databases have
applications in many areas such as PCR primer design,
microarrays, and probing genomic libraries [24].
In this paper we will improve on the algorithms presented
in [2] to solve the unique oligos search problem. Tis problem
33
Advances in Bioinformatics
Require: EST database = {1 , 2 , . . . , }, integer (length of unique oligos) and integer
(maximum number of mismatches between non-unique oligos)
Ensure: All unique -mers in
(1) /(/2 + 1)
(2) fndqmers() (hashtable of positions of all qmers in )
(3) for 1 to 4 {split loop iterations among processors} do
(4) as a base 4 integer of length
(5) list of base 4 integers of length mismatching by 1 digit
(6) the numbers in in base 10
(7) list of each [] for all
(8) goo2(, , , [], )
(9) end for
Algorithm 1: Algorithm for the unique oligos problem.
Many algorithms have been presented to solve this problem [5, 6]. Te algorithm presented in [2] relies on an observation that if two -mers agree within a specifc Hamming
Distance, then they must share a certain substring. Tese
observations are presented in this paper as theorems.
2
)4 )
4
( ) 2
),
= (
4
where
=
.
/2 + 1
(1)
(2)
2
)4 )
4
( ) 2
),
= (
4
(3)
34
Advances in Bioinformatics
(1) substring(, , )
(2) under the transformation {A, C, G, T} {0, 1, 2, 3}
(3) return
Algorithm 5: Map (string , , ).
(1) substring of from character to character
(2) return
Algorithm 6: Substring (string , , ).
35
Advances in Bioinformatics
36
Advances in Bioinformatics
Table 1: Results of serial algorithms.
Algorithm
Algorithm 2
Algorithm 1
Algorithm 3
Algorithm 2
Algorithm 1
Algorithm 3
28
28
27
28
28
27
6
6
6
6
6
6
4
7
9
4
7
9
Dataset
1 (78 ESTs)
1 (78 ESTs)
1 (78 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)
Non-unique oligos
46,469
46,469
46,564
1,611,241
1,611,241
1,614,235
28
28
27
28
28
27
6
6
6
6
6
6
4
7
9
4
7
9
Dataset
1 (78 ESTs)
1 (78 ESTs)
1 (78 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)
2 (2838 ESTs)
where
(4)
.
+1
A third theorem was also briefy mentioned [7]; however,
it was not implemented in an algorithm. We use this theorem
to create a third algorithm to solve the unique oligos search
problem.
=
2
)4 )
4
( ) 2 2
= (
),
4
where
=
.
/3 + 1
(5)
(6)
Non-unique oligos
46,469
46,469
46,564
1,611,241
1,611,241
1,614,235
3. Implementation
We implement these algorithms using C on a machine with
12 Intel Core i7 CPU 80 @ 3.33 GHz processors and 12 GB
of memory. Te datasets we use in this implementation are
Barley ESTs taken from the genetic sofware HarvEST by
Steve Wanamaker and Timothy Close of the University of
California, Riverside (http://harvest.ucr.edu/). We use two
diferent EST databases, one with 78 ESTs and another with
2838. In our experiments we search for oligos of lengths
27 and 28 since they are common lengths for oligonucleotides. As we increase the size of the database, we see that
Algorithm 3 is the most efcient as anticipated (data shown
in Tables 1 and 2).
One important thing to note about all of these algorithms
is the fact that the main portion of them is a for loop
which iterates through each index of the hash table. It is also
obvious that loop iterations are independent of each other.
Tese two factors make the algorithms perfect candidates for
parallelism. Rather than process the hash table one index
at a time, our parallel algorithms process groups of indices
simultaneously. Ignoring the communication between processors, our algorithms optimally parallelize our three serial
algorithms.
Tere are many APIs in diferent programming languages
that aid in the task of parallel programming. Some examples
of this in the C programming language are OpenMP and
POSIX Pthreads. OpenMP allows one to easily parallelize
37
a C program amongst multiple cores of a multicore machine
[8]. OpenMP also has an extension called Cluster OpenMP
which allows one to parallelize across multiple machines in a
computing cluster.
A new trend in parallel programming is in the use of
GPUs. GPUs are the processing units inside computers graphics card. C has several APIs which allow one to carry out GPU
programming. Te two such APIs are OpenCL and CUDA
[9, 10].
In the second implementation of our algorithms we use
OpenMP to parallelize our algorithms throughout the 12
cores of our machine. We can easily see that we achieve near
optimal parallelization with our parallel algorithms; that is,
the time taken by the parallel algorithms is approximately that
of the serial algorithms divided by the number of processors.
4. Conclusion
In this paper we used three algorithms to solve the unique
oligos search problem which are extensions of the algorithm presented in [2]. We observed that we can achieve
a signifcant performance improvement by parallelizing our
algorithms. We can also see that Algorithm 3 yields the best
results for larger databases. For smaller databases, however,
the time diference between each pair of algorithms is
negligible, but results in Algorithm 3 being the slowest, and
this is due to the time required to compute the mismatches of
each -mer. Other algorithms can be obtained by setting to
diferent values. See Algorithms 1, 2, 3, 4, 5, 6, 7, and 8.
References
[1] M. D. Adams, J. M. Kelley, J. D. Gocayne et al., Complementary
DNA sequencing: expressed sequence tags and human genome
project, Science, vol. 252, no. 5013, pp. 16511656, 1991.
[2] J. Zheng, T. J. Close, T. Jiang, and S. Lonardi, Efcient selection
of unique and popular oligos for large EST databases, Bioinformatics, vol. 20, no. 13, pp. 21012112, 2004.
[3] S. H. Nagaraj, R. B. Gasser, and S. Ranganathan, A hitchhikers
guide to expressed sequence tag (EST) analysis, Briefngs in Bioinformatics, vol. 8, no. 1, pp. 621, 2007.
[4] W. Klug, M. Cummings, and C. Spencer, Concepts of Genetics,
Prentice-Hall, Upper Saddle River, NJ, USA, 8th edition, 2006.
[5] F. Li and G. D. Stormo, Selection of optimal DNA oligos for
gene expression arrays, Bioinformatics, vol. 17, no. 11, pp. 1067
1076, 2001.
[6] S. Rahmann, Rapid large-scale oligonucleotide selection for
microarrays, in Proceedings of the 1st IEEE Computer Society
Bioinformatics Conference (CSB 02), pp. 5463, IEEE Press,
Stanford, Calif, USA, 2002.
[7] S. Go, Combinatorics and its applications in DNA analysis [M.S.
thesis], Department of Mathematics and Statistics, Memorial
University of Newfoundland, 2009.
[8] OpenMP.org, 2012, http://openmp.org/wp/.
[9] Khronos Group, OpenCLTe open standard for parallel
programming of heterogeneous systems, 2012, http://www
.khronos.org/opencl/.
[10] Nvidia, Parallel Programming and Computing Platform
CudaNvidia, 2012, http://www.nvidia.com/object/cuda
home new.html.
Advances in Bioinformatics
Research Article
Gene Regulation, Modulation, and Their Applications in
Gene Expression Data Analysis
Mario Flores,1 Tzu-Hung Hsiao,2 Yu-Chiao Chiu,3 Eric Y. Chuang,3
Yufei Huang,1 and Yidong Chen2,4
1
Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249, USA
Greehey Childrens Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX 78229, USA
3
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
4
Department of Epidemiology and Biostatistics, University of Texas Health Science Center at San Antonio, San Antonio,
TX 78229, USA
2
Correspondence should be addressed to Yufei Huang; yufei.huang@utsa.edu and Yidong Chen; cheny8@uthscsa.edu
Received 2 December 2012; Accepted 24 January 2013
Academic Editor: Mohamed Nounou
Copyright 2013 Mario Flores et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Common microarray and next-generation sequencing data analysis concentrate on tumor subtype classifcation, marker detection,
and transcriptional regulation discovery during biological processes by exploring the correlated gene expression patterns and their
shared functions. Genetic regulatory network (GRN) based approaches have been employed in many large studies in order to
scrutinize for dysregulation and potential treatment controls. In addition to gene regulation and network construction, the concept
of the network modulator that has signifcant systemic impact has been proposed, and detection algorithms have been developed
in past years. Here we provide a unifed mathematic description of these methods, followed with a brief survey of these modulator
identifcation algorithms. As an early attempt to extend the concept to new RNA regulation mechanism, competitive endogenous
RNA (ceRNA), into a modulator framework, we provide two applications to illustrate the network construction, modulation efect,
and the preliminary fnding from these networks. Tose methods we surveyed and developed are used to dissect the regulated
network under diferent modulators. Not limit to these, the concept of modulation can adapt to various biological mechanisms
to discover the novel gene regulation mechanisms.
1. Introduction
With the development of microarray [1] and lately the next
generation sequencing techniques [2], transcriptional profling of biological samples, such as tumor samples [35]
and samples from other model organisms, have been carried
out in order to study sample subtypes at molecular level or
transcriptional regulation during the biological processes [6
8]. While common data analysis methods employ hierarchical clustering algorithms or pattern classifcation to explore
correlated genes and their functions, the genetic regulatory
network (GRN) approaches were employed to scrutinize for
dysregulation between diferent tumor groups or biological
processes (see reviews [912]).
39
Advances in Bioinformatics
Regulator
Target
Interact
Regulator
of and
Regulator of
Target of
Target of
Target of
(a)
(b)
Figure 1: Regulator-target pair in genetic regulatory network model: (a) basic regulator-target pair and (b) regulator-target complex.
(1)
40
Advances in Bioinformatics
Hormone
Receptor
microRNAs
R
TF
Singal
transduction
proteins
Target gene
(a)
Target gene
(b)
3UTR
(c)
Figure 2: Tree diferent cases of regulation of gene expression that share the network representation of a regulator target interaction.
( ( + 1) , 1 () , . . . , () , ())
= ( ( + 1) | Parents ( ( + 1)))
(Parents ( ( + 1))) ,
Modulator
(2)
Regulator
Target
Interact
(3)
41
Advances in Bioinformatics
(4)
1 [=1 ( )+, ( , )]
.
(5)
Or a candidate interaction can be identifed using estimation of mutual information MI of genes and , MI(, ) =
MI , where MI = 1 if genes and are identical, and
MI is zero if (, ) = ()(), or and are statistically independent. Specifcally, the estimation of mutual
information of gene expressions and of regulator and
target genes is done by using the Gaussian kernel estimator.
Te ARACNE takes additional two steps to clean the network:
(1) removing MI if its value is less than that derived from
two independent genes via random permutation and (2) data
processing inequality (DPI). Te algorithm further assumes
that for a triplet gene ( , , ), where regulates ,
through , then
MI, < min (MI, , MI, ) ,
if ,
(6)
(7)
where is the probability of the modulator being absent. Particularly, an uncorrelated and correlated bivariant Gaussian
distributions were introduced to model diferent modulated
regulator-target relationship, such that
(, | = 0) =
(, | = 1)
=
21 2
1 (1/2)(2 +2 )
,
(1/2)(
+2 +2)/(12 )
(9a)
(9b)
42
Advances in Bioinformatics
Modulator
mRNA
3 UTR
Regulator
MicroRNAs
Target
mRNA
3 UTR
a partition of the paired expression samples into the correlated and uncorrelated samples. Te paired expression
samples that possess such correlated-uncorrelated partition
(0.3 < < 0.7 and || > 0.8) are determined to be
modulated. To identify the modulator of a (or a group of)
modulated pair(s), a weighted -test was developed to search
for the genes whose expressions are diferentially expressed
in the correlated partition versus the uncorrelated partition.
3.4. GEM (Gene Expression Modulator). GEM [46] improves
over MINDy by predicting how a modulator-TF interaction
afects the expression of the target gene. It can detect new
types of interactions that result in stronger correlation but
low , which therefore would be missed by MINDy. GEM
hypothesizes that the correlation between the expression of a
modulator and a target must change, as that of the TF
changes. Unlike the previous surveyed algorithms, GEM
frst transforms the continuous expression levels to binary
states (up- (1) or down-expression (0)) and then works only
with discrete expression states. To model the hypothesized
relationship, the following model is proposed:
( = 1 | , ) = + + + ,
(10)
(11)
Tese quantities and their associated statistical significance can be computed from collections of expression
of genes with number of samples 250 or greater. Hermes
expands MINDy by providing the capacity to identify candidate modulator genes of miRNAs activity. Te presence
of these modulators () will afect the relation between
the expression of the miRNAs targeting a gene () and the
expression level of this gene ().
In summary, we surveyed some of the most popular
algorithms for the inference of modulator. Additional modulator identifcation algorithms are summarized in Table 1.
It is worth noting that the concept of modulator applies
to cases beyond discussed in this paper. Such example
includes the multilayer integrated regulatory model proposed
in Yan et al. [49], where the top layer of regulators could be
also considered as modulators.
43
Advances in Bioinformatics
Table 1: Gene regulation network and modulator identifcation methods.
Algorithm
Features
ARACNE
Network profler
MINDy
Mimosa
GEM
MuTaMe
Hermes
ER modulator
References
[14, 42]
[47]
[44]
[45]
[46]
[21]
[20]
[48]
Gene Ontology analysis of CYYR1 and its connected neighbor genes revealed signifcant association with extracellular
matrix, epithelial tube formation, and angiogenesis.
4.2. TraceRNA. To identify the regulation network of ceRNAs
for a GoI, we developed a web-based application TraceRNA
presented earlier in [50] with extension to regulation network
construction. Te analysis fow chart of TraceRNA was shown
in Figure 6. For a selected GoI, the GoI binding miRNAs
(GBmiRs) were derived either validated miRNAs from miRTarBase [51] or predicted miRNAs from SVMicrO [52]. Ten
mRNAs (other than the given GoI) also targeted by GBmiRs
were identifed as the candidates of ceRNAs. Te relevant (or
tumor-specifc) gene expression data were used to further
strengthen relationship between the ceRNA candidates and
GoI. Te candidate ceRNAs which coexpressed with GoI
were reported as putative ceRNAs. To construct the gene
regulation network via GBmiRs, we set each ceRNA as the
secondary GoI, and the ceRNAs of these secondary GoIs
were identifed by applying the algorithm recursively. Upon
identifying all the ceRNAs, the regulation network of ceRNAs
of a given GoI was constructed.
44
Advances in Bioinformatics
TRGV5
DNASE2
C1orf175
PLA2G4D
DPY19L4
GHSR
TIGD5
C9orf7
PROP1
EPHA10
KIAA0143
POLR2K
KLK9
COMMD5
OR5BF1
LSR
OCLN
CDK5R2
FLJ32214
IRX5
P2RX4
IRF6
CAPN13
PVRL4
GRHL2
OR4D1
OR5H1
DYRK1B
OR10H2
LRBA
PSD4
KIAA1543
CRHR1
TMEM95
DYSF
LCE3E
AANAT
MED29
MB
CBLC
MARVELD2
CGN
C10orf27
KRTAP11-1
CCDC114
FAAH
KIAA1324
TMEM125
C1orf172 C19orf46SLC44A4
PRSS8
LOC652968
KIR3DL1
SPDEF
TJP3
RAB25
TACSTD1
MARVELD3
BIK
PSORS1C2
SSTR3
IGSF9SPINT1 RASEF
OVOL1 FXYD3
OR2B11
DLG3
BSPRY CLDN4
FAAH2 SPINT2
ELF3
RASGRP4
ATN1
TFAP2A
LOC400451
CREB3L4
PROM2
SHROOM3
C1orf34
OR10P1
P2RX2
SH2D3A
TRPS1
MGC40574
KRT3
VPS28
C1orf210
CLDN3
OTOF
ELMO3
LMAN1L
TMC4
SPINK7
LOC124220 ATAD4
CDS1
AP1M2
KRT19
KIAA1244
PKP3 PRRG2
NEBL
GSTO2
LRRC8E
C2orf15
CRB3
RBP7
LASS6
P2RY14
GIMAP5
RGS5
GUCY1B3
VIL2
FAM86A
KIAA1688
ATP8B4
OLFML2A
CDR1
FABP4
PPP1R12B
ADH1C
SPNS2
FAM84B
KRT7
PEX11B
SEC16A
C10orf81
HHAT
CREB3
OR10A5
CXorf36
PLB1
AVPR1A
AV
SCARF1
FAM83H
CLIC5
MMP11
GIMAP6
A
ADIPOQ
PDE1B
CIDEA
GIMAP7
C10orf54
ROBO4
ESAM
KCNJ2
EGFLAM
STX12
LRFN2
HSPA12B
H
SOCS3
CD93
PDZRN3
TPD52
DDR1
MAGIX
SH2D3C
HIVEP2
CCDC24
EPN3
CLDN7
CCDC107
RBM35A
CCL4 PTGDS
EMILIN2ACVRL1
GGTA1
ELTD1
KLF6
P2RY5
LOC338328CD36
RUNDC3B
ABCA9
FILIP1
TPK1
RRAD
EMX2OS
KGFLP1 LRRC32
C1QTNF7
GIMAP1
MEOX1
F13A1
PLEKHQ1
FHL5
ANKRD47
S
SPARCL1
VWF
W
WF
LIMS2
NOTCH4
RSU1
BCL6B
FOXO1 PALMD MALL
EDG1
ACVR2A CFH
C13orf33
JPH2
LYVE1
C14orf37
GIMAP8
ADH1B CILP
CCIN
TCF7L2
LPPR4
PLA2G5
CYYR1 EBF3
GSN
PODN
CDH23
OMD
SOX7
PALLD
EBF2
KIAA0355CCDC80
MAFB
TCEAL7
LGI4 MAB21L1CASQ2
SFRP4
COL6A3
TLR4 C1S
NDRG2
CLDN5RSPO3FGF7 SV2B
FBLN5
TCF4
CRYAB
FIBIN
USHBP1
KERA
CD248
FMO1
ACTA2
UACA PTH2R
RASD1
BOCABCC9
PRICKLE1
CTSK
IL33
LAMA2
LUM
DPT
HAND2
ELN
GPBAR1
IFNGR1
SAA2
RP5-1054A22.3
PRRX1
CHL1
COL15A1 C16orf77
MMP23B TNFAIP6
TIE1
LCN6
FAM20A
CLEC14A
ANKRD35
FAM49A
GIMAP4
C16orf30
FLJ45803
SELP
CRISPLD2
XG
GUCY1A3
ADCY4 CCL15
NKAPL
CX3CL1SSPN
CCCCL23
SFRP2
EGR2
PDGFD
SPSB1
PRELP
DACT3
EMP1
FILIP1L
SAA1
TSHZ2
KCNJ15
THSD7A
SRPX2
IGFBP7
COL1A2
LMOD1
APOD
MFAP5
C2orf40
RFTN2
HSD11B1
MXRA8
LAYN
CRTAC1
BHLHB3
FRZB
SH3PXD2B
D2
2
RARRES1
MMRN2
COL8A2
SPON2
GLT8D2
FBLN1
HTRA11
BST1
SHANK3
DNM3
DACT1
SKIP
LTBP2 TAGLN
ENPP6
FMO3
RASL12
GABRP
KIF19
PLEKHF1
COL16A1
SERPING1
MMP7
TLL1
PIK3C2G
RASSF22
FGF1
GJA12
FAM101A
SCPEP1
ALDH2
ITGBL1
ARID5B
TP63
TNN
OLFM4
PLXDC2
SAV1
LGI2
HSPG2
KLK5
PIGR
MMRN1
ANKRD6
DSC1 KCNMB11
ST5
UTRN
FAM3D
RSPO1
RRAGC
CPAMD8
CLDN8
DCHS1
PACRG
LGR6
KRT6B
FMOD
KL
ARL6IP1
ZFP36
ADAMTS16
CPA3
CYP11B1
HECA
TM4SF18
C1orf186
LCE1F
HDC
RGS13
AQP2
RCN3
ZNF574
PTCH2
ZNF446
MAGED2
SLC25A44
PSEN2
C4orf7
C12orf54
GPRIN2
SLC34A2
BCL6
GRAMD2
45
Advances in Bioinformatics
Candidates of ceRNAs
Output to
TraceRNA website
Putative ceRNAs
Secondary Gol
ZNF292
TARDBP
PIK3C2A
NR3C2
NCOA2
ZDHHC17
USP15 ACSL1
FAM135A
MBNL2
FUSIP1
CTNND2
PPP2CA
PAK7
MAPRE1
NARG1
RND3
FMR1
CHD9
SLC12A2
PRPF38B
CSNK1G3
DCUN1D4
SPTY2D1
CAMTA1
SMARCAD1
UBL3
MAGI1
KCNJ2
GH2
DNAJA2 EBF1
ZFPM2 CUL3
PUM2
CPEB3
TNRC6C
ZFP91
ZNF516
YPEL5
CAPN3
ARHGEF17
ESRRG
FNDC3A
REEP1
GLCE
MAT2A
TNRC6A
KLF9
HEYL
SLC12A5C10orf26
ZFHX4
MMP24
PPP6C
PPP1R12A
CRIM1
DIXDC1 NRP2
BACE1 WDR47
PPM1B MAP3K1
KIAA0240
RNF11
RAB8B
GRM8
ESRRG
FNDC3A
ZC3H11A
RAB5B
DAG1
SENP1
GRIA2
FAM120A
TP53INP1
MAT2A
KLF9
TNRC6A
KIAA1522
FNBP1L
ARRDC3
NFIA
FNIP1
G3BP2
RYBP
PKN2
ZNF654 ZNF800
ATP7A
MED13
MED13L
POMGN
NAV3
PCDHA9
FEM1C
SFRS12
RAPGEF2
C20orf194 JAZF1
KB
ZFX
VAV3
SMG1
PCDHA5
PCDHA1
TGFBR3
WDFY3
REEP1
GLCE
PTPRDPAPOLA
FLRT3
ANTXR2
ERBB2
MAP2K4
RANBP6
HS
RAP1GDS1
KBTBD4ARID4A
PCDHA3
FAM70A
CPEB2
ELL2
ESR1
RAP2C
CADM1
LEF1
THBS1
MYST4
ZNF516
KCNE4
ZFX
BAZ2B
PAN3
CACNA2D2
PITPNB
GMFBHNRPDL
FOXA1PPP3CB
WDFY3
SRGAP3
POMGNT1
BMS1P5
PAFAH1B1
MED13L
NAV3
ABCD4
SCN2A
PRKG1
ZNF238
NFIA
ATP7A
LCOR
REEP2
COL4A3BP
HIPK1
ESR1
RNF11
RAB8B
HIPK1
SNORD8 GIT1
CPEB4
GOSR1
ALCAM
CPEB2
ERBB2IP
MAP2K4
RANBP6
KIAA0240
ELL2
NHS
MAPT MAGI2
ST7L
MIER3
NPAS3
BSN
QKI
DST
RERE
DACH1
PTEN
MAPKBP1
GOSR1
ALCAM
NPAS3
CSH1
RAB14
DDX3X ARPP-19
ZBTB4
HOMER1
TMEM115
FOXP1
PDS5B
VEGFA RSBN1
SEMA6D
SOCS5 DTNA
TNRC
ZFP91
MIER3
ARPP-19
GGT3P
RUNX2
RICTOR
RAP1B
NOPE
BZRAP1
PHF15
DACH1
PTEN
SLC24A3
PPFIA2
BTRC
BAGE5
CFL2
RAB14
MAP1LC3A
ATRX
ZFHX3
SATB2
SOCS5
ZADH2
PML
PRM1
DLG2
ATP2B1
LOC442245
SEMA6A
KIAA0232RCAN2
CPEB3
ARX
ZNF292
CNOT6 NOVA1
RSBN1
DTNA
B4GALT2
LYCAT
PIP4K2A
TARDBP
DCUN1D3
SIRT1 TSHZ3
LEMD3
VEGFA
DMRT2
CD2BP2
STK35
SLC6A10P
CYB561D1
CCNT2
ABHD13
PUM2
RAP1B
CBFB
NBEA
JAG1
BAGE5
DAXX
PTGFRN
C5orf13
THSD3
VEZF1
TMEM11
USP47
ZSWIM6 CADPS
CUL3
GFAP
ATPBD4
RBM12 UBE3A
SLC38A2
MEIS2
BT
ZFHX3
ZFPM2
CXCL14
MBNL1
CIT
RAPGEF5 SOBP
ATRX
SEMA6A
TCF4
C7orf60
BRMS1L
KIAA0467
NRXN1
NOVA1
CNOT6
MUTEDC18orf25
CLK2P
ZNF608
TEX2
BCL2L1
PCDHAC1
C14orf101
RPGRIP1L
TMEM57C1orf9
PCDHAC2
COL12A1
COPB2
C5orf24
OTUD4 TAOK1
PAFAH1B1
PHTF2
CRIM1
BAZ2B
PAN3
F2C
(a)
THBS1
MYST4
CACNA2D2
FAM120A
(b)
Figure 7: (a) ceRNA network for gene of interest ESR1 generated using TraceRNA. (b) Network graph enlarged at ESR1.
5. Conclusions
In this report, we attempt to provide a unifed concept of
modulation of gene regulation, encompassing earlier mRNA
expression based methods and lately the ceRNA method. We
expect the integration of ceRNA concept into the gene-gene
interactions, and their modulator identifcation will further
46
Advances in Bioinformatics
Table 3: Top 18 candidate ceRNAs for ESR1 as GOI obtained from TraceRNA. ESR1 is at rank of 174 (not listed in this table).
SVMicrO-based prediction
Gene symbol
Score
FOXP1
VEZF1
NOVA1
CPEB3
MAP2K4
FAM120A
PCDHA3
SIRT1
PCDHA5
PTEN
PCDHA1
NBEA
ZFHX4
GLCE
MAGI2
SATB2
LEF1
ATPBD4
1.066
0.942
0.896
0.858
0.919
0.885
0.983
0.927
0.983
0.898
0.983
0.752
0.970
0.798
0.777
0.801
0.753
0.819
value
0.0043
0.0060
0.0067
0.0074
0.0064
0.0069
0.0054
0.0062
0.0054
0.0067
0.0054
0.0098
0.0056
0.0087
0.0092
0.0086
0.0098
0.0082
Authors Contribution
M. Flores and T.-H Hsiao are contributed equally to this work.
Acknowledgments
Te authors would like to thank the funding support of
this work by Qatar National Research Foundation (NPRP
09 -874-3-235) to Y. Chen and Y. Huang, National Science
Foundation (CCF-1246073) to Y. Huang. Te authors also
thank the computational support provided by the UTSA
Computational Systems Biology Core Facility (NIH RCMI
5G12RR013646-12).
References
[1] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, vol. 270, no. 5235, pp. 467470,
1995.
[2] E. R. Mardis, Next-generation DNA sequencing methods,
Annual Review of Genomics and Human Genetics, vol. 9, pp.
387402, 2008.
[3] Cancer Genome Atlas Network, Comprehensive molecular
portraits of human breast tumours, Nature, vol. 490, pp. 6170,
2012.
[4] D. Bell, A. Berchuck, M. Birrer et al., Integrated genomic
analyses of ovarian carcinoma, Nature, vol. 474, no. 7353, pp.
609615, 2011.
[5] R. McLendon, A. Friedman, D. Bigner et al., Comprehensive
genomic characterization defnes human glioblastoma genes
and core pathways, Nature, vol. 455, no. 7216, pp. 10611068,
2008.
Expression correlation
Score
0.508
0.4868
0.479
0.484
0.322
0.341
0.170
0.230
0.148
0.221
0.140
0.491
0.154
0.3231
0.321
0.243
0.291
0.170
value
0.016
0.020
0.023
0.022
0.097
0.082
0.215
0.162
0.233
0.168
0.239
0.020
0.229
0.096
0.097
0.151
0.112
0.215
Final score
1212
1179
1160
1149
1139
1130
1125
1117
1113
1104
1103
1102
1097
1096
1086
1078
1065
1060
47
[16] S. Kim, E. R. Dougherty, Y. Chen et al., Multivariate measurement of gene expression relationships, Genomics, vol. 67, no. 2,
pp. 201209, 2000.
[17] X. Chen, M. Chen, and K. Ning, BNArray: an R package for
constructing gene regulatory networks from microarray data
by using Bayesian network, Bioinformatics, vol. 22, no. 23, pp.
29522954, 2006.
[18] A. V. Werhli, M. Grzegorczyk, and D. Husmeier, Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and
bayesian networks, Bioinformatics, vol. 22, no. 20, pp. 2523
2531, 2006.
[19] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, Probabilistic Boolean networks: a rule-based uncertainty model for
gene regulatory networks, Bioinformatics, vol. 18, no. 2, pp. 261
274, 2002.
[20] P. Sumazin, X. Yang, H.-S. Chiu et al., An extensive MicroRNAmediated network of RNA-RNA interactions regulates established oncogenic pathways in glioblastoma, Cell, vol. 147, no.
2, pp. 370381, 2011.
[21] Y. Tay, L. Kats, L. Salmena et al., Coding-independent regulation of the tumor suppressor PTEN by competing endogenous
mRNAs, Cell, vol. 147, no. 2, pp. 344357, 2011.
[22] D. P. Bartel, MicroRNAs: target recognition and regulatory
functions, Cell, vol. 136, no. 2, pp. 215233, 2009.
[23] D. Yue, J. Meng, M. Lu, C. L. P. Chen, M. Guo, and Y. Huang,
Understanding MicroRNA regulation: a computational perspective, IEEE Signal Processing Magazine, vol. 29, no. 1, Article
ID 6105465, pp. 7788, 2012.
[24] M. W. Jones-Rhoades and D. P. Bartel, Computational identifcation of plant MicroRNAs and their targets, including a stressinduced miRNA, Molecular Cell, vol. 14, no. 6, pp. 787799,
2004.
[25] D. Hanahan and R. A. Weinberg, Te hallmarks of cancer, Cell,
vol. 100, no. 1, pp. 5770, 2000.
[26] S. Y. Chun, C. Johnson, J. G. Washburn, M. R. Cruz-Correa,
D. T. Dang, and L. H. Dang, Oncogenic KRAS modulates
mitochondrial metabolism in human colon cancer cells by
inducing HIF-1 and HIF-2 target genes, Molecular Cancer,
vol. 9, article 293, 2010.
[27] N. J. Hudson, A. Reverter, and B. P. Dalrymple, A diferential
wiring analysis of expression data correctly identifes the gene
containing the causal mutation, PLoS Computational Biology,
vol. 5, no. 5, Article ID e1000382, 2009.
[28] I. Stelniec-Klotz, S. Legewie, O. Tchernitsa et al., Reverse
engineering a hierarchical regulatory network downstream of
oncogenic KRAS, Molecular Systems Biology, vol. 8, Article ID
601, 2012.
[29] C. Shen, Y. Huang, Y. Liu et al., A modulated empirical
Bayes model for identifying topological and temporal estrogen
receptor regulatory networks in breast cancer, BMC Systems
Biology, vol. 5, article 67, 2011.
[30] C. A. Wilson and J. Dering, Recent translational research:
microarray expression profling of breast cancer. Beyond classifcation and prognostic markers? Breast Cancer Research, vol.
6, no. 5, pp. 192200, 2004.
[31] H. E. Cunlife, M. Ringner, S. Bilke et al., Te gene expression
response of breast cancer to growth regulators: patterns and
correlation with tumor expression profles, Cancer Research,
vol. 63, no. 21, pp. 71587166, 2003.
Advances in Bioinformatics
[32] J. Frasor, F. Stossi, J. M. Danes, B. Komm, C. R. Lyttle, and
B. S. Katzenellenbogen, Selective estrogen receptor modulators: discrimination of agonistic versus antagonistic activities
by gene expression profling in breast cancer cells, Cancer
Research, vol. 64, no. 4, pp. 15221533, 2004.
[33] L. J. vant Veer, H. Dai, M. J. van de Vijver et al., Gene expression
profling predicts clinical outcome of breast cancer, Nature, vol.
415, no. 6871, pp. 530536, 2002.
[34] S. A. Kaufman, Te Origins of Order : Self-Organization and
Selection in Evolution, Oxford University Press, New York, NY,
USA, 1993.
[35] J. D. Allen, Y. Xie, M. Chen, L. Girard, and G. Xiao, Comparing
statistical methods for constructing large scale gene networks,
PLoS ONE, vol. 7, no. 1, Article ID e29348, 2012.
[36] Y. Huang, I. M. Tienda-Luna, and Y. Wang, Reverse engineering gene regulatory networks: a survey of statistical models,
IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 7697, 2009.
[37] F. Crick, Central dogma of molecular biology, Nature, vol. 227,
no. 5258, pp. 561563, 1970.
[38] A. Hamilton and M. Piccart, Te contribution of molecular
markers to the prediction of response in the treatment of breast
cancer: a review of the literature on HER-2, p53 and BCL-2,
Annals of Oncology, vol. 11, no. 6, pp. 647663, 2000.
[39] C. Sotiriou, S. Y. Neo, L. M. McShane et al., Breast cancer
classifcation and prognosis based on gene expression profles
from a population-based study, Proceedings of the National
Academy of Sciences of the United States of America, vol. 100, no.
18, pp. 1039310398, 2003.
[40] T. Srlie, C. M. Perou, R. Tibshirani et al., Gene expression
patterns of breast carcinomas distinguish tumor subclasses with
clinical implications, Proceedings of the National Academy of
Sciences of the United States of America, vol. 98, no. 19, pp. 10869
10874, 2001.
[41] J. S. Carroll, C. A. Meyer, J. Song et al., Genome-wide analysis
of estrogen receptor binding sites, Nature Genetics, vol. 38, no.
11, pp. 12891297, 2006.
[42] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. DallaFavera, and A. Califano, Reverse engineering of regulatory
networks in human B cells, Nature Genetics, vol. 37, no. 4, pp.
382390, 2005.
[43] K. C. Liang and X. Wang, Gene regulatory network reconstruction using conditional mutual information, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2008, Article ID
253894, 2008.
[44] K. Wang, B. C. Bisikirska, M. J. Alvarez et al., Genome-wide
identifcation of post-translational modulators of transcription
factor activity in human B cells, Nature Biotechnology, vol. 27,
no. 9, pp. 829837, 2009.
[45] M. Hansen, L. Everett, L. Singh, and S. Hannenhalli, Mimosa:
mixture model of co-expression to detect modulators of regulatory interaction, Algorithms for Molecular Biology, vol. 5, no.
1, article 4, 2010.
[46] O. Babur, E. Demir, M. Gonen, C. Sander, and U. Dogrusoz,
Discovering modulators of gene expression, Nucleic Acids
Research, vol. 38, no. 17, Article ID gkq287, pp. 56485656, 2010.
[47] T. Shimamura, S. Imoto, Y. Shimada et al., A novel network profling analysis reveals system changes in epithelialmesenchymal transition, PLoS ONE, vol. 6, no. 6, Article ID
e20804, 2011.
[48] H. Y. Wu et al., A modulator based regulatory network for ERalpha signaling pathway, BMC Genomics, vol. 13, Supplement 6,
article S6, 2012.
Advances in Bioinformatics
[49] K.-K. Yan, W. Hwang, J. Qian et al., Construction and analysis of an integrated regulatory network derived from HighTroughput sequencing data, PLoS Computational Biology, vol.
7, no. 11, Article ID e1002190, 2011.
[50] M. Flores and Y. Huang, TraceRNA: a web based application
for ceRNAs prediction, in Proceedings of the IEEE Genomic
Signal Processing and Statistics Workshop (GENSIPS 12), 2012.
[51] S. D. Hsu, F. M. Lin, W. Y. Wu et al., MiRTarBase: a database curates experimentally validated microRNA-target interactions, Nucleic Acids Research, vol. 39, no. 1, pp. D163D169,
2011.
[52] H. Liu, D. Yue, Y. Chen, S. J. Gao, and Y. Huang, Improving performance of mammalian microRNA target prediction, BMC
Bioinformatics, vol. 11, article 476, 2010.
[53] Y. Dong et al., A Bayesian decision fusion approach for
microRNA target prediction, BMC Genomics, vol. 13, 2012.
[54] J. A. Asm and M. Montague, Models for Metasearch, in Proceedings of the 24th annual international ACM SIGIR conference
on Research and development in information retrieval, pp. 276
284, la, New Orleans, La, USA, 2001.
48
Research Article
Spectral Analysis on Time-Course Expression Data: Detecting
Periodic Genes Using a Real-Valued Iterative Adaptive Approach
Kwadwo S. Agyepong,1 Fang-Han Hsu,1 Edward R. Dougherty,1,2 and Erchin Serpedin1
1
2
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004-2101, USA
1. Introduction
Patterns of periodic gene expression have been found to
be associated with essential biological processes such as
cell cycle and circadian rhythm [1], and the detection of
periodic genes is crucial to advance our understanding of
gene function, disease pathways, and, ultimately, therapeutic solutions. Using high-throughput technologies such as
microarrays, gene expression profles at discrete time points
can be derived and hundreds of cell cycle regulated genes have
been reported in a variety of species. For example, Spellman
et al. applied cell synchronization methods and conducted
time-course gene expression experiments on Saccharomyces
cerevisiae [2]. Te authors identifed 800 cell cycle regulated
genes using DNA microarrays. Also, Rustici et al. and Menges
et al. identifed 407 and about 500 cell cycle regulated genes
in Schizosaccharomyces pombe and Arabidopsis, respectively
[3, 4].
50
Advances in Bioinformatics
2.1. Basics. Suppose that the signals associated with the periodic gene expressions are composed of noise and sinusoidal
components. Let ( ), = 1, . . . , , denote the time-course
expression ratios of gene at instances 1 , . . . , , respectively;
( ) are real numbers; =1 ( ) = 0. Te least-squares
periodogram is given by
2. RIAA Algorithm
2.2. Observation Interval and Resolution. Prior to implementation of RIAA for periodogram estimation, the observation
interval [0, max ] and the resolution in terms of grid size have
to be selected. To this end, the maximum frequency max in
the observation interval without aliasing errors for sampling
instances 1 , . . . , , can be evaluated by
max = 0 ,
(8)
2
= |
()|2 ,
(1)
() = arg min[ ( ) () ] .
()
=1
(2)
[ ( ) cos ( + )] + 2 sin2 ( + ) .
=1
=1
(3)
Te second term in the above equation is data independent and can be omitted from the minimization operation.
Hence, the criterion (2) is simplifed to
)
= arg min[ ( ) cos ( + )]2 .
(,
=1
(4)
2
(
, ) = arg min[ ( ) cos ( ) sin ( )] . (5)
,
=1
[ ] = R1 r,
(6)
where
R = [
=1
cos ( ) sin ( )
cos ( )
],
2
sin ( ) cos ( )
sin ( )
cos ( )
r = [
] ( ) .
sin ( )
=1
(7)
51
Advances in Bioinformatics
where 0 is given by
0 =
2 ( 1)
1
=1
(+1 )
(9)
= 1, . . . , ,
(11)
max
.
(12)
= [ ( )
A = [c
( )] ,
( )] ,
(13)
s ] ,
where
c = [cos ( 1 )
s = [sin ( 1 )
cos ( )] ,
sin ( )] ,
(14)
where
D =
=1, =
A D A ,
2 ( ) + 2 ( ) 1
[
0
2
0
],
1
(15)
(16)
...
..
.
...
.. ] .
. ]
2 ]
[Y A ] Q1
= arg min
[Y A ] .
(18)
(10)
Assuming that Q is invertible, in RIAA, a weighted leastsquares ftting problem is formulated and considered for
fnding and (instead of using (5)), and it is written in the
form of matrices using (13) as follows:
A Q1
Y
A Q1
A
(19)
1
(A A ) .
(20)
= 1, . . . , .
(21)
1
,
(22)
= min
[0,2]
1 and
where || || is the Euclidean norm. With estimates D
1
1
, = 1, . . . , , in the frst iteration are
, the estimates Q
3. Methods
(17)
52
Advances in Bioinformatics
Algorithm RIAA
Initialization
Use (6) to obtain the initial estimates and in 0 .
Te First Iteration
1
1 using (16) with parameters and given by 0 . Obtain 1 using (22). Using D
Obtain D
1
1
1
1 .
and to drive the frst weighted matrix Q by (15). Update estimate by (19) with Q = Q
Updating Iteration
and are iteratively updated in the same way
At the th iteration, = 1, 2, . . ., estimates Q
3.1. Fishers Test. Afer the spectrum of time-course expression data is obtained via periodogram estimation, a Fishers
statistic for gene with the null hypothesis 0 that
the peak of the spectral density is insignifcant against the
alternative hypothesis 1 that the peak of the spectral density
is signifcant is applied as
=
max1 ( ( ))
1
=1 ( )
(23)
(24)
Time-course
expression ratios
Spectral analysis in frequency domain
Periodogram
estimation
RIAA,
compared with
LS, DLS
Hypothesis
testing
Fishers test
Benchmark
sets
ROC curves
Real data
Simulations
= 1, . . . , ,
(25)
53
Advances in Bioinformatics
1
0.8
2
Amplitude
Gene expression
0.6
0.4
0.2
0
0
10
12
14
16
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Frequency
Time
RIAA
Sampled data
Periodic signal
(b)
(a)
Figure 2: (a) A time-course periodic signal with frequency = 0.2 sampled by the bio-like sampling strategy; 16 time points are assigned to
the interval (0,8], and 8 time points are assigned to the interval (8,16]. (b) Te periodogram derived using RIAA. Te maximum value (peak)
in the periodogram locates at frequency = 0.195.
= 1, . . . , .
(26)
= 1, . . . , ,
(27)
4. Results
RIAA performed well in the conducted simulations. As
shown in Figure 2(a), a periodic signal (solid line) with
amplitude = 1 and frequency = 0.4 is sampled
AdvancesinBioinformatics
1
0.8
0.8
Sensitivity
Sensitivity
54
0.6
0.6
0.4
0.4
0.2
0.2
0
0.2
0.4
0.6
0.8
0.2
1-specifcity
0.8
0.8
0.6
0.8
0.6
0.8
0.6
0.8
0.6
0.8
0.6
0.4
0.4
0.2
0.2
0
0.2
0.4
0.6
0.8
0.2
0.4
1-specifcity
1-specifcity
(c)
(d)
0.8
0.8
Sensitivity
Sensitivity
0.6
(b)
Sensitivity
Sensitivity
(a)
0.6
0.4
0.6
0.4
0.2
0.2
0
0.2
0.4
0.6
0.8
0.2
1-specifcity
(e)
0.4
1-specifcity
(f)
0.8
0.8
Sensitivity
Sensitivity
0.4
1-specifcity
0.6
0.4
0.6
0.4
0.2
0.2
0
0.2
0.4
0.6
0.8
0.2
1-specifcity
0.4
1-specifcity
RIAA
LS
DLS
RIAA
LS
DLS
(g)
(h)
Figure 3: Te ROC curves derived from simulations with 24 sampling time points, signal amplitude = 1, = 0.4, and Gaussian noise
= 0 and = 0.5. Description of subplots is provided in Section 4.
55
0.8
0.8
Sensitivity
Sensitivity
Advances in Bioinformatics
0.6
0.4
0.4
0.2
0.2
0.2
0.4
0.6
0.8
0.4
1-specifcity
(a)
(b)
0.8
0.8
0.6
0.4
0.6
0.8
0.6
0.8
0.6
0.8
0.6
0.8
0.6
0.4
0.2
0.2
0
0.2
0.4
0.6
0.8
0.2
0.4
1-specifcity
1-specifcity
(c)
(d)
0.8
0.8
Sensitivity
Sensitivity
0.2
1-specifcity
Sensitivity
Sensitivity
0.6
0.4
0.6
0.4
0.2
0.2
0
0.2
0.4
0.6
0.8
0.2
0.4
1-specifcity
1-specifcity
(e)
(f)
0.8
0.8
Sensitivity
Sensitivity
0.6
0.6
0.4
0.6
0.4
0.2
0.2
0.2
0.4
0.6
0.8
0.2
1-specifcity
0.4
1-specifcity
RIAA
LS
DLS
RIAA
LS
DLS
(g)
(h)
Figure 4: Te ROC Curves derived from simulations with 24 sampling time points, signal amplitude = 1, = 0.1, and Gaussian noise
= 0 and = 0.5. Description of subplots is provided in Section 4.
56
Advances in Bioinformatics
100
80
60
40
20
Te number of intersection
100
Te number of intersection
Te number of intersection
100
80
60
40
20
800
1000
200
400
(a)
600
800
200
1000
60
40
20
100
80
60
40
20
800
1000
1000
100
80
60
40
20
518 gene benchmark set
0
600
800
120
600
(c)
Te number of intersection
80
400
120
Te number of intersection
Te number of intersection
(b)
100
400
20
200
40
0
400
60
0
200
80
200
400
600
800
200
400
600
800
1000
RIAA
LS
DLS
(d)
1000
(e)
(f)
Figure 5: Te intersection of preserved genes and the benchmark sets using RIAA, LS, and DLS algorithms. (a), (b), and (c) reveal the analysis
results when dataset alpha was applied. (d), (e), and (f) reveal the analysis results when dataset alpha 38 was applied.
57
Advances in Bioinformatics
set, and 518-gene benchmark set were applied, respectively.
Similarly, Figures 5(d)5(f) demonstrate the results derived
from dataset alpha 38. Te RIAA does not result in signifcant
diferences in the numbers of intersections when compared
to those corresponding to LS and DLS in most of these
cases. However, RIAA shows slightly better coverage when
the dataset alpha 38 and the 113-gene benchmark set was
utilized (Figure 5(d)).
5. Conclusions
In this study, the rigorous simulations specifcally designed
to comfort with real experiments reveal that the RIAA can
outperform the classical LS and modifed DLS algorithms
when the sampling time points are highly irregular, and when
the number of cycles covered by sampling times is very
limited. Tese characteristics, as also claimed in the original
study by Stoica et al. [12], suggest that the RIAA can be
generally applied to detect periodicities in time-course gene
expression data with good potential to yield better results. A
supplementary simulation further shows the superiority of
RIAA over LS and DLS when multiple periodic signals are
considered (see Supplementary Figure s1 available online at
http://dx.doi.org/10.1155/2013/171530). From the simulations,
we also learned that the addition of a transcriptional burst and
a sudden drop to nonperiodic signals (the negatives) does not
afect the power of RIAA in terms of periodicity detection.
Moreover, the detrend function in DLS, designed to improve
LS by removing the linearity in time-course data, may fail to
provide improved accuracy and makes the algorithm unable
to detect periodicities when transcription oscillates with a
very low frequency.
Te intersection of detected candidates and proposed
periodic genes in the real data analysis (Figure 5) does not
reveal much diferences among RIAA, LS, and DLS. One
possible reason is that the sampling time points conducted
in the yeast experiment are not highly irregular (not many
missing values are included), since, as demonstrated in Figures 3(a)3(d), the RIAA just performs equally well as the LS
and DLS algorithms when the time-course data are regularly
or mildly irregularly sampled. Also, the very limited time
points contained in the dataset may deviate the estimation
of -values [14] and thus hinder the RIAA from exhibiting
its excellence. Besides, the number of true cell cycle genes
included in the benchmark sets remains uncertain. We expect
that the superiority of RIAA in real data analysis would be
clearer in the future when more studies and more datasets
become available.
Besides the comparison of these algorithms, it is interesting to note that the bio-like sampling strategy could lead
to better detection of periodicities than the regular sampling
strategy (as shown in Figures 3(c) and 3(d)). It might be
benefcial to apply loose sampling time intervals at posterior
periods to prolong the experimental time coverage when the
number of time points is limited.
Acknowledgments
Te authors would like to thank the members in the Genomic
Signal Processing Laboratory, Texas A&M University, for
References
[1] W. Zhao, K. Agyepong, E. Serpedin, and E. R. Dougherty,
Detecting periodic genes from irregularly sampled gene
expressions: a comparison study, EURASIP Journal on Bioinformatics and Systems Biology, vol. 2008, Article ID 769293, 2008.
[2] P. T. Spellman, G. Sherlock, M. Q. Zhang et al., Comprehensive
identifcation of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Molecular
Biology of the Cell, vol. 9, no. 12, pp. 32733297, 1998.
[3] G. Rustici, J. Mata, K. Kivinen et al., Periodic gene expression
program of the fssion yeast cell cycle, Nature Genetics, vol. 36,
no. 8, pp. 809817, 2004.
[4] M. Menges, L. Hennig, W. Gruissem, and J. A. H. Murray,
Cell cycle-regulated gene expression in Arabidopsis, Journal
of Biological Chemistry, vol. 277, no. 44, pp. 4198742002, 2002.
[5] M. Ahdesmaki, H. Lahdesmaki, R. Pearson, H. Huttunen,
and O. Yli-Harja, Robust detection of periodic time series
measured from biological systems, BMC Bioinformatics, vol. 6,
article 117, 2005.
[6] M. Ahdesmaki, H. Lahdesmaki, A. Gracey et al., Robust
regression for periodicity detection in non-uniformly sampled
time-course gene expression data, BMC Bioinformatics, vol. 8,
article 233, 2007.
[7] E. F. Glynn, J. Chen, and A. R. Mushegian, Detecting periodic
patterns in unevenly spaced gene expression time series using
Lomb-Scargle periodograms, Bioinformatics, vol. 22, no. 3, pp.
310316, 2006.
[8] R. Yang, C. Zhang, and Z. Su, LSPR: an integrated periodicity
detection algorithm for unevenly sampled temporal microarray
data, Bioinformatics, vol. 27, no. 7, pp. 10231025, 2011.
[9] E. R. Dougherty, Small sample issues for microarray-based
classifcation, Comparative and Functional Genomics, vol. 2, no.
1, pp. 2834, 2001.
[10] Y. Tu, G. Stolovitzky, and U. Klein, Quantitative noise analysis
for gene expression microarray experiments, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 99, no. 22, pp. 1403114036, 2002.
[11] Z. Bar-Joseph, Analyzing time series gene expression data,
Bioinformatics, vol. 20, no. 16, pp. 24932503, 2004.
[12] P. Stoica, J. Li, and H. He, Spectral analysis of nonuniformly
sampled data: a new approach versus the periodogram, IEEE
Transactions on Signal Processing, vol. 57, no. 3, pp. 843858,
2009.
[13] J. Fan and Q. Yao, Nonlinear Time Series: Nonparametric and
Parametric Methods, Springer, New York, NY, USA, 2003.
[14] A. W. C. Liew, N. F. Law, X. Q. Cao, and H. Yan, Statistical
power of Fisher test for the detection of short periodic gene
expression profles, Pattern Recognition, vol. 42, no. 4, pp. 549
556, 2009.
[15] V. Berger, Pros and cons of permutation tests in clinical trials,
Statistics in Medicine, vol. 19, no. 10, pp. 13191328, 2000.
[16] A. P. Bradley, Te use of the area under the ROC curve
in the evaluation of machine learning algorithms, Pattern
Recognition, vol. 30, no. 7, pp. 11451159, 1997.
58
[17] J. R. Chubb, T. Trcek, S. M. Shenoy, and R. H. Singer, Transcriptional pulsing of a developmental gene, Current Biology, vol. 16,
no. 10, pp. 10181025, 2006.
[18] T. Pramila, W. Wu, W. Noble, and L. Breeden, Periodic genes of
the yeast Saccharomyces cerevisiae: a combined analysis of fve
cell cycle data sets, 2007.
[19] U. Lichtenberg, L. J. Jensen, A. Fausbll, T. S. Jensen, P. Bork,
and S. Brunak, Comparison of computational methods for the
identifcation of cell cycle-regulated genes, Bioinformatics, vol.
21, no. 7, pp. 11641171, 2005.
[20] A. W. C. Liew, J. Xian, S. Wu, D. Smith, and H. Yan, Spectral
estimation in unevenly sampled space of periodically expressed
microarray time series data, BMC Bioinformatics, vol. 8, article
137, 2007.
[21] D. Johansson, P. Lindgren, and A. Berglund, A multivariate
approach applied to microarray data for identifcation of genes
with cell cycle-coupled transcription, Bioinformatics, vol. 19,
no. 4, pp. 467473, 2003.
[22] I. Simon, J. Barnett, N. Hannett et al., Serial regulation of
transcriptional regulators in the yeast cell cycle, Cell, vol. 106,
no. 6, pp. 697708, 2001.
[23] T. I. Lee, N. J. Rinaldi, F. Robert et al., Transcriptional
regulatory networks in Saccharomyces cerevisiae, Science, vol.
298, no. 5594, pp. 799804, 2002.
[24] H. W. Mewes, D. Frishman, U. Guldener et al., MIPS: a
database for genomes and protein sequences, Nucleic Acids
Research, vol. 30, no. 1, pp. 3134, 2002.
Advances in Bioinformatics
Research Article
Identification of Robust Pathway Markers for Cancer through
Rank-Based Pathway Activity Inference
Navadon Khunlertgit and Byung-Jun Yoon
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
Correspondence should be addressed to Byung-Jun Yoon; bjyoon@ece.tamu.edu
Received 30 November 2012; Accepted 19 January 2013
Academic Editor: Hazem Nounou
Copyright 2013 N. Khunlertgit and B.-J. Yoon. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
One important problem in translational genomics is the identifcation of reliable and reproducible markers that can be used to
discriminate between diferent classes of a complex disease, such as cancer. Te typical small sample setting makes the prediction
of such markers very challenging, and various approaches have been proposed to address this problem. For example, it has been
shown that pathway markers, which aggregate the gene activities in the same pathway, tend to be more robust than gene markers.
Furthermore, the use of gene expression ranking has been demonstrated to be robust to batch efects and that it can lead to more
interpretable results. In this paper, we propose an enhanced pathway activity inference method that uses gene ranking to predict the
pathway activity in a probabilistic manner. Te main focus of this work is on identifying robust pathway markers that can ultimately
lead to robust classifers with reproducible performance across datasets. Simulation results based on multiple breast cancer datasets
show that the proposed inference method identifes better pathway markers that can predict breast cancer metastasis with higher
accuracy. Moreover, the identifed pathway markers can lead to better classifers with more consistent classifcation performance
across independent datasets.
1. Introduction
Advances in microarray and sequencing technologies have
enabled the measurement of genome-wide expression profles, which have spawned a large number of studies aiming
to make accurate diagnosis and prognosis based on gene
expression profles [14]. For example, there has been significant amount of work on identifying markers and building
classifers that can be used to predict breast cancer metastasis
[2, 4]. Many existing methods have directly employed gene
expression data without any knowledge of the interrelations
between genes. As a result, the predicted gene markers ofen
lack interpretability and many of them are not reproducible
in other independent datasets.
To overcome this problem, several diferent approaches
have been proposed so far. For example, a recent work by
Geman et al. [3] proposed an approach that utilizes the
relative expression between genes, rather than their absolute
expression values. It was shown that the resulting markers
are easier to interpret, robust to chip-to-chip variations,
and more reproducible across datasets. Another possible
60
Advances in Bioinformatics
,
{
| 1 < } ,
(1)
1,
0,
if < ,
otherwise.
=
,
, ( ) ,
1<
(3)
= log [
[
1
,
( )
,
2 ( )
,
],
(4)
1
where ,
() is the conditional probability mass function
(PMF) of the ranking of the expression level of gene and
2
gene under phenotype 1 and ,
() is the conditional PMF
of the ranking of the expression level of gene and gene
under phenotype 2.
In practice, the number of possible gene pairs ( 2 ) may be
too large when we have large pathways with many member
genes (i.e., when is large). To reduce the computational
complexity, we prescreen the gene pairs based on the mutual
information [21] as follows. For every gene pair (, ), we frst
,
compute the mutual information between the ranking
and the corresponding phenotype . Ten we select the top
10% gene pairs with the highest mutual information and use
only these gene pairs for computing the pathway activity level
defned in (3). Although we selected the top 10% gene pairs
for simplicity, this may not be necessarily optimal and one
may also think of other strategies for adaptively choosing this
threshold.
In a practical setting, we may not have enough training
1
2
() and ,
(). For this
data to reliably estimate the PMFs ,
,
, ( ) ( , )
( , )
(5)
where
= {
(2)
1 2
1 /1 + 2 /2
(6)
61
Advances in Bioinformatics
Gene ranking
Pathway = {1 , 2 ,. . ., }
Samples
(, )
( 1, )
1,
,
= {
0,
Phenotype 2
(1, 2)
..
.
Phenotype 1
LLR matrix
1,2
(1, 2)
Samples
1,
<
otherwise
(, )
..
.
( 1, )
Ranking
,
, ( )
,
, ( )
LLR
1
2
= log[,
( )/[,
( )]
,
1
,
0 1
2
,
0 1
Normalization
, (, ) =
, ( ) ( , )
,
( , )
where , = { 1, , 1, ,. . .,
, } and
Figure 1: Probabilistic inference of rank-based pathway activity. For a given pathway, we frst compute the ranking of the member genes for
each individual sample in the dataset. Ten we estimate the conditional probability mass function (PMF) of the gene ranking under each
phenotype. Next, we transform the gene ranking into log-likelihood ratios (LLRs) based on the estimated PMFs and normalize the LLR
matrix. Finally, the pathway activity level is inferred by aggregating the normalized LLRs of the member genes.
1
( > ) ,
=1=1
where
( > ) = {
1,
0,
if > ,
otherwise.
(7)
(8)
62
in a number of previous studies [9, 10]. For comparison, we
also evaluated the performance of the mean and medianbased schemes proposed in [6] and the original probabilistic
pathway activity inference method (we refer to this method
as the LLR method for simplicity) presented in [10]. As
explained in Materials and Methods, the discriminative
power of a pathway marker was measured based on the
absolute -test score of the inferred pathway activity level.
Ten the pathway markers were sorted according to their score, in a descending order.
Figure 2 shows the discriminative power of the pathway
markers on the six datasets using diferent activity inference
methods. On each dataset, we computed the mean absolute test statistics score of the top % pathways for each of the four
pathway activity inference methods. Te -axis corresponds
to the proportion (%) of the top pathway markers that were
considered and the -axis shows the mean absolute -test
score for these pathway markers. As we can see from Figure 2,
the proposed method clearly improves the discriminative
power of the pathway markers on all six datasets that we
considered in this study. In order to investigate the efect of
normalization on the discriminative power of the pathway
activity inference methods, we repeated this experiment
using the USA and the Belgium datasets, where we frst
normalized the raw data using three diferent normalization
methods (RMA, GCRMA, and MAS5) and then evaluated the
discriminative power of the pathway markers. Te results are
summarized in Figure S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2013/618461), where
we can see that the proposed rank-based scheme is not very
sensitive to the choice of the normalization method and
performs consistently well in all cases.
Next, we investigated how the top pathway markers
identifed on a specifc dataset perform in other independent
datasets. We frst ranked the pathway markers based on their
mean absolute -test statistics score in one of the datasets
and then estimated the discriminative power of the top %
markers on a diferent dataset. Tese results are shown in
Figure 3, where the frst dataset is used for ranking the
markers and the second dataset is used for assessing the
discriminative power. As we can see from Figure 3, the
pathway markers identifed using the mean- and the medianbased schemes do not retain their discriminative power
very well in other datasets. Both the LLR method [10] and
the proposed rank-based inference method perform well
across diferent datasets, where the proposed method clearly
outperforms the previous LLR method. It is interesting to
see that the discriminative power of the markers is retained
even when we consider datasets that are obtained using
diferent platforms. For example, USA/Belgium datasets are
profled on the U133a platform and Te Netherlands dataset
is profled on a custom Agilent chip, but Figure 3 shows
that pathway markers identifed using the proposed method
retain their discriminative power across these datasets. As
before, we repeated these experiments afer normalizing the
datasets using diferent normalization methods. Te results
are depicted in Figure S2, where we can see that the proposed
method works very well, regardless of the normalization
method that was used. Interestingly, this is also true even
Advances in Bioinformatics
when the frst dataset and the second dataset are normalized
using diferent methods, as shown in Figures S3 and S4.
Another interesting observation is that the rank-based
method can overcome one of the limitations of the previous
LLR method. For example, normalization of the Belgium
dataset using GCRMA results makes the LLR method fail, as
some of the genes loose variability and some of the LLR values
become infnite. We can see this issue in Figures S1(d), S2(c),
S3(a), and S3(f). However, this limitation is easily overcome
by the proposed method through the use of gene ranking and
the preselection of informative gene pairs based on mutual
information.
3.2. Classifcation Performance of the Pathway Markers Using
the Proposed Method. Next, we evaluated the classifcation
performance of the proposed rank-based pathway activity
inference method. For this purpose, we performed fvefold
cross validation experiments, following a similar setup used
in previous studies [811]. We frst performed the withindataset experiments for each of the six datasets. First, a given
dataset was randomly divided into fvefolds, where fourfolds
(training dataset) were used for constructing an LDA (Linear
Discriminant Analysis) classifer and the remaining fold
(testing dataset) was used for evaluating its performance.
To construct the classifer, the training dataset was again
divided into threefolds, where twofolds (marker-evaluation
dataset) were used for evaluating the pathway markers and
the remaining onefold (feature-selection dataset) for feature
selection. Te entire training dataset was used for PDF/PMF
estimation. Te overall setup is shown in Figure 4(a).
In order to build the classifer, we frst evaluated the discriminative power of each pathway on the marker-evaluation
dataset. Te pathways were sorted according to their absolute
-test statistics score in a descending order and the top 50
pathways were selected as potential features. Initially, we
started with an LDA-based classifer with a single feature
(i.e., the pathway marker that is on the top of the list) and
continued to expand the feature set by considering additional pathway markers in the list. Te classifer was trained
using the marker-evaluation dataset and its performance was
assessed on the feature-selection dataset by measuring the
AUC. Pathway markers were added to the feature set only
when they increased the AUC. Finally, the performance of
the classifer with the optimal feature set was evaluated by
computing the AUC on the testing dataset. Te above process
was repeated for 100 random partitions to ensure reliable
results, and we report the average AUC as the measure of
overall classifcation performance.
Figure 5 shows how the respective classifers that use
diferent pathway activity inference methods perform on
diferent datasets. As we can see in Figure 5, among the
four inference methods, the proposed rank-based scheme
typically yields the best average performance across these
datasets. We also performed similar experiments based on
the USA and the Belgium datasets afer normalizing the raw
data using diferent normalization methods. Tese results are
summarized in Figure S5. We can see from Figure S5 that
the proposed method yields the best performance on the
63
Advances in Bioinformatics
Te Netherlands
USA
14
Average absolute -score
25
12
20
10
15
10
5
0
8
6
4
2
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(a)
Mean
Median
(b)
Belgium
GSE1456
14
Average absolute -score
20
18
16
14
12
10
8
6
4
2
0
12
10
8
6
4
2
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(c)
Mean
Median
(d)
GSE15852
GSE9574
16
20
18
16
14
12
10
8
6
4
2
0
14
12
10
8
6
4
2
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(e)
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(f)
Figure 2: Discriminative power of pathway markers. We computed the mean absolute -score of the top % markers for each dataset without
any further normalization.
64
Advances in Bioinformatics
USA-Belgium
USA-Te Netherlands
20
18
16
14
12
10
8
6
4
2
0
14
12
10
8
6
4
2
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(a)
(b)
Te Netherlands-USA
Te Netherlands-Belgium
18
16
14
12
10
8
6
4
2
0
25
20
15
10
5
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(c)
Mean
Median
(d)
Belgium-Te Netherlands
Belgium-USA
10
9
8
7
6
5
4
3
2
1
0
25
Average absolute -score
Mean
Median
20
15
10
5
0
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(e)
Top 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proposed
LLR
Mean
Median
(f)
Figure 3: Discriminative power of pathway markers across diferent datasets. Te pathway markers have been ranked and sorted using the
frst dataset, and their discriminative power has been reevaluated using the second dataset. As before, the mean absolute -score was used for
assessing the discriminative power.
65
Advances in Bioinformatics
Dataset
Dataset 1
Training set
Marker evaluation
Rank the pathways by the discriminative power
of their activity levels
Feature selection
Select features using sequential forward selection
to maximize the AUC
Feature selection
Dataset 2
Training set
Testing set
Testing set
(a)
(b)
Figure 4: Experimental setup for evaluating the classifcation performance. (a) Te setup for the within-dataset experiment. (b) Te setup
for the cross-dataset experiment.
GSE9574
GSE15852
GSE1456
Belgium
Te Netherlands
USA
1
0.9
0.8
0.7
0.6
0.5
Mean
Median
Proposed
LLR
Figure 5: Classifcation performance for within-dataset experiments. Te bars show the classifcation performance (average AUC)
of diferent pathway activity inference methods evaluated on various
breast cancer datasets.
0.75
0.7
0.65
0.6
0.55
0.5
4. Conclusions
U-N
Proposed
LLR
U-B
N-U
N-B
B-U
B-N
Mean
Median
66
Advances in Bioinformatics
Proposed
LLR
B-U (MAS5)
B-U (GCRMA)
B-U (RMA)
U-B (MAS 5)
U-B (GCRMA)
U-B (RMA)
0.75
0.7
0.65
0.6
0.55
0.5
Mean
Median
Acknowledgment
N. Khunlertgit has been supported by a scholarship from the
Royal Tai Government.
References
[1] M. West, C. Blanchette, H. Dressman et al., Predicting the
clinical status of human breast cancer by using gene expression
profles, Proceedings of the National Academy of Sciences of the
United States of America, vol. 98, no. 20, pp. 1146211467, 2001.
[2] L. J. Vant Veer, H. Dai, M. J. Van de Vijver et al., Gene
expression profling predicts clinical outcome of breast cancer,
Nature, vol. 415, no. 6871, pp. 530536, 2002.
[3] D. Geman, C. DAvignon, D. Q. Naiman, and R. L. Winslow,
Classifying gene expression profles from pairwise mRNA
comparisons, Statistical Applications in Genetics and Molecular
Biology, vol. 3, no. 1, article 19, 2004.
[4] Y. Wang, J. G. M. Klijn, Y. Zhang et al., Gene-expression profles
to predict distant metastasis of lymph-node-negative primary
breast cancer, Te Lancet, vol. 365, no. 9460, pp. 671679, 2005.
[5] L. Tian, S. A. Greenberg, S. W. Kong, J. Altschuler, I. S. Kohane,
and P. J. Park, Discovering statistically signifcant pathways
in expression profling studies, Proceedings of the National
Academy of Sciences of the United States of America, vol. 102, no.
38, pp. 1354413549, 2005.
[6] Z. Guo, T. Zhang, X. Li et al., Towards precise classifcation
of cancers based on robust gene functional expression profles,
BMC Bioinformatics, vol. 6, article 58, 2005.
[7] C. Aufray, Protein subnetwork markers improve prediction of
cancer outcome, Molecular Systems Biology, vol. 3, article 141,
2007.
[8] H. Y. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker, Networkbased classifcation of breast cancer metastasis, Molecular
Systems Biology, vol. 3, article 140, 2007.
[9] E. Lee, H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee, Inferring
pathway activity toward precise disease classifcation, PLoS
Computational Biology, vol. 4, no. 11, Article ID e1000217, 2008.
[10] J. Su, B. J. Yoon, and E. R. Dougherty, Accurate and reliable
cancer classifcation based on probabilistic inference of pathway
activity, PloS ONE, vol. 4, no. 12, Article ID e8161, 2009.
Review Article
An Overview of the Statistical Methods Used for Inferring Gene
Regulatory Networks and Protein-Protein Interaction Networks
Amina Noor,1 Erchin Serpedin,1 Mohamed Nounou,2 Hazem Nounou,3
Nady Mohamed,4 and Lotfi Chouchane4
1
Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843-3128, USA
Chemical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building, Education City,
P.O. Box 23874, Doha, Qatar
3
Electrical Engineering Department, Texas A&M University at Qatar, 253 Texas A&M Engineering Building, Education City,
P.O. Box 23874, Doha, Qatar
4
Department of Genetic Medicine, Weill Cornell Medical College in Qatar, P.O. Box 24144, Doha, Qatar
2
1. Introduction
Postgenomic era is marked by the availability of a deluge
of genomic data and has, thus, enabled the researchers to
look towards new dimensions for understanding the complex
biological processes governing the life of a living organism
[15]. Te various life sustaining functions are performed
via a collaborative efort involving DNA, RNA, and proteins.
Genes and proteins interact with themselves and each other
and orchestrate the successful completion of a multitude of
important tasks. Understanding how they work together to
form a cellular network in a living organism is extremely
important in the feld of molecular biology. Two important
problems in this considerably nascent feld of computational
biology are the inference of gene regulatory networks and
the inference of protein-protein interaction networks. Tis
paper frst looks at how the genes and proteins interact with
68
Advances in Bioinformatics
Transcription
Gene
RNA
Translation
Protein
69
Advances in Bioinformatics
Reverse transcription
mRNA
cDNA
Sequencing
Short
sequence
reads
Expression
levels
Estimation
Mapped
reads
Alignment
(X) = ( | ( )) ,
=1
(1)
70
Advances in Bioinformatics
if ( | , ) > ( | ) .
(2)
if ( | , , ) + ( | , , )
> ( | , , ) + ( | , , ) .
(3)
graph models, provide a simple and efective way of characterizing the gene interactions [19, 20]. Tis method relies on
assessing the conditional dependencies among genes in terms
of partial correlation coefcients among the gene expressions
and results in an undirected network. A covariance matrix
is estimated using the available gene expression data sets.
Suppose that X R denotes the gene expression data
matrix, where the rows correspond to observations and
the columns correspond to genes, then an estimate of the
covariance matrix is obtained by
=
W
1
X X.
1
(4)
(5)
(6)
71
Advances in Bioinformatics
1
()
()
(8)
(7)
( , ) =
1
,
1 +
(9)
(10)
z () = B (z ( 1)) + V () ,
(11)
72
Advances in Bioinformatics
. . . (0, )
V
..
] [ 1 ] [V1 ]
]
2 ] [ 2 ]
...
.
][
. ] + [ . ],
][
..
[
] .. ] [ .. ]
.
(1, )] [ ] [V ]
(12)
(13)
(14)
Te parameter estimates obtained using LASSO-based algorithms appear to be more reliable than the estimates provided
by other approaches [25].
3.2.4. State-Space Models for Time-Delayed Dependencies.
Te state-space models discussed so far do not consider
time delays whereas it has been found that time-delayed
interactions are present in gene networks [28] due to the time
required for the processes of transcription and translation to
take place. One of the ways to model this phenomenon is by
adopting the following state-space model:
z () = Az ( 1) + Bu ( ) + V () ,
x () = Cz () + w () .
(15)
(, )
]
() ()
= () + () (, ) ,
(16)
(17)
(18)
(, | )
]
( | ) ( | )
= (, ) + (, ) () (, , ) .
(19)
73
Advances in Bioinformatics
( ; ) = [ ( , + ) log
=1
( , ) = max { ( , () )}
(+1 , )
].
(+1 ) ( )
(20)
(21)
In other words, this coefcient provides a measure of independence or diference between two genes and . DPI also
holds true for the -mixing metric, and therefore, it can be
used to identify the indirect interactions as in the case of
mutual information.
3.3.4. Time-Delayed Dependencies. Another way of fnding
directed relationships is by detecting the time-delayed dependencies by using time series data. Te time instants at which
the mutual information goes above or drops below the
thresholds up and down , respectively, are noted [35]. Tese
instants are called the initial change of expression (IcE) times
and are defned as
IcE ( ) = arg min {
or
down } .
up
0
0
( ) (+ )
],
(23)
( , + )
(22)
(24)
74
Advances in Bioinformatics
1
(x ) ,
4.2. Bayesian Networks. Another way of modeling PPI networks is by means of Bayesian networks (BNs) [39], which
represent a probabilistic graphical modeling technique. Te
inference algorithm is based on fnding the conditional
probability densities ( | ), where denotes the class
variable, and denotes the th node in the network. A
( | )
.
( )
(27)
(25)
where is the normalizing constant also called the partition function. In this way, a compact representation of the
probability distribution is obtained. Te network structure
is learned by using the independence properties of Markov
networks using the available PPI data. Te details of this
method can be found in [37].
(26)
(1 | 1 , . . . , )
(0 | 1 , . . . , )
( | 1 )
=1 ( | , 1 )
= log
,
( | 0 )
=1 ( | , 0 )
(28)
(, 1 , . . . , | = 1) ( = 1)
(, 1 , . . . , )
(1 , . . . , | , = 1) ( | = 1) ( = 1)
.
(, 1 , . . . , )
(29)
75
Advances in Bioinformatics
Tese probability densities can be calculated using maximum
likelihood methods. By comparing the obtained score to
a predetermined threshold, some of the subgraphs can be
labeled to be complexes. Tis algorithm takes the weighted
matrix of PPI data as input, where the weights are assigned
using the likelihood of any particular interaction. Several
other graphical-clustering-based methods are surveyed in
[12].
4.4. Matrix Factorization Methods for Clustering. Nonnegative matrix factorization (NMF) is a method widely used in
problems of clustering. Application of this technique has been
proposed recently in [40], where an ensemble of nonnegative
factored matrices obtained using protein-protein interaction
data are combined together to perform sof clustering. Te
importance of this step lies in the fact that a particular object
may belong to multiple classes. Hence, the various algorithms
reported in the literature performing hard clustering may not
be of much beneft in such scenarios. Tis ensemble NMF
method is observed to classify the proteins in accordance
with the functions they perform and also identify the multiple
groups they belong to.
Te algorithm produces base clusterings by factorizing
the symmetric data matrix of protein interactions in the
following manner [40]:
2
minS VVT ,
(30)
V>0
(V V ) (V V )
1
+ 1) .
(
2 V V )2 V V
2
(31)
P4
P2
P1
P3
TF1
TF2
G3
TF3
G1
G2
G4
5.1. Probabilistic Graphical Models for Joint Inference. Reference [41] proposed an interesting method for estimating
GRNs and PPI networks simultaneously. Suppose that the
gene expression is denoted by x and PPI data is represented by
y. Te algorithm provides an undirected protein network
and a directed gene network , modeled using Markov and
Bayesian networks, respectively, by maximizing their joint
distribution; that is,
( , | , ) ( , , , )
= ( | ) ( | ) ( , ) ,
(32)
where ( | , ) = ( | ) and ( | , ) =
( | ). Te inference on Markov and Bayesian networks
is performed in the same manner as explained in the previous
sections. Te two subnetworks are estimated iteratively till the
algorithm converges. Further details on this algorithm can be
found in [41].
5.2. Joint Estimation Using State-Space Model. State-space
model can also be used to obtain an integrated network
of gene and protein-protein interactions [42, 43]. A novel
approach employing nonlinear model is proposed in [43],
where the system parameters are estimated using constrained
leastsquares. Te gene expression is assumed to follow a
dynamic model given by
( + 1) = () + () () + + () , (33)
=1
where
() = ( ()) =
1 + exp { ( () ) / }
(34)
76
Advances in Bioinformatics
( + 1) = () + () ()
=1
+ () () + + () ,
(35)
= [1 ] ,
= [1 ]
(36)
6. Performance Evaluation
Te inference accuracy can be assessed using the knowledge
of a gold-standard network or the true network. In order
to benchmark the algorithms, the correctly identifed edges
or true positives (TPs) need to be calculated. In addition,
the number of false positives (FPs), or the edges incorrectly
indicated to be present, and false negatives (FNs) which is
the missed detection should also be counted [10]. With these
values in hand, true positive rate or recall; that is, TPR =
TP/(TP+FN), false positive rate; that is, FPR = FP/(FP+TN),
and positive predictive value; that is, PPV = TP/(TP + FP),
also called the precision, can be calculated. Tese quantities
enable us to view the performance graphically by the area
under the ROC curve which plots FPR versus the TPR. Tese
criteria are most widely used as the fdelity criterion for gene
network inference algorithms.
While it is possible to identify the gene regulatory
relationships experimentally, it would not only be technically prohibitive but also proved to be very costly. For this
Advances in Bioinformatics
indicates the strength of the inhibitory and excitatory regulations. As the cellular networks are known to be sparse,
employing sparsity-constrained least squares for parameter
estimation as proposed in [25] is expected to result in more
robust inference algorithms.
Recent years have shown tremendous and rapid progress
in the feld of cellular network modeling. With the amount
and types of data sets increasing, algorithms combining
multiple datasets are necessary for future.
Acknowledgments
Tis paper was made possible by QNRF-NPRP Grant no. 09874-3-235 and support from NSF Grant no. 0915444. Te
statements made herein are solely the responsibility of the
authors.
References
[1] X. Zhou and S. T. C. Wong, Computational Systems Bioinformatics, World Scientifc, 2008.
[2] Y. Huang, I. M. Tienda-Luna, and Y. Wang, Reverse engineering gene regulatory networks: a survey of statistical models,
IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 7697, 2009.
[3] X. Zhou, X. Wang, and E. R. Dougherty, Genomic Networks:
Statistical Inference from Microarray Data, John Wiley & Sons,
2006.
[4] H. Kitano, Computational systems biology, Nature, vol. 420,
no. 6912, pp. 206210, 2002.
[5] B. Mallick, D. Gold, and V. Baladandayuthapani, Bayesian
Analysis of Gene Expression Data, Wiley, 2009.
[6] H. D. Jong, Modeling and simulation of genetic regulatoy
systems: a literature review, Journal of Computational Biology,
vol. 9, no. 1, pp. 67103, 2002.
[7] X. Cai and X. Wang, Stochastic modeling and simulation of
gene networks, IEEE Signal Processing Magazine, vol. 24, no. 1,
pp. 2736, 2007.
[8] H. Hache, H. Lehrach, and R. Herwig, Reverse engineering of
gene regulatory networks: a comparative study, Eurasip Journal
on Bioinformatics and Systems Biology, vol. 2009, Article ID
617281, 2009.
[9] F. Markowetz and R. Spang, Inferring cellular networksa
review, BMC Bioinformatics, vol. 8, article S5, 2007.
[10] C. A. Penfold and D. L. Wild, How to infer gene networks from
expression profles, revisited, Interface Focus, vol. 3, pp. 857
870, 2011.
[11] J. Wang, M. Li, Y. Deng, and Y. Pan, Recent advances in
clustering methods for protein interaction networks, BMC
Genomics, vol. 11, no. supplement 3, article S10, 2010.
[12] X. Li, M. Wu, C. K. Kwoh, and S. K. Ng, Computational
approaches for detecting protein complexes from protein interaction networks: a survey, BMC Genomics, vol. 11, no. 1, article
S3, 2010.
[13] A. Mortazavi, B. A. Williams, K. McCue, L. Schaefer, and B.
Wold, Mapping and quantifying mammalian transcriptomes
by RNA-Seq, Nature Methods, vol. 5, no. 7, pp. 621628, 2008.
[14] K. Y. Yip, R. P. Alexander, K. K. Yan, and M. Gerstein, Improved
reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data, PLoS ONE, vol. 5, no. 1,
Article ID e8121, 2010.
77
78
[30] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring connectivity of genetic regulatory networks using informationtheoretic criteria, IEEE/ACM Transactions on Computational
Biology and Bioinformatics, vol. 5, no. 2, pp. 262274, 2008.
[31] A. Noor, E. Serpedin, M. N. Nounou, H. N. Nounou, N.
Mohamed, and L. Chouchane, Information theoretic methods
for modeling of gene regulatory networks, in IEEE Symposium
on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 12), pp. 418423, 2012.
[32] T. Cover and J. Tomas, Elements of Information Teory, Wiley
Interscience, 2006.
[33] W. Zhao, E. Serpedin, and E. R. Dougherty, Inferring gene
regulatory networks from time series data using the minimum
description length principle, Bioinformatics, vol. 22, no. 17, pp.
21292135, 2006.
[34] M. Vidyasagar, Probabilistic methods in cancer biology, Childhood, vol. 20, pp. 8289, 2011.
[35] P. Zoppoli, S. Morganella, and M. Ceccarelli, TimeDelayARACNE: reverse engineering of gene networks from timecourse data by an information theoretic approach, BMC Bioinformatics, vol. 11, no. 1, article 154, 2010.
[36] J. Dougherty, I. Tabus, and J. Astola, Inference of gene regulatory networks based on a universal minimum description
length, Eurasip Journal on Bioinformatics and Systems Biology,
vol. 2008, Article ID 482090, 2008.
[37] A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman,
Towards an integrated protein-protein interaction network: a
relational Markov network approach, Journal of Computational
Biology, vol. 13, no. 2, pp. 145164, 2006.
[38] Y. Qi, F. Balem, C. Faloutsos, J. Klein-Seetharaman, and Z. BarJoseph, Protein complex identifcation by supervised graph
local clustering, Bioinformatics, vol. 24, no. 13, pp. i250i268,
2008.
[39] J. R. Bradford, C. J. Needham, A. J. Bulpitt, and D. R. Westhead,
Insights into protein-protein interfaces using a Bayesian network prediction method, Journal of Molecular Biology, vol. 362,
no. 2, pp. 365386, 2006.
[40] D. Greene, G. Cagney, N. Krogan, and P. Cunningham, Ensemble non-negative matrix factorization methods for clustering
protein-protein interactions, Bioinformatics, vol. 24, no. 15, pp.
17221728, 2008.
[41] N. Nariai, Y. Tamada, S. Imoto, and S. Miyano, Estimating
gene regulatory networks and protein-protein interactions of
Saccharomyces cerevisiae from multiple genome-wide data,
Bioinformatics, vol. 21, no. supplement 2, pp. ii206ii212, 2005.
[42] C. W. Li and B. S. Chen, Identifying functional mechanisms of
gene and protein regulatory networks in response to a broader
range of environmental stresses, Comparative and Functional
Genomics, vol. 2010, Article ID 408705, 2010.
[43] Y. C. Wang and B. S. Chen, Integrated cellular network
of transcription regulations and protein-protein interactions,
BMC Systems Biology, vol. 4, no. 1, article 20, 2010.
[44] http://wiki.c2b2.columbia.edu/dream.
[45] I. Cantone, L. Marucci, F. Iorio et al., A yeast synthetic network
for in vivo assessment of reverse-engineering and modeling
approaches, Cell, vol. 137, no. 1, pp. 172181, 2009.
[46] R. Schweiger, M. Linial, and N. Linial, Generative probabilistic
models for protein-protein interaction networks-the biclique
perspective, Bioinformatics, vol. 27, no. 13, pp. i142i148, 2011.
[47] X. Zhou, X. Wang, and E. R. Dougherty, Construction of
genomic networks using mutual-information clustering and
Advances in Bioinformatics
reversible-jump Markov-chain-Monte-Carlo predictor design,
Signal Processing, vol. 83, no. 4, pp. 745761, 2003.
Research Article
Using Protein Clusters from Whole Proteomes to Construct and
Augment a Dendrogram
Yunyun Zhou,1 Douglas R. Call,1,2 and Shira L. Broschat1,2,3
1
School of Electrical Engineering and Computer Science, Washington State University, P.O. Box 642752, Pullman,
WA 99164-2752, USA
2
Paul G. Allen School for Global Animal Health, Washington State University, P.O. Box 642752, Pullman,
WA 99164-2752, USA
3
Department of Veterinary Microbiology and Pathology, Washington State University, P.O. Box 642752, Pullman,
WA 99164-2752, USA
Correspondence should be addressed to Shira L. Broschat; shira@eecs.wsu.edu
Received 19 November 2012; Revised 3 January 2013; Accepted 13 January 2013
Academic Editor: Yves Van de Peer
Copyright 2013 Yunyun Zhou et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In this paper we present a new ab initio approach for constructing an unrooted dendrogram using protein clusters, an approach that
has the potential for estimating relationships among several thousands of species based on their putative proteomes. We employ an
open-source sofware program called pClust that was developed for use in metagenomic studies. Sequence alignment is performed
by pClust using the Smith-Waterman algorithm, which is known to give optimal alignment and, hence, greater accuracy than
BLAST-based methods. Protein clusters generated by pClust are used to create protein profles for each species in the dendrogram,
these profles forming a correlation flter library for use with a new taxon. To augment the dendrogram with a new taxon, a protein
profle for the taxon is created using BLASTp, and this new taxon is placed into a position within the dendrogram corresponding to
the highest correlation with profles in the correlation flter library. Tis work was initiated because of our interest in plasmids, and
each step is illustrated using proteomes from Gram-negative bacterial plasmids. Proteomes for 527 plasmids were used to generate
the dendrogram, and to demonstrate the utility of the insertion algorithm twelve recently sequenced pAKD plasmids were used to
augment the dendrogram.
1. Introduction
Te availability of complete proteomes for hundreds of thousands of species provides an unprecedented opportunity to
study genetic relationships among a large number of species.
However, the necessary sofware tools for handling massive
amounts of data must frst be developed before we can
exploit the availability of these proteomes. Currently the
tools used for clustering either are restricted in terms of
the number of proteomes that can be examined because of
the time required to obtain results or else are restricted in
terms of their sensitivity. For example, clustering by means
of hidden markov models (HMM), multiple sequence alignment, and pairwise sequence alignment by means of the
Smith-Waterman alignment algorithm are limited by their
80
Advances in Bioinformatics
P2
P3
P4
P6
C1
C3
P5
C4
P7
pClust
C5
C6
C7
Protein profles
Tree construction
PM2
C1 C2 C3 C4 C5 C6 C
Distance
metric
PM5
PM1
PM4
PM3
PM6
PM1
PM2
PM3
PM4
PM5
PM6
1 0 1
0 0 1
1 0 0
1 0 1
1 1 0
0 1 1
0
1
0
81
Advances in Bioinformatics
Protein clusters
Construct new profle
C2
C1
C3
P1
P2
P3
P4
P
C4
C1 C2 C3 C4 C5 C6 C
1 0 0 0 0 1 1
C5
BLASTp
C6
C7
PM3
PM6
PM1
PM2
PM3
PM4
PM5
PM6
C1 C2 C3 C4 C5 C6 C
1 0 1
0 0 1
1 0 0
1 0 1
1 1 0
0 1 1
0
1
0
0 0 1
0 0
Figure 2: Flowchart for insertion of a new taxon into an existing tree using a correlation flter library.
( + )
,
( + + )
(1)
82
Advances in Bioinformatics
NC 006856
Salmonella choleraesuis
pSC138
NC 010119
Salmonella choleraesuis
pOU7519
NC 011092
Salmonella enterica
pCVM19633 110
NC 011964
Escherichia coli
pAPEC-O103-ColBM
NC 006143
Aeromonas punctata
pFBAOT6
NC 007100
Pseudomonas aeruginosa
Rms149
NC 008613
Photobacterium piscicida
pP91278
NC 008612
Photobacterium piscicida
pP99018
NC 009139
Yersinia ruckeri
pYR1
NC 009141
pIP1202
NC 012885
Aeromonas hydrophila
pRA1
NC 012690
Escherichia coli
peH4H
NC 009140
Salmonella newport
pSN254
NC 012692
Escherichia coli
pAR060302
NC 012693
Salmonella enterica
pAM04528
NC 009349
Aeromonas salmonicida
pAsa4
NC 012886
Escherichia coli
pRAx
NC 012555
Enterobacter cloacae
pEC-IMP
NC 012556
Enterobacter cloacae
pEC-IMPQ
NC 010870
Klebsiella pneumoniae
pK29
NC 005211
Serratia marcescens
R478
NC 009838
Escherichia coli
pAPEC-O1-R
NC 009981
Salmonella choleraesuis
pMAK1
NC 013365
Escherichia coli
pO111 1
NC 003384
Salmonella typhi
pHCM1
NC 002305
Salmonella typhi
R27
NC 005249
Klebsiella pneumoniae
pLVPK
NC 006625
Klebsiella pneumoniae
NTUH-K2044
NC 014107
Enterobacter cloacae
pECL A
NC 012193
Borrelia burgdorferi
72a lp54
NC 012194
Borrelia burgdorferi
CA-11.2a lp54
NC 012244
Borrelia burgdorferi
94a lp54
NC 012199
Borrelia burgdorferi
64b lp54
NC 012175
Borrelia burgdorferi
WI91-23 lp54
NC 012505
Borrelia burgdorferi
29805 lp54
NC 012504
Borrelia burgdorferi
Bol26 lp54
NC 001857
Borrelia burgdorferi
B31 lp54
NC 013129
Borrelia burgdorferi
JD1 lp54
NC 013130
Borrelia burgdorferi
N40 lp54
NC 012202
Borrelia burgdorferi
CA11.2alp36-28-4
NC 001855
Borrelia burgdorferi
B31 lp36
NC 012184
Borrelia burgdorferi
64b lp36
NC 008565
Borrelia afzelii
PKol p60-2
NC 012167
Borrelia burgdorferi
WI91-23 lp38
NC 012182
Borrelia burgdorferi
64b lp38
NC 011857
Borrelia garinii
PBr lp36
NC 011867
Borrelia garinii
Far04 lp36
NC 011856
Borrelia garinii
PBr lp25
NC 011860
Borrelia garinii
PBr lp28-4
NC 012166
Borrelia valaisiana
VS116 lp25
described in the previous section to obtain a new dendrogram. Te amount of computation and time required to
accomplish this task, however, is excessive considering the
incremental gain that may be achieved. For example, the
original execution time for the 527-plasmid tree was 72 hours
on an Intel Xeon CPU E5420 machine with 32 GB of memory.
Instead it is preferable to have a means of inserting new
83
Advances in Bioinformatics
NC 012555
Enterobacter cloacae
pEC-IMP
NC 012556
Enterobacter cloacae
pEC-IMPQ
NC 010870
Klebsiella pneumoniae
pK29
NC 005211
Serratia marcescens
R478
NC 009838
Escherichia coli
pAPEC-O1-R
NC 009981
Salmonella choleraesuis
pMAK1
NC 003384
Salmonella typhi
pHCM1
NC 013365
Escherichia coli
pO111 1
NC 002305
Salmonella typhi
R27
NC 006625
Klebsiella pneumoniae
NTUH K2044
NC 005249
Klebsiella pneumoniae
pLVPK
NC 014107
Enterobacter cloacae
pECL A
NC 006856
Salmonella choleraesuis
pSC138
NC 010119
Salmonella choleraesuis
pOU7519
NC 011092
Salmonella enterica
pCVM19633 110
NC 011964
Escherichia coli
pAPEC-O103-ColBM
NC 008613
Photobacterium piscicida
pP91278
NC 008612
Photobacterium piscicida
pP99018
NC 009141
pIP1202
NC 009139
Yersinia ruckeri
pYR1
NC 012885
Aeromonas hydrophila
pRA1
NC 012692
Escherichia coli
pAR060302
NC 009140
Salmonella newport
pSN254
NC 012693
Salmonella enterica
pAM04528
NC 012690
Escherichia coli
peH4H
NC 009349
Aeromonas salmonicida
pAsa4
NC 012886
Escherichia coli
pRAx
NC 012504
Borrelia burgdorferi
Bol26 lp54
NC 001857
Borrelia burgdorferi
B31 lp54
NC 013129
Borrelia burgdorferi
JD1 lp54
NC 013130
Borrelia burgdorferi
N40 lp54
NC 012193
Borrelia burgdorferi
72a lp54
NC 012194
Borrelia burgdorferi
CA-11.2a lp54
NC 012244
Borrelia burgdorferi
94a lp54
NC 012199
Borrelia burgdorferi
64b lp54
NC 012175
Borrelia burgdorferi
WI91-23 lp54
NC 012505
Borrelia burgdorferi
29805 lp54
NC 006143
Aeromonas punctata
pFBAO T6
NC 007100
Pseudomonas aeruginosa
Rms149
NC 011856
Borrelia garinii
PBr lp25
NC 011860
Borrelia garinii
PBr lp28-4
NC 012166
Borrelia valaisiana
VS116 lp25
NC 012202
Borrelia burgdorferi
CA11.2alp36-28-4
NC 001855
Borrelia burgdorferi
B31 lp36
NC 012184
Borrelia burgdorferi
64b lp36
NC 008565
Borrelia afzelii
PKo lp60-2
NC 012167
Borrelia burgdorferi
WI91-23 lp38
NC 012182
Borrelia burgdorferi
64b lp38
NC 011857
Borrelia garinii
PBr lp36
NC 011867
Borrelia garinii
Far04 lp36
84
Advances in Bioinformatics
JN106170 pAKD25
JN106167 pAKD16
JN106175 pAKD34
JN106171 pAKD26
JN106173 pAKD31
JN106174 pAKD33
JN106166 pAKD15
>0.7
JN106165 pAKD14
JN106168 pAKD17
JN106169 pAKD18
JN106172 pAKD29
JN106164 pAKD1
NC 005912 Ralstonia
>0.5
NC 007337 Ralstonia
NC 006830 Achromobacter
NC 008766 Acidovorax
NC 010935 Comamonas
NC 004956 Pseudomonas
NC 001735 Enterobacter
NC 005088 Delfia
NC 013666 Burkholderia
NC 008357 Pseudomonas
NC 012919 Photobacterium
NC 013176 Pseudomonas
NC 009704 Yersinia
NC 006824 Aromatoleum
NC 013193 Candidatus
NC 005793 Achromobacter
85
Advances in Bioinformatics
Borrelia plasmids [26, 27]. Te Jaccard distance metric is
commonly used for a binary matrix. Nevertheless, the results
based on Euclidean distance compare favorably with those
obtained for a nonbinary intensity matrix using a diferent
approach [21]. It is not clear which distance method gives
more accurate results so users should use both matrices and
the decision as to which one is more accurate should be
determined on the basis of the biology of the system.
3.2. Insertion of New Plasmids. We applied our correlation
flter algorithm to twelve new plasmids from the pAKD
family [20]. Te twelve plasmids cluster together and are most
closely grouped with genera typical of other soil bacteria.
Te correlation coefcient values among the pAKD plasmids
were >0.7 and decreased relative to the other plasmids with
distance to >0.5 (Figure 5). pAKD plasmids 16, 25, and 34
belong to the IncP-1() compatibility group and form a
discrete cluster: pAKD plasmids 1, 14, 15, 17, 18, 29, 31, and
33 cluster as the IncP-1() compatibility group. Although
pAKD26 falls into the IncP-1() clade, it should be in the
IncP-1() group if compatibility grouping is considered the
gold standard for comparison. Nevertheless, the placement
is distal from the eight other plasmids in the group, and
pAKD26 was actually designated as IncP-1-2 to diferentiate
it from the other eight plasmids as recently described in [28].
Our results are consistent with [20].
Importantly, the correlation coefcient is used to check
the fnal dendrogramthat is, a new plasmid should be
located near the plasmid with which it is most highly
correlated. In addition, the correlation coefcient is used to
determine whether a plasmid should even be inserted into
a dendrogram. In other words, how does the magnitude of
the correlation coefcient infuence our confdence in the
placement of a new plasmid within an existing dendrogram?
Several works ofer guidelines for the interpretation of a
correlation coefcient [29, 30], but all criteria are in some
way arbitrary and ultimately interpretation of a correlation
coefcient depends on the purpose. In our case, we chose
a value of 0.5, but we also require biological evidencefor
example, that a plasmid is, in fact, from a GN bacterium.
To further examine the correlation coefcient, we randomly selected 10 Gram-positive bacterial plasmid proteomes
from 10 diferent genera. Te correlation coefcients were
found to range from 0.112 to 0.234. GP bacterial plasmids do
not belong in our GN bacterial plasmid dendrogram, and our
minimum correlation value of 0.5 sufces to exclude these
unrelated plasmids. While this level of discrimination is easy
to identify, we should note that the 527 GN bacterial plasmids
considered in this study do not represent the full diversity of
GN plasmids. Tus, it is possible to obtain a small correlation
coefcient value for a completely new and uncharacterized
GN plasmid. If the new plasmid is able to meet an underlying
correlation threshold, it can be placed within the dendrogram
structure, and by incorporating the new plasmid sequence
information into the correlation flter library, we can group
future plasmids that may be closely related to it.
While the method of inserting new plasmids into an existing tree is fast and efcient, at some point, generation of a new
dendrogram using all proteins from all the taxa will probably
4. Conclusion
In this work we present a new ab initio method for constructing a dendrogram from whole proteomes that begins
with output from pClust, a sofware program developed
for homology detection for large-scale protein sequence
analyses. We develop an efcient approach for insertion of
a new species into the dendrogram based on the use of a
correlation flter library. Tis is much more efcient than
constructing an entirely new tree which is computationally
costly. We illustrate our method by creating a dendrogram for
527 Gram-negative bacterial plasmids and augmenting this
dendrogram with twelve pAKD plasmids isolated from Norwegian soil. For purposes of comparison, we also construct
a smaller dendrogram consisting of 50 species and use two
diferent distance metrics. Te two resulting trees agree well
with results shown in [21]. Te classifcation results for the
twelve plasmids agree with a phylogenetic tree constructed
using multiple sequence alignment of the relaxase gene traI
presented in [20].
Authors Contribution
Y. Zhou and S. L. Broschat performed the research for this
paper, and all three authors shared in the preparation of the
paper.
Conflict of Interests
Tis work was not infuenced by any commercial agency, and
no confict of interests exist.
Acknowledgments
Te authors are grateful to Carl M. Hansen Foundation
for partial support of Y. Zhou and the Washington State
Agricultural Research Center and College of Veterinary
Medicine Agricultural Animal Health program for support
of D. R. Call.
References
[1] T. F. Smith and M. S. Waterman, Identifcation of common
molecular subsequences, Journal of Molecular Biology, vol. 147,
no. 1, pp. 195197, 1981.
[2] S. F. Altschul, T. L. Madden, A. A. Schafer et al., Gapped
BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Research, vol. 25, no. 17, pp.
33893402, 1997.
86
Advances in Bioinformatics
[21] Y. Zhou, D. R. Call, and S. L. Broschat, Genetic relationships
among 527 Gram-negative bacterial plasmids, Plasmid, vol. 68,
no. 2, pp. 133141, 2012.
[22] D. R. Call, R. S. Singer, D. Meng et al., blaCMY-2-positive
IncA/C plasmids from Escherichia coli and Salmonella enterica
are a distinct component of a larger lineage of plasmids,
Antimicrobial Agents and Chemotherapy, vol. 54, no. 2, pp. 590
596, 2010.
[23] J. L. Rodgers and W. A. Nicewander, Tirteen ways to look at
the correlation coefcient, Te American Statistician, vol. 42,
pp. 5966, 1988.
[24] M. S. Stigler, Francis Galtons account of the invention of correlation, Statistical Science, vol. 4, pp. 7379, 1989.
[25] K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, and
S. Kumar, MEGA5: molecular evolutionary genetics analysis
using maximum likelihood, evolutionary distance, and maximum parsimony methods, Molecular Biology and Evolution,
vol. 28, no. 10, pp. 27312739, 2011.
[26] M. Lescot, S. Audic, C. Robert et al., Te genome of Borrelia
recurrentis, the agent of deadly louse-borne relapsing fever, is a
degraded subset of tick-borne Borrelia duttonii, PLoS Genetics,
vol. 4, no. 9, Article ID e1000185, 2008.
[27] J. E. Purser and S. J. Norris, Correlation between plasmid
content and infectivity in Borrelia burgdorferi, Proceedings of
the National Academy of Sciences of the United States of America,
vol. 97, no. 25, pp. 1386513870, 2000.
[28] P. Norberg, M. Bergstrom, V. Jethava, D. Dubhashi, and M.
Hermansson, Te IncP-1 plasmid backbone adapts to diferent
host bacterial species and evolves through homologous recombination, Nature Communications, vol. 2, article 268, 2011.
[29] A. Buda and A. Jarynowski, Life-time of correlations and its
applications, Wydawnictwo Niezalezne, vol. 1, pp. 521, 2010.
[30] J. Cohen, Statistical Power Analysis For the Behavioral Sciences,
Law-rence Erlbaum Associates, Hillsdale, NJ, USA, 2nd edition,
1988.
Research Article
Solving the 0/1 Knapsack Problem by a Biomolecular DNA
Computer
Hassan Taghipour,1 Mahdi Rezaei,2 and Heydar Ali Esmaili1
1
2
1. Introduction
DNA encodes the genetic information of cellular organisms.
Te unique and specifc structure of DNA makes it one of the
favorite candidates for computing purposes. In comparison
with conventional silicon-based computers, DNA computers
have massive degrees of miniaturization and parallelism.
By recent technology, about 1018 DNA molecules can be
produced and placed in a medium-sized laboratory test tube.
Each of these DNA molecules could act as a small processor. Biological operations such as hybridization, separation,
setting, and clearing can be performed simultaneously on all
of these DNA strands. Tus, in an in vitro assay, we could
handle about 1018 DNA molecules or we can say that 1018 data
processors can be executed in parallel.
In 1994, Adleman introduced the DNA computing as a
new method of parallel computing [1]. Adleman succeeded
in solving seven-point Hamiltonian path problem solely by
manipulating DNA molecules and suggested that DNA could
be used to solve complex mathematical problems.
88
Advances in Bioinformatics
5
3
T G C A
T T C C G
A C G T
A A G G C
3
5
89
Advances in Bioinformatics
= 1 + 2 + 3 + + = ,
=1
= 1 + 2 + 3 + + =
(1)
=1
90
Advances in Bioinformatics
(1) Input (0 ), where 0 contains 2 or more memory strands with at least ( + + ) bit regions.
(2) For = 1 to , where is the total number of items
(a) Divide (0 , 1 , 2 )
(b) Set (1 , )
(c) Combine (0 , 1 , 2 )
End for
Procedure 1
For = 1 to
{
Separate (0 , ) (+ , )
For = 1 to
1
Set (+ , + =1 + )
For = 1 to
1
Set (+ , + =1 + =1 + )
Combine (0 , + , )
}
Procedure 2
91
Advances in Bioinformatics
(1) For = to + 1
For = down to
Separate ( , + 1) ((+1) , )
Combine (+1 , +1 , (+1) )
(2) Te capacity of knapsack is ,
Discard tubes +1 , +2 , +3 , . . . ,
Combine (0 , 0 , 1 , 2 , . . . , )
(3) For = + to + + 1
For = down to +
Separate ( , + 1) ((+1) , )
Combine (+1 , +1 , (+1) )
(4) Read ; else if it was empty then:
Read 1 ; else if it was empty then:
Read 2 ; else if it was empty then:
..
.
Read 2 ; else if it was empty then:
Read 1 ;
Algorithm 1
4. Conclusion
In this paper, the sticker based DNA computing was used
for solving the 0/1 knapsack problem. Tis method could
be used for solving other NP-complete problems. Tere are
four principal operations in sticker model: Combination,
Separation, Setting and Clearing. We also defned a new
operation called divide and applied it in construction of
solution space.
As mentioned earlier, one of the important properties of
DNA computing is its real massive parallelism, which makes
it a favorite and powerful tool for solving NP-complete and
hard combinatorial problems. In sticker model, as in other
DNA based computation methods, the property of DNA
molecules to making duplexes is used as main biological
operation. Te main diference between the sticker model and
Adleman-Lipton model is that in the sticker model there is a
kind of Random access memory and the computations do not
depend on DNA molecules extension as seen in AdlemanLipton model.
References
[1] L. Adleman, Molecular computation of solutions to combinatorial problems, Science, vol. 266, pp. 10211024, 1994.
[2] S. Roweis, E. Winfree, R. Burgoyne et al., A sticker based
model for DNA computation, in Proceedings of the 2nd Annual
Workshop on DNA Computing, Princeton University, L. Landweber and E. Baum, Eds., Series in Discrete Mathematics and
Teoretical Computer Science, DIMACS, pp. 129, American
Mathematical Society, 1999.
[3] R. J. Lipton, DNA solution of hard computational problems,
Science, vol. 268, pp. 542545, 1995.
[4] L. M. Adleman, On Constructing a Molecular Computer,
Department of Computer Science, University of Southern
California, 1995.
[5] L. M. Adleman, On constructing a molecular computer, in
DNA Based Computers, R. J. Lipton and E. B. Baum, Eds., pp.
122, American Mathematical Society, 1996.
[6] W.-L. Chang and M. Guo, Solving the dominating-set problem
in Adleman-Liptons Model, in Proceedings of the 3rd International Conference on Parallel and Distributed Computing,
Applications and Technologies, pp. 167172, Kanazawa, Japan,
2002.
[7] W.-L. Chang and M. Guo, Solving the clique problem and the
vertex cover problem in Adleman-Liptons model, in IASTED
International Conference, Networks, Parallel and Distributed
Processing, and Applications, pp. 431436, Tsukuba, Japan, 2002.
[8] W.-L. Chang and M. Guo, Solving NP-complete problem in
the Adleman-Lipton Model, in Proceedings of Te International
Conference on Computer and Information Technology, pp. 157
162, 2002.
92
[9] L. Adleman, P. Rothemund, S. Roweis, and E. Winfree, On
applying molecular computation to the data encryption standard, in Proceedings of the 2nd DIMACS wWorkshop on DNA
Based Computers, Princeton University, pp. 2448, 1996.
[10] H. Taghipour, A. Taghipour, M. Rezaei, and H. Esmaili, Solving
the independent set problem by sticker based DNA computers,
American Journal of Molecular Biology, vol. 2, no. 2, pp. 153158,
2012.
[11] Q. Ouyang, P. D. Kaplan, S. Liu, and A. Libchaber, DNA
solution of the maximal clique problem, Science, vol. 278, no.
5337, pp. 446449, 1997.
[12] M. Amos, A. Gibbons, and D. Hodgson, Error-resistant implementation of DNA computations, in Proceedings of the 2nd
DIMACS Workshop on DNA Based Computers, 1996.
[13] M. Amos, A. Gibbons, and D. Hodgson, A new model of DNA
computation, in Proceedings of the 12th British Colloquium on
Teoretical Computer Science, 1996.
[14] M. Hagiya, M. Arita, D. Kiga, K. Sakamoto, and S. Yokoyama,
Towards parallel evaluation and learning of boolean formulas with molecules, DIMACS Series in Discrete Mathematics and Teoretical Computer Science, vol. 48, pp. 5772,
1999.
[15] E. Winfree, Simulations of computing by self-assembly, in
Proceedings of the 4th International Meeting on DNA Based
Computers, pp. 213239, 1998.
[16] E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman, Design
and self-assembly of two-dimensional DNA crystals, Nature,
vol. 394, no. 6693, pp. 539544, 1998.
[17] E. Winfree, X. Yang, and N. Seeman, Universal computation
via self-assembly of DNA: some theory and experiments, in
Proceedings of the 2nd DIMACS Workshop on DNA Based
Computers, 1996.
[18] Q. Liu, Z. Guo, A. E. Condon, R. M. Corn, M. G. Lagally, and
L. M. Smith, A surface-based approach to DNA computation,
in Proceedings of the 2nd Annual Meeting on DNA Based
Computers, Princeton University, 1996.
[19] H. Taghipour, M. Rezaei, and H. Esmaili, Applying surfacebased DNA computing for solving the dominating set problem,
American Journal of Molecular Biology, vol. 2, no. 3, pp. 286290,
2012.
[20] G. Rozenberg and H. Spaink, DNA computing by blocking,
Teoretical Computer Science, vol. 292, no. 3, pp. 653665, 2003.
[21] M. R. Garey and D. S. Johnson, Computer and Intractability:
a Guide to the Teory of NP-Completeness, Freeman, San
Francisco, Calif, USA, 1979.
Advances in Bioinformatics
Author Guidelines
Submission
Manuscripts should be submitted by one of the authors of the manuscript through the online
Manuscript Tracking System. Regardless of the source of the word-processing tool, only electronic
PDF (.pdf) or Word (.doc, .docx, .rtf) files can be submitted through the MTS. There is no page limit.
Only online submissions are accepted to facilitate rapid publication and minimize administrative costs.
Submissions by anyone other than one of the authors will not be accepted. The submitting author
takes responsibility for the paper during submission and peer review. If for some technical reason
submission through the MTS is not possible, the author can contact abi@hindawi.com for support.
Terms of Submission
Papers must be submitted on the understanding that they have not been published elsewhere and are
not currently under consideration by another journal published by Hindawi or any other publisher. The
submitting author is responsible for ensuring that the article's publication has been approved by all the
other coauthors. It is also the authors' responsibility to ensure that the articles emanating from a
particular institution are submitted with the approval of the necessary institution. Only an
acknowledgment from the editorial office officially establishes the date of receipt. Further
correspondence and proofs will be sent to the author(s) before publication unless otherwise indicated.
It is a condition of submission of a paper that the authors permit editing of the paper for readability. All
enquiries concerning the publication of accepted papers should be addressed to abi@hindawi.com.
Peer Review
All manuscripts are subject to peer review and are expected to meet standards of academic
excellence. Submissions will be considered by an editor and if not rejected right away by peerreviewers, whose identities will remain anonymous to the authors.
Units of Measurement
Units of measurement should be presented simply and concisely using System International (SI) units.
Paper title
Full author names
Full institutional mailing addresses
Email addresses
Abstract
The manuscript should contain an abstract. The abstract should be self-contained and citation-free
and should not exceed 200 words.
Introduction
This section should be succinct, with no subheadings.
Conclusions
This should clearly explain the main conclusions of the work highlighting its importance and relevance.
Acknowledgments
All acknowledgments (if any) should be included at the very end of the paper before the references
and may include supporting grants, presentations, and so forth.
References
Authors are responsible for ensuring that the information in each reference is complete and accurate.
All references must be numbered consecutively and citations of references in text should be identified
using numbers in square brackets (e.g., as discussed by Smith [9]; as discussed elsewhere [9,
10]). All references should be cited within the text; otherwise, these references will be automatically
removed.
Preparation of Figures
Upon submission of an article, authors are supposed to include all figures and tables in the PDF file of
the manuscript. Figures and tables should not be submitted in separate files. If the article is accepted,
authors will be asked to provide the source files of the figures. Each figure should be supplied in a
separate electronic file. All figures should be cited in the paper in a consecutive order. Figures should
be supplied in either vector art formats (Illustrator, EPS, WMF, FreeHand, CorelDraw, PowerPoint,
Excel, etc.) or bitmap formats (Photoshop, TIFF, GIF, JPEG, etc.). Bitmap images should be of 300 dpi
resolution at least unless the resolution is intentionally set to a lower level for scientific reasons. If a
bitmap image has labels, the image and labels should be embedded in separate layers.
Preparation of Tables
Tables should be cited consecutively in the text. Every table must have a descriptive title and if
numerical measurements are given, the units should be included in the column heading. Vertical rules
should not be used.
Proofs
Corrected proofs must be returned to the publisher within 2-3 days of receipt. The publisher will do
everything possible to ensure prompt publication. It will therefore be appreciated if the manuscripts
and figures conform from the outset to the style of the journal.
Copyright
Open Access authors retain the copyrights of their papers, and all open access articles are distributed
under the terms of the Creative Commons Attribution License, which permits unrestricted use,
distribution and reproduction in any medium, provided that the original work is properly cited.
The use of general descriptive names, trade names, trademarks, and so forth in this publication, even
if not specifically identified, does not imply that these names are not protected by the relevant laws
and regulations.
While the advice and information in this journal are believed to be true and accurate on the date of its
going to press, neither the authors, the editors, nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.
Disclosure Policy
A competing interest exists when professional judgment concerning the validity of research is
influenced by a secondary interest, such as financial gain. We require that our authors reveal any
possible conflict of interests in their submitted manuscripts.
If there is no conflict of interests, authors should state that The author(s) declare(s) that there is no
conflict of interests regarding the publication of this article.