IEEE Transactions On Computational Biology and Bioinformatics (January-March) - Volume 2, Number 1 (2005)

Guest Editorial: WABI Special Section Part ll
Junhyong Kim and Inge Jonassen
T
HE Fourth International Workshop on Algorithms in
BIoinformatics (WABI) 2004 was held in Bergen, Nor-
way, September 2004. The program committee consisted of
33 members and selected, among 117 submissions, 39 to be
presented at the workshop and included in the proceedings
from the workshop (volume 3240 of Lecture Notes in
Bioinformatics, series edited by Sorin Istrail, Pavel Pevzner,
and Michael Waterman).
The WABI 2004 program committee selected a small
number of papers among the 39 to be invited to submit
extended versions of their papers to a special section of the
IEEE/ACM Transactions on Computational Biology and Bioin-
formatics. Four papers were published in the October-
December 2004 issue of the journal and this issue contains
an additional three papers. We would like to thank both the
entire program committee for WABI and the reviewers of
the papers in this issue for their valuable contributions.
The first of the papers is A New Distance for High Level
RNA Secondary Structure Comparison authored by Julien
Allali and Marie-France Sagot. This paper describes algo-
rithms for comparingsecondarystructures of RNAmolecules
where the structures are representedby trees. The problemof
classifying RNA secondary structure is becoming critical as
biologists are discovering more and more noncoding func-
tional elements in the genome (e.g., miRNA). Most likely, the
major functional determinants of the elements are their
secondary structure and, therefore, a metric between such
secondary structures will also help delineate clusters of
functional groups. In Allali and Sagots paper, two tree
representations of secondary structure are compared by
analysing how one tree can be transformed into the other
using an allowed set of operations. Each operation can be
associated with a cost and the distance between two trees can
then be defined as the minimum cost associated with a
transform of one tree to the other. Allali and Sagot introduce
two new operations that they name edge fusion and node
fusion and show that these alleviate limitations associated
with the classical tree edit operations used for RNA
comparison. Importantly, they also present algorithms for
calculating the distance between trees allowing the new
operations in addition to the classical ones, and analyze the
performance of the algorithms.
The second paper is Topological Rearrangements and
Local Search Method for Tandem Duplication Trees and is
authored by Denis Bertrand and Olivier Gascuel. The paper
approaches the problem of estimating the evolutionary
history of tandem repeats. A tandem repeat is a stretch of
DNA sequence that contains an element that is repeated
multiple times and where the repeat occurrences are next to
each other in the sequence. Since the repeats are subject to
mutations, they are not identical. Therefore, tandem repeats
occur through evolution by copying (duplication) of
repeat elements in blocks of varying size. Bertrand and
Gascuel address the problem of finding the most likely
sequence of events giving rise to the observed set of repeats.
Each sequence of events can be described by a duplication
tree and one searches for the tree that is the most
parsimonious, i.e., one that explains how the sequence has
evolved from an ancestral single copy with a minimum
number of mutations along the branches of the tree. The
main difference with the standard phylogeny problem is
that linear ordering of the tandem duplications impose
constraints the possible binary tree form. This paper
describes a local search method that allows exploration of
the complete space of possible duplication trees and shows
that the method is superior to other existing methods for
reconstructing the tree and recovering its duplication
events.
The third paper is Optimizing Multiple Seeds for
Homology Search authored by Daniel G. Brown. The
paper presents an approach to selecting starting points for
pairwise local alignments of protein sequences. The
problem of pairwise local alignment is to find a segment
from each so that the two local segments can be aligned to
obtain a high score. For commonly used scoring schemes,
this can be solved exactly using dynamic programming.
However, pairwise alignment is frequently applied to large
data sets and heuristic methods for restricting alignments to
be considered are frequently used, for instance, in the
BLAST programs. The key is to restrict the number of
alignments as much as possible, by choosing a few good
seeds, without missing high scoring alignments. The paper
shows that this can be formulated as an integer program-
ming problem and presents algorithm for choosing optimal
seeds. Analysis is presented showing that the approach
gives four times fewer false positives (unnecessary seeds) in
comparison with BLASTP without losing more good hits.
Junhyong Kim
Inge Jonassen
Guest Editors
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1
. J. Kim is with the Department of Biology, University of Pennsylvania,
3451 Walnut Street, Philadelphia, PA 19104.
E-mail: junhyong@sas.upenn.edu.
. I. Jonassen is with the Department of Informatics and Computational
Biology Unit, University of Bergen, HIB N5020 Bergen, Norway.
E-mail: inge@ii.uib.no.
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org.
1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
Junhyong Kim is the Edmund J. and Louise
Kahn Term Endowed Professor in the Depart-
ment of Biology at the University of Pennsylvania.
He holds joint appointments in the Department of
Computer and Information Science, Penn Center
for Bioinformatics, and the Penn Genomics
Institute. He serves on the editorial board of
Molecular Development and Evolution and the
IEEE/ACM Transactions on Computational Biol-
ogy and Bioinformatics, the council of the Society
for Systematic Biology, and the executive committee of the Cyber
Infrastructure for Phylogenetics Research. His research focuses on
computational and experimental approaches to comparative develop-
ment. The current focus of his lab is in three areas: computational
phylogenetics, in silico gene discovery, and comparative development
using genome-wide gene expression data.
Inge Jonassen is a professor of computer
science in the Department of Informatics at the
University of Bergen in Norway, where he is
member of the bioinformatics group. He is also
affiliated with the Bergen Center for Computa-
tional Science at the same university where he
heads the Computational Biology Unit. He is also
vice president of the Society for Bioinformatics in
the Nordic Countries (SocBiN) and a member of
the board of the Nordic Bioinformatics Network.
He coordinates the technology platform for bioinformatics funded by the
Norwegian Research Council functional genomics programme FUGE.
He has worked in the field of bioinformatics since the early 1990s, where
he has primarily focused on methods for discovery of patterns with
applications to biological sequences and structures and on methods for
the analysis of microarray gene expression data.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
A New Distance for High Level RNA
Secondary Structure Comparison
Julien Allali and Marie-France Sagot
AbstractWe describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new
operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in
the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent
RNAs and what is searched for is a common structural core of two RNAs. Although the algorithmcomplexity has an exponential term, this
term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The
algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
Index TermsTree comparison, edit operation, distance, RNA, secondary structure.
1 INTRODUCTION
R
NAS are one of the fundamental elements of a cell. Their
role in regulation has been recently shown to be far
more prominent than initially believed (20 December 2002
issue of Science, which designated small RNAs with
regulatory function as the scientific breakthrough of the
year). It is now known, for instance, that there is massive
transcription of noncoding RNAs. Yet current mathematical
and computer tools remain mostly inadequate to identify,
analyze, and compare RNAs.
An RNA may be seen as a string over the alphabet of
nucleotides (also called bases), {A, C, G, T}. Inside a cell,
RNAs do not retain a linear form, but instead fold in space.
The fold is given by the set of nucleotide bases that pair. The
main type of pairing, called canonical, corresponds to bonds
of the type AU and GC. Other rarer types of bonds
may be observed, the most frequent among them is GU,
also called the wobble pair. Fig. 1 shows the sequence of a
folded RNA. Each box represents a consecutive sequence of
bonded pairs, corresponding to a helix in 3D space. The
secondary structure of an RNA is the set of helices (or the
list of paired bases) making up the RNA. Pseudoknots,
which may be described as a pair of interleaved helices, are
in general excluded from the secondary structure of an
RNA. RNA secondary structures can thus be represented as
planar graphs. An RNA primary structure is its sequence of
nucleotides while its tertiary structure corresponds to the
geometric form the RNA adopts in space.
Apart from helices, the other main structural elements in
an RNA are:
1. hairpin loops which are sequences of unpaired bases
closing a helix;
2. internal loops which are sequences of unpaired
bases linking two different helices;
3. bulges which are internal loops with unpaired bases
on one side only of a helix;
4. multiloops which are unpaired bases linking at least
three helices.
Stems are successions of one or more among helices,
internal loops, and/or bulges.
The comparison of RNA secondary structures is one of
the main basic computational problems raised by the study
of RNAs. It is the problem we address in this paper. The
motivations are many. RNA structure comparison has been
used in at least one approach to RNA structure prediction
that takes as initial data a set of unaligned sequences
supposed to have a common structural core [1]. For each
sequence, a set of structural predictions are made (for
instance, all suboptimal structures predicted by an algo-
rithm like Zuckers MFOLD [15], or all suboptimal sets of
compatible helices or stems). The common structure is then
found by comparing all the structures obtained from the
initial set of sequences, and identifying a substructure
common to all, or to some of the sequences. RNA structure
comparison is also an essential element in the discovery of
RNA structural motifs, or profiles, or of more general
models that may then be used to search for other RNAs of
the same type in newly sequenced genomes. For instance,
general models for tRNAs and introns of group I have been
derived by hand [3], [10]. It is an open question whether
models at least as accurate as these, or perhaps even more
accurate, could have been derived in an automatic way. The
identification of smaller structural motifs is an equally
important topic that requires comparing structures.
As we saw, the comparison of RNA structures may
concern known RNA structures (that is, structures that were
experimentally determined) or predicted structures. The
. J. Allali is with the Institut Gaspard-Monge, Universite de Marne-la-
Vallee, Cite Descartes, Champs-sur-Marne, 77454, Marne-la-Vallee Cedex
2, France. E-mail: allali@univ-mlv.fr.
. M.-F. Sagot is with Inria Rhone-Alpes, Universite Claude Bernard, Lyon I,
43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France.
E-mail: Marie-France.Sagot@inria.fr.
Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online
30 Mar. 2005.
tcbb@computer.org, and reference IEEECS Log Number TCBB-0164-1004.
objective in both cases is the same: to find the common parts
of such structures.
In [11], Shapiro suggested to mathematically model RNA
secondary structures without pseudoknots by means of
trees. The trees are rooted and ordered, which means that
the order among the children of a node matters. This order
corresponds to the 5-3 orientation of an RNA sequence.
Given two trees representing each an RNA, there are two
main ways for comparing them. One is based on the
computation of the edit distance between the two trees
while the other consists in aligning the trees and using the
score of the alignment as a measure of the distance between
the trees. Contrary to what happens with sequences, the
two, alignment and edit distance, are not equivalent. The
alignment distance is a restrained form of the edit distance
between two trees, where all insertions must be performed
before any deletions. The alignment distance for general
trees was defined in 1994 by Jiang et al. in [9] and extended
to an alignment distance between forests in [6]. More
recently, Ho chsmann et al. [7] applied the tree alignment
distance to the comparison of two RNA secondary
structures. Because of the restriction on the way edit
operations can be applied in an alignment, we are not
concerned in this paper with tree alignment distance and
we therefore address exclusively from now on the problem
of tree edit distance.
Our way for comparing two RNA secondary structures is
thentoapplyanumber of tree edit operations inone or bothof
the trees representing the RNAs until isomorphic trees are
obtained. The currently most popular program using this
approachis probablythe Vienna package [5], [4]. The tree edit
operations considered are derived from the operations
classically applied to sequences [13]: substitution, deletion,
andinsertion. In1989, Zhang andShasha [14] gave a dynamic
programming algorithm for comparing two trees. Shapiro
and Zhang then showed [12] how to use tree editing to
compare RNAs. The latter also proposed various tree models
that could be used for representing RNA secondary struc-
tures. Each suggested tree offers a more or less detailed view
of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few
examples of such possible views for the RNAgiven in Fig. 2a.
In Fig. 2, the nodes of the tree in Fig. 2b represent either
unpairedbases (leaves) or pairedbases (internal nodes). Each
node is labeled with, respectively, a base or a pair of bases. A
node of the tree in Fig. 2c represents a set of successive
unpaired bases or of stacked paired ones. The label of a node
is an integer indicating, respectively, the number of unpaired
bases or the height of the stackof pairedones. The nodes of the
tree in Fig. 2d represent elements of secondary structure:
hairpinloop(H), bulge (B), internal loop(I), or multiloop(M).
The edges correspond to helices. Finally, the tree in Fig. 2e
contains only the information concerning the skeleton of
multiloops of anRNA. The last representation, thoughgiving
ahighlysimplifiedviewof anRNA, is important nevertheless
as it is generally accepted that it is this skeleton which is
usually the most constrained part of an RNA. The last two
models may be enriched with information concerning, for
instance, the number of (unpaired) bases in a loop (hairpin,
internal, multi) or bulge, and the number of paired bases in a
helix. The first label the nodes of the tree, the secondits edges.
Other types of information may be added (such as overall
composition of the elements of secondary structure). In fact,
one could consider working with various representations
simultaneously or in an interlocked, multilevel fashion. This
goes beyond the scope of this paper which is concerned with
comparing RNA secondary structures using any one among
the many tree representations possible. We shall, however,
comment further on this multilevel approach later on.
Concerning the objectives of this paper, they are twofold.
The first is to give some indications on why the classical edit
operations that have been considered so far in the literature
for comparing trees present some limitations when the trees
standfor RNAstructures. Three cases of such limitations will
be illustrated through examples in Section 3. In Section 4, we
then introduce two novel operations, so-called node-fusion
and edge-fusion, that enable us to address some of these
limitations and then give a dynamic programming algorithm
for comparing two RNAstructures with these two additional
operations. Implementation issues and initial results are
presentedin Section 4. In Section 5, we give a first application
Fig. 1. Primary and secondary structures of a transfer RNA.
Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of
the same RNA (a).
of our algorithm to the comparison of two RNA secondary
structures. Finally, in Section 6, we sketch the main ideas
behind the multilevel RNAcomparison approach mentioned
above. Before that, we start by introducing some notationand
by recalling in the next section the basics about classical tree
edit operations and tree mapping.
This paper is an extended version of a paper presented at
the Workshop on Algorithms in BioInformatics (WABI) in
2004, in Bergen, Norway. A few more examples are given to
illustrate some of the points made in the WABI paper,
complexity and implementation issues are discussed in
more depth as are the cost functions and a multilevel
approach to comparing RNAs.
2 TREE EDITING AND MAPPING
Let T be an ordered rooted tree, that is, a tree where the
order among the children of a node matters. We define
three kinds of operations on T: deletion, insertion, and
relabeling (corresponding to a substitution in sequence
comparison). The operations are shown in Fig. 3. The
deletion (Fig. 3b) of a node u removes u from the tree. The
children of u become the children of us father. An insertion
(Fig. 3c) is the symmetric of a deletion. Given a node u, we
remove a consecutive (in relation to the order among the
children) set u
1
; . . . ; u
p
of its children, create a new node v,
make v a child of u by attaching it at the place where the set
was, and, finally, make the set u
1
; . . . ; u
p
(in the same order)
the children of v. The relabeling of a node (Fig. 3d) consists
simply in changing its label.
Given two trees T and T
0
, we define S fs
1
. . . s
e
g to be
a series of edit operations such that, if we apply succes-
sively the operations in S to the tree T, we obtain T
0
(i.e., T
and T
0
become isomorphic). A series of operations like S
realizes the editing of T into T
0
and is denoted by T !
S
T
0
.
We define a function cost from the set of possible edit
operations (deletion, insertion, relabeling) to the integers (or
the reals) such that cost
s
is the score of the edit operation s.
If S is a series of edit operations, we define by extension that
cost
S
is
P
s2S
cost
s
. We can define the edit distance between
two trees as the series of operations that performs the
editing of T into T
0
and such that its cost is minimal:
distanceT; T
0
fmincost
S
jT !
S
T
0
g.
Let an insertion or a deletion cost one and the relabeling of
a node cost zero if the label is the same andone otherwise. For
the two trees of the figure on the left, the series relabelA !
F:deleteB:insertG realizes the editing of the left tree into
the right one and costs 3. Another possibility is the series
deleteB:relabelA ! G:insertF which also costs 3. The
distance between these two trees is 3.
Given a series of operations S, let us consider the nodes
of T that are not deleted (in the initial tree or after some
relabeling). Such nodes are associated with nodes of T
0
. The
mapping M
S
relative to S is the set of couples u; u
0
with
u 2 T and u
0
2 T
0
such that u is associated with u
0
by S.
The operations described above are the classical tree edit
operations that have been commonly used in the literature
for RNA secondary structure comparison. We now present a
few results obtained using such classical operations that will
allowus to illustrate a fewlimitations they maypresent when
used for comparing RNA structures.
3 LIMITATIONS OF CLASSICAL TREE EDIT
OPERATIONS FOR RNA COMPARISON
As suggested in [12], the tree edit operations recalled in the
previous section can be used on any type of tree coding of
an RNA secondary structure.
Fig. 4 shows two RNAsePs extracted fromthe database [2]
(they are found, respectively, in Streptococcus gordonii and
Thermotoga maritima). For the example we discuss now, we
code the RNAs using the tree representation indicated in
Fig. 2b where a node represents a base pair and a leaf an
unpaired base. After applying a few edit operations to the
trees, we obtain the result indicated in Fig. 4, with deleted/
insertedbases ingray. Wehave surroundedafewregions that
match in the two trees. Bases in the rectangular box at the
bottomof the RNAonthe left are thus associatedwithbases in
the bottomrightmost rectangular boxof the RNAonthe right.
The same is observed for the bases in the oval boxes for both
RNAs. Suchmatches illustrate one of the mainproblems with
the classical tree edit operations: Bases in one RNA may be
mapped to identically labeled bases in the other RNA to
minimise the total cost, while such bases should not be
associated in terms of the elements of secondary structure to
which they belong. In fact, such elements are often distant
from one another along the common RNA structure. We call
this problem the scattering effect. It is related to the
definition of tree edit operations. In the case of this example
and of the representation adopted, the problem might have
been avoided if structural information had been used.
Indeed, the problem appears also because the structural
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5
Fig. 3. Edit operations: (a) the original tree T, (b) deletion of the node
labelled D, (c) insertion of the node labeled I, and (d) relabeling of a
node in T (the label A of the root is changed into K).
location of an unpaired base is not taken into account. It is
therefore possible to match, for instance, an unpaired base
from a hairpin loop with an unpaired base from a multiloop.
Using another type of representation, as we shall do, would,
however, not be enough to solve all problems as we see next.
Indeed, to compare the same two RNAs, we can also use a
more abstract tree representation such as the one given in
Fig. 2d. In this case, the internal nodes represent a multiloop,
internal-loop, or bulge, the leaves code for hairpin loops and
edges for helices. The result of the editionof T intoT
0
for some
cost functionis presentedinFig. 5 (we shall come backlater to
the cost functions used in the case of such more abstract RNA
representations; for the sake of this example, we may assume
an arbitrary one is used).
The problem we wish to illustrate in this case is shown
by the boxes in the figure. Consider the boxes at the bottom.
In the left RNA, we have a helix made up of 13 base pairs. In
the right RNA, the helix is formed by seven base pairs
followed by an internal loop and another helix of size 5. By
definition (see Section 2), the algorithm can only associate
one element in the first tree to one element in the second
tree. In this case, we would like to associate the helix of the
left tree to the two helices of the second tree since it seems
clear that the internal loop represents either an inserted
element in the second RNA, or the unbonding of one base
pair. This, however, is not possible with classical edit
operations.
A third type of problem one can meet when using only
the three classical edit operations to compare trees standing
for RNAs is similar to the previous one, but concerns this
time a node instead of edges in the same tree representa-
tion. Often, an RNA may present a very small helix between
two elements (multiloop, internal-loop, bulge, or hairpin-
loop) while such helix is absent in the other RNA. In this
case, we would therefore have liked to be able to associate
one node in a tree representing an RNA with two or more
Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarum and of
Saccharomyces kluveri, using the model given in Fig. 2d.
Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgonii and of Thermotoga maritima,
using the model given in Fig. 2b.
nodes in the tree for the other RNA. Once again, this is not
possible with any of the classical tree edit operations. An
illustration of this problem is shown in Fig. 6.
We shall use RNA representations that take the elements
of the structure of an RNA into account to avoid some of the
scattering effect. Furthermore, in addition to considering
information of a structural nature, labels are attached, in
general, to both nodes and edges of the tree representing an
RNA. Such labels are numerical values (integers or reals).
They represent in most cases the size of the corresponding
element, but may also further indicate its composition, etc.
Such additional information is then incorporated into the
cost functions for all three edit operations. It is important to
observe that when dealing with trees labeled at both the
nodes and edges, any node and the edge that leads to it (or,
in an alternative perspective, departs from it) represent a
single object from the point of view of computing an edit
distance between the trees.
It remains now to deal with the last two problems that
are a consequence of the one-to-one associations between
nodes and edges enforced by the classical tree edit
operations. To that purpose, we introduce two novel tree
edit operations, called the edge fusion and the node fusion.
4 INTRODUCING NOVEL TREE EDIT OPERATIONS
4.1 Edge Fusion and Node Fusion
In order to address some of the limitations of the classical tree
edit operations that were illustrated in the previous section,
we needto introduce two novel operations. These are the edge
fusion and the node fusion. They may be applied to any of the
tree representations given in Figs. 2c, 2d, and 2e.
An example of edge fusion is shown in Fig. 7a. Let e
u
be an
edge leading to a node u, c
i
a child of u and e
ci
the edge
between u and c
i
. The edge fusion of e
u
and e
c
i
consists in
replacing e
c
i
and e
u
with a newsingle edge e. The edge e links
the father of u to c
i
. Its label then becomes a function of the
(numerical) labels of e
u
, u and e
ci
. For instance, if such labels
indicatedthe size of eachelement (e.g., for a helix, the number
of its stackedpairs, andfor aloop, themin , max or theaverage
of its unpaired bases on each side of the loop), the label of e
could be the sum of the sizes of e
u
, u and e
c
i
. Observe that
merging two edges implies deleting all subtrees rooted at the
childrenc
j
of ufor j different fromi. The cost of suchdeletions
is added to the cost of the edge fusion.
An example of node fusion is given in Fig. 7b. Let u be a
node and c
i
one of its children. Performing a node fusion of
u and c
i
consists in making u the father of all children of c
i
and in relabeling u with a value that is a function of the
values of the labels of u, c
i
and of the edge between them.
Observe that a node fusion may be simulated using the
classical edit operations by a deletion followed by a
relabeling. However, the difference between a node fusion
and a deletion/relabeling is in the cost associated with both
operations. We shall come back to this point later.
Obviously, like insertions or deletions, edge fusions and
node fusions have of course symmetric counterparts, which
are the edge split and the node split.
Given two rooted, ordered, and labeled trees T and T
0
,
we define the edit distance with fusion between T and T
0
Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.
Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and
Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.
as distance
fusion
T; T
0
fmincost
S
jT !
S
T
0
g with cost
s
the
cost associated to each of the seven edit operations now
considered (relabeling, insertion, deletion, node fusion and
split, edge fusion and split).
Proposition 1. If the following is verified:
. cost
match
a; b is a distance,
. cost
ins
a cost
del
a ! 0,
. cost
node
fusion
a; b; c cost
nodesplit
a; b; c ! 0, and
. cost
edge
fusion
a; b; c cost
edgesplit
a; b; c ! 0,
then distance
fusion
is indeed a distance.
Proof. The positiveness of distance
fusion
is given by the fact
that all elementary cost functions are positive. Its
symmetry is guaranteed by the symmetry in the costs
of the insertion/deletion and (node/edge) fusion/split
operations. Finally, it is straighforward to see that
distance
fusion
satisfies triangular inequality. tu
Besides the above properties that must be satisfied by the
cost functions in order to obtain a distance, others may be
introduced for specific purposes. Some will be discussed in
Section 5.
We now present an algorithm to compute the tree edit
distance between two trees using the classical tree edit
operations plus the two operations just introduced.
4.2 Algorithm
The method we introduce is a dynamic programming
algorithm based on the one proposed by Zhang and Shasha.
Their algorithm is divided in two parts: They first compute
the edit distance between two trees (this part is denoted by
TDist) and then the distance between two forests (this part
is denoted by FDist). Fig. 8 illustrates in pictorial form the
part TDist and Fig. 9 the FDist part of the computation.
In order to take our two new operations into account, we
need to compute a few more things in the TDist part.
Indeed, we must add the possibility for each tree to have a
node fusion (inversely, node split) between the root and one
of its children, or to have an edge fusion (inversely edge
split) between the root and one of its children. These
additional operations are indicated in the right box of Fig. 8.
We present nowa formal description of the algorithm. Let
T be an ordered rooted tree with jTj nodes. We denote by t
i
the ith node in a postfix order. For each node t
i
, li is the
index of the leftmost child of the subtree rooted at t
i
. Let
Ti . . . j denote the forest composed by the nodes t
i
. . . t
j
(T T0 . . . jTj. To simplify notation, from now on, when
there is no ambiguity, i will refer to the node t
i
. In this case,
distancei
1
. . . i
2
; j
1
. . . j
2
will be equivalent to distanceTi
1
. . . i
2
; T
0
j
1
. . . j
2
.
The algorithm of Zhang and Sasha is fully described by
the following recurrence formula:
if i
1
li
2
and j
1
lj
2
MIN
distance i
1
. . . i
2
1 ; j
1
. . . j
2
cost
del
i
2
distance i
1
. . . i
2
; j
1
. . . j
2
1 cost
ins
j
2
distance i
1
. . . i
2
1 ; j
1
. . . j
2
1 cost
match
i
2
; j
2
8
>
<
>
:
1
else
MIN
distance i
1
. . . i
2
1 ; j
1
. . . j
2

cost
del
i
2
distance i
1
. . . i
2
; j
1
. . . j
2
1
cost
ins
j
2
distance i
1
. . . li
2
1 ; j
1
. . . lj
2
1
distance li
2
. . . i
2
; lj
2
. . . j
2

8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
:
2
Part (1) of the formula corresponds to Fig. 8, while part (2)
corresponds to Fig. 9. In practice, the algorithm stores in a
matrix the score between each subtree of T and T
0
. The space
complexityis therefore OjTj jT
0
j. Toreachthis complexity,
the computation must be done in a certain order (see
Section 4.3). The time complexity of the algorithm is
OjTj minleafT; heightT
jT
0
j minleafT
0
; heightT
0
;
where leafT and heightT represent, respectively, the
number of leaves and the height of a tree T.
Fig. 8. Zhang and Sashas dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to
take fusion into account.
The formula to compute the edit score allowing for both
node and edge fusions follows.
if i
1
! li
k
and j
1
! lj
k
0
MIN
distancefi
1
. . . i
k1
g; ;; fj
1
. . . j
k
0 g; path
0
cost
del
i
k
distancefi
1
. . . i
k
g; path; fj
1
. . . j
k
0
1
g; ; cost
ins
j
k
0
distancefi
1
. . . i
k1
g; ;; fj
1
. . . j
k
0
1
g; ; cost
match
i
k
; j
k
0
for each child i
c
of i
k
in fi
1
; . . . ; i
k
g; set i
l
li
c
distancefi
1
. . . i
c1
; i
c1
. . . i
k
g; path:u; i
c
; fj
1
. . . j
k
0 g;
path
0
cost
node fusion
i
c
; i
k
obs: :i
k
data are changed
distancefi
l
. . . i
c1
; i
k
g; path:e; i
c
; fj
1
. . . j
k
0 g; path
0
cost
edge fusion
i
c
; i
k
distancefi
1
. . . i
l1
g;
;; ;; ;
distancefi
c1
. . . i
k
1; ;; ;; ;
obs: : i
k
data are changed
for each child j
c
0 of j
k
0 in fj
1
; . . . ; j
k
0 g; set j
l
0 lj
c
0
distancefi
1
. . . i
k
g; path; fj
1
. . . j
c
0
1
; j
c
0
1
. . . j
k
0 ;
path
0
:u; j
c
0
cost
node split
j
c
0 ; j
k
0
obs: : j
k
0 data are changed
distancefi
1
. . . i
k
g; path; fj
l
0 . . . j
c
0 ; j
k
0 ; path
0
:e; j
c
0
cost
edge split
j
c
0 ; j
k
0
distance;; ;; fj
1
. . . j
l
0
1
g; ;
distance;; ;; j
c
0
1
. . . j
k
0
1
; ;
obs: : j
k
0 data are changed
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
3
else set i
l
li
k
and j
l
0 lj
k
0
MIN
distancefi
1
. . . i
k1
g; ;; fj
1
. . . j
k
0 g; path
0
deli
k
distancefi
1
. . . i
k
g; path; fj
1
. . . j
k
0
1
g; ; insj
k
0
distancefi
1
. . . i
l1
g; ;; fj
1
. . . j
l
0
1
g; ;
distancefi
l
. . . i
k
g; path; fj
l
0 . . . j
k
0 g; path
0
8
>
>
>
<
>
>
>
:
4
Given two nodes u and v such that v is a child of u,
node fusionu; v is the fusion of node v with u, and
edge fusionu; v is the edge fusion between the edges
leading to, respectively, nodes u and v. The symmetric
operations are denoted by, respectively, node splitu; v and
edge splitu; v.
The distance computation takes two new parameters
path and path
0
. These are sets of pairs e or u; v which
indicate, for node i
k
(respectively, j
k
), the series of fusions
that were done. Thus, a pair e; v indicates that an edge
fusion has been perfomed between i
k
and v, while for u; v
a node v has been merged with node i
k
.
The notation path:e; v indicates that the operation e; v
has been performed in relation to node i
k
and the
information is thus concatenated to the set path of pairs
currently linked with i
k
.
4.3 Implementation and Complexity
The previous section gave the recurrence formul for
calculating the edit distance between two trees allowing for
node and edge fusion and split. We now discuss the
complexity of the algorithm. This requires paying attention
to some high-level implementation details that, in the case
of the tree edit distance problem, may have an important
influence on the theoretical complexity of the algorithm.
Such details were first observed by Zhang and Shasha. They
concern the order in which to perform the operations
indicated in (2) and (1) to obtain an algorithm that is time
and space efficient.
Let us consider the last line of (2). We may observe that
the computation of the distance between two forests refers
to the computation of the distance between two trees
Tli
2
. . . i
2
and T
0
lj
2
. . . j
2
. We must therefore memor-
ise the distance between any two subtrees of T and T
0
.
Furthermore, we have to carry out the computation from
the leaves to the root because when we compute the
distance between two subtrees U and U
0
, the distance
between any subtrees of U and U
0
must already have been
measured. This explains the space complexity which is in
OjTj jT
0
j and corresponds to the size of the table used for
storing such distances in memory.
If we look at (1) now, we see that it is not necessary to
calculate separately the distance between the subtrees
rooted at i
0
and j
0
if i
0
is on the path from li to i and j
0
is on the path from lj to j, for i and j nodes of,
respectively, T and T
0
.
We define a set LRT of the left roots of T as follows:
LRT fkj1 k jTj and 69k
0
> k such that lk
0
lkg
Fig. 9. Zhang and Sashas dynamic programming algorithm: the forest distance part.
The algorithm for computing the edit distance between t
and T
0
consists then in computing the distance between
each subtree rooted at a node in LRT and each subtree
rooted at a node in LRT
0
. Such subtrees are considered
from the leaves to the root of T and T
0
, that is, in the order
of their indexes.
Zhang and Shasha proved that this algorithm has a
time complexity in OjTj minleafT; heightT jT
0
j
minleafT
0
; heightT
0
, leafT designating the num-
ber of leaves of T and heightT its height. In the worst
case (fan tree), the complexity is in OjTj
2
jT
0
j
2
.
Taking fusion and split operations into account does
not change the above reasoning. However, we must now
store in memory the distance between all subtrees
Tli
2
. . . i
2
and T
0
lj
2
. . . j
2
, and all the possible values
of path and path
0
.
We must therefore determine the number of values that
path can take. This amounts to determine the total number
of successive fusions that could be applied to a given node.
We recall that path is a list of pairs e or u; v. Let path
fe or u; v
1
; e or u; v
2
; . . . ; e or u; v
g be the list for node i

of T. The first fusion can be performed only with a child v
1
of i. If d is the maximum degree of T, there are d possible
choices for v
1
. The second fusion can be done with one of
the children of i or with one of its grandchildren. Let v
2
be
the node chosen. There are d + d
2
possible choices for v
2
.
Following the same reasoning, there are
P
k
k1
d
k
possible
choices for the th node v
to be fusioned with i.
Furthermore, we must take into account the fact that a
fusion can concern a node or an edge. The total number of
values possible for the variable path is therefore:
2
Y
k
k1
X
jk
j1
d
j
2
l
Y
k
k1
d
k1
1
d 1
;
that is:
2
1
d 1

Y
k
k1
d
k1
1 < 2
l
1
d 1

l
d
12
2
:
A node i may then be involved in O2d
l
possible
successive (node/edge) fusions.
As indicated, we must store in memory the distance
between each subtree Tli
2
. . . i
2
and T
0
lj
2
. . . j
2
for all
possible values of path and path
0
. The space complexity of
our algorithm is thus in O2d
2d
0
jTj jT
0
j, with d
and d
0
the maximum degrees of, respectively, T and T
0
.
The computation of the time complexity of our algorithm
is done in a similar way as for the algorithm of Zhang and
Shasha. For each node of T and T
0
, one must compute the
number of subtree distance computations the node will be
involved in by considering all subtrees rooted in, respec-
tively, a node of LRT and a node of LRT
0
. In our case,
one must also take into account for each node the possibility
of applying a fusion. This leads to a time complexity in
O2d
jTj minleafT; heightT 2d

0
jT
0
j
minleafT
0
; heightT
0
:
This complexity suggests that the fusion operations may
be used only for reasonable trees (typically, less than
100 nodes) and small values of l (typically, less than 4). It is
however important to observe that the overall number of
fusions one may perform can be much greater than l
without affecting the worst-case complexity of the algo-
rithm. Indeed, any number of fusions can be made while
still retaining the bound of
O2d
l
jTj minleafT; heightT jT
0
j minleafT
0
;
heightT
0
so long as one does not realize more than l consecutive

fusions for each node.
In general, also, most interesting tree representations of
an RNA are of small enough size as will be shown next,
together with some initial results obtained in practice.
5 APPLICATION TO RNA SECONDARY STRUCTURES
COMPARISON
The algorithm presented in the previous section has been
coded using C++. An online version is available at http://
www-igm.univ-mlv.fr/~allali/migal/.
We recall that RNAs are relatively small molecules with
sizes limited to a few kilobases. For instance, the small
ribosomal subunit of Sulfolobus acidocaldarius (D14876) is
made up of 1,147 bases. Using the representation shown in
Fig. 2b, the tree obtained contains 440 internal nodes and
567 leaves, that is 1,007 nodes overall. Using the representa-
tion in Fig. 2d, the tree is composed of 78 nodes. Finally, the
tree obtained using the representation given in Fig. 2e
contains only 48 nodes. We therefore see that even for large
RNAs, any of the known abstract tree-representations (that
is, representations which take the elements of the secondary
structure of an RNA into account) that we can use leads to a
tree of manageable size for our algorithm. In fact, for small
values of l (2 or 3), the tree comparison takes reasonable
time (a few minutes) and memory (less than 1Gb).
As we already mentioned, a fusion (respctively, split) can
be viewed as an alternative to a deletion (respectively,
insertion) followed by a relabeling. Therefore, the cost
function for a fusion must be chosen carefully.
To simplify, we reason on the cost of a node fusion
without considering the label of the edges leading to the
nodes that are fusioned with a father. The formal definition
of the cost functions takes the edges also into account.
Let us assume that the cost function returns a real
value between zero and one. If we want to compute the
cost of a fusion between two nodes u and v, the aim is to
give to such fusion a cost slightly greater than the cost of
deleting v and relabeling u; that is, we wish to have
cost
node fusion
u; v mincost
del
v t; 1. The parameter t
is a tuning parameter for the fusion.
Suppose that the new node w resulting from the fusion of
u and v matches with another node z. The cost of this match
is cost
match
w; z. If we do not allow for node fusions, the
algorithm will first match u with z, then will delete v. If we
compare the two possibilities, on one hand we have a total
cost of cost
node fusion
u; v cost
match
w; z for the fusion,
that is, cost
del
v t cost
match
w; z, on the other hand, a
cost of cost
del
v cost
match
u; z. Thus, t represents the gain
that must be obtained by cost
match
w; z with regard to
cost
match
u; z, that is, by a match without fusion. This is
illustrated in Fig. 10.
Inthis example, the cost associatedwiththe pathonthe top
is cost
match
5; 9 cost
del
3. The path at the bottomhas a cost
of cost
node fusion
5; 3 cost
del
3 t for the node fusion to
which is added a relabeling cost of cost
match
8; 9, leading to a
total of cost
match
8; 9 cost
del
3 t. A node fusion will
therefore be chosen if cost
match
8; 9 t > cost
match
5; 9,
therefore if the score of a match with fusion is better by at
least t than a match without fusion.
We apply the same reasoning to the cost of an edge fusion.
The cost function for a node and an edge fusion between a
node u and a node v, with e
u
denoting the edge leading to u
and e
v
the edge leading to v is defined as follows:
cost
node fusion
u; v cost
del
v cost
del
e
v
t
cost
edge fusion
u; v cost
del
u cost
del
e
u
t
X
csibling ofv
cost deleting subtree rooted at c:
The tuning parameter t is thus an important parameter
that allows us to control fusions. Always considering a cost
function that produces real values between 0 and 1, if t is
equal to 0:1, a fusion will be performed only if it improves
the score by 0:1. In practice, we use values of t between 0
and 0:2.
For practical considerations, we also set a further
condition on the cost and relabeling functions related to a
node or edge resulting from a fusion which is as follows:
cost
del
a cost
del
b ! cost
del
c
with c the label of the node/edge resulting from the fusion
of the nodes/edges labeled a and b. Indeed, if this condition
is not fulfilled, the algorithm may systematically fusion the
nodes or edges to reduce the overall cost.
An important consequence of the conditions seen above
is that a node fusion cannot be followed by an edge fusion.
Below, the node fusion followed by an edge fusion costs:
cost
del
b cost
del
B t cost
del
AB cost
del
a t:
The alternative is todestroynode B(together withedge b) and
then to operate an edge fusion, the whole costing: cost
del
b
cost
del
B cost
del
A cost
del
a t. The difference be-
tweenthese two costs is t cost
del
AB cost
del
A, whichis
always positive.
This observation allows to significantly improve the
performance in practice of the algorithm.
We have applied the new algorithm on the two RNAs
shown in Fig. 5 (these are eukaryotic nuclear P RNAs from
Saccharomyces uvarum and Saccharomyces kluveri) and coded
using the same type of representation as in Fig. 2d. We have
limited the number of consecutive fusions to one (l 1).
The computation of the edit distance between the two trees
taking node and edge fusions into account besides dele-
tions, insertions, and relabeling has required less than a
second. The total cost allowing for fusions is 6:18 with t
0:05 against 7:42 without fusions. As indicated in Fig. 11, the
last two problems discussed in Section 3 disappear thanks
to some edge fusions (represented by the boxes).
An example of node fusions required when comparing
two real RNAs is given in Fig. 12. The RNAs are coded
using the same type of representation as in Fig. 2d. The
figure shows part of the mapping obtained between the
small subunits of two ribosomal RNAs retrieved from [8]
(from Bacillaria paxillifer and Calicophoron calicophorum). The
node fusion has been circled.
Fig. 10. Illustration of the gain that must be obtained using a fusion
instead of a deletion/relabeling.
6 MULTILEVEL RNA STRUCTURE COMPARISON:
SKETCH OF THE MAIN IDEA
We briefly discuss now an approach which addresses in
part the scattering effect problem (see Section 2). This
approach is being currently validated and will be more fully
described in another paper. We therefore present here the
main idea only.
To start with, it is important to understand the nature of
this scattering effect. Let us consider first a trivial case: the
cost functions are unitary (insertion, deletion, and relabeling
each cost 1) and we compute the edit distance between two
trees composed of a single node each. The obtained mapping
will associate the single node in the first tree with the single
one in the second tree, independently from the labels of the
nodes. This example can be extended to the comparison of
two trees whose node labels are all different. In this case, the
obtained mapping corresponds to the maximum home-
omorphic subtree common to both trees.
If the two RNA secondary structures compared using a
tree representation which models both the base pairs and
the nonpaired bases are globally similar but present some
local dissimilarity, then an edit operation will almost
always associate the nodes of the locally divergent regions
that are located at the same positions relatively to the global
common structure. This is a normal, expected behavior in
the context of an edition. However, it seems clear also when
we look at Fig. 4 that the bases of a terminal loop should not
be mapped to those of a multiple loop.
To reduce this problem, one possible solution consists of
adding to the nodes corresponding to a base an information
concerning the element of secondary structure to which the
base belongs. The cost functions are then adapted to take
this type of information into account. This solution,
although producing interesting results, is not entirely
satisfying. Indeed, the algorithm will tend to systematically
put into correspondence nodes (and, thus, bases) belonging
to structural elements of the same type, which is also not
necessarily a good choice as these elements may not be
related in the overall structure. It seems therefore preferable
to have a structural approach first, mapping initially the
elements of secondary structure to each other and taking
care of the nucleotides in a second step only.
The approach we have elaborated may be briefly
described as follows: Given two RNA secondary structures,
the first step consists in coding the RNAs by trees of type c
in Fig. 2 (nodes represent bulges or multiple, internal or
Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.
Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.
terminal loops while edges code for helices). We then
compute the edit distance between these two trees using the
two novel fusion operations described in this paper. This
also produces a mapping between the two trees. Each node
and edge of the trees, that is, each element of secondary
structure, is then colored according to this mapping. Two
elements are thus of a same color if they have been mapped
in the first step. We now have at our disposal an
information concerning the structural similarity of the two
RNAs. We can then code the RNAs using a tree of type b.
To these trees, we add to each node the colour of the
structural element to which it belongs. We need now only to
restrict the match operation to nodes of the same color. Two
nodes can therefore match only if they belong to secondary
elements that have been identified in the first step as being
similar.
To illustrate the use of this algorithm, we have applied it
to the two RNAs of Fig. 4. Fig. 13 presents the trees of type
(Fig. 2c) coding for these structures, and the mapping
produced by the computation of the edit distance with
fusion. In particular, the noncolored fine dashed nodes and
edges correspond, respectively, to deleted nodes/edges.
One can see that in the left RNA, the two hairpin loops
involved in the scattering effect problem in Fig. 4 (indicated
by the arrows) have been destroyed and will not be mapped
to one another anymore when the edit operations are
applied to the trees of the type in Fig. 2b.
This approach allows to obtain interesting results.
Furthermore, it considerably reduces the complexity of
the algorithm for comparing two RNA structures coded
with trees of the type in Fig. 2b. However, it is important to
observe that the scattering effect problem is not specific of
the tree representations of the type in Fig. 2b. Indeed, the
same problem may be observed, to a lesser degree, with
trees of the type in Fig. 2c. This is the reason why we
generalize the process by adopting a modelling of RNA
secondary structures at different levels of abstraction. This
model, and the accompanying algorithm for comparing
RNA structures, is in progress.
7 FURTHER WORK AND CONCLUSION
We have proposed an algorithm that addresses two main
limitations of the classical tree edit operations for compar-
ing RNA secondary structures. Its complexity is high in
theory if many fusions are applied in succession to any
given (the same) node, but the total number of fusions that
may be performed is not limited. In practice, the algorithm
is fast enough for most situations one can meet in practice.
To provide a more complete solution to the problem of
the scattering effect, we also proposed a new multilevel
approach for comparing two RNA secondary structures
whose main idea was sketched in this paper. Further details
and evaluation of such novel comparison scheme will be the
subject of another paper.
REFERENCES
[1] D. Bouthinon and H. Soldano, A New Method to Predict the
Consensus Secondary Structure of a Set of Unaligned RNA
Sequences, Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.
[2] J.W. Brown, The Ribonuclease P Database, Nucleic Acids
Research, vol. 24, no. 1, p. 314, 1999.
[3] N. el Mabrouk and F. Lisacek, and Very Fast Identification of
RNA Motifs in Genomic DNA. Application to tRNA Search in the
Yeast Genome, J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.
[4] I. Hofacker, The Vienna RNA Secondary Structure Server, 2003.
[5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.
Tacker, and P. Schuster, Fast Folding and Comparison of RNA
Secondary Structures, Monatshefte fur Chemie, vol. 125, pp. 167-
188, 1994.
[6] M. Ho chsmann, T. To ller, R. Giegerich, and S. Kurtz, Local
Similarity in RNA Secondary Structures, Proc. IEEE Computer Soc.
Conf. Bioinformatics, p. 159, 2003.
[7] M. Ho chsmann, B. Voss, and R. Giegerich, Pure Multiple RNA
Secondary Structure Alignments: A Progressive Profile Ap-
proach, IEEE/ACM Trans. Computational Biology and Bioinfor-
matics, vol. 1, no. 1, pp. 53-62, 2004.
[8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, The
European Database on Small Subunit Ribosomal RNA, Nucleic
Acids Research, vol. 30, no. 1, pp. 183-185, 2002.
[9] T. Jiang, L. Wang, and K. Zhang, Alignment of TreesAn
Alternative to Tree Edit, Proc. Fifth Ann. Symp. Combinatorial
Pattern Matching, pp. 75-86, 1994.
[10] F. Lisacek, Y. Diaz, and F. Michel, Automatic Identification of
Group I Intron Cores in Genomic DNA Sequences, J. Molecular
Biology, vol. 235, no. 4, pp. 1206-1217, 1994.
Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting
from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for
hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.
[11] B. Shapiro, An Algorithm for Multiple RNA Secondary Struc-
tures, Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387-
393, 1988.
[12] B.A. Shapiro and K. Zhang, Comparing Multiple RNA Secondary
Structures Using Tree Comparisons, Computer Applications in the
Biosciences, vol. 6, no. 4, pp. 309-318, 1990.
[13] K.-C. Tai, The Tree-to-Tree Correction Problem, J. ACM, vol. 26,
no. 3, pp. 422-433, 1979.
[14] K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing
Distance between Trees and Related Problems, SIAM J. Comput-
ing, vol. 18, no. 6, pp. 1245-1262, 1989.
[15] M. Zuker, Mfold Web Server for Nucleic Acid Folding and
Hybridization Prediction, Nucleic Acids Research, vol. 31, no. 13,
pp. 3406-3415, 2003.
Julien Allali studied at the University of Marne
la Valle e (France), where he received the MSc
degree in computer science and computational
genomics. In 2001, he began his PhD in
computational genomics at the Gaspard Monge
Institute of the University of Marne la Valle e. His
thesis focused on the study of RNA secondary
structures and, in particular, their comparison
using a tree distance. In 2004, he received the
PhD degree.
Marie-France Sagot received the BSc degree in computer science from
the University of Sa o Paulo, Brazil, in 1991, the PhD degree in
theoretical computer science and applications from the University of
Marne-la-Valle e, France, in 1996, and the Habilitation from the same
university in 2000. From 1997 to 2001, she worked as a research
associate at the Pasteur Institute in Paris, France. In 2001, she moved
to Lyon, France, as a research associate at the INRIA, the French
National Institute for Research in Computer Science and Control. Since
2003, she has been the Director of Research at the INRIA. Her research
interests are in computational biology, algorithmics, and combinatorics.
Topological Rearrangements and Local Search
Method for Tandem Duplication Trees
Denis Bertrand and Olivier Gascuel
AbstractThe problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch
[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing
numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication
trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,
TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree
Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these
restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is
applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all
existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to
tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any
other program.
Index TermsTandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc
finger genes.
1 INTRODUCTION
R
EPEATED sequences constitute an important fraction of
most genomes, from the well-studied Escherichia coli
bacterial genome [1] to the Human genome [2]. For
example, it is estimated that more than 50 percent of the
Human genome consists of repeated sequences [2], [3].
There exist three major types of repeated sequences:
transposon-derived repeats, micro or minisatellites, and
large duplicated sequences, the last often containing one or
several RNA or protein-coding genes. Micro or minisatel-
lites arise through a mechanism called slipped-strand
mispairing, and are always arranged in tandem: copies of
a same basic unit are linearly ordered on the chromosome.
Large duplicated sequences are also often found in tandem
and, when this is the case, unequal recombination is widely
assumed to be responsible for their formation.
Both the linear order among tandemly repeated se-
quences, and the knowledge of the biological mechanisms
responsible for their generation, suggest a simple model of
evolution by duplication. This model, first described by
Fitch in 1977 [4], introduces tandem duplication trees as
phylogenies constrained by the unequal recombination
mechanism. Although being a completely different biologi-
cal mechanism, slipped-strand mispairing leads to the same
duplication model [5]. A formal recursive definition of this
model is provided in Section 2, but its main features can be
grasped from the examples of Fig. 1. Fig. 1a shows the
duplication history of the 13 Antennapedia-class homeobox
genes from the cognate group [6]. In this history, the
ancestral locus has undergone a series of simple duplica-
tion events where one of the genes has been duplicated into
two adjacent copies. Starting from the unique ancestral
gene, this series of events has produced the extant locus
containing the 13 linearly ordered contemporary genes. It is
easily seen [7] that trees only containing simple duplication
events are equivalent to binary search trees with labeled
leaves. They differ from standard phylogenies in that node
children have left/right orientation. Fig. 1b shows another
example corresponding to the nine variable genes of the
human T cell receptor Gamma (TRGV) locus [8]. In this
history, the most recent event involves a double duplica-
tion where two adjacent genes have been simultaneously
duplicated to produce four adjacent copies. Duplication
trees containing multiple duplication events differ from
binary search trees, but are less general than phylogenies.
The model proposed by Fitch [4] covers both simple and
multiple duplication trees.
Fitchs paper [4] received relatively little attention at the
time of its publication probably due to the lack of available
sequence data. Rediscovered by Benson and Dong [9],
Tang et al. [10], and Elemento et al. [8], tandemly repeated
sequences and their suggested duplication model have
recently received much interest, providing several new
computational biology problems and challenges [11], [12].
The main challenge consists of creating algorithms
incorporating the model constraints to reconstruct the
. The authors are with Projet Methodes et Algorithmes pour la Bioinforma-
tique, LIRMM (UMR 5506, CNRSUniv. Montpellier 2), 161 rue Ada,
34392 Montpellier Cedex 5France. E-mail: gascuel@lirmm.fr.
Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec.
2004; published online 30 Mar. 2005.
tcbb@computer.org, and reference IEEECS Log Number TCBBSI-0170-1004.
duplication history of tandemly repeated sequences.
Indeed, accurate reconstruction of duplication histories
will be useful to elucidate various aspects of genome
evolution. They will provide new insights into the
mechanisms and determinants of gene and protein domain
duplication, often recognized as major generators of
novelty [13]. Several important gene families, such as
immunity-related genes, are arranged in tandem; better
understanding their evolution should provide new insights
into their duplication dynamics and clues about their
functional specialization. Studying the evolution of micro
and minisatellites could resolve unanswered biological
questions regarding human migrations or the evolution of
bacterial diseases [14].
Given a set of aligned and ordered sequences (DNA or
proteins), the aim is to find the duplication tree that best
explains these sequences, according to usual criteria in
phylogenetics, e.g., parsimony or minimum evolution. Few
studies have focused on the computational hardness of this
problem, and all of these studies only deal with the
restricted version where simultaneous duplication of multi-
ple adjacent segments is not allowed. In this context, Jaitly
et al. [15] shows that finding the optimal single copy
duplication tree with parsimony is NP-Hard and that this
problem has a PTAS (Polynomial Time Approximation
Scheme). Another closely related PTAS is given by Tang
et al. [10] for the same problem. On the other hand,
Elemento et al. [7] describes a polynomial distance-based
algorithm that reconstructs optimal single copy tandem
duplication trees with minimum evolution.
However, it is commonly believed, as in phylogeny, that
most (especially multiple) duplication tree inference pro-
blems are NP-Hard. This explains the development of
heuristic approaches. Benson and Dong [9] provides various
parsimony-basedheuristic reconstruction algorithms to infer
duplication trees, especially from minisatellites. Elemento
et al. [8] present an enumerative algorithmthat computes the
most parsimonious duplication tree; this algorithm (by its
exhaustive approach) is limited to datasets of less than 15
repeats. Several distance-based methods have also been
described. The WINDOWmethod[10] uses anagglomeration
scheme similar to UPGMA [16] and NJ [17], but the cost
function used to judge potential duplication is based on the
assumptionthat thesequences followamolecular clockmode
of evolution. The DTSCORE method [18] uses the same
scheme but corrects this limitationusinga score criterion[19],
like ADDTREE [20]. DTSCORE can be used with sequences
that do not followthe molecular clock, which is, for example,
essential when dealing with gene families containing
pseudogenes that evolve much faster than functional genes.
Finally, GREEDY SEARCH [21] corresponds to a different
approach divided into two steps: First, a phylogeny is
computed with a classical reconstruction method (NJ), then,
with nearest neighbor interchange (NNI) rearrangements, a
duplication tree close to this phylogeny is computed. This
approach is noteworthy since it implements topological
rearrangements which are highly useful in phylogenetics
[22], but it works blindly and does not ensure that good
duplication trees will be found (cf. Section 5.2).
Topological rearrangements have an essential function in
phylogenetic inference, where they are used to improve an
initial phylogeny by subtree movement or exchange.
Rearrangements are very useful for all common criteria
(parsimony, distance, maximum likelihood) and are inte-
grated into all classical programs like PAUP* [23] or
PHYLIP [24]. Furthermore, they are used to define various
distances between phylogenies and are the foundation of
much mathematical work [25]. Unfortunately, they cannot
be directly used here, as shown by a simple example given
Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].
(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In
both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.
later. Indeed, when applied to a duplication tree, they do
not guarantee that another valid duplication tree will be
produced.
In this paper, we describe a set of topological rearrange-
ments to stay inside the duplication tree space and explore
the whole space from any of its elements. We then show the
advantages of this approach for duplication tree inference
from sequences. In Section 2, we describe the duplication
model introduced by [4], [8], [10], as well as an algorithm to
recognize duplication trees in linear time. Thanks to this
algorithm, we restrict the neighborhoods defined by
classical phylogeny rearrangements, namely, nearest neigh-
bor interchange (NNI) and subtree pruning and regrafting
(SPR), to valid duplication trees. We demonstrate (Section 3)
that for NNI moves this restricted neighborhood does not
allow the exploration of the whole duplication tree space.
On the other hand, we demonstrate that the restricted
neighborhood of SPR rearrangement allows the whole
space to be explored. In this way, we define a local search
method, applied here to parsimony and minimum evolu-
tion (Section 4). We compare this method to other existing
approaches using simulated and real data sets (Section 5).
We conclude by discussing the positive results obtained by
our method, and indicate directions for further research
(Section 6).
2 MODEL
2.1 Duplication History and Duplication Tree
The tandem duplication model used in this article was first
introduced by Fitch [4] then studied independently by [8],
[10]. It is based on unequal recombination which is assumed
to be the sole evolution mechanism (except point mutations)
acting on sequences. Although it is a completely different
biological mechanism, slipped-strand mispairing leads to
the same duplication model [5], [9].
Let O 1. 2. . . . . i be the ordered set of sequences
representing the extant locus. Initially containing a single
copy, the locus grew through a series of consecutive
duplications. As shown in Fig. 2a, a duplication history
may contain simple duplication events. When the dupli-
cated fragment contains two, three, or / repeats, we say that
it involves a multiple duplication event. Under this
duplication model, a duplication history is a rooted tree
with i labeled and ordered leaves, in which internal nodes
of degree 3 correspond to duplication events. In a real
duplication history (Fig. 2a), the time intervals between
consecutive duplications are completely known, and the
internal nodes are ordered from top to bottom according to
the moment they occurred in the course of evolution. Any
ordered segment set of the same height then represents an
ancestral state of the locus. We call such a set a floor, and
we say that two nodes i. , are adjacent (i 0 ,) if there is a
floor where i and , are consecutive and i is on the left of ,.
However, in the absence of a molecular clock mode of
evolution (a typical problem), it is impossible to recover the
order between the duplication events of two different
lineages from the sequences. In this case, we are only able to
infer a duplication tree (DT) (Fig. 2b) or a rooted
duplication tree (RDT) (Fig. 2c).
A duplication tree is an unrooted phylogeny with
ordered leaves, whose topology is compatible with at least
one duplication history. Also, internal nodes of duplication
trees are partitioned into events (or blocks following
[10]), each containing one or more (ordered) nodes. We
distinguish simple duplication events that contain a
unique internal node (e.g., / and ) in Fig. 2c) and multiple
duplication events which group a series of adjacent and
simultaneous duplications (e.g., c, d, and c in Fig. 2c). Let
1 :
i
. :
i1
. . . . . :
/
denote an event containing internal
nodes :
i
. :
i1
. . . . . :
/
in left to right order. We say that two
consecutive nodes of the same event are adjacent (:
,
0 :
,1
)
just like in histories, as any event belongs to a floor in all of
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17
Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the
possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position ,
1
on (b).
the histories that are compatible with the DT being
considered. The same notation will also be used for leaves
to express the segment order in the extant locus. When the
tree is rooted, every internal node :
,
is unambiguously
associated to one parent and two child nodes; moreover,
one child of :
,
is left and the other one is right, which is
denoted as |
,
and i
,
, respectively. In this case, for any
duplication history that is compatible with this tree, child
nodes of an event, :
i
. :
i1
. . . . . :
/
are organized as follows:
|
i
0 |
i1
0 . . . 0 |
/
0 i
i
0 i
i1
0 . . . 0 i
/
.
In [8], [26], [27], it was shown that rooting a
duplication tree is different than rooting a phylogeny:
the root of a duplication tree necessarily lies on the tree
path between the most distant repeats on the locus, i.e., 1
and i; moreover, the root is always located above all
multiple duplications, e.g., Fig. 1b shows that there are
only three valid root positions, the root cannot be a direct
ancestor of 12.
2.2 Recursive Definition of Rooted and Unrooted
Duplication Trees
A duplication tree is compatible with at least one duplica-
tion history. This suggests a recursive definition, which
progressively reconstructs a possible history, given a
phylogeny T and a leaf ordering O. We define a cherry
|. :. i as a pair of leaves (| and i) separated by a single
node : in T, and we call CT the set of cherries of T. This
recursive definition reverses evolution: It searches for a
visible duplication event, agglomerates this event, and
checks whether the reduced tree is a duplication tree. In
case of rooted trees, we have:
T. O defines a duplication tree with root , if and only if:
1. T. O only contains ,, or
2. there is in CT a series of cherries
|
i
. :
i
. i
i
. |
i1
. :
i1
. i
i1
. . . . . |
/
. :
/
. i
/
with / ! i and
|
i
0 |
i1
0 . . . 0 |
/
0 i
i
0 i
i1
0 . . . 0 i
/
in O, such
that T
0
. O
0
defines a duplication tree with root ,,
where T
0
is obtained from T by removing
|
i
. |
i1
. . . . . |
/
. i
i
. i
i1
. . . . . i
/
, and O
0
is obtained by
replacing |
i
. |
i1
. . . . . |
/
. i
i
. i
i1
. . . . . i
/
by
:
i
. :
i1
. . . . . :
/
in O.
The definition for unrooted trees is quite similar:
T. O defines an unrooted duplication tree if and only if:
1. T. O contains 1 segment, or
2. same as for rooted trees with T
0
. O
0
now defining an
unrooted duplication tree.
Those definitions provide a recursive algorithm, RADT
(Recognition Algorithm for Duplication Trees), to check
whether any given phylogeny with ordered leaves is a
duplication tree. In case of success, this algorithm can also
be used to reconstruct duplication events: At each step, the
series of internal nodes above denoted as :
i
. :
i1
. . . . . :
/
is
a duplication event. When the tree is rooted, |
,
is the left
child of :
,
and i
,
its right child, for every ,. i , /. This
algorithm can be implemented in Oi [26] where i is the
number of leaves. Another linear algorithm is proposed by
Zhang et al. [21] using a top down approach instead of a
bottom-up one, but applies only to rooted duplication trees.
3 TOPOLOGICAL REARRANGEMENTS FOR
DUPLICATION TREES
This section shows how to explore the DT space using SPR
rearrangements. First, we describe some NNI, SPR, and
TBR rearrangement properties with standard phylogenies.
But, these rearrangements cannot be directly used to
explore the DT space. Indeed, when applied to a duplica-
tion tree, they do not guarantee that another valid
duplication tree will be produced. So, we have decided to
restrict the neighborhood defined by those rearrangements
to duplication trees. If we only used NNI rearrangements,
the neighborhood would be too restricted (as shown by a
simple example) and would not allow the whole DT space
to be explored. On the other hand, we can distinguish two
types of SPR rearrangements which, when applied to a
rooted duplication tree guarantee that another valid
duplication tree will be produced. Thanks to these specific
rearrangements, we demonstrate that restricting the neigh-
borhood of SPR rearrangements allows the whole space of
duplication trees to be explored.
Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose 1T is a rooted version; T
0
is obtained by
applying NNI(5,4) around the bold edge; none of the possible root positions of T
0
(a, b, c, and d) leads to a valid RDT, cf. tree (b) which
corresponds to root b in T
0
.
3.1 Topological Rearrangements for Phylogeny
There are many ways of carrying out topological rearrange-
ments on phylogeny [22]. We only describe NNI (Nearest
Neighbor Interchange), SPR (Subtree Pruning Regrafting),
and TBR(Tree Bisection and Reconnection) rearrangements.
The NNI move is a simple rearrangement which
exchanges two subtrees adjacent to the same internal edge
(Figs. 3 and 4). There are two possible NNIs for each
internal edge, so 2i 3 neighboring trees for one tree
with i leaves. This rearrangement allows the whole space of
phylogeny to be explored; i.e., there is a succession of NNI
moves making it possible to transform any phylogeny 1
1
into any phylogeny 1
2
[28].
The SPR move consists of pruning a subtree and
regrafting it, by its root, to an edge of the resulting tree
(Figs. 6 and 7). We note that the neighborhood of a tree
defined by the NNI rearrangements is included in the
neighborhood defined by SPRs. The latter rearrangement
defines a neighborhood of size 2i 32i 7 [25].
Finally, TBR generalizes SPR by allowing the pruned
subtree to be reconnected by any of its edges to the resulting
tree. These three rearrangements (NNI, SPR, and TBR) are
reversible, that is, if T
0
is obtained from T by a particular
rearrangement, then T can be obtained from T
0
using the
same type of rearrangement.
3.2 NNI Rearrangements Do Not Stay in DT Space
The classical phylogenetic rearrangements (NNI, SPR,
TBR,...) do not always stay in DT space. So, if we apply
an NNI to a DT (e.g., Fig. 3), the resulting tree is not always
a valid DT. This property is also true for SPR and TBR
rearrangements since NNI rearrangements are included in
these two rearrangement classes.
3.3 Restricted NNI Does Not Allow the Whole DT
Space to Be Explored
To restrict the neighborhood defined by NNI rearrange-
ments to duplication trees, each element of the neighbor-
hood is filtered thanks to the recognition algorithm (RADT).
But, this restricted neighborhood does not allow the whole
DT space to be explored. Fig. 4 gives an example of a
duplication tree, T, the neighborhood of which does not
contain any DT. So, its restricted neighborhood is empty,
and there is no succession of restricted NNIs allowing T to
be transformed into any other DT.
3.4 Restricted SPR Allows the Whole DT Space to
Be Explored
As before, we restrict (using RADT) the neighborhood
defined by SPR rearrangements to duplication trees. We
name restricted SPR, SPR moves that, starting from a
duplication tree, lead to another duplication tree.
Main Theorem. Let T
1
and T
2
be any given duplication trees; T
1
can be transformed into T
2
via a succession of restricted SPRs.
Proof. To demonstrate the Main Theorem, we define two
types of special SPR that ensure staying within the space
of rooted duplication trees (RDT). Given these two types
of SPRs, we demonstrate that it is possible to transform
any rooted duplication tree into a caterpillar, i.e., a
rooted tree in which all internal nodes belong to the tree
path between the leaf 1 and the tree root , (cf. Fig. 5).
This result demonstrates the theorem. Indeed, let T
1
and T
2
be two RDTs. We can transform T
1
and T
2
into a
caterpillar by a succession of restricted SPRs. So, it is
possible to transform T
1
into T
2
by a succession of
restricted SPRs, with (possibly) a caterpillar as inter-
mediate tree. This property holds since the reciprocal
movement of an SPR is an SPR. As the two SPR types
proposed ensure that we stay within the RDTs space, we
have the desired result for rooted duplication trees. And,
this result extends to unrooted duplications trees since
two DTs can be arbitrarily rooted, transformed from one
to the other using restricted SPRs, then unrooted. tu
The first special SPR allows multiple duplication
events to be destroyed. Let 1 :
i
. :
i1
. . . . . :
/
be a
duplication event, i
i
and |
/
respectively right child of :
i
Fig. 5. A six-leaf caterpillar.
Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose 1T is a rooted version; T
0
is obtained by
exchanging subtrees 1 and (2 5); none of the possible root positions of T
0
(a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds
to root b in T
0
; and the same holds for every neighbor of T being obtained by NNI.
and left child of :
/
, and let j
i
be the father of :
i
. The
DELETE rearrangement consists of pruning the subtree of
root i
i
and grafting this subtree on the edge :
/
. |
/
, while
|
i
is renamed :
i
and the edge |
i
. :
i
is deleted. Fig. 6
demonstrates this rearrangement.
Lemma 1. DELETE preserves the RDT property.
Proof. Let T be the initial tree (Fig. 6a), 1 :
i
. :
i1
. . . . . :
/
be an event of T, and T
0
be the tree obtained from T by
applying DELETE to 1 (Fig. 6b). Children of any node :
,
(i , /) are denoted |
,
and i
,
.
By definition, for any duplication history compatible
with T we have
|
i
0 |
i1
0 . . . 0 |
/
0 i
i
0 i
i1
0 . . . 0 i
/
.
Thus, there is a way to partially agglomerate T (using an
RADT-like procedure) such that these nodes becomes
leaves. The same agglomeration can be applied to T
0
as
only ancestors of the |
,
s and i
,
s are affected by DELETE.
Now, 1) agglomerate the event 1 of T, and 2) reduce T
0
by agglomerating the cherry |
/
. i
i
and then agglomer-
ating the event :
i1
. . . . . :
/
. Two identical trees follow,
which concludes the proof. tu
By successively applying DELETE to any duplication
tree, we remove all multiple duplication events. The
following SPR rearrangement allows duplications to be
moved within simple RDT, i.e., any RDT containing only
simple duplications. Let j be a node of a simple RDT T, | its
left child, i its right child, and r the left child of i. This
rearrangement consists of pruning the subtree of root r and
regrafting it to the edge |. j (Fig. 7). This rearrangement is
an SPR (in fact an NNI); we name it LEFT as it moves the
subtree root towards the left. It is obvious that the tree
obtained by applying such a rearrangement to a simple
RDT, is a simple RDT. We now establish the following
lemma which shows that any simple tree can be trans-
formed into a caterpillar.
Lemma 2. Let T be a simple RDT; T can be transformed into a
caterpillar by a succession of LEFT rearrangements.
Proof. In a caterpillar all internal nodes are ancestors of 1. If
T is not a caterpillar, there is an internal node i that is not
an ancestor of 1. If i is the right child of its father, we can
apply LEFT to the left child of i (Fig. 7). If i is the left
child of its father, we consider its father: It cannot be an
ancestor of 1 since its children are i and a node on the
right of i. So, we can apply the same argument: Either
the father of i is adequate for performing LEFT, or we
consider its father again. In this way, we necessarily
obtain a node for which the rearrangement is possible. T
is then transformed into a caterpillar by successively
applying the LEFT rearrangement to nodes which are not
on the path between 1 and ,. After a finite number of
steps, all internal nodes are ancestors of 1 and T has been
transformed into a caterpillar. This concludes the proof
of Lemma 2 and, therefore, of our Main Theorem. tu
4 LOCAL SEARCH METHOD
We consider data consisting of an alignment of i segments
with length /, and of the ordering O of the segments along
the locus. This alignment has been created before tree
construction and the problem is not to build simultaneously
the alignment and the tree, a much more complicated task
[29]. The aim is to find a (nearly) optimal duplication tree,
where optimal is defined by some usual phylogenetic
criterion and the ordered and aligned segments at hand.
Topological rearrangements described in the previous
section naturally lead to a local search method for this
purpose. We discuss its use to optimize the usual Wagner
parsimony [22] and the distance-based balanced minimum
evolution criterion (BME) [30], [31]. First, we describe our
local search method, then we define briefly these two
criteria and explain how to compute them during local
search.
Fig. 7. LEFT rearrangement.
Fig. 6. DELETE rearrangement.
4.1 The LSDT Method
Our method, LSDT (Local Search for Duplication Trees),
follows a classical local search procedure in which, at each
step, we try to strictly improve the current tree. This
approach can be used to optimize various criteria. In this
study, we restrict ourselves to parsimony and balanced
minimum evolution; )T represents the value (to be
minimized) of one of these criteria for the duplication tree
T and the sequence set.
Algorithm 1 summarizes LSDT. The neighborhood of the
current DT, T
cniicit
, is computed using SPR. As we
explained earlier, we use the RADT procedure to restrict
this neighborhood to valid DTs. When a tree is a valid DT,
its ) criterion value is computed. That way, we select the
best neighbor of T
cniicit
. If this DT improves the value
obtained so far (i.e., )T
/c:t
), the local search restarts with
this new topology. If no neighbor of T
cniicit
improves T
/c:t
,
the local search is stopped and returns T
/c:t
.
To analyze the time complexity of one LSDT step, we
have to consider the size of the neighborhood defined by
the restricted SPR. In the worst case, this size is of the same
order as the size of an unrestricted SPR neighborhood, i.e.,
Oi
2
. Indeed for the double caterpillar (Fig. 8), it is
possible to move any subtree being rooted on the path
between i,2 and , towards any edge of the path between
i 1,2 and ,; and inversely. Thus, for this tree, Oi
2
restricted SPRs can be performed. In the worst case,

restricting the neighborhood defined by SPR to duplication
trees does not significantly decrease the neighborhood size.
However, on average the diminution is quite significant;
e.g., with i 48, only 5 percent of the neighborhood
corresponds to a valid DTs, assuming DTs are uniformly
distributed [26].
Since the time complexity of the recognition algorithm
(RADT) is Oi, computing the neighborhood defined by
restricted SPR requires Oi
3
. The calculation of the
criterion value is done for each tree of the restricted
neighborhood. Thus one local search step basically requires
Oi
3
i
2
q, where q represents the time complexity of
computing the criterion value. However, preprocessing
allows this time complexity to be lowered, both for
parsimony and minimum evolution, as we shall explain in
the following sections.
4.2 The Maximum Parsimony Criterion
Parsimony is commonly acknowledged [22] to be a good
criterion when dealing with slightly divergent sequences,
which is usually the case with tandemly duplicated genes
[8]. The parsimony criterion involves selecting the tree
which minimizes the number of substitutions needed to
explain the evolution of the given sequences. Finding the
most parsimonious tree [22] or duplication tree [15] is
NP-hard, but we can find the optimal labeling of the
internal nodes and the parsimony score of a given tree T in
polynomial time using the Fitch-Hartigan algorithm [32],
[33]. The parsimony score and optimal labeling of internal
nodes is independently computed for each position within
sequences, using a postorder depth-first search algorithm
that requires Oi time [32], [33]. Thus, computing the
parsimony score of i sequences of length / requires O/i
time. Hence, if we use this algorithm during our local
search method, one local search step is computed in O/i
3
,
which is relatively high.
To speed up this process, we adapted techniques
commonly used in phylogeny for fast calculation of
parsimony. Our implementation uses a data structure
implemented (among others) in DNAPARS [24] and
described in [34], [35]. Let T
j
be the pruned subtree and
T
i
be the resulting tree. A preprocessing stage computes
the parsimony vector (i.e., the optimal score and optimal
labeling of all sequence positions) of every rooted subtree
of T
i
using a double depth-first search [36] (Fig. 9a); the
first search is postordered and computes the parsimony
vector of down-subtrees; the second search is preordered
and computes the parsimony vector of up-subtrees. Each
search requires Oi/ time. Thanks to this data structure,
the parsimony score of the tree obtained by regrafting T
j
on any given edge of T
i
is computed in O/ (Fig. 9b).
Hence, computing the SPR neighbor with minimum
parsimony of any given duplication tree is achieved in
Oi
3
i i/ i
2
/ Oi
3
i
2
/; the first term i
3
represents the neighborhood computation; the second

term i i/ corresponds to the time required by the i
Fig. 8. A simple rooted duplication tree with a double caterpillar
structure.
preprocessing stages; the third term i
2
/ is the time to
test the i subtrees and the i possible insertion edges.
4.3 The Distance-Based Balanced Minimum
Evolution Principle
As in any distance-based approach, we first estimate the
matrix of pairwise evolutionary distances between the
segments, using some standard distance estimator [22],
e.g., the Kimura two-parameter estimator [37] in case of
DNA or the JTT method with proteins [38]. Let be this
matrix and c
i,
be the distance between segments i and ,.
The matrix plus the segment order is the input of the
reconstruction method.
The minimum evolution principle (ME) [39], [40]
involves selecting the shortest tree to be the tree which
best explains the observed sequences. The tree length is
equal to the sum of all the edge lengths, and the edge
lengths are estimated by minimizing a least squares fit
criterion. The problem of inferring optimal phylogenies
within ME is commonly assumed to be NP-hard, as are
many other distance-based phylogeny inference problems
[41]. Nonetheless, ME forms the basis of several phyloge-
netic reconstruction methods, generally based on greedy
heuristics. Among them is the popular Neighbor-Joining
(NJ) algorithm [17]. Starting from a star tree, NJ iteratively
agglomerates external pairs of taxa so as to minimize the
tree length at each step.
Recently, Pauplin [30] proposed a new simple formula to
estimate the tree length 1T of tree T:
1T
X
i < ,
2
1T
i,
c
i,
.
where T
i,
is the topological distance (number of edges) in T
between segments i and ,. The correctness of this formula
was shown by Semple and Steel [42], while Desper and
Gascuel [31] showed that this formula is a special case of
weighted-least squares tree fitting. Moreover, Desper and
Gascuel demonstrated that selecting the shortest tree (as
computed from above formula) is statistically consistent and
well suited for phylogenetic inference. They called this new
version of ME balanced minimum evolution (BME) [31].
Using the above formula, the length of any given tree is
computed in Oi
2
, so computing one LSDT local search
step can be achieved in Oi
4
. However, a faster imple-
mentation is possible using a straightforward modification
of our BME addition algorithm [43]. This involves:
1. pruning a rooted subtree T
j
from tree T,
2. computing the average distance between all non-
intersecting subtree pairs in the remaining tree T
i
,
3. computing the average distance between T
j
and any
subtree of T
i
in T, and
4. using formula (10) from [43] and RADT to find the
best allowed edge to regraft T
j
.
Steps 2 and 3 are based on algorithms described in [43],
which follow the same approach as the double depth-first
search described in the previous section. These two steps
require Oi
2
, just as Step 4. As there are Oi subtrees to
prune and regraft, this implementation requires Oi
3
to
perform one search step.
5 RESULTS
5.1 Simulation Protocol
We applied our method and other existing methods to
simulated datasets obtained using the procedure described
in [18]. We uniformly randomly generated rooted tandem
duplication trees (see [26]) with 12, 24, and 48 leaves and
assigned lengths to the edges of these trees using the
coalescent model [44]. We then obtained molecular clock
trees (MC), which might be unrealistic in numerous cases,
e.g., when the sequences being studied contain pseudo-
genes which evolve much faster than functional genes.
Then, we generated nonmolecular clock trees (NO-MC)
from the previous trees by independently multiplying
Fig. 9. (a) Every edge defines one down-subtree and one up-subtree; e.g., represents the down-subtree (2 3) defined by the edge c while 1
corresponds to the up-subtree (1 (4 5)). Moreover, only the parsimony vector of the five leaves is known before the preprocessing stage. The
postorder search computes the parsimony vector of down-subtrees: is computed from 2 and 3, 1 from 4 and 5, C from and 1. The preorder
search computes the parsimony vector of up-subtrees: 1 is obtained from 1 and 1, 1 is obtained from 1 and 3, etc. (b) When the parsimony vector
of every subtree in T
i
is known, regrafting T
j
on any given edge and computing the parsimony score of the resulting tree only requires analyzing the
parsimony vector of three subtrees and is done in O/ time.
every edge length by 1 0.8A, where A was drawn from
an exponential distribution with parameter 1. MC trees
were rescaled by multiplying every edge length by 1.8.
The trees thus obtained (MC and NO-MC) have a
maximum leaf-to-leaf divergence in the range 0.1. 0.7,
and in NO-MC trees the ratio between the longest and
shortest root-to-leaf lineages is about 3.0 on average. Both
values are in accordance with real data, e.g., gene families
[8] or repeated protein domains [10].
SEQGEN [45] was used to produce a 1,000 bp-long
nucleotide multiple alignment from each of the generated
trees using the Kimura two-parameter model of substitution
[46], and a distance matrix was computed by DNADIST [24]
from this alignment using the same substitution model. For
MC and NO-MCcases, 1,000 trees (and, then, 1,000 sequence
sets and 1,000 distance matrices) were generated per tree
size. These data sets were used to compare the ability of the
various methods to recover the original trees from the
sequences or from the distance matrices, depending on the
method being tested. We measured the percentage of trees
(out of 1,000) being correctly reconstructed (%ti). For the
phylogeny reconstruction methods, we also kept the
percentage of duplication trees among the set of inferred
trees. Due to the random process used for generating these
trees and datasets, some short branches might not have
undergone any substitution (as during Evolution) and, thus,
are unobtainable, except by chance. When i and, thus, the
branch number is high, it becomes hard or impossible to
find the entire tree. So, we also measured the percentage of
duplication events in the true tree recovered by the inferred
tree (%c.). A duplication event involves one or more
internal nodes and is the lowest common ancestor of a set
of leaves; we say it covers its descendent leaves. However,
the leaves covered by a simple duplication event can change
when the root position changes. As regards the true tree, the
root is known and each event is defined by the set of leaves
which it covers. But, the inferred tree is unrooted. To avoid
ambiguity, we then tested all possible root positions and
chose the one which gave the highest proximity in number
of events detected between the true tree and the inferred
tree, where two events are identical if they cover the same
leaves. Finally, we kept the average parsimony value of each
method (joi:).
5.2 Performance and Comparison
Using this protocol, we compared NJ [17], TNT [47], and
GREEDY-SEARCH (GS) [21] which starts from the NJ tree, a
modified version of GREEDY TRHIST RESTRICTED (GTR)
[9] to infer multiple duplication trees, WINDOWS [10],
DTSCORE [18], and eight versions of our local search
method LSDT corresponding to different starting duplica-
tion trees (GS, GTR, WINDOW, and DTSCORE) and
different criteria (parsimony and BME). TNT and GS use
the parsimony criterion, but the other are distance-based
methods. TNT is acknowledged as one of the very best
parsimony packages; it was run with 10 replicates and TBR
rearrangements. TNT often returns a set of equally
parsimonious trees. When this set contained duplication
trees, we randomly selected one of them; when no
duplication tree was inferred by TNT, we randomly
selected one of the output trees.
Results are given in Tables 1 and 2. First, we observe that
with i 48 the true tree is almost never entirely found, for
the reasons explained earlier. On the other hand, the best
methods recover 80 to 95 percent of the duplication events,
indicating that the tested datasets are relatively easy. NJ
and TNT perform relatively well, but they often output
trees that are not duplication trees, which is unsatisfactory
(e.g., with 48 leaves and NO-MC, NJ and TNT only infer
1 percent and 5 percent of duplication trees, respectively).
The GS approach is noteworthy since it modifies the trees
inferred by NJ to transform them into duplication trees.
However, GS is only slightly better than NJ regarding the
proportion of correctly reconstructed trees, but consider-
ably degrades the number of recovered duplication events,
which could be explained by the blind search it performs
to transform NJ trees into duplication trees. GTR also
obtains relatively poor results. As expected from its
assumptions, WINDOW performs better in the MC case
than in the NO-MC one. Finally, DTSCORE obtains the best
performance among the four existing methods, whatever
the topological criterion considered.
Applying our method to starting trees produced by GS,
GTR, WINDOW, and DTSCORE reveals the advantages of
the local search approach. Optimizing parsimony or BME
gives similar results, with a slight advantage for parsimony
as expected from the relatively low divergence rates in our
data sets. The trees produced by GS, GTR, and WINDOW
are clearly improved and, for most, are better than those
obtained by DTSCORE. DTSCORE trees are also improved,
even though this improvement is not very high from a
topological point of view. This could be explained by the
fact that DTSCORE is already an accurate method with
respect to the datasets used.
When we consider the parsimony criterion, the gain
achieved by LSDT is appreciable for each start method. This
could be expected for GS, WINDOW and DTSCORE which
do not optimize this criterion; with i 48 in NO-MC case,
the gain for GS is about 329, thus confirming that this
method is clearly suboptimal; the gains for WINDOW and
DTSCORE are about 42 and 15, which are lower but still
significant. The GTR results, which optimizes parsimony,
are more surprising since the gain (always with i 48 in
NO-MC case) is about 77 on average, which is very high.
Moreover, the parsimony value obtained by LSDT is very
close to that of TNT, in spite of a much more restricted
search space. This confirms the good performance of our
local search method. It should be stressed that these gains
are obtained at low computational cost as dealing with any
of the 48-taxon datasets only requires about 10 seconds
for parsimony and five seconds for BME on a standard
PC-Pentium 4.
5.3 Analysis of the ZNF45 Family
Zinc finger (ZNF) genes code for proteins that contain one
or more zinc finger motifs. The zinc finger motif is one of
the most common motifs involved in nucleic acid-protein
interaction. Experimental studies on functions of ZNF genes
suggest that many of them code for transcription factors,
and some of them are known to take part in cellular growth
and development [48]. However, the biological functions of
most ZNF genes are currently unknown. The 16 members of
ZNF45 gene family are found in the q13.2 gene cluster on
human chromosome 19 [49]. The organization and features
of the members of the ZNF45 family suggest that the genes
in the family may have been produced by a series of in situ
TABLE 2
Performance Comparison Using Simulations (No Molecular Clock of Evolution)
Note: see Table 1.
TABLE 1
Performance Comparison Using Simulations (Molecular Clock Mode of Evolution)
X+LSDT_Y: X is the method used to obtain the starting tree and Y the criterion being optimized by LSDT; %ti: the percentage of trees being correctly
reconstructed; the percentage of duplication trees obtained by phylogeny reconstruction methods is given between parentheses; %c.: the
percentage of duplication events in the true tree being recovered by the inferred tree; joi:: the average parsimony value.
gene duplication events [49]. The ZNF45 gene family has
been previously studied by Tang et al. [10] and Zhang et al.
[21], who proposed different tandem duplication trees to
explain its evolutionary history.
We downloaded the DNA sequences of the 16 members
of ZNF45 from NCBI. Multiple alignment was achieved
using TCOFFEE,
1
using default settings. We removed gaps
as usual in phylogenetics [22] and third codon positions
which look saturated (734 parsimony steps are required to
explain the evolution of the 237 sites). We thus obtained a
final alignment
2
containing 474 homologous sites, with a
maximum pairwise divergence of 0.45.
PAUP* [23] was used to estimate the matrix of pairwise
distances, assuming the GTR substitution model [50] and a
gamma distribution of rates with parameter 1.
We used this distance matrix and DTSCORE to build a
starting tree, which was then refined by LSDT using
parsimony. We selected this criterion because of its good
performance with simulated data (Tables 1 and 2). The
resulting tree (Figs. 10a and 10b) is a simple DT requiring
897 steps to explain the extant sequences. We tried to
improve this score using a computationally intensive
ratchet approach [51], but were unable to obtain any other
DT with better (or even identical) parsimony. We also ran
TNT with ratchet, 1,000 random taxon addition replicates
and TBR branch swapping (i.e., all TNT options to intensify
the search) and found one maximum-parsimony phylogeny
requiring 896 steps. This phylogeny (Fig. 10c) contains an
unresolved node with degree 4 and is not a duplication tree.
TNT phylogeny is close to LSDT duplication tree. To
transform from one to the other only three taxa have to be
1. http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi.
2. Available on request.
Fig. 10. (a) Duplication tree for the 16 genes of human ZNF 45 family inferred by DTSCORE plus LSDT with parsimony; black dots represent the only
allowed root positions, according to the tandem duplication model; the (arbitrarily) selected root position is circled. (b) Rooted duplication tree
corresponding to tree (a). (c) Phylogeny inferred by TNT. Tree (a) can be obtained from tree (c) by moving ZNF45 and ZNF228 to edge 1, and
ZNF233 to edge 2. Edge lengths in tree (a) and tree (c) were estimated by maximum likelihood [52]. Lengths in tree (b) are meaningless and were
adjusted to obtain a readable drawing.
moved (Fig. 10), and both trees differ by only 1 parsimony
step. A similar difference was commonly observed in
simulation where TNT found (non-DT) phylogenies requir-
ing one parsimony step less (on average) than the DTs
found by LSDT (Tables 1 and 2), though the true tree used
to generate the sequences was a DT. Thus, having (only)
one parsimony step of difference between the best DT and
the best phylogeny is not significant and can be seen as
supporting the duplication model. Moreover, the discre-
pancy between the two trees can be explained by long
branch attraction, a phenomenon that frequently affects
parsimony-based reconstructions [53]. Indeed, ZNF180 and
ZNF229 genes are distant from the other genes (Figs. 10a
and 10c) and might perturb the whole tree. When removing
those two genes from the data set, both LSDT and TNT
found the same tree, which is identical to the LSDT tree of
Fig. 10a without the two genes. With 14 segments, the
probability of randomly picking up a duplication tree
among all distinct phylogenies is less than 10
4
[26]. This
extremely small probability indicates that the identity
between LSDT and TNT trees is very unlikely to be due
to chance. This provides a strong support for the tandem
duplication model and indicates that our LSDT tree likely
represents mostif not allof the history of ZNF45 family.
We compared trees obtained by Tang et al. [10], Zhang
et al. [21], and those of the other programs to the LSDT tree
of Fig. 10. We computed the parsimony score of each tree
and the percentage of events shared by each tree with the
LSDT tree. Just as in the simulation study, we tested GS
[21], GTR [9], WINDOW [10], DTSCORE [8], and LSDT
using different starting points but optimizing parsimony in
all cases.
Results are displayed in Table 3 and confirm those
obtained with simulated data sets.Results of trees from
[10] and [21] are poor, which was expected as these
methods (WINDOWS and GS, respectively) do not
optimize the parsimony criterion and as we did not use
the same alignment. GS is relatively poor, while
DTSCORE, WINDOWS, and GTR perform better. LSDT
clearly improves these four methods, with gains ranging
from 10 to 50 parsimony steps. In all cases but GTR,
LSDT recovers the most parsimonious DT of Fig. 10.
6 CONCLUSION AND PROSPECTS
We have demonstrated that restricting the neighborhood
defined by the SPR rearrangement to valid duplication trees
allows the whole DT space to be explored. Thanks to these
rearrangements, we have defined a general local search
method which we used to optimize the parsimony and
balanced minimum evolution criteria. We have thus
improved the topological accuracy of all the tested
methods.
Several research directions are possible. Finding the set
of combinatorial configurations for the SPR rearrangement
which necessarily produce a duplication tree, could allow
the neighborhood computation to be accelerated (e.g., for
i 48 only 5 percent of the SPR neighborhood correspond
to duplication trees) and, furthermore, gain more insight
into the nature of duplication trees, which are just starting
to be investigated mathematically [12], [26], [27]. Our local
search method could be improved using restricted TBR
rearrangements or with the help of different stochastic
approaches (taboo, noising, ...) in order to avoid local
minima. Moreover, it would be relevant to test this local
search method with other criteria like maximum likelihood.
Finally, combining the tandem duplication events with
speciation events, as described in [54] and [55] for
nontandem duplications, would be relevant for real
applications where we have homologous tandem repeats
from several genomes.
ACKNOWLEDGMENTS
The authors would like to thank Wafae El Alaoui for her help
with ZNF45 family genes, and Richard Desper, WimHordijk
and the referees of the Workshop on Algorithms in
Bioinformatics (WABI 04) for reading preliminary versions
of this paper. This work was supported by ACI-IMPBIO
(Ministe`re de la Recherche, France).
TABLE 3
Analysis of the ZNF45 Data Set
REFERENCES
[1] F. Blattner, G. Plunkett, C. Bloch, N. Perna, V. Burland, M. Riley, J.
Collado-Vides, J. Glasner, C. Rode, G. Mayhew, J. Gregor, N.
Davis, H. Kirkpatrick, M. Goeden, D. Rose, B. Mau, and Y. Shao,
The Complete Genome Sequence Of Escherichia Coli k-12,
Science, vol. 277, no. 5331, pp. 1453-1474, 1997.
[2] E. Lander et al., Initial Sequencing and Analysis of the Human
Genome, Nature, vol. 409, pp. 860-921, 2001.
[3] A. Smit, Interspersed Repeats and Other Mementos of Transpo-
sable Elements in Mammalian Genomes, Current Opinion in
Genetics & Development, vol. 9, pp. 657-663, 1999.
[4] W. Fitch, Phylogenies Constrained by Cross-Over Process as
Illustrated by Human Hemoglobins in a Thirteen-Cycle, Eleven
Amino-Acid Repeat in Human Apolipoprotein A-I, Genetics,
vol. 86, pp. 623-644, 1977.
[5] G. Levinson and G. Gutman, Slipped-Strand Mispairing: A Major
Mechanism for DNA Sequence Evolution, Molecular Biology and
Evolution, vol. 4, pp. 203-221, 1987.
[6] J. Zhang and M. Nei, Evolution of Antennapedia-Class Homeo-
box Genes, Genetics, vol. 142, no. 1, pp. 295-303, 1996.
[7] O. Elemento and O. Gascuel, An Exact and Polynomial Distance-
Based Algorithm to Reconstruct Single Copy Tandem Duplication
Trees, Proc. 14th Ann. Symp. Combinatorial Pattern Matching
(CPM2003), 2003.
[8] O. Elemento, O. Gascuel, and M.-P. Lefranc, Reconstructing the
Duplication History of Tandemly Repeated Genes, Molecular
Biology and Evolution, vol. 19, pp. 278-288, 2002.
[9] G. Benson and L. Dong, Reconstructing the Duplication History
of a Tandem Repeat, Proc. Intelligent Systems in Molecular Biology
(ISMB1999), T. Lengauer, ed., pp. 44-53, 1999.
[10] M. Tang, M. Waterman, and S. Yooseph, Zinc Finger Gene
Clusters and Tandem Gene Duplication, J. Computational Biology,
vol. 9, pp. 429-446, 2002.
[11] E. Rivals, A Survey on Algorithmic Aspects of Tandem Repeats
Evolution, Intl J. Foundations of Computer Science, vol. 15, no. 2,
pp. 225-257, 2004.
[12] O. Gascuel, D. Bertrand, and O. Elemento, Reconstructing the
Duplication History of Tandemly Repeated Sequences, Math. of
Evolution and Phylogeny, O. Gascuel, ed., 2004.
[13] S. Ohno, Evolution by Gene Duplication. Springer Verlag, 1970.
[14] P.L. Fleche, Y. Hauck, L. Onteniente, A. Prieur, F. Denoeud, V.
Ramisse, P. Sylvestre, G. Benson, F. Ramisse, and G. Vergnaud, A
Tandem Repeats Database for Bacterial Genomes: Application to
the Genotyping of Yersinia Pestis and Bacillus Anthracis, BioMed
Central Microbiology, vol. 1, pp. 2-15, 2001.
[15] D. Jaitly, P. Kearney, G. Lin, and B. Ma, Methods for
Reconstructing the History of Tandem Repeats and Their
Application to the Human Genome, J. Computer and System
Sciences, vol. 65, pp. 494-507, 2002.
[16] P. Sneath and R. Sokal, Numerical Taxonomy. pp. 230-234, San
Francisco: W.H. Freeman and Company, 1973.
[17] N. Saitou and M. Nei, The Neighbor-Joining Method: A New
Method for Reconstructing Phylogenetic Trees, Molecular Biology
and Evolution, vol. 4, pp. 406-425, 1987.
[18] O. Elemento and O. Gascuel, A Fast and Accurate Distance-
Based Algorithm to Reconstruct Tandem Duplication Trees,
Bioinformatics, vol. 18, pp. 92-99, 2002.
[19] J. Barthelemy and A. Guenoche, Trees and Proximity Representa-
tions. Wiley and Sons, 1991.
[20] S. Sattath and A. Tversky, Additive Similarity Trees, Psychome-
trika, vol. 42, pp. 319-345, 1977.
[21] L. Zhang, B. Ma, L. Wang, and Y. Xu, Greedy Method for
Inferring Tandem Duplication History, Bioinformatics, vol. 19,
pp. 1497-1504, 2003.
[22] D. Swofford, P. Olsen, P. Waddell, and D. Hillis, Molecular
Systematics. pp. 407-514, Sunderland, Mass.: Sinauer Associates,
1996.
[23] D. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (*and
Other Methods), version 4. Sunderland, Mass.: Sinauer Associates,
1999.
[24] J. Felsenstein, PHYLIPPHYLogeny Inference Package, Cladis-
tics, vol. 5, pp. 164-166, 1989.
[25] C. Semple and M. Steel, Phylogenetics. Oxford Univ. Press, 2003.
[26] O. Gascuel, M. Hendy, A. Jean-Marie, and S. McLachlan, The
Combinatorics of Tandem Duplication Trees, Systematic Biology,
vol. 52, pp. 110-118, 2003.
[27] J. Yang and L. Zhang, On Counting Tandem Duplication Trees,
Molecular Biology and Evolution, vol. 21, pp. 1160-1163, 2004.
[28] D. Robinson, Comparison of Labeled Trees with Valency Trees,
J. Combinatorial Theory, vol. 11, pp. 105-119, 1971.
[29] L. Wang and D. Gusfield, Improved Approximation Algorithms
for Tree Alignment, J. Algorithms, vol. 25, pp. 255-273, 1997.
[30] Y. Pauplin, Direct Calculation of a Tree Length Using a Distance
Matrix, J. Molecular Evolution, vol. 51, pp. 41-47, 2000.
[31] R. Desper and O. Gascuel, Theoretical Foundation of the
Balanced Minimum Evolution Method of Phylogenetic Inference
and Its Relationship to Weighted Least-Squares Tree Fitting,
Molecular Biology and Evolution, vol. 21, no. 3, pp. 587-598, 2004.
[32] W. Fitch, Toward Defining the Course of Evolution: Minimum
Change for a Specified Tree Topology, Systematic Zoology, vol. 20,
pp. 406-416, 1971.
[33] J. Hartigan, Minimum Mutation Fits to a Given Tree, Biometrics,
vol. 29, pp. 53-65, 1973.
[34] G. Ganapathy, V. Ramachandran, and T. Warnow, Better Hill-
Climbing Searches for Parsimony, Proc. Third Intl Workshop
Algorithms in Bioinformatics, 2003.
[35] P.A. Goloboff, Methods for Faster Parsimony Analysis, Cladis-
tics, vol. 12, pp. 199-220, 1996.
[36] V. Berry and O. Gascuel, Inferring Evolutionary Trees with
Strong Combinatorial Evidence, Theoretical Computer Science,
vol. 240, pp. 271-298, 2000.
[37] M. Kimura, A Simple Model for Estimating Evolutionary Rates of
Base Substitutions through Comparative Studies of Nucleotide
Sequences, J. Molecular Evolution, vol. 16, pp. 111-120, 1980.
[38] D. Jones, W. Taylor, and J. Thornton, The Rapid Generation of
Mutation Data Matrices from Protein Sequences, Computer
Applications in Biosciences, vol. 8, pp. 275-282, 1992.
[39] K. Kidd and L. Sgaramella-Zonta, Phylogenetic Analysis:
Concepts and Methods, Am. J. Human Genetics, vol. 23, pp. 235-
252, 1971.
[40] A. Rzhetsky and M. Nei, Theoretical Foundation of the
Minimum-Evolution Method of Phylogenetic Inference, Molecu-
lar Biology and Evolution, vol. 10, pp. 173-1095, 1993.
[41] W. Day, Computational Complexity of Inferring Phylogenies
from Dissimilarity Matrices, Bull. Math. Biology, vol. 49, pp. 461-
467, 1987.
[42] C. Semple and M. Steel, Cyclic Permutations and Evolutionary
Trees, Advances in Applied Math., vol. 32, no. 4, pp. 669-680, 2004.
[43] R. Desper and O. Gascuel, Fast and Accurate Phylogeny
Reconstruction Algorithms Based on the Minimum-Evolution
Principle, J. Computational Biology, vol. 9, pp. 687-706, 2002.
[44] M. Kuhner and J. Felsenstein, A Simulation Comparison of
Phylogeny Algorithms under Equal and Unequal Evolutionary
Rates, Molecular Biology and Evolution, vol. 11, pp. 459-468, 1994.
[45] A. Rambault and N. Grassly, Seq-Gen: An Application for the
Monte Carlo Simulation of DNA Sequence Evolution Along
Phylogenetic Trees, Computer Applied Biosciences, vol. 13, pp. 235-
238, 1997.
[46] J. Felsenstein and G. Churchill, A Hidden Markov Model
Approach to Variation Among Sites in Rate of Evolution,
Molecular Biology and Evolution, vol. 13, pp. 93-104, 1996.
[47] P.A. Goloboff, J.S. Farris, and K. Nixon, TNT: Tree Analysis
Using New Technology, 2000, www.cladistics.com.
[48] T. El-Barabi and T. Pieler, Zinc Finger Proteins: What We Know
and What We Would Like to Know, Mechanisms of Development,
vol. 33, pp. 155-169, 1991.
[49] M. Shannon, J. Kim, L. Ashworth, E. Branscomb, and L. Stubbs,
Tandem Zinc-Finger Gene Families in Mammals: Insights and
Unanswered Questions, DNA SequenceThe J. Sequencing and
Mapping, vol. 8, no. 5, pp. 303-315, 1998.
[50] P. Waddel and M. Steel, General Time Reversible Distances with
Unequal Rates Across Sites: Mixing T and Inverse Gaussian
Distributions with Invariant Sites, Molecular Phylogeny and
Evolution, vol. 8, pp. 398-414, 1997.
[51] K.C. Nixon, The Parsimony Ratchet, a New Method for Rapid
Parsimony Analysis, Cladistics, vol. 15, pp. 407-414, 1999.
[52] S. Guindon and O. Gascuel, A Simple, Fast and Accurate Method
to Estimate Large Phylogenies by Maximum-Likelihood, Sys-
tematic Biology, vol. 52, no. 5, pp. 696-704, 2003.
[53] J. Felsenstein, Cases in Which Parsimony or Compatibility
Methods Will Be Positively Misleading, Systematic Zoology,
vol. 27, pp. 401-410, 1978.
[54] D. Page and M. Charleston, From Gene to Organismal Phylogeny:
Reconciled Trees and the Gene Tree/Species Tree Problem,
Molecular Phylogenetics and Evolution, vol. 7, pp. 231-240, 1997.
[55] M. Hallett, J. Lagergren, and A. Tofigh, Simultaneous Identifica-
tion of Duplications and Lateral Transfers, Proc. Conf. Research
and Computational Molecular Biology (RECOMB2004), pp. 347-356,
2004.
Denis Bertrand is a PhD student under the
supervision of Olivier Gascuel. His research
subject is the study of tandemly repeated
sequences. His main areas of interest are
phylogenetics, combinatorics, and algorithms.
Olivier Gascuel is Directeur de Recherche at
the Centre National de la Recherche Scientifi-
que (France). He is the head of the bioinfor-
matics group from the LIRMM laboratory,
belongs to the editorial board of Systematic
Biology and of BMC Evolutionary Biology, and
has served in a number of program committees
of bioinformatics conferences (ISMB, WABI). He
started in this field in the mid 1980s, with works
on sequence analysis and protein structure
prediction. Since the beginning of the 1990s, he turned his efforts to
phylogenetics, focusing on the mathematical and computational tools
and concepts. He (co)authored several well-known phylogeny inference
programs (BioNJ, PHYML, FastME).
> For more information on this or any other computing topic,
Optimizing Multiple Seeds
for Protein Homology Search
Daniel G. Brown
AbstractWe present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local
protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed
models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and
Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen
allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.
Index TermsBioinformatics database applications, similarity measures, biology and genetics.
1 INTRODUCTION
P
AIRWISE alignment is one of the most important problems
in bioinformatics. Here, we continue an exploration into
the seeding and structure of local pairwise alignments and
show that a recent strategy for seeding nucleotide align-
ments can be expanded to protein alignment. Heuristic
protein sequence aligners, exemplified by BLASTP [1], find
almost all high-scoring alignments. However, the sensitivity
of heuristic aligners to moderate-scoring alignments can
still be poor. In particular, alignments with BLASTP score
between 40 and 60 are commonly missed by BLASTP, even
though many are of truly homologous sequences. We focus
on these alignments and show that a change to the seeding
strategy gives success rates comparable to BLASTP with far
fewer false positive hits.
Specifically, multiple spaced seeds [2] and their relatives,
vector seeds [3], can be used in local protein alignment to
reduce the false positive rate in the seeding step of alignment
by a factor of four. We present a protocol for choosing
multiple vector seeds that allows us to find good seeds that
work well together. Our approach is based on solving a set-
cover integer program whose solution gives optimal thresh-
olds for a collection of seeds. Our IP is prone to overtraining,
so we discuss how to reduce the dependency of the solution
on the set of training alignments, both by increasing the false
positive rate of the seeds found slightly and by making the
program less sensitive to outliers. The problem we are trying
to solve is NP-hard and Quasi-NP-hard to approximate to a
sublogarithmic factor, so we present heuristics for it, though
most instances are of moderate enough size to use integer
programming solvers.
Our successful result here contrasts with our previous
work [3] in which we introduced vector seeds. There, we
found that using only one vector seed would not substan-
tially improve BLASTPs sensitivity or selectivity. The use
of multiple seeds is the important change in the present
work. This successful use of multiple seeds is similar to
what has been reported recently for pairwise nucleotide
alignment [4], [5], [6], but the approach we use is different
since protein aligners require extremely high sensitivity. We
note that, independently of our work, the authors of
PatternHunter, the first program to use optimized spaced
seeds, have developed a protein aligner based on seeding
approaches similar to those we discuss here [7]; however,
they have not offered theoretical justification for their
approach, which, in some sense, we provide here.
Our results confirm the themes developed by us and
others since the initial development of spaced seeds. The
first theme is that spaced seeds help in heuristic alignment
because the very surprisingly conserved regions that one
uses as a basis for building an alignment happen more
independently in true alignments than for unspaced seeds.
In protein alignments, there are often many small regions of
high conservation, each of which has a chance to have a hit
to a seed in it. With unspaced seeds, the probability that any
one of these regions is hit is low, but, when a region is hit,
there may be several more hits, which is unhelpful. By
contrast, a spaced seed is likely to hit a given region fewer
times, wasting less runtime, and will also hit at least one
region in more alignments, increasing sensitivity.
The second theme is that the more one understands how
local and global alignments look, the more possible it is to
tailor alignment seeding strategies to a particular applica-
tion, reducing false positives and improving true positives.
Here, by basing our set of seeds on sensitivity to true
alignments, we choose a set of seed models that hit diverse
types of short conserved alignment subregions. Conse-
quently, the probability that one of them hits a given
alignment is high since they complement each other well.
. The author is with the School of Computer Science, University of Waterloo,
200 University Ave., West, Waterloo, ON N2L 3G1, Canada.
E-mail: browndg@uwaterloo.ca.
Manuscript received 1 Nov. 2004; revised 2 Jan. 2005; accepted 11 Jan. 2005;
published online 30 Mar. 2005.
tcbb@computer.org, and reference IEEECS Log Number TCBBSI-0183-1104.
2 BACKGROUND: HEURISTIC ALIGNMENT AND
SPACED SEEDS
Since the development of heuristic sequence aligners [1], the
same approach has been commonly used: identify short,
highly conserved regions and build local alignments
around these hits. This avoids the use of the Smith-
Waterman algorithm [8] for pairwise local alignment, which
has nm runtimes on input sequences A and B of length n
and m, respectively. (We will use the notation Ai to
represent the ith character of sequence A.)
Instead, assuming random sequences, the expected
runtime of this heuristic search method is hn; m an; m,
where hn; m is the amount of time needed to find hits in the
two sequences and an; m is the expected time needed to
compute the alignments fromthe hits. Most heuristic aligners
have hn; m n mnm=k, while an; m nm=k
for some large constant k. There are many assumptions in
these formulas. First, evenwhenwe alignsequences withtrue
homologies, most hits are betweenunrelatedpositions, so the
estimation of the runtime need not consider whether the
sequences are related. Further, this simplification assumes
that each hit found in the first phase results in a constant
amount of work being done in the second phase to identify
that it is false (or that true hits are rare). It is the speedupfactor
of k that is important here; assuming m and n are large, the
overall runtime is much faster.
Most heuristic aligners look at the scores of matching
characters in short regions and use high-scoring short
regions as hits. For example, BLASTP [1] hits are three
consecutive positions in the two sequences where the total
score, according to a BLOSUM or PAM scoring matrix, of
aligning the three letters in one sequence to the three letters
of the other sequence is at least +13. Finding such hits can
be done easily, for example, by making a hash table of one
sequence and searching positions of the hash table for the
other sequence, in time proportional to the length of the
sequences and the number of hits found. BLASTP uses
more complicated data structures for this process, but the
principle is similar.
2.1 Seeding Models
To generalize BLASTPs hits, we defined vector seeds [3], [9].
A vector seed is a pair v; T. Vector v v
1
; . . . ; v
k
is a
vector of position multipliers and T is a threshold. Given
two sequences A and B, let s
i;j
be the score in our scoring
matrix of aligning the Ai to Bj. If we consider position i
in A and j in B, we then get an hit to the vector seed at those
positions when v s
i;j
; s
i1;j1
; . . . ; s
ik1;jk1
! T. In this
framework, BLASTPs seed is ((1, 1, 1), 13).
Vector seeds generalize the earlier idea of spaced seeds
[2] for nucleotide alignments, where both scores and the
vector are 0/1 vectors and where T, the threshold, equals
the number of 1s in v. A spaced seed requires an exact
match in the positions where the vector is 1 and the places
where the vector is 0 are dont care positions. In our
original work with vector seeds [3], the freedom to allow
positions of v to have values beside 0 and 1 was not
extremely useful, so the vector seeds we discuss here all
have binary vectors v.
Spaced seeds have the same expected number of junk
hits as unspaced seeds. For unrelated noise DNA se-
quences, this is nm4
w
, where w is the number of ones in
the seed (its support). Their advantage comes because more
distinct internal subregions of a given alignment will match
a spaced seed than the unspaced seed; this happens because
the hits are more independent of each other. The probability
that an alignment of length 64 with 70 percent conservation
matches a good spaced seed of support 11 can be greater
than 45 percent because there are likely to be more
subregions that match the spaced seed than the unspaced
seed; by contrast, the default BLASTN seed, which is
11 consecutive required matches, hits only 30 percent of
alignments.
Spaced seeds have three advantages over unspaced
seeds. First, their hits are more independent, which means
that it is more likely that a given alignment has at least one
hit to a seed; fewer alignments have many. Second, the seed
model can be tailored to a particular application: If there is
structure or periodicity to alignments, this can be reflected
in the design of the seeds chosen. For example, in searching
for homologous codons, they can be tailored to the three-
periodic structure of such alignments [10], [11]. Finally, the
use of multiple seeds allows us to boost sensitivity well
above what is achievable with a single seed, which, for
nucleotide alignment, can give near 100 percent sensitivity
in reasonable runtime [4].
Keich et al. [12] have given an algorithm for a simple
model of alignments to compute the probability that an
alignment hits a seed; this has been extended by both
Buhler et al. [10] and Brejova et al. [11] to more complex
sequence models. Choi et al. [13] have also shown
experimental results for spaced seeds with high sensitivity
across a wide range of homologies. Kucherov et al. [14]
show how to adapt spaced seeds to the interesting case of
alignments where no subregion of the alignment has a
higher score than the entire alignment.
2.2 Some Newer Seeding Models
Another seeding model, which has recently arisen [7], [15]
is of ungapped alignment seeds. These were developed by
Brown and Hudek [15] to anchor global alignments of
ambiguous DNA sequences and, independently, by Kisman
et al. [7] in their heuristic protein aligner, tPatternHunter.
An ungapped alignment seed is a vector v, a global
threshold T, and a vector of positional minimum scores b.
There is a match between positions in the two sequences
when the vector of pairwise match scores is at least as large,
position-by-position, as the minimum scores vector b and
where the dot product of the position-by-position scores and
the multiplier vector v is at least T. These seeds are a
compromise between spaced seeds and consecutive seeds:
They require spaced positions to have good scores (those
where the lower bound vector b has high values), while also
focusing on the quality of the local alignment at the seed by
possibly examining all of the positions of the seed. It is not
possible to cast an ungapped alignment seed in the language
of vector seeds because of the requirement that each
individual positions score is greater than its bound. It is
possible to cast a vector seedas an ungappedalignment seed,
by setting the b vector to 1 in all positions, thus removing
the position-by-position lower bound requirement.
Csu ro s [16] has alsoextendedthis frameworkof seedingto
look at variable-length seeds, where the length of the regions
that must match depends on their positional scores. While
this approach can also be brought into the framework of the
present work, we have not done so in our experiments.
2.3 Multiple Seeds
Another important extension to these ideas of seeding has
been the use of multiple seeds of different sorts in basing
alignments. In this approach, an attempt is made to perform
extension when any of a collection of seed models has a hit.
This will work well if each chosen seed has a very low false
positive rate so that their total false positive rate is still
below that of one seed of comparable sensitivity.
Several authors [2], [3], [4], [6], [10], [17] have proposed
using multiple seeds and given heuristics to choose them.
This problem was recently given a theoretical framework by
Xu et al. [5] and, independently, Kucherov et al. [18] studied
heuristic algorithms for identifying sets of good seeds. In
work unrelated to the present work, Kisman et al. [7] have
heuristically used multiple ungapped alignment seeds
(though not called by that term) for protein alignment. To
the best of our knowledge, the present work is the first work
to choose multiple seeds for protein alignment with a
theoretical basis.
3 CHOOSING A GOOD SET OF SEEDS
Spaced seeds have made a substantial impact in nucleotide
alignments, but less in protein alignment. Here, we show
that they have use in this domain as well. Specifically,
multiple vector seeds or multiple ungapped alignment
seeds, with high thresholds, give essentially the sensitivity
of BLASTP with four times fewer noise hits. Slightly fewer
alignments are hit, but the regions of alignment hit by the
vector seeds are all of the same good ones as hit by the
BLASTP seed and a few more. In other words, BLASTP hits
more alignments, but the hits found by BLASTP and not the
vector seeds are mostly in areas unlikely to be expanded to
full alignments.
We adapt a framework for identifying sets of seeds
introduced by Xu et al. [5]. We model multiple seed
selection as a set cover problem and give heuristics for the
problem. For our purposes, one advantage of the formula-
tion is that it works with explicit alignments: Since real
alignments may not look like a probabilistic model, we can
pick a set of seeds for sensitivity to a collection of true
alignments. Unfortunately, this also gives rise to problems,
as the thresholds may be set high due to overtraining for a
given set of alignments.
Most of our experiments concern themselves with vector
seeds, but the framework can be expandedstraightforwardly
to ungapped alignment seeds as well. This is because we do
not compute theoretical sensitivity of the seeds, but, instead,
only identify hits in existing real alignments. Indeed, our
framework is quite broad and extends to many different
models for seeding as long as the assumption that false
positives are additive is reasonably accurate and that one can
compute that false positive rate for the seed models. Where
the ungapped alignment seeds require some thought, we
present the addition needed for them.
3.1 Background Rates
One important detail that we need before we begin is to the
background hit rate for a given vector or ungapped
alignment seed. We noted previously [3] that this can be
computed for vector seeds, given a scoring matrix; it is also
straightforward to compute for ungapped alignment seeds
as well. Namely, from the scoring matrix, we can compute
the distribution of letters in random sequences implied by
the matrix; this can then be used to compute the distribu-
tion of scores found in unrelated sequences. Using this, we
can compute the probability that unrelated sequences give a
hit to a given seed at a random position, which we call the
false positive rate for that seed. In fact, we can easily
compute the entire probability distribution on the score for
a given seed vector at a random position. Similarly, we can
compute this probability under the constraint that posi-
tional scores have minimum value, thus expanding to
ungapped alignment seeds.
For the default BLASTP seed, the probability that two
random unrelated positions have a hit is quite high, 1/
1,600. Because of this high level of false positives, BLASTP
must filter hits further in hopes of throwing out hits in
unrelated sequences. Specifically, BLASTP rapidly exam-
ines the local area around a hit and, if this region is not also
well-conserved, the hit is thrown out. Sometimes, this
filtering throws out all of the hits found in some true
alignments and, thus, BLASTP misses them, even though
they hit the seed. One way of modeling this filtering is to
view BLASTP as testing two seeds simultaneously: The
vector seed ((1, 1, 1), 13) and an ungapped alignment seed
that looks at the region surrounding the seed hit.
Our goal in using other seed models here is to reduce the
false positive rate, while still hitting the overwhelming
majority of alignments and hitting them in places that are
highly enough conserved as to make a full alignment likely.
A flowchart of our proposal, and the approach of BLASTP,
is in Fig. 1.
For a set Qof alignment seeds, we say that its false positive
rate is the probability that any seed in Q has a hit to two
randompositions in unrelated sequences. This is not equal to
the sumof the false positive rates for all seeds inQsince hits to
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 31
one seed may overlap hits to another. However, we will use
this approximation in our optimization. As we extend to a
verylarge collectionof seeds inQ, this canbecome worrisome
as the same false positive may be counted many times.
However, this maybe appropriate, infact, dependingonhow
the search is done to find the false hits.
3.2 An Integer Program to Choose Many Seeds
Here, we give an integer program to find the set of seeds
that hits all alignments in a given training set with overall
lowest possible false positive rate. We will show that our IP
encodes the Set-Cover problem and that it is NP-hard to
solve and Quasi-NP-hard even to approximate to a
sublogarithmic factor. However, for moderate-sized train-
ing sets, we can solve it, in practice, or use simple heuristics
to get good solutions.
Given a set of alignment seeds Q, we say that they hit a
given alignment a if any member of Q has a hit to the
alignment. Our goal in picking such a set will be to
minimize the false positive rate of the set Q, with the
requirement that we hit all alignments in a training
collection, A.
This optimization goal is the alternative to the goal of Xu
et al. [5]. In that work, we maximized seed sensitivity when
a maximum number of spaced seeds is allowed; given that
all possible seeds had the same false positive rate, this was
equivalent to maximizing sensitivity for a given false
positive rate. This alternative goal of minimizing false
positives when we want 100 percent sensitivity on the
training set is appropriate for protein alignment; however,
we want to achieve extremely high sensitivity, as close to
100 percent as possible.
3.2.1 The Integer Program
Here, we show how to cast this seed selection problem as an
integer program. Recall that a seed model is the vector v of
multipliers or for an ungapped alignment seed, the vector v
of multipliers, and the vector b of positional lower bounds.
We will call this vector or vectors the pattern of a seed.
We can then view choosing a set of vector or ungapped
alignment seeds as choosing thresholds for each pattern.
More formally, suppose we are given a collection of
alignments A fa
1
; . . . ; a
m
g and a set of seed patterns
P fp
1
; . . . ; p
n
g. We will choose thresholds T
1
; . . . ; T
n
for
the patterns of P such that the seed model set Q

fp
1
; T
1
; . . . ; p
n
; T
n
g hits all alignments in A and the false
positive rate of Q
is as low as possible. The T
i
may be 1,
which corresponds to not choosing the pattern p
i
at all.
We require that each alignment a must be hit, so one of
the thresholds must be low enough to hit a. To verify this,
we compute the best-scoring hit for each seed pattern p
i
in
each alignment a
j
; let the score of this hit be T
i;j
. If we
choose T
i
so that it is at most T
i;j
, then the seed p
i
; T
i
will
hit alignment a.
To model this as an integer program, we have a collection
of integer variables x
i;T
for each possible threshold value for
seed pattern p
i
. We note that we are requiring that this
number is a small number or can be granularized reasonably
since each possible threshold will get its own constraint. For
simple seeds froma BLOSUMmatrix, the scores at a position
come in a small range of integers, so the possible reasonable
thresholds form a small range; let T
m
be the smallest such
threshold. We will set variable x
i;T
to 1 when the threshold or
seed vector x
i
is at most T; for each pattern p
i
, its threshold
chosen is the smallest T, where x
i;T
1.
To compute the false positive rate, we let r
i;T
be the
probability that a random place in the background model
has score exactly T according to the seed model p
i
; T. We
add these up for all of the false hits with score equal to or
greater than the chosen thresholds. Our integer program is
as follows:
min
X
i;T
x
i;T
r
i;T
; such that
X
i
x
i;Ti;j
! 1 for all alignments a
j
x
i;T
! x
i;T1
for all thresholds T > T
m
x
i;T
2 f0; 1g for all i and T:
Our framework is quite general: Given any collection of
alignments and the sensitivity of a collection of seeds to the
alignments, one can use this IP formulation to choose
thresholds to hit all alignments while minimizing false
positives. In particular, one could require that a hit satisfy
multiple seeds simultaneously or use more complicated hit
formulations. Of course, for these harder models, one might
have a more difficult time optimizing the integer program.
3.2.2 NP-Hardness
We now show that the problem of optimizing the seed set to
minimize the false positive rate while hitting all alignments
is NP-hard and that it is Quasi-NP-hard to approximate to
within a logarithmic factor [19]. (That is, assuming NP does
not have polynomial-time deterministic algorithms running
in On
Olog log n
time, no polynomial-time algorithm exists
with approximation ratio olog n.)
Fig. 1. Flowchart contrasting BLASTPs approach to heuristic sequence
alignment to the one proposed here. The only difference is in the initial
collection of hits. The smaller collection of hits found with the variations
on seeds gives as many hits to true alignments that survive to the third
stage as does BLASTP, yet far fewer noise hits must be filtered out.
We show this by giving an approximation-preserving
reduction of the Set-Cover problem to this problem. Since
Set-Cover is Quasi-NP-hard to approximate to within a
logarithmic factor [19], so is our problem.
An instance of Set-Cover is a ground set S and a
collection T fT
1
; . . . ; T
m
g of subsets of S; the goal is the
smallest cardinality subset of T whose union is S. The
connection to our problem is clear: We will produce one
alignment per ground set member and, for each of the
elements of T, we will have one seed. For simplicity, we will
assume that S f1; . . . ; ng. To fill the construction out, we
will assign the vector seed
v
i
1; 0; . . . ; 0
z}|{
i
; 1; 1
to every ground set element s
i
. In a model of sequence
where all positions are independent of all other, each of
these seeds has the same false positive rate, so the false
positive rate will be proportional to the number of ground
set members chosen.
Then, for each set T
j
2 T , we create an alignment A
j
of
length 2n
2
4n by pasting together in n blocks of length
2n 4. If i is in T
j
, then we make the ith block of the
alignment have the first and i 2nd position be of score 1,
while all other positions in the block have score zero, while
if i 62 T
j
, then the ith block is all score zero. Then, it is clear
that if we choose the seed v
i
, we will hit all alignments A
j
,
where i 2 T
j
. If we desire the minimum false positive rate to
hit all alignments, this is exactly equivalent to choosing the
minimum cardinality set to cover all of the T
j
.
Thus, we have presented an approximation-preserving
transformation from Set-Cover to our problem and it is both
NP-hard and Quasi-NP-hard to approximate to within a
logarithmic factor.
3.2.3 Expansions of the Framework
In our experiments, we use the vector seed requirement as a
threshold; one could use a more complicated threshold
scheme to focus on hits that would be expanded to full
alignments. That is, our minimum threshold for T
i;j
could
be the highest-scoring hit that is expanded to a full alignment
of seed vector v
j
in alignment a
i
. We could also have a more
complicated way of seeding alignments and, still, as long as
we could compute false positive rates, we could require that
all alignments are hit and minimize false positive rates.
Also, we can limit the total number of vector seeds used
in the true solution (in other words, limit the number of
vectors with finite threshold). We do this by putting an
upper bound on
P
i
x
i;T
for the maximum threshold T. In
practice, one might want an upper bound of four or eight
seeds, as each chosen seed requires a method to identify hits
and one might not want to have to use too many such
methods in the goal of keeping fewer indexes of a protein
sequence database, for example.
Further, we might want to not allow seeds to be chosen
with very high threshold. The optimal solution to the
problem will have the thresholds as on the seeds as high as
possible while still hitting each alignment. This allows
overtraining: Since even a tiny increase in the thresholds
would have caused a missed alignment, we may easily
expect that, in another set of alignments, there may be
alignments just barely missed by the chosen thresholds.
This is particularly possible if thresholds are allowed to get
extremely high and only useful for a single alignment. This
overtraining happened in some of our experiments, so we
lowered the maximum so that they were either found in a
fairly narrow range (+13 to +25) or set to 1 when a seed
was not used. As one way of also addressing overtraining,
we considered lowering the thresholds obtained from the IP
uniformly or just lowering the thresholds that have been set
to high values.
And, finally, the framework can be extended to allow a
specific number of alignments to be missed. For each
alignment, rather than requiring that
X
i
x
i;Ti;j
! 1;
which requires that some threshold be chosen so that the
alignment is hit, we can add a 0/1 slack variable to count
how many are missed, changing the constraint to
X
i
x
i;Ti;j
s
j
! 1:
Then, if we require that
X
j
s
j
M;
this allows at most M alignments to be so missed. This may
be appropriate to allow the optimization framework to be
less sensitive to a small number of outliers. We show
experiments with this slightly expanded framework in the
next section.
We note one simplification of our formulation: False hit
rates are not additive. Given two spaced seeds, a hit to one
may coincide with a hit to the other, so the background rate
of false positives is lower than estimated by the program.
When we give such background rates later, we will
distinguish those found by the IP from the true values.
3.2.4 Solving the IP and Heuristics
To solve this integer program or its variations is not
necessarily straightforward since the problem is NP-hard.
In our experiments, we used sets of approximately 400 align-
ments andthe IPhas beenable to solve directly quickly, using
the CPLEX 9.0 integer programming solver.
Straightforward heuristics also work well for the
problem, such as solving the LP relaxation and rounding
to 1 all variables with values close to 1, until all alignments
are hit, or setting all variables with fractional LP solutions to
1 and then raising thresholds on seeds until we start to miss
alignments.
We finally note that a simple greedy heuristic works well
for the problem, as well: Start with low thresholds for all
seed patterns and repeatedly increase the threshold whose
increase most reduces the false positive rate until no such
increase can be made without missing an alignment. This
simple heuristic performed essentially comparably to the
integer program in our experiments, but, since the IP solved
quickly, we present its results.
One other advantage to the IP formulation is that the
false-positive rate from the LP relaxation is a lower bound
on what can possibly be achieved; the simple greedy
heuristic offers no such lower bound.
4 EXPERIMENTAL RESULTS
Here, we present the results of experiments with our
multiple seed selection framework in the context of protein
alignments. Our goal is to identify collections of seed
models which together have extremely high sensitivity to
even moderately strong alignments, while admitting a very
low false positive rate.
Since we pick seeds with a relatively small number of
alignments, we run the serious risk of overtraining. In
particular, the requirement that our set of seeds has
100 percent sensitivity on the training data need not require
that it also have comparable sensitivity overall. In one
example, the particular choice of training examples was
apparently quite unrepresentative since a 100 percent
sensitivity to this set of alignments still gave only 96 percent
sensitivity on a testing set. (Or, presumably, the testing set
may be unrepresentative.) As a simple way of exploring this,
we examined what happened when we lowered the thresh-
old on some seeds that were chosen by the integer program
to modestly increase their false positive rates and sensitivity
in the hope of still keeping very high sensitivity.
We first present simple experiments with vector seeds
and with ungapped alignment seeds on a small sample of
alignments discovered with BLASTP; in this section, we
also allow for seed sets that miss a small number of the
training alignments.
Then, we explore how well these seed sets do in hitting
alignments that we did not use BLASTP to identify. Here,
we note that our vector seed sets do not appear to do as well
as BLASTP for sensitivity to alignments in general, but they
do hit more alignments with high-scoring short regions;
presumably, these alignments are more likely true.
4.1 Preliminary Experiments
We begin by exploring several sets of alignments generated
using BLASTP. Our target score range for our alignments is
BLASTP score between +40 and +60 (BLOSUM score +112
to +168). These moderate-scoring alignments can happen by
chance, but also are often true. Alignments below this
threshold are much more likely to be errors, while, in a
database of proteins we used, such alignments are likely to
happen to a random sequence by chance only one time in
10,000, according to BLASTPs statistics.
We begin by identifying a set of BLASTP alignments in
this score range. To avoid overrepresenting certain families
of alignments in our test set, we did an all-versus-all
comparison of 8,654 human proteins from the SWISS-PROT
database [20]. (We note that this is the same set of proteins
and alignments we used in our previous vector seed work
[3]. We have used this test set in part to confirm our belief
that, while a single seed may not help much, in comparison
to BLASTP, many seeds will be of assistance.) We then
divided the proteins into families so that all alignments
with BLASTP score greater than 100 are between two
sequences in the same family and there are as many families
as possible. We then chose 10 sets of alignments in our
target score range such that, in each set of alignments, a
particular family will only contribute at most eight
alignments to that set. Note that, since our threshold for
sharing family membership is a BLASTP score greater than
100 and the alignments we are seeking score between +40
and +60, many chosen alignments will be between members
of different families. We divided the sets of alignments into
five training sets and five testing sets. It is possible that the
same alignments will occur in a training and testing set as
we did not take any efforts to avoid this, though the set of
possible alignments is large enough to make this a rare
occurrence.
We note that we are using this somewhat complicated
system specifically because we want to avoid imposing a
preexisting bias on the set of alignments: Many true yet
moderate-scoring alignments will be between proteins with
different functionor fromdifferent biological families. For the
same reason, we have used alignments from dynamic
programming as our standard, rather than structural align-
ments of known proteins or curated alignments because our
goal is to improve the quality of heuristic alignments.
Certainly, many of the alignments we consider will not be
precise; still, a heuristic dynamic programming-based align-
ment that finds a hit between two proteins and then uses the
same scoring matrix as BLASTP will find the exact same,
potentially inaccurate, alignment as did BLASTP.
4.1.1 Multiple Vector Seeds
We then considered the set of all 35 vector patterns of length
at most 7 that include three or four 1s (the support of the
seed). We used this collection of vector patterns as we have
seen no evidence that nonbinary seed vectors are preferable
to binary ones for proteins and because it is more difficult to
find hits to seeds with higher support than four due to the
high number of needed hash table keys.
We computed the optimal set of thresholds for these
vector seeds such that every alignment in a training set has
a hit to at least one of the seeds, while minimizing the
background rate of hits to the seeds and only using at most
10 vector patterns. Then, we examined the sensitivity of the
chosen seeds for a training set to its corresponding test set.
The results are found in Table 1. Some seed sets chosen
showed signs of overtraining, but others were quite
successful, where the chosen seeds work well for their
training set as well and have low false positive rate.
We took the best seed set with near 100 percent
sensitivity for both its training and testing data, which
was the third of our experimental sets and used it in further
experiments. This seed set is shown in Table 2. We note that
this seed set has five times lower false positive rate
(1=8; 000) than does BLASTP, while still hitting all of its
testing alignments but four (which is not statistically
significant from zero). We also considered a set of thresh-
olds where we lowered the higher thresholds slightly to
allow more hits and possibly avoid overtraining on the
initial set of alignment. These altered thresholds are shown
as well in Table 2 and give a total false positive rate of
1=6; 900. (This set of thresholds also hits all 402 test
alignments for that instance.)
4.1.2 A Weaker Requirement on the Sensitivity
As noted previously, we can alter our integer program so
that it does not require 100 percent sensitivity on the
training data set. We performed experiments on this
formulation, using five subsets of the training alignments
chosen as before, where we allowed between zero and five
alignments from the training set to be missed by the seed
set. We show results in Table 3, using again a randomly
chosen testing set for each training set. The training data
sets varied in size from 304 to 415, while the testing sets
ranged from 392 to 407 in size.
Unsurprisingly, if we did not hit all alignments in the
training set, we often miss alignments in the testing set as
well. However, the ranges of the sensitivities we saw in
testing data for the seed sets picked allowing some misses
in the training data were much less wide, suggesting that
there may be fewer seed thresholds lowered merely to
accommodate a single outlier in the training data. As such,
if slightly lower sensitivity is acceptable, this approach may
give much more predictable results than training to require
all alignments to be hit.
4.1.3 Multiple Ungapped Alignment Seeds
Ungapped alignment seeds can be seen as breaking the
model we have for alignment speed. The most straightfor-
ward implementation of ungapped alignment seeds would
involve a hash table keyed on the letters corresponding to
the positions in the bounds vector b, where there is a
nontrivial lower bound on the score of a position. Still, even
after the first step, where we identified pairs of positions
satisfying the minimum bounds scores, we still need
another test to verify that a pair of positions satisfies the
requirement of the dot product of the local alignment score
with the vector v of positional multipliers being higher than
the threshold. Similar limitations affect any such two-phase
seed, such as requiring that two hypothetically aligned
positions satisfy two vector seeds at once.
If we assume, however, that testing a hit to the simple
hash-table to verify if the dot product of the local alignment
score with the vector of multipliers v has score greater than
the threshold T so rapidly that we can throw out misses
without having to count them, then we return to the case
from before, where we need count only the fraction of
positions expected to pass both levels of filtration. This
assumption may be appropriate, assuming that the small
amount of time taken to throw out a hash-table hit that does
not satisfy the dot product threshold is much, much smaller
than the amount of time needed to throw out a hit to the
whole ungapped alignment seed that still does not make a
good local alignment.
TABLE 3
Weakening Sensitivity to Testing Alignment
Reduces Sensitivity on Training Alignments
TABLE 2
Seeds and Thresholds Chosen by
Integer Programming for 409 Test Alignments
TABLE 1
Hit Rates for Optimal Seed Sets for Various Sets of Training
Alignments when Applied to an Unrelated Test Set
With this in mind, we tested our set of moderate
alignments on a simple collection of ungapped alignment
seed patterns to identify whether ungapped alignment seeds
form a potentially superior seed filtering approach to vector
seeds. Of course, since they include vector seeds as a special
case, this is trivial, but our interest is primarily whether the
advantage of ungapped alignments is large enough to merit
their consideration over that of vector seeds.
In our experiments, we used ungapped alignment seeds
where the vector of score lower bounds consisted of only
the values 0 and 1 (which results in no score restriction);
we also allowed the vector of pairwise multipliers to only
be the all-ones vector. This simple approach, which was
used independently in the multiple aligner of Brown and
Hudek [15] and in the tPatternHunter protein aligner [7],
simply requires a good local region, with certain specified
positions having positive score. We required that the
bounds vector have at most four active positions and
considered seed lengths between three and six. Note that, in
this model, the bounds vector 0; 0; 0; 1 behaves quite
differently than the bounds vector 0; 0; 0 because we will
be adding pairwise scores of four positions in the former
case and three in the latter.
The results of our experiment are shown in Table 4. We
used the same testing and training data sets as for Table 3.
In general, these results are slightly worse than the results
of our original experiments with vector seeds when we
require 100 percent sensitivity to testing data, but improve
when we allow some misses in the training data. Typical
false positive rates on the order of 1=10; 000 are common
with testing sensitivity of approximately 99 percent, as
before; again, the corresponding false positive rate for
BLASTPs seed is approximately 1=1; 600.
A positive note to the ungapped alignment seeds is that
there seems to be less overtraining: As the training
sensitivity is allowed to go down slightly, the testing
sensitivity does not plummet as quickly as for vector seeds.
One reason for this is that an ungapped alignment seed,
both times they have been implemented [7], [15], still
requires high-scoring short local alignment around the
seed. As we show in the next section, focusing on very
narrow alignments in seeding may be inappropriate and
one should instead focus on longer windows around a hit
before discarding it with a filter.
4.2 A Broader Set of Alignments
Returning to our set of vector seeds from Table 2, we then
considered a larger set of alignments in our target range of
good, but not great scores to verify if the advantage of
multiple seeds still holds. We used the Smith-Waterman
algorithm to compute all alignments between pairs of a
1,000-sequence subset of our protein data set and computed
how many of them were not found by BLASTP. Only 970
out of 2,950 Smith-Waterman alignments with BLOSUM62
score between +112 and +168 had been identified by
BLASTP, even though alignments in this score range would
have happened by chance only one time in 10,000 according
to BLASTPs statistics.
Almost all of these 2,950 alignments, 2,942, had a hit to
the BLASTP default seed. Despite this, however, only 970
actually built a successful BLASTP alignment. Our set of
eight seeds had hits to 1,939 of the 1,980 that did not build a
BLASTP alignment and to 955 of the 970 that did build a
BLASTP alignment, so, at first glance, the situation does not
look good. However, the difference between having a hit
and having a hit in a good region of the alignment is where
we are able to show substantial improvement.
The discrepancy between hits and alignments comes
because the BLASTP seed can have a hit in a bad part of the
alignment, which is filtered out. Typically, such hits occur
in a region where the source of positive score is quite short,
which is much more likely with an unspaced seed than with
a spaced seed. We looked at all of the regions of length
10 amino acids of alignments that included a hit to a seed
(either the BLASTP seed or one of the multiple seeds), and
assigned the best score of such a region to that alignment; if
no ungapped region of length 10 surrounded a hit, we
assumed it would certainly be filtered out. The data are
shown in Table 5 and show that of the alignments hit by the
spaced seeds, they are hit in regions that are essentially
identical in conservation to where the BLASTP seed hits
them. For example, 47.7 percent of the alignments contain a
10-amino acid region around a hit to the ((1, 1, 1), 13) seed
with BLOSUM score at least +30, while 46.7 percent contain
such a region surrounding a hit to one of the multiple seeds
with higher threshold. If we use the lower thresholds that
allow slightly more false positives, their performance is
actually slightly better than BLASTPs.
Table 5 also shows that the higher-threshold seed ((1, 1, 1),
15), which has a worse false positive rate (1/5,700) than our
ensembles of seeds, performs substantially worse: Namely,
only 64 percent of the alignments have a hit to the single seed
foundin a region with local score above +25, while 73 percent
of the alignments have a hit to one of the multiple seeds with
this property. This single seed strategy is clearly worse than
the multiple seed strategy of comparable false positive rate
and the optimized seeds perform comparably to BLASTP in
TABLE 4
Ungapped Alignment Seeds Offer
Similar Performance to Vector Seeds
identifying the alignments that actually have a core con-
served region.
Our experiments showthat multiple seedmodels canhave
an impact on local alignment of protein sequences. Using
many spaced seeds, which we picked by optimizing an
integer program, we find seed models with a comparable
chance of finding a good hit in a moderate-scoring alignment
than does the BLASTP seed, with four to five times fewer
noise hits. The difficulty with the BLASTP seed is that it not
onlyhas more junkhits andmore hits inoverlappingplaces, it
also has more hits in short regions of true alignments, which
are likely to be filtered and thrown out.
5 CONCLUSIONS
We have given a theoretical framework to the problem of
using spaced seeds for protein homology search detection.
Our result shows that using multiple vector or ungapped
alignment seeds can give sensitivity to good parts of local
protein alignments essentially comparable to BLASTP,
while reducing the false positive rate of the search
algorithm by a factor of four to five.
Our set of vector seeds is chosen by optimizing an
integer programming framework for choosing multiple
seeds when we want 100 percent sensitivity to a collection
of training alignments. The framework is general enough to
accommodate many extensions, such as requiring a fixed
amount of sensitivity on the training (not only 100 percent),
allowing only a small number of seeds to be chosen or
allowing for many different sorts of seeding strategies. We
have mostly used it to optimize sets of vector seeds because
they encapsulate an approach to homology search for
nucleotides that has been very successful.
One difficulty with our approach is that it relies on a
theoretical estimate of the runtime of a homology search
program: namely, that the program will take time propor-
tional to the number of false positives found by the seeding
method. As seeding methods become more complex, such
as the two-step ungapped alignment seeds, it may become
harder to identify what a false positive is, in particular, if
a false positive fits through one step of a filter, but is quickly
discarded before the next step, should it count toward the
estimated runtime? Using our framework, we identified a
set of seeds for moderate-scoring protein alignments whose
total false positive rate in random sequence is four-to-five
times lower than the default BLASTP seed. This set of seeds
had hits to slightly fewer alignments in a test set of
moderate-scoring alignments found by the Smith-Water-
man algorithm than found by BLASTP; however, the
BLASTP seeds hit subregions of these alignments that were
actually slightly worse than hit by the spaced seeds. Hence,
given the filtering used by BLASTP, we expect that the two
alignment strategies would give comparable sensitivity,
while the spaced seeds give four times fewer false hits.
ACKNOWLEDGMENTS
The author would like to thank Ming Li for introducing him
to the idea of spaced seeds. This work is supported by the
Natural Science and Engineering Research Council of
Canada and by the Human Frontier Science Program. A
preliminary version of this paper [21] appeared at the
Workshop on Algorithms in Bioinformatics, held in Bergen,
Norway, in September, 2004.
REFERENCES
[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman,
Basic Local Alignment Search Tool, J. Molecular Biology, vol. 215,
no. 3, pp. 403-410, 1990.
[2] B. Ma, J. Tromp, and M. Li, PatternHunter: Faster and More
Sensitive Homology Search, Bioinformatics, vol. 18, no. 3, pp. 440-
445, Mar. 2002.
[3] B. Brejova, D. Brown, and T. Vinar, Vector Seeds: An Extension to
Spaced Seeds Allows Substantial Improvements in Sensitivity and
Specificity, Proc. Third Ann. Workshop Algorithms in Bioinformatics,
pp. 39-54, 2003.
[4] M. Li, B. Ma, D. Kisman, and J. Tromp, Patternhunter II: Highly
Sensitive and Fast Homology Search, J. Bioinformatics and
Computational Biology, vol. 2, no. 3, pp. 419-439, 2004.
[5] J. Xu, D. Brown, M. Li, and B. Ma, Optimizing Multiple Spaced
Seeds for Homology Search, Proc. 15th Ann. Symp. Combinatorial
[6] Y. Sun and J. Buhler, Designing Multiple Simultaneous Seeds for
DNA Similarity Search, Proc. Eighth Ann. Intl Conf. Computational
Biology, pp. 76-84, 2004.
[7] D. Kisman, M. Li, B. Ma, and L. Wang, TPatternHunter: Gapped,
Fast and Sensitive Translated Homology Search, Bioinformatics,
2004.
TABLE 5
Hits in Locally Good Regions of Alignments
[8] T. Smith and M. Waterman, Identification of Common Molecular
Subsequences, J. Molecular Biology, vol. 147, pp. 195-197, 1981.
Spaced Seeds, J. Computer and System Sciences, 2005, pending
publication.
[10] J. Buhler, U. Keich, and Y. Sun, Designing Seeds for Similarity
Search in Genomic DNA, Proc. Seventh Ann. Intl Conf. Computa-
tional Biology, pp. 67-75, 2003.
[11] B. Brejova, D. Brown, and T. Vinar, Optimal Spaced Seeds for
Homologous Coding Regions, J. Bioinformatics and Computational
Biology, vol. 1, pp. 595-610, Jan. 2004.
[12] U. Keich, M. Li, B. Ma, and J. Tromp, On Spaced Seeds for
Similarity Search, Discrete Applied Math., vol. 138, pp. 253-263,
2004.
[13] K.P. Choi, F. Zeng, and L. Zhang, Good Spaced Seeds for
Homology Search, Bioinformatics, vol. 20, no. 7, pp. 1053-1059,
2004.
[14] G. Kucherov, L. Noe, and Y. Ponty, Estimating Seed Sensitivity
on Homogeneous Alignments, Proc. Fourth IEEE Intl Symp.
BioInformatics and BioEng., pp. 387-394, 2004.
[15] D. Brown and A. Hudek, New Algorithms for Multiple DNA
Sequence Alignment, Proc. Fourth Ann. Workshop Algorithms in
Bioinformatics, pp. 314-326, 2004.
[16] M. Csu ro s, Performing Local Similarity Searches with Variable
Length Seeds, Proc. 15th Ann. Symp. Combinatorial Pattern
Matching, pp. 373-387, 2004.
[17] K. Choi and L. Zhang, Sensitive Analysis and Efficient Method
for Identifying Optimal Spaced Seeds, J. Computer and System
Sciences, vol. 68, pp. 22-40, 2004.
[18] G. Kucherov, L. Noe, and Y. Ponty, Multiseed Lossless
Filtration, Proc. 15th Ann. Symp. Combinatorial Pattern Matching,
pp. 297-310, 2004.
[19] U. Feige, A Threshold of ln n for Approximating Set Cover,
J. ACM, vol. 45, pp. 634-652, 1998.
[20] A. Bairoch and R. Apweiler, The SWISS-PROT Protein Sequence
Database and Its Supplement TrEMBL in 2000, Nucleic Acids
Research, vol. 28, no. 1, pp. 45-48, 2000.
[21] D. Brown, Multiple Vector Seeds for Protein Alignment, Proc.
Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 170-181,
2004.
Daniel G. Brown received the undergraduate
degree in mathematics with computer science
from the Massachusetts Institute of Technology
in 1995 and the PhD degree in computer science
from Cornell University in 2000. He then spent a
year as a research scientist at the Whitehead
Institute/MIT Center for Genome Research in
Cambridge, Massachusetts, working on the Hu-
man and Mouse Genome Projects. Since 2001,
he has been an assistant professor in the School of Computer Science
at the University of Waterloo.
tcbb@computer.org.
I
T is a pleasure to write this editorial at the beginning of the second year of the publication of the IEEE/ACM Transactions
on Computational Biology and Bioinformatics (TCBB). The last year saw the publication of four issues of TCBB, the first of
which was mailed out roughly nine months after our initial call for submissions. That accomplishment was the result of
tremendous cooperation and hard work on the part of authors, reviewers, associate editors, and staff. I would like to thank
everyone for making that possible.
During the past year, we recieved roughly 205 submissions and, presently, we have about 50 of those under review. In
our first year, we published 16 papers, including Part I of a special section on The Best Papers from WABI (Workshop on
Algorithms in Bioinformatics). Part II will appear this year, along with a special issue on Machine Learning in
Computational Biology and Bioinformatics. Other special issues are also in the planning stages. The papers that we have
published are establishing TCBB as a venue for the highest quality research in a broad range of topics in computational
biology and bioinformatics. I know that some of the papers we have already published will be cited as the foundational or
the definitive papers in several subareas of the field.
Agoal for the future is to attract more submissions from the biology community and this will be facilitated when TCBB
is indexed in MEDLINE, which requires two years of publication before it will consider indexing a journal. So, this second
year of publication will hopefully lead to the inclusion of TCBB in MEDLINE.
Finally, I would like to share some wonderful news we recieved in February. The Association of American Publishers,
Professional and Scholarly Publishing Division awarded TCBB their Honorable Mention award for The Best New Journal
in any category for the year 2004. Only one Honorable Mention is awarded. Again, the credit for this accomplishment goes
to all the authors, reviewers, associate editors, and staff who have worked so hard to establish TCBB in this last year. I look
forward to continued growth and success of TCBB in our second year of publication.
Dan Gusfield
Editor-in-Chief
EditorialState of the Transaction
Dan Gusfield
Bases of Motifs for Generating
Repeated Patterns with Wild Cards
Nadia Pisanti, Maxime Crochemore, Roberto Grossi, and Marie-France Sagot
AbstractMotif inference represents one of the most important areas of research in computational biology, and one of its oldest ones.
Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in
relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature:
matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work
has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns.
This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently
proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs.
Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all
the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a
sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus,
smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of
motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the
minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to
efficiently compute such bases unless the quorum is fixed.
Index TermsMotifs basis, repeated motifs.
1 INTRODUCTION
I
DENTIFYING motifs in biological sequences is one of the
oldest fields in computational biology. Yet, it remains also
very much an open problem in the sense that no currently
existing definition of a motif is fully satisfying for the
purposes of accurately and sensitively identifying the
biological features that such motifs are supposed to
represent. Among the most difficult to model are binding
sites, as they are often quite degenerate. Indeed, variability
may be considered part of their function. Such variability
translates itself into changes in the motif, mostly substitu-
tions, that do not affect the biological function. Two main
schools of thought on how to define motifs in biology have
coexisted for years, each valid in its own way. The first
works with a statistical representation of motifs, usually
given in the form of what is called in the literature a PSSM
(Position Specific Scoring Matrix [9], [11], [13], [12] or a
profile which is one type of PSSM). Interesting PSSMs are
those that have a high information value (measured, for
instance, by the relative entropy of the corresponding
matrix). The second school defines a motif as a consensus
[4], [24]. A motif is therefore a pattern that appears
repeatedly, in general, approximately, that is, up to a
certain number of differences (most often substitutions
only) in a sequence or set of sequences of interest.
It is generally accepted that PSSMs are more appropriate
for modeling an already known (in the sense of well-
characterized) biological feature for the purpose of then
identifying other occurrences of the feature, even though
the false positive rate of this further identification remains
very high. Identifying the PSSM itself ab initio is still,
however, a difficult problem, particularly for large data sets
or when the amount of noise may be high. The methods
used are also no guarantee heuristics, leaving an uncer-
tainty as to whether motifs that are statistically as mean-
ingful as those reported have not been missed.
On the other hand, formulating the problemof identifying
approximate motifs as patterns enables one to address the
motif identification problem in an exhaustive fashion, even
though the algorithmic complexity of the problem remains
relatively high, and the model may appear more limited than
PSSMs. Because of the lower algorithmic complexity of
identifying repeated patterns, the model may, however, be
made more complex andbiologically pertinent inother ways.
One could think of introducing motifs composed of various
different submotifs separated by variable-length distances
that may then also be found in a relatively efficient way [14].
Motifs presentingsuchahighlevel of combinatorial complex-
ity are indeed frequent, particularly in eukaryotes. Exhaus-
tively seeking for approximately repeated patterns may
however have the drawback of producing many solutions,
that is, many motifs. In fact, the number of motifs identified
withthis model maybe so high(e.g., exponential inthe size of
the input) that it is as impossible tomanage as the initial input
sequence(s), even though they provide a first way of
. N. Pisanti and R. Grossi are with the Dipartimento di Informatica,
Universita` di Pisa, Italy. E-mail: {pisanti, grossi}@di.unipi.it.
. M. Crochemore is with the Institut Gaspard-Monge, University of Marne-
la-Vallee, France and Kings College London.
E-mail: maxime.crochemore@univ-mlv.fr.
. M.-F. Sagot is with INRIA Rhone-Alpes, Laboratoire de Biometrie et
Biologie

EEvolutive, Universite Claude Bernard Lyon 1, France and
Kings College London. E-mail: marie-france.sagot@inria.fr.
Manuscript received 14 Mar. 2004; revised 2 Dec. 2004; accepted 16 Feb.
structuring such input. Yet, it appeared clear also to any
computational biologist working with motifs as patterns that
there was further structure to be extracted from the set of
motifs found, even when such a set is huge. Furthermore,
such a structure could reflect some additional biological
information, thus providing additional motivation for infer-
ring it. Doing this is generally addressed by means of
clustering, or even by attempting to bring together the two
types of motif models (PSSMs andpatterns). Indeed, recently
researchers have been using pattern detection as a first filter-
flavored step toward inferring PSSMs from biological
sequences [6]. This seems very promising although much
work remains to be done to precisely determine the relation
between the two types of models, and to fully explore the
biological implications this may have.
Again, each of the two above approaches is valid, but the
question remained open whether or not the inner structure
of a set of motifs could be expressed in a manner that would
be more satisfying from both the mathematical and the
biological points of view. Then, in 2000, a paper by Parida et
al. [17] seemed to present a way of extracting such an inner
structure in a very elegant and powerful way for a
particular type of motif. The power of their proposal
resided in the fact that the above mentioned structure
corresponded to a well-known and precisely defined
mathematical object and, moreover, guaranteed that no
solution would be lost. Exhaustiveness in relation to the
chosen type of motif is also preserved, thus enabling a
biologist to draw some conclusions even in the face of
negative answers (i.e., when no motifs, or no a priori
expected motifs are found in a given input), something
which PSSM-detecting methods do not allow. The structure
is that of a basis of motifs. Informally speaking, it is a subset
of all the motifs satisfying some input parameters (related,
for instance, to which differences between a pattern and its
occurrences are allowed) from which it is possible to
recover all the other motifs, in the sense that all motifs not
in the basis are a combination of some (in general, a few
only) motifs in the basis. Such a combination is modeled by
simple rules to systematically generate the other motifs with
an output sensitive cost [18]. A basis would therefore also
provide a way of characterizing the input, which then might
be used to compare different inputs without resorting to the
traditional alignment methods with all the pitfalls they
present. The idea of a basis would fulfill such expectations
if its size could be proven to be small enough. The argument
[17] seemed to be that, for the type of motifs considered, a
compact enough basis could always be found.
The motifs considered in [17] were patterns with wild card
symbols occurring in a given sequence : of i symbols
drawn over an alphabet . A wild card symbol is a special
symbol matching any other element
1
For example, the
pattern T G matches both TTG and TGG inside : = TTGG.
Parida et al. focused on patterns which appear at least
times in : for an input parameter _ 2, called the quorum.
This may, at first sight, seem an even more restrictive type
of motif than patterns in general. It, however, has the merit
of capturing one aspect of biological features that current
PSSMs in general ignore, or address only in an indirect way.
This aspect often concerns isolated positions inside a motif
that are not part of the biological feature being captured.
This is the case, for instance, with some binding sites,
particularly at the protein level. Studying patterns with
wild cards has a further very important motivation in
biology, even when no differences (such as substitutions)
are allowed. Indeed, motifs such as these or closely related
ones can be used as seeds for finding long repeats and for
aligning, pairwise or multiple-wise, a set of sequences or
even whole genomes [15], [23].
The basis introduced by Parida et al. had interesting
features, but presented some unsatisfying properties. In
particular, as we show in this paper, there is an infinite
family of strings for which the authors basis contains (i
2
)
motifs for = 2. This contradicts the upper bound of 3i for
any _ 2 given in [17]. As a result, the algorithm taking
C(i
3
log i) time, mentioned in [17], for finding the basis of
motifs does not hold since it relies on the upper bound of
3i, thus leaving open the problem of efficiently discovering
a basis. A refinement of the definition of basis and an
incremental construction in C(i
3
) time has recently been
described by Apostolico and Parida [2]. A comparative
survey of several notions of bases can be found in [22].
Closely following previous work, here we introduce a
new definition of basis. The condition for the new basis is
stronger than that of [17] and, hence, our basis is included
in that of [17] (and is thus smaller) while both are able to
generate the same set of motifs with mechanical rules. Our
basis is moreover symmetric: Given a string :, the motifs in
the basis for its reverse :: are the reversals of the motifs in
the basis for :. Moreover, the number of motifs in our basis
can provably be upper bounded in the worst case by i 1
for = 2 and occur in : a total of 2i times at most. However,
we reveal an exponential dependency on for the number of
motifs in all bases defined so far (i.e., including our basis,
Paridas and Pelfrene et al.s [19]), something unnoticed in
previous work. Consequently, no polynomial-time algo-
rithm can exist for finding one of these bases with arbitrary
values of _ 2.
2 NOTATION AND TERMINOLOGY
We consider strings that are finite sequences of letters
drawn from an alphabet , whose elements are also called
solid characters. We introduce an additional symbol (de-
noted by and called wild card) that does not belong to
and matches any letter; a wild card clearly matches itself.
The length of a string t, denoted by [t[, is the number of
letters and wild cards in t, and t[i[ indicates the letter or
wild card at position i in t for 0 _ i _ [t[ 1 (hence, t =
t[0[t[1[ t[[t[ 1[ also noted t[0..[t[ 1[).
Definition 1 (pattern). Given the alphabet , a pattern is a
string in ' ( ' )
+
(that is, it starts and ends with a
solid character).
The patterns are related by the following specificity
relation _ .
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 41
1. In the literature on sequence analysis and pattern matching, the wild
card is often referred to as do not care (as it is in the literature on bases of
motifs). Therefore, we will use this latter term when referring to the
sequence analysis and string matching literature.
Definition 2 (_ ). For individual characters o
1
. o
2
' ,
we have o
1
_ o
2
if o
1
= or o
1
= o
2
. Relation _ extends to
strings in ( ' )
+
under the convention that each string t
is implicitly surrounded by wild cards, namely, letter t[;[ is
when ; _ [t[. Hence, . is more specific than n (written
n _ .) if n[;[ _ .[;[ for any integer ;.
We can now formally define the occurrences of patterns
r in : and their lists.
Definition 3 (occurrence, L). We say that n occurs at
position in . if n[;[ _ .[; [, for 0 _ ; _ [n[ 1
(equivalently, we say that n matches .[.. [n[ 1[). For
the input string :
+
with i = [:[, we consider the location
list L
r
_ 0..i 1 as the set of all the positions on : at
which r occurs.
When a pattern n occurs in another pattern (or into a
string) ., we also say that . contains n. For example, the
location list of r = T G in : = TTGG is L
r
= 0. 1, hence :
contains r.
Definition 4 (motif). Given a parameter _ 2, called quorum,
we say that pattern r is a motif in : when [L
r
[ _ .
Given any location list L
r
and any integer d, we adopt
the notation L
r
d = d [ L
r
for indicating the
occurrences in L
r
displaced by the offset d.
Definition 5 (maximality). A motif r is maximal if for any
other motif y that contains r, we have no integer d such that
L
y
= L
r
d.
In other words, making a maximal motif r more specific
(thus obtaining y) reduces the number of its occurrences in
:. Definition 5 is equivalent to that meant in [17] stating that
r is maximal if there exist no other motif y and no integer
d _ 0 verifying L
r
= L
y
d, such that r[;[ _ y[; d[ for 0 _
; _ [r[ 1 (that is, r occurs in y at position d in our
terminology).
2
Definition 6 (irredundant motif). A maximal motif r is
irredundant if, for any maximal motifs y
1
, y
2
. . . . . y
/
such
that L
r
= '
/
i=1
L
y
i
, motif r must be one of the y
i
s. Conversely,
if all the y
i
s are different from r, pattern r is said to be
covered by motifs y
1
, y
2
. . . . . y
/
.
The basis of irredundant motifs for string : is the set of all
irredundant motifs in :. The definition is given with respect
to the set of maximal motifs of the input string which is
unique; indeed, such basis is unique and it can be used as a
generator for all maximal motifs in : as proved in [17]. The
size of the basis is the number of irredundant motifs
contained in it. We illustrate the notions given so far by
employing the example string : = FABCXFADCYZEADCEADC.
For this string and = 2 the location list of motif r
1
= A C
is L
r
1
= 1. 6. 12. 16, and that of motif r
2
= FA C is
L
r2
= 0. 5. They are both maximal because they lose at
least one of their occurrences when extended with solid
characters at one side (possibly with wild cards in between),
or when their wild cards are replaced by solid characters.
However, motif r
3
= DC having list L
r3
= 7. 13. 17 is not
maximal. It occurs in r
4
= ADC, where L
r4
= 6. 12. 16, and
its occurrences can be obtained from those of r
4
by a
displacement of d = 1 positions. The basis of the irredun-
dant motifs for : is made up of r
1
= A C, r
2
= FA C,
r
4
= ADC, and r
5
= EADC. The location list of each of them
cannot be obtained from the union of any of the other
location lists.
3 IRREDUNDANT MOTIFS: THE BASIS AND ITS SIZE
FOR QUORUM = 2
In this section, we show the existence of an infinite family of
strings :
/
(/ _ 5) for whichthere are (i
2
) irredundant motifs
in the basis for quorum = 2, where i = [:
/
[. In this way, we
disprove the claimed upper bound of 3i [17] mentioned in
Section 1. Each string :
/
will be constructed from a shorter
string t
/
, which we now define. For each /, t
/
= A
/
TA
/
, where
A
/
denotes the letter A repeated / times (our argument works,
ingeneral, for :
/
n:
/
, where : andnare strings of equal length
not sharing any common character). String t
/
contains an
exponential number of maximal motifs, including those
having the form AA.
/2
A with exactly two wild cards. To
seewhy, eachsuchmotif roccurs four times int
/
: Specifically,
two occurrences of r match the first and the last / letters in t
/
while each distinct wild card in r matching the letter T in t
/
contributes to one of the two remaining occurrences.
Extending r or replacing a wild card with a solid character
reduces the number of these occurrences, soris maximal. The
idea of our proof is to obtain strings :
/
by prefixing t
/
with
C([t
/
[) symbols so that these motifs r become irredundant in
:
/
. Since there are (/
2
) of them, and i = [:
/
[ = ([t
/
[) =
(/), this leads to the claimed result.
In order to define the strings :
/
on the alphabet
= A. T. u. v. w. x. y. z. a
1
. a
2
. . . . . a
/2
, we introduce some
notation. Let nn denote the reversal of n, and let
c.
/
. od
/
. n
/
. .
/
be the strings thus defined
if / is even : c.
/
= a
2
a
4
a
/2
.
od
/
= a
1
a
3
a
/3
.
n
/
= c.
/
u c.
/
c.
/
vw c.
/
.
.
/
= od
/
xy

od
/
od
/
z od
/
.
if / is odd : c.
/
= a
2
a
4
a
/3
.
od
/
= a
1
a
3
a
/2
.
n
/
= c.
/
uv c.
/
c.
/
wx c.
/
.
.
/
= od
/
y

od
/
od
/
z od
/
.
The strings :
/
are then defined by :
/
= n
/
.
/
t
/
for / _ 5.
Fig. 1 shows them for / = 7.
Fact 1. The length of n
/
.
/
is 3/, and that of :
/
is i = 5/ 1.
2. Actually, the definition literally reported in [17] is Definition 4
(Maximal Motif). Let j
1
. j
2
. . . . . j
/
be the motifs in a sequence :. Let j
i
[;[ be
. if ; [j
i
[. A motif j
i
is maximal if and only if there exists no j
|
, | ,= i and
no integer 0 _ c such that L
ji
c = L
j|
and j
|
[c ;[ _ j
i
[;[ hold for
1 _ ; _ [j
i
[. (The symbols in j
i
and j
|
are indexed starting from 1
onward.) The corresponding example in the paper illustrates the definition
for : = ABCDABCD, stating that j
i
= ABCD is maximal while j
|
= ABC is not.
However, j
i
does not match the definition because of the existence of its
prefix j
|
(setting c = 0); hence, we suspect a minor typo in the definition, for
which the definition should read as ... such that L
ji
= L
j|
c and
j
i
[;[ _ j
|
[c ;[.
Proof. Whatever theparityof /, thestringn
/
.
/
contains thesix
letters u, v, w, x, y, z, two occurrences each of c.
/
and od
/
,
and one occurrence each of c.
/
c.
/
and

od
/
od
/
. Since od
/
and c.
/
together contain one occurrence of each letter a
1
,
a
2
. . . . . a
/2
, we have [od
/
[ [c.
/
[ = / 2. Moreover,
[ c.
/
c.
/
[ = [c.
/
[ and [

od
/
od
/
[ = [od
/
[, so that [n
/
.
/
[ = 6 3(/ 2)
= 3/. This proves the first statement. For the second
statement, the total length of :
/
follows by observing that
[t
/
[ = 2/ 1, and so i = [:
/
[ = 3/ 2/ 1 = 5/ 1. '
Proposition 1. For 1 _ j _ / 2, no motif of the form A
j
A
/j1
can be maximal in :
/
. Also, motif A
/
cannot be maximal
in :
/
.
Proof. Let nbeanarbitrarymotif of theformA
j
A
/j1
, with
1 _ j _ / 2. Its location list is L
n
= 0. / j. / 1
[n
/
.
/
[ = 3/. 4/ j. 4/ 1 since [n
/
.
/
[ = 3/ by Fact 1 and
n matches the two substrings A
/
of :
/
as well as A
j
TA
/j1
.
The occurrences are showninFig. 1 for / = 7 andj = 2. No
other occurrences are possible. Let us consider the
position, say i, of the leftmost appearance of letter a
j
in
:
/
(recall that there are three positions on :
/
at which letter
a
j
occurs; we have i = 0 in our example of Fig. 1 with
j = 2). We claim that motif y = a
j

3/i1
n satisfies
L
y
= L
n
(3/ i). Since n appears in y, it follows that n
cannot be maximal in :
/
by Definition 5 (setting
d = 3/ i). To see why L
n
= L
y
(3/ i), it suffices to
prove that the distance in :
/
between the positions of the
twoleftmost letters a
p
is / jwhile that of the leftmost and
the rightmost a
p
is / 1. The verification is a bit tedious
because four cases arise according to the fact that each of /
andj canbe evenor odd. Since the cases are analogous, we
detail only two of them, namely, when both / and j are
even, and when / is even and j is odd. In the first case, the
three occurrences of a
j
are all in n
/
. Moreover, the distance
between the two leftmost letters a
j
is the length of the
substringa
j
a
j2
a
/2
ua
/2
a
/4
a
j2
, that is, 2[a
j2

a
/2
[ 2 = 2(/ 2 j),2 2 = / j. The distance be-
tween the leftmost and rightmost a
j
is the length of
a
j
a
j2
a
/2
u c.
/
c.
/
vwa
2
a
4
a
j2
. This is also the length of
u c.
/
c.
/
vwa
2
a
4
a
j2
a
j
a
j2
a
/2
= u c.
/
c.
/
vwc.
/
, that is,
2(/ 2),2 3 = / 1 as expected. In the second case
where / is evenandj is odd, the occurrences of a
j
are all in
.
/
. Analogously to the first case, the distance between the
two leftmost letters a
j
is the length of a
j
a
j2
a
/3
xya
/3
a
j2
, that is, 2[a
j2
a
/3
[ 3 = 2(/ 3 j),2 3
= / j. The distance between the leftmost and the
rightmost a
j
is the length of the string a
j
a
j2
a
/3
xy

od
/
od
/
za
1
a
3
a
j2
, which equals / 1, the length of
xy

od
/
od
/
zod
/
. The analogous verification of the other two
cases yields the fact that n cannot be maximal.
The second part of the lemma for motif A
/
proceeds
along the same lines, except that we choose y =
a
j

3/i1
A
/
with i as before (note that y is not required
to be maximal and that the motifs in the statement are
maximal in t
/
). '
Proposition 2. Each motif of the form AA.
/2
A with exactly
two s is irredundant in :
/
.
Proof. Let r be an arbitrary motif of the formAA.
/2
A with
two s, namely, r = A
j
1
A
j
2
j
1
1
A
/j
2
1
for 1 _ j
1
<
j
2
_ / 2. To prove that ris anirredundant motif, we first
show that r is maximal. Its location list is L
r
= 0. / j
2
.
/ j
1
. / 1 3/ since [n
/
.
/
[ = 3/ by Fact 1 and r
matches the two substrings A
/
of :
/
as well as A
j1
TA
/j11
and A
j
2
TA
/j
2
1
. Any other motif y such that r occurs in y
can be obtained by replacing at least one wild card (at
position j
1
or j
2
) inr with a solidcharacter, but this would
cause the removal of position 4/ j
1
or 4/ j
2
from L
r
.
Analogously, extending r to the right by putting a solid
character at position [r[ or larger would eliminate position
4/ 1 from L
r
. Finally, extending r to the left by a solid
character would eliminate at least one position from L
r
because nosymbol occurs four times inn
/
.
/
. Inconclusion,
for anymotif y suchthat roccurs iny, we have L
y
,= L
r
d
for any integer d and, thus, r is a maximal motif by
Definition 5. We now prove that r is irredundant
according to Definition 6. Let us consider an arbitrary set
of maximal motifs y
1
, y
2
. . . . . y
/
such that L
r
= '
/
i=1
L
y
i
. We
claim that at least one y
i
is of the form AA.
/2
A. Indeed,
there must exist a location list L
yi
containing position 4/
1 since that position belongs to L
r
. This implies that y
i
occurs inthe suffix A
/
of :
/
. It cannot be that [y
i
[ < / since y
i
would occur also in some position ; 4/ 1 whereas
; , L
r
, so it is impossible. Consequently, y
i
is of length /
and matches A
/
, thus being of the form AA.
/2
A. We
observe that y
i
cannot contain zero or one s, as it would
not be maximal by Proposition 1. Also, y
i
cannot contain
three or more s, as eachdistinct symbol wouldmatchthe
letter T in :
/
giving [L
yi
[ [L
r
[, which is impossible. The
only possibility is that y
i
contains exactly two s as r does
at the same positions because L
y
_ L
r
and they are
maximal. It follows that y
i
= r proving the proposition. '
Theorem 2. The basis for string :
/
contains (i
2
) irredundant
motifs, where i = [:
/
[ and / _ 5.
Proof. By Proposition 2, the number of irredundant motifs
in :
/
is at least
/2
2
_ _
= (/
2
), the number of choices of
two positions in A.
/2
. Since [:
/
[ = 5/ 1 by Fact 1,
we get the conclusion. '
Fig. 1. Example string :
7
, (a
i
of the definition is simply denoted by i).
Above it, there are the occurrences of n of the Proof of Proposition 1,
while the three lines below show the occurrences of motif r =
4
19
AAAA AA in :
7
. The letter 4 corresponds to position 4 of the wild
card in AAAA AA.
4 TILING MOTIFS: THE BASIS AND ITS PROPERTIES
4.1 Terminology and Properties
In this section, we introduce a natural notion of a basis for
generating all maximal motifs occurring in a string : of
length i.
Definition 7 (tiling motif). A maximal motif r is tiling if, for
any maximal motifs y
1
, y
2
. . . . . y
/
and for any integers d
1
,
d
2
. . . . . d
/
such that L
r
= '
/
i=1
(L
yi
d
i
), motif r must be one
of the y
i
s. Conversely, if all the y
i
s are different from r, pattern
r is said to be tiled by motifs y
1
, y
2
. . . . . y
/
.
The notion of tiling is in general more selective than that
of irredundancy. Continuing our example string
: = FABCXFADCYZEADCEADC, we have seen in Section 2 that
motif r
1
= A C is irredundant for :. Now, r
1
is tiled by
r
2
= FA C and r
4
= ADC according to Definition 7 since its
location list, L
r1
= 1. 6. 12. 16, can be obtained from the
union of L
r2
= 0. 5 and L
r4
= 6. 12. 16 with respective
displacements d
2
= 1 and d
4
= 0.
Remark 1. A fairly direct consequence of Definition 7 is that
if r is tiled by y
1
, y
2
, . . . , y
/
with associated displacements
d
1
, d
2
, . . . , d
/
, then r occurs at position d
i
in y
i
for
1 _ i _ /. As a consequence, we have that d
i
_ 0 in
Definition 7. Note also that the y
i
s in Definition 7 are not
necessarily distinct and that / 1 for tiled motifs. (It
follows from the fact that L
r
= L
y
1
d
1
with r ,= y
1
would contradict the maximality of both r and y
1
.) As a
result, a maximal motif r occurring exactly times in : is
tiling as it cannot be tiled by any other motifs because
such motifs would occur less than times.
The basis of tiling motifs is the complete set of all tiling
motifs for :, and the size of the basis is the number of these
motifs. For example, the basis, let us denote it by B, for
FABCXFADCYZEADCEADC contains FA C, EADC, and ADC as
tiling motifs. Although Definition 7 is derived from that of
irredundant motifs given in Definition 6, the difference is
much more substantial than it may appear. The basis of
tiling motifs relies on the fact that tiling motifs are
considered as invariant by displacement as for maximality.
Consequently, our definition of basis is symmetric, that is,
each tiling motif in the basis for the reverse string :: is the
reverse of a tiling motif in the basis of :. This follows from
the symmetry in Definition 7 and from the fact that
maximality is also symmetric in Definition 5. It is a sine
qua non condition for having a notion of basis invariant by
the left-to-right or right-to-left order of the symbols in : (like
the entropy of :), while this property does not hold for the
irredundant motifs.
The basis of tiling motifs has further interesting proper-
ties for quorum = 2, illustrated in Sections 4.2, 4.3, and 4.4.
In Section 4.2, we show that our basis is linear (that is, its
size is at most i 1). In Section 4.3, we show that the total
size of the location lists for the tiling motifs is less than 2i,
describing how to find them in C(i
2
log ilog [[) time. In
Section 4.4, we discuss some applications such as generat-
ing all maximal motifs with the basis and finding motifs
with a constraint on the number of undefined symbols.
4.2 A Linear Upper Bound for the Tiling Motifs with
Quorum = 2
Given a string : of length i, let B denote its basis of tiling
motifs for quorum = 2. Although the number of maximal
motifs may be exponential and the basis of irredundant
motifs may be at least quadratic (see Section 3), we show
that the size of B is always less than i. For this, we
introduce an operator between the symbols of to define
the merges, which are at the heart of the properties of B.
Given two letters o
1
. o
2
with o
1
,= o
2
, the operator
satisfies o
1
o
2
= and o
1
o
1
= o
1
. The operator applies
to any pair of strings r. y
+
, so that n = r y satisfies
n[;[ = r[;[ y[;[ for all integers ;.
Definition 8 (Merge). For 1 _ / _ i 1, let :
/
be the (infinite)
string whose character at position i is :
/
[i[ = :[i[ :[i /[. If
:
/
contains at least one solid character, `ciqc
/
denotes the
motif obtained by removing all the leading and trailing s in :
/
(that is, those appearing before the leftmost solid character and
after the rightmost solid character).
For example, FABCXFADCYZEADCEADC has `ciqc
4
= EADC,
`ciqc
5
= FA C, `ciqc
6
= `ciqc
10
= ADC, and `ciqc
11
=
`ciqc
15
= A C. The latter is theonlymerge that is not atiling
motif.
Lemma 1. If `ciqc
/
exists, it must be a maximal motif.
Proof. Motif r = `ciqc
/
occurs at positions, say, i andi /in
:. Character :
/
[i[ is solidby Definitions 4 and 8. We use the
fact that r at occurs at least twice in : for showing that it is
maximal. Suppose it is not maximal. By Definition 5, there
exists y ,= r such that r occurs in y and L
y
= L
r
d for
some integer d (in this case d _ 0). Since y is more specific
than r displacedby d, there must exist at least one position
; with 0 _ ; < [y[ such that r[; d[ = and y[;[ = o .
Hence, r[; d[ = :
_
i (; d)
:
_
i / (; d)
= ,
and so :
_
(i d) ;
,= :
_
(i / d) ;
. Since y[;[ cannot

match both of the latter symbols in :, at least one of i d or
i / d is not a position of y in :. This contradicts the
hypothesis that L
y
= L
r
d, whereas both i. i / L
r
. '
Lemma 2. For each tiling motif r in the basis B, there is at least
one / for which `ciqc
/
= r.
Proof. As mentioned in Remark 1, a maximal motif
occurring exactly twice in : is tiling. Hence, if [L
r
[ = 2,
say L
r
= i. ; with ; i, then r = `ciqc
/
with / = ; i
by the maximality of r and that of the merges by
Lemma 1. Let us now consider the case where [L
r
[ 2.
For any pair i. ; L
r
, we denote by n
i;
the string :[i..i
[r[ 1[ :[;..; [r[ 1[ obtained by applying the op-
erator to the two substrings of : matching r at
positions i and ;, respectively. We have r _ n
i;
since r
occurs at positions i and ;, and L
r
=
i.;L
r
L
n
i;
since we
are taking all pairs of occurrences of r. Letting / = [; i[
for i. ; L
r
, we observe that n
i;
is a substring of `ciqc
/
occurring at position, say, c
/
in it. Thus,
_
i.;L
r
L
ni;
=
_
/=[;i[ : i.;L
r
L
`ciqc/
c
/
_ _
= L
r
.
By Definition 7, the fact that r is tiling implies that r
must be one `ciqc
/
, proving the lemma. '
We now state the main property of tiling bases that
follows directly from Lemma 2.
Theorem 3 (linearity of the basis). Given a string : of length i
and the quorum = 2, let /be the set of `ciqc
/
, for 1 _ / _
i 1 such that `ciqc
/
exists. The basis B of tiling motifs for :
satisfies B _ /and, therefore, the size of B is at most i 1.
A simple consequence of Theorem 3 implies a tight
bound on the number of tiling motifs for periodic strings. If
: = n
c
for a string n repeated c 1 times, then : has at most
[n[ tiling motifs.
Corollary 1. The number of tiling motifs for : is at most j, the
smallest period of :.
The bound in Corollary 1 is not valid for irredundant
motifs. String : = ATATATATA has period j = 2 and only one
tiling motif ATATATA, while its irredundant motifs are A, ATA,
ATATA, and ATATATA.
4.3 A Simple Algorithm for Computing Tiling Motifs
with Quorum = 2
We describe how to compute the basis B for string : when
= 2. A brute-force algorithm generating first all maximal
motifs of : takes exponential time in the worst case.
Theorem 3 plays a crucial role in that we first compute
the motifs in / and then discard those being tiled. Since
B _ /, what remains is exactly B. To appreciate this
approach, it is worth noting that we are left with the
problem of selecting B from i 1 maximal motifs in / at
most, rather than selecting B among all the maximal motifs
in :, which may be exponential in number. Our simple
algorithm takes C(i
2
log ilog [[) time and is faster than
previous (and more complicated) methods discussed in
Section 1.
Step 1. Compute the multiset /
/
of merges. Letting
:
/
[i[ be the leftmost solid character of string :
/
in
Definition 8, we define occ
r
= i. i / to be the positions
of the two occurrences of r whose superposition generates
r = `ciqc
/
. For / = 1. 2. . . . . i 1, we compute string :
/
in C(i /) time. If :
/
contains some solid characters, we
compute r = `ciqc
/
and occ
r
in the same time complex-
ity. As a result, we compute the multiset /
/
of merges in
C(i
2
) time. Each merge r in /
/
is identified by a triplet
i. i /. [r[), from which we can recover the ;th symbol of
r in constant time by simple arithmetic operations and
comparisons.
Step 2. Transform the multiset /
/
into the set / of
merges. Since there can be two or more merges in /
/
that
are identical and correspond to the same merge in /, we
put together all identical merges in /
/
by radix sorting
them. The total cost of this step is dominated by radix
sorting, giving C(i
2
) time. As a byproduct, we produce the
temporary location list T
r
=
r
/
=r: r
/
/
/ occ
r
/ for each dis-
tinct r / thus obtained.
Lemma 3. Each motif r B satisfies T
r
= L
r
.
Proof. For a fixed r B, the fact that r is equal to at least
one merge by Lemma 2 implies that T
r
is well defined,
with [T
r
[ _ 2. Since T
r
_ L
r
, let us assume by contra-
diction that L
r
T
r
,= O. For each pair i L
r
T
r
and
; T
r
, let i
i;
= `ciqc
[;i[
, which is maximal by
Lemma 1. Note that each i
i;
,= r by our assumption as
otherwise i would belong to T
r
; however, r must occur
in i
i;
, say, at position c
i;
in i
i;
. Consequently,
iLrTr.;Tr
_
L
i
i;
c
i;
_
= L
r
since any occurrence of r
is either i L
r
T
r
or ; T
r
. At this point, we apply
Definition 7 to the tiling motif r, obtaining the contra-
diction that r must be equal to one i
i;
. '
Notice that the conclusion of Lemma 3 does not
necessarily hold for the motifs in /B. For the previous
example string FADABCXFADCYZEADCEADCFADC, one such
motif is r = ADC with L
r
= 8. 14. 18. 22 while T
r
= 8. 18.
Step 3. Select /
+
_ /, where /
+
= r / : T
r
= L
r
.
In order to build /
+
, we employ the Fischer-Paterson
algorithm based on convolution [8] for string matching with
dont cares to compute the whole list of occurrences L
r
for
each merge r /. Its cost is C(([r[ i) log ilog [[) time for
each merge r. Since [r[ < i and there are at most i 1 motifs
r /, we obtain C(i
2
log ilog [[) time to construct all lists
L
r
. We can compute /
+
by discarding the merges r /
such that T
r
,= L
r
in additional C(i
2
) time.
Lemma 4. The set /
+
satisfies the conditions B _ /
+
and
r/
+ [L
r
[ < 2i.
Proof. The first condition follows from the fact that the
motifs in //
+
are surely tiled by Lemma 3. The
second condition follows from the definition of /
+
and
from the observation that
r/
+
[L
r
[ =
r/
+
[T
r
[ _
r/
[occ
r
[ < 2i.
since [occ
r
[ = 2 (see Step 1) and there are less than i of
them. '
The property of /
+
in Lemma 4 is crucial in that
r/
[L
r
[ = (i
2
) when many lists contain (i) entries.
For example, : = A
i
has i 1 distinct merges, each of the
form r = A
i
for 1 _ i _ i 1, and so [L
r
[ = i i 1. This
would be a sharp drawback in Step 4 when removing tiled
motifs as it may turn into a (i
3
) algorithm. Using /
+
instead, we are guaranteed that
r/
+ [L
r
[ = C(i); hence,
we may still have some tiled motifs in /
+
, but their total
number of occurrences is C(i).
Step 4. Discard the tiled motifs in /
+
. We can now
check for tiling motifs in C(i
2
) time. Given two distinct
motifs r. y /
+
, we want to test whether L
r
d _ L
y
for
some integer d and, in that case, we want to mark the entries
in L
y
that are also in L
r
d. At the end of this task, the lists
having all entries marked are tiled (see Definition 7). By
removing their corresponding motifs from /
+
, we even-
tually obtain the basis B by Lemma 4. Since the meaningful
values of d are as many as the entries of L
y
, we have only
[L
y
[ possible values to check. For a given value of d, we
avoid to merge L
r
and L
y
in C([L
r
[ [L
y
[) time to perform
the test, as it would contribute to a total of (i
3
) time.
Instead, we exploit the fact that each list has values ranging
from 1 to i, and use two bit-vectors of size i to perform the
above check in C([L
r
[ [L
y
[) time for all values of d. This
gives C(
r
[L
r
[ [L
y
[) = C(
y
[L
y
[
r
[L
r
[) = C(i
2
)
by Lemma 4.
We therefore detail how to perform the above check with
L
r
and L
y
in C([L
r
[ [L
y
[) time. We use two bit-vectors \
1
and \
2
of length i initially set to all zeros. Given y /
+
, we
set \
1
[i[ = 1 if i L
y
. For each r /
+
y and for each
d (L
y
i) (where i is the smallest entry of L
r
), we then
perform the following test. If all ; L
r
d satisfy \
1
[;[ = 1,
we set \
2
[;[ = 1 for all such ;. Otherwise, we take the next
value of d, or the next motif if there are no more values of d,
and we repeat the test. After examining all r /
+
y,
we check whether \
1
[i[ = \
2
[i[ for all i L
y
. If so, y is tiled
as its list is covered by possibly shifted location lists of other
motifs. We then reset the ones in both vectors in C([L
y
[)
time.
Summing up Steps 1-4, we have that the dominant cost is
that of Step 3 and that we have proved the following result.
Theorem 4. Given an input string : of length i over the alphabet
, the basis of tiling motifs with quorum = 2 can be
computed in C(i
2
log ilog [[) time. The total number of
motifs in the basis is less than i, and the total number of their
occurrences in : is less than 2i.
We have implemented the algorithm underlying Theo-
rem 4, and we report here the lessons learned from our
experiments. Step 1 requires, in practice, less than the
predicted C(i
2
) running time. If j = 1,[[ denotes the
probability that two randomly chosen symbols of match
in the uniform distribution, the probability of finding the
first solid character in a merge follows the binomial
distribution, and so the expected number of examined
characters in : is C(1,j) = C([[), yielding C(i[[) time on
the average to locate the first (scanning : from the
beginning) and the last (scanning : from the end backward)
solid character in each merge. A similar approach can be
followed in Step 2 for finding the distinct merges. In this
case, the merges are first partially sorted using hashing and
exploiting the fact that the input is almost sorted. Insertion
sort is then the best choice and works very efficiently in our
experiments (at least 50 percent faster than Quicksort). We
do not compute yet the full merges at this stage, but we
delay this expensive part to a later stage on a small set of
buckets that require explicit representation of the merges.
As a result, the average case is almost linear. For example,
executing Steps 1 and 2 on chromosome V of C.elegans
containing more than 21 million bases took around
15 minutes on a machine with 512Mb of RAM running
Linux on a 1Ghz AMD Athlon processor. Step 3 is
expensive also in practice and the worst case predicted by
theory shows up in the experiments. Running this step on
sequences much shorter than chromosome V of C.elegans
took many hours. Step 4 is not much of a problem. As a
result, an alternative way of selecting /
+
from / in Step 3
working fast in practice, would improve considerably the
overall performance.
4.4 Some Applications
Checking whether a pattern is a motif. The main property
underlying the notion of basis is that it is a generator of all
motifs. The generation can be done as follows: First select
segments of motifs in the basis that start and end with solid
characters, then replace any number of internal solid
characters by wild cards. However, since the number of
motifs, and even maximal motifs, can be exponential, this is
not really meaningful unless this number is small and the
time complexity of the algorithm is proportional to the total
size of the output. An attempt in this direction is done in
[18]. The dual problem concerns testing only one pattern.
We show how, given a pattern r, it can be tested whether r
is a motif for string :, that is, if pattern r occurs at least
times in :. There are two possible ways of performing such
a test, depending on whether we test directly on the string
or on the basis. The answer relies on iterative applications
of the observation made in Remark 1, according to which
any tiled motif must occur in at least one tiling motif. The
next two statements deal with the alternative. In both cases,
we assume that integer / comes from the decomposition of
pattern r in the form n
0

0
n
1

1
n
/1

/1
n
/
, where the
subwords n
i
contain no wild cards (n
i

+
, 0 _ i _ /) and
;
are positive integers, 0 _ ; _ / 1. The next proposition
states a well-known fact on matching such a pattern in a
text without any wild card that we report here because it is
used in the sequel.
Proposition 3. The positions of the occurrences of a pattern r in
a string of length i can be computed in time C(/i).
Proof. This is a mere application of matching a pattern with
do not cares inside a text without do not cares. Using, for
instance, the Fischer and Patersons algorithm [8] is not
necessary. Instead, the positions of the subwords n
i
are
computed by a multiple string-matching algorithm, such
as the Aho-Corasick algorithm [1]. For each position j, a
counter associated with position j on : is incremented,
where is the position of n
i
in r ( is the offset of n
i
in r).
Counters whose value is / 1 correspond then to
occurrences of r in :. It remains to check if r occurs at
least times in :. The running time is governed by the
string-matching algorithm, which is C(/i) (equivalent to
running/times a linear-time stringmatchingalgorithm).'
Proposition 4. Given the basis B of string :, testing if pattern r
is a motif or a maximal motif can be done in C(//) time, where
/ =
yB
[y[.
Proof. From Remark 1, testing if r is a maximal motif
requires only finding if r occurs in an element y of the
basis. To do this, we can apply the procedure of the
previous proof because wild cards in y should be viewed
as extra characters that do not match any letter of . The
time complexity of the procedure is thus C(//). Since a
nonmaximal motif occurs in a maximal motif, the same
procedure applies to test if r is a general motif. '
As a consequence of Propositions 3 and 4, we get an
upper bound on the time complexity for testing motifs.
Corollary 2. Testing whether or not pattern n
0

0
n
1

1
n
/1

/1
n
/
is a motif in a string of length i having a
basis of total size / can be done in time C(/ min/. i).
Remark 2. Inside the procedure described in the proofs of
Propositions 3 and 4, it is also possible to use bit-vector
pattern matching methods [3], [16], [25] to compute the
occurrences of r. This leads to practically efficient
solutions running in time proportional to the length of
the string i or the total size of the basis /, in the bit-vector
model of machine. This is certainly a method of choice
for short patterns.
Finding the longest motif with bounded number of
wild cards. We address an interesting question concerning
the computation of a longest motif occurring repeated in a
string. Given an integer q _ 0, let 1`
q
(:) be the maximal
length of motifs occurring in a string : of length i with
quorum = 2, and containing no more than q wild cards. If
q = 0, the value can be computed in C(ilog [[) time with
the help of the suffix tree of : (see [5] or [10]). For q 0, we
can show that 1`
q
(:) can be computed in C(qi
2
) time
using the suffix tree augmented (in linear time) to accept
longest common ancestor (LCA) queries as follows: For
each possible pair (i. ;) of positions on : for which :[i[ = :[;[,
we compute the longest common prefix of :[i..i 1[ and
:[;..i 1[ in constant time through an LCA query on the
suffix tree. If is the length of the prefix, we get the first part
:[i..i 1[ of a possible longest motif. The second part
is found similarly by considering the pair of positions
(i 1. ; 1). The process is iterated q times (or less)
and provides a longest motif containing at most q wild
cards and occurring at positions i and ;. Length 1`
q
(:) is
obtained by taking the maximum length of motifs for all
pairs of positions (i. ;). This yields the next result.
Proposition 5. Using the suffix tree, 1`
q
(:) can be computed in
C(qi
2
) time.
What makes the use of the basis of tiling motifs interesting
is that computing 1`
q
(:) becomes a mere pattern matching
exercise because of the strong properties of the basis. This
contrasts with the previous result grounded on the deep
algorithmic technique for LCA queries.
Proposition 6. Using the basis B of tiling motifs, 1`
q
(:) can be
computed in time C(/), where / =
yB
[y[.
Proof. Let r be a motif yielding 1`
q
(:) (i.e., r is of length
1`
q
(:)); hence, r occurs at least twice in :. Let y be a
maximal motif in which r occurs (we have y = r if r is
itself maximal). Let : be a tiling motif in which y occurs
(again we may have : = y if y is a tiling motif). The word
r then occurs in : that belongs to the basis. Let us say that
it matches :[i..;[. Assume that r is not a tiling motif, that
is r ,= :. Certainly, i = 0 or :[i 1[ = , otherwise, r
would not be the longest with its property. For the same
reason, ; = [:[ 1 or :[; 1[ = . But, indeed, r occurs
exactly in :, which means that the wild card symbols do
not match any solid symbol. Because, otherwise, :[i..;[
would contain less than q do not cares and could be
extended by at least one symbol to the left or to the right
because r ,= :, yielding a contradiction with the defini-
tion of r. Therefore, either r is a tiling motif or it matches
exactly a segment of one of the tiling motifs. Searching
for r thus reduces to finding a longest segment of a tiling
motif in B that contains no more than q wild cards. The
computation can be done in linear time with only two
pointers on :, which proves the result. '
By Proposition 6, it is clear that a small basis B leads to
an efficient computation once B is given. If we have to build
B from scratch, we can observe that no (maximal) motif can
give a larger value of 1`
q
(:) if it does not belong to B. With
this observation, we have C(i
2
) running time, which
always beats the C(q i
2
) cost of using the suffix tree. In
particular, it is interesting to notice that the running time of
the algorithm using the basis is independent of the
parameter q.
5 PSEUDOPOLYNOMIAL BASES FOR HIGHER
QUORUM
We now discuss the general case of quorum _ 2 for
finding the basis of a string of length i. Differently from
previous work, we show in Section 5.1 that no polynomial-
time algorithm can exist for any arbitrary value of in the
worst case, both for the basis of irredundant motifs and for
the basis of tiling motifs. The size of these bases provably
depends exponentially on suitable values of _ 2, that is, we
give a lower bound of
i1
2
1
1
_ _
=
_
1
2
i1
1
_ _
_
. In practice, this
size has an exponential growth for increasing values of up
to C(log i), but larger values of are theoretically possible
in the worst case. Fixing = (i 1),4 1 in our lower
bound, we get a size of (2
(i1),4
) motifs in the bases. On
the average, = C(log
[[
i) by extending the argument after
Theorem 4, namely, using the fact that on the average the
number of simultaneous comparisons to find the first solid
character of a merge is C([[
1
), which must be less than i.
We show a further property for the basis of tiling motifs
in Section 5.2, giving an upper bound of
i1
1
_ _
on its size
with a simple proof. Since we can find an algorithm taking
time proportional to the square of that size, we can
conclude that a worst-case polynomial-time algorithm for
finding the basis of tiling motifs exists if and only if the
quorum satisfies either = C(1) or = i C(1) (the latter
condition is hardly meaningful in practice).
5.1 A Lower Bound of
i1
2
1
1
_ _
on the Bases
We show the existence of a family of strings for which there
are at least
i1
2
1
1
_ _
tiling motifs for a quorum . Since a tiling
motif is also irredundant, this gives a lower bound for the
irredundant motifs to be combined with that in Section 3
(note that the lower bound in Section 3 still gives (i
2
) for
_ 2). For 2, this gives a lower bound of
i1
2
1
1
_ _
=
_
1
2
i1
1
_ _
_
for the number of both tiling and irredundant
motifs.
The strings are this time of the form t
/
= A
/
TA
/
(/ _ 5),
without the left extension used in the bound of Section 3.
The proof proceeds by exhibiting
/1
1
_ _
motifs that are
maximal and have each exactly occurrences, from when it
follows immediately that they are tiling. Indeed, Remark 1
for tiling motifs holds for any _ 2. Namely, all maximal
motifs that occur exactly times in a string are tiling.
Proposition 7. For 2 _ _ / and 1 _ j _ / 1, any motif
A
j
A.
/j1
A
j
with exactly wild cards is tiling (and
so irredundant) in t
/
.
Proof. Let r be an arbitrary motif A
j
A.
/j1
A
j
with
1 _ j _ / 1 and wild cards; namely, r = A
j1
A
j
2
j
1
1
A
j
1
j
2
1
A
/j
1
1
A
j
1
for 1 _ j
1
< j
2
<
< j
1
_ / 1 and j = j
1
. We first have to prove that r
is a maximal motif according to Definition 5. Its length is
/ 1 j
1
and its location list is L
r
= 0. / j
1
. . . . .
/ j
2
. / j
1
. Observe that the number of its occurrences
is exactly the number of times the wild card appears in r,
which is equal to . A motif y different from r such that r
occurs in y can be obtained by replacing the wild card at
position j
i
with a solid symbol, for 1 _ i _ 1, but this
eliminates / j
i
from the location list of y. Also, y can be
obtained by extending r to the right by a solid symbol (at
any position _ [r[), but then position / j
1
is not in L
y
because the last symbol in that occurrence of y occupies
position (/ j
1
)[y[1_ (/ j
1
) [r[ = (/ j
1
) (/
1 j
1
) [t
/
[ 1 in t
/
, which is impossible. Analogously, y
canbe obtainedby extending r to the left by a solidsymbol
(at any position d < 0), but position 0 is no longer in L
y
.
Consequently, for any motif y more specific than r, we
have L
y
,= L
r
d, implying that r is maximal. As
previously mentioned, r is tiling because it has exactly
occurrences. '
Theorem 5. String t
/
has
i1
2
1
1
_ _
=
_
1
2
i1
1
_ _
_
tiling (and
irredundant) motifs, where i = [t
/
[ and / _ 2.
Proof. By Proposition 7, the tiling or irredundant motifs in t
/
are at least
/1
1
_ _
, the number of choices of 1 positions
on A
/1
. Since i = 2/ 1, we obtain the statement. '
5.2 An Upper Bound of
i1
1
_ _
Tiling Motifs
We now prove that
i1
1
_ _
is an upper bound for the size of a
basis of tiling motifs for a string : and quorum _ 2. Let us
denote as before such a basis by B. To prove the upper
bound, we use again the notion of a merge, except that it
now involves strings. The operator between the
elements of extends to more than two arguments, so that
the result is a if at least two arguments differ. Let / denote
now an array of 1 positive values /
1
. . . . . /
1
with 1 _
/
i
< /
;
_ i 1 for all 1 _ i < ; _ 1.
Definition 9. Let :
/
denote the string such that its ;th character
is :
/
[;[ = :[;[ :[; /
1
[ :[; /
1
[ for all integers ;.
`ciqc
/
is the pattern obtained by removing all the leading
and trailing s in :
/
(that is, appearing before the leftmost solid
character and after the rightmost solid character).
Lemmas 5 and 6 reported below extend Lemmas 1 and 2
for 2.
Lemma 5. If `ciqc
/
exists for quorum , then it must be a
maximal motif.
Proof. Let r = `ciqc
/
denote the (nonempty) pattern, and
let :
/
[i[ be its first character, which is solid by
Definition 9. Since r occurs at least times in :, at
positions i. i /
1
. . . . . i /
1
, then r is a motif for
quorum . We show that r is maximal. Suppose it is
not maximal. By Definition 5, there exists y ,= r s.t. r
occurs in y and L
y
= L
r
d for some integer d. This
implies there exists at least one position ; with 0 _
; < [y[ such that y[;[ = o and r[; d[ = . Since
r[; d[ = :[i ; d[ :[i ; /
1
d[
:[i ; /
1
d[.
then at least one among i d. i /
1
d. . . . . i /
1
d
is not an occurrence of y, contradicting the hypothesis
that L
y
= L
r
d (since i. i /
1
. . . . . i /
1
L
r
). '
Lemma 6. For each tiling motif r in the basis B with quorum ,
there is at least one / for which `ciqc
/
= r.
Proof. If [L
r
[ = andL
r
= i
1
. . . . . i
with i
1
< < i
, then
r = `ciqc
/
where / is the array of values i
2
i
1
. i
3
i
1
.
. . . . i
i
1
. Let us now consider the case where [L
r
[ .
Given any -tuple i
1
. . . . . i
L
r
, let n
/
denote :[i
1
..i
1

[r[ 1[ :[i
..i
[r[ 1[, which is a substring of

`ciqc
/
introduced in Definition 9. We have that r _ n
/
and L
r
=
i1.i2.....iLr
L
n
/
. Since each n
/
for i
1
. i
2
. . . . . i

L
r
is a substring of `ciqc
/
, we infer that L
r
=
i1.i2.....iLr
_
L
`ciqc
/
c
/
_
where the c
/
s are non-negative
integers. By Definition 7, if `ciqc
/
were different from r,
then r would not be tiling, which is a contradiction.
Therefore, at least one `ciqc
/
is r. '
The following property of tiling bases follows from
Lemma 5 and 6.
Theorem 6. Given a string : of length i and a quorum _ 2, let
/ be the set of `ciqc
/
, for any of the
i1
1
_ _
possible choices
of / for which `ciqc
/
exists. The basis B of tiling motifs for :
satisfies B _ / and, therefore, the size of B is at most
i1
1
_ _
.
The tiling motifs in our basis appear in : for a total of
i1
1
_ _
times at most. A variation of the algorithm given in
Section 4.3 gives a pseudopolynomial-time complexity of
C
2
i 1
1
_ _
2
_ _
.
When this upper bound is combined with the lower bound
of Section 5.1, we obtain that there exists a polynomial-time
algorithm for finding the basis if and only if either = C(1)
or = i C(1).
6 CONCLUSIONS
The workpresentedinthis paper is theoretical innature, but it
should be clear by now that its practical consequences,
particularlybut not exclusivelyfor computational biol-
ogy, are relevant. Whether motifs as patterns are used for
inferring binding sites or repeats of any length, for character-
izing sequences or as a filtering step in a whole genome
comparison algorithm or before inferring PSSMs: We show
that wild cards alone are not enough for a biologically
satisfying definition of the patterns of interest. Simply
throwing away the pattern-type of motif detection is not a
good way to address the problem. This is confirmed by
variousbiological publications[24], [7] aswell asbythenot yet
publishedbut already publicly availableresults of a first
motif detection competition http://bio.cs.washington.edu/
assessment/. Evenif patternsarenot thebest wayof modeling
biological features, they deserve animportant functioninany
future improved algorithm for inferring motifs ab initio from
biological sequences. As such, the purpose of this paper is to
shed some further light on the inner structure of one
important type of motif.
ACKNOWLEDGMENTS
Many suggestions from the anonymous referees greatly
improved the original form of this paper. The authors are
thankful to them for this and to M.H.ter Beek for improving
the English. A preliminary version of the results in this
paper has been described in the technical report IGM-2002-
10, July 2002 [20], and in [21]. Work was partially supported
by the French program bioinformatique EPST 2002 Algo-
rithms for Modelling and Inference Problems in Molecular
Biology. N. Pisanti and R. Grossi were partially supported
by the Italian PRIN project ALINWEB: Algorithmics for
Internet and the Web. M.-F. Sagot was partially supported
by CNRS-INRIA-INRA-INSERM action BioInformatique
and the Wellcome Trust Foundation. M. Crochemore was
partially supported by CNRS action AlBio, NATO Science
Programme grant PST.CLG.977017, and the Wellcome Trust
Foundation.
REFERENCES
[1] A. Aho and M. Corasick, Efficient String Matching: An Aid to
Bibliographic Search, Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.
[2] A. Apostolico and L. Parida, Incremental Paradigms of Motif
Discovery, J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004.
[3] R. Baeza-Yates and G. Gonnet, A New Approach to Text
Searching, Comm. ACM, vol. 35, pp. 74-82, 1992.
[4] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, Ap-
proaches to the Automatic Discovery of Patterns in Biose-
quences, J. Computational Biology, vol. 5, pp. 279-305, 1998.
[5] M. Crochemore and W. Rytter, Jewels of Stringology. World
Scientific Publishing, 2002.
[6] E. Eskin, From Profiles to Patterns and Back Again: A Branch and
Bound Algorithm for Finding Near Optimal Motif Profiles,
RECOMB04: Proc. Eighth Ann. Intl Conf. Computational Molecular
Biology, pp. 115-124, 2004.
[7] E. Eskin, U. Keich, M. Gelfand, and P. Pevzner, Genome-Wide
Analysis of Bacterial Promoter Regions, Proc. Pacific Symp.
Biocomputing, pp. 29-40, 2003.
[8] M. Fischer and M. Paterson, String Matching and Other
Products, SIAM AMS Complexity of Computation, R. Karp, ed.,
pp. 113-125, 1974.
[9] M. Gribskov, A. McLachlan, and D. Eisenberg, Profile Analysis:
Detection of Distantly Related Proteins, Proc. Natl Academy of
Sciences, vol. 84, no. 13, pp. 4355-4358, 1987.
[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer
Science and Computational Biology. Cambridge Univ. Press, 1997.
[11] G.Z. Hertz and G.D. Stormo, Escherichia Coli Promoter Sequences:
Analysis and Prediction, Methods in Enzymology, vol. 273, pp. 30-
42, 1996.
[12] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald,
and J.C. Wooton, Detecting Subtle Sequence Signals: A Gibbs
Sampling Strategy for Multiple Alignment, Science, vol. 262,
pp. 208-214, 1993.
[13] C.E. Lawrence and A.A. Reilly, An Expectation Maximization
(EM) Algorithm for the Identification and Characterization of
Common Sites in Unaligned Biopolymer Sequences, Proteins:
Structure, Function, and Genetics, vol. 7, pp. 41-51, 1990.
[14] L. Marsan and M.-F. Sagot, Algorithms for Extracting Structured
Motifs Using a Suffix Tree with an Application to Promoter and
Regulatory Site Consensus Identification, J. Computational Biol-
ogy, vol. 7, pp. 345-362, 2000.
[15] W. Miller, Comparison of Genomic DNA Sequences: Solved and
Unsolved Problems, Bioinformatics, vol. 17, pp. 391-397, 2001.
[16] G. Myers, A Fast Bit-Vector Algorithm for Approximate String
Matching Based on Dynamic Programming, J. ACM, vol. 46, no. 3,
pp. 395-415, 1999.
[17] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao, Pattern
Discovery on Character Sets and Real-Valued Data: Linear Bound
on Irredundant Motifs and Efficient Polynomial Time Algorithm,
Proc. SIAM Symp. Discrete Algorithms (SODA), 2000.
[18] L. Parida, I. Rigoutsos, and D. Platt, An Output-Sensitive Flexible
Pattern Discovery Algorithm, Combinatorial Pattern Matching,
A. Amir and G. Landau, eds., pp. 131-142, Springer-Verlag, 2001.
[19] J. Pelfrne, S. Abdedda m, and J. Alexandre, Extracting Approx-
imate Patterns, Combinatorial Pattern Matching, pp. 328-347,
Springer-Verlag, 2003.
[20] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, A Basis
for Repeated Motifs in Pattern Discovery and Text Mining,
Technical Report IGM 2002-10, Institut Gaspard-Monge, Univ. of
Marne-la-Vallee, July 2002.
[21] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, A Basis of
Tiling Motifs for Generating Repeated Patterns and Its Complex-
ity for Higher Quorum, Math. Foundations of Computer Science
(MFCS), B. Rovan and P. Vojtas, eds., pp. 622-631, Springer-
Verlag, 2003.
[22] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, String
Algorithmics, chapter: A Comparative Study of Bases for Motif
Inference, pp. 195-225, KCL Press, 2004.
[23] D. Pollard, C. Bergman, J. Stoye, S. Celniker, and M. Eisen,
Benchmarking Tools for the Alignment of Functional Noncoding
DNA, BMC Bioinformatics, vol. 5, pp. 6-23, 2004.
[24] A. Vanet, L. Marsan, and M.-F. Sagot, Promoter Sequences and
Algorithmical Methods for Identifying Them, Research in Micro-
biology, vol. 150, pp. 779-799, 1999.
[25] S. Wu and U. Manber, Path-Matching Problems, Algorithmica,
vol. 8, no. 2, pp. 89-101, 1992.
Nadia Pisanti received the laurea degree in
computer science in 1996 from the University of
Pisa (Italy), the French DEA in fundamental
informatics with applications to genome treat-
ment in 1998 from the University of Marne-la-
Vallee (France), and the PhDdegree in computer
science in 2002 from the University of Pisa. She
has been postdoctorate at INRIA and at the
University of Paris 13 and she is currently a
research fellow in the Department of Computer
Science of the University of Pisa. Her interests are in computational
biology and, in particular, in motifs extractionandgenomerearrangement.
Maxime Crochemore received the PhD degree
in 1978 and the Doctorat detat in 1983 from the
University of Rouen. He received his first
professorship position at the University of
Paris-Nord in 1975 where he acted as President
of the Department of Mathematics and Compu-
ter Science for two years. He became a
professor at the University Paris 7 in 1989 and
was involved in the creation of the University of
Marne-la-Vallee where he is presently a profes-
sor. He also created the Computer Science Research Laboratory of this
university in 1991. Since then, he has been the director of the laboratory,
which now has around 45 permanent researchers. Professor Crochem-
ore has been a senior research fellow at Kings College London since
2002. He has been the recipient of several French grants on string
algorithmics and bioinformatics. He participated in a good number of
international projects in algorithmics and supervised 20 PhD students.
Roberto Grossi received the laurea degree in
computer science in 1988, and the PhD degree
in computer science in 1993, at the University of
Pisa. He joined the University of Florence in
1993 as an associate researcher. Since 1998,
he has been an associate professor of computer
science in the Dipartimento di Informatica,
University of Pisa. He has been visiting several
international research institutions. His interests
are in the design and analysis of algorithms and
data structures, namely, dynamic and external memory algorithms,
graph algorithms, experimental and algorithm engineering, fast lookup
tables and dictionaries, pattern matching algorithms, text indexing, and
compressed data structures.
Marie-France Sagot received the BSc degree in computer science from
the University of Sao Paulo, Brazil, in 1991, the PhD degree in
theoretical computer science and applications from the University of
Marne-la-Vallee, France, in 1996, and the Habilitation from the same
university in 2000. From 1997 to 2001, she worked as a research
associate at the Pasteur Institute in Paris, France. In 2001, she moved
to Lyon, France, as a research associate at the INRIA, the French
National Institute for Research in Computer Science and Control. Since
2003, she has been director of research at the INRIA. Her research
interests are in computational biology, algorithmics, and combinatorics.
Multiseed Lossless Filtration
Gregory Kucherov, Laurent Noe , and Mikhail Roytberg
AbstractWe study a method of seed-based lossless filtration for approximate string matching and related bioinformatics
applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt
and Ka rkka inen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial
properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed
technique to the problem of oligonucleotide selection for an EST sequence database.
Index TermsFiltration, string matching, gapped seed, gapped q-gram, local alignment, sequence similarity, seed family, multiple
spaced seeds, dynamic programming, EST, oligonucleotide selection.
1 INTRODUCTION
F
ILTERING is a widely-used technique in biosequence
analysis. Applied to the approximate string matching
problem [2], it can be summarized by the following two-
stage scheme: To find approximate occurrences (matches) of
a given string in a sequence (text), one first quickly discards
(filters out) those sequence regions where matches cannot
occur, and then checks out the remaining parts of the
sequence for actual matches. The filtering is done according
to small patterns of a specified form that the searched string
is assumed to share, in the exact way, with its approximate
occurrences. A similar filtration scheme is used by heuristic
local alignment algorithms ([3], [4], [5], [6], to mention a
few): They first identify potential similarity regions that
share some patterns and then actually check whether those
regions represent a significant similarity by computing a
corresponding alignment.
Two types of filtering should be distinguishedlossless
and lossy. A lossless filtration guarantees to detect all
sequence fragments under interest, while a lossy filtration
may miss some of them, but still tries to detect a majority of
them. Local alignment algorithms usually use a lossy
filtration. On the other hand, the lossless filtration has been
studied in the context of approximate string matching
problem [7], [1]. In this paper, we focus on the lossless
filtration.
In the case of lossy filtration, its efficiency is measured by
two parameters, usually called selectivity and sensitivity. The
sensitivity measures the part of sequence fragments of
interest that are missed by the filter (false negatives), and
the selectivity indicates what part of detected candidate
fragments do not actually represent a solution (false
positives). In the case of lossless filtration, only the
selectivity parameter makes sense and is therefore the main
characteristic of the filtration efficiency.
The choice of patterns that must be contained in the
searched sequence fragments is a key ingredient of the
filtration algorithm. Gapped seeds (spaced seeds, gapped q-
grams) have been recently shown to significantly improve
the filtration efficiency over the traditional technique of
contiguous seeds. In the framework of lossy filtration for
sequence alignment, the use of designed gapped seeds has
been introduced by the PATTERNHUNTER method [4] and
then used by some other algorithms (e.g., [5], [6]). In [8], [9],
spaced seeds have been shown to improve indexing
schemes for similarity search in sequence databases. The
estimation of the sensitivity of spaced seeds (as well as of
some extended seed models) has been the subject of several
recent studies [10], [11], [12], [13], [14], [15]. In the
framework of lossless filtration for approximate pattern
matching, gapped seeds were studied in [1] (see also [7])
and have also been shown to increase the filtration
efficiency considerably.
In this paper, we study an extension of the lossless
single-seed filtration technique [1]. The extension is based
on using seed families rather than individual seeds. The idea
of simultaneous use of multiple seeds for DNA local
alignment was already envisaged in [4] and applied in
PATTERNHUNTER II software [16]. The problem of design-
ing efficient seed families has also been studied in [17]. In
[18], multiple seeds have been applied to the protein search.
However, the issues analyzed in the present paper are quite
different, due to the proposed requirement for the search to
be lossless.
The rest of the paper is organized as follows: After
formally introducing the concept of multiple seed filtering
in Section 2, Section 3 is devoted to dynamic programming
algorithms to compute several important parameters of
seed families. In Section 4, we first study several combina-
torial properties of families of seeds and, in particular, seeds
having a periodic structure. These results are used to obtain
a method for constructing efficient seed families. We also
outline a heuristic genetic programming algorithm for
constructing seed families. Finally, in Section 5, we present
. G. Kucherov and L. Noe are with the INRIA/LORIA, 615, rue du Jardin
Botanique, B.P. 101, 54602 Villers-le`s-Nancy, France.
E-mail: {Gregory.Kucherov, Laurent.Noe}@loria.fr.
. M. Roytberg is with the Institute of Mathematical Problems in Biology,
Pushchino, Moscow Region, Russia. E-mail: roytberg@impb.psn.ru.
Manuscript received 24 Sept. 2004; revised 13 Dec. 2004; accepted 10 Jan.
several seed families we computed, and we report a large-
scale experimental application of the method to a practical
problem of oligonucleotide selection.
2 MULTIPLE SEED FILTERING
A seed Q (called also spaced seed or gapped q-gram) is a list
p
1
; p
2
; . . . ; p
d
of positive integers, called matching positions,
such that p
1
< p
2
< . . . < p
d
. By convention, we always
assume p
1
= 0. The span of a seed Q, denoted s(Q), is the
quantity p
d
1. The number d of matching positions is called
the weight of the seed and denoted w(Q). Often, we will use a
more visual representation of seeds, adopted in [1], as words
of length s(Q) over the two-letter alphabet #; , where #
occurs at all matching positions andat all positions in
between. For example, seed 0; 1; 2; 4; 6; 9; 10; 11 of weight 8
andspan12is representedbyword### # # ###.
The character is called a joker. Note that, unless otherwise
stated, the seed has the character # at its first and last
positions.
Intuitively, a seed specifies the set of patterns that, if
shared by two sequences, indicate a possible similarity
between them. Two sequences are similar if the Hamming
distance between them is smaller than a certain threshold.
For example, sequences CACTCGT and CACACTT are similar
within Hamming distance 2 and this similarity is detected
by the seed ## # at position 2. We are interested in seeds
that detect all similarities of a given length with a given
Hamming distance.
Formally, a gapless similarity (hereafter simply similarity)
of two sequences of length m is a binary word w 0; 1
m
interpreted as a sequence of matches (1s) and mismatches
(0s) of individual characters from the alphabet of input
sequences. A seed Q = p
1
; p
2
; . . . ; p
d
matches a similarity w
at position i, 1 _ i _ m p
d
1, iff for every j [1::d[, we
have w[i p
j
[ = 1. In this case, we also say that seed Q has
an occurrence in similarity w at position i. A seed Q is said to
detect a similarity w if Q has at least one occurrence in w.
Given a similarity length m and a number of
mismatches k, consider all similarities of length m
containing k 0s and (m k) 1s. These similarities are
called (m; k)-similarities. A seed Q solves the detection
problem (m; k) (for short, the (m; k)-problem) iff all of
m
k

(m; k)-similarities w are detected by Q. For example, one
can check that seed # ## # ## solves the
(15; 2)-problem.
Note that the weight of the seed is directly related to the
selectivity of the corresponding filtration procedure. A larger
weight improves the selectivity, as less similarities will pass
through the filter. On the other hand, a smaller weight
reduces the filtration efficiency. Therefore, the goal is to
solve an (m; k)-problem by a seed with the largest possible
weight.
Solving (m; k)-problems by a single seed has been studied
by Burkhardt and Karkkainen [1]. An extension we propose
here is to use a family of seeds, instead of a single seed, to solve
the (m; k)-problem. Formally, a finite family of seeds F =<
Q
l
>
L
l=1
solves an (m; k)-problemiff for any (m; k)-similarity w,
there exists a seed Q
l
F that detects w.
Note that the seeds of the family are used in the
complementary (or disjunctive) fashion, i.e., a similarity is
detected if it is detected by one of the seeds. This differs from
the conjunctive approach of [7] where a similarity should be
detected by two seeds simultaneously.
The following example motivates the use of multiple
seeds. In [1], it has been shown that a seed solving the
(25; 2)-problem has the maximal weight 12. The only such
seed (up to reversal) is
### # ### # ### #:
However, the problem can be solved by the family
composed of the following two seeds of weight 14:
##### ## ##### ##
and
# ## ##### ## ####:
Clearly, using these two seeds increases the selectivity of
the search, as only similarities having 14 or more matching
characters pass the filter versus 12 matching characters in
the case of single seed. On uniform Bernoulli sequences,
this results in the decrease of the number of candidate
similarities by the factor of [A[
2
=2, where A is the input
alphabet. This illustrates the advantage of the multiple seed
approach: it allows to increase the selectivity while
preserving a lossless search. The price to pay for this gain
in selectivity is multiplying the work on identifying the
seed occurrences. In the case of large sequences, however,
this is largely compensated by the decrease in the number
of false positives caused by the increase of the seed weight.
3 COMPUTING PROPERTIES OF SEED FAMILIES
Burkhardt and Karkkainen [1] proposed a dynamic pro-
gramming algorithm to compute the optimal threshold of a
given seedthe minimal number of its occurrences over all
possible (m; k)-similarities. In this section, we describe an
extension of this algorithm for seed families and, on the
other hand, describe dynamic programming algorithms for
computing two other important parameters of seed families
that we will use in a later section.
Consider an (m; k)-problem and a family of seeds
F =< Q
l
>
L
l=1
. We need the following notations:
. s
max
= maxs(Q
l
)
L
l=1
, s
min
= mins(Q
l
)
L
l=1
,
. for a binary word w and a seed Q
l
, suff(Q
l
; w)=1 if
Q
l
matches w at position ([w[s(Q
l
)1) (i.e.,
matches a suffix of w), otherwise suff(Q
l
; w)=0,
. last(w) = 1 if the last character of w is 1, otherwise
last(w) = 0, and
. zeros(w) is the number of 0s in w.
3.1 Optimal Threshold
Given an (m; k)-problem, a family of seeds F =< Q
l
>
L
l=1
has the optimal threshold T
F
(m; k) if every (m; k)-similarity
has at least T
F
(m; k) occurrences of seeds of F and this is the
maximal number with this property. Note that overlapping
occurrences of a seed as well as occurrences of different
seeds at the same position are counted separately. For
example, the singleton family ### ## has threshold 2
for the (15; 2)-problem.
Clearly, F solves an (m; k)-problem if and only if
T
F
(m; k) > 0. If T
F
(m; k) > 1, then one can strengthen the
detection criterion by requiring several seed occurrences for
a similarity to be detected. This shows the importance of the
optimal threshold parameter.
We now describe a dynamic programming algorithm
for computing the optimal threshold T
F
(m; k). For a
binary word w, consider the quantity T
F
(m; k; w) defined
as the minimal number of occurrences of seeds of F in all
(m; k)-similarities which have the suffix w. By definition,
T
F
(m; k) = T
F
(m; k; "). Assume that we precomputed
values T
F
(j; w) = T
F
(s
max
; j; w), for all j _ maxk; s
max
,
[w[ = s
max
. The algorithm is based on the following
recurrence relations on T
F
(i; j; w), for i _ s
max
.
T
F
(i; j; w[1::n[) =
T
F
(j; w); if i =s
max
;
T
F
(i1; j1; w[1::n1[); if w[n[ =0;
T
F
(i1; j; w[1::n1[) [
P
L
l=1
suff(Q
l
; w)[; if n=s
max
;
minT
F
(i; j; 1:w); T
F
(i; j; 0:w); if zeros(w)<j;
T
F
(i; j; 1:w); if zeros(w)=j:
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
The first relation is an initial condition of the recurrence.
The second one is based on the fact that if the last symbol of
w is 0, then no seed can match a suffix of w (as the last
position of a seed is always assumed to be a matching
position). The third relation reduces the size of the problem
by counting the number of suffix seed occurrences. The
fourth one splits the counting into two cases, by considering
two possible characters occurring on the left of w. If w
already contains j 0s, then only 1 can occur on the left of w,
as stated by the last relation.
A dynamic programming implementation of the above
recurrence allows to compute T
F
(m; k; ") in a bottom-up
fashion, startingfrominitial values T
F
(j; w) andapplyingthe
above relations in the order in which they are given. A
straightforward dynamic programming implementation re-
quires O(m k 2
(s
max
1)
) time and space. However, the space
complexity can be immediately improved: If values of i are
processed successively, then only O(k 2
(s
max
1)
) space is
needed. Furthermore, for each i and j, it is not necessary to
consider all 2
(smax1)
different strings w, but only those which
contain up to j 0s. The number of those w is g(j; s
max
) =
P
j
e=0
s
max
e

. For eachi, j ranges from0 to k. Therefore, for each
i, we needtostore f(k; s
max
) =
P
k
j=0
g(j; s
max
) =
P
k
j=0
smax
j

(k j 1) values. This yields the same space complexity as

for computing the optimal threshold for one seed [1].
The quantity
P
L
l=1
suff(Q
l
; w) can be precomputed for all
considered words w in time O(L g(k; s
max
)) and space
O(g(k; s
max
)), under the assumption that checking an
individual match is done in constant time. This leads to
the overall time complexity O(m f(k; s
max
) L g(k; s
max
))
with the leading term m f(k; s
max
) (as L is usually small
compared to m and g(k; s
max
) is smaller than f(k; s
max
)).
3.2 Number of Undetected Similarities
We now describe a dynamic programming algorithm that
computes another characteristic of a seed family, that will
be used later in Section 4.4. Consider an (m; k)-problem.
Given a seed family F =< Q
l
>
L
l=1
, we are interested in
the number U
F
(m; k) of (m; k)-similarities that are not
detected by F. For a binary word w, define U
F
(m; k; w) to
be the number of undetected (m; k)-similarities that have
the suffix w.
Similar to [10], let X(F) be the set of binary words w such
that 1) [w[ _ s
max
, 2) for any Q
l
F, suff(Q
l
; 1
s
max
[w[
w) = 0,
and 3) no proper suffix of w satisfies 2). Note that word 0
belongs to X(F), as the last position of every seed is a
matching position.
The following recurrence relations allow to compute
U
F
(i; j; w) for i _ m, j _ k, and [w[ _ s
max
:
U
F
(i; j; w[1::n[) =
i[w[
jzeros(w)

; if i < s
min
;
0; if l [1::L[;
suff(Q
l
; w) = 1;
U
F
(i 1; j last(w); w[1::n 1[); if w X(F);
U
F
(i; j; 1:w) U
F
(i; j; 0:w); if zeros(w) < j;
U
F
(i; j; 1:w); if zeros(w) = j:
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
The first condition says that if i < s
min
, then no word of
length i will be detected, hence the binomial coefficient. The
second condition is straightforward. The third relation
follows from the definition of X(F) and allows us to reduce
the size of the problem. The last two conditions are similar
to those from the previous section.
The set X(F) can be precomputed in time O(L
g(k; s
max
)) and the worst-case time complexity of the whole
algorithm remains O(m f(k; s
max
) L g(k; s
max
)).
3.3 Contribution of a Seed
Using a similar dynamic programming technique, one can
compute, for a given seed of the family, the number of
(m; k)-similarities that are detected only by this seed and not
by the others. Together with the number of undetected
similarities, this parameter will be used later in Section 4.4.
Given an (m; k)-problem and a family F =< Q
l
>
L
l=1
, we
define S
F
(m; k; l) to be the number of (m; k)-similarities
detected by the seed Q
l
exclusively (through one or several
occurrences), and S
F
(m; k; l; w) to be the number of those
similarities ending with the suffix w. A dynamic program-
ming algorithm similar to the one described in the previous
sections can be applied to compute S
F
(m; k; l). The
recurrence is given below.
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 53
S
F
(i; j; l; w[1::n[) =
0 if i < s
min
or l
/
,= l
suff(Q
l
/ ; w) = 1
S
F
(i 1; j 1; l; w[1::n 1[) if w[n[ = 0
S
F
(i 1; j; l; w[1::n 1[) if n = [Q
l
[ and
suff(Q
l
; w) = 0
S
F
(i 1; j; l; w[1::n 1[)
U
F
(i 1; j; w[1::n 1[) if n = s
max
and
suff(Q
l
; w) = 1
and \l
/
,= l;
suff(Q
l
/ ; w) = 0;
S
F
(i; j; l; 1:w[1::n[)
S
F
(i; j; l; 0:w[1::n[) if zeros(w) < j
S
F
(i; j; l; 1:w[1::n[) if zeros(w) = j:
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
The third and fourth relations play the principal role:
if Q
l
does not match a suffix of w[1::n[, then we simply
drop out the last letter. If Q
l
matches a suffix of w[1::n[,
but no other seed does, then we count prefixes matched
by Q
l
exclusively (term S
F
(i 1; j; l; w[1::n 1[)) together
with prefixes matched by no seed at all (term
U
F
(i 1; j; w[1::n 1[)). The latter is computed by the
algorithm of the previous section.
The complexity of computing S
F
(m; k; l) for a given l is
the same as the complexity of dynamic programming
algorithms from the previous sections.
4 SEED DESIGN
In the previous section we showed how to compute various
useful characteristics of a given family of seeds. A much
more difficult task is to find an efficient seed family that
solves a given (m; k)-problem. Note that there exists a trivial
solution where the family consists of all
m
k

position
combinations, but this is in general unacceptable in practice
because of a huge number of seeds. Our goal is to find
families of reasonable size (typically, with the number of
seeds smaller than 10), with a good filtration efficiency.
In this section, we present several results that contribute
to this goal. In Section 4.1, we start with the case of single
seed with a fixed number of jokers and show, in particular,
that for one joker, there exists one best seed in a sense that
will be defined. We then show in Section 4.2 that a solution
for a larger problem can be obtained from a smaller one by a
regular expansion operation. In Section 4.3, we focus on
seeds that have a periodic structure and show how those
seeds can be constructed by iterating some smaller seeds.
We then show a way to build efficient families of periodic
seeds. Finally, in Section 4.4, we briefly describe a heuristic
approach to constructing efficient seed families that we
used in the experimental part of this work presented in
Section 5.
4.1 Single Seeds with a Fixed Number of Jokers
Assume that we fixed a class of seeds under interest (e.g.,
seeds of a given minimal weight). One possible way to
define the seed design problem is to fix a similarity length
m and find a seed that solves the (m; k)-problem with the
largest possible value of k. A complementary definition is to
fix k and minimize m provided that the (m; k)-problem is
still solved. In this section, we adopt the second definition
and present an optimal solution for one particular case.
For a seed Q and a number of mismatches k, define the
k-critical length for Q as the minimal value m such that Q
solves the (m; k)-problem. For a class of seeds c and a value
k, a seed is k-optimal in c if Q has the minimal k-critical
length among all seeds of c.
One interesting class of seeds c is obtained by putting an
upper bound on the possible number of jokers in the seed,
i.e. on the number (s(Q) w(Q)). We have found a general
solution of the seed design problem for the class c
1
(n)
consisting of seeds of weight d with only one joker, i.e. seeds
#
dr
#
r
.
Consider first the case of one mismatch, i.e., k = 1. A
1-optimal seed from c
1
(d) is #
dr
#
r
with r = d=2|. To
see this, consider an arbitrary seed Q = #
p
#
q
, p q = d,
and assume by symmetry that p _ q. Observe that the
longest (m; 1)-similarity that is not detected by Q is
1
p1
01
pq
of length (2p q). Therefore, we have to minimize
2p q = d p, and since p _ d=2|, the minimum is reached
for p = d=2|, q = d=2|.
However, for k _ 2, an optimal seed has an asymmetric
structure described by the following theorem.
Theorem 1. Let n be an integer and r = [d=3[ ([x[ is the closest
integer to x). For every k _ 2, seed Q(d) = #
dr
#
r
is
k-optimal among the seeds of c
1
(d).
Proof. Again, consider a seed Q = #
p
#
q
, p q = d, and
assume that p _ q. Consider the longest word S(k) from
(1
+
0)
k
1
+
, k _ 1, which is not detected by Q and let L(k) is
the length of S(k). By the above remark, S(1) = 1
p1
01
pq
and L(1) = 2p q.
It is easily seen that for every k, S(k) starts either with
1
p1
0, or with 1
pq
01
q1
0. Define L
/
(k) to be the maximal
length of a word from (1
+
0)
k
1
+
that is not detected by Q
and starts with 1
q1
0. Since prefix 1
q1
0 implies no
additional constraint on the rest of the word, we have
L
/
(k) = q L(k 1). Observe that L
/
(1) = p 2q (word
1
q1
01
pq
). To summarize, we have the following
recurrences for k _ 2:
L
/
(k) = q L(k 1); (1)
L(k) = maxp L(k 1); p q 1 L
/
(k 1); (2)
with initial conditions L
/
(1) = p 2q, L(1) = 2p q.
Two cases should be distinguished. If p _ 2q 1, then
the straightforward induction shows that the first term in
(2) is always greater, and we have
L(k) = (k 1)p q; (3)
and the corresponding longest word is
S(k) = (1
p1
0)
k
1
pq
: (4)
If q _ p _ 2q 1, then by induction, we obtain
L(k) =
( 1)p (k 1)q if k = 2;
( 2)p kq if k = 2 1;
&
(5)
and
S(k) =
(1
pq
01
q1
0)
1
pq
if k = 2;
1
p1
0(1
pq
01
q1
0)
1
pq
if k = 2 1:
&
(6)
By definition of L(k), seed #
p
#
q
detects any word
from (1
+
0)
k
1
+
of length (L(k) 1) or more, and this is the
tight bound. Therefore, we have to find p; q which
minimize L(k). Recall that p q = d, and observe that for
p _ 2q 1, L(k) (defined by (3)) is increasing on p, while
for p _ 2q 1, L(k) (defined by (5)) is decreasing on p.
Therefore, both functions reach its minimum when
p = 2q 1. Therefore, if d = 1 (mod 3), we obtain q =
d=3| and p = d q. If d = 0 (mod 3), a routine computa-
tion shows that the minimum is reached at q = d=3,
p = 2d=3, and if d = 2 (mod 3), the minimum is reached
at q = d=3|, p = d q. Putting the three cases together
results in q = [d=3[, p = d q. .
To illustrate Theorem 1, seed #### ## is optimal
among all seeds of weight 6 with one joker. This means that
this seed solves the (m; 2)-problem for all m _ 16 and this is
the smallest possible bound over all seeds of this class.
Similarly, this seed solves the (m; 3)-problem for all m _ 20,
which is the best possible bound, etc.
4.2 Regular Expansion and Contraction of Seeds
We now show that seeds solving larger problems can be
obtained from seeds solving smaller problems, and vice
versa, using regular expansion and regular contraction
operations.
Given a seed Q , its i-regular expansion i Q is
obtained by multiplying each matching position by i. This
is equivalent to inserting i 1 jokers between every two
successive positions along the seed. For example, if Q =
0; 2; 3; 5 (or # ## #), then the 2-regular expansion
of Q is 2 Q = 0; 4; 6; 10 (or # # # #).
Given a family F, its i-regular expansion i F is the
family obtained by applying the i-regular expansion on
each seed of F.
Lemma 1. If a family F solves an (m; k)-problem, then the
(im; (i 1)k 1)-problem is solved both by family F and by
its i-regular expansion F
i
= i F.
Proof. Consider an (im; (i 1)k 1)-similarity w. By the
pigeon hole principle, it contains at least one substring of
length m with k mismatches or less and, therefore, F
solves the (im; (i 1)k 1)-problem. On the other hand,
consider i disjoint subsequences of w each one consisting
of m positions equal modulo i. Again, by the pigeon hole
principle, at least one of them contains k mismatches or
less and, therefore, the (im; (i 1)k 1)-problem is
solved by i F. .
The following lemma is the inverse of Lemma 1. It states
that if seeds solving a bigger problem have a regular
structure, then a solution for a smaller problem can be
obtained by the regular contraction operation, inverse to the
regular expansion.
Lemma 2. If a family F
i
= i F solves an (im; k)-problem, then
F solves both the (im; k)-problemand the (m; k=i|)-problem.
Proof. One can even show that F solves the (im; k)-problem
with the additional restriction for F to match inside one of
the position intervals [1::m[; [m 1::2m[; . . . ; [(i 1)m
1::im[. This is done by using the bijective mapping from
Lemma 1: Given an (im; k)-similarity w, consider i disjoint
subsequences w
j
(0 _ j _ i 1) of w obtained by picking
m positions equal to j modulo i, and then consider the
concatenation w
/
= w
1
w
2
. . . w
i1
w
0
.
For every (im; k)-similarity w
/
, its inverse image w is
detected by F
i
, and therefore F detects w
/
at one of the
intervals
[1::m[; [m 1::2m[; . . . ; [(i 1)m 1::im[:
Futhermore, for any (m; k=i|)-similarity v, consider w
/
=
v
i
and its inverse image w. As w
/
is detected by F
i
, v is
detected by F. .
Example 1. To illustrate the two lemmas above, we give the
following example pointed out in [1]. The following two
seeds are the only seeds of weight 12 that solve the
(50; 5)-problem:
# # # # # # #
# # # # #
and
### # ### # ### #:
The first one is the 2-regular expansion of the second. The
second one is the only seed of weight 12 that solves the
(25; 2)-problem.
The regular expansion allows, in some cases, to obtain an
efficient solution for a larger problem by reducing it to a
smaller problem for which an optimal or a near-optimal
solution is known.
4.3 Periodic Seeds
In this section, we study seeds with a periodic structure that
can be obtained by iterating a smaller seed. Such seeds often
turn out to be among maximally weighted seeds solving a
given (m; k)-problem. Interestingly, this contrasts with the
lossy framework where optimal seeds usually have a
random irregular structure.
Consider two seeds Q
1
;Q
2
represented as words over
#;. In this section, we lift the assumption that a seed
must start and end with a matching position. We denote
[Q
1
;Q
2
[
i
the seed defined as (Q
1
Q
2
)
i
Q
1
. For example,
[### #; [
2
=### # ### # ### #.
We also need a modification of the (m; k)-problem, where
(m; k)-similarities are considered modulo a cyclic permuta-
tion. We say that a seed family F solves a cyclic
(m; k)-problem, if for every (m; k)-similarity w, F detects
one of cyclic permutations of w. Trivially, if F solves an
(m; k)-problem, it also solves the cyclic (m; k)-problem. To
distinguish from a cyclic problem, we call sometimes an
(m; k)-problem a linear problem.
We first restrict ourselves to the single-seed case. The
following lemma demonstrates that iterating smaller seeds
solving a cyclic problem allows to obtain a solution for
bigger problems, for the same number of mismatches.
Lemma 3. If a seed Q solves a cyclic (m; k)-problem, then for
every i _ 0, the seed Q
i
= [Q;
(ms(Q))
[
i
solves the linear
(m (i 1) s(Q) 1; k)-problem. If i ,= 0, the inverse
holds too.
Proof. = Consider an (m (i 1) s(Q) 1; k)-similarity
u. Transform u into a similarity u
/
for the cyclic
(m; k)-problem as follows: For each mismatch position
of u, set 0 at position ( mod m) in u
/
. The other positions
of u
/
are set to 1. Clearly, there are at most k 0s in u. As Q
solves the (m; k)-cyclic problem, we can find at least one
position j, 1 _ j _ m, such that Q detects u
/
cyclicly.
We show now that Q
i
matches at position j of u (which
is a validpositionas 1 _ j _ mands(Q
i
) = im s(Q)). As
the positions of 1 in u are projected modulo mto matching
positions of Q, then there is no 0 under any matching
element of Q
i
and, thus, Q
i
detects u.
= Consider a seed Q
i
= [Q;
(ms(Q))
[
i
solving the
(m (i 1) s(Q) 1; k)-problem. As i > 0, consider (m
(i 1) s(Q) 1; k)-similarities having all their mis-
matches located inside the interval [m; 2m 1[. For each
such similarity, there exists a position j, 1 _ j _ m, such
that Q
i
detects it. Note that the span of Q
i
is at least
m s(Q), which implies that there is either an entire
occurrence of Q inside the window [m; 2m 1[, or a
prefix of Q matching a suffix of the window and the
complementary suffix of Q matching a prefix of the
window. This implies that Q solves the cyclic
(m; k)-problem. .
Example 2. Observe that the seed ### # solves the
cyclic (7; 2)-problem. From Lemma 3, this implies that for
every i _ 0, the (11 7i; 2)-problem is solved by the seed
[### #; [
i
of span 5 7i. Moreover, for i = 1; 2; 3,
this seed is optimal (maximally weighted) over all seeds
solving the problem.
By a similar argument based on Lemma 3, the
periodic seed [##### ##; [
i
solves the
(18 11i; 2)-problem. Note that its weight grows as
7
11
m compared to
4
7
m for the seed from the previous
paragraph. However, when m , this is not an
asymptotically optimal bound, as we will see later.
The (18 11i; 3)-problem is solved by the seed
(### # #; )
i
, as seed ### # #
solves the cyclic (11; 3)-problem. For i = 1; 2, the former
is a maximally weighted seed among all solving the
(18 11i; 3)-problem.
One question raised by these examples is whether
iterating some seed could provide an asymptotically
optimal solution, i.e., a seed of maximal asymptotic weight.
The following theorem establishes a tight asymptotic bound
on the weight of an optimal seed, for a fixed number of
mismatches. It gives a negative answer to this question, as it
shows that the maximal weight grows faster than any linear
fraction of the similarity size.
Theorem 2. Consider a constant k. Let w(m) be the maximal
weight of a seed solving the cyclic (m; k)-problem. Then,
(m w(m)) = (m
k1
k
).
Proof. Note first that all seeds solving a cyclic (m; k)-problem
canbe consideredas seeds of spanm. The number of jokers
in any seed Q is then n = m w(Q). The theorem states
that the minimal number of jokers of a seed solving the
(m; k)-problem is (m
k1
k
) for every fixed k.
Lower bound Consider a cyclic (m; k)-problem. The
number D(m; k) of distinct cyclic (m; k)-similarities
satisfies
m
k

m
_ D(m; k); (7)
as every linear (m; k)-similarity has at most m cyclicly
equivalent ones. Consider a seed Q. Let n be the number
of jokers in Q and J
Q
(m; k) the number of distinct cyclic
(m; k)-similarities detected by Q. Observe that J
Q
(m; k) _
n
k

and if Q solves the cyclic (m; k)-problem, then
D(m; k) = J
Q
(m; k) _
n
k

: (8)
From (7) and (8), we have
m
k

m
_
n
k

: (9)
Using the Stirling formula, this gives n(k) = (m
k1
k
).
Upper bound. To prove the upper bound, we construct
a seed Q that has no more then k m
k1
k
joker positions
and solves the cyclic (m; k)-problem.
We start with the seed Q
0
of span m with all matching
positions, and introduce jokers into it in k steps. After
step i, the obtained seed is denoted Q
i
, and Q = Q
k
.
Let B = m
1
k
|. Q
1
is obtained by introducing into Q
0
individual jokers with periodicity B by placing jokers at
positions 1; B 1; 2B 1; . . . . At step 2, we introduce
into Q
1
contiguous intervals of jokers of length B with
periodicity B
2
, such that jokers are placed at positions
[1 . . . B[; [B
2
1 . . . B
2
B[; [2B
2
1 . . . 2B
2
B[; . . . .
In general, at step i (i _ k), we introduce into Q
i
intervals of B
i1
jokers with periodicity B
i
at positions
[1 . . . B
i1
[; [B
i
1 . . . B
i
B
i1
[; . . . (see Fig. 1).
Note that Q
i
is periodic with periodicity B
i
. Note
also that at each step i, we introduce at most m
1
i
k
|
intervals of B
i1
jokers. Moreover, due to overlaps
with already added jokers, each interval adds (B
1)
i1
new jokers.
This implies that the total number of jokers added at
step i is at most m
1
i
k
(B 1)
i1
_ m
1
i
k
m
1
k
(i1)
= m
k1
k
.
Thus, the total number of jokers in Q is less than k m
k1
k
.
Byinductiononi, we prove that for any (m; i)-similarity
u(i _ k), Q
i
detects ucyclicly, that is there is a cyclic shift of
Q
i
such that all i mismatches of u are covered with jokers
introduced at steps 1; . . . ; i.
For i = 1, the statement is obvious, as we can
always cover the single mismatch by shifting Q
1
by at
most (B 1) positions. Assuming that the statement
holds for (i 1), we show now that it holds for i too.
Consider an (m; i)-similarity u. Select one mismatch of
u. By induction hypothesis, the other (i 1) mis-
matches can be covered by Q
i1
. Since Q
i1
has period
B
i1
and Q
i
differs from Q
i1
by having at least one
contiguous interval of B
i1
jokers, we can always shift
Q
i
by j B
i1
positions such that the selected mismatch
falls into this interval. This shows that Q
i
detects u.
We conclude that Q solves the cyclic (m; i)-problem. .
Using Theorem 2, we obtain the following bound on the
number of jokers for the linear (m; k)-problem.
Lemma 4. Consider a constant k. Let w(m) be the maximal
weight of a seed solving the linear (m; k)-problem. Then,
(m w(m)) = (m
k
k1
).
Proof. To prove the upper bound, we construct a seed Q
that solves the linear (m; k)-problem and satisfies the
asymptotic bound. Consider some l < m that will be
defined later, and let P be a seed that solves the cyclic
(l; k)-problem. Without loss of generality, we assume
s(P) = l.
For a real number e _ 1, define P
e
to be the maximally
weighted seed of span at most l
e
of the form
P
/
P P P
//
, where P
/
and P
//
are, respectively, a
suffix and a prefix of P. Due to the condition of maximal
weight, w(P
e
) _ e w(P).
We now set Q = P
e
for some real e to be defined.
Observe that if e l _ m l, then Q solves the linear
(m; k)-problem. Therefore, we set e =
ml
l
.
Fromtheproof of Theorem2, wehavel w(P) _ k l
k1
k
.
We then have
w(Q) = e w(P) _
m l
l
(l k l
k1
k
): (10)
If we set
l = m
k
k1
; (11)
we obtain
m w(Q) _ (k 1)m
k
k1
km
k1
k1
; (12)
and as k is constant,
m w(Q) = O(m
k
k1
): (13)
The lower bound is obtained similarly to Theorem 2.
Let Q be a seed solving a linear (m; k)-problem, and let
n = m w(Q). From simple combinatorial considera-
tions, we have
m
k

_
n
k

(m s(Q)) _
n
k

n; (14)
which implies n = (m
k
k1
) for constant k. .
The following simple lemma is also useful for construct-
ing efficient seeds.
Lemma 5. Assume that a family F solves an (m; k)-problem. Let
F
/
be the family obtained from F by cutting out l characters
from the left and r characters from the right of each seed of F.
Then F
/
solves the (m r l; k)-problem.
Example 3. The (9 7i; 2)-problem is solved by the seed
[###; # [
i
which is optimal for i = 1; 2; 3. Using
Lemma 5, this seed can be immediately obtained from
the seed [### #; [
i
from Example 2, solving the
(11 7i; 2)-problem.
We now apply the above results for the single seed case
to the case of multiple seeds.
For a seed Q considered as a word over #; , we
denote by Q
[i[
its cyclic shift to the left by i characters.
For exampl e, i f Q = #### # ## , t hen
Q
[5[
= # ## #### . The following lemma gives
a way to construct seed families solving bigger
problems from an individual seed solving a smaller
cyclic problem.
Lemma 6. Assume that a seed Q solves a cyclic (m; k)-problem
and assume that s(Q) = m (otherwise, we pad Q on the right
with (m s(Q)) jokers). Fix some i > 1. For some L > 0,
consider a list of Lintegers 0 _ j
1
< < j
L
< m, and define a
family of seeds F =< |(Q
[jl [
)
i
| >
L
l=1
, where |(Q
[jl [
)
i
| stands
for the seed obtained from(Q
[jl [
)
i
by deleting the joker characters
at the left and right edges. Define (l) = ((j
l1
j
l
) mod m)
(or, alternatively, (l) = ((j
l
j
l1
) mod m)) for all l,
1 _ l _ L. Let m
/
= maxs(|(Q
[jl [
)
i
|) (l)
L
l=1
1. Then,
F solves the (m
/
; k)-problem.
Proof. The proof is an extension of the proof of Lemma 3.
Here, the seeds of the family are constructed in such a
way that for any instance of the linear (m
/
; k)-problem,
there exists at least one seed that satisfies the property
required in the proof of Lemma 3 and, therefore, matches
this instance. .
In applying Lemma 6, integers j
l
are chosen from the
interval [0; m[ in such a way that values s([[(Q[j
l
[)
i
[[) (l)
are closed to each other. We illustrate Lemma 6 with two
examples that follow.
Example 4. Let m = 11, k = 2. Consider the seed Q =
#### # ## solving the cyclic (11; 2)-problem.
Choose i = 2, L = 2, j
1
= 0, j
2
= 5. This gives two seeds:
Q
1
= |(Q
[0[
)
2
| = #### # ## #### # ##
Fig. 1. Construction of seeds Q
i
from the proof of Theorem 2. Jokers are
represented in white and matching positions in black.
and
Q
2
=|(Q
[5[
)
2
| = # ## #### # ## ####
of span 20 and 21, respectively, (1) = 6 and (2) = 5.
max20 6; 21 5 1 = 25. Therefore, family F =
Q
1
; Q
2
solves the (25; 2)-problem.
Example 5. Let m = 11, k = 3. The seed Q = ### #
# solving the cyclic (11; 3)-problem. Choose
i = 2, L = 2, j
1
= 0, j
2
= 4. The two seeds are
Q
1
= |(Q
[0[
)
2
| = ### # # ### # #
(span 19) and
Q
2
= |(Q
[4[
)
2
|
= # # ### # # ###
(span 21), with (1) = 7 and (2) = 4. max19 7;
21 4 1 = 25. Therefore, family F = Q
1
; Q
2
solves
the (25; 3)-problem.
4.4 Heuristic Seed Design
Results of Sections 4.1, 4.2, and 4.3 allow one to construct
efficient seed families in certain cases, but still do not allow
a systematic seed design. Recently, linear programming
approaches to designing efficient seed families were
proposed in [19] and in [18], respectively, for DNA and
protein similarity search. However, neither of these
methods aims at constructing lossless families.
In this section, we outline a heuristic genetic program-
ming algorithm for designing lossless seed families. The
algorithm will be used in the experimental part of this
work, that we present in the next section. Note that this
algorithm uses the dynamic programming algorithms
discussed in Section 3. Since the algorithm uses standard
genetic programming techniques, we give only a high-level
description here without going into all details.
The algorithm tries to iteratively improve characteristics
of a population of seed families until it finds a small family
that detects all (m; k)-similarities (i.e., is lossless). The first
step of each iteration is based on screening current families
against a set of difficult similarities that are similarities that
have been detected by fewer families. This set is continually
reordered and updated according to the number of families
that do not detect those similarities. For this, each set is
stored in a tree and the reordering is done using the list-as-
a-tree principle [20]: Each time a similarity is not detected by
a family, it is moved towards the root of the tree such that
its height is divided by two.
For those families that pass through the screening, the
number of undetected similarities is computed by the
dynamic programming algorithm of Section 3.2. The family
is kept if it produces a smaller number than the families
currently known. An undetected similarity obtained during
this computation is added as a leaf to the tree of difficult
similarities.
To detect seeds to be improved inside a family, we
compute the contribution of each seed by the dynamic
programming algorithm of Section 3.3. The seeds with the
least contribution are then modified with a higher prob-
ability. In general, the population of seed families is
evolving by mutating and crossing over according to the set
of similarities they do not detect. Moreover, random seed
families are regularly injected into the population in order
to avoid local optima.
The described heuristic procedure often allows efficient
or even optimal solutions to be computed in a reasonable
time. For example, in 10 runs of the algorithm, we found
three of the six existing families of two seeds of weight 14
solving the (25; 2)-problem. The whole computation took
less than 1 hour, compared to a week of computation
needed to exhaustively test all seed pairs. Note that the
randomized-greedy approach (incremental completion of
the seed set by adding the best random seed) applied a
dozen of times to the same problem yielded only sets of
three and sometimes four, but never two seeds, taking
about 1 hour at each run.
5 EXPERIMENTS
We describe two groups of experiments that we made. The
first one concerns the design of efficient seed families, and
the second one applies a multiseed lossless filtration to the
identification of unique oligos in a large set of EST
sequences.
5.1 Seed Design Experiments
We considered several (m; k)-problems. For each problem,
and for a fixed number of seeds in the family, we computed
families solving the problem and realizing the largest
possible seed weight (under a natural assumption that all
seeds in a family have the same weight). We also kept track
of the ways (periodic seeds, genetic programming heur-
istics, exhaustive search) in which those families can be
computed.
Tables 1 and 2 summarize some results obtained for the
(25; 2)-problem and the (25; 3)-problem, respectively. Fa-
milies of periodic seeds (that can be found using Lemma 6)
are marked with
p
, those that are found using a genetic
algorithm are marked with
g
, and those which are obtained
by an exhaustive search are marked with
e
. Only in this
latter case, the families are guaranteed to be optimal.
Families of periodic seeds are shifted according to their
construction (see Lemma 6).
Moreover, to compare the selectivity of different families
solving a given (m; k)-problem, we estimated the probability
for at least one of the seeds of the family to match at a
given position of a uniform Bernoulli four-letter sequence.
This has been done using the inclusion-exclusion formula.
Note that the simple fact of passing from a single seed to
a two-seed family results in a considerable gain in
efficiency: In both examples shown in the tables there a
change of about one order magnitude in the selectivity
estimator .
5.2 Oligo Selection Using Multiseed Filtering
An important practical application of lossless filtration is
the selection of reliable oligonucleotides for DNA micro-
array experiments. Oligonucleotides (oligos) are small DNA
sequences of fixed size (usually ranging from 10 to 50)
designed to hybridize only with a specific region of the
genome sequence. In microarray experiments, oligos are
expected to match ESTs that stem from a given gene and not
to match those of other genes. As the first approximation,
the problem of oligo selection can then be formulated as the
search for strings of a fixed length that occur in a given
sequence but do not occur, within a specified distance, in
other sequences of a given (possibly very large) sample.
Different approaches to this problem apply different
distance measures and different algorithmic techniques
[21], [22], [23], [24]. The experiments we briefly present here
demonstrate that the multiseed filtering provides an
efficient computation of candidate oligonucleotides. These
should then be further processed by complementary
methods in order to take into account other physico-
chemical factors occurring in hybridisation, such as the
melting temperature or the possible hairpin structure of
palindromic oligos.
Here, we adopt the formalization of the oligo selection
problem as the problem of identifying in a given sequence
(or a sequence database) all substrings of length m that have
no occurrences elsewhere in the sequence within the
Hamming distance k. The parameters m and k were set to
32 and 5, respectively. For the (32; 5)-problem, different seed
families were designed and their selectivity was estimated.
Those are summarized in the table in Fig. 2, using the same
conventions as in Tables 1 and 2 above. The family
composed of six seeds of weight 11 was selected for the
filtration experiment (shown in Fig. 2).
The filtering has been applied to a database of rice EST
sequences composed of 100,015 sequences for a total length
of 42,845,242 bp.
1
Substrings matching other substrings
with five substitution errors or less were computed. The
computation took slightly more than one hour on a
TABLE 2
Seed Families for (25,3)-Problem
1. Source: http://bioserver.myongji.ac.kr/ricemac.html, The Korea Rice
Genome Database.
TABLE 1
Seed Families for (25,2)-Problem
Pentium2 4 3GHz computer. Before applying the filtering
using the family for the (32; 5)-problem, we made a rough
prefiltering using one spaced seed of weight 16 to detect,
with a high selectivity, almost identical regions. Sixty-five
percent of the database has been discarded by this
prefiltering. Another 22 percent of the database has been
filtered out using the chosen seed family, leaving the
remaining 13 percent as oligo candidates.
6 CONCLUSION
In this paper, we studied a lossless filtration method based
on multiseed families and demonstrated that it represents
an improvement compared to the single-seed approach
considered in [1]. We showed how some important
characteristics of seed families can be computed using the
dynamic programming. We presented several combinator-
ial results that allow one to construct efficient families
composed of seeds with a periodic structure. Finally, we
described a large-scale computational experiment of de-
signing reliable oligonucleotides for DNA microarrays. The
obtained experimental results provided evidence of the
applicability and efficiency of the whole method.
The results of Sections 4.1, 4,2, and 4.3 establish several
combinatorial properties of seed families, but many more of
them remain to be elucidated. The structure of optimal or
near-optimal seed families can be reduced to number-
theoretic questions, but this relation remains to be clearly
established. In general, constructing an algorithm to
systematically design seed families with quality guarantee
remains an open problem. Some complexity issues remain
open too: For example, what is the complexity of testing if a
single seed is lossless for given m; k? Section 3 implies a
time bound exponential on the number of jokers. Note that
for multiple seeds, computing the number of detected
similarities is NP-complete [16, Section 3.1].
Another direction is to consider different distance
measures, especially the Levenstein distance, or at least to
allow some restricted insertion/deletion errors. The method
proposed in [25] does not seem to be easily generalized to
multiseed families, and a further work is required to
improve lossless filtering in this case.
ACKNOWLEDGMENTS
G. Kucherov and L. Noe have been supported by the French
Action Specifique Algorithmes et Sequences of CNRS. A part
of this work has been done during a stay of M. Roytberg at
LORIA, Nancy, supported by INRIA. M. Roytberg has been
supported by the Russian Foundation for Basic Research
(project nos. 03-04-49469, 02-07-90412) and by grants from
the RF Ministry for Industry, Science, and Technology (20/
2002, 5/2003) and NWO. An extended abstract of this work
has been presented to the Combinatorial Pattern Matching
Conference (Istanbul, July 2004).
REFERENCES
[1] S. Burkhardt and J. Karkkainen, Better Filtering with Gapped
q-Grams, Fundamenta Informaticae, vol. 56, nos. 1-2, pp. 51-70,
2003, preliminary version in Combinatorial Pattern Matching
2001.
[2] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings
Practical On-Line Search Algorithms for Texts and Biological
Sequences. Cambridge Univ. Press, 2002.
[3] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller,
and D. Lipman, Gapped BLAST and PSI-BLAST: A New
Generation of Protein Database Search Programs, Nucleic Acids
Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[4] B. Ma, J. Tromp, and M. Li, PatternHunter: Faster and More
Sensitive Homology Search, Bioinformatics, vol. 18, no. 3, pp. 440-
445, 2002.
[5] S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison,
D. Haussler, and W. Miller, HumanMouse Alignments with
BLASTZ, Genome Research, vol. 13, pp. 103-107, 2003.
[6] L. Noe and G. Kucherov, Improved Hit Criteria for DNA Local
Alignment, BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[7] P. Pevzner and M. Waterman, Multiple Filtration and Approx-
imate Pattern Matching, Algorithmica, vol. 13, pp. 135-154, 1995.
[8] A. Califano and I. Rigoutsos, Flash: A Fast Look-Up Algorithm
for String Homology, Proc. First Intl Conf. Intelligent Systems for
Molecular Biology, pp. 56-64, July 1993.
[9] J. Buhler, Provably Sensitive Indexing Strategies for Biosequence
Similarity Search, Proc. Sixth Ann. Intl Conf. Computational
Molecular Biology (RECOMB 02), pp. 90-99, Apr. 2002.
[10] U. Keich, M. Li, B. Ma, and J. Tromp, On Spaced Seeds for
Similarity Search, Discrete Applied Math., vol. 138, no. 3, pp. 253-
263, 2004.
[11] J. Buhler, U. Keich, and Y. Sun, Designing Seeds for Similarity
Search in Genomic DNA, Proc. Seventh Ann. Intl Conf. Computa-
tional Molecular Biology (RECOMB 03), pp. 67-75, Apr. 2003.
Spaced Seeds Allows Substantial Improvements in Sensitivity and
Specificity, Proc. Third Intl Workshop Algorithms in Bioinformatics
(WABI), pp. 39-54, Sept. 2003.
[13] G. Kucherov, L. Noe, and Y. Ponty, Estimating Seed Sensitivity
on Homogeneous Alignments, Proc. IEEE Fourth Symp. Bioinfor-
matics and Bioeng. (BIBE 2004), May 2004.
[14] K. Choi and L. Zhang, Sensitivity Analysis and Efficient Method
for Identifying Optimal Spaced Seeds, J. Computer and System
Sciences, vol. 68, pp. 22-40, 2004.
[15] M. Csu ro s, Performing Local Similarity Searches with Variable
Length Seeds, Proc. 15th Ann. Combinatorial Pattern Matching
Symp. (CPM), pp. 373-387, 2004.
Fig. 2. Computed seed families for the (32; 5)-problem and the chosen family (six seeds of weight 11).
[16] M. Li, B. Ma, D. Kisman, and J. Tromp, PatternHunter II: Highly
Sensitive and Fast Homology Search, J. Bioinformatics and
Computational Biology, vol. 2, no. 3, pp. 417-440, Sept. 2004.
[17] Y. Sun and J. Buhler, Designing Multiple Simultaneous Seeds for
DNA Similarity Search, Proc. Eighth Ann. Intl Conf. Research in
Computational Molecular Biology (RECOMB 2004), pp. 76-84, Mar.
2004.
[18] D.G. Brown, Multiple Vector Seeds for Protein Alignment, Proc.
Fourth Intl Workshop Algorithms in Bioinformatics (WABI), pp. 170-
181, Sept. 2004.
[19] J. Xu, D. Brown, M. Li, and B. Ma, Optimizing Multiple Spaced
Seeds for Homology Search, Proc. 15th Symp. Combinatorial
[20] J. Oommen and J. Dong, Generalized Swap-with-Parent Schemes
for Self-Organizing Sequential Linear Lists, Proc. 1997 Intl Symp.
Algorithms and Computation (ISAAC 97), pp. 414-423, Dec. 1997.
[21] F. Li and G. Stormo, Selection of Optimal DNA Oligos for Gene
Expression Arrays, Bioinformatics, vol. 17, pp. 1067-1076, 2001.
[22] L. Kaderali and A. Schliep, Selecting Signature Oligonucleotides
to Identify Organisms Using DNA Arrays, Bioinformatics, vol. 18,
no. 10, pp. 1340-1349, 2002.
[23] S. Rahmann, Fast Large Scale Oligonucleotide Selection Using
the Longest Common Factor Approach, J. Bioinformatics and
Computational Biology, vol. 1, no. 2, pp. 343-361, 2003.
[24] J. Zheng, T. Close, T. Jiang, and S. Lonardi, Efficient Selection of
Unique and Popular Oligos for Large EST Databases, Proc. 14th
Ann. Combinatorial Pattern Matching Symp. (CPM), pp. 273-283,
2003.
[25] S. Burkhardt and J. Karkkainen, One-Gapped q-Gram Filters for
Levenshtein Distance, Proc. 13th Symp. Combinatorial Pattern
Matching (CPM 02), vol. 2373, pp. 225-234, 2002.
Gregory Kucherov received the PhD degree in
computer science in 1988 from the USSR
Academy of Sciences, and a Habilitation degree
in 2000 from the Henri Poincare University in
Nancy. He is a senior INRIA researcher with the
LORIA research unit in Nancy, France. For the
last 10 years, he has been doing research on
word combinatorics, text algorithms and combi-
natorial algorithms for bioinformatics, and com-
putational biology.
Laurent Noe studied computer science at the
ESIAL engineering school in Nancy, France. He
received the MS degree in 2002 and is currently
a PhD student in computational biology at
LORIA.
Mikhail Roytberg received the PhD degree in
computer science in 1983 from Moscow State
University. He is a leader of the Computational
Molecular Biology Group in the Institute of
Mathematical Problems in Biology of the Rus-
sian Academy of Sciences at Pushchino, Rus-
sia. During the last years, his main research field
has been the development of algorithms for
comparative analysis of biological sequences.
Text Mining Biomedical Literature
for Discovering Gene-to-Gene Relationships:
A Comparative Study of Algorithms
Ying Liu, Shamkant B. Navathe, Jorge Civera, Venu Dasigi,
Ashwin Ram, Brian J. Ciliax, and Ray Dingledine
AbstractPartitioning closely related genes into clusters has become an important element of practically all statistical analyses of
microarray data. Anumber of computer algorithms have been developed for this task. Althoughthese algorithms have demonstrated their
usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from
MEDLINEfor a set of genes that are isolated for further study frommicroarray experiments basedon their differential expression patterns.
The sharingof functional keywords amonggenes is usedas a basis for clustering in a newapproachcalledBEA-PARTITIONinthis paper.
Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA),
which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional
keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithmoutperformed /-means clustering
and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of
BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell
cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the
results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by /-means
andself-organizingmap. WhereasBEA-PARTITIONandthehierarchical clusteringproducedsimilar quality of clusters, BEA-PARTITION
provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a
powerful approach to clustering genes or to any clustering problemwhere starting matrices are available fromexperimental observations.
Index TermsBond energy algorithm, microarray, MEDLINE, text analysis, cluster analysis, gene function.
1 INTRODUCTION
D
NA microarrays, among the most rapidly growing tools
for genome analysis, are introducing a paradigmatic
change in biology by shifting experimental approaches from
single gene studies to genome-level analyses [1], [2].
Increasingly accessible microarray platforms allow the
rapid generation of large expression data sets [3]. One of
the key challenges of microarray studies is to derive
biological insights from the unprecedented quantities of
data on gene-expression patterns [5]. Partitioning genes into
closely related groups has become an element of practically
all analyses of microarray data [4].
A number of computer algorithms have been applied to
gene clustering. One of the earliest was a hierarchical
algorithm developed by Eisen et al. [6]. Other popular
algorithms, such as /-means [7] and Self-Organizing Maps
(SOM) [8] have also beenwidely used. These algorithms have
demonstrated their usefulness in gene clustering, but some
basic problems remain [2], [9]. Hierarchical clustering
organizes expression data into a binary tree, in which the
leaves are genes and the interior nodes (or branch points) are
candidate clusters. True clusters withdiscrete boundaries are
not produced [10]. Although SOM is efficient and simple to
implement, studies suggest that it typically performs worse
than the traditional techniques, such as /-means [11].
Basedontheassumptionthat genes withthesamefunction
or in the same biological pathway usually show similar
expression patterns, the functions of unknown genes can be
inferred from those of the known genes with similar
expression profile patterns. Therefore, expression profile
gene clustering by all the algorithms mentioned above has
received much attention; however, the task of finding
functional relationships between specific genes is left to the
investigator. Manual scanning of the biological literature (for
example, via MEDLINE) for clues regarding potential
functional relationships among a set of genes is not feasible
when the number of genes to be explored rises above
approximately 10. Restricting the scan (manual or automatic)
to annotation fields of GenBank, SwissProt, or LocusLink is
quicker but can suffer from the ad hoc relationship of
keywords to the research interests of whoever submitted
theentry. Moreover, keepingannotationfields current as new
. Y. Liu, S.B. Navathe, J. Civera, and A. Ram are with the College of
Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta,
GA 30322.
E-mail: {yingliu, sham, ashwin}@cc.gatech.edu, jorcisai@iti.upv.es.
. V. Dasigi is with the Department of Computer Science, School of
Computing and Software Engineering, Southern Polytechnic State
University, Marietta, GA 30060. E-mail: vdasigi@spsu.edu.
. B.J. Ciliax is with the Department of Neurology, Emory University School
of Medicine, Atlanta, GA 30322. E-mail: bciliax@emory.edu.
. R. Dingledine is with the Department of Pharmacology, Emory University
School of Medicine, Atlanta, GA 30322.
E-mail: rdingledine@pharm.emory.edu.
Manuscript received 4 Apr. 2004; revised 1 Oct. 2004; accepted 10 Feb. 2005;
published online 30 Mar. 2005.
information appears in the literature is a major challenge that
is rarely met adequately.
If, instead of organizing by expression pattern similarity,
genes were grouped according to shared function, investi-
gators might more quickly discover patterns or themes of
biological processes that were revealed by their microarray
experiments and focus on a select group of functionally
related genes. A number of clustering strategies based on
shared functions rather than similar expression patterns
have been devised. Chaussabel and Sher [3] analyzed
literature profiles generated by extracting the frequencies of
certain terms from the abstracts in MEDLINE and then
clustered the genes based on these terms, essentially
applying the same algorithm used for expression pattern
clustering. Jenssen et al. [12] used co-occurrence of gene
names in abstracts to create networks of related genes
automatically. Text analysis of biomedical literature has
also been applied successfully to incorporate functional
information about the genes in the analysis of gene
expression data [1], [10], [13], [14] without generating
clusters de novo. For example, Blaschke et al. [1] extracted
information about the common biological characteristics of
gene clusters from MEDLINE using Andrade and Valen-
cias statistical text mining approach, which accepts user-
supplied abstracts related to a protein of interest and
returns an ordered set of keywords that occur in those
abstracts more often than would be expected by chance [15].
We expanded and extended Andrade and Valencias
approach [15] to functional gene clustering by using an
approach that applies an algorithm called the Bond Energy
Algorithm (BEA) [16], [17], which, to our knowledge, has
not been used in bioinformatics. We modified it so that the
affinity among attributes (in our case, genes) is defined
based on the sharing of keywords between them and we
came up with a scheme for partitioning the clustered
affinity matrix to produce clusters of genes. We call the
resulting algorithm BEA-PARTITION. BEA was originally
conceived as a technique to cluster questions in psycholo-
gical instruments [16], has been used in operations research,
production engineering, marketing, and various other fields
[18], and is a popular clustering algorithm in distributed
database system (DDBS) design. The fundamental task of
BEA in DDBS design is to group attributes based on their
affinity, which indicates how closely related the attributes
are, as determined by the inclusion of these attributes by the
same database transactions. In our case, each gene was
considered as an attribute. Hence, the basic premise is that
two genes would have higher affinity, thus higher bond
energy, if abstracts mentioning these genes shared many
informative keywords. BEA has several useful properties
[16], [19]. First, it groups attributes with larger affinity
values together, and the ones with smaller values together
(i.e., during the permutation of columns and rows, it
shuffles the attributes towards those with which they have
higher affinity and away from those with which they have
lower affinity). Second, the composition and order of the
final groups are insensitive to the order in which items are
presented to the algorithm. Finally, it seeks to uncover and
display the association and interrelationships of the clus-
tered groups with one another.
In order to explore whether this algorithm could be
useful for clustering genes derived from microarray
experiments, we compared the performance of BEA-
PARTITION, hierarchical clustering algorithm, self-organiz-
ing map, and the /-means algorithm for clustering func-
tionally-related genes based on shared keywords, using
purity, entropy, and mutual information as metrics for
evaluating cluster quality.
2 METHODS
2.1 Keyword Extraction from Biomedical Literature
We used statistical methods to extract keywords from
MEDLINE citations, based on the work of [15]. This method
estimates the significance of words by comparing the
frequency of words in a given gene-related set (Test Set)
of abstracts with their frequency in a background set of
abstracts. We modified the original method by using a
1) different background set, 2) a different stemming
algorithm (Porters stemmer), and 3) a customized stop list.
The details were reported by Liu et al. [20], [21].
For each gene analyzed, word frequencies were calcu-
lated from a group of abstracts retrieved by an SQL
(structured query language) search of MEDLINE for the
specific gene name, gene symbol, or any known aliases (see
LocusLink, ftp://ftp.ncbi.nih.gov/refseq/LocusLink/
LL_tmpl.gz for gene aliases) in the TITLE field. The resulting
set of abstracts (the Test Set) was processed to generate a
specific keyword list.
Test Sets of Genes. We compared BEA-PARTITION and
other clustering algorithms (/-means, hierarchical, and
SOM) on two test sets.
1. Twenty-six genes in four well-defined functional
groups consisting of 10 glutamate receptor subunits,
seven enzymes in catecholamine metabolism, five
cytoskeletal proteins, and four enzymes in tyrosine
and phenylalanine synthesis. The gene names and
aliases are listed in Table 1. This experiment was
performed to determine whether keyword associa-
tions can be used to group genes appropriately and
whether the four gene families or clusters that were
known a priori would also be predicted by a
clustering algorithm simply using the affinity metric
based on keywords.
2. Forty-four yeast genes involved in the cell cycle of
budding yeast (Saccharomyces cerevisiae) that had
altered expression patterns on spotted DNA
microarrays [6]. These genes were analyzed by
Cherepinsky et al. [4] to demonstrate their Shrink-
age algorithm for gene clustering. A master list of
member genes for each cluster was assembled
according to a combination of 1) common cell-cycle
functions and regulatory systems and 2) the
corresponding transcriptional activators for each
gene [4] (Table 2).
Keyword Assessment. Statistical formulae from [15] for
word frequencies were used without modification. These
calculations were repeated for all gene names in the test
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 63
set, a process that generated a database of keywords
associated with specific genes, the strength of the associa-
tion being reflected by a z-score. The z-score of word o for
gene q is defined as:
7
o
q

1
o
q
1
o
o
o
. 1
where 1
o
q
equals the frequency of word o in Test Set g (i.e.,
in the Test set q, the number of abstracts where the word o
occurs divided by the total number of abstracts) and,
"
11
o
and
o
o
are the average frequency and standard deviation,
respectively, of word o in the background set. Intuitively,
the score Z compares the importance or discriminatory
relevance of a keyword in the test set of abstract with the
background set that represents the expected occurrence of
that word in the literature at large.
KeywordSelectionfor Gene Clustering. We usedz-score
thresholds to select the keywords used for gene clustering.
Those keywords with z-scores less than the threshold were
discarded. The z-score thresholds we tested were 0, 5, 8, 10,
15, 20, 30, 50, and 100. The database generated by this
algorithm is represented as a sparse word (rows) gene
(columns) matrix with cells containing z-scores. The matrix is
characterized as sparse because each gene only has a
fraction of all words associated with it. The output of the
keyword selection for all genes in each Test Set is represented
as a sparse keyword (rows) gene (columns) matrix with
cells containing z-scores.
2.2 BEA-PARTITION: Detailed Working of the
Algorithm
The BEA-PARTITION takes a symmetric matrix as input,
permutes its rows and columns, and generates a sorted
matrix, which is then partitioned to form a clustered matrix.
Constructing the Symmetric Gene Gene Matrix. The
sparse word gene matrix, with the cells containing the
z-scores of each word-gene pair, was converted to a gene
gene matrixwiththe cells containingthe sumof products of
z-scores for shared keywords. The z-score value was set to
zero if the value was less than the threshold. Larger values
reflect stronger and more extensive keyword associations
between gene-gene pairs. For each gene pair Gi. G, and
everywordo theyshare inthe sparse wordgene matrix, the
Gi Gj cell value o))Gi. G, in the gene gene matrix
represents the affinity of the two genes for each other and is
calculated as:
o))G
i
. G
,

P
`
o1
7
o
Gi
7
o
G,
1. 000
. 2
Dividing the sum of the z-score product by 1,000 was
done to reduce the typically large numbers to a more
readable format in the output matrix.
Sorting the Matrix [19]. The sorted matrix is generated
as follows:
TABLE 1
Twenty-Six Genes Manually Clustered Based on Functional Similarity
TABLE 2
Forty-Four Yeast Genes Grouped by Transcriptional Activators and Cell Cycle Functions [4]
1. Initialization. Place and fix one of the columns of
symmetric matrix arbitrarily into the clustered
matrix.
2. Iteration. Pick one of the remaining i-i columns
(where i is the number of columns already in the
sorted matrix). Choose the placement in the sorted
matrix that maximizes the change in bond energy as
described below (3). Repeat this step until no more
columns remain.
3. Row ordering. Once the column ordering is deter-
mined, the placement of the rows should also be
changed correspondingly so that their relative
positions match the relative position of the columns.
This restores the symmetry to the sorted matrix.
To calculate the change in bond energy for each possible
placement of the next i 1 column, the bonds between
that column / and each of two newly adjacent columns
i. , are added and the bond that would be broken between
the latter two columns is subtracted. Thus, the bond
energy between these three columns i, ,, and / (represent-
ing gene i Gi; gene , G,; gene / G/)) is calculated by
the following interaction contribution measure:
ciciqyGi. G,. G/
2 /oidGi. G/ /oidG/. G, /oidGi. G,.
3
where bond Gi. G, is the bond energy between gene Gi
and gene G, and
/oidGi. G,
X
`
i|
o))Gi. Gi o))Gi. G, 4
o))G0. Gi o))Gi. G0
o))Gi 1. Gi o))Gi. Gi 1 0.
5
The last set of conditions (5) takes care of cases where a
gene is being placed in the sorted matrix to the left of the
leftmost gene or to the right of the rightmost gene during
column permutations, and prior to the topmost row and
following the last row during row permutations.
Partitioning the Sorted Matrix. The original BEA
algorithm [16] did not propose how to partition the sorted
matrix. The partitioning heuristic was added by Navathe
et al. [17] for the problems in the distributed database
design. These heuristics were constructed using the goals of
design: to minimize access time and storage costs. We do
not have the luxury of such a clear cut objective function in
our case. Hence, to partition the sorted matrix into
submatrices, each representing a gene cluster, we experi-
mented with different heuristics and, finally, derived a
heuristic that identifies the boundaries between clusters by
sequentially finding the maximum sum of the quotients for
corresponding cells in adjacent columns across the matrix.
With each successive split, only those rows corresponding
to the remaining columns were processed, i.e., only the
remaining symmetrical portion of the submatrix was used
for further iterations of the splitting algorithm. The number
of clusters into which the gene affinity matrix was
partitioned was determined by AUTOCLASS (described
below), however, other heuristics might be useful for this
determination. The boundary metric 1 for columns Gi
and G, used for placement of new column / between
existing columns i and , was defined as:
1G
i
. G
,
max
j1j
X
j
/j1
maxo))/. . o))/. 1
mino))/. . o))/. 1
. 6
where is the new splitting point (for simplicity, we use the
number of the leftmost column in the new submatrix that is
to the right of the splitting point), which will split the
submatrix defined between two previous splitting points, j
and j 1 (which do not necessarily represent contiguous
columns). To partition the entire sorted matrix, the
following initial conditions are set, j N. j 1 0.
2.3 1111-Means Algorithm and Hierarchical Clustering
Algorithm
1-means andhierarchical clusteringanalysis wereperformed
using Cluster/Treeview programs available online (http://
bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
software.htm).
2.4 Self-Organizing Map
Self-organizing map was performed using GeneClus-
ter 2.0 (http://www.broad.mit.edu/cancer/software/
software.html).
Euclidean distance measure was used when gene
keyword matrix as input. When gene gene matrix was
used as input, the gene similarity was calculated by (2).
2.5 Number of Clusters
In order to apply BEA-PARTITION and /-means cluster-
ing algorithms, the investigator needs to have a priori
knowledge about the number of clusters in the test set.
We determined the number of clusters by applying
AUTOCLASS, an unsupervised Bayesian classification
system developed by [22]. AUTOCLASS, which seeks a
maximum posterior probability classification, determines
the optimal number of classes in large data sets. Among
a variety of applications, AUTOCLASS has been used
for the discovery of new classes of infra-red stars in the
IRAS Low Resolution Spectral catalogue, new classes of
airports in a database of all US airports, and discovery
of classes of proteins, introns and other patterns in
DNA/protein sequence data [22]. We applied an open
source implementation of AUTOCLASS (http://
ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/
autoclass-c-program.html). The resulting number of
clusters was then used as the endpoint for the
partitioning step of the BEA-PARTITION algorithm. To
determine whether AUTOCLASS could discover the
number of clusters in the test sets correctly, we also
tested different number of clusters other than the ones
AUTOCLASS predicted.
2.6 Evaluating the Clustering Results
To evaluate the quality of our resultant clusters, we used
the established metrics of Purity, Entropy, and Mutual
Information, which are briefly described below [23]. Let us
assume that we have C classes (i.e., C expert clusters, as
shown in Tables 1 and 2), while our clustering algorithms
produce K clusters,
.
2
. . . . .
/
.
Purity. Purity can be interpreted as classification
accuracy under the assumption that all objects of a cluster
are classified to be members of the dominant class for that
cluster. If the majority of genes in cluster A are in class X,
then class X is the dominant class. Purity is defined as the
ratio between the number of items in cluster
i
from
dominant class , and the size of cluster
i
, that is:
1
i

1
i
i
max
,
i
,
i
. i 1. 2 . . . . /. 7
where n
i
j
i
j, that is, the size of cluster i and i
,
i
is the
number of genes in
i
that belong to class j. j 1. 2. . . . . C.
The closer to 1 the purity value is, the more similar this
cluster is to its dominant class. Purity is measured for each
cluster and the average purity of each test gene set cluster
result was calculated.
Entropy. Entropy denotes how uniform the cluster is. If a
cluster is composed of genes coming from different classes,
then the value of entropy will be close to 1. If a cluster only
contains one class, the value of entropy will be close to 0.
The ideal value for entropy would be zero. Lower values of
entropy would indicate better clustering. Entropy is also
measured for each cluster and is defined as:
1
i

1
log C
X
C
,1
i
,
i
i
i
log
i
,
i
i
i
!
. 8
The average entropy of each test gene set cluster result was
also calculated.
Mutual Information. One problem with purity and
entropy is that they are inherently biased to favor small
clusters. For example, if we had one object for each cluster,
then the value of purity would be 1 and entropy would be
zero, no matter what the distribution of objects in the expert
classes is.
Mutual information is a symmetric measure for the
degree of dependency between clusters and classes. Unlike
correlation, mutual information also takes higher order
dependencies into account. We use mutual information
because it captures how related clusters are to classes
without bias towards small clusters. Mutual information is
a measure of the discordance between the algorithm-
derived clusters and the actual clusters. It is the measure
of how much information the algorithm-derived clusters
can tell us to infer the actual clusters. Random clustering
has mutual information of 0 in the limit. Higher mutual
information indicates higher similarity between the algo-
rithm-derived clusters and the actual clusters. Mutual
information is defined as:
`
2
`
X
1
i1
X
C
,1
i
,
i
log
i
,
i
`
P
1
t1
i
t
i
P
C
t1
i
t
i
log1 C
. 9
where N is the total number of genes being clustered and K
is the number of clusters the algorithm produced, and C is
the number of expert classes.
2.7 Top-Scoring Keywords Shared among Members
of a Gene Cluster
Keywords were ranked according to their highest shared z-
scores in each cluster. The keyword sharing strength metric
(1
o
) is defined as the sum of z-scores for a shared keyword
o within the cluster, multiplied by the number of genes `
within the cluster with which the word is associated; in this
calculation z-scores less than a user-selected threshold are
set to zero and are not counted.
1
o
X
`
q1
:
o
q

X
`
q1
Conit:
o
q
. 10
Thus, larger values reflect stronger and more extensive
keyword associations within a cluster. We identified the
30 highest scoring keywords for each of the four clusters and
provided these four lists to approximately 20 students,
postdoctoral fellows, and faculty, asking them to guess a
major function of the underlying genes that gave rise to the
four keyword lists.
3 RESULTS
3.1 Keywords and Keyword Gene Matrix
Generation
A list of keywords was generated for each gene to build the
keyword gene matrix. Keywords were sorted according
to their z-scores. The keyword selection experiment (see
below) showed that a z-score threshold of 10 generally
produced better results, which suggests that keywords with
z-scores lower than 10 have less information content, e.g.,
cell, express. The relative values of z-scores depended
on the size of the background set (data not shown). Since we
used 5.6 million abstracts as the background set, the
z-scores of most of the informative keywords were well
above 10 (based on smaller values of standard deviation in
the definition of z-score). The keyword gene matrices
were used as inputs to /-means, hierarchical clustering
algorithm, self-organizing map, while as required by the
BEA approach, they were first converted to a gene gene
matrix based on common shared keywords and these gene
gene matrices were used as inputs to BEA-PARTITION.
An overview of the gene clustering by shared keyword
process is provided in Fig. 1.
3.2 Effect of Keyword Selection on Gene Clustering
The effect of using different z-score thresholds for keyword
selection on the quality of resulting clusters is shown in
Figs. 2A1 and 2B1. For both test sets, BEA-PARTITION
produced clusters with higher mutual information when z-
score thresholds were within a range of 10 to 20. For the 44-
gene set, 1-means produced clusters with the highest
mutual information when the z-score threshold was 8,
while, for the 26-gene set, mutual information was highest
when z-score threshold was 15. For the remaining studies,
we chose to use a z-score threshold of 10 to keep as many
functional keywords as possible.
We then used AUTOCLASS to decide the number of
clusters in the test sets. AUTOCLASS took the keyword
gene matrix as input and predicted that there were five
clusters in the set of 26 genes and nine clusters in the set of
44 yeast genes. The effect of the numbers of clusters on the
algorithm performance was shown in Figs. 2A2 and 2B2.
BEA-PARTITION again produced a better result regardless
of the number of clusters used. BEA-PARTITION had the
highest mutual information when the numbers of clusters
were five (26-gene set) and nine (44-gene set), whereas
/-means worked marginally better when the numbers of
clusters were 8 (26-gene set) and 10 (44-gene set). Based on
these results we chose to use five and nine clusters,
respectively, because the probabilities were higher than
the other choices.
3.4 Clustering of the 26-Gene Set by Keyword
Associations
To determine whether keywordassociations couldbe usedto
group genes appropriately, we clustered the 26-gene set with
either BEA-PARTITION, /-means, hierachical algorithm,
SOM, and AUTOCLASS. Keyword lists were generated for
each of these 26 genes, which belonged to one of four well-
defined functional groups (Table 1). The resulting word
gene matrix had 26 columns (genes) and approximately
8,540 rows (words with z-scores 10 appearing in any of
the querysets). The BEA-PARTITION, withz-score threshold
= 10, correctly assigned 25 of 26 genes to the appropriate
cluster basedon the strength of keywordassociations (Fig. 3).
Tyrosine transaminase was the onlyoutlier. As expectedfrom
the BEA-PARTITION, cells inside clusters tended to have
much higher values than those outside. Hierarchical cluster-
ing algorithm, with the gene keyword matrix as the input,
generated similar result as BEA-PARTITION (five clusters
andTTwas the outlier) (Fig. 4a). The results, withgene gene
matrix as the input, were shown in tables in the supplemen-
tary materials which can be found at www.computer.org/
publications/dlib.
While BEA-PARTITION and hierarchical clustering
algorithm produced clusters very similar to the original
functional classes, those produced by /-means (Table 4),
self-organizing map (Table 5), and AUTOCLASS (Table 6),
with gene keyword matrix as input, were heterogeneous
and, thus, more difficult to explain. The average purity,
Fig. 1. Procedure for clustering genes by the strength of their associated keywords.
Fig. 2. Effect of keyword selection by z-score thresholds (A1 and B1)
and different number of clusters (A2 and B2) on the cluster quality. Z-
score thresholds were used to select the keywords for gene clustering.
Those keywords with z-scores less than the threshold were discarded.
To determine the effect of keyword selection by z-score thresholds on
cluster quality, we tested z-score thresholds 0, 5, 8, 10, 15, 20, 30, 50,
and 100. To determine whether AUTOCLASS could be used to discover
the number of clusters in the test sets correctly, we tested a different
number of clusters other than the ones AUTOCLASS predicted (four for
the 26-gene set and nine for the 44-gene set).
average entropy, and mutual information of the BEA-
PARTITION and hierarchical algorithm result were 1, 0,
and 0.88, while those of /-means result were 0.53, 0.65, and
0.28, respectively, those of SOM result were 0.76, 0.35, and
0.18, respectively, and those of AUTOCLASS result were
0.82, 0.28, and 0.56 (Table 3) (gene keyword matrix as
input). When gene gene matrix was used as input to
hierarchical algorithm, k-means, and SOM, the results were
even worse as measured by purity, entropy, and mutual
information (Table 3).
3.5 Yeast Microarray Gene Clustering by Keyword
Association
To determine whether our test mining/gene clustering
approach could be used to group genes identified in
microarray experiments, we clustered 44 yeast genes taken
from Eisen et al. [6] via Cherepinsky et al. [4], again using
BEA-PARTITION, hierarchical algorithm, SOM, AUTO-
CLASS, and /-means. Keyword lists were generated for each
of the 44 yeast genes (Table 2) anda 3,882 (words appearingin
the query sets with z-score greater or equal 10) 44 (genes)
matrix was created. The clusters produced by the BEA-
PARTITION, /-means, SOM, and AUTOCLASS are shown in
Tables 7, 8, 9, and10, respectively, whereas those producedby
hierarchical algorithm are shown in Fig. 4b. The average
purity, average entropy, and mutual information of the BEA-
PARTITIONresult were 0.74, 0.24, and0.60, whereas those of
hierarchical algorithm, SOM, /-means, and AUTOCLASS
results (gene keyword matrix as input) were 0.86, 0.12, and
0.58; 0.60, 0.37, and 0.46; 0.61, 0.33, and 0.39; 0.57, 0.39, and
0.49, respectively (Table 3).
3.6 Keywords Indicative of Major Shared Functions
with a Gene Cluster
Keywords shared among genes (26-gene set) within each
cluster were ranked according to a metric based on both the
degree of significance (the sumof z-scores for each keyword)
and the breadth of distribution (the sum of the number of
genes within the cluster for which the keyword has a z-score
greater than a selected threshold). This double-pronged
metric obviated the difficulty encountered with keywords
that had extremely high z-scores for single genes within the
cluster but modest z-scores for the remainder. The 30 highest
scoring keywords for each of the four clusters were tabulated
(Table 11). The respective keywordlists appearedtobe highly
informative about the general function of the original,
preselected clusters when shown to medical students,
faculties, and postdoctoral fellows.
4 DISCUSSION
In this paper, we clustered the genes by shared functional
keywords. Our gene clustering strategy is similar to the
document clustering in information retrieval. Document
clustering, defined as grouping documents into clusters
according to their topics or main contents in an unsuper-
vised manner, organizes large amounts of information into
a small number of meaningful clusters and improves the
information retrieval performance either via cluster-driven
dimensionality reduction, term-weighting, or query expan-
sion [9], [24], [25], [26], [27].
Term vector-based document clustering has been widely
studied in information retrieval [9], [24], [25], [26], [27]. A
Fig. 3. Gene clusters by keyword associations using BEA-PARTITION. Keywords with z-scores 10 were extracted from MEDLINE abstracts for
26 genes in four functional classes. The resulting word gene sparse matrix was converted to a gene gene matrix. The cell values are the sum of
z-score products for all keywords shared by the gene pair. This value is divided by 1,000 for purpose of display. A modified bond energy algorithm
[16], [17] was used to group genes into five clusters based on the strength of keyword associations, and the resulting gene clusters are boxed.
number of clustering algorithms have been proposed and
many of them have been applied to bioinformatics research.
In this report, we introduced a new algorithm for clustering
genes, BEA-PARTITION. Our results showed that BEA-
PARTITION, in conjunction with the heuristic developed
for partitioning the sorted matrix, outperforms the /-means
algorithm and SOM in two test sets. In the first set of genes
(26-gene set), BEA-PARTITION, as well as hierarchical
algorithm, correctly assigned 25 of 26 genes in a test set of
four known gene groups with one outlier, whereas /-means
and SOM mixed the genes into five more evenly sized but
less well functionally defined groups. In the 44-gene set, the
result generated by BEA-PARTITION had the highest
mutual information, indicating that BEA-PARTITION out-
performed all the other four clustering algorithms.
4.1 BEA-PARTITION versus ////-Means
In this study, the z-score thresholds were used for keyword
selection. When the threshold was 0, all words, including
noise (noninformative words and misspelled words), were
used to cluster genes. Under the tested conditions, clusters
produced by BEA-PARTITION had higher quality than
those produced by /-means. BEA-PARTITION clusters
genes based on their shared keywords. It is unlikely that
genes within the same cluster shared the same noisy words
with high z-scores, indicating that BEA-PARTITION is less
sensitive to noise than /-means. In fact, BEA-PARTITION
performed better than /-means in the two test gene sets
under almost all test conditions (Fig. 2). BEA-PARTITION
performed best when z-score thresholds were 10, 15, and 20,
which indicated 1) that the words with z-score less than 10
were less informative and 2) few words with z-scores
between 10 and 20 were shared by at least two genes and
did not improve the cluster quality. When z-score thresh-
olds were high ( 30 in the 26-gene set and 20 in the
44-gene set), more informative words were discarded, and
as a result, the cluster quality was degraded.
Fig. 4. Gene clusters by keyword associations using hierarchical clustering algorithm. Keywords with z-scores 10 were extracted from MEDLINE
abstracts for (a) 26 genes in four functional classes and (b) 44 gene in nine classes. The resulting word gene sparse matrix was used as input to
the hierarchical algorithm.
BEA-PARTITION is designed to group cells with larger
values together, and the ones with smaller values together.
The final order of the genes within the cluster reflected
deeper interrelationships. Among the 10 glutamate receptor
genes examined, GluR1, GluR2, and GluR4 are AMPA
receptors, while GluR6, KA1, and KA2 are kainate receptors.
The observation that BEA-PARTITION placed gene GluR6
and gene KA2 next to each other, confirms that the literature
associations between GluR6 and KA2 are higher than those
between GluR6 and AMPA receptors. Furthermore, the
association and interrelationships of the clustered groups
with one another can be seen in the final clustering matrix.
For example, TT was an outlier in Fig. 3, however, it still
had higher affinity to PD1 (affinity = 202) and PD2 (affinity
= 139) than to any other genes. Thus, TT appears to be
strongly related to genes in the tyrosine and phenylalanine
synthesis cluster, from which it originated.
BEA-PARTITION has several advantages over the
/-means algorithm: 1) while /-means generally produces a
locally optimal clustering [2], BEA-PARTITION produces
TABLE 3
The Quality of the Gene Clusters Derived by Different Clustering Algorithms, Measured by Purity, Entropy, and Mutual Information
TABLE 4
Twenty-Six Gene Set /-Means Result (Gene Keyword Matrix as Input)
the globally optimal clustering by permuting the columns
and rows of the symmetric matrix; 2) the /-means algorithm
is sensitive to initial seed selection and noise [9].
4.2 BEA-PARTITION versus Hierarchical Algorithm
Hierarchical clustering algorithm, as well as /-means, and
Self-Organizing Maps, have been widely used in microarray
expression profile analysis. Hierarchical clustering orga-
nizes expression data into a binary tree without providing
clear indication of how the hierarchy should be clustered. In
practice, investigators define clusters by a manual scan of
the genes in each node and rely on their biological expertise
to notice shared functional properties of genes. Therefore,
the definition of the clusters is subjective, and as a result,
different investigators may interpret the same clustering
result differently. Some have proposed automatically
defining boundaries based on statistical properties of the
gene expression profiles; however, the same statistical
criteria may not be generally applicable to identify all
relevant biological functions [10]. We believe that an
algorithm that produces clusters with clear boundaries
can provide more objective results and possibly new
discoveries, which are beyond the experts knowledge. In
this report, our results showed that BEA-PARTITION can
have similar performance as a hierarchical algorithm, and
provide distinct cluster boundaries.
4.3 1111-Means versus SOM
The /-means algorithm and SOM can group objects into
different clusters and provide clear boundaries. Despite its
TABLE 5
Twenty-Six Gene SOM Result (Gene Keyword Matrix as Input)
TABLE 6
Twenty-Six Gene AUTOCLASS Result (Gene Keyword Matrix as Input)
simplicity and efficiency, the SOM algorithm has several
weaknesses that make its theoretical analysis difficult and
limit its practical usefulness. Various studies have sug-
gested that it is hard to find any criteria under which the
SOM algorithm performs better than the traditional
techniques, such as /-means [11]. Balakrishnan et al. [28]
compared the SOM algorithm with /-means clustering on
108 multivariate normal clustering problems. The results
showed that the SOM algorithm performed significantly
worse than the /-means clustering algorithm. Our results
also showed that /-means performed better than SOM by
generating clusters with higher mutual information.
4.4 Computing Time
The computing time of BEA-PARTITION, same as that of
hierarchical algorithm and SOM, is in the order of N
2
, which
means that it grows proportionally to the square of the
number of genes andcommonlydenotedas ON
2
, andthat of
/-means is in the order of N*K*T (O(NKT)), where N is the
number of genes tested, K is the number of clusters, and T is
the number of improvement steps (iterations) performed by
/-means. In our study, the number of improvement steps was
1,000. Therefore, when the number of genes tested is about
1,000, BEA-PARTITION runs (o
K b) times faster than

/-means, where a, and b are constants. As long as the number
of genes to be clusteredis less than the product of the number
TABLE 7
Forty-Four Yeast Genes BEA-PARTITION Result (Gene Keyword Matrix as Input)
TABLE 8
Forty-Four Yeast Gene SOM Result (Gene Keyword as Input)
of clusters and the number of iterations, BEA-PARTITION
will run faster than /-means.
One disadvantage of BEA-PARTITION and /-means com-
pared to hierarchical clustering is that the investigator needs
tohave apriori knowledge about the number of clusters inthe
test set, which may not be known. We approached this
problem by using AUTOCLASS to predict the number of
clusters in the test sets. BEA-PARTITION performed best
when it grouped the genes into five clusters (26-gene set) and
nine clusters (44-gene set), which were predicted by AUTO-
CLASS with higher probabilities. Therefore, AUTOCLASS
TABLE 9
Forty-Four Yeast Gene /-Means Result (Gene Keyword Matrix as Input)
TABLE 10
Forty-Four Yeast Gene AUTOCLASS Result (Gene Keyword Matrix as Input)
appears to be an effective tool to assist the BEA-PARTITION
in gene clustering.
5 CONCLUSIONS AND FUTURE WORK
There are several aspects of the BEA approach that we are
currently exploring with more detailed studies. For example,
although the BEA-PARTITION described here performs
relatively well on small sets of genes, the larger gene lists
expected from microarray experiments need to be tested.
Furthermore, we deriveda heuristic to partition the clustered
affinity matrix into clusters. We anticipate that this heuristic,
which is simply based on the sum of ratios of corresponding
values fromadjacent columns, will generallyworkregardless
of the type of items beingclustered. Generally, optimizingthe
heuristic to partition a sorted matrix after BEA-based
clustering will be valuable. Finally, we are developing a
Web-based tool that will include a text mining phase to
identify functional keywords, and a gene clustering phase to
cluster the genes based on the shared functional keywords.
We believe that this tool should be useful for discovering
novel relationships amongsets of genes because it links genes
by shared functional keywords rather than just reporting
known interactions based on published reports. Thus, genes
that never co-occur in the same publication could still be
linked by their shared keywords.
The BEA approach has been applied successfully to other
disciplines, such as operations research, production en-
gineering, and marketing [18]. The BEA-PARTITION
algorithm represents our extension to the BEA approach
specifically for dealing with the problem of discovering
functional similarity among genes based on functional
keywords extracted from literature. We believe that this
important clustering technique, which was originally
proposed by [16] to cluster questions on psychological
instruments and later introduced by [17] for clustering of
data items in database design, has promise for application
to other bioinformatics problems where starting matrices
are available from experimental observations.
ACKNOWLEDGMENTS
This work was supported by NINDS (RD) and the Emory-
Georgia Tech Research Consortium. The authors would
like to thank Brian Revennaugh and Alex Pivoshenk for
research support.
REFERENCES
[1] C. Blaschke, J.C. Oliveros, and A. Valencia, Mining Functional
Information Associated with Expression Arrays, Functional &
Integrative Genomics, vol. 1, pp. 256-268, 2001.
[2] Y. Xu, V. Olman, and D. Xu, EXCAVATOR: A Computer
Program for Efficiently Mining Gene Expression Data, Nucleic
Acids Research, vol. 31, pp. 5582-5589, 2003.
[3] D. Chaussabel and A. Sher, Mining Microarray Expression Data
by Literature Profiling, Genome Biology, vol. 3, pp. 1-16, 2002.
[4] V. Cherepinsky, J. Feng, M. Rejali, and B. Mishra, Shrinkage-
Based Similarity Metric for Cluster Analysis of Microarray Data,
Proc. Natl Academy of Sciences USA, vol. 100, pp. 9668-9673, 2003.
[5] J. Quackenbush, Computational Analysis of Microarray Data,
Nature Rev. Genetics, vol. 2, pp. 418-427, 2001.
TABLE 11
Top Ranking Keywords Associated with Each Gene Cluster
[6] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster
Analysis and Display of Genome-Wide Expression Patterns, Proc.
Natl Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[7] R. Herwig, A.J. Poustka, C. Mller, C. Bull, H. Lehrach, and J.
OBrien, Large-Scale Clustering of cDNA-Fingerprinting Data,
Genome Research, vol. 9, pp. 1093-1105, 1999.
[8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.
Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting Patterns of
Gene Expression with Self-Organizing Maps: Methods and
Application to Hematopoietic Differentiation, Proc. Natl Academy
of Sciences USA, vol. 96, pp. 2907-2912, 1999.
[9] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering: A
Review, ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[10] S. Raychaudhuri, J.T. Chang, F. Imam, and R.B. Altman, The
Computational Analysis of Scientific Literature to Define and
Recognize Gene Expression Clusters, Nucleic Acids Research,
vol. 15, pp. 4553-4560, 2003.
[11] B. Kegl, Principle Curves: Learning, Design, and Applications,
PhD dissertation, Dept. of Computer Science, Concordia Univ.,
Montreal, Quebec, 2002.
[12] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, A
Literature Network of Human Genes for High-Throughtput
Analysis of Gene Expression, Natl Genetics, vol. 178, pp. 139-
143, 2001.
[13] D.R. Masys, J.B. Welsh, J.L. Fink, M. Gribskov, I. Klacansky, and J.
Corbeil, Use of Keyword Hierarchies to Interprate Gene
Expression Patterns, Bioinformatics, vol. 17, pp. 319-326, 2001.
[14] S. Raychaudhuri, H. Schutze, and R.B. Altman, Using Text
Analysis to Identify Functionally Coherent Gene Groups, Genome
Research, vol. 12, pp. 1582-1590, 2002.
[15] M. Andrade and A. Valencia, Automatic Extraction of Keywords
from Scientific Text: Application to the Knowledge Domain of
Protein Families, Bioinformatics, vol. 14, pp. 600-607, 1998.
[16] W.T. McCormick, P.J. Schweitzer, and T.W. White, Problem
Decomposition and Data Reorganization by a Clustering Techni-
que, Operations Research, vol. 20, pp. 993-1009, 1972.
[17] S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, Vertical
Partitioning Algorithms for Database Design, ACM Trans.
Database Systems, vol. 9, pp. 680-710, 1984.
[18] P. Arabie and L.J. Hubert, The Bond Energy Algorithm
Revisited, IEEE Trans. Systems, Man, and Cybernetics, vol. 20,
pp. 268-274, 1990.
[19] A.T. Ozsu and P. Valduriez, Principles of Distributed Database
Systems, second ed. Prentice Hall Inc., 1999.
[20] Y. Liu, M. Brandon, S. Navathe, R. Dingledine, and B.J. Ciliax,
Text Mining Functional Keywords Associated with Genes, Proc.
Medinfo 2004, pp. 292-296, Sept. 2004.
[21] Y. Liu, B.J. Ciliax, K. Borges, V. Dasigi, A. Ram, S. Navathe, and R.
Dingledine, Comparison of Two Schemes for Automatic Key-
word Extraction from MEDLINE for Functional Gene Clustering,
Proc. IEEE Computational Systems Bioinformatics Conf. (CSB 2004),
pp. 394-404, Aug. 2004.
[22] P. Cheeseman and J. Stutz, Bayesian Classification (Autoclass):
Theory and Results, Advances in Knowledge Discovery and Data
Mining, pp. 153-180, AAAI/MIT Press, 1996.
[23] A. Strehl, Relationship-Based Clustering and Cluster Ensembles
for High-Dimensional Data Mining, PhD dissertation, Dept. of
Electric and Computer Eng., The University of Texas at Austin,
2002.
[24] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
New York: Addison Wesley Longman, 1999.
[25] F. Sebastiani, Machine Learning in Automated Text Categoriza-
tion, ACM Computing Surveys, vol. 34, pp. 1-47, 1999.
[26] P. Willett, Recent Trends in Hierarchic Document Clustering: A
Critical Review, Information Processing and Management, vol. 24,
pp. 577-597, 1988.
[27] J. Aslam, A. Leblanc, and C. Stein, Clustering Data without Prior
Knowledge, Proc. Algorithm Eng.: Fourth Intl Workshop, 1982.
[28] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, and P.A. Lewis, A
Study of the Classification Capabilities of Neural Networks Using
Unsupervised Learning: A Comparison with K-Means Cluster-
ing, Psychometrika, vol. 59, pp. 509-525, 1994.
Ying Liu received the BS degree in environ-
mental biology from Nanjing University, China.
He received Masters degrees in bioinformatics
and computer science from Georgia Institute of
Technology in 2002. He is a PhD candidate in
College of Computing, Georgia Institute of
Technology, where he works on text mining
biomedical literature to discover gene-to-gene
relationships. His research interests include
bioinformatics, computational biology, data
mining, text mining, and database system. He is a student member of
IEEE Computer Society.
Shamkant B. Navathe received the PhD degree
from the University of Michigan in 1976. He is a
professor in the College of Computing, Georgia
Institute of Technology. He has published more
than 130 refereed papers in database research;
his important contributions are in database
modeling, database conversion, database de-
sign, conceptual clustering, distributed database
allocation, data mining, and database integra-
tion. Current projects include text mining of
medical literature databases, creation of databases for biological
applications, transaction models in P2P and Web applications, and
data mining for better understanding of genomic/proteomic and medical
data. His recent work has been focusing on issues of mobility,
scalability, interoperability, and personalization of databases in scien-
tific, engineering, and e-commerce applications. He is an author of the
book, Fundamentals of Database Systems, with R. Elmasri (Addison
Wesley, fourth edition, 2004) which is currently the leading database
text-book worldwide. He also coauthored the book Conceptual Design:
An Entity Relationship Approach (Addison Wesley, 1992) with Carlo
Batini and Stefano Ceri. He was the general cochairman of the 1996
International VLDB (Very Large Data Base) Conference in Bombay,
India. He was also program cochair of ACM SIGMOD 1985 at Austin,
Texas. He is also on the editorial boards of Data and Knowledge
Engineering (North Holland), Information Systems (Pergamon Press),
Distributed and Parallel Databases (Kluwer Academic Publishers), and
World Wide Web Journal (Kluwer). He has been an associate editor of
IEEE Transactions on Knowledge and Data Engineering. He is a
member of the IEEE.
Jorge Civera received the BSc degree in
computer science from the Universidad Polite c-
nica de Valencia in 2002, and the Msc degree in
computer science from Georgia Institute of
Technology in 2003. He is currently a PhD
student at Departamento de Sistemas Informa -
ticos y Computacio n and a research assistant in
the Instituto Tecnolo gico de Informa tica. He is
also with a fellowship from the Spanish Ministry
of Education and Culture. His research interests
include bioinformatics, machine translation, and text mining.
Venu Dasigi received the BE degree in electro-
nics and communication engineering from An-
dhra University in 1979, the MEE degree in
electronic engineering from the Netherlands
Universities Foundation for International Coop-
eration in 1981, and the MS and PhD degrees in
computer science from the University of Mary-
land, College Park in 1985 and 1988, respec-
tively. He is currently professor and chair of
computer science at Southern Polytechnic State
University in Marietta, Georgia. He is also an honorary professor at
Gandhi Institute of Technology and Management in India. He held
research fellowships at the Oak Ridge National Laboratory and the Air
Force Research Laboratory. His research interests include text mining,
information retrieval, natural language processing, artificial intelligence,
bioinformatics, and computer science education. He is a member of
ACM and the IEEE Computer Society.
Ashwin Ram received the PhD degree from
Yale University in 1989, the MS degree from the
University of Illinois in 1984, and the BTech
degree from IIT Delhi in 1982. He is an associate
professor in the College of Computing at the
Georgia Institute of Technology, an associate
professor of Cognitive Science, and an adjunct
professor in the School of Psychology. He has
published two books and more than 80 scientific
papers in international forums. His research
interests lie in artificial intelligence and cognitive science, and include
machine learning, natural language processing, case-based reasoning,
educational technology, and artificial intelligence applications.
Brian J. Ciliax received the BS degree in
biochemistry from Michigan State University in
1981, and the PhD degree in pharmacology from
the University of Michigan in 1987. He is
currently an assistant professor in the Depart-
ment of Neurology at Emory University School of
Medicine. His research interests include the
functional neuroanatomy of the basal ganglia,
particularly as it relates to hyperkinetic move-
ment disorders such as Tourettes Syndrome.
Since 2000, he has collaborated with the coauthors on the development
of a system to functionally cluster genes (identified by high-throughput
genomic and proteomic assays) according to keywords mined from
relevant MEDLINE abstracts.
Ray Dingledine received the PhD degree in
pharmacology from Stanford. He is currently
professor and chair of pharmacology at Emory
University and serves on the Scientific Council of
NINDS at NIH. His research interests include the
application of microarray and associated tech-
nologies to identify novel molecular targets for
neurologic disease, the normal functions and
pathobiology of glutamate receptors, and the
role of COX2 signaling in neurologic disease.
A
John Aach
Tatsuya Akutsu
David Aldous
Aijun An
Iannis Apostolakis
Lars Arvestad
Daniel Ashlock
Kevin Atteson
Wai-Ho Au
B
Rolf Backofen
David Bader
Tim Bailey
Tomas Balla
Serafim Batzoglou
Gil Bejerano
Amir Ben-Dor
Asa Ben-Hur
Anne Bergeron
Olaf Bininda-Emonds
Riccardo Boscolo
Guillaume Bourque
Alvis Brazma
Daniel Brown
Duncan Brown
Barb Bryant
David Bryant
Jeremy Buhler
Joachim Buhmann
C
Andrea Califano
Colin Campbell
Alberto Caprara
Keith Chan
Claudine Chaouiya
Ferdinando Cicalese
Melissa Cline
David Corne
Nello Cristianini
Miklos Csuros
Adele Cutler
D
Patrik Dhaeseleer
Michiel de Hoon
Arthur Delcher
Alain Denise
Marcel Dettling
Inderjit S. Dhillon
Diego di Bernardo
Adrian Dobra
Bruce R. Donald
Sebastin Dormido-Canto
Zhihua Du
Blythe Durbin
E
Nadia El-Mabrouk
Charles Elkan
Eleazar Eskin
F
Giancarlo Ferrari-Trecate
Liliana Florea
Gary Fogel
Yoav Freund
Jane Fridlyand
Yan Fu
Terrence Furey
Cesare Furlanello
G
Olivier Gascuel
Dan Geiger
Zoubin Ghahramani
Debashis Ghosh
Pulak Ghosh
Raffaele Giancarlo
Robert Giegerich
David Gilbert
Jan Gorodkin
John Goutsias
Daniel Gusfield
Isabelle M. Guyon
Adolfo Guzman-Arenas
H
Sridhar Hannenhalli
Alexander Hartemink
Tzvika Hartman
Lisa Holm
Paul Horton
Steve Horvath
Xiao Hu
Haiyan Huang
Alan Hubbard
Katharina Huber
Dirk Husmeier
Daniel Huson
J
Inge Jonassen
Rebecka Jornsten
K
Jaap Kaandorp
Markus Kalisch
Rachel Karchin
Juha Karkkainen
Kevin Karplus
Simon Kasif
Samuel Kaski
Ed Keedwell
Purvesh Khatri
Hyunsoo Kim
Junhyong Kim
Ross D. King
Andrzej Konopka
Hamid Krim
Nandini Krishnamurthy
Gregory Kucherov
David Kulp
L
Michelle Lacey
Wai Lam
Giuseppe Lancia
Michael Lappe
Richard Lathrop
Nicolas Le Novere
Thierry LeCroq
Hansheng Lei
Boaz Lerner
Christina Leslie
Ilya Levner
Dequan Li
Fan Li
Jinyan Li
Wentian Li
Jie Liang
Olivier Lichtarge
Charles Ling
Michal Linial
Huan Liu
Zhenqiu Liu
Stanley Loh
Heitor Lopes
Rune Lyngsoe
M
Bin Ma
Patrick Ma
Franois Major
Elisabetta Manduchi
Mark Marron
Jens Meiler
Stefano Merler
Webb Miller
Marta Milo
Satoru Miyano
Annette Molinaro
Shinichi Morishita
Vincent Moulton
Marcus Mueller
Sayan Mukherjee
Rory Mulvaney
T.M. Murali
Simon Myers
N
Iftach Nachman
Luay Nakhleh
Anand Narasimhamurthy
Gonzalo Navarro
William Noble
O
Enno Ohlebusch
Arlindo Oliveira
Jose Oliver
Christos Ouzounis
P
Junfeng Pan
Rong Pan
Wei Pan
Paul Pavlidis
Itsik Peer
Christian Pedersen
Anton Petrov
Tuan Pham
Katherine Pollard
Gianluca Pollastri
Calton Pu
R
John Rachlin
Mark Ragan
Jagath Rajapakse
R.S. Ramakrishna
Isidore Rigoutsos
Dave Ritchie
Fredrik Ronquist
Juho Rousu
2004 Reviewers List
We thank the following reviewers for the time and energy they have given to TCBB:
Jem Rowland
Larry Ruzzo
Leszek Rychlewski
S
Gerhard Sagerer
Steven Salzberg
Herbert Sauro
Alejandro Schaffer
Alexander Schliep
Scott Schmidler
Jeanette Schmidt
Alexander Schnhuth
Charles Semple
Soheil Shams
Roded Sharan
Chad Shaw
Dinggang Shen
Dou Shen
Lisan Shen
Stanislav Shvartsman
Amandeep Sidhu
Richard Simon
Sameer Singh
Janne Sinkkonen
Steven S. Skiena
Quinn Snell
Carol Soderlund
Rainer Spang
Peter Stadler
Mike Steel
Gerhard Steger
Jens Stoye
Jack Sullivan
Krister Swenson
T
Pablo Tamayo
Amos Tanay
Chun Tang
Jijun Tang
Thomas Tang
Glenn Tesler
Robert Tibshirani
Martin Tompa
Anna Tramontano
James Troendle
Jerry Tsai
Koji Tsuda
John Tyson
V
Eugene van Someren
Stella Veretnik
David Vogel
Gwenn Volkert
W
Baoying Wang
Chang Wang
Lisan Wang
Tandy Warnow
Michael K. Weir
Jason Weston
Ydo Wexler
Nalin Wickramarachchi
Chris Wiggins
David Wild
Tiffani Williams
Thomas Wu
X
Dong Xu
Jinbo Xu
Y
Qiang Yang
Yee Hwa Yang
Zizhen Yao
Daniel Yekutieli
Jeffrey Yu
Z
Mohammed J. Zaki
An-Ping Zeng
Chengxiang Zhai
Jingfen Zhang
Kaizhong Zhang
Xuegong Zhang
Yang Zhang
Zhi-Hua Zhou
Zonglin Zhou
Ji Zhu

IEEE Transactions On Computational Biology and Bioinformatics (January-March) - Volume 2, Number 1 (2005)

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

IEEE Transactions On Computational Biology and Bioinformatics (January-March) - Volume 2, Number 1 (2005)

Hochgeladen von

Copyright:

Verfügbare Formate

Guest Editorial: WABI Special Section Part ll

Junhyong Kim and Inge Jonassen

g be the list for node i

jTj minleafT; heightT 2d

so long as one does not realize more than l consecutive

restricted SPRs can be performed. In the worst case,

represents the neighborhood computation; the second

is as low as possible. The T

. Since y[;[ cannot

[r[ 1[, which is a substring of

(k j 1) values. This yields the same space complexity as

K b) times faster than

Das könnte Ihnen auch gefallen