Sie sind auf Seite 1von 58

Edited by Foxit PDF Editor

Copyright (c) by Foxit Software Company, 2004


For Evaluation Only.

A Guided Tour to Approximate String Matching


GONZALO NAVARRO
University of Chile

We survey the current techniques to cope with the problem of string matching that
allows errors. This is becoming a more and more relevant issue for many fast growing
areas such as information retrieval and computational biology. We focus on online
searching and mostly on edit distance, explaining the problem and its relevance, its
statistical behavior, its history and current developments, and the central ideas of the
algorithms and their complexities. We present a number of experiments to compare the
performance of the different algorithms and show which are the best choices. We
conclude with some directions for future work and open problems.

Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problem


complexity]: Nonnumerical algorithms and problems—Pattern matching,
Computations on discrete structures; H.3.3 [Information storage and retrieval]:
Information search and retrieval—Search process
General Terms: Algorithms
Additional Key Words and Phrases: Edit distance, Levenshtein distance, online string
matching, text searching allowing errors

1. INTRODUCTION The problem, in its most general form,


is to find a text where a text given pat-
This work focuses on the problem of string tern occurs, allowing a limited number of
matching that allows errors, also called “errors” in the matches. Each application
approximate string matching. The general uses a different error model, which defines
goal is to perform string matching of a pat- how different two strings are. The idea for
tern in a text where one or both of them this “distance” between strings is to make
have suffered some kind of (undesirable) it small when one of the strings is likely to
corruption. Some examples are recovering be an erroneous variant of the other under
the original signals after their transmis- the error model in use.
sion over noisy channels, finding DNA sub- The goal of this survey is to present
sequences after possible mutations, and an overview of the state of the art in ap-
text searching where there are typing or proximate string matching. We focus on
spelling errors. online searching (that is, when the text

Partially supported by Fondecyt grant 1-990627.


Author’s address: Department of Computer Science, University of Chile, Blanco Erncalada 2120, Santiago,
Chile, e-mail: gnavarro@dec.uchile.cl.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or direct commercial advantage and
that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works, requires prior specific permission and/or a fee. Permissions may
be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212)
869-0481, or permissions@acm.org.
°2001
c ACM 0360-0300/01/0300-0031 $5.00

ACM Computing Surveys, Vol. 33, No. 1, March 2001, pp. 31–88.
Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

32 G. Navarro

cannot be preprocessed to build an in- On the other hand, most of the algo-
dex on it), explaining the problem and its rithms designed for the edit distance are
relevance, its statistical behavior, its his- easily specialized to other cases of inter-
tory and current developments, and the est. For instance, by allowing only in-
central ideas of the algorithms and their sertions and deletions at cost 1, we can
complexities. We also consider some vari- compute the longest common subsequence
ants of the problem. We present a num- (LCS) between two strings. Another sim-
ber of experiments to compare the per- plification that has received a lot of atten-
formance of the different algorithms and tion is the variant that allows only substi-
show the best choices. We conclude with tutions (Hamming distance).
some directions for future work and open An extension of the edit distance en-
problems. riches it with transpositions (i.e. a sub-
Unfortunately, the algorithmic nature of stitution of the form ab → ba at cost 1).
the problem strongly depends on the type Transpositions are very important in text
of “errors” considered, and the solutions searching applications because they are
range from linear time to NP-complete. typical typing errors, but few algorithms
The scope of our subject is so broad that we exist to handle them. However, many algo-
are forced to narrow our focus on a subset rithms for edit distance can be easily ex-
of the possible error models. We consider tended to include transpositions, and we
only those defined in terms of replacing keep track of this fact in this work.
some substrings by others at varying costs. Since the edit distance is by far the
In this light, the problem becomes mini- best studied case, this survey focuses ba-
mizing the total cost to transform the pat- sically on the simple edit distance. How-
tern and its occurrence in text to make ever, we also pay attention to extensions
them equal, and reporting the text posi- such as generalized edit distance, trans-
tions where this cost is low enough. positions and general substring substitu-
One of the best studied cases of this er- tion, as well as to simplifications such as
ror model is the so-called edit distance, LCS and Hamming distance. In addition,
which allows us to delete, insert and sub- we also pay attention to some extensions
stitute simple characters (with a different of the type of pattern to search: when the
one) in both strings. If the different oper- algorithms allow it, we mention the possi-
ations have different costs or the costs de- bility of searching some extended patterns
pend on the characters involved, we speak and regular expressions allowing errors.
of general edit distance. Otherwise, if all We now point out what we are not cover-
the operations cost 1, we speak of simple ing in this work.
edit distance or just edit distance (ed ). In
this last case we simply seek for the min- —First, we do not cover other distance
imum number of insertions, deletions and functions that do not fit the model of
substitutions to make both strings equal. substring substitution. This is because
For instance ed ("survey,""surgery") = they are too different from our focus and
2. The edit distance has received a lot the paper would lose cohesion. Some
of attention because its generalized ver- of these are: Hamming distance (short
sion is powerful enough for a wide range survey in [Navarro 1998]), reversals
of applications. Despite the fact that [Kececioglu and Sankoff 1995] (which
most existing algorithms concentrate on allows reversing substrings), block dis-
the simple edit distance, many of them tance [Tichy 1984; Ehrenfeucht and
can be easily adapted to the generalized Haussler 1988; Ukkonen 1992; Lopresti
edit distance, and we pay attention to and Tomkins 1997] (which allows rear-
this issue throughout this work. More- ranging and permuting the substrings),
over, the few algorithms that exist for q-gram distance [Ukkonen 1992] (based
the general error model that we con- on finding common substrings of fixed
sider are generalizations of edit distance length q), allowing swaps [Amir et al.
algorithms. 1997b; Lee et al. 1997], etc. Hamming

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 33

distance, despite being a simplification be amortized over many searches) and


of the edit distance, is not covered be- simply inadequacy (as the field of in-
cause specialized algorithms for it exist dexed approximate string matching is
that go beyond the simplification of an quite immature and the speedup that
existing algorithm for edit distance. the indices provide is not always sat-
—Second, we consider pattern matching isfactory). Indexed approximate search-
over sequences of symbols, and at most ing is a difficult problem, and the area
generalize the pattern to a regular ex- is quite new and active [Jokinen and
pression. Extensions such as approx- Ukkonen 1991; Gonnet 1992; Ukkonen
imate searching in multidimensional 1993; Myers 1994a; Holsti and Sutinen
texts (short survey in [Navarro and 1994; Manber and Wu 1994; Cobbs
Baeza-Yates 1999a]), in graphs [Amir 1995; Sutinen and Tarhio 1996; Araújo
et al. 1997a; Navarro 2000a] or multi- et al. 1997; Navarro and Baeza-Yates
pattern approximate searching [Muth 1999b; Baeza-Yates and Navarro 2000;
and Manber 1996; Baeza-Yates and Navarro et al. 2000]. The problem is
Navarro 1997; Navarro 1997a; Baeza- very important because the texts in
Yates and Navarro 1998] are not con- some applications are so large that
sidered. None of these areas is very de- no online algorithm can provide ade-
veloped and the algorithms should be quate performance. However, virtually
easy to grasp once approximate pattern all the indexed algorithms are strongly
matching under the simple model is well based on online algorithms, and there-
understood. Many existing algorithms fore understanding and improving the
for these problems borrow from those we current online solutions is of interest
present here. for indexed approximate searching as
—Third, we leave aside nonstandard al- well.
gorithms, such as approximate,1 prob-
abilistic or parallel algorithms [Tarhio These issues have been put aside to keep
and Ukkonen 1988; Karloff 1993; a reasonable scope in the present work.
Atallah et al. 1993; Altschul et al. 1990; They certainly deserve separate surveys.
Lipton and Lopresti 1985; Landau and Our goal in this survey is to explain the
Vishkin 1989]. basic tools of approximate string match-
ing, as many of the extensions we are
—Finally, an important area that we leave
leaving aside are built on the basic algo-
aside in this survey is indexed search-
rithms designed for online approximate
ing, i.e. the process of building a per-
string matching.
sistent data structure (an index) on the
This work is organized as follows. In
text to speed up the search later. Typical
Section 2 we present in detail some of the
reasons that prevent keeping indices on
most important application areas for ap-
the text are: extra space requirements
proximate string matching. In Section 3
(as the indices for approximate search-
we formally introduce the problem and the
ing tend to take many times the text
basic concepts necessary to follow the rest
size), volatility of the text (as building
of the paper. In Section 4 we show some
the indices is quite costly and needs to
analytical and empirical results about the
statistical behavior of the problem.
1 Please do not confuse an approximate algorithm
Sections 5–8 cover all the work of inter-
(which delivers a suboptimal solution with some sub-
est we could trace on approximate string
optimality guarantees) with an algorithm for approx- matching under the edit distance. We
imate string matching. Indeed approximate string divided it in four sections that correspond
matching algorithms can be regarded as approxi- to different approaches to the problem:
mation algorithms for exact string matching (where dynamic programming, automata, bit-
the maximum distance gives the guarantee of opti-
mality), but in this case it is harder to find the ap- parallelism, and filtering algorithms.
proximate matches, and of course the motivation is Each section is presented as a historical
different. tour, so that we do not only explain the

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

34 G. Navarro

work done but also show how it was little use for this application, since the pat-
developed. terns rarely matched the text exactly: the
Section 9 presents experimental results experimental measures have errors of dif-
comparing the most efficient algorithms. ferent kinds and even the correct chains
Finally, we give our conclusions and dis- may have small differences, some of them
cuss open questions and future work in significant due to mutations and evolu-
Section 10. tionary alterations and others unimpor-
There exist other surveys on approxi- tant. Finding DNA chains very similar to
mate string matching, which are however those sought represent significant results
too old for this fast moving area [Hall as well. Moreover, establishing how differ-
and Dowling 1980; Sankoff and Kruskal ent two sequences are is important to re-
1983; Apostolico and Galil 1985; Galil and construct the tree of the evolution (phylo-
Giancarlo 1988; Jokinen et al. 1996] (the genetic trees). All these problems required
last one was in its definitive form in 1991). a concept of “similarity,” as well as an al-
So all previous surveys lack coverage of gorithm to compute it.
the latest developments. Our aim is to This gave a motivation to “search allow-
provide a long awaited update. This work ing errors.” The errors were those opera-
is partially based in Navarro [1998], but tions that biologists knew were common
the coverage of previous work is much in genetic sequences. The “distance” be-
more detailed here. The subject is also tween two sequences was defined as the
covered, albeit with less depth, in some minimum (i.e. more likely) sequence of op-
textbooks on algorithms [Crochemore and erations to transform one into the other.
Rytter 1994; Baeza-Yates and Ribeiro- With regard to likelihood, the operations
Neto 1999]. were assigned a “cost,” such that the more
likely operations were cheaper. The goal
2. MAIN APPLICATION AREAS was then to minimize the total cost.
Computational biology has since then
The first references to this problem we
evolved and developed a lot, with a special
could trace are from the sixties and sev-
push in recent years due to the “genome”
enties, where the problem appeared in a
projects that aim at the complete decoding
number of different fields. In those times,
of the DNA and its potential applications.
the main motivation for this kind of search
There are other, more exotic, problems
came from computational biology, signal
such as structure matching or searching
processing, and text retrieval. These are
for unknown patterns. Even the simple
still the largest application areas, and we
problem where the pattern is known is
cover each one here. See also [Sankoff and
very difficult under some distance func-
Kruskal 1983], which has a lot of informa-
tions (e.g. reversals).
tion on the birth of this subject.
Some good references for the applica-
2.1 Computational Biology
tions of approximate pattern matching to
computational biology are Sellers [1974],
DNA and protein sequences can be seen Needleman and Wunsch [1970], Sankoff
as long texts over specific alphabets (e.g. and Kruskal [1983], Altschul et al. [1990],
{A,C,G,T} in DNA). Those sequences rep- Myers [1991, 1994b], Waterman [1995],
resent the genetic code of living beings. Yap et al. [1996], and Gusfield [1997].
Searching specific sequences over those
texts appeared as a fundamental opera-
2.2 Signal Processing
tion for problems such as assembling the
DNA chain from the pieces obtained by the Another early motivation came from sig-
experiments, looking for given features in nal processing. One of the largest areas
DNA chains, or determining how different deals with speech recognition, where the
two genetic sequences are. This was mod- general problem is to determine, given
eled as searching for given “patterns” in an audio signal, a textual message which
a “text.” However, exact searching was of is being transmitted. Even simplified

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 35

problems such as discerning a word from a Since the sixties, approximate string
small set of alternatives is complex, since matching is one of the most popular tools
parts of the the signal may be compressed to deal with this problem. For instance,
in time, parts of the speech may not be pro- 80% of these errors are corrected allowing
nounced, etc. A perfect match is practically just one insertion, deletion, substitution,
impossible. or transposition [Damerau 1964].
Another problem is error correction. The There are many areas where this prob-
physical transmission of signals is error- lem appears, and Information Retrieval
prone. To ensure correct transmission over (IR) is one of the most demanding. IR is
a physical channel, it is necessary to be about finding the relevant information in
able to recover the correct message after a large text collection, and string match-
a possible modification (error) introduced ing is one of its basic tools.
during the transmission. The probability However, classical string matching is
of such errors is obtained from the sig- normally not enough, because the text col-
nal processing theory and used to assign lections are becoming larger (e.g. the Web
a cost to them. In this case we may not text has surpassed 6 terabytes [Lawrence
even know what we are searching for, we and Giles 1999]), more heterogeneous (dif-
just want a text which is correct (accord- ferent languages, for instance), and more
ing to the error correcting code used) and error prone. Many are so large and grow
closest to the received message. Although so fast that it is impossible to control their
this area has not developed much with quality (e.g. in the Web). A word which is
respect to approximate searching, it has entered incorrectly in the database can-
generated the most important measure not be retrieved anymore. Moreover, the
of similarity, known as the Levenshtein pattern itself may have errors, for in-
distance [Levenshtein 1965; 1966] (also stance in cross-lingual scenarios where a
called “edit distance”). foreign name is incorrectly spelled, or in
Signal processing is a very active area old texts that use outdated versions of the
today. The rapidly evolving field of multi- language.
media databases demands the ability to For instance, text collections digitalized
search by content in image, audio and via optical character recognition (OCR)
video data, which are potential applica- contain a nonnegligible percentage of er-
tions for approximate string matching. We rors (7–16%). The same happens with
expect in the next years a lot of pres- typing (1–3.2%) and spelling (1.5–2.5%)
sure on nonwritten human-machine com- errors. Experiments for typing Dutch sur-
munication, which involves speech recog- names (by the Dutch) reached 38% of
nition. Strong error correcting codes are spelling errors. All these percentages were
also sought, given the current interest in obtained from Kukich [1992]. Our own ex-
wireless networks, as the air is a low qual- periments with the name “Levenshtein” in
ity transmission medium. Altavista gave more than 30% of errors al-
Good references for the relations of lowing just one deletion or transposition.
approximate pattern matching with sig- Nowadays, there is virtually no text re-
nal processing are Levenshtein [1965], trieval product that does not allow some
Vintsyuk [1968], and Dixon and Martin extended search facility to recover from er-
[1979]. rors in the text or pattern. Other text pro-
cessing applications are spelling checkers,
natural language interfaces, command
2.3 Text Retrieval
language interfaces, computer aided tutor-
The problem of correcting misspelled ing and language learning, to name a few.
words in written text is rather old, per- A very recent extension which became
haps the oldest potential application for possible thanks to word-oriented text com-
approximate string matching. We could pression methods is the possibility to per-
find references from the twenties [Masters form approximate string matching at the
1927], and perhaps there are older ones. word level [Navarro et al. 2000]. That

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

36 G. Navarro

is, the user supplies a phrase to search Apostolico and Galil [1997] (for text algo-
and the system searches the text positions rithms), and Hopcroft and Ullman [1979]
where the phrase appears with a limited (for formal languages).
number of word insertions, deletions and We start with some formal definitions
substitutions. It is also possible to disre- related to the problem. Then we cover
gard the order of the words in the phrases. some data structures not widely known
This allows the query to survive from dif- which are relevant for this survey (they
ferent wordings of the same idea, which are also explained in Gonnet and Baeza-
extends the applications of approximate Yates [1991] and Crochemore and Rytter
pattern matching well beyond the recov- [1994]). Finally, we make some comments
ery of syntactic mistakes. about the tour itself.
Good references about the relation of
approximate string matching and infor- 3.1 Approximate String Matching
mation retrieval are Wagner and Fisher
In the discussion that follows, we use s,
[1974], Lowrance and Wagner [1975],
x, y, z, v, w to represent arbitrary strings,
Nesbit [1986], Owolabi and McGregor
and a, b, c, . . . to represent letters. Writing
[1988], Kukich [1992], Zobel and Dart
a sequence of strings and/or letters repre-
[1996], French et al. [1997], and Baeza-
sents their concatenation. We assume that
Yates and Ribeiro-Neto [1999].
concepts such as prefix, suffix and sub-
string are known. For any string s ∈ 6 ∗
2.4 Other Areas we denote its length as |s|. We also de-
The number of applications for approx- note si the ith character of s, for an inte-
imate string matching grows every day. ger i ∈ {1..|s|}. We denote si.. j = si si+1 · · · s j
We have found solutions to the most (which is the empty string if i > j ). The
diverse problems based on approximate empty string is denoted as ε.
string matching, for instance handwriting In the Introduction we have defined the
recognition [Lopresti and Tomkins 1994], problem of approximate string matching
virus and intrusion detection [Kumar as that of finding the text positions that
and Spaffors 1994], image compression match a pattern with up to k errors. We
[Luczak and Szpankowski 1997], data now give a more formal definition.
mining [Das et al. 1997], pattern recogni- Let 6 be a finite2 alphabet of size
tion [González and Thomason 1978], op- |6| = σ .
tical character recognition [Elliman and
Lancaster 1990], file comparison [Heckel Let T ∈ 6 ∗ be a text of length n = |T |.
1978], and screen updating [Gosling Let P ∈ 6 ∗ be a pattern of length
1991], to name a few. Many more ap- m = |P |.
plications are mentioned in Sankoff and Let k ∈ R be the maximum error al-
Kruskal [1983] and Kukich [1992]. lowed.
Let d : 6 ∗ × 6 ∗ → R be a distance
3. BASIC CONCEPTS function.
The problem is: given T , P , k and d (·),
We present in this section the important
return the set of all the text positions
concepts needed to understand all the de-
j such that there exists i such that
velopment that follows. Basic knowledge
d (P, Ti.. j ) ≤ k.
of the design and analysis of algorithms
and data structures, basic text algorithms, Find substring T(i,j) with <= k errors vs pattern P
and formal languages is assumed. If this 2 However, many algorithms can be adapted to infi-

is not the case we refer the reader to good nite alphabets with an extra O(log m) factor in their
books on these subjects, such as Aho et al. cost. This is because the pattern can have at most
m different letters and all the rest can be consid-
[1974], Cormen et al. [1990], Knuth [1973] ered equal for our purposes. A table of size σ could
(for algorithms), Gonnet and Baeza-Yates be replaced by a search structure over at most m + 1
[1991], Crochemore and Rytter [1994], different letters.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 37

Note that endpoints of occurrences are We are now in position to define the
reported to ensure that the output is of lin- most commonly used distance functions
ear size. By reversing all strings we can (although there are many others).
obtain start points.
In this work we restrict our attention to —Levenshtein or edit distance
a subset of the possible distance functions. [Levenshtein 1965]: allows inser-
We consider only those defined in the fol- tions, deletions and substitutions. In
lowing form: the simplified definition, all the oper-
ations cost 1. This can be rephrased
The distance d (x, y) between two strings as “the minimal number of insertions,
x and y is the minimal cost of a sequence deletions and substitutions to make
of operations that transform x into y (and two strings equal.” In the literature the
∞ if no such sequence exists). The cost of search problem in many cases is called
a sequence of operations is the sum of the “string matching with k differences.”
costs of the individual operations. The op- The distance is symmetric, and it holds
erations are a finite set of rules of the form 0 ≤ d (x, y) ≤ max(|x|, | y|).
t = cost δ(z, w) = t, where z and w are different —Hamming distance [Sankoff and
strings and t is a nonnegative real num- Kruskal 1983]: allows only substitu-
ber. Once the operation has converted a tions, which cost 1 in the simplified
substring z into w, no further operations definition. In the literature the search
can be done on w. problem in many cases is called “string
cost to transform string z to w = t matching with k mismatches.” The
Note especially the restriction that for-
distance is symmetric, and it is finite
bids acting many times over the same
whenever |x| = | y|. In this case it holds
string. Freeing the definition from this
0 ≤ d (x, y) ≤ |x|.
condition would allow any rewriting sys-
tem to be represented, and therefore —Episode distance [Das et al. 1997]:
determining the distance between two allows only insertions, which cost 1.
strings would not be computable in In the literature the search problem in
general. many cases is called “episode match-
If for each operation of the form ing,” since it models the case where
δ(z, w) there exists the respective opera- a sequence of events is sought, where
tion δ(w, z) at the same cost, then the dis- all of them must occur within a short
tance is symmetric (i.e. d (x, y) = d ( y, x)). period. This distance is not symmetric,
Note also that d (x, y) ≥ 0 for all strings x and it may not be possible to convert
and y, that d (x, x) = 0, and that it always x into y in this case. Hence, d (x, y) is
holds d (x, z) ≤ d (x, y) + d ( y, z). Hence, either | y| − |x| or ∞.
if the distance is symmetric, the space of —Longest common subsequence distance
strings forms a metric space. [Needleman and Wunsch 1970; Apos-
General substring substitution has been tolico and Guerra 1987]: allows only
used to correct phonetic errors [Zobel and insertions and deletions, all costing 1.
Dart 1996]. In most applications, however, The name of this distance refers to the
the set of possible operations is restricted fact that it measures the length of the
to: longest pairing of characters that can
be made between both strings, so
—Insertion: δ(ε, a), i.e. inserting the letter that the pairings respect the order
a. of the letters. The distance is the
—Deletion: δ(a, ε), i.e. deleting the letter number of unpaired characters. The
a. distance is symmetric, and it holds
—Substitution or Replacement: δ(a, b) for 0 ≤ d (x, y) ≤ |x| + | y|.
a 6= b, i.e. substituting a by b. In all cases, except the episode distance,
—Transposition: δ(ab, ba) for a 6= b, i.e. one can think that the changes can be
swap the adjacent letters a and b. made over x or y. Insertions on x are the

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


38 G. Navarro

same as deletions in y and vice versa, and as a few hundred letters (e.g. computa-
substitutions can be made in any of the tional biology).
two strings to match the other. —The number of errors allowed k satisfies
This paper is most concerned with the that k/m is a moderately low value. Re-
simple edit distance, which we denote asonable values range from 1/m to 1/2.
ed (·). Although transpositions are of in-
terest (especially in case of typing errors), —The text length can be as short as a few
there are few algorithms to deal with thousand letters (e.g. computational bi-
them. However, we will consider them at ology) and as long as megabytes or giga-
some point in this work (note that a trans- bytes (e.g. text retrieval).
position can be simulated with an inser- —The alphabet size σ can be as low as four
tion plus a deletion, but the cost is dif- letters (e.g. DNA) and a high as 256 let-
ferent). We also point out when the algo- ters (e.g. compression applications). It is
rithms can be extended to have different also reasonable to think in even larger
costs of the operations (which is of spe- alphabets (e.g. oriental languages or
cial interest in computational biology), in- word oriented text compression). The al-
cluding the extreme case of not allowing phabet may or may not be random.
some operations. This includes the other
distances mentioned.
3.2 Suffix Trees and Suffix Automata
Note that if the Hamming or edit dis-
tance are used, then the problem makes Suffix trees [Weiner 1973; Knuth 1973;
sense for 0 < k < m, since if we can per- Apostolico and Galil 1985] are widely used
form m operations we can make the pat- data structures for text processing [Apos-
tern match at any text position by means tolico 1985]. Any position i in a string S
of m substitutions. The case k = 0 cor- automatically defines a suffix of S, namely
responds to exact string matching and is Si..| S | . In essence, a suffix tree is a trie
therefore excluded from this work. Un- data structure built over all the suffixes
der these distances, we call α = k/m the of S. At the leaf nodes the pointers to the
error level, which given the above condi- suffixes are stored. Each leaf represents a
tions, satisfies 0 < α < 1. This value gives suffix and each internal node represents a
an idea of the “error ratio” allowed in the unique substring of S. Every substring of
match (i.e. the fraction of the pattern that S can be found by traversing a path from
can be wrong). the root. Each node representing the sub-
We finish this section with some notes string ax has a suffix link that leads to the
about the algorithms we are going to con- node representing the substring x.
sider. Like string matching, this area is To improve space utilization, this trie is
suitable for very theoretical and for very compacted into a Patricia tree [Morrison
practical contributions. There exist a num- 1968]. This involves compressing unary
ber of algorithms with important improve- paths. At the nodes that root a compressed
ments in their theoretical complexity, but path, an indication of which character to
they are very slow in practice. Of course, inspect is stored. Once unary paths are
for carefully built scenarios (say, m = not present the tree has O(|S|) nodes in-
100,000 and k = 2) these algorithms could stead of the worst-case O(|S|2 ) of the trie
be a practical alternative, but these cases (see Figure 1). The structure can be built
do not appear in applications. Therefore, in time O(|S|) [McCreight 1976; Ukkonen
we now point out the parameters of the 1995].
problem that we consider “practical,” i.e. A DAWG (Deterministic Acyclic Word
likely to be of use in some applications, and Graph) [Crochemore 1986; Blumer et al.
when we later say “in practice” we mean 1985] built on a string S is a determin-
under the following assumptions. istic automaton able to recognize all the
substrings of S. As each node in the suf-
—The pattern length can be as short as 5 fix tree corresponds to a substring, the
letters (e.g. text retrieval) and as long DAWG is no more than the suffix tree

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 39

Fig. 1. The suffix trie and suffix tree for a sample string. The “$” is a special marker to denote the end of
the text. Two suffix links are exemplified in the trie: from "abra" to "bra" and then to "ra". The internal
nodes of the suffix tree show the character position to inspect in the string.

augmented with failure links for the let- At the beginning of each of these sec-
ters not present in the tree. Since final tions we give a taxonomy to help guide the
nodes are not distinguished, the DAWG tour. The taxonomy is an acyclic graph
is smaller. DAWGs have similar applica- where the nodes are the algorithms and
tions to those of suffix trees, and also the edges mean that the work lower down
need O(| S |) space and construction time. can be seen as an evolution of the work in
Figure 2 illustrates. the upper position (although sometimes
A suffix automaton on S is an automaton the developments are in fact indepen-
that recognizes all the suffixes of S. The dent).
nondeterministic version of this automa- Finally, we specify some notation re-
ton has a very regular structure and is garding time and space complexity. When
shown in Figure 3 (the deterministic ver- we say that an algorithm is O(x) time we
sion can be seen in Figure 2). refer to its worst case (although sometimes
we say that explicitly). If the cost is av-
3.3 The Tour erage, we say so explicitly. We also some-
times say that the algorithm is O(x) cost,
Sections 5–8 present a historical tour
meaning time. When we refer to space
across the four main approaches to on-
complexity we say so explicitly. The av-
line approximate string matching (see Fig-
erage case analysis normally assumes a
ure 4). In those historical discussions, keep
random text, where each character is se-
in mind that there may be a long gap be-
lected uniformly and independently from
tween the time when a result is discovered
the alphabet. The pattern is not normally
and when it finally gets published in its
assumed to be random.
definitive form. Some apparent inconsis-
tencies can be explained in this way (e.g.
algorithms which are “finally” analyzed 4. THE STATISTICS OF THE PROBLEM
before they appear). We did our best in the
bibliography to trace the earliest version A natural question about approximate
of the works, although the full reference searching is: what is the probability
corresponds generally to the final version. of a match? This question is not only

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


40 G. Navarro

Fig. 2. The DAWG or the suffix automaton for the sample string. If all the states are final, it is a DAWG.
If only the 2nd, 5th and rightmost states are final then it is a suffix automaton.

interesting in itself, but also essential for [1975] and Deken [1979] can hardly be ap-
the average case analysis of many search plied to this case. It can be shown that the
algorithms, as will be seen later. We now average edit distance between two random
present the existing results and an empir- strings of length m tends to a constant
ical validation. In this section we consider fraction of m as m grows, but the frac-
the edit distance only. Some variants can tion is not known. It holds that for any
be adapted to these results. two strings of length m, m − l cs ≤ ed ≤
The effort in analyzing the probabilistic 2(m − l cs), where ed is their edit distance
behavior of the edit distance has not given and l cs is the length of their longest com-
good results in general [Kurtz and Myers mon subsequence. As proved in Chvátal
1997]. An exact analysis of the probability and Sankoff √ [1975], the√average LCS is be-
of the occurrence of a fixed pattern allow- tween m/ σ and me/ σ for large σ , and
ing k substitution errors (i.e. Hamming therefore the average
√ edit distance
√ is be-
distance) can be found in Régnier and tween m (1−e/ σ ) and 2m (1−1/ σ ). For
Szpankowski [1997], although the result large σ it is conjectured
√ that the true value
is not easy to average over all the possi- is m (1 − 1/ σ ) [Sankoff and Mainville
ble patterns. The results we present here 1983].
apply to the edit distance model and, al- For our purposes, bounding the proba-
though not exact, are easier to use in bility of a match allowing errors is more
general. important than the average edit distance.
The result of Régnier and Szpankowski Let f (m, k) be the probability of a random
[1997] holds under the assumption that pattern of length m matching a given text
the characters of the text are indepen- position with k errors or less under the edit
dently generated with fixed probabilities, distance (i.e. the text position is reported
i.e. a Bernoulli model. In the rest of this as the end of a match). In Baeza-Yates
paper we consider a simpler model, the and Navarro [1999], Navarro [1998], and
“uniform Bernoulli model,” where all the Navarro and Baeza-Yates [1999b] upper
characters occur with the same probabil- and lower bounds on the maximum error
ity 1/σ . Although this is a gross simplifi- level α ∗ for which f (m, k) is exponentially
cation of the real processes that generate decreasing on m are found. This is impor-
the texts in most applications, the results tant because many algorithms search for
obtained are quite reliable in practice. In potential matches that have to be verified
particular, all the analyses apply quite later, and the cost of such verifications is
well to biased texts if we replace σ by 1/ p, polynomial in m, typically O(m2 ). There-
where p is the probability that two ran- fore, if that event occurs with probability
dom text characters are equal. O(γ m ) for some γ < 1 then the total cost
Although the problem of the average of verifications is O(m2 γ m ) = o(1), which
edit distance between two strings is closely makes the verification cost negligible.
related to the better studied LCS, the We first show the analytical bounds for
well known results of Chvátal and Sankoff f (m, k), then give a new result on average

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 41

Fig. 3. A nondeterministic suffix automaton to recognize any suffix of "abracadabra.” Dashed lines rep-
resent ε-transitions (i.e. they occur without consuming any input).

edit distance, and finally present an exper- (i.e. Hamming distance). This distance is
imental verification. simpler to analyze but its matching proba-
bility is much lower. Using a combinatorial
4.1 An Upper Bound model again it is shown that the matching
probability is f (m, k) ≥ δ m m−1/2 , where
The upper bound for α ∗ comes from the
proof that the matching probability is µ ¶1−α
1
f (m, k) = O(γ m ) for δ=
(1 − α)σ
µ ¶1−α µ ¶1−α
1 e2 Therefore an upper bound for the maxi-
γ = ≤

σ α 1−α (1 − α)2 σ (1 − α)2 mum α ∗ value is α ∗ ≤ 1 − 1/σ , since other-
(1) wise it can be proved that f (m, k) is not
exponentially decreasing on m (i.e. it is
where we note that γ is 1/σ for α = 0 and Ä(m−1/2 )).
grows to 1 as α grows. This matching prob-
ability is exponentially decreasing on m as 4.3 A New Result on Average Edit Distance
long as γ < 1, which is equivalent to
We can now prove that the average √ edit
e e distance is larger than m (1 − e/ σ ) for
α < 1 − √ − O(1/σ ) ≤ 1 − √ (2) any σ (recall that the result of Chvátal and
σ σ
Sankoff [1975] holds for large σ ). We define

Therefore, α < 1 − e/ σ is a conserva- p(m, k) as the probability that the edit dis-
tive condition on the error level which en- tance between two strings of length m is
sures “few” matches. Therefore, the at most k. Note that p(m, k) ≤ f (m, k) be-
√ maxi-
mum level α ∗ satisfies α ∗ > 1 − e/ σ . cause in the latter case we can match with
The proof is obtained using a combi- any text suffix of length from m−k to m+k.
natorial model. Based on the observation Then the average edit distance is
that m − k common characters must ap- X
m X
m
pear in the same order in two strings that k Pr(ed = k) = Pr(ed > k)
match with k errors, all the possible alter- k=0 k=0
natives to select the matching characters X
m X
m
from both strings are enumerated. This = 1 − p(m, k) = m − p(m, k)
model, however, does not take full advan- k=0 k=0
tage of the properties of the edit distance:
even if m − k characters match, the dis- which, since p(m, k) increases with k, is
tance can be larger than k. For example, larger than
in ed (abc, bcd ) = 2, i.e. although two char-
acters match, the distance is not 1. m−(K p(m, K )+(m−K )) = K (1− p(m, K ))
for any K of our choice.
√ In particu-
4.2 A Lower Bound
lar, for K /m < 1 − e/ σ we have that
On the other hand, the only optimistic p(m, K ) ≤ f (m, K ) = O(γ m ) for√ γ < 1.
bound we know of is based on the consider- Therefore choosing K = m (1 − e/ σ ) − 1
ation that only substitutions are allowed yields that the edit distance is at least

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

42 G. Navarro

DP=Dyn. Progr.

4 approaches to approx. string matching


Fig. 4. Taxonomy of the types of solutions for online searching.


m (1 − e/ σ ) + O(1), for any σ . As we undergoes a sharp increase at α ∗ . On the
see later, this proof converts a conjecture right we show the α ∗ value as m grows. It is
about the average running time of an algo- clear that α ∗ is essentially independent of
rithm [Chang and Lampe 1992] into a fact. m, although it is a bit lower for short pat-
terns. The increase in the left plot at α ∗ is
so sharp that the right plot would be the
4.4 Empirical Verification
same if we plotted the value of the average
We verify the analysis experimentally in edit distance divided by m.
this section (this is also taken from Baeza- Figure 6 uses a stable m = 300 to
Yates and Navarro [1999] and Navarro show the α ∗ value√ as a function of σ . The
[1998]). The experiment consists of gener- curve α = 1 − 1/ σ is included to show its
ating a large random text (n = 10 MB) and closeness to the experimental data. Least
running the search of a random pattern on squares √give the approximation α ∗ =
that text, allowing k = m errors. At each 1 − 1.09/ σ , with a relative error smaller
text character, we record the minimum al- than 1%. This shows that the upper
lowed error k for which that text position bound analysis (Eq. (2)) matches reality
matches the pattern. We repeat the exper- better, provided we replace e by 1.09 in
iment with 1,000 random patterns. the formulas.
Finally, we build the cumulative his- Therefore, we have shown that the
togram, finding how many text positions matching probability has a sharp behav-
have matched with up to k errors, for each ior: for low α it is very low, not as low as
k value. We consider that k is “low enough” 1/σ m like exact string matching, but still
up to where the histogram values become exponentially decreasing in m, with an
significant, that is, as long as few text posi- exponent base larger than 1/σ . At some α
tions have matched. The threshold is set to value (that we call α ∗ ) it sharply increases
n/m2 , since m2 is the normal cost of verify- and quickly becomes √ almost 1. This point
ing a match. However, the selection of this is close to α ∗ = 1 − 1/ σ in practice.
threshold is not very important, since the This is why the problem is of inter-
histogram is extremely concentrated. For est only up to a given error level, since
example, for m in the hundreds, it moves for higher errors almost all text positions
from almost zero to almost n in just five or match. This is also the reason that some
six increments of k. algorithms have good average behavior
Figure 5 shows the results for σ = 32. On only for low√enough error levels. The point
the left we show the histogram we have α ∗ = 1 − 1/ σ matches the conjecture of
built, where the matching probability Sankoff and Mainville [1983].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 43

Fig. 5. On the left, probability of an approximate match as a function of the error level (m =
300). On the right, the observed α ∗ error level as a function of the pattern length. Both cases
correspond to random text with σ = 32.

5. DYNAMIC PROGRAMMING ALGORITHMS efficient, it is among the most flexible ones


in adapting to different distance functions.
We start our tour with the oldest among
We first show how to compute the edit
the four areas, which directly inherits
distance between two strings x and y.
from the earliest work. Most of the theo-
Later, we extend that algorithm to search
retical breakthroughs in the worst-case al-
a pattern in a text allowing errors. Finally,
gorithms belong to this category, although
we show how to handle more general dis-
only a few of them are really competitive in
tance functions. Text searching: pp 46
practice. The latest practical work in this
area dates back to 1992, although there
are recent theoretical improvements. The 5.1.1 Computing Edit Distance. The algo-
major achievements are rithm is based on dynamic programming.
√ O(kn) worst-case Imagine that we need to compute ed (x, y).
algorithms and O(kn/ σ )average-case al-
gorithms, as well as other recent theoreti- A matrix C0..|x|,0..| y| is filled, where Ci, j rep-
cal improvements on the worst-case. resents the minimum number of opera-
We start by presenting the first algo- tions needed to match x1..i to y 1.. j . This is
rithm that solved the problem and then computed as follows:
give a historical tour on the improvements
over the initial solution. Figure 7 helps Ci,0 = i delete i characters to get 0 length string
guide the tour. C0, j = j insert j characters
Ci, j = if (xi = y j ) then Ci−1, j −1
5.1 The First Algorithm else 1 + min(Ci−1, j , Ci, j −1 , Ci−1, j −1 )

We now present the first algorithm to solve where at the end C|x|,| y| = ed (x, y). The ra-
the problem. It has been rediscovered tionale for the above formula is as follows.
many times in the past, in different ar- First, Ci,0 and C0, j represent the edit dis-
eas, e.g. Vintsyuk [1968], Needleman and tance between a string of length i or j and
Wunsch [1970], Sankoff [1972], Sellers the empty string. Clearly i (respectively
[1974], Wagner and Fisher [1974], and j ) deletions are needed on the nonempty
Lowrance and Wagner [1975] (there are string. For two nonempty strings of length
more in Ullman [1977], Sankoff and i and j , we assume inductively that all
Kruskal [1983], and Kukich [1992]). How- the edit distances between shorter strings
ever, this algorithm computed the edit dis- have already been computed, and try to
tance, and it was converted into a search convert x1..i into y 1.. j .
algorithm only in 1980 by Sellers [Sellers Consider the last characters xi and y j .
1980]. Although the algorithm is not very If they are equal, then we do not need to

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

44 G. Navarro

Fig. 6. Theoretical and practical values for α ∗ , for m = 300 and different σ values.

consider them and we proceed in the best a row-wise left-to-right traversal or a


possible way to convert x1..i−1 into y 1.. j −1 . column-wise top-to-bottom traversal, but
On the other hand, if they are not equal, we will see later that, using a difference
we must deal with them in some way. Fol- recurrence, the matrix can also be filled
lowing the three allowed operations, we by (upper-left to lower-right) diagonals or
can delete xi and convert in the best way “secondary” (upper-right to lower-left) di-
x1..i−1 into y 1.. j , insert y j at the end of agonals. Figure 8 illustrates the algorithm
x1..i and convert in the best way x1..i into to compute ed ("survey," "surgery").
y 1.. j −1 , or substitute xi by y j and convert Therefore, the algorithm is O(|x|| y|)
in the best way x1..i−1 into y 1.. j −1 . In all time in the worst and average case.
cases, the cost is 1 plus the cost for the rest However, the space required is only
of the process (already computed). Notice O(min(|x|, | y|)). This is because, in the
that the insertions in one string are equiv- case of a column-wise processing, only the
alent to deletions in the other. previous column must be stored in order
An equivalent formula which is also to compute the new one, and therefore
widely used is we just keep one column and update it.
We can process the matrix row-wise or
Ci,0 j = min(Ci−1, j −1 + δ(xi , y j ), column-wise so that the space require-
Ci−1, j + 1, Ci, j −1 + 1) ment is minimized.
On the other hand, the sequences of op-
where δ(a, b) = 0 if a = b and 1 otherwise. erations performed to transform x into y
It is easy to see that both formulas are can be easily recovered from the matrix,
equivalent because neighboring cells dif- simply by proceeding from the cell C|x|,| y|
fer in at most one (just recall the meaning to the cell C0,0 following the path (i.e. se-
of Ci, j ), and therefore when δ(xi , y j ) = 0 quence of operations) that matches the up-
we have that Ci−1, j −1 cannot be larger date formula (multiple paths may exist).
than Ci−1, j + 1 or Ci, j −1 + 1. In this case, however, we need to store the
The dynamic programming algorithm complete matrix or at least an area around
must fill the matrix in such a way that the main diagonal.
the upper, left, and upper-left neighbors This matrix has some properties that
of a cell are computed prior to computing can be easily proved by induction (see,
that cell. This is easily achieved by either e.g. Ukkonen [1985a]) and which make it

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 45

Fig. 7. Taxonomy of algorithms based on the dynamic programming matrix. References are
shortened to first letters (single authors) or initials (multiple authors), and to the last two digits
of years.
Key: Vin68 = [Vintsyuk 1968], NW70 = [Needleman and Wunsch 1970], San72 = [Sankoff 1972],
Sel74 = [Sellers 1974], WF74 = [Wagner and Fisher 1974], LW75 = [Lowrance and Wagner 1975],
Sel80 = [Sellers 1980], MP80 = [Masek and Paterson 1980], Ukk85a & Ukk85b = [Ukkonen
1985a; 1985b], Mye86a & Mye86b = [Myers 1986a; 1986b], LV88 & LV89 = [Landau and Vishkin
1988; 1989], GP90 = [Galil and Park 1990], UW93 = [Ukkonen and Wood 1993], GG88 = [Galil
and Giancarlo 1988], CL92 = [Chang and Lampe 1992], CL94 = [Chang and Lawler 1994], SV97 =
[Sahinalp and Vishkin 1997], CH98 = [Cole and Hariharan 1998], and BYN99 = [Baeza-Yates
and Navarro 1999].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

46 G. Navarro
0 -> 1 replace character delete char

2->2 means: same character


Fig. 8. The dynamic programming algorithm to Fig. 9. The dynamic programming algorithm to
compute the edit distance between "survey" and search "survey" in the text "surgery" with two er-
"surgery." The bold entries show the path to the rors. Each column of this matrix is a value of the
final result. C vector. Bold entries indicate matching text posi-
tions.
possible to design better algorithms. Some deed) with at most k = 2 errors. In this
of the most useful are that the values of case there are 3 occurrences.
neighboring cells differ in at most one,
and that upper-left to lower-right diago- 5.1.3 Other Distance Functions. It is easy
nals are nondecreasing. to adapt this algorithm for the other
distance functions mentioned. If the op-
5.1.2 Text Searching. We now show how to erations have different costs, we add
adapt this algorithm to search a short pat- the cost instead of adding 1 when comput-
tern P in a long text T . The algorithm ing Ci, j , i.e.
is basically the same, with x = P and
y = T (proceeding column-wise so that C0,0 = 0
O(m) space is required). The only differ- Ci, j = min(Ci−1, j −1 + δ(xi , y j ),
ence is that we must allow that any text Ci−1, j + δ(xi , ε), Ci, j −1 + δ(ε, y j ))
position is the potential start of a match.
This is achieved by setting C0, j = 0 for where we assume δ(a, a) = 0 for any a ∈ 6
all j ∈ 0..n. That is, the empty pattern and that C−1, j = Ci,−1 = ∞ for all i, j .
matches with zero errors at any text posi- For distances that do not allow some op-
tion (because it matches with a text sub- erations, we just take them out of the min-
string of length zero). imization formula, or, which is the same,
The algorithm then initializes its col- we assign ∞ to their δ cost. For transposi-
umn C0..m with the values Ci = i, and pro- tions, we allow a fourth rule that says that
cesses the text character by character. At Ci, j can be Ci−2, j −2 + 1 if xi−1 xi = y j y j −1
each new text character T j , its column vec- [Lowrance and Wagner 1975].
0
tor is updated to C0..m . The update for- The most complex case is to allow gen-
mula is eral substring substitutions, in the form of
I downloaded Seller's paper a finite set R of rules. The formula is given
Ci0 = if (Pi = T j ) then Ci−1 in Ukkonen [1985a].
0
else 1 + min(Ci−1 , Ci , Ci−1 )
C0,0 = 0
and the text positions are where Cm ≤ k is Ci, j = min(Ci−1, j −1 if xi = y j ,
reported. Ci−|s1 |, j −|s2 | + δ(s1 , s2 )
The search time of this algorithm is for each (s1 , s2 ) ∈ R, x1..i = x 0 s1 ,
O(mn) and its space requirement is O(m). y 1.. j = y 0 s2 )
This is a sort of worst case in the analy-
sis of all the algorithms that we consider An interesting problem is how to com-
later. Figure 9 exemplifies this algorithm pute this recurrence efficiently. A naive
applied to search the pattern "survey" in approach takes O(|R|mn), where |R| is the
the text "surgery" (a very short text in- sum of all the lengths of the strings in

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

I think it's hard to apply Seller's alg. on genes...


n is too big (too much storage space needed)
Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 47

Fig. 10. The Masek and Paterson algorithm partitions the dynamic programming matrix in cells
(r = 2 in this example). On the right, we shaded the entries of adjacent cells that influence the
current one.

R. A better solution is to build two Aho– theoretical result in this area is as old
Corasick automata [Aho and Corasick as the Sellers algorithm [Sellers 1980] it-
1975] with the left and right hand sides self. In 1980, Masek and Paterson [1980]
of the rules, respectively. The automata found an algorithm whose worst case cost
are run as we advance in both strings (left is O(mn/ log2σ n) and requires O(n) extra
hand sides in x and right hand sides in space. This is an improvement over the
y). For each pair of states (i1 , i2 ) of the au- O(mn) classical complexity.
tomata we precompute the set of substitu- The algorithm is based on the Four-
tions that can be tried (i.e. those δ’s whose Russians technique [Arlazarov et al.
left and right hand sides match the suf- 1975]. Basically, it replaces the alphabet
fixes of x and y, respectively, represented 6 by r-tuples (i.e. 6 r ) for a small r. Con-
by the automata states). Hence, we know sidered algorithmically, it first builds a ta-
in constant time (per cell) the set of pos- ble of solutions of all the possible problems
sible substitutions. The complexity is now (i.e. portions of the matrix) of size r × r,
much lower, in the worst case it is O(cmn) and then uses the table to solve the origi-
where c is the maximum number of rules nal problem in blocks of size r. Figure 10
applicable to a single text position. illustrates.
As mentioned, the dynamic program- The values inside the r × r size cells de-
ming approach is unbeaten in flexibility, pend on the corresponding letters in the
but its time requirements are indeed high. pattern and the text, which gives σ 2r pos-
A number of improved solutions have been sibilities. They also depend on the values
proposed over the years. Some of them in the last column and row of the upper
work only for the edit distance, while oth- and left cells, as well as the bottom-right
ers can still be adapted to other distance state of the upper left cell (see Figure 10).
functions. Before considering the improve- Since neighboring cells differ in at most
ments, we mention that there exists a way one, there are only three choices for adja-
to see the problem as a shortest path prob- cent cells once the current cell is known.
lem on a graph built on the pattern and Therefore, this adds only m (32r ) possibil-
the text [Ukkonen 1985a]. This reformula- ities. In total, there are m (3σ )2r different
tion has been conceptually useful for more cells to precompute. Using O(n) memory
complex variants of the problem. we have enough space for r = log3σ n, and
since we finally compute mn/r 2 cells, the
final complexity follows.
5.2 Improving the Worst Case
The algorithm is only of theoretical in-
5.2.1 Masek and Paterson (1980). It is in- terest, since as the same authors estimate,
teresting that one important worst-case it will not beat the classical algorithm for

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

48 G. Navarro

Fig. 11. On the left, the O(k 2 ) algorithm to compute the edit distance. On
the right, the way to compute the strokes in diagonal transition algorithms.
The solid bold line is guaranteed to be part of the new stroke of e errors,
while the dashed part continues as long as both strings match.

texts below 40 GB size (and it would need 5.2.3 Landau and Vishkin (1985). In 1985
that extra space!). Adapting it to other and 1986, Landau and Vishkin found the
distance functions does not seem difficult, first worst-case time improvements for the
but the dependencies among different search problem. All of them and the thread
cells may become more complex. that followed were diagonal transition al-
gorithms. In 1985, Landau and Vishkin
5.2.2 Ukkonen (1983). In 1983, Ukkonen [1988] showed an algorithm which was
[1985a] presented an algorithm able to O(k 2 n) time and O(m) space, and in 1986
compute the edit distance between two they obtained O(kn) time and O(n) space
strings x and y in O(ed (x, y)2 ) time, or [Landau and Vishkin 1989].
to check in time O(k 2 ) whether that dis- The main idea in Landau and Vishkin
tance was ≤k or not. This is the first was to adapt the Ukkonen’s diagonal
member of what has been called “diago- transition algorithm for edit distance
nal transition algorithms,” since it is based [Ukkonen 1985a] to text searching. Ba-
on the fact that the diagonals of the dy- sically, the dynamic programming matrix
namic programming matrix (running from was computed diagonal-wise (i.e. stroke
the upper-left to the lower-right cells) are by stroke) instead of column-wise. They
monotonically increasing (more than that, wanted to compute the length of each
Ci+1, j +1 ∈ {Ci, j , Ci, j + 1}). The algorithm is stroke in constant time (i.e. the point
based on computing in constant time the where the values along a diagonal were
positions where the values along the di- to be incremented). Since a text position
agonals are incremented. Only O(k 2 ) such was to be reported when matrix row m was
positions are computed to reach the lower- reached before incrementing more than k
right decisive cell. times the values along the diagonal, this
Figure 11 illustrates the idea. Each di- immediately gave the O(kn) algorithm.
agonal stroke represents a number of er- Another way to see it is that each diago-
rors, and is a sequence where both strings nal is abandoned as soon as the kth stroke
match. When a stroke of e errors starts, it ends, there are n diagonals and hence nk
continues until the adjacent strokes of e−1 strokes, each of them computed in con-
errors continue or until it keeps matching stant time (recall Figure 11).
the text. To compute each stroke in con- A recurrence on diagonals (d ) and num-
stant time we need to know at what point ber of errors (e), instead of rows (i) and
it matches the text. The way to do this in columns ( j ), is set up in the following
constant time is explained shortly. way:

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 49

Despite being conceptually clear, it is


not easy to find this node in constant time.
In 1986, the only existing LCA algorithm
was that of Harel and Tarjan [1984], which
had constant amortized time, i.e. it an-
Fig. 12. The diagonal transition matrix to search swered n0 > n LCA queries in O(n0 ) time.
"survey" in the text "surgery" with two errors. Bold
entries indicate matching diagonals. The rows are e
In our case we have kn queries, so each
values and the columns are the d values. one finally cost O(1). The resulting algo-
rithm, however, is quite slow in practice.
Ld ,−1 = Ln+1,e = −1, for all e, d
Ld ,|d |−2 = |d | − 2, for −(k + 1) ≤ d ≤ −1 5.2.4 Myers (1986). In 1986, Myers also
Ld ,|d |−1 = |d | − 1, for −(k + 1) ≤ d ≤ −1 found an algorithm with O(kn) worst-case
Ld ,e = i + max(Pi+1..i+` =Td +i+1..d +i+` ) behavior [Myers 1986a, 1986b]. It needed
` O(n) extra space, and shared the idea of
where i = max(Ld ,e−1 + 1, computing the k new strokes using the
Ld −1,e−1 , Ld +1,e−1 + 1) previous ones, as well as the use of a
suffix tree on the text for the LCA algo-
where the external loop updates e from rithm. Unlike other algorithms, this one
0 to k and the internal one updates d is able to report the O(kn) matching sub-
from −e to n. Negative numbered diag- strings of the text (not only the endpoints)
onals are those virtually starting before in O(kn) time. This makes the algorithm
the first text position. Figure 12 shows our suitable for more complex applications,
search example using this recurrence. for instance in computational biology. The
Note that the L matrix has to be filled original reference is a technical report and
by diagonals, e.g. L0,3 , L1,2 , L2,1 , L0,4 , L1,3 , never went to press, but it has recently
L2,2 , L0,5 , . . . . The difficult part is how to been included in a larger work [Landau
compute the strokes in constant time (i.e. et al. 1998].
the max` (·)). The problem is equivalent
to knowing which is the longest prefix of
Pi..m that matches T j..n . This data is called 5.2.5 Galil and Giancarlo (1988). In 1988,
“matching statistics.” The algorithms of Galil and Giancarlo [1988] obtained the
this section differ basically in how they same time complexity as Landau and
manage to quickly compute the matching Vishkin using O(m) space. Basically, the
statistics. suffix tree of the text is built by over-
We defer the explanation of Landau and lapping pieces of size O(m). The algo-
Vishkin [1988] for later (together with rithm scans the text four times, being even
Galil and Park [1990]). In Landau and slower than [Landau and Vishkin 1989].
Vishkin [1989], the longest match is ob- Therefore, the result was of theoretical in-
tained by building the suffix tree (see terest.
Section 3.2) of T ; P (text concatenated
with pattern), where the huge O(n) ex- 5.2.6 Galil and Park (1989). One year later,
tra space comes from. The longest prefix in 1989, Galil and Park [1990] obtained
common to both suffixes Pi..m and T j..n can O(kn) worst-case time and O(m2 ) space,
be visualized in the suffix tree as follows: worse in theory than Galil and Giancarlo
imagine the root to leaf paths that end in [1988] but much better in practice. Their
each of the two suffixes. Both parts share idea is rooted in the work of Landau
the beginning of the path (at least they and Vishkin [1988] (which had obtained
share the root). The last suffix tree node O(k 2 n) time). In both cases, the idea is to
common to both paths represents a sub- build the matching statistics of the pat-
string which is precisely the longest com- tern against itself (longest match between
mon prefix. In the literature, this last com- Pi..m and P j..m ), resembling in some sense
mon node is called lowest common ancestor the basic ideas of Knuth et al. [1977]. But
(LCA) of two nodes. this algorithm is still slow in practice.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


50 G. Navarro

Fig. 13. On the left, the progress of the stroke-wise algorithm. The relevant
strokes are enclosed in a dotted triangle and the last k strokes computed are in
bold. On the right, the selection of the k relevant strokes to cover the last text
area. We put in bold the parts of the strokes that are used.

Consider again Figure 11, and in par- pattern and text. So, from the k previous
ticular the new stroke with e errors at the 0-strokes we can keep the one that lasts
right. The beginning of the stroke is dic- longer in the text, and up to that text posi-
tated by the three neighboring strokes of tion we have all the information we need
e − 1 errors, but after the longest of the about longest matches. We consider now
three ceases to affect the new stroke, how all the 1-strokes. Although only a suffix of
long it continues (dashed line) depends those strokes really represents a longest
only on the similarity between pattern and match between pattern and text, we know
text. More specifically, if the dotted line that this is definitely true after the last
(suffix of a stroke) at diagonal d spans text position is reached by a 0-stroke (since
rows i1 to i1 + `, the longest match be- by then no 0-stroke can “help” a 1-stroke
tween Td +i1 .. and Pi1 .. has length `. There- to last longer). Therefore, we can keep the
fore, the strokes computed by the algo- 1-stroke that lasts longer in the text and
rithm give some information about longest use it to define longest matches between
matches between text and pattern. The pattern and text when there are no more
difficult part is how to use that infor- active 0-strokes. This argument continues
mation. for all the k errors, showing that in fact
Figure 13 illustrates the algorithm. As the complete text area that is relevant can
explained, the algorithm progresses by be covered with just k strokes. Figure 13
strokes, filling the matrix of Figure 12 di- (right) illustrates this idea.
agonally, so that when a stroke is com- The algorithm of Galil and Park [1990]
puted, its three neighbors are already basically keeps this list of k relevant
computed. We have enclosed in a dot- strokes3 up to date all the time. Each time
ted triangle the strokes that may contain a new e-stroke is produced, it is compared
the information on longest matches rel- against the current relevant e-stroke, and
evant to the new strokes that are being if the new one lasts longer in the text than
computed. The algorithm of Landau and the old one, it replaces the old stroke. Since
Vishkin [1988] basically searches the rel- the algorithm progresses in the text, old
evant information in this triangle and strokes are naturally eliminated with this
hence it is in O(k 2 n) time. procedure.
This is improved in Galil and Park A final problem is how to use the in-
[1990] to O(kn) by considering carefully direct information given by the relevant
the relevant strokes. Let us call e-stroke strokes to compute the longest matches
a stroke with e errors. First consider a
0-stroke. This full stroke (not only a suf-
fix) represents a longest match between 3 Called “reference triples” there.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 51

between pattern and text. What we have 5.2.8 Chang and Lawler (1994). In 1990,
is a set of longest matches covering the Chang and Lawler [1994] repeated the
text area of interest, plus the precomputed idea that was briefly mentioned in Galil
longest matches of the pattern against it- and Park [1990]: that matching statistics
self (starting at any position). We now can be computed using the suffix tree of
know where the dashed line of Figure 11 the pattern and LCA algorithms. However,
starts (say it is Pi1 and Td +i1 ) and want they used a newer and faster LCA algo-
to compute its length. To know where rithm [Schieber and Vishkin 1988], truly
the longest match between pattern and O(1), and reported the best time among
text ends, we find the relevant stroke algorithms with guaranteed O(kn) perfor-
where the beginning of the dashed line mance. However, the algorithm is still not
falls. That stroke represents a maximal competitive in practice.
match between Td +i1 .. and some P j1 .. . As
we know by preprocessing the longest 5.2.9 Cole and Hariharan (1998). In 1998,
match between Pi1 .. and P j1 .. , we can de- Cole and Hariharan [1998] presented an
rive the longest match between Pi1 .. and algorithm with worst case O(n(1 + k c /m)),
Td +i1 .. . There are some extra complica- where c = 3 if the pattern is “mostly ape-
tions to take care of when both longest riodic” and c = 4 otherwise.4 The idea is
matches end at the same position or one that, unless a pattern has a lot of self-
has length zero, but all them can be sorted repetition, only a few diagonals of a diag-
out in O(k) time per diagonal of the onal transition algorithm need to be com-
L matrix. puted.
Finally, Galil and Park show that the This algorithm can be thought of as a fil-
O(m2 ) extra space needed to store the ma- ter (see the following sections) with worst
trix of longest matches can be reduced to case guarantees useful for very small k. It
O(m) by using a suffix tree of the pattern resembles some ideas about filters devel-
(not the text as in previous work) and LCA oped in Chang and Lawler [1994]. Proba-
algorithms, so we add different entries in bly other filters can be proved to have good
Figure 7 (note that Landau and Vishkin worst cases under some periodicity as-
[1988] already had O(m) space). Galil and sumptions on the pattern, but this thread
Park also show how to add transpositions has not been explored up to now. This algo-
to the edit operations at the same com- rithm is an improvement over a previous
plexity. This technique can be extended to one [Sahinalp and Vishkin 1997], which is
all these diagonal transition algorithms. more complex and has a worse complexity,
We believe that allowing different integral namely O(nk 8 (α log∗ n)1/ log 3 ). In any case,
costs for the operations or forbidding some the interest of this work is theoretical too.
of them can be achieved with simple mod-
ifications of the algorithms. 5.3 Improving the Average Case
5.3.1 Ukkonen (1985). The first improve-
5.2.7 Ukkonen and Wood (1990). An idea ment to the average case is due to Ukko-
similar to that of using the suffix tree of nen in 1985. The algorithm, a short
the pattern (and similarly slow in practice) note at the end of Ukkonen [1985b], im-
was independently discovered by Ukko- proved the dynamic programming algo-
nen and Wood in 1990 [Ukkonen and Wood rithm to O(kn) average time and O(m)
1993]. They use a suffix automaton (de- space. This algorithm was later called the
scribed in Section 3.2) on the pattern to “cut-off heuristic.” The main idea is that,
find the matching statistics, instead of the since a pattern does not normally match
table. As the algorithm progresses over in the text, the values at each column
the text, the suffix automaton keeps count
of the pattern substrings that match the 4 The definition of “mostly aperiodic” is rather tech-
text at any moment. Although they report nical and related to the number of auto-repetitions
O(m2 ) space for the suffix automaton, it that occur in the pattern. Most patterns are “mostly
can take O(m) space. aperiodic.”

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


52 G. Navarro

(from top to bottom) quickly reach k + 1 based on exploiting a different property


(i.e. mismatch), and that if a cell has a of the dynamic programming matrix.
value larger than k + 1, the result of They again consider the fact that, along
the search does not depend on its exact each column, the numbers are normally
value. A cell is called active if its value is increasing. They work on “runs” of con-
at most k. The algorithm simply keeps secutive increasing cells (a run ends when
count of the last active cell and avoids Ci+1 6= Ci + 1). They manage to work
working on the rest of the cells. O(1) per run in the column actualization
To keep the last active cell, we must process.
be able to recompute it for each new col- To update each run in constant time,
umn. At each new column, the last ac- they precompute loc( j, x) = min j 0 ≥ j P j 0 =
tive cell can be incremented in at most x for all pattern positions j and all char-
one, so we check if we have activated the acters x (hence it needs O(mσ ) space). At
next cell at O(1) cost. However, it is also each column of the matrix, they consider
possible that the last active cell now be- the current text character x and the cur-
comes inactive. In this case we have to rent row j , and know in constant time
search upwards for the new last active cell. where the run is going to end (i.e. next
Although we can work O(m) in a given character match). The run can end before
column, we cannot work more than O(n) this, namely where the parallel run of the
overall, because there are at most n incre- previous column ends.
ments of this value in the whole process, Based on empirical observations, they
and hence there are no more than n decre- conjecture √ that the average length of the
ments. Hence, the last active cell is main- runs is O( σ ). Notice that this matches
tained at O(1) amortized cost per column. our result √ that the average edit distance
Ukkonen conjectured that this algo- is m (1 − e/ σ ), since this is the number of
rithm was O(kn) on average, but this was increments along√columns, and therefore
proven only in 1992 by Chang and Lampe there are O(m/ σ ) nonincrements (i.e.
[1992]. The proof was refined in 1996 by runs). From there it is clear √ that each
Baeza-Yates and Navarro [1999]. The re- run has average length O( σ ). Therefore,
sult can probably be extended to more we have just proved Chang and Lampe’s
complex distance functions, although with conjecture.
substrings the last active cell must exceed Since the paper uses the cut-off heuris-
k by enough to ensure that it can never re- tic of Ukkonen,
√ their average search time
turn to a value smaller than k. In partic- is O(kn/ σ ). This is, in practice, the
ular, it must have the value k + 2 if trans- fastest algorithm of this class.
positions are allowed. Unlike the other algorithms in this
section, it seems difficult to adapt [Chang
5.3.2 Myers (1986). An algorithm in and Lampe 1992] to other distance func-
Myers [1986a] is based on diagonal tran- tions, since their idea relies strongly on
sitions like those in the previous sections, the unitary costs. It is mentioned that
but the strokes are simply computed by the algorithm could run in average time
brute force. Myers showed that the result- O(kn log log(m)/σ ) but it would not be
ing algorithm was O(kn) on average. This practical.
is clear because the length of the strokes
is σ/(σ − 1) = O(1) on average. The same
6. ALGORITHMS BASED ON AUTOMATA
algorithm was proposed again in 1989 by
Galil and Park [1990]. Since only the k This area is also rather old. It is interest-
strokes need to be stored, the space is ing because it gives the best worst-case
O(k). time algorithm (O(n), which matches the
lower bound of the problem). However,
5.3.3 Chang and Lampe [1992]. In 1992, there is a time and space exponential
Chang and Lampe [1992] produced a new dependence on m and k that limits its
algorithm called “column partitioning,” practicality.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 53

Fig. 14. Taxonomy of algorithms based on deterministic automata. References are shortened
to first letters (single authors) or initials (multiple authors), and to the last two digits of years.
Key: Ukk85b = [Ukkonen 1985b], MP80 = [Masek and Paterson 1980], Mel96 = [Melichar 1996],
Kur96 = [Kurtz 1996], Nav97b = [Navarro 1997b], and WMM96 = [Wu et al. 1996].

We first present the basic solution and text and pattern), and dashed diagonal
then discuss the improvements. Figure 14 arrows delete a character of the pattern
shows the historical map of this area. (they are ε-transitions, since we advance
in the pattern without advancing in
the text). The initial self-loop allows a
6.1 An Automaton for Approximate Search
match to start anywhere in the text. The
An alternative and very useful way to con- automaton signals (the end of) a match
sider the problem is to model the search whenever a rightmost state is active. If
with a nondeterministic automaton we do not care about the number of errors
(NFA). This automaton (in its determin- in the occurrences, we can consider final
istic form) was first proposed in Ukkonen states those of the last full diagonal.
[1985b], and first used in nondeterminis- It is not hard to see that once a state
tic form (although implicitly) in Wu and in the automaton is active, all the states
Manber [1992b]. It is shown explicitly in of the same column and higher num-
Baeza-Yates [1991], Baeza-Yates [1996], bered rows are active too. Moreover, at
and Baeza-Yates and Navarro [1999]. a given text character, if we collect the
Consider the NFA for k = 2 errors smallest active rows at each column,
under the edit distance shown in Fig- we obtain the vertical vector of the
ure 15. Every row denotes the number of dynamic programming algorithm (in
errors seen (the first row zero, the second this case [0, 1, 2, 3, 3, 3, 2]; compare to
row one, etc.). Every column represents Figure 9).
matching a pattern prefix. Horizontal Other types of distances (Hamming,
arrows represent matching a character LCS, and Episode) are obtained by
(i.e. if the pattern and text characters deleting some arrows of the automaton.
match, we advance in the pattern and in Different integer costs for the operations
the text). All the others increment the can also be modeled by changing the
number of errors (move to the next row): arrows. For instance, if insertions cost
vertical arrows insert a character in the 2 instead of 1, we make the vertical
pattern (we advance in the text but not arrows move from rows i to rows i + 2.
in the pattern), solid diagonal arrows Transpositions are modeled by adding an
substitute a character (we advance in the extra state Si, j between each pair of states

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


54 G. Navarro

Fig. 15. An NFA for approximate string matching of the pattern "survey" with two errors. The
shaded states are those active after reading the text "surgery".

at position (i, j ) and (i + 1, j + 2), and Ukkonen proved that all the elements
arrows labeled Pi + 2 from state (i, j ) to Si, j in the columns that were larger than
and Pi + 1 between Si, j and (i + 1, j + 2) k + 1 could be replaced by k + 1 without
[Melichar 1996]. Adapting to general sub- affecting the output of the search (the
string substitution needs more complex lemma was used in the same paper to
setups but it is always possible. design the cut-off heuristic described in
This automaton can simply be made Section 5.3). This reduced the potential
deterministic to obtain O(n) worst-case number of different columns. He also
search time. However, as we see next, the showed that adjacent cells in a column
main problem becomes the construction differed in at most one. Hence, the column
of the DFA (deterministic finite automa- states could be defined as a vector of m
ton). An alternative solution is based on incremental values in the set {−1, 0, 1}.
simulating the NFA instead of making it All this made it possible in Ukkonen
deterministic. [1985b] to obtain a nontrivial bound on
the number of states of the automaton,
namely O(min(3m , m(2mσ )k )). This size,
although much better than the obvious
6.2 Implementing the Automaton
O((k + 1)m ), is still very large except for
6.2.1 Ukkonen (1985). In 1985, Ukkonen short patterns or very low error levels.
proposed the idea of a deterministic The resulting space complexity of the
automaton for this problem [Ukkonen algorithm is m times the above value.
1985b]. However, an automaton like This exponential space complexity has
that of Figure 15 was not explicitly to be added to the O(n) time complexity,
considered. Rather, each possible set of as the preprocessing time to build the
values for the columns of the dynamic automaton.
programming matrix is a state of the As a final comment, Ukkonen suggested
automaton. Once the set of all possible that the columns could be computed only
columns and the transitions among them partially up to, say, 3k/2 entries. Since
were built, the text was scanned with the he conjectured (and later was proved
resulting automaton, performing exactly correct in Chang and Lampe [1992]) that
one transition per character read. the columns of interest were O(k) on
The big problem with this scheme was average, this would normally not affect
that the automaton had a potentially the algorithm, though it will reduce the
huge number of states, which had to be number of possible states. If at some
built and stored. To improve space usage, point the states not computed were really

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 55

Fig. 16. On the left, the automaton of Ukkonen [1985b] where each column is
a state. On the right, the automaton of Wu et al. [1996] where each region is a
state. Both compute the columns of the dynamic programming matrix.

needed, the algorithm would compute the Four Russians approach with [Masek
them by dynamic programming. and Paterson 1980], but there is an im-
Notice that to incorporate transposi- portant difference: the states in this case
tions and substring substitutions into do not depend on the letters of the pattern
this conception we need to consider that and text. The states of the “automaton” of
each state is the set of the j last columns Masek and Paterson [1980], on the other
of the dynamic programming matrix, hand, depend on the text and pattern.
where j is the longest left-hand side of This Four Russians approach is so flexi-
a rule. In this case it is better to build ble that this work was extended to handle
the automaton of Figure 15 explicitly and regular expressions allowing errors [Wu
make it deterministic. et al. 1995]. The technique for exact reg-
ular expression searching is to pack por-
6.2.2 Wu, Manber and Myers (1992). It was tions of the deterministic automaton in
not until 1992 that Wu et al. looked into bits and compute transition tables for
this problem again [Wu et al. 1996]. The each portion. The few transitions among
idea was to trade time for space using portions are left nondeterministic and
a Four Russians technique [Arlazarov simulated one by one. To allow errors,
et al. 1975]. Since the cells could be ex- each state is no longer active or inactive,
pressed using only values in {−1, 0, 1}, the but they keep count of the minimum
columns were partitioned into blocks of r number of errors that makes it active, in
cells (called “regions”) which took 2r bits O(log k) bits.
each. Instead of precomputing the tran-
sitions from a whole column to the next, 6.2.3 Melichar (1995). In 1995, Melichar
the transitions from a region to the next [1996] again studied the size of the de-
region in the column were precomputed, terministic automaton. By considering
although the current region could now the properties of the NFA of Figure 15,
depend on three previous regions (see he refined the bound of Ukkonen [1985b]
Figure 16). Since the regions were smaller to O(min(3m , m(2mt)k , (k + 2)m−k (k + 1)!)),
than the columns, much less space was where t = min(m + 1, σ ). The space com-
necessary. The total amount of work plexity and preprocessing time of the
was O(m/r) per column in the worst automaton is t times the number of
case, and O(k/r) on average. The space states. Melichar also conjectured that this
requirement was exponential in r. By automaton is bigger when there are pe-
using O(n) extra space, the algorithm was riodicities in the pattern, which matches
O(kn/ log n) on average and O(mn/ log n) the results of Cole and Hariharan [1998]
in the worst case. Notice that this shares (Section 5.2), in the sense that periodic

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


56 G. Navarro

patterns are more problematic. This is work of the nondeterministic automaton


in fact a property shared with other that solves the problem (Figure 15),
problems in string matching. or parallelize the work of the dynamic
programming matrix.
We first explain the technique and then
6.2.4 Kurtz (1996). In 1996, Kurtz [1996]
the results achieved by using it. Figure 17
proposed another way to reduce the space shows the historical development of this
requirements to at most O(mn). It is an area.
adaptation of Baeza-Yates and Gonnet
[1994], who first proposed it for the Ham-
ming distance. The idea was to build the 7.1 The Technique of Bit-Parallelism
automaton in lazy form, i.e. build only the This technique, in common use in string
states and transitions actually reached in matching [Baeza-Yates 1991; 1992], was
the processing of the text. The automaton introduced in the Ph.D. thesis of Baeza-
starts as just one initial state and the Yates [1989]. It consists in taking
states and transitions are built as needed. advantage of the intrinsic parallelism of
By doing this, all those transitions that the bit operations inside a computer word.
Ukkonen [1985b] considered and that By using this fact cleverly, the number
were not necessary were not built in fact, of operations that an algorithm performs
without the need to guess. The price was can be cut down by a factor of at most
the extra overhead of a lazy construction w, where w is the number of bits in a
versus a direct construction, but the idea computer word. Since in current architec-
pays off. Kurtz also proposed building tures w is 32 or 64, the speedup is very
only the initial part of the automaton significant in practice and improves with
(which should be the most commonly technological progress. In order to relate
traversed states) to save space. the behavior of bit-parallel algorithms
Navarro [1997b; 1998] studied the to other work, it is normally assumed
growth of the complete and lazy automata that w = 2(log n), as dictated by the RAM
as a function of m, k and n (this last model of computation. We, however, prefer
value for the lazy automaton only). The to keep w as an independent value. We
empirical results show that the lazy now introduce some notation we use for
automaton grows with the text at a rate bit-parallel algorithms.
of O(nβ ), for 0 < β < 1, depending on
σ , m, and k. Some replacement policies —The length of a computer word (in bits)
designed to work with bounded memory is w.
are proposed in Navarro [1998]. —We denote as b` ..b1 the bits of a mask
of length `. This mask is stored some-
where inside the computer word. Since
7. BIT-PARALLELISM the length w of the computer word is
These algorithms are based on exploiting fixed, we are hiding the details on where
the parallelism of the computer when it we store the ` bits inside it.
works on bits. This is also a new (after —We use exponentiation to denote bit rep-
1990) and very active area. The basic idea etition (e.g. 03 1 = 0001).
is to “parallelize” another algorithm us- —We use C-like syntax for operations
ing bits. The results are interesting from on the bits of computer words: “|” is
the practical point of view, and are espe- the bitwise-or, “&” is the bitwise-and,
cially significant when short patterns are “ b ” is the bitwise-xor, and “∼” com-
involved (typical in text retrieval). They plements all the bits. The shift-left
may work effectively for any error level. operation, “<<,” moves the bits to the
In this section we find elements which left and enters zeros from the right,
could strictly belong to other sections, i.e. bm bm−1 ..b2 b1 << r = bm−r ...b2 b1 0r .
since we parallelize other algorithms. The shift-right “>>” moves the bits
There are two main trends: parallelize the in the other direction. Finally, we

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 57

Fig. 17. Taxonomy of bit-parallel algorithms. References are shortened to first letters
(single authors) or initials (multiple authors), and to the last two digits of years.
Key: BY89 = [Baeza-Yates 1989], WM92b = [Wu and Manber 1992b], Wri94 = [Wright
1994], BYN99 = [Baeza-Yates and Navarro 1999], and Mye99 = [Myers 1999].

can perform arithmetic operations time, we get an improved version of the


on the bits, such as addition and KMP algorithm [Knuth et al. 1977]. How-
subtraction, which operate the bits as ever KMP is twice as slow for m ≤ w.
if they formed a number. For instance, The algorithm first builds a table B
b` ..bx 10000 − 1 = b` ..bx 01111. which for each character c stores a bit
mask B[c] = bm ..b1 . The mask in B[c] has
We now explain the first bit-parallel the bit bi set if and only if Pi = c. The
algorithm, Shift-Or [Baeza-Yates and state of the search is kept in a machine
Gonnet 1992], since it is the basis of word D = d m ..d 1 , where d i is 1 whenever
much of what follows. The algorithm P1..i matches the end of the text read
searches a pattern in a text (without up to now (i.e. the state numbered i in
errors) by parallelizing the operation of Figure 18 is active). Therefore, a match is
a nondeterministic finite automaton that reported whenever d m = 1.
looks for the pattern. Figure 18 illustrates D is set to 1m originally, and for each
this automaton. new text character T j , D is updated using
This automaton has m + 1 states, and the formula5
can be simulated in its nondeterministic
D 0 ← ((D ¿ 1) | 0m−1 1) & B[T j ]
form in O(mn) time. The Shift-Or algo-
rithm achieves O(mn/w) worst-case time 5 The real algorithm uses the bits with the inverse
(i.e. optimal speedup). Notice that if we meaning and therefore the operation “| 0m−1 1” is not
convert the nondeterministic automaton necessary. We preferred to explain this more didactic
to a deterministic one with O(n) search version.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


58 G. Navarro

Fig. 18. Nondeterministic automaton that searches "survey" exactly.

The formula is correct because the ith makes better usage of the registers of
bit is set if and only if the (i − 1)th bit was the computer word), and easier to extend
set for the previous text character and the in handling complex patterns than its
new text character matches the pattern at classical counterparts. Its main disadvan-
position i. In other words, T j −i + 1.. j = P1..i tage is the limitation it imposes on the
if and only if T j −i + 1.. j −1 = P1..i−1 and T j = size of the computer word. In many cases
Pi . It is possible to relate this formula to its adaptations in coping with longer
the movement that occurs in the nonde- patterns are not very efficient.
terministic automaton for each new text
character: each state gets the value of the 7.2 Parallelizing Nondeterministic Automata
previous state, but this happens only if the
text character matches the corresponding 7.2.1 Wu and Manber (1992). In 1992, Wu
arrow. and Manber [1992b] published a number
For patterns longer than the computer of ideas that had a great impact on the fu-
word (i.e. m > w), the algorithm uses ture of practical text searching. They first
dm/we computer words for the simulation extended the Shift-Or algorithm to handle
(not all them are active all the time). The wild cards (i.e. allow an arbitrary num-
algorithm is O(n) on average. ber of characters between two given posi-
It is easy to extend Shift-Or to handle tions in the pattern), and regular expres-
classes of characters. In this extension, sions (the most flexible pattern that can be
each position in the pattern matches a searched efficiently). Of more interest to
set of characters rather than a single us is that they presented a simple scheme
character. The classical string matching to combine any of the preceding extensions
algorithms are not extended so easily. with approximate string matching.
In Shift-Or, it is enough to set the ith The idea is to simulate, using bit-
bit of B[c] for every c ∈ Pi (Pi is now a parallelism, the NFA of Figure 15, so that
set). For instance, to search for "survey" each row i of the automaton fits in a com-
in case-insensitive form, we just set to puter word Ri (each state is represented
1 the first bit of B["s"] and B["S"], and by a bit). For each new text character,
the same with the rest. Shift-Or can also all the transitions of the automaton are
search for multiple patterns (where the simulated using bit operations among
complexity is O(mn/w) if we consider that the k + 1 computer words. Notice that all
m is the total length of all the patterns); the k + 1 computer words have the same
it was later enhanced [Wu and Manber structure (i.e. the same bit is aligned on
1992b] to support a larger set of extended the same text position). The update for-
patterns and even regular expressions. mula to obtain the new R 0i values at text
Many online text algorithms can be position j from the current Ri values is
seen as implementations of an automaton
(classically, in its deterministic form). R 00 = ((R0 ¿ 1) | 0m−1 1) & B[T j ]
Bit-parallelism has since its invention
become a general way to simulate simple R 0i + 1 = ((Ri + 1 ¿ 1) & B[T j ]) | Ri |
nondeterministic automata instead of (Ri ¿ 1) | (R 0i ¿ 1)
converting them to deterministic form.
It has the advantage of being much and we start the search with Ri = 0m − i 1i .
simpler, in many cases faster (since it As expected, R0 undergoes a simple

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 59

Shift-Or process, while the other rows tomaton even more [Baeza-Yates and
receive ones (i.e. active states) from previ- Navarro 1999]. The classical dynamic
ous rows as well. In the formula for R 0i + 1 , programming algorithm can be thought
expressed in that order, are horizontal, of as a column-wise “parallelization” of
vertical, diagonal and dashed diagonal the automaton [Baeza-Yates 1996]; Wu
arrows. and Manber [1992b] proposed a row-wise
The cost of this simulation is O(kdm/ parallelization. Neither algorithm was
wen) in the worst and average case, which able to increase the parallelism (even
is O(kn) for patterns typical in text search- if all the NFA states fit in a computer
ing (i.e. m ≤ w). This is a perfect speedup word) because of the ε-transitions of
over the serial simulation of the automa- the automaton, which caused what we
ton, which would cost O(mkn) time. Notice call zero-time dependencies. That is, the
that for short patterns, this is competitive current values of two rows or two columns
to the best worst-case algorithms. depend on each other, and hence cannot
Thanks to the simplicity of the con- be computed in parallel.
struction, the rows of the pattern can In Baeza-Yates and Navarro [1999]
be changed by a different automaton. As the bit-parallel formula for a diagonal
long as one is able to solve a problem parallelization was found. They packed
for exact string matching, make k + 1 the states of the automaton along diag-
copies of the resulting computer word, onals instead of rows or columns, which
and perform the same operations in the run in the same direction of the diagonal
k + 1 words (plus the arrows that connect arrows (notice that this is totally differ-
the words), one has an algorithm to find ent from the diagonals of the dynamic
the same pattern allowing errors. Hence, programming matrix). This idea had been
with this algorithm one is able to perform mentioned much earlier by Baeza-Yates
approximate string matching with sets [1991] but no bit-parallel formula was
of characters, wild cards, and regular found. There are m − k + 1 complete diag-
expressions. The algorithm also allows onals (the others are not really necessary)
some extensions unique in approximate which are numbered from 0 to m − k. The
searching: a part of the pattern can be number Di is the row of the first active
searched with errors that another may state in diagonal i (all the subsequent
be forced to match exactly, and different states in the diagonal are active because of
integer costs of the edit operations can the ε-transitions). The new Di0 values after
be accommodated (including not allowing reading text position j are computed as
some of them). Finally, one is able to
search a set of patterns at the same time, Di0 = min(Di + 1, Di + 1 + 1, g (Di−1 , T j ))
but this capability is very limited (since all
the patterns must fit in a computer word). where the first term represents the sub-
The great flexibility obtained encour- stitutions, the second term the insertions,
aged the authors to build a software and the last term the matches (deletions
called Agrep [Wu and Manber 1992a],6 are implicit since we represent only the
where all these capabilities are imple- lowest-row active state of each diagonal).
mented (although some particular cases The main problem is how to compute the
are solved in a different manner). This function g , defined as
software has been taken as a reference in
all the subsequent research. g (Di , T j ) = min({k + 1} ∪
{r/r ≥ Di ∧ Pi + r = T j })
7.2.2 Baeza-Yates and Navarro (1996). In
1996, Baeza-Yates and Navarro presented Notice that an active state that crosses
a new bit-parallel algorithm able to a horizontal edge has to propagate all the
parallelize the computation of the au- way down by the diagonal. This was finally
solved in 1996 [Baeza-Yates and Navarro
6 Available at ftp.cs.arizona.edu. 1999; Navarro 1998] by representing

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


60 G. Navarro

the Di values in unary form and using the algorithm is O(nm log(σ )/w) in the
arithmetic operations on the bits which worst and average case. This was compet-
have the desired propagation effects. The itive at that time for very small alphabets
formula can be understood either nu- (e.g. DNA). As the author recognizes,
merically (operating the Di ’s) or logically it seems quite difficult to adapt this
(simulating the arrows of the automaton). algorithm for other distance functions.
The resulting algorithm is O(n) worst
case time and very fast in practice if 7.3.2 Myers (1998). In 1998, Myers [1999]
all the bits of the automaton fit in the found a better way to parallelize the com-
computer word (while Wu and Manber putation of the dynamic programming
[1992b] keeps O(kn)). In general, it is matrix. He represented the differences
O(dk(m − k)/wen) worst case time, and along columns instead of the columns
O(dk 2 /wen) on average since the Ukkonen themselves, so that two bits per cell were
cut-off heuristic is used (see Section 5.3). enough (in fact this algorithm can be seen
The scheme can handle classes of char- as the bit-parallel implementation of the
acters, wild cards and different integral automaton which is made deterministic
costs in the edit operations. in Wu et al. [1996], see Section 6.2). A
new recurrence is found where the cells
7.3 Parallelizing the Dynamic of the dynamic programming matrix are
Programming Matrix expressed using horizontal and vertical
differences, i.e. 1vi, j = Ci, j − Ci−1, j and
7.3.1 Wright (1994). In 1994, Wright 1hi, j = Ci, j − Ci, j −1 :
[1994] presented the first work using
bit-parallelism on the dynamic program- 1vi, j = min(−Eqi, j , 1vi, j −1 , 1hi−1, j )
ming matrix. The idea was to consider + (1 − 1hi−1, j )
secondary diagonals (i.e. those that run
from the upper-right to the bottom-left) 1hi, j = min(−Eqi, j , 1vi, j −1 , 1hi−1, j )
of the matrix. The main observation is + (1 − 1vi, j −1 )
that the elements of the matrix follow the where Eqi, j is 1 if Pi = T j and zero
recurrence7 otherwise. The idea is to keep packed
Ci, j = Ci−1, j −1 if Pi = T j binary vectors representing the current
or Ci−1, j (i.e. j th) values of the differences, and
= Ci−1, j −1 − 1 finding the way to update the vectors
or Ci, j −1 in a single operation. Each cell Ci, j is
= Ci−1, j −1 − 1 seen as a small processor that receives
inputs 1vi, j −1 , 1hi−1, j , and Eqi, j and
Ci−1, j −1 + 1 otherwise produces outputs 1vi, j and 1hi, j . There
are 3 × 3 × 2 = 18 possible inputs, and a
which shows that the new secondary simple formula is found to express the cell
diagonal can be computed using the two logic (unlike Wright [1994], the approach
previous ones. The algorithm stores the is logical rather than arithmetical). The
differences between Ci, j and Ci−1, j −1 and hard part is to parallelize the work along
represents the recurrence using modulo the column because of the zero-time
4 arithmetic. The algorithm packs many dependency problem. The author finds
pattern and text characters in a computer a solution which, despite the fact that a
word and performs in parallel a number very different model is used, resembles
of pattern versus text comparisons, then that of Baeza-Yates and Navarro [1999].
using the vector of the results of the The result is an algorithm that uses
comparisons to update many cells of the the bits of the computer word better, with
diagonal in parallel. Since it has to store a worst case of O(dm/wen) and an aver-
characters of the alphabet in the bits, age case of O(dk/wen) since it uses the
Ukkonen cut-off (Section 5.3). The update
7 The original one in Wright [1994] has errors. formula is a little more complex than that

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 61

of Baeza-Yates and Navarro [1999] and (in fact, they dominate in a large range of
hence the algorithm is a bit slower, but it parameters).
adapts better to longer patterns because It is important to notice that a filtering
fewer computer words are needed. algorithm is normally unable to discover
As it is difficult to surpass O(kn) the matching text positions by itself.
algorithms, this algorithm may be the Rather, it is used to discard (hopefully
last word with respect to asymptotic large) areas of the text that cannot contain
efficiency of parallelization, except for the a match. For instance, in our example, it
possibility to parallelize an O(kn) worst is necessary that either "sur" or "vey"
case algorithm. As it is now common to appear in an approximate occurrence, but
expect of bit-parallel algorithms, this it is not sufficient. Any filtering algorithm
scheme is able to search some extended must be coupled with a process that
patterns as well, but it seems difficult to verifies all those text positions that could
adapt it to other distance functions. not be discarded by the filter.
Virtually any nonfiltering algorithm
8. FILTERING ALGORITHMS can be used for this verification, and in
many cases the developers of a filtering
Our last category is quite new, starting algorithm do not care to look for the best
in 1990 and still very active. It is formed verification algorithm, but just use the
by algorithms that filter the text, quickly dynamic programming algorithm. The
discarding text areas that do not match. selection is normally independent, but
Filtering algorithms address only the the verification algorithm must behave
average case, and their major interest is well on short texts because it can be
the potential for algorithms that do not started at many different text positions
inspect all text characters. The major the- to work on small text areas. By careful
oretical achievement is an algorithm with programming it is almost always possible
average cost O(n(k + logσ m)/m), which to keep the worst-case behavior of the
was proven optimal. In practice, filtering verifying algorithm (i.e. avoid verifying
algorithms are the fastest too. All of them, overlapping areas).
however, are limited in their applicability Finally, the performance of filtering
by the error level α. Moreover, they need a algorithms is very sensitive to the error
nonfilter algorithm to check the potential level α. Most filters work very well on low
matches. error levels and very badly otherwise. This
We first explain the general concept and is related to the amount of text that the
then consider the developments that have filter is able to discard. When evaluating
occurred in this area. See Figure 19. filtering algorithms, it is important not
only to consider their time efficiency but
8.1 The Concept of Filtering
also their tolerance for errors. One possi-
Filtering is based on the fact that it may ble measure for this filtration efficiency is
be much easier to tell that a text position the total number of matches found divided
does not match than to tell that it matches. by the total number of potential matches
For instance, if neither "sur" nor "vey" ap- pointed out by the filtration algorithm
pear in a text area, then "survey" cannot [Sutinen 1998].
be found there with one error under the A term normally used when referring
edit distance. This is because a single edit to filters is “sublinearity.” It is said that
operation cannot alter both halves of the a filter is sublinear when it does not
pattern. inspect all the characters in the text (like
Most filtering algorithms take advan- the Boyer–Moore algorithms [Boyer and
tage of this fact by searching pieces of the Moore 1977] for exact searching, which
pattern without errors. Since the exact can at best be O(n/m)). However, no
searching algorithms can be much faster online algorithm can be truly sublinear,
than approximate searching ones, filter- i.e. o(n), if m is independent of n. This is
ing algorithms can be very competitive only achievable with indexing algorithms.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


62 G. Navarro

Fig. 19. Taxonomy of filtering algorithms. Complexities are all on average. References are shortened
to first letters (single authors) or initials (multiple authors), and to the last two digits of years.
Key: TU93 = [Tarhio and Ukkonen 1993], JTU96 = [Jokinen et al. 1996], Nav97a = [Navarro 1997a],
CL94 = [Chang and Lawler 1994], Ukk92 = [Ukkonen 1992], BYN99 = [Baeza-Yates and Navarro 1999],
WM92b = [Wu and Manber 1992b], BYP96 = [Baeza-Yates and Perleberg 1996], Shi96 = [Shi 1996],
NBY99c = [Navarro and Baeza-Yates 1999c], Tak94 = [Takaoka 1994], CM94 = [Chang and Marr 1994],
NBY98a = [Navarro and Baeza-Yates 1998a], NR00 = [Navarro and Raffinot 2000], ST95 = [Sutinen
and Tarhio 1995], and GKHO97 = [Giegerich et al. 1997].

We divide this area in two parts: moder- used Boyer–Moore–Horspool techniques


ate and very long patterns. The algorithms [Boyer and Moore 1977; Horspool 1980]
for the two areas are normally different, to filter the text. The idea is to align the
since more complex filters are only worth- pattern with a text window and scan
while for longer patterns. the text backwards. The scanning ends
where more than k “bad” text characters
8.2 Moderate Patterns are found. A “bad” character is one that not
8.2.1 Tarhio and Ukkonen (1990). Tarhio only does not match the pattern position
and Ukkonen [1993]8 launched this area it is aligned with, but also does not match
in 1990, publishing an algorithm that any pattern character at a distance of k
characters or less. More formally, assume
8 See also Jokinen et al. [1996], which has a correc- that the window starts at text position
tion to the algorithm. j + 1, and therefore T j + i is aligned with

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 63

Pi . Then T j + i is bad when Bad (i, T j + i ), long patterns. The algorithm can proba-
where Bad (i, c) has been precomputed as bly be adapted to other simple distance
c 6∈ {Pi−k , Pi−k + 1 , . . . , Pi , . . . , Pi + k }. functions if we define k as the minimum
The idea of the bad characters is that we number of errors needed to reject a string.
know for sure that we have to pay an error
to match them, i.e. they will not match as
8.2.2 Jokinen, Tarhio, and Ukkonen (1991).
a byproduct of inserting or deleting other
In 1991, Jokinen, Tarhio and Ukkonen
characters. When more than k characters
[Jokinen et al. 1996] adapted a previ-
that are errors for sure are found, the cur-
ous filter for the k-mismatches problem
rent text window can be abandoned and
[Grossi and Luccio 1989]. The filter is
shifted forward. If, on the other hand, the
based on the simple fact that inside any
beginning of the window is reached, the
match with at most k errors there must
area T j + 1−k.. j +m must be checked with a
be at least m − k letters belonging to the
classical algorithm.
pattern. The filter does not care about
To know how much we can shift the
the order of those letters. This is a simple
window, the authors show that there is
version of Chang and Lawler [1994] (see
no point in shifting P to a new position
Section 8.3), with less filtering efficiency
j 0 where none of the k + 1 text charac-
but simpler implementation.
ters that are at the end of the current
The search algorithm slides a window
window (T j + m−k , .., T j + m ) match the
of length m over the text9 and keeps count
corresponding character of P , i.e. where
of the number of window characters that
T j + m−r 6= Pm−r−( j 0 − j ) . If those differences
belong to the pattern. This is easily done
are fixed with substitutions, we make
with a table that, for each character a,
k + 1 errors, and if they can be fixed
stores a counter of a’s in the pattern which
with less than k + 1 operations, then it is
has not yet been seen in the text window.
because we aligned some of the involved
The counter is incremented when an a en-
pattern and text characters using inser-
ters the window and decremented when
tions and deletions. In this case, we would
it leaves the window. Each time a posi-
have obtained the same effect by aligning
tive counter is decremented, the window
the matching characters from the start.
character is considered as belonging to the
So for each pattern position i ∈
pattern. When there are m − k such char-
{m − k..m} and each text character a
acters, the area is verified with a classical
that could be aligned to position i (i.e.
algorithm.
for all a ∈ 6), the shift to align a in the
The algorithm was analyzed by Navarro
pattern is precomputed, i.e. Shift(i, a) =
[1997a] using a model of urns and balls. He
mins>0 {Pi−s = a} (or m if no such s exists).
shows that the algorithm is O(n) time for
Later, the shift for the window is com-
α < e−m/σ . Some possible extensions are
puted as mini∈m−k..m Shift(i, T j + i ). This
studied in Navarro [1998].
last minimum is computed together with
The resulting algorithm is competitive
the backward window traversal.
in practice for short patterns, but it wors-
The analysis in Tarhio and Ukkonen
ens for long ones. It is simple to adapt to
[1993] shows that the search time is
other distance functions by just determin-
O(kn(k/σ + 1/(m − k))), without consid-
ing how many characters must match in
ering verification. In Appendix A.1 we
an approximate occurrence.
show that the amount of verification is
negligible for α < e−(2k + 1)/σ . The analysis
is valid for m À σ > k, so we can simplify 8.2.3 Wu and Manber (1992). In 1992, a
the search time to O(k 2 n/σ ). The algo- very simple filter was proposed by Wu and
rithm is competitive in practice for low Manber [1992b] (among many other ideas
error levels. Interestingly, the version in that work). The basic idea is in fact very
k = 0 corresponds exactly to the Horspool
algorithm [Horspool 1980]. Like Horspool, 9The original version used a variable size window.
it does not take proper advantage of very This simplification is from Navarro [1997a].

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


64 G. Navarro

old [Rivest 1976]: if a pattern is cut in k + 1 between each pair of pieces, so that the
pieces, then at least one of the pieces must transposition cannot alter both.
appear unchanged in an approximate
occurrence. This is evident, since k errors 8.2.4 Baeza-Yates and Navarro (1996).
cannot alter the k + 1 pieces. The proposal The bit-parallel algorithms presented in
was then to split the pattern in k + 1 Section 7 [Baeza-Yates and Navarro 1999]
approximately equal pieces, search the were also the basis for novel filtering
pieces in the text, and check the neighbor- techniques. As the basic algorithm is
hood of their matches (of length m + 2k). limited to short patterns, the algorithms
They used an extension of Shift-Or split longer patterns in j parts, making
[Baeza-Yates and Gonnet 1992] to search them short enough to be searchable with
all the pieces simultaneously in O(mn/w) the basic bit-parallel automaton (using
time. In the same year, 1992, Baeza- one computer word).
Yates and Perleberg [1996] suggested The method is based on a more general
better algorithms for the multipattern version of the partition into k + 1 pieces
search: an Aho–Corasick machine [Aho [Myers 1994a; Baeza-Yates and Navarro
and Corasick 1975] to guarantee O(n) 1999]. For any j , if we cut the pattern in j
search time (excluding verifications), or pieces, then at least one of them appears
Commentz-Walter [1979]. with bk/j c errors in any occurrence of the
Only in 1996 was the improvement pattern. This is clear, since if each piece
really implemented [Baeza-Yates and needs more than k/j errors to match,
Navarro 1999], by adapting the Boyer– then the complete match needs more than
Moore–Sunday algorithm [Sunday 1990] k errors.
to multipattern search (using a trie of Hence, the pattern was split in j pieces
patterns and a pessimistic shift table). (of length m/j ) which were searched with
The resulting algorithm is surprisingly k/j errors using the basic algorithm. Each
fast in practice for low error levels. time a piece was found, the neighborhood
There is no closed expression for the was verified to check for the complete pat-
average case cost of this algorithm [Baeza- tern. Notice that the error level α for the
Yates and Régnier 1990], but we show in pieces is kept unchanged. √
Appendix A.2 that a gross approximation The resulting algorithm is O(n mk /w)
is O(kn logσ (m)/σ ). Two independent on average. √ √ maximum α value√is
Its
proofs in Baeza-Yates and Navarro [1999] 1 − em O(1/ w) / σ , smaller than 1 − e/ σ
and Baeza-Yates and Perleberg [1996] and worsening as m grows. This may be
show that the cost of the search dominates surprising since the error level α is the
for α < 1/(3 logσ m). A simple way to see same for the subproblems. The reason
this is to consider that checking a text is that the verification cost keeps O(m2 )
area costs O(m2 ) and is done when any of but the matching probability is O(γ m/j ),
the k + 1 pieces of length m/(k + 1) match, larger than O(γ m ) (see Section 4).
which happens with probability near In 1997, the technique was enriched
k/σ 1/α . The result follows from requiring with “superimposition” [Baeza-Yates
the average verification cost to be O(1). and Navarro 1999]. The idea is to avoid
This filter can be adapted, with some performing one separate search for
care, to other distance functions. The main each piece of the pattern. A multipat-
issue is to determine how many pieces an tern approximate searching is designed
edit operation can destroy and how many using the ability of bit-parallelism to
edit operations can be made before sur- search for classes of characters. As-
passing the error threshold. For example, sume that we want to search "survey"
a transposition can destroy two pieces in and "secret." We search the pattern
one operation, so we would need to split "s[ue][rc][vr]e[yt]," where [ab] means
the pattern in 2k + 1 pieces to ensure that {a, b}. In the NFA of Figure 15, the hor-
one is unaltered. A more clever solution for izontal arrows are traversable by more
this case is to leave a hole of one character than one letter. Clearly, any match of each

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 65

cannot appear if none of its children


appear, even if a grandchild appeared.
Figure 20 shows an example. If one
searches the pattern "aaabbbcccddd" with
four errors in the text "xxxbbxxxxxxx,"
and splits the pattern in four pieces to be
searched with one error, the piece "bbb"
Fig. 20. The hierarchical verification method for will be found in the text. In the original
a pattern split in four parts. The boxes (leaves) are approach, one would verify the complete
the elements which are really searched, and the root
represents the whole pattern. At least one pattern pattern in the text area, while with the
at each level must match in any occurrence of the new approach one verifies only its parent
complete pattern. If the bold box is found, all the "aaabbb" and immediately determines
bold lines may be verified. that there cannot be a complete match.
An orthogonal hierarchical verification
of the two patterns is also a match of the technique is also presented in Navarro
superimposed pattern, but not vice-versa and Baeza-Yates [1998a] to include
(e.g. "servet" matches zero errors). So superimposition in this scheme. If the
the filter is weakened but the search is superimposition of four patterns matches,
made faster. Superimposition allowed the set is split in two sets of two patterns
lowering the average √ psearch√time to O(n) each, and it is checked whether some of
for p α < 1 − em O(1/ w) m/σ w and to them match instead of verifying all the
O(n mk/(σ w)) for the maximum α of four patterns one by one.
the 1996 version. By using a j value The analysis in Navarro [1998] and
smaller than the one necessary to put Navarro and Baeza-Yates [1998a] shows
the automata in single machine words, that the average verification cost drops to
an intermediate scheme was obtained O((m/j )2 ). Only now the problem scales
that softly adapted to higher error levels. well (i.e. O(γ m/j ) verification probability
The algorithm was O(kn log(m)/w) for and O((m/j )2 ) verification cost). With
√ hierarchical verification, the verification
α < 1 − e/ σ . √
cost stays negligible for α < 1 − e/ σ . All
8.2.5 Navarro and Baeza-Yates (1998). The the simple extensions of bit-parallel algo-
final twist in the previous scheme was the rithms apply, although the partition into
introduction of “hierarchical verification” j pieces may need some redesign for other
in 1998 [Navarro and Baeza-Yates 1998a]. distances. Notice that it is very difficult

For simplicity assume that the pattern is to break the barrier of α ∗ = 1 − e/ σ for
partitioned in j = 2r pieces, although the any filter because, as shown in Section 4,
technique is general. The pattern is split there are too many real matches, and even
in two halves, each one to be searched the best filters must check real matches.
with bk/2c errors. Each half is recursively In the same year, 1998, the same
split in two and so on, until the pattern is authors [Navarro and Baeza-Yates 1999c;
short enough to make its NFA fit in a com- Navarro 1998] added hierarchical verifi-
puter word (see Figure 20). The leaves of cation to the filter that splits the pattern
this tree are the pieces actually searched. in k + 1 pieces and searches them with
When a leaf finds a match, instead of zero errors. The analysis shows that
checking the whole pattern as in the with this technique the verification cost
previous technique, its parent is checked does not dominate the search time for
(in a small area around the piece that α < 1/ logσ m. The resulting filter is the
matched). If the parent is not found, the fastest for most cases of interest.
verification stops, otherwise it continues
with the grandparent until the root (i.e. 8.2.6 Navarro and Raffinot (1998). In 1998
the whole pattern) is found. This is correct Navarro and Raffinot [Navarro and
because the partitioning scheme applies Raffinot 2000; Navarro 1998] presented a
to each level of the tree: the grandparent novel approach based on suffix automata

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


66 G. Navarro

Fig. 21. The construction to search any reverse prefix of "survey" allowing 2 errors.

(see Section 3.2). They adapted an exact versed pattern allowing errors, modify it
string matching algorithm, BDM, to allow to match any pattern suffix, and apply es-
errors. sentially the same BNDM algorithm us-
The idea of the original BDM algorithm ing this automaton. Figure 21 shows the
is as follows [Crochemore et al. 1994; resulting automaton.
Crochemore and Rytter 1994]. The deter- This automaton recognizes any reverse
ministic suffix automaton of the reverse prefix of P allowing k errors. The win-
pattern is built so that it recognizes the dow will be abandoned when no pattern
reverse prefixes of the pattern. Then the substring matches what was read with k
pattern is aligned with a text window, and errors. The window is shifted to the next
the window is scanned backwards with pattern prefix found with k errors. The
the automaton (this is why the pattern matches must start exactly at the initial
is reversed). The automaton is active as window position. The window length is
long as what it has read is a substring m − k, not m, to ensure that if there is
of the pattern. Each time the automaton an occurrence starting at the window
reaches a final state, it has seen a pattern position then a substring of the pattern
prefix, so we remember the last time occurs in any suffix of the window (so that
it happened. If the automaton arrives we do not abandon the window before
with active states at the beginning of reaching the occurrence). Reaching the
the window then the pattern has been beginning of the window does not guaran-
found, otherwise what is there is not tee a match, however, so we have to check
a substring of the pattern and hence the area by computing edit distance from
the pattern cannot be in the window. In the beginning of the window (at most
any case the last window position that m + k text characters).
matched a pattern prefix gives the next In Appendix A.3 it is shown that
initial window position. The algorithm the average complexity10 is O(n(α + α ∗
BNDM [Navarro and Raffinot 2000] is logσ (m)/m)/((1 − α)α ∗ − α))
√ and the filter

a bit-parallel implementation (using works well for α < (1 − e/ σ )/(2 − e/ σ ),
the nondeterministic suffix automaton, which for large alphabets tends to 1/2.
see Figure 3) which is much faster in The result is competitive for low error
practice and allows searching for classes levels, but the pattern cannot be very
of characters, etc.
A modification of Navarro and Raffinot 10The original analysis of Navarro [1998] is inaccu-
[2000] is to build a NFA to search the re- rate.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 67

Fig. 22. Algorithms LET and SET. LET covers all the text with pattern substrings, while
SET works only at block beginnings and stops when it finds k differences.

long because of the bit-parallel imple- matches the text seen up to now. Notice
mentation. Notice that trying to do this that the article is from 1990, the same
with the deterministic BDM would have year that Ukkonen and Wood [1993] did
generated a very complex construction, the same with a suffix automaton (see
while the algorithm with the nondeter- Section 5.2). Therefore, the filtering is in
ministic automaton is simple. Moreover, a O(n) time. The authors use Landau and
deterministic automaton would have too Vishkin [1989] as the verifying algorithm
many states, just as in Section 6.2. All and therefore the worst case is O(kn).
the simple extensions of bit-parallelism The authors show that the filtering time
apply, provided the window length m − k dominates for α < 1/ logσ m + O(1). The
is carefully reconsidered. constants are involved, but practical
A recent software program, called figures are α ≤ 0.35 for σ = 64 or α ≤ 0.15
{\em nrgrep}, capable of fast, exact, and for σ = 4.
approximate searching of simple and The second algorithm presented is
complex patterns has been built with this called SET (for “sublinear expected
method [Navarro 2000b]. time”). The idea is similar to LET, except
that the text is split in fixed blocks of size
(m − k)/2, and the check for k contiguous
8.3 Very Long Patterns
strokes starts only at block boundaries.
8.3.1 Chang and Lawler (1990). In 1990, Since the shortest match is of length
Chang and Lawler [1994] presented two m − k, at least one of these blocks is
algorithms (better analyzed in Giegerich always contained completely in a match.
et al. [1997]). The first one, called LET If one is able to discard the block, no
(for “linear expected time”), works as occurrence can contain it. This is also
follows: the text is traversed linearly, and illustrated in Figure 22.
at each time the longest pattern substring The sublinearity is clear once it is
that matches the text is maintained. proven that a block is discarded on av-
When the substring cannot be extended erage in O(k logσ m) comparisons. Since
further, it starts again from the current 2n/(m − k) blocks are considered, the
text position; Figure 22 illustrates. average time is O(α n logσ (m)/(1 − α)).
The crucial observation is that, if less The maximum α level stays the same as in
than m − k text characters have been LET, so the complexity can be simplified
covered by concatenating k longest sub- to O(α n logσ m). Although the proof that
strings, then the text area does not match limits the comparisons per block is quite
the pattern. This is evident because a involved, it is not hard to see intuitively
match is formed by k + 1 correct strokes why it is true: the probability of finding a
(recall Section 5.2) separated by k errors. stroke of length ` in the pattern is limited
Moreover, the strokes need to be ordered, by m/σ ` , and the detailed proof shows
which is not required by the filter. that ` = logσ m is on average the longest
The algorithm uses a suffix tree stroke found. This contrasts with the
on the pattern to determine in a linear result of Myers [1986a] (Section 5.3), that
pass the longest pattern substring that shows that k strokes add up O(k) length.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


68 G. Navarro

Fig. 23. Q-gram algorithm. The left one [Ukkonen 1992] counts the number of pattern q-grams
in a text window. The right one [Sutinen and Tarhio 1995] finds sequences of pattern q-grams in
approximately the same text positions (we have put in bold a text sample and the possible q-grams
to match it).

The difference is that here we can take to q = 1. The search algorithm is similar
the strokes from anywhere in the pattern. as well, although of course keeping a
Both LET and SET are effective for table with a counter for each of the σ q
very long patterns only, since their over- q-grams is impractical (especially be-
head does not pay off on short patterns. cause only m − q + 1 of them are present).
Different distance functions can be accom- Ukkonen uses a suffix tree to keep count
modated after rereasoning the adequate of the last q-gram seen in linear time (the
k values. relevant information can be attached to
the m − q + 1 important nodes at depth q
in the suffix tree).
8.3.2 Ukkonen (1992). In 1992, Ukkonen
The filter therefore takes linear time.
[1992] independently rediscovered some
There is no analysis to show which is
of the ideas of Chang and Lampe. He
the maximum error level tolerated by
presented two filtering algorithms, one of
the filter, so we attempt a gross analysis
which (based on what he called “maximal
in Appendix A.4, valid for large m. The
matches”) is similar to the LET of Chang
result is that the filter works well for
and Lawler [1994] (in fact Ukkonen
α < O(1/ logσ m), and that the optimal
presents it as a new “block distance”
q to obtain it is q = logσ m. The search
computable in linear time, and shows that
algorithm is more complicated than that
it serves as a filter for the edit distance).
of Jokinen et al. [1996]. Therefore, using
The other filter is the first reference to
larger q values only pays off for larger
“q-grams” for online searching (there are
patterns. Different distance functions
much older ones in indexed searching
are easily accommodated by recomputing
[Ullman 1977]).
the number of q-grams that must be
A q-gram is a substring of length q.
preserved in any occurrence.
A filter was proposed based on counting
the number of q-grams shared between
the pattern and a text window (this is 8.3.3 Takaoka (1994). In 1994, Takaoka
presented in terms of a new “q-gram [1994] presented a simplification of
distance” which may be of interest on Chang and Lawler [1994]. He considered
its own). A pattern of length m has h-samples of the text (which are non-
(m − q + 1) overlapping q-grams. Each overlapping q-grams of the text taken each
error can alter q q-grams of the pattern, h characters, for h ≥ q). The idea is that if
and therefore (m − q + 1 − kq) pattern one h-sample is found in the pattern, then
q-grams must appear in any occurrence; a neighborhood of the area is verified.
Figure 23 illustrates. By using h = b(m−k−q + 1)/(k + 1)c one
Notice that this is a generalization cannot miss a match. The easiest way to
of the counting filter of Jokinen et al. see this is to start with k = 0. Clearly, we
[1996] (Section 8.2), which corresponds need h = m−q + 1 to not lose any matches.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 69

For larger k, recall that if the pattern is text substrings of length ` in the pattern
split in k + 1 pieces some of them must allowing errors.
appear with no errors. The filter divides The algorithm proceeds as follows. The
h by k + 1 to ensure that any occurrence of best matches allowing errors inside P are
those pieces will be found (we are assum- precomputed for every `-tuple (hence the
ing q < m/(k + 1)). O(mt ) space). Starting at the beginning of
Using a suffix tree of the pattern, the the block, it searches consecutive `-tuples
h-sample can be found in O(q) time. in the pattern (each in O(`) time), until
Therefore the filtering time is O(qn/ h), the total number of errors made exceeds
which is O(αn logσ (m)/(1 − α)) if the k. If by that time it has not yet covered
optimal q = logσ m is used. The error m − k text characters, the block can be
level is again α < O(1/ logσ m), which safely skipped.
makes the time O(αn logσ m). The reason why this works is a simple
extension of SET. We have found an area
8.3.4 Chang and Marr (1994). It looks like contained in the possible occurrence which
O(αn logσ m) is the best complexity achiev- cannot be covered with k errors (even al-
able by using filters, and that it will work lowing the use of unordered portions of the
only for α = O(1/ logσ m). But in 1994, pattern for the match). The algorithm is
Chang and Marr obtained an algorithm only practical for very long patterns, and
which was can be extended for other distances with
µ ¶ the same ideas as the other filtration and
k + logσ m q-gram methods.
O n
m It is √interesting to notice that
α ≤ 1 − ē/ σ is the limit we have dis-
for α < ρσ , where ρσ√depends only on σ cussed in Section 4, which is a firm
and it tends to 1 − ē/ σ for very large σ . barrier for any filtering mechanism.
At the same time, they proved that this Chang and Lawler proved an asymptotic
was a lower bound for the average com- result, while a general bound is proved
plexity of the problem (and therefore their in Baeza-Yates and Navarro [1999]. The
algorithm was optimal on average). This filters of Chang and Marr [1994] and
is a major theoretical breakthrough. Navarro and Baeza-Yates [1998a] reduce
The lower bound is obtained by taking the problem to fewer errors instead of to
the maximum (or sum) of two simple facts: zero errors. An interesting observation is
the first one is the O(n logσ (m)/m) bound that it seems that all the filters that par-
of Yao [1979] for exact string matching, tition the problem into exact search can
and the second one is the obvious fact be applied for α = O(1/ logσ m), and √ that
that in order to discard a block of m text in order to improve this to 1 − ē/ σ we
characters, at least k characters should be must partition the problem into (smaller)
examined to find the k errors (and hence approximate searching subproblems.
O(kn/m) is a lower bound). Also, the
maximum error level is optimal according 8.3.5 Sutinen and Tarhio (1995). Sutinen
to Section 4. What is impressive is that and Tarhio [1995] generalized the
an algorithm with such complexity was Takaoka filter in 1995, improving its
found. filtering efficiency. This is the first filter
The algorithm is a variation of SET that takes into account the relative posi-
[Chang and Lawler 1994]. It is of poly- tions of the pattern pieces that match in
nomial space in m, i.e. O(mt ) space for the text (all the previous filters matched
some constant t which depends on σ . It is pieces of the pattern in any order). The
based on splitting the text in contiguous generalization is to force s q-grams of the
substrings of length ` = t logσ m. Instead pattern to match (not just one). The pieces
of finding in the pattern the longest exact must conserve their relative ordering in
matches starting at the beginning of the pattern and must not be more than
blocks of size (m − k)/2, it searches the k characters away from their correct

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


70 G. Navarro

position (otherwise we need to make more 8.3.7 Giegerich, Kurtz, Hischke, and Ohle-
than k errors to use them). This method busch (1996). Also in 1996, a general
is also illustrated in Figure 23. method to improve filters was developed
In this case, the sampling step is re- [Giegerich et al. 1997]. The idea is to
duced to h = b(m − k − q + 1)/(k + s)c. The mix the phases of filtering and checking,
reason for this reduction is that, to ensure so that the verification of a text area
that s pieces of the pattern match, we is abandoned as soon as the combined
need to cut the pattern into k + s pieces. information from the filter (number of
The pattern is divided in k + s pieces and guaranteed differences left) and the
a hashed set is created for each piece verification in progress (number of actual
so that the pieces are forced not to be differences seen) shows that a match is
too far away from their correct positions. not possible. As they show, however, the
The set contains the q-grams of the piece improvement occurs in a very narrow
and some neighboring ones too (because area of α. This is a consequence of the
the sample can be slightly misaligned). At statistics of this problem that we have
search time, instead of a single h-sample, discussed in Section 4.
they consider text windows of contiguous
sequences of k + s h-samples. Each of
9. EXPERIMENTS
these h-samples is searched in the corr-
esponding set, and if at least s are found In this section we make empirical compar-
the area is verified. This is a sort of isons among the algorithms described in
Hamming distance, and the authors this work. Our goal is to show the best
resort to an efficient algorithm for that options at hand depending on the case.
distance [Baeza-Yates and Gonnet 1992] Nearly 40 algorithms have been surveyed,
to process the text. some of them without existing implemen-
The resulting algorithm is O(αn logσ m) tations and many of them already known
on average using optimal q = logσ m, and to be impractical. To avoid excessively long
works well for α < 1/ logσ m. The algo- comparisons among algorithms known not
rithm is better suited for long patterns, to be competitive, we have left many of
although with s = 2 it can be reasonably them aside.
applied to short ones as well. In fact the
analysis is done for s = 2 only in Sutinen
9.1 Included and Excluded Algorithms
and Tarhio [1995].
A large group of excluded algorithms is
8.3.6 Shi (1996). In 1996 Shi [1996] pro- from the theoretical side based on the
posed to extend the idea of the k + 1 pieces dynamic programming matrix. Although
(explained in Section 8.2) to k + s pieces, these algorithms are not competitive in
so that at least s pieces must match. This practice, they represent (or represented
idea is implicit in the filter of Sutinen and at their time) a valuable contribution
Tarhio but had not been explicitly written to the development of the algorithmic
down. Shi compared his filter against aspect of the problem. The dynamic
the simple one, finding that the filtering programming algorithm [Sellers 1980] is
efficiency was improved. However, this excluded because the cut-off heuristic of
improvement will be noticeable only Ukkonen [1985b] is known to be faster
for long patterns. Moreover, the online (e.g. in Chang and Lampe [1992] and
searching efficiency is degraded because in our internal tests); the Masek and
the pieces are shorter (which affects any Paterson algorithm [1980] is argued in
Boyer–Moore-like search), and because the same paper to be worse than dynamic
the verification logic is more complex. No programming (which is quite bad) for
analysis is presented in the paper, but n < 40 GB; Landau and Vishkin [1988]
we conjecture that the optimum s is O(1) has bad complexity and was improved
and therefore the same complexity and later by many others in theory and
tolerance to errors is maintained. practice; Landau and Vishkin [1989] is

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 71

implemented with a better LCA algorithm algorithm of Wright [1994] was compet-
in Chang and Lampe [1992] and found itive only on binary text, and this was
too slow; Myers [1986a] is considered shown to not hold anymore in Myers
slow in practice by the same author in [1999].
Wu et al. [1996]; Galil and Giancarlo From the filtering algorithms, we have
[1988] is clearly slower than Landau and included Tarhio and Ukkonen [1993];
Vishkin [1989]; Galil and Park [1990], one the counting filter proposed in Jokinen
of the fastest among the O(kn) worst case et al. [1996] (as simplified in Navarro
algorithms, is shown to be extremely slow [1997a]); the algorithm of Navarro and
in Ukkonen and Wood [1993], Chang and Raffinot [2000]; and those of Sutinen and
Lampe [1992], and Wright [1994] and in Tarhio [1995] and Takaoka [1994] (this
internal tests done by ourselves; Ukkonen last seen as the case s = 1 of Sutinen and
and Wood [1993] is shown to be slow in Tarhio [1995], since this implementation
Jokinen et al. [1996]; the O(kn) algorithm worked better). We have also included
implemented in Chang and Lawler [1994] the filters proposed in Baeza-Yates and
is in the same paper argued to be the Navarro [1999], Navarro and Baeza-Yates
fastest of the group and shown to be not [1998a], and Navarro [1998], preferring
competitive in practice; Sahinalp and to present only the last version which
Vishkin [1997] and Cole and Hariharan incorporates all the twists of superimpo-
[1998] are clearly theoretical, their com- sition, hierarchical verification and mixed
plexities show that the patterns have to partitioning. Many previous versions
be very long and the error level too low are outperformed by this one. We have
to be of practical application. To give an also included the best version of the
idea of how slow is “slow,” we found Galil filters that partition the pattern in k + 1
and Park [1990] 10 times slower than pieces, namely the one incorporating
Ukkonen’s cut-off heuristic (a similar hierarchical verification [Navarro and
result is reported by Chang and Lampe Baeza-Yates 1999c; Navarro 1998]. In
[1992]). Finally, other O(kn) average time those publications it is shown that this
algorithms are proposed in Myers [1986a] version clearly outperforms the previous
and Galil and Park [1990], and they are ones proposed in Wu and Manber [1992b],
shown to be very similar to Ukkonen’s Baeza-Yates and Perleberg [1996], and
cut-off [Ukkonen 1985b] in Chang and Baeza-Yates and Navarro [1999]. Finally,
Lampe [1992]. Since the cut-off heuristic we are discarding some filters [Chang and
is already not very competitive we leave Lawler 1994; Ukkonen 1992; Chang and
aside the other similar algorithms. There- Marr 1994; Shi 1996] which are applica-
fore, from the group based on dynamic ble only to very long patterns, since this
programming we consider only the cut-off case is excluded from our experiments
heuristic (mainly as a reference) and as explained shortly. Some comparisons
Chang and Lampe [1992], which is the among them were carried out by Chang
only one competitive in practice. and Lampe [1992], showing that LET is
From the algorithms based on au- equivalent to the cut-off algorithm with
tomata we consider the DFA algorithm k = 20, and that the time for SET is 2α
[Ukkonen 1985b], but prefer its lazy times that of LET. LET was shown to be
version implemented in Navarro [1997b], the fastest with patterns of a hundred
which is equally fast for small automata letters long and a few errors in Jokinen
and much faster for large automata. We et al. [1996], but we recall that many
also consider the Four Russians algorithm modern filters were not included in that
of Wu et al. [1996]. From the bit-parallel comparison.
algorithms we consider Wu and Manber We now list the included algorithms
[1992b], Baeza-Yates and Navarro [1999], and the relevant comments about them.
and Myers [1999], leaving aside Wright All the algorithms implemented by us
[1994]. As shown in the 1996 version represent our best coding effort and
of Baeza-Yates and Navarro [1999], the have been found similar or faster than

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


72 G. Navarro

other implementations found elsewhere. their algorithm 2 (which is faster), im-


The implementations coming from other prove some register usage and replace
authors were checked with the same their I/O by ours, which is faster.
standards and in some cases their code CNT (387) The counting filter of Jokinen
was improved with better register us- et al. [1996], as simplified in Navarro
age and I/O management. The number [1997a] and implemented by us.
in parenthesis following the name of EXP (877) Partitioning in k + 1 pieces
each algorithm is the number of lines plus hierarchical verification [Navarro
of the C implementation we use. This and Baeza-Yates 1999c; Navarro 1998],
gives a rough idea of how complex the implemented by us.
implementation of each algorithm is.
BPP (3,466) The bit-parallel algorithms
CTF (239) The cut-off heuristic of of Baeza-Yates and Navarro [1999],
Ukkonen [1985b] implemented by us. Navarro and Baeza-Yates [1998a], and
CLP (429) The column partitioning algo- Navarro [1998] using pattern partition-
rithm of Chang and Lampe [1992], im- ing, superimposition, and hierarchical
plemented by them. We replaced their verification. The implementation is
I/O by ours, which is faster. ours and is packaged software that can
be downloaded from the Web page of
DFA (291) The lazy deterministic automa- the author.
ton of Navarro [1997b], implemented by
us. BND (375) The BNDM algorithm adapted
to allow errors in Navarro and Raffinot
RUS (304) The Four-Russians algorithm [2000] and Navarro [1998] implemented
of Wu et al. [1996], implemented by by us and restricted to m ≤ w. Separate
them. We tried different r values (re- code is used for k = 1, 2, 3 and k > 3.
lated to the time/space tradeoff) and We could continue writing separate ver-
found that the best option is always sions but decided that this is reasonable
r = 5 in our machine. up to k = 3.
BPR (229) The NFA bit-parallelized QG2 (191) The q-gram filter of Sutinen
by rows [Wu and Manber 1992b], and Tarhio [1995], implemented by
implemented by us and restricted to them and used with s = 2 (since s = 1
m ≤ w. Separate code is used for k = 1, is the algorithm [Takaoka 1994], see
2, 3 and k > 3. We could continue writ- next item; and s > 2 worked well only
ing separate versions but decided that for very long patterns). The code is
this is reasonable up to k = 3, as at that restricted to k ≤ w/2 − 3, and it is also
point the algorithm is not competitive not run when q is found to be 1 since the
anyway. performance is very poor. We improved
BPD (249 – 1,224) The NFA bit-para- register usage and replaced the I/O
llelized by diagonals [Baeza-Yates and management by our faster versions.
Navarro 1999], implemented by us. QG1 (191) The q-gram algorithm of
Here we do not include any filtering Takaoka [1994], run as the special case
technique. The first number (249) s = 1 of the previous item. The same
corresponds to the plain technique restrictions on the code apply.
and the second one (1,224) to handling
partitioned automata. We did our best to uniformize the
BPM (283 – 722) The bit-parallel imple- algorithms. The I/O is the same in all
mentation of the dynamic programming cases: the text is read in chunks of 64 KB
matrix [Myers 1999], implemented by to improve locality (this is the optimum
that author. The two numbers have the in our machine) and care is taken to not
same meaning as in the previous item. lose or repeat matches in the borders;
BMH (213) The adaptation of Horspool open is used instead of fopen because it
to allow errors [Tarhio and Ukkonen is slower. We also uniformize internal
1993], implemented by them. We use conventions: only a final special character

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 73

(zero) is used at the end of the buffer at word beginnings. The results are
to help algorithms recognize it; and roughly equivalent to a random text
only the number of matches found is over 15 characters.
reported. Speech:We obtained speech files from
In the experiments we separate the fil- discussions of U.S. law from Indiana
tering and nonfiltering algorithms. This is University, in PCM format with 8 bits
because the filters can in general use any per sample. Of course, the standard edit
nonfilter to check for potential matches, so distance is of no use here, since it has to
the best algorithm is formed by a combina- take into account the absolute values of
tion of both. All the filtering algorithms in the differences between two characters.
the experiments use the cut-off algorithm We simplified the problem in order to
[Ukkonen 1985b] as their verification en- use edit distance: we reduced the range
gine, except for BPP (whose very essence of values to 64 by quantization, consid-
is to switch smoothly to BPD) and BND ering two samples that lie in the same
(which uses a reverse BPR to search in the range as equal. We used the first 10 MB
window and a forward BPR for the verifi- of the resulting file. The results are
cations). similar to those on a random text of 50
letters, although the file shows smooth
9.2 Experimental Setup changes from one letter to the next.

Apart from the algorithms and their de- We present results using different pat-
tails, we describe our experimental setup. tern lengths and error levels in two fla-
We measure CPU times and show the re- vors: we fix m and show the effect of in-
sults in tenths of seconds per megabyte. creasing k, or we fix α and show the effect
Our machine is a Sun UltraSparc-1 with of increasing m. A given algorithm may
167 MHz and 64 MB in main memory, we not appear at all in a plot when its times
run Solaris 2.5.1 and the texts are on a lo- are above the y range or its restrictions
cal disk of 2 GB. Our experiments were run on m and k do not intersect with the x
on texts of 10 MB and repeated 20 times range. In particular, filters are shown only
(with different search patterns). The same for α ≤ 1/2. We remind readers that in
patterns were used for all the algorithms. most applications the error levels of inter-
In the applications, we have selected est are low.
three types of texts.
DNA: This file is formed by concatenating 9.3 Results
the 1.34 MB DNA chain of h.influenzae Figure 24 shows the results for short
with itself until 10 MB is obtained. patterns (m = 10) and varying k. In non-
Lines are cut at 60 characters. The filtering algorithms BPD is normally the
patterns are selected randomly from fastest, up to 30% faster than the next
the text, avoiding line breaks if possible. one, BPM. The DFA is also quite close
The alphabet size is four, save for a few in most cases. For k = 1, a specialized
exceptions along the file, and the res- version of BPR is slightly faster than
ults are similar to a random four-letter BPD (recall that for k > 3 BPR starts to
text. use a nonspecialized algorithm, hence
Natural language: This file is formed by the jump). An exception occurs in DNA
1.29 MB from the work of Benjamin text, where for k = 4 and k = 5, BPD
Franklin filtered to lower-case and shows a nonmonotonic behavior and BPM
separators converted to a space (except becomes the fastest. This behavior comes
line breaks which are respected). This from its O(k(m − k)n/w) complexity,11
mimics common information retrieval
scenarios. The text is replicated to 11 Another reason for this behavior is that there are
obtain 10 MB and search patterns are integer round-off effects that produce nonmonotonic
randomly selected from the same text results.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


74 G. Navarro

Fig. 24. Results for m = 10 and varying k. The left plots show nonfiltering and the right plots
show filtering algorithms. Rows 1–3 show DNA, English, and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 75

which in texts with larger alphabets is not repeat somewhat those for nonfiltering
noticeable because the cut-off heuristic algorithms: BPR is the best for k = 1 (i.e.
keeps the cost unchanged. Indeed, the m = 10), then BPD is the best until a
behavior of BPD would have been totally certain pattern length is reached (which
stable if we had chosen m = 9 instead of varies from 30 on DNA to 80 on speech),
m = 10, because the problem would fit and finally BPM becomes the fastest. Note
in a computer word all the time. BPM, that for such a low error level the number
on the other hand, handles much longer of active columns is quite small, which
patterns, maintaining stability, although permits algorithms like BPD and BPM
it takes up to 50% more time than BPD. to keep their good behavior for patterns
With respect to filters, EXP is the much longer than what they could handle
fastest for low error levels. The value of in a single machine word. The DFA is
“low” increases for larger alphabets. At also quite competitive until its memory
some point, BPP starts to dominate. BPP requirements become unreasonable.
adapts smoothly to higher error levels The real change, however, is in the
by slowly switching to BPD, so BPP is a filters. In this case PEX becomes the star
good alternative for intermediate error filter in English and speech texts. The situ-
levels, where EXP ceases to work until ation for DNA, on the other hand, is quite
it switches to BPD. However, this range complex. For m ≤ 30, BND is the fastest,
is void on DNA and English text for and indeed an extended implementation
m = 10. Other filters competitive with allowing longer patterns could keep it be-
EXP are BND and BMH. In fact, BND ing the fastest for a few more points. How-
is the fastest for k = 1 on DNA, although ever, that case would have to handle four
no filter works very well in that case. errors, and only a specialized implementa-
Finally, QG2 does not appear because it tion for fixed k = 4, 5,. . . . could maintain
only works for k = 1 and it was worse than a competitive performance. We have
QG1. determined that such specialized code is
The best choice for short patterns seems worthwhile up to k = 3 only. When BND
to be EXP while it works and switching ceases to be applicable, PEX becomes the
to the best bit-parallel algorithm for fastest algorithm, and finally QG2 beats
higher errors. Moreover, the verification it (for m ≥ 60). However, notice that for
algorithm for EXP should be BPR or m > 30, all the filters are beaten by BPM
BPD (which are the fastest where EXP and therefore make little sense (on DNA).
dominates). There is a final phenomenon that
Figure 25 shows the case of longer pat- deserves mention with respect to filters.
terns (m = 30). Many of the observations The algorithms QG1 and QG2 improve as
are still valid in this case. However, in m grows. These algorithms are the most
this case the algorithm BPM shows its practical, and the only ones we tested in
advantage over BPD, since the entire the family of algorithms suitable for very
problem still fits in a computer word for long patterns. Thus, although all these
BPM and it does not for BPD. Hence in algorithms would not be competitive in
the left plots the best algorithm is BPM our tests (where m ≤ 100), they should
except for low k, where BPR or BPD are be considered in scenarios where the
better. With respect to filters, EXP or patterns are much longer and the error
BND are the fastest, depending on the level is kept very low. In such a scenario,
alphabet, until a certain error level is those algorithms would finally beat all
reached. At that point BPP becomes the the algorithms we consider here.
fastest, in some cases still faster than The situation becomes worse for the
BPM. Notice that for DNA a specialized filters when we consider α = 0.3 and
version of BND for k = 4 and even 5 could varying m (Figure 27). On DNA, no filter
be the fastest choice. can beat the nonfiltering algorithms, and
In Figure 26 we consider the case of among them the tricks to maintain a few
fixed α = 0.1 and growing m. The results active columns do not work well. This

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


76 G. Navarro

Fig. 25. Results for m = 30 and varying k. The left plots show nonfiltering and the right plots
show filtering algorithms. Rows 1–3 show DNA, English and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 77

Fig. 26. Results for α = 0.1 and varying m. The left plots show nonfiltering and the right plots
show filtering algorithms. Rows 1–3 show DNA, English and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


78 G. Navarro

Fig. 27. Results for α = 0.3 and varying m. The left plots show nonfiltering and the right plots
show filtering algorithms. Rows 1–3 show DNA, English and speech files, respectively.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 79

Fig. 28. The areas where each algorithm is the best; gray is that of filtering algorithms.

favors the algorithms that pack more in- page.12 This combined code is faster
formation per bit, which makes BPM the than each isolated algorithm, although of
best in all cases except for m = 10 (where course it is not really a single algorithm
BPD is better). The situation is almost but the combination of the best choices.
the same on English text, except that
BPP works reasonably well and becomes
10. CONCLUSIONS
quite similar to BPM (the periods where
each one dominates are interleaved). On We reach the end of this tour on approxi-
speech, on the other hand, the scenario is mate string matching. Our goal has been
similar to that for nonfiltering algorithms, to present and explain the main ideas
but the PEX filter still beats all of them, behind the existing algorithms, to classify
as 30% of errors is low enough on the them according to the type of approach
speech files. Note in passing that the error proposed, and to show how they perform
level is too high for QG1 and QG2, which in practice in a subset of possible prac-
can only be applied in a short range and tical scenarios. We have shown that the
yield bad results. oldest approaches, based on the dynamic
To give an idea of the areas where each programming matrix, yield the most
algorithm dominates, Figure 28 shows the important theoretical developments, but
case of English text. There is more infor- in general the algorithms have been
mation in Figure 28 than can be inferred improved by modern developments based
from previous plots, such as the area on filtering and bit-parallelism. In par-
where RUS is better than BPM. We have ticular, the fastest algorithms combine a
shown the nonfiltering algorithms and su- fast filter to discard most of the text with
perimposed in gray the area where the fil- a fast nonfilter algorithm to check for
ters dominate. Therefore, in the gray area potential matches.
the best choice is to use the corresponding We show some plots summarizing the
filter using the dominating nonfilter as its contents of the survey. Figure 29 shows
verification engine. In the nongray area it the historical order in which the algo-
is better to use the dominating nonfilter- rithms appeared in the different areas.
ing algorithm directly, with no filter.
A code implementing such a heuristic
(including EXP, BPD and BPP only) is 12 http://www.dcc.uchile.cl/∼gnavarro/pubcode.
publicly available from the author’s Web To apply EXP the option-ep must be used.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


80 G. Navarro

Fig. 29. Historical development of the different areas. References are shortened to first letters
(single authors) or initials (multiple authors), and to the last two digits of years.
Key: Sel80 = [Sellers 1980], MP80 = [Masek and Paterson 1980], LV88 = [Landau and Vishkin
1988], Ukk85b = [Ukkonen 1985b], LV89 = [Landau and Vishkin 1989], Mye86a = [Myers 1986a],
GG88 = [Galil and Giancarlo 1988], GP90 = [Galil and Park 1990], CL94 = [Chang and Lawler
1994], UW93 = [Ukkonen and Wood 1993], TU93 = [Tarhio and Ukkonen 1993], JTU96 = [Jokinen
et al. 1996], CL92 = [Chang and Lampe 1992], WMM96 = [Wu et al. 1996], WM92b = [Wu and
Manber 1992b], BYP96 = [Baeza-Yates and Perleberg 1996], Ukk92 = [Ukkonen 1992], Wri94 =
[Wright 1994], CM94 = [Chang and Marr 1994], Tak94 = [Takaoka 1994], Mel96 = [Melichar
1996], ST95 = [Sutinen and Tarhio 1995], Kur96 = [Kurtz 1996], BYN99 = [Baeza-Yates and
Navarro 1999], Shi96 = [Shi 1996], GKHO97 = [Giegerich et al. 1997], SV97 = [Sahinalp and
Vishkin 1997], Nav97a = [Navarro 1997a], CH98 = [Cole and Hariharan 1998], Mye99 = [Myers
1999], NBY98a & NBY99b = [Navarro and Baeza-Yates 1998a; 1999b], and NR00 = [Navarro and
Raffinot 2000].

Figure 30 shows a worst case time/space future: strong genome projects in com-
complexity plot for the nonfiltering al- putational biology, the pressure for oral
gorithms. Figure 31 considers filtration human-machine communication and the
algorithms, showing their average case heterogeneity and spelling errors present
complexity and the maximum error level in textual databases are just a sample of
α for which they work. Some practical the reasons that drive researchers to look
assumptions have been made to order the for faster and more flexible algorithms for
different functions of k, m, σ , w, and n. approximate pattern matching.
Approximate string matching is a It is interesting to point out theoreti-
very active research area, and it should cal and practical questions that are still
continue in that status in the foreseeable open.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 81

Fig. 30. Worst case time and space complexity of nonfiltering algorithms. We replaced w by
2(log n). References are shortened to first letters (single authors) or initials (multiple authors),
and to the last two digits of years.
Key: Sel80 = [Sellers 1980], LV88 = [Landau and Vishkin 1988], WM92b = [Wu and Manber
1992b], GG88 = [Galil and Giancarlo 1988], UW93 = [Ukkonen and Wood 1993], GP90 = [Galil
and Park 1990], CL94 = [Chang and Lawler 1994], Mye86a = [Myers 1986a], LV89 = [Landau
and Vishkin 1989], CH98 = [Cole and Hariharan 1998], BYN99 = [Baeza-Yates and Navarro
1999], Mye99 = [Myers 1999], WMM96 = [Wu et al. 1996], MP80 = [Masek and Paterson 1980],
and Ukk85a = [Ukkonen 1985a].

—The exact matching probability and av- algorithm which is O(kn) in the worst
erage edit distance between two random case and efficient in practice? Using bit-
strings is a difficult open question. We parallelism, there are good practical al-
found a new bound in this survey, but gorithms that achieve O(kn/w) on aver-
the problem is still open. age and O(mn/w) in the worst case.
—A worst-case lower bound of the —The lower bound of the problem for
problem is clearly O(n), but the only the average case is known to be
algorithms achieving it have space and O(n(k + logσ m)/m), and there exists
preprocessing cost exponential in m or an algorithm achieving it, so from the
k. The only improvements to the worst theoretical point of view that problem
case with polynomial space complexity is closed. However, from the practical
are the O(kn) algorithms and, for very side, the algorithms approaching those
small k, O(n(1 + k 4 /m)). Is it possible limits work well only for very long pat-
to improve the algorithms or to find a terns, while a much simpler algorithm
better lower bound for this case? (EXP) is the best for moderate and
short patterns. Is it possible to find a
—The previous question also has a unified approach, good in practice and
practical side: Is it possible to find an with that theoretical complexity?

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


82 G. Navarro

Fig. 31. Average time and maximum tolerated error level for the filtration algorithms. References
are shortened to first letters (single authors) or initials (multiple authors), and to the last two digits
of years.
Key: BYN99 = [Baeza-Yates and Navarro 1999], NBY98a = [Navarro and Baeza-Yates 1998a],
JTU96 = [Jokinen et al. 1996], Ukk92 = [Ukkonen 1992], CL94 = [Chang and Lawler 1994], WM92b =
[Wu and Manber 1992b], TU93 = [Tarhio and Ukkonen 1993], Tak94 = [Takaoka 1994], Shi96 = [Shi
1996], ST95 = [Sutinen and Tarhio 1995], NR00 = [Navarro and Raffinot 2000], and CM94 = [Chang
and Marr 1994].

—Another practical question on filtering this survey, so we content ourselves with


algorithms is: Is it possible in practice to rough figures. In particular, our analyses
improve over the current best existing are valid for σ << m. All refer to filters
algorithms? and are organized according to the original
—Finally, there are many other open order, so the reader should first read the
questions related to offline approxi- algorithm description to understand the
mate searching, which is a much less terminology.
mature area needing more research.
A.1 Tarhio and Ukkonen (1990)
First, the probability of a text charac-
APPENDIX A. SOME ANALYSES
ter being “bad” is that of not matching
Since some of the source papers lack an 2k + 1 pattern positions, i.e. Pbad =
analysis, or they do not analyze exactly (1 − 1/σ )2k + 1 ≈ ē−(2k + 1)/σ , so we try on
what is of interest to us, we provide a sim- average 1/Pbad characters until we find
ple analysis. This is not the purpose of a bad one. Since k + 1 bad characters

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


A Guided Tour to Approximate String Matching 83

have to be found, we make O(k/Pbad ) of finding a pattern prefix with errors at


leave the window. On the other hand, the the beginning of the window is high. This
probability of verifying a text window is the same as saying√that k/α ∗ = √m − k,
is that of reaching its beginning. We which gives α < (1 − ē/ σ )/(2 − ē/ σ ).
approximate that probability by equating
m to the average portion of the traversed
window (k/Pbad ), to obtain α < ē−(2k + 1)/σ . A.4 Ukkonen (1992)
The probability of finding a given q-gram
A.2 Wu and Manber (1992) in the text window is 1 − (1 − 1/σ q )m ≈
−m/σ q
1−e . So the probability of verifying
The Sunday algorithm can be analyzed as the text position is that of finding
follows. To see how far we can verify in
the current window, consider that (k + 1) ¡m−q + 1+¢ 1−kq)−m/σ
(m−q q-grams of the pattern, i.e.
)m−q + 1−kq . This must
q
kq
(1 − e
patterns have to fail. Each one fails
on average in logσ (m/(k + 1)) character be O(1/m ) in order not to interfere with
2

comparisons, but the time for all them to the search time. Taking logarithms and
fail is longer. By Yao’s bound [Yao 1979], approximating the combinatorials
√ using
this cannot be less than logσ m. Otherwise Stirling’s n! = (n/ē) 2π n(1 + O(1/n)),
n

we could split the test of a single pattern we arrive at


into (k + 1) tests of subpatterns, and all of
them would fail in less than logσ m time, 2 logσ m
breaking the lower bound. To compute the ¡ q¢
+ (m − q + 1) logσ 1 − e−m/σ
average shift, consider that k characters kq < ¡ ¢
must be different from the last window logσ 1 − e−m/σ q
character, and therefore the average shift + logσ (kq) − logσ (m − q + 1)
is σ/k. The final complexity is therefore
O(kn logσ (m)/σ ). This is optimistic, but
we conjecture that it is the correct com- from which, by replacing q = logσ m, we
plexity. An upper bound is obtained by obtain
replacing k by k 2 (i.e. adding the times for
all the pieces to fail). 1
α<
logµσ m(logσ¶α + logσ logσ m)
A.3 Navarro and Raffinot (1998) 1
=O
The automaton matches the text window logσ m
with k errors until almost surely k/α ∗
characters have been inspected (so that a quite common result for this type of fil-
the error level becomes lower than α ∗ ). ter. The q = logσ m is chosen because the
From there on, it becomes exponentially result improves as q grows, but it is nec-
decreasing on γ , which can be made essary that q ≤ logq σ m holds, since other-
1/σ in O(k) total steps. From that point wise logσ (1 − ē−m/σ ) becomes zero and the
on, we are in a case of exact string result worsens.
matching and then logσ m characters are
inspected, for a total of O(k/α ∗ + logσ m).
When the window is shifted to the last ACKNOWLEDGMENTS
prefix that matched with k errors, this
is also at k/α ∗ distance from the end The author thanks the many researchers in this
area for their willingness to exchange ideas and/or
of the window, on average. The window share their implementations: Amihood Amir,
length is m − k, and therefore we shift Ricardo Baeza-Yates, William Chang, Udi Manber,
the window in m − k − k/α ∗ on average. Gene Myers, Erkki Sutinen, Tadao Takaoka, Jorma
Therefore, the total amount of work is Tarhio, Esko Ukkonen, and Alden Wright. The
O(n(α +α ∗ logσ (m)/m)/((1−α)α ∗ −α)). The referees also provided important suggestions that
filter works well unless the probability improved the presentation.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

84 G. Navarro

REFERENCES 74–82. Preliminary version in ACM SIGIR


’89.
AHO, A. AND CORASICK, M. 1975. Efficient string
BAEZA-YATES, R. AND GONNET, G. 1994. Fast string
matching: an aid to bibliographic search. Com-
matching with mismatches. Information and
mun. ACM 18, 6, 333–340.
Computation 108, 2, 187–199. Preliminary
AHO, A., HOPCROFT, J., AND ULLMAN, J. 1974. The version as Tech. Rep. CS-88-36, Data Structur-
Design and Analysis of Computer Algorithms. ing Group, Univ. of Waterloo, Sept. 1988.
Addison-Wesley, Reading, MA.
BAEZA-YATES, R. AND NAVARRO, G. 1997. Multiple ap-
ALTSCHUL, S., GISH, W., MILLER, W., MYERS, G., AND proximate string matching. In Proceedings of the
LIPMAN, D. 1990. Basic local alignment search 5th International Workshop on Algorithms and
tool. J. Mol. Biol. 215, 403–410. Data Structures (WADS ’97). LNCS, vol. 1272,
AMIR, A., LEWENSTEIN, M., AND LEWENSTEIN, N. 1997a. 1997, Springer-Verlag, Berlin, 174–184.
Pattern matching in hypertext. In Proceedings BAEZA-YATES, R. AND NAVARRO, G. 1998. New and
of the 5th International Workshop on Algo- faster filters for multiple approximate string
rithms and Data Structures (WADS ’97). LNCS, matching. Tech. Rep. TR/DCC-98-10, Dept.
vol. 1272, Springer-Verlag, Berlin, 160–173. of Computer Science, University of Chile.
AMIR, A., AUMANN, Y., LANDAU, G., LEWENSTEIN, M., Random Struct. Algor. to appear. ftp://ftp.
AND LEWENSTEIN, N. 1997b. Pattern matching dcc.ptuchile.cl/pub/users/gnavarro/multi.
with swaps. In Proceedings of the Foundations ps.gz.
of Computer Science (FOCS’97), 1997, 144–
BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster ap-
153.
proximate string matching. Algorithmica 23, 2,
APOSTOLICO, A. 1985. The myriad virtues of subword 127–158. Preliminary versions in Proceedings of
trees. In Combinatorial Algorithms on Words. CPM ’96 (LNCS, vol. 1075, 1996) and in Proceed-
Springer-Verlag, Barlin, 85–96. ings of WSP’96, Carleton Univ. Press, 1996.
APOSTOLICO, A. AND GALIL, Z. 1985. Combinato- BAEZA-YATES, R. AND NAVARRO, G. 2000. Block-
rial Algorithms on Words. NATO ISI Series. addressing indices for approximate text re-
Springer-Verlag, Berlin. trieval. J. Am. Soc. Inf. Sci. (JASIS) 51, 1 (Jan.),
APOSTOLICO, A. AND GALIL, Z. 1997. Pattern Match- 69–82.
ing Algorithms. Oxford University Press, Ox- BAEZA-YATES, R. AND PERLEBERG, C. 1996. Fast and
ford, UK. practical approximate pattern matching. Infor-
APOSTOLICO, A. AND GUERRA, C. 1987. The Longest mation Processing Letters 59, 21–27. Prelimi-
Common Subsequence problem revisited. Algo- nary version in CPM ’92 (LNCS, vol. 644. 1992).
rithmica 2, 315–336.
BAEZA-YATES, R. AND RÉGNIER, M. 1990. Fast algo-
ARAÚJO, M., NAVARRO, G., AND ZIVIANI, N. 1997. Large rithms for two dimensional and multiple pattern
text searching allowing errors. In Proceedings of matching. In Proceedings of Scandinavian Work-
the 4th South American Workshop on String Pro- shop on Algorithmic Theory (SWAT ’90). LNCS,
cessing (WSP ’97), Carleton Univ. Press. 2–20. vol. 447, Springer-Verlag, Berlin, 332–347.
ARLAZAROV, V., DINIC, E., KONROD, M., AND FARADZEV, BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Modern
I. 1975. On economic construction of the tran- Information Retrieval. Addison-Wesley, Read-
sitive closure of a directed graph. Sov. Math. ing, MA.
Dokl. 11, 1209, 1210. Original in Russian in
BLUMER, A., BLUMER, J., HAUSSLER, D., EHRENFEUCHT,
Dokl. Akad. Nauk SSSR 194, 1970.
A., CHEN, M., AND SEIFERAS, J. 1985. The smal-
ATALLAH, M., JACQUET, P., AND SZPANKOWSKI, W. 1993. lest automaton recognizing the subwords of a
A probabilistic approach to pattern matching text. Theor. Comput. Sci. 40, 31–55.
with mismatches. Random Struct. Algor. 4, 191–
BOYER, R. AND MOORE, J. 1977. A fast string search-
213.
ing algorithm. Commun. ACM 20, 10, 762–772.
BAEZA-YATES, R. 1989. Efficient Text Searching.
CHANG, W. AND LAMPE, J. 1992. Theoretical and
Ph.D. thesis, Dept. of Computer Science, Univer-
empirical comparisons of approximate string
sity of Waterloo. Also as Res. Rep. CS-89-17.
matching algorithms. In Proceedings of the 3d
BAEZA-YATES, R. 1991. Some new results on approx- Annual Symposium on Combinatorial Pattern
imate string matching. In Workshop on Data Matching (CPM ’92). LNCS, vol. 644, Springer-
Structures, Dagstuhl, Germany. Abstract. Verlag, Berlin, 172–181.
BAEZA-YATES, R. 1992. Text retrieval: Theory and CHANG, W. AND LAWLER, E. 1994. Sublinear approx-
practice. In 12th IFIP World Computer Congress. imate string matching and biological applica-
Elsevier Science, Amsterdam. vol. I, 465–476. tions. Algorithmica 12, 4/5, 327–344. Prelimi-
BAEZA-YATES, R. 1996. A unified view of string nary version in FOCS ’90.
matching algorithms. In Proceedings of the The- CHANG, W. AND MARR, T. 1994. Approximate string
ory and Practice of Informatics (SOFSEM ’96). matching and local similarity. In Proceedings of
LNCS, vol. 1175, Springer-Verlag, Berlin, 1–15. the 5th Annual Symposium on Combinatorial
BAEZA-YATES, R. AND GONNET, G. 1992. A new ap- Pattern Matching (CPM ’94). LNCS, vol. 807,
proach to text searching. Commun. ACM 35, 10, Springer-Verlag, Berlin, 259–273.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 85

CHVÁTAL, V. AND SANKOFF, D. 1975. Longest com- algorithms for approximate string matching. In
mon subsequences of two random sequences. Proceedings of the 4th South American Work-
J. Appl. Probab. 12, 306–315. shop on String Processing (WSP ’97). Carleton
COBBS, A. 1995. Fast approximate matching using Univ. Press. 38–52. Preliminary version as
suffix trees. In Proceedings of the 6th Annual Tech. Rep. 96-01, Universität Bielefeld, Ger-
Symposium on Combinatorial Pattern Matching many, 1996.
(CPM ’95), 41–54. GONNET, G. 1992. A tutorial introduction to Compu-
COLE, R. AND HARIHARAN, R. 1998. Approximate tational Biochemistry using Darwin. Tech. rep.,
string matching: a simpler faster algorithm. In Informatik E. T. H., Zuerich, Switzerland.
Proceedings of the 9th ACM-SIAM Symposium GONNET, G. AND BAEZA-YATES, R. 1991. Handbook of
on Discrete Algorithms (SODA ’98), 463–472. Algorithms and Data Structures, 2d ed. Addison-
COMMENTZ-WALTER, B. 1979. A string matching al- Wesley, Reading, MA.
gorithm fast on the average. In Proc. ICALP ’79. GONZÁLEZ, R. AND THOMASON, M. 1978. Syntactic Pat-
LNCS, vol. 6, Springer-Verlag, Berlin, 118–132. tern Recognition. Addison-Wesley, Reading, MA.
CORMEN, T., LEISERSON, C., AND RIVEST, R. 1990. Intro- GOSLING, J. 1991. A redisplay algorithm. In Proceed-
duction to Algorithms. MIT Press, Cambridge, ings of ACM SIGPLAN/SIGOA Symposium on
MA. Text Manipulation, 123–129.
CROCHEMORE, M. 1986. Transducers and repeti- GROSSI, R. AND LUCCIO, F. 1989. Simple and efficient
tions. Theor. Comput. Sci. 45, 63–86. string matching with k mismatches. Inf. Process.
Lett. 33, 3, 113–120.
CROCHEMORE, M. AND RYTTER, W. 1994. Text Algo-
rithms. Oxford Univ. Press, Oxford, UK. GUSFIELD, D. 1997. Algorithms on Strings, Trees and
Sequences. Cambridge Univ. Press, Cambridge.
CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. HALL, P. AND DOWLING, G. 1980. Approximate string
matching. ACM Comput. Surv. 12, 4, 381–402.
1994. Speeding up two string-matching algo-
rithms. Algorithmica 12, 247–267. HAREL, D. AND TARJAN, E. 1984. Fast algorithms
for finding nearest common ancestors. SIAM J.
DAMERAU, F. 1964. A technique for computer detec-
Comput. 13, 2, 338–355.
tion and correction of spelling errors. Commun.
ACM 7, 3, 171–176. HECKEL, P. 1978. A technique for isolating differ-
ences between files. Commun. ACM 21, 4, 264–
DAS, G., FLEISHER, R., GASIENIEK, L., GUNOPULOS,
268.
D., AND KÄRKÄINEN, J. 1997. Episode matching.
In Proceedings of the 8th Annual Symposium HOLSTI, N. AND SUTINEN, E. 1994. Approximate
on Combinatorial Pattern Matching (CPM ’97). string matching using q-gram places. In Pro-
LNCS, vol. 1264, Springer-Verlag, Berlin, 12–27. ceedings of 7th Finnish Symposium on Computer
Science. Univ. of Joensuu. 23–32.
DEKEN, J. 1979. Some limit results for longest com-
mon subsequences. Discrete Math. 26, 17–31. HOPCROFT, J. AND ULLMAN, J. 1979. Introduction to
Automata Theory, Languages and Computation.
DIXON, R. AND MARTIN, T. Eds. 1979. Automatic Addison-Wesley, Reading, MA.
Speech and Speaker Recognition. IEEE Press,
New York. HORSPOOL, R. 1980. Practical fast searching in
strings. Software Practice Exper. 10, 501–506.
EHRENFEUCHT, A. AND HAUSSLER, D. 1988. A new dis-
JOKINEN, P. AND UKKONEN, E. 1991. Two algorithms
tance metric on strings computable in linear
for approximate string matching in static texts.
time. Discrete Appl. Math. 20, 191–203.
In Proceedings of the 2nd Mathematical Founda-
ELLIMAN, D. AND LANCASTER, I. 1990. A review of seg- tions of Computer Science (MFCS ’91). Springer-
mentation and contextual analysis techniques Verlag, Berlin, vol. 16, 240–248.
for text recognition. Pattern Recog. 23, 3/4, 337– JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. A com-
346. parison of approximate string matching algo-
FRENCH, J., POWELL, A., AND SCHULMAN, E. 1997. Ap- rithms. Software Practice Exper. 26, 12, 1439–
plications of approximate word matching in in- 1458. Preliminary version in Tech. Rep. A-1991-
formation retrieval. In Proceedings of the 6th 7, Dept. of Computer Science, Univ. of Helsinki,
ACM International Conference on Information 1991.
and Knowledge Management (CIKM ’97), 9–15. KARLOFF, H. 1993. Fast algorithms for approx-
GALIL, Z. AND GIANCARLO, R. 1988. Data structures imately counting mismatches. Inf. Process.
and algorithms for approximate string match- Lett. 48, 53–60.
ing. J. Complexity 4, 33–72. KECECIOGLU, J. AND SANKOFF, D. 1995. Exact and
GALIL, Z. AND PARK, K. 1990. An improved algo- approximation algorithms for the inversion
rithm for approximate string matching. SIAM distance between two permutations. Algorith-
J. Comput. 19, 6, 989–999. Preliminary version mica 13, 180–210.
in ICALP ’89 (LNCS, vol. 372, 1989). KNUTH, D. 1973. The Art of Computer Pro-
GIEGERICH, R., KURTZ, S., HISCHKE, F., AND OHLEBUSCH, gramming, Volume 3: Sorting and Searching.
E. 1997. A general technique to improve filter Addison-Wesley, Reading, MA.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

86 G. Navarro

KNUTH, D., MORRIS, J., JR, AND PRATT, V. 1977. Fast pattern matching. IEEE Trans. Inf. Theor. 43,
pattern matching in strings. SIAM J. Com- 1439–1451.
put. 6, 1, 323–350. MANBER, U. AND WU, S. 1994. GLIMPSE: A tool to
KUKICH, K. 1992. Techniques for automatically cor- search through entire file systems. In Proceed-
recting words in text. ACM Comput. Surv. 24, 4, ings of USENIX Technical Conference. USENIX
377–439. Association, Berkeley, CA, USA. 23–32. Prelim-
KUMAR, S. AND SPAFFORD, E. 1994. A pattern- inary version as Tech. Rep. 93-34, Dept. of Com-
matching model for intrusion detection. In Pro- puter Science, Univ. of Arizona, Oct. 1993.
ceedings of the National Computer Security Con- MASEK, W. AND PATERSON, M. 1980. A faster algo-
ference, 11–21. rithm for computing string edit distances. J.
KURTZ, S. 1996. Approximate string searching un- Comput. Syst. Sci. 20, 18–31.
der weighted edit distance. In Proceedings of the MASTERS, H. 1927. A study of spelling errors. Univ.
3rd South American Workshop on String Pro- of Iowa Studies in Educ. 4, 4.
cessing (WSP ’96). Carleton Univ. Press. 156– MCCREIGHT, E. 1976. A space-economical suffix tree
170. construction algorithm. J. ACM 23, 2, 262–
KURTZ, S. AND MYERS, G. 1997. Estimating the prob- 272.
ability of approximate matches. In Proceedings MELICHAR, B. 1996. String matching with k differ-
of the 8th Annual Symposium on Combinatorial ences by finite automata. In Proceedings of the
Pattern Matching (CPM ’97). LNCS, vol. 1264, International Congress on Pattern Recognition
Springer-Verlag, Berlin, 52–64. (ICPR ’96). IEEE CS Press, Silver Spring, MD.
LANDAU, G. AND VISHKIN, U. 1988. Fast string match- 256–260. Preliminary version in Computer Anal-
ing with k differences. J. Comput. Syst. Sci. 37, ysis of Images and Patterns (LNCS, vol. 970,
63–78. Preliminary version in FOCS ’85. 1995).
LANDAU, G. AND VISHKIN, U. 1989. Fast parallel and MORRISON, D. 1968. PATRICIA—Practical algo-
serial approximate string matching. J. Algor. 10, rithm to retrieve information coded in alphanu-
157–169. Preliminary version in ACM STOC ’86. meric. J. ACM 15, 4, 514–534.
LANDAU, G., MYERS, E., AND SCHMIDT, J. 1998. In- MUTH, R. AND MANBER, U. 1996. Approximate mul-
cremental string comparison. SIAM J. Com- tiple string search. In Proceedings of the 7th
put. 27, 2, 557–582. Myers Alg. Annual Symposium on Combinatorial Pattern
LAWRENCE, S. AND GILES, C. L. 1999. Accessibility of Matching (CPM ’96). LNCS, vol. 1075, Springer-
information on the web. Nature 400, 107–109. Verlag, Berlin, 75–86.
LEE, J., KIM, D., PARK, K., AND CHO, Y. 1997. Effi- MYERS, G. 1994a. A sublinear algorithm for approx-
cient algorithms for approximate string match- imate keyword searching. Algorithmica 12, 4/5,
ing with swaps. In Proceedings of the 8th Annual 345–374. Perliminary version in Tech. Rep.
Symposium on Combinatorial Pattern Matching TR90-25, Computer Science Dept., Univ. of Ari-
(CPM ’97). LNCS, vol. 1264, Springer-Verlag, zona, Sept. 1991.
Berlin, 28–39. MYERS, G. 1994b. Algorithmic Advances for Search-
LEVENSHTEIN, V. 1965. Binary codes capable of cor- ing Biosequence Databases. Plenum Press, New
recting spurious insertions and deletions of ones. York, 121–135. Very relevant
Probl. Inf. Transmission 1, 8–17. MYERS, G. 1986a. Incremental alignment algo-
LEVENSHTEIN, V. 1966. Binary codes capable of cor- rithms and their applications. Tech. Rep. 86–22,
recting deletions, insertions and reversals. Sov. Dept. of Computer Science, Univ. of Arizona.
Phys. Dokl. 10, 8, 707–710. Original in Russian MYERS, G. 1986b. An O(N D) difference algorithm
in Dokl. Akad. Nauk SSSR 163, 4, 845–848, and its variations. Algorithmica 1, 251–266.
1965.
MYERS, G. 1991. An overview of sequence compari-
LIPTON, R. AND LOPRESTI, D. 1985. A systolic ar- son algorithms in molecular biology. Tech. Rep.
ray for rapid string comparison. In Proceedings TR-91-29, Dept. of Computer Science, Univ. of
of the Chapel Hill Conference on VLSI, 363– Arizona. Very relevant
376.
MYERS, G. 1999. A fast bit-vector algorithm for ap-
LOPRESTI, D. AND TOMKINS, A. 1994. On the search- proximate string matching based on dynamic
ability of electronic ink. In Proceedings of the progamming. J. ACM 46, 3, 395–415. Earlier ver-
4th International Workshop on Frontiers in sion in Proceedings of CPM ’98 (LNCS, vol. 1448).
Handwriting Recognition, 156–165.
NAVARRO, G. 1997a. Multiple approximate string
LOPRESTI, D. AND TOMKINS, A. 1997. Block edit mod- matching by counting. In Proceedings of the 4th
els for approximate string matching. Theor. South American Workshop on String Processing
Comput. Sci. 181, 1, 159–179. (WSP ’97). Carleton Univ. Press, 125–139.
LOWRANCE, R. AND WAGNER, R. 1975. An exten- NAVARRO, G. 1997b. A partial deterministic automa-
sion of the string-to-string correction problem. ton for approximate string matching. In Pro-
J. ACM 22, 177–183. ceedings of the 4th South American Workshop
LUCZAK, T. AND SZPANKOWSKI, W. 1997. A suboptimal on String Processing (WSP ’97). Carleton Univ.
lossy data compression based on approximate Press, 112–124.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

A Guided Tour to Approximate String Matching 87

NAVARRO, G. 1998. Approximate Text Searching. RÉGNIER, M. AND SZPANKOWSKI, W. 1997. On the
Ph.D. thesis, Dept. of Computer Science, Univ. of approximate pattern occurrence in a text. In
Chile. Tech. Rep. TR/DCC-98-14. ftp://ftp. Proceedings of Compression and Complexity of
dcc.uchile.cl/pub/users/gnavarro/thesis98. SEQUENCES ’97. IEEE Press, New York.
ps.gz. RIVEST, R. 1976. Partial-match retrieval algorithms.
NAVARRO, G. 2000a. Improved approximate pattern SIAM J. Comput. 5, 1.
matching on hypertext. Theor. Comput. Sci., SAHINALP, S. AND VISHKIN, U. 1997. Approximate pat-
237, 455–463. Previous version in Proceedings tern matching using locally consistent pars-
of LATIN ’98 (LNCS, vol. 1380). ing. Manuscript, Univ. of Maryland Institute for
NAVARRO, G. 2000b. Nrgrep: A fast and flexible pat- Advanced Computer Studies (UMIACS).
tern matching tool, Tech. Rep. TR/DCC-2000-3. SANKOFF, D. 1972. Matching sequences under dele-
Dept. of Computer Science, Univ. of Chile, Aug. tion/insertion constraints. In Proceedings of the
ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/ National Academy of Sciences of the USA, vol. 69,
nrgrep.ps.gz. 4–6.
NAVARRO, G. AND BAEZA-YATES, R. 1998a. Im- SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Time Warps,
proving an algorithm for approximate
String Edits, and Macromolecules: The Theory
pattern matching. Tech. Rep. TR/DCC-
and Practice of Sequence Comparison. Addison-
98-5, Dept. of Computer Science, Univ.
Wesley, Reading, MA.
of Chile. Algorithmica, to appear. ftp://
ftp.dcc.uchile.cl/pub/users/gnavarro/dexp. SANKOFF, D. AND MAINVILLE, S. 1983. Common Subse-
ps.gz. quences and Monotone Subsequences. Addison-
Wesley, Reading, MA, 363–365.
NAVARRO, G. AND BAEZA-YATES, R. 1998b. A practical
q-gram index for text retrieval allowing errors. SCHIEBER, B. AND VISHKIN, U. 1988. On finding low-
CLEI Electron. J. 1, 2. http://www.clei.cl. est common ancestors: simplification and par-
allelization. SIAM J. Comput. 17, 6, 1253–
NAVARRO, G. AND BAEZA-YATES, R. 1999a. Fast multi-
1262.
dimensional approximate pattern matching. In
Proceedings of the 10th Annual Symposium SELLERS, P. 1974. On the theory and computation of
on Combinatorial Pattern Matching (CPM ’99). evolutionary distances. SIAM J. Appl. Math. 26,
LNCS, vol. 1645, Springer-verlag, Berlin, 243– 787–793.
257. Extended version to appear in J. Disc. Algor. SELLERS, P. 1980. The theory and computation of
(JDA). evolutionary distances: pattern recognition. J.
NAVARRO, G. AND BAEZA-YATES, R. 1999b. A new in- Algor. 1, 359–373.Text search Dyn Prog. alg.
dexing method for approximate string matching. SHI, F. 1996. Fast approximate string matching
In Proceedings of the 10th Annual Symposium with q-blocks sequences. In Proceedings of the
on Combinatorial Pattern Matching (CPM ’99), 3rd South American Workshop on String Pro-
LNCS, vol. 1645, Springer-verlag, Berlin, 163– cessing (WSP’96). Carleton Univ. Press. 257–
185. Extended version to appear in J. Discrete 271.
Algor. (JDA). SUNDAY, D. 1990. A very fast substring search algo-
NAVARRO, G. AND BAEZA-YATES, R. 1999c. Very fast rithm. Commun. ACM 33, 8, 132–142.
and simple approximate string matching. Inf. SUTINEN, E. 1998. Approximate Pattern Matching
Process. Lett. 72, 65–70. with the q-Gram Family. Ph.D. thesis, Dept. of
NAVARRO, G. AND RAFFINOT, M. 2000. Fast and flexible Computer Science, Univ. of Helsinki, Finland.
string matching by combining bit-parallelism Tech. Rep. A-1998-3.
and suffix automata. ACM J. Exp. Algor. 5, 4. SUTINEN, E. AND TARHIO, J. 1995. On using q-gram lo-
Previous version in Proceedings of CPM ’98. cations in approximate string matching. In Pro-
Lecture Notes in Computer Science, Springer- ceedings of the 3rd Annual European Sympo-
Verlag, New York. sium on Algorithms (ESA ’95). LNCS, vol. 979,
NAVARRO, G., MOURA, E., NEUBERT, M., ZIVIANI, N., AND Springer-Verlag, Berlin, 327–340.
BAEZA-YATES, R. 2000. Adding compression to SUTINEN, E. AND TARHIO, J. 1996. Filtration with q-
block addressing inverted indexes. Kluwer Inf. samples in approximate string matching. In Pro-
Retrieval J. 3, 1, 49–77. ceedings of the 7th Annual Symposium on Com-
NEEDLEMAN, S. AND WUNSCH, C. 1970. A general binatorial Pattern Matching (CPM ’96). LNCS,
method applicable to the search for similarities vol. 1075, Springer-Verlag, Berlin, 50–61.
in the amino acid sequences of two proteins. J. TAKAOKA, T. 1994. Approximate pattern matching
Mol. Biol. 48, 444–453. with samples. In Proceedings of ISAAC ’94.
NESBIT, J. 1986. The accuracy of approximate string LNCS, vol. 834, Springer-Verlag, Berlin, 234–
matching algorithms. J. Comput.-Based In- 242.
str. 13, 3, 80–83. TARHIO, J. AND UKKONEN, E. 1988. A greedy approxi-
OWOLABI, O. AND MCGREGOR, R. 1988. Fast approx- mation algorithm for constructing shortest com-
imate string matching. Software Practice Ex- mon superstrings. Theor. Comput. Sci. 57, 131–
per. 18, 4, 387–393. 145.

ACM Computing Surveys, Vol. 33, No. 1, March 2001.


Edited by Foxit PDF Editor
Copyright (c) by Foxit Software Company, 2004
For Evaluation Only.

88 G. Navarro

TARHIO, J. AND UKKONEN, E. 1993. Approximate WAGNER, R. AND FISHER, M. 1974. The string to string
Boyer–Moore string matching. SIAM J. Com- correction problem. J. ACM 21, 168–178.
put. 22, 2, 243–260. Preliminary version in WATERMAN, M. 1995. Introduction to Computational
SWAT’90 (LNCS, vol. 447, 1990). Biology. Chapman and Hall, London.
TICHY, W. 1984. The string-to-string correction WEINER, P. 1973. Linear pattern matching algo-
problem with block moves. ACM Trans. Comput. rithms. In Proceedings of IEEE Symposium on
Syst. 2, 4, 309–321. Switching and Automata Theory, 1–11.
UKKONEN, E. 1985a. Algorithms for approximate WRIGHT, A. 1994. Approximate string matching us-
string matching. Information and Control 64, ing within-word parallelism. Software Practice
100–118. Preliminary version in Proceedings Exper. 24, 4, 337–362.
of the International Conference Foundations
WU, S. AND MANBER, U. 1992a. Agrep—a fast approx-
of Computation Theory (LNCS, vol. 158,
imate pattern-matching tool. In Proceedings of
1983).
USENIX Technical Conference. USENIX Asso-
UKKONEN, E. 1985b. Finding approximate patterns ciation, Berkeley, CA, USA. 153–162.
in strings. J. Algor. 6, 132–137.
WU, S. AND MANBER, U. 1992b. Fast text searching
UKKONEN, E. 1992. Approximate string matching allowing errors. Commun. ACM 35, 10, 83–91.
with q-grams and maximal matches. Theor.
WU, S., MANBER, U., AND MYERS, E. 1995. A sub-
Comput. Sci. 1, 191–211.
quadratic algorithm for approximate regular ex-
UKKONEN, E. 1993. Approximate string matching pression matching. J. Algor. 19, 3, 346–360.
over suffix trees. In Proceedings of the 4th
WU, S., MANBER, U., AND MYERS, E. 1996. A sub-
Annual Symposium on Combinatorial Pattern
quadratic algorithm for approximate limited
Matching (CPM ’93), 228–242.
expression matching. Algorithmica 15, 1, 50–
UKKONEN, E. 1995. Constructing suffix trees on- 67. Preliminary version as Tech. Rep. TR29-36,
line in linear time. Algorithmica 14, 3, 249– Computer Science Dept., Univ. of Arizona, 1992.
260. YAO, A. 1979. The complexity of pattern matching
UKKONEN, E. AND WOOD, D. 1993. Approximate for a random string. SIAM J. Comput. 8, 368–
string matching with suffix automata. Algorith- 387.
mica 10, 353–364. Preliminary version in Rep. YAP, T., FRIEDER, O., AND MARTINO, R. 1996. High Per-
A-1990-4, Dept. of Computer Science, Univ. of formance Computational Methods for Biological
Helsinki, Apr. 1990. Sequence Analysis. Kluwer Academic Publish-
ULLMAN, J. 1977. A binary n-gram technique for au- ers, Dordrecht.
tomatic correction of substitution, deletion, in- ZOBEL, J. AND DART, P. 1996. Phonetic string match-
sertion and reversal errors in words. Comput. ing: lessons from information retrieval. In Pro-
J. 10, 141–147. ceedings of the 19th ACM International Confer-
VINTSYUK, T. 1968. Speech discrimination by dy- ence on Information Retrieval (SIGIR ’96), 166–
namic programming. Cybernetics 4, 52–58. 172.

Received August 1999; revised March 2000; accepted May 2000

ACM Computing Surveys, Vol. 33, No. 1, March 2001.

Das könnte Ihnen auch gefallen