Algorithms and Data Structures - Part 5

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information.
PDF generated at: Sun, 22 Dec 2013 17:47:20 UTC

Algorithms and Data
Structures
Part 5: String Matching (Wikipedia Book
2014)
By Wikipedians
Editors: Reiner Creutzburg, Jenny Knackmu
Contents
Articles
String Matching
1
String (computer science) 1
String searching algorithm 8
KnuthMorrisPratt algorithm 11
BoyerMoore string search algorithm 18
References
Article Sources and Contributors 22
Image Sources, Licenses and Contributors 23
Article Licenses
License 24
1
String Matching
String (computer science)
In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some
kind of variable. The latter may allow its elements to be mutated and/or the length changed, or it may be fixed (after
creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words)
that stores a sequence of elements, typically characters, using some character encoding. A string may also denote
more general arrays or other sequence (or list) data types and structures.
Depending on programming language and precise data type used, a variable declared to be a string may either cause
storage in memory to be statically allocated for a predetermined maximum length or employ dynamic allocation to
allow it to hold variable number of elements.
When a string appears literally in source code, it is known as a string literal and has a representation that denotes it as
such.
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite
sequence of symbols that are chosen from a set called an alphabet.
Formal theory
Let be a non-empty finite set of symbols (alternatively called characters), called the alphabet. No assumption is
made about the nature of the symbols. A string (or word) over is any finite sequence of symbols from . For
example, if = {0,1}, then 01011 is a string over .
The length of a string is the number of symbols in the string (the length of the sequence) and can be any
non-negative integer. The empty string is the unique string over of length 0, and is denoted or .
The set of all strings over of length n is denoted
n
. For example, if = {0, 1}, then
2
= {00, 01, 10, 11}. Note
that
0
= {} for any alphabet .
The set of all strings over of any length is the Kleene closure of and is denoted *. In terms of
n
,
For example, if = {0, 1}, * = {, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, ...}. Although * itself is countably
infinite, all elements of * have finite length.
A set of strings over (i.e. any subset of *) is called a formal language over . For example, if = {0, 1}, the set
of strings with an even number of zeros ({, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001, 1010,
1100, 1111, ...}) is a formal language over .
2
Concatenation and substrings
Concatenation is an important binary operation on *. For any two strings s and t in *, their concatenation is
defined as the sequence of symbols in s followed by the sequence of characters in t, and is denoted st. For example,
if = {a, b, ..., z}, s = bear, and t = hug, then st = bearhug and ts = hugbear.
String concatenation is an associative, but non-commutative operation. The empty string serves as the identity
element; for any string s, s = s = s. Therefore, the set * and the concatenation operation form a monoid, the free
monoid generated by . In addition, the length function defines a monoid homomorphism from * to the
non-negative integers (that is, a function , such that ).
A string s is said to be a substring or factor of t if there exist (possibly empty) strings u and v such that t = usv. The
relation "is a substring of" defines a partial order on *, the least element of which is the empty string.
Prefixes and suffixes
A string s is said to be a prefix of t if there exists a string u such that t = su. If u is nonempty, s is said to be a proper
prefix of t. Symmetrically, a string s is said to be a suffix of t if there exists a string u such that t = us. If u is
nonempty, s is said to be a proper suffix of t. Suffixes and prefixes are substrings of t.
Rotations
A string s = uv is said to be a rotation of t if t = vu. For example, if = {0, 1} the string 0011001 is a rotation of
0100110, where u = 00110 and v = 01.
Reversal
The reverse of a string is a string with the same symbols but in reverse order. For example, if s = abc (where a, b, and
c are symbols of the alphabet), then the reverse of s is cba. A string that is the reverse of itself (e.g., s = madam) is
called a palindrome, which also includes the empty string and all strings of length 1.
Lexicographical ordering
It is often useful to define an ordering on a set of strings. If the alphabet has a total order (cf. alphabetical order)
one can define a total order on * called lexicographical order. For example, if = {0, 1} and 0 < 1, then the
lexicographical order on * includes the relationships < 0 < 00 < 000 < ... < 0001 < 001 < 01 < 010 < 011 < 0110 <
01111 < 1 < 10 < 100 < 101 < 111 < 1111 < 11111 ...
String operations
A number of additional operations on strings commonly occur in the formal theory. These are given in the article on
string operations.
Topology
Strings admit the following interpretation as nodes on a graph:
Fixed-length strings can be viewed as nodes on a hypercube
Variable-length strings (of finite length) can be viewed as nodes on the k-ary tree, where k is the number of
symbols in
Infinite strings can be viewed as infinite paths on the k-ary tree.
The natural topology on the set of fixed-length strings or variable length strings is the discrete topology, but the
natural topology on the set of infinite strings is the limit topology, viewing the set of infinite strings as the inverse
limit of the sets of finite strings. This is the construction used for the p-adic numbers and some constructions of the
Cantor set, and yields the same topology.
3
Isomorphisms between string representations of topologies can be found by normalizing according to the
lexicographically minimal string rotation.
String datatypes
A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful
datatype that they are implemented in nearly every programming language. In some languages they are available as
primitive types and in others as composite types. The syntax of most high-level programming languages allows for a
string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal
or string literal.
String length
Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often
constrained to an artificial maximum. In general, there are two types of string datatypes: fixed-length strings, which
have a fixed maximum length and which use the same amount of memory whether this maximum is reached or not,
and variable-length strings, whose length is not arbitrarily fixed and which use varying amounts of memory
depending on their actual size. Most strings in modern programming languages are variable-length strings. Despite
the name, even variable-length strings are limited in length, although, in general, the limit depends only on the
amount of memory available. The string length can be stored as a separate integer (which puts a theoretical limit on
the length) or implicitly through a termination character, usually a character value with all bits zero. See also
"Null-terminated" below.
Character encoding
String datatypes have historically allocated one byte per character, and, although the exact character set varied by
region, character encodings were similar enough that programmers could often get away with ignoring this, since
characters a program treated specially (such as period and space and comma) were in the same place in all the
encodings a program would encounter. These character sets were typically based on ASCII or EBCDIC.
Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256
characters (the limit of a one 8-bit byte per-character encoding) for reasonable representation. The normal solutions
involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. Use
of these with existing code led to problems with matching and cutting of strings, the severity of which depended on
how the character encoding was designed. Some encodings such as the EUC family guarantee that a byte value in the
ASCII range will represent only that ASCII character, making the encoding safe for systems that use those characters
as field separators. Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching
on byte codes unsafe. These encodings also were not "self-synchronizing", so that locating character boundaries
required backing up to the start of a string, and pasting two strings together could result in corruption of the second
string (these problems were much less with EUC as any ASCII character did synchronize the encoding).
Unicode has simplified the picture somewhat. Most programming languages now have a datatype for Unicode
strings. Unicode's preferred byte stream format UTF-8 is designed not to have the problems described above for
older multibyte encodings. All UTF-8, UTF-16 and UTF-32 require the programmer to know that the fixed-size code
units are different than the "characters", the main difficulty currently is incorrectly designed API's that attempt to
hide this difference.
4
Implementations
Some languages like C++ implement strings as templates that can be used with any datatype, but this is the
exception, not the rule.
Some languages, such as C++ and Ruby, normally allow the contents of a string to be changed after it has been
created; these are termed mutable strings. In other languages, such as Java and Python, the value is fixed and a new
string must be created if any alteration is to be made; these are termed immutable strings.
Strings are typically implemented as arrays of bytes, characters, or code units, in order to allow fast access to
individual units or substringsincluding characters when they have a fixed length. A few languages such as Haskell
implement them as linked lists instead.
Some languages, such as Prolog and Erlang, avoid implementing a dedicated string datatype at all, instead adopting
the convention of representing strings as lists of character codes.
Representations
Representations of strings depend heavily on the choice of character repertoire and the method of character
encoding. Older string implementations were designed to work with repertoire and encoding defined by ASCII, or
more recent extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined
by Unicode along with a variety of complex encodings such as UTF-8 and UTF-16.
The term bytestring usually indicates a general-purpose string of bytes, rather than strings of only (readable)
characters, strings of bits, or such. Byte strings often imply that bytes can take any value and any data can be stored
as-is, meaning that there should be no value interpreted as a termination value.
Most string implementations are very similar to variable-length arrays with the entries storing the character codes of
corresponding characters. The principal difference is that, with certain encodings, a single logical character may take
up more than one entry in the array. This happens for example with UTF-8, where single codes (UCS code points)
can take anywhere from one to four bytes, and single characters can take an arbitrary number of codes. In these
cases, the logical length of the string (number of characters) differs from the logical length of the array (number of
bytes in use). UTF-32 is the only Unicode encoding that avoids this problem.
Null-terminated
The length of a string can be stored implicitly by using a special terminating character; often this is the null character
(NUL), which has all bits zero, a convention used and perpetuated by the popular C programming language. Hence,
this representation is commonly referred to as C string.
In terminated strings, the terminating code is not an allowable character in any string. Strings with length field do not
have this limitation and can also store arbitrary binary data. In C two things are needed to handle binary data, a
character pointer and the length of the data.
An example of a null-terminated string stored in a 10-byte buffer, along with its ASCII (or more modern UTF-8)
representation as 8-bit hexadecimal numbers is:
F R A N K NUL k e f w
46
16
52
16
41
16
4E
16
4B
16
00
16
6B
16
65
16
66
16
77
16
The length of the string in the above example, "FRANK", is 5 characters, but it occupies 6 bytes. Characters after the
terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings of
this form are sometimes called ASCIZ strings, after the original assembly language directive used to declare them.)
5
Length-prefixed
The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte value
(a convention used in many Pascal dialects): as a consequence, some people call it a P-string. Storing the string
length as byte limits the maximum string length to 255. To avoid such limitations, improved implementations of
P-strings use 16-, 32-, or 64-bit words to store the string length. When the length field covers the address space,
strings are limited only by the available memory.
Here is the equivalent Pascal string stored in a 10-byte buffer, along with its ASCII / UTF-8 representation:
length F R A N K k e f w
5
16
46
16
52
16
41
16
4E
16
4B
16
6B
16
65
16
66
16
77
16
Strings as records
Many languages, including object-oriented ones, implement strings as records in a structure like:
class string {
int length;
char *text;
};
Although this implementation is hidden, and accessed through member functions. The "text" will be a dynamically
allocated memory area, that might be expanded if needed. See also string (C++).
Linked-list
Both character termination and length codes limit strings: For example, C character arrays that contain null (NUL)
characters cannot be handled directly by C string library functions: Strings using a length code are limited to the
maximum value of the length code.
Both of these limitations can be overcome by clever programming, of course, but such workarounds are by definition
not standard.
Rough equivalents of the C termination method have historically appeared in both hardware and software. For
example, "data processing" machines like the IBM 1401 used a special word mark bit to delimit strings at the left,
where the operation would start at the right. This meant that, while the IBM 1401 had a seven-bit word in "reality",
almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example)
handle ASCII codes.
It is possible to create data structures and functions that manipulate them that do not have the problems associated
with character termination and can in principle overcome length code bounds. It is also possible to optimize the
string represented using techniques from run length encoding (replacing repeated characters by the character value
and a length) and Hamming encoding.
While these representations are common, others are possible. Using ropes makes certain string operations, such as
insertions, deletions, and concatenations more efficient.
Security concerns
The differing memory layout and storage requirements of strings can affect the security of the program accessing the
string data. String representations requiring a terminating character are commonly susceptible to buffer overflow
problems if the terminating character is not present, caused by a coding error or an attacker deliberately altering the
data. String representations adopting a separate length field are also susceptible if the length can be manipulated. In
such cases, program code accessing the string data requires bounds checking to ensure that it does not inadvertently
6
access or change data outside of the string memory limits.
String data is frequently obtained from user-input to a program. As such, it is the responsibility of the program to
validate the string to ensure that it represents the expected format. Performing limited or no validation of user-input
can cause a program to be vulnerable to code injection attacks.
Text file strings
In computer readable text files, for example programming language source files or configuration files, strings can be
represented. The NUL byte is normally not used as terminator since that does not correspond to the ASCII text
standard, and the length is usually not stored, since the file should be human editable without bugs.
Two common representations are:
Surrounded by quotation marks (ASCII 22
16
), used by most programming languages. To be able to include
quotation marks, newline characters etc., escape sequences are often available, usually using the backslash
character (ASCII 5C
16
).
Terminated by a newline sequence, for example in Windows INI files.
Non-text strings
While character strings are very common uses of strings, a string in computer science may refer generically to any
sequence of homogeneously typed data. A string of bits or bytes, for example, may be used to represent non-textual
binary data retrieved from a communications medium. This data may or may not be represented by a string-specific
datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the
programming language being used. If the programming language's string implementation is not 8-bit clean, data
corruption may ensue.
String processing algorithms
There are many algorithms for processing strings, each with various trade-offs. Some categories of algorithms
include:
String searching algorithms for finding a given substring or pattern
String manipulation algorithms
Sorting algorithms
Regular expression algorithms
Parsing a string
Sequence mining
Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees and
finite state machines.
7
Character string-oriented languages and utilities
Character strings are such a useful datatype that several languages have been designed in order to make string
processing applications easy to write. Examples include the following languages:
awk
Icon
MUMPS
Perl
Rexx
Ruby
sed
SNOBOL
Tcl
TTM
Many Unix utilities perform simple string manipulations and can be used to easily program some powerful string
processing algorithms. Files and finite streams may be viewed as strings.
Some APIs like Multimedia Control Interface, embedded SQL or printf use strings to hold commands that will be
interpreted.
Recent scripting programming languages, including Perl, Python, Ruby, and Tcl employ regular expressions to
facilitate text operations.
Some languages such as Perl and Ruby support string interpolation, which permits arbitrary expressions to be
evaluated and included in string literals.
Character string functions
String functions are used to manipulate a string or change or edit the contents of a string. They also are used to query
information about a string. They are usually used within the context of a computer programming language.
The most basic example of a string function is the length(string) function, which returns the length of a
string (not counting any terminator characters or any of the string's internal structural information) and does not
modify the string. For example, length("hello world") returns 11.
There are many string functions that exist in other languages with similar or exactly the same syntax or parameters.
For example, in many languages, the length function is usually represented as len(string). Even though string
functions are very useful to a computer programmer, a computer programmer using these functions should be
mindful that a string function in one language could in another language behave differently or have a similar or
completely different function name, parameters, syntax, and results.
References
String searching algorithm
8
In computer science, string searching algorithms, sometimes called string matching algorithms, are an important
class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a
larger string or text.
Let be an alphabet (finite set). Formally, both the pattern and searched text are vectors of elements of . The
may be a usual human alphabet (for example, the letters A through Z in the Latin alphabet). Other applications may
use binary alphabet ( = {0,1}) or DNA alphabet ( = {A,C,G,T}) in bioinformatics.
In practice, how the string is encoded can affect the feasible string search algorithms. In particular if a variable width
encoding is in use then it is slow (time proportional to N) to find the Nth character. This will significantly slow down
many of the more advanced search algorithms. A possible solution is to search for the sequence of code units instead,
but doing so may produce false matches unless the encoding is specifically designed to avoid it.
Basic classification
The various algorithms can be classified by the number of patterns each uses.
Single pattern algorithms
Let m be the length of the pattern and let n be the length of the searchable text.
Algorithm Preprocessing time
Matching time
1
Nave string search algorithm 0 (no preprocessing) ((nm+1) m)
RabinKarp string search algorithm (m) average (n+m),
worst ((nm+1) m)
Finite-state automaton based search (m ||) (n)
KnuthMorrisPratt algorithm (m) (n)
BoyerMoore string search algorithm (m + ||) (n/m), O(nm)
Bitap algorithm (shift-or, shift-and, BaezaYatesGonnet) (m + ||) O(mn)
1
Asymptotic times are expressed using O, , and notation
The BoyerMoore string search algorithm has been the standard benchmark for the practical string search
literature.
Algorithms using a finite set of patterns
AhoCorasick string matching algorithm
Commentz-Walter algorithm
RabinKarp string search algorithm
Algorithms using an infinite number of patterns
Naturally, the patterns can not be enumerated in this case. They are represented usually by a regular grammar or
regular expression.
9
Other classification
Other classification approaches are possible. One of the most common uses preprocessing as main criteria.
Classes of string searching algorithms
[1]
Text not preprocessed Text preprocessed
Patterns not preprocessed Elementary algorithms Index methods
Patterns preprocessed Constructed search engines Signature methods
Nave string search
The simplest and least efficient way to see where one string occurs inside another is to check each place it could be,
one by one, to see if it's there. So first we see if there's a copy of the needle in the first character of the haystack; if
not, we look to see if there's a copy of the needle starting at the second character of the haystack; if not, we look
starting at the third character, and so forth. In the normal case, we only have to look at one or two characters for each
wrong position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the
length of the haystack and m is the length of the needle; but in the worst case, searching for a string like "aaaab" in a
string like "aaaaaaaaab", it takes O(nm)
Finite state automaton based search
In this approach, we avoid backtracking by constructing a
deterministic finite automaton (DFA) that recognizes stored search
string. These are expensive to constructthey are usually created
using the powerset constructionbut are very quick to use. For
example,
Stubs
KnuthMorrisPratt computes a DFA that recognizes inputs with
the string to search for as a suffix, BoyerMoore starts searching
from the end of the needle, so it can usually jump ahead a whole
needle-length at each step. BaezaYates keeps track of whether
the previous j characters were a prefix of the search string, and is
therefore adaptable to fuzzy string searching. The bitap algorithm
is an application of BaezaYates' approach.
Index methods
Faster search algorithms are based on preprocessing of the text.
After building a substring index, for example a suffix tree or suffix
array, the occurrences of a pattern can be found quickly. As an example, a suffix tree can be built in time, and
all occurrences of a pattern can be found in time under the assumption that the alphabet has a constant
size and all inner nodes in the suffix tree knows what leafs are underneath them. The latter can be accomplished by
running a DFS algorithm from the root of the suffix tree.
10
Other variants
Some search methods, for instance trigram search, are intended to find a "closeness" score between the search string
and the text rather than a "match/non-match". These are sometimes called "fuzzy" searches.
Academic conferences on text searching
Combinatorial pattern matching (CPM), a conference on combinatorial algorithms for strings, sequences, and
trees.
String Processing and Information Retrieval (SPIRE), an annual symposium on string processing and information
retrieval.
Prague Stringology Conference (PSC), an annual conference on algorithms on strings and sequences.
Competition on Applied Text Searching (CATS), an annual series of evaluations of text searching algorithms.
References
[1] Melichar, Borivoj, Jan Holub, and J. Polcar. Text Searching Algorithms. Volume I: Forward String Matching. Vol. 1. 2 vols., 2005. http:/ /
stringology.org/ athens/ TextSearchingAlgorithms/ .
R. S. Boyer and J. S. Moore, A fast string searching algorithm (http:/ / www. cs. utexas. edu/ ~moore/
publications/ fstrpos. pdf), Carom. ACM 20, (10), 262272(1977).
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching,
pp.906932.
External links
Huge (maintained) list of pattern matching links (http:/ / www. cs. ucr. edu/ ~stelo/ pattern. html) Last
updated:12/27/2008 20:18:38
StringSearch high-performance pattern matching algorithms in Java (http:/ / johannburkard. de/ software/
stringsearch/ ) Implementations of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool,
Boyer-Moore-Horspool-Raita, Shift-Or)
Exact String Matching Algorithms (http:/ / www-igm. univ-mlv. fr/ ~lecroq/ string/ index. html) Animation in
Java, Detailed description and C implementation of many algorithms.
Boyer-Moore-Raita-Thomas (http:/ / www. concentric. net/ ~Ttwang/ tech/ stringscan. htm)
(PDF) Improved Single and Multiple Approximate String Matching (http:/ / www. cs. ucr. edu/ ~stelo/ cpm/
cpm04/ 35_Navarro. pdf)
Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features
(http:/ / www. ncbi. nlm. nih. gov/ pmc/ articles/ PMC2647288/ )
KnuthMorrisPratt algorithm
11
In computer science, the KnuthMorrisPratt string searching algorithm (or KMP algorithm) searches for
occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch
occurs, the word itself embodies sufficient information to determine where the next match could begin, thus
bypassing re-examination of previously matched characters.
The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris.
The three published it jointly in 1977.
Background
A string matching algorithm wants to find the starting index m in string S[] that matches the search word W[].
The most straightforward algorithm is to look for a character match at successive values of the index m, the position
in the string being searched, i.e. S[m]. If the index m reaches the end of the string then there is no match, in which
case the search is said to "fail". At each position m the algorithm first checks for equality of the first character in the
searched for word, i.e. S[m] =? W[0]. If a match is found, the algorithm tests the other characters in the searched
for word by checking successive values of the word position index, i. The algorithm retrieves the character W[i]
in the searched for word and checks for equality of the expression S[m+i] =? W[i]. If all successive characters
match in W at position m then a match is found at that position in the search string.
Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then
the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The
chance that the first two letters will match is 1 in 26^2 (1 in 676). So if the characters are random, then the expected
complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance
is very good. If S[] is 1 billion characters and W[] is 1000 characters, then the string search should complete after
about one billion character comparisons.
That expected performance is not guaranteed. If the strings are not random, then checking a trial m may take many
character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string
S[] consists of 1 billion characters that are all A, and that the word W[] is 999 A characters terminating in a final B
character. The simple string matching algorithm will now examine 1000 characters at each trial position before
rejecting the match and advancing the trial position. The simple string search example would now take about 1000
character comparisons times 1 billion positions for 1 trillion character comparisons. If the length of W[] is n, then
the worst case performance is O(kn).
The KMP algorithm does not have the horrendous worst case performance of the straightforward algorithm. KMP
spends a little time precomputing a table (on the order of the size of W[], O(n)), and then it uses that table to do an
efficient search of the string in O(k).
The difference is that KMP makes use of previous match information that the straightforward algorithm does not. In
the example above, when KMP sees a trial match fail on the 1000th character (i=999) because
S[m+999]W[999], it will increment m by 1, but it will know that the first 998 characters at the new position
already match. KMP matched 999 A characters before discovering a mismatch at the 1000th character (position 999).
Advancing the trial match position m by one throws away the first A, so KMP knows there are 998 A characters that
match W[] and does not retest them; that is, KMP sets i to 998. KMP maintains its knowledge in the precomputed
table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase
(variable m) and where it will resume testing (variable i).
12
KMP algorithm
Worked example of the search algorithm
To illustrate the algorithm's details, we work through a (relatively artificial) run of the algorithm, where W =
"ABCDABD" and S = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state
determined by two integers:
m which denotes the position within S which is the beginning of a prospective match for W
i the index in W denoting the character currently under consideration.
In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run,
like
1 2
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456
We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if
they match. However, in the fourth step, we get S[3] is a space and W[3] = 'D', a mismatch. Rather than
beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S except at 0; hence,
having checked all those characters previously, we know there is no chance of finding the beginning of a match if we
check them again. Therefore we move on to the next character, setting m = 4 and i = 0.
1 2
m: 01234567890123456789012
W: ABCDABD
i: 0123456
We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy.
However, just prior to the end of the current partial match, we passed an "AB" which could be the beginning of a
new match, so we must take this into consideration. As we already know that these characters match the two
characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and
continue matching the current character. Thus, not only do we omit previously matched characters of S, but also
previously matched characters of W.
1 2
m: 01234567890123456789012
W: ABCDABD
i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return
to the beginning of W and begin searching at the next character of S: m = 11, reset i = 0.
1 2
m: 01234567890123456789012
W: ABCDABD
i: 0123456
13
Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final
character 'D' of the word W. Reasoning as before, we set m = 15, to start at the two-character string "AB"
leading up to the current position, set i = 2, and continue matching from the current position.
1 2
m: 01234567890123456789012
W: ABCDABD
i: 0123456
This time we are able to complete the match, whose first character is S[15].
Description of pseudocode for the search algorithm
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial
match" table T, described below, which indicates where we need to look for the start of a new match in the event that
the current one ends in a mismatch. The entries of T are constructed so that if we have a match starting at S[m]
that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i]
in S (that is, T[i] is the amount of "backtracking" we need to do after a mismatch). This has two implications:
first, T[0] = -1, which indicates that if W[0] is a mismatch, we cannot backtrack and must simply check the
next character; and second, although the next possible match will begin at index m + i - T[i], as in the example
above, we need not actually check any of the T[i] characters after that, so that we continue searching from
W[T[i]]. The following is a sample pseudocode implementation of the KMP search algorithm.
algorithm kmp_search:
input:
an array of characters, S (the text to be searched)
an array of characters, W (the word sought)
output:
an integer (the zero-based position in S at which W is found)
define variables:
an integer, m 0 (the beginning of the current match in S)
an integer, i 0 (the position of the current character in W)
an array of integers, T (the table, computed elsewhere)
while m + i < length(S) do
if W[i] = S[m + i] then
if i = length(W) - 1 then
return m
let i i + 1
else
let m m + i - T[i]
if T[i] > -1 then
let i T[i]
else
let i 0
(if we reach here, we have searched all of S unsuccessfully)
return the length of S
14
Efficiency of the search algorithm
Assuming the prior existence of the table T, the search portion of the KnuthMorrisPratt algorithm has complexity
O(n), where n is the length of S and the O is big-O notation. Except for the fixed overhead incurred in entering
and exiting the function, all the computations are performed in the while loop. To bound the number of iterations
of this loop; observe that T is constructed so that if a match which had begun at S[m] fails while comparing S[m
+ i] to W[i], then the next possible match must begin at S[m + (i - T[i])]. In particular the next possible
match must occur at a higher index than m, so that T[i] < i.
This fact implies that the loop can execute at most 2n times. For, in each iteration, it executes one of the two
branches in the loop. The first branch invariably increases i and does not change m, so that the index m + i of the
currently scrutinized character of S is increased. The second branch adds i - T[i] to m, and as we have seen,
this is always a positive number. Thus the location m of the beginning of the current potential match is increased.
Now, the loop ends if m + i = n; therefore each branch of the loop can be reached at most k times, since they
respectively increase either m + i or m, and m m + i: if m = n, then certainly m + i n, so that since it
increases by unit increments at most, we must have had m + i = n at some point in the past, and therefore either
way we would be done.
Thus the loop executes at most 2n times, showing that the time complexity of the search algorithm is O(n).
Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p, if W exists
as a substring of S at p, then W[0 through m] == S[p through p+m]. Upon success, that is, the word and the text
matched at the positions(W[i] == S[p+i]), we increase i by 1 (i++). Upon failure, that is, the word and the text does
not match at the positions(W[i] != S[p+i]), the text pointer is kept still, while the word pointer roll-back a certain
amount(i = T[i], where T is the jump table) And we attempt to match W[T[i]] with S[p+i]. The maximum
number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll-back as much as we have
progressed up to the failure. Then it is clear the runtime is 2n.
"Partial match" table (also known as "failure function")
The goal of the table is to allow the algorithm not to match any character of S more than once. The key observation
about the nature of a linear search that allows this to happen is that in having checked some segment of the main
string against an initial segment of the pattern, we know exactly at which places a new potential match which could
continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern
itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not
sacrificing any potential matches in doing so.
We want to be able to look up, for each position in W, the length of the longest possible initial segment of W leading
up to (but not including) that position, other than the full segment starting at W[0] that just failed to match; this is
how far we have to backtrack in finding the next match. Hence T[i] is exactly the length of the longest possible
proper initial segment of W which is also a segment of the substring ending at W[i - 1]. We use the convention
that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no
possibility of backtracking), we set T[0] = -1, as discussed above.
15
Worked example of the table-building algorithm
We consider the example of W = "ABCDABD" first. We will see that it follows much the same pattern as the main
search, and is efficient for similar reasons. We set T[0] = -1. To find T[1], we must discover a proper suffix of
"A" which is also a prefix of W. But there are no proper suffixes of "A", so we set T[1] = 0. Likewise, T[2] =
0.
Continuing to T[3], we note that there is a shortcut to checking all suffixes: let us say that we discovered a proper
suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible); then its first character is
a proper prefix of W, hence a proper prefix itself, and it ends at W[1], which we already determined cannot occur in
case T[2]. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1
only if a valid suffix of size m was found at the previous stage (e.g. T[x]=m).
Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole
one with length 1 fails, so T[3] = 0.
We pass to the subsequent W[4], 'A'. The same logic shows that the longest substring we need consider has length
1, and although in this case 'A' does work, recall that we are looking for segments ending before the current
character; hence T[4] = 0 as well.
Considering now the next character, W[5], which is 'B', we exercise the following logic: if we were to find a
subpattern beginning before the previous character W[4], yet continuing to the current one W[5], then in particular
it would itself have a proper initial segment ending at W[4] yet beginning before it, which contradicts the fact that
we already found that 'A' itself is the earliest occurrence of a proper segment ending at W[4]. Therefore we need
not look before W[4] to find a terminal string for W[5]. Therefore T[5] = 1.
Finally, we see that the next character in the ongoing segment starting at W[4] = 'A' would be 'B', and indeed
this is also W[5]. Furthermore, the same argument as above shows that we need not look before W[4] to find a
segment for W[6], so that this is it, and we take T[6] = 2.
Therefore we compile the following table:
i 0 1 2 3 4 5 6
W[i] A B C D A B D
T[i] -1 0 0 0 0 1 2
Other example more interesting and complex:
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
W[i] P A R T I C I P A T E I N P A R A C H U T E
T[i] -1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 3 0 0 0 0 0
Description of pseudocode for the table-building algorithm
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle
is that of the overall search: most of the work was already done in getting to the current position, so very little needs
to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously
gives non-proper substrings at the beginning. This necessitates some initialization code.
algorithm kmp_table:
input:
an array of characters, W (the word to be analyzed)
an array of integers, T (the table to be filled)
output:
16
nothing (but during operation, it populates the table)
define variables:
an integer, pos 2 (the current position we are computing in T)
an integer, cnd 0 (the zero-based index in W of the next
character of the current candidate substring)
(the first few values are fixed but different from what the algorithm
might suggest)
let T[0] -1, T[1] 0
while pos < length(W) do
(first case: the substring continues)
if W[pos - 1] = W[cnd] then
let cnd cnd + 1, T[pos] cnd, pos pos + 1
(second case: it doesn't, but we can fall back)
else if cnd > 0 then
let cnd T[cnd]
(third case: we have run out of candidates. Note cnd = 0)
else
let T[pos] 0, pos pos + 1
Efficiency of the table-building algorithm
The complexity of the table algorithm is O(n), where n is the length of W. As except for some initialization all the
work is done in the while loop, it is sufficient to show that this loop executes in O(n) time, which will be done
by simultaneously examining the quantities pos and pos - cnd. In the first branch, pos - cnd is preserved,
as both pos and cnd are incremented simultaneously, but naturally, pos is increased. In the second branch, cnd
is replaced by T[cnd], which we saw above is always strictly less than cnd, thus increasing pos - cnd. In the
third branch, pos is incremented and cnd is not, so both pos and pos - cnd increase. Since pos pos -
cnd, this means that at each stage either pos or a lower bound for pos increases; therefore since the algorithm
terminates once pos = n, it must terminate after at most 2n iterations of the loop, since pos - cnd begins at
1. Therefore the complexity of the table algorithm is O(n).
Efficiency of the KMP algorithm
Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the
overall algorithm is O(n + k).
These complexities are the same, no matter how many repetitive patterns are in W or S.
Variants
A real-time version of KMP can be implemented using a separate failure function table for each character in the
alphabet. If a mismatch occurs on character in the text, the failure function table for character is consulted for
the index in the pattern at which the mismatch took place. This will return the length of the longest substring
ending at matching a prefix of the pattern, with the added condition that the character after the prefix is . With
this restriction, character in the text need not be checked again in the next phase, and so only a constant number of
17
operations are executed between the processing of each index of the text. This satisfies the real-time computing
restriction.
The Booth algorithm uses a modified version of the KMP preprocessing function to find the lexicographically
minimal string rotation. The failure function is progressively calculated as the string is rotated.
References
Knuth, Donald; Morris, James H., jr; Pratt, Vaughan (1977). "Fast pattern matching in strings"
[1]
. SIAM Journal
on Computing 6 (2): 323350. doi:10.1137/0206024
[2]
. Zbl0372.68005
[3]
.
Cormen, Thomas; Lesiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Section 32.4: The
Knuth-Morris-Pratt algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill.
pp.923931. ISBN0-262-03293-7. Zbl1047.68161
[4]
.
Crochemore, Maxime; Rytter, Wojciech (2003). Jewels of stringology. Text algorithms. River Edge, NJ: World
Scientific. pp.2025. ISBN981-02-4897-0. Zbl1078.68151
[5]
.
Szpankowski, Wojciech (2001). Average case analysis of algorithms on sequences. Wiley-Interscience Series in
Discrete Mathematics and Optimization. With a foreword by Philippe Flajolet. Chichester: Wiley.
pp.1517,136141. ISBN0-471-24063-X. Zbl0968.68205
[6]
.
External links
String Searching Applet animation
[7]
An explanation of the algorithm
[8]
and sample C++ code
[9]
by David Eppstein
Knuth-Morris-Pratt algorithm
[10]
description and C code by Christian Charras and Thierry Lecroq
Explanation of the algorithm from scratch
[11]
by FH Flensburg.
Breaking down steps of running KMP
[12]
by Chu-Cheng Hsieh.
[13] NPTELHRD YouTube lecture video
References
[1] http:/ / citeseer. ist. psu. edu/ context/ 23820/ 0
[2] http:/ / dx.doi. org/ 10. 1137%2F0206024
[3] http:/ / www. zentralblatt-math.org/ zmath/ en/ search/ ?format=complete& q=an:0372. 68005
[7] http:/ / www. cs. pitt.edu/ ~kirk/ cs1501/ animations/ String. html
[8] http:/ / www. ics. uci. edu/ ~eppstein/ 161/ 960227.html
[9] http:/ / www. ics. uci. edu/ ~eppstein/ 161/ kmp/
[10] http:/ / www-igm. univ-mlv.fr/ ~lecroq/ string/ node8. html
[11] http:/ / www.inf.fh-flensburg.de/ lang/ algorithmen/ pattern/ kmpen. htm
[12] http:/ / oak.cs.ucla. edu/ cs144/ examples/ KMPSearch.html
[13] http:/ / www.youtube.com/ watch?v=Zj_er99KMb8
BoyerMoore string search algorithm
18
In computer science, the BoyerMoore string search algorithm is an efficient string searching algorithm that is the
standard benchmark for practical string search literature.
[1]
It was developed by Robert S. Boyer and J Strother
Moore in 1977. The algorithm preprocesses the string being searched for (the pattern), but not the string being
searched in (the text). It is thus well-suited for applications in which the pattern is much shorter than the text or does
persist across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to
skip sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the
algorithm runs faster as the pattern length increases.
Definitions
A N P A N M A N -
P A N - - - - - -
- P A N - - - - -
- - P A N - - - -
- - - P A N - - -
- - - - P A N - -
- - - - - P A N -
Alignments of pattern PAN to text ANPANMAN, from k=3 to k=8. A match occurs at k=5.
S[i] refers to the character at index i of string S, counting from 1.
S[i..j] refers to the substring of string S starting at index i and ending at j, inclusive.
A prefix of S is a substring S[1..i] for some i in range [1, n], where n is the length of S.
A suffix of S is a substring S[i..n] for some i in range [1, n], where n is the length of S.
The string to be searched for is called the pattern and is referred to with symbol P.
The string being searched in is called the text and is referred to with symbol T.
The length of P is n.
The length of T is m.
An alignment of P to T is an index k in T such that the last character of P is aligned with index k of T.
A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k].
Description
The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at
different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore
uses information gained by preprocessing P to skip as many alignments as possible.
The algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in P and T are
then compared starting at index n in P and k in T, moving backward: the strings are matched from the end of P to the
start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a
mismatch occurs upon which the alignment is shifted to the right according to the maximum value permitted by a
number of rules. The comparisons are performed again at the new alignment, and the process repeats until the
alignment is shifted past the end of T, which means no further matches will be found.
The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.
19
Shift Rules
The Bad Character Rule
Description
- - - - X - - K - - -
A N P A N M A N A M -
- N N A A M A N - - -
- - - N N A A M A N -
Demonstration of bad character rule with pattern NNAAMAN.
The bad-character rule considers the character in T at which the comparison process failed (assuming such a failure
occurred). The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in
line with the mismatched occurrence in T is proposed. If the mismatched character does not occur to the left in P, a
shift is proposed that moves the entirety of P past the point of mismatch.
Preprocessing
Methods vary on the exact form the table for the bad character rule should take, but a simple constant-time lookup
solution is as follows: create a 2D table which is indexed first by the index of the character c in the alphabet and
second by the index i in the pattern. This lookup will return the occurrence of c in P with the next-highest index j < i
or -1 if there is no such occurrence. The proposed shift will then be i - j, with O(1) lookup time and O(kn) space,
assuming a finite alphabet of length k.
The Good Suffix Rule
Description
- - - - X - - K - - - - -
M A N P A N A M A N A P -
A N A M P N A M - - - - -
- - - - A N A M P N A M -
Demonstration of good suffix rule with pattern ANAMPNAM.
The good suffix rule is markedly more complex in both concept and implementation than the bad character rule. It is
the reason comparisons begin at the end of the pattern rather than the start, and is formally stated thus:
Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch
occurs at the next comparison to the left. Then find, if it exists, the right-most copy t' of t in P such that t'
is not a suffix of 'P' and the character to the left of 't'' in 'P' differs from the character to the left of 't' in
'P'. Shift 'P' to the right so that substring 't'' in 'P' aligns with substring 't' in 'T'. If 't'' does not exist, then
shift the left end of 'P' past the left end of 't' in 'T' by the least amount so that a prefix of the shifted
pattern matches a suffix of 't' in 'T'. If no such shift is possible, then shift 'P' by 'n' places to the right. If
an occurrence of 'P' is found, then shift 'P' by the least amount so that a proper prefix of the shifted 'P'
matches a suffix of the occurrence of 'P' in 'T'. If no such shift is possible, then shift 'P' by 'n' places, that
is, shift 'P' past 't'.
20
Preprocessing
The good suffix rule requires two tables: one for use in the general case, and another for use when either the general
case returns no meaningful result or a match occurs. These tables will be designated L and H respectively. Their
definitions are as follows:
For each i, L[i] is the largest position less than n such that string P[i..n] matches a suffix of P[1..L[i]]
and such that the character preceding that suffix is not equal to P[i-1]. L[i] is defined to be zero if there
is no position satisfying the condition.
Let H[i] denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. If none
exists, let H[i] be zero.
Both of these tables are constructible in O(n) time and use O(n) space. The alignment shift for index i in P is given
by n - L[i] or n - H[i]. H should only be used if L[i] is zero or a match has been found.
The Galil Rule
A simple but important optimization of Boyer-Moore was put forth by Galil in 1979. As opposed to shifting, the
Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known
to match. Suppose that at an alignment k
1
, P is compared with T down to character c of T. Then if P is shifted to k
2
such that its left end is between c and k
1
, in the next comparison phase a prefix of P must match the substring T[(k
2
-
n)..k
1
]. Thus if the comparisons get down to position k
1
of T, an occurrence of P can be recorded without explicitly
comparing past k
1
. In addition to increasing the efficiency of Boyer-Moore, the Galil rule is required for proving
linear-time execution in the worst case.
Performance
The Boyer-Moore algorithm as presented in the original paper has worst-case running time of O(n+m) only if the
pattern does not appear in the text. This was first proved by Knuth, Morris, and Pratt in 1977, followed by Guibas
and Odlyzko in 1980 with an upper bound of 5m comparisons in the worst case. Richard Cole gave a proof with an
upper bound of 3m comparisons in the worst case in 1991.
When the pattern does occur in the text, running time of the original algorithm is O(nm) in the worst case. This is
easy to see when both pattern and text consist solely of the same repeated character. However, inclusion of the Galil
rule results in linear runtime across all cases.
Implementations
Various implementations exist in different programming languages. In C++, Boost provides the generic
BoyerMoore search
[2]
implementation under the Algorithm library.
Below are a few simple implementations.
Variants
The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm using only the bad character
rule.
The Apostolico-Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given
alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of
the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths
requires an additional table equal in size to the text being searched.
21
References
[1] Hume and Sunday (1991) [Fast String Searching] SOFTWAREPRACTICE AND EXPERIENCE, VOL. 21(11), 12211248
(NOVEMBER 1991)
[2] http:/ / www. boost. org/ doc/ libs/ 1_53_0/ libs/ algorithm/ doc/ html/ algorithm/ Searching. html#the_boost_algorithm_library. Searching.
BoyerMoore
External links
Original paper on the Boyer-Moore algorithm (http:/ / www. cs. utexas. edu/ ~moore/ publications/ fstrpos. pdf)
An example of the Boyer-Moore algorithm (http:/ / www. cs. utexas. edu/ users/ moore/ best-ideas/
string-searching/ fstrpos-example. html) from the homepage of J Strother Moore, co-inventor of the algorithm
Richard Cole's 1991 paper proving runtime linearity (http:/ / www. cs. nyu. edu/ cs/ faculty/ cole/ papers/
CHPZ95. ps)
Article Sources and Contributors
22
Article Sources and Contributors
String (computer science) Source: https://en.wikipedia.org/w/index.php?oldid=586639232 Contributors: 216.60.221.xxx, A4b3c2d1e0f, Ahoerstemeier, Alai, Alan Millar, Alfabalon,
AnOddName, Andreas Kaufmann, Andres, Andrew Helwer, Andy Dingley, Anphanax, Anthony Borla, Arthur Frayn, AxelBoldt, B4hand, BIL, Beland, Bevo, Bkkbrad, Black Falcon,
Bogdangiusca, Boleslav Bobcik, Borgx, Bryan Derksen, BurntSky, C45207, CanadianMaritimer, CanisRufus, Captain Conundrum, Castaa, Cedders, Cgtdk, Charles Matthews, Charvest, Chris the
speller, Christopherlin, Conversion script, Courcelles, Cybercobra, Dainomite, Damian Yerrick, Dcoetzee, Denispir, Dennis714, Dereckson, Derek farn, Doctorfluffy, Doug Bell, Dreadstar,
Dreftymac, Drj, Drphilharmonic, Dysprosia, Elassint, Eloquence, Error, Fabartus, Fatboar, Forderud, Fropuff, Fudo, Furrykef, GNRY09, Gaiacarra, Garyzx, Georg Peter, Ghewgill, Giftlite,
GoingBatty, Gparker, Gregbard, Gumum, Gurch, Gwil, Gyro Copter, Hairy Dude, Hippopha, Hornlo, Howcheng, IOLJeff, Ian Pitchford, Icep, JDP90, JLaTondre, Jarble, Jay-Sebastos, Jc3s5h,
Jeremysr, Jiri 1984, Jncraton, John254, JonathanCross, Jonnabuz, Jordandanford, Kbdank71, Komarov om, Koyaanis Qatsi, Kusunose, Kyng, Lambiam, LilHelpa, Linas, Loadmaster,
Local.empire, Luke Igoe, Mad Tinman, Marc van Leeuwen, MattGiuca, Maximaximax, Michael Hardy, Mikeblas, Minghong, MisterSheik, Mjb, Mkweise, Mlpkr, Mojo Hand, Murray Langton,
Murtasa, Mythas11, Nasnema, Nbarth, Neelix, Nevyn, Obradovic Goran, Oleg Alexandrov, OpinionPerson, Pantser, Patrick, Pengo, Perique des Palottes, Peterdjones, Pexatus, Philg88, Pimlottc,
Pinguin.tk, Plugwash, Pnm, Ptarjan, Qwertyus, R. S. Shaw, RTC, Richard W.M. Jones, RogerofRomsey, Rory O'Kane, Ruud Koot, S.rvarr.S, Sahirshah, Scarfboy, Sebbe, Seec77, Sewing,
Shahab, Shirifan, Sietse Snel, Slady, Spearhead, Spitzak, Stephen Gilbert, StuartBrady, TBloemink, Taemyr, TakuyaMurata, Teles, Tentinator, The Anome, The Thing That Should Not Be,
TheIncredibleEdibleOompaLoompa, Thumperward, Tigrisek, Tobias Bergemann, Tompsci, Tortoise3, Treekids, Ubermonkey, Underrated1 17, Urod, Uuf6429, Vadmium, Wayfarer, Wikipelli,
WinterSpw, Witharebelyell, Zundark, , 149 anonymous edits
String searching algorithm Source: https://en.wikipedia.org/w/index.php?oldid=585633011 Contributors: A3RO, Algebran, Alvestrand, Andreas Kaufmann, Angela, Ascnder,
Ayonggu114ster, B4hand, Bender235, Bisqwit, Boleslav Bobcik, Borgx, BrokenSegue, CRGreathouse, Catmoongirl, Conversion script, Dcirovic, Dcoetzee, Drake Wilson, Dummy6277,
Excirial, Fredrik, Hariva, HerrMister, Ijustam, Jafet, Jan.papousek, Jarble, Jaredwf, Jwpat7, KerinthIT, Kku, Kragen, Kumioko (renamed), Ltickett, Macrakis, Mandarax, Materialscientist,
MaxEnt, MegaHasher, Mikeblas, Mlpkr, Mordomo, Ms2ger, NJM, Neilbeach, Netrapt, Nils Grimsmo, Nils.grimsmo, Nixdorf, Nyxos, OldCodger2, PFHLai, Phe, PhilKnight, Plugwash, Pne,
Poor Yorick, Quuxplusone, Ruud Koot, Sam Hocevar, Shadowhillway, Shehzad.kazmi, Shekharsuman93, Sniffnoy, Squash, Squidonius, SummerWithMorons, Szabolcs Nagy, Taw, Thosylve,
Tony1, TripleF, Tristan Schmelcher, Trusilver, Watcher, Webmeischda, 91 anonymous edits
KnuthMorrisPratt algorithm Source: https://en.wikipedia.org/w/index.php?oldid=586295098 Contributors: A5b, Acntx, Adityasinghhhh, Adityasinghhhhh, Almi, Amitchaudhary, Andrew
Helwer, Antaeus Feldspar, Arlolra, Axings, Bgwhite, Bikri, Blinken, Borgx, Bruguiea, Bryan Derksen, Byronknoll, Chadernook, Chester br, Chucheng, Crescent Moon, Curly Turkey, David
Eppstein, Dcoetzee, Deltahedron, Diagonalfish, Dmshafi, Ee19921, Elias, Erroneous01, Fibonacci, Glrx, GregorB, Haojin, Hariva, Hddqsb, Hobsonlane, J04n, Jagat sastry, Jaredwf, Javy413,
Jeremiah Mountain, Jocapc, Johnuniq, Jon Awbrey, KSmrq, Krischik, LOL, Little Mountain 5, LokiClock, MadLex, Madoka, Magicheader, Magioladitis, Mandarax, Mark T, Mhss, Michael
Hardy, Mikespedia, NeilFraser, Niceguyedc, Olau, OnePlusTwelve, PACO, Peni, Phe, Pranith, Pratik.mallya, Quuxplusone, RainCT, Raknarf44, Rich Farmbrough, Ruud Koot, Ryan Reich, Shell
Kinney, Sikuyihsoy, Smallman12q, Spencer4Hire, Swift, Timwi, Tom Alsberg, Tregoweth, Tushicomeng, Versus, Wahas1234, Wikibob, Winston Chuen-Shih Yang, Wisiti, Ww, Xterminatrix,
Ycl6, Zhaladshar, Znora, 167 anonymous edits
BoyerMoore string search algorithm Source: https://en.wikipedia.org/w/index.php?oldid=585928282 Contributors: Abednigo, Adfellin, Alex.mccarthy, Ancheta Wis, Andrew Helwer,
Antaeus Feldspar, Art1x com, Aunndroid, BD2412, Balabiot, Barry Fruitman, Beland, Biker Biker, Billlava, Billyoneal, Bisqwit, Blueyoshi321, Booyabazooka, Borgx, Brunobowden, Chokfung,
ChrisGualtieri, Cneubauer, Cwalgampaya, Czlaner, Damian Yerrick, DaveWF, Dcoetzee, Dekart, Deqing.huang, DocWatson42, Donner60, Dpakoha, Duplicity, Edsarian, Eeppeliteloop, Elassint,
Evgeny Lykhin, Eyal0, Fbriere, Fib, Freaky Dug, Fredrik, Furrykef, Greenrd, Icktoofay, IgushevEdward, Infinito, J12f, Jashmenn, Jemfinch, Jinghaoxu, JoeMarfice, JustAHappyCamper,
Jy2wong, Karnan, Kayvee, Klutzy, Kostmo, Kri, Kucyla, Lauren Lester, Lisamh, Lumpynifkin, M.O.X, Martinkunev, Mathiasl26, Maximus Rex, Mboverload, Mi1ror, Mikeblas, Moink, Mr flea,
Murray Langton, Neelpulse, Nemo bis, Nickjhay, Nneonneo, Ott2, PedR, Phe, PhilKnight, Plindenbaum, Pne, Quuxplusone, RJFJR, Radagast83, Rich Farmbrough, Ruud Koot, Ryan Reich,
SeekerOfThePath, Smallman12q, Snowgene, SummerWithMorons, Szabolcs Nagy, Thegeneralguy, Tide rolls, Tim Starling, Tim.head, Tobias Bergemann, Triddle, TripleF, Vacation9, Watcher,
Wikibob, Wthrower, Ww, Xillion, YUL89YYZ, Zearin, 218 anonymous edits
Image Sources, Licenses and Contributors
23
Image Sources, Licenses and Contributors
Image:DFA search mommy.svg Source: https://en.wikipedia.org/w/index.php?title=File:DFA_search_mommy.svg License: Public Domain Contributors: Dcoetzee, Jochen Burghardt,
Kilom691
License
24
License
Creative Commons Attribution-Share Alike 3.0
//creativecommons.org/licenses/by-sa/3.0/

Algorithms and Data Structures - Part 5

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Algorithms and Data Structures - Part 5

Hochgeladen von

Copyright:

Verfügbare Formate

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information.

PDF generated at: Sun, 22 Dec 2013 17:47:20 UTC

Das könnte Ihnen auch gefallen