String Matching

Outline
String Matching
Introduction Nave Algorithm Rabin-Karp Algorithm Knuth-Morris-Pratt (KMP) Algorithm
Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or body of text) While using editor/word processor/browser Login name & password checking Virus detection Header analysis in data communications DNA sequence analysis, Web search engines (e.g. Google), image analysis
Many applications

String-Matching Problem
The text is in an array T [1..n] of length n The pattern is in an array P [1..m] of length m Elements of T and P are characters from a finite alphabet
E.g., = {0,1} or = {a, b, , z}
Usually T and P are called strings of characters
String-Matching Problem
contd
We say that pattern P occurs with shift s in text T if:

a) 0 s n-m and b) T [(s+1)..(s+m)] = P [1..m]
If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift String-matching problem: finding all valid shifts for a given T and P
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a c
pattern P
s=3
a b a a
1 2 3 4
shift s = 3 is a valid shift (n=13, m=4 and 0 s n-m holds)
Example 2 3 4
3 4 5 6 7 8 9 10 11 12 13
pattern P a b a a
1 2
text T
a b c a b a a b c a b a a
s=3
a b a a
s=9
a b a a
Nave String-Matching Algorithm

Input: Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAVE-STRING-MATCHER (T, P) n length[T] m length[P] for s 0 to n-m if P[1..m] = T [(s+1)..(s+m)] print pattern occurs with shift s
Nave Algorithm
The Nave algorithm consists in checking, at all the positions in the text between 0 to n-m, whether an occurrence of the pattern starts there or not. After each attempt, it shifts the pattern by exactly one position to the right. Example (from left to right): a b c a b c a a b c a (shift = 0) a b c a (shift = 1) a b c a (shift = 2) a b c a (shift = 3)
Analysis: Worst-case Example

1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a a a a a a a a a a a a a
a a a b
a a a b
Worst-case Analysis
There are m comparisons for each shift in the worst case There are n-m+1 shifts So, the worst-case running time is ((nm+1)m)
In the example on previous slide, we have (13-4+1)4 comparisons in total
Nave method is inefficient because information from a shift is not used again
Analysis
Brute force pattern matching runs in time O(mn) in the worst case.
But most searches of ordinary text take O(m+n), which is very quick.
continued
Brute-force Analysis (Best)

Best Case
Example1: Found in first position of text
Text: 0000000000000000001 Pattern: 000 Cost = O(M)
Example2: Pattern Not found and always a mismatch on first character

Text: 0000000000000000001 Pattern: 11 Cost = O(N+M)
Nave Algorithm
Example (from right to left): a b c a b c a a b c a (shift =3) a b c a (shift = 2) a b c a (shift = 1) a b c a (shift = 0) Pattern occur with shift 0 and 3
Rabin-Karp Algorithm

Has a worst-case running time of O((nm+1)m) but average-case is O(n+m)

Also works well in practice
Based on number-theoretic notion of modular equivalence We assume that = {0,1, 2, , 9}, i.e., each character is a decimal digit
In general, use radix-d where d = ||
Rabin-Karp Approach
We can view a string of k characters (digits) as a length-k decimal number
E.g., the string 31425 corresponds to the decimal number 31,425
Given a pattern P [1..m], let p denote the corresponding decimal value Given a text T [1..n], let ts denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,,(n-m)
Rabin-Karp Approach

contd
ts = p iff T [(s+1)..(s+m)] = P [1..m] s is a valid shift iff ts = p p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))
t0 can similarly be computed in O(m) time Other t1, t2,, tn-m can be computed in O(nm) time since ts+1 can be computed from ts in constant time
Rabin-Karp Approach
contd
ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

E.g., if T={,3,1,4,1,5,2,}, m=5 and ts= 31,415, then ts+1 = 10(31415 100003) + 2 =14152 Thus we can compute p in (m) and can compute t0, t1, t2,, tn-m in (n-m+1) time And we can find al occurrences of the pattern P[1m] in text T[1n] with (m) preprocessing time and (n-m+1) matching time.
Buta problem: this is assuming p and ts are small numbers They may be too large to work with easily
Rabin-Karp Approach
contd
Solution: we can use modular arithmetic with a suitable modulus, q

E.g., ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q) Where h =10 m-1 (mod q)
q is chosen as a small prime number ; e.g., 13 for radix 10

Generally, if the radix is d, then dq should fit within one computer word
How values modulo 13 are computed

3 1 4 1 5 2
old highorder digit
new loworder digit
14152 ((31415 3 10000) 10 + 2 )(mod 13)
((7 3 3) 10 + 2 )(mod 13)

8 (mod 13)
Problem of Spurious Hits

ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean that two integers are equal
A case in which ts p (mod q) when ts p is called a spurious hit On the other hand, if two integers are not modular equivalent, then they cannot be equal
Example
3 1 4 1 5 pattern
mod 13
7
1 2 3 4 5 6 7 8
text
9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid spurious match hit
Rabin-Karp Algorithm
Basic structure like the nave algorithm, but uses modular arithmetic as described For each hit, i.e., for each s where ts p (mod q), verify character by character whether s is a valid shift or a spurious hit In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O(n+m)
Example 2
Let T = a b c b a b and P = a b c Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters). = 256. Integer value of P, p = c + 256(b+256 a) = [99 + 256(98+256 97)] % 256 =197 In similar fashion, we can calculate hash value of m-length text and compare to check valid / spurious hit (as in previous slides). Analysis In the worst case, every shift is verified Running time can be shown as O((n-m+1)m) Average-case running time is O (n + m)
3. The KMP Algorithm

The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm).
But it shifts the pattern more intelligently than the brute force algorithm.
continued
If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]
Example
T: P:
j=5
jnew = 2
Why
j == 5
Find largest prefix (start) of: "a b a a b" ( P[0..j-1] ) which is suffix (end) of: "b a a b" ( p[1 .. j-1] )
Answer: "a b" Set j = 2 // the new j value
KMP Failure Function

KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].
Failure Function Example

P: "abaaba" j: 012345
j F(j) 0 0 1 0 2 1
(k == j-1)
3 1 4 2
F(k) is the size of the largest prefix.
In code, F() is represented by an array, like the table.
P: "abaaba" Why is F(4) == 2?
F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2
Using the Failure Function

Knuth-Morris-Pratts algorithm modifies the brute-force algorithm.
if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j
Example
T: P:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k
F(k
0 0
1 0
2 1
3 0
4 1
14 15 16 17 18 19
a b a c a b
P: "abacab" Why is F(4) == 1?
F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" =1
KMP Advantages
KMP runs in optimal time: O(m+n)
very fast
The algorithm never needs to move backwards in the input text, T

this makes the algorithm good for processing very large files that are read in from external devices or through a network stream
KMP Disadvantages
KMP doesnt work so well as the size of the alphabet increases
more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

String Matching

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

String Matching

Hochgeladen von

Copyright:

Verfügbare Formate

Outline

Usually T and P are called strings of characters

We say that pattern P occurs with shift s in text T if:

shift s = 3 is a valid shift (n=13, m=4 and 0 s n-m holds)

Nave String-Matching Algorithm

Analysis: Worst-case Example

Brute-force Analysis (Best)

Example2: Pattern Not found and always a mismatch on first character

Has a worst-case running time of O((nm+1)m) but average-case is O(n+m)

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

Solution: we can use modular arithmetic with a suitable modulus, q

q is chosen as a small prime number ; e.g., 13 for radix 10

How values modulo 13 are computed

old highorder digit

new loworder digit

14152 ((31415 3 10000) 10 + 2 )(mod 13)

((7 3 3) 10 + 2 )(mod 13)

Problem of Spurious Hits

2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid spurious match hit

Average-case running time is O(n+m)

3. The KMP Algorithm

KMP Failure Function

Failure Function Example

F(k) is the size of the largest prefix.

In code, F() is represented by an array, like the table.

P: "abaaba" Why is F(4) == 2?

Using the Failure Function

P: "abacab" Why is F(4) == 1?

The algorithm never needs to move backwards in the input text, T

Das könnte Ihnen auch gefallen