Sie sind auf Seite 1von 35

Outline

String Matching
Introduction Nave Algorithm Rabin-Karp Algorithm Knuth-Morris-Pratt (KMP) Algorithm

Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or body of text) While using editor/word processor/browser Login name & password checking Virus detection Header analysis in data communications DNA sequence analysis, Web search engines (e.g. Google), image analysis

Many applications

String-Matching Problem
The text is in an array T [1..n] of length n The pattern is in an array P [1..m] of length m Elements of T and P are characters from a finite alphabet
E.g., = {0,1} or = {a, b, , z}

Usually T and P are called strings of characters

String-Matching Problem

contd

We say that pattern P occurs with shift s in text T if:


a) 0 s n-m and b) T [(s+1)..(s+m)] = P [1..m]

If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift String-matching problem: finding all valid shifts for a given T and P

Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13

text T

a b c a b a a b c a b a c

pattern P

s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift (n=13, m=4 and 0 s n-m holds)

Example 2 3 4
3 4 5 6 7 8 9 10 11 12 13

pattern P a b a a
1 2

text T

a b c a b a a b c a b a a

s=3

a b a a

s=9

a b a a

Nave String-Matching Algorithm


Input: Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAVE-STRING-MATCHER (T, P) n length[T] m length[P] for s 0 to n-m if P[1..m] = T [(s+1)..(s+m)] print pattern occurs with shift s

Nave Algorithm

The Nave algorithm consists in checking, at all the positions in the text between 0 to n-m, whether an occurrence of the pattern starts there or not. After each attempt, it shifts the pattern by exactly one position to the right. Example (from left to right): a b c a b c a a b c a (shift = 0) a b c a (shift = 1) a b c a (shift = 2) a b c a (shift = 3)

Analysis: Worst-case Example


1 2 3 4

pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13

text T

a a a a a a a a a a a a a

a a a b

a a a b

Worst-case Analysis

There are m comparisons for each shift in the worst case There are n-m+1 shifts So, the worst-case running time is ((nm+1)m)
In the example on previous slide, we have (13-4+1)4 comparisons in total

Nave method is inefficient because information from a shift is not used again

Analysis
Brute force pattern matching runs in time O(mn) in the worst case.
But most searches of ordinary text take O(m+n), which is very quick.

continued

Brute-force Analysis (Best)


Best Case
Example1: Found in first position of text
Text: 0000000000000000001 Pattern: 000 Cost = O(M)

Example2: Pattern Not found and always a mismatch on first character


Text: 0000000000000000001 Pattern: 11 Cost = O(N+M)

Nave Algorithm
Example (from right to left): a b c a b c a a b c a (shift =3) a b c a (shift = 2) a b c a (shift = 1) a b c a (shift = 0) Pattern occur with shift 0 and 3

Rabin-Karp Algorithm

Has a worst-case running time of O((nm+1)m) but average-case is O(n+m)


Also works well in practice

Based on number-theoretic notion of modular equivalence We assume that = {0,1, 2, , 9}, i.e., each character is a decimal digit
In general, use radix-d where d = ||

Rabin-Karp Approach
We can view a string of k characters (digits) as a length-k decimal number
E.g., the string 31425 corresponds to the decimal number 31,425

Given a pattern P [1..m], let p denote the corresponding decimal value Given a text T [1..n], let ts denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,,(n-m)

Rabin-Karp Approach

contd

ts = p iff T [(s+1)..(s+m)] = P [1..m] s is a valid shift iff ts = p p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))

t0 can similarly be computed in O(m) time Other t1, t2,, tn-m can be computed in O(nm) time since ts+1 can be computed from ts in constant time

Rabin-Karp Approach

contd

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]


E.g., if T={,3,1,4,1,5,2,}, m=5 and ts= 31,415, then ts+1 = 10(31415 100003) + 2 =14152 Thus we can compute p in (m) and can compute t0, t1, t2,, tn-m in (n-m+1) time And we can find al occurrences of the pattern P[1m] in text T[1n] with (m) preprocessing time and (n-m+1) matching time.

Buta problem: this is assuming p and ts are small numbers They may be too large to work with easily

Rabin-Karp Approach

contd

Solution: we can use modular arithmetic with a suitable modulus, q


E.g., ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q) Where h =10 m-1 (mod q)

q is chosen as a small prime number ; e.g., 13 for radix 10


Generally, if the radix is d, then dq should fit within one computer word

How values modulo 13 are computed


3 1 4 1 5 2

old highorder digit

new loworder digit

14152 ((31415 3 10000) 10 + 2 )(mod 13)

((7 3 3) 10 + 2 )(mod 13)


8 (mod 13)

Problem of Spurious Hits


ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean that two integers are equal

A case in which ts p (mod q) when ts p is called a spurious hit On the other hand, if two integers are not modular equivalent, then they cannot be equal

Example
3 1 4 1 5 pattern

mod 13
7
1 2 3 4 5 6 7 8

text
9 10 11 12 13 14

2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid spurious match hit

Rabin-Karp Algorithm

Basic structure like the nave algorithm, but uses modular arithmetic as described For each hit, i.e., for each s where ts p (mod q), verify character by character whether s is a valid shift or a spurious hit In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)

Average-case running time is O(n+m)

Example 2
Let T = a b c b a b and P = a b c Take a = 97, b = 98, c= 99 (i.e. ASCII value of characters). = 256. Integer value of P, p = c + 256(b+256 a) = [99 + 256(98+256 97)] % 256 =197 In similar fashion, we can calculate hash value of m-length text and compare to check valid / spurious hit (as in previous slides). Analysis In the worst case, every shift is verified Running time can be shown as O((n-m+1)m) Average-case running time is O (n + m)

3. The KMP Algorithm


The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm).
But it shifts the pattern more intelligently than the brute force algorithm.

continued

If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons?
Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

Example
T: P:

j=5
jnew = 2

Why

j == 5

Find largest prefix (start) of: "a b a a b" ( P[0..j-1] ) which is suffix (end) of: "b a a b" ( p[1 .. j-1] )
Answer: "a b" Set j = 2 // the new j value

KMP Failure Function


KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

Failure Function Example


P: "abaaba" j: 012345
j F(j) 0 0 1 0 2 1

(k == j-1)
3 1 4 2

F(k) is the size of the largest prefix.

In code, F() is represented by an array, like the table.

P: "abaaba" Why is F(4) == 2?

F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2

Using the Failure Function


Knuth-Morris-Pratts algorithm modifies the brute-force algorithm.
if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

Example
T: P:
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6

a b a c a b
7

a b a c a b
8 9 10 11 12

a b a c a b
13

a b a c a b

k
F(k

0 0

1 0

2 1

3 0

4 1

14 15 16 17 18 19

a b a c a b

P: "abacab" Why is F(4) == 1?

F(4) means
find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" =1

KMP Advantages
KMP runs in optimal time: O(m+n)
very fast

The algorithm never needs to move backwards in the input text, T


this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

KMP Disadvantages
KMP doesnt work so well as the size of the alphabet increases
more chance of a mismatch (more possible mismatches) mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

Das könnte Ihnen auch gefallen