Anda di halaman 1dari 34

Outline

String Matching
Introduction
Nave Algorithm
Rabin-Karp Algorithm
Knuth-Morris-Pratt (KMP) Algorithm
Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or
body of text)
Many applications
While using editor/word processor/browser
Login name & password checking
Virus detection
Header analysis in data communications
DNA sequence analysis, Web search engines (e.g.
Google), image analysis
String-Matching Problem
The text is in an array T [1..n] of length n
The pattern is in an array P [1..m] of
length m
Elements of T and P are characters from a
finite alphabet
E.g., = {0,1} or = {a, b, , z}
Usually T and P are called strings of
characters
String-Matching Problem contd

We say that pattern P occurs with shift s in


text T if:
a) 0 s n-m and
b) T [(s+1)..(s+m)] = P [1..m]
If P occurs with shift s in T, then s is a
valid shift, otherwise s is an invalid shift
String-matching problem: finding all valid
shifts for a given T and P
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a c

pattern P s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift


(n=13, m=4 and 0 s n-m holds)
1 2
Example
3 4
2
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a a

s=3 a b a a

s=9 a b a a
Nave String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed

NAVE-STRING-MATCHER (T, P)
n length[T]
m length[P]
for s 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print pattern occurs with shift s
Nave Algorithm

The Nave algorithm consists in checking, at all the positions in


the text between 0 to n-m, whether an occurrence of the pattern
starts there or not.
After each attempt, it shifts the pattern by exactly one position to
the right.
Example (from left to right):
a b c a b c a
a b c a (shift = 0)
a b c a (shift = 1)
a b c a (shift = 2)
a b c a (shift = 3)
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a

a a a b

a a a b
Worst-case Analysis
There are m comparisons for each shift in the
worst case
There are n-m+1 shifts
So, the worst-case running time is ((n-
m+1)m)
In the example on previous slide, we have (13-4+1)4
comparisons in total
Nave method is inefficient because information
from a shift is not used again
Nave Algorithm

Example (from right to left):


a b c a b c a
a b c a (shift =3)
a b c a (shift = 2)
a b c a (shift = 1)
a b c a (shift = 0)
Pattern occur with shift 0 and 3
Rabin-Karp Algorithm
Has a worst-case running time of O((n-
m+1)m) but average-case is O(n+m)
Also works well in practice
Based on number-theoretic notion of
modular equivalence
We assume that = {0,1, 2, , 9}, i.e.,
each character is a decimal digit
In general, use radix-d where d = ||
Rabin-Karp Approach
We can view a string of k characters (digits)
as a length-k decimal number
E.g., the string 31425 corresponds to the
decimal number 31,425
Given a pattern P [1..m], let p denote the
corresponding decimal value
Given a text T [1..n], let ts denote the decimal
value of the length-m substring T [(s+1)..
(s+m)] for s=0,1,,(n-m)
The Rabin-Karp algorithm
The Rabin-Karp algorithm
Rabin-Karp Approach contd

ts = p iff T [(s+1)..(s+m)] = P [1..m]


s is a valid shift iff ts = p
p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))
t0 can similarly be computed in O(m) time
Other t1, t2,, tn-m can be computed in O(n-
m) time since ts+1 can be computed from ts in
constant time
Rabin-Karp Approach contd

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]


E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=
31,415, then ts+1 = 10(31415 100003) + 2
=14152
Thus we can compute p in (m) and can
compute t0, t1, t2,, tn-m in (n-m+1) time
And we can find al occurrences of the pattern
P[1m] in text T[1n] with (m) preprocessing
time and (n-m+1) matching time.
Buta problem: this is assuming p and ts are small numbers
They may be too large to work with easily
Rabin-Karp Approach contd

Solution: we can use modular arithmetic with


a suitable modulus, q
E.g.,
ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q)
Where h =10 m-1 (mod q)
q is chosen as a small prime number ; e.g.,
13 for radix 10
Generally, if the radix is d, then dq should fit
within one computer word
How values modulo 13 are computed
3 1 4 1 5 2

old high- new low-


order digit 7 8 order digit

14152 ((31415 3 10000) 10 + 2 )(mod


13)
((7 3 3) 10 + 2 )(mod 13)
8 (mod 13)
Problem of Spurious Hits
ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean
that two integers are equal
A case in which ts p (mod q) when ts p is
called a spurious hit

On the other hand, if two integers are not


modular equivalent, then they cannot be
equal
Example
3 1 4 1 5 pattern

mod 13
7 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1

mod 13

1 7 8 4 5 10 11 7 9 11
valid spurious
match hit
Rabin-Karp Algorithm
Basic structure like the nave algorithm,
but uses modular arithmetic as described
For each hit, i.e., for each s where ts p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O(n+m)
3. The KMP Algorithm
The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).

But it shifts the pattern more intelligently


than the brute force algorithm.

continued
If a mismatch occurs between the text and
pattern P at P[j], the most we can shift the
pattern to avoid wasteful comparisons?
Example
Why j == 5

Find largest prefix (start) of:


"a b a a b" ( P[0..j-1] )

which is suffix (end) of:


"b a a b" ( p[1 .. j-1] )

Answer: "a b"


Set j = 2 // the new j value
KMP Failure Function
KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
j = mismatch position in P[]
k = position before the mismatch (k = j-1).
The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].
Failure Function Example
(k == j-1)
P: "abaaba" j 0 1 2 3 4
j: 012345 F(j) 0 0 1 1 2

F(k) is the size of


the largest prefix.

In code, F() is represented by an array, like


the table.
Why is F(4) == 2?P: "abaaba"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2
Using the Failure Function

Knuth-Morris-Pratts algorithm modifies the


brute-force algorithm.
if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k ) 0 0 1 0 1 a b a c a b
Why is F(4) == 1?P: "abacab"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1
KMP Advantages
KMP runs in optimal time: O(m+n)
very fast

The algorithm never needs to move


backwards in the input text, T
this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
KMP Disadvantages
KMP doesnt work so well as the size of the
alphabet increases
more chance of a mismatch (more possible
mismatches)
mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later

Anda mungkin juga menyukai