String Matching

Outline
String Matching
Introduction
Nave Algorithm
Rabin-Karp Algorithm
Knuth-Morris-Pratt (KMP) Algorithm
Introduction
What is string matching?
Finding all occurrences of a pattern in a given text (or
body of text)
Many applications
While using editor/word processor/browser
Login name & password checking
Virus detection
Header analysis in data communications
DNA sequence analysis, Web search engines (e.g.
Google), image analysis
String-Matching Problem
The text is in an array T [1..n] of length n
The pattern is in an array P [1..m] of
length m
Elements of T and P are characters from a
finite alphabet
E.g., = {0,1} or = {a, b, , z}
Usually T and P are called strings of
characters
String-Matching Problem contd
We say that pattern P occurs with shift s in

text T if:
a) 0 s n-m and
b) T [(s+1)..(s+m)] = P [1..m]
If P occurs with shift s in T, then s is a
valid shift, otherwise s is an invalid shift
String-matching problem: finding all valid
shifts for a given T and P
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a c
pattern P s=3
a b a a
1 2 3 4
shift s = 3 is a valid shift

(n=13, m=4 and 0 s n-m holds)
1 2
Example
3 4
2
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a a
s=3 a b a a
s=9 a b a a
Nave String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed
NAVE-STRING-MATCHER (T, P)
n length[T]
m length[P]
for s 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print pattern occurs with shift s
Nave Algorithm
The Nave algorithm consists in checking, at all the positions in

the text between 0 to n-m, whether an occurrence of the pattern
starts there or not.
After each attempt, it shifts the pattern by exactly one position to
the right.
Example (from left to right):
a b c a b c a
a b c a (shift = 0)
a b c a (shift = 1)
a b c a (shift = 2)
a b c a (shift = 3)
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a
a a a b
a a a b
Worst-case Analysis
There are m comparisons for each shift in the
worst case
There are n-m+1 shifts
So, the worst-case running time is ((n-
m+1)m)
In the example on previous slide, we have (13-4+1)4
comparisons in total
Nave method is inefficient because information
from a shift is not used again
Nave Algorithm
Example (from right to left):

a b c a b c a
a b c a (shift =3)
a b c a (shift = 2)
a b c a (shift = 1)
a b c a (shift = 0)
Pattern occur with shift 0 and 3
Has a worst-case running time of O((n-
m+1)m) but average-case is O(n+m)
Also works well in practice
Based on number-theoretic notion of
modular equivalence
We assume that = {0,1, 2, , 9}, i.e.,
each character is a decimal digit
In general, use radix-d where d = ||
Rabin-Karp Approach
We can view a string of k characters (digits)
as a length-k decimal number
E.g., the string 31425 corresponds to the
decimal number 31,425
Given a pattern P [1..m], let p denote the
corresponding decimal value
Given a text T [1..n], let ts denote the decimal
value of the length-m substring T [(s+1)..
(s+m)] for s=0,1,,(n-m)
The Rabin-Karp algorithm
The Rabin-Karp algorithm
Rabin-Karp Approach contd
ts = p iff T [(s+1)..(s+m)] = P [1..m]

s is a valid shift iff ts = p
p can be computed in O(m) time
p = P[m] + 10 (P[m-1] + 10 (P[m-2]+))
t0 can similarly be computed in O(m) time
Other t1, t2,, tn-m can be computed in O(n-
m) time since ts+1 can be computed from ts in
constant time
ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

E.g., if T={,3,1,4,1,5,2,}, m=5 and ts=
31,415, then ts+1 = 10(31415 100003) + 2
=14152
Thus we can compute p in (m) and can
compute t0, t1, t2,, tn-m in (n-m+1) time
And we can find al occurrences of the pattern
P[1m] in text T[1n] with (m) preprocessing
time and (n-m+1) matching time.
Buta problem: this is assuming p and ts are small numbers
They may be too large to work with easily
Solution: we can use modular arithmetic with

a suitable modulus, q
E.g.,
ts+1 (10(ts T[s+1]h) + T [s+m+1]) (mod q)
Where h =10 m-1 (mod q)
q is chosen as a small prime number ; e.g.,
13 for radix 10
Generally, if the radix is d, then dq should fit
within one computer word
How values modulo 13 are computed
3 1 4 1 5 2
old high- new low-

order digit 7 8 order digit
14152 ((31415 3 10000) 10 + 2 )(mod

13)
((7 3 3) 10 + 2 )(mod 13)
8 (mod 13)
Problem of Spurious Hits
ts p (mod q) does not imply that ts=p
Modular equivalence does not necessarily mean
that two integers are equal
A case in which ts p (mod q) when ts p is
called a spurious hit
On the other hand, if two integers are not

modular equivalent, then they cannot be
equal
Example
3 1 4 1 5 pattern
mod 13
7 text
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1
mod 13
1 7 8 4 5 10 11 7 9 11
valid spurious
match hit
Basic structure like the nave algorithm,
but uses modular arithmetic as described
For each hit, i.e., for each s where ts p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
In the worst case, every shift is verified
Running time can be shown as O((n-m+1)m)
Average-case running time is O(n+m)
3. The KMP Algorithm
The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).
But it shifts the pattern more intelligently

than the brute force algorithm.
continued
If a mismatch occurs between the text and
pattern P at P[j], the most we can shift the
pattern to avoid wasteful comparisons?
Example
Why j == 5
Find largest prefix (start) of:

"a b a a b" ( P[0..j-1] )
which is suffix (end) of:

"b a a b" ( p[1 .. j-1] )
Answer: "a b"

Set j = 2 // the new j value
KMP Failure Function
KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
j = mismatch position in P[]
k = position before the mismatch (k = j-1).
The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].
Failure Function Example
(k == j-1)
P: "abaaba" j 0 1 2 3 4
j: 012345 F(j) 0 0 1 1 2
F(k) is the size of

the largest prefix.
In code, F() is represented by an array, like

the table.
Why is F(4) == 2?P: "abaaba"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2
Using the Failure Function
Knuth-Morris-Pratts algorithm modifies the

brute-force algorithm.
if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k ) 0 0 1 0 1 a b a c a b
Why is F(4) == 1?P: "abacab"
F(4) means
find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1
KMP Advantages
KMP runs in optimal time: O(m+n)
very fast
The algorithm never needs to move

backwards in the input text, T
this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
KMP Disadvantages
KMP doesnt work so well as the size of the
alphabet increases
more chance of a mismatch (more possible
mismatches)
mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later

String Matching

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

String Matching

Diunggah oleh

Hak Cipta:

Format Tersedia

Outline

We say that pattern P occurs with shift s in

shift s = 3 is a valid shift

The Nave algorithm consists in checking, at all the positions in

Example (from right to left):

ts = p iff T [(s+1)..(s+m)] = P [1..m]

ts+1 = 10(ts - 10m-1 T [s+1]) + T [s+m+1]

Solution: we can use modular arithmetic with

old high- new low-

14152 ((31415 3 10000) 10 + 2 )(mod

On the other hand, if two integers are not

But it shifts the pattern more intelligently

Find largest prefix (start) of:

which is suffix (end) of:

Answer: "a b"

F(k) is the size of

In code, F() is represented by an array, like

Knuth-Morris-Pratts algorithm modifies the

The algorithm never needs to move

Anda mungkin juga menyukai