Algorithms
F O U R T H
E D I T I O N
NFA simulation
NFA construction
applications
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
NFA simulation
NFA construction
applications
Pattern matching
Substring search. Find a single string in text.
Pattern matching. Find one of a specified set of strings in text.
Ex. [genomics]
text
GC G( CG G |AG G )*C T G
GC GG CG T GTG T GCG A GAG A GT G G G T T TA A AGC T GG CGC G GAGGCGGCTGGCGCGGAGGCTG
Syntax highlighting
/*************************************************************************
* Compilation: javac NFA.java
* Execution:
java NFA regexp text
* Dependencies: Stack.java Bag.java Digraph.java DirectedDFS.java
*
* % java NFA "(A*B|AC)D" AAAABD
* true
*
* % java NFA "(A*B|AC)D" AAAAC
* false
*
*************************************************************************/
public class NFA
{
private Digraph G;
private String regexp;
private int M;
input
output
Ada
HTML
Asm
XHTML
Applescript
LATEX
Awk
MediaWiki
Bat
ODF
Bib
TEXINFO
Bison
ANSI
C/C++
DocBook
C#
Cobol
Caml
Changelog
Css
D
Erlang
Flex
Fortran
GLSL
Haskell
Html
Java
Javalog
Javascript
Latex
Lisp
Lua
http://code.google.com/p/chromium/source/search
Regular expressions
A regular expression is a notation to specify a set of strings.
possibly infinite
operation
order
example RE
matches
concatenation
AABAAB
AABAAB
or
AA | BAAB
AA
BAAB
closure
AB*A
AA
ABBBBBBBBA
AB
ABABA
A(A|B)AAB
AAAAB
ABAAB
(AB)*A
A
ABABABABABA
AA
ABBA
parentheses
operation
example RE
matches
wildcard
.U.U.U.
CUMULUS
JUGULUM
SUCCUBUS
TUMULTUOUS
character class
[A-Za-z][a-z]*
word
Capitalized
camelCase
4illegal
at least 1
A(BC)+DE
ABCDE
ABCBCDE
ADE
BCDE
exactly k
[0-9]{5}-[0-9]{4}
08540-1321
19072-5541
111111111
166-54-111
regular expression
matches
.*SPB.*
RASPBERRY
CRISPBREAD
SUBSPACE
SUBSPECIES
166-11-4433
166-45-1111
11-55555555
8675309
wayne@princeton.edu
rs@princeton.edu
spam@nowhere
ident3
PatternMatcher
3a
ident#3
(substring search)
[0-9]{3}-[0-9]{2}-[0-9]{4}
(U. S. Social Security numbers)
[a-z]+@([a-z]+\.)+(edu|com)
(simplified email addresses)
[$_A-Za-z][$_A-Za-z0-9]*
(Java identifiers)
http://xkcd.com/1313
yes
no
obama
romney
bush
mccain
clinton
kerry
reagan
gore
...
washington
clinton
Ex. Match elected presidents but not opponents (unless they later won).
RE. bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls
madison
harrison
10
http://www.justice.gov/oig/special/s0807/final.pdf
11
12
http://xkcd.com/208
13
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
14
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
NFA simulation
NFA construction
applications
For any DFA, there exists a RE that describes the same set of strings.
For any RE, there exists a DFA that recognizes the same set of strings.
RE
0* | (0*10*10*10*)*
DFA
Stephen Kleene
Princeton Ph.D. 1934
17
a
text
AAAABD
pattern
matches text
reje
ct
Bad news. Basic plan is infeasible (DFA may have exponential # of states).
18
a
text
AAAABD
pattern
matches text
reje
ct
Q. What is an NFA?
19
Nondeterminism.
One view: machine can guess the proper sequence of state transitions.
Another view: sequence is a proof that the machine accepts the text.
0
10
11
accept state
A
0
A
3
A
3
match transition:
scan to next input character
and change state
B
3
D
5
10
11
!-transition:
change state
with no match
10
11
A
3
no way out
of state 4
A
0
A
0
no way out
of state 7
A
A
3
A
3
C
3
no way
9 out
of state 4
10
11
no way out
of state 4
wrongby
guess
if input is
Q. Is A A A C matched
NFA?
A
no way out
of state 7
A
0
A
3
A
3
A
3
C
3
no way out
of state 4
10
11
Nondeterminism
Q. How to determine whether a string is matched by an automaton?
DFA. Deterministic easy because exactly one applicable transition.
NFA. Nondeterministic can be several applicable transitions;
need to select the right one!
Q. How to simulate NFA?
A. Systematically consider all possible transition sequences. [stay tuned]
10
11
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
NFA simulation
NFA construction
applications
NFA representation
State names. Integers from 0 to M.
number of symbols in RE
10
11
accept state
NFA simulation
Q. How to efficiently simulate an NFA?
A. Maintain set of all possible states that NFA could be in
after reading in the first i text characters.
input
-transitions
match transitions
10
11
10
11
accept !
Digraph reachability
Digraph reachability. Find all vertices reachable from a given source
or set of vertices.
30
// match transitions
// epsilon transition digraph
// number of states
31
follow -transitions
}
for (int v : pc)
if (v == M) return true;
return false;
}
32
10
11
accept state
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
NFA simulation
NFA construction
applications
10
11
accept state
10
11
10
11
single-character closure
i+1
G.addEdge(i, i+1);
G.addEdge(i+1, i);
closure expression
single-character closure
i
i+1
lp
. . .
G.addEdge(i, i+1);
G.addEdge(i+1, i);
. . .
i+1
or
lp
...
G.addEdge(lp, i+1);
G.addEdge(i+1,
3
4
5lp);
or expression
or
lp
...
or expression
G.addEdge(lp, i+1);
G.addEdge(i+1, lp);
closure expression
lp
i+1
...
G.addEdge(lp, or+1);
9
10
11
G.addEdge(or,
i);
)
D
NFA construction rules
...
G.addEdge(lp, or+1);
38
closure expression
i
Building an NFA
corresponding
toi+1an RE
lp
. . .
or expression
or
lp
...
...
G.addEdge(lp, or+1);
G.addEdge(or, i);
10
11
( symbol:
| symbol:
) symbol:
10
11
stack
( ( A * B | A C ) D )
41
stack
10
11
accept state
2-way or
closure
(needs 1-character lookahead)
metasymbols
}
return G;
}
43
10
11
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
NFA simulation
NFA construction
applications
contains RE
as a substring
% more words.txt
a
aback
dictionary
(standard in Unix)
abacus
abalone
abandon
47
<blink>text</blink>some
reluctant
text<blink>more
text</blink>
greedy
48
50
Harvesting information
Goal. Print all substrings of input that match a RE.
51
Harvesting information
RE pattern matching is implemented in Java's java.util.regexp.Pattern and
java.util.regexp.Matcher
classes.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Harvester
{
public static void main(String[] args)
{
String regexp
= args[0];
In in
= new In(args[1]);
String input
= in.readAll();
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(input);
while (matcher.find())
{
StdOut.println(matcher.group());
}
}
}
compile() creates a
Pattern (NFA) from RE
matcher() creates a
Matcher (NFA simulator)
from NFA and text
group() returns
the substring most
recently found by find()
52
%
%
%
%
%
%
java
java
java
java
java
java
Validate
Validate
Validate
Validate
Validate
Validate
"(a|aa)*b"
"(a|aa)*b"
"(a|aa)*b"
"(a|aa)*b"
"(a|aa)*b"
"(a|aa)*b"
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
1.6
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
3.7
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
9.7
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
23.2
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
62.2
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 161.6
seconds
seconds
seconds
seconds
seconds
seconds
Not-so-regular expressions
Back-references.
// beriberi couscous
// 1111 111111 111111111
Context
Abstract machines, languages, and nondeterminism.
KMP
grep
javac
string DFA.
RE NFA.
Java language Java byte code.
KMP
grep
Java
pattern
string
RE
program
parser
unnecessary
check if legal
check if legal
compiler output
DFA
NFA
byte code
simulator
DFA simulator
NFA simulator
JVM
55