McCLUR£, Editor
Regular Expression Search In the compiled code, the lists mentioned in the algo-
rithm are not characters, but transfer instructions into
Algorithm the compiled code. The execution is extremely fast since
a transfer to the top of the current list automatically
searches for all possible sequel characters in the regular
I~EN THOMPSON expression.
Bell Telephone Laboratories, Inc., Murray Hill, New Jersey This compile-search algorithm is incorporated as the
context search in a time-sharing text editor. This is by
no means the only use of such a search routine. For
A method for locating specific character strings embedded example, a variant of this algorithm is used as the symbol
in character text is described and an implementation of this table search in an assembler.
method in the form of a compiler is discussed. The compiler I t is assumed that the reader is familiar with regular
accepts a regular expression as source language and pro- expressions [2] and the machine language of the I B M 7094
duces an IBM 7 0 9 4 program as object language. The object computer [3].
program then accepts the text to be searched as input and
produces a signal every time an embedded string in the text The Compiler
matches the given regular expression. Examples, problems,
and solutions are also presented. The compiler consists of three concurrently running
stages. The first stage is a syntax sieve that allows only
KEY WORDS AND PHRASES: search, match, regular expression
syntactically correct regular expressions to pass. This
CR CATEGORIES:3.74, 4.49, 5.32
stage also inserts the operator " . " for juxtaposition of
regular expressions. The second stage converts the regular
The Algorithm
expression to reverse Polish form. The third stage is the
Previous search algorithms involve backtracking when object code producer. The first two stages are straight-
a partially successful search path fails. This necessitates forward and are not discussed. The third stage expects a
a lot of storage and bookkeeping, and executes slowly. In syntactically correct, reverse Polish regular expression.
the regular expression recognition technique described in The regular expression a(b I c),d will be carried through
this paper, each character in the text to be searched is as an example. This expression is translated into abc I * " d •
examined in sequence against a list of all possible current by the first two stages. A functional description of the
characters. During this examination a new list of all third stage of the compiler follows:
possible next characters is built. When the end of the The heart of the third stage is a pushdown stack. Each
current list is reached, the new list becomes the current entry in the pushdown stack is a pointer to the compiled
list, the next character is obtained, and the process con- code of an operand. When a binary operator ("1" or ". ")
tinues. In the terms of Brzozowski [1], this algorithm con- is compiled, the top (most recent) two entries on the stack
tinually takes the left derivative of the given regular ex- are combined and a resultant pointer for the operation re-
pression with respect to the text to be searched. The places the two stack entries. The result of the binary
parallel nature of this algorithm makes it extremely fast. operator is then available as an operand in another opera-
tion. Similarly, a unary operator ("*") operates on the top
The Implementation
entry of the stack and creates an operand to replace that
The specific implementation of this algorithm is a com-
entry. When the entire regular expression is compiled,
piler that translates a regular expression into I B M 7094
there is just one entry in the stack, and that is a pointer to
code. The compiled code, along with certain runtime
routines, accepts the text to be searched as input and the code for the regular expression.
finds all substrings in the text that match the regular The compiled code invokes one of two functional rou-
expression. The compiling phase of the implemention does tines. The first is called N N O D E . N N O D E matches a
not detract from the overall speed since any search routine single character and will be represented by an oval con-
must translate the input regular expression into some taining the character that is recognized. The second func-
sort of machine accessible form. tional routine is called CNODE. C N O D E will split the
o b c
a-(b c}~-d
FIG. 5
FIG. 1
a-(bic)* eof :
code[pc] := instruction('tra', value('found'), 0, 0);
pc := p c + l
FIG. 4
end
The integer procedure get character returns the next
The final two characters d. compile and connect an character from the second stage of the compiler. The
CNODE AXC **,7 CLIST COUNT TRACMD TRA XCHG CONSTANT, NOT
CAL CLIST,7 EXECUTED
SLW CLIST+i,7 MOVE TRA XCHG
INSTRUCTION Initialization is required to set up the initial lists and
PCA ,4 start the first character.
ACL TSXCMD
SLW CLIST,7 INSERT NEW TSX **,2 INIT SCA NNODE,0
INSTRUCTION TRA XCHG
TXI *+1,7,-1
The routine F O U N D is transferred to for each successful
SCA CNODE,7 INCREMENT CLIST
COUNT match of the entire regular expression. There is a one
TRA 2,4 RETURN character delay between the end of a successful match
$ and the transfer to F O U N D . The null regular expression
TSXCMD TSX 1,2 CONSTANT, NOT is found on the first character while one character regular
EXECUTED
expressions are found on the second character. This means
The subroutine N N O D E is called after a successful that an extra (end of file) character must be put through
This character is right adjusted in the accumulator. code[pc] := instruction('tra', value('code')+pc+4, O, 0);
G E T C H A must also recognize the end of the text and code[pcA-1] := instruction('tsx', value('cnode'), 4, 0);
code[pc+2] := code[staek[lc--1]];
terminate the search. code[pc+3] := code[stack[lc-2]];
Notes code[stack[lc- 2]] := instruetion( 'tra', value('code') + p c + l, 0, 0);
eode[stack[lc--1]] := instruction ('tra', value( 'code') +pc+4, O, O)
Code compiled for a** will go into a loop due to the if lambda[lc-2] = 0 t h e n
closure operator on an operand containing the null regular begin if lambda[lc--1] ~ 0 t h e n
expression, k. ']?here are two ways out of this problem. T h e lambda[lc-- 2] = lambda[le--1]
end else
first is to not allow such an expression to get t h r o u g h the
if lambda[lc--1] ~ 0 t h e n
s y n t a x sieve. I n m o s t practical applications, this would code[lambda[lc--1]] :=
not be a serious restriction. T h e second w a y out is to instruction ( 'tra', value( 'code') + lambda[lc-- 2], 0, 0);
recognize l a m b d a separately in operands and r e m e m b e r pc := pc+4;
the C O D E location of the recognition of lambda. This lc := lc--1;
go to advance;
means t h a t a , is compiled as a search for h l a a , . I f the
eof:
closure operation is performed on an operand containing code[pc] := instruetion('tra', value('found'), O, 0);
lambda, the instruction T R A FAIL is overlaid on that pc := p c + l
portion of the operand that recognizes lambda. Thus a** end
is compiled as h]aa,(aa,),. The next note on the implementation is that the sizes
The array lambda is added to the third stage of the pre- of the two runtime lists can grow quite large. For example,
vious compiler. It contains zero if the corresponding the expression a . a . a * a * a . a * explodes when it encounters
operand does not contain X. I t contains the code location a few concurrent a's. This expression is equivalent to a*
of the recognition of X if the opcrand does contain h. (The and therefore should n o t generate so m a n y entries. Such
code location of the recognition of X can never be zero.) r e d u n d a n t searches can be easily t e r m i n a t e d b y h a v i n g
begin N N O D E ( C N O D E ) search N L I S T ( C L I S T ) for a m a t c h -
integer procedure get character; code;
ing e n t r y before it puts an e n t r y in the list. This n o w gives
integer procedure instruction(op, address, tag, decrement);
code ; a m a x i m u m size on the n u m b e r of entries t h a t can be in the
integer procedure value(symbol); code; lists. T h e m a x i m u m n u m b e r of entries t h a t can be in
integer procedureindex(character); code; C L I S T is the n u m b e r of T S X C N O D E , 4 and T S X
integer char, lc, pc; N N O D E , 4 instructions compiled. T h e m a x i m u m n u m -
integer array stack, lambda[0:lO], code[O:300];
ber of entries in N L I S T is just the n u m b e r of T S X
switch switch := alpha, juxta, closure, or, eof ;
lc := pc := 0; N N O D E , 4 instructions compiled. I n practice, these
advance : m a x i m a are never met.
char := get character; T h e execution is so fast, t h a t a n y other recognition and
go to switch[index(char)J; deleting of r e d u n d a n t searches, such as described b y K u n o
alpha :
and Oettinger [4], would p r o b a b l y waste time.
code[pc] := instruetion('tra', value('code')+pc+l, O, 0);
code[pc+l] := instruetion('txl', value('fail'), 1, --char--i); This compiling scheme is v e r y amenable to the extension
code[pc+2] := instruction('txh', value('fail'), 1, ~char); of the regular expressions recognized. Special characters
code[pc+3] := instruction('tsx', value('nnode'), 4, 0); can be introduced to m a t c h special situations or sequences.
stack[lc] := pc; E x a m p l e s include: beginning of line character, end of line
lambda[lc] :-~ 0;
character, a n y character, alphabetic character, a n y n u m -
pc := pc-4-4;
lc := lc+l; ber of spaces character, lambda, etc. I t is also easy to
g o t o advance; incorporate new operators in the regular expression rou-
juxta : tine. Examples include: not, exclusive or, intersection, etc.
if lambda[lc~l] = 0 thezi
lambda[lc--2] := 0; REFERENCES
le := lc--1; 1. BRZOZOWSKI, J. A. Derivatives of regular expressions. J.
g o t o advance; ACM 11, 4 (Oct. 1964), 481-494.
closure : 2. I~LEENE, S. C. Representation of events in nerve nets and
code[pc] := instruction('tsx', value('cnode'), 4, 0); finite automata. In Automata Studies, Ann. Math. Stud. No.
code[pc+l] := code[stack[lc-1]]; 34. Princeton U. Press, Princeton, N.J., 1956, pp. 3-41.
code[pc+2] := instruction('tra', value('code')Wpc+6, O, 0); 3. IBM Corp. IBM 7094 principles of operation. File No. 7094-01,
code[pc+3] := instruetion ( 'tsx', value ( 'cnode') , 4, 0); Form A22-6703-1.
code[pc +4 ] : = code[stack[lc- 1]] ; 4. KuNO, S., AND OETTINGER, A. G. Multiple-path syntactic
code[pc+5] := instruction('tra', value('code')-4-pc+6, O, 0); analyzer. Proc. I F I P Congress, Munich, 1962, North-Holland
code[stack[lc--1]] := instruction('tra', value('code')-~-pc-4-3, O, 0); Pub. Co., Amsterdam.