ABSTRA CT
Lex helps write programs whose control ow is directed by instances of regular expressions in the input
stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a
parsing routine.
Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a
program which reads an input stream, copying it to an output stream and partitioning the input into strings
which match the given expressions. As each such string is recognized the corresponding program fragment is
executed. The recognition of the expressions is performed by a deterministic finite automaton generated by
Lex. The program fragments written by the user are executed in the order in which the corresponding regular
expressions occur in the input stream.
The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match
possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream
will be backed up to the end of the current partition, so that the user has general freedom to manipulate it.
Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable
Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however,
will only discuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under
UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-com-
piler system.
Table of Contents
1. Introduction. 1
2. Lex Source. 3
3. Lex Regular Expressions. 3
4. Lex Actions. 5
5. Ambiguous Source Rules. 7
6. Lex Source Definitions. 8
7. Usage. 8
8. Lex and Yacc. 9
9. Examples. 10
10. Left Context Sensitivity. 11
11. Character Set. 12
12. Summary of Source Format. 12
13. Caveats and Bugs. 13
14. Acknowledgments. 13
15. References. 13
1. Introduction. can produce code to run on different computer hard-
Lex is a program generator designed for lexical pro- ware, Lex can write code in different host languages.
cessing of character input streams. It accepts a high- The host language is used for the output code generated
level, problem oriented specification for character by Lex and also for the program fragments added by
string matching, and produces a program in a general the user. Compatible run-time libraries for the different
purpose language which recognizes regular expres- host languages are also provided. This makes Lex
sions. The regular expressions are specified by the user adaptable to different environments and different users.
in the source specifications given to Lex. The Lex writ- Each application may be directed to the combination of
ten code recognizes these expressions in an input hardware and host language appropriate to the task, the
stream and partitions the input stream into strings user’s background, and the properties of local imple-
matching the expressions. At the boundaries between mentations. At present, the only supported host lan-
strings program sections provided by the user are guage is C, although Fortran (in the form of Ratfor [2]
executed. The Lex source file associates the regular has been available in the past. Lex itself exists on
expressions and the program fragments. As each UNIX, GCOS, and OS/370; but the code generated by
expression appears in the input to the program written Lex may be taken anywhere the appropriate compilers
by Lex, the corresponding fragment is executed. exist.
The user supplies the additional code beyond expres- Lex turns the user’s expressions and actions (called
sion matching needed to complete his tasks, possibly s o u r c e in this memo) into the host general-purpose lan-
including code written by other generators. The pro- guage; the generated program is named y y l e x . The
gram that recognizes the expressions is generated in the y y l e x program will recognize expressions in a stream
general purpose programming language employed for (called i n p u t in this memo) and perform the specified
the user’s program fragments. Thus, a high level actions for each expression as it is detected. See Figure
expression language is provided to write the string 1.
expressions to be matched while the user’s freedom to
Source Lex yylex
write actions is unimpaired. This avoids forcing the
user who wishes to use a string manipulation language
for input analysis to write processing programs in the
Input yylex Output
same and often inappropriate string handling language.
Lex is not a complete language, but rather a generator
An overview of Lex
representing a new language feature which can be
added to different programming languages, called
‘‘host languages.’’ Just as general purpose languages
-- --
LEX-2
LEX-3
line, it should be enclosed in braces. As a slightly more Within square brackets, most operator meanings are
useful example, suppose it is desired to change a num- ignored. Only three characters are special: these are \ -
ber of words from British to American spelling. Lex and ˆ. The - character indicates ranges. For example,
rules such as [a-z0-9<>_]
colour printf("color"); indicates the character class containing all the lower
mechanise printf("mechanize"); case letters, the digits, the angle brackets, and under-
petrol printf("gas"); line. Ranges may be given in either order. Using -
would be a start. These rules are not quite enough, between any pair of characters which are not both
since the word p e t r o l e u m would become g a s e u m ;a upper case letters, both lower case letters, or both digits
way of dealing with this will be described later. is implementation dependent and will get a warning
message. (E.g., [0-z] in ASCII is many more charac-
3. Lex Regular Expressions. ters than it is in EBCDIC). If it is desired to include the
The definitions of regular expressions are very similar character - in a character class, it should be first or last;
to those in QED [5]. A regular expression specifies a thus
set of strings to be matched. It contains text characters [-+0-9]
(which match the corresponding characters in the matches all the digits and the two signs.
strings being compared) and operator characters (which In character classes, the ˆ operator must appear as the
specify repetitions, choices, and other features). The first character after the left bracket; it indicates that the
letters of the alphabet and the digits are always text resulting string is to be complemented with respect to
characters; thus the regular expression the computer character set. Thus
integer [ˆabc]
matches the string i n t e g e r wherever it appears and the matches all characters except a, b, or c, including all
expression special or control characters; or
a57D [ˆa-zA-Z]
looks for the string a 5 7 D . is any character which is not a letter. The \ character
O p e r a t o r s . The operator characters are provides the usual escapes within character class brack-
"\[]ˆ-?.* +|()$/{}%<> ets.
and if they are to be used as text characters, an escape A r b i t r a r y c h a r a c t e r . To match almost any character,
should be used. The quotation mark operator (") indi- the operator character
cates that whatever is contained between a pair of .
quotes is to be taken as text characters. Thus is the class of all characters except newline. Escaping
xyz"++" into octal is possible although non-portable:
matches the string x y z + + when it appears. Note that a [\40-\176]
part of a string may be quoted. It is harmless but matches all printable characters in the ASCII character
unnecessary to quote an ordinary text character; the set, from octal 40 (blank) to octal 176 (tilde).
expression O p t i o n a l e x p r e s s iio n s . The operator ? indicates an
"xyz++" optional element of an expression. Thus
is the same as the one above. Thus by quoting every ab?c
non-alphanumeric character being used as a text char- matches either a c or a b c .
acter, the user can avoid remembering the list above of R e p e a t e d e x p r e s s i o n s . Repetitions of classes are indi-
current operator characters, and is safe should further cated by the operators * and + .
extensions to Lex lengthen the list. a *
An operator character may also be turned into a text is any number of consecutive a characters, including
character by preceding it with \ as in zero; while
xyz\+\+ a+
which is another, less readable, equivalent of the above is one or more instances of a . For example,
expressions. Another use of the quoting mechanism is [a-z]+
to get a blank into an expression; normally, as is all strings of lower case letters. And
explained above, blanks or tabs end a rule. Any blank [A-Za-z][A-Za-z0-9]*
character not contained within [ ] (see below) must be indicates all alphanumeric strings with a leading alpha-
quoted. Several normal C escapes with \ are recog- betic character. This is a typical expression for recog-
nized: \n is newline, \t is tab, and \b is backspace. To nizing identifiers in computer languages.
enter \ itself, use \\. Since newline is illegal in an A l t e r n a t i o n a n d G r o u p i n g . The operator | indicates
expression, \n must be used; it is not required to escape alternation:
tab and backspace. Every character but blank, tab, (ab | cd)
newline and the list above is always a text character. matches either a b or c d . Note that parentheses are used
C h a r a c t e r c l a s s e s . Classes of characters can be speci- for grouping, although they are not necessary on the
fied using the operator pair [ ]. The construction [ a b c ] outside level;
matches a single character, which may be a , b ,orc .
-- --
LEX-4
LEX-5
this input. Normally, the next input string would over- In addition to these routines, Lex also permits access
write the current entry in y y t e x t . Second, y y l e s s ( n ) to the I/O routines it uses. They are:
may be called to indicate that not all the characters 1)i n p u t ( ) which returns the next input character;
matched by the currently successful expression are 2)o u t p u t ( c ) which writes the character c on the output;
wanted right now. The argument n indicates the num- and
ber of characters in y y t e x t to be retained. Further char- 3)u n p u t ( c ) pushes the character c back onto the input
acters previously matched are returned to the input. stream to be read later by i n p u t ( ) .
This provides the same sort of lookahead offered by the By default these routines are provided as macro defini-
/ operator, but in a different form. tions, but the user can override them and supply private
E x a m p l e : Consider a language which defines a string versions. These routines define the relationship
as a set of characters between quotation (") marks, and between external files and internal characters, and must
provides that to include a " in a string it must be pre- all be retained or modified consistently. They may be
ceded by a \. The regular expression which matches redefined, to cause input or output to be transmitted to
that is somewhat confusing, so that it might be prefer- or from strange places, including other programs or
able to write internal memory; but the character set used must be
\"[ˆ"]* { consistent in all routines; a value of zero returned by
if (yytext[yyleng-1] == '\\') i n p u t must mean end of file; and the relationship
yymore(); between u n p u t and i n p u t must be retained or the Lex
else lookahead will not work. Lex does not look ahead at
... normal user processing all if it does not have to, but every rule ending in + * ?
} or $ or containing / implies lookahead. Lookahead is
which will, when faced with a string such as " a b c \ " d e f " also necessary to match an expression that is a prefix of
first match the five characters " a b c \ ; then the call to another expression. See below for a discussion of the
y y m o r e ( ) will cause the next part of the string, " d e f ,to character set used by Lex. The standard Lex library
be tacked on the end. Note that the final quote termi- imposes a 100 character limit on backup.
nating the string should be picked up in the code Another Lex library routine that the user will some-
labeled ‘‘normal processing’’. times want to redefine is y y w r a p ( ) which is called
The function y y l e s s ( ) might be used to reprocess text in whenever Lex reaches an end-of-file. If y y w r a p returns
various circumstances. Consider the C problem of dis- a 1, Lex continues with the normal wrapup on end of
tinguishing the ambiguity of ‘‘=-a’’. Suppose it is input. Sometimes, however, it is convenient to arrange
desired to treat this as ‘‘=- a’’ but print a message. A for more input to arrive from a new source. In this
rule might be case, the user should provide a y y w r a p which arranges
=-[a-zA-Z] { for new input and returns 0. This instructs Lex to con-
printf("Operator (=-) ambiguous\n"); tinue processing. The default y y w r a p always returns 1.
yyless(yyleng-1); This routine is also a convenient place to print tables,
... action for =- ... summaries, etc. at the end of a program. Note that it is
} not possible to write a normal rule which recognizes
which prints a message, returns the letter after the oper- end-of-file; the only access to this condition is through
ator to the input stream, and treats the operator as y y w r a p . In fact, unless a private version of i n p u t ( ) is
‘‘=-’’. Alternatively it might be desired to treat this as supplied a file containing nulls cannot be handled, since
‘‘= -a’’. To do this, just return the minus sign as well a value of 0 returned by i n p u t is taken to be end-of-file.
as the letter to the input: 5. Ambiguous Source Rules.
=-[a-zA-Z] { Lex can handle ambiguous specifications. When more
printf("Operator (=-) ambiguous\n"); than one expression can match the current input, Lex
yyless(yyleng-2); chooses as follows:
... action for = ... 1)The longest match is preferred.
} 2)Among rules which matched the same number of
will perform the other interpretation. Note that the characters, the rule given first is preferred.
expressions for the two cases might more easily be Thus, suppose the rules
written integer keyword action ...;
=-/[A-Za-z] [a-z]+ identifier action ...;
in the first case and to be given in that order. If the input is i n t e g e r s ,itis
=/-[A-Za-z] taken as an identifier, because [ a - z ] + matches 8 char-
in the second; no backup would be required in the rule acters while i n t e g e r matches only 7. If the input is
action. It is not necessary to recognize the whole iden- i n t e g e r , both rules match 7 characters, and the keyword
tifier to observe the ambiguity. The possibility of rule is selected because it was given first. Anything
‘‘=-3’’, however, makes shorter (e.g. i n t ) will not match the expression i n t e g e r
=-/[ˆ \t\n] and so the identifier interpretation is used.
a still better rule.
-- --
LEX-6
The principle of preferring the longest match makes matches the first rule for four characters and then the
rules containing expressions like . * dangerous. For second rule for three characters. In contrast, the input
example, a c c d agrees with the second rule for four characters
'.*' and then the first rule for three.
might seem a good way of recognizing a string in sin- In general, REJECT is useful whenever the purpose of
gle quotes. But it is an invitation for the program to Lex is not to partition the input stream but to detect all
read far ahead, looking for a distant single quote. Pre- examples of some items in the input, and the instances
sented with the input of these items may overlap or include each other. Sup-
'first' quoted string here, 'second' here pose a digram table of the input is desired; normally the
the above expression will match digrams overlap, that is the word t h e is considered to
'first' quoted string here, 'second' contain both t h and h e . Assuming a two-dimensional
which is probably not what was wanted. A better rule array named d i g r a m to be incremented, the appropriate
is of the form source is
'[ˆ'\n]*' %%
which, on the above input, will stop after ¢fi r s t ¢ . The [a-z][a-z] {digram[yytext[0]][yytext[1]]++; REJECT;}
consequences of errors like this are mitigated by the \n ;
fact that the . operator will not match newline. Thus where the REJECT is necessary to pick up a letter pair
expressions like . * stop on the current line. Don’t try beginning at every character, rather than at every other
to defeat this with expressions like [ . \ n ] + or equiv- character.
alents; the Lex generated program will try to read the
entire input file, causing internal buffer over ows. 6. Lex Source Definitions.
Note that Lex is normally partitioning the input Remember the format of the Lex source:
stream, not searching for all possible matches of each {definitions}
expression. This means that each character is %%
accounted for once and only once. For example, sup- {rules}
pose it is desired to count occurrences of both s h e and %%
h ee in an input text. Some Lex rules to do this might be {user routines}
she s++; So far only the rules have been described. The user
he h++; needs additional options, though, to define variables for
\n | use in his program and for use by Lex. These can go
.; either in the definitions section or in the rules section.
where the last two rules ignore everything besides h e Remember that Lex is turning the rules into a program.
and s h e . Remember that . does not include newline. Any source not intercepted by Lex is copied into the
Since s h e includes h e , Lex will normally n o t recognize generated program. There are three classes of such
the instances of h e included in s h e , since once it has things.
passed a s h e those characters are gone. 1)Any line which is not part of a Lex rule or action
Sometimes the user would like to override this choice. which begins with a blank or tab is copied into the Lex
The action REJECT means ‘‘go do the next alterna- generated program. Such source input prior to the first
tive.’’ It causes whatever rule was second choice after %% delimiter will be external to any function in the
the current rule to be executed. The position of the code; if it appears immediately after the first %%, it
input pointer is adjusted accordingly. Suppose the user appears in an appropriate place for declarations in the
really wants to count the included instances of h e : function written by Lex which contains the actions.
she {s++; REJECT;} This material must look like program fragments, and
he {h++; REJECT;} should precede the first Lex rule.
\n | As a side effect of the above, lines which begin with a
.; blank or tab, and which contain a comment, are passed
these rules are one way of changing the previous exam- through to the generated program. This can be used to
ple to do just that. After counting each expression, it is include comments in either the Lex source or the gen-
rejected; whenever appropriate, the other expression erated code. The comments should follow the host
will then be counted. In this example, of course, the language convention.
user could note that s h e includes h e but not vice versa, 2)Anything included between lines containing only % {
and omit the REJECT action on h e ; in other cases, and % } is copied out as above. The delimiters are dis-
however, it would not be possible a priori to tell which carded. This format permits entering text like prepro-
input characters were in both classes. cessor statements that must begin in column 1, or
Consider the two rules copying lines that do not look like programs.
a[bc]+ { ... ; REJECT;} 3)Anything after the third %% delimiter, reg ardless of
a[cd]+ { ... ; REJECT;} formats, etc., is copied out after the Lex output.
If the input is a b , only the first rule matches, and on a d Definitions intended for Lex are given before the first
only the second matches. The input string a c c b %% delimiter. Any line in this section not contained
-- --
LEX-7
LEX-8
LEX-9
relevant left context appeared some time earlier, such as BEGIN name1;
at the beginning of a line. which changes the start condition to n a m e 1 . To resume
This section describes three means of dealing with dif- the normal state,
ferent environments: a simple use of ags, when only a BEGIN 0;
few rules change from one environment to another, the resets the initial condition of the Lex automaton inter-
use of s t a r t c o n d i t i o n s on rules, and the possibility of preter. A rule may be active in sev eral start conditions:
making multiple lexical analyzers all run together. In <name1,name2,name3>
each case, there are rules which recognize the need to is a legal prefix. Any rule not beginning with the <>
change the environment in which the following input prefix operator is always active.
text is analyzed, and set some parameter to re ect the The same example as before can be written:
change. This may be a ag explicitly tested by the %STARTAABBCC
user’s action code; such a ag is the simplest way of %%
dealing with the problem, since Lex is not involved at ˆa {ECHO; BEGIN AA;}
all. It may be more convenient, however, to hav e Lex ˆb {ECHO; BEGIN BB;}
remember the ags as initial conditions on the rules. ˆc {ECHO; BEGIN CC;}
Any rule may be associated with a start condition. It \n {ECHO; BEGIN 0;}
will only be recognized when Lex is in that start condi- <AA>magic printf("first");
tion. The current start condition may be changed at any <BB>magic printf("second");
time. Finally, if the sets of rules for the different envi- <CC>magic printf("third");
ronments are very dissimilar, clarity may be best where the logic is exactly the same as in the previous
achieved by writing several distinct lexical analyzers, method of handling the problem, but Lex does the work
and switching from one to another as desired. rather than the user’s code.
Consider the following problem: copy the input to the
output, changing the word m a g i c to fi r s t on every line 11. Character Set.
which began with the letter a , changing m a g i c to s e c - The programs generated by Lex handle character I/O
o n d on every line which began with the letter b , and only through the routines i n p u t , o u t p u t , and u n p u t .
changing m a g i c to t h i r d on every line which began with Thus the character representation provided in these rou-
the letter c . All other words and all other lines are left tines is accepted by Lex and employed to return values
unchanged. in y y t e x t . For internal use a character is represented as
These rules are so simple that the easiest way to do a small integer which, if the standard library is used,
this job is with a ag: has a value equal to the integer value of the bit pattern
int ag; representing the character on the host computer. Nor-
%% mally, the letter a is represented as the same form as
ˆa { ag = 'a'; ECHO;} the character constant ¢a ¢ . If this interpretation is
ˆb { ag = 'b'; ECHO;} changed, by providing I/O routines which translate the
ˆc { ag = 'c'; ECHO;} characters, Lex must be told about it, by giving a trans-
\n { ag = 0 ; ECHO;} lation table. This table must be in the definitions sec-
magic { tion, and must be bracketed by lines containing only
switch ( ag) ‘‘%T’’. The table contains lines of the form
{ {integer} {character string}
case 'a': printf("first"); break; which indicate the value associated with each character.
case 'b': printf("second"); break; Thus the next example
case 'c': printf("third"); break; %T
default: ECHO; break; 1Aa
} 2Bb
} ...
should be adequate. 26 Zz
To handle the same problem with start conditions, each 27 \n
start condition must be introduced to Lex in the defini- 28 +
tions section with a line reading 29 -
%Start name1 name2 ... 30 0
where the conditions may be named in any order. The 31 1
word S t a rrt may be abbreviated to s or S . The condi- ...
tions may be referenced at the head of a rule with the 39 9
<> brackets: %T
<name1>expression
is a rule which is only recognized when Lex is in the Sample character table.
start condition n a m e 1 . To enter a start condition, maps the lower and upper case letters together into the
execute the action statement integers 1 through 26, newline into 27, + and - into 28
-- --
LEX-10
and 29, and the digits into 30 through 39. Note the x|y an x or a y.
escape for newline. If a table is supplied, every charac- (x) an x.
ter that is to appear either in the rules or in any valid x/y an x but only if followed by y.
input must be included in the table. No character may {xx} the translation of xx from the definitions section.
be assigned the number 0, and no character may be x{m,n} m through n occurrences of x
assigned a bigger number than the size of the hardware
character set. 13. Caveats and Bugs.
There are pathological expressions which produce
12. Summary of Source Format. exponential growth of the tables when converted to
The general form of a Lex source file is: deterministic machines; fortunately, they are rare.
{definitions} REJECT does not rescan the input; instead it remem-
%% bers the results of the previous scan. This means that if
{rules} a rule with trailing context is found, and REJECT
%% executed, the user must not have used u n p u t to change
{user subroutines} the characters forthcoming from the input stream. This
The definitions section contains a combination of is the only restriction on the user’s ability to manipulate
1)Definitions, in the form ‘‘name space translation’’. the not-yet-processed input.
2)Included code, in the form ‘‘space code’’. 14. Acknowledgments.
3)Included code, in the form As should be obvious from the above, the outside of
%{ Lex is patterned on Yacc and the inside on Aho’s string
code matching routines. Therefore, both S. C. Johnson and
%} A. V. Aho are really originators of much of Lex, as well
4)Start conditions, given in the form as debuggers of it. Many thanks are due to both.
%S name1 name2 ... The code of the current version of Lex was designed,
5)Character set tables, in the form written, and debugged by Eric Schmidt.
%T
number space character-string 15. References.
... 1.B. W. Kernighan and D. M. Ritchie, T h e C P r o g r a m -
%T m i n g L a n g u a g e , Prentice-Hall, N. J. (1978).
6)Changes to internal array sizes, in the form 2.B. W. Kernighan, R a t f o r : A P r e p r o c e s s o r f o r a R a t i o -
%x n n n n a l F o r t r a n , Software - Practice and Experience, 5,
where n n n is a decimal integer representing an array pp. 395-496 (1975).
size and x selects the parameter as follows: 3.S. C. Johnson, Y a c c : Y e t A n o t h e r C o m p i l e r C o m p i l e r ,
Letter Parameter Computing Science Technical Report No. 32, 1975,
p positions Bell Laboratories, Murray Hill, NJ 07974.
n states 4.A. V. Aho and M. J. Corasick, E f fi c i e n t S t r i n g M a t c h -
e tree nodes i n g : A n A i d t o B i b l i o g r a p h i c S e a r c h , Comm. ACM 18,
a transitions 333-340 (1975).
k packed character classes 5.B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
o output array size Q E D T e x t E d i t o r , Computing Science Technical Report
Lines in the rules section have the form ‘‘expression No. 5, 1972, Bell Laboratories, Murray Hill, NJ 07974.
action’’ where the action may be continued on succeed- 6.D. M. Ritchie, private communication. See also M.
ing lines by using braces to delimit it. E. Lesk, T h e P o r t a b l e C L i b r a r y , Computing Science
Regular expressions in Lex use the following opera- Technical Report No. 31, Bell Laboratories, Murray
tors: Hill, NJ 07974.
x the character "x"
"x" an "x", even if x is an operator.
\x an "x", even if x is an operator.
[xy] the character x or y.
[x-z] the characters x, y or z.
[ˆx] any character but x.
. any character but newline.
ˆx an x at the beginning of a line.
<y>x an x when Lex is in start condition y.
x$ an x at the end of a line.
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.