To understand the innards of lex/flex and use it effectively, you should be comfortable with regular
expressions and finite automata. Here are some exercises to help you practice.
1. Note which strings are in the language denoted by each regular expression.
(a) ((ab)|b)*c matches which of these strings? ababbc abab c babc aaabc
(b) ab*c*(a|b)c matches which of these strings? acac acbbc abbcac abc acc
(c) (a|b)a+(ba)* matches which of these strings? ba bba ababa aa baa
2. Write regular expressions for each description. The alphabet Σ is the binary digits {0, 1}.
3. Describe the language denoted by the following regular expressions. The alphabet Σ is
{x, y}.
(a) x(x|y)*y
(b) ((x|y)(x|y))+
(c) x*(yx+)*x*
(d) (x|y)*((xx)|(yy))*y*
One easy way to practice with regular expressions is to use the unix utility grep to search
files for lines matching a given regular expression (see man page for usage).
4. The DFA below accepts which of these strings? xy xyxxy yyyx xyyxyxyxxy
1
‘
5. Construct the following automata.
(a) Construct a DFA for the language in problem 2a.
(b) Construct a DFA for the language in problem 2f.
Remember that lex matches the longest token it can. If a token matches more than one pattern,
the one listed first takes precedence. Also, assume that yytext is the global character buffer
storing the matched token (as a null-terminated C string).
Show the output printed from the scanner when reading this input:
aaaa
acabca
bababbc
7. Suppose you already have a working scanner for Decaf or some similar language. Now,
you want to add a simple pre-processor-like feature that allows large chunks of code to be
switched on and off with #if ... #endif blocks. So, for instance, here:
#if 0 _
... |region A
... _|
#endif
...
#if 1 _
... |region B
... _|
#endif
2
Answers
(a) (0|1)*01 (any number of binary digits can precede the ending 01)
(b) 1*01* (must be a zero somewhere, can be preceded or followed by 1s)
(c) (11)* (ones must be added in pairs)
(d) (0*10*10*)* (take answer to c and insert optional zeros in-between)
(e) (0|1)*01(0|1)* (must be 01 somewhere, anything can come before or after)
(f) 1*0* (0 can only be followed by another 0)
5. Note there can be many equivalent automata. We tried to simplify to a fairly tidy version.
(a)
(b)
6. Since lex will try to match the longest lexeme it can, even if it manages to match a pattern,
it keeps pulling characters if it thinks a longer pattern might be matched. Only when it
realizes that a longer lexeme can’t be matched will it give up and officially match a pattern.
3
If something doesn’t match any of the patterns, it matches the default pattern, where the
default action is to just ECHO the lexeme to standard out.
Output:
1 aaaa
2 ac
3 abca
1 baba
5 bb
c
7. Because other lexical features that use start states may be nested inside a #if/#endif
block, we need to use both exclusive and inclusive start states. We also need to use the state
stack.
...
%s INIF
%x IGNORE
%option stack
/* Definition Section */
...
BEGINIF (#if" "(0|1))
ENDIF (#endif)
%%
/* Rules Section */
...
{BEGINIF} { if (yytext[4] == ’0’)
yy_push_state(IGNORE);
else
yy_push_state(INIF);
}
<IGNORE>. { /* ignore anything until we hit endif */ }
<IGNORE>{ENDIF} { yy_pop_state(); }
<INIF>{ENDIF} { yy_pop_state(); }