Anda di halaman 1dari 4

CS143 Handout 6

Summer 2007 June 28


Section 1 Handout

Regular Expressions and DFAs

To understand the innards of lex/flex and use it effectively, you should be comfortable with regular
expressions and finite automata. Here are some exercises to help you practice.

1. Note which strings are in the language denoted by each regular expression.

(a) ((ab)|b)*c matches which of these strings? ababbc abab c babc aaabc
(b) ab*c*(a|b)c matches which of these strings? acac acbbc abbcac abc acc
(c) (a|b)a+(ba)* matches which of these strings? ba bba ababa aa baa

2. Write regular expressions for each description. The alphabet Σ is the binary digits {0, 1}.

(a) All strings which end in 01


(b) All strings which contain exactly one 0
(c) All strings which contain an even number of 1s and no 0s
(d) All strings which contain an even number of 1s and any number of 0s
(e) All strings which contain the substring 01
(f) All strings which do not contain the substring 01

3. Describe the language denoted by the following regular expressions. The alphabet Σ is
{x, y}.

(a) x(x|y)*y
(b) ((x|y)(x|y))+
(c) x*(yx+)*x*
(d) (x|y)*((xx)|(yy))*y*

One easy way to practice with regular expressions is to use the unix utility grep to search
files for lines matching a given regular expression (see man page for usage).

4. The DFA below accepts which of these strings? xy xyxxy yyyx xyyxyxyxxy

1

5. Construct the following automata.
(a) Construct a DFA for the language in problem 2a.
(b) Construct a DFA for the language in problem 2f.

Practice with (f)lex

Remember that lex matches the longest token it can. If a token matches more than one pattern,
the one listed first takes precedence. Also, assume that yytext is the global character buffer
storing the matched token (as a null-terminated C string).

6. Given this lex specification:

(a|b)+a { printf("1 %s\n", yytext); }


ab*c+ { printf("2 %s\n", yytext); }
(ab)+(a|c)* { printf("3 %s\n", yytext); }
(aa)+ { printf("4 %s\n", yytext); }
a?b*(ac)? { printf("5 %s\n", yytext); }
[ \t\n] { /* discard whitespace */ }

Show the output printed from the scanner when reading this input:

aaaa
acabca
bababbc

7. Suppose you already have a working scanner for Decaf or some similar language. Now,
you want to add a simple pre-processor-like feature that allows large chunks of code to be
switched on and off with #if ... #endif blocks. So, for instance, here:

#if 0 _
... |region A
... _|
#endif
...
#if 1 _
... |region B
... _|
#endif

Everything in “region A” would be ignored completely, and everything in “region B” would


be processed as if the enclosing #if / #endif pair weren’t there. Only 0 or 1 is allowed
as the value in the #if directive.
Show what would need to be added to the corresponding scanner specification file in order
to implement this.

2
Answers

1. (a) ababbc abab c babc aaabc


(b) acac acbbc abbcac abc acc
(c) ba bba ababa aa baa

2. (Note: Many different regular expressions are possible.)

(a) (0|1)*01 (any number of binary digits can precede the ending 01)
(b) 1*01* (must be a zero somewhere, can be preceded or followed by 1s)
(c) (11)* (ones must be added in pairs)
(d) (0*10*10*)* (take answer to c and insert optional zeros in-between)
(e) (0|1)*01(0|1)* (must be 01 somewhere, anything can come before or after)
(f) 1*0* (0 can only be followed by another 0)

3. (a) string must start with x and end in y


(b) string must be of even length ≥ 2
(c) every y is followed by at least one x (can’t contain substring yy and can’t end with y)
(d) any string (i.e. this expression matches Σ∗ )

4. xy xyxxy yyyx xyyxyxyxxy

5. Note there can be many equivalent automata. We tried to simplify to a fairly tidy version.

(a)

(b)

6. Since lex will try to match the longest lexeme it can, even if it manages to match a pattern,
it keeps pulling characters if it thinks a longer pattern might be matched. Only when it
realizes that a longer lexeme can’t be matched will it give up and officially match a pattern.

3
If something doesn’t match any of the patterns, it matches the default pattern, where the
default action is to just ECHO the lexeme to standard out.
Output:

1 aaaa
2 ac
3 abca
1 baba
5 bb
c

7. Because other lexical features that use start states may be nested inside a #if/#endif
block, we need to use both exclusive and inclusive start states. We also need to use the state
stack.

...
%s INIF
%x IGNORE
%option stack

/* Definition Section */
...
BEGINIF (#if" "(0|1))
ENDIF (#endif)
%%

/* Rules Section */
...
{BEGINIF} { if (yytext[4] == ’0’)
yy_push_state(IGNORE);
else
yy_push_state(INIF);
}
<IGNORE>. { /* ignore anything until we hit endif */ }
<IGNORE>{ENDIF} { yy_pop_state(); }
<INIF>{ENDIF} { yy_pop_state(); }

Anda mungkin juga menyukai