© All Rights Reserved

17 tayangan

© All Rights Reserved

- Solution Manual
- Compiler Construction MCQs-1.docx
- Oop in Python Best Resources
- parsing
- Compiler Pracs New
- Class 10
- states minimization in Automata
- Decidability 2010
- Symbol Tables
- Vivaasdasdasda
- Dima
- IEEE control Petri nets and 61850 revised1
- MELJUN CORTES Automata Lecture the Regular Operations on Languages 1
- Y05-1019
- FARECLCS
- SuffixT1withFigs
- Reduced Merge_FSM Pattern Matching Algorithm for Network Intrusion Detection
- Push Down Automata
- 3 - 1-03-01- Lexical Analysis (12m06s)
- abelvhdl

Anda di halaman 1dari 69

By Esubalew Alemneh

Session1

Lexical Analyzer

The Role of the Lexical Analyzer

What is token?

Token Attributes

Lexical Errors

Input Buffering

Specification of Tokens

Recognition of Tokens

Lexical Analyzer

Lexical analysis is the process of converting a sequence of

A program or function which performs lexical analysis is called a

lexical analyzer, lexer, or scanner.

A lexer often exists as a single function which is called by a

parser (syntax Analyzer) or another function.

The tasks done by scanner are

Grouping input characters into tokens

Stripping out comments and white spaces and other separators

Correlating error messages with the source program

Macros expansion

Pass token to parser

pass token

and attribute value

read char

Source

program

Lexical

analyzer

id

Symbol Table

Parser

get next

Token

Lexical Analyzer .

Sometimes Lexical Analyzers are divided into a cascade of two

Advantages of separating lexical analysis from scanning are

Simpler design - Separate the two analysis

High cohesion / Low coupling

Improve maintainability

Compiler efficiency

Liberty to apply specialized techniques that serves only lexical tasks,

Speedup reading input characters using specialized buffering

techniques

Compiler portability (e.g. Linux to Win)

Input device peculiarities are restricted to the lexical Analyzer

Enable integration of third-party lexers [lexical analysis tool]

What is token?

Tokens correspond to sets of strings.

Identifier: strings of letters or digits, starting with a letter

Integer: a non-empty string of digits; 123, 123.45

Keyword: else or if or begin or

Whitespace: a non-empty sequence of blanks, newlines, and tabs

Symbols:

Comments, White space

What do we want to do?

Example:

if (i == j)

Z = 0;

else

Z = 1;

The input is just a string of characters:

\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;

Goal: Partition input string into substrings

Where the substrings are tokens

Tokens: if, (, i, ==, j, ), z, =, 0, ;, else, 1,

Lexeme

Actual sequence of characters that matches a pattern and has a given

token class.

Examples: Name, Data, x, 345,2,0,629,....

Token

A classification for a common set of strings

Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....

One token for all identifiers

Pattern

The rules that characterize the set of strings for a token

Examples: [0-9]+

identifier:

([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*

Token Attributes

More than one lexeme can match a pattern

the lexical analyzer must provide the information about the particular

lexeme that matched.

For example, the pattern for token number matches both 0, 1, 934,

But code generator must know which lexeme was found in the

source program.

Thus, the lexical analyzer returns to the parser a token name and an

attribute value

For each lexeme the following type of output is produced

(token-name, attribute-value)

E = M * C ** 2

<id, pointer to symbol table entry for E>

< assign_op>

<id, pointer to symbol table entry for M>

<mult_op>

<id, pointer to symbol table entry for C>

<exp_op>

<number, integer value 2>

Lexical Errors

Lexical analyzer cant detect all errors needs other

components

In what situations do errors occur?

When none of the patterns for tokens matches any prefix of the

remaining input.

However look at: fi(a==f(x))

generates no lexical error in C subsequent phases of compiler do

generate the errors

Possible error recovery actions:

Deleting or Inserting Input Characters

Replacing or Transposing Characters

Or, skip over to next separator to ignore problem

Input Buffering

Processing large number of characters during the compilation of

process a single input character

How to speed the reading of source program ?

specialized buffering techniques have been developed

two buffer scheme - that handles large lookaheads safely

sentinels improvement which saves time of checking buffer ends

Buffer Pair

Buffer size N, N = size of a disk block (4096 bytes)

read N characters into a buffer

one system call i.e. not one call per character

If read < N characters we put eof that marks the end of the source file

Buffer Pair

Two pointers to the input are maintained

lexemeBegin marks the beginning of the current lexeme

Forward scans ahead until a pattern match is found

Initially both pointers point the first character of the next lexeme

Forward pointer scans; if a lexeme is found, it is set to the last

After processing the lexeme, both pointers are set to the character

Sentinels

In preceding schema, for each character read we make two tests:

have we moved off one of the buffer, and

one to determine what character is read

Note that eof retains its use as a marker for the end of the entire

input.

Any eof that appears other than at the end of a buffer means that

the input is at an end.

Specifying Tokens

Two issues in lexical analysis.

How to specify tokens? And How to recognize the tokens giving a token

specification? (i.e. how to implement the nexttoken( ) routine)?

How to specify tokens:

Tokens are specified by regular expressions.

Regular Expressions

Represent patterns of strings of characters

cannot express all possible patterns, they are very effective in specifying

those types of patterns that we actually need for tokens.

The set of strings generated by a regular expression r is as L(r)

alphabet : a finite set of symbols. E.g. {a, b, c}

Operations.

concatenation of strings x and y: xy; x=x =x

exponential : S0 = , Si =si-1s, i>0

prefix VS suffix

proper prefixVS proper suffix,

substring

subsequence

Languages

A language is any countable set of strings over some fixed

alphabet.

Alphabet

{0,1}

{a,b,c}

{AZ}

{AZ,az,09,

+,-,,<,>,}

Special Languages:

Language

{0,10,100,1000,10000,}

{0,1,100,000,111,}

{abc,aabbcc,aaabbbccc,}

{TEE,FORE,BALL}

{FOR,WHILE,GOTO}

{All legal PASCAL progs}

{All grammatically correct English Sentences}

EMPTY LANGUAGE

contains empty string only

Operations on Languages

L = {A, B, C, D }

D = {1, 2, 3}

L D = {A, B, C, D, 1, 2, 3 }

LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }

L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, DD}

L4 = L2 L2 = ??

L* = { All possible strings of L plus }

L+ = L* -

L (L D ) = ??

L (L D )* = ??

Regular Expressions

Formal definition of Regular expression:

Given an alphabet ,

contains the empty string.

(2) For each a e , a is a regular expression denote {a}, the set

containing the string a.

(3) r and s are regular expressions, then

o r | s is a regular expression denoting L( r ) U L( s )

o rs is a regular expression denoting L( r ) L ( s )

o ( r )* is a regular expression denoting (L ( r )) *

Regular expression is defined together with the language it

denotes.

Regular expressions are used to denote regular languages

All Strings in English alphabet that start with tab or end with bat:

tab{A,,Z,a,...,z}*|{A,,Z,a,....,z}*bat

All Strings in English alphabet in Which {1,2,3} exist in ascending order:

Regular Definition

Gives names to regular expressions to construct more complicate

regular expressions.

If is an alphabet of basic symbols, then a regular definition is a

sequence of definitions of the form:

d 1 r 1,

d 2 r 2,

d 3 r 3,

where:

Each di is a new symbol, not in and not the same as any other of

the d's

Each ri is a regular expression over the alphabet U {dl, d2,. . . , dil).

Example

C identifiers are strings of letters, digits, and underscores. The regular

definition for the language of C identifiers.

LetterA | B | C|| Z | a | b | |z|

digit 0|1|2 | | 9

id letter( letter | digit )*

Unsigned numbers (integer or floating point) are strings such as 5280,

0.01234, 6.336E4, or 1.89E-4. The regular definition

digit 0|1|2 | | 9

digits digit digit*

optionalFraction .digits |

optionalExponent ( E( + |- | ) digits ) |

number digits optionalFraction optionalExponent

Regular Definition

Short hand notations

+: one or more

r* = r+/ and r+ = rr*

?: zero or one

r? = r/

[A-Z] = A|B|C|.|Z

[range]: set range of characters

Example

C variable name: [A -Za - z ][A - Za - z0 - 9 ]*

Write regular definitions for

Phone numbers of form (510) 643-1481

number ( area ) exchange - phone,

phone digit4,

exchange digit3 ,

area digit3,

digit { 0, 1, 2, 3, , 9 }

Email Addresses (Exercise)

Regular Grammars

Basically regular grammars are used to represent regular languages

The Regular Grammars are either left or right linear:

Right Regular Grammars:

Rules of the forms: A , A a, A aB, A,B: variables and a:

terminal

Left Regular Grammars:

Rules of the forms: A , A a, A Ba; A,B: variables and A:

terminal

Example: S aS | bA, A cA |

This grammar produces the language produced by the regular

expression a*bc*

Recognition of tokens

Recognition of tokens is the second issue in lexical analysis

How to recognize the tokens giving a token specification?

Given the grammar of branching statement:

The terminals of the grammar, which are if, then, else, relop,

The patterns for the given tokens are:

Recognition of tokens

Note that

The lexical analyzer also has the job of stripping out whitespace, by

Recognition of tokens

The table shows which token name is returned to the parser and what

attribute value for each lexeme or family of lexemes.

Recognition of tokens

Transition diagrams

As intermediate step in the construction of a lexical analyzer, we

diagrams"

Example Transition diagram for relop

we must retract the input

one position ( back)

Recognition of tokens

Token nexttoken( ) {

Mapping transition

diagrams into C code

while (1) {

switch (state) {

case 0:

c = nextchar();

if (c is white space) state = 0;

else if (c == <) state = 1;

else if (c == =) state = 5;

case 1:

c = nextchar();

if(c===) go to state 2

return;

else if(c==>) goto state 3 retrun

else

retract ; return;

if (isletter( c) ) state = 10; else state =fail();

break;

case 10: .

case 11: retract(1); insert(id);

return;

}}}

Recognition of tokens

Recognition of Reserved Words and Identifiers

Recognizing keywords and identifiers presents a problem.

keywords like if or then are reserved but they look like identifiers

stall the reserved words in the symbol table initially.

If we find an identifier, installID is called to

place it in the symbol table.

Session2

Finite Automata

Deterministic Finite Automata

From Regular Expressions to Automata

Conversion of NFA to DFA

State minimization

Design of a Lexical-Analyzer Generator

Implementing Scanners

Finite Automata

At the heart of the transition diagram is the formalism known as

finite automata

A finite automaton is a finite-state transition diagram that can be

expression

Finite automata are recognizers; they simply say "yes" or "no"

about each possible input string.

A finite automaton can be a

Non-deterministic finite automaton (NFA

Deterministic finite automaton (DFA)

An NFA can be diagrammatically represented by a labeled

The set of states =

{A,B, C, D}

Input symbol = {0, 1}

Start state is A,

accepting state is D

Input Symbol

represented by transition

table on the right side

State

{A,B}

{A,C}

{D}

--

--

{D}

{D}

{D}

Acceptance of Input Strings by Automata

An NFA accepts an input string str iff there is some path in the finite-state

transition diagram from the start state to some final state such that the edge

labels along this path spell out str

The language recognized by an NFA is the set of strings it accepts

Example

A deterministic finite automaton (DFA) is a special case of an

NFA where:

There are no moves on input ,and

For each state q and input symbol a, there is exactly one edge out of a

labeled a.

Example

this DFA is also

(b*aa*(aa*b)+)

Implementing a DFA

Let us assume that the end of a string is marked with a special

The algorithm for recognition will be as follows: (an efficient

implementation)

q = q0

c = nextchar

while (c != eos) do

begin

q = move(q,c)

c = nextchar

end

if (q in F) then

return yes

else

return no

{ get the next character from the input string }

{ do until the end of the string }

{ transition function }

{ if s is an accepting state }

Implementing an NFA

Q = -closure({q0})

c = nextchar

while (c != eos) {

begin

Q = -closure(move(Q,c))

c = nextchar

end

if (Q F != ) then

return yes

else

return no

accessible from q0 by -transitions }

{ set of all states that can

be accessible from a state

in q by a transition on c }

-closure(q): set of states reachable from some state q in Q on -

transitions alone.

move(q, c): set of states to which there is a transition on input symbol

Given an NFA below find -closure (s)

reached through -transitions

only

For each kind of RE, there is a corresponding NFA

i.e. any regular expression can be converted to a NFA that defines the

same language.

The algorithm is syntax-directed, in the sense that it works

For each sub-expression the algorithm constructs an NFA with a

Algorithm: The McNaughton-Yamada-Thompson

algorithm to convert a regular expression to a NFA.

INPUT: A regular expression r over alphabet .

OUTPUT: An NFA N accepting L(r).

Method: Begin by parsing r into its constituent sub-expressions. The rules

with no operators, and inductive rules for constructing larger NFA's from

the NFA's for the immediate sub-expressions of a given expression.

For expression e construct the NFA

For any sub-expression a in C, construct the NFA

NFA for the union of two regular expressions

Ex: a|b

NFA for r1r2

NFA for r*

Example: (a|b)*

Example: Constructing NFA for regular expression

r= (a|b)*abb

Step 1: construct a, b

Step 2: constructing a | b

Step3: construct (a|b)*

Step4: concatenate it with a, then, b,

then b

Why?

DFA is difficult to construct directly from REs

NFA is difficult to represent in a computer program and inefficient to

compute

So, RENFADFA

Conversion algorithm: subset construction

The idea is that each DFA state corresponds to a set of NFA states.

After reading input a1a2an, the DFA is in a state that represents the

subset Q of the states of the NFA that are reachable from the start state.

INPUT: An NFA N

OUTPUT: A DFA D accepting the same language as N.

ALGORITHM:

Dstates := -closure (q0) //and they all are unmarked

While (there is an unmarked state Q in Dstates ) do

begin

mark Q;

for (each input symbol a) do

begin

U := -closure ( move(Q,a) );

if (U is not in Dstates ) then

add U as an unmarked state to Dstates;

Dtran [Q, a] := U;

end

end

Example NFA to DFA

A = {0,1,2,4,7},

Since these are exactly the states reachable from state 0 via a

Note that a path can have zero edges, so state 0 is reachable from

The input alphabet is {a, b). Thus, our first step is to mark A and

compute

Dtran[A, b] = - closure(move(A, b)) .

on a, to 3 and 8, respectively.

let Dtran[A, a] = B

compute Dtran[A, b]. Among the states in A, only 4 has a

C=

If we continue this process with the unmarked sets B and C, we eventually

reach a point where all the states of the DFA are marked

a

a

start

0

b

1

b

2

b

-closure(0) = {0}

move(0,a) = {0,1}

move(0,b) = {0}

move({0,1}, a) = {0,1}

move({0,1}, b) = {0,2}

move({0,2}, a) = {0,1}

move({0,2}, b) = {0,3}

New states

A = {0}

B = {0,1}

C = {0,2}

D = {0,3}

a

start

b

a

b

b

D

1

start

Due to -transitions, we

must compute -closure(S)

Example: -closure (0) =

{0,3}

a

2

U:= -closure (move( {1,2}, a)) = {3,4,5}

Add {3,4,5} to Dstates

U:= -closure (move( {1,2}, b)) = {}

-closure (move( {3,4,5}, a)) = {5}

-closure (move( {3,4,5}, b)) = {4,5}

-closure (move( {4,5}, a)) = {5}

-closure (move( {4,5}, b)) = {5}

a

start

a

4

a

a

A{1,2}

--

B{3,4,5}

C{4,5}

D{5}

--

--

a|b

Convert the -NFA shown below into a DFA, using subset construction

Regular

Expression

RE NFA

Conversion

NFA DFA

Conversion

Input

String

DFA

Simulation

Yes, if w L(R)

No, if w L(R)

State minimization

If we implement a lexical analyzer as a DFA, we would generally

since each state requires entries in the table that describes the lexical

analyzer.

There is always a unique minimum state DFA for any regular

language.

Moreover, this minimum-state DFA can be constructed from any DFA

for the same language by grouping sets of equivalent states.

An Algorithm to Minimizing the number of states of a DFA

INPUT: A DFA D with set of states S, input alphabet , start state 0, and

set of accepting states F.

OUTPUT: A DFA D' accepting the same language as D and having as

few states as possible.

State minimization

State

minimization

An Algorithm to Minimizing the number of states of a DFA

non-final states {A, B, C, D},

final state{E}

3. group {A, B, C, D} can be split into {A, B, C}{D}, and IInew for this

round is {A, B, C}{D}{E}.

Minimized DFA

In the next round, split {A, B, C} into {A, C}{B}, since A and C

member of another group, {D}. Thus, after the second round, new

= {A, C} {B} {D} {E).

For the third round, we cannot split the one remaining group with

more than one state, since A and C each go to the same state (and

therefore to the same group) on each input. final = {A,

C}{B){D){E). The minimum-state of the given DFA has four states.

b

a

A

a

b

a

b

E

D

b

Minimize states of the following DFA

The lexical analyzer has to recognize the longest possible string.

Ex: identifier newval n ne new newv newva newval

What is the end of a token? Is there any character which marks the

If the number of characters in a token is fixed, in that case no problem:

The end of an identifier : the characters that cannot be in an identifier can

mark the end of token.

We may need a lookhead

In Prolog: p :- X is 1. p :- X is 1.5.

The dot followed by a white space character can mark the end of a

number. But if that is not the case, the dot must be treated as a

part of the number.

We have learnt the formalism for lexical analysis

Regular expression (RE)

Solution 1: to implement using a tool Lex (for C), Flex (for

Programmer specifies the interesting tokens using REs

The tool generates the source code from the given Res

Solution 2: to write the code starting from scratch

This is also the code that the tool generates

The code is table-driven and is based on finite state automaton

Writing REs instead of the code itself

The final states of the DFA

are associated with actions

The detailed about lex will be discussed laboratory sessions

Assignment 1 (10%)

Implement DFA and NFA. The program should accept states,

alphabet, transition function, start state and set of final states. Then

the program should check if a string provided by a user is accepted

or rejected.You can use either Java or C++ for the

implementation

2. Convert the following NFA to DFA

1.

answer for question 2

Reading Assignment

from a regular expression

- Solution ManualDiunggah olehAnjinayulu Anji
- Compiler Construction MCQs-1.docxDiunggah olehKhurram Abbas
- Oop in Python Best ResourcesDiunggah olehSamuel Forkner
- parsingDiunggah olehapi-20012397
- Compiler Pracs NewDiunggah olehrnspace
- Class 10Diunggah olehTaqi Shah
- states minimization in AutomataDiunggah olehSathee Mitra
- Decidability 2010Diunggah olehMrZaggy
- Symbol TablesDiunggah olehrakeshguptacse
- VivaasdasdasdaDiunggah olehSiddarthModi
- DimaDiunggah olehPablo Montes
- IEEE control Petri nets and 61850 revised1Diunggah olehJose de Sa
- MELJUN CORTES Automata Lecture the Regular Operations on Languages 1Diunggah olehMELJUN CORTES, MBA,MPA
- Y05-1019Diunggah olehKyel Lopez
- FARECLCSDiunggah olehrdorbala
- SuffixT1withFigsDiunggah olehtuupituu
- Reduced Merge_FSM Pattern Matching Algorithm for Network Intrusion DetectionDiunggah olehidescitation
- Push Down AutomataDiunggah olehMithun Sudheendran
- 3 - 1-03-01- Lexical Analysis (12m06s)Diunggah olehHilmar Ancocallo
- abelvhdlDiunggah olehKb Nguyễn
- SWE321- Data Centered DesignDiunggah olehZeeshan Rasool
- 1-4-RegularLanguagesPropertiesDiunggah olehAlex Solomon A
- NP Lab Manual.docxDiunggah olehARVIND
- Abstract ClassDiunggah olehEmandi RamaLakshmi
- CS2006_2_SemDiunggah olehAditya Biswas
- token-manager.pdfDiunggah olehPablo Lorenzo Muños Sanches
- 151713664-Compiler-Design-Lab-Manual.docxDiunggah olehAkash

- Manual 3Com Switch 1gbps 3CGSU08Diunggah olehWylmer Torres
- AHU Instruction - BookletDiunggah olehSurianshah
- En VisciDiunggah olehMichelleDelaguna
- 33324701-a-project-report-on-brand-awarness-120307014830-phpapp02-140628092409-phpapp01Diunggah olehKarthik Patil
- AIS caseDiunggah olehShedee Grace Galeon
- Naoman Sheikh's CVDiunggah olehNaoman Sheikh
- DIP Lab FileDiunggah olehAnkit Jain
- persuasiveunitDiunggah olehapi-239718189
- Mat Lab SimulationDiunggah olehAnonymous L21MIUqA
- Marechal DSN Plugs & Sockets Industrial Power Supply & Motor Connections (20 63A)Diunggah olehrocketvt
- Ioana Vasiu Computer ForensicsDiunggah olehTarhon Anca
- GPS303 CD User ManualDiunggah olehchristmas14
- Matlab - Generating Sine and Cosine WavesDiunggah olehlentelsoup
- EDI810SpecDocumentationFAQ.pdfDiunggah olehMohamed Ali Amri
- M800_CCW.pdfDiunggah olehAlaa Ramadan
- W0000724Diunggah olehtinduong
- aDiunggah olehOrt-Land Vilks Fosc
- WFT SAP BPCDiunggah olehVino Brillo
- Report on Service and Maintenance ModuleDiunggah olehSithara Ps
- aucr2013.pdfDiunggah olehjenidani
- Samsung ML-1610 Printer Service ManualDiunggah olehBayardo Aguilar
- technology plan evaluation morris wallace geigerman colbert beattieDiunggah olehapi-310180660
- Spatial Data StructuresDiunggah olehdevasis
- Smart Dust TechnologyDiunggah olehSantosh Kumar Buyya
- #1.docDiunggah olehArjane Poldo Rioflorido
- A Guide to Simulation with EViews by Afees Salisu.pdfDiunggah olehchykwums
- Spaceport Information InfrastructureDiunggah olehkellyjohnson86
- Avamar Compatibility and Interoperability MatrixDiunggah olehArun Prabhu
- Benefits of E bankingDiunggah olehAshwani Tyagi
- Quickguide Q and Q.zen CeilingDiunggah olehKarren_M888

## Lebih dari sekadar dokumen.

Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbit-penerbit terkemuka.

Batalkan kapan saja.