Overview of Compiler Environment Pass and Phase Phases of Compiler Regular Expression Lexical Analyzer LEX Tool Bootstrapping

UNIT 1
Overview of compiler
Environment
pass and phase
phases of compiler
regular expression
Lexical Analyzer
LEX tool
Bootstrapping.
Compiler - Introduction
A compiler is a computer
program that translates a
program in a source
language into an equivalent
program in a target
language.
A source program/code
is a program/code written in
the source language, which
is usually a high-level
language.
A target program/code is
a program/code written in
the target language, which
often is a machine language
or an intermediate code.
Input
Source
program
compiler
Error
message
Target
program
Output
A language-processing system
Skeletal Source Program
Preprocessor
Source Program
Compiler
Try for example:
gcc -v myprog.c
Target Assembly Program

Assembler
Relocatable Object Code
Linker
Absolute Machine Code
Libraries and
Relocatable Object Files
3
The Economy of Programming Languages

Why are there so many programming languages?
- Application domains have distinctive/conflicting needs.
Why are there new programming languages?

- Programmer training is the dominant cost
What is a good programming language?

- There is no universally accepted metric
Why Study Compilers?

Build a large, ambitious software system.
See theory come to life.

Learn how to build programming languages.
Learn how programming languages work.
Learn tradeoffs in language design.
Building a compiler requires knowledge of
programming languages (parameter passing, variable scoping, memory

allocation, etc)
theory (automata, context-free languages, etc)
algorithms and data structures (hash tables, graph algorithms, dynamic

programming etc)
computer architecture (assembly programming)

software engineering.
Phases of a Compiler
Source program
Lexical analyzer
token stream
Syntax analyzer
syntax tree
Semantic analyzer
Symbol
Table
syntax tree
Intermediate code generator

intermediate representation
Code Optimizer
intermediate representation
Code generator
Target program
Error
Handler
The Structure of a Compiler :The Analysis-Synthesis Model of Compilation
There are two parts to compilation:

Analysis
Breaks up source program into pieces and imposes a grammatical
structure
Creates intermediate representation of source program

Determines the operations and records them in a tree structure,
syntax tree
Known as front end of compiler
Synthesis
Constructs target program from intermediate representation
Takes the tree structure and translates the operations into the
target program
Known as back end of compiler
Source
Code
Front End
Intermediate
Code
Back End
Target
Code
The Analysis Task For Compilation

Three Phases:
Linear / Lexical Analysis:
L-to-R Scan to Identify Tokens
token: sequence of chars having a collective meaning

Hierarchical Analysis:
Grouping of Tokens Into Meaningful Collection
Semantic Analysis:
Checking to ensure Correctness of Components
Phase 1. Lexical Analysis

First step: recognize words.
Smallest unit above letters
This is a sentence.
Lexical analysis divides program text into words or tokens
if (x == y) z = 1; else z = 2;
Position __= _____

initial _+ ___
rate _ *__
60_;
_______
Blanks, Line breaks, etc. are
scanned out
All are tokens
Once words are understood, the next step is to understand sentence
structure
Parsing = Diagramming Sentences
The diagram is a tree
Phase 2. Hierarchical Analysis

Parsing or Syntax Analysis
For previous example,
we would have
Parse Tree:
assignment
statement
identifier
position
expression
+
expression
expression
identifier
initial
*
expression
expression
identifier
number
rate
60
Nodes of tree are constructed using a grammar for the language
Phase 3. Semantic Analysis
Find More Complicated Semantic Errors and Support Code Generation

Parse Tree Is Augmented With Semantic Actions
:=
:=
position
position
initial
initial
*
rate
60
*
rate
inttofloat
60
Compressed Tree
Most Important Activity in This Phase:
Type Checking - Legality of Operands
Many Different Situations:
Conversion Action
Float = int + char ;

A[int] = A[float] + int ;
while (char != int)
. Etc.
character stream
Lexical Analyzer
position = initial + rate * 60
<id,1> <=> <id,2> <+> <id,3> <*> <60>
Syntax Analyzer
=
+
<id,1>
<id,2>
<id,3>
60
=
Semantic Analyzer
<id,1>
<id,2>
<id,3>
*
inttofloat
60
Intermediate Code Generator
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Machine-Independent
Code Optimizer
t1 = id3 * 60.0
id1 = id2 + t1
Code Generator
1
position
initial
rate
SYMBOL TABLE
LDF R2, id3

MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
Phases and Passes

Pass
Phase
Pass is a physical scan over a source
A phase is a logically cohesive
program
operation that takes i/p in one form

and produces o/p in another form
The portions of one or more phases are combined into a module called pass
Requires an intermediate file
No need of any intermediate files in
between two passes
between phases
Splitting into more no. of passes
Splitting into more no. of phases
reduces memory
reduces the complexity of the

program
Single pass compiler is faster than

two pass
Reduction in no. of phases, increases

the execution speed
Lexical Analysis
The compiler sees the following code as
if (i == j)
Z = 0;
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
else
Z = 1;
Token Class (or Class)
In English:
Noun, verb, adjective ..
In a programming language:
Identifiers, keywords, operators, numbers..
Token
- A classification for a common set of strings
Pattern
- <Identifier> <Number> etc
- The rules which characterize the set of strings for a token

- File and OS wild cards *.* [A-Z]
Lexeme
- Actual sequence of characters that matches pattern and is classified
by a token
- Identifiers : x, count , . . .
Token classes correspond to sets of strings.

Identifier:
strings of letters or digits, starting with a letter
Integer:
a non-empty string of digits

Keyword:
else or if or begin or
Whitespace:
a non-empty sequence of blanks, newlines, and tabs
Classify program substrings according to role
Communicate tokens to the parser
Lexical
Analyzer
<Class, String>
Parser
An implementation must do two things:

1.Recognize substrings corresponding to tokens
The lexemes
2.Identify the token class of each lexeme
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Find the No.of Tokens in the following Code segments:

1. printf( Compiler Design);
2. DO I = 15.5;
3. Int add(int x,int y)
{
return x+y;
}
4. printf( i = %d , $i = %p,i,&i);
Complexity in Lexical Analysis

FORTRAN rule: Whitespace is insignificant
DO 5 I = 1,25
DO 5 I = 1.25
VAR1 is the same as VA R1
if (i == j)
Z = 0;
else
Z = 1;
PL/I keywords are not reserved

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN
C++ template syntax:
Foo<Bar>
C++ stream syntax:
cin >> var;
The goal of lexical analysis is to

Partition the input string into lexemes
Identify the token of each lexeme
Left-to-right scan => lookahead sometimes required
Regular Languages
Lexical structure = token classes

We must say what set of strings is in a token class
Use regular languages
Regular expressions specify regular languages
Five constructs
Two base cases
empty and 1-character strings

Three compound expressions
union, concatenation, iteration
Def. The regular expressions over S are the smallest set of
expressions including
R=
| c
c is in
| R + R | RR
| R*
RE examples :
For = {0,1}, Find the strings represented by the following Res

1 . 1* =
2. (1 + 0) 1 =
3. 0* + 1* =
4. (0+1)* =
Formal Languages
Def. Let be a set of characters (an alphabet).
A language over is a set of strings of characters drawn from
Alphabet = English characters
Alphabet = ASCII
Language = English sentences
Language = C programs
Meaning function L maps syntax to semantics

L (e) = M
Why use a meaning function?
Makes clear what is syntax, what is semantics.
Allows us to consider notation as a separate issue
Because expressions and meanings are not 1-1

Meaning is many to one
Never one to many!
Lexical Specifications
Keyword: if or else or then or
Integer: a non-empty string of digits
Identifier: strings of letters or digits, starting with a letter
Whitespace: a non-empty sequence of blanks, newlines, and tabs

digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9'
digits = digit+
opt_fraction = ('.' digits) +
opt_exponent = ('E' ('+' + '-' + ) digits) +
num = digits opt_fraction opt_exponent
At least one:
A+ = AA*
Union:
A|B= A+B
Option:
A? = A +
Range: a+b++z = [a-z]
Excluded range: complement of [a-z] = [^a-z]
Lexical Specification Process

1.Write a rexp for the lexemes of each token class
Number = digit+
Keyword = if + else +
Identifier = letter (letter + digit)*

OpenPar = (
2.Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Number +
= R1 + R2 +
3.Let input be x1xn
For 1 i n check
x1xi L(R)
4.If success, then we know that
x1xi L(Rj) for some j
5.Remove x1xi from input and go to (3)
Resolving Ambiguities
How much input is used?
- Maximal Munch
Which token is used?
- Choose the one listed first
What if no rule matches?

- Pass on to error handler
Lexical errors
Some errors are out of power of lexical analyzer to recognize:

fi (a == f(x))
However it may be able to recognize errors like:
d = 2r
Such errors are recognized when no pattern for tokens matches a
character sequence
Error recovery
Panic mode: successive characters are ignored until we reach to a well

formed token
Delete one character from the remaining input
Insert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent characters
Input buffering
Sometimes lexical analyzer needs to look ahead some symbols to decide

about the token to return
In C language: we need to look after -, = or < to decide what token to
return
We need to introduce a two buffer scheme to handle large look-aheads
safely
Two buffers of the same size, say 4096, are alternately reloaded.
Two pointers to the input are maintained:

Pointer lexeme_Begin marks the beginning of the current lexeme.
Pointer forward scans ahead until a pattern match is found.
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
Transition diagrams
Transition diagram for relop
Transition diagram for reserved words and identifiers
Transition diagram for unsigned numbers
Transition diagram for whitespace

Overview of Compiler Environment Pass and Phase Phases of Compiler Regular Expression Lexical Analyzer LEX Tool Bootstrapping

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Overview of Compiler Environment Pass and Phase Phases of Compiler Regular Expression Lexical Analyzer LEX Tool Bootstrapping

Diunggah oleh

Hak Cipta:

Format Tersedia

UNIT 1

Skeletal Source Program

Try for example:

Target Assembly Program

The Economy of Programming Languages

Why are there new programming languages?

What is a good programming language?

Why Study Compilers?

See theory come to life.

programming languages (parameter passing, variable scoping, memory

theory (automata, context-free languages, etc)

algorithms and data structures (hash tables, graph algorithms, dynamic

computer architecture (assembly programming)

Intermediate code generator

The Structure of a Compiler :The Analysis-Synthesis Model of Compilation

There are two parts to compilation:

Creates intermediate representation of source program

The Analysis Task For Compilation

token: sequence of chars having a collective meaning

Phase 1. Lexical Analysis

Position __= _____

Phase 2. Hierarchical Analysis

Nodes of tree are constructed using a grammar for the language

Phase 3. Semantic Analysis

Find More Complicated Semantic Errors and Support Code Generation

Most Important Activity in This Phase:

Type Checking - Legality of Operands

Many Different Situations:

Float = int + char ;

position = initial + rate * 60

<id,1> <=> <id,2> <+> <id,3> <*> <60>

Intermediate Code Generator

LDF R2, id3

Phases and Passes

Pass is a physical scan over a source

A phase is a logically cohesive

operation that takes i/p in one form

Requires an intermediate file

No need of any intermediate files in

between two passes

Splitting into more no. of passes

Splitting into more no. of phases

reduces the complexity of the

Single pass compiler is faster than

Reduction in no. of phases, increases

Identifiers, keywords, operators, numbers..

- <Identifier> <Number> etc

- The rules which characterize the set of strings for a token

Token classes correspond to sets of strings.

a non-empty string of digits

An implementation must do two things:

Find the No.of Tokens in the following Code segments:

Complexity in Lexical Analysis

VAR1 is the same as VA R1

PL/I keywords are not reserved

The goal of lexical analysis is to

Lexical structure = token classes

empty and 1-character strings

For = {0,1}, Find the strings represented by the following Res

Language = English sentences

Meaning function L maps syntax to semantics

Because expressions and meanings are not 1-1

Integer: a non-empty string of digits

Position = ___