Anda di halaman 1dari 35

UNIT 1

Overview of compiler
Environment
pass and phase
phases of compiler

regular expression
Lexical Analyzer
LEX tool

Bootstrapping.

Compiler - Introduction
A compiler is a computer
program that translates a
program in a source
language into an equivalent
program in a target
language.
A source program/code
is a program/code written in
the source language, which
is usually a high-level
language.
A target program/code is
a program/code written in
the target language, which
often is a machine language
or an intermediate code.

Input

Source
program

compiler

Error
message

Target
program

Output

A language-processing system

Skeletal Source Program

Preprocessor
Source Program
Compiler

Try for example:

gcc -v myprog.c

Target Assembly Program


Assembler
Relocatable Object Code
Linker
Absolute Machine Code

Libraries and
Relocatable Object Files
3

The Economy of Programming Languages


Why are there so many programming languages?
- Application domains have distinctive/conflicting needs.

Why are there new programming languages?


- Programmer training is the dominant cost

What is a good programming language?


- There is no universally accepted metric

Why Study Compilers?


Build a large, ambitious software system.

See theory come to life.


Learn how to build programming languages.
Learn how programming languages work.
Learn tradeoffs in language design.
Building a compiler requires knowledge of

programming languages (parameter passing, variable scoping, memory


allocation, etc)

theory (automata, context-free languages, etc)

algorithms and data structures (hash tables, graph algorithms, dynamic


programming etc)

computer architecture (assembly programming)


software engineering.

Phases of a Compiler
Source program

Lexical analyzer
token stream
Syntax analyzer

syntax tree
Semantic analyzer
Symbol
Table

syntax tree

Intermediate code generator


intermediate representation
Code Optimizer
intermediate representation
Code generator
Target program

Error
Handler

The Structure of a Compiler :The Analysis-Synthesis Model of Compilation

There are two parts to compilation:


Analysis
Breaks up source program into pieces and imposes a grammatical
structure

Creates intermediate representation of source program


Determines the operations and records them in a tree structure,
syntax tree
Known as front end of compiler
Synthesis
Constructs target program from intermediate representation
Takes the tree structure and translates the operations into the
target program
Known as back end of compiler

Source
Code

Front End

Intermediate
Code

Back End

Target
Code

The Analysis Task For Compilation


Three Phases:
Linear / Lexical Analysis:
L-to-R Scan to Identify Tokens

token: sequence of chars having a collective meaning


Hierarchical Analysis:
Grouping of Tokens Into Meaningful Collection

Semantic Analysis:
Checking to ensure Correctness of Components

Phase 1. Lexical Analysis


First step: recognize words.
Smallest unit above letters
This is a sentence.
Lexical analysis divides program text into words or tokens
if (x == y) z = 1; else z = 2;

Position __= _____


initial _+ ___
rate _ *__
60_;
_______
Blanks, Line breaks, etc. are
scanned out
All are tokens
Once words are understood, the next step is to understand sentence

structure
Parsing = Diagramming Sentences
The diagram is a tree

Phase 2. Hierarchical Analysis


Parsing or Syntax Analysis
For previous example,
we would have
Parse Tree:

assignment
statement
identifier

position

expression
+

expression

expression
identifier
initial

*
expression

expression

identifier

number

rate

60

Nodes of tree are constructed using a grammar for the language

Phase 3. Semantic Analysis

Find More Complicated Semantic Errors and Support Code Generation


Parse Tree Is Augmented With Semantic Actions
:=

:=
position

position

initial

initial

*
rate

60

*
rate

inttofloat
60

Compressed Tree

Most Important Activity in This Phase:

Type Checking - Legality of Operands

Many Different Situations:

Conversion Action

Float = int + char ;


A[int] = A[float] + int ;
while (char != int)
. Etc.

character stream
Lexical Analyzer

position = initial + rate * 60

<id,1> <=> <id,2> <+> <id,3> <*> <60>

Syntax Analyzer

=
+

<id,1>

<id,2>
<id,3>

60

=
Semantic Analyzer

<id,1>

<id,2>
<id,3>

*
inttofloat
60

Intermediate Code Generator

t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Machine-Independent
Code Optimizer
t1 = id3 * 60.0
id1 = id2 + t1
Code Generator
1

position

initial

rate

SYMBOL TABLE

LDF R2, id3


MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1

Phases and Passes


Pass

Phase

Pass is a physical scan over a source

A phase is a logically cohesive

program

operation that takes i/p in one form


and produces o/p in another form

The portions of one or more phases are combined into a module called pass

Requires an intermediate file

No need of any intermediate files in

between two passes

between phases

Splitting into more no. of passes

Splitting into more no. of phases

reduces memory

reduces the complexity of the


program

Single pass compiler is faster than


two pass

Reduction in no. of phases, increases


the execution speed

Lexical Analysis
The compiler sees the following code as
if (i == j)
Z = 0;
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
else
Z = 1;
Token Class (or Class)
In English:
Noun, verb, adjective ..

In a programming language:

Identifiers, keywords, operators, numbers..

Token
- A classification for a common set of strings
Pattern

- <Identifier> <Number> etc

- The rules which characterize the set of strings for a token


- File and OS wild cards *.* [A-Z]
Lexeme
- Actual sequence of characters that matches pattern and is classified
by a token

- Identifiers : x, count , . . .

Token classes correspond to sets of strings.


Identifier:
strings of letters or digits, starting with a letter
Integer:

a non-empty string of digits


Keyword:
else or if or begin or
Whitespace:
a non-empty sequence of blanks, newlines, and tabs
Classify program substrings according to role
Communicate tokens to the parser
Lexical
Analyzer

<Class, String>
Parser

An implementation must do two things:


1.Recognize substrings corresponding to tokens
The lexemes
2.Identify the token class of each lexeme
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Find the No.of Tokens in the following Code segments:


1. printf( Compiler Design);
2. DO I = 15.5;
3. Int add(int x,int y)

{
return x+y;
}
4. printf( i = %d , $i = %p,i,&i);

Complexity in Lexical Analysis


FORTRAN rule: Whitespace is insignificant
DO 5 I = 1,25

DO 5 I = 1.25

VAR1 is the same as VA R1

if (i == j)
Z = 0;
else
Z = 1;

PL/I keywords are not reserved


IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN
C++ template syntax:
Foo<Bar>
C++ stream syntax:
cin >> var;

The goal of lexical analysis is to


Partition the input string into lexemes
Identify the token of each lexeme
Left-to-right scan => lookahead sometimes required

Regular Languages

Lexical structure = token classes


We must say what set of strings is in a token class
Use regular languages
Regular expressions specify regular languages
Five constructs
Two base cases

empty and 1-character strings


Three compound expressions
union, concatenation, iteration
Def. The regular expressions over S are the smallest set of

expressions including
R=
| c

c is in

| R + R | RR

| R*

RE examples :

For = {0,1}, Find the strings represented by the following Res


1 . 1* =
2. (1 + 0) 1 =
3. 0* + 1* =
4. (0+1)* =

Formal Languages
Def. Let be a set of characters (an alphabet).
A language over is a set of strings of characters drawn from
Alphabet = English characters

Alphabet = ASCII

Language = English sentences

Language = C programs

Meaning function L maps syntax to semantics


L (e) = M
Why use a meaning function?
Makes clear what is syntax, what is semantics.
Allows us to consider notation as a separate issue

Because expressions and meanings are not 1-1


Meaning is many to one
Never one to many!

Lexical Specifications
Keyword: if or else or then or

Integer: a non-empty string of digits

Identifier: strings of letters or digits, starting with a letter

Whitespace: a non-empty sequence of blanks, newlines, and tabs


digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9'
digits = digit+
opt_fraction = ('.' digits) +
opt_exponent = ('E' ('+' + '-' + ) digits) +

num = digits opt_fraction opt_exponent

At least one:

A+ = AA*

Union:

A|B= A+B

Option:

A? = A +

Range: a+b++z = [a-z]

Excluded range: complement of [a-z] = [^a-z]

Lexical Specification Process


1.Write a rexp for the lexemes of each token class
Number = digit+
Keyword = if + else +

Identifier = letter (letter + digit)*


OpenPar = (
2.Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Number +
= R1 + R2 +
3.Let input be x1xn
For 1 i n check

x1xi L(R)

4.If success, then we know that

x1xi L(Rj) for some j

5.Remove x1xi from input and go to (3)

Resolving Ambiguities
How much input is used?
- Maximal Munch
Which token is used?
- Choose the one listed first

What if no rule matches?


- Pass on to error handler

Lexical errors

Some errors are out of power of lexical analyzer to recognize:


fi (a == f(x))
However it may be able to recognize errors like:
d = 2r
Such errors are recognized when no pattern for tokens matches a
character sequence

Error recovery

Panic mode: successive characters are ignored until we reach to a well


formed token

Delete one character from the remaining input

Insert a missing character into the remaining input

Replace a character by another character

Transpose two adjacent characters

Input buffering

Sometimes lexical analyzer needs to look ahead some symbols to decide


about the token to return
In C language: we need to look after -, = or < to decide what token to
return
We need to introduce a two buffer scheme to handle large look-aheads
safely

Two buffers of the same size, say 4096, are alternately reloaded.

Two pointers to the input are maintained:


Pointer lexeme_Begin marks the beginning of the current lexeme.

Pointer forward scans ahead until a pattern match is found.

Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}

Transition diagrams
Transition diagram for relop

Transition diagram for reserved words and identifiers

Transition diagram for unsigned numbers

Transition diagram for whitespace

Anda mungkin juga menyukai