Anda di halaman 1dari 8

Yeditepe University

Department of Computer Engineering

CSE 232
Systems Programming
LECTURE NOTES #5
BASIC COMPILER THEORY AND
INTERPRETERS

1. Basic Language Concepts

A grammar is a formal description of a language.


A language is a notational system for communication.
A sentence is a collection of language symbols that are arranged according to the basic precepts of
grammar.
Syntax is the study of the form or structure of sentences that are valid in the language.
Semantics is the study of the meaning of valid sentences written in the language.
An algorithm is as abstract description of how to solve a problem.
Interpretation is the performance of the algorithm that a program represents.

Methods of Defining the Syntax of a Language


A programming language is usually defined as a set of terminals (basic alphabet of the language), nonterminals (constructions that need to be further defined with the language), production rules (rules
that define how non terminals can be constructed from terminals and non terminals) and a starting
symbol (indicates an initial production rule that defines the entire language). Symbol generally refers
to both terminals and non-terminals, however token is used to refer to terminal symbols only. Defining
a language requires notational system to express the production rules of that language: BNF and syntax
graphs.
Backus Normal Form (BNF)
Terminals are written as they appear (BEGIN, :=)
Non-terminals are usually enclosed in <> (<expression>, <integer>)
A production rule is expressed as a non-terminal followed by ::= (or )followed by the
definition of the non-terminal. There may be several ways to define a non-terminal which are
separated by | symbol.
Recursion ({} , {<z>}ji )

Example:
1 <assign> ::= ident := <expr>
2 <expr> ::= <expr> + <term> | <expr> - <term> | <term>
(or <expr> ::= <term> {+|- <term>})
3 <term> ::= <term> * <factor> | <term> / <factor> | <factor> (or <term> ::= <factor> {*|/ <factor>})
4 <factor> ::= ident | numb | ( <expr> )

Parse tree
Example: A:=B+C*(D-E)
assign
ident
A

:=

expr
expr

term

term

term

factor

factor

factor

expr

ident

ident

expr

term

factor

factor

ident

ident

)
term

2. Compilers
A compiler takes a program written in high-level language and converts it into an equivalent
program in machine language (translation).
source program

compiler

in high-level lang.

execution

results of the program

object code

in machine lang.
Figure 1. Compilation of a source program.

Phases of the compiler:


source language

analysis

synthesis

target language

Figure 2. Basic steps of compilation

1. Lexical analysis (Scanner)


Source code is scanned to recognize and classify the tokens
source program is converted into the basic elements (tokens: keywords, identifiers, constants,
operators, punctuation) of the language. Ordered pairs are used:
- <type, index>
eg, B++; <operator, 1> might be +, <symbol, 2> might be B
- <symbol type, string>
eg, gcd=12; <identifier, gcd>, <number, 12>
the identifier, literal and symbol tables are built
2. Syntactic analysis (Parser)
Inputs the tokens and attempts to combine them into valid constructs of the language (such as
statements)
3. Semantic analysis (Parser)
Appropriate action routines are called which will generate the intermediate form of these
constructs, such as parse tree, P-code, symbol table
4. Code Generation
When parser recognizes a portion of the program, corresponding routine is executed, which
generates the machine language code
5. Optimization
- machine-dependent and machine-independent optimizations are possible
Passes of a compiler:
In a 2-Pass compiler:
Pass 1: - Lexical analysis
Pass 2:- Syntactic and semantic analysis
- Code generation
3

- Optimization
3. Interpreters
An interpreter directly executes the statements of a high-level language, just as if those statements
were part of the instruction set of the machine.
program statements

interpreter

result of the statement

Figure 3. Input and output of an interpreter

Interpreter Example: Tiny BASIC


Example 1:
Text Buffer Contents:

10.Y=5^20.X=10^30.X=X+Y^40.GOTO.100^50.Y=90^100.Y=Y*X^
offset in
text buff.
0
7
15
24
36
44

Program text

10 Y=5^
20 X=10^
30 X=X+Y^
40 GOTO 100^
50 Y=90^
100 Y=Y*X^

Statement Descriptor Table (SDT)


Expresses the order of the statements in the program.
Each entry in the SDT describes one line (statement) of the source code.
4 fields are used:
Descriptor no: descriptor number for the statement
Statement no: number of the statement in the source code
Offset in text buffer: a pointer to the first character of the statement in the text buffer
Link: a link to the descriptor no. of the next statement
SDT for the example:
Descriptor no.
0
1
2
3
4
5

Statement no.
10
20
30
40
50
100

Offset in text buffer


0
7
15
24
36
44

Link
1
2
3
4
5
6

Interpretation Process

Select the first statement descriptor


Parse the line
Process it

more
lines

N
done

Y
Fetch the next statement

Figure 4. Control flow in the interpreter.

Parsing:
When a line is parsed:
- A Parse Table is generated
- Keywords (FOR, TO, IF) are converted to 1 byte ASCII codes
- Numbers in ASCII are converted to 2 byte binary numbers
- Algebraic expressions are converted to postfix (Reverse Polish Notation)
eg.

(A+B)*(C/D)

=>

AB+CD/*

Parse Table
Statement type
Length

Next descriptor

Sequential/Branch (S/B)

Offset in text buffer

Branch target

Type
operands

Ex:
50 Y=90^
descriptor no=4
next descriptor=5

offset=36
.1

5
1
2

S
39
41

3 (int. variable)
2 (int. constant)

Fetching the next statement:


IP (instruction pointer) shows the next statement to be executed.

If S/B field is S (sequential) then


IP next descriptor field in the SDT
If S/B field is B (branch) then
- Search the statement no. field of the SDT to find the descriptor no. of the statement to be
branched
- IP descriptor no. field of the matching SDT entry

Interpreters use Symbol Table to keep the variable names and their values:
Symbol Table (ST)
Variable

Value

Execution of Program Statements:

Assignments
(10 X=7^,
12 X=Y^,
13 X=A+B^)
Evaluate the expression and assign the result to the value field of the variable in the Symbol Table.
Arrays
Ex: A(10)
A(n)
Loc(A(i)) = base + i * size

base
10

A(m, n)
row-wise:
column-wise:

Loc(A(i, j)) = base + (i * m + j)* size


Loc(A(i, j)) = base + (j * n + i)* size
6

Unconditional Branches
Set S/B field to B

Conditional Branches
(30 IF (A>B) 500^)
Evaluate the Boolean expression. If the condition is true, set S/B field to B (therefore the
interpretation process continues with the statement number 500); otherwise, set S/B field to S (and
continue with the next statement in the sequence).

Subroutines

(20 GOTO 100^)

Subroutine Return Stack (SRS): Each entry contains the descriptor number of the next statement
after the subroutine call

SRS
CALL statement:
- descriptor no. of the next statement is pushed on the SRS
- the number of the first statement in the subroutine is get from the text buffer and a branch is
performed to that statement
RETURN statement:
- the descriptor no. on top of the SRS is popped out
- a branch is executed to that descriptor

Loops

Loop Control Table (LCT): Dynamically created and destroyed for each loop. Contains the following
information:
1. descriptor number of the first statement inside the loop
2. descriptor number of the first statement after the end of the loop
3. address of the ST entry of the loop counter
4. ending value of the loop counter
LCT

LRS (Loop Return Stack)

Loop Begin Statement:


(50 FOR I=1 TO 100^)
- allocate a LCT
- push the number of the LCT in LRS
- initialize the value of the loop counter
Loop End Statement:
(70 NEXT I^)
- Access the LCT of the top element in the LRS
- If the second field of the LCT has not been filled yet, fill it
- If the loop counter is equal to its ending value, then pop from LRS and destroy the LCT;
otherwise, increment loop counter and go back to the beginning of the loop
UNLOOP Statement: (termination of looping)
(60 UNLOOP^)
- pop the related LCT from LRS
- if the descriptor of the next statement after the loop is not in the LCT, follow the link in
the SDT to find the loop end
- clear the LCT

LCT structure:

LCT no.

pointer to the
next node

Initially:
free list

in-use list

After the allocation of the first LCT:


free list
1

in-use list