Anda di halaman 1dari 56

Overview of Compiler

Recap
• What is Translator?
• What is Compiler??
• Is Compiler, the only translator available??

• What else is required to run High Level


Program?
• Executing a program written in high level
language is basically a two step process.
• The source program must first be
compiled (ie translated to object
program)
• The object program is loaded into memory
and executed.

• Components that are required for


executing HLL programs are:
– Translators, Linkers, Loaders
Structure of Compiler
• Process of Compilation is very complex.
• It is not reasonable for logical point of
view as well as implementation point of
view to consider compilation process as
occurring in one single step.
• Compilation process is partitioned into
series of sub-processes called phases.

• What is a phases?
• A phase is a logically cohesive
operation that takes as input one
representation of the source program and
produces as output another
representation.
Sorce Program

Lexical Analysis
(SCANNER)

Syntax Analysis
(PARSER)

Semantic Analysis

Table Management Error Handling


Intermediate Code
Generation

Code Optimization

Code Generation

Target Program
Lexical Analyzer (SCANNER)
• ROLE: separates the characters of the
source language into groups that logically
belongs together.
• TOKENS: These groups are called
tokens.
• Tokens are basic unit of syntax.
double f = sqrt(-1);

T_DOUBLE (“double)
T_IDENT (“f”)
T_OP (“=“)
T_IDENT (“sqrt”)
T_LPAREN (“(“)
T_OP (“-”)
T_INTCONSTANT (“1”)
T_RPAREN (“)”)
T_SEP (“;”)
• Eliminates white spaces (tabs, blanks,
comments)
• Key issue is speed.
• A scanner must recognize various parts of
the language’s syntax.

• SOME PARTS ARE EASY:


– White space
– Keywords and operators
specified as literal patterns – do,end
– Comments
Opening and closing delimiters- /* ….. */
• SOME PARTS ARE MUCH HARDER
– Identifiers
alphabet followed by k alpha-numerics
– Numbers
integers- 0 or digit from 1-9 followed by digits from 0-9
decimals- integer “.” digits from 0-9
reals- (integer or decimal) “E” (+ or -) digits from 0-9

• POWERFUL NOTATION IS NEEDED TO


SPECIFY THESE PATTERNS.
AND THESE POWERFUL NOTATIONS
ARE:

1. REGULAR EXPRESSIONS
2. AUTOMATA’s
– DETERMINISTIC FINITE AUTOMATA(DFA)
– NON-DETERMINISTIC FINITE AUTOMATA
(NDFA)
SYNTAX ANALYSIS (PARSER)
• ROLE: Syntax analyzer groups tokens
together into syntactic structures.
• Example: three tokens representing A+B
might be grouped into syntactic structure
called an expression.

• Describes a set of strings (that are


programs) using a grammar.
Expression -> UnaryExpression
Expression -> FuncCall
Expression -> T_INTCONSTANT
UnaryExpression -> T_OP Expression
FuncCall -> T_IDENT T_LPAREN Expression T_RPAREN

THIS IS GRAMMAR (ie RULES OF LANGUAGE). SO


FOR GENERATION OF PARSERS, YOU MUST
UNDERSTAND GRAMMER OF LANGUAGE.

FOR THAT YOU MIUST LEARN:


– CONTEXT FREE GRAMMARS
– CONTEXT SENSITIVE GRAMMARS
• Output of this phase is a syntactic
structure known as syntax tree or
derivation tree or parse tree (tree that is
generated as a result of parsing)
• Leaves of parse tree are tokens.
PARSE TREE FOR double f = sqrt(-1);

Expression

FuncCall

T_IDENT T_LPAREN ( EXPRESSION -1 T_RPAREN )


sqrt

UnaryExpression

T_OP - Expression

T_INTCONSTAT 1
OVERVIEW Contd..
SEMANTIC ANALYSIS
• Checks the source program for semantic errors and gathers type
information for the subsequent code-generation phase.

• Use hierarchical structure used by syntax-analysis phase to identify


operands and operators of expressions and statements.

• Important component of semantic analysis is type checking.

• Compiler checks that each operator has operands that are permitted
by the source language specification.

• Eg. Many programming language definitions require compiler to


report error every time a real number is used to index an array.
Semantic Analysis
• “does it make sense”?
• Checking semantic rules such as
– Is there a main function?
– Is variable declared?
– Are operands type compatible? (coercions)
– Do function arguments match function
declarations?
Intermediate Code Generation
• On a logical level the output of the syntax
analyzer and semantic analyzer is some
representation of a parse tree.
• After syntax and semantic analysis, some
compilers generate an explicit intermediate
representation of the source program.
• This intermediate representation can be thought
of as a program for an abstract machine.
• Intermediate representation should have two
properties:
– Should be easy to produce.
– Easy to translate into target program.
• Intermediate representation can have variety of
forms. Out of these varied forms Three-Address
Code is commonly used.
• Three-Address Code:
– Three-address code can consists of a sequence of
instructions, each of which has at most three
operands.
– Each Three-address code instruction has at most one
operator in addition to the assignment.
• Thus when generating these instructions, the compiler has to
decide on the order in which operations are to be done.
position := initial + rate * 60

Lexical Analysis
(SCANNER)

id1=id2+id3*60

Syntax Analysis
(PARSER)
:=
id1 +
id2 *
id3 60

Semantic Analysis

:=
id1 +
Intermediate Code id2 *
Generation id3 inttoreal
60
temp1:=inttoreal(60)
temp2=:=id3*temp1
temp3=:=id2+temp2
Id1:=temp3
• Multiplication precedes addition.
• Compiler must generate a temporary name to hold
the value computed by each instruction.
• Some three address instructions can have fewer
than three operands.
– INTERMEDIATE CODE GENERATION IS
DONE BY SYNTAX DIRECTED
TRANSLATION, A PROCESS IN WHICH
ACTIONS OF SYNTAX ANALYSIS PHASE
GUIDE THE TRANSLATION.
Code Optimization
• Is an optional phase.
• Code Optimizer analyzes and changes the intermediate
code, so that transformed code is better in some sense.
• GOAL of this phase is to
– Either reduce runtime or space
• The term optimization is a complete misnomer, since
there is no algorithmic way of producing a target
language program that is best possible under any
reasonable definition of “best”.
• A good optimizing compiler can improve the target
program by perhaps a factor of two in overall speed, in
comparison with a compiler that generates code without
using specialized techniques.
Example of optimization:
• Local Optimization:
– See two instances of jumps over jumps in the
intermediate code.
if A>B goto L2
goto L3
L2:
– This sequence could be replaced by the
single statement
if A≤B goto L3
Sequence 1
• Compare A and B to set the condition codes.
• Jump to L2 if the code for > is set and
• Jump to L3
Sequence 2
• Compare A and B to set the condition codes and
• Jump to L3 if the code for < or = is set.

Assume A> B is true half the time, then in sequence 1


we execute (1) and 2 all the times and (3) half the
time, for an average of 2.5 instructions.
For sequence 2 we always execute two instructions,
a 20% saving.
Loop Optimizations
– Loop Invariants
• Remove loop invariants ie. Entities that whose value remains
same inside the loop.

for i:=1 to 100


k:=10;
j:=j+2;
end
OPTIMIZED VERSION
k:=10
for i:=1 to 100
j:=j+2
end
– Loop Unrolling replicate the body of the loop.
begin
I:=1;
while I<=100 do
begin
A[I]:=0
I:=I+1;
end
end
OPTIMIZED:
begin
I:=1;
while I<=100 do
begin
A[I]:=0;
I:=I+1;
A[I]:=0;
I:=I+1;
end
end
– Loop Jamming
– Merging the bodies of two loops. It is necessary that each loop
be executed the same number of times and that the indices be
same.

Begin
for I:=1 to 10 do
for J:=1 to 10 do
A[I,J]:=0;
for I:=1 to 10 do
A[I,I]=1
End
Optimized Version
Begin
for I:=1 to 10 do
begin
for J:=1 to 10 do
A[I,J]:=0
A[I,1]:=1
end
• Other Optimizations
– Common sub expression elimination
– Redundant computation elimination
– Move computation to less frequently executed
place (ie. Out of loops).
Code Generation
• Converts the intermediate code into a sequence of machine
instructions.
• Simple code generator might map the statement A:=B+C into the
machine code sequence
LOAD A
ADD C
STORE A
• However such straightforward macro like expansion of intermediate
code into machine code usually produces a target program that
contains many redundant loads and stores and that uses the
resources of target machine inefficiently.
• To avoid it, code generator keep track of run time contents of
registers.
• Knowing what quantities resides in registers, the code
generator can generate loads and stores only when necessary.
• Attempt to utilize register as efficiently as possible.
• Register allocation is difficult to do
optimally , but some heuristic approaches
exists and give reasonably good results.
SUMMARY
Instruction Selection
– Produce compact, fast code.
– Use available addressing modes.
– pattern matching problem
• Ad hoc techniques
• Tree pattern matching algos.
• String pattern matching algos.
• Dynamic programming techs.
Register Allocation
– Limited resources.
– Loads and Stores should be minimized.
– Keep run time track of values in registers.
– Optimal allocation is difficult.
• NP-complete for 1 or k registers.
In next class
• Continue with the overview.
• Passes of compilers.
• Bootstrapping
• Cross Compilation
Overview Contd…
Symbol Table Management
• Essential function of a compiler is to record the
identifiers used in the source program and
collect information about various attributes of
identifiers.
• The attributes may provide information about the
– Storage allocated for an identifier.
– Its type
– Its scope
– In case of procedures (name & types of arguments,
method of passing each argument, the type returned)
• Symbol Table is a data structure containing a
record for each identifier, with fields for the
attributes of the identifier.
• This data structure allow us to find the record
for each identifier quickly and to store and
retrieve data from the record quickly.
• When identifier in the source program is
detected by the lexical analyzer, the identifier is
entered into symbol table.
• However, the attributes of an identifier
cannot normally be determined during
lexical analysis.
• Eg. Pascal Declaration
var position, initial, rate: real;
• The remaining phases enter information
about identifiers into the symbol table and
then use the information in various ways.
Error Detection and Handling
• Each phase can encounter error.
• For eg:

– The lexical analyzer may be unable to proceed because the next token
in the source program is misspelled.
– The syntax analyzer may be unable to infer a structure for its input
because a syntactic error such as missing parenthesis has occurred.
– The intermediate code generator may detect an operator whose
operands have incompatible types.
– The code optimizer, during control flow analysis, may detect that
certain statements can never be reached.
– The code generator may find a compiler created constant that is too
large to fit in a word of the target machine.
– While entering information into symbol table, the book-keeping routine
may discover an identifier that has been multiply declared with
contradictory attributes.
• Whenever a phase of compiler discovers an
error, it must report the error to error handler,
which issues appropriate diagnostic message.
• Once error has been noted, the compiler must
modify the input to the phase detecting the error,
so that the latter can continue processing its
input, looking for subsequent errors.
• A compiler that stops when it finds the first error
is not as helpful as it could be.
• Good error handling is difficult because certain
errors can mask subsequent errors.
Phases Grouped into Passes
• In an implementation of a compiler, portions of
one or more phases are combined into a module
called a pass.
• A pass reads the source program or the output
of the previous pass, makes the transformations
specified by its passes and writes output into an
intermediate file, which may then be read by
subsequent pass.
• If several phases are grouped together into one
pass, then the operation of the phases may be
interleaved, with control altering among several
phases.
• The number of passes and the grouping of
phases into passes depends upon:
– A particular language and machine.
– Structure of language.
• Certain languages require atleast two passes to
generate code easily. For example. ALGOL allow
declaration of a name to occur after uses of that
name. Code for expressions containing name
cannot be generated conveniently until the
declaration has been seen.
– Environment.
– The environment in which the compiler must operate
can also affect the number of passes.
– A multi pass compiler can be made to use less space
than a single pass compiler, since the space occupied
by the compiler program for one pass can be reused
by the following pass.
– A multi pass compiler is ofcourse slower than a single
pass compiler because each pass reads and writes
an intermediate file.
– Thus compiler running on computers with small
memory would normally use several passes while, on
a computer with large memory, a compiler with fewer
passes would be possible.
Bootstrapping
• A compiler is characterized by three languages:
– Its source language
– Its object language
– The language in which it is written.
• All language may be different.
• Compiler may run on one machine and may
produce code for another machine. Such a
compiler is called cross-compiler.
• Sometimes we hear of compiler being
implemented in its own language.
• How was the first compiler compiled?
• Suppose we have new language L, which
we want to make available on several
machine say A and B.

Anda mungkin juga menyukai