Anda di halaman 1dari 16

TPX

Introduction to Syntax Analysis


When an input string (source code or a program in some language)
is given to a compiler, the compiler processes it in several phases,
starting from lexical analysis(scans the input and divides it into
tokens) to target code generation. The first phase of scanner works
as a text scanner. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
<token-name, attribute-
value>

Syntax Analysis
The next phase is
called the syntax
analysis or parsing. It takes the token produced by lexical analysis as input and
generates a parse tree (or syntax tree). In this phase, token arrangements are checked
against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
Explanation

Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the
syntactical structure of the given input, i.e. whether the given input is in the correct syntax (of
the language in which the input has been written) or not. It does so by building a data structure,
called a Parse tree or Syntax tree. The parse tree is constructed by using the pre-defined
Grammar of the language and the input string. If the given input string can be produced with the
help of the syntax tree (in the derivation process), the input string is found to be in the correct
syntax. if not, error is reported by syntax analyzer.

Inshort: The parser analyzes the source code (token stream) against the production rules
to detect any errors in the code. The output of this phase is a parse tree. the parser
accomplishes two tasks, i.e., parsing the code, looking for errors and generating a parse
tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program.
Parsers use error recovering strategies.
TPX

This phase uses context-free grammar (CFG), which is recognized by push-down


automata.CFG, on the other hand, is a superset of Regular Grammar.
Example:
Suppose Production rules for the Grammar of a language are:
S -> cAd
A -> bc|a
And the input string is “cad”.
Now the parser attempts to
construct syntax tree from this
grammar for the given input
string. It uses the given
production rules and applies
those as needed to generate the
string. To generate string “cad”:

In the step iii above, the


production rule A->bc was not a
suitable one to apply (because
the string produced is “cbcd” not
“cad”), here the parser needs to
backtrack, and apply the next production rule available with A which is shown in the step
iv, and the string “cad” is produced.
Thus, the given input can be produced by the given grammar, therefore the input is correct
in syntax.

Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.


 Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two
options.

Left-most Derivation: If the sentential form (related text) of an input is scanned and
replaced from left to right, it is called left-most derivation. The sentential form derived by the
left-most derivation is called the left-sentential form.

Example 1

Production rules:

1. S = S + S
TPX

2. S = S - S
3. S = a | b |c

Input:

a - b + c

The left-most derivation is:

1. S = S+S
2. S = S-S+S
3. S = a-S+S
4. S = a-b+S
5. S = a-b+c
Example 2
Production rules:
E → E + E
E → E * E
E → id
Input string: id + id * id
The left-most derivation is:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.

Right-most Derivation: If we scan and replace the input with production rules, from
right to left, it is known as right-most derivation. The sentential form derived from the
right-most derivation is called the right-sentential form.
Example 1
1. S = S + S
2. S = S - S
3. S = a | b |c
Input:
a-b+c

The right-most derivation is:

1. S = S - S
2. S = S - S + S
3. S = S - S + c
TPX

4. S = S - b + c
5. S = a - b + c

Example 2
Production rules:
E → E + E
E → E * E
E → id

Input string: id + id * id
The right-most derivation is:
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id

Parse Tree
A parse tree is a graphical depiction of a derivation/symbol. The derivation/symbol can be
terminal or non-terminal. In parsing, the string is derived using the start symbol. The start
symbol of the derivation becomes the root of the parse tree.

So, in short parse tree is the graphical representation of derivation /symbol that can be
terminals or non-terminals.

Parse tree follows the precedence of operators. The deepest sub-tree is traversed/moves
first, therefore the operator in that sub-tree gets precedence over the operator which is in
the parent nodes.

Noe: In a parse tree:

 All leaf nodes are terminals.


 All interior nodes are non-terminals.
 In-order traversal gives original input string.
Example 1:
Production rules:

1. T= T + T | T * T
2. T = a|b|c

Input:
TPX

a * b + c
Step 1: Step 2:

Step 3: Step 4:

Step 5:

RESUTING output of a * b + c

Here We took the left-most derivation of a + b * c


TPX

Example 2:
Here We take the left-most derivation of a + b * c
The left-most derivation is:
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
TPX

Example 2:
This is how semantic analysis happens – S = 2+3*4. Parse tree corresponding to S
would be

Ambiguity
A grammar G is said to be
ambiguous if there exists more than
one leftmost derivation or more
than one rightmost derivative or
more than one parse tree for the
given input string. If the grammar
is not ambiguous then it is called
unambiguous.

Note: The language generated by


an ambiguous grammar is said to be inherently ambiguous. Ambiguity in grammar is not
good for a compiler construction If the grammar has ambiguity then no method can
automatically detect and remove the ambiguity but you can remove ambiguity by re-writing
the whole grammar without ambiguity or by setting and following associativity and
precedence constraints.

Example 1:
Example
E → E + E
TPX

E → E – E
E → id
For the input string id + id – id, the above grammar generates two parse trees:

Example 2:
Let us consider this grammar : E -> E+E|id
We can create 2 parse tree from this grammar to obtain a string id+id+id :
The following are the 2 parse trees generated by left most derivation:

Both the above parse trees are derived from same grammar rules but both parse trees
are different. Hence the grammar is ambiguous.

Example 3:
Let us now consider the following grammar:

Set of alphabets ∑ = {0,…,9, +, *, (, )}


E -> I
E -> E + E
E -> E * E
E -> (E)
I -> ε | 0 | 1 | … | 9
TPX

From the above grammar String 3*2+5 can be derived in 2 ways:

I) First leftmost derivation II) Second leftmost


derivation
E=>E*E E=>E+E
=>I*E =>E*E+E
=>3*E+E =>I*E+E
=>3*I+E =>3*E+E
=>3*2+E =>3*I+E
=>3*2+I =>3*2+I
=>3*2+5 =>3*2+5

Example 4:
1. S = aSb | SS
2. S = ∈

For the string aabb, the above grammar generates two parse trees:
TPX

Example 5: Eg- consider a grammar


S -> aS | Sa | a Now for string aaa we will have 4 parse trees, hence ambiguous

Parser
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by the
lexical analyzer are grouped according to the syntactic rules of the language. This is done by a
parser. The parser obtains a string of tokens from the lexical analyzer and verifies that the string
can be the grammar for the source language. It detects and reports any syntax errors and
produces a parse tree from which intermediate code can be generated. So, in short, the Parser
is a compiler that is used to break the data into smaller elements coming from lexical
analysis phase. A parser takes input in the form of sequence of tokens and produces output
in the form of parse tree.

Type of parser

1. Top Down Parser


2. Bottom Up Parser
TPX

Top down paring


 Top down parsing attempts to build the parse tree from root to leave. Top
down parser will start from start symbol and proceeds to string or transform
it into the input symbol. It follows leftmost derivation. Recursive Descent
and LL parsers are the Top-Down parsers.

 Recursive descent parsing: It is a common form of top-down parsing. It is


called recursive as it uses recursive procedures to process the input. Recursive
descent parsing suffers from backtracking.
 Backtracking: It means, if one derivation of a
production fails, the syntax analyzer restarts the
process using different rules of same production. This
technique may process the input string more than once
to determine the right production.

 NON-Recursive Predictive parsing: This type if parsing does not require

backtracking. Predictive parsers can be constructed for LL(1)


o LL Parser or LL1 parser: An LL Parser accepts LL grammar. LL
grammar is a subset of context-free grammar but with some restrictions to
get the simplified version, in order to achieve easy implementation. LL
grammar can be implemented by means of both algorithms namely,
recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right,
the second L in LL(k) stands for left-most derivation and k itself represents the number
of look aheads. Generally k = 1, so LL(k) may also be written as LL(1).

Rules for llparser


A grammar G is LL(1) if A → α | β are two distinct productions of G:
 for no terminal, both α and β derive strings beginning with a.
 at most one of α and β can derive empty string.
 if β → t, then α does not derive any string beginning with a terminal in FOLLOW(A).

Construction of LL(1) Parsing Table:


To construct the Parsing table, we have two functions or rulse:
1: First(): If there is a variable, and from that variable if we try to drive all the strings
then the beginning Terminal Symbol is called the first.
TPX

Rules for First Sets


1. If X is a terminal then First(X) is just X!
2. If there is a Production X → ε then add ε to first(X)
3. If there is a Production X → Y1Y2..Yk then add first(Y1Y2..Yk) to first(X)
4. First(Y1Y2..Yk) is either
1. First(Y1) (if First(Y1) doesn't contain ε)
2. OR (if First(Y1) does contain ε) then First (Y1Y2..Yk) is everything in First(Y1)
<except for ε > as well as everything in First(Y2..Yk)
3. If First(Y1) First(Y2)..First(Yk) all contain ε then add ε to First(Y1Y2..Yk) as well.

2: Follow(): What is the Terminal Symbol which follow a variable in the process of
derivation
Rules for Follow Sets
1. First put $ (the end of input marker) in Follow(S) (S is the start symbol)
2. If there is a production A → aBb, (where a can be a whole string) then everything in
FIRST(b) except for ε is placed in FOLLOW(B).
3. If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
4. If there is a production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A)
is in FOLLOW(B)
TPX

Bottom up paring
As the name suggests, bottom-up parsing starts with the input symbols and tries to
construct the parse tree up to the start symbol.

Bottom up parsing is classified in to various parsing. These are as follows:

1. Shift-Reduce Parsing->

 LR( 1 )
 SLR( 1 )
 CLR ( 1 )
 LALR( 1 )

2. Operator Precedence Parsing


3. Table Driven LR Parsing

Example 1 of parse tree types:


Input string: a + b * c
Production rules:
S → E
E → E + T
E → E * T
E → T
T → id
Let us start bottom-up parsing
a + b * c
Read the input and check if any production matches with the input:
a + b * c
T + b * c
E + b * c
E + T * c
E * c
E * T
E
S
TPX

Example 2 of parse tree types:

Production

1. E → T
2. T → T * F
3. T → id
4. F → T
5. F → id

Parse Tree representation of input string "id * id" is as follows:


TPX

Shift reduce parsing


Shift reduce parsing
o Shift reduce parsing is a process of reducing a string to the start symbol of a
grammar.
o Shift reduce parsing uses a stack to hold the grammar and an input tape to hold the
string.

o Sift reduce parsing performs the two actions: shift and reduce. That's why it is
known as shift reduces parsing.
o At the shift action, the current symbol in the input string is pushed to a stack.
o At each reduction, the symbols will replaced by the non-terminals. The symbol is the
right side of the production and non-terminal is the left side of the production.

Example:
Grammar:

1. S → S+S
2. S → S-S
3. S → (S)
4. S → a

Input string:

1. a1-(a2+a3)

Parsing table:
TPX

There are two main categories of shift reduce parsing as follows:

1. Operator-Precedence Parsing
2. LR-Parser

Anda mungkin juga menyukai