Syntax Analysis
• Every programming language has precise rules that describe the
syntactic structure of well formed programs
• The syntax of programming language constructs can be specified by
Role of a parser
• Main tasks
Parser
• Top Down
• Builds parse trees from top to bottom
• Example: Recursive decent parsing, predictive parsing
• Bottom Up
Errors in a program
• Lexical errors include misspellings of identifiers, keywords or operators if x<1 thenn y=5:
• Syntactic
if ((x<1)&(y>5)))…. {…….{….{……}}s • Semantics
if (x+5) then ... Type Errors
Undefined IDs, etc • Logical
if (i<9)should <=not < Bugs
Requirement
• Detect All Errors (Except Logical!) • Messages should be helpful.
Difficult to produce clear messages! Example: Syntax Error Example:
Error Recovery Approaches: Panic
Mode
• Discard tokens until we see a “synchronizing” token • Simple to implement
• Commonly used
• The key...
• Good set of synchronizing tokens
Knowing what to do then
Error Recovery Approaches:
Phrase-Level Recovery
• Compiler corrects the program by deleting or inserting tokens • ...so it can proceed to parse from where it was
• The key
... Don’t get into an infinite loop ...constantly inserting tokens
Error Recovery Approaches: Error
Productions
• Augment the CFG with “Error Productions” • Now the CFG accepts anything!
• If “error productions” are used... • Their actions:
• { print (“Error...”) }
• Used with...
Error Recovery Approaches: Global
Correction
• Theoretical Approach
• Find the minimum change to the source to yield a valid program
(Insert tokens, delete tokens, swap adjacent tokens)
Grammars
• Context-free grammar is a 4-tuple
G = (N, T, P, S) where
• T is a finite set of tokens (terminal symbols) • N is a finite set of nonterminals
• P is a finite set of productions of the form
where (NT)* N (NT)* and (NT)*
Notational Conventions Used
• Terminals
a,b,c,… T
specific terminals: 0, 1, id, + • Nonterminals
A,B,C,… N
specific nonterminals: expr, term, stmt • Grammar symbols
X,Y,Z (NT)
• Strings of terminals
u,v,w,x,y,z T*
• Strings of grammar symbols
Example
• Terminals Keywords
else “else” • Token Classes
ID INTEGER REAL
• Punctuation
; “;” ;
• Non-terminals
Any symbol appearing on the left hand side of any rule
• Start Symbol
Usually the non-terminal on the left hand side of the first rule
• Rules (or “Productions”)
S → 0S1 S → ε
The string 0011 is in the language generated. The derivation is: S = 0S1 = 00S11 = 0011 ⇒ ⇒ ⇒ For compactness, we write S → 0S1 | ε
Palindrome
• Let P be language of palindromes with alphabet { a, b }. One can
determine a CFG for P by finding a recursive decomposition.
• If we peel first and last symbols from a palindrome, what remains is a
palindrome; and if we wrap a palindrome with the same symbol front and back, then it is still a palindrome.
• CFG is
P → a P a | b P b | ε
Even 0’s
• A CFG for all binary strings with an even number of 0’s.
• Find the decomposition. If first symbol is 1, then even number of 0’s
remain. If first symbol is 0, then go to next 0; after that again an even number of 0’s remain.
Alternate CFG for Even 0’s
• Here is another CFG for the same language.
• Note that when first symbol is 0, what remains has odd number of 0’s.
Examples
• A CFG for the regular language corresponding to the RE 00 11 . ∗ ∗
• The language is the concatenation of two languages: • all strings of zeroes with all strings of ones.
Derivation
• We derive strings in the language of a CFG by starting with the start
symbol, and repeatedly replacing some variable A by the right side of one of its productions.
• That is, the “productions for A” are those that have A on the left side of
Derivations – Formalism
• We say αAβ => αϒβ if A -> ϒ is a production. • Example: S -> 01; S -> 0S1
Iterated Derivation
• =>* means “zero or more derivation steps.” • Basis: α =>* α for any string α.
Example: Iterated Derivation
• S -> 01; S -> 0S1.
• S => 0S1 => 00S11 => 000111.
Sentential Forms
• Any string of variables and/or terminals derived from the start symbol
is called a sentential form
Leftmost and Rightmost Derivations
• Derivations allow us to replace any of the variables in a string. • Leads to many different derivations of the same string.
• By forcing the leftmost variable (or alternatively, the rightmost
Leftmost Derivations
• Say wA = > lm w if w is a string of terminals only and A -> is a production.
• Also, = > * lm if becomes by a sequence
Example
: Leftmost Derivations
26
• Balanced-parentheses grammar: S ->SS | (S) |()
•
S = > lm SS = > lm (S)S = > lm (())S = > lm(())()•Thus, S = > * lm (())()
•S = > SS = > S() = > (S)() = > (())() is a
Rightmost Derivations
27
• Say Aw = > rm w if w is a string of terminals only and A -> is a production.
Example
: Rightmost
Derivations
• Balanced-parentheses grammmar: S -> SS | (S) | ()
• S = > rm SS = > rm S() = > rm (S)() = > rm (())()
• Thus, S = > * rm (())()
• S = > SS = > SSS = > S()S = > ()()S = >
29
Parse Trees
• Parse trees are trees labeled by symbols of a
particular CFG.
• Leaves: labeled by a terminal or ε.
• Interior nodes: labeled by a variable.
• Children are labeled by the right side of a production
for the parent.
30
Example
: Parse Tree
S -> SS | (S) | ()
S
S S
S )
(
( )
31
Yield of a Parse Tree
• The concatenation of the labels of the leaves in left-to-right order
• That is, in the order of a preorder traversal.
is called the yield of the parse tree.
• Example: yield of is (())()
32
Parse Trees, Left- and
Rightmost Derivations
• For every parse tree, there is a unique leftmost, and a unique
rightmost derivation.
• We’ll prove:
33
Parse Trees and Rightmost
Derivations
• The ideas are essentially the mirror image of the
proof for leftmost derivations.
34
Parse Trees and Any Derivation
• The proof that you can obtain a parse tree from a
leftmost derivation doesn’t really depend on “leftmost.”
• First step still has to be A => X1…Xn.
• And w still can be divided so the first portion is
Ambiguity
• A grammar is ambiguous if it has more than one Parse-Tree for some string.
– Equivalently, there is more than one right-most or left-most derivation for some string.
• Ambiguity is bad: Leaves meaning of some programs ill-defined since we cannot decide its syntactical structure uniquely.
• Ambiguity is a Property of Grammars, not Languages.
• Two alternative solutions:
1. Disambiguate the grammar
Ambiguity: Arithmetic Expressions
Consider the Grammar for arithmetic expressions:
E → E + E | E ∗ E | (E ) | −E | id
The sequence of Tokens id + id ∗ id has two Parse-Trees
E
E
E + E E *
id E * E E E
id id id id
The first Parse-Tree reflects the usual assumption that * takes precedence on +.
E
Free University of Bolzano–Formal Languages and Compilers. Lecture V, 2014/2015 – A.Artale
(11)
Eliminating Ambiguity by Disambiguating the Grammar
• Sometime it is possible to eliminate ambiguity by rewriting the Grammar.
• Example. Let us rewrite the Grammar for arithmetic expressions:
– Enforces precedence of * over +; – Enforces left-associativity of + and *
E → E+T |
T T
F
→ →
T ∗F |
F
(12)
Eliminating Ambiguity: Example
The sequence of Tokens id + id ∗ id has now only one Parse-Tree
E
E + T
T T * F F id id F id
E → E+T |
T T
F
→ →
T ∗F |
F
Ambiguity: The Dangling Else
• Consider the Grammar for if-then-else statements:
St m t → if E x pr then
St m t| if E x pr then St m t else St m t
| other
• This Grammar is ambiguous.
• Example. Consider the statement:
else
The Dangling Else: Example
The statement: if E1 then if E2 then S1 else S2, has two Parse-Trees
Stmt Stmt
if EE 11 then Stmt if then Stmt else
if E2 then S1 S2 E2
S2
if then S1
• Typically, the first Parse-Tree is preferred.
• Disambiguating Rule: Match each else with the closest unmatched
Disambiguating Dangling Else
• Disambiguating Rule: Match each else with the closest unmatched
then.
• The rule can be incorporated into the Grammar if we distinguish between
matched and unmatched statements.
• A statement between a then-else must be matched.
Stm t Matched stmt
→ Matched stmt | Unmatched stmt
→ if Expr then Matched stmt else Matched stmt
| Other-Stmt
• This Grammar generates the same set of strings as the previous one but gives just one Parse-Tree for if-then-else statements.
Unmatched
stmt → if Expr then Stmt
| if Expr then Matched stmt
Elimination of Left Recursion
• A grammar is left recursive if it has a non terminal A such that there is a derivation A=>Aα for some string α.
• Top down parsing methods can not handle left recursive grammars.
Elimination of left recursion
• If we have production AAα|β.Left recursion can be eliminated by following rules.
AβA A’A’ |ε
• EE+T | T
• T T*F | F
• F ( E) | id
• ET E’
• E’+ TE’|ε
• T FT’
• T’* FT’| ε
Left Factoring
• Useful for producing grammar suitable for predictive or top down parsing
• When the choice between two alternative a productions is not clear we may be able to rewrite the productions.
Example-Left factoring
If we have the two productions stmt if expr then stmt else stmt | if expr then stmt
On seeing the input if we cannot immediately tell which production to choose to expand stmt.
In general, if Aα β1 | α β1
Rules for left factored are following
Example-Left factoring
stmt if expr then stmt else stmt | if expr then stmt
• After left factoring
stmt if expr then stmt S’