Lecture 3: Syntax
Analysis
The role of the parser
•
performs context-free syntax analysis
•
guides context-sensitive analysis
•
constructs an intermediate representation
•
produces meaningful error messages
•
attempts error correction
P
a
rs
in
Syntax analysis
Grammars are often written in Backus-Naur form (BNF).
Example:
In a BNF for a grammar, we represent
1. non-terminals with <angle brackets> or CAPITAL LETTERS
2. terminals with typewriter font or underline
3. productions as in the example
1. <goal> ::= <expr>
2. <expr> ::= <expr> <op> <expr>
3. | num
4. | id
5. <op>::= +
6. | —
7. | *
BNF Example
Write a BNF grammar for the language of Pascal variable declarations.
var i : integer;
var b : boolean;
var myfloat : real;
mychar : char;
x, y, z : integer;
Solution:
<vardecl> ::= var <vardecllist> ;
<vardecllist> ::= <varandtype> { ; <varandtype> }
<varandtype> ::= <ident> { , <ident> } : <typespec>
<ident> ::= <letter> { <idchar> }
Notational Conventions Used
•
Terminals
a,b,c,…
T
specific terminals:
0
,
1
,
id
,
+
•
Nonterminals
A,B,C,…
N
specific nonterminals:
expr
,
term
,
stmt
•
Grammar symbols
X,Y,Z
(
N
T
)
•
Strings of terminals
u,v,w,x,y,z
T
*
•
Strings of grammar symbols
,
,
(
N
T
)*
Scanning vs. parsing
Factoring out lexical analysis simplifies the compiler
term ::= [a-zA-Z] ( [a-zA-Z] [0-9] )*
0 [1-9][0-9]* op ::= + — * /
expr ::= (term op)* term
Where do we draw the line?
Regular expressions:
—
Normally used to classify identifiers, numbers, keywords …
—
Simpler and more concise for tokens than a grammar
—
More efficient scanners can be built from REs
CFGs are used to impose structure
—
Brackets:
(), begin … end, if … then … else
Hierarchy of grammar classes
© O sc a r N ie rs tr a sz P a rs in gLL(
k
):
— Left-to-right, Leftmost derivation, k tokens lookahead
LR(
k
):
— Left-to-right, Rightmost derivation, k tokens lookahead
SLR:
— Simple LR (uses “follow sets”)
LALR:
Derivations
<goal> <expr>
<expr> <op> <expr>
<expr> <op> <expr> <op> <expr>
<id,x> <op> <expr> <op> <expr>
<id,x> + <expr> <op> <expr>
<id,x> + <num,2> <op> <expr>
<id,x> + <num,2> * <expr>
<id,x> + <num,2> * <id,y>
We can view the productions of a CFG as rewriting rules.
We have derived the sentence:
x + 2 * y
We denote this
derivation
(or
parse
) as: <goal>
*id + num * id
New grammar with some
addition to force precedence
1. <goal> ::= <expr>
2. <expr> ::= <expr>
+
<term>
3.
|
<expr>
-
<term>
4.
|
<term>
5. <term> ::= <term>
*
<factor>
6.
|
<term>
/
<factor>
7.
|
<factor>
8. <factor>::=
num
Forcing the desired
precedence
Now, for the string:
x + 2 * y
<goal>
<expr>
<expr> + <term>
<expr> + <term> * <factor>
<expr> + <term> * <id,y>
<expr> + <factor> * <id,y>
<expr> + <num,2> * <id,y>
<term> + <num,2> * <id,y>
<factor> + <num,2> * <id,y>
<id,x> + <num,2> * <id,y>Role of the Parser
•
Not all sequences of tokens are program.
•
Parser must distinguish between valid and
invalid
sequences of tokens.
Parsing: the big picture
Top-down versus bottom-up
•
Top-down
parser:
•
starts at the root of derivation tree and fills in
•
picks a production and tries to match the input
•
may require backtracking
•
some grammars are backtrack-free (
predictive
)
•
Bottom-up
parser:
•
starts at the leaves and fills in
•
starts in a state valid for legal first tokens
•
as input is consumed, changes state to encode possibilities
(
recognize
valid
prefixes
)
Top-Down Parsing
•
LL methods (Left-to-right, Leftmost derivation) and
recursive-descent parsing
14 Grammar:
E T + T T ( E )
T - E T id
Leftmost derivation:
E lm T + T
lm id + T
lm id + id
Left Recursion
•
Productions of the form
A
A
|
|
are left recursive
•
When one of the productions in a grammar is left recursive
then a predictive parser loops forever on certain inputs
Left Recursive Grammar
A grammar is said to be left –recursive if it has a non-terminal A
such that there is a derivation A =>Aa, for some string a.
Consider the grammar:
(i) A -> Aa|b
The parser can go into an infinite loop.
Corresponding grammar without left recursion:
A ->bR
Elimination of left recursion
17
Rewrite every left-recursive production
A A
|
| | A
into a right-recursive production:
A AR
| AR AR AR
| AR
Eliminate Left recursion
1-
EE+T/T
2-ExprExpr - Term / Expr - Term / Term
TermTerm * Factor / Term ∕ Factor / Factor Factor (Expr)/ Num/ Identifier
3-
A B C | a
B C A | A b
Left Factoring
•
When a nonterminal has two or more
productions whose right-hand sides start with
the same grammar symbols, the grammar is not
LL(1) and cannot be used for predictive parsing
•
Replace productions
A
1|
2|
… |
n|
with
A
A
R|
A
R
1|
2| … |
nRemove Left Factoring
1-S
if E then S else S
S
if E then S
Predictive Parsing
•
If a top down parser picks the
wrong production
, it
may need to
backtrack
•
Alternative is to
look ahead
in
input
and use
context to pick correctly
•
Fortunately, large classes of CFGs can be parsed
with
limited lookahead
•
Most
programming languages
constructs fall in
those subclasses
Predictive Parsing
•
Eliminate left recursion from grammar
•
Left factor the grammar
•
Compute FIRST and FOLLOW
•
Two variants:
•
Recursive (recursive-descent parsing)
•
Non-recursive (table-driven parsing)
Predictive Parsing
FIRST
Sets
:
For some rhs
G
, define
FIRST(
)
as the set of tokens
that appear as the
first symbol
in some string that derives
from
.
Predictive Parsing
That is,
x
FIRST(
)
iff
x
for some
.
FIRST Set
•
FIRST() = {
the set of terminals that begin all
strings derived from
}
FIRST(
a
) = {
a
}
if
a
T
FIRST(
) = {
}
FIRST(
A
) =
AFIRST(
)
for
A
P
FIRST(
X
1X
2…
X
k) =
if
for all
j
= 1, …,
i
-1 :
FIRST(
X
j)
then
add non-
in FIRST(
X
i) to FIRST(
X
1X
2…
X
k)
if
for all
j
= 1, …,
k
:
FIRST(
X
j)
then
add
to FIRST(
X
1X
2…
X
k)
Predictive Parsing
FOLLOW(
)
is the set of all words in the
grammar that can legally
appear
after
an
.
FOLLOW
•
FOLLOW(A) = { the set of terminals that can
immediately follow nonterminal A }
FOLLOW(A) =
for
all (B
A
)
P
do
add FIRST(
)\{
} to FOLLOW(A)
for
all (B
A
)
P
and
FIRST(
)
do
add FOLLOW(B) to FOLLOW(A)
for
all (B
A)
P
do
add FOLLOW(B) to FOLLOW(A)
if
A is the start symbol S
then
add
$
to FOLLOW(A)
Find First and Follow sets
CFG
expr expr + term | term term term * factor | factor factor number | ( expr )
Reformed
EE+T/T TT*F/F F(E)/id
LL(1) Grammar
•
A grammar
G
is LL(1) if it is not left recursive and
for each collection of productions
A
1|
2| … |
nfor nonterminal
A
the following holds:
1.
FIRST(
i)
FIRST(
j) =
for all
i
j
2.
if
i
*
then
2.a.
j
*
for all
i
j
2.b.
FIRST(
j)
FOLLOW(
A
) =
for all
i
j
Non-LL(1) Examples
30
Grammar
Not LL(1) because:
S
S
a
|
a
Left recursive
S
a
S
|
a
FIRST(
a
S
)
FIRST(
a
)
S
a
R
|
R
S
|
For
R
:
S
*
and
*
S
a
R
a
Example Table
31
E T ER
ER + T ER |
T F TR
TR * F TR |
F ( E ) | id
A
FIRST(
)
FOLLOW(
A
)
E
T E
R( id
$ )
E
R
+
T
E
R+
$ )
E
R
T
F T
R( id
+ $ )
T
R
*
F
T
R*
+ $ )
T
R
F
(
E
)
(
* + $ )
Example Table
32
E T ER
ER + T ER |
T F TR
TR * F TR |
F ( E ) | id
id
+
*
(
)
$
E
E
T E
R
E
T E
RE
R
E
R
+
T
E
RE
R
E
R
T
T
F T
R
T
F T
RT
R
T
R
T
R
*
F
T
RT
R
T
R