Syntax Analysis

(1)

Lecture 3: Syntax

Analysis

(2)

The role of the parser

•

performs context-free syntax analysis

•

guides context-sensitive analysis

•

constructs an intermediate representation

•

produces meaningful error messages

•

attempts error correction

P

a

rs

in

(3)

Syntax analysis

Grammars are often written in Backus-Naur form (BNF).

Example:

In a BNF for a grammar, we represent

1. non-terminals with <angle brackets> or CAPITAL LETTERS

2. terminals with typewriter font or underline

3. productions as in the example

1. <goal> ::= <expr>

2. <expr> ::= <expr> <op> <expr>

3. | num

4. | id

5. <op>::= +

6. | —

7. | *

(4)

BNF Example

Write a BNF grammar for the language of Pascal variable declarations.

var i : integer;

var b : boolean;

var myfloat : real;

mychar : char;

x, y, z : integer;

Solution:

<vardecl> ::= var <vardecllist> ;

<vardecllist> ::= <varandtype> { ; <varandtype> }

<ident> ::= <letter> { <idchar> }

(5)

Notational Conventions Used

•

Terminals

a,b,c,…



T

specific terminals:

0

,

1

,

id

,

+

•

Nonterminals

A,B,C,…



N

specific nonterminals:

expr

,

term

,

stmt

•

Grammar symbols

X,Y,Z



(

N



T

)

•

Strings of terminals

u,v,w,x,y,z



T

*

•

Strings of grammar symbols



,



,





(

N



T

)*

(6)

Scanning vs. parsing

Factoring out lexical analysis simplifies the compiler

term ::= [a-zA-Z] ( [a-zA-Z]  [0-9] )*

 0  [1-9][0-9]* op ::= +  —  *  /

expr ::= (term op)* term

Where do we draw the line?

Regular expressions:

—

Normally used to classify identifiers, numbers, keywords …

—

Simpler and more concise for tokens than a grammar

—

More efficient scanners can be built from REs

CFGs are used to impose structure

—

Brackets:

(), begin … end, if … then … else

(7)

Hierarchy of grammar classes

LL(

k

):

— Left-to-right, Leftmost derivation, k tokens lookahead

LR(

k

):

— Left-to-right, Rightmost derivation, k tokens lookahead

SLR:

— Simple LR (uses “follow sets”)

LALR:

(8)

Derivations

 <expr> <op> <expr>

 <expr> <op> <expr> <op> <expr>

 <id,x> <op> <expr> <op> <expr>

 <id,x> + <expr> <op> <expr>

 <id,x> + <num,2> <op> <expr>

 <id,x> + <num,2> * <expr>

 <id,x> + <num,2> * <id,y>

We can view the productions of a CFG as rewriting rules.

We have derived the sentence:

x + 2 * y

We denote this

derivation

(or

parse

) as: <goal>

*

id + num * id

(9)

New grammar with some

addition to force precedence

1. <goal> ::= <expr>

2. <expr> ::= <expr>

+

<term>

3.

|

<expr>

-

<term>

4.

|

<term>

5. <term> ::= <term>

*

6.

|

<term>

/

7.

|

8. <factor>::=

num

(10)

Forcing the desired

precedence

Now, for the string:

x + 2 * y

<goal>



<expr>

















<id,x> + <num,2> * <id,y>

(11)

Role of the Parser

•

Not all sequences of tokens are program.

•

Parser must distinguish between valid and

invalid

sequences of tokens.

(12)

Parsing: the big picture

(13)

Top-down versus bottom-up

•

Top-down

parser:

•

starts at the root of derivation tree and fills in

•

picks a production and tries to match the input

•

may require backtracking

•

some grammars are backtrack-free (

predictive

)

•

Bottom-up

parser:

•

starts at the leaves and fills in

•

starts in a state valid for legal first tokens

•

as input is consumed, changes state to encode possibilities

(

recognize

valid

prefixes

)

(14)

Top-Down Parsing

•

LL methods (Left-to-right, Leftmost derivation) and

recursive-descent parsing

14 Grammar:

E  T + T T  ( E )

T  - E T  id

Leftmost derivation:

E _lm T + T

_lm id + T

_lmid + id

(15)

Left Recursion

•

Productions of the form

A



A



|



|



are left recursive

•

When one of the productions in a grammar is left recursive

then a predictive parser loops forever on certain inputs

(16)

Left Recursive Grammar

A grammar is said to be left –recursive if it has a non-terminal A

such that there is a derivation A =>Aa, for some string a.

Consider the grammar:

(i) A -> Aa|b

The parser can go into an infinite loop.

Corresponding grammar without left recursion:

A ->bR

(17)

Elimination of left recursion

17

Rewrite every left-recursive production

A  A 

| 

|  | A 

into a right-recursive production:

A   A_R

|  A_R A_R   A_R

|  A_R

(18)

Eliminate Left recursion

1-

EE+T/T

2-ExprExpr - Term / Expr - Term / Term

TermTerm * Factor / Term ∕ Factor / Factor Factor  (Expr)/ Num/ Identifier

3-

A  B C | a

B  C A | A b

(19)

Left Factoring

•

When a nonterminal has two or more

productions whose right-hand sides start with

the same grammar symbols, the grammar is not

LL(1) and cannot be used for predictive parsing

•

Replace productions

A







₁

|





₂

|

… |





_n

|



with

A





A

_R

|



A

_R





₁

|



₂

| … |



_n

(20)

Remove Left Factoring

1-S



if E then S else S

S



if E then S

(21)

Predictive Parsing

•

If a top down parser picks the

wrong production

, it

may need to

backtrack

•

Alternative is to

look ahead

in

input

and use

context to pick correctly

•

Fortunately, large classes of CFGs can be parsed

with

limited lookahead

•

Most

programming languages

constructs fall in

those subclasses

(22)

Predictive Parsing

•

Eliminate left recursion from grammar

•

Left factor the grammar

•

Compute FIRST and FOLLOW

•

Two variants:

•

Recursive (recursive-descent parsing)

•

Non-recursive (table-driven parsing)

(23)

Predictive Parsing

FIRST

Sets

:

For some rhs





G

, define

FIRST(



)

as the set of tokens

that appear as the

first symbol

in some string that derives

from



.

(24)

Predictive Parsing

That is,

x



FIRST(



)

iff









x



for some



.

(25)

FIRST Set

•

FIRST() = {

the set of terminals that begin all

strings derived from

 }

FIRST(

a

) = {

a

}

if

a



T

FIRST(



) = {



}

FIRST(

A

) =



_A_

FIRST(



)

for

A

 

P

FIRST(

X

₁

X

₂

…

X

_k

) =

if

for all

j

= 1, …,

i

-1 :





FIRST(

X

_j

)

then

add non-



in FIRST(

X

_i

) to FIRST(

X

₁

X

₂

…

X

_k

)

if

for all

j

= 1, …,

k

:





FIRST(

X

_j

)

then

add



to FIRST(

X

₁

X

₂

…

X

_k

)

(26)

Predictive Parsing

FOLLOW(



)

is the set of all words in the

grammar that can legally

appear

after

an



.

(27)

FOLLOW

•

FOLLOW(A) = { the set of terminals that can

immediately follow nonterminal A }

FOLLOW(A) =

for

all (B





A



)



P

do

add FIRST(



)\{



} to FOLLOW(A)

for

all (B





A



)



P

and





FIRST(



)

do

add FOLLOW(B) to FOLLOW(A)

for

all (B





A)



P

do

add FOLLOW(B) to FOLLOW(A)

if

A is the start symbol S

then

add

$

to FOLLOW(A)

(28)

Find First and Follow sets

CFG

expr  expr + term | term term  term * factor | factor factor  number | ( expr )

Reformed

EE+T/T TT*F/F F(E)/id

(29)

LL(1) Grammar

•

A grammar

G

is LL(1) if it is not left recursive and

for each collection of productions

A





₁

|



₂

| … |



_n

for nonterminal

A

the following holds:

1.

FIRST(



_i

)



FIRST(



_j

) =



for all

i



j

2.

if



_i



*



then

2.a.



_j



*



for all

i



j

2.b.

FIRST(



_j

)



FOLLOW(

A

) =



for all

i



j

(30)

Non-LL(1) Examples

30

Grammar

Not LL(1) because:

S



S

a

|

a

Left recursive

S



a

S

|

a

FIRST(

a

S

)



FIRST(

a

)





S



a

R

|



R



S

|



For

R

:

S



*



and





*



S



a

R

a

(31)

Example Table

31

E  T E_R

E_R  + T E_R | 

T  F T_R

T_R  * F T_R | 

F  ( E ) | id

A





FIRST(



)

FOLLOW(

A

)

E



T E

_R

( id

$ )

E

_R



+

T

E

_R

+

$ )

E

_R





T



F T

_R

( id

+ $ )

T

_R



*

F

T

_R

*

+ $ )

T

_R





F



(

E

)

(

* + $ )

(32)

Example Table

32

E  T E_R

E_R  + T E_R | 

T  F T_R

T_R  * F T_R | 

F  ( E ) | id

id

+

*

(

)

$

E

T E



R

E



T E

_R

E

R

E

_R



+

T

E

_R

E

_R





E

_R





T

F T



R

T



F T

_R

T

R

T

_R





T

_R



*

F

T

_R

T

_R





T

_R



