Regular Grammars and Finite Automata - Compiler Construction

Figure 5.8a. The extra productions can be removed by a simple substitution: If

B

N

occurs exactly twice in a grammar, once in a production of the form

A

B

and once in a production of the form

B

(

;;

V

), then

B

can be eliminated and the two productions replaced by

A

. After all such substitutions have been made, the resulting grammar will dier from Figure 5.8a only in the representation of vocabulary symbols.

5.2 Regular Grammars and Finite Automata

A grammar species a process for generating sentences, and thus allows us to give a nite description of an innite language. The analysis phase of the compiler, however, must recog- nize the phrase structure of a given sentence: It must parse the sentence. Assuming that the language has been described by a grammar, we are interested in techniques for automatically generating a recognizer from that grammar. There are two reasons for this requirement:

It provides a guarantee that the language recognized by the compiler is identical to that dened by the grammar.

It simplies the task of the compiler writer.

We shall use automata, which we introduce as special cases of general rewriting systems, as models for the parsing process. In this section we develop a theoretical basis for regular languages and nite automata, and then extend the concepts and algorithms to context-free languages and pushdown automata in Section 5.3. The implementation of the automata is covered in Chapters 6 and 7.

5.2.1 Finite Automata

5.12 Definition

A nite automaton (nite state acceptor) is a quintuple

A

= (

T;Q;R;q

;F

), where

Q

is a nonempty set, (

T

[

Q;R

) is a general rewriting system,

q

0 is an element of

Q

and

F

is a subset of

Q

. The sets

T

and

Q

are disjoint. Each element of

R

has the form

qt

q

0, where

q

and

q

0 are elements of

Q

and

t

is an element of

T

. We say that

A

accepts a set of strings

L

(

A

) =f

T

q

)

q;q

F

g. Two automata,

A

and

A

0 are equivalent if and only if

L

(

A

) =

L

(

A

0).

We can conceive of the nite automaton as a machine that reads a given input string out of a buer one symbol at a time and changes its internal state upon absorbing each symbol.

Q

is the set of internal states, with

q

0 being the initial state and

F

the set of nal states. We say that a nite automaton is in state

q

when the current string in the derivation has the form

q

. It makes a transition from state

q

to state

q

0 if

t

and

qt

q

0 is an element of

R

. Each state transition removes one symbol from the input string.

5.13 Theorem

For every regular grammar,

G

, there exists a nite automaton,

A

, such that

L

(

A

) =

L

(

G

). The proof of this theorem is an algorithm to construct

A

, given

G

= (

T;N;P;Z

). Let

A

= (

T;N

f

;R;Z;F

f =

N

R

is constructed from

P

by the following rules:

1. If

X

t

(

X

N;t

T

) is a production of

P

then let

Xt

f

be a production of

R

. 2. If

X

tY

(

X;Y

N;t

T

) is a production of

P

then let

Xt

Y

be a production of

T

n;:;

;

;E

Q

C;F;I;X;S;U;q

R

Cn

q

Cn

F

C:

I

F:

I

FE

S

In

q

In

X

XE

S

Sn

q

S

U

S

U

Un

q

C

F

q

Figure 5.9: An Automaton Corresponding to Figure 5.4a

Further,

F

= f

f

g[f

X

P

g. Figure 5.9 is an automaton constructed by this process from the grammar of Figure 5.4a.

One can show by induction that the automaton constructed in this manner has the following characteristic: For any derivation

Z

)

X

)

q

(

;

T

;X

N;

L

(

A

)

;q

F

), the state

X

species the nonterminal symbol of

G

that must have been used to derive the string

. Clearly this statement is true for the initial state

Z

belongs to

L

(

G

). It remains true until the nal state

q

, which does not generate any further symbols, is reached. With the help of this interpretation it is easy to prove that each sentence of

L

(

G

) also belongs to

L

(

A

) and vice-versa.

Figure 5.9 is an unsatisfactory automaton in practice because at certain steps { for example in state

I

with input symbol

n

{ several transitions are possible. This is not a theoretical problem since the automaton is capable of producing a derivation for any string in the language. When implementing this automaton in a compiler, however, we must make some arbitrary decision at each step where more than one production might apply. An incorrect decision requires backtracking in order to seek another possibility. There are three reasons why backtracking should be avoided if possible:

The time required to parse a string with backtracking may increase exponentially with the length of the string.

If the automaton does not accept the string then it will be recognized as incorrect. A parse with backtrack makes pinpointing the error almost impossible. (This is illustrated by attempting to parse the string

n:nE

+ +

n

with the automaton of Figure 5.9 trying the rules in the sequence in which they are written.)

Other compiler actions are often associated with state transitions. Backtracking then requires unraveling of actions already completed, generally a very dicult task.

In order to avoid backtracking, additional constraints must be placed upon the automata that we are prepared to accept as models for our recognition algorithms.

5.14 Definition

An automaton is deterministic if every derivation can be continued by at most one move. A nite automaton is therefore deterministic if the left-hand sides of all rules are distinct. It can be completely described by a state table that has one row for each element of

Q

and one column for each element of

T

. Entry (

q;t

) contains

q

0 if and only if the production

qt

q

0 is an element of

R

. The rows corresponding to

q

0 and to the elements of

F

are suitably marked.

5.2 Regular Grammars and Finite Automata 93

5.15 Theorem

For every regular grammar,

G

, there exists a deterministic nite automaton,

A

, such that

L

(

A

) =

L

(

G

Following construction 5.13, we can derive an automaton from a regular grammar

G

= (

T;N;P;Z

) such that, during acceptance of a sentence in

L

(

G

), the state at each point species the element of

N

used to derive the remainder of the string. Suppose that the productions

X

tU

and

X

tV

belong to

P

. When

t

is the next input symbol, the remainder of the string could have been derived either from

U

or from

V

. If

A

is to be deterministic, however,

R

must contain exactly one production of the form

Xt

q

0. Thus the state

q

0 must specify a set of nonterminals, any one of which could have been used to derive the remainder of the string. This interpretation of the states leads to the following inductive algorithm for determining

Q

R

and

F

of a deterministic automaton

A

= (

T;Q;R;q

;F

). (In this algorithm,

q

represents a subset

N

q of

N

f

;f =

N

1.

Initially let

Q

q

g and

R

=;, with

N

q 0 =

Z

2.

Let

q

be an element of

Q

that has not yet been considered. Perform steps (3)-(5) for each

t

T

3.

Let

(

q;t

) =f

U

X

N

q such that

X

tU

P

4.

If there is an

X

N

q such that

X

t

P

then add

f

(

q;t

) if it is not already present; if there is an

X

N

q such that

X

P

then add

f

N

q if it is not already present.

5.

(

q;t

) 6=; then let

q

0 be the state representing

N

q0 =

(

q;t

). Add

q

0 to

Q

and

qt

q

0 to

R

if they are not already present.

6.

If all states of

Q

have been considered then let

F

= f

q

f

N

qg and stop. Otherwise return to step (2).

You can easily convince yourself that this construction leads to a deterministic nite automaton

A

such that

L

(

A

) =

L

(

G

). In particular, the algorithm terminates: All states represent subsets of

N

f

g, of which there are only a nite number.

To illustrate this procedure, consider the construction of a deterministic nite automaton that recognizes strings generated by the grammar of Figure 5.4a. The state table for this grammar, showing the correspondence between states and sets of nonterminals, is given in Figure 5.10a. You should derive this state table for yourself, following the steps of the algorithm. Begin with a single empty row for

q

0 and work across it, lling in each entry that corresponds to a valid transition. Each time a distinct set of nonterminal symbols is generated, add an empty row to the table. The algorithm terminates when all rows have been processed.

5.16 Theorem

For every nite automaton,

A

, there exists a regular grammar,

G

, such that

L

(

G

) =

L

(

A

). Theorems 5.15 and 5.16 together establish the fact that nite automata and regular grammars are equivalent. To prove Theorem 5.16 we construct the production set

P

of the grammar

G

= (

T;Q;P;q

0) from the automaton (

T;Q;R;q

;F

) as follows:

P

q

tq

qt

q

R

g[f

q

F

5.2.2 State Diagrams and Regular Expressions

The phrase structure of the basic symbols of the language is usually not interesting, and in fact may simply make the description harder to understand. Two additional formalisms, both

n :

+ ,

E

q

2 f

C

q

3 f

f;F

q

4 f

I

q

6 f

S

q

3 f

f;X

q

5 f

f

q

5 f

U

g a) The state table

T

n;:;

;

;E

Q

q

;q

6 g

P

q

n

q

:

q

:

q

E

q

n

q

n

q

3+ !

q

3 ,!

q

E

q

n

q

5 g

F

q

;q

5 g

b) The complete automaton

Figure 5.10: A Deterministic Automaton Corresponding to Figure 5.4a

of which avoid the need for irrelevant structuring, are available for regular languages. The rst is the representation of a nite automaton by a directed graph:

5.17 Definition

Let

A

= (

T;Q;R;q

;F

) be a nite automaton,

D

= f(

q;q

0) j 9

t;qt

q

0 2

R

g, and

f

: (

q;q

0) !f

t

qt

q

R

g be a mapping from

D

into the powerset of

T

. The directed graph (

Q;D

) with edge labels

f

((

q;q

0)) is called the state diagram of the automaton

A

Figure 5.11a is the state diagram of the automaton described in Figure 5.10b. The nodes corresponding to elements of

F

have been represented as squares, while the remaining nodes are represented as circles. Only the state numbers appear in the nodes: 0 stands for

q

0, 1 for

q

1, and so forth.

In a state diagram, the sequence of edge labels along a path beginning at

q

0 and ending at a state in

F

is a sentence of

L

(

A

). Figure 5.11a has exactly 12 such paths. The corresponding sentences are given in Figure 5.11b.

A state diagram species a regular language. Another characterization is the regular expression:

5.18 Definition

Given a vocabulary

V

, and the symbols

E

, +, , ( and ) not in

V

. A string

over

V

E;;

;;

(

;

)g is a regular expression over

V

is a single symbol of

V

or one of the symbols

E

, or if

has the form (

X

Y

), (

XY

) or (

X

) where

X

and

Y

are regular expressions.

5.2 Regular Grammars and Finite Automata 95 n E n E n n 2 1 3 6 5 4 0 + - a) State diagram n .n n.n

nEn nE+n nE-n

.nEn .nE+n .nE-n n.nEn n.nE+n n.nE-n

b) Paths

Figure 5.11: Another Description of Figure 5.10b

Every regular expression results from a nite number of applications of rules (1) and (2). It describes a language over

V

: The symbol

E

describes the empty language,

describes the language consisting only of the empty string,

v

V

describes the language f

v

g, (

X

Y

) = f

!

X

!

Y

g, (

XY

) =f

X;

Y

g. The closure operator () is dened by the following innite sum:

X

XX

XXX

:::

As illustrated in this denition, we shall usually omit parentheses. Star is unary, and takes priority over either binary operator; plus has a lower priority than concatenation. Thus

W

XY

is equivalent to the fully-parenthesized expression (

W

+ (

X

(

Y

))).

Figure 5.12 summarizes the algebraic properties of regular expressions. The distinct rep- resentations for

X

show that several regular expressions can be given for one language.

X

Y

X

(commutative) (

X

Y

) +

Z

X

+ (

Y

Z

) (associative) (

XY

)

Z

X

(

Y Z

)

X

(

Y

Z

) =

XY

XZ

(distributive) (

X

Y

)

Z

XZ

Y Z

X

E

X

(identity)

X

XE

EX

E

(zero)

X

(idempotent) (

X

) =

X

XX

X

E

Figure 5.12: Algebraic Properties of Regular Expressions

The main advantage in using a regular expression to describe a set of strings is that it gives a precise specication, closely related to the `natural language' description, which can be written in text form suitable for input to a computer. For example, let

l

denote any single letter and

d

any single digit. The expression

l

(

l

d

) is then a direct representation of the natural language description `a letter followed by any sequence of letters and digits'.

The equivalence of regular expressions and nite automata follows from:

5.19 Theorem

Let

R

be a regular expression that describes a subset,

S

, of

T

. There exists a deterministic nite automaton,

A

= (

T;Q;P;q

;F

) such that

L

(

A

) =

S

The automaton is constructed in much the same way as that of Theorem 5.15: We create a new expression

R

0 by replacing the elements of

T

occurring in

R

by distinct symbols (multiple occurrences of the same element will receive distinct symbols). Further, we prex another distinct symbol to the altered expression; if

R

E

, then

R

0 consists only of this starting symbol. (As symbols we could use, for example, natural numbers with 0 as the starting symbol.) The states of our automaton correspond to subsets of the symbol set. The set corresponding to the initial state

q

0 consists solely of the starting symbol. We inspect the states of

Q

one after another and add new states as required. For each

q

Q

and each

t

T

, let

q

0 correspond to the set of symbols in

R

0 that replace

t

and follow any of the symbols of the set corresponding to

q

. If the set corresponding to

q

0 is not empty, then we add

qt

q

0 to

P

and add f

q

g to

Q

if it is not already present. The set

F

of nal states consists of all states that include a possible nal symbol of

R

Figure 5.13 gives an example of this process. Starting with

q

f0g, we obtain the state table of Figure 5.13b, with states

q

2 and

q

3 as nal states. Obviously this is not the simplest automaton which we could create for the given language; we shall return to this problem in Section 6.2.2.

R

l

(

l

d

)

R

0 = 01(2 + 3)

a) Modifying the Regular Expression l d

q

1 f0g

q

3 f1g (nal)

q

3 f2g (nal)

q

3 f3g (nal) b) The resulting state table

Figure 5.13: Regular Expressions to State Tables

In document Compiler Construction - Free Computer, Programming, Mathematics, Technical Books, Lecture Notes and Tutorials (Page 103-108)