Figure 5.8a. The extra productions can be removed by a simple substitution: If
B
2N
occurs exactly twice in a grammar, once in a production of the formA
!B
and once in a production of the formB
! (;;
2V
), then
B
can be eliminated and the two productions replaced byA
!. After all such substitutions have been made, the resulting grammar will dier from Figure 5.8a only in the representation of vocabulary symbols.5.2 Regular Grammars and Finite Automata
A grammar species a process for generating sentences, and thus allows us to give a nite description of an innite language. The analysis phase of the compiler, however, must recog- nize the phrase structure of a given sentence: It must parse the sentence. Assuming that the language has been described by a grammar, we are interested in techniques for automatically generating a recognizer from that grammar. There are two reasons for this requirement:
It provides a guarantee that the language recognized by the compiler is identical to that dened by the grammar.
It simplies the task of the compiler writer.
We shall use automata, which we introduce as special cases of general rewriting systems, as models for the parsing process. In this section we develop a theoretical basis for regular languages and nite automata, and then extend the concepts and algorithms to context-free languages and pushdown automata in Section 5.3. The implementation of the automata is covered in Chapters 6 and 7.
5.2.1 Finite Automata
5.12 Definition
A nite automaton (nite state acceptor) is a quintuple
A
= (T;Q;R;q
0;F
), whereQ
is a nonempty set, (T
[Q;R
) is a general rewriting system,q
0 is an element of
Q
andF
is a subset ofQ
. The setsT
andQ
are disjoint. Each element ofR
has the formqt
!q
0, where
q
andq
0 are elements ofQ
andt
is an element ofT
. We say thatA
accepts a set of stringsL
(A
) =f 2T
jq
0 )q;q
2
F
g. Two automata,A
andA
0 are equivalent if and only if
L
(A
) =L
(A
0).We can conceive of the nite automaton as a machine that reads a given input string out of a buer one symbol at a time and changes its internal state upon absorbing each symbol.
Q
is the set of internal states, withq
0 being the initial state andF
the set of nal states. We say that a nite automaton is in stateq
when the current string in the derivation has the formq
. It makes a transition from stateq
to stateq
0 if =t
andqt
!q
0 is an element of
R
. Each state transition removes one symbol from the input string.5.13 Theorem
For every regular grammar,
G
, there exists a nite automaton,A
, such thatL
(A
) =L
(G
). The proof of this theorem is an algorithm to constructA
, givenG
= (T;N;P;Z
). LetA
= (T;N
[ff
g;R;Z;F
),f =
2N
.R
is constructed fromP
by the following rules:1. If
X
!t
(X
2N;t
2T
) is a production ofP
then letXt
!f
be a production ofR
. 2. IfX
!tY
(X;Y
2N;t
2T
) is a production ofP
then letXt
!Y
be a production ofT
=fn;:;
+;
,;E
gQ
=fC;F;I;X;S;U;q
gR
=fCn
!q
,Cn
!F
,C:
!I
,F:
!I
,FE
!S
,In
!q
,In
!X
,XE
!S
,Sn
!q
,S
+!U
,S
,!U
,Un
!q
gq
0=C
F
=fq
gFigure 5.9: An Automaton Corresponding to Figure 5.4a
Further,
F
= ff
g[fX
jX
! 2P
g. Figure 5.9 is an automaton constructed by this process from the grammar of Figure 5.4a.One can show by induction that the automaton constructed in this manner has the follow- ing characteristic: For any derivation
Z
)
X
)q
(;
2T
;X
2N;
2L
(A
);q
2F
), the stateX
species the nonterminal symbol ofG
that must have been used to derive the string . Clearly this statement is true for the initial stateZ
if belongs toL
(G
). It remains true until the nal stateq
, which does not generate any further symbols, is reached. With the help of this interpretation it is easy to prove that each sentence ofL
(G
) also belongs toL
(A
) and vice-versa.Figure 5.9 is an unsatisfactory automaton in practice because at certain steps { for exam- ple in state
I
with input symboln
{ several transitions are possible. This is not a theoretical problem since the automaton is capable of producing a derivation for any string in the lan- guage. When implementing this automaton in a compiler, however, we must make some arbitrary decision at each step where more than one production might apply. An incorrect decision requires backtracking in order to seek another possibility. There are three reasons why backtracking should be avoided if possible:The time required to parse a string with backtracking may increase exponentially with the length of the string.
If the automaton does not accept the string then it will be recognized as incorrect. A parse with backtrack makes pinpointing the error almost impossible. (This is illustrated by attempting to parse the string
n:nE
+ +n
with the automaton of Figure 5.9 trying the rules in the sequence in which they are written.)Other compiler actions are often associated with state transitions. Backtracking then requires unraveling of actions already completed, generally a very dicult task.
In order to avoid backtracking, additional constraints must be placed upon the automata that we are prepared to accept as models for our recognition algorithms.
5.14 Definition
An automaton is deterministic if every derivation can be continued by at most one move. A nite automaton is therefore deterministic if the left-hand sides of all rules are distinct. It can be completely described by a state table that has one row for each element of
Q
and one column for each element ofT
. Entry (q;t
) containsq
0 if and only if the productionqt
!
q
0 is an element ofR
. The rows corresponding toq
0 and to the elements ofF
are suitably marked.5.2 Regular Grammars and Finite Automata 93
5.15 Theorem
For every regular grammar,
G
, there exists a deterministic nite automaton,A
, such thatL
(A
) =L
(G
).Following construction 5.13, we can derive an automaton from a regular grammar
G
= (T;N;P;Z
) such that, during acceptance of a sentence inL
(G
), the state at each point species the element ofN
used to derive the remainder of the string. Suppose that the pro- ductionsX
!tU
andX
!tV
belong toP
. Whent
is the next input symbol, the remainder of the string could have been derived either fromU
or fromV
. IfA
is to be deterministic, however,R
must contain exactly one production of the formXt
!q
0. Thus the state
q
0 must specify a set of nonterminals, any one of which could have been used to derive the remainder of the string. This interpretation of the states leads to the following inductive algorithm for determiningQ
,R
andF
of a deterministic automatonA
= (T;Q;R;q
0;F
). (In this algorithm,q
represents a subsetN
q ofN
[ff
g;f =
2N
):1.
Initially letQ
=fq
0g and
R
=;, withN
q 0 =f
Z
g.2.
Letq
be an element ofQ
that has not yet been considered. Perform steps (3)-(5) for eacht
2T
.3.
Letnext
(q;t
) =fU
j9X
2N
q such thatX
!tU
2P
g.4.
If there is anX
2N
q such thatX
!t
2P
then addf
tonext
(q;t
) if it is not already present; if there is anX
2N
q such thatX
!2P
then addf
toN
q if it is not already present.5.
Ifnext
(q;t
) 6=; then letq
0 be the state representing
N
q0 =
next
(q;t
). Addq
0 to
Q
andqt
!q
0 to
R
if they are not already present.6.
If all states ofQ
have been considered then letF
= fq
jf
2N
qg and stop. Otherwise return to step (2).You can easily convince yourself that this construction leads to a deterministic nite automaton
A
such thatL
(A
) =L
(G
). In particular, the algorithm terminates: All states represent subsets ofN
[ff
g, of which there are only a nite number.To illustrate this procedure, consider the construction of a deterministic nite automaton that recognizes strings generated by the grammar of Figure 5.4a. The state table for this grammar, showing the correspondence between states and sets of nonterminals, is given in Figure 5.10a. You should derive this state table for yourself, following the steps of the algorithm. Begin with a single empty row for
q
0 and work across it, lling in each entry that corresponds to a valid transition. Each time a distinct set of nonterminal symbols is generated, add an empty row to the table. The algorithm terminates when all rows have been processed.5.16 Theorem
For every nite automaton,
A
, there exists a regular grammar,G
, such thatL
(G
) =L
(A
). Theorems 5.15 and 5.16 together establish the fact that nite automata and regular grammars are equivalent. To prove Theorem 5.16 we construct the production setP
of the grammarG
= (T;Q;P;q
0) from the automaton (T;Q;R;q
0;F
) as follows:P
=fq
!tq
0
j
qt
!q
02
R
g[fq
! jq
2F
g5.2.2 State Diagrams and Regular Expressions
The phrase structure of the basic symbols of the language is usually not interesting, and in fact may simply make the description harder to understand. Two additional formalisms, both
n :
+ ,E
q
0q
1q
2 fC
gq
1q
2q
3 ff;F
gq
2q
4 fI
gq
3q
5q
6q
6 fS
gq
4q
3 ff;X
gq
5 ff
gq
6q
5 fU
g a) The state tableT
=fn;:;
+;
,;E
gQ
=fq
0;q
1;q
2;q
3;q
4;q
5;q
6 gP
=fq
0n
!q
1,q
0:
!q
2,q
1:
!q
2,q
1E
!q
3,q
2n
!q
4,q
3n
!q
5,q
3+ !q
6,q
3 ,!q
6,q
4E
!q
3,q
6n
!q
5 gF
=fq
1;q
4;q
5 gb) The complete automaton
Figure 5.10: A Deterministic Automaton Corresponding to Figure 5.4a
of which avoid the need for irrelevant structuring, are available for regular languages. The rst is the representation of a nite automaton by a directed graph:
5.17 Definition
Let
A
= (T;Q;R;q
0;F
) be a nite automaton,D
= f(q;q
0) j 9t;qt
!q
0 2R
g, andf
: (q;q
0) !ft
jqt
!q
02
R
g be a mapping fromD
into the powerset ofT
. The directed graph (Q;D
) with edge labelsf
((q;q
0)) is called the state diagram of the automatonA
.Figure 5.11a is the state diagram of the automaton described in Figure 5.10b. The nodes corresponding to elements of
F
have been represented as squares, while the remaining nodes are represented as circles. Only the state numbers appear in the nodes: 0 stands forq
0, 1 forq
1, and so forth.In a state diagram, the sequence of edge labels along a path beginning at
q
0 and ending at a state inF
is a sentence ofL
(A
). Figure 5.11a has exactly 12 such paths. The corresponding sentences are given in Figure 5.11b.A state diagram species a regular language. Another characterization is the regular expression:
5.18 Definition
Given a vocabulary
V
, and the symbolsE
, , +, , ( and ) not inV
. A string overV
[fE;;
+;;
(;
)g is a regular expression overV
if1.
is a single symbol ofV
or one of the symbolsE
or, or if2.
has the form (X
+Y
), (XY
) or (X
) whereX
andY
are regular expressions.5.2 Regular Grammars and Finite Automata 95 n E n E n n 2 1 3 6 5 4 0 + - a) State diagram n .n n.n
nEn nE+n nE-n
.nEn .nE+n .nE-n n.nEn n.nE+n n.nE-n
b) Paths
Figure 5.11: Another Description of Figure 5.10b
Every regular expression results from a nite number of applications of rules (1) and (2). It describes a language over
V
: The symbolE
describes the empty language, describes the language consisting only of the empty string,v
2V
describes the language fv
g, (X
+Y
) = f!
j!
2X
or!
2Y
g, (XY
) =f j 2X;
2Y
g. The closure operator () is dened by the following innite sum:X
=+X
+XX
+XXX
+:::
As illustrated in this denition, we shall usually omit parentheses. Star is unary, and takes priority over either binary operator; plus has a lower priority than concatenation. Thus
W
+XY
is equivalent to the fully-parenthesized expression (W
+ (X
(Y
))).Figure 5.12 summarizes the algebraic properties of regular expressions. The distinct rep- resentations for
X
show that several regular expressions can be given for one language.X
+Y
=Y
+X
(commutative) (X
+Y
) +Z
=X
+ (Y
+Z
) (associative) (XY
)Z
=X
(Y Z
)X
(Y
+Z
) =XY
+XZ
(distributive) (X
+Y
)Z
=XZ
+Y Z
X
+E
=E
+X
=X
(identity)X
=X
=X
XE
=EX
=E
(zero)X
+X
=X
(idempotent) (X
) =X
X
= +XX
X
=X
+X
=E
=Figure 5.12: Algebraic Properties of Regular Expressions
The main advantage in using a regular expression to describe a set of strings is that it gives a precise specication, closely related to the `natural language' description, which can be written in text form suitable for input to a computer. For example, let
l
denote any single letter andd
any single digit. The expressionl
(l
+d
) is then a direct representation of the natural language description `a letter followed by any sequence of letters and digits'.The equivalence of regular expressions and nite automata follows from:
5.19 Theorem
Let
R
be a regular expression that describes a subset,S
, ofT
. There exists a deterministic nite automaton,A
= (T;Q;P;q
0;F
) such thatL
(A
) =S
.The automaton is constructed in much the same way as that of Theorem 5.15: We create a new expression
R
0 by replacing the elements ofT
occurring inR
by distinct symbols (multiple occurrences of the same element will receive distinct symbols). Further, we prex another distinct symbol to the altered expression; ifR
=E
, thenR
0 consists only of this starting symbol. (As symbols we could use, for example, natural numbers with 0 as the starting symbol.) The states of our automaton correspond to subsets of the symbol set. The set corresponding to the initial stateq
0 consists solely of the starting symbol. We inspect the states ofQ
one after another and add new states as required. For eachq
2Q
and eacht
2T
, letq
0 correspond to the set of symbols inR
0 that replacet
and follow any of the symbols of the set corresponding toq
. If the set corresponding toq
0 is not empty, then we addqt
!
q
0 toP
and add fq
0
g to
Q
if it is not already present. The setF
of nal states consists of all states that include a possible nal symbol ofR
0.Figure 5.13 gives an example of this process. Starting with
q
0=f0g, we obtain the state table of Figure 5.13b, with states
q
1,q
2 andq
3 as nal states. Obviously this is not the simplest automaton which we could create for the given language; we shall return to this problem in Section 6.2.2.R
=l
(l
+d
)R
0 = 01(2 + 3)
a) Modifying the Regular Expression l d
q
0q
1 f0gq
1q
2q
3 f1g (nal)q
2q
2q
3 f2g (nal)q
3q
2q
3 f3g (nal) b) The resulting state tableFigure 5.13: Regular Expressions to State Tables