Characters to items - Syntax Error Handling

3.5 Syntax Error Handling

4.1.1 Characters to items

In a two-pass compiler the lexical analysis phase is commonly implemented as a subroutine of the syntax analysis phase, so task iii in figure 4.1 must be the kind of task which can be delegated to a subroutine. This implies that the lexical analyser cannot use information about previous items, or information about the current syntactic context, to determine which items it should encounter: all the necessary information must come from the characters of the item itself. The definition of the syntax of a programming language always gives the syntax of an item in a manner which is independent of the context in which it occurs,1 and the form of programs, statements, expressions and other constituents of the language in terms of the items and phrases which they may contain. In effect the lexical analysis phase consists of those parts of the syntax analyser whose task is to recognise items by recognising the characters which form their internal structure.

The definition of the syntax of items – the ‘micro-syntax’ of the language – will define the form of an item in such a way that it can be recognised by the fact that it starts with a particular character, or one of a class of characters such as a digit or a letter, and then either consists of a particular sequence of characters or else follows a simple set of rules which determine, for each subsequent character, whether it forms part of the item or not. Descriptions of the way items may be written are usually like those shown in figure 4.2.

1 _{In old (obsolescent?) languages such as FORTRAN or COBOL it is necessary to establish}

a context before an item can reliably be recognised. Such languages require a prescan to accomplish ‘statement identification’ before analysis can begin: more modern languages do not fall into this trap.

60 CHAPTER 4. LEXICAL ANALYSIS AND LOADING an<itemX>is the sequence of characters ABCD

an<itemY>is the character E

an<itemZ>starts with the characters F,G,H, ... or K and may be con- tinued with any one of the characters L,M, ... X,Y

Figure 4.2: Simple lexical syntax description <integer>::=<digit>|<integer> <digit>

|<real>E<integer>|<integer>E<integer> Figure 4.3: Syntax of numbers

Item description in terms of simple sequences of characters makes it possible to write a lexical analyser which only needs to look at the current character in the input stream in order to decide what to do next. This character may itself be a simple-item (e.g. a comma), it may start a simple-item (e.g. the ‘:’ character starts the ‘:=’ item) or it may start an item which is a member of a ‘class’ of items such as an identifier, a numeral or a string. There is some discussion in chapter 15 of the theory of lexical analysis grammars: in this chapter I show only the practical results.

Once a character which makes up a simple-item has been read and recognised there is little more for the lexical analyser to do, though in some cases the character read may be the first character of more than one simple-item – e.g. in SIMULA-67 the character ‘:’ signals the start of

the<assign>item ‘:=’ the<ref-assign>item ‘:-’

the<colon>item ‘:’

and in such cases the item definition is so designed that the analyser can separate the simple-items merely by reading the next character and deciding what to do. If the character which is read can start a class-item then the lexical analyser must read characters and follow the micro-syntax rules to determine whether each succeeding character is part of the item or not.

Figure 4.4 shows part of a lexical analyser which accepts numbers2 _{of the form}

shown in figure 4.3 (see chapter 15 for a discussion about the properties of this syntax description).

The analyser produces three strings ‘ipart’, ‘fpart’ and ‘epart’, representing the

2 _{Actually they are numerals – concrete representations of numbers. In this chapter I’ve}

4.1. READING THE SOURCE PROGRAM 61

let LexAnalyse() be

{ let ipart, fpart, epart = empty, empty, empty

switchon char into

{ case ‘0’ .. ‘9’:

{ /* read integer part */

while ‘0’<=char<=‘9’ do

{ ipart := ipart ++ char; readch(@char) }

if char = ‘.’ then

{ readch(@char)

if ‘0’<=char<=‘9’ then goto point

else Error("digit expected after point")

}

elsf char = ‘E’ then goto exponent

else { lexitem := integernumber; lexvalue := ipart; return

} }

case ‘.’:

{ readch(@char)

if 0<=char<=9 then

{ /* read fractional part */ point: while ‘0’<=char<=‘9’ do

{ fpart := fpart ++ char; readch(@char) }

if char = ‘E’ then

{ /* read exponent */ exponent: readch(@char)

unless ‘0’<=char<=‘9’

Error("digit expected in exponent")

while ‘0’<=char<=‘9’ do

{ epart := epart ++ char; readch(@char) } }

lexitem := realnumber

lexvalue := node(ipart, fpart, epart)

return

}

else { lexitem := ‘.’; return } }

case ‘:’ .. /* distinguish ‘:’ and ‘:=’ */

case ‘A’..‘Z’, ‘a’..‘z’: .. /* read identifier */ ...

} }

62 CHAPTER 4. LEXICAL ANALYSIS AND LOADING

S I F0 F1 F2 E1 E2

0..9: 1, I 3, I 5, F2 5, F2 5, F2 6, E2 6, E2 ‘.’: 2, F0 4, F1 9, exit error 8, exit error 8, exit ‘E’: .... 4, E1 9, exit error 4, E1 error 8, exit other: .... 7, exit 9, exit error 8, exit error 8, exit action #1: ipart,fpart,epart := empty,empty,empty

action #2: ipart,fpart,epart := empty, empty,empty readch(@char)

action #3: ipart:=ipart++char; readch(@char)

action #4: readch(@char)

action #5: fpart:=fpart++char; readch(@char)

action #6: epart:=epart++char; readch(@char)

action #7: lexitem := integernumber lexvalue := ipart; return

action #8: lexitem := realnumber

lexvalue := node(ipart, fpart, epart)

return

action #9: lexitem := ‘.’; return

4.1. READING THE SOURCE PROGRAM 63 integer part, fractional part and exponent part of a number. Most working lexical analysers convert numbers into their eventual run-time representation as they are read in, although there are arguments against this – see the discussion below.

Another way of describing the actions of figure 4.4, often used in ‘compiler factories’ where programmers need to be able to generate lexical analysers au- tomatically, is to use a state-transition table, as illustrated in figure 4.5. Each row of the table represents an input character, each column a state of the analyser. Each square contains an<action,new-state>pair: the action which must be performed if that character is read in that analyser state, and the new state which the analyser should then take up. Figure 4.5 shows four rows of a possible table.

In document Understanding and Writing Compilers (Page 75-79)