4 Writing an Interpreter
4.2 Implementing a Simple Interpreter
4.2.2 The Reader
(Note: this section could be skimmed if you're not interested in how the reader works|it is just a front end to the evaluator, where the interesting work is done.)
We won't write our own reader for our interpreter, but I'll sketch how the reader works. (Our interpreter will just snarf the reader from the underlying Scheme system we're implement- ing it in, but it's good to know how we could write a reader, and it's a nice example of recursive programming.)
The reader is just the procedure read, which is written in terms of a few lower-level procedures
structures. A token is just a fairly simple item that doesn't have a nested structure. For example, lists nest, but symbol names don't, strings don't, and numbers don't.
The low-level routines that read uses just read individual tokens from the input (a stream of
characters). These tokens include symbols, strings, numbers, and parentheses. Parentheses are special, because they tell the reader when recursion is needed to read nested data structures.
(I haven't explained about character I/O, but don't worry|there are Scheme procedures for reading a character of input at a time, testing characters for equality, etc. For now, we'll ignore those details and I'll just sketch the overall structure of the reader.)
Lets assume we have a simple reader that only reads symbols, integers, and strings, and (possibly nested) lists made up of those things. It'll be pretty clear how to extend it to read other kinds of things.
4.2.2.1 Implementing
readreaduses recursion to construct nested lists while reading through the character input from left
to right.
When it sees a left parenthesis, it calls an auxiliary procedure we'll call read-listto read the
elements of a list.
read and read-listare mutually recursive. read-listreads the elements of a list by calling read(if the list elements are simple tokens), which may call read-listrecursively to read nested
lists.
Notice that hitting a right parenthesis is the termination condition for the recursion. If we're reading a sublist of a list, and hit a right parenthesis, read-list recognizes that as the sign to
stop, and return a complete (nested) list toread.
Here's a slightly oversimplied version of read. (The main oversimplication is that we've left
out any error-checking code. We assume that what we're reading is a legal textual representation of a Scheme data structure. We also haven't dealt with reading from les, instead of the standard input, or what to do when reaching the end of a le.)
(Our little reader will use the standard Scheme procedure read-charto read one character of
tell whether a character represents a letter or a number. We'll also use the character literals #\"
and #\(, which represent the double quote character and the left parenthesis character.)
this code is o the top of my head and needs to be debugged ]
(define (read)
(let ((first-char (read-char)))
(if (eq? first-char #\() char a left parenthesis? (read-list)
(cond ((char-alphabetic? first-char) (read-symbol first-char)) ((char-digit? first-char)
(read-number first-char))
((eq? first-char #\") char a double quote? (read-string))))))
Notice that the rstifis the recursion test. If we see a left parenthesis, which is a special token,
and we call read-listto read a list. read list can call read again to read the elements of the
list, soreadis indirectly recursive.
If we're not reading a list, we call any of several auxiliary procedures to read tokens:
read-symbol. If the character we read is a letter, we're reading a symbol, so we call read- symbol to nish reading it. (We pass it the character we read, since it's the rst character
of the symbol's print name.) read-symbol (not shown) just reads through more characters,
saving them until it hits a special token (space or parenthesis). When it nishes reading the whole print name of the symbol, it checks the table of symbols to see if there's already a symbol by that name. If so, it just returns a pointer to it. If not, it constructs a symbol by that name, adds it to the table, and returns a pointer to that.
read-number. If the character we read is a digit, we're reading a number, so we call read- number. (We pass it the rst character we read, since that's the rst digit of the number.) read- numberjust reads through successive characters, saving them until it hits a special token such
as a space or parenthesis. Then it calls another procedure,string->numberwhich converts the
sequence of digit characters into a binary number in the usual Scheme number representation, and returns that.
read-string. If the character we read is a double quote ("), we're reading a string, so we
call read-string. (We don't have to pass it the character we read, since the double quote
read-stringjust reads through characters, saving them until it hits another double quote. It
then calls another procedure that constructs a string with that sequence of characters.
4.2.2.2 Implementing
read-listHere's a slightly oversimplied version ofread-list. (Again, the main oversimplication is that
we don't check for illegal syntax, like extra closing parentheses.) need to explain peek-character ]
this code is o the top of my head and needs to be debugged ]
(define (read-list list-so-far) (let ((next-char (peek-char)))
if we hit a right parenthesis, (if (eq? next-char #\))
then return list we've read, reversing it into proper order (reverse list-so-far)
else read next item and call self recursively to read rest (cons (read)
(read-list list-so-far)))))
Notice that we've coded read-listrecursively in two ways.
We code the iteration that reads successive items in the list as a recursion, passing the list so far as an argument to the recursive call. This is not tail-recursive, but we could x it. ]
We read list elements by calling read, and then cons them onto the list so far, and pass that
to a recursive call to read-list. This constructs a list that's backwards, because we push later
elements onto the front of the list. When we hit a right parenthesis and end a recursive call, we reverse the list we've read, to put it in the proper order.
4.2.2.3 Comments on the Reader
The reader is really a simple kind of recursive descent parser. A parser converts a sequence of tokens into a syntax tree that describes the nesting of expressions or statements.) It is a \top-
down" parser, because it recognizes high-level structures before lower-level ones (e.g., it recognizes the beginning of a list before reading and recognizing the tokens and sublists inside it).1 2
It converts a linear sequence of characters into a simple parse tree.
(If you're familiar with standard compiler terminology, you should recognize thatreadperforms
lexical analysis (a.k.a. scanning or tokenization) using read-string, read-symbol, and read- number. It performs predictive recursive-descent parsing via the mutual recursion of read and read-list.)
Unlike most parsers, the data structure readgenerates is a data structure in the Scheme lan-
guage, rather than a data structure internal to a compiler or interpreter. This is one of the nice things about Scheme|there's a simple but exible parser you can use in your own programs. You can use it for parsing data as well as programs.
When implementing the Scheme language, that's not all there is to doing parsing. The reader does the rst part of parsing, translating input into s-expressions. The rest of parsing is done during interpretation or compilation, in a very straightforward way. The rest of the parsing isn't much more complicated than reading, and is also done recursively.3
1 Unsurprisingly, a bottom-up parser would do the opposite|it would recognizes the smaller
consituents rst, and then recognizes the larger groupings that enclose them.
2 In the technical terminology of programming language processors, the reader is a predictive
parser for an LL grammar. It can parse s-expressions top-down in a single pass through the sequence of tokens, without looking ahead more than one token, because it only needs to see the next token to know what action to take. (E.g., if it sees a left parenthesis, it immediately \knows" that it is parsing a nested list.)
3 It's often said that Lisp and Scheme have such a simple syntax that they \don't need a parser,"
but this is just false. Lisp and Scheme actually have\em two parsers, because their syntax has
a two levels. The \surface" syntax is parenthesized prex expressions, recognized by the reader, but there is a \deeper" syntax that is recognized by the interpreter or compiler, which analyzes s-expressions in the process of evaluating or compiling them.
As we'll see when we get to macros, Scheme syntax is even more sophisticated than this, despite its simplicity. Technically, Scheme has a\em transformational grammar that is not \context-free,"
but is easy to parse. (If you don't know what that means, don't worry about it. Scheme is easy to understand without knowing the fancy technical terms.)