Implementing the Pratt Parser - Thorsten Ball-Writing an interpreter in Go (2017).pdf

A Pratt parser’s main idea is the association of parsing functions (which Pratt calls “semantic code”) with token types. Whenever this token type is encountered, the parsing functions are called to parse the appropriate expression and return an AST node that represents it. Each token type can have up to two parsing functions associated with it, depending on whether the token is found in a prefix or an infix position.

The first thing we need to do is to setup these associations. We define two types of functions: a prefix parsing functions and an infix parsing function.

// parser/parser.go

type (

prefixParseFn func() ast.Expression

infixParseFn func(ast.Expression) ast.Expression )

Both function types return anast.Expression, since that’s what we’re here to parse. But only the infixParseFn takes an argument: another ast.Expression. This argument is “left side” of the infix operator that’s being parsed. A prefix operator doesn’t have a “left side”, per definition. I know that this doesn’t make a lot of sense yet, but bear with me here, you’ll see how this works. For now, just remember thatprefixParseFns gets called when we encounter the associated token type in prefix position andinfixParseFn gets called when we encounter the token type in infix position.

In order for our parser to get the correct prefixParseFn or infixParseFn for the current token type, we add two maps to theParserstructure:

// parser/parser.go

type Parser struct { l *lexer.Lexer errors []string

curToken token.Token peekToken token.Token

prefixParseFns map[token.TokenType]prefixParseFn infixParseFns map[token.TokenType]infixParseFn }

With these maps in place, we can just check if the appropriate map (infix or prefix) has a parsing function associated withcurToken.Type.

We also give the Parsertwo helper methods that add entries to these maps:

// parser/parser.go

func (p *Parser) registerPrefix(tokenType token.TokenType, fn prefixParseFn) { p.prefixParseFns[tokenType] = fn

}

func (p *Parser) registerInfix(tokenType token.TokenType, fn infixParseFn) { p.infixParseFns[tokenType] = fn

}

Now we are ready to get to the heart of the algorithm.

Identifiers

We’re going to start with possibly the simplest expression type in the Monkey programming language: identifiers. Used in an expression statement an identifier looks like this:

foobar;

Of course, the foobaris arbitrary and identifiers are expressions in other contexts too, not just in an expression statements:

add(foobar, barfoo);

foobar + barfoo; if (foobar) {

// [...] }

Here we have identifiers as arguments in a function call, as operands in an infix expression and as a standalone expression as part of a conditional. They can be used in all of these contexts, because identifiers are expressions just like1 + 2. And just like any other expression identifiers produce a value: they evaluate to the value they are bound to.

We start with a test:

// parser/parser_test.go

func TestIdentifierExpression(t *testing.T) { input := "foobar;" l := lexer.New(input) p := New(l) program := p.ParseProgram() checkParserErrors(t, p) if len(program.Statements) != 1 {

len(program.Statements)) }

stmt, ok := program.Statements[0].(*ast.ExpressionStatement)

if !ok {

t.Fatalf("program.Statements[0] is not ast.ExpressionStatement. got=%T", program.Statements[0])

}

ident, ok := stmt.Expression.(*ast.Identifier)

if !ok {

t.Fatalf("exp not *ast.Identifier. got=%T", stmt.Expression) }

if ident.Value != "foobar" {

t.Errorf("ident.Value not %s. got=%s", "foobar", ident.Value) }

if ident.TokenLiteral() != "foobar" {

t.Errorf("ident.TokenLiteral not %s. got=%s", "foobar", ident.TokenLiteral())

} }

That’s a lot of lines, but it’s mostly just grunt work. We parse our input foobar;, check the parser for errors, make an assertion about the number of statements in the *ast.Program node and then check that the only statement in program.Statementsis an*ast.ExpressionStatement. Then we check that the *ast.ExpressionStatement.Expression is an *ast.Identifier. Finally we check that our identifier has the correct value of "foobar".

Of course, the parser tests fail: $ go test ./parser

--- FAIL: TestIdentifierExpression (0.00s)

parser_test.go:110: program has not enough statements. got=0 FAIL

FAIL monkey/parser 0.007s

The parser doesn’t know anything about expressions yet. We need to write a parseExpression method.

The first thing we need to do is to extend the parseStatement() method of the parser, so that it parses expression statements. Since the only two real statement types in Monkey are let and return statements, we try to parse expression statements if we don’t encounter one of the other two:

// parser/parser.go

func (p *Parser) parseStatement() ast.Statement {

switch p.curToken.Type { case token.LET: return p.parseLetStatement() case token.RETURN: return p.parseReturnStatement() default: return p.parseExpressionStatement() } }

The parseExpressionStatement method looks like this:

func (p *Parser) parseExpressionStatement() *ast.ExpressionStatement { stmt := &ast.ExpressionStatement{Token: p.curToken} stmt.Expression = p.parseExpression(LOWEST) if p.peekTokenIs(token.SEMICOLON) { p.nextToken() } return stmt }

We already know the drill: we build our AST node and then try to fill its field by calling other parsing functions. In this case there are a few differences though: we call parseExpression(), which doesn’t exist yet, with the constantLOWEST, that doesn’t exist yet, and then we check for an optional semicolon. Yes, it’s optional. If thepeekTokenis atoken.SEMICOLON, we advance so it’s the curToken. If it’s not there, that’s okay too, we don’t add an error to the parser if it’s not there. That’s because we want expression statements to have optional semicolons (which makes it easier to type something like5 + 5into the REPL later on).

If we now run the tests we can see that compilation fails, because LOWESTis undefined. That’s alright, let’s add it now, by defining the precedences of the Monkey programming language:

// parser/parser.go const ( _ int = iota LOWEST EQUALS // == LESSGREATER // > or < SUM // + PRODUCT // * PREFIX // -X or !X CALL // myFunction(X) )

Here we use iota to give the following constants incrementing numbers as values. The blank identifier_takes the zero value and the following constants get assigned the values1to7. Which numbers we use doesn’t matter, but the order and the relation to each other do. What we want out of these constants is to later be able to answer: “does the * operator have a higher precedence than the ==operator? Does a prefix operator have a higher preference than a call expression?”

In parseExpressionStatement we pass the lowest possible precedence to parseExpression, since we didn’t parse anything yet and we can’t compare precedences. That’s going to make more sense in a short while, I promise. Let’s write parseExpression:

// parser/parser.go

func (p *Parser) parseExpression(precedence int) ast.Expression { prefix := p.prefixParseFns[p.curToken.Type] if prefix == nil { return nil } leftExp := prefix() return leftExp }

That’s the first version. All it does is checking whether we have a parsing function associated with p.curToken.Type in the prefix position. If we do, it calls this parsing function, if not, it returns nil. Which it does at the moment, since we haven’t associated any tokens with any parsing functions yet. That’s our next step:

// parser/parser.go

func New(l *lexer.Lexer) *Parser {

// [...]

p.prefixParseFns = make(map[token.TokenType]prefixParseFn) p.registerPrefix(token.IDENT, p.parseIdentifier)

// [...]

}

func (p *Parser) parseIdentifier() ast.Expression {

return &ast.Identifier{Token: p.curToken, Value: p.curToken.Literal} }

We modified the New() function to initialize the prefixParseFns map on Parserand register a parsing function: if we encounter a token of type token.IDENT the parsing function to call is parseIdentifier, a method we defined on *Parser.

The parseIdentifier method doesn’t do a lot. It only returns a *ast.Identifier with the current token in theTokenfield and the literal value of the token in Value. It doesn’t advance the tokens, it doesn’t callnextToken. That’s important. All of our parsing functions,prefixParseFn or infixParseFn, are going to follow this protocol: start with curToken being the type of token you’re associated with and return with curToken being the last token that’s part of your expression type. Never advance the tokens too far.

Believe it or not, our tests pass: $ go test ./parser

ok monkey/parser 0.007s

We successfully parsed an identifier expression! Alright! But, before we get off the computer, find someone and proudly tell them, let’s keep our breath a little longer and write some more parsing functions.

In document Thorsten Ball-Writing an interpreter in Go (2017).pdf (Page 54-58)