3.3 Csound 7: New Parser, New Possibilities
3.3.1 Parser3
Parser3 is a new parser design and implementation for Csound 7. It is based on the NewParser that was introduced in Csound 5, using the same Flex and Bison tools for lexer and parser generation. However, it takes a very different strategy to parsing than the NewParser. This new strategy was designed to address aspects of semantic analysis in the NewParser design that limited the extensibility of the Csound Orchestra language. The following will discuss motivations for pursuing a new parser design, followed by the design and implementation of Parser3.
Motivations
The primary problem of the NewParser is that its design introduced some aspects of semantic analysis early in the compiling process, specifically in the tokenizing (i.e., lexing) step. These tokens then were used throughout the Bison grammar for the parser with the result that, arguably, the grammar
was overly complicated. As a result, implementing new language features for Csound 7 was becoming difficult using the existing grammar.
The design of the NewParser originated in Csound5, where the initial implementation was closely based upon the design and implementation of the OldParser. With the Csound5 NewParser, semantic analysis and verification were being done in various areas of the lexer, parser, and compiler. Following the design of the OldParser allowed for easier verification that the NewParser was generating equivalent results, as well as following the same rules, as the OldParser.
In Csound6, the semantic analysis code that was found in the compiler was separated out into its own phase, run after parsing but before compilation. This was a marked improvement in the clarity of the compiler code and simplified modification for both the analysis and compilation phases. However, the semantic analysis of tokens to determine if they were things like opcode names or reserved identifiers was still being done within the lexer. As a result, the grammar of the parser was still defined in terms of the numerous tokens generated by the lexer, which led to the writing of some complex rules.
The goal for Parser3 then was to continue the work started in Csound6 and to move all semantic analysis into the specific phase run after parsing. Doing so would simplify the specifications for both the lexer and parser. That in turn would make the parser easier to maintain as well as extend.
Implementation
The implementation of Parser3 moved all semantic analysis from the lexer and parser into the semantic analysis phase. Changes were required in each of those parts. They will be discussed individually below.
Lexer Firstly, all lookup-related code was removed from the lexer. In the NewParser, before any compiling was done, a special table was loaded that contained a copy of all opcode names and whether they were T_OPCODE or T_OPCODE0 token types. At parse time, any time an identifier was found (identifiers are words made up of an initial letter, followed by zero or more letters, numbers, or underscores), the lexer would first do a lookup in the special table to see if it was an opcode. If so, the lexer would emit one of the two token types found in the special table. If not, the lexer would emit the token as just an identifier using the T_IDENT type.
For Parser3, the special table, the table initialisation code, and the opcode lookup were all removed. Instead, when an identifier was found, it would always emit a token with T_IDENT type. The rules in the lexer to identify reserved identifiers (sr, kr, ksmps, nchnls, and nchnls_i) were also removed. This removed all semantic knowledge about what an identifier meant from the lexer.
Parser Next, the grammar was rewritten to use only identifiers. In the NewParser, rules were written using the semantically aware tokens. It was here that language ambiguities were also handled, which required knowledge about the types of tokens. This wove together both recognition of the structure of the language as well as the meaning of the language.
With the Csound Orchestra language, the language had known ambiguities regarding opcode-call syntax. For example, if a line of text was found with two words, such as “word word2”, it would be ambiguous whether the statement was a word opcode with a single input argument word2, or if it was a word2 opcode with a single word output argument.
With the knowledge of whether one of these words was an opcode name, the ambiguity could be resolved. The NewParser then was able to generate a single tree format for all opcode statements. Consequently, the NewParser’s semantic analyser and compiler could treat all opcode statements in the same way. While this worked to handle the ambiguities and simplify the compiler, it also complicated the grammar.
For Parser3, the grammar was updated to reflect the changes from the lexer. All rules were rewritten using only identifiers, which saw a number of rules removed. However, parsing opcode-statements now required a more complex set of rules (shown in Listing 3.28).
opcall : identifier NEWLINE
| out_arg_list expr_list NEWLINE | out_arg_list '(' ')' NEWLINE
| out_arg_list identifier expr_list NEWLINE ;
Listing 3.28: opcall rule in Parser3
With the opcall rule, four different tree formats could be generated for opcode calls, depending on the structure of the opcode call statement. opcall became a sort of catch-all rule. It would still only match opcode-statements that would be valid in the NewParser, so that aspect was not lost. However, the generated TREEs for opcode-statements could not longer be used as-is by the analyser or compiler.
Semantic Analyser In the NewParser, while opcode names were recog- nised in the lexer, the actual lookup of the OENTRY for an opcode name was not done until the semantic analysis phase. The OENTRY defines the
opcode, including its input and output argument types. This information was necessary only when verifying that opcode use was semantically correct.
In Parser3, the semantic analyser largely stayed the same with the ex- ception of one additional step. Previously, when the analyser encounted an opcode-statement, the TREE structures were all formed in the same way. Now, when the analyser first encounters an opcode-statement, it will run a TREE rewriting step to re-form trees into the same structure as was previously used in the NewParser. With the addition of this disambiguation step, the rest of the analyser could continue to function as-is, as could the compiler.
Note, the general algorithm applied in the TREE rewriting was designed to follow the same exact process found in the NewParser. This reads through the words found in the TREE, checks to see if they are opcode names, then checks against the variable pools, and so on. By applying the same algorithm here, the same process of disambiguation was successfully moved from the lexer and parser to the analyser.
Summary
Parser3 provides a new approach compared to the NewParser. All semantic analysis has now been removed from the lexer and parser and moved to the semantic analysis phase. Resolution of language ambiguities present in the Csound Orchestra language were consequently moved to a single location in the analyser. Parser3 also remains backwards compatible with the NewParser, meaning all previous code that could be processed with the NewParser is also valid with Parser3.
The result is that the lexer and grammar specifications have been simplified, making them easier to maintain and extend by core developers. This work would provide a foundation for other language developments in Csound7.