Code generation tools
3.4 B UILDING THE LANGUAGE PARSER TOOLKIT
3.4.2 Inside the LanguageScanners
Layered on top of the Tokenizers is a set of language scanners. There is one lan-guage scanner for each lanlan-guage the toolkit supports. Figure 3.5 illustrates the relation-ship between these classes.
All of the classes inherit from the LanguageScanner base class, which is an inter-face providing the one common entry point, parse(), that takes a TokenStream. On the SQL side of the fence the basic SQLLanguageScanner class adds the pars-ing for tables, and PostgreSQLScanner adds support for parsing PostgreSQL-stored procedure prototypes.
The C language tree starts with CLanguageScanner, which reads C files and builds a list of function prototypes and #define preprocessor macro values. CPP-LanguageScanner derives from the C scanner and parses C++. In order to handle C++ classes, it adds an array of class objects that define instance variables and methods.
Finally, as you’d expect, JavaLanguageScanner parses Java code. It parses the Java class and creates an internal store of instance variables and methods as well as their related JavaDoc comments.
Next, we’ll examine each of these classes. The code for the language scanners is available on the book’s web site.
The LanguageScanner class
The LanguageScanner class and its related classes provide the foundation for all of the specialized language scanners. It contains the basic constructs—prototypes, classes, and class variables—that are used by all of the other scanners. See figure 3.6.
+parse(in tokens) LanguageScanner
-prototypes
PostgresSQLScanner -tables
SQLLanguageScanner
-prototypes -defines
CLanguageScanner
-classes
CPPLanguageScanner
-classes
JavaLanguageScanner
Figure 3.5 The family tree of the LanguageScanner classes
BUILDINGTHELANGUAGEPARSERTOOLKIT 53 The LanguageScanner class defines an interface with a single entry point that all of the derived scanners implement. Each language scanner stores the features specific to that language. For example, SQLLanguageScanner stores table information, while CLanguageScanner stores function prototypes and #define values.
The other classes that are related to the LanguageScanner class are shown in figure 3.7.
Along with the LanguageScanner interface, there is an infrastructure for stor-ing details about both plain functions (through the Prototype object) and object-oriented classes (through the definition of the LanguageClass and ClassVariable classes).
In derived language scanners, one or more LanguageClass objects are used to store information about class definitions found in the token stream. Each LanguageClass object contains the class name, information about the class lineage as well as the variables in the class (defined by an array of ClassVariable objects), and any methods (as an array of Prototype objects.)
+parse(in tokens) LanguageScanner
Figure 3.6
The LanguageScanner base class
-name
Figure 3.7 The LanguageScanner helper classes
The SQL scanners
The processing flow for ANSI SQL and PostgreSQL is shown in figure 3.8.
Each code flow starts with running the text through the SQLTokenizer, which creates a TokenStream from the text. This TokenStream is then passed to either SQLLanguageScanner (for ANSI SQL) or PostgreSQLScanner (for PostgreSQL). Figure 3.9 shows the UML for these SQL language-scanning classes.
SQLLangaugeScanner reads TokenStreams and then parses and stores infor-mation about any table definitions it found. The Tables array stores SQLTable objects for each table found. Each of the SQLTable objects in turn contains an array of SQLField objects that describe each field. Any comments found before the table or the fields is associated with the table or field in the comment attribute.
PostgreSQLScanner derives from SQLLanguageScanner and handles reading the prototypes for PostgreSQL-stored procedures. Listings 3.1 and 3.2 show example code for parsing SQL and PostgreSQL, respectively.
SQLTokenizer
Figure 3.9 The SQL scanning classes
BUILDINGTHELANGUAGEPARSERTOOLKIT 55
require "SQLTokenizer" Includes the Tokenizer definitions require "SQLLanguageScanner" Includes the Scanner definitions File.open( ARGV[0] ) { |fh|
in_text = fh.read() tokenizer = SQLTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = SQLLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.tables.each{ |table|
print "#{table.name}\n"
} }
require "SQLTokenizer"
require "SQLLanguageScanner"
File.open( ARGV[0] ) { |fh|
in_text = fh.read()
tokenizer = SQLTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = PostgreSQLScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.prototypes.each{ |proto|
print "#{proto.method_name}\n"
} }
Listing 3.1 Example code for parsing SQL
Opens and reads the file Tokenizes the SQL
Scans the TokenStream
Prints out names of tables
Listing 3.2 Example code for parsing PostgreSQL
Includes the Tokenizer definitions Includes the Scanner definitions
Opens and reads the file Tokenizes the SQL
Scans the TokenStream
Prints out names of tables
The C and C++ scanners
The processing flow of C and C++ files is shown in figure 3.10.
Both C and C++ text is run through a CTokenizer, which returns a Token-Stream. This TokenStream is then fed to either the CLanguage or CPP-LanguageScanner scanner. These classes are shown in figure 3.11.
The CLanguageScanner looks for C function prototypes and preprocessor
#define macros. It stores the prototypes in an array called prototypes and the macros in a hash table called defines, which is keyed on the #define symbol name.
CTokenizer TokenStream
CPP text
CPPLanguageScanner
Class Objects Method
Objects Variable
Objects
CTokenizer TokenStream
C text
CLanguageScanner
Function Protoypes Function
Protoypes
Figure 3.10
Parsing C and C++ with the language parsing toolkit
BUILDINGTHELANGUAGEPARSERTOOLKIT 57 The CPPLanguageScanner looks for classes in C++ code that have been token-ized using CTokenizer. The resulting classes array is populated with Language-Class objects that describe each class as well as its instance variables and methods.
Because this derives from CLanguageScanner, any standard function prototypes will be read as well. Listing 3.3 contains example code for parsing C, and listing 3.4 shows example code for parsing C++.
require "CTokenizer"
require "CLanguageScanner"
File.open( ARGV[0] ) { |fh|
in_text = fh.read()
tokenizer = CTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = CLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.prototypes.each{ |proto|
print "#{proto.method_name}\n"
}
}
require "CTokenizer"
require "CPPLanguageScanner"
File.open( ARGV[0] ) { |fh|
in_text = fh.read()
tokenizer = CTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = CPPLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.classes.each{ |cpp_class|
print "#{cpp_class.name}\n"
} }
Listing 3.3 Example code for parsing C
Includes the Tokenizer definitions Includes the Scanner definitions
Tokenizes the C Scans the TokenStream Prints out any function names
Listing 3.4 Example code for parsing C++
Includes the Tokenizer definitions Includes the Scanner definitions
Tokenizes the C++
Scans the TokenStream
Prints out any class names
The Java scanner
Figure 3.12 shows the processing flow for parsing a Java file.
The Java code is fed to the CTokenizer, which returns a TokenStream. The TokenStream is passed to the JavaLanguageScanner, which parses the tokens and stores information about the classes and JavaDoc comments. The Java-LanguageScanner class and its helper classes are shown in figure 3.13.
This is less complicated than it looks. As with CPPLanguageScanner, it is JavaLanguageScanner’s responsibility to read the class definitions in the TokenStream and store relevant information. The surrounding Java classes are derivations of the basic storage classes with JavaDoc elements added.
For example, the JavaClass class is derived from LanguageClass and adds an instance of the JavaDoc object. The JavaDoc object is added to allow handling of any JavaDoc comments associated with the class definition. The JavaVariable and JavaPrototype classes derive from ClassVariable and Prototype
CTokenizer TokenStream
Java text
JavaLanguageScanner
Class Object Method
Objects Variable
Objects
Figure 3.12
Parsing Java using the language parsing toolkit
+parse(in tokens) LanguageScanner
+parse(in tokens)
#parse_class(in tokens, in start, in base_class)
#parse_declaration(in codefrag)
#parse_prototype(in class_data, in codefrag)
#parse_variable(in class_data, in codefrag) -classes
JavaLanguageScanner
-javadoc JavaVariable
-javadoc JavaClass
-javadoc JavaPrototype
Figure 3.13 The Java language scanner and its related classes
BUILDINGTHELANGUAGEPARSERTOOLKIT 59 classes, respectively, and add the JavaDoc handling object. Listing 3.5 shows example code for parsing Java.
require "CTokenizer"
require "JavaLanguageScanner"
File.open( ARGV[0] ) { |fh|
in_text = fh.read()
tokenizer = CTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = JavaLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.classes.each{ |jclass|
print "#{jclass.name}\n"
} }
NOTE If your generation task is strictly Java based and the input to the generator is a set of class definitions, you may want to consider using the Doclet API (see chapter 6, section 6.4).