Inside the LanguageScanners - B UILDING THE LANGUAGE PARSER TOOLKIT

Code generation tools

3.4 B UILDING THE LANGUAGE PARSER TOOLKIT

3.4.2 Inside the LanguageScanners

Layered on top of the Tokenizers is a set of language scanners. There is one lan-guage scanner for each lanlan-guage the toolkit supports. Figure 3.5 illustrates the relation-ship between these classes.

All of the classes inherit from the LanguageScanner base class, which is an inter-face providing the one common entry point, parse(), that takes a TokenStream. On the SQL side of the fence the basic SQLLanguageScanner class adds the pars-ing for tables, and PostgreSQLScanner adds support for parsing PostgreSQL-stored procedure prototypes.

The C language tree starts with CLanguageScanner, which reads C files and builds a list of function prototypes and #define preprocessor macro values. CPP-LanguageScanner derives from the C scanner and parses C++. In order to handle C++ classes, it adds an array of class objects that define instance variables and methods.

Finally, as you’d expect, JavaLanguageScanner parses Java code. It parses the Java class and creates an internal store of instance variables and methods as well as their related JavaDoc comments.

Next, we’ll examine each of these classes. The code for the language scanners is available on the book’s web site.

The LanguageScanner class

The LanguageScanner class and its related classes provide the foundation for all of the specialized language scanners. It contains the basic constructs—prototypes, classes, and class variables—that are used by all of the other scanners. See figure 3.6.

+parse(in tokens) LanguageScanner

-prototypes

PostgresSQLScanner -tables

SQLLanguageScanner

-prototypes -defines

CLanguageScanner

-classes

CPPLanguageScanner

-classes

JavaLanguageScanner

Figure 3.5 The family tree of the LanguageScanner classes

BUILDINGTHELANGUAGEPARSERTOOLKIT 53 The LanguageScanner class defines an interface with a single entry point that all of the derived scanners implement. Each language scanner stores the features specific to that language. For example, SQLLanguageScanner stores table information, while CLanguageScanner stores function prototypes and #define values.

The other classes that are related to the LanguageScanner class are shown in figure 3.7.

Along with the LanguageScanner interface, there is an infrastructure for stor-ing details about both plain functions (through the Prototype object) and object-oriented classes (through the definition of the LanguageClass and ClassVariable classes).

In derived language scanners, one or more LanguageClass objects are used to store information about class definitions found in the token stream. Each LanguageClass object contains the class name, information about the class lineage as well as the variables in the class (defined by an array of ClassVariable objects), and any methods (as an array of Prototype objects.)

+parse(in tokens) LanguageScanner

Figure 3.6

The LanguageScanner base class

-name

Figure 3.7 The LanguageScanner helper classes

The SQL scanners

The processing flow for ANSI SQL and PostgreSQL is shown in figure 3.8.

Each code flow starts with running the text through the SQLTokenizer, which creates a TokenStream from the text. This TokenStream is then passed to either SQLLanguageScanner (for ANSI SQL) or PostgreSQLScanner (for PostgreSQL). Figure 3.9 shows the UML for these SQL language-scanning classes.

SQLLangaugeScanner reads TokenStreams and then parses and stores infor-mation about any table definitions it found. The Tables array stores SQLTable objects for each table found. Each of the SQLTable objects in turn contains an array of SQLField objects that describe each field. Any comments found before the table or the fields is associated with the table or field in the comment attribute.

PostgreSQLScanner derives from SQLLanguageScanner and handles reading the prototypes for PostgreSQL-stored procedures. Listings 3.1 and 3.2 show example code for parsing SQL and PostgreSQL, respectively.

SQLTokenizer

Figure 3.9 The SQL scanning classes

BUILDINGTHELANGUAGEPARSERTOOLKIT 55

require "SQLTokenizer" Includes the Tokenizer definitions require "SQLLanguageScanner" Includes the Scanner definitions File.open( ARGV[0] ) { |fh|

in_text = fh.read() tokenizer = SQLTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = SQLLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.tables.each{ |table|

print "#{table.name}\n"

} }

require "SQLTokenizer"

require "SQLLanguageScanner"

File.open( ARGV[0] ) { |fh|

in_text = fh.read()

tokenizer = SQLTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = PostgreSQLScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.prototypes.each{ |proto|

print "#{proto.method_name}\n"

} }

Listing 3.1 Example code for parsing SQL

Opens and reads the file Tokenizes the SQL

Scans the TokenStream

Prints out names of tables

Listing 3.2 Example code for parsing PostgreSQL

Includes the Tokenizer definitions Includes the Scanner definitions

Opens and reads the file Tokenizes the SQL

Scans the TokenStream

Prints out names of tables

The C and C++ scanners

The processing flow of C and C++ files is shown in figure 3.10.

Both C and C++ text is run through a CTokenizer, which returns a Token-Stream. This TokenStream is then fed to either the CLanguage or CPP-LanguageScanner scanner. These classes are shown in figure 3.11.

The CLanguageScanner looks for C function prototypes and preprocessor

#define macros. It stores the prototypes in an array called prototypes and the macros in a hash table called defines, which is keyed on the #define symbol name.

CTokenizer TokenStream

CPP text

CPPLanguageScanner

Class Objects Method

Objects Variable

Objects

CTokenizer TokenStream

C text

CLanguageScanner

Function Protoypes Function

Protoypes

Figure 3.10

Parsing C and C++ with the language parsing toolkit

BUILDINGTHELANGUAGEPARSERTOOLKIT 57 The CPPLanguageScanner looks for classes in C++ code that have been token-ized using CTokenizer. The resulting classes array is populated with Language-Class objects that describe each class as well as its instance variables and methods.

Because this derives from CLanguageScanner, any standard function prototypes will be read as well. Listing 3.3 contains example code for parsing C, and listing 3.4 shows example code for parsing C++.

require "CTokenizer"

require "CLanguageScanner"

File.open( ARGV[0] ) { |fh|

in_text = fh.read()

tokenizer = CTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = CLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.prototypes.each{ |proto|

print "#{proto.method_name}\n"

}

require "CTokenizer"

require "CPPLanguageScanner"

File.open( ARGV[0] ) { |fh|

in_text = fh.read()

tokenizer = CTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = CPPLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.classes.each{ |cpp_class|

print "#{cpp_class.name}\n"

} }

Listing 3.3 Example code for parsing C

Includes the Tokenizer definitions Includes the Scanner definitions

Tokenizes the C Scans the TokenStream Prints out any function names

Listing 3.4 Example code for parsing C++

Includes the Tokenizer definitions Includes the Scanner definitions

Tokenizes the C++

Scans the TokenStream

Prints out any class names

The Java scanner

Figure 3.12 shows the processing flow for parsing a Java file.

The Java code is fed to the CTokenizer, which returns a TokenStream. The TokenStream is passed to the JavaLanguageScanner, which parses the tokens and stores information about the classes and JavaDoc comments. The Java-LanguageScanner class and its helper classes are shown in figure 3.13.

This is less complicated than it looks. As with CPPLanguageScanner, it is JavaLanguageScanner’s responsibility to read the class definitions in the TokenStream and store relevant information. The surrounding Java classes are derivations of the basic storage classes with JavaDoc elements added.

For example, the JavaClass class is derived from LanguageClass and adds an instance of the JavaDoc object. The JavaDoc object is added to allow handling of any JavaDoc comments associated with the class definition. The JavaVariable and JavaPrototype classes derive from ClassVariable and Prototype

CTokenizer TokenStream

Java text

JavaLanguageScanner

Class Object Method

Objects Variable

Objects

Figure 3.12

Parsing Java using the language parsing toolkit

+parse(in tokens) LanguageScanner

+parse(in tokens)

#parse_class(in tokens, in start, in base_class)

#parse_declaration(in codefrag)

#parse_prototype(in class_data, in codefrag)

#parse_variable(in class_data, in codefrag) -classes

JavaLanguageScanner

-javadoc JavaVariable

-javadoc JavaClass

-javadoc JavaPrototype

Figure 3.13 The Java language scanner and its related classes

BUILDINGTHELANGUAGEPARSERTOOLKIT 59 classes, respectively, and add the JavaDoc handling object. Listing 3.5 shows example code for parsing Java.

require "CTokenizer"

require "JavaLanguageScanner"

File.open( ARGV[0] ) { |fh|

in_text = fh.read()

tokenizer = CTokenizer.new( ) tokenizer.parse( in_text ) languagescanner = JavaLanguageScanner.new() languagescanner.parse( tokenizer.tokens ) languagescanner.classes.each{ |jclass|

print "#{jclass.name}\n"

} }

NOTE If your generation task is strictly Java based and the input to the generator is a set of class definitions, you may want to consider using the Doclet API (see chapter 6, section 6.4).

In document Code Generation in action (Page 81-88)