Reading XML using XML::Parser - The Backend Database

E.1. SNMP in Practice

3.3. Building an Account System to Manage Users

3.3.1. The Backend Database

3.3.1.2. Reading XML using XML::Parser

We'll see one more way of writing XML from Perl in a moment, but before we do, let's turn our attention to the process of reading all of the great XML we've just learned how to write. We need code that will parse the account addition and deletion queues and the main database.

It would be possible to cobble together an XML parser in Perl for our limited data set out of regular

expressions, but that gets trickier as the XML data gets more complicated.[6] For general parsing, it is easier to use the XML::Parser module initially written by Larry Wall and now significantly enhanced and maintained by Clark Cooper.

[6]But it is doable; for instance, see Eric Prud'hommeaux's module at http://www.w3.org/1999/02/26−modules/W3C−SAX−XmlParser−*.

XML::Parser is an event−based module. Event−based modules work like stock market brokers. Before trading begins, you leave a set of instructions with them for actions they should take should certain triggers occur (e.g., sell a thousand shares should the price drop below 314, buy this stock at the beginning of the trading day, and so on). With event−based programs, the triggers are called events and the instruction lists for what to do when an event happens are called event handlers. Handlers are usually just special subroutines designed to deal with a particular event. Some people call them callback routines, since they are run when the main program "calls us back" after a certain condition is established. With the XML::Parser module, our events will be things like "started parsing the data stream," "found a start tag," and "found an XML comment,"

and our handlers will do things like "print the contents of the element you just found."[7]

[7]Though we don't use it here, Chang Liu's XML::Node module allows the programmer to easily request callbacks for only certain elements, further simplifying the process we're about to discuss.

Before we begin to parse our data, we need to create an XML::Parser object. When we create this object, we'll specify which parsing mode, or style, to use. XML::Parser provides several styles, each of which behaves a little different during the parsing of data. The style of a parse will determine which event handlers are called by default and the way data returned by the parser (if any) is structured.

Certain styles require that we specify an association between each event we wish to manually process and its handler. No special actions are taken for events we haven't chosen to explicitly handle. This association is stored in a simple hash table with keys that are the names of the events we want to handle, and values that are references to our handler subroutines. For the styles that require this association, we pass the hash in using a named parameter called Handlers (e.g., Handlers=>{Start=>\&start_handler}) when we create a parser object.

We'll be using the stream style that does not require this initialization step. It simply calls a set of predefined event handlers if certain subroutines are found in the program's namespace. The stream event handlers we'll be using are simple: StartTag, EndTag, and Text. All but Text should be

self−explanatory. Text, according to the XML::Parser documentation, is "called just before start or end tags with accumulated non−markup text in the $_ variable." We'll use it when we need to know the contents

3.3.1.2. Reading XML using XML::Parser 102

of a particular element.

Here's the initialization code we're going to use for our application:

use XML::Parser;

use Data::Dumper; # used for debugging output, not needed for XML parse

$p = new XML::Parser(ErrorContext => 3,

Style => 'Stream',

Pkg => 'Account::Parse');

This code returns a parser object after passing in three named parameters. The first, ErrorContext, tells the parser to return three lines of context from the parsed data if an error should occur while parsing. The second sets the parse style as we just discussed. Pkg, the final parameter, instructs the parser to look in a different namespace than the current one for the event handler subroutines it expects to see. By setting this parameter, we've instructed the parser to look for &Account::Parse::StartTag(),

&Account::Parse::EndTag( ), and so on, instead of just &StartTag( ), &EndTag( ), etc. This doesn't have any operational impact, but it does allow us to sidestep any concerns that our parser might inadvertently call someone else's function called StartTag( ). Instead of using the Pkg parameter, we could have put an explicit package Account::Parse; line before the above code.

Now let's look at the subroutines that perform the event handling functions. We'll go over them one at a time:

package Account::Parse;

sub StartTag {

undef %record if ($_[1] eq "account");

}

&StartTag( ) is called each time we hit a start tag in our data. It is invoked with two parameters: a reference to the parser object and the name of the tag encountered. We'll want to construct a new record hash for each new account record we encounter, so we can use StartTag( ) in order to let us know when we've hit the beginning of a new record (e.g., an <account> start tag). In that case, we obliterate any existing record hash. In all other cases we return without doing anything:

sub Text {

my $ce = $_[0]−>current_element( );

$record{$ce}=$_ unless ($ce eq "account");

}

Here we use &Text( ) to populate the %record hash. Like the previous function, it too receives two parameters upon invocation: a reference to the parser object and the "accumulated nonmarkup text" the parser has collected between the last start and end tag. We determine which element we're in by calling the parser object's current_element( ) method. According to the XML::Parser::Expat documentation, this method "returns the name of the innermost currently opened element." As long as the current element name is not "account," we're sure to be within one of the subelements of <account>, so we record the element name and its contents:

sub EndTag {

print Data::Dumper−>Dump([\%record],["account"]) if ($_[1] eq "account");

# here's where we'd actually do something, instead of just # printing the record

}

Our last handler, &EndTag( ), is just like our first, &StartTag( ), except it gets called when we encounter an end tag. If we reach the end of an account record, we do the mundane thing and print that record out. Here's some example output:

$account = {

The Five−Minute RCS Tutorial (Perl for System Administration)

3.3.1.2. Reading XML using XML::Parser 103

'login' => 'bobf',

If we were really going to use this parse code in our account system we would probably call some function like CreateAccount(\%record) rather than printing the record using Data::Dumper.

Now that we've seen the XML::Parser initialization and handler subroutines, we need to include the piece of code that actually initiates the parse:

# handle multiple account records in a single XML queue file open(FILE,$addqueue) or die "Unable to open $addqueue:$!\n";

# this clever idiom courtesy of Jeff Pinyan read(FILE, $queuecontents, −s FILE);

$p−>parse("<queue>".$queuecontents."</queue>");

This code has probably caused you to raise an eyebrow, maybe even two. The first two lines open our add queue file and read its contents into a single scalar variable called $queuecontents. The third line would probably be easily comprehensible, except for the funny argument being passed to parse( ). Why are we bothering to read in the contents of the actual queue file and then bracket it with more XML before actually parsing it?

Because it is a hack. As hacks go, it's not so bad. Here's why these convolutions are necessary to parse the multiple <account> elements in our queue file.

Every XML document, by definition (in the very first production rule of the XML specification), must have a root or document element. This element is the container element for the rest of the document; all other elements are subelements of it. An XML parser expects the first start tag it sees to be the start tag for the root element of that document and the last end tag it sees to be the end tag for that that element. XML documents that do not conform to this structure are not considered to be well−formed.

This puts us in a bit of a bind when we attempt to model a queue in XML. If we do nothing, <account> will be found as the first tag in the file. Everything will work fine until the parser hits the end </account> tag for that record. At that point it will cease to parse any further, even if there are more records to be found, because it believes it has found the end of the document.

We can easily put a start tag (<queue>) at the beginning of our queue, but how do we handle end tags (</queue>)? We always need the root element's end tag at the bottom of the file (and only there), a difficult task given that we're planning to repeatedly append records to this file.

A plausible but fairly heinous hack would be to seek( ) to the end of the file, and then seek( )

backwards until we backed up just before the last end tag. We could then write our new record over this tag, leaving an end tag at the end of the data we were writing. Just the risk of data corruption (what if you back up too far?) should dissuade you from this method. Also, this method gets tricky in cases where there is no clear end of file, e.g., if you were reading XML data from a network connection. In those cases you would probably need to do some extra shadow buffering of the data stream so it would be possible to back up from the end of transmission.

3.3.1.2. Reading XML using XML::Parser 104

The method we demonstrated in the code above of prepending and appending a root element tag pair to the existing data may be a hack, but it comes out looking almost elegant compared to other solutions. Let's return to more pleasant topics.

In document The Five Minute RCS Tutorial (Perl for System Administration (Page 109-112)