Working with nodes - The Document Object Model

What's with the Space Before Your End Slash, Brett?

Chapter 4. Advanced SA

5.1 The Document Object Model

5.2.3 DOMSerializer

5.2.3.3 Working with nodes

Once within the serializeNode( ) method, the first task is to determine what type of node has been passed in. Although you could approach this with a Java methodology, using the instanceof keyword and Java reflection, the DOM language bindings for Java make this task much simpler. The Node interface defines a helper method, getNodeType( ), which returns an integer value. This value can be compared against a set of constants (also defined within the Node interface), and the type of Node being examined can be quickly and easily determined. This also fits very naturally into the Java switch construct, which can be used to break up serialization into logical sections. The code here covers almost all DOM node types; although there are some additional node types defined (see Figure 5-2), these are the most common, and the concepts here can be applied to the less common node types as well:

public void serializeNode(Node node, Writer writer, String indentLevel)

throws IOException {

// Determine action based on node type switch (node.getNodeType( )) { case Node.DOCUMENT_NODE: break; case Node.ELEMENT_NODE: break; case Node.TEXT_NODE: break; case Node.CDATA_SECTION_NODE: break; case Node.COMMENT_NODE: break; case Node.PROCESSING_INSTRUCTION_NODE: break; case Node.ENTITY_REFERENCE_NODE: break; case Node.DOCUMENT_TYPE_NODE: break; } }

This code is fairly useless; however, it helps to see all of the DOM node types laid out here in a line, rather than mixed in with all of the code needed to perform actual serialization. I want to get to that now, though, starting with the first node passed into this method, an instance of the Document interface.

Because the Document interface is an extension of the Node interface, it can be used interchangeably with the other node types. However, it is a special case, as it contains the root element as well as the XML document's DTD and some other special information not within the XML element hierarchy. As a result, you need to extract the root element and

pass that back to the serialization method (starting recursion). Additionally, the XML declaration itself is printed out:

case Node.DOCUMENT_NODE:

writer.write("<?xml version=\"1.0\"?>"); writer.write(lineSeparator);

Document doc = (Document)node;

serializeNode(doc.getDocumentElement( ), writer, ""); break;

DOM Level 2 (as well as SAX 2.0) does not expose the XML declaration. This may not seem like a big deal, until you consider that the encoding of the document is included in this declaration. DOM Level 3 is expected to address this deficiency, and I'll cover that in the next chapter. Be careful not to write DOM applications that depend on this information until this feature is in place.

Since the code needs to access a Document-specific method (as opposed to one defined in the generic Node interface), the Node implementation must be cast to the Document

interface. Then invoke the object's getDocumentElement( ) method to obtain the root element of the XML input document, and in turn pass that on to the serializeNode( )

method, starting the recursion and traversal of the DOM tree.

Of course, the most common task in serialization is to take a DOM Element and print out its name, attributes, and value, and then print its children. As you would suspect, all of these can be easily accomplished with DOM method calls. First you need to get the name of the XML element, which is available through the getNodeName( ) method within the

Node interface. The code then needs to get the children of the current element and serialize these as well. A Node's children can be accessed through the getChildNodes( )

method, which returns an instance of a DOM NodeList. It is trivial to obtain the length of this list, and then iterate through the children calling the serialization method on each, continuing the recursion. There's also quite a bit of logic that ensures correct indentation and line feeds; these are really just formatting issues, and I won't spend time on them here. Finally, the closing bracket of the element can be output:

case Node.ELEMENT_NODE:

String name = node.getNodeName( ); writer.write(indentLevel + "<" + name); writer.write(">");

// recurse on each child

NodeList children = node.getChildNodes( ); if (children != null) {

if ((children.item(0) != null) &&

(children.item(0).getNodeType( ) == Node.ELEMENT_NODE)) {

writer.write(lineSeparator); }

for (int i=0; i<children.getLength( ); i++) { serializeNode(children.item(i), writer, indentLevel + indent);

}

if ((children.item(0) != null) &&

(children.item(children.getLength( )-1) .getNodeType( ) ==

writer.write(indentLevel); } } writer.write("</" + name + ">"); writer.write(lineSeparator); break;

Of course, astute readers (or DOM experts) will notice that I left out something important: the element's attributes! These are the only pseudo-exception to the strict tree that DOM builds. They should be an exception, though, since an attribute is not really a child of an element; it's (sort of) lateral to it. Basically the relationship is a little muddy. In any case, the attributes of an element are available through the getAttributes( ) method on the

Node interface. This method returns a NamedNodeMap, and that too can be iterated through. Each Node within this list can be polled for its name and value, and suddenly the attributes are handled! Enter the code as shown here to take care of this:

case Node.ELEMENT_NODE:

String name = node.getNodeName( ); writer.write(indentLevel + "<" + name);

NamedNodeMap attributes = node.getAttributes( ); for (int i=0; i<attributes.getLength( ); i++) { Node current = attributes.item(i);

writer.write(" " + current.getNodeName( ) + "=\"" + current.getNodeValue( ) + "\""); } writer.write(">");

// recurse on each child

NodeList children = node.getChildNodes( ); if (children != null) {

if ((children.item(0) != null) &&

(children.item(0).getNodeType( ) == Node.ELEMENT_NODE)) {

writer.write(lineSeparator); }

for (int i=0; i<children.getLength( ); i++) { serializeNode(children.item(i), writer, indentLevel + indent);

}

if ((children.item(0) != null) &&

(children.item(children.getLength( )-1) .getNodeType( ) == Node.ELEMENT_NODE)) { writer.write(indentLevel); } } writer.write("</" + name + ">"); writer.write(lineSeparator); break;

Next on the list of node types is Text nodes. Output is quite simple, as you only need to use the now-familiar getNodeValue( ) method of the DOM Node interface to get the textual data and print it out; the same is true for CDATA nodes, except that the data within a

CDATA section should be enclosed within the CDATA XML semantics (surrounded by

<![CDATA[ and ]]>). You can add the logic within those two cases now:

case Node.TEXT_NODE: writer.write(node.getNodeValue( )); break; case Node.CDATA_SECTION_NODE: writer.write("<![CDATA[" + node.getNodeValue( ) + "]]>"); break;

Dealing with comments in DOM is about as simple as it gets. The getNodeValue( )

method returns the text within the  XML constructs. That's really all there is to it; see this code addition:

case Node.COMMENT_NODE:

writer.write(indentLevel + ""); writer.write(lineSeparator);

break;

Moving on to the next DOM node type: the DOM bindings for Java define an interface to handle processing instructions that are within the input XML document, rather obviously called ProcessingInstruction. This is useful, as these instructions do not follow the same markup model as XML elements and attributes, but are still important for applications to know about. In the table of contents XML document, there aren't any PIs present

(although you could easily add some for testing).

The PI node in the DOM is a little bit of a break from what you have seen so far: to fit the syntax into the Node interface model, the getNodeValue( ) method returns all data instructions within a PI in one String. This allows quick output of the PI; however, you still need to use getNodeName( ) to get the name of the PI. If you were writing an application that received PIs from an XML document, you might prefer to use the actual

ProcessingInstruction interface; although it exposes the same data, the method names (getTarget( ) and getData( )) are more in line with a PI's format. With this understanding, you can add in the code to print out any PIs in supplied XML documents:

case Node.PROCESSING_INSTRUCTION_NODE: writer.write("<?" + node.getNodeName( ) + " " + node.getNodeValue( ) + "?>"); writer.write(lineSeparator); break;

While the code to deal with PIs is perfectly workable, there is a problem. In the case that handled document nodes, all the serializer did was pull out the document element and recurse. The problem is that this approach ignores any other child nodes of the Document

object, such as top-level PIs and any DOCTYPE declarations. Those node types are actually

lateral to the document element (root element), and are ignored. Instead of just pulling out the document element, then, the following code serializes all child nodes on the supplied

Document object:

case Node.DOCUMENT_NODE:

writer.write("<xml version=\"1.0\">"); writer.write(lineSeparator);

// recurse on each child

NodeList nodes = node.getChildNodes( ); if (nodes != null) {

serializeNode(nodes.item(i), writer, ""); }

} /*

Document doc = (Document)node;

serializeNode(doc.getDocumentElement( ), writer, ""); */

break;

With this in place, the code can deal with DocumentType nodes, which represent a

DOCTYPE declaration. Like PIs, a DTD declaration can be helpful in exposing external information that might be needed in processing an XML document. However, since there can be public and system IDs as well as other DTD-specific data, the code needs to cast the Node instance to the DocumentType interface to access this additional data. Then, use the helper methods to get the name of the Node, which returns the name of the element in the document that is being constrained, the public ID (if it exists), and the system ID of the DTD referenced. Using this information, the original DTD can be serialized:

case Node.DOCUMENT_TYPE_NODE:

DocumentType docType = (DocumentType)node; writer.write("<!DOCTYPE " + docType.getName( )); if (docType.getPublicId( ) != null) { System.out.print(" PUBLIC \"" + docType.getPublicId( ) + "\" "); } else { writer.write(" SYSTEM "); } writer.write("\"" + docType.getSystemId( ) + "\">"; writer.write(lineSeparator); break;

All that's left at this point is handling entities and entity references. In this chapter, I will skim over entities and focus on entity references; more details on entities and notations are in the next chapter. For now, a reference can simply be output with the & and ; characters

surrounding it:

case Node.ENTITY_REFERENCE_NODE:

writer.write("&" + node.getNodeName( ) + ";"); break;

There are a few surprises that may trip you up when it comes to the output from a node such as this. The definition of how entity references should be processed within DOM allows a lot of latitude, and also relies heavily on the underlying parser's behavior. In fact, most XML parsers have expanded and processed entity references before the XML document's data ever makes its way into the DOM tree. Often, when expecting to see an entity reference within your DOM structure, you will find the text or values referenced rather than the entity reference itself. To test this for your parser, you'll want to run the

SerializerTest class on the contents.xml document (which I'll cover in the next section) and see what it does with the OReillyCopyright entity reference. In Apache, this comes across as an entity reference, by the way.

And that's it! As I mentioned, there are a few other node types, but covering them isn't worth the trouble at this point; you get the idea about how DOM works. In the next chapter, I'll take you deeper than you probably ever wanted to go. For now, let's put the pieces together and see some results.

With the DOMSerializer class complete, all that's left is to invoke the serializer's

serialize( ) method in the test class. To do this, add the following lines to the

SerializerTest class:

public void test(String xmlDocument, String outputFilename) throws Exception {

File outputFile = new File(outputFilename); DOMParser parser = new DOMParser( ); // Get the DOM tree as a Document object parser.parse(xmlDocument);

Document doc = parser.getDocument( ); // Serialize

DOMSerializer serializer = new DOMSerializer( ); serializer.serialize(doc, new File(outputFilename));

}

This fairly simple addition completes the classes, and you can run the example on Chapter 2's contents.xml file, as shown:

C:\javaxml2\build>java javaxml2.SerializerTest c:\javaxml2\ch05\xml\contents.xml

output.xml

While you don't get any exciting output here, you can open up the newly created output.xml

file and check it over for accuracy. It should contain all the information in the original XML document, with only the differences already discussed in previous sections. A portion of my

output.xml is shown in Example 5-3.

Example 5-3. A portion of the output.xml serialized DOM tree

<?xml version="1.0"?>

<!DOCTYPE book SYSTEM "DTD/JavaXML.dtd">

You may notice that there is quite a bit of extra whitespace in the output; that's because the serializer adds some line feeds every time writer.write(lineSeparator) appears in the code. Of course, the underlying DOM tree has some line feeds in it as well, which are

reported as Text nodes. The end result in many of these cases is the double line breaks, as seen in the output.

Let me be very clear that the

DOMSerializer

class shown in this chapter is for example purposes, and is not a good production solution. While you are welcome to use the class in your own applications, realize that several important options are left out, like encoding and setting advanced options for indentation, line feeds, and line wrapping. Additionally, entities are handled only in passing (complete treatment would be twice as long as this chapter already is!). Your parser probably has its own serializer class, if not multiple classes, that perform this task at least as well, if not better, than the example in this chapter. However, you now should understand what's going on under the hood in those classes. As a matter of reference, if you are using Apache Xerces, the classes to look at are in the

org.apache.xml.serialize

. Some particularly useful ones are the

XMLSerializer

XHTMLSerializer

, and

HTMLSerializer

. Check them out—they offer a good solution, until DOM Level 3 comes out with a standardized one.

In document Java and XML 2nd Edition Brett McLaugblin pdf (Page 98-104)