Traversal - DOM Level 2 Modules - Java and XML 2nd Edition Brett McLaugblin pdf

Overloaded?

6.3 DOM Level 2 Modules

6.3.2 Traversal

First up on the list is the DOM Level 2 Traversal module. This is intended to provide tree- walking capability, but also to allow you to refine the nature of that behavior. In the earlier section on DOM mutation, I mentioned that most of your DOM code will know something about the structure of a DOM tree being worked with; this allows for quick traversal and modification of both structure and content. However, for those times when you do not know the structure of the document, the traversal module comes into play.

Consider the auction site again, and the items input by the user. Most critical are the item name and the description. Since most popular auction sites provide some sort of search, you would want to provide the same in this fictional example. Just searching item titles isn't going to cut it in the real world; instead, a set of key words should be extracted from the item descriptions. I say key words because you don't want a search on "adirondack top" (which to a guitar lover obviously applies to the wood on the top of a guitar) to return toys ("top") from a particular mountain range ("Adirondack"). The best way to do this in the format discussed so far is to extract words that are formatted in a certain way. So the words in the description that are bolded, or in italics, are perfect candidates. Of course, you could

grab all the nontextual child elements of the description element. However, you'd have to weed through links (the a element), image references (img), and so forth. What you really want is to specify a custom traversal. Good news; you're in the right place. The whole of the traversal module is contained within the org.w3c.dom.traversal

package. Just as everything within core DOM begins with a Document interface, everything in DOM Traversal begins with the org.w3c.dom.traversal.DocumentTraversal

interface. This interface provides two methods:

NodeIterator createNodeIterator(Node root, int whatToShow, NodeFilter filter, boolean expandEntityReferences);

TreeWalker createTreeWalker(Node root, int whatToShow, NodeFilter filter, boolean expandEntityReferences);

Most DOM implementations that support traversal choose to have their

org.w3c.dom.Document implementation class implement the DocumentTraversal

interface as well; this is how it works in Xerces. In a nutshell, using a NodeIterator

provides a list view of the elements it iterates over; the closest analogy is a standard Java

List (in the java.util package). TreeWalker provides a tree view, which you may be more used to in working with XML by now.

6.3.2.1 NodeIterator

I want to get past all the conceptualization and into the code sample I referred to earlier. I want access to all content within the description of an item from the auction site that is within a specific set of formatting tags. To do this, I first need access to the DOM tree itself. Since this doesn't fit into the servlet approach (you probably wouldn't have a servlet building the search phrases, you'd have some standalone class), I need a new class,

ItemSearcher (Example 6-5). This class takes any number of item files to search

through as arguments.

Example 6-5. The ItemSearcher class

package javaxml2; import java.io.File; // DOM imports import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.w3c.dom.traversal.DocumentTraversal; import org.w3c.dom.traversal.NodeFilter; import org.w3c.dom.traversal.NodeIterator; // Vendor parser import org.apache.xerces.parsers.DOMParser; public class ItemSearcher {

private String docNS = "http://www.oreilly.com/javaxml2"; public void search(String filename) throws Exception { // Parse into a DOM tree

File file = new File(filename);

DOMParser parser = new DOMParser( ); parser.parse(file.toURL().toString( ));

Document doc = parser.getDocument( ); // Get node to start iterating with

Element root = doc.getDocumentElement( ); NodeList descriptionElements =

root.getElementsByTagNameNS(docNS, "description"); Element description = (Element)descriptionElements.item(0); // Get a NodeIterator

NodeIterator i = ((DocumentTraversal)doc)

.createNodeIterator(description, NodeFilter.SHOW_ALL, null, true); Node n;

while ((n = i.nextNode( )) != null) {

if (n.getNodeType( ) == Node.ELEMENT_NODE) { System.out.println("Encountered Element: '" + n.getNodeName( ) + "'");

} else if (n.getNodeType( ) == Node.TEXT_NODE) { System.out.println("Encountered Text: '" + n.getNodeValue( ) + "'");

} } }

public static void main(String[] args) { if (args.length == 0) {

System.out.println("No item files to search through specified."); return;

} try {

ItemSearcher searcher = new ItemSearcher( ); for (int i=0; i<args.length; i++) {

System.out.println("Processing file: " + args[i]); searcher.search(args[i]); } } catch (Exception e) { e.printStackTrace( ); } } }

As you can see, I've created a NodeIterator, and supplied it the description element to start with for iteration. The constant value passed as the filter instructs the iterator to show all nodes. You could just as easily provide values like Node.SHOW_ELEMENT and

Node.SHOW_TEXT, which would show only elements or textual nodes, respectively. I haven't yet provided a NodeFilter implementation (I'll get to that next), and I allowed for entity reference expansion. What is nice about all this is that the iterator, once created, doesn't have just the child nodes of description. Instead, it actually has all nodes under

description, even when nested multiple levels deep. This is extremely handy for dealing with unknown XML structure!

At this point, you still have all the nodes, which is not what you want. I added some code (the last while loop) to show you how to print out the element and text node results. You can run the code as is, but it's not going to help much. Instead, the code needs to provide a filter, so it only picks up elements with the formatting desired: the text within an i or b block. You can provide this customized behavior by supplying a custom implementation of the

public short acceptNode(Node n);

This method should return NodeFilter.FILTER_SKIP, NodeFilter.FILTER_REJECT, or NodeFilter.FILTER_ACCEPT. The first skips the examined node, but continues to iterate over its children; the second rejects the examined node and its children (only applicable in TreeWalker); and the third accepts and passes on the examined node. It behaves a lot like SAX, in that you can intercept nodes as they are being iterated and decide if they should be passed on to the calling method. Add the following nonpublic class to the ItemSearcher.java source file:

class FormattingNodeFilter implements NodeFilter { public short acceptNode(Node n) {

if (n.getNodeType( ) == Node.TEXT_NODE) { Node parent = n.getParentNode( );

if ((parent.getNodeName( ).equalsIgnoreCase("b")) || (parent.getNodeName( ).equalsIgnoreCase("i"))) { return FILTER_ACCEPT;

} }

// If we got here, not interested return FILTER_SKIP;

} }

This is just plain old DOM code, and shouldn't pose any difficulty to you. First, the code only wants text nodes; the text of the formatted elements is desired, not the elements

themselves. Next, the parent is determined, and since it's safe to assume that Text nodes have Element node parents, the code immediately invokes getNodeName( ). If the element name is either "b" or "i", the code has found search text, and returns

FILTER_ACCEPT. Otherwise, FILTER_SKIP is returned.

All that's left now is a change to the iterator creation call instructing it to use the new filter implementation, and to the output, both in the existing search( ) method of the

ItemSearcher class:

// Get a NodeIterator

NodeIterator i = ((DocumentTraversal)doc)

.createNodeIterator(description, NodeFilter.SHOW_ALL, new FormattingNodeFilter( ), true);

Node n;

while ((n = i.nextNode( )) != null) {

System.out.println("Search phrase found: '" + n.getNodeValue( ) + "'");

}

Some astute readers will wonder what happens when a

NodeFilter

implementation conflicts with the constant supplied to the

createNodeIterator( )

method (in this case that constant is

NodeFilter.SHOW_ALL

). Actually, the short constant filter is applied first, and then the resulting list of nodes is passed to the filter implementation. If I had supplied the constant

NodeFilter.SHOW_ELEMENT

, I would not have gotten any search phrases, because my filter would not have received any

Text

nodes to examine; just

Element

nodes. Be careful to use the two together in a way that makes sense. In the example, I could have safely used

NodeFilter.SHOW_TEXT

also.

Now, the class is useful and ready to run. Executing it on the bourgOM.xml file I explained in the first section, I get the following results:

bmclaugh@GANDALF ~/javaxml2/build

$ java javaxml2.ItemSearcher ../ch06/xml/item-bourgOM.xml Processing file: ../ch06/xml/item-bourgOM.xml

Search phrase found: 'beautiful' Search phrase found: 'Sitka-topped' Search phrase found: 'Indian Rosewood' Search phrase found: 'huge sound' Search phrase found: 'great action' Search phrase found: 'fossilized ivory' Search phrase found: 'ebony'

Search phrase found: 'great guitar'

This is perfect: all of the bolded and italicized phrases are now ready to be added to a search facility. (Sorry; you'll have to write that yourself!)

6.3.2.2 TreeWalker

The TreeWalker interface is almost exactly the same as the NodeIterator interface; the only difference is that you get a tree view instead of a list view. This is primarily useful if you want to deal with only a certain type of node within a tree; for instance, the tree with only elements or without any comments. By using the constant filter value (such as

NodeFilter.SHOW_ELEMENT) and a filter implementation (like one that passes on

FILTER_SKIP for all comments), you can essentially get a view of a DOM tree without extraneous information. The TreeWalker interface provides all the basic node operations, such as firstChild( ), parentNode( ), nextSibling( ), and of course

getCurrentNode( ), which tells you where you are currently walking.

I'm not going to give an example here. By now, you should see that this is identical to dealing with a standard DOM tree, except that you can filter out unwanted items by using the NodeFilter constants. This is a great, simple way to limit your view of XML

documents to only information you are interested in seeing. Use it well; it's a real asset, as is NodeIterator! You can also check out the complete specification online at

http://www.w3.org/TR/DOM-Level-2-Traversal-Range/.

6.3.3 Range

The DOM Level 2 Range module is one of the least commonly used modules, probably due to a lack of understanding of DOM Range rather than any lack of usefulness. This module provides a way to deal with a set of content within a document. Once you've defined that range of content, you can insert into it, copy it, delete parts of it, and manipulate it in various ways. The most important thing to start with is realizing that "range" in this sense refers to a number of pieces of a DOM tree grouped together. It does not refer to a set of allowed values, where a high and low or start and end are defined. Therefore, DOM Range has nothing at all to do with validation of data values. Get that, and you're already ahead of the pack.

Like traversal, working with Range involves a new DOM package: org.w3c.dom.ranges. There are actually only two interfaces and one exception within this class, so it won't take you long to get your bearings. First is the analog to Document (and

DocumentTraversal): that's org.w3c.dom.ranges.DocumentRange. Like the

DocumentTraversal class, Xerces' Document implementation class implements Range. And also like DocumentTraversal, it has very few interesting methods; in fact, only one:

public Range createRange( );

All other range operations operate upon the Range class (rather, an implementation of the interface; but you get the idea). Once you've got an instance of the Range interface, you

can set the starting and ending points, and edit away. As an example, let's go back to the

UpdateItemServlet . I mentioned that it's a bit of a hassle to try and remove all the children of the description element and then set the new description text; that's because there is no way to tell if a single Text node is within the description, or if many elements and text nodes, as well as nested nodes, exist within a description that is primarily HTML. I showed you how to simply remove the old description element and create a new one. However, DOM Range makes this unnecessary. Take a look at this modification to the

doPost( ) method of that servlet:

// Load document try {

DOMParser parser = new DOMParser( ); parser.parse(xmlFile.toURL().toString( )); doc = parser.getDocument( );

Element root = doc.getDocumentElement( );

// Name of item

NodeList nameElements =

root.getElementsByTagNameNS(docNS, "name"); Element nameElement = (Element)nameElements.item(0); Text nameText = (Text)nameElement.getFirstChild( ); nameText.setData(name);

// Description of item

NodeList descriptionElements =

root.getElementsByTagNameNS(docNS, "description");

Element descriptionElement = (Element)descriptionElements.item(0); // Remove and recreate description

Range range = ((DocumentRange)doc).createRange( ); range.setStartBefore(descriptionElement.getFirstChild( )); range.setEndAfter(descriptionElement.getLastChild( )); range.deleteContents( );

Text descriptionText = doc.createTextNode(description); descriptionElement.appendChild(descriptionText); range.detach( );

} catch (SAXException e) { // Print error

PrintWriter out = res.getWriter( ); res.setContentType("text/html"); out.println("<HTML><BODY>Error in reading XML: " + e.getMessage( ) + ".</BODY></HTML>"); out.close( ); return; }

To remove all the content, I first create a new Range, using the DocumentRange cast. You'll need to add import statements for the DocumentRange and Range classes to your servlet, too (they are both in the org.w3c.dom.ranges package).

In the first part of the DOM Level 2 Modules section, I showed you how to check which modules a parser implementation supports. I realize that Xerces reported that it did not support Range. However, running this code with Xerces 1.3.0, 1.3.1, and 1.4 all worked without a hitch. Strange, isn't it?

Once the range is ready, set the starting and ending points. Since I want all content within the description element, I start before the first child of that Element node (using

setStartBefore( )), and end after its last child (using setEndAfter( )). There are other, similar methods for this task, setStartAfter( ) and setEndBefore( ). Once that's done, it's simple to call deleteContents( ). Just like that, not a bit of content is left. Then the servlet creates the new textual description and appends it. Finally, I let the JVM know that it can release any resources associated with the Range by calling detach( ). While this step is commonly overlooked, it can really help with lengthy bits of code that use the extra resources.

Another option is to use extractContents( ) instead of deleteContents( ). This method removes the content, then returns the content that has been removed. You could insert this as an archived element, for example:

// Remove and recreate description

Range range = ((DocumentRange)doc).createRange( ); range.setStartBefore(descriptionElement.getFirstChild( )); range.setEndAfter(descriptionElement.getLastChild( ));

Node oldContents = range.extractContents( );

Text descriptionText = doc.createTextNode(description); descriptionElement.appendChild(descriptionText);

// Set this as content to some other, archival, element archivalElement.appendChild(oldContents);

Don't try this in your servlet; there is no archivalElement in this code, and it is just for demonstration purposes. However, it should be starting to sink in that the DOM Level 2 Range module can really help you in editing documents' contents. It also provides yet another way to get a handle on content when you aren't sure of the structure of that content ahead of time.

There's a lot more to ranges in DOM; check this out on your own, along with all of the DOM modules covered in this chapter. However, you should now have enough of an

understanding of the basics to get you going. Most importantly, realize that at any point in an active Range instance, you can simply invoke range.insertNode(Node newNode) and add new content, wherever you are in a document! It is this robust editing quality of ranges that make them so attractive. The next time you need to delete, copy, extract, or add content to a structure that you know little about, think about using ranges. The specification gives you information on all this and more, and is located online at

http://www.w3.org/TR/DOM-Level-2-Traversal-Range/.

In document Java and XML 2nd Edition Brett McLaugblin pdf (Page 122-128)