• No results found

8.4 ■ Prototype 1: Intermediate representation

In the first phase we will develop a component that supports a structural representation of HTML documents. This component must provide tools methods to build, explore, modify and serialize representation of HTML pages. The idea is to offer an easy to use object-oriented interface for building and manipulating HTML content. This kind of facility is usually called an “object model”.

8.4.1  Analysis

An example of HTML that should be handled by the component is presented in Figure 8.6. It contains an arbitrary selection of the most common tags found in HTML pages on the internet, i.e.:

■ <HTML>and </HTML>delimit the entire document;

■ <HEAD><TITLE>and </TITLE></HEAD>delimit the document title;

■ <BODY>and </BODY>delimit the document body;

■ <H1>and </H1>delimit a first order titles;

■ <A HREF...>and </A>identify a link to another HTML document, The structural representation shown in Figure 8.7 corresponds to the object structure of the sample HTML document. This figure shows a UML object diagram; the names of the objects have been defined in order to be

Figure 8.6 Sample HTML Head Heading Text Text Link Page Body Heading `HTMLp

`HEADp`p`TITLEpTest`/TITLEp`p`/HEADp

`BODY bgcolor#whitep

`H1p This is a test `/H1p

This is plain text

`A HREF#“http://www.w3c.org”p

Click here...

`/Ap

`/BODYp

as intuitive as possible. The class names correspond to the element types identified above.

The structure of the corresponding class diagram (shown in Figure 8.8 is isomorphic to the static object structure representing the page. Class Pageis the class at the root of the containment hierarchy. It can contain directly Head and Body. Class Head is atomic, i.e. it does not contain any other element class. On the other hand, class Bodyrepresents a composed element and it has a containment relationship with the other three classes: Heading, Textand Link. Both Headingand Linkcontain Text. Textis atomic.

Figure 8.7 Analysis of sample HTML document

sample : Page test : Head

body : Body

a title : Heading

plain text : Text

a link : Link

title text : Text

link text : Text

Figure 8.8 Analysis class diagram

Page Head 1 1 1 * Body Heading Text Link 1 * 1 * 1 * 1 * 1 * 1 * 1 * containment associations are ordered ordered ordered

The only additions that cannot be directly devised from the static object structure are the each-other containment relationships between Heading and Link; these relationships have been added based on knowledge of the HTML language.

All containment relationships in this class diagram are ordered because the order in which contained elements appear in the container element is important (e.g. the heading of a section should appear before the text that constitutes its content).

Class Pagedoes not contain any specific information. Class Headcontains only the title of the HTML page (i.e. Testin the example above). Class Body contains the background colour, which is encoded in the bgcolorparameter inside the <body>tag. Class Textcontains the fragment of text it represents. Class Heading contains the text that constitutes the heading. Finally class Link contains the destination URL of the hyperlink. In addition, all classes except Headand Textcan contain other elements that are linked to them.

8.4.2  Design

We try now to refine the analysis model described so far. First we identify commonalities among different classes and factor them into a common base class connected to the specific classes by inheritance. This is an application of the class generalization idiom presented in Chapter 6. Three main characteristics can be identified as common to the elements of the analysis model.

1 Most elements have a string associated with them, which represents the content of the element or some other parameter. The information asso- ciated with the various elements may be very different; the common characteristic is that it can be stored in a string. Examples of such information are the background colour of the tag body, or the text that constitutes a document fragment.

2 All elements must provide an opening tag and a closing one.

3 Most elements must provide mechanisms to iterate on the contained elements for exploring the page elements and for serializing the page contents.

The main differences among these elements are related to the mechanisms used to iterate on the document structure. Each class must provide specific methods for accessing its content; for instance class Pageshould provide a method for accessing the Headcomponent and another method for accessing its Bodycomponent.

Decision point

Iteration can be implemented using two different approaches: using element specific methods or using generic methods.

The first approach consists in defining specific document iteration methods for each class. This approach is straightforward but difficult to implement. In fact, some elements (e.g. Body) contain a set of heterogeneous components (e.g. Text and Link). In order to iterate among them, the con- tainer class must recognize their type. This can be done using the reflection capability of the Java language. Unfortunately the use of reflection often results in inelegant code.

The second approach consists in defining an abstract container class (e.g. Element) and to derive a concrete subclass for each element (see Figure 8.9). The abstract container class defines a set of generic methods for document iteration. This solution has been adopted by the World Wide Web Consortium (W3C 2004c). For iteration on elements of an HTML page two methods are defined in the common base class:

■ first Child, invoked on the container element, returns the first component;

■ nextSibling, invoked on a component element, returns the next compo- nent inside the same container.

Figure 8.9 HTML object model

+getOpeningTag() +getClosingTag() +firstChild() +nextSibling() +isComposed() Element +serialize_all() Page Heading Text Head Body Link 1 1 1 0..1 1 * 1 * 1 * 1 * 1 * 1 * 1 *

A straightforward implementation of these two methods requires that the container holds a reference to the first component, and each component has a reference to the next element. In addition it is useful to have a method to test whether a given element is composite or not; this will make the code that iterates on the elements simpler.

The problem is to transform a two-dimensional representation (the tree structure of the objects) into a mono-dimensional or serialized represent- ation (the sequence of characters that form an HTML document). Most of the elements in HTML have an opening and a closing tag; in their serialized form they are represented by putting the children between the opening tag and the closing tag. The object structure can be serialized by means of revisiting the tree.

Two methods should be provided to read the opening and closing tags: getOpeningTag() and getClosingTag(). In the case of Text, both methods should return an empty string. This solution can be combined with the iteration in order to obtain a straightforward serialization method.

The Pageclass defines a serialization method that is able to serialize the entire web page. The serialization procedure is fairly simple thanks to the methods provided by class Element. For each element the serialization should start by printing the opening tag and terminate by printing the closing tag. In between, if the element is composed, the procedure should iterate on all the contained elements and print them. Since each contained element should be printed using the same procedure, a recursive procedure fits our needs very well. The pseudocode for such a method is the following:

serialize_all (Element current ){ print( current.getOpeningTag() ); if( current.isComposite() ){

Element child # firstSibling; while(child!#null){ serialize (child); child#child.nextSibling(); } } print( current.getClosingTag() ) }

This solution would work fine for all elements except Text:it has neither an opening nor a closing tag, but it has a different kind of information that needs to be serialized. A simple and smart solution to this problem is to Decision point

make the getOpeningTag()method of the class Textreturn the content of the element. When the serialize_all()method is invoked with a Textas argument it would print the content, then will skip the iteration code (since Textis not composed), finally the getClosingTag()method would return an empty string, thus not affecting the output at all.

The solution provided above for serialization has some defects; in partic- ular it partly violates the information hiding principle. In fact it accesses the internal structure of each element in order to serialize its children.

A better solution consists in decentralizing the serialization algorithm and delegating each element to serialize itself. Since all the objects are (indirect) instances of Element, and since the algorithm is symmetric, the decentral- ized algorithm can be defined in Element. Therefore all the objects of the document graph will execute the same algorithm. An improved version of the serializemethod of Elementcan be written as follows:

serialize(){

print( this.getOpeningTag() ); if( this.isComposite() ){

Element child # firstSibling; while(child!#null){ child.serialize(); child#child.nextSibling(); } } println( this.getClosingTag() ); }

The structure of the method is practically the same as the serialize_all() method. The signature of the method has changed since the previous argument has become implicit, i.e. it is the object on which the method is invoked (this).

While two elements (Textand Head) are simple elements, the others can contain other elements. This common feature can be captured introducing a new class, Composite.

Decision point

We want to define a better serialization procedure that does not violate the information hiding principle.

Decision point

Each composite object can contain elements, therefore we add an associ- ation from Compositeto Element. But a composite can be contained within other composites, i.e. it behaves like an element, therefore class Composite inherits from class Element. The new classes are shown in Figure 8.10.

There are several advantages with this solution. The atomic elements and the composite elements are clearly distinguished in the class diagram. The class structure is less complex and easier to understand than the previous one. Finally, the methods to compose elements and serialize composite ele- ments are written once in Composite.

Other elements can be added to the class structure easily, i.e. the elements Uland Lirepresenting a bulleted list and the items of a list respectively.

The main difference between this design solution and the previous design class diagram is that, based on the semantics of inheritance and aggregation, any composite object can (in principle) contain any other composite or component object. Consequently, containment constraints need to be graphically documented by means of notes attached to the classes in the class diagram.

8.4.3  Implementation

Here we present the implementation of the classes that have been defined for this prototype. All the classes that implement the HTML object model are in the package HtmlDOM:

package HtmlDOM;

Figure 8.10 Refined design class diagram

Element

Composite Text Head Li

Page Body Heading Link Ul

1

!children

*

can contain one Head and one Body

cannot contain Page or Head cannot contain Page, Head, or Body ordered

The central class in our design is Element, which represents the generic element of an HTML page. It offers methods for iteration and serialization: getOpeningTag(), getClosingTag() and serialize(). This class, together with Compositeforms an instance of the Composite pattern. The attribute info contains generic information, whose specific semantics are defined in each concrete derived class. The attribute tagcontains the name of the tag, (e.g. BODY, H1).

The one-to-many association children between Compositeand Elementis implemented by decomposing it into two associations: a child one-to-one association from Compositeto Elementand a sibling recursive association on Element (see idiom One-to-many association decomposition in Chapter 4 (p. 84)). The latter is implemented by attribute sibling.

public abstract class Element { protected String info;

protected String tag; Element sibling;

Element(String p_info, String p_tag){ info # p_info;

tag # p_tag; }

public String getOpeningTag() { return "<" ! tag ! ">"; } public String getClosingTag() { return ""; }

public void serialize(PrintWriter os){ os.print(getOpeningTag());

os.print(" "); }

public boolean isComposite() { return false; } public boolean hasNext() { return sibling !# null; } public Element nextSibling() { return sibling; } void addSibling(Element newSibling){

sibling # newSibling; }

}

The other class that forms the core of the composite pattern is Composite, which represents the composite HTML element. This class redefines the isComposite()method to always return true. It also redefines the serialize() method to iterate over the components. Two new methods are added: addChild()to add a new component and firstChild()to navigate the first semi- association resulting from the split of the children association.

public abstract class Composite extends Element { Element child;

Composite(String info, String tag){ super(info, tag);

public boolean isComposite() { return true; } public Element firstChild() { return child; } public void addChild(Element newChild) {

if(child ## null) child # newChild; else { Element currentChild; currentChild # child; while(currentChild.hasNext()) currentChild # currentChild.nextSibling(); currentChild.addSibling(newChild); } }

public String getClosingTag() { return "</" ! tag ! ">"; } public void serialize(PrintWriter os){

os.print(getOpeningTag()); Element currentChild;

CurrentChild # firstChild(); while(currentChild! # null){ currentChild.serialize(os); currentChild#currentChild.nextSibling(); } os.println(getClosingTag()); } }

The remaining classes extend either Elementor Component. They repre- sent concrete elements that appear in HTML pages.

public class Br extends Element { public Br(){

super(null, "BR"); } }

public class Link extends Composite { public Link(String href){

super(href,"A"); }

public String getOpeningTag() {

return "<" ! tag ! " href#\"" ! info ! "\" >"; } }

public class Ul extends Composite { public Ul(){

super(null,"UL"); }

}

public class Head extends Element { public Head(String title){

super(title, "HEAD"); }

public String getOpeningTag() {

return "<head><title>" ! info ! "</title></head>"; } public String getClosingTag() { return ""; }

}

public class Heading extends Composite { public Heading(){

super(null,"H1"); }

}

public class Li extends Composite { public Li(){

super(null,"LI"); }

}

public class Body extends Composite { public Body(String bgcolor){

super(bgcolor,"BODY"); }

public String getOpeningTag() {

return "<" ! tag ! " bgcolor#\"" ! info ! "\" >"; }

}

public class Text extends Element { public Text(String text){

super(text,null); }

public String getOpeningTag() { return info; } }

public class Font extends Composite {

public Font(String size, String color, String face){ super(null,"FONT");

String param # ""; if(size !# null)

param # param ! " size#\"" ! size ! "\""; if(color !# null)

param # param ! " color#\"" ! color ! "\""; if(face !# null)

param # param ! " face#\"" ! face ! "\""; info # param;

}

public String getOpeningTag() { return "<" ! tag ! info ! " >"; }

}

public class Page extends Composite { public Page(){

super(null, "HTML"); }}

8.4.4  Test

To test the intermediate representation we have to check the correctness of both the internal static representation and the serialized form. The simpli- fied test class TestDOMperforms these two checks for the specific case of the page described in Figure 8.6. The test method testStaticStructure()checks if the structure of the internal representation conforms to the structure pre- sented in Figure 8.7. The test method testSerialize()checks if the result of the serialization is as expected; it verifies the presence of given substrings in the output (as done in Chapter 6 for the calculator).

public class TestDOM extends TestCase { public void testStaticStructure() {

Page P # new Page(); Head T # new Head("Test"); Body B # new Body("white"); P.addChild(T);

P.addChild(B);

Heading H # new Heading();

Text T1 # new Text("This is a test"); H.addChild(T1);

B.addChild(H);

Text T2 # new Text("This is plain text"); B.addChild(T2);

Link L # new Link("http://www.w3.org"); Text T3 # new Text("Click here..."); L.addChild(T3); B.addChild(L); assertEquals(T,P.firstChild()); assertEquals(B,P.firstChild().nextSibling()); assertEquals(H,B.firstChild()); assertEquals(T1,H.firstChild()); }

public void testSerialize() { Page P # new Page();

Head T # new Head("Test"); Body B # new Body("white"); P.addChild(T);

P.addChild(B);

Heading H # new Heading();

Text T1 # new Text("This is a test"); H.addChild(T1);

B.addChild(H);

Text T2 # new Text("This is plain text"); B.addChild(T2);

Link L # new Link("http://www.w3c.org"); Text T3 # new Text("Click here...");