XML – A Mark-up Language - A Model-Driven Architecture for Enterprise DocumentManagement, Sup

5. Implementation

5.1 XML – A Mark-up Language

The Extensible Mark-up Language (XML) has been the subject of much on-going development, publicity and discussion in recent years. As a mark-up language, it represents a small subset of the Standard Generalised Mark-up Language (SGML) that became an ISO standard in 1986. Whereas SGML tools have been slow to emerge and restrictively expensive over the past 13 years, support for XML has grown rapidly. A number of active development programmes and freely available XML tools have improved its technical and commercial viability. Importantly, XML was released at a time when many organisations were recognising the inability of HTML to provide a solution to structured information distribution over the World Wide Web. Since the inception of the WWW in 1990, HTML has been its enforced document mark-up language. HTML offers a simple presentation-driven notation for marking up document content for presentation on a client browser. The simplicity of HTML has been a significant factor in the growth of the web to this point, but its limitations have hindered the use of the web for larger scale network-based information systems. [Bosak97] explains how HTML is failing in a number of important ways:

• Extensibility HTML does not allow users to specify their own tags or attributes in order to parameterise or otherwise semantically qualify their data.

• Structure HTML does not support the specification of complex data structures needed to represent database schemas or object-oriented hierarchies.

• Validation HTML does not support the kind of language specification that allows consuming applications to check data for structural validity on importation.

A number of current web-based information systems recognise and circumvent the loss of information semantic through the translation between structured information and web-friendly HTML (e.g. MetaCrawler¹⁷ and RoboShopper¹⁸). Solutions lie in a painstaking extraction (’mining’) of relevant information from HTML files.

Additional to the computational load on extracting such information, this process has one obvious drawback, especially within the image-conscious world of web page design. Web site designers and owners change the design of their sites and pages frequently, and HTML structures change accordingly. Unless the ’mining’

services are aware of upcoming changes to particular HTML formats, such methods immediately lose access to relevant information. XML aims to simplify the automation of such document mining processes by retaining the intended data structures for a particular application area. The ideas behind XML are not new, but widespread XML tool support and international interest have raised its profile far beyond that of other industrial-strength mark-up languages. As a language in itself, XML has expressive power but no processing ability. Its supporting standards - the Document Object Model (DOM), the Extensible Stylesheet Language (XSL) and others - facilitate manipulation and presentation of XML instances for an end-use. XML provides the rules by which its applications must adhere – HTML is one such application¹⁹. This thesis describes the way in which XML assists in the creation of a web-focussed MRA implementation, and so necessarily concentrates on the relationship between HTML and other XML applications.

The intention of XML (as it was with SGML) is to maintain a layer of separation between a document's content, structure and its eventual presentation. Through this separation an automated system has access to the underlying structure and content of an XML instance, while a stylesheet, coded in XSL, specifies the way in which such an instance may be displayed to an end-user for human inspection and interaction. A detailed description of the purpose and syntax of XML and XSL lies outside the scope of this thesis²⁰ but an example is given in Appendix B to make it understandable to the rest of this thesis. The example explains how an XML file that retains the semantic of the data it holds can be transformed into an HTML representation for display in a web browser. Figure 5-1 provides a simplified overview of the transformation process from XML to web-readable HTML.

17 http://www.metacrawler.com/ (Last verified 8th March 2000)

18 http://www.roboshopper.com/ (Last verified 8th March 2000)

19 HTML is strictly an SGML application, but it could be easily re-written as an application of XML

20 Many XML and XSL tutorials can be found on-line - http://www.w3c.org (Last verified 8th March 2000) provides a good starting point

Importantly, the transformation from semantically rich XML to presentation-oriented HTML is performed at the very last moment, retaining the semantic until the user wishes to view the information. While the XML representation assists computational extraction and manipulation, a transformation to HTML ensures that the information can be viewed in a browser that does not hold an explicit understanding of how to render XML documents. The current generation of web browsers, however, understand and support the separation of XML and XSL, and are able to present the XML file according to the XSL file, while retaining the XML file as its source. This increases the ability of the browser to support behaviours and functionality that can be embedded within the XSL stylesheets and applied to XML files at the point of presentation. This emerging ability presents the opportunity to serve native XML files to the browser and rely on the browsers to apply the appropr iate stylesheet as required. This has a number of advantages:

• Document mining and manipulation become simpler because the semantic of the document is retained

• Functionality can now be divided between client and server, allowing a much greater degree of client-side manipulation of the data. The XSL syntax permits reordering and alternative views on the source information.

• ’Publishing to the web’ becomes much simpler as the transformation from XML to HTML is automated and performed by the client on demand.

• A number of different presentation styles can be applied to the same XML document, providing dynamic views of the information stored.

<addressbook> <html> Figure 5-1: The XML to HTML Transformation Process

• The transmission size of individual files is much reduced. Instead of passing a bulky HTML file from the server to the client, the server need only pass the XML file.

The power of XML and its related standards is illustrated throughout this chapter - its use is pervasive to the implementation discussed at all levels of the MRA.

5.1.1 Application of XML to the MRA

The applications of XML fall into two broad categories: document mark-up and information interchange. Karl Branting states that effective document reuse requires access to the original intentions underlying the document [Branting96]. The MRA assists this need by associating a document with the classification within which is it perceived to have value, and further providing a document model-based approach to enable manipulation of the documents themselves. As a document mark-up language XML facilitates the retention of more of the author’s original intentions by enabling the mark-up that best captures the authored meaning. A document’s ’social type’

can be coded within the XML document, whereas its HTML equivalent is forced to degrade text descriptions to a single paragraph type. The development of document type definitions (DTDs) allows user communities to develop document styles that have common verifiable structures. It is this definition of styles through DTDs that permits the automation of reuse of document resources whose structure is understood by the system.

All layers of the MRA are implemented as XML files and basic operations over those files. Some represent recognisable document structures for the resources held, whereas others repres ent more informational aspects of the architecture. Further to the use of XML as a document mark-up language that better retains a document’s semantic, XML can also be used as a basic open information modelling language. Each of the entities within the MRA information model (Figure 4-9) is represented as an XML document instance with a well-defined and navigable structure. XML manipulation methods, provided through the Document Object Model, are almost identical for both structured document mark-up and information modelling applications. In its basic form XML is able to model many kind of data structures easily, and the recent Document Content Descriptions (DCD) proposal submitted to W3C by Microsoft and IBM permits more enforceable data types definitions within an XML instance. The expressive power of XML and XML Data (which forms part of the DCD proposal) has proved sufficient for the needs of the demonstrator MRA implementation.

5.1.2 XML versus X500

Virtual Science Park resource rooms are currently implemented using an underlying X500 directory. X500 is a standard for directories which defines an information model, a functional model, a namespace, an authentication framework and a directory access protocol (DAP) with which to provide directory services [Drew97]. X500 was deemed to be suitable for the VSP implementation because of its decentralised model of storage, powerful search facilities, global namespace and structured information model. At first it appeared that the X500 directory structure might be useful for the implementation of the Model-driven Reuse Architecture. After investigation,

however, it was decided that XML and its associated standards could provide a similar list of advantages and provide further support for the implementation requirements. In particular,

• The X500 design emphasises a tree-like structure to the domain of the entities it stores. However, the nature of hypertext and cross-enterprise links forms a focus on the ’interconnectedness of everything’ [Whalley93] which does not naturally map into a tree structure. A network structure better supports such need and provides further support for incomplete and changing models. XML is more naturally able to support such a network structure with high connectivity and low dependency.

• Current work and opinion on structured documents is leading to a blurring of the distincti on between database and document [Ressler95]. A person record, for example, can be considered as either a data record or a structured document. If a part-X500 implementation were undertaken, at some stage of development the designer would need to define which information types were to be regarded as data and which were to be regarded as documents. Person information, for example, may be regarded as data to be navigated and searched, but could also be regarded as a document to be viewed. If the person object were represented as an X500 model it would require explicit transformation on the server for viewing, but if stored as an XML file, that file could be navigated and searched using the Document Object Model and viewed directly in a web browser using an XSL stylesheet.

For these reasons, and to investigate the expressive and functional power of XML and its tools, it was decided to design an entirely XML-driven implementation.

5.1.3 The Document Object Model

The Document Object Model (DOM) is a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents²¹. The DOM provides a standard set of objects for representing XML documents as tree structures, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them. Its goal is to define a programming interface that operates over an XML instance regardless of the low-level parser used. The DOM represents a document as a hierarchy of ’node’ objects, each of which contains properties that identify its type, content and attributes. The hierarchy of nodes can be navigated using straightforward DOM-specified API calls such as parentNode, childNodes and nextSibling. When called, these API functions move a pointer from the current node to one that is related in the specified way. The API also provides a number of methods by which to create new nodes and place them relative to the current node - insertBefore, replaceChild, appendChild, etc.

21 Source: http://www.w3.org/DOM/ (Last verified 8th March 2000)

The DOM upholds the notions of a document object, a single node within that document, lists of nodes and node attributes - each can be accessed through a different API call.

The Document Object Model became a W3C Recommendation in October 1998. Although simple in its aims and recommendations, it was seen as vital to provide a standard to which document parser implementations should adhere. A number of DOM-compliant toolkit implementations are currently available free to developers and users. Microsoft’s latest XML parser incorporates the DOM and was selected as the parser for the MRA implementation.

In document A Model-Driven Architecture for Enterprise Document Management, Supporting Discovery and Reuse (Page 89-94)