4.3 Layered OCMS Framework
4.3.2 Content Layer
The content layer contains content documents which are the subjects of semantic annotation [Kiryakov et al., 2004]. We define content as any digital information which is in a textual format that contains structured or semi-structured documents, web pages, executable files, software help files, etc. An OCMS essentially deals with content in the form of books, web pages, blogs, news papers, software products, documentations, help file reports, publica- tions, etc. [Gruhn et al., 1995] [Abgaz et al., 2010]. The content layer provides the follow- ing services.
Storage of documents. The content layer facilitates the storage of content. The storage
can be file-base or database storage. The content layer stores content as files in folders or tables in databases and are accessible from the web. In any of the two cases the content layer provides a permanent storage of the content.
Retrieval of the documents. Another service provided by the content layer is the re-
service provided by the content layer is crufcial in accessing a specific document.
Unique identifiers for content. All documents and parts of documents should be iden-
tified uniquely. The unique identifiers serve as content identifiers to link the content with the ontology. Documents in the file-based storage are identified using the path and the file names. However, in databases they are identified using the database name, the table name and the primary key [Elmasri & Navathe, 2010]. Documents that are stored on the web can be accessed using the URI of the web combined with the file names. However, the detailed implementation is the decision of the content manager.
Content in OCMS can be categorized as structured content, semi-structured content or unstructured content.
Structured content. Structured content is content which is well defined with respect
to some data-centric structure. Data-centric structure defines the content or fragments of the content as data elements with a schema describing the elements. The content gets the structure by explicitly tagging parts of the content with the schema. A widely used for- mat is XML. XML is supplemented by DTDs and XML schema (Section 2.2) to provide further semantics about the data. In a structured document, it is possible to locate and retrieve a specific part of a document using the data elements. The other widely used for- mat for structured content is databases. Relational databases store the content in the form of tables which are organized into columns and rows. The rows represent individual in- stances and the columns represent the attributes of the instances. The content or part of the content in databases is accessible using queries which extract specific rows and columns [Elmasri & Navathe, 2010].
In structured documents, there is interdependence between parts of the content docu- ments using tags/attributes that allow composition of new content from the available snip- pets. Since the fragments of content are highly structured and identifiable using content identifiers, it is possible to combine different content fragments into one and present that as a new content fragment. Such relation between different content can be identified using structures such as DocBook. DocBook5 defines the logical structure of a document in the
5
form of XML, HTML, XHTML, etc., (Section2.2).
Semi-structured content. Semi-structured content is content which is organized using
a document-centric structure. A document-centric structure gives structure to the whole document or part of the document focusing on its presentation. In such documents, it is possible to access information based on the available structure, but it needs additional effort to locate and retrieve specific data elements. Content in an HTML file can be considered as semi-structured content which incorporates tags that give some structure to the presentation of the content.
Unstructured content. Unstructured content refers to content which does not have any
structure defined for identifying components of the content. Unstructured content holds a series of texts where there is no associated structural information that gives the content a structure.
Our OCMS layer allows any kind of content to be annotated including unstructured content. However, for the purpose of this research, we focus only on content which is either structured or semi-structured.
4.3.2.1 Changes in the Content
Content in OCMS evolves continuously and frequently [Uren et al., 2006] [Adler et al., 2008] [Krotzsch et al., 2011]. The evolution may cause a change in the semantics or in the struc- ture of the content. Changes that affect the structure also cause a change in the semantics. In a dynamic content management system new documents are produced, existing ones are modified, edited or deleted frequently to provide up-to-date information. The content layer allows changes ranging from removal of the whole document to modification of a single el- ement in the document. The changes of the content in the content layer need to be available to the other layers to ensure the consistency of the OCMS system [Javed et al., 2010].
We focus on structured and semi-structured content in the content layer. This is to avoid complications related to accessing and processing changes in unstructured documents. Structured and semi-structured content further allow us to easily identify evolving elements of the content and create a unique reference which can be used for later processing. Thus,
in this research, we primarily focus on XML and HTML content documents. The content documents have associated URIs. In case of XML documents, the different sections are identified by combining the URI with the element ID. Whereas, in HTML and XHTML files, we identify specific parts of the content with an offset showing the beginning and the end of the section relative to the document [Maynard, 2008].