Semantic Documents Modeling
3.1 Semantic Documents Design Principles
So far, the term semantic document has been referring mainly to documents annotated by concepts from ontologies [40]. The process of annotating documents by ontologi-cal concepts is also known as a semantic document annotation. Ontologiontologi-cal concepts that annotate a document, together with the underlying ontologies that define them, intend to enable intelligent software agents to search documents in a more meaningful way, comparing to the traditional content-based search, and to discover desired infor-mation/knowledge. Over the past few years, a considerable number of ontology-based semantic annotation approaches has been developed[44, 75, 110, 15, 9]. While differ-ing in many aspects, they all attempt to enhance documents by adddiffer-ing an additional, semantic layer containing conceptualized semantic descriptions (i.e., ontological con-cepts) that refer to actual documents.
In my vision, the term semantic document does not denote semantically annotated electronic documents, but rather a new document form in which the semantic layer is integrated inside the document representation. In other words, the semantic layer is not considered as an additional layer in a document representation, but as integral part of it. In the rest of the section I describe a set of design principles, which I identified as fundamental for semantic documents.
37 3.1 Semantic Documents Design Principles
Document data granularity: Semantic documents are composite information re-sources, composed of a number of smaller resources called document units. Each doc-ument unit is characterized by the binary data and the information provided by that data. There are two types of document units: atomic document units and composite document units. Each atomic document unit has one content stream, connecting the binary data (e.g., text, image, audio and video) to the document unit. Depending on the concrete implementation of a semantic document, the binary data of document units can be kept into the actual representation of the document or stored into an external bi-nary data repository. Composite document units aggregate a number of atomic or other composite units and add navigation to them.
Document data identity: Semantic documents are uniquely identified by globally unique resource identifiers (URIs). Moreover, each document unit within a semantic document, wheither atomic or composite, is identified by a globally unique URI.
Document data annotation: Semantic document annotations are entities which are identified by the annotation identifier, the annotation type and the annotation body.
The annotation body is determined by the annotation type and can hold a data-value or reference to another entity. The annotation types, which I am focused on in this the-sis, are: standardized metadata, semantic (ontological) annotations, and social-context annotations. All of these annotation types will be considered in several sections of the thesis (Section 3.2.2, Section 4.2.4, Section 5.1.2). It is important to point out that a potential, new annotation type can be introduced by anyone providing the name of the annotation type and specifying the annotation body. The semantic document annota-tions can be added to all levels of a document granularity: a whole document, composite document units and atomic document units (Section 3.2.1).
Document data linking:There are two main types of links which can be established between semantic document units: structural links and semantic links. Structural links are used to build a logical structure of a semantic document. Semantic links are used to explicitly represent the semantic relations between document units. Both, structural and semantic links can be established between document units which belong to the same or different semantic documents. However, only those document units connected by struc-tural links are considered to belong to the same document. Actually, it means that the document units that are connected by structural links to document units of many dif-ferent documents, belong to each of those documents. Accordingly, document units can be easily added or removed from a semantic document, simply by linking or unlinking them from the document. Document units can have an arbitrary number of semantic links, connecting them to other document units. In principle, if two document units are annotated by the same semantic annotations (ontological concepts) then the two units share some semantics and there can be identified one or more semantic relations between them. The semantic links can be then established between these document units. Finally, semantic document units can be linked with any other uniquely identified data (resources) on the Semantic Web. Considering that semantic document units are
encapsulated fractions of document data, this linking fulfills one of the main principles of the Semantic Web, that is, the transition from linking documents to linking data[12].
Document data universality: Semantic documents are not owned by a single ap-plication. They are universal and platform/tool independent. They are transferable across platforms and are easy to port and integrate to existing applications on any plat-form. Appropriate transformation functions which map conventional document formats to and from the semantic documents can be provided. In that case, semantic documents can also serve as an intermediate step in the document transformation from one con-ventional, application-specific document format to another. In this way the number of necessary transformations for N target applications (document formats) will be reduced from N2-N to 2N . Where N2-N is a number of all one-to-one transformations.
Document data accessibility:Semantic documents are completely open and query-able. Humans and software agents can search semantic documents, by searching doc-ument units based on their binary content or conceptualized semantics. In search re-sults, the retrieved document units are represented by their URIs. Applications, such as semantic document browsers use obtained document unit URIs to access the MP repre-sentations of the document units and then render the HR reprerepre-sentations that can be perceived by a user. Moreover, semantic document browsers provide an exploratory in-terface through which the user can interact with the retrieved document units, and can traverse semantic documents by navigating along the semantic links between document units.
Document data traceability: Over time semantic documents change and evolve through a number of versions. In order to verify the evolution path of document units, each document unit has a version identifier in addition to its URI. Moreover, semantic documents provide mechanisms and structures for capturing and formal representation of changes that can be made to document units. Similar to the representation of the se-mantic document annotations, the document unit changes are represented as uniquely identified entities which are linked into the MP representation of the document units, and are characterized by a set of specific properties that hold information about the changes. By using such formalized change representations, previous versions of doc-ument units can be rebuilt and re-deployed. This is especially important in case of document revisions when previous versions do not exist any more.
In accordance with the above listed design principles, I give the following definition of semantic documents:
åDefinition of Semantic Documents
Semantic documents are composite information resources composed of uniquely identified, semantically annotated, and semantically interlinked document data/information units of different granularity. Each semantic document is characterized by unique permanent machine-processable (MP) representation and a number of temporal human-readable (HR) representations rendered from the MP representation.
39 3.2 Semantic Document Model
Two categories of potential users, that is, humans and machines (i.e., software agents), determine two possible forms of the semantic document representation. The HR representation uses conventional content types such as text, images, audios and videos to represent information stored in a semantic document. The MP representation uses conceptualized semantics (i.e., ontological concepts) and semantic links between document units to represent document information in a conceptualized, machine pro-cessable form.