Towards semantic interoperability of cultural
information systems — making ontologies work.
Magisterarbeit an der
Philosophischen Fakult¨
at der
Universit¨
at zu K¨
oln
Contents
1 Introduction 2
2 Establishing digital scholarship 5
2.1 Functional requirements of digital scholarship . . . 5
2.2 Implementing digital scholarship . . . 10
2.3 The interoperability challenge . . . 13
3 A web of linked cultural heritage data 17 3.1 Conceptual and technical requirements . . . 17
3.2 Identification and representation of resources . . . 22
3.3 Semantic Web tools . . . 24
4 Standards for semantic interoperability 26 4.1 Managing archaeological objects . . . 27
4.2 Linking to bibliographic information . . . 30
4.3 Linking to other forms of knowledge organisation . . . 34
5 Dealing with heterogeneity 37 5.1 Levels of heterogeneity . . . 37
5.2 Heterogeneity on the schema level . . . 38
5.2.1 Uniform representation of data models . . . 38
5.2.2 Mapping data models . . . 39
5.3 Heterogeneity on the entity level . . . 48
5.3.1 Data extraction and data quality problems . . . 48
5.3.2 Entity Identification and record linkage . . . 50
5.4 Implementing an overall mapping workflow . . . 53
6 Knowledge visualization for the Semantic Web 56 6.1 Paradigms for visualizing linked data . . . 56
6.2 Faceted browsing using Longwell . . . 57
Chapter 1
Introduction
Recently new terms have emerged to describe an IT and social infrastructure that should facilitate seamless digital scholarly work, usually referred to as Cyberinfras-tructure, a term coined by the US National Science Foundation. Many endeavors have been made to approach such Cyberinfrastructure. Most of them have the same objective with only slight variations, because due to fragmentation of ex-isting data sources that are spread all over the world, some scientific questions cannot be solved today. The main objective therefore has to be to identify, de-scribe and implement elements of an infrastructure that enable scholars to better exploit digital resources [28, 45, 59]. This infrastructure will provide unified ac-cess to data sources and offer services that add value to their underlying cultural heritage content.
One step towards an integrated Cyberinfrastructure for cultural heritage is to syntactically bring data objects together and to semantically mediate between different data models. State of the art scientific research suggests to establish metadata harvesting in addition to crafting software agents that are aware of ontologies. Conceptual reference models like the CIDOC CRM help to mediate between different data models and provide a blueprint for building software that “understands” cultural heritage data [10]. But semantic (processing data mean-ingfully) and syntactic integration (bringing data to a common place) are just one step towards seamless interoperability of cultural heritage information systems. New ideas originating from Semantic Web research and well established concepts from the world of (digital) libraries may contribute important ideas for a digital work environment for scientists.
However, today, many different information systems with different methodical approaches can be found in the field of historical cultural research; each one is designed according to a specific scientific question and perspective, using special-ized terminology and a certain national language. This could be seen as a rather productive situation, but the experience of using information systems for historical
cultural research could be greatly enhanced by creating a common platform for information retrieval. In a joint effort, two parties from classics and archaeology intend to formulate a research program for achieving the goals mentioned above. These parties will be The Perseus Project and Arachne [9, 8, 20]. The Perseus Project is a digital library currently hosted at Tufts University. It provides hu-manities resources in digital form with a focus on Classics but also early modern and even more recent material. Arachne is the central database for archeological objects of the German Archaeological Institute (DAI) and the Research Archive for Ancient Sculpture (FA) at the University of Cologne [11, 21]. DAI and FA joined their efforts in developing Arachne as a free tool for archaeological internet research.
The goal of this thesis is to document the course of this project, i.e. the efforts to gain first experience with building a system that syntactically and semantically integrates data in an international and therefore multilingual environment. It also reports about issues that were encountered during the project and reflects on possible ways to resolve them. It turns out that conceptually mapping data models is not the greatest challenge, but extracting data with appropriate quality and identifying multiple digital surrogates that refer to the same entity in a multilingual environment.
The project is designed to contribute to the said efforts to establish a digital infrastructure for scientific research in the cultural heritage area. Thus, first, to incorporate the project in its greater context, by crafting a model of digital scholarship, functional requirements are discussed that will have to be implemented during the process of software development. Second, the peculiarities of sharing data among Perseus and Arachne are introduced, how the collections complement each other and where the mutual benefits are. Third, state of the art concepts and tools are discussed that help with integrating heterogeneous data from multiple sources, most of them originate from current Semantic Web research. Fourth, the reader is given a closer look at standards for digitally representing cultural heritage data, commonly known as (Networked) Knowledge Organisation Systems and Services (NKOS).1 Within the context of this project, the CIDOC CRM was used as a common data model for sharing metadata. The main part will be discussing forms of heterogeneity that were encountered during implementing a mapping workflow. The main issues are explained and possible ways to resolve the problems are suggested. Finally, paradigms for visually presenting integrated data objects to users are explored. Longwell, a Semantic Web browser, was used to index and display the data that was mapped to the CIDOC CRM.
With only one person doing the analysis of both data models, the mapping, and
1NKOS discusses the requirements for enabling knowledge organization systems as network
the implementation of fundamental software tools, the overall software architecture had to remain rather lean. Therefore, higher programming languages were avoided and most of the presented workflow relies on shell scripts, most of them basic tools that come with the UNIX operating system, and style-sheet processing for data extraction and mapping. The mapping results were documented in a simple text file that can be found in the Appendix and that was implemented using regular expressions and XSLT style-sheets. Although the infrastructure discussed is suitable for all areas of scientific research, this thesis focuses on the cultural heritage area.
Chapter 2
Establishing digital scholarship
This section is aimed at sounding the intellectual requirements that facilitate a scholarly workflow as defined above. While still being a very young discipline compared to examining ancient Greek and Roman texts, software development always intended to better facilitate certain tasks that a person needs to perform. Therefore, functional requirements are defined first and software developers then build a specific software tool around these agreed and formulated requirements. This involves a lot of communication between experts and software developers before the first line of code can be written. In the larger context of this project functional requirements can be deducted from traditional scientific workflow, es-pecially within the subjects that deal with culture and history. In the following section, first, a model of digital scholarship is used to help identifying and describ-ing the tasks that could be supported by proper integration of cultural heritage information systems. Second, interoperability on the level of data objects pre-supposes that also on more abstract levels, interoperability of ideas and concepts needs to be established. Therefore, several related ideas are discussed and evalu-ated. Finally, these concepts are applied to the particular project that this thesis is to report on.
2.1
Functional requirements of digital
scholar-ship
Large amounts of cultural heritage information have already been migrated to various digital media during the last years. Additionally, the importance of peer reviewed Open Access material is more and more recognized within the scientific community. Consequently, a lot of work goes into reflecting the architecture that currently facilitates scholarly communication and how it could be transformed as a reaction to new opportunities that arise in a digital environment. One core
argu-ment is that new knowledge which has been discovered with the help of taxpayers’ money should not be given to large publishing houses so that libraries have to pay for it again while prices become prohibitive. Making scientific results available at every subsequent intellectual processing stage is one first important step. Adding digital services to the data that is publicly available in the World Wide Web would add even more value to the underlying content.
But which services do scientists need for research in the cultural heritage area? Figure 2.1 introduces a layered logical model of digital scholarship that transcends the components of the aforementioned Cyberinfrastructure. The uppermost layer suggests the need to distinguish objects of the perceptible world and their digi-tal surrogates in one or more digidigi-tal library collections. These surrogates consist of critical editions of ancient literary texts, archaeological surveys of individual sites or even catalogues of physical artifacts. Scientists create these surrogates as a result of their everyday work, by digitalizing material that has been pub-lished in traditional form or, in the future, by directly publishing in digital form. Beneath the primary sources, digital libraries should also host secondary sources like reference works that capture the results of a longer research process, and also monographs and research papers that exhaust new and original ideas. In a dig-ital library, secondary sources should be linked to primary research material to facilitate advanced services.
The model differentiates between three further layers. While pursuing scientific research, scholars need to refer to texts, parts of texts, archaeological objects, and abstract things. If a digital library provides a stable and unambiguous identifier for each relevant instance, a scientist could use this identifier to refer to the ar-chaeological object, for example. This reference would be more accurate than in traditional scholarly works. In combination with a resolving service, the identi-fier could be used to obtain one or more digital representations of, for example, a passage of an ancient text. A digital representation could be a scanned image or the result of OCR. This is reflected by the third layer of the model. However, in certain cases scholars do not want to refer to an instance, but to a specific entity, lets say to one of the smaller Alexandrias that were built to honor Alexander the Great, not to the one in Egypt. By grouping instances that have been identified as referring to the same entity in the perceptible world, scholars not only can refer to all digital surrogates that have been digitalized so far but also to the one entity that they stand for.
The model therefore emphasizes two layers between surrogates of primary sources within the digital collection and secondary sources. Secondary sources refer both to named entities that are derived from grouping instances and to the instances themselves that represent the object in the “real” world. A long term project objective will be to populate both layers with metadata about instances and entities that conform to the CIDOC CRM and other standards. The third layer represents the world of quotations and the forth layer represents the world of authority documents. This view suggests that annotating objects together with referring to instances and entities are fundamental functions of digital scholar-ship especially in the humanities, since arguments have to be connected to their evidence in primary sources. The latter is reflected by the bottom layer of the model [1].
After defining the functional requirements that could leverage digital scientific research, software components have to be described as parts of a logical architec-ture that is able to provide services to meet the functional requirements defined. Snow et al. have described a layered logical model that assists archaeological re-search consisting of storage management, a web service interface layer and portal software [56]. They also state that in the absence of a new generation of cyber-tools archaeological research will remain impoverished. Archaeology concentrates on exploring the evolution of culture, growth in population and the interaction of cultures. Research in these areas depends on finding meaningful links between different findings. This in turn depends on being able to access distributed data sources hosting heterogeneous data objects.
Because nowadays data mainly is held in separate silos administered by individ-uals, museum and governmental institutions, finding those connections is difficult.
Both, classification and terminology vary and especially GIS databases are com-posed of records that have been accumulated on paper. On addition, there is a voluminous amount of unpublished gray literature with images, maps and pho-tographs embedded. But problems do not only lie in access, data internally is differently represented.
They further state that due to political boundaries, also in the future, archaeol-ogy will remain a mosaic of provincial efforts. This is one of the main motivations to build an integrated framework with customizable access points to methods and data that would help to overcome the current state of fragmentation. Against that background interoperability not only is a technical goal, but also a social project. Sharing design strategies can promote effective cooperation both on the level of human collaboration and electronic interaction. Especially because archaeological research is dealing with cultural heritage data sustainability has to be established. Therefore, all host institutions should remain in control of their data. Digital li-braries and the services offered should be made publicly available to researchers and whole organizations to store their data.
Figure 2.2 introduces a possible logical infrastructure consisting of data providers and service providers that process the data to offer advanced services on the raw data objects. Additionally, authority naming services will contribute information that can be exploited by software components or by the end user. Both Perseus and Arachne would form repositories that expose well curated data objects and ex-haustive metadata to the web community, possibly by using institutional repository software that will be introduced later. The repository software should implement a protocol such as OAI-PMH that is suitable for dissemination of huge amounts of data. IRs often also offer advanced services for scalability and durability of the data objects.
Authority naming services will provide specialized structured information on entities of the Greco-Roman world that cannot be covered by gazetteer services like the Getty Thesaurus of Geographic Names [35]. These services host knowl-edge that has been created by scientists at all times and can be used by them to unambiguously refer to a specific entity. They should be rich in variants and languages to help with information retrieval, entity identification and translation of metadata.
The figure also demonstrates how one service provider (indexing) can become a data provider for a second service provider (search and image browsing). Service providers harvest data from institutional repositories to offer advanced services for that data. These could be either services that process large sets of data objects like statistical analysis and indexing or services that focus on single data objects to de-liver representations like images in multiple formats. The figure shows an indexing service that consults authority naming services to perform entity identification for
Figure 2.2: Overall system architecture.
data merging. A second service obtains processed data from the indexing service to offer searching and browsig facilities.
End users are equipped with a piece of software commonly called agents. The term “agent” refers to a very broad definition of software that performs complex tasks. A software agent could be a web browser or more specialized software that can be influenced directly by a user. Either the tool (by configuration) or the user (for example by typing the address of a web page in the browser address field) has knowledge of the service provider and knows how to connect and use the service. The agent also can run at a remote site controlled by the user with a browser. All the pieces of software offer useful services to the user. That could be compilations of images of data objects, information about unambiguously identified entites or metadata of the data objects themselves.
From a logical perspective it does not matter where the services live and where the data is stored as long as they are scalable, reliable, and accessible. Lately a new buzzword has emerged: distributed or grid computing. Although the term is used for a lot of things, it describes some requirements that are valuable for inter-operability. Commonly the term is used to refer to different forms of distributed computing, a method of digital information processing that uses a logical layer to run different parts of a computer program simultaneously and distributed to gain performance.
However, processing “soft” cultural heritage data will lead to scalability issues. A grid infrastructure could help to exploit resources of many separate comput-ers that are connected by a network. A grid should be able to solve large scale computing problems by virtualizing resources using a logical layer that mediates between resource consumers and resource providers. For example, large numbers of distributed physical hard drives could be logically connected to one large volume to host huge amounts of image data. Additionally, and absolutely transparent for the user, this disk array could be plugged into a preservation system. This system would assure that all data objects are stored redundantly and will be preserved over time.
The infrastructure described would be suitable to build high level services to manage complex workflows without having to accept multiple media discontinuities (in German “Medienbr¨uche”). A new form of work environment could support sci-entists by offering a tool that supports a complex workflow starting from targeted information search to compiling and arranging thoughts and ideas to argumenta-tion chains and online publishing. This agent would be able to use a set of services to support the key workflow steps. The German TextGrid project is one of the larger efforts to achieve this goal, focusing on the field of literary studies [28].
2.2
Implementing digital scholarship
But what is already out there? The following paragraphs deal with how the require-ments introduced in the preceding section should be implemented. To approach this challenge it helps to have a look at paradigms that have emerged lately, especially the notion of Digital Libraries and Institutional Repositories. One fun-damental paradigm to keep in mind is the notion of a process-oriented view of the overall infrastructure. Leveraging the interoperability capabilities from the meta-data to the resource level means supporting scholarly workflows like publication, citation and archiving of resources, not just information retrieval. An effective Cyberinfrastructure will provide functions for discovery, reference, dissemination, aggregation and other forms of reuse and exchange of resources while preserving intellectual property rights.
Today, many Web resources are dynamically created by scripting languages like PHP or servlet technology. These can be considered to be part of what is commonly referred to as the Deep Web. Typically this data is managed in relational databases and compiled in a certain way to provide useful presentations to the human user. From the perspective of the Semantic Web, this approach has the disadvantage that crawling services like Google will not have access to this kind of content, it remains invisible. Only human beings can, by operating a front-end in a certain way, reveal the contents. The Semantic Web approach aims to link resources that
conceptually belong together. But to be able to do this, all resources need to be publicly available to a certain community. The concept of an Institutional Repository is a step in this direction. Arachne currently approaches this problem by creating a sitemap that helps search engine bots to find objects that are buried within the architecture.1
In a digital age, the primary function of institutions hosting cultural heritage material is to publicly offer data and services to their audience. Digital library is an emerging term that describes a set of software that can fulfill this task. The term digital library has been used in many different ways in the past. Digital libraries hold collections of digital objects and provide means to rapidly access material in digital form. Additionally, the digital form facilitates new services on that data. While traditional libraries focused on the document as the most granular item needing to be accessed, digital libraries can also focus on the content itself. The content either is digitally created or digitized by for example scanning and applying OCR software.2
A digital library has at its core some sort of institutional repository software like Fedora or DSpace [12, 3, 58]. Institutional repository software provides methods for collecting, preserving and disseminating the intellectual output of an institution, particularly research institutions. Institutional repositories also help to achieve interoperability of resources from institutions by providing programming interfaces that help with disseminating and federating items of the collections. They can also be used for implementing common services associated with digital libraries.
Since 2006 the Mellon Foundation has been funding an initiative that will develop specifications allowing distributed repositories to share digital objects [33]. In this context digital objects are considered as units of scholarly communication as opposed to the traditional definition. Traditionally, a scientific publication in printed form is one unit of scholarly discourse.
Fedora is an institutional repository that aims at building the foundation for digital libraries [12, 40, 3, 14]. Although the models that are developed by this initiative seem to be very ambitious, they point in a direction that is produc-tive for the further development of digital scholarship in classics and archaeology. From this point of view, in archaeology, a set of metadata and images about an archaeological find can be considered as a unit of scholarly communication. This bundle could be aligned with scientific annotations, leading away from traditional scholarly publication in this domain.
1Google offers a set of tools for webmasters that facilitate indexing of contents that are
dynamically created athttps:// www.google.com/ webmasters/ tools/ docs/ de/ about.html.
2One remarkable project for the cultural heritage area is the OCRopus project that aims at
covering pluggable layout and character recognition as well as statistical language modeling and multilingualism. In a later project phase, OCRopus wants to be able to recognize handwritten documents. More information can be found athttp:// code.google.com/ p/ ocropus/.
Fedora stands for Flexible Extensible Digital Object Repository Architecture. Modern digital libraries are supposed to host a large variety of heterogeneous dig-ital objects. During the life-cycle of a digdig-ital object a number of management tasks like data creation, organization and dissemination have to be carried out. Fedora tries to reduce costs by providing a set of features that standardize these management tasks. According to the Fedora digital object model a unit of infor-mation consists of one or more data streams. Each data stream could be another representation of a text or an image in different resolutions. Metadata that is associated with a digital object is stored as a separate data stream, multiple meta-data formats, images and other meta-data can be associated with one object using this mechanism. Fine-grained access-control policies to the management and access in-terfaces provide a security architecture. Internally, all data objects together with their data streams are serialized as XML files on a hard disk. This better supports complex tasks associated, for example, with digital preservation. Fedora therefore is one approach to provide a technical foundation for digital library software.
Interestingly Fedora also implements a couple of features that are interesting for providing Semantic Web services. Any type of relation that is expressed within the metadata of an object is indexed and can be queried using Semantic Web query languages like SPARQL.3All data streams of digital objects can be associated with behavior for dynamic content delivery (for example image manipulation services or metadata crosswalks). Additionally the management and access API’s (REST and SOAP) facilitate integration into different application environments. Furthermore, each digital object is associated with a unique URI during the ingesting process and a history of all modifications is stored together with the digital object. This enables references to a specific version of a digital object. Fedora supports dissem-ination of all data streams, including metadata that is associated with any digital objects of the managed collection by implementing the OAI Protocol for Metadata Harvesting. This is a protocol developed by the Open Archives Initiative and used to collect metadata descriptions of resources for (indexing) services that need to use metadata from many sources [39, 13].
Rooted in the e-print community and well known in the context of Open Ac-cess,4 the OAI Protocol for Metadata Harvesting is based on a client-server archi-tecture. Harvesting clients request data from repositories called “Data Providers”. “Service Providers” can then use this data to offer advanced services like index-ing or other forms of advanced organization on that data. The metadata to be transported over a network can be in any format that can be serialized as XML and on which a certain community has agreed upon. Unqualified Dublin Core
3SPARQL is a W3C recommendation published at http:// www.w3.org/ TR/
rdf-sparql-query/.
4The Budapest Open Access Initiative (http:// www.soros.org/ openaccess/) is recognized
always has to be attached in order to facilitate a basic layer of interoperability. OAI-PMH claims to be one enabling infrastructure element for supporting new forms of scholarly communication.
Digital libraries should provide multiple access methods to their collections as well as advanced services on the hosted content. For federated digital libraries, two fundamental paradigms for searching do exist, distributed searching and searching an index of previously harvested metadata. Both ways of dealing with federated information systems face fundamental problems both on the server and on the client side. However, a harvesting approach is more appropriate for the project purposes. To exploit the full power of the CIDOC CRM, resources from different places extensively have to be linked. The processing steps that are required for this task will be performed much better if data objects are accumulated in one place.
Distributed searching involves a software component that is aware of a set of associated databases. Search criteria are encoded using a standardized client server query protocol such as Z39.50.5 These information systems translate the query to an internal format and modify the results so that they conform with the standard. Then they are sent back to the querying component that merges the results. This approach delegates the indexing work to each connected database. Thus, computing efforts for index generation and searching are distributed. Since the search results have to be transferred back to the issuing query service, the network traffic during searching is higher. Control over how the index is created and search results are weighted continue to be up to each federated database system.
Searching of metadata that was digitally harvested is basically implemented by the OAI Protocol for Metadata Harvesting. In this scenario, a service provider harvests data from multiple associated data providers in advance and then builds a local index. This approach bears the disadvantage that all indexing work has to be done locally. But since the harvesting is done in advance there is less network traffic involved while the actual query is performed. In fact, in this case indexing could be an ongoing process. This approach does not delegate indexing work to federated library institutions and full control over how the index is technically created remains at the querying software system.
2.3
The interoperability challenge
Cultural heritage databases use specialized terminology of their respective domain of research in a certain national language. Moreover, terminology and standards
5Meanwhile, the Z39.50 standard is 20 years old and currently maintained by the Library of
used may vary within only one domain. Given the fact that the information in each of the databases is of interest for a large community of people, efforts have to be made to overcome current problems with data integration that are caused by the described heterogeneity. Against this background it seems reasonable that the CIDOC CRM, delivering a set of standardized terms and properties, could serve as a basis for heterogeneity.
Many projects started experimenting using the CIDOC CRM for describing cul-tural heritage data in general and archaeological data in particular [18]. Integration of different cultural heritage vocabularies and descriptive systems is an ongoing research challenge in the course of projects like BRICKS, EPOCH, SCULPTEUR, and IUGO.6 But currently only a few implementations exist that try to bridge the gap between more than one language and several data models at the same time. To overcome the lack of experience with implementing the CIDOC CRM as an intellectual concept and as a software system, Perseus and Arachne want to establish a robust implementation of a mapping workflow in the long term. This thesis reports about launching this collaboration by creating a prototype to sound the mapping of both databases to a shared metadata format.
Together, Perseus and Arachne are hosting hundreds of texts, thousands of art objects, bibliographic records and large lists of named entities, especially about places and people [8, 9, 20]. Both project partners expect many benefits from integrating their collections by using open standards. Arachne hosts data about approximately 100,000 objects of antiquity and in addition over 200,000 images of these objects in a connected image repository.7 The Perseus Project comprises of 6,000 well-described art and archaeology objects and additionally 36,000 im-ages, but also approximately eight million words of Greek and Latin text as TEI code [34]. First, the integration of records on art and archeology would provide a larger source of information to users, accessible through a common and multi-lingual interface. Second, to facilitate serious digital scholarly research, advanced services regarding those collections should be provided. Users may be interested in browsing passages of Pausanias’ History of Greece (a text that is part of the Perseus digital collection) that are referring to objects in Arachne. Or they may want to consult, for example, the Smiths’ Dictionary of Greek and Roman Antiq-uity that is accessible online at Perseus to rapidly acquire more information about
6The BRICKS project (http:// www.brickscommunity.org/) uses the CIDOC CRM for
a software component that manages archaeological finds. EPOCH (http:// www.epoch-net. org/) wants to develop a tool that maps from other metadata standards to the CIDOC CRM. The already completed SCULPTEUR project (http:// sculpteur.it-innovation.soton.ac.uk/ auth/ login.jsp) used the CIDOC CRM as internal data model for data integration among several European institutions. IUGO (http:// iugo.ilrt.bris.ac.uk/) exploits Semantic Web tools to help locating informal related content of conferences.
a specific record in Arachne, with just one or two mouse clicks. In a nutshell, data integration in this context would mean to link the Greek and Latin collections in Perseus to the Greco-Roman material in Arachne.
Galuzzi points out that traditionally museums present art objects only with few context and according to specific curatorial decisions [25]. In the course of changing from analogue to digital media formats, he sees a chance to break with the traditional ways of documentation and information. One challenge of introduc-ing reference models and ontologies like CIDOC CRM is the re-contextualization of those objects by connecting them to other art objects of the same or different kind, such as ancient texts. This approach permits to lay emphasis on “concep-tual similarities” among objects of classics and archaeology, and it does not only allows the user to find conceptually related objects, but also to navigate from one object to another by means of qualified links. The aim, therefore, must not be to imitate traditional forms of documentation in digital form, but finding new paradigms of data processing and presentation. Arachne and Perseus host unique but conceptually related data objects that could be linked meaningfully.
However, currently data is technically processed in completely different ways within each database; each institution has designed its own software that can deal with the respective specialized data model. Both databases process data of a certain heterogeneity — sculptures, vases and entire buildings with their hierarchical arrangement, and of course large amounts of textual data. It does not seem to be reasonable or feasible to change the internal data models of all participating database systems. Therefore, an abstract mapping agent that can be configured to match each internal data model would certainly be a more rational approach. This mapping agent had to be aware of both database schemas to be able to translate data to a shared vocabulary of terms with a certain structure. It has been argued that the belief in easily building such a mapping agent is na¨ıve [57]. Therefore, one goal of the project was to estimate the feasibility of an abstract but adaptable mapping component.
But how should this mapping-agent be designed? In software technology, flexi-bility often is described with regard to modularity, adaptaflexi-bility and maintainabil-ity. It is interesting that all three claims deal with the reduction of complexmaintainabil-ity. All become especially problematic when dealing with information systems hosting and processing cultural heritage data. In this context, information systems have to cope with rather complex and non-uniform, sometimes incomplete, sets of data. In addition, in cultural heritage research, functional requirements have a tendency to evolve rapidly while information systems are used by historians. As the under-standing of the subject increases, new questions and requirements arise. A flexible information system must therefore be able to advance at the same pace as scientific methodology develops. This should be considered in the design phase already, and
Chapter 3
A web of linked cultural heritage
data
The issues described so far are saturated with concepts and ideas that are currently discussed under the notion of “Semantic Web.” Having discussed the intellectual requirements of digital scholarship, presented means for implementation, this sec-tion deals with identifying and describing state of the art developments relating to the current World Wide Web. These are to be facilitating new and better ways of scholarly communication. Although often criticized, the current Semantic Web research efforts articulate new and interesting ideas on how to deal with data that is to be published on the known internet. Also, a fruitful discussion is emerging on how to identify, describe, and retrieve Web resources in future interoperability environments. These means of identifying and retrieving resources could be the glue that ties distributed data together. However, the Semantic Web lacks bigger integrated software solutions and is mostly tool-based, today. Some of these tools greatly helped with sounding the usability of the Semantic Web for the interop-erability prototype that was developed during the course of the project. In this section, first, the foundations of Semantic Web technology are described on the basis of the traditional, and admittedly imprecise, Semantic Web “layer cake.” Then to emphasize the importance of identification and representation, the latest information on this topic are presented and discussed. Finally, those Semantic Web tools are introduced that were used during the project.
3.1
Conceptual and technical requirements
“Sofern der Forscher seinen Einfall kritisch beurteilt, ab¨andert oder ver-wirft, k¨onnte man unsere methodologische Analyse auch als eine ratio-nale Nachkonstruktion der betreffenden denkpsychologischen Vorg¨ange
auffassen [49].”
The World Wide Web Consortium defines the term Semantic Web in a surpris-ingly simple way: “The Semantic Web is a web of data.”1 Even more surprising is that a condition is described thereby that we do not have today in the humanities: A web of data. Why? A great deal of cultural heritage data that is currently accessible on the Internet is controlled by software written for small and special-ized audiences and tailored to a specific purpose. Furthermore, archeological data currently is collected at a low level of granularity as sets of documents. Today’s cultural heritage web therefore can be described as a web of linked documents, not as a web of linked data.
The “Web of Data”, how the “Semantic Web” should consequently be named, describes all activities aimed at overcoming today’s unsatisfactory state. For that purpose formal languages and software components need to be developed that deal with two aspects of data integration. Syntactical data integration physically combines data of different data sources by accumulating data objects at one place, a central database for example. Semantic integration builds upon this foundation by assuring that the data is interpreted and processed in a consistent way, namely interpreted as intended by the originator of the data. By this means data of different sources can be combined and queried better than before.
Scientists that want to solve a scientific problem need a phase of creative think-ing to collect ideas and materials that contribute to resolvthink-ing the research problem. Thus, they need to juggle a lot of information at a time in their minds to exhaus-tively study all aspects of the issue. This is what the Semantic Web technology is designed for — it is supposed to knock down the boundaries between different “silos” of information.2 The Semantic Web thus aims at allowing scientists to con-nect information in a seamless and networked way without the need to translate and transform between multiple media formats. Figure 3.1 shows the components of the Semantic Web.3 This comparison captures the notion that there are several levels, each of them build upon a lower one.
The Unicode standard (ISO/IEC Standard 10646) reserves a distinct number for each letter (more general: character) independent of the platform (operating system), language or program that uses Unicode.4 Major IT companies accepted Unicode and other standards such as XML or JAVA support it. The concept of the
1This quote was taken from the basic introductory material about the Semantic Web that
can be found athttp:// www.w3.org/ 2001/ sw/.
2Tim Berners-Lee expressed this idea in an interview that was published at http:// www.
businessweek.com/ technology/ content/ apr2007/ tc20070409 961951.htm.
3The image was taken fromhttp:// www.w3.org/ 2001/ 09/ 06-ecdl/ swlevels.gif. 4A basic introduction on Unicode can be retrieved athttp:// www.unicode.org/ standard/
Figure 3.1: Semantic Web layer cake.
Semantic Web builds upon Unicode characters for expressing strings. In order to interact with resources on the internet, a Uniform Resource Identifier (URI) was introduced. A URI is a string of Unicode characters that unambiguously names or identifies material or abstract things of the “real” world, provided that there is a digital surrogate available.
URI’s can be divided into two subcategories, Uniform Resource Names (URN) and Uniform Resource Locators (URL). While URLs are URIs that provide some additional information on how the reference to a resource can be resolved to an actual object, URNs only provide a unique name for a resource without information about where an agent can get a representation such as an image. An example for the latter are DOIs (The Digital Object Identifier System).5 For DOIs the DOI website itself provides a resolver that does not directly deliver HTML but redirects to a URL that can be resolved to a HTML page. However, many URNs simply are URIs that have a well known resolver mechanism, the global Domain Name System. This system is so well established that it seems to be totally transparent. Metadata is data about data that is used to facilitate the understanding, use, and management of data. In the context of a digital library a data object could be a digitized text. Metadata for this text would include, for example, information about the author, the publisher, or the number of pages. The Extensible Markup Language (XML) with its hierarchical structure can be used to attach data about data objects to the same. XML defines a basis syntax that can be used to structure
5If a user types in the DOI10.1007/978-0-387-34347-1 6 at http:// www.doi.org/, it will
be resolved to the URLhttp:// www.springerlink.com/ content/ h3800073756x7872/. This in turn delivers a HTML page with more information on the paper and a few more browsing facilities.
documents on the Web.6 But XML does not provide any means to make assertions about the semantics of a document or its parts. XML Schema is a language that constraints the structure of an XML document and augments the XML standard with additional typing facilities.7 It depends on the context if data is considered as self contained or as data about data. One could imagine cases where metadata is the object of research. In this event metadata about metadata would be absolutely valuable.
The lower layers of the model basically deal with questions of syntax while the higher layers are concerned with interpreting the “meaning” of data. The term “semantics” has been used for a lot of things and never has been well defined. Moreover, there is no agreement on how the term semantics refers to the concept of the Semantic Web. As mentioned earlier, the Semantic Web community lately prefers the term “Web of Data” over “Semantic Web.” It can be said that the notion “semantics” itself refers to the meaning that is expressed in some form of representation of information, for example natural or formal language (metadata). Uschold sates that the notion of “real world semantics” as defined by Ouksel and Sheth best captures the role of “semantics” in the orbit of the Semantic Web [60, 47]. According to this definition, objects within a model are mapped onto the perceptible word. Uschold then introduces a semantic continuum. According to his model, information can be encoded on different levels of detail ranging from implicitly, over explicitly informally, explicitly formally for human processing, to explicitly formally for machine processing. Although the far right end of this continuum has not been reached today, there is a lot of value in encoding meaning explicitly and formally for human processing. This helps software developers to write software that is able to process a certain kind of shared data. In the end, the objective will be to build software that dynamically and autonomously resolves the meaning of data objects that are encountered by concept reasoning.
To explicitly make assertions about the semantics of a data object a hierar-chical markup language is insufficient. That is where higher standards like RDF and the notion of “ontologies” comes in. Gruber defines a formal ontology as an artifact of a construction that was designed for a specific purpose and is “evaluated against objective design criteria [29].” The meaning of “ontology” is controver-sially discussed in the artificial intelligence field because at the same time it has a long tradition in philosophical discourse where it alludes to the notion of existence. It has often been confused with epistemology that refers to knowledge and the-ory of cognition. In the context of knowledge sharing and reuse, ontology can be defined as a specification of conceptualization. Thus, an ontology is a description
6XML is a markup standard derived from SGML (ISO 8879). More information about XML
can be found athttp:// www.w3.org/ XML/.
7XML Schema is a WC3 standard and has been published at http:// www.w3.org/ TR/
(like a formal specification of a computer program) of the concepts and relations within a domain that an agent (again a computer program) or a set of agents can evaluate to process data. By restricting the vocabulary to express what is the case in a specific domain, ontologies facilitate interoperability between multiple pieces of software.8
The Resource Description Framework (RDF) mentioned above is another lan-guage that defines a simple data model to describe resources and the relations that can exist between resources. RDF provides trivial semantic concepts like objects and relations and can be expressed in XML but also in other notations like Nota-tion3 [30]. In RDF, information is represented as triples. A triple is an assertion that comprises subject, predicate, and object. RDF Schema builds on top of RDF by providing a vocabulary to group objects to classes and to constrain the relations that may exist between class instances. Thus, RDFS is to RDF what XML Schema is to XML. It augments the semantics of RDF by hierarchical generalization and the definition of properties. It has enough semantic power to describe simple on-tologies [5]. Since the CIDOC CRM version 4.2 has been published as RDFS, both Perseus and Arachne data were exported to RDF and evaluated against the published RDFS document [31].
OWL is a language that reaches beyond the abilities of RDFS for example by defining further language elements to describe relations between classes (“disjunc-tive”), restricting cardinalities (“exactly one”), equality, richer typing of prop-erties, features of properties (“symmetry”), and enumerated classes [42]. The concept of the Semantic Web knows three additional layers that have not been addressed extensively until now, Logic, Proof, and Trust: The three upper layers deal with advanced concepts that are irrelevant for the description of the CIDOC CRM. Therefore, will not be further dealt with in this thesis.
It has been argued that the Semantic Web endeavor is too expensive, that nobody would be willing or even be able to produce enough content to create enough uptake. Shadbolt et al. explain that “uptake is about reaching the point where serendipitous reuse of data, your own and others’, becomes possible [54].” They carry on by saying that, today, most projects lack this viral uptake. In most cases there is no stable URI for objects so that the predicted revolution has not taken place yet. There is a need for small communities that have a pressing need for new technology. Could the cultural heritage sector be such a community?
Viral uptake would create a network effect. In information technology the term “network effect” was coined by Metcalfe, the founder of the ethernet [43].9
8Pidcock tries to clarify the destinction between a vocabulary, a taxonomy, a
the-saurus, an ontology, and a meta-model at http:// www.metamodel.com/ article.php? story= 20030115211223271.
9For more information on applying Matclalfe’s law to the Semantic Web refer to http://
He argued that the costs of network cards is proportional to the number of cards installed, but the value of the network was proportional to the square of the number of users. These can share access to expensive resources like storage. Transfered to the linked data idea, users then could share access to metadata about a uniquely identified resource that already has been annotated by others. A critical mass has to be reached to make the system useful for all users because the value obtained from the infrastructure has to be greater or equal to the price paid for establishing the building blocks of the overall system. A reasonable strategy could be to build a system that delivers value to users even without exploiting network effects. As the number of users increases the system becomes more valuable to everybody. Scalability of these solutions can be almost infinitely enhanced by introducing a peer to peer principle instead of hosting all data as a monolithic block on one server. But it is certain that by sharing unique identifiers everybody can add metadata to a specific entity and share it among the community.
3.2
Identification and representation of resources
Currently every archaeologist can access the Arachne database to conduct research and to choose from a vast amount of information. It is also possible to cite Arachne as a source by mentioning the unique Arachne serial number in connection with providing some information to disambiguate the serial number in Arachne, that distinguishes buildings from topographic entities. This enables the reader of a certain publication to write down the serial number and direct his browser to the Arachne website. After logging in he can use the serial number to access the same information that his predecessor got some time ago. This is one method to reconstruct the methodical approach that was used to compile the results in a publication. This traditional approach has a couple of shortcomings and seems to be complicated and time-consuming.
To be able to talk about a specific subject area that has an internet repre-sentation, each object on the Web should be identified by a stable URI. Then, this URI can be used to reference the entity, lets say for annotation purposes, or to resolve a digital representation of this resource (in Fedora terms this is a data stream). Many webservers also support content negotiation. By exploiting this functionality, a software agent can state its preference regarding the representation of a Web resource. The webserver then can deliver one or more representations in HTML, a machine-readable representation in RDF/XML or a couple of images for the resource.
By using the traditional HTTP URL schema for naming a web resource, most Web-enabled programs will be able to rapidly retrieve a representation of the resource. An archaeological object in Arachne could, for example, be named
by the URL http:// arachne.org/ object/ 30014. By exploiting the mechanism of content negotiation, a software agent could retrieve a RDF/XML representa-tion and discover that there are multiple images connected to this resource. As a second step, the agent retrieves one image by dereferencing the URL http: // arachne.org/ images/ 482199 and indicating that compressed JPEG is the pre-ferred format. The Apache HTTP server [23], for example, would indicate that by including the string Accept: image/jpeg; q=1.0, application/rdf+xml; q=0.5, text/html; q=0.1 in the header of the request.10 By transmitting this string together with the request, the user agent can thus express that with this request he prefers an image over a representation in RDF/XML. The remaining option, if all else fails, is a representation in HTML.
Listing 3.1 demonstrates the process of content negotiation that can direct a client to select the appropriate representation of a specific Web resource. In this particular example, the client tries to retrieve the URLhttp:// dbpedia.org/ resource/ Berlin and indicates that it prefers a HTML page as result. The server responds with a 303 message and provides another URL that most browsers automatically re-retrieve to display according HTML page. This process is transparent to the user.11
Listing 3.1: The client request an HTML represenation.
1 Krabat : ˜ rokummer$ t e l n e t d b p e d i a . o r g 80 2 T r y i n g 1 6 0 . 4 5 . 1 3 7 . 8 5 . . . 3 C o n n e c t e d t o d b p e d i a . o r g . 4 E s c a p e c h a r a c t e r i s ’ ˆ ] ’ . 5 GET / r e s o u r c e / B e r l i n HTTP/ 1 . 1 6 H o s t : d b p e d i a . o r g 7 A c c e p t : t e x t / html 8 9 HTTP/ 1 . 1 3 0 3 S e e O t h e r
10 Date : Tue , 14 Aug 2 0 0 7 1 2 : 0 5 : 1 2 GMT 11 S e r v e r : Apache−C o y o t e / 1 . 1 12 L o c a t i o n : h t t p : / / d b p e d i a . o r g / p a g e / B e r l i n 13 Content−L e n g t h : 0 14 Content−Type : t e x t / p l a i n 15 16 C o n n e c t i o n c l o s e d by f o r e i g n h o s t .
Listing 3.2 shows the client requesting the HTML page that it asked for. After the heading information, the HTML code is attached at line 34.
Listing 3.2: The client retrieves the HTML representation.
17 Krabat : ˜ rokummer$ t e l n e t d b p e d i a . o r g 80 18 T r y i n g 1 6 0 . 4 5 . 1 3 7 . 8 5 . . . 19 C o n n e c t e d t o d b p e d i a . o r g . 20 E s c a p e c h a r a c t e r i s ’ ˆ ] ’ . 21 GET / p a g e / B e r l i n HTTP/ 1 . 1 22 H o s t : d b p e d i a . o r g 23 A c c e p t : t e x t / html 24
10Apache supports content negotiation according to the HTTP/1.1 standard. More
infor-mation on Apache content negotiation can be found at http:// httpd.apache.org/ docs/ 2.3/ content-negotiation.html.
11This example is inspired by the document “How to publish Linked Data on the Web” at
25 HTTP/ 1 . 1 2 0 0 OK
26 Date : Wed , 15 Aug 2 0 0 7 1 3 : 3 6 : 3 6 GMT 27 S e r v e r : Apache−C o y o t e / 1 . 1
28 Cache−C o n t r o l : no−c a c h e 29 Pragma : no−c a c h e
30 Content−Type : t e x t / html ; c h a r s e t=u t f−8 31 T r a n s f e r−E n c o d i n g : chunked
32 33 5 b4
34 <html xmlns=" h t t p : / / w w w . w 3 . o r g / 1 9 9 9 / x h t m l " xml : l a n g=" e n " l a n g=" e n "> 35 <head>
Listing 3.3 shows the client indicating that RDF/XML is the preferred repre-sentation. The sever again responds with a 303 redirect but this time to the URL that points to RDF/XML data.
Listing 3.3: The client requests a RDF representation.
36 Krabat : ˜ rokummer$ t e l n e t d b p e d i a . o r g 80 37 T r y i n g 1 6 0 . 4 5 . 1 3 7 . 8 5 . . . 38 C o n n e c t e d t o d b p e d i a . o r g . 39 E s c a p e c h a r a c t e r i s ’ ˆ ] ’ . 40 GET / r e s o u r c e / B e r l i n HTTP/ 1 . 1 41 H o s t : d b p e d i a . o r g 42 A c c e p t : a p p l i c a t i o n / r d f+xml 43 44 HTTP/ 1 . 1 3 0 3 S e e O t h e r
45 Date : Tue , 14 Aug 2 0 0 7 1 2 : 0 5 : 5 0 GMT 46 S e r v e r : Apache−C o y o t e / 1 . 1
47 L o c a t i o n : h t t p : / / d b p e d i a . o p e n l i n k s w . com : 8 8 9 0 / s p a r q l ? d e f a u l t−graph−u r i=h t t p%3A%2F%2F d b p e d i a . o r g& q u e r y=DESCRIBE+%3Chttp%3A%2F%2F d b p e d i a . o r g%2 F r e s o u r c e %2F B e r l i n%3E
48 Content−L e n g t h : 0 49 Content−Type : t e x t / p l a i n 50
51 C o n n e c t i o n c l o s e d by f o r e i g n h o s t .
An alternative to addressing resources with HTTP URLs is to use a generic URI and to provide a service to resolve this URI to an appropriate representation. Whilst the use of URLs exploits existing technology, using generic URIs entails building resolving services. This is complex and cost intensive but is useful in some domains. A sample URL that resolves an URI is http:// some.resolver.org/ resolve? uri=arachne:objekt:4711&type=application/ rdf+xml. Here the content negotiation part is visibly encoded within the URL. There will be a more in-depth description of this mechanism in section 4.2 on page 32.
3.3
Semantic Web tools
Even if there is enough data represented in a way that can be easily exchanged and shared, there is still the need for software that is able to process the data. These software components are so-called agents. They serve to process Semantic Web data and to provide communication channels to resolve problems collaboratively, one or more agents for each task. Many tools are evolving in the field of the Semantic Web. In fact, the number of tools that are supposed to deal with the technologies described has grown so fast that the W3C could not cope with the upsurge and decided to create a community driven portal to keep track of the
domain.12 Since most of these toolkits deal with and depend on RDF, we decided to choose RDF for implementing the mapping. Unfortunatey most of these tools come with little or no documentation and little experience on how they deal with large amounts of data.
Shopping agents are degenerated examples of Semantic Web applications. On behalf of their users, they fulfill the fundamental task of comparing prices from dis-parate and heterogeneous but semantically related sources. They are degenerated because usually none of the sources has published its vocabulary. Shopping agents therefore usually need to scrape the information from multiple HTML pages. This results in additional work for software developers since they always have to deal with individual data models. There is no format that everybody has agreed on and a lot of semantics have to be hardwired within the agent software. Each time one of the participating vendors changes the appearance of the web page, the agent software needs to be adapted.
Throughout the project, multiple tools served to provide a better understand-ing of Semantic Web concepts and methods. The followunderstand-ing describes the software components used. Prot´eg´e was helpful for approaching modeling techniques of ontologies including the CIDOC CRM. The user gets an impression on how the RDF markup could look like if it was produced by an automated mapping algo-rithm. Strengths and drawbacks of different modeling approaches became visible after manually creating data objects in the CIDOC CRM schema.13 The next tool, called Jena, is a Java framework that supports the development of Semantic Web applications. It provides a programming environment for RDF, RDFS and OWL and embodies a rule based inference engine. Jena is Open Source, a result of development efforts of the HP Labs Semantic Web Programme. There are a couple of frameworks available for Java and other programming languages, but Jena comprising od currently 11 developers and 24,600 downloads, appears to be one of the more active projects within the Open Source community.14 Eyeball is a part of the Jena framework that checks RDF model for common problems and is used within the project to check the CIDOC CRM markup before it is further processed by software components. It checks for unknown predicates and classes, bad namespaces, ill-formed URIs, amongst other things. The Redland RDF Li-braries provide a couple of command line tools that were useful to count triples and to reformat the RDF code. In this particular case, it was used to count the triples that were generated during the mapping efforts.15
12The W3C maintains a wiki-style list of Semantic Web tools athttp:// esw.w3.org/ topic/
SemanticWebTools.
13http:// protege.stanford.edu/. 14http:// jena.sourceforge.net/. 15http:// librdf.org/.
Chapter 4
Standards for semantic
interoperability
Many cultural historians are happy to conduct scientific research without having to think about formalized and shared conceptualizations. Developing formalized ontologies for easier exchange of knowledge involves more time and effort than doing things intuitively. The issue of building awareness of the advantages that ensue from using standards for digital representation of cultural heritage data still needs to be addressed. Formalizing knowledge with standardized systems not only allows it to be transfered and displayed over network connections, but also to enrich it with annotations and behaviors like searching and browsing. However, as Semantic Web concepts are not yet understood and accepted within the cultural heritage area, this currently limits the CRM’s potential.
Common conceptual models like the CIDOC CRM can be used in may ways. Guarino categorizes different uses of ontologies by temporal and structural dimen-sion [52]. Thus, ontologies can be used at development time and at run time. At development time, ontologies can serve as a common language for software developers and domain experts. In this scenario, it would help to model domain concepts as software components. By using standard vocabularies, the software usually achieves a better rate of interoperability. Information systems that are ontology-aware use ontologies at runtime. Some software agents recognize data that they encounter as being encoded according to a certain ontology. From a structural point of view, an ontology can be used at different levels of an ap-plication program or even interfuse the whole information system, the database component, the application component, and the user interface.
Due to the respective focus of each project partner, the project focuses on material objects, ancient Greek and Latin text, and the contexts that these can be linked to. To establish interoperability, multiple standards have to collaborate to cover the needs of a specific domain. While the CIDOC CRM was developed
to represent information about objects, especially those managed by museums, a new version of Functional Requirements for Bibliographic Records (FRBR), FR-BRoo, is being developed as ontology aligned to the CIDOC CRM [17]. As an entity-relationship model, FRBR provides the means to accurately describe bibli-ographic information in a digital world. FRBRoo provides the means to express the IFLA FRBR data model with the same mechanisms and notations provided by the CIDOC CRM. The CIDOC CRM and FRBR harmonization, especially when extended with the Canonical Text Services protocol [50], will allow collections to integrate complex textual materials with extensive metadata about objects. The following section will focus on the introduction of these standards.
Thus, the concept of the CIDOC CRM itself heavily relies on other form of shared infrastructure and standards. Gazetteers, other domain specific naming authorities, and controlled vocabularies provide the means for referencing and describing things and objects that form the context of material and textual objects. These registries still have to be developed and published so that a wide audience will be able to use these vocabularies by referencing to entities and contributing to the content. Furthermore, service registries will hook up all participating data providers and play a major role in data discovery.
4.1
Managing archaeological objects
“We have the vision of a global semantic network model, a fusion of relevant knowledge from all museum sources, abstracted from their con-text of creation and units of documentation under a common concep-tual model. The network should, however, not replace the qualities of good scholarly text. Rather it should maintain links to related primary textual sources to enable their discovery under relevant criteria [15].” Many standards have emerged that facilitate representation of cultural heritage data like the Getty Categories for the Descriptions of Works of Art or the Art Museum Image Consortium that operated until 2005 [2, 7]. Since 2006, the CIDOC Conceptual Reference Model became the official standard ISO 21127:2006. The CIDOC CRM comprises definitions arranged as a structured vocabulary that were developed over a period of ten years by the CIDOC Documentation Standards Group. This group falls within the International Committee for Documentation (ICOM-CIDOC) of the International Council of Museums (ICOM). The CIDOC CRM provides a blueprint to describe cultural heritage and museum information. Therefore the CIDOC CRM will have a major role within the integration efforts of this project. It can help to analyze the data structures of the participating information systems, to identify common information contents.
Technically speaking, the CIDOC CRM is a hierarchy of 84 classes defining concepts that are commonly referred to in museum documentation practice. Each class describes a set of objects that share common features. 141 so called prop-erties define semantic relations between these conceptual classes. Thus, the CRM builds a foundation for semantic interoperability in the cultural heritage area [10]. Figure 4.1 shows a schematic overview of the most important concepts and rela-tions that can exist between them, according to the model.1 By adopting these concepts of formal semantics, the CIDOC CRM is well prepared play a role in the development of the Semantic Web.
Figure 4.1: Conceptual overview of the CIDOC CRM.
The CIDOC CRM does not inted to prescribe how a certain community should document objects, even though it could serve as a guideline for good documen-tation practice. The goal is to facilitate a read-only data integration of data materially or virtually. While creating the CIDOC CRM, two design choices have been made, the CIDOC CRM to further enhance and facilitate data integration and to keep the whole vocabulary to a manageable size. First, as the result of a pragmatic approach to ontology design, the CIDOC CRM is property centric. By providing a large set of properties, richer semantics can be expressed than by using fine grained hierarchies of classes like thesauri would do. Classes were thus only introduced to form the domains and ranges for properties. While an attribute
is applicable to only one class instance, a relation always concerns two instances. Thus, the CIDOC CRM helps with modeling objects within their context instead of attaching isolated attributes. Second, it has been argued that explicitly in-cluding events in ontologies results in models that facilitate better integration of cultural contents [15]. Thus, the CIDOC CRM proposes events that tie objects and their contexts together. Figure 4.1 demonstrates how events link physical things, conceptual objects, places, timeframes, and actors. It goes without saying that a data structure that conforms to this paradigm is more difficult to create than flat attachment of values to a data object.
An ancient sculpture for example would be modeled as an instance of the class
E24 Physical Man-Made Thing, a class that “comprises all persistent physical items that are purposely created by human activity [10].” It came into existence by an activity that in turn is an instance of the classE12 Production, “this class comprises activities that are designed to, and succeed in, creating one or more new items.” Both instances are connected by the property P108B was produced by, a property that “identifies the Physical Man-Made Thing that came into existence as a result of a Production Event.”
Data from different sources which follow this scheme can be processed more consistently, even if different sources deliver contradictory information. Unlike Dublin Core, the CIDOC CRM focuses on the cultural heritage domain and adds a class and property hierarchy to its vocabulary defintions. Additionally, attribute assignments can be linked to events so that the same attribute can be assigned twice with different values as a result of different measurement events. A situation that is common when dealing with soft historical data. Arranging database objects to well-defined classes also facilitates searching for common objects that originate from different data sources.
If certain communities figure that class concepts like E24 are too broad in scope, more detailed classes can be agreed on, for example, in order to distinguish vases and buildings that both fall within the category E24. This is usually done be exploiting the extension mechanism of the CIDOC CRM. Certainly, a simple mapping of these concepts to E24 Physical Man-Made Thing would be dissat-isfactory because information would get lost. Therefore, the CRM offers means to refine its high level concepts by using the class E55 Type. This class can be used to attach a thesaurus-like hierarchy of terms to the standard data model. Be-cause the extensions throughE55 Typeare community specific and not covered by standard CIDOC CRM, they have to be documented and published as authority documents. Not until this has been done, seamless and automatic processing of the data is assured.
The CRM offers two mechanisms to create more granularity for describing museum objects. One approach would be to define subclasses of the built in
CIDOC CRM classes like, for example, A1 Sculpture→ is a → E24 Physical Man-Made Thing or A2 Building → is a → E24 Physical Man-Made Thing. Interestingly the same can be done with properties. The other mechanism is to use a Type hierarchy that can be constructed by using the class E55 Type. The class E55 Type is treated as universal and specific at the same time. This bears the advantage that a type can be discussed as an element of scholarly discourse (E83 Type Creation→ P135 created type → E55 Type).
But in some situations this approach seems to be too complicated and the cre-ation of subclasses or usage of publicly available and more specialized ontologies seems to be more feasible. This approach has the advantage that those ontolo-gies are already published and often well documented. Defining subclasses and exploiting the type hierarchy has the disadvantage that both extension mecha-nisms are not covered by the standard so that other information systems cannot exploit them “out of the box.” Hand-crafted extensions have to be documented and published so that others can easily retrieve the information and build their software accordingly. Anyway, the CIDOC CRM becomes more powerful if it is used in connection with other ontologies like SKOS for attaching thesauri that will be looked at in more detail below.
All properties of the CRM have definite domains and ranges that belong to the vocabulary itself. The CRM offers classes for describing people, places and bibliographic entities. This seems as if the ontology claimed the authority not only for describing museum objects but also for covering most of their contexts. However, it does not seem to be useful to treat the CRM as an all-in-one device suitable for each and every purpose. Additionally, the CRM is an upper level ontology and therefore cannot and does not intend to cover the pecularities of each cultural heritage domain. Although it does provide an extension mechanism, variations and specialization have to be documented and published (preferably in a formal language). For each object, a specific unambiguous URI needs to be assigned. This information does not include hints on how that URI could be resolved into a human or machine readable representation. For example, a URI of an image does not include the information how to decode and display it. One could argue, that this does not belong to the scope of the CRM and needs to be addressed on other layers like content negotiation as described above.
4.2
Linking to bibliographic information
The CIDOC CRM mainly concentrates on describing material cultural heritage and museum objects. But the value of this information source can be increased by linking the material objects to other sources of information like gazetteers or bigger bibliographic databases. Information about archaeological objects in Arachne, for