Implementing an overall mapping workflow - Towards semantic interoperability of cultural inform

After introducing the tesserae that form the mapping process, this section con- centrates on how they are tied together. Figure 5.6 points out how an overall mapping workflow should be implemented to contribute to a system that estab- lishes interoperability. Although the model tries to divide different steps of the mapping process, some steps cannot be clearly separated and need to interact. Entity linking for example does need indexing.

Figure 5.6: The overall interoperability workflow.

First, both data models had to be represented in a uniform way for further processing. Perseus’ data has been exported to XML by using the Collection services. Since the Collection service could currently not handle the amount of data that is generated during the export for all of Arachne’s data, the MySQL Query Browser was used to export Arachne’s object, literature, and images tables. This export created more than 80,000 files, one for each data object, that had to be distributed as a hierarchic directory structure. Thus, there are definitely scalability issues in the mapping phase already. The export step has resulted in an one-to-one XML representation of the data models.

The next step aimed at cleaning the resulting data-set. For building the mapping prototype, the Unix sed command has been used, containing various regular expressions for extracting bibliographic entities from different fields. Some XML code for bibliographic entries was not valid and had to be dropped until a tool will be on hand that uses heuristics to fix the broken XML markup. Since Arachne maintains its own bibliographic database, extraction of bibliographic entities was easier on this side. The end result is an intermediate data model that can be better processed. In future versions, it would be good to experiment with profes- sional data cleaning tools to extract more data from fields with informal internal structure. This step again resulted in an XML representation, but this time as an intermediate data model.

According to the mapping documentation, an XSLT style-sheet has been crafted that implements the mapping rules described. By processing each XML file with this style-sheet the intermediate data model then has been mapped to RDF/XML conforming to the CIDOC CRM. Additionally, the “Eyeball” tool described ear- lier was used to validate the resulting RDF code against the published CRM RDF definition file. This mapping step also involved assigning unique identifiers to each material or conceptual object that was created during the mapping process, in accordance with RFC 3986.4

Having mapped each database record to a single RDF/XML files, the data has been prepared for merging. At this part of the process, all data objects have been cleaned for proper record linkage. The current implementation relies on a simple mechanism to copy the resulting files to a common directory. Thereafter the RDF information has been merged by ingesting the file into the Longwell browser. Longwell ingests all RDF files and uses the “Lucene” search engine to connect objects that bear the same identifiers.5 This mechanism is useful because it accumulates everything that has been said about a specific entity, even if the information is distributed among different physical files. Currently, this is the only form of record linkage that has been achieved, the prototype does not do multilingual entity identification. However, the infrastructure that would make this step feasible is still missing.

Longwell was also used to visually present the results of the mapping process for debugging reasons. In this case, both indexing and presentation were achieved by ingesting the data into the Longwell browser software. The next section gives a more in-depth introduction on how Longwell has been configured to display cultural heritage data objects.

The mapping workflow presented was chosen to gain experience with the appli-

4_{The full-text of the Request for Comments can be found at}_{http:// tools.ietf.org/ html/}

rfc3986.

cation of Semantic Web concepts to cultural heritage data and to explore the issues that are connected with it. The overall mapping process definitely needs more automation by implementing means to publish and harvest, index and present the data. Once this automation has been established further steps need to be introduced. These include multilingual record linkage and the interaction with authority-naming services for better linking data objects and accumulating multilingual metadata. This, in turn, would better facilitate services like cross language information retrieval in very specialized domains like classics and archaeology. On the conceptual level, the mapping should be enhanced iteratively by including more database fields and extracting more information.

Chapter 6 Knowledge visualization for the

Semantic Web

The visualization of Semantic Web data poses an interesting challenge to software developers. Data structures of almost unlimited complexity need to be presented to users that usually are not aware of the underlying concepts of information representation. The CIDOC CRM, for example, promotes the consideration of events during modeling cultural heritage data. It has been argued that this approach facilitates better data integration, therefore, it is necessary. Although this method of describing data may be useful and logical, users probably will not immediately agree to the necessity. This assumption is backed by the observation of current documentation practice. Here, events obviously are not needed and therefore not explicitly documented. This section deals with exploring means to process and visualize data that resulted from prior information integration. First, a survey of paradigms for visualizing Semantic Web data is undertaken. Then, the Longwell browser that was used to index and display the RDF/XML is introduced. Longwell has also been useful for exploring scalability issues with Semantic Web data.

6.1 Paradigms for visualizing linked data

When it comes to presenting data to the user, lets say a scientist who is into cultural heritage research, a fundamental conflict has to be solved. Maeda states that “simplicity is about subtracting the obvious, and adding the meaningful [41]”. But RDF facilitates the formulation of amazingly complex data models where huge amounts of interlinked data objects can reside. But how do we extract what is meaningful and useful for the end user? Visualization in information technology always went after explicitly pointing to coherence that, without applying smart algorithms, remained implicit.

Geroimenko et al. pioneered the area of visualization of Semantic Web data. They propose the extensive use of SVG and X3D to implement different visualization paradigms.1 _{The identified application fields reach from creating dis-}

tributed user interfaces, over illustrating complex networks (for example citation networks), to sophistic models for knowledge visualization by using dynamic SVG charts. One topic particularly interesting for archaeological research is the use of SVG and XSLT to display geo-referenced data on interactive maps [27]. Different user communities need to figure the processing and visualization paradigms they require to create useful data presentations. The most straightforward approach to presenting Semantic Web would be to generate a textual presentation of web resources that may be linked by the well-known HTTP link mechanism. By using a simple XSLT transformation, an RDF/XML document can be converted to a HTML file including links to other data objects. This strategy is pursued by browsers like Disco or the well-known Tabulator.2

But there are more sophisticated approaches to display RDF data. Since RDF is based on the the idea that all data can internally be represented as a graph, there is an almost unlimited number of ways to visualize the data. Robertson created several examples on visualizing cultural heritage data within the scope of his Historical Event Markup and Linking Project (HEML). Historical events either can be displayed on a map for emphasizing the spatial element of an event or in a timeline to emphasize the temporal dimension. This particular example is interesting because the HEML language can easily be translated in CIDOC CRM and visa versa [53]. Other project display data objects as nodes of a graph to emphasize the relation an object has to its surrounding contexts.

In document Towards semantic interoperability of cultural information systems making ontologies work. (Page 54-58)