LINKEDARC.NET - Untangling the web of data : a critical analysis of the Archaeological Semanti

On building an Archaeological Semantic Web RDF triplestore

‘While we're acknowledging writing theory as making stuff, can we also acknowledge making stuff as doing theory?’ (Kirschenbaum 2013)

Introduction

Methodological knowledge, or the principle that there is epistemic value to be found in practice, is a central idea in the digital humanities (McCarty 2005 p. 120). Archaeologists, because of the focus that they place on material things, are comfortable with this concept. Branches of the discipline, such as experimental archaeology (Outram 2008) and material cultural studies (DeMarrais et al. 2005), and popular archaeological schools of thought, such as phenomenology (Tilley 1994), all see the value in the act of doing. While the tools of the trade of the digital humanist tend to be different from those of the archaeologist, the principle remains the same. In place of the trowel and the drafting board, the digital humanist employs the use of the keyboard, the digital display screen and the computer. This chapter presents the central practical component of this thesis, that of the building of linkedarc.net, an Archaeological Information System. I have designed linkedarc.net from the outset to be fully compliant with Linked Open Data practice, insofar as has been practically possible and achievable within the scope of this project.

No programme of work exists in a knowledge vacuum, however, and we begin this chapter with a review of Archaeological Information Systems and Cultural Heritage Information Systems, which are noteworthy for their adherence to Linked Open Data practices. Having situated linkedarc.net within its technological and intellectual space, we move on to the primary topic of the chapter. This involves an in-depth account and explanation of the linkedarc.net server and web app technical design.

Linked Open Data and archaeology: the trend setters

The archaeological method is fundamentally built on an ability to assimilate, categorise, analyse and interpret large amounts of disparate data with a view to ultimately finding meaning in them (Hodder & Hutson 2003 p. 209). In the following section we analyse a

number of projects that are applying this philosophy of practice using techniques that are compatible with the Archaeological Semantic Web ideal.

The Numishare framework, Nomisma and the Online Coins of the Roman Empire

The Numishare framework was developed by Ethan Gruber and Andrew Meadows of the American Numismatic Society as a software platform on which to allow the creation, management and dissemination of numismatic data online (E. Gruber 2015). The project blossomed out of Gruber’s efforts to make the University of Virginia’s Art Museum’s Numismatic Collection available as an online resource (University of Virginia Library 2015). The figurehead of the Numishare community of applications is Nomisma.74 Nomisma is primarily a Linked Open Data vocabulary of terms, which can be used to populate the data fields of digital coin collections (E. Gruber 2012; Meadows & Gruber 2014). In this role, Nomisma acts as a collection of ‘minted’ URIs that can be used by coin databases, thereby making them compliant with Berners-Lee’s fourth law of Linked Data (Berners-Lee 2006). While Nomisma primarily holds URIs for Roman Republican and Imperial coin types, mints, persons related to the creation of coin material and emperors and deities represented on numismatic iconography, it also, but to a far lesser extent, hosts URIs for corresponding entities for ancient Greek coin types, which are decidedly less canonical.

Nomisma also performs a secondary valuable function and that is to aggregate numismatic content, which has been created by its partner projects, which are not insignificant in number.75_{Principal among these has been the American Numismatic} Society and again building on the Numishare framework and the Nomisma URIs, Gruber and the Institute for the Study of the Ancient World at New York University have produced the Online Coins of the Roman Empire (OCRE) collection (American Numismatic Society & Institute for the Study of the Ancient World at New York University 2015). OCRE is an impressive achievement, albeit one with a few caveats. It provides the user with a faceted search function, which makes the discovery and analysis of data on the site intuitive and powerful. It is also possible to generate graphs that summarise certain aspects (the percentage of certain coin types for example) of the

74_{http://nomisma.org}

75_{At the time of writing, there were 31,201 unique coin entries hosted on Nomisma. These all}

dataset. Figure 29 shows a map of all of the known find spots and mint locations for all of the collection’s coins. This is a successful use of a map visualisation to present distribution data. It clearly highlights the geographical areas from which most of OCRE’s coins originated.

Figure 29: a distribution map showing the find spots and mint locations for the OCRE coins

Figure 30: the Nomisma record for Carian mint, Apollonia Salbace, represented as RDF Turtle In many respects Nomisma is a model Linked Open Data resource for the cultural heritage sector. Besides the digital databases associated directly with the activities of the American Numismatic Society, Nomisma has been used to provide thesauri for many

other institutions around the world, such the British Museum, the UK’s Portable Antiquities Scheme and more recently the University College Dublin Classics Museum. At a technical level, before Nomisma made the decision to move to a Linked Open Data platform, it was built upon a XHTML content base, which used Apache Tomcat, Cocoon, Solr, Orbeon and eXist. Its Linked Open Data services are now provided by an Apache Jena RDF triplestore system. Fuseki is used to provide a SPARQL interface to the RDF data and all of its URIs can be retrieved as RDF data in various serialisations. Nomisma’s data is structured using a custom ontology, which is available on its website.76 The ontology is largely flat of hierarchy, which makes it relatively straightforward to implement (Figure 30). Nomisma also employs the use of Dublin Core Terms and the Basic Geo Vocabulary (Brickley 2003). The success of Nomisma is no doubt helped by the fact that the study and taxonomy of Roman Imperial coins in particular is very well established. For instance, the compilation of the multi-volume series Roman Imperial Coins

catalogue began in 1923. This research environment has created a high level of taxonomic standardisation within the field and because of this, ancient coins and their types are ideal candidates for representation using RDF.

PeriodO

Dates and periods have always presented archaeologists with one of their sternest ontological challenges (Binding 2010; Pare 2008). Even in the pre-digital age, being able to chronologically interrogate disparate information sets, each using its particular periodization system, has presented enormous difficulties. How might one conceptualise the relationship between Artefact A, which has been given the label ‘Late Iron Age’ with another labelled ‘Early Archaic’? Is there an overlap in time between the two subjects? How long do these periods last? Might one artefact have been in use for just a short period, perhaps at the beginning of a very long block of time, while the other enjoyed a much longer period of use? The problem becomes even more entrenched when one considers that periods are often spatially contingent – for example, the ‘Iron Age’ in the eastern Mediterranean might well represent an entirely different range of calendar years to that of the Iron Age in north-western Europe.

Artefacts serve as invaluable proxies for archaeologists. Their analysis and the dates given to them allow archaeologists to date the stratigraphy within which they were found and in turn date the site that contained that stratigraphy (Renfrew & Bahn 2004 pp. 122– 123). Because of this, material dating has always been recognised as one of the key tools at the archaeologist’s disposal and issues such as site contemporaneity (Dewar 1991) and the re-dating of multiple sites as and when new discoveries are brought to light, ultimately emanate from what are often necessarily fuzzy artifactual timescales. This temporal ‘fuzziness’ is exacerbated in the case of prehistoric archaeology.

The PeriodO project77_{(Rabinowitz et al. 2015) was established to deal with this very} problem using Linked Open Data techniques. As Rabinowitz notes, periods are ‘essentially arbitrary conventions about which scholars disagree’ (Rabinowitz 2014 p. 1). However, despite this perceived arbitrariness, periods form the basis of the archaeological method. Rabinowitz et al. have realised, quite rightly, that the period problem is an ideal candidate for resolution using Linked Open Data methods. The ability of Linked Open Data datasets to be dynamic as a result of their being connected to other (possibly changing) datasets means that the variability of periods, which in the pre-digital world had demanded such enormous intermittent reinvestments of labour, now becomes an attractive feature of a dynamic linked knowledge system.

PeriodO is, essentially, a gazetteer of period values that are created by its community of users. Rabinowitz is at pains to point out that the knowledge production for the system will follow a ‘bottom up’ model, eschewing the ‘top down’ knowledge that characterised so much of pre-digital archaeological practice. Democracy of opinion fits well with the Linked Open Data philosophy. However, despite this potential for a more liberal knowledge environment, for the most part, the majority of Linked Open Data creation has originated in the corridors of the larger knowledge institutions and, as such, PeriodO’s attempt to swim against this general current should be applauded.

77_{http://perio.do}

Figure 31: period assertions related to the 'Iron Age' concept for the Levant region (Rabinowitz 2014 fig. 3) Technically, PeriodO presents what it calls ‘period assertions’ as concepts that are modelled using SKOS.78_{As such, PeriodO’s data is a vocabulary of terms that are} exposed as subject URIs and which include links to citation and contextual information about the period in question. As is shown in Figure 31, the ‘Iron Age’ in the Levant has been understood in a number of different ways over the years by scholars with Aharoni assigning it a date range of ‘1200-586 BC’, while Younker is less specific in 2003 as he gives it a more descriptive date range of ‘1200 BC – mid-6th_{century BC’. Others, as} would be expected, have slightly different understandings of the Iron Age in Israel. These are all valid readings of this period in their own way. Consumers of PeriodO data will have access to all of these various interpretations; with no one source gaining precedence over another.

It is the intention of the project to serialise its RDF data as JSON-LD79_{and its DOIs are} to be minted using the EZID system of the California Digital Library. The dataset will be hosted on Github and it is hoped that the project will eventually also provide a SPARQL endpoint for its data.

78_{See Chapter 3 for an overview of where SKOS fits within the Semantic Web model.}

Pleiades and Pelagios

The Pleiades80_{and Pelagios}81_{projects have much in common with the Nomisma and} PeriodO initiatives both in terms of the technical strategies that they have employed and in their basic objectives. Whereas PeriodO takes as its subject matter the vocabularies that archaeologists use to assign periods to material and sites and Nomisma is interested in the concepts employed by the numismatic community, Pleiades looks to ancient place- names for its subject matter (Simon et al. 2012).

Place-names change through time and vary across languages. How might a human user or, more significantly, a computer realise that the terms Athens, Athenae and Αθήνα all refer to the same conceptual or geospatial entity? This potential for misinterpretation becomes a huge problem when you are dealing with multiple sources, which might have originated in very different contexts, be they temporal or spatial variations. In the example just cited, Athenae was the term used between 750 BCE-640 CE to describe what is now the city of Athens in English or Αθήνα in Modern Greek. For a human, this collection of terms to describe Athens might not prove an insurmountable challenge. However, the advanced Natural Language Processing techniques that humans take for granted and, which allow for the handling of such processing with apparent ease, would present a sizeable problem for a computer and, as we have already seen, it is intended that Linked Open Data resources be consumed primarily by computers.

The Pleiades service aims to solve these place name ambiguities by providing a URI for each and every spatial concept in its datastore. It was originally designed to deal exclusively with ancient82_{place names but more latterly, it has begun to spread its net} wider across more modern datasets (Simon et al. 2014). The Pleiades website can be joined by any member of the public – a user need not have any particular academic credentials. Any user is entitled to create new Pleiades data although this does not become an official part of the Pleiades listing until it has been vetted by the community of peer reviewers.

80_{http://pleiades.stoa.org}

81_{http://pelagios-project.blogspot.com}

Each Pleiades entry is ontologically structured as follows (Figure 32). A pleiades:Place entity links to pleiades:Location entity via the pleiades:hasLocation predicate. The pleiades:Place entity contains documentary data such as the name, type, description and creator of the resource. It also contains links to bibliographic resources that substantiate the claims contained within the other associated data. The pleiades:Location entity details the geospatial coordinates of the place in question – it can be a point or a polygon. pleiades:Location also expects the user to specify a start and an end date that associates this particular location with the relevant pleiades:Place entity for that period of time. The logic here is that, while a geospatial specification is essentially eternal, places are transient and come and go out of existence with the passage of time.

Figure 32: the Pleiades data model

Unsurprisingly, Pleiades data is structured as RDF triples. A Pleiades namespace exists that exposes an ontology of classes, predicates and vocabulary values and these are used alongside the ever-present FOAF, Dublin Core Terms, SKOS and Geo ontologies. Some less well known ontologies used by the project are the PROV Ontology (Belhajjame et al. 2013) and the Citation Typing Ontology (Shotton 2010).

The Pelagios project builds on the foundations of Pleiades. Its goals are twofold: to make it easier for data content providers working on material related to the ancient world to

publish place names and to make it easier for users interested in consuming ancient place name data to do so across all of the datasets that expose this type of information (Simon et al. 2012).

Pelagios is a community of users who contribute content that they have curated. The idea is that by creating an aggregation of all of the Linked Open Data datastores that contain references to ancient place names, users will be able to find links between these datasets based on their inclusion of canonical Pleiades place name URIs. Besides this requirement to use Pleiades URIs, partners must also model their place name data using the Open Annotation (Sanderson et al. 2013) and Vocabulary of InterLinked Open Datasets (VoID) (Alexander et al. 2011) ontologies.

The Getty Linked Open Data resources

The Getty vocabularies83_{were first formalised in the 1980s and their aim then and now is} to ‘help people categorize, describe, and index cultural heritage objects and information’ (Alexiev et al. 2014). There are four Getty thesauri (AAT, TGN, CONA and ULAN), all of which are compliant with ISO and NISO standards for thesaurus construction, although to date only the first two have been published as Linked Open Data resources.84

The Art and Architecture Thesaurus (AAT) exposes generic concepts related to the fields of art, architecture, broader cultural heritage contexts and conservation. AAT concepts can be used to describe the basic type of a particular work of art or architecture, its stylistic conventions, its component material parts and the subject matter that it deals with. The AAT is concept-centric and it is polyhierarchical, in the sense that any one concept can have more than one parent. The AAT as of August 2014 held 42,000 concepts.

The Thesaurus of Geographic Names (TGN) is concerned with places and their names. The TGN essentially tackles the same problem as Pleiades; places can be associated with multiple names and these can change through time. The TGN contained 1.26 million places and 1.85 million names linked to these place concepts in August 2014.

83_{http://www.getty.edu/research/tools/vocabularies}

84_{This was correct as of 20 March 2015. The ULAN thesaurus was scheduled to be made}

Figure 33: Arthur Evans as represented by the Getty ULAN data service

The Union List of Artist Names (ULAN) vocabulary was created by Getty as a repository of names chiefly associated with practitioners of the arts, although as we will see, this list is not exclusively populated with artists. For example, the ULAN subject 500212319 groups information about the British archaeologist, Arthur Evans (Figure 33). The ULAN record includes fields such as the various names, nationality, roles (for example archaeologist, et cetera), gender, birth and death dates, and citations related to Evans. 582,000 names were represented by the ULAN in 2014.

The Cultural Objects Name Authority (CONA) dataset contains information about specific fixed and movable works of art and architecture. The CONA is currently only available to view online in a limited form. However, the few records that do exist exemplify the agenda of the project quite well. Take for example the record with the subject ID of 700000206. This details information about the Pergamon Altar, which is currently housed in the Collection of Classical Antiquities in the Berlin Museum. The CONA record includes information about the altar’s various titles, its type (which points

to the AAT altar concept), its classification as a piece of architecture, its creation date in the 2nd_{century BCE, its current and past locations, style (another AAT reference), related} works, subject matter, citations and general notes. CONA is, therefore, a consumer of the AAT, TGN and ULAN. While the three vocabularies describe idealised concepts, the CONA’s data represents material objects.

The Getty vocabularies present a vast range of concepts that can be used by cultural heritage professionals to categorise their material. They allow for the categorisation of most conceivable aspects of cultural heritage material objects. And now with their provision of Linked Open Data interfaces (Harpring 2014), they are making it possible not only for big institutional museums and galleries but also for smaller initiatives, such as the kerameikos.org project,85_{to utilise their resources.}

In document Untangling the web of data : a critical analysis of the Archaeological Semantic Web (Page 132-196)