• No results found

The semantic annotation output is targeted at supporting the STAR project for interoperability and semantic discovery of archaeological information, in this case grey literature. In order to achieve its aims, the prototype adopts the CIDOC CRM and its extension CRM-EH ontology while utilising a range of English Heritage terminological resources (glossaries and thesauri). The prototype is developed in the language engineering framework GATE and processes a corpus of archaeological excavation and evaluation reports originating from the OASIS grey literature library at the Archaeology Data Service.

3.2.1 Ontology-Based versus Thesauri-Based Semantic Annotation

As discussed in Chapter 2 (section 2.5.2), ontologies can be employed by semantic annotation systems to provide specialised vocabulary and relationships that are exploited during IE. Examples of such systems include h-TechSight (Maynard et al. 2005), which delivers semantic annotation via an ontology that consists of 9 main concepts (Location, Organization, Sector, Job Title, Salary, Expertise, Person and Skill); populated with a vocabulary of 29000 instances. The ontology is used to enable the task of Named Entity Recognition (NER) over job advertisements. Similarly, KIM (Kiryakov et al. 2004) uses KIMO, an ontology populated with a vocabulary of general purpose classes, such as Location, City, Currency, Date, Job Title etc. KIM is a comprehensive semantic application that utilises KIMO beyond NER in order to support document retrieval on semantic level.

Ontology-based IE projects such KIM and h-TechSight make use of ontologies that explicitly define classes and their properties. Classes and sub-classes form hierarchies which combine terminological and ontological specialisation. Bontcheva et al. (2004) describes a technique of ontology engagement in IE using the GATE OWLIM-Lite processing resource. The tool associates ontological classes with one or more vocabulary listings (gazetteers). Lists contain entries which populate ontological classes with instances, for example the class Location can be associated with the list Cities containing the entries London, Paris, Athens etc.

However, according to Tsujii and Ananiadou (2005) the tendency of an ontology-based approach to make explicit semantic associations between vocabulary and individual context is problematic. They argue that contextual dependencies strongly influence the IE process: “... relationships among concepts as well as the concepts themselves remain

implicit in text, waiting to be discovered”. Thus, inherited language ambiguity and

diversity, as well as domain-dependent inferences and knowledge cannot be comprehensively encoded in ontological structures. Instead, they argue that terminological thesauri, as language oriented structures, can support implicit definition of semantics.

In particular, they highlight the case of the Biomedicine domain, which poses problems for purely logical deduction. Different communities within the same broad field have evolved their particular vocabularies and language uses. Interpretation of context is important for the selection of relevant facts, where inevitably language is ambiguous.

―Most of the widely used ontologies have been built on a top-down manner. They are limited in their conceptual coverage and they are mainly oriented for human (expert) use. The difficulties and limitations lie with the definition of concepts (classes, sets of instances) since one is expected to identify all instances of a concept. This task demands evidence from text.‖

(Tsujii and Ananiadou 2005).

In some applications, the matching of instances with ontology classes may be less problematic, where language use is constrained or perhaps highly specialised.

The archaeology domain, however, shares some of the context-dependency discussed above. As in the Biomedicine domain, context-independent relationships as explicitly defined in logical ontologies are not the norm. Contextual factors dictate if, for example, a particular place is an archaeological “context”, or if a physical object constitutes an archaeological “find”. Such forms of entity specialisation cannot be inferred solely by a specialised vocabulary but are derived by contextual evidence. Therefore, complementary use of terminological and ontological resources may prove a promising avenue of investigation.

3.2.2 Development Pathway Criteria

The prototype development has an experimental focus aimed at obtaining practical experience and results to inform the large scale semantic annotation effort of this thesis as discussed in Chapters 4, 5, and 6. The prototype aims to explore an innovative semantic annotation process which does not rely on the use of a single ontology, as typical Ontology Based IE, but instead makes a complementary usage of ontological and terminological resources (Figure 3.1).

The reason for following this particular development pathway is based on the following criteria;

Semantic annotation (via archaeologically specific CRM-EH entities) cannot be reached by using only specialised vocabulary, as discussed in section 3.2.3. The archaeology domain vocabulary does not contain heavily specialised scientific terms and CRM-EH specialisation is subject to contextual dependencies.

The CRM and CRM-EH ontologies have no directly associated vocabularies but define a range of entities and properties which provide semantic definitions and clarifications for the cultural heritage (CRM) and archaeology domains (CRM-EH)

The semantic annotation effort is targeted at delivering semantic indices which will support information retrieval at the level of concepts. Thus, the prototype system is not concerned with the annotation of unique instances i.e. post-hole A, post-hole B, but with the annotation of concepts, i.e. the concept of post-hole not individual post-hole occurrences. However a concept may have term variants (e.g. post hole).

Using both ontological and terminological resources empowers semantic annotations with a dual conceptual reference system that enables information retrieval on the ontological level, on the terminological level and the combination of both.