8.3 The Extraction Workow
8.3.8 Determining Semantic Patterns
As a result of the relation extraction phase, a set of subjects S, first level objects O1, second level objects O2 and field references F has been extracted. In this last step, the semantic relations are built, which are (1:1)-relations of the form (concept1, type, concept2).
When building the relations, a few aspects have to be regarded:
1. All subjects are related to each other by anequal relation.
2. All subjects are related to all field references by apart-of relation.
3. All subjects are related to all objects.
4. Objects are not (necessarily) related among each other.
71http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/
research/yago-naga/javatools/
The last point may be astonishing at first glance. In fact, there are situations where ob-jects are also related among each other, e.g., by anequal relation as in Netbooks are small laptops or notebooks ( <laptop, equal, notebook>). However, in many cases there is no clear relation between the objects, as the following examples shows:
A rocket is a missile, spacecraft, aircraft or other vehicle that obtains thrust from a rocket engine.72
The four objects missile, spacecraft, aircraft and vehicle are in different relations to each other. For example, the relation between aircraft and spacecraft isrelated (co-hyponyms), the relation between aircraft and vehicle isis-a, while there is no clear relation between missile and aircraft. In contrast to the netbook example, noequal relations occur between those four objects. Therefore, the Wikipedia approach does not build relations between object terms, as the relation type cannot be unequivocally determined.
Let R be the set of (1:1)-relations gained from the extracted concepts and patterns of a Wikipedia article. Let RS be the set of synonym relations that were extracted (RS ⊂ R).
It holds:
|Rs| = |S|
2
= |S| × (|S| − 1)
2 (8.1)
The number of relations between subjects and objects is |S| × |O1| resp. |S| × |O2|. The number of relations between field references and subjects is |S| × |F |. The overall number of extracted relations is thus:
|R| = |S|
2
+ (|S| × |O1|) + (|S| × |O2|) + (|S| × |F |) (8.2) In some cases, |R| can become pretty large. Imagine that there are 5 subject terms, 1 field reference, 2 first-level objects and 2 second-level objects. According to Equation 8.2, the number of extracted (1:1)-relations is 35. Though this is a rare case, this example points out the general richness of the approach compared to other work that concentrates on simply linking Wikipedia pages with other resources like WordNet, which usually leads to only one link per Wikipedia article.
The extracted relations are written into files that serve as import files for SemRep. In detail, there is a file for field references, a file for re-direct relations and a file for all remaining extracted relations (which comprises the main data set). In the next chapter, we will explain how the relations are integrated in this repository and how it can be used for mapping enrichment.
The SemRep System 9
In this chapter, we introduce the semantic repository SemRep. This repository allows the import and access (querying) of different lexicographic resources, which was especially designed to support mapping enrichment and matching in general. After an introduction (Section 9.1) we will describe the general architecture of SemRep in Section 9.2. In Section 9.3, we illustrate how SemRep is used to determine the semantic relation type between two concepts, which we call query execution. This section also includes the efficient search for paths between two concepts, the calculation of the path type in longer (indirect) paths and the scoring of paths (path confidence calculation). Eventually, we will discuss some technical and implementation aspects of SemRep in Section 9.4.
9.1 Introduction
Once the semantic relations have been obtained from the previously presented Wikipe-dia approach, they need to be integrated together with the relations from WordNet and further resources in a repository that provides a holistic view on the data. This semantic repository, dubbed SemRep, is designed to serve as background knowledge repository for STROMA and is primarily used to answer questions about the relation type hold-ing between two given input words. Unlike other frameworks that combine semantic relations from different resources (like [44]), SemRep was designed for the field of on-tology mapping, which led to some specific requirements and necessary adaptations. In particular, there had been the following requirements:
• Velocity. The repository has to process queries within a short time in order to be
tasks where a user is interacting with the mapping enrichment tool and expects that the processing of the mapping is carried out within some seconds or at most within a few minutes.
• Extendability. The repository must be able to integrate further resources.
• Multilingualism. The repository should be able to cope with resources from dif-ferent languages.
• Simplicity. Using and extending the repository should be as simple as possible.
• Correctness and precision. The repository should answer queries (as far as possi-ble) correctly and should return the most relevant relation type.
Presently, SemRep contains semantic relations from up to five different resources: Word-Net, Wikipedia, UMLS, OpenThesaurus (German relations) and ConceptNet. Concept-Net is not used in the default configuration of SemRep though, as its quality is relatively low and could impair the general usefulness of SemRep. Wikipedia consists of the auto-matically extracted Wikipedia relations, the field references and the separately extracted Wikipedia redirects.
Table 9.1 gives an overview of those resources, including their language, the way they were created and the number of concepts and relations they comprise (after all filters have been applied). Note that "Creation" refers to the way how the relations were built.
Though the Wikipedia redirects have been automatically extracted, they were manually created (by users). This is an important difference to the automatically detected and created Wikipedia relations and to the automatically created ConceptNet relations that could be imprecise or incorrect. The number of relations refers to the number of links between two concepts, i.e., one relation describes both directions. Thus, the two atomic relations <car,is-a, vehicle> and <vehicle, inverse is-a, car> are treated as one relation and it is possible that there are less relations than concepts in a resource, just as in the Wikipedia Redirects case. Apparently, there are many concepts x in this resource, that have only one relation to another concept y. Apart from Wikipedia Redirects, the other resources contain more relations than concepts; WordNet even has a concept-relations ratio of almost 1:10.
Resource Lang. Creation #Concepts #Relations
WordNet English Manually 119,895 1,881,346
Wikipedia English Automatically 548,610 1,488,784
Wikipedia Field References English Automatically 66,965 72,500 Wikipedia Redirects English Manually 2,117,001 1,472,117
UMLS English Manually 938,527 1,265,703
OpenThesaurus German Manually 58,473 614,559
ConceptNet English Automatically 90,364 245,320
Table 9.1: Resources used in SemRep.
Though SemRep was implemented as a background knowledge repository for schema and ontology mapping, there are also further fields of application. As a matter of fact,
Figure 9.1: Sample excerpt from the repository.
SemRep could also be used for matching, i.e., in the first step of the two-step-approach presented in this thesis. Secondly, it can be used as a repository for (manually) verified, correct mappings. Such an approach is also called mapping re-use, where the manually confirmed correspondences of mappings can be re-used to foster new match tasks. Of course, a combination of such mappings and lexicographic resources in SemRep is also possible, and it is even possible to discover potentially incorrect lexicographic relations in other import resources by means of such verified mappings. Eventually, any kind of dictionary, thesaurus or background knowledge ontology can be integrated in SemRep, which would make it also applicable as a knowledge base for specific domains, for entity resolution, for specific linguistic tasks like query answering and related fields.
Similar to STROMA, SemRep uses different configuration parameters, thresholds and weights, and is a fully configurable system. A list of all default parameters used in SemRep is provided in Appendix B.