• No results found

Manually or Collaboratively Generated Resources

4.2 Background Knowledge

4.2.3 Manually or Collaboratively Generated Resources

The Princeton WordNet19is one of the most popular and most well-known thesauri [48].

Founded and continuously evolved by George A. Miller in 1986, it is now among the most comprehensive linguistic resources for the English language. In analogy to linguis-tic semanlinguis-tics, WordNet is based upon mental concepts (synsets) that are linked with each other by hypernym, meronym or antonym relations. Words referring to the same synset are synonyms. The most current version is the Princeton WordNet 3.0, that consists of 155,287 unique words (117,798 nouns) in 117,659 synsets (82,115 noun synsets). To our experiences, WordNet is a very effective resource in the field of ontology mapping, how-ever, the currently available version is from 2006, implying that several modern terms like cloud computing, netbook or tablet PC are not yet contained.

In [84], an approach to effectively use WordNet in schema and ontology matching is presented. Given a word A and a WordNet synset S in which A appears, the authors cal-culate Super WordNet Synsets (SWS) for A, which are an aggregation of all hypernyms, hyponyms, meronyms and holonyms of S. Given two possible match concepts A, B, the Super WordNet Synsets SW S(A) and SW S(B) are calculated, and according to the number of words those two SWSs have in common, the two terms are either considered as match or mismatch. They can even distinguish betweenequal and is-a relations by the number of overlapping words, although the relation type detection was not evalu-ated.

GermaNet20is the German language counterpart to WordNet and has been developed at the University of Tübingen since 1997. Technically, it uses the same data structure as WordNet, but slightly differs in some aspects. For instance, GermaNet only regards meronyms without any subdivision in part meronyms, member meronyms and sub-stance meronyms as in WordNet, and it uses artificial concepts to enhance the taxonomic structure. It also provides some more-specific relation types, like entailment (e.g., to try – to succeed) and causality (e.g., to kill – to die) [83]. By 2014, GermaNet consists of 93,246 synsets, 110,738 terms and 110,170 relations.

Word nets were developed for many further languages, although the Princeton WordNet remains the most comprehensive and most frequently used resource.21 A collection of available word nets for different languages is provided by the Open Multilingual Word-net.22

EuroWordNet23is a multilingual thesaurus similar to GermanNet or WordNet that com-bines vocabulary of different European languages, while WordNet had served as a base for the development of this multilingual project. Therefore, the data structure is closely related to WordNet again, but also contains an upper ontology as a semantic framework for the different languages [83]. Altogether, word nets from eight languages (Czech, Dutch, English, Estonian, French, German, Italian, Spanish) have been integrated and are

19http://wordnet.princeton.edu/

20http://www.sfs.uni-tuebingen.de/GermaNet/

21http://globalwordnet.org/

22http://compling.hss.ntu.edu.sg/omw/

23

interlinked by the so-called interlingual index (ILI). Integrating such resources entailed the development of equivalence relations between synsets of the specific languages and the WordNet synsets. The project has been finished in 1999. Some sample data is pro-vided free of charge, while the access to the full database requires a commercial license.

The Universal WordNet approach is a learning-based approach to automatically develop a word net for more than 200 languages.24Starting from the Princeton WordNet 3.0 again, the authors make use of different web dictionaries, Wiktionary, word alignment tech-niques on web corpora, monolingual thesauri and existing word nets of other languages [39]. Today, the resource contains more than 1.5 million words and can be extended by an additional data set of named entities (MENTA). The generated data sets can be down-loaded free of charge.

Crowd sourcing, in which a group of volunteers participates in the creation of linguis-tic resources, is a promising approach to alleviate the laborious development of such resources. OpenThesaurus25is an exemplary collaborative project for the German lan-guage, which comprises about 93,000 words in 30,000 synsets. Apparently, there is also a considerable number of entities (like German cities, companies or politicians) among the vocabulary. As the contributors are no linguistic experts, the quality seems slightly below resources like GermaNet or WordNet. Another example, though less influential, is WikiSaurus26, a sub-project of Wiktionary providing synonyms, hypernyms, hyponyms and antonyms for selected concepts (while meronyms and holonyms are rare). It cur-rently provides approx. 1,400 categories, though recent activity seems rather low and no API is applicable so far.

OpenOffice and LibreOffice use a list of thesauri in different languages (including Ger-man and English), which are included in office programs like Writer or Calc. The the-saurus for the English language is also provided as text file and was automatically de-rived from WordNet 2.0.27,28 Investigating the thesauri file that contains about 142,000 concepts, it becomes clear that it does not contain any more information than WordNet 2.0, and that it is thus not a helpful resource if WordNet is already used. For the Ger-man version, OpenOffice and LibreOffice use the data from OpenThesaurus, so that no additional knowledge can be obtained from it if OpenThesaurus is already used.

Less known, but still an interesting and beneficial resource is ConceptNet.29It is a multi-linguistic, semantic network similar to WordNet, but contains more specific relations like X is used for Y, X is made of Y or X is etymologically derived from Y [90, 137]. ConceptNet is based upon the Open Mind Common Sense corpus (OMCS), which is a comprehensive repository of declarative sentences created and added by a crowd of volunteers. Such sentences, like A cook stove is used to prepare a meal contain basic knowledge used for se-mantic reasoning and related tasks. ConceptNet extracts such sentences and translates

24https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/

them into semantic relations like "cook stove used_for preparing meals". Additionally, Con-ceptNet connects knowledge from the Princeton WordNet, parts of DBpedia, Umbel and Wiktionary. Though there are many useful linguistic relations in ConceptNet, there is also numerous useless or irrelevant information, which is an obvious consequence of the automatic extraction approach. For instance, relations like <computer,is-a, friend> can be found in the data sets, which may origin from a sentence like "The Computer is your friend.". ConceptNet was used in the context of this research, and 687,000 of the approx.

8.7 million English relations had been extracted. However, this background knowledge led to worse results in the evaluation, mostly because of the many imprecise relations.

Thus, it was not further pursued in this research, yet could be successfully used in other fields, e.g., for query expansion [73] or word sense disambiguation [74, 31], where a com-bination of WordNet and ConceptNet apparently led to the best results.

Umbel(Upper Mapping and Binding Exchange Layer) is a top-level (reference) ontol-ogy primarily used for Linked Open Data resources [105]. It defines some 28,000 general concepts within a strict taxonomic hierarchy and serves as a reference framework for do-main ontologies by promoting the interoperability with external data sets and dodo-mains.30 Among the concepts there are also some entities (such as car brands and companies).

The Cyc project is a logical framework to formalize common sense knowledge [86].

Founded in 1984, it was continuously extended by knowledge engineers at CycCorp and is primarily designed to foster artificial intelligence and semantic reasoning [90]. The open source version of this large knowledge base, called OpenCyc, provides endpoints for semantic web tools.31 The concepts and assertions were manually created and con-centrate on natural, simple facts like cottage is made of wood and wood is able to burn. Using Cyc, a reasoner is thus able to conclude that a cottage is able to catch fire. The knowl-edge is represented in the Cyc Language (CycL) and stored in different Cyc ontologies which are provided for download free of charge. OpenCyc was diversely used in the field of semantic web, e.g., to classify Wikipedia articles [113], for document annotation [151] or for rule generation in event processing systems [100]. The latest version 4.0 com-prises about 2.1 million triples and 239,000 terms that are organized within a manually designed ontology.

FrameNet32 is a different approach of organizing vocabulary of a language. Instead of synsets, hypernyms, meronyms or antonyms, the lexical database consists of semantic frames that can refer to a process type, situation type or event type. A semantic frame describes the entities and participants that take part in such a type, and which relations hold between them. For instance, the semantic frame transfer specifies that there is a per-son A (the "donor"), a perper-son B (the "recipient") and an object C (the object to transfer).

The frame can be activated by verbs like to transfer, to give, to hand, etc. Frames are related to each other, e.g., the frame "committing crime" is related to "crime investigation", which again is related to the frame "criminal process" (the relation type can be considered as en-tails) [49]. FrameNet consists of 800 hierarchically arranged frames that can be triggered by more than 10,000 elements, the so-called "lexical units" [83].

30http://www.umbel.org/

31http://sw.opencyc.org/

32https://framenet.icsi.berkeley.edu/fndrupal/

UMLS33is a large domain-specific knowledge base and thesaurus for the biomedical do-main [20]. It was founded as early as in 1986 and combines the vocabulary from various biomedical thesauri and taxonomies in the so-called MetaThesaurus34. Additionally, the terms in the thesaurus are linked to the Semantic Network35of UMLS, which serves for classification and provides a large set of relationships between concepts. Altogether, 133 semantic types and 54 different relationships are distinguished, so that UMLS tends to be both a thesaurus and a knowledge base. It contains about 707,000 terms and 58 mil-lion relations, though there are many duplicate (or even multiple) relations. For example, there may bepart-of relations of the kind (Nail, Toe), (Nail, toe), (nail, Toe) and (nail, toe) since UMLS is case-sensitive. Additionally, some obviousequal relations like (Stomach, stomach) are contained.

There is also a selection of knowledge bases that were manually developed. WikiData36 is a collaboratively generated, machine-readable knowledge base about facts and entity data (like birth dates of persons), thus being similar to the automatically developed DB-pedia project (see below). Similar to WikiData, but possibly stronger focusing on the Semantic Web and Linked Open Data cloud, Freebase37is a knowledge base which con-tains millions of facts for a broad range of topics which were entered by volunteers [21].

GeoNames38is a collection providing more than 10 million geographic names like coun-tries, towns or rivers, together with information about geo-coordinates, elevation, pop-ulation, etc. The knowledge base was derived from many geographic resources while volunteers are able to add new information or update existing ones.

Even with these many available resources, WordNet remains the only solid and appro-priate resource for the schema and ontology matching domain. Many match approaches use WordNet as additional background knowledge [57, 85, 78, 122, 36, 81, 130, 103], but practical no other linguistic resource is used. Some match tools used for the biomedical domain exploit UMLS as domain-specific background knowledge, e.g., [85, 78, 81], but suitable alternatives for the general purpose domain are not available. We see a high demand for such general purpose resources to improve schema and ontology mapping based on background knowledge.