2.3 Methods for the Building of Multilingual Ontologies
2.3.1 Ab initio construction
In case of ontological resources, this localization procedure is generally adopted when: i) the ontology is being developed from the start and mul- tilinguality is included at the same time or ii) the decision of building a multilingual ontology is taken during the first stages of the ontology devel- opment.
The common feature of the majority of works that adopt this localization procedure is the effort towards the construction of an upper-level conceptual core. This conceptual core makes on one hand the ontology under consid- eration accessible in many languages, and on the other allows ontology to represent many different cultures. Below we provide the most relevant works in this area.
Two well known efforts that adopt this procedure are EuroWordNet (EWN) [Vossen, 1997] and MultiWordNet (MWN) [Bentivogli et al., 2000]. EuroWordNet explicitly was based on and had the same structure of the
10
Latin phrase meaning “from the start”; literal meaning being something done “from scratch”.
Princenton WordNet11 [Miller, 1995], developed as a monolingual lexical database for American English. The work initiated in the EWN project is now being continued by the Global Wordnet Association (GWA)12.
The aim of EuroWordNet was to develop a multilingual lexicon with wordnets for several European languages (English, Dutch, Spanish and Ital- ian), which could be used “to improve recall of queries via semantically linked variants in any of these languages”. The general approach for EWN was to build the multilingual database taking advantage of existing resources in each language. Participants from each country were responsible for a lan- guage specific wordnet using their already available tools and resources built up in previous national and international projects. As in WordNet, infor- mation about nouns, verbs, adjectives and adverbs was organized in synsets. A synset is “a set of words with the same part-of-speech that can be inter- changed in a certain context” [Vossen, 2004]. Synsets are related to each other by semantic relations, such as hyponymy or meronymy, for example. The wordnets in EuroWordNet are considered “autonomous language spe- cific ontologies”. Then, multilingual wordnets are interconnected through an Inter-Lingual-Index (ILI), a list of unstructured meanings mainly from Princenton WordNet, specifically WordNet1.5, that provide the mappings across the wordnets as illustrated in Figure 2.2.
Figure 2.2: The global architecture of the EWN database [Vossen, 2004]. MultiWordNet (MWN) is a multilingual lexical database including infor- mation about English and Italian words. The model adopted within MWN,
11
http://wordnet.princeton.edu/
consists of building language specific wordnets keeping as much as possible of the semantic relations available in the WordNet. This was done by build- ing the new synsets in correspondence with the WordNet synsets, whenever possible, and importing semantic relations from the corresponding English synsets; i.e., if there are two synsets in WordNet and a relation holding between them, the same relation holds between the corresponding synsets in the new language. The MWN model minimizes the discrepancies that can appear when two wordnets are built independently for two different lan- guages, by strictly adhering to the WordNet building criteria and subjective choices. However, MultiWordNet explicitly recognizes the presence of “lexi- cal gaps” in the correspondence between different languages, due to missing direct translations of some words.
Another approach is given by [Bonino et al., 2004], in which the authors introduce a simple approach to multilingual semantic elaboration using the Distributed Open Semantic Elaboration platform (DOSE). This approach uses a language independent ontology in which concepts are defined as high- level entities for which language dependent definitions are specified. Such entities are linked to a set of different definitions, one for each supported language, and a set of words that the authors call synset. Figure 2.3 shows the multilingual ontology deployment used in the DOSE platform.
Figure 2.3: Multilingual ontology deployment in the DOSE platform. The ontology is physically distinct from definitions and synsets, allowing separate management of concepts and language-specific information, iso- lating the semantic and the textual layers. This assumption guarantees sufficient expressive power to model conceptual entities typical of each lan- guage and, at the same time, reduces redundancy by collapsing all common
concepts into a single multilingual entity. Synsets and textual definitions are created by human experts through an interactive refinement process. A multilingual team works on concept definitions by comparing ideas and intentions, aided by domain experts with linguistic skills for at least two different languages, and formalizes topics in a mutual learning cycle.
Finally, the work presented in [Segev and Gal, 2008] proposes an ontology- based model for building multilingual applications. Their model was based on a global ontology manually designed for a specific domain. Addition- ally, this model uses local context to specify the ontology. The combination of ontologies and contexts lends itself well to multilingual applications in which a single ontology fails to capture all nuances that stem from language and cultural differences. The procedure used for adapting an existing ontol- ogy to the needs of a multilingual environment includes the following four steps, selection, collection, extraction, and adaptation. In the selection step an existing ontology is chosen. In the collection step, sample documents that represent ontology concepts are collected. Contexts are extracted from sample documents in the extraction step. Finally, extracted contexts are associated with ontology concepts. The ontological system works simulta- neously in multiple languages, and it is easily expansible and adaptable to other languages. As a final comment, we can say that some of the steps in this approach need intensive human labor.
Main advantages and shortcomings
An advantage of the works that adopt this approach is that it is easier to ensure language neutrality (i.e. lack of bias towards any one language). However, the costs of producing such ontology, as well as the definition and multilingual equivalence of its terms, have to be established a priori.
There are still two critical issues that need to be solved before the build- ing of multilingual from scratch or other similar efforts can be used as a shared conceptual framework for all languages: the scarcity of lexical se- mantic information (especially from endangered languages), and the lack of a linguistically-motivated shared conceptual core as the basis of multilingual conceptual representation.