6.6 The Ontology Translation Module
6.6.2 Translator Component
The Translator component relies on three steps: label pre-processing, label
translation and label post-processing to discover appropriate translations ac-
cording the lexical and semantic context of the original ontology label. The three steps identified above are executed if the Leverage component does not return any results. The output of this component is an automatically translated label and manually validated by an expert.
Label Pre-Processing
We consider that ontology label pre-processing is essential in an ontology localization system, in order to simplify the core translation processing and make it both quality and time effective. The ontology labels pose different challenges to MT, which can be attributed to two distinct characteristics:
• Ontology labels differ linguistically and stylistically from written lan-
guage: phrases are shorter and in some cases poorly structured, also they can contain ungrammaticality expressions (e.g., Service Transport instead of Transport Service)
• The current “standard” for naming the ontology labels is to use a
CamelCase4 approach. Therefore, we cannot rely on the initial up-
percase letter to identify a phrase initial word nor to recognize proper names, since names cannot be identified by an initial capital.
These problematic factors are dealt with in a pre-processing pipeline that prepares the input for processing by a core MT technique. Thus, the task of the ontology label pre-processing pipeline is to make the input amenable to a linguistically-principled, domain independent treatment. This task is accomplished in two ways:
1. By normalizing the input, i.e. removing noise, reducing the input to standard typographical conventions, and also restructuring and sim- plifying it, whenever this can be done in a reliable, meaning-preserving way.
2. By annotating the input with linguistic information, whenever this can be reliably done with a shallow linguistic analysis, to reduce input ambiguity and make a full linguistic analysis more manageable. In the following we describe the functionalities of the different tasks in more detail:
Normalization
The label normalization groups three components, which clean up and tok- enize the input.
The text-level normalization phase performs operations at the string level (ontology term comments by example), such as removing extraneous text
4
CamelCase (also spelled camel case, camel-case or medial capitals) is the practice of writing compound words or phrases in which the elements are joined without spaces, with each element’s initial letter capitalized within the compound, and the first letter is either upper or lower caseas in “LaBelle”, “BackColor”, or “iPod”. The name comes from the uppercase “bumps” in the middle of the compound word, suggestive of the humps of a camel. The practice is known by many other names.
and punctuation (e.g., brackets, used to mark synonyms or usage context), or removing periods from abbreviations. E.g.,:
“A publication may have an I.S.B.N.”
⇓
“A publication may have an International Standard Book Number” The tokenization phase breaks a ontology label into words. The token-
level normalization recognizes and annotates tokens belonging to special
categories (times, numbers, etc.), expands contractions (e.g AssistProfessor to AssistantProfessor), recognizes, and normalizes typographic errors (e.g., Profesor by Professor), and identifies compound words.
“British” “System” “Education”
⇓
“British” “System Education”
Tagging
In the tagging phase a tagger system5assigns parts of speech to tokens. Part of speech information is used by the subsequent pre-processing modules, and also in parsing, to prioritize the most likely lexical assignments of ambiguous items.
Proper name recognition
Proper names are ubiquitous in ontology labels, specially in instance terms. Their recognition is important for deciding what instances should be trans- lated, with an annoying effect if any instance term is systematically mistrans- lated (e.g., a sport domain ontology where the golfer named Tiger Woods is an instance systematically referred to as “los bosques del tigre”, lit. “the woods of the tiger”).
Name recognition is harder in the ontology domain due to the fact that capitalization information is commonly used for naming all types of onto- logical terms (concepts, properties and instances), thus making unusable all methods that rely on capitalization as the main way to identify candidates. Of course, this problem is even larger when no capitalization information is given. For instance, an expression like “mark shields”, as a possible in- stance in the ontology, is problematic in the absence of capitalization, as both ‘mark’ and ‘shields’ are three-way ambiguous (proper name, common noun and verb). Our approach does not support the proper name recogni- tion for the moment.
5
A tagger system is a tool for annotating text with part-of-speech and lemma informa- tion.
Segmentation
Segmentation breaks an ontology label into one or more segments, which are passed separately to subsequent modules. For our purposes, the translation units that we identify are syntactic units, motivated by cross-linguistic con- siderations. Each unit is a constituent that can be translated independently. Its translation is insensitive to the context in which the unit occurs, and the order of the units is preserved in the translation.
One motivation for segmenting is that processing is faster: syntactic am- biguity is reduced, and backtracking from a module to a previous one does not involve re-processing an entire phrase, but only the segment that failed. A second motivation is robustness: a failure in one segment does not involve a failure in the entire phrase, and error-recovery can be limited only to a segment. Further motivations are provided by the problems of the conven- tional MT systems. These systems have serious difficulties in dealing with long sentences due to the grammar coverage, memory limitation and com- putational complexity. Without proper treatment of long phrases, the base MT systems, may fail to produce understandable translations. Although in our proposal we did not treat the translation of phrases (as found in term annotations), however we considered this component of utmost importance for future versions of the system.
In our approach we use a basic segmentation process to divide the tokens of a compound label. However, for the translation of phrases we devised a segmentation component based on machine learning techniques [Kim et al., 2001], syntactic analysis techniques [Kim and Kim, 1997] or support vector machines [Kim and Oh, 2008].
Label Translation
After preparing the ontology element for an effective translation processing, the Ontology Translator invokes the label translation component, which ob- tains the most probable translation for each ontology label (see section 4.4.2). This component integrates different translation methods, combining the out- put by means of different translation strategies. Some natural ways to com- pose different translation algorithms was presented in section 5.8.1. In ad- dition, in section 5.8.2 we summarized some of the well-known combination methods used to integrate the output of different MT approaches. Each translation method relies on different linguistic and semantic resources to obtain candidate translations. The output of this component is a ranked set of translations for each ontology label.
Label Post-Processing
This component shows the translations to the user for review of their trans- lation quality. The quality of the translations is measured by two factors
adequacy and fluency. Adequacy determines the quantity in that the mean- ing of a correct translation is preserved. Fluency determines how well the corresponding translation in the target language has been done.
The checking of the quality of a translation is the only task of the on- tology localization activity in which the user necessarily needs to interact.