Information Linguistic Procedures

Chapter 6 LANGUAGE-RELATED ALIGNMENT OF THE DOMAIN SEMANTICS IN HETEROGENEOUS

6.2 Semantic Analysis

6.2.4 Information Linguistic Procedures

In the past decades different natural language processing and information linguistic procedures became available that enable the processing of natural language in and for information systems (Harms and Luckhardt, 2009). Thus they are suitable for ontology matching at element level (Euzenat and Shvaiko, 2007).

143 6.2.4.1 Decomposition of Compounds

Terms in natural language can be of different complexity, either consisting of a single term or in form of a combination of terms. Thereby a single term usually consists of the word and a term combination comprises several terminological components. In English these are often multiple word terms, while in contrast in German these are compound terms, i.e., the combination of at least two individually existing words into one composite word (Bertram, 2005). For languages foreseeing the creation of compounds such as the German language their decomposition and subsequent comparison of the individual components is considered to be meaningful (Stock, 2007). Thereby, upon decomposing it is of importance to generate conceptually meaningful terminological component parts for finding all occurrence of a search term. For avoiding meaningless decompositions of compounds or undesired decomposition of proper names appropriate dictionaries can offer support (Bertram, 2005).

6.2.4.2 Disambiguation through Resolving Synonymy

Synonyms are differing designations for the same concept. The differences occur in form of differing inflection, spelling variants, abbreviations or full forms as well as alternatively used designations (Weiss, 2001). Resolving synonyms ensures that semantic correspondence of concepts can be detected, even if they are designated differently so that matching results are optimized (Stock, 2007). Such synonymy resolution or word sense disambiguation can be performed through employing a thesaurus as a synonym dictionary for support (Stock, 2007). A thesaurus links terms to conceptual entities with or without listing preferred labels and relates them to other concepts. Such systems mostly capture semantic

144

relations such as synonymy and ambiguity, hyponymy and hyperonymy, and antonymy as well as association (Stock and Stock, 2008). For creating web-based thesauri the W3C provides SKOS, the Simple Knowledge Organization System (Miles and Bechhofer, 2009). The use of SKOS allows for the reuse of freely available resources, such as for example WordNet (Fellbaum, 1998) or the STW Thesaurus for Economics (ZBW, 2010).

6.2.4.3 Treatment of Stopwords

In information retrieval words that are not considered upon indexing are called stop words. Mostly they have syntactical functions and are therefore not relevant for drawing conclusions about the content of a document. In German as well as in English these are articles, conjunctions, prepositions or pronouns and negation (Bertram, 2005). Nevertheless, they are essential for understanding meaning (Bertram, 2005). The number of stop words may vary depending on the domain, since also words can be included which, even though they carry meaning, are not to be used for analysis purposes, since they occur in most documents and are therefore not useful for differentiating content. Accordingly, applied onto the issue of the business semantics in process models it seems advantageous to not eliminating them in general as suggested in (Koschmider, 2007), but domain specifically. Depending on the type of searches refraining from elimination allows for better results when searching in word combinations (Beus, 2008).

Furthermore, in the case of business processes often the existence of negation within decisions is of importance when searching for semantically similar elements. Especially when short phrases are given in which a stop word common to the given language constitutes a significant difference in meaning, stop word

145

elimination can lead to incorrect results, as for example with negations (Stock, 2007).

6.2.4.4 Stemming

For morphological analysis information retrieval methods for determining the basic form of a word or lemmatization as well as for determining the stem of a word or stemming can be used (Stock, 2007). Through lemmatization the grammatical base or principal form is determined by attributing the concrete form to a dictionary entry. Through stemming the morphological variants of a word are traced back to their common stem by deleting inflectional endings and derivation suffixes, though this is not necessarily a lexical term. In the given case of matching process models this way semantic similarity between activities can be detected more precisely, regardless if named with a substantiated verb or a combination of a verb and a noun, and objects, as here only the stems are compared. Furthermore, undesired matching of suffixes is prevented, as they are deleted prior to matching.

6.2.4.5 String Matching

A sequence of characters out of a defined character set is called a string. Strings may be character sequences of arbitrary length from a predefined set (Euzenat and Shvaiko, 2007). String matching algorithms search for matching character sequences. This task needs to be addressed in various domains and has over time led to different approaches (Cohen et al., 2003). String metrics allow for measuring the similarity of character sequences (Stoilos et al., 2005). The Levenshtein distance of two strings expresses the minimally required number of insertions or deletions for converting the first onto the second string (Levenshtein,

146

1966). The Jaccard metric compares the similarity of words within an expression (Jaccard, 1912). The Jaro metric compares characters and their position within the string, even when they are a few positions apart (Jaro, 1989). N-Grams can be used for fragmenting words or character sequences (Stock, 2007). On this basis the Q-Grams algorithm counts the common set of tri-grams in the strings to be compared and is therefore applicable for so called approximate string matching (Sutinen and Tarhio, 1995). As the results returned by the different methods can be very different, a suitable metric needs to be chosen depending on the language and the function of the terms (Stoilos et al., 2005). Even though string metrics alone cannot fulfill all requirements for finding semantic similarity of designators, they proved nevertheless useful in this field (Stoilos et al., 2005). They can be used for determining semantic similarity based on the matching of strings in case synonymy of terms is not given. Prior stemming can further increase the result precision, as by reducing onto the word stem for example matchings of suffixes are not computed.

6.3 Implementation

For applying the described method a prototypical system called LaSMat has been implemented, which stands for Language-aware Semantic Matching.

In document Semantic Model Alignment for Business Process Integration (Page 117-121)