2.4 Ontology Matching Techniques
3.1.2 Word Formation
Over the centuries, the evolution of the languages led constantly to new terms for dif-ferent objects, processes, properties, abstract things, etc. While there are a couple of root words in each language, most of them of unknown origin, the majority of words is de-rived from the existing vocabulary. Such a derivation implies that there must be at least some semantic relatedness between the original word and its descendant, and this is a
The creation of new words is called word formation and several forms of word formation exist [13].
Derivations
In derivations, a prefix p or suffix s is added to a given word W , so that the derivation is either pW or W s. The prefix and suffix is a derivational morpheme, just as illustrated in the previous section. Derivations are among the most frequently occurring word for-mations, as they can change the meaning of a given word (prefix morpheme) or allow its usage in a different word class (suffix morpheme).
Compounds
A compound is a special word C that is a combination ("compound") of two words H (called head) and m (called modifier) [129]. The head word generally occurs at the end of the compound and specifies its basic meaning; the modifier occurs at the beginning and modifies the compound so that C expresses something more specific than H. For exam-ple, a database conference is a specific conference and a kitchen chair is a specific chair. To-gether with derivations, compounds are the most productive means of word formation, especially since compounds can also be derived from existing compounds, like kitchen chair manufacturer, which is a specific chair manufacturer. In such compounds of higher order, the compound consists of more than one modifier, though it still consists of only one head. In this thesis, a compound C is defined as C = m1m2...mnH, thus consisting of n modifiers (n ≥ 1) and one head.
From a lexicographic point of view, three types of compounds can be distinguished:
• Closed compound: Head and modifier are directly combined, e.g., cookbook or blackboard.
• Hyphenated compound: Head and modifier are separated by a hyphen, e.g., get-together or see-saw.
• Open compound: Head and modifier are separated by a space, e.g., city hall or computer screen.
Though there is no official regulation in the English language how to create and write compounds (in fact, some compounds can take up different forms, like bus-driver and bus driver), there are some characteristics for each type. Generally, most recently evolved compounds like web space or smart phone use the open form. The same holds for most compounds where either head or modifier consists of two syllables or more, like build-ing site or railroad company. Otherwise, more established and more frequently occurrbuild-ing compounds often use the closed form, as airbag or blackboard. The hyphenated form some-times marks a development from the open form towards the closed form (e.g., data base – data-base – database).
In the English language, compound words are normally combined without any further characters and only the head is modified w.r.t. grammatical inflections. From a technical point of view, this regularity makes it relatively easy to parse and process English com-pounds, while it is more difficult in other languages. For instance, in German compounds the modifier can change, as in Städtebund (Stadt + Bund) and additional characters may occur between modifier and head, as in Handelsabkommen (Handel + Abkommen). Only in some rare cases, an English compound modifier has changed so that it is no official word of the language anymore, e.g., holiday (holy + day).
From a semantic point of view, three types of compounds can be distinguished [88]:
• Endocentric compounds: It holds that C is a specification of H, as in the example blackboard (which is a specific board). This is the classic form of compounds.
• Exocentric compounds: There is no (obvious) semantic relation between C and H, such as in buttercup (which is not a cup and has no other semantic relation to a cup).
• Copulative or appositional compounds: Head and modifier are at the same level and C is not a specification of H, but rather the sum of what m and H express. An example is bitter-sweet, which means both bitter and sweet (not a specific sweet).
Exocentric compounds are often of literal meaning, like computer mouse, which resembles a mouse, but has no semantic relation to an actual mouse. They can also be the result of words that changed their spelling, e.g., butterfly, which might originate from "flutter by", or cocktail, which might originate from French coquetier.6 Eventually, they can be the result of two words that often co-occur, like pickpocket (someone who picks pockets) or breakfast (the time to break the fast after night) [14].
Copulative and appositional compounds are quite rare. Unlike endocentric compounds, they express something more general than the compound head. They are very often hyphenated compounds, as in Bosnia-Herzegovina, actor-director or twenty-one.
Shortenings, Blends and Acronyms
Shortenings (or clippings) are reductions of a word by deleting parts of its base. There are three forms of shortenings, according to which part of the original word remains.7
1. Beginning: lab / laboratory, doc / doctor 2. End: net / Internet, phone / telephone 3. Middle: flu / influenza, fridge / refrigerator
In some cases like fridge, the spelling of the shortening changes slightly. The first case is the most frequently occurring case in the English language [112].
6http://www.etymonline.com/
Acronyms are abbreviations that follow the regular reading rules of a language, like NATO or radar (Radio Detection And Ranging). Blends consist of two words w1, w2
where parts of at least one word are deleted. The remnants of w1, w2 are combined to a new word, very similar to a closed compound. Typical blends are motel (motor + hotel) and brunch (breakfast + lunch).
Conversion
Conversion refers to processes where a word of a specific word class is used for a differ-ent word class. For example, professional was originally solely used as an adjective, but converted towards a noun (a professional). Conversions do not influence the spelling of a given word.
Loan Words
Loan words are words that were imported from a different language. In many cases, the spelling was adapted to match the regulations of the English language, which makes it quite impossible to distinguish between a loan word from any other word by means of automatic approaches. Only in some rare cases a loan word was not adapted to the English language, like kindergarten, or only to some degree, like iceberg (German: Eisberg).