Intractable Problems of Compound Nouns - The Problems and Proposed Solution

5.2 The Problems and Proposed Solution

5.2.1 Intractable Problems of Compound Nouns

The nouns and compound nouns are the most important elements for representing a fact in a natural language sentence. Compound nouns are formed by combining any number of simple nouns in sequence. They carry more specific contextual information than simple nouns and thus they have been considered very important in natural language processing. Compound nouns, a complex set of nouns potentially including particles or the other part of speech word which function as a noun in a sentence. The parts of a compound noun may be written as one word, as separate words, or as a hyphenated word. Such as:

saishoku-shugi(vegetarianism)

−→ saishoku (eating vegetables) + shugi (ism)

−→ kankei (relationship) + kaizen (improvement) senmon-chishiki (expertise)

−→ senmon (expert) + chishiki (knowledge)

Compound nouns are a commonly occurring construction in language consisting of a sequence of nouns, acting as a noun phrase; fruit tree farmer, for example. For a detailed linguistic theory of compound noun syntax and semantics. Compound nouns (also known as noun-noun compounds, complex nominals or noun sequences) are analysed syntactically by means of a rule such as N R N N which is applied recursively. Compounds of more than two nouns are ambiguous in syntactic structure. The first step in producing an interpretation of a CN is an analysis of the attachments within the compound. Syntactic parsers cannot choose an appropriate analysis, because attachments are not syntactically governed. Without semantic knowledge, multiple ambiguities arise, resulting in inefficient parsing.

There is a estimation that the token occurrence of Japanese noun + noun compounds in the 1996 Mainichi Shinbun Corpus (32m word tokens, Mainichi Newspaper co. (1996)) is roughly 10%, underlining their high frequency. The average token frequency per noun + noun compound type is around 7, and slightly more than half of the noun + noun compounds occur only once in the corpus (mirroring the results of Lapata and Lascarides (2003) for English). Additionally, new noun + noun compounds are constantly evolving [44], all of which motivate a robust translation method which is able to handle novel noun + noun compounds.

Compound words pose many problems for linguistic description, and some additional ones fro nature language processing in particular:

1. Identification:

How can compounds be distinguished from other words and phrases? 2. Segmentation:

What are the components of a compound? In many languages, orthographic convention is such that compounds are written as single units.

3. Disambiguation:

What is the correct analysis of a compound? On the widespread assumption that compounds have a recursive binary structure, any occurrences with more than two basic elements will admit multiple analysis, from which, normally, a single candidate must be selected.

How can the meaning of a compound word be derived from the meanings of its parts? For many purposes, there is little point in performing any of the other tasks unless this is feasible.

In natural language processing, it is difficult to deal with compound nouns in Japanese. If the grammar rules authorize simple noun sequence, parsing trees run into a combinatorial explosion. Not all nouns form compound nouns. However, if the compound nouns themselves are entered as lexical items in a lexicon, the content of the lexicon will be explosive. Date (years, months, days of months, and numbered times) are compound nouns in Japanese, typically consisting of a number and a temporal noun. The compound noun rules first distinguish between time positions and time periods, for example, the adverbial 13-nichi (13 days) could be on the thirteenth or for thirteen days. Once it has been determined that the noun phrase refers to time position, simple regular rules are used to generate the corresponding expressions.

We have found compound nouns are highly frequent and highly productive in Japanese and they are a very real problem for both MT systems and human translators. In our previous studies on SFBMT we have found that compound nouns were difficult to be translated correctly, it was because:

1. Lexical idiosyncracies in Japanese and Chinese:

Translation pairs where one or more pairs of component nouns does not align under ex- act translation, but are conceptually similar. Such as, bijinesu + bagu (business bag)1 _−→

gongwen + bao vs. bijinesu + mashin (business machine) −→ shiwu + jiqi vs. bijinesu + sofuto (business software) −→ shangyong + ruanjian vs. bijinesu + gurafikkusu (business graphics) −→ shangye + tubiao vs.bijinesu + dairi (business agency) −→shangwu + daili. In the above examples we can find that the underlined word bijinesu can have many different interpretations in Chinese.

2. Existence of non-compositional compound nouns:

Since languages differ as to when to compound, a Japanese compound is not always most properly translated as a compound. For each Japanese compound, the appropriate Chinese construction must be selected according to the condition of target language which is Chi- nese. Such as, senmon + uriba (special counter) −→ zhuangmen + guitai, which is translated most naturally into Chinese as zhuangui; In this case, that zhuangui can be analyzed as a

With all examples of Japanese and Chinese compound nouns, we segment the compound into its component nouns through the use of the ”+” symbol. Note that no such segmentation boundary is indicated in the original Japanese and Chinese.

two-character abbreviation derived from zhuanmen (special) and guitai (counter), where total translation becomes zhuanmen + guitai. Another example: medama + syouhin (loss leader) −→yanzhu + shangpin, which is translated into Chinese should be remenhuo. Here, remen does not align with medama.

3. Constructional variability in the translations:

shouhin + kakaku (price of commodity) (NJ

1N2J) −→ shangpin + jiage (N1CN2C)2vs. keiyaku

+ ihan (breach of contract) (NJ

1N2J) −→ weifan + hetong (N2CN1C) vs. nebiki + riyuu (dis-

count reason) (NJ

1N2J) → jiangjia + de + liyou (N1CdeN2C). For Japanese noun compound

(NJ

1N2J), there are many different kinds of constructions in Chinese language. Such as,

(NC

1 N2C), (N2CN1C), (N1CdeN2C).

4. Mismatch in semantic explicitness:

Sometimes one compound can has multiple interpretations, and can only be reliably inter- preted in the context [39]. For example, keiyaku + naiyou (contract content) can be translated as: heyue + neirong, qiyue + neirong, hetong + neirong, and so on. The semantic content explicitly described in the source compound noun is made implicit in the translation. Further- more, integrating the translated compound into its context may also present difficult choices. One specific issue concerns potential attributes to the compound.

In document SUPER-FUNCTION BASED MACHINE TRANSLATION SYSTEM FOR BUSINESS USER. Xin Zhao (Page 102-105)