A morphological engine for Italian language
C. Morphological analysis
The analyzed word is the input to the analyzer that creates a list of string to analyze. In this list following strings are added: the word to analyze, strings obtained with normalization of word, strings obtained removing possible prefix from all strings in the list, strings obtained removing possible suffix from all strings in the list and strings obtained by a new normalization of all string in the list. For each string in the list a step called segmentation associates to each string a list of possible entry words. For each obtained entry word, a list of entry words with irregular roots is consulted to obtain the exact entry word of provenience if the word derives from an entry word that is inflected also in the root. The word-list is consulted to remove entry words which do not exist from the list of possible entry words. Now a compatibility analysis is performed inflecting each possible entry word: if an entry word generates the initial word, it is a solution. Following paragraphs describe each step in detail.
Normalization. This step completes the word if it is truncated. Last characters of the word are analyzed. If the word ends in “-r”, “-l”, “-m”, or “-n”, the vowel “e” is added (e.g. “miel” becomes “miele”). If the word ends in r”, l”, m”, or n”, the vowel “o” is added (e.g. “cammin” becomes “cammino”. If the word ends in “-or”, the vowel “e” is added (e.g. “amor” becomes “amore”).
Prefixes removal. This step uses two lists of prefixes, one for verbal ones (e.g. “ri-”) and one for adjectival ones (e.g. “super-”).
Suffixes removal. This step uses four lists of suffixes: substantival, adjectival, adverbial and verbal enclitics.
Four forms for each substantival and adjectival suffix (generated by changing gender and number) are considered (e.g. “-issimo”, “-issima”, “-issime”, “-issimi” for adjectives and “-accio”, “-accia”, “-accie”, “-acci” for nouns).
Adverbial suffixes are applied to adjectives (“-mente” and “-issimamente”). Verbal enclitic can be removed only from a present imperative, a present gerund, a present infinitive or a past participle and the verb cannot be intransitive. After the removal of verbal enclitics the obtained string is modified: if the last character is a vowel, also a string with the character [’] added is inserted in the list; if the last character is a consonant that matches with the first character of detected enclitic, it is removed (enclitics we apply to monosyllabic imperatives double the initial consonant); if obtained string ends by a vowel, also a string with the character [’] added is inserted in the list. It is used to detect verbs where the last character is [’], for example “di’” that is derived from the entry word “dire”.
The algorithm stores each modification of the word, storing also the kind of suffix and the gender and the number associated to the suffix.
Renormalization. Removal of suffix can cause truncation of the word to be analyzed. The renormalization operates as the normalization, but it also performs the following actions:
• If substantival or adjectival suffixes have been removed, three new strings are added to list: the string obtained by adding the last character of the suffix, the string obtained by adding the vowel “e” and the string obtained by adding the vowel “o”.
• If adverbial suffixes have been removed, character “e” is added.
• If verbal suffixes have been removed and the last character of the obtained string is “r”, perhaps it is a verb in the form present infinitive, so two new strings are added to list: the string obtained by adding vowel “e”
(for verbs ending in are”, ere” and ire”) and the string obtained by adding “re” (for verbs ending in “-rre”).
Segmentation. Segmentation associates a list of possible entry words of origin to each string in the list. A list of possible desinences is used; a string of instruction (3) is associated to each desinence:
desinence:(str)-n+des;(str)- n+des;…;(str)-n+des. (3)
Instruction is interpreted as follows: the desinence is removed and, if obtained string ends in str, n characters are removed and the substring des is added. For example, consider the word “temerono” in which desinence
“rono” is detected. To this desinence following instructions are associated: rono:+ere; (a)+re; (e)+re; (i)+re.
This means that, after removing the desinence, the string “ere” is added and the possible entry word “temeere” is added to the list; if the obtained string ends in “a”, the string “re” is added (in this case this instruction is not applied); if the obtained string ends in “e”, the string “re” is added and the possible entry word “temere” is added to the list; if the obtained string ends in “i”, the string “re” is added (in this case this instruction is not applied).
Search for irregular roots. In some words the inflection can change not only the ending, but also the root. In some words, the change is small and frequent in many entry words; for these cases, some instructions associated with endings were added. In rare cases the root completely changes, for example in the verb “andare”, in which the first person singular of the present indicative is “vado”. For this reason, there is a list of strings obtained from the segmentation of these particular words, each associated with the correspondent entry word. For example the string “vadare”, obtained by segmentation of “vado”, is present in the list and it is associated to the entry word
“andare” that is added to list of possible entry words from which it originates.
Selection of existing entry words. At this point there is a list of possible entry words some of which are nonsensical. This step eliminates the strings without meaning removing those not present in the word-list or that are in the word-list but belong to a grammatical category different from that seen during removal of prefixes and suffixes. If the grammatical category of adverb is detected, the word in the word-list is searched as an adjective.
Compatibility analysis. In this step each entry word obtained in the previous step is inflected. This step has the following inputs: the entry word and the grammatical category associated, the initial word (normalized and without suffixes and prefixes) and information about the removed suffixes (gender, number and an indication that specifies if the suffix is “-one”). All inflections of the entry word are generated and compared with the initial word. For each inflected form that matches with the word the following checks are performed:
• If the word is a verb where the enclitic has been removed, it can be only a present imperative, a present gerund, a present infinitive or a past participle and it cannot be an intransitive verb.
• If the entry word has the grammatical category of adjective, but the removed suffix is adverbial, the initial word is accepted as adverb.
• If the removed suffix has associated gender and number, they must coincide with those of generated inflected form.
• If the suffix is “-one” and the entry word is a feminine noun, the word can be an augmentative form that has changed its gender: in this case the result is a masculine noun.
IV. EXAMPLES
For example, consider the word “bellissima”. In the removal of suffix step, the adjectival suffix “-issima”
is detected and the string “bell” is added to the list of strings to be analyzed with grammatical category of adjective. The renormalization step adds the strings “bella” (obtained by adding the last character of the suffix),
“belle” and “bello” to the list, all three with grammatical category of adjective because they are derived from
“bell”. The selection of existing entry words detects only the entry word “bello” with grammatical category of adjective. After the compatibility analysis the result is that “bellissima” is derived from the entry word “bello”
with the suffix “-issima” and it is the feminine singular form. For example, consider the word “bellissima”. In the removal of suffix step, the adjectival suffix “-issima” is detected and the string “bell” is added to the list of strings to be analyzed with grammatical category of adjective. The renormalization step adds the strings “bella” (obtained by adding the last character of the suffix), “belle” and “bello” to the list, all three with grammatical category of adjective because they are derived from “bell”. The selection of existing entry words detects only the entry word
“bello” with grammatical category of adjective. After the compatibility analysis the result is that “bellissima” is derived from the entry word “bello” with the suffix “-issima” and it is the feminine singular form.
Consider the word “cammino”. In the removal of suffix step the substantival and adjectival suffix “-ino” is detected: the string “camm” is added twice in the list of strings to be analyzed, both as a noun and as an adjective.
The renormalization step adds to list the strings “cammo” (twice, both as a noun and as an adjective) and “camme”
(twice, both as a noun and as an adjective). In the segmentation step many possible entry words are generated, then the selection of existing entry words detects: the verb “camminare” and the masculine noun “cammino”.
After the compatibility analysis the result is that “cammino” can be first-person singular of present indicative of verb “camminare” or the masculine singular of noun “cammino”.
V. CONCLUSIONS AND FUTURE WORKS
With the algorithm the number of the entries in the dictionary is reduced, in fact it does not contain: verbs and adjectives starting with a prefix, adverbs that can be derived from adjectives, adjectives and nouns with suffixes, verbs with enclitics, whole conjugation of verbs (57 verbal forms are obtained from a verbal entry word), and all declensions of nouns and adjectives. Thanks to the developed morphological engine the search space is considerably reduced and the parser receives as input a vocabulary with only the words of the phrase and the complexity is reduced.
The management of compound nouns can be improved, for example by adding new grammatical categories and by adding a new step in the algorithm of morphological analysis after the segmentation step to check if obtained entry words start with another word.
Another possible improvement regards the suffixes: many nouns or adjectives are not compatible with certain suffixes, consequently the algorithm can recognize words that do not exist. The problem can be solved by adding inflectional codes that indicate the set of suffixes compatible with the entry word. The same action can be performed for prefixes. The morphological engine can be adapted for other Romance languages such as French.
ACKNOWLEDGMENT
Authors were in part supported by the Sicily Region grants PO FESR 2007/2013: "Rammar Sistema Cibemetico programmabile di interfacce a interazione verbale" PROGETTO POR 4.1.1.1 and “Improved Adaptive Testing with Language Interface Pack (IATLIP)” PROGETTO POR 4.1.1.2
REFERENCES
1. Chiari, I.: Introduzione alla linguistica computazionale. Bari, Laterza, 2007.
2. Kumar, D., Rana, P.: Design and Development of a Stemmer for Punjabi. International Journal of Computer Applications, Volume 11 No.12, December 2010, page 1.
3. Musso, F., Prandi N.: Per dirla giusta. Fonologia, ortografia, morfologia. S. Lattes & C. Editori SpA, 2012.