A morphological engine for Italian language
III. THE MORPHOLOGICAL ENGINE
The morphological engine is developed in C++ and has three components: a word-list, a morphological generator and a morphological analyzer.
A. The word-list
The word-list contains a set of entry words and each entry word is associated to a grammatical category with information about the part of speech of the entry word and its way of creating inflections.
For verbs there are four grammatical categories: VI (intransitive verb), VT (transitive verb), VA (auxiliary verb) and VS (modal verb). If a verb is irregular a three-digit code, called inflectional code, is added and it allows obtaining irregular inflections.
For nouns there are the following grammatical categories:
• SN: Neuter nouns that can generate four inflectional forms (e.g. “maestro”, “maestra”, “maestri”, “maestre”).
• SNFI: Neuter nouns with invariable feminine form (e.g. “cantante”, “cantanti”).
• SM: Masculine nouns that can generate two inflectional forms, one for masculine singular form and one for masculine plural form (e.g. “gelato”, “gelati”).
• SF: Feminine nouns, that can generate two inflectional forms, one for feminine singular form and one for feminine plural form (e.g. “oliva”, “olive”).
• SINVM: Invariable masculine nouns that do not change in the plural form (e.g. “bar”).
• SINVF: Invariable feminine nouns that do not change in the plural form (e.g. “crisi”).
• SPRN: Proper neuter nouns that are proper masculine nouns from which the feminine form can be derived (e.g. “Roberto”, “Roberta”).
• SINVPRF: Proper feminine nouns (e.g. “Alice”).
• SINVPRM: Proper masculine nouns (e.g. “Matteo”).
• SIRR: Irregular nouns, not following inflectional rules; a list of irregular entry words to obtain the irregular forms must be consulted.
• SMI: Masculine nouns that become feminine in the plural form (e.g. “paio”, “paia”).
• SMI2P: Masculine nouns with two plurals, one masculine and one feminine (e.g. “braccio” becomes “bracci”
and “braccia”).
• SMICI: Masculine nouns that ends in “-co” and have the plural in “-ci” and not in “-chi” (e.g. “medico”,
“medici”).
• SFI: Particular feminine nouns (e.g. “energia” becomes “energie” and not “energe”).
• SNCI: Neuter nouns that ends in “-co” and have the masculine plural in “-ci” and not in “-chi” (e.g.
“amico”, “amici”).
• SNII: Neuter nouns that have the plural with two “i” (e.g. “zio”, “zii”).
• SNI: Neuter nouns with particular the feminine form (e.g. “leone”, “leonessa”).
• SMC1: Compound nouns.
For adjectives there are the following grammatical categories:
• AINV: Invariable adjectives that do not change in other inflectional forms (e.g. “loro”).
• AN: Neuter adjectives that can generate four inflectional forms (e.g. “vario”).
• AINVNUM: Cardinal number adjectives.
• ANNUM: Ordinal number adjectives that can generate four inflectional forms (e.g. “secondo”).
• AIRR: For irregular adjectives not following inflectional rules, a list of irregular entry words to obtain the irregular forms must be consulted.
• AI1: Adjectives with particular inflections (e.g. “mio”, “mia”, “miei”, “mie”).
• AI2: Adjectives with particular inflections (e.g. “tuo”, “tua”, “tuoi”, “tue”).
• AI3: Adjectives with particular inflections (e.g. “bello”, “bella”, “belli”, “bei”, “begli”, “belle”).
• ANCI: Adjectives that ends in “-co” and have the masculine plural in “-ci” and not in “-chi” (e.g.
“economico”, “economici”).
• ANII: Adjectives that have the masculine plural with two “i” (e.g. “pio”, “pii”).
For the other parts of speech there are the following grammatical categories: E (prepositions), C (conjunctions), B (adverbs), R (articles) and P (pronouns). Entry words are written in lower case letters, accents are represented by the quote character ['] and spaces between words of the same entry word are represented by the underscore [_]. The word-list can also contain identical entry words but with different grammatical categories.
B. The morphological generator
The algorithm for morphological generation allows generating inflected forms from an entry word according to the rules. To generate inflections of a verb, the last three characters of the verbs are examined and they determine the verbal group code: if it ends in “-are”, inflections of first conjugations are applied; if it ends in
“ere” or “-rre”, inflections of second conjugations are applied; if it ends in “-ire”, inflections of third conjugations are applied. Nouns and adjectives are inflected according to their grammatical category. For each grammatical category there are different inflectional rules chosen according to the last characters of the entry word that determine the noun group code and the adjectival group code. In general each group code associates a grammatical category and a substring with which a word can end with specific inflections to be applied. Each inflection has associated information about: mood, tense, person, number, gender, characters of desinence, characters to remove from the entry word before applying the desinence, number of characters to remove from the entry word, an indication to understand if the entire entry word must be replaced to obtain the inflected form and an indicator to understand if the regular inflection is also valid, finally a string of instructions that indicates the actions to obtain the inflection that has the following format (1):
*N>C,-M+S; (1)
where N is the position of the character to substitute, C is the substituted character , M is the number of characters to delete from the end of the entry word, S is the string of characters to add. Different instructions can correspond to the same grammatical category according to the last characters of the entry word. For example, consider the entry word “altoforno”, that belongs to grammatical category “SMC1” to which following instructions are associated if the entry word ends in “-o”: MS.-1+o;MP.*4>i,-1+i; this means that the masculine singular form is obtained by deleting the last character and adding “o” and the masculine plural form which is obtained by substituting the fourth character with an “i”, that is deleting the last character and adding “i”.
Irregular inflections of nouns and adjectives are obtained from a list that contains irregular entry words associated with corresponding inflections, indicating for each inflection the associated gender and number. For example, consider the entry word “bue”, that belongs to grammatical category “SIRR”; in the list of irregular entry words there are the following instructions associated to “bue”: MS-bue,MP-buoi; this means that the masculine singular form is “bue” and the masculine plural form is “buoi”.
If the verb is irregular the inflectional code is examined: each inflectional code is associated to a combination of groups of irregular inflections. A specific code is associated to each group. There is a list of elements called sets of rules which associates a code to a list of inflectional rules. Each inflectional rule contains information about the number of characters to delete from the end of the entry word and is able to understand if the regular inflection is also valid. In general a set of rules has the following format (2):
C.D1-N1±, D2-N2±, …, Dn-Nn±; (2)
in which C is the code, D is a number that identifies a set of desinences, N is the number of characters to delete from the entry word and ± is “+" if regular versions are also valid, “-” otherwise. For example, consider the entry word “scrivere” that belongs to grammatical category “VT024”. Its inflectional code is 024 that corresponds to the set of rules: 024.604-4-,424-4-; this means that desinences with code 604 are to be applied after deleting 4 characters from the entry word and desinences with code 424 are to be applied after deleting 4 characters from the entry word. In this case regular forms are not valid.
Inflections creation. The morphological generator searches the entry word in the word-list and derives its grammatical category.
If it is a verb, the last three characters of the entry word are analyzed and regular associated inflections are derived. The possible inflectional code is derived. If the verb is regular, only inflections associated to the verbal group code are applied. Instead if there is an inflectional code, also irregular inflections to be applied are derived using the rules previously explained.
If it is a noun or an adjective, the grammatical category is analyzed: if it is “SIRR” (for nouns) or “AIRR”
(for adjectives), the list with irregular entry words is consulted and inflections are derived, else the rules associated with the “group code” are applied.
If it is a preposition, a conjunction, an adverb, an article or a pronoun, the algorithm gives in output the same entry word.
Orthographic rules. When desinences are combined with the roots, the orthographic rules of the Italian language are respected.
When the entry word is a noun or an adjective:
• If the root ends in “-c” or “-g” and the first character of the desinence of the canonical form is “-a”, “-o” or
“-u” and the first character of desinence to be applied is “-e” or “-i”, character “h” is inserted between the root and the desinence to maintain the hardness of the sound (e.g. “oca” - “oche”).
• If the root ends in “-c” or “-g” and the first character of the desinence of the canonical form is “-e” or “i” and the first character of desinence to be applied is “-a”, “-o” or “-u”, the character “i” is inserted between the root and the desinence to maintain the softness of the sound.
• If the root ends in “-i” and the desinence to be applied is “i”, this is not added (e.g. “bacio”).
• If the root ends in “-ci” or “-gi” and the first character of the desinence to be applied is “-e”, the vowel “i” is removed if “-ci” or “-gi” are preceded by a consonant (e.g. not in “ciliegia”, yes in “frangia”). Words which do not follow these rules are managed by assigning appropriate grammatical categories, e.g. the category SNII nullifies the third rule and category SMICI nullifies the fourth.
When the entry word is a verb:
• The vowel “i” is removed from the desinence if all following conditions occur:
• The root ends in “-gn”
• The first character of the desinence to be applied is “-i”
• The second character of the desinence to be applied is a vowel
• The entry word is not in the first-person plural of present indicative
• The entry word is not in the first-person plural of present subjunctive
• The entry word is not in the first-person plural of present imperative •
The entry word is not in the second-person plural of the present subjunctive (e.g. “sogniamo” from the entry word “sognare”).
• If the root ends in “-hi” or “-li” and the first character of the desinence to be applied is “i”, it is removed from the desinence (e.g. “macchi” from the entry word “macchiare”).
• If the root ends in “-ci” and the first character of the desinence to be applied is “-e” or “-i”, it is omitted (e.g.
“mangera'” form the entry word “mangiare”).
• If the root ends in “-i” and the desinence is “-i”, it is omitted (e.g. “scoppi” from the entry word “scoppiare”).
• If the root ends in “-c” or “-g”, the first character of the desinence of the canonical form is “-a” or “-o” or “u”
and the first character of the desinence to be applied is “-e” or “-i”, the character “h” is added between root and desinence to maintain the hardness of the sound (e.g. “bivacchi” from the entry word “bivaccare”).