• No results found

Chapter 6 Algorithms for Lexicon Implementation

6.6 Conclusions

In this Chapter, few algorithms for lexicon implementation are discussed. The simple word list, due to large search time, is definitely not an acceptable solution for real MT applications. Table 6.2 shows the time efficiency for hash table implementation of a lexicon. The average number of word lookup searches for Urdu words ranges from 1.6 to 3.3 for different hash functions, which is acceptable and can be further improved by using better collision resolution strategy than linear open addressing. The advantages of hash table lexicon implementation are the fast access time, lesser morphological knowledge requirement and easier inclusion of non- morphological word attributes. The only disadvantage is more space requirements, which is not a big issue for current desktop computing standards.

Trie and directed acyclic word graphs have search time proportional to the length of words and have lesser space requirements. Table 6.2 shows that average Urdu word length is less than 15 characters, therefore for successful word search we need about 15 comparisons, while the unsuccessful search in these branching structures is even faster. These structures are useful for spell checking and automatic stemming applications. Lexical transducer based lexicon implementation is best suited for both search time and storage space requirements. However, the knowledge of stems and affixes as well as morphotactics must be available for the lexical transducer implementation through morphological analysis.

89

PART III

SYNTACTICAL ANALYSIS

AND

90

Chapter 7

MODELING URDU NOMINAL SYNTAX

BY IDENTIFYING

CASE MARKERS AND POSTPOSITIONS

In Chapter 3 and Chapter 4, morphological analysis of various verb forms, noun forms and adjective forms in Urdu, and various attributes associated with different morphemes have been analyzed and listed. These lexical attributes obtained through morphology are very useful for the syntactic analysis based on the ‘Lexical Functional Grammar’. In the approach used in this research, morphology variations are handled by using finite state transducers (Karttunen 1994; Beesley and Karttunen 2003). Given the various word forms, the finite state transducers, extract useful grammatical information from the word morphemes. In LFG, these lexical attributes extracted by finite state transducers become feature-value pairs at the feature-structure level. To assign syntactic attributes values extracted by finite state transducers, a form of mapping table is used. For example, GEND attribute may get values MASC or FEM, if the finite state tags have value +Masc or +Fem for the word under consideration. When constituent-structure nodes unify, these attributes at leaf node, which contain attributes obtained from lexical entries, get unify to generate overall f-structure.

In this Chapter, the NP structure is analyzed and its syntactic combination with various case-markers/ postpositions in Urdu is distinguished. A Noun Phrase (NP) in Urdu is characterized by a rich case-marking system, which makes possible its free phrase order. The case markers and postposition are similar in nature and it is not easy to find a definition, which clearly separates the two. In this Chapter, an approach to distinguish various classes of case-markers and postposition has been introduced. The term ‘case marker’ or ‘case clitic’ is generally used for a word, which appears with a noun or a noun phrase such that the resultant phrase is a case marked noun phrase. While for a postposition, the resultant phrase is a postpositional phrase that acts as an adjunct to verb phrase. Some terms are defined below which may be referred in this chapter.

Transitivity refers to the number of objects a verb requires or takes in a grammatically well-formed clause or a sentence. The argument structure of a verb always contains subject and zero, one or two objects. The transitivity refers only to objects present in the argument structure of a verb. A subject is treated as a specifier

of the verb, while the object noun phrases appear in complement position in grammar modeling theories like X-bar and HPSG. Urdu, in contrast, has a flat phrase structure with rich case marking system, which allows relatively free order of phrase structure of sentence daughter phrases, and the verb is sister to subject noun phrase. The specifier and verb phrase thus do not appear in Urdu as in English.

Valency refers to the total number of arguments controlled by a predicate. Thus verb valency counts all the arguments of the verb including subject, objects, oblique case marked noun phrases and complement phrases. Valency is more relevant for analysis of Urdu verb’s argument structures presented in this chapter for causative verbs and for other cases, which are marked with marker ‘sey’.

Thematic role is the semantic relationship between a predicate (e.g. a verb) and an argument (e.g. the noun phrases) of a sentence. There are different thematic roles available in the literature and different authors agree on different roles. The more widely used thematic roles are briefly reviewed here.

Agent is the one who deliberately performs the action, the one who is the principal cause of action and/or the one that controls the event, e.g., ‘Hamid ate the apple’. Experiencer is the one who gets affect of sensory, emotional or abstract input or the one who is unconsciously participating in the event, e.g., ‘Anjom is shocked’, and ‘Hamid fears heights’. Beneficiary is the one who benefits from the action, e.g., ‘The teacher teaches Anjom’, and ‘The teacher gave Anjom the book’. Theme or Patient is the role of the undergoer of an action, e.g., ‘The boy crushed the snake’, and ‘The teacher gave Anjom the book’. Instrument is a thing used to carry out the action, e.g., ‘Hamid cut the apple with the knife’. Location is the place in space and time where the action occurs, e.g., ‘Hamid plays cricket in the park’. Goal is the person or place towards which action is directed, e.g., ‘Hamid is going to the school’, ‘He writes a letter to her’. Source is the person or place from where the action is initiated, e.g., ‘The rain is coming from the west’, and ‘He received a letter from the

principal’.

Thematic hierarchy presents relative prominence among various thematic roles. The ‘>’ sign means that role on left side has more prominence than on right side. There are variations in the literature, however the more acceptable (Bresnan 2001; Dalrymple 2001) is given in (94).

(94) agent > beneficiary > experiencer/recipient > instrument > patient/goal/theme > locative

These thematic roles are mapped to the grammatical functions in the argument structure of verbs. The mapping of grammatical functions and thematic roles is called

details (Butt 2005), however, usually agent and experiencer roles are mapped to subjects; patient and theme roles are mapped to objects; and goal/beneficiary are mapped to indirect objects. Locative, instrument, source and goal roles fill oblique arguments or they are attached as adjuncts as summarized below.

subject – agent, experiencer

object – patient, theme

indirect object – goal, beneficiary

oblique arguments – instrument, locative, source, goal

This chapter presents the data and analysis to show that the role of case marker ‘sey’ is quite diverse and it adopts various grammatical functions or thematic roles in

the argument structure of different verbs. The role of ‘sey’ is described as versatile,

and it is treated as the ‘instrumental case’ which adopts different roles (Mohanan 1990; Butt and King 2002). The marker ‘sey’ marks subjects, objects, instruments,

time and space nouns, post-positional phrases, adverbial phrases, etc. The analysis presented in this chapter shows that semantic considerations simplify classification of these roles. It is also shown that the marker ‘sey’ marks ‘indirect subjects’, for

causative form 2 verbs. At the end, the chapter includes evidence of Urdu tetravalent causative verbs and presents a model for their handling.