• No results found

IMPLEMENTATION David Moeljad

3 Indonesian data

We wrote a Python script to extract SVCs from a Sherlock Holmes short story and a Japanese short story in the Indonesian database in the Nanyang Technological University Multilingual Corpus (NTUMC) (Tan and Bond 2012), a parallel English-Chinese-Japanese-Indonesian corpus containing 2,975 Indonesian sentences from three sources: Singapore Tourism Board website (www.yoursingapore.com), a Sherlock Holmes short story “The Adventure of the Speckled Band”, and a Japanese short story written by Akutagawa Ryunosuke: “The Spiders Thread”. The Sherlock Holmes short story and the Japanese short story are originally written in English and Japanese, respectively, and

Papers from Chula-ISSSEAL – Moeljadi and Ow

94

translated into standard Indonesian. Both the original texts and the translations are part-of-speech (POS) tagged.

The NTU-MC data is organized according to sentence IDs (SID), word IDs (WID) for V1, WID for V2, V1, V2, parts-of-speech (POS), and sentence. After extracting all possible Indonesian SVCs, we used suggested PARSEME annotation guidelines to determine if an extracted SVC is an SVC. After that, SVCs following the patterns we observed were automatically tagged and the remaining complex verbs were manually tagged. Automatic tags were then hand checked for errors claims.

3.1 Extracting SVCs

The PARSEME annotation guidelines are state-of-the-art for annotation of verbal multi-word expressions (Candito et al. 2016). They are written with the assumption that a person, not a computer, is doing the annotation. The annotators can use the guidelines to write scripts to automate the annotation. Since PARSEME does not have specific guidelines for Indonesian, we modified the guidelines for English and wrote suggested PARSEME guidelines for the extraction and identification of Indonesian SVCs based on the findings in Section 3. The guidelines describe a series of tests to identify and classify Indonesian SVCs (see Table 1).

1. If V1 is a control verb like ingin “desire”, mau “want”, mencoba “try”, tahu “know”, and mampu “be able to”, and the meaning of V1+V2 is compositional, V1+V2 is a control SVC. 2. If V1 is a raising verb like terlihat “seems”, terasa “feels like”, and tampak “looks”, and the

meaning of V1+V2 is compositional, V1+V2 is a raising SVC.

3. If V2 expresses the manner of V1, and V1 is the head of the SVC, V1+V2 is a manner SVC. 4. If V2 indicates the purpose of V1, V1 is the head of the SVC, and V1 and V2 are bound by a

temporal sequence (V1 happens before V2), V1+V2 is a purpose SVC.

5. If V1 and V2 occur rapidly and seem to be simultaneous actions, and the meaning of V1+V2 is compositional, V1+V2 is a coordinated action SVC.

6. If V2 happens before V1 and V1 is the head of the SVC, V1+V2 is a source SVC.

7. If V1 indicates an action, V2 is the result of V1 (V1 happens before V2), and V1 is the head of the SVC, V1+V2 is a resultative SVC.

8. If untuk “for/to” can be inserted between V1 and V2, V1+V2 is a control or purpose SVC, where control SVCs are not temporally constrained, but purpose SVCs require the V1 to occur before V2.

9. If a pronoun like dia “s/he” can be inserted between V1 and V2, as in Example (9), V1+V2 is NOT an SVC because it does not fulfill two of the three criteria for prototypical SVCs, i.e. occurring contiguously and encompassing a single intonation unit, as mentioned in Section 1.

(9) Dia mengatakan (dia) mau ke kota…

3SG meN-say (3SG) want to city

‘He says he wants to go to the city...’ (SID:10255)

10. If the complementiser bahwa “that” or apakah “whether” can be inserted between V1 and V2, where V1 is a saying verb like mengatakan “say” or an asking verb like menanyakan “ask”, V1+V2 is NOT an SVC, for the same reason mentioned above. (10) is an example.

(10) Dia mengatakan (bahwa) (dia) mau ke kota…

3SG meN-say (that) (3SG) want to city

‘He says that he wants to go to the city...’ (SID:10255)

All possible Indonesian SVCs are two words tagged as “verb” appearing beside each other in the corpus. Following the suggested PARSEME guidelines for Indonesian SVCs, we wrote a script to semi- automatically assign the tags. First, we checked for each sentence in the corpus whether the POS tag V and V occur side-by-side, afterwards we checked whether V1 is a raising or a control verb. If V1 is neither a raising nor a control verb, we checked and assigned the tags manually. Instances where V1s or V2s turn out to be segmentation errors, nouns, adverbs, or prepositions that were wrongly tagged in

the corpus were then marked and tagged as “NOT VV”. Semi-automatically tagged SVCs were then checked for errors.

Table 1: Indonesian SVCs and the corresponding tags Type Tag V1 V2

Control SVC control control verb (in)transitive Raising SVC raising raising verb (in)transitive Manner SVC manner intransitive (in)transitive Purpose SVC purpose intransitive (in)transitive Coord.action SVC coord.action (in)transitive (in)transitive Source SVC source intransitive (in)transitive Resultative SVC resultative transitive intransitive

Table 2: Distribution of Indonesian SVCs in the corpus Type Number Percentage

control 21 72.4% raising 3 10.3% manner 2 6.9% purpose 2 6.9% coord.action 1 3.4% source 0 0.0% resultative 0 0.0% Total 29 100.0% 3.2 Result

Out of 45 candidates which have the POS tags V-V side-by-side and can be automatically extracted, only 29 are SVCs. Out of the 29, 21 are control SVCs, three are manner SVCs, two are purpose SVCs, two are raising SVCs, and one is a coordinated action SVC. Regarding the remaining 16 candidates, 13 of them have incorrect POS tags (e.g. the word masalah “problem” was tagged V, but it should have been tagged N), two of them are incorrectly segmented (have segmentation errors), and the rest (one candidate) is the one in (10) where its V1 is a saying verb. Table 2 shows the distribution of Indonesian SVCs from the extracted data. All the extracted SVCs in the data comprise two juxtaposed verbs. SVCs having more than two juxtaposed verbs were not found in the corpus. A test-suite (a sample of text illustrating a particular language phenomenon or construction, formatted in interlinearized glossed text according to Leipzig glossing rules) containing the 29 SVC sentences was made.4 The sentences were

slightly edited to accommodate INDRA, focusing on the SVCs only.5

For control SVCs, V1s are the head of the SVCs, and are control verbs like jadi “manage (to)”, ingin “wish”, and bermaksud “intend”. The V2s attached to these V1s are complements, and the meaning of V1+V2 is compositional for all control SVCs in the data, as in (11).

(11) …Holmes …mencoba membuka palang itu…

Holmes meN-try meN-open shutter that

‘…Holmes ...tries to open that shutter…’ (SID:10417)

4 The test-suite can be accessed in INDRA repository in GitHub:

https://github.com/davidmoeljadi/INDRA/blob/master/testsuites/SVC.txt

5 The computational grammar for Indonesian (INDRA) is still being developed and it does not cover phenomena

such as subordination at present. The 29 SVC sentences were edited so that they do not contain subordinate clauses.

Papers from Chula-ISSSEAL – Moeljadi and Ow

96

For Indonesian manner SVCs, V2s indicate the manner and direction of V1. For example, in pulang “return home” + melalui “pass through”, the act of returning home is done by passing through a place, as illustrated in Example (12). The V2 melalui “pass through” is a verb in (12) but can also function as a preposition meaning “through”.6 Payne (2008:312) notes that serial verbs can also be a source for

adpositions.

(12) …saya pulang melalui halaman… itu…

1SG go.home pass.through yard that

‘…I go home by passing through that… yard.’ (SID:10500)

For the raising SVC category, all V1s are raising verbs and are the heads of each SVC, and V2s are complements. The meaning of V1+V2 is compositional, as in (13).

(13) Waktu terasa berlalu dengan lambat sekali.

time feel pass with slow very

‘It seems that time passes very slowly.’ (SID:10585)

For purpose SVCs, V1 is the head of the SVC and V2 indicates the reason for doing V1, and V1 has to happen before V2. The meaning of every V1+V2 is compositional. For example, in bersiap “prepare” + pergi “go”, the act of preparing was done for the purpose of going somewhere, as shown in (14). The other purpose SVC we found is menggapai-gapai “reach out” + mencari “search”. However, one can argue that bersiap pergi “prepare to go” is a control SVC and menggapai-gapai mencari “reach out and search” is a coordinated action SVC. Some SVCs have ambiguous semantic relations (see also (7)).

(14) …Holmes …bersiap pergi.

Holmes prepare go

‘…Holmes ...is prepared to go.’ (SID:10587)

The only coordinated action SVC in the extracted data is berlari “run” + menuju “head towards”, meaning ‘run and go towards’. V1 and V2 happen rapidly and repetitively to describe the seemingly simultaneous action of running and going towards somewhere.

(15) Saya segera berlari menuju kamar ayah tiri kami...

1SG soon run head.toward room father step- 1PL.EXCL

‘I soon run towards our stepfather’s room…’ (SID:10193)

4 HPSG and MRS

We use the theoretical framework of HPSG (Pollard and Sag 1994). HPSG is monostratal, handling orthography, syntax (SYN), and semantics (SEM) in a single structure (sign), modeled through typed feature structures. Signs in HPSG include words, phrases, sentences and utterances (Sag et al. 2003). Types are classes of linguistic entities. Each type is associated with a particular feature structure. Feature structures are sets of features or attributes and value pairs which represent objects. Features or attributes are unanalyzable atomic symbols from some finite set and values are either atomic symbols or feature structures themselves. Feature structures are usually illustrated with an Attribute-Value Matrix (AVM). HPSG is unification- and constraint-based. The words and phrases are combined according to constraints of the lexical entries based on the type hierarchy. We use MRS (Copestake et al. 2005) as the semantic framework because it is adaptable for HPSG typed-feature structure and suitable for parsing and generation. The semantic structures in MRS are underspecified for scope and thus suitable for representing ambiguous scoping.

6 If melalui can be passivized, it is a verb, not a preposition. The V2 melalui in Example (12) is a verb because

we can change the order of the subject saya “1SG” and the object halaman itu “that yard” and change the verb form to passive: halaman itu saya lalui “that yard is passed through by me”.