• No results found

Token Features

4.3 A Machine Learning Method to Tag Compound Clauses and Com-

4.3.1 Token Features

The feature extraction tool derives the values of 39 features of tokens occurring in input sequences corresponding to sentences. I designed the initial pool of features to encode information about the intrasentential linguistic context of each token. This included features intrinsic to the token such as its orthographic form and part of speech and information about its relationship to other tokens in the sequence. It was necessary to engineer features of this type due to the relatively limited size of my dataset, which restricted the ability of the machine learning method to derive even quite limited information about the contexts of tokens and the relationships holding between tokens of different types. For brevity, I do not list the 39 features here, but the full feature set is presented in Appendix C.

In addition to the training data described in Section4.2.4, validation datasets were also developed for optimisation of the machine learning methods. For the models to tag sentences containing compound clauses, the validation set com- prised 2093 sequences while, for models to tag complex constituents, the valida- tion set comprised 2628 sequences. In both cases, the token sequences were from texts of the registers of health, literature, and news. Optimisation was performed using naïve hill climbing and grid search methods to assess the suitability of fea- tures in the pool and other parameters for use in the CRF sequence labelling models. When selecting features for the tagging of complex constituents, evalua- tion was based on the F1-score obtained for classification of sequences involving

complexRF NPs (as opposed to other types of complex constituent).

Table4.6indicates the set of features selected for classification of tokens both in sequences containing compound clauses and in sequences containing complex constituents. This is the set of features exploited when learning the most accurate models for tagging input sequences in accordance with the annotation schemes detailed in Section 4.2.3. In the evaluations performed for feature selection, the CRF tagger was trained using data from all three text registers (health, literature, and news) at once and validated on data from these three registers.

Tables 4.8 and 4.9 list additional features from the initial pool that were selected for inclusion in the models to classify tokens in sequences containing compound clauses and complex constituents, respectively. For each of the two tagging tasks, the features listed in Tables 4.8 and 4.9 bring additional gains in the accuracy of the models when added to the set of features listed in Table 4.6.

Table 4.6: Features selected for tagging of both compound clauses and complex constituents

Boolean Token has a part of speech matching that of the first token followingthe next sign of syntactic complexity Token is the word when

Token is a colon

Token is a final/illative conjunction (see Table4.7for an indicative list of such conjunctions)

Ternary Position of the token in the sentence: FIRST_THIRD, SECOND_THIRD,or THIRD_THIRD

Numeric Number of words between token and the next word with part of speech tag IN Number of words between token and the next word with part of speech tag VBD Number of words between token and the next sign of syntactic complexity Number of verbs that precede the token in the sentence

Symbolic The token

Part of speech of the token or class label, if the token is a sign of syntactic complexity

Part of speech of the first word in the sequence

Table 4.7: Final/illative conjunctions hence in consequence

of course so that so then therefore thus

Table 4.8: Additional features selected for tagging of compound clauses Boolean Part of speech of token matches that of the first word in the sequence

Token matches the first lexical word in the sequence

Token is verbal (part of speech is in the set {VB, VBG, VBN, or RB}) Token is the word some

Ternary Token is a coordinator: YES (and, but, or or), MAYBE (apunctuation mark followed by and, but, or or), or NO (any other token)

Numeric Position of the token in the document

Symbolic Acoustic form of the token (in the token, consonant clusters arerendered C, single consonants c, vowel sequences as V, and single vowels as v. The word consonant is thus rendered as cvCvcvC

Table 4.9: Additional features selected for tagging of complex constituents

Boolean Token is a relative pronoun (wh-word or that)

Sentence in which the token appears also contains a clause complement word9 (see Table 4.11 for an indicative list of such words)

Token is the word who and subsequent tokens include a comma immediately followed by a past tense verb (PoS is VBD) Token is either that or which and subsequent tokens include a

comma immediately followed by a determiner (PoS is DT)

Token is an adversative conjunction (see Table4.10 for an indicative list of such conjunctions)

Quinary Token’s relationship to the word because: INDEPENDENT,PRECEDES, FOLLOWS, BOTH_PRECEDES_AND_ FOLLOWS, or IS the word because

Numeric Number of commas in the same sentence as the token

Number of signs of syntactic complexity in the same sentence as the token

Table 4.10: Adversative conjunctions

although contrariwise conversely despite however instead nevertheless nonetheless though whereas while yet

When deriving the models, tokens were represented using the three sets of fea- ture templates presented in Section3.2.3.10 For the model used to tag compound

clauses, templates were included for all of the features listed in Tables 4.6 and 4.8. For the model used to tag complex constituents, templates were included for all of the features listed in Tables 4.6 and 4.9. These templates were 5-grams, used to condition the tagging of each token on the basis of information about the value of the feature in the two preceding tokens, the token being tagged, and the two following tokens.

9This includes morphological variants such as the past, present, and -ing forms of clause

complement verbs. This footnote pertains to the first portion of Table4.9.

10In CRF++, feature selection is implemented via the content of the feature template file.

Table 4.11: Clause complement words. Verbs

accept acknowledge add admit agree

allege announce answer appreciate argue

ask aware believe certain claim

clear complain concern conclude confirm convince decide demonstrate deny disappoint

disclose discover doubt dread emerge

emphasise ensure establish expect explain

fear feel find given guess

hear hold hope illustrate indicate

infer insist intimate imply know

learn maintain mean note order

plain possible promise protest prove

provide record realise recognise recommend

read realise record relate remain

report retort reveal rule satisfy

saw say see show state

suggest suspect tell terrified testify

think warn

Nouns

allegation admission belief manner scale

view way

Adjectives disappointed obvious

Identification of the sequences (sentences) to be tagged using these models depends on accurate detection of signs which coordinate clauses in compounds (tagged CEV) and which serve as the left boundaries of subordinate clauses (tagged SSEV). For this reason, the sign tagger described in Chapter 3 of this thesis is of central importance in this approach to tagging compound clauses and complex constituents.

that several features were particularly useful, with ablation negatively affecting accuracy by more than 1%. Table4.12 lists these features and the effects of their ablation on the accuracy of the models.

Table 4.12: Features for which ablation has the greatest adverse effect on accuracy of derived tagging models

Feature F1 (negative)

Tagging compound clauses Orthographic form 0.0257 Distance to sign 0.0214

Acoustic form 0.0155

Tagging complex constituents Orthographic form 0.0376 Distance to sign 0.0201

Sign is when 0.0195

Sign is a relative pronoun 0.0147

PoS/sign tag 0.0101

Of the tagging models, the bigram model performed best. The feature encod- ing information from the sign tagger (PoS/sign tag) is ranked fifth in terms of its contribution to models tagging sentences which contain complex constituents and, although it is not listed in Table 4.12 because the negative change in F1 < 0.01

(It is 0.0095), it is ranked fourth for models tagging Type 1 sentences. Other linguistic features brought minor improvements in performance, and were also included in the models. Table 4.13 displays micro-averaged F1 scores obtained

by the taggers using different combinations of features.

Experiments in which the classification of tokens in the training and validation datasets was extended, using variants of the BIO scheme, did not lead to the

Table 4.13: Performance of the taggers when exploiting different combinations of features

F1 (micro-averaged, all registers)

Compound Complex

Features Clauses Constituents

Orthographic form 0.4893 0.2577

Orthographic form and PoS/sign tags 0.5041 0.2716

All but PoS/sign tags 0.7186 0.5391

All 0.7281 0.5492

derivation of more accurate tagging models.