4.3 A Machine Learning Method to Tag Compound Clauses and Com-
4.3.1 Token Features
The feature extraction tool derives the values of 39 features of tokens occurring in input sequences corresponding to sentences. I designed the initial pool of features to encode information about the intrasentential linguistic context of each token. This included features intrinsic to the token such as its orthographic form and part of speech and information about its relationship to other tokens in the sequence. It was necessary to engineer features of this type due to the relatively limited size of my dataset, which restricted the ability of the machine learning method to derive even quite limited information about the contexts of tokens and the relationships holding between tokens of different types. For brevity, I do not list the 39 features here, but the full feature set is presented in Appendix C.
In addition to the training data described in Section4.2.4, validation datasets were also developed for optimisation of the machine learning methods. For the models to tag sentences containing compound clauses, the validation set com- prised 2093 sequences while, for models to tag complex constituents, the valida- tion set comprised 2628 sequences. In both cases, the token sequences were from texts of the registers of health, literature, and news. Optimisation was performed using naïve hill climbing and grid search methods to assess the suitability of fea- tures in the pool and other parameters for use in the CRF sequence labelling models. When selecting features for the tagging of complex constituents, evalua- tion was based on the F1-score obtained for classification of sequences involving
complexRF NPs (as opposed to other types of complex constituent).
Table4.6indicates the set of features selected for classification of tokens both in sequences containing compound clauses and in sequences containing complex constituents. This is the set of features exploited when learning the most accurate models for tagging input sequences in accordance with the annotation schemes detailed in Section 4.2.3. In the evaluations performed for feature selection, the CRF tagger was trained using data from all three text registers (health, literature, and news) at once and validated on data from these three registers.
Tables 4.8 and 4.9 list additional features from the initial pool that were selected for inclusion in the models to classify tokens in sequences containing compound clauses and complex constituents, respectively. For each of the two tagging tasks, the features listed in Tables 4.8 and 4.9 bring additional gains in the accuracy of the models when added to the set of features listed in Table 4.6.
Table 4.6: Features selected for tagging of both compound clauses and complex constituents
Boolean Token has a part of speech matching that of the first token followingthe next sign of syntactic complexity Token is the word when
Token is a colon
Token is a final/illative conjunction (see Table4.7for an indicative list of such conjunctions)
Ternary Position of the token in the sentence: FIRST_THIRD, SECOND_THIRD,or THIRD_THIRD
Numeric Number of words between token and the next word with part of speech tag IN Number of words between token and the next word with part of speech tag VBD Number of words between token and the next sign of syntactic complexity Number of verbs that precede the token in the sentence
Symbolic The token
Part of speech of the token or class label, if the token is a sign of syntactic complexity
Part of speech of the first word in the sequence
Table 4.7: Final/illative conjunctions hence in consequence
of course so that so then therefore thus
Table 4.8: Additional features selected for tagging of compound clauses Boolean Part of speech of token matches that of the first word in the sequence
Token matches the first lexical word in the sequence
Token is verbal (part of speech is in the set {VB, VBG, VBN, or RB}) Token is the word some
Ternary Token is a coordinator: YES (and, but, or or), MAYBE (apunctuation mark followed by and, but, or or), or NO (any other token)
Numeric Position of the token in the document
Symbolic Acoustic form of the token (in the token, consonant clusters arerendered C, single consonants c, vowel sequences as V, and single vowels as v. The word consonant is thus rendered as cvCvcvC
Table 4.9: Additional features selected for tagging of complex constituents
Boolean Token is a relative pronoun (wh-word or that)
Sentence in which the token appears also contains a clause complement word9 (see Table 4.11 for an indicative list of such words)
Token is the word who and subsequent tokens include a comma immediately followed by a past tense verb (PoS is VBD) Token is either that or which and subsequent tokens include a
comma immediately followed by a determiner (PoS is DT)
Token is an adversative conjunction (see Table4.10 for an indicative list of such conjunctions)
Quinary Token’s relationship to the word because: INDEPENDENT,PRECEDES, FOLLOWS, BOTH_PRECEDES_AND_ FOLLOWS, or IS the word because
Numeric Number of commas in the same sentence as the token
Number of signs of syntactic complexity in the same sentence as the token
Table 4.10: Adversative conjunctions
although contrariwise conversely despite however instead nevertheless nonetheless though whereas while yet
When deriving the models, tokens were represented using the three sets of fea- ture templates presented in Section3.2.3.10 For the model used to tag compound
clauses, templates were included for all of the features listed in Tables 4.6 and 4.8. For the model used to tag complex constituents, templates were included for all of the features listed in Tables 4.6 and 4.9. These templates were 5-grams, used to condition the tagging of each token on the basis of information about the value of the feature in the two preceding tokens, the token being tagged, and the two following tokens.
9This includes morphological variants such as the past, present, and -ing forms of clause
complement verbs. This footnote pertains to the first portion of Table4.9.
10In CRF++, feature selection is implemented via the content of the feature template file.
Table 4.11: Clause complement words. Verbs
accept acknowledge add admit agree
allege announce answer appreciate argue
ask aware believe certain claim
clear complain concern conclude confirm convince decide demonstrate deny disappoint
disclose discover doubt dread emerge
emphasise ensure establish expect explain
fear feel find given guess
hear hold hope illustrate indicate
infer insist intimate imply know
learn maintain mean note order
plain possible promise protest prove
provide record realise recognise recommend
read realise record relate remain
report retort reveal rule satisfy
saw say see show state
suggest suspect tell terrified testify
think warn
Nouns
allegation admission belief manner scale
view way
Adjectives disappointed obvious
Identification of the sequences (sentences) to be tagged using these models depends on accurate detection of signs which coordinate clauses in compounds (tagged CEV) and which serve as the left boundaries of subordinate clauses (tagged SSEV). For this reason, the sign tagger described in Chapter 3 of this thesis is of central importance in this approach to tagging compound clauses and complex constituents.
that several features were particularly useful, with ablation negatively affecting accuracy by more than 1%. Table4.12 lists these features and the effects of their ablation on the accuracy of the models.
Table 4.12: Features for which ablation has the greatest adverse effect on accuracy of derived tagging models
Feature F1 (negative)
Tagging compound clauses Orthographic form 0.0257 Distance to sign 0.0214
Acoustic form 0.0155
Tagging complex constituents Orthographic form 0.0376 Distance to sign 0.0201
Sign is when 0.0195
Sign is a relative pronoun 0.0147
PoS/sign tag 0.0101
Of the tagging models, the bigram model performed best. The feature encod- ing information from the sign tagger (PoS/sign tag) is ranked fifth in terms of its contribution to models tagging sentences which contain complex constituents and, although it is not listed in Table 4.12 because the negative change in F1 < 0.01
(It is 0.0095), it is ranked fourth for models tagging Type 1 sentences. Other linguistic features brought minor improvements in performance, and were also included in the models. Table 4.13 displays micro-averaged F1 scores obtained
by the taggers using different combinations of features.
Experiments in which the classification of tokens in the training and validation datasets was extended, using variants of the BIO scheme, did not lead to the
Table 4.13: Performance of the taggers when exploiting different combinations of features
F1 (micro-averaged, all registers)
Compound Complex
Features Clauses Constituents
Orthographic form 0.4893 0.2577
Orthographic form and PoS/sign tags 0.5041 0.2716
All but PoS/sign tags 0.7186 0.5391
All 0.7281 0.5492
derivation of more accurate tagging models.