6.5 Bootstrapping Parser Development –
6.5.3 Sample-Selection-Based Co-Training
6.5.3.1 Related Work
Sample selection involves choosing training items for use in a particular task based on some criteria which approximates their accuracy in the absence of a label or reference. In the context of parsing, Rehbein (2011) chooses additional sentences to add to the parser’s training set based on their similarity to the existing training set – the idea here is that sentences that are similar to training data are likely to have been parsed properly and so are “safe” to add to the training set. In their parser co-training experiments, Steedman et al. (2003) sample training items based on the confidence of the individual parsers (as approximated by parse probability).
In active learning research (see Section 6.4), the Query By Committee selection method (Seung et al., 1992) is used to choose items for annotation – if a committee of two or more systems disagrees on an item, this is evidence that the item needs to be prioritised for manual correction. Steedman et al. (2003) discuss a sample selection approach based on differences between parsers – if parser A and parser B disagree on an analysis, parser A can be improved by being retrained on parser B’s analysis, and vice versa. In contrast, Ravi et al. (2008) show that parser agreement is a strong indicator of parse quality, and in parser domain adaptation, Sagae and Tsujii (2007) and Le Roux et al. (2012) use agreement between parsers to choose
which automatically parsed target domain items to add to the training set.
Sample selection can be used with both self-training and co-training. We restrict our attention to co-training since our previous experiments have demonstrated that it has more potential than self-training. In the following set of experiments, we ex- plored the role of both parser agreement and parser disagreement in sample selection in co-training.
6.5.3.2 Agreement-Based Co-Training
Experimental Setup The main algorithm for agreement-based co-training is given in Algorithm 4. Again, Malt and Mate are used. However, this algorithm differs from the co-training algorithm in Figure 3 in that rather than adding the full set of 323 newly parsed trees (PAi and PBi) to the training set at each iteration, selected subsets of these trees (PAi0 and Pi
B0) are added instead. To define these
subsets, we identify the trees that have 85% or higher agreement between the two parser output sets.13 As a result, the number of trees in the subsets differ at each
iteration. For iteration 1, 89 trees reach the agreement threshold; iteration 2, 93 trees; iteration 3, 117 trees; iteration 4, 122 trees; iteration 5, 131 trees; iteration 6, 114 trees. The number of trees in the training sets is much smaller compared with those in the experiments of Section 6.5.2.
Figure 6.9: Agreement-based Co-Training Results on the Development Set
13We chose 85% as our cut-off as it was more relaxed than 100% agreement, yet seemed a
respectable threshold for quality trees when we regarded the proportion of the agreement between trees in the development set.
Algorithm 4 Sample selection Co-training algorithm
A and B are two different parsers.
MAi and MBi are models of A and B at step i.
PAi and PBi are a sets of trees produced using MAi and MBi. U is a set of sentences.
Ui is a subset of U at step i.
L is the manually labelled seed training set. Li
A and L i
B are labelled training data for A and B at step i.
Initialise: L0 A← L0B ← L. M0 A← Train(A,L0A) M0 B← Train(B,L0B) for i = 1 → N do
Ui ← Add set of unlabelled sentences from U
Pi
A← Parse(Ui , MAi)
Pi
B ← Parse(Ui , MBi)
PAi0 ← a subset of X trees from Pi A
PBi0 ← a subset of X trees from Pi B Li+1A ← Li A + P i B0 Li+1B ← Li B + P i A0
MAi+1← Train(A,Li+1A ) MBi+1← Train(B,Li+1B ) end for
Results The results for agreement-based co-training are presented in Figure 6.9. Malt’s best model was trained on 1166 trees at the final iteration (71.0% LAS and 79.8% UAS). Mate’s best model was trained on 1052 trees at the 5th iteration (71.5% LAS and 79.7% UAS). Neither result represents a statistically significant improvement over the baseline.
6.5.3.3 Disagreement-based Co-Training
Experimental Setup This experiment uses the same sample selection algorithm we used for agreement-based co-training (Figure 4). For this experiment, however, the way in which the subsets of trees (Pi
A0 and PBi0) are selected differs. This time
we choose the trees that have 70% or higher disagreement between the two parser output sets. Again, the number of trees in the subsets differ at each iteration. For iteration 1, 91 trees reach the disagreement threshold; iteration 2, 93 trees; iteration 3, 73 trees; iteration 4, 74 trees; iteration 5, 68 trees; iteration 6, 71 trees.
Results The results for our disagreement-based co-training experiment are shown in Figure 6.10. The best Malt model was trained with 831 trees at the 4th iteration
Figure 6.10: Disagreement-based Co-Training Results on the Development Set
(70.8% LAS and 79.8% UAS). Mate’s best models were trained on (i) 684 trees on the 2nd iteration (71.0% LAS) and (ii) 899 trees on the 5th iteration (79.4% UAS). Neither improvement over the baseline is statistically significant.
6.5.3.4 Non-Iterative Agreement-based Co-Training
In this section, we explore what happens when we add the additional training data at once rather than over several iterations. Rather than testing this idea with all our previous setups, we choose sample-selection-based co-training where agreement between parsers is the criterion for selecting additional training data.
Experimental Setup Again, we also follow the algorithm for agreement-based co-training as presented in Figure 4. However, two different approaches are taken this time, involving only one iteration in each. For the first experiment (ACT1a), the subsets of trees (Pi
A0 and PBi0) that are added to the training data are chosen
based on an agreement threshold of 85% between parsers, and are taken from the full set of unlabelled data (where Ui = U ), comprising 1938 trees. In this instance, the subset consisted of 603 trees, making a final training set of 1103 trees.
For the second experiment (ACT1b), only trees meeting a parser agreement threshold of 100% are added to the training data. 253 trees (Pi
A0 and PBi0) out of
1938 trees (Ui = U ) meet this threshold. The final training set consisted of 753
Results ACT1a proved to be the most accurate parsing model for Mate overall. The addition of 603 trees that met the agreement threshold of 85% increased the LAS and UAS scores over the baseline by 1.0% and 1.3% to 71.8 and 80.4 respectively. This improvement is statistically significant. Malt showed a LAS improvement of 0.93% and a UAS improvement of 0.42% (71.0% LAS and 79.6% UAS). The LAS improvement over the baseline is statistically significant.
The increases for ACT1b, where 100% agreement trees are added, are less pro- nounced and are not statistically significant. Results showed a 0.5% LAS and 0.2% UAS increase over the baseline with Malt, based on the 100% agreement threshold (adding 235 trees). Mate performs at 0.5% above the LAS baseline and 0.1% above the UAS baseline.
6.5.4
Analysis
We performed an error analysis for the Malt and Mate baseline, self-trained and co-trained models on the development set. We observed the following trends:
• All Malt and Mate parsing models confuse the subj and obj labels. A few possible reasons for this stand out: (i) It is difficult for the parser to discrim- inate between analytic verb forms and synthetic verb forms. For example, in the phrase ph´osfainn thusa ‘I would marry you’, ph´osfainn is a synthetic form of the verb p´os ‘marry’ that has been inflected with the incorporated pronoun ‘I’. Not recognising this, the parser decided that it is an intransitive verb, tak- ing thusa, the emphatic form of the pronoun t´u ‘you’, as its subject instead of object. (ii) Possibly due to a VSO word order, when the parser is dealing with relative clauses, it can be difficult to ascertain whether the following noun is the subject or object.
(38) an cail´ın a chonaic m´e inn´e the girl REL saw me/I yesterday
Example 38 shows an ambiguous relative clause.14 (iii) There is no passive verb form in Irish. The autonomous form is most closely linked with passive use and is used when the agent is not known or mentioned. A ‘hidden’ or understood subject is incorporated into the verbform. Casadh eochair i nglas ‘a key was turned in a lock’ (lit. somebody turned a key in a lock). In this sentence, eochair ‘key’ is the object.
• For both parsers, there is some confusion between the labelling of obl and padjunct, both of which mark the attachment between verbs and prepositions. Overall, Malt’s confusion decreases over the 6 iterations of self-training, but Mate begins to incorrectly choose padjunct over obl instead. Mixed results are obtained using the various variants of co-training.
• Mate handles coordination better than Malt.15 It is not surprising then that
co-training Malt using Mate parses improves Malt’s coordination handling whereas the opposite is the case when co-training Mate on Malt parses, demon- strating that co-training can both eliminate and introduce errors.
• Other examples of how Mate helps Malt during co-training is in the distinction between top and comp relations, between vparticle and relparticle, and in the analysis of xcomps.
• Distinguishing between relative and cleft particles is a frequent error for Mate, and therefore Malt also begins to make this kind of error when co-trained using Mate. Mate improves using sample-selection-based co-training with Malt. • The sample-selection-based co-training variants show broadly similar trends
to the basic co-training.