Comparing Conditional and Generative Models

5.4 Experiments and Results

5.4.2 Comparing Conditional and Generative Models

The probabilistic generation models presented in Cahill and van Genabith (2006) and Hogan et al. (2007) are conditional models in that they deﬁne probabilities of f-structure annotated productions conditional directly on the sets of the input f- structure features/attributes. By contrast, the probabilistic models presented in this thesis are generative (or joint) models based on derivation of standard or augmented PCFGs.

To compare the performance of my generative PCFG models to the probabilistic models conditioned on input f-structure features, I reimplement the conditional models of Cahill and van Genabith (2006) and Hogan et al. (2007) on the CTB data and carry out experiments with the standard PCFG, the PF-PCFG and the PC-PCFG models. To eliminate the eﬀects of unknown lexical features, I extract lexical rules from all treebank trees of CTB5.1 including the test and development set, so lexical smoothing is not relevant in this experiment.

The generation models are evaluated against the raw text of the testing data in terms of accuracy and coverage. Following Langkilde (2002) and other work on wide- coverage, general-purpose generators, I adopt BLEU score (Papineni et al., 2002), average NIST simple string accuracy (SSA) and percentage of exactly matched sentences for accuracy evaluation. For coverage evaluation, I measure the percentage of input f-structures that generate a sentence. To measure whether the difference between the accuracy scores of two generation models is significant or only due to chance, I employ statistical significance tests. To measure the significance of an improvement in the BLEU score, I use FastMtEval,9 a bootstrap resampling method which is popular for machine translation evaluations. For SSA scores, I calculate the statistical significance by applying a paired student t-test on the mean difference of the SSA scores. As incompleteness of realisations has a negative impact on the

9_Scripts _for _the _{bootstrapping} _evaluation _of _conﬁdence _intervals _and _statis-

tical signiﬁcance testing are available for download at the author’s homepage: http://www.computing.dcu.ie/ nstroppa/index.php?page=softwares&lang=en

BLEU and SSA scores, signiﬁcance tests comparing two models are conducted only on the intersection of complete sentences which are generated by both models.

Complete Sentence Coverage ExMatch BLEU SSA

Cahill (2006) 87.19% 24.13% 0.7040 0.6682 Hogan (2007) 84.26% 24.39% 0.7103 0.6746 >Cahill _∼Cahill PCFG 98.69% 23.14% 0.7050 0.6644 ∼Cahill ∼Cahill PF-PCFG 97.06% 24.03% 0.7142 0.6708 ∼Hogan ∼Hogan PC-PCFG 97.55% 24.67% 0.7206 0.6840 ≫Hogan ≫Hogan Table 5.4: Results for completely generated sentences on development data

Table 5.4 gives the comparison results of the five generation models evaluating against the subset of completely generated sentences for f-structures of the development set. In the table, ≫ means statistical significance at the level of 𝑝=0.005, > means significance at 𝑝=0.05 and ∼ means the difference is not significant. With regard to coverage, the conditioning factor contributed by f-structure features leads to relatively low coverage for the two conditional generation models. By contrast, the three generative PCFG models boost the number of completely generated sentences by more than 10%. With regard to accuracy, the conditioning f-structure features change the probability distribution over the CFG rules, however they do not result in higher accuracy compared to the generative PCFG models. Specifi- cally, the model of Cahill and van Genabith (2006) performs about the same as the simple PCFG model, while the model of Hogan et al. (2007) which also includes the parent GF as a conditioning feature performs at about the same level as the corresponding PF-PCFG model, but both conditional generation models perform significantly worse than the generative PC-PCFG model. Roughly speaking, three major reasons account for this fact: (i) the generation grammar rules employed by the PCFG models are not conventional CFG rules, but CFG rules annotated with grammatical functions, hence the information contributed by f-structure features is already contained in the annotated CFG rules to some extent; (ii) the implementa-

tion of the chart-style generator associates a sub-chart with each sub-f-structure of the generation input f-structure, and this set-up prevents rules incompatible with the input f-structure to be applied; (iii) the two generation models that condition directly on input f-structure features suﬀer from severe data sparseness in generation grammar rule counts and overﬁtting under MLE.

Another observation from the results presented in Table 5.4 is the performance of the three generative PCFG models. The two models that extend the conditioning context of CFG productions to parent annotations have slightly lower generation coverage than the basic PCFG model. However, as far as generation accuracy is concerned, the PC-PCFG model that includes the phrasal category of the parent node outperforms the other two PCFG models. The PF-PCFG model that includes the grammatical function of the f-structure parent is also better than the simple PCFG model, with a signiﬁcant improvement in the BLUE score and an observable but not signiﬁcant improvement in the SSA score.

All Sentence Coverage ExMatch BLEU SSA Cahill (2006) 100% 21.04% 0.6624 0.6403 Hogan (2007) 100% 20.55% 0.6609 0.6410

PCFG 100% 22.84% 0.7034 0.6628

PF-PCFG 100% 23.33% 0.7091 0.6671

PC-PCFG 100% 24.06% 0.7171 0.6796

Table 5.5: Results for all sentences on development data

Table 5.5 lists results evaluating against all (complete and partial) sentences generated from the input f-structures. These results are a natural outcome of Table 5.4. As the conditioning f-structure features do not improve the accuracy but reduce the generation coverage, the generative PCFG models show substantially better overall performance than the conditional generation models. And again, the syntactic category parent annotation model achieves the best results among all the PCFG-based generation models.

In document Treebank-based acquisition of Chinese LFG resources for parsing and generation (Page 121-124)