We use a number of common features in our model, which are listed below. The features are further grouped into two broad classes:
Dense: Dense features occur at every hypernode in a translation hypergraph. For each (partial) path their value is the sum of all values in antecedent hypernodes. Sparse: Sparse features only apply to a subset of nodes, which is for most features
likely∅.
3.5.1 Dense Feature Set
Dense features, as deőned above, are usually features derived from the generative translation- and language models. The features in the translation model as used in this work are mostly variations of the following quantities [Lopez, 2007]:
• count(f): Absolute count of the occurrence of source phrasef in the whole
corpus, or the maximum number of samples.
• count(e): Count of the target phrase ein the whole corpus.
• count(fe): Count of the complete rule, composed of the source phrasef and
the target phrase e in the whole corpus. Since the collocations of f are
sampled, this count is effectively limited to the number of samples. The features derived from these quantities are calculated as follows14(v
d is the
feature’s initial weight):
CountEF (vd←0.1): log10(1 + count(fe)) (3.11) EgivenFCoherent (vd← −0.1): −log10 count(fe) count(f) (3.12) 14
Taken from the implementation described in [Baltescu and Blunsom, 2014], listing adapted from [Karimova et al., 2014].
IsSingletonF (vd← −0.01): { 1.0 if count(f) = 1 0.0 else (3.13) IsSingletonFE (vd← −0.01): { 1.0 if count(fe) = 1 0.0 else (3.14) MaxLexFgivenE (vd ← −0.1): − |T([fe:f])| ∑ i=1 log10pmax(fi|e), (3.15)
where T([fe :·])is the set of terminal symbols in the source or target side of a par-
ticular rule, andpmax is the highest translation probability in a lexical distribution
given by relative frequency estimation from the symmetrized word alignments.
MaxLexEgivenF (vd← −0.1): − |T([fe:e])| ∑ i=1 log10pmax(ei|f), (3.16) SampleCountF (vd← −0.1): log10(1 + count(f)) (3.17) Glue (vd←0.01):
Absolute number of usages of glue rules.
PassThrough (vd← −0.1):
Absolute number of usages of pass-through rules.
WordPenalty (vd← −0.1):
Absolute number of target terminal symbols.
Arity0/1/2 (vd← −0.1):
Absolute number of rules used with arity15
0, 1 or 2.
LanguageModel (vd←0.1):
15
Corresponds to the number of non-terminals in a rule, i.e. arity-0 rules have no non-terminal symbols.
(1) X→ X1hatX2versprochen|X1 promisedX2
(2) X→ X1hat mirX2versprochen|
X1promised meX2
(3) X→ X1versprachX2 |X1promisedX2
Figure 3.2: SCFG rules for translation.
Negative log-likelihood score of the language model.
LanguageModelOOV: (vd← −1):
Absolute number of tokens unknown16 to the language model.
3.5.2 Sparse Feature Set
Sparse features only apply to subset of translations and in search only to a subset of arcs. Thus they should be able to discriminate between different translations, an ideal match for discriminative training. In contrast to e.g. Chiang et al. [2009], who try to őnd sparse features that cope with very speciőc phenomena or őx individual problems of the translation system, we seek to őnd a feature set that can be effectively trained to improve translation quality, without the need for further manual engineering.
To illustrate our proposed features, a sample of three SCFG rules is shown in Figure 3.2.
Rule identiőers (Rule-Id): These features identify each rule by a unique identiőer [Blunsom and Osborne, 2008]. Each application of a rule is counted, and the őnal value of the feature is the sum of its applications in the derivation. Such features roughly correspond to the relative frequencies of rewrites rules used in the dense features described before. Each rule depicted in Figure 3.2 would correspond to a single unique identiőer, which is obtained by mapping the rule to a string representation. Combined, these features correspond to the use of a discriminative translation model (grammar).
Rule bigrams (Rule-Bigram): These features identify Bigrams of consecutive items in a rule. We use bigrams on source- and target-sides of rules. Such features identify possible source- and target-side phrases and thus can give preference to rules in- or excluding them.17
In Figure 3.2, the őrst rule would őre the following additional features: X1 hat,hatX2, andX2 versprochenon the source-side, andX1 16
“Out-of-vocabulary” (OOV) items.
17
promised,promisedX2on the target-side.
Rule shape (Rule-Shape): These features are indicators that abstract away from lexical items by extracting templates that identify the location of sequences of terminal symbols in relation to non-terminal symbols, on both the source- and target-sides of a rule. For example, both rule (1) and (2) in Figure 3.2 map to the same indicator, namely to that of a rule that consists of a (non-terminal, terminal *, non-terminal, terminal *) pattern on its source side, and an (non-terminal, terminal *, non-terminal) pattern on its target side (* denotes zero ore more occurrences).
Rule (3) maps to a different template (non-terminal, terminal *, non-terminal), on both source- and target-side.
For some experiments we additionally explore a more syntax-oriented approach for sparse features, or further sparse feature templates. The sparse feature set as described here is denoted bySparse, and the dense feature set as described before byDense.
3.5.3 Experiments with Features
In a őrst experiment we explore the feasibility of ourSparsefeature set. The result is depicted in Table 3.5. The largest number of features originates from using the discriminative grammar. Rule bigram features add about 30K features, and rule shapes only account for 51 features. In total, this results in about 180K for the experiment with the full sparse feature set. The best result is achieved combining all features. On the development test data, theDensefeature set performs best using the small data set for tuning.
Tuning on the full bitext results in a similar picture on the test set, which this time is also resembled by the results on the development test data.
To evaluate our proposed feature sets we compare theDenseandSparsefeature sets on all data sets. Results depicted in Tables 3.6 forNc∗, 3.7 forEp∗, 3.8 for Wmt13, and 3.9Wmt15. All results are deőnitely in favor of theSparsefeature set, improvements ranging from 0.3 (Nc∗, Table 3.6) to 1.8 %BLEU (Wmt15, Table 3.9). Overall, we observe that tuning withSparsefeatures seems to perform better when the underlying training data, used for estimating the generative models, becomes larger.