CHAPTER 7 : Multilingual Supervision for Cross-lingual Entity Linking
7.4 Training and Inference
7.4.3 Training Objective
When only training the mention context encoder and entity vectors, theEC-Lossaveraged over all training mentions is minimized. When using the two type-aware losses, a weighted sum of EC-Loss, TE-Loss, and TC-Loss is minimized, using the weighing scheme of Kendall et al. (2018), EC-Loss 2λ2EC + TE-Loss 2λ2TE + TC-Loss 2λ2TC
+logλ2EC+logλ2TE+logλ2TC
(7.13)
Here λi are learnable scalar weighing parameters, and the respective 2λ12 i
and logλ2i term ensure that λ2i does not grow unboundedly. This way, the model learns the relative weight for each loss term.
During training, mentions from different languages are mixed usinginverse-ratio mini-batch mixingstrategy. That is, if two languages have training data sizes proportional toα:β, at any time during training, mini-batches seen from them are in the ratio 1
α :
1
β. This strategy
prevents languages with more training data from overwhelming languages with less training data. Though simple, we found this strategy yielded good results.
7.5. Experimental Setup
This section describes the training and evaluation datasets, and the previous XEL ap- proaches from the literature used in the experiments for comparison.
Training Mentions. Following previous work, hyperlinks from Wikipedia (dumps dated 05/20/2017) are used as the source of grounded mentions for supervision. As described earlier (Section 4.3), Wikipedias in different languages contain articles describing the same entity, which can be resolved by using inter-language links. For instance, article लवरपूल in the Hindi Wikipedia resolves to Liverpool in English. Training mentions statistics are shown in Table 24.
Lang. # Train Mentions Size Relative to # English Mentions German (de) 22.6M 43.7% Spanish (es) 13.8M 26.7% French (fr) 16.2M 31.3% Italian (it) 11.5M 22.2% Chinese (zh) 5.9M 11.4% Arabic (ar) 3.1M 6.0% Turkish (tr) 1.8M 3.5% Tamil (ta) 473k 0.9%
Table 24: Number of train mentions (from Wikipedia) in each language, with % size relative to English (51.7M mentions). Train mentions from Wikipedias like Arabic, Turkish and Tamil are <10% the size of those from the English Wikipedia.
The evaluation spans 8 languages — German (de), Spanish (es), Italian (it), French (fr), Chinese (zh), Arabic (ar), Turkish (tr) and Tamil (ta), each of which has varying amount of grounded mentions from the respective Wikipedia (Table 24). Note that our method is applicable to any of the293Wikipedia languages as a target language.
Evaluation Datasets. Xelms is evaluated on the following benchmark datasets, span- ning 8 different languages, thus providing an extensive evaluation.
(a) McN-TestDataset from (McNamee et al., 2011). The test set was collected by using parallel document collections, and then crowd-sourcing the ground truths. All the test mentions in this dataset consists of person-names only.
(b) TH-Test A subset of the dataset used in (Tsai and Roth, 2016), derived from Wikipedia.6 The mentions in the dataset fall in two categories — easy and hard,
where hard mentions are those for which the most likely candidate according to the prior probability (i.e., arg max Prprior(e | m)) is not the correct title. Indeed,
most Wikipedia mentions can be correctly linked by selecting the most likely candi-
6Pan et al. (2017) also created a dataset using Wikipedia, but did not categorize mentions like Tsai and Roth (2016). Preliminary experiments on their dataset showedXelmsconsistently beat Pan et al. (2017)’s model. TH-Testwas chosen for more controlled experiments.
date (Ratinov et al., 2011). All the hard mentions from Tsai and Roth (2016)’s test splits, henceforth calledTH-Test, are used for each language.
(c) TAC15-TestDataset from TAC-KBP 2015 Trilingual Entity Linking Track (Ji et al., 2015) for Chinese and Spanish. It contains 84 news and 82 discussion forum articles in Chinese (total 166 documents) and 84 news and 83 discussion forum articles in Spanish (total 167 documents).
All models are evaluated using linking accuracy on gold mentions, which is the fraction of test mentions that are linked to the correct entity. It is assumed gold mentions are provided at test time, following common practice (Tsai and Roth, 2016; Ganea and Hofmann, 2017; Gupta et al., 2017). Table 25 shows the source domains of the evaluation datasets.
Dataset Lang. Source
TH-Test de, es, fr, it, Wikipedia
zh, ar, tr, ta
McN-Test de, es, fr, it, News,
zh, ar, tr Parliament Proceedings
TAC15-Test es, zh News,
Discussion Forums
Table 25: Evaluation datasets used in the cross-lingual entity linking experiments.
Implementation and Tuning. All models were implemented using PyTorch.7 The
ADAM (Kingma and Ba, 2014) optimizer was used with a learning rate of 1e-3 in all experiments. the candidate generator was limited to output the top-20 candidates for all experiments. Local context window was set to W = 25 tokens. The convolutional filter width was set to k = 5. The mention surface vocabulary V was limited to size 1M for both monolingual and joint training. The multilingual embeddings (d=300) were scaled to a fixed norm R (=5.0), and were not updated during training. Dropout (Srivastava et al., 2014) was separately applied to local context and document context feature, each being tuned over {0.4,0.45,· · ·,0.7}. The size of entity, type and context vectors was fixed to
h= 100. Batch size was tuned over{128,256,512,1024}.
The Wikipedia dumps were parsed using the WikiExtractor script.8 Stanford segmenter
was used for Arabic (Monroe et al., 2014) and Chinese segmentation (Tseng et al., 2005). Any dataset-specific tuning is avoided by tuning on a development set and applying the same parameters across all datasets. All tunable parameters were tuned on a development set containing the hard mentions from the train split released by Tsai and Roth (2016).
Comparative Approaches. We compare against the following state-of-the-art (SoTA) approaches, described with the language from which they use mention contexts in(.),
(a) Tsai and Roth (2016) (Target Only) trains a separate model for each language using mention contexts from the target language Wikipedia only. Current SoTA on
TH-Test.
(b) Pan et al. (2017) (English Only) uses entity coherence statistics from English Wikipedia and the document context of a mention for XEL. Current SoTA onMcN- Test, except for Italian and Turkish, for which it’s McNamee et al. (2011).
(c) Sil et al. (2018) (English Only) uses multilingual embeddings to transfer a pre- trained English entity linking model to perform XEL for Spanish and Chinese. Prior probabilities Prprior are used as a feature. Current SoTA on TAC15-Test.
7.6. Experiments
The experiments in this section aim to evaluate: (a) if Xelms can train a better entity linking model for a target language, by exploiting additional data from a high resource language like English? (Section 7.6.1) (b) how does a single XEL model (trained using
Xelms) for multiple related languages compare to individually trained models for each
language? (Section 7.6.2)(c)if adding additional type information through a multi-tasking
loss toXelmsimproves performance? (Section 7.6.3)
In all experiments, the linking accuracy of Xelms is reported, averaged over 5 different runs, and marked with ∗ the statistical significance (p < 0.01) of the best result (shown
bold) against the state-of-the-art (SoTA) using Student’s one-sample t-test.