Training Objective - Training and Inference

CHAPTER 7 : Multilingual Supervision for Cross-lingual Entity Linking

7.4 Training and Inference

7.4.3 Training Objective

When only training the mention context encoder and entity vectors, theEC-Lossaveraged over all training mentions is minimized. When using the two type-aware losses, a weighted sum of EC-Loss, TE-Loss, and TC-Loss is minimized, using the weighing scheme of Kendall et al. (2018), EC-Loss 2λ2_EC + TE-Loss 2λ2_TE + TC-Loss 2λ2_TC

+logλ2_EC+logλ2TE+logλ2TC

(7.13)

Here λi are learnable scalar weighing parameters, and the respective ₂_λ12 i

and logλ2_i term ensure that λ2_i does not grow unboundedly. This way, the model learns the relative weight for each loss term.

During training, mentions from different languages are mixed usinginverse-ratio mini-batch mixingstrategy. That is, if two languages have training data sizes proportional toα:β, at any time during training, mini-batches seen from them are in the ratio 1

α :

β. This strategy

prevents languages with more training data from overwhelming languages with less training data. Though simple, we found this strategy yielded good results.

7.5. Experimental Setup

This section describes the training and evaluation datasets, and the previous XEL approaches from the literature used in the experiments for comparison.

Training Mentions. Following previous work, hyperlinks from Wikipedia (dumps dated 05/20/2017) are used as the source of grounded mentions for supervision. As described earlier (Section 4.3), Wikipedias in different languages contain articles describing the same entity, which can be resolved by using inter-language links. For instance, article लवरपूल in the Hindi Wikipedia resolves to Liverpool in English. Training mentions statistics are shown in Table 24.

Lang. # Train Mentions Size Relative to # English Mentions German (de) 22.6M 43.7% Spanish (es) 13.8M 26.7% French (fr) 16.2M 31.3% Italian (it) 11.5M 22.2% Chinese (zh) 5.9M 11.4% Arabic (ar) 3.1M 6.0% Turkish (tr) 1.8M 3.5% Tamil (ta) 473k 0.9%

Table 24: Number of train mentions (from Wikipedia) in each language, with % size relative to English (51.7M mentions). Train mentions from Wikipedias like Arabic, Turkish and Tamil are <10% the size of those from the English Wikipedia.

The evaluation spans 8 languages — German (de), Spanish (es), Italian (it), French (fr), Chinese (zh), Arabic (ar), Turkish (tr) and Tamil (ta), each of which has varying amount of grounded mentions from the respective Wikipedia (Table 24). Note that our method is applicable to any of the293Wikipedia languages as a target language.

Evaluation Datasets. Xelms is evaluated on the following benchmark datasets, span- ning 8 different languages, thus providing an extensive evaluation.

(a) McN-TestDataset from (McNamee et al., 2011). The test set was collected by using parallel document collections, and then crowd-sourcing the ground truths. All the test mentions in this dataset consists of person-names only.

(b) TH-Test A subset of the dataset used in (Tsai and Roth, 2016), derived from Wikipedia.6 _{The mentions in the dataset fall in two categories —} _easy _and _hard_,

where hard mentions are those for which the most likely candidate according to the prior probability (i.e., arg max Prprior(e | m)) is not the correct title. Indeed,

most Wikipedia mentions can be correctly linked by selecting the most likely candi-

6_{Pan et al. (2017) also created a dataset using Wikipedia, but did not categorize mentions like Tsai and} Roth (2016). Preliminary experiments on their dataset showedXelmsconsistently beat Pan et al. (2017)’s model. TH-Testwas chosen for more controlled experiments.

date (Ratinov et al., 2011). All the hard mentions from Tsai and Roth (2016)’s test splits, henceforth calledTH-Test, are used for each language.

(c) TAC15-TestDataset from TAC-KBP 2015 Trilingual Entity Linking Track (Ji et al., 2015) for Chinese and Spanish. It contains 84 news and 82 discussion forum articles in Chinese (total 166 documents) and 84 news and 83 discussion forum articles in Spanish (total 167 documents).

All models are evaluated using linking accuracy on gold mentions, which is the fraction of test mentions that are linked to the correct entity. It is assumed gold mentions are provided at test time, following common practice (Tsai and Roth, 2016; Ganea and Hofmann, 2017; Gupta et al., 2017). Table 25 shows the source domains of the evaluation datasets.

Dataset Lang. Source

TH-Test de, es, fr, it, Wikipedia

zh, ar, tr, ta

McN-Test de, es, fr, it, News,

zh, ar, tr Parliament Proceedings

TAC15-Test es, zh News,

Discussion Forums

Table 25: Evaluation datasets used in the cross-lingual entity linking experiments.

Implementation and Tuning. All models were implemented using PyTorch.7 _The

ADAM (Kingma and Ba, 2014) optimizer was used with a learning rate of 1e-3 in all experiments. the candidate generator was limited to output the top-20 candidates for all experiments. Local context window was set to W = 25 tokens. The convolutional filter width was set to k = 5. The mention surface vocabulary V was limited to size 1M for both monolingual and joint training. The multilingual embeddings (d=300) were scaled to a fixed norm R (=5.0), and were not updated during training. Dropout (Srivastava et al., 2014) was separately applied to local context and document context feature, each being tuned over _{0.4,0.45,· · ·,0.7}. The size of entity, type and context vectors was fixed to

h= 100. Batch size was tuned over_{128,256,512,1024}.

The Wikipedia dumps were parsed using the WikiExtractor script.8 _{Stanford segmenter}

was used for Arabic (Monroe et al., 2014) and Chinese segmentation (Tseng et al., 2005). Any dataset-specific tuning is avoided by tuning on a development set and applying the same parameters across all datasets. All tunable parameters were tuned on a development set containing the hard mentions from the train split released by Tsai and Roth (2016).

Comparative Approaches. We compare against the following state-of-the-art (SoTA) approaches, described with the language from which they use mention contexts in(.),

(a) Tsai and Roth (2016) (Target Only) trains a separate model for each language using mention contexts from the target language Wikipedia only. Current SoTA on

TH-Test.

(b) Pan et al. (2017) (English Only) uses entity coherence statistics from English Wikipedia and the document context of a mention for XEL. Current SoTA onMcN- Test, except for Italian and Turkish, for which it’s McNamee et al. (2011).

(c) Sil et al. (2018) (English Only) uses multilingual embeddings to transfer a pre- trained English entity linking model to perform XEL for Spanish and Chinese. Prior probabilities Prprior are used as a feature. Current SoTA on TAC15-Test.

7.6. Experiments

The experiments in this section aim to evaluate: (a) if Xelms can train a better entity linking model for a target language, by exploiting additional data from a high resource language like English? (Section 7.6.1) (b) how does a single XEL model (trained using

Xelms) for multiple related languages compare to individually trained models for each

language? (Section 7.6.2)(c)if adding additional type information through a multi-tasking

loss toXelmsimproves performance? (Section 7.6.3)

In all experiments, the linking accuracy of Xelms is reported, averaged over 5 different runs, and marked with ∗ the statistical significance (p < 0.01) of the best result (shown

bold) against the state-of-the-art (SoTA) using Student’s one-sample t-test.

In document Exploiting Cross-Lingual Representations For Natural Language Processing (Page 137-141)