How to (Properly) Evaluate Cross Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

12 

(1)How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions Goran Glavaš1 , Robert Litschko1 , Sebastian Ruder2,3∗ , and Ivan Vuli´c4 1. 2. Data and Web Science Group, University of Mannheim, Germany Insight Research Centre, National University of Ireland, Galway, Ireland 3 Aylien Ltd., Dublin, Ireland 4 PolyAI Ltd., London, United Kingdom {goran, litschko}@informatik.uni-mannheim.de, sebastian@ruder.io, iv250@cam.ac.uk Abstract. Cross-lingual word embeddings (CLEs) facilitate cross-lingual transfer of NLP models. Despite their ubiquitous downstream usage, increasingly popular projection-based CLE models are almost exclusively evaluated on bilingual lexicon induction (BLI). Even the BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties of different CLE models. In this work, we take the first step towards a comprehensive evaluation of CLE models: we thoroughly evaluate both supervised and unsupervised CLE models, for a large number of language pairs, on BLI and three downstream tasks, providing new insights concerning the ability of cuttingedge CLE models to support cross-lingual NLP. We empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance. We indicate the most robust supervised and unsupervised CLE models and emphasize the need to reassess simple baselines, which still display competitive performance across the board. We hope our work catalyzes further research on CLE evaluation and model analysis.. 1. Introduction and Motivation. Following the ubiquitous use of word embeddings in monolingual NLP tasks, research in word representation quickly broadened towards cross-lingual word embeddings (CLEs). CLE models learn vectors of words in two or more languages and represent them in a shared cross-lingual word vector space where words with similar meanings obtain similar vectors, irrespective of their language. Owing to this property, CLEs hold promise to support cross-lingual NLP by enabling multilingual modeling of meaning and facilitating cross-lingual transfer for downstream NLP tasks and under-resourced ?. languages. CLEs are used as (cross-lingual) knowledge sources in range of tasks, such as bilingual lexicon induction (Mikolov et al., 2013), document classification (Klementiev et al., 2012), information retrieval (Vuli´c and Moens, 2015), dependency parsing (Guo et al., 2015), sequence labeling (Zhang et al., 2016; Mayhew et al., 2017), and machine translation (Artetxe et al., 2018c; Lample et al., 2018), among others. Earlier work typically induces CLEs by leveraging bilingual supervision from multilingual corpora aligned at the level of sentences (Zou et al., 2013; Hermann and Blunsom, 2014; Luong et al., 2015, inter alia) and documents (Søgaard et al., 2015; Vuli´c and Moens, 2016; Levy et al., 2017, inter alia). A recent trend are the so-called projectionbased CLE models1 , which post-hoc align pretrained monolingual embeddings. Their popularity stems from competitive performance coupled with a conceptually simple design, requiring only cheap bilingual supervision (Ruder et al., 2018b): they demand word-level supervision from seed translation dictionaries, spanning at most several thousand word pairs (Mikolov et al., 2013; Huang et al., 2015), but it has also been shown that reliable projections can be bootstrapped from small dictionaries of 50–100 pairs (Vuli´c and Korhonen, 2016; Zhang et al., 2016), identical strings and cognates (Smith et al., 2017; Søgaard et al., 2018), and shared numerals (Artetxe et al., 2017). Moreover, recent work has leveraged topological similarities between monolingual vector spaces to introduce fully unsupervised projection-based CLE models, not demanding any bilingual supervision (Conneau et al., 2018a; Artetxe et al., 2018b, inter alia). Being conceptually attractive, such weakly supervised and unsupervised CLEs have recently taken the field by storm (Grave et al., 2018; Dou. Sebastian is now affiliated with DeepMind.. 1 In the literature the methods are sometimes referred to as mapping-based CLE approaches or offline approaches.. 710 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 710–721 c Florence, Italy, July 28 - August 2, 2019. 2019 Association for Computational Linguistics.

(2) et al., 2018; Doval et al., 2018; Hoshen and Wolf, 2018; Ruder et al., 2018a; Kim et al., 2018; Chen and Cardie, 2018; Mukherjee et al., 2018; Nakashole, 2018; Xu et al., 2018; Alaux et al., 2019). Producing the same end result—a shared crosslingual vector space—all CLE models are directly comparable, regardless of modelling assumptions and supervision requirements. Therefore, they can support exactly the same groups of tasks. Yet, a comprehensive evaluation of recent CLE models is missing. Limited evaluations impede comparative analyses and may lead to inadequate conclusions, as models are commonly trained to perform well on a single task. While early CLE models (Klementiev et al., 2012; Hermann and Blunsom, 2014) were evaluated on downstream tasks like text classification, a large body of recent work is judged exclusively on the task of bilingual lexicon induction (BLI). This limits our understanding of CLE methodology as: 1) BLI is an intrinsic task, and agreement between BLI and downstream performance has been challenged (Ammar et al., 2016; Bakarov et al., 2018); 2) BLI is not the main motivation for inducing cross-lingual embedding spaces— rather, we seek to exploit CLEs to tackle multilinguality and downstream language transfer (Ruder et al., 2018b). In other words, previous research does not evaluate the true capacity of projectionbased CLE models to support cross-lingual NLP. It is unclear whether and to which extent BLI performance of (projection-based) CLE models correlates with various downstream tasks of different types. At the moment, it is virtually impossible to directly compare all recent projection-based CLE models on BLI due to the lack of a common evaluation protocol: different papers consider different language pairs and employ different training and evaluation dictionaries. Furthermore, there is a surprising lack of testing of BLI results for statistical significance. The mismatches in evaluation yield partial conclusions and inconsistencies: on the one hand, some unsupervised models (Artetxe et al., 2018b; Hoshen and Wolf, 2018) reportedly outperform competitive supervised CLE models (Artetxe et al., 2017; Smith et al., 2017). On the other hand, the most recent supervised approaches (Doval et al., 2018; Joulin et al., 2018) report performances surpassing the best unsupervised models.. is easily obtainable for most language pairs.2 Therefore, despite the attractive zero-supervision setup, we see unsupervised CLE models practically justified only if such models can, unintuitively, indeed outperform their supervised competition. Contributions. We provide a comprehensive comparative evaluation of a wide range of stateof-the-art—both supervised and unsupervised— projection-based CLE models. Our benchmark encompasses BLI and three cross-lingual (CL) downstream tasks of different nature: document classification (CLDC), information retrieval (CLIR), and natural language inference (XNLI). We unify evaluation protocols for all models and conduct experiments over 28 language pairs spanning diverse language types. Besides providing a unified testbed for guiding CLE research, we aim to answer the following research questions: 1) Is BLI performance a good predictor of downstream performance for projectionbased CLE models? 2) Can unsupervised CLE models indeed outperform their supervised counterparts? The simplest models often outperform more intricate competitors: we apply a simple bootstrapping to the basic Procrustes model (P ROC -B, see §2.2) and show it is competitive across the board. We find that overfitting to BLI may severely hurt downstream performance, warranting the coupling of BLI experiments with downstream evaluations in order to paint a more informative picture of CLE models’ properties.. 2. Projection-Based CLEs: Methodology. In contrast to more recent unsupervised models, CLE models typically require bilingual signal: aligned words, sentences, or documents. CLE models based on sentence and document alignments have been extensively studied in previous work (Vuli´c and Korhonen, 2016; Upadhyay et al., 2016; Ruder et al., 2018b). Current CLE research is almost exclusively focused on projection-based CLE models; they are thus also the focus of our study.3. Supervised projection-based CLEs require merely small-sized translation dictionaries (up to a few thousand word pairs) and such bilingual signal 711. 2. We argue that, if acquiring a few thousand word translation pairs is a challenge, one probably deals with a truly underresourced language for which it would be difficult to obtain reliable monolingual embeddings in the first place. Furthermore, there are initiatives in typological linguistics research such as the ASJP database (Wichmann et al., 2018), which offers 40-item word lists denoting the same set of concepts in all the world’s languages: https://asjp.clld.org/. Indirectly, such lists can offer the initial seed supervision. 3 These methods a) are not bound to any particular word embedding model (i.e., they are fully agnostic to how we.

(3) vector lookup. XL1. XL2. ⋮. ⋮. ⋯ ⋱ ⋯ … ⋯ ⋱ ⋯ …. ⋮. ⋮. ⋮. ⋮. D = {xiL1, xjL2}K. XS ⋯ ⋱ ⋯. ⋯ ⋱ ⋯ XT. projection learning. ⋮. ⋮. ⋮ ⋮. WL1 ⋯ ⋱ ⋮ ⋯. ⋯ ⋱ ⋮ ⋯ WL2. projecting vector spaces. … ⋱ … … … ⋱ … …. XCL. Figure 1: A general framework for post-hoc projection-based induction of cross-lingual word embeddings.. 2.1. Projection-Based Framework. Algorithm 1: Bootstrapping Procrustes (P ROC -B). The goal is to learn a projection between independently trained monolingual embedding spaces. The mapping is sought using a seed bilingual lexicon, provided beforehand or extracted without supervision. A general post-hoc projection-based CLE framework is depicted in Figure 1. Let XL1 and XL2 be monolingual embedding spaces of two languages. All projection-based CLE approaches encompass the following steps: Step 1: Construct the seed translation dictionary i , w j )}K D = {(wL1 L2 k=1 containing K word pairs. Supervised models use an external dictionary; unsupervised models induce D automatically, assuming approximate isomorphism between XL1 and XL2 . Step 2: Align monolingual subspaces XS = j K {xiL1 }K k=1 and XT = {xL2 }k=1 using the transi }K lation dictionary D: retrieve vectors of {wL1 k=1 j from X . from XL1 and vectors of {wL2 }K L2 k=1 Step 3: Learn to project XL1 and XL2 to the shared cross-lingual space XCL based on the aligned matrices XS and XT . In the general case, we learn two projection matrices WL1 and WL2 : XCL = XL1 WL1 ∪ XL2 WL2 . Many models, however, learn to directly project XL1 to XL2 , i.e., WL2 = I and XCL = XL1 WL1 ∪ XL2 . 2.2. Projection-Based CLE Models. While supervised models employ external dictionaries, unsupervised models automatically induce seed translations using diverse strategies: adversarial learning (Conneau et al., 2018a), similaritybased heuristics (Artetxe et al., 2018b), PCA (Hoshen and Wolf, 2018), and optimal transport (Alvarez-Melis and Jaakkola, 2018). Canonical Correlation Analysis (CCA). Faruqui and Dyer (2014) use CCA to project XL1 and XL2 obtain monolingual vectors) and b) they do not require any multilingual corpora, they lend themselves to a wider spectrum of languages than the alternatives (Ruder et al., 2018b).. XL1 , XL2 ← monolingual embeddings of L1 and L2 D ← initial word translation dictionary for each of n iterations do XS , XT ← lookups for D in XL1 , XL2 WL1 ← arg minW kXS W − XT k2 WL2 ← arg minW kXT W − XS k2 X0 L1 ← XL1 WL1 ; X0 L2 ← XL2 WL2 D1,2 ← nn(X0 L1 , XL2 ); D2,1 ← nn(X0 L2 , XL1 ) D ← D ∪ (D1,2 ∩ D2,1 ) return: WL1 (and/or WL2 ). into a shared space XCL . Projection matrices, WL1 and WL2 are obtained by applying CCA to XS and XT . We evaluate CCA as a simple baseline that has mostly been neglected in recent BLI evaluations. Solving the Procrustes Problem. In their seminal work, Mikolov et al. (2013) find WL1 by minimizing the Euclidean distance between projected XS and XT : WL1 = arg minW kXL1 W − XL2 k2 . Xing et al. (2015) report BLI gains by imposing orthogonality on WL1 . If WL1 is orthogonal, the above minimization problem becomes the Procrustes problem, with the following closed-form solution (Schönemann, 1966): WL1 = UV> , with UΣV> = SVD(XT XS > ).. (1). The map WL1 being the solution to the Procrustes problem is the main baseline in our evaluation (P ROC). Furthermore, following self-learning procedures that unsupervised models use to augment initially induced lexicons D (Artetxe et al., 2018b; Conneau et al., 2018a), we propose a simple bootstrapping extension of the P ROC model (dubbed P ROC -B). With P ROC -B we aim to boost performance when starting with smaller but reliable external dictionaries. The procedure is summarized in Algorithm 1. In each iteration, we use the translation lexicon D to learn two projections: WL1 projects XL1 to XL2 and WL2 projects XL2 to XL1 . Next we induce two word translations sets D12 and D21 as cross-lingual nearest neighbours. 712.

(4) between (1) XL1 WL1 and XL2 and (2) XL2 WL2 and XL1 . We finally augment D with mutual nearest neighbours, i.e., D12 ∩ D21 .4 Discriminative Latent-Variable Model (DLV). Ruder et al. (2018a) augment the seed supervised lexicon through Expectation-Maximization in a i }K latent-variable model. The source words {wL1 k=1 j and target words {wL2 }K are seen as a fully conk=1 nected bipartite weighted graph G = (E, VL1 ∪ VL2 ) with edges E = VL1 × VL2 . By drawing embeddings from a Gaussian distribution and normalizing them, the weight of each edge (i, j) ∈ E is shown to correspond to the cosine similarity between vectors. In the E-step, a maximal bipartite matching is found on the sparsified graph using the Jonker-Volgenant algorithm (Jonker and Volgenant, 1987). In the M-step, a better projection WL1 is learned by solving the Procrustes problem. Ranking-Based Optimization (RCSLS). Instead of minimizing the Euclidean distance, Joulin et al. (2018) follow earlier work (Lazaridou et al., 2015) and maximize a ranking-based objective, specifically cross-domain similarity local scaling (CSLS; Conneau et al., 2018a), between the XS WL1 and XT . CSLS is an extension of cosine similarity commonly used for BLI inference. Let r(xkL1 W, XL2 ) be the average cosine similarity of the projected source vector with its N nearest neighbors from XL2 . Inversely, let r(xkL2 , XL1 W) be the average cosine similarity of a target vector with its N nearest neighbors from the projected source space XL1 W. By relaxing the orthogonality constraint on WL1 , maximization of relaxed CSLS (dubbed RCSLS) becomes a convex optimization problem: WL1 = arg min W. 1 K. X.   −2 cos xkL1 W, xkL2. xk L1 ∈XS xk L2 ∈XT. + r(xkL1 W, XL2 ) + r(xkL2 , XL1 W). (2). By maximizing (R)CSLS, this model is explicitly designed to induce cross-lingual embedding spaces that perform well in word translation, i.e., BLI. Adversarial Alignment (M USE). M USE (Conneau et al., 2018a) initializes a seed bilingual lexicon solely from monolingal data using Generative Adversarial Networks (GANs; Goodfellow et al.,. 2014). The generator component is the linear map WL1 . M USE improves the generator WL1 by additionally competing with the discriminator (feedforward net) that needs to distinguish between true L2 vectors from XL2 and projections from L1, XL1 WL1 . M USE then improves the GAN-induced mapping, through a refinement step similar to the P ROC -B procedure. M USE strongly relies on the approximate isomorphism assumption (Søgaard et al., 2018), which often leads to poor GAN-based initialization, especially for distant languages. Heuristic Alignment (V EC M AP). Artetxe et al. (2018b) assume that word translations have approximately the same vectors of monolingual similarity distribution. The seed D is the set of nearest neighbors according to the similarity between monolingual similarity distribution vectors. Next, they employ a self-learning bootstrapping procedure similar to M USE. V EC M AP owes robustness to a number of empirically motivated enhancements. It adopts both multi-step pre-processing: unit length normalization, mean centering, and ZCA whitening (Bell and Sejnowski, 1997); and multiple post-processing steps: cross-correlational reweighting, de-whitening, and dimensionality reduction (Artetxe et al., 2018a). Moreover, V EC M AP critically relies on stochastic dictionary induction: elements of the similarity matrix are randomly set to 0 (with varying probability across iterations), allowing the model to escape poor local optima. Iterative Closest Point Model (ICP). Hoshen and Wolf (2018) induce the seed dictionary D by projecting vectors of N most frequent words of both languages to a lower-dimensional space with PCA. They then search for an optimal alignment between L1 words (vectors xi1 ) and L2 words (vectors xj2 ), assuming projections W1 and W2 . Let f1 (i) (vice versa f2 (j)) be the L2 index (vice versa L1 index) to which xi1 (vice versa xj2 ) is aligned. The goal is to find projections W1 and W2 that minimize the sum of Euclidean distances between optimally-aligned vectors. Since both projections and optimal alignment are unknown, they employ a two-step Iterative Closest Point optimization algorithm that first fixes projections to find the optimal alignment and then uses that alignment to update projections, by minimizing:. 4. X. We obtain best performance with just one bootstrapping iteration. Even when starting with D of size 1K, the first iteration yields between 5K and 10K mutual translations. The second iteration already produces over 20K translations, which seem to be too noisy for further performance gains.. i. λ. X i. 713. f (i). kxi1 W1 − x21. k+. X. f (j). kxj2 W2 − x12. k+. j. kxi1 − xi1 W1 W2 k + λ. X j. kxj2 − xj2 W2 W1 k. (3).

(5) The cyclical constraints in the second row force vectors not to change by round-projection (to the other space and back). They then employ a bootstraping procedure similar to P ROC -B and M USE and produce the final WL1 by solving Procrustes. Gromov-Wasserstein Alignment Model (GWA). Since embedding models employ metric recovery algorithms, Alvarez-Melis and Jaakkola (2018) cast dictionary induction as optimal transport problem based on the Gromov-Wasserstein distance. They first compute intra-language costs CL1 = cos(XL1 , XL1 ) and CL2 = cos(XL2 , XL2 ) and inter-language similarities C12 = C2L1 p1> m + 1n q(C2L2 )> , with p and q as uniform distributions over respective vocabularies. They then induce the projections by solving the Gromov-Wasserstein optimal transport problem with a fast iterative algorithm (Peyré et al., 2016), iteratively updating parameter vectors a and b, a = p Kb; b = q K> a, where is element-wise division and K = ˆ Γ /λ) (with C ˆ Γ = C12 − 2CL1 ΓC> ). exp(−C L2 The alignment matrix Γ = diag(a) K diag(b) is recomputed in each iteration. The final projection WL1 is (again!) obtained by solving Procrustes using the final alignments from Γ as supervision. In sum, our brief overview points to the main (dis)similarities of all projection-based CLE models: while they differ in the way the initial seed lexicon is extracted, most models are based on bootstrapping procedures that repeatedly solve the Procrustes problem from Eq. (1), typically on the trimmed vocabulary. In the final step, the fine-tuned linear map is applied on the full vocabulary.. 3. Bilingual Lexicon Induction. (2) direct comparisons across different language pairs, we create training and evaluation dictionaries that are fully aligned across all evaluated language pairs. Finally, we also discuss other choices in existing BLI evaluations which are currently taken for granted: e.g., (in)appropriate evaluation metrics and lack of significance testing. Language Pairs. Our evaluation comprises eight languages: Croatian (HR), English (EN), Finnish (FI), French (FR), German (DE), Italian (IT), Russian (RU), and Turkish (TR). For diversity, we selected two languages from three different IndoEuropean branches: Germanic (EN, DE), Romance (FR, IT), and Slavic (HR, RU); as well as two nonIndo-European languages (FI, TR). From these, we create a total of 28 language pairs for evaluation. Monolingual Embeddings. Following prior work, we use 300-dimensional fastText embeddings (Bojanowski et al., 2017)6 , pretrained on complete Wikipedias of each language. We trim all vocabularies to the 200K most frequent words. Translation Dictionaries. We automatically created translation dictionaries using Google Translate, similar to prior work (Conneau et al., 2018a). We selected the 20K most frequent English words and automatically translated them to the other seven languages. We retained only tuples for which all translations were unigrams found in vocabularies of respective monolingual embedding spaces, leaving us with ≈7K tuples. We reserved 5K tuples created from the more frequent English words for training, and the remaining 2K tuples for testing. We also created two smaller training dictionaries, by selecting tuples corresponding to 1K and 3K most frequent English words.. BLI has become the de facto standard evaluation task for projection-based CLE models. Given a CLE space, the task is to retrieve target language translations for a (test) set of source language words. A typical BLI evaluation in the recent literature reports comparisons with the well-known M USE model on a few language pairs, always involving English as one of the languages—a comprehensive comparative BLI evaluation conducted on a large set of language pairs is missing. Our evaluation spans 28 language pairs, many of which do not involve English.5 Furthermore, to allow for (1) fair comparison across supervised models and 5. English participates as one of the languages in all pairs in existing BLI evaluations, with the exception of Estonian– Finnish evaluated by Søgaard et al. (2018).. 714. Evaluation Measures and Significance. BLI is generally cast as a ranking task. Existing work uses precision at rank k (P @k, k ∈ {1, 5, 10}) as a BLI evaluation metric. We advocate the use of mean average precision (MAP) instead.7 While correlated with P @k, MAP is more informative: unlike MAP, P @k treats all models that rank the correct translation below k equally.8 The limited size of BLI test sets warrants statistical significance testing. Yet, most current work 6. https://fasttext.cc/docs/en/ pretrained-vectors.html 7 In this setup with only one correct translation for each query, MAP is equivalent to mean reciprocal rank (MRR). 8 E.g., P @5 equally penalizes two models of which one ranks the translation at rank 6 and the other at rank 100K..

(6) Supervised. Dict. All LPs. Filt. LPs. Succ. LPs. CCA CCA CCA P ROC P ROC P ROC P ROC -B P ROC -B DLV DLV DLV RCSLS RCSLS RCSLS. 1K 3K 5K 1K 3K 5K 1K 3K 1K 3K 5K 1K 3K 5K. .289 .378 .400 .299 .384 .405 .379 .398 .289 .381 .403 .331 .415 .437. .404 .482 .498 .411 .487 .503 .485 .497 .400 .484 .501 .441 .511 .527. 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28 28/28. .375 .183 .253 .137. .471 .458 .424 .345. 28/28 13/28 22/28 15/28. Unsupervised V EC M AP M USE ICP GWA. Table 1: Summary of BLI performance (MAP). All LPs: average scores over all 28 language pairs; Filt. LP: average scores only over language pairs for which all models in evaluation yield at least one successful run; Succ. LPs: the number of language pairs for which we obtained at least one successful run. A run is considered successful if MAP ≥ 0.05.. provides no significance testing, declaring limited numeric gains (e.g., 1% P @1) to be relevant. We test BLI results for significance with the two-tailed t-test with Bonferroni correction (Dror et al., 2018). Results and Discussion. Table 1 summarizes BLI performance over all 28 language pairs.9 RCSLS (Joulin et al., 2018) displays the strongest BLI performance. This is not surprising given that its learning objective is tailored particularly for BLI. RCSLS outperforms other supervised models (CCA, P ROC, and DLV) trained with exactly the same dictionaries. We confirm previous findings (Smith et al., 2017) suggesting that CCA and P ROC exhibit very similar performance. We confirm that CCA and P ROC, as well as DLV, are statistically indistinguishable in terms of BLI performance (even at α = 0.1). Our bootstrapping P ROC -B approach significantly (p < 0.01) boosts the performance of P ROC when given a small translation dictionary with 1K pairs. For the same 1K dictionary, P ROC -B also significantly outperforms RCSLS. Interestingly, training on 5K pairs does not significantly outperform training on 3K pairs for any of the supervised models (whereas using 3K pairs is better than using 1K). This is in line with prior findings (Vuli´c and Korhonen, 2016) that no. significant improvements are to be expected from training linear maps on more than 5K word pairs. The results highlight V EC M AP (Artetxe et al., 2018b) as the most robust choice among unsupervised models: besides being the only model to produce successful runs for all language pairs, it also significantly outperforms other unsupervised models—both when considering all language pairs and only the subset where other models produce successful runs. However, V EC M AP still performs worse (p ≤ 0.0002) than P ROC -B (trained on only 1K pairs) and all supervised models trained on 3K or 5K word pairs. Our findings challenge unintuitive claims from recent work (Artetxe et al., 2018b; Hoshen and Wolf, 2018; Alvarez-Melis and Jaakkola, 2018) that unsupervised CLE models perform on a par or even surpass supervised models. Table 2 shows the scores for a subset of 10 language pairs using a subset of models from Table 1.10 As expected, all models work reasonably well for major languages—this is most likely due to a higher quality of the respective monolingual embeddings, which are pre-trained on much larger corpora.11 Language proximity also plays a critical role: on average, models achieve better BLI performance for languages from the same family (e.g., compare the results of HR–RU vs. HR–EN). The gap between the best-performing supervised model (RCSLS) and the best-performing unsupervised model (V EC M AP) is more pronounced for cross-family language pairs, especially those consisting of one Germanic and one Slavic or non-IndoEuropean language (e.g., 19 points MAP difference for EN–RU, 14 for DE–RU and EN–FI, 10 points for EN– TR and EN– HR, 9 points for DE– FI and EN– HR). We suspect that this is due to heuristics based on intra-language similarity distributions, employed by Artetxe et al. (2018b) to induce an initial translation dictionary. This heuristic, critically relying on the approximate isomorphism assumption, is less effective the more distant the languages are.. 9 We provide detailed BLI results for each of the 28 language pairs and all models in the appendix.. 715. 4. Downstream Evaluation. Moving beyond the limiting BLI evaluation, we evaluate CLE models on three diverse cross-lingual downstream tasks: 1) cross-lingual transfer for natural language inference (XNLI), a language under10. We provide full BLI results for all 28 language pairs and all models in the appendix. 11 For instance, EN Wikipedia is approximately 3 times larger than DE and RU Wikipedias, 19 times larger than FI Wikipedia and 46 times larger than HR Wikipedia..

(7) Model. Dict. EN – DE. IT – FR. HR – RU. EN– HR. DE– FI. TR – FR. RU– IT. FI– HR. TR – HR. TR – RU. P ROC P ROC P ROC -B RCSLS RCSLS. 1K 5K 1K 1K 5K. 0.458 0.544 0.521 0.501 0.580. 0.615 0.669 0.665 0.637 0.682. 0.269 0.372 0.348 0.291 0.404. 0.225 0.336 0.296 0.267 0.375. 0.264 0.359 0.354 0.288 0.395. 0.215 0.338 0.305 0.247 0.375. 0.360 0.474 0.466 0.383 0.491. 0.187 0.294 0.263 0.214 0.321. 0.148 0.259 0.210 0.170 0.285. 0.168 0.290 0.230 0.191 0.324. V EC M AP. –. 0.521. 0.667. 0.376. 0.268. 0.302. 0.341. 0.463. 0.280. 0.223. 0.200. Average. –. 0.520. 0.656. 0.343. 0.294. 0.327. 0.304. 0.440. 0.260. 0.216. 0.234. Table 2: BLI performance (MAP) with a selection of models on a subset of evaluated language pairs.. standing task; 2) cross-lingual document classification (CLDC), a task commonly requiring shallow n-gram-level modelling, and 3) cross-lingual information retrieval (CLIR), an unsupervised ranking task relying on coarser semantic relatedness. 4.1. Supervised Dict P ROC P ROC P ROC -B P ROC -B DLV RCSLS RCSLS. Natural Language Inference. Results and Discussion. XNLI accuracy scores are summarized in Table 3. The mismatch between BLI and XNLI performance is most obvious for RCSLS. While RCSLS is the best-performing model on BLI, it shows subpar performance on XNLI across the board. This suggests that specializing CLE spaces for word translation can seriously hurt cross-lingual transfer for language understanding tasks. As the second indication of the mismatch, the unsupervised V EC M AP model, outperformed by supervised models on BLI, performs on par with P ROC and P ROC -B on XNLI. Finally, there 12 Since our aim is to compare different bilingual spaces— input vectors for ESIM, kept fixed during training—we simply use the default ESIM hyper-parameter configuration. 13 Our goal is not to compete with current state-of-the-art systems for (X)NLI (Artetxe and Schwenk, 2018; Lample and Conneau, 2019), but rather to provide means to analyze properties and relative performance of diverse CLE models in a downstream language understanding task.. Avg. 0.561 0.607 0.613 0.615 0.614 0.376 0.390. 0.504 0.534 0.543 0.532 0.556 0.357 0.363. 0.534 0.568 0.568 0.573 0.536 0.387 0.387. 0.544 0.585 0.593 0.599 0.579 0.378 0.399. 0.536 0.574 0.579 0.580 0.571 0.374 0.385. 0.604 0.611 0.580 0.427*. 0.613 0.536 0.510 0.383*. 0.534 0.359* 0.400* 0.359*. 0.574 0.363* 0.572 0.376*. 0.581 0.467 0.516 0.386. Unsupervised. Large training corpora for NLI exist only in English (Bowman et al., 2015; Williams et al., 2018). Recently, Conneau et al. (2018b) released a multilingual XNLI corpus created by translating the development and test portions of the MultiNLI corpus (Williams et al., 2018) to 15 other languages. Evaluation Setup. XNLI covers 5 out of 8 languages from our BLI evaluation: EN, DE, FR, RU, and TR. Our setup is straightforward: we train a well-known robust neural NLI model, Enhanced Sequential Inference Model (ESIM; Chen et al., 2017)12 on the large English MultiNLI corpus, using EN word embeddings from a shared EN–L2 (L2 ∈ {DE, FR, RU, TR }) embedding space. We then evaluate the model on the L2 portion of the XNLI by feeding L2 vectors from the shared space. 13. 1K 5K 1K 3K 5K 1K 5K. EN – DE EN – FR EN – TR EN – RU. V EC M AP M USE ICP GWA. Table 3: XNLI performance (test set accuracy). Bold: highest scores, with mutually insignificant differences according to the non-parametric shuffling test (Yeh, 2000). Asterisks denote language pairs for which CLE models could not yield successful runs in the BLI task.. are significant differences between BLI and XNLI performance across language pairs—while we observe much better BLI performance for EN–DE and EN– FR compared to EN – RU and especially EN – TR , XNLI performance of most models for EN–RU and EN – TR surpasses that for EN – FR and is close to that for EN–DE. While this can be an artifact of the XNLI dataset creation, we support these observations for invidivual language pairs by measuring an overall Spearman correlation of only 0.13 between BLI and XNLI over individual language pairs scores (for all models). The P ROC model performs significantly better on XNLI when trained on 5K pairs than with 1K pairs, and this is consistent with BLI results. However, we show that we can reach the same performance level using 1K pairs and the proposed P ROC -B bootstrapping scheme. V EC M AP is again the most robust and most effective unsupervised model, but it is outperformed by the P ROC -B model on more distant language pairs, EN–TR and EN–RU.. 716. 4.2. Document Classification. Evaluation Setup. Our next evaluation task is cross-lingual document classification (CLDC). We use the TED CLDC corpus (Hermann and Blun-.

(8) Supervised Dict P ROC P ROC P ROC -B P ROC -B DLV RCSLS RCSLS. 1K 5K 1K 3K 5K 1K 5K. DE. FR. IT. RU. TR. Avg. .250 .345 .374 .352 .299 .557 .588. .107 .239 .182 .210 .175 .550 .540. .158 .310 .205 .218 .234 .516 .451. .127 .251 .243 .186 .375 .466 .527. .309 .190 .254 .310 .208 .419 .447. .190 .267 .251 .255 .258 .501 .510. .433 .288 .492 .180*. .316 .223 .254 .209*. .333 .198 .457 .206*. .504 .226* .362 .151*. .439 .264* .175* .173*. .405 .240 .348 .184. guably requires more language understanding than CLDC (where we capture n-grams) and less than XNLI (modeling subtle meaning nuances). Evaluation Setup. We employ a simple and effective unsupervised CLIR model from (Litschko et al., 2018)15 that (1) builds query and document representations as weighted averages of word vectors from the CLE space and (2) computes the relevance score as the cosine similarity between aggregate query and document vectors. We evaluate CLE models on the standard test collections from the CLEF 2000-2003 ad-hoc retrieval Test Suite.16 We obtain 9 language pairs by intersecting languages from CLEF and our BLI evaluation.. Unsupervised V EC M AP M USE ICP GWA. Table 4: CLDC performance (micro-averaged F1 scores); cross-lingual transfer EN–X. Numbers in bold denote the best scores in the model group. Asterisks denote language pairs for which CLE models did not yield successful runs in the BLI task.. som, 2014), covering 15 topics and 12 language pairs (EN is one of the languages in all pairs). A binary classifier is trained and evaluated for each topic and each language pair, using predefined train and test splits. Intersecting TED and BLI languages results in five CLDC evaluation pairs: EN–DE, EN– FR , EN – IT , EN – RU , and EN – TR . Since we seek to compare different CLEs and analyse their contribution and not to match state-of-the-art on TED, for the sake of simplicity we employ a light-weight CNN-based classifier in CLDC experiments.14 Results and Discussion. The CLDC results (F1 micro-averaged over 12 classes) are shown in Table 4. In contrast to XNLI, RCSLS, the bestperforming model on BLI, obtains peak scores on CLDC as well, with a wide margin w.r.t. other models. It significantly outperforms the unsupervised V EC M AP model, which in turn significantly outperforms all other supervised models. Surprisingly, supervised Procrustes-based models (P ROC, P ROC -B, and DLV) that performed strongly on both BLI and XNLI display very weak performance on CLDC: this calls for further analyses. 4.3. Information Retrieval. Finally, we analyse behaviour of CLE models in cross-lingual information retrieval. Unlike XNLI and CLDC, we perform CLIR in an unsupervised fashion by comparing aggregate semantic representations of queries in L1 with aggregate semantic representations of documents in L2. Retrieval ar-. Results and Discussion. CLIR performance (MAP scores for all 9 language pairs) is shown in Table 5. The supervised Procrustes-based methods (P ROC, P ROC -B, DLV) appear to have an edge in CLIR, with the bootstrapping P ROC -B model outperforming all other CLE methods.17 Contrary to other downstream tasks, V EC M AP is not the best-performing unsupervised model on CLIR— ICP performs best considering all nine language pairs (All LPs) and M USE is best on LPs for which all models produced meaningful spaces (Filt. LPs). RCSLS, the best-performing BLI model, displays only mediocre CLIR performance. 4.4. Further Discussion. At first glance, BLI performance shows a weak and inconsistent correlation with results in downstream tasks. The behaviour of RCSLS is especially peculiar: it is the best-performing BLI model and it achieves the best results on CLDC by a wide margin, but it is not at all competitive on XNLI and falls short of other supervised models in CLIR. Downstream results of other models seem, with a few exceptions, to correspond to BLI trends. To further investigate this, in Table 6 we measure correlations (in terms of Pearson’s ρ) between aggregate task performances on BLI and each downstream task by considering (1) all models and (2) all models except RCSLS.18 Without RCSLS,. 14 We implement a CNN with a single 1-D convolutional layer (8 filters for sizes 2–5) and a 1-max pooling layer, coupled with a softmax classifier. We minimize the negative loglikelihood using the Adam algorithm (Kingma and Ba, 2014).. 717. 15. The model is dubbed BWE-AGG in the original paper. http://catalog.elra.info/en-us/ repository/browse/ELRA-E0008/ 17 Similarly to the BLI evaluation, we test the significance by applying a Student’s t-test on two lists of ranks of relevant documents (concatenated across all test collections), produced by two models under comparison. Even with Bonferroni correction for multiple tests, P ROC -B significantly outperforms all other CLE models at α = 0.05. 18 For measuring correlation between BLI and each down16.

(9) Model. Dict. DE- FI. DE - IT. DE- RU. EN - DE. EN - FI. EN - IT. EN - RU. FI- IT. FI- RU. Avg. 1K 5K 1K 3K 5K 1K 5K. 0.147 0.255 0.294 0.305 0.255 0.114 0.196. 0.155 0.212 0.230 0.232 0.210 0.133 0.189. 0.098 0.152 0.155 0.143 0.155 0.077 0.122. 0.175 0.261 0.288 0.238 0.260 0.163 0.237. 0.101 0.200 0.258 0.267 0.206 0.063 0.127. 0.210 0.240 0.265 0.269 0.240 0.163 0.210. 0.104 0.152 0.166 0.150 0.151 0.106 0.133. 0.113 0.149 0.151 0.163 0.147 0.074 0.130. 0.096 0.146 0.136 0.170 0.147 0.069 0.113. 0.133 0.196 0.216 0.215 0.197 0.107 0.162. – – – –. 0.240 0.001* 0.252 0.218. 0.129 0.210 0.170 0.139. 0.162 0.195 0.167 0.149. 0.200 0.280 0.230 0.013. 0.150 0.000* 0.230 0.005*. 0.201 0.272 0.231 0.007*. 0.104 0.002* 0.119 0.005*. 0.096 0.002* 0.117 0.058. 0.109 0.001* 0.124 0.052. 0.155 0.107 0.182 0.072. Supervised P ROC P ROC P ROC -B P ROC -B DLV RCSLS RCSLS Unupervised V EC M AP M USE ICP GWA. Table 5: CLIR performance (MAP) of CLE models (the first language in each column is the query language, the second is the language of the document collection). Numbers in bold denote the best scores in the model group. Asterisks denote language pairs for which CLE models did not yield successful runs in BLI evaluation.. Models. XNLI. CLDC. CLIR. All models All w/o RCSLS. 0.269 0.951. 0.390 0.266. 0.764 0.910. 5. Table 6: Correlations of model-level results between BLI and each of the three downstream tasks.. BLI results correlate strongly with XNLI and CLIR results and weakly with CLDC results. But why does RCSLS diverge from other models? All other models induce an orthogonal projection (using given or induced dictionaries), minimizing the post-projection Euclidean distances between aligned words. In contrast, by maximizing CSLS, RCSLS relaxes the orthogonality condition imposed on the projection. This allows for distortions of the source embedding space after projection. The exact nature of these distortions and their impact on downstream performance of RCSLS require further investigation. However, these findings indicate that downstream evaluation is even more important for CLE models that learn nonorthogonal projections. For CLE models with orthogonal projections, downstream results seem to be more in line with BLI performance. This brief task correlation analysis is based on coarse-grained model-level aggregation. The actual selection of the strongest baseline models requires finer-grained tuning at the level of particular language pairs and evaluation tasks of interest. Nonetheless, our experiments detect two robust baselines that should be included as indicative reference points in future CLE research: P ROC -B (supervised) and V EC M AP (unsupervised). stream task T, we average model’s BLI performance only over the language pairs included in the task T.. Conclusion. Rapid development of cross-lingual word embedding (CLE) methods is not met with adequate progress in their fair and systematic evaluation. CLE models are commonaly evaluated only in bilingual lexicon induction (BLI), and even the BLI task includes a variety of evaluation setups which are not directly comparable, hindering our ability to correctly interpret the key results. In this work, we have made the first step towards a comprehensive evaluation of CLEs. By systematically evaluating CLE models for many language pairs on BLI and three downstream tasks, we shed new light on the ability of current cutting-edge CLE models to support cross-lingual NLP. In particular, we have empirically proven that the quality of CLE models is largely task-dependent and that overfitting the models to the BLI task can result in deteriorated performance in downstream tasks. We have highlighted the most robust supervised and unsupervised CLE models and have exposed the need for reassessing existing baselines, as well as for unified and comprehensive evaluation protocols. We hope that this study will encourage future work on CLE evaluation and analysis and help guide the development of new CLE models. . We make the code and resources available at: https: //github.com/codogogo/xling-eval.. Acknowledgments The work of the first two authors was supported by the Baden-Württemberg Stiftung (AGREE grant of the Eliteprogramm). We thank the anonymous reviewers for their useful comments and suggestions.. 718.

(10) References Jean Alaux, Edouard Grave, Marco Cuturi, and Armand Joulin. 2019. Unsupervised hyperalignment for multilingual word embeddings. In Proceedings of ICLR. David Alvarez-Melis and Tommi Jaakkola. 2018. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of EMNLP, pages 1881– 1890. Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A Smith. 2016. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL, pages 451–462. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Proceedings of AAAI, pages 5012–5019. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018b. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of ACL, pages 789–798. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018c. Unsupervised neural machine translation. In Proceedings of ICLR.. Xilun Chen and Claire Cardie. 2018. Unsupervised multilingual word embeddings. In Proceedings of EMNLP, pages 261–270. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018a. Word translation without parallel data. In Proceedings of ICLR. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018b. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of EMNLP, pages 2475–2485. Zi-Yi Dou, Zhi-Hao Zhou, and Shujian Huang. 2018. Unsupervised bilingual lexicon induction via latent variable models. In Proceedings of EMNLP, pages 621–626. Yerai Doval, Jose Camacho-Collados, Luis Espinosa Anke, and Steven Schockaert. 2018. Improving cross-lingual word embeddings by meeting in the middle. In Proceedings of EMNLP, pages 294–304. Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of ACL, pages 1383–1392. Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of EACL, pages 462– 471.. Mikel Artetxe and Holger Schwenk. 2018. Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond. CoRR, abs/1812.10464.. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of NIPS, pages 2672–2680.. Amir Bakarov, Roman Suvorov, and Ilya Sochenkov. 2018. The limitations of cross-language word embeddings evaluation. In Proceedings of *SEM, pages 94–100.. Edouard Grave, Armand Joulin, and Quentin Berthet. 2018. Unsupervised alignment of embeddings with Wasserstein procrustes. CoRR, abs/1805.11222.. Anthony Bell and Terrence Sejnowski. 1997. The ’Independent Components’ of Natural Scenes are Edge Filters. Vision Research.. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of ACL, pages 1234–1244.. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the ACL, 5:135–146.. Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual models for compositional distributed semantics. In Proceedings of ACL, pages 58–68.. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of EMNLP, pages 632–642. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of ACL, pages 1657–1668.. 719. Yedid Hoshen and Lior Wolf. 2018. Non-adversarial unsupervised word translation. In Proceedings of EMNLP, pages 469–478. Kejun Huang, Matt Gardner, Evangelos Papalexakis, Christos Faloutsos, Nikos Sidiropoulos, Tom Mitchell, Partha P. Talukdar, and Xiao Fu. 2015. Translation invariant word embeddings. In Proceedings of EMNLP, pages 1084–1088..

(11) Roy Jonker and Anton Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325– 340.. Tanmoy Mukherjee, Makoto Yamada, and Timothy Hospedales. 2018. Learning unsupervised word translations without adversaries. In Proceedings of EMNLP, pages 627–632.. Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of EMNLP, pages 2979–2984.. Ndapa Nakashole. 2018. NORMA: Neighborhood sensitive maps for multilingual word embeddings. In Proceedings of EMNLP, pages 512–522.. Yunsu Kim, Jiahui Geng, and Hermann Ney. 2018. Improving unsupervised word-by-word translation with language model and denoising autoencoder. In Proceedings of EMNLP, pages 862–868. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. Proceedings of COLING, pages 1459–1474. Guillaume Lample and Alexis Conneau. 2019. Crosslingual language model pretraining. CoRR, abs/1901.07291. Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of EMNLP, pages 5039– 5049. Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning. In Proceedings of ACL-IJCNLP, pages 270–280. Omer Levy, Anders Søgaard, and Yoav Goldberg. 2017. A strong baseline for learning cross-lingual word embeddings from sentence alignments. In Proceedings of EACL, pages 765–774. Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto, and Ivan Vuli´c. 2018. Unsupervised crosslingual information retrieval using monolingual data only. In Proceedings of SIGIR, pages 1253–1256. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 151–159. Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017. Cheap translation for cross-lingual named entity recognition. In Proceedings of EMNLP, pages 2536–2545. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.. 720. Gabriel Peyré, Marco Cuturi, and Justin Solomon. 2016. Gromov-Wasserstein averaging of kernel and distance matrices. In Proceedings of ICML, pages 2664–2672. Sebastian Ruder, Ryan Cotterell, Yova Kementchedjhieva, and Anders Søgaard. 2018a. A discriminative latent-variable model for bilingual lexicon induction. In Proceedings of EMNLP, pages 458–468. Sebastian Ruder, Anders Søgaard, and Ivan Vuli´c. 2018b. A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902. Peter H Schönemann. 1966. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1):1–10. Samuel L. Smith, David H.P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In Proceedings of ICLR. Anders Søgaard, Željko Agi´c, Héctor Martínez Alonso, Barbara Plank, Bernd Bohnet, and Anders Johannsen. 2015. Inverted indexing for cross-lingual NLP. In Proceedings of ACL, pages 1713–1722. Anders Søgaard, Sebastian Ruder, and Ivan Vuli´c. 2018. On the limitations of unsupervised bilingual dictionary induction. In Proceedings of ACL, pages 778– 788. Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of ACL, pages 1661–1670. Ivan Vuli´c and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of ACL, pages 247–257. Ivan Vuli´c and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of SIGIR, pages 363–372. Ivan Vuli´c and Marie-Francine Moens. 2016. Bilingual distributed word representations from documentaligned comparable data. Journal of Artificial Intelligence Research, 55:953–994. Søren Wichmann, André Müller, Viveka Velupillai, Cecil H Brown, Eric W Holman, Pamela Brown, Sebastian Sauppe, Oleg Belyaev, Matthias Urban, Zarina Molochieva, et al. 2018. The asjp database (version 18)..

(12) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, pages 1112–1122. Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of NAACL-HLT, pages 1006–1011. Ruochen Xu, Yiming Yang, Naoki Otani, and Yuexin Wu. 2018. Unsupervised cross-lingual transfer of word embedding spaces. In Proceedings of EMNLP, pages 2465–2474. Alexander Yeh. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of COLING, pages 947–953. Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi Jaakkola. 2016. Ten pairs to tag – Multilingual POS tagging via coarse mapping between embeddings. In Proceedings of NAACL-HLT, pages 1307–1317. Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP, pages 1393–1398.. 721.

(13)

New documents

As shown in Figure 1 and Table 2, we have identified and sequenced 29 novel clones that do not match with other known expressed sequences but do match with fly genomic sequence

Standard deviation and coefficient of variation of blood glucose for comparison of glycemic control with intermittent measurement, two groups should be studied – both groups should have

To explore the dysregulated gene interactions in primary monocytes of XLA patients, the interaction networks were generated for DE protein-coding genes which were significantly enriched

Another possible null hypothesis would be 1 to chose a method for identifying tags, as we did with UniGene [7], and 2 to estimate the relevance of the possible extra bases according to

Cr-EDTA, glomerular filtration rate measured with infusion clearance of Cr-EDTA; CG, estimated glomerular filtration rate with the Cockcroft Gault equation using actual, preoperative

Liver samples were harvested from rats inoculated with normal saline and treated with water a, inoculated with normal saline and treated with HPE-XA-08 c, inoculated with TAA 200 mg/kg

anti-fXa, anti-activated factor X activity; aPCC, activated prothrombinase complex concentrate; aPTT, activated partial thromboplastin time; AUC, area under the curve; bid, twice daily;

Further investigation using SignalP v3.0 Table 3 predicted the presence of cleavage sites for bacterial signal peptidases, 10 commonly expressed proteins i.e PotF, GltI, ArgT, Cbp,

From this research, the composition percentage of carbon dioxide and carbon monoxidefrom the combustion of direct injection of internal combustion engine can be determined according to

The article also attempts to at least partially explain the practices, querying for instance whether the disproportionately high amounts of citations to Pre-Trial Chamber records are a