Top PDF Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings

Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings

Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings

1 , . . . , d 0 N 0 }. Compute dim-dimensional document embeddings − → d 0 for each d 0 ∈ DC using the dim-dimensional WEs from the set BWE obtained in the previous step and a semantic composition model (ADD-BASIC or ADD-SI something anything else).

52 Read more

Cross Lingual Word Embeddings and the Structure of the Human Bilingual Lexicon

Cross Lingual Word Embeddings and the Structure of the Human Bilingual Lexicon

In this work, we ask if the structure of cross- lingual word embeddings spaces have the proper- ties that would be expected given human bilingual behaviour. Assuming the distributed, integrated model of the lexicon proposed in the bilingualism literature, the underlying linking hypothesis is that coactivation effects (whether expressed as simi- larity judgments or measured as reaction times) are the expression of greater or smaller proximity in a multi-dimentional space. On this basis, sev- eral hypotheses are proposed. The first hypothesis aims to establish whether cross-lingual word em- beddings are sensitive to a bilingual situation and generate an integrated cross-lingual space. Sec- ondly, we test if cross-lingual word embeddings show the shared-translation effect. Finally we test the cross-linguistic competition/priming of lexical forms from the L1 to the L2 language, comparing cross-lingual to monolingual spaces in true friends and false friends scenarios.
Show more

11 Read more

Improving cross-lingual word embeddings by meeting in the middle

Improving cross-lingual word embeddings by meeting in the middle

languages may be more reliable than that of the other. In such cases, it may be of interest to replace the vectors ~ µ w,w 0 by a weighted aver- age of the monolingual word vectors. Second, while we have only considered bilingual scenar- ios in this paper, our approach can naturally be ap- plied to scenarios involving more languages. In this case, we would first choose a single target language, and obtain alignments between all the other languages and this target language. To ap- ply our model, we can then simply learn map- pings to predict averaged word vectors across all languages. Finally, it would also be interesting to use the obtained embeddings in downstream ap- plications such as language identification or cross- lingual sentiment analysis, and extend our analy- sis to other languages, with a particular focus on morphologically-rich languages (after seeing our success with Finnish), for which the bilingual in- duction task has proved more challenging for stan- dard cross-lingual embedding models ( Søgaard et al. , 2018 ).
Show more

11 Read more

Learning Bilingual Sentiment Specific Word Embeddings without Cross lingual Supervision

Learning Bilingual Sentiment Specific Word Embeddings without Cross lingual Supervision

Sentimental Embeddings Continuous word representations encode the syntactic context of a word but often ignore the information of sentiment polarity. This drawback makes them hard to distinguish words with similar syntactic context but opposite sentiment polarity (e.g. good and bad), resulting in unsatisfactory performance on sentiment analysis. Tang et al. (2014) learned word representations that encode both syntactic context and sentiment polarity by adding an ob- jective to classify the polarity of an n-gram. This method can be generalized to the cross-lingual setting by training monolingual sentimental embeddings on both languages then aligning them in a common space. However, it requires sentiment resources in the target language thus is impractical for low-resource languages.
Show more

10 Read more

Learning cross-lingual word embeddings from Twitter via distant supervision

Learning cross-lingual word embeddings from Twitter via distant supervision

Various methods have been proposed for aligning two mono- lingual embedding spaces. Two recent methods in particu- lar have obtained outstanding results in both unsupervised and semi-supervised settings: MUSE (Conneau et al. 2018) and VecMap (Artetxe, Labaka, and Agirre 2018a). Recall that the seed supervision signal required for these methods comes in the form of a bilingual dictionary, which may be external or automatically generated. These two methods are similar in that they learn an orthogonal linear transforma- tion which maps one monolingual embedding space into the other. In VecMap this is done using SVD, while MUSE uses Procrustes analysis. VecMap applies this approach in an it- erative fashion, where at each step the previously used bilin- gual dictionary is extended based on the current alignment. It is also worth noting that after the initial orthogonal trans- formation, VecMap fine-tunes the resulting embeddings by giving more weight to highly correlated embedding compo- nents, improving its performance in word translation.
Show more

11 Read more

Cross lingual Wikification Using Multilingual Embeddings

Cross lingual Wikification Using Multilingual Embeddings

Besides the CCA-based multilingual word em- beddings (Faruqui and Dyer, 2014) that we ex- tend in Section 3, several other methods also try to embed words in different languages into the same space. Hermann and Blunsom (2014) use a sen- tence aligned corpus to learn bilingual word vectors. The intuition behind the model is that representa- tions of aligned sentences should be similar. Unlike the CCA-based method which learns monolingual word embeddings first, this model directly learns the cross-lingual embeddings. Luong et al. (2015) pro- pose Bilingual Skip-Gram which extends the mono- lingual skip-gram model and learns bilingual em- beddings using a parallel copora and word align- ments. The model jointly considers within language co-occurrence and meaning equivalence across lan- guages. That is, the monolingual objective for each language is also included in their learning objec- tive. Several recent approaches (Gouws et al., 2014; Coulmance et al., 2015; Shi et al., 2015; Soyer et al., 2015) also require a sentence aligned parallel corpus to learn multilingual embeddings. Unlike other ap- proaches, Vuli´c and Moens (2015) propose a model that only requires comparable corpora in two lan- guages to induce cross-lingual vectors. Similar to our proposed approach, this model can also be ap- plied to all languages in Wikipedia if we treat docu- ments across two Wikipedia languages as a compa- rable corpus. However, the quality and quantity of this comparable corpus for low-resource languages
Show more

10 Read more

A Resource Free Evaluation Metric for Cross Lingual Word Embeddings Based on Graph Modularity

A Resource Free Evaluation Metric for Cross Lingual Word Embeddings Based on Graph Modularity

Monolingual Word Embeddings All monolin- gual embeddings are trained using a skip-gram model with negative sampling (Mikolov et al., 2013b). The dimension size is 100 or 200. All other hyperparameters are default in Gensim ( Re- ˇ h˚uˇrek and Sojka, 2010). News articles except for Amharic are from Leipzig Corpora (Goldhahn et al., 2012). For Amharic, we use documents from LORELEI (Strassel and Tracey, 2016). MeCab (Kudo et al., 2004) tokenizes Japanese sentences. Bilingual Seed Lexicon For supervised meth- ods, bilingual lexicons from Rolston and Kirchhoff (2016) induce all cross-lingual embeddings except for Danish, which uses Wiktionary. 3
Show more

11 Read more

Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

Recently, methods for inducing shared cross-lingual embedding spaces without the need for any bilingual signal (not even word translation pairs) have been proposed [1, 3]. These methods ex- ploit inherent structural similarities of induced monolingual em- bedding spaces to learn vector space transformations that align the source language space to the target language space, with strong results observed for bilingual lexicon extraction. In this work, we show that these unsupervised cross-lingual word embeddings of- fer strong support to the construction of fully unsupervised ad- hoc CLIR models. We propose two different CLIR models: 1) term- by-term translation through the shared cross-lingual space, and 2) query and document representations as IDF-weighted sums of constituent word vectors. To the best of our knowledge, our CLIR
Show more

5 Read more

Weakly Supervised Concept based Adversarial Learning for Cross lingual Word Embeddings

Weakly Supervised Concept based Adversarial Learning for Cross lingual Word Embeddings

Recently, efforts have concentrated on how to limit or avoid reliance on dictionaries. Good re- sults were achieved with some drastically min- imal techniques. Zhang et al. (2016) achieved good results at bilingual POS tagging, but not bilingual lexicon induction, using only ten word pairs to build a coarse orthonormal mapping be- tween source and target monolingual embeddings. The work of Smith et al. (2017) has shown that a singular value decomposition (SVD) method can produce a competitive cross-lingual mapping by using identical character strings across lan- guages. Artetxe et al. (2017, 2018b) proposed a self-learning framework, which iteratively trains its cross-lingual mapping by using dictionaries trained in previous rounds. The initial dictionary of the self-learning can be reduced to 25 word pairs or even only a list of numerals and still have competitive performance. Furthermore, Artetxe et al. (2018a) extend their self-learning framework to unsupervised models, and build the state-of- the-art for bilingual lexicon induction. Instead of using a pre-build dictionary for initialization, they sort the value of the word vectors in both the source and the target distribution, treat two vectors that have similar permutations as possible transla- tions and use them as the initialization dictionary. Additionally, their unsupervised framework also includes many optimization augmentations, such as stochastic dictionary induction and symmetric re-weighting, among others.
Show more

12 Read more

Delexicalized Word Embeddings for Cross lingual Dependency Parsing

Delexicalized Word Embeddings for Cross lingual Dependency Parsing

This paper presents a new approach to the problem of cross-lingual dependency parsing, aiming at leveraging training data from different source languages to learn a parser in a target language. Specifi- cally, this approach first constructs word vector representations that exploit struc- tural (i.e., dependency-based) contexts but only considering the morpho-syntactic in- formation associated with each word and its contexts. These delexicalized word em- beddings, which can be trained on any set of languages and capture features shared across languages, are then used in com- bination with standard language-specific features to train a lexicalized parser in the target language. We evaluate our approach through experiments on a set of eight dif- ferent languages that are part the Univer- sal Dependencies Project. Our main re- sults show that using such delexicalized embeddings, either trained in a monolin- gual or multilingual fashion, achieves sig- nificant improvements over monolingual baselines.
Show more

10 Read more

Sentiment analysis for Hinglish code-mixed tweets by means of cross-lingual word embeddings

Sentiment analysis for Hinglish code-mixed tweets by means of cross-lingual word embeddings

It can also be observed that our transfer learning based model is able to perform sentiment analysis with accept- able accuracies without needing code-mixed supervision of any degree. This is a very promising outcome for low(er)- resourced languages, where large dedicated data sets for NLP tasks such as sentiment analysis are lacking. Regard- ing the baseline approaches, it is also worth noting that the Code-Mixed Baseline does not perform a lot better than the English baseline as one would expect. This can proba- bly be attributed to the quality of the monolingual embed- dings, since the English embeddings were trained on the vast Common Crawl data while the Code-Mixed embed- dings were trained on a little more than 100,000 scraped tweets. While the classification is understandably accurate for tweets containing a majority of English words like “Ex- clusive censor reports of Bharat is world class Words like movie of the year” and less reliable for sentences predom- inantly containing code-mixed words like “YouTube views ko vote samjhne wale agar is bar Nahi jita to Kabhi Nahi jitega”, the performance could be improved with better alignments and possibly a hybrid approach with minimal supervision.
Show more

7 Read more

Cross Lingual Alignment of Contextual Word Embeddings, with Applications to Zero shot Dependency Parsing

Cross Lingual Alignment of Contextual Word Embeddings, with Applications to Zero shot Dependency Parsing

We introduce a novel method for multilin- gual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counter- parts, aligning them poses a challenge due to their dynamic nature. To this end, we con- struct context-independent variants of the orig- inal monolingual spaces and utilize their map- ping to derive an alignment for the context- dependent spaces. This mapping readily sup- ports processing of a target language, improv- ing transfer by context-aware embeddings. Our experimental results demonstrate the ef- fectiveness of this approach for zero-shot and few-shot learning of dependency parsing. Specifically, our method consistently outper- forms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average. 1
Show more

15 Read more

A robust self learning method for fully unsupervised cross lingual mappings of word embeddings

A robust self learning method for fully unsupervised cross lingual mappings of word embeddings

Recent work has managed to learn cross- lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial train- ing. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more re- alistic scenarios. This work proposes an alternative approach based on a fully un- supervised initialization that explicitly ex- ploits the structural similarity of the em- beddings, and a robust self-learning algo- rithm that iteratively improves this solu- tion. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpass- ing previous supervised systems. Our implementation is released as an open source project at https://github. com/artetxem/vecmap.
Show more

10 Read more

A Comparison of Word Embeddings for English and Cross Lingual Chinese Word Sense Disambiguation

A Comparison of Word Embeddings for English and Cross Lingual Chinese Word Sense Disambiguation

As Navigli (2009) noted that supervised approaches have performed best in WSD, we focus on inte- grating word embeddings in supervised approaches; in specific, we explore the use of word embeddings within the IMS framework. We focus our work on Continuous Bag of Words (CBOW) from Word2Vec, Global Vectors for Word Representation (GloVe) and Collobert & Weston’s Embeddings(C&W). The CBOW embeddings were trained over Wikipedia, while the publicly available vectors from GloVe and C&W were used. Word2Vec provides 2 architectures for learning word embeddings, Skip-gram and CBOW. In contrast to Iacobacci (2016) which focused on Skip-gram, we focused our work on CBOW. In our first set of evaluations, we used tasks from Senseval-2 (hereafter SE-2), Senseval-3 (hereafter SE-3) and SemEval-2007 (hereafter SE-2007) to evaluate the performance of our classifiers on monolingual WSD. We do this to first validate that our approach is a sound approach of performing WSD, showing improved or identical scores to state-of-the-art systems in most tasks.
Show more

10 Read more

Multilingual word embeddings and their utility in cross-lingual learning

Multilingual word embeddings and their utility in cross-lingual learning

approach, where the training language parser is employed for parsing an initial set of evaluation language sentences as a form of distant supervision (similar in spirit to Fang and Cohn (2017)). However, though McDonald et al. (2011) showed that de-lexicalization is a viable approach for cross-lingual dependency parsing, it is nonetheless apparent that the success of monolingual parsers is largely owed to their advantage in accounting for lexical features. Recent advancements in multilingual embedding alignment have inspired a trend towards lexicalization in parsing, as the distributional information carried by word embeddings for different languages can be encoded into vectors that reside in a single space. One of the first embedding-based lexicalized parsing approaches was carried out by Guo et al. (2015), whose method includes a transition-based neural dependency parser that is trained on a de-lexicalized English features. These include word, POS-tag, and dependency relation features that are projected to an embedding layer which the network estimates throughout training. In addition to this, they include lexical features in the form of monolingual embeddings projected to multilingual space via an extension of Faruqui and Dyer (2014)’s CCA alignment method. In their experiments, they find that lexicalizing the parser via multilingual embeddings improves the de-lexicalized parser by an average error rage of 10.9% when evaluating on an unseen language.
Show more

76 Read more

Trans gram, Fast Cross lingual Word embeddings

Trans gram, Fast Cross lingual Word embeddings

We introduce Trans-gram, a simple and computationally-efficient method to simultaneously learn and align word- embeddings for a variety of languages, us- ing only monolingual data and a smaller set of sentence-aligned data. We use our new method to compute aligned word- embeddings for twenty-one languages us- ing English as a pivot language. We show that some linguistic features are aligned across languages for which we do not have aligned data, even though those properties do not exist in the pivot language. We also achieve state of the art results on standard cross-lingual text classification and word translation tasks.
Show more

5 Read more

A Survey of Cross-lingual Word Embedding Models

A Survey of Cross-lingual Word Embedding Models

A large body of work on multilingual probabilistic topic modeling (Vulić, De Smet, Tang, & Moens, 2015; Boyd-Graber, Hu, & Mimno, 2017) also extracts shared cross-lingual word spaces, now by means of conditional latent topic probability distributions: two words with similar distributions over the induced latent variables/topics are considered semantically similar. The learning process is again steered by the data requirements. The early days witnessed the use of pseudo-bilingual corpora constructed by merging aligned document pairs, and then applying a monolingual representation model such as LSA (Landauer & Dumais, 1997) or LDA (Blei, Ng, & Jordan, 2003) on top of the merged data (Littman, Dumais, & Landauer, 1998; De Smet, Tang, & Moens, 2011). This approach is very similar to the pseudo-cross-lingual approaches discussed in Section 6 and Section 8. More recent topic models learn on the basis of parallel word-level information, enforcing word pairs from seed bilingual lexicons (again!) to obtain similar topic distributions (Boyd-Graber & Blei, 2009; Zhang, Mei, & Zhai, 2010; Boyd-Graber & Resnik, 2010; Jagarlamudi & Daumé III, 2010). In consequence, this also influences topic distributions of related words not occurring in the dictionary. Another group of models utilizes alignments at the document level (Mimno, Wallach, Naradowsky, Smith, & McCallum, 2009; Platt, Toutanova, & Yih, 2010; Vulić, De Smet, & Moens, 2011; Fukumasu, Eguchi, & Xing, 2012; Heyman, Vulić, & Moens, 2016) to induce shared topical spaces. The very same level of supervision (i.e., document alignments) is used by several cross-lingual word embedding models, surveyed in Section 8. Another embedding model based on the document-aligned Wikipedia structure (Søgaard, Agić, Alonso, Plank, Bohnet, & Johannsen, 2015) bears resemblance with the cross-lingual Explicit Semantic Analysis model (Gabrilovich & Markovitch, 2006; Hassan & Mihalcea, 2009; Sorg & Cimiano, 2012).
Show more

57 Read more

Cross Lingual Word Embeddings for Low Resource Language Modeling

Cross Lingual Word Embeddings for Low Resource Language Modeling

With the parameters tuned on the English valida- tion set as above, we evaluated the LSTM lan- guage model when the embedding layer is initial- ized with various monolingual and cross-lingual word embeddings. Figure 3 compares the perfor- mance of a number of language models on the test set. In every case where pre-trained embeddings were used, the embedding layer was held fixed during training. However, we observed similar re- sults when allowing them to deviate from their ini- tial state. For the CLWEs, the same language set was used as in Section 3. The curves for the source languages (Dutch, Greek, Finnish, and Japanese) are remarkably similar, as were those for the lan- guages omitted from the figure (German, Russian, Serbian, Italian, and Spanish). This suggests that the English target embeddings are gleaning simi- lar information from each of the languages, infor- mation likely to be more semantic than syntactic, given the syntactic differences between the lan- guages.
Show more

11 Read more

Cross-Lingual Word Embeddings for Morphologically Rich Languages

Cross-Lingual Word Embeddings for Morphologically Rich Languages

Cross-lingual word embedding models learn a shared vector space for two or more lan- guages so that words with similar meaning are represented by similar vectors regardless of their language. Although the existing mod- els achieve high performance on pairs of mor- phologically simple languages, they perform very poorly on morphologically rich languages such as Turkish and Finnish. In this pa- per, we propose a morpheme-based model in order to increase the performance of cross- lingual word embeddings on morphologically rich languages. Our model includes a sim- ple extension which enables us to exploit mor- phemes for cross-lingual mapping. We ap- plied our model for the Turkish-Finnish lan- guage pair on the bilingual word translation task. Results show that our model outper- forms the baseline models by 2% in the nearest neighbour ranking.
Show more

7 Read more

A common semantic space for monolingual and cross-lingual meta-embeddings

A common semantic space for monolingual and cross-lingual meta-embeddings

This Master’s Thesis is the continuation of the work done for in my final degree work “Estudio de Word Embeddings y métodos de generación de Meta Embeddings” (Study of Word Embeddings and methods of generating Meta-embeddings) (García-Ferrero, 2018), where we studied the performance of different pre-trained word embeddings, normalization methods and meta embedding generation methods in the word similarity task. The main approach of this work was the development of a meta embedding generation method using linear transformations Artetxe et al. (2018b) and averaging. The main objective of this work is to extend the proposed method to the multi-lingual domain with two intentions. The first one being able to improve the quality of the generated meta-embeddings by ensembling representations in different languages. The second one is to use it as a transfer learning mechanism, where pre-trained embeddings trained in a rich-resource language can improve the quality of the pre-trained embedding from a low-resource language. In the original research, we evaluated our embeddings in the word similarity task. Now, we will extend the results evaluating the generated mono-lingual and cross-lingual meta- embeddings in more challenging tasks: Semantic Textual Similarity (STS), Part-of-Speech (POS) and Named Entity Recognition (NER).
Show more

70 Read more

Show all 10000 documents...