Exploiting Cross-Lingual Representations For Natural Language Processing

(1)

University of Pennsylvania

ScholarlyCommons

Publicly Accessible Penn Dissertations

2019

Exploiting Cross-Lingual Representations For

Natural Language Processing

Shyam Upadhyay

University of Pennsylvania, [email protected]

Follow this and additional works at:

https://repository.upenn.edu/edissertations

Part of the

Computer Sciences Commons

This paper is posted at ScholarlyCommons.https://repository.upenn.edu/edissertations/3265

Recommended Citation

Upadhyay, Shyam, "Exploiting Cross-Lingual Representations For Natural Language Processing" (2019).Publicly Accessible Penn Dissertations. 3265.

(2)

Exploiting Cross-Lingual Representations For Natural Language

Processing

Abstract

Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages.

In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of

downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest.

Degree Type Dissertation

Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Computer and Information Science

First Advisor Dan Roth

Keywords

cross-lingual, low supervision, multilingual, natural language processing, representation learning

(3)

EXPLOITING CROSS-LINGUAL REPRESENTATIONS FOR NATURAL LANGUAGE PROCESSING

Shyam Upadhyay

A DISSERTATION

in

Computer and Information Science

Presented to the Faculties of the University of Pennsylvania

in

Partial Fulfillment of the Requirements for the

Degree of Doctor of Philosophy

2019

Supervisor of Dissertation

Dan Roth, Professor of Computer and Information Science

Graduate Group Chairperson

Rajeev Alur, Professor of Computer and Information Science

Dissertation Committee

Adam Kalai, Principal Researcher, Microsoft Research

Chris Callison-Burch, Professor of Computer and Information Science

Lyle Ungar, Professor of Computer and Information Science

(4)

EXPLOITING CROSS-LINGUAL REPRESENTATIONS

FOR NATURAL LANGUAGE PROCESSING

2019

Shyam Upadhyay

This work is licensed under the

Creative Commons Attribution

NonCommercial-ShareAlike 3.0

License

To view a copy of this license, visit

(5)

(6)

ACKNOWLEDGEMENT

Nothing of me is original. I am the combined

effort of everybody I’ve ever known.

Chuck Palahniuk, Invisible Monsters

Very few Ph.D. students get the opportunity of working with a great advisor like Dan.

Without his continuous support, patience and his open style of advising that grants research

freedom, most of the work in this thesis would not have seen the light of day. Besides being a

great advisor, Dan is one of the kindest and most considerate person I know. I am immensely

grateful for all that he has done for me. I am also indebted to my thesis committee — Adam,

Mitch, Lyle, and Chris. Not only have they provided valuable feedback on the thesis at

various stages, but also have been a source of sage advice.

During my Ph.D., I also got the valuable opportunity to work with some of the best

re-searchers in the field through internships at Microsoft Research and Google. Working with

Ming-Wei and Scott was instrumental in shaping my thinking and temperament for

re-search. The wisdom I got from Ming-Wei will be invaluable for the rest of my research life.

The next summer at MSR Cambridge was equally enjoyable, where Adam, Matt, Kai-Wei,

and James gave me the freedom to define and work on a research problem, and helping

me find a way towards a solution. Lastly, the summer spent working with Gokhan, Dilek

and Larry taught me how to approach problems outside my comfort zone, and keeping in

mind practical considerations when thinking about new ideas. I am thankful for all these

experiences, which have shaped my outlook and approach towards research.

I would like to thank former and current members of CogComp — especially Mark Sammons,

Kai-Wei Chang, Rajhans Samdani, and Gourab Kundu, who helped me fit in the group in

my early days. Working in CogComp also brought me in contact with Christos, Parisa,

Michael, Snigdha, and Wenpeng, who took the trouble of mentoring me when they were

(7)

giving me research advice at different stages of my Ph.D., and constantly reminding me

“not to work too hard over the weekends”. I am also indebted to Yangqiu Song, who really

did all the hard work that later became Chapter 4. Special thanks to Snigdha, who was

influential in helping me frame the early draft of this thesis.

During my stay at UIUC and UPenn, I developed close friends in brilliant fellow students

like Subhro, Haoruo, Nitish, Daniel, John, Chen-Tse, Stephen, Dan (Deutsch), Qiang, Anne,

Jordan, Reno, João, Daphne, and others. When I reflect how much I have learned from all

of you over the years, it amazes me. I am sure you will have prolific and satisfying careers

ahead of you. I will dearly miss the discussions over lunches and happy hours we had. I

will especially miss the evening meals I had with Subhro, Nitish, Snigdha, and Shashank.

The not-so-good Indian food in the US will taste even worse without your company.

I would be remiss in not thanking my remote collaborators — Manaal, Chris (Dyer),

Yo-garshi, and Marine. Working with all of you was a unique experience and I have acquired

countless skills and research wisdom from our interactions. Special thanks to Manaal, who

I often trouble for research and career advice from time to time.

It would be unfair not to thank the folks behind the scenes who ensure things run smoothly

in CogComp at UIUC and UPenn. Thanks to Dawn Cheek, Eric Horn and Jennifer Sheffield,

who often went out of their way to help students like me.

Lastly, I would like to thank Maa, Papa and everyone in my family whose efforts and

sacrifices got me here. Thanks for your unconditional love and support when things were not

going well, and uplifting my spirits when I was facing the despair that naturally accompanies

(8)

ABSTRACT

EXPLOITING CROSS-LINGUAL REPRESENTATIONS

FOR NATURAL LANGUAGE PROCESSING

Shyam Upadhyay

Dan Roth

Traditional approaches to supervised learning require a generous amount of labeled data for

good generalization. While such annotation-heavy approaches have proven useful for some

Natural Language Processing (NLP) tasks in high-resource languages (like English), they are

unlikely to scale to languages where collecting labeled data is difficult and time-consuming.

Translating supervision available in English is also not a viable solution, because developing

a good machine translation system requires expensive to annotate resources which are not

available for most languages.

In this thesis, I argue that cross-lingual representations are an effective means of extending

NLP tools to languages beyond English without resorting to generous amounts of annotated

data or expensive machine translation. These representations can be learned in an

inexpen-sive manner, often from signals completely unrelated to the task of interest. I begin with a

review of different ways of inducing such representations using a variety of cross-lingual

sig-nals and study algorithmic approaches of using them in a diverse set of downstream tasks.

Examples of such tasks covered in this thesis include learning representations to transfer a

trained model across languages for document classification, assist in monolingual lexical

se-mantics like word sense induction, identify asymmetric lexical relationships like hypernymy

between words indifferent languages, or combining supervision across languages through a

shared feature space for cross-lingual entity linking. In all these applications, the

represen-tations make information expressed in other languages available in English, while requiring

(9)

Contents

ACKNOWLEDGEMENT . . . iv

ABSTRACT . . . vi

LIST OF TABLES . . . xviii

LIST OF ILLUSTRATIONS . . . xxiii

CHAPTER 1 : Introduction . . . 1

1.1 Overview . . . 1

1.2 Thesis Statement . . . 4

1.3 Outline of This Thesis . . . 4

CHAPTER 2 : History of Word Representations . . . 7

2.1 Introduction . . . 7

2.2 One-Hot Word Representations . . . 7

2.3 Distributional Semantics . . . 8

2.4 Cluster-based Word Representations . . . 11

2.5 Vector-based Word Representations . . . 13

2.6 Sparse Word Representations using Count Models . . . 15

2.7 Dense Word Representations using Exact Matrix Factorization . . . 17

2.8 Dense Word Representations using Neural Language Modeling . . . 19

2.9 Dense Word Representations using Approximate Matrix Factorization . . . 20

2.10 Limitations of Word Representations . . . 23

CHAPTER 3 : Cross-lingual Word Representations . . . 26

(10)

3.3 Cluster-based Cross-lingual Representations . . . 28

3.4 Vector-based Cross-lingual Representations . . . 29

3.4.1 Sentence and Word-level Alignment . . . 30

3.4.2 Sentence-level Alignment . . . 31

3.4.3 Word-level Alignment . . . 32

3.4.4 Document-level Alignment . . . 33

3.5 Experiments . . . 34

3.5.1 Monolingual Lexical Similarity . . . 35

3.5.2 Cross-lingual Dictionary Induction . . . 37

3.5.3 Cross-lingual Document Classification . . . 38

3.5.4 Cross-lingual Dependency Parsing . . . 40

3.6 Summary . . . 42

CHAPTER 4 : Deriving Cross-lingual Representations from Wikipedia . . . 43

4.2 Background . . . 43

4.2.1 Explicit Semantic Representations (ESA) . . . 43

4.2.2 Dataless Document Classification using ESA . . . 46

4.3 Inter-lingual Structure of Wikipedia . . . 47

4.4 Cross-lingual ESA (CLESA) . . . 48

4.5 Experimental Setup and Experiments . . . 50

4.5.1 CLESA v/s Supervised Classification . . . 50

4.5.2 CLESA v/s Monolingual ESA . . . 51

4.6 Summary . . . 52

CHAPTER 5 : Multi-sense Representation Learning with Cross-lingual Signals . . 54

5.2 Related Work — Representing Word Senses . . . 55

(11)

5.2.2 Data-driven Sense Representations . . . 57

5.2.3 Motivation for Our Work . . . 58

5.3 Model Description . . . 59

5.4 Learning and Disambiguation . . . 62

5.5 Multilingual Extension . . . 64

5.6 Experimental Setup . . . 65

5.7.1 Word Sense Induction (WSI) Results . . . 67

5.7.2 Effect of Language Family Distance . . . 68

5.7.3 Effect of Window Size . . . 69

5.7.4 Qualitative Analysis of Induces Senses . . . 70

5.8 Summary . . . 71

CHAPTER 6 : Identifying Hypernymy across Languages . . . 73

6.2 History of the Hypernymy Detection Task . . . 73

6.2.1 Hypernymy Detection in English . . . 74

6.2.2 Cross-lingual Hypernymy Detection . . . 75

6.3 The BiSparse-DepApproach . . . 76

6.4 Crowd-Sourcing Annotations . . . 81

6.6.1 Dependency v/s Window Contexts . . . 87

6.6.2 Ablating Directionality in Context . . . 89

6.6.3 Qualitative Analysis . . . 90

6.7 Evaluating Robustness ofBiSparse-Dep . . . 91

6.7.1 No Treebank . . . 91

6.7.2 Small Monolingual Corpus . . . 93

(12)

6.7.4 Choice of Hypernymy Detection Measure . . . 94

6.8 Summary . . . 94

CHAPTER 7 : Multilingual Supervision for Cross-lingual Entity Linking . . . 96

7.2 History of the Entity Linking Task . . . 97

7.2.1 Monolingual Entity Linking . . . 98

7.2.2 Cross-lingual Entity Linking . . . 101

7.2.3 Our Work . . . 103

7.3 Cross-lingual EL with Xelms . . . 105

7.3.1 Mention Context Representation . . . 105

7.3.2 Including Type Information . . . 108

7.4 Training and Inference . . . 109

7.4.1 Candidate Generation . . . 109

7.4.2 Inference . . . 110

7.4.3 Training Objective . . . 111

7.6.1 Monolingual and Joint Models . . . 115

7.6.2 Multilingual Training . . . 116

7.6.3 Adding Fine-grained Type Information . . . 117

7.7 Experiments with Limited Resources . . . 117

7.7.1 Zero-shot Setting . . . 118

7.7.2 Low-resource Setting . . . 120

7.8 Summary . . . 121

CHAPTER 8 : Bootstrapping Transliteration with Constrained Discovery . . . 123

(13)

8.2.1 Transliteration Generation . . . 125

8.2.2 Transliteration Discovery . . . 126

8.2.3 Our Work . . . 127

8.3 Transliteration with Hard Monotonic Attention . . . 128

8.4 Low-Resource Bootstrapping . . . 130

8.4.1 The Bootstrapping Algorithm . . . 131

8.4.2 Discovery Constraints . . . 131

8.6.1 Full Supervision Setting . . . 134

8.6.2 Low-Resource Setting . . . 136

8.6.3 Error Analysis . . . 137

8.7 Challenges Inherent to Transliteration . . . 137

8.7.1 Source and Target-Specific Issues . . . 138

8.7.2 Disparity between Native and Foreign . . . 139

8.8 Case Studies . . . 139

8.8.1 Manual Annotation . . . 140

8.8.2 Candidate Generation . . . 142

8.9 Summary . . . 144

CHAPTER 9 : Conclusion . . . 146

9.1 Summary of Contributions . . . 146

9.2 Future Directions . . . 150

APPENDIX . . . 152

A.1 Statistical Measures used in the Thesis . . . 152

A.2 BiSparse — Bilingual Sparse Coding . . . 155

A.3 Qualitative Analysis for Multi-sense Embeddings . . . 155

(14)

List of Tables

TABLE 1 : The distributional hypothesis in action: We (humans) can guess the

meaning of words likewampimuk by the context in which it appears. 8

TABLE 2 : Example of a word-context co-occurrence matrix extracted from a

corpus. Each dimension of a row corresponds to co-occurrence with

a certain context word. For instance, the word apple co-occurs with

fruit (4th _{dimension) 20 times in the corpus. . . .} ₁₄

TABLE 3 : Word similarity score measured in Spearman’s correlation ratio for

English on SimLex-999. The best score for each language pair is

shown in bold. Scores that are significantly better (per Steiger’s

Method with p <0.1) than the next lower score are underlined. For

example, for English-Chinese, BiCVM is significantly better than

BiSkip, which in turn is significantly better than BiVCD. . . 36

TABLE 4 : Intrinsic evaluation of English word vectors measured in terms of

Qvec score across models. Best scores for each language pair is

shown in bold. . . 37

TABLE 5 : Cross-lingual dictionary induction results (top-10 accuracy). The

same trend was also observed across models when computing MRR

(15)

TABLE 6 : Cross-lingual document classification accuracy when trained on

lan-guage l1, and test on language l2. Best scores for each language is

shown inbold. Majority baselines for English_→l2andl1→English

are 49.7% and 46.7% respectively, for all languages. Scores

signif-icantly better (per McNemar’s Test, p < 0.05) than the next best

score are underlined. For example, for Swedish → English, BiVCD

is significantly better than BiSkip, which is significantly better than

BiCVM. . . 39

TABLE 7 : Labeled attachment score (LAS) for dependency parsing when trained

and tested on language l (direct-parsing experiment). Mono refers

to parser trained with mono-lingually induced embeddings. Scores

inboldare better than the Mono scores for each language, showing

improvement from cross-lingual training. . . 40

TABLE 8 : Labeled attachment score (LAS) for cross-lingual dependency

pars-ing when trained on languagel1, and evaluated on languagel2

(model-transfer experiment). The best score for each language is shown in

bold. . . 41

TABLE 9 : Concept-term matrix constructed using the English Wikipedia.

Ex-plicit Semantic Analysis (ESA) representations are derived from the

matrix above by reweighing each entry using TF-IDF and

normal-izing the columns. Each column describes a word in English using a

“bag-of-concepts” in Wikipedia. . . 44

TABLE 10 : Statistics showing number of articles in Wikipedias for 9 languages,

and the size of the intersection (as identified by inter-language links)

with the English Wikipedia. The last column shows the relative

size of the intersection (in % age). For instance, 52% of German

(16)

TABLE 11 : Concept-term matrix constructed using the English, Spanish and

Hindi Wikipedias (compare to Table 9). From left to right, the words

in Spanish and Hindi translate to tower, big, politician, olympic,

democratic, hawaii respectively. The term counts are computed in

the respective article in the Spanish or Hindi Wikipedia. CLESA

representations are derived from this matrix by weighing each entry

using TF-IDF and normalizing the columns. Each column describes

a word using a “bag-of-concepts” in Wikipedia. . . 49

TABLE 12 : Comparing monolingual ESA (Mono) and cross-lingual ESA (CLESA)

for classification on translated documents from 20-Newsgroup. The

table shows the number of languages which achieve accuracies in a

certain range with a given ESA representation (Mono or CLESA).

For instance, CLESA achieves accuracy@1 in the range of 70% to

80% (i.e.,0.8> p≥0.7) for 7 languages. It can be seen that CLESA

achieves high accuracies (e.g., accuracy@1 lies in 1 ≥ p ≥ 0.9) for

20 languages, whereas the same is true for monolingual ESA for 2

languages only. . . 52

TABLE 13 : Corpus statistics (in millions) of the parallel corpora used to train

the multi-sense representations. Horizontal lines demarcate corpora

from the same domain. . . 66

TABLE 14 : Results on word sense induction (left four columns) in ARI and

con-textual word similarity (last column) in percent correlation.

Lan-guage pairs are separated by horizontal lines. Best results inbold. 67

TABLE 15 : Effect (in ARI) of language family distance on WSI task. Best results

for each column is shown in bold. The improvement from Mono

to Full is shown as (3) - (1). Note that this is not comparable to

results in Table 14, as a different training corpus is used to control

(17)

TABLE 16 : Senses discovered for some words under multilingual training, and

their nearest neighbors in the vector space. . . 70

TABLE 17 : Crowd-sourced dataset statistics. #pos (#neg) denote the number

of positives (negatives) in the evaluation set, and #crowd-sourced

denote the number of crowd-sourced pairs. Negatives were

deliber-ately under-sampled to have a balanced evaluation set. Some

exam-ples for positive and negative instances in French and Russian are:

(pêcheur (en:fisher), worker) (vêtement (en:clothing), jeans) (замок

(en:castle), structure) (флора(en:flora), sunflower) . . . 83

TABLE 18 : Training data statistics for different languages. While we use parallel

corpora for computing translation dictionaries, our approach can

work with any bilingual dictionary. . . 85

TABLE 19 : Comparing the different approaches from Section 6.5 with our

BiSparse-Dep approach onHyper-Hypo(random baseline= 0.5). Bold

de-notes the best score for each language, and the * on the best score

indicates a statistically significant (p< 0.05) improvement over the

next best score, using McNemar’s test (McNemar, 1947). . . 88

TABLE 20 : Comparing the different approaches from Section 6.5 with our

BiSparse-Dep approach on Hyper-Cohypo (random baseline= 0.5). Bold

denotes the best score for each language, and the * on the best score

indicates a statistically significant (p< 0.05) improvement over the

next best score, using McNemar’s test (McNemar, 1947). . . 88

TABLE 21 : Common errors made by the different models described in Section 6.3. 90

TABLE 22 : Robustness to absence of a treebank: The delexicalized model

is competitive to the best dependency based and the best window

based models on both test sets. For each dataset, * indicates a

statistically significant (p < 0.05) improvement over the next best

(18)

TABLE 23 : Part of a dictionary compiled from Wikipedia hyperlinks showing the

entities that the stringchicagocan refer to, along with the respective

prior probabilities and counts. . . 99

TABLE 24 : Number of train mentions (from Wikipedia) in each language, with

% size relative to English (51.7M mentions). Train mentions from

Wikipedias like Arabic, Turkish and Tamil are <10% the size of

those from the English Wikipedia. . . 112

TABLE 25 : Evaluation datasets used in the cross-lingual entity linking

experi-ments. . . 113

TABLE 26 : Xelms(joint) improves upon Xelms(mono) and the current

State-of-The-Art (SoTA) onTH-TestandMcN-Test, showing the

ben-efit of using additional supervision from English. The best score

is shown bold and ∗ marks statistical significance of best against

SoTA. Refer Section 7.5 for details on SoTA. . . 115

TABLE 27 : Linking accuracy on TAC15-Test. Numbers for Sil et al. (2018) from

personal communication. . . 116

TABLE 28 : Linking accuracy of asingleXelms(multi) model for four languages

— German, Spanish, French and Italian. For comparison,

individ-ually trained Xelms(joint) scores are also shown. The best score

is shown bold and ∗ marks statistical significance of best against

SoTA. Refer Section 7.5 for details on SoTA. . . 116

TABLE 29 : Adding fine-grained type information further improves linking

ac-curacy (compare to Table 26). The best score is shown bold and ∗

marks statistical significance of best against SoTA. Refer Section 7.5

(19)

TABLE 30 : Linking accuracy of the zero-shot (Z-S) approach on different datasets.

Zero-shot (w/ prior) is close to SoTA for datasets like TAC15-Test,

but performance drops in the more realistic setting of zero-shot (w/o

prior) (Section 7.7.1) on all datasets, indicating most of the

perfor-mance can be attributed to the presence of prior probabilities. The

slight drop in McN-Test is due to trivial mentions that only have

a single candidate. . . 119

TABLE 31 : Cumulative number of person name pairs in Wikipedia inter-language

links. For instance, 45 languages (that cover 22 scripts) have at least

1000 name pairs that can be extracted from inter-language links.

While previous approaches for transliteration generation were

appli-cable to only 24 languages (spanning 15 scripts), our approach is

applicable to 56 languages (23 scripts). When counting scripts we

exclude variants (e.g., all Cyrillic scripts and variants count as one). 127

TABLE 32 : Comparing different generation approaches on the NEWS 2015 dataset

using accuracy@1 as the evaluation metric for five languages —

Hindi (hi), Kannada (kn), Bengali (bn), Tamil (ta) and Hebrew

(he) – in full supervision and low-resource settings. “Ours” denotes

the Seq2Seq(HMA) model, with (.) denoting the inference strategy.

The rest of the approaches are described in Section 8.5. Numbers

for RPI-ISI are from Lin et al. (2016). . . 135

TABLE 33 : Accuracy@1 for native and foreign words for four languages

(Sec-tion 8.7.2). Ratio is native performance relative to foreign. . . 140

TABLE 34 : Corpora used for obtaining foreign vocabularyVf for bootstrapping

(20)

TABLE 35 : Accuracy@1 of Seq2Seq(HMA) model supervised using human

an-notated seed set in Punjabi and Armenian (with and without

boot-strapping). Both languages perform well relative to the other

lan-guages investigated so far. Both annotation sub-tasks took roughly

the same time. . . 141

TABLE 36 : Comparing candidate recall@20 for different approaches on Tigrinya

and Macedonian. CV-split refers to consonant-vowel splitting. Using

our transliteration generation model with bootstrapping yields the

(21)

List of Figures

FIGURE 1 : Overview of the chapters in this thesis. The double-headed arrows

indicate tasks involving sharing of information between languages,

and single-headed arrows indicate tasks involving transfer from one

language to another. . . 5

FIGURE 2 : Vector-based word representations in a 2-dimensional space. . . . 14

FIGURE 3 : Cross-lingual word representations as a vector space

approxima-tion of discrete translaapproxima-tion dicapproxima-tionaries. By encoding items present

in the dictionary as points in a continuous vector space, one can

recover some of the missing entries. Note that while the figure uses

vector-based representations (which partition the space implicitly)

a similar reasoning can be applied to cluster-based representations

(which partition the space explicitly). . . 27

FIGURE 4 : Different forms of cross-lingual alignments one can use as

supervi-sion for learning cross-lingual representations. From left to right,

the cost of the supervision required varies from expensive (sentence

and word-level alignments) to cheap (document-level alignment). . 29

FIGURE 5 : Inter-language links in English Wikipedia and Hindi Wikipedia link

the corresponding articles that describe the entityAlbert_Einstein

in both Wikipedias. . . 47

FIGURE 6 : Comparing CLESA based document classification with supervised

classification on the TED dataset (averaged macro-F1 scores over

15 labels) for 7 languages — English (en), Arabic (ar), German

(22)

FIGURE 7 : Senses listed in WordNet for the word wicked when used as an

adjective. None of the senses listed reflect the sense of wicked in

the sentence, “Tony Hawk did a wicked ollie at the convention”,

wherewicked is used as an adjective denoting awesome-ness. . . . 56

FIGURE 8 : Benefit of Multilingual Information (beyond bilingual): Two

different senses of the wordinterestand their translations to French

and Chinese (word translation shown in [bold]). While the

sur-face form of both senses of interest are same in French, they are

different in Chinese, illustrating the benefit of having more than

two languages. . . 59

FIGURE 9 : The aligned pair (interest,intérêt) is used to predict monolingual

and cross-lingual context in both languages (see factors in

Equa-tion (5.7)). Each sense vector (here 2nd is shown) for interest,

participates in the update. Only the polysemy in English is

mod-elled. . . 61

FIGURE 10 : Tuning window size for English-Chinese and English-French models. 69

FIGURE 11 : The BiSparse-Dep Approach, that learns sparse bilingual

em-beddings using dependency-based contexts. The resulting sparse

embeddings, together with an unsupervised hypernymy detection

measure, can detect hypernyms across languages (e.g.,pomme is a

fruit). . . . 77

FIGURE 12 : Dependency tree for “The tired traveler roamed the sandy desert,

seeking food”. . . . 77

FIGURE 13 : Robustness to small corpus: For most languages,

BiSparse-Depoutperforms the corresponding best window based model on

(23)

FIGURE 14 : Robustness to low quality dictionary: For most languages,

BiSparse-Depoutperforms the corresponding best window based

model onHyper-Hypo, with increasingly lower quality dictionaries. 93

FIGURE 15 : Tamil and English mention contexts containing [mentions] of the

entity Liverpool_F.C. from the respective Wikipedias. Tamil

Wikipedia only has 9 mentions referring toLiverpool_F.C., whereas

English Wikipedia has 5303 such mentions. Clearly, there is a

need to augment the limited contextual evidence in low-resource

languages (like Tamil) with evidence from high-resource languages

(like English). The Tamil sentence translates to “Suarez plays for

[Liverpool] and Uruguay.” . . . 101

FIGURE 16 : (a) Xelms uses grounded mentions from two or more languages

(English and Tamil shown) as supervision. The contextg, entity

e and type t vectors interact through Entity-Context loss (

EC-Loss), Type-Context loss (TC-Loss) and Type-Entity loss (

TE-Loss). The Tamil sentence is the same as in Figure 15, and other

mentions in it translate to [Suarez] and [Uruguay]. (b) The

Mention Context Encoder (Section 7.3.1) encodes the local

con-text (neighboring words) and the document concon-text (surfaces of

other mentions in the document) of the mention into g. Internal

view of local context encoder is in Figure 17. . . 104

FIGURE 17 : Local Context Encoder, for the right context. Figure 16b shows

how it fits inside Mention Context Encoder. . . 107

FIGURE 18 : Candidate generation using inter-language links in Wikipedia. Prior

probabilities are computed from surface to index statistics

(24)

FIGURE 19 : Linking accuracy v/s the number of train mentions in the target

language L (= Turkish (tr), Chinese (zh) and Spanish (es)). We

compare bothXelms(mono) and Xelms(joint) to the best results

using all available supervision, denoted by L-best. To discount the

effect of the prior, all results above are without it. For number of

train mentions = 0,Xelms(joint) is equivalent to zero-shot without

prior. Best viewed in color. . . 120

FIGURE 20 : Limitation of the candidate generation approach described in

Fig-ure 18. For the Russian mention (that transliterates to Berlin),

using inter-language links misses two plausible candidate entities

in the English Wikipedia. . . 123

FIGURE 21 : Generation treats transliteration as a sequence transduction task,

while discovery aims to select the correct transliteration from a

given list of names. . . 125

FIGURE 22 : Transliteration using Seq2Seq transduction with Hard Monotonic

Attention, or Seq2Seq(HMA). The figure shows how decoding

pro-ceeds for transliterating थनोस to thanos. During decoding, the

model attends to a source character (e.g.,थ shown in blue) and

outputs target characters (t,h,a) until astep action is generated,

which moves the attention position forward by one character (to

न), and so on. . . 129

FIGURE 23 : Plot showing accuracy@1 after each bootstrapping iteration for

Hindi, Kannada, Bengali, Tamil and Hebrew, starting with only

500 training pairs as supervision. For comparison, the accuracy@1

of a model trained with all available supervision is also shown

(25)

FIGURE 24 : Summary of contributions made in the thesis. Different chapters

demonstrated how cross-lingual representations can help achieve

model transfer and sharing (Chapter 3 and 4), aid in lexical

se-mantics tasks either monolingually (Chapter 5) or cross-lingually

(Chapter 6), and facilitate cross-lingual information extraction

(Chap-ter 7 and 8). . . 147

FIGURE 25 : PCA plots for the vectors for {apple, bank, interest, itunes, potato,

west, monetary, desire}with multiple sense vectors for apple

(ap-ple_1and apple_2),interest(interest_1and interest_2) andbank

(bank_1 and bank_2) obtained using monolingual (25a), bilingual

(26)

CHAPTER 1 : Introduction

1.1. Overview

Most work in Natural Language Processing (NLP) is driven by the availability of supervision

in the form of labeled data. Such labeled data is relatively easily available for simple

NLP tasks in high-resource languages (like document classification for English), but this

is not true for other languages. In fact, for more sophisticated NLP tasks like semantic

parsing, fully labeled data is hard to come by even in English, let alone other languages.

Consequently, even though there are over 7000 languages from over 140 different language

families in use across the world (Lewis, 2009), the state of NLP is skewed towards

high-resource languages like English.

Why should one care about NLP in other languages? Developing NLP technology that can

operate multilingually has several benefits. A large population of the world is multilingual,

and produces information at a unprecedented scale across several languages thanks to the

web. Indeed, over 40% of the content on the web is not in English, and the rate at which

such content is generated has been growing dramatically.1 Although most of this content is

publicly available (online news, social media platforms etc.), yet consuming it is complicated

by the multitude of languages used. Multilingual NLP makes this information available to

a broader audience than what the language in which it was expressed can reach, helping

overcome the metaphorical language-imposed barrier to knowledge. For instance, knowing

what entities appear in a certain tweet written in Hindi can improve the understanding of an

English reader. This can help faster dissemination of critical information, that can prove

crucial in urgent situations (e.g., early warning systems, financial trading). Multilingual

NLP can also help from a scientific standpoint, in understanding how languages evolve and

interact with each other, revealing previously unknown similarities or differences between

languages (Bender, 2009, 2011).

(27)

Unfortunately, traditional supervised learning approaches, which form the backbone of

cur-rent NLP technology, are inhecur-rently ill-equipped to deal with the lack of labeled data, which

poses a significant challenge in scaling to other languages. As a result, advances made in

NLP for high-resource languages take longer to percolate to the rest of the languages. To

bridge this language barrier, efforts have been made in two directions.

Annotation-heavy Approaches. The first direction is motivated by the overwhelming

success of supervised machine learning approaches for English NLP tasks. Given enough

training data, traditional machine learning approaches have shown remarkable

generaliza-tion abilities. One natural step is to apply these approaches to NLP tasks in non-English

lan-guages, by providing supervision in the target language. Often, such supervision is derived

from multilingual resources that are used in machine translation, such as parliamentary

pro-ceedings (Koehn, 2005) or translations of the Bible (Christodouloupoulos and Steedman,

2015; Agić et al., 2015). A popular approach for achieving this is annotation projection,

where annotation for some linguistic structure is projected from a high resource language to

a low resource language, usually by means of some cross-lingual alignment (Yarowsky and

Ngai, 2001; Hwa et al., 2005). Nevertheless, the community has also invested in curating

an-notated resources in non-English languages for various NLP tasks (Xue, 2008; de Melo and

Weikum, 2009; McDonald et al., 2013), so that traditional supervised learning approaches

can be developed for these languages. Indeed, statistical machine learning approaches have

become integral to advancing NLP in English in past three decades, and it is reasonable

to assume that they will succeed for other languages once the appropriate datasets become

available.

Translation-based Approaches. The second direction has advocated extending NLP

to languages other than English by automatically translating text in the target language

to English using machine translation, and then using state-of-the-art NLP models already

developed for English. Like annotation-heavy approaches, this direction also relies on the

(28)

translation from a language to English will be sufficient to bridge the language barrier.

In-deed, translation quality from a non-English language to English has improved considerably

over the past few years with the advent of neural machine translation (Sutskever et al., 2014;

Bahdanau et al., 2015; Luong et al., 2015a; Sennrich et al., 2016, inter alia).

While promising, both of these directions suffer from limitations. Traditional supervised

learning approaches require generous amount of labeled data to achieve good generalization.

While such annotation-heavy approaches have proved successful for high resource languages

such as English, their applicability to a new language is limited due to lack of labeled data.

Collecting high-quality labelled data for any NLP task in an arbitrary language is

chal-lenging due to the human effort involved and the paucity of expert annotators. There are

practical limitations as well — generating high-quality annotations such as treebanks

(Mar-cus et al., 1993; Prasad et al., 2008; McDonald et al., 2013; Socher et al., 2013) takes time

and expertise. This may not be ideal if one has limited time to develop a system in a

language to achieve certain goals (e.g., for disaster management).

Similar problems affect machine translation. Building a translation system requires

re-sources in the target language (namely large parallel corpora with English), which are not

always available and are expensive to annotate at scale (Lopez and Post, 2013). Unless

several million lines of in-domain parallel text are available, statistical machine translation

approaches perform poorly (Kolachina et al., 2012; Irvine, 2014), an issue that is

exacer-bated for the now popular neural machine translation approaches (Koehn and Knowles,

2017). From a practical standpoint too, including a translation system in the pipeline may

prove to be an overkill for simple understanding tasks such document classification. For

instance, knowing thatगु वाकषणmeansgravityand वांटम यां क meansquantum mechanics

is sufficient to realize a Hindi document containing the above words is about physics,

with-out translating the entire document. Another practical issue is that even though we can

apply NLP tools on the translated text, often the output is desired in the target language

(29)

approach incurs the additional cost of projecting the output back to the original language,

which besides introducing latency, might be a hard task in itself. For instance, how to

project back the Part-Of-Speech (POS) tags for tokens in a English phrase that aligns to a

single word in Hindi?

All the above challenges become more acute when working with low-resource languages. To

extend NLP tools to languages beyond English, we need to find approaches that eliminate

or reduce the dependence on high quality annotated data, and that enable one to scale to

multiple languages quickly without investing too much effort.

1.2. Thesis Statement

In this thesis, I argue thatcross-lingual representationsare an effective means of extending

NLP tools to languages beyond English, without resorting to generous amounts of annotated

data or expensive machine translation. These cross-lingual representations encode semantics

and maintain similarity of lexical items across multiple languages. Such

multilingually-enriched semantic representations can be learnt in an inexpensive manner, often from

cross-lingual signals completely unrelated to the task of interest. By using these representations as

features (in lieu of monolingually derived representations) for training a model for a certain

NLP task, we can learn models that operate on input from multiple languages. Using such

language-agnostic feature representations in the model enable us to leverage any existing

annotations in a high-resource language for joint multilingual training, thereby reducing

the amount of supervision required for the task in a new language.

1.3. Outline of This Thesis

The rest of the document is organized as follows.

• Chapter 2 gives a brief history of different approaches of representing words in NLP

applications, and highlights their limitations in capturing cross-lingual semantics.

(30)

limi-High-Resource Languages (e.g., English)

Other Languages What signals can be used to transfer models from English to other languages?

what is the impact of the choice of the signal?

Can we exploit freely available resources like Wikipedia to generate cross-lingual representations?

Can cross-lingual signals help learn better sense representations in English?

How to use cross-lingual representations to understand relationships (other than translation) across languages?

How to use cross-lingual representations to develop a cross-lingual information extraction system?

Chapter 3

Chapter 7 and 8 Chapter 6 Chapter 4

Chapter 5

Figure 1: Overview of the chapters in this thesis. The double-headed arrows indicate tasks involving sharing of information between languages, and single-headed arrows indicate tasks involving transfer from one language to another.

tation, and motivates them as an effective way of translating feature spaces. I then

describe different cross-lingual signals that we can exploit to learn cross-lingual

repre-sentations and evaluate their suitability for achieving semantic and syntactic transfer.

Based on (Upadhyay et al., 2016).

• In Chapter 4, I describe another approach to learn cross-lingual representations using

multilingual encyclopedic resources such as Wikipedia to encode words in different

languages in a shared semantic space, and use it to classify documents without any

supervision. Based on (Song et al., 2016).

• In Chapter 5, I describe our work on learning multi-sense representations in English

using cross-lingual signals, where we show that using a small amount of multilingual

data we can train models for sense induction that compare favorably with a model

(31)

• Chapter 6 shows how, by using appropriate cross-lingual word representations, one

can detect asymmetric relations across languages. The chapter demonstrates that

cross-lingual representations can aid in semantic tasks beyond those requiring coarse

semantics (such as document classification). Based on (Upadhyay* et al., 2018).

The next two chapters examine a downstream information extraction task, namely

cross-lingual entity linking, along two dimensions.

• Chapter 7 describes how we can exploit cross-lingual representations to develop a

cross-lingual entity linking system. In particular, I describe how we can augment the

limited supervision in a low-resource language with supervision from a high-resource

language by using a shared multilingual representation space. Based on (Upadhyay

et al., 2018a).

• Chapter 8 examines the problem of name transliteration from a language to English,

a crucial component for performing entity linking in the cross-lingual setting. I show

that we can improve transliteration from low-resource languages into English using

constraints to drive learning with limited supervision. Based on (Upadhyay et al.,

2018b).

Finally, Chapter 9 summarizes the contributions of this thesis and provides an overview of

(32)

CHAPTER 2 : History of Word Representations

2.1. Introduction

The generalizability of any machine learning model is a function of the data representation

(or features) that it operates on to make predictions. For instance, the generalizability of a

document classifier depends on what features are used to represent a document. For most

NLP applications, features of smaller units of language such as words are used to derive

features for larger linguistic structures such as sentences and documents. For instance, a

common choice of word-level features in the document classification task is “how often did

word w appear in the document”, and the document-level feature is simply the sum of all

the active word-level features in that document. The choice of feature representation for

units such as words is thus a crucial decision when developing a NLP model.

The idea of learning mathematical representations useful for building classifiers is a general

one (Bengio et al., 2013), and the aim of this chapter is to review different approaches

of representing words as mathematical objects in NLP. The techniques and principles

in-troduced in this chapter for training monolingual representations will serve as the starting

point for cross-lingual representation learning approaches described in the rest of the thesis.

I start with a discussion of one-hot word representations and their limitations. I will then

discuss how the principles of distributional semantics can be put into practice to derive

cluster-based or based word representations. Owing to the popularity of

vector-based representations, I will briefly discuss the different frameworks for learning vector

representations and their benefits. I end the chapter by discussing some known limitations

of these word representations, that I will address in the rest of the thesis.

2.2. One-Hot Word Representations

Conventional approaches represent each word w in the vocabulary V by denoting w by

(33)

disease_}, the one-hot representations will be _{1, 2, 3, 4, 5_}. This representation scheme is

better visualized by treating each word as a vector of 0s with a 1 at the index associated

with the word (or aone-hot vector),

vking =

            1 0 0 0 0            

, vqueen =

            0 1 0 0 0            

, vcomputer=

            0 0 1 0 0            

, vvirus =

            0 0 0 1 0            

, vdisease =

            0 0 0 0 1             (2.1)

It is easy to see that this is an extremely crude representation that fails to capture

relation-ships (or lack thereof) between words. For instance, related words like king and queen are

treated the same (twodistinct atomic symbols), and so are unrelated words like computer

and king. Using features built on top of one-hot word representations will also suffer from

the same problem. For instance, the parameters learnt for the feature “the word king is

present in the document” during training cannot be shared for the feature “the wordqueen

is present in the document” that is only seen at test time. As a result, these one-hot word

representations have limited utility when used as the building blocks for a NLP model.

Another drawback is that the representation size increases with the size of the vocabulary,

which is cumbersome when working with large corpora.

2.3. Distributional Semantics

Marco saw a hairy little wampimuk crouching behind the tree.

During the winter the wampimuk hibernates in his burrow.

The girls fed their pet wampimuk berries and acorns.

Table 1: The distributional hypothesis in action: We (humans) can guess the meaning of

words likewampimuk by the context in which it appears.

How can we learn more meaningful representations for words that capture some notion of

relatedness? Intuitively, humans infer the meaning of a word from the context in which

(34)

wampimukin the sentence “Marco saw a hairy little wampimuk crouching behind the tree.”

can be guessed based on the neighboring words.1 _{Such patterns of word usage also allows}

humans to guess that words likesquirrelorrabbitare similar towampimuk, because they can

substitute wampimuk in the above context. These observations form the basis of the field

of distributional semantics, where word meanings are modeled as a function of statistics

computed over a corpus. This intuition can be formalized in two popular distributional

semantics hypotheses.

The Distributional Hypothesis

The distributional hypothesis states that you shall know a word by the company it

keeps (Weaver, 1955; Firth, 1957; Furnas et al., 1983, inter alia). The hypothesis

has been restated differently (and more formally) over the years — words that are

similar in meaning occur in similar contexts (Rubenstein and Goodenough, 1965);

words with similar meanings will occur with similar neighbors if enough text material

is available (Schütze and Pedersen, 1995).

The definition of context and similarity in the distributional hypothesis is task-dependent.

The notion of similarity is often dictated by the representation paradigm — vector-based

word representations use a geometric similarity metric like cosine or the dot product, while

cluster-based word representations simply check for cluster overlap. Most popular

repre-sentation learning approaches define context to be the lexical neighborhood of the word

under consideration, but this choice can also be dictated by the task. For instance, in

Chapter 6, I will describe word representations that usesyntacticcontext of words to learn

representations for detecting asymmetric relations like hypernymy.

Distributed v/s Distributional. An important distinction is to be made between

dis-tributional word representations and distributed representations, A related term used in

representation learning literature (Bengio et al., 2013). Word representations that

ap-peal to the distributional hypothesis are calleddistributional representations. On the other

(35)

hand,distributedrepresentations are those where each word’s representation is described by

multiple contexts, and each context participates in describing multiple words. Distributed

representations are in contrast to local representations, such as the one-hot representation

discussed earlier. Note that a representation can be both distributional and distributed

(e.g., representations learnt using Skip-gram with Negative Sampling (SGNS), described in

Section 2.9).

While the distributional hypothesis is widely popular, it is worth noting that not all word

representation approaches appeal to it. Another related hypothesis that describes the

se-mantics of a word through statistics computed on a corpus is thebag-of-words hypothesis.

The Bag-of-Words Hypothesis

The bag-of-words hypothesis states that similar documents contain similar words,

and similar words appear in similar documents. More formally,documents that have

similar column vectors in a term-document matrix are similar (Salton et al., 1975;

Salton and Buckley, 1988).

The term-document matrix is computed by treating each document as a bag-of-words,

hence the hypothesis is referred to as the bag-of-words hypothesis. The bag-of-words

hy-pothesis examines the document level co-occurrence of words, and is close relative of the

distributional hypothesis.2 _{In Chapter 4, I will describe Explicit Semantic Analysis (ESA)}

representations, that appeal to the bag-of-words hypothesis to represent words as the

‘bag-of-concepts’ that they describe in an encyclopedic resource like Wikipedia.

By invoking the distributional hypothesis and the bag-of-words hypothesis, representation

learning approaches can extract contextual knowledge directly from raw corpora. This

way of extracting semantics automatically is particularly appealing compared to the

time-consuming approach of hand-coding semantics (e.g., building lexical ontologies). Another

benefit of these representations is their re-usability — once extracted from a large and

2_{In fact, by defining the context of a word to be the documents it appears in, one can arrive at the}

(36)

general enough corpora, the same word representation can be used for multiple tasks.

2.4. Cluster-based Word Representations

The problem with the one-hot representations were that they were extremely high

dimen-sional (of the size of the vocabulary), and thus did not permit any sharing. An alternative

to the crude one-hot representation is to first cluster the words in the vocabulary, such

that words that share meaningful linguistic properties gets assigned to the same cluster.

Intuitively, this cluster-based word representation reduces the effective vocabulary size to

the number of word clusters, such that words that are similar (clustered together) have

same one-hot representations. Here, I discuss one popular clustering algorithm for learning

such representations, Brown clustering (Brown et al., 1992).

The input to the Brown clustering algorithm is a corpus of wordsw1, w2,· · ·, wT, whereT

is the size of the corpus. The aim of the algorithm is to assign each word to a cluster, where

C:V →1,2,· · ·, K is a cluster assignment function that maps each word wi in the

vocab-ulary V to its clusterC(wi). The sequence of cluster assignments C(w1),C(w2),· · · ,C(wT)

for wordsw1, w2,· · · , wT is modeled using aclass-based bi-gram language model. Under this

model the generative story of the corpus is the following — transition to the current cluster

C(wi)conditioned on the previous clusterC(wi−1), sample a wordwiconditioned on the

cur-rent cluster _C(wi), and repeat the process. The log-likelihood of the corpusw1, w2,· · ·, wT

(or the clustering quality) under this model is computed as,

log Pr(w1, w2,· · · , wT) = n

∑

i=1

log Pr(wi | C(wi))Pr(C(wi)|C(wi−1)) (2.2)

The output of the clustering algorithm is a binary tree, whose leaves are words, and internal

nodes are clusters of the words in the sub-tree rooted at that node. Each cluster can be

expressed using a binary code, which is the representation shared by all the words that

(37)

mutual information of adjacent clusters,

Quality(C) =∑ c,c′

Pr(c, c′)log Pr(c, c′) Pr(c)Pr(c′) −

∑

w

Pr(w)log Pr(w) (2.3)

=I(C)−H (2.4)

Thus, Brown clustering uses mutual information of adjacent clusters (i.e., at bi-gram level)

as a measure of distributional similarity between the words in those clusters. Starting with

each word in its own distinct cluster, the brown clustering algorithm greedily merges pair

of clusters that cause the smallest decrease in the likelihood of the corpus with respect to

the class-based bi-gram language model. The greedy clustering strategy naturally leads

to a hierarchical clustering upon termination. However, naively implementing this

clus-tering strategy has a run-time of O(|V|5₎_{, which was reduced to} _O₍_|_V_|3₎ _{in Brown et al.}

(1992). Later, Liang (2005) proposed a more efficient approach that reduced the run-time

toO(|V|m2+n) by starting withm initial clusters for each of them most popular words,

wherem is number of initial clusters andnis the corpus length.

Brown clustering has proved successful in several semi-supervised scenarios, improving

per-formance on tasks like dependency parsing (Koo et al., 2008; Tratz and Hovy, 2011),

syntac-tic chunking (Turian et al., 2010), with sequence labeling tasks like named-entity recognition

enjoying large gains (Freitag, 2004; Miller et al., 2004; Ratinov and Roth, 2009).

Other Clustering Representations. Many variants of the Brown clustering algorithm

exists. For instance, Ushioda (1996) extended the Brown clustering algorithm to compute

clusters for words and phrases, while Martin et al. (1998) extended the bi-gram model

to tri-grams. Similarly, Uszkoreit and Brants (2008) used a more expressive class-based

language model that transitions to the current cluster _C(wi) conditioned on the previous

(38)

clustering quality,

log Pr(w1, w2,· · · , wT) = n

∑

i=1

log Pr(wi| C(wi))Pr(C(wi)|wi−1) (2.5)

More recently, Stratos et al. (2014) noticed that the greedy strategy employed in Brown

clus-tering is simply a heuristic and is not guaranteed to recover the (provably) correct clusclus-tering

when given sufficient number of training examples. Furthermore, even with Liang (2005)’s

improved algorithm, the run-time is prohibitively long. Stratos et al. (2014) proposed an

improved cluster based representation learning approach that is guaranteed to recover the

correct clustering and runs around 10 times faster than the Liang (2005) implementation.

Limitations of Cluster-based Representations. A key limitation of using

clustering-based representations is that they are not amenable to end-to-end learning aimed at a

downstream task. This is due to the fact that the cluster assignment function_Cis, in general,

discontinuous, so end-to-end learning methods like back-propagation cannot be used with

them. Furthermore, approaches like Brown clustering scale quadratically in the number

of clusters, and thus are extremely slow to train, preventing more expressive partitioning

of the space. The discreteness of the space induced by the clustering also means that the

granularity of similarity score between pair of words is discretely quantized. For instance, at

a certain level of the Brown clusters, two words either belong to the same cluster (similarity

of 1), or don’t (similarity of 0). One can circumvent this by describing a word by its path

from the root in the binary tree (e.g., apple becomes 101101). However, this approach

partially alleviates the discreteness of the space. We will see that vector space models

allow for a more expressive representation equipped with a continuous similarity measures

(39)

car bike

king queen book

magazine

muffin cake

Figure 2: Vector-based word representations in a 2-dimensional space.

2.5. Vector-based Word Representations

Vector-based word representations use a point in a n-dimensional vector space to describe

each word in a language (see Figure 2). Under this representation paradigm, geometric

proximity in the vector space is used as surrogate for semantic similarity of two words. For

instance, in Figure 2, related words likeking and queen are closer than king and car. The

similarity of two wordsv and w represented in a vector space asvand wrespectively, can

then be computed using some natural geometric measure of vector proximity, like the dot

product dot(v, w) =v⊺w, or the cosine cosine(v, w) = dot_|_v(_||v,w_w_|). This notion of “proximity

as a surrogate for similarity” traces its roots in cognition (Lakoff and Johnson, 1980, 1999)

as discussed in (Sahlgren, 2006).

As these representationsembed a word in a geometric space, they are also called word

em-beddings or word vectors. Over the years, several different ways of learning vector-based

word representations have been developed, that differ in the form of the vector

representa-tion (sparse or dense), and the learning framework used — count-based model and matrix

(40)

dimension_→ 1st ₂nd ₃rd ₄th ₅th ₆th ₇th _{. . .}

word

context

computer is of fruit breed physics bank . . .

apple 10 70 90 20 0 0 0 . . .

cat 0 60 75 5 20 0 0 . . .

interest 25 45 55 5 5 30 30 . . .

..

. ... ... ... ... ... ... ... . ..

Table 2: Example of a word-context co-occurrence matrix extracted from a corpus. Each dimension of a row corresponds to co-occurrence with a certain context word. For instance,

the word apple co-occurs with fruit (4th _{dimension) 20 times in the corpus.}

2.6. Sparse Word Representations using Count Models

Sparse word representations (Lin, 1998a,b; Turney and Pantel, 2010; Baroni and Lenci,

2010; Levy and Goldberg, 2014a, inter alia) describe each word in the vocabulary using a

high dimensional but extremelysparse(>90% entries of the vector are 0s) vector.

The key ingredient in all sparse word representations is aco-occurrence matrixM∈R|V|×|C|,

such as the one shown in Table 2, where |V| is the size of the word vocabulary V, |C| is

the size of the context vocabulary C. The entryM[w, c]counts how often the word w∈V

has appeared in a contextc∈C.3 _{As a result, these approaches are also called} _count-based

modelsfor learning word representations (Baroni et al., 2014). The matrixM is extremely

sparse, e.g., for the text8 corpus containing about 17 million tokens and 71 thousand distinct

words, the co-occurrence matrix has _≈1% non-zero entries only (Jiang et al., 2018). The

sparsity of M is an artifact of the Zipfian distribution of words in natural language (Zipf,

1949). The representation of a word w can be directly read off the from the relevant row

w∈R|C| _{in the co-occurrence matrix, where}_|_C_|_{is the size of the context vocabulary.}

Variants of count-based models differ along two dimensions — the definition of context,

and how the entries of Mare computed. As discussed in Section 2.3, the definition of the

3_{While I yield to the popular usage, the entries of the matrix can denote event frequencies other than}

(41)

context is task-dependent, with lexical neighborhood being a popular choice. Similarly,

there are several ways to compute the entries of the co-occurrence matrixM. A naive way

of computing the entries of the occurrence matrix is to simply set each to the raw

co-occurrence frequencyn(w, c). However, raw frequency is not a good measure of relatedness

between words, because it is skewed towards co-occurrences with function words like the,

of, is etc., that do not truly reflect the semantics of the context. Different approaches of

computing the entries ofMhave been proposed to remedy this issue.

Point-wise Mutual Information (PMI). A popular approach is to weigh each entry of

the co-occurrence matrix using point-wise mutual information or PMI (Church and Hanks,

1990). PMI measures the association between word w and context c as,

P M I(w, c) =log Pr(w, c)

Pr(w)Pr(c) (2.6)

The probability of the co-occurrence is simply estimated using counts Pr(w, c) = n(w,c_N ),

Pr(w) = n(_Nw), Pr(c) = n_N(c), wheren(w, c)is the frequency of(w, c)co-occurrence,n(w)and

n(c) are the frequencies ofw and crespectively, andN is the total corpus size. Intuitively,

PMI measures how much the probability Pr(w, c) differs from what we would expect it to

be assuming independence of Pr(w) and Pr(c) (Bouma, 2009).

Positive PMI (PPMI). An issue with PMI is that negative PMI values do not convey

meaningful information. When will PMI assume a negative value? when we have Pr(w, c)<

Pr(w)Pr(c), that is when one of w or c tend to occur individually rather than together.

For instance, this can happen when computing co-occurrences with words like the. PPMI

helps to ignore such uninformative co-occurrences. Furthermore, for unobserved(w, c)pairs

P M I(w, c) =log0 =−∞, which makes the co-occurrence matrix dense (and ill-defined).

To remedy these issues, Levy and Bullinaria (2001) suggested using positive PMI, that

(42)

Positive Local Mutual Information (PLMI). PMI and PPMI have a bias towards

rare events (i.e., smalln(w, c)). To see this, notice that for perfectly correlated wordwand

contextc (i.e., Pr(w) =Pr(c) =Pr(w, c)), the PMI or PPMI value is ₋log Pr(w, c), which

is higher for low Pr(w, c) (i.e., rare co-occurrences). Positive Local Mutual Information

(PLMI) (Evert, 2005, 2008) balances PPMI by including a factor for the co-occurrence

count n(w, c) of word w and context c, P LM I(w, c) = n(w, c)×P P M I(w, c). This way

rare co-occurrences get appropriately low entries inM.

Limitations of Sparse Representations. While sparse word representations do

over-come most of the limitations of one-hot word presentations, they have their own limitations.

The entries ofM are frequency estimates from a large corpus, that may be unreliable due

to chance co-occurrences of unrelated word pairs and missing evidence for unobserved but

related word pairs. Furthermore, many columns of the matrixMwill be highly correlated.

For instance, words that appear frequently with the context word big are also likely to

appear with the context wordlarge. These correlated context dimensions introduce

redun-dancy in the representation. The size of the representations also increases with the size of

the context vocabulary, just like one-hot representations.

Dense low dimensional word representations alleviate the above limitations of the sparse

word representations. These representations are dense, in that most entries in the

rep-resentation are non-zero values. Unlike sparse reprep-resentation, the reprep-resentation size is

significantly smaller than the size of the vocabulary (e.g., dense representation of size 100

can be used for a vocabulary of size 200k words). The reduced representation size also

reduces the redundancy due to correlated dimensions in the representation. Over the years,

several different approaches have been used to learn dense word representations. While

some approaches derive the dense representations by factorizing the sparse co-occurrence

matrix (eitherexactlyorapproximately), others directly learn dense representation by using