Word Representations - Cross-Lingual Transfer of Natural Language Processing Systems

The main type of features used in different natural language processing applications are lexical features; i.e. words. Depending on the task, other features such as part-of-speech tags are used but the most influential type of feature is the lexical features. Empirically, ignoring lexical features decreases the final performance of any natural language processing system significantly. There are three main issues when one deals with lexical features:

• Out-of-vocabulary words: Word frequencies in languages follow the Zipfian dis-

tribution: the frequency of each word is inversely proportional to its rank in the frequency list of a large sample text of that language [Zipf, 1935]. That being said, most

of the words in a text corpus have a very low frequency even in a large sample, lead- ing to a distribution with a heavy tail. As a consequence, many words in any testing data are not seen in the training data: the problem of out-of-vocabulary words is a challenge in natural language processing. Figure 2.5 shows the logarithm-scale frequency of words in the Penn Treebank [Marcuset al., 1993]. As shown in the figure, nearly half of the word types occur only once in the data.

• Representing categorical features: In traditional machine learning methods such

as the maximum entropy models [Ratnaparkhi, 1996], each feature is converted to a binary indicator feature, indicating whether a categorical feature exists in the input (as value 1) or not (as value 0) . That leads to a huge sparse feature vector in which only a few of the values in the vector are non-zero.

• Cross-lingual lexical features: When dealing with cross-lingual data, as in our

case, vocabularies have a small overlap across languages. One solution would be to represent lexical features of all languages in a shared space where semantically similar words are similar in the shared space. One can create a lexical abstraction by mapping words to vectors or cluster ids.

We use two types of word representation throughout this thesis: 1) hierarchical word clusters, and 2) word embedding vectors.

Hierarchical Word Clusters

A clustering is a function C(w) that maps each word w in a vocabulary to a cluster

1 6 , 000 12 , 000 18 , 000 24 , 000 30 , 000 36 , 000 42 , 000 0 2 4 6 8 10 Frequency rank Log-frequency

Figure 2.5: Zipfian distribution of the English words in the training section of the Penn Treebank [Marcuset al., 1993].

tion C(w, l)that maps a wordw together with an integer l to a cluster at levell in the

hierarchy. The goal of hierarchical word clustering is to map each word in a vocabulary to a point in a hierarchy in which words in a close neighborhood in the hierarchy should be syntactically or semantically similar. As one example, the Brown clustering algorithm [Brownet al., 1992] gives a hierarchical clustering. The levellallows cluster features at

different levels of granularity.

Brown Clustering Brown clustering [Brown et al., 1992] is a class-based bigram lan-

guage model in which each word in the vocabulary receives a unique bitstring as its cluster identity. Figure 2.6 shows a small vocabulary with the hierarchical bitstring assignments from the Brown clustering algorithm. One can view this algorithm as unsupervised part- of-speech induction with a bigram transition model such that every word in the training data can have only one possible part-of-speech tag. Probability of a data withN tokens

28/58

Introduction

Past Work

Future Work

Density-Driven Annotation Projection [EMNLP 2015]

Direct Transfer with Limited Resources [Cond. Accept at TACL]

Learning Cross-Lingual Word Clusters

I

A

clustering

function maps a word to a

cluster.

0 00 000 apple 001 pear 01 010 Apple 011 IBM 1 10 100 bought 101 run 11 110 of 111 in

I

In a

cross-lingual

clustering, the clusters are shared across

different languages.

Mohammad Sadegh Rasooli

Advances in Cross-Lingual Syntactic Transfer

Figure 2.6: A simplified depiction of the hierarchical word clusters for some English words. This example is taken from the hierarchy shown by Kooet al.[2008].

the algorithm assumes that every wordwi has a unique cluster assignmentC(wi):

P(wi|wi−1) =P(wi|C(wi))P(C(wi)|C(wi−1))

Definition 2.7 Given a clustering assignmentC, the quality measure for that assignment

with respect to the data is defined as following:

Quality(C) = logP(w1,· · · , wn)

n =

log∏n

i=1P(wi|C(wi))P(C(wi)|C(wi−1))

The original work by Brown et al. [1992] uses a greedy heuristic algorithm to find the best clustering of the data. A naive implementation has a time complexity ofO(n5₎_.

Brownet al.[1992] proposed and algorithm to reduce this runtime toO(n3). This runtime

is still not tractable for a large corpus. To solve this issue, a precomputation trick is used to reduce this runtime toO(N +nm2₎_{in which} _m_{is the number of clusters. The}

Inputs: Corpus withN tokens andndistinct word typesw1,· · · , wnordered by decreas-

ing frequency;m: number of clusters wherem≤n.

Algorithm:

Initialize active clustersC={{w1},· · · ,{wm}} fori= m+1 ton+m−1do

if i≤nthen

SetC =C∪ {{wi}}

Mergec, c′ ∈Cthat cause the smallest decrease in the likelihood of the corpus (def.

2.7).

Output: The clusteringC.

Figure 2.7: Brown clustering algorithm.

algorithm in figure 2.7 shows the pseudo-code for deriving the Brown clusters using the greedy heuristic. As shown in the pseudo-code, the algorithm starts with assigning m

unique clusters to the m most frequent words. It then visits other words according to

their decreasing frequency order, assigns a new ((m+ 1)th) cluster to the new word, and

merges two of the clusters based on the quality measure. Thus at each step, the algorithm ends up with exactlymclusters.

When dealing with a very large corpus, even theO(N+nm2)is very time-consuming.

There are other alternatives, such as [Stratoset al., 2015], to obtain the clustering assignments with a more efficient computational complexity. Using word clusters has shown promising results in dependency parsing [Koo et al., 2008], part-of-speech tagging and named entity recognition [Turian et al., 2010]. Word clusters are usually used as addi- tional features in a traditional classifier such as the log-linear model or the Perceptron algorithm.

Word Embeddings

Word embedding models embed each word into a d-dimensional vector by creating a

matrix inRN×d_{from a vocabulary with}_N_{words. The goal is to obtain a set of vectors for}

words in the vocabulary such that semantically similar words are similar in vector space. One famous model for obtaining word embeddings is the Skip-gram model by Mikolovet al.[2013b]: given a sequence of training wordsw1,· · · , wT, the objective function of the

Skip-gram model maximizes the average log-probability of the data:

1 T T ∑ t=1 ∑ −c≤j≤c,j̸=0 logp(wO=wt+j|wI =wt)

wherecis the size of the training context. The probabilityp(wO|wI)can be defined pro-

portional to the dot product of the “word” representation ofwO and “context” represen-

tation of wI. Defining two different vectors for context words and words gives more

flexibility to the model for distinguishing between the word being targeted and its context: p(wO|wI) = exp(v_w′⊤_OvwI) ∑ w∈Vexp(v ′_⊤ w vwI)

whereV is the vocabulary. The softmax function in the above equation normalizes over a large vocabularyV. This normalization is computationally expensive and memory in- tensive. Mikolovet al.[2013b] use an approximate softmax function by samplingkwords

from the vocabulary:

logσ(v_w′⊤ OvwI) + k ∑ i=1 Ewi∼Pn(V)[logσ(v ′_⊤ wivwI)]

whereσis the sigmoid function, andPn(w)is the noise distribution that drawsksample

words from the vocabulary. Mikolovet al. [2013b] empirically found that the unigram distribution proportional to the unigram count of each word raised by 3

4rd power gives

a better performance than the uniform and flat unigram distributions in standard word analogy and similarity tasks. With this approximate method, callednegative sampling, we are able to maximize the likelihood of a pair of input and context words in contrast tok

noise context examples.

Skip-gram model is not the only way to achieve word embedding vectors. Other methods such as the continuous bag-of-words (CBOW) model [Mikolov et al., 2013a], Glove [Penningtonet al., 2014], and the spectral method of Stratoset al. [2015] exist. It is not clear which model gives the best accuracy: it depends on the target task in which those embeddings might be used.

In document Cross-Lingual Transfer of Natural Language Processing Systems (Page 41-47)