• No results found

The main type of features used in different natural language processing applications are lexical features; i.e. words. Depending on the task, other features such as part-of-speech tags are used but the most influential type of feature is the lexical features. Empirically, ignoring lexical features decreases the final performance of any natural language pro- cessing system significantly. There are three main issues when one deals with lexical features:

Out-of-vocabulary words: Word frequencies in languages follow the Zipfian dis-

tribution: the frequency of each word is inversely proportional to its rank in the fre- quency list of a large sample text of that language [Zipf, 1935]. That being said, most

of the words in a text corpus have a very low frequency even in a large sample, lead- ing to a distribution with a heavy tail. As a consequence, many words in any testing data are not seen in the training data: the problem of out-of-vocabulary words is a challenge in natural language processing. Figure 2.5 shows the logarithm-scale frequency of words in the Penn Treebank [Marcuset al., 1993]. As shown in the figure, nearly half of the word types occur only once in the data.

Representing categorical features: In traditional machine learning methods such

as the maximum entropy models [Ratnaparkhi, 1996], each feature is converted to a binary indicator feature, indicating whether a categorical feature exists in the input (as value 1) or not (as value 0) . That leads to a huge sparse feature vector in which only a few of the values in the vector are non-zero.

Cross-lingual lexical features: When dealing with cross-lingual data, as in our

case, vocabularies have a small overlap across languages. One solution would be to represent lexical features of all languages in a shared space where semantically similar words are similar in the shared space. One can create a lexical abstraction by mapping words to vectors or cluster ids.

We use two types of word representation throughout this thesis: 1) hierarchical word clusters, and 2) word embedding vectors.

Hierarchical Word Clusters

A clustering is a function C(w) that maps each word w in a vocabulary to a cluster

1 6 , 000 12 , 000 18 , 000 24 , 000 30 , 000 36 , 000 42 , 000 0 2 4 6 8 10 Frequency rank Log-frequency

Figure 2.5: Zipfian distribution of the English words in the training section of the Penn Treebank [Marcuset al., 1993].

tion C(w, l)that maps a wordw together with an integer l to a cluster at levell in the

hierarchy. The goal of hierarchical word clustering is to map each word in a vocabulary to a point in a hierarchy in which words in a close neighborhood in the hierarchy should be syntactically or semantically similar. As one example, the Brown clustering algorithm [Brownet al., 1992] gives a hierarchical clustering. The levellallows cluster features at

different levels of granularity.

Brown Clustering Brown clustering [Brown et al., 1992] is a class-based bigram lan-

guage model in which each word in the vocabulary receives a unique bitstring as its cluster identity. Figure 2.6 shows a small vocabulary with the hierarchical bitstring assignments from the Brown clustering algorithm. One can view this algorithm as unsupervised part- of-speech induction with a bigram transition model such that every word in the training data can have only one possible part-of-speech tag. Probability of a data withN tokens

28/58

Introduction

Past Work

Future Work

Density-Driven Annotation Projection [EMNLP 2015]

Direct Transfer with Limited Resources [Cond. Accept at TACL]

Learning Cross-Lingual Word Clusters

I

A

clustering

function maps a word to a

cluster.

0 00 000 apple 001 pear 01 010 Apple 011 IBM 1 10 100 bought 101 run 11 110 of 111 in

I

In a

cross-lingual

clustering, the clusters are shared across

different languages.

Mohammad Sadegh Rasooli

Advances in Cross-Lingual Syntactic Transfer

Figure 2.6: A simplified depiction of the hierarchical word clusters for some English words. This example is taken from the hierarchy shown by Kooet al.[2008].

the algorithm assumes that every wordwi has a unique cluster assignmentC(wi):

P(wi|wi−1) =P(wi|C(wi))P(C(wi)|C(wi−1))

Definition 2.7 Given a clustering assignmentC, the quality measure for that assignment

with respect to the data is defined as following:

Quality(C) = logP(w1,· · · , wn)

n =

log∏n

i=1P(wi|C(wi))P(C(wi)|C(wi−1))

n

The original work by Brown et al. [1992] uses a greedy heuristic algorithm to find the best clustering of the data. A naive implementation has a time complexity ofO(n5).

Brownet al.[1992] proposed and algorithm to reduce this runtime toO(n3). This runtime

is still not tractable for a large corpus. To solve this issue, a precomputation trick is used to reduce this runtime toO(N +nm2)in which mis the number of clusters. The

Inputs: Corpus withN tokens andndistinct word typesw1,· · · , wnordered by decreas-

ing frequency;m: number of clusters wherem≤n.

Algorithm:

Initialize active clustersC={{w1},· · · ,{wm}} fori= m+1 ton+m−1do

if i≤nthen

SetC =C∪ {{wi}}

Mergec, c′ ∈Cthat cause the smallest decrease in the likelihood of the corpus (def.

2.7).

Output: The clusteringC.

Figure 2.7: Brown clustering algorithm.

algorithm in figure 2.7 shows the pseudo-code for deriving the Brown clusters using the greedy heuristic. As shown in the pseudo-code, the algorithm starts with assigning m

unique clusters to the m most frequent words. It then visits other words according to

their decreasing frequency order, assigns a new ((m+ 1)th) cluster to the new word, and

merges two of the clusters based on the quality measure. Thus at each step, the algorithm ends up with exactlymclusters.

When dealing with a very large corpus, even theO(N+nm2)is very time-consuming.

There are other alternatives, such as [Stratoset al., 2015], to obtain the clustering assign- ments with a more efficient computational complexity. Using word clusters has shown promising results in dependency parsing [Koo et al., 2008], part-of-speech tagging and named entity recognition [Turian et al., 2010]. Word clusters are usually used as addi- tional features in a traditional classifier such as the log-linear model or the Perceptron algorithm.

Word Embeddings

Word embedding models embed each word into a d-dimensional vector by creating a

matrix inRN×dfrom a vocabulary withNwords. The goal is to obtain a set of vectors for

words in the vocabulary such that semantically similar words are similar in vector space. One famous model for obtaining word embeddings is the Skip-gram model by Mikolovet al.[2013b]: given a sequence of training wordsw1,· · · , wT, the objective function of the

Skip-gram model maximizes the average log-probability of the data:

1 T Tt=1 ∑ −c≤j≤c,j̸=0 logp(wO=wt+j|wI =wt)

wherecis the size of the training context. The probabilityp(wO|wI)can be defined pro-

portional to the dot product of the “word” representation ofwO and “context” represen-

tation of wI. Defining two different vectors for context words and words gives more

flexibility to the model for distinguishing between the word being targeted and its con- text: p(wO|wI) = exp(vw′⊤OvwI) ∑ w∈Vexp(v w vwI)

whereV is the vocabulary. The softmax function in the above equation normalizes over a large vocabularyV. This normalization is computationally expensive and memory in- tensive. Mikolovet al.[2013b] use an approximate softmax function by samplingkwords

from the vocabulary:

logσ(vw′⊤ OvwI) + ki=1 Ewi∼Pn(V)[logσ(v wivwI)]

whereσis the sigmoid function, andPn(w)is the noise distribution that drawsksample

words from the vocabulary. Mikolovet al. [2013b] empirically found that the unigram distribution proportional to the unigram count of each word raised by 3

4rd power gives

a better performance than the uniform and flat unigram distributions in standard word analogy and similarity tasks. With this approximate method, callednegative sampling, we are able to maximize the likelihood of a pair of input and context words in contrast tok

noise context examples.

Skip-gram model is not the only way to achieve word embedding vectors. Other methods such as the continuous bag-of-words (CBOW) model [Mikolov et al., 2013a], Glove [Penningtonet al., 2014], and the spectral method of Stratoset al. [2015] exist. It is not clear which model gives the best accuracy: it depends on the target task in which those embeddings might be used.