The main type of features used in different natural language processing applications are lexical features; i.e. words. Depending on the task, other features such as part-of-speech tags are used but the most influential type of feature is the lexical features. Empirically, ignoring lexical features decreases the final performance of any natural language pro- cessing system significantly. There are three main issues when one deals with lexical features:
• Out-of-vocabulary words: Word frequencies in languages follow the Zipfian dis-
tribution: the frequency of each word is inversely proportional to its rank in the fre- quency list of a large sample text of that language [Zipf, 1935]. That being said, most
of the words in a text corpus have a very low frequency even in a large sample, lead- ing to a distribution with a heavy tail. As a consequence, many words in any testing data are not seen in the training data: the problem of out-of-vocabulary words is a challenge in natural language processing. Figure 2.5 shows the logarithm-scale frequency of words in the Penn Treebank [Marcuset al., 1993]. As shown in the figure, nearly half of the word types occur only once in the data.
• Representing categorical features: In traditional machine learning methods such
as the maximum entropy models [Ratnaparkhi, 1996], each feature is converted to a binary indicator feature, indicating whether a categorical feature exists in the input (as value 1) or not (as value 0) . That leads to a huge sparse feature vector in which only a few of the values in the vector are non-zero.
• Cross-lingual lexical features: When dealing with cross-lingual data, as in our
case, vocabularies have a small overlap across languages. One solution would be to represent lexical features of all languages in a shared space where semantically similar words are similar in the shared space. One can create a lexical abstraction by mapping words to vectors or cluster ids.
We use two types of word representation throughout this thesis: 1) hierarchical word clusters, and 2) word embedding vectors.
Hierarchical Word Clusters
A clustering is a function C(w) that maps each word w in a vocabulary to a cluster
1 6 , 000 12 , 000 18 , 000 24 , 000 30 , 000 36 , 000 42 , 000 0 2 4 6 8 10 Frequency rank Log-frequency
Figure 2.5: Zipfian distribution of the English words in the training section of the Penn Treebank [Marcuset al., 1993].
tion C(w, l)that maps a wordw together with an integer l to a cluster at levell in the
hierarchy. The goal of hierarchical word clustering is to map each word in a vocabulary to a point in a hierarchy in which words in a close neighborhood in the hierarchy should be syntactically or semantically similar. As one example, the Brown clustering algorithm [Brownet al., 1992] gives a hierarchical clustering. The levellallows cluster features at
different levels of granularity.
Brown Clustering Brown clustering [Brown et al., 1992] is a class-based bigram lan-
guage model in which each word in the vocabulary receives a unique bitstring as its cluster identity. Figure 2.6 shows a small vocabulary with the hierarchical bitstring assignments from the Brown clustering algorithm. One can view this algorithm as unsupervised part- of-speech induction with a bigram transition model such that every word in the training data can have only one possible part-of-speech tag. Probability of a data withN tokens
28/58
Introduction
Past Work
Future Work
Density-Driven Annotation Projection [EMNLP 2015]
Direct Transfer with Limited Resources [Cond. Accept at TACL]
Learning Cross-Lingual Word Clusters
I
A
clustering
function maps a word to a
cluster.
0 00 000 apple 001 pear 01 010 Apple 011 IBM 1 10 100 bought 101 run 11 110 of 111 inI
In a
cross-lingual
clustering, the clusters are shared across
different languages.
Mohammad Sadegh Rasooli
Advances in Cross-Lingual Syntactic Transfer
Figure 2.6: A simplified depiction of the hierarchical word clusters for some English words. This example is taken from the hierarchy shown by Kooet al.[2008].
the algorithm assumes that every wordwi has a unique cluster assignmentC(wi):
P(wi|wi−1) =P(wi|C(wi))P(C(wi)|C(wi−1))
Definition 2.7 Given a clustering assignmentC, the quality measure for that assignment
with respect to the data is defined as following:
Quality(C) = logP(w1,· · · , wn)
n =
log∏n
i=1P(wi|C(wi))P(C(wi)|C(wi−1))
n
The original work by Brown et al. [1992] uses a greedy heuristic algorithm to find the best clustering of the data. A naive implementation has a time complexity ofO(n5).
Brownet al.[1992] proposed and algorithm to reduce this runtime toO(n3). This runtime
is still not tractable for a large corpus. To solve this issue, a precomputation trick is used to reduce this runtime toO(N +nm2)in which mis the number of clusters. The
Inputs: Corpus withN tokens andndistinct word typesw1,· · · , wnordered by decreas-
ing frequency;m: number of clusters wherem≤n.
Algorithm:
Initialize active clustersC={{w1},· · · ,{wm}} fori= m+1 ton+m−1do
if i≤nthen
SetC =C∪ {{wi}}
Mergec, c′ ∈Cthat cause the smallest decrease in the likelihood of the corpus (def.
2.7).
Output: The clusteringC.
Figure 2.7: Brown clustering algorithm.
algorithm in figure 2.7 shows the pseudo-code for deriving the Brown clusters using the greedy heuristic. As shown in the pseudo-code, the algorithm starts with assigning m
unique clusters to the m most frequent words. It then visits other words according to
their decreasing frequency order, assigns a new ((m+ 1)th) cluster to the new word, and
merges two of the clusters based on the quality measure. Thus at each step, the algorithm ends up with exactlymclusters.
When dealing with a very large corpus, even theO(N+nm2)is very time-consuming.
There are other alternatives, such as [Stratoset al., 2015], to obtain the clustering assign- ments with a more efficient computational complexity. Using word clusters has shown promising results in dependency parsing [Koo et al., 2008], part-of-speech tagging and named entity recognition [Turian et al., 2010]. Word clusters are usually used as addi- tional features in a traditional classifier such as the log-linear model or the Perceptron algorithm.
Word Embeddings
Word embedding models embed each word into a d-dimensional vector by creating a
matrix inRN×dfrom a vocabulary withNwords. The goal is to obtain a set of vectors for
words in the vocabulary such that semantically similar words are similar in vector space. One famous model for obtaining word embeddings is the Skip-gram model by Mikolovet al.[2013b]: given a sequence of training wordsw1,· · · , wT, the objective function of the
Skip-gram model maximizes the average log-probability of the data:
1 T T ∑ t=1 ∑ −c≤j≤c,j̸=0 logp(wO=wt+j|wI =wt)
wherecis the size of the training context. The probabilityp(wO|wI)can be defined pro-
portional to the dot product of the “word” representation ofwO and “context” represen-
tation of wI. Defining two different vectors for context words and words gives more
flexibility to the model for distinguishing between the word being targeted and its con- text: p(wO|wI) = exp(vw′⊤OvwI) ∑ w∈Vexp(v ′⊤ w vwI)
whereV is the vocabulary. The softmax function in the above equation normalizes over a large vocabularyV. This normalization is computationally expensive and memory in- tensive. Mikolovet al.[2013b] use an approximate softmax function by samplingkwords
from the vocabulary:
logσ(vw′⊤ OvwI) + k ∑ i=1 Ewi∼Pn(V)[logσ(v ′⊤ wivwI)]
whereσis the sigmoid function, andPn(w)is the noise distribution that drawsksample
words from the vocabulary. Mikolovet al. [2013b] empirically found that the unigram distribution proportional to the unigram count of each word raised by 3
4rd power gives
a better performance than the uniform and flat unigram distributions in standard word analogy and similarity tasks. With this approximate method, callednegative sampling, we are able to maximize the likelihood of a pair of input and context words in contrast tok
noise context examples.
Skip-gram model is not the only way to achieve word embedding vectors. Other methods such as the continuous bag-of-words (CBOW) model [Mikolov et al., 2013a], Glove [Penningtonet al., 2014], and the spectral method of Stratoset al. [2015] exist. It is not clear which model gives the best accuracy: it depends on the target task in which those embeddings might be used.