The BiSparse-Dep Approach - : Identifying Hypernymy across Languages

CHAPTER 6 : Identifying Hypernymy across Languages

6.3 The BiSparse-Dep Approach

This section proposes BiSparse-Dep, a family of approaches that uses bilingual, dependency based word embeddings to detect hypernymy between words in different languages. An overview of BiSparse-Dep is in Figure 11. BiSparse-Dep has two key components:

(a)Dependency-based word representations, that enable generalization across different languages with minimal customization by abstracting away language-specific word order. (b) Bilingual sparse coding, that allow us to align dependency-based word representation in a shared semantic space using a small bilingual dictionary. The resulting sparse bilingual embeddings can then be used with anunsupervised hypernymy detection measure(Section 6.3) to determine hypernymy between word pairs.

Dependency-based Word Representations

As discussed in Chapter 2, the context of a word can be described in multiple ways when learning distributional representations. One such context is defined using the syntactic neighborhood of the word in adependency graph. For instance, for the sentence in Figure 12,

The tired traveler roamed the sandy desert, seeking food det amod nsubj dobj advcl amod det dobj

Figure 12: Dependency tree for “The tired traveler roamed the sandy desert, seeking food”.

the syntactic neighborhood for the target word traveler can be described in the following two ways:

• Full context (Padó and Lapata, 2007; Baroni and Lenci, 2010; Levy and Goldberg, 2014b): Children and parent words, concatenated with the label and direction of the relation (e.g., roamed#nsubj−1 _and _tired#amod_{are contexts for}_traveler_).

• Joint context (Chersoni et al., 2016): Parent concatenated with each of its siblings

(e.g., roamed#desert androamed#seeking are contexts fortraveler).

Both context types encode directionality into the context, either through label direction or through sibling-parent relations. The two contexts exploit different amounts of syntactic information —Jointdoes not require labeled parses unlikeFull. Jointcontext combines parent and sibling information, whileFull keeps them as distinct contexts.

Word representations learnt using the syntactic neighborhood of a word (such as the ones described above) are popularly called dependency-based word representations. Dependency context based word representations capture functional similarity (e.g.,singingandrapping), in contrast to topical similarity (e.g., singing and dancing) as captured by lexical context (Levy and Goldberg, 2014b). Such dependency based embeddings have been shown to outperform window based embeddings on many tasks (Bansal et al., 2014; Hill et al., 2014; Melamud et al., 2016). In fact, for the monolingual hypernmy detection task, it has been shown that dependency embeddings can recover Hearst patterns (Roller and Erk, 2016), and are almost always superior to window based embeddings (Shwartz et al., 2017).

Dependency Contexts without a Treebank

Using dependency contexts in multilingual settings may not always be possible, as large dependency-parsed corpora are hard to obtain. One can parse a raw corpus in the language of interest using a dependency parser, but pre-trained dependency parsers are available for a handful of languages. Training a parser is also not feasible, as dependency treebanks used to supervise these parsers are not available for many languages. To circumvent these issues, a weak dependency parser is trained on languages related to the language of interest. Specifically, adelexicalizedparser is trained using treebanks of related languages, where the word form features are turned off, so that the parser is trained on purely non-lexical features (e.g., POS tags). The rationale behind this is that related languages show common syntactic structure that can be transferred to another language via delexicalized parsing (Zeman and Resnik, 2008; McDonald et al., 2011, inter alia).

Unsupervised Hypernymy Detection Measure

When can one assert that word y is a hypernym of a word x using their distributional representations? To answer this question a hypothesis was formalized by Weeds et al. (2004) and Geffet and Dagan (2005),

The Distributional Inclusion Hypothesis

The distributional inclusion hypothesis (DIH) states that if a word y is a hypernym of a word x, then the set of distributional features of x are included in the set of distributional features of y (Weeds et al., 2004; Geffet and Dagan, 2005). That is, the contexts in which x occurs are a subset of those in whichy occurs.

DIH was introduced in (Weeds et al., 2004) as distributional generality for hypernymy detection, and was stated more generally for determining lexical entailment in (Geffet and Dagan, 2005). Intuitively, DIH states that a hypernym y can replace appearances of its hyponymx. For example, rodent can replacesquirrelin the sentence “the squirrelis hiber-

nating for the winter”. Notice that the reverse is not true —squirrel cannot replacerodent

in a sentence like “The capybara is the largest living rodent”. To apply DIH to detect hypernymy, one needs to quantify the amount of overlap in the co-occurrences of x and y

with other contexts.

Known Limitations. The intuition behind DIH is not always correct. For instance, Kart- saklis and Sadrzadeh (2016) noted that in sentences with quantifiers (e.g., “all”, “none”), replacing word with its hypernym is not always appropriate. For instance, changing “all

squirrels hibernate” to “all animals hibernate”. In such contexts, DIH fails. Similarly, Rimell (2014) noted that DIH is not correct in contexts that are collocational (e.g., “hot dog” to “hot animal”), highly specific (“squirrels eat nuts” to “animals eat nuts”), or when the hypernym being considered is too general (e.g., entity andsquirrel).

Despite these limitations, DIH has enjoyed success in many lexical entailment applications, and several unsupervised hypernymy detection measures have been developed that appeal to it (Weeds and Weir, 2003; Weeds et al., 2004; Clarke, 2009; Lenci and Benotto, 2012).

The BalAPinc Measure. In this work, we use one such measure, named BalAPinc, first described by Kotlerman et al. (2009), to score word pairs for hypernymy. BalAPincis defined using two other measures: LIN (Lin, 1998b) and APinc (Kotlerman et al., 2009). The LIN measure defines a symmetric similarity score for a word pair (x,y),

LIN(x, y) = ∑ f∈x∩yx[f] +y[f] ∑ f∈xx[f] + ∑ f∈yy[f] (6.1)

where x and y are representations of x and y respectively, f ∈ x and f ∈ y are feature indices active in x and y respectively, and f ∈ x∩y are features indices that are active (non-zero values) in both x and y. The value of a feature at index f is x[f]. The LIN

measure computes the amount of information needed to describe the commonality ofx and

On the other hand, APinc is an asymmetric score measuring a relevance-weighted overlap in context co-occurrences ofxandy, computed using modified version ofaverage precision2

APinc(x→y) = ∑ rP(r)·rel(f) |x| where rel(f) =        1−rank_|_y_|(₊₁f,y) f ∈y 0 otherwise (6.2)

here rel(f) is a measure of feature f’s relevance, rank(f,y) is rank of feature f (among all active features) in the representation of y, and P(r) is the precision at rank r for the features y with respect to features of x (i.e., ratio of the number of included features of

x iny from rank 1 to r, and r). APinc computes the proportion of the included (active) features of y, weighted by their relevance scorerel(f), with respect to all active features of

x. BalAPinc is defined as the geometric mean of these measures,

BalAPinc(x→y) =√LIN(x, y)·APinc(x→y) (6.3)

Bilingual Sparse Coding

To compare the features of words x and y belonging to different languages, one has to first align the independently learnt dependency representations. Moreover, the BalAPinc

metric described above requires word representations that are sparse, so that active features for computing the score can be identified. To achieve this, we generate BiSparse-Dep

embeddings using the BiSparse framework from Vyas and Carpuat (2016). BiSparse

generates sparse, bilingual word embeddings using a dictionary learning objective with a sparsity inducing l1 penalty. Appendix A.2 describes the learning objective in detail.

BiSparsetakes as input pre-computed monolingual embeddingsXe,Xf for two languages

along with a translation matrixS, and outputs sparse matricesAeandAf that are bilingual representations in a shared semantic space. The translation matrix S (of size ve ×vf)

captures correspondences between the vocabularies (of size ve and vf) of two languages.

For instance, each row of S can be a one-hot vector that identifies the word in vf that is

most frequently aligned with the word in ve for that row in a large parallel corpus, thus

building a one-to-many mapping between the two languages.

In document Exploiting Cross-Lingual Representations For Natural Language Processing (Page 102-107)