• No results found

Measuring semantic content in distributional vectors

N/A
N/A
Protected

Academic year: 2020

Share "Measuring semantic content in distributional vectors"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Measuring semantic content in distributional vectors

Aur´elie Herbelot EB Kognitionswissenschaft

Universit¨at Potsdam Golm, Germany

aurelie.herbelot@cantab.net

Mohan Ganesalingam Trinity College University of Cambridge

Cambridge, UK mohan0@gmail.com

Abstract

Some words are more contentful than oth-ers: for instance,makeis intuitively more general than produce and fifteen is more ‘precise’ than a group. In this paper, we propose to measure the ‘semantic con-tent’ of lexical items, as modelled by distributional representations. We inves-tigate the hypothesis that semantic con-tent can be computed using the Kullback-Leibler (KL) divergence, an information-theoretic measure of the relative entropy of two distributions. In a task focus-ing on retrievfocus-ing the correct orderfocus-ing of hyponym-hypernym pairs, the KL diver-gence achieves close to 80% precision but does not outperform a simpler (linguis-tically unmotivated) frequency measure. We suggest that this result illustrates the rather ‘intensional’ aspect of distributions. 1 Introduction

Distributional semantics is a representation of lex-ical meaning that relies on a statistlex-ical analysis of the way words are used in corpora (Curran, 2003; Turney and Pantel, 2010; Erk, 2012). In this framework, the semantics of a lexical item is accounted for by modelling its co-occurrence with other words (or any larger lexical context). The representation of a target word is thus a vector in a space where each dimension corresponds to a pos-sible context. The weights of the vector compo-nents can take various forms, ranging from sim-ple co-occurrence frequencies to functions such as Pointwise Mutual Information (for an overview, see (Evert, 2004)).

This paper investigates the issue of comput-ing the semantic content of distributional vectors.

That is, we look at the ways we can distribution-ally express thatmakeis a more general verb than

produce, which is itself more general than, for instance, weave. Although the task is related to the identification of hyponymy relations, it aims to reflect a more encompassing phenomenon: we wish to be able to compare the semantic content of words within parts-of-speech where the standard notion of hyponymy does not apply (e.g. preposi-tions: see withvs. next to orofvs. concerning) and across parts-of-speech (e.g.fifteenvs.group). The hypothesis we will put forward is that se-mantic content is related to notions of relative en-tropy found in information theory. More specif-ically, we hypothesise that the more specific a word is, the more the distribution of the words co-occurring with it will differ from the baseline distribution of those words in the language as a whole. (A more intuitive way to phrase this is that the more specific a word is, the more information it gives us about which other words are likely to occur near it.) The specific measure of difference that we will use is the Kullback-Leibler divergence of the distribution of words co-ocurring with the target word against the distribution of those words in the language as a whole. We evaluate our hy-pothesis against a subset of the WordNet hierar-chy (given by (Baroni et al, 2012)), relying on the intuition that in a hyponym-hypernym pair, the hy-ponym should have higher semantic content than its hypernym.

The paper is structured as follows. We first define our notion of semantic content and moti-vate the need for measuring semantic content in distributional setups. We then describe the im-plementation of the distributional system we use in this paper, emphasising our choice of weight-ing measure. We show that, usweight-ing the

(2)

nents of the described weighting measure, which are both probability distributions, we can calculate the relative entropy of a distribution by inserting those probability distributions in the equation for the Kullback-Leibler (KL) divergence. We finally evaluate the KL measure against a basic notion of frequency and conclude with some error analysis.

2 Semantic content

As a first approximation, we will define seman-tic content as informativeness with respect to de-notation. Following Searle (1969), we will take a ‘successful reference’ to be a speech act where the choice of words used by the speaker appropri-ately identifies a referent for the hearer. Glossing over questions of pragmatics, we will assume that a more informative word is more likely to lead to a successful reference than a less informative one. That is, if Kim owns a cat and a dog, the identify-ing expressionmy catis a better referent thanmy petand socatcan be said to have more semantic content thanpet.

While our definition relies on reference, it also posits a correspondence between actual utterances and denotation. Given two possible identifying ex-pressionse1 ande2,e1may be preferred in a

par-ticular context, and so, context will be an indicator of the amount of semantic content in an expres-sion. In Section 5, we will produce an explicit hypothesis for how the amount of semantic con-tent in a lexical item affects the contexts in which it appears.

A case where semantic content has a direct cor-respondence with a lexical relation is hyponymy. Here, the correspondence relies entirely on a basic notion of extension. For instance, it is clear that

hammeris more contentful thantool because the extension of hammeris smaller than that oftool, and therefore more discriminating in a given iden-tifying expression (SeeGive me the hammer ver-susGive me the tool). But we can also talk about semantic content in cases where the notion of ex-tension does not necessarily apply. For example, it is not usual to talk of the extension of a prepo-sition. However, in context, the use of a preposi-tion against another one might be more discrim-inating in terms of reference. Compare a)Sandy is with Kimand b)Sandy is next to Kim. Given a set of possible situations involving, say, Kim and Sandy at a party, we could show that b) is more discriminating than a), because it excludes the

sit-uations where Sandy came to the party with Kim but is currently talking to Kay at the other end of the room. The fact that next toexpresses physi-cal proximity, as opposed to just being in the same situation, confers it more semantic content accord-ing to our definition. Further still, there may be a need for comparing the informativeness of words across parts of speech (compareA group of/Fifteen people was/were waiting in front of the town hall). Although we will not discuss this in detail, there is a notion of semantic content above the word level which should naturally derive from compo-sition rules. For instance, we would expect the composition of a given intersective adjective and a given noun to result into a phrase with a seman-tic content greater than that of its components (or at least equal to it).

3 Motivation

The last few years have seen a growing interest in distributional semantics as a representation of lex-ical meaning. Owing to their mathematlex-ical inter-pretation, distributions allow linguists to simulate human similarity judgements (Lund, Burgess and Atchley, 1995), and also reproduce some of the features given by test subjects when asked to write down the characteristics of a given concept (Ba-roni and Lenci, 2008). In a distributional semantic space, for instance, the word ‘cat’ may be close to ‘dog’ or to ‘tiger’, and its vector might have high values along the dimensions ‘meow’, ‘mouse’ and ‘pet’. Distributional semantics has had great suc-cesses in recent years, and for many computational linguists, it is an essential tool for modelling phe-nomena affected by lexical meaning.

If distributional semantics is to be seen as a general-purpose representation, however, we should evaluate it across all properties which we deem relevant to a model of the lexicon. We con-sider semantic content to be one such property. It underlies the notion of hyponymy and naturally models our intuitions about the ‘precision’ (as op-posed to ‘vagueness’) of words.

(3)

from content words might, in the end, involve tak-ing into account how much ‘descriptive’ content they have.

4 An implementation of a distributional system

The distributional system we implemented for this paper is close to the system of Mitchell and La-pata (2010) (subsequently M&L). As background data, we use the British National Corpus (BNC) in lemmatised format. Each lemma is followed by a part of speech according to the CLAWS tagset for-mat (Leech, Garside, and Bryant, 1994). For our experiments, we only keep the first letter of each part-of-speech tag, thus obtaining broad categories such as N or V. Furthermore, we only retain words in the following categories: nouns, verbs, adjec-tives and adverbs (punctuation is ignored). Each article in the corpus is converted into a 11-word window format, that is, we are assuming that con-text in our system is defined by the five words pre-ceding and the five words following the target.

To calculate co-occurrences, we use the follow-ing equations:

freqci =

X

t

freqci,t (1)

freqt = X

ci

freqci,t (2)

freqtotal = X

ci,t

freqci,t (3)

The quantities in these equations represent the following:

freqci,t frequency of the context wordci

with the target wordt

freqtotal total count of word tokens

freqt frequency of the target wordt freqci frequency of the context wordci

As in M&L, we use the 2000 most frequent words in our corpus as the semantic space dimen-sions. M&L calculate the weight of each context term in the distribution as follows:

vi(t) = p(ci|t)

p(ci)

= f reqci,t×f reqtotal

f reqt×f reqci

(4)

We will not directly use the measure vi(t)as it

is not a probability distribution and so is not suit-able for entropic analysis; instead our analysis will

be phrased in terms of the probability distributions p(ci|t)andp(ci)(the numerator and denominator

invi(t)).

5 Semantic content as entropy: two measures

Resnik (1995) uses the notion of information con-tent to improve on the standard edge counting methods proposed to measure similarity in tax-onomies such as WordNet. He proposes that the information content of a termtis given by the self-information measure−logp(t). The idea behind this measure is that, as the frequency of the term increases, its informativeness decreases. Although a good first approximation, the measure cannot be said to truly reflect our concept of semantic con-tent. For instance, in the British National Corpus,

timeandseeare more frequent thanthingormay

andmanis more frequent thanpart. However, it seems intuitively right to say that time, see and

manare more ‘precise’ concepts than thing, may

andpartrespectively. Or said otherwise, there is no indication that more general concepts occur in speech more than less general ones. We will there-fore consider self-information as a baseline.

As we expect more specific words to be more informative about which words co-occur with them, it is natural to try to measure the specificity of a word by using notions from information the-ory to analyse the probability distribution p(ci|t) associated with the word. The standard notion of entropy is not appropriate for this purpose, be-cause it does not take account of the fact that the words serving as semantic space dimensions may have different frequencies in language as a whole, i.e. of the fact thatp(ci)does not have a uniform

distribution. Instead we need to measure the de-gree to whichp(ci|t)differs from the context word distributionp(ci). An appropriate measure for this

is the Kullback-Leibler (KL) divergence or rela-tive entropy:

DKL(PkQ) = X

i

ln(P(i)

Q(i))P(i) (5)

By takingP(i)to bep(ci|t)andQ(i)to bep(ci)

(as given by Equation 4), we calculate the rela-tive entropy ofp(ci|t) andp(ci). The measure is

clearly informative: it reflects the way thatt mod-ifies the expectation of seeing ci in the corpus.

(4)

more ‘distorted’ distribution p(ci|t) and that the KL divergence will reflect this.1

6 Evaluation

In Section 2, we defined semantic content as a no-tion encompassing various referential properties, including a basic concept of extension in cases where it is applicable. However, we do not know of a dataset providing human judgements over the general informativeness of lexical items. So in or-der to evaluate our proposed measure, we inves-tigate its ability to retrieve the right ordering of hyponym pairs, which can be considered a subset of the issue at hand.

Our assumption is that if X is a hypernym of Y, then the information content inXwill be lower than inY (because it has a more ‘general’ mean-ing). So, given a pair of words {w1, w2} in a

known hyponymy relation, we should be able to tell which of w1 orw2 is the hypernym by

com-puting the respective KL divergences.

We use the hypernym data provided by (Baroni et al, 2012) as testbed for our experiment.2 This

set of hyponym-hypernym pairs contains 1385 in-stances retrieved from the WordNet hierarchy. Be-fore running our system on the data, we make slight modifications to it. First, as our distributions are created over the British National Corpus, some spellings must be converted to British English: for instance,coloris replaced bycolour. Second, five of the nouns included in the test set are not in the BNC. Those nouns are brethren, intranet, iPod, webcamand IX. We remove the pairs containing those words from the data. Third, numbers such as

elevenorsixtyare present in the Baroni et al set as nouns, but not in the BNC. Pairs containing seven such numbers are therefore also removed from the data. Finally, we encounter tagging issues with three words, which we match to their BNC equiv-alents:acousticsandannalsare matched to acous-ticandannal, andtrousertotrousers. These mod-ifications result in a test set of 1279 remaining pairs.

We then calculate both the self-information measure and the KL divergence of all terms

in-1Note that KL divergence is not symmetric:

DKL(p(ci|t)kp(ci))) is not necessarily equal to

DKL(p(ci)kp(ci|t)). The latter is inferior as a few

very small values ofp(ci|t)can have an inappropriately large

effect on it.

2The data is available at http://clic.cimec.

unitn.it/Files/PublicData/eacl2012-data. zip.

cluded in our test set. In order to evaluate the sys-tem, we record whether the calculated entropies match the order of each hypernym-hyponym pair. That is, we count a pair as correctly represented by our system if w1 is a hypernym of w2 and

KL(w1) < KL(w2) (or, in the case of the

baseline, SI(w1) < SI(w2) where SI is

self-information).

Self-information obtains 80.8% precision on the task, with the KL divergence lagging a little be-hind with 79.4% precision (the difference is not significant). In other terms, both measures per-form comparably. We analyse potential reasons for this disappointing result in the next section.

7 Error analysis

It is worth reminding ourselves of the assumption we made with regard to semantic content. Our hypothesis was that with a ‘more general’ target word t, the p(ci|t) distribution would be fairly similar top(ci).

Manually checking some of the pairs which were wrongly classified by the KL divergence re-veals that our hypothesis might not hold. For ex-ample, the pair beer – beverage is classified in-correctly. When looking at the beverage distri-bution, it is clear that it does not conform to our expectations: it shows high vi(t) weights along

thefood, wine, coffeeandteadimensions, for in-stance, i.e. there is a large difference between p(cf ood) and p(cf ood|t), etc. Although beverage

is an umbrella word for many various types of drinks, speakers of English use it in very partic-ular contexts. So, distributionally, it isnota ‘gen-eral word’. Similar observations can be made for, e.g. liquid(strongly associated withgas, presum-ably via coordination),anniversary(linked to the verbmarkor the nounsilver), or againprojectile

(co-occurring withweapon,motionandspeed). The general point is that, as pointed out else-where in the literature (Erk, 2013), distributions are a good representation of (some aspects of) in-tension, but they are less apt to model extension.3

So a term with a large extension like beverage

may have a more restricted (distributional) inten-sion than a word with a smaller exteninten-sion, such as

3We qualify ‘intension’ here, because in the sense of a

(5)

beer.4

Contributing to this issue, fixed phrases, named entities and generally strong collocations skew our distributions. So for instance, in thejewelry distri-bution, the most highly weighted context ismental

(withvi(t) = 395.3) because of the music album Mental Jewelry. While named entities could eas-ily be eliminated from the system’s results by pre-processing the corpus with a named entity recog-niser, the issue is not so simple when it comes to fixed phrases of a more compositional nature (e.g.

army ant): excluding them might be detrimental for the representation (it is, after all, part of the meaning ofantthat it can be used metaphorically to refer to people) and identifying such phrases is a non-trivial problem in itself.

Some of the errors we observe may also be related to word senses. For instance, the word

medium, to be found in the pair magazine – medium, can be synonymous with middle, clair-voyant or again mode of communication. In the sense of clairvoyant, it is clearly more specific than in the sense intended in the test pair. As dis-tributions do not distinguish between senses, this will have an effect on our results.

8 Conclusion

In this paper, we attempted to define a mea-sure of distributional semantic content in or-der to model the fact that some words have a more general meaning than others. We com-pared the Kullback-Leibler divergence to a sim-ple self-information measure. Our experiments, which involved retrieving the correct ordering of hyponym-hypernym pairs, had disappointing re-sults: the KL divergence was unable to outperform self-information, and both measures misclassified around 20% of our testset.

Our error analysis showed that several factors contributed to the misclassifications. First, distri-butions are unable to model extensional properties which, in many cases, account for the feeling that a word is more general than another. Second, strong collocation effects can influence the measurement of information negatively: it is an open question which phrases should be considered ‘words-with-spaces’ when building distributions. Finally,

dis-4Although it is more difficult to talk of the extension of

e.g. adverbials (very) or some adjectives (skillful), the general point is that text is biased towards a certain usage of words, while the general meaning a competent speaker ascribes to lexical items does not necessarily follow this bias.

tributional representations do not distinguish be-tween word senses, which in many cases is a de-sirable feature, but interferes with the task we sug-gested in this work.

To conclude, we would like to stress that we do not think another information-theoretic measure would perform hugely better than the KL diver-gence. The point is that the nature of distributional vectors makes them sensitive to word usage and that, despite the general assumption behind dis-tributional semantics, word usage might not suf-fice to model all aspects of lexical semantics. We leave as an open problem the issue of whether a modified form of our ‘basic’ distributional vectors would encode the right information.

Acknowledgements

This work was funded by a postdoctoral fellow-ship from the Alexander von Humboldt Founda-tion to the first author, and a Title A Fellowship from Trinity College, Cambridge, to the second author.

References

Baroni, Marco, and Lenci, Alessandro. 2008. Con-cepts and properties in word spaces. In Alessan-dro Lenci (ed.),From context to meaning: Distribu-tional models of the lexicon in linguistics and cog-nitive science(Special issue of the Italian Journal of Linguistics 20(1)), pages 55–88.

Baroni, Marco, Raffaella Bernardi, Ngoc-Quynh Do and Chung-chieh Shan. 2012. Entailment above the word level in distributional semantics. In Proceed-ings of the 13th Conference of the European Chap-ter of the Association for Computational Linguistics (EACL2012), pages 23–32.

Baroni, Marco, Raffaella Bernardi, and Roberto Zam-parelli. 2012. Frege in Space: a Program for Com-positional Distributional Semantics. Under review.

Curran, James. 2003. From Distributional to Semantic

Similarity. Ph.D. thesis, University of Edinburgh, Scotland, UK.

Erk, Katrin. 2012. Vector space models of word

mean-ing and phrase meanmean-ing: a survey. Language and

Linguistics Compass, 6:10:635–653.

Erk, Katrin. 2013. Towards a semantics for distribu-tional representations. InProceedings of the Tenth International Conference on Computational Seman-tics (IWCS2013).

Evert, Stefan. 2004. The statistics of word

(6)

Leech, Geoffrey, Roger Garside, and Michael Bryant. 1994. Claws4: The tagging of the british national

corpus. In Proceedings of the 15th International

Conference on Computational Linguistics (COLING 94), pages 622–628, Kyoto, Japan.

Lund, Kevin, Curt Burgess, and Ruth Ann Atchley. 1995. Semantic and associative priming in

high-dimensional semantic space. InProceedings of the

17th annual conference of the Cognitive Science So-ciety, Vol. 17, pages 660–665.

McNally, Louise. 2013. Formal and distributional

se-mantics: From romance to relationship. In

Proceed-ings of the ‘Towards a Formal Distributional Seman-tics’ workshop, 10th International Conference on Computational Semantics (IWCS2013), Potsdam, Germany. Invited talk.

Mitchell, Jeff and Mirella Lapata. 2010. Composition

in Distributional Models of Semantics. Cognitive

Science, 34(8):1388–1429, November.

Resnik, Philipp. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Con-ference on Artificial Intelligence (IJCAI-95), pages 448–453.

Searle, John R. 1969. Speech acts: An essay in the phi-losophy of language. Cambridge University Press.

References

Related documents

[r]

Do not try more times if the terminal declaims the debit card as it will cost you every time Always store all the receipts and get as much information as you can from the customer

The impacts on air environment from a mining activity depend on various factors like production capacity, machinery involved, operations and maintenance of various equipments

In the case of a group insurance contract concluded by a policy- holder which is a legal person and under which insurance protection is afforded to members of a group with regard

Through an interpretive design, we suggest that the SMEs need to have a strategic and incremental intent, understand their organizational structure, understand the external

in most of the applications, RFID tags respond to the reader’s query automatically, without authenticating the reader (only the tag authenticates itself). interaction usually

As a newly appointed school guidance counsellor, I found myself witnessing the stories and experiences of a small group of Tongan students in ways that had not been available to me as

There are software companies that have the technical ability to build a mobile application but no knowledge or experience on how to use it effectively as a marketing tool.. Then