Distributional Similarity

(1)

Distributional Similarity

Overview and Applications

Tim Van de Cruys alpage inria& parisvii

S´eminaire cental

(2)

Outline

1 Introduction 2 Methodology Similarity Context Dimensionality reduction

Latent semantic analysis

Non-negative matrix factorization

Tensors

3 Applications

MWE extraction

(3)

Semantic similarity

Most work on semantic similarity relies on the

distributional hypothesis(Harris 1954)

Take a word and its contexts:

tastytnassiorc

greasytnassiorc tnassiorcwith butter

tnassiorcfor breakfast

(4)

Semantic similarity

tastytnassiorc

⇒ FOOD

(5)

Semantic similarity

tastytnassiorc

(6)

Matrix

(7)

Matrix

Capture co-occurrence frequencies of two entities

rouge d´elicieux rapide d’occasion

pomme 2 1 0 0

vin 2 2 0 0

voiture 1 0 1 2

(8)

Matrix

pomme 7 9 0 0

vin 12 6 0 0

voiture 7 0 8 4

(9)

Matrix

pomme 56 98 0 0

vin 44 34 0 0

voiture 23 0 31 39

(10)

Matrix

pomme 728 592 1 0

vin 1035 437 0 2

voiture 392 0 487 370

(11)

Similarity calculation

Cosine cos(−→x,−→y) = _|−→−→_xx_||−·−→→y_y_|= Pn i=1xiyi √ Pn i=1xi2 Pn i=1yi2 Examples:

cos(pomme,vin) =.96

cos(pomme,voiture) =.42

Other possibilities: set-theoretic measures Dice Jaccard probabilistic measures Kullback-Leibler divergence Jensen-Shannon divergence

(12)

Different kinds of context

Three different word space models based on context:

document-based model (nouns×documents) window-based model (nouns×context words) syntax-based model (nouns×dependency relations)

Each model with plethora of parameters!

document size, window size, type of dependency relations weighting function

(13)

Different kinds of semantic similarity

‘tight’, synonym-like similarity: (near-)synonymous or (co-)hyponymous

loosely related, topical similarity: more loose relationships, such as association and meronymy

(14)

Different kinds of semantic similarity

‘tight’, synonym-like similarity: (near-)synonymous or (co-)hyponymous

loosely related, topical similarity: more loose relationships, such as association and meronymy

Example

m´edecin ‘doctor’: docteur ‘doctor’, m´edecin de famille

‘family doctor’, chirurgien‘surgeon’, sp´ecialiste ‘specialist’,

dermatologue ‘dermatologist’, gyn´ecologue ‘gynaecologist’

m´edecin ‘doctor’: malade ‘patient’,maladie ‘disease’,

(15)

Relation context – similarity

Different context leads to different kind of similarity

Syntax, small window↔ large window, documents

The former models inducetight, synonymous similarity

(16)

Relation context – similarity

Different context leads to different kind of similarity

Syntax, small window↔ large window, documents

The former models inducetight, synonymous similarity

The latter models induce topical relatedness

Evaluation

Syntax-based model scores best when evaluated according to

Wordnet similarity measures (cornetto)

Large window and document-based do not score well on Wordnet similarity, but do score on Wordnet domain evaluation

(17)

Introduction

Two reasons for performing dimensionality reduction: Intractable computations

When number of elements and number of features is too large, similarity computations may become intractable

reduction of the number of features makes computation tractable again

Generalization capacity

the dimensionality reduction is able to describe the data better, or is able to capture intrinsic semantic features

dimensionality reduction is able to improve the results (counter data sparseness and noise)

(18)

Latent semantic analysis: introduction

Application of a mathematical/statistical technique to simulate how humans learn the semantics of words

LSA finds ‘latent semantic dimensions’ according to which words and documents can be identified

(19)

Latent semantic analysis: introduction

What is latent semantic analysis technically speaking?

The application of singular value decomposition

to a term-document matrix

(20)

Singular value decomposition

Find the dimensions that explain most variance by solving

number of eigenvector problems

(21)

(22)

LSA: Example

LSA in (part of) Twente Nieuws Corpus: 10 years of Dutch newspaper texts (AD, NRC, TR, VK, PAR)

terms = nouns

documents = paragraphs

20,000 terms * 2,000,000 documents matrix reduced to 300 dimensions

(23)

LSA: criticism

LSA criticized for a number of reasons:

dimensionality reduction is best fit in least-square sense, but this is only valid for normally distributed data; language data is not normally distributed

Dimensions may contain negative values; it is not clear what negativity on a semantic scale should designate

Shortcomings are remedied by subsequent techniques (PLSA, LDA, NMF, . . . )

(24)

Non-negative matrix factorization: technique

Given a non-negative matrix V, find non-negative matrix factors W and H such that:

Vnxm ≈WnxrHrxm (1)

Choosingr n,m reduces data

Constraint on factorization: all values in three matrices need

to benon-negative values (≥0)

Constraint brings about a parts-based representation: only

(25)

Non-negative matrix factorization: technique

Different kinds of NMF’s that minimize different cost functions:

Square of Euclidean distance Kullback-Leibler divergence

⇒better suited for language phenomena

To find NMF is to minimize D(VkWH) with respect to W

and H, subject to the constraints W,H ≥0

This can be done with update rules

Haµ←Haµ P iWia Viµ (WH)iµ P kWka Wia ←Wia P µHaµ Viµ (WH)iµ P vHav (2)

these update rules converge to alocal minimum in the

(26)

(27)

NMF: results

Context vectors (5k nouns ×2k co-occurring nouns)

extracted from clefcorpus

nmf is able to capture clear semantic dimensions

Examples:

bus‘bus’,taxi‘taxi’,trein ‘train’,halte‘stop’,reiziger

‘traveler’,perron‘platform’,tram‘tram’,station‘station’,

chauffeur‘driver’, passagier‘passenger’

bouillon‘broth’, slagroom‘cream’,ui‘onion’,eierdooier ‘egg yolk’,laurierblad‘bay leaf’,zout‘salt’,deciliter‘decilitre’,

(28)

Two-way vs. three-way

all methods use two way co-occurrence frequencies−→ matrix

suitable for two-way problems

words×documents

nouns×dependency relations

not suitable for n-way problems

words×documents×authors verbs×subjects×direct objects

(29)

Two-way vs. three-way

all methods use two way co-occurrence frequencies−→ matrix

suitable for two-way problems

words×documents

nouns×dependency relations

not suitable for n-way problems −→ tensor

words×documents×authors verbs×subjects×direct objects

(30)

Non-negative tensor factorization: technique

Idea similar to non-negative matrix factorization Calculations are different

min_x

i∈RD≥10,yi∈R≥D20,zi∈RD≥30kT −

Pk

(31)

(32)

Introduction

Task: automatic extraction of multi-word expressions (mwes)

from large corpora

Starting point: many mwes are non-compositional, i.e. the

meaning of the mweis not the sum of the meaning of the

individual words

Intuition: a noun within a mwecannot easily be replaced by a

semantically similar noun

→ Use of semantic clusters to determine whether a

(33)

Intuition

break the ice ←→ break the vase

*snow cup

*hail dish

In the first expression, it is not possible to replace ice with

semantically related nouns such assnow or hail;

in the second expression, it is possible to replacevase with

(34)

Overview of the method

verb + prepositional complement instances are extracted from

500M word corpus (focus onmwe with PP)

matrix of 5K verb-preposition combinations ×10K nouns is

created

10K nouns are automatically clustered using distributional similarity measures

number of statistical measures is applied to determine ‘unique associations’ given the cluster in which a noun appears

(35)

Measures

inspired by selectional preferences (Resnik 1993), entropy-based

Kullback-Leibler divergence between P(n) and P(n|v)

Sv =

X

n

p(n |v) logp(n |v)

(36)

Measures

The preference of the verb for the noun →[0,1]

Av→n=

p(n|v) logp_p(n₍_n|v₎)

Sv

(4) Ratio of verb preference for a particular noun, compared to

other nouns in the cluster→[0,1]

Rv→n=

Av→n

P

(37)

Measures

The preference of the noun for the verb →[0,1]

An→v =

p(v|n) logp_p(v₍_v|n₎)

Sn

(6) Ratio of noun preference for a particular verb, compared to

other nouns in the cluster→[0,1]

Rn→v =

An→v

P

(38)

An elaborated example

in de smaak vallen ←→ in de put vallen

*geur kuil

*voorkeur krater

*stijl greppel

in the taste fall in the well fall

‘to be appreciated’ ‘to fall down the well’

*smell hole

*preference crater

(39)

An elaborated example

smaak: idioom, karakter, persoonlijkheid, stijl, temperament, thematiek, uiterlijk, uitstraling, voorkomen

mwecandidate (1) (2) (3) (4) mwe?

val#in smaak .12 1.00 .07 1.00 yes

val#in karakter .00 .00 .00 .00 no

(40)

An elaborated example

put: gaatje, gat, kloof, krater, kuil, lek, scheur, valkuil

mwe candidate (1) (2) (3) (4) mwe?

val#in put .00 .04 .10 .05 no

val#in kuil .00 .11 .10 .38 no

(41)

Quantitative Evaluation

Fully automated, compared to Referentie Bestand Nederlands

Upper bound consists of all RBNmwe’s present in the data

Parameters Prec Rec F-Measure

(1) (2) (3) (4) N (%) (%) (%)

.10 .80 – – 2916 24.66 16.64 19.87

.10 .80 .01 .80 1694 30.99 12.15 17.45

Fazly/Stevenson 3470 16.89 13.56 15.04

(42)

Conclusion

Non-compositionality based algorithm is able to rule out

expression that are coined mwe’s by traditional algorithms,

improving on state-of-the-art

Using measures (1) and (2) gives best results; using (3) and (4) increases precision but degrades recall

(43)

Introduction Methodology Applications

MWE extraction Word sense discrimination Selectional preferences

Ambiguity

Problem: ambiguity

bar

(44)

Ambiguity

Problem: ambiguity

bar

(45)

Ambiguity

Problem: ambiguity

bar

↔

(46)

Ambiguity

Problem: ambiguity

bar

↔

(47)

Ambiguity

Problem: ambiguity

bar

↔

(48)

Ambiguity

Problem: ambiguity

bar

↔

(49)

Ambiguity

Problem: ambiguity

bar

↔

Main research question: can ‘topical’ similarity and tight, synonym-like similarity be combined to differentiate between various senses of a word?

(50)

Methodology

Goal: classification of nouns according to both window-based context (with large window) and syntactic context

⇒ Construct three matrices capturing co-occurrence

frequencies for each mode

nouns cross-classified by dependency relations

nouns cross-classified by (bag of words) context words dependency relations cross-classified by context words

⇒ Apply nmfto matrices, but interleave the process

Result of former factorization is used to initialize factorization of the next one

(51)

Graphical Representation

= _W x H = _V x G = _U x F 80k 5k 5k 2k 80k 2k 50 5k 80k 50 50 5k 2k 50 50 80k 2k 50 A nouns x dependency relations B nouns x context words C context words x dependency relations

(52)

Sense subtraction

‘switch off’ dimension(s) of an ambiguous word to reveal other possible senses

From matrix W, we know which dimensions are the most

important for a certain word

Matrix H gives the importance of each dependency relation

given a dimension

‘subtract’ dependency relations that are responsible for a given dimension from the original noun vector

− →_v new =−→vorig(−→v1− − → hdim)

each dependency relation is multiplied by a scaling factor, according to the load of the feature on the subtracted

(53)

Combination with clustering

A simple clustering algorithm (k-means) assigns ambiguous

nouns to its predominant sense

Centroid of the cluster is folded into nmf model

The dimensions that define the centroid are subtracted from the ambiguous noun vector

(54)

Experimental Design

Approach applied to Dutch, using Twente Nieuws Corpus

(±500M words)

Corpus parsed with Dutch dependency parser alpino

three matrices constructed with:

5k nouns×80k dependency relations 5k nouns×2k context words

80k dependency relations×2k context words

(55)

Example dimension: transport

1 nouns: auto‘car’, wagen ‘car’,tram ‘tram’,motor

‘motorbike’, bus ‘bus’,metro‘subway’, automobilist‘driver’,

trein ‘trein’,stuur‘steering wheel’, chauffeur‘driver’

2 context words: auto ‘car’, trein‘train’, motor ‘motorbike’,

bus ‘bus’,rij‘drive’, chauffeur ‘driver’,fiets ‘bike’,reiziger

‘reiziger’, passagier‘passenger’, vervoer ‘transport’

3 dependency relations: viertraps_adj ‘four pedal’,

verplaats metobj ‘move with’, toeteradj ‘honk’,

tank in houdobj [parsing error], tanksubj ‘refuel’, tankobj

‘refuel’, rij voorbijsubj ‘pass by’, rij voorbijadj ‘pass by’,

(56)

Pop: most similar words

pop music↔ doll

1 pop,rock,jazz,meubilair ‘furniture’,popmuziek ‘pop music’,

heks ‘witch’, speelgoed‘toy’,kast‘cupboard’, servies ‘[tea]

service’, vraagteken ‘question mark’

2 _pop_,_meubilair _{‘furniture’,}_speelgoed _‘toy’,_kast_{‘cupboard’,}

servies ‘[tea] service’,heks‘witch’, vraagteken ‘question mark’

sieraad ‘jewel’, sculptuur‘sculpture’,schoen ‘shoe’

3 pop,rock,jazz,popmuziek ‘pop music’,heks ‘witch’,danseres

‘dancer’,servies ‘[tea] service’,kopje‘cup’, house‘house

(57)

Barcelona: most similar words

Spanish city↔ Spanish football club

1 Barcelona,Arsenal,Inter,Juventus,Vitesse,Milaan ‘Milan’,

Madrid,Parijs‘Paris’,Wenen ‘Vienna’,M¨unchen ‘Munich’

2 _Barcelona_,_Milaan _‘Milan’,_M¨_unchen _{‘Munich’,}_Wenen

‘Vienna’,Madrid,Parijs‘Paris’,Bonn,Praag‘Prague’, Berlijn

‘Berlin’,Londen ‘London’

3 Barcelona,Arsenal,Inter,Juventus,Vitesse,Parma,

(58)

Clustering example: werk

1 werk‘work’,beeld‘image’,foto‘photo’,schilderij‘painting’,tekening ‘drawing’,doek‘canvas’,installatie‘installation’,afbeelding‘picture’, sculptuur‘sculpture’,prent‘picture’,illustratie‘illustration’,handschrift ‘manuscript’,grafiek‘print’,aquarel‘aquarelle’,maquette‘scale-model’, collage‘collage’,ets‘etching’

2 werk‘work’,boek‘book’,titel‘title’,roman‘novel’,boekje‘booklet’, debuut‘debut’,biografie‘biography’,bundel‘collection’,toneelstuk ‘play’,bestseller‘bestseller’,kinderboek‘child book’,autobiografie ‘autobiography’,novelle‘short story’,

3 _werk_‘work’,_voorziening_{‘service’,}_arbeid_{‘labour’,}_opvoeding_{‘education’,} kinderopvang‘child care’,scholing‘education’,huisvesting‘housing’, faciliteit‘facility’,accommodatie‘acommodation’,arbeidsomstandigheid ‘working condition’

(59)

Evaluation: methodology

Comparison to EuroWordNet senses

using Wu & Palmer’s Wordnet similarity measure a sense is assigned to a correct cluster if:

for the top 4 words of the cluster

the average similarity between the sense and the top 4 words exceeds a similarity threshold in EuroWordNet

and the sense has not yet been considered correct

when multiple senses exist in EuroWordNet, the one with

(60)

Evaluation: precision & recall

Precision

of a word: Percentage of correct clusters to which it is assigned overall: average precision of all words in test set

Recall

of a word: Percentage of senses in EuroWordnet that have a corresponding cluster

overall: average recall of all words in test set

Test set: words covered by algorithm that are also present in EuroWordNet (3683 words)

(61)

Evaluation: results

thresholdθ .40 (%) .50 (%) .60 (%) kmeansnmf prec. 78.97 69.18 55.16 rec. 63.90 55.95 44.77 cbc prec. 44.94 38.13 29.74 rec. 69.61 60.00 48.00

kmeansorig prec. 86.13 74.99 58.97

(62)

Evaluation: results

kmeansnmf beats cbcwith regard to precision

cbc beats kmeansnmf with regard to recall

kmeansnmf has higher recall than kmeansorig, so algorithm is

(63)

Conclusion

Combining bag of words data and syntactic data is useful

bag of words data (factorized with_nmf) puts its finger on topical dimensions

syntactic data is particularly good at finding similar words a clustering approach allows one to determine which topical dimension(s) are responsible for a certain sense

and adapt the (syntactic) feature vector of the noun accordingly

subtracting the more dominant sense to discover less dominant senses

Algorithm scores better with regard to precision; lower with regard to recall

(64)

Introduction

Standard selectional preference models: two-way co-occurrences

Keeping track of single relationships

But: two-way selectional preference models are not sufficiently rich

Compare:

The skyscraper is playing coffee. The turntable is playing the piano.

(65)

Introduction

The skyscraper is playing coffee.

(play,su,scyscraper) ↓

(play,obj,coffee) ↓

The turntable is playing the piano.

(play,su,turntable) ↑

(play,obj,piano) ↑

(66)

Methodology

Three-way extraction of selectional preferences

Approach applied to Dutch, usingtwente nieuws corpus

(500M words of newspaper texts)

parsed with Dutch dependency parseralpino

three-way co-occurrence of verbs with subjects and direct objects extracted

adapted with extension of pointwise mutual information

Resulting tensor 1k verbs×10ksubjects×10k direct objects

non-negative tensor factorization withk dimensions

(67)

(68)

Example dimension: police action

subjects sus verbs vs objects objs

politie‘police’ .99 houd aan‘arrest’ .64 verdachte‘suspect’ .16

agent‘policeman’ .07 arresteer‘arrest’ .63 man‘man’ .16

autoriteit‘authority’ .05 pak op‘run in’ .41 betoger‘demonstrator’ .14

Justitie’Justice’ .05 schiet dood‘shoot’ .08 relschopper‘rioter’ .13

recherche‘detective force’ .04 verdenk‘suspect’ .07 raddraaier‘instigator’ .13

marechaussee‘military police’ .04 tref aan‘find’ .06 overvaller‘raider’ .13

justitie‘justice’ .04 achterhaal‘overtake’ .05 Roemeen‘Romanian’ .13

arrestatieteam‘special squad’ .03 verwijder‘remove’ .05 actievoerder‘campaigner’ .13

leger‘army’ .03 zoek‘search’ .04 hooligan‘hooligan’ .13

(69)

Example dimension: legislation

subjects sus verbs vs objects objs

meerderheid‘majority’ .33 steun‘support’ .83 motie‘motion’ .63

VVD .28 dien in‘submit’ .44 voorstel‘proposal’ .53

D66 .25 neem aan‘pass’ .23 plan‘plan’ .28

Kamermeerderheid .25 wijs af‘reject’ .17 wetsvoorstel‘bill’ .19

fractie‘party’ .24 verwerp‘reject’ .14 hem‘him’ .18

PvdA .23 vind‘think’ .08 kabinet‘cabinet’ .16

CDA .23 aanvaard‘accepts’ .05 minister‘minister’ .16

Tweede Kamer .21 behandel‘treat’ .05 beleid‘policy’ .13

partij‘party’ .20 doe‘do’ .04 kandidatuur‘candidature’ .11

(70)

Example dimension: exhibition

subjects sus verbs vs objects objs tentoonstelling‘exhibition’ .50 toon‘display’ .72 schilderij‘painting’ .47

expositie‘exposition’ .49 omvat‘cover’ .63 werk‘work’ .46

galerie‘gallery’ .36 bevat‘contain’ .18 tekening‘drawing’ .36

collectie‘collection’ .29 presenteer‘present’ .17 foto‘picture’ .33

museum‘museum’ .27 laat‘let’ .07 sculptuur‘sculpture’ .25

oeuvre‘oeuvre’ .22 koop‘buy’ .07 aquarel‘aquarelle’ .20

Kunsthal .19 bezit‘own’ .06 object‘object’ .19

kunstenaar‘artist’ .15 zie‘see’ .05 beeld‘statue’ .12

dat‘that’ .12 koop aan‘acquire’ .05 overzicht‘overview’ .12

(71)

Quality count

44 dimensions contain similar, framelike semantics 43 dimensions contain less clear-cut semantics

single verbs account for one dimension verb senses are mixed up

13 dimensions based on syntax rather than semantics

fixed expressions pronomina

(72)

Evaluation: methodology

pseudo-disambiguation task to test generalization capacity (standard automatic evaluation for selectional preferences)

s v o s0 o0

jongere drink bier coalitie aandeel ‘youngster’ ‘drink’ ‘beer’ ‘coalition’ ‘share’ werkgever riskeer boete doel kopzorg ‘employer’ ‘risk’ ‘fine’ ‘goal’ ‘worry’ directeur zwaai scepter informateur vodka ‘manager’ ‘sway’ ‘sceptre’ ‘informer’ ‘wodka’

(73)

Evaluation: models

Evaluation of 4 different models 2 matrix models

→ 1k verbs ×(10ksubjects + 10k direct objects)

singular value decomposition (_R) non-negative matrix factorization (_R≥0)

2 tensor models

→ 1k verbs ×10k subjects ×10k direct objects

parallel factor analysis (R)

(74)

Evaluation: results

dimensions 50 (%) 100 (%) 300 (%) svd 69.60± 0.41 62.84 ±1.30 45.22 ±1.01 nmf 81.79± 0.15 78.83 ±0.40 75.74 ±0.63 parafac 85.57± 0.25 83.58 ±0.59 80.12 ±0.76 ntf 89.52± 0.18 90.43±0.14 90.89 ±0.16

(75)

Conclusion

novel method able to investigate three-way co-occurrences capable of automatically inducing selectional preferences

three-way methods improve on two-way methods

non-negativity constraint improves on unconstrained models non-negative tensor factorization outperforms other models

(76)

Wordnet-based similarity

Compare results to Dutch cornettodatabase

Two similarity measures:

path length: Wu & Palmer similarity measure information theoretic: Lin’s similarity measure

Nouns close in the hierarchy are tightly similar Pairwise similarity for k similar words

(77)

Wordnet-based similarity

wu & palmer’s similarity lin’s similarity

model k = 1 k = 3 k = 5 k = 1 k = 3 k= 5 document .379 .331 .309 .354 .320 .304 window₍_w₌_par₎ .442 .377 .349 .404 .357 .336 window(w=2) .633 .561 .526 .541 .485 .456 syntax .648 .584 .554 .555 .504 .480 baseline .128 .126 .126 .164 .163 .163

(78)

Wordnet-based similarity

Syntax (withpmi) best

Closely followed by small window (with pmi)

Large window and document perform much worse dimensionality reduction only helps to improve document-based (a little)

(79)

Cluster quality

Compare output of clustering algorithm to gold standard classification

Two cluster tasks (esslli2008 workshop’s shared task)

concrete noun categorization (44 nouns)

2-way: natural,artefact

3-way: animal,vegetable,artefact

6-way: bird,groundAnimal,fruitTree,green,tool,vehicle

abstract/concrete noun discrimination (30 nouns)

2-way: hi,lo Evaluation measures:

Entropy: distribution of classes within cluster (small = good) Purity: ratio of largest class present in cluster (large = good)

(80)

Cluster quality

2-way 3-way 6-way

model ent pur ent pur ent pur

document .930 .614 .179 .932 .292 .682

window(w=par) .911 .659 .541 .705 .377 .591

window₍_w₌₂₎ .000 1.000 .213 .909 .206 .773

syntax-based .000 1.000 .000 1.000 .153 .864

(81)

Cluster quality

Same tendencies as wordnet-based similarity

model with large window seems to extract topically related clusters:

aardappel ‘potato’,ananas ‘pineapple’,banaan‘banana’,

champignon‘mushroom’,fles ‘bottle’,kers ‘cherry’,ketel

‘kettle’,kip‘chicken’,kom ‘bowl’,lepel ‘spoon’, peer ‘pear’,

sla ‘lettuce’,ui ‘onion’

(82)

Domain coherence

Coherence of semantic domain tags (available incornetto)

‘particular areas of human knowledge’ (politics,medicine,

sports)

→ topical similarity

Ratio of most frequent domain tag (also in tagset of target word) over top 10 similar words

(83)

Domain coherence

model simtopic

document .394

window(w=par) .399

window₍_w₌₂₎ .414

syntax .441

baseline .048

(84)

Domain coherence

Syntax still scores best

Other models do not perform much worse

No real difference between small window and large window

→ large window and document do not extract tight similarity,