Distributional Similarity
Overview and Applications
Tim Van de Cruys alpage inria& parisvii
S´eminaire cental
Outline
1 Introduction 2 Methodology Similarity Context Dimensionality reductionLatent semantic analysis
Non-negative matrix factorization
Tensors
3 Applications
MWE extraction
Semantic similarity
Most work on semantic similarity relies on the
distributional hypothesis(Harris 1954)
Take a word and its contexts:
tastytnassiorc
greasytnassiorc tnassiorcwith butter
tnassiorcfor breakfast
Semantic similarity
Most work on semantic similarity relies on the
distributional hypothesis(Harris 1954)
Take a word and its contexts:
tastytnassiorc
greasytnassiorc tnassiorcwith butter
tnassiorcfor breakfast
⇒ FOOD
Semantic similarity
Most work on semantic similarity relies on the
distributional hypothesis(Harris 1954)
Take a word and its contexts:
tastytnassiorc
greasytnassiorc tnassiorcwith butter
tnassiorcfor breakfast
Matrix
Matrix
Capture co-occurrence frequencies of two entities
rouge d´elicieux rapide d’occasion
pomme 2 1 0 0
vin 2 2 0 0
voiture 1 0 1 2
Matrix
Capture co-occurrence frequencies of two entities
rouge d´elicieux rapide d’occasion
pomme 7 9 0 0
vin 12 6 0 0
voiture 7 0 8 4
Matrix
Capture co-occurrence frequencies of two entities
rouge d´elicieux rapide d’occasion
pomme 56 98 0 0
vin 44 34 0 0
voiture 23 0 31 39
Matrix
Capture co-occurrence frequencies of two entities
rouge d´elicieux rapide d’occasion
pomme 728 592 1 0
vin 1035 437 0 2
voiture 392 0 487 370
Similarity calculation
Cosine cos(−→x,−→y) = |−→−→xx||−·−→→yy|= Pn i=1xiyi √ Pn i=1xi2 Pn i=1yi2 Examples:cos(pomme,vin) =.96
cos(pomme,voiture) =.42
Other possibilities: set-theoretic measures Dice Jaccard probabilistic measures Kullback-Leibler divergence Jensen-Shannon divergence
Different kinds of context
Three different word space models based on context:
document-based model (nouns×documents) window-based model (nouns×context words) syntax-based model (nouns×dependency relations)
Each model with plethora of parameters!
document size, window size, type of dependency relations weighting function
Different kinds of semantic similarity
‘tight’, synonym-like similarity: (near-)synonymous or (co-)hyponymous
loosely related, topical similarity: more loose relationships, such as association and meronymy
Different kinds of semantic similarity
‘tight’, synonym-like similarity: (near-)synonymous or (co-)hyponymous
loosely related, topical similarity: more loose relationships, such as association and meronymy
Example
m´edecin ‘doctor’: docteur ‘doctor’, m´edecin de famille
‘family doctor’, chirurgien‘surgeon’, sp´ecialiste ‘specialist’,
dermatologue ‘dermatologist’, gyn´ecologue ‘gynaecologist’
m´edecin ‘doctor’: malade ‘patient’,maladie ‘disease’,
Relation context – similarity
Different context leads to different kind of similarity
Syntax, small window↔ large window, documents
The former models inducetight, synonymous similarity
Relation context – similarity
Different context leads to different kind of similarity
Syntax, small window↔ large window, documents
The former models inducetight, synonymous similarity
The latter models induce topical relatedness
Evaluation
Syntax-based model scores best when evaluated according to
Wordnet similarity measures (cornetto)
Large window and document-based do not score well on Wordnet similarity, but do score on Wordnet domain evaluation
Introduction
Two reasons for performing dimensionality reduction: Intractable computations
When number of elements and number of features is too large, similarity computations may become intractable
reduction of the number of features makes computation tractable again
Generalization capacity
the dimensionality reduction is able to describe the data better, or is able to capture intrinsic semantic features
dimensionality reduction is able to improve the results (counter data sparseness and noise)
Latent semantic analysis: introduction
Application of a mathematical/statistical technique to simulate how humans learn the semantics of words
LSA finds ‘latent semantic dimensions’ according to which words and documents can be identified
Latent semantic analysis: introduction
What is latent semantic analysis technically speaking?
The application of singular value decomposition
to a term-document matrix
Singular value decomposition
Find the dimensions that explain most variance by solving
number of eigenvector problems
LSA: Example
LSA in (part of) Twente Nieuws Corpus: 10 years of Dutch newspaper texts (AD, NRC, TR, VK, PAR)
terms = nouns
documents = paragraphs
20,000 terms * 2,000,000 documents matrix reduced to 300 dimensions
LSA: criticism
LSA criticized for a number of reasons:
dimensionality reduction is best fit in least-square sense, but this is only valid for normally distributed data; language data is not normally distributed
Dimensions may contain negative values; it is not clear what negativity on a semantic scale should designate
Shortcomings are remedied by subsequent techniques (PLSA, LDA, NMF, . . . )
Non-negative matrix factorization: technique
Given a non-negative matrix V, find non-negative matrix factors W and H such that:
Vnxm ≈WnxrHrxm (1)
Choosingr n,m reduces data
Constraint on factorization: all values in three matrices need
to benon-negative values (≥0)
Constraint brings about a parts-based representation: only
Non-negative matrix factorization: technique
Different kinds of NMF’s that minimize different cost functions:Square of Euclidean distance Kullback-Leibler divergence
⇒better suited for language phenomena
To find NMF is to minimize D(VkWH) with respect to W
and H, subject to the constraints W,H ≥0
This can be done with update rules
Haµ←Haµ P iWia Viµ (WH)iµ P kWka Wia ←Wia P µHaµ Viµ (WH)iµ P vHav (2)
these update rules converge to alocal minimum in the
NMF: results
Context vectors (5k nouns ×2k co-occurring nouns)
extracted from clefcorpus
nmf is able to capture clear semantic dimensions
Examples:
bus‘bus’,taxi‘taxi’,trein ‘train’,halte‘stop’,reiziger
‘traveler’,perron‘platform’,tram‘tram’,station‘station’,
chauffeur‘driver’, passagier‘passenger’
bouillon‘broth’, slagroom‘cream’,ui‘onion’,eierdooier ‘egg yolk’,laurierblad‘bay leaf’,zout‘salt’,deciliter‘decilitre’,
Two-way vs. three-way
all methods use two way co-occurrence frequencies−→ matrix
suitable for two-way problems
words×documents
nouns×dependency relations
not suitable for n-way problems
words×documents×authors verbs×subjects×direct objects
Two-way vs. three-way
all methods use two way co-occurrence frequencies−→ matrix
suitable for two-way problems
words×documents
nouns×dependency relations
not suitable for n-way problems −→ tensor
words×documents×authors verbs×subjects×direct objects
Non-negative tensor factorization: technique
Idea similar to non-negative matrix factorization Calculations are different
minx
i∈RD≥10,yi∈R≥D20,zi∈RD≥30kT −
Pk
Introduction
Task: automatic extraction of multi-word expressions (mwes)
from large corpora
Starting point: many mwes are non-compositional, i.e. the
meaning of the mweis not the sum of the meaning of the
individual words
Intuition: a noun within a mwecannot easily be replaced by a
semantically similar noun
→ Use of semantic clusters to determine whether a
Intuition
break the ice ←→ break the vase
*snow cup
*hail dish
In the first expression, it is not possible to replace ice with
semantically related nouns such assnow or hail;
in the second expression, it is possible to replacevase with
Overview of the method
verb + prepositional complement instances are extracted from
500M word corpus (focus onmwe with PP)
matrix of 5K verb-preposition combinations ×10K nouns is
created
10K nouns are automatically clustered using distributional similarity measures
number of statistical measures is applied to determine ‘unique associations’ given the cluster in which a noun appears
Measures
inspired by selectional preferences (Resnik 1993), entropy-based
Kullback-Leibler divergence between P(n) and P(n|v)
Sv =
X
n
p(n |v) logp(n |v)
Measures
The preference of the verb for the noun →[0,1]
Av→n=
p(n|v) logpp(n(n|v))
Sv
(4) Ratio of verb preference for a particular noun, compared to
other nouns in the cluster→[0,1]
Rv→n=
Av→n
P
Measures
The preference of the noun for the verb →[0,1]
An→v =
p(v|n) logpp(v(v|n))
Sn
(6) Ratio of noun preference for a particular verb, compared to
other nouns in the cluster→[0,1]
Rn→v =
An→v
P
An elaborated example
in de smaak vallen ←→ in de put vallen
*geur kuil
*voorkeur krater
*stijl greppel
in the taste fall in the well fall
‘to be appreciated’ ‘to fall down the well’
*smell hole
*preference crater
An elaborated example
smaak: idioom, karakter, persoonlijkheid, stijl, temperament, thematiek, uiterlijk, uitstraling, voorkomen
mwecandidate (1) (2) (3) (4) mwe?
val#in smaak .12 1.00 .07 1.00 yes
val#in karakter .00 .00 .00 .00 no
An elaborated example
put: gaatje, gat, kloof, krater, kuil, lek, scheur, valkuil
mwe candidate (1) (2) (3) (4) mwe?
val#in put .00 .04 .10 .05 no
val#in kuil .00 .11 .10 .38 no
Quantitative Evaluation
Fully automated, compared to Referentie Bestand Nederlands
Upper bound consists of all RBNmwe’s present in the data
Parameters Prec Rec F-Measure
(1) (2) (3) (4) N (%) (%) (%)
.10 .80 – – 2916 24.66 16.64 19.87
.10 .80 .01 .80 1694 30.99 12.15 17.45
Fazly/Stevenson 3470 16.89 13.56 15.04
Conclusion
Non-compositionality based algorithm is able to rule out
expression that are coined mwe’s by traditional algorithms,
improving on state-of-the-art
Using measures (1) and (2) gives best results; using (3) and (4) increases precision but degrades recall
Introduction Methodology Applications
MWE extraction Word sense discrimination Selectional preferences
Ambiguity
Problem: ambiguity
bar
Introduction Methodology Applications
MWE extraction Word sense discrimination Selectional preferences
Ambiguity
Problem: ambiguity
bar
Introduction Methodology Applications
MWE extraction Word sense discrimination Selectional preferences
Ambiguity
Problem: ambiguity
bar
↔
Introduction Methodology Applications
MWE extraction Word sense discrimination Selectional preferences
Ambiguity
Problem: ambiguity
bar
↔
Introduction Methodology Applications
MWE extraction Word sense discrimination Selectional preferences
Ambiguity
Problem: ambiguity
bar
↔
Introduction Methodology Applications
MWE extraction Word sense discrimination Selectional preferences
Ambiguity
Problem: ambiguity
bar
↔
Ambiguity
Problem: ambiguity
bar
↔
Main research question: can ‘topical’ similarity and tight, synonym-like similarity be combined to differentiate between various senses of a word?
Methodology
Goal: classification of nouns according to both window-based context (with large window) and syntactic context
⇒ Construct three matrices capturing co-occurrence
frequencies for each mode
nouns cross-classified by dependency relations
nouns cross-classified by (bag of words) context words dependency relations cross-classified by context words
⇒ Apply nmfto matrices, but interleave the process
Result of former factorization is used to initialize factorization of the next one
Graphical Representation
= W x H = V x G = U x F 80k 5k 5k 2k 80k 2k 50 5k 80k 50 50 5k 2k 50 50 80k 2k 50 A nouns x dependency relations B nouns x context words C context words x dependency relationsSense subtraction
‘switch off’ dimension(s) of an ambiguous word to reveal other possible senses
From matrix W, we know which dimensions are the most
important for a certain word
Matrix H gives the importance of each dependency relation
given a dimension
‘subtract’ dependency relations that are responsible for a given dimension from the original noun vector
− →v new =−→vorig(−→v1− − → hdim)
each dependency relation is multiplied by a scaling factor, according to the load of the feature on the subtracted
Combination with clustering
A simple clustering algorithm (k-means) assigns ambiguous
nouns to its predominant sense
Centroid of the cluster is folded into nmf model
The dimensions that define the centroid are subtracted from the ambiguous noun vector
Experimental Design
Approach applied to Dutch, using Twente Nieuws Corpus
(±500M words)
Corpus parsed with Dutch dependency parser alpino
three matrices constructed with:
5k nouns×80k dependency relations 5k nouns×2k context words
80k dependency relations×2k context words
Example dimension: transport
1 nouns: auto‘car’, wagen ‘car’,tram ‘tram’,motor
‘motorbike’, bus ‘bus’,metro‘subway’, automobilist‘driver’,
trein ‘trein’,stuur‘steering wheel’, chauffeur‘driver’
2 context words: auto ‘car’, trein‘train’, motor ‘motorbike’,
bus ‘bus’,rij‘drive’, chauffeur ‘driver’,fiets ‘bike’,reiziger
‘reiziger’, passagier‘passenger’, vervoer ‘transport’
3 dependency relations: viertrapsadj ‘four pedal’,
verplaats metobj ‘move with’, toeteradj ‘honk’,
tank in houdobj [parsing error], tanksubj ‘refuel’, tankobj
‘refuel’, rij voorbijsubj ‘pass by’, rij voorbijadj ‘pass by’,
Pop: most similar words
pop music↔ doll
1 pop,rock,jazz,meubilair ‘furniture’,popmuziek ‘pop music’,
heks ‘witch’, speelgoed‘toy’,kast‘cupboard’, servies ‘[tea]
service’, vraagteken ‘question mark’
2 pop,meubilair ‘furniture’,speelgoed ‘toy’,kast‘cupboard’,
servies ‘[tea] service’,heks‘witch’, vraagteken ‘question mark’
sieraad ‘jewel’, sculptuur‘sculpture’,schoen ‘shoe’
3 pop,rock,jazz,popmuziek ‘pop music’,heks ‘witch’,danseres
‘dancer’,servies ‘[tea] service’,kopje‘cup’, house‘house
Barcelona: most similar words
Spanish city↔ Spanish football club
1 Barcelona,Arsenal,Inter,Juventus,Vitesse,Milaan ‘Milan’,
Madrid,Parijs‘Paris’,Wenen ‘Vienna’,M¨unchen ‘Munich’
2 Barcelona,Milaan ‘Milan’,M¨unchen ‘Munich’,Wenen
‘Vienna’,Madrid,Parijs‘Paris’,Bonn,Praag‘Prague’, Berlijn
‘Berlin’,Londen ‘London’
3 Barcelona,Arsenal,Inter,Juventus,Vitesse,Parma,
Clustering example: werk
1 werk‘work’,beeld‘image’,foto‘photo’,schilderij‘painting’,tekening ‘drawing’,doek‘canvas’,installatie‘installation’,afbeelding‘picture’, sculptuur‘sculpture’,prent‘picture’,illustratie‘illustration’,handschrift ‘manuscript’,grafiek‘print’,aquarel‘aquarelle’,maquette‘scale-model’, collage‘collage’,ets‘etching’
2 werk‘work’,boek‘book’,titel‘title’,roman‘novel’,boekje‘booklet’, debuut‘debut’,biografie‘biography’,bundel‘collection’,toneelstuk ‘play’,bestseller‘bestseller’,kinderboek‘child book’,autobiografie ‘autobiography’,novelle‘short story’,
3 werk‘work’,voorziening‘service’,arbeid‘labour’,opvoeding‘education’, kinderopvang‘child care’,scholing‘education’,huisvesting‘housing’, faciliteit‘facility’,accommodatie‘acommodation’,arbeidsomstandigheid ‘working condition’
Evaluation: methodology
Comparison to EuroWordNet senses
using Wu & Palmer’s Wordnet similarity measure a sense is assigned to a correct cluster if:
for the top 4 words of the cluster
the average similarity between the sense and the top 4 words exceeds a similarity threshold in EuroWordNet
and the sense has not yet been considered correct
when multiple senses exist in EuroWordNet, the one with
Evaluation: precision & recall
Precision
of a word: Percentage of correct clusters to which it is assigned overall: average precision of all words in test set
Recall
of a word: Percentage of senses in EuroWordnet that have a corresponding cluster
overall: average recall of all words in test set
Test set: words covered by algorithm that are also present in EuroWordNet (3683 words)
Evaluation: results
thresholdθ .40 (%) .50 (%) .60 (%) kmeansnmf prec. 78.97 69.18 55.16 rec. 63.90 55.95 44.77 cbc prec. 44.94 38.13 29.74 rec. 69.61 60.00 48.00kmeansorig prec. 86.13 74.99 58.97
Evaluation: results
kmeansnmf beats cbcwith regard to precision
cbc beats kmeansnmf with regard to recall
kmeansnmf has higher recall than kmeansorig, so algorithm is
Conclusion
Combining bag of words data and syntactic data is useful
bag of words data (factorized withnmf) puts its finger on topical dimensions
syntactic data is particularly good at finding similar words a clustering approach allows one to determine which topical dimension(s) are responsible for a certain sense
and adapt the (syntactic) feature vector of the noun accordingly
subtracting the more dominant sense to discover less dominant senses
Algorithm scores better with regard to precision; lower with regard to recall
Introduction
Standard selectional preference models: two-way co-occurrences
Keeping track of single relationships
But: two-way selectional preference models are not sufficiently rich
Compare:
The skyscraper is playing coffee. The turntable is playing the piano.
Introduction
The skyscraper is playing coffee.
(play,su,scyscraper) ↓
(play,obj,coffee) ↓
The turntable is playing the piano.
(play,su,turntable) ↑
(play,obj,piano) ↑
Methodology
Three-way extraction of selectional preferences
Approach applied to Dutch, usingtwente nieuws corpus
(500M words of newspaper texts)
parsed with Dutch dependency parseralpino
three-way co-occurrence of verbs with subjects and direct objects extracted
adapted with extension of pointwise mutual information
Resulting tensor 1k verbs×10ksubjects×10k direct objects
non-negative tensor factorization withk dimensions
Example dimension: police action
subjects sus verbs vs objects objs
politie‘police’ .99 houd aan‘arrest’ .64 verdachte‘suspect’ .16
agent‘policeman’ .07 arresteer‘arrest’ .63 man‘man’ .16
autoriteit‘authority’ .05 pak op‘run in’ .41 betoger‘demonstrator’ .14
Justitie’Justice’ .05 schiet dood‘shoot’ .08 relschopper‘rioter’ .13
recherche‘detective force’ .04 verdenk‘suspect’ .07 raddraaier‘instigator’ .13
marechaussee‘military police’ .04 tref aan‘find’ .06 overvaller‘raider’ .13
justitie‘justice’ .04 achterhaal‘overtake’ .05 Roemeen‘Romanian’ .13
arrestatieteam‘special squad’ .03 verwijder‘remove’ .05 actievoerder‘campaigner’ .13
leger‘army’ .03 zoek‘search’ .04 hooligan‘hooligan’ .13
Example dimension: legislation
subjects sus verbs vs objects objs
meerderheid‘majority’ .33 steun‘support’ .83 motie‘motion’ .63
VVD .28 dien in‘submit’ .44 voorstel‘proposal’ .53
D66 .25 neem aan‘pass’ .23 plan‘plan’ .28
Kamermeerderheid .25 wijs af‘reject’ .17 wetsvoorstel‘bill’ .19
fractie‘party’ .24 verwerp‘reject’ .14 hem‘him’ .18
PvdA .23 vind‘think’ .08 kabinet‘cabinet’ .16
CDA .23 aanvaard‘accepts’ .05 minister‘minister’ .16
Tweede Kamer .21 behandel‘treat’ .05 beleid‘policy’ .13
partij‘party’ .20 doe‘do’ .04 kandidatuur‘candidature’ .11
Example dimension: exhibition
subjects sus verbs vs objects objs tentoonstelling‘exhibition’ .50 toon‘display’ .72 schilderij‘painting’ .47
expositie‘exposition’ .49 omvat‘cover’ .63 werk‘work’ .46
galerie‘gallery’ .36 bevat‘contain’ .18 tekening‘drawing’ .36
collectie‘collection’ .29 presenteer‘present’ .17 foto‘picture’ .33
museum‘museum’ .27 laat‘let’ .07 sculptuur‘sculpture’ .25
oeuvre‘oeuvre’ .22 koop‘buy’ .07 aquarel‘aquarelle’ .20
Kunsthal .19 bezit‘own’ .06 object‘object’ .19
kunstenaar‘artist’ .15 zie‘see’ .05 beeld‘statue’ .12
dat‘that’ .12 koop aan‘acquire’ .05 overzicht‘overview’ .12
Quality count
44 dimensions contain similar, framelike semantics 43 dimensions contain less clear-cut semantics
single verbs account for one dimension verb senses are mixed up
13 dimensions based on syntax rather than semantics
fixed expressions pronomina
Evaluation: methodology
pseudo-disambiguation task to test generalization capacity (standard automatic evaluation for selectional preferences)
s v o s0 o0
jongere drink bier coalitie aandeel ‘youngster’ ‘drink’ ‘beer’ ‘coalition’ ‘share’ werkgever riskeer boete doel kopzorg ‘employer’ ‘risk’ ‘fine’ ‘goal’ ‘worry’ directeur zwaai scepter informateur vodka ‘manager’ ‘sway’ ‘sceptre’ ‘informer’ ‘wodka’
Evaluation: models
Evaluation of 4 different models 2 matrix models
→ 1k verbs ×(10ksubjects + 10k direct objects)
singular value decomposition (R) non-negative matrix factorization (R≥0)
2 tensor models
→ 1k verbs ×10k subjects ×10k direct objects
parallel factor analysis (R)
Evaluation: results
dimensions 50 (%) 100 (%) 300 (%) svd 69.60± 0.41 62.84 ±1.30 45.22 ±1.01 nmf 81.79± 0.15 78.83 ±0.40 75.74 ±0.63 parafac 85.57± 0.25 83.58 ±0.59 80.12 ±0.76 ntf 89.52± 0.18 90.43±0.14 90.89 ±0.16Conclusion
novel method able to investigate three-way co-occurrences capable of automatically inducing selectional preferences
three-way methods improve on two-way methods
non-negativity constraint improves on unconstrained models non-negative tensor factorization outperforms other models
Wordnet-based similarity
Compare results to Dutch cornettodatabase
Two similarity measures:
path length: Wu & Palmer similarity measure information theoretic: Lin’s similarity measure
Nouns close in the hierarchy are tightly similar Pairwise similarity for k similar words
Wordnet-based similarity
wu & palmer’s similarity lin’s similarity
model k = 1 k = 3 k = 5 k = 1 k = 3 k= 5 document .379 .331 .309 .354 .320 .304 window(w=par) .442 .377 .349 .404 .357 .336 window(w=2) .633 .561 .526 .541 .485 .456 syntax .648 .584 .554 .555 .504 .480 baseline .128 .126 .126 .164 .163 .163
Wordnet-based similarity
Syntax (withpmi) best
Closely followed by small window (with pmi)
Large window and document perform much worse dimensionality reduction only helps to improve document-based (a little)
Cluster quality
Compare output of clustering algorithm to gold standard classification
Two cluster tasks (esslli2008 workshop’s shared task)
concrete noun categorization (44 nouns)
2-way: natural,artefact
3-way: animal,vegetable,artefact
6-way: bird,groundAnimal,fruitTree,green,tool,vehicle
abstract/concrete noun discrimination (30 nouns)
2-way: hi,lo Evaluation measures:
Entropy: distribution of classes within cluster (small = good) Purity: ratio of largest class present in cluster (large = good)
Cluster quality
2-way 3-way 6-way
model ent pur ent pur ent pur
document .930 .614 .179 .932 .292 .682
window(w=par) .911 .659 .541 .705 .377 .591
window(w=2) .000 1.000 .213 .909 .206 .773
syntax-based .000 1.000 .000 1.000 .153 .864
Cluster quality
Same tendencies as wordnet-based similarity
model with large window seems to extract topically related clusters:
aardappel ‘potato’,ananas ‘pineapple’,banaan‘banana’,
champignon‘mushroom’,fles ‘bottle’,kers ‘cherry’,ketel
‘kettle’,kip‘chicken’,kom ‘bowl’,lepel ‘spoon’, peer ‘pear’,
sla ‘lettuce’,ui ‘onion’
Domain coherence
Coherence of semantic domain tags (available incornetto)
‘particular areas of human knowledge’ (politics,medicine,
sports)
→ topical similarity
Ratio of most frequent domain tag (also in tagset of target word) over top 10 similar words
Domain coherence
model simtopic
document .394
window(w=par) .399
window(w=2) .414
syntax .441
baseline .048
Domain coherence
Syntax still scores best
Other models do not perform much worse
No real difference between small window and large window
→ large window and document do not extract tight similarity,