Explaining away ambiguity: Learning verb selectional preference with Bayesian networks

(1)

with Bayesian networks

Massimiliano Ciaramita and Mark Johnson

Cognitive and LinguisticSciences

Box 1978, Brown University

Providence, RI 02912,USA

massimiliano [email protected] [email protected]

Abstract

ThispaperpresentsaBayesianmodelfor

unsu-pervisedlearningofverbselectionalpreferences.

For each verb the model creates a Bayesian

network whose architecture is determined by

the lexical hierarchy of Wordnet and whose

parameters are estimated from a list of

verb-object pairs found from a corpus. \Explaining

away", a well-known propertyof Bayesian

net-works, helps the model deal in a natural

fash-ion with word sense ambiguity in the training

data. Ona wordsense disambiguationtest our

model performed better thanother state of the

art systems for unsupervised learning of

selec-tional preferences. Computational complexity

problems, ways ofimprovingthisapproachand

methodsforimplementing\explainingaway"in

other graphicalframeworksarediscussed.

1 Selectional preference and sense

ambiguity

Regularitiesofaverbwithrespecttothe

seman-tic class of its arguments (subject, object and

indirectobject) are called selectional

prefer-ences (SP) (Katz and Fodor, 1964; Chomsky,

1965;Johnson-Laird,1983). Theverbpilot

car-riestheinformationthatitsobjectwilllikelybe

some kindofvehicle;subjects oftheverbthink

tendtobehuman;andsubjectsoftheverbbark

tend to be dogs. For the sake of simplicity we

will focus on the verb-object relation although

the techniques we will describe can be applied

to other verb-argumentpairs.

WewouldliketothanktheBrownLaboratoryfor

Lin-guistic Information Processing; Thomas Hofmann; Elie

Bienenstock;PhilipResnik,whoprovideduswith

train-ingandtestdata;andDanielGarciaforhishelpwiththe

SMILElibraryofclasses forBayesiannetworksthatwe

usedfor ourexperiments. Thisresearchwas supported

ENTITY

something

FOOD

LIQUID

PHSYSICAL_OBJECT

BEVERAGE

LAND

earth

ISLAND

drink

COFFEE

JAVA-1

JAVA-2

java

ESPRESSO

BALI

island

espresso

bali

liquid

WATER

ALIMENT

TEA

beverage

entity

ORGANISM

object

ICE

land

ground

CAPE

coffee

food

Figure 1: A portionof Wordnet.

Models of the acquisition of SP are

impor-tantintheirownrightand have applicationsin

NaturalLanguageProcessing(NLP).The

selec-tionalpreferencesofa verbcanbeusedto infer

thepossiblemeaningsofanunknownargument

of a known verb; e.g., it might be possible to

inferthatxxxx isa kindofdog fromthe

follow-ing sentence: \The xxxx barked all night". In

parsingasentenceselectionalpreferencescanbe

usedtorankcompetingparses,providinga

par-tial measure of semantic well-formedness.

In-vestigatingSPmight helpusto understandthe

structureofthe mental lexicon.

SystemsforunsupervisedlearningofSP

usu-ally combine statistical and knowledge-based

approaches. The knowledge-base component

is typically a database that groups words into

classes. In the models we will see, the

knowl-edge base is Wordnet (Miller, 1990).

Word-net groups nouns into classes of synonyms

representing concepts, called synsets, e.g.,

(2)

be-sitive and asymmetrical relation, hyponymy,

is dened between synsets. A synset is a

hy-ponym of another synset if the former has the

latter asabroader concept;forexample,

BEV-ERAGE isahyponymofLIQUID.Figure1

de-pictsa portion ofthehierarchy.

The statistical component consists of

predicate-argument pairs extracted from a

corpusinwhichthesemanticclassofthewords

is not indicated. A trivial algorithm might

get a list of words that occurred as objects

of the verb and output the semantic classes

the words belong to according to Wordnet.

For example, if the verb drink occurred with

water and water2LIQUID,the modelwould

learnthatdrink selectsforLIQUID.AsResnik

(1997)andAbneyandLight(1999)havefound,

the main problem these systems face is the

presence of ambiguous words in the training

data. If the word java also occurred as an

object of drink, since java 2 BEVER AGE

and java 2 ISLAND,this model would learn

that drink selects for both BEVER AGE and

ISLAND.

More complex models have been proposed.

These models, though, deal with word sense

ambiguity by applying an unselective strategy

similarto theone above;i.e., they assumethat

ambiguouswordsprovideequal evidenceforall

their senses. These models choose as the

con-ceptstheverbselectsforthosethat arein

com-mon among several words (e.g., BEVERAGE

above). This strategyworksto theextent that

these overlapping senses are also the concepts

theverb selectsfor.

2 Previous approaches to learning

selectional preference

2.1 Resnik's model

Ourssystemiscloselyrelatedtothoseproposed

in(Resnik,1997)and (AbneyandLight,1999).

The fact that a predicate p selects for a class

c, given a syntactic relation r, can be

repre-sented as a relation, selects(p;r;c); e.g., that

eat selects for FOOD in object position can

be represented as selects(eat;object;FOOD).

In(Resnik,1997)selectionalpreferenceis

quan-tied by comparing the prior distribution of

a given class c appearing as an argument,

P(c), and the conditional probability of the

FOOD 7/4

FLESH

ESSENCE 1/4

1/4

COGNITION

1/4 FRUIT 1/2 BREAD 1/2

DAIRY 1/2

apple(1)

bagel(1)

cheese(1)

meat(1)

idea(0)

Figure 2: Simplied Wordnet. The numbers

next to the synsets represent the values of

freq(p;r;c)estimatedusing(3), thenumbersin

parenthesesrepresentthevaluesoffreq(p;r;w).

tic relation P(cjp;r), e.g., P(FOOD) and

P(FOODjeat;object). Therelativeentropy

be-tween P(c) and P(cjp;r) measures how much

the predicateconstrainsits arguments:

S(p;r)=D(P(cjp;r) jj P(c)) (1)

Resnik denes the selectional association of

apredicateforaparticularclassctobethe

por-tionoftheselectionalpreferencestrengthdueto

that class:

A(p;r;c)= 1

S(p;r)

P(cjp;r) log

P(cjp;r)

P(c) (2)

Here the main problem is the estimation of

P(cjp;r). Resnik suggests as a plausible

esti-mator ^

P(cjp;r) def

= freq(p;r;c)=freq(p;r): But

since themodel istrained on datathat arenot

sense-tagged, there is no obvious way to

esti-mate freq(p;r;c). Resnik suggests considering

each observation ofa wordasevidence foreach

of theclassesthe wordbelongsto,

freq(p;r;c) X

w2c

count(p;r;w)

classes(w)

(3)

where count(p;r;w) is the number of times

the word w occurred as an argument of p in

relation r, and classes(w) is the number of

classes w belongs to. For example, suppose

the system is trained on (eat,object) pairs and

the verb occurred once each with meat,

ap-ple, bagel, and cheese, and Wordnet is

simpli-ed as in Figure 2. An ambiguous word like

meat providesevidence also forclassesthat

ap-pear unrelated to those selected by the verb.

(3)

se-1\8

7\8

ESSENCE

FLESH

FRUIT

BREAD

cheese

Figure 3: The HMM version of the simple

ex-ample.

of the observed words, and hence will receive

the highest values for P(cjp;r). Using (3) we

nd that the highest frequency is in fact

as-sociatedwith FOOD:freq(eat;object;food)

1

However, some evidenceisfoundalsofor

COG-NITION: freq(eat;object;cognition) 1

4 and

P(COGNITIONjeat)=0:06.

2.2 Abney and Light's approach

Abney and Light (1999) pointed out that the

distribution of senses of an ambiguous word is

not uniform. They noticed also that it is not

clearhowtheprobabilityP(cjp;r)istobe

inter-pretedsince there isno explicitstochastic

gen-erationmodelinvolved.

They proposed a system that associates

a Hidden Markov Model (HMM) with each

predicate-relation pair (p,r). Transitions

be-tweensynsetstatesrepresent thehyponymy

re-lation, and ", theempty word, is emitted with

probability1;transitionsto analstate emita

word w with probability0 P(w) 1. T

ran-sition and emission probabilities are estimated

using the EM algorithm on training data that

consistofthenounsthatoccurredwiththeverb.

Abney and Light's model estimates P(cjp;r)

from the model trained for (p;r); the

distri-bution P(c) can be calculated from a model

trainedfor allnouns inthecorpus.

This model did not perform as well as

ex-pected. An ambiguous word in the model can

be generatedbymore thanone state sequence.

Abney and Light discovered that the EM

al-gorithm nds parameter values that associate

some probability mass with all the transitions

eralstatesequencesforthesameword,EMdoes

notselectoneofthemovertheothers. 1

Figure3

shows theparameters estimated byEM forthe

same example as above. The transition to the

COGNITION statehasbeenassigneda

proba-bilityof1=8becauseitispartofapossiblepath

to meat. The HMM model does not solve the

problem of the unselective distribution of the

frequency of occurrence of an ambiguous word

toall its senses. Abney and Light claimedthat

thisis aseriousproblem,particularlywhenthe

ambiguous word is a frequent one, and caused

the model to learn the wrong selectional

pref-erences. To correct this undesirable outcome

theyintroducedsome smoothingand balancing

techniques. However, even withthese

modica-tionstheirsystem'sperformancewasbelowthat

achieved byResnik's.

3 Bayesian networks

A Bayesian network (Pearl, 1988), or

Bayesianbeliefnetwork(BBN),consistsofaset

of variablesandasetof directededges

con-necting the variables. The variables and the

edges dene a directed acyclic graph (DAG)

where each variable is represented by a node.

Eachvariableisassociatedwithanitenumber

of(mutuallyexclusive)states. Toeach variable

A with parents B

1 ;:::;B

n

is attached a

condi-tional probability table (CPT) P(AjB

1 ;:::;B

n ).

Given a BBN, Bayesian inference can be used

to estimate marginal and posterior

proba-bilitiesgiventhe evidence at handand the

in-formationstoredintheCPTs,theprior

prob-abilities, by means of Bayes' rule, P(HjE) =

P(H)P(EjH)

P(E)

,whereH standsforhypothesis and

Eforevidence.

Bayesiannetworksdisplayanextremely

inter-estingpropertycalledexplainingaway. Word

senseambiguityintheprocessoflearningSP

de-nesaproblemthatmightbesolvedbyamodel

that implements an explaining away strategy.

Supposewe are learningthe selectional

prefer-enceofdrink, andthenetworkinFigure4isthe

1

As a matter of fact, for this HMM there are

(in-nitely)manyparametervaluesthatmaximizethe

like-lihoodofthetrainingdata;i.e., theparametersarenot

identiable. The intuitively correct solution is one of

them,butsoareinnitelymanyother,intuitively

(4)

ISLAND

BEVERAGE

water

java

Figure4: ABayesiannetworkforword

ambigu-ity.

knowledge base. The verb occurred with java

and water. This situation can be represented

asa Bayesiannetwork. The variablesISLAND

and BEVERAGE represent concepts in a

se-mantic hierarchy. The variablesjava andwater

standforpossibleinstantiationsoftheconcepts.

Allthe variablesare Boolean; i.e., they are

as-sociatedwithtwostates, true orfalse. Suppose

thefollowingCPTsdene thepriorsassociated

with each node. 2

The unconditional

probabili-tiesareP(I =true)=P(B=true)=0:01and

P(I =false)=P(B =false)=0:99, and the

CPTs forthechildnodesare

P(X=xjY1 =y1;Y2=y2)

I;B I;:B :I;B :I;:B

j=true 0.99 0.99 0.99 0.01

j=false 0.01 0.01 0.01 0.99

w=true 0.99 0.99 0.01 0.01

w=false 0.01 0.01 0.99 0.99

Thesevaluesmeanthattheoccurrenceofeither

concept isa priori unlikely. Ifeither concept is

true theword java islikelyto occur. Similarly,

ifBEVERAGEoccursitislikelytoobservealso

the word water. As the posterior probabilities

show,ifjavaoccurs,thebeliefsinbothconcepts

increase: P(Ijj) =P(Bjj)= 0:3355. However,

water providesevidence forBEVERAGE only.

Overall there is more evidence for the

hypoth-esis that the concept being expressed is

BEV-ERAGE and not ISLAND. Bayesian networks

implementthisinferencescheme;ifwecompute

the conditional probabilities given that both

wordsoccurred,weobtainP(Bjj;w)=0:98and

P(Ijj;w)=0:02. The new evidence caused the

\island"hypothesisto be explained away!

3.1 The relevance of priors

Explainingaway seems to dependon the

spec-ication of the prior probabilities. The priors

2

I, B, j and w abbreviate ISLAND, BEVERAGE,

COGNITION

FOOD

ESSENCE

FLESH

FRUIT

BREAD

DAIRY

cheese

bagel

apple

meat

idea

true

Figure 5: A Bayesian network for the simple

example.

dene the background knowledge available to

themodel relative to theconditional

probabili-tiesof the events represented by the variables,

butalso aboutthejoint distributionsof several

events. In the simple network above, we

de-ned the probability that either concept is

se-lected (i.e., that the corresponding variable is

true) to be extremely small. Intuitively, there

are many concepts and the probability of

ob-servinganyparticularoneissmall. Thismeans

that the joint probability of the two events is

much higher in the case in which only one of

themis true (0.0099)thanin thecase inwhich

they arebothtrue (0.0001). Therefore, via the

priors,weintroducedabias accordingto which

thehypothesisthat one concept isselected will

befavoredovertwoco-occurringones. Thisisa

generalpatternof Bayesiannetworks; theprior

causessimplerexplanationstobepreferredover

morecomplexones, and therebytheexplaining

awayeect.

4 A Bayesian network approach to

learning selectional preference

4.1 Structure and parameters of the

model

The hierarchy of nouns in Wordnet denes a

DAG. Its mapping into a BBN is

straightfor-ward. Each word or synset in Wordnet is a

node in the network. If A is a hyponym of B

thereisan arcinthenetwork fromBto A.All

thevariablesareBoolean. Asynsetnodeistrue

iftheverbselectsforthatclass. Awordnodeis

true if the word can appear as an argument of

the verb. The priors are dened following two

intuitive principles. First, it is unlikely that a

verb a priori selects for any particular synset.

(5)

its hyponyms, sayFRUIT.Thesame principles

applyto words: itislikely that awordappears

asanargumentoftheverbiftheverbselectsfor

any of its possible senses. On the other hand,

if the verb does not select for a synset, it is

unlikely thatthewordsinstantiating thesynset

occurasitsarguments. \Likely"and\unlikely"

are given numerical values that sum up to 1.

The following table denes the scheme for the

CPTsassociatedwitheachnodeinthenetwork;

p

i

(X) denotesthe ithparent ofthenode X.

P(X=xjp1(X)_;:::;_pn(X)=true)

x=true likely

x=false unlikely

P(X=xjp1(X)^;:::;^pn(X)=fal se)

x=true unlikely

x=false likely

For the root nodes, the table reduces to the

unconditional probability of the node. Now

we can test the model on the simple example

seen earlier. W +

is the set of words that

oc-curred with the verb. The nodes

correspond-ing to the words in W +

are set to true and

the others left unset. For the previous

ex-ample W +

= fmeat;apple;bagel;cheeseg, and

the corresponding nodes are set to true, as

de-picted in Figure 5. With likely and unlikely

respectively equal to 0.99 and 0.01, the

poste-rior probabilities are 3

P(Fjm;a;b;c) = 0:9899

and P(Cjm;a;b;c) = 0:0101. Explainingaway

works. The posterior probability of

COGNI-TION gets as low as its prior, whereas the

probability of FOOD goes up to almost 1. A

Bayesian network approach seems to actually

implementtheconservativestrategywethought

to be thecorrect one for unsupervisedlearning

of selectionalrestrictions.

4.2 Computational issues in building

BBNs based on Wordnet

The implementationof aBBN forthe wholeof

Wordnet facescomputational complexity

prob-lems typical of graphical models. A densely

connectedBBN presentstwokindsofproblems.

The rst is the storage of the CPTs. The size

ofa CPTgrows exponentiallywiththenumber

of parents of the node. 4

This problem can be

3

F, C, m,a, b andc respectively stand for FOOD,

COGNITION,meat,apple,bagel andcheese

4

Somewords inWordnethavemorethan20senses.

java

JAVA-1

JAVA-2

ISLAND

COFFEE

BEVERAGE

LAND

PHSYSICAL_OBJECT

LIQUID

FOOD

ENTITY

drink

Figure6: Thesubnetwork fordrink.

solvedbyoptimizingtherepresentationofthese

tables. In ourcasemostof theentries have the

same values, and a compact representation for

them can be found (much like the one used in

the noisy-ORmodel (Pearl,1988)).

A harder problem is performing inference.

The graphical structure of a BBN represents

the dependency relations among the random

variables of the network. The algorithms used

with BBNs usually perform inference by

dy-namic programming on the triangulated moral

graph. A lower bound on the number of

com-putations that arenecessary to modelthejoint

distribution over thevariables usingsuch

algo-rithmsis 2 jnj+1

,wherenis thesizeofthe

max-imal boundary set according to the visitation

schedule.

4.3 Subnetworks and balancing

Because ofthese problemswecould notbuilda

single BBN for Wordnet. Instead we simplied

thestructureofthemodelbybuildingasmaller

subnetworkforeachpredicate-argumentpair. A

subnetwork consists of the union of the sets of

ancestors of the words in W +

. Figure 6

pro-vides an exampleof theunion of these

\ances-tral subgraphs" of Wordnet for the words java

and drink (compare it withFigure 1).

This simplication does not aect the

com-putation of the distributions we are interested

in; that is, the marginals of the synset nodes.

A BBN provides a compact representation for

the joint distribution over the set of variables

senses.ThesizeofitsCPTistherefore2 26

.Storinga

ta-bleofoatnumbersforthisnodealonerequiresaround

(6)

inthenetwork. IfN =X

1 ;:::;X

n

isaBayesian

network withvariables X

1 ;:::;X

n

,its joint

dis-tributionP(N) is theproductof all the

condi-tionalprobabilitiesspeciedinthenetwork,

P(N)= Y

j P(X

j jpa(X

j

)) (4)

wherepa(X)isthesetofparentsofX. ABBN

generates a factorization of the joint

distribu-tion over its variables. Consider a network of

three nodesA;B;C witharcs fromA to B and

C. Itsjointdistributioncanbecharacterizedas

P(A;B;C) = P(A)P(BjA)P(CjA). If there is

no evidence forC thejointdistribution is

P(A;B;C) = P(A)P(BjA) X

C

P(CjA)

= P(A)P(BjA)

= P(A;B) (5)

Thenode C gets marginalizedout.

Marginaliz-ingoverachildlessnodeisequivalentto

remov-ing it with its connections from the network.

Thereforethesubnetworksareequivalenttothe

whole network; i.e., they have the same joint

distribution.

Our model computes the value of P(cjp;r),

but we did not compute the prior P(c) for all

nouns in the corpus. We assumed this to be

a constant, equal to the unlikely value, for all

classes. In a BBN the values of the marginals

increasewiththeirdistancefromtherootnodes.

Toavoidundesiredbias(seetable ofresults)we

dened a balancing formula that adjusted the

conditional probabilitiesof theCPTs insuch a

way that we got all the marginals to have

ap-proximately thesame value. 5

5 Experiments and results 6

5.1 Learning of selectional preferences

When trained on predicate-argument pairs

ex-tracted from a largecorpus, theSan Jose

Mer-curyCorpus,themodelgave very goodresults.

The corpus contains about 1.3 million

verb-objecttokens. Theobtainedrankingsofclasses

accordingto theirposteriormarginal

probabili-ties were good. Table 1 shows the topand the

5

Moredetailscanbefoundinanextendedversionof

thepaper: www.cog.brown.edu/massi/.

6

Forthese experimentswe usedvaluesfor the likely

1 VEHICLE 0:9995

2 VESSEL 0:9893

3 AIRCRAFT 0:9937

4 AIRPLANE 0:9500

5 SHIP 0:9114

... ... ...

255 CONCEPT 0:1002

256 LAW 0:1001

257 PHILOSOPHY 0:1000

258 JURISPRUDENCE 0:1000

Table 1: Resultsfor(maneuver,object).

bottom of the list of synsets for the verb

ma-neuver. Themodellearnedthatmaneuver

\se-lects" formembers of the class VEHICLE and

of other plausible classes, hyponyms of

VEHI-CLE. It also learned that the verb does not

select for direct objects that are members of

classes,likeCONCEPT orPHILOSOPHY.

5.2 Word sense disambiguation test

A direct evaluation measure for unsupervised

learning of SP models does not exist. These

models are instead evaluated on a word-sense

disambiguation test (WSD). The idea is that

systems that learnSP produce word sense

dis-ambiguationas aside-eect. Java might be

in-terpretedastheisland orthebeverage,butina

context like \thetourists ew to Java" the

for-merseemsmorecorrect, becausey couldselect

forgeographic locations but not for beverages.

A system trained on a predicate p should be

able to disambiguate arguments of p if it has

learnedits selectionalrestrictions.

We tested our model using the test and

trainingdata developed byResnik (seeResnik,

1997). The same test was used in (Abney

and Light, 1999). The training data consists

of predicate-object counts extracted from 4/5

of the Brown corpus (about 1M words). The

test set consists of predicate-object pairs from

the remaining 1/5 of the corpus, which has

beenmanually sense-annotatedbyWordnet

re-searchers. The results are shown in Table 2.

The baseline algorithm chooses at random one

of the multiple senses of an ambiguous word.

The \rst sense" method always chooses the

(7)

Baseline 28:5%

Abney and Light (HMM smoothed) 35:6%

Abney and Light (HMM balanced) 42:3%

Resnik 44:3%

BBN (withoutbalancing) 45:6%

BBN (with balancing) 51:4%

FirstSense 82:5%

Table2: Results

formed better than thestate of the art models

forunsupervisedlearningof SP.It seemsto

de-neabetter estimator forP(cjp;r).

It isremarkablethatthemodelachieved this

result making only a limited use of

distribu-tional information. A noun is in W +

if it

oc-curredat leastonceinthetrainingset, butthe

systemdoesnotknowifitoccurredonceor

sev-eral times; either it occurred or it didn't. The

model did not suer too much from this

limi-tation during this task. This is probably due

to the sparseness of the training data for the

test. For each verb the average number of

ob-ject types is 3:3, for each of them the average

numberoftokens is1:3; i.e., mostof thewords

in the training data only occurred once. For

thistrainingset we alsotested a version of the

modelthatbuiltawordnode foreach observed

objecttokenandthereforeintegratedthe

distri-butional information. On theWSD test it

per-formedexactlythesame asthesimplerversion.

When trainedontheSan JoseMercuryCorpus

the model performed worse on the WSD test

(35:8%). This is nottoo surprisingconsidering

thedierencesbetweentheSJMandtheBrown

corpora: theformer, a recent newswirecorpus;

the latter, an older, balanced corpus. Another

importantfactoristhedierentrelevanceof

dis-tributionalinformation. Thetrainingdatafrom

the SJM Corpus are much richer and noisier

than the Brown data. Here the frequency

in-formation is probablycrucial; however, in this

casewe couldnotimplementthesimplescheme

above.

5.3 Conclusion

Explaining away implements a cognitively

at-tractive and successful strategy. A

straightfor-make full use of the distributional information

present in the training data; we only partially

achieved this. Bayesian networks are usually

confronted with a single presentation of

evi-dence. Their extension to multiple evidence is

not trivial. We believe the model can be

ex-tendedinthisdirection. Possiblythereare

sev-eral ways to do so(multinomialsampling,

ded-icatedimplementations,etc.). However, we

be-lieve that the most relevant nding of this

re-search might be that \explaining away" is not

only a property of Bayesian networks but of

Bayesianinference ingeneraland thatit might

be implementable in other kinds of graphical

models. Weobservedthatthepropertyseemsto

dependon the specicationof theprior

proba-bilities. WefoundthattheHMMmodelof

(Ab-neyandLight,1999)wasunidentiable;thatis,

thereareseveralsolutionsfortheparametersof

themodel,includingthedesired one. Our

intu-itionisthat itshouldbepossibleto implement

\explaining away" in a HMM with priors, so

thatitwouldpreferonlyone ora fewsolutions

overmany. Thismodelwouldhave alsothe

ad-vantageof beingcomputationally simpler.

References

S.Abney and M.Light. 1999. Hidinga

seman-tic hierarchyin a markov model. In Procee

d-ingsof theWorkshoponUnsupervised

Learn-ingin Natural Language Processing, ACL.

N. Chomsky. 1965. Aspects of the Theory of

Syntax. MIT Press,Cambridge, MA.

P.N.Johnson-Laird. 1983. MentalModels:

To-wardsaCognitiveScienceofLanguage,

Infer-ence,andConsciousness. HarvardUniversity

Press.

J. J. Katz and J. A. Fodor. 1964. The

struc-ture of a semantic theory. In J. J. Katz and

J. A. Fodor, editors, The Structure of

Lan-guage. Prentice-Hall.

G. Miller. 1990. Wordnet: An on-line lexical

database. International Journal of

Lexicog-raphy, 3(4).

J.Pearl. 1988. ProbabilisticReasoningin

Intel-ligent Systems: Networks of Plausible

Infer-ence. MorganKaufman.

P. Resnik. 1997. Selectional preference and

sense disambiguation. In Proceedings of the