with Bayesian networks
Massimiliano Ciaramita and Mark Johnson
Cognitive and LinguisticSciences
Box 1978, Brown University
Providence, RI 02912,USA
massimiliano [email protected] [email protected]
Abstract
ThispaperpresentsaBayesianmodelfor
unsu-pervisedlearningofverbselectionalpreferences.
For each verb the model creates a Bayesian
network whose architecture is determined by
the lexical hierarchy of Wordnet and whose
parameters are estimated from a list of
verb-object pairs found from a corpus. \Explaining
away", a well-known propertyof Bayesian
net-works, helps the model deal in a natural
fash-ion with word sense ambiguity in the training
data. Ona wordsense disambiguationtest our
model performed better thanother state of the
art systems for unsupervised learning of
selec-tional preferences. Computational complexity
problems, ways ofimprovingthisapproachand
methodsforimplementing\explainingaway"in
other graphicalframeworksarediscussed.
1 Selectional preference and sense
ambiguity
Regularitiesofaverbwithrespecttothe
seman-tic class of its arguments (subject, object and
indirectobject) are called selectional
prefer-ences (SP) (Katz and Fodor, 1964; Chomsky,
1965;Johnson-Laird,1983). Theverbpilot
car-riestheinformationthatitsobjectwilllikelybe
some kindofvehicle;subjects oftheverbthink
tendtobehuman;andsubjectsoftheverbbark
tend to be dogs. For the sake of simplicity we
will focus on the verb-object relation although
the techniques we will describe can be applied
to other verb-argumentpairs.
WewouldliketothanktheBrownLaboratoryfor
Lin-guistic Information Processing; Thomas Hofmann; Elie
Bienenstock;PhilipResnik,whoprovideduswith
train-ingandtestdata;andDanielGarciaforhishelpwiththe
SMILElibraryofclasses forBayesiannetworksthatwe
usedfor ourexperiments. Thisresearchwas supported
ENTITY
something
FOOD
LIQUID
PHSYSICAL_OBJECT
BEVERAGE
LAND
earth
ISLAND
drink
COFFEE
JAVA-1
JAVA-2
java
ESPRESSO
BALI
island
espresso
bali
liquid
WATER
ALIMENT
TEA
beverage
entity
ORGANISM
object
ICE
land
ground
CAPE
coffee
food
Figure 1: A portionof Wordnet.
Models of the acquisition of SP are
impor-tantintheirownrightand have applicationsin
NaturalLanguageProcessing(NLP).The
selec-tionalpreferencesofa verbcanbeusedto infer
thepossiblemeaningsofanunknownargument
of a known verb; e.g., it might be possible to
inferthatxxxx isa kindofdog fromthe
follow-ing sentence: \The xxxx barked all night". In
parsingasentenceselectionalpreferencescanbe
usedtorankcompetingparses,providinga
par-tial measure of semantic well-formedness.
In-vestigatingSPmight helpusto understandthe
structureofthe mental lexicon.
SystemsforunsupervisedlearningofSP
usu-ally combine statistical and knowledge-based
approaches. The knowledge-base component
is typically a database that groups words into
classes. In the models we will see, the
knowl-edge base is Wordnet (Miller, 1990).
Word-net groups nouns into classes of synonyms
representing concepts, called synsets, e.g.,
be-sitive and asymmetrical relation, hyponymy,
is dened between synsets. A synset is a
hy-ponym of another synset if the former has the
latter asabroader concept;forexample,
BEV-ERAGE isahyponymofLIQUID.Figure1
de-pictsa portion ofthehierarchy.
The statistical component consists of
predicate-argument pairs extracted from a
corpusinwhichthesemanticclassofthewords
is not indicated. A trivial algorithm might
get a list of words that occurred as objects
of the verb and output the semantic classes
the words belong to according to Wordnet.
For example, if the verb drink occurred with
water and water2LIQUID,the modelwould
learnthatdrink selectsforLIQUID.AsResnik
(1997)andAbneyandLight(1999)havefound,
the main problem these systems face is the
presence of ambiguous words in the training
data. If the word java also occurred as an
object of drink, since java 2 BEVER AGE
and java 2 ISLAND,this model would learn
that drink selects for both BEVER AGE and
ISLAND.
More complex models have been proposed.
These models, though, deal with word sense
ambiguity by applying an unselective strategy
similarto theone above;i.e., they assumethat
ambiguouswordsprovideequal evidenceforall
their senses. These models choose as the
con-ceptstheverbselectsforthosethat arein
com-mon among several words (e.g., BEVERAGE
above). This strategyworksto theextent that
these overlapping senses are also the concepts
theverb selectsfor.
2 Previous approaches to learning
selectional preference
2.1 Resnik's model
Ourssystemiscloselyrelatedtothoseproposed
in(Resnik,1997)and (AbneyandLight,1999).
The fact that a predicate p selects for a class
c, given a syntactic relation r, can be
repre-sented as a relation, selects(p;r;c); e.g., that
eat selects for FOOD in object position can
be represented as selects(eat;object;FOOD).
In(Resnik,1997)selectionalpreferenceis
quan-tied by comparing the prior distribution of
a given class c appearing as an argument,
P(c), and the conditional probability of the
FOOD 7/4
FLESH
ESSENCE 1/4
1/4
COGNITION
1/4 FRUIT 1/2 BREAD 1/2
DAIRY 1/2
apple(1)
bagel(1)
cheese(1)
meat(1)
idea(0)
Figure 2: Simplied Wordnet. The numbers
next to the synsets represent the values of
freq(p;r;c)estimatedusing(3), thenumbersin
parenthesesrepresentthevaluesoffreq(p;r;w).
tic relation P(cjp;r), e.g., P(FOOD) and
P(FOODjeat;object). Therelativeentropy
be-tween P(c) and P(cjp;r) measures how much
the predicateconstrainsits arguments:
S(p;r)=D(P(cjp;r) jj P(c)) (1)
Resnik denes the selectional association of
apredicateforaparticularclassctobethe
por-tionoftheselectionalpreferencestrengthdueto
that class:
A(p;r;c)= 1
S(p;r)
P(cjp;r) log
P(cjp;r)
P(c) (2)
Here the main problem is the estimation of
P(cjp;r). Resnik suggests as a plausible
esti-mator ^
P(cjp;r) def
= freq(p;r;c)=freq(p;r): But
since themodel istrained on datathat arenot
sense-tagged, there is no obvious way to
esti-mate freq(p;r;c). Resnik suggests considering
each observation ofa wordasevidence foreach
of theclassesthe wordbelongsto,
freq(p;r;c) X
w2c
count(p;r;w)
classes(w)
(3)
where count(p;r;w) is the number of times
the word w occurred as an argument of p in
relation r, and classes(w) is the number of
classes w belongs to. For example, suppose
the system is trained on (eat,object) pairs and
the verb occurred once each with meat,
ap-ple, bagel, and cheese, and Wordnet is
simpli-ed as in Figure 2. An ambiguous word like
meat providesevidence also forclassesthat
ap-pear unrelated to those selected by the verb.
se-1\8
7\8
ESSENCE
FLESH
FRUIT
BREAD
cheese
Figure 3: The HMM version of the simple
ex-ample.
of the observed words, and hence will receive
the highest values for P(cjp;r). Using (3) we
nd that the highest frequency is in fact
as-sociatedwith FOOD:freq(eat;object;food)
1
However, some evidenceisfoundalsofor
COG-NITION: freq(eat;object;cognition) 1
4 and
P(COGNITIONjeat)=0:06.
2.2 Abney and Light's approach
Abney and Light (1999) pointed out that the
distribution of senses of an ambiguous word is
not uniform. They noticed also that it is not
clearhowtheprobabilityP(cjp;r)istobe
inter-pretedsince there isno explicitstochastic
gen-erationmodelinvolved.
They proposed a system that associates
a Hidden Markov Model (HMM) with each
predicate-relation pair (p,r). Transitions
be-tweensynsetstatesrepresent thehyponymy
re-lation, and ", theempty word, is emitted with
probability1;transitionsto analstate emita
word w with probability0 P(w) 1. T
ran-sition and emission probabilities are estimated
using the EM algorithm on training data that
consistofthenounsthatoccurredwiththeverb.
Abney and Light's model estimates P(cjp;r)
from the model trained for (p;r); the
distri-bution P(c) can be calculated from a model
trainedfor allnouns inthecorpus.
This model did not perform as well as
ex-pected. An ambiguous word in the model can
be generatedbymore thanone state sequence.
Abney and Light discovered that the EM
al-gorithm nds parameter values that associate
some probability mass with all the transitions
eralstatesequencesforthesameword,EMdoes
notselectoneofthemovertheothers. 1
Figure3
shows theparameters estimated byEM forthe
same example as above. The transition to the
COGNITION statehasbeenassigneda
proba-bilityof1=8becauseitispartofapossiblepath
to meat. The HMM model does not solve the
problem of the unselective distribution of the
frequency of occurrence of an ambiguous word
toall its senses. Abney and Light claimedthat
thisis aseriousproblem,particularlywhenthe
ambiguous word is a frequent one, and caused
the model to learn the wrong selectional
pref-erences. To correct this undesirable outcome
theyintroducedsome smoothingand balancing
techniques. However, even withthese
modica-tionstheirsystem'sperformancewasbelowthat
achieved byResnik's.
3 Bayesian networks
A Bayesian network (Pearl, 1988), or
Bayesianbeliefnetwork(BBN),consistsofaset
of variablesandasetof directededges
con-necting the variables. The variables and the
edges dene a directed acyclic graph (DAG)
where each variable is represented by a node.
Eachvariableisassociatedwithanitenumber
of(mutuallyexclusive)states. Toeach variable
A with parents B
1 ;:::;B
n
is attached a
condi-tional probability table (CPT) P(AjB
1 ;:::;B
n ).
Given a BBN, Bayesian inference can be used
to estimate marginal and posterior
proba-bilitiesgiventhe evidence at handand the
in-formationstoredintheCPTs,theprior
prob-abilities, by means of Bayes' rule, P(HjE) =
P(H)P(EjH)
P(E)
,whereH standsforhypothesis and
Eforevidence.
Bayesiannetworksdisplayanextremely
inter-estingpropertycalledexplainingaway. Word
senseambiguityintheprocessoflearningSP
de-nesaproblemthatmightbesolvedbyamodel
that implements an explaining away strategy.
Supposewe are learningthe selectional
prefer-enceofdrink, andthenetworkinFigure4isthe
1
As a matter of fact, for this HMM there are
(in-nitely)manyparametervaluesthatmaximizethe
like-lihoodofthetrainingdata;i.e., theparametersarenot
identiable. The intuitively correct solution is one of
them,butsoareinnitelymanyother,intuitively
ISLAND
BEVERAGE
water
java
Figure4: ABayesiannetworkforword
ambigu-ity.
knowledge base. The verb occurred with java
and water. This situation can be represented
asa Bayesiannetwork. The variablesISLAND
and BEVERAGE represent concepts in a
se-mantic hierarchy. The variablesjava andwater
standforpossibleinstantiationsoftheconcepts.
Allthe variablesare Boolean; i.e., they are
as-sociatedwithtwostates, true orfalse. Suppose
thefollowingCPTsdene thepriorsassociated
with each node. 2
The unconditional
probabili-tiesareP(I =true)=P(B=true)=0:01and
P(I =false)=P(B =false)=0:99, and the
CPTs forthechildnodesare
P(X=xjY1 =y1;Y2=y2)
I;B I;:B :I;B :I;:B
j=true 0.99 0.99 0.99 0.01
j=false 0.01 0.01 0.01 0.99
w=true 0.99 0.99 0.01 0.01
w=false 0.01 0.01 0.99 0.99
Thesevaluesmeanthattheoccurrenceofeither
concept isa priori unlikely. Ifeither concept is
true theword java islikelyto occur. Similarly,
ifBEVERAGEoccursitislikelytoobservealso
the word water. As the posterior probabilities
show,ifjavaoccurs,thebeliefsinbothconcepts
increase: P(Ijj) =P(Bjj)= 0:3355. However,
water providesevidence forBEVERAGE only.
Overall there is more evidence for the
hypoth-esis that the concept being expressed is
BEV-ERAGE and not ISLAND. Bayesian networks
implementthisinferencescheme;ifwecompute
the conditional probabilities given that both
wordsoccurred,weobtainP(Bjj;w)=0:98and
P(Ijj;w)=0:02. The new evidence caused the
\island"hypothesisto be explained away!
3.1 The relevance of priors
Explainingaway seems to dependon the
spec-ication of the prior probabilities. The priors
2
I, B, j and w abbreviate ISLAND, BEVERAGE,
COGNITION
FOOD
ESSENCE
FLESH
FRUIT
BREAD
DAIRY
cheese
bagel
apple
meat
idea
true
Figure 5: A Bayesian network for the simple
example.
dene the background knowledge available to
themodel relative to theconditional
probabili-tiesof the events represented by the variables,
butalso aboutthejoint distributionsof several
events. In the simple network above, we
de-ned the probability that either concept is
se-lected (i.e., that the corresponding variable is
true) to be extremely small. Intuitively, there
are many concepts and the probability of
ob-servinganyparticularoneissmall. Thismeans
that the joint probability of the two events is
much higher in the case in which only one of
themis true (0.0099)thanin thecase inwhich
they arebothtrue (0.0001). Therefore, via the
priors,weintroducedabias accordingto which
thehypothesisthat one concept isselected will
befavoredovertwoco-occurringones. Thisisa
generalpatternof Bayesiannetworks; theprior
causessimplerexplanationstobepreferredover
morecomplexones, and therebytheexplaining
awayeect.
4 A Bayesian network approach to
learning selectional preference
4.1 Structure and parameters of the
model
The hierarchy of nouns in Wordnet denes a
DAG. Its mapping into a BBN is
straightfor-ward. Each word or synset in Wordnet is a
node in the network. If A is a hyponym of B
thereisan arcinthenetwork fromBto A.All
thevariablesareBoolean. Asynsetnodeistrue
iftheverbselectsforthatclass. Awordnodeis
true if the word can appear as an argument of
the verb. The priors are dened following two
intuitive principles. First, it is unlikely that a
verb a priori selects for any particular synset.
its hyponyms, sayFRUIT.Thesame principles
applyto words: itislikely that awordappears
asanargumentoftheverbiftheverbselectsfor
any of its possible senses. On the other hand,
if the verb does not select for a synset, it is
unlikely thatthewordsinstantiating thesynset
occurasitsarguments. \Likely"and\unlikely"
are given numerical values that sum up to 1.
The following table denes the scheme for the
CPTsassociatedwitheachnodeinthenetwork;
p
i
(X) denotesthe ithparent ofthenode X.
P(X=xjp1(X)_;:::;_pn(X)=true)
x=true likely
x=false unlikely
P(X=xjp1(X)^;:::;^pn(X)=fal se)
x=true unlikely
x=false likely
For the root nodes, the table reduces to the
unconditional probability of the node. Now
we can test the model on the simple example
seen earlier. W +
is the set of words that
oc-curred with the verb. The nodes
correspond-ing to the words in W +
are set to true and
the others left unset. For the previous
ex-ample W +
= fmeat;apple;bagel;cheeseg, and
the corresponding nodes are set to true, as
de-picted in Figure 5. With likely and unlikely
respectively equal to 0.99 and 0.01, the
poste-rior probabilities are 3
P(Fjm;a;b;c) = 0:9899
and P(Cjm;a;b;c) = 0:0101. Explainingaway
works. The posterior probability of
COGNI-TION gets as low as its prior, whereas the
probability of FOOD goes up to almost 1. A
Bayesian network approach seems to actually
implementtheconservativestrategywethought
to be thecorrect one for unsupervisedlearning
of selectionalrestrictions.
4.2 Computational issues in building
BBNs based on Wordnet
The implementationof aBBN forthe wholeof
Wordnet facescomputational complexity
prob-lems typical of graphical models. A densely
connectedBBN presentstwokindsofproblems.
The rst is the storage of the CPTs. The size
ofa CPTgrows exponentiallywiththenumber
of parents of the node. 4
This problem can be
3
F, C, m,a, b andc respectively stand for FOOD,
COGNITION,meat,apple,bagel andcheese
4
Somewords inWordnethavemorethan20senses.
java
JAVA-1
JAVA-2
ISLAND
COFFEE
BEVERAGE
LAND
PHSYSICAL_OBJECT
LIQUID
FOOD
ENTITY
drink
Figure6: Thesubnetwork fordrink.
solvedbyoptimizingtherepresentationofthese
tables. In ourcasemostof theentries have the
same values, and a compact representation for
them can be found (much like the one used in
the noisy-ORmodel (Pearl,1988)).
A harder problem is performing inference.
The graphical structure of a BBN represents
the dependency relations among the random
variables of the network. The algorithms used
with BBNs usually perform inference by
dy-namic programming on the triangulated moral
graph. A lower bound on the number of
com-putations that arenecessary to modelthejoint
distribution over thevariables usingsuch
algo-rithmsis 2 jnj+1
,wherenis thesizeofthe
max-imal boundary set according to the visitation
schedule.
4.3 Subnetworks and balancing
Because ofthese problemswecould notbuilda
single BBN for Wordnet. Instead we simplied
thestructureofthemodelbybuildingasmaller
subnetworkforeachpredicate-argumentpair. A
subnetwork consists of the union of the sets of
ancestors of the words in W +
. Figure 6
pro-vides an exampleof theunion of these
\ances-tral subgraphs" of Wordnet for the words java
and drink (compare it withFigure 1).
This simplication does not aect the
com-putation of the distributions we are interested
in; that is, the marginals of the synset nodes.
A BBN provides a compact representation for
the joint distribution over the set of variables
senses.ThesizeofitsCPTistherefore2 26
.Storinga
ta-bleofoatnumbersforthisnodealonerequiresaround
inthenetwork. IfN =X
1 ;:::;X
n
isaBayesian
network withvariables X
1 ;:::;X
n
,its joint
dis-tributionP(N) is theproductof all the
condi-tionalprobabilitiesspeciedinthenetwork,
P(N)= Y
j P(X
j jpa(X
j
)) (4)
wherepa(X)isthesetofparentsofX. ABBN
generates a factorization of the joint
distribu-tion over its variables. Consider a network of
three nodesA;B;C witharcs fromA to B and
C. Itsjointdistributioncanbecharacterizedas
P(A;B;C) = P(A)P(BjA)P(CjA). If there is
no evidence forC thejointdistribution is
P(A;B;C) = P(A)P(BjA) X
C
P(CjA)
= P(A)P(BjA)
= P(A;B) (5)
Thenode C gets marginalizedout.
Marginaliz-ingoverachildlessnodeisequivalentto
remov-ing it with its connections from the network.
Thereforethesubnetworksareequivalenttothe
whole network; i.e., they have the same joint
distribution.
Our model computes the value of P(cjp;r),
but we did not compute the prior P(c) for all
nouns in the corpus. We assumed this to be
a constant, equal to the unlikely value, for all
classes. In a BBN the values of the marginals
increasewiththeirdistancefromtherootnodes.
Toavoidundesiredbias(seetable ofresults)we
dened a balancing formula that adjusted the
conditional probabilitiesof theCPTs insuch a
way that we got all the marginals to have
ap-proximately thesame value. 5
5 Experiments and results 6
5.1 Learning of selectional preferences
When trained on predicate-argument pairs
ex-tracted from a largecorpus, theSan Jose
Mer-curyCorpus,themodelgave very goodresults.
The corpus contains about 1.3 million
verb-objecttokens. Theobtainedrankingsofclasses
accordingto theirposteriormarginal
probabili-ties were good. Table 1 shows the topand the
5
Moredetailscanbefoundinanextendedversionof
thepaper: www.cog.brown.edu/massi/.
6
Forthese experimentswe usedvaluesfor the likely
1 VEHICLE 0:9995
2 VESSEL 0:9893
3 AIRCRAFT 0:9937
4 AIRPLANE 0:9500
5 SHIP 0:9114
... ... ...
255 CONCEPT 0:1002
256 LAW 0:1001
257 PHILOSOPHY 0:1000
258 JURISPRUDENCE 0:1000
Table 1: Resultsfor(maneuver,object).
bottom of the list of synsets for the verb
ma-neuver. Themodellearnedthatmaneuver
\se-lects" formembers of the class VEHICLE and
of other plausible classes, hyponyms of
VEHI-CLE. It also learned that the verb does not
select for direct objects that are members of
classes,likeCONCEPT orPHILOSOPHY.
5.2 Word sense disambiguation test
A direct evaluation measure for unsupervised
learning of SP models does not exist. These
models are instead evaluated on a word-sense
disambiguation test (WSD). The idea is that
systems that learnSP produce word sense
dis-ambiguationas aside-eect. Java might be
in-terpretedastheisland orthebeverage,butina
context like \thetourists ew to Java" the
for-merseemsmorecorrect, becausey couldselect
forgeographic locations but not for beverages.
A system trained on a predicate p should be
able to disambiguate arguments of p if it has
learnedits selectionalrestrictions.
We tested our model using the test and
trainingdata developed byResnik (seeResnik,
1997). The same test was used in (Abney
and Light, 1999). The training data consists
of predicate-object counts extracted from 4/5
of the Brown corpus (about 1M words). The
test set consists of predicate-object pairs from
the remaining 1/5 of the corpus, which has
beenmanually sense-annotatedbyWordnet
re-searchers. The results are shown in Table 2.
The baseline algorithm chooses at random one
of the multiple senses of an ambiguous word.
The \rst sense" method always chooses the
Baseline 28:5%
Abney and Light (HMM smoothed) 35:6%
Abney and Light (HMM balanced) 42:3%
Resnik 44:3%
BBN (withoutbalancing) 45:6%
BBN (with balancing) 51:4%
FirstSense 82:5%
Table2: Results
formed better than thestate of the art models
forunsupervisedlearningof SP.It seemsto
de-neabetter estimator forP(cjp;r).
It isremarkablethatthemodelachieved this
result making only a limited use of
distribu-tional information. A noun is in W +
if it
oc-curredat leastonceinthetrainingset, butthe
systemdoesnotknowifitoccurredonceor
sev-eral times; either it occurred or it didn't. The
model did not suer too much from this
limi-tation during this task. This is probably due
to the sparseness of the training data for the
test. For each verb the average number of
ob-ject types is 3:3, for each of them the average
numberoftokens is1:3; i.e., mostof thewords
in the training data only occurred once. For
thistrainingset we alsotested a version of the
modelthatbuiltawordnode foreach observed
objecttokenandthereforeintegratedthe
distri-butional information. On theWSD test it
per-formedexactlythesame asthesimplerversion.
When trainedontheSan JoseMercuryCorpus
the model performed worse on the WSD test
(35:8%). This is nottoo surprisingconsidering
thedierencesbetweentheSJMandtheBrown
corpora: theformer, a recent newswirecorpus;
the latter, an older, balanced corpus. Another
importantfactoristhedierentrelevanceof
dis-tributionalinformation. Thetrainingdatafrom
the SJM Corpus are much richer and noisier
than the Brown data. Here the frequency
in-formation is probablycrucial; however, in this
casewe couldnotimplementthesimplescheme
above.
5.3 Conclusion
Explaining away implements a cognitively
at-tractive and successful strategy. A
straightfor-make full use of the distributional information
present in the training data; we only partially
achieved this. Bayesian networks are usually
confronted with a single presentation of
evi-dence. Their extension to multiple evidence is
not trivial. We believe the model can be
ex-tendedinthisdirection. Possiblythereare
sev-eral ways to do so(multinomialsampling,
ded-icatedimplementations,etc.). However, we
be-lieve that the most relevant nding of this
re-search might be that \explaining away" is not
only a property of Bayesian networks but of
Bayesianinference ingeneraland thatit might
be implementable in other kinds of graphical
models. Weobservedthatthepropertyseemsto
dependon the specicationof theprior
proba-bilities. WefoundthattheHMMmodelof
(Ab-neyandLight,1999)wasunidentiable;thatis,
thereareseveralsolutionsfortheparametersof
themodel,includingthedesired one. Our
intu-itionisthat itshouldbepossibleto implement
\explaining away" in a HMM with priors, so
thatitwouldpreferonlyone ora fewsolutions
overmany. Thismodelwouldhave alsothe
ad-vantageof beingcomputationally simpler.
References
S.Abney and M.Light. 1999. Hidinga
seman-tic hierarchyin a markov model. In Procee
d-ingsof theWorkshoponUnsupervised
Learn-ingin Natural Language Processing, ACL.
N. Chomsky. 1965. Aspects of the Theory of
Syntax. MIT Press,Cambridge, MA.
P.N.Johnson-Laird. 1983. MentalModels:
To-wardsaCognitiveScienceofLanguage,
Infer-ence,andConsciousness. HarvardUniversity
Press.
J. J. Katz and J. A. Fodor. 1964. The
struc-ture of a semantic theory. In J. J. Katz and
J. A. Fodor, editors, The Structure of
Lan-guage. Prentice-Hall.
G. Miller. 1990. Wordnet: An on-line lexical
database. International Journal of
Lexicog-raphy, 3(4).
J.Pearl. 1988. ProbabilisticReasoningin
Intel-ligent Systems: Networks of Plausible
Infer-ence. MorganKaufman.
P. Resnik. 1997. Selectional preference and
sense disambiguation. In Proceedings of the