Identifying Unambiguous Triples - Experiments in Resolving Ambiguity

Experiments in Resolving Ambiguity

7.3 Identifying Unambiguous Triples

Triples comprising a preposition and two unambiguous words, then, indicate reliable patterns of usage.

Those patterns will later act as templates for helping to resolve ambiguous cases. The problem is that WordNet often reports several POS for any given word; completely unambiguous results, though, must just one POS. By that definition, many words remain potentially ambiguous; in such cases, machines must examine the number of senses reported by WordNet for each candidate POS, in order to reveal the most likely one by means of familiarity.

Method IIa for collating unambiguous triples

Words that appeared around each of the top ten prepositions were researched in WordNet. Words with several POS were considered unambiguous, should one have two more senses than did any remaining candidates. To that end, the new AstonWordNet class uses the JWNL Java package, which comes with WordNet. That package provides a Dictionary class that allows any Java program to research words. A further JWNL class called IndexWord retrieves synsets from WordNet. Instances of IndexWord represent a particular POS, and carry fields such as the number of senses. That information is retrieved by using the getSenseCount() method in the IndexWord class. The following code, then, shows how AstonWordNet researches a word:

Dictionary dictionary = Dictionary.getInstance();

IndexWord indexWord = dictionary.lookupIndexWord(pos-type, word);

int senses = indexWord.getSenseCount();

Having created an instance of the Dictionary class, the lookupIndexWord() method retrieves informa-tion for a specified POS; although the subsequent call to getSenseCount() has no arguments, the instance of that method pertains just to the retrieved IndexWord object for a specific POS. The getSenseCount() method, then, was invoked separately for each of the four POS held in WordNet. Should all four attempts fail to identify a word, the stop-word adjunct in AstonWordNet provided a pseudo-synset, if possible. All results were stored as instances of WNet. Any specific preposition, along with results for the two associated words, were stored in a further bespoke Java class called Triple.

Any specific Triple object, then, held a particular preposition, two words from GRiST mind maps, and a unique key, nodeID, of any associated node used to retrieve associated node texts. Further, two WNet objects represented each of any triple’s content words, along with those words’ familiarity: the sum of all of the senses reported by WordNet for a specific POS. Should several candidate POS arise, the preferred one had a familiarity of two or more senses greater than any remaining POS. Further, all POS held a ratio of the highest sense count and the next highest.

7.3. IDENTIFYING UNAMBIGUOUS TRIPLES

Results IIa of collating unambiguous triples

Figure 7.1 shows an example Triple object created by MidmapPOSAnalysis; that triple carried two unambiguous words, ‘known’ and ‘culture’, respectively reported as a verb and a noun. The first line in that triple holds a unique identifier for the representative node. Subsequent indented lines show the two WNet objects associated with that triple:

Figure 7.1: Unambiguous instances of the new WNet class.

The two WNet instances from Figure 7.1 report unambiguous POS for ‘known’ and ‘culture’. Indeed, the WNet instance for ‘culture’ holds just a sole POS, which was accepted without further analysis. The word ‘known’, though, had two candidates, as a verb and as an adjective; the large difference in senses of 10 preferred the verb, which was duly assigned a meta-type of ‘action’¹. Remaining entries in square brackets show alternative interpretations that were declined, in this case, just the adjectival POS for

‘known’ that had just one sense; that, in turn, gave an overall familiarity of 11 + 1 = 12. Accordingly, the ratio of senses for ‘known’ was 11 : 1 = 0.09, when rounded to two decimal places.

Table 7.3 next presents triples that were treated as unambiguous in that way. Even though some words had several candidate POS, one form was predominant; examples in the first column depict such words in italics, with the associated preposition in bold². The next column restates those words, for clarity. After that come the POS reported for any word, as adjective, adverb, noun, or verb. The column

∆_s gives the difference between the highest sense count and the next highest. The final column shows the selected POS, which had two more senses than the next most likely interpretation:

Examples from GRiST mind maps Ambiguous POS Senses Best

Word Aj Av N V ∆s POS

[felt like battering] wife battering - - 1 3

2 V

[focussed on specific] thing specific 4 - 2 - Aj

constant [smell of urine] smell - - 5 3 N

how surprised / not surprised [still with us] still 6 4 4 4 Aj rq : SH if it provides some [kind of release] release - - 11 9 N

Table 7.3: Correctly identified unambiguous words, for senses difference = 2.

POS for ambiguous words from Table 7.3 were deemed appropriate, yielding triples centred on the prepositions ‘like’, ‘on’, ‘of’, and ‘with’. A difference of 2 senses arose from subtracting low counts, such

1In fact, just the meta-type ‘a’ was stored on any WNet, in a normalised way.

2Henceforth, that use of square brackets, italics and bold type for triple components in nodes will be adopted.

7.3. IDENTIFYING UNAMBIGUOUS TRIPLES

as 3 − 1 = 2, through to higher values, such as 11 − 9 = 2. Indeed, that is why the preposition ‘of’ has two entries: the latter example was just to show high sense counts. In that way, the word ‘release’ that followed ‘of’ was interpreted as a noun, due to having two more senses than as a verb. Note that the word ‘still’ from Table 7.3 might have been any of WordNet’s four POS, although it was in preference seen as an adjective.

In contrast to such appropriate choices, Table 7.4 lists some incorrect ones that arose from a sense difference of 2, which in turn arose from relatively low sense counts. Prepositions in the following triples were, then, ‘for’, ‘in’, ‘of’ and ‘as’:

Examples from GRiST mind maps Ambiguous POS Senses Best

Word Aj Av N V ∆s POS

less [services for women] with children services - - 1 3 2

V get [feeling in objective] & subjective way objective 4 - 2 - Aj depends on their past [experiences of

ser-vices]

experiences - - 3 5 V

what see the [future as holding] future 5 - 3 - Aj

Table 7.4: WordNet results incorrectly identified as unambiguous by senses difference = 2.

Note that choices from Table 7.4 were either between an adjective and a noun, or between a noun and a verb. Both triples from that former category were taken as adjectives, while those from the latter emerged as verbs. Such ambiguous results accordingly reduced the number of unambiguous triples. While a senses difference of 2 yielded 1601 unambiguous triples, a difference of 3 gave 1345 unambiguous triples: 256 fewer than with the less stringent test for a difference of 2.

Discussion IIa of collating unambiguous triples

The unambiguous triple from Figure 7.1 showed the word ‘culture’ to have 8 verb senses. That word was perfectly specific: WordNet suggested no other POS. In that same triple, the word ‘known’ had 11 verb senses, but just one as an adjective. A large difference of 10 senses made the verb much more likely.

In contrast, words in triples from Table 7.3 had several POS, although they were treated as unam-biguous due differences of 2 senses between competitors. Choices were largely between nouns and verbs;

in all but one of such cases, the noun was preferred. The exception was the word ‘battering’, which was correctly treated as a verb. Of the remaining triples from Table 7.3, two correct distinctions between nouns and adjectives arose. The word ‘release’ did indeed act as noun rather than as a verb, while

’specific’ really was an adjective for ‘thing’.

A final point about unambiguous triples from Table 7.3 concerns the word ‘still’, which could have been any of the four WordNet POS. The adjective had 6 senses, whereas each remaining POS had 4 senses.

7.3. IDENTIFYING UNAMBIGUOUS TRIPLES

That meant treating ‘still’ as an adjective, although it was really an adverb; the predominant POS proved inaccurate in that case. In fact, that did not matter: adjectives and adverbs were treated jointly as modifiers, so eliminating the noun and verb senses sufficed.

Triples from Table 7.4, though, revealed inappropriate choices between candidate POS, arising from relatively low sense counts. For example, the word ‘future’ was treated as an adjective, when it was in fact a noun. Although that will contribute an incorrect POS to subsequent CA, Fellbaum et al. (1990) note a tendency for nouns to act as adjectives in the English language. That, though, applies just to contiguous nouns, rather than those separated by prepositions. Further research might take into account such heuristics, although they would be better determined by machines that imposed by humans.

Although triples from Table 7.4 were treated as unambiguous, then, inappropriate POS were selected.

The word ‘services’ had 1 noun sense and 3 verb senses. Although the predominant verb was selected,

‘services to women’ clearly intended the noun. Indeed, the idea of servicing women has a distinct and unwarranted sexual connotation. In a similar way, the word ‘experiences’ had 3 noun senses and 5 for the verb, which was chosen incorrectly. Further, the nouns ‘objective’ and ‘future’ were wrongly treated as adjectives. All of the errors described, though, arose for words having low sense counts: particular POS for such words had relatively few distinct meanings.

In fact, using absolute differences in senses might be too coarse a measure. Future research might apply ratios between sense counts, instead; for the moment, that is calculated but not used. That said, differences of 2 senses proved largely adequate, as opposed to the approach taken by Mihalcea and Moldovan (2000) that insisted on monosemous words: just those having a single POS were considered as unambiguous. Indeed, removing triples having multiple POS that yet have a predominant type would risk discarding an appreciable portion of the data. Any affect on ensuing analyses will be made clear in Section 7.6, when CA will be run for sense differences of both 2 and 3.

In document Discovering knowledge structures in mind maps of mental health risks (Page 180-184)