Application to the Enrichment of WordNet - Pilot Study on the Formulation and Application of Mo

3 Investigation into Morphology

Precision 90.78% 97.20% 100% n/a n/a n/a

3.2 A Rule-based Approach

3.2.2 Pilot Study on the Formulation and Application of Morphological Rules

3.2.2.4 Application to the Enrichment of WordNet

In order to investigate whether WordNet could be usefully enriched by encoding more morphological relations between word senses and whether it could be further usefully enriched by interpreting morphological relations between word senses as semantic relations (Bilgin et al., 2004; Koeva et al., 2008; §3.1.3), the first step is to discover what proportion of morphological relations are already encoded in WordNet, either as derivational pointers or as other types of relation.

53_{See Appendix 50 for the paucity of prefixes of Anglo-Saxon origin: only "hind-", "mid-", "under-", "be-",}

"deed-", "die-", "kin-", "none-", "off-", "un-" and "with-" occur, though "a-" (non-antonymous) and "in-" (non-antonymous) are sometimes Anglo-Saxon. These amount to 2% of the valid prefixes identified in §5. In most words beginning with an English preposition, including all prefixations derived from English prepositions not listed here, the rest of the word is also a word in its own right. Such cases can be considered as concatenations.

WordNet Relations between members of CatVar Clusters

Inasmuch as the CatVar sample is representative of morphologically related word clusters, it is pertinent to ask how many of the morphological relations between members of the sample clusters are already encoded in WordNet. Class CatVarTuple stores the

relations in which the WordNet senses of the word form it represents, or the synsets to which these senses belong, participate54. All the words in the sample dataset were implemented as instances of CatVarTuple and each cluster was implemented as a

CatVarCluster55. The Suffixation and Suffix Stripping Algorithms were adapted to

output CatVarTuple arrays instead of POSTaggedWord arrays, which were similarly grouped into clusters for each seed word. It was then a simple matter to count the number of WordNet relations between the members of each CatVarCluster. WordNet derivational pointers were counted separately. For the CatVar sample dataset, 2366 Wordnet relations were found between pairs of synsets or word senses containing one or more words from within the same CatVar cluster. Of these 1963, or 82.97% are derivational pointers, making an average of 4.54 WordNet relations (3.77 derivational pointers) per cluster.

Since it is possible for more than one WordNet relation to exist between the same two synsets, or for one relation to exist between two synsets and another to exist between two word senses each of which belongs to one of the two synsets, the number of duplicate relations was also calculated, totalling 86. The maximum possible number of relational pairings for each cluster (excluding duplicates) was calculated as

n n −

where n = the number of members of the cluster. This would be the number of relations

if there was a relation between each member of the cluster and every other member.

Since derivation is a directional phenomenon, each member of a cluster can be considered to be directly derived from 1 and only 1 other member. However all correct members are related directly or indirectly and every member is directly or indirectly derived from a common root, so that the entire cluster forms a derivational tree (§3.1.4; Fig. 5). The ideal or optimal number of relations per cluster is then equivalent to the number of links between nodes in a tree which is

1 −

where n = the number of nodes.

Fig. 5: Derivational tree for a CatVar cluster

differ, VERB |

| |

different, ADJ. differing, ADJ. |

| |

difference, NOUN differently, ADV.

| |

differential, ADJ. differentiate, VERB

| |

| | |

differential, NOUN differentially, ADV. | |

| | | |

differentiator, NOUN differentiable, ADJ. differentiation, NOUN differentiated, ADJ.

The representation of derivational relationships within a cluster as a derivational tree, implying the directionality of morphological relations, might be useful for detecting false morphological relations generated algorithmically. For instance the CatVar dataset links the word "student" to the word "stud". A morphological rule might be formulated to represent the transformation from a noun to another noun by appending "-ent"; another rule might represent the transformation from a noun with suffix "-y" to another noun by

substituting "-ent", then the word "student" would be treated as simultaneously derived from "stud" and from "study"56. This dual inheritance would violate the tree structure so that an exception could be detected by the algorithm. This would highlight the fact that only one of the proposed roots of "student" can be correct, at which point human intervention could quickly establish that only "study" and not "stud" is the root of "student".

Using the above definitions of maximum possible and ideal or optimal, it was discovered that over the entire CatVar sample dataset, only 6.17% of the maximum possible relations were realised in WordNet while 54.64% of the optimal number were realised. This means that almost half these morphological relations are not encoded, confirming the potential for further enrichment of WordNet with morphological relations.

With the dataset generated from the word list (§3.2.2.2.1) by suffixation, there were an average of 0.60 WordNet relations per cluster of which 80.29% were derivational pointers. The WordNet relations represented 3.9% of the maximum possible and 34.14% of the optimum. With the dataset generated from the word list by suffix stripping, there were an average of 0.91 WordNet relations per cluster of which 78.87% were derivational pointers. The WordNet relations represented 4.02% of the maximum possible and 34.00% of the optimum.

Comparison of WordNet relation occurrence between members of clusters of derivationally related words for each experiment.

Table 25 shows little variance between experiments in the proportion of the WordNet relations which are derivational pointers. However, using CatVar data as a starting point yields a significantly higher relation count. This discovery suggested that CatVar data had already been used for WordNet enrichment, as planned (Habash & Dorr, 2003). However this is refuted by Fellbaum and Miller (2007; §3.1.3). It would appear then that the

undocumented methodology used for the creation of CatVar was similar to that adopted by Fellbaum and Miller, and it seems likely that some derivational pointers have been subsequently re-encoded as other WordNet relations. It is also abundantly clear that there is plenty of scope for further enrichment.

Table 25: WordNet relations between members of clusters of derivationally related words

CatVar dataset Word list suffixation

Word list suffix stripping

TOTAL AVERAGE TOTAL AVERAGE TOTAL AVERAGE

WN DERIV relations within cluster 1963 3.77 664 0.60 1008 0.91 WN relations within cluster 2366 4.54 827 0.75 1278 1.15 DERIV as proportion of WN relations 82.97% 80.29% 78.87% Duplicate relations 86 0.17 26 0.02 34 0.03

Total synsets / cluster 9.01 3.12 4.30

MAX possible relations / cluster

excl. duplicates 70.98 18.54 27.95

Proportion of possible

relations in WN 6.17% 3.90% 4.02%

Optimal relation count

/ cluster 8.01 2.12 3.30

Proportion of optimal relation count

realised in WN 54.64% 34.14% 34.00%

In document Lexical database enrichment through semi-automated morphological analysis (Page 132-136)