Prefix Tree Construction - Automatic Prefix Discovery

3 Investigation into Morphology

Precision 90.78% 97.20% 100% n/a n/a n/a

3.4 Automatic Affix Discovery

3.4.1 Automatic Prefix Discovery

3.4.1.1 Prefix Tree Construction

At each level, a prefix tree is populated with candidate prefixes with one more character than at the previous level. Every possible combination of alphabetic characters at each level is looked up in the lexicon to see whether it occurs at the start of more than one word. If so then a Prefix object is created with that character combination. The number

Fig. 6: Part of prefix tree rooted at "su-"

(prefix candidates with occurrence count < 10 have been omitted)

| | | | | |

sub suc sud | sum |

| | | | |

| | | | | | | | | | | | subc subd subj subl subm subo subs subv succ suff summ |

| | | | | |

| | | | | | | | subli subor subse subsi subst succe summa |

| | |

subsidi substanti |

| | |

sun sup etc.

| |

| | | | |

sunb sund | supp supr

| |

| | |

super suppl suppo

| |

| | | | |

superf superi supern supers suppos

of levels was limited to 10 since at the last level no character sequences were found which occurred more than once at the beginning of a word.

The first attempt at constructing a prefix tree, branch by branch, took about 24 hours to run, because of the large number of lexicon traversals required. In order to improve efficiency the algorithm was optimised to construct each level of the prefix tree in succession, so as to minimise the number of lexicon traversals required. This added complexity but reduced runtime to about 5 seconds. A single lexicon traversal is performed for each level of the tree and the number of characters is increased at each level. At each level, all the possible character combinations are generated in the same order as they appear in the lexicon, which accounts for the improved performance. Because of the duplication criterion, candidate prefixes with only one occurrence are excluded from the tree. Candidates with only one child are deleted after constructing the tree, since their status as parents of a single child cannot be established when they are instantiated, but only on instantiation of the child.

The algorithm needs not only to find candidate prefixes but also to store information which may be relevant to determining which candidates satisfy the semantic criterion. The frequency of lexicon occurrence (as a prefix) f (affix frequency) of a candidate is _c

obviously related to the probability of its being a valid prefix and is calculated by the prefix constructor. Also, the higher the proportion of the occurrences of its parent f _p

(parent frequency) which is represented by a candidate, the more likely it is that it is a valid prefix.

Prefix Tree Construction Algorithm (see also Class Diagrams 9 & 10)

discoverPrefixes {

prefixTree = new PrefixTree(); look up stems in lexicon;

for (each prefix in prefixTree) {

if (prefix has more than one child) {

} else {

delete prefix as irrelevant; }

}

create prefix set ordered according to a heuristic; }

prefixTree () {

root = new Prefix(""); for each level

{

addLevel(root);

while (newRoot does not exist) {

if root has child {

newRoot = first child of root; } else { root = changeBranch(root); } } root = newRoot; } } addLevel(parent) {

reset lexicon iterator; form = parent.form + "a";

while ((currentPrefix is not in lexicon) && (form does not end with "z"))

{

form = next possible lexical form with same number of characters;

currentPrefix = new Prefix(form);

current_prefix. f_p = parent. f_c; }

if (currentPrefix is not in lexicon) {

navigationalPrefix = currentPrefix; //mark for removal }

make currentPrefix child of parent; while (currentPrefix exists)

{ currentPrefix = nextPrefix(currentPrefix); } if (navigationalPrefix exists) { remove navigationalPrefix } } nextPrefix(previousPrefix) { valid = false; currentForm = previousPrefix.form; parentPrefix = parent of parentPrefix; while (not valid)

{

if (currentForm ends with "z") {

parentPrefix = changeBranch(parentPrefix); newForm = parentPrefix.form;

newForm = newForm+ "a"; }

{

newForm = currentForm with last letter increased; }

newPrefix = new Prefix(newForm);

newPrefix. f_p = parentPrefix. f_c; if (newPrefix occurs more than once) { valid = true; } else { currentForm = newForm; } }

make newPrefix child of parentPrefix; return newPrefix; } changeBranch(currentPrefix) { generationCounter = 0; rightPlace = false; while (not rightPlace) {

nextPrefix = next sibling of currentPrefix; while (nextPrefix does not exist)

{

currentPrefix = parent of currentPrefix; increment generationCounter;

nextPrefix = next sibling of currentPrefix; }

currentPrefix = nextPrefix; while (generationCounter > 0) {

currentPrefix = first child of currentPrefix; decrement generationCounter;

rightPlace = true; }

return currentPrefix; }

Recording Stem Information

Every word beginning with a candidate prefix can be segmented into a prefix and a residue, which can provisionally65 be considered as the stem. It might be relevant to examine whether the stem obtained by such a segmentation exists as a word in the lexicon (Hafer & Weiss, 1974; §3.3.2). To achieve this, the prefix constructor stores all the stems that occur with each prefix, and the prefix tree maintains a global alphabetic list of stems, each associated with a list of the prefixes with which it occurs. After the construction of the tree is complete, one final traversal of the lexicon is performed, to identify which of the stems exist as words in their own right within the lexicon. The proportion of the stems occurring with each prefix which are also words is then calculated and stored with the prefix as its stem validity quotient q_s. The data concerning stems was not analysed or evaluated initially, but proved to be a productive research direction (§3.4.4).

In document Lexical database enrichment through semi-automated morphological analysis (Page 156-162)