• No results found

Stemming for GRiST Mind Maps

4.2. AN OVERVIEW OF STEMMING

By whatever definition, stemming produces strings of letters that are not in themselves words. That end-users might find such stems difficult to interpret is unimportant. Rather, stems serve to facilitate IR by expanding words in any query with related forms. In that way, queries might retrieve a larger set of relevant documents. Now, although stemming might improve the recall of information, there is a risk of inappropriate stems retrieving non-relevant documents. The challenge lies in achieving the best trade-off between those opposing tendencies (Xu & Croft, 1998).

Approaches to stemming reflect the two categories that were seen earlier for spelling correction:

processing words in isolation as opposed to considering context. As to the former, basic stemming might simply involve removing suffixes from plurals; researching tables of common word-endings is one way to do that (Xu & Croft, 1998). In a similar vein, a stem dictionary might be used (Porter, 1980). More advanced approaches to stemming isolated words employ so-called conflation algorithms. While some researchers embed linguistic knowledge in such algorithms, others treat words as just n-grams of letters.

Approaches that consider context, on the other hand, use whatever text surrounds any stemmed word to assess the likelihood of being correct. Having introduced the two main approaches to stemming, attention now turns to various studies that employed them.

4.2.1 Deriving Stems from Isolated Words

Stemming, then, identifies groups of related words. For example, Aas and Eikvil (1999) derived the stem

‘walk’ for ‘walker’, ‘walked’, and ‘walking’ by removing the respective suffixes ‘-er’, ‘-ed’, and ‘-ing’. That the stem ‘walk’ is a word in its own right is of no particular importance; Xu and Croft (1998) point out that stems are better seen as n-grams that are embedded in related words. In fact, the algorithm that Aas and Eikvil (1999) used was what they call the popular Porter stemmer, which is discussed next.

The Porter Stemmer

Porter (1980) devised an ingenious approach of splitting words into n-grams that contained just vowels or consonants. N-grams composed of vowels were termed V , while C denoted n-grams of consonants. The length of any particular n-gram, then, depended solely on the number of contiguous letters of a specific type. Having identified V and C n-grams within any word, the Porter stemmer goes on to analyse repeating sequences of V C. In that way, words are reduced to recurring blocks that comprise one or more vowels followed by one or more consonants. The symbols [C] and [V ] represented the first and last such sequences, should they exist. Between those extremes might be m repetitions of V C. The following expression, then, sums up the construction of any English word:

[C](V C)m[V ] (Porter, 1980).

Cases where m = 0 mean that words comprise just the [C] and [V ] components. Words that do contain V C groups have m > 0; in those cases, the [C] and [V ] components are allowed to be null (Porter, 1980).

4.2. AN OVERVIEW OF STEMMING

Although Porter (1980) provides examples, they are not well discussed; what follows is an interpretation of some of those examples. Table 4.1 gives shows the construction of words having m values from 0 to 2.

The middle columns of that table show [C], V C and [V ] as described by Porter (1980), while the final column is my comment:

Table 4.1: Stems from the Porter stemmer, adapted from Porter (1980) .

All of the rows from Table 4.1 have an identical [C] component, ‘tr’. In the first row, the word ‘tree’

was accounted for by just the [C] and [V ] components, each having two letters. For m = 1, the word

‘trouble’ starts with ‘tr’ in [C], while the [V ] component contains the final ‘e’. In between comes a single V C group, with ‘ou’ and ‘bl’ in V and C respectively. The final entry contains the word ‘troubles’, for which m = 2. Note that taking the plural of ‘trouble’ raises m from 1 to 2; adding the consonant ‘s’ to the vowel ‘e’ yields an additional V C group. Because of that, the terminating [V ] is left empty.

Having discovered a way of sub-dividing words, Porter (1980) goes on to specify five steps in his algorithm. Each of those steps bears a transformation rule that results in a shortened suffix. In addition, an optional condition stipulates when any particular rule should be applied. Those conditions used a shorthand notation based on the symbol ‘∗’. For example, ∗v∗ meant that any stem resulting from removing a suffix must contain a vowel, while ∗o stood for a stem that ended with the pattern consonant-vowel-consonant (CVC), where the second consonant is not W, X or Y. Table 4.2, then, lists those five steps in stemming:

2 -ational -tion (m > 0) conditional condition

3 -alize -al (m > 0) formalize formal

4 -ance - (m > 1) allowance allow

5 -e - (m = 1 & ! ∗ o) cease ceas

Table 4.2: Step that comprise the Porter stemmer, adapted from Porter (1980) .

4.2. AN OVERVIEW OF STEMMING

The lack of a condition from the first entry from Table 4.2 means that ‘-ing’ can always be removed.

Conversely, the second example for Step 1 stipulates that any stem must contain a vowel; that was the case for ‘motor’, which actually contains two ‘o’s. The next step, number 2, specifies that ‘-ational’ may be shortened to ‘-tion’ just for words having one or more V C group. That same condition applies to Step 3, which adjusts, for example, ‘-alize’ to ‘-al’. In a similar way, Step 4 applied just to words having at least two V C groups. Lastly, the ‘!’ symbol in Step 5 negates the ‘∗o’ requirement for any remaining stem to end in CVC; that ‘ceas’ ends with the pattern VCV satisfies that condition (Porter, 1980).

Porter (1980) notes that the first step deals just with plurals and past participles. That step in fact had three discrete parts, although subsequent steps were seen as more straightforward. Importantly, the algorithm avoids removing suffixes that would result in too short a stem. The value of m was seen as a useful guide in that respect. For example, removing the suffix ‘-ate’ from ‘relate’ having m = 1 would render the short stem ‘rel’. The value of m = 2 for ‘activate’, though, leaves the adequate stem ‘activ’. In practice, applying suffix-stripping steps to 10,000 words yielded 6,370 distinct stems. Stemming reduced the number of terms by about one third (Porter, 1980).

The Dual Problems of Under- and Over-Stemming

Over-aggressive stemming, though, risks obtaining stems that are too short to be useful. That risk recurs in research by Orengo and Huyck (2001), who note that what appears to be a suffix might actually part of any desired stem. Removing such false suffixes would produce stems that conflate unrelated words.

That problem was overcome for the Portuguese language by specifying a minimum length for any stem;

even so, avoiding over-stemming was difficult. For example, although the suffix ‘-inho’ might indicates diminutive forms, ‘golfinho’ means ‘dolphin’, rather than a smaller form of ‘golf’. In that case, it would be wrong to treat ‘-inho’ as a suffix (Orengo & Huyck, 2001).

In order to alleviate that problem, lists of exceptions were created to specify words that are not, in fact, directly related by any given stem. The following entry for ‘-inho’ stipulates such exceptions:

inho, 3, {caminho, carinho, cominho, golfinho, padrinho, sobrinho, vizinho}.

The first part of the entry is the suffix ‘-inho’, while the number 3 dictates the minimum number of letters allowed in any stem resulting from removing that suffix. Words enclosed within braces {. . . } were treated separately from words reliably suffixed by ‘-inho’. Employing such exceptions reduced over-stemming mistakes by 5%. Conversely, under-over-stemming occurs when a true suffix is not removed, meaning that related words will not be fully conflated (Orengo & Huyck, 2001).

4.2. AN OVERVIEW OF STEMMING

Shortcomings of Language-Specific Stemmers

Algorithms such as the Porter stemmer, though, suffer the serious drawback of being language-specific.

Stemming any novel language will require new linguistic rules, which in turn demands a detailed knowledge of any language under investigation. In contrast, concentrating on n-grams of letters overcomes any reliance on embedded linguistic knowledge, and allows a language-neutral stemmer. That approach should work over a wide variety of languages (Mayfield & McNamee, 2003).

To that end, words were parsed to see if they contained n-grams that occur naturally. Those known n-grams were compared with sub-strings from words in question; each letter within any word was treated as the start of a new sub-string. Any words that contained a particular n-gram were conflated in that way. Such derived stems were further checked for over-stemming by means of a measure called the inverse document frequency (IDF). That reflected how many words would be conflated by any particular stem;

those having high IDF were discarded as too general (Mayfield & McNamee, 2003).

That approach based on n-grams proved viable for eight European languages, of which no knowledge was built into the algorithm. The Wilkinson test of significance showed it to perform equally well as a language-specific stemmer, for some languages. The sole adjustment for any given language involved selecting a suitable length for the number of letters, that is, of ‘n’ for any n-gram. There was, though, an important prerequisite: a pre-compiled list of n-gram frequencies. All the same, calculating such frequencies was seen as a straightforward task. A more serious drawback concerned the performance penalty incurred by the high number of string comparisons required (Mayfield & McNamee, 2003).

4.2.2 Accounting for Context during Stemming

The Porter stemmer, then, is a rule-based algorithm that removes suffixes from words to reveal underlying stems; problems, though, arise from employing such programmes. Notably, what seems to be a suffix might in fact be part of a stem. Such over-aggressive stemming yields stems that conflate words having little shared meaning, such as ‘pol’ for ‘policy’ and ‘police’. Conversely, lenient stemming risks the opposite effect, under-stemming, which fails to conflate truly related words such as ‘matrix’ and ‘matrices’.

Such under- and over-stemming might lead to serious failures in information retrieval (Xu & Croft, 1998).

Corpus-Based Stemming

A noted problem with the Porter stemmer concerns a failure to reflect actual language usage. Although rules in that algorithm are specific to English, that is not seen as a built-in linguistic model. Rather, such rules are designed to handle specific aspects of isolated words within a body of text, while neglecting any wider knowledge existing in such texts. An alternative approach of corpus-based stemming might rectify that shortcoming, by using what is called the co-occurrence of word variants. Put another way,

4.2. AN OVERVIEW OF STEMMING

word forms that should be conflated for a given corpus will occur together in the very documents from that corpus. Unrelated words, on the other hand, should co-occur rarely (Xu & Croft, 1998).

To that end, a metric called EM, a variation of the Expected Mutual Information Measure (EMIM), measures the degree to which any word a co-occurs with word b within any corpus. Although the actual meaning of EM was left unclear, the following expression shows how it was calculated. Variables na and nb are the respective occurrences of words a and b, and nabis the frequency with which a and b appear together. That actual frequency is compared with the expected number of co-occurrences, En(a, b). The function max() avoids negative results arising from the subtraction term:

em(a, b) = max(nab− En(a, b) na+ nb



, 0) (Xu & Croft, 1998).

My interpretation of that metric is that the expected co-occurrence of two words is divided by the sum of independent frequencies from the corpus. That ratio will be lower should two words’ isolated occurrences exceed any expected co-occurrence. Conversely, higher ratios reflect isolated occurrences that are less frequent than are predicted co-occurrences. Subtracting the resulting ratio from actual co-occurrences gives a value for EM. In that way, higher ratios diminish any value for EM. Overall, relatively high EM values arise when two words often co-occur. Any given EM value, though, is reduced should such words appear by themselves relatively more than they do together.

In fact, the corpus in that study comprised text windows on a desktop machine. In that restricted environ-ment, co-occurring word variants were seen to enhance the performance of stemming algorithms without a need for expert linguistic knowledge. Notably, that approach successfully avoided over-stemming the words ‘company’ and ‘computer’ to yield ‘com’. That result, though, did not arise from applying the EM measure; rather, ‘company’ and ‘computer’ were assigned a value of em = 0 from the outset. Although

‘company’ and ‘computer’ share the same prefix, adjacent n-grams ‘pan’ and ‘put’ differ. Because of that, those words were deemed to be unrelated (Xu & Croft, 1998).

Statistical Approaches to Stemming

The Porter stemmer, then, might be criticised for its reliance on rules. Even so, such a rule dictated permissible n-grams in the co-occurrence approach of Xu and Croft (1998). In a similar way, Larkey, Ballesteros, and Connell (2002) note that designers of stemmers often build linguistic expertise into algorithms. In contrast, statistical methods promote conflation without resort to linguistic rules. Related words might be grouped purely by measuring the similarity of n-grams. So-called equivalence classes can be formed solely from words that share a particular n-gram of letters; the challenge lies in setting an appropriate threshold for the proportion of any related words that such n-grams must comprise (Larkey et al., 2002).