Spelling Correction for GRiST Mind Maps
3.4 Experiments in Improving Spelling Correction
3.4 Experiments in Improving Spelling Correction
The proposed means of checking spelling corrections were assessed by implementing the pseudo code from Figures 3.9 and 3.10. The resulting Java class MindmapSpellChecker was run against the collection of GRiST mind maps to assess the overall effectiveness of those measures. Specific attention was paid to values of maxL that led to accepting appropriate corrections, while rejecting others.
Method
The first step was to compile a list of unique words from GRiST mind maps. To that end, all nodes were retrieved from the database that held those mind maps1. Text from each node was transformed to lower case, then split into separate words by the Java method split() from the String class. Characters other that alphanumeric ones served as delimiters, and were discarded during that process. For a particular node, that yielded one or more words comprised of just the letters a − z and the numerals 0 − 9. Words containing numerals were subsequently discarded, and did not participate in the analysis. The following command, then, splits text in a Java String object called nodeText into an array of single words:
String[ ] words = nodeText.toLowerCase().split('\\W');
Words resulting from splitting nodes were loaded into a Java TreeSet object. By storing any particular value just once, such objects omit duplicated entries. The ensuing list served as an extra dictionary for checking spelling corrections, although the mind map containing any non-word was not consulted.
Suspected errors absent from that list of GRiST words were passed to the Jazzy spelling checker from Idzelis (2005). Jazzy provides several dictionaries as separate files, of which just two were used: the general and the U.K. dictionaries; words from the U.S. dictionary did not contribute spelling corrections.
Candidate corrections from Jazzy were further processed by a bespoke Java object, MindmapSpellChecker, which calculated L and L0 between errors and corresponding suggestions. Although matrices introduced in Section 3.3.2 revealed shared sub-string between words, that facility will not be used to validate spelling corrections. Appropriate corrections might be rejected needlessly if, say, a letter must be inserted in order to correct a misspelling, splitting any shared sub-string to either side of that insertion. The very nature of spelling correction suggests makes that likely. For that reason, just the Edit Distance from such matrices will be used to refine suggested corrections.
1Chapter 9 deals in full with retrieving information from GRiST mind maps.
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
Proposed refinements for spelling corrections are summarised next as pseudo code in Figure 3.9, which shows how a given word is researched, and any suggestions evaluated:
Pseudo Code for L0 spelling correction Comments
get corrections for word Retrieve suggestions for the current word for each correction until one accepted Find the first acceptable suggestion
if correction in GRiST Correction is a word from GRiST mind maps accept correction Accept proposed correction for the current word else
if correctionOk() Check acceptability with L0
accept correction Accept proposed correction for the current word end-if
end-if end-for
Figure 3.9: Pseudo Code for checking spelling corrections by means of L0.
An empty list from the first step from Figure 3.9 indicates a valid word; otherwise, suggested corrections are checked in GRiST mind maps, which in that way constitute a dictionary. The loop that processes spelling corrections will be entered, then, should any suggestions arise. Words that receive no support from those mind maps are checked further by the function correctionOk()1. Additional pseudo code for that function appears next as Figure 3.10, which applies L0 to proposed corrections:
Pseudo Code for correctionOk() Comments
get L between word & correction Compute L and L0= L − d between word & suggestion if L0= 0 Length differences account for variation
accept correction Accept proposed correction for the current word else
if (d = 0ANDL < maxL) Equal lengths, and reasonably similar words accept correction Accept proposed correction for the current word end-if
end-if
Figure 3.10: Pseudo code for the correctionOk() function called from Figure 3.9.
Pseudo code from Figure 3.10 first checks for corrections having L0 = 0 in comparison with spelling mistakes; such suggestions are accepted. The second check, applied when L0> 0, is performed on words that yield corrections having the same length. In such instances, L0 is checked against a constant given as maxL. Suggestions are rejected should L exceed that constant, indicating excessive variation.
1Please note the font used to depict functions such as correctionOk(), and subsequently, all Java classes.
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
The first candidate separated by L0 = 0 from any offending word was accepted automatically. Otherwise, simple L assessed corrections that were the same length as the word that produced them. Values from the inclusive range 0..2 were applied as the constant maxL, which defined successively more lenient criteria for accepting corrections. Suggestions were dismissed should comparisons with offending words yield L > 2. Such unrecognised words were subsequently researched in the list from GRiST, in search of smaller words that yielded L0= 0. That, in turn, would reveal valid words embedded in misspellings.
Non-words arising from conjoined words were split into separate, valid words. That was done by a combination of L0and researching words from GRiST mind maps. Non-words were compared against each word from the list compiled from those mind maps. Cases having L0= 0 were added to a TreeSet from an array of such objects; each element of that array represented a particular non-word. On completion, MindmapSpellChecker examined the resulting list of TreeSets. Should any such object contain more than just a single word, the lengths of those words were summed. Non-words having lengths equal to the summed length from the associated TreeSet were split into those corresponding words.
Results
GRiST mind maps contained 4447 unique words, of which 4253 were wholly alphabetic. Jazzy failed to recognise 374 of those 4253 words, and offered suggestions for 337. Of those suggestions, 212 appeared in GRiST mind maps, and were accepted as novel words. Of the remaining 125 unrecognised words assessed by L0, 59 yielded corrections that gave L0 = 0. A further 30 corrections of identical length to the word producing them were accepted by means of the standard Edit Distance, where L = 1; an additional 11 corrections were accepted by L = 2. Suggestions for the remaining 37 words were rejected as having L greater than the ceiling of maxL = 2. A summary of results presented so far follows as Table 3.1:
Note Test Performed on Non-Word vs. Jazzy Suggestion Counts
Rejected words No suggestions from Jazzy 37
Accepted suggestions
Found as a word in GRiST mind maps 212
L0= 0 59
d = 0 and L = 1 30
d = 0 and L = 2 11
Rejected
suggestions All suggestions failed the above tests 25
Totals 374
Table 3.1: Summary of Spelling Correction Results.
Details of results from Table 3.1 are presented next in four parts. First of all come corrections that were accepted solely by consulting Jazzy, to show how it performed unaided. After those come corrections that found support from GRiST mind maps. The third set of results shows suggestions that were accepted
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
due to checking L and L0. After that come results from Jazzy alone that improved on consulting GRiST mind maps, or on checking L and L0. For each of those first three sets of results, acceptable corrections are shown first. Separate tables of inappropriate corrections, where needed, appear immediately after any suitable ones. The fourth main set shows results that corrected conjoined words by splitting them into separate, valid words. Following the main four sets of results from spelling correction, a supplementary set highlights corrections that will subsequently affect stemming.
Results I for Jazzy corrections Alone
Taking the first suggestion from Jazzy for any dubious word sometimes yielded suitable corrections. A few examples are presented next as Table 3.2:
Non-Word Suggestion Non-Word Suggestion Non-Word Suggestion
aquire acquire flucky fluky survelience surveillance
begining beginning likliehood likelihood tattooes tattoos colabarating collaborating openess openness welbeing wellbeing
Table 3.2: Appropriate spelling corrections from Jazzy alone.
In contrast to corrections from Table 3.2, Jazzy accepted various candidate corrections that, to a human, would be blatantly misleading. Examples of such inappropriate suggestions follow as Table 3.3:
Non-Word Suggestion Non-Word Suggestion Non-Word Suggestion
agrophoia agraphia keyworker coworker sytem stem
benzoes bonzes parasuicidal presystole untreatable intratubal clorazil gloriously premorbid pyromorphite whinging winging
Table 3.3: Inappropriate spelling corrections from Jazzy alone.
Discussion I of Jazzy corrections Alone
The first set of results, then, determined the efficacy of simply consulting Jazzy. As Table 3.2 showed, appropriate suggestions arose for misspellings of various lengths. Those cases needed no further analysis, any first suggestion from Jazzy proving adequate. Conversely, Table 3.3 showed various unacceptable suggestions from Jazzy. Indeed, certain of those words were not mistakes at all, but valid words missing from Jazzy’s dictionaries. That echoes the distinction made by Kukich (1993) between non-words which constitute true errors, and those that, while unrecognised, are actually valid. Examples included ‘para-suicidal’ and ‘clorazil’, having respective corrections ‘presystole’ and ‘gloriously’. To a human eye, such suggestions bear little, if any, resemblance to corresponding suspect words.
Specialised medical terminology, though, might legitimately be excluded from general-purpose checkers
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
such as Jazzy. Even so, less arcane words such as ‘keyworker’, ‘untreatable’ and ‘whinging’ failed to find dictionary entries. Corrections for truly misspelt words suffered a similar problem. Although ‘agrophoia’
was an error, the suggestion of ‘agraphia’ would be misleading. The intended word ‘agrophobia’ was not offered as a suggestion, revealing the limitations of dictionary files searched by Jazzy. Perhaps one or more additional dictionaries might yield a majority decision in such cases; even so, nothing short of a full list of English words could be seen as complete. Any such list would not be current for very long, due to the dynamic nature of English noted by Kukich (1993).
Results II for Jazzy Corrections Researched in GRiST Mind Maps
Several misspellings yielded just a sole suggestion. In 148 such cases, suggestions existed in one or more GRiST mind maps. Table 3.4 gives examples of sole corrections accepted in that way:
Non-Word Suggestion Non-Word Suggestion Non-Word Suggestion
Table 3.4: Appropriate sole suggestions accepted by reference to GRiST mind maps.
In addition, various unrecognised words yielded several suggestions. In 41 such cases, the first suggestion was accepted on finding support in one or more GRiST mind maps. A selection of those corrections appears next as Table 3.5. The number of suggestions for any word from the Non-Word column is given under column n, followed by the first correction from any proffered list. Examples are ordered alphabetically by the number of suggestions:
Non-Word Suggestions
Non-Word Suggestions
Non-Word Suggestions
n 1st n 1st n 1st
absense 2 absence neglectt 2 neglect reqire 5 require diference 2 difference severly 2 severely worring 5 worrying familiy 2 family diferent 3 different
Table 3.5: Appropriate first suggestions accepted by referring to GRiST.
In cases of multiple suggestions, the first correction in any such list did not, sometimes, exist in GRiST’s mind maps. The second suggestion from Jazzy, though, was found in those mind maps in 14 such cases, a selection of which appear next as Table 3.6. Preferred second suggestions in that table appear in bold type; examples are again ordered alphabetically by number of suggestions:
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
Non-Word Suggestions
Non-Word Suggestions
n 1st 2nd n 1st 2nd
especialy 2 especial especially illnesss 3 illness’s illnesses
sayng 2 sang saying tenous 5 tenus tenuous
seperate 2 sperate separate ligher 6 liger lighter
Table 3.6: Appropriate second suggestions accepted by referring to GRiST mind maps.
Note that all second suggestions from Table 3.6 were longer than corresponding first entries; that was due entirely to Jazzy. Further, second suggestions were better than those offered before them.
Discussion II of Jazzy Corrections Researched in GRiST Mind Maps
Suggestions from Jazzy, then, were researched in a list of words from GRiST mind maps. Cases from Table 3.4 where Jazzy offered just a single correction showed various acceptable suggestions. Corrections such as ‘psychotic’ for ‘pyschotic’ proved Jazzy to contain certain mental-health terms, at least; such words, though, enjoy relatively common usage. That is further true of corrections such as ‘referral’,
‘depressed’, and ‘nihilistic’. In jazzy’s favour is the suggestion ‘shitty’ for ‘shittiy’ that shows a coverage of slang terms. Note, though, that the difference between, say, ‘psychotic’ and ‘pyschotic’ arises from transposing the letters ‘sy’ for ‘ys’; in terms of edit distance, two substitutions constitutes a relatively small difference that should not challenge Jazzy. Support from GRiST mind maps, though, further boosts confidence in such corrections.
A similar approach yielded results from Table 3.5, except that any corrections appeared as the first entries in longer lists of suggestions. Finding corrections in GRiST mind maps terminated such searches immediately, ignoring any further suggestions: the first one proved acceptable. Suggestions were, though, relatively similar to any offending words; the word ‘reqire’, for example, needs just a single insertion of
‘u’ to make ‘require’. Because such suggestions were but competing candidates, support from GRiST mind maps was all the more important in selecting the best correction.
Table 3.6 further presented corrections appearing second in any list from Jazzy. Interestingly, Jazzy presented shorter corrections before longer ones; take, for example, the misspelling ‘ligher’ that yielded six suggestions. The first one, ‘liger’, is a cross between a male lion and a tigress, whereas the longer sub-sequent entry ‘lighter’ was a far more likely replacement. For longer lists of suggestions, then, researching GRiST mind maps avoided inappropriate corrections in favour of longer, apposite ones.
Ignoring whatever mind map contained any non-word avoided taking support from the author respon-sible. Cutting and pasting non-words in FreeMind would lend false evidence of such words’ true existence, whereas usage by a further author is more convincing. In fact, such research yielded no inappropriate corrections, regardless of position in any list of suggestions. That is not to say, though, that all spelling
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
errors were rectified in that way. Corrections that failed to find support from GRiST mind maps, then, were further checked by means of L0; results from that procedure are presented next.
Results IIIa for Spelling Corrections Checked by L0= 0
In the absence of any support from GRiST mind maps, then, proposed spelling corrections were analysed by means of L0. Suggestions at an adjusted distance of L0 = 0 from corresponding unrecognised words were accepted without further checks; a selection of such corrections appears next in Table 3.7. Column L lists simple edit distances between words and any suggested corrections. Differences in length between words and suggestions appear under column d, with the resulting adjusted distance under L0:
L d L0 Non-Word Suggestion L d L0 Non-Word Suggestion
1 1 0 Table 3.7: Appropriate spelling corrections accepted by L0= 0.
All but one of the suggestions from that table were the first or only one offered by Jazzy, as was the case for subsequent results in this section1. All of the inappropriate spelling corrections accepted by applying a criterion of L0= 0 are presented next as Table 3.8:
L d L0 Non-Word Suggestion L d L0 Non-Word Suggestion
1 1 0
Table 3.8: Inappropriate spelling corrections accepted by L0= 0.
Discussion IIIa of Spelling Corrections Checked by L0= 0
Suggestions from Jazzy, then, might not exist in any GRiST mind map. In such cases, corrections were compared with any word they sought to replace. To that end, L0 quantified any similarity between non-words and candidate corrections. For example, the value of L = 1 between the replacement ‘heterosexual’
for ‘hetrosexual’ from Table 3.7 indicated just a single insertion, deletion, or substitution. The difference in those words’ lengths, d = 1, gives L0 = L − d = 1 − 1 = 0. That lengths differed by 1, then, accounts for all the variation noted by L = 1. In other words, a letter must have been inserted, rather than deleted or substituted; that is the letter ‘e’ inserted between ‘het’ and ‘rosexual’. In a similar way, the correction
‘monosyllabic’ was accepted for ‘monsylabic’. Although that case gave d = 2 and l = 2, the value of L0= L − d = 0 indicated two insertions: the letter ‘o’ after ‘mon’, and a second ‘l’ in ‘sylabic’.
1The sole exception was the second suggestion ‘wouldst’ that replaced ‘wouldnt’.
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
Table 3.7 showed the majority of suggestions accepted in that way to be suitable, with exceptions listed in Table 3.8. Of those inappropriate substitutions, ‘ptsd’ was an acronym for post-traumatic stress disorder rather than a real word. For the misspelling ‘paradoxicaly’, accepting the first suggestion that had L0 = 0 led to abandoning the search in favour of ‘paradoxical’. The more suitable word ‘paradoxically’
came second in the list, due to Jazzy’s tendency to offer shorter suggestions first. Checking might be allowed to continue, stopping on the first unlikely suggestion rather than on the first reasonable one.
Further problems arose from accepting suggestions from single-entry lists, when any suggestion similar enough to the corresponding non-word was applied automatically. For example, lack of support for the word ‘spliff’ elsewhere in GRiST mind maps led to L0 = 0 accepting ‘spiff’ as a replacement. In fact, the node [patient x .half a spliff] was a child of [cannabis] that in turn branched from [substance misuse] by way of [drugs]. Context, as Kukich (1993) and Wilcox-O’Hearn et al. (1998) note, might contribute knowledge that leads machines to better choices. That, though, would mean considering semantic distance, in order to link ‘spliff’, which is actually a cannabis cigarette, to related words that concern drug abuse.
Results IIIb for Spelling Corrections Checked by d = 0 and L0 = 1
Whereas L0 = 0 in Table 3.8 guaranteed acceptance, cases having L0 > 0 were subjected to further checking by means of L, the standard Edit Distance. That metric was applied to non-words having suggestions of identical length, that is, when d = 0. An acceptance threshold of L = 1 in such cases gave the results in Table 3.9; the redundant L0 column is retained to highlight that L0= L when d = 0:
L d L0 Non-Word Suggestion L d L0 Non-Word Suggestion
1 0 1
Table 3.9: Spelling Corrections Accepted by d = 0 and L = 1.
While corrections from Table 3.8 were appropriate, others were less so. Such inappropriate corrections accepted by d = 0 and L = 1 appear next as Table 3.10:
L d L0 Non-Word Suggestion L d L0 Non-Word Suggestion
1 0 1
Table 3.10: Inappropriate Corrections Accepted by d = 0 and L = 1.
3.4. EXPERIMENTS IN IMPROVING SPELLING CORRECTION
Discussion IIIb of Spelling Corrections Checked by d = 0 and L0 = 1
Further to cases having L0= 0, Table 3.9 presented largely suitable corrections separated by L0 = 1 from any suspect word. In those cases, suggestions were accepted should d = 0, indicating a corresponding non-word of identical length. In that way, the error ‘tranqyillisers’ was correctly changed to ‘tranquillisers’.
Identical lengths of 14 letters gave d = 0; a value of L = 1 further yielded L0 = 1, indicating the sole substitution of ‘u’ for ‘y’. In fact, the majority of corrections from Table 3.9 involved replacing one vowel with another, such as in ‘peripheral’ for ‘perepheral, ‘dysthymia’ for ‘dysthimia’, and ‘fundamental’ for
’fundemental’1; that suggests a rule to be researched further. For now, the correction ‘tranquillisers’ in particular encourages links between forms of drugs, whether obtained on prescription or off the street.
The few exceptions that arose were listed in Table 3.10. Of those, the word ‘aspergers’ was a medical term missing from Jazzy’s dictionaries, as was the missing word ‘detox’, a slang term for detoxification therapy undergone by drug addicts. Surprisingly, ‘rejector’ was the sole suggestion offered for ‘rejecton’;
‘rejection’ would have been preferable, emphasising that suggestions can be checked only if they actually arise.
Results IIIc for Spelling Corrections Checked by d = 0 and L0 = 2
While continuing to insist on identical lengths, allowing slightly more variation between words and sug-gested corrections produced the results Table 3.11, for cases where d = 0 and L = 2:
While continuing to insist on identical lengths, allowing slightly more variation between words and sug-gested corrections produced the results Table 3.11, for cases where d = 0 and L = 2: