Restricting the search

In document English spelling and the computer (Page 140-148)

The last chapter showed how the prototype corrector, given about a hundred words from the dictionary that seem to be reasonably promising candidates for a misspelling, ranks them in order so that it can offer a shortlist to the user, with its best guess at the top. This chapter looks at how the corrector selects the reasonably promising hundred out of a dictionary of about 70,000.

I introduced the Soundex system in Chapter Seven and showed how it could be used to produce a list of candidates. Briefly, each word in the dictionary is given a Soundex code which encapsulates certain of the word’s salient features. The hope is that any misspelling of a word will contain at least these same salient features, so that the misspelling and the correct spelling will have the same code. A code is calculated from the misspelling, and all the words in the dictionary that have this same code are retrieved to form the candidate list. The details of the code (Knuth 1973) are presented in Figure 9.1, with some examples.

Take as an example the misspellingdisapont. A corrector would compute the code D215 from disapont and then retrieve all the words with code D215: disband, disbands, disbanded, disbanding, disbandment, disbandments, dispense, dispenses, dispensed, dispensing, dispenser, dispensers, dispensary, dispensaries, dispensable, dispensation, dispensations, deceiving, deceivingly, despondent, despondency, despon- dently, disobeying, disappoint, disappoints, disappointed, disappointing, disappointedly, disappointingly, disappointment, disappointments, disavowing.

The prototype corrector uses Soundex but not in its original form. Soundex was invented in America, as one can see from the

1) Keep the first letter (in upper case).

2) Replace these letters with hyphens:a,e,i,o,u,y,h,w. 3) Replace the other letters by numbers as follows:

b,f,p,v: 1 c,g,j,k,q,s,x,z: 2 d,t: 3 l: 4 m,n: 5 r: 6

4) Delete adjacent repeats of a number. 5) Delete the hyphens.

6) Keep the first three numbers or pad out with zeros. For example:

Birkbeck Zbygniewski toy car lorry bicycle B-621-22 Z1-25---22- T-- C-6 L-66- B-2-24- B-621-2 Z1-25---2- T-- C-6 L-6- B-2-24- B621 Z125 T000 C600 L600 B224 Figure 9.1 The Soundex code, with some examples

prominence it gives to the letterr. Americans generally pronounce the phoneme /r/ even when it is at the end of a word or is followed by a consonant phoneme – Ma (mother) and mar (spoil),

lava (volcanoes) and larva (insects) are not homophones for most Americans. I had British people in mind, however, in designing my spellchecker, and most people in the British Isles only pronounce /r/ when it’s followed by a vowel sound – comparefar awayandfar behind. Though there are some exceptions, notably the Scots, the Irish and people in the West Country (Wells 1982), I felt that the letter r was not so salient for most British people, so I included it along with h and wand so on in the list of letters to be ignored.

The second problem with Soundex, when used for spelling correction, is that it takes the first letter of the misspelling as part of the code. I noted in Chapter Four that the majority of misspellings have the same first letter as the correct spelling, but not all of them do. The candidate lists retrieved for rong, enything and neumonia

RESTRICTING THE SEARCH

based on Soundex would fail to correct misspellings of this kind though they are trivial errors that ought to be easy to correct.

There are two ways of dealing with this, and both are used in the prototype. The first is to treat some words that begin with different letters as if they began with the same letter. For example, words beginning with a j are assigned the same code as words beginning g, so that, say, jipsy would get the same code as gypsy.

The second is to compute a code from the pronunciation of a word as well as from its spelling, and to store the entry more than once in the dictionary file if the codes are different. For instance, psyche

produces code P220 whereas the pronunciation produces S200, so the entry forpsycheis stored twice.

The second method obviously enlarges the dictionary file whereas the first method does not. The second method also deals with another, related problem, which arises when the misspellings of a word are likely to have a different consonant pattern from the correct spelling, as indettfordebt. Since the code is computed from the pronunciation as well as from the spelling, the entry for debt

will be included twice in the file, under D130 and D300.

Misspellings that are not based on pronunciation can also be anticipated in this way. The dictionary file contains information about letters likely to be omitted, substituted and so on – as explained in the last chapter – and this can be used in computing extra codes. The second m of remember, for instance, is flagged as likely to be omitted, so the code R510 is computed as well as R551.

Further codes are computed, if necessary, to make sure that a homophone has an entry under its partner’s code as well as under its own. For example, ifsoughtwere detected as a real-word error – in ‘this is the sought of thing,’ for instance – then we would want

sort to appear in the candidate list. However, the group of words with code S230 does not contain sort. (Group S200, the group for

sort, contains sought because of the pronunciation, but not vice- versa.) So sort is simply added to group S230, and likewise for all pairs of homophones.

A final group of errors that has to be anticipated in this way is caused by typing slips that affect the first letter, such asnadforand.

It is not feasible to attempt to anticipate all possible errors of this kind since each word would require many entries and the dictionary file would be enlarged many times over, but a corrector

ought to be able to cope with nad, hte and the like, so extra entries are included for about a thousand common words; and, for example, has code N300 as well as A530, so that it will appear in the candidate list for nad. If the corrector encountered an error of this kind in some other word, such as ocmputer, it would fail to retrieve the correct word (computer) in its candidate list, but such errors are rare.

The effect of entering a number of words more than once in the dictionary file in all the ways just described is to increase its size from about 70,000 entries to about 85,000.

These Soundex-like codes divide up the dictionary into a number of groups of entries, each group sharing a particular code. It would be convenient if these groups were about the same size – say 850 groups of a hundred words each. Unfortunately the pattern is quite different. Nearly two thousand groups are produced; the largest groups have a few hundred entries while there are several hundred groups with fewer than ten entries each. These small groups are a nuisance. As I will explain shortly, the corrector in the course of producing candidates for a misspelling actually retrieves not just one group but several, and, on the system on which the prototype was developed, it is more efficient to retrieve a hundred entries at one go than ten groups of ten.

The solution is simply to collapse together various collections of small groups. Having computed a code, the corrector consults a table of codes that are to be recoded, and changes the code if necessary. For example, the groups F515, F514, F513, F512, F511 are all joined with F510 to make a single group. After these rearrangements, there are about 750 groups, and the smallest group has fifty entries.

This modified Soundex system copes with the sort of errors that have some fairly obvious motivation – double letters in place of single (and vice-versa), ‘silent’ letters, unstressed vowels and so on. It does not cope well, however, with keying errors and with unexpected, but still correctable, spelling errors. For example, it is easy to see what atlogether is meant to be, but the group retrieved by the code A342 will not include altogether, so the corrector will fail to offer the right word.

Taking it as a target that the corrector should at least be able to cope with misspellings that have just one single-letter error

RESTRICTING THE SEARCH

(substitution, omission, insertion or transposition) at any place in the word after the initial letter, then the corrector will have to retrieve a number of groups with codes slightly different from the misspelling’s code. For example, to cope with the transposition of the leading consonants in atlogether, the corrector would find the right word in group A432, not the misspelling’s own group (A342).

Having computed the misspelling’s code and having retrieved and considered the words in that group, the corrector then generates all the related codes that could contain a word differing from the misspelling in one of the above four ways and retrieves the words in those groups. Depending on the original code, this can produce over forty codes. This need not mean that over forty groups need to be retrieved, however, for two reasons. The first is that some of these codes correspond to empty groups – there never were any entries in the dictionary with these codes. The corrector has a table of these and can simply delete them from the generated list. The second is that subgroups of these related codes often appear in the table of codes that are to be recoded, so several of them get collapsed together to form larger groups.

Even so, this procedure increases substantially the number of words to be considered, and a large proportion of these words bear very little resemblance to the misspelling. To save time, the corrector performs a quick test on them. It arranges the letters of the misspelling into letter-string form (that is to say, in alphabetical order and without repeats); mutliply, for instance, would become

ilmptuy. Each dictionary entry contains its spelling in both original and letter-string form. Having retrieved an entry, the corrector compares its letter-string with that of the misspelling. If the two differ by no more than a certain number of letters (this number is larger for longer strings), the candidate has passed this quick test and is passed on for the full string-matching as described in the previous chapter. This quick test winnows out most of the entries from these related-code groups.

One final, and important, way of restricting the search is to take account of word length. People’s misspellings are generally about the same length as the words they are trying to spell. Table 9.1 shows how misspellings from a spelling test differed in length from the correct spellings.

Table 9.1 Lengths of misspellings from a spelling test

Missps Missps of Missps of of all short wds long wds words (3 to 5 (10 letters

letters) or more)

Missps were: −3 or more 6% 0% 11%

−2 12% 1% 14% −1 40% 26% 41% same length 32% 49% 27% +1 9% 20% 6% +2 1% 2% 1% +3 or more * 2% ** Number of missps (=100%) 2514 96 1194 * 8 missps (= 0.3%) ** 2 missps (= 0.2%)

The table shows, for example, that, out of all 2514 misspellings, thirty-two per cent were the same length as the words they were misspellings of. Forty per cent were one letter shorter, twelve per cent were two letters shorter, while six per cent were three or more letters shorter than the words they were misspellings of,

e.g.betfulforbeautiful.

These misspellings were taken from a test given to fifteen-year-olds in Cambridge in 1970 – the same people who produced the corpus of errors described in Chapter Four. They are typical of spelling test results. The majority of misspellings are close to the length of the correct spellings. Misspellings of short words often err on the side of being longer than the correct spellings (depending somewhat on the actual words used in the test), while misspellings of long words tend to be shorter than the correct spellings. Comparison of results from a number of tests indicates that this bias is more marked in the misspellings of poorer spellers.

Spellcheckers are intended to correct the misspellings that people make in free writing, not in spelling tests, and it is possible that the errors they make in free writing follow a different pattern. The following table presents the same analysis as the previous one, but this time based on misspellings from free writing, in fact from compositions written by the same people who did the spelling test.

RESTRICTING THE SEARCH

Table 9.2 Lengths of misspellings from free writing

Missps Missps of Missps of of all short wds long wds words (3 to 5 (10 letters

letters) or more)

Missps were: −3 or more 1% * 2%

−2 9% 3% 14% −1 38% 37% 40% same length 30% 33% 31% +1 20% 24% 12% +2 2% 3% 1% +3 or more ** 0% 0% Number of missps (=100%) 3351 1145 349 * 1 missp (= 0.09%) ** 1 missp (= 0.03%)

Misspellings of hyphenated words and misspellings containing spaces, such as

head miss dressforheadmistress,are not included in this table.

This table differs somewhat from the previous one, and this difference is good news for spellcheckers. In the spelling test results of Table 9.1, eleven per cent of the misspellings of long words were three or more letters shorter than the correct spellings; the corresponding figure from Table 9.2 is only two per cent. All but a handful of the misspellings in free writing were within two letters (in length) of the correct spelling. The following are examples of the few misspellings that differed by more than two letters: parallellel, hankichies, teral (terrible), divent (different), pertickly, semblys (assemblies), seticates (certificates).

The difference between the two tables has a simple explanation: poor spellers use short words. In a spelling test, everyone has to attempt all the words. The poor spellers get the longer (usually harder) words wrong, sometimes very wrong, and often produce misspellings that are much shorter than the correct spellings. They know they cannot spell these words so, in their own writing, they simply don’t attempt them. It is not that these words are outside their vocabulary, but that they curtail their vocabulary, perhaps severely, in an attempt to reduce the number of spelling errors

(Moseley 1989). The misspellings of long words in Table 9.2 are largely the work of the better spellers, whose misspellings are usually close, in length and in other ways, to the correct spelling.

There are, of course, exceptions to this. A poor speller occasionally has to attempt a long word, as in the semblys and

seticates above. And there are some poor spellers who regularly use long words and spell them badly (Frith 1980), but they are exceptions. In the corpus from which the above results were taken, the following passages are typical, the first of a poor speller, the second of a good one:

In my primary shcool, it was roton because you we treated like littal babys, And you had to do babys work, and that was no good, and if you did’ent do ore home work you were smacked on the legs one day when the teachur smacked me on the legs, I put my tun out to her, so she done it agian, my last teachur at my old school was Mr Woods and he was a nice teachur and we got on with him we done owe work good and so for that he used to take us on visits

We did Maths and English perhaps for two hours at a time, with the occasional woodwork and extra games period. I remember also that I used to be quite good at everything including cricket and football which I certainly am not now. I was given the appointment of Head Boy which seemed a bit ridiculous because it made no difference to myself and anybody else in the school as I wasn’t really made a prefect and had no authority.

Since a corrector is meant to cope with free writing rather than with the results of spelling tests, it can restrict its search to words that are just one or two letters longer or shorter than the misspelling, perhaps taking in a slightly larger range on the long side for longer misspellings, to include long words that may have been shortened.

The result of restricting the search in all these ways is that the corrector performs the string-matching on a relatively small number of candidates. The actual number varies from one misspelling to another – for example, it takes 155 candidates for

sissors and 90 for mutliply – but the selection procedure is pretty reliable. If the required word is in the dictionary at all, it is almost always in the list that gets considered by the string-matching.

In document English spelling and the computer (Page 140-148)