The prototype implementation - English spelling and the computer

Chapters Eight to Ten described the general ideas incorporated into the corrector. This appendix gives some details of their specific implementation in the prototype. Here again is the simplified overview of the program presented in Chapter Eight:

Taking the text sentence by sentence:

Split the input into words and store each word in memory. Look up each word in the dictionary, and mark it if not found. Check each pair of words for anomalies of syntax.

Display sentence, possibly with queries. If any words have been queried, then

For each queried word, do: Generate list of suggestions. Offer best few to user. Get user’s decision.

Insert user’s choice in sentence.

The procedure described as ‘Generate list of suggestions’ has three main parts:

1. Retrieve misspelling’s own S-code group and string-match all the words from length x to length y against the misspelling, putting the best into an ordered shortlist.

2. Retrieve related S-code groups; for each of these, take the words from length x to y and do a quick test on their letter-strings, passing on the successful few for string-matching and possible addition to the shortlist.

THE PROTOTYPE IMPLEMENTATION

The prototype was written in Cobol. This is a language more associated with file processing in business and administration than with natural-language applications, but it was obvious from the beginning that the corrector would require random file access, and Cobol provides this as a feature of the standard language. It is provided in many other languages – Pascal, for instance – only as a manufacturer’s extension to the standard, and I preferred to work in a standard language, rather than in some dialect specific to one manufacturer. The 1985 Cobol standard meets many of the criticisms levelled at earlier versions of the language, and it even includes some simple features for string-manipulation. I also considered Icon (Griswold and Griswold 1983), a language with powerful string-handling facilities, but decided against it, partly because of its weakness in file-access, but also because, paradoxically, I did not have much use for its string-handling features. The string-matching that produces the closeness-score was something I had to program myself in detail; the program’s other string-handling is fairly straightforward.

An early version of the prototype, built along the lines described in Chapters Eight to Ten, made a reasonable job of detecting and correcting misspellings, but it was rather slow. It was never intended that the prototype should perform as fast as a commercial piece of software. Nonetheless, I did not want it to be purely an academic exercise; I wanted to show that it had at least the potential to be turned into a real-life spellchecker. The rest of this appendix describes some modifications that were introduced to speed it up.

An elementary observation to be made about ordinary prose is that a small number of words occur a great many times. The prototype holds the 1024 words that occur most frequently in running text in a Cobol table in main store. When looking words up in the dictionary, simply to establish whether they are there or not, the spellchecker first consults this table, which is searched by binary search (using the Cobol SEARCH verb), and only proceeds to consult the dictionary file on disc if it fails to find the word in the table. (Retrieving a record from secondary storage takes a lot longer, of course, than doing a binary search on a table in main store.) Taking the first sentence of this paragraph as an example, the spellchecker would need to consult the disc file only for

elementary, observation, proseandoccur.

In the first version of the dictionary file, each word occupied one record; if the spellchecker needed to look at a hundred words, it retrieved a hundred records. Using the simple timing facilities offered by VAX/VMS (the system under which the corrector was developed), I discovered that the program spent over half of its time – and this was CPU time, not elapsed time – carrying out these READ operations. Some simple experiments showed that it was far quicker to pull in a hundred words as a single large record than as a hundred small ones, so I reorganized the dictionary file to take advantage of this.

When the corrector searches the dictionary for promising candidates, it considers sets of words that are in the same Soundex-type group, so the obvious thing was to turn each of these groups into a single, variable-length record. This provided the opportunity to give each of these large records some internal structure, as follows:

Soundex-type code (the record key) Fifteen fixed-length pointer-items

Variable number (50 to 300+) of letter-string-items Variable number (50 to 300+) of spelling-fields

The first of the pointer-items refers to words of length one, the second to words of length two, and so on; the fifteenth refers to words of length fifteen or more. Each pointer-item contains this information:

Number of words of length n

Position in letter-string section of first word of length n Position in spelling-field section of first word of length n The positions are held as byte offsets from the beginning of the record. The letter-string items are variable length and are separated by low-value bytes; likewise the spelling-fields.

I explained in Chapter Nine how the spellchecker retrieves groups that have codes related to that of the misspelling (so as to succeed with things like atlogether and unerstand) and how it discards most of the words in them with a quick test based only on each word’s letter-string – the letters of the word in alphabetical order, without repeats. For the great majority of words considered

THE PROTOTYPE IMPLEMENTATION

as candidates for a given misspelling, it would be a waste of time to unstring the spelling, word-tags and other fields only to discard them after a quick look at the letter-string, so the letter-strings are held in a separate section of the record. Each letter-string item contains the required information for one word in the following form:

Letter-string

Byte-offset of this word’s spelling-field

The final section of the record is a long string containing all the remaining information about the words, with each field divided from the next by a separator. The information for one word is as follows:

A single-digit value called ‘start-point’ (explained below) Spelling in coded form (for string-matching)

Letters that might be inserted in this word Spelling in ordinary form

Word-tag(s)

Number of syllables Homophone code

Having the letter-string information in a separate section enables the corrector to run through the letter-strings (using the Cobol UNSTRING verb) and to perform the quick test on one word after another as fast as possible. It picks out the other information about a word only for the few words that pass the quick test. By contrast, when considering words in the same group as the misspelling, it performs the string-matching on all the words from length x to length y, so it simply moves along a section of the long string of spelling-fields, unstringing one after another.

These arrangements have the unfortunate effect of making it more difficult to simply look up a word. The spellchecker computes the word’s Soundex-type code and retrieves the appro- priate record. Then it computes the word’s letter-string, takes that part of the letter-string section that contains words of the right length and searches it for a match. If it finds one, it unstrings the corresponding spelling and compares it with the word being looked up. If they don’t match, it carries on. The letter-string items are arranged in alphabetical order so that the corrector can

abandon the search if it finds it has gone past the place where the word’s letter-string would have been. Despite these convolutions, the dictionary look-up still takes very little time compared with the retrieval and ordering of candidates for a misspelling.

This reorganization of the file reduced dramatically the time required for the mere retrieval of records from the dictionary; when run over a file of test data, the program took only about forty per cent of the time that it had taken with the previous file organization. The part that now took up most of the corrector’s time was the string-matching that produced the closeness scores.

To recap briefly on Chapter Eight, the string-matching is regarded as the traversing of a directed network. Computationally, each node of the network is represented as an element in a two- dimensional array, and the algorithm takes account of all paths across the network by computing values for the elements of the array in an ordered sequence. The computation of the value for a single element in the array requires the calculation of four values (corresponding to a single-letter omission, insertion, substitution or transposition), and each of these entails taking a number from a table and adding it to one of the array values already calculated; the lowest of the four is retained as the value for the element.

The reason why this takes time is not that the calculation of an array value is slow – Cobol indexed tables are used and all variables involved are specified as COMPUTATIONAL – but that there are so many array values to be calculated. The size of the array depends on the length of the misspelling and the length of the candidate being compared with it; a misspelling and a candidate that were both of length eight would require an array of eighty-one elements (nine by nine). The matching of this misspelling against a hundred candidates – some shorter, some longer, some the same length – would require the computation of about eight thousand array values. Attempts to reduce the time spent on string-matching therefore focus on ways to cut down the number of array values to be computed.

The mincost function presented in Chapter Eight computes the array values row by row. Take the first three columns of two rows to be represented by the lettersutoz, as follows:

THE PROTOTYPE IMPLEMENTATION

u v w x y z

Ignoring transpositions for the moment, the value z is the lowest of the following:

w + an omission cost v + a substitution cost y + an insertion cost

Costs are either zero (for substituting, say,afora), or positive. So: z>= min(v,w,y)

The value y is in the same position with respect to u, v and x: y>= min (u,v,x)

And x is u plus an omission cost (positive), so: x>u

It follows that: y>= min(u,v)

z>= min(v,w,min(u,v)) z>= min(u,v,w)

And, in general, the lowest value on row i cannot be lower than the lowest value on row i−1. In other words, a candidate’s score cannot get better with each row calculated; it can only stay the same or get worse.

Transposition costs could spoil this picture. Transposition provides a further way of calculating the value z as follows:

r s t u v w x y z r + a transposition cost

Suppose that r had the value 3, that 7 was the lowest value in the u-v-w row and that the transposition cost was 3. Then z could have the value 6, lower than the previous row’s lowest.

This can be prevented by setting the transposition cost to be at least as large as the largest omission cost. Suppose that both were set at 5. This clearly avoids the problem in the example since the transposition route to z now costs 8. In fact it avoids it altogether since it is always possible to take the omission route from one row to the next, which will never cost more than 5, so the transposition route from row i−2 to i cannot now cost less than the lowest-cost route from row i−2 to i−1.

The importance of all this is that it enables the corrector often to abandon the calculation of array values after doing only the first few rows of an array – a simple application of the ‘branch and bound’ technique. When considering a hundred candidates, it begins by putting the first fifteen into its shortlist. Thereafter, it puts a candidate into the list only if the closeness-score is better (lower) than that of the worst candidate in the list so far. Suppose it were considering the seventieth candidate, and the worst candidate in the shortlist had a score of twelve. Suppose also that the lowest value in row five for this (the seventieth) candidate was fourteen. Then there would be no point in continuing the array calculations to the end since the final score could not be less than fourteen. This candidate is obviously not going to make the shortlist, so it can be rejected without more ado.

A variation on this idea is presented in a recent paper (Du and Chang 1992). Instead of calculating the array values row by row, the authors suggest calculating them ‘layer by layer’. What they mean by this is illustrated below.

1 2 3 4 2 2 3 4 3 3 3 4 4 4 4 4

Beginning at the top left, you calculate value 1, then the values marked 2, then those marked 3 and so on. What I have just described for rows also applies to these layers. A candidate’s score in layer 4 cannot be lower than its lowest score in layer 3.

The next time-saving modification builds on the observation that words in the dictionary often come in sequences that begin with the same first few letters. This does not happen as much as it

THE PROTOTYPE IMPLEMENTATION

would in a dictionary that was completely in alphabetical order, but, even in the prototype’s dictionary, there are short runs of words in the same Soundex group and of the same length that begin with the same letters. Given the misspelling undeterd, for example, all the words in group U533 would be considered from length 6 to length 10. Suppose that undertake was followed by

undertook, both being matched, of course, against the same misspelling. There would be no need to compute the first few rows for undertook (the ones corresponding to undert) since they will be exactly the same as they were for undertake. To facilitate this, each dictionary entry carries a number – the ‘start-point’ mentioned above – telling the corrector how many rows of the array it can skip, assuming that the previous word in the dictionary was the last word it dealt with.

The corrector is here saving a little time by making use of the state of the array left over from the previous call to the string- matching function. This is no problem in Cobol since local variables retain their value from one call to the next.

The size of the array increases as the square of the length of the misspelling, so it is particularly important to reduce, if possible, the number of array values to be calculated in the larger arrays. In addition to slicing off the bottom and the top of the array, as described above, the corrector also cuts off the corners.

In traversing the directed network from the start node to the diagonally-opposite end node, a route that takes in either of the other two corners is most unlikely to be a low-cost route since these routes correspond to lots of insertions followed by lots of omissions (or vice-versa). A low-cost traversal is almost certain to contain a number of zero-cost substitutions, which means it will stay fairly close to the diagonal. For larger arrays, therefore, the corrector computes values only for a diagonal band across the array, of about three elements to either side of the diagonal, and ignores the elements outside this.1

Compared with the dramatic difference in speed produced by reorganizing the dictionary file, the effect of these program modifications is rather modest. Each one reduces the running time by about five per cent, though the precise effect depends on the actual misspellings being corrected. To give some idea of the speed of the prototype, I ran it over the following four sentences (with

one misspelling per sentence): 1. It’s agranulersubstance.

2. This is thesoughtof thing we want. 3. You canchoocewhichever you want. 4. I can’tonderstanit.

Running on a VAX 11/750 under VMS, the CPU times were as follows:2

1 2 3 4 Total Total CPU seconds 1.7 2.6 3.6 5.5 13.4 of which:

a) string-matching 0.5 0.9 2.5 2.1 6.0 b) quick comparison

of letter-strings 0.7 1.1 0.7 2.3 4.8 c) other 0.5 0.6 0.4 1.1 2.6 Elapsed times, of course, are larger than CPU times, but they are not worth reporting since they depend almost entirely on the amount of work the machine happens to be doing for other users.

The bulk of the time goes into producing the lists of suggestions. The time-consuming parts, as the table shows, are the string- matching and the comparison of letter-strings, but some of the time marked ‘other’ also goes into producing the lists of suggestions – for example reading records from the dictionary file and reordering the shortlist by word-tag and word-frequency. The variation from one misspelling to another is explained by the number of candidate words in the misspelling’s Soundex-type group, the number of neighbouring groups that have to be searched and the number of words in these neighbouring groups. Granuler belongs to the smallest group (G540), whereaschooce and onderstan belong to two of the largest.

Notes

1. Herewith a cautionary tale for program optimizers. Since the array is not always square – you might be comparing a twelve-letter candidate with a nine-letter misspelling – my first attempt at this included some calculations to

THE PROTOTYPE IMPLEMENTATION

establish which elements fell on the diagonal. This version actually ran slower than the one it was supposed to be improving on. The problem was that the calculations included a division, which is a computationally lengthy operation. A less elegant but simpler version had the desired effect.

2. I include these times to give some idea of the relative lengths of the various operations. The College has now replaced the VAX 11/750 with a VAX 4100

In document English spelling and the computer (Page 181-191)