Word Recognition Experiments - Word-Level Experiments

8.4 Experiments

8.4.1 Word-Level Experiments

8.4.1.1. Word Recognition Experiments

In order to evaluate the performance of the GME distance approach to the recognition of handwritten words, we carried out several sets of experiments on the IAM database. In order to keep the evaluation process as unbiased as possible, in the following experiments, we did not use the part of the IAM database that was already used for the generation of the cursive characters database and the training of the underlying cursive character models as described in Chapter 6.

Table 8.1 Average recognition rate of the GME approach over several test subsets of IAM.

Model Cost Functions Test Lexicon Size

10 20 50 100

5-state

Default 91.0 83.6 80.4 74.3

Trained 92.1 84.7 81.3 75.7

Adapted 92.4 85.1 81.7 76.2

159-state

Default 93.0 86.8 82.1 77.0

Trained 93.4 87.2 82.7 77.6

Adapted 94.1 87.7 82.8 77.8

315-state

Default 93.0 87.5 82.6 77.4

Trained 93.5 88.1 83.1 78.0

Adapted 94.3 88.7 83.5 78.5

The GME approach to word recognition can be used with or without the training of the associated cost functions as we discussed in Chapter 7. In fact, using the default cost functions we can accomplish word-level recognition without the need for word-level training data. However, given that word-level training data are available, we can optimize the cost functions for a specific distribution of data using the hidden Markov model framework. The model can be trained using a wide range of handwriting samples from different writers or it can be further adapted for a specific person by using their

140

handwriting samples. In the following, the former model will be referred to as “trained”, and the latter model will be referred to as “trained/adapted”, or simple “adapted”.

Moreover, the GME model may or may not use the lexicon knowledge. The three variations that we discussed in the previous chapter are the basic 5-state model (without knowledge of the lexicon), the 159-state model (with unigram and bigram knowledge of the lexicon) and the 315-state model (with unigram, bigram and trigram knowledge of the lexicon).

Table 8.1 shows the average recognition rate of the GME approach over several test subsets of the IAM database with different lexicon sizes. Again, for the trained and adapted models, the training and test subsets are completely disjoint.

The reason we limited the lexicon size to 100 words is that in a typical word spotting application, the lexicon of interest contains only a few tens of keyword. In the French keyword spotting application that we mentioned in the previous section, the keyword lexicon contains 48 keywords. The number of effective classes is even smaller; because among these keywords, we have conjugations and plural forms; for example “contrat”

(contract) and “contrats” (contracts), or “résilier” (cancel) and “résiliee” (cancelled) which essentially indicate the same class/action. This means that if the recognition algorithm mistakes “contrat” for “contrats” or vice versa, it is not counted as an error from the word spotting view point, because they belong to the same keyword family.

However, in the word recognition experiments that we summarize in this section, we simply treat all words as separate classes; so for example, in the test sets of the IAM database, we have “American” and “Americans” as two different classes.

141

The recognition results summarized in Table 8.1 indicate that adding the trigram knowledge to the model slightly improves the recognition results over the unigram/bigram model, which is in turn better than the basic model without any knowledge about the lexicon. The performance difference between the trigram-based models and the unigram/bigram-based models is 0.5% on average. However, the performance difference between the unigram/bigram-based models and the basic models without the lexicon knowledge is around 2.0% on average. This means that the major recognition improvement is obtained by virtue of the unigram/bigram knowledge. In summary, the most elaborate model (with adaptation given that a writer’s handwriting is available) improves the recognition rate over the model basic model (without training/adaptation) from 3.1% to 5.1%. It is interesting to note that the GME approach can achieve an acceptable word recognition performance without any training at the word-level.

Table 8.2 Average recognition rate of the GME approach over several test subsets of IAM without perturbation-based character recognition.

Model Cost Functions Test Lexicon Size

10 20 50 100

5-state

Default 90.1 81.1 76.3 69.1

Trained 90.9 82.0 77.2 70.2

Adapted 91.5 82.3 77.5 70.8

159-state

Default 92.1 84.1 77.9 71.5

Trained 92.4 84.6 78.6 72.1

Adapted 93.0 85.2 78.7 72.3

315-state

Default 92.3 85.1 78.5 72.1

Trained 92.6 85.7 79.3 72.9

Adapted 92.2 86.2 79.6 73.4

142

One advantage of the GME model is that it can be combined with any character recognition algorithm. The results that we summarized in Table 8.1 were obtained using the perturbation-based cursive character recognition algorithm that we described in Chapter 6. In order to show the effectiveness of the perturbation-based approach in the context of word recognition, we repeated the same experiments as in Table 8.1 but without character perturbation. The results are summarized in Table 8.2. As can be seen, the perturbation-based character recognition improves the word recognition performance in all cases, by 1% (for the smallest lexicon) to 5% (for the largest lexicon). The average recognition improvement due to the perturbation is 3.3%.

Table 8.3 Average recognition rate of implemented handwritten word recognition algorithms several test subsets of IAM.

Model Training Test Lexicon Size

10 20 50 100

143

In order to put our results into perspective, we experimented with several popular HMM-based approaches to word recognition. A direct comparison of the results we presented in Table 8.1 and published works is not quite meaningful; because first of all, not all public databases define a standard training and test sets (the IAM database is an example);

second, different word recognition algorithms use different pre-processing/post-processing techniques that may considerably affect the performance, even with the same training and testing sets of the same database; and third, the evaluation criteria may not be the same. Therefore, in order to see how the proposed 2D GME approach compares with traditional HMM-based approaches, we carried out our experiments on the same training and test sets based on our implementation of these algorithms with the same pre-processing steps for all algorithms.

The four common pre-processing steps include: 1) image height normalization; 2) stroke-width normalization based on skeletonization [ZS84] followed by dilation; 3 and 4) skew correction and slant correction based on horizontal and vertical projection profiles.

Currently, we do not apply any post-processing techniques such as lexicon reduction or verification based on language models. The handwriting recognition algorithms that we experimented with include: 1) Left-to-Right Discrete HMM (LR-D-HMM) [CLK95]; 2) Left-to-Right Continuous HMM (LR-C-HMM) [RSP09b]; 3) Multi-Stream HMM (MS-HMM) [KPBH10]; and Pseudo 2D-HMM (P2D-(MS-HMM) [KA94]. The experimental results are shown in Table 8.3., where we copied the GME results from Table 8.1 for the sake of comparison. As can be seen, the GME approach based on a 159-state or 315-state EHMM and ensemble of neural networks significantly outperforms the other HMM-based approaches on small lexica. However, as the size of the lexicon grows, the GME

144

approach, which is based on an exact 2D model, and the P2D-HMM show almost the same performance. The interesting observation is that all the three HMM models that use some kind of 2D information, namely MS, P2D, and GME, outperform LR HMMs in all cases, particularly when the lexicon grows.

Although, the focus of this research is not to recognize large size lexica, it is worth mentioning that our manual inspection of misrecognized words shows that the major source of the error, is due to the difficulties in the recognition of curve characters. As we mentioned in Chapter 6, the amount of variations in cursive forms of characters is so high that in order for the classifier to resolve the ambiguity, it will have assign an unknown input shape to all possible character classes. The right sequence of characters is then chosen by using a larger context; that is, in the simplest form, a lexicon of words which acts as a constraint over the sequence of character hypotheses. However, as the lexicon size increases, the chance of a sequence of shapes being matched against more than one valid sequence of characters (i.e. lexicon entry) also increases, which in general, results in lower top-1 recognition rates for words.

It should be emphasized that the word recognition results that we presented above are based on an exact (all or nothing) evaluation criterion; meaning that the output of the word recognition algorithm is considered correct if and only if the top-1 word recognition hypothesis and the ground truth transcription of the input word image match character by character for all character positions. In [GLF+09] the authors have presented a word recognition system based on a new recurrent neural network architecture that achieves a reported word recognition rate of 73.3% to 74.9% over very large lexica of words (~5000

145

to 20000). However, these results are based on an approximate evaluation criterion which the authors refer to as word accuracy:

riptions) set transc

test of length total

deletions ons

substituti insertions

1 ( 100 accuracy

word = × − + + (8.1)

Therefore, for example if the word “Americans” is recognized as “American”, based on the approximate criterion that is used in [GLF+09], the calculated performance is 100×(1-1/9) ≈ 88.9%. However, based on the exact criterion that we use, the calculated performance is 0% in this example.

In document Arbitrary Keyword Spotting in Handwritten Documents (Page 151-157)