RECOGNITION METHODS the extent of this context is known, but longer-term dependencies can not be

Recognition methods

CHAPTER 7. RECOGNITION METHODS the extent of this context is known, but longer-term dependencies can not be

learnt. Because of the rigid hierarchy of the input and hidden units, dependencies of variable length are hard to learn. Each perceptron can only associate features which are a xed distance apart. The recurrent network, on the other hand, stores all context in the hidden units which are available at every time step. If the context is of variable length, the feedback units will vary slowly and the correlation between two features can be detected at an arbitrary delay.

It is believed to be for this reason that TDNNs did not perform well on this handwriting recognition task. They were also found to be unwieldy since the architecture of a TDNN is specied by a large number of parameters. The number of hidden layers must be specied, as well as the number of units in each and the size of each receptive eld. A further parameter that can be controlled is the number of frames shifted between successive operations of each of the sets of perceptrons. Finding a good set of values for all these parametersrequiresa long search, whereastherecurrentnetworkhas a single such parameter | the number of feedback units (section 7.1.3). Because of this poor initial performance, TDNNs were not investigated further, and no results are presented for them here.

7.3 Discrete probability estimation

This section describes the third technique investigated for probability estimation. This involves computing a number of integer-valued indices from each frame and using these to look up probability values in pre-computed tables. When combined with the hidden Markov models (HMMs) described in the next chapter, the system is a conventional discrete HMM since this is the usual method of calculating probabilities for a discrete HMM. By con- trast, the recurrent network and HMM together would be termed a hybrid system.

The probabilities thatmust be estimatedarethelikelihoodsP(xtji)| the probability of a frameof data beinggenerated, given theidentity of the letter. Since the data are represented as about 80 features, each coded as a byte (256 possible values), to store the probability of each possible co-occurrence would require 25680 26 probabilities to be stored and estimated. This is clearly computationally impractical and would require infeasible quantities of data to give estimates of the probabilities. Parametric distributions could be used, which calculate these probabilities as functions of a smaller number of parameters, but the numbers are still impractical, and the re-estimation more dicult. Two methods are used to simplify the estimation.

CHAPTER 7. RECOGNITION METHODS

7.3.1 A simple system

First, since the units mostly record simply the presence or absence of a fea- ture, even for the skeleton where the coarse coding does give values between 0 and 1, the most important information is whether a line segment is present or not. The inputs are thus re-quantized to be binary-valued (or some other number of values much less than 256). Secondly, the features are assumed to be independent. Thus the probability of the co-occurrence of all the features in a frame is simply the product of the occurrence of the individual features.

P(xtji) Y

j P((

xt)jji) (7.4)

Now only 80226 probabilities need to be stored or, since the pairs must sum to one, only 8026.

The assumption of independencein the occurrence of features in the input is clearly inaccurate since, for example, the occurrence of a vertical stroke in one box is highly correlated with the occurrence of a vertical stroke in the box below. In practice, the assumption is far too strong, and the performance of the HMM system is much worse than that of the recurrent network (an error rate greater than 50%). The following section describes a system which obviates the independence assumption, and gives better recognition results.

7.3.2 Vector quantization

Vector quantization (VQ) is a method of characterizing each frame by a single number, or code c(xt). The quantization process is designed so that similar frames are all coded as the same number. Then, instead of estimating the probability of all the features in a frame given the character class, it is only the probability of the code given the character class that must be estimated:

P(xtji)P(c(xt)ji).

In vector quantization, each frame is considered as a vector in a metric space with as many dimensions as there are elements in the frame. Quanti- zation determines a codebook of code vectors ci in this space. Each framext is then coded according to the nearest code vector: c(xt) = argminikci xtk

2_.

In the subsequent training, it is these codes that are the features, and it is the probability of a code being part of a given letter that must be estimated. Before being able to estimate the probabilities, the code vectors must be determined. To be representative, they must be well distributed in the space of vectors actually produced by the preprocessing system, and each should represent a typical group of vectors which can be considered to be similar. The groups of equivalent vectors are assumed to be those close to one an- other in the metric space, and the code vectors are determinedby a clustering algorithm which nds these clusters in the training vectors. Each code vector is then the centroid of a cluster of training vectors. A number of algorithms exist for carrying out this clustering, and a number are reviewed by Gray

CHAPTER 7. RECOGNITION METHODS

In document TrinityHall, Cambridge, England. (Page 74-76)