Application to handwritten digit recognition

Handwriting recognition has always been a challenging task in pattern recognition. But since handwriting depends much on the writer and because we do not always write the same character in exactly the same way, building a general recognition system that would recognize any character with good reliability in every application is not possible. Typically, the recognition systems are tailored to specific applications to achieve better performances. In particular, handwritten digit recognition has been applied to recognize amounts written on checks for banks or zip codes on envelopes for postal services (the USPS database). In these two cases, good results were obtained. A handwritten digit recognition system can be divided into several stages: preprocessing (fil- tering, segmentation, normalization, thinning. . . ), feature extraction (and selection), classification and verification. This section focuses on feature extraction and classification. The main purpose of this example is to show that learning the features in a black box manner can be very efficient, while transformation-invariance is necessary to achieve state-of-the-art performance.

The recognition system is applied to handwritten digit recognition and compared to other methods on the MNIST database [161] (famous digit database often used as a benchmark). Since the aim here is to show how to use virtual samples in a practical application, the reader is referred to the original presentation of these results [153] for more details on the feature extractor and the experiment setup5_.

Feature extraction by learning

A feature extractor processes the raw data (the gray-scaled image in this case) to generate a feature vector. This vector has a smaller dimension than the orginal data while holding the maximum amount of useful information given by the data. As an example, a feature extractor might build a feature vector whose component i is the number of crossing points in the ith line of the image. This feature extractor is constructed from prior knowledge on the application, because we know that the crossing points possess pertinent information for diﬀerentiating digits. Another approach to this problem is to consider the feature extractor as a black box trained to give relevant features as outputs with no prior knowledge on the data. In the following, a neural network, originally developed as a classiﬁer, is used as a feature extractor.

Amongst all the classiﬁers that have been applied to character recognition, neural networks became very popular in the 80’s as demonstrated by the performances obtained by LeCun’s LeNet family of neural networks [160]. These are convolutional neural networks that are sensitive to the topological properties of the input (here, the image) whereas simple fully connected networks are not. As a convolutional neural network, LeNet-5 extracts the features in its ﬁrst layers. For a 10-class problem, the last layer has 10 units, one for each class. The outputs of this layer can be considered as membership probabilities and an input pattern is assigned to the class corresponding to the maximal probability. The weights (parameters of the network) are trained by minimizing the errors between the outputs of the last layer and the targets encoding the class-labels.

Note that the results presented here were obtained prior to the thesis. However, submission and publication of [153] took place during the thesis.

lo alre eptiveeld C1: 6× 28 × 28 inputimage: S2: 6× 14 × 14 C3:16× 10 × 10 S4:16× 5 × 5 32_{× 32} 120features testing training 10fully onne ted outputs C5:120

Figure 2.4: The architecture of the trainable feature extractor (TFE). The outputs of the layer C5 are either directed to the 10 outputs for the training phase or used as features for the testing phase.

Here the idea is to use an architecture where the last layers of the original LeNet-5, originally containing nonlinear units, are replaced by a set of 10 linear output units. These units will thus perform linear classiﬁcation of the samples on the basis of features given by the previous layer, i.e. the last convolutional layer C5 as shown on Fig. 2.4. According to the training procedure minimizing the output error, the 120 outputs of layer C5 are optimized so that the samples can be linearly separated by the output layer. Once the network has been trained, these features can be used as inputs for any other classiﬁer. The resulting system is a trainable feature extractor (TFE) that can quickly be applied to a particular image recognition application without prior knowledge on the features.

Transformation-invariance

In character recognition, the input takes the form of a 2-dimensional image containing rows of pixels. For gray level images, a pixel is represented by its coordinates (x, y) and its value p (usually a number between 0 and 255 indicating its darkness or brightness). It is clear that an image representing a character will still represent the same character if, for instance, translated by one pixel. Thus, one often looks for classiﬁers that can incorporate some translation-invariance as prior knowledge.

If the number of training samples is small, generating additional data using transformations (such as translations) may improve the performances of character recognition [240]. Using neural networks, results on the MNIST database were improved by applying transformations on the data and thus multiplying the size of the training set by ten [160]. This shows that one can create new training samples by using prior knowledge on transformation-invariance properties in order to increase the recognition ability of the classiﬁer. The following describes two image transformations typically used in character recognition.

Simple distortions such as translations, rotations and scaling can be generated by applying affine displacement fields to images. For each pixel (x, y), a target location (u, v) is computed w.r.t. the displacement fields ∆x(x, y) and ∆y(x, y) at this position by (u, v) = (x + ∆x(x, y), y + ∆y(x, y)). For instance if ∆x(x, y) = αx and ∆y(x, y) = αy, the image is scaled by α.

The new grey level in the transformed image for the pixel at position (x, y) is the grey level of the pixel at position (u, v) in the original image. For aﬃne transformations, the target location (u, v) is computed by u v = A x y + b + x y , (2.83)

where the 2_{× 2-matrix A and the vector b are the parameters of the transformation, e.g.}

• for scaling: A = α 0 0 α , b= 0 0 ,

• for translations of one pixel: A = 0 and b takes 8 diﬀerent values, in the 4 main directions and in the 4 diagonals, i.e. [1, 0]T_{, [0, 1]}T_{, [}_{−1, 0]}T_{, [0,}_−1]T_{, [1, 1]}T_{, [}_{−1, 1]}T_{, [}_{−1, −1]}T _and [1,−1]T_.

Figure 2.5: Samples generated by elastic distortion from the original pattern shown on the left.

Table 2.4: Best results on the MNIST database. The second column shows whether virtual samples were used and with which transformations they were created.

Classiﬁer distortion reference test error (%)

SVM [66] 1.4

LeNet-5 [160] 0.95

TFE-SVM [153] 0.83

VSVM aﬃne [66] 0.8

LeNet5 aﬃne [160] 0.8

boosted LeNet-4 aﬃne [160] 0.7

VSVM2 aﬃne [66] 0.68

VSVM2 + deskewing aﬃne [66] 0.56

TFE-SVM elastic [153] 0.56

TFE-SVM aﬃne [153] 0.54

convolutional neural net. (NN) elastic [240] 0.4 large conv. NN + pretraining elastic [217] 0.39

human [162] 0.2

For transformed pixels with non-integer target locations (u, v), bilinear interpolation is used [240]. Elastic distortion is another image transformation, introduced by [240] to imitate the variations of the handwriting. The generation of the elastic distortion is as follows. First, random displacement fields are created from a uniform distribution between−1 and +1. They are then convolved with a Gaussian of standard deviation σ. After normalization and multiplication by a scaling factor α that controls the intensity of the deformation, they are applied on the image. A small σ (called here the elastic coefficient) means more elastic distortion. For a large σ, the deformation approaches affine, and if σ is very large, then the displacements become translations. Figure 2.5 shows some samples generated by elastic distortion.

Recognition system and results

The proposed recognition system is composed of the trainable feature extractor (TFE) of Fig. 2.4 connected to a multitude of binary SVMs for the testing phase6_{. It is thus labeled TFE-SVM in} the following experiments. The multiclass approach for the SVMs is either the one-vs-one method that needs 44 binary classiﬁers to separate every couples of classes or the one-vs-all method that involves 10 classiﬁers, each one assigned to the separation of one class from the others. In the results of Table 2.4, both methods are applied and only the best of the two results is shown.

In character recognition, the generation of virtual samples became very popular and almost necessary to achieve ﬁrst-class performances as can be seen in Table 2.4 showing the best results on the MNIST database (also available and updated at the MNIST homepage [161]). Other transformations, such as morphing [138], were speciﬁcally developed and it appears that some of the best results are obtained by elastic distortions [240, 217] even if it is based on random displacement of pixels in the image. This highlights the fact that more samples help to learn better even if they are not absolutely accurate.

An analysis of the errors performed in [153] led to the conclusion that the performance could not be increased above a certain limit, because of bad samples in the test set not recognizable without

Note that the idea of using the extracted features of a convolutional network for another classifier can be found in [162]. The last layers of a LeNet-4 network were replaced by a K-Nearest Neighbors (K-NN) classifier to work on the extracted features. However, this method did not improve the results compared to a plain LeNet-4 network.

ambiguity by humans. Nonetheless some error samples are very clear and are misrecognized because of their structure and the lack of samples of the same prototype in the training set. Regarding these errors, the generation of more samples of rare prototypes could lead to further improvement.

In document From Support Vector Machines to Hybrid System Identification (Page 66-69)