7.3 Concentric Discs
7.4.2 Classifiers for Face Detection using Moments
In this experiment, a classifier using 11 moment invariants (computed over square areas) was trained with AdaBoost. The positive images were taken from FERET and the negative images from the same source that was used in chapter 3. The training algorithm failed to get classifiers with low false detection rates, when using a large number of positive images. The fact that the images are too ambiguous for classification indicates that the 11 moments, extracted from square sub-windows, are not enough to discriminate the positive and negative sets. The only reference to successful face detectors using moments relied on colour segmentation (Terrillon et al., 1998), making it easier to differentiate from the background. Even when using colour segmentation, Terrillon et al. (1998) found that there were a number of false detections that could not be overcome.
The experiment showed that the 11 moment invariants have a relatively poor discrim- ination characteristic for face detection applications.
Experiment 6: face detection using the CDMI approach
In order to show the potential of the method of concentric discs, real-time face detection was tested using two simple classifiers. A positive set with 250 face images acquired using a web camera was used. The negative set contained the same room’s background to guarantee a low false detection rate. Figure 7.11 shows some samples of the application using a classifier produced with 250 positive examples.
Figure 7.11: Examples of successful detection with low false detection using 66 CDMI. The face classifier produced with moment invariants is rotation invariant, as the exam- ples in figure 7.11 shows. The performance is somewhat slower compared to the original Viola-Jones method. An average of 1 frame per second was achieved using resolutions of 480x640 pixels, with a kernel size of 128x128 pixels, a scaling factor of 1.1 and a translation factor of 3 pixels.
The experiment presented a number of interesting characteristics of the CDMI method. Firstly, the rotation invariance property holds well, as several rotated faces (at random) are correctly detected. Secondly, the scaling invariance property allows for some variation in the kernel size. Even though the original kernel size of the trained classifier is 128x128 pixels, detection with smaller kernels is achieved using the same classifier. In comparison, the Haar-like features only allow the use of kernels that are of equal or larger size than
7.4. Face Detection 125
the original size used during training. Thirdly, the training process is faster due to the limited dimension of the training sets.
Experiment 7: Measuring Accuracy with Small Negative Sets
The two previous experiments showed that the moment invariants do not have discrimina- tion powers as strong as the Haar-like features. The attempt to train using the Viola-Jones version of AdaBoost did not yield good classifiers when using more than 500 faces samples from FERET.
In this final experiment, the number of negative sample images was limited to 20000 and trained the classifiers using algorithm 3 (chapter 3). The classifiers used in this experiment yielded too many false detections to be used in a real environment with random backgrounds, but they allow us to examine the issues regarding the accuracy for different dimensions of the training set.
The training process used up to 2000 face samples from FERET. The negative set was composed of 20000 images, randomly acquired from various images with no faces. Each classifier was tuned to keep the hit rates at 100% and trained up to 50 layers. Each layer was limited to 400 weak classifiers.
Figures 7.12 and 7.13 show the false detection ratio plotted against the number of weak classifiers used by the AdaBoost classifier. The figures show that training converges faster when using a larger number of CDMIs. Also, when more CDMI features are used in the training process, a smaller number of weak classifiers is needed in order to achieve the same level of false detections.
0 10 20 30 40 50 60 70 80 90 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
false detection rate
number of weak classifiers Classifiers trained with 1000 FERET faces
11 CDMI 22 CDMI 44 CDMI 66 CDMI
Figure 7.12: The false detection rate as a function of the number of weak classifiers. The positive set contains 1000 FERET faces and the negative set contains 20000 background images.
0 10 20 30 40 50 60 70 80 90 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
false detection rate
number of weak classifiers Classifiers trained with 2000 FERET faces
11 CDMI 22 CDMI 44 CDMI 66 CDMI
Figure 7.13: The false detection rate as a function of the number of weak classifiers. The positive set contains 2000 FERET faces and the negative set contains 20000 background images.
7.5
Summary
A new feature extraction method combining moment invariants (Hu, 1962; Flusser, 2000a) with SATs (Crow, 1984) has been presented. The method speeds up the computation of moment invariants over sub-windows acquired from a larger image and has the poten- tial to be used in real-time computer vision algorithms. It is possible to implement the Viola-Jones method using Hu’s features instead of Haar-like features. Besides the advan- tage of dealing with rotation invariant features, moment invariant features also limit the dimension of the feature set for the training sets.
Advantages of this method compared to Haar-like features:
• Rotation invariance (smooth detection when objects rotate in front of the camera).
• Faster training due to the limited feature space dimension.
• Flexible kernel size: due to the scaling invariance classifiers trained with a large kernel can be used with a proportional smaller kernel with a similar accuracy. Some of the face classifiers trained for section 7.4.2 were accurate enough to be used in practice with a web camera, even though a generic face detection for the CMU-MIT dataset was not possible. The results are encouraging, but a question remained: is it possible to produce good classifiers for generic recognition of shapes? In chapter 8 the method is applied to a hand-written digit recognition problem, a very difficult problem due to the similarity among the images.
127
Chapter 8
Digits Recognition Using the
Rapid Moment Extraction
Method
This chapter presents the results of experiments with digit recognition using two different types of features. The first type used the normalised central moments (ηn), which are scale
invariant but not rotation invariant. The second type used the method of the concentric discs (CDMI) developed in chapter 7 (based on Flusser’s set of moment invariants ψn),
which are both scale and rotation invariant. The experiments were constrained to hand- written digits, but the concepts seen here can also be applied to general OCR problems. The main question addressed in this chapter is whether the proposed set of features are discriminative enough to cope with handwritten characters. The scope of the study is limited to handwritten digits using a standard set of images collected by NIST.
Classifiers were trained using a modified version of AdaBoost. The classification prob- lem presented in this chapter is of a different nature from the face detection problem. In face detection, the training is carried out considering a set of positive images against “the universe” of images. In practice, in each AdaBoost’s round new negative samples are added, making the next stage training more difficult. The opposite is true in digit recognition, i.e., a number of negative samples are eliminated as new stages are added to the classifiers, until no negative samples are left.
The results showed that the discriminative powers of feature based on moments were not strong enough to create classifiers as reliable as the ones described in the literature. The best results achieved by these experiments were just below 10% test error (based on the MNIST database). Considering that the method is scale invariant, very fast and simple to implement, there is a potential use as a first stage in recognition problems.
This chapter is organised as follows. Firstly, a brief literature review shows the state- of-the-art in handwritten digit recognition and points to the difficulties faced by most
methods in terms of accuracy and performance. The next section describes the methods used to extract the features and to train the classifiers for the experiments in this chapter. Next, the results for training experiments are shown and an analysis of the accuracy of classifiers is discussed. In the final section, a detailed analysis of the errors for individual classifiers is presented.
8.1
Related Work
There is a number of handwritten digits databases such as MNIST, USPS, NIST and others. MNIST has been used recently as a benchmark for OCR methods. The MNIST database was based on the NIST SD-3 and SD-1 databases. The training set contains 60000 digits and the test set contains 10000 digits and is publicly available (LeCun et al., 1998). The digits were normalised to fit a 20x20 pixels image and were centred in a final 28x28 pixels image.
The task of recognising handwritten characters in real-time is a very difficult one. One of the critical steps involves feature extraction. Usually, one has to choose a compromise among certain characteristics such as invariance, discriminative powers, dimensionality, and computational complexity of the feature set. However, the classification process needs features that contains enough information about the class, and that is where moment invariants have problems.
Wong et al. (1995) proposed a new set of invariants based on Hu (1962) that achieved good correct recognition rates for a simple OCR problem using printed characters. Mo- ments were limited to lower order due to numerical instabilities and to noise sensitivity. Trier et al. (1996) did a survey of feature extraction methods for OCR applications. They concluded that while the printed characters recognition problem is relatively simple, the handwritten character recognition needs more sophisticated methods and it is much more difficult to train accurate classifiers. They reminded that most successful OCR systems needed at least 10-15 features. However, a larger number of features was needed to achieve better accuracies with handwritten digits. Liao et al. (1997) used moment invariants to build a Chinese character recognition system, where he noticed that characters that were too similar had to be grouped together in order to make a strong system.
Baluja (1999) has studied the problem of recognising rotated digits. He used three different methods to cope with rotated digits. The first method used an exhaustive ap- proach and was not accurate. The second used a two step approach where a de-rotation neural networks was trained to return an angle for unknown digits, followed by a single neural networks that classified the de-rotated images. The third approach used the same de-rotation approach, but individual classifiers were trained for each digit. The last ap- proach was the most accurate, achieving 93% recognition, although the method itself does not provide scale invariance.
8.1. Related Work 129
Table 8.1: Results reported on the MNIST database.
Method Authors Reported Test Error
linear classifier (1-layer NN) LeCun et al. (1998) 12.0% 2-layer NN, 1000 hidden units LeCun et al. (1998) 4.5%
Euclidean nearest neighbour Simard et al. (1992) 3.5% Haar-like features and AdaBoost Casagrande (2005) 1.3% 3-Stage NN-NN-SVM Gorgevik and Cakmakov (2004) 0.83% LeNet4 with distortions LeCun et al. (1998) 0.7% BoostMap and BoostMap-C Athistos et al. (2005) 0.58% Combination of the methods Keysers (2006) 0.35%
When analysing the errors made by various methods in the task of digits recognition Suen and Tan (2005) found that some of the characters were so ambiguous that hardly any of the available methods could correctly classify them. They presented a list of 127 handwritten digits from MNIST as being very difficult (which already represents an error of 1.27%). They divided the most common errors into three categories:
• Category 1: geometric similarity (such as 4s and 9s, 0s and 6s etc). The errors in this category are very difficult to overcome because there is usually an undefined boundary between such digits in any feature space.
• Category 2: noisy images. The errors in this category are due to degraded images. Common problems include writing habits, thick pens or pens that fail to write part of the digit.
• Category 3: images easily recognisable by humans. Usually errors in this category are due to the feature extraction process or due to the training process. If the feature set is not discriminative enough, further training is unlikely to improve the results.
There are a number of reports using the MNIST database with a variety of meth- ods. It is beyond the scope of this work to discuss them all in detail. A good review of various methods (about 30 methods) was presented by Keysers (2006). LeCun et al. (1998) discusses implementations of convolutional neural networks method (including his ‘LeNet’ method). Examples of overall results are shown in table 8.1. The errors varied be- tween 12.0% and 0.35%. The methods that presented very low errors added geometrically transformed samples to the training set.
Summing up, moment invariants were used before in OCR with limitations in accu- racy. Moment invariants of higher order are sensitive to noise, limiting the dimensionality of moment based feature sets. There are many feature extraction methods that were
successfully used in OCR, but most of the features are not invariant to scale or rotation, with the exception of Haar-like features that are invariant to scale. Most of the feature extraction methods surveyed were developed specially for OCR applications and it would be difficult to apply to different recognition problems. It is useful to extend the feature set based on moments, as these features are invariant to scale and rotation, fast and easy to implement.