Summary of Geometric Matching Methods - Improving Bags-of-Words model for object categorization

The geometric matching model was the first meaningful attempt by the computer vision community in tackling the problem of object recognition. The fundamental principle is that an object is represented as a collection of parts, which implies that recognition will be viewpoint invariant. Mundy, in [105], proposed four reasons why geometric representation played such an important part in the development

of recognition theory and resulting algorithms and systems, are invariance to viewpoint; invariance to illumination; well developed theory; and man-made objects.

• Invariance to viewpoint. Geometric object descriptions allow the pro- jected shape of an object to be accurately predicted understanding perspec- tive projection.

• Invariance to illumination. Recognizing geometric descriptions from images can be achieved using edge detection and geometric boundary segmen- tation.

• Well developed theory. Geometry has been under active investigation by mathematicians for a long time. The geometric framework has achieved a high degree of maturity and effective algorithms exist for analyzing and manipulating geometric structures.

• Man-made objects. A large fraction of manufactured objects are naturally described by primitive geometric elements, such as planes and spheres.

One of the major problems of the geometric approach is that an object can be seen from different points of view, resulting in different images which need to be recognized as portraying the same object [166]. This implies that the extraction of edges from natural images can be difficult when there is extensive illumination difference, background clutter and occlusion.

Category Level Recognition

This thesis is interested in the problem of learning and recognition of object categories. Unlike object instance recognition, the focus of category level recognition is not only matching concrete shapes to make sense of shape concepts. Indeed, traditional strategies like template matching, geometric models and texture region matching are no longer capable of handling such tasks. Not because of inflexibility of the models, but rather because the template database is no longer well-defined at the level of abstraction on which the system operates – because each instance of the category is no longer identical, hence the matching scheme must have a way of accounting for the variability across instances in the features extracted. Object categories are more general, require more complex representations, and are more difficult to learn; which is why most work today is focused on modeling and learning object categories.

In the last two decades, the research community has mainly focused on some challenging problems such as complex scenes, and large number of classes. This section reviews some of the most notable approaches in turn, starting from hand-

Figure B.1: Sample digits from the MNIST dataset.

written digits [82], pedestrian [69] and faces [161, 136].

B.1 Categorizing Handwritten Digits

The recognition of handwritten digits is a challenging problem, not only because there are different ways in which a digit can be written, but also as a result of strict requirements of specific problems, as shown in Figure B.1. The primary perfor- mance is measured by recognition accuracy and speed, and most researchers have adopted the classical pattern recognition approach in which image pre-processing is followed by feature extraction and classification. This section will not attempt to review in depth the work that has gone into this area in the past three decades. However, it will summarize research directions and methodologies in this field.

Work in this area can be roughly summarized in two dimensions: statis- tic/structural and local/global approaches. For the global statistical approach, Cash et al., in [17] extract central and raw mathematical moments and use them as features, while Shridhar et al. [139] use features derived from the topological (e.g. crossings, endpoints, holes, etc) character profiles in the image, which are

dependent on the global property of the data.

For the local structural approach, Lam et al. [77] extracted local geometric information consisting of lines and convex polygons and used these as input to a structural classifier. Other notable attempts include automatically learning ap- propriate local features using feed forward neural networks [25]. Hinton et al. [55] argue that it is also possible to discriminate by fitting a separate probability density model to each class and then picking the class of the model that assigns the highest density to a test image. However, one disadvantage of this relative density approach is that it generally requires more computational time during recognition. In recent years, a more intuitive approach [154] begin to emerge, in the form of deformable templates. In this approach, an image deformation is used to match testing images against a library of training images. Research in this approach has concentrated on taking the outlines of images, representing them with a number of a combinations of curve segments, and deformation of the image is achieved by altering the curve parameters [67].

Belonging to the same dimension of research, Lam and Suen [77] proposed a two-stage scheme. In their work, samples are first classified by their structure using a tree classifier. Samples which can not be confidently assigned to a class through this process will be passed to a slower, relaxation matching algorithm that uses deformable templates.

In document Improving Bags-of-Words model for object categorization (Page 183-187)