Discussion - Object Part Localization Using Exemplar-based Models

One might expect that fine-grained classification problems would be extremely difficult, that telling a Beagle from a Basset Hound would be much harder than telling a car from a computer mouse. Our main contribution is to show that much of this difficulty can be mitigated by the fact that it is possible to establish accurate correspondence between instances from a large family of related classes. We extract visual features that can be effectively located using generic and breed-specific models of part locations. An additional contribution is the creation of a large, publicly available dataset for dog breed identification, coupled with a practical system that achieves high accuracy in real-world images.

CHAPTER 4. FISH AND BIRD SPECIES CLASSIFICATION 29

Chapter 4

Fish and Bird Species Classification

As part-based approach has shown promising results in dog breed classification, we would like to apply it to other fine-grained categories such as fishes and birds. In this chapter, we build recognition systems that are capable of identifying fish species and bird species. This time, we employ an updated version of part localizer (referred to as CoE-ext), which will be described in Chapter 5. With the improved part localizer, we seek to detect all the parts explicitly regardless of the different levels of difficulty. As a result, the classification pipeline is simplified to have three steps: (1) localize the parts, (2) extract part-based features, (3) predict the class labels. Fig. 4.1 illustrates the pipeline with a test image of fish. In the following sections, we assume the part locations are detected, and will describe how we extract the part-based features for classification.

Figure 4.1: Pipeline of our fine-grained classification system: parts (green dots) are first detected from the test image, then features are extracted at the locations dictated by the parts, finally species classifiers predict the most likely label.

CHAPTER 4. FISH AND BIRD SPECIES CLASSIFICATION 30

Figure 4.2: Top: sample fish images from 29 species. Bottom: the number of images per species.

4.1 Fish Species Classification

4.1.1 Columbia Fish Dataset

As there is no public fish dataset available, we collect our own data to evaluate our system. The fish dataset consists of 2, 127 images from 29 species. Some sample images and statistics of the dataset are shown in Fig. 4.2. We randomly partition the dataset with a fixed ratio for each class to generate 1, 335 training images and 792 testing images. We use the training set to build the part detectors and species classifiers, and apply them to the testing set.

Besides the fish images and species labels, we labeled eight fish parts that are common to all the species, as shown in Fig. 4.3. Similar to our dog dataset, the images was submitted to MTurk to have the species labels verified, and part locations annotated.

CHAPTER 4. FISH AND BIRD SPECIES CLASSIFICATION 31

Figure 4.3: Examples of eight labeled parts.

Eye Mouth Second Dorsal Fin Caudal Fin Anal Fin First Dorsal Fin Pectoral Fin Ventral Fin

3.72 4.29 6.03 6.01 6.94 4.96 4.23 6.21

Table 4.1: Average localization error for each fish part. Fish length is normalized to 100 pixels.

4.1.2 Fish Features

Given the detected parts, we first normalize the image such that the fish has fixed length (measured by the distance between Eye and Caudal Fin). Also the fish should be upright and facing right (left-right flipping may be used). From the normalized image, we extract three types of features: fine-scale SIFT features which capture the texture of the fish body (i.e., fish scales), coarse-scale SIFT features which capture the shape and appearance at the parts, and color histograms which capture the color pattern of the fish body. We concatenate these features to represent a fish image. Fig. 4.4 shows the specific regions from which these features are extracted.

4.1.3 Results

We first measure the localization errors for the fish parts, which are listed in Tab. 4.1. From these numbers, we can see that parts that are more variable across different subcategories have larger localization errors. We can also see that the average error is around 5% of the fish length, which indicates that our part localizer makes reasonable predictions.

Using the part-based features, we build one vs. all species classifiers using SVM with RBF kernels. We evaluate the classification performance by plotting Cumulative Match Characteristic (CMC) curves (Fig. 4.5). The rank-1 accuracy of our method is about 72%, which is remarkable for

CHAPTER 4. FISH AND BIRD SPECIES CLASSIFICATION 32

Figure 4.4: Illustration of fish features extracted from the normalized images. (a) Two fine-scale SIFT descriptors (grayscale) are extracted at the halfway points between upper and lower fins. (b) Five coarse-scale SIFT descriptors (grayscale) extracted at part locations. (c) Two color histograms extracted from two convex hulls of subsets of parts. (d) 64 RGB color centers learned with k-means.

Figure 4.5: Cumulative Match Characteristic (CMC) curves for fish species classification.

such a challenging problem. Moreover, our method significantly outperforms a well-known image classification technique: LLC [Wang et al., 2009], demonstrating again the benefit from parts. We

CHAPTER 4. FISH AND BIRD SPECIES CLASSIFICATION 33

Figure 4.6: Testing examples of fish species classification. Green words below the images indicate the correct labels. Success case is denoted with green frame, while failure case is denoted with red frame. Each image is overlaid with colored dots (i.e., detected parts) and pink box (i.e., object bounding box).

also show the upper bound of our classification method by using the ground-truth part locations. We observe a large gap between the accuracy of detected parts and ground-truth parts, presumably because the visual features are sensitive to the part locations. Some classification examples of our method are shown in Fig. 4.6.

In document Object Part Localization Using Exemplar-based Models (Page 45-50)