5.3 Scale-Invariant Feature Transform
5.3.3 Comparison with HMAX
Table 5.4 summarizes the results of the face discrimination task for both HMAX and SIFT inputs. Performance is fairly close between the two underlying models, indicating that they do a similar job of projecting images into a feature space good for discriminating between different faces while generalizing between different images of the same face. A few more subtle details expose differences between the two models, however. SIFT was designed to take advantage of features that should be very similar between different images of the same object (as opposed to images of different exemplars of the same class), so the features are generally very specific to that object. In fact, just three descriptors or so are often good enough to match an object between two images (Lowe, 1999). Performance as measured by the ROC or semi-supervised metrics, then, is somewhat better when the model is applied to
1.288 1.218 1.184 1.112 0.473 0.237 0.146 0.131 0.115 0.107 0.095 0.079 0.070 0.067 0.066 0.064 0.060 0.058 0.055 0.040 (a) 0 0.5 1 1.5 2 0 20 40 60 80 100 v i number of responses 0 0.5 1 0 0.5 1 p(false alarm) p(detection) ROC, area = 1.000 (b)
Figure 5.10: Responses of one selective unit (out of 15) after the unsupervised category learning on the same image set as in Figure 5.3 using SIFT features. (a): images that evoked the top responses with the 10 most important SIFT descriptors outlined
and the activation level above each image. Every 2nd image omitted for clarity. (b):
response histograms. x-axis is the activation level; y-axis is the number of test images
(100 total) evoking a response at that level. Responses to preferred person in black; responses to all other images in white. Insets: ROC curves. Solid line is ROC curve for selected unit, dashed line is ROC curve for best principal component.
SIFT features rather than HMAX features. This is because a unit is less likely to be excited by a non-preferred person because such an image is likely to be well separated from images of the preferred person in feature space. Furthermore, with fewer people in the input set than available coding units, the SIFT features make available finer distinctions between images than the HMAX features, so categories are more likely to be split into subcategories. Using the semi-supervised metric 2, this results in better performance, as multiple units representing different subsets of the same category are taken into account. Using the unsupervised metric 3, however, this results in worse performance for small numbers of input categories.
I also tested the SIFT approach on the multi-class categorization task (airplanes- cars-motorbikes-faces) described in Section 5.2.3 above, with very different results. In that case, the images from a single category are much more widely separated, so the generalization capabilities of the model need to be correspondingly better. This is
# Metric 1 Metric 2 Metric 3
people HMAX SIFT ch HMAX SIFT ch HMAX SIFT ch
4 91.9 94.0 50.0 85.6 92.0 25.0 67.0 60.2 6.7 5 92.2 94.8 50.0 81.4 90.5 20.0 70.0 67.4 6.7 6 92.7 93.6 50.0 81.6 85.8 16.7 72.2 69.7 6.7 7 91.3 93.6 50.0 73.6 84.9 14.3 68.7 73.8 6.7 8 90.6 93.1 50.0 70.0 83.7 12.5 65.7 76.6 6.7 9 90.2 93.7 50.0 67.5 80.7 11.1 63.5 77.5 6.7 10 90.1 93.0 50.0 64.1 75.7 10.0 63.3 74.1 6.7
Table 5.4: Comparison of performance of sparse coding network applied to HMAX and SIFT features on face discrimination task.
where SIFT performs much worse than HMAX; in fact, performance is barely better than chance in this setting (and so the details are omitted). It turns out that, for example, images of two different motorcycles may be as widely separated in SIFT feature space as an image of a motorcycle and one of an airplane, so they are not likely to be clustered together.
These distinctions between HMAX and SIFT suggest a possible hierarchy for object and category representation in the brain. At one stage, neurons may operate on HMAX-like inputs to become selective to broad categories such as motorcycles and faces. Such neurons would explicitly represent their preferred categories, but within each category the identity of a particular exemplar would be carried only across the population. As discussed in Chapter 3, the “face cells” of the macaque inferior temporal cortex are an example of such neurons: individual cells respond much more strongly to faces than non-faces, but facial identity is carried across the population (Young & Yamane, 1992; Rolls & Tovee, 1995). These neurons may then be making explicit image features best suited for making fine distinctions between objects within their preferred category, but perhaps not suited for making broader category judgement; these features would be more akin to the SIFT features used here. A large population of these neurons with the same category selectivity, then, may form the input to a second sparse coding stage that makes identity within the
category explicit. The sparse, invariant human MTL neurons are the clear example here (Quian Quiroga et al., 2005).
A second possibility also comes to mind, however. With only roughly 10,000
afferents on average, cortical neurons receive input from only a tiny fraction of cells in the preceding region. Simply by chance, then, some neurons may receive input from neurons representing a subset of features well-suited to broad categorization (HMAX- like features) while others receive input from neurons that respond to features better adapted to making fine distinctions within some category (SIFT-like features). The emergence of category- and exemplar-selective cells would then happen in parallel rather than as a two-stage process. With the available data it is unclear which of these two architectures is more likely (or if there is a third possibility), though the clear existence of face-selective cells in macaque IT and individual-selective cells in human MTL makes the hierarchical architecture attractive.