and extension of Marr’s theory. We start by considering Biederman’s approach before mov- ing on to more recent theories.
Biederman’s recognition-by-
components theory
The central assumption of Biederman’s (1987, 1990) recognition-by-components theory is that objects consist of basic shapes or components known as “geons” (geometric ions). Examples of geons are blocks, cylinders, spheres, arcs,
9781841695402_4_003.indd 85
and wedges. According to Biederman (1987), there are approximately 36 different geons. That may seem suspiciously few to provide descrip- tions of all the objects we can recognise and identify. However, we can identify enormous numbers of spoken English words even though there are only approximately 44 phonemes (basic sounds) in the English language. This is because these phonemes can be arranged in almost endless combinations. The same is true of geons: part of the reason for the richness of the object descriptions provided by geons stems from the different possible spatial relationships among them. For example, a cup can be described by an arc connected to the side of a cylinder, and a pail can be described by the same two geons, but with the arc connected to the top of the cylinder.
The essence of recognition-by-components theory is shown in Figure 3.6. The stage we have discussed is that of the determination of the components or geons of a visual object and their relationships. When this information is available, it is matched with stored object rep- resentations or structural models containing
information about the nature of the relevant geons, their orientations, sizes, and so on. The identifi cation of any given visual object is deter- mined by whichever stored object representa- tion provides the best fi t with the component- or geon-based information obtained from the visual object.
As indicated in Figure 3.6, the fi rst step in object recognition is edge extraction. Biederman (1987, p. 117) described this as follows: “[There is] an early edge extraction stage, responsive to differences in surface characteristics, namely, luminance, texture, or colour, providing a line drawing description of the object.”
The next step is to decide how a visual object should be segmented to establish its parts or components. Biederman (1987) argued that the concave parts of an object’s contour are of particular value in accomplishing the task of segmenting the visual image into parts. The importance of concave and convex regions was discussed earlier (Vecera et al., 2004).
The other major element is to decide which edge information from an object possesses the important characteristic of remaining invariant across different viewing angles. According to Biederman (1987), there are fi ve such invariant properties of edges:
Curvature
• : points on a curve
Parallel
• : sets of points in parallel
Cotermination
• : edges terminating at a com- mon point
Symmetry
• : versus asymmetry
Collinearity
• : points sharing a common line According to the theory, the components or geons of a visual object are constructed from these invariant properties. For example, a cylinder has curved edges and two parallel edges connecting the curved edges, whereas a brick has three parallel edges and no curved edges. Biederman (1987, p. 116) argued that the fi ve properties:
have the desirable properties that they are invariant over changes in orientation and can be determined from just a few
Matching of components to object representations Determination of components Detection of non-accidental properties Edge extraction Parsing of regions of concavity
Figure 3.6 An outline of Biederman’s recognition- by-components theory. Adapted from Biederman (1987).
9781841695402_4_003.indd 86
points on each edge. Consequently, they allow a primitive (component or geon) to be extracted with great tolerance for variations of viewpoint, occlusions (obstructions), and noise.
This part of the theory leads to the key prediction that object recognition is typically viewpoint-invariant, meaning an object can be recognised equally easily from nearly all viewing angles. (Note that Marr (1982) assumed that the three-dimensional model representation was viewpoint-invariant.) Why is this prediction made? Object recognition depends crucially on the identifi cation of geons, which can be identifi ed from a great variety of viewpoints. It follows that object recognition from a given viewing angle would be diffi cult only when one or more geons were hidden from view.
An important part of Biederman’s (1987) theory with respect to the invariant properties is the “non-accidental” principle. According to this principle, regularities in the visual image refl ect actual (or non-accidental) regularities in the world rather than depending on accidental characteristics of a given viewpoint. Thus, for example, a two-dimensional symmetry in the visual image is assumed to indicate symmetry in the three-dimensional object. Use of the non-accidental principle occasionally leads to error. For example, a straight line in a visual image usually refl ects a straight edge in the world, but it might not (e.g., a bicycle viewed end on).
How do we recognise objects when condi- tions are suboptimal (e.g., an intervening object obscures part of the target object)? Biederman (1987) argued that the following factors are important in such conditions:
The invariant properties (e.g., curvature,
•
parallel lines) of an object can still be detected even when only parts of edges are visible.
Provided the concavities of a contour are
•
visible, there are mechanisms allowing the missing parts of the contour to be restored.
There is generally much
• redundant infor-
mation available for recognising complex objects, and so they can still be recognised when some geons or components are missing. For example, a giraffe could be identifi ed from its neck even if its legs were hidden from view.
Evidence
The central prediction of Biederman’s (1987, 1990) recognition-by-components theory is that object recognition is viewpoint-invariant. Biederman and Gerhardstein (1993) obtained support for that prediction in an experiment in which a to-be-named object was preceded by a prime. Object naming was priming as well when there was an angular change of 135° as when the two views of the object and when the two views were identical. Biederman and Gerhardstein used familiar objects, which have typically been encountered from multiple viewpoints, and this facilitated the task of dealing with different viewpoints. Not surprisingly, Tarr and Bülthoff (1995) obtained different fi ndings when they used novel objects and gave observers extensive practice at recognising these objects from certain specifi ed viewpoints. Object recognition was viewpoint-dependent, with performance being better when familiar viewpoints were used rather than unfamiliar ones.
It could be argued that developing ex- pertise with given objects produces a shift from viewpoint-dependent to viewpoint-invariant recognition. However, Gauthier and Tarr (2002) found no evidence of such a shift. Observers received seven hours of practice in learning to identify Greebles (artifi cial objects belonging to various “families”; see Figure 3.7). Two Greebles were presented in rapid succession, and observers decided whether the second Greeble was the same as the fi rst. The second Greeble was pre- sented at the same orientation as the fi rst, or at various other orientations up to 75°.
Gauthier and Tarr’s (2002) fi ndings are shown in Figure 3.8. There was a general increase in speed as expertise developed. However,
9781841695402_4_003.indd 87
performance remained strongly viewpoint- dependent throughout the experiment. Such fi ndings are hard to reconcile with Biederman’s emphasis on viewpoint-invariant recognition.
Support for recognition-by-components theory was reported by Biederman (1987). He presented observers with degraded line drawings of objects (see Figure 3.9). Object recognition was much harder to achieve when parts of the
contour providing information about concavities were omitted than when other parts of the contour were deleted. This confi rms that con- cavities are important for object recognition.
Support for the importance of geons was obtained by Cooper and Biederman (1993) and Vogels, Biederman, Bar, and Lorincz (2001). Cooper and Biederman (1993) asked observers to decide whether two objects presented in rapid succession had the same name (e.g., hat). There were two conditions in which the two objects shared the same name but were not identical: (1) one of the geons was changed (e.g., from a top hat to a bowler hat); and (2) the second object was larger or smaller than the fi rst. Task performance was signifi cantly worse when a geon changed than when it did not. Vogels et al. (2001) assessed the response of individual neurons in inferior temporal cortex to changes in a geon compared to changes in the size of an object with no change in the geon. Some neurons responded more to geon changes than to changes in object size, thus providing some support for the reality of geons.
According to the theory, object recognition depends on edge information rather than on surface information (e.g., colour). However,
MALES
FEMALES
FAMILY 1 FAMILY 2 FAMILY 3 FAMILY 4 FAMILY 5
Figure 3.7 Examples of “Greebles”. In the top row fi ve different “families” are represented. For each family, a member of each “gender” is shown. Images provided courtesy of Michael. J. Tarr (Carnegie Mellon University, Pittsburgh, PA), see www.tarrlab.org 1800 1600 1400 1200 1000 800 600
Mean speed of Greeble matching (ms)
0 25 50 75
Shift in orientation between stimuli in degrees
Early in training
Middle of training End of training
Figure 3.8 Speed of Greeble matching as a function of stage of training and difference in orientation between successive Greeble stimuli. Based on data in Gauthier and Tarr (2002).
9781841695402_4_003.indd 88
Sanocki, Bowyer, Heath, and Sarkar (1998) pointed out that edge-extraction processes are less likely to lead to accurate object recognition when objects are presented in the context of other objects rather than on their own. This is because it can be diffi cult to decide which edges belong to which object when several objects are presented together. Sanocki et al. presented observers briefl y with objects in the form of line drawings or full-colour photographs, and these objects were presented in isolation or in context. Object recognition was much worse with the edge drawings than with the colour photographs, especially when objects were presented in context. Thus, Biederman (1987) exaggerated the role of edge-based extraction processes in object recognition.
Look back at Figure 3.6. It shows that recognition-by-components theory strongly emphasises bottom-up processes. Information extracted from the visual stimulus is used to construct a geon-based representation that is then compared against object representations stored in long-term memory. According to the theory, top-down processes depending on fac- tors such as expectation and knowledge do not infl uence the early stages of object recognition. In fact, however, top-down processes are often very important (see Bar et al., 2006, for a
review). For example, Palmer (1975) presented a picture of a scene (e.g., a kitchen) followed by the very brief presentation of the picture of an object. This object was either appropriate to the context (e.g., a loaf) or inappropriate (e.g., a mailbox or drum). There was also a further condition in which no contextual scene was presented. The probability of identifying the object correctly was greatest when the object was appropriate to the context, intermediate with no context, and lowest when the object was contextually inappropriate.
Evaluation
A central puzzle is how we manage to iden- tify objects in spite of substantial differences among the members of any given category in shape, size, and orientation. Biederman’s (1987) recognition-by-components theory provides a reasonably plausible account of object rec- ognition explaining how this is possible. The assumption that geons or geon-like compon- ents are involved in visual object recognition seems plausible. In addition, there is evidence that the identifi cation of concavities and edges is of major importance in object recognition.
Biederman’s theoretical approach possesses various limitations. First, the theory focuses primarily on bottom-up processes triggered directly by the stimulus input. By so doing, it de-emphasises the importance of top-down processes based on expectations and knowledge. This important limitation is absent from several recent theories (e.g., Bar, 2003; Lamme, 2003).
Second, it only accounts for fairly unsubtle perceptual discriminations. Thus, it explains how we decide whether the animal in front of us is a dog or cat, but not how we decide whether it is our dog or cat. We can easily make discrimi- nations within categories such as identifying individual faces, but Biederman, Subramaniam, Bar, Kalocsai, and Fiser (1999) admitted that his theory is not applicable to face recognition.
Third, it is assumed within recognition- by-components theory that object recognition generally involves matching an object-centred representation independent of the observer’s viewpoint with object information stored Figure 3.9 Intact fi gures (left-hand side), with
degraded line drawings either preserving (middle column) or not preserving (far-right column) parts of the contour providing information about concavities. Adapted from Biederman (1987).
9781841695402_4_003.indd 89
in long-term memory. However, as discussed below, there is considerable evidence for viewpoint-dependent object recognition (e.g., Gauthier & Tarr, 2002; Tarr & Bülthoff, 1995). Thus, the theory is oversimplifi ed.
Fourth, Biederman’s theory assumes that objects consist of invariant geons, but object recognition is actually much more fl exible than that. As Hayward and Tarr (2005, p. 67) pointed out, “You can take almost any object, put a working light-bulb on the top, and call it a lamp . . . almost anything in the image might constitute a feature in appropriate conditions.” The shapes of some objects (e.g., clouds) are so variable that they do not have identifi able geons.