Objective Functions for Feature Discrimination
Pascal Fua a n d A n d r e w J . H a n s o n * Artificial Intelligence Center
SRI International
333 Ravenswood Avenue
Menlo Park, California 94025
A b s t r a c t
We propose and evaluate a class of objective functions t h a t rank hypotheses for feature la- bels. Our approach takes into account the representation cost and quality of the shapes themselves, and balances the geometric require-
ments against the photometric evidence T h i s balance is essential for any system using un- derconstrained or generic feature models. We introduce examples of specific models allowing the actual c o m p u t a t i o n of the terms in the ob- jective f u n c t i o n , and show how this framework
leads n a t u r a l l y to control parameters that have a clear semantic meaning. We illustrate the properties of our objective functions on syn-
thetic and real images.
1 I n t r o d u c t i o n
A l l approaches to the problem of extracting features f r o m images can in principle be phrased in terms of de- cision theory; however, the concepts of decision theory are very hard to p u t i n t o practice because of the diffi- culty of evaluating the required probability measures.
Therefore, most practical approaches to model-based vision for both specific models, e.g., [Binford, 1982,
Bolles and Horaud, 1986, Brooks, 1981, Shneier et a/., 1986], and generic models, e.g., [Fischler et r/./., 1981, O h t a et a/., 1979, M c K e o w n and Denlinger, 1984, Huer-
tas and Nevatia, 1988], rely on heuristic measures to select among competing scene parses. These methods,
although they may be effective in the context for which they were designed, are extremely hard to extend and require the use of many parameters whose significance is not clearly understood.
On the other h a n d , approaches such as those of Feldman and Yakimovsky [1974], Georgeff and Wallace [1984], and Rissanen [1983, 1987] provide a sound theo- retical basis for the decision problem but offer few prac- tical computational methods for dealing w i t h complex scenes in real images.
In this paper, we focus on an objective function ap- proach to the task of r a n k i n g scene-labeling hypotheses.
*This research was supported in part by the Defense Advanced Research Projects Agency under Contract Nos.
MDA903-86-C-0084 and DACA76-85-C-0004.
For brevity, we o m i t discussion of the related problem of hypothesis-generation, and refer the reader to [Fua and Hanson, 1989]. We define a class of objective func- tions based upon theoretical arguments similar to those of Georgeff, Wallace and Rissanen, and show t h a t the
required probability estimates can actually be computed in the context of a few natural assumptions.
Our formulation has many desirable features, but is not by itself a complete solution to the feature extrac- tion problem. To be effective it must be coupled w i t h a robust hypothesis generation mechanism and an efficient o p t i m i z a t i o n procedure. Furthermore, one would like to have models for geometric quality analysis much more complex than those presented here. It should come as no surprise that discovering good models and hypothesis-
generation strategies are the most difficult tasks in the development of a system a t t e m p t i n g to perform shape perception. The strength of our approach is that it pro- vides a unified framework t h a t clearly exposes the criti- cal components and characteristics of model-based vision systems.
2 D e r i v a t i o n o f t h e O b j e c t i v e F u n c t i o n
The goal of feature extraction is to parse a scene in terms of objects conforming to particular models. To discriminate among competing parses, an objective func- t i o n must be able to measure the goodness of fit to feature models t h a t include such characteristics as area photometry, edge photometry, shape, and semantic re- lationships. In this section, we define a basic class of models, discuss the parameters we expect to control our objective functions, derive the theoretical forms of the objective functions themselves, and provide an interpre-
t a t i o n of the resulting functions in terms of i n f o r m a t i o n theory.
2 . 1 O b j e c t M o d e l i n g
For the purposes of this work, we define a model to be a geometric description of an object in the world charac- terized by its geometric constraints and its photometric signature; we define the evidence for such objects in dig- i t a l images to be a collection of delineated areas corre- sponding to m a j o r object parts, together w i t h associated quantities directly derivable f r o m the pixel values in such areas.
1596 Vision and Robotics
Here F is what we call the encoding-effectiveness of the set of models The first term in F is the number of bits needed to describe the evidence in the absence of the model, while the second term gives the number of bits needed to describe the evidence in terms of the model.
The term effectiveness is thus motivated by the fact that F represents the number of bits saved by representing the evidence using the model, and the fact t h a t F increases
as the fit improves.
G is the number of bits needed to encode the evidence- free model representation information, and quantifies the elegance of the chosen set of model instances as well as
their dependencies.
R e m a r k s
F e a t u r e E x t r a c t i o n V i e w e d a s a n O p t i m i z a t i o n P r o b l e m . The problem of finding the best parse of a scene can now be rephrased as the problem of optimiz-
ing over sets of hypotheses evaluated by Eq. (3). Global optimization corresponds to a blind search procedure, which searches all possibilities without a t t e m p t i n g to determine which candidates are more likely than o t h - ers. In practice, the search space may be far too large
for this type of search. Since intelligent heuristics can overcome this drawback, a natural way to design an ap-
plication system is to incorporate hypothesis-generation algorithms that project from the space of all possible hy- potheses onto a subspace of very likely hypotheses. Such projections have the side effect of reducing the discrimi- natory burden placed upon the objective function.
G e n e r i c M o d e l s R e q u i r e P h o t o m e t r i c / G e o m e t r i c B a l a n c e . When a model's geometry is completely de- termined beforehand, as it is for template-matching ap- proaches to automatic shape recognition, there is no need for the geometric information component of the objec-
tive function, since it is constant and m a x i m u m like- lihood analysis alone w i l l do. The geometric terms in the objective function begin to play a critical role when we utilize models defined by a set of general geometric constraints in place of a specific shape template. Such generic models, w i t h arbitrarily large numbers of param- eters, require objective functions like ours that balance
their geometric aspects against their photometry.
3 Photometry: Computing F
T w o of the main characteristics of an object in an i m - age are its interior photometry and its contrast w i t h the background, which produces edges. Here we explore sim- ple models for the area and for the edges of an object that have proven useful in analyzing imagery. W h e n working w i t h stereo pairs of images, we also incorporate a stereoscopic model, and compute the depth parameters of an object in the scene by optimizing the corresponding stereo effectiveness.
1598 Vision and Robotics
Fua and Hanson 1599
3.2 E d g e M o d e l
We a d o p t the definition [Rosenfeld, 1970, Haralick, 1984, Canny, 1986] of edge pixels as m a x i m a of the local image derivative, and we classify edges according to whether or not an edge b o u n d a r y pixel conforms to this d e f i n i t i o n . In the absence of a m o d e l , it would take 1 b i t per pixel to encode this i n f o r m a t i o n . If we now use the 1-parameter
model t h a t takes into account the p r o p o r t i o n of m a x i - m a l edge pixels, the most efficient H u f f m a n n [ H a m m i n g ,
1985] code for this i n f o r m a t i o n w o u l d require
As in the case of the area t e r m , we have neglected the ( 1 / 2 ) log( L / s ) bits required to encode the one i n t e r n a l parameter of the model [Rissanen, 1983, Schwarz, 1978].
As shown in the r i g h t c o l u m n of Figure 2, this edge score is m a x i m a l when all b o u n d a r y pixels conform to our edge m o d e l , and degrades as the p r o p o r t i o n of such pixels diminishes. T h i s model has proven effective in our a p p l i c a t i o n of these techniques to aerial images because it provides a measure of edge-quality t h a t does not i n - clude an image-dependent threshold on edge s t r e n g t h .
We have also experimented w i t h an edge model t h a t requires the gradient direction be n o r m a l to the object o u t l i n e , and computes the encoding cost of deviations f r o m the n o r m a l vector. B o t h models yield s i m i l a r rank- ings.
3.3 S t e r e o g r a p h y
T h e simplest stereo model assumes t h a t corresponding pixels have the same grey-levels in b o t h images. In prac- tice, to c o m p u t e the stereo effectiveness of E q . (6), we determine the number of bits required to encode the pro- jected patch in the second image, while k n o w i n g its pho-
t o m e t r y in the first. We c o m p u t e the deviations of the intensities f r o m their predicted values and encode t h e m
using the same Gaussian m o d e l w i t h anomalies t h a t we used for the area t e r m . T h e a n o m a l y d i s c o u n t i n g is re- quired because of the possibility of occlusions. We also take i n t o account the edge q u a l i t y of the contour in the second image and its edge-encoding effectiveness.
T h e stereographic effectiveness t e r m Fs is therefore the s u m of an edge and an area t e r m :
where A2 is the area of the projected patch in the second image, L2 is its b o u n d a r y l e n g t h , and k2 and KE2 are the corresponding model encoding costs.
We can use the effectiveness measure (12) to o p t i - mize the elevation parameters of a t w o - d i m e n s i o n a l de- l i n e a t i o n found in the first image. T h e search space is ex-
tremely constrained since the projected shape is k n o w n and the only degree of freedom is epipolar m o t i o n in the second image.
Let us consider the stereo pair of images in F i g u r e 3(a,c). A s s u m i n g t h a t the roof is h o r i z o n t a l , we p l o t in Figure 3(b) the value of Fs as a f u n c t i o n of the assumed disparity between the candidate o u t l i n e in the left image (a) and the projected o u t l i n e in the r i g h t image. We note t h a t Fs has a sharp peak for the correct, m a t c h o u t l i n e d in (c).
1600 Vision and Robotics
Figure 4: (a) R a t i o of single-square to double-rectangle score as a. f u n c t i o n of noise variance (40, 20, 10). (b) S i m i l a r plot c o m p a r i n g the score of the square interpre-
t a t i o n to the "U' i n t e r p r e t a t i o n .
In Figure 4 ( a ) , we show how the length t e r m (14), which gives preference to compact objects, influences the parse when a s p l i t square is interpreted alternately as a single c o m p a c t square or t w o adjacent rectangles.
T h e b o t t o m g r a p h takes three images, w i t h noise vari- ance 40, 20 and 10, and plots the ratios (two-rectangle score)/(square score) as a f u n c t i o n of scale for fixed 7 — 1 . Note t h a t increasing the scale i n this example a m o u n t s to l o o k i n g at a subsampled image in which fine details arc no longer visible. T h e interesting value of the scale is t h a t for w h i c h the scores are equal, i.e., the ratio
is one. T h u s we p l o t in the upper graphs the locus of p o i n t s where the r a t i o is u n i t y as a f u n c t i o n of 7 as well as scale. In F i g u r e 4 ( b ) , we carry o u t a s i m i l a r p l o t for an image of a square w i t h a missing p o r t i o n t h a t makes it
" U " - s h a p e d . W e see t h a t the r a t i o ( " U " score)/(square score) behaves so t h a t the square i n t e r p r e t a t i o n is pre- ferred at a large scale in the best image, and at a much
lower scale in the noisier images.
5 Examples
W e have a p p l i e d the p r i n c i p l e o f o b j e c t i v e - f u n c t i o n o p t i - m i z a t i o n t o o p e r a t o r - i n i t i a t e d shape e x t r a c t i o n and t o a u t o m a t e d e x t r a c t i o n of generic cartographic features such as b u i l d i n g s f r o m aerial imagery, described else- where [Fua a n d H a n s o n , 1989]. In the a u t o m a t e d ap-
For example, for Figure 5(a), the system produces t w o conflicting interpretations: one in terms of a. single p o l y - gon enclosing both wings as in Figure 5(b) the other in terms of two polygons, one for each w i n g as in Figure 5(c). At low scale the latter w i l l be preferred because of its better fit to the photometric data, while at high scale the former w i l l dominate due to its lower geometric cost.
In Figure (J, we show the hypotheses generated and retained by the svstem for scale values of (6, 7 and 8, w i t h fixed shape coefficient; for this scene, scale 8 clearly gives the best, parse.
From the examples shown in this section, we can f o r m an i n t u i t i v e understanding of the scale parameter: s tunes the scale not of the physical size of the o b j e c t , but the scale of its quality. Objects w i t h close fits to the strict model are selected first as we r a m p the scale d o w n f r o m a high value.
6 Conclusion
In this work, we have shown how an i n f o r m a t i o n the- oretic approach to the feature e x t r a c t i o n p r o b l e m can
be formulated in such a way as to p e r m i t realistic c o m - p u t a t i o n a l techniques for the required p r o b a b i l i t y esti- mates. Our approach provides a f i r m theoretical basis for understanding complex feature e x t r a c t i o n problems t h a t require a balance between p h o t o m e t r i c evidence and geometric quality. Of course, the objective f u n c t i o n ap- proach given here cannot by itself lead to good solutions
to the feature extraction problem, b u t must be teamed
Fua and Hanson 1601
w i t h a competent, ( h u m a n or a u t o m a t e d ) hypothesis gen- erator [Fua and Hanson, 1989]. A m o n g the goals of f u -
ture work w i l l be the extension of the range of our models and the t r e a t m e n t of complex semantic dependencies in terms of their information-theoretic context.
References
[ B i n f o r d , 1982] T . O . B i n f o r d . Survey of model-based image analysis systems. The International Journal of
Robotics Research, 1 (1 ):18 64, S p r i n g 1982.
[Bolles and H o r a u d , 1986] R.C. Bolles and R. H o r a u d . 3dpo, a three-dimensional part o r i e n t a t i o n system.
International Journal of Robotics Research, 5:3- 26, 1986.
[Brooks, 1981] R.A. Brooks. Symbolic reasoning a m o n g 3-d models and 2-d images. .Artificial Intelligence
Journal, 16, 1981.
[Canny, 1986] .1. Canny. A c o m p u t a t i o n a l approach to edge detection. IEEE Trans. RAMI, 8:679-698, 1986.
[Feldman and Y a k i m o v s k y , 1974]
J . A . Feldman and Y. Yakimovsky. Decision theory and artificial intelligence: I. a semantics-based region analyzer. Artificial Intelligence, 5:349 3 7 1 , 1974.
[Fischler et al, 1981] M . A . Fischler, .J.M. T e n e n b a u n i , and H.C. Wolf. Detection of roads and linear struc- tures in low-resolution aerial imagery using a m u l t i -
source knowledge i n t e g r a t i o n technique. Computer Graphics and Image Processing, 15:201 223, 1981.
[Fua and Hanson, 1989] P. Fua and A . J . Hanson. O b - jective functions for feature d i s c r i m i n a t i o n : A p p l i - cation to semiautomated and a u t o m a t e d feature ex-
t r a c t i o n . In Proceedings of the Image Understanding Workshop, Palo A l t o , C A , May 1989.
[Georgeff and Wallace, 1984] M.P. Georgeff and C.S.
Wallace. A general selection criterion for i n d u c t i v e inference. In T. O'Shea, editor, Proceedings of Ad-
vances in Artificial Intelligence, Pisa, Italy, A m s t e r - d a m , September 1984. North H o l l a n d .
[ H a m m i n g , 1985] R . W . H a m m i n g . Coding and Informa- tion Theory. Prentice H a l l , New Jersey, 1985.
[Haralick, 1984] R . M . Haralick. D i g i t a l step edges f r o m zero crossings of second directional derivatives. IEEE
Trans. PAMI, 6:58-68, 1984.