A Computational Investigation into the Human Representation and Processing of Visual Information

(1)

Project Editor: Judith Wilson Copy Editor: Paul Monsour

Production Coordinator: Linda Jupiter Illustration Coordinator: Richard Quiñones Designer: Ron Newcomer

Artists: Catherine Brandei and Victor Royer Compositor: Graphic Typesetting Service

Printer and Binder: The Maple-Vail Book Manufacturing Group

Library of Congress Cataloging in Publication Data

Marr, David, 1945-1980. Vision.

Bibliography: p. Includes index.

1. Vision—Dataprocessing. 2. Vision—Mathematical models. 3. Human information processing. I. Title.

QP475.M27 1982 152.1'4028'54 81-15076

ISBN 0-7167-1567-8

No part of this book may be reproduced by any mechanical, photographic, or electronic process, or in the form of a phonographic recording, nor may it be stored in a retrieval system, transmitted, or otherwise copied for public or private use, without written permission from the publisher.

Printed in the United States of America

8 9 10 11 12 VB 5 4 3 2 1 0 8 9 8

A Computational Investigation

into the Human Representation

and Processing of Visual Information

David Marr

Late o f the Massachusetts Institute of Technology

Ш

(2)

CHAPTER

1

The Philosophy

and the Approach

1.1 BACKGROUND

The p ro b le m s of visual percep tio n have attracted th e curiosity o f scientists for m any centuries. Im portant early co ntributions w ere m ade by Newton (1704), w h o laid th e foundations fo r m o d e rn w ork o n co lo r vision, and H elm holtz (1910), w hose treatise o n physiological optics g enerates interest even today Early in this century, W ertheim er (1912, 1923) n o tic ed th e apparent m o tio n n o t o f individual dots b u t o f w holes, o r “fields,” in images p re se n te d sequentially as in a m ovie.. In m u ch th e sam e way w e perceive th e m igration across the sky o f a flock o f geese: the flock som ehow con stitutes a single entity, and is n o t s e e n as individual birds. This observation started th e G estalt school o f psychology, w hich w as co n cern ed w ith describ ing th e qualities of w holes by using term s like so lid a rity and distinctness, an d w ith trying to form ulate th e “laws” that g ov ern ed th e creation o f these w holes. T he attem pt failed for various reasons, and th e Gestalt school dissolved into the fog of subjectivism. With th e death o f th e school, m any

1.1 B a ck g ro u n d 9

Figure 1 -1 . A ra n d o m -d o t stereo g ram o f th e type u se d extensively b y Bela Julesz.

T he left a n d rig h t im ages a re identical ex cep t fo r a central sq u a re re g io n th at is d isplaced slightly in o n e image. W hen fu sed binocularly, th e im ages yield the im pression o f th e cen tral sq u are floating in fro n t o f th e background.

of its early and g enuine insights w ere unfortunately lost to the m ainstream of experim ental psychology.

Since then, students o f th e psychology o f p erc ep tio n have m ade no serious attem pts at an overall u n derstanding of w hat p erc ep tio n is, con centrating instead o n the analysis o f p ro p e rtie s and perform ance. The tri chrom atism o f color vision was firmly established (se e Brindley, 1970), and the p reoccupation w ith m otion continued, w ith th e m ost interesting devel opm ents p erh a p s bein g th e experim ents o f Miles (1931) a n d o f Wallach and O ’C onnell (1953), w hich established that u n d e r suitable conditions an unfamiliar three-dim ensional shape can b e correctly perceived from only its changing m onocular projection.*

T he developm ent of th e digital electronic co m p u ter m a d e possible a sim ilar discovery for b inocular vision. In I960 Bela Julesz devised com puter-generated random -dot stereogram s, w hich are im age pairs con structed of d o t p atterns that a p p e ar random w h en viewed m onocularly b u t fiase w h en view ed o n e thro u g h each eye to give a p erc ep t o f shapes and surfaces w ith a clear three-dim ensional structure. An exam ple is show n in Figure 1-1. H ere the im age for th e left eye is a m atrix of black a n d w hite squares g en e rate d at random by a co m p u ter program . The im age for the

(3)

The P hilosophy a n d the A pproach

rig h t eye is m ade b y copying the left image, shifting a square-shaped region at its ce n te r slightly to the left, and th e n providing a n ew ran d o m pattern to fill th e gap that th e shift creates. If each o f th e eyes sees only o n e matrix, as if th e m atrices w ere both in th e sam e physical place, th e re su lt is th e sensation of a sq u are floating in space. Plainly such p erc ep ts a re caused solely by th e stereo disparity betw een m atching elem ents in th e images p re se n te d to each eye; from such experim ents, w e know that th e analysis o f stereoscopic inform ation, like th e analysis o f m otion, can p ro c e e d in d e p en d e n tly in th e absence o f o th e r inform ation. Such findings are o f critical im p o rtan ce b ecause they help u s to subdivide o u r study o f p erc ep tio n into m o re specialized parts w hich can b e treated separately. I shall re fe r to these as in d e p e n d e n t m odules o f perception.

The m ost rec en t contribution o f psychophysics has b e e n o f a different kind b u t of equal im portance. It arose from a com bination o f adaptation an d th re sh o ld detection studies and originated from th e dem onstration by C am pbell and Robson (1968) of th e existence o f in d e p en d e n t, spatial- frequency-tuned channels— that is, channels sensitive to intensity variations in th e im age o ccu rrin g at a particular scale o r spatial interval— in th e early stages o f o u r perceptual apparatus. This p ap e r led to an explosion of arti cles on various aspects of th e se channels, w hich culm inated ten years later w ith quite satisfactory quantitative accounts o f th e characteristics o f th e first stages o f visual percep tio n (W ilson and B ergen, 1979). I shall discuss this in detail later on.

Recently a rather different approach has attracted considerable at tention. In 1971, Roger N. S hepard and Jacqueline M etzler m ade lin e draw ings of sim ple objects that differed from o n e an o th er eith e r b y a three- dim ensional rotation o r by a rotation plus a reflection (see Figure 1-2). They asked h ow long it took to decide w h e th e r tw o d ep icted objects dif fe re d by a rotation and a reflection o r m erely a rotation. They found that th e tim e taken d e p e n d e d o n th e three-dim ensional angle o f rotation n ec essary to brin g th e two objects into co rresp o n d en ce. Indeed, th e tim e varied linearly w ith this angle, O n e is le d th e re b y to th e n o tio n that a m ental rotatio n o f sorts is actually bein g p erfo rm ed — that a m en tal descrip tio n o f th e first shape in a p air is b eing adjusted increm entally in orientation u n til it m atches th e second, such adjustm ent req u irin g g rea ter tim e w h en g rea ter angles are involved.

T he significance o f this ap proach lies n o t so m u c h in its results, w hose in te rp re tatio n is controversial, as in th e type o f q uestions it raised. For until then, th e n o tio n o f a representation was n o t o n e that visual psychologists to o k seriously This type o f ex perim ent m eant that th e n o tio n h ad to b e considered. A lthough th e early thoughts o f visual psychologists w ere naive c o m p ared w ith those o f the co m p u ter vision com munity, w hich had had

1.1 B a ck g ro u n d

(a) (b )

Figure 1—2. S om e draw ings sim ilar to th o se u se d in Sh ep ard a n d M etzler’s ex p er im ents o n m ental rotation. T he o n e s sh o w n in (a ) a re identical, as a clockw ise turn in g o f this page b y 80° w ill readily prove. T hose in (b ) a re also identical, and again th e relative angle b e tw ee n th e two is 80°. H ere, however, a ro tatio n in d e p th will m ake th e first coincide w ith th e seco n d . Finally, th o se in (c ) are n o t at all identical, fo r no ro tatio n w ill b rin g th e m in to co n g ru en ce. T he tim e taken to decide w h e th e r a p a ir is th e sam e w as fo u n d to vary linearly w ith th e an g le th ro u g h w hich o n e figure m u st b e ro tated to b e b ro u g h t in to c o rre sp o n d e n c e w ith th e other. This suggested to th e investigators th at a step w ise m en tal ro tatio n w as in fact being p e rfo rm e d b y th e subjects o f th e ir experim ents.

to face th e p ro b le m o f rep resen tatio n from th e beginning, it was not long b efo re th e thinking o f psychologists b ecam e m o re sophisticated (see Shepard, 1979).

(4)

was stim ulated— as o n e m ight have expected from anatom ical studies. This le d to th e view th a t d ie p e rip h e ra l nerv e fib e rs co u ld b e th o u g h t o f as a sim ple m apping supplying th e sensorium w ith a copy o f th e physical events at th e body surface (Adrian, 1947). The rest o f th e explanation, it was thought, could safely be left to th e psychologists.

T he next developm ent was th e technical im provem ent in am plification that m a d e possible the recording o f single n e u ro n s (G ranit a n d Svaetichin, 1939; H ardine, 1938; Galam bos a n d Davis, 1943). This led to th e n o tio n of a cell’s “receptive field” (Hardine, 1940) and to the H arvard School’s fam ous series o f studies o f th e behavior o f n eu ro n s at successively d e e p e r levels o f th e visual pathway (Kuffler, 1953; H ubei a n d Wiesel, 1962, 1968). But p erh a p s th e m ost exciting developm ent was th e n ew view that questions o f psychological interest could b e illum inated a n d perh ap s even explained by n europhysiological experim ents. The clearest early exam ple o f this was Barlow ’s (1953) study o f ganglion cells in th e frog retina, and I cannot put it b e tte r than h e did:

If one explores the responsiveness of single ganglion cells in the frog's retina using handheld targets, one finds that one particular type of ganglion cell is most effectively driven by something like a black disc subtending a degree or so moved rapidly to and fro within the unit’s receptive field. This causes a vigorous discharge which can be maintained without much decrement as long as the movement is continued. Now, if the stimulus which is optimal for this class of cells is presented to intact frogs, the behavioural response is often dramatic; they turn towards the target and make repeated feeding responses consisting of a jump and snap. The selectivity o f the retinal neurons and the frog’s reaction w hen they are selectively stimulated, suggest that they are “bug detectors” (Barlow 1953) performing a primitive but vitally important form of recognition.

The result makes one suddenly realize that a large part of the sensory machinery involved in a frog’s feeding responses may actually reside in the retina rather than in mysterious “centres” that would be too difficult to under stand by physiological methods. Tire essential lock-like property resides in each member of a whole class of neurons and allows the cell to discharge only to the appropriate key pattern of sensory stimulation. Lettvin et al. (1959) suggested that there were five different classes of cell in the frog, and Barlow, Hill and Levick (1964) found an even larger number of categories in the rabbit. [Barlow et al.] called these key patterns “trigger features,” and Maturana et al. (I960) emphasized another important aspect of the behaviour of these gan glion cells; a cell continues to respond to the same trigger feature in spite of changes in light intensity over many decades. The properties of the retina are such that a ganglion cell can, figuratively speaking, reach out and determine that something specific .is happening in front of the eye. Light is the agent by

1.1 B a c k g ro u n d______________ 13 which it does this, but it is the detailed pattern of the light that carries the

information, and the overall level of illumination prevailing at the time is almost totally disregarded, (p. 373)

Barlow (1972) then goes o n to sum m arize these findings in th e fol low ing way·.

The cumulative effect of all the changes I have tried to outline above has been to make us realise that each single neuron can perform a much more complex

a n d subtle task than had previously been thought (emphasis added). Neurons

do not loosely and unreliably remap the luminous intensities of the visual image onto our sensorium, but instead they detect pattern elements, discrim inate the depth of objects, ignore irrelevant causes of variation and are arranged in an intriguing hierarchy. Furthermore, there is evidence that they give prominence to what is informationally important, can respond with great reliability, and can have their pattern selectivity permanently modified by early visual experience. This amounts to a revolution in our oudook. It is now quite inappropriate to regard unit activity as a noisy indication of more basic and reliable processes involved in mental operations: instead, we must regard single neurons as the prime movers of these mechanisms. Thinking is brought about by neurons and we should not use phrases like “unit activity reflects, reveals, or monitors thought processes,” because the activities of neurons, quite simply, are thought processes.

This revolution stemmed from physiological work and makes us realize that the activity of each single neuron may play a significant role in perception, (p. 380)

This a s p e a of his thinking led Barlow to form ulate the first and most im portant o f his five dogmas: ‘A description o fth a t activity o f a single nerve cell w hich is transm itted to a n d influences o th e r nerve cells and o f a nerve cell’s resp o n se to such influences from o th e r cells, is a com plete enough description for functional understanding o f the nervous system. T here is n o thing else “looking at” o r controlling this activity, w hich m ust therefore p ro v id e a basis fo r understanding h o w th e b rain controls b ehaviour’ (Bar-

low, 1972, p . 380).

(5)

T h e P hilosophy a n d the A pproach

Rocha-Miranda, and B ender (1972), w ho found “hand-detectors” in th e . infero tem p o ral cortex, seem ed to show that th e application o f th e red u c tionist a p p ro a ch w o u ld n o t b e lim ited just to th e early parts of th e visual pathway.

It was, o f course, recognized that physiologists h ad b een lucky: If o n e p ro b e s a ro u n d in a conventional electronic co m p u ter and records the behavior o f single elem ents w ithin it, o n e is unlikely to b e able to discern w hat a given e le m e n t is doing. B ut th e brain, thanks to Barlow’s first dogma, se e m e d to b e built along m o re accom m odating lines— p eo p le w ere able to d eterm in e th e functions o f single elem ents o f the brain. T here seem ed n o rea so n w hy th e reductionist ap proach could n o t b e taken all the way

I w as m yself fully caught u p in this excitem ent. Truth, I also believed, was basically neural, and th e central aim o f all rese arch was a th o ro u g h functional analysis o f th e stru ctu re o f th e central n ervous system. My enth u siasm fo u n d expression in a theory o f th e ce reb ellar cortex (Marr, 1969). A ccording to this theory, th e sim ple a n d regular cortical structure is inter p r e te d as a sim ple b u t pow erful m em orizing device fo r learning m o to r skills; b ecau se o f a sim ple com binatorial trick, e a ch o f th e 15 m illion Pur kinje cells in th e cerebellum is capable o f learning over 200 different p attern s a n d discrim inating th em from u n le arn ed patterns. Evidence is gradually accum ulating that th e ce reb ellu m is involved in learning m otor s tills (Ito, 1978), so that som ething like this th e o ry m ay in fact b e correct.

The way se em e d clear. O n th e o n e hand w e had n e w experim ental te ch n iq u e s of proven power, a n d on th e other, th e beginnings o f a theo retical ap proach that could back them u p w ith a fine analysis o f cortical structure. Psychophysics co u ld tell us w hat n e e d e d explaining, a n d th e re c e n t advances in anatom y— th e Fink-Heimer te ch n iq u e from N auta’s lab o rato ry a n d th e rec en t successful deploym ent b y Szentagothai and others o f th e ele c tro n m icroscope— could provide th e necessary inform ation a b o u t th e stru ctu re o f the cereb ral cortex.

But so m e w h ere underneath, som ething w as going w rong. T he initial discoveries of th e 1950s and 1960s w ere not b ein g follow ed by equally dram atic discoveries in the 1970s. No neurophysiologists had rec o rd e d new a n d clea r high-level correlates o f perception. T h e lead ers o f th e 1960s had tu rn e d away from w hat they had b e e n doing— H ubei and W iesel con centrated o n anatomy, Barlow tu rn e d to psychophysics, and th e m ainstream of n europhysiology concentrated on developm ent and plasticity (th e con cept that n eu ra l connections a re n o t fixed) o r o n a m o re th o ro u g h analysis o f th e cells that h a d already b e e n discovered (fo r exam ple, Bishop, Coom bs, a n d H enry 1971; Schiller, Finlay, and Volman, 1976a, 1976b), o r o n cells in species like th e ow l (fo r exam ple, Pettigrew a n d Konishi, 1976).

1.1 B a c k g ro u n d 15 None of the new studies su cceed ed in elucidating th e fu n c tio n of th e visual

cortex.

It is difficult to say precisely w hy this h appened, b ecause th e reasoning was n ever m ade explicit and was p robably largely unconscious. However, various factors are identifiable. In m y ow n case, th e cerebellar study had two effects. O n the one hand, it suggested that o n e could eventually hope to u n d erstan d cortical structure in functional term s, and this was exciting. But at th e sam e tim e th e study has disappointed m e, b ecause ev en if th e theory was correct, it d id n o t m uch enlighten o n e abo u t th e m o to r sys tem— it d id not, fo r exam ple, tei 1 o n e h o w to g o a b o u t p ro g ra m m in g a m echanical arm. It suggested that if o n e w ishes to pro g ram a m echanical arm so that it operates in a versatile way, th e n at som e p o in t a very large and rather sim ple type o f m em ory will prove indispensable. B ut it d id not say why, n o r w hat that m em ory sh o u ld contain.

The discoveries o f th e visual neurophysiologists left o n e in a sim ilar situation. Suppose, for exam ple, that o n e actually found th e apocryphal gran d m o th er cell.* W ould that really tell us anything m uch at all? It w ould tell us that it existed— G ross’s hand-detectors tell us alm ost that— b u t not

why o r even h o w such a thing may b e constructed from th e outputs of

previously discovered cells. D o th e single-unit recordings— th e sim ple and com plex cells— tell u s m u c h a b o u t h o w to d e t e a ed g e s o r w hy o n e w ould want to, except in a rath e r g eneral way thro u g h argum ents based o n eco n omy a n d redundancy? If w e really knew th e answers, for exam ple, w e should b e able to program them o n a com puter. But finding a hand- detector certainly d id n o t allow us to program one.

As o n e reflected o n th e se sorts o f issues in th e early 1970s, it gradually becam e clear that som ething im portant was m issing that w as n o t p resen t in e ith e r o f th e disciplines o f neurophysiology o r psychophysics. T h e key observation is that neurophysiology a n d psychophysics have as th e ir b u si ness to describe th e behavior o f cells o r of subjects b u t n o t to eocplain such behavior. W hat are th e visual areas of th e cereb ral cortex actually doing? What a re th e p ro b lem s in d o in g it that n e e d explaining, an d at w hat level of description should such explanations b e sought?

T he b e st way of finding o u t th e difficulties o f doing som ething is to try to d o it, s o at this p o in t I m oved to th e Artificial Intelligence Laboratory at MIT, w h e re Marvin Minsky had collected a g ro u p o f p e o p le a n d a p o w er ful co m p u ter fo r th e express p u rp o se o f addressing these questions.

(6)

l ő ________ The P hilosophy a n d the A pproach

T he first great revelation w as that th e p ro b lem s are difficult. O f course, these days this fact is a com m onplace. B ut in the 1960s alm ost n o o n e realized that m achine vision w as difficult The field had to go thro u g h th e sam e exp erien ce as th e m achine translation field did in its fiascoes o f th e 1950s b efo re it was at last realized that h e re w ere som e pro b lem s that h ad to b e taken seriously The reason for this m isperception is that w e hum ans are ourselves so go o d at vision. The n o tio n o f a feature d etec to r was well established by Barlow and by H ubei and W iesel, and the idea th at extracting edges and lines from im ages m ight be at all difficult sim ply d id n o t occur to th o se w h o h ad n o t trie d to d o it. It tu rn e d o u t to b e an elusive problem : Edges that are o f critical im portance from a three-dim ensional p o in t of view often cannot be found at all by looking at th e intensity changes in an image. Any k in d o f textured im age gives a m ultitude o f noisy edge seg m ents; variations in reflectance a n d illum ination cause n o e n d o f trouble; and even if an edge has a clear existence at o n e point, it is as likely as n o t to fede o u t q u ite soon, appearing only in patches along its length in th e image. T he com m on and alm ost despairing feeling o f the early investigators like B.K.P. H orn a n d TO. B inford was that practically anything co u ld h ap p en in an im age a n d fu rth erm o re th at practically everything did.

T hree types o f approach w ere taken to try to com e to grips w ith these phen o m en a. T he first was unasham edly em pirical, associated m ost w ith Azriel Rosenfeld. His style was to take so m e n ew trick fo r ed g e detection, texture discrim ination, o r som ething similar, r u n it o n im ages, a n d observe th e result. A lthough several interesting ideas em erg ed in this way including the sim ultaneous use of operators* o f different sizes as an approach to increasing sensitivity and red u cin g noise (Rosenfeld and T hurston, 1971), these studies w ere not as useful as they could have b e e n b ecause they w ere never accom panied by any serious assessm ent of how well th e different algorithm s perform ed. Few attem pts w ere m ade to com p a re the m erits o f different o p era to rs (although Fram and D eutsch, 1975, did try), and an approach like trying to prove mathem atically w hich o p e r ator was optim al was n o t even attem pted. Indeed, it could n o t b e , b ec au se ' no o n e h ad yet form ulated precisely w hat th e se o p erato rs sh o u ld b e trying to do. N evertheless, considerable ingenuity was shown. The m ost clever was p robably H ueckel’s (1973) operator, w hich solved in an ingenious way th e p ro b le m o f finding th e ed g e orientation that b e st fit a given intensity change in a small n eig h b o rh o o d o f an image.

"Operator refers to a local calculation to b e applied at each location in (he image, making use o f the intensity th ere and in th e im m ediate vicinity

1.1 B a c k g ro u n d 17 The seco n d approach was to try for d ep th of analysis b y restricting the

scope to a w orld o f single, illum inated, m atte w hite toy blocks set against a black background. The blocks co u ld occur in any shapes p ro v id ed only that all faces w ere planar and all edges w ere straight. This restriction allowed m o re specialized techniques to b e used, b u t it still d id n o t make th e p ro b le m easy The B in fo rd -H o rn line finder (H orn, 1973) was used to find edges, and both it and its sequel (d escrib ed in Shirai, 1973) m ade use o f th e special circum stances of the environm ent, such as th e fact that all edges th e re w ere straight.

These techniques did w ork reasonably well, however, and they allowed a prelim inary analysis o f later p ro b lem s to em erg e— roughly, w hat does one do o n c e a com plete lin e draw ing has b e e n extracted from a scene? Studies o f this had b eg u n som etim e b efore w ith Roberts (1965) and Guz m an (1968), a n d they culm inated in th e w orks of "Waltz (1975) a n d Mack- w orth (1973), w hich essentially solved th e in terpretation p ro b le m for line drawings deriv ed from im ages of prism atic solids. Waltz’s w ork h ad a p ar ticularly dram atic impact, because it was th e first to show explicidy that an exhaustive analysis o f all possible local physical arrangem ents o f surfaces, edges, a n d shadow s could lead to an effective and efficient algorithm for in terpreting an actual image. Figure 1-3 and its legend convey th e m ain ideas b e h in d Waltz’s theory.

The h o p e that lay b e h in d this w ork was, o f course, that o n ce th e toy w orld o f w hite blocks h a d b e e n u n d ersto o d , th e solutions found th e re could b e generalized, providing th e basis for attacking th e m o re com plex problem s p o se d by a richer visual environm ent. Unfortunately, this tu rn e d out n o t to b e so. For th e roots o f th e approach that was eventually suc cessful, w e have to look at th e th ird k in d o f developm ent that was taking place then.

(7)

3 The P hilosophy a n d the A pproach

+ Convex

— C oncave Å O ccluding

Figure 1—3. Som e configurations o f e d g es are physically realizable, a n d som e are not. T he trih e d ral junctions o f th re e convex ed g es (a ) o r o f th re e concave ed g es ( b ) are realizable, w h e rea s th e configuration (c ) is im possible. Waltz cataloged all th e po ssib le junctions, including sh adow edges, fo r u p to fo u r co in cid en t edges. H e th e n fo u n d th at b y using this catalog to im p lem en t consistency relatio n s [req u ir ing, fo r exam ple, th at a n ed g e b e o f th e sam e type all alo n g its length like ed g e E in (d)], th e so lu tio n to th e labeling o f a lin e draw ing th at in clu d ed shadow s was often u n iq u ely d eterm in ed .

clever parallel algorithm for this, and I suggested how it m ight b e im ple m e n ted by n e u ro n s in the retina (Marr, 1974a).

I d o n o t n ow believe that this is at all a co rrec t analysis o f co lo r vision o r o f th e retina, b u t it show ed th e possible style o f a co rrect analysis. G one a re the ad hoc program s o f co m p u ter vision; g o n e is th e restriction to a special visual m iniworld; g o n e is any explanation in term s o f n e u ro n s— except as a way o f im plem enting a m ethod. A nd p re se n t is a clear u n d e r standing o f w hat is to be com puted, how it is to b e done, th e physical assum ptions o n w hich th e m e th o d is based, a n d so m e kind o f analysis of algorithm s that a re capable o f carrying it out.

1 .2 U n d ersta n d in g C om plex In fo rm a tio n -P ro cessin g System s 19 T he o th e r piece of w ork was H o rn ’s (1975) analysis of sh a p e from

shading, w hich was th e first in w hat was to becom e a distinguished series of articles on th e form ation o f images. By carefully analyzing th e way in w hich th e illum ination, surface geometry, surface reflectance, a n d view point co n sp ired to create th e m easu red intensity values in an image, H orn form ulated a differential equatio n that related th e image intensity values to the surface geometry. If th e surface reflectance and illum ination are known, o n e can solve for th e surface geom etry (see also H orn, 1977). Thus from shading o n e can derive shape.

The m essage was plain. T here m ust exist an additional level of u n d e r standing at w hich the character o f the inform ation-processing tasks carried out durin g percep tio n are analyzed and u n d ersto o d in a way that is in d e p en d e n t o f th e particular m echanism s and structures that im p lem en t them in o u r heads. This was w hat was missing— th e analysis o f th e p ro b le m as an inform ation-processing task. Such analysis do es n o t u su rp an u n d e r standing at th e other levels— o f n eu ro n s o r of com puter p rogram s— but it is a necessary com plem ent to them , since w ithout it th e re can b e n o real understanding o f th e function o f all those neurons.

This realization was arrived at independently and form ulated together by Tomaso Poggio in Tubingen a n d myself (M arr and Poggio, 1977; Marr, 1977b). It was n o t even q u ite new — Leon D. H arm on was saying som ething sim ilar at about the sam e time, and others had p aid lip service to a sim ilar distinction. But the im portant p o in t is that if th e n otion o f different types of u n derstanding is taken very seriously, it allows the study o f th e infor m ation-processing basis of percep tio n to b e m ade rigorous. It becom es possible, by separating explanations into different levels, to m ake explicit statem ents abo u t w hat is bein g com puted and w hy and to construct th e o ries stating that w hat is b ein g co m p u ted is optim al in so m e se n se o r is guaranteed to function correctly The a d hoc elem e n t is rem oved, and heuristic com puter p rogram s a re replaced by solid foundations o n w hich a real subject can b e built. This realization— th e form ulation o f w hat was missing, together w ith a clear id e a o f how to supply it— fo rm e d th e basic foundation for a new integrated approach, w hich it is th e p u rp o se o f this book to describe.

1.2 UNDERSTANDING COMPLEX

INFORMATION-PROCESSING SYSTEMS

(8)

tem perature, p ressure, density, and the relationships am ong these fac tors— is n o t form ulated by using a large set o f equations, o n e for each of th e particles involved. Such effects are d escrib ed at th e ir ow n level, that of an e n o rm o u s collection of particles; th e effort is to show that in principle the m icroscopic and m acroscopic descriptions are consistent w ith one another. If o n e h opes to achieve a full u n d erstan d in g of a system as com plicated as a nervous system, a developing em bryo, a set o f m etabolic pathways, a bottle o f gas, o r even a large co m p u ter program , th e n o n e m ust b e p re p a re d to contem plate different kinds o f explanation at different lev els o f description that are linked, at least in principle, into a cohesive w hole, even if linking th e levels in co m plete detail is impractical. F or th e specific case o f a system that solves an inform ation-processing p roblem , th e re are in addition th e twin strands o f process a n d representation, and b o th these ideas n e e d som e discussion.

Representation and Description

A representation is a form al system for m aking explicit certain entities o r types o f inform ation, together w ith a specification o f how th e system does this. And I shall call th e resu lt o f using a rep rese n tatio n to d escrib e a given entity a description of the entity in that rep resen tatio n (M arr a n d Nishihara, 1978).

For exam ple, the Arabic, Roman, and binary num eral system s are all form al systems for representing num bers. T he Arabic representation con sists o f a string of sym bols draw n from th e set (0 ,1 , 2, 3, 4, 5, 6, 7, 8, 9), a n d the ru le for constructing th e description o f a particular in teg er n is that o n e decom poses n into a sum o f m ultiples o f pow ers o f 10 and unites th e se m ultiples into a string w ith th e largest pow ers o n th e left and th e sm allest o n the right. Thus, thirty-seven equals 3 x 10! + 7 x 10°, which b eco m es 37, th e Arabic n um eral system’s description of th e num ber. W hat this descrip tio n m akes explicit is th e n u m b e r’s decom position into pow ers o f 10. The binary num eral system ’s descrip tio n o f th e n u m b e r thirty-seven is 100101, and this description m akes explicit th e n u m b e r’s decom position into pow ers o f 2. In th e Roman n um eral system , thirty-seven is rep rese n ted as XXXVII.

This definition o f a rep resen tatio n is q u ite general. For exam ple, a representation for shape w o u ld b e a form al schem e for d escribing som e aspects o f shape, together w ith rules that specify h ow th e sch em e is applied to any particular shape. A m usical score p rovides a w ay o f rep rese n tin g a sym phony; th e alphabet allows th e construction o f a w ritten representation

1 .2 U n d ersta n d in g C om plex In fo rm a tio n -P ro cessin g System s______________ 21 of w ords; and s o forth. The phrase “form al schem e” is critical to the defi

nition, b u t th e rea d er sh o u ld not b e frightened by it. The reason is simply that w e are dealing w ith inform ation-processing m achines, and th e way such m achines w ork is by using sym bols to stand for things— to represent things, in o u r term inology. To say that som ething is a form al schem e m eans only that it is a set of sym bols w ith rules for putting th em together— no m ore and n o less.

A representation, therefore, is n o t a foreign idea at all—w e all use representations all th e tim e. However, the n o tio n that o n e can capture som e aspect o f reality by making a description of it using a sym bol and that to d o so can b e useful seem s to m e a fascinating a n d pow erful idea. But even th e sim ple exam ples w e have discussed in troduce som e rather general and im portant issues that arise w henever o n e chooses to use one particular representation. For exam ple, if one chooses th e Arabic num eral representation, it is easy to discover w h e th e r a n um ber is a pow er o f 10 b u t difficult to discover w h e th e r it is a pow er of 2. If o n e chooses th e binary representation, th e situation is reversed. Thus, th e re is a trade-off; any particular rep resen tatio n m akes certain inform ation explicit at th e expense of inform ation that is p u sh e d into th e background and may b e quite hard to recover.

This issue is im portant, b ecause how inform ation is re p re se n te d can greatly affect h ow easy it is to d o different things w ith it. This is evident even from o u r num bers exam ple: It is easy to add, to subtract, and even to multiply if th e Arabic o r binary representations are used, b u t it is n o t at all easy to do th e se things— especially m ultiplication— w ith Rom an numerals. This is a key rea so n w hy th e Rom an culture failed to develop m athem atics in th e way the earlier Arabic cultures had.

(9)

difficulty w ith w hich operations may subsequently b e carried o u t on that inform ation.

Process

The term p rocess is very broad. For exam ple, addition is a p rocess, and so is taking a F ourier transform. But so is m aking a cup o f tea, o r going shopping. For th e p urposes o f this book, I w ant to restrict o u r attention to th e m eanings associated w ith m achines that are carrying o u t inform ation- processing tasks. So let us exam ine in d epth the n o tions b eh in d o n e sim ple such device, a cash register at th e checkout co u n ter o f a superm arket.

T h ere a re several levels at w hich o n e n e e d s to u n d ersta n d such a device, and it is perh ap s m ost useful to think in term s of th re e of them . The m ost abstract is th e level o f w hat th e device does and why. W hat it d oes is arithm etic, so our first task is to m aster th e theory o f addition. Addition is a m apping, usually d en o ted by + , from pairs o f n um bers into single num bers; for example, + m aps th e pair (3, 4) to 7, and I shall w rite this in the form (3 + 4) -* 7. Addition has a n u m b er o f abstract properties, however. It is commutative: both (3 + 4) and (4 + 3) are eq u a l to 7; an d associative: th e sum of 3 + (4 + 5) is the sam e as th e sum of (3 + 4) + 5. T hen th e re is th e unique distinguished elem ent, zero, th e adding of w hich has no effect: (4 + 0) —> 4. Also, for every n u m b e r th e re is a u n iq u e "inverse,” w ritten ( — 4) in th e case of 4, w hich w h e n ad d ed to th e n u m b er gives zero: [4 + ( - 4)] -»· 0.

Notice that these pro p erties are part o f th e fundam ental theory of addition. They are tru e no m atter how th e n u m b e rs are w ritten— w h eth e r in binary, Arabic, o r Roman representation— an d n o m atter h o w th e ad d i tion is executed. Thus part o f this first level is som ething th at m ight b e characterized as w h a t is being com puted.

The o th e r half o f this level o f explanation has to d o w ith th e question of w hy th e cash register perform s addition a n d not, for instance, m ultipli cation w h e n com bining th e prices o f th e p u rch a sed items to arrive at a finai bill. The reaso n is that th e rules w e intuitively feel to b e appropriate for com bining the individual prices in fact define th e m athem atical o p e r ation o f addition. T hese can b e form ulated as co n stra in ts in th e following way:

1. If you buy nothing, it sh o u ld cost you nothing; and b uying nothing and som ething sh o u ld cost th e sam e as buying just the som ething. (The rules for zero.)

1.2 U n d ersta n d in g C om plex In fo rm a tio n -P ro cessin g System s 23 2. The o rd e r in w hich goods are p rese n ted to th e cashier sh o u ld not

affect th e total. (Commutativity.)

3. Arranging the goods into two piles and paying for each pile sepa rately sh o u ld not affect the total am ount you pay. (Associativity; the basic operation for com bining prices.)

4. If you buy an item and then retu rn it for a refund, y our total expen diture should b e zero. (Inverses.)

It is a m athem atical th e o re m that these conditions define th e o p eration of addition, w hich is th erefo re th e appropriate com putation to use.

This w hole argum ent is w hat I call th e c o m p u ta tio n a l theory o f the cash register. Its im portant features are (1 ) that it contains separate argu m ents a b o u t w hat is c o m p u ted a n d w hy a n d (2) th at th e resulting operation is defined uniquely by th e constraints it has to satisfy. In th e theory o f visual processes, th e underlying task is to reliably derive pro p erties o f th e w orld from im ages o f it; th e business o f isolating constraints that are b o th pow erful en o u g h to allow a process to b e defined and generally tru e o f the w orld is a central th e m e of o u r inquiry

In o rd e r that a process shall actually run, however, o n e has to realize it in som e way and th erefo re choose a representation for th e entities that the process m anipulates. T he second level o f the analysis o f a process, therefore, involves choosing two things: (1) a representation for th e input and for th e o utput o f th e process a n d (2) an algorithm by w hich the transform ation may actually b e accom plished. For addition, of course, th e input and o u tp u t representations can b o th b e the same, because they both consist o f num bers. However this is n o t true in general. In th e case of a Fourier transform , fo r exam ple, th e input representation may be th e tim e domain, and the output, the frequency dom ain. If th e first o f o u r levels specifies w hat and why, this second level specifies how . For addition, we might choose Arabic num erals fo r th e representations, and for th e algo rithm w e could follow th e usual rules about adding th e least significant digits first and "carrying” if th e sum exceeds 9. Cash registers, w hether mechanical o r electronic, usually u se this type o f representation and algo rithm.

(10)

m o re efficient than another, o r ano th er may be slightly less efficient but m o re ro b u st (that is, less sensitive to slight inaccuracies in th e data o n w hich it m ust run). O r again, o n e algorithm may b e parallel, and another, serial. T he choice, then, may d ep e n d o n th e type o f hardw are o r machinery· in w hich th e algorithm is to b e e m b o d ied physically

This b rings us to the third level, that o f th e device in w hich th e process is to b e realized physically The im portant p o in t h e re is that, once again, th e sam e algorithm m ay b e im plem ented in quite different technologies. The child w h o m ethodically ad d s two n u m b e rs from right to left, carrying a digit w h en necessary, may b e using th e sam e algorithm th at is im ple m e n te d by h e w ires and transistors o f th e cash register in th e n eig h b o r h o o d superm arket, b u t th e physical realization o f th e algorithm is quite different in th e se two cases. A nother example·. Many p e o p le have w ritten co m p u ter program s to play tic-tac-toe, an d th e re is a m o re o r less standard algorithm that cannot lose. This algorithm has in fact b e e n im p lem en ted by W. D. Hillis and R. Silverman in a quite different technology, in a co m p u te r m ade o u t o f Tinkertoys, a ch ild ren ’s w o o d en b uilding set. The w hole m onstrously ungainly engine, w hich actually works, currently resides in a m u seu m at th e University of M issouri in St. Louis.

Som e styles o f algorithm w ill suit so m e physical substrates b etter than others. For exam ple, in conventional digital com puters, th e n um ber of connections is com parable to th e n u m b er of gates, w hile in a brain, the n u m b er o f connections is m uch larger ( x 104) than th e n u m b er o f nerve cells. T h e underlying reason is that w ires are rather cheap in biological architecture, because they can grow individually and in th re e dim ensions. In conventional technology, w ire laying is m o re o r less restricted to two dim ensions, w hich quite severely restricts th e sc o p e for using parallel techniques and algorithms; the sam e o p erations are often b e tte r carried o u t serially.

The Three Levels

We can sum m arize o u r discussion in som ething like th e m a n n er show n in Figure 1-4, w hich illustrates th e different levels at w hich an inform ation- processing device m ust b e u n d ersto o d b efo re o n e can b e said to have u n d ersto o d it completely. At o n e extrem e, th e to p level, is th e abstract com putational theory o f the device, in w hich th e perform ance o f the device is ch aracterized as a m apping from one k in d o f inform ation to another, the abstract p ro p ertie s of this m apping a re defin ed precisely, a n d its ap p ro priateness and adequacy for th e task at h an d are dem onstrated. In th e ce n te r is th e choice o f representation for th e input and o u tp u t and th e

1 .2 U n d ersta n d in g C om plex In fo rm a tio n -P ro cessin g System s 25

C o m p u tatio n al th eo ry

R ep resen tatio n and algorithm

H ardw are im p lem en tatio n W hat is th e goal of the

com putation, w hy is it ap p ro p riate, a n d w hat is th e logic o f th e strat egy b y w h ich it can b e carried out?

H ow can this co m p u ta tional th eo ry b e im p le m ented? In particular, w hat is th e rep re se n ta tio n fo r th e in p u t and o u tput, a n d w h at is the algorithm for th e trans form ation?

Flow can th e re p re se n tation an d algorithm b e realized physically?

Figure 1—4. T h e th re e levels at w h ich any m achine carrying o u t an inform ation-

processing task m ust b e u n d ersto o d .

algorithm to b e used to transform o n e into th e other. And at the o th e r extrem e are th e details o f how th e algorithm a n d representation are real ized physically— th e d etailed co m p u ter architecture, so to speak. These th ree levels a re coupled, b u t only loosely. The choice of an algorithm is influenced fo r exam ple, b y w hat it has to d o and by th e hardw are in w hich it m ust run. B ut th e re is a w ide choice available at each level, and the explication of each level involves issues that are rath e r in d e p en d e n t o f the o th e r two.

(11)

(c) (b )

Figure 1 -5 - T he so-called N ecker illusion, n am ed after L. A. Necker, th e Swiss naturalist w h o d e v elo p ed it in 1832. T he e ssen ce o f th e m atter is that th e two- d im ensional re p re se n tatio n (a ) has collapsed the d e p th o u t o f a c u b e an d th at a certain aspect o f h u m an vision is to recover this m issing th ird dim en sio n . The d e p th o f the c u b e can in d e e d b e p erceived, b u t two in te rp reta tio n s are possible, (b ) an d (c). A p e rs o n ’s p erce p tio n characteristically flips fro m o n e to the other.

brain, but few w ould feel satisfied by an account that failed to mention, the existence o f two different b u t perfectly plausible three-dim ensional in te r pretations o f this tw o-dim ensional image.

F or so m e phenom ena, th e type of explanation re q u ire d is fairly obvious. Neuroanatomy, for exam ple, is clearly tie d principally to th e th ird level, th e physical realization o f th e com putation. The sam e holds for syn aptic m echanism s, action potentials, inhibitory interactions, and so forth. Neurophysiology, too, is related m ostly to this level, b u t it can also h elp us to u n d ersta n d th e ty p e of representations b eing used, particularly if o n e accepts som ething along the lines o f Barlow’s views that I q u o te d earlier. But o n e has to exercise extrem e caution in m aking inferences from neu- rophysiological findings abo u t th e algorithm s a n d representations being used, particularly until one has a clear idea about w hat inform ation n ee d s to b e rep rese n ted a n d w hat processes n e e d to b e im plem ented.

Psychophysics, o n the o th e r hand, is related m o re directly to th e level o f algorithm and representation. Different algorithm s te n d to fail in radi cally different ways as they are p u sh e d to th e lim its of th e ir perform ance o r a re deprived o f critical inform ation. As w e shall see, prim arily psycho physical evidence proved to Poggio and myself that o u r first stereo-m atch ing algorithm (M arr and Poggio, 1976) was n o t th e o n e th at is u se d by th e brain, and th e best evidence that o u r seco n d algorithm (M arr and Poggio, 1979) is roughly th e o n e that is used also com es from psychophysics. O f course, th e underlying com putational theory rem ain ed th e sam e in b o th cases, only th e algorithm s w e re different.

1 .2 U n d ersta n d in g C om plex In fo rm a tio n -P ro cessin g System s 21 Psychophysics can also help to determ in e th e n atu re o f a re p re se n

tation. The w ork of Roger S hepard (1975), Eleanor Rosch (1978), o r Eliz abeth W arrington (1975) provides som e interesting hints in this direction. M ore specifically, Stevens (1979) argued from psychophysical experi m ents that surface orientation is rep rese n ted by the coordinates of slant and tilt, rather than (fo r exam ple) th e m o re traditional (p, q ) o f gradient space (see Chapter 3). H e also ded u ced from th e uniform ity o f th e size of errors m a d e by subjects judging surface orientation over a w ide range of orientations that the representational quantities used for slant and tilt are p u re angles and not, fo r exam ple, th e ir cosines, sines, o r tangents.

M ore generally, if the idea that different phen o m en a n e e d to be explained at different levels is kep t clearly in m ind, it often helps in the assessm ent o f the validity of the different kinds o f objections that are raised from tim e to time. F or exam ple, o n e favorite is that th e b rain is quite different from a com puter b ecause one is parallel and th e o th e r serial. The answer to this, o f course, is that th e distinction betw een serial and parallel is a distinction at th e level o f algorithm ; it is not fundam ental at all— anything p ro g ram m ed in parallel can b e rew ritten serially (though not necessarily vice versa). T he distinction, therefore, provides n o g ro u n d s for arguing that the brain operates so differently from a co m p u ter that a com pu te r cou ld n o t be p ro g ram m ed to perform th e same tasks.

Im portance of Computational Theory

Although algorithm s and m echanism s are em pirically m o re accessible, it is th e to p level, the level of com putational theory, w hich is critically im por tant from an inform ation-processing p oint of view. The reason for this is that the n atu re o f the com putations that u n d erlie p ercep tio n d ep e n d s m ore u p o n th e com putational problem s that have to b e solved than u p o n the particular hardw are in w hich their solutions are im plem ented. To phrase the m atter another way, an algorithm is likely to be u n d ersto o d m ore readily by understanding the nature of th e p roblem b eing solved than by exam ining the m echanism (an d the hardw are) in w hich it is em bodied.

(12)

as they d o by studying their w iring and interactions, but in o rd e r to u n d e r stand w hy the receptive fields are as they are— why they a re circularly sym m etrical and w hy their excitatory and inhibitory regions have charac teristic shapes and distributions— w e have to know a little of th e theory of differential operators, band-pass channels, and the m athem atics o f the uncertainty principle (see Chapter 2),

P erhaps it is n o t surprising that th e very specialized em pirical disci p lin es o f th e neurosciences failed to appreciate fully th e absence o f com putational theory; b u t it is surprising that this level of approach did not play a m o re forceful role in th e early developm ent o f artificial intelligence. For far too long, a heuristic pro g ram fo r carrying out som e task was held to b e a theory o f that task, a n d th e distinction b etw een w hat a p rogram did a n d how it did it w as not taken seriously. As a result, (1) a style o f expla nation evolved that invoked the use o f special m echanism s to solve partic u la r problem s, (2) particular data structures, such as th e lists of attribute value pairs called property lists in th e LISP program ing language, w ere h eld to am o u n t to theories o f th e rep rese n tatio n of know ledge, an d (3) th e re was frequently no way to d eterm in e w h eth e r a p ro g ram w ould deal w ith a particular case other than by ru n n in g th e program .

F ailure to recognize this theoretical distinction betw een w h a t and how also greatly h am p ered com m unication betw een th e fields o f artificial intel ligence a n d linguistics. Chom sky’s (1965) theory o f transform ational gram m ar is a tru e com putational theory in th e sense defined earlier. It is con c e rn e d solely w ith specifying what,, th e ' syntactic decom position o f an English sentence should be, and n o t at all w ith how that decom position sh o u ld b e achieved. Chomsky him self was very clear a b o u t this— it is roughly his distinction b etw een com petence and perform ance, th o u g h his idea o f p erform ance did include o th e r factors, like stopping in m id u tter ance— b u t the fact that his theory was defined by transform ations, w hich lo o k like com putations, seem s to have confused m any people. W inograd (1972), for exam ple, felt ab le to criticize Chom sky’s theory o n th e g ro u n d s that it cannot b e inverted a n d so ca n n o t b e m ade to ru n o n a com puter; I had h ea rd reflections of th e sam e argum ent m ade by Chom sky’s colleagues in linguistics as they tu rn th e ir attention to h ow gram m atical structure m ight actually b e com puted from a real English sentence.

The explanation is sim ply that finding algorithm s by w hich Chom sky’s th e o ry m ay b e im plem ented is a com pletely different en d eav o r from for m ulating th e theory itself. In o u r term s, it is a study at a different level, and b o th tasks have to b e done. This p o in t was ap p reciated by Marcus (1980), w ho was co n c e rn e d precisely w ith h o w Chom sky’s theory can b e realized and w ith th e kinds o f constraints o n th e pow er of th e hu m an gram m atical p ro ce sso r that m ight give rise to th e structural constraints in syntax that

1.2 U n d ersta n d in g C om plex In fo rm a tio n -P ro cessin g System s

Chomsky found. It even appears that th e em erging “trace” th e o ry o f gram m ar (Chom sky and Lasnik, 1977) may provide a way o f synthesizing the two approaches— show ing that, for exam ple, som e of th e rath e r ad hoc restrictions that form p art o f th e com putational theory may b e conse quences o f w eaknesses in th e com putational pow er that is available for im plem enting syntactical decoding.

The Approach o f J. J. Gibson

In perception, perh ap s th e n ea rest anyone cam e to th e level o f com puta tional theory was G ibson (1966). However, although so m e aspects o f his thinking w ere on th e right lines, h e did n o t u n d ersta n d p ro p erly what inform ation processing was, w hich led him to seriously u nderestim ate the com plexity of th e inform ation-processing p ro b lem s involved in vision and the co n seq u e n t subtlety that is necessary in approaching them.

G ibson’s im portant contribution was to take th e d eb ate away from the philosophical considerations o f sense-data a n d the affective qualities of sensation and to no te instead that th e im portant thing ab o u t th e senses is that they a re channels for p erc ep tio n of the real w orld outside or, in the case of vision, of th e visible surfaces. H e th erefo re asked th e critically im portant question, H ow does o n e obtain constant p erceptions in everyday life o n th e basis of continually changing sensations? This is exactly th e right question, showing that G ibson correctly regarded the p ro b le m o f p erc e p tion as that o f recovering from sensory inform ation “valid” pro p erties of the external world. His p ro b lem was that he had a m uch oversim plified view o f h ow this sh o u ld b e done. His approach led him to co n sid er higher- o rd er variables— stim ulus energy, ratios, proportions, and so o n — as “invariants” o f the m ovem ent o f an o bserver and o f changes in stim ulation intensity.

(13)

resonate. This was th e basic idea b eh in d the n o tion of ecological optics (Gibson, 1966, 1979).

A lthough o n e can criticize certain shortcom ings in th e quality o f G ib so n ’s analysis, its m ajor and, in m y view, fatal shortcom ing lies at a d e e p e r level and results from a failure to realize two things. First, th e d etection of physical invariants, like im age surfaces, is exactly and precisely an infor m ation-processing p roblem , in m o d e rn term inology. And second, h e vastly u n d e rra te d the sh e e r difficulty o f such detection. In discussing th e recovery o f three-dim ensional inform ation from th e m ovem ent o f an observer, he says that “in m otion, perspective inform ation alone can b e u se d ” (Gibson, 1966, p. 202). And perhaps th e key to G ibson is th e following:

T he d etection o f non-change w h en an object m oves in th e w o rld is n o t as difficult as it m ight appear. It is only m ad e to seem difficult w h e n w e assum e th at th e p e rce p tio n o f con stan t dim en sio n s o f th e object m u st d e p e n d o n the co rrectin g o f sensations o f inconstant form an d size. The inform ation fo r the con stan t d im en sio n of an object is norm ally carried by invariant relatio n s in an o ptic array. Rigidity is specified, (em phasis a d d ed )

Yes, to b e sure, b u t how ? D etecting physical invariants is just as difficult as G ibson feared, b u t nevertheless w e can do it. And th e only way to u n d e r stand how is to treat it as an inform ation-processing problem .

The underlying p oint is that visual inform ation processing is actually very com plicated, an d G ibson w as n o t th e only th in k e r w h o was m isled by th e app aren t sim plicity of the act of seeing. The w h o le tradition o f p hilo sophical inquiry into the n atu re o f p erc ep tio n seem s n o t to have taken seriously en o u g h th e com plexity o f the inform ation processing involved. For exam ple, A ustin’s (1962) Sense a n d Sensibilia entertainingly d em o  lishes th e argum ent, apparently favored by earlier p hilosophers, that since w e are som etim es d elu d ed by illusions (for exam ple, a straight stick appears b e n t if it is partly su b m erg ed in w ater), w e see sense-data rath e r than m aterial things. The answ er is sim ply that usually o u r perceptual processing does ru n correctly (it delivers a tru e description o f w hat is th e re ), b u t although evolution has se en to it that o u r p rocessing allows for m any changes (like inconstant illum ination), the p ertu rb atio n d u e to the refraction of light by w ater is n o t o n e o f them . A nd incidentally, although th e exam ple o f th e b e n t stick has b e e n discussed since Aristotle, I have seen n o philosphical inquiry into th e n atu re o f th e percep tio n s of, for instance, a h ero n , w hich is a b ird that feeds by pecking u p fish first seen from above th e w ater surface. For such bird s the visual co rrec tio n m ight b e present.

Anyway, my m ain point h e re is another one. Austin (1962) spends m uch tim e o n the idea that percep tio n tells o n e ab o u t real p ro p ertie s of

1.3 A R ep resen ta tio n a l F ram ew ork f o r V ision 31 the external world, and o n e thing h e considers is “real shape,” (p. 66), a

notion w hich had cro p p e d u p earlier in his discussion of a coin that “looked elliptical” from som e points o f view. Even so,

it h a d a real sh a p e w hich re m a in e d unchanged. But coins in fact are ra th e r special cases. For o n e th in g th e ir o u tlin es a re w ell d e fin e d a n d very highly stable, an d for a n o th e r they have a kno w n a n d a nam eable shape. But th e re a re plenty o f things o f w hich this is n o t tru e. W hat is th e real sh ap e o f a c lo u d ? . . . o r o f a cat? D oes its real sh a p e change w h en ev er it moves? If not, in w h at p o stu re is its real shape o n display? F u rth erm o re, is its real sh ap e such as to b e fairly sm o o th outlines, o r m u st it b e finely e n o u g h se rrate d to take a cco u n t o f each hair? I t is p re tty o b v io u s that there is n o a n sw e r to these

questio n s— n o rules a cco rd in g to which, n o p ro c e d u re by which, an sw ers are to be determ ined, (em p h asis ad d ed ), (p. 67)

B ut th e re a re answers to these questions. T here a re ways o f describing th e sh a p e o f a cat to an arbitrary level o f p recision (see-C hapter 5), and th ere a re rules a n d p ro ce d u re s for^ arriving at such descriptions. That is exactly w hat vision is about, and precisely w hat m akes it com plicated.

1.3 A REPRESENTATIONAL FRAMEWORK

FOR VISION

Vision is a process that p ro d u ce s from im ages o f th e external w orld a descrip tio n that is useful to th e view er a n d n o t clu ttered w ith irrelevant inform ation (Marr, 1976; M arr and Nishihara, 1978). We have already seen that a process may b e th ought o f as a m apping from o n e rep resen tatio n to another, and in the case of hum an vision, ¿ íe initial rep resen tatio n is in no doubt— it consists o f arrays of im age intensity values as detected by the p h o to recep to rs in the retina.

It is q u ite p ro p e r to think of an im age as a representation; th e item s that are m ade explicit are th e im age intensity values at each p o in t in the array; w hich we can conveniently d en o te by I (x,y) at coordinate (x y ). In o rd e r to simplify o u r discussion, w e shall neglect for th e m o m en t th e fact that th e re are several different types of receptor, and im agine instead that th e re is just one, so that th e image is black-and-white. Each value o f I ( :,y) thus specifies a particular level o f gray; w e shall refer to each detecto r as a p ictu re elem e n t o r p ix e l and to th e w hole array I as an image.

(14)

The Philosophy a n d the Approach

alone specify precisely and an im portant aspect o f this n ew ap p ro ach is that it m akes q u ite co ncrete proposals a b o u t w hat that e n d is. B ut b efore w e b eg in that discussion, let us step back a little and sp e n d a little tim e form ulating th e m o re general issues that a re raised by th e se questions.

The Purpose o f Vision

The usefulness o f a representation d e p e n d s u p o n how w ell su ite d it is to th e p u rp o se fo r w hich it is used. A p ig e o n uses vision to h e lp it navigate, fly, and se ek o u t food. Many types o f jum ping sp id e r use vision to tell the difference betw een a potential m eal and a potential m ate. O n e type, for exam ple, h as a c urious retina fo rm e d o f tw o diagonal strips a rra n g e d in a V. If it detects a red V on th e back o f an object lying in front o f it, th e sp id e r has found a mate. O therw ise, m aybe a m eal. The frog, as w e have seen, d etects b u g s w ith its retina·, a n d th e ra b b it re tin a is full o f special gadgets, including w hat is apparently a haw k detector, since it responds well to th e p attern m ade by a preying h a n k hovering overhead. H um an vision, o n th e o th e r h and, seem s to b e v ery m u c h m o re general, although it clearly contains a variety of special-purpose m echanism s that can, for exam ple, d irec t th e eye tow ard a n u n ex p ected m ovem ent in th e visual field o r cause o n e to b lin k o r otherw ise avoid som ething that approaches o n e ’s h e a d to o quickly.

Vision, in short, is used in such a bew ildering variety o f ways that th e visual system s o f different anim als m ust differ significantly from one another. Can th e type o f form ulation that I have b e e n advocating, in term s o f rep resen tatio n s and processes, possibly p ro v e adequate fo r them all? I think so. T he g eneral p o in t h e re is that b ecause vision is u se d by different anim als fo r such a w ide variety o f p urposes, it is inconceivable that all seeing anim als u s e th e sam e representations; each can confidently be expected to use o n e o r m o re rep resen tatio n s that a re nicely ta ilored to the o w n e r’s purposes.

As an exam ple, let us co n sid er briefly a prim itive b u t highly efficient visual system that has th e ad d ed v irtue of b ein g w ell understood. W erner R eichardt’s g ro u p in Tübingen has sp en t th e last 14 years patiently unrav elin g th e visual flight-control system o f th e housefly, a n d in a fam ous col laboration, R eichardt and Tomaso Poggio have g o n e far tow ard solving th e p ro b le m (R eichardt and Poggio, 1976, 1979; Poggio a n d Reichardt, 1976). Roughly speaking, th e fly’s visual apparatus con tro ls its flight thro u g h a collection o f a b o u t five in d ep en d en t, rigidly inflexible, very fast re sp o n d ing systems (th e tim e from visual stim ulus to change o f to rq u e is only 21 ms). F o r exam ple, o n e o f these systems is th e landing system; if th e visual

1.3 A R epresentational F ram ew ork f a r Vision

field “ex p lo d es” fast en o u g h (because a surface loom s nearby), th e fly automatically “lands” tow ard its center. If this center is above th e fly, th e fly automatically inverts to land upside down. W hen th e feet touch, pow er to the wings is cu t off. Conversely, to take off, th e fly jumps; w h en th e feet no longer touch th e g round, pow er is resto red to the wings, and th e in s e a flies again.

In-flight con tro l is achieved b y in d ep en d en t systems controlling the fly’s vertical velocity (th ro u g h con tro l of th e lift gen erated by th e w ings) a n d horizontal d irec tio n (d e te rm in e d b y th e to rq u e p ro d u c e d b y th e asym m etry o f th e horizontal th ru st from th e left and right wings). The visual input to th e horizontal control system, for exam ple, is com pletely d escrib ed b y th e tw o te rm s

Κψ)ψ + £>(ψ)

w h e re r and D have th e form illustrated in Figure 1 6. This in p u t describes how th e fly tracks an object that is p rese n t at angle ψ in th e visual field and has angular velocity ψ. This system is triggered to track objects o f a certain angular d im ension in th e visual field, and the m o to r strategy is such that if th e visible object was an o th er fly a few inches away then it w ould b e

V«. ·

" • V v /

0» f

— π — т г /2 0 + тт/2 + it

ψ ►

Figare 1-6. The horizontal component of the visual input R to the

fly’s flight system is described by the formula R = £>(ψ) - ΚΨ) Ψ, where ψ is the direction of the stimulus and ψ is its angular velocity in the fly’s visual field. 0(40 is an odd function, as shown in (a), which has the effect of keeping the target centered in the fly's visual field; г(ф) is essentially constant as shown in (b).