Junction detection: Computation and psychophysics

(1)

Junction Detection: Computation and Psychophysics

by

Joshua Hartman McDermott

Submitted for the Degree of Mphil

at

University Colledge London

(2)

ProQuest Number: U643289

INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

uest.

ProQuest U643289

(3)

Abstract

Junctions, formed at the intersection of image contours, have been thought to play an important

and early role in vision since the time of Helmholtz, and have recently been the focus of much

research in vision science. The interest injunctions can be attributed in part to the notion that

they are local image features that are easy to detect which nonetheless provide valuable

information about important scene events like occlusion and transparency. Despite the

prevalence of this view, the extent to which it holds for junctions in real images remains unclear.

In this thesis we put this notion to the test. Our first approach involves psychophysical

experiments with real images as stimuli in which we use the human visual system as a tool to

measure the visual information available in local image regions. W e test the ability of observers

to detect occlusion points, which by hypothesis are signified by local junctions, given only a

small image region surrounding the occlusion point. Human performance in our experiments

places constraints on how junctions might be detected and what their role in vision might be. We

also attempted to train neural networks to detect junctions given only a small image patch

centered on the junction. Together, our experiments suggest that although some junctions are

locally defined and can be detected with simple and local mechanisms, a substantial fraction

necessitate the use of more complex and global processes, and may not play the bottom-up role in

(4)

Abstract

2 1 Introduction

4 2 Psychophysics with Real Images

7

2.1 Stimulus Acquisition 7

2.2 Experiment 1: Aperture Size 9

2.3 Illustrative Example Stimuli 17

2.4 Experiment 2: Controls 21

3 Psychophysics Experiments on Scale

24

3.1 Experiment 3: Scale 24

3.2 Scale Examples 29

3.3 Scale Artifacts 36

3.4 Nonlocal Occlusion Points 41

3.5 Scale Selection 45

3.6 General Discussion of Psychophysics 47

4 Neural Networks for Junction Detection

51

4.1 General Motivation 51

4.2 Network Implementation 54

4.3 Training on Real Points of Occlusion 56

4.4 Training on Synthetic Junctions 61

5 Rotation Invariant Neural Networks

66

5.1 Training on a Single Orientation 66

5.2 General Discussion of Neural Networks 71

6 Conclusions

73 Acknowledgements

75

(5)

Chapter 1: Introduction

Vision is the task o f inferring representations of scenes in the world from images. Doing this

properly involves implicitly estimating a probability distribution o f scenes given an observed

image, and perhaps searching through the space of scenes to find the one that is most likely or in

some other sense best given the distribution. The problem, o f course, is that the direct estimation

o f this distribution is intractable due to its high dimensionality - a huge num ber o f parameters is

required to describe an entire scene. Moreover, searching through the space of scenes is also

intractable, again due to the high dimensionality. A more promising alternative is to estimate

many reduced versions of this distribution at different locations in an image, i.e. to estimate at

each location the distribution over the local scene parameters (e.g. surface curvature, reflectance,

etc.) given the image region at that location. Such local distributions are (hopefully) more

tractable to estimate and use because of the smaller number o f parameters involved. Locally the

distribution of scene structure given the image is less peaked and unimodal because there is less

information to constrain scene interpretation. However, the local distributions could narrow the

possible interpretations and might then be combined to yield a global scene estimate. Methods

have recently been suggested that accomplish this in some simple cases (e.g. Saund, 1999;

Freeman, 2000).

Something akin to this broad strategy seems likely to play a large role in vision; the big search

involving the entire image is too hard, and yet we manage to see astoundingly well, implying that

shortcuts are available and are taken by biological visual systems. The hierarchical nature of

mammalian visual systems may also be reflective o f such a strategy. The main question, then, is

just whether the local processing will be intelligible, i.e. whether vision science will be able to

produce a blueprint for what happens locally. It might be the case that the local processes are

messy and difficult to make sense of, but the assumption implicit in much o f vision research is

that this is not the case. It is for this reason that there is much interest in local cues such as

junctions.

Junctions, formed at the intersections of image contours, are attractive candidates for elements of

early vision because they are presumed to be local image features which nonetheless provide

substantial information about many important scene characteristics. Junctions have been

(6)

Anandan, 1990; T-junctions, Anderson, 1997), and reflectance, illumination, and atmospheric

boundaries (T-junctions, Gilchrist, 1979; X-junctions, Adelson, 1993, 1999). Furthermore,

junctions lend themselves to appealingly principled and rule-driven models o f many aspects of

mid-level vision (e.g. Sajda and Finkel, 1995).

Although a growing body o f psychophysics suggests an important role for local junction analysis

in mid-level vision, virtually all the psychophysical and modeling work on junctions is conducted

with "cartoon" images composed of discrete luminance values and step edges. W hile these may

be a useful abstraction of real images, it is unclear how representative the junctions in such

images are of junctions in real images. For junctions to be used in the manner that the simple

models and psychophysics suggest, they must be locally discernible, and yet it is not obvious that

this is the case in real images. Psychophysicists are happy to point out junctions in real images,

but in doing so are making use of the entire image and their entire visual system. It might be the

case that a substantial fraction of junctions become visible only after the image in which they lie

is interpreted, in which case the theoretical claims that have been made about them would be

called into question.

One obvious approach to investigating this issue is to attempt to design a mechanism for

detecting junctions in real images, in that the success of a local m echanism at detecting junctions

would imply that they are indeed locally defined. However, the results o f several previous such

attempts have been discouraging (Beymer, 1989; Heitger et al., 1991; Freeman, 1992; Alvarez

and Morales, 1997; Lindeberg, 1998; Farida et al., 1998). Sensible computations involving

oriented filtering and local templates result in apparently poor performance on real images. This

is surprising given that much o f the interest in junctions is predicated on their apparent simplicity.

These results might be taken to indicate that there is insufficient information in local regions of

real images to detect junctions. Note, though, that there is an inherent difficulty in claiming on

the basis of failures in machine vision that a low-level/local mechanism is insufficient to do some

(7)

1986; Zucker et al., 1988; Shashua and Ullman, 1990). Recently there have been further attempts

to refine the local operators (Iverson and Zucker, 1995). Although it is generally agreed that

some degree of nonlocal post-processing is necessary to generate edge maps suitable for further

analysis (e.g. inferring shape from contours), it is not clear how well a local operator should be

expected to perform, in part because they are typically evaluated by looking at their outputs in the

context of an entire image. It is thus difficult to ascertain the extent to which more sophisticated,

nonlocal analysis is needed.

Traditional psychophysics experiments on people, in contrast, can help by demonstrating the

existence of novel mechanisms in human vision, using cleverly designed stimuli that low-level

mechanisms are provably unable to deal with. Classic examples o f this include second-order

stimuli, which can demonstrate the existence o f texture-based mechanisms for edge detection, and

the association field stimuli of Field, Hayes and Hess (1993), which have been used to

demonstrate "contour integration" in human vision. Such findings can be useful for machine

vision because the existence of such mechanisms in human vision suggests that vision problems

may necessitate them. However, this approach has its own shortcomings, the main one being that

there is no way o f knowing the extent to which such mechanisms are active outside the lab in the

presence of real world stimuli. There are many people who contend, for instance, that second-

order mechanisms for motion are largely epiphenomenal in normal vision.

In this thesis we take an alternative approach: psychophysics with real images as stimuli. We

contend that psychophysics with real images can complement machine vision and traditional

human psychophysics by showing what human observers can and can’t do given some fraction of

the information available in real images. Simple manipulations (filtering, masking etc.) can be

used to restrict the information available to an observer. An observer’s competence at a task is an

existence proof that it can be done given the restrictions, whereas failure suggests that it cannot.

W e first present a series of experiments within this paradigm which provide insights and

constraints on junctions and their detection. W e then attempt to leam junction detectors given the

(8)

Chapter 2: Psychophysics

The goal o f our psychophysics experiments was to examine the extent to which occlusion in real

images produces simple and local cues like junctions that might be easily detected by early visual

mechanisms. Junctions are purported to be locally discernible image features, and we wanted to

put this notion to the test. The main idea is straightforward: if junctions are locally defined, then

observers should be able to reliably detect them given only a small image region centered on the

junction.

To test this idea, we first needed a large set o f junction candidates to use as stimuli. Junctions

are, by hypothesis, generated by significant events in the world such as the occlusion of one edge

by another, so this is where we looked for them. W e restricted our study to junctions generated at

points of occlusion, because this is by far the most common junction-generating situation

occurring in real scenes; such points were plentiful enough to study in some detail. It was also

fairly easy to describe to observers how they should label a point o f occlusion. W e thus began by

having naive observers identify points of occlusion in natural images.

2.1 Stimulus Acquisition

Observers were asked to label the points in an image where "one edge occludes another", a task

that tended to be intuitive and natural for naive observers to do. Their instructions were as

follows:

"In each image you should label the "points of occlusion". A point of occlusion is defined as a

point in an image where one edge is occluded by (i.e. goes behind) another. The occluding edge

will be the border of a surface or object, but the occluded edge (i.e. the one that goes behind) can

be any sort o f edge (e.g. a surface marking such as paint, the borders o f shadows, etc., in addition

(9)

thin as to make it impossible to distinctly label the two points of occlusion occurring on each side

of the line, subjects were instructed to take the center of the line to be the point of occlusion.

Such locations were obtained for 20 real images taken from the online database of Van Hateren

(Van Hateren and van der Schaaf, 1998) consisting of images of various locations in the

Netherlands taken with a digital camera. Images were logarithmically transformed and scaled for

display purposes. We selected images which appeared to have lots of occlusion points. Some

contained buildings and other man-made objects, whereas others consisted mainly of trees and

plants. Most images were labeled by two observers, and labels tended to be consistent across

observers. We then sifted through the sets of locations to eliminate the occasional erroneous label

as well as to ensure that the locations were centered as well as was humanly possible. In the end

we accumulated approximately 1200 occlusion points. An example image with labels is shown in

Figure 2.1.

A''

-mm

(10)

These labeled points were all locations where occlusion could be seen upon inspection of the

entire image at once. To test for the presence of local signatures of occlusion, such as T-

junctions, we asked observers to detect occlusion points given only a small image region

surrounding them.

2.2 Experiment 1: Aperture Size

Observers were presented with a succession of image patches of varying size. For each patch,

they were asked to guess whether it was centered on an occlusion point. Some of the patches

were centered on the previously labeled occlusion point locations, some were selected randomly

from the same images, some were synthetic edges, and some were synthetic T- and L-junctions.

If points of occlusion are signaled by a local image signature such as a T-junction, subjects

should perform well on an occlusion discrimination task even for small apertures. Furthermore,

the synthetic stimuli can help us characterize what subjects are basing their judgment on. If

subjects are using T-junctions to detect point of occlusion, then they should also classify the

synthetic T-junctions, but not the L-junctions, as occlusion points.

Figure 2.2. Example experimental display. Patch is 97 pixels wide.

Methods

(11)

hairs could be moved in front of the image when localization was in doubt, as shown in Figure

2.3.

Figure 2.3. Example experimental display with crosshairs toggled in front.

Subjects were given the following instructions:

"The purpose of this experiment is to leam how the human visual system detects occlusion

relationships in the world. We are particularly interested in whether the locations where one

object occludes another can be detected using very local information in an image. In the first

stage of this study, several observers labeled "occlusion points" in a series of images of the world.

In the present experiment, you will be presented with a series of image patches of varying size,

taken from the images that were labeled in the first stage of the study. Some of these patches will

be centered on the previously labeled occlusion points, while some will be centered on randomly

selected locations. Of the latter, some will contain just a single edge, some will be regions of

texture, some will be blank, and some will contain occlusion points that are not centered on the

patch. You will be asked to guess whether a given patch is centered on an occlusion point. In

some cases this will be obvious, and others less so."

The randomly selected patches were drawn randomly from the same images from which the

occlusion points were obtained, subject to the constraint that the standard deviation of the

intensities of the central 13 by 13 pixels exceeded 10 (intensities ranged from 0-255). This was to

reduce the number of "uninteresting" patches (e.g. those that were essentially uniform) that could

be trivially distinguished from occlusion points. The synthetic edges and junctions were

generated in Matlab. The different sectors o f the junctions and edges were uniform in luminance

(12)

the 0-255 scale, to ensure that the junctions and edges were well above threshold. The stimuli

were generated at large sizes and blurred and subsampled to reduce aliasing. A small amount of

noise was then added (normally distributed with a standard deviation o f 5). Because they were so

easy to see, the synthetic stimuli were only displayed at the three smallest aperture sizes. The

proportions of stimuli in the experiments were as follows: 40% o f the patches were centered on

real point of occlusion, 45% were randomly selected patches o f real images, and there were 5%

each of synthetic edges, L-junctions, and T-junctions.

Patches were displayed at one o f 5 sizes (13, 2 5,49, 97, or 201 pixels in diameter) on a uniform

background set to slightly above or below the mean luminance o f the patch. This was achieved

by multiplying each image region with a circular mask. At the borders the mask faded gradually

to zero over a 5 pixel radius increment. (Note that as a result, the effective size of each patch was

roughly 4 pixels greater than the listed size.) Stimuli were generated and displayed in Matlab on

a PC running Linux. Patches were displayed at full resolution on a Yama Vision M aster Pro 450

m onitor in ambient illumination. Subjects were instructed to sit at a comfortable viewing

distance. This was usually a bit less than arm ’s-length, at which the smallest patch size subtended

roughly 0.25 degrees visual angle. On each trial the display was present until subjects entered

their responses; viewing time was unlimited.

Subjects entered two responses on each trial. They were first asked to make their best guess as to

whether the current patch was centered on an occlusion point or not, by pressing one of two

buttons. They were then asked to make a confidence rating, pressing one of three buttons to

indicate whether they were essentially guessing, somewhat sure, or quite sure that their previous

judgm ent was correct. Subjects were instructed to envision a continuum o f confidence and to

assign thirds of that continuum to each button. Each response was confirmed with a second key

press before which subjects had the option o f changing the response. This was intended to

minimize mistaken responses, and seemed to be effective: subjects were debriefed following each

experiment and consistently claimed to have made only a handful of such mistaken responses.

(13)

classifications ju st on the basis of an edge running through the center of the patch. No feedback

was given on trials with synthetic junctions, since these were intended as a barometer of a

subject’s strategy. Subjects were told that the feedback signal was probabilistic (i.e. would not

occur for every error), to be consistent with the absence o f feedback on these trials. Subjects

were told the following about the feedback:

"To help you perform the task as well as you can, on some trials you will be given feedback. The

computer will sometimes make a buzzing sound when you incorrectly label the current patch.

You should try to take this feedback into account as you decide how to respond. There may be

particular patterns that occur at points of occlusion that you can use to detect them."

Subjects began their participation in the experiment by completing roughly an hour’s worth of

practice trials. Once fully comfortable with the judgm ents and response requirements, each

subject completed approximately 700 experimental trials. The subjects who participated in the

experiments were different from the subjects who labeled the images, and had never seen the

images from which the patches were extracted. W e collected data from 5 subjects.

Results

The data from all 5 subjects were qualitatively similar, and were combined for analysis purposes.

Figure 2.4a plots classification errors for the real image patches as a function of the aperture size.

At the smallest aperture size (13 pixels in diameter), subjects misclassify 27 percent of the image

patches (chance performance is 50 percent). Errors drop gradually as the aperture size is

increased, but do not reach a minimum until an aperture size of about 100 pixels. Figure 2.4b

plots histograms o f subjects’occlusion judgm ents for the real image patches, divided into patches

that actually were points o f occlusion and those which were randomly selected. Most of the

effect of aperture size is on misses: points o f occlusion that are not classified as such. At the

smallest aperture size, just over half of the patches that are in fact centered on points of occlusion

are correctly classified; this proportion increases to over 90 percent at the largest aperture size.

Note that all of these image patches were clearly points of occlusion when the entire image was

viewed at once; this was the criterion by w hich they were labeled. The fact that subjects

misclassify a small proportion of the occlusion points at the largest aperture size tested implies

that a degree of ambiguity exists even for this large size (otherwise subject would have been close

(14)

a

40

6 20

Figure 2.4. Results of Experiment 1.

Occlusion Classification of Real Image Patches

T otal Errors Results for 5 subjects

1 I r

I

12.1

13 25 49 97

<—Patch Diameter in Pixels—>

11

201

b

100

I

o 50 H

I

13 25

Ju d gem en ts fo r Points o f O cclu sion

J

1

49

J

1

97 J1 201

Ju d g em en ts fo r R an d om Points 100 0 Q.

1

O H o

I

97

13 25 49 201

■ Judged as Ctelusion Points □ NOT Judged as Occlusion Poirts

(15)

occlusion are randomly selected. Occasionally, locations are randomly selected that happen to lie

on points of occlusion. W ith purely random selection this would be a very improbable event, but

as described above, the patches were also constrained to have high variance in their pixel

intensities, which predisposed the selection process towards edges and other regions of high

contrast. Moreover, subjects were given intentionally lax guidelines with regard to the centering

of the occlusion points: they were instructed to classify something as being centered on an

occlusion point if there appeared to be such a point within the cross hair boundaries, a 10 by 10

pixel region. This made it easier still for randomly selected patches to legitimately contain

occlusion points. Thus many o f the "random" points that were classified as occlusion points were

classified correctly, and are not true false positive responses. The only way to get around this

would be to painstakingly weed out these points by hand, which we did not do. W e are thus

clearly overestimating false positive rates; this is a weakness of the method.

Our qualitative results are quite clear, however: even when forced to guess, human observers are

unable to identify many points of occlusion just on the basis of local information. This suggests

that in many cases there isn’t much o f a local cue such as a T-junction with which to detect the

occlusion. Conversely, though, in many cases observers can reliably identify occlusion points

with purely local information, providing an existence proof for local cues of some sort. It would

be of interest to characterize the nature o f such cues. However, our occlusion discrimination task

could conceivably be performed with any number of different strategies, and our results with real

image patches provide few constraints on the strategies our subjects could be adopting. Synthetic

(16)

Figure 2.5. Results of Experiment 1 for synthetic stimuli.

§

100 <u CE: o H

I

50

0 Occlusion Judgements for Synthetic Junctions

T -Junctions

_CL

13 25 49

L -Junctions

100

50

0

■ Judged as Ct elusion Points □ NOT Judged as Occlusion Poirts

Figure 2.5 shows occlusion judgments for the synthetic junctions inserted as "catch trials" in the

experiments. In contrast to the real image patches, the classification of the synthetic junctions

depends very little on aperture size. Virtually all of the T-junctions are classified as occlusion

points whereas most of the L-junctions are not. Note that this does not show that observers’

judgments on the real points of occlusion are mediated by junctions; they could very well be

monitoring other things. It is, however, consistent with the notion that junctions are involved.

Moreover, it suggests that if junctions were present at most points of occlusion in real images,

observers would make use of them within our paradigm to identify the occlusion points as such.

Thus the fact that they fail to pick out many of the occlusion points suggests that junctions are not

visible there.

Figure 2.6 plots histograms of our observers’ confidence ratings in each experimental condition.

For display purposes, the ratings were combined with the occlusion judgments, such that the six

(17)

aperture size, the majority of judgments are very confident and correct. There is a similar but

weaker trend evident in the responses for the randomly selected patches. In contrast, the vast

majority of the confidence ratings for the synthetic junctions are high, for both small and large

apertures. Evidently the occlusion evidence in real images is far more ambiguous.

Confidence Ratings for Occlusion Classification

a

_{J u d g em en ts fo r Points o f O cclusion} Results for 5 subjects

Ju d g em en ts fo r R an dom Points

a 40

49 97

Highly Confident

Guessing

Highly Confident

NOT J Ldged as J udged as Occlusion Points Occlusion Points

(18)

2.3 Illustrative Examples

A few examples of our stimuli help illustrate the range of situations encountered in real images.

Figure 2.7 shows a number of real points of occlusion that are easily identifiable as such viewed

through apertures 13 pixels in diameter. All of these points generate local image structures that

we would be comfortable terming T-junctions.

Figure 2.7. Example stimuli viewed through apertures 13 pixels in diameter.

Even when only the local information visible through the apertures is

available, all of these locations are good candidates for occlusion points,

presumably because they generate T-junctions. Image is blown up for

reproduction purposes.

Figure 2.8 shows another group of occlusion points, again viewed through apertures 13 pixels in

diameter. In contrast to the previous group, none of these image patches are obviously the

product of an occlusion point, and accordingly none were correctly classified by our observers.

None of the occlusion points in this group generate prototypical T-junctions, for reasons that vary

from case to case. In some instances the occluded contour is very low contrast and hard to

(19)

... ..

ÎJS«’ =■ AïJfeÊrbs

QBUWB#

m m

Figure 2.8. Example stimuli stimuli viewed through apertures 13 pixels in

diameter. Just given the local evidence visible through these small apertures,

these locations are relatively poor candidates for occlusion points. Image is

blown up for reproduction purposes.

Figures 2.9-2.16 further illustrate this, showing some of these and other occlusion points over a

range of aperture sizes (the four smallest sizes used in the experiments). Although not obviously

occlusion points when viewed through the smallest apertures, as the aperture size increases it

becomes easy to tell that the image patches are centered on points of occlusion. In most of these

(20)

Figure 2.9. Example occlusion point viewed through apertures 13, 25, 49, and 97 pixels in

diameter. Location 14.

Figure 2.10. Location 114.

;

(21)

1

The fact that a large number of pixels (e.g. 50 by 50) is needed to recognize many points of

occlusion might indicate that a rather complicated mechanism is necessary for their detection

(more complicated than mechanisms that could detect synthetic junctions, which can evidently be

recognized with far less information). Before considering this possibility, however, we will first

(22)

2.4 Experiment 2: Controls W ith Zoomed Image Patches

One relatively uninteresting explanation of the basic result of Experiment 1 might involve lateral

masking between the aperture border and aspects of the image patch. For small apertures, the

borders are close to the center of the patch, and might thus interfere to a greater extent on

subjects’ ability to detect the image structure relevant to their occlusion judgments. It seemed

unlikely that masking could be so strong as to generate the trend present in our results, but we

nonetheless sought to rule it out. W e thus repeated Experiment 1 with the image patches "blown

up"; each image was upsampled and blurred to twice the original size.

Methods

Image patches were blown up by a factor of two using the upBlur function in Eero Simoncelli’s

M atlab Pyramid Toolbox. The patch sizes were 14, 26, 50, 98, and 194 pixels in diameter, each

generated from a patch of the original image half that size. Thus the smallest patch size was

generated from a patch 7 pixels in diameter, a size not included in the original experiment. The

other four patch sizes were generated from the four smallest sizes o f the original experiment.

Results

As shown in Figure 2.17, the basic result of Experiment 1 was replicated with the zoomed

patches; discrimination is poor for small apertures and improves markedly as the aperture size is

increased and a greater number o f pixels become visible. M oreover, the results were

quantitatively indistinguishable from those of Experiment 1 even though the displays were twice

as large and the aperture borders thus further from the relevant image structures. We wanted to

check if error rates were similar to those from the original experiment for patches generated from

the same number of pixels but displayed over a larger region, and the 95% confidence intervals,

plotted as error bars, confum that this is the case. There were no significant differences between

(23)

Figure 2.17. Error rates for different aperture sizes for zoomed stimuli. Error bars

denote 95% confidence intervals. Compare to Figure 2.4a.

Occlusion Classification of Zoomed Real Image Patches

Total Errors Results for 4 subjects

5^ 40

5 20

25 49

We ran a second experiment on two subjects, in which the small patches were blown up more

than the large ones, such that the display size was the same for all the apertures. Again,

discrimination was much worse for the displays generated from a small number of pixels, even

though the patch borders were now at the same eccentricity in all conditions (data not shown). It

thus appears that interference by the aperture borders is unlikely to account for our results.

We also replicated the results of Experiment 1 with images reduced (blurred and subsampled) by

a factor of two. We took the labels of the original images and weeded out points that were no

longer visible in the reduced images (those that were formed by very thin contours in the original

images, for instance). We ran two subjects in the aperture size experiment with this new set of

locations from the reduced images, and found similar proportions of detectable occlusion points

for each patch size. This controls for the possibility that the CCD used to take the pictures might

have corrupted the high frequencies through aliasing. It also suggests that the image property we

are measuring in our experiments is statistically scale invariant, i.e. that the frequencies of

detectable occlusions at different aperture sizes are independent of the scale at which the image is

sampled. This is unsurprising given that the statistics of many image properties are scale

invariant in natural images (e.g. Ruderman and Bialek, 1994), but it confirms that pixels are the

right unit with which to measure the visual information used in our task (as opposed to, say,

(24)

In sum, it seems unlikely that our effects are due to problems with our stimuli. W hat else might

(25)

Chapter 3: Psychophysics Experiments on Scale

3.1 Experiment 3: Scale

Another obvious explanation of the results of Experiment 1 involves the spatial scale of any

putative signature of occlusion. Intensity changes in images due to object boundaries and other

structures in the world occur over both large and small scales (Koenderink, 1984), and one would

expect junctions to exist at different scales as well. The detection of junctions at large scales

would require analysis of a large image region even though they might require no more

parameters for their description than do fme-scale, locally defined junctions. As illustrated in

Figure 3.1, a large-scale junction lacking crisp edges is not clearly visible when viewed through a

small aperture - a large area is needed to detect the intensity gradients defining the junction.

Note, though, that this example junction is a simple structure and could presumably be recognized

with a fairly small number of image samples, they just need to be distributed over a large region

of space. Similar examples can be found in real images; Figure 3.2 shows one, and some of the

examples shown earlier also appear to fall in this category. Such image regions appear to contain

relatively simple image features even though they require a large area in order to be visible. The

dependence of occlusion detection on aperture size might thus be the result of simple image

structures distributed across a variety of scales, and could still be consistent with the notion of

junctions as simple and locally defined cues. It must be emphasized that this question, of whether

the existence of junctions at multiple scales can account for our results, is independent of our

observation that the dependence of occlusion detection on aperture size is scale invariant. It is

hard to see how the latter sort of scale invariance could fail to hold, since that would mean that

our results would be dependent on the distance from the scenes being imaged to the camera. The

present question, in contrast, is very much open and needs to be tested.

Figure 3.1. A blurred synthetic junction viewed through apertures 13, 25, 49, and 97 pixels in

(26)

% ** AS r«^rc*^3iv A i

Figure 3.2. An example real occlusion point which appears to have simple junction structure at a

large scale due to blurred edges. Location 760.

How can we test the extent to which large-scale junctions could account for our results?

Intuitively, we want to isolate the large-scale structure of image regions whose occlusion points

are not visible through small apertures, and see if the occlusion points remain detectable when

only the large-scale structure could be underlying performance. A standard technique for

isolating structure at large scales is to blur and subsample an image (e.g. Burt and Adelson,

1983). Blurring eliminates the fine scale structure, and subsampling converts the remaining low

frequencies into high frequencies defined over a small number of pixels. The point of this is

twofold. First, simply blurring the image patches would be confounded by the fact that the

resulting patches look blurry, which could conceivably affect subjects'judgments. Subsampling

the blurred patches makes them look as sharp as the originals. Second, the subsampling step

takes the large-scale structure that was formerly distributed over many pixels and compactly

represents it over a small set of pixels. If large-scale junctions are mediating occlusion

discrimination for the occlusion points that are visible only through large apertures, these points

should be rendered visible through small apertures once the large-scale structure is isolated via

blurring and subsampling. Figure 3.3 shows the effect of blurring and subsampling the large-

scale synthetic junction of Figure 3.1. Each row was generated by viewing a blurred and

subsampled version of the previous row’s image through apertures of different sizes. The

junction is not visible in the original image when viewed through the small aperture, but once

reduced it becomes crisp and easy to see. We thus have a test of the extent to which simple

(27)

Figure 3.3. Reduced versions of a blurred synthetic junction viewed through apertures of

different sizes. The image for each row is a reduced version of the previous rows image.

Note that the junction is visible in the reduced images viewed through the small

apertures, because the large scale structure defining the junction has been isolated.

Methods

Each image location was shown in 6 conditions: through (1) small, (2) medium, and (3) large

apertures at full resolution, (4) blurred, subsampled and viewed through a small aperture (i.e. the

medium apertures reduced once) and (5) blurred, subsampled and viewed through a medium

aperture (i.e. the large apertures reduced once), and (6) blurred and subsampled twice viewed

through small apertures (i.e. the large apertures reduced twice, down to the size of the small

apertures). The extent to which junctions at large scales might be mediating occlusion

discrimination can be measured by examining whether occlusion points that are visible through a

large aperture but not through a small one become visible through a small aperture once they have

been reduced.

Unlike the previous experiments, in which each image location was included in just a single

(28)

of the experiment’s design. This led to a concern that a subject’s judgm ent for a particular

location in one condition might influence their judgment for that location in another condition

seen later in the experiment. For an occlusion point clearly visible when viewed through a large

aperture, for instance, it seemed conceivable that when presented with the same image masked by

a smaller aperture subjects might realize that they had seen it before, and base their judgment on

having already seen it as an occlusion point regardless o f the local evidence for occlusion. To

minimize the chances of that sort of thing occurring, we randomly left-right reversed and rotated

the image patches 90 degrees clockwise or counter clockwise in the different conditions. This

made it less obvious that image patches were being shown multiple times. Subjects were not

informed that this was the case, and none ever reported seeing the same thing more than once; we

suspect the repeats had little to no effect on the results.

Subjects’ instructions were the same as in the previous two experiments. All subjects were naive

as to the specific purpose of this experiment. As in the previous experiments, feedback was given

to enable subjects to perform as well as possible.

W e ran two versions of the experiment, one in which the aperture sizes were 7, 13, and 25 pixels

in diameter and the images were expanded by a factor o f 2 as in Experiment 2, and one in which

the aperture sizes were 13, 25, and 49 pixels in diameter and the images were shown at full

resolution. Images were reduced by a factor of 2 or 4 in each dimension, and the aperture sizes

were chosen so that image patches visible through the larger apertures would reduce down to the

size o f the smaller apertures. Images were reduced (blurred and subsampled) using the blurDn

function in Eero Simoncelli’s Matlab Pyramid Toolbox. A 5-tap binomial filter was used for

blurring.

Results

To test for the presence of large-scale junctions, we examined all the points of occlusion which

were visible through large apertures but not through small apertures at normal resolution. Each of

(29)

Scale Analysis of Points of Occlusion

1-, (U H

c : 3 < C 100

(U

_K

p _<

80

> tSJ) c ■£ e 00 o 60

1

₄₀ Q C c/5 o

oZ

c o 20 "u '•GCJ

u 3 0

o

13 D

O C

Experiment A

Results for 4 subjects

visible at 13 but not at 7

pixels

< t:

Experiment B

pixels

Figure 3.4. Results of Experiment 3. Error bars are 95% confidence intervals.

Figure 3.4 plots the proportion of such occlusion points that remained visible after blurring and

subsampling, for each of the three analysis conditions in each of the two experiments. The results

of the two experiments are fairly similar: between 60 and 80 percent of the occlusion points that

(30)

subsampling, even though they are represented with far fewer pixels (1/4 or 1/16 as many,

depending on the analysis condition). This is good news for junction lovers, since it suggests that

in many cases in which occlusion points require a large region of analysis to be detected, there is

nonetheless a fairly simple signature of their presence.

3.2 Scale Exam ples

Figures 3.5-3.10 show some examples for which this is clearly the case. As before, each row

consists of the same image viewed through apertures of different sizes. The image for each row

is a blurred and subsampled version of the image from the previous row. The aperture sizes are

arranged such that the image patches in each column cover the same region of the original image

at different scales. Occlusion is hard to see through the small apertures at high resolution, but is

much clearer after blurring and subsampling when the large-scale structure is isolated, mainly

because T-junctions become evident.

Figure 3.5. A real point of occlusion displayed at successively coarser scales, viewed

(31)

n - -s a r a '

i # sT#!

k; \ ;'.

m ^ m r

f

# 0

9%»^ ‘_* f A

Figure 3.6. Location 761.

:ïAë7.!:.j

m

:

(32)

Figure 3.8. Location 516.

«%

(33)

_____

N

■

lE .jgl' ,$%

,3 2r'S

I #'-T^:;: r ' W--

■'■ f i

:’

y * ^ ; y N *. I* V

0 9

9 0 JWKh*.*.

(34)

(35)

However, upon inspection of other examples, it becomes clear that in some cases what appears to

be evidence for occlusion after blurring and subsampling is in some sense artifactual.

Figure 3.13. Example of an occlusion point that is artifactually visible given local

information at a large scale. See text for details. Location 702.

Figure 3.13 shows one such example. The contour being occluded at the center of the patch is the

edge of a tree trunk which happens to be fairly low contrast and in part visible due to a texture

boundary. It is not visible until a large region is available for analysis. When the image is

reduced, the vertical edge of the labeled occlusion point becomes invisible, but the shading on the

tree trunk produces a high contrast edge that is shifted towards the center of the patch by the

reduction process. This image was classified as an occlusion point, but the T-junction responsible

for the judgment is not the product of the occlusion point under consideration, and is not even

produced by something that was labeled as an occlusion point in the original image. The fact that

subjects make the correct response for a stimulus such as this is thus an artifact of the reduction

process. The response is artifactual in the sense that the apparent large-scale signature of

(36)

in the world that was originally labeled, either being due to another nearby occlusion point

brought closer to the center of the image by the reduction process, or an accidental large-scale

image structure, and 2) presumably not responsible for the (correct) percept of occlusion in the

full-scale patches.

3.3 Scale Artifacts

W e are thus probably overestimating the extent to which occlusion points are often indicated by

large-scale occlusion cues; some of what appear to be occlusion cues are artifacts. Fortunately,

we can estimate the frequency with which these artifactual occlusion signatures are produced by

examining our subjects’ occlusion judgments on the randomly selected patches. W e looked at

those randomly selected patches which were correctly labeled as not being occlusion points at

two aperture sizes, and computed the proportion which were then labeled (incorrectly) as

occlusion points when they were reduced from the size of the large aperture to that of the small

one. Since these points were originally selected at random and did not appear to be occlusion

points at either aperture size at full resolution, we can be reasonably sure that the appearance of

occlusion in the reduced image is an artifact of the reduction. M oreover, the appearance of

occlusion in the reduced image clearly is not the determinant o f occlusion judgments for the full

resolution images, because we have selected patches that at full resolution are judged to not be

occlusion points. One might thus say that the large-scale structure at such points is both

physically artifactual (in that they are not really due to an occlusion point at the center of the

patch) and psychologically artifactual (in that they are not responsible for occlusion percepts at

full resolution). Figures 3.14-3.17 show several examples of image patches from our randomly

(37)

I

Figure 3.14. Example of a randomly selected patch which at large scales

(38)

^ â

(39)

m m

9

=

0#0B

M S P

m m »

##

#m

dmwMN

T.

#P

Figure 3.16. Additional example of an artifactual large-scale occlusion signature.

(40)

1#

Figure 3.17. Additional example of an artifactual large-scale occlusion

signature.

Figure 3.18 plots the proportion of these artifactual occlusion judgments for each of the three

analysis conditions in each experiment. On average, blurring and subsampling introduces

artifactual occlusion signatures in around 20 percent of the randomly selected patches. Assuming

that the frequency of these artifacts is the same for image regions centered on actual points of

occlusion and those randomly selected, we can estimate that on the order of 20 percent of the

occlusion points visible after blurring and subsampling are artifactual: the large-scale structure,

although apparently consistent with occlusion, is neither physically due to occlusion nor

psychologically responsible for occlusion judgments at full resolution. The proportion of

(41)

Artifactual Large-scale Occlusion Signatures

Experiment A

§ C o 100 u o

o

'- B c o U 80 60 > '.n CL o 40

Oh c_(U

M £ c3 20 ~o 3 •—> 0 correctly rejected at both 13 and 7

pixels

correctly rejected at both 25 and 7

pixels

correctly rejected at both

25 and 13 pixels

Experiment B

1 1

II

P

il

tin -3^

100 80 60 40 20

_ É _

correctly correctly correctly rejected at both rejected at both rejected at both

25 and 13 49 and 13 49 and 25

pixels pixels pixels

(42)

3.4 Nonlocally Defined Occlusion Points

Figures 3.19-3.24 show several examples which fall in this category; these occlusion points do

not become visible until viewed through large apertures, and blurring and subsampling them does

not preserve this visibility. This is for a variety of reasons. In some cases one or more of the

contours is not visible at the larger scales (because lines, unlike step edges, are not scale

invariant); in other cases spurious large-scale structures are introduced which seem inconsistent

with occlusion. Perhaps a quarter of our occlusion points fall in this category; their detection

evidently requires a mechanism that can make use of all of the pixels in a large region.

F C" , ■ .

Figure 3.19. Example of an occlusion point whose visibility requires a large number of pixels

regardless of the scale of analysis. Here the stem of the junction is a thin contour that is

(43)

Figure 3.20. Another example of an occlusion point whose

visibility requires a large number of pixels regardless of the

(44)

Figure 3.21. Another such example, with four aperture sizes instead of three. Here the branches

are difficult to individuate for all but the largest apertures.

A

, ^ S'

Figure 3.22. Another example. The structure of the woman’s shirt sleeve near the occlusion

(45)

Figure 3.23. Here the thin stem contour is again lost at large scales and hardly present locally.

(46)

3.5 Scale Selection

But although many occlusion points seem to necessitate analysis of fine scale structure over a

large region, a substantial portion produce fairly simple signatures that are visible at fine scale,

and the remainder seem to produce fairly simple signatures only at a large-scale. If there was a

simple way to select the scale of analysis, a large proportion of occlusion points could thus be

detected using fairly simple mechanisms. The selection of the correct scale is critical, though.

The existence of artifactual junctions at large scales discussed earlier implies that one cannot

simply take a junction at a large-scale as evidence for occlusion; large-scale structures are

frequently spurious. Moreover, spurious large-scale structure generally does not "fool" the

human visual system, implying that scale selection mechanisms are at work in human vision.

W hat might they involve?

M any theories of scale selection rely on the magnitude of image gradients at different scales to

choose the scale of analysis. Elder and Zucker (1998), for instance, have proposed an elegant

framework that chooses the minimum reliable scale for edge detection by comparing gradient

magnitudes at different scales with the values that would be produced by sensor noise alone (see

also Morrone, Navangione, and Burr, 1995; Lindeberg, 1996). Edge detection at a particular

location is conducted at the minimum scale where the gradient exceeds these noise levels. This

prevents spurious responses to noise (to which fine scale operators are more vulnerable) while

minimizing the interference of neighboring image structure (which is a problem at large scales, as

we have seen). One could envision a similar scheme for junction detection at different scales;

detection would occur at the minimum scale at which the gradient magnitudes exceeded the levels

expected just from noise.

For some junctions, such a strategy seems reasonable. Consider, for instance, the blurry synthetic

junction of Figure 3.1 that was used as a motivating example: there is very little high frequency

structure. W hen viewed through a small aperture at high resolution, little is visible other than a

few blurry image gradients, if that. Derivative magnitudes computed for this junction will be

(47)

the pole that has an unusual shape due to the pole’s curvature. When viewed through a small

aperture, the shadow is hard to interpret as such and prevents most observers from correctly

categorizing the image as an occlusion point. When observers have access to a larger region of

the image, the shape of the pole becomes evident and the shadow starts to look like a shadow, and

it becomes obvious that there is a point of occlusion at the center of the image. Notably, most

observers continue to classify this image as an occlusion point after it is reduced. This is because

the reduction erases the small shadow, leaving what appears to be a fairly crisp T-junction.

However, the high frequency structure in the image is salient and inconsistent with the large-scale

T-junction. The scale selection methods outlined above would clearly fail to isolate the large-

scale structure in images such as this - there is abundant high contrast, high frequency structure,

and no obvious bottom-up method with which to select the large-scale for junction analysis. We

cannot directly estimate the frequency of this situation from our data, but inspection of our stimuli

suggests that it is quite common.

Figure 3.25. Example of an occlusion point where a cast shadow drastically alters the

local intensity signature. Although there is a junction at coarse scales, how should one

choose to ignore the fine scale shadow unless it has already been recognized as such?

Thus even though points of occlusion in many cases generate large-scale junctions, the utility of

(48)

selecting the proper scale of analysis. If scale is not properly selected, false positive responses to

spurious large-scale structure will result. In many cases, it is not obvious that there is a bottom-

up method for scale selection (e.g. in our shadow example), in which case the presence of large-

scale junctions does not simplify the analysis of the large pixel arrays needed to detect many of

our points of occlusion. In our shadow example, the shadow may need to be recognized as a

shadow before occlusion can be identified.

The shadow example also highlights a shortcoming of our paradigm, namely that it is difficult to

say anything specific about why so many pixels are needed to detect certain points of occlusion.

The fact that observers require a large number of pixels is consistent with both bottom-up (low

contrast contours must be integrated across large numbers o f pixels to be detectable, textures

require large numbers o f pixels to be segregated, etc.) and top-down (shadows must be correctly

interpreted) accounts of how occlusion points are being detected. For individual examples, we

can speculate what is going on, but it is not obvious how to assess this issue experimentally

across an ensemble o f occlusion points. For shadows, presenting the images in negative might be

an effective way of testing how often shadow interpretation is critical to detecting occlusion

points, since shadows are generally interpreted as such only when they are dark (Cavanagh and

Leclerc, 1989; Anstis, 1992), but in general we lack a method to assess the importance of high

level interpretation.

3.6 General Discussion of Psychophysics

We have introduced a method for studying local occlusion cues in real images. Our results offer

some new insights into junctions and their detection that would be difficult to obtain with

traditional psychophysics experiments using synthetic stimuli, because synthetic stimuli do not

reflect the demands of the real world. Our results suggest that although junctions as traditionally

conceived do indeed occur at a significant number of occlusion points in real images, many other

occlusion points lack a simple, local signature. The recovery o f occlusion geometry may thus

(49)

stimuli are poorly controlled and the judgments ill-defined in comparison to traditional

psychophysics with synthetic stimuli. W e think that in many instances these difficulties are not

insuperable, in which case our approach may be a useful addition to the vision scientist’s toolbox.

Nonetheless, there are a number of potential concerns with our paradigm that we wish to address.

First, the success of our method depends critically on the ability o f observers to perform the task

effectively. One might argue that people simply might not know what characterizes occlusion

points in real images, since they are not used to seeing them in isolation. Perhaps as a result of

this they might be impaired at our task irrespective o f the available local information. This is a

possibility we can never fully rule out, but we did take a num ber of steps to minimize the chances

of this. Stimulus duration was unlimited, and subjects were encouraged to take as long as they

needed on each trial. The responses to our synthetic junctions inserted as catch trials helped to

insure that subjects were doing something reasonable. Inspection of their responses to various

stimuli further confirm this. Furthermore, our subjects received feedback on their responses and

were instructed to use the feedback to maximize their performance. Subjects were initially given

extensive practice prior to taking part in the actual experiments, and most o f our subjects came

back for multiple sessions. Subjects did improve somewhat with practice, but even our most

practiced subjects were unable to make fewer than 20 percent errors in classifying real image

patches through the smallest apertures we used. In addition, the task we required subjects to

perform was both natural to our subjects and of obvious ecological relevance. Given all this, it

seems likely that our subjects were fairly effective at picking out local occlusion cues when they

were present.

A second area of concern is the choice of randomly selected image patches as negative examples

o f points of occlusion. The negative examples are critical to investigating occlusion detection,

since otherwise subjects could simply label everything a point of occlusion without even looking

at the stimuli. Given that we are interested in how occlusion points might be detected in real

images, patches of real images not centered on occlusion points seemed the logical thing to use as

negative examples. As we have discussed, many randomly selected patches of real images are

fairly uniform at the center, and thus trivially easy to distinguish from a likely point of occlusion.

To force subjects to make a more specific discrimination, we screened out those randomly

selected patches whose intensity variance was below a threshold and added synthetic edges as

additional negative examples, but these were essentially arbitrary steps. The negative examples

(50)

make an even more specific judgment, and only label points of occlusion as such if they are

signified by very well-defined junctions.

A third concern, also alluded to earlier, is also related to the negative examples. A consequence

of using randomly selected image patches as the negative examples is that some fraction of them

were actually points o f occlusion. The presence of these impostor stimuli results in inflated false

positive levels; many o f the false positive errors subjects make are not really errors. Again, a

more carefully chosen set of negative examples could avoid this problem.

Addressing each of these issues differently could certainly alter our quantitative results to some

degree, but it seems unlikely that our qualitative conclusions would be substantially affected. We

are hopeful that such concerns could be dealt with in other contexts as well.

Although our methodology could conceivably be applied to a variety o f topics in early vision, it

could also be used to further investigate occlusion detection in human vision. Our results with

synthetic junctions suggest that our paradigm can be used to measure the extent to which any

arbitrary synthetic stimulus is interpreted as being due to occlusion. W e found that when

intermixed with real image patches, our synthetic T-junctions were nearly always judged to be

points of occlusion whereas our synthetic L-j unctions were not. This is consistent with other

psychophysical results on the interpretation of isolated junctions (McDermott and Adelson,

unpublished manuscript). This suggests an attractive method for investigating other stimulus

properties potentially related to occlusion. Synthetic stimuli could be generated that vary along

dimensions of interest, and inserted alongside the real image patches in an experiment similar to

those we have described. The extent to which they are judged to be occlusion points can be taken

as a measure of the strength of the occlusion cue being manipulated. For instance, Stevens and

Brookes (1988) have proposed that concave cusps offer a strong cue to occlusion, and constructed

a number of complicated displays as evidence for this. One could perhaps better isolate the effect

by incorporating concave cusps into an occlusion detection experiment with real images. With

larger displays one could also conceivably study the influence of context on the interpretation of

(51)

enhance even the local sense of depth and three-dimensionality, which in turn can greatly

strengthen the percept o f occlusion. Contour curvature also made a substantial contribution.

M any junctions formed by straight edges with sectors of uniform luminance convey very weak

percepts of occlusion, presumably because they have other plausible explanations. These

observations are informal, but suggest that there may be a rich assortment o f local occlusion cues

that could be studied in the manner we have discussed.

W e can draw a number of conclusions from the experiments described so far. First, we have

addressed the question of whether junctions are in fact locally defined. It appears that although a

fraction of junctions are indeed detectable given purely local information, many others require

larger image regions in order to be seen. Many junctions do not become visible until the image

region containing them is sufficiently large that many other things are visible besides just the

junction: other edges, other junctions, patches of texture etc. Moreover, often this is not simply a

matter of selecting the correct scale of analysis, a large num ber o f pixels are needed to detect

many junctions even at the appropriate scale. This suggests that any mechanisms for detecting

junctions are performing a nontrivial analysis of a high dimensional input space. The negative

spin on these results is that junctions may not be all they are cracked up to be - a significant

fraction are not locally detectable, and thus may not be good candidates for foundational elements

o f vision. The flip side to this, though, is that a significant fraction of junctions are locally

detectable. This means that local mechanisms, operating on image regions on the order of 10-15

pixels in diameter, ought to be able to detect this portion of the things that people label as

junctions when viewing the whole image. We therefore attempted to learn to detect junctions

(52)

Chapter 4: Neural Networks for Junction Detection

4.1 General Motivation

First of all, why should one try to build a junction detector? W e had several motivations. First, a

functioning junction detector would be useful for a variety o f image interpretation tasks, and

particularly for object recognition algorithms which depend on the shape o f the bounding contour.

Second, a functioning junction detector which operates on local image regions could further

corroborate our psychophysics results showing that many junctions are locally defined and could

play an early role in human vision. Third, the way that a working junction detector works might

provide clues to how the brain might solve the same problem.

Our main motivation, though, was that a functioning junction detector, regardless of its biological

plausability, would be useful for measuring junction statistics. O f primary interest are the

probabilities with which junctions are actually associated with the scene events like occlusion to

which they are purported to be cues. Figure 4.1 shows an image o f a zebra whose stripes produce

a host of T-junctions. Notably, though, nearly all of them DO NOT indicate occlusion. T-

junctions like these, that are not due to occlusion, are easy to find once you start looking for them.

It remains unclear just how common they are relative to the generic type caused by occlusion.

W ithout a principled method of systematically finding junctions, we lack a good method with

which to measure the frequencies of generic and non-generic T-junctions and thus measure the

extent to which T-junctions are in fact good cues. Assuming one could produce a mechanism

capable of labeling local T-junctions with reasonable accuracy, one could then scan it across a

collection of images and check the degree of correspondence between points of occlusion and T-

junctions. To be able to do this, and hence compute the cue probability, would be of substantial

(53)

^ ___

Figure 4.1. A zebra, which generates a host of T-junctions nearly all of which are

inconsistent with the border ownership

There are thus a number of reasons to try to produce a good junction detector. Unfortunately, this

is easier said than done. Previous attempts to detect junctions have mostly been "top-down":

people have started with some idea of what is appropriate, and then tried to make it work.

Junction detection: Computation and psychophysics