• No results found

Seeing, sensing, and selection: modeling visual perception in complex environments

N/A
N/A
Protected

Academic year: 2019

Share "Seeing, sensing, and selection: modeling visual perception in complex environments"

Copied!
306
0
0

Loading.... (view fulltext now)

Full text

(1)

RIT Scholar Works

Theses

Thesis/Dissertation Collections

9-29-2003

Seeing, sensing, and selection: modeling visual

perception in complex environments

Roxanne Canosa

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Dissertation is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

Modeling Visual Perception in Complex

Environments

Roxanne Canosa

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Center for Imaging Science in the College of Science

Rochester Institute of Technology

September 29,2003

Signature of Author _

Roxanne Canosa

/'(J /

!fJ3

Accepted by -=.-=._ _

(3)

THESIS RELEASE PERMISSION

ROCHESTER INSTITUTE OF TECHNOLOGY

COLLEGE OF SCIENCE

Seeing, Sensing, and Selection:

Modeling Visual Perception in Complex Environments

I, Roxanne Canosa, hereby grant permission to the Wallace Memorial Library of RIT to reproduce my dissertation in whole or part. Any reproduction will not be for commercial use or profit.

Signature of Author _

Roxanne Canosa

(4)

-CENTER FOR IMAGING SCIENCE

COLLEGE OF SCIENCE

ROCHESTER INSTITUTE OF TECHNOLOGY

ROCHESTER, NEW YORK

CERTIFICATE OF APPROVAL

Ph.D. DEGREE THESIS

The Ph.D. degree of Roxanne Canosa

has been examined and approved by the dissertation committee

as satisfactory for the dissertation requirement for the

Doctor of Philosophy degree in Imaging Science

Dr. Jeff B. Pelz, Dissertation Advisor

Dr. Marc Marschark, Committee Chair

Dr. JulieA. Adams

Dr. Dana H. Ballard

Dr. Roger S. Gaborski

(5)

I would like to thank and acknowledge all those who helped to make this dissertation

possible. Thanks to Jeff Pelz for his guidance, support, and encouragement over the

many years of work that were required to bring this research to fruition. None ofthis

would have been possible without his vast knowledge ofthe subject matter, technical

expertise, and cheerful willingness to discuss and explore new ideas. Thanks also to

Jason Babcock for the countless hours spentdeveloping and fine-tuning the eye-tracker,

and for sharing his ideas on experimental design and dataanalysis. This work was built

upon earlier work conducted by Jeff Pelz, Mary Hayhoe, and Dana Ballard at the

University of Rochester, and also Roger Gaborski at the Rochester Institute of

Technolgy. Thanks for their insights and for providing the "shoulders upon which I

stand."

I am also greatly indebted to Julie Adams for her thorough review ofthis work

and her many useful and enlightening comments. The skills and ideas of Marianne

Lipps, Constantin Rothkopf, and Vishal Vaingankarhave provedto be invaluable, and I

am grateful for our many discussions that helped to form and focus this dissertation.

Also, thanks toMarc Marschark for his support. This work was supportedin partbythe

Naval Research Laboratories, the New York State Office of Science, Technology, and

Academic Research, and the Xerox Corporation. Finally, I owe mygreatest appreciation

to John, Elyse, Sandra, and Dennis Canosa, fortheir continued love and supportthrough

(6)

The author wasbom in Rochester, New York andhas spentvirtually herentire life there.

She attended college as a first-generation college student at the State University ofNew

York College at Potsdam, earning aBachelorofArts degree in Art History in 1980. She

then returnedtoRochesterand earned an Associate ofApplied Science degree in Optical

Engineering Technology from Monroe Community College in 1983, and studied

Electrical Engineering part-time at the Rochester Institute ofTechnology. While raising

three children and working at JML Optical Industries, Inc., and Eastman Kodak

Company, shereturned to school to earn yet anotherdegree. In May of 1998 she earned

theBachelor ofScience degree in Computer Science atthe State UniversityofNew York

Collegeat Brockport, andthe followingfall begangraduate studies in ImagingScience at

RIT. In October of2000 she earned the Master of Science degree in Imaging Science,

andbegan doctoral studies underthe directionofProfessor JeffPelz. She is currentlyan

(7)

Abstract

The purpose ofthis thesis is to investigate human visual perception at the level of eye

movements by describing the interaction between vision and action during natural,

everyday tasks in a real-world environment. The results of the investigation provide

motivation for the development of a biologically-based model of selective visual

perception that relies on the relative perceptual conspicuity of certain regions within the

field of view. Several experiments weredesigned and conducted that form the basis for

the model. The experimentsprovide evidence that thevisual system is notpassive, nor is

it general-purpose,but rather it is active and specific, tightlycoupledto the requirements

of planned behavior and action. The implication for an active and task-specific visual

system isthat an explicit representation oftheenvironmentcan beeschewed infavorof a

compact representation with large potential savings in computational efficiency. The

compact representation is in the form of a topographic map of relative perceptual

conspicuityvalues. Otherrecent attemptsat compact scene representationshave focused

mainly on low-level maps that code certain salient features ofthe scene including color,

edges, andluminance. This study has found that thelow-levelmapsdo notcorrelatewell

with

subjects'

fixation locations, therefore, amap of perceptual conspicuity is presented

that incorporates high-level information. The high-level information is in the form of

figure/ground segmentation, potential object detection, and task-specific location bias.

The resulting model correlates well with the fixation densities of human viewers of

natural scenes, and can be used as a pre-processing module for image understanding or

(8)

ListofFigures xvi

ListofTables xxiv

1. Introduction 1

1.1 Overview 1

1.2 Problem statement 3

1.3 Outline ofthepresented work 5

2. Background 9

2.1 Historicalperspective 9

2.2 Thehumanvisual system 15

2.2.1 Imageformation 15

2.2.2 Center-surroundorganization of receptive fields 18

2.2.3 Contrast sensitivity function 20

2.2.4 Opponentprocesses 22

2.2.5 Eye movements 24

2.3 Visualattentionandselectivity 26

2.3.1 Theinfluence of attention on neural response 26

2.3.2 Orientingofattention 29

2.3.3 Behavioraldataon selectivityandcapacity limitations 32

2.4 Task-orientedvision 35

2.4. 1 Task-dependencyof visual scanpaths 36

2.4.2 Limited memoryrepresentations 38

2.4.3 Naturaltasks 42

2.5 Computational modelingof visual attention 45

2.5.1 Hierarchicalmodels ofattention 45

2.5.2 Connectionistmodels of attention 47

2.5.3 Graphical models of attention 49

(9)

3.2 Thebenefitsofeye-tracking 58

3.3 Eye-tracking

-theoryof operation 59

3.3. 1 Bright-pupil detection 59

3.3.2 Calculationof eye position 61

3.4 The VPLportableeye-tracker 62

3.4.1 Theopticsmodule and mirror 64

3.4.2 Theeye camera 65

3.4.3 Thescene camera 66

3.4.4 The LASER 67

3.4.5 Thecontrol unit 67

3.4.6 Eye-trackersetupand calibration 68

3.4.7 Eyemovementmonitoring 71

3.4.8 Portable eye-trackerprecision,accuracy, and noise 72

3.5 The ASLmodel 501 eye-tracker 76

3.5.1 Integratingheadmovements 77

3.5.2 ASLeye-trackerprecision, accuracy,and noise 78

3.5.3 Estimationand correction ofaccuracy loss 80

3.5.4 Fixation finding 82

Modular Visual Routines 87

4. 1 Introduction 87

4.2 Method 91

4.3 Results 95

4.3.1 Meanfixation durationsoftasks-pooleddata 96

4.3.2 Varianceoffixation duration-pooleddata 101

4.3.3 Statistical differences betweensubjects 106

4.3.4 Meansaccadeamplitude oftasks-pooleddata 111

4.3.5 Varianceof saccade amplitude

-pooleddata 117

4.3.6 Statistical differences betweensubjects 119

4.4 Discussion 121

Task-dependenciesofFixation Locations 127

5.1 Introduction 127

5.2 Fixationlocations ina simple environment 133

5.2.1 Method 134

5.2.2 Results 138

5.2.3 Discussion 145

5.3 Fixationlocations inan extendedenvironment 145

5.3.1 Method 146

5.3.2 Results 150

(10)

6. TheConspicuity Map 167

6. 1 Overview 167

6.2 Model description 169

6.2. 1 Inputimage processing 1 69

6.2.2 The low-levelsaliencymap 175

6.2.3 High-level proto-objectmap 187

6.2.4 Expected locationmask 1 89

6.3 Verificationofmodelusing eye-trackingmethods 191

6.3.1 Datacollection 191

6.3.2 Comparisonoffixationdensitiestomodel predictions 192

6.3.3 Determinationofmap weights 196

6.4 Natural-task images 206

6.4.1 Comparisontoextended environment 206

6.4.2 Free-viewand multi-view 213

6.4.3 Estimationoflocation bias 215

6.4.4 Expected locations 22 1

6.5 General discussionand conclusion 224

7. Conclusion 227

References 239

(11)

Figure2-1 Cross-section ofthehuman

eye, depictingimage formationcomponents...1 6

Figure 2-2 Spectralsensitivities ofthe threetypes of cones. Measurements include

light loss duetoabsorptionfromthecornea, lens, and other pigments inthe eye 17

Figure 2-3 Receptive fields oftwo types of retinal neurons: on-center/off-surround and

off-center/on-surround. Yellowareas indicate locations oflight stimulus 19

Figure 2-4 Contrast sensitivity functionwith example spatialfrequencies and

on-center/off-surroundneurontuned to thepeak response 21

Figure 2-5 Studyshowingthat scan paths aretaskdependent. Original paintingofI. E.

Repin's Unexpected Return isat upperleft, withfive examplescanpaths fora single

subject who viewedthepaintingwhilebeingaskedtoformulate answersto the

various questions 38

Figure 3-1 Imageofthepupil(white)and corneal reflection(black)as detectedbythe

eye camera. Centersareindicatedbycrosshairs. A slight offsetbetweentheactual

centers ofthe images andthedisplayedcenters is due toatimingoffsetduringdata

capture, anddoesnot affectcalculationof eye movementamplitude anddirection...60

Figure 3-2 Calculation oftheline-of-gaze 61

Figure 3-3 Portable eye-tracking headgearandbackpack 63

Figure 3-4 Topview ofheadgear 64

Figure 3-5 Opticsmodule 64

Figure3-6 Diffractionpattern usedforcalibration 69

Figure 3-7 Eyemovementtraceaftercalibration. Thesubjectwasinstructedtolookat

(12)

seconds each 71

Figure 3-8 Verticaleye position 73

Figure3-9 Horizontaleye position 73

Figure3-10 ExpandedviewofFigure 3-8 73

Figure3-1 1 Expandedview ofFigure 3-9 73

Figure 3-1 2 Eye-tracker

noise, noaveraging 74

Figure 3-13 Eye-trackernoise, two fieldave 74

Figure 3-14 Eye-trackernoise, four fieldave 74

Figure 3-15 Eye-trackernoise, eightfieldave 74

Figure 3-16 Average angulardeviation foreach ofthenine calibration points atthe

start ofthe experiment,acrosseight subjects 75

Figure 3-17 Average angulardeviation foreach of eight subjects atthe start ofthe

start oftheexperiment, across nine calibration points 75

Figure 3-18 Average angulardeviation foreach ofthenine calibration points at

mid-experiment, across six subjects 76

Figure 3-19 Average angulardeviation foreach of six subjectsatmid-experiment,

across nine calibrationpoints 76

Figure3-20 ASLmodel501 eye-tracker 77

Figure 3-21 Deviations from calibrationtargetpointsatthe start ofthe experiment,

beforeandafter correctionacross elevensubjects 80

Figure 3-22 Deviations fromcalibrationtargetpoints atthe end ofthe experiment,

before andaftercorrection,acrosseleven subjects 80

Figure3-23 Deviations fromcalibrationtargetpoints atthe start ofthe experiment,

beforeandaftercorrection,acrossnine points 80

Figure 3-24 Deviations from calibrationtargetpointsatthe end ofthe experiment,

beforeand aftercorrection,across ninepoints 80

Figure 3-25 Raw ASLeye-headdata 85

(13)

Figure 4-1 Relativefrequency offixation durations forsubjectsJB andJP forReading,

[image:13.511.44.466.147.439.2]

Search,and Manipulationsduringrocket-building 89

Figure 4-2 Fixationsequences forthree sub-tasks inthe rocket-buildingtask

-Reading, Search, andManipulation. Bars indicateperiods offixation, spaces indicate

gaze changesbetween fixationpoints 90

Figure 4-3 Walkingalongahallway 92

Figure 4-4 Havingaface-to-faceconversation 92

Figure 4-5 Telephone conversation 92

Figure 4-6 Sortingcards 92

Figure 4-7 Sortingblocks 92

Figure 4-8 Readingposter 92

Figure 4-9 Readingform 93

Figure4-1 0

Countingchange 93

Figure 4-1 1 Countingredblocks 93

Figure 4-12 Meanfixation duration foreach oftheninetasks,pooled across all

eight subjects 97

Figure 4-13 95%confidence intervalofthemeanfixation durations foreach ofthe tasks. A statisticallysignificantdifference betweentwo tasks existsifthereis no overlap of thecorresponding confidence intervals. Center dotsrepresentthemeanvalues 99

Figure 4-14 Thegammadensityfunctionwith a=2and |3 1 103

Figure4-1 5 Walkhallpooleddata 103

Figure4-16 Talkconversationpooled data 103

Figure4-17 Talktelephonepooled data 104

Figure4-18 Sortcards pooleddata 104

Figure4-1 9 Sortblocks pooleddata

1 04

(14)

Figure4-2 1 Read formpooleddata 1 04

Figure 4-22 Countchange pooleddata 1 04

Figure4-22 Countblocks pooleddata 104

Figure 4-24 Relationshipbetweenmeanandstandarddeviationforall ofthetasks. From

left, tasksare: RF, CC, RP, SC, SB, TT,WH, CB,TC 104

Figure 4-25 Mean fixation duration foreachsubject, alltasks 107

Figure 4-26 Calculationof visual anglefromfieldof view 1 13

Figure 4-27 Meansaccade amplitude foreach oftheninetasks, pooledacrossall

eightsubjects 1 13

Figure 4-28 Meansaccade amplitude foreachsubject, alltasks,with standard

errorbars 114

Figure 4-29 95%confidence intervals ofthe mean saccade amplitudesfor each ofthe

ninetasks. An overlap betweentwoor moreintervals indicatesthat there isno

statisticallysignificantdifference betweenthecorrespondingmean values 115

Figure 4-30 Walk hallpooleddata 117

Figure 4-3 1 Talkconversation pooleddata 117

Figure 4-32 Talktelephonepooled data 117

Figure 4-33 Sortcards pooleddata 117

Figure4-34 Sortblockspooleddata 117

Figure 4-3 5 Readposter pooleddata 117

Figure 4-36 Read formpooleddata 117

Figure 4-37 Countchangepooleddata 117

Figure 4-38 Count blocks pooleddata 117

Figure 4-39 Relationshipbetweenthemean andthestandarddeviationof saccade

amplitude. Fromtheleft, tasksareRP, CC,CB (lower), RF, SB(lower), TC,

(15)

Figure 5-1 Block copyingtask. This is thedisplaythatwas shown on thecomputer

screen. Thedisplaysubtendedan area of17x 13visual angle. A traceoftheeye

movement andofthehandmovementis shownasarrows connectingthedifferent

regions 1 29

Figure 5-2 Eye movementstrategies usedfor block copyingtask. Relative frequencies

ofeachstrategy fromasamplecontainingapproximatelyfiftyblockmoves foreach of

seven subjects 130

Figure 5-3 Fixation durationas afunctionoftaskdifficultyforadrivingtask 131

Figure 5-4 SortingCards 135

Figure 5-5 SortingBlocks 135

Figure 5-6 Copy-model-same-room 135

Figure 5-7 Model fromcopy-model-different-room 136

Figure 5-8 ResourceandWorkspacefromcopy-model-different-room 136

Figure 5-9 Amountoftime takenbyeach subjecttocompleteeach ofthefourtasks.

Thetasksalongthex-axis are orderedaccordingto the order of performanceby Group

1 (subjectsB, D, F, and H). Thefirst four bars foreachtaskcorrespondto theGroup

1 subjects,andthesecondfour bars correspondto the Group2subjects (A, C, E,and

G)who performedthe tasks inthereverse order 138

Figure5-10 Divisionoftimebetweenthe twodifferentregions

-sorting blocks and

sortingcards 142

Figure5-11 Divisionoftimebetweenthree differentregions

-copymodel sameroom

and copy-modeldifferentroom 1 43

Figure 5-12 Depictionoffourextended environments usedfortheportableeye-tracking

study. Clockwise fromthetopleft, Washroom, Hallway, Office, andVending 149

Figure 5-13 Relativeamounts oftimespentondifferentobjects intheWashroom

environment,pooledacross allfixationsandallsubjects 151

Figure5-14 Washroomenvironment. Time spentfixatingobjectsasthe tasksprogress

for Subject T. Tasks are, from thetop,"Washyour

hands,"

"Fill acupwith

water,"

and"Comb your

hair."

(16)

Figure 5-15 Relativeamounts oftimespent ondifferentobjects intheHallway

environment,pooled across allfixationsand all subjects 155

Figure 5-16 Relativeamountsoftimespentondifferentobjects intheOffice

environment,pooled across allfixations and all subjects 157

Figure 5-17 Relativeamounts oftime spent ondifferentobjects intheVending

environment, pooled across all fixations,and all subjects 158

Figure 5-18 Hallwayenvironment. Timespentfixatingobjects as the tasksprogress for Subject T. Tasksare, fromthetop,"Throw something inthe

garbage,"

"The fire

alarmjustwent

off,"

and"Findabathroom." 160

Figure 5-19 Officeenvironment. Time spentfixatingobjects asthetasksprogressfor Subject U. Tasksare, fromthetop, "Getsupplies fromthe

closet,"

"Workatthe

computer,"

and"Makea

photocopy."

161

Figure 5-20 Vendingmachine environment. Time spentfixating objects asthe tasks

progress for Subject U. Tasksare, fromthetop, "Checkfor Skittles,"

"BuyaSnickers

bar,"

and"Checkforchange."

1 62

Figure 5-21 Comparisonoffixationtypes 1 64

Figure 6-1 Constructionofthe Conspicuity Map 170

Figure 6-2 Creationofthecolormap fromphotoreceptor responses. Upperleft is input

image,upper right is CI (red/green) signal, lower left is C2 (blue/yellow) signal,and lowerright istheresultingcolor map. Darkblue areas inthesignal maps correspond

tolowsignal values, yellow correspondstomedium signalvalues, andred

correspondstohigh signal values 1 76

Figure6-3 Intensitymap forexample input imageofFigure 6-2 177

Figure 6-4 The sevenlevels ofthemulti-resolution Gaussianpyramidforthe example

input image 178

Figure 6-5 The sevenGaussianconvolutionkernelsofthe Laplacianpyramid. They

are usedto createthebandpass filtersthatdetect a specific range offrequenciesinthe

inputimage. The spatialdomainrepresentationisshowninthetoprow,andthe

correspondingfrequencydomainrepresentationisshown inthebottomrow 1 80

Figure 6-6 Sixbandpass filters usedtodetect frequencies of aparticular rangeinthe

input image. F1-F2 isellipticalin shapebecause fl isodd-sized(5x5) andf2 is

(17)

Figure 6-7 One-dimensionalfrequencyresponsecharacteristicsofthesixbandpass

filters shownin Figure 6-6. Notethat only therighthalfoftheresponse curves are 181 shown, i.e., theyaresymmetrical aboutthe origin 1G1

Figure 6-8 Six levels ofthe Laplacianedge cube (difference-of-Gaussians) derived

fromtheseven levelsofthe Gaussianpyramid(labeledLO

-L6)afterweighting each

Laplacian levelbytheresponse fromthecontrast sensitivity function. Toprowfrom leftL0-L1, L1-L2,andL2-L3. Bottomrowfrom leftL3-L4, L4-L5, andL5-L6....183

Figure 6-9 Basis functionsoftheGaborfiltersused tomodelthetuningofreceptive

fields inareaVI ofstriatecortex. Fromleft, 0, 45, 90, and 135 184

Figure 6-10 Fouroriented edge signals andresultingorientededge map. Toprow from

left, 0, 45. Middlerow,fromleft, 90, 135. Bottomrow,false-coloredoriented

edge map 185

Figure6-1 1 Low-level featuremaps andresulting low-level saliencymap. Toprow,

fromleft,colormapandintensitymap. Bottomrow fromleft, oriented edge mapand

saliencymap 186

Figure 6-12 Creationofthebinaryproto-object map. Toprow fromleft, inputimage,

estimation ofbackground,andforeground segmentation. Bottomrowfromleft, after

thresholdingtheforeground image andCanny edgedetection, afterdilation, and after

holefilling and erosion 188

Figure6-13 ComparisonofF/Mratiosfor 76 images insetA,free-viewcondition.

Threemaps were generated foreachimage, as givenin Equations 6-15 through 6-17.

images numbered44-47, 64-66, 69-71,and 80-83 areduplicateimages, and not

shownhere 195

Figure 6-14 Comparison F/Mratios for 76 images insetB,free-viewcondition. Three

maps were generated foreachimage, as givenin Equations 6-15through 6-17.

Imagesnumbered 1-4, 44-47, 67,68, and80-84are duplicate images,and are not

shownhere 196

Figure 6-15 Mean F/Mratios forthe threedifferentmaps,averagedacross all 152

images 200

Figure 6-16 Exampleimages andoverlaidfixationplotforwhichtheoptimal weights

werefound usingtherandom weight generationmethod. The correspondingweighted

conspicuitymap (C-Map)is shownbeneatheachimage. Fromleft, Al, A28,

B17 200

Figure 6-1 7 F/MratioforsetA images,free-view

condition,usingthe C-Map. The

F/Mratios fortheotherthreemaps arethesame as showninFigure6-13,andare

(18)

74isoffthe chart andhas a valueof6.89 204

Figure 6-18 F/Mratio forset B images,free-view condition,usingthe C-Map. The F/Mratiosforthe otherthreemaps arethe sameas shownin Figure6-14,and are

included forcomparisonto the C-Map 205

Figure 6-19 Mean F/M ratiosforall 152 images using four differentmaps 206

Figure 6-20 Fournatural-taskimageswith overlaidfixationplotsfromonesubject, free-viewcondition, andcorrespondingmaps. Fromleft,Washroom(Al), Hallway

(A2),Office(A3), andVending(A4). Maps are,from toptobottom, theCIEmap,

the Pmap, theCIEPmap, andthe C-Map 208

Figure 6-21 Fixationdensityplots for free-viewandthreemulti-viewconditionsfor four images. Imagesare fromtop, Washroom, Hallway,Office, andVending

Machine 210

Figure6-22 F/M ratiosfor free-viewand multi-view conditions forthe fournatural-task images. Acomparison isshownbetweenthelow-level CIE map andthehigh-level

perceptual conspicuityC-Mapforeachimage 214

Figure 6-23 F/Mratiosfor 1000 randomlygeneratedfixation locations 215

Figure 6-24 F/M ratioscomputedformixedimageand fixation data. Eachchartis for

one ofthefour images forwhichtwomaps werecomputed, CIE map (saliency)and

C-Map(conspicuity). The free-view fixation data is indicated alongthex-axis 217

Figure 6-25 Histogramsoffixation distances fromthe centerof eachimage 219

Figure6-26 F/M ratiosforrandomfixations restrictedto lA imagesizedistance from

center, and i/i6 image sizedistance fromcenter 220

Figure 6-27 Ninegrid locationsusedtocompute theexpectedlocation map 22 1

Figure 6-28 F/Mratiosfor different locations inthe C-Map,foundbyturningon asingle

(19)

List

of

Tables

Table 3-1 Newell'stemporalhierarchyofbrain organization 58

Table 4-1 Time inseconds and number offixations (inparenthesis) pertaskforthe

eightsubjectswho performedtheexperiment 94

Table 4-2 OrderoftasksforGroup 1 and Group2 94

Table 4-3 Taskabbreviations 96

Table4-4 Pairwisecomparisons forsignificantdifferences infixation durations

betweentasks. An Xindicatesthatastatisticallysignificantdifference exists

betweenthecorrespondingtasks intherow and column 98

Table 4-5 Hallway Walking (WH) 107

Table 4-6 Conversation(TC) 107

Table 4-7 TelephoneTalking(TT) 107

Table 4-8 SortingCards (SC) 107

Table 4-9 SortingBlocks(SB) 107

Table 4-10 ReadingPoster(RP) 107

Table 4-11 ReadingForm(RF) 107

Table4-12

CountingChange(CC) 1 07

(20)

Table 4-14 SubjectAtaskdifferences 109

Table 4-15 Subject Ftaskdifferences 109

Table 4-16 Subject C taskdifferences 109

Table 4-17 Subject Htask differences 109

Table 4-18 Subject Etaskdifferences 109

Table 4-19 Subject Gtaskdifferences 109

Table 4-20 Subject B taskdifferences 109

Table 4-21 Subject D taskdifferences 109

Table 4-22 Pairwisecomparisons forsignificantdifferences insaccade amplitude

betweentasks. An X indicatesthatastatisticallysignificantdifference exists

betweenthe correspondingtasks intherow and column 114

Table 4-23 Hallwaywalking(WH) 119

Table 4-24 Conversation (TC) 119

Table 4-25 Telephone talking(TT) 120

Table 4-26 Sorting cards(SC) 120

Table 4-27 Sortingblocks (SB) 120

Table 4-28 Readingposter (RP) 120

Table 4-29 Readingform(RF) 120

Table 4-30 Countingchange(CC) 120

Table 4-31 Countingblocks (CB) 120

[image:20.511.41.468.300.633.2]

Table 4-32 Summaryofresults from studyof naturaltasks 122

Table 5-1 Statistical comparisonofcompletiontimes forthesubjects ofGroup 1 and

Group 2. Ineach casethe nullhypothesis isrejected (h=0), indicatingthat thereisno

statisticallysignificantdifference betweentheorderingofthe tasks interms of

(21)

Table 5-2 Orderofinstructions forGroupAand GroupB duringextended

environmentstudy 1 47

Table 6-1 MaximumF/M ratiosand associated weightsforthree exampleimages usingtherandom weight generationmethod. 10,000 trials 200

Table 6-2 MaximumF/M ratios andassociated weights forthree example images usingthe genetic algorithmmethod. 2,400trials. #Gensrefersto the actualnumber

oftrialsrequiredbeforeasolutionconverged. Images A30,A32, A76, B30, andB88 arenotincludedintherangedata becausethe weights were greaterthan 50,dueto

manymutations 203

Table 6-3 Instructions formulti-view part oftheexperiment 207

Table 6-4 Threemost frequentlyfixatedobjects and percentage oftime spentlooking atthoseobjectsfor each ofthe tasks intheextended environment studyfrom Section

5.3,overall subjects 212

Table 7-1 Classification oftasksinto featurevectorcorrespondingtoboththe level of visual engagement withtheenvironment and amount of strategicplanning

(22)

Introduction

1.1 Overview

Visual perception is an inherently active and selective process. As an individual goes

about performing daily activities, the visual system is constantly monitoring the

environment to provide the individual with information about the scene that will enable

meaningful interactions or contemplative study. The outcome is usually a change in the

cognitive state ofthe individual that leads to the realization ofa plan ofaction. In this

sense, vision is not a passive process whereby information is merely collected and

processed or perhaps stored for later retrieval, but rather it is an active process that

integratesspecific, localaspects ofthe scene with goal-orientedbehavior. Consequently,

thepurpose of visionisto servetheneeds ofthe individual asthoseneeds arise.

An essential component of active visual perception is a selective mechanism.

(23)

from the retina to the cortex. The advantage of selecting less information than is

available isthat themeaningof aparticular scene or imagecanberepresentedcompactly,

thusmakingoptimal use oflimitedneural resources.

Currently, it is uncertain exactly what use the study of eye movements is for

understanding human behavior (Viviani, 1990). Recent studies support the ideathat eye

movements are an external manifestation ofselective attentionand canplayan important

role in indicating which attributes of the scene carry the most pertinent information.

Patterns of visual fixations overtimeas well as space canreveal cognitive strategies that

are not amenable to conscious control or verbalization, and as such can be thought ofas

providing a windowinto pre-consciousthought. The locations offixationsas well asthe

particular sequence have beenshowntobe dependent uponnotonlythe characteristics of

the scene, but also upon the goals of the observer. The results of an eye movement

analysis can yield important insights into the nature ofdecision-making and reasoning

underavarietyof environmental andtask-specific situations.

Thepurpose ofthis thesis isto developabiologically-plausiblemodel of selective

visual perception for individuals who are engaged in realistic, everyday tasks such as

walkingdownahallway, filling acupwithwater, or makingacopy at a copier machine.

The model is in the form ofheuristics gleaned from eye-tracking studies conducted on

subjects navigating in natural, extended environments, and is combined with a

computational simulation of low-level properties of the primate visual system. The

computational aspects ofthe model augment the heuristics to provide a detailedaccount

(24)

Manystudieshave beenconductedshowingthateye fixationsarenottorandomlocations

in the field, but rather to regions in the image or scene that rate high in information

content, such as edge density, colorfulness, or luminance contrast. Presumably, a

random selection processwould notbean efficient means ofgatheringvisualinformation

forsomeonewhoisengagedina visual task,whetherthe taskrequires formulatinga plan

of action orjust contemplative thought. Thus, strategic planning of saccades plays an

importantrole in extractingusefulinformation. Anunresolvedquestionis howthe visual

system determines what strategy to use when deciding where in the scene to look next.

More specifically, what is the role of context in determining oculomotorbehavior? The

central hypothesis ofthis thesis is that it is the subjective, orperceived, conspicuity of

context-relevant objects in the scene that guides the fixation strategy, in addition to the

objective amount ofinformation inherent inthesceneatthepotentialtarget offixation.

Recent attempts at computational modeling of the human visual system have

focused mainly on bottom-up, or stimulus-drivenprocessing, in other words, processing

that begins with pixel counts from the digitized image and proceeds upward through

successive layers of increasing abstraction. The idea is to detect luminance differences

directly from the digitized image, and from those differences locate edges, boundaries,

homogeneous regions, surfaces, and eventually objects and their 3D representations.

Scene semantics arrive last in the chain of processing, ifat all. An advantage ofthis

approach is that the scene is represented in its entirety

(25)

computationallyprohibitiveanddoes not make optimal useoflimitedneuralresources.

An alternative to bottom-up processing is top-down, or task-dependent

processing. With top-down processing, one might begin with a conceptualized object

described in abstract terms, such as "a chair has four legs"

and proceed downthrough a

hierarchy of increasing detail, eventually reaching a scene description in terms of

primitive features. The disadvantage here is the difficultyofconceptualizing abstract, or

non-representational items, however the advantage is a compact representation of the

scene semantics.

A model based on selective perception and perceived conspicuity combines

aspects ofboth bottom-up and top-down information in a unique way. The degree to

whichbottom-up ortop-down is employed is largelyafunction ofthe goals ofthe system

and its current state. The result is a computational model of visual perception and

processingthat isa reflection ofthe ongoing interaction between an active visual system

andtheenvironment.

This thesis is devoted to the goal of showing that the perceptually significant

information content ofany particular region in an image or scene must ultimately take

intoaccountthe implicit semantics ofthat image or scene

-that is, the "meaningfulness"

ofthe scene forthe viewer. This approach implies that specific objects, as well as their

relative and expected locations play an important role in determining meaningfulness in

natural scenes, especially when combined with action-implied imperatives. The

low-level,bottom-upfeatures such as edges, colors, andluminancecontrast cannotbeignored

(26)

willbe shownthatsuccessfullypredicting fixation densities innatural imagesand inreal,

extended environments requires computational algorithms that combine bottom-up

processing with top-down constraints in a way that is context-sensitive and ultimately

mostmeaningfulfortheviewer.

1.3 Outline ofthe presented work

The remainder of this thesis is organized as follows: Chapter 2 highlights background

material that is essential for a complete understanding of the issues involved in

computational modeling of selective visual perception and human visual behavior in

natural, unrestricted environments. This chapter includes a detailed historical account

andreviewofthe literature showing howprevious workhas led to the present state ofthe

field. Issues relating to the physiology of the human visual systems, attentional

mechanisms, task-oriented vision and natural tasks are discussed, as are variations in

approachesthathave beenappliedto thecomputationalmodelingof visual attention.

Chapter 3 is an outline ofthe experimental method that was applied in order to

extractdataonhumanvisual behavior innaturalenvironments,as well as inthe restricted

environment of2D imageviewing. Studying eye movements outside oftheconfines of a

restricted laboratory setting is a topic of current interest, yet this research area remains

largely unexplored and undocumented in the literature. Novel hardware and software

were developed by the Visual Perception Laboratory at the Rochester institute of

Technology to enable a thorough data collection and analysis procedure. The results,

(27)

hardware, as well as the software that was developed for the eye-tracking sessions and

dataanalysis, is included in Chapter 3.

Chapter 4 describesa result gleaned fromevaluating eye-tracking data inthe real

world:thatvisual routinesaresomewhat

"modular"

innature. Thatis,when metrics such

as fixation duration, saccade amplitude, and gaze-change interval are used to describe

certain

"primitive"

visual behaviors such as reading text or having a face-to-face

conversation, stereotypical patterns of oculomotor behavior result. This evidence

supports thehypothesisthat thehumanbrain isorganizedin sucha wayas tomake use of

pre-determined low-level visual routines in order to enhance functioning in a complex

and constantly changing environment. Pre-determined routines may affect perceived

conspicuitybyrestrictingthefocus of attentiontoexpectedusefullocations.

Chapter 5 is an extended study into the high-level visual strategies employedby

people that either enhance or detract from perceptual conspicuity in the environment.

For example, when walking along a corridor after a high-cognitive load task has been

imposed (memorization of a random block pattern), fixations tend to be longer, more

centrally located in the scene, andhave shorter saccade amplitudes than whenthere has

beenno cognitive load imposed. This impliesthat objects inthe environment thatwould

have attracted attention when the task is not cognitively challenging do not do so when

thesystem isotherwiseoccupied.

Chapter 6 outlines thecomputational model that is developedandused to predict

fixation densities on natural-task images and in the real world. The model is a

(28)

andtop-down, task-orientedconstraints. Themodeltakes the formofatopographic map

of perceivedconspicuity values, where the value at a coordinate in themap is a measure

ofhow important thatcoordinate is for perception. The model is a partial adaptation of

thestimulus-driven approachtakenbyothers (as discussed in Chapter2), yet it is original

in the sensethat ituses anovel method to take into account context-sensitive information

about the sceneatboth the higher levelsand the lower levels. A novel algorithm isused

toinhibitregions inthe scene thatarenotlikelytobe perceptually importantand enhance

those regions that are. The resulting model is shown to correlate well with the fixation

densities ofhuman subjects.

Chapter 7 is a summaryand conclusion ofthe work presented in this thesis, and

(29)
(30)

Background

2.1 Historical perspective

One ofthe earliest theories of spatial attention, the attentional spotlight, originates from

the psychophysical work of Herman von Helmholtz (1867). The spotlight metaphor

capturesthe concept ofan"internal eye"

i.e., an implicit foveawhich localizes an object

in space and focuses all ofthe processing on that one object before moving to another

location in the field. Any information that is not centered on the implicit fovea is

diminished.

The idea ofusing a spotlight as a metaphor for attention was further developed

later in the 20th century (Crick, 1984; Treisman, 1982). Within the spotlight objects

whicharebeingattended to arehighlighted, orilluminated, sothatinformationaboutthat

object will be processed more efficiently and at ahigher level than other objects in the

(31)

existence of an attentional spotlight (Sagi and Julesz, 1986), however most current

thinking considers the metaphor to be too simplistic to capture all of the nuances of

selective attention. The early evidence points to observations made during

psychophysical studies offilteringtasks. Forexample, Sagi andJulesz (1986) studiedthe

ability ofsubjects to discriminate the orientation ofbriefly presentedbar targets located

in the periphery. On some trials a small light was flashed close to theperipheral target,

on other trials the light was flashed near a peripheral non-target. Subjects were able to

detectthe light onlywhenitwasflashedwithina certain area nearthetarget, eventhough

in both cases the light was located at the same foveal eccentricity. The authors suggest

that the area aroundthe target at whichthe light couldreliablybe detected delineatedthe

contour ofthespotlight of attention.

Other studies have demonstrated that the area covered by the spotlight does not

necessarily cover contiguous regions in the field (Pylyshyn and Storm, 1988). Duncan

(1980) showed subjects a circular display containing eight characters from which they

were to locate the target letter, Q. Distractor letters were either O's and C's, or O's and

K's,placedat random circular positions inthe display. Subjectsweretoldahead oftime

which four ofthe eight positions the target could be located in (the relevant positions).

Theotherfourpositions wereirrelevantand couldbe ignored.

The study found that the O and C distractors had little effect on the subjects'

ability to locate the target, regardless of whether they were located in the relevant or

irrelevant field. When the O and K distractors were in the relevant field search times

were slowed, presumable because of feature interaction (Treisman and Gelade, 1980).

(32)

field search times were not slowed, presumably because the subjects were able to attend

to the non-contiguous relevant locations while ignoring the also non-contiguous

irrelevant locations. The abilityto attend to non-contiguous areas when the demands of

the task so require is evidence that high-level processes can mediate the acquisition of

visual stimuli.

Another study found that the spotlight does not end abruptly at one location

before it moves to the next, nor does it sweep continuously across the field of view

(Sperling and Weichselgartner, 1995). Processing is completed in a select area, fades,

and then moves to a new area to resume building strength there. An extension ofthe

spotlightmetaphor forattentional capture is thezoom lens metaphor, which suggests that

the area under consideration is examined with variable spatial extent (Eriksen and St.

James, 1986). In this case the amount of detail available for processing is inversely

relatedtothe sizeoftheareabeingprocessed.

The theories mentioned thus far have assumed a serial mechanism for selection,

i.e.,the focus ofprocessing iscompleted at a single select regionbefore movingonto the

next region. An alternative to focused, serial processing of attention is dispersed, or

parallel processing, originating with the work of James (1890). With dispersed

processing, the focus of attention is spread uniformly across the field of view. Neisser

(1967) was the first to show that the two theories need not be mutually exclusive, but

rathermay bethought ofas part ofthe same process existingattwo distinctphases. The

pre-attentive phase isthe earlier stage interms ofprocessing, andis consideredrelatively

(33)

attentivestage integrates information fromaparticularareaofthe field, andis considered

slow, voluntary,and progresses serially fromone regionto thenext.

Much ofthe work that has been conducted on the 2-phase theory of selective

perception has been under the experimental paradigm known as visual search. In this

paradigmthe amountoftime it takes to complete a searchis plotted as a function ofthe

total number ofitems in the display. A flat response indicates a fast, parallel process,

whereas a linear response indicates a slower sequential process. The feature integration

theory of selective visual attention (Treisman and Gelade, 1980; Treisman, 1988) is an

attempt to define the purpose of focused attention using the visual search paradigm.

According to feature integration theory, elementary features such as color and shape are

processedbeforeobjectsthatrequire a conjunction ofseveralfeatures, such as abluebox,

or a gray kitten. Focused attention is necessary to conjoin the separate features, which

thenenables properidentification oftheobject.

A series of experiments were designed to distinguish between features that are

elementary (also called integral) and features that are separable and require focused

attention for integration. The hypothesis was that an integral feature would elicit a flat

response time and exhibit

"pop-out"

in a field of distractors, whereas an object with

separable features would require a sequential (conjunctive) search and elicit a linear

response time. The results showed that when the elementary features were chosen to be

colors or shapes (for example a green object in a field of red distractors), search times

were constant with the number ofdistractors. When separable featureswere chosen as a

conjunction oftwo elementary features (a green disc in a field of green squares and red

(34)

A hallmark of the theory is that the pre-attentive stage extracts the primitive

features in parallel across the visual field, and the attentive stage is required for binding

the separable features within a small part ofthe field. As evidence againstthe theory, it

has been shownthat itis possible to perform some conjunctive searches in parallel ifthe

separable features consist ofcolor, motion, or depth (Nakayama and Silverman, 1986).

Also, recent studies have shown that reaction times for conjunctive searches can range

from close to 0 seconds per item (pop-out) to 30-50 milliseconds per item, depending

upon the degree ofsimilarity between the target and the distractors (Deco, Pollatos and

Zihl,2002).

Two-phasetheoriesof visual attentiondonot explicitly describe howthe selection

process is controlled. Questions such as "what is the region ofinterest?" and "where

should the next fixationbe?" can be approached by considering the purpose offocused

attention.

Thenotion ofasaliency mapwas proposedto definetherelationship betweenthe

components of a scene according to their relative importance to the observer (Koch and

Ullman, 1985; MahoneyandUllman, 1988). The essential components of asaliency map

include a priority map for rating the relative components of the scene, and a gating

mechanism whereby the selected regions are processed and the non-selected regions are

inhibited. According to the theory, the visual system performs an initial low-frequency

parsing ofthe environment to identify potential regions ofinterest, and assigns to each

region a weight according to the computed saliency. For example, bright colors, high

(35)

assignedaheavyweight. This information isrecordedinatopographic mapofthe scene,

which indicatestheweightofeveryelementinthatscene.

The map is dynamic in the sense that the gatingmechanism chooses the element

with the highest current weight to be the target of focused processing, and then

suppresses this element when processing is complete. An inhibition-of-return (Posner

and Cohen, 1984) mechanism is used to reduce the saliency at the current focus of

attention sothat thenext highestregionmay beselected forprocessing. This mechanism

isthought tobiastheattentional resources towardnovel stimuli thatappearinthe fieldby

reducingthesalience of an itemthathas beenviewedforat least 300msec.

The guided search model (Wolfe, Cave, and Franzel, 1989; Wolfe, 1994) is an

adaptation of the visual search paradigm that uses the concept of a salience map to

prioritize potential items of focused attention. The basic idea is that a parallel-feature

computation stage guides a later serial attentive stage. The highly salient targets should

be detected more quickly thannon-salienttargets, givingrise toconstant searchtimes for

elementary features. Slower, conjunctive search times are the result ofthe contribution

of noise from competing feature dimensions during the parallel feature computation

stage.

Alternatives to the guided search model are the search via recursive rejection

model (Humprhreys and Miiller, 1993) andmodels basedon signal detection theory (see

Verghese, 2001, for a review). The search via recursive rejection model (SERR) is a

connectionist model that recursively rejects regions where clearly defined grouping of

distractors occur. In other words, if stable groupings occur everywhere at the lowest

(36)

(differently grouped) distractors. Search is slowed when groupings contain elements of both target and distractors. Signal detection theory uses a variable threshold to distinguish between fast search and slow search. The threshold is usually described as a decision rule that depends upon distractor discriminability rather than a parallel/serial dichotomy. Accordingly,the decisionruletakes intoaccountawide range offactorsthat

might contribute to search response times, and does not assume a parallel pre-attentive stageis followedbyaserialattentivestage.

Insummary, the historyofthoughtonthe topicof selective visual perception

begins with the earliest metaphors of an internal eye, and a spotlight or zoom lens of

focused attention. From there, the visual search paradigm has produced theories describing apre-attentive and attentive processing ofintegrated features, and progressed to the more current concepts of atopographicmap ofsaliencyvalues or signal detection.

What remains is a means of incorporating context sensitivity and task-relevancy into theories of selectivevisual perception.

2.2 The human visual system

2.2.1 Image formation

Light fromthesurroundingareaenters theeye and undergoes severaltransformations that enable the brainto make use ofinformationfrom that surrounding. The transformations are both optical and neural in nature, and begin with the transformation oflight energy

(37)

Cornea

Retina

Optic nerve

55* M' 4tf 70

Xf llf 0' Wc 2CM

<Kf W 6f 7<f SO Visual Angle{degrees fromfovea)

Figure 2-1 Crosssection ofthehumaneye,depictingimageformationcomponents. Adapted

fromPalmer, 1999

The retina is a layer of neural tissue approximately 0.4mm thick and is the

repository of over 100 million light-sensitive photoreceptors called rods and cones

(approximately 100million rods and 5 million cones ineach eye). Figure 2-1 shows that

the distribution of rods and cones across the surface ofthe retina is highly uneven, with

most ofthe cones located ina small central area oftheretinacalledthe fovea. The cones

are responsible for bothcolorperception andhighvisual acuity; thus,the eyes mustmove

in order to obtain detailed, high-resolution information from different regions in the

visual field. The area of the field covered by the fovea is approximately 2

of visual

angle, whichis approximatelythewidthofathumbextended an arm'slength.

Retinal cones canbe classified into oneofthree different types, dependingupon

the wavelength sensitivity ofthe cell's photopigment

(38)

(for short, medium, and long wavelength response). Figure 2-2 shows the spectral

sensitivities ofthe threeconetypes.

1.0

B 0.8

G ii c/j -a <u

"3

o

0.6

0.4

0.2

0.0

sco/ie\ M-copeX

/

N L-cones

A

/

\

\

/

\

1

\

/

/'

\

'

400 500 600

Wavelength,nm

700

Figure 2-2 Spectralsensitivities ofthe three typesof cone photoreceptors. Measurements

include light loss duetoabsorptionfromthe cornea,lensand other pigmentsintheeye. From StockmanandMacLeod,1993.

The absorption ofphotons by the S-cone photoreceptors is significantly different

fromthatofthe M- and L-cone photoreceptors. The S-conesare particularlysensitive to

short-wavelengthphotons and arethe primary detectorswhen short-wavelengthlight is at

the threshold ofdetection. Both the M- and L-cones will detect longerwavelength light

since there is a greater amount ofoverlap in those response curves. Also, S-cones are

knowntobe relativelyrare inthe retina and are not present at all inthe centralpart ofthe

fovea (Wandell, 1995). S-cones are spaced relatively far apart in the fovea, with a

spacingof 10arc minutes,whereasthe spacing forthe L- andM-cones is0.5 arc minutes.

The consequence of wide spacing is that the sampling frequency is reduced for the

(39)

mosaic is that the visual system will encode only relatively slowly varying spatial and

temporal signalsoriginating intheshortwavelength regionofthespectrum.

2.2.2 Center-surround organization ofreceptivefields

Retinal neurons and cortical neurons develop fromthe same bio-chemical processes, and

as such the retina can be considered to be part ofthe central nervous system (Wandell,

1995). Muchofthephysiological andorganizational propertiesofthecortex applyto the

retina as well. Similar to the cortex, the retina is amulti-layered surface, with the first

few layers ofthe retina consisting of ganglion cells that exhibit spatial interaction with

neighboring cells. Neurons in each layer excite corresponding neurons at a higher layer

and inhibit neighboring neurons in the same layer. The result of the network of

connections is called lateral inhibition. The network projecting from any particular

neurontoneighboringneuronsis calledthe projectivefieldofthatneuron. Thepatternof

connections in the opposite direction, from the receiving neuron to those neurons that

influenceit, iscalledthe receptivefield ofthatneuron.

As mentioned earlier, visual perception can be described as a series of

transformations that begins with the input ambient light array and proceeds through

higher levels of cortical processing. Since the receptive field ofa retinal neuron is the

area in which light influences the neuron's response, lateral inhibition and receptive

fields canbe thoughtof asthetransformationproperties of retinal neurons.

The receptive field of a neuron in the retina can be described as having a

(40)

of action potentials results. However, if light activates only the central part of the

receptive field and not the surroundingarea, an elevated response interms ofthe

firing-rate with respectto therandom response willresult, andthe neuronis saidto havean

on-center/off-surround organization. For this case, light activating only the inhibitory

surround will cause a significant decrease in the firing rate. A neuron exhibiting the

opposite pattern of activation is said to have an off-center/on-surround organization.

Figure 2-3 depictsa schematic ofthe differentresponse properties of retinal neurons.

Stimulus Response Stimulus Response

O^O

On-center/off-surround Off-center/on-surround

Figure2-3 Receptive fieldsoftwo typesof retinal neurons: on-center/off-surround and

off-center/on-surround. Yellowareasindicate locationsoflightstimulus.

The receptive field structure of neurons continues along the central nervous

system from the retina to the lateral geniculate nucleus (LGN) ofthe thalamus and onto

area VI (primary visual cortex), with some qualitative differences. For example, the

(41)

have elongated shapes and are orientation anddirection selective. Also, cortical neurons

can be classified into two broad categories: simple and complex (Hubel, 1988). Simple

cells have response properties that conform to linearity and superposition principles,

whereas complex cells donot.

2.2.3 Contrast sensitivity function

The contrastsensitivityfunction (CSF) is typically defined as the sensitivityof observers

to sinusoidal gratings ofvarying frequencies. Thetechnique used to measure the CSF is

to askobserverstoadjust athresholduntil ajust-noticeable difference betweena uniform

gray fieldandasinusoidal pattern is detected. Whenthresholds are measuredforarange

offrequencies, acontrast threshold function is plotted showing the minimum contrast at

threshold as a function of spatial frequency. The reciprocal ofthe contrast threshold

function is the contrast sensitivity function. A typical contrast sensitivity function is

depicted in Figure 2-4, showing that frequencies inthe range of4-5 cycles perdegree of

(42)

o

high

o

o

E0 s *u

sn

83

u

S

o U

low

llllllll

10 100

Spatial frequency(cpd)

Figure 2-4 Contrast sensitivity functionwith example spatialfrequenciesand

on-center/off-surround neurontuned to thepeak response. Adapted fromWandell,1995.

The CSF can also describe a retinal ganglion cell's receptive field. The most

effective frequency for any ganglion cell is a measure ofthe size ofthat cell's receptive

field (Wandell, 1995). For example, Figure 2-4 depicts an on-center/off-surround

ganglion cell whose peak response is at the peak sensitivity of the contrast sensitivity

curve, i.e., the most effective spatial frequency for this cell is the intermediate

frequencies. At lower spatial frequencies, light falling on the surround reduces activity

from the center, and at high spatial frequencies, light falling on the center is averaged

over severalcycles, again, loweringthe overall activity.

In general, contrast patterns such as sinusoidal gratings at afixed luminance level

providean effectivemeasure ofthe input/output behavior of neurons. Adaptation effects

operating over a very large range ofluminance levels make direct comparisons difficult

because of the highly non-linear response characteristics of neurons. Therefore, a

(43)

response can then be characterized by cumulative comparisons over a range of mean

luminance levels.

2.2.4 Opponentprocesses

In 1867 Helmholtz described what has come to be known as the trichromatic theory of

color vision (Helmholtz, 1867/1925). Essentially, this theory describes colorperception

as the result ofthe three photoreceptors response to photons of a particular wavelength.

Any single photoreceptor cannot distinguish between different colors - it

is the overlap

among the three spectral response curves that contributes to the unique perception of

color.

Trichromatic theory explained much about color perception, such as the

psychophysical observationthatanyperceived color canbe matchedbya combination of

the threeprimary colors ofred,blue, and green. It cannot explainmany subjective color

experiences, however, such as theobservationthat certain color combinations such as red

and green, or blue and yellow are not easily imagined as a single color. In addition it

does not explainwhycolor visiondeficiencies are always theresultofthe loss of pairs of

colors

-red and green,orblueand yellow. Also, psychologically, yellowappears tobea

primary color and not the combination of red and green as would be predicted by

trichromatic theory.

In 1878 Ewald Hering proposed the opponent process theory ofcolor perception

to explain the perceived, or subjective, experience of color (Hering, 1878/1964).

Opponent process theory describes color perception as the result of four chromatic

primaries that are arranged in polar pairs

(44)

and yellow form the other polar pair. Each of the three retinal receptor types are

responsible for detecting photons of the proper wavelength range along one polar

dimension

-the R/G dimension, the B/Y dimension and an achromatic dimension of

black/white that detects luminance levels. Physiologically, Hering theorized that the

experience of red could be the result ofa sufficient amount of a certain chemical in the

R/G photoreceptor, andthe experience ofgreen couldbe the result ofadepletion ofthat

chemical on the same photoreceptor. Hurvich and Jameson (1957) conducted

psychophysical experiments to verify predictions of opponent-process theory, using hue

cancellation techniques. The central idea was that the if blue and yellow are polar

components ofthe same mechanism, then one should be able to cancel the amount of

"blueness"

in a light by adding a certain amount of "yellow". The results ofthose

experiments showedstrongevidence supportingthe opponent-processtheory.

In 1905 von Kries laid the foundation for a dual-process theory of color

perception that consists oftwo sequential stages of color processing

-a trichromatic

stage atthelevelof retinal photoreceptors and an opponent-process stage at ahigher level

(von Kries, 1905). More recent physiological studies have shown that color opponent

cells exist in the LGN of macaque monkeys and that both processing stages are

performed intheretina(DeValois, 1965,andDeValois, AbramovandJacobs, 1966).

The implicationofdual-processtheoryforvisual perception at ahigher, conscious

level of awareness is that the re-parameterization of responses from the three

photoreceptors to a more psychological color appearance is that it is more ecologically

useful. Separating luminance from chromaticity is advantageous because it allows the

(45)

fallingover a surface (ameasurement alongtheluminance axis) andchanges in the scene

that result from encountering a new surface (a measurement along one of the

chrominance axes).

2.2.5 Eyemovements

In general, eye movements fall under two broad categories

-smooth and saccadic.

Smooth eye movements, such as smooth pursuit, vergence, and the vestibular-ocular

reflex (VOR) enable the tracking ofmoving objects, whereas saccadic eye movements

are swift and abrupt, and allow the eyes to shift fixation from one object in the fieldto

another. The optokinetic response (OKR) is a combination of both a smooth and a

saccadic movement, and is characterized by a slow, smooth phase followed by a swift,

saccadic snap ofthe eyes back in the direction opposite the movement of the tracked

object. From a cognitive point ofview, a saccadic eye movement isthe more interesting

oculomotorbehavior primarily because it is an external manifestation of a pre-conscious

choice, i.e., the eyes must move in orderto obtain detailed, high-resolution information

frominterestingareas intheenvironment.

Saccades are high velocity, ballistic eye movements that have the function of

bringing retinal images of objects ofinterest from the periphery to the fovea for closer

inspection. Atypical saccade takes approximately 150

-200msec to plan and execute

-planning takes about 150 msec on average and the duration of the eye movement is

approximately 20 msec plus 2 msec per degree of visual angle (Carpenter, 1988).

Saccades can reach up to 900 degrees per second, and individuals typicallymake 3 or4

(46)

Studies on eye movements during reading have shown that saccades during

reading are typically seven letters long, which results in a saccade length ofbetween 1

and 2 for reading standard size text at a distance of40cm (O'Regan, 1990). There is

also a wide distribution of within-word target landing for reading text, i.e., there is no

precise position withinthe wordthat is the saccadic landingtarget

-anywhere withinthe

wordis sufficient forcomprehension(Morgan, etal, 1990). Fixationsare definedas the

timebetween successive saccades. Atypicalfixation duration for reading textis between

200and 300msec.

It should be noted that saccadic eye movements are one example of overt

manifestation of visual selectivity and orienting of attention - head

movements and

posturaladjustments are amongthe others. Covert orientingof attention and inneracts of

selection are not necessarily accompanied by any overt signs. It is possible, though

unusual, to attend to one area ofthe visual field while fixatinganother (Corbetta, 1998;

KustovandRobinson, 1996).

Recent studies have suggested that the classification of eye movements into sub

categories such as smooth-pursuit, vergence, and VOR ignores the behavioral

significance ofeyemovements,and reflects thesimpletasksofthe earlystudies thatwere

performed in a constrained and sparse visual environment (Steinman, Kowler, and

Collewijn, 1990). The claim is that the experimental results of such early studies reflect

low-level and involuntary aspects of oculomotor control that do not surface in a

te

Figure

Figure 4-2Fixation sequences for three sub-tasks in the rocket-building task-Reading, Search, and Manipulation
Table 5-1Statistical comparison of completion times for the subjects of Group 1 andGroup 2
Figure 4-26Calculation of visual angle from field of view

References

Related documents