Seeing, sensing, and selection: modeling visual perception in complex environments

(1)

RIT Scholar Works

Theses

Thesis/Dissertation Collections

9-29-2003

Seeing, sensing, and selection: modeling visual

perception in complex environments

Roxanne Canosa

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Dissertation is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

Modeling Visual Perception in Complex

Environments

Roxanne Canosa

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Center for Imaging Science in the College of Science

Rochester Institute of Technology

September 29,2003

Signature of Author _

Roxanne Canosa

/'(J /

!fJ3

Accepted by -=.-=._ _

(3)

THESIS RELEASE PERMISSION

ROCHESTER INSTITUTE OF TECHNOLOGY

COLLEGE OF SCIENCE

Seeing, Sensing, and Selection:

Modeling Visual Perception in Complex Environments

I, Roxanne Canosa, hereby grant permission to the Wallace Memorial Library of RIT to reproduce my dissertation in whole or part. Any reproduction will not be for commercial use or profit.

Signature of Author _

Roxanne Canosa

(4)

-CENTER FOR IMAGING SCIENCE

COLLEGE OF SCIENCE

ROCHESTER INSTITUTE OF TECHNOLOGY

ROCHESTER, NEW YORK

CERTIFICATE OF APPROVAL

Ph.D. DEGREE THESIS

The Ph.D. degree of Roxanne Canosa

has been examined and approved by the dissertation committee

as satisfactory for the dissertation requirement for the

Doctor of Philosophy degree in Imaging Science

Dr. Jeff B. Pelz, Dissertation Advisor

Dr. Marc Marschark, Committee Chair

Dr. JulieA. Adams

Dr. Dana H. Ballard

Dr. Roger S. Gaborski

(5)

I would like to thank and acknowledge all those who helped to make this dissertation

possible. Thanks to Jeff Pelz for his guidance, support, and encouragement over the

many years of work that were required to _bring this research to fruition. None ofthis

would have been possible without his vast knowledge ofthe subject _matter, technical

expertise, and cheerful willingness to discuss and explore new ideas. Thanks also to

Jason Babcock for the countless hours spent_developing and _fine-tuning the eye-tracker,

and for sharing his ideas on experimental design and dataanalysis. This work was built

upon earlier work conducted _by Jeff _Pelz, _Mary _Hayhoe, and Dana Ballard at the

University of _Rochester, and also Roger Gaborski at the Rochester Institute of

Technolgy. Thanks for their insights and for providing the "shoulders upon which I

stand."

I am also _{greatly indebted} to Julie Adams for her thorough review ofthis work

and her many useful and _enlightening comments. The skills and ideas of Marianne

Lipps, Constantin _Rothkopf, and Vishal Vaingankarhave provedto be invaluable, and I

am grateful for our _{many discussions} that helped to form and focus this dissertation.

Also, thanks toMarc Marschark for his support. This work was supportedin part_bythe

Naval Research _{Laboratories,} the New York State Office of _{Science, Technology,} and

Academic _Research, and the Xerox Corporation. Finally, I owe _mygreatest appreciation

to _John, _Elyse, _Sandra, and Dennis _Canosa, fortheir continued love and supportthrough

(6)

The author wasbom in _Rochester, New York andhas spent_{virtually her}entire life there.

She attended college as a first-generation college student at the State _University ofNew

York College at _Potsdam, _earning aBachelorofArts degree in Art _History in 1980. She

then returnedtoRochesterand earned an Associate ofApplied Science degree in Optical

Engineering Technology from Monroe _Community College in _1983, and studied

Electrical _Engineering part-time at the Rochester Institute ofTechnology. While raising

three children and _working at JML Optical _Industries, _Inc., and Eastman Kodak

Company, shereturned to school to earn yet anotherdegree. In _May of 1998 she earned

theBachelor ofScience degree in Computer Science atthe State _UniversityofNew York

Collegeat _Brockport, andthe _followingfall begangraduate studies in _ImagingScience at

RIT. In October of2000 she earned the Master of Science degree in _{Imaging Science,}

andbegan doctoral studies underthe directionofProfessor JeffPelz. _{She is currently}an

(7)

Abstract

The purpose ofthis thesis is to investigate human visual perception at the level of eye

movements _by _describing the interaction between vision and action _during natural,

everyday tasks in a real-world environment. The results of the investigation provide

motivation for the development of a biologically-based model of selective visual

perception that relies on the relative perceptual _conspicuity of certain regions within the

field of view. Several experiments weredesigned and conducted that form the basis for

the model. The experimentsprovide evidence that thevisual system is notpassive, nor is

it _{general-purpose,}but rather it is active and _specific, _tightlycoupledto the requirements

of planned behavior and action. The implication for an active and task-specific visual

system isthat an explicit representation oftheenvironmentcan beeschewed infavorof a

compact representation with large potential savings in computational efficiency. The

compact representation is in the form of a topographic _map of relative perceptual

conspicuityvalues. Otherrecent attemptsat compact scene representationshave focused

mainly on low-level maps that code certain salient features ofthe scene _including _color,

edges, andluminance. This study has found that thelow-levelmapsdo notcorrelatewell

with

subjects'

fixation _locations, therefore, a_map of perceptual _{conspicuity is} presented

that incorporates high-level information. The high-level information is in the form of

figure/ground segmentation, potential object _detection, and task-specific location bias.

The resulting model correlates well with the fixation densities of human viewers of

natural _scenes, and can be used as a _{pre-processing} module _{for image understanding} or

(8)

ListofFigures xvi

ListofTables xxiv

1. Introduction 1

1.1 Overview 1

1.2 Problem statement 3

1.3 Outline ofthepresented work 5

2. Background 9

2.1 Historicalperspective 9

2.2 Thehumanvisual system 15

2.2.1 Imageformation 15

2.2.2 Center-surroundorganization of receptive fields 18

2.2.3 _{Contrast sensitivity function} 20

2.2.4 Opponentprocesses 22

2.2.5 Eye movements 24

2.3 Visualattentionand_selectivity 26

2.3.1 Theinfluence of attention on neural response 26

2.3.2 _Orientingofattention 29

2.3.3 Behavioraldataon _selectivityand_{capacity limitations} 32

2.4 Task-orientedvision 35

2.4. 1 _{Task-dependency}of visual scanpaths 36

2.4.2 _{Limited memory}representations 38

2.4.3 Naturaltasks 42

2.5 _{Computational modeling}of visual attention 45

2.5.1 Hierarchicalmodels ofattention 45

2.5.2 Connectionistmodels of attention 47

2.5.3 Graphical models of attention 49

(9)

3.2 Thebenefitsof_eye-tracking 58

3.3 _Eye-tracking

-theoryof operation 59

3.3. 1 Bright-pupil detection 59

3.3.2 Calculationof eye position 61

3.4 The VPLportableeye-tracker 62

3.4.1 Theopticsmodule and mirror 64

3.4.2 Theeye camera 65

3.4.3 Thescene camera 66

3.4.4 The LASER 67

3.4.5 Thecontrol unit 67

3.4.6 Eye-trackerset_upand calibration 68

3.4.7 Eyemovement_monitoring 71

3.4.8 Portable eye-tracker_precision,_accuracy, and noise 72

3.5 The ASLmodel 501 eye-tracker 76

3.5.1 _Integratingheadmovements 77

3.5.2 ASLeye-tracker_precision, accuracy,and noise 78

3.5.3 Estimationand correction of_{accuracy loss} 80

3.5.4 Fixation _finding 82

Modular Visual Routines 87

4. 1 Introduction 87

4.2 Method 91

4.3 Results 95

4.3.1 Meanfixation durationsoftasks-pooleddata 96

4.3.2 Varianceoffixation duration-pooleddata 101

4.3.3 Statistical differences betweensubjects 106

4.3.4 Meansaccadeamplitude oftasks-pooleddata 111

4.3.5 Varianceof saccade amplitude

-pooleddata 117

4.3.6 Statistical differences betweensubjects 119

4.4 Discussion 121

Task-dependenciesofFixation Locations 127

5.1 Introduction 127

5.2 Fixationlocations ina simple environment 133

5.2.1 Method 134

5.2.2 Results 138

5.2.3 Discussion 145

5.3 Fixationlocations inan extendedenvironment 145

5.3.1 Method 146

5.3.2 Results ₁₅₀

(10)

6. The_{Conspicuity Map} 167

6. 1 Overview 167

6.2 Model description 169

6.2. 1 Inputimage _processing 1 69

6.2.2 The low-level_saliency_map 175

6.2.3 High-level proto-object_map 187

6.2.4 Expected locationmask 1 89

6.3 Verificationofmodel_{using eye-tracking}methods 191

6.3.1 Datacollection 191

6.3.2 Comparisonoffixationdensitiestomodel predictions 192

6.3.3 Determinationof_map weights 196

6.4 Natural-task images 206

6.4.1 Comparisontoextended environment 206

6.4.2 Free-viewand multi-view 213

6.4.3 Estimationoflocation bias 215

6.4.4 Expected locations 22 1

6.5 General discussionand conclusion 224

7. Conclusion 227

References 239

(11)

Figure2-₁ _{Cross-section} _of_the_human

eye, depictingimage formationcomponents...1 6

Figure 2-2 Spectralsensitivities ofthe threetypes of cones. Measurements include

light loss duetoabsorptionfromthecornea, lens, and other pigments inthe eye 17

Figure 2-3 Receptive fields oftwo types of retinal neurons: on-center/off-surround and

off-center/on-surround. Yellowareas indicate locations oflight stimulus 19

Figure 2-4 _{Contrast sensitivity function}with example spatialfrequencies and

on-center/off-surroundneurontuned to thepeak response 21

Figure 2-5 _Study_showingthat scan paths aretask_{dependent. Original painting}ofI. E.

Repin's Unexpected Return isat upper_left, withfive examplescanpaths fora single

subject who viewedthe_paintingwhile_beingaskedtoformulate answersto the

various questions 38

Figure 3-1 Imageofthepupil_(white)and corneal reflection_(black)as detected_bythe

eye camera. Centersareindicated_bycrosshairs. A slight offsetbetweentheactual

centers ofthe images andthedisplayedcenters is due toa_timingoffset_duringdata

capture, anddoesnot affectcalculationof eye movementamplitude anddirection...60

Figure 3-2 Calculation oftheline-of-gaze 61

Figure 3-3 _{Portable eye-tracking headgear}andbackpack 63

Figure 3-4 _Topview ofheadgear ₆₄

Figure 3-5 Opticsmodule ₆₄

Figure3-6 Diffractionpattern usedforcalibration ₆₉

Figure 3-7 Eyemovementtraceaftercalibration. Thesubjectwasinstructedtolookat

(12)

seconds each 71

Figure 3-8 Verticaleye position 73

Figure3-9 Horizontaleye position 73

Figure3-10 ExpandedviewofFigure 3-8 73

Figure3-1 1 Expandedview ofFigure 3-9 73

Figure 3-_{1 2} _Eye-tracker

noise, no_averaging 74

Figure 3-13 Eye-tracker_{noise, two} fieldave 74

Figure 3-14 Eye-tracker_noise, four fieldave 74

Figure 3-15 Eye-tracker_noise, eightfieldave 74

Figure 3-16 Average angulardeviation foreach ofthenine calibration points atthe

start of_{the experiment,}acrosseight subjects 75

Figure 3-17 Average angulardeviation foreach of eight subjects atthe start ofthe

start ofthe_experiment, across nine calibration points 75

Figure 3-18 Average angulardeviation foreach ofthenine calibration points at

mid-experiment, across six subjects 76

Figure 3-19 Average angulardeviation foreach of six subjectsat_{mid-experiment,}

across nine calibrationpoints 76

Figure3-20 ASLmodel501 eye-tracker 77

Figure 3-21 Deviations from calibrationtargetpointsatthe start ofthe experiment,

beforeandafter correctionacross elevensubjects 80

Figure 3-22 Deviations fromcalibrationtargetpoints atthe end of_{the experiment,}

before andafter_correction,acrosseleven subjects 80

Figure3-23 Deviations fromcalibrationtargetpoints atthe start of_{the experiment,}

beforeandafter_correction,acrossnine points 80

Figure 3-24 Deviations from calibrationtargetpointsatthe end of_{the experiment,}

beforeand aftercorrection,across ninepoints ₈₀

Figure 3-25 Raw ASLeye-headdata 85

(13)

Figure 4-1 Relative_frequency offixation durations forsubjectsJB andJP forReading,

[image:13.511.44.466.147.439.2]

Search,and Manipulations_during_{rocket-building} 89

Figure 4-2 Fixationsequences forthree sub-tasks in_{the rocket-building}task

-Reading, Search, andManipulation. Bars indicateperiods offixation, spaces indicate

gaze changesbetween fixationpoints 90

Figure 4-3 _Walking_alonga_hallway 92

Figure 4-4 _Havingaface-to-faceconversation 92

Figure 4-5 Telephone conversation 92

Figure 4-6 _Sortingcards 92

Figure 4-7 _Sortingblocks 92

Figure 4-8 _Readingposter 92

Figure 4-9 _Readingform 93

Figure4-_{1 0}

Countingchange 93

Figure 4-1 1 _Countingredblocks 93

Figure 4-12 Meanfixation duration foreach ofthenine_tasks,pooled across all

eight subjects 97

Figure 4-13 95%confidence intervalofthemeanfixation durations foreach ofthe tasks. A statisticallysignificantdifference betweentwo tasks existsifthereis no _overlap of the_{corresponding} confidence intervals. Center dotsrepresentthemeanvalues 99

Figure 4-14 Thegamma_densityfunctionwith a=₂_and _|3 ₁ ₁₀₃

Figure4-_{1 5 Walk}_hall_pooled_data ₁₀₃

Figure4-16 Talkconversationpooled data ₁₀₃

Figure4-17 Talktelephonepooled data ₁₀₄

Figure4-₁_{8 Sort}_{cards pooled}_data ₁₀₄

Figure4-_{1 9 Sort}_blocks _pooled_data

1 04

(14)

Figure4-2 1 Read formpooleddata 1 04

Figure 4-22 Countchange pooleddata 1 04

Figure4-22 Countblocks pooleddata 104

Figure 4-24 _Relationshipbetweenmeanandstandarddeviationforall ofthetasks. From

left, tasksare: _RF, _{CC, RP, SC, SB,} _TT,_WH, _CB,TC 104

Figure 4-25 Mean fixation duration foreach_subject, alltasks 107

Figure 4-26 Calculationof visual anglefromfieldof view 1 13

Figure 4-27 Meansaccade amplitude foreach ofthenine_tasks, pooledacrossall

eightsubjects 1 13

Figure 4-28 Meansaccade amplitude foreach_subject, all_tasks,with standard

errorbars 114

Figure 4-29 95%confidence intervals ofthe mean saccade amplitudesfor each ofthe

ninetasks. _{An overlap between}twoor moreintervals indicatesthat there isno

statisticallysignificantdifference betweenthe_{corresponding}mean values 115

Figure 4-30 Walk hallpooleddata 117

Figure 4-3 1 Talkconversation pooleddata 117

Figure 4-32 Talktelephonepooled data 117

Figure 4-33 Sortcards pooleddata 117

Figure4-34 Sortblockspooleddata 117

Figure 4-3 5 Readposter pooleddata 117

Figure 4-36 Read formpooleddata 117

Figure 4-37 Countchangepooleddata 117

Figure 4-38 Count blocks pooleddata 117

Figure 4-39 _Relationshipbetweenthemean andthestandarddeviationof saccade

amplitude. Fromthe_left, tasksare_RP, _CC,CB _{(lower), RF,} SB_{(lower), TC,}

(15)

Figure 5-1 _{Block copying}task. This is the_displaythatwas shown on thecomputer

screen. The_displaysubtendedan area of17x 13visual angle. A traceoftheeye

movement andofthehandmovementis shownasarrows _connectingthedifferent

regions 1 29

Figure 5-2 Eye movementstrategies usedfor block copyingtask. Relative frequencies

ofeach_{strategy from}asample_containing_{approximately}fiftyblockmoves foreach of

seven subjects 130

Figure 5-3 Fixation durationas afunctionoftask_difficultyfora_drivingtask 131

Figure 5-4 _SortingCards 135

Figure 5-5 _SortingBlocks 135

Figure 5-6 Copy-model-same-room 135

Figure 5-7 Model fromcopy-model-different-room 136

Figure 5-8 ResourceandWorkspacefromcopy-model-different-room 136

Figure 5-9 Amountoftime taken_byeach subjecttocompleteeach ofthefourtasks.

Thetasks_alongthex-axis are ordered_accordingto the order of performance_{by Group}

1 (subjectsB, D, F, and H). Thefirst four bars foreachtaskcorrespondto the_Group

1 subjects,andthesecondfour bars correspondto the _Group2subjects _(A, _{C, E,}and

G)who performedthe tasks inthereverse order 138

Figure5-10 Divisionoftimebetweenthe twodifferentregions

-sorting blocks and

sortingcards 142

Figure5-11 Divisionoftimebetweenthree differentregions

-copymodel sameroom

and copy-modeldifferentroom 1 43

Figure 5-12 Depictionoffourextended environments usedfortheportable_eye-tracking

study. Clockwise fromthe_top_{left, Washroom, Hallway, Office,} and_Vending 149

Figure 5-13 Relativeamounts oftimespentondifferentobjects intheWashroom

environment,pooledacross allfixationsandallsubjects 151

Figure5-14 Washroomenvironment. Time spent_fixatingobjectsasthe tasksprogress

for Subject T. Tasks are, from the_top,"Washyour

hands,"

"Fill a_cupwith

water,"

and"Comb your

hair."

(16)

Figure 5-15 Relativeamounts oftimespent ondifferentobjects inthe_Hallway

environment,pooled across allfixationsand all subjects 155

Figure 5-16 Relativeamountsoftimespentondifferentobjects intheOffice

environment,pooled across allfixations and all subjects 157

Figure 5-17 Relativeamounts oftime spent ondifferentobjects inthe_Vending

environment, pooled across all _fixations,and all subjects 158

Figure 5-18 _Hallwayenvironment. Timespent_fixatingobjects as the tasksprogress for Subject T. Tasksare, fromthe_top,_{"Throw something in}the

garbage,"

"The fire

alarmjustwent

off,"

and"Findabathroom." 160

Figure 5-19 Officeenvironment. Time spent_fixatingobjects asthetasksprogressfor Subject U. Tasksare, fromthe_top, "Getsupplies fromthe

closet,"

"Workatthe

computer,"

and"Makea

photocopy."

161

Figure 5-20 _Vendingmachine environment. Time spent_fixating objects asthe tasks

progress for Subject U. Tasks_are, fromthe_top, "Checkfor Skittles,"

"BuyaSnickers

bar,"

and"Checkforchange."

1 62

Figure 5-21 Comparisonoffixationtypes 1 64

Figure 6-1 Constructionofthe _{Conspicuity Map} 170

Figure 6-2 Creationofthecolor_{map from}photoreceptor responses. Upperleft is input

image,upper right is CI _(red/green) signal, lower left is C2 (blue/yellow) signal,and lowerright isthe_resultingcolor map. Darkblue areas inthesignal maps correspond

tolowsignal _values, yellow correspondstomedium signalvalues, andred

correspondstohigh signal values 1 76

Figure6-3 _Intensity_{map for}example input imageofFigure 6-2 177

Figure 6-4 The sevenlevels ofthemulti-resolution Gaussianpyramidforthe example

input image 178

Figure 6-5 The sevenGaussianconvolutionkernelsofthe Laplacianpyramid. _They

are usedto createthebandpass filtersthatdetect a specific range offrequenciesinthe

inputimage. The spatialdomainrepresentationisshowninthe_top_row,andthe

correspondingfrequencydomainrepresentationisshown inthebottomrow 1 80

Figure 6-6 Sixbandpass filters usedtodetect frequencies of aparticular rangeinthe

input image. F1-F2 isellipticalin shapebecause fl isodd-sized_(5x5) andf2 is

(17)

Figure 6-7 One-dimensional_frequencyresponsecharacteristicsofthesixbandpass

filters shownin Figure 6-6. Notethat only therighthalfoftheresponse curves are 181 shown, i.e., theyaresymmetrical aboutthe origin 1G1

Figure 6-8 Six levels ofthe Laplacianedge cube (difference-of-Gaussians) derived

fromtheseven levelsofthe Gaussianpyramid(labeledLO

-L6)afterweighting each

Laplacian level_bytheresponse fromthecontrast _{sensitivity function.} Toprowfrom left_L0-L1, _L1-L2,andL2-L3. Bottomrowfrom leftL3-L4, L4-L5, andL5-L6....183

Figure 6-9 Basis functionsoftheGaborfiltersused tomodelthetuningofreceptive

fields inareaVI ofstriatecortex. Fromleft, 0, 45, 90, and 135 184

Figure 6-10 Fouroriented edge signals andresultingorientededge map. Toprow from

left, 0, 45. Middlerow,fromleft, 90, 135. Bottomrow,false-coloredoriented

edge _map 185

Figure6-1 1 Low-level featuremaps and_{resulting low-level saliency}map. Toprow,

from_left,color_mapand_intensitymap. Bottomrow fromleft, oriented edge _mapand

saliencymap 186

Figure 6-12 Creationofthe_binaryproto-object map. _Toprow fromleft, inputimage,

estimation ofbackground,andforeground segmentation. Bottomrowfromleft, after

thresholdingtheforeground image and_Canny edge_detection, after_dilation, and after

hole_filling and erosion 188

Figure6-13 ComparisonofF/Mratiosfor 76 images inset_A,free-viewcondition.

Threemaps were generated foreachimage, as givenin Equations 6-15 through 6-17.

images numbered44-47, 64-66, 69-71,and 80-83 areduplicateimages, and not

shownhere 195

Figure 6-14 Comparison F/Mratios for 76 images inset_B,free-viewcondition. Three

maps were generated foreach_image, as givenin Equations 6-15through 6-17.

Imagesnumbered _{1-4, 44-47,} _67,_68, and80-84are duplicate images,and are not

shownhere 196

Figure 6-15 Mean F/Mratios forthe threedifferentmaps,averagedacross all 152

images 200

Figure 6-16 Exampleimages andoverlaidfixationplotforwhichtheoptimal weights

werefound usingtherandom weight generationmethod. _{The corresponding}weighted

conspicuitymap (C-Map)is shownbeneatheachimage. From_{left, Al, A28,}

B17 200

Figure 6-_{1 7 F/M}_ratio_for_set_A _images,_free-view

condition,usingthe C-Map. The

F/Mratios fortheotherthreemaps arethesame as showninFigure6-13,andare

(18)

74isoffthe chart andhas a valueof6.89 204

Figure 6-18 F/Mratio forset B _images,free-view condition,usingthe C-Map. The F/Mratiosforthe otherthreemaps arethe sameas shownin Figure_6-14,and are

included forcomparisonto the _C-Map 205

Figure 6-19 Mean F/M ratiosforall 152 images using four differentmaps 206

Figure 6-20 Fournatural-taskimageswith overlaidfixationplotsfromonesubject, free-view_condition, and_{corresponding}maps. From_left,Washroom_(Al), _Hallway

(A2),Office_(A3), and_Vending(A4). Maps are,from topto_bottom, theCIE_map,

the P_{map, the}CIEP_map, andthe _C-Map 208

Figure 6-21 Fixation_densityplots for free-viewandthreemulti-viewconditionsfor four images. Imagesare from_top, _{Washroom, Hallway,}_Office, and_Vending

Machine 210

Figure6-22 F/M ratiosfor free-viewand multi-view conditions forthe fournatural-task images. Acomparison isshownbetweenthe_{low-level CIE map} andthehigh-level

perceptual _conspicuity_C-Mapforeachimage 214

Figure 6-23 F/Mratiosfor 1000 randomlygeneratedfixation locations 215

Figure 6-24 F/M ratioscomputedformixedimageand fixation data. Eachchartis for

one ofthefour images forwhichtwomaps were_computed, CIE map _(saliency)and

C-Map(conspicuity). The _{free-view fixation data is indicated along}thex-axis 217

Figure 6-25 Histogramsoffixation distances fromthe centerof eachimage 219

Figure6-26 F/M ratiosforrandomfixations restrictedto lA imagesizedistance from

center, and i/i6 image sizedistance fromcenter 220

Figure 6-27 Ninegrid locationsusedtocompute theexpectedlocation map 22 1

Figure 6-28 F/Mratiosfor different locations inthe _C-Map,found_by_turningon asingle

(19)

List

of

Tables

Table 3-1 Newell'stemporal_hierarchyofbrain organization 58

Table 4-1 Time inseconds and number offixations (inparenthesis) pertaskforthe

eightsubjectswho performedtheexperiment 94

Table 4-2 Orderoftasksfor_Group 1 and _Group2 94

Table 4-3 Taskabbreviations 96

Table4-4 Pairwisecomparisons forsignificantdifferences infixation durations

betweentasks. An Xindicatesthata_{statistically}significantdifference exists

betweenthe_{corresponding}tasks intherow and column 98

Table 4-5 _{Hallway Walking (WH)} 107

Table 4-6 Conversation_(TC) 107

Table 4-7 Telephone_Talking(TT) 107

Table 4-8 _SortingCards _(SC) 107

Table 4-9 _SortingBlocks_(SB) 107

Table 4-10 _ReadingPoster_(RP) 107

Table 4-11 _ReadingForm_(RF) 107

Table4-₁₂

CountingChange_(CC) 1 07

(20)

Table 4-14 SubjectAtaskdifferences 109

Table 4-15 Subject Ftaskdifferences 109

Table 4-16 Subject C taskdifferences 109

Table 4-17 Subject Htask differences 109

Table 4-18 Subject Etaskdifferences 109

Table 4-19 Subject Gtaskdifferences 109

Table 4-20 Subject B taskdifferences 109

Table 4-21 Subject D taskdifferences 109

Table 4-22 Pairwisecomparisons forsignificantdifferences insaccade amplitude

betweentasks. An X indicatesthata_{statistically}significantdifference exists

betweenthe _{corresponding}tasks intherow and column 114

Table 4-23 _Hallway_walking_(WH) 119

Table 4-24 Conversation _(TC) 119

Table 4-25 Telephone _talking_(TT) 120

Table 4-26 _Sorting cards_(SC) 120

Table 4-27 _Sortingblocks _(SB) 120

Table 4-28 _Readingposter _(RP) 120

Table 4-29 _Readingform_(RF) 120

Table 4-30 _Countingchange_(CC) 120

Table 4-31 _Countingblocks _(CB) 120

[image:20.511.41.468.300.633.2]

Table 4-32 _Summaryofresults from studyof naturaltasks 122

Table 5-1 Statistical comparisonofcompletiontimes forthesubjects of_Group 1 and

Group 2. Ineach casethe nullhypothesis isrejected (h=0), _indicatingthat thereisno

statisticallysignificantdifference betweenthe_orderingofthe tasks interms of

(21)

Table 5-2 Orderof_instructions for_GroupAand _GroupB _duringextended

environment_study 1 47

Table 6-1 MaximumF/M ratiosand associated weightsforthree exampleimages usingtherandom weight generationmethod. 10,000 trials 200

Table 6-2 MaximumF/M ratios andassociated weights forthree example images usingthe genetic algorithmmethod. 2,400trials. #Gensrefersto the actualnumber

oftrialsrequiredbeforeasolutionconverged. Images A30,A32, A76, B30, andB88 arenotincludedintherangedata becausethe weights were greaterthan 50,dueto

manymutations 203

Table 6-3 Instructions formulti-view part oftheexperiment 207

Table 6-4 Threemost _frequentlyfixatedobjects and percentage oftime spent_looking atthoseobjectsfor each ofthe tasks intheextended environment _studyfrom Section

5.3,overall subjects 212

Table 7-1 Classification oftasksinto featurevector_{corresponding}toboththe level of visual engagement withtheenvironment and amount of strategic_planning

(22)

Introduction

1.1 Overview

Visual perception is an _inherently active and selective process. As an individual goes

about _performing _daily _activities, the visual system is _constantly _monitoring the

environment to provide the individual with information about the scene that will enable

meaningful interactions or contemplative study. The outcome is usually a change in the

cognitive state ofthe individual that leads to the realization ofa plan ofaction. In this

sense, vision is not a passive process _{whereby information is} _merely collected and

processed or perhaps stored for later retrieval, but rather it is an active process that

integratesspecific, localaspects ofthe scene with goal-orientedbehavior. _{Consequently,}

thepurpose of visionisto servetheneeds ofthe individual asthoseneeds arise.

An essential component of active visual perception is a selective mechanism.

(23)

from the retina to the cortex. The advantage of _{selecting less} information than is

available isthat the_meaningof aparticular scene or imagecanberepresentedcompactly,

thus_makingoptimal use oflimitedneural resources.

Currently, it is uncertain _exactly what use the _study of eye movements is for

understanding human behavior (Viviani, 1990). Recent studies support the ideathat eye

movements are an external manifestation ofselective attentionand can_playan important

role in _indicating which attributes of the scene _carry the most pertinent information.

Patterns of visual fixations overtimeas well as space canreveal cognitive strategies that

are not amenable to conscious control or verbalization, and as such can be thought ofas

providing a windowinto pre-consciousthought. The locations offixationsas well asthe

particular sequence have beenshowntobe dependent uponnot_onlythe characteristics of

the scene, but also upon the goals of the observer. The results of an eye movement

analysis can yield important insights into the nature of_{decision-making} and _reasoning

undera_varietyof environmental andtask-specific situations.

Thepurpose ofthis thesis isto _developabiologically-plausiblemodel of selective

visual perception for individuals who are engaged in _realistic, _everyday tasks such as

walkingdowna_hallway, _filling a_cupwithwater, or _makinga_copy at a copier machine.

The model is in the form ofheuristics gleaned from eye-tracking studies conducted on

subjects _navigating in natural, extended _{environments,} and is combined with a

computational simulation of low-level properties of the primate visual system. The

computational aspects ofthe model augment the heuristics to provide a detailedaccount

(24)

Manystudieshave beenconducted_showingthateye fixationsarenottorandomlocations

in the _field, but rather to regions in the image or scene that rate high in information

content, such as edge _density, _{colorfulness,} or luminance contrast. _Presumably, a

random selection processwould notbean efficient means of_gatheringvisualinformation

forsomeonewhoisengagedina visual _task,whetherthe taskrequires _formulatinga plan

of action orjust contemplative thought. _Thus, strategic _planning of saccades plays an

importantrole in extractingusefulinformation. Anunresolvedquestionis howthe visual

system determines what _strategy to use when _deciding where in the scene to look next.

More specifically, what is the role of context in _determining oculomotorbehavior? The

central hypothesis ofthis thesis is that it is the subjective, or_perceived, _conspicuity of

context-relevant objects in the scene that guides the fixation _strategy, in addition to the

objective amount ofinformation inherent inthesceneatthepotentialtarget offixation.

Recent attempts at computational _modeling of the human visual system have

focused mainly on _bottom-up, or stimulus-driven_processing, in other _words, _processing

that begins with pixel counts from the digitized image and proceeds upward through

successive layers of _increasing abstraction. The idea is to detect luminance differences

directly from the digitized image, and from those differences locate edges, boundaries,

homogeneous regions, surfaces, and _eventually objects and their 3D representations.

Scene semantics arrive last in the chain of processing, ifat all. An advantage ofthis

approach is that the scene is represented in its entirety

(25)

computationallyprohibitiveanddoes not make optimal useoflimitedneuralresources.

An alternative to _bottom-up processing is top-down, or task-dependent

processing. With top-down _processing, one might begin with a conceptualized object

described in abstract _terms, such as "a chair has four legs"

and proceed downthrough a

hierarchy of _increasing _detail, _{eventually reaching} a scene description in terms of

primitive features. The disadvantage here is the _difficultyofconceptualizing abstract, or

non-representational _items, however the advantage is a compact representation of the

scene semantics.

A model based on selective perception and perceived conspicuity combines

aspects ofboth _bottom-up and top-down information in a unique way. The degree to

which_bottom-up ortop-down is employed is _largelyafunction ofthe goals ofthe system

and its current state. The result is a computational model of visual perception and

processingthat isa reflection ofthe _{ongoing interaction between} an active visual system

andtheenvironment.

This thesis is devoted to the goal of _showing that the _perceptually significant

information content of_any particular region in an image or scene must _ultimately take

intoaccountthe implicit semantics ofthat image or scene

-that _is, the "meaningfulness"

ofthe scene forthe viewer. This approach implies that specific _objects, as well as their

relative and expected _{locations play} an important role in _determining meaningfulness in

natural scenes, especially when combined with action-implied imperatives. The

low-level,bottom-upfeatures such as _{edges, colors,} andluminancecontrast cannotbeignored

(26)

willbe shownthat_successfully_{predicting fixation densities in}natural imagesand inreal,

extended environments requires computational algorithms that combine _bottom-up

processing with top-down constraints in a _way that is context-sensitive and _ultimately

mostmeaningfulfortheviewer.

1.3 Outline ofthe presented work

The remainder of this thesis is organized as follows: Chapter 2 highlights background

material that is essential for a complete _{understanding} of the issues involved in

computational _modeling of selective visual perception and human visual behavior in

natural, unrestricted environments. This chapter includes a detailed historical account

andreviewofthe _{literature showing how}previous workhas led to the present state ofthe

field. _{Issues relating} to the _physiology of the human visual systems, attentional

mechanisms, task-oriented vision and natural tasks are _discussed, as are variations in

approachesthathave beenappliedto thecomputational_modelingof visual attention.

Chapter 3 is an outline ofthe experimental method that was applied in order to

extractdataonhumanvisual behavior innatural_{environments,}as well as inthe restricted

environment of2D imageviewing. _Studying eye movements outside oftheconfines of a

restricted _laboratory _{setting is} a topic of current _interest, yet this research area remains

largely unexplored and undocumented in the literature. Novel hardware and software

were developed _by the Visual Perception _Laboratory at the Rochester institute of

Technology to enable a thorough data collection and analysis procedure. The _results,

(27)

hardware, as well as the software that was developed for the eye-tracking sessions and

dataanalysis, is included in Chapter 3.

Chapter 4 describesa result gleaned fromevaluating eye-tracking data inthe real

world:thatvisual routinesaresomewhat

"modular"

innature. Thatis,when metrics such

as fixation _duration, saccade _amplitude, and gaze-change interval are used to describe

certain

"primitive"

visual behaviors such as _reading text or _having a face-to-face

conversation, stereotypical patterns of oculomotor behavior result. This evidence

supports thehypothesisthat thehumanbrain isorganizedin sucha _wayas tomake use of

pre-determined low-level visual routines in order to enhance _functioning in a complex

and _constantly _changing environment. Pre-determined routines _may affect perceived

conspicuitybyrestrictingthefocus of attentiontoexpectedusefullocations.

Chapter 5 is an extended _{study into} the high-level visual strategies employed_by

people that either enhance or detract from perceptual _conspicuity in the environment.

For example, when _walking _along a corridor after a high-cognitive load task has been

imposed (memorization of a random block pattern), fixations tend to be _longer, more

centrally located in the scene, andhave shorter saccade amplitudes than whenthere has

beenno cognitive load imposed. This impliesthat objects inthe environment thatwould

have attracted attention when the task is not _{cognitively challenging do} not do so when

thesystem isotherwiseoccupied.

Chapter 6 outlines thecomputational model that is developedandused to predict

fixation densities on natural-task images and in the real world. The model is a

(28)

and_top-down, task-orientedconstraints. Themodeltakes the formofatopographic _map

of perceived_conspicuity _values, where the value at a coordinate in the_{map is} a measure

ofhow important thatcoordinate is for perception. The model is a partial adaptation of

thestimulus-driven approachtaken_byothers (as discussed in Chapter2), yet it is original

in the sensethat ituses anovel method to take into account context-sensitive information

about the sceneatboth the higher levelsand the lower levels. A novel algorithm isused

toinhibitregions inthe scene thatarenot_likelytobe _perceptually importantand enhance

those regions that are. The resulting model is shown to correlate well with the fixation

densities ofhuman subjects.

Chapter 7 is a _summaryand conclusion ofthe work presented in this _thesis, and

(29)

(30)

Background

2.1 Historical perspective

One ofthe earliest theories of spatial _attention, the attentional _spotlight, originates from

the psychophysical work of Herman von Helmholtz (1867). The spotlight metaphor

capturesthe concept ofan"internal eye"

i.e., an implicit foveawhich localizes an object

in space and focuses all ofthe _processing on that one object before moving to another

location in the field. _Any information that is not centered on the implicit fovea is

diminished.

The idea of_using a spotlight as a metaphor for attention was further developed

later in the 20th _century _{(Crick, 1984; Treisman,} 1982). Within the spotlight objects

whichare_beingattended to are_highlighted, or_illuminated, sothatinformationaboutthat

object will be processed more _efficiently and at ahigher level than other objects in the

(31)

existence of an attentional spotlight (Sagi and Julesz, 1986), however most current

thinking considers the metaphor to be too simplistic to capture all of the nuances of

selective attention. The _early evidence points to observations made during

psychophysical studies of_filteringtasks. Forexample, Sagi andJulesz (1986) studiedthe

ability ofsubjects to discriminate the orientation of_briefly presentedbar targets located

in the periphery. On some trials a small light was flashed close to theperipheral target,

on other trials the light was flashed near a peripheral non-target. Subjects were able to

detectthe _{light only}whenitwasflashedwithina certain area nearthetarget, eventhough

in both cases the light was located at the same foveal eccentricity. The authors suggest

that the area aroundthe target at whichthe light could_reliablybe detected delineatedthe

contour ofthespotlight of attention.

Other studies have demonstrated that the area covered _by the spotlight does not

necessarily cover contiguous regions in the field (Pylyshyn and _Storm, 1988). Duncan

(1980) showed subjects a circular display _containing eight characters from which _they

were to locate the target _letter, Q. Distractor letters were either O's and _C's, or O's and

K's,placedat random circular positions inthe display. Subjectsweretoldahead oftime

which four ofthe eight positions the target could be located in (the relevant positions).

Theotherfourpositions wereirrelevantand couldbe ignored.

The study found that the O and C distractors had little effect on the subjects'

ability to locate the target, regardless of whether _they were located in the relevant or

irrelevant field. When the O and K distractors were in the relevant field search times

were slowed, presumable because of feature interaction (Treisman and _Gelade, 1980).

(32)

field search times were not _{slowed, presumably because} the subjects were able to attend

to the non-contiguous relevant locations while _ignoring the also non-contiguous

irrelevant locations. _{The ability}to attend to non-contiguous areas when the demands of

the task so require is evidence that high-level processes can mediate the acquisition of

visual stimuli.

Another study found that the spotlight does not end _abruptly at one location

before it moves _{to the next,} nor _{does it sweep} _continuously across the field of view

(Sperling and _{Weichselgartner,} 1995). _Processing is completed in a select _area, _fades,

and then moves to a new area to resume _building strength there. An extension ofthe

spotlightmetaphor forattentional capture is thezoom lens _metaphor, which suggests that

the area under consideration is examined with variable spatial extent (Eriksen and St.

James, 1986). In this case the amount of detail available for processing is _inversely

relatedtothe sizeofthearea_beingprocessed.

The theories mentioned thus far have assumed a serial mechanism for selection,

i.e.,the focus of_{processing is}completed at a single select regionbefore movingonto the

next region. An alternative to _focused, serial _processing of attention is _dispersed, or

parallel processing, originating with the work of James (1890). With dispersed

processing, the focus of attention is spread _uniformly across the field of view. Neisser

(1967) was the first to show that the two theories need not be mutually _exclusive, but

rather_{may be}thought ofas part ofthe same process _existingattwo distinctphases. The

pre-attentive phase isthe earlier stage interms of_processing, andis considered_relatively

(33)

attentivestage integrates information fromaparticularareaofthe field, andis considered

slow, voluntary,and progresses _{serially from}one regionto thenext.

Much ofthe work that has been conducted on the 2-phase theory of selective

perception has been under the experimental paradigm known as visual search. In this

paradigmthe amountoftime it takes to complete a searchis plotted as a function ofthe

total number ofitems in the display. A flat response indicates a fast, parallel process,

whereas a linear response indicates a slower sequential process. The feature integration

theory of selective visual attention (Treisman and _{Gelade, 1980; Treisman,} 1988) is an

attempt to define the purpose of focused attention _using the visual search paradigm.

According to feature integration theory, elementary features such as color and shape are

processedbeforeobjectsthatrequire a conjunction ofseveral_features, such as abluebox,

or a _{gray kitten.} Focused attention is _necessary to conjoin the separate features, which

thenenables properidentification oftheobject.

A series of experiments were designed to distinguish between features that are

elementary (also called _integral) and features that are separable and require focused

attention for integration. The hypothesis was that an integral feature would elicit a flat

response time and exhibit

"pop-out"

in a field of _distractors, whereas an object with

separable features would require a sequential _{(conjunctive)} search and elicit a linear

response time. The results showed that when the _elementary features were chosen to be

colors or shapes (for example a green object in a field of red _{distractors),} search times

were constant with the number ofdistractors. When separable featureswere chosen as a

conjunction oftwo _{elementary features} (a green disc in a field of green squares and red

(34)

A hallmark of the _theory is that the pre-attentive stage extracts the primitive

features in parallel across the visual _field, and the attentive stage is required for _binding

the separable features within a small part ofthe field. As evidence againstthe _theory, it

has been shownthat itis possible to perform some conjunctive searches in parallel ifthe

separable features consist of_{color, motion,} or depth (Nakayama and _Silverman, 1986).

Also, recent studies have shown that reaction times for conjunctive searches can range

from close to 0 seconds per item _(pop-out) to 30-50 milliseconds per item, _depending

upon the degree of_{similarity between} the target and the distractors (Deco, Pollatos and

Zihl,2002).

Two-phasetheoriesof visual attentiondonot _{explicitly describe how}the selection

process is controlled. Questions such as "what is the region ofinterest?" and "where

should the next fixationbe?" can be approached _by _considering the purpose offocused

attention.

Thenotion ofa_{saliency map}was proposedto definethe_{relationship between}the

components of a scene _according to their relative importance to the observer (Koch and

Ullman, 1985; Mahoneyand_Ullman, 1988). The essential components of a_{saliency map}

include a _priority _{map for rating} the relative components of the scene, and a _gating

mechanism _whereby the selected regions are processed and the non-selected regions are

inhibited. _According to the _theory, the visual system performs an initial _{low-frequency}

parsing ofthe environment to _identify potential regions of_interest, and assigns to each

region a weight _according to the computed saliency. For _example, bright _colors, high

(35)

assigneda_heavyweight. This information isrecordedinatopographic _mapofthe scene,

which indicatestheweightof_everyelementinthatscene.

The map is dynamic in the sense that the _gatingmechanism chooses the element

with the highest current weight to be the target of focused processing, and then

suppresses this element when _processing is complete. An inhibition-of-return (Posner

and _{Cohen, 1984)} mechanism is used to reduce the saliency at the current focus of

attention sothat thenext highestregion_{may be}selected forprocessing. This mechanism

isthought tobiastheattentional resources towardnovel stimuli thatappearinthe field_by

reducingthesalience of an itemthathas beenviewedforat least 300msec.

The guided search model _{(Wolfe, Cave,} and _{Franzel, 1989; Wolfe,} 1994) is an

adaptation of the visual search paradigm that uses the concept of a salience _map to

prioritize potential items of focused attention. The basic idea is that a parallel-feature

computation stage guides a later serial attentive stage. The _highly salient targets should

be detected more _quickly thannon-salient_targets, _givingrise toconstant searchtimes for

elementary features. Slower, conjunctive search times are the result ofthe contribution

of noise from competing feature dimensions _during the parallel feature computation

stage.

Alternatives to the guided search model are the search via recursive rejection

model (Humprhreys and _{Miiller, 1993)} andmodels basedon signal detection _theory (see

Verghese, 2001, for a review). The search via recursive rejection model _(SERR) is a

connectionist model that _recursively rejects regions where _{clearly defined grouping} of

distractors occur. In other _words, if stable groupings occur everywhere at the lowest

(36)

(differently grouped) distractors. Search is slowed when groupings contain elements of both target and distractors. Signal detection _theory uses a variable threshold to distinguish between fast search and slow search. The threshold _{is usually described} as a decision rule that depends upon distractor _{discriminability} rather than a parallel/serial dichotomy. _Accordingly,the decisionruletakes intoaccountawide range offactorsthat

might contribute to search response _times, and does not assume a parallel pre-attentive stageis followed_byaserialattentivestage.

In_{summary, the} _historyofthoughtonthe topicof selective visual perception

begins with the earliest metaphors of an internal eye, and a spotlight or zoom lens of

focused attention. From _there, the visual search paradigm has produced theories describing apre-attentive and attentive _processing ofintegrated _features, and progressed to the more current concepts of atopographic_map of_saliencyvalues or signal detection.

What remains is a means of _{incorporating} context _sensitivity and _{task-relevancy} into theories of selectivevisual perception.

2.2 The human visual system

2.2.1 Image formation

Light fromthe_surroundingareaenters theeye and undergoes severaltransformations that enable the brainto make use ofinformationfrom that surrounding. The transformations are both optical and neural in _nature, and begin with the transformation oflight energy

(37)

Cornea

Retina

Optic nerve

55* M' 4tf 70

Xf llf 0' Wc 2CM

<Kf W 6f 7<f SO Visual Angle{degrees fromfovea)

Figure 2-1 Crosssection ofthehumaneye,depictingimageformationcomponents. Adapted

fromPalmer, 1999

The retina is a layer of neural tissue _{approximately 0.4mm} thick and is the

repository of over 100 million light-sensitive photoreceptors called rods and cones

(approximately 100million rods and 5 million cones ineach eye). Figure 2-1 shows that

the distribution of rods and cones across the surface ofthe retina is _highly _uneven, with

most ofthe cones located ina small central area oftheretinacalledthe fovea. The cones

are responsible for bothcolorperception andhighvisual _acuity; _thus,the eyes mustmove

in order to obtain _detailed, high-resolution information from different regions in the

visual field. The area of the field covered _by the _{fovea is approximately} 2

of visual

angle, whichis approximatelythewidthofathumbextended an arm'slength.

Retinal cones canbe classified into oneofthree different _types, _dependingupon

the wavelength _sensitivity ofthe cell's photopigment

(38)

(for _{short, medium,} and _long wavelength response). Figure 2-2 shows the spectral

sensitivities ofthe threeconetypes.

1.0

B 0.8

G ii c/j -a <u

"3

o

0.6

0.4

0.2

0.0

sco/ie\ M_-copeX

/

N _L-cones

A

/

\

/

\

1

\

/

/'

\

'

400 500 600

Wavelength,nm

700

Figure 2-2 Spectralsensitivities ofthe three typesof cone photoreceptors. Measurements

include light loss duetoabsorptionfromthe cornea,lensand other pigmentsintheeye. From StockmanandMacLeod,1993.

The absorption ofphotons _by the S-cone photoreceptors is significantly different

fromthatofthe M- and L-cone photoreceptors. The S-conesare _particularlysensitive to

short-wavelengthphotons and arethe _{primary detectors}when short-wavelengthlight is at

the threshold ofdetection. Both the M- _and _L-cones _will _{detect longer}_wavelength _light

since there is a greater amount of_{overlap in} those response curves. _Also, S-cones are

knownto_{be relatively}rare inthe retina and are not present at all inthe centralpart ofthe

fovea _(Wandell, 1995). S-cones are spaced _{relatively far} apart in the _fovea, with a

spacingof 10arc _minutes,whereasthe _{spacing for}the L- _and_M-cones _is_0.5 _{arc minutes.}

The consequence of wide _{spacing is} that the _sampling _frequency is reduced for the

(39)

mosaic is that the visual system will encode only relatively slowly varying spatial and

temporal signals_{originating in}theshortwavelength regionofthespectrum.

2.2.2 Center-surround organization ofreceptivefields

Retinal neurons and cortical neurons _develop fromthe same bio-chemical processes, and

as such the retina can be considered to be part ofthe central nervous system (Wandell,

1995). Muchofthephysiological andorganizational propertiesofthecortex applyto the

retina as well. Similar to the cortex, the retina is amulti-layered _surface, with the first

few layers ofthe retina _consisting of ganglion cells that exhibit spatial interaction with

neighboring cells. Neurons in each layer excite _{corresponding} neurons at a higher layer

and inhibit _neighboring neurons in the same layer. The result of the network of

connections is called lateral inhibition. The network _projecting from any particular

neuronto_neighboringneuronsis calledthe projectivefieldofthatneuron. Thepatternof

connections in the opposite _direction, from the _receiving neuron to those neurons that

influence_it, iscalledthe receptivefield ofthatneuron.

As mentioned _earlier, visual perception can be described as a series of

transformations that begins with the input ambient _{light array} and proceeds through

higher levels of cortical processing. Since the receptive field ofa retinal neuron is the

area in which light influences the neuron's _response, lateral inhibition and receptive

fields canbe thoughtof asthetransformationproperties of retinal neurons.

The receptive field of a neuron in the retina can be described as _having a

(40)

of action potentials results. _However, if light activates _only the central part of the

receptive field and not the _surrounding_area, an elevated response interms ofthe

firing-rate with respectto therandom response willresult, andthe neuronis saidto havean

on-center/off-surround organization. For this case, light activating only the inhibitory

surround will cause a significant decrease in the _firing rate. A neuron _exhibiting the

opposite pattern of activation is said to have an off-center/on-surround organization.

Figure 2-3 depictsa schematic ofthe differentresponse properties of retinal neurons.

Stimulus Response Stimulus Response

O^O

On-center/off-surround Off-center/on-surround

Figure2-3 Receptive fieldsoftwo typesof retinal neurons: on-center/off-surround and

off-center/on-surround. Yellowareasindicate locationsoflightstimulus.

The receptive field structure of neurons continues _along the central nervous

system from the retina to the lateral geniculate nucleus _(LGN) ofthe thalamus and onto

area VI _(primary visual _cortex), with some qualitative differences. For example, the

(41)

have elongated shapes and are orientation anddirection selective. _Also, cortical neurons

can be classified into two broad categories: simple and complex _(Hubel, 1988). Simple

cells have response properties that conform to _linearity and superposition _principles,

whereas complex cells donot.

2.2.3 _{Contrast sensitivity function}

The contrast_sensitivityfunction _(CSF) is _typically defined as the _sensitivityof observers

to sinusoidal gratings of_{varying frequencies.} Thetechnique used to measure the CSF is

to askobserverstoadjust athresholduntil ajust-noticeable difference betweena uniform

gray fieldandasinusoidal pattern is detected. Whenthresholds are measuredforarange

of_frequencies, acontrast threshold function is plotted _showing the minimum contrast at

threshold as a function of spatial frequency. The reciprocal ofthe contrast threshold

function is the contrast _{sensitivity function.} A typical contrast _{sensitivity function is}

depicted in Figure _2-4, _showing that frequencies inthe range of4-5 cycles perdegree of

(42)

o

high

o

E0 s *u

sn

83

u

S

o U

low

llllllll

10 100

Spatial _frequency_(cpd)

Figure 2-4 _{Contrast sensitivity function}with example spatialfrequenciesand

on-center/off-surround neurontuned to thepeak response. Adapted fromWandell,1995.

The CSF can also describe a retinal ganglion cell's receptive field. The most

effective _frequency _{for any} ganglion cell is a measure ofthe size ofthat cell's receptive

field _(Wandell, 1995). For example, Figure 2-4 depicts an on-center/off-surround

ganglion cell whose peak response is at the peak _sensitivity of the contrast _sensitivity

curve, i.e., the most effective spatial _frequency for this cell is the intermediate

frequencies. At lower spatial _frequencies, light _falling on the surround reduces _activity

from the center, and at high spatial _frequencies, light _falling on the center is averaged

over several_cycles, _again, _loweringthe overall activity.

In general, contrast patterns such as sinusoidal gratings at afixed luminance level

providean effectivemeasure ofthe input/output behavior of neurons. Adaptation effects

operating over a _{very large} range ofluminance levels make direct comparisons difficult

because of the _highly non-linear response characteristics of neurons. _Therefore, a

(43)

response can then be characterized _by cumulative comparisons over a range of mean

luminance levels.

2.2.4 Opponentprocesses

In 1867 Helmholtz described what has come to be known as the trichromatic _theory of

color vision _(Helmholtz, 1867/1925). _Essentially, this _theory describes colorperception

as the result ofthe three photoreceptors response to photons of a particular wavelength.

Any single photoreceptor cannot distinguish between different colors - it

is the _overlap

among the three spectral response curves that contributes to the unique perception of

color.

Trichromatic _theory explained much about color _perception, such as the

psychophysical observationthat_anyperceived color canbe matched_bya combination of

the three_primary colors of_red,_blue, and green. It cannot explain_many subjective color

experiences, however, such as theobservationthat certain color combinations such as red

and _green, or blue and yellow are not _{easily imagined} as a single color. In addition it

does not explain_whycolor visiondeficiencies are always theresultofthe loss of pairs of

colors

-red and _green,orblueand yellow. _Also, _{psychologically,} yellowappears tobea

primary color and not the combination of red and green as would be predicted _by

trichromatic theory.

In 1878 Ewald _Hering proposed the opponent process _theory ofcolor perception

to explain the _perceived, or _subjective, experience of color _(Hering, 1878/1964).

Opponent process _theory describes color perception as the result of four chromatic

primaries that are arranged in polar pairs

(44)

and yellow form the other polar pair. Each of the three retinal receptor types are

responsible for _detecting photons of the proper wavelength range _along one polar

dimension

-the R/G _dimension, the B/Y dimension and an achromatic dimension of

black/white that detects luminance levels. Physiologically, Hering theorized that the

experience of red could be the result ofa sufficient amount of a certain chemical in the

R/G photoreceptor, andthe experience ofgreen couldbe the result ofadepletion ofthat

chemical on the same photoreceptor. Hurvich and Jameson ₍₁₉₅₇₎ conducted

psychophysical experiments to _verify predictions of opponent-process _theory, _{using hue}

cancellation techniques. The central idea was that the if blue and yellow are polar

components ofthe same _mechanism, then one should be able to cancel the amount of

"blueness"

in a light _by _adding a certain amount of "yellow". The results ofthose

experiments showed_strongevidence _supportingthe opponent-processtheory.

In 1905 von Kries laid the foundation for a dual-process _theory of color

perception that consists oftwo sequential stages of color _processing

-a trichromatic

stage atthelevelof retinal photoreceptors and an opponent-process stage at ahigher level

(von _Kries, 1905). More recent physiological studies have shown that color opponent

cells exist in the LGN of macaque monkeys and that _{both processing} stages are

performed intheretina(De_{Valois, 1965,}andDe_Valois, Abramovand_Jacobs, 1966).

The implicationofdual-process_theoryforvisual perception at a_higher, conscious

level of awareness is that the re-parameterization of responses from the three

photoreceptors to a more psychological color appearance is that it is more _ecologically

useful. _Separating luminance from chromaticity is advantageous because it allows the

(45)

fallingover a surface (ameasurement _alongtheluminance axis) andchanges in the scene

that result from _encountering a new surface (a measurement _along one of the

chrominance axes).

2.2.5 Eyemovements

In _general, eye movements fall under two broad categories

-smooth and saccadic.

Smooth eye _movements, such as smooth pursuit, vergence, and the vestibular-ocular

reflex _(VOR) enable the _tracking of_moving objects, whereas saccadic eye movements

are swift and _abrupt, and allow the eyes to shift fixation from one object in the fieldto

another. The optokinetic response _(OKR) is a combination of both a smooth and a

saccadic _movement, and is characterized _by a _slow, smooth phase followed by a _swift,

saccadic _snap ofthe eyes back in the direction opposite the movement of the tracked

object. From a cognitive point of_view, a saccadic eye movement isthe more _interesting

oculomotorbehavior primarily because it is an external manifestation of a pre-conscious

choice, i.e., the eyes must move in orderto obtain _detailed, high-resolution information

from_interestingareas intheenvironment.

Saccades are high _velocity, ballistic eye movements that have the function of

bringing retinal images of objects ofinterest from the _periphery to the fovea for closer

inspection. Atypical saccade takes _{approximately 150}

-200msec to plan and execute

-planning takes about 150 msec on average and the duration of the eye movement is

approximately 20 msec plus 2 msec per degree of visual angle _(Carpenter, 1988).

Saccades can reach _up to 900 degrees per _second, and individuals _typicallymake 3 or4

(46)

Studies on eye movements _during _reading have shown that saccades _during

reading are _typically seven letters _long, which results in a saccade length ofbetween 1

and 2 for _reading standard size text at a distance of40cm _(O'Regan, 1990). There is

also a wide distribution of within-word target _landing _{for reading} text, i.e., there is no

precise position withinthe wordthat is the saccadic _landingtarget

-anywhere withinthe

wordis sufficient forcomprehension_(Morgan, etal, 1990). Fixationsare definedas the

timebetween successive saccades. Atypical_{fixation duration for reading} textis between

200and 300msec.

It should be noted that saccadic eye movements are one example of overt

manifestation of visual _selectivity and _orienting of attention - head

movements and

posturaladjustments are _amongthe others. _{Covert orienting}of attention and inneracts of

selection are not _necessarily accompanied _by _any overt signs. It is _possible, though

unusual, to attend to one area ofthe visual field while _fixatinganother _{(Corbetta, 1998;}

Kustovand_Robinson, 1996).

Recent studies have suggested that the classification of eye movements into sub

categories such as _{smooth-pursuit,} _vergence, and VOR ignores the behavioral

significance ofeye_movements,and reflects thesimpletasksofthe _earlystudies thatwere

performed in a constrained and sparse visual environment _{(Steinman, Kowler,} and

Collewijn, 1990). The claim is that the experimental results of such _early studies reflect

low-level and _involuntary aspects of oculomotor control that do not surface in a

te