Visual-Auditory Integration in the Superior Colliculus

The multimodal integration of spatial information from vision and audition is

accomplished (at least in part) by the superior colliculus (SC), which has been implicated in the spatial processing of both types of sensory information. In the SC, one finds topographically organized spatial maps in a number of different sensory modalities, including vision and audition. Some of these maps are modality-specific, such that their cells respond only to visual stimuli or auditory stimuli; however, the SC also contains maps with cells that respond to multi-sensory stimuli—i.e., they receive both visual and auditory input and respond with roughly equal strength to both auditory and visual stimuli in a particular region of space. The result is that when a stimulus that is multimodal in nature (i.e., can be discriminated by both the visual and auditory systems) falls within the receptive fields of these multisensory cells, the combined excitatory

stimulation of both aural and visual inputs causes an increase in the saliency of the stimulus.

What’s more, all of these distinct topographic maps—visual, auditory, multisensory, and others—are aligned with one another in the SC; they are literally superimposed over one another in the brain. As Jay & Spark (1987) point out, this implies that these sensory signals have been translated into a common coordinate system. Indeed, as was mentioned above, in order to c-coordinate maps in different sensory modalities they must be

translated into a common coordinate scheme. Prior to the SC, the location of a visual stimulus is encoded in terms of retinal coordinates + eye position while the location of an auditory stimulus is encoded in head-centered spherical bipolar coordinates of azimuth, elevation, and distance. If the multimodal maps in SC truly do represent spatial locations

in a way that is independent of either vision or audition, then it must be in terms of some common, inter-translatable coordinate scheme.

There are, of course, different possible ways which these different coordinate schemes could be translated into a common one. For example, both could be translated into an egocentric Cartesian coordinate scheme centered on the head. (Indeed, given that both coordinate schemes are head-centered to some degree, this makes a certain amount of intuitive sense.) However, it seems that is not how the brain represents visuo-auditory space. Rather, multimodal cells in SC that respond to both auditory and visual

information appear to encode spatial relations in terms of motor coordinates; specifically, in terms of what is known as gaze shift—the change in eye position required to orient towards a stimulus.

Thus, auditory space has been translated into motor coordinates which specify the change in eye position required to look at an auditory stimulus, encoded in terms of retinal displacement. (Jay & Spark, 1987, p.50) Furthermore, in order to maintain the alignment of visual and auditory maps following eye saccade, there is a dynamic re-mapping of receptive field locations for auditory stimuli that occurs in SC such that the spatial location of the receptive field for cells that selectively respond to auditory stimuli is significantly modulated by the position of the eyes in their sockets.

This integration of auditory and visual spatial information with information about the motor system is possible because the SC contains topographic maps of not only visual and auditory space, but also of motor space as well.101 That is, just as one can specify the spatial location of a stimulus in terms of a coordinate scheme defined in terms of the stimulation of sensory receptors (e.g., in terms of retinal location, or interaural time difference, or bodily location of a tactile receptor), one can also specify the relative spatial position (and orientation) of body parts in terms of a coordinate scheme defined in terms of motor activity. As Grush puts it,

101_{In fact, the SC also contains maps of somatosensory space that are aligned with the visual, auditory,}

motor, and multi-sensory maps. However, I will refrain from including it in the discussion here for the purposes of simplicity.

“…it is possible to specify the location of my hand relative to my torso by giving the angles of my shoulder, and elbow joints. Given that my shoulder has three (actually more than three, but let’s keep it simple) and my elbow one degree of freedom, one can specify my hand position relative to my torso as a point in a four-dimensional joint-angle ‘space’” (2000, p.66)

The s-coordination of sensory (e.g., visual and auditory) information with motor

information is absolutely crucial to set the stage for a subsequent c-coordination of visual and auditory space. Auditory and visual spatial information are only able to be integrated into a multi-modal topographic map because the motor system actively translates auditory spatial location into gaze shift vectors in a coordinate scheme based on retinotopic

displacement. As Grush points out, the involvement of motor maps is particularly

important in the representation of spatial location because it provides a common frame of reference that allows for the integration of different types of sensory information,

especially in cross-modal cases.

Consider, for example, what sort of s-coordinations are actually required in order to locate a stimulus in visual egocentric space. First (as was mentioned above) one must coordinate visual information about the location of retinal stimulation with proprioceptive information about not only the position of the eyes in their sockets, but also with

information about the orientation of the head relative to the torso. This type of proprioceptive information is registered by, e.g., muscle spindles102 in both the extraocular muscles (which control eye movement) and in the neck muscles. A

representation of auditory egocentric space can be formed in an analogous manner; that is, via the coordination of information about the azimuth, elevation, and distance of auditory stimuli with proprioceptive information about the position of the head relative to the torso and vestibular information about the orientation of the body and/or head.

(Indeed, precisely this sort of integration of proprioceptive information with audition is

102_{These cells essentially act as stretch receptors, such that their firing rate is a function of the amount of}

change in muscle length as well as the speed of that change. In this way, they are able to encode information about the position of various body parts relative to one another.

necessary in creatures who move their heads and ears to obtain better spatial resolution!)103

It is important to note that the coordination described above is overly simplistic and incomplete. For example, I have said nothing about the vital need for sensory systems to be sensitive to inputs that contain information about efference copies of motor

commands. Furthermore, even this additional motor information is not yet sufficient for localizing a stimulus in visual or aural egocentric space; rather, one must also coordinate such information with vestibular cues about orientation and balance, which in turn is provided by neural mechanisms in the inner ear. (That is, the ordered array consisting of [retinal location + eye position + head position] will correspond to different locations in egocentric space depending on whether one is lying down or standing upright.)

Nevertheless, the general point should be clear: the representation of spatial location crucially depends on the coordination of many different kinds of sensory and motor information. The content of such representations is a product of the joint contribution of all the lower-level, modality-specific sensory and motor maps that play a role in its formation. For example, the spatial content of a sensory representation of an auditory stimulus is determined in part by the change in eye and head position that would be required to look towards it. Similarly, as Grush (2000) puts it,

“part of the content of, say, a visual stimulus is provided in part by how one would orient towards that stimulus motor action), and how one would move one’s arm in order to bring the hand to that point, such as THE THING GRASPABLE BY REACHING THUS. Similarly, part of the content of a felt location is given by how one would visually orient that location, and how the hand would look when the eyes are trained on it.” (p.70)

There is ample empirical evidence for this claim, based on well-documented cross-modal influences on spatial perception. For example, in the well-known “ventriloquist effect”,

103_{The use of motor feedback is particularly important in}_olfaction_{, which is not an inherently spatial}

subjects misrepresent the spatial location of an auditory stimulus because of the causal influence of a concurrent visual stimulus. Similarly, in the “rubber-hand illusion”

(discussed in the next section), subjects come to misrepresent the spatial location of their limbs due to the contribution of certain visual cues.

In document Representationalism About Sensory Phenomenology (Page 116-120)