CHAPTER 4 DEEP MODELS MADE INTERPRETABLE: A COG-
4.2 Brain Encoding using Deep Models: An Exploratory Study
Exploratory Study
4.2.1
Background and Motivation
In this part, we explore the applicability of deep networks to the novel applica- tion domain of brain encoding [83]. Brain encoding refers to the challenging task to predict the brain activity, e.g., blood oxygenation level dependent (BOLD) responses, from the stimuli. Lately, [84] developed a two-stage cascadedsecond- order contrast(SOC) model, that accepts a grayscale image as input and predicts BOLD responses in early visual cortex as output. The SOC model has a cascade architecture, consisting of two stages of linear and nonlinear operations. The first stage involves well-established computations - local oriented filters and divisive normalization - whereas the second stage involves novel computations - compres- sive spatial summation (a form of normalization) and a variance-like nonlinearity that generates selectivity for second-order contrast. The SOC model has only eight controlling parameters: it heavily relies on specific nonlinear computations, that are summarized from neuroscience expertise. The philosophy behind SOC is that because data are limited, parameters are precious and we have to incorporate very specific computations into models.
In this section, we try to apply more parameterized deep models. Such mod- els could be more flexible, by allowing the data to inform the model as to what types of computations are necessary. One major obstacle arises from the fact that training data are extremely limited. It is also infeasible to artificially increase the data volume, such as generating “synthetic” data or perform data augmentation. Therefore, the classical “data-driven” setting for deep learning, as well as many popular training techniques, does not directly apply here.
4.2.2
Dataset
We refer to Kendrick Kay et al.’s publicly available datasets of BOLD responses in visual cortex4, measured by functional magnetic resonance imaging (fMRI) in human subjects. Specifically, we adopt their stimulus set 2, stimulus set 3, (response) dataset 4, and (response) dataset 5.
All stimuli are band-pass filtered grayscale images. Following [84], we resize them to 150×150 pixels. Stimulus sets 2 and 3 consist of 156 and35 distinct stimuli, respectively. The responses at a total of 200 voxels are recorded. On each voxel, a scalar fMRI measurement was measured given each input stimulus. Note that each voxel needs to train a separate brain encoding model. Dataset 4 consists of one person’s responses to stimulus set 2, while dataset 5 has the same person’s responses to stimulus set 3. Details about the datasets could be found at [84].
Our goal is to train a regression-type model using stimulus set 2 and dataset 4 (as thetraining set). The model is used to generate predictions and be evaluated on stimulus set 3 and dataset 5 (as thetesting set). Obviously, it is an ill-conditioned “small data” problem.
4.2.3
Model and Experiment
To design a well-adapted architecture that learns with very limited data, it is nec- essary to incorporate task-specific domain expertise. Since we consider the visual stimuli (structured natural images) as the input, it is straightforward to choose fully convolutional networks [82] as the computational model, and to solve brain encoding as a regression problem. Fully convolutional networks also cost fewer parameters than typical deep models with fully-connected layers. For brain en- coding, The SOC model [84] further suggested that the encoding in the human brain would go through some type of spatial summation, followed by compres- sive power-law function (compressive nonlinearity). That motivates us to try the sigmoid neuron and average pooling, and compare their effects with the popular ReLu neuron and max pooling, in this specific scenario.
We construct five different deep models for comparison:
• Model I: 1-hidden-layer fully convolutional network, with one convolu- tional layer configured by: channel number = 16, filter size = 3, stride =
Table 4.4: The averagedR2performance comparison of different models for brain encoding.
Model I II III IV V SOC
R2 82.6027 88.0655 87.4333 87.7403 88.0646 87.7628
2, zero padding = 2, followed by a global average pooling operator.5 The output is further concatenated with average pooling operator, followed by a mean square error (MSE) loss.
• Model II:2-hidden-layer fully convolutional network, by adding the second convolutional layer with channel number = 8, filter size = 3, stride = 1, zero padding = 1. All other configurations are identical to Model I.
• Model III:3-hidden-layer fully convolutional network, by adding the third convolutional layer, whose configuration remains the same as the second one’s. All others are identical to Model II.
• Model IV:replacing the sigmoid neurons in Model II with the RELU neu- rons, while leaving all others unchanged.
• Model V:replacing the last average pooling operator, in Model II with max- pooling, while leaving all others unchanged.
We initialize the first layer using PCA, and the remaining layers (if any) using layer-wise pre-training [10], to carefully ensure that those models converge prop- erly. A earning rate of 0.01 and a momentum of 0.9 are applied. The model ac- curacy is quantified as the percentage of variance explained (R2) in the measured response amplitudes by the cross-validated predictions of the response amplitudes (see page. 14, [84] for definitions). R2 ranges between [0, 100]: the higherR2 is, the more accurate the model is.
The preliminary experiment has found encouraging results, as in Table 4.4. In terms of averaged R2 performance across 200 voxels, the SOC baseline reaches 87.7028, which is very competitive and shows neuroscience expertise to be pow- erful. By adding one more layer, Model II gains a sharp advantage of 5.4% over Model I, and also outperforms the SOC slightly. However, increasing more 5The global average pooling operator is inserted for fusing all channels into one feature map.
It should be distinguished from the following pooling operator, which performs feature selection over the feature map.
hardly brings any further benefit, as evidenced by the slight performance drop from Model II to III. That reflects the potential overfitting problem due to the limited training data, and calls for the collection of larger datasets.
A comparison between Models II and IV reminds us that the sigmoid neurons potentially suit the brain encoding scenario better than the popular choice of ReLU neuron. We may view it as a success of learning from neuroscience; however, we realize that sigmoid neurons are more likely to suffer computational concerns (“saturation”) when the networks grow deeper. That could be partially alleviated by pre-training.
On the other hand, choosing either max or average pooling operator does not af- fect the performance much in this experiments. Previous literature [126] suggests that the sensory processing in the brain suggests a sparse coding strategy over a highly over-complete basis, when finding stimuli that effectively activate the neu- rons. Therefore, we conjecture that the max pooling operator, which introduces sparsity, brings in performance benefits too, which may also find neuroscience grounds in the brain encoding process.
4.2.4
Remarks and Discussions
The above experiments, although restricted by the training data availability, have shown promise of the task-specific deep architectures designed from neuroscience expertise. While preliminary conclusions were drawn, several open questions re- main for our future investigation, to list a few:
• If more training data can be collected, will a deeper architecture be useful? Does it really cost thousands of layers to model a brain network?
• Will sigmoid always outperform ReLU? If so, how to resolve the compu- tational difficulties of the former, i.e., the saturation phenomenon of com- pressive nonlinearity?
• Is sparsity also an indispensable part of brain encoding? Will a mixed utility of average pooling and max pooling boost the model performance further? • Is there really a clear mapping between the neural network layers and the
brain hierarchy? In other words, to what extent could the neuroscience expertise guide the deep architecture design?