The above described network performs feature extraction on a scale defined by the dimensionality of the basis vectors (e.g. the size of the image patches). Since the learned RFs appear to model V1 RFs more closely with increasing overcompleteness, it would be interesting to investigate whether higher M:N ratios than the 4:1 studied here lead to even better agreement with experimental data – especially given the 25:1 ratios inferred from cat brain data [62]. However, this would only be feasible with significantly faster computational resources than those which I have currently at my disposal (on a 3GHz Pentium 4 system, the 4x overcomplete model took ≈4 days to learn).
In contrast to the network studied here, the visual cortex consists of more than one layer of neurons. It has been observed in [66] that stacking two OF networks with the same number of basis vectors on top of each other results in the top one learning an identity mapping between inputs and outputs. Something similar could be expected for my network. To integrate features on a larger scale, the top net- work would at least have to combine the outputs from several image patches of the lower network. More importantly, the question how the outputs are combined needs to be addressed. In [41], it was demonstrated that a square-and-add nonlinearity, inspired by a model of V1 complex cells proposed in [39], leads to top layer units that show some phase and position invariance in their RFs. Alternatively, one could also constrain the outputs of the bottom layer to positive values only, which is neu- rophysiologically plausible, since real neurons can not produce negative signals. In that case, one might expect the bottom layer to develop antagonistic pairs of feature detectors. The top layer could then simply marginalize over suitably chosen groups of bottom-layer outputs to generate responses that are invariant under certain trans- formations. Another interesting way of incorporating architectural constraints in a multilayer network model of the early visual system was explored in [90]: the optic nerve of many mammals has significantly less fibers than V1 has cells. Combining this observation with synaptic and firing rate energy constraints was shown to give rise to response characteristics of the simulated retina and V1 units which resem- ble those found in neurophysiological data. However, getting hierarchical networks to learn is not easy, if not generally infeasible. Thus, one will have to resort to
approximations, e.g. the Helmholtz machine [18].
Given the fact that asymmetric noise is a sparseness-promoting factor, it might be interesting to investigate whether this asymmetry is present in biological neurons and how much sparseness one would expect from it. There is reason to believe that that might be so: assuming that a neuron’s firing behavior can to some degree be described as a Poisson process (i.e. the probability of spike generation is constant per unit time, with the probability being a monotonically increasing function of the neuron’s activation), and that another neuron receiving these spikes merely counts how many of them arrive in a given time window, then, no matter how high the firing probability pf ire, there is always a chance that the receiver gets no input. On
the other hand, if pf ire = 0, then no spike will be generated with certainty. In other
words, the situation in real brains might resemble that of the limit studied in section 3.6.
Chapter 4
Information extraction from
neural spike trains I:
Bayesian Bin Classification
4.1
Introduction
In the following two chapters, two Bayesian methods will be developed for informa- tion extraction from small and/or noisy datasets. As an example application, they will be employed to extract features from neural spike trains and to quantify the information contained therein. The experimental setup is schematically depicted in fig. 4.1, for a detailed description see [47, 48]: a monkey is presented with a visual stimulus, labelled y, which evokes a neural response that is recorded in the form of a spike train, i.e. a temporally ordered sequence of time indexes. Each time index marks the occurrence of a spike. This spiketrain is subjected to a functionf() which condenses it into a quantity x that contains as much information as possible about
y. x will be used in three ways:
1. The Bayesian Bin Classification algorithm (BBCa), which is the subject of this chapter, can be used to computeP(y|x), i.e. the most probabley givenx
can be determined. This kind of classification task must also be performed in some way by the brain, when visual object recognition is carried out.
2. If y is known, the function f() can be inferred. More precisely, assume that
shape of f(). For example, if f() counted the number of spikes in a temporal window, ~θ would contain the start and end positions of this window. Given experimental data, the BBCa can then be used to compute the posterior dis- tribution of ~θ. It would then be possible to pick the best ~θ in a maximum- a-posteriori sense. However, one can also evaluate various expectations under the posterior, such as means and variances of the compontents of ~θ. This will be done so as to provide not only the expected f(), but also a measure of its reliability. Knowing the posterior of ~θ (and thus, of f()) enables one to draw conclusions about how the brain encodes stimulus-related information. The necessary ’feedback signal’ (see fig. 4.1) is provided by the BBCa as well.
3. Given y and x from the inferred f(), the mutual information I(x;y) can be inferred, i.e. the amount of information a neuron transmits about a stimu- lus. This will be done via the Bayesian Bin Distribution Inference (BBDIa) algorithm, subject of the next chapter.
One might wonder why two separate algorithms are necessary. Wouldn’t it be sufficient to infer the mutual information for a givenf() and then search for the func- tion that maximizes I(x;y)? The answer is no. As explained in section 2.3.1, only the formalism of probability theory is suitable for conducting (Bayesian) inference (or any formalism isomorphic to it). Since mutual information is not a probability, it is thus ruled out for the purpose of inferring the posterior distribution off(). To do that, the possible choices forf() need to be weighted by a probability. Moreover, this probability needs to be a measure of classification performance if we are to learn which f()s are suited for carrying out the classification task and which ones are not. The most natural choice is thus P(y|x), which will be high if, for given x and y, correct classification can be done with some certainty.
It might be argued that, since P(y) is determined by the experimental setup, one could also try to inferp(x|y) instead and convert it via Bayes’ rule into P(y|x). However, this approach is likely to meet with computational difficulties: p(x|y) will usually be parameterized in some fashion. After the posterior distributions of these parameters have been inferred from the data, the marginalizations necessary for the determination of the posterior distribution of f() are, in most cases, going to be
stimulus y spiketrain x = f( spiketrain ) BBCa inference
stimulus classification BBDIa, mutual information I(x;y)
P(y|x)
Figure 4.1: Schematic representation of the RSVP (rapid serial visual presen- tation) experiment and its evaluation. A monkey is presented with a visual stimulus y which evokes a neural response that is recorded in the form of a spike train. This spiketrain is subjected to a function f() which condenses it into a quantity x that contains as much infomation as possible about y. The BBCa then allows for the computation of P(y|x) and thus, for the determina- tion of the most probable y. Conversely, if yis known, f() can be inferred and subsequently the mutual information I(x;y), too.
evaluation of the posterior distribution of f(), and BBDIa, which avails one of an exact Bayesian estimate of the mutual information.
Nevertheless, mutual information is an infomation-theoretic measure of average classification performance (see section 2.5.3). Thus, one would expect that proba- bilistic classification measures should be closely related to it. That this is indeed so will be demonstrated in the next chapter.