5.13 Sparse Priors
5.14.3 Is the mutual information zero?
Given the fact that the above presented mutual informations – especially in the faster presentation conditions – are rather small, one might wonder if they are not really zero altogether. The expectation of the mutual information is ill suited to answer this question, because I(X;Y) is a continuous variable in a bounded interval (here: [0,3] bits). Therefore, unless its posterior density diverges at I(X;Y) = 0, its expectation will always be greater than 0. Conversely, if the density is finite at that point, then the probability of I(X;Y) = 0 is zero. If, on the other hand, this posterior was known, then questions of the form ’what is the probability that
I(X;Y)<0.01 bits?’ could be addressed. Unfortunately, computing this posterior is very difficult, if not infeasible.
There is, however, another way to circumvent this problem. As stated in section 2.4.2, the mutual information is zero if and only if the joint density is equal to the product of the marginals. This observation allows for the construction of a Bayesian hypothesis test:
H0 The joint density is the product of the marginal densities, i.e. I(X;Y) = 0. H1 The joint density is not the product of the marginal densities, i.e. I(X;Y)>0.
To compute the posterior probabilities of these hypotheses, one must first as- sign a prior to both of them. In the following, the maximum entropy choice
P(H0) = P(H1) = 0.5 will be used. This is sensible from the perspective of the hypotheses, because there are only two alternatives. However, when the possible densities included in each hypothesis are considered, then one might argue that this choice is heavily biased towards H0: as noted above, there are many more densities for which I(X;Y)>0, i.e. a prior P(H0)≪P(H1) could also be motivated. Thus, if the posterior probabilities favor H1 given the uniform prior, it might be taken as strong evidence that the mutual information is not zero.
Next, given a dataset, the probabilities (or densities) P(D|H0) and P(D|H1) need to be evaluated. For H1, this is given by
P(D|H1) = X
m
P(D|m)P(m) (5.86)
where P(D|m) is eqn. (5.3) (extended to multiple classes, as explained in section 5.11), P(m) is the (uniform) prior over the number of bin boundaries, and the sum over m was chosen to run from 0 to K −1, i.e. all possible density models are considered.
P(H0) can be calculated by treating the stimulus label independent of the re- sponse, i.e.
P(D|H0) =Ps
X
m
P( ˜D|m)P(m) (5.87) where ˜Dis the dataset which contains only the responses, thusP( ˜D|m) is eqn. (5.3) in its original form, and
Ps =
QC y=1ny!
(N +C−1)!(C−1)! (5.88) is the evidence of the distribution of the stimulus labels (for a derivation, see section 4.6). ny is the number of times stimulusywas presented,Cis the size of the stimulus set (here: 8) and N =PCy=1ny. Now the posterior
P(H0|D) = P(D|H0)P(H0)
P(D|H0)P(H0) +P(D|H1)P(H1) (5.89)
P(H1|D) = P(D|H1)P(H1)
P(D|H0)P(H0) +P(D|H1)P(H1) (5.90) can be computed.
If L datasets are available, the probability that all of them contain no mutual information between stimulus and response is
P(H0|D1, . . . , DL) = L Y l=1 P(H0|Dl) (5.91) given thatP(D1, . . . , DL|H0) = QL
l=1P(Dl|H0) (and likewise forP(D1, . . . , DL|H1)),
i.e. the probability of a particular dataset given one of the hypotheses does not de- pend on any of the other datasets. This is a justifiable assumption if the datasets contain recordings from different cells.
SOA [ms] P(H0|D) 222 ≈0 110 ≈0 56 ≈0 42 ≈0 28 4.0904×10−37 14 0.99997
Table 5.1: Probabilities that the mutual information between stimulus label and neural response are zero in all available cell recordings. ≈ 0 means that the probability was close to the smallest number representable by a variable of type ’double’ (ca. 10−300).
Table 5.1 shows the results for all available cell recordings. For SOAs 42 ms - 222 ms, the probabilities of H0 were found to be close to the smallest number representable by a variable of type ’double’ (ca. 10−300), i.e. one can be very certain that there is nonzero mutual information in (at least one of) the datasets. This is true as well to a slightly lesser degree for SOA 28 ms, but the probability is still close enough to 0 to justify the statement that the responses contain stimulus-related information at this SOA.
The situation is somewhat different at SOA 14 ms. Here, the posterior proba- bility is strongly in favor of H0, which indicates that stimulus-related information is hard to detect in the responses. As noted above, this does not necessarily mean that
has to be very small. This is supported by the findings in [48], where a small signal was found at this SOA. Furthermore, the log evidence graph (fig. 4.9) also suggests that a very small amount of classification information is transmitted at SOA 14 ms.
5.15
Conclusion
5.15.1
Algorithm
The presented algorithm computes exact evidences and relevant expectations of probability distributions/densities with polynomial computational effort. This is a significant improvement over the na¨ıve approach, which requires an exponential growth of the number of operations with the degrees of freedom. It is also able to find the optimal number of degrees of freedom necessary to explain the data, without the danger of overfitting. Furthermore, the expectations of entropies and mutual information have been shown to be close to their true values for relatively small sample sizes. In the past, a variety of methods for dealing with systematic errors due to small sample sizes have been proposed, such as stimulus shuffling in [67] or regularization of neural responses by convolving them with Gaussian kernels. What most of these approaches have in common is that the marginal P(X) and the conditional P(X|Y) distributions of the responses are estimated first, and the mutual information is then computed from these estimates via
I(X;Y) =H(X)−H(X|Y) (5.92)
whereH(X)(H(X|Y)) is computed fromP(X)(P(X|Y) andP(Y), which is usually set by the experimenter) via eqn. (2.20)(eqn. (2.27)). The problem with such a procedure is that the entropy is a nonlinear function of the probabilities, hence the expectation of the entropy is not equal to the entropy of the expected probabilities. More precisely, from a Bayesian perspective, this exchange of the order of computing the expectations would only be justified if the posterior density of the distributions was concentrated at a single point. This, however, is not likely to happen with the small datasets usually available from neurophysiological experiments. Thus, while these approaches work to some degree, they lack the sound theoretical foundation of exact Bayesian treatments.
In [74], finite size corrections are given based on the number of effective bins, i.e. the number of bins necessary to explain the data. Therein, it is also demonstrated that this leads to information estimates which converge much more rapidly to the true value than the other techniques mentioned (shuffling and convolving). However, [74] as themselves admit, their method of choosing the number of effective bins is only ’Bayesian-like’. Furthermore, the initial regularization applied to the data – choosing a number of bins that is equal to the number of stimuli and then setting the bin boundaries so that all bins contain the same number of data points, a procedure also used by [79] – is debatable (it should, however, be pointed out that this equi-probable binning procedure is not an essential ingredient for the successful application of the finite-size corrections of [74]. It has recently been demonstrated [2] that, given a decent amount of data is available, the methods of [74] can be used to yield reasonably unbiased estimates for M = K −1.). On the one hand, one might argue that the posterior ofM is still broad when the data set is small (see fig. 5.3, top row), so choosing the wrong number of bins will do little damage. On the other hand, the bin boundaries must certainly not be chosen in such a way that all bins contain the same number of points. Doing so will destroy the structure present in the data. Consider e.g. fig. 5.3, third row: there are many more data points in the interval [0.58,0.68] than there are between [0.15,0.58], which reflects a feature of the distribution from which the data were drawn and should thus be modeled by any good density estimation technique. This will, however, not be the case if the boundaries are chosen as proposed: there would be a boundary somewhere at
≈0.63 instead of 0.58, and the step at this point would be replaced by a considerably smaller one at ≈ 0.63, thus misrepresenting the underlying distribution. In other words, this procedure would not even converge to the correct distribution as the data set size grows larger. Consequently, mutual information estimates calculated from those estimated distributions must be interpreted with great care.
The author believes to have shown that those drawbacks can be overcome by a Bayesian treatment, which also shows improved performance over finite-size correc- tions. Thus, the algorithm should be useful in several areas of research where large datasets are hard to come by, such as neuroscience.
estimation is the Nemenman-Shafee-Bialek (NSB) method [61]. It exploits the fact that the typical distributions under the symmetric Dirichlet prior (5.71) have very similar entropies, with a variance that vanishes as K grows large. This observation is then employed to construct a prior which is (almost) uniform in the entropy. The resulting entropy estimator is demonstrated to work very well even for relatively small datasets. However, as demonstrated above, it yields (at least in the current implementation) inconsistent estimates for the mutual information.
In contrast to the NSB method, my approach deals with finite size effects by determining the model complexity (i.e. the posterior of M). It might be interesting to combine the two: since the NSB prior depends only on θ and K, the required numerical integration (eqn. (9) in [61]) could be carried out, with eqn. (10) in [61] replaced by P(D|θ) (i.e the denominator of (5.20) for a given θ).
It was proven in [73] that uniformly (over all possible distributions) consistent entropy estimators can be constructed for distributions comprised of any number of bins M, even if M ≫ N. The above presented results (fig. 5.6) suggest that the expected entropies computed with our algorithm are asymptotically unbiased and consistent. Furthermore, the true entropy was usually found within the expected standard deviation. It remains to be determined how the algorithm performs if
M ≫N.
Since the upper bound (5.69) on the variance of the mutual information is rather large for small sample sizes, it might be interesting to invest some more work into computing the exact variance of the mutual information. This, however, turns out to be difficult.