• No results found

Extension to multiple classes

Assume each data point xi was labeled, the label being yi ∈ {1, . . . , C}. In other

words, each xi is drawn from one of C classes. The algorithm will now be ex-

tended so as to infer the joint distribution (or density) P((x, y)|M, D), where D =

{(x1, y1), . . . ,(xn, yn)}.

Instead of M + 1 probabilities, there are now (M + 1)C. Pmy is the probability

between km and km−1 in class y, i.e. it is assumed that the bins are located at the same places across classes. This might seem like a rather arbitrary restriction. It shall, however, be imposed for two reasons:

1 The algorithm for iterating over the bin configurations keeps a simple form, similar to (5.16) and (5.17).

2 If different sets ofkmfor the classes were allowed, computing marginal distribu-

tions and entropies became exceedingly difficult. While possible in principle, it would involve confluent hypergeometric functions and a significantly increased computational cost.

Letnym be the number of data points in classy and binm, and ˜nm =

PC y=1n

y m. The

likelihood (5.2) now becomes

P(D|M,{Pmy, km}) = M Y m=0 QC y=1(P y m) nym ∆k˜nm m (5.63)

Thus, following the same reasoning as above, the iteration rules now are:

a(0,K˜) = QC y=1n y 0! ( ˜K + 1)n˜0 (5.64)

where ny0 is the total number of data points in class y for which k K˜ and ˜n0 = PC y=1n y 0. Furthermore, a( ˜M + 1,K˜) = ˜ K−1 X ˜ k= ˜M a( ˜M ,k˜) QC y=1n y ˜ M+1! ( ˜K ˜k)n˜M+1˜ (5.65)

where nyM˜+1 is the total number of data points in class yfor which ˜k < k K˜, and ˜

Now one can evaluate the expected joint distribution and its variance at any (k, y), using (5.22) and (5.23) as before. To compute the marginal distribution, note that P(k|M, D) = C X y=1 P((k, y)|M, D) (5.66) and likewise for its square.

5.12

Computing the mutual information and an

upper bound on its variance

The mutual information between class label y and x is given by

I(X;Y) =H(X, Y)−H(X)−H(Y) (5.67) which has to be averaged over the posterior distribution of the model param- eters, p({Py

m, km}|M, D). This can be accomplished term-by-term, as described

above, yielding the exact expectation of the mutual information under the poste- rior. The evaluation of its variance is somewhat more difficult, due to the mixed terms E[H(X)H(Y)], E[H(X)H(X, Y)] and E[H(Y)H(X, Y)]. For the time be- ing, an upper bound shall thus suffice. Using the identity (for a derivation, see (D.42)) Var " N X i=1 Xi # ≤N N X i=1 Var [Xi] (5.68) yields

Var [I(X;Y)]≤ 3(Var [H(X)] + Var [H(Y)] + Var [H(X, Y)]) (5.69) All terms on the r.h.s can be computed as above.

Figure 5.9 shows the results of some test runs on artificial data. Points were drawn from two classes with equal probability. Within each class, a three bin distri- bution was used to generate the data. The probabilities in the bins were varied to create four different values of the mutual information. Before inferring the mutual information from the data, the best discretization stepsize (via (5.24)) was deter- mined first. The depicted values are individual averages over 100 datasets. In all cases, its true value lies well within the error bars. However, especially for small sam- ple sizes the error bars seem rather too large – an indication that the upper bound

100

10000

1e+06

# data points per class

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

I(X;Y) [bit]

Figure 5.9: Expected mutual informations (symbols) and upper bounds on the standard deviations (error bars, computed via (5.69)) for different data set sizes. Solid line of the same color as the symbol: mutual information of the generating density. Circles: I(X;Y) = 0.169 bit, squares: I(X;Y) = 0.357 bit, diamonds: I(X;Y) = 0.558 bit, stars: I(X;Y) = 0.724 bit. Dataset sizes were 10,100,1000,10000,100000 and 1000000 datapoints, symbols are shifted to disentangle the error bars.

given by (5.69) needs future refinement. Nevertheless, observe that the expectation of I(X;Y) is close to its true value from 100 data points per class onwards. One might argue that acquiring 100 data points per class would be difficult in most neu- rophysiological experiments, and thus, reliable mutual information estimates could not be obtained with the BBDIa. This problem can be remedied by changing the prior, as will be demonstrated in section 5.13 for sparse distributions. Another way towards faster convergence of the mutual information estimates is the incorporation of a bias towards certain values of the Fano factor (i.e. the variance of X divided by its mean), which has been shown to assume values between 1.1 and 1.8 for neurons from many cortical areas [32]. While it would in principle be possible to do that (through a properly chosen mixture of Dirichlet priors), it is rather difficult and thus this option will not be explored here.

A closer upper bound on the variance of the mutual information seems to be given by the variance of the joint entropy:

Var [I(X;Y)]Var [H(X, Y)] (5.70)

as can be seen in fig. 5.10.

Each type of symbol represents the values of the empirical standard deviation (dashed connecting lines) and the expected standard deviation of the joint entropy (solid connecting lines) for a given mutual information (see legend). The bound held in all tried cases. Moreover, for large datasets, the connecting lines seem to be parallel, which indicates that the two standard deviations may just differ by a factor. Note also that this behavior differs from that of the empirical standard deviations of the differential entropy (see fig. 5.7), which appear to approach the exact expected standard deviations in the limit of large datasets. However, there is no strict proof available for these observations at the moment.

1

100

10000

1e+06

# data points per class

0,0001

0,001

0,01

0,1

1

σ

I

[bit]

0.1173 bit 0.2472 bit 0.3870 bit 0.5020 bit

Figure 5.10: Expected standard deviations of the joint entropy (solid lines) and empirical standard deviations of the mutual information (dashed lines), computed by averaging over 100 datasets. The former seems to be an upper bound on the latter. Each type of symbol represents the values for a given mutual information (see legend). Note that the connecting lines for a given mutual information appear to be parallel for large datasets, which indicates that the empirical and expected standard deviation may just differ by a factor.