Extension to multi-layer networks - Out of equilibrium Statistical Physics of learning

It is easy to imagine that, if the restriction to discrete synapses can turn the learning problem in the simplest neural network architecture into a hard task, finding an effective approach in multi-layer discrete networks can be a very hard problem. The main issue is due to the fact that the message-passing algorithms, which inspired the effective heuristics for the Perceptron, in this scenario suffer from inherent convergence problems[29, 26]. An easy to spot source for these problems is the permutation symmetry: when single Perceptrons are stacked and connected to obtain a more complex architecture, if the Perceptrons in the uppers layers are initialized in the same unbiased way, during the message- passing iterations they will exchange the same exact messages with the rest of the network and will not be able to differentiate. This kind of symmetry is disruptive for the classification performance, since the network becomes completely redundant.

A seemingly reasonable solution is to apply to these variables a small random external field, which could potentially play a symmetry-breaking role. This heuristic seems to help, but even in the case of two-layer binary networks (committee machines), the results obtained with BP are at least questionable: it seems that a growing number of distinct BP fixed points (not imputable only to the permutation symmetry) may be found, and using the information obtained through the message-passing procedure for finding single solutions, as in the R-BP algorithm, requires a very delicate fine-tuning of the reinforcement rate. In fact, the extension of BP to multi-layer networks unfortunately introduces all sorts of numerical stability problems, due to the “loopy” nature of the factor graph and to the presence of long-range correlations which are neglected in the BP approximation. For example, the gaussian approximation in equation (3.10) and, in some cases, even the finite machine precision can cause the

message-passing procedure to go off the rails.

All these numerical issues motivated the search of a simplified heuristic [1], inspired by the efficient ones described above. It is indeed possible to heuristically extend the CP+R algorithm to the case of a multi-layer classifier, obtained by stacking two layers of fully-connected committee machines, with L possible output labels. Because of a symmetry associated to any simultaneous

Fig. 3.3 Multi-layer architecture considered in the extension of the CP+R algo- rithm. The neural network can be seen as a “committee of committees”, with an argmax at end, in order to allow for a multi-label classification.

change in the sign of a synapse in the top layers and in all the synapses directly below it, it is sufficient to learn only the synapses in the first layer.

More specifically, the architecture (in figure 3.3) consists of an array of K2 committee machines, each comprising K1 hidden units. The K2 outputs are sent to L summation nodes (each one specifically associated to a possible label), and the maximum one is chosen as the predicted output of the network. The non-linear function represented by the network can be written as:

ϕ(ξ) = argmax_{l∈{1,...,L}}   K2 X k2=1 Yk2lsign   K1 X k1=1 τWk1k2_{, ξ} i     (3.15)

where Yk2l ∈ {−1, 1} are random quenched binary weights, defining mutually

perpendicular directions associated a priori to the labels, and Wk1k2 _{∈ {−}1, 1}N

are the synaptic weights learned by the algorithm.

The unsupervised reinforcement term, characteristic of CP+R, can be left unaltered from the single-layer version of the algorithm. Instead, it is necessary to design a scheme for back-propagating the observed errors (ϕ (ξ) ̸= σ) down to the synapses Wk1k2 ∈ {−1, 1}N. The main idea is that allowing for too many

changes of the synapses in the first layer can destabilize the learning procedure quite easily. A possible cure of this problem is the following: first, one needs to find all the committee machines which contributed to the error, i.e. all those for which sign PK1 k1=1τ Wk1k2_{, ξ} i ̸

= Yk2σ. Then, for each of these, the signal is

further propagated only in the branch corresponding to the hidden unit whose mistake is easiest to fix, i.e. for which Yk2σ

PN i=1W

k1k2

i ξi is less negative. In these branches, the update of the hidden states associated to the synapses simply follows the standard CP+R rule. The generalization performance can be highly improved if a “robustness” requirement is added, such that an error signal is emitted also when ϕ (ξ) = σ, but the gap between the two maximum outputs of the L committees is smaller then some threshold r.

This extension allows us to test the algorithm on real world data, for example on the MNIST database benchmark [71], which consists of 7 · 104 gray-scale images of hand-written digits (L = 10). The last 104 _{images are} reserved for assessing the generalization performance of the learned network. We observed that it is very easy to learn perfectly the training set, and that very good generalization errors can be reached despite the binary nature of the synapses, without any specialization of the architecture for this particular dataset. Moreover, the algorithm seems to completely avoid over-fitting, even when the considered networks are very large. The smallest network which was found to be able to achieve zero training error had K1 = 11 and K2 = 30, with r = 0, reaching a generalization error around 2.4%. Very large networks can achieve much better generalization error rates, e.g. 1.25% with K1 = 81,

K2 = 200, r = 120.

In document Out of equilibrium Statistical Physics of learning (Page 70-72)