Teacher-student scenario - Out of equilibrium Statistical Physics of learning

The solution to the system of equations stemming from the RS saddle point produces qualitatively very similar results for both the classification (with

α < αc) and the generalization (with α < αT S) case [1].

In the teacher-student scenario, the relevant quantity is the generalization error, i.e. the ability of giving a correct prediction on a newly presented pattern extracted from the same distribution of those in the training set. The analytical prediction for the generalization error rate is found to be simply dependent on the alignment of the student with the teacher: pe= 1_πarccos

₁

NW · W

T_. As can be seen in figure 4.10, the generalization properties of the optimal reference solutions ˜W are generally much better than those of typical solutions.

Moreover, the curve for small D is found to be in striking agreement with the numerical results, produced using solutions obtained from the SBPI algorithm. The generalization error decreases monotonically when D is increased, and saturates to a plateau when S (D) equals equilibrium entropy (of the typical solutions).

This good generalization property might be justified with a Bayesian argu- ment: given a new pattern-output association, the optimal Bayesian prediction is obtained by averaging the outputs of all the solutions of the training problem, as in: Pσ|ξnew; {ξµ, σµ} αN µ=1 =Z dW P (σ|W, ξnew) P W | {ξµ, σµ}αN_µ=1 (4.62) Since a solution in the sub-dominant cluster is immersed in a dense region of solutions, its output can be seen as a local Bayesian estimator of the output of its neighboring solutions. This means that the weight of this output in the full Bayesian prediction is likely larger than the output of a typical isolated solution, therefore the high (exponential) density guarantees good generalization properties.

Also in the case of multi-layer networks, the same qualitative scenario seems to hold: if one considers a random-walk constrained to the solutions of the training problem, the generalization properties of the starting solution (obtained with the extension of the CP+R algorithm) are found to be better than those of the neighboring solutions, found in later stages of the random walk, as it can be seen in figure 4.8.

Fig. 4.11 Generalization vs density in a multi-layer network (with K1 = 11,

K2 = 30, r = 0). Performing a random walk over solutions to the training set,

one can observe that, moving away from this solution, the generalization error (red, circles) increases, and the solution density (blue, squares) decreases. The same qualitative behavior is observed with all network sizes.

Entropy driven Monte Carlo

The results of the Large Deviation Analysis presented in the previous section finally put forward an explanation for the success of the heuristic solvers of the binary Perceptron. In a landscape dominated by frozen solutions, which cannot be found in sub-exponential time by local search strategies [66], the algorithms exploit the presence of a region in which the solutions accumulate and form a complex connected structure – branching with a decreasing density to the whole space of solutions – and sample solutions near the core of this cluster. The key ingredient for highlighting analytically these special structures was the introduction of a local entropy potential, that was used for enhancing the statistical weight of dense regions of solutions.

Now that we know what kind of solutions attract the efficient algorithms, we can devise more theoretically under-control solvers that explicitly target the dense cluster of solutions [2, 3]. Markov Chain Monte Carlo (MCMC) algorithms are often used in the context of combinatorial optimization for approximating the stationary distribution π of the studied problem. This distribution is monotonically decreasing with respect to the objective function to be minimized, and can be made more focused on the optima by tuning a properly introduced temperature, a simple procedure exploited in the Simulated Annealing algorithm [37]. Depending on the smoothness of the stationary distribution the sampling process can rapidly converge to low energy minima or it can get trapped in sub-optimal local minima of the loss-function. Typically, there is a trade-off between the optimality of the sampled configurations and

the form of π: at high temperature, sampling from smooth and close to uniform distributions is usually easy, but the obtained configurations are most often far from optimal. On the other hand, as the statistical weight of the minima is increased by lowering the temperature, in hard optimization problems one has to deal with the emergence of a glassy landscape, where the number of meta-stable minima that can trap the MCMC, breaking the ergodicity of the sampling process, is exponential.

In section (4.5 large dev: unconstrained case), we have seen that the dense cluster can be found even by lifting the requirement of selecting a solution as the reference configuration, since it is sufficient to look for configurations immersed in a zero energy configuration neighborhood. This fact suggests that it is possible to treat the local entropy as a pure objective function and to define a novel MCMC scheme, the Entropy-driven Monte Carlo (EdMC), where this new “energy” is maximized in a simple Metropolis procedure. The computation of the local entropy is obviously more involved than that of the energy, but EdMC is able to explore a smoother landscape (see figure 5.1) than the one seen by a standard Simulated Annealing (SA), that is usually hindered by the proliferation of local minima. Moreover, EdMC offers a numerical method for validating the Replica calculations on single problem instances, and can help in understanding the properties of the heuristic BP-inspired algorithms that are able to find a solution in the binary Perceptron.

5.1 Energy of the reference configuration

An important question is what is the optimal radius – defining the neighborhood considered in the local entropy estimation – to choose in order to be confident that an algorithm like EdMC would eventually land on a solution. In order to address this question, we need to take a look at the behavior of the typical energy density of the unconstrained reference configuration ˜W, as a function

of the selected distance D or, equivalently, of the typical overlap S = 1 − 2D (between the surrounding solutions and the reference).

The energy density can be easily related to the probability of classifying incorrectly a pattern ξ⋆, drawn uniformly at random from the training set.

Fig. 5.1 Energy landscape compared to local entropy landscape in an illustra- tive toy example. The energy landscape (gray curve) can be very rugged, with a large number of narrow local minima. Some isolated global minima can also be observed on the right. On the left, there is a region of denser minima which coalesce into a wide global optimum. The red curves show the local entropy landscape (equation 5.7 with the opposite sign) computed at increasing values of the interaction parameter

γ, i.e., at progressively finer scales. At low values of γ (dashed curve), the landscape

is extremely smooth and the dense region is identifiable on a coarse-grained scale. At intermediate values of γ (dot-dashed curve) the global minimum is narrower and located in a denser region, but it does not correspond to a global energy minimum yet. At large values of γ (solid curve) finer-grain features appear as several local minima, but the global minimum is now located inside a wide global optimum of the energy. Note that in a high-dimensional space the isolated global minima can be exponentially more numerous and thus dominate the equilibrium measure, but they are “filtered out” in the local entropy description.

This quantity can be obtained by calculating: P (σ⋆ ̸= 1) = * Θ −√1 N X i ˜ Wiξi⋆ !+ ˜ W (5.1) where the average is defined over the re-weighted unconstrained measure

dµW

W= dµW˜Nξ

W , Sy. This calculation can be carried out straight-

forwardly by exploiting the replica trick, rewriting the ensemble average as: lim n→0 Z dµW ˜ WΘ −√1 N X i ˜ Wiξi⋆ ! Z dµW ˜ W n−1 = lim n→0 Z Y c dµW ˜ WcΘ −√1 N X i ˜ W_i1ξ⋆_i ! (5.2) We have thus introduced n − 1 unconstrained replicas of the reference solution, leaving out the replica index 1 for the probing ˜W-replica, coupled

to the pattern ξ⋆ _{by the constraint. In this way one can first average out the} quenched disorder, and then recover the initial expression in the n → 0 limit. When one extracts the overlaps referred to the reference configurations by introducing vanishing constraints (i.e. when γ → 0), the conjugate parameters related to these overlaps tend to vanish as well. Therefore, if one organizes the calculation in the same way of the previously presented ones, the entropic terms cancel out and the only non-zero contribution to the average comes from the energetic part. The final expression, in the 1RSB Ansatz, is the following:

P(σ⋆ ̸= 1) = Z Dz0 R Dz1( R Dz2H(A) y₎m−1R Dz2H(A) y H(−C) R Dz1( R Dz2H(A)y) m (5.3)

with the definitions:

A(z0, z1, z2) = z0 √ q0+ z1 √ q1− q0+ z2 √ q2− q1 √ 1 − q2 (5.4) C(z0, z1, z2) = z0 ˜ S0 √ q0 + z1 ˜ S1− ˜S0 √ q1−q0 + z2 S− ˜S1 √ q2−q1 r 1 −S˜2 0 q0 − ( ˜ S1− ˜S0) 2 q1−q0 − ( S− ˜S1) 2 q2−q1 (5.5)

Fig. 5.2 A. Probability of a classification error by the optimal reference config- uration ˜W , for various values of α, as a function of S. The dashed parts of the curves

correspond to the parts with negative local entropy (cf. figure 4.6); the curves have a gap above α_{≳ 0.77. B. Same as panel A, but in logarithmic scale on the y axis,} which shows that all curves tend to zero errors for S → 1.

In the limit y → ∞, posing the same scaling behaviors as in section 4.3, we get: P (σ⋆ ̸= 1) = Z Dz0 R Dz1ex B(z0,z1)H  − z0 ˜ S0 √ q0+z1 ˜ S1− ˜S0 √ q1−q0+z2 δS √ δq q 1− ˜ S2 0 q0− (_{S1− ˜}˜ _S0)2 q1−q0   R Dz1ex B(z0,z1) (5.6) The analytic curves are plotted in figure 5.2, where we can see that, as long as the cluster exists, the probability of a classification error drops exponentially as the overlap S is increased to 1 (i.e. for small distances D → 0 and a strong couplings γ → ∞).

Unfortunately, if one starts the learning procedure from a random configura- tion directly at a high value for γ, the information provided by the local entropy is not sufficient for reaching the dense cluster. Therefore, in the same spirit of the annealing procedure employed in SA, we can initialize γ to a small value and devise a learning scheme in which the MC is slowly led to lower energy solutions by increasing the coupling γ, focusing the local entropy evaluation to smaller and smaller neighborhoods of the reference configuration and biasing the measure towards denser and denser regions of solutions. In the following, we call this special annealing the “scoping” procedure.

In document Out of equilibrium Statistical Physics of learning (Page 122-130)