• No results found

Understanding saddle points

In document On Recurrent and Deep Neural Networks (Page 132-137)

4.4 Saddle-points

4.4.1 Understanding saddle points

Saddle points are critical points of a function L which are neither minima nor maxima. See, for example, the few illustrations provided in Figure 4.4. Note that the Hessian is a symmetric matrix, and therefore its eigenvalues are real numbers.

(a) Saddle point in 1–D given by x3 (b) Classical saddle point in 2–D

given by x2 y2

(c) Monkey saddle given by x3 3xy2 (d) gutter structure given by

(x2+ y2 1)2

Figure 4.4: Illustrations of three di↵erent types of saddle points (a-c) plus a gutter structure (d). Note that for the gutter structure, any point from the circle x2+ y2= 1 is a minimum. The

shape of the function is like the bottom of a bottle of wine. This means that the minimum is now a ring instead of a single point. The Hessian is singular at any of these points.

The critical points can be categorized or identified based on the signs of these eigenvalues. Specifically, we know that :

1. If all eigenvalues of the Hessian are non-zero and positive, then the critical point is a local minimum. The Hessian is positive definite.

2. If all eigenvalues are non-zero and negative, then the critical point is a local maximum. The Hessian is negative definite.

3. If the eigenvalues are non-zero and we have both positive and negative eigen- values, then the critical point is a saddle point with a min-max structure (see

Figure4.4(b)). That is, if we restrict the functionL to the subspace given by the eigenvectors corresponding to positive eigenvalues, then the saddle point is a minimum of this restriction. On the other hand, if we restrict L to the subspace of the eigenvectors corresponding to negative eigenvalues, then the saddle point becomes a maximum of this restriction.

4. If the Hessian matrix is singular, then the degenerate critical point can be a saddle point, as it is, for example, for x3, x2 R or for the monkey saddle

(Figure 4.4(a) and (c)). If it is a saddle, then, if we restrict ✓ to only change along the direction of singularity, the restricted function does not exhibit a minimum nor a maximum. We would only have a plateau. When moving from one side to other of the plateau, the eigenvalue corresponding to this picked direction changes sign, being exactly zero at the critical point. Note that an eigenvalue of zero can also indicate the presence of a gutter structure. A gutter is a connected set of points that are all either minima, maxima or saddle (depending on the rest of the eigenvalues). In the direction of the gutter, the function is constant. This structure can have the shape of a line or subspace if, for example, one or more coordinates do not a↵ect the function at all. It can also have more interesting shapes.

In the enumeration above we make a distinction between gutters and plateaus. In this text, a plateau is an almost flat region in some direction. This structure is given by having the eigenvalues (which describe the curvature) corresponding to the directions of the plateau be close to 0, but not exactly 0. Or, additionally, by having a large discrepancy between the norm of the eigenvalues. This large di↵erence would make the direction of “relatively” small eigenvalues look like flat compared to the direction corresponding to large eigenvalues. A gutter is the extreme case when the surface is perfectly flat and the eigenvalue is 0.

One simple way of analyzing (and understanding) nondegenerate critical points is by relying on Morse’s lemma (see, for example, chapter 7.3, Theorem 7.16 in Callahan (2010)). It states that locally to such a critical point there exists a change of coordinates such that the function can be rewritten as a sum of squares:

L(✓⇤+ ✓) =L(✓⇤) ( v1)2 . . . ( vr)2+ ( vr+1)2 + . . . + ( vn✓)

Note that v are the new coordinates and that we subtract r squares and add n✓ r squares. r is the index (or index of inertia) of the nondegenerate critical

point and is equal to the number of negative eigenvalues of the Hessian. By abuse of notation we will refer to the index of inertia also as the fraction of negative eigenvalues (and denote it by r as well).

This reparametrization provides a clear geometrical understanding of the land- scape around the critical point. We can explore each coordinate of the reparametrized space in parallel. Along each dimension, the function has the shape of a bowl, where the critical point is either a minimum (if the bowl is concave up) or a maximum (if the bowl is concave down). It also shows that if all eigenvalues have the same sign that the critical point becomes a local minimum or local maximum of the function. Otherwise, it has the min-max structure and it is a saddle point. For a more in depth understanding of saddle points from a geometrical/mathematical perspective we recommend Chapter 7 of Callahan (2010).

Before going further in our analysis, one question we need to answer is: Why should we look at saddle points ? How common are they ? Some results on these questions come from statistical physics where the nature of critical points for ran- dom Gaussian error functions on high dimensional continuous domains are studied. See the seminal work ofBray and Dean(2007);Fyodorov and Williams(2007). The results presented in these works rely on the replica theory, a mathematical technique for analysing large dimensional systems with quenched disorder like spin glasses. A recent description of this technique is given in Parisi(2007).

Recall that the index of a critical point r is the fraction of negative eigenvalues of the Hessian, and let us denote the error obtained at the critical point by the error that we obtain L. Any function will have a global minimum with L = Lmin

and r = 0 and a global maximum withL = Lmax and r = 1. Bray and Dean(2007)

counted the number of critical points of a random function in a finite volume of N dimensions within a range of error L and index r. The authors found that any such function with large enough N , has an exponential number of critical points. If we project these points in the plane whose axes are given by L (i.e. the ammount of error that we have) and r, they are overwhelmingly likely to be located on a monotonically increasing curve L⇤(r) that rises from L

min to Lmax as r goes from

0 to 1. The probability of a critical point to be O(1) away from the curve is exponentially small in the dimensionality N of the space.

This result states that most critical points that correspond to an error larger than Lmin are highly likely to be saddle points and the larger the error L is, the

larger r becomes (we have more and more negative eigenvalues as the error at the critical point increases). This means that the values of all local minima are concentrated close to the value of the global minimum of the function.

Another way of understanding these findings is via random matrix theory. Bray and Dean (2007) states that the eigenvalue distribution of the Hessian follows a semi-circular law described byWigner (1958), except that the semi-circle is shifted according to L. In particular, for L = Lmin, the semi-circle is shifted so far to the

right that all eigenvalues are positive. This means that, besides having most critical points be saddle points, for any such saddle point that corresponds to a reasonable large error there are sufficiently many directions of low curvature (many eigenvalues are very close to 0). This indicates the presence of a plateau like structure around the saddle point, plateau that can a↵ect considerably stochastic gradient descent. Fyodorov and Williams (2007) findings are very similar.

In Baldi and Hornik (1989) the error surface of a single hidden layer MLP with linear units is analysed. The number of hidden units is assumed to be less than the number of inputs units. Such an error surface shows only saddle-points and no local minimum or local maximum. This result agrees with the observation made by Bray and Dean (2007). In fact, as long as we do not get stuck in the plateaus surrounding these saddle points, for such a model we are guaranteed to obtain the global minimum of the error. A similar observation is also made inSaxe et al. (2014), where the existence of symmetries in the weights of a deep linear feedforward models leads to saddle structures.

In Saad and Solla (1995) the dynamics of stochastic gradient descent are anal- ysed for soft committee machines. A soft committee machine is a single nonlinear hidden layer MLP, whose output weights are all equal to 1 and there is no out- put activation function. The paper explores how well a student model can learn to imitate a teacher model which was randomly sampled. The approach taken is analytical, where di↵erential equations are provided that describe the evolution of the learning dynamics. An important observation of this work is showing that learning goes through an initial phase of being trapped in the symmetric subspace. In other words, due to symmetries in the randomly initialized weights, the network has to traverse one or more plateaus that are caused by units with similar be-

haviour. Rattray et al.(1998);Inoue et al.(2003) provides further analysis, stating that the initial phase of learning is plagued with saddle point structures caused by symmetries in the weights.

Mizutani and Dreyfus (2010) looks at the e↵ect of negative curvature for learn- ing and implicitly at the e↵ect of saddle point structures in the error surface. Their findings are similar. A proof is given where the error surface of a single layer MLP is shown to have saddle points (where the Hessian matrix is indefinite). Other small scale problems are also discussed such as the XOR problem.

In document On Recurrent and Deep Neural Networks (Page 132-137)