Conclusion and Discussion - Improving sampling, optimization and feature extraction in Boltzman

In this paper, we have shown that while exact calculation of the partition function of RBMs may be intractable, one can exploit the smoothness of gradient descent learning, in order to approximately track the evolution of the log-partition function during learning. Our method exploits the Parallel Tempering framework. At each time-step t, Bridge Sampling allows us to estimate the ∆Zi between ad- jacent chains providing a path to a known partition function ZM. AIS can then be applied with very few interpolating chains between models at nearby learning iterations. For small enough learning rates, this can even be reduced to zero, as was the case in our experiment, and results in large computational gains. Treating the logZi’s as unknown variables, the formalism of the Gaussian graphical model

6.7 Conclusion and Discussion 92

of Figure B.1 allowed us to combine multiple sources of information, smooth out the estimates and achieve good tracking of the partition function.

The method presented in the paper is also computationally attractive, with only a small computational overhead relative to SML-PT training. The added computational cost lies in the computation of the importance weights for AIS and Bridge Sampling. However, this boils down to computing free-energies which are mostly pre-computed in the course of gradient updates with the sole exception being the computation of an extra pi,t(xi,t−1) term in the “learning path” AIS.

In comparison to AIS, our method allows us to fairly accurately track the partition function, and at a per-point estimate cost well below that of AIS. Having a reliable and accurate online estimate of the partition function opens the doors to interesting research avenues.

7

Prologue to Third Article

7.1 Article Details

Metric-Free Natural Gradient for Joint-Training of Boltzmann Ma- chines. Guillaume Desjardins, Razvan Pascanu, Aaron Courville and Yoshua Ben- gio. International Conference on Learning Representations (ICLR), 2013.

Personal Contribution. The particular form of the natural gradient for BMs was derived by myself, with the generic derivation of the natural gradient being joint work with Razvan Pascanu and Aaron Courville. I was the main author for the DBM code, and developed the MFNG algorithm with Razvan Pascanu, who also provided the crucial code for Conjugate Gradient and MinRes. The experiments and writing are my own, with the exception of the results found in Figure8.2.

7.2 Context

The Hessian-Free (HF) optimization method of Martens (2010); Martens and Sutskever(2011) represents an important breakthrough in training deep and recurrent neural networks. Until its publication, pretraining was the method of choice for successfully training deep multi-layer perceptrons (Hinton et al., 2006; Ben- gio et al., 2007; Lee et al., 2009). Using Hessian-Free, the authors were able to outperform pre-trained auto-encoders of Hinton and Salakhutdinov (2006) using pure supervised learning. Later, this same method was adapted to learn long-term dependencies in Recurrent Neural Networks (RNN), seemingly bypassing issues of exploding and vanishing gradients (Bengio et al.,1994). HF belongs to the family of truncated Newton methods explored in Section1.2.2. In addition to using CG to compute the Newton update directionH−1_g_(where_H _{is the Hessian matrix and}_g the estimated gradient), HF bypasses the need for computing or storing the Hessian

7.3 Contributions 94

matrix explicitly by using the R-operator, an eﬃcient mechanism for computing Hessian-vector products.

At the time of publication however, it was not clear how the above could be adapted to the probabilistic setting of BMs.

Subsequently, Montavon and Muller (2012) showed how centering (see Sec- tion 2.5.2) could enable joint-training of DBMs. This paper provided convincing evidence that our previous reliance on greedy layer-wise pretraining (Salakhutdi- nov and Hinton,2009a) stemmed from issues of optimization. This motivated us to revisit second-order optimization methods for BMs, to determine if they could not only subsume the centering trick but improve training altogether.

7.3 Contributions

To the authors’ knowledge, this paper introduced the first practical algorithm for applying the natural gradient to large Boltzmann Machines. As with HF, our method uses a linear solver to invert the Fisher Information Matrix (FIM) and precludes the need to compute or store it explicitly through an eﬃcient matrix- vector operation.

Our paper shows that the Metric-Free Natural Gradient (MFNG) algorithm (and its diagonal approximation) improves convergence speed when training DBMs via variational SML. Unfortunately, the method as presented in the paper is not yet computationally eﬃcient. Surprisingly, we also found that the natural gradient was not a replacement for proper centering of the energy function.

7.4 Recent Developments

Concurrent to our work,Pascanu and Bengio(2013) showed that using Newton’s method with the extended Gauss-Newton (EGN) matrix in lieu of the Hessian, was directly equivalent to the natural gradient algorithm. Since this approximation was found to work best in Martens (2010), HF and MFNG are essentially equivalent at a high-level: both are eﬃcient implementations of the natural gradient, HF

being tailored to the optimization of deterministic functions and MFNG to the optimization of MRFs.

Since publication, it was brought to our attention that Byrd et al.(2011) may provide the key to making our method computationally eﬃcient. In the context of a truncated Newton method, they found that it was preferable to use a much smaller batch size for estimating the Hessian and allocate more capacity towards a careful estimation of the gradient (via a larger batch size). Pascanu and Bengio(2013) also observed improved performance when using a separate set of samples for estimating the FIM, than to estimate the gradient. With regards to the need for centering, we are currently exploring two hypotheses. (1) The benefits of centering might stem from its global re-parametrization of the energy, whereas the natural gradient is only locally invariant to re-parametrizations of the model. (2) Alternatively, our treatment of latent variables in the derivation of the FIM might be to blame for our inability to perform joint-training without centering. In particular, we have observed that for a fixed setting of the parameters, centering the BM energy can greatly reduce the number of iterations required for inference. Given a maximal number of inference iterations, the failure of MFNG alone to perform joint-training might therefore stem from failures of the inference process. This suggests taking into account the manifold structure of the posterior in the context of inference.

Finally, our paper also failed to cite the Information Geometry Optimization (IGO) algorithm ofArnold et al.(2011), which applied the natural gradient to a toy RBM and predates our work. In addition to the (local) invariance properties of the natural gradient to re-parametrization of the model, their method also incorporates invariances to re-parametrizations of the input and monotonic transformations of the function undergoing optimization.

8

Metric-Free Natural

Gradient for Joint-Training

of Boltzmann Machines

T

his paperintroduces the Metric-Free Natural Gradient (MFNG) algorithm

for training Boltzmann Machines. Similar in spirit to the Hessian-Free method ofMartens(2010), our algorithm belongs to the family of truncated Newton methods and exploits an eﬃcient matrix-vector product to avoid explicitly storing the natural gradient metric L. This metric is shown to be the expected second derivative of the log-partition function (under the model distribution), or equiv- alently, the covariance of the vector of partial derivatives of the energy function. We evaluate our method on the task of joint-training a 3-layer Deep Boltzmann Machine and show that MFNG does indeed have faster per-epoch convergence compared to Stochastic Maximum Likelihood with centering, though wall-clock performance is currently not competitive.

8.1 Introduction

Boltzmann Machines (BM) have become a popular method in Deep Learning for performing feature extraction and probability modeling. The emergence of these models as practical learning algorithms stems from the development of eﬃcient training algorithms, which estimate the negative log-likelihood gradient by either contrastive (Carreira-Perpi˜nan and Hinton, 2005) or stochastic (Tieleman, 2008;

Younes, 1998) approximations. However, the success of these models has for the most part been limited to the Restricted Boltzmann Machine (RBM) (Freund and Haussler, 1992), whose architecture allows for eﬃcient exact inference. Unfortu- nately, this comes at the cost of the model’s representational capacity, which is limited to a single layer of latent variables. The Deep Boltzmann Machine (DBM) (Salakhutdinov and Hinton,2009a) addresses this by defining a joint energy function over multiple disjoint layers of latent variables, where interactions within a

layer are prohibited. While this affords the model a rich inference scheme incor- porating top-down feedback, it also makes training much more difficult, requiring until recently an initial greedy layer-wise pretraining scheme. Since, Montavon and Muller (2012) have shown that this difficulty stems from an ill-conditioning of the Hessian matrix, which can be addressed by a simple reparameterization of the DBM energy function, a trick called centering (an analogue to centering and skip-connections found in the deterministic neural network literature (Schraudolph,

1998; Raiko et al., 2012)). As the barrier to joint-training i _{is overcoming a chal-} lenging optimization problem, it is apparent that second-order gradient methods might prove to be more eﬀective than simple stochastic gradient methods. This should prove especially important as we consider models with increasingly complex posteriors or higher-order interactions between latent variables.

To this end, we explore the use of the Natural Gradient (Amari,1998), which seems ideally suited to the stochastic nature of Boltzmann Machines. Our paper is structured as follows. Section8.2provides a detailed derivation of the natural gradient, including its specific form for BMs. While most of these equations have pre- viously appeared inAmari et al. (1992), our derivation aims to be more accessible as it attempts to derive the natural gradient from basic principles, while minimizing references to Information Geometry. Section 8.3 represents the true contribution of the paper: a practical natural gradient algorithm for BMs which exploits the persistent Markov chains of Stochastic Maximum Likelihood (SML) (Tieleman,

2008), with a Hessian-Free (HF) like algorithm (Martens, 2010). The method, named Metric-Free Natural Gradient (MFNG) (in recognition of the similarities of our method to HF), avoids explicitly storing the natural gradient metric L and uses a linear solver to perform the required matrix-vector productL−1_E

q[∇logpθ]. Preliminary experimental results on DBMs are presented in Section 8.4, with the discussion appearing in Section8.5.

i. Joint-training refers to the act of jointly optimizingθ(the concatenation of all model parameters, across all layers of the DBM) through maximum likelihood. This is in contrast to

Salakhutdinov and Hinton(2009a), where joint-training is preceded by a greedy layer-wise pretraining strategy.

In document Improving sampling, optimization and feature extraction in Boltzmann machines (Page 106-113)