• No results found

Stochastic Approximation Procedure for Boltzmann Machines

3. Feedforward Neural Networks:

4.3 Estimating Statistics and Parameters of Boltzmann Machines

4.3.3 Stochastic Approximation Procedure for Boltzmann Machines

Using a naive stochastic gradient method described in Section 2.5, we can find the set of parameters that maximizes the marginal log-likelihood (4.3) or the variational lower bound (4.20) of a Boltzmann machine. One can simply repeat the following update to each parameter:

θt+1=θt+ηt - h(x,h|θt) . d - h(x,h|θt) . m =θt+ηt(H0−H), (4.22)

4Note that the order ofQandP matters when their KL-divergence is computed, as the KL- divergence isnota symmetric measure.

whereθtandη

tare the parameter value and the learning rate at timet, and h(x,h|θ) = (−E(x,h|θ))

∂θ .

ηtshould decrease over time while satisfying Eqs. (2.37)–(2.38). Note that we used

the following shorthand notations for simplicity:

H0=-h(x,h|θt). d H= - h(x,h|θt) . m.

The first termH0can be computed quite efficiently by the variational approxima- tion with a fixed number of training samples randomly collected from the training set. Letx(1), . . . ,x(N)be a set of randomly chosen samples from the training set, and letμ(n)be the variational parameters obtained by iteratively applying Eq. (4.21) to all hidden units conditioned onx(n). Then,

H0 1 N N n=1 hx(n),μ(n)θt.

The problem is with the second termHwhich requires running a Gibbs sampling chain until convergence. For instance, let us assume that we collected a finite number

N0of samples(x(1),h(1)), . . . ,(x(N0),h(N0))from the model distribution using

Gibbs sampling. Then,

H 1 N0 N0 n=1 hx(n),h(n)θt .

The problem is that it is difficult to choose or determineN0. Furthermore,N0might be determined too large to be of any practical use.

A computationally efficient method to overcome this problem was proposed by Younes (1988). This algorithm, sometimes calledstochastic approximation proce- dure(Salakhutdinov, 2009), does not run the Gibbs sampling chain, starting from random states until the convergence at each update.

LetXt =x(1)

t,h(1)t

, . . . ,x(Nt0),h(Nt0)

be a set of states of visible and hidden units. At timet= 0,X0is initialized with random samples, or a randomly chosen subset of the training set. Then at each timetbefore updating parameters by Eq. (4.22), we obtainXt+1by applying the following transition to each sample a few times: x(nt+1) ,h(nt+1) ∼ Tθt x,hx(tn),h(nt) ,

whereTθt is the transition probability of the Gibbs sampling on the Boltzmann

Hby H 1 N0 (x,h)∈Xt+1 hx,h|θt .

Simply put, this approach does not wait for the Gibbs sampling chain to converge to the equilibrium distribution. Rather, it performs only a few Gibbs sampling steps starting from the samples used during the last update, and use the new samples to compute the second termHof the gradient. This algorithm arises from the fact that if the parameters converge slowly to, for instance,θ, thenXtwill converge to the equilibrium distribution of the Boltzmann machine parameterized byθin the limit oft→ ∞.

This approach was proposed independently for training a restricted Boltzmann ma- chine by Tieleman (2008). Tieleman (2008) called this approachpersistent con- trastive divergencebased on the similarity between this approach and an approach of minimizing contrastive divergence (see Section 4.4.2).

Although this approach is only a special case of a stochastic gradient method, we refer to this algorithm as astochastic approximation procedurein order to distinguish it from a method that uses a randomly sampled subset of training samples to compute a gradient.

4.4 Structurally-restricted Boltzmann Machines

Beside the intractability of computing the statistics of the distributions modeled by a fully-connected Boltzmann machine exactly, the approximate methods introduced before, such as MCMC methods and variational approximations, are still computa- tionally very expensive. Especially when it comes to using MCMC methods such as Gibbs sampling, the full connectivity of Boltzmann machines prevents an efficient, parallel sampling procedure.

In this section, we first describe how the Boltzmann machine can be interpreted as a Markov random field. This interpretation allows us to examine the underlying reason of the difficulty in parallelizing Gibbs sampling in a fully-connected Boltzmann ma- chine. Furthermore, it sheds light on the direction in which the structural restriction will be applied.

Based on this interpretation, we introduce two structurally restricted variants of Boltzmann machines that have become widely used recently. The first model, called a restricted Boltzmann machine, simplifies the connectivity of units such that no pair of units of the same type is connected. This allows an extremely efficient and exact computation of the posterior probability of the hidden units, avoiding any need for the variational approximation. Furthermore, this bipartite structure allows an easy

implementation of parallel Gibbs sampling.

The other model is called a deep Boltzmann machine. It relaxes the structural restriction of the restricted Boltzmann machine by allowing multiple layers of hidden units, instead of just a single one. Again, each pair of layers is fully connected, while no pair of units in the same layer is connected.