Advanced Model Initialization Techniques
5.2 Deep Belief Network Pretraining
An RBM can be considered as a generative model with infinite number of layers, in which all the layers share the same weight matrix as shown in Fig.5.5a, b. If we sep-arate the bottom layer from the deep generative model described in Fig.5.5b, the rest layers form another generative model, which also has infinite number of layers with shared weight matrices. These rest layers equal to another RBM whose visible layer and hidden layer is switched as shown in Fig.5.5c. This model is a special generative model named deep belief network (DBN), in which the top layer is an undirected RBM, and the bottom layers form a directed generative model. We apply the same reasoning to Fig.5.5c and can see that it equals to the DBN illustrated in Fig.5.5d.
5.2 Deep Belief Network Pretraining 87 Algorithm 5.1 The contrastive divergence algorithm for training RBMs.
1: procedure TrainRBMWithCD(S= {om|0 ≤ m < M} , N)
Sis the training set with M samples, N is the CD steps 2: Randomly initialize{W0, a0, b0}
3: while Stopping Criterion Not Met do
Stop if reached max iterations or the training criterion improvement is small 4: Randomly select a minibatch O with Mbsamples.
5: V0← O Positive phase
The relationship between the RBM and DBN suggests a layer-wise procedure to train very deep generative models [8]. Once we have trained an RBM, we cause the RBM to re-represent the data. For each data vector, v, we compute a vector of expected hidden neuron activations (which equal to the probabilities) h. We use these hidden expectations as training data for a new RBM. Thus each set of RBM weights can be used to extract features from the output of the previous layer. Once we stop training RBMs, we have the initial values for all the weights of the hidden layers of a DBN with a number of hidden layers equal to the number of RBMs we trained. The DBN can be further fine-tuned using the algorithms such as the wake-sleep algorithm [10].
In the above procedure, we have assumed that the dimensions in the RBMs are fixed. In this setup, the DBN would perform exactly as the RBM if the RBM is perfectly trained. However, this assumption is not necessary and we can stack RBMs with different dimensions. This allows for flexibility in the DBN architecture, and stacking additional layers can potentially improve the upper bound of the likelihood.
The DBN weights can be used as the initial weights in the sigmoidal DNN. This is because the conditional probability P(h|v) in the RBM has the same form as that in the DNN if the sigmoid nonlinear activation function is used. The DNN described in Chap.4 can be viewed as a statistical graphical model, in which each hidden
(a)
Fig. 5.5 An RBM (a) is equivalent to a generative model with infinite number of layers (b), in which all the layers share the same weight matrix. By replacing top layers with RBMs, (b) equals to the DBNs (c) and (d)
layer 0< < L models posterior probabilities of conditionally independent hidden binary neurons hgiven input vectors v−1as Bernoulli distribution
P
and the output layer approximates the label y conditioned on the input as a multino-mial probability distribution as
Given the observed feature o and the label y, the precise modeling of P(y|o) requires integration over all possible values of h across all layers which is infeasible. An effec-tive practical trick is to replace the marginalization with the mean-field approximation [17]. In other words, we define
v= E(h|v−1) = P
h|v−1
= σ
Wv−1+ b
, (5.34)
and we get the conventional nonstochastic description of the DNN discussed in Chap.4.
Based on this view of the sigmoidal DNN, we can see that the DBN weights can be used as the initial weights of the DNN. The only difference between DBN and
5.2 Deep Belief Network Pretraining 89
DNN is that in the DNN we have labels. As such, in the DNN, when the pretraining finishes, we add a randomly initialized softmax output layer and use backpropagation to fine-tune all the weights in the network discriminatively.
Initializing DNN weights with generative pretraining may potentially improve the performance of the DNN on the testing set. This is due to three reasons. First, the DNN is highly nonlinear and non-convex. The initialization point may greatly affect the final model especially if the batch mode training algorithm is used. Second, the generative criterion used in the pretraining stage is different from the discriminative criterion used in the backpropagation phase. Starting the BP training from the gen-eratively pretrained model thus implicitly regularizes the model. Third, since only the supervised fine-tuning phase requires labeled data, we can potentially leverage a large quantity of unlabeled data during pretraining. Experiments have shown that generative pretraining often helps and never hurts the training of DNN, except that pretraining takes additional time. The generative pretraining is particularly helpful when the training set is small.
The DBN pretraining is not important when only one hidden layer is used and it typically works best with two hidden layers [18,19]. When the number of hidden layers increases, the effectiveness often decreases. This is because DBN-pretraining employs two approximations. First, the mean-field approximation is used as the gen-eration target when training the next layer. Second, the approximated contrastive divergence algorithm is used to learn the model parameters. Both these approxima-tions introduce modeling errors for each additional layer. As the number of layers increases, the integrated errors increase and the effectiveness of DBN-pretraining decreases. It is obvious that although we can still use the DBN-pretrained model as the initial model for DNNs with rectified linear units, the effectiveness is greatly discounted since there is no direct link between two.