Learning the Minibatch Online - Stochastic Variational Inference for Binary Matrices

5.4 Stochastic Variational Inference for Binary Matrices

5.4.6 Learning the Minibatch Online

Stochastic methods often use minibatches to reduce variance in the noisy estimates of the natural gradient to help the algorithm converge faster. Instead of updating the variational parameters after subsampling a single matrix entry, the updates are averaged over a minibatch of data. When using a minibatch of size S, we randomly subsample S entries from X. For each subsampled entry xi,j, we compute and store

the parameter values ˚u?_i,d and ˚v?_j,d that would have been produced during the normal execution of SIBM without minibatches. After subsampling S entries, we update each ˚ui,d if at least one of the last S subsampled entries belongs to the i-th row of X. The

minibatch update rule follows from Equation (5.14), ˚

unew_i,d = (1 − ρu_i)˚u_i,dold+ ρu_i˚u?,avg_i,d , (5.17)

where ˚u?,avg_i,d = 1 n(i)

n(i)

s=1

˚u?,s_i,d,

and n(i) is the number of entries from the i-th row found in the last minibatch of S subsampled entries with ˚u?,s_i,d being the value of ˚u?_i,d produced when the s-th of those entries is subsampled. The minibatch update rules for ˚vj,d and z are similar.

An important question in stochastic methods is how to choose the minibatch size S. The choice of S is particularly relevant when working with matrix factorization models, because parameter distributions are often heavy tailed [Lakshminarayanan et al., 2011]. In our stochastic method, this results in heavy tailed noisy estimates of the natural gradients. The choice of S governs a trade-off between the reduction of these heavy tails and slow convergence due to excessively large minibatches, in the limit of S = LM we reduce to batch optimization.

Typically S is hand-tuned to each dataset or optimized with expensive cross- validation search. To avoid these procedures, we propose an adaptive algorithm that selects S appropriately to the statistics of the data during learning. In particular, we choose S so that we bound the magnitude of the error in the noisy gradient. Let ˚u?,?_i,d be the value of ˚ui,d that maximizes the exact ELBO (5.11), that is, the optimum given

all of the data with the other parameters fixed. We obtain a probabilistic bound on the relative error of ˚u?,avg_i,d in (5.17) with respect to the global maximizer of the ELBO, ˚u?,?_i,d, using Markov’s inequality. Markov’s inequality is an upper bound on the probability that a non-negative random variable exceeds a particular value. This is a general bound that makes no assumptions about the distribution of the variable. This gives us the following bound on the error,

δ = p" k˚u ?,avg i,d −˚u ?,? i,dk 2 2 k˚u?,?_i,dk2 2 ≥ θ # ≤ E[k˚u ?,avg i,d −˚u ?,? i,dk 2 2] θk˚u?,?_i,dk2 2 = kVar[˚u ? i,d]k1 θkE[˚u? i,d]k22 E 1 n(i) ≈ kVar[˚u ? i,d]k1 Sp(i)θkE[˚u? i,d]k22 , (5.18)

probability of sampling an element from the i-th row of X, p(i) = PM

j=1p(i, j). In

Equation (5.18) we approximate E [1/n(i)] by 1/[p(i)S]. Also note that ˚u?,?_i,d = E[˚u?,avg_i,d ]. We now solve for S to obtain a minibatch size that approximately limits the probability that the relative error of ˚u?,avg_i,d is larger than θ,

Su i,d = kVar[˚u?_i,d]k1 θδp(i)kE[˚u?_i,d]k2 2 . (5.19)

Intuitively, the resulting minibatch size increases with the inverse of the signal to noise ratio (SNR) in the estimate ˚u?_i,d of the global maximizer of the exact ELBO in (5.11), ˚u?,?_i,d. If the SNR decreases, this rule chooses larger minibatches to mitigate the greater relative errors. The rule in (5.19) provides a different minibatch size Su

i,d for each ˚ui,d,

and similarly for each ˚vj,d. Therefore, to select the overall size S we average of the

minibatch sizes chosen for each parameter,

S = PL i=1 PD d=1Si,du + PM j=1 PD d=1Sj,dv DL + DM . (5.20)

The proposed approach requires choosing a single dataset-independent parameter, the product of θ and δ, as opposed to hand-tuning S to each dataset. By making θδ small we limit the expected deviation of ˚u?,avg_i,d from ˚u?,?_i,d. Empirically we find θδ = 2 leads to good performance.

Equation (5.19) requires E[˚u?_i,d] and Var[˚u?_i,d] which are unknown a priori. There- fore, we estimate these quantities online using exponentially weighted moving averages. Let ¯ui,d and ¯u¯i,d denote respectively estimates of the mean and mean squared value of

˚u?_i,d. Each time we draw a sample from the i-th row of X, we update these averages as

ui,d = (1 − ˆρui)¯ui,d+ ˆρui˚u?i,d,

¯ ¯

ui,d = (1 − ˆρui)¯u¯i,d+ ˆρui[˚u?i,d◦˚u?i,d]

where “◦” denotes the Hadamard element-wise product operation. The interpolation weight ˆρu

i is selected as ˆρui = (1 + ˆtiu)−λ, where ˆtiu is the number of times that we have

sampled an entry in the i-th row of X and we set λ = 0.7. The quantities E[˚u? i,d] and

Var[˚u?_i,d] are then estimated using E[˚u?_i,d] ≈ ¯ui,d and Var[˚ui,d? ] ≈ ¯u¯i,d−u¯i,d◦u¯i,d. The

minibatch sizes Sv

j,d for the natural parameters ˚vj,d are obtained in a similar manner.

As learning progresses and the parameters are updated these statistics will change, therefore the algorithm adapts the minibatch size online.

only change if the minibatch includes a sample in the i-th row or j-th column. To collect the initial statistics, we use S = 5L for the first minibatch, subsequent values of S chosen by the algorithm are insensitive to this choice, as evidenced by our experiments in Section5.5.

In document Efficient Bayesian active learning and matrix modelling (Page 97-100)