3.10 Parallelization
3.10.1 Feature Selection, Regularization & Multi-Task Learning
3.10.1.2 Asynchronous Parallelization
In the current implementation of parallelization, a full averaged model computed from the weight matrixWis only observed after a single epoch of parallel training.
Depending on the size of the shards or tasks, this could be a too long delay before obtaining a joint model.
As an alternative to this parallelization scheme, Dean et al. [2012] propose an asynchronous variant of SGD with strong empirical results compared to vanilla synchronous SGD (without parallelization) with largely reduced running time. An asynchronous variant of their proposed scheme adapted for pairwise ranking optimization is depicted in Algorithm 13. In the algorithm, parallel sentences (an- notated with the respective per-sentence grammar) are distributed in a round-robin strategy to worker processes running in parallel. The current model parameters
w, which are maintained within the main loop, are also distributed to the worker 47
Algorithm 12 Iterative mixing algorithm with feature selection (adapted from
[Simianer et al., 2012]). Inputs: Number of epochsT, number of shardsZ, parallel training data I, learning rate η, gold-standard function g(·), pair generation
algorithmQ, regularization parameterk.
1: Partition data intoZ shards, each of sizeS=I/Z and distribute to machines. 2: procedureIterMixSelSGD
3: Initializev←0.
4: for epochst←0. . . T−1:do
5: for all shardsz∈ {1. . . Z}: parallel do
6: wz,t,0,0←v
7: for alli∈ {0. . . S−1}:do
8: Decodeith input withw
z,t,i,0.
9: Generate training examplesP using algorithmQg.
10: for all pairsxj, j∈ {0. . .|P| −1}:do
11: wz,t,i,j+1←wz,t,i,j−η∇lj(wz,t,i,j)
12: end for
13: wz,t,i+1,0←wz,t,i,|P| 14: end for
15: end for
16: Collect/stack weightsW←[w1,t,S,0|. . .|wZ,t,S,0]T
17: Select topK feature columns ofWbyℓ2 norm and set
18: fork←1. . . K do 19: v[k] = 1 Z Z ∑ z=1 W[z][k]. 20: end for 21: end for 22: returnv 23: end procedure
Algorithm 13 Asynchronous optimization with iterative feature selection,
AsyncSGD.Inputs: Number of epochsT, number of workersZ, parallel dataI, feature selection frequencyF, number of featuresK, learning rateη, gold-standard functiong(·), pair generation algorithmQ.
1: procedureMain Loop(I, T, Z)
2: w←0
3: PrepareZ worker processes
4: Setup queueU for incoming weight updates 5: for epochs t←0. . . T−1:do
6: for alli∈ {0. . .|I| −1}:do
7: Send ith input, current weights w, and pointer to queue U to next available worker by round robin allocation.
8: for allu∈U do
9: w←w+u
10: if i+ 1 modF is0 then
11: Select topKfeature columns of w byℓ2norm
12: fork←1. . . K do 13: w′=w[k] 14: end for 15: w←w′ 16: end if 17: end for 18: end for 19: end for 20: returnw 21: end procedure
22: procedureWorker(i,w,η, g,Q,U) 23: Decode input iwithw.
24: Generate training examplesP using algorithmQg.
25: for all pairsxj, j∈ {0. . .|P| −1}:do
26: wj+1←wj−η∇lj(wj)
27: end for
28: U ←U ∪w|P| 29: end procedure
processes. Model updates from the workers are put into a queue, which is regularly checked within the main process, and its contents are incorporated into the main copy of the model once available. Our proposed implementation is designed in such a way, that a single update consists of a mini-batch, which corresponds to all pairs extracted from a singlek-best list.
We also incorporate a heuristic for feature selection, selectingK features after having processedF mini-batches. Since there is no weight matrix available, but only single weight vectors, we propose to select weights simply by theirℓ2norm,
selecting a őxed number ofK features with maximal norm value. 3.10.2 Evaluation
We őrst explore the effectiveness of our general parallelization scheme on the Nc@ data set, training on the full bitext. Results are depicted in Table 3.14. Statistical signiőcance between result differences for the test set are assessed with a approximate randomization test for the BLEU score [Riezler and Maxwell, 2005], as described in Section 2.3.2.4. Signiőcant results are annotated by referencing the respective experiment in brackets, wherep≤0.05.
Using only theDensefeature set, the algorithms48
show about the same per- formance, without signiőcant differences. However, using Sparsefeatures, the iterative mixing approaches perform better than mixing once. Feature selection by ℓ1/ℓ2 regularization (selecting 100,000 features49 after each epoch) results in some
minor, but signiőcant gains over all other algorithms. All algorithms are run for 15 epochs and use őxed random shards with a size of 1,000 segments, if applicable.
Another parameter we explore is sharding: Since our general domain data has no obvious partitioning, we use random sharding. There are however two variants to implement this Ð generate shards once before optimization, or randomly re- sharding after each epoch. Results for these experiments onNc∗ andWmt13data sets presented in Tables 3.15 and 3.16 respectively. Since we tune on a smaller tuning set forWmt13we only select 10,000 features after each epoch.
According to these results there is no difference between randomly sharding once or repeatedly. It is also worth to note that, despite having introduced a random factor in training, there is no large variance observable in the results for repeated experiments, as constituted by standard deviations. All experiments were repeated three times.
Results for the proposed asynchronous optimization algorithm contrasted to the synchronous counterparts are depicted in Table 3.17. Both algorithms peri- odically selected 10,000 features. Both synchronous and asynchronous variants were trained for 15 epochs, using randomized data (either randomized shards
48
Note that we omitted theIterMixSelSGDalgorithm. 49
In preliminary experiments we determined that a fixed model size of 100,000 represents a practical tradeoff between decoding speed, model size and communication overhead.
Nc@
System Dev. Test Test
Dense,MixSGD(1) 25.7 27.9 Dense,IterMixSGD(2) 26.1 27.9
Sparse, MixSGD(1) 26.1 27.9
Sparse, IterMixSGD(2) 26.4 (1)28.6 Sparse, IterMixSelSGD(3) 26.8 (1,2)28.8
Table 3.14: Comparing different, synchronous parallel optimization schemes on the smallNc@data set, training on the full bitext withDenseandSparse feature sets. Signiőcance is assessed with approximate randomization tests between all experiments in a group, signiőcant improvements are denoted by the number of the respective algorithms. Table adapted from [Simianer et al., 2012].
Nc∗
System Dev. Test Once 26.3±0.0
Re-shard 26.2±0.0
Table 3.15: Random re-sharding per epoch versus sharding once on theNc∗ data tuning on the bitext withSparsefeatures.
Wmt13
System Dev. Test1 Test1 Dev. Test2 Test2
Once 24.9±0.1 23.1±0.2 26.1±0.1 25.4±0.1
Re-shard 24.9±0.0 23.0±0.1 26.1±0.1 25.5±0.1
Table 3.16: Random re-sharding per epoch versus sharding once on the Wmt13 data set using the TuningL data for training.
Wmt13
System Dev. Test1 Test1 Dev. Test2 Test2
IterMixSelSGD, Once 24.9±0.1 23.1±0.2 26.1±0.1 25.4±0.1 IterMixSelSGD, Re-shard 24.9±0.0 23.0±0.1 26.1±0.1 25.5±0.1 AsyncSGD, 2 Workers 25.0±0.1 23.1±0.1 26.2±0.2 25.6±0.1 AsyncSGD, 4 Workers 24.8±0.1 23.3±0.1 26.4±0.1 25.3±0.2 AsyncSGD, 10 Workers 24.8±0.1 23.3±0.1 26.4±0.0 25.6±0.1 AsyncSGD, 20 Workers 23.6±0.6 21.8±0.9 24.8±0.6 24.6±0.6
Table 3.17: Synchronous and asynchronous parallelized SGD with ℓ1/ℓ2
regularization-based feature selection using the Sparse feature set on the TuningL data.
or random permutations of the training data). We used two, four, ten and 20 parallel workers for the asynchronous algorithm, and ten shards for the synchronous algorithm. The segments of the training data are distributed in a round-robin fashion, skipping workers that did not return yet. Each worker sends its weight vector immediately after each mini-batch, and receives a new segment along with a newly computed global weight vector. Features are selected by the main loop after 100 total segments. All experiments were repeated three times to account for optimizer instability. We use the TuningL data set for tuning, and we employ the
margin perceptron (cf. Section 3.10.3) with theSparse feature set.
Results for both algorithmic variants are very similar, the asynchronous version however breaks down when using more than ten workers, which also results in a slightly increased standard deviation. The variation for the other settings is negligible.