FULLY QUANTIZING A SIMPLIFIED TRANSFORMER FOR END-TO-END SPEECH RECOGNITION

(1)

FULLY QUANTIZING A SIMPLIFIED TRANSFORMER FOR END-TO-END SPEECH

RECOGNITION

Alex Bie

∗

, Bharat Venkitesh, Joao Monteiro, Md. Akmal Haidar, Mehdi Rezagholizadeh

Huawei Noah’s Ark Lab, Montreal Research Centre, Canada

[email protected], {bharat.venkitesh, joao.monteiro, md.akmal.haidar, mehdi.rezagholizadeh}@huawei.com

ABSTRACT

While significant improvements have been made in recent years in terms of end-to-end automatic speech recognition (ASR) perfor-mance, such improvements were obtained through the use of very large neural networks, unfit for embedded use on edge devices. That being said, in this paper, we work on simplifying and compressing Transformer-based encoder-decoder architectures for the end-to-end ASR task. We empirically introduce a more compact Speech-Transformer by investigating the impact of discarding particular modules on the performance of the model. Moreover, we evaluate reducing the numerical precision of our network’s weights and activa-tions while maintaining the performance of the full-precision model. Our experiments show that we can reduce the number of parameters of the full-precision model and then further compress the model 4x by fully quantizing to 8-bit fixed point precision.

Index Terms— automatic speech recognition, sequence-to-sequence, quantization, compression, Transformer

1. INTRODUCTION

End-to-end automatic speech recognition (ASR) systems combine the functionality of acoustic, pronunciation, and language modelling components into a single neural network. Early approaches to end-to-end ASR employ CTC [1, 2]; however these models require rescoring with an external language model (LM) to obtain good performance [3]. RNN encoder-decoder [4, 5] equipped with attention [6], origi-nally proposed for machine translation, is an effective approach for end-to-end ASR [3, 7]. These systems see less of a performance drop in the no-LM setting [3].

More recently, the Transformer [8] encoder-decoder architecture has been applied to ASR [9, 10, 11]. Transformer training is paral-lelizable across time, leading to faster training times than recurrent models [8]. This makes them especially amenable to the large audio corpora encountered in speech recognition. Furthermore, Transform-ers are powerful autoregressive models [12, 13], and have achieved reasonable ASR results without incurring the storage and computa-tional overhead associated with using LM’s during inference [10].

Although current end-to-end technology has seen significant im-provements in accuracy, computational requirements in terms of both time and space for performing inference with such models remains prohibitive for edge devices. Thus, there has been increased inter-est in reducing model sizes to enable on-device computation. The model compression literature explores many techniques to tackle the problem, including: quantization [14], pruning [15, 16], and knowl-edge distillation [17, 18]. An all-neural, end-to-end solution based on RNN-T [19] is presented in [20]. The authors make several runtime

∗_{Work performed during an internship at Huawei Noah’s Ark Lab.}

Fig. 1: Components of end-to-end Transformer ASR as described in [10]. 2D convolutional blocks are used for feature extraction and down-sampling. 1D causal convolutions are applied on the decoder side. Our proposed simplified Transformer only uses the green and red blocks (no decoder causal convolutions or sinusoidal positional encodings); as such, the decoder receives no explicit positional infor-mation.

optimizations to inference and perform post-training quantization, allowing the model to be successfully deployed to edge devices.

In this contribution, we turn our focus to refining the Transformer architecture so as to enable its use on edge devices. The absence of recurrent connections in Transformers provides a significant advan-tage in terms of speeding up computation, and therefore, quantizing a Transformer-based ASR system would be an important step towards on-device ASR. We report findings on direct improvements to the model through removing components which do not significantly af-fect performance, and finally reduce the numerical precision of model weights and activations. Specifically, we reduced the dimensionality of the inner representations throughout the model, removed convo-lutional layers employed prior to the decoder’s layers (as in Fig. 1), and finally performed 8-bit quantization to the model’s weights and activations. As verified in terms of recognition performance, our re-sults on the Librispeech dataset [21] support the claim that one can recover the original performance even after greatly reducing model’s computational requirements.

The remainder of this work is organized as follows: section 2 gives an overview of Transformer-based ASR, and section 3 describes the details of our quantization scheme. Section 4 describes our

(2)

iments with the Librispeech dataset. Section 5 is a discussion of our results. Connection to prior work is presented in section 6. Finally, we draw conclusions and describe future directions in section 7.

2. TRANSFORMER NETWORKS FOR ASR

Casting ASR as a sequence-to-sequence task, the Transformer en-coder takes as input a sequence of frame-level acoustic features (x1, ..., xT), and maps it to a sequence of high-level representa-tions (h1, ..., hN). The decoder generates a transcription (y1, ..., yL) one token at a time. Each choice of output token ylis conditioned on the hidden states (h1, ..., hN) and previously generated tokens (y1, ..., yl−1) through attention mechanisms. The typical choice for acoustic features are frame-level log-Mel filterbank coefficents. The target transcripts are represented by word-level tokens or sub-word units such as characters or produced through byte pair encoding [22]. 2.1. Transformer architecture

The encoder and decoder of the Transformer are stacks of N Trans-former layers. The layers of the encoder iteratively refine the rep-resentation of the input sequence with a combination of multi-head self-attention and frame-level affine transformations. Specifically, the inputs to each layer are projected into keys K, queries Q, and values V . Scaled dot product attention is then used to compute a weighted sum of values for each query vector:

Attention(Q, K, V ) = softmax(QK T √

dk

)V (1)

where dkis the dimension of the keys. We obtain multi-head attention by performing this computation h times independently with different sets of projections, and concatenating:

MultiHead(Q, K, V ) = Concat(head1, ..., headh)WO (2) headi= Attention(QWiQ, KW

K i , V W

V i ) (3) The Wi∗are learned linear transformations W

∗

i : dmodel→ d∗, and WO: h · dv→ dmodel. We use d∗= dmodel/h. The self-attention operation allows frames to gather context from all timesteps and build an informative sequence of high-level features. The outputs of multi-head attention go through a 2-layer position-wise feed-forward network with hidden size df f.

FFN(x) = W2ReLu(W1x + b1) + b2 (4)

On the decoder side, each layer performs two rounds of multi-head attention: the first one being self-attention over the representa-tions of previously emitted tokens (Q = K = V ), and the second being attention over the output of the final layer of the encoder (Q are previous layer outputs, K = V are (h1, ..., hN)). The output of the final decoder layer for token yl−1is used to predict the following token yl. Other components of the architecture such as sinusoidal positional encodings, residual connections and layer normalization are described in [8].

2.2. Convolutional layers

Following previous work [9, 10, 23], we apply frequency-time 2-dimensional convolution blocks followed by max pooling to our au-dio features, prior to feeding them into the encoder, as seen in Fig. 1. We can achieve significant savings in computation given that the resulting length of the input is considerably reduced and the com-putation required for self-attention layers scales quadratically with respect to the sequence length.

Moreover, it has been shown that temporal convolutions are ef-fective in modeling time dependencies [24], and serves to encode ordering into learned high level representations of the input signal. Based on these observations, [10] proposes to replace sinusoidal po-sitional encodings in the Transformer with convolutions, employing 2D convolutions over spectrogram inputs and 1D causal convolutions over word embeddings in the decoder (pictured in Fig. 1).

3. MODEL COMPRESSION

A simple approach to reducing computational requirements is to re-duce the precision requirements for weights and activations in the model. It is shown in [25] that stochastic uniform quantization is an unbiased estimator of its input and quantizing the weights of a net-work is equivalent to adding Gaussian noise over parameters, which can induce a regularization effect and help avoid overfitting. Quanti-zation has several advantages: 1) Computation is performed in fixed-point precision, which can be done more efficiently on hardware. 2) With 8-bit quantization, the model can be compressed up to 4 times its original size. 3) In several architectures, memory access dominates power consumption, and moving 8-bit data is four times more effi-cient when compared to 32-bit floating point data. All three factors contribute to faster inference, with 2-3x times speed up [14] and fur-ther improvements are possible with optimized low precision vector arithmetic.

3.1. Quantization scheme

For our experiments, we apply the quantization scheme introduced in [14]: we use a uniform quantization function Q : [a, b] ⊆ < → [−2K−1_{, 2}K−1

− 1] ⊆ Z which maps real values (weights and acti-vations) in the range of [a, b] to K-bit signed integers:

Q(x) = round(x − a

∆ ) (5)

with ∆ = ₂b−aK−1. In the case that x is not in the range of [a, b], we first apply the clamp operator:

clamp(x; a, b) = min(max(x, a), b) (6) The de-quantization function D(.) is given by:

D(xQ) = xQ× ∆ + a (7)

where xQ= Q(x) refers to the quantized integer value correspond-ing to the real value x.

During training, forward propagation simulates the effects of quantized inference by incorporating the de-quantized values of both weights and activations in the forward pass floating-point arithmetic operations. We then apply the quantization operation and the de-quantization operation according to eq. 5 and eq. 7 respectively to each layer. The clamping ranges are computed differently for weights and activations. For a weight matrix X, we set a and b to be Xmin and Xmaxrespectively. For activations, the clamping range depends on the x, the input to the layer. We calculate [a, b] by keeping track of xminand xmaxfor each mini-batch during training, and aggregating them using an exponential moving average with smoothing parameter set to 0.9. Quantization of activations starts after a fixed number of steps (3000). This ensures that the network has reached a more stable stage and the estimated ranges do not exclude a significant fraction of values. We quantize to K = 8-bit precision in our experiments.

(3)

Table 1: WER (%) results of different hyperparameter configurations for Conv-Context. The first 2 rows are taken directly from [10].

Model dmodel Layers Params dev test

Enc Dec clean other clean other

Conv-Context 1024 16 6 315M 4.8 12.7 4.7 12.9 6 6 138M 5.6 14.5 5.7 15.3 512 6 6 52M 5.3 14.9 5.7 14.8 3.2. Quantization choices

We quantize all the matrix multiplication operations, inputs and the weights of the matrix multiplications. For other operations such as addition, quantization does not lead to computational gains during inference, so we do not quantize. Specifically, we quantize all the weights and activations, excluding the biases in the weights. The biases are summed with the INT32 output of matrix multiplications. In the multi-head attention module, we quantize the inputs (Q, K, V ), softmax layer (including numerator, denominator and division) and the scaled dot product’s output. In the position-wise feed forward network, we quantize the weights, its output and the output of ReLUs. The weights in the layer norms (γ), division operation and outputs of layer norm are also quantized.

4. EXPERIMENTS

We use the open-source, sequence modelling toolkit fairseq [26]. We conduct our experiments on LibriSpeech 960h [21], and follow the same setup as [10]: the input features are 80-dimensional log-Mel filterbanks extracted from 25ms windows every 10ms, and the output tokens come from a 5K subword vocabulary created with sentence-piece [27] “unigram”. For fair comparison, we also optimize with AdaDelta [28] with learning rate=1.0 and gradient clipping at 10.0, and run for 80 epochs, averaging checkpoints saved over the last 30 epochs. The dropout rate was set to 0.15.

4.1. Comparison of Transformer variants

We perform preliminary experiments comparing full-precision Trans-former variants and choose one to quantize. We start from Conv-Context [10] that proposes to replace sinusoidal positional encodings in the encoder and decoder with 2D convolutions (over audio features) and 1D convolutions (over previous token embeddings) respectively. Motivated by recent results in Transformer-based speech recogni-tion [11] and language modelling [29], we allocate our parameter budget towards depth over width, and retrain their model under the configuration of Transformer Base [8], namely: 6 encoder/decoder layers, dmodel= 512, 8 heads, and df f = 2048. We obtain a satis-factory trade-off between model size and performance (Table 1), and adopt this configuration for the remainder of this work.

Next, we propose removing the 1D convolutional layers on the decoder side, based on previous work [29] demonstrating that the autoregressive Transformer training setup provides enough of a po-sitional signal for Transformer decoders to reconstruct order in the deeper layers. We observe that removing these layers do not affect our performance, and reduce our parameter count from 52M to 51M. Finally, we add positional encodings on top of this configuration and see, counter-intuitively, that our performance degrades. These results are pictured in Table 2.

4.2. Quantization

For quantization, we restrict our attention to our proposed simplified Transformer (no decoder-side convolutions or positional encodings),

Table 2: Comparison of 3 full-precision model variants.

Model 1D

Conv Pos.

enc. Params

dev test

clean other clean other

Conv-Context 3 7 52M 5.3 14.9 5.7 14.8

Proposed 7 7 _51M 5.6 14.2 5.5 14.8

+ Pos. enc. 7 3 6.0 14.6 6.0 14.5

since it performs well and is the least complex of the Transformer variants. We compare the results of quantization-aware training to the full-precision model, as well as to the results of post-training quantization. In post-training quantization, we start from the aver-aged full-precision model, keep the weights fixed, and compute the clamping range [a, b] for our activations over 1k training steps. To report the checkpoint-averaged result of quantization-aware training, we average the weights of quantization-aware training checkpoints, initialize our activation ranges [a, b] with checkpoint averages, and adjust them over 1k training steps. In both cases, no additional up-dates are made to the weights.

Our results are summarised in Table 3. Our quantized models per-form comparably to the full-precision model, and represent reason-able trade-offs in accuracy for model size and inference time. The last row of the table represents a result of 10x compression over the 138M parameter baseline with no loss in performance. Our quantization-aware training scheme did not result in significant gains over post-quantization.

Table 3: Quantization results of our proposed model (no positional encodings or decoder-side convolutions).

Model Fully

quantized

dev test

clean other clean other

Full-precision 7 5.6 14.2 5.5 14.8

Post-training quant 3 5.6 14.6 5.6 15.1

Quant-aware training 3 5.4 14.5 5.5 15.2

5. DISCUSSION 5.1. Representing positional information

The 3 Transformer variants explored in this work differ in how they present token-positional information to the decoder. We study their behaviour to get a better understanding of why our proposed simpli-fied model performs well.

We remark that sinusoidal position encodings hurt performance because of longer sequences at test time. It has been observed that decoder-side positional encodings do worse than 1D convolutions [10] (and also nothing at all, from our results). This performance drop is from under-generation; on dev-clean, our proposed model’s WER increases 5.6 → 6.0 after adding positional encodings, with deletion rate increasing 0.7 → 1.3. Our plot in Fig. 2 shows that this can be attributed to the inability of sinusoidal positional encodings to generalize to lengths longer than encountered in the training set.

Examining the same plot, we notice utterances with large dele-tion counts in the outputs of models without sinusoidal posidele-tional encoding. An example is shown in Fig. 3. Our models without si-nusoidal positional encoding exhibit skipping.We hypothesize the issue lies in the time-axis translation-invariance of decoder inputs: repeated n-grams confuse the decoder into losing its place in the in-put audio. Cross-attention visualizations between inin-puts to the final decoder layer and encoder outputs (left column of Fig. 4) support this hypothesis. We remark that being able to handle repetition is crucial

(4)

Fig. 2: A plot of reference length vs. deletion for the dev-clean system output of our 3 models. The histograms in orange represent the length distribution of training transcriptions.

Reference This second part is divided into two, for in the first I speak of

her as regards the nobleness of her soul relating some of her virtues proceeding from her soul. In the second I speak of her as regards the nobleness of her body narrating some of her beauties here love saith concerning her.

Conv Context Thesecond part has divided into two for in the first I speak of

her as regards the nobleness of her soul relating some of her virtues proceeding from her soul. In the second I speak of her as regards the nobleness of her body narrating some of her beauties here love saith concerning her.

Fig. 3: An example of ”skipping” taken from dev-clean. Punctuation is added for readability. In bold are repeated n-grams. The output of our proposed model is mostly identical to 1D Conv. The model employing positional encodings makes no errors.

for transcribing spontaneous speech. Imposing constraints on atten-tion matrices or expanding relative posiatten-tional informaatten-tion context are some possible approaches for addressing this problem.

Finally, we affirm the hypothesis proposed in [29] that the Trans-former with no positional encodings reconstructs ordering in deeper layers. The second column of Fig. 4 show visualizations of cross-attention as we go up the decoder stack.

5.2. Training the Transformer

We observe no significant gain with quantization-aware training. Fur-thermore, it increases training time by more than 4x due to its expan-sion of our computational graph. We note that in post-quantization, the 1k steps used to fine-tune activation clamping ranges is very im-portant. Without this step, system output is degenerate.

In our experiments, we found that training with large batch sizes (80k audio frames) was necessary for convergence. Similar optimiza-tion behaviour was observed across all experiments: a plateau at ∼ 25% frame-level accuracy followed by a jump to 80% within a single epoch. This jump was not observed when training with smaller batch sizes.

6. RELATION TO PRIOR WORK

Transformers for speech recognition. Several studies have focused on adapting Transformer networks for end-to-end speech recognition. In particular, [10, 9] present models augmenting Transformers with convolutions. [11] focuses on refining the training process, and show that Transformer-based end-to-end ASR is highly competitive with state-of-the-art methods over 15 datasets. These studies focus only on performance, and do not consider traoffs required for edge de-ployment.

Compression with knowledge distillation. [17] proposes a knowl-edge distillation strategy applied to Transformer ASR to recover the performance of a larger model with fewer parameters. Distilled mod-els still work in 32-bit floating point, and do not take advantage of faster, more energy-efficient hardware available when working with

Fig. 4: Decoder-encoder attention matrices for the utterance in Fig. 3. On the left column, we see the models without positional encoding sometimes exhibit bi-modality in attention distributions over the in-put audio. The transcription for the repeated section attends to both positions in the input audio. When decoding, the shorter path that skips the segment between the repetition has higher likelihood. 8-bit fixed-point. Additionally, we believe this work is orthogonal to ours, and the two methods can be combined for further improvement. Transformer quantization. Quantization strategies for the trans-former have been proposed in the context of machine translation. [30] proposes a quantization scheme that allow them to improve upon the original full-precision performance.

Necessity of positional encodings. For language modelling, [29] achieve better perplexity scores without positional encodings, and argue that the autoregressive setup used to train the Transformer de-coder provides a sufficient positional signal.

7. CONCLUSION

In this paper, we proposed a compact Transformer-based end-to-end ASR system, fully quantized to enable edge deployment. The pro-posed compact version has a smaller hidden size and no decoder side convolutions or positional encodings. We then fully quantize it to 8-bit fixed point. Compared to the 138M baseline we started from, we achieve more than 10x compression with no loss in performance. The final model also takes advantage of efficient hardware to enable fast inference. Our training strategy and model configurations are not highly tuned. Future work includes exploring additional training strategies and incorporating text data, as to bring highly performant, single-pass, end-to-end ASR to edge devices.

8. ACKNOWLEDGEMENTS

We would like to thank our colleagues Ella Charlaix, Eyy¨ub Sari, and Gabriele Prato for their valuable insights and discussions.

(5)

9. REFERENCES

[1] Alex Graves, Santiago Fern´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.

[2] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning, 2014, pp. 1764–1772. [3] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk,

Phile-mon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE interna-tional conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4945–4949.

[4] Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Pro-ceedings of the 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1724–1734, Association for Computational Linguistics. [5] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to

sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [6] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine

transla-tion by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[7] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.

[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

[9] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888. [10] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Trans-formers with convolutional context for asr,” arXiv preprint arXiv:1904.11660, 2019.

[11] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al., “A comparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317, 2019.

[12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” .

[13] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” in International Conference on Machine Learning, 2018, pp. 4052–4061. [14] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard,

H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713.

[15] Yann LeCun, John S Denker, and Sara A Solla, “Optimal brain damage,” in Advances in neural information processing sys-tems, 1990, pp. 598–605.

[16] Song Han, Huizi Mao, and William J Dally, “Deep compression: Compressing deep neural networks with pruning, trained quanti-zation and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.

[17] H.-G. Kim, H. Na, H. Lee, J. Lee, T. G. Kang, M.-J. Lee, and Y. S. Choi, “Knowledge distillation using output errors for self-attention end-to-end models,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP). IEEE, 2019, pp. 6181–6185.

[18] Yoon Kim and Alexander M Rush, “Sequence-level knowledge distillation,” arXiv preprint arXiv:1606.07947, 2016. [19] Alex Graves, “Sequence transduction with recurrent neural

networks,” arXiv preprint arXiv:1211.3711, 2012.

[20] Y. He, T. N. Sainath, R. Prabhavalkar, I. Mcgraw, R. Alvarez, D. Zhao, D. Rybach, Y. Kannan, A. Wu, and R et al. Pang, “Streaming end-to-end speech recognition for mobile devices.,”

2018.

[21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. [22] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neu-ral machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715–1725.

[23] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat-tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., “Deep speech 2: End-to-end speech recognition in en-glish and mandarin,” in International conference on machine learning, 2016, pp. 173–182.

[24] Shaojie Bai, J Zico Kolter, and Vladlen Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018. [25] A. Polino, R. Pascanu, and D. Alistarh, “Model compression via

distillation and quantization,” arXiv preprint arXiv:1802.05668, 2018.

[26] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceed-ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstra-tions), 2019, pp. 48–53.

[27] T. Kudo and J. Richardson, “Sentencepiece: A simple and lan-guage independent subword tokenizer and detokenizer for neu-ral text processing,” arXiv preprint arXiv:1808.06226, 2018. [28] M. D. Zeiler, “Adadelta: an adaptive learning rate method,”

arXiv preprint arXiv:1212.5701, 2012.

[29] K. Irie, A. Zeyer, R. Schl¨uter, and H. Ney, “Language model-ing with deep transformers,” arXiv preprint arXiv:1905.04226, 2019.

[30] Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh, “Fully quantized transformer for improved translation,” arXiv