for Contextual Word Classification
Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney
• Background
• Methodology
• Experiments
• Conclusion
Contextual Word Classification
• language modeling (LM) and machine translation (MT)
• dominated by neural networks (NN)
• p(w1T|...) = p(w0)QT
1 p(wt|w0t−1, ...) = QT
1 p(wt|ht)
• despite many choices to learn the context vector h: – feed-forward NN
– recurrent NN – convolutional NN – self-attention NN – ...
• output is often modeled with standard softmax and trained with cross entropy:
p(wv|h) = exp(WvTh)/PV
v0=1exp(WvT0h) L = − log p(wv|h)
Softmax Bottleneck [Yang et al., 2018]
• from previous:
L = − log p(wv|h) = − log
exp(WvTh)/PV
v0=1 exp(WvT0h)
• exponential-and-logarithm calculation:
log p(wv|h) + logPV
v0=1exp(WvT0h) = WvTh
• approximate true log posteriors with inner products:
log ˜p(wv|h) + Ch ≈ WvTh
Softmax Bottleneck Cont.
• from previous: log ˜p(wv|h) + Ch ≈ WvTh
• in matrix form:
log
˜
p(w1|h1) . . . ˜p(w1|hN) ... . . . ...
˜
p(wV|h1) . . . ˜p(wV|hN)
V×N
+ Ch1 . . .ChN
V×N
| {z }
rank ∼V
≈
W1T
...
WVT
V×d
h1. . .hN
d×N
| {z }
rank ∼d
• factorization the true log posterior matrix: log ˜P + C ≈ WTH
• Softmax Bottleneck:
– log ˜P is high-rank for natural language: rank(log ˜P) ∼ V – C decreases the rank of left-hand side by maximum 1
– rank of WTH is bounded by hidden dimension: rank(WTH) ∼ d – typically V ∼ 100, 000 and d ∼ 1000
≈ → 6≈
Breaking Softmax Bottleneck
• mixture-of-softmax (mos) [Yang et al., 2018]:
pmos(wv|h) =
K
X
k=1
πk exp(WvTfk(h)) PV
v0=1exp(WvT0fk(h))
• sigsoftmax [Kanai et al., 2018]:
psigsoftmax(wv|h) = exp(WvTh)σ(WvTh) PV
v0=1exp(WvT0h)σ(WvT0h)
• weight norm regularization [Herold et al., 2018]:
Lwnr = −X
data
X
V
log(pmos(wv|h)) + ρ v u u t
V
X
v=1
(||Wv||2 − ν)2
Breaking Softmax Bottleneck Cont.
• let z = WvTh
• theoretically, to break softmax bottleneck with activation g(z) [Kanai et al., 2018]:
– nonlinearity of log(g(z)) – numerical stable
– non-negative
– monotonically increasing
Geometric Explanation of Softmax Bottleneck
• an intuitive example:
– ˜p(dog|a common home pet is ...) ≈ ˜p(cat|a common home pet is ...) ≈ 50%
– learned word vectors are close: Wdog ≈ Wcat
– posteriors over dog and cat are thus close: p(dog|...) ≈ p(cat|...) – there exist contexts that would fail the model:
– overall an expressiveness problem
Kernel Trick
• widely used in support-vector machine (SVM) and logistic regression
• improve expressiveness by implicitly transforming data into high dimensional feature spaces
[Eric, 2019]
Kernel Trick Cont.
• K(x,y) =< φ(x), φ(y) >
• Ksq(x1 x2
,y1 y2
) = (x1y1 +x2y2)2 = x12,√
2x1x2,x22
y12,√
2y1y2,y22T
= φT x1 x2
φy1 y2
• for a kernel function (kernel) to be valid:
– positive semidefinite (PSD) Gram matrix
– corresponds to a scalar product in some feature space
• empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005]
→ we do not enforce PSD
• where there is an inner product, there could be an application of the kernel trick – import vector machine [Zhu and Hastie, 2002]
– multilayer kernel machine [Cho and Saul, 2009]
– gated softmax [Memisevic et al., 2010]
Non-Euclidean Word Embedding
• Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017]
[Athiwaratkun and Wilson, 2017]
Non-Euclidean Word Embedding Cont.
• hyperbolic (Poincar´e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018]
[Nickel and Kiela, 2017]
Quick Recap
• contextual word classification
• softmax bottleneck
• breaking softmax bottleneck
• geometric explanation of softmax bottleneck
• kernel trick
• non-Euclidean word embedding
→ kernels in softmax
• Background
• Methodology
• Experiments
• Conclusion
Generalized Softmax
• model posterior:
p(wv|h) =
K
X
k=1
πk exp(Sk(Wv,fk(h))) PV
v0=1exp(Sk(Wv0,fk(h)))
• mixture weight:
πk = exp(MkTh) PK
k0=1 exp(MkT0h)
• nonlinearly transformed context:
fk(h) = tanh(QkTh)
• with trainable parameters: W ∈ Rd×V, M ∈ Rd×K and Qk ∈ Rd×d
Generalized Softmax Cont.
• from previous:
p(wv|h) =
K
X
k=1
πk exp(Sk(Wv,fk(h))) PV
v0=1exp(Sk(Wv0,fk(h)))
• note:
– replace inner product with kernels: WvTh → Sk(Wv,fk(h)) – replace single softmax with a mixture: 1 → PKk=1 πk
– replace context vector with transformed ones: h → fk(h) – shared word vectors due to memory restriction
• motivations:
– different kernels give different feature spaces
– based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different
– ideally, for each feature space, the word vector could also be different
Individual Kernels
Slin(Wv, h) = WvTh
Slog(Wv, h) = − log(||Wv − h||p + 1) Spow(Wv, h) = −||Wv − h||p
Spol(Wv, h) = (αWvTh + c)p
Srbf(Wv, h) = exp(−γ||Wv − h||2) Swav(Wv, h) = cos(||Wv − h||2
a ) exp(−||Wv − h||2
b )
Sssg(Wv, h) = log Z
N(µWv, ΣWv)N(µh, Σh) Smog(Wv, h) = X
i ,j
log Z
N(µi ,Wv, Σi ,Wv)N(µj ,h, Σj ,h)
Shpb(Wv, h) = −acos(1 + 2||Wv − h||2
(1 − ||Wv||2)(1 − ||h||2))
Individual Kernels Cont.
• kernel sources:
– baseline: lin
– classic (e.g. from SVM): log, pow, pol, pol, rbf, wav – from non-Eulidean word embedding: ssg, mog, hpb
• dimension reduction step in d common to kernels above not immediately executable – simplify the wavelet kernel
– use spherical covariance matrices
– rewrite power of vector difference norm:
||Wv −h||p = (||Wv||2 + ||h||2 − 2WvTh)p2
||Wv −h||p → a vector norm regularized version of the inner product
• Background
• Methodology
• Experiments
• Conclusion
Experimental Setup
• LM on Switchboard:
– 30K vocabulary – 25M training tokens
– LSTM Recurrent NN [Sundermeyer et al., 2012]
• MT on IWSLT2014 German→English:
– 10K joint byte pair encoding merge operations – 160K parallel training sentences
– Transformer NN [Vaswani et al., 2017]
• overall:
– grid search hyperparameters of kernels – vary K and Sk
– fairseq toolkit [Ott et al., 2019]
Performance of Individual Kernels
Method SWB IWSLT
(Perplexity) (Bleu
[%]) Ref. [Irie et al., 2018] [Wu et al., 2019]
47.6 35.2
lin 46.8 34.3
log 103.0 0.4
pow 46.8 32.8
pol 47.3 31.7
rbf 284.9 0.0
ssg 49.9 34.6
mog 46.7 34.2
hpb 122.6 0.3
wav 289.7 0.0
Gradient Properties
0 2 4 6
||Wv − h||p -6
-4 -2 0 2
S(Wv,h)
rbf wav log pow
• expected performance: rbf ≈ wav < log < pow
Performance of Individual Kernels Revisited
Method SWB IWSLT
(Perplexity) (Bleu
[%]) Ref. [Irie et al., 2018] [Wu et al., 2019]
47.6 35.2
lin 46.8 34.3
log 103.0 0.4
pow 46.8 32.8
pol 47.3 31.7
rbf 284.9 0.0
ssg 49.9 34.6
mog 46.7 34.2
hpb 122.6 0.3
wav 289.7 0.0
Mixture of Kernels
• encourage equal contribution of mixture components, similar to [Takase et al., 2018]:
Lreg = Lce + ρ N
X
N
VarN(πk)
• compromise between regularization and performance → ρ = 0.1
ρ Variance PPL
0.001 4.74 46.8
0.01 4.98 46.6
0.1 3.67 47.2
1 3.81 47.4
Mixture of Kernels Cont.
Name Mixture settings PPL
mos* 9×lin 47.8
mix
big1 of each kernel 47.1 mix
1lin, log, rbf, hpb, wav 46.6
mix
23×lin, log 46.5
mix
3lin, log, pow, pol 47.3 mix
4lin, log, rbf, hpb 47.1 mix
52×lin, 2×rbf 46.7
* tricks like activation regularization and average stochastic gradient descent are not applied
• no major performance difference:
– at least one lin mixture component
– consistently, a total mixture weight of more than 50% in lin component – tied projection matrices across mixture components
Disambiguation Abilities
• automatic extraction of word clusters:
– recall the dog and cat example – extract projection matrix
– calculate all pairwise distances between word vectors – for each word, sum the distances of five closest neighbors – sort the results and pick the words with the smallest values – make sure each word in the clusters appears at least 500 times
• three extracted clusters:
– {quickly, slowly, soon, quick, easily}
– {democrat, republicans, politicians, republican, democrats}
– {hamster, hamsters, parakeets, rabbits, turtles}
Disambiguation Abilities Cont.
Model Prediction
Ground Truth ... books can end up being outdated very quickly lin ... books can end up being outdated very soon mix
big... books can end up being outdated very quickly Ground Truth ... if you vote for a republican or vote for a democrat lin ... if you vote for a republican or vote for a republican mix
big... if you vote for a republican or vote for a democrat Ground Truth ... well we had some turtles and a hamster lin ... well we had some turtles and a turtle mix
big... well we had some turtles and a hamster
• manual inspection of appearances of words in the clusters
• examples where mixbig is right and lin is wrong
• need a more systematic way to judge if a model is better at word disambiguation
• Background
• Methodology
• Experiments
• Conclusion
Conclusion and Outlook
• softmax bottleneck: theory and intuition
• generalized softmax: introducing kernels and mixture of kernels
• individual kernels: 9 different kernels, lin, pol, pow, ssg and mog perform the best
• gradient properties: justification of why some kernels perform better than others
• mixture of kernels: consistent large mixture weight of lin when sharing word vectors
• disambiguition abilities: interesting observations, lack a more systematic way for evaluation
• outlook: untie the word embeddings
Any questions?
Athiwaratkun, B. and Wilson, A. G. (2017).
Multimodal word distributions.
In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1645–1656, Vancouver, Canada.
Boughorbel, S., Tarel, J.-P., and Boujemaa, N. (2005).
Conditionally positive definite kernels for svm based image recognition.
In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116. IEEE.
Cho, Y. and Saul, L. K. (2009).
Kernel methods for deep learning.
In Advances in neural information processing systems, pages 342–350.
Dhingra, B., Shallue, C., Norouzi, M., Dai, A., and Dahl, G. (2018).
Embedding text in hyperbolic spaces.
In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69.
Eric, K. (2019).
Illustration of the kernel trick.
http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html.
Accessed: 2019.10.31.
Herold, C., Gao, Y., and Ney, H. (2018).
Improving neural language models with weight norm initialization and regularization.
In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100.
Irie, K., Lei, Z., Deng, L., Schl¨uter, R., and Ney, H. (2018).
Investigation on estimation of sentence probability by combining forward, backward and bi-directional lstm-rnns.
Proc. Interspeech 2018, pages 392–395.
Kanai, S., Fujiwara, Y., Yamanaka, Y., and Adachi, S. (2018).
Sigsoftmax: Reanalysis of the softmax bottleneck.
In 32nd Conference on Neural Information Processing Systems (NeurIPS), Montr´eal, Canada.
Lin, H.-T. and Lin, C.-J. (2003).
A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods.
Technical report.
Memisevic, R., Zach, C., Pollefeys, M., and Hinton, G. E. (2010).
Gated softmax classification.
In Advances in neural information processing systems, pages 1603–1611.
Nickel, M. and Kiela, D. (2017).
Poincar´e embeddings for learning hierarchical representations.
In Advances in neural information processing systems, pages 6338–6347.
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019).
fairseq: A fast, extensible toolkit for sequence modeling.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, MN, USA.
Sundermeyer, M., Schl¨uter, R., and Ney, H. (2012).
Lstm neural networks for language modeling.
In Thirteenth annual conference of the international speech communication association, pages 194–197, Portland, OR, USA.
Takase, S., Suzuki, J., and Nagata, M. (2018).
Direct output connection for a high-rank language model.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).
Attention is all you need.
In Advances in Neural Information Processing Systems, pages 5998–6008.
Vilnis, L. and McCallum, A. (2015).
Word representations via gaussian embedding.
In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA.
Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. (2019).
Pay less attention with lightweight and dynamic convolutions.
In Seventh International Conference on Learning Representations (ICLR), New Orleans, LA, USA.
Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2018).
Breaking the softmax bottleneck: A high-rank rnn language model.
In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada.
Zhu, J. and Hastie, T. (2002).
Kernel logistic regression and the import vector machine.
In Advances in neural information processing systems, pages 1081–1088.