Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification. Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

(1)

for Contextual Word Classification

Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

(2)

• Background

• Methodology

• Experiments

• Conclusion

(3)

Contextual Word Classification

• language modeling (LM) and machine translation (MT)

• dominated by neural networks (NN)

• p(w₁^T|...) = p(w₀)QT

1 p(w_t|w₀^t⁻¹, ...) = QT

1 p(w_t|h_t)

• despite many choices to learn the context vector h: – feed-forward NN

– recurrent NN – convolutional NN – self-attention NN – ...

• output is often modeled with standard softmax and trained with cross entropy:

p(w_v|h) = exp(W_v^Th)/PV

v⁰=1exp(W_v^T0h) L = − log p(w_v|h)

(4)

Softmax Bottleneck [Yang et al., 2018]

• from previous:

L = − log p(w_v|h) = − log

exp(W_v^Th)/PV

v⁰=1 exp(W_v^T0h)

• exponential-and-logarithm calculation:

log p(w_v|h) + logPV

v⁰=1exp(W_v^T0h) = W_v^Th

• approximate true log posteriors with inner products:

log ˜p(w_v|h) + C_h ≈ W_v^Th

(5)

Softmax Bottleneck Cont.

• from previous: log ˜p(w_v|h) + C_h ≈ W_v^Th

• in matrix form:

log





˜

p(w₁|h₁) . . . ˜p(w₁|h_N) ... . . . ...

˜

p(w_V|h₁) . . . ˜p(w_V|h_N)





V×N

+ C_h₁ . . .C_h_N

V×N

| {z }

rank ∼V

≈



 W₁^T

...

W_V^T





V×d

h₁. . .h_N

d×N

| {z }

rank ∼d

• factorization the true log posterior matrix: log ˜P + C ≈ W^TH

• Softmax Bottleneck:

– log ˜P is high-rank for natural language: rank(log ˜P) ∼ V – C decreases the rank of left-hand side by maximum 1

– rank of W^TH is bounded by hidden dimension: rank(W^TH) ∼ d – typically V ∼ 100, 000 and d ∼ 1000

≈ → 6≈

(6)

Breaking Softmax Bottleneck

• mixture-of-softmax (mos) [Yang et al., 2018]:

p_mos(w_v|h) =

K

X

k=1

π_k exp(W_v^Tf_k(h)) PV

v⁰=1exp(W_v^T0f_k(h))

• sigsoftmax [Kanai et al., 2018]:

p_sigsoftmax(w_v|h) = exp(W_v^Th)σ(W_v^Th) PV

v⁰=1exp(W_v^T0h)σ(W_v^T0h)

• weight norm regularization [Herold et al., 2018]:

L_wnr = −X

data

X

V

log(p_mos(w_v|h)) + ρ v u u t

V

X

v=1

(||W_v||₂ − ν)²

(7)

Breaking Softmax Bottleneck Cont.

• let z = W_v^Th

• theoretically, to break softmax bottleneck with activation g(z) [Kanai et al., 2018]:

– nonlinearity of log(g(z)) – numerical stable

– non-negative

– monotonically increasing

(8)

Geometric Explanation of Softmax Bottleneck

• an intuitive example:

– ˜p(dog|a common home pet is ...) ≈ ˜p(cat|a common home pet is ...) ≈ 50%

– learned word vectors are close: W_dog ≈ W_cat

– posteriors over dog and cat are thus close: p(dog|...) ≈ p(cat|...) – there exist contexts that would fail the model:

– overall an expressiveness problem

(9)

Kernel Trick

• widely used in support-vector machine (SVM) and logistic regression

• improve expressiveness by implicitly transforming data into high dimensional feature spaces

[Eric, 2019]

(10)

Kernel Trick Cont.

• K(x,y) =< φ(x), φ(y) >

• K_sq(x₁ x₂

,y₁ y₂

) = (x₁y₁ +x₂y₂)² = x₁²,√

2x₁x₂,x₂²

y₁²,√

2y₁y₂,y₂²^T

= φ^T x₁ x₂

φy₁ y₂

• for a kernel function (kernel) to be valid:

– positive semidefinite (PSD) Gram matrix

– corresponds to a scalar product in some feature space

• empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005]

→ we do not enforce PSD

• where there is an inner product, there could be an application of the kernel trick – import vector machine [Zhu and Hastie, 2002]

– multilayer kernel machine [Cho and Saul, 2009]

– gated softmax [Memisevic et al., 2010]

(11)

Non-Euclidean Word Embedding

• Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017]

[Athiwaratkun and Wilson, 2017]

(12)

Non-Euclidean Word Embedding Cont.

• hyperbolic (Poincar´e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018]

[Nickel and Kiela, 2017]

(13)

Quick Recap

• contextual word classification

• softmax bottleneck

• breaking softmax bottleneck

• geometric explanation of softmax bottleneck

• kernel trick

• non-Euclidean word embedding

→ kernels in softmax

(14)

• Background

• Methodology

• Experiments

• Conclusion

(15)

Generalized Softmax

• model posterior:

p(w_v|h) =

K

X

k=1

π_k exp(S_k(W_v,f_k(h))) PV

v⁰=1exp(S_k(W_v⁰,f_k(h)))

• mixture weight:

π_k = exp(M_k^Th) PK

k⁰=1 exp(M_k^T0h)

• nonlinearly transformed context:

f_k(h) = tanh(Q_k^Th)

• with trainable parameters: W ∈ R^d^×^V^, M ∈ R^d^×^K ^and Q_k ∈ R^d^×^d

(16)

Generalized Softmax Cont.

• from previous:

p(w_v|h) =

K

X

k=1

π_k exp(S_k(W_v,f_k(h))) PV

v⁰=1exp(S_k(W_v⁰,f_k(h)))

• note:

– replace inner product with kernels: W_v^Th → S_k(W_v,f_k(h)) – replace single softmax with a mixture: 1 → ^P^K_k=1 π_k

– replace context vector with transformed ones: h → f_k(h) – shared word vectors due to memory restriction

• motivations:

– different kernels give different feature spaces

– based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different

– ideally, for each feature space, the word vector could also be different

(17)

Individual Kernels

S_lin(W_v, h) = W_v^Th

S_log(W_v, h) = − log(||W_v − h||^p + 1) S_pow(W_v, h) = −||W_v − h||^p

S_pol(W_v, h) = (αW_v^Th + c)^p

S_rbf(W_v, h) = exp(−γ||W_v − h||²) S_wav(W_v, h) = cos(||W_v − h||²

a ) exp(−||W_v − h||²

b )

S_ssg(W_v, h) = log Z

N(µ_W_v, Σ_W_v)N(µ_h, Σ_h) S_mog(W_v, h) = X

i ,j

log Z

N(µ_{i ,W}_v, Σ_{i ,W}_v)N(µ_{j ,h}, Σ_{j ,h})

S_hpb(Wv, h) = −acos(1 + 2||W_v − h||²

(1 − ||W_v||²)(1 − ||h||²))

(18)

Individual Kernels Cont.

• kernel sources:

– baseline: lin

– classic (e.g. from SVM): log, pow, pol, pol, rbf, wav – from non-Eulidean word embedding: ssg, mog, hpb

• dimension reduction step in d common to kernels above not immediately executable – simplify the wavelet kernel

– use spherical covariance matrices

– rewrite power of vector difference norm:

||W_v −h||^p = (||W_v||² + ||h||² − 2W_v^Th)^p²

||W_v −h||^p → a vector norm regularized version of the inner product

(19)

• Background

• Methodology

• Experiments

• Conclusion

(20)

Experimental Setup

• LM on Switchboard:

– 30K vocabulary – 25M training tokens

– LSTM Recurrent NN [Sundermeyer et al., 2012]

• MT on IWSLT2014 German→English:

– 10K joint byte pair encoding merge operations – 160K parallel training sentences

– Transformer NN [Vaswani et al., 2017]

• overall:

– grid search hyperparameters of kernels – vary K and S_k

– fairseq toolkit [Ott et al., 2019]

(21)

Performance of Individual Kernels

Method SWB IWSLT

(Perplexity) (Bleu

^[%]

) Ref. [Irie et al., 2018] [Wu et al., 2019]

47.6 35.2

lin 46.8 34.3

log 103.0 0.4

pow 46.8 32.8

pol 47.3 31.7

rbf 284.9 0.0

ssg 49.9 34.6

mog 46.7 34.2

hpb 122.6 0.3

wav 289.7 0.0

(22)

Gradient Properties

0 2 4 6

||Wv − h||^p -6

-4 -2 0 2

S(Wv,h)

rbf wav log pow

• expected performance: rbf ≈ wav < log < pow

(23)

Performance of Individual Kernels Revisited

Method SWB IWSLT

(Perplexity) (Bleu

^[%]

) Ref. [Irie et al., 2018] [Wu et al., 2019]

47.6 35.2

lin 46.8 34.3

log 103.0 0.4

pow 46.8 32.8

pol 47.3 31.7

rbf 284.9 0.0

ssg 49.9 34.6

mog 46.7 34.2

hpb 122.6 0.3

wav 289.7 0.0

(24)

Mixture of Kernels

• encourage equal contribution of mixture components, similar to [Takase et al., 2018]:

L_reg = L_ce + ρ N

X

N

Var_N(π_k)

• compromise between regularization and performance → ρ = 0.1

ρ Variance PPL

0.001 4.74 46.8

0.01 4.98 46.6

0.1 3.67 47.2

1 3.81 47.4

(25)

Mixture of Kernels Cont.

Name Mixture settings PPL

mos* 9×lin 47.8

mix

_big

1 of each kernel 47.1 mix

₁

lin, log, rbf, hpb, wav 46.6

mix

₂

3×lin, log 46.5

mix

₃

lin, log, pow, pol 47.3 mix

₄

lin, log, rbf, hpb 47.1 mix

₅

2×lin, 2×rbf 46.7

* tricks like activation regularization and average stochastic gradient descent are not applied

• no major performance difference:

– at least one lin mixture component

– consistently, a total mixture weight of more than 50% in lin component – tied projection matrices across mixture components

(26)

Disambiguation Abilities

• automatic extraction of word clusters:

– recall the dog and cat example – extract projection matrix

– calculate all pairwise distances between word vectors – for each word, sum the distances of five closest neighbors – sort the results and pick the words with the smallest values – make sure each word in the clusters appears at least 500 times

• three extracted clusters:

– {quickly, slowly, soon, quick, easily}

– {democrat, republicans, politicians, republican, democrats}

– {hamster, hamsters, parakeets, rabbits, turtles}

(27)

Disambiguation Abilities Cont.

Model Prediction

Ground Truth ... books can end up being outdated very quickly lin ... books can end up being outdated very soon mix

_big

... books can end up being outdated very quickly Ground Truth ... if you vote for a republican or vote for a democrat lin ... if you vote for a republican or vote for a republican mix

_big

... if you vote for a republican or vote for a democrat Ground Truth ... well we had some turtles and a hamster lin ... well we had some turtles and a turtle mix

_big

... well we had some turtles and a hamster

• manual inspection of appearances of words in the clusters

• examples where mix_big is right and lin is wrong

• need a more systematic way to judge if a model is better at word disambiguation

(28)

• Background

• Methodology

• Experiments

• Conclusion

(29)

Conclusion and Outlook

• softmax bottleneck: theory and intuition

• generalized softmax: introducing kernels and mixture of kernels

• individual kernels: 9 different kernels, lin, pol, pow, ssg and mog perform the best

• gradient properties: justification of why some kernels perform better than others

• mixture of kernels: consistent large mixture weight of lin when sharing word vectors

• disambiguition abilities: interesting observations, lack a more systematic way for evaluation

• outlook: untie the word embeddings

(30)

Any questions?

(31)

Athiwaratkun, B. and Wilson, A. G. (2017).

Multimodal word distributions.

In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1645–1656, Vancouver, Canada.

Boughorbel, S., Tarel, J.-P., and Boujemaa, N. (2005).

Conditionally positive definite kernels for svm based image recognition.

In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116. IEEE.

Cho, Y. and Saul, L. K. (2009).

Kernel methods for deep learning.

In Advances in neural information processing systems, pages 342–350.

Dhingra, B., Shallue, C., Norouzi, M., Dai, A., and Dahl, G. (2018).

Embedding text in hyperbolic spaces.

In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69.

Eric, K. (2019).

Illustration of the kernel trick.

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html.

Accessed: 2019.10.31.

Herold, C., Gao, Y., and Ney, H. (2018).

Improving neural language models with weight norm initialization and regularization.

In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100.

Irie, K., Lei, Z., Deng, L., Schl¨uter, R., and Ney, H. (2018).

Investigation on estimation of sentence probability by combining forward, backward and bi-directional lstm-rnns.

Proc. Interspeech 2018, pages 392–395.

(32)

Kanai, S., Fujiwara, Y., Yamanaka, Y., and Adachi, S. (2018).

Sigsoftmax: Reanalysis of the softmax bottleneck.

In 32nd Conference on Neural Information Processing Systems (NeurIPS), Montr´eal, Canada.

Lin, H.-T. and Lin, C.-J. (2003).

A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods.

Technical report.

Memisevic, R., Zach, C., Pollefeys, M., and Hinton, G. E. (2010).

Gated softmax classification.

Nickel, M. and Kiela, D. (2017).

Poincar´e embeddings for learning hierarchical representations.

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019).

fairseq: A fast, extensible toolkit for sequence modeling.

In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, MN, USA.

Sundermeyer, M., Schl¨uter, R., and Ney, H. (2012).

Lstm neural networks for language modeling.

In Thirteenth annual conference of the international speech communication association, pages 194–197, Portland, OR, USA.

Takase, S., Suzuki, J., and Nagata, M. (2018).

Direct output connection for a high-rank language model.

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609.

(33)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).

Attention is all you need.

In Advances in Neural Information Processing Systems, pages 5998–6008.

Vilnis, L. and McCallum, A. (2015).

Word representations via gaussian embedding.

In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA.

Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. (2019).

Pay less attention with lightweight and dynamic convolutions.

In Seventh International Conference on Learning Representations (ICLR), New Orleans, LA, USA.

Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2018).

Breaking the softmax bottleneck: A high-rank rnn language model.

In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada.

Zhu, J. and Hastie, T. (2002).

Kernel logistic regression and the import vector machine.