• No results found

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification. Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

N/A
N/A
Protected

Academic year: 2022

Share "Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification. Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

for Contextual Word Classification

Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

(2)

Background

Methodology

Experiments

Conclusion

(3)

Contextual Word Classification

language modeling (LM) and machine translation (MT)

dominated by neural networks (NN)

p(w1T|...) = p(w0)QT

1 p(wt|w0t−1, ...) = QT

1 p(wt|ht)

despite many choices to learn the context vector h: – feed-forward NN

– recurrent NN – convolutional NN – self-attention NN – ...

output is often modeled with standard softmax and trained with cross entropy:

p(wv|h) = exp(WvTh)/PV

v0=1exp(WvT0h) L = − log p(wv|h)

(4)

Softmax Bottleneck [Yang et al., 2018]

from previous:

L = − log p(wv|h) = − log



exp(WvTh)/PV

v0=1 exp(WvT0h)



exponential-and-logarithm calculation:

log p(wv|h) + logPV

v0=1exp(WvT0h) = WvTh

approximate true log posteriors with inner products:

log ˜p(wv|h) + Ch ≈ WvTh

(5)

Softmax Bottleneck Cont.

from previous: log ˜p(wv|h) + Ch ≈ WvTh

in matrix form:

log

˜

p(w1|h1) . . . ˜p(w1|hN) ... . . . ...

˜

p(wV|h1) . . . ˜p(wV|hN)

V×N

+ Ch1 . . .ChN

V×N

| {z }

rank ∼V

 W1T

...

WVT

V×d

h1. . .hN

d×N

| {z }

rank ∼d

factorization the true log posterior matrix: log ˜P + C ≈ WTH

Softmax Bottleneck:

– log ˜P is high-rank for natural language: rank(log ˜P) ∼ VC decreases the rank of left-hand side by maximum 1

– rank of WTH is bounded by hidden dimension: rank(WTH) ∼ d – typically V ∼ 100, 000 and d ∼ 1000

≈ → 6≈

(6)

Breaking Softmax Bottleneck

mixture-of-softmax (mos) [Yang et al., 2018]:

pmos(wv|h) =

K

X

k=1

πk exp(WvTfk(h)) PV

v0=1exp(WvT0fk(h))

sigsoftmax [Kanai et al., 2018]:

psigsoftmax(wv|h) = exp(WvTh)σ(WvTh) PV

v0=1exp(WvT0h)σ(WvT0h)

weight norm regularization [Herold et al., 2018]:

Lwnr = −X

data

X

V

log(pmos(wv|h)) + ρ v u u t

V

X

v=1

(||Wv||2 − ν)2

(7)

Breaking Softmax Bottleneck Cont.

let z = WvTh

theoretically, to break softmax bottleneck with activation g(z) [Kanai et al., 2018]:

– nonlinearity of log(g(z)) – numerical stable

– non-negative

– monotonically increasing

(8)

Geometric Explanation of Softmax Bottleneck

an intuitive example:

– ˜p(dog|a common home pet is ...) ≈ ˜p(cat|a common home pet is ...) ≈ 50%

– learned word vectors are close: WdogWcat

– posteriors over dog and cat are thus close: p(dog|...) ≈ p(cat|...) – there exist contexts that would fail the model:

– overall an expressiveness problem

(9)

Kernel Trick

widely used in support-vector machine (SVM) and logistic regression

improve expressiveness by implicitly transforming data into high dimensional feature spaces

[Eric, 2019]

(10)

Kernel Trick Cont.

K(x,y) =< φ(x), φ(y) >

Ksq(x1 x2



,y1 y2



) = (x1y1 +x2y2)2 = x12,√

2x1x2,x22

y12,√

2y1y2,y22T

= φT x1 x2



φy1 y2



for a kernel function (kernel) to be valid:

– positive semidefinite (PSD) Gram matrix

– corresponds to a scalar product in some feature space

empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005]

→ we do not enforce PSD

where there is an inner product, there could be an application of the kernel trick – import vector machine [Zhu and Hastie, 2002]

– multilayer kernel machine [Cho and Saul, 2009]

– gated softmax [Memisevic et al., 2010]

(11)

Non-Euclidean Word Embedding

Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017]

[Athiwaratkun and Wilson, 2017]

(12)

Non-Euclidean Word Embedding Cont.

hyperbolic (Poincar´e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018]

[Nickel and Kiela, 2017]

(13)

Quick Recap

contextual word classification

softmax bottleneck

breaking softmax bottleneck

geometric explanation of softmax bottleneck

kernel trick

non-Euclidean word embedding

→ kernels in softmax

(14)

Background

Methodology

Experiments

Conclusion

(15)

Generalized Softmax

model posterior:

p(wv|h) =

K

X

k=1

πk exp(Sk(Wv,fk(h))) PV

v0=1exp(Sk(Wv0,fk(h)))

mixture weight:

πk = exp(MkTh) PK

k0=1 exp(MkT0h)

nonlinearly transformed context:

fk(h) = tanh(QkTh)

with trainable parameters: W ∈ Rd×V, M ∈ Rd×K and Qk ∈ Rd×d

(16)

Generalized Softmax Cont.

from previous:

p(wv|h) =

K

X

k=1

πk exp(Sk(Wv,fk(h))) PV

v0=1exp(Sk(Wv0,fk(h)))

note:

– replace inner product with kernels: WvTh → Sk(Wv,fk(h)) – replace single softmax with a mixture: 1 → PKk=1 πk

– replace context vector with transformed ones: hfk(h) – shared word vectors due to memory restriction

motivations:

– different kernels give different feature spaces

– based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different

– ideally, for each feature space, the word vector could also be different

(17)

Individual Kernels

Slin(Wv, h) = WvTh

Slog(Wv, h) = − log(||Wv − h||p + 1) Spow(Wv, h) = −||Wv − h||p

Spol(Wv, h) = (αWvTh + c)p

Srbf(Wv, h) = exp(−γ||Wv − h||2) Swav(Wv, h) = cos(||Wv − h||2

a ) exp(−||Wv − h||2

b )

Sssg(Wv, h) = log Z

N(µWv, ΣWv)N(µh, Σh) Smog(Wv, h) = X

i ,j

log Z

N(µi ,Wv, Σi ,Wv)N(µj ,h, Σj ,h)

Shpb(Wv, h) = −acos(1 + 2||Wv − h||2

(1 − ||Wv||2)(1 − ||h||2))

(18)

Individual Kernels Cont.

kernel sources:

– baseline: lin

– classic (e.g. from SVM): log, pow, pol, pol, rbf, wav – from non-Eulidean word embedding: ssg, mog, hpb

dimension reduction step in d common to kernels above not immediately executable – simplify the wavelet kernel

– use spherical covariance matrices

– rewrite power of vector difference norm:

||Wvh||p = (||Wv||2 + ||h||2 − 2WvTh)p2

||Wvh||p → a vector norm regularized version of the inner product

(19)

Background

Methodology

Experiments

Conclusion

(20)

Experimental Setup

LM on Switchboard:

– 30K vocabulary – 25M training tokens

– LSTM Recurrent NN [Sundermeyer et al., 2012]

MT on IWSLT2014 German→English:

– 10K joint byte pair encoding merge operations – 160K parallel training sentences

– Transformer NN [Vaswani et al., 2017]

overall:

– grid search hyperparameters of kernels – vary K and Sk

– fairseq toolkit [Ott et al., 2019]

(21)

Performance of Individual Kernels

Method SWB IWSLT

(Perplexity) (Bleu

[%]

) Ref. [Irie et al., 2018] [Wu et al., 2019]

47.6 35.2

lin 46.8 34.3

log 103.0 0.4

pow 46.8 32.8

pol 47.3 31.7

rbf 284.9 0.0

ssg 49.9 34.6

mog 46.7 34.2

hpb 122.6 0.3

wav 289.7 0.0

(22)

Gradient Properties

0 2 4 6

||Wv − h||p -6

-4 -2 0 2

S(Wv,h)

rbf wav log pow

expected performance: rbf ≈ wav < log < pow

(23)

Performance of Individual Kernels Revisited

Method SWB IWSLT

(Perplexity) (Bleu

[%]

) Ref. [Irie et al., 2018] [Wu et al., 2019]

47.6 35.2

lin 46.8 34.3

log 103.0 0.4

pow 46.8 32.8

pol 47.3 31.7

rbf 284.9 0.0

ssg 49.9 34.6

mog 46.7 34.2

hpb 122.6 0.3

wav 289.7 0.0

(24)

Mixture of Kernels

encourage equal contribution of mixture components, similar to [Takase et al., 2018]:

Lreg = Lce + ρ N

X

N

VarNk)

compromise between regularization and performance → ρ = 0.1

ρ Variance PPL

0.001 4.74 46.8

0.01 4.98 46.6

0.1 3.67 47.2

1 3.81 47.4

(25)

Mixture of Kernels Cont.

Name Mixture settings PPL

mos* 9×lin 47.8

mix

big

1 of each kernel 47.1 mix

1

lin, log, rbf, hpb, wav 46.6

mix

2

3×lin, log 46.5

mix

3

lin, log, pow, pol 47.3 mix

4

lin, log, rbf, hpb 47.1 mix

5

2×lin, 2×rbf 46.7

* tricks like activation regularization and average stochastic gradient descent are not applied

no major performance difference:

– at least one lin mixture component

– consistently, a total mixture weight of more than 50% in lin component – tied projection matrices across mixture components

(26)

Disambiguation Abilities

automatic extraction of word clusters:

– recall the dog and cat example – extract projection matrix

– calculate all pairwise distances between word vectors – for each word, sum the distances of five closest neighbors – sort the results and pick the words with the smallest values – make sure each word in the clusters appears at least 500 times

three extracted clusters:

– {quickly, slowly, soon, quick, easily}

– {democrat, republicans, politicians, republican, democrats}

– {hamster, hamsters, parakeets, rabbits, turtles}

(27)

Disambiguation Abilities Cont.

Model Prediction

Ground Truth ... books can end up being outdated very quickly lin ... books can end up being outdated very soon mix

big

... books can end up being outdated very quickly Ground Truth ... if you vote for a republican or vote for a democrat lin ... if you vote for a republican or vote for a republican mix

big

... if you vote for a republican or vote for a democrat Ground Truth ... well we had some turtles and a hamster lin ... well we had some turtles and a turtle mix

big

... well we had some turtles and a hamster

manual inspection of appearances of words in the clusters

examples where mixbig is right and lin is wrong

need a more systematic way to judge if a model is better at word disambiguation

(28)

Background

Methodology

Experiments

Conclusion

(29)

Conclusion and Outlook

softmax bottleneck: theory and intuition

generalized softmax: introducing kernels and mixture of kernels

individual kernels: 9 different kernels, lin, pol, pow, ssg and mog perform the best

gradient properties: justification of why some kernels perform better than others

mixture of kernels: consistent large mixture weight of lin when sharing word vectors

disambiguition abilities: interesting observations, lack a more systematic way for evaluation

outlook: untie the word embeddings

(30)

Any questions?

(31)

Athiwaratkun, B. and Wilson, A. G. (2017).

Multimodal word distributions.

In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1645–1656, Vancouver, Canada.

Boughorbel, S., Tarel, J.-P., and Boujemaa, N. (2005).

Conditionally positive definite kernels for svm based image recognition.

In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116. IEEE.

Cho, Y. and Saul, L. K. (2009).

Kernel methods for deep learning.

In Advances in neural information processing systems, pages 342–350.

Dhingra, B., Shallue, C., Norouzi, M., Dai, A., and Dahl, G. (2018).

Embedding text in hyperbolic spaces.

In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69.

Eric, K. (2019).

Illustration of the kernel trick.

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html.

Accessed: 2019.10.31.

Herold, C., Gao, Y., and Ney, H. (2018).

Improving neural language models with weight norm initialization and regularization.

In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100.

Irie, K., Lei, Z., Deng, L., Schl¨uter, R., and Ney, H. (2018).

Investigation on estimation of sentence probability by combining forward, backward and bi-directional lstm-rnns.

Proc. Interspeech 2018, pages 392–395.

(32)

Kanai, S., Fujiwara, Y., Yamanaka, Y., and Adachi, S. (2018).

Sigsoftmax: Reanalysis of the softmax bottleneck.

In 32nd Conference on Neural Information Processing Systems (NeurIPS), Montr´eal, Canada.

Lin, H.-T. and Lin, C.-J. (2003).

A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods.

Technical report.

Memisevic, R., Zach, C., Pollefeys, M., and Hinton, G. E. (2010).

Gated softmax classification.

In Advances in neural information processing systems, pages 1603–1611.

Nickel, M. and Kiela, D. (2017).

Poincar´e embeddings for learning hierarchical representations.

In Advances in neural information processing systems, pages 6338–6347.

Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019).

fairseq: A fast, extensible toolkit for sequence modeling.

In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, MN, USA.

Sundermeyer, M., Schl¨uter, R., and Ney, H. (2012).

Lstm neural networks for language modeling.

In Thirteenth annual conference of the international speech communication association, pages 194–197, Portland, OR, USA.

Takase, S., Suzuki, J., and Nagata, M. (2018).

Direct output connection for a high-rank language model.

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609.

(33)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).

Attention is all you need.

In Advances in Neural Information Processing Systems, pages 5998–6008.

Vilnis, L. and McCallum, A. (2015).

Word representations via gaussian embedding.

In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA.

Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. (2019).

Pay less attention with lightweight and dynamic convolutions.

In Seventh International Conference on Learning Representations (ICLR), New Orleans, LA, USA.

Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2018).

Breaking the softmax bottleneck: A high-rank rnn language model.

In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada.

Zhu, J. and Hastie, T. (2002).

Kernel logistic regression and the import vector machine.

In Advances in neural information processing systems, pages 1081–1088.

References

Related documents

The Scottish Early Rheumatoid Arthritis (SERA) study was designed to create a national inception cohort of patients with newly diagnosed RA or undifferentiated arthritis (UA),

In our study, both α -TOS and 2-DG, when administered as single agents, were limited to inhibit the proliferation of HT29, HeLa and A549 tumor cells, and when administered in

This near linear trend in Tafel slope for varying mole fractions would suggest a linear superposition type behavior of the two mixed oxides i.e.. two different

The jumping apparatus is localized in the hind legs and formed by the femur, tibia, femoro-tibial joint, modified metafemoral extensor tendon, extensor ligament, tibial flexor

He observed that the bright red oxyhaemoglobin present in Anisops pellucens at the beginning of a dive slowly changed to a dark purple once the insect had reached neutral

Resting metabolic rate was positively correlated with whole body mass in all groups, and female mice faced with both low and high foraging costs had similar mass- specific

Like bats, toothed whales terminate a prey capture with a buzz composed of clicks with lower source level and higher repetition rate (Miller et al., 1995; Madsen et al., 2002; Miller

The goal of this research is to provide recommendations for the key requirements specification of an information system that supports planned and unplanned use of the shared meeting