Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency


Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3698–3707 Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics. 3698. Noise Contrastive Estimation and Negative Sampling for Conditional. Models: Consistency and Statistical Efficiency. Zhuang Ma. University of Pennsylvania⇤ Michael Collins. Google AI Language and Columbia University† Abstract. Noise Contrastive Estimation (NCE) is a pow- erful parameter estimation method for log- linear models, which avoids calculation of the partition function or its derivatives at each training step, a computationally demanding step in many cases. It is closely related to negative sampling methods, now widely used in NLP. This paper considers NCE-based es- timation of conditional models. Conditional models are frequently encountered in practice; however there has not been a rigorous theo- retical analysis of NCE in this setting, and we will argue there are subtle but important ques- tions when generalizing NCE to the condi- tional case. In particular, we analyze two vari- ants of NCE for conditional models: one based on a classification objective, the other based on a ranking objective. We show that the ranking- based variant of NCE gives consistent param- eter estimates under weaker assumptions than the classification-based method; we analyze the statistical efficiency of the ranking-based and classification-based variants of NCE; fi- nally we describe experiments on synthetic data and language modeling showing the ef- fectiveness and trade-offs of both methods.. 1 Introduction. This paper considers parameter estimation in con- ditional models of the form. p(y|x; ✓) = exp (s(x, y; ✓)). Z(x; ✓) (1). where s(x, y; ✓) is the unnormalized score of label y in conjunction with input x under parameters ✓, Y is a finite set of possible labels, and Z(x; ✓) = P. y2Y exp (s(x, y; ✓)) is the partition function for input x under parameters ✓.. It is hard to overstate the importance of models of this form in NLP. In log-linear models, includ- ing both the original work on maximum-entropy models (Berger et al., 1996), and later work on conditional random fields (Lafferty et al., 2001),. ⇤Part of this work done at Google. †Work done at Google.. the scoring function s(x, y; ✓) = ✓ · f(x, y) where f(x, y) 2 Rd is a feature vector, and ✓ 2 Rd are the parameters of the model. In more recent work on neural networks the function s(x, y; ✓) is a non- linear function. In Word2Vec the scoring function is s(x, y; ✓) = ✓. x. · ✓0 y. where y is a word in the context of word x, and ✓. x. 2 Rd and ✓0 y. 2 Rd are “inside” and “outside” word embeddings x and y.. In many NLP applications the set Y is large. Maximum likelihood estimation (MLE) of the pa- rameters ✓ requires calculation of Z(x; ✓) or its derivatives at each training step, thereby requiring a summation over all members of Y, which can be computationally expensive. This has led to many authors considering alternative methods, often re- ferred to as “negative sampling methods”, where a modified training objective is used that does not require summation over Y on each example. In- stead negative examples are drawn from some dis- tribution, and a objective function is derived based on binary classification or ranking. Prominent ex- amples are the binary objective used in word2vec ((Mikolov et al., 2013), see also (Levy and Gold- berg, 2014)), and the Noise Contrastive Estima- tion methods of (Mnih and Teh, 2012; Jozefowicz et al., 2016) for estimation of language models.. In spite of the centrality of negative sampling methods, they are arguably not well understood from a theoretical standpoint. There are clear connections to noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2012), a negative sam- pling method for parameter estimation in joint models of the form. p(y) = exp (s(y; ✓)). Z(✓) ; Z(✓) =. X. y2Y exp (s(y; ✓)). (2) However there has not been a rigorous theoretical analysis of NCE in the estimation of conditional models of the form in Eq. 1, and we will argue there are subtle but important questions when gen- eralizing NCE to the conditional case. In partic- ular, the joint model in Eq 2 has a single parti- tion function Z(✓) which is estimated as a param-. 3699. eter of the model (Gutmann and Hyvärinen, 2012) whereas the conditional model in Eq 1 has a sepa- rate partition function Z(x; ✓) for each value of x. This difference is critical.. We show the following (throughout we define K � 1 to be the number of negative examples sampled per training example): • For any K � 1, a binary classification variant. of NCE, as used by (Mnih and Teh, 2012; Mikolov et al., 2013), gives consistent parameter estimates under the assumption that Z(x; ✓) is constant with respect to x (i.e., Z(x; ✓) = H(✓) for some func- tion H). Equivalently, the method is consistent under the assumption that the function s(x, y; ✓) is powerful enough to incorporate log Z(x; ✓). • For any K � 1, a ranking-based variant of. NCE, as used by (Jozefowicz et al., 2016), gives consistent parameter estimates under the much weaker assumption that Z(x; ✓) can vary with x. Equivalently, there is no need for s(x, y; ✓) to be powerful enough to incorporate log Z(x; ✓). • We analyze the statistical efficiency of the. ranking-based and classification-based NCE vari- ants. Under respective assumptions, both vari- ants achieve Fisher efficiency (the same asymp- totic mean square error as the MLE) as K ! 1. • We discuss application of our results to ap-. proaches of (Mnih and Teh, 2012; Mikolov et al., 2013; Levy and Goldberg, 2014; Jozefowicz et al., 2016) giving a unified account of these methods. • We describe experiments on synthetic data. and language modeling evaluating the effective- ness of the two NCE variants. 2 Basic Assumptions. We assume the following setup throughout: • We have sets X and Y, where X , Y are finite. • There is some unknown joint distribution. p X,Y. (x, y) where x 2 X and y 2 Y. We assume that the marginal distributions satisfy p. X. (x) > 0 for all x 2 X and p. Y. (y) > 0 for all y 2 Y. • We have training examples {x(i), y(i)}n. i=1 drawn I.I.D. from p. X,Y. (x, y). • We have a scoring function s(x, y; ✓) where. ✓ are the parameters of the model. For example, s(x, y; ✓) may be defined by a neural network. • We use ⇥ to refer to the parameter space. We. assume that ⇥ ✓ Rd for some integer d. • We use p. N. (y) to refer to a distribution from which negative examples are drawn in the NCE approach. We assume that p. N. satisfies p N. (y) > 0 for all y 2 Y.. We will consider estimation under the following two assumptions:. Assumption 2.1 There exists some parameter. value ✓⇤ 2 ⇥ such that for all (x, y) 2 X ⇥ Y,. p Y |X(y|x) =. exp(s(x, y; ✓⇤)). Z(x; ✓⇤) (3). where Z(x; ✓⇤) = P. y2Y exp(s(x, y; ✓ ⇤ )).. Assumption 2.2 There exists some parameter. value ✓⇤ 2 ⇥, and a constant �⇤ 2 R, such that for all (x, y) 2 X ⇥ Y,. p Y |X(y|x) = exp (s(x, y; ✓⇤) � �⇤) . (4). Assumption 2.2 is stronger than Assump- tion 2.1. It requires log Z(x; ✓⇤) ⌘ �⇤ for all x 2 X , that is, the conditional distribution is per- fectly self-normalized. Under Assumption 2.2, it must be the case that 8x 2 X X. y. p Y |X(y|x) =. X. y. exp{s(x, y; ✓⇤) � �⇤} = 1. There are |X| constraints but only d + 1 free pa- rameters. Therefore self-normalization is a non- trivial assumption when |X| � d. In the case of language modeling, |X| = |V |k � d + 1, where |V | is the vocabulary size and k is the length of the context. The number of constraints grows ex- ponentially fast.. Given a scoring function s(x, y; ✓) that satisfies assumption 2.1, we can derive a scoring function s0 that satisfies assumption 2.2 by defining. s0(x, y; ✓, {c x. : x 2 X}) = s(x, y; ✓) � c x. where c x. 2 R is a parameter for history x. Thus we introduce a new parameter c. x. for each possible history x. This is the most straightforward exten- sion of NCE to the conditional case; it is used by (Mnih and Teh, 2012). It has the clear drawback however of introducing a large number of addi- tional parameters to the model.. 3 Two Estimation Algorithms. Figure 1 shows two NCE-based parameter esti- mation algorithms, based respectively on binary objective and ranking objective. The input to either algorithm is a set of training examples {x(i), y(i)}n. i=1, a parameter K specifying the num- ber of negative examples per training example, and. 3700. a distribution p N. (·) from which negative exam- ples are sampled. The algorithms differ only in the choice of objective function being optimized: Ln. B. for binary objective, and Ln R. for ranking objective. Binary objective essentially corresponds to a prob- lem where the scoring function s(x, y; ✓) is used to construct a binary classifier that discriminates between positive and negative examples. Rank- ing objective corresponds to a problem where the scoring function s(x, y; ✓) is used to rank the true label y(i) above negative examples y(i,1) . . . y(i,K). for the input x(i). Our main result is as follows:. Theorem 3.1 (Informal: see section 4 for a for-. mal statement.) For any K � 1, the binary classification-based algorithm in figure 1 is con-. sistent under Assumption 2.2, but is not always. consistent under the weaker Assumption 2.1. For. any K � 1, the ranking-based algorithm in fig- ure 1 is consistent under either Assumption 2.1 or. Assumption 2.2. Both algorithms achieve the same. statistical efficiency as the maximum-likelihood. estimate as K ! 1.. The remainder of this section gives a sketch of the argument underlying consistency, and dis- cusses use of the two algorithms in previous work.. 3.1 A Sketch of the Consistency Argument. for the Ranking-Based Algorithm. In this section, in order to develop intuition under- lying the ranking algorithm, we give a proof sketch of the following theorem:. Theorem 3.2 (First part of theorem 4.1 below.). Define L1 R. (✓) = E[Ln R. (✓)]. Under Assump- tion 2.1,. ¯✓ 2 arg max ✓. L1 R. (✓) if and only if, for all (x, y) 2 X ⇥ Y,. p Y |X(y|x) = exp(s(x, y; ¯✓))/Z(x, ¯✓).. This theorem is key to the consistency argument. Intuitively as n increases Ln. R. (✓) converges to L1 R. (✓), and the output to the algorithm converges to ✓0 such that p(y|x; ✓0) = p. Y |X(y|x) for all x, y. Section 4 gives a formal argument.. We now give a proof sketch for theo- rem 3.2. Consider the algorithm in figure 1. For convenience define ȳ(i) to be the vector (y(i,0), y(i,1), . . . , y(i,K)). Define ↵(x, ȳ) =. Inputs: Training examples {x(i), y(i)}n i=1, sampling dis-. tribution p N. (·) for generating negative examples, an in- teger K specifying the number of negative examples per training example, a scoring function s(x, y; ✓). Flags {BINARY = true, RANKING = false} if binary classifica- tion objective is used, {BINARY = false, RANKING = true} if ranking objective is used.. Definitions: Define s̄(x, y; ✓) = s(x, y; ✓) � log p N. (y). Algorithm:. • For i = 1 . . . n, k = 1 . . . K, draw y(i,k) I.I.D. from the distribution p. N. (y). For convenience define y. (i,0) = y. (i).. • If RANKING, define the ranking objective function. Ln R. (✓) =. 1. n. n. X. i=1. log. exp(s̄(x. (i) , y. (i,0) ; ✓)). P. K. k=0 exp(s̄(x (i) , y. (i,k) ; ✓)). ,. and the estimator b✓ R. = argmax. ✓2⇥ Ln. R. (✓).. • If BINARY, define the binary objective function. Ln B. (✓, �) =. 1. n. n. X. i=1. n. log g(x. (i) , y. (i,0) ; ✓, �). +. K. X. k=1. log. ⇣. 1 � g(x(i), y(i,k); ✓, �) ⌘o. ,. and estimator (b✓ B. , b�. B. ) = argmax. ✓2⇥,�2� Ln. B. (✓, �), where. g(x, y; ✓, �) =. exp (s̄(x, y; ✓) � �) exp (s̄(x, y; ✓) � �) + K. .. • Define b✓ = b✓ R. if RANKING and b✓ = b✓ B. otherwise. Return b✓ and. bp. Y |X(y|x) = exp(s(x, y;. b. ✓)). P. y2Y exp(s(x, y; b. ✓)). Figure 1: Two NCE-based estimation algorithms, using ranking objective and binary objective respectively.. P. K. k=0 pX,Y (x, ȳk) Q. j 6=k pN(ȳj), and. q(k|x, ȳ; ✓) = exp(s̄(x, ȳ. k. ; ✓)) P. K. k=0 exp(s̄(x, ȳk; ✓)) ,. �(k|x, ȳ) = p X,Y. (x, ȳ k. ). Q. j 6=k pN(ȳj). ↵(x, ȳ). =. p Y |X(ȳk|x)/pN(ȳk). P. N. k=0 pY |X(ȳk|x)/pN(ȳk). C(x, ȳ; ✓) = � K. X. k=0. �(k|x, ȳ) log q(k|x, ȳ; ✓). Intuitively, q(·|x, ȳ; ✓) and �(·|x, ȳ) are posterior. 3701. distributions over the true label k 2 {0 . . . K} given an input x, ȳ, under the parameters ✓ and the true distributions p. X,Ȳ. (x, ȳ) respectively; C(x, ȳ; ✓) is the negative cross-entropy between these two distributions.. The proof of theorem 3.2 rests on two iden- tities. The first identity states that the objective function is the expectation of the negative cross- entropy w.r.t. the density function 1. K+1↵(x, ȳ) (see Section B.1.1 of the supplementary material for derivation):. L1 R. (✓) = X. x. X. ȳ. 1. K + 1 ↵(x, ȳ)C(x, ȳ; ✓). (5). The second identity concerns the relationship be- tween q(·|x, ȳ; ✓) and �(·|x, ȳ). Under assump- tion 2.1, for all x, ȳ, k 2 {0 . . . K},. q(k|x, ȳ; ✓⇤). =. p Y |X(ȳk|x)Z(x; ✓⇤)/pN(yk). P. K. k=0 pY |X(ȳk|x)Z(x; ✓⇤)/pN(yk) = �(k|x, ȳ) (6). It follows immediately through the properties of negative cross entropy that. 8x, ȳ, ✓⇤ 2 argmax ✓. C(x, ȳ; ✓) (7). The remainder of the argument is as follows: • Eqs. 7 and 5 imply that ✓⇤ 2 argmax. ✓. L1 R. (✓). • Assumption 2.1 implies that ↵(x, ȳ) > 0 for. all x, ȳ. It follows that any ✓0 2 arg max ✓. L1 R. (✓) satisfies. for all x, ȳ, k, (8) q(k|x, ȳ; ✓0) = q(k|x, ȳ; ✓⇤) = �(k|x, ȳ). Otherwise there would be some x, ȳ such that C(x, ȳ; ✓0) < C(x, ȳ; ✓⇤). • Eq. 8 implies that 8x, y, p(y|x; ✓0) =. p(y|x; ✓⇤). See the proof of lemma B.3 in the sup- plementary material.. In summary, the identity in Eq. 5 is key: the ob- jective function in the limit, L1. R. (✓), is related to a negative cross-entropy between the underlying distribution �(·|x, ȳ) and a distribution under the parameters, q(·|x, ȳ; ✓). The parameters ✓⇤ maxi- mize this negative cross entropy over the space of all distributions {q(·|x, ȳ; ✓), ✓ 2 ⇥}.. 3.2 The Algorithms in Previous Work. To motivate the importance of the two algorithms, we now discuss their application in previous work.. Mnih and Teh (2012) consider language mod- eling, where x = w1w2 . . . wn�1 is a history con- sisting of the previous n�1 words, and y is a word. The scoring function is defined as. s(x, y; ✓) = ( n�1 X. i=1. C i. r w. i. ) · q y. + b y. � c x. where r w. i. is an embedding (vector of parameters) for history word w. i. , q y. is an embedding (vector of parameters) for word y, each C. i. for i = 1 . . . n�1 is a matrix of parameters specifying the contribu- tion of r. w. i. to the history representation, b y. is a bias term for word y, and c. x. is a parameter corre- sponding to the log normalization term for history x. Thus each history x has its own parameter c. x. . The binary objective function is used in the NCE algorithm. The noise distribution p. N. (y) is set to be the unigram distribution over words in the vo- cabulary.. This method is a direct application of the original NCE method to conditional estimation, through introduction of the parameters c. x. corre- sponding to normalization terms for each history. Interestingly, Mnih and Teh (2012) acknowledge the difficulties in maintaining a separate parame- ter c. x. for each history, and set c x. = 0 for all x, noting that empirically this works well, but with- out giving justification.. Mikolov et al. (2013) consider an NCE-based method using the binary objective function for estimation of word embeddings. The skip-gram method described in the paper corresponds to a model where x is a word, and y is a word in the context. The vector v. x. is the embedding for word x, and the vector v0. y. is an embedding for word y (separate embeddings are used for x and y). The method they describe uses. s̄(x, y; ✓) = v0 y. · v x. or equivalently. s(x, y; ✓) = v0 y. · v x. + log p N. (y). The negative-sampling distribution p N. (y) was chosen as the unigram distribution p. Y. (y) raised to the power 3/4. The end goal of the method was to learn useful embeddings v. w. and v0 w. for each word. 3702. in the vocabulary; however the method gives a consistent estimate for a model of the form. p(y|x) = exp. �. v0 y. · v x. + log p N. (y) �. P. y. exp. �. v0 y. · v x. + log p N. (y) �. =. p N. (y) exp �. v0 y. · v x. �. Z(x; ✓). assuming that Assumption 2.2 holds, i.e. Z(x; ✓) =. P. y. p N. (y) exp �. v0 y. · v x. �. ⌘ H(✓) which does not vary with x.. Levy and Goldberg (2014) make a connec- tion between the NCE-based method of (Mikolov et al., 2013), and factorization of a matrix of point- wise mutual information (PMI) values of (x, y) pairs. Consistency of the NCE-based method un- der assumption 2.2 implies a similar result, specif- ically: if we define p. N. (y) = p Y. (y), and de- fine s(x, y; ✓) = v0. y. · v x. + log p N. (y) implying s̄(x, y; ✓) = v0. y. · v x. , then parameters v0 y. and v x. converge to values such that. p(y|x) = p Y. (y) exp �. v0 y. · v x. �. H(✓). or equivalently. PMI(x, y) = log p(y|x) p(y). = v0 y. · v x. � log H(✓). That is, following (Levy and Goldberg, 2014), the inner product v0. y. · v x. is an estimate of the PMI up to a constant offset H(✓).. Finally, Jozefowicz et al. (2016) introduce the ranking-based variant of NCE for the language modeling problem. This is the same as the ranking-based algorithm in figure 1. They do not, however, make the connection to assumptions 2.2 and 2.1, or derive the consistency or efficiency results in the current paper. Jozefowicz et al. (2016) partially motivate the ranking-based vari- ant throught the importance sampling viewpoint of Bengio and Senécal (2008). However there are two critical differences: 1) the algorithm of Ben- gio and Senécal (2008) does not lead to the same objective Ln. R. in the ranking-based variant of NCE; instead it uses importance sampling to derive an objective that is similar but not identical; 2) the importance sampling method leads to a biased es- timate of the gradients of the log-likelihood func- tion, with the bias going to zero only as K ! 1. In contrast the theorems in the current paper show that the NCE-based methods are consistent for any. value of K. In summary, while it is tempting to view the ranking variant of NCE as an impor- tance sampling method, the NCE-based view gives stronger guarantees for finite values of K.. 4 Theory. This section states the main theorems. The supple- mentary material contains proofs. Throughout the paper, we use E. X. [ · ], E Y. [ · ], E X,Y. [ · ], E Y |X=x[ · ]. to represent the expectation w.r.t. p X. (·), p Y. (·), p X,Y. (·, ·), p Y |X(·|x). We use k · k to denote ei-. ther the l2 norm when the operand is a vector or the spectral norm when the operand is a matrix. Finally, we use ) to represent converge in distri- bution. Recall that we have defined. s̄(x, y; ✓) = s(x, y; ✓) � log p N. (y).. 4.1 Ranking. In this section, we study noise contrastive estima- tion with ranking objective under Assumption 2.1. First consider the following function:. L1 R. (✓) = X. x,y0,··· ,y K. p X,Y. (x, y0) K. Y. i=1. p N. (y i. ). ⇥ log. exp(s̄(x, y0; ✓)) P. K. k=0 exp(s̄(x, yk; ✓)). !. .. By straightforward calculation, one can find that. L1 R. (✓) = E [Ln R. (✓)] .. Under mild conditions, Ln R. (✓) converges to L1 R. (✓) as n ! 1. Denote the set of maximiz- ers of L1. R. (✓) by ⇥⇤ R. , that is. ⇥. ⇤ R. = arg max. ✓2⇥ L1 R. (✓) .. The following theorem shows that any parameter vector ¯✓ 2 ⇥⇤. R. if and only if it gives the correct conditional distribution p. Y |X(y|x). Assumption 4.1 (Identifiability). For any ✓ 2 ⇥, if there exists a function c(x) such that s(x, y; ✓)� s(x, y; ✓⇤) ⌘ c(x) for all (x, y) 2 X ⇥ Y, then ✓ = ✓⇤ and thus c(x) = 0 for all x.. Theorem 4.1 Under Assumption 2.1,. ¯✓ 2 ⇥⇤ R. if. and only if, for all (x, y) 2 X ⇥ Y,. p Y |X(y|x) = exp(s(x, y; ¯✓))/Z(x, ¯✓).. In addition, ⇥. ⇤ R. is a singleton if and only if As-. sumption 4.1 holds.. 3703. Next we consider consistency of the estimation algorithm based on the ranking objective under the following regularity assumptions: Assumption 4.2 (Continuity). s(x, y; ✓) is con- tinuous w.r.t. ✓ for all (x, y) 2 X ⇥ Y. Assumption 4.3 ⇥. ⇤ R. is contained in the interior. of a compact set ⇥ ⇢ Rd. For a given estimate bp. Y |X of the conditional dis- tribution p. Y |X , define the error metric d(·, ·) by. d �. bp Y |X, pY |X. �. =. X. x2X,y2Y p X,Y. (x, y). ⇥ �. bp Y |X(y|x) � pY |X(y|x). �2 .. For a sequence of IID observations (x(1), y(1)), (x(2), y(2)), . . . , define the sequences of esti- mates (b✓1. R. , bp1 Y |X), (. b✓2 R. , bp2 Y |X), . . . where the. nth estimate (b✓n R. , bpn Y |X) is obtained by op-. timizing the ranking objective of figure 1 on (x(1), y(1)), (x(2), y(2)), . . . , (x(n), y(n)). Theorem 4.2 (Consistency) Under Assump-. tions 2.1, 4.2, 4.3, the estimates based on the. ranking objective are strongly consistent in the. sense that for any fixed K � 1,. P n. lim. n!1 min. ✓. ⇤2⇥⇤ R. kb✓n R. � ✓⇤k = 0 o. = P n. lim. n!1 d ⇣. bpn Y |X, pY |X. ⌘. = 0. o. = 1. Further, if Assumption 4.1 holds,. P n. lim. n!1 b✓n R. = ✓⇤ o. = 1.. Remark 4.1 Thoughout the paper, all NCE esti-. mators are defined for some fixed K. We suppress the dependence on K to simplify notation (e.g. b✓n. R. should be interpreted as. b✓ n,K. R. ).. 4.2 Classification. Now we turn to the analysis of NCE with binary objective under Assumption 2.2. First consider the following function,. L1 B. (✓, �) = X. x,y. n. p X,Y. (x, y) log (g(x, y; ✓, �)). + Kp X. (x)p N. (y) log (1 � g(x, y; ✓, �)) o. One can find that. L1 B. (✓, �) = E [Ln B. (✓, �)] .. Denote the set of maximizers of L1 B. (✓, �) by ⌦⇤ B. :. ⌦. ⇤ B. = arg max. ✓2⇥,�2� L1 B. (✓, �) .. Parallel results of Theorem 4.1, 4.2 are established as follows. Assumption 4.4 (Identifiability). For any ✓ 2 ⇥, if there exists some constant c such that s(x, y; ✓)�s(x, y; ✓⇤) ⌘ c for all (x, y) 2 X ⇥Y, then ✓ = ✓⇤ and thus c = 0.. Assumption 4.5 ⌦. ⇤ B. is in the interior of ⇥ ⇥ � where ⇥ ⇢ Rd, � ⇢ R are compact sets. Theorem 4.3 Under Assumption 2.2, (. ¯✓, �̄) 2 ⌦. ⇤ B. if and only if, for all (x, y) 2 X ⇥ Y,. p Y |X(y|x) = exp(s(x, y; ¯✓) � �̄). for all (x, y). ⌦⇤ B. is a singleton if and only if As-. sumption 4.4 holds.. Similarly we can define the sequence of es- timates (b✓1. B. , b�1 B. , bp1 Y |X), (. b✓2 B. , b�2 B. , bp2 Y |X), . . .. based on the binary objective.. Theorem 4.4 (Consistency) Under Assump-. tion 2.2, 4.2, 4.5, the estimates defined by the. binary objective are strongly consistent in the. sense that for any K � 1,. P n. lim. n!1 min. (✓⇤,�⇤)2⌦⇤ B. k(b✓n B. ,b�n B. ) � (✓⇤, �⇤)k = 0 o. = P n. lim. n!1 d ⇣. bpn Y |X, pY |X. ⌘. = 0. o. = 1. If further Assumption 4.4 holds,. P n. lim. n!1 (. b✓n B. ,b�n B. ) = (✓⇤, �⇤) o. = 1.. 4.3 Counterexample. In this section, we give a simple example to demonstrate that the binary classification approach fails to be consistent when assumption 2.1 holds but assumption 2.2 fails (i.e. the partition function depends on the input).. Consider X 2 X = {x1, x2} with marginal distribution. p X. (x1) = pX(x2) = 1/2,. and Y 2 Y = {y1, y2} generated by the condi- tional model specified in assumption 2.1 with the score function parametrized by ✓ = (✓1, ✓2) and. s(x1, y1; ✓) = log ✓1,. 3704. s(x1, y2; ✓) = s(x2, y1; ✓) = s(x2, y2; ✓) = log ✓2.. Assume the true parameter is ✓⇤ = (✓⇤1, ✓ ⇤ 2) =. (1, 3). By simple calculation,. Z(✓⇤; x1) = 4, Z(✓ ⇤ ; x2) = 6,. p X,Y. (x1, y1) = 1/8, pX,Y (x1, y2) = 3/8, p X,Y. (x2, y1) = pX,Y (x2, y2) = 1/4.. Suppose we choose the negative sampling distri- bution p. N. (y1) = pN(y2) = 1/2. For any K � 1, by the Law of Large Numbers, as n goes to infin- ity, Ln. B. (✓, �) will converge to L1 B. (✓, �). Substi- tute in the parameters above. One can show that. L1 B. (✓, �) = 1. 8. log. 2✓1 2✓1 + K exp(�). +. K. 4. log. K exp(�). 2✓1 + K exp(�). +. 7. 8. log. 2✓2 2✓2 + K exp(�). +. 3K. 4. log. K exp(�). 2✓2 + K exp(�) .. Setting the derivatives w.r.t. ✓1, ✓2 to zero, one will obtain. ✓1 = 1. 4. exp(�), ✓2 = 7. 12. exp(�).. So for any (e✓1, e✓2,e�) 2 argmax ✓,�. L1 B. (✓, �), (. e✓1, e✓2,e�) will satisfy the equalities above. Then the estimated distribution ep. Y |X will satisfy. ep Y |X(y1|x1) ep Y |X(y2|x1). =. e✓1 e✓2. =. 1/4. 7/12 =. 3. 7. ,. which contradicts the fact that. p Y |X(y1|x1). p Y |X(y2|x1). =. p X,Y. (x1, y1). p X,Y. (x1, y2) =. 1. 3. .. So the binary objective does not give consistent estimation of the conditional distribution.. 4.4 Asymptotic Normality and Statistical. Efficiency. Noise Contrastive Estimation significantly reduces the computational complexity, especially when the label space |Y| is large. It is natural to ask: does such scalability come at a cost? Classical likeli- hood theory tells us, under mild conditions, the maximum likelihood estimator (MLE) has nice properties like asymptotic normality and Fisher ef- ficiency. More specifically, as the sample size goes. to infinity, the distribution of the MLE will con- verge to a multivariate normal distribution, and the mean square error of the MLE will achieve the Cramer-Rao lower bound (Ferguson, 1996).. We have shown the consistency of the NCE es- timators in Theorem 4.2 and Theorem 4.4. In this part of the paper, we derive their asymptotic distri- bution and quantify their statistical efficiency. To this end, we restrict ourselves to the case where ✓⇤. is identifiable (i.e. Assumptions 4.1 or 4.4 hold) and the scoring function s(x, y; ✓) satisfies the fol- lowing smoothness condition:. Assumption 4.6 (Smoothness). The scoring func-. tion s(x, y; ✓) is twice continuous differentiable w.r.t. ✓ for all (x, y) 2 X ⇥ Y.. We first introduce the following maximum- likelihood estimator.. b✓ MLE = arg min ✓. LnMLE(✓). := arg min. ✓. n. X. i=1. log. exp(s(x(i), y(i); ✓)) P. y2Y exp(s(x (i), y; ✓)). !. .. Define the matrix. I ✓. ⇤ = E. X. ⇥. Var. Y |X=x [r✓s(x, y; ✓⇤)] ⇤. .. As shown below, I ✓. ⇤ is essentially the Fisher in- formation matrix under the conditional model.. Theorem 4.5 Under Assumption 2.1, 4.1, 4.3,. and 4.6, if I ✓. ⇤ is non-singular, as n ! 1. p n(b✓ MLE � ✓⇤) ) N(0, I�1. ✓. ⇤ ).. For any given estimator b✓, define the scaled asymptotic mean square error by. MSE1(b✓) = lim n!1. E ". �. �. �. �. r. n. d. ⇣. b✓ � ✓⇤ ⌘. �. �. �. �. 2 #. ,. where d is the dimension of the parameter ✓⇤. The- orem 4.5 implies that,. MSE1(b✓ MLE. ) = Tr(I�1 ✓. ⇤ )/d.. where Tr(·) denotes the trace of a matrix. Accord- ing to classical MLE theory (Ferguson, 1996), un- der certain regularity conditions, this is the best achievable mean square error. So the next question to answer is: can these NCE estimators approach this limit?. 3705. Assumption 4.7 There exist positive constants. c, C such that �min(I ✓. ⇤ ) � c and. max. (x,y)2X⇥Y. n. |s(x, y; ✓⇤)|, kr ✓. s(x, y; ✓⇤)k , �. �r2 ✓. s(x, y; ✓⇤) �. �. o.  C.. where �min(·) denotes the smallest singular value.. Theorem 4.6 (Ranking) Under Assumption 2.1,. 4.1, 4.3, 4.6, 4.7, there exists an integer K0 such that for all K � K0, as n ! 1. p n ⇣. b✓ R. � ✓⇤ ⌘. ) N(0, I�1 R,K. ), (9). for some matrix I R,K. . There exists a constant C. such that for all K � K0,. | MSE1(b✓R) � MSE1(b✓ MLE)|  C/ p K. kI�1 R,K. � I�1 ✓. ⇤ k  C/ p K. Theorem 4.7 (Binary) Under Assumption 2.2,. 4.4, 4.5, 4.6, 4.7, there exists an integer K0 such that, for any K � K0, as n ! 1. p n ⇣. b✓ B. � ✓⇤ ⌘. ) N(0, I�1 B,K. ), (10). for some matrix I B,K. . There exists a constant C. such that for all K � K0,. | MSE1(b✓B) � MSE1(b✓ MLE)|  C/K kI�1. B,K. � I�1 ✓. ⇤ k  C/K.. Remark 4.2 Theorem 4.6 and 4.7 reveal that un-. der respective model assumptions, for any given. K � K0 both NCE estimators are asymptotically normal and. p n-consistent. Moreover, both NCE. estimators approach Fisher efficiency (statistical. optimality) as K grows.. 5 Experiments. 5.1 Simulations. Suppose we have a feature space X ⇢ Rd with |X| = m. x. , label space Y = {1, · · · , m y. }, and pa- rameter ✓ = (✓1, · · · , ✓m. y. ) 2 Rmy⇥d. Then for any given sample size n, we can generate observa- tions (x(i), y(i)) by first sampling x(i) uniformly from X and then sampling y(i) 2 Y by the con- dional model. p(y|x; ✓) = exp(x0✓ y. )/. m. y. X. y=1. exp(x0✓ y. ).. We first consider the estimation of ✓ by MLE and NCE-ranking. We fix d = 4, m. x. = 200, m y. =. 100 and generate X and the parameter ✓ from sep- arate mixtures of Gaussians. We try different con- figurations of (n, K) and report the KL divergence between the estimated distribution and true distri- bution, as summarized in the left panel of figure 2. The observations are: • The NCE estimators are consistent for any. fixed K. For a fixed sample size, the NCE estima- tors become comparable to MLE as K increases. • The larger the sample size, the less sensitive. are the NCE estimators to K. A very small value of K seems to suffice for large sample size.. Apparently, under the parametrization above, the model is not self-normalized. To use NCE- binary, we add an extra x-dependent bias parame- ter b. x. to the score function (i.e. s(x, y; ✓) = x0✓ y. +. b x. ) to make the model self-normalized or else the algorithm will not be consistent. Similar patterns to figure 2 are observed when varying sample size and K (see Section A.1 of the supplementary ma- terial). However this makes NCE-binary not di- rectly comparable to NCE-ranking/MLE since its performance will be compromised by estimating extra parameters and the number of extra param- eters depends on the richness of the feature space X . To make this clear, we fix n = 16000, d = 4, m. y. = 100, K = 32 and experiment with m x. =. 100, 200, 300, 400. The results are summarized on the right panel of figure 2. As |X| increases, the KL divergence will grow while the performance of NCE-ranking/MLE is independent of |X|. With- out the x-dependent bias term for NCE-binary, the KL divergence will be much higher due to lack of consistency (0.19, 0.21, 0.24, 0.26 respectively).. 5.2 Language Modeling. We evaluate the performance of the two NCE al- gorithms on a language modeling problem, using the Penn Treebank (PTB) dataset (Marcus et al., 1993). We choose (Zaremba et al., 2014) as the benchmark where the conditional distribution is modeled by two-layer LSTMs and the parame- ters are estimated by MLE (note that the current state-of-the-art is (Yang et al., 2018)). Zaremba et al. (2014) implemented 3 model configurations: “Small” , “Medium” and “Large”, which have 200, 650 and 1500 units per layer respectively. We follow their setup (model size, unrolled steps, dropout ratio, etc) but train the model by maximiz-. 3706. Small Medium Large MLE 111.5 82.7 78.4 NCE Ranking Binary Ranking Binary Ranking Binary. K = 200 113.8 106.8 83.2 82.1 79.3 76.0 K = 400 112.9 105.6 82.3 81.5 77.9 75.6 K = 800 111.9 105.3 81.4 81.6 77.8 75.7 K = 1600 110.6 104.8 81.7 81.5 77.5 75.9 reg-MLE 105.4 79.9 77.0. reg-Ranking (K = 1600) 105.4 79.8 75.0 reg-Binary (K = 1600) 104.8 82.5 75.7. Table 1: Perplexity on the test set of Penn Treebank. We show performance for the ranking v.s. binary loss algorithms, with different values for K, and with/without regularization.. Figure 2: KL divergence between the true distribution and the estimated distribution.. ing the two NCE objectives. We use the unigram distribution as the negative sampling distribution and consider K = 200, 400, 800, 1600.. The results on the test set are summarized in ta- ble 1. Similar patterns are observed on the vali- dation set (see Section A.2 of the supplementary material). As shown in the table, the performance of NCE-ranking and NCE-binary improves as the number of negative examples increases, and fi- nally outperforms the MLE.. An interesting observation is, without regular- ization, the binary classification approach outper- forms both ranking and MLE. This suggests the model space (two-layer LSTMs) is rich enough as to approximately incorporate the x-dependent par- tition function Z(✓; x), thus making the model ap- proximately self-normalized. This motivates us to modify the ranking and MLE objectives by adding the following regularization term:. ↵. n. n. X. i=1. 0. @. log. 0. @. 1. m. m. X. j=1. exp. ⇣. s̄(x(i), ey(i,j); ✓) ⌘. 1. A. 1. A. 2. ⇡ ↵ E X. h. (log Z(x; ✓)) 2 i. ,. where ey(i,j), 1  j  m are sampled from the noise distribution p. N. (·). This regularization. term promotes a constant partition function, that is Z(x; ✓) ⇡ 1 for all x 2 X . In our experiments, we fix m to be 1/10 of the vocabulary size, K = 1600 and tune the regularization parameter ↵. As shown in the last three rows of the table, regularization significantly improves the performance of both the ranking approach and the MLE.. 6 Conclusions. In this paper we have analyzed binary and rank- ing variants of NCE for estimation of conditional models p(y|x; ✓). The ranking-based variant is consistent for a broader class of models than the binary-based algorithm. Both algorithms achieve Fisher efficiency as the number of negative exam- ples increases. Experiments show that both algo- rithms outperform MLE on a language modeling task. The ranking-based variant of NCE outper- forms the binary-based variant once a regularizer is introduced that encourages self-normalization.. Acknowledgments. The authors thank Emily Pitler and Ali Elkahky for many useful conversations about the work, and David Weiss for comments on an earlier draft of the paper.. 3707. References. Yoshua Bengio and Jean-Sébastien Senécal. 2008. Adaptive importance sampling to accelerate train- ing of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722.. Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. 1996. A maximum entropy ap- proach to natural language processing. Comput. Linguist., 22(1):39–71.. Thomas Shelburne Ferguson. 1996. A course in large sample theory, volume 49. Chapman & Hall Lon- don.. Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized sta- tistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361.. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling se- quence data. In Proceedings of the Eighteenth Inter- national Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Mor- gan Kaufmann Publishers Inc.. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Pro- ceedings of the 27th International Conference on. Neural Information Processing Systems - Volume 2, NIPS’14, pages 2177–2185, Cambridge, MA, USA. MIT Press.. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computa- tional linguistics, 19(2):313–330.. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013. Distributed represen- tations of words and phrases and their composition- ality. In Proceedings of the 26th International Con- ference on Neural Information Processing Systems -. Volume 2, NIPS’13, pages 3111–3119, USA. Curran Associates Inc.. Andriy Mnih and Yee W Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1751–1758.. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representa-. tions.. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.

New documents

Comparing all results for Swahili broadcast speech transcription task table 1, Morfessor based segmentation ASR system is the only one, with 34.8% WER, performing significantly better

5 Proposition de claviers et de polices ergonomiques pour la saisie en API Pour chaque langue on transforme une police Unicode Times New Roman.ttf en déplaçant les glyphes copier-coller

Figure 2: a proportion of fruit set and b fruit length mean ± SE of bagged shaded bars and open unshaded bars flowers hand-pollinated with self same flower, geitonogamous same plant and

La plate-forme sera adaptée spécifiquement au projet DiLAF car, en sus des dictionnaires, des informations spécifiques au projet doivent être accessibles aux visiteurs : — présentation

The study sheds new light on this unusual breeding strategy of delayed implantation and superfetation, and highlights a number of significant differences between the reproductive

Exemples: dimiro «à la brebis»; maləmbo «au marabout»; dallo «au bouc» +ye, marque du locuteur, s’écrit +ye tout court après toutes les voyelles, +be après la nasale m et +e

Système d’alignement automatique à partir du français Afin de pouvoir rechercher et écouter des mots ou des réalisations de séquences de phonèmes mbochi spécifiques dans le signal,

The following are a list of the possible feature types and the restrictions that can be imposed on them: • Symbolic – the feature value must be one of an enumerated list of possible

Cet article traite de la reconnaissance automatique de la parole pour l’amharique et le swahili, deux langues peu dotées à morphologie riche en utilisant des unités de découpage au

In this respect it would be useful to explore whether the degree of policy congruence at the national level impacts on public attitudes towards representation in the EU, as the systems