Nonparametric e-mixture Estimation

Full text

(1)LETTER. Communicated by Shun-ichi Amari. Nonparametric e-Mixture Estimation Ken Takano [email protected] Graduate School of Advanced Science and Engineering, Waseda University, Shinjuku, Tokyo 169-8555, Japan. Hideitsu Hino [email protected] Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki 305-8573, Japan. Shotaro Akaho [email protected] National Institute of Advanced Industrial Science and Technology, Tsukuba, Ibaraki 305–8568, Japan. Noboru Murata [email protected] Graduate School of Advanced Science and Engineering, Waseda University, Shinjuku, Tokyo 169-8555, Japan. This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the m- and e-mixtures. The m-mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the e-mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The e-mixture, however, is a welltempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the emixture and a geometrically inspired estimation algorithm. As numerical Neural Computation 28, 2687–2725 (2016) doi:10.1162/NECO_a_00888. c 2016 Massachusetts Institute of Technology . Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(2) 2688. K. Takano, H. Hino, S. Akaho, and N. Murata. examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set. 1 Introduction Constructing a mixture model of probability distributions (Everitt & Hand, 1981; McLachlan & Peel, 2000) is a standard approach for integrating information from different sources and representing the presence of different subpopulations underlying the overall population. In this letter, we present a nonparametric e-mixture estimation method, namely, an algorithm to estimate a logarithmic mixture of nonparametric, models for solving a variant of user adaptation or transfer learning (Blum & Mitchell, 1998; Pan & Yang, 2010). Our task is to construct a good model of the target data set when it includes an insufficient number of examples by using auxiliary data sets that have a sufficient amount of relevant information. This problem is considered to be one of approximating the target distribution in a subspace spanned by a set of auxiliary distributions. In these situations, mixture modeling is a popular method for estimating the target distribution. There are two typical mixtures in the context of information geometry (Amari & Nagaoka, 2000): the m-mixture and the e-mixture. The m-mixture pm (x) is a convex combination of auxiliary probability density functions (pdfs) pi (x), pm (x; θ) =. N . θi pi (x),. i=1. N . θi = 1,. θi ≥ 0,. (1.1). i=1. where θ = {θi }N i=1 is a mixture ratio vector of the pdfs pi (x), i = 1, . . . , N, and N is the number of the pdfs. The gaussian mixture model (GMM) is an example of the m-mixture (McLachlan & Peel, 2000). The e-mixture pe (x) has the following form, p (x; θ) = exp e. N i=1. θi log pi (x) − b(θ) ,. N . θi = 1,. (1.2). i=1. where b(θ) is the normalization term. In the m-mixture form, the pdfs are combined by the weighted arithmetic average. On the contrary, in the emixture form, a weighted average of log densities is used. Figure 1 shows the e- and the m-mixtures of two gaussian distributions. The solid lines indicate the two gaussian distributions, and the dashed and dotted lines indicate the m- and the e-mixtures of these two gaussian distributions, respectively, with a uniform mixture ratio. The difference between the m- and the e-mixtures is understood as an analogy between logical OR and AND. The idea of the e-mixture is also related to the classical mixture of experts (Jordan &. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(3) Nonparametric e-Mixture Estimation. 2689. Figure 1: An example of the m- and e-mixture of two gaussian distributions.. Jacobs, 1994; Heskes, 1998; Choi, Choi, & Choe, 2013). To motivate e-mixture modeling, we present two important characteristics for mixture models. First, the set of models should contain all the auxiliary pdfs; indeed, it is natural to require the mixture model to contain (or well approximate) the target pdf and all the auxiliary pdfs. Second, the set of mixture models, which is parameterized by θ, should be as simple as possible. This consideration is important to maintain generalization ability and avoid overfitting (Wang, Greiner, & Wang, 2013). The e-mixture satisfies both of these characteristics, while the m-mixture satisfies only the first. Hence, a set of e-mixtures is simpler than that of m-mixtures in terms of the principle of maximum entropy, as shown in section 3.3: e-mixtures are included in the exponential family. Since entropy is a measure of ambiguity, a pdf with small entropy has mostly a condensed probability mass in specific small areas of the data space. Particularly in the problem of model estimation from finite samples, small entropy implies overfitting. The exponential family is known to be a natural solution to the maximum entropy problem. The principle of maximum entropy shows that the e-mixture is the flattest or minimally informative pdf in a certain family of distributions (Cover & Thomas, 1991). In some cases, the e-mixture is a more natural modeling approach than the m-mixture. For example, suppose N distinct data sets are well approximated by gaussian distributions with different means and covariance values. In this case, it is natural to model the (N + 1)st data set with a gaussian distribution rather than a mixture of gaussians. Further, the gaussian distribution belongs to the exponential family, which is known to have the e-mixture property, that is, an e-mixture of gaussians becomes a gaussian. Thus, one might say that the e-mixture model is a natural extension of the exponential family, which contains all the auxiliary pdfs (Akaho, 2004). In a variety of research fields such as speaker verification (Douglas, Thomas, & Robert, 2000), background subtraction (Zivlovic, 2004), and genetic analysis (Ji, Wu, Liu, Wang, & Coombes, 2005), m-mixtures are. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(4) 2690. K. Takano, H. Hino, S. Akaho, and N. Murata. applied along with the EM algorithm (Dempster, Laird, & Rubin, 1977). However, few authors have applied e-mixtures despite their good properties (Genest & Zidek, 1986), since estimating them can be computationally intractable because of the use of log and exponential functions. Indeed, an appropriate distribution family for auxiliary pdfs must be selected, which can be calculated in the e-mixture form. In this letter, from the viewpoint of information geometry, we propose a novel framework for estimating the e-mixtures of nonparametric models. Suppose we are given a set of observed data points {xk }Kk=1 . A naive approach for nonparametric e-mixture modeling would be to replace the auxiliary distributions pi (x) in equation 1.2 with empirical distributions. However, as shown in section 3.1, the substitution of empirical distributions in the logarithmic function is prohibited. Instead, we consider a nonparametric representation of e-mixture models by using a weighted empirical distribution function of all the given data,. p(x) =. K . wk δ(x − xk ),. (1.3). k=1. where w = {wk }Kk=1 , Kk=1 wk = 1, wk ≥ 0 is a weight vector—that is, each element wk of w represents the sampling probability of a datum xk ∈ Rd , and δ(·) is the Dirac delta function. When a sufficient number of data are given, nonparametric models such as an empirical distribution function can express the underlying distribution more precisely than parametric models. We can control the empirical distribution function, equation 1.3, by changing the weights wi of the data points. Here, our objective is to determine the weight w of all the given data so that the empirical distribution, equation 1.3, becomes the e-mixture of the auxiliary distributions. We construct our nonparametric e-mixture estimation algorithm with the aid of two theorems: the characterization of the e-mixture and the Pythagorean relation. In exploratory data analysis, it is preferable not to assume any specific form of probability distribution behind the data in advance, and therefore nonparametric approaches are often preferred. In addition, when we consider a parametric e-mixture, both auxiliary and target distributions must be restricted to belong to a certain family of distributions to ensure the feasibility of calculating the e-mixture. To ensure that the modeling is flexible, we aim to estimate the e-mixture in a nonparametric manner. The remainder of this letter is organized as follows. In section 2, we introduce the information geometry required to explain our approach. The detailed problem addressed in this letter is described in section 3, and our proposed framework is explained in section 4. Section 5 presents the. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(5) Nonparametric e-Mixture Estimation. 2691. experimental results by using both artificial and real-world data sets, and the last section is devoted to a discussion and conclusion. 2 Preliminary on Information Geometry Information geometry is a framework for discussing the mechanisms of statistical inference or machine learning by focusing on the geometrical structure of the manifold of probability distributions. In this section, we discuss some basic ideas of information geometry, focusing on the Pythagorean relation derived from the Kullback–Leibler (KL) divergence. 2.1 KL Divergence. We consider a statistical space S composed of arbitrary pdfs {p(x)}, where x is a random variable. A point in the space S corresponds to a pdf. We also consider a subspace M ⊂ S composed of pdfs {p(x; ξ )}, where ξ is a parameter of the pdf. For example, ξ is composed of mean μ and variance σ 2 when we consider a family of gaussian distributions. The parameter ξ plays the role of a coordinate system in this subspace M. The problem of statistical inference is reduced to the problem of searching for the closest point on the subspace M from the given data, where closeness is measured by using a certain divergence function. A schematic diagram of these notions of information geometry is shown in Figure 2. The KL divergence (Kullback & Leibler, 1951) is an example of the divergence between two probability distributions p and q, which is defined as DKL (p, q) =. p(x) log. p(x) dx. q(x). (2.1). Let us consider a relationship based on the KL-divergence among three pdfs, p, q, and r: DKL (p, q) − DKL (p, r) − DKL (r, q) =. {p(x) − r(x)}{log r(x) − log q(x)}dx. (2.2). When the right-hand side of equation 2.2 equals zero, {p(x) − r(x)} and {log r(x) − log q(x)} can be regarded as orthogonal vectors in the statistical space S , as shown in Figure 3. Theorem 1 (Pythagorean relation) (Amari & Nagaoka, 2000). Let p, q , and r be probability density functions. If { p − r } and {log r − log q } are orthogonal, namely, { p(x) − r (x)}{log r (x) − log q (x)}d x = 0, then the Pythagorean relation holds: DK L ( p, q ) = DK L ( p, r ) + DK L (r, q ).. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021. (2.3).

(6) 2692. K. Takano, H. Hino, S. Akaho, and N. Murata. Figure 2: Schematic diagram of the statistical inference.. Figure 3: Geometrical view of the Pythagorean relation.. 2.2 Geodesics and Flatness. As opposed to the Euclidean space, a statistical space S is curved and distorted in general. Theorem 1 induces two types of geodesics in S . Definition 1 (m-geodesic and m-flat subspace). Let p and q be probability density functions. The m-geodesic is defined as the set of internal divisions between p and q parameterized by t: r (x; t) = {(1 − t) · p(x) + t · q (x)}, 0 ≤ t ≤ 1.. (2.4). Every internal division r (x; t) is also a pdf because its integral is equal to one. The m-flat subspace is defined by the set generated from N pdfs pi (x) by the m-mixture. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(7) Nonparametric e-Mixture Estimation. M ({ pi }) = r (x; t) = m. N . ti pi (x),. 2693 N . i=1. ti = 1, ti ≥ 0 .. (2.5). i=1. The m-geodesic of two arbitrary pdfs in Mm ({pi }) is included in Mm ({pi }). The m-flat subspace is specified by a set of pdfs that spans the. subspace; however, when it is clear from the context or when there is no need to specify, we omit the set of pdfs and simply denote the m-flat subspace by Mm . The e-geodesic and the e-flat subspace are derived in the same way as the m-geodesic and the m-flat subspace. Definition 2 (e-geodesic and e-flat subspace). Let p and q be probability density functions. The e-geodesic is defined as the set of internal divisions between p and q parameterized by t: log r (x; t) = {(1 − t) · log p + t · log q − φ(t)}, 0 ≤ t ≤ 1,. (2.6). where φ is the normalization term defined as φ(t) = log. p(x)1−t q (x)t d x.. The e-flat subspace is defined by the set generated from N pdfs pi (x) by e-mixture: M ({ pi }) = r (x; t) = e xp e. . N . ti log pi (x) + b(t) ,. i=1. N . ti = 1, ti ≥ 0 , (2.7). i=1. where b(t) is a normalization constant. The e-geodesic of two arbitrary pdfs in. Me ({ pi }) is included in Me ({ pi }).. We note that it is possible to define the e-flat subspace by allowing ti ’s to include negative values, because the product of exponential functions is positive. In this letter, however, we define the e-flat subspace as all the internal points of the subspace spanned by a finite number of pdfs; that is, the ti ’s are defined as nonnegative values to simplify the argument. 2.3 Projection. Let p, q, and r be pdfs in S . When the m-geodesic connecting p and r is orthogonal at r to the e-geodesic connecting q and r, the Pythagorean relation holds: DKL (p, q) = DKL (p, r) + DKL (r, q). When the Pythagorean relation holds, DKL (p, q)≥DKL (p, r) for any q in Me ({q, r}) also holds because of the nonnegativity of the KL divergence.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(8) 2694. K. Takano, H. Hino, S. Akaho, and N. Murata. Therefore, the intersection point r of two geodesics becomes the minimizer of the KL divergence between p and an arbitrary point in Me ({q, r}). The pdf r uniquely exists, and it is known as the projection. Definition 3 (m-projection and e-projection). The m-projection from a point p ∈ S to an e-flat subspace Me is given by finding the closest point rm ∈ Me from p: rm = arg min DK L ( p, q ) q ∈Me. (m-projection).. (2.8). Similarly, the e-projection from a point q ∈ S to an m-flat subspace Mm is given by finding the closest point re ∈ Mm from q : re = arg min DK L ( p, q ) p∈Mm. (e-projection).. (2.9). The m-projection to an e-flat subspace Me and the e-projection to an m-flat subspace Mm are uniquely determined (Amari, 2016). 2.4 Mixture Models and Their Characterization. In information geometry, points in a subspace M can be represented in different coordinate systems called the m- and e-representations. We can consider two types of mixtures of pdfs in the m-representation and the e-representation. Mixture models are regarded as flat subspaces spanned by a finite number of pdfs. The following two theorems characterize the m-mixture and the e-mixture of pdfs, respectively: Let pi , i = 1, . . . , N be pdfs and θ = {θi }N i=1 ∈ N be the associated mixture ratios, where N. . N = θ. θi = 1,. θi ≥ 0 .. (2.10). i=1. Theorem 2 (characterization of the m-mixture) (Murata & Fujimoto, 2009). For N N ∈ Δ N , the sum of the KL divergence i=1 θi DK L ( pi , q ) weighted any θ = {θi }i=1 by θ is minimized at the m-mixture: arg min q ∈P. N . θi DK L ( pi , q ) = p m (x; θ) ∈ Mm ({ pi }),. (2.11). i=1. where P is the set of probability density functions. Theorem 3 (characterization of the e-mixture) (Murata & Fujimoto, 2009). For N N any θ = {θi }i=1 ∈ Δ N , the sum of the KL divergence i=1 θi DK L (q , pi ) weighted by θ is minimized at the e-mixture:. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(9) Nonparametric e-Mixture Estimation. arg min q ∈P. N . θi DK L (q , pi ) = p e (x; θ) ∈ Me ({ pi }),. 2695. (2.12). i=1. where P is the set of probability density functions. The proofs of theorems 2 and 3 are given in appendixes A and B. 3 Nonparametric e-Mixture In this section, to consider the problem of nonparametric e-mixture estimation from the viewpoint of information geometry, we define and restate the notions introduced in the previous section in the nonparametric setting. 3.1 Problem Formulation. Suppose we have a target data set D (0) = composed of n0 samples xk(0) ∈ Rd generated from a probability distribution with a pdf p0 . We also have N auxiliary data sets {D (i) }N i=1 . The ni is composed of ni samples xk(i) ∈ Rd generated ith data set D (i) = {xk(i) }k=1 from a probability distribution with a pdf pi . We consider the situation that the target data set has many fewer data than the auxiliary data sets, and we wish to obtain a more feasible estimate of p0 , taking advantage of the informative auxiliary data sets. The situation is often seen, for example, in classification problems of EEG (Tu & Sun, 2011) and audio signals (Sturim, Reynolds, Singer, & Campbell, 2001), and formulated as the transfer learning problem. We consider representing the target pdf p0 as an e-mixture of other auxiliary pdfs pi by weighting the data in the auxiliary data sets. In order for the auxiliary pdfs to be informative to express the target pdf, the support of the target and auxiliary pdfs must have sufficiently overlapped. To facilitate the discussion, we assume. n0 {xk(0) }k=1. supp(p0 ) ⊆. N. supp(pi ),. (3.1). i=1. where supp( f ) is a support of a function f . Figure 4 shows a conceptual diagram of our framework in parametric and nonparametric manners from the viewpoint of information geometry (Amari, 1991, 2016; Amari & Nagaoka, 2000). The curved surface with the solid lines in the left panel of Figure 4 shows the subspace M of a certain family of pdfs p(x; ξ ) parameterized by ξ . In the conventional parametric mixture estimation setting, each parameter ξi of the model is estimated based on each data set D (i) , and this procedure is regarded as the projection of empirical distribution for D (i) onto the subspace M. Then the mixture ratio θ is updated by minimizing the sum of divergences between. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(10) 2696. K. Takano, H. Hino, S. Akaho, and N. Murata. Figure 4: (Left) The curved surface with the solid lines represents the subspace M parameterized by ξ . Typical algorithms such as the EM algorithm work on this surface. (Right) Conversely, our algorithm works on the curved surface with the dotted lines, representing the subspace of the e-mixtures of empirical distributions E.. p(x; ξi ) ∈ M, i = 1, . . . , N and p(x; ξ e ) ∈ M, namely, the projection of D (0) onto M. Figure 5 illustrates two ways of estimating the parametric e-mixture: the conventional method that uses the gradient descent method and the proposed method we introduce in section 4. Suppose we have two auxiliary data sets D (1) and D (2) and target data sets D (0) . When we consider gaussian distributions as the probabilistic models of those data sets, we obtain the parameters of these gaussian distributions ξ1 = {μ1 , σ12 }, ξ2 = {μ2 , σ22 }, and ξ0 = {μ0 , σ02 } from D (1) , D (2) , and D (0) , respectively. The goal of parametric emixture estimation is to obtain the mixture ratio θ, which minimizes the KL divergence between the target pdf p0 (x) and the e-mixture pe (x; θ). As noted above, we can consider two methods to estimate the optimal mixture ratio θ: the gradient descent method and the proposed Pythagorean relation-based method, introduced in the next section. In the gradient descent method, we minimize the KL divergence DKL (p0 , pe (·, θ)) by using its gradient with respect to the mixture ratio θ. In the one-dimensional gaussian case, we can derive the gradient of the KL divergence as shown in appendix B. The left panels of Figure 5 show the results by using the gradient descent method. The upper panel shows the estimated mixture ratios θ1 and θ2 by iterations. The dotted lines indicate the ground-truth mixture ratio θ1 = 0.8, θ2 = 0.2. The lower panel shows the parameter space spanned by μ and σ 2 . Every point marked by indicates the estimated parameter at each iteration. We can see that the estimated point approaches ξ0 as the iterations proceed. The right panels in Figure 5 show the results by using the proposed method based on the Pythagorean relation. As shown in Figure 5, the proposed method can estimate the mixture ratio θ as well. Since the. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(11) Nonparametric e-Mixture Estimation. 2697. Figure 5: An illustrative example of two estimation methods for a parametric emixture estimation. (Left) The estimation results from using the gradient descent method. (Right) The estimation results from using the proposed algorithm based on the Pythagorean relation. This figure shows that both methods provide the correct mixture ratio.. proposed method requires only KL divergence, we need not calculate the complicated differentials of the KL divergence. 3.2 Nonparametric e-Mixture Modeling. In contrast to the m-mixture, the e-mixture is a nonlinear combination of the pdfs. In general, obtaining the closed-form solution for the e-mixture of distributions is impossible, even in the parametric framework. Furthermore, nonparametric modeling for the mixture of distributions is desirable when we have no prior knowledge on the form of data distribution. The problem of the nonparametric e-mixture estimation is more difficult than its parametric counterpart, as explained later. We denote the whole data set as D = D (0) ∪ D (1) ∪ D (2) ∪ . . . ∪ D (N) ,. (3.2). and let K = |D| be the number of data points in D. In this letter, regarding that the weighted data set Dw = {D, w} gives a nonparametric representation of a pdf with D fixed and w = {wk }Kk=1 a parameter of weighted empirical distributions, we optimize the parameter w so that equation 1.3 represents the e-mixture of the entire data set for the nonparametric expression of the target pdf.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(12) 2698. K. Takano, H. Hino, S. Akaho, and N. Murata. We construct an empirical distribution p0 from the target data set n 1 n0 δ(x − xk(0) ). Similarly, we construct auxn0 k=1 iliary empirical distributions pi , i = 1, . . . , N from the auxiliary data sets ni ni D (i) = {xk(i) }k=1 as pi (x) = n1 k=1 δ(x − xk(i) ). The e-mixture pe of these auxi iliary empirical distributions pi , i = 1, . . . , N can be written, with an abuse of the delta functions, as 0 D (0) = {xk(0) }k=1 as p0 (x) =. p (x; θ) = exp e. N i=1. ni 1 (i) θi log δ(x − xk ) − b(θ) , ni. (3.3). k=1. where θ = {θi }N i=1 is the mixture ratio vector of the auxiliary pdfs pi . Expression (3.3) is not mathematically formal because of the log of delta functions, another source of difficulty in nonparametric e-mixture modeling. Thus, we define a mixture model that satisfies equation 2.12 of theorem 3.4 as an e-mixture of the nonparametric models. Given a mixture parameter θ, from theorem 3.4, we obtain a nonparametric e-mixture pê as pê (x; θ) = arg min N i=1 θi DKL (q, pi ). As the set of pdfs P in theorem 3.4, in q∈P. which the e-mixture is found, we use the set of pdfs parameterized by the weights as . K K. . P = p p(x; w) = wk δ(x − yk ), wk = 1, wk ≥ 0, yk ∈ D .. k=1. (3.4). k=1. A weight vector w specifies a point in P , and the point is determined by equation 2.11. Although the weight vector depends on the mixture ratio θ, for simplicity, we write the weight vector w instead of w(θ). On the contrary, given a weight vector w for all the data points in D, we aim to optimize the mixture parameter θ so that the e-mixture defined by equation 2.11 is a good approximation of the target distribution p0 . The curved surface E with the dotted lines in the right panel of Figure 4 shows a subspace of the e-mixtures of the empirical distributions, which is defined by . N N. . E = p p(·; θ) = arg min θi DKL (q, pi ), θi = 1, θi ≥ 0 .. q∈P i=1. (3.5). i=1. A mixture parameter θ specifies a point in E . Let the projection of p0 onto E be pêw , namely, pêw = arg min DKL (p0 , p). p∈E. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(13) Nonparametric e-Mixture Estimation. 2699. Since we optimize θ so that the pêw is closest to p0 and since pêw is specified by w, the optimal θ depends on the given w. For simplicity, we write the mixture parameter θ instead of θ(w). We perform our proposed e-mixture estimation algorithm on this surface, as is typical in the conventional parametric e-mixture estimation method, by satisfying the two requisite conditions for the e-mixture described in section 3.3. Finally, we note that equation 3.4 does not indicate the m-mixture of the auxiliary pdfs because each weight wk in equation 3.4 contains θ implicitly. Since the e- and m-mixtures have different restrictions with respect to θ, restrictions on weight are also different depending on the mixture models. Equation 3.4 is the m-representation of the e-mixture, and we develop an algorithm to optimize the weight vector w in equation 3.4 for the e-mixture estimation. 3.3 Requisite Conditions for the e-Mixture. As described in the section 3.2, the gradient descent method cannot be used for the nonparametric e-mixture estimation because of the abuse of the e-representation of the nonparametric mixture model. Therefore, we consider a geometrical algorithm that only needs the KL divergence. To estimate the e-mixture in a nonparametric manner, the following two conditions are imposed: 1. pê is the e-mixture of auxiliary empirical distributions pi , i = 1, . . . , N. 2. pê is the projection of p0 onto the subspace E . Let us consider the first condition. According to theorem 3 (characterization of the e-mixture), the pdf that minimizes the weighted KL divergence is written in the form of the e-mixture. Therefore, if we obtain the weight w of pê in equation 2.11, which minimizes the weighted KL divergence in equation 2.12, pê is regarded as the e-mixture of the auxiliary pdfs with the given mixture ratios θ. For the second condition, we consider the subspace E , which includes pi , i = 1, . . . , N. Our objective is to find the closest pdf pê ∈ E from p0 in the sense of the KL divergence. Then any pdf q ∈ E satisfies theorem 2.1 (Pythagorean relation) when we consider p = p0 , q = q, and r = pê in equation 2.3. Moreover, because each auxiliary pdf pi is in E , the following equation holds: DKL (p0 , pê ) + DKL ( pê , pi ) − DKL (p0 , pi ) = 0.. (3.6). The subspaces P and E are those to which pê belongs. The former is the search space for w given θ from equation 2.11 (condition 1), while the latter is the search space for θ given the weight w (condition 2). To find the optimal pê in P and E , a nonparametric KL divergence estimator is thus required.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(14) 2700. K. Takano, H. Hino, S. Akaho, and N. Murata. 3.4 Nonparametric KL Divergence Estimator. Divergence estimators based on the k-nearest neighbor method have been widely investi´ 2005, 2009), and such methods have been gated (Wang, Kulkarni, & Verdu, extended to deal with weighted observations (Hino & Murata, 2013). Suppose we are given two weighted data sets Dw = {D, w} = {(xk , wk )}nk=1 and Dv = {D , v} = {(yk , vk )}m k=1 , whose empirical distributions are expressed by equation 1.3. Now we denote the index of the kth nearest point from an inspection point x in Dw by (k). Then we define the quantile of x(k) with respect to the inspection point by α = kh=1 w(h) . Conversely, when the quantile α is specified, the point xkˆ in Dw , where kˆ = max{k| kh=1 w(h) ≤ α}, is called the α-quantile point of x . Let εα (x , Dw ) be the Euclidean distance between the inspection point x and its α-quantile point xkˆ in Dw , and let εα -ball b(x , εα ) be the hypersphere of radius εα (x , Dw ) centered at x . The probability mass of the εα -ball centered at x is denoted by . ˆ. b(x ,εα ). p(x)dx =. k . w(h) .. (3.7). h=1. We obtain the following approximation formula by using Taylor’s expansion of the integrand in equation 3.7, b(x ,εα ). p(x)dx cd (εα (x , Dw ))d p(x ),. (3.8). d. where cd = π 2 /

(15) (1 + d/2) is the volume of the unit ball in Rd and

(16) (x) is the gamma function. Now, the probability density at the inspection point x. is estimated from the above expression as pˆ α (x ) =. α . cd (εα (x , Dw ))d. (3.9). From this density estimator, we can obtain an estimator of Shannon’s differential entropy as ˆ D )= H( w. n .

(17). −k wk log pˆ α xk ; Dw. k=1.

(18). cd −k , +d wk log εα xk , Dw α n. = log. (3.10). k=1. 1 −k = {D \ xk , 1−w w \ wk } is a renormalized weighted data set exwhere Dw k cluding (xk , wk ). When we have two weighted data sets, Dw = {D, w} =. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(19) Nonparametric e-Mixture Estimation. 2701. {(xk , wk )}nk=1 and Dv = {D , v} = {(yk , vk )}m k=1 , a cross-entropy estimator is written as ˆ D , D ) = log cd + d H( w v α. n . wk log εα (xk , Dv ).. (3.11). k=1. By specifying the quantile α, we can estimate the KL divergence in a nonparametric manner: ˆ (D , D ) = d D KL w v. n .

(20). −k wk log εα (xk , Dv ) − log εα (xk , Dw ) .. (3.12). k=1. 3.5 Why the e-Mixture? Before deriving the proposed algorithm, we emphasize our motivation behind estimating the e-mixture. The major reason we consider the e-mixture is that it satisfies the principle of maximum entropy (Cover & Thomas, 1991). Let F be a set of pdfs p satisfying p(x) ≥ 0, with equality outside the support set S, p(x)dx = 1, . (3.13) (3.14). S. S. p(x) fi (x)dx = τi , i = 1, . . . , N,. (3.15). where fi , i = 1, . . . , N are certain vector-valued functions and τi , i = 1, . . . , N are moments with respect to the functions fi . Theorem 4 (maximum entropy and e-mixture) (Murata & Fujimoto, 2009). Let q be the maximum entropy function in F , which is defined by q (x) = arg ma x − p(x) log p(x)d x . p∈F. (3.16). Then the probability density function q in equation 3.16 is written in the form of an exponential family as q (x; θ) = e xp. N . θi f i (x) − b(θ) .. (3.17). i=1. This theorem shows that the pdf that belongs to an exponential family satisfies the principle of maximum entropy. Thus, the e-mixture also satisfies the principle of maximum entropy. On the contrary, we now consider the problem of representing a mixture of the auxiliary pdfs pi , i = 1, . . . , N. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(21) 2702. K. Takano, H. Hino, S. Akaho, and N. Murata. as a pdf in the exponential family. The most natural definition of fi for including all auxiliary pdfs in this exponential form is fi = a log pi , where a is constant. This form can be easily derived when considering the mixture ratios θi , i = 1, . . . , N of all zeros but a single one. Therefore, the e-mixture is the most natural pdf in the form of exponential families, which can represent all auxiliary pdfs pi , i = 1, . . . , N. 4 Algorithm for the Nonparametric e-Mixture To estimate the e-mixture of distributions that approximate the target distribution, we first explore a general algorithmic framework. Considering the two conditions denoted in section 3.3, it is natural to find pê from algorithm 1. In the parametric setting, the projection of p0 on the subspace E spanned by the auxiliary distribution and the calculation of the mixture parameter θ for pê are both straightforward. However, in our nonparametric setting, this task is difficult in general. To overcome this difficulty, we derive a specific nonparametric e-mixture estimation algorithm by using the techniques introduced in sections 2 and 3.1. Since the e-mixture of the nonparametric models is determined by the weight w and the mixture ratio θ, we denote the e-mixture by pêw(θ) when it is clearer. Step 1 of the proposed algorithm computes the weight w given a fixed mixture ratio θ, and step 2 computes the mixture ratio θ given a fixed weight w. These two steps are computed alternately until w converges. For the initialization, we start with uniform weights {wk = 1/K}Kk=1 and uniform ratios {θi = 1/N}N i=1 . 4.1 Step 1. In step 1, the e-mixture is estimated by using a given mixture ratio θ based on theorem 3:. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(22) Nonparametric e-Mixture Estimation. 2703. Step 1: Compute the weight w that minimizes the weighted KL divergence by using a fixed mixture ratio θ: min w. N . θi DKL ( pêw , pi ).. (4.1). i=1. By substituting the estimator defined by equation 3.12 into equation 4.1, we obtain the objective function: L(w) ≡. N . ˆ (D , D (i) ), θi D KL w u. i=1. =. N . ˆ D , D (i) ) − θi H( w u. i=1. =. N . N . ˆ D ) θi H( w. i=1. ˆ D ), ˆ D , D (i) ) − H( θi H( w w u. (4.2). i=1 n. i where u denotes the uniform weight. In other words, Du(i) = {xk(i) , 1/ni }k=1 , By denoting. g = {gk }Kk=1 ,. gk = d. N . θi log εα (yk , Du(i) ),. i=1. f(w) = { fk }Kk=1 ,. −k fk = −d log εα (yk , Dw ),. equation 4.2 is written as . . L(w) = wT g + f(w) .. (4.3). We obtain the weight vector w by minimizing equation 4.3 according to the gradient projection method. Since the objective function L(w) is discontinuous because of the index kˆ in εα , we evaluate f(w) using the weight obtained in the previous iteration. That is, with some abuse of the. nota∂ tion, the gradient at iteration s + 1 is approximated as ∂w L(w) w=w(s) =. ∂f. g + f(w(s) ) + w(s)T ∂w g + f(w(s) ), and the updating formula is w=w(s) given by

(23) w(s+1) ← w(s) − ηs,γ g + f(w(s) ) ,. (4.4). where ηs,γ is the learning rate and is the projection operator for the weight w(s+1) , that is, Kk=1 wk(s+1) = 1, wk(s+1) ≥ 0. When we estimate the entropy. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(24) 2704. K. Takano, H. Hino, S. Akaho, and N. Murata. and the cross-entropy by using equations 3.10 and 3.11, the parameter α is set to be small to reduce the bias. Then we can assume that for most of wk , k = −k 1, . . . , K, a small change in wk does not change the distance εα (yk , Dw ) and that the derivatives of f(w) with respect to wk , if they exist, are zero. For wk , such that the derivative of f(w) does not exist, a small change in wk causes a jump in the value of εα . The size of the jump is of the order of the distance between two data points, and it is divided by εα , the sum of the distances. Hence, we expect that the resultant absolute value is small, and the value is multiplied by wk < 1. From this consideration, we omit ∂f. the term w(s)T ∂w from the gradient of the objective function. We can w=w(s) construct a pathological example that the jump is too large to ignore, but in our experiments, the objective function L(w) monotonically decreased by the approximated gradient descent. 1 We set ηs,γ = 0.9s γ ×K in our experiment, where γ is the parameter that determines the updating speed. We then run this gradient projection method until w(s) converges. 4.2 Step 2. We next determine the mixture ratio θ so that pêw(θ) becomes the projection of p0 onto the subspace E . Figure 6 shows the geometrical interpretation of step 2. If pêw(θ) is the projection of p0 onto E , the dotted line between p0 and pêw(θ) and the solid line between log pêw(θ) and log pi are orthogonal: namely pi , pê , and p0 satisfy the Pythagorean relation. Based on this geometrical intuition, we update the mixture ratio θ according to the violation of the Pythagorean relation ri ≡ DKL (p0 , pêw(θ) ) + DKL ( pêw(θ) , pi ) − DKL (p0 , pi ), which takes 0 when the mixture ratio θi is optimal. The two top triangles in Figure 6 show the relations among p0 , pêw(θ) , and pi , i = {1, 4}. The upper left panel shows the case of the acute-angled triangle, where r1 is positive. In this case, θ1 is smaller than optimal and pêw(θ) should be closer to p1 . Conversely, the upper right panel shows the case of the obtuse-angled triangle, where θ4 is larger than optimal. To reflect this geometrical comprehension, we introduce a weakly increasing piecewise linear function φ of ri , as shown in Figure 7. The function φ controls θ , reflecting the degree of the violation of the Pythagorean relation ri . The function φ increases (decreases) θi when ri satisfies ri > 0 (ri < 0). By using this function φ, we update θi as follows. Step 2: Update the mixture ratio θ = {θi }N i=1 according to the violation of the Pythagorean relation: θi ← θi × φ(rî ),. (4.5). where rî is estimated by the estimator (3.2) as ˆ (D (0) , D ) + D ˆ (D , D (i) ) − D ˆ (D (0) , D (i) ), rî = D KL w KL w KL u u u u. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021. (4.6).

(25) Nonparametric e-Mixture Estimation. 2705. Figure 6: Step 2 of the algorithm. Each empirical distribution pi is on the curved surface with the dotted lines, which is the subspace of the e-mixtures of the empirical distributions E. In this space p0 , each data set pi , and the weighted empirical distribution pêw(θ) , forms triangles. To make pêw(θ) the projection of p0 onto E, this triangle satisfies the Pythagorean relation.. 7: The weakly increasing Figure , r = 0.95 , c = 0.5. r = −0.95 c c. piecewise. linear. function. φ(ri ).. and φ is defined as ⎧ ⎨ c × rî + 1 φ(ˆri ) = c × r + 1 ⎩ c×r+1. (r ≤ rî ≤ r), (ˆri < r), (ˆri > r),. (4.7). where c is a positive constant. Note that the function φ satisfies φ(0) = 1. In our experiments, we use r = −0.95 , r = 0.95 , and c ∈ (0, 1). After updating θi for all i = 1, . . . , N, they c c are normalized so that N i=1 θi = 1. These two steps search for the closest emixture of the auxiliary pdfs from D (0) by assigning weights for the samples in the set D. Since we assume that the weight vector w satisfies the definition of the probability distributions, our proposed algorithm eventually provides a way in which to sample from the set D.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(26) 2706. K. Takano, H. Hino, S. Akaho, and N. Murata. The hyperparameter γ is introduced to control the update speed in equation 4.4, and c is used to control the penalty for violating the Pythagorean relation in equation 4.7; both were tuned in our preliminary experiments. As for the computational cost of the proposed algorithm, the most timeconsuming part of the algorithm is estimating the KL divergence by using equation 3.12, which requires sorting K data points K times, amounting to O (K2 log K) computation. Note that this is mainly the cost for sorting, and it is required only in the proposed algorithm. We close the section with the pseudocode of the nonparametric e-mixture estimation algorithm in algorithm 2.. 5 Experiments We conducted a set of experiment on three synthetic and one real-world data set to evaluate the proposed algorithm. In the real-world data set, we considered the situation where we have only a few samples from the target pdf, and our algorithm is used for data augmentation in the classification problem. In all the experiments, to reduce the computational cost of estimating the KL divergence, we sampled half the number of data points from D, where D is redefined by the sampled subset of the original.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(27) Nonparametric e-Mixture Estimation. 2707. 5.1 Synthetic Data 5.1.1 Simple Setup. First, we used synthetic data to demonstrate how our proposed algorithm works. We show that our nonparametric e-mixture estimation algorithm works when the underlying distributions are gaussian. Suppose we have a set of auxiliary data sets {D (i) }N i=1 . Each data set has 2000 data points sampled from a gaussian distribution N (μi , i ). We are also given a target data set D (0) , which contains data points of size n0 = 200, from the e-mixture pe = N (μe , e ) of the auxiliary pdfs {N (μi , i )}N i=1 . The mean vector and covariance matrix of the e-mixture pe are calculated by μ = e. e. N . θi −1 i μi. (5.1). θi −1 i ,. (5.2). i=1. and . e. −1. =. N i=1. respectively. For illustration purposes, we consider a two-component twodimensional (N = 2, d = 2) gaussian mixture. Our aim is to generate 2000 points from the nonparametric e-mixture distribution constructed from D (0) , D (1) , and D (2) . We set the quantile value to α = 0.003 in equation 3.12, the updating speed to γ = 20 in equation 4.4 to facilitate the convergence, and the coefficient to c = 0.5 in equation 4.7. The value of the quantile α is determined so that it contains 5 to 10 data points from an inspection point when the weight w is uniform. The top panels of Figure 8A show the data sets D (1) , D (2) , and 2000 points sampled from pe . The bottom panels show D (0) , uniformly weighted em pirical distribution pˆ u (x) = K1 Ki=1 δ(x − xi ), K = |D| = 4, 200, D = D (0) ∪ D (1) ∪ D2 , and the result of the nonparametric e-mixture estimation, respectively. The size of each mark in the bottom panels represents the weight of each sample. The estimated θ by iterations is shown in Figure 8B. The horizontal dotted lines are the true values of the mixture ratio θ = {0.8, 0.2}. In this experiment, the underlying distributions for both target and auxiliary are gaussians, which are characterized by the means and the covariance matrices. Figure 8C shows the contours of the empirical covariance matrices centered at the empirical means, which are estimated by using the obtained weights, to see the behavior of the estimates with the progress of the algorithm. Each ellipse includes 90% of the probability mass for the gaussian distribution. The solid ellipse expresses the original gaussian distributions p1 = N (μ1 , 1 ) and p2 = N (μ2 , 2 ). The solid gray ellipse is the ground-truth e-mixture. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(28) Figure 8: (A) Top panels show the scatter plots of the data sets D (1) , D (2) and 2000 points sampled from pe . Bottom panels plot the target data set D (0) , uniformly weighted empirical distribution pˆ u , and estimated pêw(θ) . (B) Mixture ratios θ1 and θ2 by iterations. (C) Estimated gaussian distributions by iterations.. 2708 K. Takano, H. Hino, S. Akaho, and N. Murata. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(29) Nonparametric e-Mixture Estimation. 2709. pe , while all the dotted ellipses express the estimated e-mixtures pêw(θ) by iterations. These experimental results suggest that the proposed algorithm successfully approximates the ground-truth distribution for the target data set as the nonparametric e-mixture constructed from the given data sets with the appropriate weights. 5.1.2 The e-Mixture of the pdfs Represented by the Gaussian Mixtures. In the second experiment with a synthetic data set, we assess how the proposed algorithm performs in the specific case where the auxiliary pdfs are nongaussian. In particular, we consider the case where the auxiliary pdfs are multimodal, represented by the m-mixture of gaussians. We use two fivecomponent GMMs as the auxiliary pdfs. We obtain two data sets D (1) and D (2) , which each have 2500 points sampled from the five-component GMM, which is not included in the exponential family; the data points of these two data sets form an S-shape. In this experiment, we use the e-mixture of these two GMMs as the target pdf. The left panel of Figure 9 shows the data sets D (1) (marked ), D (2) (marked ×), and D (0) (marked ), which are sampled from the target pdf. The points of the target data set D (0) were sampled as follows. Initially we created two five-component GMMs: pi (x) = 5j=1 15 N (x; μ j , j ), i = 1, 2. Detailed values of the parameters are shown in appendix C. Their density outputs are shown in panels a and b in Figure 9, respectively. Then we obtain the e-mixture of GMMs pe , which is shown in panel c, in the form of the density function with the mixture ratio θ = {0.5, 0.5}, that is, pe = exp(0.5 × log p1 + 0.5 × log p2 − b(θ)). We sampled the data points from pe by using the rejection sampling method. The experimental results, reported in the density outputs of the right panels of Figure 9, show that the proposed algorithm weights the data points correctly to make the e-mixture of GMMs. If we use the m-mixture algorithm in this experiment, we obtain completely different results from the ground truth shown in panel e in Figure 9. The nonparametric m-mixture algorithm is reported in appendix D. It is difficult to represent the true target pdf shown in panel c by using the m-mixture of the two GMMs that intends to mix the two pdfs as like logical OR, while the true target pdf forms like logical AND of the two GMMs. Since the proposed algorithm can weight every data point, it is possible to reduce the weights of those points not related to the true target pdf, as shown in panel f. 5.1.3 Estimation of the Mixture Ratio on High-Dimensional Data. In this section, we show simple experimental results for artificial high-dimensional data. We evaluate how the proposed algorithm estimates the mixture ratios on high-dimensional data. We build five different auxiliary data sets {Di }5i=1 that contain 1000 data points sampled from the d-dimensional gaussian distributions pi , i =. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(30) 2710. K. Takano, H. Hino, S. Akaho, and N. Murata. Figure 9: (Left) The data sets D (1) , D (2) , and 500 points sampled from the emixture of the five-component GMM pe . The right panels show the density outputs of (a) D (1) ; (b) D (2) ; (c) the density of pe ; (d) D (0) , which is sampled by rejection sampling from the e-mixture of GMMs; (e) the result of the nonparametric m-mixture estimation; (f) the result of the proposed algorithm. We use the quantile value α = 0.008, updating speed γ = 10, and the parameter for the violation of the Pythagorean relation c = 0.5.. 1, . . . , 5 with different means {μi }5i=1 and different covariances {i }5i=1 , respectively. Covariance i is a diagonal matrix whose diagonal elements take the same value generated from the uniform distribution on the interval [0.5, 1]. Mean μi is a d-dimensional vector generated from a uniform distribution on the surface of a d-dimensional sphere. The target data set also contains n0 data points from the gaussian distribution, which is the e-mixture pe of the auxiliary pdf pi , i = 1, . . . , 5 with a mixture ratio θ. The parameters μ and of pe are calculated by using equations 5.1 and 5.2. We compare the ground-truth mixture ratio θ with the estimated mixture ratio by using the proposed algorithm. The ground-truth mixture ratio is a realization of the Dirichlet distribution, d

(31) ( di=1 βi ) βi −1 p(x; β) = d [x]i , i=1

(32) (βi ) i=1. (5.3). where [x]i indicates the ith element of variable x and β = {βi }di=1 is the parameter of the Dirichlet distribution. We use β = {1, 1, . . . , 1} ∈ Rd . We run this experiment 10 times. The average distances between the ground truth and estimated results are shown in Figure 10. We use the cosine distance. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(33) Nonparametric e-Mixture Estimation. 2711. Figure 10: Box plots of the distances between the true mixture ratio and the estimated mixture ratio. We use the quantile value α = 0.008, updating speed . γ = 20, and coefficient c = 0.3 d. dcos (θ, θ ) = 1 − . d d. i=1 θi θ i. 2 i=1 θi. . d i=1. (5.4) θ 2i. as the distance measure between the mixture ratios. In this experiment, we evaluate the extent to which the number of target data sets n0 and dimensions d affect the performance of the estimation. To do so, we vary n0 = {50, 100, 200, 500, 1000} and d = {10, 20, 50, 100}. The results in Figure 10 show that the parameters estimated by using the proposed algorithm are close to the ground-truth parameters. Examples of the estimated mixture ratio of 10 trials are shown in Figures 11 and 12. We also observe that the performance of the proposed algorithm does not improve when n0 increases. On the contrary, the performance worsens when the dimension d increases. In general, to estimate the mixture ratio accurately, each auxiliary pdf must be isolated from the subspace of the other auxiliary pdfs; that is, the auxiliary pdfs should be in a general position. Figure 13 shows examples of failure cases where the auxiliary pdfs are not in a general position. To evaluate how a failure case affects the estimation results, we conduct an additional experiment. We use similar experimental settings to those explained in the previous paragraph. Here, we construct the fifth auxiliary pdf p5 to. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(34) Figure 11: The bar plot of 10 trials in the case of n0 = 200, d = 5. (Upper panels) The true mixture ratios. (Lower panels) The estimated mixture ratios.. 2712 K. Takano, H. Hino, S. Akaho, and N. Murata. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(35) Figure 12: The bar plot of 10 trials in the case of n0 = 200, d = 100. (Upper panels) The true mixture ratios. (Lower panels) The estimated mixture ratios.. Nonparametric e-Mixture Estimation. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021. 2713.

(36) 2714. K. Takano, H. Hino, S. Akaho, and N. Murata. Figure 13: Schematics of failure cases, that is, auxiliary pdfs that are not in a general position. This figure shows an image of the specific auxiliary pdf(s) near the model subspace of the e-mixture of the other auxiliary pdfs.. be an e-mixture of the other four auxiliary pdfs {pi }4i=1 with gaussian noise N (0, 0.1) for the elements of mean vector μi . We evaluate the estimation performance in the n0 = 1000, d = 10 case. The hyperparameters are the same as those in the previous experiment. Therefore, the result in Figure 14 shows the case where an alternative representation of the e-mixture exists, which leads to the wrong estimation. 5.2 EEG Data Set. We developed a method to represent the target pdf in a nonparametric manner and to provide a way in which to approximately sample from the target pdf by using the auxiliary data sets. In this experiment, we consider the brain-computer interface (BCI) task, which is formulated as a classification problem by using brain signals. We evaluate the usefulness of our approach by applying it to the BCI task, in which parametric classifiers are adopted. We use calibration data from NIPS BCI Competition IV data set 1 (motor imagery, uncued classifier application) (Blankertz, Dornhege, Krauledat, Mueller, & Curio, 2007).1 We have continuous signals of 59 EEG channels of five subjects denoted by i, ii, iii, iv, and v.2 They provide EEG signals corresponding to two classes: the motor imagery of the left (L) and right (R) hands. In the experiment, we aim to obtain a good feature extractor for the classification task by using data sets weighted by the proposed algorithm. To extract the features from the EEG signals, the common spatial pattern (CSP) (Koles, Lazar, & Zhou, 1990) is one of the most popular methods. The basic idea of the CSP is to find a transformation matrix B = [b(1) , . . . , b(D) ] ∈ R59×D that simultaneously diagonalizes both class covariance matrices SL and SR . The covariance matrices Sc , c ∈ {L, R} are defined as. 1 http://www.bbci.de/competition/iv/. 2 NIPS. BCI Competition IV data set 1 contains seven subjects denoted a, b, c, d, e, f, and g. We use only subjects who are not selected for the “foot” class. We renumber these subjects as b = i, c = ii, d = iii, e = iv, and g = v.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(37) Figure 14: The bar plot of 10 trials in the case of n0 = 1000, d = 10. In this case, the auxiliary pdfs are not in a general position. (Upper panels) The true mixture ratios. (Lower panels) The estimated mixture ratios.. Nonparametric e-Mixture Estimation. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021. 2715.

(38) 2716. K. Takano, H. Hino, S. Akaho, and N. Murata N c 1 XcT k Xk , cT c N tr(Xk Xck ) k=1 c. Sc =. c ∈ {L, R},. (5.5). where Xck ∈ RT×59 is the EEG data set from the kth trial of class c ∈ {L, R}. In XcT Xc. other words, we treat the normalized covariance matrix tr(XkcT Xkc ) ∈ R59×59 , c ∈ k. k. {L, R} as a datum in this experiment. To maximize the differences between the features of the different classes, the CSP finds a transformation matrix B whose dth column vector bd maximizes the ratio b(d)T SL b(d) . b(d)T SR b(d). (5.6). A transformation matrix B can be easily found by the generalized eigenvalue decomposition. The projection of Xck by B is expressed as Zck = Xck B ∈ RT×D . The dth element of a feature vector ζ k = {ζk(1) , . . . , ζk(D) } is the normalized variance of each spatial-filtered trial Zck ,. ζk(d) = log. 1 n. D h=1. T. (d)2 − k=1 zk. 1 n. T. (h)2 k=1 zk.

(39) T 1 n. −. (d) k=1 zk.

(40) T 1 n. 2. (h) k=1 zk. 2 ,. (5.7). where zk(d) is the dth column vector of the kth spatial-filtered trial Zck . We assume that we are given only n0 = 60 trials as the target data set, while we are given full trials (180–190 trials) for the other four subjects as the auxiliary data sets. The target data set has n0 /2 trials for each class c. Our algorithm finds the weights wkc of the kth normalized covariance matrix in the whole trials Dc = 4i=0 Dc,(i) of class c, where Dc,(0) is composed of the normalized covariance matrices for the target subject and Dc,(i) , i = 1, . . . , 4 are composed of those for the auxiliary subjects. By using the estimated weights, we compute the weighted average of the covariance matrices instead of equation 5.5 as N c. S = c. k=1. wkc. c XcT k Xk. tr(XcT Xck ) k. ,. c = {L, R}.. (5.8). We obtain the transformation matrix Be by using the weighted average of the covariance matrices in equation 5.8. For the distance between the data points for the nonparametric KL estimator in equation 3.12, we adopt the. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(41) Nonparametric e-Mixture Estimation. 2717. symmetrized gaussian KL divergence defined by 1 D (S, S ) + DKL (S , S) 2 KL 1 = tr (S )−1 S + (S)−1 S , 4. DKL (S, S ) = sym. (5.9). where the KL divergence for the gaussian distributions with covariance S and S is defined by DKL (S, S ) =. 1 |S | log + tr{(S )−1 S} − 59 . 2 |S|. (5.10). Note that the symmetrized KL divergence serves as the distance between the data points, which is used for the KL divergence estimator in equation 3.12. For the preprocessing, the time window from 2.0 s to 6.0 s is retained for each trial, with 8 Hz to 30 Hz bandpass filtering applied following the work of Tu and Sun (2011). We evaluate the improvement in classification accuracy by using the transformation matrix Be , which is created from the augmented data. We compare Be (e-mixture) with the three other conditions: small, uniform, and reg. In the small case, the transformation matrix B is computed by using only target trials D (0) . In the uniform case, B is computed by using the uniform weights in equation 5.8. The reg. case is a supervised approach proposed by Lotte and Guan (2010) that regularizes the estimated covariance matrix in equation 5.5 toward the average of the covariance matrices of the other subjects. All subjects have 180 to 190 trials. We train a linear SVM (Cortes & Vapnik, 1995) by using n0 = 60 trials and evaluate the error rate by using the rest of the data for the subject. Table 1 shows the average with one standard deviation of the 10-fold cross-validated test error rates of the SVM. For subjects i, ii, iv, and v, the e-mixture provides advantages compared with the small and uniform cases. The error rates of the proposed method are close to those of the reg. method, a supervised method for transfer learning. We conjecture that the principle of maximum entropy affects the improvement in the accuracy of these subjects. The e-mixture may avoid overfitting and thereby improve the generalization error. For subject iii, there is no advantage over the others, suggesting that the support of the underlying pdfs of the trials of subject iii is not covered by those of the underlying pdfs of the trials of the other subjects. Indeed, we see no statistically significant improvement by using the proposed method compared with the conventional method of Lotte and Guan (2010); however, the weight optimization under the proposed method improves the classification accuracy compared with that when using uniform weights. In addition, the proposed method is comparable to the conventional method, which validates the approach presented in this letter.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(42) 2718. K. Takano, H. Hino, S. Akaho, and N. Murata. Table 1: Test Errors Obtained by the small, uniform, reg. (Lotte & Guan, 2010), and Proposed e-Mixtures from the EEG Real-World Data Set. Subjects/ Method small uniform reg. e-mixture. i. ii. iii. iv. v. 37.22 ±8.67 36.03 ±11.70 31.51 ±9.79 29.12 ±9.07. 33.73 ±12.03 31.35 ±12.37 21.59 ±9.47 23.57 ±10.97. 29.05 ±12.27 39.68 ±9.86 31.83 ±11.22 35.71 ±10.77. 41.27 ±9.81 40.63 ±6.22 30.08 ±11.90 30.00 ±12.28. 40.87 ±4.38 37.22 ±4.29 29.44 ±12.40 27.22 ±10.04. Notes: The averages of the classification error rates are written in percentage terms with one standard deviation. The bold text indicates the lowest error rate among the four methods. We use hyperparameters that minimized the cross-validation error calculated by using the training data. See appendix E for details.. 6 Conclusion and Future Work In this study, we proposed a nonparametric e-mixture estimation algorithm based on the geometric characterization of e-mixtures. First, we discussed the relationship between a certain pdf and the closest e-mixture of the auxiliary pdfs with the given mixture ratios in terms of the KL divergence as stated in theorem 3. Second, we gave a representation of a nonparametric e-mixture model by using a weighted empirical distribution. We used an estimator of the KL divergence between the weighted empirical distributions, thereby reducing the problem to finding the optimal weights of the weighted empirical distribution of all the given data that satisfy the condition of theorem 3. Consequently, we provided a way in which to use the samples included in the auxiliary data sets like “data augmentation.” The effectiveness of our proposed algorithm was then demonstrated by using three types of synthetic data sets, and its practical capability was shown based on the EEG data set within a variant of the transfer learning setup. For the nonparametric e-mixture estimation, the support of the auxiliary data sets must properly overlap with that of the target data set, as we assumed in equation 3.1. Thus, it is preferable to use a sufficient number of auxiliary data sets for our algorithm. On the contrary, when a lot of auxiliary data sets are given, it is computationally expensive to calculate all the distances between the data points. Hence, when many data sets are available, it is important to select an appropriate subset to estimate the emixture. In addition, how to tune the quantile α and the update speed γ remains to be covered by future work.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(43) Nonparametric e-Mixture Estimation. 2719. One of the advantages of the e-mixture model is that negative mixture ratios can be considered because the mixture ratios here are in the exponential function. In this study, we derived a multiplicative update algorithm to estimate the mixture ratios, which does not support negative mixture ratios. Therefore, another interesting direction of future work might be to design an e-mixture estimation algorithm that can deal with negative mixture ratios. Appendix A: Proof of Theorem 2 Proof 1 (characterization of the m-mixture). From the definition of the KL divergence in equation 2.1, we can rewrite the weighted KL divergence as N . θi DKL (pi , q). i=1. =. N . θi. pi (x) log pi (x)dx −. i=1. =. N . . . θi. pi (x) log pi (x)dx −. pi (x) log q(x)dx. N. i=1. =. N . θi pi (x) log q(x)dx. i=1. . θi. pi (x) log pi (x)dx −. pm (x; θ) log q(x)dx.. i=1. Therefore, the optimization objective becomes. arg min q. N . θi DKL (pi , q) = arg min q. i=1. N . θi. pi (x) log pi (x)dx. i=1. !. . ". not depend on q. −. pm (x; θ) log q(x)dx . = arg min q. pm (x; θ) log pm (x; θ)dx ! " not depend on q. −. pm (x; θ) log q(x)dx. = arg min DKL (pm , q) q. = p (x; θ), m. which proves theorem 2.4.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(44) 2720. K. Takano, H. Hino, S. Akaho, and N. Murata. Proof 2 (characterization of the e-mixture). From the definition of the KL divergence in equation 2.1, we can rewrite the weighted KL divergence as N . θi DKL (q, pi ). i=1. =. N . . . q(x) log q(x)dx −. θi. q(x) log pi (x)dx. i=1. =. q(x) log q(x)dx −. q(x). N . θi log pi (x)dx.. i=1. Here we recall the formula of the e-mixture; p (x; θ) = exp e. N . θi log pi (x) − b(θ) ,. (A.1). i=1. and rewrite the optimization objective as. arg min q. N . θi DKL (p, pi ) = arg min. q(x) log q(x)dx. q. i=1. −. q(x). N . = arg min . θi log pi (x)dx. i=1. q(x) log q(x)dx. q. −. q(x) log pe (x; θ)dx −. b(θ) !" not depend on q. = arg min DKL (q, p ) e. q. = pe (x; θ), which proves theorem 3. Appendix B: Proof of Theorem 3 We consider the e-mixture of two one-dimensional gaussian distributions p1 = N (μ1 , σ12 ) and p2 = N (μ1 , σ22 ) with the mixture ratio θ = {θ1 , θ2 }. The mean μe and covariance σe of the e-mixture are calculated by. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(45) Nonparametric e-Mixture Estimation. 2721. μe =. θ1 μ1 σ22 + θ2 μ2 σ12 , θ1 σ22 + θ2 σ12. (B.1). σe2 =. σ12 σ22 , θ1 σ22 + θ2 σ12. (B.2). and. respectively. Considering that the KL divergence between the two gaussian p1 and p2 is calculated by DKL (p1 , p2 ) =. σ2 σ2 1 (μ − μ )2 log 22 + 12 + 1 2 2 − 1 , 2 σ1 σ2 σ2. (B.3). the KL divergence between the e-mixture of pi , i = {1, 2} and target pdf p0 = N (μ0 , σ02 ) is written as ⎡ DKL (p0 , pe ) =. 1⎢ ⎣log 2. σ12 σ22 θ1 σ22 +θ2 σ12. σ02. +. σ02 σ12 σ22 θ1 σ22 +θ2 σ12. +. (μ0 −. θ1 μ1 σ22 +θ2 μ2 σ12 2 ) θ1 σ22 +θ2 σ12 σ12 σ22 θ1 σ22 +θ2 σ12. ⎤ ⎥ − 1⎦. ) σ 2θ σ 2 σ 2θ σ 2 1 = log σ12 σ22 − log(σ02 θ1 σ22 + σ02 θ2 σ12 ) + 0 2 1 22 + 0 2 2 21 + 2 σ1 σ2 σ1 σ2 +. θ12 (μ0 σ22 − μ1 σ22 )2 + 2θ1 θ2 (μ0 σ22 − μ1 σ22 )(μ0 σ12 − μ2 σ12 ) + θ22 (μ0 σ12 − μ2 σ12 )2 . σ12 σ22 (θ1 σ22 + θ2 σ12 ). The differentials of the KL divergence with θ1 and θ2 are also derived: σ02 σ22 σ 2σ 2 ∂DKL (p0 , pe ) =− 2 + 02 22 2 2 2 ∂θ1 σ 0 θ1 σ 2 + σ 0 θ2 σ 1 σ1 σ2 +. σ12 σ22 (θ1 σ22 + θ2 σ12 ){2θ1 (μ0 σ22 − μ1 σ22 )2 + 2θ2 (μ0 σ22 − μ1 σ22 )(μ0 σ12 − μ2 σ12 )} (σ12 σ22 (θ1 σ22 + θ2 σ12 ))2. −. σ12 σ24 {θ12 (μ0 σ22 − μ1 σ22 )2 + 2θ1 θ2 (μ0 σ22 − μ1 σ22 )(μ0 σ12 − μ2 σ12 ) + θ22 (μ0 σ12 − μ2 σ12 )2 } . (σ12 σ22 (θ1 σ22 + θ2 σ12 ))2. σ02 σ12 σ 2σ 2 ∂DKL (p0 , pe ) =− 2 + 02 12 2 2 2 ∂θ2 σ 0 θ1 σ 2 + σ 0 θ2 σ 1 σ1 σ2. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(46) 2722. K. Takano, H. Hino, S. Akaho, and N. Murata. +. σ12 σ22 (θ1 σ22 + θ2 σ12 ){2θ2 (μ0 σ12 − μ2 σ12 )2 + 2θ1 (μ0 σ22 − μ1 σ22 )(μ0 σ12 − μ2 σ12 )} (σ12 σ22 (θ1 σ22 + θ2 σ12 ))2. −. σ14 σ22 {θ12 (μ0 σ22 − μ1 σ22 )2 + 2θ1 θ2 (μ0 σ22 − μ1 σ22 )(μ0 σ12 − μ2 σ12 ) + θ22 (μ0 σ12 − μ2 σ12 )2 } . (σ12 σ22 (θ1 σ22 + θ2 σ12 ))2. We obtain the optimal mixture ratio θ by using the gradient projection method with these differentials of the KL divergence. Appendix C: Parameters of GMMs Used in Section 5.1 We show the parameters of the two five-component GMMs p1 and p2 used in the experiment in section 5.1. Parameters of the five-component GMM p1 : . 0.06073770 −0.03141351 , μ1 = (−0.4537003, 0.1612960) −0.03141351 0.02065030 0.003546679 −0.002938532 2 = , μ2 = (−0.9583421, 0.9429759) −0.002938532 0.077841327 0.04768301 0.03660208 3 = , μ3 = (−0.6073216, 1.7477465) 0.03660208 0.03370499 0.07457504 −0.015512384 4 = , μ4 = (0.2142159, 1.9344870) −0.01551238 0.006812967 0.01855797 −0.02997590 , μ5 = (0.8814549, 1.4168586) 5 = −0.02997590 0.06283003 1 =. Parameters of the five-component GMM p2 : 1 = . 0.05659599 −0.03369993 , −0.03369993 0.02479202. μ1 = (0.4871073, −0.1745825). 0.0034401648 −0.0008427231 2 = , μ2 = (0.9530586, −0.9877180) −0.0008427231 0.0779478411 0.05417223 0.03474041 3 = , μ3 = (0.5725446, −1.7796748) 0.03474041 0.02721578 0.0709275 −0.02178290 4 = , μ4 = (−0.2641227, −1.9209650) −0.0217829 0.01046051. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(47) Nonparametric e-Mixture Estimation. 0.01779915 −0.02940042 5 = , −0.02940042 0.06358885. 2723. . μ5 = (−0.8832122, −1.3969769). Appendix D: A Nonparametric m-Mixture Estimation We provide the nonparametric m-mixture estimation algorithm. The subspace of the m-mixture of the empirical distributions is defined as. Em =. ⎧ ⎨ ⎩. pm (x; θ) =. N i=1. ⎫ ni N ⎬ 1 θi δ(x − xk(i) ), θi = 1, θi ≥ 0 . ⎭ ni k=1. i=1. Our objective is to compute pˆm ∈ E m , the closest pdf of the target distribution p0 . Thus, to obtain the optimal mixture ratio θ, we update the mixture ratio by using the following procedure. In this situation, we can use step 2 of the procedure similar to that in section 4.2. Step 2 (modified for the m-mixture): Update the mixture ratio θ = {θi }N i=1 according to the violation of the Pythagorean relation, θi ← θi × φ(rî ), where rî is estimated by the estimator equation, (3.12) as ˆ (pm , D (0) ) + D ˆ (pm , C ) − D ˆ (D (i) , D (0) ), rî = D KL KL w KL u u u and φ is defined as ⎧ ⎨ c × rî + 1 (r ≤ rî ≤ r), φ(ˆri ) = c × r + 1 (ˆri < r), ⎩ c × r + 1 (ˆri > r). We note that the arguments in equation 2.3 are put in reverse order toward the e-mixture version. Appendix E: A Nonparametric m-Mixture Estimation The hyperparameters used in the classification task of the EEG data set were determined by three-fold cross-validation using the training data. Table 2 shows their details.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

(48) 2724. K. Takano, H. Hino, S. Akaho, and N. Murata. Table 2: Hyperparameters Tuned by a Validation Process Using the Range of Values γ ∈ {25, 30, 35, 40} and α ∈ {0.06, 0.07, 0.08}. Parameters/Subjects. γ. α. c. i ii iii iv iv. 35 30 25 25 25. 0.08 0.06 0.08 0.06 0.08. 0.5 0.5 0.5 0.5 0.5. Note: This set of values was chosen after performing preliminary experiments.. Acknowledgments We express our special thanks to the editor and reviewers whose comments led to valuable improvements of this letter. Part of this work was supported by JSPS KAKENHI No. 25120009, 25120011, and 16K16108. References Akaho, S. (2004). The e-PCA and m-PCA: Dimension reduction of parameters by information geometry. In Proceedings of the IEEE International Joint Conference on Neural Networks (pp. 129–134), Pisscctaway, NJ: IEEE. Amari, S. (1991). Dualistic geometry of the manifold of higher-order neurons. Neural Networks, 4, 443–451. Amari, S. (2016). Information geometry and its applications. New York: Springer. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. Providence, RI: American Mathematical Society. Blankertz, B., Dornhege, G., Krauledat, M., Mueller, K. R., & Curio, G. (2007). The noninvasive Berlin Brain Computer Interface: Fast acquisition of effective performance in untrained subjects. NeuroImage, 37, 539–550. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with cotraining. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 92–100). New York: ACM. Choi, H., Choi, S., & Choe, Y. (2013). Parameter learning for alpha integration. Neural Computation, 25, 1585–1604. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273– 297. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38. Douglas, R. A., Thomas, F. Q., & Robert, B.D. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10, 19–41. Everitt, B. S., & Hand, D. J. (1981). Finite mixture distributions. Rotler dcm. Springer Netherlands.. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/NECO_a_00888 by guest on 30 March 2021.

No results found