Extensions - Non-negative least squares (NNLS) for high-dimensional linear models

1.4 Non-negative least squares (NNLS) for high-dimensional linear models

1.4.9 Extensions

In light of promising theoretical properties and empirical success, it is worthwhile to explore possible extensions of NNLS. Below, we collect a few ideas that either build directly upon NNLS or that make use of concepts such as non-negativity, sparsity and self-regularization. One of these extensions has meanwhile been published [143] while the others are left as topics of future research.

Least squares regression with half-space constraints. The constraint set of NNLS

can be represented as an intersection of half-spaces: we have Rp+ = {x ∈ Rp : ∩p_j=1hej, xi ≥ 0}. More generally, one can consider constraint sets of the form {x ∈ Rp+ : ∩

j=1haj, xi ≥ 0} for given aj ∈ Rp, j = 1, . . . , q. This prompts the following generalization of NNLS:

min

β: Aβ0ky − Xβk 2

2, (1.175)

where A ∈ Rq×p _{contains the {a}

j}qj=1 as its rows. Accordingly, statistical analysis involves the sparsity of Aβ∗ _{in place of the sparsity of β}∗_{. As an example, let A} represent the first order difference operator, i.e.

A=        −1 1 0 . . . 0 0 −1 1 0 . . . 0 ... 0 . .. ... ... ... ... ... ... ... ... 0 0 . . . 0 −1 1       

Then the constraint set is given by all β satisfying β1 ≤ β2 ≤ . . . ≤ βp, while sparsity means that the sequence {βj}pj=1 has few ’jumps’, i.e. βj = βj+1 for most j. The cen- tral question to be answered is whether − under suitable conditions on X and A − minimizers of (1.175) enjoy similar performance guarantees as NNLS.

In the sequel, we consider three extensions involving convex cones of real matrices.

Non-negative least squares approximations for matrices. _{Let Y ∈ R}m1×m2 _and

Zj ∈ Rm1×m2, j = 1, . . . , p, be given matrices and suppose that we wish to find a non-negative combination of the {Zj}pj=1 optimally approximating Y in a least squares sense. This yields the optimization problem

min β0 Y − p X j=1 βjZj 2 F , (1.176)

where kM kF = (P_a,bMab2)1/2 denotes the Frobenius norm of matrix M . Actually, (1.176) is not a proper extension of NNLS as it can be converted to a standard NNLS problem (1.16) by vectorizing Y , vectorizing the {Zj}pj=1 and stacking them as columns of a corresponding design matrix. Nevertheless, it is worth mentioning (1.176) because of its usefulness for solving several projection problems on polyhedral cones of real

matrices considered in the literature [111, 126] such as the set of diagonally dominant matrices with positive diagonal, or interesting subsets thereof like the set of Laplacian matrices Lm ₌ ( A ∈ Rm×m : A = A>, aii= − X j6=i aij, i= 1, . . . , m, aij ≤ 0 for all i 6= j ) (1.177) It is not hard to verify that the set Lm _{can equivalently be expressed as}

Lm ₌ ( A ∈ Rm×m : A = m X k=1 m X l>k λkl(eke>k + ele>l − eke>l − ele>k), λkl≥ 0 ∀k, l ) . (1.178) Accordingly, the Euclidean projection of Y ∈ Rm×m _{on L}m _{can be recast as a problem} of the form (1.176) with

Z1 = e1e>1 + e2e2>− e1e>2 − e2e>1, Z2 = e1e>1 + e3e>3 − e1e>3 − e3e>1, . . . , Zp = em−1e>m−1+ emem> − em−1e>m− eme>m−1,

where p = m(m − 1)/2. Using the conic hull representation (1.178) may be advan- tageous over approaches which try to compute the projection based on the half-space representation (1.177), because in the former case it is possible to make use of fast solvers available for NNLS.

Estimation of positive definite M -matrices and structure learning for attractive Gaussian Markov random fields. Besides regression, sparsity has become a key con- cept for various other statistical estimation problems. Estimation of the covariance matrix or its inverse (also known as the precision matrix) of a random vector is a traditional task in multivariate analysis that poses a considerable challenge in the high-dimensional setting. Let Z = (Zj)pj=1∼ N (µ

∗_,_Σ∗_{) be a p-dimensional Gaussian random vector and} suppose that we want to estimate its precision matrix Ω∗ _{= (ω}∗

jk) = (Σ

∗₎−1 _{from sam-} ples {z1, . . . , zn} which are i.i.d. realizations of Z, assuming that µ∗ is known. Then maximum likelihood estimation can be shown to be equivalent to the log-determinant divergence minimization problem

min Ω∈Sp+ − log det(Ω) + tr(ΩS), S := 1 n n X i=1 (zi− µ∗)(zi− µ∗)>, (1.179) where Sp+ denotes the set of by p × p symmetric positive definite matrices, cf. [74], §17.3. Once n < p, the sample covariance matrix S is singular, and as a result, (1.179) is unbounded from below so that a maximum likelihood estimator fails to exist. More- over, in this situation one cannot hope to reasonably estimate Ω∗ _{unless it possesses} additional structure that can be exploited. Sparsity of the off-diagonal entries of Ω∗ has primarily been considered in this context, and different forms of sparsity-promoting regularization have been proposed [26,59,131]. In the Gaussian setting, sparsity of the off-diagonal entries of Ω∗ _{has a particularly convenient interpretation in terms of the} induced conditional independence graph in which pairs of variables (Zj, Zk), k 6= j, are

connected by an edge if and only if Zj and Zk are conditionally independent given the remaining variables {Zl}l /∈{j,k}, which can be shown to be equivalent to ω_jk∗ = 0, cf. [90], §5. Thus, sparsity of Ω∗ _{translates to sparsity of the associated conditional indepen-} dence graph. Independent of Gaussianity, one can also show that (−ω∗

jk/ω ∗

jj)k6=j equals the vector of regression coefficients of the linear regression in which Zj is regressed on {Zk}k6=j, j = 1, . . . , p. In our recent paper [143], we specialize to the case in which all these regression coefficients are non-negative. Equivalently, we consider the following subset of Sp+:

Mp = {Ω = (ωjk) ∈ S p

+: ωjk ≤ 0, j, k = 1, . . . , p, j 6= k},

the set of symmetric positive definite M -matrices [8]. In [143], we investigate whether the sign constraints on the off-diagonal elements can be exploited in estimation, and whether adaptation to sparsity similar as in non-negative regression is possible. Specif- ically, as a direct modification of (1.179), we consider sign-constrained log-determinant divergence minimization

min

Ω∈Mp− log det(Ω) + tr(ΩS). (1.180)

We show that, under a mild condition on the sample covariance matrix S, there ex- ists a unique minimizer of (1.180) even if n < p. Moreover, we provide theoretical and empirical evidence indicating that thresholding of the off-diagonal entries of the resulting minimizer may be a suitable approach for recovering the sparsity pattern of an underlying sparse target Ω∗_.

Recovering symmetric positive definite matrices of low rank. The field of com- pressed sensing started with the problem of recovering a sparse vector from incomplete linear measurements, but was readily extended to the more general problem of recovering a low rank matrix B∗

∈ Rm1×m2 _{from linear measurements of the form y}

i = tr(XiB∗) for certain measurement matrices Xi ∈ Rm2×m1, i = 1, . . . , n, cf. e.g. [29, 115, 127]. While sparsity now refers to the singular values of B∗_{, symmetric positive definite-} ness appears to be the natural counterpart to non-negativity in the vector case. First recovery results into this direction have been shown in [166], but these fall behind those established for trace norm regularization (the counterpart to non-negative `1- regularization) in [25,37]. Besides, the results in [166] are limited to a noiseless setting.

In document Topics in learning sparse and low-rank models of non-negative data (Page 103-105)