1.4 Non-negative least squares (NNLS) for high-dimensional linear models
1.4.9 Extensions
In light of promising theoretical properties and empirical success, it is worthwhile to explore possible extensions of NNLS. Below, we collect a few ideas that either build directly upon NNLS or that make use of concepts such as non-negativity, sparsity and self-regularization. One of these extensions has meanwhile been published [143] while the others are left as topics of future research.
Least squares regression with half-space constraints. The constraint set of NNLS
can be represented as an intersection of half-spaces: we have Rp+ = {x ∈ Rp : ∩pj=1hej, xi ≥ 0}. More generally, one can consider constraint sets of the form {x ∈ Rp+ : ∩
q
j=1haj, xi ≥ 0} for given aj ∈ Rp, j = 1, . . . , q. This prompts the following generalization of NNLS:
min
β: Aβ0ky − Xβk 2
2, (1.175)
where A ∈ Rq×p contains the {a
j}qj=1 as its rows. Accordingly, statistical analysis involves the sparsity of Aβ∗ in place of the sparsity of β∗. As an example, let A represent the first order difference operator, i.e.
A= −1 1 0 . . . 0 0 −1 1 0 . . . 0 ... 0 . .. ... ... ... ... ... ... ... ... 0 0 . . . 0 −1 1
Then the constraint set is given by all β satisfying β1 ≤ β2 ≤ . . . ≤ βp, while sparsity means that the sequence {βj}pj=1 has few ’jumps’, i.e. βj = βj+1 for most j. The cen- tral question to be answered is whether − under suitable conditions on X and A − minimizers of (1.175) enjoy similar performance guarantees as NNLS.
In the sequel, we consider three extensions involving convex cones of real matrices.
Non-negative least squares approximations for matrices. Let Y ∈ Rm1×m2 and
Zj ∈ Rm1×m2, j = 1, . . . , p, be given matrices and suppose that we wish to find a non-negative combination of the {Zj}pj=1 optimally approximating Y in a least squares sense. This yields the optimization problem
min β0 Y − p X j=1 βjZj 2 F , (1.176)
where kM kF = (Pa,bMab2)1/2 denotes the Frobenius norm of matrix M . Actually, (1.176) is not a proper extension of NNLS as it can be converted to a standard NNLS problem (1.16) by vectorizing Y , vectorizing the {Zj}pj=1 and stacking them as columns of a corresponding design matrix. Nevertheless, it is worth mentioning (1.176) because of its usefulness for solving several projection problems on polyhedral cones of real
matrices considered in the literature [111, 126] such as the set of diagonally dominant matrices with positive diagonal, or interesting subsets thereof like the set of Laplacian matrices Lm = ( A ∈ Rm×m : A = A>, aii= − X j6=i aij, i= 1, . . . , m, aij ≤ 0 for all i 6= j ) (1.177) It is not hard to verify that the set Lm can equivalently be expressed as
Lm = ( A ∈ Rm×m : A = m X k=1 m X l>k λkl(eke>k + ele>l − eke>l − ele>k), λkl≥ 0 ∀k, l ) . (1.178) Accordingly, the Euclidean projection of Y ∈ Rm×m on Lm can be recast as a problem of the form (1.176) with
Z1 = e1e>1 + e2e2>− e1e>2 − e2e>1, Z2 = e1e>1 + e3e>3 − e1e>3 − e3e>1, . . . , Zp = em−1e>m−1+ emem> − em−1e>m− eme>m−1,
where p = m(m − 1)/2. Using the conic hull representation (1.178) may be advan- tageous over approaches which try to compute the projection based on the half-space representation (1.177), because in the former case it is possible to make use of fast solvers available for NNLS.
Estimation of positive definite M -matrices and structure learning for attractive Gaussian Markov random fields. Besides regression, sparsity has become a key con- cept for various other statistical estimation problems. Estimation of the covariance ma- trix or its inverse (also known as the precision matrix) of a random vector is a traditional task in multivariate analysis that poses a considerable challenge in the high-dimensional setting. Let Z = (Zj)pj=1∼ N (µ
∗,Σ∗) be a p-dimensional Gaussian random vector and suppose that we want to estimate its precision matrix Ω∗ = (ω∗
jk) = (Σ
∗)−1 from sam- ples {z1, . . . , zn} which are i.i.d. realizations of Z, assuming that µ∗ is known. Then maximum likelihood estimation can be shown to be equivalent to the log-determinant divergence minimization problem
min Ω∈Sp+ − log det(Ω) + tr(ΩS), S := 1 n n X i=1 (zi− µ∗)(zi− µ∗)>, (1.179) where Sp+ denotes the set of by p × p symmetric positive definite matrices, cf. [74], §17.3. Once n < p, the sample covariance matrix S is singular, and as a result, (1.179) is unbounded from below so that a maximum likelihood estimator fails to exist. More- over, in this situation one cannot hope to reasonably estimate Ω∗ unless it possesses additional structure that can be exploited. Sparsity of the off-diagonal entries of Ω∗ has primarily been considered in this context, and different forms of sparsity-promoting regularization have been proposed [26,59,131]. In the Gaussian setting, sparsity of the off-diagonal entries of Ω∗ has a particularly convenient interpretation in terms of the induced conditional independence graph in which pairs of variables (Zj, Zk), k 6= j, are
connected by an edge if and only if Zj and Zk are conditionally independent given the remaining variables {Zl}l /∈{j,k}, which can be shown to be equivalent to ωjk∗ = 0, cf. [90], §5. Thus, sparsity of Ω∗ translates to sparsity of the associated conditional indepen- dence graph. Independent of Gaussianity, one can also show that (−ω∗
jk/ω ∗
jj)k6=j equals the vector of regression coefficients of the linear regression in which Zj is regressed on {Zk}k6=j, j = 1, . . . , p. In our recent paper [143], we specialize to the case in which all these regression coefficients are non-negative. Equivalently, we consider the following subset of Sp+:
Mp = {Ω = (ωjk) ∈ S p
+: ωjk ≤ 0, j, k = 1, . . . , p, j 6= k},
the set of symmetric positive definite M -matrices [8]. In [143], we investigate whether the sign constraints on the off-diagonal elements can be exploited in estimation, and whether adaptation to sparsity similar as in non-negative regression is possible. Specif- ically, as a direct modification of (1.179), we consider sign-constrained log-determinant divergence minimization
min
Ω∈Mp− log det(Ω) + tr(ΩS). (1.180)
We show that, under a mild condition on the sample covariance matrix S, there ex- ists a unique minimizer of (1.180) even if n < p. Moreover, we provide theoretical and empirical evidence indicating that thresholding of the off-diagonal entries of the resulting minimizer may be a suitable approach for recovering the sparsity pattern of an underlying sparse target Ω∗.
Recovering symmetric positive definite matrices of low rank. The field of com- pressed sensing started with the problem of recovering a sparse vector from incomplete linear measurements, but was readily extended to the more general problem of recover- ing a low rank matrix B∗
∈ Rm1×m2 from linear measurements of the form y
i = tr(XiB∗) for certain measurement matrices Xi ∈ Rm2×m1, i = 1, . . . , n, cf. e.g. [29, 115, 127]. While sparsity now refers to the singular values of B∗, symmetric positive definite- ness appears to be the natural counterpart to non-negativity in the vector case. First recovery results into this direction have been shown in [166], but these fall behind those established for trace norm regularization (the counterpart to non-negative `1- regularization) in [25,37]. Besides, the results in [166] are limited to a noiseless setting.