Let X1, . . . , Xn ∈ Rdbe a sample of n independent and identically distributed random
vectors with unknown d × d covariance matrix Σ0. Define the empirical estimate of
Σ0 to be ˆΣemp = n−1
Pn
i=1(Xi− ¯X)(Xi− ¯X) T
where ¯X = n−1Pn
i=1Xi is the sample
mean. The goal of the following procedure is to construct a sparse estimator, ˆΣsp, for Σ0 by first constructing a confidence set for Σ0 about the estimator ˆΣemp and
then searching this set for the sparsest member. Two different search methods for such a sparse member are outlined in Sections 2.2.1 and 2.2.2.
To construct such a confidence set about ˆΣemp, concentration inequalities are employed. Specific inequalities are chosen based on data assumptions and are discussed in subsequent Sections 2.3.1, 2.3.2, and 2.3.3. In general, the inequalities all take a similar form. Let d(·, ·) be some metric measuring the distance between two covariance matrices, and let ψ : R → R be monotonically increasing. Then, the
2.2. SPARSE ESTIMATION PROCEDURE Assumption on Xi d( ˆΣsp, Σ0) rα Log-Concave Measure ˆ Σsp− Σ0 1/2 p p(−2/c0n) log α Bounded in norm ˆ Σsp− Σ 0 p Up(−1/2n) log α Sub-Exponential Measure ˆ Σsp− Σ0 1/2 p max{−K log α/ √ n,√−K log α}
Table 2.1: Specific metrics d(·, ·) and deviation thresholds rα given specific assump-
tions on the data Xi discussed in subsequent sections.
general form of the concentration inequalities is P
d(Σ0, ˆΣemp) ≥ Ed(Σ0, ˆΣemp) + r
≤ e−ψ(r),
which is a bound on the tail of the distribution of d(Σ0, ˆΣemp) as it deviates above
its mean. Thus, to construct a (1 − α)-confidence set, the variable r = rα is chosen
such that exp(−ψ(rα)) = α. This rα will be referred to as the deviation threshold.
Table 2.1 contains some explicit choices for the metric and deviation threshold given specific assumptions on the Xi, which are used to drive the choice of concentration
inequality. These three cases are discussed separately in Section 2.3. Now, let ˆΣsp be our sparse estimator for Σ
0. We want these two to be close
in the sense of the above confidence set and therefore choose a ˆΣsp such that
d( ˆΣsp, ˆΣemp) ≤ r
α. Consequently, we have that
Pd( ˆΣsp, Σ0 ) ≥ Ed( ˆΣemp, Σ0) + 2rα
≤ Pd( ˆΣsp, ˆΣemp) + d( ˆΣemp, Σ0) ≥ Ed( ˆΣemp, Σ0) + 2rα
≤ Pd( ˆΣemp, Σ0) ≥ Ed( ˆΣemp, Σ0) + rα
≤ exp(−ψ(rα)) = α
To actually identify such a ˆΣsp, we require some criterion to optimize over all
elements of the confidence set. Two such methods are proposed in the following subsections. The zeroing method in Section 2.2.1 takes inspiration from thresholding techniques for sparse covariance estimation (Bickel and Levina, 2008a; Rothman et al., 2009; Cai and Liu, 2011). It begins with ˆΣemp and attempts to zero as many entries as possible in the empirical estimate while still remaining in the confidence set, which is similar to applying a hard thresholding estimator restricted to the confines of the confidence set. The Procrustes method in Section 2.2.2 is more closely related to the shrinkage estimators (Daniels and Kass, 1999, 2001; Hoff, 2009; Johnstone and Lu, 2012). It chooses ˆΣsp to be a convex combination of ˆΣemp and some sparse
target matrix using the Procrustes size and shape distance, which has been shown to be a useful metric when one is concerned with inference in the space of covariance matrices (Dryden et al., 2009).
2.2.1
Zeroing Method
Beginning with ˆΣemp, the goal of this method is to remove as many entries of ˆΣemp
as possible while respecting the restriction that d( ˆΣsp, ˆΣemp) ≤ r
α. Here, the (i, j)
entry of ˆΣsp is denoted as ˜σ
i,j and the (i, j) entry of ˆΣspk is denoted as ˜σi,jk
0. Set ˆΣsp0 = ˆΣemp. Choose an α and compute rα. Later it will be shown that
this method is fairly robust to the choice of α. The cross-validation technique described below in Section 2.2.3 can be used to select a desirable α in practice. We also observed in the Gaussian data experiments of Section 2.4.1 that α ≈ 10−6 gave good performance.
1. While d( ˆΣspk , ˆΣemp) ≤ r
α and ˆΣspk has at least one non-zero off-diagonal entry.
(a) Choose the smallest non-zero off-diagonal entry in ˆΣspk and construct ˆΣspk+1 by setting it equal to zero. That is, determine (i, j) such that i < j and 0 < |˜σk
i,j| ≤ |˜σik0,j0| for all i0 6= i and j0 6= j such that |˜σki0,j0| > 0. If the set of such pairs (i, j) has more than one element, then choose one pair uniformly at random and continue.
(b) Construct ˆΣspk+1 with entries ˜σk+1i,j = ˜σj,ik+1 = 0 and ˜σk+1i0,j0 = ˜σik0,j0 for all other (i0, j0) 6= (i, j).
2. Denote ˆΣsp the final matrix resulting from this recursion. If ˆΣsp is not positive semi-definite, then project it onto the space of positive semi-definite matrices by mapping the negative eigenvalues to zero.
In the case that the metric d(·, ·) is a monotonically increasing function of the Hilbert-Schmidt / Frobenius norm k ˆΣspk − ˆΣempk
2, then the sequence d( ˆΣ sp
k , ˆΣemp)
will be increasing in k. This is true because the Frobenius norm is equivalent to the `2 norm of the entries in the matrix, and as we run the algorithm, more entries
in the difference ˆΣspk − ˆΣemp will be non-zero. This property guarantees that the above algorithm will find the sparsest ˆΣsp in the confidence set in the sense of having the most zero entries. However, for an arbitrary metric, this sequence may not necessarily be strictly increasing in k. Another commonly used norm, which will be shown in Section 2.4 to give superior performance on simulated data, is the operator norm k ˆΣspk − ˆΣempk
∞, which does not yield a monotonically increasing sequence.
Though, this sequence is roughly increasing in the sense that it is lower bounded by definition by the maximum `2 norm of the columns of ˆΣsp
2.2. SPARSE ESTIMATION PROCEDURE increasing sequence. Furthermore, it is upper bounded by the `1 norm of the columns of ˆΣspk − ˆΣemp, which follows from the Gershgorin circle theorem (Iserles, 2009), and which is also an increasing sequence. In practice, the operator norm in particular gives superior performance in the numerical simulations of Section 2.4.
From a computational perspective, the above algorithm as stated requires an unacceptable O(d2) eigenvalue decompositions as thus does not scale well as the dimension of the matrix increases. To account for this, a binary search routine can be incorporated resulting in a reduction to O(log2d) eigenvalue decompositions. In short, set zk = b(d2− d)/2kc to be the number of non-zero off-diagonal entries to set to zero
in step (1a). If the resulting ˆΣspk+1 from step (1b) is such that d( ˆΣspk+1, ˆΣemp) ≤ r α,
then continue as normal and attempt to remove zk+1 more entries. Otherwise, if
d( ˆΣspk+1, ˆΣemp) ≥ r
α, set ˆΣspk+1 ← ˆΣ sp
k , then continue again as normal with zk+1 as
before.
2.2.2
Procrustes Method
Past research into estimation and hypothesis testing for covariance matrices and operators has highlighted the superior performance of the Procrustes size and shape distance when compared with other metrics (Dryden et al., 2009; Pigoli et al., 2014; Cabassi et al., 2017). The intuition behind this metric and why it is popular in the context of shape analysis is that it allows for unitary transformations to best align the two objects under scrutiny.
In the context of sparse estimation, the Procrustes distance is used to construct ˆ
Σsp as a convex combination of ˆΣemp and some sparse target matrix Σtar that presumably lies outside of the confidence set. Hence, this approach attempts to move or shrink from the empirical estimator to the sparse target along a path determined by the Procrustes metric. Specifically, set Lemp = ( ˆΣemp)1/2 and Ltar = (Σtar)1/2, and
construct the estimator as a function of some γ ∈ [0, 1] to be ˆ
Σsp(γ) = Lemp+ γ(LtarR − Lemp) Lemp+ γ(LtarR − Lemp)T
where R = U VT and U and V are, respectively, the left and right matrices of singular vectors for the matrix (Ltar)TLemp (Pigoli et al., 2014, Section 3). The argument γ ∈ [0, 1] is chosen to be as large as possible while it still holds that d( ˆΣsp(γ), ˆΣemp) ≤ r
α.
This method finds the estimator closest to Σtar with respect to the Procrustes distance that is still in some confidence ball about ˆΣemp. In practice, a choice of Σtar
must be made based on some assumption regarding the nature of the true Σ0. In
the case of sparse estimation, either Id, the d × d identity matrix, or the diagonal
semi-Bayesian feel as we are compromising between the empirical estimate and some prior chosen sparse target.
It is easily seen that this distance is an increasing function of γ for any p- Schatten norm. Thus, our goal is to determine the maximal value of γ such that dProc( ˆΣsp(γ), ˆΣemp) ≤ rα. To compute this estimator in practice, a binary search
procedure similar to that for the above zeroing method can be implemented. Begin with the initial values γ = δ = 0.5. If d( ˆΣsp(γ), ˆΣemp) ≤ r
α, set γ ← γ + δ/2 and
δ ← δ/2. Otherwise set γ ← γ − δ/2 and δ ← δ/2. This will quickly converge on the optimal choice of γ.
2.2.3
Cross-Validation
In practice, an optimal value of α ∈ (0, 1) must be chosen to enforce the proper amount of sparsity. Beyond that, many of the concentration inequalities arrive with finite but unknown coefficients that may only have loose upper bounds known. Hence, we propose a cross-validation technique for tuning α, which takes its inspiration from the similar technique proposed in the thresholding literature (Bickel and Levina, 2008a; Rothman et al., 2009; Cai and Liu, 2011).
Given n = 2m observations, we split the data randomly in half to get X1
1, . . . , Xm1
and X2
1, . . . , Xm2. Then, the two empirical estimators are constructed ˆΣ emp
1 and
ˆ
Σemp2 . The desired sparsifying procedure is applied to ˆΣemp2 for a variety of α ∈ A resulting in the collection of estimators { ˆΣspα}α∈A. The value of α chosen as α =
arg minα∈Ad( ˆΣemp1 , ˆΣspα) for some metric d(·, ·). This process is repeated k times
resulting in the set {α1, . . . , αk}. Then, the cross-validated choice is the average of
the αi in the log domain, which is α = exp(k−1Pki=1log αi). The reason for the log,
as will be seen in the following sections, is that our deviation threshold rα is often a
function of log α stemming from the application of the concentration inequalities.