Sparse Estimation Procedure - A concentration inequality based statistical methodology for infe

Let X1, . . . , Xn ∈ Rdbe a sample of n independent and identically distributed random

vectors with unknown d × d covariance matrix Σ0. Define the empirical estimate of

Σ0 to be ˆΣemp = n−1

i=1(Xi− ¯X)(Xi− ¯X) T

where ¯X = n−1Pn

i=1Xi is the sample

mean. The goal of the following procedure is to construct a sparse estimator, ˆΣsp, for Σ0 by first constructing a confidence set for Σ0 about the estimator ˆΣemp and

then searching this set for the sparsest member. Two different search methods for such a sparse member are outlined in Sections 2.2.1 and 2.2.2.

To construct such a confidence set about ˆΣemp, concentration inequalities are employed. Specific inequalities are chosen based on data assumptions and are discussed in subsequent Sections 2.3.1, 2.3.2, and 2.3.3. In general, the inequalities all take a similar form. Let d(·, ·) be some metric measuring the distance between two covariance matrices, and let ψ : R → R be monotonically increasing. Then, the

2.2. SPARSE ESTIMATION PROCEDURE Assumption on Xi d( ˆΣsp, Σ0) rα Log-Concave Measure ˆ Σsp− Σ0 1/2 p p(−2/c0n) log α Bounded in norm ˆ Σsp_{− Σ} 0 p Up(−1/2n) log α Sub-Exponential Measure ˆ Σsp− Σ0 1/2 p max{−K log α/ √ n,√−K log α}

Table 2.1: Specific metrics d(·, ·) and deviation thresholds rα given specific assump-

tions on the data Xi discussed in subsequent sections.

general form of the concentration inequalities is P

d(Σ0, ˆΣemp) ≥ Ed(Σ0, ˆΣemp) + r

≤ e−ψ(r),

which is a bound on the tail of the distribution of d(Σ0, ˆΣemp) as it deviates above

its mean. Thus, to construct a (1 − α)-confidence set, the variable r = rα is chosen

such that exp(−ψ(rα)) = α. This rα will be referred to as the deviation threshold.

Table 2.1 contains some explicit choices for the metric and deviation threshold given specific assumptions on the Xi, which are used to drive the choice of concentration

inequality. These three cases are discussed separately in Section 2.3. Now, let ˆΣsp _{be our sparse estimator for Σ}

0. We want these two to be close

in the sense of the above confidence set and therefore choose a ˆΣsp _{such that}

d( ˆΣsp_{, ˆ}_Σemp_{) ≤ r}

α. Consequently, we have that

Pd( ˆΣsp, Σ0 ) ≥ Ed( ˆΣemp, Σ0) + 2rα

≤ Pd( ˆΣsp, ˆΣemp) + d( ˆΣemp, Σ0) ≥ Ed( ˆΣemp, Σ0) + 2rα

≤ Pd( ˆΣemp, Σ0) ≥ Ed( ˆΣemp, Σ0) + rα

≤ exp(−ψ(rα)) = α

To actually identify such a ˆΣsp_{, we require some criterion to optimize over all}

elements of the confidence set. Two such methods are proposed in the following subsections. The zeroing method in Section 2.2.1 takes inspiration from thresholding techniques for sparse covariance estimation (Bickel and Levina, 2008a; Rothman et al., 2009; Cai and Liu, 2011). It begins with ˆΣemp and attempts to zero as many entries as possible in the empirical estimate while still remaining in the confidence set, which is similar to applying a hard thresholding estimator restricted to the confines of the confidence set. The Procrustes method in Section 2.2.2 is more closely related to the shrinkage estimators (Daniels and Kass, 1999, 2001; Hoff, 2009; Johnstone and Lu, 2012). It chooses ˆΣsp _{to be a convex combination of ˆ}_Σemp _{and some sparse}

target matrix using the Procrustes size and shape distance, which has been shown to be a useful metric when one is concerned with inference in the space of covariance matrices (Dryden et al., 2009).

2.2.1 Zeroing Method

Beginning with ˆΣemp_{, the goal of this method is to remove as many entries of ˆ}_Σemp

as possible while respecting the restriction that d( ˆΣsp_{, ˆ}_Σemp_{) ≤ r}

α. Here, the (i, j)

entry of ˆΣsp _{is denoted as ˜}_σ

i,j and the (i, j) entry of ˆΣspk is denoted as ˜σi,jk

0. Set ˆΣsp₀ = ˆΣemp. Choose an α and compute rα. Later it will be shown that

this method is fairly robust to the choice of α. The cross-validation technique described below in Section 2.2.3 can be used to select a desirable α in practice. We also observed in the Gaussian data experiments of Section 2.4.1 that α ≈ 10−6 gave good performance.

1. While d( ˆΣsp_k , ˆΣemp_{) ≤ r}

α and ˆΣsp_k has at least one non-zero off-diagonal entry.

(a) Choose the smallest non-zero off-diagonal entry in ˆΣsp_k and construct ˆΣsp_k+1 by setting it equal to zero. That is, determine (i, j) such that i < j and 0 < |˜σk

(b) Construct ˆΣsp_k+1 with entries ˜σk+1_i,j = ˜σ_j,ik+1 = 0 and ˜σk+1_i0_,j0 = ˜σ_ik0_,j0 for all other (i0, j0) 6= (i, j).

2. Denote ˆΣsp the final matrix resulting from this recursion. If ˆΣsp is not positive semi-definite, then project it onto the space of positive semi-definite matrices by mapping the negative eigenvalues to zero.

In the case that the metric d(·, ·) is a monotonically increasing function of the Hilbert-Schmidt / Frobenius norm k ˆΣsp_k − ˆΣemp_k

2, then the sequence d( ˆΣ sp

k , ˆΣemp)

will be increasing in k. This is true because the Frobenius norm is equivalent to the `2 _{norm of the entries in the matrix, and as we run the algorithm, more entries}

in the difference ˆΣsp_k − ˆΣemp will be non-zero. This property guarantees that the above algorithm will find the sparsest ˆΣsp in the confidence set in the sense of having the most zero entries. However, for an arbitrary metric, this sequence may not necessarily be strictly increasing in k. Another commonly used norm, which will be shown in Section 2.4 to give superior performance on simulated data, is the operator norm k ˆΣsp_k − ˆΣemp_k

∞, which does not yield a monotonically increasing sequence.

Though, this sequence is roughly increasing in the sense that it is lower bounded by definition by the maximum `2 _{norm of the columns of ˆ}_Σsp

2.2. SPARSE ESTIMATION PROCEDURE increasing sequence. Furthermore, it is upper bounded by the `1 norm of the columns of ˆΣsp_k − ˆΣemp, which follows from the Gershgorin circle theorem (Iserles, 2009), and which is also an increasing sequence. In practice, the operator norm in particular gives superior performance in the numerical simulations of Section 2.4.

From a computational perspective, the above algorithm as stated requires an unacceptable O(d2) eigenvalue decompositions as thus does not scale well as the dimension of the matrix increases. To account for this, a binary search routine can be incorporated resulting in a reduction to O(log₂d) eigenvalue decompositions. In short, set zk = b(d2− d)/2kc to be the number of non-zero off-diagonal entries to set to zero

in step (1a). If the resulting ˆΣsp_k+1 from step (1b) is such that d( ˆΣsp_k+1, ˆΣemp_{) ≤ r} α,

then continue as normal and attempt to remove zk+1 more entries. Otherwise, if

d( ˆΣsp_k+1, ˆΣemp_{) ≥ r}

α, set ˆΣspk+1 ← ˆΣ sp

k , then continue again as normal with zk+1 as

before.

2.2.2 Procrustes Method

Past research into estimation and hypothesis testing for covariance matrices and operators has highlighted the superior performance of the Procrustes size and shape distance when compared with other metrics (Dryden et al., 2009; Pigoli et al., 2014; Cabassi et al., 2017). The intuition behind this metric and why it is popular in the context of shape analysis is that it allows for unitary transformations to best align the two objects under scrutiny.

In the context of sparse estimation, the Procrustes distance is used to construct ˆ

Σsp as a convex combination of ˆΣemp and some sparse target matrix Σtar that presumably lies outside of the confidence set. Hence, this approach attempts to move or shrink from the empirical estimator to the sparse target along a path determined by the Procrustes metric. Specifically, set Lemp _{= ( ˆ}_Σemp₎1/2 _{and L}tar _{= (Σ}tar₎1/2_{, and}

construct the estimator as a function of some γ ∈ [0, 1] to be ˆ

Σsp(γ) = Lemp+ γ(LtarR − Lemp) Lemp+ γ(LtarR − Lemp)T

where R = U VT and U and V are, respectively, the left and right matrices of singular vectors for the matrix (Ltar)TLemp (Pigoli et al., 2014, Section 3). The argument γ ∈ [0, 1] is chosen to be as large as possible while it still holds that d( ˆΣsp_{(γ), ˆ}_Σemp_{) ≤ r}

α.

This method finds the estimator closest to Σtar with respect to the Procrustes distance that is still in some confidence ball about ˆΣemp_{. In practice, a choice of Σ}tar

must be made based on some assumption regarding the nature of the true Σ0. In

the case of sparse estimation, either Id, the d × d identity matrix, or the diagonal

semi-Bayesian feel as we are compromising between the empirical estimate and some prior chosen sparse target.

It is easily seen that this distance is an increasing function of γ for any p- Schatten norm. Thus, our goal is to determine the maximal value of γ such that dProc( ˆΣsp(γ), ˆΣemp) ≤ rα. To compute this estimator in practice, a binary search

procedure similar to that for the above zeroing method can be implemented. Begin with the initial values γ = δ = 0.5. If d( ˆΣsp_{(γ), ˆ}_Σemp_{) ≤ r}

α, set γ ← γ + δ/2 and

δ ← δ/2. Otherwise set γ ← γ − δ/2 and δ ← δ/2. This will quickly converge on the optimal choice of γ.

2.2.3 Cross-Validation

In practice, an optimal value of α ∈ (0, 1) must be chosen to enforce the proper amount of sparsity. Beyond that, many of the concentration inequalities arrive with finite but unknown coefficients that may only have loose upper bounds known. Hence, we propose a cross-validation technique for tuning α, which takes its inspiration from the similar technique proposed in the thresholding literature (Bickel and Levina, 2008a; Rothman et al., 2009; Cai and Liu, 2011).

Given n = 2m observations, we split the data randomly in half to get X1

1, . . . , Xm1

and X2

1, . . . , Xm2. Then, the two empirical estimators are constructed ˆΣ emp

1 and

Σemp₂ . The desired sparsifying procedure is applied to ˆΣemp₂ for a variety of α ∈ A resulting in the collection of estimators { ˆΣsp_α}α∈A. The value of α chosen as α =

arg minα∈Ad( ˆΣemp1 , ˆΣspα) for some metric d(·, ·). This process is repeated k times

resulting in the set {α1, . . . , αk}. Then, the cross-validated choice is the average of

the αi in the log domain, which is α = exp(k−1Pk_i=1log αi). The reason for the log,

as will be seen in the following sections, is that our deviation threshold rα is often a

function of log α stemming from the application of the concentration inequalities.

In document A concentration inequality based statistical methodology for inference on covariance matrices and operators (Page 34-38)