Remarks - Discovery of low-dimensional structure in high-dimensional inference problems

In this chapter we presented our information-theoretic analysis of the sample complexity of sparse signal processing problems. Our framework is unifying in the sense that it is applicable to arbitrary (e.g. linear or nonlinear) observation models and variable distributions since we only necessitate a simple set of assumptions (A1)–(A5) that are listed in Section 3.2. We used an analogy to channel coding to formulate the problem of set decoding that allowed us to obtain mutual information formulas to determine upper and lower bounds. We then extended the initial analysis to different modalities such as models with latent observation variables or scaling models. This made our analysis applicable to many problems commonly studied in sparse signal processing, for which we will present specific results in Chapter 5.

Set decoding vs. channel coding. The main difference between the analysis of the error probability for the set decoding and channel coding problems is that in contrast to channel coding, in set decoding the codewords of a candidate set ˆS and the true set S are not necessarily independent even if X is generated IID, since the two sets could be overlapping. To overcome this difficulty, in Lemmas 3.3.1 and 3.6.1 we separated the error events Ei, i = 1, . . . , K, of misidentifying i items of the true

S and averaged over the set of possible codeword realizations for every candidate set with i differing elements S \ ˜S.

Interpretation. Intuitively, the bounds in (3.9), (3.46) and (3.34) can be explained as follows: For each i, the numerator is approximately the number of bits required to represent all sets Sω that differ from S in i elements. The denominator represents the

information given by the output variable Y about the remaining i indices S \ ˜S, given the subset ˜S of K − i true indices. Hence, the ratio represents the number of samples needed to control i support errors and the maximization accounts for all possible support errors.

Upper vs. lower bounds. We have shown and remarked in Sections 3.5 and 3.6 that for arbitrarily small recovery error ε → 0, the upper bound for non-scaling models in Theorem 3.6.1 is tight as it matches the lower bound given in Theorem 3.5.1. The upper bound in Theorem 3.7.1 for arbitrary models is also tight up to a constant factor C provided that mild regularity condition on the problem hold. A novel contribution of our work is proving that worst-case conditions w.r.t. βS are both

necessary and sufficient to obtain support recovery with a small error probability, compared to previous results that either consider a deterministic setup for βS or

consider an average-case analysis for lower bounds.

Partial recovery. As we analyze the error probability separately for i = 1, . . . , K support errors corresponding to ˜S ⊂ S with |S \ ˜S| = i in order to obtain the necessity and sufficiency results, it is straightforward to determine necessary and sufficient conditions for partial support recovery instead of exact support recovery. By changing the maximization from over all subsets ˜S ⊂ S (i.e. i = 1, . . . , K) to ˜S ⊂ S such that

at least k of the K support indices can be determined.

Technical issues with typicality decoding. It is worth mentioning that a typicality decoder can also be analyzed to obtain a sufficient condition, as used in the early versions of (Atia and Saligrama, 2009). However, typicality conditions must be defined carefully to obtain a tight bound w.r.t. K, as with standard typicality definitions the atypicality probability may dominate the decoding error probability in the typical set. For instance, for the group testing scenario considered in (Atia and Saligrama, 2012), where Xn ∼ Bernoulli(1/K), we have Pr[XS = (1, . . . , 1)] = (1/K)K, which would

require the undesirable scaling of N as KK, to ensure typicality in the strong sense (as needed to apply results such as the packing lemma (El Gamal and Kim, 2011, p. 32)). Redefining the typical set as in (Atia and Saligrama, 2009) is then necessary, but it is problem-specific and makes the analysis cumbersome compared to the ML decoder adopted herein and in (Atia and Saligrama, 2012). Furthermore, the case where K scales together with D requires an even more subtle analysis, whereas the analysis of the ML decoder analysis is more straightforward in regards to that scaling. Typicality decoding has also been reported as infeasible for the analysis of similar problems, such as multiple access channels where the number of users scale with the coding block length (Chen and Guo, 2013).

Support recovery and support coefficients. In the proof for Theorem 3.5.1, we showed that βS being unknown with prior P (βS) induces a penalty term in the

denominator given by I(βS; XS\ ˜S|XS˜, Y , S)/N , compared to the case where the

support coefficients βS are fixed and known. In the proof for Theorem 3.6.1, we

similarly showed that random βS induces the H1

2(βS) term that is dominated by IS˜ therefore does not affect the sample complexity asymptotically for non-scaling models. This shows that recovering the support given the knowledge of the support coefficients

is as hard as recovering the support with unknown coefficients for high sparsity regimes, underlying the importance of recovering the support in sparse recovery problems.

Structured sparsity. For candidate sets we considered an unstructured framework where we assumed that the underlying set Sω belonged to the combinatorial set

S = {S : S ⊂ {1, . . . , D}, |S| = K} of all K-sets with equal probability. However it is trivial to extend our analysis to problems with “structured” sparsity, where the set of candidate sets S is a subset of the set S as defined above. One example of such problems is multivariate regression in Section 5.1.7, where multiple problems are constrained to share the same support. Another example is where the structural information can be encoded with respect to an D-node graph G = (V, E). Here, we can consider the collection S as the family of all connected subgraphs of size K. Thus S is a K-set of K nodes whose induced subgraph is connected. These are problems that can arise in many interesting scenarios (Qian and Saligrama, 2014a; Qian et al., 2014) such as disease outbreak detection, medical imaging and inverse problems, where the underlying signal must satisfy connectivity constraints.

In these problems, our high level sample complexity expressions would change from (3.3) to N > max ˜ S⊂S log |S˜_S| I_S˜ , (3.49)

where S_S˜ = {S ∈ S : S ⊃ ˜S}. Intuitively S_S˜ is the set of all structures that are consistent with the partially recovered set ˜S. The numerator is the only part that changes from (3.3) and the analysis in this section, accounting for the change in the number of feasible K-sets.

In document Discovery of low-dimensional structure in high-dimensional inference problems (Page 70-74)