Theoretical Guarantees - : A Convex Optimization Approach to High-dimensional Sparse Quadratic

CHAPTER 3 : A Convex Optimization Approach to High-dimensional Sparse Quadratic

3.4 Theoretical Guarantees

for the estimation and classification error and then provide matching minimax lower bounds, up to logarithm factors. These results together show the near-optimality of the SDAR rule.

3.4.1. Upper bounds

To overcome the limitations illustrated in Section 3.2, we consider the following parameter space ofθ= (π1, π2,µ1,µ2,Σ1,Σ2). Especially, we assume here that both the discriminating

directionβ and the differential graph Dare sparse. Let fQ,θ be the probability density of

Q(z;θ) defined in (3.3), we consider the following parameter space.

Θp(s1, s2) ={θ= (π1, π2,µ1,µ2,Σ1,Σ2) :µ1,µ2 ∈Rp,Σ1,Σ2 0,|D|0≤s1,kβk0≤s2 kDkF,kβk2 ≤M0, M₁−1≤λmin(Σk)≤λmax(Σk)≤M1, k= 1,2, sup |x|<δ fQ,θ(x)< M2, c≤π1, π2 ≤1−c}, (3.8) for some constants M0 >0, M1>1,δ, M2>0 andc∈(0,1/2).

Remark 3. Note that we assume sparsity on both the discriminant direction β and the

differential graphD, whose necessities are shown by Theorem 1 and 2. The upper bound on

kβk₂is a general assumption in LDA, see Cai and Liu (2011); Neykov et al. (2015); and Cai et al. (2019a), and we assume the same on kvec(D)k2 =kDkF in the QDA setting. More- over, the condition on the bounded density is commonly assumed in discriminant analysis, see condition (C1) in Cai and Liu (2011), and discussions in Li and Shao (2015) and Jiang et al. (2015). In the following we present a condition onθ such that this bounded density assumption holds. Note that the term z>Dz+β>z is equal in distribution to a weighted non-central chi-square distribution, by using the similar proof as that of Lemma 7.2 in Xu et al. (2014), the condition sup_|_x_|_<δfQ,θ(x)< M2holds when either the two largest positive

eigenvalues of D λ1(D), λ2(D) or the two largest negative eigenvalues of D λ˜1(D),˜λ2(D)

are of the same order, that is 0<lim infn→∞_λ λ1(D)

1(D)+λ2(D) <lim supn→∞ λ1(D) λ1(D)+λ2(D) <1 or 0<lim infn→∞ ˜ λ1(D) ˜ λ1(D)+˜λ2(D) <lim supn→∞ ˜ λ1(D) ˜ λ1(D)+˜λ2(D) <1.

(3.5) and (3.6) converge to the true parameters D and β. This theorem will then be used to establish the consistency of the proposed classification rule.

Theorem 3. Consider the parameter space Θp(s1, s2), and assume that n1 n2, s1 +

s2 . _logn_p, where n = min{n1, n2}. In optimization problems (3.5) and (3.6), let λi,n = ci

logp/n with ci > 0, i = 1,2 being sufficiently large constants. Then as n goes to

infinity, the estimators obtained in (3.5) and (3.6) satisfies that, with probability at least

1−p−1_, kDˆ −DkF . r s1logp n ; kβˆ−βk2 . r s2logp n .

The above theorem shows that although our estimating procedure (3.6) is different from Zhao et al. (2014), the same convergence rate can be obtained and requires milder theoretical conditions. In fact, Zhao et al. (2014) assumes thatkΩ1k1 andkΩ2k1 are both bounded, and

additionally requires that the off-diagonal elements of Σ1 and Σ2 are vanishing asn→ ∞,

which is much stronger than conditions in (4.18). In addition, the above bound implies that when Σ1 = Σ2, that is, s1 = 0, we have ˆD = D = 0 when λ1,n is suitably chosen. This implies that when the two covariance matrices are equal, SDAR rule (3.7) would adaptively be reduced to the LPD rule in Cai and Liu (2011) designed for high-dimensional LDA. We now turn to the performance of the classification rule ˆGSDAR. The behavior of ˆGSDAR is

measured by the excess riskRθ( ˆGSDAR)−Rθ(G∗θ), defined in (4.32). The following theorem

provides the upper bound for the excess classification error.

Theorem 4. Consider the parameter spaceΘp(s1, s2), and assume thatn1 n2, s1+s2.

logp·log2_n. Then when n goes to infinity, the proposed SDAR classification rule in (3.7)

satisfies that, for sufficiently large n,

sup θ∈Θp(s1,s2) E h Rθ( ˆGSDAR)−Rθ(G∗θ) i .(s1+s2)· logp n ·log 2_n.

The result in Theorem 4 shows that ˆGSDARis able to mimicG∗θconsistently over the param-

eter space Θp(s1, s2), and to the best of our knowledge, gives the first explicit convergence

rate of classification error for the high-dimensional QDA problem.

Remark 4. Related work studying the convergence of classification error includes Li and Shao (2015) and Jiang et al. (2015), but both Theorem 3 in Li and Shao (2015) and Theorem 4 in Jiang et al. (2015) only show the consistency of their proposed classification rules instead of explicit convergence rates. Although in Corollary 3 of Jiang et al. (2015), the authors showed a convergence rate for the classification error of order s1s22

logp/n under some regularity conditions, this result is based on the assumption that an intercept termη, defined in their paper, is known. Jiang et al. (2015) proposed to estimate η based on the idea of cross validation and in their theorem 3 they showed the consistency of this estimation without explicit convergence rate. In contrast, our paper shows that the convergence rate O((s1+s2) logp·log2n/n) is achievable, which is much faster than their

results. In addition, the assumptions here are weaker.

The major technical challenge of this improvement is the characterization of the distribution ofQ(z;θ), which involves the sum of weighted non-central chi-square random variables. In the next section we will show that this convergence rate is indeed optimal up to logarithm factors.

3.4.2. Minimax lower bound for sparse QDA

In this section we establish the minimax lower bound for the convergence rate ofRθ( ˆG)−

Rθ(G∗_θ), and thus show the optimality of ˆGSDAR up to logarithm factors.

Theorem 5. Consider the parameter space Θp(s1, s2) defined in (4.18). Suppose n1

n2, 1 ≤ s1, s2 ≤ o(_logn_p), and Gˆ is constructed based on the observations x1, ...,xn i.i.d.

∼

Np(µ1,Σ1), y1, ...,yn i.i.d.

∼ Np(µ2,Σ2). Then the minimax risk of the classification error

over Θp(s1, s2) satisfies inf ˆ sup E h Rθ( ˆG)−Rθ(G∗θ) i &(s1+s2)· logp n .

The challenge of proving Theorem 5 is that the excess riskRθ( ˆG)−Rθ(G∗θ) does not satisfy

the triangle inequality (or subadditivity), which is essential to the standard minimax lower bound techniques. To overcome this challenge, we define an alternative risk functionLθ( ˆG)

as follows, Lθ( ˆG) :=Pθ ˆ G(z)6=G∗_θ(z) . (3.9)

This loss functionLθ( ˆG) is essentially the probability that ˆGproduces a different label than

G∗_θ, and satisfies the triangle inequality, as shown in Lemma 13. The connection between Rθ( ˆG)−Rθ(G∗θ) and Lθ( ˆG) is presented by the following lemma, which shows that it’s

sufficient to provide a lower bound for Lθ( ˆG) to prove Theorem 5.

Lemma 1. Suppose θ∈Θp(s1, s2). There exists a constant c >0, doesn’t depend on n, p,

such that for some classification ruleG, if Lθ(G)< c, then,

L2_θ(G).Pθ(G(z)6=L(z))−Pθ(Gθ(z)6=L(z)).

Based on Lemma 1, we use Fano’s inequality on a carefully designed least favorable mul- tivariate normal distributions to complete the proof of Theorems 2 and 5. The details are shown in Section 3.7.

In document Minimax Optimality In High-Dimensional Classification, Clustering, And Privacy (Page 77-81)