On dual model-free variable selection with two groups of variables

(1)

This is an Open Access document downloaded from ORCA, Cardiff University's institutional

repository: http://orca.cf.ac.uk/112212/

This is the author’s version of a work that was submitted to / accepted for publication.

Citation for final published version:

Alothman, Ahmad, Dong, Yuexiao and Artemiou, Andreas 2018. On dual model-free variable

selection with two groups of variables. Journal of Multivariate Analysis 167 , pp. 366-377.

10.1016/j.jmva.2018.06.003 file

Publishers page: http://dx.doi.org/10.1016/j.jmva.2018.06.003

<http://dx.doi.org/10.1016/j.jmva.2018.06.003>

Please note:

Changes made as a result of publishing processes such as copy-editing, formatting and page

numbers may not be reflected in this version. For the definitive version of this publication, please

refer to the published source. You are advised to consult the publisher’s version if you wish to cite

this paper.

This version is being made available in accordance with publisher policies. See

http://orca.cf.ac.uk/policies.html for usage policies. Copyright and moral rights for publications

made available in ORCA are retained by the copyright holders.

(2)

On dual model-free variable selection with two groups of variables

Ahmad Alothmana_{, Yuexiao Dong}a,∗_{, Andreas Artemiou}b

a_{Department of Statistical Science, Temple University} b_{School of Mathematics, Cardi}_ff_University

Abstract

In the presence of two groups of variables, existing model-free variable selection methods only reduce the dimension-ality of the predictors. We extend the popular marginal coordinate hypotheses [3] in the suffi_{cient dimension reduction} literature and consider the dual marginal coordinate hypotheses, where the role of the predictor and the response is not important. Motivated by canonical correlation analysis (CCA), we propose a CCA-based test for the dual marginal coordinate hypotheses, and devise a joint backward selection algorithm for dual model-free variable selection. The performances of the proposed test and the variable selection procedure are evaluated through synthetic examples and a real data analysis.

Keywords: Canonical correlation analysis, Dual marginal coordinate hypotheses, Sliced inverse regression, Trace test

1. Introduction

In this paper, we consider dual model-free variable selection with two groups of variablesx_∈Rp_and_y

∈Rq_{. As}

a popular tool for multivariate analysis, classical variable selection aims at identifying important variables amongx for the prediction ofy. Most existing variable selection methods are model-based, and consider selecting important predictors under a given parametric or semi-parametric model. Variable selection methods in linear regression include LASSO [17], SCAD [5], the adaptive LASSO [22], and the Dantzig selector [1]. Variable selection in semi-parametric models have been studied in [7, 14, 18]. In multivariate association studies with two sets of random vectors, popular methods such as canonical correlation analysis (CCA) [6] focus on reducing the dimensionality for both sets of variables, where the role of the predictor and the response is not important. This viewpoint motivates us to consider dual variable selection, where the goal is to simultaneously identify the important variables amongxfor the prediction ofyand the important variables amongyfor the prediction ofx.

Unlike model-based procedures in the literature, our proposal is model-free and does not require assuming specific models betweenxandy. Existing model-free variable selection methods all focus on selecting important variables amongxfor the prediction ofy. See, for example, [9, 12, 13, 21]. The aforementioned model-free variable selection methods are closely related to suffi_{cient dimension reduction [2]. An important link between su}ffi_{cient dimension} re-duction and model-free variable selection is elucidated in [21], where popular suffi_{cient dimension reduction methods} such as sliced inverse regression (SIR) [11], sliced average variance estimation [4], and directional regression [10] are used to construct corresponding model-free variable selection procedures.

To achieve dual model-free variable selection, we demonstrate that CCA can be viewed as a valid suffi_cient dimension reduction procedure under suitable conditions. There is an important diff_{erence between CCA and popular} suffi_{cient dimension reduction methods such as SIR: CCA maintains the symmetry between}_x_and_y_{while SIR does} not. We follow Yu et al. [21] and develop CCA-based model-free variable selection procedures. Unlike the procedures proposed in Yu et al. [21] that select important variables amongx, the symmetry in CCA provides a unique opportunity to perform dual variable selection among bothxandysimultaneously.

∗_{Corresponding author.}

(3)

The rest of the paper is organized as follows. We review SIR-based trace test for variable selection in Section 2. The general framework for dual model-free variable selection is introduced in Section 3. CCA-based trace test for dual variable selection is developed in Section 4. Numerical studies are performed in Section 5 and we conclude the paper with some discussions in Section 6. All the proofs are relegated to the Appendix.

2. Review of SIR-based trace test

Letx =_(X₁_{, . . . ,}_X_p₎⊤_and_y=_(Y₁_{, . . . ,}_Y_q₎⊤_{. Without loss of generality, assume E(x)}=₀_{and E(y)}=_{0. Denote}

x₋k=(X1, . . . ,Xk−1,Xk+1, . . . ,Xp)⊤fork∈ {1, . . . ,p}. To test the importance of thekth predictorXk, we may consider

the following hypotheses

H0−k:y⊥⊥x|x−k vs. Ha−k:y̸⊥⊥x|x−k, (1)

where⊥⊥means independence and̸⊥⊥means no independence. The hypothesisH−k

0 : y⊥⊥x | x−kabove implies that Xk is not important for the prediction ofyin the presence of all the other predictors. Hypotheses (1) are known as

the marginal coordinate hypotheses [3]. Once we have a valid test for (1), sequential procedures such as forward selection, backward selection and stepwise regression can be designed in parallel to the classical procedures in linear regression. For example, Li et al. [13] consider backward selection through the marginal coordinate test proposed in [3], while forward selection and stepwise regression through the trace test are discussed in [21].

Since [3], diff_{erent tests for (1) have been proposed in the literature. Most tests have the same flavor as the original} marginal coordinate test in [3], such as [16, 19, 20]. Yu et al. [21] introduce a novel family of trace tests, which can be combined with various suffi_{cient dimension reduction methods. In the following, we first review SIR as a su}ffi_cient dimension reduction method, and then we revisit the SIR-based trace test for (1).

Classical suffi_{cient dimension reduction aims to find}_β_∈Rp_×d_{with the smallest possible column space such that}

y_⊥⊥x_|β⊤x. The corresponding column space is known as the central space for the regression ofyonx, and is denoted by_Sy_|x. LetΣx =var(x) and let{J1, . . . ,JH}denote a measurable partition ofΘy, the sample space ofy. Under the linearity condition that E(x_|β⊤x) is linear inβ⊤x, we know from [11] that

E(z|y_∈Jh)∈Σx1/2Sy_|x, (2)

wherez = _Σ−_x1/2_x _{is the standardized version of}_{x. Define} _{z-scale SIR kernel matrix as} _MSIR = ∑H

h=1πhE(z|y ∈

Jh)E⊤(z|y∈Jh), whereπh=Pr(y∈Jh). From (2), we know

Span(MSIR)⊆Σ1x/2Sy_|x, (3)

where Span denotes the column space. LetΣx−k =var(x−k) andz−k =Σ

−1/2

x₋k x−k. Similar toM

SIR_{, we define}_MSIR

−k =

∑H

h=1πhE(z−k|y∈Jh)E⊤(z−k|y∈ Jh).

Yu et al. [21] consider

δSIR_k =_tr(MSIR₎₋_tr(MSIR

−k ), (4)

where tr denotes the trace. Yu et al. [21] provide the asymptotic distribution of ˆδSIR

k underH− k

0 , where ˆδ SIR

k is the

sample version ofδSIR_k . BecauseδSIR_k =_{0 under}_H−k

0 in (1),H−

k

0 is rejected if ˆδ SIR

k is larger than a threshold determined

by its asymptotic distribution under null.

3. The principle of dual model-free variable selection

DenoteIx={1, . . . ,p}as the full index set forx. Define the active setAfor the regression ofyonxas

A=_{_k_{∈ I}_x_:_y_{depends on}_x_through_X_k_}_. ₍₅₎

(4)

Similarly, let_Iy={1, . . . ,q}denote the full index set fory, and the active setBfor the regression ofxonybe defined as

B=_{_j_{∈ I}_y_:_x_{depends on}_y_through_Y_j_}_. ₍₆₎ Letx_A=_{_X_k_:_k_{∈ A}}_and_y

B={Yj: j∈ B}. We have the following result.

Proposition 1. The following three conditions are equivalent, and all are implied from the definitions of_Ain(5)and

Bin(6).

Let∅_{denote the empty set. It follows from Proposition 1 that}_A = ∅_{if and only if}_B = ∅_{. We remark that} Proposition 1 is parallel to Proposition 1 in [8], where the dual central spaces for suffi_{cient dimension reduction are} studied.

The goal of dual model-free variable selection is to identify_Aand_Bwithout assuming specific models between xandy. Letx_F =_{_X_k _:_k_{∈ F }}_and_y

G ={Yj : j ∈ G}, whereF ⊆ Ixis the working active set forxandG ⊆ Iyis the working active set fory. Motivated from part (i) in Proposition 1 and the marginal coordinate hypotheses (1) in Section 2, we consider the following dual marginal coordinate hypotheses

H0F,[G] :y⊥⊥x|xF andy⊥⊥x|yG vs. HF

,[G]

a :y̸⊥⊥x|xF ory̸⊥⊥x|yG. (7)

If_H₀F,[G] in (7) is true, then obviously we have_{A ⊆ F} and_{B ⊆ G}. We can then recover_Aand_Bby looking for the combination of the smallest possible_F and the smallest possible_Gsuch that_H₀F,[G]is not rejected.

4. CCA-based trace tests and dual model-free variable selection

We have reviewed in Section 2 that SIR-based trace test can be used to test the marginal coordinate hypotheses (1). To test the dual marginal coordinate hypotheses (7), where the roles ofxandyare symmetric, we need a dimension reduction method that maintains the symmetry betweenxandy. In Section 4.1, we introduce CCA as a dual suffi_cient dimension reduction method. In Section 4.2, we study CCA-based trace tests for selecting variables among eitherx ory. In Section 4.3, CCA-based test for the dual marginal coordinate hypotheses (7) is developed. In Section 4.4, we propose a sample level algorithm for dual model-free variable selection.

4.1. CCA for dual sufficient dimension reduction

Recall thatz=_Σ−_x1/2_x_{is the standardized version of}_{x. Let}_w=_Σ−_y1/2_y_{be the standardized version of}_{y, where}

Σy=var(y). Define kernel matrices

M=_E(zw⊤_)E(wz⊤₎ _and _Me =_E(wz⊤_)E(zw⊤₎_. ₍₈₎ Givenx _∈Rp_and_y

∈Rq_{, the}_ℓ_{th pair of canonical covariates (u}

ℓ,vℓ) is defined asuℓ =aℓ⊤xandvℓ =b⊤ℓy, such

that var(uℓ) =var(vℓ) =1 and cov(uℓ,vℓ) is maximized. Forℓ >1,uℓ andvℓ satisfy the additional constraints that

cov(uℓ,uk)=0 and cov(vℓ,vk) =0 for allk< ℓ. It is well-known that theaℓ =Σ−x1/2cℓ, wherecℓis the eigenvector

corresponding to theℓth largest eigenvalue ofM. Similarly, bℓ = Σy−1/2dℓ, wheredℓ is theℓth eigenvector ofM.e

DenoteSy_|xandSx_|yas the dual central spaces for the regression ofyonxand the regression ofxony, respectively. The next result states that matricesMandMeare closely related to suffi_{cient dimension reduction.}

(5)

Proposition 2. SupposeE(x)=₀_and_E(y)=_{0. Assume}_β_{is the basis for}_S_y

|xandηis the basis forSx|y. (i) IfE(x_|β⊤x)is linear inβ⊤x, thenSpan(M)_⊆Σ1x/2Sy|x;

(ii) IfE(y|η⊤_y)_{is linear in}_η⊤_{y, then}_Span(_M)_e _⊆_Σ1/2 y Sx_|y.

The assumptions made in this proposition are common in the suffi_{cient dimension reduction literature.} Propo-sition 2 implies that the column space ofΣ−x1/2Mcan recover the central space for the regression ofy onx, while the column space ofΣ−y1/2Me can recover the central space for the regression ofxony. It follows thataℓ ∈ Sy_|x and bℓ ∈ Sx_|y. We remark that the conclusions in Proposition 2 bare close resemblance to (3) about the SIR-based kernel matrixMSIR.

4.2. CCA-based trace tests for marginal coordinate hypotheses

We consider two sets of marginal coordinate hypotheses in this section, both of which are related to (7). The first set is

H0F :y⊥⊥x|xF vs. HaF :y̸⊥⊥x|xF. (9)

Hypotheses (9) include (1) as a special case, asx_F becomesx₋kwhen we takeF ={1, . . . ,k−1,k+1, . . . ,p}. The

second set is

H0[G]:y⊥⊥x|yG vs. H [_G]

a :y̸⊥⊥x|yG. (10)

While hypotheses (9) can be used for selecting important variables amongx, hypotheses (10) are useful for selecting important variables amongy.

First we focus on the CCA-based trace test for (9). Let Σx_F = var(x_F) andz_F = Σ−xF1/2xF. Motivated by the

SIR-based trace test, we consider

δ_−F =_tr(M)₋_tr(M

F), (11)

whereM_F =_E(z

Fw⊤)E(wz⊤_F). We remark thatδ−F is constructed as the trace difference of twoz-scale CCA kernel

matrices, which has the same flavor asδSIR_k in (4). Let_Fc_{be the complement of}

F in_Ixand denoteΣxFc =var(x_Fc). DefineΣ_x

Fc_|xF =ΣxFc−E(x_Fcx⊤ F)Σ− 1 xFE(xFx⊤Fc) andγx_Fc_|x_F =xFc−E(x_Fcx⊤ F)Σ− 1 x_FxF. Then we have

Proposition 3. SupposeE(x_Fc|x_F)is a linear function ofx_F. Then

(i) δ_−F =_tr_{_Σ−1

x_Fc_|x_FE(γx_Fc_|x_Fy⊤)Σ−y1E(yγ⊤x_Fc_|xF)}.

(ii) δ_−F =₀_under_HF

0 :y⊥⊥x|xF.

The assumption made in this proposition is common in the model-free variable selection literature, and it is satisfied ifxis normal. The first part of Proposition 3 provides the explicit formula to calculateδ_−F. The second part states that ifx_Fc is unimportant for the prediction ofygivenx_F, thenδ_−F becomes zero. Yu et al. [21] have shown

thatδSIR

k =0 ifXkis unimportant for the prediction ofygivenx−k. Our result here is more general asF

c_{can contain}

more than one variable. Denote ˆδ_−F as the sample version ofδ_−F. We reject_H₀F if ˆδ_−F is too large. The asymptotic distribution of ˆδ_−F under_H₀F is provided in Corollary 1 in the Appendix.

Next we introduce the CCA-based trace test for (10). LetΣyG = var(yG) andwG = Σ

−1/2 y_G yG. DenoteMeG = E(w_Gz⊤_)E(zw⊤ G) and consider δ−G=_tr(_M)e ₋_tr(_Me G). (12) LetGc

be the complement ofGinIyand denoteΣy_Gc =var(y_Gc). DefineΣy_Gc_|yG =ΣyGc −E(y_Gcy⊤_G)Σ−

1

y_GE(yGy⊤Gc). Let

γy_Gc_|y_G =yGc−E(y_Gcy⊤

G)Σ−

1

yGyG. Parallel to Proposition 3, we have

(6)

Proposition 4. SupposeE(y_Gc|y_G)is a linear function ofy_G. Then

(i) δ−G=_tr_{_Σ−1

y_Gc_|y_GE(γy_Gc_|y_Gx⊤)Σ−x1E(xγ⊤yGc|yG)}.

(ii) δ−G=₀_under_H[G]

0 :y⊥⊥x|yG.

Let ˆδ−Gbe the sample version ofδ−G. We rejectH0[G]if ˆδ−Gis too large.

The asymptotic distribution of ˆδ−GunderH0[G]is provided in Corollary 2 in the Appendix. 4.3. CCA-based trace test for dual marginal coordinate hypotheses

In this section, we develop a test forH0F,[G] :y⊥⊥x|xF andy⊥⊥x|yGversus the alternativeH

F,[G]

a :y̸⊥⊥x|x_F

ory_̸⊥⊥ x _|y_G. From the definition ofM=_E(zw⊤_)E(wz⊤_{) and}_Me =_E(wz⊤_)E(zw⊤_{) in (8), we have tr(M)}=_tr(_M).e RecallM_F =_E(z

Fw⊤)E(wz⊤_F) and defineMG =E(zw⊤_G)E(wGz⊤). It is easy to see that tr(MG)=tr(MeG). Henceδ−G

in (12) becomes

δ−G=_tr(M)₋_tr(MG₎_. ₍₁₃₎

We have seen thatδ_−F =_tr(M)₋_tr(M

F) in (11) can be used to testH0F :y⊥⊥x |xF, andδ−Gin (13) can be used to test_H₀[G]:y_⊥⊥x_|y_G. This motivates us to consider

δ−G

−F =tr(M)−tr(MGF), (14)

whereMG

F =E(zFw⊤G)E(wGz⊤F). Note thatδ −G

−F in (14) includeδ−F andδ−Gas special cases. If we takeF =Ix, then

δ−G

−F becomesδ−G. On the other hand,δ−G−F reduces toδ−F when we setG=Iy.

The symmetry betweenzandwin the definition ofMandMeallows tr(M) to simultaneously capture the regression information betweeny andx as well as the regression information between x andy. This is a unique feature of the CCA-based trace test, as tr(MSIR_{) in (4) only captures the regression information between}_y_and_{x. Parallel to} Proposition 3 and Proposition 4, we have

Proposition 5. SupposeE(x_Fc|x_F)is a linear function ofx_F andE(y_Gc|y_G)is a linear function ofy_G. Then

(i) δ−G −F =tr{Σ− 1 xFc|xFE(γxFc_|xFy⊤)Σ −1 y E(yγ⊤x_Fc_|x_F)}+tr{Σ− 1 yGc|yGE(γyGc_|yGx⊤F)Σ− 1 xFE(xFγ⊤y_Gc_|y_G)}. (ii) δ−G −F =0underH F,[_G] 0 :y⊥⊥x|xF andy⊥⊥x|yG.

Let{(x(1),y(1)), . . . ,(x(n),y(n))}be an iid sample. Let ¯x=_n−1∑n

i=1x (i) , ˜x(i) =_x(i)₋_{x, and ˆ}_¯ _Σ_x=_n−1∑n i=1x˜ (i) (˜x(i))⊤. Similarly let ¯y =_n−1∑n i=1y (i) , ˜y(i) =_y(i)₋_{y, ˆ}_¯ _Σ_y =_n−1∑n i=1y˜ (i) (˜y(i))⊤, En(xy⊤)=n−1∑ni=1x˜ (i) (˜y(i))⊤, and En(yx⊤) = n−1∑n_i=1y˜ (i) (˜x(i))⊤. Then ˆM=_Σˆ−1/2 x En(xy⊤) ˆΣ− 1 y En(yx⊤) ˆΣ− 1/2

x . Similarly one can calculate ˆ MG F =Σˆ −1/2 x_F En(xFy⊤G) ˆΣ −1 y_GEn(yGx⊤_F) ˆΣ− 1/2 x_F . Then the sample version ofδ−G

−F in (14) becomes

ˆ

δ−G

−F =tr( ˆM)−tr( ˆMGF).

We conclude this section with the asymptotic distribution of ˆδ−G

−F underHF

,[_G]

0 . Assume|F |=p1and|G|=q1, where

| · |denotes cardinality.

Theorem 1. SupposeE(x)=0,E(y)=0,E(x_Fc|x_F)is a linear function ofx_F andE(y_Gc|y_G)is a linear function of

y_G. Then under_H₀F,[G], nδˆ−G −F L ∑ ℓ=1 τℓχ2ℓ(1)

(7)

as n _{→ ∞}, where means convergence in distribution, L = _pq₋_p₁_q₁_,_χ2

ℓ(1)is independent chi-square with one

degree of freedom for allℓ_{∈ {}1, . . . ,L_}andτ1≥ · · · ≥τLare the eigenvalues ofΩ, and the exact form ofΩis provided

in the Appendix.

The asymptotic distribution in Theorem 1 needs to be estimated in practice. Specifically, let ˆΩbe the sample estimators ofΩ, and let ˆτ1 ≥ · · · ≥ τˆLbe the eigenvalues of ˆΩ. Denoteζ = (ˆτ1, . . . ,τˆL)⊤ ∈ RLand letΞ ∈ RN×L

consist of i.i.d. χ2_{(1) realizations. Then the}_N_{elements of}

Ξζbecome realizations of the approximate asymptotic distribution ofnδˆ−G

−F under null. The proportion of theseN elements greater thannδˆ−G−F become the approximate

p-value. For a given significance levelα, we reject_H₀F,[G] if this p-value is smaller thanα. We useN = _{500 in our} numerical studies.

4.4. Algorithm for dual model-free variable selection Let_{(x(1)_,_y(1)₎_{, . . . ,}_(x(n)_,_y(n)₎

}be an iid sample of_{x_∈Rp_,_y ∈Rq

}. We devise a sample-level algorithm for dual model-free variable selection in this section. From the development in Section 3, we have seen that the active setsA

andBcan be recovered by the smallest possibleF and the smallest possibleGsuch thatH0F,[G]is not rejected. This motivates us to consider the following joint backward selection procedure.

1. Initial step. SetF(0)₌

{1, . . . ,p_}andG(0)₌

{1, . . . ,q_}. Letαbe the pre-specified significance level.

1.1 For eachi _{∈ {}1, . . . ,p_}, denote_F₋(0)_i as the index set whereiis removed from_F(0)_{, and let}_ϱ

i,(0) be the approximatep-value from testingHF

(0)

−i,[G(0)]

0 against its alternative.

1.2 For each j_{∈ {}1, . . . ,q_}, denoteG(0)₋j as the index set where jis removed fromG

(0)_{, and let}_ϱ

p+j,(0)be the approximatep-value from testingHF

(0)_,_[_G(0)

−j]

0 against its alternative. 1.3 Let k(0) = arg max

ι_∈{1,...,p+q_}

ϱι,(0) and ϱ(0) = max

ι_∈{1,...,p+q_}ϱι,(0). If ϱ(0) ≥ α andk(0) ≤ p, then set F

(1) ₌

F₋(0)k(0), G(1)₌

G(0)_{, and go to Step 2. If}_ϱ

(0)≥αandk(0)>p, then setF(1)=F(0),G(1)=G (0)

−{k(0)−p}, and go to Step 2. Ifϱ(0)< α, then stop the algorithm and return ˆA=F(0), ˆB=G(0).

2. Iteration step. At the beginning of the ℓth iteration, let _F(ℓ) _and

G(ℓ) _{be the working index sets. Assume}

|F(ℓ)

|=_p(_ℓ₎_and_|G(ℓ)

|=_q(_ℓ_).

2.1 For eachi_{∈ {}1, . . . ,p(ℓ)}, denoteF (ℓ)

−i as the index set where theith element ofF

(ℓ)_{is removed from}

F(ℓ)_, and letϱi,(ℓ)be the approximatep-value from testingHF

(ℓ)

−i,[G (ℓ)_]

0 against its alternative. 2.2 For each j_{∈ {}1, . . . ,q(ℓ)}, denoteG

(ℓ)

−jas the index set where the jth element ofG

(ℓ)_{is removed from}

G(ℓ)_, and letϱp(ℓ)+j,(ℓ)be the approximatep-value from testingH

F(ℓ)_,_[

G(ℓ)−j]

0 against its alternative. 2.3 Letk(ℓ) = arg max

ι∈{1,...,p(ℓ)+q(ℓ)}

ϱι,(ℓ)andϱ(ℓ) = max

ι∈{1,...,p(ℓ)+q(ℓ)}

ϱι,(ℓ). Ifϱ(ℓ) ≥αandk(ℓ) ≤p(ℓ), then setF(ℓ +1)₌

F₋(ℓk)(ℓ), G(ℓ+1)₌

G(ℓ)

, and repeat Step 2. Ifϱ(ℓ) ≥αandk(ℓ) >p(ℓ), then setF(ℓ +1)₌

F(ℓ)

,G(ℓ+1)₌

G(_−{ℓ)k(ℓ)−p(ℓ)}, and repeat Step 2. Ifϱ(ℓ)< α, then stop the iteration and return ˆA=F(ℓ), ˆB=G(ℓ).

In the initial step, we first testy_⊥⊥x_|x₋iagainst its alternative for eachi∈ {1, . . . ,p}. Then we testy⊥⊥x|y₋jfor

each j_{∈ {}1, . . . ,q_}. The correspondingp-values are denoted asϱι,(0)forι∈ {1, . . . ,p+q}. The maximump-valueϱ(0)

is then compared toα. Ifϱ(0)is smaller thanα, thenϱι,(0) < αfor anyι∈ {1, . . . ,p+q}. Thus we rejecty⊥⊥x |x₋i

for anyi_{∈ {}1, . . . ,p_}, and we rejecty_⊥⊥x _|y₋jfor any j∈ {1, . . . ,q}. Hence we estimate the active sets byF(0) and G(0)

. In the case withϱ(0)≥α, the least significant element, which is indexed byk(0), can be removed from the active sets. Fork(0)_≤p, we updateF by removing the least significant element, which corresponds to an element inx. For k(0)>p, the least significant element corresponds to an element inyand we only updateG.

In theℓth iteration, we start from working index setsF(ℓ)_and_G(ℓ)_{. Note that}_HF(ℓ)_,_[

G(ℓ)_]

0 is not rejected from the last iteration, as we only go to theℓth iteration ifϱ(ℓ₋1) ≥α. In another word, theℓth iteration is needed only when

(8)

Table 1: Frequencies of rejectingH0F,[G]based on 1000 repetitions. Model 1 Model 2 α=₀.01 α=₀.05 α=₀.1 α=₀.01 α=₀.05 α=₀.1 F =_{₃,4},G=_{₁,2} 1 1 1 0.007 0.033 0.085 F =_{₁,2},G=_{₃,4} 0.007 0.044 0.093 1 1 1 F(ℓ₋1) orG(ℓ₋1)

is updated. Parallel to the initial step, after removing one element at a time from eitherF(ℓ) orG(ℓ)

, we test the dual marginal coordinate hypotheses (7) and get p-valueϱι,(ℓ) forι ∈ {1, . . . ,p(ℓ) +q(ℓ)}. The maximum p-valueϱ(ℓ)is then compared toα. Ifϱ(ℓ)< α, thenF(ℓ)andG(ℓ)can not be further reduced. We stop the iteration and estimate the active sets by_F(ℓ) _and

G(ℓ)_{. Otherwise we go to the next iteration, where either}

F(ℓ)_or

G(ℓ)_{is updated} by removing the least significant element. Note that in each iteration, we are testing the conditional independence betweenxandy, and our procedure asymptotically controls the type-I error rate at the significance levelα.

5. Numerical results

We use synthetic data in Section 5.1, and a real data analysis is considered in Section 5.2. 5.1. Simulation studies

Letx=_(X1_{, . . . ,}_X_p₎⊤_{be multivariate normal with mean}₀_{and covariance matrix}_Σ_x=₍_σ_{i j}_{), where}_σ_{i j} =_σ|i−j|_for

1≤i,j_≤p. Similarly, letϵ=₍_ϵ₁_{, ϵ}₂_{, . . . , ϵ}_q₎⊤_{be multivariate normal with mean}₀_{and covariance matrix}_Σ_ϵ =₍_θ_{i j}_), whereθi j=θ|i−j|for alli,j∈ {1, . . . ,q}. The errorϵis independent ofx. Then we generatey=(Y1, . . . ,Yq)⊤from the

following two models:

Model 1:Y1=_ϵ₁_,_Y2=_ϵ₂_,_Y3=_ϵ₃_,_Y4=_X1+_X2+_ϵ_4;

Model 2:Y1=e0.5X3+sin(X4)+ϵ1,Y2=X33+X4+ϵ2,Y3=ϵ3,Y4=ϵ4.

In both models, we setp=_{5 and}_q=_{4. In Model 1, we set}_σ=_{0 and}_θ=₀_._{5. The active set for the regression of}_y onxis_A=_{₁_,₂_}_{. Due to the nonzero correlation among the}_ϵ_{’s, we cannot determine}_B_{by evaluating the forward} regression betweenyandx. Instead we calculate E(x_| y)=₍₀_,₀_,₀_,_4Y4_/₁₁₋_2Y3_/₁₁_,_4Y4_/₁₁₋_2Y3_/₁₁₎⊤_{, and thus} the active set for the regression ofxonyis_B=_{₃_,₄_}_{. More details are provided in the Appendix. In Model 2, we set}

σ=₀_._{5 and}_θ=_{0. We have}_A=_{₃_,₄_}_and_B=_{₁_,₂_}_{as the result of}_θ=_0.

First we evaluate the performance of the CCA-based trace test for the dual marginal coordinate hypotheses (7). For user-specifiedF andG, we testH0F,[G] : y⊥⊥x | xF andy⊥⊥x | yG versus the alternativeHF

,[_G]

a : y ̸⊥⊥ x | xF

ory _̸⊥⊥ x _| y_G. Based on 1000 repetitions, the frequencies ofH0F,[G] being rejected are reported in Table 1. Fix sample sizen = _{300 and take}_α _{∈ {}₀_.₀₁_,₀_.₀₅_,₀_.₁_}_{. We consider two combinations of}_F _and_G_{. When}_F =_{₃_,₄_} andG=_{₁_,₂_}_,_HF,[G]

0 does not hold for Model 1. We see from Table 1 thatHF

,[_G]

0 is always rejected regardless of theαvalue. WhenF =_{₁_,₂_}_and_G = _{₃_,₄_}_,_y_⊥⊥_x _| _x

F andy⊥⊥x | yGfor Model 1. We see that the frequencies

of rejectingH0F,[G] are close to the corresponding nominal level. The results for Model 2 are reversed. The first combination of_F and_Gnow corresponds to the Type-I error, and the frequencies are close to the nominal level. The second combination of_F and_Gcorresponds to the estimated power, which is 1 for allαvalues.

Next we investigate the performance of the joint backward selection algorithm proposed in Section 4.4. For each

i_{∈ {}1, . . . ,1000_}, denote the estimated active sets of theith repetition as ˆ_Aiand ˆBi. We define the over-fitted frequency

(OF), the correctly-fitted frequency (CF), and the under-fitted frequency (UF) as

OF=₁₀₀₀−1 1000 ∑ i=1 { 1(A ⊆Aî)1(B ⊆Bî)−1(A=Aî)1(B=Bî) } , CF=₁₀₀₀−1 1000_∑ i=1 1(A=_Aˆ_i₎₁₍_B=_Bˆ_i₎_,

andU F = ₁₋_CF₋_{OF, where} ₁ _{denotes the indicator function. The average model size is defined as} _MS = 1000−1∑1000

i=1 (|Aˆi|+|Bˆi|). Based on 1000 repetitions, we report UF, CF, OF, MS together with the frequencies of each

(9)

Table 2: Variable selection results for Model 1 based on 1000 repetitions.

n X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 UF CF OF MS

100 1 1 0.01 0.01 0.02 0.02 0.04 0.15 1 0.85 0.13 0.02 3.25 300 1 1 0.01 0.01 0.01 0.01 0.02 0.9 1 0.1 0.86 0.04 3.96 700 1 1 0.02 0.02 0.02 0.02 0.03 1 1 0 0.94 0.06 4.1

Table 3: Variable selection results for Model 2 based on 1000 repetitions.

α X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 UF CF OF MS 0.01 0 0 1 0.99 0 0.99 0.99 0 0 0.03 0.96 0.01 3.97 0.05 0.01 0.01 1 1 0.02 1 1 0.02 0.02 0 0.95 0.05 4.08 0.1 0.02 0.03 1 1 0.03 1 1 0.05 0.03 0 0.9 0.1 4.17

The variable selection results for Model 1 is summarized in Table 2. We fixα=₀_._{05 and take}_n_{∈ {}₁₀₀_,₃₀₀_,₇₀₀_}_. We see that the variable selection performance improves asnincreases. Forn=_{300 and}_n =_{700, the unimportant} variablesX3,X4,X5,Y1andY2are selected with very low frequencies; the important variablesX1,X2,Y3andY4are selected with frequency 1 or frequency close to 1; and the average model size is close to 4. We also note that the frequency of correctly-fitted model becomes close to 1₋αwithn=_700.

Table 3 summarizes the variable selection results for Model 2. We fixn=_{300 and consider}_α_{∈ {}₀_.₀₁_,₀_.₀₅_,₀_.₁_}_. Our backward algorithm works well at all nominal levels. The important variablesX3,X4,Y1andY2are selected with high frequencies, the unimportant variablesX1,X2,X5,Y3andY4are selected with low frequencies, and the average model size is close to 4. As expected,α=₀_._{01 leads to smaller models and}_α=₀_._{1 tend to result in larger models.} Again, the frequency of correctly-fitted model is close to 1₋α.

5.2. Real data analysis

Beta-carotene and retinol are well studied chemical compounds in the human plasma. Several studies suggest that low levels of both compounds in plasma are associated with increased risk of an array of diseases such as cancer, cardiovascular disease, and cataracts. To determine the role of dietary habits and other health related metrics in plasma concentrations of beta-carotene and retinol, [15] did a cross-sectional study with 12 personal characteristics and dietary metrics for 315 patients with nonmelanoma skin cancer. After removing three categorical variables, we considerx=_(X1_{, . . . ,}_X9)⊤_{. The response variables are}_y =_(Y1_,_Y2)⊤_{, where}_Y1_{is the plasma concentration of} beta-carotene andY2 is the plasma concentration of retinol. After exploratory data analysis, we remove six observations with extreme values and getn=_{309. We apply our proposed dual variable selection procedure from Section 4.4 with} significance levelα=₀_._{05, and end up with ˆ}_A=_{₁_,₂_,₆_,₈_}_{and ˆ}_B=_{₁_,₂_}_{. This suggests that to further study the} multivariate associations between dietary habits and the plasma compound concentrations, we can focus only on six variablesX1,X2,X6,X8,Y1andY2instead of the originalxandy.

To demonstrate the eff_{ect of variable selection on canonical correlation analysis, we first calculate the first two} pairs of canonical covariates (u1,v1) and (u2,v2) based on the original data, wherex _∈ R9 _andy _∈ R2_{. Then we} calculate the first two pairs of canonical covariates ( ˜u1,v1) and ( ˜˜ u2,v2) based on the reduced data, where˜ x_Aˆ ∈ R4 andx_Bˆ ∈R2. The plots of the canonical covariate from the original data versus the corresponding canonical covariate from the reduced data are provided in Figure 1. The scatterplots are close to the dotted 45 degree line, suggesting that the canonical covariates before and after the data reduction largely agree with each other. This implies that the reduced data keeps the canonical information from the original data.

6. Concluding remarks

In this paper, we propose the CCA-based trace test for the dual marginal coordinate hypotheses and study the asymptotic properties of the resulting test statistic. The validity of the asymptotic test is justified through simulation studies. Based on this novel test, we design a joint backward selection algorithm for dual model-free variable selection.

(10)

−4 −2 0 2 4 −4 −2 0 2 4 u ~

1 from reduced data

u1

from orginal data

−6 −4 −2 0 2 4 −6 −4 −2 0 2 4 v ~

1 from reduced data

v1

from orginal data

−2 0 2 4 −2 0 2 4 u ~

2 from reduced data

u2

from orginal data

−6 −4 −2 0 2 −6 −4 −2 0 2 v ~

2 from reduced data

v2

from orginal data

Figure 1. Scatterplots of the canonical covariates from the original data versus the canonical covariate from the reduced data.

The finite-sample performance of the proposed test and the variable selection algorithm are very promising. The dual variable selection and feature screening in the case of divergingpandqis worth future investigation.

Appendix

Proof of Proposition 1. The proof is similar to Proposition 1 in Iaci et al. [8], and is thus omitted. ✷

Proof of Proposition 2. For part (i), note that Span(M)=_Span_{_E(zw⊤₎_}_{. Plug in}_z=_Σ_x−1/2_x_and_w=_Σ−_y1/2_{y, and all} we need to prove becomes

Span_{Σ−x1E(xy⊤)}=Span{Σ− 1/2

x E(zw⊤)} ⊆ Sy_|x=Span(β). (A.1)

From the law of iterated expectations and the fact thaty_⊥⊥x_|β⊤x, we have

E(xy⊤)=_E_{_xE⊤_(y_|_x)_}=_E_{_xE⊤_(y_|_β⊤_x)_}_. _(A.2) From the property of conditional expectation and the assumption that E(x|β⊤x) is linear inβ⊤x, we have

(11)

(A.2) and (A.3) together lead to

Σ−x1E(xy⊤)=β(β⊤Σxβ)−1β⊤E(xy⊤). (A.4) (A.1) follows from (A.4) and proof of part (i) is done. Proof of part (ii) is similar to the proof of part (i), and is thus

omitted. ✷

Proof of Proposition 3. For part (i), assumex_F _∈Rp1_and_x

Fc ∈Rp2withp1+p2 =p. Letx=(x⊤ F,x⊤Fc)⊤. DefineC andDas C= ( Ip1 0 −E(x_Fcx⊤ F)Σ− 1 xF Ip2 ) and D= ( ΣxF 0 0 _ΣxFc_|xF ) . ThenCx=_(x⊤ F,γ⊤xFc_|xF)

⊤_,_C_Σ_x_C⊤=_D_and_Σ−_x1=_C⊤_D−1_{C. It follows that} tr(M)=_tr{_Σ−1 x E(xy⊤)Σ− 1 y E(yx⊤) } =_tr{_C⊤_D−1_CE(xy⊤₎_Σ−1 y E(yx⊤) } =_tr {( Σ−xF1 0 0 _Σ−_x1 Fc_|x_F ) ( E(x_Fy⊤) E(γx_Fc_|x_Fy⊤) ) Σ−y1 ( E(yx⊤_F),E(yγ⊤_x Fc_|xF) )} =_tr       Σ −1

x_FE(xFy⊤)Σ−y1E(yx⊤_F) Σx−_F1E(xFy⊤)Σy−1E(yγ⊤xFc_|xF)

Σ−xF1c_|xFE(γxFc_|xFy ⊤₎_Σ−1 y E(yx⊤_F) Σ− 1 xFc_|xFE(γxFc_|xFy ⊤₎_Σ−1 y E(yγ⊤x_Fc_|x_F)       =_tr_{_Σ−_x1 FE(xFy ⊤₎_Σ−1 y E(yx⊤_F)}+tr{Σ− 1 x_Fc_|xFE(γxFc_|x_Fy⊤)Σ− 1 y E(yγ⊤x_Fc_|xF)}. Together with tr(M_F)=_tr_{_Σ−_x1 FE(xFy ⊤₎_Σ−1 y E(yx⊤_F)}, we get δ_−F =_tr(M)₋_tr(M F)=tr{Σ−xF1c_|xFE(γxFc_|xFy ⊤₎_}_Σ−1 y E(yγ⊤xFc_|xF)}. (A.5)

For part (ii), the assumption that E(x_Fc|x_F) is a linear function ofx_F implies E(x_Fc|x_F)=E(x_Fcx⊤

F)Σ− 1 xFxF. It follows that E{E(x_Fcx⊤ F)Σ− 1 xFxFy ⊤_}=_E_{_E(x Fc|x_F)y⊤}=E{x_FcE⊤(y|x_F)}. (A.6)

UnderH0F :y⊥⊥x|xF, we have E(y|x)=E(y|xF). Thus

E_{x_FcE⊤(y|x_F)}=E{x_FcE⊤(y|x)}=E(x_Fcy⊤). (A.7)

The definitionγxFc|xF =xFc −E(xFcx⊤F)Σ−

1

xFxF together with (A.6) and (A.7) leads to E(γxFc|xFy⊤)= 0. It follows

from part (i) thatδ_−F =_{0 under}_HF

0 . ✷

Proof of Proposition 4. The proof is similar to the proof of Proposition 3, and is thus omitted. ✷

Proof of Proposition 5. For part (i), assumey_G _∈Rq1 _and_y

Gc ∈Rq2withq1+q2=q. Lety=(y⊤ G,y⊤Gc)⊤. DefineK andOas K= ( Iq1 0 −E(y_Gcy⊤ G)Σ− 1 yG Iq2 ) andO= ( Σy_G 0 0 _Σy_Gc_|y_G ) . 10

(12)

ThenKy=_(y⊤ G,γ⊤y_Gc_|y_G)⊤,KΣyK⊤=OandΣ−y1=K⊤O−1K. Thus tr(M_F)=_tr{_Σ−_x1 FE(xFy ⊤₎_Σ−1 y E(yx⊤_F) } =_tr{_Σ−_x1 FE(xy ⊤₎_K⊤_O−1_K E(yx⊤)} =_tr { Σ−x_F1 ( E(x_Fy⊤_G),E(x_Fγ⊤_y Gc_|yG) ) (Σ−1 y_G 0 0 _Σ−_y1 Gc_|yG ) ( E(y_Gx⊤ F) E(γy_Gc_|y_Gx⊤_F) )} =_tr_{_Σ−_x1 FE(xFy ⊤ G)Σ− 1 y_GE(yGx⊤F)}+tr{Σ− 1 x_FE(xFγ⊤y_Gc_|yG)Σ −1 y_Gc_|yGE(γyGc_|y_Gx⊤_F)} =_tr(MG F)+tr{Σ− 1 y_Gc_|y_GE(γyGc_|yGx ⊤ F)Σ− 1 xFE(xFγ ⊤ y_Gc_|y_G)}.

Together with (A.5) from the proof of Proposition 3, we get the desired result in part (i). For part (ii), we have seen thaty_⊥⊥x _|x_F leads to E(γ_x

Fc_|xFy

⊤₎ =₀_{from the proof of Proposition 3. Following} similar steps, we can show thaty_⊥⊥x _| y_G leads to E(γy_Gc_|yGx⊤F) = 0. It follows from part (i) thatδ−G−F = 0 under

H0F,[G]:y⊥⊥x|xF andy⊥⊥x|yG. ✷

We need Lemma 1 and Lemma 2 before we prove Theorem 1. Let ¯x_F = _n−1∑n

i=1x (i) F, ˜x (i) F = x (i) F −x¯F, and ˆ ΣxF = n− 1∑n i=1x˜ (i) F(˜x (i) F)⊤. Similarly we define ˜y (i) _{and ˜}_x(i) Fc. Let En(xFx⊤_Fc) = n− 1∑n i=1x˜ (i) F(˜x (i) Fc)⊤, ˆγ (i) xFc_|xF = x_˜(i) Fc − En(x_Fcx⊤ F) ˆΣ −1 x_Fx˜ (i) F, and En(yγˆ⊤x_Fc_|x_F)=n− 1∑n i=1y˜ (i) ( ˆγ(_xi) Fc_|xF) ⊤_{. Define} Π(i) ={_y_˜(i)₋_E(yx⊤ F)Σ− 1 x_Fx˜ (i) F } { (˜x(i) Fc)⊤−(˜x (i) F)⊤Σ− 1 x_FE(xFx⊤_Fc) }

and we have the following result.

Lemma 1. SupposeE(x)=_0,_E(y)=_{0, and}_E(x

Fc|x_F)is a linear function ofx_F. Ify⊥⊥x|x_F, then

En(yγˆ⊤x_Fc_|x_F)= 1 n n ∑ i=1 Π(i)+_O_p_(n−1₎_, where the first term on the right-hand side is of order Op(n−1/2).

Proof of Lemma 1. From the definition of ˆγx_Fc_|x_F andγx_Fc_|x_F, we have En(yγˆ⊤x_Fc_|x_F)−E(yγ⊤x_Fc_|x_F)={En(yx⊤_Fc)−E(yx⊤_Fc)} −[En{yx⊤_FΣˆ−

1 xFEn(xFx ⊤ Fc)} −E{yx⊤_FΣ− 1 xFE(xFx ⊤ Fc)}]. (A.8)

Because E(x)=₀_{and E(y)}=_{0, it can be shown that}

En(yx⊤_Fc)−E(yx⊤_Fc)= 1 n n ∑ i=1 {y˜(i)(˜x(i) Fc)⊤−E(yx⊤_Fc)}+Op(n−1). (A.9)

The asymptotic expansions of ˆΣ−xF1and En(xFx

⊤ Fc) are, respectively, ˆ Σ−xF1−Σ −1 xF =−Σ −1 xF   1_n n ∑ i=1 { ˜ x(i) F(˜x (i) F)⊤−ΣxF }_ Σ−xF1+Op(n −1 ) and (A.10) En(xFx⊤_Fc)−E(x_Fx⊤_Fc)= 1 n n ∑ i=1 { ˜ x(i) F(˜x (i) Fc)⊤−E(xFx⊤_Fc) } +_O_p_(n−1₎_. _(A.11)

(13)

Eqs. (A.10) and (A.11) together lead to ˆ Σ−x_F1En(xFx⊤_Fc)−Σ− 1 x_FE(xFx⊤_Fc)= 1 n n ∑ i=1 Σ−x_F1x˜ (i) F { (˜x(i) Fc)⊤−(˜x (i) F)⊤Σ− 1 x_FE(xFx⊤_Fc) } +_O_p_(n−1₎_. _(A.12) It follows from (A.12) that

En{yx⊤_FΣˆ− 1 x_FEn(x_Fx⊤_Fc)} −E{yx⊤_FΣ−x_F1E(xFx⊤Fc)}= 1 n n ∑ i=1 E(yx⊤_F)Σ−x_F1x˜ (i) F { (˜x(i) Fc)⊤−(˜x (i) F)⊤Σ− 1 x_FE(xFx⊤Fc) } +1 n n ∑ i=1 {y˜(i)(˜x(i)

F)⊤−E(yx⊤Fc)}Σ−x_F1E(xFx⊤Fc)+Op(n−1). (A.13)

Eqs. (A.8), (A.9) and (A.13) together lead to

En(yγˆ⊤x_Fc_|xF)=En(yx ⊤ Fc)−En{yx⊤_FΣˆ− 1 x_FEn(x_Fx⊤_Fc)}= 1 n n ∑ i=1 [ ˜ y(i)(˜x(i) Fc)⊤−y˜ (i) (˜x(i) F)⊤Σ− 1 x_FE(xFx⊤Fc) −E(yx⊤_F)Σ−xF1x˜ (i) F { (˜x(i) Fc)⊤−(˜x (i) F)⊤Σ− 1 xFE(xFx ⊤ Fc) }] +_O_p_(n−1₎ =1 n n ∑ i=1 { ˜ y(i)₋E(yx⊤_F)Σ−x_F1x˜ (i) F } { (˜x(i) Fc)⊤−(˜x (i) F)⊤Σ− 1 x_FE(xFx⊤_Fc) } +_O_p_(n−1₎_, _(A.14)

which is the desired result. ✷

Similarly, let En(y_Gy⊤_Gc)=n− 1∑n i=1y˜ (i) G(˜y (i) Gc)⊤, ˆγ (i) yGc_|yG =_y_˜(i) Gc−En(y_Gcy⊤ G) ˆΣ −1 y_Gy˜ (i) G, and En(xFγˆ⊤y_Gc_|y_G)=n− 1∑n i=1x˜ (i) F( ˆγ (i) yGc_|yG) ⊤_. Define Λ(i)={_x_˜(i) F −E(xFy⊤G)Σ− 1 y_Gy˜ (i) G } { (˜y(i) Gc)⊤−(˜y (i) G)⊤Σ− 1 y_GE(yGy⊤Gc) } and we have

Lemma 2. SupposeE(x)=_0,_E(y)=_{0, and}_E(y

Gc|y_G)is a linear function ofy_G. Ify⊥⊥x|y_G, then

En(xFγˆ⊤y_Gc_|yG)= 1 n n ∑ i=1 Λ(i)+_O_p_(n−1₎_, where the first term on the right-hand side is of order Op(n−1/2).

Proof of Lemma 2. The proof is similar to the proof of Lemma 1, and is thus omitted. ✷

Proof of Theorem 1. Recall that _{|F |} = _p_1, _|Fc

| = _p_2, _|G| = _q_1, _|Gc | = _q_2, _p₁ +_p₂ = _p _and _q₁ +_q₂ = _q. Letϕ1 = vec{Σ− 1/2 y E(yγ⊤x_Fc_|x_F)Σ− 1/2 xFc_|xF} ∈ Rqp2_, _ϕ 2 = vec{Σ− 1/2 x_F E(xFγ⊤y_Gc_|y_G)Σ− 1/2 yGc_|yG} ∈ Rp1q2_{, and} _ψ = _(ϕ⊤ 1,ϕ⊤2)⊤, where vec denotes vectorization. Then we have δ−G

−F = ψ⊤ψ. At the sample level, let ˆψ = ( ˆϕ ⊤ 1,ϕˆ ⊤ 2)⊤, where ˆ ϕ₁=_vec_{_Σˆ−1/2 y En(yγˆ⊤_x Fc_|xF) ˆΣ −1/2 xFc|xF}and ˆϕ2=vec{Σˆ −1/2 xF En(xFγˆ ⊤ yGc_|yG) ˆΣ −1/2 yGc|yG}. Then we have ˆ δ−G −F =ψˆ ⊤_ψ_ˆ_. (A.15) Under_H₀F,[G], we havey_⊥⊥x_|x_F. It follows that E(yγ⊤

x_Fc_|x_F)=0andϕ1=0. Together with Lemma 1, we have ˆ ϕ1= 1 n n ∑ i=1 vec_{Σ−y1/2Π(i)Σ− 1/2 xFc_|xF} +_O_p_(n−1₎_, _(A.16) 12

(14)

where the first term on the right-hand side is of orderOp(n−1/2). Similarly, we havey⊥⊥x|yGunderH0F,[G]. It follows that E(x_Fγ⊤

y_Gc_|y_G)=0andϕ2 =0. Together with Lemma 2, we have ˆ ϕ2= 1 n n ∑ i=1 vec_{Σ−x_F1/2Λ(i)Σ− 1/2 y_Gc_|y_G}+Op(n−1), (A.17)

where the first term on the right-hand side is of orderOp(n−1/2). It follows from (A.16) and (A.17) that

ˆ ψ= 1 n n ∑ i=1 ϑ(i)+_O_p_(n−1₎_, _(A.18) where ϑ(i) = { vec⊤(Σ−y1/2Π(i)Σ− 1/2 x_Fc_|x_F),vec⊤(Σ− 1/2 x_F Λ(i)Σ− 1/2 y_Gc_|y_G) }⊤ ∈RL with E(ϑ(i))=₀_and_L=_qp2+_p1q2=_pq₋_{p1q1. As a result of (A.18), we have}

√

nψˆ N(0,_Ω) (A.19)

asn_{→ ∞}, whereΩ=_E_{_ϑ(i)_(ϑ(i)₎⊤_}_{. Eqs. (A.19) and (A.15) lead to the desired result.} _✷ As a special case,δ−G

−F reduces toδ−F when we setG=Iy. Thenδ−F =ϕ1⊤ϕ1and ˆδ−F =ϕˆ⊤1ϕˆ1. It follows from (A.16) that ˆϕ1 =n−1 ∑n i=1ϑ (i) 1 +Op(n−1), whereϑ (i) 1 =vec{Σ− 1/2 y Π(i)Σ− 1/2 xFc_|xF} ∈ Rp2q_{. Thus} √_n_ϕˆ 1 N(0,Γ), where Γ=_E_{_ϑ(i) 1(ϑ (i)

1 )⊤}. The asymptotic distribution of ˆδ−F is summarized in the next result. Corollary 1. SupposeE(x)=_0,_E(y)=_{0, and}_E(x

Fc|x_F)is a linear function ofx_F. Then underH₀F :y⊥⊥x|x_F,

nδˆ_−F

p2q

∑

ℓ=1

ρℓχ2ℓ(1)

as n_{→ ∞}, whereχ2_ℓ(1)is independent chi-square with one degree of freedom forℓ_{∈ {}1, . . . ,p2q_}, andρ1≥ · · · ≥ρp2q are the eigenvalues of_Γ.

Similarly,δ−G

−F becomesδ−Gwhen we setF =Ix. Note thatδ−G=ϕ⊤3ϕ3withϕ3=vec{Σ− 1/2

x E(xγ⊤yGc|yG)Σ

−1/2 y_Gc_|yG} ∈

Rpq2_{, and ˆ}_δ−G = _ϕˆ⊤₃_ϕˆ₃ _{with ˆ}_ϕ₃ = _vec_{_Σˆ−_x1/2_E_n_(x_γ_ˆ⊤ y_Gc_|y_G) ˆΣ

−1/2

y_Gc_|y_G}. Similar to (A.17), it can be shown that ˆϕ3 = n−1∑n_i=1ϑ (i) 3 +Op(n− 1 ), whereϑ(₃i) =_vec_{_Σ_x−1/2_Λ(i)_Σ−1/2 y_Gc_|y_G}. Thus √_n_ϕ_ˆ 3 N(0,Υ), whereΥ =E{ϑ (i) 3(ϑ (i) 3 )⊤}. The asymptotic distribution of ˆδ−Gis summarized in the next result.

Corollary 2. SupposeE(x)=0,E(y)=0, andE(y_Gc|y_G)is a linear function ofy_G. Then underH[G]

0 :y⊥⊥x|yG, nδˆ−G pq2 ∑ ℓ=1 ωℓχ2ℓ(1) as n_{→ ∞}, whereχ2

ℓ(1)is independent chi-square with one degree of freedom forℓ∈ {1, . . . ,pq2}, andω1≥ · · · ≥ωpq2 are the eigenvalues of_Υ.

Derivation of_Bfor Model 1. Let

Σϵ =      1 .5 .25 .125 .5 1 .5 .25 .25 .5 1 .5 .125 .25 .5 1      and H=      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0      .

(15)

Becausey=_Hx+_{ϵ, we have} Σy=HΣxH⊤+Σϵ =      1 .5 .25 .125 .5 1 .5 .25 .25 .5 1 .5 .125 .25 .5 3      and Σ−y1=       4/3 ₋2/3 0 0 −2/3 5/3 −2/3 0 0 ₋2/3 47/33 ₋2/11 0 0 −2/11 4/11       . It follows that E(x|y)=_Σ_xy_Σ−1 y y=       0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0             4/3 ₋2/3 0 0 −2/3 5/3 −2/3 0 0 −2/3 47/33 −2/11 0 0 ₋2/11 4/11            Y1 Y2 Y3 Y4      = 2 11       2Y4−Y3 2Y4−Y3 0 0 0       .

Thus we haveB=_{₃_,₄_}_{for Model 1.} _✷

References

[1] E. Cand`es, T. Tao, The Dantzig selector: Statistical estimation whenpis much larger thann, Ann. Statist. 35 (2007) 2313–2351. [2] R.D. Cook, Regression Graphics: Ideas for Studying Regressions through Graphics, Wiley, New York, 1998.

[3] R.D. Cook, Testing predictor contributions in sufficient dimension reduction, Ann. Statist. 32 (2004) 1062–1092.

[4] R.D. Cook, S. Weisberg, Discussion of “sliced inverse regression for dimension reduction” by K.C. Li, J. Amer. Statist. Assoc. 86 (2004) 328–332.

[5] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. [6] H. Hotelling, Relations between two sets of variables. Biometrika 58 (1936) 433–451.

[7] J. Huang, J.L. Horowitz, F. Wei, Variable selection in nonparametric additive models, Ann. Statist. 38 (2010) 2282–2313. [8] R. Iaci, X. Yin, L.X. Zhu, The dual central subspaces in dimension reduction, J. Multivariate Anal. 145 (2016) 178–189. [9] B. Jiang, J. Liu, Variable selection for general index models via sliced inverse regression, Ann. Statist. 42 (2014) 1751–1786. [10] B. Li, S. Wang, On directional regression for dimension reduction, J. Amer. Statist. Assoc. 102 (2007) 997–1008.

[11] K.C. Li, Sliced inverse regression for dimension reduction (with discussion), J. Amer. Statist. Assoc. 86 (1991) 316–342. [12] L. Li, Sparse sufficient dimension reduction, Biometrika 94 (2007) 603–613.

[13] L. Li, R.D. Cook, C.J. Nachtsheim, Model-free variable selection, J. R. Stat. Soc. Ser. B 67 (2005) 285–299.

[14] J.Y. Liu, R. Li, R. Wu, Feature selection for varying coeffi_{cient models with ultrahigh dimensional covariates, J. Amer. Statist. Assoc. 109} (2014) 266–274.

[15] D.W. Nierenberg, T.A. Stukel, J.A. Baron, B.J. Dain, E.R. Greenberg, S.C.P.S. Group, Determinants of plasma levels of beta-carotene and retinol, Amer. J. Epidemiol. 130 (1989) 511–521.

[16] Y. Shao, R.D. Cook, S. Weisberg, Mariginal tests with sliced average variance estimation, Biometrika 94 (2007) 285–296. [17] R.J. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1996) 267–288.

[18] L. Wang, L. Xue, A. Qu, H. Liang, Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates, Ann. Statist. 42 (2014) 592–624.

[19] Z. Yu, Y. Dong. Model-free coordinate test and variable selection via directional regression, Statist. Sinica 26 (2016) 1159–1174.

[20] Z. Yu, Y. Dong, Y. Fang, Marginal coordinate tests for central mean subspace with principal hessian directions, Chinese J. Appl. Probab. Statist. 26 (2010) 544–552.

[21] Z. Yu, Y. Dong, L.X. Zhu, Trace pursuit: A general framework for model-free variable selection, J. Amer. Statist. Assoc. 111 (2016) 813–821. [22] H. Zou, The adaptive lasso and its oracle properties, J. Amer. Statist. Assoc. 101 (2006) 1418–1429.