Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure

(1)

Web-based Supplementary Materials

for

Multivariate sparse group lasso for the multivariate multiple linear

regression with an arbitrary group structure

by

Yanming Li, Bin Nan and Ji Zhu

A

Proofs of technical results

A.1 Some matrix algebra

Lemma A.1 Let

L0(B) = 1 2kY − XBk 2 2 = 1 2tr(Y − XB) T (Y − XB) = 1₂ n X i=1 q X k=1 (yik₋ p X j=1 xijβjk)2. Then ∂L0(B)/∂βjk _{= −x}T j(Y − XB)·k = −Sjk+ kxjk22βjk, where Sjk = xTj(Y − XB(−j))·k.

A.2 Proof of Theorem 3.1

Following Lemma A.1,

L(B) = 1 nL 0_{(B) +} X 1≤j≤p;1≤k≤q λjk_|βjk| + X g∈G λg_kBgk2.

(2)

For a coordinate βjk_{in B, denote G}jk = {g : βjk ∈ Bg ∈ G}, then when L(B) is differentiable at βjk, ∂L(B) ∂βjk = −Sjk/n + kxjk 2 2βjk/n + λjksign(βjk) + X Gjk λgβjk/kBgk2.

If βjk> 0, then for any Bg ∈ Gjk, kBgk2 > 0, and

∂L(B) ∂βjk = −Sjk/n + kxjk 2 2βjk/n + λjk+ X Gjk λgβjk_/kBgk2.

Notice that ∂L(B)/∂βjk _{≥ 0 if and only if}

βjk_≥ Sjk− nλjk kxjk22+ n P Gjkλg/kBgk2 △ = ˜β_jk+,

and ∂L(B)/∂βjk< 0 if and only if βjk< ˜β_jk+. So fixing all other coordinates of B, if βjk > 0,

then L(B) is monotone increasing with respect to βjk when βjk > ˜β_jk+ and decreasing when

βjk< ˜βjk+. Therefore, if ˆβjk,min+ minimizes L(B) with respect to βjk when βjk > 0, then

ˆ β_jk,min+ = ( _S_jk_−nλ_jk kxjk22+n P Gjkλg/k ˆBgk2 , if Sjk> nλjk 0, otherwise. (A.1)

Similarly, if ˆβ_jk,min− minimizes L(B) with respect to βjk when βjk < 0, then

ˆ

β_jk,min− =

( Sjk+nλjk

kxjk2₂+nP_Gjkλg/k ˆBgk2, if Sjk< −nλjk

0, otherwise. (A.2)

Based on both (A.1) and (A.2), when βjk_{6= 0, the minimizer ˆ}βjk can be summarized into

a unified form: ˆ βjk= sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈Gjk}λg/k ˆBgk2 .

When βjk = 0, we discuss two separate cases according to weather all the groups

con-taining βjk _{are zero groups or not. First, if none of the groups in G}jk is a zero group, then

ˆ

βjk needs to satisfy the subgradient equation

Sjk_{/n − kx}jk22βjk/n = λjku +ˆ

X

Gjk ˆ

(3)

with |u| ≤ 1. Then a similar discussion to the case when βjk 6= 0 based on whether u > 0 or u ≤ 0 in (A.3) yields the same expression:

ˆ βjk= sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈Gjk}λg/k ˆBgk2 .

Second, if some of the groups in Gjk are zero groups, then neither | · | nor k · k2 in L(B)

is differentiable at zero. Let Gjk⊗ = {g : βjk ∈ Bg ∈ G, kBkg > 0}. Then for any zero group

B_g0 _{in G}_jk _{and for all βjk}_{∈ B}_{g0, ˆ}_βjk _{needs to satisfy the subgradient equation:}

Sjk_{/n − kx}jk22βjk/n = λjku + λg0vjkˆ +

X

G_jk⊗

λgβjkˆ _{/k ˆ}B_g_k_2, _(A.4)

where u is the subgradient scalar for the L1 _{norm | · | and v is the subgradient vector for}

the L2 _{norm k · k}2 with constrains |u| ≤ 1 and kvk2 ≤ 1, and λg0 is the tuning parameter

associated with Bg0. It can be seen directly (similar to the case of βjk = 0) that (A.4) yields

ˆ B_g0_{= 0 if} s X {jk: βjk∈Bg0} (|Sjk|/n − λjk)2₊ ≤ λg0.

A.3 Proof of Theorem 3.3

Lemma A.2 _{Under the assumptions in Theorem 3.3, for any B ∈ R}p×q, with probability at

least 1 − (pq)1−A2_/2 , 1 nkX(B ∗ − ˆB_)k2 2+ λ| ˆB− B|1+ 2 X g∈G λg_{k ˆ}B_g_{− B}_g_k₂ _(A.5) ≤ _n1kX(B∗− B)k22+ 4λ X jk∈J1(B) | ˆβjk_{− β}jk| + 4 X g∈J2(B) λg_{k ˆ}B_g _{− B}_g_k_2, M1( ˆB_{) ≤} 4 λ2_n2 X jk∈J1( ˆB) |[XT_X ( ˆB_{− B}∗_)]jk_|2 _≤ 4 λ2_n2kX T_X ( ˆB_{− B}∗_)k2 2. (A.6)

(4)

Proof. For any B ∈ Rp×q_{, we have} 1 nkY − X ˆBk 2 2+ 2λ| ˆB|1+ X g∈G 2λg_{k ˆ}B_g_k₂ _≤ 1 nkY − XBk 2 2+ 2λ|B|1+ X g∈G 2λg_kBgk2.

Plugging Y = XB∗+ W into the above inequality, we obtain

1 nkX(B ∗ − ˆB_)k2 2 ≤ 1 nkX(B ∗ − B)k22+ 2 n n X i=1 q X k=1 [X( ˆB_{− B)]}_ikωik +2λ(|B|1− | ˆB_|1_{) +}X g∈G 2λ_g(kBgk2_{− k ˆ}B_g_k2_),

where [X( ˆB_{− B)]}_ik _{denotes the ikth element of the product matrix X( ˆ}B_{− B) and ω}_ik _is

the ikth element of W. Notice that n X i=1 q X k=1 [X( ˆB_{− B)]ik}_ω_ik ₌ n X i=1 ( _q X k=1 " _p X j=1 xij( ˆβjk− βjk) # ωik ) ≤ max 1≤k≤q,1≤j≤p n X i=1 xijωik q X k=1 p X j=1 | ˆβjk_{− β}jk| = |XTW|∞| ˆB− B|1 where |XT_W

|∞= max1≤k≤q,1≤j≤p|Pni=1xijωik| is the maximum absolute value of entries of

XT_W

.

Let Vjk = xT

j · wk, 1 ≤ j ≤ p, 1 ≤ k ≤ q. Since wk ∼ N(0, σ2kIq) for 1 ≤ q ≤ Q,

then var(Vjk) = xT

pcov(wq)xp = nσ2q. Therefore (nσq2)−1/2Vjk are standard normal random

variables. Consider the random event

A = 2_n|XT_W

|∞≤ λ

. It is easy to see that the complement of A can be expressed as

Ac = At least one |Vjk| > λn 2 , 1 ≤ j ≤ p, 1 ≤ k ≤ q .

Denote B(0, λn/2) to be a 1-dimensional ball centered at 0 and with radius λn/2, then

P r{Ac} ≤ p X j=1 q X k=1 P r Vjk ∈ B/ 0,λn 2 = p q X k=1 P r (nσ_k2)−1/2Vjk ∈ B/ 0,λn 1/2 2σk ≤ pq × P r |Z| ≥ λn 1/2 2σ ≤ pq exp −λ 2_n 8σ2 = (pq)1−A2/2,

(5)

where Z is a standard normal random variable, and the last inequality is obtained by

P r{|Z| > a} ≤ exp (−a2_{/2). But on event A, we have}

1 nkX(B ∗ − ˆB_)k2 2+ λ| ˆB− B|1+ 2 X g∈G λg_{k ˆ}B_g_{− B}_g_k₂ ≤ _n1kX(B∗− B)k22+ 2λ(| ˆB− B|1+ |B|1− | ˆB|1) +2X g∈G λg_{(k ˆ}B_g_{− B}_g_k₂_{+ kB}_g_k₂_{− k ˆ}B_g_k₂₎ ≤ _n1kX(B∗− B)k22+ 4λ X jk∈J1(B) | ˆβjk_{− β}jk| + 4 X g∈J2(B) λg_{(k ˆ}B_g_{− B}_g_k_2.

This completes the proof of the first inequality in Lemma A.2.

To prove the second inequality, we use the KKT conditions and obtain ( (1/n)[XT (Y − X ˆB_)]jk _{= 2λsgn( ˆ}_{βjk) + 2}P g∈Gλgβjkˆ /k ˆBgk2, ˆβjk6= 0; (1/n)|[XT (Y − X ˆB_)]_{jk| ≤ 2λ + 2}P g∈Gλg, βˆjk= 0.

From the first condition we can see that ∀ ˆβjk_{6= 0,}

λ ≤ 1

n|[X

T

(Y − X ˆB_)]_jk|.

On the other hand, we have on A 1 n|[X T (Y − X ˆB_)]jk_{| ≤} 1 n|[X T_X (B∗_{− ˆ}B_)]jk_{+ [X}T_W ]jk_| ≤ _n1|[XT_X (B∗_{− ˆ}B_)]jk_{| +} 1 n|X T_W |∞ ≤ _n1|[XT_X (B∗_{− ˆ}B_)]jk_{| +}λ 2.

Then combine the above two inequalities, we have λ 2 ≤ 1 n|[X T_X_{( ˆ}_B − B∗)]_jk|. Therefore M1( ˆB_{) = |J}_{1( ˆ}B_{)| ≤} 4 λ2_n2 X jk∈J1( ˆB) |[XT_X ( ˆB_{− B}∗_)]jk_|2 _≤ 4 λ2_n2kX T_X ( ˆB_{− B}∗_)k2 2.

(6)

This completes the proof of Lemma A.2. Proof of Theorem 3.3.

By setting B = B∗ _{in (A.5) in Lemma A.2, we have that on event A,}

1 nkX( ˆB− B ∗ )k22 ≤ 4λ X jk∈J1(B∗) | ˆβjk_{− β}_{jk| + 4}∗ X g∈J2(B∗) λg_{k ˆ}B_g_{− B}∗_gk₂ _(A.7) ≤ 4λr1/2_{k( ˆ}_B_{− B}∗₎ J1(B∗)k2+ 4   X g∈J2(B∗) λ2 g   1/2 k( ˆB_{− B}∗₎_J 2(B∗)k2.

The last inequality is by Cauchy-Schwarz. Specifically, we have   X jk∈J1(B∗) | ˆβjk_{− β}_jk|∗   2 =   X jk∈J1(B∗) 1 × | ˆβjk_{− β}_jk|∗   2 ≤   X jk∈J1(B∗) 12     X jk∈J1(B∗) | ˆβjk_{− β}_jk|∗ 2   = rk( ˆB_{− B}∗_)J 1(B∗)k 2 2, and   X g∈J2(B∗) λg_{k ˆ}B_g_{− B}∗_gk₂   2 ≤   X g∈J2(B∗) λ2_g     X g∈J2(B∗) k ˆB_g _{− B}∗_gk2 2  .

Also by inequality (A.5), on event A, we have

λ| ˆB_{− B}∗_|₁_{+ 2}X g∈G λg_{k ˆ}B_g_{− B}∗_gk₂ _(A.8) ≤ 4λ X jk∈J1(B∗) | ˆβjk_{− β}_{jk| + 4}∗ X g∈J2(B∗) λg_{k ˆ}B_g_{− B}∗_gk_2. This is equivalent to λ X jk∈Jc 1(B∗) | ˆβjk_{− β}_{jk| + 2}∗ X g∈Jc 2(B∗) λg_{k ˆ}B_g_{− B}∗_gk₂ ≤ 3λ X jk∈J1(B∗) | ˆβjk_{− β}_{jk| + 2}∗ X g∈J2(B∗) λg_{k ˆ}B_g_{− B}∗_gk_2.

(7)

Thus the condition in Assumption 1 holds with ∆ = ˆB_{− B}∗ _{and ρg} _{= λg/λ. Therefore,} k( ˆB_{− B}∗₎_J 1(B∗)k2 ≤ kX( ˆB_{− B}∗_)k2 κ1n1/2 , k( ˆB− B ∗₎ J2(B∗)k2 ≤ kX( ˆB_{− B}∗_)k2 κ2n1/2 .

Plugging the above two inequalities into (A.7), we have 1 nkX( ˆB− B ∗ )k22 ≤    4λr1/2 κ1n1/2 + 4P g∈J2(B∗)λ 2 g 1/2 κ2n1/2    kX( ˆB− B ∗ )k2 =    4λr1/2 κ1n1/2 + 4λP g∈J2(B∗)ρ 2 g 1/2 κ2√n    kX( ˆB− B ∗ )k2, which gives 1 nkX( ˆB− B ∗ )k22 ≤ 16λ2    r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2    2 .

Define kAk2,1 =Pg∈G2_∪G1kAk2 for a matrix A, where each coefficient in G1 = GL forms

a group. Hence k ˆB_{− B}∗_k_2,1 _{= | ˆ}B_{− B}∗_|₁₊X g∈G k ˆB_g_{− B}∗_gk₂ _(A.9) ≤ (c + 1)| ˆB_{− B}∗_|_1. _(A.10) Then we have (λ + ρλ)k ˆB_{− B}∗_k_2,1 _{= λk ˆ}B_{− B}∗_k_2,1_{+ ρλk ˆ}B_{− B}∗_k_2,1 ≤ (c + 1)λ| ˆB_{− B}∗_|₁_{+ ρλk ˆ}B_{− B}∗_k_2,1 _{(by (A.10))} = (c + 1)λ| ˆB_{− B}∗_|₁_{+ ρλ| ˆ}B_{− B}∗_|₁₊X g∈G ρλk ˆB_g_{− B}∗_gk₂ _{(by (A.9))} ≤ (c + 2)λ| ˆB_{− B}∗_|₁₊X g∈G λg_{k ˆ}B_g_{− B}∗ gk2 ≤ (c + 2)λ| ˆB_{− B}∗_|₁_{+ 2}X g∈G λg_{k ˆ}B_g_{− B}∗_gk₂ ≤ (c + 2) " λ| ˆB_{− B}∗_|₁_{+ 2}X g∈G λg_{k ˆ}B_g_{− B}∗_gk₂ #

(8)

By (A.8) and the last inequality in (A.7) we obtain 1 + ρ c + 2λk ˆB− B ∗ k2,1 ≤ λ| ˆB_{− B}∗_|₁_{+ 2}X g∈G λg_{k ˆ}B_g _{− B}∗ gk2 ≤ 4λ X jk∈J1(B∗) | ˆβjk_{− β}_{jk| + 4}∗ X g∈J2(B∗) λg_{k ˆ}B_g_{− B}∗_gk₂ ≤ 4λr1/2_{k( ˆ}_B_{− B}∗₎ J1(B∗)k2+ 4   X g∈J2(B∗) λ2 g   1/2 k( ˆB_{− B}∗₎_J 2(B∗)k2 ≤    4λr1/2 κ1n1/2 + 4P g∈J2(B∗)λ 2 g 1/2 κ2n1/2    kX( ˆB− B ∗ )k2 ≤    4λr1/2 κ1n1/2 + 4λP g∈J2(B∗)ρ 2 g 1/2 κ2n1/2   4n 1/2_λ    r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2    = 16λ2    r1/2 κ1 + P g∈5J2(B∗)ρ 2 g 1/2 κ2    2 .5 Therefore, k ˆB_{− B}∗_k2,1 _≤ 16(c + 2)λ 1 + ρ    r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2    2 = 32(c + 2)σA 1 + ρ log(pq) n 1/2    r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2    2 . It is trivial that | ˆB − B∗_| 1 ≤ k ˆB − B∗k2,1.

From (A.6) in Lemma A.2, we obtain

M1( ˆB_{) ≤} 4 λ2_n2kX T_X ( ˆB_{− B}∗_)k2 2 ≤ 4ψmax λ2_n kX( ˆB− B ∗ )k22,

(9)

where the second inequality is from k[XT_X ( ˆB_{− B}∗_)]·k_k2 2 = ( ˆB− B ∗ )T ·kX T (XXT )X( ˆB_{− B}∗_)·k ≤ nψmaxkX( ˆB_{− B}∗_)·k_k2 2

for each 1 ≤ k ≤ q. By the upper bound of kX( ˆB_{− B}∗_)k2

2 we have M1( ˆB_{) ≤ 64ψ}_max    r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2    2 .

A.4 Proof of Proposition 4.1

To prove Proposition 4.1, we first show the following lemma.

Lemma A.3 _{For every pair of (j, k), the sequence of coordinate decent estimates { ˆ}β_jk(m) :

m = 0, 1, 2, · · · } obtained at each step m by solving the following equation ˆ βjk = sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆBgk2>0}λg/(k ˆBg−(jk)k 2 2+ | ˆβjk|2)1/2 (A.11)

for ˆβjk while fixing others converges to a global minimizer of the objective function (2) in the

main text.

First, it is easy to see that the exact solution of (A.11) exists. If k ˆB_g−(jk)_k₂ _{= 0, the}

close form solution of (A.11) is just the lasso solution. If k ˆB_g−(jk)_k₂ _{6= 0, then the right hand}

side of (A.11) is a continuous function of ˆβjk, which is monotone when ˆβjk > 0 or ˆβjk < 0,

bounded away from zero when ˆβjk = 0, and bounded away from ±∞ when ˆβjk goes to ±∞,

therefore must intersect with either y = ˆβjk or y = − ˆβjk. Therefore an exact solution of

(10)

Wu and Lange (2008) proved the convergence to a minimal point of the lasso objective function for the greedy coordinate descent algorithm. In a very similar way, one can extend the proof to the multivariate sparse group lasso objective function and the coordinate descent algorithm of iteratively solving for the exact solution of (A.11). Due to significant overlapping with Wu and Lange (2008), we omit the proof of Lemma A.3 here.

Proof of Proposition 4.1.

Denote { ˆβ_jk(m)_{} the sequence of estimates of jk}th _{coordinate from the coordinate descent}

algorithm that solves equation (A.11) in each step indexed by m. Starting from ˆB(m−1)_,

denote ˆβ_jkMCD(m−1)the one step update of the jkth_{coordinate by the mixed coordinate descent}

algorithm. We prove in the following that

| ˆβ_jk(m)_{| ≤ | ˆ}β_jkMCD(m)_{| ≤ | ˆ}β_jk(m−1)_| (A.12)

with equalities hold only when | ˆβ_jk(m)_{| = | ˆ}β_jk(m−1)_|.

First, if ˆβ_jk(m) is updated by (3) in the main text, then (A.12) is automatically satisfied

since

0 = | ˆβ_jk(m)_{| = | ˆ}β_jkMCD(m)_{| ≤ | ˆ}β_jk(m−1)_|

Otherwise, if ˆβ_jk(m) is updated by (II) or (III) or (IV) in Section 4 of the main text, then

it must be one of the following cases. (i) If ˆ β_jk(m−1) < ˆβ_jk(m) = − (|Sjk| − nλjk)+ kxjk22 + n P {g∈G: βjk∈Bg, k ˆB(m−1)g k2>0}λg/(k ˆ B(m−1)_g−(jk)k2 2+ | ˆβ (m) jk |2)1/2 < 0, then ˆ β_jkMCD(m) = − (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆB(m−1)g k2>0}λg/(k ˆ B(m−1)_g−(jk)k2 2+ | ˆβ (m−1) jk |2)1/2 < ˆβ_jk(m).

(11)

From the proof of Theorem 3.1, ˆβ_jk(m−1) < ˆβ_jk(m) if and only if ∂L(B) ∂βjk _ˆ β_jk(m−1) = −Sjk/n + kxjk22βˆ (m−1) jk /n − λjk+ X Gjk λgβˆ_jk(m−1)_{/k ˆ}B(m−1) g k2< 0.

Notice that the above is also the partial derivative of Lnet_{(B) w.r.t. βjk} _{taking value at}

ˆ

β_jk(m−1), where Lnet_{(B) is the elastic net objective function}

Lnet(B) = 1 2nkY − XBk 2 2+ X jk λjk_|βjk| + X g∈G λg_kBgk22/(2k ˆB (m−1) g k2) holding k ˆB(m−1)

g k2 as constants and constraining that βjk< 0.

Following exactly the same argument as in the proof of Theorem 3.1, we can prove that ∂Lnet(B) ∂βjk _ˆ β(m−1)_jk < 0 if and only if ˆβ (m−1)

jk is less than the solution of ∂L(B)/∂βjk= 0 with the

constraint βjk < 0, which is the solution of

−Sjk/n + kxjk22βjk/n − λjk+ X Gjk λgβjk_{/k ˆ}B(m−1) g k2 = 0 under βjk< 0 given by − (|Sjk| − nλjk)₊ kxjk22+ n P {g∈G: βjk∈Bg, k ˆBgk(m−1)₂ >0}λg/k ˆ B(m−1)_k2 = ˆβ MCD(m) jk . Therefore, we have ˆ β_jk(m−1) < ˆβ_jkMCD(m) < ˆβ_jk(m)< 0. (ii) If ˆ β_jk(m−1) > ˆβ_jk(m)= (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆB (m−1) g k2>0}λg/(k ˆ B(m−1)_g−(jk)k2 2+ | ˆβ (m) jk |2)1/2 ≥ 0, with similar argument, we have that

ˆ

(12)

(iii) If ˆβ_jk(m−1) = ˆβ_jk(m), the mixed coordinate descent algorithm will be exact update and we will have

ˆ

β_jk(m−1) = ˆβ_jkMCD(m) = ˆβ_jk(m). In summary, we have (A.12).

Lemma A.3 shows that the sequence of estimates of jkth _{coordinate { ˆ}_β(m)

jk } iteratively

updated from solving (A.11) converges to a global minimizer regardless the value of the

starting point. For each term in the sequence { ˆβ_jkMCD(l)_{}, suppose one can construct a}

se-quence of { ˆβ_jk(m)_{} starting from ˆ}β_jkMCD(l), then those sequences all converge to minimizers (if

the minimizer is not unique, e.g. for not strictly convex objective function) with the same

minimum value. Thus from (A.12) we know that { ˆβ_jkMCD(l)_{} converge to a global minimizer}

with the same minimum value.

Figure A.1 illustrates coordinate updates by the standard coordinate descent and the mixed coordinate descent algorithms on a contour surface of a two-dimensional objective function. Given the same starting values, one step update on one coordinate from the mixed coordinate descent algorithm is always bounded between the previous and current values from the standard coordinate descent algorithm.

A.5 Comparison of computing costs

The computational cost of coordinate descent algorithm with inner iterations is much higher than our mixed coordinate descent algorithm. Figure A.2 shows the comparison between these two algorithms. The group structure of the regression coefficient matrix used is set to be (b) in Figure 1 in the main text. In Figure A.2, the mixed coordinate descent algorithm converges to a minimizer after 500 iterations while the coordinate descent algorithm with inner iterations converges after 150000 iterations.

(13)

( 1 (m-1) , 2 (m-1) ) ₍ 1 (m) , 2 (m-1) ) ( 1 (m) , 2 (m) ) X ( 1 MCD(m) , 2 (m-1) ) ( 1 (m) , 2 MCD(m) ) X

Figure A.1: Illustration of coordinate updates by the standard coordinate descent and the mixed coordinate descent algorithms on a contour surface of a two-dimensional objective function.

B

Comparison between univariate and multivariate

ap-proaches

Figures B.1 and B.2 illustrate the comparisons between univariate approaches and

multi-variate approaches. The true regression coefficient matrix takes a GXY ∪ GX group structure.

It can be seen that when different response variables have a similar sparsity to the predic-tors, the multiple univariate lasso (using different λ values for different response variables) and the multivariate lasso (using the same λ value for all response variables) have similar performance on variable selection. The multiple univariate sparse group lasso approach has

(14)

1 e +0 5 2 e +0 5 3 e +0 5 4 e +0 5 5 e +0 5 iteration numbers o b je ct iv e f u n ct io n 0 ₅₀0 100050001000 0 5000 0 1000 00 1500 00 2000 00 CD−w/ inner iteration MCD−w/o inner iteration

Figure A.2: Decreasing of the objective function. The gray line is for the proposed mixed coordinate descent (MCD) algorithm without inner iterations of updating (A.11) and the black line is for the coordinate descent (CD) algorithm with inner iterations.

a slightly better variable selection performance than the multiple univariate lasso. The pro-posed multivariate sparse group lasso yields the best variable selection result by borrowing information from other response variables within the same group. It also has the smallest prediction error.

C

More simulations

Figures C.1 to C.5 show the variable selection and prediction effects in some other simulation settings, such as with different autocorrelation coefficient values or with a true “all-in-all-out”

(15)

(a) (b) (c)

(d) (e)

Figure B.1: Heatmaps of coefficient matrices. (a) True B∗_{; (b) The multiple univariate}

lasso; (c) The multiple univariate sparse group lasso (d) The multivariate lasso; (e) The

multivariate sparse group lasso; The true B∗ _{has a “not all in all out” and X+XY group}

structure with p = q = 200, n = 100, ρ = 0.5. u n i L u n i SG L m u lt i L m u lt i SG L False Positive 0 500 1000 1500 2000 2500 u n i L u n i SG L m u lt i L m u lt i SG L False Negative 0 250 500 750 1000 1250 u n i L u n i SG L m u lt i L m u lt i SG L Sensitivity(%) 0 25 50 75 u n i L u n i SG L m u lt i L m u lt i SG L Specificity(%) 0 25 50 75 100 u n i L u n i SG L m u lt i L m u lt i SG L Prediction Error 6.70e+6 6.95e+6 7.20e+6 7.45e+6

Univariate and multivariate comparison, N=100, P=200, Q=200, ρ = 0.5

Figure B.2: Comparison between multiple-univariate and multivariate approaches from 100 simulated data sets. “uni L” – the multiple univariate lasso; “uni SGL” – the multiple univariate group lasso; “multi L” – the multivariate lasso; “multi SGL” – the multivariate sparse group lasso with an XY group structure on the coefficient matrix.

(16)

group structure.

D

Network structure for the yeast eQTL data analysis

Figure D.1 shows the network constructed from the multivariate sparse group lasso method. The top association signals are highlighted in dark lines and also reported in Table 2 and 3 in the main text.

References

Wu, T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Annal of Applied Statistics, 2:224–44.

(17)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1500 2500 3500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 50 100 150 200 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 1.90e+6 2.00e+6 2.10e+6 2.20e+6 X group structure (b) L L X L XY L XXY G X G XY G XXY False Positive 0 3000 6000 9000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.90e+6 2.15e+6 2.40e+6 2.65e+6 2.90e+6 XY group structure (c) L L X L XY L XXY G X G XY G XXY False Positive 0 2000 4000 6000 L L X L XY L XXY G X G XY G XXY False Negative 0 200 400 600 700 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.70e+6 3.95e+6 4.20e+6 4.45e+6 4.70e+6 4.95e+6

X+XY group structure

(d) L SG L G False Positive 0 1000 2000 3000 L SG L G False Negative 0 200 400 600 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 3.20e+6 3.35e+6 3.50e+6 3.65e+6

Overlapping group structure

Figure C.1: More simulation results, “not all in all out” cases with n = 150, p = q = 200 and ρ = 0.2.

(18)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1500 2500 3500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 100 200 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 2.00e+6 2.10e+6 2.20e+6 2.30e+6 2.40e+6 2.50e+6 X group structure (b) L L X L XY L XXY G X G XY G XXY False Positive 0 1000 2000 3000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.80e+6 2.00e+6 2.20e+6 2.40e+6 XY group structure (c) L L X L XY L XXY G X G XY G XXY False Positive 0 2000 4000 6000 7000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 600 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.50e+6 3.75e+6 4.00e+6 4.25e+6 4.50e+6

X+XY group structure

(d) L SG L G False Positive 0 2000 4000 6000 L SG L G False Negative 0 100 200 300 400 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 3.10e+6 3.15e+6 3.20e+6 3.25e+6 3.30e+6

Overlapping group structure

Figure C.2: More simulation results, “not all in all out” cases with n = 150, p = q = 200 and ρ = 0.8.

(19)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1000 1500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 1000 2000 3000 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 1.40e+7 1.50e+7 1.60e+7

X group structure, all in all out

(b) L L X L XY L XXY G X G XY G XXY False Positive 0 500 1000 1500 L L X L XY L XXY G X G XY G XXY False Negative 0 1000 2000 3000 4000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.40e+7 1.45e+7 1.50e+7 1.55e+6 1.60e+6 1.65e+6

XY group structure, all in all out

(c) L L X L XY L XXY G X G XY G XXY False Positive 0 500 1000 1500 L L X L XY L XXY G X G XY G XXY False Negative 0 2000 4000 6000 7000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 2.70e+7 2.80e+7 2.90e+7 3.00e+7 3.10e+7

X+XY group structure, all in all out

(d) L SG L G False Positive 0 1000 2000 3000 L SG L G False Negative 0 1500 3000 4500 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 1.20e+7 1.30e+7 1.40e+7

Overlapping group structure, all in all out

Figure C.3: More simulation results, “all in all out” cases with n = 150, p = q = 200 and ρ = 0.5.

(20)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 200 400 600 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 200 400 600 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 3.20e+6 3.30e+6 3.40e+6 3.50e+6 3.60e+6

X group structure, all in all out

(b) L L X L XY L XXY G X G XY G XXY False Positive 0 200 400 600 800 L L X L XY L XXY G X G XY G XXY False Negative 0 200 400 600 800 1000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.30e+6 3.50e+6 3.70e+6 3.90e+6 4.20e+6

XY group structure, all in all out

(c) L L X L XY L XXY G X G XY G XXY False Positive 0 200 400 600 800 L L X L XY L XXY G X G XY G XXY False Negative 0 400 800 1200 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 6.10e+6 6.50e+6 6.90e+6 7.30e+6

X+XY group structure, all in all out

(d) L SG L G False Positive 0 300 600 900 L SG L G False Negative 0 200 400 600 800 L SG L G Sensitivity(%) 0 25 50 75 100 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 2.80e+6 2.95e+6 3.10e+6 3.25e+6

Overlapping group structure, all in all out

Figure C.4: More simulation results, “all in all out” cases with n = 150, p = q = 100 and ρ = 0.5.

(21)

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure C.5: Heatmaps of coefficient matrices, selection effects. “Not all in all out” XY group

structure with n = 100, p = 200, q = 200, and ρ = 0.5. (a) B∗; (b) ˆB_L_{; (c) ˆ}B_LX_{; (d) ˆ}B_LXY_;

(22)

● ● ● ●_●_● ● ● ● ● ●● ● ● ●_● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ● ●●●● ● ●●● ● ●● ● ● ●●● ●●●● ●● ● ● ●_● ● ● ● ●● ●●●●● ● ●●● ● ● ●●● ●● ● ● ● _● ● ● ● ● ● ●●●●● ● ●● ● ●●● ●●● ●●●● ●●● ●●●● ●●● ● ● ●●● ●● ●●●● ●●● ●●●● ●●●● ● ●●●_●●●●●●●●●●●●●●●●●●●●●●●●●●_●●●●_●●●● ● ●●●●_● ● ●●● ● ●●● ●●_● _● ●● ● ● ● ● ● ● ● ●_● ●● ● ● ●● ● ●●●●● ● ● ●●●●●●●● ●●●●●●●● ●● ● ●●● ● ● ● ● ●●●● ●●_● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●●●● ● ● ● ●●●●●● ●●●● ● ● ● ●● ●● ● ●●● ●●● ●●● ● ●●●●●●● ● ● ●●● ●● ●●●● ●●● ●●●● ●●_● ● ● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ●●●● ● ● ● ●●●●●●●_● ● ● ● ●● ● ● ●● ● ● gA L gBL gBR gC L gC R gEL gER gH L gJL gL R ML g g N L LO_g PLg N BR N C R ND R NEL NER NH L NH R NLR NML NNL NOL NPL SNR YAL YAR YBR YCL YCR YDR YEL YER YG L YG R YH L YH R YJL YL R YML YN L Y_O L YPL MA PK cell circl e cance r riboso me

Figure D.1: Network constructed from the multivariate sparse group lasso method. Network structure is between gene expressions grouped in mitogen-activated protein kinases (MAPK), cell cycle, cancer, ribosome pathways and markers grouped in 45 gene groups. Gray lines

connect expression-marker pairs with non-zero ˆβjk. Dark lines are for the top 10 associations

in each pathways. The strength of these top associations are indicated by the width of the dark lines. The dotted circles indicate the overlapping pathway group structure.