• No results found

Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure

N/A
N/A
Protected

Academic year: 2021

Share "Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Web-based Supplementary Materials

for

Multivariate sparse group lasso for the multivariate multiple linear

regression with an arbitrary group structure

by

Yanming Li, Bin Nan and Ji Zhu

A

Proofs of technical results

A.1

Some matrix algebra

Lemma A.1 Let

L0(B) = 1 2kY − XBk 2 2 = 1 2tr(Y − XB) T (Y − XB) = 12 n X i=1 q X k=1 (yik p X j=1 xijβjk)2. Then ∂L0(B)/∂βjk = −xT j(Y − XB)·k = −Sjk+ kxjk22βjk, where Sjk = xTj(Y − XB(−j))·k.

A.2

Proof of Theorem 3.1

Following Lemma A.1,

L(B) = 1 nL 0(B) + X 1≤j≤p;1≤k≤q λjkjk| + X g∈G λgkBgk2.

(2)

For a coordinate βjkin B, denote Gjk = {g : βjk ∈ Bg ∈ G}, then when L(B) is differentiable at βjk, ∂L(B) ∂βjk = −Sjk/n + kxjk 2 2βjk/n + λjksign(βjk) + X Gjk λgβjk/kBgk2.

If βjk> 0, then for any Bg ∈ Gjk, kBgk2 > 0, and

∂L(B) ∂βjk = −Sjk/n + kxjk 2 2βjk/n + λjk+ X Gjk λgβjk/kBgk2.

Notice that ∂L(B)/∂βjk ≥ 0 if and only if

βjk Sjk− nλjk kxjk22+ n P Gjkλg/kBgk2 △ = ˜βjk+,

and ∂L(B)/∂βjk< 0 if and only if βjk< ˜βjk+. So fixing all other coordinates of B, if βjk > 0,

then L(B) is monotone increasing with respect to βjk when βjk > ˜βjk+ and decreasing when

βjk< ˜βjk+. Therefore, if ˆβjk,min+ minimizes L(B) with respect to βjk when βjk > 0, then

ˆ βjk,min+ = ( Sjk−nλjk kxjk22+n P Gjkλg/k ˆBgk2 , if Sjk> nλjk 0, otherwise. (A.1)

Similarly, if ˆβjk,min− minimizes L(B) with respect to βjk when βjk < 0, then

ˆ

βjk,min− =

( Sjk+nλjk

kxjk22+nPGjkλg/k ˆBgk2, if Sjk< −nλjk

0, otherwise. (A.2)

Based on both (A.1) and (A.2), when βjk6= 0, the minimizer ˆβjk can be summarized into

a unified form: ˆ βjk= sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈Gjk}λg/k ˆBgk2 .

When βjk = 0, we discuss two separate cases according to weather all the groups

con-taining βjk are zero groups or not. First, if none of the groups in Gjk is a zero group, then

ˆ

βjk needs to satisfy the subgradient equation

Sjk/n − kxjk22βjk/n = λjku +ˆ

X

Gjk ˆ

(3)

with |u| ≤ 1. Then a similar discussion to the case when βjk 6= 0 based on whether u > 0 or u ≤ 0 in (A.3) yields the same expression:

ˆ βjk= sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈Gjk}λg/k ˆBgk2 .

Second, if some of the groups in Gjk are zero groups, then neither | · | nor k · k2 in L(B)

is differentiable at zero. Let Gjk⊗ = {g : βjk ∈ Bg ∈ G, kBkg > 0}. Then for any zero group

Bg0 in Gjk and for all βjk∈ Bg0, ˆβjk needs to satisfy the subgradient equation:

Sjk/n − kxjk22βjk/n = λjku + λg0vjkˆ +

X

Gjk

λgβjkˆ /k ˆBgk2, (A.4)

where u is the subgradient scalar for the L1 norm | · | and v is the subgradient vector for

the L2 norm k · k2 with constrains |u| ≤ 1 and kvk2 ≤ 1, and λg0 is the tuning parameter

associated with Bg0. It can be seen directly (similar to the case of βjk = 0) that (A.4) yields

ˆ Bg0= 0 if s X {jk: βjk∈Bg0} (|Sjk|/n − λjk)2+ ≤ λg0. 

A.3

Proof of Theorem 3.3

Lemma A.2 Under the assumptions in Theorem 3.3, for any B ∈ Rp×q, with probability at

least 1 − (pq)1−A2/2 , 1 nkX(B ∗ − ˆB)k2 2+ λ| ˆB− B|1+ 2 X g∈G λgk ˆBg− Bgk2 (A.5)n1kX(B∗− B)k22+ 4λ X jk∈J1(B) | ˆβjk− βjk| + 4 X g∈J2(B) λgk ˆBg − Bgk2, M1( ˆB) ≤ 4 λ2n2 X jk∈J1( ˆB) |[XTX ( ˆB− B)]jk|2 4 λ2n2kX TX ( ˆB− B)k2 2. (A.6)

(4)

Proof. For any B ∈ Rp×q, we have 1 nkY − X ˆBk 2 2+ 2λ| ˆB|1+ X g∈G 2λgk ˆBgk2 1 nkY − XBk 2 2+ 2λ|B|1+ X g∈G 2λgkBgk2.

Plugging Y = XB∗+ W into the above inequality, we obtain

1 nkX(B ∗ − ˆB)k2 2 ≤ 1 nkX(B ∗ − B)k22+ 2 n n X i=1 q X k=1 [X( ˆB− B)]ikωik +2λ(|B|1− | ˆB|1) +X g∈G 2λg(kBgk2− k ˆBgk2),

where [X( ˆB− B)]ik denotes the ikth element of the product matrix X( ˆB− B) and ωik is

the ikth element of W. Notice that n X i=1 q X k=1 [X( ˆB− B)]ikωik = n X i=1 ( q X k=1 " p X j=1 xij( ˆβjk− βjk) # ωik ) ≤ max 1≤k≤q,1≤j≤p n X i=1 xijωik q X k=1 p X j=1 | ˆβjk− βjk| = |XTW|∞| ˆB− B|1 where |XTW

|∞= max1≤k≤q,1≤j≤p|Pni=1xijωik| is the maximum absolute value of entries of

XTW

.

Let Vjk = xT

j · wk, 1 ≤ j ≤ p, 1 ≤ k ≤ q. Since wk ∼ N(0, σ2kIq) for 1 ≤ q ≤ Q,

then var(Vjk) = xT

pcov(wq)xp = nσ2q. Therefore (nσq2)−1/2Vjk are standard normal random

variables. Consider the random event

A = 2n|XTW

|∞≤ λ

 . It is easy to see that the complement of A can be expressed as

Ac =  At least one |Vjk| > λn 2 , 1 ≤ j ≤ p, 1 ≤ k ≤ q  .

Denote B(0, λn/2) to be a 1-dimensional ball centered at 0 and with radius λn/2, then

P r{Ac} ≤ p X j=1 q X k=1 P r  Vjk ∈ B/  0,λn 2  = p q X k=1 P r  (nσk2)−1/2Vjk ∈ B/  0,λn 1/2 2σk  ≤ pq × P r  |Z| ≥ λn 1/2 2σ  ≤ pq exp −λ 2n 8σ2  = (pq)1−A2/2,

(5)

where Z is a standard normal random variable, and the last inequality is obtained by

P r{|Z| > a} ≤ exp (−a2/2). But on event A, we have

1 nkX(B ∗ − ˆB)k2 2+ λ| ˆB− B|1+ 2 X g∈G λgk ˆBg− Bgk2n1kX(B∗− B)k22+ 2λ(| ˆB− B|1+ |B|1− | ˆB|1) +2X g∈G λg(k ˆBg− Bgk2+ kBgk2− k ˆBgk2)n1kX(B∗− B)k22+ 4λ X jk∈J1(B) | ˆβjk− βjk| + 4 X g∈J2(B) λg(k ˆBg− Bgk2.

This completes the proof of the first inequality in Lemma A.2.

To prove the second inequality, we use the KKT conditions and obtain ( (1/n)[XT (Y − X ˆB)]jk = 2λsgn( ˆβjk) + 2P g∈Gλgβjkˆ /k ˆBgk2, ˆβjk6= 0; (1/n)|[XT (Y − X ˆB)]jk| ≤ 2λ + 2P g∈Gλg, βˆjk= 0.

From the first condition we can see that ∀ ˆβjk6= 0,

λ ≤ 1

n|[X

T

(Y − X ˆB)]jk|.

On the other hand, we have on A 1 n|[X T (Y − X ˆB)]jk| ≤ 1 n|[X TX (B∗− ˆB)]jk+ [XTW ]jk|n1|[XTX (B∗− ˆB)]jk| + 1 n|X TW |∞ ≤ n1|[XTX (B∗− ˆB)]jk| +λ 2.

Then combine the above two inequalities, we have λ 2 ≤ 1 n|[X TX( ˆB − B∗)]jk|. Therefore M1( ˆB) = |J1( ˆB)| ≤ 4 λ2n2 X jk∈J1( ˆB) |[XTX ( ˆB− B)]jk|2 4 λ2n2kX TX ( ˆB− B)k2 2.

(6)

This completes the proof of Lemma A.2.  Proof of Theorem 3.3.

By setting B = B∗ in (A.5) in Lemma A.2, we have that on event A,

1 nkX( ˆB− B ∗ )k22 ≤ 4λ X jk∈J1(B∗) | ˆβjk− βjk| + 4∗ X g∈J2(B∗) λgk ˆBg− Bgk2 (A.7) ≤ 4λr1/2k( ˆB− B) J1(B∗)k2+ 4   X g∈J2(B∗) λ2 g   1/2 k( ˆB− B)J 2(B∗)k2.

The last inequality is by Cauchy-Schwarz. Specifically, we have   X jk∈J1(B∗) | ˆβjk− βjk|∗   2 =   X jk∈J1(B∗) 1 × | ˆβjk− βjk|∗   2 ≤   X jk∈J1(B∗) 12     X jk∈J1(B∗) | ˆβjk− βjk|∗ 2   = rk( ˆB− B)J 1(B∗)k 2 2, and   X g∈J2(B∗) λgk ˆBg− Bgk2   2 ≤   X g∈J2(B∗) λ2g     X g∈J2(B∗) k ˆBg − Bgk2 2  .

Also by inequality (A.5), on event A, we have

λ| ˆB− B|1+ 2X g∈G λgk ˆBg− Bgk2 (A.8) ≤ 4λ X jk∈J1(B∗) | ˆβjk− βjk| + 4∗ X g∈J2(B∗) λgk ˆBg− Bgk2. This is equivalent to λ X jk∈Jc 1(B∗) | ˆβjk− βjk| + 2∗ X g∈Jc 2(B∗) λgk ˆBg− Bgk2 ≤ 3λ X jk∈J1(B∗) | ˆβjk− βjk| + 2∗ X g∈J2(B∗) λgk ˆBg− Bgk2.

(7)

Thus the condition in Assumption 1 holds with ∆ = ˆB− Band ρg = λg/λ. Therefore, k( ˆB− B)J 1(B∗)k2 ≤ kX( ˆB− B)k2 κ1n1/2 , k( ˆB− B ∗) J2(B∗)k2 ≤ kX( ˆB− B)k2 κ2n1/2 .

Plugging the above two inequalities into (A.7), we have 1 nkX( ˆB− B ∗ )k22 ≤    4λr1/2 κ1n1/2 + 4P g∈J2(B∗)λ 2 g 1/2 κ2n1/2    kX( ˆB− B ∗ )k2 =    4λr1/2 κ1n1/2 + 4λP g∈J2(B∗)ρ 2 g 1/2 κ2√n    kX( ˆB− B ∗ )k2, which gives 1 nkX( ˆB− B ∗ )k22 ≤ 16λ2    r1/2 κ1 +  P g∈J2(B∗)ρ 2 g 1/2 κ2    2 .

Define kAk2,1 =Pg∈G2∪G1kAk2 for a matrix A, where each coefficient in G1 = GL forms

a group. Hence k ˆB− Bk2,1 = | ˆB− B|1+X g∈G k ˆBg− Bgk2 (A.9) ≤ (c + 1)| ˆB− B|1. (A.10) Then we have (λ + ρλ)k ˆB− Bk2,1 = λk ˆB− Bk2,1+ ρλk ˆB− Bk2,1 ≤ (c + 1)λ| ˆB− B|1+ ρλk ˆB− Bk2,1 (by (A.10)) = (c + 1)λ| ˆB− B|1+ ρλ| ˆB− B|1+X g∈G ρλk ˆBg− Bgk2 (by (A.9)) ≤ (c + 2)λ| ˆB− B|1+X g∈G λgk ˆBg− B∗ gk2 ≤ (c + 2)λ| ˆB− B|1+ 2X g∈G λgk ˆBg− Bgk2 ≤ (c + 2) " λ| ˆB− B|1+ 2X g∈G λgk ˆBg− Bgk2 #

(8)

By (A.8) and the last inequality in (A.7) we obtain 1 + ρ c + 2λk ˆB− B ∗ k2,1 ≤ λ| ˆB− B|1+ 2X g∈G λgk ˆBg − B∗ gk2 ≤ 4λ X jk∈J1(B∗) | ˆβjk− βjk| + 4∗ X g∈J2(B∗) λgk ˆBg− Bgk2 ≤ 4λr1/2k( ˆB− B) J1(B∗)k2+ 4   X g∈J2(B∗) λ2 g   1/2 k( ˆB− B)J 2(B∗)k2 ≤    4λr1/2 κ1n1/2 + 4P g∈J2(B∗)λ 2 g 1/2 κ2n1/2    kX( ˆB− B ∗ )k2 ≤    4λr1/2 κ1n1/2 + 4λP g∈J2(B∗)ρ 2 g 1/2 κ2n1/2   4n 1/2λ    r1/2 κ1 +  P g∈J2(B∗)ρ 2 g 1/2 κ2    = 16λ2    r1/2 κ1 +  P g∈5J2(B∗)ρ 2 g 1/2 κ2    2 .5 Therefore, k ˆB− Bk2,1 16(c + 2)λ 1 + ρ    r1/2 κ1 +  P g∈J2(B∗)ρ 2 g 1/2 κ2    2 = 32(c + 2)σA 1 + ρ  log(pq) n 1/2    r1/2 κ1 +  P g∈J2(B∗)ρ 2 g 1/2 κ2    2 . It is trivial that | ˆB − B∗| 1 ≤ k ˆB − B∗k2,1.

From (A.6) in Lemma A.2, we obtain

M1( ˆB) ≤ 4 λ2n2kX TX ( ˆB− B)k2 2 ≤ 4ψmax λ2n kX( ˆB− B ∗ )k22,

(9)

where the second inequality is from k[XTX ( ˆB− B)]·kk2 2 = ( ˆB− B ∗ )T ·kX T (XXT )X( ˆB− B)·k ≤ nψmaxkX( ˆB− B)·kk2 2

for each 1 ≤ k ≤ q. By the upper bound of kX( ˆB− B)k2

2 we have M1( ˆB) ≤ 64ψmax    r1/2 κ1 +  P g∈J2(B∗)ρ 2 g 1/2 κ2    2 . 

A.4

Proof of Proposition 4.1

To prove Proposition 4.1, we first show the following lemma.

Lemma A.3 For every pair of (j, k), the sequence of coordinate decent estimates { ˆβjk(m) :

m = 0, 1, 2, · · · } obtained at each step m by solving the following equation ˆ βjk = sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆBgk2>0}λg/(k ˆBg−(jk)k 2 2+ | ˆβjk|2)1/2 (A.11)

for ˆβjk while fixing others converges to a global minimizer of the objective function (2) in the

main text.

First, it is easy to see that the exact solution of (A.11) exists. If k ˆBg−(jk)k2 = 0, the

close form solution of (A.11) is just the lasso solution. If k ˆBg−(jk)k2 6= 0, then the right hand

side of (A.11) is a continuous function of ˆβjk, which is monotone when ˆβjk > 0 or ˆβjk < 0,

bounded away from zero when ˆβjk = 0, and bounded away from ±∞ when ˆβjk goes to ±∞,

therefore must intersect with either y = ˆβjk or y = − ˆβjk. Therefore an exact solution of

(10)

Wu and Lange (2008) proved the convergence to a minimal point of the lasso objective function for the greedy coordinate descent algorithm. In a very similar way, one can extend the proof to the multivariate sparse group lasso objective function and the coordinate descent algorithm of iteratively solving for the exact solution of (A.11). Due to significant overlapping with Wu and Lange (2008), we omit the proof of Lemma A.3 here.

Proof of Proposition 4.1.

Denote { ˆβjk(m)} the sequence of estimates of jkth coordinate from the coordinate descent

algorithm that solves equation (A.11) in each step indexed by m. Starting from ˆB(m−1),

denote ˆβjkMCD(m−1)the one step update of the jkthcoordinate by the mixed coordinate descent

algorithm. We prove in the following that

| ˆβjk(m)| ≤ | ˆβjkMCD(m)| ≤ | ˆβjk(m−1)| (A.12)

with equalities hold only when | ˆβjk(m)| = | ˆβjk(m−1)|.

First, if ˆβjk(m) is updated by (3) in the main text, then (A.12) is automatically satisfied

since

0 = | ˆβjk(m)| = | ˆβjkMCD(m)| ≤ | ˆβjk(m−1)|

Otherwise, if ˆβjk(m) is updated by (II) or (III) or (IV) in Section 4 of the main text, then

it must be one of the following cases. (i) If ˆ βjk(m−1) < ˆβjk(m) = − (|Sjk| − nλjk)+ kxjk22 + n P {g∈G: βjk∈Bg, k ˆB(m−1)g k2>0}λg/(k ˆ B(m−1)g−(jk)k2 2+ | ˆβ (m) jk |2)1/2 < 0, then ˆ βjkMCD(m) = − (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆB(m−1)g k2>0}λg/(k ˆ B(m−1)g−(jk)k2 2+ | ˆβ (m−1) jk |2)1/2 < ˆβjk(m).

(11)

From the proof of Theorem 3.1, ˆβjk(m−1) < ˆβjk(m) if and only if ∂L(B) ∂βjk ˆ βjk(m−1) = −Sjk/n + kxjk22βˆ (m−1) jk /n − λjk+ X Gjk λgβˆjk(m−1)/k ˆB(m−1) g k2< 0.

Notice that the above is also the partial derivative of Lnet(B) w.r.t. βjk taking value at

ˆ

βjk(m−1), where Lnet(B) is the elastic net objective function

Lnet(B) = 1 2nkY − XBk 2 2+ X jk λjkjk| + X g∈G λgkBgk22/(2k ˆB (m−1) g k2) holding k ˆB(m−1)

g k2 as constants and constraining that βjk< 0.

Following exactly the same argument as in the proof of Theorem 3.1, we can prove that ∂Lnet(B) ∂βjk ˆ β(m−1)jk < 0 if and only if ˆβ (m−1)

jk is less than the solution of ∂L(B)/∂βjk= 0 with the

constraint βjk < 0, which is the solution of

−Sjk/n + kxjk22βjk/n − λjk+ X Gjk λgβjk/k ˆB(m−1) g k2 = 0 under βjk< 0 given by − (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆBgk(m−1)2 >0}λg/k ˆ B(m−1)k2 = ˆβ MCD(m) jk . Therefore, we have ˆ βjk(m−1) < ˆβjkMCD(m) < ˆβjk(m)< 0. (ii) If ˆ βjk(m−1) > ˆβjk(m)= (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆB (m−1) g k2>0}λg/(k ˆ B(m−1)g−(jk)k2 2+ | ˆβ (m) jk |2)1/2 ≥ 0, with similar argument, we have that

ˆ

(12)

(iii) If ˆβjk(m−1) = ˆβjk(m), the mixed coordinate descent algorithm will be exact update and we will have

ˆ

βjk(m−1) = ˆβjkMCD(m) = ˆβjk(m). In summary, we have (A.12).

Lemma A.3 shows that the sequence of estimates of jkth coordinate { ˆβ(m)

jk } iteratively

updated from solving (A.11) converges to a global minimizer regardless the value of the

starting point. For each term in the sequence { ˆβjkMCD(l)}, suppose one can construct a

se-quence of { ˆβjk(m)} starting from ˆβjkMCD(l), then those sequences all converge to minimizers (if

the minimizer is not unique, e.g. for not strictly convex objective function) with the same

minimum value. Thus from (A.12) we know that { ˆβjkMCD(l)} converge to a global minimizer

with the same minimum value. 

Figure A.1 illustrates coordinate updates by the standard coordinate descent and the mixed coordinate descent algorithms on a contour surface of a two-dimensional objective function. Given the same starting values, one step update on one coordinate from the mixed coordinate descent algorithm is always bounded between the previous and current values from the standard coordinate descent algorithm.

A.5

Comparison of computing costs

The computational cost of coordinate descent algorithm with inner iterations is much higher than our mixed coordinate descent algorithm. Figure A.2 shows the comparison between these two algorithms. The group structure of the regression coefficient matrix used is set to be (b) in Figure 1 in the main text. In Figure A.2, the mixed coordinate descent algorithm converges to a minimizer after 500 iterations while the coordinate descent algorithm with inner iterations converges after 150000 iterations.

(13)

( 1 (m-1) , 2 (m-1) ) ( 1 (m) , 2 (m-1) ) ( 1 (m) , 2 (m) ) X ( 1 MCD(m) , 2 (m-1) ) ( 1 (m) , 2 MCD(m) ) X

Figure A.1: Illustration of coordinate updates by the standard coordinate descent and the mixed coordinate descent algorithms on a contour surface of a two-dimensional objective function.

B

Comparison between univariate and multivariate

ap-proaches

Figures B.1 and B.2 illustrate the comparisons between univariate approaches and

multi-variate approaches. The true regression coefficient matrix takes a GXY ∪ GX group structure.

It can be seen that when different response variables have a similar sparsity to the predic-tors, the multiple univariate lasso (using different λ values for different response variables) and the multivariate lasso (using the same λ value for all response variables) have similar performance on variable selection. The multiple univariate sparse group lasso approach has

(14)

1 e +0 5 2 e +0 5 3 e +0 5 4 e +0 5 5 e +0 5 iteration numbers o b je ct iv e f u n ct io n 0 500 100050001000 0 5000 0 1000 00 1500 00 2000 00 CD−w/ inner iteration MCD−w/o inner iteration

Figure A.2: Decreasing of the objective function. The gray line is for the proposed mixed coordinate descent (MCD) algorithm without inner iterations of updating (A.11) and the black line is for the coordinate descent (CD) algorithm with inner iterations.

a slightly better variable selection performance than the multiple univariate lasso. The pro-posed multivariate sparse group lasso yields the best variable selection result by borrowing information from other response variables within the same group. It also has the smallest prediction error.

C

More simulations

Figures C.1 to C.5 show the variable selection and prediction effects in some other simulation settings, such as with different autocorrelation coefficient values or with a true “all-in-all-out”

(15)

(a) (b) (c)

(d) (e)

Figure B.1: Heatmaps of coefficient matrices. (a) True B∗; (b) The multiple univariate

lasso; (c) The multiple univariate sparse group lasso (d) The multivariate lasso; (e) The

multivariate sparse group lasso; The true B∗ has a “not all in all out” and X+XY group

structure with p = q = 200, n = 100, ρ = 0.5. u n i L u n i SG L m u lt i L m u lt i SG L False Positive 0 500 1000 1500 2000 2500 u n i L u n i SG L m u lt i L m u lt i SG L False Negative 0 250 500 750 1000 1250 u n i L u n i SG L m u lt i L m u lt i SG L Sensitivity(%) 0 25 50 75 u n i L u n i SG L m u lt i L m u lt i SG L Specificity(%) 0 25 50 75 100 u n i L u n i SG L m u lt i L m u lt i SG L Prediction Error 6.70e+6 6.95e+6 7.20e+6 7.45e+6

Univariate and multivariate comparison, N=100, P=200, Q=200, ρ = 0.5

Figure B.2: Comparison between multiple-univariate and multivariate approaches from 100 simulated data sets. “uni L” – the multiple univariate lasso; “uni SGL” – the multiple univariate group lasso; “multi L” – the multivariate lasso; “multi SGL” – the multivariate sparse group lasso with an XY group structure on the coefficient matrix.

(16)

group structure.

D

Network structure for the yeast eQTL data analysis

Figure D.1 shows the network constructed from the multivariate sparse group lasso method. The top association signals are highlighted in dark lines and also reported in Table 2 and 3 in the main text.

References

Wu, T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Annal of Applied Statistics, 2:224–44.

(17)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1500 2500 3500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 50 100 150 200 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 1.90e+6 2.00e+6 2.10e+6 2.20e+6 X group structure (b) L L X L XY L XXY G X G XY G XXY False Positive 0 3000 6000 9000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.90e+6 2.15e+6 2.40e+6 2.65e+6 2.90e+6 XY group structure (c) L L X L XY L XXY G X G XY G XXY False Positive 0 2000 4000 6000 L L X L XY L XXY G X G XY G XXY False Negative 0 200 400 600 700 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.70e+6 3.95e+6 4.20e+6 4.45e+6 4.70e+6 4.95e+6

X+XY group structure

(d) L SG L G False Positive 0 1000 2000 3000 L SG L G False Negative 0 200 400 600 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 3.20e+6 3.35e+6 3.50e+6 3.65e+6

Overlapping group structure

Figure C.1: More simulation results, “not all in all out” cases with n = 150, p = q = 200 and ρ = 0.2.

(18)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1500 2500 3500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 100 200 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 2.00e+6 2.10e+6 2.20e+6 2.30e+6 2.40e+6 2.50e+6 X group structure (b) L L X L XY L XXY G X G XY G XXY False Positive 0 1000 2000 3000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.80e+6 2.00e+6 2.20e+6 2.40e+6 XY group structure (c) L L X L XY L XXY G X G XY G XXY False Positive 0 2000 4000 6000 7000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 600 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.50e+6 3.75e+6 4.00e+6 4.25e+6 4.50e+6

X+XY group structure

(d) L SG L G False Positive 0 2000 4000 6000 L SG L G False Negative 0 100 200 300 400 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 3.10e+6 3.15e+6 3.20e+6 3.25e+6 3.30e+6

Overlapping group structure

Figure C.2: More simulation results, “not all in all out” cases with n = 150, p = q = 200 and ρ = 0.8.

(19)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1000 1500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 1000 2000 3000 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 1.40e+7 1.50e+7 1.60e+7

X group structure, all in all out

(b) L L X L XY L XXY G X G XY G XXY False Positive 0 500 1000 1500 L L X L XY L XXY G X G XY G XXY False Negative 0 1000 2000 3000 4000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.40e+7 1.45e+7 1.50e+7 1.55e+6 1.60e+6 1.65e+6

XY group structure, all in all out

(c) L L X L XY L XXY G X G XY G XXY False Positive 0 500 1000 1500 L L X L XY L XXY G X G XY G XXY False Negative 0 2000 4000 6000 7000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 2.70e+7 2.80e+7 2.90e+7 3.00e+7 3.10e+7

X+XY group structure, all in all out

(d) L SG L G False Positive 0 1000 2000 3000 L SG L G False Negative 0 1500 3000 4500 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 1.20e+7 1.30e+7 1.40e+7

Overlapping group structure, all in all out

Figure C.3: More simulation results, “all in all out” cases with n = 150, p = q = 200 and ρ = 0.5.

(20)

(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 200 400 600 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 200 400 600 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 3.20e+6 3.30e+6 3.40e+6 3.50e+6 3.60e+6

X group structure, all in all out

(b) L L X L XY L XXY G X G XY G XXY False Positive 0 200 400 600 800 L L X L XY L XXY G X G XY G XXY False Negative 0 200 400 600 800 1000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.30e+6 3.50e+6 3.70e+6 3.90e+6 4.20e+6

XY group structure, all in all out

(c) L L X L XY L XXY G X G XY G XXY False Positive 0 200 400 600 800 L L X L XY L XXY G X G XY G XXY False Negative 0 400 800 1200 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 6.10e+6 6.50e+6 6.90e+6 7.30e+6

X+XY group structure, all in all out

(d) L SG L G False Positive 0 300 600 900 L SG L G False Negative 0 200 400 600 800 L SG L G Sensitivity(%) 0 25 50 75 100 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 2.80e+6 2.95e+6 3.10e+6 3.25e+6

Overlapping group structure, all in all out

Figure C.4: More simulation results, “all in all out” cases with n = 150, p = q = 100 and ρ = 0.5.

(21)

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure C.5: Heatmaps of coefficient matrices, selection effects. “Not all in all out” XY group

structure with n = 100, p = 200, q = 200, and ρ = 0.5. (a) B∗; (b) ˆBL; (c) ˆBLX; (d) ˆBLXY;

(22)

● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ● ●●●● ● ●●● ● ●● ● ● ●●● ●●●● ●● ● ● ● ● ● ● ●● ●●●●● ● ●●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●●●●● ● ●● ● ●●● ●●● ●●●● ●●● ●●●● ●●● ● ● ●●● ●● ●●●● ●●● ●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ●●● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●●●● ● ● ●●●●●●●● ●●●●●●●● ●● ● ●●● ● ● ● ● ●●●● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●●●● ● ● ● ●●●●●● ●●●● ● ● ● ●● ●● ● ●●● ●●● ●●● ● ●●●●●●● ● ● ●●● ●● ●●●● ●●● ●●●● ●● ● ● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ●●●● ● ● ● ●●●●●●● ● ● ● ●● ● ● ●● ● ● gA L gBL gBR gC L gC R gEL gER gH L gJL gL R ML g g N L LOg PLg N BR N C R ND R NEL NER NH L NH R NLR NML NNL NOL NPL SNR YAL YAR YBR YCL YCR YDR YEL YER YG L YG R YH L YH R YJL YL R YML YN L YO L YPL MA PK cell circl e cance r riboso me

Figure D.1: Network constructed from the multivariate sparse group lasso method. Network structure is between gene expressions grouped in mitogen-activated protein kinases (MAPK), cell cycle, cancer, ribosome pathways and markers grouped in 45 gene groups. Gray lines

connect expression-marker pairs with non-zero ˆβjk. Dark lines are for the top 10 associations

in each pathways. The strength of these top associations are indicated by the width of the dark lines. The dotted circles indicate the overlapping pathway group structure.

References

Related documents

Time is brain is a key concept in the management of persons who have had a stroke, because everything related to the early acute management of stroke is time sensitive, and nurses

The Clock Ring Editor module is where you are provided access to employee clock rings to ensure that all employees have the correct combination of work and leave hours as dictated

At the time of writing this paper, excavation of the Cerzone Tunnel was 60% complete, the Monteleone Tunnel was 45% complete and the Fuciello Tunnel had been finished (inclusive

In addition to supporting the dynamic download of kernels and root file systems through the ST Micro Connect, the STLinux distribution contains a version of Das U-Boot boot

Based on these expectations, the responsible managers have to involve the shipping business in order to archive and satisfy the following targets: (1) acquisition of

In Tanzania, social networks are not constrained to close family, kinship and very close friendship ties like in Rwanda, but slightly more loosely defined but still strong enough

As shown in Figure 1, antisocial sub- jects were significantly more likely than schizophrenic subjects to have experi- enced a &#34;good&#34; outcome for marital

Figure 6.7: GBPUSD-M5 trade profit vs time of day over one month of trading This figure is not very useful for determining the most profitable times of day to trade, but it does