Web-based Supplementary Materials
for
Multivariate sparse group lasso for the multivariate multiple linear
regression with an arbitrary group structure
by
Yanming Li, Bin Nan and Ji Zhu
A
Proofs of technical results
A.1
Some matrix algebra
Lemma A.1 Let
L0(B) = 1 2kY − XBk 2 2 = 1 2tr(Y − XB) T (Y − XB) = 12 n X i=1 q X k=1 (yik− p X j=1 xijβjk)2. Then ∂L0(B)/∂βjk = −xT j(Y − XB)·k = −Sjk+ kxjk22βjk, where Sjk = xTj(Y − XB(−j))·k.
A.2
Proof of Theorem 3.1
Following Lemma A.1,
L(B) = 1 nL 0(B) + X 1≤j≤p;1≤k≤q λjk|βjk| + X g∈G λgkBgk2.
For a coordinate βjkin B, denote Gjk = {g : βjk ∈ Bg ∈ G}, then when L(B) is differentiable at βjk, ∂L(B) ∂βjk = −Sjk/n + kxjk 2 2βjk/n + λjksign(βjk) + X Gjk λgβjk/kBgk2.
If βjk> 0, then for any Bg ∈ Gjk, kBgk2 > 0, and
∂L(B) ∂βjk = −Sjk/n + kxjk 2 2βjk/n + λjk+ X Gjk λgβjk/kBgk2.
Notice that ∂L(B)/∂βjk ≥ 0 if and only if
βjk≥ Sjk− nλjk kxjk22+ n P Gjkλg/kBgk2 △ = ˜βjk+,
and ∂L(B)/∂βjk< 0 if and only if βjk< ˜βjk+. So fixing all other coordinates of B, if βjk > 0,
then L(B) is monotone increasing with respect to βjk when βjk > ˜βjk+ and decreasing when
βjk< ˜βjk+. Therefore, if ˆβjk,min+ minimizes L(B) with respect to βjk when βjk > 0, then
ˆ βjk,min+ = ( Sjk−nλjk kxjk22+n P Gjkλg/k ˆBgk2 , if Sjk> nλjk 0, otherwise. (A.1)
Similarly, if ˆβjk,min− minimizes L(B) with respect to βjk when βjk < 0, then
ˆ
βjk,min− =
( Sjk+nλjk
kxjk22+nPGjkλg/k ˆBgk2, if Sjk< −nλjk
0, otherwise. (A.2)
Based on both (A.1) and (A.2), when βjk6= 0, the minimizer ˆβjk can be summarized into
a unified form: ˆ βjk= sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈Gjk}λg/k ˆBgk2 .
When βjk = 0, we discuss two separate cases according to weather all the groups
con-taining βjk are zero groups or not. First, if none of the groups in Gjk is a zero group, then
ˆ
βjk needs to satisfy the subgradient equation
Sjk/n − kxjk22βjk/n = λjku +ˆ
X
Gjk ˆ
with |u| ≤ 1. Then a similar discussion to the case when βjk 6= 0 based on whether u > 0 or u ≤ 0 in (A.3) yields the same expression:
ˆ βjk= sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈Gjk}λg/k ˆBgk2 .
Second, if some of the groups in Gjk are zero groups, then neither | · | nor k · k2 in L(B)
is differentiable at zero. Let Gjk⊗ = {g : βjk ∈ Bg ∈ G, kBkg > 0}. Then for any zero group
Bg0 in Gjk and for all βjk∈ Bg0, ˆβjk needs to satisfy the subgradient equation:
Sjk/n − kxjk22βjk/n = λjku + λg0vjkˆ +
X
Gjk⊗
λgβjkˆ /k ˆBgk2, (A.4)
where u is the subgradient scalar for the L1 norm | · | and v is the subgradient vector for
the L2 norm k · k2 with constrains |u| ≤ 1 and kvk2 ≤ 1, and λg0 is the tuning parameter
associated with Bg0. It can be seen directly (similar to the case of βjk = 0) that (A.4) yields
ˆ Bg0= 0 if s X {jk: βjk∈Bg0} (|Sjk|/n − λjk)2+ ≤ λg0.
A.3
Proof of Theorem 3.3
Lemma A.2 Under the assumptions in Theorem 3.3, for any B ∈ Rp×q, with probability at
least 1 − (pq)1−A2/2 , 1 nkX(B ∗ − ˆB)k2 2+ λ| ˆB− B|1+ 2 X g∈G λgk ˆBg− Bgk2 (A.5) ≤ n1kX(B∗− B)k22+ 4λ X jk∈J1(B) | ˆβjk− βjk| + 4 X g∈J2(B) λgk ˆBg − Bgk2, M1( ˆB) ≤ 4 λ2n2 X jk∈J1( ˆB) |[XTX ( ˆB− B∗)]jk|2 ≤ 4 λ2n2kX TX ( ˆB− B∗)k2 2. (A.6)
Proof. For any B ∈ Rp×q, we have 1 nkY − X ˆBk 2 2+ 2λ| ˆB|1+ X g∈G 2λgk ˆBgk2 ≤ 1 nkY − XBk 2 2+ 2λ|B|1+ X g∈G 2λgkBgk2.
Plugging Y = XB∗+ W into the above inequality, we obtain
1 nkX(B ∗ − ˆB)k2 2 ≤ 1 nkX(B ∗ − B)k22+ 2 n n X i=1 q X k=1 [X( ˆB− B)]ikωik +2λ(|B|1− | ˆB|1) +X g∈G 2λg(kBgk2− k ˆBgk2),
where [X( ˆB− B)]ik denotes the ikth element of the product matrix X( ˆB− B) and ωik is
the ikth element of W. Notice that n X i=1 q X k=1 [X( ˆB− B)]ikωik = n X i=1 ( q X k=1 " p X j=1 xij( ˆβjk− βjk) # ωik ) ≤ max 1≤k≤q,1≤j≤p n X i=1 xijωik q X k=1 p X j=1 | ˆβjk− βjk| = |XTW|∞| ˆB− B|1 where |XTW
|∞= max1≤k≤q,1≤j≤p|Pni=1xijωik| is the maximum absolute value of entries of
XTW
.
Let Vjk = xT
j · wk, 1 ≤ j ≤ p, 1 ≤ k ≤ q. Since wk ∼ N(0, σ2kIq) for 1 ≤ q ≤ Q,
then var(Vjk) = xT
pcov(wq)xp = nσ2q. Therefore (nσq2)−1/2Vjk are standard normal random
variables. Consider the random event
A = 2n|XTW
|∞≤ λ
. It is easy to see that the complement of A can be expressed as
Ac = At least one |Vjk| > λn 2 , 1 ≤ j ≤ p, 1 ≤ k ≤ q .
Denote B(0, λn/2) to be a 1-dimensional ball centered at 0 and with radius λn/2, then
P r{Ac} ≤ p X j=1 q X k=1 P r Vjk ∈ B/ 0,λn 2 = p q X k=1 P r (nσk2)−1/2Vjk ∈ B/ 0,λn 1/2 2σk ≤ pq × P r |Z| ≥ λn 1/2 2σ ≤ pq exp −λ 2n 8σ2 = (pq)1−A2/2,
where Z is a standard normal random variable, and the last inequality is obtained by
P r{|Z| > a} ≤ exp (−a2/2). But on event A, we have
1 nkX(B ∗ − ˆB)k2 2+ λ| ˆB− B|1+ 2 X g∈G λgk ˆBg− Bgk2 ≤ n1kX(B∗− B)k22+ 2λ(| ˆB− B|1+ |B|1− | ˆB|1) +2X g∈G λg(k ˆBg− Bgk2+ kBgk2− k ˆBgk2) ≤ n1kX(B∗− B)k22+ 4λ X jk∈J1(B) | ˆβjk− βjk| + 4 X g∈J2(B) λg(k ˆBg− Bgk2.
This completes the proof of the first inequality in Lemma A.2.
To prove the second inequality, we use the KKT conditions and obtain ( (1/n)[XT (Y − X ˆB)]jk = 2λsgn( ˆβjk) + 2P g∈Gλgβjkˆ /k ˆBgk2, ˆβjk6= 0; (1/n)|[XT (Y − X ˆB)]jk| ≤ 2λ + 2P g∈Gλg, βˆjk= 0.
From the first condition we can see that ∀ ˆβjk6= 0,
λ ≤ 1
n|[X
T
(Y − X ˆB)]jk|.
On the other hand, we have on A 1 n|[X T (Y − X ˆB)]jk| ≤ 1 n|[X TX (B∗− ˆB)]jk+ [XTW ]jk| ≤ n1|[XTX (B∗− ˆB)]jk| + 1 n|X TW |∞ ≤ n1|[XTX (B∗− ˆB)]jk| +λ 2.
Then combine the above two inequalities, we have λ 2 ≤ 1 n|[X TX( ˆB − B∗)]jk|. Therefore M1( ˆB) = |J1( ˆB)| ≤ 4 λ2n2 X jk∈J1( ˆB) |[XTX ( ˆB− B∗)]jk|2 ≤ 4 λ2n2kX TX ( ˆB− B∗)k2 2.
This completes the proof of Lemma A.2. Proof of Theorem 3.3.
By setting B = B∗ in (A.5) in Lemma A.2, we have that on event A,
1 nkX( ˆB− B ∗ )k22 ≤ 4λ X jk∈J1(B∗) | ˆβjk− βjk| + 4∗ X g∈J2(B∗) λgk ˆBg− B∗gk2 (A.7) ≤ 4λr1/2k( ˆB− B∗) J1(B∗)k2+ 4 X g∈J2(B∗) λ2 g 1/2 k( ˆB− B∗)J 2(B∗)k2.
The last inequality is by Cauchy-Schwarz. Specifically, we have X jk∈J1(B∗) | ˆβjk− βjk|∗ 2 = X jk∈J1(B∗) 1 × | ˆβjk− βjk|∗ 2 ≤ X jk∈J1(B∗) 12 X jk∈J1(B∗) | ˆβjk− βjk|∗ 2 = rk( ˆB− B∗)J 1(B∗)k 2 2, and X g∈J2(B∗) λgk ˆBg− B∗gk2 2 ≤ X g∈J2(B∗) λ2g X g∈J2(B∗) k ˆBg − B∗gk2 2 .
Also by inequality (A.5), on event A, we have
λ| ˆB− B∗|1+ 2X g∈G λgk ˆBg− B∗gk2 (A.8) ≤ 4λ X jk∈J1(B∗) | ˆβjk− βjk| + 4∗ X g∈J2(B∗) λgk ˆBg− B∗gk2. This is equivalent to λ X jk∈Jc 1(B∗) | ˆβjk− βjk| + 2∗ X g∈Jc 2(B∗) λgk ˆBg− B∗gk2 ≤ 3λ X jk∈J1(B∗) | ˆβjk− βjk| + 2∗ X g∈J2(B∗) λgk ˆBg− B∗gk2.
Thus the condition in Assumption 1 holds with ∆ = ˆB− B∗ and ρg = λg/λ. Therefore, k( ˆB− B∗)J 1(B∗)k2 ≤ kX( ˆB− B∗)k2 κ1n1/2 , k( ˆB− B ∗) J2(B∗)k2 ≤ kX( ˆB− B∗)k2 κ2n1/2 .
Plugging the above two inequalities into (A.7), we have 1 nkX( ˆB− B ∗ )k22 ≤ 4λr1/2 κ1n1/2 + 4P g∈J2(B∗)λ 2 g 1/2 κ2n1/2 kX( ˆB− B ∗ )k2 = 4λr1/2 κ1n1/2 + 4λP g∈J2(B∗)ρ 2 g 1/2 κ2√n kX( ˆB− B ∗ )k2, which gives 1 nkX( ˆB− B ∗ )k22 ≤ 16λ2 r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2 2 .
Define kAk2,1 =Pg∈G2∪G1kAk2 for a matrix A, where each coefficient in G1 = GL forms
a group. Hence k ˆB− B∗k2,1 = | ˆB− B∗|1+X g∈G k ˆBg− B∗gk2 (A.9) ≤ (c + 1)| ˆB− B∗|1. (A.10) Then we have (λ + ρλ)k ˆB− B∗k2,1 = λk ˆB− B∗k2,1+ ρλk ˆB− B∗k2,1 ≤ (c + 1)λ| ˆB− B∗|1+ ρλk ˆB− B∗k2,1 (by (A.10)) = (c + 1)λ| ˆB− B∗|1+ ρλ| ˆB− B∗|1+X g∈G ρλk ˆBg− B∗gk2 (by (A.9)) ≤ (c + 2)λ| ˆB− B∗|1+X g∈G λgk ˆBg− B∗ gk2 ≤ (c + 2)λ| ˆB− B∗|1+ 2X g∈G λgk ˆBg− B∗gk2 ≤ (c + 2) " λ| ˆB− B∗|1+ 2X g∈G λgk ˆBg− B∗gk2 #
By (A.8) and the last inequality in (A.7) we obtain 1 + ρ c + 2λk ˆB− B ∗ k2,1 ≤ λ| ˆB− B∗|1+ 2X g∈G λgk ˆBg − B∗ gk2 ≤ 4λ X jk∈J1(B∗) | ˆβjk− βjk| + 4∗ X g∈J2(B∗) λgk ˆBg− B∗gk2 ≤ 4λr1/2k( ˆB− B∗) J1(B∗)k2+ 4 X g∈J2(B∗) λ2 g 1/2 k( ˆB− B∗)J 2(B∗)k2 ≤ 4λr1/2 κ1n1/2 + 4P g∈J2(B∗)λ 2 g 1/2 κ2n1/2 kX( ˆB− B ∗ )k2 ≤ 4λr1/2 κ1n1/2 + 4λP g∈J2(B∗)ρ 2 g 1/2 κ2n1/2 4n 1/2λ r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2 = 16λ2 r1/2 κ1 + P g∈5J2(B∗)ρ 2 g 1/2 κ2 2 .5 Therefore, k ˆB− B∗k2,1 ≤ 16(c + 2)λ 1 + ρ r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2 2 = 32(c + 2)σA 1 + ρ log(pq) n 1/2 r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2 2 . It is trivial that | ˆB − B∗| 1 ≤ k ˆB − B∗k2,1.
From (A.6) in Lemma A.2, we obtain
M1( ˆB) ≤ 4 λ2n2kX TX ( ˆB− B∗)k2 2 ≤ 4ψmax λ2n kX( ˆB− B ∗ )k22,
where the second inequality is from k[XTX ( ˆB− B∗)]·kk2 2 = ( ˆB− B ∗ )T ·kX T (XXT )X( ˆB− B∗)·k ≤ nψmaxkX( ˆB− B∗)·kk2 2
for each 1 ≤ k ≤ q. By the upper bound of kX( ˆB− B∗)k2
2 we have M1( ˆB) ≤ 64ψmax r1/2 κ1 + P g∈J2(B∗)ρ 2 g 1/2 κ2 2 .
A.4
Proof of Proposition 4.1
To prove Proposition 4.1, we first show the following lemma.
Lemma A.3 For every pair of (j, k), the sequence of coordinate decent estimates { ˆβjk(m) :
m = 0, 1, 2, · · · } obtained at each step m by solving the following equation ˆ βjk = sgn(Sjk) (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆBgk2>0}λg/(k ˆBg−(jk)k 2 2+ | ˆβjk|2)1/2 (A.11)
for ˆβjk while fixing others converges to a global minimizer of the objective function (2) in the
main text.
First, it is easy to see that the exact solution of (A.11) exists. If k ˆBg−(jk)k2 = 0, the
close form solution of (A.11) is just the lasso solution. If k ˆBg−(jk)k2 6= 0, then the right hand
side of (A.11) is a continuous function of ˆβjk, which is monotone when ˆβjk > 0 or ˆβjk < 0,
bounded away from zero when ˆβjk = 0, and bounded away from ±∞ when ˆβjk goes to ±∞,
therefore must intersect with either y = ˆβjk or y = − ˆβjk. Therefore an exact solution of
Wu and Lange (2008) proved the convergence to a minimal point of the lasso objective function for the greedy coordinate descent algorithm. In a very similar way, one can extend the proof to the multivariate sparse group lasso objective function and the coordinate descent algorithm of iteratively solving for the exact solution of (A.11). Due to significant overlapping with Wu and Lange (2008), we omit the proof of Lemma A.3 here.
Proof of Proposition 4.1.
Denote { ˆβjk(m)} the sequence of estimates of jkth coordinate from the coordinate descent
algorithm that solves equation (A.11) in each step indexed by m. Starting from ˆB(m−1),
denote ˆβjkMCD(m−1)the one step update of the jkthcoordinate by the mixed coordinate descent
algorithm. We prove in the following that
| ˆβjk(m)| ≤ | ˆβjkMCD(m)| ≤ | ˆβjk(m−1)| (A.12)
with equalities hold only when | ˆβjk(m)| = | ˆβjk(m−1)|.
First, if ˆβjk(m) is updated by (3) in the main text, then (A.12) is automatically satisfied
since
0 = | ˆβjk(m)| = | ˆβjkMCD(m)| ≤ | ˆβjk(m−1)|
Otherwise, if ˆβjk(m) is updated by (II) or (III) or (IV) in Section 4 of the main text, then
it must be one of the following cases. (i) If ˆ βjk(m−1) < ˆβjk(m) = − (|Sjk| − nλjk)+ kxjk22 + n P {g∈G: βjk∈Bg, k ˆB(m−1)g k2>0}λg/(k ˆ B(m−1)g−(jk)k2 2+ | ˆβ (m) jk |2)1/2 < 0, then ˆ βjkMCD(m) = − (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆB(m−1)g k2>0}λg/(k ˆ B(m−1)g−(jk)k2 2+ | ˆβ (m−1) jk |2)1/2 < ˆβjk(m).
From the proof of Theorem 3.1, ˆβjk(m−1) < ˆβjk(m) if and only if ∂L(B) ∂βjk ˆ βjk(m−1) = −Sjk/n + kxjk22βˆ (m−1) jk /n − λjk+ X Gjk λgβˆjk(m−1)/k ˆB(m−1) g k2< 0.
Notice that the above is also the partial derivative of Lnet(B) w.r.t. βjk taking value at
ˆ
βjk(m−1), where Lnet(B) is the elastic net objective function
Lnet(B) = 1 2nkY − XBk 2 2+ X jk λjk|βjk| + X g∈G λgkBgk22/(2k ˆB (m−1) g k2) holding k ˆB(m−1)
g k2 as constants and constraining that βjk< 0.
Following exactly the same argument as in the proof of Theorem 3.1, we can prove that ∂Lnet(B) ∂βjk ˆ β(m−1)jk < 0 if and only if ˆβ (m−1)
jk is less than the solution of ∂L(B)/∂βjk= 0 with the
constraint βjk < 0, which is the solution of
−Sjk/n + kxjk22βjk/n − λjk+ X Gjk λgβjk/k ˆB(m−1) g k2 = 0 under βjk< 0 given by − (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆBgk(m−1)2 >0}λg/k ˆ B(m−1)k2 = ˆβ MCD(m) jk . Therefore, we have ˆ βjk(m−1) < ˆβjkMCD(m) < ˆβjk(m)< 0. (ii) If ˆ βjk(m−1) > ˆβjk(m)= (|Sjk| − nλjk)+ kxjk22+ n P {g∈G: βjk∈Bg, k ˆB (m−1) g k2>0}λg/(k ˆ B(m−1)g−(jk)k2 2+ | ˆβ (m) jk |2)1/2 ≥ 0, with similar argument, we have that
ˆ
(iii) If ˆβjk(m−1) = ˆβjk(m), the mixed coordinate descent algorithm will be exact update and we will have
ˆ
βjk(m−1) = ˆβjkMCD(m) = ˆβjk(m). In summary, we have (A.12).
Lemma A.3 shows that the sequence of estimates of jkth coordinate { ˆβ(m)
jk } iteratively
updated from solving (A.11) converges to a global minimizer regardless the value of the
starting point. For each term in the sequence { ˆβjkMCD(l)}, suppose one can construct a
se-quence of { ˆβjk(m)} starting from ˆβjkMCD(l), then those sequences all converge to minimizers (if
the minimizer is not unique, e.g. for not strictly convex objective function) with the same
minimum value. Thus from (A.12) we know that { ˆβjkMCD(l)} converge to a global minimizer
with the same minimum value.
Figure A.1 illustrates coordinate updates by the standard coordinate descent and the mixed coordinate descent algorithms on a contour surface of a two-dimensional objective function. Given the same starting values, one step update on one coordinate from the mixed coordinate descent algorithm is always bounded between the previous and current values from the standard coordinate descent algorithm.
A.5
Comparison of computing costs
The computational cost of coordinate descent algorithm with inner iterations is much higher than our mixed coordinate descent algorithm. Figure A.2 shows the comparison between these two algorithms. The group structure of the regression coefficient matrix used is set to be (b) in Figure 1 in the main text. In Figure A.2, the mixed coordinate descent algorithm converges to a minimizer after 500 iterations while the coordinate descent algorithm with inner iterations converges after 150000 iterations.
( 1 (m-1) , 2 (m-1) ) ( 1 (m) , 2 (m-1) ) ( 1 (m) , 2 (m) ) X ( 1 MCD(m) , 2 (m-1) ) ( 1 (m) , 2 MCD(m) ) X
Figure A.1: Illustration of coordinate updates by the standard coordinate descent and the mixed coordinate descent algorithms on a contour surface of a two-dimensional objective function.
B
Comparison between univariate and multivariate
ap-proaches
Figures B.1 and B.2 illustrate the comparisons between univariate approaches and
multi-variate approaches. The true regression coefficient matrix takes a GXY ∪ GX group structure.
It can be seen that when different response variables have a similar sparsity to the predic-tors, the multiple univariate lasso (using different λ values for different response variables) and the multivariate lasso (using the same λ value for all response variables) have similar performance on variable selection. The multiple univariate sparse group lasso approach has
1 e +0 5 2 e +0 5 3 e +0 5 4 e +0 5 5 e +0 5 iteration numbers o b je ct iv e f u n ct io n 0 500 100050001000 0 5000 0 1000 00 1500 00 2000 00 CD−w/ inner iteration MCD−w/o inner iteration
Figure A.2: Decreasing of the objective function. The gray line is for the proposed mixed coordinate descent (MCD) algorithm without inner iterations of updating (A.11) and the black line is for the coordinate descent (CD) algorithm with inner iterations.
a slightly better variable selection performance than the multiple univariate lasso. The pro-posed multivariate sparse group lasso yields the best variable selection result by borrowing information from other response variables within the same group. It also has the smallest prediction error.
C
More simulations
Figures C.1 to C.5 show the variable selection and prediction effects in some other simulation settings, such as with different autocorrelation coefficient values or with a true “all-in-all-out”
(a) (b) (c)
(d) (e)
Figure B.1: Heatmaps of coefficient matrices. (a) True B∗; (b) The multiple univariate
lasso; (c) The multiple univariate sparse group lasso (d) The multivariate lasso; (e) The
multivariate sparse group lasso; The true B∗ has a “not all in all out” and X+XY group
structure with p = q = 200, n = 100, ρ = 0.5. u n i L u n i SG L m u lt i L m u lt i SG L False Positive 0 500 1000 1500 2000 2500 u n i L u n i SG L m u lt i L m u lt i SG L False Negative 0 250 500 750 1000 1250 u n i L u n i SG L m u lt i L m u lt i SG L Sensitivity(%) 0 25 50 75 u n i L u n i SG L m u lt i L m u lt i SG L Specificity(%) 0 25 50 75 100 u n i L u n i SG L m u lt i L m u lt i SG L Prediction Error 6.70e+6 6.95e+6 7.20e+6 7.45e+6
Univariate and multivariate comparison, N=100, P=200, Q=200, ρ = 0.5
Figure B.2: Comparison between multiple-univariate and multivariate approaches from 100 simulated data sets. “uni L” – the multiple univariate lasso; “uni SGL” – the multiple univariate group lasso; “multi L” – the multivariate lasso; “multi SGL” – the multivariate sparse group lasso with an XY group structure on the coefficient matrix.
group structure.
D
Network structure for the yeast eQTL data analysis
Figure D.1 shows the network constructed from the multivariate sparse group lasso method. The top association signals are highlighted in dark lines and also reported in Table 2 and 3 in the main text.
References
Wu, T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Annal of Applied Statistics, 2:224–44.
(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1500 2500 3500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 50 100 150 200 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 1.90e+6 2.00e+6 2.10e+6 2.20e+6 X group structure (b) L L X L XY L XXY G X G XY G XXY False Positive 0 3000 6000 9000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.90e+6 2.15e+6 2.40e+6 2.65e+6 2.90e+6 XY group structure (c) L L X L XY L XXY G X G XY G XXY False Positive 0 2000 4000 6000 L L X L XY L XXY G X G XY G XXY False Negative 0 200 400 600 700 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.70e+6 3.95e+6 4.20e+6 4.45e+6 4.70e+6 4.95e+6
X+XY group structure
(d) L SG L G False Positive 0 1000 2000 3000 L SG L G False Negative 0 200 400 600 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 3.20e+6 3.35e+6 3.50e+6 3.65e+6
Overlapping group structure
Figure C.1: More simulation results, “not all in all out” cases with n = 150, p = q = 200 and ρ = 0.2.
(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1500 2500 3500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 100 200 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 2.00e+6 2.10e+6 2.20e+6 2.30e+6 2.40e+6 2.50e+6 X group structure (b) L L X L XY L XXY G X G XY G XXY False Positive 0 1000 2000 3000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.80e+6 2.00e+6 2.20e+6 2.40e+6 XY group structure (c) L L X L XY L XXY G X G XY G XXY False Positive 0 2000 4000 6000 7000 L L X L XY L XXY G X G XY G XXY False Negative 0 150 300 450 600 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.50e+6 3.75e+6 4.00e+6 4.25e+6 4.50e+6
X+XY group structure
(d) L SG L G False Positive 0 2000 4000 6000 L SG L G False Negative 0 100 200 300 400 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 3.10e+6 3.15e+6 3.20e+6 3.25e+6 3.30e+6
Overlapping group structure
Figure C.2: More simulation results, “not all in all out” cases with n = 150, p = q = 200 and ρ = 0.8.
(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 500 1000 1500 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 1000 2000 3000 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 1.40e+7 1.50e+7 1.60e+7
X group structure, all in all out
(b) L L X L XY L XXY G X G XY G XXY False Positive 0 500 1000 1500 L L X L XY L XXY G X G XY G XXY False Negative 0 1000 2000 3000 4000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 1.40e+7 1.45e+7 1.50e+7 1.55e+6 1.60e+6 1.65e+6
XY group structure, all in all out
(c) L L X L XY L XXY G X G XY G XXY False Positive 0 500 1000 1500 L L X L XY L XXY G X G XY G XXY False Negative 0 2000 4000 6000 7000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 2.70e+7 2.80e+7 2.90e+7 3.00e+7 3.10e+7
X+XY group structure, all in all out
(d) L SG L G False Positive 0 1000 2000 3000 L SG L G False Negative 0 1500 3000 4500 L SG L G Sensitivity(%) 0 25 50 75 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 1.20e+7 1.30e+7 1.40e+7
Overlapping group structure, all in all out
Figure C.3: More simulation results, “all in all out” cases with n = 150, p = q = 200 and ρ = 0.5.
(a) L L X L XY L XXY grp X g rp XY g rp XXY False Positive 0 200 400 600 L L X L XY L XXY grp X g rp XY g rp XXY False Negative 0 200 400 600 L L X L XY L XXY grp X g rp XY g rp XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY grp X g rp XY g rp XXY Prediction Error 3.20e+6 3.30e+6 3.40e+6 3.50e+6 3.60e+6
X group structure, all in all out
(b) L L X L XY L XXY G X G XY G XXY False Positive 0 200 400 600 800 L L X L XY L XXY G X G XY G XXY False Negative 0 200 400 600 800 1000 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 3.30e+6 3.50e+6 3.70e+6 3.90e+6 4.20e+6
XY group structure, all in all out
(c) L L X L XY L XXY G X G XY G XXY False Positive 0 200 400 600 800 L L X L XY L XXY G X G XY G XXY False Negative 0 400 800 1200 L L X L XY L XXY G X G XY G XXY Sensitivity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Specificity(%) 0 25 50 75 100 L L X L XY L XXY G X G XY G XXY Prediction Error 6.10e+6 6.50e+6 6.90e+6 7.30e+6
X+XY group structure, all in all out
(d) L SG L G False Positive 0 300 600 900 L SG L G False Negative 0 200 400 600 800 L SG L G Sensitivity(%) 0 25 50 75 100 L SG L G Specificity(%) 0 25 50 75 100 L SG L G Prediction Error 2.80e+6 2.95e+6 3.10e+6 3.25e+6
Overlapping group structure, all in all out
Figure C.4: More simulation results, “all in all out” cases with n = 150, p = q = 100 and ρ = 0.5.
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure C.5: Heatmaps of coefficient matrices, selection effects. “Not all in all out” XY group
structure with n = 100, p = 200, q = 200, and ρ = 0.5. (a) B∗; (b) ˆBL; (c) ˆBLX; (d) ˆBLXY;
● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ● ●●●● ● ●●● ● ●● ● ● ●●● ●●●● ●● ● ● ●● ● ● ● ●● ●●●●● ● ●●● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ●● ● ●●● ●●● ●●●● ●●● ●●●● ●●● ● ● ●●● ●● ●●●● ●●● ●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●●●● ● ● ●●●●●●●● ●●●●●●●● ●● ● ●●● ● ● ● ● ●●●● ●●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●●●● ● ● ● ●●●●●● ●●●● ● ● ● ●● ●● ● ●●● ●●● ●●● ● ●●●●●●● ● ● ●●● ●● ●●●● ●●● ●●●● ●●● ● ● ● ● ● ●●●● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ●●●● ● ● ● ●●●●●●●● ● ● ● ●● ● ● ●● ● ● gA L gBL gBR gC L gC R gEL gER gH L gJL gL R ML g g N L LOg PLg N BR N C R ND R NEL NER NH L NH R NLR NML NNL NOL NPL SNR YAL YAR YBR YCL YCR YDR YEL YER YG L YG R YH L YH R YJL YL R YML YN L YO L YPL MA PK cell circl e cance r riboso me
Figure D.1: Network constructed from the multivariate sparse group lasso method. Network structure is between gene expressions grouped in mitogen-activated protein kinases (MAPK), cell cycle, cancer, ribosome pathways and markers grouped in 45 gene groups. Gray lines
connect expression-marker pairs with non-zero ˆβjk. Dark lines are for the top 10 associations
in each pathways. The strength of these top associations are indicated by the width of the dark lines. The dotted circles indicate the overlapping pathway group structure.