ASYMPTOTIC LOG-DET SUM-OF-RANKS MINIMIZATION VIA TENSOR (ALTERNATING) ITERATIVELY REWEIGHTED
LEAST SQUARES
SEBASTIAN KR ¨AMER
Abstract. Affine sum-of-ranks minimization (ASRM) generalizes the affine rank minimization (ARM) problem from matrices to tensors. Here, the inter-est lies in the ranks of a family K of different matricizations. Transferring our priorly discussed results on asymptotic log-det rank minimization, we show that iteratively reweighted least squares with weight strengthp = 0 remains a, theoretically and practically, particularly viable method denoted as IRLS-0K. As in the matrix case, we prove global convergence of asymptotic minimizers of the log-det sum-of-ranks function to desired solutions. Further, we show local convergence of IRLS-0K in dependence of the rate of decline of the therein ap-pearing regularization parameterγ & 0. For hierarchical families K, we show how an alternating version (AIRLS-0K, related to prior work under the name SALSA) can be evaluated solely through tensor tree network based operations. The method can thereby be applied to high dimensions through the avoidance of exponential computational complexity. Further, the otherwise crucial rank adaption process becomes essentially superfluous even for completion prob-lems. In numerical experiments, we show that the therefor required subspace restrictions and relaxation of the affine constraint cause only a marginal loss of approximation quality. On the other hand, we demonstrate that IRLS-0K allows to observe the theoretical phase transition also for generic tensor recoverability in practice. Concludingly, we apply AIRLS-0K to larger scale problems.
Key words. affine rank minimization, iteratively reweighted least square, ma-trix recovery, mama-trix completion, log-det function
AMS subject classifications. 15A03, 15A29, 65J20, 90C31, 90C26
1. Introduction
The setting of affine sum-of-ranks minimization (ASRM) is a generalization of the affine rank minimization (ARM) problem for matrices to tensors. While the tensor rank refers to the minimal number of elementary tensors required for a decompo-sition into a sum, we are here interested in the ranks of so called matricizations. Let [d] = {1, . . . , d}, d ∈ N, as well as nµ ∈ N, µ = 1, . . . , d. For ∅ 6= J ( [d] and
Jc:= [d] \ J, we define such matricizations (cf. [15])
(·)[J]: Rn1⊗ . . . ⊗ Rnd→ RnJ×nJc, n
S :=
Y
µ∈S
nµ,
as the simple reshaping isomorphisms induced via (v1⊗ . . . ⊗ vd)[J]:= vec( O j∈J vj) · vec( O j∈Jc vj)T, vi∈ Rni, i = 1, . . . , d,
Institut f¨ur Geometrie und Praktische Mathematik, RWTH Aachen University, Templergraben 55, 52056 Aachen, Germany ([email protected],https://www.igpm.rwth-aachen.de).
1
where vec(·) : R×µ∈Snµ → RnSdenotes the vectorization in co-lexicographic
(column-wise) order. As usual, we identify Rn1⊗ . . . ⊗ Rnd∼= Rn1×...×nd. For a (not
neces-sarily hierarchical) family of subsets K ⊆ {J ( [d] | J 6= ∅} and a surjective linear operator L : Rn1×...×nd→ R`, ` < n
[d], as well as measurements y ∈ image(L), we
then define ASRM to refer to the problem of finding argmin X∈Rn1×...×nd X J∈K rank(X[J]) subject to L(X) = y. (1.1)
This setting is not only of particular interest due to its regularizing properties, but its close relation to so called hierarchical (or tensor tree) decompositions (cf. [15,25]). We are here however mainly interested in the problem itself, and only secondarily in the possibility to recover an eventual ground truth tensor from its measurements. As large parts of this work rely on our preceding article [26], which in turn is based on [5,7,11,29], we strongly recommend to take notice of such. 1.1. Approaches to ASRM and tensor recovery. Affine rank minimization (ARM), as theoretical origin of ASRM, is included as such for the dimension d = 2 and K = {{1}} [26], and consequently defined as the problem to find a matrix
X∗∈ argmin
X∈Rn×m
rank(X) subject to L(X) = y.
This setting in turn is based on the affine cardinality minimization problem (ACM), that is to find a vector
x∗∈ argmin
x∈Rn card(x) subject to L(x) = y.
A short overview over recovery methods as well as the role of iteratively reweighted least squares (IRLS, cf. [11,29] for ARM and [5,7] for ACM) for these two problems can be found in our preceding article [26]. To the best of our knowledge, IRLS has only priorly been considered with regard to the ASRM problem for tensors in the thesis [25], from which also the related, so called stable ALS approximation algo-rithm [16] stems. Relaxations of ASRM itself however have been considered before, including the minimization of the sum of nuclear norms [12,28,35]. The tensor rank as outlined in the introduction, Section1, however, is hard to calculate, and usually not the direct target of minimization. Though [37,38] utilize the canonical polyadic decomposition to a certain fiber completion problem. Other algorithm rely on the explicit, separate adaption of unknown ranks such as low rank manifold [27,36,39] or a-priorly representation or subspace based optimization [18,19,34]. However, non-intrusive rank adaption schemes, even if elaborate, tend to be problematic [16]. The AIRLS related method presented therein, as well as [2,13,14] based thereon, contrarily consider an intrusive regularization related to reweighting that circum-vents the instability and overfitting problems otherwise caused. Another class of algorithm requires to choose specific sampling points, prominently cross approxi-mation based methods [1,22,31]. Though such are preferable in that setting, we here however assume the affine measurement operator to be a priorly given. 1.2. Contributions and organization of this paper. The novel aspects of this paper are organized as follows.
• In Sections1.3and1.4, we generalize the optimization as well as reweighting process from the matrix to the tensor case in an introductory manner. Sec-tion 1.5 contains a preliminary description of hierarchical decompositions and the thereto related data sparse optimization.
• Section3provides global convergence results for the adjusted tensor IRLS-0K algorithm with respect to sequences of complementary weights, under consideration of the rate of decline of the regularization parameter γ. • In Section4, we discuss the relaxation of the affine constraint together with
the restriction to iteratively defined sequences of admissible subspaces. • Section 5 concisely reintroduces hierarchical formats as non-rooted tree
tensor networks with an emphasis on its graph theoretical foundation. It contains several fundamental statements required for the subsequently in-troduced A(lternating)IRLS-0K algorithm.
• In Section 6, we utilize tree tensor networks to derive the AIRLS-0K al-gorithm which allows a non-exponentially scaling realization of the relaxed IRLS-0K method introduced in Section 4 through an evaluation within given low rank representations.
• Section7contains a comprehensive series of numerical experiments. Firstly, we demonstrate that IRLS-0K allows to observe the theoretical phase tran-sition [4] regarding the required number of measurements for recoveries. Secondly, we follow the relaxations laid out in this work made from IRLS-0K up to the AIRLS-IRLS-0K approach. We demonstrate the improvement, but likewise common ground towards our priorly introduced, so called SALSA algorithm [16], as well as superiority over conventional ALS. We conclude with an application of AIRLS-0K to large scale problems in higher dimen-sions.
• Appendix Acontains a postponed proof. The supplementary SectionSM1
includes a further numerical experiment. Section SM3 contains extended visualization of results as explained in SectionSM2. Technical proofs con-cerning branch evaluations and therefor partially necessary notation can be found in Sections SM4 and SM5. The AIRLS-0K method is summarized in Algorithm 3, whereas SectionSM6 discusses viable heuristics.
1.3. Asymptotic minimization. We have priorly discussed in [26] as based on [5,7,11,29] in which way the ARM problem for matrices can be approached via the asymptotic minimization (cf. Definition1.1) of the family
fγ(A) := log k1
Y
i=1
(σ2
i(A) + γ) = log det(AAT + γI), γ & 0,
(1.2)
for which σi(A), i = 1, . . . , r, are defined as the singular values of A ∈ Rk1×k2
and σi(A) = 0, i > r, r = rank(A). Plainly analogous, its tensor version for the
minimization of a sum of ranks is defined as (see Section 2) fγK(X) := X J∈K fγ(X[J]) = log Y J∈K nJ Y i=1 (σ(J)i (X)2+ γ), (1.3)
where σi(J)(X) = σi(X[J]) is the i-th singular value of the matrix X[J] ∈ RnJ×nJc.
Thus the matrix version corresponds to K = {{1}}, whereas for the alternating IRLS method, we have also considered the complementary K = {{2}}. In [26], we have already reasoned the choice p = 0 of the therein appearing weight strength parameter p ∈ [0, 1]. Thus, we here only regard1 the thereto corresponding log-det approach laid out above, as opposed to the other extreme p = 1 associated to nuclear norm minimization. This leads us to the following, potential solutions to the ASRM problem.
Definition 1.1. We define X∗ := {X∗| ∃(Xγ)γ>0⊂ L−1(y), X∗= lim γ&0Xγ, f K γ (Xγ) = min X∈L−1(y)f K γ(X)}.
This set of asymptotic, global minimizers indeed yields the desired solutions as we prove in Theorem2.4. The decline of the parameter γ is no less important here as more detailly remarked on in the predecessor [26]. It should further be noted that neither the ranks r(J):= rank(X[J]) (cf. Section2), nor the families of singular
values σ(J), J ∈ K, are independent of each other [24], though not prohibitively so
in regard of aboves approach.
1.4. Iteratively reweighted least squares (IRLS). In line with the overall gen-eralization, also iteratively reweighted least squares (IRLS) allows to be applied to the minimization of a sum of ranks of a tensor. For the matrix case, one version (cf. [26,29]) defines (k · kF being the Frobenius norm)
X(i):= argmin
X∈L−1(y)
kWγ1/2(i−1),X(i−1)XkF, Wγ,X := (XXT+ γI)−1,
for a monotonically decreasing sequence {γ(i)}
i≥0 ⊂ R>0. The tensor variant
straightforwardly is given by (see Section 3) X(i):= argmin X∈L−1(y) X J∈K (W(J) γ(i−1),X(i−1)) 1/2(X(i))[J] 2 F, (1.4)
where the weight matrices2follow the same generalization with Wγ,X(J) := Wγ,X[J]= X[J](X[J])T + γI
−1
, J ∈ K.
Continued from the vector as well as matrix case, it also here holds true that for a sequence Xγ →X with sufficiently fast declining singular values
X J∈K W(J) γ,XγX [J] γ 2 F = X J∈K nJ X i=1 σi(J)(Xγ)2 σ(J)i (Xγ)2+ γ −→ γ&0 X J∈K rank(X[J]).
Though largely similar to the matrix case, there is however at least one difference as we discuss in Section 2. Due to its dependence on p = 0 and the family K, we abbreviate aboves algorithm (1.4) as IRLS-0K.
1.5. Data sparse optimization. With increasing dimensions d, the size of the space Rn1×...×nd quickly becomes prohibitively large. While for smaller instances,
IRLS-0K is by all means a viable algorithm, it otherwise remains a theoretical ideal. However, for hierarchical families K, that is if
(J ⊂ S ∨ S ⊂ J ∨ J ∩ S = ∅) ∧ J 6= Sc, ∀J, S ∈ K,
(1.5)
so called hierarchical decompositions [15] or, basically synonymously, tensor tree networks (cf. [10,25]) provide remedy in the same way the ordinary low rank matrix decomposition does (cf. [26]). In the latter case, the data space Dr:= {(Y, Z) | Y ∈
Rk1×r, Z ∈ Rr×k2} represents the low rank variety Vk1,k2
≤r := {A ∈ R
k1×k2| rank(A) ≤ r}
via the surjective (but not injective) bilinear map
τr: Dr→ V≤r, τr(Y, Z) := Y Z ∈ Rn1×n2.
(1.6)
The alternating method AIRLS then only requires to operate on Dr, while directly
minimizing fγ subject to relaxed affine constraints (see Section 4). In the tensor
case, where the rank becomes r = {r(J)}
J∈K∈ NK, the variety
V≤rK := {X ∈ Rn1×...×nd| X[J] ∈ VnJ,nJc
≤r(J) , J ∈ K},
(1.7)
has a logarithmicly lower dimension and is likewise represented by a data space Dr
together with a simple, surjective and multilinear contraction map τr: Dr→ V≤rK
(see Section 5). Thereby, a sparse optimization as for matrices is also possible in higher dimensions. Ultimately, also AIRLS-0K (see Section 6) distinguishes itself from well known unregularized alternating least squares (ALS) [20] only through an additional penalty term. However, it thereby not only becomes stable by means of [16], but it is derived from and directly minimizes the objective function fK γ
restricted to V≤rK.
2. Underlying structure and global behavior
Phrased more generalized, we in principle desire to solve the problem (cf. [26]) of finding X∗∈ argmin X∈L−1(y) CV(X), CV(v) := min V∈V: X∈Vdim(V ), (2.1)
where in this setting the family of varieties V is VdK:= {V≤rK ⊂ Rn1×...×nd| r = {r(J)}
J∈K∈ NK0},
for VK
≤ras defined in (1.7). In general however, the dimension of V≤rK does not equal
P
J∈Kr(J), and is thus not directly represented by the sum of ranks as in (1.1).
While
V≤eKr( V≤rK ⇒ er(J)≤ r(J), J ∈ K, er 6= r ⇒ dim(V≤eKr) < dim(V≤rK),
neither of the converse implications holds true in general. Firstly, some differently indexed varieties are equal since some constellations r ∈ NK
0 are unfeasible [25].
Definition 2.1. The values r = {r(J)}
J∈K are called (un)feasible (for n ∈ Nd),
if there exists (not) at least one tensor X ∈ Rn1×...×nd with rank(X[J]) = r(J),
J ∈ K.
For hierarchical sets K, these bounds are (cf. [25]) r(Jeˆ)≤ n
vQe∈Ev\{ˆe}r
(Je) for
ˆ
e ∈ Ev, v ∈ V . This natural interrelation of ranks is somewhat beneficial to the
simplified ranks approach as it excludes some extremal cases. The sum-of-ranks minimization is itself a necessary relaxation of the (arguably) more desirable objective function CVK
d, yet it is closer than it might first seem. What remains
however is that, contrarily to the matrix case, the varieties are only partially nested. 2.1. Determinant expansion and convergence of (global) minimizers. Fol-lowing from the matrix case, one can likewise expand the function fγKinto squared
sums of minors defined as det2k(A) := X I∈Pk([nJ]) X J∈Pk([nJc]) det(AI,J)2, k = 1, . . . , nJ,
for AI,J := {Ai,j}i∈I,j∈J ∈ R|I|×|J| and Pk([`]) := {I ⊆ {1, . . . , `} | |I| = k}. For
simplicity of notation, we further define det20(A) := 1.
Corollary 2.2. Let X ∈ Rn1×...×nd andγ ≥ 0. Then
with gs(X) := X {kJ}J ∈K∈Ξs Y J∈K det2kJ(X[J]), (2.2) forΞs:= {{kJ}J∈K| 0 ≤ kJ≤ nJ, J ∈ K, PJ∈KkJ= s}. Proof. As QnJ i=1(σ (J)
i (X)2+ γ) = det(X[J](X[J])T + γI), the first equality follows
by [26]. The third term is merely a restructured version. The minimizers of these functions are nested in the sense of the following Lemma. Lemma 2.3. For gs(X), s = 0, . . . ,PJ∈KnJ, as in Corollary 2.2, we have
gs(X) = 0 ⇔
X
J∈K
rank(X[J]) < s
for all X ∈ Rn1×...×nd.
Proof. By definition of gs(X), we have
gs(X) 6= 0 ⇔ ∃{kJ}J∈K: X J∈K kJ = s ∀J ∈ K : rank(X[J]) ≥ kJ ⇔ X J∈K rank(X[J]) ≥ s. By Lemma 2.3, it directly follows that each gs(X) = 0 implies gs+1(X) = 0.
With this structure, we can apply the nested minimization scheme as in [26] to conclude the following Theorem 2.4.
Theorem 2.4. Let s∗= min X∈L−1(y) X J∈K rank(X[J]).
Then for any convergent sequence of (global) minimizers Xγ of fγK(X) subject to
L(X) = y, we have X∗:= lim γ→0Xγ ∈X∈L−1(y), Pargmin J∈Krank(X[J])=s∗ Y J∈K rank(X[J]) Y i=1 σi(J)(X) with σrank((X(J) ∗)[J])+1(Xγ) 2∈ O(γ), J ∈ K. (2.3)
If there is only one Xs∗∈ L−1(y) withPJ∈Krank(Xs[J]∗) = s∗, then Xγ → Xs∗.
Proof. Since argminX∈L−1 gs(X) ⊂ argminX∈L−1 gs+1(X) due to Lemma2.3, the
proof is analogous to the corresponding one in [26]. 3. Log-det tensor iteratively reweighted least squares (IRLS-0K)
Although the global minimizers of fK
γ yield the sought solution, it is not
3.1. Minimization of an augmented function. The augmented map3 analo-gous to fγ corresponding to the tensor function fγK is
JγK(X, {W(J)} J∈K) := X J∈K Jγ,nJ(X [J], W(J)) for
Jγ,m(A, H) = trace(H(AAT + γI)) − log det(H) − m
= X
J∈K
kH1/2Ak2F+ γkH1/2k2F− log det(H) − m,
where each W(J)∈ RnJ×nJ ranges over W(J)= (W(J))T 0 (symmetric positive
definite). Consequently, with the same argumentation as in [11,26], it is ∂ ∂W(J) JγK(X, {W(J)}J∈K) = X[J](X[J])T + γI − (W(J))−1 and thus Wγ,X(J) := argmin W(J)=(W(J))T0 JγK(X, {W(J)}J∈K) = (X[J](X[J])T + γI)−1. (3.1)
It likewise holds true that
fγK(X) = JγK(X, {W (J) γ,X}J∈K).
(3.2)
Further, the minimizer in X is determined by an ordinary least squares problem. In order to derive the closed form solution for the minimizer, we note that each W(J), J ∈ K, defines linear operations
(W(J))α: Rn1×...×nd→ Rn1×...×nd, ((W(J))α(X))[J]:= (W(J))αX[J], α > 0.
We can thereby write X J∈K k(W(J))1/2X[J]k2F = X J∈K k(W(J))1/2(X)k2F = kW K (X)k2F, where WK(X) := {(W(J))1/2(X)}
J∈K ∈ Rn1×...×nd×|K|. Based on the operator
WK (cf. [26]), the sought minimizer is given by XWK := argmin X∈L−1(y) JγK(X, {W(J)}J∈K) = cW−1◦ L∗◦ (L ◦ cW−1◦ L∗)−1(y) (3.3) for c WK(X) := (WK)∗◦ WK(X) =X J∈K W(J)(X), (3.4)
where (·)∗ denotes adjoint operators. Further, following [11,26,29], we have c
WK(XWK) ⊥ kernel(L).
(3.5)
Vice versa, XWK is the unique solution to (3.5) subject to L(XWK) = y. A more
stable update formula is provided by [26] through
XWK = X0− K ◦ (K∗◦ cWK◦ K)−1◦ K∗◦ cWK(X0),
(3.6)
where X0 is one arbitrary solution to L(X0) = y and K : R Qd
i=1ni−`→ Rn1×...×nd
is a kernel representation of L, whereby image(K) = kernel(L). Due to the sum structure, also the gradient properties generalize to the tensor case.
3Due to the distinguishable roles ofJ ∈ K and the map JK
γ , we here remain faithful to prior
Corollary 3.1. It is
∇XfγK(X) = ∇XJγK(X, {W(J)}J∈K)|W(J)=W(J) γ,X, J∈K.
(3.7)
Thus X is a stationary point of fγK if and only ifX = XWK forW(J)= W (J) γ,X, J ∈
K, which means that (X, {Wγ,X(J)}J∈K) is a stationary point of JK γ.
As in the matrix case, γ → ∞ provides a unique, canonical starting value. Corollary 3.2. Independently of X(0)∈ L−1(y), it holds
lim γ→∞Xargmin∈L−1(y) fγK(X) = lim γ→∞X{Wγ,X(0)(J) }J∈K = argmin X∈L−1(y) kXkF,
where the first limit is possibly a set convergence.
3.2. Complementary weights. In the matrix case [26], there is one more eq-uitable choice f(2)(A) = log det(ATA + γI) as opposed to f(1)
γ (A) = fγ(A) =
log det(AAT+ γI). For families K containing more subsets, each set J ∈ K may be
replaced by its complement. For a subset S ⊂ K, let therefor KS := (K \ S) ∪ {Jc| J ∈ S}, Jc:= [d] \ J,
(3.8)
for W(J)= W(J)
γ,X, J ∈ KS. Although the updates X (K) W and X
(KS)
W in general differ,
the overall properties outlined in Section3 are not influenced as fγKS(X) = X J∈S nJc X i=1 log(σ(Ji c)(X)2+ γ) + X J∈K\S nJ X i=1 log(σi(J)(X)2+ γ) = fγK(X) + X J∈S (nJc− nJ) log γ.
While the weights are in that sense interchangeable, switching between complemen-tary weights becomes essential for AIRLS-0K as captured in Lemma 6.2.
3.3. Adjusted IRLS-0K algorithm. Based on a monotonically declining sequence {γ(i)}
i≥0⊂ R>0 (cf. Definition1.1), and (optionally) a sequence Si ⊂ K (cf.
Sec-tion3.2), Algorithm1defines the sequence {(X(i), {W(i,J)}
J∈K)}i≥0with L(X(i)) =
y and (W(i,J))T = W(i,J) 0, J ∈ K, i ≥ 0. These iterates behave largely analo-gously to the matrix version [26] (cf. [5,7,11,29]). In particular, that case is included in Theorem3.3for d = 2 and K = {{1}}.
Algorithm 1 Iteratively reweighted least squares with switching weights
1: set X(0)∈ L−1(y), γ(0)> 0
2: fori = 1, 2, . . . do
3: set Si−1 ⊂ K (cf. Section3.2)
4: {W(i−1,J)}J∈KSi−1 := {Wγ(J)(i−1),X(i−1)}J∈KSi−1 (cf. (3.1)) 5: X(i):= XWKSi−1(i−1) (cf. (3.3))
6: set γ(i)≤ γ(i−1) 7: end for
Theorem 3.3. Let {(X(i))}
i≥0 be generated by Algorithm 1for {Si}i∈N0 and the
weakly decreasing sequence {γi}i≥0⊂ R>0. Let further Sγ∗⊂ L−1(y) be the
station-ary points offK
γ|L−1(y) forγ > 0, as well as γ∗:= limi→∞γ(i).
(i) For each i ∈ N and each S ⊂ K, it holds fγK(i)S(X(i)) ≤ fK
S
(ii) Ifγ∗> 0, then the sequences X(i)and |fKS
γ(i)(X
(i))|, S ⊂ K, remain bounded.
(iii) Further, if γ∗> 0, then lim
i→∞kX
(i)− X(i−1)k F = 0
(3.9)
and each accumulation point of X(i) is in S∗ γ∗.
(iv) (See Remark 3.4) Let Θ ⊂ R>0 be an arbitrary, infinite, bounded set with
its only accumulation point at inf(Θ) = 0, and let δi:= inf
S∈S∗ γ(i)
kX(i)− Sk, i ∈ N.
For an arbitrary, bounded sequence A = {αi}i∈N0 with inf(A) > 0 (e.g.
αi= 1, i ∈ N0) and for γ(0)= max(Θ), we recursively define
γ(i+1)= (
θi ifαiδi < θi
γ(i) otherwise , θi:= max{z ∈ Θ | z < γ (i)},
i ∈ N0.
Then limi→∞δi = γ∗ = 0 and for at least one subsequence {X(i`)}`∈N,
there exists a sequence of stationary points {S`}`∈N,S`∈ Sγ∗(i`), with kS`−
X(i`)k → 0.
Remark 3.4. Part (iv) of Theorem3.3as well as its proof are literally the same as in the matrix case [26]. Roughly, if the sequence {γ(i)}
i∈Nis decreased to γ∗= 0 slowly
enough, then X(i)can only converge to a limit of stationary points of f
γ|L−1(y) for
γ & 0. The contrary case of too fast decline has been covered in [26] as well.
Proof. See AppendixA.
4. Relaxed iteratively reweighted least squares
Too large mode sizes n or high dimensions d in practice prohibit to even operate on the spaces L−1(y) or Rn1×...×nd directly. As hinted on in Section1.5, so called
hierarchical decompositions can provide remedy in the same way low rank matrix decompositions do. This however first requires to relax the affine constraint L(X) = y.
4.1. Relaxation of affine constraint. Let aγ(s) := s −PJ∈KnJlog(γ), γ > 0.
As each of these function is monotonically increasing, a composition with such does not change minimizers. We correspondingly define
fγa,K(X) := aγ◦ fγK(X) = log Y J∈K ∞ Y i=1 (1 +σ (J) i (X)2 γ ), (4.1)
with σi(J)(X) := 0 for i > nJ, J ∈ K. Likewise, let Jγa,K(X, {W(J)}J∈K) :=
aγ◦ JγK(X, {W(J)}J∈K). With the same reasoning as in [26], one then defines
Fγ,ωa,K(X) := kL(X) − yk2F+ cL· ω2· fγa,K(X),
Jγ,ωa,K(X, W ) := kL(X) − yk2
F+ cL· ω2· Jγ,aK(X, W ).
(4.2)
for an appropriate scaling constant cL. As ∂γ∂ F a,K
γ,√γ(X) = cL· ∂
∂γ(γ · fγa,K(X)) ≥ 0,
Algorithm 2 Subspace restricted IRLS with switching weights 1: set X(0)∈ L−1(y), γ(0)> 0 2: fori = 1, 2, . . . do 3: set Si−1 ⊂ K (cf. Section3.2) 4: {W(i−1,J)}J ∈K(Si−1) := {W (J) γ(i−1),X(i−1)}J∈K(Si−1) (cf. (3.1)) 5: set a subspace Ti−1 ⊂ Rn1×...×nd with T
i−13 X(i−1)
6: X(i):= argminX∈Ti−1 Jγa,K(i−1)(Si−1)(X, {W(i−1,J)}J∈K(Si−1)) (cf. (4.2))
7: set γ(i)≤ γ(i−1) 8: end for
4.2. Subspace dependent, relaxed optimization algorithm. To later incor-porate the alternating optimization, we here also consider an additional sequence of subspaces {Ti}i∈N0 with Ti⊆ R
n1×...×nd, as well as T
i∩ Ti−1 3 X(i), i ∈ N0. The
latter condition ensures that the previous iterate remains admissible. This then yields the modified Algorithm2. While the objective function is still monotonically decreased as provided by Corollary4.1, to show the remaining parts of Theorem3.3
as far as possible for now remains subject to future research. Corollary 4.1. For X(i)as defined by Algorithm 2it holds
Fγa,K(i)S(X
(i)) ≤ Fa,KS
γ(i−1)(X(i−1)),
for all i ∈ N and all S ⊂ K.
Proof. The argumentation is the same as in Theorem 3.3 part (i) as steps (a) to
(g) analogously hold true (cf. Section 4.1).
5. Hierarchical decomposition
We briefly reintroduce hierarchical tensor decompositions [15] as tensor tree net-works with reference to the introductory Section1.5. For further reading, we rec-ommend [8,10,15,17,23,25,30].
5.1. Notational deviation. In the following, G = (V, E) denotes a tree graph with vertices V ⊇ [d] and edges E ⊆ {{v, w} | v 6= w ∈ V }. Due to the complex description of general tensor (tree) networks, we require a certain minimum of notational deviation. That is, we dismiss the order of modes when indexing tensors. Instead, in order to avoid ambiguity, each specific object is consistently referenced with the same, distinctly assigned labels, based on the graph G = (V, E). The first group is given by αS = {αµ}µ∈S, for αS ∈ [nS], nS =Qµ∈Snµ, S ⊆ V . We set
nµ = 1 for µ > d, but any such αµ is only denoted when required for notational
simplicity. Further, the second group is given by β = {βe}
e∈E with βe ∈ [r(Je)],
Je∈ K (see Section5.3), whereas the measurement index is denoted by ζ ∈ [`]. For
each such label, we correspondingly define the spaces Hαµ := R
[nµ], v ∈ V, H
βe := R[r (Je)]
, e ∈ E, Hζ := R[`].
The entirety of labels is formally required to be ordered, but the exact ordering is irrelevant. To each collection Γ of such labels, we consequently assign the space
HΓ:=
O
γ∈Γ
Hγ.
(5.1)
Some, in particular labels corresponding to edges also appear as unequally treated, so called primed labels βe06= βe, e ∈ E. Each is however still thought to refer to the
it shall become apparent that it is in fact mostly redundant to explicitly denote these labels. While we nevertheless here hold on to indices, SectionSM5does make use of this fact to more compactly repeat some of following statements and lay out their proofs. What is here introduced as notation, is the basis to the formalized arithmetic introduced in [25]. For the Matlab toolbox that realizes the latter through automated contractions, on which the implementation of (A)IRLS-0K is based on, please contact the author.
5.2. Graph notation. We denote each the path from excluding c ∈ V to excluding v ∈ V \ {v} within a tree G = (V, E) as the unique ordered set
c ˚+v := (p1, . . . , p−1) = p ⊂ V,
(5.2)
for which {c, p1} ∈ E, {pi, pi+1} ∈ E, i = 1, . . . , |p| − 1 as well as {p−1, v} ∈ E.
We further define the neighbors of v ∈ V , as well as the predecessor and set of descendants of v ∈ V \ {c} relative to c ∈ V as
neigh(v) := {h ∈ V | {h, v} ∈ E}, predc(v) := p−1, descc(v) := neigh(v) \ {p−1}.
We define the branches relative to c ∈ V as
branchc(v) := {v} ∪ {b ∈ V \ {c, v} | v ∈ c ˚+b}.
Each root c ∈ V splits the graph into the multiple connected components of V \{c}, ˙
[
h∈neigh(c)branchc(h) = V \ {c}.
For any v 6= w ∈ V , we further define the sets Jw(v) := branchw(v) ∩ [d]. Thus if
e = {v, w} ∈ E is an edge, then Jw(v) ˙∪ Jv(w) = [d].
5.3. Tree corresponding to hierarchical family. Without loss of generality, we from here on postulate that hierarchical families K (cf. Section 1.5) are by definition also dimension separating. That is, we assume that there does not exist a map π : [d] → [d − 1], for which π(J) /∈ {π( ˆJ), [d − 1] \ π( ˆJ)} for all J, ˆJ ∈ K. Lemma 5.1. Each (dimension separating) hierarchical family K defines an, up to equivalence, unique tree GK = (V, E), V ⊇ [d] and root c ∈ V , for which |E| =
|V | − 1 = |K| and K = {Jc(v)}v∈V \{c} — and vice versa.
Proof. See for example [15,25].
Definition 5.2. Let GK correspond to the hierarchical family K. We define Je∈
{Jw(v), Jv(w)}, e = {v, w} ∈ E, as each the one set that is contained in K.
This convention implies a bijection K = {Je| e ∈ E} to E. The simple graph
that corresponds to the matrix case K2= {{1}} for d = 2 is for instance given by
the tree
GK2 = (V, E), V = {1, 2}, E = {{1, 2}},
whereby J{1,2}= {1}. For KTucker= {{1}, . . . , {d}} (cf. Example5.5), we have
GKTucker= (V, E), V = {1, . . . , d + 1}, E = {{1, d + 1}, . . . , {d, d + 1}},
(5.3)
and J{µ,d+1}= {µ}, µ ∈ [d]. As required later, for subsets S ⊂ V , we further define ES := {{v, w} ⊂ E | v ∈ S, w ∈ neigh(v)},
(5.4)
˚
ES := {{v, w} ⊂ E | v, w ∈ S}, ∂ES:= ES\ ˚ES.
5.4. Representation map corresponding to tree. Whereas each hierarchical family K defines a tree GK = (V, E), each such (not necessarily rooted) graph together with r ∈ NK in turn defines a certain data space D
rand a representation
map τr: Dr→ Rn1×...×nd for values r ∈ NK.
Definition 5.3. With reference to Section 5.1, let Dr:=
×
v∈V Hmv, mv:= {β e} e∈Ev ∪ ( {αv} if v ∈ [d], ∅ otherwise.The dimension of each node Nv ∈ Hmv, {Nv}v∈V ∈ Dr, is thus the degree of
v ∈ V , plus one if v ∈ [d]. The representation map τris now defined as the map that
proceeds each a contraction over modes with common labels. With the notation declared in Section5.1, we may write
τr(N )α1,...,αd:= X βe∈E Y µ∈[d] (Nv)αv,{βe}e∈Ev Y v∈V \[d] (Nv){βe}e ∈Ev, (5.5) where αµ∈ [nµ], µ ∈ [d].
Example 5.4. In the matrix case with r = r(J{1,2}) ∈ N, we simply have an
ordinary matrix multiplication (cf. Section 1.5) τr(Y, Z)α1,α2 =
Pr
β=1Yα1,βZβ,α2,
where the summation ranges over α1 ∈ [n1] and α2 ∈ [n2]. Here, β = β{1,2}∈ [r]
is the label assigned to the only edge.
Example 5.5. For d ∈ N, the Tucker format [40] or MLSVD4[9] is defined through the graph KTucker (5.3) and consists of the components {Nµ}d+1µ=1 ∈ Dr of sizes
Nµ ∈ Rnµ×r(J{µ,d+1}) and Nd+1 ∈ Rr(J{1,d+1})×...×r(J{d,d+1}). The corresponding
contraction map is given by (though less convenient when written out in particular cases) Xα1,...,αd= τr(N1, . . . , Nd, Nd+1)α1,...,αd = r(J{1,d+1})X β{1,d+1}=1 . . . r(J{d,d+1})X β{d,d+1}=1 (N1)α1,β{1,d+1}. . . (Nd)αd,β{d,d+1}(Nd+1)β{1,d+1},...,β{d,d+1},
for αµ= 1, . . . , nµ, µ = 1, . . . , d as visualized in Fig. 1.
Nd+1 N1 N2 N3 N4 α1 α2 α 3 α 4 β{1 ,d+1 } β {2 ,d+1 } β{3 ,d +1 } β {4,d +1 } e Nd+1 Ned+2 N1 N2 N3 N4 α1 α2 α 3 α 4 β{1 ,d+1 } β {2 ,d +1 } β{d+1,d+2} β{ 3 ,d +2 } β {4,d +2 }
Figure 1. [Left] The contraction diagram for the Tucker representation in Example 5.5 for d = 4. The dotted line indicates the part which for J = {1} yields Z(J), whereas Y(J) =N
1 (cf. (5.7)). [Right] A balanced binary
hierarchical Tucker (HT) representation (cf. Section5.6) for the exhaustive family K = {{1, 2}, {1}, . . . , {4}}, Y({1,2})=τr({N1, N2, eNd+1}) (cf. (5.6)). In contrast to conventional literature (cf. [15]), the root node has been omitted as it is redundant here (cf. [25]).
While initially defined on the whole network, we can also extend the map τr to
contract nodes over any subset S ⊂ V via τr({Ns}s∈S){αs}s∈S,{βe}e∈∂ES := X βe: e∈˚ES Y v∈S (Nv)αv,{βe}e∈Ev, (5.6)
for αs∈ [ns], s ∈ S, and ∂ES as well as ˚ES as defined by (5.4). Here, some αv,
that is for v > d, are redundant (cf. Section5.1).
5.5. Decomposition theorem. The following theorem is fundamental to hierar-chical tensor approximation theory.
Theorem 5.6 ( [15]). Let GK be the tree corresponding to the hierarchical family
K (cf. Lemma 5.1). Then for each r = {r(J)}J∈K∈ NK, the according multilinear
representation map τr: Dr→ Rn1×...×nd is non-injective withimage(τr) = V≤rK.
In other words, for each tensor X ∈ Rn1×...×nd with rank(X[J]) ≤ r(J), J ∈ K,
there exists a (non-unique) decomposition N ∈ Dr with X = τr(N ). Each edge
e = {v, w} ∈ E, assuming J = Je= Jw(v) ∈ K, splits the tree into two disconnected
subgraphs and yields a corresponding matrix decomposition
X[J]= Y(J)Z(J), Y(J)∈ R[nJ]×r(J), Z(J)∈ Rr(J)×[nJc].
(5.7)
The matrices Y(J)and Z(J)are obtained by contractions over each (N
h)h∈branchw(v)
and (Nh)h∈branchv(w), respectively. In explicit, abbreviating S = branchw(v), we
have Yα(J)J,βe = τr({Ns}s∈S){αµ}µ∈J,βe = X βe: e∈˚E S Y v∈S (Nv)αv,{βe}e∈Ev,
for ˚ESas defined in Section5.3. Given (5.7), it is easy to see that indeed image(τr) ⊆
VK
≤r, whereas the other direction requires some more work (cf. [15,25]).
Lemma 5.7. The dimension of the variety corresponding to a feasible r ∈ NK for a hierarchical family K is dim(V≤rK) = X µ∈[d] nv Y e∈Eµ r(Je)+ X v∈V \[d] Y e∈Ev r(Je)−X e∈E (r(Je))2,
where GK= (V, E) is the corresponding graph. The set VK
=r in turn is a manifold
of equal dimension.
Proof. Follows by a generalization of the argumentation in [21,41]5. 5.6. Exhaustive hierarchical families. The larger the family K, the more reg-ularizing the IRLS approach. Thus, one may desire such to be exhaustive in the following sense.
Definition 5.8. Let K be a hierarchical family. We say K is exhaustive if there does not exist another hierarchical family eK with eK ) K.
Exhaustive hierarchical families in a certain sense yield particularly data sparse formats as specified in the following Lemma 5.9. For any such family, it further holds |K| = 2d − 3 = |E| and |V | = 2d − 2 (cf. Lemma 5.1).
Lemma 5.9. Let K be an exhaustive hierarchical family. Then GK consists only
of inner vertices v ∈ V \ [d] of degree 3 and leafs v ∈ [d] ⊂ V of degree 1.
Proof. See for instance [15,25].
The Tucker family KTucker for example is not exhaustive (for d ≥ 4). The degree
of the vertex d + 1 ∈ V is d, whereby the dimension of the node Nd+1 is d as
well. For d = 4, all exhaustive families are equivalent (up to permutation of modes) to K = {{1, 2}, {1}, {2}, {3}, {4}} (see Fig. 1). In general, exhaustive hierarchical families correspond to so called binary hierarchical Tucker formats (cf. [15,25]). 5.7. Rooted trees and orthonormalization. A root c ∈ V , if at all, may be chosen freely, leading us back to the choice of complementary weights in Section3.2. Lemma 5.10. For each c ∈ V , there exists a unique subset Sc ⊂ K for which
KSc = {J
c(v) | v ∈ V \ {c}} (cf. Section5.2).
Proof. Follows directly with Sc= {Jc(v)c| Jc(v) /∈ K, v ∈ V \ {c}} ⊆ K.
The set equality in Lemma5.10implies that for each J ∈ KSc, there is a unique
vertex v =: vc,J∈ V \ {c} with J = Jc(v). Note that only the sets Sc, c ∈ V , again
lead to hierarchical families KSc as opposed to the 2|K| generally possible subsets
S ⊂ K. One can utilize the non-injectivity of the map τr to orthonormalize the
representation in the sense of the following Theorem5.11, yet without the need to calculate the represented, full tensor.
Theorem 5.11. Let r(J)= rank(X[J]), J ∈ K. Then there exists a representation X = τr(N ), N ∈ Dr, such thatY(J)∈ R[nJ]×r
(J)
,J ∈ KSc, as defined in (5.7), are
orthonormal matrices.
Proof. Can for example be found in [25].
Note that the matrices Y(J) in Theorem5.11are defined via the representation
N . If may further be achieved that these matrices each consist of the left singu-lar vectors, Y(J) = U(J), of the compact matrix SVDs X[J] = U(J)Σ(J)(V(J))T,
J ∈ KSc. Thereby, the decomposition in fact becomes essentially unique6 [15,25].
However, mere orthonormality is in general sufficient and can be ensured with sig-nificantly less effort in an alternating optimization. In case of the Tucker format (Example 5.5), if indeed Y(J) = U(J), this canonical form is specifically known
as MLSVD [9], while for the tensor train format [32], it is known as canonical MPS [42]. General canonical forms of tensor tree networks and their properties are further discussed in [25].
6. Alternating iteratively reweighted least squares (AIRLS-0K) In this section, let K be a hierarchical family, GK = (V, E) the corresponding tree as well as τr : Dr → V≤rK, with Dr =
×
v∈V Hmv, the representation map forr ∈ NKas described in Section5. The idea of alternating least squares (ALS) is to in each step fixate N = {Nv}v∈V ∈ Drbut the one component Nc, where the root
c ∈ V iteratively cycles through all vertices. We therefor define the linear map N6=c: Hmc → V≤rK, N6=c( ˆNc) = τr({Nv}v∈V \{c}∪ { ˆNc}).
As the image of that map is independent of the specific, chosen representation, we obtain the (well defined) subspace (cf. Section 4.2)
Tc(τr(N )) := image(N6=c).
(6.1)
Though one avoids to ever calculate the full tensor X = τr({Nv}v∈V) ∈ Rn1×...×nd,
we define the resulting update as
Xγ,ω,WN,c := argminX∈Tc(N ) J
a,KSc
γ,ω (X, {W(J)}J∈KSc),
(6.2)
as well as (Nc)N,cγ,ω,W ∈ Hmc via N6=c((Nc)
N,c
γ,ω,W) := X N,c
γ,ω,W. The subset Sc ∈ K
is defined as by Lemma 5.10, the objective functions in (4.2). The subsequent sections are summarized in Algorithm 3, though it generates the same iterates X(i)= τ
r(N(i)), i ∈ N0, as Algorithm 2 when the subspaces are chosen according
to (6.1).
6.1. Sweeps, micro steps and stability. The update Xγ,ω,WN,c (cf. (6.2)) is in-dependent of the specific, chosen representation N of the previous iterate X (cf. Theorem 5.6). Thus, for each c ∈ V , the updating maps
M(c)r : Dr→ Dr, M(c)r (N ) := {Nv}v∈V \{c}∪ {(Nc)N,cγ,ω,W}, W = Wγ,τr(N ),
operating on the data space, as well as the one operating on the full tensor space, ζM(c): Rn1×...×nd→ Rn1×...×nd, ζM(c)(X) := τr(X)◦ M(c)τ
r(X)◦ τ
−1 r(X)(X),
are well defined. Here, r(X) ∈ NKdenotes the ranks of each X and N = τr−1(X) is
each an arbitrary representation. A whole sweep (for fixed γ and ω) is defined as Mr:= c∈V M(c)r , ζM:= c∈V ζM(c),
where the order of composition may be chosen as most suitable. Issues around these functions in particular concerning stable rank adaptivity have been discussed in [16].
6.2. Representation based evaluation. In order to obtain a practically viable algorithm, it remains to show that each next iterate Xγ,ω,WN,c (cf. (6.2)) given W = Wγ,τr(N ) and ω =
√
γ > 0 (cf. Section 4.1) can indeed be calculated through its representation, that is, without the need to construct full tensors in Rn1×...×nd.
The updated node (Nc)N,cγ,ω,W ∈ Hmc, for which X
N,c
γ,ω,W = N6=c((Nc) N,c
γ,ω,W), is given
by the lineare least squares problem (cf. Sections3.1,3.2 and4.1) (Nc)N,cγ,ω,W = argmin e Nc∈Hmc kL ◦ N6=c( eNc) − yk2+ cLγ X J∈KSc k(W(J))1/2N6=c( eNc)k2F,
for W(J) as in (3.4). The minimizer is thus given as solution (N
c)N,cγ,ω,W := eNc to N6=c∗ ◦ L∗◦ L ◦ N6=c( eNc) + cLγ X J∈KSc N6=c∗ ◦ W(J)◦ N6=c( eNc) = N6=c∗ ◦ L∗(y). (6.3)
Following are two aspects that are required for a representation based evaluation. The first one in Section6.3depends on the operator L itself and can in that sense not be influenced. The second one in Section 6.4in turn merely asks for the right choices of Sc, namely the one in Lemma5.10, and can thus always be achieved.
6.3. Decomposition of measurement operator. Like each linear operator, L : Rn1×...×nd→ R` has a tensor description L ∈ R`×n1×...×nd in terms of
L(X)ζ = n1 X α1=1 . . . nd X αd=1 Lζ,α1,...,αdXα1,...,αd. (6.4)
This tensor L must itself somehow allow for an efficient handling. Similar to The-orem 5.6, each L ∈ R`×n1×...×nd can for some r
L ∈ NK (assumed to be low) be
The symbols εe, e ∈ E, are additional labels with range εe∈ [r(Je)
L ], whereas ζ ∈ [`].
The assigned multilinear representation map is ρrL(L)ζ,α1,...,αd:= X εe: e∈E Y µ∈[d] (Lµ)ζ,αµ,{εe}e∈Eµ Y v∈V \[d] (Lv){εe}e ∈Ev,
For simplicity of notation, as with the representation N , we will also denote indices ζ and αv in nodes Lv with v > d. While this is formally compatible as long as
(Lv)ζ,αv,{εe}e∈Ev, v > d, is constant in ζ, these indices can likewise be omitted.
Sampling operators for instance can be decomposed for rL(J)≡ 1, J ∈ K.
Example 6.1. For d ∈ N, the operator decomposition corresponding to the Tucker graph KTucker(5.3) consists of the components {Lv}v∈V of sizes Lµ ∈ R`×nµ×r
({µ})
and Ld+1∈ Rr(J{1,d+1})×...×r(J{d,d+1}). The corresponding contraction map ρrL is
Lζ,α1,...,αd = ρrL(L1, . . . , Ld, Ld+1)ζ,α1,...,αd = r(J{1,d+1})L X β{1,d+1}=1 . . . rL(J{d,d+1})X β{d,d+1}=1 (L1)ζ,α1,β{1,d+1}. . . (Ld)ζ,αd,β{d,d+1}(Ld+1){β{µ,d+1}}µ∈[d],
for αµ= 1, . . . , nµ, µ = 1, . . . , d and ζ = 1, . . . , ` as visualized in Fig.2.
Ld+1 L1 L2 L3 L4 α1 α2 α 3 α 4 ε{1 ,d+1 } ε{1 ,d+1 } ε {2 ,d+1 } ε {2 ,d+1 } ε {3 ,d +1 } ε { 3 ,d +1 } ε{4,d +1 } ε{4,d +1 } ζ
Figure 2. The contraction diagram for the Tucker-like decomposition of L as in Example6.1ford = 4 (cf. Example5.5).
In general, when all summations over αv, v ∈ [d], are proceeded first, then L(X)
can be efficiently evaluated by means of the tree structure of (cf. Proposition 6.5) L(X)ζ = X εe,βe: e∈E Y v∈V X αv (Lv)ζ,αv,{βe}e∈Ev(Nv)αv,{βe}e∈Ev , ζ ∈ [`]. (6.5)
Note that we have here again made use of the redundant additional indices ζ and αv for v > d. Likewise, the composition of L and N6=ccan be proceeded efficiently.
6.4. Equivalent low rank weights. The switching between each complementary weights introduced in Section3.2has the following motivation.
Lemma 6.2. Letc ∈ V and Sc be as in Lemma5.10, and letN be a representation
for which Y(J), J ∈ KSc, are orthonormal (cf. Theorem 5.11). Then the update
Xγ,ω,WN,c as defined in(6.2) for the rank nJmatricesW(J)= Wγ,X(J) = (X[J](X[J])T+
γI)−1,J ∈ KSc,X = τ
r(N ), is the same as for the rank r(J) matrices
W(J)= Wγ,N,c(J) := Y(J)(H(J)+ γI)−1(Y(J))T, H(J):= Z(J)(Z(J))T, J ∈ KSc.
Proof. It suffices to show that for every eNc ∈ Hmc, we have
(Wγ,X(J))1/2N
Let the orthonormal matrix UJ,⊥ ∈ RnJ×nJ−r(J) span the orthogonal complement
of the r(J) dimensional space range(Y(J)). Then
(Wγ,X(J))1/2= (Wγ,N,c(J) + γ−1UJ,⊥(UJ,⊥)T)1/2= (Wγ,N,c(J) )1/2+ γ−1/2UJ,⊥(UJ,⊥)T. It thus remains to show that range(N6=c( eNc)[J]) ⊥ range(UJ,⊥) for all eNc. As by
construction of Sc, the matrix Y(J)does not depend on the vertex c ∈ V , we have
range(N6=c( eNc)[J]) ⊆ range(Y(J)).
6.5. Path evaluations. Lemma 6.2allows for a further, significant simplification in the evaluation of (Nc)N,cγ,ω,W as defined in (6.3). Let in the following c ∈ V and
ˆ
J ∈ KSc be fixed. Further, let v
c, ˆJ ∈ V \{c} be the uniquely determined vertex with
Jc(vc, ˆJ) ∩ [d] = ˆJ (cf. Section 5.2), as in Section5.7. Without explicit indication
of the dependence on the above, we denote (cf. (5.2) and (5.4))
p := c ˚+vc, ˆJ ⊆ V \ {c, vc, ˆJ}, EY := ∂E{c}∪p= Ec\ {e1} ∪ ∂Ep.
For empty p, we set p1= vc, ˆJ and p−1= c for convenience. We further define the
edges e1= {c, p1} /∈ EY and ˆe = {p−1, vc, ˆJ} ∈ EY, such that ˆJ = Jeˆ.
Theorem 6.3 (cf. Section SM5.2). Let c ∈ V and Sc be as in Lemma 5.10,
and let N be a representation for which Y(J), J ∈ KSc, are orthonormal (as in
Theorem 5.11). Further, let the operator N6=c∗ ◦ W( ˆJ) ◦ N
6=c : Hmc → Hmc be
described by the matrixA( ˆJ)∈ H
mc⊗ Hmc. Then (cf. Fig.3) A( ˆαJ)0c,{βe0}e∈Ec; αc,{βe} e∈Ec = δα0c,αc Y e∈Ec\{e1} δβe0,βeM( ˆJ) βe1 0,βe1, (6.6)
where each δγ0,γ∈ {0, 1} is a Kronecker delta, as well as
(6.7) Mβ( ˆJ)e1 0,βe1 = X βe0,βe: e ∈Ep\{e1}, αv: v∈p Y e∈∂Ep\{e1,ˆe} δβe0,βe Y v∈p (Nv)αv,{βe0}e∈Ev(Nv)αv,{βe}e∈Ev ! (H( ˆJ)+ γI)−1βe 0ˆ,βˆe,
and further (cf. Fig. 4) (6.8) Hβ( ˆJ)ˆe0,βˆe= X βe0,βe: e ∈E{c}∪p\{ˆe}, αv: v∈{c}∪p Y e∈EY\{ˆe} δβe0,βe Y v∈{c}∪p (Nv)αv,{βe0}e∈Ev(Nv)αv,{βe}e∈Ev ,
for each αv ∈ [nv], v ∈ V , and βe0, βe∈ [r(Je)], e ∈ E.
Proof. See Figs.3and4. For the rigorous, though exceedingly technical proof, see Section SM4. A more elegant version can be found in SectionSM5.2.
The formula for A( ˆJ) simplifies whenever ˆe ∈ E
Nc Np1 Np−1 Y( ˆJ) Y(Je) Y(Je) Y(Je) Y(Je) βe1 β ˆe Nc Np1 Np−1 Y( ˆJ) Y(Je) Y(Je) Y(Je) Y(Je) βe1 0 β ˆe0 αJ αp1 (Z( ˆJ)(Z( ˆJ))T+ γI)−1 Y( ˆJ) β Y( ˆJ) ˆ e0 βˆe = Nc Np1 Np−1 Nc Np1 Np−1 δβe0,βe αp1 (Z(J)(Z(J))T+ γI)−1 β ˆe βˆe0
Figure 3. Network diagram for A( ˆJ) representing N6=c∗ ◦ W( ˆJ)◦ N6=c (cf.
(6.6) and (6.7)) for a particular case of a certain K, network N and a path p = (p1, p−1) of length |p| = 2. Contractions over labels αS,S ⊂ [d], are in
gray, whereas uncontracted modes are visualized via dashed lines. Here, it is c, p−1∈ [d], but p/ 1∈ [d]. We recommend to view the digital version for better
readability. [Lefthand] Emphasized are the segmentsG( ˆJ)(N∗
6=c at south-west
and N6=c south-east) and Wγ,N,c( ˆJ) (north) as in Theorem 6.3. The lighter shaded nodes are the partial contractionsY(Je) fore ∈ E
Y, the orthogonality
constraints of which are indicated with corresponding arrows. [Righthand] The contracted version in which only the nodes {Nv}v∈p and their copies (as
encircled) as well as the matrix (H( ˆJ)+γI)−1= (Z( ˆJ)(Z( ˆJ))T+γI)−1remain
as well as some delta tensors.
Nc Np1 Np−1 Y(Je) Y(Je) Y(Je) Y(Je) βeˆ Nc Np1 Np−1 Y(Je) Y(Je) Y(Je) Y(Je) βˆe αJ αp1 = Nc Np1 Np−1 βˆe Nc Np1 Np−1 βˆe0 αp1
Figure 4. Network diagram for H( ˆJ)=Z( ˆJ)(Z( ˆJ))T (cf. (6.8)) for the same
particular case as in Fig. 3. [Lefthand] The lighter shaded nodes are the partial contractionsY(Je) fore ∈ E
Y, the orthogonality constraints of which
are indicated with corresponding arrows. [Righthand] The contracted version in which only the nodes {Nv}v∈{c}∪pand their copies (as encircled) remain.
Corollary 6.4. In Theorem 6.3, if ˆe = e1= {c, v} for v ∈ neigh(c), then p = ∅.
Thus, we haveM( ˆJ)= (H( ˆJ)+ γI)−1 and
Hβ( ˆJ)ˆe0,βeˆ= X βe: e∈Ec\{ˆe}, αc (Nc)αc,{βe}e∈Ec\{ˆe},βˆe0· (Nc)αc,{βe}e∈Ec, forαv∈ [nv], and βe0, βe∈ [r(Je)], e ∈ Ec.
6.6. Branch evaluations. As described in the following, each expression in the update formula of (Nc)N,cγ,ω,W (cf. (6.3)) can be rewritten, such that reevaluations of
identical terms are avoided. This is particularly useful (Proposition6.5) for the first measurement related summand (L◦N6=c)∗◦L◦N6=c: Hmc → Hmcand the righthand
side (L ◦ N6=c)∗ : R` → H
mc since only few branch-wise evaluations change after
Proposition 6.5. Let c ∈ V and each Je ∈ KSc, e ∈ E. Further, let L ◦ N6=c :
Hmc→ R
` be represented by the tensorF
c∈ R[`]⊗ Hmc. Then (Fc)ζ,αc,{βe}e∈Ec = X εe: e∈Ec (Lc)ζ,αc,{εe}e∈Ec Y v∈neigh(c) S(J{c,v}) ζ,β{c,v},ε{c,v},
forζ ∈ [`] (not being contracted), as well as S(Jeˆ) ζ,βeˆ,εeˆ= X εe,βe: e ∈Ev\{ˆe} αv (Lv)ζ,αv,{εe}e∈Ev(Nv)αv,{βe}e∈Ev Y b∈descc(v) S(J{v,b}) ζ,β{v,b},ε{v,b},
fore = {predˆ c(v), v}, v ∈ V \ {c} and ζ ∈ [`] (not being contracted).
Proof. See SectionSM5.3.
The paths appearing in the evaluation of A( ˆJ)∈ H
mc⊗ Hmc representing N6=c∗ ◦
W( ˆJ)◦ N
6=c : Hmc → Hmc, J ∈ KS
c (cf. Theorem 6.3) naturally overlap. In the
evaluation of A :=PJ∈KA( ˆJ) (cf. (6.3)), this can be utilized.
Proposition 6.6. Letc ∈ V and each Je∈ KSc,e ∈ E. It is
X ˆ J∈K A( ˆαJ)0c,{βe0}e∈Ec; αc,{βe} e∈Ec = δα0c,αc X v∈neigh(c) Y e∈Ec\{{c,v}} δβe0,βeB (J{c,v}) β{c,v}0,β{c,v} with B(Jeˆ)= (H(Jˆe)+ γI)−1+P b∈descc(v)Be (J{v,b}),e = {predˆ c(v), v}, v ∈ V \ {c},
as well as, for b ∈ descc(v),
e B(J{v,b}) βe 0ˆ,βˆe = X βe: e ∈Ev\{ˆe}, β{v,b}0, αv (Nv)αv,βe0ˆ,{βe}e∈Ev \{{v,b},ˆe},β{v,b}0· B (J{v,b}) β{v,b}0,β{v,b}· (Nv)αv,{βe}e∈Ev.
Proof. See SectionSM5.3.
Due to the recursive structure in Proposition 6.6, the evaluation is to be pro-ceeded in order leaves to root. In turn, also the matrices H(Jˆe), ˆe ∈ E, (cf. (6.8))
can be simplified, but in the opposing root to leaves order. The starting points for this recursion are given by Corollary 6.4.
Proposition 6.7. Let c ∈ V and each Je ∈ KSc, e ∈ E. For ˆe = {predc(v), v},
v ∈ V \ {c}, and b ∈ descc(v), it is H(J{v,b}) β{v,b}0,β{v,b}= X βe: e∈Ev\{{v,b}} βˆe 0, α v (Nv)αv,β{v,b}0,{βe}e∈Ev \{{v,b},ˆe},βˆe 0· H (Jeˆ) βˆe 0,βˆe· (Nv)αv,{βe}e∈Ev.
Proof. See SectionSM5.3.
7. Numerical Experiments
The following Sections7.1to7.4specify terminology and configurations referred to in the subsequent experiments SM1and7.1to7.4in SectionsSM1,7.7and7.9. The presentation of results is further laid out in Section 7.5. For simplicity, the mode sizes {nµ}µ∈[d] are chosen uniformly as n ∈ N in all experiments. For the
7.1. Reference solutions, measurements vectors and family K. Each mea-surement vector is constructed via a (not necessarily sought for) reference solution with ranks r(rs)∈ NK, which in turn relies on a randomly generated representation,
y = L(X(rs)) ∈ R`, X(rs)= τr(rs)(N
(rs)) ∈ L−1(y) ∩ VK ≤r(rs).
All entries of the components {Nv(rs)}v∈V(rs) are assigned independent, normally
distributed entries. For simplicity7and to limit the amount of randomness, we also choose the components {r(rs)(J)}J∈K uniformly, as r(rs)∈ N. We distinguish between
four different types.
Tucker format. With K = KTucker = {{1}, . . . , {d}}, the components of the
repre-sentation {N(rs)}
v∈V(rs) follow the scheme in Example5.5.
Balanced, binary hierarchical Tucker format (bbHT).A balanced, binary hierarchi-cal Tucker format can be defined by the property of K = KbbHTto be exhaustive
(cf. Section5.6) and to minimize the maximal distance of any two vertices v, w ∈ [d] within GK (that is, the depth of the rooted tree, cf. [15]).
Exponentially declining singular values. Firstly, a bbHT representation as defined above is generated. As second step, all singular values σ(J), J ∈ K, are manipulated
such they decline exponentially. In explicit, for a constant s(expfac) ∈ (0, 1), it is
σi(J)≈ max(σmin, sx(expfac)), i = 1, . . . , r(rs), J ∈ K, where each x is an independent
random, normally distributed value and σmin > ε > 0 (cf. Section7.4) is a lower
bound. We denote such reference solutions by the abbreviated exp.dec.bbHT. Canonical polyadic (CP) decomposition.For r(rs)∈ N, the reference solution does
here not rely on K, but is generated as sum of r(rs)elementary tensors, X (rs) α1,...,αd:= τr(rs)(φ (rs)) = Pr(rs) γ=1(φ (rs) 1 )α1,γ. . . (φ (rs) d )αd,γ, where (φ (rs) µ ) ∈ R[nµ]×[r(rs)], for µ =
1, . . . , d. The corresponding graph is a hypertree, and the set image(τr(rs)) of at
most rank r(rs)tensors is a semi-algebraic subset of V≤rKmax(rs)(cf. (1.7)) for the in that
case defined, non-hierarchical family Kmax := {J ( [d] | J 6= ∅}, given r(J)(rs)≡ r(rs),
J ∈ Kmax.
7.2. Operators. We consider three types of operators L, where in each case L(X) := L vec(X) is based on the tensor L ∈ R`×n1...nd.
(Full) Gaussian operator.With a Gaussian operator, we refer to a randomly gen-erated tensor L ∈ R`×n1...nd with independent, normally distributed entries.
Gaussian low rank operator.For (low) uniform ranks r(J)L ≡ rL ∈ N, J ∈ K, the
operator is defined through the representation of L := ρrL({Lv}v∈V(rs)) (cf.
Sec-tion6.3). Each component therein are assigned independent, normally distributed entries.
Random sampling operator.As sampling operator, we denote L(X) := {Xpi}
` i=1,
for uniformly randomly drawn indices {p1, . . . , p`} ⊂
×
dµ=1[nµ]. Note that samplingoperators can trivially be decomposed, for rL(J)= 1, J ∈ K.
7.3. Solution methods. Based on a sufficiently large starting value γ(0)> 0, we
choose γ(i)= νγ(i−1), where ν < 1 remains constant throughout each single run of
an algorithm. We consider the following types of optimization.
Full, image based (IRLS-0K). As in (1.4), the full tensor is optimized based on the (literally interpreted) image update formula (3.3) without further modification (Al-gorithm1 for Si≡ ∅, i ∈ N0). When instability threatens to occur, the equivalent
Full, relaxed.The relaxed constraints described in Section 4.1 are utilized, but without subspace restrictions or weight switching (Algorithm2 with Ti≡ L−1(y),
Si≡ ∅, i ∈ N0). In this case, the residual kL(X) − yk is expected to converge to 0
parallel to the decline of γ, but this is not guaranteed.
Alternating (AIRLS-0K). We apply the representation based, necessarily relaxed, alternating optimization (Algorithm 2for Ti= Tci(τr(N
(i))) and S
i= Sci, i ∈ N0)
further discussed in Section 6. The update formulas for the single components make use of the branch-wise evaluations as derived in Section 6.6. Whether the (maximal) ranks r ∈ NK of the iterate, that is, the sizes of {N
v}v∈V, are fixed or
adapted, as well as the potential use of other heuristics laid out in SectionSM6, is specified in the respective experiments.
Neigh.The same algorithm as aboves AIRLS-0K is applied, but in each update of the node Nc, c ∈ V , only weights corresponding to Je∈ KSc, e ∈ Ec, are included
(cf. Corollary6.4) in order to reduce the computational complexity. This reduction of paths yields the variant closest to our priorly introduced algorithm SALSA, and in particular the minimal number of weights in each the update of Nc, c ∈ V , for
which the rank adaption stability property, as further introduced in [16], still holds true.
Plain ALS without reweighting. In one instance in Experiment7.3, we also compare to the plain alternating least squares (ALS) residual minimization ((6.2) for ω = γ = 0) for fixed ranks r = r(rs)∈ NK. This algorithm is thus granted additional, in
practice generally unavailable information and does not adapt ranks.
7.4. Experimental setup and evaluation. In order to evaluate each output X(alg), we compare its non-neglectable singular values to those of X(rs). We define
detKn,γ,ε(X) := Y J∈K detnJ,γ,ε(X [J]), X ∈ Rn1×...×nd,
where the matrix version is as in [26] given by
det2m,γ,ε(A) := γm−rankε(A)
rankε(A)
Y
i=1
(σi(A)2+ γ),
for rankε(A) := max{i ∈ [m] | σi(A) > · kAkF}. Therein, we choose :=
10−6. We firstly examine the residual norm kL(X(rs)) − yk
F, secondly compare the
approximate ranks, and lastly compare the products of singular values. The latter two aspects are reflected by the limit of the quotient
Qε(X(alg), X(rs)) := lim γ&0
detKn,γ,ε(X(alg))
detKn,γ,ε(X(rs))
∈ [0, 0.98] ∪ (0.98, 1.005) ∪ [1.005, ∞]. The three intervals are related to the categorization into improvements, successes or the two types of failures as outlined below, where the limits 0 or ∞ are reached if and only if PJ∈Krankε((X(alg))[J]) andPJ∈Krankε((X(rs))[J]) differ.
Post iteration. In order to avoid misjudgment, in cases where the tensor X(alg)may
be an improving solution (though that seldomly happens here), we apply a post iteration analogous to the one discussed in the matrix case [26] in order to allow the parameter ε to be reduced to machine precision.
Details of comparison.As in [26], if kL(X(alg)) − yk > 10−6kyk or if for the
quo-tient, it holds Qε(X(alg), X(rs)) = ∞, then the result is considered a strong
fail-ure. If kL(X(alg)) − yk ≤ 10−6kyk, then on the one hand we refer to 1.005 ≤
Qε(X(alg), X(rs)) < ∞ as weak failure. On the other, for 0.98 < Qε(X(alg), X(rs)) <
1.005, we consider the result successful, while for Qε(X(alg), X(rs)) ≤ 0.98, we say
Sensitivity analysis. With the exception of Experiment7.3, we lower the meta pa-rameter ν = νk =
√
νk−1 (cf. Section 7.3), starting with ν0 = 1.2, and rerun the
respective algorithm from the start until the result is not a failure. However, after too many reruns k > kmax, we give up and thus either achieve a weak or strong
failure depending on the result for k = kmax. All other meta parameters for each
algorithm are common to all respective experiments.
7.5. Presentation of results. Each but experiments7.1and7.3is reflected upon in three different ways as summarized in TableSM2.
ASRM/recovery tables.For each instance, we list the percentual numbers of ASRM improvements, successes or fails as defined in Section 7.4. Successes are further distinguished regarding recoveries, whether kX(alg)− X(rs)k
F ≤ 10−4kX(rs)kF. In
near all cases where this is fulfilled, the relative residual even falls below 10−6
(see Section SM2), in which case the algorithm stops automatically8. Note that both improvements as well as fails with respect to ASRM naturally nearly exclude recoveries with accuracy 10−4, and always so for 10−6.
ASRM/recovery figures. More distinguished visualizations of the results underlying the above mentioned tables can be found in Section SM2as described therein. γ-decline sensitivity. A depiction of results regarding the sensitivity analysis out-lined in Section 7.4is covered in SectionSM2 as well.
7.6. Observing the theoretical phase transition for generic recoveries. Experiment 7.1. For d = 4, n = 5 and r(rs)= 3, we consider the ASRM-KbbHT
problem based on Gaussian measurements for reference solutions given via bbHT representations for ` ∈ {68, 69, 70}. The solution method in both cases utilizes full, image based updates (cf. Section 7.1). Each constellation is repeated 100 times, for a comparatively large value of kmax= 10. The results are covered in Table1.
The dimension of the given bbHT variety is dim(VKbbHT
≤r(rs) ) = 4nr(rs)+ 2r
3 (rs)−
5r2(rs) = 69 (cf. Lemma 5.7). The value ` = dim(VKbbHT
≤r(rs) ) + 1 (which here is
` = 70) in turn provides the minimal sufficient number of generic9 measurements (thus not including sampling) to provide L−1(L(X(rs))) ∩ VKbbHT
≤r(rs) = {X
(rs)} for
generic X(rs) ∈ VKbbHT
≤r(rs) , as more generally proven in [4]. We can indeed observe
(see Table1) that for ` = 69, multiple solutions are found as verified through the post iteration process up to machine accuracy. For the value ` = 70 in turn, no duplicate solutions seem to exist. The one improving solution as well as the two weak failures are not the reference solution, though in fact neither within VKbbHT
≤r
for r = r(rs) but r = ˜r, ˜r({1,2}) = 3, (˜r({1}), . . . , ˜r({4})) = (2, 2, 4, 4). From the
perspective of a dimension minimization (cf. (2.1)) in turn, not even the improving result would be preferable as dim(VKbbHT
≤ˆr ) = 71 (V≤ˆKrbbHT + V≤rKbbHT(rs) ).
7.7. Affine sum-of-ranks minimization.
Experiment 7.2. For d = 4, n = 5 and r(rs) = 3, we consider the
ASRM-K problem based on samples or Gaussian measurements and reference solutions given via bbHT representations for ` ∈ {83, 111, 138} and K = KbbHT or by CP
decompositions for ` ∈ {62, 82, 102} and K = Kmax. The solution method in both
cases utilizes full, image based updates based on the respective families K (cf.
8Needless to say, this is the only point at which the reference solution itself is used within the algorithm, and only done in order to save a considerable amount of unnecessary computation time.
instance ASRM-recovery:K: Qε∈ [0, 0.98]no Qεno∈ (0.98, 1.005)yes Qε∈ [1.005, ∞)no Qε=∞
∆` =−1 •HT, ` = 68 24.0 2.0 + 0.0 25.0 49.0
∆` = 0 •HT, ` = 69 9.0 1.0 + 2.0 7.0 81.0
∆` = 1 •HT, ` = 70 1.0 0.0 + 10.0 2.0 87.0
Table 1. tensor recovery (full, Gaussian, image method, d = 4, n = 5, rrs= 3) – table as specified in Section7.5for Experiment7.1(see Fig.SM2
for more details)
Section 7.1). Each constellation is repeated 100 times, for kmax = 8. The results
are covered in Table 2and Figs.SM3 andSM4.
instance ASRM-KbbHT/ max: Qε∈ [0, 0.98] Qε∈ (0.98, 1.005) Qε∈ [1.005, ∞) Qε=∞
recovery: no no yes no cmf= 1.2 •gaussian, HT, ` = 83 0.0 0.0 + 96.0 0.0 4.0 •gaussian, CP, ` = 62 0.0 0.0 + 18.0 0.0 82.0 •sampling, HT, ` = 83 0.0 0.0 + 33.0 0.0 67.0 •sampling, CP, ` = 62 0.0 0.0 + 1.0 0.0 99.0 cmf= 1.6 •gaussian, HT, ` = 111 0.0 0.0 + 100.0 0.0 0.0 •gaussian, CP, ` = 82 0.0 0.0 + 86.0 0.0 14.0 •sampling, HT, ` = 111 0.0 0.0 + 94.0 0.0 6.0 •sampling, CP, ` = 82 0.0 0.0 + 44.0 0.0 56.0 cmf= 2.0 •gaussian, HT, ` = 138 0.0 0.0 + 100.0 0.0 0.0 •gaussian, CP, ` = 102 0.0 0.0 + 100.0 0.0 0.0 •sampling, HT, ` = 138 0.0 0.0 + 100.0 0.0 0.0 •sampling, CP, ` = 102 0.0 0.0 + 84.0 0.0 16.0
Table 2. IRLS-0K (full, image method, d = 4, n = 5, rrs= 3) – table as
specified in Section7.5for Experiment7.2(see Fig.SM4for more details)
The dimension or even the more particular structure of the variety VKmax
≤r(rs), for
Kmax = {J ( [d] | J 6= ∅}, d ≥ 4, as applied in the CP case (cf. Section 7.1), is
unknown to the best of our knowledge. While real tensors of at most rank r(rs)
do not form varieties, complex ones with at most this border rank do, here with a dimension of dim(V≤r(rs),C) = r(rs)(d(n − 1) + 1) = 51 (cf. [3,6,33]). Though we
assume this dimension to be lower than the one for Kmax, we take this smaller value
as reference. In that sense, the considered values ` are each (rounded) multiples cmf∈ {1.2, 1.6, 2}. To our surprise, if successeful, the CP reference solution is (near
perfectly) recovered even for ` = 62, considering that this value is smaller than 69 ≡ dim(VKbbHT
≤r(rs) ) for every exhaustive hierarchical family KbbHT. One possible
explanation would be that dim(VKmax
≤r(rs)) is lower or equal to 61, but further
inves-tigation remains subject to future work. While we can not, as theory provides, expect generic completions in case of sampling problems, the failures with respect to ASRM are subject of IRLS-0K itself. In particular, slower rates of decline ν (cf. Fig.SM3) may be required, and allow for better results for both sampling and Gaussian measurements as suggested by Fig. SM1. Though already for cmf = 1.2,
the rate of decline seems to suffice.
7.8. Alternating, affine sum-of-ranks minimization.
Experiment 7.3. For d = 4, n = 5, r(rs)= 3 and ` ∈ {126, 168, 210} we consider
the ASRM-KTucker problem based on samples or rank rL = 1 Gaussian
measure-ments for reference solutions given through Tucker representations. We compare the following four solution methods:
(a) full, image based
(c) alternating, based on fixed, maximally feasible ranks r(J)= 5, J ∈ K Tucker
(d) alternating, with adaptive ranks r(J)∈ [5], J ∈ K
Tucker(cf. SectionSM6.3)
(e) plain, ALS without reweighting, based on the, a-priorly provided, fixed ranks r(J)= r
A fixed rate ν = 1.002−1 of decline is used (applicable to the first three methods).
Each constellation is repeated 100 times, for which the results are covered in Table3
and Fig.SM7.
instance ASRM-KTucker: Qε∈ [0, 0.98] Qε∈ (0.98, 1.005) Qε∈ [1.005, ∞) Qε=∞
recovery: no no yes no
` = 126
•gaussian (rL= 1): (a) full 0.0 0.0 + 99.0 0.0 1.0
•(c) alt 0.0 0.0 + 79.0 0.0 21.0
•(d) alt adapt 0.0 0.0 + 78.0 0.0 22.0
•(e) plain ALS 0.0 0.0 + 0.0 0.0 100.0
•samp: (a) full 0.0 0.0 + 89.0 0.0 11.0
•(c) alt 0.0 0.0 + 16.0 0.0 84.0
•(d) alt adapt 0.0 0.0 + 19.0 0.0 81.0
•(e) plain ALS 0.0 0.0 + 0.0 0.0 100.0
` = 168
•gaussian (rL= 1): (a) full 0.0 0.0 + 100.0 0.0 0.0
•(c) alt 0.0 0.0 + 100.0 0.0 0.0
•(d) alt adapt 0.0 0.0 + 100.0 0.0 0.0
•(e) plain ALS 0.0 0.0 + 28.0 0.0 72.0
•samp: (a) full 0.0 0.0 + 100.0 0.0 0.0
•(c) alt 0.0 0.0 + 77.0 0.0 23.0
•(d) alt adapt 0.0 0.0 + 80.0 0.0 20.0
•(e) plain ALS 0.0 0.0 + 4.0 0.0 96.0
` = 210
•gaussian (rL= 1): (a) full 0.0 0.0 + 100.0 0.0 0.0
•(c) alt 0.0 0.0 + 100.0 0.0 0.0
•(d) alt adapt 0.0 0.0 + 100.0 0.0 0.0
•(e) plain ALS 0.0 0.0 + 74.0 0.0 26.0
•samp: (a) full 0.0 0.0 + 100.0 0.0 0.0
•(c) alt 0.0 0.0 + 100.0 0.0 0.0
•(d) alt adapt 0.0 0.0 + 98.0 0.0 2.0
•(e) plain ALS 0.0 0.0 + 24.0 0.0 76.0
Table 3. (A)IRLS-0KTucker (Tucker, d = 4,n = 5, rrs= 3) – table as
spec-ified in Section7.5for Experiment7.3(see Fig.SM7for more details)
The degrees of freedom within a Tucker decompositions for d = 4 in this setting is dim(VKTucker
≤r(rs) ) = r
4
(rs)+ 4nr(rs)− 4r2(rs)= 105 (cf. Lemma 5.7). The number of
measurements ` are (rounded) multiples cmf ∈ {1.2, 1.6, 2} of such. As in
Exper-iment SM1, there is nearly no difference between the version using fixed ranks or adaptive ranks, but both instances are slightly worse than the full version using unrelaxed constraints (note that here, these methods use the same, fixed rate of decay ν = 1.002−1). Plain alternating least squares on the other hand (even though only in that case, the ranks of the reference solution are provided) is significantly worse than the other methods, also for larger numbers of measurements.
7.9. Large scale, alternating ASRM.
Experiment 7.4. For d = 8, n = 20, r(rs)= 5 and ` ∈ {6500, 13000, 19500, 26000},
we consider the ASRM-KbbHT problem based on samples, rank rL = 1 or rank
rL = 2 Gaussian measurements for reference solutions given via bbHT
represen-tations with exponentially declining singular values, s(expfac) = 13. For Gaussian
measurements, we also consider unmodified singular values. As solution method, we apply alternating optimization with explicit rank adaption (limited only by r(J)≤ 8, J ∈ K
bbHT) as well as the applicable heuristics laid out in Section SM6.
The maximal length of paths is either unrestricted, or limited to neighbors. Each constellation is repeated 100 times, for kmax= 5. The results are covered in Tables4
and 5and Figs.SM8 toSM11.
The degrees of freedom within 8-dimensional bbHT decompositions in this setting is dim(VKbbHT
≤r(rs) ) = 8nr(rs)+ 6r
3
(rs)− 13r2(rs)= 1225 (cf. Lemma5.7), while ` = 6500
constitutes a fraction of about 2.5 · 10−7 of the total size nd = 2.56 · 1010 of the
tensor. Due to the long runtime for values k > 5, it yet remains speculation whether the restriction of paths to neighboring nodes does result in a loss of approximation quality or, as in other cases, rather a need for a lower parameter ν (cf. Figs. SM8