arxiv: v1 [math.na] 29 Jun 2021

(1)

ASYMPTOTIC LOG-DET SUM-OF-RANKS MINIMIZATION VIA TENSOR (ALTERNATING) ITERATIVELY REWEIGHTED

LEAST SQUARES

SEBASTIAN KR ¨AMER

Abstract. Affine sum-of-ranks minimization (ASRM) generalizes the affine rank minimization (ARM) problem from matrices to tensors. Here, the inter-est lies in the ranks of a family K of different matricizations. Transferring our priorly discussed results on asymptotic log-det rank minimization, we show that iteratively reweighted least squares with weight strengthp = 0 remains a, theoretically and practically, particularly viable method denoted as IRLS-0K. As in the matrix case, we prove global convergence of asymptotic minimizers of the log-det sum-of-ranks function to desired solutions. Further, we show local convergence of IRLS-0K in dependence of the rate of decline of the therein ap-pearing regularization parameterγ & 0. For hierarchical families K, we show how an alternating version (AIRLS-0K, related to prior work under the name SALSA) can be evaluated solely through tensor tree network based operations. The method can thereby be applied to high dimensions through the avoidance of exponential computational complexity. Further, the otherwise crucial rank adaption process becomes essentially superfluous even for completion prob-lems. In numerical experiments, we show that the therefor required subspace restrictions and relaxation of the affine constraint cause only a marginal loss of approximation quality. On the other hand, we demonstrate that IRLS-0K allows to observe the theoretical phase transition also for generic tensor recoverability in practice. Concludingly, we apply AIRLS-0K to larger scale problems.

Key words. affine rank minimization, iteratively reweighted least square, ma-trix recovery, mama-trix completion, log-det function

AMS subject classifications. 15A03, 15A29, 65J20, 90C31, 90C26

1. Introduction

The setting of affine sum-of-ranks minimization (ASRM) is a generalization of the affine rank minimization (ARM) problem for matrices to tensors. While the tensor rank refers to the minimal number of elementary tensors required for a decompo-sition into a sum, we are here interested in the ranks of so called matricizations. Let [d] = {1, . . . , d}, d ∈ N, as well as nµ ∈ N, µ = 1, . . . , d. For ∅ 6= J ( [d] and

Jc_{:= [d] \ J, we define such matricizations (cf. [}₁₅_])

(·)[J]: Rn1_{⊗ . . . ⊗ R}nd_{→ R}nJ×nJc_, _n

S :=

Y

µ∈S

nµ,

as the simple reshaping isomorphisms induced via (v1⊗ . . . ⊗ vd)[J]:= vec( O j∈J vj) · vec( O j∈Jc vj)T, vi∈ Rni, i = 1, . . . , d,

Institut f¨ur Geometrie und Praktische Mathematik, RWTH Aachen University, Templergraben 55, 52056 Aachen, Germany ([email protected],https://www.igpm.rwth-aachen.de).

1

(2)

where vec(·) : R×µ∈Snµ _{→ R}nS_{denotes the vectorization in co-lexicographic}

(column-wise) order. As usual, we identify Rn1_{⊗ . . . ⊗ R}nd∼_{= R}n1×...×nd_{. For a (not}

neces-sarily hierarchical) family of subsets K ⊆ {J ( [d] | J 6= ∅} and a surjective linear operator L : Rn1×...×nd_{→ R}`_{, ` < n}

[d], as well as measurements y ∈ image(L), we

then define ASRM to refer to the problem of finding argmin X∈Rn1×...×nd X J∈K rank(X[J]₎ _{subject to L(X) = y.} (1.1)

This setting is not only of particular interest due to its regularizing properties, but its close relation to so called hierarchical (or tensor tree) decompositions (cf. [15,25]). We are here however mainly interested in the problem itself, and only secondarily in the possibility to recover an eventual ground truth tensor from its measurements. As large parts of this work rely on our preceding article [26], which in turn is based on [5,7,11,29], we strongly recommend to take notice of such. 1.1. Approaches to ASRM and tensor recovery. Affine rank minimization (ARM), as theoretical origin of ASRM, is included as such for the dimension d = 2 and K = {{1}} [26], and consequently defined as the problem to find a matrix

X∗∈ argmin

X∈Rn×m

rank(X) subject to L(X) = y.

This setting in turn is based on the affine cardinality minimization problem (ACM), that is to find a vector

x∗∈ argmin

x∈Rn card(x) subject to L(x) = y.

A short overview over recovery methods as well as the role of iteratively reweighted least squares (IRLS, cf. [11,29] for ARM and [5,7] for ACM) for these two problems can be found in our preceding article [26]. To the best of our knowledge, IRLS has only priorly been considered with regard to the ASRM problem for tensors in the thesis [25], from which also the related, so called stable ALS approximation algo-rithm [16] stems. Relaxations of ASRM itself however have been considered before, including the minimization of the sum of nuclear norms [12,28,35]. The tensor rank as outlined in the introduction, Section1, however, is hard to calculate, and usually not the direct target of minimization. Though [37,38] utilize the canonical polyadic decomposition to a certain fiber completion problem. Other algorithm rely on the explicit, separate adaption of unknown ranks such as low rank manifold [27,36,39] or a-priorly representation or subspace based optimization [18,19,34]. However, non-intrusive rank adaption schemes, even if elaborate, tend to be problematic [16]. The AIRLS related method presented therein, as well as [2,13,14] based thereon, contrarily consider an intrusive regularization related to reweighting that circum-vents the instability and overfitting problems otherwise caused. Another class of algorithm requires to choose specific sampling points, prominently cross approxi-mation based methods [1,22,31]. Though such are preferable in that setting, we here however assume the affine measurement operator to be a priorly given. 1.2. Contributions and organization of this paper. The novel aspects of this paper are organized as follows.

• In Sections1.3and1.4, we generalize the optimization as well as reweighting process from the matrix to the tensor case in an introductory manner. Sec-tion 1.5 contains a preliminary description of hierarchical decompositions and the thereto related data sparse optimization.

(3)

• Section3provides global convergence results for the adjusted tensor IRLS-0K algorithm with respect to sequences of complementary weights, under consideration of the rate of decline of the regularization parameter γ. • In Section4, we discuss the relaxation of the affine constraint together with

the restriction to iteratively defined sequences of admissible subspaces. • Section 5 concisely reintroduces hierarchical formats as non-rooted tree

tensor networks with an emphasis on its graph theoretical foundation. It contains several fundamental statements required for the subsequently in-troduced A(lternating)IRLS-0K algorithm.

• In Section 6, we utilize tree tensor networks to derive the AIRLS-0K al-gorithm which allows a non-exponentially scaling realization of the relaxed IRLS-0K method introduced in Section 4 through an evaluation within given low rank representations.

• Section7contains a comprehensive series of numerical experiments. Firstly, we demonstrate that IRLS-0K allows to observe the theoretical phase tran-sition [4] regarding the required number of measurements for recoveries. Secondly, we follow the relaxations laid out in this work made from IRLS-0K up to the AIRLS-IRLS-0K approach. We demonstrate the improvement, but likewise common ground towards our priorly introduced, so called SALSA algorithm [16], as well as superiority over conventional ALS. We conclude with an application of AIRLS-0K to large scale problems in higher dimen-sions.

• Appendix Acontains a postponed proof. The supplementary SectionSM1

includes a further numerical experiment. Section SM3 contains extended visualization of results as explained in SectionSM2. Technical proofs con-cerning branch evaluations and therefor partially necessary notation can be found in Sections SM4 and SM5. The AIRLS-0K method is summarized in Algorithm 3, whereas SectionSM6 discusses viable heuristics.

1.3. Asymptotic minimization. We have priorly discussed in [26] as based on [5,7,11,29] in which way the ARM problem for matrices can be approached via the asymptotic minimization (cf. Definition1.1) of the family

fγ(A) := log k1

Y

i=1

(σ2

i(A) + γ) = log det(AAT + γI), γ & 0,

(1.2)

for which σi(A), i = 1, . . . , r, are defined as the singular values of A ∈ Rk1×k2

and σi(A) = 0, i > r, r = rank(A). Plainly analogous, its tensor version for the

minimization of a sum of ranks is defined as (see Section 2) f_γK(X) := X J∈K fγ(X[J]) = log Y J∈K nJ Y i=1 (σ(J)_i (X)2_{+ γ),} (1.3)

where σi(J)(X) = σi(X[J]) is the i-th singular value of the matrix X[J] ∈ RnJ×nJc.

Thus the matrix version corresponds to K = {{1}}, whereas for the alternating IRLS method, we have also considered the complementary K = {{2}}. In [26], we have already reasoned the choice p = 0 of the therein appearing weight strength parameter p ∈ [0, 1]. Thus, we here only regard1 _{the thereto corresponding} log-det approach laid out above, as opposed to the other extreme p = 1 associated to nuclear norm minimization. This leads us to the following, potential solutions to the ASRM problem.

(4)

Definition 1.1. We define X∗ := {X∗| ∃(Xγ)γ>0⊂ L−1(y), X∗= lim γ&0Xγ, f K γ (Xγ) = min X∈L−1(y)f K γ(X)}.

This set of asymptotic, global minimizers indeed yields the desired solutions as we prove in Theorem2.4. The decline of the parameter γ is no less important here as more detailly remarked on in the predecessor [26]. It should further be noted that neither the ranks r(J)_{:= rank(X}[J]_{) (cf. Section}₂_{), nor the families of singular}

values σ(J)_{, J ∈ K, are independent of each other [}₂₄_{], though not prohibitively so}

in regard of aboves approach.

1.4. Iteratively reweighted least squares (IRLS). In line with the overall gen-eralization, also iteratively reweighted least squares (IRLS) allows to be applied to the minimization of a sum of ranks of a tensor. For the matrix case, one version (cf. [26,29]) defines (k · kF being the Frobenius norm)

X(i):= argmin

X∈L−1(y)

kW_γ1/2(i−1),X(i−1)XkF, Wγ,X := (XXT+ γI)−1,

for a monotonically decreasing sequence {γ(i)_}

i≥0 ⊂ R>0. The tensor variant

straightforwardly is given by (see Section 3) X(i):= argmin X∈L−1_(y) X J∈K (W(J) γ(i−1),X(i−1)) 1/2_(X(i)₎[J] 2 F, (1.4)

where the weight matrices2follow the same generalization with Wγ,X(J) := Wγ,X[J]= X[J](X[J])T + γI

−1

, J ∈ K.

Continued from the vector as well as matrix case, it also here holds true that for a sequence Xγ →X with sufficiently fast declining singular values

X J∈K W(J) γ,XγX [J] γ 2 F = X J∈K nJ X i=1 σi(J)(Xγ)2 σ(J)i (Xγ)2+ γ −→ γ&0 X J∈K rank(X[J]).

Though largely similar to the matrix case, there is however at least one difference as we discuss in Section 2. Due to its dependence on p = 0 and the family K, we abbreviate aboves algorithm (1.4) as IRLS-0K.

1.5. Data sparse optimization. With increasing dimensions d, the size of the space Rn1×...×nd _{quickly becomes prohibitively large. While for smaller instances,}

IRLS-0K is by all means a viable algorithm, it otherwise remains a theoretical ideal. However, for hierarchical families K, that is if

(J ⊂ S ∨ S ⊂ J ∨ J ∩ S = ∅) ∧ J 6= Sc_, _{∀J, S ∈ K,}

(1.5)

so called hierarchical decompositions [15] or, basically synonymously, tensor tree networks (cf. [10,25]) provide remedy in the same way the ordinary low rank matrix decomposition does (cf. [26]). In the latter case, the data space Dr:= {(Y, Z) | Y ∈

Rk1×r, Z ∈ Rr×k2} represents the low rank variety Vk1,k2

≤r := {A ∈ R

k1×k2_{| rank(A) ≤ r}}

via the surjective (but not injective) bilinear map

τr: Dr→ V≤r, τr(Y, Z) := Y Z ∈ Rn1×n2.

(1.6)

(5)

The alternating method AIRLS then only requires to operate on Dr, while directly

minimizing fγ subject to relaxed affine constraints (see Section 4). In the tensor

case, where the rank becomes r = {r(J)_}

J∈K∈ NK, the variety

V_≤rK := {X ∈ Rn1×...×nd_{| X}[J] _{∈ V}nJ,nJc

≤r(J) , J ∈ K},

(1.7)

has a logarithmicly lower dimension and is likewise represented by a data space Dr

together with a simple, surjective and multilinear contraction map τr: Dr→ V_≤rK

(see Section 5). Thereby, a sparse optimization as for matrices is also possible in higher dimensions. Ultimately, also AIRLS-0K (see Section 6) distinguishes itself from well known unregularized alternating least squares (ALS) [20] only through an additional penalty term. However, it thereby not only becomes stable by means of [16], but it is derived from and directly minimizes the objective function fK γ

restricted to V_≤rK.

2. Underlying structure and global behavior

Phrased more generalized, we in principle desire to solve the problem (cf. [26]) of finding X∗∈ argmin X∈L−1(y) C_V(X), C_V(v) := min V∈V: X∈Vdim(V ), (2.1)

where in this setting the family of varieties V is V_dK:= {V_≤rK ⊂ Rn1×...×nd_{| r = {r}(J)_}

J∈K∈ NK0},

for VK

≤ras defined in (1.7). In general however, the dimension of V≤rK does not equal

P

J∈Kr(J), and is thus not directly represented by the sum of ranks as in (1.1).

While

V_≤eKr( V≤rK ⇒ er(J)≤ r(J), J ∈ K, er 6= r ⇒ dim(V≤eKr) < dim(V≤rK),

neither of the converse implications holds true in general. Firstly, some differently indexed varieties are equal since some constellations r ∈ NK

0 are unfeasible [25].

Definition 2.1. The values r = {r(J)_}

J∈K are called (un)feasible (for n ∈ Nd),

if there exists (not) at least one tensor X ∈ Rn1×...×nd _with _rank(X[J]_{) = r}(J)_,

J ∈ K.

For hierarchical sets K, these bounds are (cf. [25]) r(Jeˆ)_{≤ n}

vQe∈Ev\{ˆe}r

(Je) _for

ˆ

e ∈ Ev, v ∈ V . This natural interrelation of ranks is somewhat beneficial to the

simplified ranks approach as it excludes some extremal cases. The sum-of-ranks minimization is itself a necessary relaxation of the (arguably) more desirable objective function C_VK

d, yet it is closer than it might first seem. What remains

however is that, contrarily to the matrix case, the varieties are only partially nested. 2.1. Determinant expansion and convergence of (global) minimizers. Fol-lowing from the matrix case, one can likewise expand the function fγKinto squared

sums of minors defined as det2k(A) := X I∈Pk([nJ]) X J∈Pk([nJc]) det(AI,J)2, k = 1, . . . , nJ,

for AI,J := {Ai,j}i∈I,j∈J ∈ R|I|×|J| and Pk([`]) := {I ⊆ {1, . . . , `} | |I| = k}. For

simplicity of notation, we further define det20(A) := 1.

Corollary 2.2. Let X ∈ Rn1×...×nd _and_{γ ≥ 0. Then}

(6)

with gs(X) := X {kJ_}_J ∈K∈Ξs Y J∈K det2kJ(X[J]), (2.2) forΞs:= {{kJ}J∈K| 0 ≤ kJ≤ nJ, J ∈ K, P_J∈KkJ= s}. Proof. As QnJ i=1(σ (J)

i (X)2+ γ) = det(X[J](X[J])T + γI), the first equality follows

by [26]. The third term is merely a restructured version. The minimizers of these functions are nested in the sense of the following Lemma. Lemma 2.3. For gs(X), s = 0, . . . ,P_J∈KnJ, as in Corollary 2.2, we have

gs(X) = 0 ⇔

X

J∈K

rank(X[J]) < s

for all _{X ∈ R}n1×...×nd_.

Proof. By definition of gs(X), we have

gs(X) 6= 0 ⇔ ∃{kJ}J∈K: X J∈K kJ = s ∀J ∈ K : rank(X[J]) ≥ kJ ⇔ X J∈K rank(X[J]) ≥ s. By Lemma 2.3, it directly follows that each gs(X) = 0 implies gs+1(X) = 0.

With this structure, we can apply the nested minimization scheme as in [26] to conclude the following Theorem 2.4.

Theorem 2.4. Let s∗= min X∈L−1(y) X J∈K rank(X[J]).

Then for any convergent sequence of (global) minimizers Xγ of fγK(X) subject to

L(X) = y, we have X∗:= lim γ→0Xγ ∈_X_∈L−1(y), Pargmin J∈Krank(X[J])=s∗ Y J∈K rank(X[J]₎ Y i=1 σi(J)(X) with σ_rank((X(J) _∗₎[J]₎₊₁(Xγ) 2_{∈ O(γ),} _{J ∈ K.} (2.3)

If there is only one Xs∗∈ L−1(y) withP_J∈Krank(X_s[J]∗) = s∗, then Xγ → Xs∗.

Proof. Since argmin_X∈L−1 gs(X) ⊂ argmin_X∈L−1 gs+1(X) due to Lemma2.3, the

proof is analogous to the corresponding one in [26]. 3. Log-det tensor iteratively reweighted least squares (IRLS-0K)

Although the global minimizers of fK

γ yield the sought solution, it is not

(7)

3.1. Minimization of an augmented function. The augmented map3 analo-gous to fγ corresponding to the tensor function fγK is

J_γK(X, {W(J)_} J∈K) := X J∈K Jγ,nJ(X [J]_{, W}(J)₎ for

Jγ,m(A, H) = trace(H(AAT + γI)) − log det(H) − m

= X

J∈K

kH1/2Ak2F+ γkH1/2k2F− log det(H) − m,

where each W(J)_{∈ R}nJ×nJ _{ranges over W}(J)_{= (W}(J)₎T _{0 (symmetric positive}

definite). Consequently, with the same argumentation as in [11,26], it is ∂ ∂W(J) JγK(X, {W(J)}J∈K) = X[J](X[J])T + γI − (W(J))−1 and thus Wγ,X(J) := argmin W(J)_=(W(J)₎T₀ JγK(X, {W(J)}J∈K) = (X[J](X[J])T + γI)−1. (3.1)

It likewise holds true that

fγK(X) = JγK(X, {W (J) γ,X}J∈K).

(3.2)

Further, the minimizer in X is determined by an ordinary least squares problem. In order to derive the closed form solution for the minimizer, we note that each W(J)_{, J ∈ K, defines linear operations}

(W(J))α: Rn1×...×nd_{→ R}n1×...×nd_, _((W(J)₎α_(X))[J]_{:= (W}(J)₎α_X[J]_, _{α > 0.}

We can thereby write X J∈K k(W(J))1/2X[J]k2_F = X J∈K k(W(J))1/2(X)k2F = kW K (X)k2F, where WK(X) := {(W(J)₎1/2_(X)}

J∈K ∈ Rn1×...×nd×|K|. Based on the operator

WK (cf. [26]), the sought minimizer is given by XWK := argmin X∈L−1(y) JγK(X, {W(J)}J∈K) = cW−1◦ L∗◦ (L ◦ cW−1◦ L∗)−1(y) (3.3) for c WK(X) := (WK)∗◦ WK(X) =X J∈K W(J)(X), (3.4)

where (·)∗ denotes adjoint operators. Further, following [11,26,29], we have c

WK(XWK) ⊥ kernel(L).

(3.5)

Vice versa, XWK is the unique solution to (3.5) subject to L(XWK) = y. A more

stable update formula is provided by [26] through

XWK = X0− K ◦ (K∗◦ cWK◦ K)−1◦ K∗◦ cWK(X0),

(3.6)

where X0 is one arbitrary solution to L(X0) = y and K : R Qd

i=1ni−`_{→ R}n1×...×nd

is a kernel representation of L, whereby image(K) = kernel(L). Due to the sum structure, also the gradient properties generalize to the tensor case.

3_{Due to the distinguishable roles of}_{J ∈ K and the map J}K

γ , we here remain faithful to prior

(8)

Corollary 3.1. It is

∇XfγK(X) = ∇XJγK(X, {W(J)}J∈K)|W(J)_=W(J) γ,X, J∈K.

(3.7)

Thus X is a stationary point of fγK if and only ifX = XWK forW(J)= W (J) γ,X, J ∈

K, which means that (X, {W_γ,X(J)}_J∈K) is a stationary point of JK γ.

As in the matrix case, γ → ∞ provides a unique, canonical starting value. Corollary 3.2. Independently of X(0)_{∈ L}−1_{(y), it holds}

lim γ→∞_Xargmin_∈L−1(y) fγK(X) = lim γ→∞X{Wγ,X(0)(J) }J∈K = argmin X∈L−1(y) kXkF,

where the first limit is possibly a set convergence.

3.2. Complementary weights. In the matrix case [26], there is one more eq-uitable choice f(2)_{(A) = log det(A}T_{A + γI) as opposed to f}(1)

γ (A) = fγ(A) =

log det(AAT_{+ γI). For families K containing more subsets, each set J ∈ K may be}

replaced by its complement. For a subset S ⊂ K, let therefor KS := (K \ S) ∪ {Jc_{| J ∈ S},} _Jc_{:= [d] \ J,}

(3.8)

for W(J)_{= W}(J)

γ,X, J ∈ KS. Although the updates X (K) W and X

(KS₎

W in general differ,

the overall properties outlined in Section3 are not influenced as fγKS(X) = X J∈S nJc X i=1 log(σ(J_i c)(X)2+ γ) + X J∈K\S nJ X i=1 log(σ_i(J)(X)2+ γ) = fγK(X) + X J∈S (nJc− n_J) log γ.

While the weights are in that sense interchangeable, switching between complemen-tary weights becomes essential for AIRLS-0K as captured in Lemma 6.2.

3.3. Adjusted IRLS-0K algorithm. Based on a monotonically declining sequence {γ(i)_}

i≥0⊂ R>0 (cf. Definition1.1), and (optionally) a sequence Si ⊂ K (cf.

Sec-tion3.2), Algorithm1defines the sequence {(X(i)_{, {W}(i,J)_}

J∈K)}i≥0with L(X(i)) =

y and (W(i,J))T = W(i,J) 0, J ∈ K, i ≥ 0. These iterates behave largely analo-gously to the matrix version [26] (cf. [5,7,11,29]). In particular, that case is included in Theorem3.3for d = 2 and K = {{1}}.

Algorithm 1 Iteratively reweighted least squares with switching weights

1: set X(0)∈ L−1(y), γ(0)> 0

2: fori = 1, 2, . . . do

3: set Si−1 ⊂ K (cf. Section3.2)

4: {W(i−1,J)}_J_∈KSi−1 := {W_γ(J)(i−1),X(i−1)}J∈KSi−1 (cf. (3.1)) 5: X(i):= X_WKSi−1(i−1) (cf. (3.3))

6: set γ(i)≤ γ(i−1) 7: end for

Theorem 3.3. Let {(X(i)_)}

i≥0 be generated by Algorithm 1for {Si}i∈N0 and the

weakly decreasing sequence {γi}i≥0⊂ R>0. Let further Sγ∗⊂ L−1(y) be the

station-ary points offK

γ|L−1(y) forγ > 0, as well as γ∗:= limi→∞γ(i).

(i) For each _{i ∈ N and each S ⊂ K, it holds} f_γK(i)S(X(i)) ≤ fK

S

(9)

(ii) Ifγ∗_{> 0, then the sequences X}(i)_{and |f}KS

γ(i)(X

(i)_{)|, S ⊂ K, remain bounded.}

(iii) Further, if γ∗> 0, then lim

i→∞kX

(i)_{− X}(i−1)_k F = 0

(3.9)

and each accumulation point of X(i) _{is in S}∗ γ∗.

(iv) (See Remark 3.4) Let Θ ⊂ R>0 be an arbitrary, infinite, bounded set with

its only accumulation point at inf(Θ) = 0, and let δi:= inf

S∈S∗ γ(i)

kX(i)− Sk, i ∈ N.

For an arbitrary, bounded sequence A = {αi}i∈N0 with inf(A) > 0 (e.g.

αi= 1, i ∈ N0) and for γ(0)= max(Θ), we recursively define

γ(i+1)= (

θi ifαiδi < θi

γ(i) _otherwise , θi:= max{z ∈ Θ | z < γ (i)_},

i ∈ N0.

Then limi→∞δi = γ∗ = 0 and for at least one subsequence {X(i`)}`∈N,

there exists a sequence of stationary points {S`}`∈N,S`∈ S_γ∗_(i`), with kS`−

X(i`)_{k → 0.}

Remark 3.4. Part (iv) of Theorem3.3as well as its proof are literally the same as in the matrix case [26]. Roughly, if the sequence {γ(i)_}

i∈Nis decreased to γ∗= 0 slowly

enough, then X(i)_{can only converge to a limit of stationary points of f}

γ|L−1(y) for

γ & 0. The contrary case of too fast decline has been covered in [26] as well.

Proof. See AppendixA.

4. Relaxed iteratively reweighted least squares

Too large mode sizes n or high dimensions d in practice prohibit to even operate on the spaces L−1_{(y) or R}n1×...×nd _{directly. As hinted on in Section}_1.5_{, so called}

hierarchical decompositions can provide remedy in the same way low rank matrix decompositions do. This however first requires to relax the affine constraint L(X) = y.

4.1. Relaxation of affine constraint. Let aγ(s) := s −P_J∈KnJlog(γ), γ > 0.

As each of these function is monotonically increasing, a composition with such does not change minimizers. We correspondingly define

fγa,K(X) := aγ◦ fγK(X) = log Y J∈K ∞ Y i=1 (1 +σ (J) i (X)2 γ ), (4.1)

with σi(J)(X) := 0 for i > nJ, J ∈ K. Likewise, let Jγa,K(X, {W(J)}J∈K) :=

aγ◦ JγK(X, {W(J)}J∈K). With the same reasoning as in [26], one then defines

Fγ,ωa,K(X) := kL(X) − yk2F+ cL· ω2· fγa,K(X),

J_γ,ωa,K(X, W ) := kL(X) − yk2

F+ cL· ω2· Jγ,aK(X, W ).

(4.2)

for an appropriate scaling constant cL. As ∂γ∂ F a,K

γ,√γ(X) = cL· ∂

∂γ(γ · fγa,K(X)) ≥ 0,

(10)

Algorithm 2 Subspace restricted IRLS with switching weights 1: set X(0)∈ L−1(y), γ(0)> 0 2: fori = 1, 2, . . . do 3: set Si−1 ⊂ K (cf. Section3.2) 4: {W(i−1,J)}_J ∈K(Si−1) := {W (J) γ(i−1),X(i−1)}J∈K(Si−1) (cf. (3.1)) 5: set a subspace T_i−1 ⊂ Rn1×...×nd _{with T}

i−13 X(i−1)

6: X(i):= argmin_X∈T_i₋₁ J_γa,K(i−1)(Si−1)(X, {W(i−1,J)}J∈K(Si−1)) (cf. (4.2))

7: set γ(i)≤ γ(i−1) 8: end for

4.2. Subspace dependent, relaxed optimization algorithm. To later incor-porate the alternating optimization, we here also consider an additional sequence of subspaces {Ti}i∈N0 with Ti⊆ R

n1×...×nd_{, as well as T}

i∩ Ti−1 3 X(i), i ∈ N0. The

latter condition ensures that the previous iterate remains admissible. This then yields the modified Algorithm2. While the objective function is still monotonically decreased as provided by Corollary4.1, to show the remaining parts of Theorem3.3

as far as possible for now remains subject to future research. Corollary 4.1. For X(i)_{as defined by Algorithm} ₂_{it holds}

F_γa,K(i)S(X

(i)_{) ≤ F}a,KS

γ(i−1)(X(i−1)),

for all i ∈ N and all S ⊂ K.

Proof. The argumentation is the same as in Theorem 3.3 part (i) as steps (a) to

(g) analogously hold true (cf. Section 4.1).

5. Hierarchical decomposition

We briefly reintroduce hierarchical tensor decompositions [15] as tensor tree net-works with reference to the introductory Section1.5. For further reading, we rec-ommend [8,10,15,17,23,25,30].

5.1. Notational deviation. In the following, G = (V, E) denotes a tree graph with vertices V ⊇ [d] and edges E ⊆ {{v, w} | v 6= w ∈ V }. Due to the complex description of general tensor (tree) networks, we require a certain minimum of notational deviation. That is, we dismiss the order of modes when indexing tensors. Instead, in order to avoid ambiguity, each specific object is consistently referenced with the same, distinctly assigned labels, based on the graph G = (V, E). The first group is given by αS = {αµ}µ∈S, for αS ∈ [nS], nS =Q_µ∈Snµ, S ⊆ V . We set

nµ = 1 for µ > d, but any such αµ is only denoted when required for notational

simplicity. Further, the second group is given by β = {βe_}

e∈E with βe ∈ [r(Je)],

Je∈ K (see Section5.3), whereas the measurement index is denoted by ζ ∈ [`]. For

each such label, we correspondingly define the spaces Hαµ := R

[nµ]_{, v ∈ V,} _H

βe := R[r (Je)_]

, e ∈ E, Hζ := R[`].

The entirety of labels is formally required to be ordered, but the exact ordering is irrelevant. To each collection Γ of such labels, we consequently assign the space

HΓ:=

O

γ∈Γ

Hγ.

(5.1)

Some, in particular labels corresponding to edges also appear as unequally treated, so called primed labels βe06= βe_{, e ∈ E. Each is however still thought to refer to the}

(11)

it shall become apparent that it is in fact mostly redundant to explicitly denote these labels. While we nevertheless here hold on to indices, SectionSM5does make use of this fact to more compactly repeat some of following statements and lay out their proofs. What is here introduced as notation, is the basis to the formalized arithmetic introduced in [25]. For the Matlab toolbox that realizes the latter through automated contractions, on which the implementation of (A)IRLS-0K is based on, please contact the author.

5.2. Graph notation. We denote each the path from excluding c ∈ V to excluding v ∈ V \ {v} within a tree G = (V, E) as the unique ordered set

c ˚+v := (p1, . . . , p−1) = p ⊂ V,

(5.2)

for which {c, p1} ∈ E, {pi, pi+1} ∈ E, i = 1, . . . , |p| − 1 as well as {p−1, v} ∈ E.

We further define the neighbors of v ∈ V , as well as the predecessor and set of descendants of v ∈ V \ {c} relative to c ∈ V as

neigh(v) := {h ∈ V | {h, v} ∈ E}, predc(v) := p−1, descc(v) := neigh(v) \ {p−1}.

We define the branches relative to c ∈ V as

branchc(v) := {v} ∪ {b ∈ V \ {c, v} | v ∈ c ˚+b}.

Each root c ∈ V splits the graph into the multiple connected components of V \{c}, ˙

[

h∈neigh(c)branchc(h) = V \ {c}.

For any v 6= w ∈ V , we further define the sets Jw(v) := branchw(v) ∩ [d]. Thus if

e = {v, w} ∈ E is an edge, then Jw(v) ˙∪ Jv(w) = [d].

5.3. Tree corresponding to hierarchical family. Without loss of generality, we from here on postulate that hierarchical families K (cf. Section 1.5) are by definition also dimension separating. That is, we assume that there does not exist a map π : [d] → [d − 1], for which π(J) /∈ {π( ˆJ), [d − 1] \ π( ˆJ)} for all J, ˆJ ∈ K. Lemma 5.1. Each (dimension separating) hierarchical family K defines an, up to equivalence, unique tree GK = (V, E), V ⊇ [d] and root c ∈ V , for which |E| =

|V | − 1 = |K| and K = {Jc(v)}v∈V \{c} — and vice versa.

Proof. See for example [15,25].

Definition 5.2. Let GK correspond to the hierarchical family K. We define Je∈

{Jw(v), Jv(w)}, e = {v, w} ∈ E, as each the one set that is contained in K.

This convention implies a bijection K = {Je| e ∈ E} to E. The simple graph

that corresponds to the matrix case K2= {{1}} for d = 2 is for instance given by

the tree

GK2 = (V, E), V = {1, 2}, E = {{1, 2}},

whereby J_{1,2}= {1}. For KTucker= {{1}, . . . , {d}} (cf. Example5.5), we have

G_KTucker= (V, E), V = {1, . . . , d + 1}, E = {{1, d + 1}, . . . , {d, d + 1}},

(5.3)

and J_{µ,d+1}= {µ}, µ ∈ [d]. As required later, for subsets S ⊂ V , we further define ES := {{v, w} ⊂ E | v ∈ S, w ∈ neigh(v)},

(5.4)

˚

ES := {{v, w} ⊂ E | v, w ∈ S}, ∂ES:= ES\ ˚ES.

(12)

5.4. Representation map corresponding to tree. Whereas each hierarchical family K defines a tree G_K = (V, E), each such (not necessarily rooted) graph together with r ∈ NK _{in turn defines a certain data space D}

rand a representation

map τr: Dr→ Rn1×...×nd for values r ∈ NK.

Definition 5.3. With reference to Section 5.1, let Dr:=

×

v∈V Hmv, mv:= {β e_} e∈Ev ∪ ( {αv} if v ∈ [d], ∅ otherwise.

The dimension of each node Nv ∈ Hmv, {Nv}v∈V ∈ Dr, is thus the degree of

v ∈ V , plus one if v ∈ [d]. The representation map τris now defined as the map that

proceeds each a contraction over modes with common labels. With the notation declared in Section5.1, we may write

τr(N )α1,...,αd:= X βe_∈E Y µ∈[d] (Nv)αv,{βe}e∈Ev Y v∈V \[d] (Nv){βe_}_e ∈Ev, (5.5) where αµ∈ [nµ], µ ∈ [d].

Example 5.4. In the matrix case with r = r(J_{1,2}) _{∈ N, we simply have an}

ordinary matrix multiplication (cf. Section 1.5) τr(Y, Z)α1,α2 =

Pr

β=1Yα1,βZβ,α2,

where the summation ranges over α1 ∈ [n1] and α2 ∈ [n2]. Here, β = β{1,2}∈ [r]

is the label assigned to the only edge.

Example 5.5. For d ∈ N, the Tucker format [40] or MLSVD4_[₉_{] is defined through} the graph KTucker (5.3) and consists of the components {Nµ}d+1µ=1 ∈ Dr of sizes

Nµ ∈ Rnµ×r(J{µ,d+1}) and Nd+1 ∈ Rr(J{1,d+1})×...×r(J{d,d+1}). The corresponding

contraction map is given by (though less convenient when written out in particular cases) Xα1,...,αd= τr(N1, . . . , Nd, Nd+1)α1,...,αd = r(J{1,d+1})_X β{1,d+1}=1 . . . r(J{d,d+1})_X β{d,d+1}=1 (N1)α1,β{1,d+1}. . . (Nd)αd,β{d,d+1}(Nd+1)β{1,d+1},...,β{d,d+1},

for αµ= 1, . . . , nµ, µ = 1, . . . , d as visualized in Fig. 1.

Nd+1 N1 N2 N3 N4 α1 α2 α 3 α 4 β{1 ,d+1 } β {2 ,d+1 } β{₃ ,d +1 } β {4,d +1 } e Nd+1 Ned+2 N1 N2 N3 N4 α1 α2 α 3 α 4 β{1 ,d+1 } β {2 ,d +1 } β{d+1,d+2} β_{ 3 ,d +2 } β {4_,d +2 }

Figure 1. [Left] The contraction diagram for the Tucker representation in Example 5.5 for d = 4. The dotted line indicates the part which for J = {1} yields Z(J)_{, whereas} _Y(J) ₌_N

1 (cf. (5.7)). [Right] A balanced binary

hierarchical Tucker (HT) representation (cf. Section5.6) for the exhaustive family K = {{1, 2}, {1}, . . . , {4}}, Y({1,2})₌_τ_r_({N₁_{, N}₂_{, e}_N_d+1}) (cf. (5.6)). In contrast to conventional literature (cf. [15]), the root node has been omitted as it is redundant here (cf. [25]).

(13)

While initially defined on the whole network, we can also extend the map τr to

contract nodes over any subset S ⊂ V via τr({Ns}s∈S){αs}s∈S,{βe}e_∈∂ES := X βe_{: e}_∈˚_E_S Y v∈S (Nv)αv,{βe}e∈Ev, (5.6)

for αs∈ [ns], s ∈ S, and ∂ES as well as ˚ES as defined by (5.4). Here, some αv,

that is for v > d, are redundant (cf. Section5.1).

5.5. Decomposition theorem. The following theorem is fundamental to hierar-chical tensor approximation theory.

Theorem 5.6 ( [15]). Let GK be the tree corresponding to the hierarchical family

K (cf. Lemma 5.1). Then for each r = {r(J)}J∈K∈ NK, the according multilinear

representation map τr: Dr→ Rn1×...×nd is non-injective withimage(τr) = V_≤rK.

In other words, for each tensor X ∈ Rn1×...×nd _{with rank(X}[J]_{) ≤ r}(J)_{, J ∈ K,}

there exists a (non-unique) decomposition N ∈ Dr with X = τr(N ). Each edge

e = {v, w} ∈ E, assuming J = Je= Jw(v) ∈ K, splits the tree into two disconnected

subgraphs and yields a corresponding matrix decomposition

X[J]= Y(J)Z(J), Y(J)∈ R[nJ]×r(J)_, _Z(J)_{∈ R}r(J)×[nJc]_.

(5.7)

The matrices Y(J)_{and Z}(J)_{are obtained by contractions over each (N}

h)h∈branchw(v)

and (Nh)h∈branchv(w), respectively. In explicit, abbreviating S = branchw(v), we

have Y_α(J)_J_,βe = τr({Ns}s∈S){αµ}µ∈J,βe = X βe_{: e}_∈˚_E S Y v∈S (Nv)αv,{βe}e∈Ev,

for ˚ESas defined in Section5.3. Given (5.7), it is easy to see that indeed image(τr) ⊆

VK

≤r, whereas the other direction requires some more work (cf. [15,25]).

Lemma 5.7. The dimension of the variety corresponding to a feasible _{r ∈ N}K for a hierarchical family K is dim(V_≤rK) = X µ∈[d] nv Y e∈Eµ r(Je)₊ X v∈V \[d] Y e∈Ev r(Je)₋X e∈E (r(Je)₎2_,

where G_K= (V, E) is the corresponding graph. The set VK

=r in turn is a manifold

of equal dimension.

Proof. Follows by a generalization of the argumentation in [21,41]5. 5.6. Exhaustive hierarchical families. The larger the family K, the more reg-ularizing the IRLS approach. Thus, one may desire such to be exhaustive in the following sense.

Definition 5.8. Let K be a hierarchical family. We say K is exhaustive if there does not exist another hierarchical family eK with eK ) K.

Exhaustive hierarchical families in a certain sense yield particularly data sparse formats as specified in the following Lemma 5.9. For any such family, it further holds |K| = 2d − 3 = |E| and |V | = 2d − 2 (cf. Lemma 5.1).

Lemma 5.9. Let K be an exhaustive hierarchical family. Then GK consists only

of inner vertices v ∈ V \ [d] of degree 3 and leafs v ∈ [d] ⊂ V of degree 1.

Proof. See for instance [15,25].

(14)

The Tucker family KTucker for example is not exhaustive (for d ≥ 4). The degree

of the vertex d + 1 ∈ V is d, whereby the dimension of the node Nd+1 is d as

well. For d = 4, all exhaustive families are equivalent (up to permutation of modes) to K = {{1, 2}, {1}, {2}, {3}, {4}} (see Fig. 1). In general, exhaustive hierarchical families correspond to so called binary hierarchical Tucker formats (cf. [15,25]). 5.7. Rooted trees and orthonormalization. A root c ∈ V , if at all, may be chosen freely, leading us back to the choice of complementary weights in Section3.2. Lemma 5.10. For each c ∈ V , there exists a unique subset Sc ⊂ K for which

KSc _{= {J}

c(v) | v ∈ V \ {c}} (cf. Section5.2).

Proof. Follows directly with Sc= {Jc(v)c| Jc(v) /∈ K, v ∈ V \ {c}} ⊆ K.

The set equality in Lemma5.10implies that for each J ∈ KSc_{, there is a unique}

vertex v =: vc,J∈ V \ {c} with J = Jc(v). Note that only the sets Sc, c ∈ V , again

lead to hierarchical families KSc _{as opposed to the 2}|K| _{generally possible subsets}

S ⊂ K. One can utilize the non-injectivity of the map τr to orthonormalize the

representation in the sense of the following Theorem5.11, yet without the need to calculate the represented, full tensor.

Theorem 5.11. Let r(J)= rank(X[J]), J ∈ K. Then there exists a representation X = τr(N ), N ∈ Dr, such thatY(J)∈ R[nJ]×r

(J)

,J ∈ KSc_{, as defined in} ₍_5.7_{), are}

orthonormal matrices.

Proof. Can for example be found in [25].

Note that the matrices Y(J) _{in Theorem}_5.11_{are defined via the representation}

N . If may further be achieved that these matrices each consist of the left singu-lar vectors, Y(J) _{= U}(J)_{, of the compact matrix SVDs X}[J] _{= U}(J)_Σ(J)_(V(J)₎T_,

J ∈ KSc_{. Thereby, the decomposition in fact becomes essentially unique}6 _[₁₅_,₂₅_].

However, mere orthonormality is in general sufficient and can be ensured with sig-nificantly less effort in an alternating optimization. In case of the Tucker format (Example 5.5), if indeed Y(J) _{= U}(J)_{, this canonical form is specifically known}

as MLSVD [9], while for the tensor train format [32], it is known as canonical MPS [42]. General canonical forms of tensor tree networks and their properties are further discussed in [25].

6. Alternating iteratively reweighted least squares (AIRLS-0K) In this section, let K be a hierarchical family, G_K = (V, E) the corresponding tree as well as τr : Dr → V_≤rK, with Dr =

×

v∈V Hmv, the representation map for

r ∈ NKas described in Section5. The idea of alternating least squares (ALS) is to in each step fixate N = {Nv}v∈V ∈ Drbut the one component Nc, where the root

c ∈ V iteratively cycles through all vertices. We therefor define the linear map N_6=c: Hmc → V_≤rK, N6=c( ˆNc) = τr({Nv}v∈V \{c}∪ { ˆNc}).

As the image of that map is independent of the specific, chosen representation, we obtain the (well defined) subspace (cf. Section 4.2)

Tc(τr(N )) := image(N6=c).

(6.1)

Though one avoids to ever calculate the full tensor X = τr({Nv}v∈V) ∈ Rn1×...×nd,

we define the resulting update as

X_γ,ω,WN,c := argminX∈Tc(N ) J

a,KSc

γ,ω (X, {W(J)}J∈KSc),

(6.2)

(15)

as well as (Nc)N,cγ,ω,W ∈ Hmc via N6=c((Nc)

N,c

γ,ω,W) := X N,c

γ,ω,W. The subset Sc ∈ K

is defined as by Lemma 5.10, the objective functions in (4.2). The subsequent sections are summarized in Algorithm 3, though it generates the same iterates X(i)_{= τ}

r(N(i)), i ∈ N0, as Algorithm 2 when the subspaces are chosen according

to (6.1).

6.1. Sweeps, micro steps and stability. The update X_γ,ω,WN,c (cf. (6.2)) is in-dependent of the specific, chosen representation N of the previous iterate X (cf. Theorem 5.6). Thus, for each c ∈ V , the updating maps

M(c)_r : Dr→ Dr, M(c)r (N ) := {Nv}v∈V \{c}∪ {(Nc)N,cγ,ω,W}, W = Wγ,τr(N ),

operating on the data space, as well as the one operating on the full tensor space, ζ_M(c): Rn1×...×nd→ Rn1×...×nd, ζ_M(c)(X) := τ_r(X)◦ M(c)_τ

r(X)◦ τ

−1 r(X)(X),

are well defined. Here, r(X) ∈ NKdenotes the ranks of each X and N = τr−1(X) is

each an arbitrary representation. A whole sweep (for fixed γ and ω) is defined as Mr:= c∈V M(c)r , ζM:= c∈V ζM(c),

where the order of composition may be chosen as most suitable. Issues around these functions in particular concerning stable rank adaptivity have been discussed in [16].

6.2. Representation based evaluation. In order to obtain a practically viable algorithm, it remains to show that each next iterate X_γ,ω,WN,c (cf. (6.2)) given W = Wγ,τr(N ) and ω =

√

γ > 0 (cf. Section 4.1) can indeed be calculated through its representation, that is, without the need to construct full tensors in Rn1×...×nd_.

The updated node (Nc)N,c_γ,ω,W ∈ Hmc, for which X

N,c

γ,ω,W = N6=c((Nc) N,c

γ,ω,W), is given

by the lineare least squares problem (cf. Sections3.1,3.2 and4.1) (Nc)N,cγ,ω,W = argmin e Nc∈Hmc kL ◦ N_6=c( eNc) − yk2+ cLγ X J∈KSc k(W(J))1/2N_6=c( eNc)k2F,

for W(J) _{as in (}_3.4_{). The minimizer is thus given as solution (N}

c)N,c_γ,ω,W := eNc to N_6=c∗ ◦ L∗◦ L ◦ N_6=c( eNc) + cLγ X J∈KSc N_6=c∗ ◦ W(J)◦ N_6=c( eNc) = N_6=c∗ ◦ L∗(y). (6.3)

Following are two aspects that are required for a representation based evaluation. The first one in Section6.3depends on the operator L itself and can in that sense not be influenced. The second one in Section 6.4in turn merely asks for the right choices of Sc, namely the one in Lemma5.10, and can thus always be achieved.

6.3. Decomposition of measurement operator. Like each linear operator, L : Rn1×...×nd_{→ R}` _{has a tensor description L ∈ R}`×n1×...×nd _{in terms of}

L(X)ζ = n1 X α1=1 . . . nd X αd=1 Lζ,α1,...,αdXα1,...,αd. (6.4)

This tensor L must itself somehow allow for an efficient handling. Similar to The-orem 5.6, each L ∈ R`×n1×...×nd _{can for some r}

L ∈ NK (assumed to be low) be

(16)

The symbols εe_{, e ∈ E, are additional labels with range ε}e_{∈ [r}(Je)

L ], whereas ζ ∈ [`].

The assigned multilinear representation map is ρrL(L)ζ,α1,...,αd:= X εe_{: e}_∈E Y µ∈[d] (Lµ)ζ,αµ,{εe}e∈Eµ Y v∈V \[d] (Lv){εe_}_e ∈Ev,

For simplicity of notation, as with the representation N , we will also denote indices ζ and αv in nodes Lv with v > d. While this is formally compatible as long as

(Lv)ζ,αv,{εe}e∈Ev, v > d, is constant in ζ, these indices can likewise be omitted.

Sampling operators for instance can be decomposed for rL(J)≡ 1, J ∈ K.

Example 6.1. For d ∈ N, the operator decomposition corresponding to the Tucker graph KTucker(5.3) consists of the components {Lv}v∈V of sizes Lµ ∈ R`×nµ×r

({µ})

and Ld+1∈ Rr(J{1,d+1})×...×r(J{d,d+1}). The corresponding contraction map ρrL is

Lζ,α1,...,αd = ρrL(L1, . . . , Ld, Ld+1)ζ,α1,...,αd = r(J{1,d+1})L _X β{1,d+1}=1 . . . rL(J{d,d+1})_X β{d,d+1}=1 (L1)ζ,α1,β{1,d+1}. . . (Ld)ζ,αd,β{d,d+1}(Ld+1){β{µ,d+1}}µ∈[d],

for αµ= 1, . . . , nµ, µ = 1, . . . , d and ζ = 1, . . . , ` as visualized in Fig.2.

Ld+1 L1 L2 L3 L4 α1 α2 α 3 α 4 ε{1 ,d+1 } ε{1 ,d+1 } ε {2 ,d+1 } ε {2 ,d+1 } ε {3 ,d +1 } ε { 3 ,d +1 } ε{4,d +1 } ε{4,d +1 } ζ

Figure 2. The contraction diagram for the Tucker-like decomposition of L as in Example6.1ford = 4 (cf. Example5.5).

In general, when all summations over αv, v ∈ [d], are proceeded first, then L(X)

can be efficiently evaluated by means of the tree structure of (cf. Proposition 6.5) L(X)ζ = X εe_,βe_{: e}_∈E Y v∈V X αv (Lv)ζ,αv,{βe}e∈Ev(Nv)αv,{βe}e∈Ev , ζ ∈ [`]. (6.5)

Note that we have here again made use of the redundant additional indices ζ and αv for v > d. Likewise, the composition of L and N6=ccan be proceeded efficiently.

6.4. Equivalent low rank weights. The switching between each complementary weights introduced in Section3.2has the following motivation.

Lemma 6.2. Letc ∈ V and Sc be as in Lemma5.10, and letN be a representation

for which Y(J)_, _{J ∈ K}Sc_{, are orthonormal (cf. Theorem} _5.11_{). Then the update}

Xγ,ω,WN,c as defined in(6.2) for the rank nJmatricesW(J)= Wγ,X(J) = (X[J](X[J])T+

γI)−1,J ∈ KSc_,_{X = τ}

r(N ), is the same as for the rank r(J) matrices

W(J)= W_γ,N,c(J) := Y(J)(H(J)+ γI)−1(Y(J))T, H(J):= Z(J)(Z(J))T, J ∈ KSc_.

Proof. It suffices to show that for every eNc ∈ Hmc, we have

(W_γ,X(J))1/2_N

(17)

Let the orthonormal matrix UJ,⊥ _{∈ R}nJ×nJ−r(J) _{span the orthogonal complement}

of the r(J) _{dimensional space range(Y}(J)_{). Then}

(W_γ,X(J))1/2= (W_γ,N,c(J) + γ−1UJ,⊥(UJ,⊥)T)1/2= (W_γ,N,c(J) )1/2+ γ−1/2UJ,⊥(UJ,⊥)T. It thus remains to show that range(N_6=c( eNc)[J]) ⊥ range(UJ,⊥) for all eNc. As by

construction of Sc, the matrix Y(J)does not depend on the vertex c ∈ V , we have

range(N6=c( eNc)[J]) ⊆ range(Y(J)).

6.5. Path evaluations. Lemma 6.2allows for a further, significant simplification in the evaluation of (Nc)N,cγ,ω,W as defined in (6.3). Let in the following c ∈ V and

ˆ

J ∈ KSc _{be fixed. Further, let v}

c, ˆJ ∈ V \{c} be the uniquely determined vertex with

Jc(vc, ˆJ) ∩ [d] = ˆJ (cf. Section 5.2), as in Section5.7. Without explicit indication

of the dependence on the above, we denote (cf. (5.2) and (5.4))

p := c ˚+v_{c, ˆ}_J ⊆ V \ {c, v_{c, ˆ}_J}, EY := ∂E_{c}∪p= Ec\ {e1} ∪ ∂Ep.

For empty p, we set p1= vc, ˆJ and p−1= c for convenience. We further define the

edges e1= {c, p1} /∈ EY and ˆe = {p−1, vc, ˆJ} ∈ EY, such that ˆJ = Jeˆ.

Theorem 6.3 (cf. Section SM5.2). Let c ∈ V and Sc be as in Lemma 5.10,

and let N be a representation for which Y(J)_, _{J ∈ K}Sc_{, are orthonormal (as in}

Theorem 5.11). Further, let the operator N_6=c∗ ◦ W( ˆJ) _{◦ N}

6=c : Hmc → Hmc be

described by the matrixA( ˆJ)_{∈ H}

mc⊗ Hmc. Then (cf. Fig.3) A( ˆ_αJ)0_c,{βe0_}_e∈Ec_{; α}_c_,_{βe_} e∈Ec = δα0c,αc Y e∈Ec\{e1} δβe0,βeM( ˆJ) βe1 0,βe1, (6.6)

where each δγ0,γ∈ {0, 1} is a Kronecker delta, as well as

(6.7) M_β( ˆJ)e1 0,βe1 = X βe0_,βe_{: e} ∈Ep\{e1}, αv: v∈p Y e∈∂Ep\{e1,ˆe} δβe0_,βe Y v∈p (Nv)αv,{βe0}e∈Ev(Nv)αv,{βe}e∈Ev ! (H( ˆJ)+ γI)−1_βe 0ˆ_,βˆe,

and further (cf. Fig. 4) (6.8) H_β( ˆJ)ê0_,βê= X βe0_,βe_{: e} ∈E{c}∪p\{ê}, αv: v∈{c}∪p Y e∈EY\{ê} δβe0,βe Y v∈{c}∪p (Nv)αv,{βe0}e∈Ev(Nv)αv,{βe}e∈Ev ,

for each αv ∈ [nv], v ∈ V , and βe0, βe∈ [r(Je)], e ∈ E.

Proof. See Figs.3and4. For the rigorous, though exceedingly technical proof, see Section SM4. A more elegant version can be found in SectionSM5.2.

The formula for A( ˆJ) _{simplifies whenever ˆ}_{e ∈ E}

(18)

Nc Np1 Np−1 Y( ˆJ) Y(Je) Y(Je) Y(Je) Y(Je) βe1 β ê Nc Np1 Np−1 Y( ˆJ) Y(Je) Y(Je) Y(Je) Y(Je) βe1 0 β ê0 αJ αp1 (Z( ˆJ)_(Z( ˆJ)₎T_{+ γI)}−1 Y( ˆJ) β _Y( ˆJ) ˆ e0 _βê = Nc Np1 Np−1 Nc Np1 Np−1 δβe0,βe αp1 (Z(J)_(Z(J)₎T_{+ γI)}−1 _β ê βê0

Figure 3. Network diagram for A( ˆJ) representing N_6=c∗ ◦ W( ˆJ)◦ N6=c (cf.

(6.6) and (6.7)) for a particular case of a certain K, network N and a path p = (p1, p−1) of length |p| = 2. Contractions over labels αS,S ⊂ [d], are in

gray, whereas uncontracted modes are visualized via dashed lines. Here, it is c, p−1∈ [d], but p/ 1∈ [d]. We recommend to view the digital version for better

readability. [Lefthand] Emphasized are the segmentsG( ˆJ)_(N∗

6=c at south-west

and N_6=c south-east) and W_γ,N,c( ˆJ) (north) as in Theorem 6.3. The lighter shaded nodes are the partial contractionsY(Je) _for_{e ∈ E}

Y, the orthogonality

constraints of which are indicated with corresponding arrows. [Righthand] The contracted version in which only the nodes {Nv}v∈p and their copies (as

encircled) as well as the matrix (H( ˆJ)_+γI)−1_{= (Z}( ˆJ)_(Z( ˆJ)₎T_+γI)−1_remain

as well as some delta tensors.

Nc Np1 Np−1 Y(Je) Y(Je) Y(Je) Y(Je) βeˆ Nc Np1 Np−1 Y(Je) Y(Je) Y(Je) Y(Je) βê αJ αp1 = Nc Np1 Np−1 βê Nc Np1 Np−1 βê0 αp1

Figure 4. Network diagram for H( ˆJ)=Z( ˆJ)_(Z( ˆJ)₎T _{(cf. (}_6.8_{)) for the same}

particular case as in Fig. 3. [Lefthand] The lighter shaded nodes are the partial contractionsY(Je) _for_{e ∈ E}

Y, the orthogonality constraints of which

are indicated with corresponding arrows. [Righthand] The contracted version in which only the nodes {Nv}_v∈{c}∪pand their copies (as encircled) remain.

Corollary 6.4. In Theorem 6.3, if ˆe = e1= {c, v} for v ∈ neigh(c), then p = ∅.

Thus, we haveM( ˆJ)_{= (H}( ˆJ)_{+ γI)}−1 _and

H_β( ˆJ)ê0_,βeˆ= X βe: e∈Ec\{ê}, αc (Nc)αc,{βe}e∈Ec\{ê},βê0· (Nc)αc,{βe}e∈Ec, forαv∈ [nv], and βe0, βe∈ [r(Je)], e ∈ Ec.

6.6. Branch evaluations. As described in the following, each expression in the update formula of (Nc)N,c_γ,ω,W (cf. (6.3)) can be rewritten, such that reevaluations of

identical terms are avoided. This is particularly useful (Proposition6.5) for the first measurement related summand (L◦N6=c)∗◦L◦N6=c: Hmc → Hmcand the righthand

side (L ◦ N_6=c)∗ _{: R}` _{→ H}

mc since only few branch-wise evaluations change after

(19)

Proposition 6.5. Let c ∈ V and each Je ∈ KSc, e ∈ E. Further, let L ◦ N6=c :

Hmc→ R

` _{be represented by the tensor}_F

c∈ R[`]⊗ Hmc. Then (Fc)ζ,αc,{βe}e∈Ec = X εe_{: e}_∈E_c (Lc)ζ,αc,{εe}e∈Ec Y v∈neigh(c) S(J{c,v}) ζ,β{c,v},ε{c,v},

forζ ∈ [`] (not being contracted), as well as S(Jeˆ) ζ,βeˆ_,εeˆ= X εe_,βe_{: e} ∈Ev\{ˆe} αv (Lv)ζ,αv,{εe}e∈Ev(Nv)αv,{βe}e∈Ev Y b∈descc(v) S(J{v,b}) ζ,β{v,b},ε{v,b},

fore = {predˆ c(v), v}, v ∈ V \ {c} and ζ ∈ [`] (not being contracted).

Proof. See SectionSM5.3.

The paths appearing in the evaluation of A( ˆJ)_{∈ H}

mc⊗ Hmc representing N_6=c∗ ◦

W( ˆJ)_{◦ N}

6=c : Hmc → Hmc, J ∈ KS

c _{(cf. Theorem} _6.3_{) naturally overlap. In the}

evaluation of A :=P_J_∈KA( ˆJ) _{(cf. (}_6.3_{)), this can be utilized.}

Proposition 6.6. Letc ∈ V and each Je∈ KSc,e ∈ E. It is

X ˆ J∈K A( ˆ_αJ)0_c,{βe0_}_e∈Ec_{; α}_c_,_{βe_} e∈Ec = δα0c,αc X v∈neigh(c) Y e∈Ec\{{c,v}} δβe0,βeB (J_{c,v}) β{c,v}0,β{c,v} with B(Jeˆ)_{= (H}(Jˆe)_{+ γI)}−1₊P b∈descc(v)Be (J_{v,b})_,_{e = {pred}_ˆ c(v), v}, v ∈ V \ {c},

as well as, for b ∈ descc(v),

e B(J{v,b}) βe 0ˆ_,βê = X βe_{: e} ∈Ev\{ê}, β{v,b}0_{, α}_v (Nv)αv,βe0ˆ,{βe}e∈Ev \{{v,b},ê},β{v,b}0· B (J_{v,b}) β{v,b}0,β{v,b}· (Nv)αv,{βe}e∈Ev.

Due to the recursive structure in Proposition 6.6, the evaluation is to be pro-ceeded in order leaves to root. In turn, also the matrices H(Jˆe)_{, ˆ}_{e ∈ E, (cf. (}_6.8₎₎

can be simplified, but in the opposing root to leaves order. The starting points for this recursion are given by Corollary 6.4.

Proposition 6.7. Let c ∈ V and each Je ∈ KSc, e ∈ E. For ˆe = {predc(v), v},

v ∈ V \ {c}, and b ∈ descc(v), it is H(J{v,b}) β{v,b}0_,β{v,b}= X βe: e∈Ev\{{v,b}} βê 0_{, α} v (Nv)αv,β{v,b}0,{βe}e∈Ev \{{v,b},ê},βê 0· H (Jeˆ) βê 0_,βê· (Nv)αv,{βe}e∈Ev.

7. Numerical Experiments

The following Sections7.1to7.4specify terminology and configurations referred to in the subsequent experiments SM1and7.1to7.4in SectionsSM1,7.7and7.9. The presentation of results is further laid out in Section 7.5. For simplicity, the mode sizes {nµ}µ∈[d] are chosen uniformly as n ∈ N in all experiments. For the

(20)

7.1. Reference solutions, measurements vectors and family K. Each mea-surement vector is constructed via a (not necessarily sought for) reference solution with ranks r(rs)∈ NK, which in turn relies on a randomly generated representation,

y = L(X(rs)) ∈ R`, X(rs)= τr(rs)(N

(rs)_{) ∈ L}−1_{(y) ∩ V}K ≤r(rs).

All entries of the components {Nv(rs)}v∈V(rs) are assigned independent, normally

distributed entries. For simplicity7_{and to limit the amount of randomness, we also} choose the components {r_(rs)(J)}J∈K uniformly, as r(rs)∈ N. We distinguish between

four different types.

Tucker format. With K = KTucker = {{1}, . . . , {d}}, the components of the

repre-sentation {N(rs)_}

v∈V(rs) follow the scheme in Example5.5.

Balanced, binary hierarchical Tucker format (bbHT).A balanced, binary hierarchi-cal Tucker format can be defined by the property of K = KbbHTto be exhaustive

(cf. Section5.6) and to minimize the maximal distance of any two vertices v, w ∈ [d] within GK (that is, the depth of the rooted tree, cf. [15]).

Exponentially declining singular values. Firstly, a bbHT representation as defined above is generated. As second step, all singular values σ(J)_{, J ∈ K, are manipulated}

such they decline exponentially. In explicit, for a constant s(expfac) ∈ (0, 1), it is

σ_i(J)≈ max(σmin, sx(expfac)), i = 1, . . . , r(rs), J ∈ K, where each x is an independent

random, normally distributed value and σmin > ε > 0 (cf. Section7.4) is a lower

bound. We denote such reference solutions by the abbreviated exp.dec.bbHT. Canonical polyadic (CP) decomposition.For r(rs)∈ N, the reference solution does

here not rely on K, but is generated as sum of r(rs)elementary tensors, X (rs) α1,...,αd:= τr(rs)(φ (rs)_{) =} Pr(rs) γ=1(φ (rs) 1 )α1,γ. . . (φ (rs) d )αd,γ, where (φ (rs) µ ) ∈ R[nµ]×[r(rs)], for µ =

1, . . . , d. The corresponding graph is a hypertree, and the set image(τr(rs)) of at

most rank r(rs)tensors is a semi-algebraic subset of V_≤rKmax_(rs)(cf. (1.7)) for the in that

case defined, non-hierarchical family Kmax := {J ( [d] | J 6= ∅}, given r(J)_(rs)≡ r(rs),

J ∈ Kmax.

7.2. Operators. We consider three types of operators L, where in each case L(X) := L vec(X) is based on the tensor L ∈ R`×n1...nd_.

(Full) Gaussian operator.With a Gaussian operator, we refer to a randomly gen-erated tensor L ∈ R`×n1...nd _{with independent, normally distributed entries.}

Gaussian low rank operator.For (low) uniform ranks r(J)_L ≡ rL ∈ N, J ∈ K, the

operator is defined through the representation of L := ρrL({Lv}v∈V(rs)) (cf.

Sec-tion6.3). Each component therein are assigned independent, normally distributed entries.

Random sampling operator.As sampling operator, we denote L(X) := {Xpi}

` i=1,

for uniformly randomly drawn indices {p1, . . . , p`} ⊂

×

dµ=1[nµ]. Note that sampling

operators can trivially be decomposed, for r_L(J)= 1, J ∈ K.

7.3. Solution methods. Based on a sufficiently large starting value γ(0)_{> 0, we}

choose γ(i)_{= νγ}(i−1)_{, where ν < 1 remains constant throughout each single run of}

an algorithm. We consider the following types of optimization.

Full, image based (IRLS-0K). As in (1.4), the full tensor is optimized based on the (literally interpreted) image update formula (3.3) without further modification (Al-gorithm1 for Si≡ ∅, i ∈ N0). When instability threatens to occur, the equivalent

(21)

Full, relaxed.The relaxed constraints described in Section 4.1 are utilized, but without subspace restrictions or weight switching (Algorithm2 with Ti≡ L−1(y),

Si≡ ∅, i ∈ N0). In this case, the residual kL(X) − yk is expected to converge to 0

parallel to the decline of γ, but this is not guaranteed.

Alternating (AIRLS-0K). We apply the representation based, necessarily relaxed, alternating optimization (Algorithm 2for Ti= Tci(τr(N

(i)_{)) and S}

i= Sci, i ∈ N0)

further discussed in Section 6. The update formulas for the single components make use of the branch-wise evaluations as derived in Section 6.6. Whether the (maximal) ranks r ∈ NK _{of the iterate, that is, the sizes of {N}

v}v∈V, are fixed or

adapted, as well as the potential use of other heuristics laid out in SectionSM6, is specified in the respective experiments.

Neigh.The same algorithm as aboves AIRLS-0K is applied, but in each update of the node Nc, c ∈ V , only weights corresponding to Je∈ KSc, e ∈ Ec, are included

(cf. Corollary6.4) in order to reduce the computational complexity. This reduction of paths yields the variant closest to our priorly introduced algorithm SALSA, and in particular the minimal number of weights in each the update of Nc, c ∈ V , for

which the rank adaption stability property, as further introduced in [16], still holds true.

Plain ALS without reweighting. In one instance in Experiment7.3, we also compare to the plain alternating least squares (ALS) residual minimization ((6.2) for ω = γ = 0) for fixed ranks r = r(rs)∈ NK. This algorithm is thus granted additional, in

practice generally unavailable information and does not adapt ranks.

7.4. Experimental setup and evaluation. In order to evaluate each output X(alg)_{, we compare its non-neglectable singular values to those of X}(rs)_{. We define}

detKn,γ,ε(X) := Y J∈K detnJ,γ,ε(X [J]_), X ∈ Rn1×...×nd_,

where the matrix version is as in [26] given by

det2m,γ,ε(A) := γm−rankε(A)

rankε(A)

Y

i=1

(σi(A)2+ γ),

for rankε(A) := max{i ∈ [m] | σi(A) > · kAkF}. Therein, we choose :=

10−6_{. We firstly examine the residual norm kL(X}(rs)_{) − yk}

F, secondly compare the

approximate ranks, and lastly compare the products of singular values. The latter two aspects are reflected by the limit of the quotient

Qε(X(alg), X(rs)) := lim γ&0

detKn,γ,ε(X(alg))

detKn,γ,ε(X(rs))

∈ [0, 0.98] ∪ (0.98, 1.005) ∪ [1.005, ∞]. The three intervals are related to the categorization into improvements, successes or the two types of failures as outlined below, where the limits 0 or ∞ are reached if and only if P_J_∈Krankε((X(alg))[J]) andPJ∈Krankε((X(rs))[J]) differ.

Post iteration. In order to avoid misjudgment, in cases where the tensor X(alg)_may

be an improving solution (though that seldomly happens here), we apply a post iteration analogous to the one discussed in the matrix case [26] in order to allow the parameter ε to be reduced to machine precision.

Details of comparison.As in [26], if kL(X(alg)_{) − yk > 10}−6_{kyk or if for the}

quo-tient, it holds Qε(X(alg), X(rs)) = ∞, then the result is considered a strong

fail-ure. If kL(X(alg)_{) − yk ≤ 10}−6_{kyk, then on the one hand we refer to 1.005 ≤}

Qε(X(alg), X(rs)) < ∞ as weak failure. On the other, for 0.98 < Qε(X(alg), X(rs)) <

1.005, we consider the result successful, while for Qε(X(alg), X(rs)) ≤ 0.98, we say

(22)

Sensitivity analysis. With the exception of Experiment7.3, we lower the meta pa-rameter ν = νk =

√

νk−1 (cf. Section 7.3), starting with ν0 = 1.2, and rerun the

respective algorithm from the start until the result is not a failure. However, after too many reruns k > kmax, we give up and thus either achieve a weak or strong

failure depending on the result for k = kmax. All other meta parameters for each

algorithm are common to all respective experiments.

7.5. Presentation of results. Each but experiments7.1and7.3is reflected upon in three different ways as summarized in TableSM2.

ASRM/recovery tables.For each instance, we list the percentual numbers of ASRM improvements, successes or fails as defined in Section 7.4. Successes are further distinguished regarding recoveries, whether kX(alg)_{− X}(rs)_k

F ≤ 10−4kX(rs)kF. In

near all cases where this is fulfilled, the relative residual even falls below 10−6

(see Section SM2), in which case the algorithm stops automatically8_{. Note that} both improvements as well as fails with respect to ASRM naturally nearly exclude recoveries with accuracy 10−4, and always so for 10−6.

ASRM/recovery figures. More distinguished visualizations of the results underlying the above mentioned tables can be found in Section SM2as described therein. γ-decline sensitivity. A depiction of results regarding the sensitivity analysis out-lined in Section 7.4is covered in SectionSM2 as well.

7.6. Observing the theoretical phase transition for generic recoveries. Experiment 7.1. For d = 4, n = 5 and r(rs)= 3, we consider the ASRM-KbbHT

problem based on Gaussian measurements for reference solutions given via bbHT representations for ` ∈ {68, 69, 70}. The solution method in both cases utilizes full, image based updates (cf. Section 7.1). Each constellation is repeated 100 times, for a comparatively large value of kmax= 10. The results are covered in Table1.

The dimension of the given bbHT variety is dim(VKbbHT

≤r(rs) ) = 4nr(rs)+ 2r

3 (rs)−

5r2_(rs) = 69 (cf. Lemma 5.7). The value ` = dim(VKbbHT

≤r(rs) ) + 1 (which here is

` = 70) in turn provides the minimal sufficient number of generic9 _measurements (thus not including sampling) to provide L−1_(L(X(rs)_{)) ∩ V}KbbHT

≤r(rs) = {X

(rs)_{} for}

generic X(rs) _{∈ V}KbbHT

≤r(rs) , as more generally proven in [4]. We can indeed observe

(see Table1) that for ` = 69, multiple solutions are found as verified through the post iteration process up to machine accuracy. For the value ` = 70 in turn, no duplicate solutions seem to exist. The one improving solution as well as the two weak failures are not the reference solution, though in fact neither within VKbbHT

≤r

for r = r(rs) but r = ˜r, ˜r({1,2}) = 3, (˜r({1}), . . . , ˜r({4})) = (2, 2, 4, 4). From the

perspective of a dimension minimization (cf. (2.1)) in turn, not even the improving result would be preferable as dim(VKbbHT

≤ˆr ) = 71 (V≤ˆKrbbHT + V≤rKbbHT(rs) ).

7.7. Affine sum-of-ranks minimization.

Experiment 7.2. For d = 4, n = 5 and r(rs) = 3, we consider the

ASRM-K problem based on samples or Gaussian measurements and reference solutions given via bbHT representations for ` ∈ {83, 111, 138} and K = KbbHT or by CP

decompositions for ` ∈ {62, 82, 102} and K = Kmax. The solution method in both

cases utilizes full, image based updates based on the respective families K (cf.

8_{Needless to say, this is the only point at which the reference solution itself is used within} the algorithm, and only done in order to save a considerable amount of unnecessary computation time.

(23)

instance ASRM-_recovery:K: Qε∈ [0, 0.98]_no Qε_no∈ (0.98, 1.005)_yes Qε∈ [1.005, ∞)_no Qε=∞

∆` =−1 •HT, ` = 68 24.0 2.0 + 0.0 25.0 49.0

∆` = 0 •HT, ` = 69 9.0 1.0 + 2.0 7.0 81.0

∆` = 1 •HT, ` = 70 1.0 0.0 + 10.0 2.0 87.0

Table 1. tensor recovery (full, Gaussian, image method, d = 4, n = 5, rrs= 3) – table as specified in Section7.5for Experiment7.1(see Fig.SM2

for more details)

Section 7.1). Each constellation is repeated 100 times, for kmax = 8. The results

are covered in Table 2and Figs.SM3 andSM4.

instance ASRM-KbbHT/ max: Qε∈ [0, 0.98] Qε∈ (0.98, 1.005) Qε∈ [1.005, ∞) Qε=∞

recovery: no no yes no cmf= 1.2 •gaussian, HT, ` = 83 0.0 0.0 + 96.0 0.0 4.0 •gaussian, CP, ` = 62 0.0 0.0 + 18.0 0.0 82.0 •sampling, HT, ` = 83 0.0 0.0 + 33.0 0.0 67.0 •sampling, CP, ` = 62 0.0 0.0 + 1.0 0.0 99.0 cmf= 1.6 •gaussian, HT, ` = 111 0.0 0.0 + 100.0 0.0 0.0 •gaussian, CP, ` = 82 0.0 0.0 + 86.0 0.0 14.0 •sampling, HT, ` = 111 0.0 0.0 + 94.0 0.0 6.0 •sampling, CP, ` = 82 0.0 0.0 + 44.0 0.0 56.0 cmf= 2.0 •gaussian, HT, ` = 138 0.0 0.0 + 100.0 0.0 0.0 •gaussian, CP, ` = 102 0.0 0.0 + 100.0 0.0 0.0 •sampling, HT, ` = 138 0.0 0.0 + 100.0 0.0 0.0 •sampling, CP, ` = 102 0.0 0.0 + 84.0 0.0 16.0

Table 2. IRLS-0K (full, image method, d = 4, n = 5, rrs= 3) – table as

specified in Section7.5for Experiment7.2(see Fig.SM4for more details)

The dimension or even the more particular structure of the variety VKmax

≤r(rs), for

Kmax = {J ( [d] | J 6= ∅}, d ≥ 4, as applied in the CP case (cf. Section 7.1), is

unknown to the best of our knowledge. While real tensors of at most rank r(rs)

do not form varieties, complex ones with at most this border rank do, here with a dimension of dim(V≤r(rs),C) = r(rs)(d(n − 1) + 1) = 51 (cf. [3,6,33]). Though we

assume this dimension to be lower than the one for Kmax, we take this smaller value

as reference. In that sense, the considered values ` are each (rounded) multiples cmf∈ {1.2, 1.6, 2}. To our surprise, if successeful, the CP reference solution is (near

perfectly) recovered even for ` = 62, considering that this value is smaller than 69 ≡ dim(VKbbHT

≤r(rs) ) for every exhaustive hierarchical family KbbHT. One possible

explanation would be that dim(VKmax

≤r(rs)) is lower or equal to 61, but further

inves-tigation remains subject to future work. While we can not, as theory provides, expect generic completions in case of sampling problems, the failures with respect to ASRM are subject of IRLS-0K itself. In particular, slower rates of decline ν (cf. Fig.SM3) may be required, and allow for better results for both sampling and Gaussian measurements as suggested by Fig. SM1. Though already for cmf = 1.2,

the rate of decline seems to suffice.

7.8. Alternating, affine sum-of-ranks minimization.

Experiment 7.3. For d = 4, n = 5, r(rs)= 3 and ` ∈ {126, 168, 210} we consider

the ASRM-KTucker problem based on samples or rank rL = 1 Gaussian

measure-ments for reference solutions given through Tucker representations. We compare the following four solution methods:

(a) full, image based

(c) alternating, based on fixed, maximally feasible ranks r(J)_{= 5, J ∈ K} Tucker

(d) alternating, with adaptive ranks r(J)_{∈ [5], J ∈ K}

Tucker(cf. SectionSM6.3)

(e) plain, ALS without reweighting, based on the, a-priorly provided, fixed ranks r(J)_{= r}

(24)

A fixed rate ν = 1.002−1 _{of decline is used (applicable to the first three methods).}

Each constellation is repeated 100 times, for which the results are covered in Table3

and Fig.SM7.

instance ASRM-KTucker: Qε∈ [0, 0.98] Qε∈ (0.98, 1.005) Qε∈ [1.005, ∞) Qε=∞

recovery: no no yes no

` = 126

•gaussian (rL= 1): (a) full 0.0 0.0 + 99.0 0.0 1.0

•(c) alt 0.0 0.0 + 79.0 0.0 21.0

•(d) alt adapt 0.0 0.0 + 78.0 0.0 22.0

•(e) plain ALS 0.0 0.0 + 0.0 0.0 100.0

•samp: (a) full 0.0 0.0 + 89.0 0.0 11.0

•(c) alt 0.0 0.0 + 16.0 0.0 84.0

•(d) alt adapt 0.0 0.0 + 19.0 0.0 81.0

•(e) plain ALS 0.0 0.0 + 0.0 0.0 100.0

` = 168

•(c) alt 0.0 0.0 + 100.0 0.0 0.0

•(d) alt adapt 0.0 0.0 + 100.0 0.0 0.0

•(e) plain ALS 0.0 0.0 + 28.0 0.0 72.0

•samp: (a) full 0.0 0.0 + 100.0 0.0 0.0

•(c) alt 0.0 0.0 + 77.0 0.0 23.0

•(d) alt adapt 0.0 0.0 + 80.0 0.0 20.0

•(e) plain ALS 0.0 0.0 + 4.0 0.0 96.0

` = 210

•(c) alt 0.0 0.0 + 100.0 0.0 0.0

•(d) alt adapt 0.0 0.0 + 100.0 0.0 0.0

•(e) plain ALS 0.0 0.0 + 74.0 0.0 26.0

•samp: (a) full 0.0 0.0 + 100.0 0.0 0.0

•(c) alt 0.0 0.0 + 100.0 0.0 0.0

•(d) alt adapt 0.0 0.0 + 98.0 0.0 2.0

•(e) plain ALS 0.0 0.0 + 24.0 0.0 76.0

Table 3. (A)IRLS-0KTucker (Tucker, d = 4,n = 5, rrs= 3) – table as

spec-ified in Section7.5for Experiment7.3(see Fig.SM7for more details)

The degrees of freedom within a Tucker decompositions for d = 4 in this setting is dim(VKTucker

≤r(rs) ) = r

4

(rs)+ 4nr(rs)− 4r2(rs)= 105 (cf. Lemma 5.7). The number of

measurements ` are (rounded) multiples cmf ∈ {1.2, 1.6, 2} of such. As in

Exper-iment SM1, there is nearly no difference between the version using fixed ranks or adaptive ranks, but both instances are slightly worse than the full version using unrelaxed constraints (note that here, these methods use the same, fixed rate of decay ν = 1.002−1). Plain alternating least squares on the other hand (even though only in that case, the ranks of the reference solution are provided) is significantly worse than the other methods, also for larger numbers of measurements.

7.9. Large scale, alternating ASRM.

Experiment 7.4. For d = 8, n = 20, r(rs)= 5 and ` ∈ {6500, 13000, 19500, 26000},

we consider the ASRM-KbbHT problem based on samples, rank rL = 1 or rank

rL = 2 Gaussian measurements for reference solutions given via bbHT

represen-tations with exponentially declining singular values, s(expfac) = 13. For Gaussian

measurements, we also consider unmodified singular values. As solution method, we apply alternating optimization with explicit rank adaption (limited only by r(J)_{≤ 8, J ∈ K}

bbHT) as well as the applicable heuristics laid out in Section SM6.

The maximal length of paths is either unrestricted, or limited to neighbors. Each constellation is repeated 100 times, for kmax= 5. The results are covered in Tables4

and 5and Figs.SM8 toSM11.

The degrees of freedom within 8-dimensional bbHT decompositions in this setting is dim(VKbbHT

≤r(rs) ) = 8nr(rs)+ 6r

3

(rs)− 13r2(rs)= 1225 (cf. Lemma5.7), while ` = 6500

constitutes a fraction of about 2.5 · 10−7 _{of the total size n}d _{= 2.56 · 10}10 _{of the}

tensor. Due to the long runtime for values k > 5, it yet remains speculation whether the restriction of paths to neighboring nodes does result in a loss of approximation quality or, as in other cases, rather a need for a lower parameter ν (cf. Figs. SM8