Factoring Big Tensors in the Cloud. A Tutorial on Big Tensor Data Analytics

(1)

Factoring Tensors in the Cloud:

A Tutorial on Big Tensor Data Analytics

Nikos Sidiropoulos, UMN, and Evangelos Papalexakis, CMU

(2)

Web, papers, software, credits

Nikos Sidiropoulos, UMN

Signal and Tensor Analytics Research (STAR) group

https://sites.google.com/a/umn.edu/nikosgroup/home Signal processing, communications, optimization

Tensor decomposition, factor analysis Big data

Spectrum sensing, cognitive radio

Sidekicks: preference measurement, conference review & session assignment

Vangelis Papalexakis, Christos Faloutsos, CMU

http://www.cs.cmu.edu/˜epapalex/and

http://www.cs.cmu.edu/˜christos/ Data mining

Graph mining

Tensor analysis, coupled matrix-tensor analysis Big data, Hadoop

Christos Faloutsos, Tom Mitchell (CMU), Nikos Sidiropoulos, George Karypis (UMN), NSF-NIH/BIGDATA: Big Tensor Mining: Theory, Scalable Algorithms and Applications, NSF IIS-1247489/1247632.

(3)

Matrices, rank decomposition

Amatrix(ortwo-wayarray) is a datasetXindexed bytwo indices,(i,j)-th entryX(i,j).

Simple matrixS(i,j) =a(i)b(j),∀i,j; separable, every row (column)

proportional to every other row (column). Can write asS=abT_.

rank(X):= smallest number of ‘simple’ (separable, rank-one) matrices

needed to generateX- a measure of complexity.

X(i,j) = F X f=1 af(i)bf(j); orX= F X f=1 afbTf =ABT.

Turns out rank(X)= maximum number of linearly independent rows (or,

columns) inX.

Rank decomposition for matrices is not unique (except for matrices of

rank = 1), as∀invertibleM:

X=ABT = (AM)M−TBT= (AM) BM−1T

(4)

Tensor? What is this?

CS ‘slang’ forthree-wayarray: datasetXindexed bythree indices,

(i,j,k)-th entryX(i,j,k).

In plain words: a‘shoebox’!

Warning:different meaning in Physics!

For two vectorsa(I×1) andb(J×1),a◦bis anI×J rank-one matrix

with (i,j)-th elementa(i)b(j); i.e.,a◦b=abT_.

For three vectors,a(I×1),b(J×1),c(K×1),a◦b◦cis anI×J×K

rank-one three-way array with(i,j,k)-th elementa(i)b(j)c(k).

Therank of a three-way arrayXis the smallest number of outer products

(5)

Motivating example: NELL @ CMU / Tom Mitchell

Crawl web, learn language ‘like children do’: encounter new concepts, learn from context

NELL triplets of “subject-verb-object” naturally lead to a 3-mode tensor

X ≈ object subject verb ~ a1 ~ b1 ~ c1 + ~ a2 ~ b2 ~ c2 + ~ aF ~ bF ~ cF . . . +

Each rank-one factor corresponds to aconcept, e.g., ‘leaders’ or ‘tools’

E.g., saya1,b1,c1corresponds to ‘leaders’: subjects/rows with high

score ona1will be “Obama”, “Merkel”, “Steve Jobs”, objects/columns with

high score onb1will be “USA”, “Germany”, “Apple Inc.”, and verbs/fibers

with high score onc1will be ‘verbs’, like “lead”, “is-president-of”, and

(6)

Kronecker and Khatri-Rao products

⊗stands for the Kronecker product:

A⊗B=    BA(1,1),BA(1,2),· · · BA(2,1),BA(2,2),· · · .. .   

stands for the Khatri-Rao (column-wise Kronecker) product: givenA

(I×F) andB(J×F),ABis theJI×F matrix

AB=

A(:,1)⊗B(:,1)· · ·A(:,F)⊗B(:,F)

vec(ABC) = (CT ⊗A)vec(B)

(7)

Rank decomposition for tensors

•Tensor: X= F X f=1 af◦bf ◦cf •Scalar: X(i,j,k) = F X f=1 ai,fbj,fck,f, ∀i∈ {1,· · ·,I} ∀j∈ {1,· · ·,J} ∀k ∈ {1,· · · ,K} •Slabs: Xk =ADk(C)BT, k =1,· · ·,K •Matrix: X(KJ×I) _{= (}_B_C₎_AT •Tall vector: x(KJI) _:=_vec_X(KJ×I)_{= (}_A₍_B_C₎₎₁ F×1= (ABC)1F×1

(8)

Do I need to worry about higher-way (or,

higher-order

)

tensors?

Semantic analysis of Brain fMRI data

fMRI→semantic category scores

fMRI mode is vectorized (O(105₋₁₀6₎₎

Spatial coherence - better to treat as three separate spatial modes→

5-way array

(9)

Rank decomposition for higher-order tensors

•Tensor: X= F X f=1 a(_f1)◦ · · · ◦a(_fN) •Scalar: X(i1,· · ·,iN) = F X f=1 N Y n=1 a(_in) n,f •Matrix: X(I1I2···IN−1×IN)₌ A(N−1)A(N−2) · · · A(1) A(N)T •Tall vector: x(I1···IN)_:=_vec X(I1I2···IN−1×IN) = A(N)A(N−1)A(N−2) · · · A(1)1F×1

(10)

Tensor Singular Value Decomposition?

For matrices, SVD is instrumental: rank-revealing, Eckart-Young So is there a tensor equivalent to the matrix SVD?

Yes, ... and no! In fact there is no single tensor SVD. Two basic decompositions:

CANonical DECOMPosition (CANDECOMP), also known as PARAllel FACtor (PARAFAC) analysis, or CANDECOMP-PARAFAC (CP) for short: non-orthogonal, unique under certain conditions.

Tucker3, orthogonal without loss of generality, non-unique except for very special cases.

Both are outer product decompositions, but with very different structural properties.

Rule of thumb: use Tucker3 for subspace estimation and tensor approximation, e.g., compression applications; use PARAFAC for latent parameter estimation - recovering the ‘hidden’ rank-one factors.

(11)

Tucker3

I×J×K three-way arrayX

A:I×L,B:J×M,C:K ×N mode loading matrices

(12)

Tucker3, continued

Consider anI×J×K three-way arrayXcomprisingK matrix slabs

{Xk}Kk=1, arranged into matrixX:= [vec(X1),· · ·,vec(XK)].

The Tucker3 model can be written as

X≈(B⊗A)GCT,

whereGis the Tucker3 core tensorGrecast in matrix form. The non-zero

elements of the core tensor determine the interactions between columns ofA,B,C.

The associated model-fitting problem is min

A,B,C,G||X−(B⊗A)GC

T_||2

F,

which is usually solved using an alternating least squares procedure.

vec(X)≈(C⊗B⊗A)vec(G).

Highly non-unique- e.g., rotateC, counter-rotateGusing unitary matrix.

Subspaces can be recovered; Tucker3 is good fortensor approximation,

(13)

PARAFAC

Figure :From Matlab Central / N-way toolbox (Rasmus Bro)http://www.

mathworks.com/matlabcentral/fileexchange/1088-the-n-way-toolbox

Low-rank tensor decomposition / approximation

X≈

F X

f=1

af◦bf◦cf,

PARAFAC [Harshman ’70-’72], CANDECOMP [Carroll & Chang, ’70], now CP; also cf. [Hitchcock, ’27]

Xk ≈ADk(C)BT, whereDk(C)is a diagonal matrix holding thek-th row of

Cin its diagonal.

Combining slabs and using Khatri-Rao product,

(14)

Uniqueness

Under certain conditions, PARAFAC isessentially unique, i.e.,(A,B,C)

can be identified fromXup to permutation and scaling of columns

-there’s no rotational freedom; cf. [Kruskal ’77, Sidiropouloset al’00 - ’07,

de Lathauwer ’04-, Stegeman ’06-, Chiantini, Ottaviani ’11-, ...]

I×J×K tensorXof rankF, vectorized asIJK×1 vector

x= (ABC)1, for someA(I×F),B(J×F), andC(K ×F) - a

PARAFAC model of sizeI×J×K and orderF parameterized by

(A,B,C).

TheKruskal-rankofA, denotedkA, is the maximumk such thatany k

columns ofAare linearly independent (kA≤rA:=rank(A)).

GivenX(⇔x), ifkA+kB+kC≥2F+2, then(A,B,C)are unique up to a

common column permutation and scaling/counter-scaling (e.g., multiply

first column ofAby 5, divide first column ofBby 5, outer product stays

the same) - cf. [Kruskal, 1977]

N-way case:PN

(15)

Special case: Two slabs

Assumesquare/tall, full column rankA(I×F)andB(J×F); and distinct

nonzero elements along the diagonal ofD:

X1 = ABT,

X2 = ADBT

Introduce the compactF-component SVD of

X1 X2 = A AD BT ₌_U_Σ_VH

rank(B) =F implies that span(U) =span

A AD

; hence there exists

a nonsingular matrixPsuch that

U= U1 U2 = A AD P

(16)

Special case: Two slabs, continued

Next, construct the auto- and cross-product matrices

R1 = UH₁U1 = PHAHAP =: QP,

R2 = UH1U2 = PHAHADP = QDP.

All matrices on the right hand side are square and full-rank. Solving the first equation and substituting to the second, yields the eigenvalue problem

R−1₁ R2

P−1=P−1D

P−1(andP) recovered up to permutation and scaling

From bottom of previous page,Alikewise recovered, thenB

(17)

Alternating Least Squares (ALS)

Based on matrix view:

X(KJ×I)= (BC)AT Multilinear LS problem: min A,B,C||X (KJ×I)₋₍_B_C₎_AT_||2 F

NP-hard- even for a single component, i.e., vectora,b,c. See [Hillar and Lim, “Most tensor problems are NP-hard,” 2013]

But ... given interim estimates ofB,C, can easily solve for conditional LS

update ofA:

ACLS =

(BC)†X(KJ×I)

T

Similarly for the CLS updates ofB,C(symmetry); alternate until cost

(18)

Other algorithms?

Many! - first-order (gradient-based), second-order (Hessian-based) Gauss-Newton, line search, Levenberg-Marquardt, weighted least squares, majorization

Algebraic initialization (matters)

See Tomasi and Bro, 2006, for a good overview

Second-order advantage when close to optimum, but can (and do) diverge

First-order often prone to local minima, slow to converge

Stochastic gradient descent (CS community) - simple, parallel, but very slow

Difficult to incorporate additional constraints like sparsity, non-negativity, unimodality, etc.

(19)

ALS

No parameters to tune!

Easy to program, uses standard linear LS Monotone convergence of cost function

Does not require any conditions beyond model identifiability

Easy to incorporate additional constraints, due to multilinearity, e.g., replace linear LS with linear NNLS for NN

Even non-convex (e.g., FA) constraints can be handled with column-wise updates (optimal scaling lemma)

Cons: sequential algorithm, convergence can be slow Still workhorse after all these years

(20)

Outliers, sparse residuals

Instead of LS,

min||X(KJ×I)₋₍_B_C₎_AT_||

1

Conditional update: LP

Almost as good: coordinate-wise, usingweighted median filtering(very

cheap!) [Vorobyov, Rong, Sidiropoulos, Gershman, 2005]

PARAFAC CRLB: [Liu & Sidiropoulos, 2001] (Gaussian); [Vorobyov, Rong, Sidiropoulos, Gershman, 2005] (Laplacian, etc differ only in pdf-dependent scale factor).

Alternating optimization algorithms approach the CRLB when the problem is well-determined (meaning: not barely identifiable).

(21)

Big (tensor) data

Tensors can easily become really big! - size exponential in the number of dimensions (‘ways’, or ‘modes’).

Datasets with millions of items per mode - examples in second part of this tutorial.

Cannot load in main memory; may reside in cloud storage.

Sometimesverysparse - can store and process as (i,j,k,value) list,

nonzero column indices for each row, runlength coding, etc.

(Sparse) Tensor Toolbox for Matlab [Koldaet al].

(22)

Tensor partitioning?

Parallel algorithms for matrix algebra use data partitioning Can we reuse some of these ideas?

Low-hanging fruit?

First considered in [Phan, Cichocki, Neurocomputing, 2011]

(23)

Tensor partitioning

In [Phan & Cichocki, Neurocomputing, 2011] each sub-block is

independently decomposed, then low-rank factors of the big tensor are ‘stitched’ together.

The decomposition of each sub-block must be identifiable→far more stringent ID conditions;

Each sub-block has different permutation and scaling of the factors - all these must be reconciled before stitching (not discussed in paper);

Noise affects sub-block decomposition more than full tensor decomposition.

More recently, [Almeida & Kibangou, CAMSAP 2013, ICASSP 2014]:

Also rely on partitioning, so sub-block identification conditions are required; But more robust, based on collaborative consensus-averaging, and the matching of different permutations and scales is explicitly taken into account; Requires considerable communication overhead across clusters and nodes; i.e., the parallel threads are not independent.

(24)

Tensor compression

Commonly used compression method for ‘moderate’-size tensors: fit orthogonal Tucker3 model, regress data onto fitted mode-bases.

Implemented in COMFAC

http://www.ece.umn.edu/˜nikos/comfac.m

Lossless if exact mode bases used [CANDELINC]; but Tucker3 fitting is itself cumbersome for big tensors (big matrix SVDs), cannot compress below mode ranks without introducing errors

(25)

Tensor compression

Consider compressingx=vec(X)intoy=Sx, whereSisd ×IJK,

d IJK.

In particular, consider a specially structured compression matrix S=UT_⊗_VT _⊗_WT

Corresponds to multiplying (every slab of)Xfrom theI-mode withUT_,

from theJ-mode withVT_{, and from the}_K_{-mode with}_WT_{, where}_U_is_I_×_L_,

VisJ×M, andWisK×N, withL≤I,M ≤J,N≤K andLMNIJK

I L M I J J K N L K M X N Y _ _

(26)

Key

Due to a property of the Kronecker product

UT _⊗_VT _⊗_WT₍_A_B_C_{) =}

(UTA)(VTB)(WTC), from which it follows that

y=(UTA)(VTB)(WTC)1=A˜ B˜ C˜1.

i.e., the compressed data follow a PARAFAC model of sizeL×M×N

and orderF parameterized by( ˜A,B˜,C˜), withA˜ :=UT_A_,_B˜ _:=_VT_B_, ˜

(27)

Random

multi-way compression can be better!

Sidiropoulos & Kyrillidis, IEEE SPL Oct. 2012

Assume that the columns ofA,B,Care sparse, and letna(nb,nc) be an

upper bound on the number of nonzero elements per column ofA

(respectivelyB,C).

Let the mode-compression matricesU(I×L,L≤I),V(J×M,M ≤J),

andW(K ×N,N≤K) be randomly drawn from an absolutely continuous

distribution with respect to the Lebesgue measure inRIL,RJM, andRKN,

respectively. If

min(L,kA) +min(M,kB) +min(N,kC)≥2F+2, and L≥2na, M ≥2nb, N≥2nc,

then the original factor loadingsA,B,Care almost surely identifiable from

(28)

Proof rests on two lemmas + Kruskal

Lemma 1: ConsiderA˜ :=UT_A_{, where}_A_is_I_×_F_{, and let the}_I_×_L_matrix

Ube randomly drawn from an absolutely continuous distribution with

respect to the Lebesgue measure inRIL(e.g., multivariate Gaussian with

a non-singular covariance matrix). Thenk_A˜=min(L,kA)almost surely

(with probability 1) [Sidiropoulos & Kyrillidis, 2012]

Lemma 2: ConsiderA˜ :=UTA, whereA˜ andUare given andAis sought.

Suppose that every column ofAhas at mostnanonzero elements, and

thatkUT ≥2n_a. (The latter holds with probability 1 if theI×LmatrixUis

randomly drawn from an absolutely continuous distribution with respect to

the Lebesgue measure inRIL, and min(I,L)≥2na.) ThenAis the unique

solution with at mostnanonzero elements per column [Donoho & Elad,

(29)

Complexity

First fitting PARAFAC in compressed space and then recovering the

sparseA,B,Cfrom the fitted compressed factors entails complexity

O(LMNF+ (I3.5₊_J3.5₊_K3.5₎_F₎_{in practice.}

Using sparsity first and then fitting PARAFAC in raw space entails

complexityO(IJKF+ (IJK)3.5)- the difference is huge.

Also note that the proposed approach does not require computations in the uncompressed data domain, which is important for big data that do not fit in memory for processing.

(30)

Further compression - down to

O

(

√

F

)

in 2/3 modes

Sidiropoulos & Kyrillidis, IEEE SPL Oct. 2012

(respectivelyB,C).

respectively. If

rA=rB=rC=F

L(L−1)M(M−1)≥2F(F −1), N≥F, and

L≥2na, M ≥2nb, N≥2nc,

(31)

Proof: Lemma 3 + results on a.s. ID of PARAFAC

Lemma 3: ConsiderA˜ =UT_A_{, where}_A₍_I_×_F₎_{is deterministic,}

tall/square (I≥F) and full column rankrA=F, and the elements ofU

(I×L) are i.i.d. Gaussian zero mean, unit variance random variables.

Then the distribution ofA˜ is nonsingular multivariate Gaussian.

From [Stegeman, ten Berge, de Lathauwer 2006] (see also [Jiang, Sidiropoulos 2004], we know that PARAFAC is almost surely identifiable if

the loading matricesA˜,B˜ are randomly drawn from an absolutely

continuous distribution with respect to the Lebesgue measure inR(L+M)F,

˜

(32)

Generalization to higher-way arrays

Theorem 3: Letx= (A1 · · · Aδ)1∈R Qδ d=1Id_{, where}_A d isId×F, and consider compressing it toy= UT 1 ⊗ · · · ⊗UTδ x= (UT 1A1) · · · (UTδAδ) 1=A˜1 · · · A˜δ 1∈_RQδd=1Ld_{, where the}

mode-compression matricesUd(Id×Ld,Ld ≤Id) are randomly drawn

from an absolutely continuous distribution with respect to the Lebesgue

measure inRIdLd_{. Assume that the columns of}_A

d are sparse, and letnd

be an upper bound on the number of nonzero elements per column ofAd,

for eachd∈ {1,· · ·, δ}. If

δ

X

d=1

min(Ld,kAd)≥2F+δ−1, and Ld ≥2nd, ∀d ∈ {1,· · ·, δ},

then the original factor loadings{Ad}

δ

d=1are almost surely identifiable

from the compressed datayup to a common column permutation and

scaling.

(33)

Latest on PARAFAC uniqueness

Luca Chiantini and Giorgio Ottaviani, On Generic Identifiability of 3-Tensors of Small Rank, SIAM. J. Matrix Anal. & Appl., 33(3), 1018–1037:

Consider anI×J×K tensorXof rankF, and order the dimensions so

thatI≤J ≤K

Leti be maximal such that 2i _≤_I_{, and likewise}_j _{maximal such that 2}i _≤_J

IfF ≤2i+j−2_{, then}_X_{has a unique decomposition almost surely}

ForI,J powers of 2, the condition simplifies toF ≤ IJ

4

More generally, condition implies:

ifF ≤(I+1)(J+1)

(34)

Even further compression

(respectivelyB,C).

respectively.

AssumeL≤M ≤N, andL,M are powers of 2, for simplicity

If

rA=rB=rC=F LM≥4F, N≥M ≥L, and

L≥2na, M ≥2nb, N≥2nc,

the compressed data up to a common column permutation and scaling.

(35)

What if

A

,

B

,

C

are not sparse?

IfA,B,Care sparse with respect toknownbases, i.e.,A=RAˇ,B=SBˇ,

andC=TCˇ, withR,S,Tthe respective sparsifying bases, andAˇ,Bˇ,Cˇ

sparse

Then the previous results carry over under appropriate conditions, e.g.,

whenR,S,Tare non-singular.

(36)

PARACOMP: PArallel RAndomly COMPressed Cubes

I J K X _ (U2,V2,W2) … M1 L1 Y_1 N1 M2 L2 N2 Y _2 MP LP Y_P NP … (A2,B2,C2)~ ~ ~ … (A,B,C) join (LS) fork I Lp Mp I J J K Np Lp K Mp X Np Y _ _ UpT V p _T WpT p

(37)

PARACOMP

AssumeA˜p,B˜p,C˜pidentifiable fromYp(up to perm & scaling of cols)

Upon factoringY_pintoF rank-one components, we obtain

˜

Ap=UTpAΠpΛp. (1)

Assume first 2 columns of eachUpare common, letU¯ denote this

common part, andA¯p:=first two rows ofA˜p. Then

¯

Ap= ¯UTAΠpΛp.

Dividing each column ofA¯pby the element of maximum modulus in that

column, denoting the resulting 2×F matrixAˆp,

ˆ

Ap= ¯UTAΛΠp.

Λdoes not affect the ratio of elements in each 2×1 column. If ratios are

distinct, then permutations can be matched by sorting the ratios of the

(38)

PARACOMP

In practice using a few more ‘anchor’ rows will improve perm-matching.

WhenSanchor rows are used, the opt permutation matching cast as

min

Π ||

ˆ

A1−AˆpΠ||2F,

Optimization over set of permutation matrices - hard?

||Aˆ1−AˆpΠ||2F =Tr ( Â1−AˆpΠ)T( Â1−AˆpΠ) = ||Aˆ1||2F +||AˆpΠ||2F−2Tr( ÂT1AˆpΠ) = ||Aˆ1||2F+||Aˆp||2F−2Tr( ÂT1AˆpΠ). ⇐⇒max Π Tr( Â T 1AˆpΠ),

(39)

PARACOMP

After perm-matching, back to (1) and permute columns→A˘psatisfying

˘

Ap=UTpAΠΛp.

Remains to get rid ofΛp. For this, we can again resort to the first two

common rows, and divide each column ofA˘pwith its top element→

ˇ

Ap=UTpAΠΛ.

For recovery ofAup to perm-scaling of cols, we then require that

   ˇ A1 .. . ˇ AP   =    UT 1 .. . UT P   AΠΛ (2)

(40)

PARACOMP

This implies that

2+

P X

p=1

(Lp−2)≥I.

Every sub-matrix contains the two common anchor rows, duplicate rows do not increase rank.

Once dim requirement met, full rank w.p. 1, non-redundant entries drawn from jointly cont. distribution (by design).

AssumingLp=L,∀p∈ {1,· · ·,P}

P≥ I−2 L−2.

Common column perm ofA˜p,B˜p,C˜piscommon→

P≥max I−2 L−2, J M, K N

(41)

PARACOMP

If compression ratios in different modes are similar, makes sense to use longest mode for anchoring; if this is the last mode, then

P≥max I L, J M, K−2 N−2

Theorem:Assume thatF ≤I≤J ≤K, andA,B,Care full column rank

(F). Further assume thatLp=L,Mp=M,Np=N,∀p∈ {1,· · ·,P},

L≤M ≤N,(L+1)(M+1)≥16F, random{Up}P_p=1,{Vp}

P

p=1, eachWp

contains two common anchor columns, otherwise random{Wp}P_p=1.

ThenA˜p,B˜p,C˜p

unique up to column permutation and scaling. If, in addition,P ≥max_LI,_MJ,_NK−2₋₂, then(A,B,C)are almost surely identifiable fromnA˜p,B˜p,C˜p

oP

p=1up to a common column permutation

(42)

PARACOMP - Significance

Indicative of a family of results that can be derived.

Theorem shows that fully parallel computation of the big tensor

decomposition is possible – first result thatguaranteesID of the big

tensor decomposition from the small tensor decompositions, without stringent additional constraints.

Corollary:If K−2 N−2 =max I L, J M, K−2 N−2

, then the memory / storage and computational complexity savings afforded by PARACOMP relative to brute-force computation are of order IJ_F.

Note on complexity of solving master join equation: after removing redundant rows, system matrix in (2) will have approximately orthogonal

(43)

PARACOMP

Cases whereF >min(I,J,K)can be treated using Kruskal’s condition;

total storage and complexity gains still of order IJ

F2.

Latent sparsity: Assume that every column ofA(B,C) has at mostna

(resp.nb,nc) nonzero elements. A column ofAcan be uniquely

recovered from only 2naincoherent linear equations.

Therefore, we may replace the condition

P≥max I L, J M, K −2 N−2 , with P≥max 2na L , 2nb M , 2nc−2 N−2 .

(44)

Color of compressed noise

Y=X+Z, whereZ: zero-mean additive white noise.

y=x+z, withy:=vec(Y),x:=vec(X),z:=vec(Z).

Multi-way compression→Y_c yc :=vec(Yc) = UT _⊗_VT _⊗_WT_y₌ UT ⊗VT ⊗WTx+UT ⊗VT ⊗WTz. Letzc := UT ⊗VT ⊗WT

z. Clearly,E[zc] =0; it can be shown that

EhzczTc i

=σ2UTU⊗VTV⊗WTW.

⇒IfU,V,Ware orthonormal, then noise in the compressed domain is white.

(45)

‘Universal’ LS

E[zc] =0, and

EhzczTc i

=σ2UTU⊗VTV⊗WTW.

For largeI andUdrawn from a zero-mean unit-variance uncorrelated distribution,UT_U_≈_I_{by the law of large numbers.}

Furthermore, even ifzis not Gaussian,zc will be approximately Gaussian

for largeIJK, by the Central Limit Theorem.

Follows that least-squares fitting is approximately optimal in the compressed domain, even if it is not so in the uncompressed domain.

(46)

Component energy

≈

preserved after compression

Consider randomly compressing a rank-one tensorX=a◦b◦c, written

in vectorized form asx=a⊗b⊗c.

The compressed tensor isX˜, in vectorized form

˜

x=UT ⊗VT ⊗WT(a⊗b⊗c) = (UTa)⊗(VTb)⊗(WTc).

Can be shown that, for moderateL,M,Nand beyond, Frobenious norm

of compressed rank-one tensor approximately proportional to Frobenious norm of the uncompressed rank-one tensor component of original tensor. In other words: compression approximately preserves component energy

⇒order.

⇒Low-rank least-squares approximation of the compressed tensor low-rank least-squares approximation of the big tensor, approximately.

Match component permutations across replicas by sorting component energies.

(47)

PARACOMP: Numerical results

Nominal setup:

I=J=K =500;F =5;A,B,C∼randn(500,5);

L=M=N=50 (each replica = 0.1% of big tensor);

P=12 replicas (overall cloud storage = 1.2% of big tensor).

S=3 (vs.Smin=2) anchor rows.

↑Satisfy identifiability without much ‘slack’. + WGN stdσ=0.01.

COMFACwww.ece.umn.edu/˜nikosused for all factorizations, big and small

(48)

PARACOMP: MSE as a function of

L

=

M

=

N

FixP=12, varyL=M =N. 50 100 150 10−7 10−6 10−5 10−4 10−3 L=M=N ||A−Ahat|| F 2 I=J=K=500; F=5; sigma=0.01; P=12; S=3 PARACOMP Direct, no compression 32% of full data 1.2% of full data (P=12 processors, each w/ 0.1 % of full data)

(49)

PARACOMP: MSE as a function of

P

FixL=M =N=50, varyP. 20 30 40 50 60 70 80 90 100 110 10−7 10−6 10−5 10−4 10−3 P ||A−Ahat|| F 2 I=J=K=500; L=M=N=50; F=5; sigma=0.01; S=3 PARACOMP Direct, no compression

12% of the full data in P=120 processors w/ 0.1% each 1.2% of the full data

in P=12 processors w/ 0.1% each

(50)

PARACOMP: MSE vs AWGN variance

σ

2 FixL=M =N=50,P=12, varyσ2_. 10−8 10−6 10−4 10−2 100 10−12 10−10 10−8 10−6 10−4 10−2 100 102 sigma2 ||A−Ahat|| F 2 I=J=K=500; F=5; L=M=N=50; P=12; S=3 PARACOMP Direct, no compression

(51)

Missing elements

Recommender systems, NELL, many other datasets: over 90% of the values are missing!

PARACOMP to the rescue: fortuitous fringe benefit of ‘compression’ (rather: taking linear combinations)!

LetT denote the set of all elements, andΨthe set of available elements.

Consider one element of the compressed tensor, as it would have been computed had all elements been available; and as it can be computed from the available elements (notice normalization - important!):

Yν(l,m,n) = 1 |T | X (i,j,k)∈T ul(i)vm(j)wn(k)X(i,j,k) e Yν(l,m,n) = 1 E[|Ψ|] X (i,j,k)∈Ψ ul(i)vm(j)wn(k)X(i,j,k)

(52)

Missing elements

Theorem:[Marcos & Sidiropoulos, IEEE ISCCSP 2014] Assume a

Bernoulli i.i.d. miss model, with parameterρ=Prob[(i,j,k)∈Ψ], and let

X(i,j,k) =PF

f=1af(i)bf(j)cf(k), where the elements ofaf,bf andcf are

all i.i.d. random variables drawn fromaf(i)∼ Pa(µa, σa),

bf(j)∼ Pb(µb, σb), andcf(k)∼ Pc(µc, σc), withpa:=µ2a+σa2,

pb:=µ2b+σ2b,pc :=µ2c+σc2, andF0:= (F−1). Then, forµa,µb,µc all

6 =0, E[kEνk2_F] E[kYνk2_F] ≤ (1−ρ) ρ|T | (1+ σ2 U µ2 u )(1+σ 2 V µ2 v )(1+σ 2 W µ2 w )(F 0 F + papbpc Fµ2_aµ2_bµ2_c)

(53)

Missing elements

10−5 10−4 10−3 10−2 10−1 100 −10 0 10 20 30 40 50 60 70

Bernoulli Density parameter (ρ)

S N RY (d B ) SNR ofY, forrank(X) = 1, (µX,σX) = (1,1)

Actual SNR using expected number of available alements Actual SNR using actual number of available alements Theoretical SNR Lower bound

Theoretical SNR (Exact)

I = J = K = 50

I = 400 J = K = 50 I = J = K = 200

(54)

Missing elements

10−5 10−4 10−3 10−2 10−1 100 −5 0 5 10 15 20 25 30 35 40 45

Bernoulli Density parameter (ρ)

S N RA (d B )

SNR of loadings A, forrank(X) = 1, (µX,σX) = (1,1)

(I,J,K) = (200,200,200) (I,J,K) = (100,100,100) (I,J,K) = (50,50,50) (I,J,K) = (200,100,50) (I,J,K) = (400,50,50)

(55)

Missing elements

Three-way (emission, excitation, sample) fluorescence spectroscopy

0 50 100 0 100 200 300 −100 0 100 200 300 400 500 data slab 0 50 100 0 100 200 300 −100 0 100 200 300 400 500 randn−based imputation 0 50 100 150 200 250 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

Figure :Measured and imputed data; recovered latent spectra