Factoring Tensors in the Cloud:
A Tutorial on Big Tensor Data Analytics
Nikos Sidiropoulos, UMN, and Evangelos Papalexakis, CMU
Web, papers, software, credits
Nikos Sidiropoulos, UMN
Signal and Tensor Analytics Research (STAR) group
https://sites.google.com/a/umn.edu/nikosgroup/home Signal processing, communications, optimization
Tensor decomposition, factor analysis Big data
Spectrum sensing, cognitive radio
Sidekicks: preference measurement, conference review & session assignment
Vangelis Papalexakis, Christos Faloutsos, CMU
http://www.cs.cmu.edu/˜epapalex/and
http://www.cs.cmu.edu/˜christos/ Data mining
Graph mining
Tensor analysis, coupled matrix-tensor analysis Big data, Hadoop
Christos Faloutsos, Tom Mitchell (CMU), Nikos Sidiropoulos, George Karypis (UMN), NSF-NIH/BIGDATA: Big Tensor Mining: Theory, Scalable Algorithms and Applications, NSF IIS-1247489/1247632.
Matrices, rank decomposition
Amatrix(ortwo-wayarray) is a datasetXindexed bytwo indices,(i,j)-th entryX(i,j).
Simple matrixS(i,j) =a(i)b(j),∀i,j; separable, every row (column)
proportional to every other row (column). Can write asS=abT.
rank(X):= smallest number of ‘simple’ (separable, rank-one) matrices
needed to generateX- a measure of complexity.
X(i,j) = F X f=1 af(i)bf(j); orX= F X f=1 afbTf =ABT.
Turns out rank(X)= maximum number of linearly independent rows (or,
columns) inX.
Rank decomposition for matrices is not unique (except for matrices of
rank = 1), as∀invertibleM:
X=ABT = (AM)M−TBT= (AM) BM−1T
Tensor? What is this?
CS ‘slang’ forthree-wayarray: datasetXindexed bythree indices,
(i,j,k)-th entryX(i,j,k).
In plain words: a‘shoebox’!
Warning:different meaning in Physics!
For two vectorsa(I×1) andb(J×1),a◦bis anI×J rank-one matrix
with (i,j)-th elementa(i)b(j); i.e.,a◦b=abT.
For three vectors,a(I×1),b(J×1),c(K×1),a◦b◦cis anI×J×K
rank-one three-way array with(i,j,k)-th elementa(i)b(j)c(k).
Therank of a three-way arrayXis the smallest number of outer products
Motivating example: NELL @ CMU / Tom Mitchell
Crawl web, learn language ‘like children do’: encounter new concepts, learn from context
NELL triplets of “subject-verb-object” naturally lead to a 3-mode tensor
X ≈ object subject verb ~ a1 ~ b1 ~ c1 + ~ a2 ~ b2 ~ c2 + ~ aF ~ bF ~ cF . . . +
Each rank-one factor corresponds to aconcept, e.g., ‘leaders’ or ‘tools’
E.g., saya1,b1,c1corresponds to ‘leaders’: subjects/rows with high
score ona1will be “Obama”, “Merkel”, “Steve Jobs”, objects/columns with
high score onb1will be “USA”, “Germany”, “Apple Inc.”, and verbs/fibers
with high score onc1will be ‘verbs’, like “lead”, “is-president-of”, and
Kronecker and Khatri-Rao products
⊗stands for the Kronecker product:
A⊗B= BA(1,1),BA(1,2),· · · BA(2,1),BA(2,2),· · · .. .
stands for the Khatri-Rao (column-wise Kronecker) product: givenA
(I×F) andB(J×F),ABis theJI×F matrix
AB=
A(:,1)⊗B(:,1)· · ·A(:,F)⊗B(:,F)
vec(ABC) = (CT ⊗A)vec(B)
Rank decomposition for tensors
•Tensor: X= F X f=1 af◦bf ◦cf •Scalar: X(i,j,k) = F X f=1 ai,fbj,fck,f, ∀i∈ {1,· · ·,I} ∀j∈ {1,· · ·,J} ∀k ∈ {1,· · · ,K} •Slabs: Xk =ADk(C)BT, k =1,· · ·,K •Matrix: X(KJ×I) = (BC)AT •Tall vector: x(KJI) :=vecX(KJ×I)= (A(BC))1 F×1= (ABC)1F×1Do I need to worry about higher-way (or,
higher-order
)
tensors?
Semantic analysis of Brain fMRI data
fMRI→semantic category scores
fMRI mode is vectorized (O(105−106))
Spatial coherence - better to treat as three separate spatial modes→
5-way array
Rank decomposition for higher-order tensors
•Tensor: X= F X f=1 a(f1)◦ · · · ◦a(fN) •Scalar: X(i1,· · ·,iN) = F X f=1 N Y n=1 a(in) n,f •Matrix: X(I1I2···IN−1×IN)= A(N−1)A(N−2) · · · A(1) A(N)T •Tall vector: x(I1···IN):=vec X(I1I2···IN−1×IN) = A(N)A(N−1)A(N−2) · · · A(1)1F×1Tensor Singular Value Decomposition?
For matrices, SVD is instrumental: rank-revealing, Eckart-Young So is there a tensor equivalent to the matrix SVD?
Yes, ... and no! In fact there is no single tensor SVD. Two basic decompositions:
CANonical DECOMPosition (CANDECOMP), also known as PARAllel FACtor (PARAFAC) analysis, or CANDECOMP-PARAFAC (CP) for short: non-orthogonal, unique under certain conditions.
Tucker3, orthogonal without loss of generality, non-unique except for very special cases.
Both are outer product decompositions, but with very different structural properties.
Rule of thumb: use Tucker3 for subspace estimation and tensor approximation, e.g., compression applications; use PARAFAC for latent parameter estimation - recovering the ‘hidden’ rank-one factors.
Tucker3
I×J×K three-way arrayX
A:I×L,B:J×M,C:K ×N mode loading matrices
Tucker3, continued
Consider anI×J×K three-way arrayXcomprisingK matrix slabs
{Xk}Kk=1, arranged into matrixX:= [vec(X1),· · ·,vec(XK)].
The Tucker3 model can be written as
X≈(B⊗A)GCT,
whereGis the Tucker3 core tensorGrecast in matrix form. The non-zero
elements of the core tensor determine the interactions between columns ofA,B,C.
The associated model-fitting problem is min
A,B,C,G||X−(B⊗A)GC
T||2
F,
which is usually solved using an alternating least squares procedure.
vec(X)≈(C⊗B⊗A)vec(G).
Highly non-unique- e.g., rotateC, counter-rotateGusing unitary matrix.
Subspaces can be recovered; Tucker3 is good fortensor approximation,
PARAFAC
Figure :From Matlab Central / N-way toolbox (Rasmus Bro)http://www.
mathworks.com/matlabcentral/fileexchange/1088-the-n-way-toolbox
Low-rank tensor decomposition / approximation
X≈
F X
f=1
af◦bf◦cf,
PARAFAC [Harshman ’70-’72], CANDECOMP [Carroll & Chang, ’70], now CP; also cf. [Hitchcock, ’27]
Xk ≈ADk(C)BT, whereDk(C)is a diagonal matrix holding thek-th row of
Cin its diagonal.
Combining slabs and using Khatri-Rao product,
Uniqueness
Under certain conditions, PARAFAC isessentially unique, i.e.,(A,B,C)
can be identified fromXup to permutation and scaling of columns
-there’s no rotational freedom; cf. [Kruskal ’77, Sidiropouloset al’00 - ’07,
de Lathauwer ’04-, Stegeman ’06-, Chiantini, Ottaviani ’11-, ...]
I×J×K tensorXof rankF, vectorized asIJK×1 vector
x= (ABC)1, for someA(I×F),B(J×F), andC(K ×F) - a
PARAFAC model of sizeI×J×K and orderF parameterized by
(A,B,C).
TheKruskal-rankofA, denotedkA, is the maximumk such thatany k
columns ofAare linearly independent (kA≤rA:=rank(A)).
GivenX(⇔x), ifkA+kB+kC≥2F+2, then(A,B,C)are unique up to a
common column permutation and scaling/counter-scaling (e.g., multiply
first column ofAby 5, divide first column ofBby 5, outer product stays
the same) - cf. [Kruskal, 1977]
N-way case:PN
Special case: Two slabs
Assumesquare/tall, full column rankA(I×F)andB(J×F); and distinct
nonzero elements along the diagonal ofD:
X1 = ABT,
X2 = ADBT
Introduce the compactF-component SVD of
X1 X2 = A AD BT =UΣVH
rank(B) =F implies that span(U) =span
A AD
; hence there exists
a nonsingular matrixPsuch that
U= U1 U2 = A AD P
Special case: Two slabs, continued
Next, construct the auto- and cross-product matrices
R1 = UH1U1 = PHAHAP =: QP,
R2 = UH1U2 = PHAHADP = QDP.
All matrices on the right hand side are square and full-rank. Solving the first equation and substituting to the second, yields the eigenvalue problem
R−11 R2
P−1=P−1D
P−1(andP) recovered up to permutation and scaling
From bottom of previous page,Alikewise recovered, thenB
Alternating Least Squares (ALS)
Based on matrix view:
X(KJ×I)= (BC)AT Multilinear LS problem: min A,B,C||X (KJ×I)−(BC)AT||2 F
NP-hard- even for a single component, i.e., vectora,b,c. See [Hillar and Lim, “Most tensor problems are NP-hard,” 2013]
But ... given interim estimates ofB,C, can easily solve for conditional LS
update ofA:
ACLS =
(BC)†X(KJ×I)
T
Similarly for the CLS updates ofB,C(symmetry); alternate until cost
Other algorithms?
Many! - first-order (gradient-based), second-order (Hessian-based) Gauss-Newton, line search, Levenberg-Marquardt, weighted least squares, majorization
Algebraic initialization (matters)
See Tomasi and Bro, 2006, for a good overview
Second-order advantage when close to optimum, but can (and do) diverge
First-order often prone to local minima, slow to converge
Stochastic gradient descent (CS community) - simple, parallel, but very slow
Difficult to incorporate additional constraints like sparsity, non-negativity, unimodality, etc.
ALS
No parameters to tune!
Easy to program, uses standard linear LS Monotone convergence of cost function
Does not require any conditions beyond model identifiability
Easy to incorporate additional constraints, due to multilinearity, e.g., replace linear LS with linear NNLS for NN
Even non-convex (e.g., FA) constraints can be handled with column-wise updates (optimal scaling lemma)
Cons: sequential algorithm, convergence can be slow Still workhorse after all these years
Outliers, sparse residuals
Instead of LS,
min||X(KJ×I)−(BC)AT||
1
Conditional update: LP
Almost as good: coordinate-wise, usingweighted median filtering(very
cheap!) [Vorobyov, Rong, Sidiropoulos, Gershman, 2005]
PARAFAC CRLB: [Liu & Sidiropoulos, 2001] (Gaussian); [Vorobyov, Rong, Sidiropoulos, Gershman, 2005] (Laplacian, etc differ only in pdf-dependent scale factor).
Alternating optimization algorithms approach the CRLB when the problem is well-determined (meaning: not barely identifiable).
Big (tensor) data
Tensors can easily become really big! - size exponential in the number of dimensions (‘ways’, or ‘modes’).
Datasets with millions of items per mode - examples in second part of this tutorial.
Cannot load in main memory; may reside in cloud storage.
Sometimesverysparse - can store and process as (i,j,k,value) list,
nonzero column indices for each row, runlength coding, etc.
(Sparse) Tensor Toolbox for Matlab [Koldaet al].
Tensor partitioning?
Parallel algorithms for matrix algebra use data partitioning Can we reuse some of these ideas?
Low-hanging fruit?
First considered in [Phan, Cichocki, Neurocomputing, 2011]
Tensor partitioning
In [Phan & Cichocki, Neurocomputing, 2011] each sub-block is
independently decomposed, then low-rank factors of the big tensor are ‘stitched’ together.
The decomposition of each sub-block must be identifiable→far more stringent ID conditions;
Each sub-block has different permutation and scaling of the factors - all these must be reconciled before stitching (not discussed in paper);
Noise affects sub-block decomposition more than full tensor decomposition.
More recently, [Almeida & Kibangou, CAMSAP 2013, ICASSP 2014]:
Also rely on partitioning, so sub-block identification conditions are required; But more robust, based on collaborative consensus-averaging, and the matching of different permutations and scales is explicitly taken into account; Requires considerable communication overhead across clusters and nodes; i.e., the parallel threads are not independent.
Tensor compression
Commonly used compression method for ‘moderate’-size tensors: fit orthogonal Tucker3 model, regress data onto fitted mode-bases.
Implemented in COMFAC
http://www.ece.umn.edu/˜nikos/comfac.m
Lossless if exact mode bases used [CANDELINC]; but Tucker3 fitting is itself cumbersome for big tensors (big matrix SVDs), cannot compress below mode ranks without introducing errors
Tensor compression
Consider compressingx=vec(X)intoy=Sx, whereSisd ×IJK,
d IJK.
In particular, consider a specially structured compression matrix S=UT⊗VT ⊗WT
Corresponds to multiplying (every slab of)Xfrom theI-mode withUT,
from theJ-mode withVT, and from theK-mode withWT, whereUisI×L,
VisJ×M, andWisK×N, withL≤I,M ≤J,N≤K andLMNIJK
I L M I J J K N L K M X N Y _ _
Key
Due to a property of the Kronecker product
UT ⊗VT ⊗WT(ABC) =
(UTA)(VTB)(WTC), from which it follows that
y=(UTA)(VTB)(WTC)1=A˜ B˜ C˜1.
i.e., the compressed data follow a PARAFAC model of sizeL×M×N
and orderF parameterized by( ˜A,B˜,C˜), withA˜ :=UTA,B˜ :=VTB, ˜
Random
multi-way compression can be better!
Sidiropoulos & Kyrillidis, IEEE SPL Oct. 2012
Assume that the columns ofA,B,Care sparse, and letna(nb,nc) be an
upper bound on the number of nonzero elements per column ofA
(respectivelyB,C).
Let the mode-compression matricesU(I×L,L≤I),V(J×M,M ≤J),
andW(K ×N,N≤K) be randomly drawn from an absolutely continuous
distribution with respect to the Lebesgue measure inRIL,RJM, andRKN,
respectively. If
min(L,kA) +min(M,kB) +min(N,kC)≥2F+2, and L≥2na, M ≥2nb, N≥2nc,
then the original factor loadingsA,B,Care almost surely identifiable from
Proof rests on two lemmas + Kruskal
Lemma 1: ConsiderA˜ :=UTA, whereAisI×F, and let theI×Lmatrix
Ube randomly drawn from an absolutely continuous distribution with
respect to the Lebesgue measure inRIL(e.g., multivariate Gaussian with
a non-singular covariance matrix). ThenkA˜=min(L,kA)almost surely
(with probability 1) [Sidiropoulos & Kyrillidis, 2012]
Lemma 2: ConsiderA˜ :=UTA, whereA˜ andUare given andAis sought.
Suppose that every column ofAhas at mostnanonzero elements, and
thatkUT ≥2na. (The latter holds with probability 1 if theI×LmatrixUis
randomly drawn from an absolutely continuous distribution with respect to
the Lebesgue measure inRIL, and min(I,L)≥2na.) ThenAis the unique
solution with at mostnanonzero elements per column [Donoho & Elad,
Complexity
First fitting PARAFAC in compressed space and then recovering the
sparseA,B,Cfrom the fitted compressed factors entails complexity
O(LMNF+ (I3.5+J3.5+K3.5)F)in practice.
Using sparsity first and then fitting PARAFAC in raw space entails
complexityO(IJKF+ (IJK)3.5)- the difference is huge.
Also note that the proposed approach does not require computations in the uncompressed data domain, which is important for big data that do not fit in memory for processing.
Further compression - down to
O
(
√
F
)
in 2/3 modes
Sidiropoulos & Kyrillidis, IEEE SPL Oct. 2012
Assume that the columns ofA,B,Care sparse, and letna(nb,nc) be an
upper bound on the number of nonzero elements per column ofA
(respectivelyB,C).
Let the mode-compression matricesU(I×L,L≤I),V(J×M,M ≤J),
andW(K ×N,N≤K) be randomly drawn from an absolutely continuous
distribution with respect to the Lebesgue measure inRIL,RJM, andRKN,
respectively. If
rA=rB=rC=F
L(L−1)M(M−1)≥2F(F −1), N≥F, and
L≥2na, M ≥2nb, N≥2nc,
then the original factor loadingsA,B,Care almost surely identifiable from
Proof: Lemma 3 + results on a.s. ID of PARAFAC
Lemma 3: ConsiderA˜ =UTA, whereA(I×F)is deterministic,
tall/square (I≥F) and full column rankrA=F, and the elements ofU
(I×L) are i.i.d. Gaussian zero mean, unit variance random variables.
Then the distribution ofA˜ is nonsingular multivariate Gaussian.
From [Stegeman, ten Berge, de Lathauwer 2006] (see also [Jiang, Sidiropoulos 2004], we know that PARAFAC is almost surely identifiable if
the loading matricesA˜,B˜ are randomly drawn from an absolutely
continuous distribution with respect to the Lebesgue measure inR(L+M)F,
˜
Generalization to higher-way arrays
Theorem 3: Letx= (A1 · · · Aδ)1∈R Qδ d=1Id, whereA d isId×F, and consider compressing it toy= UT 1 ⊗ · · · ⊗UTδ x= (UT 1A1) · · · (UTδAδ) 1=A˜1 · · · A˜δ 1∈RQδd=1Ld, where themode-compression matricesUd(Id×Ld,Ld ≤Id) are randomly drawn
from an absolutely continuous distribution with respect to the Lebesgue
measure inRIdLd. Assume that the columns ofA
d are sparse, and letnd
be an upper bound on the number of nonzero elements per column ofAd,
for eachd∈ {1,· · ·, δ}. If
δ
X
d=1
min(Ld,kAd)≥2F+δ−1, and Ld ≥2nd, ∀d ∈ {1,· · ·, δ},
then the original factor loadings{Ad}
δ
d=1are almost surely identifiable
from the compressed datayup to a common column permutation and
scaling.
Latest on PARAFAC uniqueness
Luca Chiantini and Giorgio Ottaviani, On Generic Identifiability of 3-Tensors of Small Rank, SIAM. J. Matrix Anal. & Appl., 33(3), 1018–1037:
Consider anI×J×K tensorXof rankF, and order the dimensions so
thatI≤J ≤K
Leti be maximal such that 2i ≤I, and likewisej maximal such that 2i ≤J
IfF ≤2i+j−2, thenXhas a unique decomposition almost surely
ForI,J powers of 2, the condition simplifies toF ≤ IJ
4
More generally, condition implies:
ifF ≤(I+1)(J+1)
Even further compression
Assume that the columns ofA,B,Care sparse, and letna(nb,nc) be an
upper bound on the number of nonzero elements per column ofA
(respectivelyB,C).
Let the mode-compression matricesU(I×L,L≤I),V(J×M,M ≤J),
andW(K ×N,N≤K) be randomly drawn from an absolutely continuous
distribution with respect to the Lebesgue measure inRIL,RJM, andRKN,
respectively.
AssumeL≤M ≤N, andL,M are powers of 2, for simplicity
If
rA=rB=rC=F LM≥4F, N≥M ≥L, and
L≥2na, M ≥2nb, N≥2nc,
then the original factor loadingsA,B,Care almost surely identifiable from
the compressed data up to a common column permutation and scaling.
What if
A
,
B
,
C
are not sparse?
IfA,B,Care sparse with respect toknownbases, i.e.,A=RAˇ,B=SBˇ,
andC=TCˇ, withR,S,Tthe respective sparsifying bases, andAˇ,Bˇ,Cˇ
sparse
Then the previous results carry over under appropriate conditions, e.g.,
whenR,S,Tare non-singular.
PARACOMP: PArallel RAndomly COMPressed Cubes
I J K X _ (U2,V2,W2) … M1 L1 Y_1 N1 M2 L2 N2 Y _2 MP LP Y_P NP … (A2,B2,C2)~ ~ ~ … (A,B,C) join (LS) fork I Lp Mp I J J K Np Lp K Mp X Np Y _ _ UpT V p T WpT pPARACOMP
AssumeA˜p,B˜p,C˜pidentifiable fromYp(up to perm & scaling of cols)
Upon factoringYpintoF rank-one components, we obtain
˜
Ap=UTpAΠpΛp. (1)
Assume first 2 columns of eachUpare common, letU¯ denote this
common part, andA¯p:=first two rows ofA˜p. Then
¯
Ap= ¯UTAΠpΛp.
Dividing each column ofA¯pby the element of maximum modulus in that
column, denoting the resulting 2×F matrixAˆp,
ˆ
Ap= ¯UTAΛΠp.
Λdoes not affect the ratio of elements in each 2×1 column. If ratios are
distinct, then permutations can be matched by sorting the ratios of the
PARACOMP
In practice using a few more ‘anchor’ rows will improve perm-matching.
WhenSanchor rows are used, the opt permutation matching cast as
min
Π ||
ˆ
A1−AˆpΠ||2F,
Optimization over set of permutation matrices - hard?
||Aˆ1−AˆpΠ||2F =Tr ( ˆA1−AˆpΠ)T( ˆA1−AˆpΠ) = ||Aˆ1||2F +||AˆpΠ||2F−2Tr( ˆAT1AˆpΠ) = ||Aˆ1||2F+||Aˆp||2F−2Tr( ˆAT1AˆpΠ). ⇐⇒max Π Tr( ˆA T 1AˆpΠ),
PARACOMP
After perm-matching, back to (1) and permute columns→A˘psatisfying
˘
Ap=UTpAΠΛp.
Remains to get rid ofΛp. For this, we can again resort to the first two
common rows, and divide each column ofA˘pwith its top element→
ˇ
Ap=UTpAΠΛ.
For recovery ofAup to perm-scaling of cols, we then require that
ˇ A1 .. . ˇ AP = UT 1 .. . UT P AΠΛ (2)
PARACOMP
This implies that
2+
P X
p=1
(Lp−2)≥I.
Every sub-matrix contains the two common anchor rows, duplicate rows do not increase rank.
Once dim requirement met, full rank w.p. 1, non-redundant entries drawn from jointly cont. distribution (by design).
AssumingLp=L,∀p∈ {1,· · ·,P}
P≥ I−2 L−2.
Common column perm ofA˜p,B˜p,C˜piscommon→
P≥max I−2 L−2, J M, K N
PARACOMP
If compression ratios in different modes are similar, makes sense to use longest mode for anchoring; if this is the last mode, then
P≥max I L, J M, K−2 N−2
Theorem:Assume thatF ≤I≤J ≤K, andA,B,Care full column rank
(F). Further assume thatLp=L,Mp=M,Np=N,∀p∈ {1,· · ·,P},
L≤M ≤N,(L+1)(M+1)≥16F, random{Up}Pp=1,{Vp}
P
p=1, eachWp
contains two common anchor columns, otherwise random{Wp}Pp=1.
ThenA˜p,B˜p,C˜p
unique up to column permutation and scaling. If, in addition,P ≥maxLI,MJ,NK−2−2, then(A,B,C)are almost surely identifiable fromnA˜p,B˜p,C˜p
oP
p=1up to a common column permutation
PARACOMP - Significance
Indicative of a family of results that can be derived.
Theorem shows that fully parallel computation of the big tensor
decomposition is possible – first result thatguaranteesID of the big
tensor decomposition from the small tensor decompositions, without stringent additional constraints.
Corollary:If K−2 N−2 =max I L, J M, K−2 N−2
, then the memory / storage and computational complexity savings afforded by PARACOMP relative to brute-force computation are of order IJF.
Note on complexity of solving master join equation: after removing redundant rows, system matrix in (2) will have approximately orthogonal
PARACOMP
Cases whereF >min(I,J,K)can be treated using Kruskal’s condition;
total storage and complexity gains still of order IJ
F2.
Latent sparsity: Assume that every column ofA(B,C) has at mostna
(resp.nb,nc) nonzero elements. A column ofAcan be uniquely
recovered from only 2naincoherent linear equations.
Therefore, we may replace the condition
P≥max I L, J M, K −2 N−2 , with P≥max 2na L , 2nb M , 2nc−2 N−2 .
Color of compressed noise
Y=X+Z, whereZ: zero-mean additive white noise.
y=x+z, withy:=vec(Y),x:=vec(X),z:=vec(Z).
Multi-way compression→Yc yc :=vec(Yc) = UT ⊗VT ⊗WTy= UT ⊗VT ⊗WTx+UT ⊗VT ⊗WTz. Letzc := UT ⊗VT ⊗WT
z. Clearly,E[zc] =0; it can be shown that
EhzczTc i
=σ2UTU⊗VTV⊗WTW.
⇒IfU,V,Ware orthonormal, then noise in the compressed domain is white.
‘Universal’ LS
E[zc] =0, and
EhzczTc i
=σ2UTU⊗VTV⊗WTW.
For largeI andUdrawn from a zero-mean unit-variance uncorrelated distribution,UTU≈Iby the law of large numbers.
Furthermore, even ifzis not Gaussian,zc will be approximately Gaussian
for largeIJK, by the Central Limit Theorem.
Follows that least-squares fitting is approximately optimal in the compressed domain, even if it is not so in the uncompressed domain.
Component energy
≈
preserved after compression
Consider randomly compressing a rank-one tensorX=a◦b◦c, written
in vectorized form asx=a⊗b⊗c.
The compressed tensor isX˜, in vectorized form
˜
x=UT ⊗VT ⊗WT(a⊗b⊗c) = (UTa)⊗(VTb)⊗(WTc).
Can be shown that, for moderateL,M,Nand beyond, Frobenious norm
of compressed rank-one tensor approximately proportional to Frobenious norm of the uncompressed rank-one tensor component of original tensor. In other words: compression approximately preserves component energy
⇒order.
⇒Low-rank least-squares approximation of the compressed tensor low-rank least-squares approximation of the big tensor, approximately.
Match component permutations across replicas by sorting component energies.
PARACOMP: Numerical results
Nominal setup:
I=J=K =500;F =5;A,B,C∼randn(500,5);
L=M=N=50 (each replica = 0.1% of big tensor);
P=12 replicas (overall cloud storage = 1.2% of big tensor).
S=3 (vs.Smin=2) anchor rows.
↑Satisfy identifiability without much ‘slack’. + WGN stdσ=0.01.
COMFACwww.ece.umn.edu/˜nikosused for all factorizations, big and small
PARACOMP: MSE as a function of
L
=
M
=
N
FixP=12, varyL=M =N. 50 100 150 10−7 10−6 10−5 10−4 10−3 L=M=N ||A−Ahat|| F 2 I=J=K=500; F=5; sigma=0.01; P=12; S=3 PARACOMP Direct, no compression 32% of full data 1.2% of full data (P=12 processors, each w/ 0.1 % of full data)PARACOMP: MSE as a function of
P
FixL=M =N=50, varyP. 20 30 40 50 60 70 80 90 100 110 10−7 10−6 10−5 10−4 10−3 P ||A−Ahat|| F 2 I=J=K=500; L=M=N=50; F=5; sigma=0.01; S=3 PARACOMP Direct, no compression12% of the full data in P=120 processors w/ 0.1% each 1.2% of the full data
in P=12 processors w/ 0.1% each
PARACOMP: MSE vs AWGN variance
σ
2 FixL=M =N=50,P=12, varyσ2. 10−8 10−6 10−4 10−2 100 10−12 10−10 10−8 10−6 10−4 10−2 100 102 sigma2 ||A−Ahat|| F 2 I=J=K=500; F=5; L=M=N=50; P=12; S=3 PARACOMP Direct, no compressionMissing elements
Recommender systems, NELL, many other datasets: over 90% of the values are missing!
PARACOMP to the rescue: fortuitous fringe benefit of ‘compression’ (rather: taking linear combinations)!
LetT denote the set of all elements, andΨthe set of available elements.
Consider one element of the compressed tensor, as it would have been computed had all elements been available; and as it can be computed from the available elements (notice normalization - important!):
Yν(l,m,n) = 1 |T | X (i,j,k)∈T ul(i)vm(j)wn(k)X(i,j,k) e Yν(l,m,n) = 1 E[|Ψ|] X (i,j,k)∈Ψ ul(i)vm(j)wn(k)X(i,j,k)
Missing elements
Theorem:[Marcos & Sidiropoulos, IEEE ISCCSP 2014] Assume a
Bernoulli i.i.d. miss model, with parameterρ=Prob[(i,j,k)∈Ψ], and let
X(i,j,k) =PF
f=1af(i)bf(j)cf(k), where the elements ofaf,bf andcf are
all i.i.d. random variables drawn fromaf(i)∼ Pa(µa, σa),
bf(j)∼ Pb(µb, σb), andcf(k)∼ Pc(µc, σc), withpa:=µ2a+σa2,
pb:=µ2b+σ2b,pc :=µ2c+σc2, andF0:= (F−1). Then, forµa,µb,µc all
6 =0, E[kEνk2F] E[kYνk2F] ≤ (1−ρ) ρ|T | (1+ σ2 U µ2 u )(1+σ 2 V µ2 v )(1+σ 2 W µ2 w )(F 0 F + papbpc Fµ2aµ2bµ2c)
Missing elements
10−5 10−4 10−3 10−2 10−1 100 −10 0 10 20 30 40 50 60 70Bernoulli Density parameter (ρ)
S N RY (d B ) SNR ofY, forrank(X) = 1, (µX,σX) = (1,1)
Actual SNR using expected number of available alements Actual SNR using actual number of available alements Theoretical SNR Lower bound
Theoretical SNR (Exact)
I = J = K = 50
I = 400 J = K = 50 I = J = K = 200
Missing elements
10−5 10−4 10−3 10−2 10−1 100 −5 0 5 10 15 20 25 30 35 40 45Bernoulli Density parameter (ρ)
S N RA (d B )
SNR of loadings A, forrank(X) = 1, (µX,σX) = (1,1)
(I,J,K) = (200,200,200) (I,J,K) = (100,100,100) (I,J,K) = (50,50,50) (I,J,K) = (200,100,50) (I,J,K) = (400,50,50)
Missing elements
Three-way (emission, excitation, sample) fluorescence spectroscopy
0 50 100 0 100 200 300 −100 0 100 200 300 400 500 data slab 0 50 100 0 100 200 300 −100 0 100 200 300 400 500 randn−based imputation 0 50 100 150 200 250 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3
Figure :Measured and imputed data; recovered latent spectra