Tensor Factorization via Matrix Factorization
(Kuleshov et al., 2015)
Amir Zakeri, Sebastien Henwood
March 24, 2020
Outline
1 Introduction
2 Tensor Factorization via Matrix Factorization(TFMF)
3 Simultaneous diagonalization
4 Experiments
5 Conclusion
Outline
1 Introduction
2 Tensor Factorization via Matrix Factorization(TFMF)
3 Simultaneous diagonalization
4 Experiments
5 Conclusion
Introduction
Given a tensorT ∈ Rd×d×d with the following CP-decomposition:
T =ˆ ∑k
i=1
πiai⊗ bi⊗ ci+ noise,
our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class
This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)
The core idea
ProjectionT along a vector w to do eigendecomp. repeated L times
Introduction
Given a tensorT ∈ Rd×d×d with the following CP-decomposition:
T =ˆ ∑k
i=1
πiai⊗ bi⊗ ci+ noise,
our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class
This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)
The core idea
ProjectionT along a vector w to do eigendecomp. repeated L times
Introduction
Given a tensorT ∈ Rd×d×d with the following CP-decomposition:
T =ˆ ∑k
i=1
πiai⊗ bi⊗ ci+ noise,
our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class
This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)
The core idea
ProjectionT along a vector w to do eigendecomp. repeated L times
Introduction
Given a tensorT ∈ Rd×d×d with the following CP-decomposition:
T =ˆ ∑k
i=1
πiai⊗ bi⊗ ci+ noise,
our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class
This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)
The core idea
ProjectionT along a vector w to do eigendecomp. repeated L times
Outline
1 Introduction
2 Tensor Factorization via Matrix Factorization(TFMF)
3 Simultaneous diagonalization
4 Experiments
5 Conclusion
Tensor Factorization via Matrix Factorization(TFMF)
TFMF algorithm overview
1 Input : L random vectors w , a tensorT
2 ProjectT onto a set of random vectors wLproducingM matrices
3 Simultaneously diagonalizeM producing CP decomp. factors estimates ˜uI
4 Refine by repeating with the factor estimates instead of the random vectors
5 Output : CP factor matrices ˜uI
Application: to orthogonal, non-orthogonal and asymmetric tensors of arbitrary order.
Novelty: Simultaneous matrix diagonalization.
Factors u ?
When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :
∑
i
πiui⊗3 (1)
Project along a vector w !
∑
i
πi(wTui)ui⊗2 (2)
Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive
Sensitivity≈ smallest diff. between eigenvalues : the eigengap
maxj≠i
1
∣λi− λj∣
Factors u ?
When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :
∑
i
πiui⊗3 (1)
Project along a vector w !
∑
i
πi(wTui)ui⊗2 (2)
Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive
Sensitivity≈ smallest diff. between eigenvalues : the eigengap
maxj≠i
1
∣λi− λj∣
Factors u ?
When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :
∑
i
πiui⊗3 (1)
Project along a vector w !
∑
i
πi(wTui)ui⊗2 (2)
Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive
Sensitivity≈ smallest diff. between eigenvalues : the eigengap
maxj≠i
1
∣λi− λj∣
Factors u ?
When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :
∑
i
πiui⊗3 (1)
Project along a vector w !
∑
i
πi(wTui)ui⊗2 (2)
Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive
Sensitivity≈ smallest diff. between eigenvalues : the eigengap
maxj≠i
1
∣λi− λj∣
Factors u ?
When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :
∑
i
πiui⊗3 (1)
Project along a vector w !
∑
i
πi(wTui)ui⊗2 (2)
Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive
Sensitivity≈ smallest diff. between eigenvalues : the eigengap
maxj≠i
1
∣λi− λj∣
Solving the eigengap with multiple projections (orth. case)
Using L random projections we have the matrices M`
∑
i
πi(w`Tui)ui⊗2 (3)
The set of matrices M`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows
∣∣˜ui− ui∣∣2≤ (2√
2∣∣π∣∣1πmax
πi2 +C(δ)
πi ) + o() (4)
with C(δ) = O(log(kd/δ√
d L)
⇒ The bigger L the lower the error bound !
Solving the eigengap with multiple projections (orth. case)
Using L random projections we have the matrices M`
∑
i
πi(w`Tui)ui⊗2 (3)
The set of matrices M`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows
∣∣˜ui− ui∣∣2≤ (2√
2∣∣π∣∣1πmax
πi2 +C(δ)
πi ) + o() (4)
with C(δ) = O(log(kd/δ√
d L)
⇒ The bigger L the lower the error bound !
Solving the eigengap with multiple projections (orth. case)
Using L random projections we have the matrices M`
∑
i
πi(w`Tui)ui⊗2 (3)
The set of matrices M`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows
∣∣˜ui− ui∣∣2≤ (2√
2∣∣π∣∣1πmax
πi2 +C(δ)
πi ) + o() (4)
with C(δ) = O(log(kd/δ√
d L)
⇒ The bigger L the lower the error bound !
Using estimates instead of random W
After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes
∣∣˜ui− ui∣∣2≤ 2√
∣∣π∣∣1πmax
πi2 + o() (5)
⇒ same as prev. slide when L → ∞
What about non orth. tensors ?
The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)
Using estimates instead of random W
After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes
∣∣˜ui− ui∣∣2≤ 2√
∣∣π∣∣1πmax
πi2 + o() (5)
⇒ same as prev. slide when L → ∞
What about non orth. tensors ?
The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)
Using estimates instead of random W
After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes
∣∣˜ui− ui∣∣2≤ 2√
∣∣π∣∣1πmax
πi2 + o() (5)
⇒ same as prev. slide when L → ∞
What about non orth. tensors ?
The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)
Outline
1 Introduction
2 Tensor Factorization via Matrix Factorization(TFMF)
3 Simultaneous diagonalization
4 Experiments
5 Conclusion
Simultaneous diagonalization
Symmetric matricesM1,⋯, ML∈ Rd×d as:
Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.
Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.
Optimizing objective function to findV:
F(X) ≜∑L
l=1
off(X−1MlX−T), off(A) = ∑
i≠j
A2ij.
⇒ this penalizes the off-diagonal terms!
Use Jacobi & QRJ1D
Simultaneous diagonalization
Symmetric matricesM1,⋯, ML∈ Rd×d as:
Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.
Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.
Optimizing objective function to findV:
F(X) ≜∑L
l=1
off(X−1MlX−T), off(A) = ∑
i≠j
A2ij.
⇒ this penalizes the off-diagonal terms!
Use Jacobi & QRJ1D
Simultaneous diagonalization
Symmetric matricesM1,⋯, ML∈ Rd×d as:
Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.
Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.
Optimizing objective function to findV:
F(X) ≜∑L
l=1
off(X−1MlX−T), off(A) = ∑
i≠j
A2ij.
⇒ this penalizes the off-diagonal terms!
Use Jacobi & QRJ1D
Simultaneous diagonalization
Symmetric matricesM1,⋯, ML∈ Rd×d as:
Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.
Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.
Optimizing objective function to findV:
F(X) ≜∑L
l=1
off(X−1MlX−T), off(A) = ∑
i≠j
A2ij.
⇒ this penalizes the off-diagonal terms!
Use Jacobi & QRJ1D
For asymmetric and higher-order tensors
Asymmetric tensors:
The l-th projection(Ml) of an asymmetric tensor has the following form:
Ml= ∑
i
λiuilvilT = UΛlVT,
where Λl is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.
for eachMl we define another matrixNl as:
Nl= [ 0 MlT Ml 0 ] =1
2[V V
U −U] [Λl 0
0 −Λl] [V V U −U]
T
.
TheNlare symmetric matrices with common (in general, non-orthogonal) factors.
For asymmetric and higher-order tensors
Asymmetric tensors:
The l-th projection(Ml) of an asymmetric tensor has the following form:
Ml= ∑
i
λiuilvilT = UΛlVT,
where Λl is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.
for eachMl we define another matrixNl as:
Nl= [ 0 MlT Ml 0 ] =1
2[V V
U −U] [Λl 0
0 −Λl] [V V U −U]
T
.
TheNlare symmetric matrices with common (in general, non-orthogonal) factors.
For asymmetric and higher-order tensors
Asymmetric tensors:
The l-th projection(Ml) of an asymmetric tensor has the following form:
Ml= ∑
i
λiuilvilT = UΛlVT,
where Λl is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.
for eachMl we define another matrixNl as:
Nl= [ 0 MlT Ml 0 ] =1
2[V V
U −U] [Λl 0
0 −Λl] [V V U −U]
T
.
TheNlare symmetric matrices with common (in general, non-orthogonal) factors.
For asymmetric and higher-order tensors
Higher order tensors:
For higher order tensor (say fourth order):
T = ∑
i
πiai⊗ bi⊗ ci⊗ di,
We first determineai,bi by projecting into matrices:
T= ∑
i
π(ωTci)(uTdi)ai⊗ bi,
Then determineci,diby projecting along the first two components.
For asymmetric and higher-order tensors
Higher order tensors:
For higher order tensor (say fourth order):
T = ∑
i
πiai⊗ bi⊗ ci⊗ di,
We first determineai,bi by projecting into matrices:
T= ∑
i
π(ωTci)(uTdi)ai⊗ bi,
Then determineci,diby projecting along the first two components.
For asymmetric and higher-order tensors
Higher order tensors:
For higher order tensor (say fourth order):
T = ∑
i
πiai⊗ bi⊗ ci⊗ di,
We first determineai,bi by projecting into matrices:
T= ∑
i
π(ωTci)(uTdi)ai⊗ bi,
Then determineci,diby projecting along the first two components.
Convergence properties
Convergence depends on the choice of joint diagonalization subroutine.
Theoretically:
Convergence to Local minimum at a quadratic rate guaranteed.
Convergence to Global minimum is an open question!
Empirically, convergence to global minima achieved.
Convergence properties
Convergence depends on the choice of joint diagonalization subroutine.
Theoretically:
Convergence to Local minimum at a quadratic rate guaranteed.
Convergence to Global minimum is an open question!
Empirically, convergence to global minima achieved.
Convergence properties
Convergence depends on the choice of joint diagonalization subroutine.
Theoretically:
Convergence to Local minimum at a quadratic rate guaranteed.
Convergence to Global minimum is an open question!
Empirically, convergence to global minima achieved.
Convergence properties
Convergence depends on the choice of joint diagonalization subroutine.
Theoretically:
Convergence to Local minimum at a quadratic rate guaranteed.
Convergence to Global minimum is an open question!
Empirically, convergence to global minima achieved.
Convergence properties
Convergence depends on the choice of joint diagonalization subroutine.
Theoretically:
Convergence to Local minimum at a quadratic rate guaranteed.
Convergence to Global minimum is an open question!
Empirically, convergence to global minima achieved.
Outline
1 Introduction
2 Tensor Factorization via Matrix Factorization(TFMF)
3 Simultaneous diagonalization
4 Experiments
5 Conclusion
Experiments
Examining convergence to global minima in orthogonal setting (Jacobi Algo.): ⇒ Using 1000 random starting points, getting the same solution!
0.0360 0.038 0.04 0.042 0.044 0.046 0.048
20 40 60 80
Objective function value
Figure 1:Histogram of objective function values, in orthogonal setting
Experiments
Plotting histogram for different values( = 0, = 1e − 4, = 1e − 3)
0 0.5 1 1.5
x 10−10 0
10 20 30 40 50 60 70
Objective function value epsilon = 0.0
0.04 0.045 0.05
0 10 20 30 40 50 60 70
epsilon = 1e−4
Objective function value
0.360 0.38 0.4 0.42 0.44 0.46
10 20 30 40 50 60
epsilon = 1e−3
Objective function value
Figure 2:Comparing Histograms for different sizes, in orthogonal setting
For small enough convergence is guaranteed.
Experiments
Examning convergence to global minimum in Non-orthogonal setting:
0 5 10 15 20
0 5 10 15 20 25 30 35 40
Objective function value epsilon = 0.0
0 50 100 150 200 250 300
0 10 20 30 40 50 60
epsilon = 1e−4
Objective function value
Figure 3:Histograms when µ is big
Experiments
Examning convergence to global minimum in Non-orthogonal setting,(for small µ)
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
0 20 40 60 80 100 120 140
epsilon = 1e−4
Objective function value
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 20 40 60 80 100 120 140
epsilon = 1e−3
Objective function value
Figure 4:Histogram when µ is small
Experiments
Comparing random vs. plugin projection
0 10 20 30 40 50 60
Number of projections 0.000
0.002 0.004 0.006 0.008 0.010
Error
Orthogonal case
0 10 20 30 40 50 60
Number of projections 0.00
0.01 0.02 0.03 0.04 0.05
Error
Non-orthogonal case
Random projections Plug-in projections
Experiments
Performance comparison:
0.001 0.002 0.003 0.004 0.005 0.006 0.007
Noise level 0.00
0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
Error
Effect of noise on algorithm performance (d=100, k=100)
OJD1 OJD0 TPM Lathauwer NOJD
0.001 0.002 0.003 0.004 0.005 0.006 0.007
Noise level 0.00
0.05 0.10 0.15 0.20 0.25
Error
Effect of noise on algorithm performance (d=50, k=10)
Outline
1 Introduction
2 Tensor Factorization via Matrix Factorization(TFMF)
3 Simultaneous diagonalization
4 Experiments
5 Conclusion
Conclusion
TFMF, another take on CP decomposition
TFMF = random projections + simultaneous diagonalization + plugin estimates Works for orthogonal, non-orthogonal, symmetric, asymetric, high order tensors Is more accurate that state-of-the-art.