Tensor Factorization via Matrix Factorization

(1)

Tensor Factorization via Matrix Factorization

(Kuleshov et al., 2015)

Amir Zakeri, Sebastien Henwood

March 24, 2020

(2)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(3)

Outline

1 Introduction

4 Experiments

5 Conclusion

(4)

Introduction

Given a tensorT ∈ R^d^×d×d with the following CP-decomposition:

T =ˆ ∑^k

i=1

π_ia_i⊗ bi⊗ ci+ noise,

our goal is to estimate the factorsa_i,b_i,c_iand the factor weights π∈ R^k. To solve this, we saw ALS, gradient-based approaches in class

This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)

The core idea

ProjectionT along a vector w to do eigendecomp. repeated L times

(5)

Introduction

T =ˆ ∑^k

i=1

The core idea

(6)

Introduction

T =ˆ ∑^k

i=1

The core idea

(7)

Introduction

T =ˆ ∑^k

i=1

The core idea

(8)

Outline

1 Introduction

4 Experiments

5 Conclusion

(9)

Tensor Factorization via Matrix Factorization(TFMF)

TFMF algorithm overview

1 Input : L random vectors w , a tensorT

2 ProjectT onto a set of random vectors wLproducingM matrices

3 Simultaneously diagonalizeM producing CP decomp. factors estimates ˜u_I

4 Refine by repeating with the factor estimates instead of the random vectors

5 Output : CP factor matrices ˜u_I

Application: to orthogonal, non-orthogonal and asymmetric tensors of arbitrary order.

Novelty: Simultaneous matrix diagonalization.

(10)

Factors u ?

When a_i = b_i= c_i= u_i∀i ⇒ symmetric factorization ! We have :

∑

i

π_iu_i^⊗3 (1)

Project along a vector w !

∑

i

π_i(w^Tu_i)u_i^⊗2 (2)

Estimate u_i by eigendecomposition of Eq. 2⇒ ˜u_i The error∣∣u_i− ˜u_i∣∣₂is noise sensitive

Sensitivity≈ smallest diff. between eigenvalues : the eigengap

maxj≠i

1

∣λ_i− λ_j∣

(11)

Factors u ?

∑

i

π_iu_i^⊗3 (1)

∑

i

π_i(w^Tu_i)u_i^⊗2 (2)

maxj≠i

1

∣λ_i− λ_j∣

(12)

Factors u ?

∑

i

π_iu_i^⊗3 (1)

∑

i

π_i(w^Tu_i)u_i^⊗2 (2)

maxj≠i

1

∣λ_i− λ_j∣

(13)

Factors u ?

∑

i

π_iu_i^⊗3 (1)

∑

i

π_i(w^Tu_i)u_i^⊗2 (2)

maxj≠i

1

∣λ_i− λ_j∣

(14)

Factors u ?

∑

i

π_iu_i^⊗3 (1)

∑

i

π_i(w^Tu_i)u_i^⊗2 (2)

maxj≠i

1

∣λ_i− λ_j∣

(15)

Solving the eigengap with multiple projections (orth. case)

Using L random projections we have the matrices M_`

∑

i

π_i(w_`^Tu_i)u_i^⊗2 (3)

The set of matrices M_`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows

∣∣˜u_i− u_i∣∣2≤ (2√

2∣∣π∣∣1πmax

π_i² +C(δ)

π_i ) + o() (4)

with C(δ) = O(log(kd/δ√

d L)

⇒ The bigger L the lower the error bound !

(16)

Solving the eigengap with multiple projections (orth. case)

∑

i

π_i(w_`^Tu_i)u_i^⊗2 (3)

∣∣˜u_i− u_i∣∣2≤ (2√

π_i² +C(δ)

π_i ) + o() (4)

d L)

(17)

Solving the eigengap with multiple projections (orth. case)

∑

i

π_i(w_`^Tu_i)u_i^⊗2 (3)

∣∣˜u_i− u_i∣∣2≤ (2√

π_i² +C(δ)

π_i ) + o() (4)

d L)

(18)

Using estimates instead of random W

After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes

∣∣˜ui− ui∣∣2≤ 2√

∣∣π∣∣1πmax

π_i² + o() (5)

⇒ same as prev. slide when L → ∞

What about non orth. tensors ?

The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)

(19)

Using estimates instead of random W

∣∣˜ui− ui∣∣2≤ 2√

π_i² + o() (5)

(20)

Using estimates instead of random W

∣∣˜ui− ui∣∣2≤ 2√

π_i² + o() (5)

(21)

Outline

1 Introduction

4 Experiments

5 Conclusion

(22)

Simultaneous diagonalization

Symmetric matricesM₁,⋯, M_L∈ R^d^×d as:

M_l= UΛ_lU^T + Rl. U∈ R^d^×k is common, Λ_l ∈ R^k^×k and R_l are individual.

Goal: find inverse factors V⁻¹∈ R^d^×d such thatV⁻¹M_lV^−T is nearly diagonal.

Optimizing objective function to findV:

F(X) ≜∑^L

l=1

off(X⁻¹M_lX^−T), off(A) = ∑

i≠j

A²_ij.

⇒ this penalizes the off-diagonal terms!

Use Jacobi & QRJ1D

(23)

Simultaneous diagonalization

F(X) ≜∑^L

l=1

i≠j

A²_ij.

Use Jacobi & QRJ1D

(24)

Simultaneous diagonalization

F(X) ≜∑^L

l=1

i≠j

A²_ij.

Use Jacobi & QRJ1D

(25)

Simultaneous diagonalization

F(X) ≜∑^L

l=1

i≠j

A²_ij.

Use Jacobi & QRJ1D

(26)

For asymmetric and higher-order tensors

Asymmetric tensors:

The l-th projection(M_l) of an asymmetric tensor has the following form:

M_l= ∑

i

λ_iu_ilv_il^T = UΛ_lV^T,

where Λ_l is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.

for eachM_l we define another matrixN_l as:

N_l= [ 0 M^l^T M_l 0 ] =1

2[V V

U −U] [Λ_l 0

0 −Λ_l] [V V U −U]

T

.

TheN_lare symmetric matrices with common (in general, non-orthogonal) factors.

(27)

For asymmetric and higher-order tensors

M_l= ∑

i

N_l= [ 0 M^l^T M_l 0 ] =1

2[V V

U −U] [Λ_l 0

0 −Λ_l] [V V U −U]

T

.

(28)

For asymmetric and higher-order tensors

M_l= ∑

i

N_l= [ 0 M^l^T M_l 0 ] =1

2[V V

U −U] [Λ_l 0

0 −Λ_l] [V V U −U]

T

.

(29)

For asymmetric and higher-order tensors

Higher order tensors:

For higher order tensor (say fourth order):

T = ∑

i

π_ia_i⊗ b_i⊗ c_i⊗ d_i,

We first determinea_i,b_i by projecting into matrices:

T= ∑

i

π(ω^Tc_i)(u^Td_i)a_i⊗ b_i,

Then determinec_i,d_iby projecting along the first two components.

(30)

For asymmetric and higher-order tensors

T = ∑

i

T= ∑

i

(31)

For asymmetric and higher-order tensors

T = ∑

i

T= ∑

i

(32)

Convergence properties

Convergence depends on the choice of joint diagonalization subroutine.

Theoretically:

Convergence to Local minimum at a quadratic rate guaranteed.

Convergence to Global minimum is an open question!

Empirically, convergence to global minima achieved.

(33)

Convergence properties

Theoretically:

(34)

Convergence properties

Theoretically:

(35)

Convergence properties

Theoretically:

(36)

Convergence properties

Theoretically:

(37)

Outline

1 Introduction

4 Experiments

5 Conclusion

(38)

Experiments

Examining convergence to global minima in orthogonal setting (Jacobi Algo.): ⇒ Using 1000 random starting points, getting the same solution!

0.0360 0.038 0.04 0.042 0.044 0.046 0.048

20 40 60 80

Objective function value

Figure 1:Histogram of objective function values, in orthogonal setting

(39)

Experiments

Plotting histogram for different values( = 0, = 1e − 4, = 1e − 3)

0 0.5 1 1.5

x 10⁻¹⁰ 0

10 20 30 40 50 60 70

Objective function value epsilon = 0.0

0.04 0.045 0.05

0 10 20 30 40 50 60 70

epsilon = 1e−4

0.360 0.38 0.4 0.42 0.44 0.46

10 20 30 40 50 60

epsilon = 1e−3

Figure 2:Comparing Histograms for different sizes, in orthogonal setting

For small enough convergence is guaranteed.

(40)

Experiments

Examning convergence to global minimum in Non-orthogonal setting:

0 5 10 15 20

0 5 10 15 20 25 30 35 40

Objective function value epsilon = 0.0

0 50 100 150 200 250 300

0 10 20 30 40 50 60

epsilon = 1e−4

Figure 3:Histograms when µ is big

(41)

Experiments

Examning convergence to global minimum in Non-orthogonal setting,(for small µ)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

0 20 40 60 80 100 120 140

epsilon = 1e−4

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 20 40 60 80 100 120 140

epsilon = 1e−3

Figure 4:Histogram when µ is small

(42)

Experiments

Comparing random vs. plugin projection

0 10 20 30 40 50 60

Number of projections 0.000

0.002 0.004 0.006 0.008 0.010

Error

Orthogonal case

0 10 20 30 40 50 60

Number of projections 0.00

0.01 0.02 0.03 0.04 0.05

Error

Non-orthogonal case

Random projections Plug-in projections

(43)

Experiments

Performance comparison:

0.001 0.002 0.003 0.004 0.005 0.006 0.007

Noise level 0.00

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

Error

Effect of noise on algorithm performance (d=100, k=100)

OJD1 OJD0 TPM Lathauwer NOJD

0.001 0.002 0.003 0.004 0.005 0.006 0.007

Noise level 0.00

0.05 0.10 0.15 0.20 0.25

Error

Effect of noise on algorithm performance (d=50, k=10)

(44)

Outline

1 Introduction

4 Experiments

5 Conclusion

(45)

Conclusion

TFMF, another take on CP decomposition

TFMF = random projections + simultaneous diagonalization + plugin estimates Works for orthogonal, non-orthogonal, symmetric, asymetric, high order tensors Is more accurate that state-of-the-art.