• No results found

Tensor Factorization via Matrix Factorization

N/A
N/A
Protected

Academic year: 2022

Share "Tensor Factorization via Matrix Factorization"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Tensor Factorization via Matrix Factorization

(Kuleshov et al., 2015)

Amir Zakeri, Sebastien Henwood

March 24, 2020

(2)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(3)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(4)

Introduction

Given a tensorT ∈ Rd×d×d with the following CP-decomposition:

T =ˆ ∑k

i=1

πiai⊗ bi⊗ ci+ noise,

our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class

This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)

The core idea

ProjectionT along a vector w to do eigendecomp. repeated L times

(5)

Introduction

Given a tensorT ∈ Rd×d×d with the following CP-decomposition:

T =ˆ ∑k

i=1

πiai⊗ bi⊗ ci+ noise,

our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class

This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)

The core idea

ProjectionT along a vector w to do eigendecomp. repeated L times

(6)

Introduction

Given a tensorT ∈ Rd×d×d with the following CP-decomposition:

T =ˆ ∑k

i=1

πiai⊗ bi⊗ ci+ noise,

our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class

This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)

The core idea

ProjectionT along a vector w to do eigendecomp. repeated L times

(7)

Introduction

Given a tensorT ∈ Rd×d×d with the following CP-decomposition:

T =ˆ ∑k

i=1

πiai⊗ bi⊗ ci+ noise,

our goal is to estimate the factorsai,bi,ciand the factor weights π∈ Rk. To solve this, we saw ALS, gradient-based approaches in class

This presentation⇒ Tensor Factorization via Matrix Factorization (TFMF)

The core idea

ProjectionT along a vector w to do eigendecomp. repeated L times

(8)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(9)

Tensor Factorization via Matrix Factorization(TFMF)

TFMF algorithm overview

1 Input : L random vectors w , a tensorT

2 ProjectT onto a set of random vectors wLproducingM matrices

3 Simultaneously diagonalizeM producing CP decomp. factors estimates ˜uI

4 Refine by repeating with the factor estimates instead of the random vectors

5 Output : CP factor matrices ˜uI

Application: to orthogonal, non-orthogonal and asymmetric tensors of arbitrary order.

Novelty: Simultaneous matrix diagonalization.

(10)

Factors u ?

When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :

i

πiui⊗3 (1)

Project along a vector w !

i

πi(wTui)ui⊗2 (2)

Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive

Sensitivity≈ smallest diff. between eigenvalues : the eigengap

maxj≠i

1

∣λi− λj

(11)

Factors u ?

When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :

i

πiui⊗3 (1)

Project along a vector w !

i

πi(wTui)ui⊗2 (2)

Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive

Sensitivity≈ smallest diff. between eigenvalues : the eigengap

maxj≠i

1

∣λi− λj

(12)

Factors u ?

When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :

i

πiui⊗3 (1)

Project along a vector w !

i

πi(wTui)ui⊗2 (2)

Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive

Sensitivity≈ smallest diff. between eigenvalues : the eigengap

maxj≠i

1

∣λi− λj

(13)

Factors u ?

When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :

i

πiui⊗3 (1)

Project along a vector w !

i

πi(wTui)ui⊗2 (2)

Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive

Sensitivity≈ smallest diff. between eigenvalues : the eigengap

maxj≠i

1

∣λi− λj

(14)

Factors u ?

When ai = bi= ci= ui∀i ⇒ symmetric factorization ! We have :

i

πiui⊗3 (1)

Project along a vector w !

i

πi(wTui)ui⊗2 (2)

Estimate ui by eigendecomposition of Eq. 2⇒ ˜ui The error∣∣ui− ˜ui∣∣2is noise sensitive

Sensitivity≈ smallest diff. between eigenvalues : the eigengap

maxj≠i

1

∣λi− λj

(15)

Solving the eigengap with multiple projections (orth. case)

Using L random projections we have the matrices M`

i

πi(w`Tui)ui⊗2 (3)

The set of matrices M`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows

∣∣˜ui− ui∣∣2≤ (2√

2∣∣π∣∣1πmax

πi2 +C(δ)

πi ) + o() (4)

with C(δ) = O(log(kd/δ√

d L)

⇒ The bigger L the lower the error bound !

(16)

Solving the eigengap with multiple projections (orth. case)

Using L random projections we have the matrices M`

i

πi(w`Tui)ui⊗2 (3)

The set of matrices M`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows

∣∣˜ui− ui∣∣2≤ (2√

2∣∣π∣∣1πmax

πi2 +C(δ)

πi ) + o() (4)

with C(δ) = O(log(kd/δ√

d L)

⇒ The bigger L the lower the error bound !

(17)

Solving the eigengap with multiple projections (orth. case)

Using L random projections we have the matrices M`

i

πi(w`Tui)ui⊗2 (3)

The set of matrices M`has common eigenvectors⇒ simultaneous diagonalization ! The error bound then follows

∣∣˜ui− ui∣∣2≤ (2√

2∣∣π∣∣1πmax

πi2 +C(δ)

πi ) + o() (4)

with C(δ) = O(log(kd/δ√

d L)

⇒ The bigger L the lower the error bound !

(18)

Using estimates instead of random W

After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes

∣∣˜ui− ui∣∣2≤ 2√

∣∣π∣∣1πmax

πi2 + o() (5)

⇒ same as prev. slide when L → ∞

What about non orth. tensors ?

The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)

(19)

Using estimates instead of random W

After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes

∣∣˜ui− ui∣∣2≤ 2√

∣∣π∣∣1πmax

πi2 + o() (5)

⇒ same as prev. slide when L → ∞

What about non orth. tensors ?

The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)

(20)

Using estimates instead of random W

After a first pass w/ random W the paper proposes to use ˜u as the projection The error bound then becomes

∣∣˜ui− ui∣∣2≤ 2√

∣∣π∣∣1πmax

πi2 + o() (5)

⇒ same as prev. slide when L → ∞

What about non orth. tensors ?

The papers extends this analysis with a new coef. > 1 (spoiler: the error bound grows)

(21)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(22)

Simultaneous diagonalization

Symmetric matricesM1,⋯, ML∈ Rd×d as:

Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.

Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.

Optimizing objective function to findV:

F(X) ≜L

l=1

off(X−1MlX−T), off(A) = ∑

i≠j

A2ij.

⇒ this penalizes the off-diagonal terms!

Use Jacobi & QRJ1D

(23)

Simultaneous diagonalization

Symmetric matricesM1,⋯, ML∈ Rd×d as:

Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.

Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.

Optimizing objective function to findV:

F(X) ≜L

l=1

off(X−1MlX−T), off(A) = ∑

i≠j

A2ij.

⇒ this penalizes the off-diagonal terms!

Use Jacobi & QRJ1D

(24)

Simultaneous diagonalization

Symmetric matricesM1,⋯, ML∈ Rd×d as:

Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.

Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.

Optimizing objective function to findV:

F(X) ≜L

l=1

off(X−1MlX−T), off(A) = ∑

i≠j

A2ij.

⇒ this penalizes the off-diagonal terms!

Use Jacobi & QRJ1D

(25)

Simultaneous diagonalization

Symmetric matricesM1,⋯, ML∈ Rd×d as:

Ml= UΛlUT + Rl. U∈ Rd×k is common, Λl ∈ Rk×k and Rl are individual.

Goal: find inverse factors V−1∈ Rd×d such thatV−1MlV−T is nearly diagonal.

Optimizing objective function to findV:

F(X) ≜L

l=1

off(X−1MlX−T), off(A) = ∑

i≠j

A2ij.

⇒ this penalizes the off-diagonal terms!

Use Jacobi & QRJ1D

(26)

For asymmetric and higher-order tensors

Asymmetric tensors:

The l-th projection(Ml) of an asymmetric tensor has the following form:

Ml= ∑

i

λiuilvilT = UΛlVT,

where Λl is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.

for eachMl we define another matrixNl as:

Nl= [ 0 MlT Ml 0 ] =1

2[V V

U −U] [Λl 0

0 −Λl] [V V U −U]

T

.

TheNlare symmetric matrices with common (in general, non-orthogonal) factors.

(27)

For asymmetric and higher-order tensors

Asymmetric tensors:

The l-th projection(Ml) of an asymmetric tensor has the following form:

Ml= ∑

i

λiuilvilT = UΛlVT,

where Λl is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.

for eachMl we define another matrixNl as:

Nl= [ 0 MlT Ml 0 ] =1

2[V V

U −U] [Λl 0

0 −Λl] [V V U −U]

T

.

TheNlare symmetric matrices with common (in general, non-orthogonal) factors.

(28)

For asymmetric and higher-order tensors

Asymmetric tensors:

The l-th projection(Ml) of an asymmetric tensor has the following form:

Ml= ∑

i

λiuilvilT = UΛlVT,

where Λl is diagonal but not necessarily positive matrix, andU, V are common but not necessarily orthogonal.

for eachMl we define another matrixNl as:

Nl= [ 0 MlT Ml 0 ] =1

2[V V

U −U] [Λl 0

0 −Λl] [V V U −U]

T

.

TheNlare symmetric matrices with common (in general, non-orthogonal) factors.

(29)

For asymmetric and higher-order tensors

Higher order tensors:

For higher order tensor (say fourth order):

T = ∑

i

πiai⊗ bi⊗ ci⊗ di,

We first determineai,bi by projecting into matrices:

T= ∑

i

π(ωTci)(uTdi)ai⊗ bi,

Then determineci,diby projecting along the first two components.

(30)

For asymmetric and higher-order tensors

Higher order tensors:

For higher order tensor (say fourth order):

T = ∑

i

πiai⊗ bi⊗ ci⊗ di,

We first determineai,bi by projecting into matrices:

T= ∑

i

π(ωTci)(uTdi)ai⊗ bi,

Then determineci,diby projecting along the first two components.

(31)

For asymmetric and higher-order tensors

Higher order tensors:

For higher order tensor (say fourth order):

T = ∑

i

πiai⊗ bi⊗ ci⊗ di,

We first determineai,bi by projecting into matrices:

T= ∑

i

π(ωTci)(uTdi)ai⊗ bi,

Then determineci,diby projecting along the first two components.

(32)

Convergence properties

Convergence depends on the choice of joint diagonalization subroutine.

Theoretically:

Convergence to Local minimum at a quadratic rate guaranteed.

Convergence to Global minimum is an open question!

Empirically, convergence to global minima achieved.

(33)

Convergence properties

Convergence depends on the choice of joint diagonalization subroutine.

Theoretically:

Convergence to Local minimum at a quadratic rate guaranteed.

Convergence to Global minimum is an open question!

Empirically, convergence to global minima achieved.

(34)

Convergence properties

Convergence depends on the choice of joint diagonalization subroutine.

Theoretically:

Convergence to Local minimum at a quadratic rate guaranteed.

Convergence to Global minimum is an open question!

Empirically, convergence to global minima achieved.

(35)

Convergence properties

Convergence depends on the choice of joint diagonalization subroutine.

Theoretically:

Convergence to Local minimum at a quadratic rate guaranteed.

Convergence to Global minimum is an open question!

Empirically, convergence to global minima achieved.

(36)

Convergence properties

Convergence depends on the choice of joint diagonalization subroutine.

Theoretically:

Convergence to Local minimum at a quadratic rate guaranteed.

Convergence to Global minimum is an open question!

Empirically, convergence to global minima achieved.

(37)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(38)

Experiments

Examining convergence to global minima in orthogonal setting (Jacobi Algo.): ⇒ Using 1000 random starting points, getting the same solution!

0.0360 0.038 0.04 0.042 0.044 0.046 0.048

20 40 60 80

Objective function value

Figure 1:Histogram of objective function values, in orthogonal setting

(39)

Experiments

Plotting histogram for different  values( = 0,  = 1e − 4,  = 1e − 3)

0 0.5 1 1.5

x 10−10 0

10 20 30 40 50 60 70

Objective function value epsilon = 0.0

0.04 0.045 0.05

0 10 20 30 40 50 60 70

epsilon = 1e−4

Objective function value

0.360 0.38 0.4 0.42 0.44 0.46

10 20 30 40 50 60

epsilon = 1e−3

Objective function value

Figure 2:Comparing Histograms for different  sizes, in orthogonal setting

For small enough  convergence is guaranteed.

(40)

Experiments

Examning convergence to global minimum in Non-orthogonal setting:

0 5 10 15 20

0 5 10 15 20 25 30 35 40

Objective function value epsilon = 0.0

0 50 100 150 200 250 300

0 10 20 30 40 50 60

epsilon = 1e−4

Objective function value

Figure 3:Histograms when µ is big

(41)

Experiments

Examning convergence to global minimum in Non-orthogonal setting,(for small µ)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

0 20 40 60 80 100 120 140

epsilon = 1e−4

Objective function value

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 20 40 60 80 100 120 140

epsilon = 1e−3

Objective function value

Figure 4:Histogram when µ is small

(42)

Experiments

Comparing random vs. plugin projection

0 10 20 30 40 50 60

Number of projections 0.000

0.002 0.004 0.006 0.008 0.010

Error

Orthogonal case

0 10 20 30 40 50 60

Number of projections 0.00

0.01 0.02 0.03 0.04 0.05

Error

Non-orthogonal case

Random projections Plug-in projections

(43)

Experiments

Performance comparison:

0.001 0.002 0.003 0.004 0.005 0.006 0.007

Noise level 0.00

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

Error

Effect of noise on algorithm performance (d=100, k=100)

OJD1 OJD0 TPM Lathauwer NOJD

0.001 0.002 0.003 0.004 0.005 0.006 0.007

Noise level 0.00

0.05 0.10 0.15 0.20 0.25

Error

Effect of noise on algorithm performance (d=50, k=10)

(44)

Outline

1 Introduction

2 Tensor Factorization via Matrix Factorization(TFMF)

3 Simultaneous diagonalization

4 Experiments

5 Conclusion

(45)

Conclusion

TFMF, another take on CP decomposition

TFMF = random projections + simultaneous diagonalization + plugin estimates Works for orthogonal, non-orthogonal, symmetric, asymetric, high order tensors Is more accurate that state-of-the-art.

References

Related documents

Dyer et al., (2006) argue that subsistence households do adjust their supply to changes in agricultural output prices through multiple factor linkages when there is at least a

A análise dos dados permite estimar que a extensão média da área principal das propriedades produtoras de leite do Vale do Taquari é de 21,6 ha , sendo que, frequentemente,

Lockwood and Ingram (1999) consider 141 articles by subdividing them into the topics of strategy and environment, property and asset management, human resources, customers

Risks: The risks involved in this study are minimal, which means they are equal to the risks you would encounter in everyday life.. Benefits: The direct benefits participants

Numbers of hides and skulls exported from Canada in 2013 and 2014 that came from bears killed in the 2012/13 hunting season; and the number of individual bears represented by

The manual is applicable to continuous emission monitoring systems (CEMS), continuous parameter monitoring systems (CPMS), and continuous opacity monitoring systems (COMS)..

To study the effect of different insulin administration protocols, we performed three intravenous glucose tolerance tests in each of seven obese subjects (age, 20–41 yr; body