Computing the Factorization - A Three-Way Model for Relational Learning

2. A Three-Way Model for Relational Learning

2.5. Computing the Factorization

Recall from section 2.2 that to compute the R factorization we seek the solution to the optimization problem

arg min A,R

kX−R ×₁A×₂Ak2+λ_AkAk2+λ_RkRk2

The dimensionality of the latent spacer ∈ {r ∈|0 <r <n}, as well as the regularization parametersλ_A ≥ 0 andλ_R ≥ 0 are assumed to be known. In practice, we will use cross- validation to find the optimal settings for these parameters. Local optima to the optimization problem equation (2.5) can be computed by taking the partial derivatives of its objective function with respect toAandRand using any gradient-based non-linear optimization algorithm. However, to improve the computational efficiency of the algorithm, we exploit the connec- tions between R and D and adapt techniques from the very efficient Alternating Simultaneous Approximation, Least Squares and Newton (A) algorithm by Bader et al. (2007) to compute the R model.

A is based on the alternating least-squares (ALS) method, which is also the stan- dard method to compute tensor factorizations such as CP or T for a least-squares loss function (Acar and Yener, 2009; Kolda and Bader, 2009). Being a block coordinate descent (BCD) method, in ALS, the variables of an optimization problem are partitioned into disjoint blocks, such that the objective function is optimized via alternating updates of these vari- able blocks until a certain convergence criterion is met (Sra et al., 2012). For R, the optimization variables are already partitioned naturally via the core tensor and the factor matrices, such that we alternatingly keep one of the factorsAandRﬁxed and compute the update for the remaining factor. In each of these update steps, we will use the method of normal equations to computeclosed-formsolutions for the optimization subproblems. In this context, regularization serves also another purpose, as it increases the numerical stability

2.5 Computing the Factorization 47 of normal-equation-based algorithms (Neumaier, 1998). In the following, we will refer to this algorithm as R-ALS The updates for the latent factorsAandRin R-ALS are computed as follows:

Updates forA To compute updates for the factor matrixA, we keepRﬁxed and seek the solution to the optimization problem

arg min A X−R ×1A×2A 2₊_λ AkAk2. (2.8) Since the variableAappears twice in equation (2.8), this optimization problem can not be reduced to linear regression anymore and it would be diﬃcult to derive scalable closed-form updates for it. For this reason, we use a similar approach as taken in A and use an approximate procedure, which is based on the idea to consider the unfolding of the ﬁrst and second mode ofXsimultaneously. To motivate this approach, we rewrite equation (2.8) as the constrained optimization problem

arg min

A kX−R×1A` ×2Ark 2₊_λ

AkA`k2 (2.9) subject toA` =Ar

whereA` corresponds to the left-handAin equation (2.8) andAr to the right-handA. It can be seen from the equivalence

kX−R ×₁A_` ×₂A_rk2 = kX₍₁₎ −A_`R₍₁₎(I ⊗A_r)Tk2 = kX₍₂₎ −A_rR₍₂₎(I ⊗A_`)Tk2

thatA` appears as the left-hand factor for the unfolding of the ﬁrst mode, whileAr appears the left-hand factor for the unfolding of the second mode. Furthermore, it holds for both tensorsXandRthat

X₍₁₎ = f

X₁ · · · X_mg R₍₁₎ = fR₁ · · · R_mg

X₍₂₎ = f

XT₁ · · · X_mTg R₍₂₎ = fRT₁ · · · RT_mg

since they are third-order tensors with square frontal slices. Now, to approximate equation (2.9), the idea is to stackX₍₁₎ andX₍₂₎ as well asR₍₁₎ andR₍₂₎ side by side and to solve only for the left-hand factor, while keeping the right-hand factor constant. This way, the information of both unfoldings is included in an update ofA, but the optimization problem

48 2. A Three-Way Model for Relational Learning

can be reduced to simple linear regression. Therefore, to compute an update ofA, we set

X = f X₁ XT₁ · · · X_m X_mTg R = f R₁ RT 1 · · · Rm RTm g

Furthermore, letM =R(I_2m ⊗A)T andAbe constant. Then, we approximate equation (2.8), by computing the solution to

arg min A X −AM 2₊_λ A A 2 _(2.10)

The advantage of the optimization problem equation (2.10) over equation (2.8) is that it is in the form of a Tikhonov regularization problem which has a closed-form solution, i.e. (Boyd and Vandenberghe, 2004, Section 6.3.2)

A=XMT₍_MMT ₊_λ AI)−1

However, computingMMT _{directly is not practical – other than for very small data sets} – sinceM is a dense matrix of sizer × 2mn. Fortunately, both termsXMT andMMT can be reduced signiﬁcantly using properties of the Kronecker product and block partitioned matrices, such that an update forAcan be computed by

A←    m X k=1 X_kART k +XTkARk       m X k=1 B_k +C_k +λ_AI    −1 (2.11) where B_k =R_kAT_ART k, Ck =RTkATARk

The derivation of equation (2.11) from properties of the Kronecker product and block partitioned matrices is given in detail in appendix A.2.

Updates for R To update the core tensorR, we keep the matrixAﬁxed and seek the solution to arg min R X−R×1A×2A 2 +λ_RkRk2. (2.12) Due to the T-2 structure of R, equation (2.12) can be written equivalently in matrix notation as arg min A,R m X k=1 kX_k −AR_kATk2+λ_RkR_kk2, (2.13)

2.6 Experiments 49

In document Nickel, Maximilian (2013): Tensor factorization for relational learning. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 62-65)