2. A Three-Way Model for Relational Learning
2.5. Computing the Factorization
Recall from section 2.2 that to compute the R factorization we seek the solution to the optimization problem
arg min A,R
kX−R ×1A×2Ak2+λAkAk2+λRkRk2
The dimensionality of the latent spacer ∈ {r ∈|0 <r <n}, as well as the regularization parametersλA ≥ 0 andλR ≥ 0 are assumed to be known. In practice, we will use cross- validation to find the optimal settings for these parameters. Local optima to the optimization problem equation (2.5) can be computed by taking the partial derivatives of its objective func- tion with respect toAandRand using any gradient-based non-linear optimization algorithm. However, to improve the computational efficiency of the algorithm, we exploit the connec- tions between R and D and adapt techniques from the very efficient Alternating Simultaneous Approximation, Least Squares and Newton (A) algorithm by Bader et al. (2007) to compute the R model.
A is based on the alternating least-squares (ALS) method, which is also the stan- dard method to compute tensor factorizations such as CP or T for a least-squares loss function (Acar and Yener, 2009; Kolda and Bader, 2009). Being a block coordinate descent (BCD) method, in ALS, the variables of an optimization problem are partitioned into disjoint blocks, such that the objective function is optimized via alternating updates of these vari- able blocks until a certain convergence criterion is met (Sra et al., 2012). For R, the optimization variables are already partitioned naturally via the core tensor and the factor matrices, such that we alternatingly keep one of the factorsAandRfixed and compute the update for the remaining factor. In each of these update steps, we will use the method of normal equations to computeclosed-formsolutions for the optimization subproblems. In this context, regularization serves also another purpose, as it increases the numerical stability
2.5 Computing the Factorization 47 of normal-equation-based algorithms (Neumaier, 1998). In the following, we will refer to this algorithm as R-ALS The updates for the latent factorsAandRin R-ALS are computed as follows:
Updates forA To compute updates for the factor matrixA, we keepRfixed and seek the solution to the optimization problem
arg min A X−R ×1A×2A 2+λ AkAk2. (2.8) Since the variableAappears twice in equation (2.8), this optimization problem can not be reduced to linear regression anymore and it would be difficult to derive scalable closed-form updates for it. For this reason, we use a similar approach as taken in A and use an approximate procedure, which is based on the idea to consider the unfolding of the first and second mode ofXsimultaneously. To motivate this approach, we rewrite equation (2.8) as the constrained optimization problem
arg min
A kX−R×1A` ×2Ark 2+λ
AkA`k2 (2.9) subject toA` =Ar
whereA` corresponds to the left-handAin equation (2.8) andAr to the right-handA. It can be seen from the equivalence
kX−R ×1A` ×2Ark2 = kX(1) −A`R(1)(I ⊗Ar)Tk2 = kX(2) −ArR(2)(I ⊗A`)Tk2
thatA` appears as the left-hand factor for the unfolding of the first mode, whileAr appears the left-hand factor for the unfolding of the second mode. Furthermore, it holds for both tensorsXandRthat
X(1) = f
X1 · · · Xmg R(1) = fR1 · · · Rmg
X(2) = f
XT1 · · · XmTg R(2) = fRT1 · · · RTmg
since they are third-order tensors with square frontal slices. Now, to approximate equa- tion (2.9), the idea is to stackX(1) andX(2) as well asR(1) andR(2) side by side and to solve only for the left-hand factor, while keeping the right-hand factor constant. This way, the information of both unfoldings is included in an update ofA, but the optimization problem
48 2. A Three-Way Model for Relational Learning
can be reduced to simple linear regression. Therefore, to compute an update ofA, we set
X = f X1 XT1 · · · Xm XmTg R = f R1 RT 1 · · · Rm RTm g
Furthermore, letM =R(I2m ⊗A)T andAbe constant. Then, we approximate equation (2.8), by computing the solution to
arg min A X −AM 2+λ A A 2 (2.10)
The advantage of the optimization problem equation (2.10) over equation (2.8) is that it is in the form of a Tikhonov regularization problem which has a closed-form solution, i.e. (Boyd and Vandenberghe, 2004, Section 6.3.2)
A=XMT(MMT +λ AI)−1
However, computingMMT directly is not practical – other than for very small data sets – sinceM is a dense matrix of sizer × 2mn. Fortunately, both termsXMT andMMT can be reduced significantly using properties of the Kronecker product and block partitioned matrices, such that an update forAcan be computed by
A← m X k=1 XkART k +XTkARk m X k=1 Bk +Ck +λAI −1 (2.11) where Bk =RkATART k, Ck =RTkATARk
The derivation of equation (2.11) from properties of the Kronecker product and block parti- tioned matrices is given in detail in appendix A.2.
Updates for R To update the core tensorR, we keep the matrixAfixed and seek the solution to arg min R X−R×1A×2A 2 +λRkRk2. (2.12) Due to the T-2 structure of R, equation (2.12) can be written equivalently in matrix notation as arg min A,R m X k=1 kXk −ARkATk2+λRkRkk2, (2.13)
2.6 Experiments 49