2.4 Related work
2.4.1 Fixed-rank matrix factorization with missing data
A branch of problems involving low-rank matrix factorization and missing data is visible in various computer vision and machine learning applications; bundle adjustment using affine cameras [Tomasi and Kanade,1992], non-rigid structure-from-motion using basis shapes or point trajectory basis functions [Bregler et al.,2000], photometric stereo assuming ambient light and Lambertian surfaces [Belhumeur and Kriegman,1996] and recommender systems [Bennett and Lanning,2007] just to name a few.
Over the last two decades, the computer vision and machine learning community have seen a plethora of low-rank matrix factorization algorithms [Boumal and Absil,2011;Buchanan and Fitzgibbon,2005;Cabral et al.,2013;Chen,2008;Del Bue et al.,2012;Kennedy et al.,
2016;Okatani and Deguchi,2007;Okatani et al.,2011;Vidal and Hartley,2004] . Many of those algorithms were based on the space-efficient alternating least squares algorithm, with extremely poor convergence properties. Buchanan and Fitzgibbon[2005] introduced damping with a damped Newton algorithm, but again ignored Wiberg. Okatani and Deguchi[2007] reconsidered Wiberg, showing its strong convergence properties, and thenOkatani et al.[2011] combined damping and Wiberg to boost convergence rates to near 100% on some previously- difficult problems. At the same timeGotardo and Martinez[2011]’s column space fitting (CSF) algorithm showed similar improvements.
Although the derivations of individual successful algorithms [Chen,2008;Gotardo and Martinez,2011;Okatani et al.,2011] may have been inspired by different sources, they are all related to the approach dubbed variable projection (VarPro) [Golub and Pereyra,1973], in which the problem is solved by eliminating one matrix of larger dimension from the original objective
function and using a second order optimizer such as Levenberg-Marquardt [Levenberg,1944;
Marquardt,1963]. This observation forms the basis of the unifications made in Chapter3.
Ruhe and Wedin[1980] considered VarPro in the Gauss-Newton scenario and presented three numbered algorithms abbreviated here as RW1, RW2, RW3 respectively. RW1 is the Gauss-Newton solver on the reduced objective (see Section2.1). RW2, like Wiberg, approxi- mates this Jacobian by eliminating a term. RW3 is alternation (ALS). Whether using the full Gauss-Newton matrix or its approximation is better is still debated [O’Leary and Rust,2013], and has been explored previously in the context of matrix factorization [Chen,2008;Gotardo and Martinez,2011]. Chapter3extends these comparisons.
2.4.2
Variable projection (VarPro)
Variable projection (VarPro) was first proposed byGolub and Pereyra [1973] for separable nonlinear least squares problems, and was applied to principal components analysis (i.e. matrix factorization) byWiberg[1976]. In short, it applies a second order optimizer such as Levenberg- Marquardt [Levenberg,1944;Marquardt,1963] on a reduced objective, which is obtained by optimally eliminating (or projecting out) one set of the unknowns. It is especially applicable to factorization problems, since in these problem instances one of the involved unknown factors can be eliminated in closed form.
Several work [Gotardo and Martinez,2011;Okatani et al.,2011;O’Leary and Rust,2013]) (and Chapter 3 and 4 of this dissertation), have experimentally demonstrated that VarPro applied on matrix factorization problems has much higher probabilities to reach a better optimum than using a second order method on the full (joint) problem (joint optimization, i.e. without eliminating one set of unknowns). Golub and Pereyra [2002] also noted that VarPro for the aforementioned type of problems not only leads to a reduction in the number of parameters but also decreases the number of iterations required for convergence. Yet the reasons for such difference in the algorithmic behaviours has not been carefully analyzed in the literature, with VarPro just being mostly used as a black box tool. O’Leary and Rust[2013] stated that the reduction in the problem dimension could be a reason for an improved efficiency and potential reduction in the number of local minima, allowing a better chance for global convergence [O’Leary and Rust,2013].
On the contrary, Zollhöfer et al. [2014] argued that employing joint optimization may yield better results than using variable projection for lifted robust optimization problems (see SectionA.2.4), which have model parameters (θ ) and relaxed robust kernel weights (w) as optimization variables [Zach,2014]. The conclusion drawn in this work is based on a robust 2D line fitting example incorporating a truncated quadratic robust kernel, which assigns a fixed cost for a residual when its norm is greater than the inlier radius. In their VarPro implementation,
2.4 Related work 57
the robust weights (w) were optimally eliminated over the model parameters (θ ), and thus VarPro’s inferior performance arises from the optimal w given θ (w∗(θ )) having zero gradients (dw∗/dθ ≈ 0) when the residual is large. Since bad initial model parameters lead to large residuals and therefore zero weights and gradients, VarPro falls into a local minimum from which it cannot recover. On the other hand, when jointly optimizing over θ and w,Zollhöfer et al. [2014] set each robust weight to 1, allowing each residual an initial “opportunity” to participate as inliers. Instead, if one optimally eliminates the model parameters (i.e. obtain θ∗(w)) when using VarPro and set the initial weights w = 1, it can be shown that joint optimization and variable projection converge to the global optimum with similar probabilities. Alternatively, one may choose a robust kernel with non-zero gradients such as the Huber [Huber,
1964] kernel that may allow VarPro to escape initial bad local minimum. Designing a gold standard algorithm for solving robust optimization problems remains a challenge to this date. Several papers pointed out some structural similarity between Joint optimization and VarPro.
Ruhe and Wedin[1980] andOkatani et al.[2011] pointed out the similarity between the update equations of VarPro and joint optimization but this was confined to the Gauss-Newton algorithm where no damping is present. Strelow [2012b] pointed out that VarPro performs additional minimization over the eliminated parameters. The Ceres solver [Agarwal et al.,2014], which is a widely-used nonlinear optimization library, also assumes the same. Chapter4shows that these are not exactly performing VarPro, and removal of damping in some places takes a key role in implementing “pure” VarPro and widening the convergence basin.
VarPro/Wiberg was believed to be comparatively slow and memory consuming [Cabral et al.,
2013;Chen,2008], which is incorrect as will be shown in Chapter4. With regards to scalable implementation of VarPro, RTRMC [Boumal and Absil,2011] is in principle indirectly solving the VarPro-reduced problem, which is also what this work essentially proposes. However, their algorithm is implemented based on the assumption that the problem is regularized, which may be suitable for machine learning recommender systems and other random matrices but suffers from numerical instability when performed on SfM problems (as will be demonstrated in Chapter3), where the regularizer is not a good idea because it essentially puts unrealistic priors on camera and point parameters (see Section3). Chapter4provides a numerically stable and scalable VarPro algorithm which is tested and works well on matrix factorization problems of various sizes and densities.