Conclusion - Optimization Algorithms for Machine Learning Designed for Parallel and Distributed

This study proposes a parallel framework for the SINCO algorithm originally introduced in Scheinberg and Rish [2010]. The main impetus for this algorithm is simplicity of implementation and potential of being massively parallelizable. The new approach for data distribution provides a more efficient implementation of the parallel framework. Further, the algorithm SINCO2D is introduced where the rank-2 updating steps are strengthened by taking into account the structure of semi-definite programs. Numerical examples verify that the new algorithm improves on the con- vergence rate of SINCO and from practical standpoint, can achieve more accurate solutions compared to other powerful packages such as QUIC.

Chapter 3 Inverse Covariance Selection:

Parallel Graphical Lasso

3.1 Introduction

In this chapter, we derive another parallel algorithm for the SICS problem. For this optimization problem, we choose a different coordinate descent algorithm. In the previous chapter, we considered single (SINCO) and augmented version (SINCO2D) coordinate descent algorithms for this problem. However, both cases only consider up to a handful number of individual coordinates. In this chapter, we study the Graphical Lasso (GLasso) – a block coordinate descent algorithm – which updates one row (and one column) of the variable matrix at a time, and which is originally proposed by Banerjee et al. [2008]. The algorithm is further developed by Friedman et al. [2008], where it is proposed to solve the dual of form for the SICS problem

max

X0 log detX− hS, Xi −λ|X|1. (3.1)

In what follows, first we describe the dual problem derivation and then consider the block coordinate descent GLasso algorithm, developed by Friedman et al. [2008]. The row subproblems considered in GLasso are actually Lasso (Tibshirani [1996]) problems. We study different methods of distributing the data for developing parallel

solver based on GLasso. The suitable data distribution that minimizes communica- tion in the parallel setting is chosen and consequently a reformulation of the Lasso problem is proposed to minimize the computational effort with respect to this data distribution.

The reformulated Lasso problem makes it easier to use a parallel method for solving the Lasso problem addressed by Richt´arik and Tak´aˇc [2013, 2016]. Therefore, at each outer iteration of the algorithm, all subproblems are solved using a parallel Lasso solver. This algorithm becomes essential whenever the size of matrices is too large to fit inside in memory of a single worker.

3.1.1 Analysis of SICS problem

To introduce Lasso approach to SICS problem, we reiterate the formulation and then motivate the use of dual model in derivation of GLasso. Recall that the maximum likelihood estimator, augmented with regularization term, was defined as

max

X0 log detX− hX, Si −λkXk1.

As a way to reformulate this problem in a nicer form, we can eliminate the`1 norm

penalty. Since the problem is of maximization form, it can be seen that once an elementxij of matrixX is nonzero, it would contribute−λ|xij|cost to the objective

value. Another way to model such phenomena is through an auxiliary (implicit) variableU.

Consider the inner product hX, Ui; For a given X, we can write another math- ematical model, with U as the decision variable, that gathers the contributions of the regularization term.

Min − hX, Ui,

s.t

−λ≤uij ≤λ, ∀i, j.

As a result, for given X, if xij <0, the optimal corresponding element of U would

of this subproblem will be equal to the regularization term. As a result, the original problem can theoretically be written in a two level optimization problem as:

max

X0 −λmin≤uij≤λ

log detX− hX, Si − hX, Ui. (3.2)

The bound constraint can be succinctly written as kUk∞ ≤ λ. Next section shows

how the concept of dual problem is achieved using this reformulation.

3.1.2 Dual of SICS problem

Banerjee et al. [2008] derived the dual for the SICS problem. By rewriting the problem as

max

X0 kUmink∞≤λ

log detX− hX, S+Ui, (3.3)

they show that interchanging the inner and outer problems we find the dual of the SICS problem stated by

min

kUk∞≤λ

max

X0 log detX− hX, S+Ui. (3.4)

We derive X = (S+U)−1 _{for the solution of inner problem by analytically solving}

using the first-order condition, resulting in

min

kUk∞≤λ

−log det(S+U)−p. (3.5)

LettingW =S+U we have:

Max log detW

s.t

kW −Sk∞≤λ

Friedman et al. [2008] consider a block coordinate descent framework for this problem, which results in the subproblems having a special form. Let us decompose the

matrixW by the first row and column: W = w11 W

W1 W\1\1 !

herew11 is a scalar,W1

column ofW; treatingW\1\1 as a constant and also noting the determinant identity

using the Schur complement we have

detW = det(W\1\1)(w11−W1T(W\1\1)

−1

W1). (3.6)

We can solve the resulting optimization problem over the first row as follows:

MinββT(W\1\1)

−1

s.t

kβ−S1k∞≤λ

Friedman et al. [2008] point out that this problem is similar to the dual of a Lasso problem and proposed an algorithm known as Graphical Lasso, which solves the SICS problem by solving Lasso problems at each iteration using coordinate descent.

In document Optimization Algorithms for Machine Learning Designed for Parallel and Distributed Environments (Page 57-61)