Bundle methods for regularized risk minimization

1.6 Regularized risk estimation and optimizations

1.6.4 Bundle methods for regularized risk minimization

A natural heuristic for stabilization is to penalize the displacement of wt from wt−1:

wt:= argmin w

λkw−wt−1k2+ftcp(w).

This idea is called proximal bundle method (Kiwiel,1990) as the cutting planes_{ai, bi} are deemed as bundles, and wt is attracted to the proximity ofwt−1. A large volume of work has been done in this area for decades,e.g., (Kiwiel,1985) and (Hiriart-Urruty & Lemar´echal, 1993a, Chapter XIII to XV). The underlying idea is Moreau-Yosida regularization (Moreau,1965;Yosida,1964), and it guarantees to find aapproximate

solution in O(1/3_{) steps (Kiwiel,} _{2000). When the objective function is strongly} convex, the convergence rate can be linear under some assumptions (Robinson,1999). Variants of this idea are also widely used, e.g., trust region bundle method (Schramm & Zowe,1992) which upper bounds the displacement instead of penalizing it; and level set bundle method (Lemar´echal et al.,1995) which minimizes the displacement subject to a level of f_tcp(w).

It is noteworthy that the above methods treat the objective function as a black box which provides function and gradient evaluation at any given location. However, RRM problems are not black boxes, but explicitly composed of two parts: empirical risk

Remp and regularizer Ω. The free availability of the regularizer motivated Teo et al. (2007); Smola et al.(2007b) to perform cutting plane on Remp only, and use Ω as the stabilizer. This is called bundle method for machine learning (BMRM). Different from

Moreau-Yosida regularization wherewtis stabilized aboutwt−1, Ω(w) usually attracts

Algorithm 2:Exact inner solver forBMRM (qp-bmrm)

Input: Previous subgradients {ai}ti=1 and intercepts {bi} t i=1. 1 AssembleA_t:= (a₁, . . . ,at) andb_t:= (b₁, . . . , bt)>. 2 Solve α_t:= argmax_α_∈_∆ t−λΩ ∗₍₋_λ−1_A tα) +hα,bti. 3 return wt:=∂Ω∗(−λ−1Atαt)

Algorithm 3:Inexact line search inner solver forBMRM (ls-bmrm)

Input: Previous subgradients {ai}ti=1 and intercepts {bi}ti=1. 1 AssembleAt:= (a1, . . . ,at) andbt:= (b1, . . . , bt)>.

2 Solve ηt:= argmaxη∈[0,1]−λΩ∗(−λ−1Atαt(η)) +hαt(η),bti,where αt(η) := ((1−η)α>_t−1, η)>.

3 αt←((1−ηt)α>t−1, ηt)>. 4 return w_t:=∂Ω∗(−λ−1A_tαt)

entropy regularizer. Technically, BMRM modifies the cutting plane algorithm just by

replacing Eq. (1.29) with: wt:= argmin w∈domf λΩ(w) +Rcp_emp,t(w) = argmin w λΩ(w) + maxi∈[t]{hai,wi+bi} | {z } :=Jt(w) . (1.31)

We summarize the BMRMalgorithm in Algorithm 1.

The most expensive steps inBMRMare step 4 and 7 in Algorithm1. In step 4, the

computation of subgradient needs to go through the whole dataset, and this admits straightforward data parallelization. In particular, ifRempsums the loss from individual data points like in the statistical query model (Kearns,1998), then one can divide the whole dataset into subsets residing on distributed computing devices, compute their contribution to the gradient in parallel, and finally sum them up. This makes BMRM

very promising for the coming era when parallel computing is the mainstream.

The other expensive step is to solve the optimization problem Eq. (1.31),i.e. step 7 of Algorithm 1. Teo et al.(2007) resorted to the dual problem:

αt:= argmax

α∈∆t

−λΩ∗(−λ−1Atα) +hα,bti, (1.32) where ∆t is the t-dimensional simplex

(α1, . . . , αt)>∈Rt:αi ≥0, P

iαi = 1 ,At := (a1, . . . ,at) andbt:= (b1, . . . , bt)>. The dual connection iswt=∂Ω∗(−λ−1Atαt). See Algorithm 2. Since the Ω∗ _{in this dual problem is assumed to be twice differentiable}

and the constraint is a simple simplex, one can solve Eq. (1.32) with relatively more ease, e.g., (Dai & Fletcher, 2006) which is specialized to L2 regularizer 1₂kwk2, and penalty/barrier methods (Nocedal & Wright,1999) in general.

To circumvent the growing cost of solving Eq. (1.31) or Eq. (1.32),Teo et al.(2007) proposed the following approximation. Instead of searching for αt in ∆t, we restrict the search domain to a line segment

((1−η)α>_t₋₁, η)> :η∈[0,1] . See Algorithm 3. If Ω(w) = 1

2kwk

2_{, then we are essentially restricting the search for} _w

t to the line segment between wt−1 and −λ−1at. In this case, we call Algorithm 3 ls-bmrm, and Algorithm 2 qp-bmrm as it solves a full quadratic program. As the feasible region of ls-bmrmin Eq. (1.32) is a proper subset of that ofqp-bmrm,ls-bmrmmakes less progress

thanqp-bmrm in each iteration, and hence converges more slowly.

The key result on the convergence rate of BMRMis (Teo et al.,2010, Theorem 5):

Theorem 27 (Convergence rate for BMRM) Assume that J(w) > 0 for all w. Assume k∂wRemp(w)k ≤ G for all w ∈ domJ. Also assume that Ω∗ has bounded curvature, i.e. ∂_µ2Ω∗(µ) ≤H∗ for all µ∈ n −λ−1Pt+1 i=1αiai:α∈∆t+1 o . For any

<4G2_H∗_/λ_{, the algorithm} _BMRM _{converges to the desired precision} _after k_≤log₂ λJ(0)

G2_H∗ +

8G2_H∗

λ −1

steps. Furthermore, if the Hessian ofJ(w) is bounded as∂_w2J(w)

≤H, convergence to any ≤H/2 takes at most the following number of steps:

k_≤log₂ λJ(0) 4G2_H∗ + 4H∗ λ max 0, H₋ 8G 2_H∗ λ + 4HH∗ λ log2 H 2.

Teo et al.(2010, Theorem 5) proved this rate forls-bmrmwhere each iteration only

solves a simple one-dimensional optimization. In contrast, qp-bmrm performs a much

more expensive optimization at every iteration, therefore it was conjectured that the rates of convergence of qp-bmrm could be improved. This was also supported by the

empirical convergence behavior ofqp-bmrm, which is much better than the theoretically

predicted rates on a number of real life problems (Teo et al.,2010, Section 5). In Section

5.2, we answer this question in the negative by explicitly constructing a regularized risk minimization problem for which qp-bmrm takes at leastO(1/) iterations.

In document Graphical Models: Modeling, Optimization, and Hilbert Space Embedding (Page 57-59)