1.6 Regularized risk estimation and optimizations
1.6.4 Bundle methods for regularized risk minimization
A natural heuristic for stabilization is to penalize the displacement of wt from wt−1:
wt:= argmin w
λkw−wt−1k2+ftcp(w).
This idea is called proximal bundle method (Kiwiel,1990) as the cutting planes{ai, bi} are deemed as bundles, and wt is attracted to the proximity ofwt−1. A large volume of work has been done in this area for decades,e.g., (Kiwiel,1985) and (Hiriart-Urruty & Lemar´echal, 1993a, Chapter XIII to XV). The underlying idea is Moreau-Yosida regularization (Moreau,1965;Yosida,1964), and it guarantees to find aapproximate
solution in O(1/3) steps (Kiwiel, 2000). When the objective function is strongly convex, the convergence rate can be linear under some assumptions (Robinson,1999). Variants of this idea are also widely used, e.g., trust region bundle method (Schramm & Zowe,1992) which upper bounds the displacement instead of penalizing it; and level set bundle method (Lemar´echal et al.,1995) which minimizes the displacement subject to a level of ftcp(w).
It is noteworthy that the above methods treat the objective function as a black box which provides function and gradient evaluation at any given location. However, RRM problems are not black boxes, but explicitly composed of two parts: empirical risk
Remp and regularizer Ω. The free availability of the regularizer motivated Teo et al. (2007); Smola et al.(2007b) to perform cutting plane on Remp only, and use Ω as the stabilizer. This is called bundle method for machine learning (BMRM). Different from
Moreau-Yosida regularization wherewtis stabilized aboutwt−1, Ω(w) usually attracts
Algorithm 2:Exact inner solver forBMRM (qp-bmrm)
Input: Previous subgradients {ai}ti=1 and intercepts {bi} t i=1. 1 AssembleAt:= (a1, . . . ,at) andbt:= (b1, . . . , bt)>. 2 Solve αt:= argmaxα∈∆ t−λΩ ∗(−λ−1A tα) +hα,bti. 3 return wt:=∂Ω∗(−λ−1Atαt)
Algorithm 3:Inexact line search inner solver forBMRM (ls-bmrm)
Input: Previous subgradients {ai}ti=1 and intercepts {bi}ti=1. 1 AssembleAt:= (a1, . . . ,at) andbt:= (b1, . . . , bt)>.
2 Solve ηt:= argmaxη∈[0,1]−λΩ∗(−λ−1Atαt(η)) +hαt(η),bti,where αt(η) := ((1−η)α>t−1, η)>.
3 αt←((1−ηt)α>t−1, ηt)>. 4 return wt:=∂Ω∗(−λ−1Atαt)
entropy regularizer. Technically, BMRM modifies the cutting plane algorithm just by
replacing Eq. (1.29) with: wt:= argmin w∈domf λΩ(w) +Rcpemp,t(w) = argmin w λΩ(w) + maxi∈[t]{hai,wi+bi} | {z } :=Jt(w) . (1.31)
We summarize the BMRMalgorithm in Algorithm 1.
The most expensive steps inBMRMare step 4 and 7 in Algorithm1. In step 4, the
computation of subgradient needs to go through the whole dataset, and this admits straightforward data parallelization. In particular, ifRempsums the loss from individual data points like in the statistical query model (Kearns,1998), then one can divide the whole dataset into subsets residing on distributed computing devices, compute their contribution to the gradient in parallel, and finally sum them up. This makes BMRM
very promising for the coming era when parallel computing is the mainstream.
The other expensive step is to solve the optimization problem Eq. (1.31),i.e. step 7 of Algorithm 1. Teo et al.(2007) resorted to the dual problem:
αt:= argmax
α∈∆t
−λΩ∗(−λ−1Atα) +hα,bti, (1.32) where ∆t is the t-dimensional simplex
(α1, . . . , αt)>∈Rt:αi ≥0, P
iαi = 1 ,At := (a1, . . . ,at) andbt:= (b1, . . . , bt)>. The dual connection iswt=∂Ω∗(−λ−1Atαt). See Algorithm 2. Since the Ω∗ in this dual problem is assumed to be twice differentiable
and the constraint is a simple simplex, one can solve Eq. (1.32) with relatively more ease, e.g., (Dai & Fletcher, 2006) which is specialized to L2 regularizer 12kwk2, and penalty/barrier methods (Nocedal & Wright,1999) in general.
To circumvent the growing cost of solving Eq. (1.31) or Eq. (1.32),Teo et al.(2007) proposed the following approximation. Instead of searching for αt in ∆t, we restrict the search domain to a line segment
((1−η)α>t−1, η)> :η∈[0,1] . See Algorithm 3. If Ω(w) = 1
2kwk
2, then we are essentially restricting the search for w
t to the line segment between wt−1 and −λ−1at. In this case, we call Algorithm 3 ls-bmrm, and Algorithm 2 qp-bmrm as it solves a full quadratic program. As the feasible region of ls-bmrmin Eq. (1.32) is a proper subset of that ofqp-bmrm,ls-bmrmmakes less progress
thanqp-bmrm in each iteration, and hence converges more slowly.
The key result on the convergence rate of BMRMis (Teo et al.,2010, Theorem 5):
Theorem 27 (Convergence rate for BMRM) Assume that J(w) > 0 for all w. Assume k∂wRemp(w)k ≤ G for all w ∈ domJ. Also assume that Ω∗ has bounded curvature, i.e. ∂µ2Ω∗(µ) ≤H∗ for all µ∈ n −λ−1Pt+1 i=1αiai:α∈∆t+1 o . For any
<4G2H∗/λ, the algorithm BMRM converges to the desired precision after k≤log2 λJ(0)
G2H∗ +
8G2H∗
λ −1
steps. Furthermore, if the Hessian ofJ(w) is bounded as∂w2J(w)
≤H, convergence to any ≤H/2 takes at most the following number of steps:
k≤log2 λJ(0) 4G2H∗ + 4H∗ λ max 0, H− 8G 2H∗ λ + 4HH∗ λ log2 H 2.
Teo et al.(2010, Theorem 5) proved this rate forls-bmrmwhere each iteration only
solves a simple one-dimensional optimization. In contrast, qp-bmrm performs a much
more expensive optimization at every iteration, therefore it was conjectured that the rates of convergence of qp-bmrm could be improved. This was also supported by the
empirical convergence behavior ofqp-bmrm, which is much better than the theoretically
predicted rates on a number of real life problems (Teo et al.,2010, Section 5). In Section
5.2, we answer this question in the negative by explicitly constructing a regularized risk minimization problem for which qp-bmrm takes at leastO(1/) iterations.