Optimal first-order minimization methods

(1)

Optimal first-order minimization methods

with applications to image reconstruction and ML

Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology

University of Michigan

http://web.eecs.umich.edu/˜fessler

ECE Colloquium, Univ. of Minnesota

(2)

Disclosure

• Research support from GE Healthcare

• Supported in part by NIH grant U01 EB018753

• Equipment support from Intel Corporation

(3)

Lower-dose X-ray CT image reconstruction

Thin-slice FBP ASIR Statistical

Seconds A bit longer Much longer

Image reconstruction as an optimization problem: ˆ x = arg min x0 1 2ky − Axk 2 W + R(x),

y data, A system model, W statistics, R(x) regularizer. (Same sinogram, so all at same dose.)

(4)

Outline

Motivation Problem setting Existing algorithms

Gradient descent

Nesterov’s “optimal” first-order method

Optimizing first-order minimization methods Numerical examples

Logistic regression for machine learning CT image reconstruction

Further acceleration using OS

Generalizing OGM Summary / future work

(5)

Outline

Motivation

Problem setting

Existing algorithms Gradient descent

Nesterov’s “optimal” first-order method Optimizing first-order minimization methods Numerical examples

Further acceleration using OS Generalizing OGM

(6)

Optimization problem setting

ˆ

x ∈ arg min

x f (x)

I Unconstrained

I Large-scale (Hessian ∇2f too big to store and/or undefined)

I image reconstruction / inverse problems

I big-data / machine learning

I ...

I Cost function assumptions (throughout)

I _{f : R}M_{7→ R}

I convex(need not be strictly convex)

I non-empty set of global minimizers: ˆ

x ∈ X∗=x?∈ RM : f (x?) ≤ f (x), ∀x ∈ RM

I smooth(differentiable with L-Lipschitz gradient) k∇ f (x) −∇ f (z)k₂≤ L kx − zk₂, ∀x, z ∈ RM

(7)

Example: Fair potential function

Fair’s potential function [1] (similar to Huber function and hyperbola): ψ(z) = δ2[|z/δ| − log(1 + |z/δ|)] ˙ ψ(z) = z 1 + |z/δ| ¨ ψ(z) = 1 (1 + |z/δ|)2 ≤ 1. Thus L = 1. -1 0 1 3 0 1 z f(z) -1 0 1 3 -1 0 1 z f’(z)

(8)

Example: Machine learning for classification

To learn weights x of binary classifier given feature vectors {vi}

and labels {yi = ±1}: ˆ x = arg min x f (x), f (x) = X i ψ(yihx, vii) . loss functions ψ(z) I _{0-1: I}_{z≤0} I exponential: exp(−z)

I logistic: log(1 + exp(−z))

I hinge: max {0, 1 − z} -2 0 2 z 0 1 8 ψ ( z )

Loss functions (surrogates)

exponential hinge logistic 0-1

Which of these ψ fit our conditions?

(9)

Outline

Motivation Problem setting

Existing algorithms

Gradient descent

(10)

Gradient descent

I Problem: ˆ x = arg min x f (x) . I Initial guess x0.

I Simple recursive iteration:

xn+1= xn−

1

L∇f (xn) .

I Step size 1/L ensures monotonic descent of f .

I Telescoping (for intuition, not implementation):

xn+1= x0− 1 L n X k=0 ∇f (x_k) . 10 / 56

(11)

Gradient descent convergence rate

I Classic O(1/n) convergence rate of cost function descent:

f (xn) − f (x?) | {z } inaccuracy ≤ L kx0− x?k 2 2 2n .

I Drori & Teboulle (2014) derive tight inaccuracy bound:

f (xn) − f (x?) ≤

L kx0− x?k22

4n + 2 .

I They construct a Huber-like function f for which GD achieves

that bound =⇒ case closed for GD with step size 1/L.

(12)

Generalizing GD slightly

I GD with general step size h:

xn+1= xn−

h

L∇f (xn) .

I Classical monotone descent result:

h ∈ (0, 2) =⇒ f (x_n+1) < f (xn) when xn is not a minimizer.

I What is best h?

I If f is quadratic, then asymptotic best choice is:

h∗=

2L

λmax(∇2f ) + λmin(∇2f )

.

(13)

Generalizing GD slightly

I GD with general step size h:

xn+1= xn−

h

L∇f (xn) .

I More generally, Taylor et al. [3] recently (2017) conjectured:

f (xN) − f (x?) ≤ L kx0− x?k2₂ 2 max ₁ 2Nh+ 1, (1 −h) 2N_.

I Proof for 0 <h≤ 1 by Drori and Teboulle [2]

I Upper bounds achieved by Huber-like function and quadratic function f (x ) = (L/2)x2_{respectively.}

I Besthdepends on N !

(For N = 1, h∗= 1.5; for N = 100, h∗= 1.9705.) I Must select N in advance?

(14)

Heavy ball method and momentum

I Quest for accelerated convergence.

I Heavy ball iteration (Polyak, 1987):

xn+1 = xn− α L ∇f (xn) + β (x| n− x{z n−1}) momentum! (recursive form to implement) xn+1 = xn− 1 L n X k=0 αβn−k | {z } coefficients ∇f (x_k) (summation form to analyze)

I How to choose α and β?

I How to optimize coefficients more generally?

(15)

General first-order method classes

I General “first-order” (GFO) method:

xn+1= function(x0, f (x0), ∇f (x0), . . . , f (xn), ∇f (xn)) .

I First-order (FO) methods with fixed step-size coefficients:

xn+1= xn− 1 L n X k=0 hn+1,k∇f (xk) . Primary goals:

I Analyze convergence rate of FO for any given {hn,k}

I Optimize step-size coefficients {hn,k}

I fast convergence

I efficient recursive implementation

(16)

Example: Barzilai-Borwein gradient method

Barzilai & Borwein, 1988: g(n) , ∇f (xn) αn= kxn− xn−1k2₂ hx_n− x_n−1, g(n)_{− g}(n−1)_i xn+1= xn− αn∇f (xn) .

I In “general” first-order (GFO) class, but

I not in class FO with fixed step-size coefficients.

I Likewise for methods like

I steepest descent (with line search),

I conjugate gradient,

I quasi-Newton ...

(17)

Nesterov’s fast gradient method (FGM1)

Nesterov (1983) iteration: Initialize: t0 = 1, z0 = x0

zn+1= xn− 1 L∇f (xn) (usual GD update) tn+1= 1 2 1 + q 1 + 4t2 n

(magicmomentum factors)

xn+1= zn+1+

tn− 1 tn+1

(zn+1− zn) (update with momentum) .

Reverts to GD if tn= 1, ∀n. FGM1 is in class FO: xn+1= xn− 1 L n X k=0 hn+1,k∇f (xk) hn+1,k=          tn− 1 tn+1 hn,k, k = 0, . . . , n − 2 tn− 1 tn+1 (hn,n−1− 1) , k = n − 1 1 +tn− 1 tn+1 , k = n.   1 0 0 0 0 0 1.25 0 0 0 0 0.10 1.40 0 0 0 0.05 0.20 1.50 0 0 0.03 0.11 0.29 1.57  

(18)

Nesterov’s FGM1 optimal convergence rate

Shown by Nesterov to be O(1/n2_{) for “auxiliary” sequence:}

f (zn) − f (x?) ≤

2L kx0− x?k22 (n + 1)2 .

Nesterov constructed a simple quadratic function f such that, for any general FO method:

3

32L kx0− x?k

2 2

(n + 1)2 ≤ f (xn) − f (x?) .

Thus O(1/n2) rate of FGM1 is optimal.

New results(Donghwan Kim & JF, 2016):

• Bound on convergence rate of primary sequence {xn}:

f (xn) − f (x?) ≤

2L kx0− x?k22 (n + 2)2 .

• Verifies (numerically inspired) conjecture of Drori & Teboulle (2014).

(19)

Outline

Gradient descent

Nesterov’s “optimal” first-order method

Optimizing first-order minimization methods

Numerical examples

(20)

Overview

First-order (FO) method with fixed step-size coefficients:

xn+1= xn− 1 L n X k=0 hn+1,k∇f (xk)

I Analyze (i.e., bound) convergence rate as a function of

I number of iterations N

I Lipschitz constant L

I step-size coefficients H = {hn+1,k}

I initial distance to a solution: R = kx0− x?k.

I Optimize H by minimizing the bound.

I Seek an equivalent recursive form for efficient implementation.

(21)

Ideal “universal” bound for first-order methods

For given

• number of iterations N

• Lipschitz constant L

• step-size coefficients H = {hn+1,k}

• initial distance to a solution: R = kx0− x?k,

try to bound the worst-case convergence rate of a FO method:

B1(H, R, L, N, M) , max f ∈FL max x0,x1,...,xN∈RM max x?∈X∗_{(f )} kx0−x?k≤R f (xN) − f (x?) such that xn+1 = xn− 1 L n X k=0 hn+1,k∇f (xk), n = 0, . . . , N − 1.

Clearly for any FO method, this cost-function bound would hold:

(22)

Towards practical bounds for first-order methods

For convex functions with L-Lipschitz gradients: 1

2Lk∇f (x) − ∇f (z)k

2_{≤ f (x) − f (z) − h∇f (z), x − zi,}

∀x, z ∈ RM.

Drori & Teboulle (2014) use this inequality to propose a “more tractable” (finite-dimensional) relaxed bound:

B2(H, R, L, N, M) , max g₀,...,g_N∈RMδ0,...,δmaxN∈R max x0,x1,...,xN∈RM max x?: kx0−x?k≤R LRδ_N2 such that xn+1= xn− 1 L n X k=0 hn+1,kR gk, n = 0, . . . , N − 1, 1 2 gi− gj 2 ≤ δ_i − δ_j− 1 R hgj, xi − xji, i , j = 0, . . . , N, ∗, where g_n= _LR1 ∇f (xn) and δn= _LR1 (f (xn) − f (x?)) .

For any FO method:

f (xN) − f (x?) ≤ B1(H, R, L, N, M) ≤ B2(H, R, L, N, M)

However, even B2 is as of yet unsolved (for general H).

(23)

Numerical bounds for first-order methods

I Drori & Teboulle (2014) further relax the bound:

f (xN) − f (x?) ≤ B1(H, . . .) ≤ B2(H, . . .) ≤ B3(H, R, L, N).

I For given step-size coefficients H, and given number of

iterations N, they use a semi-definite program (SDP) to

compute B3 numerically.

I They find numerically that for the FGM1 choice of H,

the convergence bound B3 is slightly below

2L kx0− x?k2₂

(N + 1)2 .

(24)

Optimizing step-size coefficients numerically

Drori & Teboulle (2014) also computed numerically the minimizer over H of their relaxed bound for given N using a SDP:

H∗ = arg min

H

B3(H, R, L, N).

Numerical solution for H∗ for N = 5 iterations: [2, Ex. 3]

Drawbacks:

• Must choose N in advance

• Requires O(N) memory for all gradient vectors {∇f (xn)}N_n=1

• O(N2) computation for N iterations

Benefit: convergence bound (for specific N) ≈ 2× lower than for Nesterov’s FGM1.

(25)

New analytical solution (D. Kim, JF, 2016)

I Analytical solution for optimized step-size coefficients [8], [9]:

H∗ : hn+1,k =      θn−1 θn+1hn,k, k = 0, . . . , n − 2 θn−1 θn+1 (hn,n−1− 1) , k = n − 1 1 +2_θθn−1 n+1 , k = n. θn=        1, n = 0 1 2 1 +q1 + 4θ2_n−1, n = 1, . . . , N − 1 1 2 1 +q1 + 8θ2_n−1, n = N.

I Analytical convergence bound for this optimized H∗:

f (xN) − f (x?) ≤ B3(H∗, R, L, N) =

1L kx0− x?k22 (N + 1)(N + 1 +√2).

I Of course bound is O(1/N2_{), but constant is twice better.} I No numerical SDP needed =⇒ feasible for large N.

(26)

Optimized gradient method (OGM1)

Donghwan Kim & JF (2016) also foundefficient recursiveiteration:

Initialize: θ0= 1, z0 = x0 zn+1= xn− 1 L∇f (xn) θn=    1 2 1 +q1 + 4θ2 n−1 , n = 1, . . . , N − 1 1 2 1 +q1 + 8θ2 n−1 , n = N xn+1= zn+1+ θn− 1 θn+1 (zn+1− zn) + θn θn+1 (zn+1− xn) | {z } new momentum .

Reverts to Nesterov’s FGM1 by removing thenew term.

• Very simple modification of existing Nesterov code.

• No need to solve SDP.

• Factor of 2 better bound than Nesterov’s “optimal” FGM1.

• Similar momentum to G¨uler’s 1992 proximal point algorithm [10]. (Proofs omitted.)

(27)

Recent refinement of OGM1

New version OGM1’ [11], [12]:

zn+1 = xn− 1 L∇f (xn) tn+1 = 1 2 1 + q 1 + 4t2 n (momentum factors) xn+1 = xn− 1 +tn/tn+1 L ∇f (xn) | {z } over-relaxed GD +tn− 1 tn+1 (zn+1− zn) | {z } FGM momentum .

I New convergence bound for every iteration:

f (zn) − f (x?) ≤

1L kx0− x?k22

(n + 1)2 .

I Simpler and more practical implementation.

(28)

OGM1’ momentum factors

0 50 0 1 xn+1= xn− 1+tn/tn+1 L ∇f (xn) + tn− 1 tn+1 (zn+1− zn) 27 / 56

(29)

Optimized gradient method (OGM) is optimal!

For the class of first-order (FO) methods with fixed step sizes:

xn+1= xn− 1 L n X k=0 hn+1,k∇f (xk),

we optimized OGM and proved the convergence rate upper bound:

f (xN) − f (x?) ≤

L kx0− x?k22

N2 .

Recently Y. Drori [13] considered the class of general FO methods: xn+1 =F(x0, f (x0), ∇f (x0), . . . , f (xn), ∇f (xn)) ,

and showed any algorithm in this case has a function f such that

L kx0− x?k2₂

N2 ≤ f (xN) − f (x?),

for d > N (large-scale). Thus OGM hasoptimalcomplexity among

(30)

Worst-case functions for OGM

From [11], [12], worst-case behavior is:

OGM has two worst-case functions (like GD): a Huber-like function and a quadratic function. Worst-case means: f (xN) − f (x?) = LR2 θ2 N ≤ LR 2 (N + 1)(N + 1 +√2) ≤ LR2 (N + 1)2. 29 / 56

(31)

Outline

Gradient descent

Nesterov’s “optimal” first-order method Optimizing first-order minimization methods

Numerical examples

(32)

Machine learning (logistic regression)

To learn weights x of binary classifier given feature vectors {vi}

and labels {yi = ±1}: ˆ x = arg min x f (x), f (x) = X i ψ(yihx, vii) +β 1 2kxk 2 2. logistic: ψ(z) = log(1 + e−z), ψ(z) =˙ −1 ez_{+ 1}, ψ(z) =¨ ez (ez_{+ 1)}2 ∈ 0,1 4 . Gradient ∇f (x) =P iyiviψ(y˙ ihx, vii) +βx

Hessian is positive definite so strictly convex: ∇2f (x) =X i viψ(y¨ ihx, vii) v0i+βI 1 4 X i viv0i+βI =⇒ L , 1 4ρ X i viv0i ! + β ≥ max x ρ ∇2f (x) 31 / 56

(33)

Numerical Results: logistic regression

-4 -1 0 1 4 -4 -1 0 1 4

Training data (points); initial decision boundary (red); final decision boundary (magenta); ideal boundary (yellow).

(34)

Numerical Results: convergence rates

0 20 40 60 80 0 8 GD Nesterov FGM OGM1

OGM faster than FGM in early iterations...

(35)

Numerical Results: adaptive restart

0 20 40 60 80 10-3 10-2 10-1 100 101

FGM restart, O’Donoghue & Cand`es, 2015.

(36)

Low-dose 2D X-ray CT image reconstruction simulation

0 50 100 150 200 0 5 10 15 20 25 30 Iteration RMSD [HU] GD FGM OGM 35 / 56

(37)

Outline

Gradient descent

Nesterov’s “optimal” first-order method Optimizing first-order minimization methods

Numerical examples

(38)

Combining ordered subsets (OS) with momentum

I Optimization problems in image reconstruction (and machine

learning) involve sums of many similar terms:

f (x) = M

X

m=1 fm(x) .

I Approximate gradients using just one term at a time:

∇ f (x) ≈ M∇ fm(x)

I Ordered subsets (OS) in tomography [16]

I Incremental gradients in optimization / machine learning

I Combining OS with momentum dramatically accelerates!

(39)

OS + OGM1 method

Initialize: θ0= 1, z0 = x0 (D. Kim, S. Ramani, JF, 2015) [17]

For each iteration n

For each subset m = 1, . . . , M k= nM + m − 1 zk+1 = xk− M L ∇fm(xk) (usual OSupdate) θk = 1 2 1 +q1 + 4θ2 k−1 (momentum factors) xk+1 = zk+1+ θk− 1 θk+1 (zk+1− zk) + θk θk+1 (zk+1− xk) | {z } new momentum .

• Simple modification of existing OS code

• ≈ O 1/(Mn)2

(40)

Results: 3D X-ray CT patient scan

• 3D cone-beam helical CT scan with pitch 0.5

Initial FBP image x(0) Converged image x(∞) 800 850 900 950 1000 1050 1100 1150 1200

• Convergence rate in RMSD [HU], within ROI, versus iteration:

RMSD_ROI(xn) ,

||x_ROI(n)_√− ˆxROI||2

NROI

.

(Disclaimer: RMSD may not relate to task performance...)

(41)

Results: RMSD [HU] vs. iteration: without OS

0 50 100 150 200 0 5 10 15 20 25 30 Iteration RMSD [HU] GD FGM OGM

• Computation time: OGM <FGM GD

• OGM requires about √1

2-times fewer iterations than FGM

(42)

Results: RMSD [HU] vs. iteration: with OS

0 5 10 15 20 0 5 10 15 20 25 30 Iteration RMSD [HU] GD FGM OGM OS(12)−GD OS(12)−FGM OS(12)−OGM • M = 12 subsets in OS algorithm.

• Proposed OS-OGMconverges faster than OS-FGM.

• Computation time per iteration of all algorithms are similar.

(43)

Outline

Gradient descent

Further acceleration using OS

Generalizing OGM

(44)

Generalizing OGM - faster gradient norm decrease

I Cost function decrease: f (xn) − f (x?) ∼ O(1/n2)

I Gradient norm decrease? k∇f (xn)k → 0 at what rate?

Important especially for problems involving duality.

(45)

Bounds on gradient norm decrease

I Known bounds [18] [20]: GD: min 0≤n≤Nk∇f (xn)k = k∇f (xN)k ≤ √ 2 N LR FGM: k∇f (x_N)k ≤ 2 NLR.

I New very recent bounds (DK & JF, 2016) [21], [22]:

FGM: min 0≤n≤Nk∇f (xn)k ≤ 2√3 N3/2LR OGM: min 0≤n≤Nk∇f (xn)k ≤ k∇f (xN)k ≤ √ 2 N LR.

(46)

Generalized OGM (GOGM) recursive iteration

Very recent generalization (DK & JF, 2016) [21], [22]

Input: f ∈ FL, x0∈ RN, z0 = x0, t0∈ (0, 1]. for n = 0, 1, . . . zn+1= xn− 1 L∇f (xn) tn+1 > 0 s.t. tn+12 ≤ Tn+1, n+1 X k=0 tk (momentum_factors) xn+1 = zn+1+ (Tn− tn)tn+1 Tn+1tn (zn+1− zn) +(2t 2 n− Tn)tn+1 Tn+1tn (zn+1− xn) . I Simple implementation

I Best choice of factors tn(in terms of gradient norm decrease)?

(47)

Generalized OGM (GOGM)

Optimized choice of momentum factors (for decreasing gradient norm) (DK & JF, 2016) [21], [22]: tn,      1, n = 0, 1 2 1 +q1 + 4t_n−12 , n = 0, . . . , bN/2c − 1, (N − n + 1)/2, n = bN/2c , . . . N.

(48)

Optimized parameters for OGM-OG

0 25 50 0 15 tn 0 25 50 0 1 (Tn− tn)tn+1/(Tn+1tn) n 0 25 50 -1 0 1 (2t2 n− Tn)tn+1/(Tn+1tn) 47 / 56

(49)

OGM-OG convergence rate bounds

I Convergence bound for cost function for OGM-OG:

f (zN) − f (x?) ≤

2L kx0− x?k22

N2 .

I Same as Nesterov’s FGM.

I Convergence bound for gradient norm is best known:

min

0≤n≤Nk∇f (zn)k ≤ min0≤n≤Nk∇f (xn)k ≤

√ 6

N3/2LR.

I √2 better than FGM’s smallest gradient norm bound.

I Variations that do not require choosing N in advance,

but that have slightly larger constants in bounds.

I Derivation uses relaxations that are not tight.

(50)

Summary of (fast?) gradient decreasing FO methods

From [21], [22]:

Numerical examples are work-in-progress.

(51)

Non-smooth (composite) convex problems

Composite cost function: arg min

x F (x), F (x) , f (x) + g(x)

f (x) : convex, smooth with Lipshitz gradient g (x) : convex but possibly (usually) non-smooth

Examples:

• g (x) = kxk₁

• g (x) characteristic function of a convex constraint

Fast iterative soft thresholding algorithm (FISTA) (Beck & Teboulle, 2009) [23]

AKA “fast proximal gradient method” (FPGM)

Simple recursive iteration with O(1/n2) cost function convergence

(52)

Improving on FISTA

DK & JF, 2016 [24], [25]

FPGM with “optimized proximal gradient” (FPGM-OPG). Best known bound on proximal gradient convergence rate.

(53)

Summary

I New optimized first-order minimization algorithm (optimal!)

I Simple implementation akin to Nesterov’s FGM

I Analytical converge rate bound

I Bound on cost function decrease is 2× better than Nesterov

Future work

• Constraints

• Non-smooth cost functions, e.g., `1

• Tighter bounds

• Strongly convex case

• Asymptotic / local convergence rates

• Incremental gradients

• Stochastic gradient descent

• Adaptive restart

• Distributed computation

(54)

Weisman Art Museum, Univ. of Minnesota

Charles Biederman # 4 Diaz

1986

(55)

Bibliography I

[1] R. C. Fair,“On the robust estimation of econometric models,” Ann. Econ. Social Measurement, vol. 2,

667–77, Oct. 1974.

[2] Y. Drori and M. Teboulle,“Performance of first-order methods for smooth convex minimization: A novel approach,” Mathematical Programming, vol. 145, no. 1-2, 451–82, Jun. 2014.

[3] A. B. Taylor, J. M. Hendrickx, and Franc¸ois. Glineur,“Smooth strongly convex interpolation and exact worst-case performance of first- order methods,” Mathematical Programming, vol. 161, no. 1, 307–45,

Jan. 2017.

[4] B. T. Polyak, Introduction to optimization. New York: Optimization Software Inc, 1987.

[5] J. Barzilai and J. Borwein,“Two-point step size gradient methods,” IMA J. Numerical Analysis, vol. 8,

no. 1, 141–8, 1988.

[6] Y. Nesterov,“A method for unconstrained convex minimization problem with the rate of convergence

O(1/k2),” Dokl. Akad. Nauk. USSR, vol. 269, no. 3, 543–7, 1983.

[7] ——,“Smooth minimization of non-smooth functions,” Mathematical Programming, vol. 103, no. 1,

127–52, May 2005.

[8] D. Kim and J. A. Fessler,Optimized first-order methods for smooth convex minimization, arxiv 1406.5468, 2014.

[9] ——,“Optimized first-order methods for smooth convex minimization,” Mathematical Programming,

vol. 159, no. 1, 81–107, Sep. 2016.

[10] O. G¨uler,“New proximal point algorithms for convex minimization,” SIAM J. Optim., vol. 2, no. 4,

649–64, 1992.

[11] D. Kim and J. A. Fessler,On the convergence analysis of the optimized gradient methods, arxiv 1510.08573, 2015.

(56)

Bibliography II

[12] ——,“On the convergence analysis of the optimized gradient methods,” J. Optim. Theory Appl., vol.

172, no. 1, 187–205, Jan. 2017.

[13] Y. Drori,“The exact information-based complexity of smooth convex minimization,” J. Complexity,

2016.

[14] D. B¨ohning and B. G. Lindsay,“Monotonicity of quadratic approximation algorithms,” Ann. Inst. Stat. Math., vol. 40, no. 4, 641–63, Dec. 1988.

[15] B. O’Donoghue and E. Cand`es,“Adaptive restart for accelerated gradient schemes,” Found. Comp. Math., vol. 15, no. 3, 715–32, Jun. 2015.

[16] H. Erdo˘gan and J. A. Fessler,“Ordered subsets algorithms for transmission tomography,” Phys. Med. Biol., vol. 44, no. 11, 2835–51, Nov. 1999.

[17] D. Kim, S. Ramani, and J. A. Fessler,“Combining ordered subsets and momentum for accelerated X-ray CT image reconstruction,” IEEE Trans. Med. Imag., vol. 34, no. 1, 167–78, Jan. 2015.

[18] Y. Nesterov,How to make the gradients small, Optima 88, 2012.

[19] A. Beck and M. Teboulle,“A fast dual proximal gradient algorithm for convex minimization and applications,” Operations Research Letters, vol. 42, no. 1, 1–6, Jan. 2014.

[20] I. Necoara and A. Patrascu,“Iteration complexity analysis of dual first order methods for conic convex programming,” Optimization Methods and Software, vol. 31, no. 3, 645–78, 2016.

[21] D. Kim and J. A. Fessler,Generalizing the optimized gradient method for smooth convex minimization, arxiv 1607.06764, 2016.

[22] ——,“Generalizing the optimized gradient method for smooth convex minimization,” Mathematical Programming, 2016, Submitted.

[23] A. Beck and M. Teboulle,“Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems,” IEEE Trans. Im. Proc., vol. 18, no. 11, 2419–34, Nov. 2009.

(57)

Bibliography III

[24] D. Kim and J. A. Fessler,Another look at the ”Fast iterative shrinkage/Thresholding algorithm (FISTA), arxiv 1608.03861, 2016.

[25] ——,“Another look at the “Fast iterative shrinkage/Thresholding algorithm (FISTA)”,” SIAM J. Optim., 2016, Submitted.