L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G

(1)

L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G

Q I H A N G L I N

Tepper School of Business Carnegie Mellon University

Thesis Committee Javier Peña (Chair)

Geoffrey Gordon Fatma Kılınç-Karzan

Lin Xiao

Submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy

(2)

(3)

Dedicated to my parents Yunwu Lin and Biying Liu, and my stepmother Feique Lin.

(4)

Thanks to the development of modern digital technology, people today are able to generate, collect and store data of unprecedented volume, dimension and complexity. This growing trend of big data brings new needs for powerful tools to explore and analyze different data sets. Among different tools for data analysis, structured regression is one of the most popular techniques. It has been successfully applied to data from many disciplines in business, science and engineering.

A structured regression model is formulated as an optimization problem using a large number of decision variables. Traditional techniques often suffer from low scalability and unaffordable computational time when applied to solve optimization problems in such a large scale. To address this challenge, in the first main part of this thesis, we propose several optimization methods with low memory requirements for different structured regression problems based on homotopy, smoothing, stochastic sampling and other techniques. In particular, our contributions include:

1. A gradient homotopy method for the Lasso problem, the most popular structured regression model. This method finds an e-optimal solution within O log ¹_e

iterations. This complexity improves the traditional O ^√¹

e complexity, which is theoretically not improvable if an algorithm is only allowed to use black-box gradient information. What allows us to achieve a better complexity is that our method fully utilizes the local strong convexity property of the Lasso problem under the restricted eigenvalue assumption, which provides much more information than gradients.

2. A smoothing proximal gradient algorithm for solving general structured regressions. One difficulty of applying first-order methods to structured regressions with sophisticated penalty terms is that the proximal mapping, which must be computed in each iteration, does not have a closed-form solution. The algorithm we proposed constructs a smooth approximation to the penalty terms so that the proximal mapping can be computed in closed form, and thus, first-order methods can be applied efficiently.

3. A unified theoretical framework for analyzing the geometric nature of different first-order methods. We show that most popular versions of accelerated gradient methods essentially construct an estimate sequence of the optimization problem in different ways. This explains why these methods achieve the same optimal convergence rate even with different updating schemes.

4. A stochastic first-order method for structured regression, which utilizes a stochastic gradient constructed using only a random sample of the whole

iv

(5)

data set. This method is favorable when the size of the data grows beyond the storage capacity and deterministic first-order methods cannot be applied due to the overwhelming computational cost involved with computing the exact gradient. Our methods achieves the “uniformly” optimal complexity O ^√¹

e + ^σ

e², which means, when the noise of the stochastic gradient σ is reduced to zero, our method achieves the optimal complex- ity O ^√¹

e

as a deterministic first-order method. By contrast, the traditional stochastic gradient methods would still have sub-optimal complexity O _e¹2.

The challenges of large-scale optimization arise not only from the large volume of data but also from the exponential rate of growth of the number of variables in multistage decision making models. In the second main part of this thesis, we explore the scalability of first-order methods in the latter case.

We focus on the optimal trade execution problem under coherent dynamic risk measure. This is a large-scale multistage stochastic optimization problem that arises in financial engineering.

Relying on the dual representation of the coherent risk measure, we can formulate this problem as a saddle-point problem and solve it with a primal-dual first-order method. The truncated simplex structure of the primal and dual do- mains allows us to obtain a closed-form solution to the relevant proximal mapping sub-problem, resulting in an efficient implementation of this first-order method. Our models and algorithms are tested on limit order book real data and demonstrate promising numerical properties. Furthermore, we generalize the same primal-dual first-order method to the case where the multistage decision making problem is modeled with a scenario tree in a non-Markovian fashion.

In the last part of the thesis, we discuss future research directions in optimization techniques for solving high-dimensional structured regression problems and multistage decision making problems from both the theoretical and computational aspects.

K E Y W O R D S

Machine Learning, Sparse Learning, Lasso, Optimization, Regression, First-Order Method, Proximal Gradient Method, Stochastic Optimization, Stochastic Gra- dient, Convex Programming, Dynamic Coherent Risk Measure, Saddle-Point Problem, Markov Decision Process, Trade Execution, Multistage Decision Mak- ing, Scenario Tree, Limit Order Book.

v

(6)

First, I would like to thank my advisor Javier Peña for his great advice and support through my entire Ph.D. journey. I am particularly grateful to him not only for directing me towards exciting problems and applications, but also for allowing me the freedom to pursue the topics that I am passionate about. His vision on new research problems, his attitude to the academic career, as well as his extremely nice and sincere personality all deeply influenced me and will continue to shape me in the future.

I also deeply appreciate my other thesis committee members, Geoffrey Gor- don, Fatma Kılınç-Karzan and Lin Xiao. They put a lot of effort on this thesis and provided many insightful suggestions. I also thank many other faculty members in Tepper School of Business: Egon Balas, Gerard Cornuejols, R. Ravi, John Hooker and Alan Scheller-Wolf. I was so fortunate to take excellent courses from them. Their courses open the doors to different areas of operations research, a great field that I am delighted to devote my entire career to explore.

During my Ph.D., I did a great summer internship at Microsoft Research in 2012. I sincerely thank my mentor Lin Xiao, who is also my committee mem- ber, my collaborator Dengyong Zhou and other friends in Microsoft Research for their help and support. I am also indebted to many other colleagues and friends at CMU, who played critical roles during my Ph.D. through collabo- rations, discussions and suggestions. Some of them include: Marco Molinaro, Ishani Aggarwal, David Bergman, Hao Xue, Xin Fang, Nan Xiong, Selvaprabu Nadarajah, Negar Soheili, Andre Cire, Amitabh Basu, Andrea Qualizza and Yangfang Zhou.

In particular, I would like to thank Xi Chen. He, as one of my most important collaborators, not only taught me knowledge about statistics, but also proposed many interesting problems to me. We had so many enjoyable discussions and I was often amazed by his programming skills and knowledge of statistical machine learning and data mining.

I also owe special thanks to Lawrence Rapp, who had done a superb admin- istration job and provided crucial assistance in making sure that my graduate experience went smoothly. The tea break organized by him every Monday and Thursday was always something I looked forward to and enjoyed every week.

Most importantly, I want to thank my parents, Yunwu Lin and Biying Liu as well as my stepmother Feique Lin. They made all kinds of efforts to create a healthy family environment where I could grow up happily. This thesis is dedicated to them.

vi

(7)

C O N T E N T S

i t h e s i s ov e r v i e w 1

1 t h e s i s ov e r v i e w 3

1.1 Motivation and Statement 3 1.2 Thesis Overview 3

1.3 Main Results and Organization 4 1.3.1 Partii: Background 4

1.3.2 Partiii: First-Order Methods for Structured Regressions 4 1.3.3 Partiv: First-Order Methods for Multistage Decision Prob-

lems 6

1.3.4 Partv: Conclusions and Future Work 8 ii b a c k g r o u n d 9

2 b a c k g r o u n d 11

2.1 Structured Regression 11

2.2 Multi-task Structured Regression 14 2.3 First-Order Methods 16

2.4 Stochastic First-Order Methods 17

2.5 A Guideline for Choosing Algorithms in the Thesis 18 iii f i r s t-order methods for structured regressions 21 3 a c c e l e r at e d g r a d i e n t h o m o t o p y m e t h o d f o r l a s s o 23

3.1 Introduction 23

3.1.1 Minimizing composite objective functions 24 3.1.2 Homotopy continuation for sparse optimization 26 3.1.3 Outline of the chapter 28

3.2 Preliminaries and Notation 28

3.2.1 Composite gradient mapping 29

3.2.2 Proximal gradient method with line search 32

3.3 An APG Method for Minimizing Strongly Convex Functions 34 3.3.1 Proof of Theorem3.1 36

3.3.2 The non-blowout property 39 3.4 An Adaptive APG Method with Restart 41

3.5 Homotopy Continuation for Sparse Optimization 45 3.5.1 Sparsity along the solution path 50

3.5.2 Proof of Theorem3.3 53 3.5.3 Proof of Theorem3.4 54 3.6 Numerical Experiments 56

3.6.1 Experiments on the AdapAPG method 57 3.6.2 Experiments on homotopy continuation 59

4 s m o o t h i n g p r o x i m a l g r a d i e n t m e t h o d f o r s t r u c t u r e d r e- g r e s s i o n 67

4.1 Introduction 67

vii

(8)

4.2 Linear Regression Regularized by Structured Sparsity-inducing Penalties 70

4.3 Smoothing Proximal Gradient 71

4.3.1 Reformulation of Structured Sparsity-inducing Penalty 71 4.3.2 Smooth Approximation to Structured Sparsity-inducing

Penalty 72

4.3.3 Smoothing Proximal Gradient Method 76

4.3.4 Issues on the Computation of the Lipschitz Constant 78 4.3.5 Convergence Rate and Time Complexity 78

4.3.6 Summary and Discussions 80 4.4 Related Optimization Methods 81

4.4.1 Related work for mixed-norm based group-lasso penalty 81 4.4.2 Related work for fused lasso 82

4.5 Extensions to Multi-task Regression with Structures on Outputs 83 4.5.1 Multi-task Linear Regression Regularized by Structured

Sparsity-inducing Penalties 84

4.5.2 Smoothing Proximal Gradient Descent 84 4.6 Numerical Experiments 85

4.6.1 Simulation Study I: Overlapping Group Lasso 86

4.6.2 Simulation Study II: Multi-task Graph-guided Fused Lasso 88 4.6.3 Real Data Analysis: Pathway Analysis of Breast Cancer

Data 89

5 e s t i m at e s e q u e n c e s a n d a c c e l e r at e d p r o x i m a l g r a d i e n t m e t h o d s 93

5.1 Introduction 93 5.2 Estimate Sequence 95

5.3 A Generic Accelerated Gradient Algorithm 102

6 o p t i m a l r e g u l a r i z e d d ua l av e r a g i n g m e t h o d s f o r s t o c h a s- t i c o p t i m i z at i o n 105

6.1 Introduction 105

6.2 Preliminary and Notation 108

6.3 Optimal Regularized Dual Averaging Method 108 6.3.1 Convergence Rate 110

6.3.2 Variance Bounds 117

6.3.3 High Probability Bounds 119

6.4 Multistage ORDA for Stochastic Strongly-Convex Optimization 122

6.5 Related Work 124

6.6 Numerical Experiments 125 6.6.1 Simulated Experiments 126 6.6.2 Real Data Experiments 128

iv f i r s t-order methods for multistage decision problems 131

7 o p t i m a l t r a d e e x e c u t i o n w i t h c o h e r e n t d y na m i c r i s k m e a- s u r e s 133

(9)

c o n t e n t s ix

7.1 Introduction 133

7.2 Trade Execution Model 136

7.3 Coherent Dynamic Risk Measure 138 7.4 Convex Optimization Formulation 140

7.5 The Optimal Strategy for Single Asset Liquidation 145 7.6 The Optimal Strategy for Multiple Asset Liquidation 148

7.6.1 Sample average approximation and dual representation of coherent risk measures 149

7.6.2 Saddle-point formulation and Mirror-Prox algorithm 150 7.6.3 Acceleration by excessive gap method 154

7.7 Extension to Nonlinear Market Impacts 157 7.8 Numerical Experiments 158

7.8.1 Scalability and efficiency 158 7.8.2 Early completion v.s. small tail 159 7.8.3 Efficient frontier 162

7.8.4 Experiments with NYSE limit order book data 163

8 d y na m i c c o h e r e n t r i s k m i n i m i z at i o n ov e r s c e na r i o t r e e s 167 8.1 Introduction 167

8.2 Dynamic Risk Minimization and Saddle-Point Formulation 168 8.3 Mirror-Prox Algorithm for Trade Execution under a Scenario Tree

Model 171

8.3.1 Mirror-Prox algorithm 171

8.3.2 Trade execution with scenario tree price model 171 8.3.3 Solving the minimization sub-problems 174

8.4 Theoretical Convergence Rate 178 8.5 Numerical Experiments 187

8.5.1 Comparison of distance generating functions 187

8.5.2 Comparison of Markovian and Non-Markovian Policies 188 v c o n c l u s i o n s a n d f u t u r e w o r k 197

9 c o n c l u s i o n s a n d f u t u r e d i r e c t i o n s 199 9.1 Conclusions 199

9.2 Future Directions 200 b i b l i o g r a p h y 203

(10)

Figure 2.1 The guideline for using the algorithms in this thesis. 19 Figure 3.1 Minimizing a random instance of the log-sum-exp func-

tion. 58

Figure 3.2 Minimizing a random instance of the log-sum-exp func-

tion. CPU time in seconds: PG(4.25), FISTA(7.22), FISTA+RS(1.93), AdapAPG(µ₀=200)(2.68) and AdapAPG(µ₀=0.2)(1.82). 58 Figure 3.3 Minimizing another random instance of the log-sum-exp

function. CPU time in seconds: PG(2.66), FISTA(6.04), FISTA+RS(3.63), AdapAPG(µ₀=200)(2.08) and AdapAPG(µ₀=0.2)(1.18). 59 Figure 3.4 Solving an ill-conditioned`₁-LS problem. AdapAPG1 starts

with µ₀ =L₀/10, and AdapAPG2 starts with µ₀ = L₀/100.

CPU time in seconds: PG(27.41), FISTA(54.03), FISTA+RS(22.76), AdapAPGmu1(46.94), AdapAPGmu2(55.39), PG+H(21.61), FISTA+H(28.78), FISTA+RS+H(22.49), AdapAPGmu1+H(23.12), AdapAPGmu2+H(13.91). 61

Figure 3.5 Solving a randomly generated`₁-LS problem. AdapAPG1 starts with µ₀= L₀/10, and AdapAPG2 starts with µ₀=

L₀/100. CPU time in seconds: PG(5.33), FISTA(3.52), FISTA+RS(2.94), AdapAPGmu1(4.58), AdapAPGmu2(3.11), PG+H(1.97), FISTA+H(2.77), FISTA+RS+H(2.79), AdapAPGmu1+H(1.68), AdapAPGmu2+H(1.57).

63

Figure 3.6 Solving a randomly generated `₁-LS problem with non- sparse ¯x. AdapAPG1 starts with µ₀ = L₀/10, and Ada- pAPG2 starts with µ₀ = L₀/100. CPU time in seconds:

PG(7.97), FISTA(9.97), FISTA+RS(5.74), AdapAPGmu1(7.82),

AdapAPGmu2(6.35), PG+H(2.93), FISTA+H(5.21), FISTA+RS+H(4.05), AdapAPGmu1+H(2.22), AdapAPGmu2+H(2.82). 65

Figure 4.1 A geometric illustration of the smoothness of Ψµ(β). (a) The 3-D plot of z(α, β), (b) the projection of (a) onto the β-z space, (c) the 3-D plot of zs(α, β), and (d) the projection of (c) onto the β-z space. 73

x

(11)

List of Figures xi

Figure 4.2 Regression coefficients estimated by different methods based on a single simulated data set. b = 0.8 and threshold ρ= 0.3 for the output correlation graph are used. Red pixels indicate large values. (a) The correlation coefficient matrix of pheno- types, (b) the edges of the phenotype correlation graph obtained at threshold 0.3 are shown as black pixels, (c) the true regression coefficients used in simulation. Absolute values of the estimated regression coefficients are shown for (d) lasso, (e) `₁/`₂ regularized multi-task regression, (f) Graph-guided fused lasso. Rows correspond to outputs and columns to in-

puts. 88

Figure 4.3 Comparisons of SPG, FOBOS and QP. (a) Vary k from 50 to 10, 000, fixing m =_{500, n}=100; (b) Vary m from 50 to 10, 000, fixing m = 1000, k = 50; and (c) Vary n from 500 to 10000, fixing n=100, k=50. 88

Figure 4.4 Results from the analysis of breast cancer data set. (a) Balanced error rate for varying the number of selected genes, and (b) the number of pathways for varying the number of selected genes. 90

Figure 6.1 Objective values v.s. Iterations when ρ=0. Only the first 200iterations are plotted for better visualization and the ease of comparisons. 127

Figure 6.2 Objective values v.s. Iterations when ρ=1. Only the first 200iterations are plotted for better visualization and the ease of comparisons. 127

Figure 6.3 ORDA v.s. M_ORDA. 128

Figure 7.1 Optimal trading strategies (_∆) and trajectories (_Π) in Mean-Variance Model and Dynamic Risk Model: α_k = 0.98, λ =0.0001, Expected trading costEC(_∆) =3.6231e+ 005 ($) 160

Figure 7.2 Optimal trading strategies (_∆) and trajectories (_Π) _in Mean-Variance Model and Dynamic Risk Model: α_k = 0.7, λ = 0.00004682, Expected trading cost EC(_∆) = 2.5345e+_{005 ($)} ¹⁶⁰

Figure 7.3 Optimal trading strategies (_∆) and trajectories (_Π) in Mean-Variance Model and Dynamic Risk Model: α_k = 0.5, λ =0.0000305, Expected trading costEC(_∆) =2.1253e+ 005 ($) 161

Figure 7.4 Efficient frontier 162

Figure 8.1 Objectives values decrease in different Mirror-Prox algorithms. 192

Figure 8.2 The values of ^L_σ^Φ

1 = ^L^Φ

σ2 found by the search scheme in Appendix in each iteration. 193

Figure 8.3 The sample paths of simulated prices. 194

Figure 8.4 The scenario tree constructed using the method from [52] based on the sample in Figure8.3. 194

(12)

Markovian and Markovian strategies. 194

Figure 8.6 The trading policy n and trading trajectory x when α=0, T=5 and x₀ =1000. 195

Figure 8.7 The trading policy n and trading trajectory x when α = 0.5, T=5 and x₀=1000. 195

Figure 8.8 The trading policy n and trading trajectory x when α = 0.9, T=5 and x0=1000. 195

Figure 8.9 The trading policy n and trading trajectory x when α=0, T=8 and x₀ =5000. 196

L I S T O F TA B L E S

Table 4.1 Comparison of Per-iteration Time Complexity 80 Table 4.2 Comparisons of different first-order methods for opti-

mizing mixed-norm based overlapping-group-lasso penalties. 81

Table 4.3 Comparisons of different methods for optimizing graph- guided fused lasso 83

Table 4.4 Comparison of Per-iteration Time Complexity for Multi- task Regression 85

Table 4.5 Comparisons of different optimization methods on the overlapping group lasso 87

Table 6.1 Summary for different stochastic gradient algorithms. V is short for V(x^?, x⁽⁰⁾); AC for “accelerated”; M for “multi-stage" and NA stands for either “not applicable” or “no analysis of the rate". 125

Table 6.2 Comparisons for different algorithms in objective value and F1-score for solving Lasso problem. 126

Table 6.3 Comparisons for different algorithms in objective value and F1-score for solving Elastic-net problem. 126 Table 6.4 The statistics of the experimental datasets. 128

Table 6.5 Experimental results for MNIST in terms of objective value, density of the final solution and testing error. 128 Table 6.6 Experimental results for 20-newsgroup in terms of ob-

jective value, density of the final solution and testing error. 129

Table 7.1 Comparisons of scalability and efficiency of Mirror-Prox and EG with CVX 159

xii

(13)

Table 7.2 Some information of the limit order data. 163

Table 7.3 A snapshot of the limit order book of DAL at 9:45 AM on July 2nd, 2010. 164

Table 7.4 The mean of trading cost 165

Table 7.5 The standard deviation of trading cost 165 Table 7.6 The CVaR_66.7%of trading cost 165

Table 8.1 CPU times and objectives values by different Mirror-Prox algorithms when T =5 and|c(ν)| =3. 187

Table 8.2 CPU times and objectives values by different Mirror-Prox algorithms when T =8 and|c(ν)| =5. 188

L I S T I N G S

A C R O N Y M S

xiii

(14)

(15)

Part I

T H E S I S O V E R V I E W

(16)

(17)

1

T H E S I S O V E R V I E W

1.1 m o t i vat i o n a n d s tat e m e n t

With the development of modern digital technology, data of unprecedented size and dimension are generated and collected every day in business, finance, health care, energy and many other areas. To understand and extract useful knowledge from these massive data, a variety of statistical learning models have been developed, some of which can be formulated as optimization problems.

However, as the size and the dimension of the data increase, the size of the resulting optimization problems render traditional methods quickly impractical due to the excessive memory or computational requirements. A similar challenge is faced when trying to find the optimal strategy of a multistage decision making problem built on massive historical data. In such a problem, the number of scenarios, and thus decision variables, increases very quickly as the number of stages grows, also leading to a large-scale optimization problem. To address these challenges from different angles for various applications, this thesis de- velops algorithms with low-memory requirements for large-scale optimization using homotopy, smoothing, data sampling and other problem-specific techniques.

1.2 t h e s i s ov e r v i e w

This thesis focuses on addressing the computational challenge in both the learning problems over high-dimensional large-scale data and the decision making problems with multiple stages. In particular, we consider the following two main problems:

• Structured regression:

As one of the most important subjects in statistical machine learning, regression analysis has been used to analyze different types of data sets to discover the relationship between input and response variables. When a regression model contains hundreds of thousands of input variables, it is important to identify the input variables that are most relevant to reduce the dimension of the original regression model and focus on the most important covariates.

This variable selection problem can be solved by enforcing a sparse structure in the regression coefficients so that the less relevant variables will have a zero coefficient and can be moved out of the model. An efficient approach to achieve sparsity is to formulate the regression as a joint minimization of a fitness loss function and a sparsity inducing term, e.g., `₁- norm, of the regression coefficients. Depending on the prior knowledge

3

(18)

about the data, we may want to enforce different kinds of structure on the regression coefficients such as group sparsity, low-rank and group- wise constant. Like sparsity, these types of structure can be obtained by minimizing a fitness loss function jointly with the corresponding structure inducing terms. The minimization problems of this kind are named structured regressions.

We proposed efficient proximal gradient methods to address computational challenges for different structured regression problems [81, 27,28].

We further modify the proposed algorithms to stochastic settings for data- intensive or online applications [28]. Our stochastic optimization algorithm [28] achieves a uniformly optimal convergence rate.

• Trade execution problem:

We study the problems of finding optimal strategy of selling a large portfolio of assets in an illiquid market by minimizing a coherent dynamic risk of the transaction cost [80]. Two different dynamic models of the assets prices are considered: a discrete-time Markov process model and a non-Markovian scenario tree model. In both cases, the prices can be per- turbed by either temporal or permanent impacts related to the trading speed which is the main source of transaction costs. The goal is to compute the optimal trading speed in order to reach the best trade off between the large cost of a fast trading and the large uncertainty of a slow trading.

We show that this optimal execution problem is equivalent to a saddle- point problem, and propose efficient first-order methods are proposed to compute the optimal strategy numerically [80].

1.3 m a i n r e s u lt s a n d o r g a n i z at i o n 1.3.1 Partii: Background

In Partii, we briefly review some background on structured regression, including the choices of fitness loss functions, the choices of structured inducing terms and their multi-task extensions. We also present some basics of first-order optimization methods for structured regressions as well as their stochastic versions based on stochastic gradients.

1.3.2 Partiii: First-Order Methods for Structured Regressions

The main challenge for solving structured regressions lies in its high dimension- ality. In Partiii, we present techniques to tackle this challenge that belong to the family of first-order methods. As the name suggests, these methods only utilize gradient or sub-gradient information from the objective function. Compared to other optimization methods, the biggest advantage of a first-order method is that it mainly needs matrix-vector multiplications in each iteration, so that it requires very little memory space and is scalable to large instances in practice.

Indeed, for problems with tens of thousands of variables or more, first-order

(19)

1.3 main results and organization 5

methods are currently one of the best choices. We present our main contributions in the following four chapters.

1.3.2.1 Accelerated Gradient Homotopy Method for Lasso

In Chapter 3, we consider a particular structured regression problem, namely the`₁-norm regularized least-square problem, which is also known as the Lasso problem. Existing first-order methods for Lasso, e.g., FISTA [12], can find an e- optimal solution using O(1/√

e)iterations, which is well-known to be optimal if one can only use black box gradient information [98]. However, for a specific problem like Lasso where we have more structural information than just its gradient, there is still the potential to get a better iteration complexity using first- order methods. We derive a novel gradient homotopy method [81] that can find an e-optimal solution with only O(_log(_1/e))iterations under the assumption of restricted eigenvalue property. This complexity improves the former complexity bound O(1/√

e), and is the best complexity result in the literature so far.

The idea we used is based on the local strong convexity property of the Lasso problem. Using a homotopy technique, which is a multistage warm-start method, our algorithm always keeps the path of the solutions within a sparse subspace where the original non-strongly convex objective function becomes strongly convex. The strong convexity of the objective function helps accelerate the first-order method so that the overall complexity of our gradient homotopy method is improved.

1.3.2.2 Smoothing Proximal Gradient Method for Structured Sparse Regression In Chapter 4, we consider a challenge faced by all first-order methods when the structure inducing term in the structured regression becomes more sophisticated. Take FISTA [12] as an example. In each iteration, FISTA needs to solve a proximal mapping sub-problem, which is itself a minimization problem. When the structure inducing term is simple, e.g.,`₁-norm, the proximal mapping sub- problem has a closed-form solution, which guarantees an efficient implementation of FISTA. However, in general cases, e.g., overlapping group Lasso [61] and graph-guided fused Lasso [69], the structure inducing terms are more elaborate and non-smooth. In these cases, the proximal mapping sub-problem cannot be solved so easily. This essentially prevents the use of most first-order methods.

One alternative solution is to utilize sub-gradient methods but this would require O(_1/e²)iterations to find an e-optimal solution, which is not efficient.

To design an efficient first-order method for general structured regressions, we present a technique based on a key observation: These sophisticated structure inducing terms can often be represented as a linear maximization over a compact set. As a result, the structured regression problem can be formulated as a saddle-point problem. Using a technique by Nesterov [104] for saddle-point problems, we construct a smooth approximation of the structure inducing term.

This approximation yields a closed-form solution of the proximal mapping sub- problem just as for the case of the `₁-norm. Incorporating this smoothing technique into an accelerated gradient method, we proposed a smoothing proximal

(20)

gradient method [27] which can solve structured regressions using O(1/e)iterations, which is better than the O(1/e²)complexity of sub-gradient methods.

1.3.2.3 Estimate Sequences and Accelerated Proximal Gradient Methods

In Chapter5, we give a unified geometric analysis to different accelerated proximal gradient methods proposed in the literature, including [35,37, 55, 77, 99, 73, 74, 51, 63]. We show that, although these methods use different updating rules, they are essentially running the same procedure, that is, constructing a so-called estimate sequence [103] for the objective function and using it to push the iterates towards optimality. Based on this observation, we proposed several new estimate sequences which all lead to new accelerated proximal gradient algorithms that achieve optimal convergence rate. The framework we provide covers the first-order methods in the survey given by Tseng [142] and beyond. In particular, our analysis holds for both non-strongly convex and strongly convex optimization.

1.3.2.4 Optimal Regularized Dual Averaging Methods for Stochastic Optimization In Chapter6, we develop a stochastic first-order method [28] for structured regression by incorporating data sampling schemes into gradient computation.

When the size of the data goes beyond the storage capacity of the machine, computing the exact gradient of the objective function may become very time consuming or impossible. To solve structured regression problems over such types of data sets, in each iteration, we construct a stochastic gradient only using a small random sample of data which can be fitted into memory. Despite the random noise in the stochastic gradients, our algorithms converge with a convergence rate that is “uniformly optimal” [98]. That means, the optimal converge rate is obtained with the same algorithm whether the problem is deterministic or stochastic. Compared to previous stochastic first-order methods, our method updates the solution using all historical stochastic gradients instead of just the current one. Hence, our method has a high tolerance on the gradient noise and consequently numerically outperforms other methods when few data points are available to construct the stochastic gradient.

1.3.3 Partiv: First-Order Methods for Multistage Decision Problems

Multistage stochastic decision models and the related optimization techniques are useful tools for decision making in environments with uncertainty. One of the challenges for solving a multistage stochastic optimization is the exponential growth rate of the problem size as the number of stage increases. For such an optimization problem, a first-order method is attractive due to its low memory requirement and good scalability. Although the efficiency of first-order methods has been demonstrated for solving large-scale problems in machine learning, signal processing and statistics, they have not received an equal amount atten- tion for solving multistage stochastic optimization. To contribute in this direction, in Partiv, we present first-order methods for multistage decision making

(21)

1.3 main results and organization 7

problem under a dynamic coherent risk measure with its application in portfolio trade execution problem [80]. Our contributions in this part are presented in the following two chapters.

1.3.3.1 Optimal Trade Execution with Coherent Dynamic Risk Measures

In Chapter7, we apply first-order methods for computing the optimal strategy for the trade execution problem, which is a multistage decision making problem with a large number of decision variables. Trading a large portfolio in an illiquid market creates an adverse impact on the asset prices, resulting in big transaction costs. One of the tasks of algorithmic trading is to strategically exe- cute the portfolio transaction with the goal of reducing transaction costs. Due to the uncertainty of the market, the transaction cost may become very large and even unbearable to the trader in the worst case. In order to reduce the chance of running into such a bad scenario, a risk-averse strategy is usually preferred in practice. This is obtained by minimizing a risk measure of the transaction cost. Different risk measures have been considered in the literature. Unfortu- nately, their corresponding optimization problems either are computationally intractable or lead to a time inconsistent strategy.

We propose an execution strategy [80] that minimizes a coherent dynamic risk measure [127]. This risk measure evaluates the risk of the transaction cost by sequential certainty equivalence so that it represents the rationale of a risk- averse decision maker in a multistage setting. We prove that the optimal strategy under this dynamic risk measure is always static and time consistent. The static nature of the optimal strategy allows us to reformulate the trade execution problem as a saddle-point problem which can be solved by a primal-dual first-order method due to Nemirovski [97]. This method computes the e-optimal strategy numerically with complexity O(1/e).

1.3.3.2 Dynamic Coherent Risk Minimization over Scenario Tree

In Chapter8, we further explore the efficiency of first-order methods for trade execution problem under a more general setting. Different from Chapter 7 where the price process is assumed to be Markovian with stage-wise indepen- dent price incremental values, in this chapter, the price process is modeled as a general scenario tree so that the price incremental values can be serially de- pendent. Due to this change in the model, the number of variables in the corresponding saddle-point problem increases exponentially with the number of stages. We show that the trade execution problem can be still formulated as a saddle-point problem, which we can solve by an extension of the first-order method in Chapter7. At the heart of our approach is a clever construction of the distance generating functions for the primal and dual feasible sets, which enable solving the proximal mapping sub-problems efficiently.

(22)

1.3.4 Partv: Conclusions and Future Work

The conclusions of the thesis and future directions are provided in the last part of the thesis.

(23)

Part II

B A C K G R O U N D

(24)

(25)

2

B A C K G R O U N D

2.1 s t r u c t u r e d r e g r e s s i o n

In a high-dimensional regression problem, our goal is to predict the output b ∈ R from a high-dimensional input vector a ∈ _Rⁿ. ¹ Given a data set of m input/output pairs: {a_i, b_i} for i = 1, . . . , m, let b = (b₁, . . . , b_m)^T denote the vector of outputs and A = (a₁, . . . , am)^T denote the m by n matrix of inputs of m samples. A structured regression problem can be formulated as the following convex optimization problem:

xmin∈Rⁿφ(x) ≡ f(x) +_Ψ(x) =

∑

m i=1

`(b_i, a^T_i x) +_Ψ(x), (2.1) where`:R² →R is a convex differentiable fitness loss function. Typically, f(x) is assumed to be a convex differentiable function with a Lipschitz continuous gradient, i.e., there exists a constant L_f such that

k∇f(x) − ∇f(y)k₂ ≤ L_fkx−yk₂, (2.2) or equivalently,

f(x) ≤ f(y) +h∇f(y), x−yi + ^L^f

2 kx−yk₂. (2.3)

The second term Ψ(x) in (2.1) is a structured penalty which is usually convex and non-smooth and is added to enforce some structural property in the optimal solution x^? of problem (2.1). We say that the loss function f(x)in (2.1) is strongly convex, if there exists a constant µ_f >0 (the convexity parameter) such that

f(x) ≥ f(y) + h∇f(y), x−yi + ^µ^f

2 kx−yk²₂. (2.4)

Typical examples of fitness loss function include:

1. The squared loss for linear regressions:

`(b_i, a^T_i x) = ¹

2(b_i−a_i^Tx)². (2.5)

and the corresponding f(x) = ¹₂kAx−bk²₂;

2. The logistic loss for classification problems, with b_i ∈ {−1,+1}:

`(b_i, a^T_i x) =log(1+exp(−b_ia_i^Tx)). (2.6)

1 Unless indicated otherwise, all vectors in this thesis are assumed to be column vectors

11

(26)

The structured penalty Ψ(x) is chosen according to the prior information about the regression model or the data set. In the following, we briefly overview some widely used structured penalties.

1. `₁-norm Penalty

The most widely used structured penalty is the`₁-norm of the solution x, i.e.,

Ψ(x) ≡λkxk₁ =λ

∑

n i=1

|x_i|. (2.7)

This function is well known for its ability to enforce a sparse optimal solution x^? for problem (2.1) [137]. Here, λ is a positive constant called the regularization parameter. When f(x)is the squared loss function and Ψ(x) is the `₁-norm of x, problem (2.1) becomes the `₁-regularized least- square problem (`₁-LS) or so-called Lasso problem, which is applied to select the most important input variables in a regression model. In the last decade, the `₁-regularized regression problem has been studied in numer- ous ways, from the theoretical perspective [158, 146, 16, 156] to efficient computational methods [40,107,44,149,12]. Please refer to [50] (Chapter 18) for a deep discussion on`₁-regularized methods.

2. Group Lasso Penalty with Disjoint Groups

In some situations, the variables are partitioned into groups and it is de- sirable to select all or none of the variables within a group [153]. A typical example is when dealing with categorical data, each variable can be ex- pressed via a group of dummy variables; and one should conduct variable selection on a group level instead of individual level. To achieve group level variable selection, one can adopt the`₁/`_qmixed-norm based group Lasso penalty with any q >1 [153]:

Ψ(x) ≡γ

∑

g∈G

w_gkx_gk_q≡γ

∑

g∈G

w_g

∑

i∈g

|x_i|^q^1/q, (2.8)

where G denotes a partition of {1, . . . , n}, x_g ∈ _R^|^g^| is the subvector of x consists of the variables in group g; w_gis the predefined weight for group g; and k · k_q is the vector `_q-norm. This `₁/`_q mixed-norm penalty plays the role of jointly setting all of the coefficients within each group to zero or non-zero values. In practice, the`₁/`₂ and`₁/`_∞ are the most popular norms. Theoretically, it has been demonstrated that if the group structure is consistent with the true sparsity pattern,`₁/`₂group Lasso penalty has the potential to improve the accuracy of the estimator [57].

3. Group Lasso Penalty with Overlapping Groups

To model more complex group structures, the groups inG in (2.8) can be allowed to overlap [61]. In other words, the coefficient for a variable x_j can appear in different `_q-norms. Due to the shrinkage property of the

(27)

2.1 structured regression 13

`_q-norm, the penalty in (2.8) will set some x_gto zeros. If we denote the set of groups g with xg =0byG₀ ⊂ G, the support of the solution x is:

supp(x) ⊂^S_g_∈G

0 gc

, (2.9)

where (·)^c stands for the complementary set. A commonly used special case of general overlapping group structure is the hierarchal structure (e.g. a tree or a forest) [160]. Specifically, we assume that variables correspond to nodes of a tree and a given variable is included in the model only if all its ancestors in the tree have already been selected. The general overlapping group structure has an important application in pathway selection for gene expression data. In more details, a biological pathway is a group of genes that participates in a particular biological process to per- form certain functionality in a cell. To find the controlling factors related to a disease, it is more meaningful to study the genes by considering their pathways. We will discuss more about this application in Chapter4. 4. Chain-structured Fused Lasso Penalty

If the variables are ordered in some meaningful way (e.g. on a timeline), using the (chain-structured) fusion penalty, we can learn a piece-wise constant coefficient vector [138]. The fusion penalty, which is the `₁-norm of the coefficients’ successive differences, takes the following form:

Ψ(x) =γ

n−1 i

∑

=1

|x_i+1−x_i|. (2.10)

It has been widely applied to hot-spot detection for comparative genomic hybridization (CGH) data [140] and time-series data analysis [71].

5. Graph-guided Fused Lasso Penalty

The work in [69] extends the chain-structure fusion penalty in (2.10) to a more general graph-guided fusion penalty. The graph-guided fusion penalty can encode the prior knowledge about the structural constraints over features in the form of pairwise relatedness described by a graph G≡ (V, E), where V = {1, . . . , n}denotes the variables of interest, and E denotes the set of edges among V. Additionally, we let r_ml ∈R denote the weight of the edge e = (m, l) ∈ E, corresponding to correlation or other proper similarity measures between features m and l. The graph-guided fusion penalty, which encourages the coefficients of related variable to share similar magnitude, is defined as follows:

Ψ(x) =γ

∑

(m,l)∈E,m<l

τ(r_ml)|x_m−sign(r_ml)x_l|, (2.11) where τ(r_ml) represent a general weight function that enforces a fusion effect over coefficients x_m and x_l of relevant features. It can be any mono- tonically increasing function of the absolute values of correlations and the most popular examples include τ(r) = |r| or τ(r) = |r|². The sign(r_ml)

(28)

in (2.11) ensures that two positively correlated inputs would tend to influ- ence the output in the same direction, whereas two negatively correlated inputs impose opposite effect. Since the fusion effect is calibrated by the edge weight, the graph-guided fusion penalty in (2.11) encourages highly inter-correlated inputs corresponding to a densely connected subnetwork in G to be jointly selected as relevant.

It is noteworthy that when r_ml = 1 for all e= (m, l) ∈ E, and G is simply a chain over nodes, the graph-guided fusion penalty is reduced to the chain-structured fusion penalty in (2.10).

One of the properties shared by all of the structured penalties listed above is that they can be formulated as

Ψ(x) =max

y∈Q y^TCx, (2.12)

where Q is a convex and compact set. This property is the key for deriving the smoothing proximal gradient algorithm in Chapter4.

2.2 m u lt i-task structured regression

In multi-task learning, we are interested in learning multiple related tasks jointly by analyzing data from all of the tasks at the same time instead of considering each task individually [136,25,152,155,111]. When data are scarce, it is greatly advantageous to borrow the information in the data from other related tasks to learn each task more effectively. More specifically, we consider the multi-task sparse regression problem, where each task is to learn a functional mapping from a high-dimensional input space to a continuous-valued output space and only a small number of inputs are relevant to the output. In multi-task regression, it is often assumed that parameters for different tasks share the same sparsity pattern [143,8, 112]; and the task of conducting variable selection can be achieved via learning the joint sparsity pattern of parameters.

For the simplicity of illustration, we assume all different tasks share the same input matrix. Let A∈_R^m^×ⁿdenote the matrix of input data for n inputs and B∈ R^m^×^kdenote the matrix of output data for k outputs over m samples. We assume a linear regression model for each of the j-th output: B_j = AX_j+e_j, ∀j = 1, . . . k, where Bj ∈ _R^m is the j-th column of B, Xj = [X_1j, . . . , Xnj]^T ∈ _Rⁿ _{is the} regression coefficient vector for the j-th output and e_j is Gaussian noise vector.

Let X= [X₁, . . . , X_k] ∈_Rⁿ^×^kbe the matrix of regression coefficients for all of the k outputs. Then, the multi-task (or multivariate-response) structured regression problem can be naturally formulated as the following optimization problem:

min

X∈_Rⁿ^×^kφ(X) ≡ f(X) +_Ψ(X) = ¹

2kB−AXk²_F+_Ψ(X), (2.13) where k · k_F denotes the matrix Frobenius norm and Ψ(X) is a structured sparsity-inducing penalty with a structure over the outputs.

(29)

2.2 multi-task structured regression 15

A popular approach is to adopt a joint sparsity regularization to encourage sparsity across all tasks. In particular, one can adopt the l₁/lq mixed-norm penalty with q>1 [8,112,96]:

Ψ(X) =λkXk_q,1=γ

∑

n i=1

kXⁱk_q, (2.14)

where Xⁱ = (X_i1, Xi2, . . . , X_ik) ∈ _R^k is the i-th row of X and γ is a positive regularization parameter. However, this penalty cannot incorporate a complex structure in which the outputs themselves are correlated. In many applications, it is advantageous to utilize the prior structural information among outputs to guide the variable selection. For doing this, one can extend the structured penalties introduced in Section 2.1and obtain the following three most widely used structured penalties for multi-task regression.

1. `₁-norm Penalty in Multi-task Regression

Similar to the `₁-norm penalty in the single task regression case, the `₁- norm penalty in the multi-task setting is defiend as

Ψ(x) ≡λkXk₁= λ

∑

n i=1

∑

k j=1

|X_ij|. (2.15)

2. Overlapping Group Lasso Penalty in Multi-task Regression

We define the group lasso penalty for a structured multi-task regression as follows:

Ψ(X) ≡γ

∑

n i=1

∑

g∈G

w_gkX_igk_q, (2.16)

whereG = {g₁, . . . , g_|G|}is a subset of the power set of{1, . . . , k}and X_ig is the vector of regression coefficients correspond to outputs in group g:

{X_ik, k ∈ g}. The `₁/`_q mixed-norm penalty for multi-task regression in (2.14) is a special case of (2.16) whereGonly has one group g= {1, . . . , k}. The tree-structured group-lasso penalty introduced in [68] is also a special case of (2.16).

3. Graph-guided Lasso Fused Penalty in Multi-task Regression

Assuming that a graph structure over the k outputs is given as G with a set of nodes V = {1, . . . , k}each corresponding to an output variable and a set of edges E, the graph-guided fusion penalty for a structured multi-task regression is given as:

Ψ(X) =γ

∑

e=(m,l)∈E

τ(r_ml)

∑

n i=1

|X_im−_sign(r_ml)X_il|_. _(2.17)

(30)

2.3 f i r s t-order methods

The aforementioned sparse regression problem (2.1) is a convex optimization problem. Two traditional generic solvers include (1) subgradient descent method and (2) interior point method (IPM). The subgradient method converges very slowly with a rate (i.e. the number of iterations) of O(¹

e²), where e is the de- sired optimality gap and the obtained solutions are usually not sparse. For all the structured penalties that we considered here, the corresponding regression problem can always be cast as a semidefinite program or its simpler special form (e.g. second-order cone program (SOCP) or quadratic program (QP)) and solved by interior point methods (IPM). Although IPM has a logarithmic convergence rate O(log(¹

e)), it is computationally prohibitive for problems of even a moderate size due to its O(n³)complexity in each iterations.

Due to the separability of some non-smooth penalties (e.g. `₁-norm penalty), coordinate descent methods can be directly applied where we optimize the objective with one variable (or a block of variables) at a time while keeping all others fixed [44]. Although it has surprisingly good empirical performance, it is limited in that the convergence cannot be guaranteed when nondifferential terms are not separable [141].

Another class of optimization methods, accelerated proximal gradient (APG) methods have become increasingly popular in the past few years in the machine learning community. They enjoy optimal convergence rate under the first-order black-box model; and more importantly, since they only use the gradient information, they are much more scalable than second-order methods and hence more suitable for large-scale applications.

Although APG methods have many variations, [12,107,104,103,102,101,142, 75, 32] (see [142] for a survey of different APG methods), all of them need to compute a so-called proximal mapping (or proximal operator, proximal problem, projection mapping, generalized gradient update) at each iteration. The proximal mapping on a point y takes the following form:

T_L(y) =arg min

x

f(y) + ∇f(y)^T(x−y) + ^L

2kx−yk²₂+_Ψ(x)

. (2.18) where L is set to be L_f or determined by a line search procedure. The minimization problem in (2.18) is called proximal mapping sub-problem. If L is chosen to be L_f, the objective function of the minimization in (2.18) becomes an upper bound of f(x)according to (2.3).

We also note that the term¹₂kx−yk²₂_{in (}².18) can be replaced by any Bregman divergence between x and y, denoted by V(x, y), which is defined as

V(x, y):=ω(x) −ω(y) − h∇ω(y), x−yi, (2.19) where ω(x)is a strongly convex and differentiable function.

When ω(x) = ¹₂kxk²₂ so that V(x, y) = ¹₂kx−yk²₂, the proximal mapping can be written as:

T_L(y) =arg min

x

1

2kx− (y− ¹

L∇f(y))k²₂+ ¹

LΨ(x) (2.20)

(31)

2.4 stochastic first-order methods 17

We note that if the non-smooth term Ψ(x) is zero, then (2.20) simply reduces to the standard gradient descent update rule with the step size 1/L: TL(y) = y− ¹_L∇f(y). WhenΨ(x)is not zero but simple enough, the proximal mapping sub-problem admits a closed-form or exact solution. For example, whenΨ(x) = λkxk₁, TL(y), has the closed-form solution given as

T_L(y) =shrink

y− ¹

L∇f(x), λ L

, (2.21)

where shrink : Rⁿ×_R⁺ → _Rⁿ is the well-known shrinkage or soft-thresholding operator, defined as

(shrink(x, α))_i =sgn(x_i)max{|x_i| −α, 0}, i=1, . . . , n. (2.22) Starting at an initial solution x0 to (2.1), the so-called proximal gradient (PG) method generates a sequence of solutions approaching to optimality by itera- tively applying the proximal mapping (2.18) :

x⁽^k⁺¹⁾ =T_L(x⁽^k⁾), k=0, 1, 2, . . . . (2.23) To find an e-optimal solution x_k, i.e., a point x_k ∈ _Rⁿsuch that φ(x_k) −φ(x^?) ≤ e, the PG method requires O(L_f/e) iterations (k = O(L_f/e)) when µ_f = 0.

When µ_f > 0, that is, f(x) is strongly convex, the iteration complexity of PG method is reduced to O(^L_µ^f

f log(1/e)).

Utilizing more updating sequences and more sophisticated updating scheme than (2.23), APG methods find an e-optimal solution using only O(L_f/√

e)iter- ations when µ_f = _{0 and O}(

rLf

µ_f log(_1/e))iterations when µ_f > 0. Notice that APG methods require fewer iterations than the PG method in both cases.

Moreover, according to [103, 98], the iteration complexity achieved by APG methods is “optimal” under a black-box first-oracle assumption in both cases, which means that if the only available information is the gradient∇f(x), it is impossible to derive an algorithm for solving (2.1) with a better complexity than APG methods.

2.4 s t o c h a s t i c f i r s t-order methods

The fitness loss function f(x) of the structured regression problem (2.1) is defined based on a data set of finite input/output pairs:{a_i, b_i} for i = 1, . . . , m.

As an alternative, we can also define the loss function on the underlying distribution of the data points. More specifically, the loss function f(x)can also take the form:

f(x):=_E_ξ(F(x, ξ)) =

Z

F(x, ξ)dP(ξ), (2.24)

where ξ is a random vector (data point) with the distribution P. In a typical structured regression problem setting, ξ represent the input/output pairs(a, b). We assume that for every random vector ξ, F(x, ξ)is a convex and continuous

(32)

function in x. Therefore, f(x)is also convex. Furthermore, we also assume f(x) satisfies (2.3) and (3.5) as in the deterministic case.

One of the challenges of applying first-order method to solve (2.1) with f(x) given by (2.24) is that the gradient ∇f(x) = _E_ξ∇F(x, ξ) = R ∇F(x, ξ)dP(ξ) becomes computationally intractable for a high-dimensional P. Moreover, in most of the cases, the distribution P is highly complicated or even unknown so that the exact gradient information ∇f(x) is unavailable. To deal with this difficulty, a stochastic gradient G(x, ξ)is constructed to approximate ∇f(x). For example, we can simply choose G(x, ξ)to be∇F(x, ξ)or its mini-batch version

1

m∑^mi=1∇F(x, ξ_i) where {ξ₁, . . . , ξ_m} are drawn from P independently. The algorithms utilizing these stochastic gradients to solve (2.1) are called stochastic first-order methods.

Another situation where a stochastic first-order method is a more suitable choice is when the data set in the regression problems are very large so that the computation of∇f(x)is time-consuming or impossible due to limited memory.

For example, in a linear regression problem, the gradient of f(x)defined on the entire data set{a_i, b_i}for i = 1, . . . , m, is ∇f(x) = A^T(Ax−b). If we draw a small random subset of the data, i.e.,{a_i, b_i}_i_∈_Swith S⊂ {1, 2, . . . , m}, a stochastic gradient can be given as G(x, S) = _|^m

S|∑i∈Sa_i(a^T_i x−b_i) whose computation may require much less memory than that of∇f(x) =A^T(Ax−b)_.

Research related to stochastic gradient methods dates back to Robbins and Monro’s stochastic approximation algorithm [122] in 1951, which was further developed by Polyak and Juditsky [118]. In the past few years, many stochastic first-order methods [35,37,55,77,99,73,74,51,63] have been applied to different stochastic optimization problems of the form (2.1). These methods enjoy low per-iteration complexity and the capability of scaling up to very large data sets.

However, a stochastic gradient unavoidably contains a certain level of noise and hence more iterations are needed for a stochastic first-order method to achieve the same optimality gap of a deterministic method. In fact, to find a solution x_k with Eφ(x_k) −φ(x^?) ≤ e, a stochastic first-order method typically require O(1/e²)iterations when µ_f =0 and O(1/e)iterations when µ_f >0. According to [98], both of these two complexities are not improvable.

2.5 a g u i d e l i n e f o r c h o o s i n g a l g o r i t h m s i n t h e t h e s i s

To provide a guideline about how to choose appropriate algorithms in this thesis to solve different structured regression problems (2.1) under different penalty terms and different problem sizes, we include a decision tree in Figure2.1.

In a structured regression problem, if m is tremendously larger than n, the challenge of solving (2.1) is from "big data". In this case, stochastic methods such as Algorithm 6.1 in Chapter 6 may dominate deterministic methods due to its low computational cost and memory requirement in each iteration. If m is moderate but n is very big, the challenge comes from "big model" rather than

"big data". In this case, which algorithm to use highly depend on the penalty functionΨ(x).

(33)

2.5 a guideline for choosing algorithms in the thesis 19

Figure 2.1: The guideline for using the algorithms in this thesis.

If Ψ(x) = λkxk₁ as in the Lasso case, Algorithm 3.6 in Chapter3 is a better choice than other algorithms in this thesis because it has a better theoretical and practical convergence rate. If Ψ(x) is not λkxk₁ but still simple enough, e.g., group Lasso penalty with disjoint groups, so that the proximal mapping (2.18) can be solved easily, one can use Algorithm 5.1 in Chapter 5. If Ψ(x) is sophisticated so that (2.18) can not be solved easily but can be reformulated as (2.12), one may use Algorithm 4.1 in Chapter 4. However, if Ψ(x) can not be represented as (2.12), the problem is beyond the scope of this thesis and sub-gradient descent methods might be an alternative choice.

(34)

(35)

Part III

F I R S T - O R D E R M E T H O D S F O R S T R U C T U R E D R E G R E S S I O N S

(36)

(37)

3

A C C E L E R AT E D G R A D I E N T H O M O T O P Y M E T H O D F O R L A S S O

In this chapter, we focus on a special structured regression problem,`₁-regularized least-squares (`₁-LS) problem or so-called Lasso problem, in the high-dimensional setting.¹ We first present an accelerated proximal gradient (APG) method for problems where the smooth part of the objective function is also strongly convex.

This method incorporates an efficient line-search procedure, and achieves optimal iteration complexity for such composite optimization problems. In case the strong convexity parameter is unknown, we also develop an adaptive scheme that can automatically estimate it on the fly, at the cost of a slightly worse iteration complexity.

In`₁-LS problem, the smooth part of the objective (least-squares) is not strongly convex over the entire domain. Nevertheless, we can exploit its restricted strong convexity over sparse vectors using the adaptive APG method combined with a homotopy continuation scheme. We show that such a combination leads to a global geometric rate of convergence, and the overall iteration complexity has an improved dependency on the restricted condition number than previous work.

3.1 i n t r o d u c t i o n

Exploiting problem structure has become an important theme in recent ad- vances in convex optimization. It is well known that proper use of problem structure at the numerical linear algebra level may dramatically improve the efficiency of an optimization method. More recently, it has become clear that exploiting problem structure can also lead to more efficient optimization methods in terms of their iteration complexity, sometimes significantly surpassing the limitations of the black-box complexity theory (see [108] for an excellent discussion). Such examples start with the theory of self-concordant functions for interior-point methods [110], to the more recent development of smoothing technique [104], minimization of composite objective functions [107], and acceleration via manifold identification (e.g., [147]).

In this chapter, we first develop an adaptive accelerated proximal gradient method for minimizing objective functions that are strongly convex, without the knowledge of their convexity parameters or any lower bound. Then we employ this method in a homotopy continuation scheme for sparse optimization (with

`₁-regularization), and show that it achieves an improved iteration complexity over previous methods for solving the sparse least-squares problem.

1 This chapter is based on the technical report [81] submitted to Mathematical Programming.

23