L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G
Q I H A N G L I N
Tepper School of Business Carnegie Mellon University
Thesis Committee Javier Peña (Chair)
Geoffrey Gordon Fatma Kılınç-Karzan
Lin Xiao
Submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy
Copyright © June 2013 by Qihang Lin
Dedicated to my parents Yunwu Lin and Biying Liu, and my stepmother Feique Lin.
Thanks to the development of modern digital technology, people today are able to generate, collect and store data of unprecedented volume, dimension and complexity. This growing trend of big data brings new needs for powerful tools to explore and analyze different data sets. Among different tools for data anal- ysis, structured regression is one of the most popular techniques. It has been successfully applied to data from many disciplines in business, science and en- gineering.
A structured regression model is formulated as an optimization problem us- ing a large number of decision variables. Traditional techniques often suffer from low scalability and unaffordable computational time when applied to solve optimization problems in such a large scale. To address this challenge, in the first main part of this thesis, we propose several optimization methods with low memory requirements for different structured regression problems based on homotopy, smoothing, stochastic sampling and other techniques. In particular, our contributions include:
1. A gradient homotopy method for the Lasso problem, the most popular structured regression model. This method finds an e-optimal solution within O log 1e
iterations. This complexity improves the traditional O √1
e complexity, which is theoretically not improvable if an algorithm is only allowed to use black-box gradient information. What allows us to achieve a better complexity is that our method fully utilizes the local strong convexity property of the Lasso problem under the restricted eigen- value assumption, which provides much more information than gradients.
2. A smoothing proximal gradient algorithm for solving general structured regressions. One difficulty of applying first-order methods to structured regressions with sophisticated penalty terms is that the proximal mapping, which must be computed in each iteration, does not have a closed-form solution. The algorithm we proposed constructs a smooth approximation to the penalty terms so that the proximal mapping can be computed in closed form, and thus, first-order methods can be applied efficiently.
3. A unified theoretical framework for analyzing the geometric nature of dif- ferent first-order methods. We show that most popular versions of accel- erated gradient methods essentially construct an estimate sequence of the optimization problem in different ways. This explains why these methods achieve the same optimal convergence rate even with different updating schemes.
4. A stochastic first-order method for structured regression, which utilizes a stochastic gradient constructed using only a random sample of the whole
iv
data set. This method is favorable when the size of the data grows be- yond the storage capacity and deterministic first-order methods cannot be applied due to the overwhelming computational cost involved with com- puting the exact gradient. Our methods achieves the “uniformly” optimal complexity O √1
e + σ
e2, which means, when the noise of the stochastic gradient σ is reduced to zero, our method achieves the optimal complex- ity O √1
e
as a deterministic first-order method. By contrast, the tradi- tional stochastic gradient methods would still have sub-optimal complex- ity O e12.
The challenges of large-scale optimization arise not only from the large vol- ume of data but also from the exponential rate of growth of the number of variables in multistage decision making models. In the second main part of this thesis, we explore the scalability of first-order methods in the latter case.
We focus on the optimal trade execution problem under coherent dynamic risk measure. This is a large-scale multistage stochastic optimization problem that arises in financial engineering.
Relying on the dual representation of the coherent risk measure, we can for- mulate this problem as a saddle-point problem and solve it with a primal-dual first-order method. The truncated simplex structure of the primal and dual do- mains allows us to obtain a closed-form solution to the relevant proximal map- ping sub-problem, resulting in an efficient implementation of this first-order method. Our models and algorithms are tested on limit order book real data and demonstrate promising numerical properties. Furthermore, we generalize the same primal-dual first-order method to the case where the multistage de- cision making problem is modeled with a scenario tree in a non-Markovian fashion.
In the last part of the thesis, we discuss future research directions in opti- mization techniques for solving high-dimensional structured regression prob- lems and multistage decision making problems from both the theoretical and computational aspects.
K E Y W O R D S
Machine Learning, Sparse Learning, Lasso, Optimization, Regression, First-Order Method, Proximal Gradient Method, Stochastic Optimization, Stochastic Gra- dient, Convex Programming, Dynamic Coherent Risk Measure, Saddle-Point Problem, Markov Decision Process, Trade Execution, Multistage Decision Mak- ing, Scenario Tree, Limit Order Book.
v
First, I would like to thank my advisor Javier Peña for his great advice and support through my entire Ph.D. journey. I am particularly grateful to him not only for directing me towards exciting problems and applications, but also for allowing me the freedom to pursue the topics that I am passionate about. His vision on new research problems, his attitude to the academic career, as well as his extremely nice and sincere personality all deeply influenced me and will continue to shape me in the future.
I also deeply appreciate my other thesis committee members, Geoffrey Gor- don, Fatma Kılınç-Karzan and Lin Xiao. They put a lot of effort on this the- sis and provided many insightful suggestions. I also thank many other faculty members in Tepper School of Business: Egon Balas, Gerard Cornuejols, R. Ravi, John Hooker and Alan Scheller-Wolf. I was so fortunate to take excellent courses from them. Their courses open the doors to different areas of operations re- search, a great field that I am delighted to devote my entire career to explore.
During my Ph.D., I did a great summer internship at Microsoft Research in 2012. I sincerely thank my mentor Lin Xiao, who is also my committee mem- ber, my collaborator Dengyong Zhou and other friends in Microsoft Research for their help and support. I am also indebted to many other colleagues and friends at CMU, who played critical roles during my Ph.D. through collabo- rations, discussions and suggestions. Some of them include: Marco Molinaro, Ishani Aggarwal, David Bergman, Hao Xue, Xin Fang, Nan Xiong, Selvaprabu Nadarajah, Negar Soheili, Andre Cire, Amitabh Basu, Andrea Qualizza and Yangfang Zhou.
In particular, I would like to thank Xi Chen. He, as one of my most important collaborators, not only taught me knowledge about statistics, but also proposed many interesting problems to me. We had so many enjoyable discussions and I was often amazed by his programming skills and knowledge of statistical machine learning and data mining.
I also owe special thanks to Lawrence Rapp, who had done a superb admin- istration job and provided crucial assistance in making sure that my graduate experience went smoothly. The tea break organized by him every Monday and Thursday was always something I looked forward to and enjoyed every week.
Most importantly, I want to thank my parents, Yunwu Lin and Biying Liu as well as my stepmother Feique Lin. They made all kinds of efforts to create a healthy family environment where I could grow up happily. This thesis is dedicated to them.
vi
C O N T E N T S
i t h e s i s ov e r v i e w 1
1 t h e s i s ov e r v i e w 3
1.1 Motivation and Statement 3 1.2 Thesis Overview 3
1.3 Main Results and Organization 4 1.3.1 Partii: Background 4
1.3.2 Partiii: First-Order Methods for Structured Regressions 4 1.3.3 Partiv: First-Order Methods for Multistage Decision Prob-
lems 6
1.3.4 Partv: Conclusions and Future Work 8 ii b a c k g r o u n d 9
2 b a c k g r o u n d 11
2.1 Structured Regression 11
2.2 Multi-task Structured Regression 14 2.3 First-Order Methods 16
2.4 Stochastic First-Order Methods 17
2.5 A Guideline for Choosing Algorithms in the Thesis 18 iii f i r s t-order methods for structured regressions 21 3 a c c e l e r at e d g r a d i e n t h o m o t o p y m e t h o d f o r l a s s o 23
3.1 Introduction 23
3.1.1 Minimizing composite objective functions 24 3.1.2 Homotopy continuation for sparse optimization 26 3.1.3 Outline of the chapter 28
3.2 Preliminaries and Notation 28
3.2.1 Composite gradient mapping 29
3.2.2 Proximal gradient method with line search 32
3.3 An APG Method for Minimizing Strongly Convex Functions 34 3.3.1 Proof of Theorem3.1 36
3.3.2 The non-blowout property 39 3.4 An Adaptive APG Method with Restart 41
3.5 Homotopy Continuation for Sparse Optimization 45 3.5.1 Sparsity along the solution path 50
3.5.2 Proof of Theorem3.3 53 3.5.3 Proof of Theorem3.4 54 3.6 Numerical Experiments 56
3.6.1 Experiments on the AdapAPG method 57 3.6.2 Experiments on homotopy continuation 59
4 s m o o t h i n g p r o x i m a l g r a d i e n t m e t h o d f o r s t r u c t u r e d r e- g r e s s i o n 67
4.1 Introduction 67
vii
4.2 Linear Regression Regularized by Structured Sparsity-inducing Penalties 70
4.3 Smoothing Proximal Gradient 71
4.3.1 Reformulation of Structured Sparsity-inducing Penalty 71 4.3.2 Smooth Approximation to Structured Sparsity-inducing
Penalty 72
4.3.3 Smoothing Proximal Gradient Method 76
4.3.4 Issues on the Computation of the Lipschitz Constant 78 4.3.5 Convergence Rate and Time Complexity 78
4.3.6 Summary and Discussions 80 4.4 Related Optimization Methods 81
4.4.1 Related work for mixed-norm based group-lasso penalty 81 4.4.2 Related work for fused lasso 82
4.5 Extensions to Multi-task Regression with Structures on Outputs 83 4.5.1 Multi-task Linear Regression Regularized by Structured
Sparsity-inducing Penalties 84
4.5.2 Smoothing Proximal Gradient Descent 84 4.6 Numerical Experiments 85
4.6.1 Simulation Study I: Overlapping Group Lasso 86
4.6.2 Simulation Study II: Multi-task Graph-guided Fused Lasso 88 4.6.3 Real Data Analysis: Pathway Analysis of Breast Cancer
Data 89
5 e s t i m at e s e q u e n c e s a n d a c c e l e r at e d p r o x i m a l g r a d i e n t m e t h o d s 93
5.1 Introduction 93 5.2 Estimate Sequence 95
5.3 A Generic Accelerated Gradient Algorithm 102
6 o p t i m a l r e g u l a r i z e d d ua l av e r a g i n g m e t h o d s f o r s t o c h a s- t i c o p t i m i z at i o n 105
6.1 Introduction 105
6.2 Preliminary and Notation 108
6.3 Optimal Regularized Dual Averaging Method 108 6.3.1 Convergence Rate 110
6.3.2 Variance Bounds 117
6.3.3 High Probability Bounds 119
6.4 Multistage ORDA for Stochastic Strongly-Convex Optimization 122
6.5 Related Work 124
6.6 Numerical Experiments 125 6.6.1 Simulated Experiments 126 6.6.2 Real Data Experiments 128
iv f i r s t-order methods for multistage decision problems 131
7 o p t i m a l t r a d e e x e c u t i o n w i t h c o h e r e n t d y na m i c r i s k m e a- s u r e s 133
c o n t e n t s ix
7.1 Introduction 133
7.2 Trade Execution Model 136
7.3 Coherent Dynamic Risk Measure 138 7.4 Convex Optimization Formulation 140
7.5 The Optimal Strategy for Single Asset Liquidation 145 7.6 The Optimal Strategy for Multiple Asset Liquidation 148
7.6.1 Sample average approximation and dual representation of coherent risk measures 149
7.6.2 Saddle-point formulation and Mirror-Prox algorithm 150 7.6.3 Acceleration by excessive gap method 154
7.7 Extension to Nonlinear Market Impacts 157 7.8 Numerical Experiments 158
7.8.1 Scalability and efficiency 158 7.8.2 Early completion v.s. small tail 159 7.8.3 Efficient frontier 162
7.8.4 Experiments with NYSE limit order book data 163
8 d y na m i c c o h e r e n t r i s k m i n i m i z at i o n ov e r s c e na r i o t r e e s 167 8.1 Introduction 167
8.2 Dynamic Risk Minimization and Saddle-Point Formulation 168 8.3 Mirror-Prox Algorithm for Trade Execution under a Scenario Tree
Model 171
8.3.1 Mirror-Prox algorithm 171
8.3.2 Trade execution with scenario tree price model 171 8.3.3 Solving the minimization sub-problems 174
8.4 Theoretical Convergence Rate 178 8.5 Numerical Experiments 187
8.5.1 Comparison of distance generating functions 187
8.5.2 Comparison of Markovian and Non-Markovian Policies 188 v c o n c l u s i o n s a n d f u t u r e w o r k 197
9 c o n c l u s i o n s a n d f u t u r e d i r e c t i o n s 199 9.1 Conclusions 199
9.2 Future Directions 200 b i b l i o g r a p h y 203
Figure 2.1 The guideline for using the algorithms in this thesis. 19 Figure 3.1 Minimizing a random instance of the log-sum-exp func-
tion. 58
Figure 3.2 Minimizing a random instance of the log-sum-exp func-
tion. CPU time in seconds: PG(4.25), FISTA(7.22), FISTA+RS(1.93), AdapAPG(µ0=200)(2.68) and AdapAPG(µ0=0.2)(1.82). 58 Figure 3.3 Minimizing another random instance of the log-sum-exp
function. CPU time in seconds: PG(2.66), FISTA(6.04), FISTA+RS(3.63), AdapAPG(µ0=200)(2.08) and AdapAPG(µ0=0.2)(1.18). 59 Figure 3.4 Solving an ill-conditioned`1-LS problem. AdapAPG1 starts
with µ0 =L0/10, and AdapAPG2 starts with µ0 = L0/100.
CPU time in seconds: PG(27.41), FISTA(54.03), FISTA+RS(22.76), AdapAPGmu1(46.94), AdapAPGmu2(55.39), PG+H(21.61), FISTA+H(28.78), FISTA+RS+H(22.49), AdapAPGmu1+H(23.12), AdapAPGmu2+H(13.91). 61
Figure 3.5 Solving a randomly generated`1-LS problem. AdapAPG1 starts with µ0= L0/10, and AdapAPG2 starts with µ0=
L0/100. CPU time in seconds: PG(5.33), FISTA(3.52), FISTA+RS(2.94), AdapAPGmu1(4.58), AdapAPGmu2(3.11), PG+H(1.97), FISTA+H(2.77), FISTA+RS+H(2.79), AdapAPGmu1+H(1.68), AdapAPGmu2+H(1.57).
63
Figure 3.6 Solving a randomly generated `1-LS problem with non- sparse ¯x. AdapAPG1 starts with µ0 = L0/10, and Ada- pAPG2 starts with µ0 = L0/100. CPU time in seconds:
PG(7.97), FISTA(9.97), FISTA+RS(5.74), AdapAPGmu1(7.82),
AdapAPGmu2(6.35), PG+H(2.93), FISTA+H(5.21), FISTA+RS+H(4.05), AdapAPGmu1+H(2.22), AdapAPGmu2+H(2.82). 65
Figure 4.1 A geometric illustration of the smoothness of Ψµ(β). (a) The 3-D plot of z(α, β), (b) the projection of (a) onto the β-z space, (c) the 3-D plot of zs(α, β), and (d) the projection of (c) onto the β-z space. 73
x
List of Figures xi
Figure 4.2 Regression coefficients estimated by different methods based on a single simulated data set. b = 0.8 and threshold ρ= 0.3 for the output correlation graph are used. Red pixels indicate large values. (a) The correlation coefficient matrix of pheno- types, (b) the edges of the phenotype correlation graph ob- tained at threshold 0.3 are shown as black pixels, (c) the true regression coefficients used in simulation. Absolute values of the estimated regression coefficients are shown for (d) lasso, (e) `1/`2 regularized multi-task regression, (f) Graph-guided fused lasso. Rows correspond to outputs and columns to in-
puts. 88
Figure 4.3 Comparisons of SPG, FOBOS and QP. (a) Vary k from 50 to 10, 000, fixing m =500, n=100; (b) Vary m from 50 to 10, 000, fixing m = 1000, k = 50; and (c) Vary n from 500 to 10000, fixing n=100, k=50. 88
Figure 4.4 Results from the analysis of breast cancer data set. (a) Balanced error rate for varying the number of selected genes, and (b) the number of pathways for varying the number of selected genes. 90
Figure 6.1 Objective values v.s. Iterations when ρ=0. Only the first 200iterations are plotted for better visualization and the ease of comparisons. 127
Figure 6.2 Objective values v.s. Iterations when ρ=1. Only the first 200iterations are plotted for better visualization and the ease of comparisons. 127
Figure 6.3 ORDA v.s. M_ORDA. 128
Figure 7.1 Optimal trading strategies (∆) and trajectories (Π) in Mean-Variance Model and Dynamic Risk Model: αk = 0.98, λ =0.0001, Expected trading costEC(∆) =3.6231e+ 005 ($) 160
Figure 7.2 Optimal trading strategies (∆) and trajectories (Π) in Mean-Variance Model and Dynamic Risk Model: αk = 0.7, λ = 0.00004682, Expected trading cost EC(∆) = 2.5345e+005 ($) 160
Figure 7.3 Optimal trading strategies (∆) and trajectories (Π) in Mean-Variance Model and Dynamic Risk Model: αk = 0.5, λ =0.0000305, Expected trading costEC(∆) =2.1253e+ 005 ($) 161
Figure 7.4 Efficient frontier 162
Figure 8.1 Objectives values decrease in different Mirror-Prox algo- rithms. 192
Figure 8.2 The values of LσΦ
1 = LΦ
σ2 found by the search scheme in Appendix in each iteration. 193
Figure 8.3 The sample paths of simulated prices. 194
Figure 8.4 The scenario tree constructed using the method from [52] based on the sample in Figure8.3. 194
Markovian and Markovian strategies. 194
Figure 8.6 The trading policy n and trading trajectory x when α=0, T=5 and x0 =1000. 195
Figure 8.7 The trading policy n and trading trajectory x when α = 0.5, T=5 and x0=1000. 195
Figure 8.8 The trading policy n and trading trajectory x when α = 0.9, T=5 and x0=1000. 195
Figure 8.9 The trading policy n and trading trajectory x when α=0, T=8 and x0 =5000. 196
Figure 8.10 The trading policy n and trading trajectory x when α = 0.5, T=8 and x0=5000. 196
Figure 8.11 The trading policy n and trading trajectory x when α = 0.9, T=8 and x0=5000. 196
L I S T O F TA B L E S
Table 4.1 Comparison of Per-iteration Time Complexity 80 Table 4.2 Comparisons of different first-order methods for opti-
mizing mixed-norm based overlapping-group-lasso penal- ties. 81
Table 4.3 Comparisons of different methods for optimizing graph- guided fused lasso 83
Table 4.4 Comparison of Per-iteration Time Complexity for Multi- task Regression 85
Table 4.5 Comparisons of different optimization methods on the overlapping group lasso 87
Table 6.1 Summary for different stochastic gradient algorithms. V is short for V(x?, x(0)); AC for “accelerated”; M for “multi-stage" and NA stands for either “not applicable” or “no analysis of the rate". 125
Table 6.2 Comparisons for different algorithms in objective value and F1-score for solving Lasso problem. 126
Table 6.3 Comparisons for different algorithms in objective value and F1-score for solving Elastic-net problem. 126 Table 6.4 The statistics of the experimental datasets. 128
Table 6.5 Experimental results for MNIST in terms of objective value, density of the final solution and testing error. 128 Table 6.6 Experimental results for 20-newsgroup in terms of ob-
jective value, density of the final solution and testing er- ror. 129
Table 7.1 Comparisons of scalability and efficiency of Mirror-Prox and EG with CVX 159
xii
Table 7.2 Some information of the limit order data. 163
Table 7.3 A snapshot of the limit order book of DAL at 9:45 AM on July 2nd, 2010. 164
Table 7.4 The mean of trading cost 165
Table 7.5 The standard deviation of trading cost 165 Table 7.6 The CVaR66.7%of trading cost 165
Table 8.1 CPU times and objectives values by different Mirror-Prox algorithms when T =5 and|c(ν)| =3. 187
Table 8.2 CPU times and objectives values by different Mirror-Prox algorithms when T =8 and|c(ν)| =5. 188
L I S T I N G S
A C R O N Y M S
xiii
Part I
T H E S I S O V E R V I E W
1
T H E S I S O V E R V I E W
1.1 m o t i vat i o n a n d s tat e m e n t
With the development of modern digital technology, data of unprecedented size and dimension are generated and collected every day in business, finance, health care, energy and many other areas. To understand and extract useful knowledge from these massive data, a variety of statistical learning models have been developed, some of which can be formulated as optimization problems.
However, as the size and the dimension of the data increase, the size of the re- sulting optimization problems render traditional methods quickly impractical due to the excessive memory or computational requirements. A similar chal- lenge is faced when trying to find the optimal strategy of a multistage decision making problem built on massive historical data. In such a problem, the number of scenarios, and thus decision variables, increases very quickly as the number of stages grows, also leading to a large-scale optimization problem. To address these challenges from different angles for various applications, this thesis de- velops algorithms with low-memory requirements for large-scale optimization using homotopy, smoothing, data sampling and other problem-specific tech- niques.
1.2 t h e s i s ov e r v i e w
This thesis focuses on addressing the computational challenge in both the learn- ing problems over high-dimensional large-scale data and the decision making problems with multiple stages. In particular, we consider the following two main problems:
• Structured regression:
As one of the most important subjects in statistical machine learning, re- gression analysis has been used to analyze different types of data sets to discover the relationship between input and response variables. When a regression model contains hundreds of thousands of input variables, it is important to identify the input variables that are most relevant to reduce the dimension of the original regression model and focus on the most important covariates.
This variable selection problem can be solved by enforcing a sparse struc- ture in the regression coefficients so that the less relevant variables will have a zero coefficient and can be moved out of the model. An efficient approach to achieve sparsity is to formulate the regression as a joint min- imization of a fitness loss function and a sparsity inducing term, e.g., `1- norm, of the regression coefficients. Depending on the prior knowledge
3
about the data, we may want to enforce different kinds of structure on the regression coefficients such as group sparsity, low-rank and group- wise constant. Like sparsity, these types of structure can be obtained by minimizing a fitness loss function jointly with the corresponding struc- ture inducing terms. The minimization problems of this kind are named structured regressions.
We proposed efficient proximal gradient methods to address computa- tional challenges for different structured regression problems [81, 27,28].
We further modify the proposed algorithms to stochastic settings for data- intensive or online applications [28]. Our stochastic optimization algo- rithm [28] achieves a uniformly optimal convergence rate.
• Trade execution problem:
We study the problems of finding optimal strategy of selling a large port- folio of assets in an illiquid market by minimizing a coherent dynamic risk of the transaction cost [80]. Two different dynamic models of the as- sets prices are considered: a discrete-time Markov process model and a non-Markovian scenario tree model. In both cases, the prices can be per- turbed by either temporal or permanent impacts related to the trading speed which is the main source of transaction costs. The goal is to com- pute the optimal trading speed in order to reach the best trade off between the large cost of a fast trading and the large uncertainty of a slow trading.
We show that this optimal execution problem is equivalent to a saddle- point problem, and propose efficient first-order methods are proposed to compute the optimal strategy numerically [80].
1.3 m a i n r e s u lt s a n d o r g a n i z at i o n 1.3.1 Partii: Background
In Partii, we briefly review some background on structured regression, includ- ing the choices of fitness loss functions, the choices of structured inducing terms and their multi-task extensions. We also present some basics of first-order opti- mization methods for structured regressions as well as their stochastic versions based on stochastic gradients.
1.3.2 Partiii: First-Order Methods for Structured Regressions
The main challenge for solving structured regressions lies in its high dimension- ality. In Partiii, we present techniques to tackle this challenge that belong to the family of first-order methods. As the name suggests, these methods only utilize gradient or sub-gradient information from the objective function. Compared to other optimization methods, the biggest advantage of a first-order method is that it mainly needs matrix-vector multiplications in each iteration, so that it requires very little memory space and is scalable to large instances in practice.
Indeed, for problems with tens of thousands of variables or more, first-order
1.3 main results and organization 5
methods are currently one of the best choices. We present our main contribu- tions in the following four chapters.
1.3.2.1 Accelerated Gradient Homotopy Method for Lasso
In Chapter 3, we consider a particular structured regression problem, namely the`1-norm regularized least-square problem, which is also known as the Lasso problem. Existing first-order methods for Lasso, e.g., FISTA [12], can find an e- optimal solution using O(1/√
e)iterations, which is well-known to be optimal if one can only use black box gradient information [98]. However, for a specific problem like Lasso where we have more structural information than just its gradient, there is still the potential to get a better iteration complexity using first- order methods. We derive a novel gradient homotopy method [81] that can find an e-optimal solution with only O(log(1/e))iterations under the assumption of restricted eigenvalue property. This complexity improves the former complexity bound O(1/√
e), and is the best complexity result in the literature so far.
The idea we used is based on the local strong convexity property of the Lasso problem. Using a homotopy technique, which is a multistage warm-start method, our algorithm always keeps the path of the solutions within a sparse subspace where the original non-strongly convex objective function becomes strongly convex. The strong convexity of the objective function helps accelerate the first-order method so that the overall complexity of our gradient homotopy method is improved.
1.3.2.2 Smoothing Proximal Gradient Method for Structured Sparse Regression In Chapter 4, we consider a challenge faced by all first-order methods when the structure inducing term in the structured regression becomes more sophis- ticated. Take FISTA [12] as an example. In each iteration, FISTA needs to solve a proximal mapping sub-problem, which is itself a minimization problem. When the structure inducing term is simple, e.g.,`1-norm, the proximal mapping sub- problem has a closed-form solution, which guarantees an efficient implementa- tion of FISTA. However, in general cases, e.g., overlapping group Lasso [61] and graph-guided fused Lasso [69], the structure inducing terms are more elaborate and non-smooth. In these cases, the proximal mapping sub-problem cannot be solved so easily. This essentially prevents the use of most first-order methods.
One alternative solution is to utilize sub-gradient methods but this would re- quire O(1/e2)iterations to find an e-optimal solution, which is not efficient.
To design an efficient first-order method for general structured regressions, we present a technique based on a key observation: These sophisticated struc- ture inducing terms can often be represented as a linear maximization over a compact set. As a result, the structured regression problem can be formulated as a saddle-point problem. Using a technique by Nesterov [104] for saddle-point problems, we construct a smooth approximation of the structure inducing term.
This approximation yields a closed-form solution of the proximal mapping sub- problem just as for the case of the `1-norm. Incorporating this smoothing tech- nique into an accelerated gradient method, we proposed a smoothing proximal
gradient method [27] which can solve structured regressions using O(1/e)iter- ations, which is better than the O(1/e2)complexity of sub-gradient methods.
1.3.2.3 Estimate Sequences and Accelerated Proximal Gradient Methods
In Chapter5, we give a unified geometric analysis to different accelerated prox- imal gradient methods proposed in the literature, including [35,37, 55, 77, 99, 73, 74, 51, 63]. We show that, although these methods use different updating rules, they are essentially running the same procedure, that is, constructing a so-called estimate sequence [103] for the objective function and using it to push the iterates towards optimality. Based on this observation, we proposed several new estimate sequences which all lead to new accelerated proximal gradient algorithms that achieve optimal convergence rate. The framework we provide covers the first-order methods in the survey given by Tseng [142] and beyond. In particular, our analysis holds for both non-strongly convex and strongly convex optimization.
1.3.2.4 Optimal Regularized Dual Averaging Methods for Stochastic Optimization In Chapter6, we develop a stochastic first-order method [28] for structured re- gression by incorporating data sampling schemes into gradient computation.
When the size of the data goes beyond the storage capacity of the machine, computing the exact gradient of the objective function may become very time consuming or impossible. To solve structured regression problems over such types of data sets, in each iteration, we construct a stochastic gradient only us- ing a small random sample of data which can be fitted into memory. Despite the random noise in the stochastic gradients, our algorithms converge with a convergence rate that is “uniformly optimal” [98]. That means, the optimal con- verge rate is obtained with the same algorithm whether the problem is deter- ministic or stochastic. Compared to previous stochastic first-order methods, our method updates the solution using all historical stochastic gradients instead of just the current one. Hence, our method has a high tolerance on the gradient noise and consequently numerically outperforms other methods when few data points are available to construct the stochastic gradient.
1.3.3 Partiv: First-Order Methods for Multistage Decision Problems
Multistage stochastic decision models and the related optimization techniques are useful tools for decision making in environments with uncertainty. One of the challenges for solving a multistage stochastic optimization is the exponential growth rate of the problem size as the number of stage increases. For such an optimization problem, a first-order method is attractive due to its low memory requirement and good scalability. Although the efficiency of first-order methods has been demonstrated for solving large-scale problems in machine learning, signal processing and statistics, they have not received an equal amount atten- tion for solving multistage stochastic optimization. To contribute in this direc- tion, in Partiv, we present first-order methods for multistage decision making
1.3 main results and organization 7
problem under a dynamic coherent risk measure with its application in portfo- lio trade execution problem [80]. Our contributions in this part are presented in the following two chapters.
1.3.3.1 Optimal Trade Execution with Coherent Dynamic Risk Measures
In Chapter7, we apply first-order methods for computing the optimal strategy for the trade execution problem, which is a multistage decision making prob- lem with a large number of decision variables. Trading a large portfolio in an illiquid market creates an adverse impact on the asset prices, resulting in big transaction costs. One of the tasks of algorithmic trading is to strategically exe- cute the portfolio transaction with the goal of reducing transaction costs. Due to the uncertainty of the market, the transaction cost may become very large and even unbearable to the trader in the worst case. In order to reduce the chance of running into such a bad scenario, a risk-averse strategy is usually preferred in practice. This is obtained by minimizing a risk measure of the transaction cost. Different risk measures have been considered in the literature. Unfortu- nately, their corresponding optimization problems either are computationally intractable or lead to a time inconsistent strategy.
We propose an execution strategy [80] that minimizes a coherent dynamic risk measure [127]. This risk measure evaluates the risk of the transaction cost by sequential certainty equivalence so that it represents the rationale of a risk- averse decision maker in a multistage setting. We prove that the optimal strategy under this dynamic risk measure is always static and time consistent. The static nature of the optimal strategy allows us to reformulate the trade execution prob- lem as a saddle-point problem which can be solved by a primal-dual first-order method due to Nemirovski [97]. This method computes the e-optimal strategy numerically with complexity O(1/e).
1.3.3.2 Dynamic Coherent Risk Minimization over Scenario Tree
In Chapter8, we further explore the efficiency of first-order methods for trade execution problem under a more general setting. Different from Chapter 7 where the price process is assumed to be Markovian with stage-wise indepen- dent price incremental values, in this chapter, the price process is modeled as a general scenario tree so that the price incremental values can be serially de- pendent. Due to this change in the model, the number of variables in the cor- responding saddle-point problem increases exponentially with the number of stages. We show that the trade execution problem can be still formulated as a saddle-point problem, which we can solve by an extension of the first-order method in Chapter7. At the heart of our approach is a clever construction of the distance generating functions for the primal and dual feasible sets, which enable solving the proximal mapping sub-problems efficiently.
1.3.4 Partv: Conclusions and Future Work
The conclusions of the thesis and future directions are provided in the last part of the thesis.
Part II
B A C K G R O U N D
2
B A C K G R O U N D
2.1 s t r u c t u r e d r e g r e s s i o n
In a high-dimensional regression problem, our goal is to predict the output b ∈ R from a high-dimensional input vector a ∈ Rn. 1 Given a data set of m input/output pairs: {ai, bi} for i = 1, . . . , m, let b = (b1, . . . , bm)T denote the vector of outputs and A = (a1, . . . , am)T denote the m by n matrix of inputs of m samples. A structured regression problem can be formulated as the following convex optimization problem:
xmin∈Rnφ(x) ≡ f(x) +Ψ(x) =
∑
m i=1`(bi, aTi x) +Ψ(x), (2.1) where`:R2 →R is a convex differentiable fitness loss function. Typically, f(x) is assumed to be a convex differentiable function with a Lipschitz continuous gradient, i.e., there exists a constant Lf such that
k∇f(x) − ∇f(y)k2 ≤ Lfkx−yk2, (2.2) or equivalently,
f(x) ≤ f(y) +h∇f(y), x−yi + Lf
2 kx−yk2. (2.3)
The second term Ψ(x) in (2.1) is a structured penalty which is usually convex and non-smooth and is added to enforce some structural property in the opti- mal solution x? of problem (2.1). We say that the loss function f(x)in (2.1) is strongly convex, if there exists a constant µf >0 (the convexity parameter) such that
f(x) ≥ f(y) + h∇f(y), x−yi + µf
2 kx−yk22. (2.4)
Typical examples of fitness loss function include:
1. The squared loss for linear regressions:
`(bi, aTi x) = 1
2(bi−aiTx)2. (2.5)
and the corresponding f(x) = 12kAx−bk22;
2. The logistic loss for classification problems, with bi ∈ {−1,+1}:
`(bi, aTi x) =log(1+exp(−biaiTx)). (2.6)
1 Unless indicated otherwise, all vectors in this thesis are assumed to be column vectors
11
The structured penalty Ψ(x) is chosen according to the prior information about the regression model or the data set. In the following, we briefly overview some widely used structured penalties.
1. `1-norm Penalty
The most widely used structured penalty is the`1-norm of the solution x, i.e.,
Ψ(x) ≡λkxk1 =λ
∑
n i=1|xi|. (2.7)
This function is well known for its ability to enforce a sparse optimal solution x? for problem (2.1) [137]. Here, λ is a positive constant called the regularization parameter. When f(x)is the squared loss function and Ψ(x) is the `1-norm of x, problem (2.1) becomes the `1-regularized least- square problem (`1-LS) or so-called Lasso problem, which is applied to select the most important input variables in a regression model. In the last decade, the `1-regularized regression problem has been studied in numer- ous ways, from the theoretical perspective [158, 146, 16, 156] to efficient computational methods [40,107,44,149,12]. Please refer to [50] (Chapter 18) for a deep discussion on`1-regularized methods.
2. Group Lasso Penalty with Disjoint Groups
In some situations, the variables are partitioned into groups and it is de- sirable to select all or none of the variables within a group [153]. A typical example is when dealing with categorical data, each variable can be ex- pressed via a group of dummy variables; and one should conduct variable selection on a group level instead of individual level. To achieve group level variable selection, one can adopt the`1/`qmixed-norm based group Lasso penalty with any q >1 [153]:
Ψ(x) ≡γ
∑
g∈G
wgkxgkq≡γ
∑
g∈G
wg
∑
i∈g
|xi|q1/q, (2.8)
where G denotes a partition of {1, . . . , n}, xg ∈ R|g| is the subvector of x consists of the variables in group g; wgis the predefined weight for group g; and k · kq is the vector `q-norm. This `1/`q mixed-norm penalty plays the role of jointly setting all of the coefficients within each group to zero or non-zero values. In practice, the`1/`2 and`1/`∞ are the most popular norms. Theoretically, it has been demonstrated that if the group structure is consistent with the true sparsity pattern,`1/`2group Lasso penalty has the potential to improve the accuracy of the estimator [57].
3. Group Lasso Penalty with Overlapping Groups
To model more complex group structures, the groups inG in (2.8) can be allowed to overlap [61]. In other words, the coefficient for a variable xj can appear in different `q-norms. Due to the shrinkage property of the
2.1 structured regression 13
`q-norm, the penalty in (2.8) will set some xgto zeros. If we denote the set of groups g with xg =0byG0 ⊂ G, the support of the solution x is:
supp(x) ⊂Sg∈G
0 gc
, (2.9)
where (·)c stands for the complementary set. A commonly used special case of general overlapping group structure is the hierarchal structure (e.g. a tree or a forest) [160]. Specifically, we assume that variables corre- spond to nodes of a tree and a given variable is included in the model only if all its ancestors in the tree have already been selected. The general overlapping group structure has an important application in pathway se- lection for gene expression data. In more details, a biological pathway is a group of genes that participates in a particular biological process to per- form certain functionality in a cell. To find the controlling factors related to a disease, it is more meaningful to study the genes by considering their pathways. We will discuss more about this application in Chapter4. 4. Chain-structured Fused Lasso Penalty
If the variables are ordered in some meaningful way (e.g. on a timeline), using the (chain-structured) fusion penalty, we can learn a piece-wise con- stant coefficient vector [138]. The fusion penalty, which is the `1-norm of the coefficients’ successive differences, takes the following form:
Ψ(x) =γ
n−1 i
∑
=1|xi+1−xi|. (2.10)
It has been widely applied to hot-spot detection for comparative genomic hybridization (CGH) data [140] and time-series data analysis [71].
5. Graph-guided Fused Lasso Penalty
The work in [69] extends the chain-structure fusion penalty in (2.10) to a more general graph-guided fusion penalty. The graph-guided fusion penalty can encode the prior knowledge about the structural constraints over features in the form of pairwise relatedness described by a graph G≡ (V, E), where V = {1, . . . , n}denotes the variables of interest, and E denotes the set of edges among V. Additionally, we let rml ∈R denote the weight of the edge e = (m, l) ∈ E, corresponding to correlation or other proper similarity measures between features m and l. The graph-guided fusion penalty, which encourages the coefficients of related variable to share similar magnitude, is defined as follows:
Ψ(x) =γ
∑
(m,l)∈E,m<l
τ(rml)|xm−sign(rml)xl|, (2.11) where τ(rml) represent a general weight function that enforces a fusion effect over coefficients xm and xl of relevant features. It can be any mono- tonically increasing function of the absolute values of correlations and the most popular examples include τ(r) = |r| or τ(r) = |r|2. The sign(rml)
in (2.11) ensures that two positively correlated inputs would tend to influ- ence the output in the same direction, whereas two negatively correlated inputs impose opposite effect. Since the fusion effect is calibrated by the edge weight, the graph-guided fusion penalty in (2.11) encourages highly inter-correlated inputs corresponding to a densely connected subnetwork in G to be jointly selected as relevant.
It is noteworthy that when rml = 1 for all e= (m, l) ∈ E, and G is simply a chain over nodes, the graph-guided fusion penalty is reduced to the chain-structured fusion penalty in (2.10).
One of the properties shared by all of the structured penalties listed above is that they can be formulated as
Ψ(x) =max
y∈Q yTCx, (2.12)
where Q is a convex and compact set. This property is the key for deriving the smoothing proximal gradient algorithm in Chapter4.
2.2 m u lt i-task structured regression
In multi-task learning, we are interested in learning multiple related tasks jointly by analyzing data from all of the tasks at the same time instead of considering each task individually [136,25,152,155,111]. When data are scarce, it is greatly advantageous to borrow the information in the data from other related tasks to learn each task more effectively. More specifically, we consider the multi-task sparse regression problem, where each task is to learn a functional mapping from a high-dimensional input space to a continuous-valued output space and only a small number of inputs are relevant to the output. In multi-task regres- sion, it is often assumed that parameters for different tasks share the same sparsity pattern [143,8, 112]; and the task of conducting variable selection can be achieved via learning the joint sparsity pattern of parameters.
For the simplicity of illustration, we assume all different tasks share the same input matrix. Let A∈Rm×ndenote the matrix of input data for n inputs and B∈ Rm×kdenote the matrix of output data for k outputs over m samples. We assume a linear regression model for each of the j-th output: Bj = AXj+ej, ∀j = 1, . . . k, where Bj ∈ Rm is the j-th column of B, Xj = [X1j, . . . , Xnj]T ∈ Rn is the regression coefficient vector for the j-th output and ej is Gaussian noise vector.
Let X= [X1, . . . , Xk] ∈Rn×kbe the matrix of regression coefficients for all of the k outputs. Then, the multi-task (or multivariate-response) structured regression problem can be naturally formulated as the following optimization problem:
min
X∈Rn×kφ(X) ≡ f(X) +Ψ(X) = 1
2kB−AXk2F+Ψ(X), (2.13) where k · kF denotes the matrix Frobenius norm and Ψ(X) is a structured sparsity-inducing penalty with a structure over the outputs.
2.2 multi-task structured regression 15
A popular approach is to adopt a joint sparsity regularization to encourage sparsity across all tasks. In particular, one can adopt the l1/lq mixed-norm penalty with q>1 [8,112,96]:
Ψ(X) =λkXkq,1=γ
∑
n i=1kXikq, (2.14)
where Xi = (Xi1, Xi2, . . . , Xik) ∈ Rk is the i-th row of X and γ is a positive regularization parameter. However, this penalty cannot incorporate a complex structure in which the outputs themselves are correlated. In many applications, it is advantageous to utilize the prior structural information among outputs to guide the variable selection. For doing this, one can extend the structured penalties introduced in Section 2.1and obtain the following three most widely used structured penalties for multi-task regression.
1. `1-norm Penalty in Multi-task Regression
Similar to the `1-norm penalty in the single task regression case, the `1- norm penalty in the multi-task setting is defiend as
Ψ(x) ≡λkXk1= λ
∑
n i=1∑
k j=1|Xij|. (2.15)
2. Overlapping Group Lasso Penalty in Multi-task Regression
We define the group lasso penalty for a structured multi-task regression as follows:
Ψ(X) ≡γ
∑
n i=1∑
g∈G
wgkXigkq, (2.16)
whereG = {g1, . . . , g|G|}is a subset of the power set of{1, . . . , k}and Xig is the vector of regression coefficients correspond to outputs in group g:
{Xik, k ∈ g}. The `1/`q mixed-norm penalty for multi-task regression in (2.14) is a special case of (2.16) whereGonly has one group g= {1, . . . , k}. The tree-structured group-lasso penalty introduced in [68] is also a special case of (2.16).
3. Graph-guided Lasso Fused Penalty in Multi-task Regression
Assuming that a graph structure over the k outputs is given as G with a set of nodes V = {1, . . . , k}each corresponding to an output variable and a set of edges E, the graph-guided fusion penalty for a structured multi-task regression is given as:
Ψ(X) =γ
∑
e=(m,l)∈E
τ(rml)
∑
n i=1|Xim−sign(rml)Xil|. (2.17)
2.3 f i r s t-order methods
The aforementioned sparse regression problem (2.1) is a convex optimization problem. Two traditional generic solvers include (1) subgradient descent method and (2) interior point method (IPM). The subgradient method converges very slowly with a rate (i.e. the number of iterations) of O(1
e2), where e is the de- sired optimality gap and the obtained solutions are usually not sparse. For all the structured penalties that we considered here, the corresponding regression problem can always be cast as a semidefinite program or its simpler special form (e.g. second-order cone program (SOCP) or quadratic program (QP)) and solved by interior point methods (IPM). Although IPM has a logarithmic con- vergence rate O(log(1
e)), it is computationally prohibitive for problems of even a moderate size due to its O(n3)complexity in each iterations.
Due to the separability of some non-smooth penalties (e.g. `1-norm penalty), coordinate descent methods can be directly applied where we optimize the ob- jective with one variable (or a block of variables) at a time while keeping all others fixed [44]. Although it has surprisingly good empirical performance, it is limited in that the convergence cannot be guaranteed when nondifferential terms are not separable [141].
Another class of optimization methods, accelerated proximal gradient (APG) methods have become increasingly popular in the past few years in the machine learning community. They enjoy optimal convergence rate under the first-order black-box model; and more importantly, since they only use the gradient in- formation, they are much more scalable than second-order methods and hence more suitable for large-scale applications.
Although APG methods have many variations, [12,107,104,103,102,101,142, 75, 32] (see [142] for a survey of different APG methods), all of them need to compute a so-called proximal mapping (or proximal operator, proximal prob- lem, projection mapping, generalized gradient update) at each iteration. The proximal mapping on a point y takes the following form:
TL(y) =arg min
x
f(y) + ∇f(y)T(x−y) + L
2kx−yk22+Ψ(x)
. (2.18) where L is set to be Lf or determined by a line search procedure. The minimiza- tion problem in (2.18) is called proximal mapping sub-problem. If L is chosen to be Lf, the objective function of the minimization in (2.18) becomes an upper bound of f(x)according to (2.3).
We also note that the term12kx−yk22in (2.18) can be replaced by any Bregman divergence between x and y, denoted by V(x, y), which is defined as
V(x, y):=ω(x) −ω(y) − h∇ω(y), x−yi, (2.19) where ω(x)is a strongly convex and differentiable function.
When ω(x) = 12kxk22 so that V(x, y) = 12kx−yk22, the proximal mapping can be written as:
TL(y) =arg min
x
1
2kx− (y− 1
L∇f(y))k22+ 1
LΨ(x) (2.20)
2.4 stochastic first-order methods 17
We note that if the non-smooth term Ψ(x) is zero, then (2.20) simply reduces to the standard gradient descent update rule with the step size 1/L: TL(y) = y− 1L∇f(y). WhenΨ(x)is not zero but simple enough, the proximal mapping sub-problem admits a closed-form or exact solution. For example, whenΨ(x) = λkxk1, TL(y), has the closed-form solution given as
TL(y) =shrink
y− 1
L∇f(x), λ L
, (2.21)
where shrink : Rn×R+ → Rn is the well-known shrinkage or soft-thresholding operator, defined as
(shrink(x, α))i =sgn(xi)max{|xi| −α, 0}, i=1, . . . , n. (2.22) Starting at an initial solution x0 to (2.1), the so-called proximal gradient (PG) method generates a sequence of solutions approaching to optimality by itera- tively applying the proximal mapping (2.18) :
x(k+1) =TL(x(k)), k=0, 1, 2, . . . . (2.23) To find an e-optimal solution xk, i.e., a point xk ∈ Rnsuch that φ(xk) −φ(x?) ≤ e, the PG method requires O(Lf/e) iterations (k = O(Lf/e)) when µf = 0.
When µf > 0, that is, f(x) is strongly convex, the iteration complexity of PG method is reduced to O(Lµf
f log(1/e)).
Utilizing more updating sequences and more sophisticated updating scheme than (2.23), APG methods find an e-optimal solution using only O(Lf/√
e)iter- ations when µf = 0 and O(
rLf
µf log(1/e))iterations when µf > 0. Notice that APG methods require fewer iterations than the PG method in both cases.
Moreover, according to [103, 98], the iteration complexity achieved by APG methods is “optimal” under a black-box first-oracle assumption in both cases, which means that if the only available information is the gradient∇f(x), it is impossible to derive an algorithm for solving (2.1) with a better complexity than APG methods.
2.4 s t o c h a s t i c f i r s t-order methods
The fitness loss function f(x) of the structured regression problem (2.1) is de- fined based on a data set of finite input/output pairs:{ai, bi} for i = 1, . . . , m.
As an alternative, we can also define the loss function on the underlying distri- bution of the data points. More specifically, the loss function f(x)can also take the form:
f(x):=Eξ(F(x, ξ)) =
Z
F(x, ξ)dP(ξ), (2.24)
where ξ is a random vector (data point) with the distribution P. In a typical structured regression problem setting, ξ represent the input/output pairs(a, b). We assume that for every random vector ξ, F(x, ξ)is a convex and continuous
function in x. Therefore, f(x)is also convex. Furthermore, we also assume f(x) satisfies (2.3) and (3.5) as in the deterministic case.
One of the challenges of applying first-order method to solve (2.1) with f(x) given by (2.24) is that the gradient ∇f(x) = Eξ∇F(x, ξ) = R ∇F(x, ξ)dP(ξ) becomes computationally intractable for a high-dimensional P. Moreover, in most of the cases, the distribution P is highly complicated or even unknown so that the exact gradient information ∇f(x) is unavailable. To deal with this difficulty, a stochastic gradient G(x, ξ)is constructed to approximate ∇f(x). For example, we can simply choose G(x, ξ)to be∇F(x, ξ)or its mini-batch version
1
m∑mi=1∇F(x, ξi) where {ξ1, . . . , ξm} are drawn from P independently. The al- gorithms utilizing these stochastic gradients to solve (2.1) are called stochastic first-order methods.
Another situation where a stochastic first-order method is a more suitable choice is when the data set in the regression problems are very large so that the computation of∇f(x)is time-consuming or impossible due to limited memory.
For example, in a linear regression problem, the gradient of f(x)defined on the entire data set{ai, bi}for i = 1, . . . , m, is ∇f(x) = AT(Ax−b). If we draw a small random subset of the data, i.e.,{ai, bi}i∈Swith S⊂ {1, 2, . . . , m}, a stochas- tic gradient can be given as G(x, S) = |m
S|∑i∈Sai(aTi x−bi) whose computation may require much less memory than that of∇f(x) =AT(Ax−b).
Research related to stochastic gradient methods dates back to Robbins and Monro’s stochastic approximation algorithm [122] in 1951, which was further developed by Polyak and Juditsky [118]. In the past few years, many stochastic first-order methods [35,37,55,77,99,73,74,51,63] have been applied to differ- ent stochastic optimization problems of the form (2.1). These methods enjoy low per-iteration complexity and the capability of scaling up to very large data sets.
However, a stochastic gradient unavoidably contains a certain level of noise and hence more iterations are needed for a stochastic first-order method to achieve the same optimality gap of a deterministic method. In fact, to find a solution xk with Eφ(xk) −φ(x?) ≤ e, a stochastic first-order method typically require O(1/e2)iterations when µf =0 and O(1/e)iterations when µf >0. According to [98], both of these two complexities are not improvable.
2.5 a g u i d e l i n e f o r c h o o s i n g a l g o r i t h m s i n t h e t h e s i s
To provide a guideline about how to choose appropriate algorithms in this thesis to solve different structured regression problems (2.1) under different penalty terms and different problem sizes, we include a decision tree in Figure2.1.
In a structured regression problem, if m is tremendously larger than n, the challenge of solving (2.1) is from "big data". In this case, stochastic methods such as Algorithm 6.1 in Chapter 6 may dominate deterministic methods due to its low computational cost and memory requirement in each iteration. If m is moderate but n is very big, the challenge comes from "big model" rather than
"big data". In this case, which algorithm to use highly depend on the penalty functionΨ(x).
2.5 a guideline for choosing algorithms in the thesis 19
Figure 2.1: The guideline for using the algorithms in this thesis.
If Ψ(x) = λkxk1 as in the Lasso case, Algorithm 3.6 in Chapter3 is a better choice than other algorithms in this thesis because it has a better theoretical and practical convergence rate. If Ψ(x) is not λkxk1 but still simple enough, e.g., group Lasso penalty with disjoint groups, so that the proximal mapping (2.18) can be solved easily, one can use Algorithm 5.1 in Chapter 5. If Ψ(x) is sophisticated so that (2.18) can not be solved easily but can be reformulated as (2.12), one may use Algorithm 4.1 in Chapter 4. However, if Ψ(x) can not be represented as (2.12), the problem is beyond the scope of this thesis and sub-gradient descent methods might be an alternative choice.
Part III
F I R S T - O R D E R M E T H O D S F O R S T R U C T U R E D R E G R E S S I O N S
3
A C C E L E R AT E D G R A D I E N T H O M O T O P Y M E T H O D F O R L A S S O
In this chapter, we focus on a special structured regression problem,`1-regularized least-squares (`1-LS) problem or so-called Lasso problem, in the high-dimensional setting.1 We first present an accelerated proximal gradient (APG) method for problems where the smooth part of the objective function is also strongly convex.
This method incorporates an efficient line-search procedure, and achieves opti- mal iteration complexity for such composite optimization problems. In case the strong convexity parameter is unknown, we also develop an adaptive scheme that can automatically estimate it on the fly, at the cost of a slightly worse itera- tion complexity.
In`1-LS problem, the smooth part of the objective (least-squares) is not strongly convex over the entire domain. Nevertheless, we can exploit its restricted strong convexity over sparse vectors using the adaptive APG method combined with a homotopy continuation scheme. We show that such a combination leads to a global geometric rate of convergence, and the overall iteration complexity has an improved dependency on the restricted condition number than previous work.
3.1 i n t r o d u c t i o n
Exploiting problem structure has become an important theme in recent ad- vances in convex optimization. It is well known that proper use of problem structure at the numerical linear algebra level may dramatically improve the efficiency of an optimization method. More recently, it has become clear that exploiting problem structure can also lead to more efficient optimization meth- ods in terms of their iteration complexity, sometimes significantly surpassing the limitations of the black-box complexity theory (see [108] for an excellent discussion). Such examples start with the theory of self-concordant functions for interior-point methods [110], to the more recent development of smooth- ing technique [104], minimization of composite objective functions [107], and acceleration via manifold identification (e.g., [147]).
In this chapter, we first develop an adaptive accelerated proximal gradient method for minimizing objective functions that are strongly convex, without the knowledge of their convexity parameters or any lower bound. Then we employ this method in a homotopy continuation scheme for sparse optimization (with
`1-regularization), and show that it achieves an improved iteration complexity over previous methods for solving the sparse least-squares problem.
1 This chapter is based on the technical report [81] submitted to Mathematical Programming.
23