PICO modifi cations - LDRD final report on massively-parallel linear programming : the parPCx s

We have modifi ed the core and integer programming modules of PICO to include a new ramp-up procedure to exploit procedural (as opposed to subproblem) parallelism early in the search. This is a natural way to integrate parallel (root) LP solves into PICO. We use ramp up early in a branch-and-bound computation when there isn’t suffi cient subproblem parallelism to keep a large number of processors busy. For example, at the root of the search tree, there is only one problem. In this setting, it is best to evaluate subproblems in parallel, parallelizing individual steps including bounding, cut generation, gradient calcu-lations, searching for feasible solutions, etc. A parallel LP solver naturally fi ts into this new ramp up, certainly for the root solve, and possibly even for early subproblems if there are a signifi cant number of new cuts (additional constraints). When there are enough indepen-dent subproblems relative to the number of processors (or there is insuffi cient parallelism at the subproblem level), the computation switches to tree-level parallelism where processors evaluate independent subproblems in parallel. Because processors stay in lock step during ramp up, PICO makes this transition without communication and with an initially perfect load balance.

Problem rows cols KLU (1) DSC (1) DSC (2) DSC (4) DSC (8) DSC (16)

25fv47 821 1571 3.26 3.36 4.03 3.99 4.21 4.46

maros-r7 3136 9408 48.5 20.2 16.7 13.8 11.6 11.7

Figure 5. Time (sec.) for parallel solution for different linear solvers. Number of processors in parentheses.

To integrate parPCx into PICO, we also built an (IBM Open-Source Initiative) interface for the solver and the extended PICO LP interface (of course, throwing exceptions for calls to methods that are only applicable to simplex methods). The solver is not fully debugged as of this writing, but likely will be before this report is published. Because parPCx uses the PCx interface, the OSI solver interface now enables access to the PCx serial solver, pPCx and parPCx for any code that uses this standard interface.

7 Parallel Results

The parPCx system is still under development so it is still too early to publish meaningful parallel results. However, for the purpose of this status report we have done some parallel runs. All results in this section were obtained with the qPCx code, where only the solver is parallel. The fully distributed memory version (PCy) should outperform qPCx, but we are still debugging this version.

All the results below use test problems from the Netlib LP collection [17], which despite its age is still a dominant LP benchmark suite. We used a Compaq Alpha cluster with 32 processors, of which we could use up to 16.

We examine the results with direct solvers here, as we already discussed the perfor-mance of iterative solvers in Section 4.1. Even though the only changes from PCx are the formation of the ADA^T matrix and the Cholesky solver, the solutions from parPCx some-times differed slightly from PCx solutions. That is, parPCx did approach the same solution as PCx but due to small numerical differences the status of the fi nal iterate was in a few cases classifi ed as “unknown” or “infeasible” instead of “optimal”. Also, a few problems did not converge or failed in other ways. There may be several causes, but the lack of special treatment of very small pivots is a likely factor. We therefore focus on cases where parPCx (qPCx) and PCx did produce the same answers. In Figure 5 we show the execution time for two test problems with varying number of processors using the DSCPACK linear solver.

For the 25fv47 problem there was no parallel speedup. This is a small problem, and it appears that the parallel overhead and data redistribution cost cancelled the small savings from computing the Cholesky factorization in parallel. For the larger maros-r7 problem,

there is some parallel speedup but it is poor. Clearly, we cannot expect linear speedup since only the Cholesky solver is parallel, but we had hoped for better speedup. We are cur-rently investigating the cause. One contributing factor is that vectors are replicated across processors, and therefore require a lot of communication to update. There may also be ineffi ciencies in the Amesos interface layer, which sits between parPCx and the third-party linear solver (e.g., DSCPACK). We have also identifi ed the sparse matrix-matrix multipli-cation as a slow part on some problems; apparently due to an ineffi cient implementation in EpetraExt. (For some problems, computing ADA^T took substantially longer than factoring ADA^T.)

We also attempted to run qPCx on some much larger problems (up to 100,000 vari-ables), like CRE-B, but qPCx crashed with floating-point exception using parallel Cholesky (DSCPACK) so we were only able to perform serial solves (using KLU). We did obtain the optimal solution with the KLU solver and need to investigate why the code failed with DSCPACK. This shows that robustness is a serious issue in interior-point methods, even with direct solvers.

8 Future Work

Although many goals of this LDRD project have been accomplished, some work still re-mains before the parPCx/PCy codes become truly useful. A continuation of the project would need to tackle:

• Debug and test PCy. The fully distributed code still has bugs, and must be debugged and verifi ed against the original PCx.

• Make all vectors distributed (parallel). Currently, most vectors are replicated across processors but this is easy to change using Epetra. Important to get good parallel speedup.

• Dense column handling using either the Sherman-Morrison-Woodbury formula or column splitting. This feature is missing in parPCx and only partially implemented in PCy. Important for performance.

• Profi ling and performance optimization. The current code is too slow.

• Debug and test PICO/OSI interface.

• Implement cross-over operation. Essential for use in the MIP solver PICO.

A wish list for future expansion might include:

• Expand solver class to optionally solve augmented system via direct or iterative methods. This may be a better approach than the normal equations in many cases.

• Provide simpler methods to add new preconditioners.

• Implement LSQR for solving least-squares version of the augmented system. This will almost surely work better (more robust) than the current method (CG on the normal equations).

• Investigate better scalings to reduce condition number.

• Implement direct–iterative hybrid solver.

• Investigate new preconditioners. Good scalable parallel performance can only be achieved with iterative methods.

• Parallel presolve. Currently, the presolver is serial.

Some of our parallel code (in particular, the Solver class) may be useful in other, more general, interior-point method codes. For example, we could integrate our parallel solver with the OOQP convex QP package. This would require some modifi cations since the linear systems for QPs require the Hessian.

References

[1] V. Baryamureeba and T. Steihaug. Properties and computational issues of a precon-ditioner for interior point methods. Technical Report 180, Dept. of Informatics, Univ.

of Bergen, Norway, 1999.

[2] J. Berry, L. Fleischer, W. E. Hart, and C. A. Phillips. Sensor placement in municipal water networks. In Proceedings of the World Water and Environmental Resources Congress (electronic proceedings). ASCE, 2003.

[3] J. Berry, W. E. Hart, C. A. Phillips, and J. Uber. A general integer-programming-based framework for sensor placement in municipal water networks. In Proceedings of the 6th Annual Symposium on Water Distribution System Analysis (electronic pro-ceedings). ASCE, 2004.

[4] R. H. Bisseling, T. M. Doup, and L.D.J.C. Loyens. A parallel interior point algorithm for linear programming on a network of transputers. Annals of Operations Research, 43:51–86, 1993.

[5] ˚Ake Bj¨orck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, PA, 1996.

[6] E. G. Boman, D. Chen, B. Hendrickson, and S. Toledo. Maximum-weight-basis pre-conditioners. Numerical Linear Algebra with Appl., 11(8–9):695–721, 2004.

[7] E. G. Boman and B. Hendrickson. Support theory for preconditioning. SIAM J. on Matrix Anal. and Appl., 25(3):694–717, 2004. (Published electronically on 17 Dec 2003.).

[8] R. E. Carlson, M. A. Turnquist, and L. K. Nozick. Expected losses, insurability, and benefi ts from reducing vulnerability to attack. Technical Report SAND2004-0742, Sandia National Laboratories, 2004.

[9] R.D. Carr and G. Konjevod. Polyhedral combinatorics. In H.J. Greenberg, editor, Tu-torials on Emerging Methodologies and Applications in Operations Research. Kluwer Academic Press, 2004.

[10] Ü. Ç atalyürek and C. Aykanat. Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Dist. Systems, 10(7):673–693, 1999.

[11] D. Chen and S. Toledo. Vaidya’s preconditioners: Implementation and experimental study. ETNA, 16, 2003. Available from http://etna.mcs.kent.edu.

[12] T. Coleman, J. Czyzyk, C. Sun, M. Wagner, and S.J. Wright. pPCx: parallel software for linear programming. In Proc. Eighth SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 1997.

[13] J. Czyzyk, S. Mehrotra, and S.J. Wright. PCx user’s guide. Technical Report Tech. re-port OTC 96/01, Argonne National Lab, 1996.

[14] Karen Devine, Erik Boman, Robert Heaphy, Bruce Hendrickson, and Courte-nay Vaughan. Zoltan data management services for parallel dynamic ap-plications. Computing in Science and Engineering, 4(2):90–97, 2002.

http://www.cs.sandia.gov/Zoltan.

[15] J. Eckstein. Parallel branch-and-bound algorithms for general mixed integer program-ming on the CM-5. SIAM Journal on Optimization, 4(4):794–814, 1994.

[16] Jonathan Eckstein, Cynthia A. Phillips, and William E. Hart. PICO: An object-oriented framework for parallel branch-and-bound. In Proc Inherently Parallel Al-gorithms in Feasibility and Optimization and Their Applications, Elsevier Scientifi c Series on Studies in Computational Mathematics, pages 219–265, 2001.

[17] D.M. Gay. Electronic mail distribution of linear programming test problems. COAL Newsletter, Math. Prog. Soc., 13:10–12, 2003.

[18] P. Gill, W. Murray, and M. Wright. Practical Optimization. Academic Press, 1981.

[19] P.E. Gill, W. Murray, D.B. Ponceleon, and M.A. Saunders. Preconditioners for in-defi nite systems arising in optimization. SIAM J. on Matrix Analysis, 13(1):292–311, 1992.

[20] P.E. Gill, W. Murray, D.B. Ponceleon, and M.A. Saunders. Primal-dual methods for linear programming. Math. Prog., 70(3):251–277, 1995.

[21] J. Gondzio and R. Sarkissian. Parallel interior-point solver for structured linear pro-grams. Mathematical Programming, 96(3):561–584, 2003.

[22] M.D. Grigoriadis and L.G. Khachiyan. An interior-point method for bordered block-diagonal linear programs. SIAM J. Optim., 6(4):913–932, 1996.

[23] Michael Heroux, Roscoe Bartlett, Vicki Howle, Robert Hoekstra, Jonathan Hu, Tamara Kolda, Richard Lehoucq, Kevin Long, Roger Pawlowski, Eric Phipps, An-drew Salinger, Heidi Thornquist, Ray Tuminaro, James Willenbring, and Alan Williams. An overview of Trilinos. Technical Report SAND2003-2927, Sandia Na-tional Laboratories, Albuquerque, NM, 2003. http://software.sandia.gov/trilinos.

[24] D. Hysom and A. Pothen. A scalable parallel algorithm for incomplete factor precon-ditioning. SIAM J. Sci. Comp., 22(6):2194–, 2001.

[25] J.J. Judice, J. Patricio, L.F. Portugal, M.G.C. Recende, and G. Veiga. A study of preconditioners for network interior point methods. Computational Optimization and Applications, 24(1):5–35, 2003.

[26] P. Koka, T. Suh, M Smelyanskiy, R. Grzeszczuk, and C. Dulong. Construction and performance characterization of parallel interior point solver on 4-way intel itanium 2 multiprocessor system. In IEEE Seventh Annual Workshop on Workload Characteri-zation, 2004.

[27] S. Mehrotra and J. Wang. Conjugate gradient based implementation of interior point methods for network flow problems. In L. Adams and J. Nazareth, editors, Linear and Nonlinear Conjugate Gradient Related Methods, pages 124–142. SIAM, 1995.

[28] R.D.C. Monteiro, J.W. O’Neill, and T. Tsuchiya. Uniform boundedness of a precondi-tioned normal matrix used in interior-point methods. SIAM J. Optim., 15(1):96–100, 2004.

[29] L.F. Portugal, M.G.C. Recende, G. Veiga, and J.J. Judice. A truncated primal-infeasible dual-feasible interior point network flow method. Networks, 35:91–108, 2000.

[30] M. Resende and G. Veiga. An effi cient implementation of a network interior point method. In D. Johnson and C. McGeoch, editors, Network Flow and Matching:

First DIMACS Implementation Challenge, volume 12, pages 299–384, Providence, RI, 1993. AMS.

[31] M.A. Saunders. PDCO: primal-dual interior method for convex objectives. Software available at http://www.stanford.edu/group/SOL/software/pdco.html.

[32] M.A. Saunders and J.A. Tomlin. Solving regularized linear programs using barrier methods and KKT systems. Technical Report Report SOL 96-4, Dept. of EESOR, Stanford University, 1996.

[33] G. Schultz and R.R. Meyer. An interior-point method for block angular optimization.

SIAM J. Optim., 1(1):583–602, 1991.

[34] D. Spielman and S-H. Teng. Nearly-linear time algorithms for graph partition-ing, graph sparsifi cation, and solving linear systems. Manuscript available from www.arxiv.org/abs/cs.DS/0310051, 2003.

[35] D. Spielman and S-H. Teng. Solving sparse, symmetric, diagonally-dominant linear systems in time 0(m¹^.31). In Proc. of FOCS’03, pages 416–427, 2003.

[36] D.A. Spielman and S.H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. J. of the ACM, 51(3):385–463, May 2004.

[37] Sivan Toledo. TAUCS - a library of sparse linear solvers. Software available at http://www.math.tau.ac/∼stoledo/taucs/.

[38] P. M. Vaidya. Solving linear equations with symmetric diagonally dominant matrices by constructing good preconditioners. Unpublished manuscript. A talk based on this manuscript was presented at the IMA Workshop on Graph Theory and Sparse Matrix Computations, Minneapolis, October 1991.

[39] R. Vanderbei. Symmetrical quasidefi nite matrices. SIAM J. Optimization, 5(1):100–

113, Feb. 1995.

[40] Brendan Vastenhouw and Rob H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. Preprint 1238, Dept. Mathe-matics, Utrecht University, May 2002. To appear in SIAM Review, 2005.

[41] W. Wang and D.P. O’Leary. Adaptive use of iterative methods in interior point meth-ods for linear programming. Tech. report CS-3650, Univ. of Maryland, 1995.

[42] J.-P. Watson, H. J. Greenberg, and W. E. Hart. A multiple-objective analysis of sensor placement optimization in water networks. In Proceedings of the 6th Annual Sympo-sium on Water Distribution System Analysis (electronic proceedings). ASCE, 2004.

[43] M. H. Wright. Interior methods for constrained optimization. In A. Iserles, editor, Acta Numerica 1992, pages 341–407. Cambridge University Press, 1992.

[44] S. Wright. Primal-Dual Interior-Point Methods. SIAM, Philadelphia, 1997.

DISTRIBUTION:

3 Prof. Ojas Parekh Math/CS Department 400 Downman Dr.

Atlanta, GA 30322 3 Prof. Stephen J. Wright

University of Wisconsin 1210 West Dayton Street Madison, WI 53706

1 MS 9018

Central Technical Files, 8945-1 3 MS 1110

Erik Boman, 09215

3 MS 1110

Cynthia Phillips, 09215 1 MS 1110

Michael Heroux, 09214 3 MS 9159

Victoria Howle, 08962 1 MS 1110

Suzanne L. K. Rountree, 09215 2 MS 0899

Technical Library, 9616 1 MS 0123

D. Chavez, LDRD Offi ce, 1011

In document LDRD final report on massively-parallel linear programming : the parPCx system. (Page 27-35)